Perhaps the best way to understand Bayesian statistics is to
contrast them with frequentist statistics. In most statistics
courses we learn how to create and interpret frequentist statistics
almost exclusively. In other words, we’ve been treating probability as a
measure of frequency. In this paradigm, there is a true,
correct, value for a parameter of interest, and as we collect
more and more data, our estimation of the parameter (our estimand) will
approach this ‘true’ value. This is fine when it works, but what do we
do when we realize we realize we need a sample larger than what
we can realistically manage? What if you really want to think
of confidence intervals in terms of probability and not
confidence? What even is confidence anyways?
Let’s say we don’t know what the population parameter might be for some
variable of interest, but we think your sample data are good. We might
feeling a bit iffy about how well our data matches up with the
truth…
What if our data can’t meet those pesky assumptions that our statistics
professors have been telling us about (and that we’ve probably been
trying to ignore), like normally distributed residuals or
homoskedasticity?
We’re mostly all biologists here, right? We know our data frequently
(hah) sucks for doing frequentist statistics, and maybe we don’t want to
be in an email correspondence trying to convince someone keeping our
paper in reviewer Hell that our data is probably normal. We
must ask ourselves: how well do we think we could defend our statistical
methods?
Enter Bayesian statistics.
Bayesian statistics are otherwise known as subjective
statistics, and for good reason! Rather than estimating a parameter
we’re assuming to be true and trying to figure out how close our data
come to it, we’re actually estimating our uncertainty about a
parameter based on our dataset itself.
For example, let’s say we want to learn about a parameter: \(\theta\).
With a Bayesian paradigm, we make inferences on \(\theta\) combining our prior
knowledge of \(\theta\) and the
likelihood of receiving our data given \(\theta\). In other words, we’re integrating
whatever data we have with our subjective thoughts of how the data is
parameterized. From this fun collaboration, we generate a
posterior probability distribution of samples of \(\theta\). Rather than a binary Yes/No
answer, as in frequentist statistics, the posterior distribution is our
“answer”.
So we have a probability distribution of how \(\theta\) is distributed, what does this
mean and why should we care?
Well… if we’ve done it all correctly the posterior distribution tells us
how certain/uncertain we are about \(\theta\) given our specific data. Instead
of rejecting/failing to reject a hypothesis based on our data and
whatever assumptions we’re making about it, the posterior distribution
tells us the exact probabilities of \(\theta\) given our data as it actually
is.
Bayesian inference is an application of Bayes’ Rule (we already
learned this fun fact when we talked about Probability!).
Bayes’ Rule states that \(P(A|B) =
\frac{P(A|B) \times P(A)}{P(B)}\). Instead of individual
probabilities, however, we can think of these values in the context of
the example given above. In other words, our posterior
probability (\(P(A|B)\) for \(\theta\)) is operationalized in this
formula using our prior probability (what we already think we know about
\(\theta\)) and the likelihood of
various values of \(\theta\) given our
dataset… but how do we get to that from this equation?
With Bayesian statistics, we can workshop our formula a bit to get \(P(\theta | X_{1,2,...,n}) \propto
P(X_{1,2,...,n}|\theta) \times P(\theta)\). In this equation, our
posterior probability is the probability of \(\theta\) given our dataset (\(P(\theta | X_{1,2,...,n})\)), which is
proportional to the likelihood of our dataset given \(\theta\) is a parameter that describes it
(\(P(X_{1,2,...,n}|\theta)\)) and our
prior knowledge of the probability of the parameter (\(P(\theta)\)).
We can make an educated guess on how \(\theta\) is distributed (an informative
prior), or we can throw in the towel and admit we know very little
regarding how \(\theta\) is distributed
(an uninformative prior). Our data can’t be changed (this is
the objective part of Bayesian statistics), but we can see that is
distributed in a certain way. From these two distributions (our
prior and our actual data distribution), our goal is then to find how
\(\theta\) is distributed
GIVEN our data.
But again: what does all of this mean?
Bayesian statistics can be complex and a lot of work. Without recent beefy computers, Bayesian statistics had to be calculated by hand (this is hard and not fun) and were EXTREMELY limited in their applications.
However, we now have the computational power to make Bayesian statistics more accessible and realistically applicable, including in R.
Enter: {rjags}.
rjags
To demonstrate, let’s check out the titi monkeys from Homework 2 again, but this time from a Bayesian perspective!
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(coda)
library(rjags)
## Linked to JAGS 4.3.2
## Loaded modules: basemod,bugs
So if you remember, our primatologist believed from prior experience that the average number of titi monkey calls she would hear in a 2hr period at her study site in Ecuador would be around 15. If we think each call response to her playback is a unique titi monkey group, could that accurately represent the actual number of titi monkeys at the field site?
Let’s imagine (and simulate) that she collected titi monkey call data during a week of field work using 2 separate methods: one using a playback point transect method (listening for the number of calls, assuming that each separate call she heard represented a different group of titi monkeys), and the other using actual observations from surveying the forest to visit known titi monkey home ranges at the field site.
Let’s use the published population means from the paper by Anand Dacier (the inspiration for this homework assignment), collected at the Tiputini Biodiversity Station, in Ecuador) to simulate the results for these two methods:
set.seed(812) # set seed, Jimmy and this is his birthday; feel free to change up your seed; it's just a random number
titi_ppt <- rpois(n = 7, lambda = 13.8) # this will "collect" group counts via the playback point transect method
titi_homerange <- rpois(n = 7, lambda = 16) # this will "collect" group counts via the home range visitation method
Let’s first check and see how frequentist methods would estimate whether our Anand’s prior assumption of 15 calls or groups in the field site is correct given the data she collected. We can do this using a t-test to see if there is a significant difference between the assumed mean number of calls heard at playbacks and the observed number of groups seen by visiting home ranges as measured in the field:
# are these consistent with 15 groups?
t.test(titi_homerange, alternative = "greater", mu = 15) # this test asks: is the home range method detecting more than 15 groups?
##
## One Sample t-test
##
## data: titi_homerange
## t = 0.23625, df = 6, p-value = 0.4105
## alternative hypothesis: true mean is greater than 15
## 95 percent confidence interval:
## 12.93568 Inf
## sample estimates:
## mean of x
## 15.28571
t.test(titi_ppt, alternative = "less", mu = 15) # this test asks: is the playback method detecting more than 15 groups?
##
## One Sample t-test
##
## data: titi_ppt
## t = -1.2914, df = 6, p-value = 0.122
## alternative hypothesis: true mean is less than 15
## 95 percent confidence interval:
## -Inf 15.93727
## sample estimates:
## mean of x
## 13.14286
Here, we see there is no significant difference in our answer based on the method used; in other words, we fail to reject our null hypothesis that there are on average 15 groups of titi monkeys in a given area. This is no big deal but also - given this were the work of a whole field season in the Amazon - we’ve spent all this money, and done all this work, got what looks like interesting data just to confirm that our two methods are pretty much the same. It happens…
t.test(titi_homerange, titi_ppt)
##
## Welch Two Sample t-test
##
## data: titi_homerange and titi_ppt
## t = 1.1404, df = 11.657, p-value = 0.277
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.964508 6.250222
## sample estimates:
## mean of x mean of y
## 15.28571 13.14286
So… we don’t have significant evidence to suggest that either method is different from the other. But… we know they’re different (we just learned that they area!). So how can we dig into this system a bit more to better understand or vidualize these differences…?
rjags
We can try to do this by generating a LOT of data via
machine learning Markov-Chain Monte-Carlo (MCMC) methods. These methods
essentially create a dataset drawn from a predefined probability
distribution, then another one, then another one, and another one, and
so on… and allow us to define how we’d like to sample distributions from
this large chain of simulated distributions to better understand our
sample. We’re going to be sampling using the Just Another Gibbs
Sampler (JAGS) algorithm, which creates Markov-Chain Monte Carlo
(MCMC) samples via a Metropolis Hastings algorithm. Essentially, we’re
training the algorithm to create a distribution of data using both our
prior beliefs of how we think the data should look (i.e., that
the average number of groups detected should be 15) and the data we
actually collected using each of our methods (i.e., how likely it is
that each method detects 15 groups). This is a Gaussian random
walk, meaning we’re going to be “randomly walking” around possible
values of the parameter we hope to estimate until the algorithm begins
to converge on a plausible range of values for our parameter given both
our prior beliefs and the data we’ve collected. These samples constitute
our posterior distribution for the parameter, meaning we’ll be
generating a lot of potential parameter estimates and looking
at how those, themselves, are distributed in order to infer what we can
know about the parameter.
Let’s initialize our algorithm…
To begin, as our algorithm conducts its random walk it will try out
some truly outlandish parameter estimates. This is because, in part, the
algorithm compares parameter estimates and subsquent distributions to
converge on the most likely estimate (again, given both our prior
assumptions and the data itself). Because these first few steps are
going to be a bit wild, it’s always a good idea to declare what’s called
a burn-in… this is a number that defines how many iterations of
parameter estimates you’d like to throw our from our final distribution
of estimates once our algorithm has converged. This number could be big
or small, but typically you just want to make sure that it throws out
the “wobble” period of outlandish estimates at the beginning of the MCMC
chain. In this case, I’m throwing out 5000
iterations in my
n_burnin
option.
Of course, we also need to declare how many iterations we’d like to
try in order to converge on what might be a reasonable parameter
estimate… this is more an art than a science. Generally, you want to
have enough iterations for your MCMC chain to converge on a reasonable
parameter estimate, but not so much that you’re over-inflating the
significance of your estimate. For now, we’ll start with
20000
iterations, set using the n_iter
option.
Finally, we also want to mix things up a bit. We can deflate the
build-up of estimates by randomly excluding, or throwing out, some of
our estimates from the chain. This is a method that’s intended to
functionally randomize our estimates… because something we definitely
want to avoid is auto-correlation between our simulated distributions
(in other words, we don’t want there to be a pattern in our
distributions, given that would mean our parameter estimates aren’t
independent). Do do this, I’m setting the n_adapt
parameter
to 5000
, which will throw out every 5000th simulated sample
distribution and parameter estimate.
n_burnin <- 5000 # we're going to throw 5000 samples away as the algorithm settles in
n_iter <- 20000 # we're going to generate 20000 total samples
n_adapt <- 5000 # we're going throw another 5000 random samples away from the total chain
Once this runs, we’re going to have a total of 10,000 (hopefully) random sampling distributions generated by our MCMC algorithm.
We’ll also set out parameter estimates for each method we’re testing based on some pilot data collected over 7 days. In this case, we’ll start with 7 random samples for each method in trying to estimate the number of titi groups (both are 7, but it’s good practice to differentiate):
n_homerange <- length(titi_homerange)
n_ppt <- length(titi_ppt)
Finally, we’re can get started with our estimation be declaring which parameter we would like to estimate. Let’s start with the mean:
homerange_init <- mean(titi_homerange)
ppt_init <- mean(titi_ppt)
Ok, now we’re ready to formulate our JAGS dataset. In this case,
we’re creating a jags_data
object, which will contain our
pilot data, and a jags_init
object, which will contain our
initial estimates of lambda from each of our pilot datasets:
jags_data <- list(n_homerange = n_homerange, n_ppt = n_ppt,
homerange = titi_homerange, ppt = titi_ppt)
jags_init <- list(lambda1 = homerange_init,
lambda2 = ppt_init) # oi jags innit
So we know already based on our statistical knowledge that Anand’s titi data follows a Poisson distribution, because she is counting calls after playbacks and also counting monkey groups as she searches through their home ranges.
Given this, what do we think could make a good prior
distribution of our data (centered at 15); in other words, what do
we think our data will look like given that we already know?
Will it look uniform (Unif(10, 20)
)?
x <- seq(0, 30, by = 0.01)
plot(x, dunif(x, min = 10, max = 20), type = "l",
main = "Uniform?",
xlab = "Prior values",
ylab = "Probabilities")
Or will it look normal/Gaussian (Normal(15, 3)
)?
x <- seq(0, 30, by = 0.01)
plot(x, dnorm(x, mean = 15, sd = 3), type = "l",
main = "Normal?",
xlab = "Prior values",
ylab = "Probabilities")
or maybe Gamma Gamma(15, 1)
?
x <- seq(0, 30, by = 0.01)
plot(x, dgamma(x, shape = 15, rate = 1), type = "l",
main = "Gamma?",
xlab = "Prior values",
ylab = "Probabilities")
In this case, let’s pick Gamma(15, 1)
for the prior,
why?
Let’s think… What values can a poisson distribution take? What do we remember about distributions in general?
Can Poisson count data be negative?
Can Poisson count data be 0?
Is Poisson count data discrete? Can \(\theta\) be continuous?
All in all, it’s up to YOU regarding what the prior should be… just be prepared to justify it!
Before computers were as powerful as they are today, statisticians were limited to only using prior distributions conjugate to the likelihood of their data. A conjugate prior is one that, when combined with a specific likelihood function, results in a posterior distribution that’s parameterized by the same family as the prior distribution. In this case, you’ll just have to trust that this is the case with the a Gamma prior and Poisson distributed data.
Other common conjugates that you might run in to and actually use include:
Doing the math for all of this by hand can be… not fun, and in those cases one is limited to modeling exclusively via conjugate priors. In a case where you have normally distributed data and you’re not at all confident about whether or not it’s truly normal, you might give up and use a uniform prior… which might make calculating posteriors cumbersome. Prior to good computers, this would be tough! This is why Bayesian statistics are only now really getting their time in the limelight.
set.seed(812) # feel free to change the seed
# let's make the model:
jags_model <- "model{
# likelihood
for(i in 1:n_homerange){
homerange[i] ~ dpois(lambda1)
}
for (i in 1:n_ppt){
ppt[i] ~ dpois(lambda2)
}
# prior
lambda1 ~ dgamma(15, 1)
lambda2 ~ dgamma(15, 1)
}"
# and fit the model:
fit <- jags.model(textConnection(jags_model),
data = jags_data, inits = jags_init,
n.chains = 2, n.adapt = n_adapt)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 14
## Unobserved stochastic nodes: 2
## Total graph size: 20
##
## Initializing model
fit_samples <- coda.samples(fit, c("lambda1", "lambda2"), n.iter = n_iter) %>%
window(start = n_burnin + n_adapt)
You may have noticed that we have two chains going on up there
(n.chains = 2
), what does that mean?
When we initiate and begin to run our algorithm, how do we control for the algorithm from just going wild jumping all over the place? We can obviously omit the first 5,000 samples and other random elements (burn in plus adaptation) but we can also run the algorithm a SECOND time and then concatenate our simulated samples. This will give us a good idea of how consistent our model is and how much uncertanty there might be in our estimates.
Ok, once this runs we have our samples… now what? Let’s check to see if our algorithm actually worked the way we’d hoped by taking a look at our samples, and then we can run some diagnostics (trace and ACF plots).
plot(window(fit_samples), density = FALSE) # this is a trace plot (tells us where we're randomly walking)
plot(window(fit_samples), trace = FALSE) # this is a density plot (you know what this is!)
summary(window(fit_samples)) # these are our samples
##
## Iterations = 10000:20000
## Thinning interval = 1
## Number of chains = 2
## Sample size per chain = 10001
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## lambda1 15.27 1.391 0.009838 0.009724
## lambda2 13.37 1.285 0.009086 0.009228
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## lambda1 12.68 14.30 15.23 16.18 18.12
## lambda2 10.95 12.49 13.34 14.21 16.00
fit_samples <- as.data.frame(as.array(fit_samples)) # got to make a df
acf(fit_samples$lambda1.1)
acf(fit_samples$lambda1.2)
# let's do another diagnostic, when we're walking around, we need to guarantee that samples, from say 10 samples ago, aren't influencing the current sample. ACF plot!
acf(fit_samples$lambda2.1)
acf(fit_samples$lambda2.2)
# all of these look great! autocorrelation between sample 1 and sample 3 is basically 0, good model (you want your model to be between the blue dotted lines I'd say before lag 10)
# we've got two chains, so we need to concatenate our dudes
fit_samples <- data.frame(homerange =
c(fit_samples[, "lambda1.1"],
fit_samples[, "lambda1.2"]),
ppt =
c(fit_samples[, "lambda2.1"],
fit_samples[, "lambda2.2"]))
You can informally determine if a trace plot is “good” based on whether or not you could draw a line through the plot with a sharpie (this is literally how Bayesian statisticians conceptualize this)… in other words, does the scramble of your plot result in a horizontally straigt line? In this case, it looks like both of our chains fit this well.
Autocorrelation, modeled by an autocorrelation function (ACF) tells us how much the ith sample impacts the following samples… again, we don’t want to see a pattern, so you’d like to see the ACF plot plummet as soon as possible (hopefully before lag 10) and not come back up again.
Overall, our algorithm is simple, and our data isn’t too messy, so all looks great.
Let’s now do some inference with our samples!
colors <- c("Homerange" = "orange3", "Playback point transect" = "steelblue3")
# let's put our two samples against each other
fit_samples %>% ggplot() + # ggplot them
geom_density(aes(x = homerange, fill = "Homerange", alpha = 0.5)) +
geom_density(aes(x = ppt, fill = "Playback point transect", alpha = 0.5)) +
xlab("Lambda Samples") +
ggtitle("Posterior distributions of Homerange\nand Playback point transect lambdas") +
scale_fill_manual(values = colors) +
geom_vline(xintercept = 15, linetype = 3) +
guides(alpha="none")
So let’s see how the playback point transect is working out.
# what percentage are less than 15?
ppt_credinterval <- quantile(fit_samples$ppt, probs = c(0.025, 0.975)); ppt_credinterval
## 2.5% 97.5%
## 10.95465 16.00195
# the probability lambda < 15
ppt_problessthan <- sum(fit_samples$ppt < 15)/length(fit_samples$ppt); ppt_problessthan
## [1] 0.8933607
So our credible interval is analogous to a confidence interval, and is telling us the range of the most likely values for our parameter, and, as articulated above, what proportion of our parameter estimates are less than 15. In this case, our parameter estimates range from 10.97 to 15.98, for a total range of 5.01. On other words, with 95% probability we can be confident that there are between 10.97 an 15.98 titi monkey groups at our field site using the PPT method. According to this output, the probability that the PPT method will result in an estimate of less than 15 titi monkeys is 89.5%. As a frequentist, this might be frustrating… that’s not a significant p-value telling me I’m right or wrong! However in Bayesian world this can be a valid, and very satisfying answer, it just matters on how we interpret it. How do we interpret this probability? What does it mean to not think of things in terms of significance? How certain are we that the parameter falls under 15? Do you think 89.5% probability is high enough?
Let’s contrast this with our calculations using the home range method:
# credible interval
hr_credinterval <- quantile(fit_samples$homerange, probs = c(0.025, 0.975)); hr_credinterval
## 2.5% 97.5%
## 12.68252 18.12263
# the probability lambda > 15
hr_probmorethan <- sum(fit_samples$homerange > 15)/length(fit_samples$homerange); hr_probmorethan
## [1] 0.5640936
So our crediblity interval for the home range estimate is 12.66 to 18.12, with a range of 5.46. Based on this, would you say that there is more or less certainty about the number of titis at the field site based on the home range method or the playback point transects methods? What about our null hypothesis that we have greater than 15 groups. 56.3% of our samples are greater than 15, so what do you think? Is this different from 15?
Finally, are our estimates different from one another? The graph of our posterior distributions for the mean using each method, above, gives us an diea, but let’s take a more direct approach:
diff_data <- fit_samples$homerange - fit_samples$ppt # subtract the two samples, then create a credible interval!
diffr_credinterval <- quantile(diff_data, probs = c(0.025, 0.975)); diffr_credinterval
## 2.5% 97.5%
## -1.818458 5.608200
# okay but what's the probability homerange estimates more than ppt?
diffr_prop <- sum(diff_data > 0)/length(diff_data); diffr_prop
## [1] 0.8439156
Overall, based on our credibility interval here, it looks like we still can’t say with 95% probability that these two methods are different, but compared to the frequentist confidence interval (-1.964508, 6.250222) we’re reducing uncertainty. The Bayesian method is giving us a more precise estimate with less uncertainty. We also can see that 84% of homerange estimates are larger than those provided by playbacks. In this case, we can’t blame our sample size… our sample itself tells us that 84% of home range estimates are going to be larger. Is this different enough for you?
Using Bayesian methods, then, I’m able to say that the playback method not only provides a slightly more certain estimate of the number of titi monkeys in a given field site over the home range method. Moreover, I’d say we have reason to think the number of titi monkeys in our field site, using the playback method, is probably less than 15.
So, ultimately, what’s our answer? Again, with Bayesian methods, it’s up to you to interpret… no easy p-value to fall back on!
If you would prefer to use a normal prior, that’s a valid choice! Being able to use different priors than the conjugate is one of the benefits of using a JAGS algorithm.
That being said, do you think a normal distribution is an effective prior for these data? Why or why not?
Let’s do our initializations again, but this time using a normal prior. Theoretically, a normal prior may not be the best choice but it’s good to see how and why…
jags_data <- list(n_homerange = n_homerange, n_ppt = n_ppt,
homerange = titi_homerange, ppt = titi_ppt)
jags_init <- list(lambda1 = homerange_init,
lambda2 = ppt_init)
Whoa! what’s up with \(\tau\)? Instead of using variance, JAGS parameterizes a normal prior with precision (which is just \(\frac{1}{\sigma^2}\)). Otherwise, our model should function identically to that implemented with the gamma prior.
set.seed(812) # feel free to change my seed
# we first need to make the model
jags_model <- "model{
# likelihood
for(i in 1:n_homerange){
homerange[i] ~ dpois(lambda1)
}
for (i in 1:n_ppt){
ppt[i] ~ dpois(lambda2)
}
# prior
lambda1 ~ dnorm(15, 1/9)
lambda2 ~ dnorm(15, 1/9)
}"
fit_norm <- jags.model(textConnection(jags_model),
data = jags_data, inits = jags_init,
n.chains = 2, n.adapt = n_adapt) # what do the chains do?
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 14
## Unobserved stochastic nodes: 2
## Total graph size: 22
##
## Initializing model
# this saves the model as a variable
fit_samples_norm <- coda.samples(fit_norm, c("lambda1", "lambda2"), n.iter = n_iter) %>%
window(start = n_burnin + n_adapt) # let's get our samples <3
Whoa! what’s up with 1/3 in the model? Instead of using variance, JAGS parameterizes a normal prior with precision (which is just \(\frac{1}{\sigma^2}\) OR \(\tau\)). Otherwise, our model should function identical to the gamma prior. Again, we’re looking at lambdas because our data is poisson generated.
plot(window(fit_samples_norm), density = FALSE) # this is a trace plot (tells us where we're randomly walking)
plot(window(fit_samples_norm), trace = FALSE) # this is a density plot (you know what this is!)
summary(window(fit_samples_norm)) # these are our samples
##
## Iterations = 10000:25000
## Thinning interval = 1
## Number of chains = 2
## Sample size per chain = 15001
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## lambda1 15.34 1.325 0.007650 0.009881
## lambda2 13.56 1.284 0.007413 0.009540
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## lambda1 12.85 14.43 15.30 16.22 18.03
## lambda2 11.15 12.68 13.53 14.40 16.18
fit_samples_norm <- as.data.frame(as.array(fit_samples_norm)) # got to make a df
acf(fit_samples_norm$lambda1.1) # notice how we have a bit more lag, sample i is influencing i+1 a bit more than w the poisson-gamma
acf(fit_samples_norm$lambda1.2)
# let's do another diagnostic, when we're walking around, we need to guarantee that samples, from say 10 samples ago, aren't influencing the current sample. ACF plot!
acf(fit_samples_norm$lambda2.1)
acf(fit_samples_norm$lambda2.2)
# all of these look great! autocorrelation between sample 1 and sample 3 is basically 0, good model (you want your model to be between the blue dotted lines I'd say before lag 10)
# we've got two chains, so we need to concatenate our dudes
fit_samples_norm <- data.frame(homerange =
c(fit_samples_norm[, "lambda1.1"],
fit_samples_norm[, "lambda1.2"]),
ppt =
c(fit_samples_norm[, "lambda2.1"],
fit_samples_norm[, "lambda2.2"]))
colors <- c("Gamma" = "pink", "Normal" = "dodgerblue")
# let's put our two samples against each other
ggplot() + # ggplot them
geom_density(aes(x = fit_samples$homerange, fill = "Gamma", alpha = 0.5)) +
geom_density(aes(x = fit_samples_norm$homerange, fill = "Normal", alpha = 0.5)) +
xlab("Lambda Samples") +
ggtitle("Posterior distributions of Homerange\nfor Gamma and Normal Priors") +
scale_fill_manual(values = colors) +
geom_vline(xintercept = 15, linetype = 3) +
guides(alpha="none")
ggplot() + # ggplot them
geom_density(aes(x = fit_samples$ppt, fill = "Gamma", alpha = 0.5)) +
geom_density(aes(x = fit_samples_norm$ppt, fill = "Normal", alpha = 0.5)) +
xlab("Lambda Samples") +
ggtitle("Posterior distributions of Playback point transect\nfor Gamma and Normal Priors") +
scale_fill_manual(values = colors) +
geom_vline(xintercept = 15, linetype = 3) +
guides(alpha="none")
Just a snapshot of what’s going on here, it looks like our normal
prior is reducing uncertainty with respect to homerange, but maybe not
that much with playback point transect. We also see that while the
centers of homerange are pretty similar, it is not the same with the
playback point transect. Why? Think about our priors, how are they
parameterized? Is our normal prior really influencing the posterior
compared to gamma prior? Is normal better? Or just different?
Keep this in mind, let’s figure out credible intervals and check out
what’s up!
# credible interval
hr_normcredinterval <- quantile(fit_samples_norm$homerange, probs = c(0.025, 0.975)); hr_normcredinterval
## 2.5% 97.5%
## 12.85157 18.02919
# the probability lambda > 15
hr_normprobmorethan <- sum(fit_samples_norm$homerange > 15)/length(fit_samples_norm$homerange);
hr_normprobmorethan
## [1] 0.5919939
How does this compare to the gamma values? The normal credible interval range is 5.18, contrast with the poisson which is 5.47. We’re reducing a bit of uncertainty with the normal credible interval relative to the gamma, cool! Looking at probabilities greater than 15, we’re basically identical to the gamma, cool! Why do we see these results do you think?
What’s going on with the playback point transect?
ppt_normcredinterval <- quantile(fit_samples_norm$ppt, probs = c(0.025, 0.975)); ppt_normcredinterval
## 2.5% 97.5%
## 11.14939 16.17933
# the probability lambda < 15
ppt_normprobmorethan <- sum(fit_samples_norm$ppt < 15)/length(fit_samples_norm$ppt);
ppt_normprobmorethan
## [1] 0.8687421
So with playback point transect for the normal we have a slightly
smaller credible interval (4.95) compared to the gamma (5.2). However,
the probability that our playback point transect estimates less than 15
is 86.7%.
Let’s try comparing the two again:
diff_data <- fit_samples_norm$homerange - fit_samples_norm$ppt # subtract the two samples, then create a credible interval!
diffr_credinterval <- quantile(diff_data, probs = c(0.025, 0.975)); diffr_credinterval
## 2.5% 97.5%
## -1.844981 5.398575
# okay but what's the probability homerange estimates more than ppt?
diffr_prop <- sum(diff_data > 0)/length(diff_data); diffr_prop
## [1] 0.8333444
So we get very similar results to the Gamma estimate, but our credible interval is ever so slightly smaller. Our probability homerange is greater than ppt is also a bit smaller, what’s going on there?
The answer is… it depends
Let’s be clear here, you shouldn’t work multiple JAGS models and pick
the one that gives you the answer you want. Rather, you gotta pick a
prior BECAUSE it best models the parameter you’re looking at and stick
with it. If you’re a stats wizard, you can think it through. For
instance, despite the slight increase in uncertainty relative to a
normal prior, I think a gamma prior works better for this context
because of the limitations of a poisson distribution. But let’s say you
aren’t a stats wizard, you should probably go for either a normal or a
uniform prior.
Cool! You’ve got a variance and a mean you can get from preliminary
data and plug into your prior. That is, in the rjags
model
you can parameterize via dnorm(preliminary mean
,
preliminary precision
). Remember precision is \(\frac{1}{\sigma^2}\)! Double remember, make
sure you DON’T parameterize your prior with your actual data! This is
cheating and fake statistics!
Cry. No, but seriously, this is not the most fun tough situation to be in. First, get an idea of where the prior should be centered. What does the literature say about your parameter? Do you have numbers you can use? Can you guesstimate a value based on what’s been done before? If yes to any of those things, determine how certain you are about your parameter.
Great! Feel free to go for a normal or uniform prior. Choose a level of precision that fits your prior (I can’t tell you what’s going to work for your situation!)
Bummer! You should have some measure of center (it doesn’t
have to be good). You should probably use a Uniform prior such that
upper and lower bounds are reflective of your uncertainty. In the JAGS
code this might look like…
dunif(measure of center - large number
,
measure of center + large number
).
How large a number? It’s your choice, you gotta learn to trust yourself
here!
Pat yourself on the back! This is some pretty tough stuff and I’m
proud of you :)
I’ve hopefully illustrated the utility of Bayesian statistics for you,
here’s the take home:
Bayesian statistics aren’t better, they just require a different way of thinking about things than frequentist statistics
Bayesian statistics are about modeling our uncertainty about a given parameter
So when your sample sucks, you may want to hedge your bets with Bayesian statistics over frequentist
We looked at some titi monkey data to compare both frequentist and Bayesian methods, as well as our choice of prior
There’s not really a simple one-line piece of code to get you a
posterior distribution (brms
is a bit easier but doesn’t
afford you the same flexibility as JAGS modeling in my opinion)
This is a hyper simplified version of how JAGS and Bayesian statistics work. If you’d like an introduction in making linear models (and more complex ones, they follow the same-ish procedure) check out: https://bookdown.org/kevin_davisross/bayesian-reasoning-and-methods/regression.html
What if my prior has priors? https://bookdown.org/kevin_davisross/bayesian-reasoning-and-methods/hierarchical.html
Congrats!