Every statistics professor i’ve had has a serious parasocial relationship with xkcd

1 What are Bayesian statistics?

1.1 How do you pronounce Bayesian?

Perhaps the best way to understand Bayesian statistics is to contrast them with frequentist statistics. In most statistics courses we learn how to create and interpret frequentist statistics almost exclusively. In other words, we’ve been treating probability as a measure of frequency. In this paradigm, there is a true, correct, value for a parameter of interest, and as we collect more and more data, our estimation of the parameter (our estimand) will approach this ‘true’ value. This is fine when it works, but what do we do when we realize we realize we need a sample larger than what we can realistically manage? What if you really want to think of confidence intervals in terms of probability and not confidence? What even is confidence anyways?

Let’s say we don’t know what the population parameter might be for some variable of interest, but we think your sample data are good. We might feeling a bit iffy about how well our data matches up with the truth…

What if our data can’t meet those pesky assumptions that our statistics professors have been telling us about (and that we’ve probably been trying to ignore), like normally distributed residuals or homoskedasticity?

We’re mostly all biologists here, right? We know our data frequently (hah) sucks for doing frequentist statistics, and maybe we don’t want to be in an email correspondence trying to convince someone keeping our paper in reviewer Hell that our data is probably normal. We must ask ourselves: how well do we think we could defend our statistical methods?

Enter Bayesian statistics.

Bayesian statistics are otherwise known as subjective statistics, and for good reason! Rather than estimating a parameter we’re assuming to be true and trying to figure out how close our data come to it, we’re actually estimating our uncertainty about a parameter based on our dataset itself.

For example, let’s say we want to learn about a parameter: \(\theta\).

With a Bayesian paradigm, we make inferences on \(\theta\) combining our prior knowledge of \(\theta\) and the likelihood of receiving our data given \(\theta\). In other words, we’re integrating whatever data we have with our subjective thoughts of how the data is parameterized. From this fun collaboration, we generate a posterior probability distribution of samples of \(\theta\). Rather than a binary Yes/No answer, as in frequentist statistics, the posterior distribution is our “answer”.

So we have a probability distribution of how \(\theta\) is distributed, what does this mean and why should we care?

Well… if we’ve done it all correctly the posterior distribution tells us how certain/uncertain we are about \(\theta\) given our specific data. Instead of rejecting/failing to reject a hypothesis based on our data and whatever assumptions we’re making about it, the posterior distribution tells us the exact probabilities of \(\theta\) given our data as it actually is.

1.2 What’s going on under the hood

Bayesian inference is an application of Bayes’ Rule (we already learned this fun fact when we talked about Probability!).

Bayes’ Rule states that \(P(A|B) = \frac{P(A|B) \times P(A)}{P(B)}\). Instead of individual probabilities, however, we can think of these values in the context of the example given above. In other words, our posterior probability (\(P(A|B)\) for \(\theta\)) is operationalized in this formula using our prior probability (what we already think we know about \(\theta\)) and the likelihood of various values of \(\theta\) given our dataset… but how do we get to that from this equation?

With Bayesian statistics, we can workshop our formula a bit to get \(P(\theta | X_{1,2,...,n}) \propto P(X_{1,2,...,n}|\theta) \times P(\theta)\). In this equation, our posterior probability is the probability of \(\theta\) given our dataset (\(P(\theta | X_{1,2,...,n})\)), which is proportional to the likelihood of our dataset given \(\theta\) is a parameter that describes it (\(P(X_{1,2,...,n}|\theta)\)) and our prior knowledge of the probability of the parameter (\(P(\theta)\)).

We can make an educated guess on how \(\theta\) is distributed (an informative prior), or we can throw in the towel and admit we know very little regarding how \(\theta\) is distributed (an uninformative prior). Our data can’t be changed (this is the objective part of Bayesian statistics), but we can see that is distributed in a certain way. From these two distributions (our prior and our actual data distribution), our goal is then to find how \(\theta\) is distributed GIVEN our data.

But again: what does all of this mean?

Bayesian statistics are hard

Bayesian statistics can be complex and a lot of work. Without recent beefy computers, Bayesian statistics had to be calculated by hand (this is hard and not fun) and were EXTREMELY limited in their applications.

However, we now have the computational power to make Bayesian statistics more accessible and realistically applicable, including in R.

Enter: {rjags}.

2 Bayesian modeling using `rjags`

RAWR, Jaguar

To demonstrate, let’s check out the titi monkeys from Homework 2 again, but this time from a Bayesian perspective!

Titi, she’s mother

2.1 Let’s load some libraries

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(coda)
library(rjags)

## Linked to JAGS 4.3.2
## Loaded modules: basemod,bugs

2.2 …and simulate some data

So if you remember, our primatologist believed from prior experience that the average number of titi monkey calls she would hear in a 2hr period at her study site in Ecuador would be around 15. If we think each call response to her playback is a unique titi monkey group, could that accurately represent the actual number of titi monkeys at the field site?

Let’s imagine (and simulate) that she collected titi monkey call data during a week of field work using 2 separate methods: one using a playback point transect method (listening for the number of calls, assuming that each separate call she heard represented a different group of titi monkeys), and the other using actual observations from surveying the forest to visit known titi monkey home ranges at the field site.

Let’s use the published population means from the paper by Anand Dacier (the inspiration for this homework assignment), collected at the Tiputini Biodiversity Station, in Ecuador) to simulate the results for these two methods:

set.seed(812) # set seed, Jimmy and this is his birthday; feel free to change up your seed; it's just a random number
titi_ppt <- rpois(n = 7, lambda = 13.8) # this will "collect" group counts via the playback point transect method
titi_homerange <- rpois(n = 7, lambda = 16) # this will "collect" group counts via the home range visitation method

2.3 Frequentist inference

2.3.1 Hypothesis Test

Let’s first check and see how frequentist methods would estimate whether our Anand’s prior assumption of 15 calls or groups in the field site is correct given the data she collected. We can do this using a t-test to see if there is a significant difference between the assumed mean number of calls heard at playbacks and the observed number of groups seen by visiting home ranges as measured in the field:

# are these consistent with 15 groups?
t.test(titi_homerange, alternative = "greater", mu = 15) # this test asks: is the home range method detecting more than 15 groups?

## 
##  One Sample t-test
## 
## data:  titi_homerange
## t = 0.23625, df = 6, p-value = 0.4105
## alternative hypothesis: true mean is greater than 15
## 95 percent confidence interval:
##  12.93568      Inf
## sample estimates:
## mean of x 
##  15.28571

t.test(titi_ppt, alternative = "less", mu = 15) # this test asks: is the playback method detecting more than 15 groups?

## 
##  One Sample t-test
## 
## data:  titi_ppt
## t = -1.2914, df = 6, p-value = 0.122
## alternative hypothesis: true mean is less than 15
## 95 percent confidence interval:
##      -Inf 15.93727
## sample estimates:
## mean of x 
##  13.14286

Here, we see there is no significant difference in our answer based on the method used; in other words, we fail to reject our null hypothesis that there are on average 15 groups of titi monkeys in a given area. This is no big deal but also - given this were the work of a whole field season in the Amazon - we’ve spent all this money, and done all this work, got what looks like interesting data just to confirm that our two methods are pretty much the same. It happens…

t.test(titi_homerange, titi_ppt)

## 
##  Welch Two Sample t-test
## 
## data:  titi_homerange and titi_ppt
## t = 1.1404, df = 11.657, p-value = 0.277
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.964508  6.250222
## sample estimates:
## mean of x mean of y 
##  15.28571  13.14286

So… we don’t have significant evidence to suggest that either method is different from the other. But… we know they’re different (we just learned that they area!). So how can we dig into this system a bit more to better understand or vidualize these differences…?

2.4 Bayesian inference with `rjags`

We can try to do this by generating a LOT of data via machine learning Markov-Chain Monte-Carlo (MCMC) methods. These methods essentially create a dataset drawn from a predefined probability distribution, then another one, then another one, and another one, and so on… and allow us to define how we’d like to sample distributions from this large chain of simulated distributions to better understand our sample. We’re going to be sampling using the Just Another Gibbs Sampler (JAGS) algorithm, which creates Markov-Chain Monte Carlo (MCMC) samples via a Metropolis Hastings algorithm. Essentially, we’re training the algorithm to create a distribution of data using both our prior beliefs of how we think the data should look (i.e., that the average number of groups detected should be 15) and the data we actually collected using each of our methods (i.e., how likely it is that each method detects 15 groups). This is a Gaussian random walk, meaning we’re going to be “randomly walking” around possible values of the parameter we hope to estimate until the algorithm begins to converge on a plausible range of values for our parameter given both our prior beliefs and the data we’ve collected. These samples constitute our posterior distribution for the parameter, meaning we’ll be generating a lot of potential parameter estimates and looking at how those, themselves, are distributed in order to infer what we can know about the parameter.

2.4.1 JAGS Initializations

Let’s initialize our algorithm…

To begin, as our algorithm conducts its random walk it will try out some truly outlandish parameter estimates. This is because, in part, the algorithm compares parameter estimates and subsquent distributions to converge on the most likely estimate (again, given both our prior assumptions and the data itself). Because these first few steps are going to be a bit wild, it’s always a good idea to declare what’s called a burn-in… this is a number that defines how many iterations of parameter estimates you’d like to throw our from our final distribution of estimates once our algorithm has converged. This number could be big or small, but typically you just want to make sure that it throws out the “wobble” period of outlandish estimates at the beginning of the MCMC chain. In this case, I’m throwing out 5000 iterations in my n_burnin option.

Of course, we also need to declare how many iterations we’d like to try in order to converge on what might be a reasonable parameter estimate… this is more an art than a science. Generally, you want to have enough iterations for your MCMC chain to converge on a reasonable parameter estimate, but not so much that you’re over-inflating the significance of your estimate. For now, we’ll start with 20000 iterations, set using the n_iter option.

Finally, we also want to mix things up a bit. We can deflate the build-up of estimates by randomly excluding, or throwing out, some of our estimates from the chain. This is a method that’s intended to functionally randomize our estimates… because something we definitely want to avoid is auto-correlation between our simulated distributions (in other words, we don’t want there to be a pattern in our distributions, given that would mean our parameter estimates aren’t independent). Do do this, I’m setting the n_adapt parameter to 5000, which will throw out every 5000th simulated sample distribution and parameter estimate.

n_burnin <- 5000 # we're going to throw 5000 samples away as the algorithm settles in

n_iter <- 20000 # we're going to generate 20000 total samples

n_adapt <- 5000 # we're  going throw another 5000 random samples away from the total chain

Once this runs, we’re going to have a total of 10,000 (hopefully) random sampling distributions generated by our MCMC algorithm.

We’ll also set out parameter estimates for each method we’re testing based on some pilot data collected over 7 days. In this case, we’ll start with 7 random samples for each method in trying to estimate the number of titi groups (both are 7, but it’s good practice to differentiate):

n_homerange <- length(titi_homerange) 
n_ppt <- length(titi_ppt)

Finally, we’re can get started with our estimation be declaring which parameter we would like to estimate. Let’s start with the mean:

homerange_init <- mean(titi_homerange)
ppt_init <- mean(titi_ppt)

Ok, now we’re ready to formulate our JAGS dataset. In this case, we’re creating a jags_data object, which will contain our pilot data, and a jags_init object, which will contain our initial estimates of lambda from each of our pilot datasets:

jags_data <- list(n_homerange = n_homerange, n_ppt = n_ppt, 
                  homerange = titi_homerange, ppt = titi_ppt)
jags_init <- list(lambda1 = homerange_init, 
                  lambda2 = ppt_init) # oi jags innit

2.4.2 What prior should we pick?

So we know already based on our statistical knowledge that Anand’s titi data follows a Poisson distribution, because she is counting calls after playbacks and also counting monkey groups as she searches through their home ranges.

Given this, what do we think could make a good prior distribution of our data (centered at 15); in other words, what do we think our data will look like given that we already know?

Will it look uniform (Unif(10, 20))?

x <- seq(0, 30, by = 0.01)
plot(x, dunif(x, min = 10, max = 20), type = "l", 
     main = "Uniform?",
     xlab = "Prior values",
     ylab = "Probabilities")

Or will it look normal/Gaussian (Normal(15, 3))?

x <- seq(0, 30, by = 0.01)
plot(x, dnorm(x, mean = 15, sd = 3), type = "l", 
     main = "Normal?",
     xlab = "Prior values",
     ylab = "Probabilities")

or maybe Gamma Gamma(15, 1)?

x <- seq(0, 30, by = 0.01)
plot(x, dgamma(x, shape = 15, rate = 1), type = "l", 
     main = "Gamma?",
     xlab = "Prior values",
     ylab = "Probabilities")

In this case, let’s pick Gamma(15, 1) for the prior, why?

Let’s think… What values can a poisson distribution take? What do we remember about distributions in general?

Can Poisson count data be negative?

Can Poisson count data be 0?

Is Poisson count data discrete? Can \(\theta\) be continuous?

All in all, it’s up to YOU regarding what the prior should be… just be prepared to justify it!

Before computers were as powerful as they are today, statisticians were limited to only using prior distributions conjugate to the likelihood of their data. A conjugate prior is one that, when combined with a specific likelihood function, results in a posterior distribution that’s parameterized by the same family as the prior distribution. In this case, you’ll just have to trust that this is the case with the a Gamma prior and Poisson distributed data.

Other common conjugates that you might run in to and actually use include:

Binomial likelihood = Beta prior
Normal likelihood = Normal-inverse gamma prior

Doing the math for all of this by hand can be… not fun, and in those cases one is limited to modeling exclusively via conjugate priors. In a case where you have normally distributed data and you’re not at all confident about whether or not it’s truly normal, you might give up and use a uniform prior… which might make calculating posteriors cumbersome. Prior to good computers, this would be tough! This is why Bayesian statistics are only now really getting their time in the limelight.

2.4.3 Gamma JAGS model

set.seed(812) # feel free to change the seed

# let's make the model:
jags_model <- "model{
  # likelihood
  for(i in 1:n_homerange){
  homerange[i] ~ dpois(lambda1)
  }
  for (i in 1:n_ppt){
  ppt[i] ~ dpois(lambda2)
  }
  
  # prior
  lambda1 ~ dgamma(15, 1)
  lambda2 ~ dgamma(15, 1)
}"

# and fit the model:
fit <- jags.model(textConnection(jags_model),
               data = jags_data, inits = jags_init, 
               n.chains = 2, n.adapt = n_adapt)

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 14
##    Unobserved stochastic nodes: 2
##    Total graph size: 20
## 
## Initializing model

fit_samples <- coda.samples(fit, c("lambda1", "lambda2"), n.iter = n_iter) %>% 
  window(start = n_burnin + n_adapt)

You may have noticed that we have two chains going on up there (n.chains = 2), what does that mean?

When we initiate and begin to run our algorithm, how do we control for the algorithm from just going wild jumping all over the place? We can obviously omit the first 5,000 samples and other random elements (burn in plus adaptation) but we can also run the algorithm a SECOND time and then concatenate our simulated samples. This will give us a good idea of how consistent our model is and how much uncertanty there might be in our estimates.

Ok, once this runs we have our samples… now what? Let’s check to see if our algorithm actually worked the way we’d hoped by taking a look at our samples, and then we can run some diagnostics (trace and ACF plots).

plot(window(fit_samples), density = FALSE) # this is a trace plot (tells us where we're randomly walking)

plot(window(fit_samples), trace = FALSE) # this is a density plot (you know what this is!)

summary(window(fit_samples)) # these are our samples

## 
## Iterations = 10000:20000
## Thinning interval = 1 
## Number of chains = 2 
## Sample size per chain = 10001 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##          Mean    SD Naive SE Time-series SE
## lambda1 15.27 1.391 0.009838       0.009724
## lambda2 13.37 1.285 0.009086       0.009228
## 
## 2. Quantiles for each variable:
## 
##          2.5%   25%   50%   75% 97.5%
## lambda1 12.68 14.30 15.23 16.18 18.12
## lambda2 10.95 12.49 13.34 14.21 16.00

fit_samples <- as.data.frame(as.array(fit_samples)) # got to make a df

acf(fit_samples$lambda1.1)

acf(fit_samples$lambda1.2)

# let's do another diagnostic, when we're walking around, we need to guarantee that samples, from say 10 samples ago, aren't influencing the current sample. ACF plot! 
acf(fit_samples$lambda2.1)

acf(fit_samples$lambda2.2)

# all of these look great! autocorrelation between sample 1 and sample 3 is basically 0, good model (you want your model to be between the blue dotted lines I'd say before lag 10)

# we've got two chains, so we need to concatenate our dudes
fit_samples <- data.frame(homerange = 
                            c(fit_samples[, "lambda1.1"], 
                              fit_samples[, "lambda1.2"]),
                          ppt = 
                            c(fit_samples[, "lambda2.1"],
                                  fit_samples[, "lambda2.2"]))

You can informally determine if a trace plot is “good” based on whether or not you could draw a line through the plot with a sharpie (this is literally how Bayesian statisticians conceptualize this)… in other words, does the scramble of your plot result in a horizontally straigt line? In this case, it looks like both of our chains fit this well.

Autocorrelation, modeled by an autocorrelation function (ACF) tells us how much the ith sample impacts the following samples… again, we don’t want to see a pattern, so you’d like to see the ACF plot plummet as soon as possible (hopefully before lag 10) and not come back up again.

Overall, our algorithm is simple, and our data isn’t too messy, so all looks great.

Let’s now do some inference with our samples!

2.4.4 Gamma JAGS inference

colors <- c("Homerange" = "orange3", "Playback point transect" = "steelblue3")
# let's put our two samples against each other
fit_samples %>% ggplot() + # ggplot them
  geom_density(aes(x = homerange, fill = "Homerange", alpha = 0.5)) +
  geom_density(aes(x = ppt, fill = "Playback point transect", alpha = 0.5)) +
  xlab("Lambda Samples") +
  ggtitle("Posterior distributions of Homerange\nand Playback point transect lambdas") +
  scale_fill_manual(values = colors) +
  geom_vline(xintercept = 15, linetype = 3) + 
  guides(alpha="none")

So let’s see how the playback point transect is working out.

# what percentage are less than 15?
ppt_credinterval <- quantile(fit_samples$ppt, probs = c(0.025, 0.975)); ppt_credinterval

##     2.5%    97.5% 
## 10.95465 16.00195

# the probability lambda < 15
ppt_problessthan <- sum(fit_samples$ppt < 15)/length(fit_samples$ppt); ppt_problessthan

## [1] 0.8933607

So our credible interval is analogous to a confidence interval, and is telling us the range of the most likely values for our parameter, and, as articulated above, what proportion of our parameter estimates are less than 15. In this case, our parameter estimates range from 10.97 to 15.98, for a total range of 5.01. On other words, with 95% probability we can be confident that there are between 10.97 an 15.98 titi monkey groups at our field site using the PPT method. According to this output, the probability that the PPT method will result in an estimate of less than 15 titi monkeys is 89.5%. As a frequentist, this might be frustrating… that’s not a significant p-value telling me I’m right or wrong! However in Bayesian world this can be a valid, and very satisfying answer, it just matters on how we interpret it. How do we interpret this probability? What does it mean to not think of things in terms of significance? How certain are we that the parameter falls under 15? Do you think 89.5% probability is high enough?

Let’s contrast this with our calculations using the home range method:

# credible interval
hr_credinterval <- quantile(fit_samples$homerange, probs = c(0.025, 0.975)); hr_credinterval

##     2.5%    97.5% 
## 12.68252 18.12263

# the probability lambda > 15
hr_probmorethan <- sum(fit_samples$homerange > 15)/length(fit_samples$homerange); hr_probmorethan

## [1] 0.5640936

So our crediblity interval for the home range estimate is 12.66 to 18.12, with a range of 5.46. Based on this, would you say that there is more or less certainty about the number of titis at the field site based on the home range method or the playback point transects methods? What about our null hypothesis that we have greater than 15 groups. 56.3% of our samples are greater than 15, so what do you think? Is this different from 15?

Finally, are our estimates different from one another? The graph of our posterior distributions for the mean using each method, above, gives us an diea, but let’s take a more direct approach:

diff_data <- fit_samples$homerange - fit_samples$ppt # subtract the two samples, then create a credible interval!
diffr_credinterval <- quantile(diff_data, probs = c(0.025, 0.975)); diffr_credinterval

##      2.5%     97.5% 
## -1.818458  5.608200

# okay but what's the probability homerange estimates more than ppt?
diffr_prop <- sum(diff_data > 0)/length(diff_data); diffr_prop

## [1] 0.8439156

Overall, based on our credibility interval here, it looks like we still can’t say with 95% probability that these two methods are different, but compared to the frequentist confidence interval (-1.964508, 6.250222) we’re reducing uncertainty. The Bayesian method is giving us a more precise estimate with less uncertainty. We also can see that 84% of homerange estimates are larger than those provided by playbacks. In this case, we can’t blame our sample size… our sample itself tells us that 84% of home range estimates are going to be larger. Is this different enough for you?

Using Bayesian methods, then, I’m able to say that the playback method not only provides a slightly more certain estimate of the number of titi monkeys in a given field site over the home range method. Moreover, I’d say we have reason to think the number of titi monkeys in our field site, using the playback method, is probably less than 15.

So, ultimately, what’s our answer? Again, with Bayesian methods, it’s up to you to interpret… no easy p-value to fall back on!

2.4.5 How would a normal prior work with this dataset?

“Normies possess a lack of interest in ideas not easily accessible or being outside of their/society’s current range of acceptance. A straight. A follower.” - Urban Dictionary

If you would prefer to use a normal prior, that’s a valid choice! Being able to use different priors than the conjugate is one of the benefits of using a JAGS algorithm.

That being said, do you think a normal distribution is an effective prior for these data? Why or why not?

Let’s do our initializations again, but this time using a normal prior. Theoretically, a normal prior may not be the best choice but it’s good to see how and why…

jags_data <- list(n_homerange = n_homerange, n_ppt = n_ppt, 
                  homerange = titi_homerange, ppt = titi_ppt)
jags_init <- list(lambda1 = homerange_init, 
                  lambda2 = ppt_init)

Whoa! what’s up with \(\tau\)? Instead of using variance, JAGS parameterizes a normal prior with precision (which is just \(\frac{1}{\sigma^2}\)). Otherwise, our model should function identically to that implemented with the gamma prior.

set.seed(812) # feel free to change my seed
# we first need to make the model
jags_model <- "model{
  # likelihood
  for(i in 1:n_homerange){
  homerange[i] ~ dpois(lambda1)
  }
  for (i in 1:n_ppt){
  ppt[i] ~ dpois(lambda2)
  }
  
  # prior
  lambda1 ~ dnorm(15, 1/9)
  lambda2 ~ dnorm(15, 1/9)
}"

fit_norm <- jags.model(textConnection(jags_model),
               data = jags_data, inits = jags_init, 
               n.chains = 2, n.adapt = n_adapt) # what do the chains do?

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 14
##    Unobserved stochastic nodes: 2
##    Total graph size: 22
## 
## Initializing model

# this saves the model as a variable

fit_samples_norm <- coda.samples(fit_norm, c("lambda1", "lambda2"), n.iter = n_iter) %>% 
  window(start = n_burnin + n_adapt) # let's get our samples <3

Whoa! what’s up with 1/3 in the model? Instead of using variance, JAGS parameterizes a normal prior with precision (which is just \(\frac{1}{\sigma^2}\) OR \(\tau\)). Otherwise, our model should function identical to the gamma prior. Again, we’re looking at lambdas because our data is poisson generated.

plot(window(fit_samples_norm), density = FALSE) # this is a trace plot (tells us where we're randomly walking)

plot(window(fit_samples_norm), trace = FALSE) # this is a density plot (you know what this is!)

summary(window(fit_samples_norm)) # these are our samples

## 
## Iterations = 10000:25000
## Thinning interval = 1 
## Number of chains = 2 
## Sample size per chain = 15001 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##          Mean    SD Naive SE Time-series SE
## lambda1 15.34 1.325 0.007650       0.009881
## lambda2 13.56 1.284 0.007413       0.009540
## 
## 2. Quantiles for each variable:
## 
##          2.5%   25%   50%   75% 97.5%
## lambda1 12.85 14.43 15.30 16.22 18.03
## lambda2 11.15 12.68 13.53 14.40 16.18

fit_samples_norm <- as.data.frame(as.array(fit_samples_norm)) # got to make a df

acf(fit_samples_norm$lambda1.1) # notice how we have a bit more lag, sample i is influencing i+1 a bit more than w the poisson-gamma

acf(fit_samples_norm$lambda1.2)

# let's do another diagnostic, when we're walking around, we need to guarantee that samples, from say 10 samples ago, aren't influencing the current sample. ACF plot! 
acf(fit_samples_norm$lambda2.1)

acf(fit_samples_norm$lambda2.2)

# all of these look great! autocorrelation between sample 1 and sample 3 is basically 0, good model (you want your model to be between the blue dotted lines I'd say before lag 10)

# we've got two chains, so we need to concatenate our dudes
fit_samples_norm <- data.frame(homerange = 
                            c(fit_samples_norm[, "lambda1.1"], 
                              fit_samples_norm[, "lambda1.2"]),
                          ppt = 
                            c(fit_samples_norm[, "lambda2.1"],
                                  fit_samples_norm[, "lambda2.2"]))

2.4.6 How does normal compare to gamma as a prior

colors <- c("Gamma" = "pink", "Normal" = "dodgerblue")
# let's put our two samples against each other
ggplot() + # ggplot them
  geom_density(aes(x = fit_samples$homerange, fill = "Gamma", alpha = 0.5)) +
  geom_density(aes(x = fit_samples_norm$homerange, fill = "Normal", alpha = 0.5)) +
  xlab("Lambda Samples") +
  ggtitle("Posterior distributions of Homerange\nfor Gamma and Normal Priors") +
  scale_fill_manual(values = colors) +
  geom_vline(xintercept = 15, linetype = 3) + 
  guides(alpha="none")

ggplot() + # ggplot them
  geom_density(aes(x = fit_samples$ppt, fill = "Gamma", alpha = 0.5)) +
  geom_density(aes(x = fit_samples_norm$ppt, fill = "Normal", alpha = 0.5)) +
  xlab("Lambda Samples") +
  ggtitle("Posterior distributions of Playback point transect\nfor Gamma and Normal Priors") +
  scale_fill_manual(values = colors) +
  geom_vline(xintercept = 15, linetype = 3) + 
  guides(alpha="none")

Just a snapshot of what’s going on here, it looks like our normal prior is reducing uncertainty with respect to homerange, but maybe not that much with playback point transect. We also see that while the centers of homerange are pretty similar, it is not the same with the playback point transect. Why? Think about our priors, how are they parameterized? Is our normal prior really influencing the posterior compared to gamma prior? Is normal better? Or just different?

Keep this in mind, let’s figure out credible intervals and check out what’s up!

2.4.7 Normal JAGS inference

# credible interval
hr_normcredinterval <- quantile(fit_samples_norm$homerange, probs = c(0.025, 0.975)); hr_normcredinterval

##     2.5%    97.5% 
## 12.85157 18.02919

# the probability lambda > 15
hr_normprobmorethan <- sum(fit_samples_norm$homerange > 15)/length(fit_samples_norm$homerange);
hr_normprobmorethan

## [1] 0.5919939

How does this compare to the gamma values? The normal credible interval range is 5.18, contrast with the poisson which is 5.47. We’re reducing a bit of uncertainty with the normal credible interval relative to the gamma, cool! Looking at probabilities greater than 15, we’re basically identical to the gamma, cool! Why do we see these results do you think?

What’s going on with the playback point transect?

ppt_normcredinterval <- quantile(fit_samples_norm$ppt, probs = c(0.025, 0.975)); ppt_normcredinterval

##     2.5%    97.5% 
## 11.14939 16.17933

# the probability lambda < 15
ppt_normprobmorethan <- sum(fit_samples_norm$ppt < 15)/length(fit_samples_norm$ppt);
ppt_normprobmorethan

## [1] 0.8687421

So with playback point transect for the normal we have a slightly smaller credible interval (4.95) compared to the gamma (5.2). However, the probability that our playback point transect estimates less than 15 is 86.7%.

Let’s try comparing the two again:

diff_data <- fit_samples_norm$homerange - fit_samples_norm$ppt # subtract the two samples, then create a credible interval!
diffr_credinterval <- quantile(diff_data, probs = c(0.025, 0.975)); diffr_credinterval

##      2.5%     97.5% 
## -1.844981  5.398575

# okay but what's the probability homerange estimates more than ppt?
diffr_prop <- sum(diff_data > 0)/length(diff_data); diffr_prop

## [1] 0.8333444

So we get very similar results to the Gamma estimate, but our credible interval is ever so slightly smaller. Our probability homerange is greater than ppt is also a bit smaller, what’s going on there?

This figure was uploaded by Hideyoshi Yanagisawa

2.4.8 So… which is better?

The answer is… it depends

Let’s be clear here, you shouldn’t work multiple JAGS models and pick the one that gives you the answer you want. Rather, you gotta pick a prior BECAUSE it best models the parameter you’re looking at and stick with it. If you’re a stats wizard, you can think it through. For instance, despite the slight increase in uncertainty relative to a normal prior, I think a gamma prior works better for this context because of the limitations of a poisson distribution. But let’s say you aren’t a stats wizard, you should probably go for either a normal or a uniform prior.

2.4.9 I’m still confused at what my prior should be

2.4.9.1 Situation 1: You have old preliminary data

Cool! You’ve got a variance and a mean you can get from preliminary data and plug into your prior. That is, in the rjags model you can parameterize via dnorm(preliminary mean, preliminary precision). Remember precision is \(\frac{1}{\sigma^2}\)! Double remember, make sure you DON’T parameterize your prior with your actual data! This is cheating and fake statistics!

2.4.9.2 Situation 2: You don’t have preliminary data

Cry. No, but seriously, this is not the most fun tough situation to be in. First, get an idea of where the prior should be centered. What does the literature say about your parameter? Do you have numbers you can use? Can you guesstimate a value based on what’s been done before? If yes to any of those things, determine how certain you are about your parameter.

2.4.9.2.1 You’re a bit certain

Great! Feel free to go for a normal or uniform prior. Choose a level of precision that fits your prior (I can’t tell you what’s going to work for your situation!)

2.4.9.2.2 You’re more than a bit uncertain

Bummer! You should have some measure of center (it doesn’t have to be good). You should probably use a Uniform prior such that upper and lower bounds are reflective of your uncertainty. In the JAGS code this might look like…
dunif(measure of center - large number, measure of center + large number).
How large a number? It’s your choice, you gotta learn to trust yourself here!

3 Conclusion

slay

Pat yourself on the back! This is some pretty tough stuff and I’m proud of you :)

I’ve hopefully illustrated the utility of Bayesian statistics for you, here’s the take home:

Bayesian statistics aren’t better, they just require a different way of thinking about things than frequentist statistics
Bayesian statistics are about modeling our uncertainty about a given parameter
So when your sample sucks, you may want to hedge your bets with Bayesian statistics over frequentist
We looked at some titi monkey data to compare both frequentist and Bayesian methods, as well as our choice of prior
There’s not really a simple one-line piece of code to get you a posterior distribution (brms is a bit easier but doesn’t afford you the same flexibility as JAGS modeling in my opinion)
This is a hyper simplified version of how JAGS and Bayesian statistics work. If you’d like an introduction in making linear models (and more complex ones, they follow the same-ish procedure) check out: https://bookdown.org/kevin_davisross/bayesian-reasoning-and-methods/regression.html
What if my prior has priors? https://bookdown.org/kevin_davisross/bayesian-reasoning-and-methods/hierarchical.html

Congrats!

An Introduction to Bayesian Analysis for Biological Anthropologists

Jimmy Erkens

2025-04-15

1 What are Bayesian statistics?

1.1 How do you pronounce Bayesian?

1.2 What’s going on under the hood

2 Bayesian modeling using `rjags`

2.1 Let’s load some libraries

2.2 …and simulate some data

2.3 Frequentist inference

2.3.1 Hypothesis Test

2.4 Bayesian inference with `rjags`

2.4.1 JAGS Initializations

2.4.2 What prior should we pick?

2.4.3 Gamma JAGS model

2.4.4 Gamma JAGS inference

2.4.5 How would a normal prior work with this dataset?

2.4.6 How does normal compare to gamma as a prior

2.4.7 Normal JAGS inference

2.4.8 So… which is better?

2.4.9 I’m still confused at what my prior should be

2.4.9.1 Situation 1: You have old preliminary data

2.4.9.2 Situation 2: You don’t have preliminary data

2.4.9.2.1 You’re a bit certain

2.4.9.2.2 You’re more than a bit uncertain

3 Conclusion

An Introduction to Bayesian Analysis for Biological Anthropologists

Jimmy Erkens

2025-04-15

1 What are Bayesian statistics?

1.1 How do you pronounce Bayesian?

1.2 What’s going on under the hood

2 Bayesian modeling using rjags

2.1 Let’s load some libraries

2.2 …and simulate some data

2.3 Frequentist inference

2.3.1 Hypothesis Test

2.4 Bayesian inference with rjags

2.4.1 JAGS Initializations

2.4.2 What prior should we pick?

2.4.3 Gamma JAGS model

2.4.4 Gamma JAGS inference

2.4.5 How would a normal prior work with this dataset?

2.4.6 How does normal compare to gamma as a prior

2.4.7 Normal JAGS inference

2.4.8 So… which is better?

2.4.9 I’m still confused at what my prior should be

2.4.9.1 Situation 1: You have old preliminary data

2.4.9.2 Situation 2: You don’t have preliminary data

2.4.9.2.1 You’re a bit certain

2.4.9.2.2 You’re more than a bit uncertain

3 Conclusion

2 Bayesian modeling using `rjags`

2.4 Bayesian inference with `rjags`