Bayesian analysis: probability of model parameters given data and priors
Frequentist analysis: (im)probability of data given model
Works better with small samples (meaningful priors improve convergence)
More complex models are possible
Straightforward to quantify uncertainty of any fitted values / contrasts / predictions
I got it!!!
Generative models
Bayesian mixed models with brms
How to fit and diagnose
How to plot
How to calculate effects
How to report
Useful resources and further directions
Generative = “generating data”
Explicitly specify the process that generated observations
A single normally distributed measure, eg human height
“Normal” = Gaussian = bell curve = 1 / exp^2
rnorm()
## [1] 1.90524940 -2.15555377 -0.05026634 -0.04281362 1.00146129 1.15929253
dnorm()
quantiles = seq(-3, 3, length.out = 100)
dens = dnorm(x = quantiles, mean = 0, sd = 1)
plot(quantiles, dens, type = 'l')
# d is 100 numbers drawn from a normal distribution
d = rnorm(100, mean = 0, sd = 1)
hist(d, freq = FALSE)
# overlay the theoretical pdf
x = seq(-3, 3, length.out = 100)
points(x, dnorm(x, mean = 0, sd = 1), type = 'l', col = 'blue')
Guess the mean and SD of this (hypothetical) distribution of the height of adult humans:
mean = 150, sd = 50
mean = 150, sd = 3
mean = 165, sd = 10
Pick a generative model
Pick parameter values based on prior knowledge
Evaluate fit to observed data
Good fit? Keep these parameters and try again
Bad fit? Discard and try again
Long list of good parameters = posterior distribution = credible parameter values that might have generated our data
Define a process that generated data, here dnorm(mean, sd) –> LIKELIHOOD
Define reasonable ranges of our parameters, here mean & sd –> PRIOR
Explore fit to data within this “prior” range, analytically or with MCMC
Credible values of parameters are those that best reproduce data
Posterior distribution of model parameters ~ prior * likelihood
Bayes rule: posterior = probability of data * prior / average probability of data
McElreath 2020, Ch. 4.1 - 4.3
Model: height ~ normal(mean, SD)
MCMC = Markov Chain Monte Carlo
Algorithm(s) for approximating posterior distribution of model parameters
Ex.: if guessing mean/SD in rnorm() for human height, we get a JOINT posterior distribution of:
credible values for MEAN given a particular SD
credible values for SD given a particular MEAN
## # A draws_df: 1000 iterations, 4 chains, and 3 variables
## b_Intercept sigma lp__
## 1 163 9.6 -378
## 2 165 10.5 -378
## 3 165 10.7 -378
## 4 163 9.9 -377
## 5 164 10.7 -378
## 6 164 9.5 -378
## 7 164 10.1 -377
## 8 163 10.2 -377
## 9 163 10.1 -377
## 10 164 10.2 -377
## # ... with 3990 more draws
## # ... hidden reserved variables {'.chain', '.iteration', '.draw'}
## Warning: Dropping 'draws_df' class as required metadata was removed.
Correlated model parameters –> difference in confidence intervals ≠ confidence interval on difference
Point estimate: mode / median / mean of posterior distribution
## [1] 163.2034
“The most credible mean height is 163 cm”
Credible interval
## 50% 2.5% 97.5%
## 163.2102 161.1560 165.2060
“The most credible mean height is 163 cm, 95% CI [161, 165]”
Proportion of posterior distribution that satisfies condition
## [1] 0.99925
“99.9% of the posterior distribution of mean height is above 160 cm”
Bayes Factors (BF)
Region of practical equivalence (ROPE)
Ratio of prior-weighted likelihoods for two models
Very sensitive to priors
Same problems as p-values
More info:
Dienes 2014 “Using Bayes to get the most out of
non-significant results”
Dienes 2016 “How Bayes factors change
scientific practice”
Posterior within ROPE -> no effect
rope = c(.45, .55)
posterior = rnorm(1000, mean = .52, sd = .015)
hist(posterior, xlim = c(0, 1)); abline(v = rope, col = 'blue', lwd = 5)
Proportion of posterior within ROPE
## [1] 0.975
Posterior outside ROPE -> effect
rope = c(.45, .55)
posterior = rnorm(1000, mean = .72, sd = .015)
mean(posterior > rope[1] & posterior < rope[2])
## [1] 0
Posterior overlaps with ROPE -> not clear
rope = c(.45, .55)
posterior = rnorm(1000, mean = .6, sd = .1)
mean(posterior > rope[1] & posterior < rope[2])
## [1] 0.242
Reject the null (p < .05)
Fail to reject the null (p >= .05)
Prove a hypothesis (null / alternative / …)
Reject the null
Accept the null
Any posterior condition (eg mean within specific range like 3 to 5)
Not enough evidence either way
Goal 1 (power): prove H0 (posterior within ROPE) or prove H1 (posterior outside ROPE)
Goal 2 (precision): estimate effect size(s) with some precision, eg ± 0.1
Kruschke 2018 “The Bayesian new statistics” Psychon Bull Rev (2018)
25:178–206
Kruschke 2015 “Bayesian data analysis, 2nd ed.”, ch. 12
Kruschke 2015 ch. 13
Multilevel models with “random” intercepts and/or slopes
How it works: group-level parameters are taken from a single distribution instead of being independent
## raven trial success
## archie:15 Min. : 1 Mode :logical
## bred :15 1st Qu.: 4 FALSE:23
## k5 :15 Median : 8 TRUE :40
## lola :15 Mean : 8 NA's :27
## mandy :15 3rd Qu.:12
## mum :15 Max. :15
## raven trial success
## 2 lola 2 FALSE
## 3 lola 3 FALSE
## 4 lola 4 FALSE
## 9 lola 9 FALSE
## 10 lola 10 FALSE
## 11 lola 11 FALSE
##
## Call:
## glm(formula = success ~ 1, family = "binomial", data = ani)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.5534 0.2617 2.115 0.0345 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 82.692 on 62 degrees of freedom
## Residual deviance: 82.692 on 62 degrees of freedom
## AIC: 84.692
##
## Number of Fisher Scoring iterations: 4
## [1] 0.6349206
## [1] 0.6349206
Thus: we are just reproducing sample mean
##
## archie bred k5 lola mandy mum
## FALSE 0 0 6 7 7 3
## TRUE 7 12 8 0 6 7
##
## Call:
## glm(formula = success ~ raven, family = "binomial", data = ani)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.957e+01 4.065e+03 0.005 0.996
## ravenbred 1.557e-11 5.115e+03 0.000 1.000
## ravenk5 -1.928e+01 4.065e+03 -0.005 0.996
## ravenlola -3.913e+01 5.748e+03 -0.007 0.995
## ravenmandy -1.972e+01 4.065e+03 -0.005 0.996
## ravenmum -1.872e+01 4.065e+03 -0.005 0.996
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 82.692 on 62 degrees of freedom
## Residual deviance: 49.284 on 57 degrees of freedom
## AIC: 61.284
##
## Number of Fisher Scoring iterations: 18
NB: no such thing as “average raven” in this model
pr = predict(
mod1,
newdata = data.frame(raven = unique(ani$raven)), # for each raven
type = 'response' # probability of success
)
round(pr * 100)
## 1 2 3 4 5 6
## 0 100 100 57 70 46
## raven success
## 1 archie 1.0000000
## 2 bred 1.0000000
## 3 k5 0.5714286
## 4 lola 0.0000000
## 5 mandy 0.4615385
## 6 mum 0.7000000
Thus: same rubbish as glm()
Each raven is treated as completely unique
Doesn’t generalize to unobserved participants
Doesn’t work for small clusters (eg few trials from some participants)
Can’t deal with floor/ceiling effects in some clusters (eg 100% accuracy)
## Warning: Parts of the model have not converged (some Rhats are > 1.05). Be
## careful when analysing the results! We recommend running more iterations and/or
## setting stronger priors.
## Family: bernoulli
## Links: mu = logit
## Formula: success ~ 1 + raven
## Data: ani (Number of observations: 63)
## Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup draws = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 2037.00 1884.67 23.15 5416.64 2.30 5 29
## ravenbred 12941.26 7679.48 511.46 28250.36 1.99 6 13
## ravenk5 -2036.69 1884.68 -5415.89 -22.55 2.30 5 29
## ravenlola -29746.51 13291.50 -53152.14 -5801.50 2.17 5 29
## ravenmandy -2037.17 1884.67 -5416.69 -23.40 2.30 5 29
## ravenmum -2036.03 1884.68 -5415.51 -21.64 2.29 5 29
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
par(mfrow = c(1, 3))
intercept1 = rnorm(1000, 0, .5)
hist(logistic(intercept1), main = '')
intercept2 = rnorm(1000, 0, 2)
hist(logistic(intercept2), main = '')
intercept3 = rnorm(1000, 0, 5)
hist(logistic(intercept3), main = '')
par(mfrow = c(1, 3))
beta_raven1 = rnorm(1000, 0, 100)
hist(logistic(intercept2 + beta_raven1), main = '')
beta_raven2 = rnorm(1000, 0, 3)
hist(logistic(intercept2 + beta_raven2), main = '')
beta_raven3 = rnorm(1000, 3, .01)
hist(logistic(intercept2 + beta_raven3), main = '')
## prior class coef group resp dpar nlpar lb ub
## (flat) b
## (flat) b ravenbred
## (flat) b ravenk5
## (flat) b ravenlola
## (flat) b ravenmandy
## (flat) b ravenmum
## student_t(3, 0, 2.5) Intercept
## source
## default
## (vectorized)
## (vectorized)
## (vectorized)
## (vectorized)
## (vectorized)
## default
## Family: bernoulli
## Links: mu = logit
## Formula: success ~ 1 + raven
## Data: ani (Number of observations: 63)
## Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup draws = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 2.15 0.95 0.45 4.21 1.00 1720 1703
## ravenbred 2.54 1.88 -0.70 6.67 1.00 2048 2253
## ravenk5 -1.79 1.07 -4.03 0.16 1.00 1779 2154
## ravenlola -5.30 1.67 -8.86 -2.43 1.00 2085 2290
## ravenmandy -2.26 1.08 -4.51 -0.26 1.00 1926 1957
## ravenmum -1.15 1.16 -3.55 1.00 1.00 2042 1917
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
##
## archie bred k5 lola mandy mum
## FALSE 0 0 6 7 7 3
## TRUE 7 12 8 0 6 7
Note: Lola has 0/11 successes, yet the predicted accuracy is between 0 and 30%
Why? Because our prior expects only limited difference between animals (a raven is a raven)
SHRINKAGE
Assume that success rates of ravens belong to a single distribution
Learn the parameters of this distribution from the data (adaptive prior)
success ~ bernoulli(mu[raven]) –> LIKELIHOOD
logit(mu) ~ normal(mean, sd) –> adaptive PRIOR
mean and sd (HYPERPRIORS) determined from the distribution across ravens
Success rates of all ravens define prior expectations for each raven
## Family: bernoulli
## Links: mu = logit
## Formula: success ~ 1 + (1 | raven)
## Data: ani (Number of observations: 63)
## Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup draws = 4000
##
## Group-Level Effects:
## ~raven (Number of levels: 6)
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept) 3.61 1.74 1.32 8.13 1.00 1099 1074
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 0.79 1.40 -1.96 3.67 1.00 863 1000
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
library(ggplot2)
ggplot(pl, aes(raven, fit, ymin = lwr, ymax = upr)) + geom_pointrange() +
geom_point(aes(raven, obs), shape = 4, color = 'blue', size = 3)
Group-level intercepts “shrink” to global mean
Amount of shrinkage depends on:
Group size: less data –> more shrinkage
Distance from global mean: extreme values –> more shrinkage
Similarity between clusters: most clusters similar –> more shrinkage
Thus: predicted values ≠ observed values in mixed models
View posterior distribution of predicted success rate:
Asymmetric distribution –> mode ≠ median ≠ mean
## [1] 0.6419514
## 50% 2.5% 97.5%
## 0.6759189 0.1236150 0.9752390
# or: logistic(summary(mod_brm2)$fixed[, c(1, 3, 4)])
# Compare to observed global mean success rate:
mean(ani$success, na.rm = TRUE)
## [1] 0.6349206
Compare groups / conditions / ravens…
Method: extract posterior distribution OF THE CONTRAST
##
## FALSE TRUE
## archie 0 7
## bred 0 12
## k5 6 8
## lola 7 0
## mandy 7 6
## mum 3 7
fit_bredLola = fitted(mod_brm2,
newdata = data.frame(raven = c('bred', 'lola')),
summary = FALSE)
head(fit_bredLola)
## [,1] [,2]
## [1,] 0.9990973 0.46316016
## [2,] 0.9992929 0.02048516
## [3,] 0.9931419 0.01229632
## [4,] 0.9973438 0.10188524
## [5,] 0.9999752 0.09040533
## [6,] 0.9999998 0.03309967
## 50% 2.5% 97.5%
## 0.9290251 0.5771936 0.9996498
## [1] 1
Thus: 100% of the posterior of the “bred vs. lola” contrast is positive. The most credible difference is 92.9%, 95% CI [57.7, 99.9].
fit_mumMandy = fitted(mod_brm2,
newdata = data.frame(raven = c('mum', 'mandy')),
summary = FALSE)
d_mumMandy = fit_mumMandy[, 1] - fit_mumMandy[, 2]
hist(d_mumMandy)
## 50% 2.5% 97.5%
## 0.2372693 -0.1533093 0.5679236
## [1] 0.8865
Thus: quite uncertain. Only 88% of the posterior of the “mum vs. mandy” contrast is positive. The most credible difference is 23.7%, 95% CI [-15.3, 56.7].
Nested data with multiple observations per subject / stimulus / family group / …
Group-level intercepts belong to a single distribution
Model with lme4::lmer() / lme4::glmer() or with brm()
Formula: y ~ fixed effects + (random effects), eg y ~ x + (x|group)
Fit model brm(y ~ x + (x|group)), with reasonable priors if possible
Check convergence plot(mod)
, posterior predictive
checks pp_check(mod)
Get fitted values for all effects of interest –> plot with
conditional_effects(mod)
Get posterior for comparisons of interest with
fitted(mod)
, summarize –> plot / text
Pool info across groups (ravens, participants, schools, …)
Generalize better to unobserved groups
Better estimates for small clusters (eg few trials from some participants)
Deal with floor/ceiling effects in some clusters (eg 100% accuracy)
Predictions at both group and population levels (eg for observed raven, for new ravens, or for a “typical” raven)
There’s more to brm()…
Other families ?brmsfamily
Priors
Multiple outcomes (multivariate regression)
Splines
Autocorrelated residuals
Imputed missing data
Measurement error
Outcomes with >2 categories
Generalization of logistic regression to multiple possible outcomes
Ex.: predict for which party someone votes from sex, age, income, education, …
Ordinal outcomes like Likert scales
Generalization of multinomial regression to ordered outcomes
Ex.: “How much do you like this picture?” 1…5
Beta for proportions, eg ratings (0, 1)
Zero-one-inflated to include 0 and 1
Good for continuous ratings on a Visual Analog Scale (VAS)
Model several outcomes simultaneously
Residuals on both outcomes can be modeled as correlated
Saves typing
Ex.: brm(mvbind(height, weight) ~ age, …)
Model nonlinear relationships flexibly
“Wiggliness” determined automatically from the data
Formula: s(predictor), s(predictor, by = group), etc.
See ?brms::s
Usually exclude the entire row in case of NA in any variable
Wasteful in “expensive” datasets
Alternative: impute NAs with MICE
https://www.gerkovink.com/miceVignettes/Missingness_inspection/Missingness_inspection.html
Alternative: analyze with imputation using brm_multiple(): https://cran.r-project.org/web/packages/brms/vignettes/brms_missings.html
Autocorrelated residuals: value at time t predicts values at time (t+1)
Ex.: pupil size changes over time, event-related potentials, fMRI signal, skin conductance
Solution: use autoregressive / autocorrelated / moving-average models
Formula: brm(y ~ x + arma(p = 1, q = 1))
See ?brms::autocor-terms
(Library: 2nd ed. paper form, 1st ed. eBook)
MrElreath lecture videos 2022: https://www.youtube.com/playlist?list=PLDcUM9US4XdMROZ57-OIRtIK0aOynbgZN
Code in brms/tidyverse: https://bookdown.org/content/4857/
Kruschke & Liddell 2018 “The Bayesian New Statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective”
Kruschke 2015 “Bayesian data analysis”