1
Mathematical foundations of Bayesian inference
Christopher B. Germann (Ph.D., M.Sc., B.Sc. / Marie Curie Alumnus)
2018
URL: https://christopher-germann.de
Bayesian inference allocates credibility (viz., belief) across the parameter space Θ
1
of the model (conditional on the a priori obtained empirical data). The mathemati-
cal axiomatic basis is provided by Bayes’ theorem. Bayes’ theorem derives the
probability of θ given the empirical data in terms of its inverse probability (i.e., the
probability of the data given θ and the prior probabilities of θ). In other word
Bayesian data analysis involves describing data by meaningful mathematical models,
and allocating credibility to parameter values that are consistent with the data and with
prior knowledge(Kruschke & Vanpaemel, 2015, p. 279).
The mathematical formula for the allocation of credibility across parameters is axi-
omatized in Bayes’ theorem (Bayes & Price, 1763), i.e., Bayes’ theorem mathematically
defines the posterior distribution on the parameter values in a formal manner.
(|) =
(|) ()
()
Where:
() signifies the prior (the preliminary belief about A)
() signifies the evidence
(|) signifies the posterior probability (the belief about of A given B)
(|) signifies the likelihood.
Applied to the current analysis Bayes’ theorem takes the following form:
(
,
,
,
,
|
)
= ( |
,
,
,
, )
x (
,
,
,
, )
/ ()
   
1
Uppercase Theta (Θ) denotes the set of all possible combinations of parameter values in a specific mathemati-
cal model (the joint parameter space). Lowercase theta (θ) on the other hand, denotes a single k-dimensional pa-
rameter vector.
2
let D be the empirical data, μ
1
and μ
2
the means per experimental condition (e.g., condi-
tion V
00
and V
01
), σ
1
and σ
2
the associated standard deviations, and the normality pa-
rameter.
Bayes’ theorem (Bayes & Price, 1763) as specified for the descriptive model used to
estimate the parameters of interest in Experiment 2.
Bayes’ theorem emphasises the posterior (conditional) distribution of parameter val-
ues (the Latin terminus “a posteriori” signifies empirical knowledge which proceeds
from experiences/observations). The factors of Bayes’ theorem have specific mean-
ing assigned to them: The “evidence” for the specified model, p(D), equals the total
probability of the data under the model which can be computed by averaging over
the parameter space Θ (Kruschke, 2015). Each parameter value is weighted by the
“strength of belief” in the respective values of θ. For the current model, Bayes’ theo-
rem can be semantically summarised as follows: It signifies that the posterior prob-
ability of the combination of parameter values (i.e., < μ
1
, μ
2
, σ
1
, σ
2
, >) is equal to the
likelihood of that parameter value combination multiplied by the prior probability of
that parameter combination, divided by the constant p(D). This constant is often re-
ferred to as the “evidencefor the model and is also called the “marginal likelihood
function (Kruschke, 2013). Its numerical value is calculated by taking the average of
the likelihood, p(D|θ), across all values of θ (i.e., over the entire parameter space Θ),
weighted by the prior probability of θ (Kruschke, 2014). The posterior distribution is
thus always a compromise between the prior believability of the parameter values
and the likelihood of the parameter values, given data. (Kruschke, 2010). Our exper-
imental data was measured on a visual analogue scale (VAS) ranging across a contin-
uum of values. Given the extremely fine-grained nature of our measurements the
resulting numerical values are “quasi-continuous”. Therefore, all parameters are re-
garded as continuous variables for all practical purposes. It thus follows that the pos-
terior distribution is continuously distributed across the joint parameter space Θ
(Kruschke et al., 2017).
Given that Bayesian parameter estimation (BPE) is currently no methodological
standard in psychology we will provide some terminological clarifications of the un-
derlying Bayesian nomenclature. The credibility of the parameter values after the
empirical observation is termed the “posterior distribution”, and the believability of
the parameter values before the empirical observation is termed the “prior distribu-
tion”. The probability of the observation for a particular parameter value combina-
tion, is called the “marginal likelihood function”. It indicates the degree to which the
observed outcome is anticipated, when averaged across all possible values of the
weights, scaled proportionally to their respective believability (Kruschke, 2008). The
denominator labelled as “evidence”, p(D), is the marginal likelihood also referred to
as “model evidence”. In BPE, Bayes’ theorem is used to make inferences about dis-
tribution parameters, i.e., the conditional distribution of θ is calculated given the ob-
served data. What is the probability of θ conditional on the observed data. The prior
3
is an unconditional distribution associated with θ. In contrast to NHST, θ is not as-
sumed to be random, we are merely nescient
2
of its value. In other words, probability
is conceptualised as a state of subjective belief or state of knowledge (as opposed to
objective “pure” probability as an intrinsic characteristic of θ).
The posterior distribution is approximated by a powerful class of algorithms known
as Markov chain Monte Carlo (MCMC) methods (named in analogy to the randomness
of events observed at games in casinos). MCMC generates a large representative
sample from the data which, in principle, allows to approximate the posterior distri-
bution to an arbitrarily high degree of accuracy (as ). The MCMC sample (or
chain) contains a large number (i.e., > 1000) of combinations of the parameter values
of interest. Our model of perceptual judgments contains the following parameters: <
μ
1
, μ
2
, σ
1
, σ
2
, >. In other words, the MCMC algorithm randomly samples a very large
n of combinations of θ from the posterior distribution. This representative sample of
θ values is subsequently utilised in order to estimate various characteristics of the
posterior (Gustafsson, Montelius, Starck, & Ljungberg, 2017), e.g., its mean, mode,
standard deviation, etc. pp. The thus obtained sample of parameter values can then
be plotted in the form of a histogram in order to visualise the distributional proper-
ties and a prespecified high density interval can be superimposed on the histogram.
Relatively recent advances in technology make these computationally demanding
methods feasible. The combination of powerful microprocessor and sophisticated
computational algorithms allows researchers to perform extremely powerful Bayes-
ian statistical analyses that would have been very expensive only 15 years ago and
virtually impossible circa 25 years ago. The statistical “Bayesian revolution” is rele-
vant for many scientific disciplines (Beaumont & Rannala, 2004; Brooks, 2003;
Gregory, 2001; Shultz, 2007) and the scientific method in general. This Kuhnian-par-
adigm shift (Kuhn, 1970) goes hand in hand with the Moore's law (Moore, 1965) and
the exponential progress of information technologies (Kurzweil, 2005) (cf. Goertzel,
2007) and the associated ephemeralization
3
(Heylighen, 2008). For the current
Bayesian analysis, the parameter space Θ is a five-dimensional space that embeds
the joint distribution of all possible combinations of parameter values (Kruschke,
2014). Hence exact parameter values can be approximated by sampling large num-
bers of values from the posterior distribution. The larger the number of random
samples the more accurate the estimate. A longer MCMC chain (a larger sample) pro-
vides a more accurate representation (i.e., better estimate or higher resolution) of
the posterior distribution of the parameter values (given the empirical data). For in-
stance, if the number of MCMC samples is relatively small and the analysis would be
repeated the values would be significantly different and, on visual inspection, the
2
The term “nescienct” is a composite lexeme composed of the Latin prefix from ne "not" + scire "to know" (cf.
science). It is not synonymous with ignorant because ignorance has a different semantic meaning (“to ignore
is very different from “not knowing”).
3
A concept popularised by Buckminster Fuller which is frequently cited as an argument against Malthusianism.
4
associated histogram would appear “edgy”. With larger MCMC samples, the esti-
mated values (on average) approximate the true values of the posterior distribution
of the parameter values and the associated histogram becomes smoother (Kruschke,
2014). The larger the MCMC sample size the higher the accuracy because the sample
size n is proportional to the “Monte Carlo Error” (MCE; i.e., accuracy is a function of
MCMC sample size). To sum up, the MCMC approach clearly yields approximate pa-
rameter values and its accuracy depends on the number of values n that are used to
calculate the average. Quantitative methods have been developed to measure the
Monte Carlo Error “objectively” (Koehler, Brown, & Haneuse, 2009), however, this
intricate topic goes beyond the scope of this chapter. Of great relevance for our pur-
pose is the fact that this analytic approach also allows to compute the credible dif-
ference of means between experimental conditions by computing μ
1
- μ
2
for every
combination of sampled values. Moreover, BPE provides a distribution of credible
effect sizes. The same distributional information can be obtained for the differences
between σ
1
and σ
2
(and the associated distributional range of credible effect sizes).
To sum up, BPE is currently one of the most effective statistical approaches to obtain
detailed information about the various parameters of interest.
References
Bayes, M., & Price, M. (1763). An Essay towards Solving a Problem in the Doctrine of
Chances. By the Late Rev. Mr. Bayes, F. R. S. Communicated by Mr. Price, in a
Letter to John Canton, A. M. F. R. S. Philosophical Transactions of the Royal
Society of London, 53(0), 370418. https://doi.org/10.1098/rstl.1763.0053
Beaumont, M. A., & Rannala, B. (2004). The Bayesian revolution in genetics. Nature
Reviews Genetics, 5(4), 251261. https://doi.org/10.1038/nrg1318
Brooks, S. P. (2003). Bayesian computation: a statistical revolution. Philosophical
Transactions of the Royal Society A: Mathematical, Physical and Engineering
Sciences, 361(1813), 26812697. https://doi.org/10.1098/rsta.2003.1263
Goertzel, B. (2007). Human-level artificial general intelligence and the possibility of
a technological singularity. A reaction to Ray Kurzweil’s The Singularity Is
Near, and McDermott’s critique of Kurzweil. Artificial Intelligence, 171(18), 1161
1173. https://doi.org/10.1016/j.artint.2007.10.011
Gregory, P. C. (2001). A Bayesian revolution in spectral analysis. In AIP Conference
Proceedings (Vol. 568, pp. 557568). https://doi.org/10.1063/1.1381917
Gustafsson, O., Montelius, M., Starck, G., & Ljungberg, M. (2017). Impact of prior
distributions and central tendency measures on Bayesian intravoxel
incoherent motion model fitting. Magnetic Resonance in Medicine.
https://doi.org/10.1002/mrm.26783
5
Heylighen, F. (2008). Accelerating Socio-Technological Evolution: from
ephemeralization and stigmergy to the global brain. In Globalization as
Evolutionary Process: Modeling global change (pp. 284309).
https://doi.org/10.4324/9780203937297
Koehler, E., Brown, E., & Haneuse, S. J.-P. A. (2009). On the Assessment of Monte
Carlo Error in Simulation-Based Statistical Analyses. The American Statistician,
63(2), 155162. https://doi.org/10.1198/tast.2009.0030
Kruschke, J. K. (2008). Bayesian approaches to associative learning: From passive to
active learning. Learning & Behavior, 36(3), 210226.
https://doi.org/10.3758/LB.36.3.210
Kruschke, J. K. (2010). What to believe: Bayesian methods for data analysis. Trends
in Cognitive Sciences. https://doi.org/10.1016/j.tics.2010.05.001
Kruschke, J. K. (2013). Bayesian estimation supersedes the t test. Journal of
Experimental Psychology: General, 142(2), 573603.
https://doi.org/10.1037/a0029146
Kruschke, J. K. (2014). Doing Bayesian data analysis: A tutorial with R, JAGS, and
Stan, second edition. Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and
Stan, Second Edition. https://doi.org/10.1016/B978-0-12-405888-0.09999-2
Kruschke, J. K. (2015). Doing Bayesian data analysis: a tutorial with R, JAGS and Stan.
Amsterdam: Elsevier.
Kruschke, J. K., Liddell, T. M., Bob, A., Don, C., Bob, A., & Don, C. (2017). Bayesian
data analysis for newcomers are impossible. Psychonomic Bulletin & Review, 1
29. https://doi.org/10.3758/s13423-017-1272-1
Kruschke, J. K., & Vanpaemel, W. (2015). Bayesian estimation in hierarchical models.
The Oxford Handbook of Computational and Mathematical Psychology, 279299.
https://doi.org/10.1093/oxfordhb/9780199957996.013.13
Kuhn, T. (1970). The Structure of Scientific Revolutions. University of Chicago Press,
University of Chicago.
Kurzweil, R. (2005). The singularity is near. viking (Vol. 45).
https://doi.org/10.1109/MSPEC.2008.4635038
Moore, G. E. (1965). Creaming more components onto integrated circuits.
Electronics, 38(8), 114117. https://doi.org/10.1109/jproc.1998.658762
Shultz, T. R. (2007). The Bayesian revolution approaches psychological
development. Developmental Science, 10(3), 357364.
https://doi.org/10.1111/j.1467-7687.2007.00588.x
6