Mathematical foundations of Bayesian inference | ॐ Homepage of Dr. Christopher B. Germann (Ph.D., M.Sc., B.Sc. / Marie Curie Alumnus)

Mathematical foundations of Bayesian inference

Christopher B. Germann (Ph.D., M.Sc., B.Sc. / Marie Curie Alumnus)

2018

URL: https://christopher-germann.de

Bayesian inference allocates credibility (viz., belief) across the parameter space Θ

of the model (conditional on the a priori obtained empirical data). The mathemati-

cal axiomatic basis is provided by Bayes’ theorem. Bayes’ theorem derives the

probability of θ given the empirical data in terms of its inverse probability (i.e., the

probability of the data given θ and the prior probabilities of θ). In other word

“Bayesian data analysis involves describing data by meaningful mathematical models,

and allocating credibility to parameter values that are consistent with the data and with

prior knowledge” (Kruschke & Vanpaemel, 2015, p. 279).

The mathematical formula for the allocation of credibility across parameters is axi-

omatized in Bayes’ theorem (Bayes & Price, 1763), i.e., Bayes’ theorem mathematically

defines the posterior distribution on the parameter values in a formal manner.

(|) =

(|)  ()

()

Where:

• () signifies the prior (the preliminary belief about A)

• () signifies the evidence

• (|) signifies the posterior probability (the belief about of A given B)

• (|) signifies the likelihood.

Applied to the current analysis Bayes’ theorem takes the following form:



(





, 



, 



, 



, 

)











= ( | 



, 



, 



, 



, )











x ( 



, 



, 



, 



, )











/ ()



   

Uppercase Theta (Θ) denotes the set of all possible combinations of parameter values in a specific mathemati-

cal model (the joint parameter space). Lowercase theta (θ) on the other hand, denotes a single k-dimensional pa-

rameter vector.

let D be the empirical data, μ

and μ

the means per experimental condition (e.g., condi-

tion V

and V

), σ

and σ

the associated standard deviations, and  the normality pa-

rameter.

Bayes’ theorem (Bayes & Price, 1763) as specified for the descriptive model used to

estimate the parameters of interest in Experiment 2.

Bayes’ theorem emphasises the posterior (conditional) distribution of parameter val-

ues (the Latin terminus “a posteriori” signifies empirical knowledge which proceeds

from experiences/observations). The factors of Bayes’ theorem have specific mean-

ing assigned to them: The “evidence” for the specified model, p(D), equals the total

probability of the data under the model which can be computed by averaging over

the parameter space Θ (Kruschke, 2015). Each parameter value is weighted by the

“strength of belief” in the respective values of θ. For the current model, Bayes’ theo-

rem can be semantically summarised as follows: It signifies that the posterior prob-

ability of the combination of parameter values (i.e., < μ

, μ

, σ

,  >) is equal to the

likelihood of that parameter value combination multiplied by the prior probability of

that parameter combination, divided by the constant p(D). This constant is often re-

ferred to as the “evidence” for the model and is also called the “marginal likelihood

function” (Kruschke, 2013). Its numerical value is calculated by taking the average of

the likelihood, p(D|θ), across all values of θ (i.e., over the entire parameter space Θ),

weighted by the prior probability of θ (Kruschke, 2014). The posterior distribution is

thus always a compromise between the prior believability of the parameter values

and the likelihood of the parameter values, given data. (Kruschke, 2010). Our exper-

imental data was measured on a visual analogue scale (VAS) ranging across a contin-

uum of values. Given the extremely fine-grained nature of our measurements the

resulting numerical values are “quasi-continuous”. Therefore, all parameters are re-

garded as continuous variables for all practical purposes. It thus follows that the pos-

terior distribution is continuously distributed across the joint parameter space Θ

(Kruschke et al., 2017).

Given that Bayesian parameter estimation (BPE) is currently no methodological

standard in psychology we will provide some terminological clarifications of the un-

derlying Bayesian nomenclature. The credibility of the parameter values after the

empirical observation is termed the “posterior distribution”, and the believability of

the parameter values before the empirical observation is termed the “prior distribu-

tion”. The probability of the observation for a particular parameter value combina-

tion, is called the “marginal likelihood function”. It indicates the degree to which the

observed outcome is anticipated, when averaged across all possible values of the

weights, scaled proportionally to their respective believability (Kruschke, 2008). The

denominator labelled as “evidence”, p(D), is the marginal likelihood also referred to

as “model evidence”. In BPE, Bayes’ theorem is used to make inferences about dis-

tribution parameters, i.e., the conditional distribution of θ is calculated given the ob-

served data. What is the probability of θ conditional on the observed data. The prior

is an unconditional distribution associated with θ. In contrast to NHST, θ is not as-

sumed to be random, we are merely nescient

of its value. In other words, probability

is conceptualised as a state of subjective belief or state of knowledge (as opposed to

objective “pure” probability as an intrinsic characteristic of θ).

The posterior distribution is approximated by a powerful class of algorithms known

as Markov chain Monte Carlo (MCMC) methods (named in analogy to the randomness

of events observed at games in casinos). MCMC generates a large representative

sample from the data which, in principle, allows to approximate the posterior distri-

bution to an arbitrarily high degree of accuracy (as   ). The MCMC sample (or

chain) contains a large number (i.e., > 1000) of combinations of the parameter values

of interest. Our model of perceptual judgments contains the following parameters: <

, μ

, σ

,  >. In other words, the MCMC algorithm randomly samples a very large

n of combinations of θ from the posterior distribution. This representative sample of

θ values is subsequently utilised in order to estimate various characteristics of the

posterior (Gustafsson, Montelius, Starck, & Ljungberg, 2017), e.g., its mean, mode,

standard deviation, etc. pp. The thus obtained sample of parameter values can then

be plotted in the form of a histogram in order to visualise the distributional proper-

ties and a prespecified high density interval can be superimposed on the histogram.

Relatively recent advances in technology make these computationally demanding

methods feasible. The combination of powerful microprocessor and sophisticated

computational algorithms allows researchers to perform extremely powerful Bayes-

ian statistical analyses that would have been very expensive only 15 years ago and

virtually impossible circa 25 years ago. The statistical “Bayesian revolution” is rele-

vant for many scientific disciplines (Beaumont & Rannala, 2004; Brooks, 2003;

Gregory, 2001; Shultz, 2007) and the scientific method in general. This Kuhnian-par-

adigm shift (Kuhn, 1970) goes hand in hand with the Moore's law (Moore, 1965) and

the exponential progress of information technologies (Kurzweil, 2005) (cf. Goertzel,

2007) and the associated ephemeralization

(Heylighen, 2008). For the current

Bayesian analysis, the parameter space Θ is a five-dimensional space that embeds

the joint distribution of all possible combinations of parameter values (Kruschke,

2014). Hence exact parameter values can be approximated by sampling large num-

bers of values from the posterior distribution. The larger the number of random

samples the more accurate the estimate. A longer MCMC chain (a larger sample) pro-

vides a more accurate representation (i.e., better estimate or higher resolution) of

the posterior distribution of the parameter values (given the empirical data). For in-

stance, if the number of MCMC samples is relatively small and the analysis would be

repeated the values would be significantly different and, on visual inspection, the

The term “nescienct” is a composite lexeme composed of the Latin prefix from ne "not" + scire "to know" (cf.

“science”). It is not synonymous with ignorant because ignorance has a different semantic meaning (“to ignore”

is very different from “not knowing”).

A concept popularised by Buckminster Fuller which is frequently cited as an argument against Malthusianism.

associated histogram would appear “edgy”. With larger MCMC samples, the esti-

mated values (on average) approximate the true values of the posterior distribution

of the parameter values and the associated histogram becomes smoother (Kruschke,

2014). The larger the MCMC sample size the higher the accuracy because the sample

size n is proportional to the “Monte Carlo Error” (MCE; i.e., accuracy is a function of

MCMC sample size). To sum up, the MCMC approach clearly yields approximate pa-

rameter values and its accuracy depends on the number of values n that are used to

calculate the average. Quantitative methods have been developed to measure the

Monte Carlo Error “objectively” (Koehler, Brown, & Haneuse, 2009), however, this

intricate topic goes beyond the scope of this chapter. Of great relevance for our pur-

pose is the fact that this analytic approach also allows to compute the credible dif-

ference of means between experimental conditions by computing μ

- μ

for every

combination of sampled values. Moreover, BPE provides a distribution of credible

effect sizes. The same distributional information can be obtained for the differences

between σ

and σ

(and the associated distributional range of credible effect sizes).

To sum up, BPE is currently one of the most effective statistical approaches to obtain

detailed information about the various parameters of interest.

References

Bayes, M., & Price, M. (1763). An Essay towards Solving a Problem in the Doctrine of

Chances. By the Late Rev. Mr. Bayes, F. R. S. Communicated by Mr. Price, in a

Letter to John Canton, A. M. F. R. S. Philosophical Transactions of the Royal

Society of London, 53(0), 370–418. https://doi.org/10.1098/rstl.1763.0053

Beaumont, M. A., & Rannala, B. (2004). The Bayesian revolution in genetics. Nature

Reviews Genetics, 5(4), 251–261. https://doi.org/10.1038/nrg1318

Brooks, S. P. (2003). Bayesian computation: a statistical revolution. Philosophical

Transactions of the Royal Society A: Mathematical, Physical and Engineering

Sciences, 361(1813), 2681–2697. https://doi.org/10.1098/rsta.2003.1263

Goertzel, B. (2007). Human-level artificial general intelligence and the possibility of

a technological singularity. A reaction to Ray Kurzweil’s The Singularity Is

Near, and McDermott’s critique of Kurzweil. Artificial Intelligence, 171(18), 1161–

1173. https://doi.org/10.1016/j.artint.2007.10.011

Gregory, P. C. (2001). A Bayesian revolution in spectral analysis. In AIP Conference

Proceedings (Vol. 568, pp. 557–568). https://doi.org/10.1063/1.1381917

Gustafsson, O., Montelius, M., Starck, G., & Ljungberg, M. (2017). Impact of prior

distributions and central tendency measures on Bayesian intravoxel

incoherent motion model fitting. Magnetic Resonance in Medicine.

https://doi.org/10.1002/mrm.26783

Heylighen, F. (2008). Accelerating Socio-Technological Evolution: from

ephemeralization and stigmergy to the global brain. In Globalization as

Evolutionary Process: Modeling global change (pp. 284–309).

https://doi.org/10.4324/9780203937297

Koehler, E., Brown, E., & Haneuse, S. J.-P. A. (2009). On the Assessment of Monte

Carlo Error in Simulation-Based Statistical Analyses. The American Statistician,

63(2), 155–162. https://doi.org/10.1198/tast.2009.0030

Kruschke, J. K. (2008). Bayesian approaches to associative learning: From passive to

active learning. Learning & Behavior, 36(3), 210–226.

https://doi.org/10.3758/LB.36.3.210

Kruschke, J. K. (2010). What to believe: Bayesian methods for data analysis. Trends

in Cognitive Sciences. https://doi.org/10.1016/j.tics.2010.05.001

Kruschke, J. K. (2013). Bayesian estimation supersedes the t test. Journal of

Experimental Psychology: General, 142(2), 573–603.

https://doi.org/10.1037/a0029146

Kruschke, J. K. (2014). Doing Bayesian data analysis: A tutorial with R, JAGS, and

Stan, second edition. Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and

Stan, Second Edition. https://doi.org/10.1016/B978-0-12-405888-0.09999-2

Kruschke, J. K. (2015). Doing Bayesian data analysis: a tutorial with R, JAGS and Stan.

Amsterdam: Elsevier.

Kruschke, J. K., Liddell, T. M., Bob, A., Don, C., Bob, A., & Don, C. (2017). Bayesian

data analysis for newcomers are impossible. Psychonomic Bulletin & Review, 1–

29. https://doi.org/10.3758/s13423-017-1272-1

Kruschke, J. K., & Vanpaemel, W. (2015). Bayesian estimation in hierarchical models.

The Oxford Handbook of Computational and Mathematical Psychology, 279–299.

https://doi.org/10.1093/oxfordhb/9780199957996.013.13

Kuhn, T. (1970). The Structure of Scientific Revolutions. University of Chicago Press,

University of Chicago.

Kurzweil, R. (2005). The singularity is near. viking (Vol. 45).

https://doi.org/10.1109/MSPEC.2008.4635038

Moore, G. E. (1965). Creaming more components onto integrated circuits.

Electronics, 38(8), 114–117. https://doi.org/10.1109/jproc.1998.658762

Shultz, T. R. (2007). The Bayesian revolution approaches psychological

development. Developmental Science, 10(3), 357–364.

https://doi.org/10.1111/j.1467-7687.2007.00588.x