# Bayesian Model and Conjugate Priors

2021-01-30 by xiaoguang

When reading “Thompson Sampling for Dynamic Multi-Armed Bandits”, I learned a new concept called “conjugate priors”. For Bayesian models, we can choose a conjugate prior for a likelyhood function, and then the posterior distribution will have the same form as the conjugate prior distribution. This will greatly simplify the updating of the Bayesian model by just updating the hyperparameters of the prior distribution.

In the following parts, I’ll first go through some concepts related to the Bayesian model, and then give an example of the usage of conjugate prior.

## Bayesian Model

The Bayesian model is defined as the product of the likelyhood $p(X|\theta)$ and the prior probability $p(\theta)$ divided by the probability of the observation data $p(X)$:

$p(\theta|X) = \frac{p(X|\theta)p(\theta)}{p(X)}$

Let’s use coin tossing as a example to explain it.

### Likelyhood Function

Most of the time, the likelyhood function can be considered as fixed, since we can determin it from the description of the problem. For example, for the coin tossing model, we can use the Bernoulli trial as the likelyhood function, and the data distribution can then be described using Binomial distribution:

$p(X = h|\theta) = \binom{n}{h}\theta^h(1-\theta)^{n-h}$

(h stands for the times of heads up.)

### Prior Distribution

The $p(\theta)$ is called prior distribution because it’s our belief/assumption of $\theta$.

With more and more trials, we’ll collect more and more data too, and using these data we can get the posterior distribution, and then we can use the posterior distribution to replace the prior distribution for future predictions.

This is the core process of Bayesian model updating.

However, choosing the prior belief can be complicated. For coin tossing model, we may assume the $\theta = 0.5$ at first, this is reasonable for most of the coins in the world. But what’s the distribution of the model parameter $\theta$?

### Probability of the Observed Data

We can only get the probability of the observed data based on the assumption of the prior probability. With the prior distribution:

$p(X) = \int p(X, \theta)\,\mathrm{d}{\theta} = \int {p(X|\theta)p(\theta)}\, \mathrm{d}{\theta}$

It’s an integral, how scare is that, what if we choose a wrong prior distribution function that make this calculation really complicated?

Actually this is a serious problem. It will be answered in the Conjugate Prior section.

### Expand the Bayesian Model

And according to the above descriptions, we can now can expand the Bayesian model to:

$p(\theta|X) = \frac{p(X|\theta)p(\theta)}{\int {p(X|\theta)p(\theta)}\, \mathrm{d}{\theta}}$

### Get Prediction from Bayesian Model

Let’s put aside the prior distribution selection problem, assume it’s been done and focus on how to use a Bayesian model to do the prediction $p(x|X)$:

$p(x|X) = \int {p(x|\theta)p(\theta|X)}\,\mathrm{d}{\theta}$

This is called the posterior prediction. Note, $p(\theta|X)$ is the posterior distribution, and $p(x|\theta)$ is the likelyhood function. So we’re using posterior and likelyhood function to predict the future events.

Looks like both the posterior and the posterior prediction contains integrals. However both of them can be simplified by carefully choosing a conjugate prior.

## Conjugate Prior

In the prior distribution section, we described the process of updating the Bayesian model, simply put it as a summary: it will use the previous posterior as the next prior during the model update iteration.

It’s reasonable to think that the prior and the posterior is of the same form, ie. they are belonging to the same kind of distribution. It’ll be even wonderful if we could skip the integral calculation and get the posterior by just updating the hyper parameters of prior distribution. Is that possible? Yes, the answer is conjugate prior.

Even better is, with conjugate prior, the posteror prediction can also be simplified.

### Beta Distribution as Conjugate Prior

Beta distribution is the conjugate prior of Bernoulli, Binominal and Geomitric likelyhoods. Let’s simplify the problem in Thompson Sampling papper and use only the single arm bandis as an example.

Let’s say each time we pull the arm, it’ll give back either a success or failure result, and it conforms to the Binominal distribution, such that:

$p(S=s|\theta) = \binom{n}{s}\theta^s(1-\theta)^{n-s}$

(it means there are s successes after n pulls)

and we choose Beta distribution as the prior:

$p(\theta) = \frac{\theta^{\alpha - 1}(1-\theta)^{\beta - 1}}{B(\alpha,\beta)}$

after integral, the posterior would be:

$p(\theta|S) = \frac{\theta^{s + \alpha - 1}(1-\theta)^{f + \beta - 1}}{B(s + \alpha,f + \beta)}$

Can you see the beauty in it? It’s like we just need to update the $\alpha$ and $\beta$ after each pull of the arm, and we can then get the posterior, another Beta distribution with different hyper parameters $\alpha$ and $\beta$!