Bayesian Inference and Graphical ModelsMarkov Chain Monte Carlo
While conjugate priors are elegant, they aren't always applicable. We might want to use a prior distribution which isn't in a conjugate family, or we might not be able to come up with a conjugate family for a given problem. So more general techniques are called for. Let's look at a simple example where we run into difficulties trying to work entirely with on-paper formulas.
Example
Suppose that we have a prior density for a parameter given by . Suppose that the data are drawn from the distribution .
Derive a formula for the posterior mean of , given a single observation from this distribution. Can the formula be evaluated analytically?
Solution. The posterior mean is given by
The integrals in the numerator and denominator cannot be computed analytically.
We could evaluate these integrals to high precision using standard numerical integration techniques like the trapezoid rule. However we're going to want to use a less accurate but more flexible approach based on Markov chains, because the trapezoid rule and similar approaches become intractable when our data are high-dimensional. For example, a 28 by 28 grayscale image is an element of , which is practically impossible to sample densely—counting just the corners, it contains a set of points with no point closer than one unit to another.
Markov Chains
A discrete Markov chain is a sequence of random variables whose joint distribution is specified by a diagram like this one:
The value of the initial random variable is one of the four states (A, B, C, or D in this example), and then the conditional distribution of given is determined by the outgoing arrows from . For example, if , then will be with probability 1/2 or with probability . Then the conditional distribution of given is determined by the outgoing arrows from , and so on.
Exercise
Write down the matrix whose entry is the probability of transitioning to the state given that the Markov chain is currently at the state.
Solution. The transition matrix is given by
P = [1/2 1/2 0 0
1/2 0 1/2 0
0 0 1/2 1/2
1 0 0 0]
Example
Plot an example trajectory of the Markov chain above starting from state A.
using Plots
Solution.
We sample the next state from the corresponding row of :
using Plots, StatsBase function takestep(i) sample(1:4, weights(P[i,:])) end function chain(start, n_steps) x = [start] for i in 1:n_steps push!(x, takestep(x[end])) end x end P = [1/2 1/2 0 0 1/2 0 1/2 0 0 0 1/2 1/2 1 0 0 0] plot(chain(1, 100), size = (800, 100), legend = false, yticks = (1:4, ["A", "B", "C", "D"]))
Exercise
Show that (the top left entry of ) yields the probability of being at state 1 in two steps given that you start from state .
Solution. Squaring and looking at the top left entry gives , while calculating the probability of being at state 1 in two steps starting from state 1 gives the same result: for the probability that we stay at A for both steps, plus for the probability that we go to B and come back to A.
More generally, we can say that the $(i,j)$th entry of is equal to the probability of being at state after steps, given that we began in state .
The animation below shows that happens to the probability of being in each state after steps, starting from state .
We see that the probability of being at state after steps converges as . We call this collection of limiting probabilities the stationary distribution of the Markov chain. It turns out not to depend on the starting state:
The goal of Markov Chain Monte Carlo is to reverse this procedure: we'll start with a desired probability measure on a set and come up with a collection of transition probabilities that give that measure as its stationary distribution. Then we can sample from the desired measure by starting somewhere in and running a long Markov chain with those transition probabilities.
Metropolis-Hastings
Finding transition probabilities which give rise to a particular stationary distribution might seem like a tall order, but there is a surprisingly general method of achieving it.
Suppose is the set we're looking to sample from, and is either a PMF or PDF on . Given a symmetric
- Choose the initial state arbitrarily, and sample from the distribution
. - Define to be with probability (or 1, if this ratio exceeds 1) and otherwise.
- Repeat steps (ii) and (iii) to obtain from , from , and so on.
Note: we call the
This algorithm is called Metropolis-Hastings. To the extent that Metropolis-Hastings is sometimes difficult to implement to the desired effect, the main difficulties usually have to do with choosing an efficient proposal distribution or knowing how long to run the Markov chain until the distribution of the current state is close to the stationary distribution.
Exercise
Apply Metropolis-Hastings to the four-state Markov chain above, with the stationary distribution .
Solution. The probability of moving from given that you start at is the probability that the proposal suggests , which is , multiplied by the probability you accept the proposal, which is . So that entry is 1/4. Likewise, the transition probability from to is . Continuing in this way, we fill out the matrix, and it indeed gives rise to the desired stationary distribution:
P = [5/8 1/4 0 1/8 1/2 0 1/2 0 0 1/2 1/4 1/4 1/2 0 1/2 0] # let's check that each row is close to the # desired stationary distribution: P^100
You might be thinking that surely we could sample from this distribution in easier ways (like generating a uniform random variable and checking whether it's between 0 and 4/9, or between 4/9 and 4/9 + 2/9, etc.). That's a fair point. Let's look at an example where Metropolis-Hastings actually allows us to do something new.
Example
Suppose we want to sample from a probability measure on the set of length-100 binary strings.
(a) How do we sample from the uniform distribution on such strings?
(b) How do we sample from the distribution whose probability mass function assigns to each string a mass proportional to the number of 1's (for example, a string with 82 ones would be twice as likely as a string with 41 ones)?
Solution. In neither case do we want to try inverse CDF sampling, because splitting the unit interval into tiny intervals is impractical (even the
rand(0:1, 100)
For (b), we can use Metropolis-Hastings. We choose an initial state somehow, maybe all ones. Then we propose changing the string by, let's say, choosing a random position and flipping the bit in that position. If the number of ones goes up, we accept that switch. If it goes down, we accept it with probability , where is the current number of heads.
function rand_weighted_by_num_ones(n) x = fill(1, n) n_ones = n for i in 1:10^6 k = rand(1:n) if x[k] == 0 x[k] = 1 n_ones += 1 else if rand() < (n_ones-1)/n_ones x[k] = 0 n_ones -= 1 end end end x end rand_weighted_by_num_ones(100)
The following video provides a visualization for the algorithm that rand_weighted_by_num_ones
executes.
Notice here that it's helpful to choose a proposal distribution that suggests changes which can be executed quickly (for example, in this case we just looked at one position at a time rather than several), because we'll have to perform lots of these changes to get close to the stationary distribution.
Bayesian Linear Regression
Finally, let's see how to use MCMC to do Bayesian linear regression.
Consider a linear regression model , where is normally distributed with mean 0 and variance . We'll treat the parameters amd as Bayesian parameters, meaning that they're random variables with some distribution. As priors, let's take normal distributions with mean 0 and variance 100 for both and .
Our goal is to sample from the posterior distribution given some observations as ranges from 1 to . Let's use kind of an extreme example to make the point:
using Plots, Distributions x = rand(Uniform(0, 1), 1000) y_mean = 0.4x .+ 0.2 y = y_mean + rand(Normal(0, 0.05), 1000) function observations() scatter(x, y, ms = 1, msw = 0.2, size = (400, 250), legend = false) end observations()
To sample from the posterior, we'll use Metropolis-Hastings on the parameters and . That means we'll start at initial values for those variables, then propose an update drawn from . We'll calculate the acceptance ratio for the posterior distribution, move or stay based on that ratio, and so on.
Exercise
What is the acceptance ratio for the posterior distribution, given the observations?
Solution. Since posterior is proportional to likelihood times prior, we need to compute likelihood ratios and ratios of priors. The latter we'll get by deferring to Julia's
Let's code this up. We begin by writing a function to compute the acceptance ratio.
N = Normal(0, 10) δ(x,y,m,b) = sum((yᵢ - m*xᵢ - b)^2/2 for (xᵢ, yᵢ) in zip(x,y)) function α(x, y, m, b, m_prop, b_prop) min(1.0, exp(-δ(x,y,m_prop,b_prop) + δ(x,y,m,b)) * pdf(N, m_prop)/pdf(N, m) * pdf(N, b_prop)/pdf(N, b)) end
function mcmc(n_iterations) m, b, σ = 0.0, 0.0, 1.0 θs = [(m, b)] for i in 1:n_iterations m_prop, b_prop = m + rand(Normal(0,0.005)), b + rand(Normal(0,0.005)) if rand() < α(x, y, m, b, m_prop, b_prop) m, b = m_prop, b_prop push!(θs, (m, b)) end end θs end
Finally, we can visualize the result:
θs = mcmc(3_000) m, b = θs[end] observations() plot!(0:1, x-> m*x + b, size = (400, 300), ylims = (-0.5, 1))
If we run this code cell several times, then we'll get slightly different results each time. That's because we're sampling and from their posterior distributions, rather than choosing specific "best" values as we did in frequentist statistics. We could use many runs of the Markov chain sampling algorithm to get quantiles for and and even see how they're distributed jointly given the observed data:
scatter([mcmc(3000)[end] for _ in 1:1000], size = (400, 400), ms = 2, msw = 0.2, title = "posterior distribution", xlabel = "m", ylabel = "b", legend = false)
Exercise
Explain why it makes intuitive sense that the joint posterior distribution of and would be negatively correlated.
Solution. If happens to be on the large side, then the slope needs to be smaller than usual to be reasonably compatible with the data (that is, to cut through the middle of the point cloud). Said another way, and are both exceptionally large, then the likelihood for that pair would be extremely small because of the Gaussian factors picking up large vertical differences between the observed points and the line. Similarly, if is smaller than usual, then is likely to be larger than usual.