Week 9 Day 2: Markov Chain Monte Carlo

Objectives

Set up a new virtual environment
Broad overview of MCMC
Look at PyMC3

Virtual environments

Or how to install a conda package properly

We are going to need to install Theano on OSC. This means we need a new conda virtual environment! Let's see how we would do that. I would open a new terminal, because it's a bit slow:

conda create -n theano_env python=3.6 anaconda theano

Name (-n): theano_env Python version: 3.6 Package list: All packages from anaconda, theano

This will also install up-to-date packages to link into the new environment, so it might be a bit slow. On your local machine, it might be faster, since you probably conda update anaconda occasionally.

You'll next need your kernels available - you can either add nb_conda_kernels to your base environment if you own it (not on OSC), or install them manually:

source activate theano_env
python -m ipykernel install --user --name theano_env

Note: In the latest Conda release, besides Python 3.7, you can now use conda activate instead.

While you are in the environment, go ahead and install pymc3: pip install pymc3 or conda install -c conda-forge pymc3.

# Numpy keeps throwing a Future warning, but we will ignore it.
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import scipy.stats
import scipy.special
import numpy as np
import matplotlib.pyplot as plt

Example: Biased coin

Based on http://www.behind-the-enemy-lines.com/2008/01/are-you-bayesian-or-frequentist-or.html

This is a very popular example for understanding Bayesian statistics. Let's assume you have a coin and you think it may be biased. You job is to flip the coin and calculate the bias $p$, where $p$ is the chance of obtaining a head ($p=.5$ for an unbiased coin).

You flip the coin 14 times. It lands heads up 10 times.

Let's illustrate the difference by asking both the Frequentist and the Bayesian a question: Would you bet even money that the next two flips will both land heads up?

n = 14
heads = 10
a = 1
b = 1

Frequentist statistics

The Frequentist simply calculates the value of $p$ given the data:

p_freq = heads / n
prob_2_heads = p_freq**2
print(f"p={p_freq:.3}, and the chance of two heads is {prob_2_heads:.1%}")

Bayesian statistics: Analytical

For us, $p$ is no longer a value, but a distribution. What is the probability of $p$, given data $x$? This is the Bayes formula:

$$ P(p|x) = \frac{P(x|p) P(p)}{P(x)} \tag{1} $$

We can compute $P(x|p)$ using a binomial distribution:

$$ P(x|p) \propto p^{10} \left( 1-p \right)^4 $$

Bayesian's need to select a prior distribution $P(p)$. We'll pick the Beta distribution:

$$ P(p) \propto p^{a-1} \left( 1-p \right)^{b-1} $$

This gives us a combined distribution of:

$$ P(p|x) \propto p^{10+a-1} \left( 1-p \right)^{4+b-1} $$

Now, we need to normalize (we've been dropping constants everywhere intentionally). This is again a Beta distribution, so we can just write it that way:

$$ P(p|x) = \mathrm{Beta}(p; a+10, b+4) $$

x = np.linspace(0,1,200)
orig = scipy.stats.beta.pdf(x, a,b)
new = scipy.stats.beta.pdf(x, a + heads,b + n - heads)
plt.plot(x,orig, label="Prior")
plt.plot(x,new, label="Posterior")
plt.legend()
plt.show()

Working through some math, we could find the probability of two heads in a row:

$$ P(HH|x) = \frac{B(10+a+2, 4+b)}{B(10+a, 4+b)} $$

(Here $B$ is the beta function, not beta distribution).

(  scipy.special.beta(10+1+2, 4+1)
 / scipy.special.beta(10+1, 4+1) )

So the Bayesian would vote against two heads in a row.

We were lucky here, due to our careful choice of problem. We had a nice posterior and could do everything analytically. What if we didn't such a perfect setup?

Markov-Chain Monte Carlo

Let's look again at the Bayes formula for a general parameter $\theta$:

$$ P(\theta|x) = \frac{P(x|\theta) P(\theta)}{P(x)} \tag{1} $$

$P(\theta)$: Prior
$P(x|\theta)$: Likelihood
$P(x)$: Evidence

We need to compute:

$$ P(x) = \int _\Theta P(x,\theta) d\theta $$

For non-simple models, this is not closed form. So how about using MC and sampling the posterior? Now you need to solve and invert the Bayes formula! We didn't fix anything.

How about we go crazy and decide to build a markov chain that has an equilibrium state that matches our unknown posterior distribution? While this may seem yet one step harder still, it turns out that we can do it rather easily!

Markov Chains

Let's change gears just for a second, and talk about Markov chains. A Markov chain is a system where the next state of the system depends only on the current state of the system, not on any prior states. Chutes and Ladders, the children's board game, is the canonical example: you can set the game up, then the next possible state is a collection of probabilities based on dice rolls and the current marker locations, but it does not have any "memory" as to how you got to the current state.

For us, it means that we start at some point, then try to "jump" to a next point based on a probability. If we like the new point better, we keep it. This would be a simple search, and would "stop" at the peak -- but we want a distribution, not a value -- so we make an acceptance probability: $p_\mathrm{accept} = p_\mathrm{proposal} / p_\mathrm{current}$.

PyMC3

Let's look at MCMC to solve the coin problem above. We'll adjust our numbers a bit:

n = 100
heads = 61
a = 10
b = 10
niter = 2_000

x = np.linspace(0,1,200)
orig = scipy.stats.beta.pdf(x, a, b)
new = scipy.stats.beta.pdf(x, a + heads, b + n - heads)
plt.plot(x,orig, label="Prior")
plt.plot(x,new, label="Posterior")
plt.legend()
plt.show()

import pymc3 as pm

with pm.Model() as coin_context:
    p = pm.Beta('p', alpha=2, beta=2)
    y = pm.Binomial('y', n=n, p=p, observed=heads)
    trace = pm.sample(niter)

plt.plot(trace['p'])
plt.show()

p = trace.get_values('p', burn=niter//2, combine=True, chains=[0,2])
plt.plot(p)
plt.show()

Gaussian example

Taken verbatim from https://twiecki.github.io/blog/2015/11/10/mcmc-sampling/. It uses a Gaussian instead of the coin flip. We are estimating the mean from 20 samples drawn from a data sample centered around 0.

import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from scipy.stats import norm

sns.set_style('white')
sns.set_context('talk')

np.random.seed(123)

data = np.random.randn(20)

ax = plt.subplot()
sns.distplot(data, kde=False, ax=ax)
_ = ax.set(title='Histogram of observed data', xlabel='x', ylabel='# observations');

def calc_posterior_analytical(data, x, mu_0, sigma_0):
    sigma = 1.
    n = len(data)
    mu_post = (mu_0 / sigma_0**2 + data.sum() / sigma**2) / (1. / sigma_0**2 + n / sigma**2)
    sigma_post = (1. / sigma_0**2 + n / sigma**2)**-1
    return norm(mu_post, np.sqrt(sigma_post)).pdf(x)

ax = plt.subplot()
x = np.linspace(-1, 1, 500)
posterior_analytical = calc_posterior_analytical(data, x, 0., 1.)
ax.plot(x, posterior_analytical)
ax.set(xlabel='mu', ylabel='belief', title='Analytical posterior');
sns.despine()

def sampler(data, samples=4, mu_init=.5, proposal_width=.5, plot=False, mu_prior_mu=0, mu_prior_sd=1.):
    mu_current = mu_init
    posterior = [mu_current]
    for i in range(samples):
        # suggest new position
        mu_proposal = norm(mu_current, proposal_width).rvs()

        # Compute likelihood by multiplying probabilities of each data point
        likelihood_current = norm(mu_current, 1).pdf(data).prod()
        likelihood_proposal = norm(mu_proposal, 1).pdf(data).prod()
        
        # Compute prior probability of current and proposed mu        
        prior_current = norm(mu_prior_mu, mu_prior_sd).pdf(mu_current)
        prior_proposal = norm(mu_prior_mu, mu_prior_sd).pdf(mu_proposal)
        
        p_current = likelihood_current * prior_current
        p_proposal = likelihood_proposal * prior_proposal
        
        # Accept proposal?
        p_accept = p_proposal / p_current
        
        # Usually would include prior probability, which we neglect here for simplicity
        accept = np.random.rand() < p_accept
        
        if plot:
            plot_proposal(mu_current, mu_proposal, mu_prior_mu, mu_prior_sd, data, accept, posterior, i)
        
        if accept:
            # Update position
            mu_current = mu_proposal
        
        posterior.append(mu_current)
        
    return posterior

# Function to display
def plot_proposal(mu_current, mu_proposal, mu_prior_mu, mu_prior_sd, data, accepted, trace, i):
    from copy import copy
    trace = copy(trace)
    fig, (ax1, ax2, ax3, ax4) = plt.subplots(ncols=4, figsize=(16, 4))
    fig.suptitle('Iteration %i' % (i + 1))
    x = np.linspace(-3, 3, 5000)
    color = 'g' if accepted else 'r'
        
    # Plot prior
    prior_current = norm(mu_prior_mu, mu_prior_sd).pdf(mu_current)
    prior_proposal = norm(mu_prior_mu, mu_prior_sd).pdf(mu_proposal)
    prior = norm(mu_prior_mu, mu_prior_sd).pdf(x)
    ax1.plot(x, prior)
    ax1.plot([mu_current] * 2, [0, prior_current], marker='o', color='b')
    ax1.plot([mu_proposal] * 2, [0, prior_proposal], marker='o', color=color)
    ax1.annotate("", xy=(mu_proposal, 0.2), xytext=(mu_current, 0.2),
                 arrowprops=dict(arrowstyle="->", lw=2.))
    ax1.set(ylabel='Probability Density', title='current: prior(mu=%.2f) = %.2f\nproposal: prior(mu=%.2f) = %.2f' % (mu_current, prior_current, mu_proposal, prior_proposal))
    
    # Likelihood
    likelihood_current = norm(mu_current, 1).pdf(data).prod()
    likelihood_proposal = norm(mu_proposal, 1).pdf(data).prod()
    y = norm(loc=mu_proposal, scale=1).pdf(x)
    sns.distplot(data, kde=False, norm_hist=True, ax=ax2)
    ax2.plot(x, y, color=color)
    ax2.axvline(mu_current, color='b', linestyle='--', label='mu_current')
    ax2.axvline(mu_proposal, color=color, linestyle='--', label='mu_proposal')
    #ax2.title('Proposal {}'.format('accepted' if accepted else 'rejected'))
    ax2.annotate("", xy=(mu_proposal, 0.2), xytext=(mu_current, 0.2),
                 arrowprops=dict(arrowstyle="->", lw=2.))
    ax2.set(title='likelihood(mu=%.2f) = %.2f\nlikelihood(mu=%.2f) = %.2f' % (mu_current, 1e14*likelihood_current, mu_proposal, 1e14*likelihood_proposal))
    
    # Posterior
    posterior_analytical = calc_posterior_analytical(data, x, mu_prior_mu, mu_prior_sd)
    ax3.plot(x, posterior_analytical)
    posterior_current = calc_posterior_analytical(data, mu_current, mu_prior_mu, mu_prior_sd)
    posterior_proposal = calc_posterior_analytical(data, mu_proposal, mu_prior_mu, mu_prior_sd)
    ax3.plot([mu_current] * 2, [0, posterior_current], marker='o', color='b')
    ax3.plot([mu_proposal] * 2, [0, posterior_proposal], marker='o', color=color)
    ax3.annotate("", xy=(mu_proposal, 0.2), xytext=(mu_current, 0.2),
                 arrowprops=dict(arrowstyle="->", lw=2.))
    #x3.set(title=r'prior x likelihood $\propto$ posterior')
    ax3.set(title='posterior(mu=%.2f) = %.5f\nposterior(mu=%.2f) = %.5f' % (mu_current, posterior_current, mu_proposal, posterior_proposal))
    
    if accepted:
        trace.append(mu_proposal)
    else:
        trace.append(mu_current)
    ax4.plot(trace)
    ax4.set(xlabel='iteration', ylabel='mu', title='trace')
    plt.tight_layout()
    #plt.legend()

np.random.seed(123)
sampler(data, samples=8, mu_init=-1., plot=True);

posterior = sampler(data, samples=15000, mu_init=1.)
fig, ax = plt.subplots()
ax.plot(posterior)
_ = ax.set(xlabel='sample', ylabel='mu');

ax = plt.subplot()

sns.distplot(posterior[500:], ax=ax, label='estimated posterior')
x = np.linspace(-.5, .5, 500)
post = calc_posterior_analytical(data, x, 0, 1)
ax.plot(x, post, 'g', label='analytic posterior')
_ = ax.set(xlabel='mu', ylabel='belief');
ax.legend();