$$\DeclareMathOperator*{\E}{\mathbb{E}} \newcommand{\cE}{\mathcal{E}}$$

Machine Learning and Statistical Mechanics

What do ML and SM have in common?

• Both fields use probabilistic models with large numbers of variables

• There are theoretical concepts and tools that apply to both

Probability in Statistical Mechanics

• Goal: describe thermodynamic properties of macroscopic system probabilistically in terms of microscopic constituents.

• The probabilistic model is normally the Boltzmann distribution

$$p(\mathbf{x})=\frac{\exp\left[-\beta \mathcal{E}(\mathbf{x})\right]}{Z},$$

• Normalizing constant $Z$ is partition function, $\mathcal{E}(\mathbf{x})$ is the energy of configuration $\mathbf{x}$, and $\beta=1/T$ is inverse temperature

• Central problem of SM: computing averages of physical quantities

• Principle difficulty: this is hard

Example I: Classical gas

• $\mathbf{x}$ corresponds to positions of each gas molecule: $\mathbf{x}=(\mathbf{x}_1,\ldots \mathbf{x}_N)$

• Average is a $3N$-dimensional integral

• Only tractable case: noninteracting (ideal) gas, in which case

$$\mathcal{E}(\mathbf{x}) = \sum_{n=1}^N \mathcal{E}_1(\mathbf{x}_n)$$

• If we introduce interactions between particles of the form $$\mathcal{E}(\mathbf{x}) = \sum_{n<m}^N \mathcal{E}_2(\mathbf{x}_n,\mathbf{x}_m)$$ things get a lot harder.

Example 2: Ising model

• Can also have discrete random variables, e.g. Ising model

• Configuration corresponds to fixing the values of $N$ “spins” $\sigma_n=\pm 1$ with an energy function of the form $$\mathcal{E}(\sigma)=\sum_n h_n\sigma_n + \sum_{n,m} J_{mn}\sigma_m\sigma_n.$$ It’s the couplings $J_{mn}$ that causes problems / interest

• Worst case: sum over $2^N$ configurations

• Solve approximately with: mean field theory, Monte Carlo, etc.

Probability in Machine Learning

• Example: computer vision.

• Image defined by set $(R,G,B)$ values of each pixel each $\in [0,255]$

• Basic hypothesis of probabilistic ML

Dataset represents a set of independent and identically distributed (iid) samples of some random variables.

• For image, variables are RGB values of pixels

• Distribution has to be highly correlated and have a great deal of complex structure: cats and dogs not white noise

SM vs. ML

• Classical SM: motion of molecules deterministic but complicated. Replace with probability model constrained by physics

• ML: (mostly) rely solely on data. Infer properties of model. How?

• Recent prgress using models based on NNs + training algorithms

• Allows rich probability models describing images or audio signals

Applications of probablistic ML

• Sampling (generative modelling). Obvious physics applications
• Density estimation. Anomaly detection
• Compression

This lecture

• Some mathematical background
• Variational inference

Some mathematical background

Probabilities

• Probabilities $p(x)\geq 0$ satisfy

$$\sum_x p(x)=1$$

• For continuous variables $$\int p(x) dx=1,$$ but we’ll use the discrete notation throughout.
• Joint probabilities denoted $p(x_1,\ldots x_N)$

• Sum over subset to give marginal distribution of remaining

$$p(x)= \sum_{y} p(x,y).$$

• Conditional probability $p(x|y)$: distribution of $x$ given fixed $y$. The relation between joint and conditional probabilities is

$$p(x,y)=p(x|y)p(y) \tag{1} \label{eq:joint}$$

Chain rule of probability

$$p(x_1,\ldots x_N)=p(x_1)p(x_2|x_1)p(x_3|x_2,x_1)\cdots p(x_N|x_1,\ldots x_{N-1}), \tag{2} \label{eq:chain}$$

• Sampling is easy!

Priors and posteriors

• Another way to express the joint probability is

$$p(x,y)=p(y|x)p(x)$$

$$p(y|x)=\frac{p(x|y)p(y)}{p(x)}$$

Bayesian statistics

• Bayes' theorem is workhorse of Bayesian statistics

• Regard parameters $z$ in your probability model as random variables taken from some initial distribution $p(z)$, called the prior distribution (or just the prior)

Example

• Model distribution is Gaussian normal distribution with mean $\mu$ and variance $\sigma^2$

• Parameters are $z=(\mu,\sigma^2)$

• For prior could choose a normal distribution: $\mu\sim \mathcal{N}(\mu_\mu,\sigma^2_\mu))$

• For $\sigma^2$ a distribution of a positive quantity: the inverse gamma distribution is a popular choice.

• Once parameters fixed, have a model distribution for your data that can be thought of as the conditional distribution $p(x|z)$

• What does an observation of $x$ tell me? Just use Bayes:

$$p(z|x) = \frac{p(x|z)p(z)}{p(x)}.$$

• This is the posterior distribution (or just posterior)

• Note that the denominator doesn’t depend on $z$, it just provides a normalization. If you have lots of data points then

$$p(z|x_1,\ldots x_N) \propto p(x_1,\ldots x_N|z)p(z).$$

• Bayes' theorem lets us update our beliefs about parameters based on our initial beliefs and any evidence we receive. This is inference.

Latent variables; generative models

• We allow the $z$s to have different distributions for different data points $p(z_n|x_n)$

• Equivalently, our model is defined by a joint distribution $p(x,z)$.

Example: mixture model

• $M$ different components, each with their own distribution $p(x|m)$ and occurring with probability $p(m)$, so that

$$p(x) = \sum_m p(m)p(x|m).$$

• Observation $x$ will give me information about $p(m|x)$, telling which of the $M$ components that observation belongs to.

• This may bring insight, if latent variables are interpretable

• Or: a more powerful model

• Latent variables allow for structure learning

• Example: for a dataset of images of people walking we’d like to find latent variables parameterize a manifold of different poses.

• Latent variable models are also the basis of generative modelling: sampling from a distribution $p(x)$ learnt from data.

• If the model has been formulated in terms of a prior $p(z)$ over latent variables and a generative model $p(x|z)$, sampling is straightforward in principle.

Entropy

• In SM we’re familiar with entropy associated with probability distribution.

• Arrived in ML from information theory

$$H[p]=- \sum_x p(x)\log_2 p(x).$$

• Taking the logarithm base 2 means we measure in bits

One interpretation of entropy

• $N$ iid variables with distribution $p(x)$

• Probability of observing a sequence $x_1,\ldots x_N$ is

$$$$p(x_1,\ldots x_N)=\prod_{n=1}^N p(x_n).$$ \tag{3} \label{eq:seq}$$

• Probability is obviously exponentially small as $N\to\infty$, but how small?

$$\lim_{N\to\infty} \frac{1}{N}\log p(x_1,\ldots x_N) = -H[p].$$

• Asymptotic partition property

• Shouldn’t the probability depend on what you actually get?

• Suppose you have a biased coin that give heads with probability $p_H>0.5$ and tails with probability $p_T=1-p_H$

• Chance of getting half heads and half tails exponentially small

• What you’re going to get instead is

$$\frac{N_H}{N}\to p_H\qquad \frac{N_T}{N}\to p_T\qquad .$$

• What is the probability of such a sequence? $p_H^{N_H}p_T^{N_T}$

$$\log_2\left(p_H^{N_H}p_T^{N_T}\right)= N_H\log_2 p_H + N_T\log_2 p_T = -N H[p_H, p_T].$$

Entropy and information

• A way to quantify information in a signal

• If the coin is really biased, you will be surprised when you get tails

• Entropy lower than for fair coin, which has maximum entropy $H=1$

HHHHHHHHHHHHHHHHHHHHHTHHHHHHHHHHHHHTHHHHT

• To describe such a sequence, you might say “21 H, 13 H, 4 H”

• Shorter than the original sequence; possible because of the high degree of predictability

• But: extra symbols including the digits 0-9 and comma.

• Should instead compare with a binary code of only two symbols

• How can we exploit the lower entropy of the sequence?

• Use fact that we expect $N_H=Np_H$ heads and $N_T=N p_T$ tails, so we can just give the ordering of these heads and tails, which is one of $$\frac{N!}{N_H! N_T!}$$ possibilities. If we label each of these with a binary number, we end up with a description of length $$\log_2\left(\frac{N!}{N_H! N_T!}\right)\sim N H[p]\leq N$$ (where we used Stirling’s approximation $\log n! \sim n\log n -n$).

N i.i.d. random variables each with entropy H(X) can be compressed into more than N H(X) bits with negligible risk of information loss, as N → ∞; but conversely, if they are compressed into fewer than N H(X) bits it is virtually certain that information will be lost.

• Shannon’s theorem is the core idea that underlies (lossless) data compression

• The more predictable a signal (i.e. the lower the entropy) the more it can be compressed, with the entropy setting a fundamental limit on the number of bits required.

Divergences

• We need some way of talking about the degree to which two distributions differ

• Most common measure in use in ML is the Kullback–Leibler divergence (KL)

$$D_\text{KL}(p||q)=\sum_x p(x)\log\left(\frac{p(x)}{q(x)}\right)=\E_{x\sim p}\log\left(\frac{p(x)}{q(x)}\right).$$

• KL has property

$$D_\text{KL}(p||q)\geq 0$$

$$\E\left[\varphi(x))\right]\geq \varphi\left(\E\left[x\right]\right)$$

• Apply this to the KL then, using the convexity of $\varphi(x)=-\log(x)$ $$D_\text{KL}(p||q)=-\E_{x\sim p}\log\left(\frac{q(x)}{p(x)}\right)\geq -\log\left(\E_{x\sim p}\left[\frac{q(x)}{p(x)}\right]\right)=-\log(1)=0,$$ with equality if and only if $p=q$.

Variational inference (VI)

• Recall Bayes'

$$p(z|x) = \frac{p(x|z)p(z)}{p(x)}=\frac{p(x,z)}{p(x)}.$$

• Complicated latent variable model $\longrightarrow$ intractable denominator

Mean field theory

• For SM model like Ising the probability has the form

$$p(\sigma) = \frac{\exp\left[-\beta\cE(\sigma)\right]}{Z}.$$

$$q_\theta(\sigma)=\prod_n q_{\theta_n}(\sigma_n).$$

• Natural to try to minimize

$$D_\text{KL}(q||p)(q||p)=\E_{\sigma\sim q_\theta}\left[\log\left(\frac{q_\theta(\sigma)}{p(\sigma)}\right)\right].$$

• Substituting in the Boltzmann distribution $$D_\text{KL}(q||p)(q||p)= \log Z - H[q_\theta] + \beta \E_{\sigma\sim q_\theta}\left[\cE(\sigma)\right]\geq 0,$$ or in usual SM language $$\E_{\sigma\sim q_\theta}\left[\cE(\sigma)\right]-TH[q_\theta] \geq F,$$ where $F=-T\log Z$ is the Helmholtz free energy.

• This is the Bogoliubov or Gibbs inequality

• For Ising spins our factorized distributions are defined by fields on each site $$q_{\theta_n}(\sigma_n) = \frac{\exp\left[-\beta\theta_n\sigma_n\right]}{2\cosh (\beta\theta_n)},$$ with average spin $$\E_{\sigma_n\sim q_n}\left[\sigma_n\right] = -\tanh\left(\beta\theta_n\right).$$

VI in latent variable models

• Just need to replace the Boltzmann distribution with $$p(z|x) =\frac{p(x,z)}{p_\text{M}(x)}.$$ (we add the subscript “M” for model)

• Role of spins $\sigma$ is now played by the latent variables

• Following same steps leads us to

$$\log p_\text{M}(x) \geq \E_{z\sim q_\theta(\cdot|x)}\left[\log p(x,z)\right]+ H[q_\theta(\cdot|z)].$$

• RHS is Evidence lower bound or ELBO (marginalized probability $p(x)$ on the left is sometimes called the model evidence).

• Possible to rewrite as $$\log p_\text{M}(x) \geq \log p_\text{M}(x) - D_\text{KL}(q_\theta(\cdot|x)||p(\cdot|x)),$$ so the bound is saturated when the variational posterior for the latent variables coincides with the true posterior $$p(z|x)=p(x,z)/p_\text{M}(x)$$