Chapter 3 Multivariate normal distributions and numerical linear algebra
Multivariate normal distribution is one of the most commonly encountered distributions in statistics.
The \(d\)-dimensional multivariate normal \(\mathcal{N}(\mathbf{\mu}, \mathbf{\Sigma})\) is parametrised by the \(d\)-dimensional mean vector \(\mathbf{\mu}\) and a symmetric positive-definite \(d \times d\) covariance matrix \(\mathbf{\Sigma}\).
Some important properties of symmetric positive-definite matrices \(\mathbf{\Sigma}\):
- all eigenvalues of symmetric matrices are real; and
- all eigenvalues of symmetric positive-definite matrices are positive.
3.1 Cholesky decomposition
Symmetric positive definite matrix \(\mathbf{\Sigma}\) can be represented as \[ \mathbf{\Sigma} = \mathbf{L} \mathbf{L}^T, \] where \(\mathbf{L}\) is a lower-triangular matrix. \(\mathbf{L}\) is called the Cholesky decomposition of \(\Sigma\).
Note: Python NumPy and SciPy have two incompatible implementations
of the Cholesky decomposition: scipy.linalg.cholesky
and
numpy.linalg.cholesky
. NumPy version numpy.linalg.cholesky
uses
the definition used here while scipy.linalg.cholesky
returns by
default the upper-triangular matrix \(\mathbf{U} = \mathbf{L}^T\).
scipy.linalg.cholesky
can be made to return \(\mathbf{L}\) using
scipy.linalg.cholesky(..., lower=True)
.
3.2 Evaluating the multivariate normal density
The log-density of the \(d\)-dimensional multivariate normal is: \[ \log p(\mathbf{x}) = -\frac{d}{2} \log(2 \pi) -\frac{1}{2} \log | \det \mathbf{\Sigma} | - \frac{1}{2} (\mathbf{x}-\mathbf{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{x}-\mathbf{\mu}).\]
It contains two more difficult terms: \[ \log | \det \mathbf{\Sigma} | \] and \[ - \frac{1}{2} (\mathbf{x}-\mathbf{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{x}-\mathbf{\mu}).\]
Both difficult terms can be evaluated efficiently and in a numerically stable manner using the Cholesky decomposition of \(\mathbf{\Sigma}\).
3.2.1 Determinant evaluation using the Cholesky decomposition
Assume \(\mathbf{\Sigma} = \mathbf{L} \mathbf{L}^T\) with \(\mathbf{L} = \begin{pmatrix} l_{11} & 0 & 0 & \dots & 0 \\ l_{21} & l_{22} & 0 & \dots & 0 \\ l_{31} & l_{32} & l_{33} & \dots & 0 \\ \vdots & \vdots & \vdots & \ddots & 0 \\ l_{d1} & l_{d2} & l_{d3} & \dots & l_{dd} \\ \end{pmatrix}.\)
Now using basic properties of the determinant and the logarithm, we obtain \[\begin{align*} \log \det \boldsymbol{\Sigma} &= \log\left( \det ( \mathbf{L} \mathbf{L}^T) \right) = \log\left( \det \mathbf{L} \det \mathbf{L}^T \right) \\ &= \log\left( (\det \mathbf{L})^2 \right) = 2 \log\left( \det \mathbf{L} \right) \\ &= 2 \log\left( \prod_{i=1}^d l_{ii} \right) = 2 \sum_{i=1}^d \log\left( l_{ii} \right). \end{align*}\]
3.2.2 Quadratic form evaluation with Cholesky
We can simplify the quadratic form as \[\begin{align*} (\mathbf{x}-\boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x}-\boldsymbol{\mu}) &= (\mathbf{x}-\boldsymbol{\mu})^T (\mathbf{L} \mathbf{L}^T)^{-1} (\mathbf{x}-\boldsymbol{\mu}) \\ &= (\mathbf{x}-\boldsymbol{\mu})^T \mathbf{L}^{-T} \mathbf{L}^{-1} (\mathbf{x}-\boldsymbol{\mu}) \\ &= (\mathbf{L}^{-1}(\mathbf{x}-\boldsymbol{\mu}))^T \mathbf{L}^{-1}(\mathbf{x}-\boldsymbol{\mu}) \\ &= \mathbf{z}^T \mathbf{z} = \sum_{i=1}^d z_i^2, \end{align*}\] where \(\mathbf{z} = (z_1, \dots, z_d)^T = \mathbf{L}^{-1}(\mathbf{x}-\mathbf{\mu})\).
Note: You should never explicitly compute \(\mathbf{\Sigma}^{-1}\)!
To compute \(\mathbf{z} = \mathbf{L}^{-1}(\mathbf{x}-\mathbf{\mu})\), we solve a linear system of equations \[ \mathbf{L} \mathbf{z} = \mathbf{x}-\mathbf{\mu}. \] In our case \(\mathbf{L}\) is triangular, so this can be solved using
Even in general, the best algorithm for solving \(\mathbf{\Sigma} \mathbf{x} = \mathbf{b}\) for positive-definite \(\mathbf{\Sigma}\) uses the Cholesky decomposition of \(\mathbf{\Sigma}\) internally.
Note: Even if you are given \(\mathbf{\Sigma}^{-1}\) for free, it may still be better (as fast or faster, more stable) to use the Cholesky approach.
Note: SciPy provides slightly faster routines for solving linear
systems with Cholesky factors as scipy.linalg.cho_factor
and
scipy.linalg.cho_solve
.
3.3 Simulating multivariate normal random draws
Box-Muller transform (Sec. 2.2.1) can be used to simulate multivariate normal random draws with a unit covariance \(\mathbf{I}_d\).
Proof. \(\mathbf{y}\) remains multivariate normal after an affine transformation because the (unnormalised) multivariate normal density after transformation \(\mathbf{x} = \mathbf{A} \mathbf{z} + \mathbf{b}\) with invertible \(\mathbf{A}\) still has the form of a multivariate normal over \(\mathbf{z}\): \[\begin{align*} p(\mathbf{x}) &= \frac{1}{Z} \exp(- \frac{1}{2} (\mathbf{x}-\boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x}-\boldsymbol{\mu}))\\ &= \frac{1}{Z'} \exp(- \frac{1}{2} (\mathbf{A} \mathbf{z} + \mathbf{b} -\boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{A} \mathbf{z} + \mathbf{b}-\boldsymbol{\mu}))\\ &= \frac{1}{Z'} \exp(- \frac{1}{2} (\mathbf{z} - \mathbf{A}^{-1}(\boldsymbol{\mu} - \mathbf{b}))^T (\mathbf{A}^{-1} \boldsymbol{\Sigma} (\mathbf{A}^{-1})^{T})^{-1} (\mathbf{z} - \mathbf{A}^{-1}(\boldsymbol{\mu} - \mathbf{b}))). \end{align*}\]
The mean and covariance of \(\mathbf{y}\) can be evaluated as \[ \operatorname{E}[\mathbf{y}] = \mathbf{L} \operatorname{E}[\mathbf{x}] + \mathbf{\mu} = \mathbf{\mu} \] \[ \operatorname{Cov}[\mathbf{y}] = \operatorname{E}[ (\mathbf{y} - \operatorname{E}[\mathbf{y}]) (\mathbf{y} - \operatorname{E}[\mathbf{y}])^T ] = \operatorname{E}[ (\mathbf{L} \mathbf{x})(\mathbf{L} \mathbf{x})^T] = \operatorname{E}[ \mathbf{L} \mathbf{x} \mathbf{x}^T \mathbf{L}^T ] = \mathbf{L} \operatorname{E}[ \mathbf{x} \mathbf{x}^T ] \mathbf{L}^T = \mathbf{L} \mathbf{I}_d \mathbf{L}^T = \mathbf{\Sigma}. \]3.4 Further numerical linear algebra tricks
- If \(\mathbf{\Sigma}\) is non-positive because of numerical problems (has negative eigenvalues), \(\mathbf{\Sigma} + \epsilon \mathbf{I}_d\) with \(\epsilon > | \lambda_{\text{min}} |\) will fix this
Written algorithms often use matrix trace \[ \operatorname{tr}{}(\mathbf{A}) = \sum_{i=1}^d a_{ii} \] which only depends on a subset of the matrix elements.
If the argument \(\mathbf{A}\) is a complex expression, explicit evaluation is very inefficient. Consider e.g. \[ \operatorname{tr}{}(\mathbf{A}^T \mathbf{B}) = \sum_{i,j} a_{ij} b_{ij}, \] which is way faster (\(\mathcal{O}(n^2)\)) to compute without computing the matrix product (\(\mathcal{O}(n^3)\))- Using
trace(A@B)
would waste the non-diagonal elements computed in the matrix product
- Using
3.5 Numerical integration
Bayesian statistics provides a theoretically well-founded framework for answering many statistical questions. Unfortunately it is often difficult to apply in practice because the answers depend on the evaluation of complicated integrals which cannot be solved analytically.
Numerical integration concerns the approximation of definite integrals through finite sums: \[ \int_V f(x) \mathrm{d}x \approx \sum_i w_i f(x_i). \] There are a number of efficient methods for this problem when the space is one-dimensional, \(V\) is bounded and \(f(x)\) is well-behaved.
The simplest methods for numerical integration are based on a uniform grid. Different ways of interpolating the function between the grid points result in different algorithms with different weights \(w_i\):
- Piece-wise constant interpolation yields the so-called rectangle rule.
- Piece-wise linear interpolation yields the so-called trapezoidal rule.
- Piece-wise quadratic interpolation yields the so-called Simpson’s rule.
Given a fixed number of function evaluations, more accurate methods can be obtained by allowing them to be non-uniformly spaced. Gaussian quadrature methods have been developed to yield optimal accuracy for a small number of function evaluations in various settings such as targets with a known weighting function.
While these simple quadrature methods work well in 1D, they become impractical very quickly when the dimensionality of the space increases. If accurately covering the integration domain in 1D requires \(n\) points, the number of points needed to cover a \(d\)-dimensional domain with equal coverage is of the order \(n^d\), which becomes very large very quickly.
No efficient general methods for high-dimensional numerical integration are known. Monte Carlo methods are often the best choice.
3.5.1 Software for numerical integration
Numerical software packages usually include functions for computing integrals that apply a number of advanced tricks to make the process significantly more efficient.
In Python, standard 1D numerical integration is performed by
scipy.integrate.quad
. This function uses an adaptive method to
choose the evaluation points and therefore needs access to the target
function directly. The module scipy.integrate
also has functions for
higher dimensional integration.
Sometimes the integrand cannot be easily written as a full function
but only evaluated at a fixed set of points. In Python, functions
scipy.integrate.trapz
and scipy.integrate.simps
can compute
integrals from such a fixed set of function evaluations.
Similar functions exist in other numerical software, too.
3.6 Some best practices
- Never explicitly compute matrix inverses \(\mathbf{A}^{-1}\) in real code
- Cholesky decomposition is your friend when dealing with
covariance matrices
- Always use Cholesky instead of explicit inverse
- Always use Cholesky instead of explicit determinant
- Never explicitly compute trace of a more complex expression because many computed elements of the expression will be wasted
- Numerical quadrature good for low-d, check software system docs for the available implementation
- High-dimensional integration is difficult, Monte Carlo may be your best choice in general
3.7 Exercises
- Verify that the formulas derived for the log-determinant and the quadratic form in the normal density using Cholesky decomposition are valid.
- Use numerical integration to verify that the normal pdf implemented in Sec. 2.3 is properly normalised.
- Test and time the two ways of evaluating the trace of the product of two square matrices illustrated by the equation \[ \operatorname{tr}{}(\mathbf{A}^T \mathbf{B}) = \sum_{i,j} a_{ij} b_{ij}. \]