Chapter 1 Numerical mathematics
For reasons of computational efficiency, essentially all numerical and statistical computations are performed on computers with finite precision arithmetic. The finite precision needs to be taken into account when designing algorithms because the numbers represented by a computer do not have all the same properties that real numbers have.
1.1 Floating point numbers
Most numerical and statistical algorithms and computations use floating point numbers. These are numbers of the form \[\begin{equation} x_{FP} = \pm c \cdot 2^q, \tag{1.1} \end{equation}\] where both the coefficient \(c\) and exponent \(q\) are integers chosen from a finite range of values. Floating point numbers are useful because they can easily represent values that are very small in absolute value (close to zero) and ones that are very large by using negative and positive exponents \(q\), respectively. The range of values is still finite and as a consequence there are real numbers \(x \neq 0\) for which the closest floating point number \(x_{FP}=0\), leading to what is known as underflow in computations. Conversely, computations with a result that is larger than the largest floating point number are said to overflow.
The floating point numbers used by all modern computers are defined by the IEEE Standard for Floating-Point Arithmetic (IEEE 754). Most commonly used formats are 64 bit double precision numbers and 32 bit single precision numbers, although others are also occasionally used.
Precision | Bits | \(c\) bits | \(q\) bits | Max value |
---|---|---|---|---|
Single | 32 | 24 | 8 | \(\approx 3.4 \cdot 10^{38}\) |
Double | 64 | 53 | 11 | \(\approx 1.8 \cdot 10^{308}\) |
Half | 16 | 11 | 5 | \(\approx 6.6 \cdot 10^4\) |
Python/NumPy allows exploring floating point type properties using
## Machine parameters for float32
## ---------------------------------------------------------------
## precision = 6 resolution = 1.0000000e-06
## machep = -23 eps = 1.1920929e-07
## negep = -24 epsneg = 5.9604645e-08
## minexp = -126 tiny = 1.1754944e-38
## maxexp = 128 max = 3.4028235e+38
## nexp = 8 min = -max
## ---------------------------------------------------------------
## Machine parameters for float64
## ---------------------------------------------------------------
## precision = 15 resolution = 1.0000000000000001e-15
## machep = -52 eps = 2.2204460492503131e-16
## negep = -53 epsneg = 1.1102230246251565e-16
## minexp = -1022 tiny = 2.2250738585072014e-308
## maxexp = 1024 max = 1.7976931348623157e+308
## nexp = 11 min = -max
## ---------------------------------------------------------------
The value eps
returned above is especially interesting because it
is the smallest number of the given type for which \(1 + eps \neq 1\).
1.1.1 Special values
In addition to numbers in the prescribed range, floating point numbers can represent the following special values:
- \(\pm\)Inf (\(\pm \infty\))
- NaN (not-a-number)
Infinity values appear when a computation overflows, e.g.:
## exp(1000) = inf
##
## <string>:1: RuntimeWarning: overflow encountered in exp
Infinities obey the usual rules of arithmetic and can be compared as usual.
NaN is used to represent undefined values such as \(0/0\), \(0 \cdot \infty\), \(\infty - \infty\).
All arithmetic operations including a NaN result in a NaN. All
comparisons with NaN (except \(\neq\)) are always false. To test if a
number is NaN, you need to use np.isnan()
.
1.2 Some important properties of floating point numbers
Some mathematical properties of floating point numbers that differ from real numbers:
- Uneven distribution of numbers, more dense near zero
- Finite range of representable values
- Example: cumulative density of standard normal \(\Phi(10) \approx 1 - 7.619853 \cdot 10^{-24} =_F 1\)
- Using \(\log p\) instead of \(p\) can help with probabilities
- Rounding errors break all rules of arithmetic
- Equality comparison non-trivial because of rounding: never use \(x==y\)
- Avoid computing \(x - y\) for \(x,y \gg 0\) or \(x,y \ll 0\)
1.3 Computing with probabilities
As probabilities are non-negative but can be very close to 0, it is often numerically advantageous to store their logarithms \(\log p\) instead of the raw values \(p\).
1.3.1 Logsumexp
(Re-)normalisation of probabilities defined in log scale requires
converting back to raw probabilities with exp()
which can cause an
overflow or an underflow. This can be avoided using so-called
logsumexp
trick.
Let us consider a vector \(\boldsymbol{\pi} = (\pi_i) = (\log p_i)\), where \(p_i > 0\), \(i= 1,\dots,n\). In order to compute \[\log S = \log \sum_{i = 1}^n p_i,\] we need to compute \[\log S = \log \sum_{i=1}^n \exp(\pi_i).\] However, the computation of \(\exp(\pi_i)\) can overflow even if final \(\log S\) is not that large. The overflow can be avoided by finding \[M = \max_{i \in \{1, \dots, n\}} \pi_i\] and computing \[\begin{equation} \log S = M + \log \sum_{i=1}^n \exp(\pi_i - M), \tag{1.2} \end{equation}\] which is guaranteed to be well-behaved because \(\max_{i \in \{1, \dots, n\}} \pi_i - M = 0\). The same trick avoids potential underflows when \(\pi_i\) are small, because at least one of \(\pi_i - M\) is guaranteed not to be small.
1.3.2 Special functions
Mathematical libraries in various programming environments contain a number of special functions that are specifically designed for computing challenging expressions in a numerically stable manner by avoiding overflows and underflows. Some examples of such functions are listed in Table 1.2
Name | Purpose |
---|---|
np.log1p |
Computes \(\log(1 + x)\), useful for \(x \approx 0\) |
np.expm1 |
Computes \(\exp(x) - 1\), useful for \(x \approx 0\) |
scs.gammaln |
Computes \(\log \Gamma(x)\) |
scs.erf |
Computes \(\operatorname{erf}(x) = 2/\sqrt{\pi} \int_{0}^x e^{-t^2} \,\mathrm{d} t\) |
(The cumulative distribution of \(\mathcal{N}(0, 1)\) is \(\Phi(x) = 1/2[1 + \operatorname{erf}(x/\sqrt{2})]\)) | |
scs.erfc |
Computes \(1 - \operatorname{erf}(x)\), useful for \(x \gg 0\) |
scs.erfcx |
Computes \(\exp(x^2) (1-\operatorname{erf}(x))\), useful for \(x \gg 0\) |
Examples:
## 1e-99
## 0.0
## 1e-20
## 0.0
1.4 Reporting extreme values computed using logarithms
Most people are used to working in base 10 numbers. When using logarithms to represent extreme values, it is sometimes non-trivial to print them as \(x \cdot 10^y\).
Assuming \(z = \log_{10}(p)\), we have for all \(y\), \[ p = 10^z = 10^(z - y) \cdot 10^y. \] By choosing \(y = \lfloor z \rfloor\), we have \[ 1 \le 10^(z - y) < 10. \] Numbers computed in different logarithmic bases can be converted as usual using \[ \log_{10}(p) = \frac{\log_b(p)}{\log_b(10)}. \]
import numpy as np
def pretty_print_log10(logp):
# Print base-10 logarithm value as x*10^y
# For all y, we have
# p = 10^logp = 10^y * 10^(logp - y)
# By choosing y to be largest integer not greater than logp,
# we have 1 <= x < 10
y = np.floor(logp)
x = 10**(logp-y)
print(x, 'e', np.int(y), sep='')
print("Computed using floating point values and logarithms")
## Computed using floating point values and logarithms
## 1.2676506002282257e30
## Computed using Python's arbitrary precision integer arithmetic
## 1267650600228229401496703205376
1.5 Best practices and hints
- Be aware of your data types and their ranges, especially with GPUs and other special hardware
- Compute and store values in a way that maintains precision (\(p\) vs. \(1 - p\), \(\log p\), renormalising things as necessary)
- Order of evaluation does matter!
- You should not do equality comparisons for floating point numbers because of possible rounding errors
- Python has
np.isclose()
to test approximate equality
- Python has
- Equality tests for special values should be done using special
functions
np.isfinite(), np.isinf(), np.isnan()
- NaN values can be annoying, because any operation with NaN yields a
NaN and thus a single NaN value often propagates to all values in a
large computation.
- It can be useful to use
np.isnan()
to check the output of a suspicious function
- It can be useful to use
1.6 Literature
Gentle (2009) contains an introduction to basic challenges raised by floating point representation.
1.7 Exercises
- Write a program to increment
x = 0.0
by0.1
100 times. Computex - 10
. How do you interpret the result? - Verify (prove) that the
logsumexp
trick of Eq. (1.2) gives the correct result for \(\log S\). - Find examples of floating point computations that break standard rules of arithmetic.
- For \(x \sim \mathcal{N}(0, 1)\), compute \(\log p(28 < x < 30)\).
- Check if the following properties of real numbers hold for floating point numbers and provide examples of how the invalid properties fail.
- \(\forall x < y, \exists z: x < z < y\)
- \(x, y \in \mathbb{R} \Rightarrow x + y \in \mathbb{R}\)
- \(x, y \in \mathbb{R} \Rightarrow x y \in \mathbb{R}\)
- There exists unique \(a = 0\) such that \(x+a = a+x = x\)
- \(x, y, z \in \mathbb{R} \Rightarrow (x + y) + z = x + (y + z)\)
- \(x, y, z \in \mathbb{R} \Rightarrow (x y) z = x (y z)\)
- \(x, y, z \in \mathbb{R} \Rightarrow x (y + z) = xy + xz\)
References
Gentle, James E. 2009. Computational Statistics. 1st ed. New York, NY: Springer.