Lies, damn lies, and statistics

1. Uncertainty estimates from curvefit and scipy.optimize.minimize
2. Effects of outliers on regression
3. Summary

1 Uncertainty estimates from curvefit and scipy.optimize.minimize

We previously examined how to estimate uncertainty from the covariance matrix returned from curve_fit. Recall we need the diagonal of the covariance matrix, which is estimated during the fitting. The covariance matrix is related to the inverse Hessian matrix. We will explore how these are related here.

We will consider fitting a line to the following data.

import numpy as np
from scipy.optimize import curve_fit

x = np.array([0.,    2.5,   5.,    7.5,  10. ])
y = np.array([1.14,    1.91,  2.48,  2.2,  4.0])

def model(x, m, b):
    return m * x + b

p, pcov = curve_fit(model, x, y, [0.2, 0.1])
print(p)
pcov

[0.2404 1.14399999]

array([[ 0.00430672, -0.0215336 ],
       [-0.0215336 ,  0.161502  ]])

scipy.optimize.minimize does not return the covariance matrix; with some of the methods, it returns an estimate of the inverse Hessian matrix. In theory, the covariance matrix and the inverse hessian are related to each other with \(cov = 0.5 * H^{-1}\). Note this relationship is specific to the minimization of the summed squared errors.

from scipy.optimize import minimize

def model(pars, x):
    x = np.array(x)
    m, b = pars
    return m * x + b

def objective(pars):
    errs = y - model(pars, x)
    return np.sum(errs**2)

sol = minimize(objective, [0.2, 1])
print(sol.x)
0.5 * sol.hess_inv

[0.2404 1.14399997]

array([[ 0.01019113, -0.06596866],
       [-0.06596866,  0.49131361]])

That doesn't look very good. But, remember that it is an estimate of the Hessian and we need to be careful about the accuracy. The minimizer terminates when the solution reaches the tolerance, not when the Hessian is accurate! If we increase the tolerance, we get a more accurate result.

sol = minimize(objective, [0.2, 1], tol=1e-9)
print(sol.x)
print(0.5 * sol.hess_inv)

[0.24039999 1.144     ]
[[ 0.00424425 -0.02201408]
 [-0.02201408  0.16613705]]

With the increased accuracy, you can see the covariance is approximately equal to 1/2 the inverse Hessian. That means you can use it to estimate the uncertainty in the same way we did with curve_fit.

Not all solvers generate the inverse Hessian matrix, e.g. SLSQP does not do it. You have three options. One is always to compute the Hessian analytically. The other two options rely on libraries that use automatic differentiation to compute the relevant derivatives. One is to use numdifftools (which you may have to install). Either way, you have to compute the Hessian on the objective function that is being minimized. One way to get this is to use a numerical package designed to compute this.

!pip install numdifftools

Collecting numdifftools Downloading numdifftools-0.9.39-py2.py3-none-any.whl (953 kB) [?25l [K |▍ | 10 kB 554 kB/s eta 0:00:02 [K |▊ | 20 kB 1.1 MB/s eta 0:00:01 [K |█ | 30 kB 1.2 MB/s eta 0:00:01 [K |█▍ | 40 kB 1.1 MB/s eta 0:00:01 [K |█▊ | 51 kB 1.4 MB/s eta 0:00:01 [K |██ | 61 kB 1.7 MB/s eta 0:00:01 [K |██▍ | 71 kB 1.7 MB/s eta 0:00:01 [K |██▊ | 81 kB 1.6 MB/s eta 0:00:01 [K |███ | 92 kB 1.7 MB/s eta 0:00:01 [K |███▍ | 102 kB 1.9 MB/s eta 0:00:01 [K |███▉ | 112 kB 1.9 MB/s eta 0:00:01 [K |████▏ | 122 kB 1.9 MB/s eta 0:00:01 [K |████▌ | 133 kB 1.9 MB/s eta 0:00:01 [K |████▉ | 143 kB 1.9 MB/s eta 0:00:01 [K |█████▏ | 153 kB 1.9 MB/s eta 0:00:01 [K |█████▌ | 163 kB 1.9 MB/s eta 0:00:01 [K |█████▉ | 174 kB 1.9 MB/s eta 0:00:01 [K |██████▏ | 184 kB 1.9 MB/s eta 0:00:01 [K |██████▌ | 194 kB 1.9 MB/s eta 0:00:01 [K |██████▉ | 204 kB 1.9 MB/s eta 0:00:01 [K |███████▏ | 215 kB 1.9 MB/s eta 0:00:01 [K |███████▋ | 225 kB 1.9 MB/s eta 0:00:01 [K |████████ | 235 kB 1.9 MB/s eta 0:00:01 [K |████████▎ | 245 kB 1.9 MB/s eta 0:00:01 [K |████████▋ | 256 kB 1.9 MB/s eta 0:00:01 [K |█████████ | 266 kB 1.9 MB/s eta 0:00:01 [K |█████████▎ | 276 kB 1.9 MB/s eta 0:00:01 [K |█████████▋ | 286 kB 1.9 MB/s eta 0:00:01 [K |██████████ | 296 kB 1.9 MB/s eta 0:00:01 [K |██████████▎ | 307 kB 1.9 MB/s eta 0:00:01 [K |██████████▋ | 317 kB 1.9 MB/s eta 0:00:01 [K |███████████ | 327 kB 1.9 MB/s eta 0:00:01 [K |███████████▍ | 337 kB 1.9 MB/s eta 0:00:01 [K |███████████▊ | 348 kB 1.9 MB/s eta 0:00:01 [K |████████████ | 358 kB 1.9 MB/s eta 0:00:01 [K |████████████▍ | 368 kB 1.9 MB/s eta 0:00:01 [K |████████████▊ | 378 kB 1.9 MB/s eta 0:00:01 [K |█████████████ | 389 kB 1.9 MB/s eta 0:00:01 [K |█████████████▍ | 399 kB 1.9 MB/s eta 0:00:01 [K |█████████████▊ | 409 kB 1.9 MB/s eta 0:00:01 [K |██████████████ | 419 kB 1.9 MB/s eta 0:00:01 [K |██████████████▍ | 430 kB 1.9 MB/s eta 0:00:01 [K |██████████████▊ | 440 kB 1.9 MB/s eta 0:00:01 [K |███████████████▏ | 450 kB 1.9 MB/s eta 0:00:01 [K |███████████████▌ | 460 kB 1.9 MB/s eta 0:00:01 [K |███████████████▉ | 471 kB 1.9 MB/s eta 0:00:01 [K |████████████████▏ | 481 kB 1.9 MB/s eta 0:00:01 [K |████████████████▌ | 491 kB 1.9 MB/s eta 0:00:01 [K |████████████████▉ | 501 kB 1.9 MB/s eta 0:00:01 [K |█████████████████▏ | 512 kB 1.9 MB/s eta 0:00:01 [K |█████████████████▌ | 522 kB 1.9 MB/s eta 0:00:01 [K |█████████████████▉ | 532 kB 1.9 MB/s eta 0:00:01 [K |██████████████████▏ | 542 kB 1.9 MB/s eta 0:00:01 [K |██████████████████▌ | 552 kB 1.9 MB/s eta 0:00:01 [K |███████████████████ | 563 kB 1.9 MB/s eta 0:00:01 [K |███████████████████▎ | 573 kB 1.9 MB/s eta 0:00:01 [K |███████████████████▋ | 583 kB 1.9 MB/s eta 0:00:01 [K |████████████████████ | 593 kB 1.9 MB/s eta 0:00:01 [K |████████████████████▎ | 604 kB 1.9 MB/s eta 0:00:01 [K |████████████████████▋ | 614 kB 1.9 MB/s eta 0:00:01 [K |█████████████████████ | 624 kB 1.9 MB/s eta 0:00:01 [K |█████████████████████▎ | 634 kB 1.9 MB/s eta 0:00:01 [K |█████████████████████▋ | 645 kB 1.9 MB/s eta 0:00:01 [K |██████████████████████ | 655 kB 1.9 MB/s eta 0:00:01 [K |██████████████████████▎ | 665 kB 1.9 MB/s eta 0:00:01 [K |██████████████████████▊ | 675 kB 1.9 MB/s eta 0:00:01 [K |███████████████████████ | 686 kB 1.9 MB/s eta 0:00:01 [K |███████████████████████▍ | 696 kB 1.9 MB/s eta 0:00:01 [K |███████████████████████▊ | 706 kB 1.9 MB/s eta 0:00:01 [K |████████████████████████ | 716 kB 1.9 MB/s eta 0:00:01 [K |████████████████████████▍ | 727 kB 1.9 MB/s eta 0:00:01 [K |████████████████████████▊ | 737 kB 1.9 MB/s eta 0:00:01 [K |█████████████████████████ | 747 kB 1.9 MB/s eta 0:00:01 [K |█████████████████████████▍ | 757 kB 1.9 MB/s eta 0:00:01 [K |█████████████████████████▊ | 768 kB 1.9 MB/s eta 0:00:01 [K |██████████████████████████▏ | 778 kB 1.9 MB/s eta 0:00:01 [K |██████████████████████████▌ | 788 kB 1.9 MB/s eta 0:00:01 [K |██████████████████████████▉ | 798 kB 1.9 MB/s eta 0:00:01 [K |███████████████████████████▏ | 808 kB 1.9 MB/s eta 0:00:01 [K |███████████████████████████▌ | 819 kB 1.9 MB/s eta 0:00:01 [K |███████████████████████████▉ | 829 kB 1.9 MB/s eta 0:00:01 [K |████████████████████████████▏ | 839 kB 1.9 MB/s eta 0:00:01 [K |████████████████████████████▌ | 849 kB 1.9 MB/s eta 0:00:01 [K |████████████████████████████▉ | 860 kB 1.9 MB/s eta 0:00:01 [K |█████████████████████████████▏ | 870 kB 1.9 MB/s eta 0:00:01 [K |█████████████████████████████▌ | 880 kB 1.9 MB/s eta 0:00:01 [K |██████████████████████████████ | 890 kB 1.9 MB/s eta 0:00:01 [K |██████████████████████████████▎ | 901 kB 1.9 MB/s eta 0:00:01 [K |██████████████████████████████▋ | 911 kB 1.9 MB/s eta 0:00:01 [K |███████████████████████████████ | 921 kB 1.9 MB/s eta 0:00:01 [K |███████████████████████████████▎| 931 kB 1.9 MB/s eta 0:00:01 [K |███████████████████████████████▋| 942 kB 1.9 MB/s eta 0:00:01 [K |████████████████████████████████| 952 kB 1.9 MB/s eta 0:00:01 [K |████████████████████████████████| 953 kB 1.9 MB/s [?25hInstalling collected packages: numdifftools Successfully installed numdifftools-0.9.39

Now, similar to what we did with scipy.misc.derivative, we can write a function and then use numdifftools to get the Hessian of the function. Here, we define the sum of the squared errors function, then create a Hessian function for that. We can use the Hessian function to evaluate the Hessian at the parameters at the minimum. We use numpy.linalg.inv to get the inverse of the Hessian to compute the covariance.

import numdifftools as nd

def f(pars):
    m, b = pars
    return np.sum((y - (m * x + b))**2)

H = nd.Hessian(f)  # H is an executable function now that takes one argument, the pars.
0.5 * np.linalg.inv(H(p))

array([[ 0.004, -0.02 ],
       [-0.02 ,  0.15 ]])

numdifftools (https://pypi.org/project/numdifftools/) is a numerical differentiation package. It is more sophisticated than scipy.misc.derivative but is fundamentally still a numerical approximation to the derivatives. Now you can use these to estimate the uncertainties even for optimizers that don't provide the estimated inverse Hessian.

Later we will learn about one more approach to getting the derivatives that is used in machine learning called automatic differentiation.

2 Effects of outliers on regression

Outliers can have a significant effect on the fit of a model to data. Let's consider this example, where we want to fit a line to some data that has an outlier in it. This is just a linear regression, and we start out using numpy.polyfit.

import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

x = [0.,      2.5,   5.,    7.5,  10. ]
y = [1.14,    1.91,  2.48,  2.2,  4.0]
#                            ^
#                            |
#                         outlier

p = np.polyfit(x, y, 1)
print(p)
xfit = np.linspace(0, 10)

plt.plot(x, y, 'bo')
plt.plot(xfit, np.polyval(p, xfit));
plt.xlabel('x')
plt.ylabel('y')

[0.2404 1.144 ]

Text(0, 0.5, 'y')

<Figure size 432x288 with 1 Axes>

You can see that the fitted line is "dragged" towards the outlier. We say that least squares minimization is not robust to outliers.

This may be undesirable because if you believe there is an outlier, perhaps due to experimental error, then this point affects the accuracy of the model more than the other points you believe to be more accurate.

Today we will consider a variety of approaches to minimize the effects of outliers. We first begin by re-examining how these parameters are obtained. Here, we illustrate that the results from polyfit are equivalent to minimizing the summed squared errors between the model and the data.

from scipy.optimize import minimize

def model(pars, x):
    x = np.array(x)
    m, b = pars
    return m * x + b

def objective(pars):
    errs = y - model(pars, x)
    return np.sum(errs**2)

minimize(objective, [0.2, 1])

     fun: 0.8075100000000078
hess_inv: array([[ 0.02038226, -0.13193732],
      [-0.13193732,  0.98262721]])
     jac: array([ 5.28991222e-07, -3.05473804e-07])
 message: 'Optimization terminated successfully.'
    nfev: 20
     nit: 2
    njev: 5
  status: 0
 success: True
       x: array([0.2404    , 1.14399997])

The problem is that we are minimizing the error², which puts more weight on large errors than small errors.

Least squares regression is also called L₂ norm regression, that is we minimize the L₂ norm of the vector.

2.1 Minimizing the summed absolute errors

We can choose to minimize another objective function, for example the summed absolute value of the errors. This will reduce the emphasis on large errors. This is also called L₁ norm regression.

def objective(pars):
    errs = y - model(pars, x)
    return np.sum(np.abs(errs))

L1_sol = minimize(objective, [0.2, 1])
print(L1_sol.x)
plt.plot(x, y, 'bo')
plt.plot(xfit, model(L1_sol.x, xfit))

[0.26845682 1.14 ]

[<matplotlib.lines.Line2D at 0x10184669d0>]

<Figure size 432x288 with 1 Axes>

There is a historical reason this is not done a lot, and that is the absolute value function has a discontinuity in its first derivative at the origin which can be problematic in some optimization algorithms. It is obviously not a problem here, and you can see that the outlier has less of an effect on the fitted line in this case.

Finally, we can generalize these ideas to something called L_p norm regressions where we seek to minimize:

\(\sum |\epsilon_i|^p\)

In this paper a value of \(p=1.5\) is recommended for general use. Note this is less than two, and greater than one, so it is expected to have an intermediate effect compared to L₁ and L₂ norm regression.

def objective(pars):
    p = 1.5
    errs = y - model(pars, x)
    return np.sum(np.abs(errs)**p)

Lp_sol = minimize(objective, [0.2, 1])
print(Lp_sol.x)
plt.plot(x, y, 'bo')
plt.plot(xfit, model(Lp_sol.x, xfit))

[0.25741034 1.15352086]

[<matplotlib.lines.Line2D at 0x10197f2550>]

<Figure size 432x288 with 1 Axes>

The downside of these approaches is that they complicate the analysis of uncertainty. The uncertainty analysis we have considered so far is only formally correct when we minimize the summed squared errors. It is only approximately correct when something else is minimized.

2.2 Robust regression approaches

An alternative approach to least squares or absolute error minimization is called robust regression (see Applied Regression Analysis, 3rd edition, Draper and Smith, chapter 25). This is a class of methods that uses a different metric to minimize in the objective function.

The simplest approach is to minimize the median of the squared error. Note that minimizing the sum of squared errors is practically like minimizing the average or mean squared error. If you have a symmetric distribution of errors, then the mean and median are practically the same. If there is an outlier, however, the mean will be skewed towards the outlier, while the median will be at a position that splits the distribution in half, and is closer to what you believe the mean to be.

Here we show that given an asymmetric distribution, the median is smaller than the mean.

errs = np.array([0.1, 0.01, 0.05, 0.02, 0.8])
s = errs**2
plt.hist(s, density=True)
plt.axvline(np.mean(s), color='r')
plt.axvline(np.median(s), color='k')
plt.legend(['mean', 'median'])
plt.xlabel('')

Text(0.5, 0, '')

<Figure size 432x288 with 1 Axes>

2.2.1 Least Median regression

It is straightforward to modify the objective function to minimize the median of the squared errors.

def objective(pars):
    errs = y - model(pars, x)
    return np.median(errs**2)

LMS_sol = minimize(objective, [0.2, 1])
print(LMS_sol.x)
plt.plot(x, y, 'bo')
plt.plot(xfit, model(LMS_sol.x, xfit))

[0.26804924 1.18981534]

[<matplotlib.lines.Line2D at 0x10199e6e10>]

<Figure size 432x288 with 1 Axes>

2.3 Weighted nonlinear regression

Outliers often are associated with larger uncertainties about their values. An alternative approach to the methods described above is to use weights to say how important each data point is. This example is adapted from https://www.mathworks.com/help/stats/examples/weighted-nonlinear-regression.html.

import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

x = [1,   2,   3,   5,   7,   10]
y = [109, 149, 149, 191, 213, 224]
plt.plot(x, y, 'bo')
plt.xlabel('Incubation (days)')
plt.ylabel('BOD')

Text(0, 0.5, 'BOD')

<Figure size 432x288 with 1 Axes>

The aim of this work is to fit a nonlinear model \(y= a (1 - e^{-b x})\) to this data. We first consider a standard minimization of the sum squared errors. Inspection of the model suggests at large x, \(a\) is a plateau value, which we can read from the graph. For the value of \(b\), we might estimate a half-life at about one day and solve \(110 = 240(1 - e^-b)\)

-np.log(-(110 / 240 - 1))

0.6131044728864088

def model(pars, x):
    a, b = pars
    x = np.array(x)
    return a * (1 - np.exp(-b * x))

def objective(pars):
    errs = y - model(pars, x)
    return np.sum(errs**2)


guesses = [240, 0.6]

from scipy.optimize import minimize

sol = minimize(objective, guesses)
pars = sol.x

plt.plot(x, y, 'bo')
xfit = np.linspace(0, 10)
plt.plot(xfit, model(pars, xfit))
plt.xlabel('Incubation (days)')
plt.ylabel('BOD')

Text(0, 0.5, 'BOD')

<Figure size 432x288 with 1 Axes>

The fit generally goes through the data, but it is not clear if there is a small outlier near 2 that is skewing the fit, and perhaps leading to an inaccurate asymptote at long times.

Suppose, however, that these data points represent averages from multiple measurements, and we only measured the first two points once, and the rest of the points 5 times. In this case, we might want to put more weight on the points we measured multiple times.

We achieve this by modifying the objective function, in this case multiplying each error by the number of times the measurement was made. This makes reducing errors on points we measured a lot more important than the points we measured less.

w = np.array([1, 1, 5, 5, 5, 5])

def objective(pars):
    errs = (y - model(pars, x)) * w
    return np.sum(errs**2)


guesses = [240, 0.5]

from scipy.optimize import minimize

sol = minimize(objective, guesses)
pars = sol.x
print(pars)
plt.plot(x, y, 'bo')
xfit = np.linspace(0, 10)
plt.plot(xfit, model(pars, xfit))
plt.xlabel('Incubation (days)')
plt.ylabel('BOD')

[230.77020941 0.35563065]

Text(0, 0.5, 'BOD')

<Figure size 432x288 with 1 Axes>

The result here is that the model fits the points we measured a lot better than the points we measured once.

There are many ways you could choose to weight the points depending on what you know about them. If you have uncertainties about the measured data, you can weight the points accordingly, e.g. defining the weights as inversely proportional to the uncertainty.

3 Summary

Regression is an important technical skill required in modern engineering. It is the method which we use to convert data into models. Sometimes it is the parameters that are important, e.g. when they represent properties of a system that we are interested in. Sometimes it is the model that is interesting, e.g. when we need to use it for optimization or predictions.

At the core, regression involves minimization of some error function. The standard method is to minimize the summed squared error between the model and data. There are some benefits to this method: it is straight forward and there are well established methods to estimate the uncertainty in the parameters. However, it is known to be sensitive to outliers.

A variety of alternative approaches exist to reduce the influence of outliers, including minimizing the summed absolute errors, robust regression methods, and weighted regression methods. It is not always obvious what the right method to use is, this takes experience and an understanding of what you know about the model, the data, and the goals of the regression.