TOC

Linear regression#

  • KEYWORDS: numpy.linalg.solve

regression

In linear regression, we seek to find models in the form \(y = a_{0} f_{0}(x) + a_{1} f_{1}(x) + ... + a_{n} f_{n}(x) + \epsilon\), where \(a_{i}\) are coefficients to be determined, and ε are the residual errors. We call this linear regression because the model is linear in the unknown coefficients \(a_{i}\). The functions can be any function of \(x\). In the function numpy.polyfit these functions are polynomials in \(x\).

If we are given some data as pairs of (x, y), we can construct a set of equations of the form:

\([f_{0}(x_{i}), f_{1}(x_{i}), ..., f_{n}(x_{i})]\cdot[a_{0}, a_{1}, ..., a_{n}]^T = y_{i}\)

There will be one of these equations for every data point, so we end up with a matrix equation that looks like:

\(\mathbf{X} \mathbf{a} = \mathbf{y}\)

There are usually more data points than in the vector of \(\mathbf{a}\), so the shapes of these arrays are not suitable to solve directly. You can of course set up an objective function and use scipy.optimize.minimize, but there is a better approach.

To be a little more specific, suppose we have \(m\) pairs of \((x, y)\) data points, and we want to fit a model containing \(n\) parameters. Then, the dimensions of the \(\mathbf{X}\) will be \((m, n)\), the dimensions of \(\mathbf{a}\) will be \((n, 1)\), and the dimensions of \(\mathbf{y}\) will be \((m, 1)\). We have more equations than unknowns here, and we cannot use numpy.linalg.solve because \(\mathbf{X}\) is not square. Note that if it was square, we would be doing the kind of interpolation we described in the last lecture.

We can modify the equation though if we left multiply each side of the equation by \(\mathbf{X}^T\).

\(\mathbf{X}^T \mathbf{X} \mathbf{a} = \mathbf{X}^T \mathbf{y}\)

The array \(\mathbf{X}^T \mathbf{X}\) now has the shape \((n, m) * (m, n) = (n, n)\). The right hand side \(\mathbf{X}^T \mathbf{y}\) has a shape of \((n, m) * (n, 1) = (n, 1)\), and \(\mathbf{a}\) is still \((n, 1)\). This new matrix equation can be solved efficiently with numpy.linalg.solve. We will not prove this, but solving this modified equation is equivalent to finding the set of parameters that minimizes the summed squared errors: \(\sum (\mathbf{X} \cdot \mathbf{a} - \mathbf{y})^2\).

The parameters are then found by:

# np.linalg.solve(X @ X.T, X.T @ y)

An alternative form is called the normal equation: \(\mathbf{a} = (\mathbf{X}^T\cdot\mathbf{X})^{-1}\mathbf{X}^T \mathbf{y}\). This is symbolically correct, but relies on the inverse which is expensive to compute for large systems. It is not used practically, instead the equations are solved efficiently using a different algorithm.

An example of polynomial fitting#

Our goal in this example is to fit a polynomial to some time-dependent concentration data.

import numpy as np

time = np.array([0.0, 50.0, 100.0, 150.0, 200.0, 250.0, 300.0])
Ca = np.array([50.0, 38.0, 30.6, 25.6, 22.2, 19.5, 17.4]) * 1e-3

Fit a fourth order polynomial to this data and determine the confidence interval for each parameter. This data is from example 5-1 in Fogler, Elements of Chemical Reaction Engineering.

We want the equation \(Ca(t) = b0 + b1*t + b2*t^2 + b3*t^3 + b4*t^4\) fit to the data in the least squares sense. We can write this in a linear algebra form as: \(\mathbf{T} \mathbf{p} = \mathbf{Ca}\) where \(\mathbf{T}\) is a matrix of columns \([1, t, t^2, t^3, t^4]\), and \(\mathbf{p}\) is a column vector of the fitting parameters. We want to solve for the \(\mathbf{p}\) vector and estimate the confidence intervals.

First, we setup the array of function values, and then we solve for the parameters.

?np.column_stack
(time**2).shape
(7,)
X = np.column_stack([time**0, time, time**2, time**3, time**4])
with np.printoptions(precision=2):
    print(X)
[[1.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00]
 [1.00e+00 5.00e+01 2.50e+03 1.25e+05 6.25e+06]
 [1.00e+00 1.00e+02 1.00e+04 1.00e+06 1.00e+08]
 [1.00e+00 1.50e+02 2.25e+04 3.38e+06 5.06e+08]
 [1.00e+00 2.00e+02 4.00e+04 8.00e+06 1.60e+09]
 [1.00e+00 2.50e+02 6.25e+04 1.56e+07 3.91e+09]
 [1.00e+00 3.00e+02 9.00e+04 2.70e+07 8.10e+09]]
(X.T @ X).shape, (X.T @ Ca).shape
((5, 5), (5,))
a = np.linalg.solve(X.T @ X, X.T @ Ca)  # this is the normal equation
print(a)
[ 4.99902597e-02 -2.97846320e-04  1.34348485e-06 -3.48484848e-09
  3.69696970e-12]
import matplotlib.pyplot as plt

plt.plot(time, Ca, "bo", time, X @ a)
plt.legend(["data", "fit"])
plt.xlabel("Time")
plt.ylabel("Ca");
../../_images/b75394bb53c30276a759e237c395f90e0953a2e3817b0c82084b9bc449d54516.png

We previously claimed that solving this equation was equivalent to minimizing the summed squared errors. Here we demonstrate that is consistent with our observation for the first parameter.

P = np.linspace(0.9 * a[0], 1.1 * a[0])

errs = [np.sum(np.square(X @ [p, *a[1:]] - Ca)) for p in P]

plt.plot(P, errs)
plt.axvline(a[0], color="k", linestyle="--")
plt.xlabel("slope")
plt.ylabel("SSE")
plt.legend(["SSE", "best fit"]);
../../_images/e8ff24e81b98ece7dbcadaee0a5313f700b3dd947ccce3c74d34649927d90141.png
def SSE(a):
    return np.sum((Ca - X @ a) ** 2)


SSE(a)
1.0519480519480308e-08

The SSE looks good, quite small.

import numdifftools as nd

H = nd.Hessian(SSE)
e, _ = np.linalg.eig(H(a))
e
array([1.67392305e+20, 2.87547709e+13, 5.43139168e+07, 2.02426698e+00,
       9.28206459e+02])

The Hessian eigenvalues are all positive, so we are at a minimum, but yikes, there are 18 orders of magnitude present. That is concerning. It indicates the model is not well-behaved in some way, even though the fit looks good.

np.linalg.det(H(a))
4.911617807628561e+44
np.linalg.cond(H(a))
8.271680468493854e+19
np.linalg.inv(H(a))
array([[ 4.93508334e-01, -1.59453781e-02,  1.62882298e-04,
        -6.56584202e-07,  9.09121476e-10],
       [-1.59453781e-02,  1.59343441e-03, -2.34774419e-05,
         1.14680966e-07, -1.78789195e-10],
       [ 1.62882298e-04, -2.34774419e-05,  3.79875951e-07,
        -1.95152565e-09,  3.14143043e-12],
       [-6.56584202e-07,  1.14680966e-07, -1.95152565e-09,
         1.03300152e-11, -1.69697718e-14],
       [ 9.09121476e-10, -1.78789195e-10,  3.14143043e-12,
        -1.69697718e-14,  2.82829421e-17]])

Exercise Demonstrate that the SSE is minimized for the other parameters. Try estimating the Hessian of the sum of squared errors and then see if it is positive definite.

As we have seen many times before, Numpy provides a function for doing least squares linear regression. It returns more information about the fit than what we have done so far, and is a little more convenient because we do not have to do all the transposes and left multiplications.

?np.linalg.lstsq
SSE(a)
1.0519480519480308e-08
pars, residuals, rank, singular_values = np.linalg.lstsq(X, Ca, rcond=None)
pars, residuals, rank, singular_values
(array([ 4.99902596e-02, -2.97846320e-04,  1.34348484e-06, -3.48484840e-09,
         3.69696954e-12]),
 array([1.05194694e-08]),
 5,
 array([9.14856013e+09, 3.79175229e+06, 5.21123386e+03, 2.15423665e+01,
        1.00603128e+00]))
pars, _, _, _ = np.linalg.lstsq(
    X, Ca, rcond=None
)  # _ means ignore the value in that position
pars
array([ 4.99902596e-02, -2.97846320e-04,  1.34348484e-06, -3.48484840e-09,
        3.69696954e-12])
pars = np.linalg.lstsq(X, Ca, rcond=None)[0]  # just get the first return
pars
array([ 4.99902596e-02, -2.97846320e-04,  1.34348484e-06, -3.48484840e-09,
        3.69696954e-12])

The key points to note are that the rank is equal to the number of parameters we are estimating, which means we have enough information to get pretty good estimates of the parameters.

Confidence intervals on the parameters#

The confidence intervals reflect the range of values we are confident the true parameter lies in. Remember we are only estimating these parameters from a small amount of data.

The degrees of freedom is roughly equal to the number of data points minus the number of parameters.

We define \(\sigma^2 = SSE / dof\) where \(SSE\) is the summed squared error, and \(dof\) is the degrees of freedom.

The covariance matrix is defined as \((\mathbf{X}^T \mathbf{X})^{-1}\). Finally, we compute the standard error on the parameters as:

\(\mathbf{se} = \sqrt{diag(\sigma^2 cov)}\).

This will be an array with one element for each parameter. You can think of this standard error as the uncertainty in the mean value of each parameter.

The confidence intervals are finally computed by calculating a student t-value that accounts for the additional uncertainty we have because of the small number of degrees of freedom.

dof = len(Ca) - len(pars)  # degrees of freedom
errs = Ca - X @ pars
sigma2 = np.sum(errs**2) / dof

covariance = np.linalg.inv(X.T @ X)
se = np.sqrt(np.diag(sigma2 * covariance))

from scipy.stats.distributions import t

alpha = 0.05  # 100*(1 - alpha) confidence level
sT = t.ppf(1.0 - alpha / 2.0, dof - 1)  # student T multiplier

CI = sT * se

for beta, ci in zip(pars, CI):
    print(f"{beta: 1.2e} [{beta - ci: 1.4e} {beta + ci: 1.4e}]")
 5.00e-02 [ 4.9075e-02  5.0906e-02]
-2.98e-04 [-3.4987e-04 -2.4583e-04]
 1.34e-06 [ 5.4027e-07  2.1467e-06]
-3.48e-09 [-7.6734e-09  7.0369e-10]
 3.70e-12 [-3.2337e-12  1.0628e-11]

Now you can see what is not well-behaved, several of the coefficient confidence intervals contain 0! This model is over-fitting the data.

Here is an alternative way to write the loop that does not use zip.

for i in range(len(pars)):
    beta = pars[i]
    ci = CI[i]
    print(f"{beta: 1.2e} [{beta - ci: 1.4e} {beta + ci: 1.4e}]")
 5.00e-02 [ 4.9075e-02  5.0906e-02]
-2.98e-04 [-3.4987e-04 -2.4583e-04]
 1.34e-06 [ 5.4027e-07  2.1467e-06]
-3.48e-09 [-7.6734e-09  7.0369e-10]
 3.70e-12 [-3.2337e-12  1.0628e-11]

It is also common to estimate an \(R^2\) value, where values close to one mean the model accounts for most of the variance in the data.

SS_tot = np.sum((Ca - np.mean(Ca)) ** 2)
SS_err = np.sum(errs**2)

#  http://en.wikipedia.org/wiki/Coefficient_of_determination
Rsq = 1 - SS_err / SS_tot
print("R^2 = {0}".format(Rsq))
R^2 = 0.9999869672459537

As usual, there is a way to do this in just a few lines…

from pycse import regress
??regress
b, bint, se = regress(X, Ca, alpha=0.05, rcond=None)
with np.printoptions(precision=4, suppress=False):
    print(bint)
[[ 4.9075e-02  5.0906e-02]
 [-3.4987e-04 -2.4583e-04]
 [ 5.4027e-07  2.1467e-06]
 [-7.6734e-09  7.0369e-10]
 [-3.2337e-12  1.0628e-11]]
with np.printoptions(precision=2, suppress=False):
    print(b)
[ 5.00e-02 -2.98e-04  1.34e-06 -3.48e-09  3.70e-12]

Here we would say the model looks very good, but with the caveat that we fit five parameters to seven data points, and some of the parameters are very small, suggesting they may not be necessary (although they are in front of terms like x4 which can be very large).

Now you can use this model to interpolate new values in the fitted range. This is not a model you can extrapolate with though, even though it is a linear model. What is happening?

newt = np.linspace(0, 500)

newT = np.column_stack([newt**i for i in range(5)])
newCa = newT @ pars

plt.plot(time, Ca, "b.", label="data")
plt.plot(newt, newCa, label="predictions")
plt.xlabel("Time")
plt.legend()
plt.ylabel("Ca");
../../_images/1cc466a8b22fe2eaf81beefa4411b764e5bfe19c77f04831d90b516686a95baf.png

It is almost certainly not reasonable for the concentration of A to start increasing again after about 350 time units.

The problem is this is a polynomial model that has no physical basis. It is unconstrained outside the data we fit it too.

from pycse import predict
?predict
X.shape, Ca.shape
((7, 5), (7,))

You can see here the uncertainty begins to grow substantially once the predictions get outside the data. This is an indication the predictions are not reliable.

newt = np.linspace(0, 500)

newT = np.column_stack([newt**i for i in range(5)])
m, mint, se = predict(X, Ca, b, newT)

plt.plot(time, Ca, 'bo')
plt.plot(newt, mint)
plt.legend(['UB', 'LB']);
../../_images/acdb147d7f1c7080f6bfaca8aeab3a81d5dcd571711792a937350c891e8c84e0.png

Reflective Questions

from jupyterquiz import display_quiz
display_quiz('.quiz.json')

Regularization#

When we do linear regression we get a coefficient for every function in the model. However, there can be bad behavior with regular regression, especially for certain classes of functions, and when the functions are correlated with each other. To explore why this happens, we will look at some regression models of varying complexity. We start by looking at some data.

import numpy as np

np.random.seed(10)  # Setting seed for reproducibility

x = np.linspace(0.3, 1.5 * np.pi)
y = np.sin(x) + np.random.normal(0, 0.15, len(x))

import matplotlib.pyplot as plt

plt.plot(x, y, "b.")
plt.xlabel("x")
plt.ylabel("y");
../../_images/619a3360c3697b57c04649f95537b4ddb8a043482a8b4be4d9b1a0dcb8f1c8da.png

Our goal is to fit a linear regression model to this data. We want to avoid underfitting and overfitting. If we just fit polynomials to this data, we find some undesirable behavior. Let’s look at fits up to a 12th order polynomials.

N = [1, 3, 6, 9, 12]

print("       ", f"".join([f"x^{i:<9d}" for i in range(12, -1, -1)]))

for i in N:
    pars = np.polyfit(x, y, i)
    p = np.zeros(13)
    p[13 - (i + 1) :] = pars
    # This way of printing is to get columnar output
    print(f"{i:2d}", f"  ".join([f"{j: 9.2f}" for j in p]))
    plt.plot(x, y, "b.")
    plt.plot(x, np.polyval(pars, x), label=f"{i}")
plt.legend();
        x^12       x^11       x^10       x^9        x^8        x^7        x^6        x^5        x^4        x^3        x^2        x^1        x^0        
 1      0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00      -0.47       1.40
 3      0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.09      -0.92       2.08      -0.33
 6      0.00       0.00       0.00       0.00       0.00       0.00       0.01      -0.09       0.58      -1.80       2.37      -0.66       0.43
 9      0.00       0.00       0.00      -0.00       0.10      -1.02       5.90     -20.81      46.10     -63.24      50.45     -19.91       3.34
12      0.01      -0.21       2.83     -22.43     114.61    -395.70     940.66   -1541.20    1715.97   -1258.64     574.27    -144.86      15.53
../../_images/c10f42a56c0c23a00588430784faf769316dd6580f5e0cda2065aa263d7f33a6.png

The most undesirable behavior is that the coefficients grow large, which puts a lot of weight in places we might not want. This also leads to wiggles in the fit, which are probably not reasonable. The solution to this issue is called regularization, which means we add a penalty to our objective function that serves to reduce the magnitude of the parameters. There are several approaches to regularization. In ridge regression we add an L2 penalty to the parameters, i.e. the sum of the parameters squared. In LASSO regression we add an L1 penalty to the parameters, i.e. the sum of the absolute values of the parameters.

In ridge regression the parameters are driven by the penalty to become smaller. In LASSO regression as many of the parameters are driven to zero as possible.

Ridge regression#

In ridge regression we define our objective function to minimize the summed squared error as usual, and add a term proportional to the sum of the squared parameters.

So, if our regression model looks like \(\mathbf{X} \mathbf{p} = \mathbf{y}\) we seek to minimize:

\((\mathbf{y} - \mathbf{X} \mathbf{p})^T (\mathbf{y} - \mathbf{X} \mathbf{p}) + \lambda ||\mathbf{p}||_2^2\)

Where \(\mathbf{p}\) are the fitting parameters, and \(\lambda\) is the proportionality constant.

Finding the parameters is done by solving this modified normal equation:

(\(\mathbf{Z}^T \mathbf{Z} + \lambda \mathbf{I}) \mathbf{p} = \mathbf{Z}^T \mathbf{w}\)

We have changed variable names because it is considered important to standardize our variables:

\(\mathbf{Z} = (\mathbf{X} - mean(\mathbf{X})) / std(\mathbf{X})\)

Standardization means that the variable has a mean of 0 and a standard deviation of 1. and

\(\mathbf{w} = (\mathbf{y} - mean(\mathbf{y})) / std(\mathbf{y})\)

λ is a parameter that affects the amount of regularization.

It is common to standardize the input/output variables which means we make the average of each column equal to zero and scale them to have unit variance. Doing this eliminates the intercept from the model since it would then go through the point (0, 0).

X = np.vander(x, 12)[:, 0:-1]  # since we standardize we do not consider the last column of ones.
xmean = X.mean(axis=0)  # average of every column
xstd = X.std(axis=0)
xmean, xstd
(array([2.48293800e+06, 5.69542539e+05, 1.31727857e+05, 3.07737861e+04,
        7.27890923e+03, 1.74895299e+03, 4.28974856e+02, 1.08219836e+02,
        2.84377137e+01, 7.96966389e+00, 2.50619449e+00]),
 array([5.49844745e+06, 1.19967517e+06, 2.62434616e+05, 5.75785285e+04,
        1.26746927e+04, 2.80017452e+03, 6.20905075e+02, 1.38066119e+02,
        3.06634869e+01, 6.68612694e+00, 1.29948184e+00]))

We standardize the input vector like this.

Pandas is a data science library that provides better visualization than standard arrays.

import pandas as pd

pd.DataFrame(X)
0 1 2 3 4 5 6 7 8 9 10
0 1.771470e-06 5.904900e-06 1.968300e-05 0.000066 0.000219 0.000729 0.002430 0.008100 0.027000 0.090000 0.300000
1 3.179127e-05 8.150588e-05 2.089633e-04 0.000536 0.001374 0.003521 0.009028 0.023146 0.059341 0.152138 0.390049
2 3.123374e-04 6.505707e-04 1.355080e-03 0.002823 0.005879 0.012246 0.025506 0.053127 0.110659 0.230494 0.480098
3 2.069422e-03 3.629634e-03 6.366145e-03 0.011166 0.019584 0.034349 0.060246 0.105668 0.185336 0.325067 0.570146
4 1.038472e-02 1.572977e-02 2.382595e-02 0.036089 0.054665 0.082801 0.125418 0.189972 0.287751 0.435857 0.660195
5 4.238639e-02 5.649682e-02 7.530462e-02 0.100374 0.133788 0.178326 0.237691 0.316818 0.422287 0.562866 0.750244
6 1.474808e-01 1.755113e-01 2.088693e-01 0.248567 0.295810 0.352033 0.418941 0.498565 0.593323 0.706092 0.840293
7 4.519238e-01 4.857613e-01 5.221324e-01 0.561227 0.603248 0.648416 0.696966 0.749151 0.805243 0.865535 0.930341
8 1.248614e+00 1.223664e+00 1.199212e+00 1.175248 1.151764 1.128749 1.106193 1.084089 1.062426 1.041196 1.020390
9 3.165489e+00 2.850665e+00 2.567152e+00 2.311836 2.081912 1.874855 1.688391 1.520472 1.369254 1.233074 1.110439
10 7.463358e+00 6.216939e+00 5.178678e+00 4.313813 3.593384 2.993270 2.493379 2.076972 1.730107 1.441170 1.200488
11 1.653760e+01 1.281452e+01 9.929606e+00 7.694170 5.961995 4.619781 3.579737 2.773837 2.149367 1.665484 1.290536
12 3.472921e+01 2.515543e+01 1.822085e+01 13.197916 9.559654 6.924350 5.015519 3.632894 2.631416 1.906015 1.380585
13 6.959051e+01 4.732008e+01 3.217666e+01 21.879449 14.877564 10.116430 6.878959 4.677547 3.180634 2.162764 1.470634
14 1.338013e+02 8.573256e+01 5.493273e+01 35.197890 22.552882 14.450653 9.259188 5.932781 3.801401 2.435730 1.560683
15 2.479935e+02 1.502325e+02 9.100967e+01 55.132937 33.399098 20.232910 12.256937 7.425156 4.498101 2.724914 1.650731
16 4.448186e+02 2.555283e+02 1.467896e+02 84.324011 48.440359 27.826811 15.985254 9.182811 5.275112 3.030315 1.740780
17 7.746850e+02 4.231335e+02 2.311158e+02 126.235627 68.949989 37.660532 20.570209 11.235463 6.136818 3.351934 1.830829
18 1.313701e+03 6.839067e+02 3.560387e+02 185.352099 96.493447 50.234043 26.151611 13.614408 7.087598 3.689771 1.920878
19 2.174492e+03 1.081338e+03 5.377315e+02 267.404863 132.975961 66.126719 32.883710 16.352519 8.131834 4.043825 2.010926
20 3.520707e+03 1.675749e+03 7.976055e+02 379.635864 180.695080 86.005341 40.935917 19.484247 9.273906 4.414096 2.100975
21 5.586222e+03 2.549594e+03 1.163654e+03 531.100656 242.398391 110.632475 50.493506 23.045621 10.518197 4.800586 2.191024
22 8.700219e+03 3.814091e+03 1.672060e+03 733.015010 321.346637 140.875234 61.758330 27.074250 11.869087 5.203292 2.281073
23 1.331961e+04 5.617432e+03 2.369104e+03 999.149004 421.382483 177.714431 74.949530 31.609318 13.330958 5.622216 2.371121
24 2.007049e+04 8.154857e+03 3.313406e+03 1346.272767 547.005166 222.254107 90.304244 36.691590 14.908189 6.057358 2.461170
25 2.980059e+04 1.168092e+04 4.578566e+03 1794.658187 703.451284 275.731452 108.078321 42.363406 16.605163 6.508718 2.551219
26 4.364520e+04 1.652434e+04 6.256215e+03 2368.641102 896.781940 339.527101 128.547027 48.668687 18.426261 6.976295 2.641268
27 6.310912e+04 2.310575e+04 8.459566e+03 3097.248633 1133.976517 415.175820 152.005760 55.652930 20.375864 7.460089 2.731316
28 9.016796e+04 3.195898e+04 1.132749e+04 4014.896535 1423.033300 504.377574 178.770755 63.363211 22.458352 7.960101 2.821365
29 1.273924e+05 4.375619e+04 1.502919e+04 5162.161563 1773.077194 609.008977 209.179801 71.848184 24.678107 8.476331 2.911414
30 1.780994e+05 5.933752e+04 1.976954e+04 6586.634078 2194.474783 731.135132 243.592948 81.158081 27.039511 9.008778 3.001463
31 2.465354e+05 7.974593e+04 2.579513e+04 8343.856256 2698.956979 873.021844 282.393215 91.344711 29.546943 9.557443 3.091511
32 3.380972e+05 1.062677e+05 3.340114e+04 10498.351446 3299.749478 1037.148230 325.987308 102.461463 32.204786 10.122325 3.181560
33 4.595949e+05 1.404798e+05 4.293905e+04 13124.750407 4011.711298 1226.219702 374.806322 114.563303 35.017420 10.703425 3.271609
34 6.195671e+05 1.843040e+05 5.482534e+04 16309.020322 4851.481612 1443.181342 429.306458 127.706775 37.989227 11.300742 3.361658
35 8.286523e+05 2.400703e+05 6.955120e+04 20149.802653 5837.635138 1691.231651 489.969727 141.950001 41.124587 11.914277 3.451706
36 1.100028e+06 3.105885e+05 8.769338e+04 24759.866088 6990.846319 1973.836695 557.304670 157.352681 44.427882 12.544030 3.541755
37 1.449928e+06 3.992307e+05 1.099263e+05 30267.681004 8334.062531 2294.744624 631.847058 173.976093 47.903493 13.190000 3.631804
38 1.898239e+06 5.100254e+05 1.370353e+05 36819.122033 9892.686576 2658.000579 714.160610 191.883095 51.555801 13.852187 3.721853
39 2.469211e+06 6.477637e+05 1.699319e+05 44579.305505 11694.768683 3067.961984 804.837700 211.138119 55.389187 14.530593 3.811901
40 3.192265e+06 8.181204e+05 2.096696e+05 53734.568710 13771.208273 3529.314217 904.500070 231.807180 59.408032 15.225215 3.901950
41 4.102935e+06 1.027789e+06 2.574624e+05 64494.598101 16155.965727 4047.086675 1013.799536 253.957867 63.616717 15.936056 3.991999
42 5.243953e+06 1.284638e+06 3.147043e+05 77094.713715 18886.284392 4626.669209 1133.418704 277.659348 68.019624 16.663113 4.082048
43 6.666498e+06 1.597877e+06 3.829914e+05 91798.317303 22002.923076 5273.828956 1264.071676 302.982371 72.621133 17.406389 4.172096
44 8.431613e+06 1.978256e+06 4.641455e+05 108899.511786 25550.399271 5994.727545 1406.504765 329.999260 77.425626 18.165882 4.262145
45 1.061184e+07 2.438274e+06 5.602401e+05 128725.899867 29577.243340 6795.938692 1561.497202 358.783918 82.437484 18.941592 4.352194
46 1.329306e+07 2.992422e+06 6.736287e+05 151641.569775 34136.263923 7684.466182 1729.861845 389.411825 87.661087 19.733520 4.442243
47 1.657663e+07 3.657449e+06 8.069757e+05 178050.276326 39284.824788 8667.762221 1912.445896 421.960041 93.100818 20.541666 4.532291
48 2.058169e+07 4.452655e+06 9.632903e+05 208398.825615 45085.133383 9753.746194 2110.131604 456.507202 98.761056 21.366029 4.622340
49 2.544793e+07 5.400218e+06 1.145962e+06 243180.671861 51604.541321 10950.823783 2323.836981 493.133523 104.646184 22.206610 4.712389
Z = (X - xmean) / xstd  # this is broadcasting!
import pandas as pd

pd.DataFrame(Z)
0 1 2 3 4 5 6 7 8 9 10
0 -0.451571 -0.474747 -0.501945 -0.534466 -0.574287 -0.624587 -0.690882 -0.783767 -0.926532 -1.178509 -1.697749
1 -0.451571 -0.474747 -0.501945 -0.534466 -0.574287 -0.624586 -0.690872 -0.783659 -0.925478 -1.169216 -1.628453
2 -0.451571 -0.474747 -0.501945 -0.534466 -0.574286 -0.624583 -0.690845 -0.783441 -0.923804 -1.157497 -1.559158
3 -0.451571 -0.474747 -0.501945 -0.534466 -0.574285 -0.624575 -0.690789 -0.783061 -0.921369 -1.143352 -1.489862
4 -0.451571 -0.474747 -0.501945 -0.534466 -0.574283 -0.624558 -0.690684 -0.782450 -0.918029 -1.126782 -1.420566
5 -0.451571 -0.474747 -0.501945 -0.534465 -0.574276 -0.624523 -0.690504 -0.781531 -0.913641 -1.107786 -1.351270
6 -0.451571 -0.474747 -0.501945 -0.534462 -0.574264 -0.624461 -0.690212 -0.780215 -0.908063 -1.086365 -1.281974
7 -0.451571 -0.474747 -0.501943 -0.534457 -0.574239 -0.624356 -0.689764 -0.778400 -0.901152 -1.062518 -1.212678
8 -0.451571 -0.474746 -0.501941 -0.534446 -0.574196 -0.624184 -0.689105 -0.775974 -0.892765 -1.036245 -1.143382
9 -0.451570 -0.474745 -0.501936 -0.534426 -0.574123 -0.623918 -0.688167 -0.772814 -0.882759 -1.007547 -1.074086
10 -0.451569 -0.474742 -0.501926 -0.534391 -0.574003 -0.623518 -0.686871 -0.768783 -0.870991 -0.976424 -1.004790
11 -0.451568 -0.474737 -0.501908 -0.534333 -0.573816 -0.622937 -0.685121 -0.763736 -0.857318 -0.942875 -0.935495
12 -0.451564 -0.474726 -0.501876 -0.534237 -0.573533 -0.622114 -0.682809 -0.757513 -0.841597 -0.906900 -0.866199
13 -0.451558 -0.474708 -0.501823 -0.534086 -0.573113 -0.620974 -0.679807 -0.749947 -0.823686 -0.868500 -0.796903
14 -0.451546 -0.474676 -0.501736 -0.533855 -0.572507 -0.619427 -0.675974 -0.740856 -0.803441 -0.827674 -0.727607
15 -0.451526 -0.474622 -0.501599 -0.533509 -0.571652 -0.617362 -0.671146 -0.730046 -0.780721 -0.784423 -0.658311
16 -0.451490 -0.474534 -0.501386 -0.533002 -0.570465 -0.614650 -0.665141 -0.717316 -0.755381 -0.738746 -0.589015
17 -0.451430 -0.474395 -0.501065 -0.532274 -0.568847 -0.611138 -0.657757 -0.702449 -0.727279 -0.690643 -0.519719
18 -0.451332 -0.474177 -0.500589 -0.531247 -0.566674 -0.606648 -0.648768 -0.685218 -0.696272 -0.640115 -0.450423
19 -0.451175 -0.473846 -0.499896 -0.529822 -0.563795 -0.600972 -0.637925 -0.665386 -0.662217 -0.587162 -0.381127
20 -0.450930 -0.473350 -0.498906 -0.527873 -0.560030 -0.593873 -0.624957 -0.642704 -0.624972 -0.531783 -0.311832
21 -0.450555 -0.472622 -0.497511 -0.525242 -0.555162 -0.585078 -0.609564 -0.616909 -0.584393 -0.473978 -0.242536
22 -0.449988 -0.471568 -0.495574 -0.521736 -0.548933 -0.574278 -0.591421 -0.587730 -0.540337 -0.413748 -0.173240
23 -0.449148 -0.470065 -0.492918 -0.517114 -0.541041 -0.561122 -0.570176 -0.554883 -0.492663 -0.351092 -0.103944
24 -0.447921 -0.467950 -0.489320 -0.511085 -0.531130 -0.545216 -0.545447 -0.518072 -0.441226 -0.286011 -0.034648
25 -0.446151 -0.465011 -0.484499 -0.503297 -0.518786 -0.526118 -0.516821 -0.476992 -0.385884 -0.218504 0.034648
26 -0.443633 -0.460973 -0.478106 -0.493329 -0.503533 -0.503335 -0.483855 -0.431323 -0.326494 -0.148572 0.103944
27 -0.440093 -0.455487 -0.469710 -0.480675 -0.484819 -0.476319 -0.446073 -0.380737 -0.262914 -0.076214 0.173240
28 -0.435172 -0.448108 -0.458782 -0.464737 -0.462013 -0.444464 -0.402967 -0.324892 -0.194999 -0.001430 0.242536
29 -0.428402 -0.438274 -0.444677 -0.444812 -0.434396 -0.407097 -0.353991 -0.263436 -0.122609 0.075779 0.311832
30 -0.419180 -0.425286 -0.426614 -0.420072 -0.401149 -0.363484 -0.298567 -0.196006 -0.045598 0.155413 0.381127
31 -0.406733 -0.408274 -0.403654 -0.389554 -0.361346 -0.312813 -0.236077 -0.122225 0.036174 0.237474 0.450423
32 -0.390081 -0.386167 -0.374671 -0.352135 -0.313945 -0.254200 -0.165867 -0.041707 0.122852 0.321959 0.519719
33 -0.367984 -0.357649 -0.338327 -0.306521 -0.257773 -0.186679 -0.087241 0.045945 0.214578 0.408871 0.589015
34 -0.338890 -0.321119 -0.293035 -0.251218 -0.191518 -0.109197 0.000534 0.141142 0.311495 0.498207 0.658311
35 -0.300864 -0.274635 -0.236922 -0.184513 -0.113713 -0.020613 0.098235 0.244304 0.413745 0.589970 0.727607
36 -0.251509 -0.215853 -0.167792 -0.104447 -0.022727 0.080311 0.206682 0.355865 0.521473 0.684158 0.796903
37 -0.187873 -0.141965 -0.083074 -0.008790 0.083249 0.194913 0.326736 0.476266 0.634819 0.780771 0.866199
38 -0.106339 -0.049611 0.020224 0.104993 0.206220 0.324640 0.459307 0.605965 0.753929 0.879810 0.935495
39 -0.002496 0.065202 0.145576 0.239769 0.348400 0.471045 0.605347 0.745428 0.878944 0.981275 1.004790
40 0.129005 0.207204 0.296995 0.398773 0.512225 0.635804 0.765858 0.895132 1.010006 1.085165 1.074086
41 0.294628 0.381976 0.479108 0.585649 0.700376 0.820711 0.941891 1.055567 1.147260 1.191481 1.143382
42 0.502145 0.596074 0.697227 0.804483 0.915791 1.027692 1.134544 1.227235 1.290848 1.300222 1.212678
43 0.760862 0.857178 0.957433 1.059849 1.161686 1.258806 1.344967 1.410647 1.440913 1.411389 1.281974
44 1.081883 1.174245 1.266669 1.356855 1.441573 1.516254 1.574363 1.606328 1.597598 1.524981 1.351270
45 1.478399 1.557697 1.632834 1.701192 1.759280 1.802383 1.823986 1.814812 1.761045 1.640999 1.420566
46 1.966033 2.019613 2.064898 2.099182 2.118975 2.119694 2.095146 2.036647 1.931397 1.759443 1.489862
47 2.563212 2.573952 2.573014 2.557837 2.525183 2.470849 2.389207 2.272391 2.108798 1.880312 1.559158
48 3.291611 3.236804 3.168646 3.084918 2.982812 2.858677 2.707591 2.522613 2.293390 2.003606 1.628453
49 4.176632 4.026653 3.864711 3.688995 3.497176 3.286178 3.051774 2.787894 2.485317 2.129326 1.697749

Here we just confirm we have standardized all the columns. The only one that stands out is the column of ones, which does not have unit standard deviation.

with np.printoptions(suppress=True):
    print(Z.mean(axis=0), Z.std(axis=0))
[-0. -0.  0.  0.  0. -0.  0.  0.  0. -0.  0.] [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]

We similarly standardize the y data.

ymean = y.mean()
ystd = y.std()

w = (y - ymean) / ystd

To get an estimate of the parameters we have to specify a value of λ. If we set λ=0, we have regular linear regression. If we set λ=∞, all the weights will go to zero. We need something in between. It is a good idea to try several values of λ from a very small value to a large value, on a log scale.

lambdas = np.concatenate([[0], np.geomspace(1e-13, 10, 5)])

print("lambda     ", f"".join([f"x^{i:<11d}" for i in range(len(X[0]), 0, -1)]))
for lam in lambdas:
    l2p = np.linalg.solve(
        Z.T @ Z + lam * np.eye(len(Z[0])), Z.T @ w
    )  # for ridge regression
    p = np.zeros(len(X[0]))
    p[len(X[0] + 2) - len(l2p) :] = l2p
    # This way of printing is to get columnar output
    print(f"{lam:8.2g}", f"".join([f"{j: 12.2f}" for j in p]))
    plt.plot(x, y, "b.")
    plt.plot(x, (Z @ l2p) * ystd + ymean, label=f"{lam:1.2g}")
plt.legend();
lambda      x^11         x^10         x^9          x^8          x^7          x^6          x^5          x^4          x^3          x^2          x^1          
       0    -34043.54   190691.01  -464362.98   645651.76  -566688.59   328402.49  -128144.25    33874.94    -6032.60      691.22      -40.28
   1e-13    -13149.13    64363.30  -129320.28   133383.30   -67613.99     5161.66    12758.58    -7038.91     1606.47     -156.63        4.82
 3.2e-10     -1054.80     3732.42    -3866.47     -865.34     3642.53     -286.76    -3426.18     3217.32    -1354.26      284.92      -24.21
   1e-06       -11.38        6.95       17.29        8.03      -18.81      -29.90       13.53       55.80      -61.16       19.93       -1.06
  0.0032        -0.28       -0.10        0.10        0.32        0.54        0.63        0.39       -0.43       -1.76       -2.04        1.87
      10         0.11        0.08        0.04       -0.01       -0.06       -0.11       -0.17       -0.22       -0.25       -0.22       -0.06
../../_images/caf9d3f09fca1eb429128a228c4b32e76e737e35537c302ed9ece4023219b54a.png

One way people have evaluated a reasonable value of λ is to look at how the coefficients vary with λ using a ridge plot. In this plot, you look for a range that balances the large swings associated with regular unconstrained regression and the damping caused by large values of λ. Here a value of \(10^{-6} \le \lambda \le 10^{-8}\) would be considered reasonable.

lambdas = np.geomspace(1e-8, 1)

pars = np.zeros((11, len(lambdas)))

for i, lam in enumerate(lambdas):
    l2p = np.linalg.solve(Z.T @ Z + lam * np.eye(len(Z[0])), Z.T @ w)
    pars[:, i] = l2p

plt.semilogx(lambdas, pars.T)
plt.xlabel(r"$\lambda$")
plt.ylabel("parameters");
../../_images/c3b58f57bdc3bf30b5d2dbf29a19a25ca31856c1ced132d8ae53da9d14c0bf24.png

LASSO regression#

In LASSO regression, we seek to minimize the summed squared errors plus the sum of the absolute value of the parameters. Unlike linear least squares regression and ridge regression, there is no analytical solution to get the parameters; they can only be obtained numerically using an iterative solver. We again have a parameter λ we have to choose. Setting this parameter to zero will be equivalent to normal linear regression. Setting this parameter to infinity will again cause all coefficients to go to zero. We again have to find a balance.

def objective(pars, lam=0.0):
    SSE = np.sum(np.square(y - ((Z @ pars) * ystd + ymean)))
    return SSE + lam * np.sum(np.abs(pars))


from scipy.optimize import minimize

sol = minimize(
    objective,
    np.random.random(len(Z[0])),
    args=(0.15,),
    method="nelder-mead",
    options={"maxiter": 5000},
)

with np.printoptions(
    suppress=True, precision=3
):  # prints small numbers as practically zero
    print(sol.message, sol.x)

plt.plot(x, y, "b.")
plt.plot(x, (Z @ sol.x) * ystd + ymean);
Optimization terminated successfully. [-0.825  1.61   0.     0.967  0.963 -2.048 -1.275 -0.371 -0.003  0.
  0.182]
../../_images/6cbd093b7edc30dc795c1cec2604dc35f6e0cccf40e96be7d1dbcca0783e1ea4.png

Now, we can explore the effect of λ more thoroughly.

lambdas = np.concatenate([[0], np.geomspace(1e-5, 10, 5)])

print("lambda     ", f"".join([f"x^{i:<11d}" for i in range(len(X[0]), 0, -1)]))
for lam in lambdas:
    sol = minimize(
        objective, np.random.random(len(Z[0])), (lam,), options={"maxiter": 5000}
    )

    # This way of printing is to get columnar output
    print(f"{lam:8.2g}", f"".join([f"{j: 12.2f}" for j in sol.x]))
    plt.plot(x, y, "b.")
    plt.plot(x, (Z @ sol.x) * ystd + ymean, label=f"{lam:1.2g}")
plt.legend();
lambda      x^11         x^10         x^9          x^8          x^7          x^6          x^5          x^4          x^3          x^2          x^1          
       0       136.55     -298.38       -4.64      296.91      108.40     -326.21     -179.04      521.01     -339.93       93.29       -8.74
   1e-05        59.02     -135.95       -9.68      179.06       16.04     -162.72      -90.66      298.40     -208.36       59.31       -5.25
 0.00032         0.16       -0.13       -0.04       -0.22       -0.08       -0.00        2.72        0.78       -5.27       -0.32        1.63
    0.01        -0.01       -0.16        0.00       -0.00        0.73        0.68        0.01        0.01       -1.75       -2.16        1.88
    0.32         0.35        0.17        0.08       -0.00        0.00        0.00       -0.00       -0.00       -1.99       -0.05        0.65
      10        -0.00       -0.00        0.00        0.00       -0.00       -0.00        0.00       -0.02       -0.64       -0.05       -0.00
../../_images/6823d460d36d2cc543265952c3e4200620f3fe6041ea132137d9c096ac5c5e5c.png

You can see that by increasing λ we are making more and more of the parameters go to zero; in other words the functions they correspond to are not part of the model any longer. This is called sparsifying the model. It reduces over-fitting by reducing the model complexity. Finding the most suitable value for λ requires some sophisticated programming and analysis, and it is an important topic in machine learning and data science.

LASSO has some important benefits, and some disadvantanges. The benefits include sparsification of the model; the method removes inputs that are not needed, or that are highly correlated with other inputs. This can make models more interpretable as there are fewer terms, and the terms are more independent.

The disadvantages, however, are that we cannot use linear algebra to find the parameters. The penalty imposes a nonlinear behavior to the objective function, so we must use an iterative solver. For features that are correlated, we have no control over which feature is eliminated. Different initial guesses may lead to different feature elimination. If the features are really correlated, this will not affect the fit quality, but it will mean some models favor one feature over another. This is less of a problem in polynomial models, but often a problem in models based on physical properties that are correlated, e.g. high melting points of materials tend to be correlated with how hard they are. With LASSO, one model could favor the melting point and another could favor the hardness.

Advanced selection of \(\lambda\)#

A more advanced way to select a value of λ is called k-fold validation. It is complex to code this, and the standard method to do it is in scikit-learn, see specifically the ridge regression example and the LASSO example. The basic idea is that you split your data set into \(k\) folds, and then you fit \(k-1\) folds to get the paramters. On the remaining fold (which was not used for fitting) you estimate the model errors. Initially with no regularization, the errors will be high due to overfitting. As you add regularization, the errors will begin decrease. Eventually though, the model will start underfitting, and the errors will go back up. The λ that provides the lowest test errors is usually considered the best choice.

We will not cover these more advanced methods as they rely on learning the scikit-learn API in depth, and some other higher level Python libraries we have not covered like Pandas. These are more appropriate in a data science/machine learning focused course.

Summary#

In this lecture we introduced the concept of linear regression. In the normal linear regression, we simply solve linear equations that ultimately minimize the summed squared errors between the model and data. With some additional linear algebra, we can also estimate the confidence intervals on the parameters.

One issue with normal linear regression is that the parameters are unconstrained, which can lead to some functions having undesirably large parameters. We introduced two types of regularization to mitigate this issue: ridge regression and LASSO regression. In both cases, a penalty function is added to the objective function being minimized. In ridge regression the penalty is an L2 norm on the parameters which penalizes large parameters, leading to a reduction in their magnitude. In LASSO reduction the penalty is an L1 norm, which drives parameters towards zero. Both methods rely on a hyperparameter λ that determines how much regularization is applied. With both regularization approaches we have to use some judgment in how much regularization to apply (the magnitude of λ), and we only provided a heuristic approach to doing this.

display_quiz('.quiz-2.json')