Linear regression

Linear regression#

KEYWORDS: numpy.linalg.solve

regression

In linear regression, we seek to find models in the form \(y = a_{0} f_{0}(x) + a_{1} f_{1}(x) + ... + a_{n} f_{n}(x) + \epsilon\), where \(a_{i}\) are coefficients to be determined, and ε are the residual errors. We call this linear regression because the model is linear in the unknown coefficients \(a_{i}\). The functions can be any function of \(x\). In the function numpy.polyfit these functions are polynomials in \(x\).

If we are given some data as pairs of (x, y), we can construct a set of equations of the form:

\([f_{0}(x_{i}), f_{1}(x_{i}), ..., f_{n}(x_{i})]\cdot[a_{0}, a_{1}, ..., a_{n}]^T = y_{i}\)

There will be one of these equations for every data point, so we end up with a matrix equation that looks like:

\(\mathbf{X} \mathbf{a} = \mathbf{y}\)

There are usually more data points than in the vector of \(\mathbf{a}\), so the shapes of these arrays are not suitable to solve directly. You can of course set up an objective function and use scipy.optimize.minimize, but there is a better approach.

To be a little more specific, suppose we have \(m\) pairs of \((x, y)\) data points, and we want to fit a model containing \(n\) parameters. Then, the dimensions of the \(\mathbf{X}\) will be \((m, n)\), the dimensions of \(\mathbf{a}\) will be \((n, 1)\), and the dimensions of \(\mathbf{y}\) will be \((m, 1)\). We have more equations than unknowns here, and we cannot use numpy.linalg.solve because \(\mathbf{X}\) is not square. Note that if it was square, we would be doing the kind of interpolation we described in the last lecture.

We can modify the equation though if we left multiply each side of the equation by \(\mathbf{X}^T\).

\(\mathbf{X}^T \mathbf{X} \mathbf{a} = \mathbf{X}^T \mathbf{y}\)

The array \(\mathbf{X}^T \mathbf{X}\) now has the shape \((n, m) * (m, n) = (n, n)\). The right hand side \(\mathbf{X}^T \mathbf{y}\) has a shape of \((n, m) * (n, 1) = (n, 1)\), and \(\mathbf{a}\) is still \((n, 1)\). This new matrix equation can be solved efficiently with numpy.linalg.solve. We will not prove this, but solving this modified equation is equivalent to finding the set of parameters that minimizes the summed squared errors: \(\sum (\mathbf{X} \cdot \mathbf{a} - \mathbf{y})^2\).

The parameters are then found by:

# np.linalg.solve(X @ X.T, X.T @ y)

An alternative form is called the normal equation: \(\mathbf{a} = (\mathbf{X}^T\cdot\mathbf{X})^{-1}\mathbf{X}^T \mathbf{y}\). This is symbolically correct, but relies on the inverse which is expensive to compute for large systems. It is not used practically, instead the equations are solved efficiently using a different algorithm.

An example of polynomial fitting#

Our goal in this example is to fit a polynomial to some time-dependent concentration data.

import numpy as np

time = np.array([0.0, 50.0, 100.0, 150.0, 200.0, 250.0, 300.0])
Ca = np.array([50.0, 38.0, 30.6, 25.6, 22.2, 19.5, 17.4]) * 1e-3

Fit a fourth order polynomial to this data and determine the confidence interval for each parameter. This data is from example 5-1 in Fogler, Elements of Chemical Reaction Engineering.

We want the equation \(Ca(t) = b0 + b1*t + b2*t^2 + b3*t^3 + b4*t^4\) fit to the data in the least squares sense. We can write this in a linear algebra form as: \(\mathbf{T} \mathbf{p} = \mathbf{Ca}\) where \(\mathbf{T}\) is a matrix of columns \([1, t, t^2, t^3, t^4]\), and \(\mathbf{p}\) is a column vector of the fitting parameters. We want to solve for the \(\mathbf{p}\) vector and estimate the confidence intervals.

First, we setup the array of function values, and then we solve for the parameters.

?np.column_stack

(time**2).shape

(7,)

X = np.column_stack([time**0, time, time**2, time**3, time**4])
with np.printoptions(precision=2):
    print(X)

[[1.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00]
 [1.00e+00 5.00e+01 2.50e+03 1.25e+05 6.25e+06]
 [1.00e+00 1.00e+02 1.00e+04 1.00e+06 1.00e+08]
 [1.00e+00 1.50e+02 2.25e+04 3.38e+06 5.06e+08]
 [1.00e+00 2.00e+02 4.00e+04 8.00e+06 1.60e+09]
 [1.00e+00 2.50e+02 6.25e+04 1.56e+07 3.91e+09]
 [1.00e+00 3.00e+02 9.00e+04 2.70e+07 8.10e+09]]

(X.T @ X).shape, (X.T @ Ca).shape

((5, 5), (5,))

a = np.linalg.solve(X.T @ X, X.T @ Ca)  # this is the normal equation
print(a)

[ 4.99902597e-02 -2.97846320e-04  1.34348485e-06 -3.48484848e-09
  3.69696970e-12]

import matplotlib.pyplot as plt

plt.plot(time, Ca, "bo", time, X @ a)
plt.legend(["data", "fit"])
plt.xlabel("Time")
plt.ylabel("Ca");

../../_images/b75394bb53c30276a759e237c395f90e0953a2e3817b0c82084b9bc449d54516.png

We previously claimed that solving this equation was equivalent to minimizing the summed squared errors. Here we demonstrate that is consistent with our observation for the first parameter.

P = np.linspace(0.9 * a[0], 1.1 * a[0])

errs = [np.sum(np.square(X @ [p, *a[1:]] - Ca)) for p in P]

plt.plot(P, errs)
plt.axvline(a[0], color="k", linestyle="--")
plt.xlabel("slope")
plt.ylabel("SSE")
plt.legend(["SSE", "best fit"]);

../../_images/e8ff24e81b98ece7dbcadaee0a5313f700b3dd947ccce3c74d34649927d90141.png

def SSE(a):
    return np.sum((Ca - X @ a) ** 2)


SSE(a)

1.0519480519480308e-08

The SSE looks good, quite small.

import numdifftools as nd

H = nd.Hessian(SSE)
e, _ = np.linalg.eig(H(a))
e

array([1.67392305e+20, 2.87547709e+13, 5.43139168e+07, 2.02426698e+00,
       9.28206459e+02])

The Hessian eigenvalues are all positive, so we are at a minimum, but yikes, there are 18 orders of magnitude present. That is concerning. It indicates the model is not well-behaved in some way, even though the fit looks good.

np.linalg.det(H(a))

4.911617807628561e+44

np.linalg.cond(H(a))

8.271680468493854e+19

np.linalg.inv(H(a))

array([[ 4.93508334e-01, -1.59453781e-02,  1.62882298e-04,
        -6.56584202e-07,  9.09121476e-10],
       [-1.59453781e-02,  1.59343441e-03, -2.34774419e-05,
         1.14680966e-07, -1.78789195e-10],
       [ 1.62882298e-04, -2.34774419e-05,  3.79875951e-07,
        -1.95152565e-09,  3.14143043e-12],
       [-6.56584202e-07,  1.14680966e-07, -1.95152565e-09,
         1.03300152e-11, -1.69697718e-14],
       [ 9.09121476e-10, -1.78789195e-10,  3.14143043e-12,
        -1.69697718e-14,  2.82829421e-17]])

Exercise Demonstrate that the SSE is minimized for the other parameters. Try estimating the Hessian of the sum of squared errors and then see if it is positive definite.

As we have seen many times before, Numpy provides a function for doing least squares linear regression. It returns more information about the fit than what we have done so far, and is a little more convenient because we do not have to do all the transposes and left multiplications.

?np.linalg.lstsq

SSE(a)

1.0519480519480308e-08

pars, residuals, rank, singular_values = np.linalg.lstsq(X, Ca, rcond=None)
pars, residuals, rank, singular_values

(array([ 4.99902596e-02, -2.97846320e-04,  1.34348484e-06, -3.48484840e-09,
         3.69696954e-12]),
 array([1.05194694e-08]),
 5,
 array([9.14856013e+09, 3.79175229e+06, 5.21123386e+03, 2.15423665e+01,
        1.00603128e+00]))

pars, _, _, _ = np.linalg.lstsq(
    X, Ca, rcond=None
)  # _ means ignore the value in that position
pars

array([ 4.99902596e-02, -2.97846320e-04,  1.34348484e-06, -3.48484840e-09,
        3.69696954e-12])

pars = np.linalg.lstsq(X, Ca, rcond=None)[0]  # just get the first return
pars

array([ 4.99902596e-02, -2.97846320e-04,  1.34348484e-06, -3.48484840e-09,
        3.69696954e-12])

The key points to note are that the rank is equal to the number of parameters we are estimating, which means we have enough information to get pretty good estimates of the parameters.

Confidence intervals on the parameters#

The confidence intervals reflect the range of values we are confident the true parameter lies in. Remember we are only estimating these parameters from a small amount of data.

The degrees of freedom is roughly equal to the number of data points minus the number of parameters.

We define \(\sigma^2 = SSE / dof\) where \(SSE\) is the summed squared error, and \(dof\) is the degrees of freedom.

The covariance matrix is defined as \((\mathbf{X}^T \mathbf{X})^{-1}\). Finally, we compute the standard error on the parameters as:

\(\mathbf{se} = \sqrt{diag(\sigma^2 cov)}\).

This will be an array with one element for each parameter. You can think of this standard error as the uncertainty in the mean value of each parameter.

The confidence intervals are finally computed by calculating a student t-value that accounts for the additional uncertainty we have because of the small number of degrees of freedom.

dof = len(Ca) - len(pars)  # degrees of freedom
errs = Ca - X @ pars
sigma2 = np.sum(errs**2) / dof

covariance = np.linalg.inv(X.T @ X)
se = np.sqrt(np.diag(sigma2 * covariance))

from scipy.stats.distributions import t

alpha = 0.05  # 100*(1 - alpha) confidence level
sT = t.ppf(1.0 - alpha / 2.0, dof - 1)  # student T multiplier

CI = sT * se

for beta, ci in zip(pars, CI):
    print(f"{beta: 1.2e} [{beta - ci: 1.4e} {beta + ci: 1.4e}]")

 5.00e-02 [ 4.9075e-02  5.0906e-02]
-2.98e-04 [-3.4987e-04 -2.4583e-04]
 1.34e-06 [ 5.4027e-07  2.1467e-06]
-3.48e-09 [-7.6734e-09  7.0369e-10]
 3.70e-12 [-3.2337e-12  1.0628e-11]

Now you can see what is not well-behaved, several of the coefficient confidence intervals contain 0! This model is over-fitting the data.

Here is an alternative way to write the loop that does not use zip.

for i in range(len(pars)):
    beta = pars[i]
    ci = CI[i]
    print(f"{beta: 1.2e} [{beta - ci: 1.4e} {beta + ci: 1.4e}]")

 5.00e-02 [ 4.9075e-02  5.0906e-02]
-2.98e-04 [-3.4987e-04 -2.4583e-04]
 1.34e-06 [ 5.4027e-07  2.1467e-06]
-3.48e-09 [-7.6734e-09  7.0369e-10]
 3.70e-12 [-3.2337e-12  1.0628e-11]

It is also common to estimate an \(R^2\) value, where values close to one mean the model accounts for most of the variance in the data.

SS_tot = np.sum((Ca - np.mean(Ca)) ** 2)
SS_err = np.sum(errs**2)

#  http://en.wikipedia.org/wiki/Coefficient_of_determination
Rsq = 1 - SS_err / SS_tot
print("R^2 = {0}".format(Rsq))

R^2 = 0.9999869672459537

As usual, there is a way to do this in just a few lines…

from pycse import regress
??regress

b, bint, se = regress(X, Ca, alpha=0.05, rcond=None)
with np.printoptions(precision=4, suppress=False):
    print(bint)

[[ 4.9075e-02  5.0906e-02]
 [-3.4987e-04 -2.4583e-04]
 [ 5.4027e-07  2.1467e-06]
 [-7.6734e-09  7.0369e-10]
 [-3.2337e-12  1.0628e-11]]

with np.printoptions(precision=2, suppress=False):
    print(b)

[ 5.00e-02 -2.98e-04  1.34e-06 -3.48e-09  3.70e-12]

Here we would say the model looks very good, but with the caveat that we fit five parameters to seven data points, and some of the parameters are very small, suggesting they may not be necessary (although they are in front of terms like x⁴ which can be very large).

Now you can use this model to interpolate new values in the fitted range. This is not a model you can extrapolate with though, even though it is a linear model. What is happening?

newt = np.linspace(0, 500)

newT = np.column_stack([newt**i for i in range(5)])
newCa = newT @ pars

plt.plot(time, Ca, "b.", label="data")
plt.plot(newt, newCa, label="predictions")
plt.xlabel("Time")
plt.legend()
plt.ylabel("Ca");

../../_images/1cc466a8b22fe2eaf81beefa4411b764e5bfe19c77f04831d90b516686a95baf.png

It is almost certainly not reasonable for the concentration of A to start increasing again after about 350 time units.

The problem is this is a polynomial model that has no physical basis. It is unconstrained outside the data we fit it too.

from pycse import predict
?predict

X.shape, Ca.shape

((7, 5), (7,))

You can see here the uncertainty begins to grow substantially once the predictions get outside the data. This is an indication the predictions are not reliable.

newt = np.linspace(0, 500)

newT = np.column_stack([newt**i for i in range(5)])
m, mint, se = predict(X, Ca, b, newT)

plt.plot(time, Ca, 'bo')
plt.plot(newt, mint)
plt.legend(['UB', 'LB']);

../../_images/acdb147d7f1c7080f6bfaca8aeab3a81d5dcd571711792a937350c891e8c84e0.png

Reflective Questions

from jupyterquiz import display_quiz
display_quiz('.quiz.json')

Regularization#

When we do linear regression we get a coefficient for every function in the model. However, there can be bad behavior with regular regression, especially for certain classes of functions, and when the functions are correlated with each other. To explore why this happens, we will look at some regression models of varying complexity. We start by looking at some data.

import numpy as np

np.random.seed(10)  # Setting seed for reproducibility

x = np.linspace(0.3, 1.5 * np.pi)
y = np.sin(x) + np.random.normal(0, 0.15, len(x))

import matplotlib.pyplot as plt

plt.plot(x, y, "b.")
plt.xlabel("x")
plt.ylabel("y");

../../_images/619a3360c3697b57c04649f95537b4ddb8a043482a8b4be4d9b1a0dcb8f1c8da.png

Our goal is to fit a linear regression model to this data. We want to avoid underfitting and overfitting. If we just fit polynomials to this data, we find some undesirable behavior. Let’s look at fits up to a 12^th order polynomials.

N = [1, 3, 6, 9, 12]

print("       ", f"".join([f"x^{i:<9d}" for i in range(12, -1, -1)]))

for i in N:
    pars = np.polyfit(x, y, i)
    p = np.zeros(13)
    p[13 - (i + 1) :] = pars
    # This way of printing is to get columnar output
    print(f"{i:2d}", f"  ".join([f"{j: 9.2f}" for j in p]))
    plt.plot(x, y, "b.")
    plt.plot(x, np.polyval(pars, x), label=f"{i}")
plt.legend();

        x^12       x^11       x^10       x^9        x^8        x^7        x^6        x^5        x^4        x^3        x^2        x^1        x^0        
    0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00      -0.47       1.40
    0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.09      -0.92       2.08      -0.33
    0.00       0.00       0.00       0.00       0.00       0.00       0.01      -0.09       0.58      -1.80       2.37      -0.66       0.43
    0.00       0.00       0.00      -0.00       0.10      -1.02       5.90     -20.81      46.10     -63.24      50.45     -19.91       3.34
    0.01      -0.21       2.83     -22.43     114.61    -395.70     940.66   -1541.20    1715.97   -1258.64     574.27    -144.86      15.53

../../_images/c10f42a56c0c23a00588430784faf769316dd6580f5e0cda2065aa263d7f33a6.png

The most undesirable behavior is that the coefficients grow large, which puts a lot of weight in places we might not want. This also leads to wiggles in the fit, which are probably not reasonable. The solution to this issue is called regularization, which means we add a penalty to our objective function that serves to reduce the magnitude of the parameters. There are several approaches to regularization. In ridge regression we add an L₂ penalty to the parameters, i.e. the sum of the parameters squared. In LASSO regression we add an L₁ penalty to the parameters, i.e. the sum of the absolute values of the parameters.

In ridge regression the parameters are driven by the penalty to become smaller. In LASSO regression as many of the parameters are driven to zero as possible.

Ridge regression#

In ridge regression we define our objective function to minimize the summed squared error as usual, and add a term proportional to the sum of the squared parameters.

So, if our regression model looks like \(\mathbf{X} \mathbf{p} = \mathbf{y}\) we seek to minimize:

\((\mathbf{y} - \mathbf{X} \mathbf{p})^T (\mathbf{y} - \mathbf{X} \mathbf{p}) + \lambda ||\mathbf{p}||_2^2\)

Where \(\mathbf{p}\) are the fitting parameters, and \(\lambda\) is the proportionality constant.

Finding the parameters is done by solving this modified normal equation:

(\(\mathbf{Z}^T \mathbf{Z} + \lambda \mathbf{I}) \mathbf{p} = \mathbf{Z}^T \mathbf{w}\)

We have changed variable names because it is considered important to standardize our variables:

\(\mathbf{Z} = (\mathbf{X} - mean(\mathbf{X})) / std(\mathbf{X})\)

Standardization means that the variable has a mean of 0 and a standard deviation of 1. and

\(\mathbf{w} = (\mathbf{y} - mean(\mathbf{y})) / std(\mathbf{y})\)

λ is a parameter that affects the amount of regularization.

It is common to standardize the input/output variables which means we make the average of each column equal to zero and scale them to have unit variance. Doing this eliminates the intercept from the model since it would then go through the point (0, 0).

X = np.vander(x, 12)[:, 0:-1]  # since we standardize we do not consider the last column of ones.
xmean = X.mean(axis=0)  # average of every column
xstd = X.std(axis=0)
xmean, xstd

(array([2.48293800e+06, 5.69542539e+05, 1.31727857e+05, 3.07737861e+04,
        7.27890923e+03, 1.74895299e+03, 4.28974856e+02, 1.08219836e+02,
        2.84377137e+01, 7.96966389e+00, 2.50619449e+00]),
 array([5.49844745e+06, 1.19967517e+06, 2.62434616e+05, 5.75785285e+04,
        1.26746927e+04, 2.80017452e+03, 6.20905075e+02, 1.38066119e+02,
        3.06634869e+01, 6.68612694e+00, 1.29948184e+00]))

We standardize the input vector like this.

Pandas is a data science library that provides better visualization than standard arrays.

import pandas as pd

pd.DataFrame(X)

	0	1	2	3	4	5	6	7	8	9	10
0	1.771470e-06	5.904900e-06	1.968300e-05	0.000066	0.000219	0.000729	0.002430	0.008100	0.027000	0.090000	0.300000
1	3.179127e-05	8.150588e-05	2.089633e-04	0.000536	0.001374	0.003521	0.009028	0.023146	0.059341	0.152138	0.390049
2	3.123374e-04	6.505707e-04	1.355080e-03	0.002823	0.005879	0.012246	0.025506	0.053127	0.110659	0.230494	0.480098
3	2.069422e-03	3.629634e-03	6.366145e-03	0.011166	0.019584	0.034349	0.060246	0.105668	0.185336	0.325067	0.570146
4	1.038472e-02	1.572977e-02	2.382595e-02	0.036089	0.054665	0.082801	0.125418	0.189972	0.287751	0.435857	0.660195
5	4.238639e-02	5.649682e-02	7.530462e-02	0.100374	0.133788	0.178326	0.237691	0.316818	0.422287	0.562866	0.750244
6	1.474808e-01	1.755113e-01	2.088693e-01	0.248567	0.295810	0.352033	0.418941	0.498565	0.593323	0.706092	0.840293
7	4.519238e-01	4.857613e-01	5.221324e-01	0.561227	0.603248	0.648416	0.696966	0.749151	0.805243	0.865535	0.930341
8	1.248614e+00	1.223664e+00	1.199212e+00	1.175248	1.151764	1.128749	1.106193	1.084089	1.062426	1.041196	1.020390
9	3.165489e+00	2.850665e+00	2.567152e+00	2.311836	2.081912	1.874855	1.688391	1.520472	1.369254	1.233074	1.110439
10	7.463358e+00	6.216939e+00	5.178678e+00	4.313813	3.593384	2.993270	2.493379	2.076972	1.730107	1.441170	1.200488
11	1.653760e+01	1.281452e+01	9.929606e+00	7.694170	5.961995	4.619781	3.579737	2.773837	2.149367	1.665484	1.290536
12	3.472921e+01	2.515543e+01	1.822085e+01	13.197916	9.559654	6.924350	5.015519	3.632894	2.631416	1.906015	1.380585
13	6.959051e+01	4.732008e+01	3.217666e+01	21.879449	14.877564	10.116430	6.878959	4.677547	3.180634	2.162764	1.470634
14	1.338013e+02	8.573256e+01	5.493273e+01	35.197890	22.552882	14.450653	9.259188	5.932781	3.801401	2.435730	1.560683
15	2.479935e+02	1.502325e+02	9.100967e+01	55.132937	33.399098	20.232910	12.256937	7.425156	4.498101	2.724914	1.650731
16	4.448186e+02	2.555283e+02	1.467896e+02	84.324011	48.440359	27.826811	15.985254	9.182811	5.275112	3.030315	1.740780
17	7.746850e+02	4.231335e+02	2.311158e+02	126.235627	68.949989	37.660532	20.570209	11.235463	6.136818	3.351934	1.830829
18	1.313701e+03	6.839067e+02	3.560387e+02	185.352099	96.493447	50.234043	26.151611	13.614408	7.087598	3.689771	1.920878
19	2.174492e+03	1.081338e+03	5.377315e+02	267.404863	132.975961	66.126719	32.883710	16.352519	8.131834	4.043825	2.010926
20	3.520707e+03	1.675749e+03	7.976055e+02	379.635864	180.695080	86.005341	40.935917	19.484247	9.273906	4.414096	2.100975
21	5.586222e+03	2.549594e+03	1.163654e+03	531.100656	242.398391	110.632475	50.493506	23.045621	10.518197	4.800586	2.191024
22	8.700219e+03	3.814091e+03	1.672060e+03	733.015010	321.346637	140.875234	61.758330	27.074250	11.869087	5.203292	2.281073
23	1.331961e+04	5.617432e+03	2.369104e+03	999.149004	421.382483	177.714431	74.949530	31.609318	13.330958	5.622216	2.371121
24	2.007049e+04	8.154857e+03	3.313406e+03	1346.272767	547.005166	222.254107	90.304244	36.691590	14.908189	6.057358	2.461170
25	2.980059e+04	1.168092e+04	4.578566e+03	1794.658187	703.451284	275.731452	108.078321	42.363406	16.605163	6.508718	2.551219
26	4.364520e+04	1.652434e+04	6.256215e+03	2368.641102	896.781940	339.527101	128.547027	48.668687	18.426261	6.976295	2.641268
27	6.310912e+04	2.310575e+04	8.459566e+03	3097.248633	1133.976517	415.175820	152.005760	55.652930	20.375864	7.460089	2.731316
28	9.016796e+04	3.195898e+04	1.132749e+04	4014.896535	1423.033300	504.377574	178.770755	63.363211	22.458352	7.960101	2.821365
29	1.273924e+05	4.375619e+04	1.502919e+04	5162.161563	1773.077194	609.008977	209.179801	71.848184	24.678107	8.476331	2.911414
30	1.780994e+05	5.933752e+04	1.976954e+04	6586.634078	2194.474783	731.135132	243.592948	81.158081	27.039511	9.008778	3.001463
31	2.465354e+05	7.974593e+04	2.579513e+04	8343.856256	2698.956979	873.021844	282.393215	91.344711	29.546943	9.557443	3.091511
32	3.380972e+05	1.062677e+05	3.340114e+04	10498.351446	3299.749478	1037.148230	325.987308	102.461463	32.204786	10.122325	3.181560
33	4.595949e+05	1.404798e+05	4.293905e+04	13124.750407	4011.711298	1226.219702	374.806322	114.563303	35.017420	10.703425	3.271609
34	6.195671e+05	1.843040e+05	5.482534e+04	16309.020322	4851.481612	1443.181342	429.306458	127.706775	37.989227	11.300742	3.361658
35	8.286523e+05	2.400703e+05	6.955120e+04	20149.802653	5837.635138	1691.231651	489.969727	141.950001	41.124587	11.914277	3.451706
36	1.100028e+06	3.105885e+05	8.769338e+04	24759.866088	6990.846319	1973.836695	557.304670	157.352681	44.427882	12.544030	3.541755
37	1.449928e+06	3.992307e+05	1.099263e+05	30267.681004	8334.062531	2294.744624	631.847058	173.976093	47.903493	13.190000	3.631804
38	1.898239e+06	5.100254e+05	1.370353e+05	36819.122033	9892.686576	2658.000579	714.160610	191.883095	51.555801	13.852187	3.721853
39	2.469211e+06	6.477637e+05	1.699319e+05	44579.305505	11694.768683	3067.961984	804.837700	211.138119	55.389187	14.530593	3.811901
40	3.192265e+06	8.181204e+05	2.096696e+05	53734.568710	13771.208273	3529.314217	904.500070	231.807180	59.408032	15.225215	3.901950
41	4.102935e+06	1.027789e+06	2.574624e+05	64494.598101	16155.965727	4047.086675	1013.799536	253.957867	63.616717	15.936056	3.991999
42	5.243953e+06	1.284638e+06	3.147043e+05	77094.713715	18886.284392	4626.669209	1133.418704	277.659348	68.019624	16.663113	4.082048
43	6.666498e+06	1.597877e+06	3.829914e+05	91798.317303	22002.923076	5273.828956	1264.071676	302.982371	72.621133	17.406389	4.172096
44	8.431613e+06	1.978256e+06	4.641455e+05	108899.511786	25550.399271	5994.727545	1406.504765	329.999260	77.425626	18.165882	4.262145
45	1.061184e+07	2.438274e+06	5.602401e+05	128725.899867	29577.243340	6795.938692	1561.497202	358.783918	82.437484	18.941592	4.352194
46	1.329306e+07	2.992422e+06	6.736287e+05	151641.569775	34136.263923	7684.466182	1729.861845	389.411825	87.661087	19.733520	4.442243
47	1.657663e+07	3.657449e+06	8.069757e+05	178050.276326	39284.824788	8667.762221	1912.445896	421.960041	93.100818	20.541666	4.532291
48	2.058169e+07	4.452655e+06	9.632903e+05	208398.825615	45085.133383	9753.746194	2110.131604	456.507202	98.761056	21.366029	4.622340
49	2.544793e+07	5.400218e+06	1.145962e+06	243180.671861	51604.541321	10950.823783	2323.836981	493.133523	104.646184	22.206610	4.712389

Z = (X - xmean) / xstd  # this is broadcasting!

import pandas as pd

pd.DataFrame(Z)

	0	1	2	3	4	5	6	7	8	9	10
0	-0.451571	-0.474747	-0.501945	-0.534466	-0.574287	-0.624587	-0.690882	-0.783767	-0.926532	-1.178509	-1.697749
1	-0.451571	-0.474747	-0.501945	-0.534466	-0.574287	-0.624586	-0.690872	-0.783659	-0.925478	-1.169216	-1.628453
2	-0.451571	-0.474747	-0.501945	-0.534466	-0.574286	-0.624583	-0.690845	-0.783441	-0.923804	-1.157497	-1.559158
3	-0.451571	-0.474747	-0.501945	-0.534466	-0.574285	-0.624575	-0.690789	-0.783061	-0.921369	-1.143352	-1.489862
4	-0.451571	-0.474747	-0.501945	-0.534466	-0.574283	-0.624558	-0.690684	-0.782450	-0.918029	-1.126782	-1.420566
5	-0.451571	-0.474747	-0.501945	-0.534465	-0.574276	-0.624523	-0.690504	-0.781531	-0.913641	-1.107786	-1.351270
6	-0.451571	-0.474747	-0.501945	-0.534462	-0.574264	-0.624461	-0.690212	-0.780215	-0.908063	-1.086365	-1.281974
7	-0.451571	-0.474747	-0.501943	-0.534457	-0.574239	-0.624356	-0.689764	-0.778400	-0.901152	-1.062518	-1.212678
8	-0.451571	-0.474746	-0.501941	-0.534446	-0.574196	-0.624184	-0.689105	-0.775974	-0.892765	-1.036245	-1.143382
9	-0.451570	-0.474745	-0.501936	-0.534426	-0.574123	-0.623918	-0.688167	-0.772814	-0.882759	-1.007547	-1.074086
10	-0.451569	-0.474742	-0.501926	-0.534391	-0.574003	-0.623518	-0.686871	-0.768783	-0.870991	-0.976424	-1.004790
11	-0.451568	-0.474737	-0.501908	-0.534333	-0.573816	-0.622937	-0.685121	-0.763736	-0.857318	-0.942875	-0.935495
12	-0.451564	-0.474726	-0.501876	-0.534237	-0.573533	-0.622114	-0.682809	-0.757513	-0.841597	-0.906900	-0.866199
13	-0.451558	-0.474708	-0.501823	-0.534086	-0.573113	-0.620974	-0.679807	-0.749947	-0.823686	-0.868500	-0.796903
14	-0.451546	-0.474676	-0.501736	-0.533855	-0.572507	-0.619427	-0.675974	-0.740856	-0.803441	-0.827674	-0.727607
15	-0.451526	-0.474622	-0.501599	-0.533509	-0.571652	-0.617362	-0.671146	-0.730046	-0.780721	-0.784423	-0.658311
16	-0.451490	-0.474534	-0.501386	-0.533002	-0.570465	-0.614650	-0.665141	-0.717316	-0.755381	-0.738746	-0.589015
17	-0.451430	-0.474395	-0.501065	-0.532274	-0.568847	-0.611138	-0.657757	-0.702449	-0.727279	-0.690643	-0.519719
18	-0.451332	-0.474177	-0.500589	-0.531247	-0.566674	-0.606648	-0.648768	-0.685218	-0.696272	-0.640115	-0.450423
19	-0.451175	-0.473846	-0.499896	-0.529822	-0.563795	-0.600972	-0.637925	-0.665386	-0.662217	-0.587162	-0.381127
20	-0.450930	-0.473350	-0.498906	-0.527873	-0.560030	-0.593873	-0.624957	-0.642704	-0.624972	-0.531783	-0.311832
21	-0.450555	-0.472622	-0.497511	-0.525242	-0.555162	-0.585078	-0.609564	-0.616909	-0.584393	-0.473978	-0.242536
22	-0.449988	-0.471568	-0.495574	-0.521736	-0.548933	-0.574278	-0.591421	-0.587730	-0.540337	-0.413748	-0.173240
23	-0.449148	-0.470065	-0.492918	-0.517114	-0.541041	-0.561122	-0.570176	-0.554883	-0.492663	-0.351092	-0.103944
24	-0.447921	-0.467950	-0.489320	-0.511085	-0.531130	-0.545216	-0.545447	-0.518072	-0.441226	-0.286011	-0.034648
25	-0.446151	-0.465011	-0.484499	-0.503297	-0.518786	-0.526118	-0.516821	-0.476992	-0.385884	-0.218504	0.034648
26	-0.443633	-0.460973	-0.478106	-0.493329	-0.503533	-0.503335	-0.483855	-0.431323	-0.326494	-0.148572	0.103944
27	-0.440093	-0.455487	-0.469710	-0.480675	-0.484819	-0.476319	-0.446073	-0.380737	-0.262914	-0.076214	0.173240
28	-0.435172	-0.448108	-0.458782	-0.464737	-0.462013	-0.444464	-0.402967	-0.324892	-0.194999	-0.001430	0.242536
29	-0.428402	-0.438274	-0.444677	-0.444812	-0.434396	-0.407097	-0.353991	-0.263436	-0.122609	0.075779	0.311832
30	-0.419180	-0.425286	-0.426614	-0.420072	-0.401149	-0.363484	-0.298567	-0.196006	-0.045598	0.155413	0.381127
31	-0.406733	-0.408274	-0.403654	-0.389554	-0.361346	-0.312813	-0.236077	-0.122225	0.036174	0.237474	0.450423
32	-0.390081	-0.386167	-0.374671	-0.352135	-0.313945	-0.254200	-0.165867	-0.041707	0.122852	0.321959	0.519719
33	-0.367984	-0.357649	-0.338327	-0.306521	-0.257773	-0.186679	-0.087241	0.045945	0.214578	0.408871	0.589015
34	-0.338890	-0.321119	-0.293035	-0.251218	-0.191518	-0.109197	0.000534	0.141142	0.311495	0.498207	0.658311
35	-0.300864	-0.274635	-0.236922	-0.184513	-0.113713	-0.020613	0.098235	0.244304	0.413745	0.589970	0.727607
36	-0.251509	-0.215853	-0.167792	-0.104447	-0.022727	0.080311	0.206682	0.355865	0.521473	0.684158	0.796903
37	-0.187873	-0.141965	-0.083074	-0.008790	0.083249	0.194913	0.326736	0.476266	0.634819	0.780771	0.866199
38	-0.106339	-0.049611	0.020224	0.104993	0.206220	0.324640	0.459307	0.605965	0.753929	0.879810	0.935495
39	-0.002496	0.065202	0.145576	0.239769	0.348400	0.471045	0.605347	0.745428	0.878944	0.981275	1.004790
40	0.129005	0.207204	0.296995	0.398773	0.512225	0.635804	0.765858	0.895132	1.010006	1.085165	1.074086
41	0.294628	0.381976	0.479108	0.585649	0.700376	0.820711	0.941891	1.055567	1.147260	1.191481	1.143382
42	0.502145	0.596074	0.697227	0.804483	0.915791	1.027692	1.134544	1.227235	1.290848	1.300222	1.212678
43	0.760862	0.857178	0.957433	1.059849	1.161686	1.258806	1.344967	1.410647	1.440913	1.411389	1.281974
44	1.081883	1.174245	1.266669	1.356855	1.441573	1.516254	1.574363	1.606328	1.597598	1.524981	1.351270
45	1.478399	1.557697	1.632834	1.701192	1.759280	1.802383	1.823986	1.814812	1.761045	1.640999	1.420566
46	1.966033	2.019613	2.064898	2.099182	2.118975	2.119694	2.095146	2.036647	1.931397	1.759443	1.489862
47	2.563212	2.573952	2.573014	2.557837	2.525183	2.470849	2.389207	2.272391	2.108798	1.880312	1.559158
48	3.291611	3.236804	3.168646	3.084918	2.982812	2.858677	2.707591	2.522613	2.293390	2.003606	1.628453
49	4.176632	4.026653	3.864711	3.688995	3.497176	3.286178	3.051774	2.787894	2.485317	2.129326	1.697749

Here we just confirm we have standardized all the columns. The only one that stands out is the column of ones, which does not have unit standard deviation.

with np.printoptions(suppress=True):
    print(Z.mean(axis=0), Z.std(axis=0))

[-0. -0.  0.  0.  0. -0.  0.  0.  0. -0.  0.] [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]

We similarly standardize the y data.

ymean = y.mean()
ystd = y.std()

w = (y - ymean) / ystd

To get an estimate of the parameters we have to specify a value of λ. If we set λ=0, we have regular linear regression. If we set λ=∞, all the weights will go to zero. We need something in between. It is a good idea to try several values of λ from a very small value to a large value, on a log scale.

lambdas = np.concatenate([[0], np.geomspace(1e-13, 10, 5)])

print("lambda     ", f"".join([f"x^{i:<11d}" for i in range(len(X[0]), 0, -1)]))
for lam in lambdas:
    l2p = np.linalg.solve(
        Z.T @ Z + lam * np.eye(len(Z[0])), Z.T @ w
    )  # for ridge regression
    p = np.zeros(len(X[0]))
    p[len(X[0] + 2) - len(l2p) :] = l2p
    # This way of printing is to get columnar output
    print(f"{lam:8.2g}", f"".join([f"{j: 12.2f}" for j in p]))
    plt.plot(x, y, "b.")
    plt.plot(x, (Z @ l2p) * ystd + ymean, label=f"{lam:1.2g}")
plt.legend();

lambda      x^11         x^10         x^9          x^8          x^7          x^6          x^5          x^4          x^3          x^2          x^1          
       0    -34043.54   190691.01  -464362.98   645651.76  -566688.59   328402.49  -128144.25    33874.94    -6032.60      691.22      -40.28
   1e-13    -13149.13    64363.30  -129320.28   133383.30   -67613.99     5161.66    12758.58    -7038.91     1606.47     -156.63        4.82
 3.2e-10     -1054.80     3732.42    -3866.47     -865.34     3642.53     -286.76    -3426.18     3217.32    -1354.26      284.92      -24.21
   1e-06       -11.38        6.95       17.29        8.03      -18.81      -29.90       13.53       55.80      -61.16       19.93       -1.06
  0.0032        -0.28       -0.10        0.10        0.32        0.54        0.63        0.39       -0.43       -1.76       -2.04        1.87
      10         0.11        0.08        0.04       -0.01       -0.06       -0.11       -0.17       -0.22       -0.25       -0.22       -0.06

../../_images/caf9d3f09fca1eb429128a228c4b32e76e737e35537c302ed9ece4023219b54a.png

One way people have evaluated a reasonable value of λ is to look at how the coefficients vary with λ using a ridge plot. In this plot, you look for a range that balances the large swings associated with regular unconstrained regression and the damping caused by large values of λ. Here a value of \(10^{-6} \le \lambda \le 10^{-8}\) would be considered reasonable.

lambdas = np.geomspace(1e-8, 1)

pars = np.zeros((11, len(lambdas)))

for i, lam in enumerate(lambdas):
    l2p = np.linalg.solve(Z.T @ Z + lam * np.eye(len(Z[0])), Z.T @ w)
    pars[:, i] = l2p

plt.semilogx(lambdas, pars.T)
plt.xlabel(r"$\lambda$")
plt.ylabel("parameters");

../../_images/c3b58f57bdc3bf30b5d2dbf29a19a25ca31856c1ced132d8ae53da9d14c0bf24.png

LASSO regression#

In LASSO regression, we seek to minimize the summed squared errors plus the sum of the absolute value of the parameters. Unlike linear least squares regression and ridge regression, there is no analytical solution to get the parameters; they can only be obtained numerically using an iterative solver. We again have a parameter λ we have to choose. Setting this parameter to zero will be equivalent to normal linear regression. Setting this parameter to infinity will again cause all coefficients to go to zero. We again have to find a balance.

def objective(pars, lam=0.0):
    SSE = np.sum(np.square(y - ((Z @ pars) * ystd + ymean)))
    return SSE + lam * np.sum(np.abs(pars))


from scipy.optimize import minimize

sol = minimize(
    objective,
    np.random.random(len(Z[0])),
    args=(0.15,),
    method="nelder-mead",
    options={"maxiter": 5000},
)

with np.printoptions(
    suppress=True, precision=3
):  # prints small numbers as practically zero
    print(sol.message, sol.x)

plt.plot(x, y, "b.")
plt.plot(x, (Z @ sol.x) * ystd + ymean);

Optimization terminated successfully. [-0.825  1.61   0.     0.967  0.963 -2.048 -1.275 -0.371 -0.003  0.
  0.182]

../../_images/6cbd093b7edc30dc795c1cec2604dc35f6e0cccf40e96be7d1dbcca0783e1ea4.png

Now, we can explore the effect of λ more thoroughly.

lambdas = np.concatenate([[0], np.geomspace(1e-5, 10, 5)])

print("lambda     ", f"".join([f"x^{i:<11d}" for i in range(len(X[0]), 0, -1)]))
for lam in lambdas:
    sol = minimize(
        objective, np.random.random(len(Z[0])), (lam,), options={"maxiter": 5000}
    )

    # This way of printing is to get columnar output
    print(f"{lam:8.2g}", f"".join([f"{j: 12.2f}" for j in sol.x]))
    plt.plot(x, y, "b.")
    plt.plot(x, (Z @ sol.x) * ystd + ymean, label=f"{lam:1.2g}")
plt.legend();

lambda      x^11         x^10         x^9          x^8          x^7          x^6          x^5          x^4          x^3          x^2          x^1          

       0       136.55     -298.38       -4.64      296.91      108.40     -326.21     -179.04      521.01     -339.93       93.29       -8.74

   1e-05        59.02     -135.95       -9.68      179.06       16.04     -162.72      -90.66      298.40     -208.36       59.31       -5.25

 0.00032         0.16       -0.13       -0.04       -0.22       -0.08       -0.00        2.72        0.78       -5.27       -0.32        1.63

    0.01        -0.01       -0.16        0.00       -0.00        0.73        0.68        0.01        0.01       -1.75       -2.16        1.88
    0.32         0.35        0.17        0.08       -0.00        0.00        0.00       -0.00       -0.00       -1.99       -0.05        0.65

      10        -0.00       -0.00        0.00        0.00       -0.00       -0.00        0.00       -0.02       -0.64       -0.05       -0.00

../../_images/6823d460d36d2cc543265952c3e4200620f3fe6041ea132137d9c096ac5c5e5c.png

You can see that by increasing λ we are making more and more of the parameters go to zero; in other words the functions they correspond to are not part of the model any longer. This is called sparsifying the model. It reduces over-fitting by reducing the model complexity. Finding the most suitable value for λ requires some sophisticated programming and analysis, and it is an important topic in machine learning and data science.

LASSO has some important benefits, and some disadvantanges. The benefits include sparsification of the model; the method removes inputs that are not needed, or that are highly correlated with other inputs. This can make models more interpretable as there are fewer terms, and the terms are more independent.

The disadvantages, however, are that we cannot use linear algebra to find the parameters. The penalty imposes a nonlinear behavior to the objective function, so we must use an iterative solver. For features that are correlated, we have no control over which feature is eliminated. Different initial guesses may lead to different feature elimination. If the features are really correlated, this will not affect the fit quality, but it will mean some models favor one feature over another. This is less of a problem in polynomial models, but often a problem in models based on physical properties that are correlated, e.g. high melting points of materials tend to be correlated with how hard they are. With LASSO, one model could favor the melting point and another could favor the hardness.

Advanced selection of \(\lambda\)#

A more advanced way to select a value of λ is called k-fold validation. It is complex to code this, and the standard method to do it is in scikit-learn, see specifically the ridge regression example and the LASSO example. The basic idea is that you split your data set into \(k\) folds, and then you fit \(k-1\) folds to get the paramters. On the remaining fold (which was not used for fitting) you estimate the model errors. Initially with no regularization, the errors will be high due to overfitting. As you add regularization, the errors will begin decrease. Eventually though, the model will start underfitting, and the errors will go back up. The λ that provides the lowest test errors is usually considered the best choice.

We will not cover these more advanced methods as they rely on learning the scikit-learn API in depth, and some other higher level Python libraries we have not covered like Pandas. These are more appropriate in a data science/machine learning focused course.

Summary#

In this lecture we introduced the concept of linear regression. In the normal linear regression, we simply solve linear equations that ultimately minimize the summed squared errors between the model and data. With some additional linear algebra, we can also estimate the confidence intervals on the parameters.

One issue with normal linear regression is that the parameters are unconstrained, which can lead to some functions having undesirably large parameters. We introduced two types of regularization to mitigate this issue: ridge regression and LASSO regression. In both cases, a penalty function is added to the objective function being minimized. In ridge regression the penalty is an L2 norm on the parameters which penalizes large parameters, leading to a reduction in their magnitude. In LASSO reduction the penalty is an L1 norm, which drives parameters towards zero. Both methods rely on a hyperparameter λ that determines how much regularization is applied. With both regularization approaches we have to use some judgment in how much regularization to apply (the magnitude of λ), and we only provided a heuristic approach to doing this.

display_quiz('.quiz-2.json')