## Linear regression with confidence intervals.

| categories: | tags:

Matlab post Fit a fourth order polynomial to this data and determine the confidence interval for each parameter. Data from example 5-1 in Fogler, Elements of Chemical Reaction Engineering.

We want the equation $$Ca(t) = b0 + b1*t + b2*t^2 + b3*t^3 + b4*t^4$$ fit to the data in the least squares sense. We can write this in a linear algebra form as: T*p = Ca where T is a matrix of columns [1 t t^2 t^3 t^4], and p is a column vector of the fitting parameters. We want to solve for the p vector and estimate the confidence intervals.

import numpy as np
from scipy.stats.distributions import  t

time = np.array([0.0, 50.0, 100.0, 150.0, 200.0, 250.0, 300.0])
Ca = np.array([50.0, 38.0, 30.6, 25.6, 22.2, 19.5, 17.4])*1e-3

T = np.column_stack([time**0, time, time**2, time**3, time**4])

p, res, rank, s = np.linalg.lstsq(T, Ca)
# the parameters are now in p

# compute the confidence intervals
n = len(Ca)
k = len(p)

sigma2 = np.sum((Ca - np.dot(T, p))**2) / (n - k)  # RMSE

C = sigma2 * np.linalg.inv(np.dot(T.T, T)) # covariance matrix
se = np.sqrt(np.diag(C)) # standard error

alpha = 0.05 # 100*(1 - alpha) confidence level

sT = t.ppf(1.0 - alpha/2.0, n - k) # student T multiplier
CI = sT * se

for beta, ci in zip(p, CI):
print '{2: 1.2e} [{0: 1.4e} {1: 1.4e}]'.format(beta - ci, beta + ci, beta)

SS_tot = np.sum((Ca - np.mean(Ca))**2)
SS_err = np.sum((np.dot(T, p) - Ca)**2)

#  http://en.wikipedia.org/wiki/Coefficient_of_determination
Rsq = 1 - SS_err/SS_tot
print 'R^2 = {0}'.format(Rsq)

# plot fit
import matplotlib.pyplot as plt
plt.plot(time, Ca, 'bo', label='raw data')
plt.plot(time, np.dot(T, p), 'r-', label='fit')
plt.xlabel('Time')
plt.ylabel('Ca (mol/L)')
plt.legend(loc='best')
plt.savefig('images/linregress-conf.png')

 5.00e-02 [ 4.9680e-02  5.0300e-02]
-2.98e-04 [-3.1546e-04 -2.8023e-04]
1.34e-06 [ 1.0715e-06  1.6155e-06]
-3.48e-09 [-4.9032e-09 -2.0665e-09]
3.70e-12 [ 1.3501e-12  6.0439e-12]
R^2 = 0.999986967246 A fourth order polynomial fits the data well, with a good R^2 value. All of the parameters appear to be significant, i.e. zero is not included in any of the parameter confidence intervals. This does not mean this is the best model for the data, just that the model fits well.