Model selection

| categories: data analysis, statistics | tags: | View Comments

Matlab post

adapted from http://www.itl.nist.gov/div898/handbook/pmd/section4/pmd44.htm

In this example, we show some ways to choose which of several models fit data the best. We have data for the total pressure and temperature of a fixed amount of a gas in a tank that was measured over the course of several days. We want to select a model that relates the pressure to the gas temperature.

The data is stored in a text file download PT.txt , with the following structure:

Run          Ambient                            Fitted
 Order  Day  Temperature  Temperature  Pressure    Value    Residual
  1      1      23.820      54.749      225.066   222.920     2.146
...

We need to read the data in, and perform a regression analysis on P vs. T. In python we start counting at 0, so we actually want columns 3 and 4.

import numpy as np
import matplotlib.pyplot as plt

data = np.loadtxt('data/PT.txt', skiprows=2)
T = data[:, 3]
P = data[:, 4]

plt.plot(T, P, 'k.')
plt.xlabel('Temperature')
plt.ylabel('Pressure')
plt.savefig('images/model-selection-1.png')
>>> >>> >>> >>> >>> >>> [<matplotlib.lines.Line2D object at 0x00000000084398D0>]
<matplotlib.text.Text object at 0x000000000841F6A0>
<matplotlib.text.Text object at 0x0000000008423DD8>

It appears the data is roughly linear, and we know from the ideal gas law that PV = nRT, or P = nR/V*T, which says P should be linearly correlated with V. Note that the temperature data is in degC, not in K, so it is not expected that P=0 at T = 0. We will use linear algebra to compute the line coefficients.

A = np.vstack([T**0, T]).T
b = P

x, res, rank, s = np.linalg.lstsq(A, b)
intercept, slope = x
print 'b, m =', intercept, slope

n = len(b)
k = len(x)

sigma2 = np.sum((b - np.dot(A,x))**2) / (n - k)

C = sigma2 * np.linalg.inv(np.dot(A.T, A))
se = np.sqrt(np.diag(C))

from scipy.stats.distributions import  t
alpha = 0.05

sT = t.ppf(1-alpha/2., n - k) # student T multiplier
CI = sT * se

print 'CI = ',CI
for beta, ci in zip(x, CI):
    print '[{0} {1}]'.format(beta - ci, beta + ci)
>>> >>> >>> >>> b, m = 7.74899739238 3.93014043824
>>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> CI =  [ 4.76511545  0.1026405 ]
... ... [2.98388194638 12.5141128384]
[3.82749994079 4.03278093569]

The confidence interval on the intercept is large, but it does not contain zero at the 95% confidence level.

The R^2 value accounts roughly for the fraction of variation in the data that can be described by the model. Hence, a value close to one means nearly all the variations are described by the model, except for random variations.

ybar = np.mean(P)
SStot = np.sum((P - ybar)**2)
SSerr = np.sum((P - np.dot(A, x))**2)
R2 = 1 - SSerr/SStot
print R2
>>> >>> >>> 0.993715411798
plt.figure(); plt.clf()
plt.plot(T, P, 'k.', T, np.dot(A, x), 'b-')
plt.xlabel('Temperature')
plt.ylabel('Pressure')
plt.title('R^2 = {0:1.3f}'.format(R2))
plt.savefig('images/model-selection-2.png')
<matplotlib.figure.Figure object at 0x0000000008423860>
[<matplotlib.lines.Line2D object at 0x00000000085BE780>, <matplotlib.lines.Line2D object at 0x00000000085BE940>]
<matplotlib.text.Text object at 0x0000000008449898>
<matplotlib.text.Text object at 0x000000000844CCF8>
<matplotlib.text.Text object at 0x000000000844ED30>

The fit looks good, and R^2 is near one, but is it a good model? There are a few ways to examine this. We want to make sure that there are no systematic trends in the errors between the fit and the data, and we want to make sure there are not hidden correlations with other variables. The residuals are the error between the fit and the data. The residuals should not show any patterns when plotted against any variables, and they do not in this case.

residuals = P - np.dot(A, x)

plt.figure()

f, (ax1, ax2, ax3) = plt.subplots(3)

ax1.plot(T,residuals,'ko')
ax1.set_xlabel('Temperature')


run_order = data[:, 0]
ax2.plot(run_order, residuals,'ko ')
ax2.set_xlabel('run order')

ambientT = data[:, 2]
ax3.plot(ambientT, residuals,'ko')
ax3.set_xlabel('ambient temperature')

plt.tight_layout() # make sure plots do not overlap

plt.savefig('images/model-selection-3.png')
>>> <matplotlib.figure.Figure object at 0x00000000085C21D0>
>>> >>> >>> [<matplotlib.lines.Line2D object at 0x0000000008861CC0>]
<matplotlib.text.Text object at 0x00000000085D3A58>
>>> >>> >>> [<matplotlib.lines.Line2D object at 0x0000000008861E80>]
<matplotlib.text.Text object at 0x00000000085EC5F8>
>>> >>> [<matplotlib.lines.Line2D object at 0x0000000008861C88>]
<matplotlib.text.Text object at 0x0000000008846828>

There may be some correlations in the residuals with the run order. That could indicate an experimental source of error.

We assume all the errors are uncorrelated with each other. We can use a lag plot to assess this, where we plot residual[i] vs residual[i-1], i.e. we look for correlations between adjacent residuals. This plot should look random, with no correlations if the model is good.

plt.figure(); plt.clf()
plt.plot(residuals[1:-1], residuals[0:-2],'ko')
plt.xlabel('residual[i]')
plt.ylabel('residual[i-1]')
plt.savefig('images/model-selection-correlated-residuals.png')
<matplotlib.figure.Figure object at 0x000000000886EB00>
[<matplotlib.lines.Line2D object at 0x0000000008A02908>]
<matplotlib.text.Text object at 0x00000000089E8198>
<matplotlib.text.Text object at 0x00000000089EB908>

It is hard to argue there is any correlation here.

Lets consider a quadratic model instead.

A = np.vstack([T**0, T, T**2]).T
b = P;

x, res, rank, s = np.linalg.lstsq(A, b)
print x

n = len(b)
k = len(x)

sigma2 = np.sum((b - np.dot(A,x))**2) / (n - k)

C = sigma2 * np.linalg.inv(np.dot(A.T, A))
se = np.sqrt(np.diag(C))

from scipy.stats.distributions import  t
alpha = 0.05

sT = t.ppf(1-alpha/2., n - k) # student T multiplier
CI = sT * se

print 'CI = ',CI
for beta, ci in zip(x, CI):
    print '[{0} {1}]'.format(beta - ci, beta + ci)


ybar = np.mean(P)
SStot = np.sum((P - ybar)**2)
SSerr = np.sum((P - np.dot(A,x))**2)
R2 = 1 - SSerr/SStot
print 'R^2 = {0}'.format(R2)
>>> >>> >>> [  9.00353031e+00   3.86669879e+00   7.26244301e-04]
>>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> CI =  [  1.38030344e+01   6.62100654e-01   7.48516727e-03]
... ... [-4.79950412123 22.8065647329]
[3.20459813681 4.52879944409]
[-0.00675892296907 0.00821141157035]
>>> >>> >>> >>> >>> R^2 = 0.993721969407

You can see that the confidence interval on the constant and T^2 term includes zero. That is a good indication this additional parameter is not significant. You can see also that the R^2 value is not better than the one from a linear fit, so adding a parameter does not increase the goodness of fit. This is an example of overfitting the data. Since the constant in this model is apparently not significant, let us consider the simplest model with a fixed intercept of zero.

Let us consider a model with intercept = 0, P = alpha*T.

A = np.vstack([T]).T
b = P;

x, res, rank, s = np.linalg.lstsq(A, b)

n = len(b)
k = len(x)

sigma2 = np.sum((b - np.dot(A,x))**2) / (n - k)

C = sigma2 * np.linalg.inv(np.dot(A.T, A))
se = np.sqrt(np.diag(C))

from scipy.stats.distributions import  t
alpha = 0.05

sT = t.ppf(1-alpha/2.0, n - k) # student T multiplier
CI = sT * se

for beta, ci in zip(x, CI):
    print '[{0} {1}]'.format(beta - ci, beta + ci)

plt.figure()
plt.plot(T, P, 'k. ', T, np.dot(A, x))
plt.xlabel('Temperature')
plt.ylabel('Pressure')
plt.legend(['data', 'fit'])

ybar = np.mean(P)
SStot = np.sum((P - ybar)**2)
SSerr = np.sum((P - np.dot(A,x))**2)
R2 = 1 - SSerr/SStot
plt.title('R^2 = {0:1.3f}'.format(R2))
plt.savefig('images/model-selection-no-intercept.png')
>>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> ... ... [4.05680124495 4.12308349899]
<matplotlib.figure.Figure object at 0x0000000008870BE0>
[<matplotlib.lines.Line2D object at 0x00000000089F4550>, <matplotlib.lines.Line2D object at 0x00000000089F4208>]
<matplotlib.text.Text object at 0x0000000008A13630>
<matplotlib.text.Text object at 0x0000000008A16DA0>
<matplotlib.legend.Legend object at 0x00000000089EFD30>
>>> >>> >>> >>> >>> <matplotlib.text.Text object at 0x000000000B26C0B8>

The fit is visually still pretty good, and the R^2 value is only slightly worse. Let us examine the residuals again.

residuals = P - np.dot(A,x)

plt.figure()
plt.plot(T,residuals,'ko')
plt.xlabel('Temperature')
plt.ylabel('residuals')
plt.savefig('images/model-selection-no-incpt-resid.png')
>>> <matplotlib.figure.Figure object at 0x0000000008A0F5C0>
[<matplotlib.lines.Line2D object at 0x000000000B29B0F0>]
<matplotlib.text.Text object at 0x000000000B276FD0>
<matplotlib.text.Text object at 0x000000000B283780>

You can see a slight trend of decreasing value of the residuals as the Temperature increases. This may indicate a deficiency in the model with no intercept. For the ideal gas law in degC: \(PV = nR(T+273)\) or \(P = nR/V*T + 273*nR/V\), so the intercept is expected to be non-zero in this case. Specifically, we expect the intercept to be 273*R*n/V. Since the molar density of a gas is pretty small, the intercept may be close to, but not equal to zero. That is why the fit still looks ok, but is not as good as letting the intercept be a fitting parameter. That is an example of the deficiency in our model.

In the end, it is hard to justify a model more complex than a line in this case.

Copyright (C) 2013 by John Kitchin. See the License for information about copying.

org-mode source

Read and Post Comments

Random thoughts

| categories: math, statistics | tags: | View Comments

Matlab post

Random numbers are used in a variety of simulation methods, most notably Monte Carlo simulations. In another later example, we will see how we can use random numbers for error propagation analysis. First, we discuss two types of pseudorandom numbers we can use in python: uniformly distributed and normally distributed numbers.

Say you are the gambling type, and bet your friend $5 the next random number will be greater than 0.49. Let us ask Python to roll the random number generator for us.

import numpy as np

n = np.random.uniform()
print 'n = {0}'.format(n)

if n > 0.49:
    print 'You win!'
else:
    print 'you lose.'
n = 0.381896986693
you lose.

The odds of you winning the last bet are slightly stacked in your favor. There is only a 49% chance your friend wins, but a 51% chance that you win. Lets play the game a lot of times times and see how many times you win, and your friend wins. First, lets generate a bunch of numbers and look at the distribution with a histogram.

import numpy as np

N = 10000
games = np.random.uniform(size=(1,N))

wins = np.sum(games > 0.49)
losses = N - wins

print 'You won {0} times ({1:%})'.format(wins, float(wins) / N)

import matplotlib.pyplot as plt
count, bins, ignored = plt.hist(games)
plt.savefig('images/random-thoughts-1.png')
You won 5090 times (50.900000%)

As you can see you win slightly more than you lost.

It is possible to get random integers. Here are a few examples of getting a random integer between 1 and 100. You might do this to get random indices of a list, for example.

import numpy as np

print np.random.random_integers(1, 100)
print np.random.random_integers(1, 100, 3)
print np.random.random_integers(1, 100, (2,2))
96
[ 95  49 100]
[[69 54]
 [41 93]]

The normal distribution is defined by \(f(x)=\frac{1}{\sqrt{2\pi \sigma^2}} \exp (-\frac{(x-\mu)^2}{2\sigma^2})\) where \(\mu\) is the mean value, and \(\sigma\) is the standard deviation. In the standard distribution, \(\mu=0\) and \(\sigma=1\).

import numpy as np

mu = 1
sigma = 0.5
print np.random.normal(mu, sigma)
print np.random.normal(mu, sigma, 2)
1.04225842065
[ 0.58105204  0.64853157]

Let us compare the sampled distribution to the analytical distribution. We generate a large set of samples, and calculate the probability of getting each value using the matplotlib.pyplot.hist command.

import numpy as np
import matplotlib.pyplot as plt

mu = 0; sigma = 1

N = 5000
samples = np.random.normal(mu, sigma, N)

counts, bins, ignored = plt.hist(samples, 50, normed=True)

plt.plot(bins, 1.0/np.sqrt(2 * np.pi * sigma**2)*np.exp(-((bins - mu)**2)/(2*sigma**2)))
plt.savefig('images/random-thoughts-2.png')

What fraction of points lie between plus and minus one standard deviation of the mean?

samples >= mu-sigma will return a vector of ones where the inequality is true, and zeros where it is not. (samples >= mu-sigma) & (samples <= mu+sigma) will return a vector of ones where there is a one in both vectors, and a zero where there is not. In other words, a vector where both inequalities are true. Finally, we can sum the vector to get the number of elements where the two inequalities are true, and finally normalize by the total number of samples to get the fraction of samples that are greater than -sigma and less than sigma.

import numpy as np
import matplotlib.pyplot as plt

mu = 0; sigma = 1

N = 5000
samples = np.random.normal(mu, sigma, N)

a = np.sum((samples >= (mu - sigma)) & (samples <= (mu + sigma))) / float(N) 
b = np.sum((samples >= (mu - 2*sigma)) & (samples <= (mu + 2*sigma))) / float(N) 
print '{0:%} of samples are within +- standard deviations of the mean'.format(a)
print '{0:%} of samples are within +- 2standard deviations of the mean'.format(b)
67.500000% of samples are within +- standard deviations of the mean
95.580000% of samples are within +- 2standard deviations of the mean

1 Summary

We only considered the numpy.random functions here, and not all of them. There are many distributions of random numbers to choose from. There are also random numbers in the python random module. Remember these are only pseudorandom numbers, but they are still useful for many applications.

Copyright (C) 2013 by John Kitchin. See the License for information about copying.

org-mode source

Read and Post Comments

Numerical propagation of errors

| categories: statistics | tags: | View Comments

Matlab post

Propagation of errors is essential to understanding how the uncertainty in a parameter affects computations that use that parameter. The uncertainty propagates by a set of rules into your solution. These rules are not easy to remember, or apply to complicated situations, and are only approximate for equations that are nonlinear in the parameters.

We will use a Monte Carlo simulation to illustrate error propagation. The idea is to generate a distribution of possible parameter values, and to evaluate your equation for each parameter value. Then, we perform statistical analysis on the results to determine the standard error of the results.

We will assume all parameters are defined by a normal distribution with known mean and standard deviation.

1 Addition and subtraction

import numpy as np
import matplotlib.pyplot as plt

N = 1e6 # number of samples of parameters

A_mu = 2.5; A_sigma = 0.4
B_mu = 4.1; B_sigma = 0.3

A = np.random.normal(A_mu, A_sigma, size=(1, N))
B = np.random.normal(B_mu, B_sigma, size=(1, N))

p = A + B
m = A - B

print np.std(p)
print np.std(m)

print np.sqrt(A_sigma**2 + B_sigma**2) # the analytical std dev
>>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> 0.500505424616
0.500113385681
>>> 0.5

2 Multiplication

F_mu = 25.0; F_sigma = 1;
x_mu = 6.4; x_sigma = 0.4;

F = np.random.normal(F_mu, F_sigma, size=(1, N))
x = np.random.normal(x_mu, x_sigma, size=(1, N))

t = F * x
print np.std(t)
print np.sqrt((F_sigma / F_mu)**2 + (x_sigma / x_mu)**2) * F_mu * x_mu
>>> >>> >>> >>> >>> >>> 11.8900166284
11.8726576637

3 Division

This is really like multiplication: F / x = F * (1 / x).

d = F / x
print np.std(d)
print np.sqrt((F_sigma / F_mu)**2 + (x_sigma / x_mu)**2) * F_mu / x_mu
0.293757533168
0.289859806243

4 exponents

This rule is different than multiplication (A^2 = A*A) because in the previous examples we assumed the errors in A and B for A*B were uncorrelated. in A*A, the errors are not uncorrelated, so there is a different rule for error propagation.

t_mu = 2.03; t_sigma = 0.01*t_mu; # 1% error
A_mu = 16.07; A_sigma = 0.06;

t = np.random.normal(t_mu, t_sigma, size=(1, N))
A = np.random.normal(A_mu, A_sigma, size=(1, N))

# Compute t^5 and sqrt(A) with error propagation
print np.std(t**5)
print (5 * t_sigma / t_mu) * t_mu**5
>>> >>> >>> >>> >>> ... 1.72454836176
1.72365440621
print np.std(np.sqrt(A))
print 1.0 / 2.0 * A_sigma / A_mu * np.sqrt(A_mu)
0.00748903477329
0.00748364738749

5 the chain rule in error propagation

let v = v0 + a*t, with uncertainties in vo,a and t

vo_mu = 1.2; vo_sigma = 0.02;
a_mu = 3.0;  a_sigma  = 0.3;
t_mu = 12.0; t_sigma  = 0.12;

vo = np.random.normal(vo_mu, vo_sigma, (1, N))
a = np.random.normal(a_mu, a_sigma, (1, N))
t = np.random.normal(t_mu, t_sigma, (1, N))

v = vo + a*t

print np.std(v)
print np.sqrt(vo_sigma**2 + t_mu**2 * a_sigma**2 + a_mu**2 * t_sigma**2)
>>> >>> >>> >>> >>> >>> >>> >>> >>> 3.62232509326
3.61801050303

6 Summary

You can numerically perform error propagation analysis if you know the underlying distribution of errors on the parameters in your equations. One benefit of the numerical propogation is you do not have to remember the error propagation rules, and you directly look at the distribution in nonlinear cases. Some limitations of this approach include

  1. You have to know the distribution of the errors in the parameters
  2. You have to assume the errors in parameters are uncorrelated.

Copyright (C) 2013 by John Kitchin. See the License for information about copying.

org-mode source

Read and Post Comments

Numerical propogation of errors

| categories: statistics | tags: | View Comments

Matlab post

Propagation of errors is essential to understanding how the uncertainty in a parameter affects computations that use that parameter. The uncertainty propogates by a set of rules into your solution. These rules are not easy to remember, or apply to complicated situations, and are only approximate for equations that are nonlinear in the parameters.

We will use a Monte Carlo simulation to illustrate error propogation. The idea is to generate a distribution of possible parameter values, and to evaluate your equation for each parameter value. Then, we perform statistical analysis on the results to determine the standard error of the results.

We will assume all parameters are defined by a normal distribution with known mean and standard deviation.

1 Addition and subtraction

import numpy as np
import matplotlib.pyplot as plt

N = 1e6 # number of samples of parameters

A_mu = 2.5; A_sigma = 0.4
B_mu = 4.1; B_sigma = 0.3

A = np.random.normal(A_mu, A_sigma, size=(1, N))
B = np.random.normal(B_mu, B_sigma, size=(1, N))

p = A + B
m = A - B

print np.std(p)
print np.std(m)

print np.sqrt(A_sigma**2 + B_sigma**2) # the analytical std dev
>>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> 0.500505424616
0.500113385681
>>> 0.5

2 Multiplication

F_mu = 25.0; F_sigma = 1;
x_mu = 6.4; x_sigma = 0.4;

F = np.random.normal(F_mu, F_sigma, size=(1, N))
x = np.random.normal(x_mu, x_sigma, size=(1, N))

t = F * x
print np.std(t)
print np.sqrt((F_sigma / F_mu)**2 + (x_sigma / x_mu)**2) * F_mu * x_mu
>>> >>> >>> >>> >>> >>> 11.8900166284
11.8726576637

3 Division

This is really like multiplication: F / x = F * (1 / x).

d = F / x
print np.std(d)
print np.sqrt((F_sigma / F_mu)**2 + (x_sigma / x_mu)**2) * F_mu / x_mu
0.293757533168
0.289859806243

4 exponents

This rule is different than multiplication (A^2 = A*A) because in the previous examples we assumed the errors in A and B for A*B were uncorrelated. in A*A, the errors are not uncorrelated, so there is a different rule for error propagation.

t_mu = 2.03; t_sigma = 0.01*t_mu; # 1% error
A_mu = 16.07; A_sigma = 0.06;

t = np.random.normal(t_mu, t_sigma, size=(1, N))
A = np.random.normal(A_mu, A_sigma, size=(1, N))

# Compute t^5 and sqrt(A) with error propogation
print np.std(t**5)
print (5 * t_sigma / t_mu) * t_mu**5
>>> >>> >>> >>> >>> ... 1.72454836176
1.72365440621
print np.std(np.sqrt(A))
print 1.0 / 2.0 * A_sigma / A_mu * np.sqrt(A_mu)
0.00748903477329
0.00748364738749

5 the chain rule in error propogation

let v = v0 + a*t, with uncertainties in vo,a and t

vo_mu = 1.2; vo_sigma = 0.02;
a_mu = 3.0;  a_sigma  = 0.3;
t_mu = 12.0; t_sigma  = 0.12;

vo = np.random.normal(vo_mu, vo_sigma, (1, N))
a = np.random.normal(a_mu, a_sigma, (1, N))
t = np.random.normal(t_mu, t_sigma, (1, N))

v = vo + a*t

print np.std(v)
print np.sqrt(vo_sigma**2 + t_mu**2 * a_sigma**2 + a_mu**2 * t_sigma**2)
>>> >>> >>> >>> >>> >>> >>> >>> >>> 3.62232509326
3.61801050303

6 Summary

You can numerically perform error propogation analysis if you know the underlying distribution of errors on the parameters in your equations. One benefit of the numerical propogation is you do not have to remember the error propogation rules, and you directly look at the distribution in nonlinear cases. Some limitations of this approach include

  1. You have to know the distribution of the errors in the parameters
  2. You have to assume the errors in parameters are uncorrelated.

Copyright (C) 2013 by John Kitchin. See the License for information about copying.

org-mode source

Read and Post Comments

Confidence interval on an average

| categories: statistics | tags: | View Comments

nil has a statistical package available for getting statistical distributions. This is useful for computing confidence intervals using the student-t tables. Here is an example of computing a 95% confidence interval on an average.

import numpy as np
from scipy.stats.distributions import  t

n = 10 # number of measurements
dof = n - 1 # degrees of freedom
avg_x = 16.1 # average measurement
std_x = 0.01 # standard deviation of measurements

# Find 95% prediction interval for next measurement

alpha = 1.0 - 0.95

pred_interval = t.ppf(1-alpha/2.0, dof) * std_x / np.sqrt(n)

s = ['We are 95% confident the next measurement',
       ' will be between {0:1.3f} and {1:1.3f}']
print ''.join(s).format(avg_x - pred_interval, avg_x + pred_interval)
We are 95% confident the next measurement will be between 16.093 and 16.107

Copyright (C) 2013 by John Kitchin. See the License for information about copying.

org-mode source

Read and Post Comments

« Previous Page