## Basic statistics

| categories: statistics | tags:

Given several measurements of a single quantity, determine the average value of the measurements, the standard deviation of the measurements and the 95% confidence interval for the average.

This is a recipe for computing the confidence interval. The strategy is:

1. compute the average
2. Compute the standard deviation of your data
3. Define the confidence interval, e.g. 95% = 0.95
4. compute the student-t multiplier. This is a function of the confidence

interval you specify, and the number of data points you have minus 1. You subtract 1 because one degree of freedom is lost from calculating the average. The confidence interval is defined as ybar +- T_multiplier*std/sqrt(n).

import numpy as np
from scipy.stats.distributions import  t

y = [8.1, 8.0, 8.1]

ybar = np.mean(y)
s = np.std(y)

ci = 0.95
alpha = 1.0 - ci

n = len(y)
T_multiplier = t.ppf(1-alpha/2.0, n-1)

ci95 = T_multiplier * s / np.sqrt(n-1)

print [ybar - ci95, ybar + ci95]

[7.9232449090029595, 8.210088424330376]


We are 95% certain the next measurement will fall in the interval above.

org-mode source

## Introduction to statistical data analysis

| categories: statistics | tags:

Given several measurements of a single quantity, determine the average value of the measurements, the standard deviation of the measurements and the 95% confidence interval for the average.

import numpy as np

y = [8.1, 8.0, 8.1]

ybar = np.mean(y)
s = np.std(y, ddof=1)

print ybar, s

>>> >>> >>> >>> >>> >>> 8.06666666667 0.057735026919


Interesting, we have to specify the divisor in numpy.std by the ddof argument. The default for this in Matlab is 1, the default for this function is 0.

Here is the principle of computing a confidence interval.

1. compute the average
2. Compute the standard deviation of your data
3. Define the confidence interval, e.g. 95% = 0.95
4. compute the student-t multiplier. This is a function of the confidence interval you specify, and the number of data points you have minus 1. You subtract 1 because one degree of freedom is lost from calculating the average.

The confidence interval is defined as ybar +- T_multiplier*std/sqrt(n).

from scipy.stats.distributions import  t
ci = 0.95
alpha = 1.0 - ci

n = len(y)
T_multiplier = t.ppf(1.0 - alpha / 2.0, n - 1)

ci95 = T_multiplier * s / np.sqrt(n)

print 'T_multiplier = {0}'.format(T_multiplier)
print 'ci95 = {0}'.format(ci95)
print 'The true average is between {0} and {1} at a 95% confidence level'.format(ybar - ci95, ybar + ci95)

>>> >>> >>> >>> >>> >>> >>> >>> T_multiplier = 4.30265272991
ci95 = 0.143421757664
The true average is between 7.923244909 and 8.21008842433 at a 95% confidence level


org-mode source

## Numerical propagation of errors

| categories: statistics | tags:

Propagation of errors is essential to understanding how the uncertainty in a parameter affects computations that use that parameter. The uncertainty propagates by a set of rules into your solution. These rules are not easy to remember, or apply to complicated situations, and are only approximate for equations that are nonlinear in the parameters.

We will use a Monte Carlo simulation to illustrate error propagation. The idea is to generate a distribution of possible parameter values, and to evaluate your equation for each parameter value. Then, we perform statistical analysis on the results to determine the standard error of the results.

We will assume all parameters are defined by a normal distribution with known mean and standard deviation.

import numpy as np
import matplotlib.pyplot as plt

N = 1e6 # number of samples of parameters

A_mu = 2.5; A_sigma = 0.4
B_mu = 4.1; B_sigma = 0.3

A = np.random.normal(A_mu, A_sigma, size=(1, N))
B = np.random.normal(B_mu, B_sigma, size=(1, N))

p = A + B
m = A - B

print np.std(p)
print np.std(m)

print np.sqrt(A_sigma**2 + B_sigma**2) # the analytical std dev

>>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> 0.500505424616
0.500113385681
>>> 0.5


## 2 Multiplication

F_mu = 25.0; F_sigma = 1;
x_mu = 6.4; x_sigma = 0.4;

F = np.random.normal(F_mu, F_sigma, size=(1, N))
x = np.random.normal(x_mu, x_sigma, size=(1, N))

t = F * x
print np.std(t)
print np.sqrt((F_sigma / F_mu)**2 + (x_sigma / x_mu)**2) * F_mu * x_mu

>>> >>> >>> >>> >>> >>> 11.8900166284
11.8726576637


## 3 Division

This is really like multiplication: F / x = F * (1 / x).

d = F / x
print np.std(d)
print np.sqrt((F_sigma / F_mu)**2 + (x_sigma / x_mu)**2) * F_mu / x_mu

0.293757533168
0.289859806243


## 4 exponents

This rule is different than multiplication (A^2 = A*A) because in the previous examples we assumed the errors in A and B for A*B were uncorrelated. in A*A, the errors are not uncorrelated, so there is a different rule for error propagation.

t_mu = 2.03; t_sigma = 0.01*t_mu; # 1% error
A_mu = 16.07; A_sigma = 0.06;

t = np.random.normal(t_mu, t_sigma, size=(1, N))
A = np.random.normal(A_mu, A_sigma, size=(1, N))

# Compute t^5 and sqrt(A) with error propagation
print np.std(t**5)
print (5 * t_sigma / t_mu) * t_mu**5

>>> >>> >>> >>> >>> ... 1.72454836176
1.72365440621

print np.std(np.sqrt(A))
print 1.0 / 2.0 * A_sigma / A_mu * np.sqrt(A_mu)

0.00748903477329
0.00748364738749


## 5 the chain rule in error propagation

let v = v0 + a*t, with uncertainties in vo,a and t

vo_mu = 1.2; vo_sigma = 0.02;
a_mu = 3.0;  a_sigma  = 0.3;
t_mu = 12.0; t_sigma  = 0.12;

vo = np.random.normal(vo_mu, vo_sigma, (1, N))
a = np.random.normal(a_mu, a_sigma, (1, N))
t = np.random.normal(t_mu, t_sigma, (1, N))

v = vo + a*t

print np.std(v)
print np.sqrt(vo_sigma**2 + t_mu**2 * a_sigma**2 + a_mu**2 * t_sigma**2)

>>> >>> >>> >>> >>> >>> >>> >>> >>> 3.62232509326
3.61801050303


## 6 Summary

You can numerically perform error propagation analysis if you know the underlying distribution of errors on the parameters in your equations. One benefit of the numerical propogation is you do not have to remember the error propagation rules, and you directly look at the distribution in nonlinear cases. Some limitations of this approach include

1. You have to know the distribution of the errors in the parameters
2. You have to assume the errors in parameters are uncorrelated.

org-mode source

## Numerical propogation of errors

| categories: statistics | tags:

Propagation of errors is essential to understanding how the uncertainty in a parameter affects computations that use that parameter. The uncertainty propogates by a set of rules into your solution. These rules are not easy to remember, or apply to complicated situations, and are only approximate for equations that are nonlinear in the parameters.

We will use a Monte Carlo simulation to illustrate error propogation. The idea is to generate a distribution of possible parameter values, and to evaluate your equation for each parameter value. Then, we perform statistical analysis on the results to determine the standard error of the results.

We will assume all parameters are defined by a normal distribution with known mean and standard deviation.

import numpy as np
import matplotlib.pyplot as plt

N = 1e6 # number of samples of parameters

A_mu = 2.5; A_sigma = 0.4
B_mu = 4.1; B_sigma = 0.3

A = np.random.normal(A_mu, A_sigma, size=(1, N))
B = np.random.normal(B_mu, B_sigma, size=(1, N))

p = A + B
m = A - B

print np.std(p)
print np.std(m)

print np.sqrt(A_sigma**2 + B_sigma**2) # the analytical std dev

>>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> 0.500505424616
0.500113385681
>>> 0.5


## 2 Multiplication

F_mu = 25.0; F_sigma = 1;
x_mu = 6.4; x_sigma = 0.4;

F = np.random.normal(F_mu, F_sigma, size=(1, N))
x = np.random.normal(x_mu, x_sigma, size=(1, N))

t = F * x
print np.std(t)
print np.sqrt((F_sigma / F_mu)**2 + (x_sigma / x_mu)**2) * F_mu * x_mu

>>> >>> >>> >>> >>> >>> 11.8900166284
11.8726576637


## 3 Division

This is really like multiplication: F / x = F * (1 / x).

d = F / x
print np.std(d)
print np.sqrt((F_sigma / F_mu)**2 + (x_sigma / x_mu)**2) * F_mu / x_mu

0.293757533168
0.289859806243


## 4 exponents

This rule is different than multiplication (A^2 = A*A) because in the previous examples we assumed the errors in A and B for A*B were uncorrelated. in A*A, the errors are not uncorrelated, so there is a different rule for error propagation.

t_mu = 2.03; t_sigma = 0.01*t_mu; # 1% error
A_mu = 16.07; A_sigma = 0.06;

t = np.random.normal(t_mu, t_sigma, size=(1, N))
A = np.random.normal(A_mu, A_sigma, size=(1, N))

# Compute t^5 and sqrt(A) with error propogation
print np.std(t**5)
print (5 * t_sigma / t_mu) * t_mu**5

>>> >>> >>> >>> >>> ... 1.72454836176
1.72365440621

print np.std(np.sqrt(A))
print 1.0 / 2.0 * A_sigma / A_mu * np.sqrt(A_mu)

0.00748903477329
0.00748364738749


## 5 the chain rule in error propogation

let v = v0 + a*t, with uncertainties in vo,a and t

vo_mu = 1.2; vo_sigma = 0.02;
a_mu = 3.0;  a_sigma  = 0.3;
t_mu = 12.0; t_sigma  = 0.12;

vo = np.random.normal(vo_mu, vo_sigma, (1, N))
a = np.random.normal(a_mu, a_sigma, (1, N))
t = np.random.normal(t_mu, t_sigma, (1, N))

v = vo + a*t

print np.std(v)
print np.sqrt(vo_sigma**2 + t_mu**2 * a_sigma**2 + a_mu**2 * t_sigma**2)

>>> >>> >>> >>> >>> >>> >>> >>> >>> 3.62232509326
3.61801050303


## 6 Summary

You can numerically perform error propogation analysis if you know the underlying distribution of errors on the parameters in your equations. One benefit of the numerical propogation is you do not have to remember the error propogation rules, and you directly look at the distribution in nonlinear cases. Some limitations of this approach include

1. You have to know the distribution of the errors in the parameters
2. You have to assume the errors in parameters are uncorrelated.

org-mode source

## Confidence interval on an average

| categories: statistics | tags:

nil has a statistical package available for getting statistical distributions. This is useful for computing confidence intervals using the student-t tables. Here is an example of computing a 95% confidence interval on an average.

import numpy as np
from scipy.stats.distributions import  t

n = 10 # number of measurements
dof = n - 1 # degrees of freedom
avg_x = 16.1 # average measurement
std_x = 0.01 # standard deviation of measurements

# Find 95% prediction interval for next measurement

alpha = 1.0 - 0.95

pred_interval = t.ppf(1-alpha/2.0, dof) * std_x / np.sqrt(n)

s = ['We are 95% confident the next measurement',
' will be between {0:1.3f} and {1:1.3f}']
print ''.join(s).format(avg_x - pred_interval, avg_x + pred_interval)

We are 95% confident the next measurement will be between 16.093 and 16.107