## Uncertainty in implicit functions

| categories: statistics | tags: | View Comments

Suppose we have an equation $$y = e^{a y}$$ that we want to solve, where $$a$$ is a constant with some uncertainty. What is the uncertainty in the solution $$y$$?

Finding a solution is not difficult. The uncertainty in the solution, however, is not easy, since we do not have an explicit function to propagate errors through. Let us examine the solution first.

import numpy as np
from scipy.optimize import fsolve

a = 0.20

def f(y):
return y - np.exp(a * y)

sol, = fsolve(f, 1)
print sol

1.2958555091


A way to estimate the uncertainty is by Monte Carlo simulation. We solve the equation many times, using values sampled from the uncertainty distribution. Here we assume that the $$a$$ parameter is normally distributed with an average of 0.2 and a std deviation of 0.02. We solve the equation 10000 times for different values of $$a$$ sampled according to the normal distribution. That gives us a distribution of solutions that we can do statistical analysis of to get the average and std deviation.

import numpy as np
from scipy.optimize import fsolve
N = 10000

A = np.random.normal(0.2, 0.02, size=N)

sol = np.zeros(A.shape)

for i, a in enumerate(A):
s, = fsolve(lambda y:y - np.exp(a * y), 1)
sol[i] = s

ybar = np.mean(sol)
s_y = np.std(sol)

print ybar, s_y, s_y / ybar

import matplotlib.pyplot as plt
count, bins, ignored = plt.hist(sol)
plt.savefig('images/implicit-uncertainty.png')

1.29887470397 0.0465110111613 0.0358086973433


We get approximately the same answer, and you can see here the distribution of solution values is not quite normal. We compute the standard deviation anyway, and find the standard deviation is about 3.6%. It would be nice to have some analytical method to estimate this uncertainty. So far I have not figured that out.

This method could have relevance in estimating the uncertainty in the friction factor for turbulent flow ($$Re > 2100$$). In that case we have the implicit equation $$\frac{1}{\sqrt{f_F}}=4.0 \log(Re \sqrt{f_F})-0.4$$. Uncertainties in the Re number would lead to uncertainties in the friction factor. Whether those uncertainties are larger than the uncertainties from the original correlation would require some investigation.

org-mode source

## Another approach to error propagation

| categories: statistics | tags: | View Comments

In the previous section we examined an analytical approach to error propagation, and a simulation based approach. There is another approach to error propagation, using the uncertainties module (https://pypi.python.org/pypi/uncertainties/). You have to install this package, e.g. pip install uncertainties. After that, the module provides new classes of numbers and functions that incorporate uncertainty and propagate the uncertainty through the functions. In the examples that follow, we repeat the calculations from the previous section using the uncertainties module.

import uncertainties as u

A = u.ufloat((2.5, 0.4))
B = u.ufloat((4.1, 0.3))
print A + B
print A - B

>>> >>> >>> 6.6+/-0.5
-1.6+/-0.5


Multiplication and division

F = u.ufloat((25, 1))
x = u.ufloat((6.4, 0.4))

t = F * x
print t

d = F / x
print d

>>> >>> >>> 160.0+/-11.8726576637
>>> >>> 3.90625+/-0.289859806243


Exponentiation

t = u.ufloat((2.03, 0.0203))
print t**5

from uncertainties.umath import sqrt
A = u.ufloat((16.07, 0.06))
print sqrt(A)
# print np.sqrt(A) # this does not work

from uncertainties import unumpy as unp
print unp.sqrt(A)

34.4730881243+/-1.72365440621
>>> >>> >>> >>> 4.00874045057+/-0.00748364738749
... >>> >>> 4.00874045057+/-0.00748364738749


Note in the last example, we had to either import a function from uncertainties.umath or import a special version of numpy that handles uncertainty. This may be a limitation of teh uncertainties package as not all functions in arbitrary modules can be covered. Note, however, that you can wrap a function to make it handle uncertainty like this.

import numpy as np

wrapped_sqrt = u.wrap(np.sqrt)
print wrapped_sqrt(A)

>>> >>> 4.00874045057+/-0.00748364738774


Propagation of errors in an integral

import numpy as np
import uncertainties as u

x = np.array([u.ufloat((1, 0.01)),
u.ufloat((2, 0.1)),
u.ufloat((3, 0.1))])

y = 2 * x

print np.trapz(x, y)

>>> >>> ... ... >>> >>> >>> >>> 8.0+/-0.600333240792


Chain rule in error propagation

v0 = u.ufloat((1.2, 0.02))
a = u.ufloat((3.0, 0.3))
t = u.ufloat((12.0, 0.12))

v = v0 + a * t
print v

>>> >>> >>> >>> 37.2+/-3.61801050303


A real example? This is what I would setup for a real working example. We try to compute the exit concentration from a CSTR. The idea is to wrap the “external” fsolve function using the uncertainties.wrap function, which handles the units. Unfortunately, it does not work, and it is not clear why. But see the following discussion for a fix.

from scipy.optimize import fsolve

Fa0 = u.ufloat((5.0, 0.05))
v0 = u.ufloat((10., 0.1))

V = u.ufloat((66000.0, 100))  # reactor volume L^3
k = u.ufloat((3.0, 0.2))      # rate constant L/mol/h

def func(Ca):
"Mole balance for a CSTR. Solve this equation for func(Ca)=0"
Fa = v0 * Ca     # exit molar flow of A
ra = -k * Ca**2  # rate of reaction of A L/mol/h
return Fa0 - Fa + V * ra

# CA guess that that 90 % is reacted away
CA_guess = 0.1 * Fa0 / v0

wrapped_fsolve = u.wrap(fsolve)
CA_sol = wrapped_fsolve(func, CA_guess)

print 'The exit concentration is {0} mol/L'.format(CA_sol)

>>> >>> >>> >>> >>> >>> >>> ... ... ... ... ... >>> ... >>> >>> >>> <function fsolve at 0x148f25f0>
>>> >>> The exit concentration is NotImplemented mol/L


I got a note from the author of the uncertainties package explaining the cryptic error above, and a solution for it. The error arises because fsolve does not know how to deal with uncertainties. The idea is to create a function that returns a float, when everything is given as a float. Then, we wrap the fsolve call, and finally wrap the wrapped fsolve call!

• Step 1. Write the function to solve with arguments for all unitted quantities. This function may be called with uncertainties, or with floats.
• Step 2. Wrap the call to fsolve in a function that takes all the parameters as arguments, and that returns the solution.
• Step 3. Use uncertainties.wrap to wrap the function in Step 2 to get the answer with uncertainties.

Here is the code that does work:

import uncertainties as u
from scipy.optimize import fsolve

Fa0 = u.ufloat((5.0, 0.05))
v0 = u.ufloat((10., 0.1))

V = u.ufloat((66000.0, 100.0))  # reactor volume L^3
k = u.ufloat((3.0, 0.2))      # rate constant L/mol/h

# Step 1
def func(Ca, v0, k, Fa0, V):
"Mole balance for a CSTR. Solve this equation for func(Ca)=0"
Fa = v0 * Ca     # exit molar flow of A
ra = -k * Ca**2  # rate of reaction of A L/mol/h
return Fa0 - Fa + V * ra

# Step 2
def Ca_solve(v0, k, Fa0, V):
'wrap fsolve to pass parameters as float or units'
# this line is a little fragile. You must put  at the end or
# you get the NotImplemented result
sol = fsolve(func, 0.1 * Fa0 / v0, args=(v0, k, Fa0, V))
return sol

# Step 3
print u.wrap(Ca_solve)(v0, k, Fa0, V)

0.005+/-0.000167764327667


It would take some practice to get used to this, but the payoff is that you have an “automatic” error propagation method.

Being ever the skeptic, let us compare the result above to the Monte Carlo approach to error estimation below.

import numpy as np
from scipy.optimize import fsolve

N = 10000
Fa0 = np.random.normal(5, 0.05, (1, N))
v0 = np.random.normal(10.0, 0.1, (1, N))
V =  np.random.normal(66000, 100, (1,N))
k = np.random.normal(3.0, 0.2, (1, N))

SOL = np.zeros((1, N))

for i in range(N):
def func(Ca):
return Fa0[0,i] - v0[0,i] * Ca + V[0,i] * (-k[0,i] * Ca**2)
SOL[0,i] = fsolve(func, 0.1 * Fa0[0,i] / v0[0,i])

print 'Ca(exit) = {0}+/-{1}'.format(np.mean(SOL), np.std(SOL))

Ca(exit) = 0.00500829453185+/-0.000169103578901


I am pretty content those are the same!

## 1 Summary

The uncertainties module is pretty amazing. It automatically propagates errors through a pretty broad range of computations. It is a little tricky for third-party packages, but it seems doable.

org-mode source

## Are averages different

| categories: | tags: | View Comments

Class A had 30 students who received an average test score of 78, with standard deviation of 10. Class B had 25 students an average test score of 85, with a standard deviation of 15. We want to know if the difference in these averages is statistically relevant. Note that we only have estimates of the true average and standard deviation for each class, and there is uncertainty in those estimates. As a result, we are unsure if the averages are really different. It could have just been luck that a few students in class B did better.

The hypothesis:

the true averages are the same. We need to perform a two-sample t-test of the hypothesis that $$\mu_1 - \mu_2 = 0$$ (this is often called the null hypothesis). we use a two-tailed test because we do not care if the difference is positive or negative, either way means the averages are not the same.

import numpy as np

n1 = 30  # students in class A
x1 = 78.0  # average grade in class A
s1 = 10.0  # std dev of exam grade in class A

n2 = 25  # students in class B
x2 = 85.0  # average grade in class B
s2 = 15.0  # std dev of exam grade in class B

# the standard error of the difference between the two averages.
SE = np.sqrt(s1**2 / n1 + s2**2 / n2)

# compute DOF
DF = (n1 - 1) + (n2 - 1)


see the discussion at http://stattrek.com/Help/Glossary.aspx?Target=Two-sample%20t-test for a more complex definition of degrees of freedom. Here we simply subtract one from each sample size to account for the estimation of the average of each sample.

compute the t-score for our data

The difference between two averages determined from small sample numbers follows the t-distribution. the t-score is the difference between the difference of the means and the hypothesized difference of the means, normalized by the standard error. we compute the absolute value of the t-score to make sure it is positive for convenience later.

tscore = np.abs(((x1 - x2) - 0) / SE)
print tscore

1.99323179108


Interpretation

A way to approach determinining if the difference is significant or not is to ask, does our computed average fall within a confidence range of the hypothesized value (zero)? If it does, then we can attribute the difference to statistical variations at that confidence level. If it does not, we can say that statistical variations do not account for the difference at that confidence level, and hence the averages must be different.

Let us compute the t-value that corresponds to a 95% confidence level for a mean of zero with the degrees of freedom computed earlier. This means that 95% of the t-scores we expect to get will fall within $$\pm$$ t95.

from scipy.stats.distributions import  t

ci = 0.95;
alpha = 1 - ci;
t95 = t.ppf(1.0 - alpha/2.0, DF)

print t95

>>> >>> >>> >>> >>> 2.00574599354


since tscore < t95, we conclude that at the 95% confidence level we cannot say these averages are statistically different because our computed t-score falls in the expected range of deviations. Note that our t-score is very close to the 95% limit. Let us consider a smaller confidence interval.

ci = 0.94
alpha = 1 - ci;
t95 = t.ppf(1.0 - alpha/2.0, DF)

print t95

>>> >>> >>> 1.92191364181


at the 94% confidence level, however, tscore > t94, which means we can say with 94% confidence that the two averages are different; class B performed better than class A did. Alternatively, there is only about a 6% chance we are wrong about that statement. another way to get there

An alternative way to get the confidence that the averages are different is to directly compute it from the cumulative t-distribution function. We compute the difference between all the t-values less than tscore and the t-values less than -tscore, which is the fraction of measurements that are between them. You can see here that we are practically 95% sure that the averages are different.

f = t.cdf(tscore, DF) - t.cdf(-tscore, DF)
print f

0.948605075732


org-mode source

## Basic statistics

| categories: statistics | tags: | View Comments

Given several measurements of a single quantity, determine the average value of the measurements, the standard deviation of the measurements and the 95% confidence interval for the average.

This is a recipe for computing the confidence interval. The strategy is:

1. compute the average
2. Compute the standard deviation of your data
3. Define the confidence interval, e.g. 95% = 0.95
4. compute the student-t multiplier. This is a function of the confidence

interval you specify, and the number of data points you have minus 1. You subtract 1 because one degree of freedom is lost from calculating the average. The confidence interval is defined as ybar +- T_multiplier*std/sqrt(n).

import numpy as np
from scipy.stats.distributions import  t

y = [8.1, 8.0, 8.1]

ybar = np.mean(y)
s = np.std(y)

ci = 0.95
alpha = 1.0 - ci

n = len(y)
T_multiplier = t.ppf(1-alpha/2.0, n-1)

ci95 = T_multiplier * s / np.sqrt(n-1)

print [ybar - ci95, ybar + ci95]

[7.9232449090029595, 8.210088424330376]


We are 95% certain the next measurement will fall in the interval above.

org-mode source

## Introduction to statistical data analysis

| categories: statistics | tags: | View Comments

Given several measurements of a single quantity, determine the average value of the measurements, the standard deviation of the measurements and the 95% confidence interval for the average.

import numpy as np

y = [8.1, 8.0, 8.1]

ybar = np.mean(y)
s = np.std(y, ddof=1)

print ybar, s

>>> >>> >>> >>> >>> >>> 8.06666666667 0.057735026919


Interesting, we have to specify the divisor in numpy.std by the ddof argument. The default for this in Matlab is 1, the default for this function is 0.

Here is the principle of computing a confidence interval.

1. compute the average
2. Compute the standard deviation of your data
3. Define the confidence interval, e.g. 95% = 0.95
4. compute the student-t multiplier. This is a function of the confidence interval you specify, and the number of data points you have minus 1. You subtract 1 because one degree of freedom is lost from calculating the average.

The confidence interval is defined as ybar +- T_multiplier*std/sqrt(n).

from scipy.stats.distributions import  t
ci = 0.95
alpha = 1.0 - ci

n = len(y)
T_multiplier = t.ppf(1.0 - alpha / 2.0, n - 1)

ci95 = T_multiplier * s / np.sqrt(n)

print 'T_multiplier = {0}'.format(T_multiplier)
print 'ci95 = {0}'.format(ci95)
print 'The true average is between {0} and {1} at a 95% confidence level'.format(ybar - ci95, ybar + ci95)

>>> >>> >>> >>> >>> >>> >>> >>> T_multiplier = 4.30265272991
ci95 = 0.143421757664
The true average is between 7.923244909 and 8.21008842433 at a 95% confidence level