Are averages different

| categories: data analysis, statistics | tags:

Matlab post

Adapted from http://stattrek.com/ap-statistics-4/unpaired-means.aspx

Class A had 30 students who received an average test score of 78, with standard deviation of 10. Class B had 25 students an average test score of 85, with a standard deviation of 15. We want to know if the difference in these averages is statistically relevant. Note that we only have estimates of the true average and standard deviation for each class, and there is uncertainty in those estimates. As a result, we are unsure if the averages are really different. It could have just been luck that a few students in class B did better.

The hypothesis:

the true averages are the same. We need to perform a two-sample t-test of the hypothesis that \(\mu_1 - \mu_2 = 0\) (this is often called the null hypothesis). we use a two-tailed test because we do not care if the difference is positive or negative, either way means the averages are not the same.

import numpy as np

n1 = 30  # students in class A
x1 = 78.0  # average grade in class A
s1 = 10.0  # std dev of exam grade in class A

n2 = 25  # students in class B
x2 = 85.0  # average grade in class B
s2 = 15.0  # std dev of exam grade in class B

# the standard error of the difference between the two averages. 
SE = np.sqrt(s1**2 / n1 + s2**2 / n2)

# compute DOF
DF = (n1 - 1) + (n2 - 1)

see the discussion at http://stattrek.com/Help/Glossary.aspx?Target=Two-sample%20t-test for a more complex definition of degrees of freedom. Here we simply subtract one from each sample size to account for the estimation of the average of each sample.

compute the t-score for our data

The difference between two averages determined from small sample numbers follows the t-distribution. the t-score is the difference between the difference of the means and the hypothesized difference of the means, normalized by the standard error. we compute the absolute value of the t-score to make sure it is positive for convenience later.

tscore = np.abs(((x1 - x2) - 0) / SE)
print tscore
1.99323179108

Interpretation

A way to approach determinining if the difference is significant or not is to ask, does our computed average fall within a confidence range of the hypothesized value (zero)? If it does, then we can attribute the difference to statistical variations at that confidence level. If it does not, we can say that statistical variations do not account for the difference at that confidence level, and hence the averages must be different.

Let us compute the t-value that corresponds to a 95% confidence level for a mean of zero with the degrees of freedom computed earlier. This means that 95% of the t-scores we expect to get will fall within \(\pm\) t95.

from scipy.stats.distributions import  t

ci = 0.95;
alpha = 1 - ci;
t95 = t.ppf(1.0 - alpha/2.0, DF)

print t95
>>> >>> >>> >>> >>> 2.00574599354

since tscore < t95, we conclude that at the 95% confidence level we cannot say these averages are statistically different because our computed t-score falls in the expected range of deviations. Note that our t-score is very close to the 95% limit. Let us consider a smaller confidence interval.

ci = 0.94
alpha = 1 - ci;
t95 = t.ppf(1.0 - alpha/2.0, DF)

print t95
>>> >>> >>> 1.92191364181

at the 94% confidence level, however, tscore > t94, which means we can say with 94% confidence that the two averages are different; class B performed better than class A did. Alternatively, there is only about a 6% chance we are wrong about that statement. another way to get there

An alternative way to get the confidence that the averages are different is to directly compute it from the cumulative t-distribution function. We compute the difference between all the t-values less than tscore and the t-values less than -tscore, which is the fraction of measurements that are between them. You can see here that we are practically 95% sure that the averages are different.

f = t.cdf(tscore, DF) - t.cdf(-tscore, DF)
print f
0.948605075732

Copyright (C) 2013 by John Kitchin. See the License for information about copying.

org-mode source

Discuss on Twitter

Nonlinear curve fitting with confidence intervals

| categories: data analysis | tags:

Our goal is to fit this equation to data \(y = c1 exp(-x) + c2*x\) and compute the confidence intervals on the parameters.

This is actually could be a linear regression problem, but it is convenient to illustrate the use the nonlinear fitting routine because it makes it easy to get confidence intervals for comparison. The basic idea is to use the covariance matrix returned from the nonlinear fitting routine to estimate the student-t corrected confidence interval.

# Nonlinear curve fit with confidence interval
import numpy as np
from scipy.optimize import curve_fit
from scipy.stats.distributions import  t

x = np.array([ 0.1,  0.2,  0.3,  0.4,  0.5,  0.6,  0.7,  0.8,  0.9,  1. ])
y = np.array([ 4.70192769,  4.46826356,  4.57021389,  4.29240134,  3.88155125,
               3.78382253,  3.65454727,  3.86379487,  4.16428541,  4.06079909])

# this is the function we want to fit to our data
def func(x,c0, c1):
    return c0 * np.exp(-x) + c1*x

pars, pcov = curve_fit(func, x, y, p0=[4.96, 2.11])

alpha = 0.05 # 95% confidence interval

n = len(y)    # number of data points
p = len(pars) # number of parameters

dof = max(0, n-p) # number of degrees of freedom

tval = t.ppf(1.0 - alpha / 2.0, dof) # student-t value for the dof and confidence level

for i, p,var in zip(range(n), pars, np.diag(pcov)):
    sigma = var**0.5
    print 'c{0}: {1} [{2}  {3}]'.format(i, p,
                                  p - sigma*tval,
                                  p + sigma*tval)

import matplotlib.pyplot as plt
plt.plot(x,y,'bo ')
xfit = np.linspace(0,1)
yfit = func(xfit, pars[0], pars[1])
plt.plot(xfit,yfit,'b-')
plt.legend(['data','fit'],loc='best')
plt.savefig('images/nonlin-fit-ci.png')
c0: 4.96713966439 [4.62674476567  5.30753456311]
c1: 2.10995112628 [1.76711622427  2.45278602828]

Copyright (C) 2013 by John Kitchin. See the License for information about copying.

org-mode source

Discuss on Twitter

Nonlinear curve fitting

| categories: data analysis, nonlinear regression | tags:

Here is a typical nonlinear function fit to data. you need to provide an initial guess. In this example we fit the Birch-Murnaghan equation of state to energy vs. volume data from density functional theory calculations.

from scipy.optimize import leastsq
import numpy as np

vols = np.array([13.71, 14.82, 16.0, 17.23, 18.52])

energies = np.array([-56.29, -56.41, -56.46, -56.463, -56.41])

def Murnaghan(parameters, vol):
    'From Phys. Rev. B 28, 5480 (1983)'
    E0, B0, BP, V0 = parameters

    E = E0 + B0 * vol / BP * (((V0 / vol)**BP) / (BP - 1) + 1) - V0 * B0 / (BP - 1.0)

    return E

def objective(pars, y, x):
    #we will minimize this function
    err =  y - Murnaghan(pars, x)
    return err

x0 = [ -56.0, 0.54, 2.0, 16.5] #initial guess of parameters

plsq = leastsq(objective, x0, args=(energies, vols))

print 'Fitted parameters = {0}'.format(plsq[0])

import matplotlib.pyplot as plt
plt.plot(vols,energies, 'ro')

#plot the fitted curve on top
x = np.linspace(min(vols), max(vols), 50)
y = Murnaghan(plsq[0], x)
plt.plot(x, y, 'k-')
plt.xlabel('Volume')
plt.ylabel('Energy')
plt.savefig('images/nonlinear-curve-fitting.png')
Fitted parameters = [-56.46839641   0.57233217   2.7407944   16.55905648]

See additional examples at \url{http://docs.scipy.org/doc/scipy/reference/tutorial/optimize.html}.

Copyright (C) 2013 by John Kitchin. See the License for information about copying.

org-mode source

Discuss on Twitter

The Gibbs free energy of a reacting mixture and the equilibrium composition

| categories: optimization | tags: thermodynamics, reaction engineering

Table of Contents

Matlab post

In this post we derive the equations needed to find the equilibrium composition of a reacting mixture. We use the method of direct minimization of the Gibbs free energy of the reacting mixture.

The Gibbs free energy of a mixture is defined as \(G = \sum\limits_j \mu_j n_j\) where \(\mu_j\) is the chemical potential of species \(j\), and it is temperature and pressure dependent, and \(n_j\) is the number of moles of species \(j\).

We define the chemical potential as \(\mu_j = G_j^\circ + RT\ln a_j\), where \(G_j^\circ\) is the Gibbs energy in a standard state, and \(a_j\) is the activity of species \(j\) if the pressure and temperature are not at standard state conditions.

If a reaction is occurring, then the number of moles of each species are related to each other through the reaction extent \(\epsilon\) and stoichiometric coefficients: \(n_j = n_{j0} + \nu_j \epsilon\). Note that the reaction extent has units of moles.

Combining these three equations and expanding the terms leads to:

$$G = \sum\limits_j n_{j0}G_j^\circ +\sum\limits_j \nu_j G_j^\circ \epsilon +RT\sum\limits_j(n_{j0} + \nu_j\epsilon)\ln a_j $$

The first term is simply the initial Gibbs free energy that is present before any reaction begins, and it is a constant. It is difficult to evaluate, so we will move it to the left side of the equation in the next step, because it does not matter what its value is since it is a constant. The second term is related to the Gibbs free energy of reaction: \(\Delta_rG = \sum\limits_j \nu_j G_j^\circ\). With these observations we rewrite the equation as:

$$G - \sum\limits_j n_{j0}G_j^\circ = \Delta_rG \epsilon +RT\sum\limits_j(n_{j0} + \nu_j\epsilon)\ln a_j $$

Now, we have an equation that allows us to compute the change in Gibbs free energy as a function of the reaction extent, initial number of moles of each species, and the activities of each species. This difference in Gibbs free energy has no natural scale, and depends on the size of the system, i.e. on \(n_{j0}\). It is desirable to avoid this, so we now rescale the equation by the total initial moles present, \(n_{T0}\) and define a new variable \(\epsilon' = \epsilon/n_{T0}\), which is dimensionless. This leads to:

$$ \frac{G - \sum\limits_j n_{j0}G_j^\circ}{n_{T0}} = \Delta_rG \epsilon' + RT \sum\limits_j(y_{j0} + \nu_j\epsilon')\ln a_j $$

where \(y_{j0}\) is the initial mole fraction of species \(j\) present. The mole fractions are intensive properties that do not depend on the system size. Finally, we need to address \(a_j\). For an ideal gas, we know that \(A_j = \frac{y_j P}{P^\circ}\), where the numerator is the partial pressure of species \(j\) computed from the mole fraction of species \(j\) times the total pressure. To get the mole fraction we note:

$$y_j = \frac{n_j}{n_T} = \frac{n_{j0} + \nu_j \epsilon}{n_{T0} + \epsilon \sum\limits_j \nu_j} = \frac{y_{j0} + \nu_j \epsilon'}{1 + \epsilon'\sum\limits_j \nu_j} $$

This finally leads us to an equation that we can evaluate as a function of reaction extent:

$$ \frac{G - \sum\limits_j n_{j0}G_j^\circ}{n_{T0}} = \widetilde{\widetilde{G}} = \Delta_rG \epsilon' + RT\sum\limits_j(y_{j0} + \nu_j\epsilon') \ln\left(\frac{y_{j0}+\nu_j\epsilon'}{1+\epsilon'\sum\limits_j\nu_j} \frac{P}{P^\circ}\right) $$

we use a double tilde notation to distinguish this quantity from the quantity derived by Rawlings and Ekerdt which is further normalized by a factor of \(RT\). This additional scaling makes the quantities dimensionless, and makes the quantity have a magnitude of order unity, but otherwise has no effect on the shape of the graph.

Finally, if we know the initial mole fractions, the initial total pressure, the Gibbs energy of reaction, and the stoichiometric coefficients, we can plot the scaled reacting mixture energy as a function of reaction extent. At equilibrium, this energy will be a minimum. We consider the example in Rawlings and Ekerdt where isobutane (I) reacts with 1-butene (B) to form 2,2,3-trimethylpentane (P). The reaction occurs at a total pressure of 2.5 atm at 400K, with equal molar amounts of I and B. The standard Gibbs free energy of reaction at 400K is -3.72 kcal/mol. Compute the equilibrium composition.

import numpy as np

R = 8.314
P = 250000  # Pa
P0 = 100000 # Pa, approximately 1 atm
T = 400 # K

Grxn = -15564.0 #J/mol
yi0 = 0.5; yb0 = 0.5; yp0 = 0.0; # initial mole fractions

yj0 = np.array([yi0, yb0, yp0])
nu_j = np.array([-1.0, -1.0, 1.0])   # stoichiometric coefficients

def Gwigglewiggle(extentp):
    diffg = Grxn * extentp
    sum_nu_j = np.sum(nu_j)
    for i,y in enumerate(yj0):
        x1 = yj0[i] + nu_j[i] * extentp
        x2 = x1 / (1.0 + extentp*sum_nu_j)
        diffg += R * T * x1 * np.log(x2 * P / P0)
    return diffg

There are bounds on how large \(\epsilon'\) can be. Recall that \(n_j = n_{j0} + \nu_j \epsilon\), and that \(n_j \ge 0\). Thus, \(\epsilon_{max} = -n_{j0}/\nu_j\), and the maximum value that \(\epsilon'\) can have is therefore \(-y_{j0}/\nu_j\) where \(y_{j0}>0\). When there are multiple species, you need the smallest \(epsilon'_{max}\) to avoid getting negative mole numbers.

epsilonp_max = min(-yj0[yj0 > 0] / nu_j[yj0 > 0])
epsilonp = np.linspace(1e-6, epsilonp_max, 1000);

import matplotlib.pyplot as plt

plt.plot(epsilonp,Gwigglewiggle(epsilonp))
plt.xlabel('$\epsilon$')
plt.ylabel('Gwigglewiggle')
plt.savefig('images/gibbs-minim-1.png')
>>> >>> >>> __main__:7: RuntimeWarning: divide by zero encountered in log
__main__:7: RuntimeWarning: invalid value encountered in multiply
[<matplotlib.lines.Line2D object at 0x10b1c7710>]
<matplotlib.text.Text object at 0x10b1c3d10>
<matplotlib.text.Text object at 0x10b1c9b90>

Now we simply minimize our Gwigglewiggle function. Based on the figure above, the miminum is near 0.45.

from scipy.optimize import fminbound

epsilonp_eq = fminbound(Gwigglewiggle, 0.4, 0.5)
print epsilonp_eq

plt.plot([epsilonp_eq], [Gwigglewiggle(epsilonp_eq)], 'ro')
plt.savefig('images/gibbs-minim-2.png')
>>> >>> 0.46959618249
>>> [<matplotlib.lines.Line2D object at 0x10d4d3e50>]

To compute equilibrium mole fractions we do this:

yi = (yi0 + nu_j[0]*epsilonp_eq) / (1.0 + epsilonp_eq*np.sum(nu_j))
yb = (yb0 + nu_j[1]*epsilonp_eq) / (1.0 + epsilonp_eq*np.sum(nu_j))
yp = (yp0 + nu_j[2]*epsilonp_eq) / (1.0 + epsilonp_eq*np.sum(nu_j))

print yi, yb, yp

# or this
y_j = (yj0 + np.dot(nu_j, epsilonp_eq)) / (1.0 + epsilonp_eq*np.sum(nu_j))
print y_j
>>> >>> >>> 0.0573220186324 0.0573220186324 0.885355962735
>>> ... >>> [ 0.05732202  0.05732202  0.88535596]

\(K = \frac{a_P}{a_I a_B} = \frac{y_p P/P^\circ}{y_i P/P^\circ y_b P/P^\circ} = \frac{y_P}{y_i y_b}\frac{P^\circ}{P}\).

We can express the equilibrium constant like this :\(K = \prod\limits_j a_j^{\nu_j}\), and compute it with a single line of code.

K = np.exp(-Grxn/R/T)
print 'K from delta G ',K
print 'K as ratio of mole fractions ',yp / (yi * yb) * P0 / P
print 'compact notation: ',np.prod((y_j * P / P0)**nu_j)
K from delta G  107.776294742
K as ratio of mole fractions  107.779200065
compact notation:  107.779200065

These results are very close, and only disagree because of the default tolerance used in identifying the minimum of our function. You could tighten the tolerances by setting options to the fminbnd function.

1 Summary

In this post we derived an equation for the Gibbs free energy of a reacting mixture and used it to find the equilibrium composition. In future posts we will examine some alternate forms of the equations that may be more useful in some circumstances.

Copyright (C) 2014 by John Kitchin. See the License for information about copying.

org-mode source

Org-mode version = 8.2.7c

Discuss on Twitter

Nonlinear curve fitting by direct least squares minimization

| categories: data analysis | tags:

Here is an example of fitting a nonlinear function to data by direct minimization of the summed squared error.

from scipy.optimize import fmin
import numpy as np

volumes = np.array([13.71, 14.82, 16.0, 17.23, 18.52])

energies = np.array([-56.29, -56.41, -56.46, -56.463,-56.41])

def Murnaghan(parameters,vol):
    'From PRB 28,5480 (1983'
    E0 = parameters[0]
    B0 = parameters[1]
    BP = parameters[2]
    V0 = parameters[3]

    E = E0 + B0*vol/BP*(((V0/vol)**BP)/(BP-1)+1) - V0*B0/(BP-1.)

    return E

def objective(pars,vol):
    #we will minimize this function
    err =  energies - Murnaghan(pars,vol)
    return np.sum(err**2) #we return the summed squared error directly

x0 = [ -56., 0.54, 2., 16.5] #initial guess of parameters

plsq = fmin(objective,x0,args=(volumes,)) #note args is a tuple

print 'parameters = {0}'.format(plsq)

import matplotlib.pyplot as plt
plt.plot(volumes,energies,'ro')

#plot the fitted curve on top
x = np.linspace(min(volumes),max(volumes),50)
y = Murnaghan(plsq,x)
plt.plot(x,y,'k-')
plt.xlabel('Volume ($\AA^3$)')
plt.ylabel('Total energy (eV)')
plt.savefig('images/nonlinear-fitting-lsq.png')
Optimization terminated successfully.
         Current function value: 0.000020
         Iterations: 137
         Function evaluations: 240
parameters = [-56.46932645   0.59141447   1.9044796   16.59341303]

Copyright (C) 2013 by John Kitchin. See the License for information about copying.

org-mode source

Discuss on Twitter
« Previous Page -- Next Page »