Peak finding in Raman spectroscopy

| categories: data analysis | tags: | View Comments

Table of Contents

Raman spectroscopy is a vibrational spectroscopy. The data typically comes as intensity vs. wavenumber, and it is discrete. Sometimes it is necessary to identify the precise location of a peak. In this post, we will use spline smoothing to construct an interpolating function of the data, and then use fminbnd to identify peak positions.

This example was originally worked out in Matlab at http://matlab.cheme.cmu.edu/2012/08/27/peak-finding-in-raman-spectroscopy/

numpy:loadtxt

Let us take a look at the raw data.

import numpy as np
import matplotlib.pyplot as plt

w, i = np.loadtxt('data/raman.txt', usecols=(0, 1), unpack=True)

plt.plot(w, i)
plt.xlabel('Raman shift (cm$^{-1}$)')
plt.ylabel('Intensity (counts)')
plt.savefig('images/raman-1.png')
plt.show()
>>> [<matplotlib.lines.Line2D object at 0x10b1d3190>]
<matplotlib.text.Text object at 0x10b1b1b10>
<matplotlib.text.Text object at 0x10bc7f310>

The next thing to do is narrow our focus to the region we are interested in between 1340 cm^{-1} and 1360 cm^{-1}.

ind = (w > 1340) & (w < 1360)
w1 = w[ind]
i1 = i[ind]

plt.plot(w1, i1, 'b. ')
plt.xlabel('Raman shift (cm$^{-1}$)')
plt.ylabel('Intensity (counts)')
plt.savefig('images/raman-2.png')
plt.show()
>>> >>> >>> [<matplotlib.lines.Line2D object at 0x10bc7a4d0>]
<matplotlib.text.Text object at 0x10bc08090>
<matplotlib.text.Text object at 0x10bc49710>

Next we consider a scipy:UnivariateSpline. This function "smooths" the data.

from scipy.interpolate import UnivariateSpline

# s is a "smoothing" factor
sp = UnivariateSpline(w1, i1, k=4, s=2000)

plt.plot(w1, i1, 'b. ')
plt.plot(w1, sp(w1), 'r-')
plt.xlabel('Raman shift (cm$^{-1}$)')
plt.ylabel('Intensity (counts)')
plt.savefig('images/raman-3.png')
plt.show()
>>> ... >>> >>> [<matplotlib.lines.Line2D object at 0x1105633d0>]
[<matplotlib.lines.Line2D object at 0x10dd70250>]
<matplotlib.text.Text object at 0x10dd65f10>
<matplotlib.text.Text object at 0x1105409d0>

Note that the UnivariateSpline function returns a "callable" function! Our next goal is to find the places where there are peaks. This is defined by the first derivative of the data being equal to zero. It is easy to get the first derivative of a UnivariateSpline with a second argument as shown below.

# get the first derivative evaluated at all the points
d1s = sp.derivative()

d1 = d1s(w1)

# we can get the roots directly here, which correspond to minima and
# maxima.
print('Roots = {}'.format(sp.derivative().roots()))
minmax = sp.derivative().roots()

plt.clf()
plt.plot(w1, d1, label='first derivative')
plt.xlabel('Raman shift (cm$^{-1}$)')
plt.ylabel('First derivative')
plt.grid()

plt.plot(minmax, d1s(minmax), 'ro ', label='zeros')
plt.legend(loc='best')

plt.plot(w1, i1, 'b. ')
plt.plot(w1, sp(w1), 'r-')
plt.xlabel('Raman shift (cm$^{-1}$)')
plt.ylabel('Intensity (counts)')
plt.plot(minmax, sp(minmax), 'ro ')

plt.savefig('images/raman-4.png')
>>> >>> >>> >>> ... ... Roots = [ 1346.4623087   1347.42700893  1348.16689639]
>>> >>> >>> [<matplotlib.lines.Line2D object at 0x1106b2dd0>]
<matplotlib.text.Text object at 0x110623910>
<matplotlib.text.Text object at 0x110c0a090>
>>> >>> [<matplotlib.lines.Line2D object at 0x10b1bacd0>]
<matplotlib.legend.Legend object at 0x1106b2650>
[<matplotlib.lines.Line2D object at 0x1106b2b50>]
[<matplotlib.lines.Line2D object at 0x110698550>]
<matplotlib.text.Text object at 0x110623910>
<matplotlib.text.Text object at 0x110c0a090>
[<matplotlib.lines.Line2D object at 0x110698a10>]

In the end, we have illustrated how to construct a spline smoothing interpolation function and to find maxima in the function, including generating some initial guesses. There is more art to this than you might like, since you have to judge how much smoothing is enough or too much. With too much, you may smooth peaks out. With too little, noise may be mistaken for peaks.

1 Summary notes

Using org-mode with :session allows a large script to be broken up into mini sections. However, it only seems to work with the default python mode in Emacs, and it does not work with emacs-for-python or the latest python-mode. I also do not really like the output style, e.g. the output from the plotting commands.

Copyright (C) 2014 by John Kitchin. See the License for information about copying.

org-mode source

Org-mode version = 8.2.7c

Read and Post Comments

Are averages different

| categories: data analysis, statistics | tags: | View Comments

Matlab post

Adapted from http://stattrek.com/ap-statistics-4/unpaired-means.aspx

Class A had 30 students who received an average test score of 78, with standard deviation of 10. Class B had 25 students an average test score of 85, with a standard deviation of 15. We want to know if the difference in these averages is statistically relevant. Note that we only have estimates of the true average and standard deviation for each class, and there is uncertainty in those estimates. As a result, we are unsure if the averages are really different. It could have just been luck that a few students in class B did better.

The hypothesis:

the true averages are the same. We need to perform a two-sample t-test of the hypothesis that \(\mu_1 - \mu_2 = 0\) (this is often called the null hypothesis). we use a two-tailed test because we do not care if the difference is positive or negative, either way means the averages are not the same.

import numpy as np

n1 = 30  # students in class A
x1 = 78.0  # average grade in class A
s1 = 10.0  # std dev of exam grade in class A

n2 = 25  # students in class B
x2 = 85.0  # average grade in class B
s2 = 15.0  # std dev of exam grade in class B

# the standard error of the difference between the two averages. 
SE = np.sqrt(s1**2 / n1 + s2**2 / n2)

# compute DOF
DF = (n1 - 1) + (n2 - 1)

see the discussion at http://stattrek.com/Help/Glossary.aspx?Target=Two-sample%20t-test for a more complex definition of degrees of freedom. Here we simply subtract one from each sample size to account for the estimation of the average of each sample.

compute the t-score for our data

The difference between two averages determined from small sample numbers follows the t-distribution. the t-score is the difference between the difference of the means and the hypothesized difference of the means, normalized by the standard error. we compute the absolute value of the t-score to make sure it is positive for convenience later.

tscore = np.abs(((x1 - x2) - 0) / SE)
print tscore
1.99323179108

Interpretation

A way to approach determinining if the difference is significant or not is to ask, does our computed average fall within a confidence range of the hypothesized value (zero)? If it does, then we can attribute the difference to statistical variations at that confidence level. If it does not, we can say that statistical variations do not account for the difference at that confidence level, and hence the averages must be different.

Let us compute the t-value that corresponds to a 95% confidence level for a mean of zero with the degrees of freedom computed earlier. This means that 95% of the t-scores we expect to get will fall within \(\pm\) t95.

from scipy.stats.distributions import  t

ci = 0.95;
alpha = 1 - ci;
t95 = t.ppf(1.0 - alpha/2.0, DF)

print t95
>>> >>> >>> >>> >>> 2.00574599354

since tscore < t95, we conclude that at the 95% confidence level we cannot say these averages are statistically different because our computed t-score falls in the expected range of deviations. Note that our t-score is very close to the 95% limit. Let us consider a smaller confidence interval.

ci = 0.94
alpha = 1 - ci;
t95 = t.ppf(1.0 - alpha/2.0, DF)

print t95
>>> >>> >>> 1.92191364181

at the 94% confidence level, however, tscore > t94, which means we can say with 94% confidence that the two averages are different; class B performed better than class A did. Alternatively, there is only about a 6% chance we are wrong about that statement. another way to get there

An alternative way to get the confidence that the averages are different is to directly compute it from the cumulative t-distribution function. We compute the difference between all the t-values less than tscore and the t-values less than -tscore, which is the fraction of measurements that are between them. You can see here that we are practically 95% sure that the averages are different.

f = t.cdf(tscore, DF) - t.cdf(-tscore, DF)
print f
0.948605075732

Copyright (C) 2013 by John Kitchin. See the License for information about copying.

org-mode source

Read and Post Comments

Fit a line to numerical data

| categories: data analysis | tags: | View Comments

Matlab post

We want to fit a line to this data:

x = [0, 0.5, 1, 1.5, 2.0, 3.0, 4.0, 6.0, 10]
y = [0, -0.157, -0.315, -0.472, -0.629, -0.942, -1.255, -1.884, -3.147]

We use the polyfit(x, y, n) command where n is the polynomial order, n=1 for a line.

import numpy as np

p = np.polyfit(x, y, 1)
print p
slope, intercept = p
print slope, intercept
>>> >>> [-0.31452218  0.00062457]
>>> -0.3145221843 0.00062457337884

To show the fit, we can use numpy.polyval to evaluate the fit at many points.

import matplotlib.pyplot as plt

xfit = np.linspace(0, 10)
yfit = np.polyval(p, xfit)

plt.plot(x, y, 'bo', label='raw data')
plt.plot(xfit, yfit, 'r-', label='fit')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.savefig('images/linefit-1.png')
>>> >>> >>> >>> [<matplotlib.lines.Line2D object at 0x053C1790>]
[<matplotlib.lines.Line2D object at 0x0313C610>]
<matplotlib.text.Text object at 0x052A4950>
<matplotlib.text.Text object at 0x052B9A10>
<matplotlib.legend.Legend object at 0x053C1CD0>

Copyright (C) 2013 by John Kitchin. See the License for information about copying.

org-mode source

Read and Post Comments

Fitting a numerical ODE solution to data

| categories: data analysis, nonlinear regression | tags: | View Comments

Matlab post

Suppose we know the concentration of A follows this differential equation: \(\frac{dC_A}{dt} = -k C_A\), and we have data we want to fit to it. Here is an example of doing that.

import numpy as np
from scipy.optimize import curve_fit
from scipy.integrate import odeint

# given data we want to fit
tspan = [0, 0.1, 0.2, 0.4, 0.8, 1]
Ca_data = [2.0081,  1.5512,  1.1903,  0.7160,  0.2562,  0.1495]

def fitfunc(t, k):
    'Function that returns Ca computed from an ODE for a k'
    def myode(Ca, t):
        return -k * Ca

    Ca0 = Ca_data[0]
    Casol = odeint(myode, Ca0, t)
    return Casol[:,0]

k_fit, kcov = curve_fit(fitfunc, tspan, Ca_data, p0=1.3)
print k_fit

tfit = np.linspace(0,1);
fit = fitfunc(tfit, k_fit)

import matplotlib.pyplot as plt
plt.plot(tspan, Ca_data, 'ro', label='data')
plt.plot(tfit, fit, 'b-', label='fit')
plt.legend(loc='best')
plt.savefig('images/ode-fit.png')
[ 2.58893455]

Copyright (C) 2013 by John Kitchin. See the License for information about copying.

org-mode source

Read and Post Comments

Graphical methods to help get initial guesses for multivariate nonlinear regression

| categories: data analysis, plotting | tags: | View Comments

Matlab post

Fit the model f(x1,x2; a,b) = a*x1 + x2^b to the data given below. This model has two independent variables, and two parameters.

We want to do a nonlinear fit to find a and b that minimize the summed squared errors between the model predictions and the data. With only two variables, we can graph how the summed squared error varies with the parameters, which may help us get initial guesses. Let us assume the parameters lie in a range, here we choose 0 to 5. In other problems you would adjust this as needed.

import numpy as np
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt

x1 = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]
x2 = [0.2, 0.4, 0.8, 0.9, 1.1, 2.1]
X = np.column_stack([x1, x2]) # independent variables

f = [ 3.3079,    6.6358,   10.3143,   13.6492,   17.2755,   23.6271]

fig = plt.figure()
ax = fig.gca(projection = '3d')

ax.plot(x1, x2, f)
ax.set_xlabel('x1')
ax.set_ylabel('x2')
ax.set_zlabel('f(x1,x2)')

plt.savefig('images/graphical-mulvar-1.png')


arange = np.linspace(0,5);
brange = np.linspace(0,5);

A,B = np.meshgrid(arange, brange)

def model(X, a, b):
    'Nested function for the model'
    x1 = X[:, 0]
    x2 = X[:, 1]
    
    f = a * x1 + x2**b
    return f

@np.vectorize
def errfunc(a, b):
    # function for the summed squared error
    fit = model(X, a, b)
    sse = np.sum((fit - f)**2)
    return sse

SSE = errfunc(A, B)

plt.clf()
plt.contourf(A, B, SSE, 50)
plt.plot([3.2], [2.1], 'ro')
plt.figtext( 3.4, 2.2, 'Minimum near here', color='r')

plt.savefig('images/graphical-mulvar-2.png')

guesses = [3.18, 2.02]

from scipy.optimize import curve_fit

popt, pcov = curve_fit(model, X, f, guesses)
print popt

plt.plot([popt[0]], [popt[1]], 'r*')
plt.savefig('images/graphical-mulvar-3.png')

print model(X, *popt)

fig = plt.figure()
ax = fig.gca(projection = '3d')

ax.plot(x1, x2, f, 'ko', label='data')
ax.plot(x1, x2, model(X, *popt), 'r-', label='fit')
ax.set_xlabel('x1')
ax.set_ylabel('x2')
ax.set_zlabel('f(x1,x2)')

plt.savefig('images/graphical-mulvar-4.png')
[ 3.21694798  1.9728254 ]
[  3.25873623   6.59792994  10.29473657  13.68011436  17.29161001
  23.62366445]

It can be difficult to figure out initial guesses for nonlinear fitting problems. For one and two dimensional systems, graphical techniques may be useful to visualize how the summed squared error between the model and data depends on the parameters.

Copyright (C) 2013 by John Kitchin. See the License for information about copying.

org-mode source

Read and Post Comments

« Previous Page -- Next Page »