The Kitchin Research Group: machine-learning

Lies, damn lies, statistics and Bayesian statistics

Posted June 22, 2025 at 11:14 AM | categories: machine-learning | tags:

Updated June 23, 2025 at 01:33 PM

1. The data
2. GPR with a RBF kernel
3. a better kernel solves these issues
4. How about with feature engineering?
5. Summary

This post on LinkedIn (https://www.linkedin.com/posts/activity-7341134401705041920-gaEd?utm_source=share&utm_medium=member_desktop&rcm=ACoAAAfqmO0BzyXpJw8w7yyHwkoMSiaKfGg-sKI) reminded me of a quip I often make of "Lies, damn lies, statistics, and Bayesian statistics". I am frequently skeptical of claims about "Bayesian something something", especially when the claim is about uncertainty quantification. That skepticism comes from practical experience of mine that "Bayesian something something" is rarely as well behaved and informative as advertised (in my hands of course).

To illustrate, I will use some noisy, 1d data from a Lennard-Jones function and Gaussian process regression to fit the data.

1. The data

We get our data by sampling a Lennard-Jones function, adding some noise, and removing a gap in the data. The gap in the middle might be classically considered an interpolation region.

import numpy as np
import matplotlib.pyplot as plt

r = np.linspace(0.95, 3, 200)

eps, sig = 1, 1
y = 4 * eps * ((1 / r)**12 - (1 / r)**6) + np.random.normal(0, 0.03, size=r.shape)


ind = ((r > 1) & (r < 1.25)) | ((r > 2) & (r < 2.5))
_R = r[ind][:, None]
_y = y[ind]
plt.plot(_R, _y, '.')
plt.xlabel('R')
plt.ylabel('E');

2. GPR with a RBF kernel

The RBF kernel is the most standard kernel. It does an ok job fitting here, although I see evidence of overfitting (the wiggles are caused by the noise). You can reduce the overfitting by using a larger alpha value in the gpr, but that requires you to know in advance how smooth it should be.

from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, WhiteKernel
kernel = RBF() + WhiteKernel()
gpr = GaussianProcessRegressor(kernel=kernel,
                               random_state=0, normalize_y=True).fit(_R, _y)

plt.plot(_R, _y, 'b.')
plt.plot(r, y, 'b.', alpha=0.2)

yp, se = gpr.predict(r[:, None], return_std=True)
plt.plot(r, yp)
plt.plot(r, yp + 2 * se, 'k--', r, yp - 2 * se, 'k--');
plt.plot(_R, _y, '.')
plt.xlabel('R')
plt.ylabel('E');

gpr.kernel_

RBF(length_scale=0.0948) + WhiteKernel(noise_level=0.00635)

The uncertainty here is primarily related to the model, i.e. it is constrained to be correct where there is data, but with no data, the model is not the right one.

The model does well in the region where there is data, but is qualitatively wrong in the gap (even though classically this would be considered interpolation), and overestimates the uncertainty in this region. The problem is the covariance kernel decays to 0 about two length scales away from the last point, which means there is no data to inform what the weights in that region should look like. That causes the model to revert to the mean of the data.

gpr.predict([[1.8]]), gpr.predict([[3.0]]), np.mean(_y)

array

((-0.2452041))

array

((-0.29363654))

-0.2936364964541409

Why is this happening? It is not that tricky. You can think of the GP as an expansion of the data in basis functions. The kernel trick effectively makes this expansion in the infinite limit. What are those basis functions? We can draw samples of them, which we show here. You can see that where there is no data, the basis functions are "wiggly". That means they are simply not good at making predictions here.

y_samples = gpr.sample_y(r[:, None], n_samples=15, random_state=0)

plt.plot(r, yp)
plt.plot(r, yp + 2 * se, 'k--', r, yp - 2 * se, 'k--');
plt.plot(_R, _y, '.')

plt.plot(r, y_samples, 'k', alpha=0.2);

plt.xlabel('R')
plt.ylabel('E');

This kernel simply cannot be used for extrapolation, or any predictions more than about two length scales away from the nearest point. Calling it Bayesian doesn't make it better. For similar reasons, this model will not work well outside the data range.

A practical person would still consider using this model, and might even rely on the uncertainty being too large to identify regions of low reliability.

3. a better kernel solves these issues

Not all is lost, if we know more. In this case we can construct a kernel that reflects our understanding that the data came from a Lennard Jones like interaction model. You can construct kernels by adding and multiplying kernels. Here we consider a linear kernel, the DotProduct kernel, and construct a new kernel that is a sum of the linear kernel to the 12^th power, a linear kernel to the 6^th power, and a WhiteKernel for the noise. It is a little subtle that this kernel should work in \(1 / r\) space, so in addition to kernel engineering, we also do feature engineering.

from sklearn.gaussian_process.kernels import DotProduct

kernel = DotProduct()**12 + DotProduct()**6 +  WhiteKernel()
gpr = GaussianProcessRegressor(kernel=kernel, normalize_y=True).fit(1 / _R, _y)

plt.plot(_R, _y, 'b.')
plt.plot(r, y, 'b.', alpha=0.2)


yp, se = gpr.predict(1 / r[:, None], return_std=True)
plt.plot(r, yp)
plt.plot(r, yp + 2 * se, 'k--', r, yp - 2 * se, 'k--');

plt.xlabel('R')
plt.ylabel('E');

gpr.kernel_

DotProduct(sigma_0=0.0281) ** 12 + DotProduct(sigma_0=0.936) ** 6 + WhiteKernel(noise_level=0.0077)

Note that this GPR does fine in the gap, including the right level of uncertainty there. This model is better because we used the kernel to constrain what forms the model can have. This model actually extrapolates correctly outside the data. It is worth noting that although this model has great predictive and UQ properties, it does not tell us anything about the values of ε and σ in the Lennard Jones model. Although we might say the kernel is physics-based, i.e. it is based on the relevant features and equation, it does not have physical parameters in it.

How about those basis functions here? You can see that all of them basically look like the LJ potential. That means they are good basis functions to expand this data set in.

y_samples = gpr.sample_y(1 / r[:, None], n_samples=15, random_state=0)

plt.plot(_R, _y, '.')

plt.plot(r, y_samples, 'k', alpha=0.2);

plt.xlabel('R')
plt.ylabel('E');

4. How about with feature engineering?

Can we do even better with feature engineering here? Motivated by this comment by Cory Simon, we cast the problem as a linear regression in [1 / r⁶, 1 / r¹²] feature space. This is also a perfectly reasonable thing to do. Since our output is linear in these features, we simply use a linear kernel (aka the DotProduct kernel in sklearn).

r6 = 1 / _R**6
r12 = r6**2

kernel = DotProduct() + WhiteKernel()

gpr = GaussianProcessRegressor(kernel=kernel, normalize_y=True).fit(np.hstack([r6, r12]), _y)

plt.plot(_R, _y, 'b.')
plt.plot(r, y, 'b.', alpha=0.2)

fr6 = 1 / r[:, None]**6
fr12 = fr6**2

yp, se = gpr.predict(np.hstack([fr6, fr12]), return_std=True)
plt.plot(r, yp)
plt.plot(r, yp + 2 * se, 'k--', r, yp - 2 * se, 'k--');

plt.xlabel('R')
plt.ylabel('E');

gpr.kernel_

DotProduct(sigma_0=0.74) + WhiteKernel(noise_level=0.00654)

We can't easily plot these basis functions the same way, so we reduce them to a 1-d plot. You can see here that these basis functions practically the same as the one with the advanced kernel.

y_samples = gpr.sample_y(np.hstack([fr6, fr12]),
                         n_samples=15, random_state=0)

plt.plot(_R, _y, '.')

plt.plot(r, y_samples, 'k', alpha=0.2);

plt.xlabel('R')
plt.ylabel('E');

This also works quite well, and is another way to leverage knowledge about what we are building a model for.

5. Summary

Naive use of GPR can provide useful models when you have enough data, but these models likely do not accurately capture uncertainty outside that data, nor is it likely they are reliable in extrapolation. It is possible to do better than this, when you know what to do. Through feature and kernel engineering, you can sometimes create situations where the problem essentially becomes linear regression, where a simple linear kernel is what you want, or you develop a kernel that represents the underlying model. Kernel engineering is generally hard, with limited opportunities to be flexible. See https://www.cs.toronto.edu/~duvenaud/cookbook/ for examples of kernels and combining them.

You can see it is not adequate to say "we used Gaussian process regression". That is about as informative as saying linear regression without identifying the features, or nonlinear regression and not saying what model. You have to be specific about the kernel, and thoughtful about how you know if a prediction is reliable or not. Just because you get an uncertainty prediction doesn't mean its right.

org-mode source

Org-mode version = 9.8-pre

Discuss on Twitter

Modeling a Cu dimer by EMT, nonlinear regression and neural networks

Posted March 18, 2017 at 03:47 PM | categories: machine-learning, molecular-simulation, neural-network, python | tags:

In this post we consider a Cu₂ dimer and how its energy varies with the separation of the atoms. We assume we have a way to calculate this, but that it is expensive, and that we want to create a simpler model that is as accurate, but cheaper to run. A simple way to do that is to regress a physical model, but we will illustrate some challenges with that. We then show a neural network can be used as an accurate regression function without needing to know more about the physics.

We will use an effective medium theory calculator to demonstrate this. The calculations are not expected to be very accurate or relevant to any experimental data, but they are fast, and will illustrate several useful points that are independent of that. We will take as our energy zero the energy of two atoms at a large separation, in this case about 10 angstroms. Here we plot the energy as a function of the distance between the two atoms, which is the only degree of freedom that matters in this example.

import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

from ase.calculators.emt import EMT
from ase import Atoms

atoms = Atoms('Cu2',[[0, 0, 0], [10, 0, 0]], pbc=[False, False, False])
atoms.set_calculator(EMT())

e0 = atoms.get_potential_energy()

# Array of bond lengths to get the energy for
d = np.linspace(1.7, 3, 30)

def get_e(distance):
    a = atoms.copy()
    a[1].x = distance
    a.set_calculator(EMT())
    e = a.get_potential_energy()
    return e

e = np.array([get_e(dist) for dist in d])
e -=  e0  # set the energy zero

plt.plot(d, e, 'bo ')
plt.xlabel('d (Å)')
plt.ylabel('energy (eV)')

We see there is a minimum, and the energy is asymmetric about the minimum. We have no functional form for the energy here, just the data in the plot. So to get another energy, we have to run another calculation. If that was expensive, we might prefer an analytical equation to evaluate instead. We will get an analytical form by fitting a function to the data. A classic one is the Buckingham potential: \(E = A \exp(-B r) - \frac{C}{r^6}\). Here we perform the regression.

def model(r, A, B, C):
    return A * np.exp(-B * r) - C / r**6

from pycse import nlinfit
import pprint

p0 = [-80, 1, 1]
p, pint, se = nlinfit(model, d, e, p0, 0.05)
print('Parameters = ', p)
print('Confidence intervals = ')
pprint.pprint(pint)
plt.plot(d, e, 'bo ', label='calculations')

x = np.linspace(min(d), max(d))
plt.plot(x, model(x, *p), label='fit')
plt.legend(loc='best')
plt.xlabel('d (Å)')
plt.ylabel('energy (eV)')

Parameters = [ -83.21072545 1.18663393 -266.15259507] Confidence intervals = array([[ -93.47624687, -72.94520404], [ 1.14158438, 1.23168348], [-280.92915682, -251.37603331]])

That fit is ok, but not great. We would be better off with a spline for this simple system! The trouble is how do we get anything better? If we had a better equation to fit to we might get better results. While one might come up with one for this dimer, how would you extend it to more complex systems, even just a trimer? There have been decades of research dedicated to that, and we are not smarter than those researchers so, it is time for a new approach.

We will use a Neural Network regressor. The input will be \(d\) and we want to regress a function to predict the energy.

There are a couple of important points to make here.

This is just another kind of regression.
We need a lot more data to do the regression. Here we use 300 data points.
We need to specify a network architecture. Here we use one hidden layer with 10 neurons, and the tanh activation function on each neuron. The last layer is just the output layer. I do not claim this is any kind of optimal architecture. It is just one that works to illustrate the idea.

Here is the code that uses a neural network regressor, which is lightly adapted from http://scikit-neuralnetwork.readthedocs.io/en/latest/guide_model.html.

from sknn.mlp import Regressor, Layer

D = np.linspace(1.7, 3, 300)

def get_e(distance):
    a = atoms.copy()
    a[1].x = distance
    a.set_calculator(EMT())
    e = a.get_potential_energy()
    return e

E = np.array([get_e(dist) for dist in D])
E -=  e0  # set the energy zero

X_train = np.row_stack(np.array(D))

N = 10
nn = Regressor(layers=[Layer("Tanh", units=N),
                       Layer('Linear')])
nn.fit(X_train, E)

dfit = np.linspace(min(d), max(d))

efit = nn.predict(np.row_stack(dfit))

plt.plot(d, e, 'bo ')
plt.plot(dfit, efit)
plt.legend(['calculations', 'neural network'])
plt.xlabel('d (Å)')
plt.ylabel('energy (eV)')

This fit looks pretty good, better than we got for the Buckingham potential. Well, it probably should look better, we have many more parameters that were fitted! It is not perfect, but it could be systematically improved by increasing the number of hidden layers, and neurons in each layer. I am being a little loose here by relying on a visual assessment of the fit. To systematically improve it you would need a quantitative analysis of the errors. I also note though, that if I run the block above several times in succession, I get different fits each time. I suppose that is due to some random numbers used to initialize the fit, but sometimes the fit is about as good as the result you see above, and sometimes it is terrible.

Ok, what is the point after all? We developed a neural network that pretty accurately captures the energy of a Cu dimer with no knowledge of the physics involved. Now, EMT is not that expensive, but suppose this required 300 DFT calculations at 1 minute or more a piece? That is five hours just to get the data! With this neural network, we can quickly compute energies. For example, this shows we get about 10000 energy calculations in just 287 ms.

%%timeit

dfit = np.linspace(min(d), max(d), 10000)
efit = nn.predict(np.row_stack(dfit))

1 loop, best of 3: 287 ms per loop

Compare that to the time it took to compute the 300 energies with EMT

%%timeit
E = np.array([get_e(dist) for dist in D])

1 loop, best of 3: 230 ms per loop

The neural network is a lot faster than the way we get the EMT energies!

It is true in this case we could have used a spline, or interpolating function and it would likely be even better than this Neural Network. We are aiming to get more complicated soon though. For a trimer, we will have three dimensions to worry about, and that can still be worked out in a similar fashion I think. Past that, it becomes too hard to reduce the dimensions, and this approach breaks down. Then we have to try something else. We will get to that in another post.

org-mode source

Org-mode version = 9.0.5