<?xml version="1.0" encoding="UTF-8"?>

<rss version="2.0"
     xmlns:content="http://purl.org/rss/1.0/modules/content/"
     xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
     xmlns:atom="http://www.w3.org/2005/Atom"
     xmlns:dc="http://purl.org/dc/elements/1.1/"
     xmlns:wfw="http://wellformedweb.org/CommentAPI/"
     >
  <channel>
    <atom:link href="http://kitchingroup.cheme.cmu.edu/blog/feed/index.xml" rel="self" type="application/rss+xml" />
    <title>The Kitchin Research Group</title>
    <link>https://kitchingroup.cheme.cmu.edu/blog</link>
    <description>Chemical Engineering at Carnegie Mellon University</description>
    <pubDate>Sat, 01 Nov 2025 13:47:46 GMT</pubDate>
    <generator>Blogofile</generator>
    <sy:updatePeriod>hourly</sy:updatePeriod>
    <sy:updateFrequency>1</sy:updateFrequency>
    
    <item>
      <title>Lies, damn lies, statistics and Bayesian statistics</title>
      <link>https://kitchingroup.cheme.cmu.edu/blog/2025/06/22/Lies-damn-lies-statistics-and-Bayesian-statistics</link>
      <pubDate>Sun, 22 Jun 2025 11:14:23 EDT</pubDate>
      <category><![CDATA[machine-learning]]></category>
      <guid isPermaLink="false">14mSE18aHhiNdMq3g69gXgG5w2o=</guid>
      <description>Lies, damn lies, statistics and Bayesian statistics</description>
      <content:encoded><![CDATA[


&lt;div id="table-of-contents" role="doc-toc"&gt;
&lt;h2&gt;Table of Contents&lt;/h2&gt;
&lt;div id="text-table-of-contents" role="doc-toc"&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#org035da70"&gt;1. The data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#org28ad175"&gt;2. GPR with a RBF kernel&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#orga7fb619"&gt;3. a better kernel solves these issues&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#orgca7200a"&gt;4. How about with feature engineering?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#orgdd16d18"&gt;5. Summary&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;
This post on LinkedIn (&lt;a href="https://www.linkedin.com/posts/activity-7341134401705041920-gaEd?utm_source=share&amp;amp;utm_medium=member_desktop&amp;amp;rcm=ACoAAAfqmO0BzyXpJw8w7yyHwkoMSiaKfGg-sKI"&gt;https://www.linkedin.com/posts/activity-7341134401705041920-gaEd?utm_source=share&amp;amp;utm_medium=member_desktop&amp;amp;rcm=ACoAAAfqmO0BzyXpJw8w7yyHwkoMSiaKfGg-sKI&lt;/a&gt;) reminded me of a quip I often make of "Lies, damn lies, statistics, and Bayesian statistics". I am frequently skeptical of claims about "Bayesian something something", especially when the claim is about uncertainty quantification. That skepticism comes from practical experience of mine that "Bayesian something something" is rarely as well behaved and informative as advertised (in my hands of course).
&lt;/p&gt;

&lt;p&gt;
To illustrate, I will use some noisy, 1d data from a Lennard-Jones function and Gaussian process regression to fit the data.
&lt;/p&gt;
&lt;div id="outline-container-org035da70" class="outline-2"&gt;
&lt;h2 id="org035da70"&gt;&lt;span class="section-number-2"&gt;1.&lt;/span&gt; The data&lt;/h2&gt;
&lt;div class="outline-text-2" id="text-1"&gt;
&lt;p&gt;
We get our data by sampling a Lennard-Jones function, adding some noise, and removing a gap in the data. The gap in the middle might be classically considered an interpolation region.
&lt;/p&gt;

&lt;div class="org-src-container"&gt;
&lt;pre class="src src-jupyter-python"&gt;&lt;span style="color: #0000FF; font-weight: bold;"&gt;import&lt;/span&gt; numpy &lt;span style="color: #0000FF; font-weight: bold;"&gt;as&lt;/span&gt; np
&lt;span style="color: #0000FF; font-weight: bold;"&gt;import&lt;/span&gt; matplotlib.pyplot &lt;span style="color: #0000FF; font-weight: bold;"&gt;as&lt;/span&gt; plt

&lt;span style="color: #BA36A5;"&gt;r&lt;/span&gt; = np.linspace(0.95, 3, 200)

&lt;span style="color: #BA36A5;"&gt;eps&lt;/span&gt;, &lt;span style="color: #BA36A5;"&gt;sig&lt;/span&gt; = 1, 1
&lt;span style="color: #BA36A5;"&gt;y&lt;/span&gt; = 4 * eps * ((1 / r)**12 - (1 / r)**6) + np.random.normal(0, 0.03, size=r.shape)


&lt;span style="color: #BA36A5;"&gt;ind&lt;/span&gt; = ((r &amp;gt; 1) &amp;amp; (r &amp;lt; 1.25)) | ((r &amp;gt; 2) &amp;amp; (r &amp;lt; 2.5))
&lt;span style="color: #BA36A5;"&gt;_R&lt;/span&gt; = r[ind][:, &lt;span style="color: #D0372D;"&gt;None&lt;/span&gt;]
&lt;span style="color: #BA36A5;"&gt;_y&lt;/span&gt; = y[ind]
plt.plot(_R, _y, &lt;span style="color: #008000;"&gt;'.'&lt;/span&gt;)
plt.xlabel(&lt;span style="color: #008000;"&gt;'R'&lt;/span&gt;)
plt.ylabel(&lt;span style="color: #008000;"&gt;'E'&lt;/span&gt;);
&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;
&lt;figure&gt;&lt;img src="/media/653165863df7654b10ddaca2f7645560768bd870.png"&gt;&lt;/figure&gt; 
&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id="outline-container-org28ad175" class="outline-2"&gt;
&lt;h2 id="org28ad175"&gt;&lt;span class="section-number-2"&gt;2.&lt;/span&gt; GPR with a RBF kernel&lt;/h2&gt;
&lt;div class="outline-text-2" id="text-2"&gt;
&lt;p&gt;
The RBF kernel is the most standard kernel. It does an ok job fitting here, although I see evidence of overfitting (the wiggles are caused by the noise). You can reduce the overfitting by using a larger &lt;code&gt;alpha&lt;/code&gt; value in the gpr, but that requires you to know in advance how smooth it should be.
&lt;/p&gt;

&lt;div class="org-src-container"&gt;
&lt;pre class="src src-jupyter-python"&gt;&lt;span style="color: #0000FF; font-weight: bold;"&gt;from&lt;/span&gt; sklearn.gaussian_process &lt;span style="color: #0000FF; font-weight: bold;"&gt;import&lt;/span&gt; GaussianProcessRegressor
&lt;span style="color: #0000FF; font-weight: bold;"&gt;from&lt;/span&gt; sklearn.gaussian_process.kernels &lt;span style="color: #0000FF; font-weight: bold;"&gt;import&lt;/span&gt; RBF, WhiteKernel
&lt;span style="color: #BA36A5;"&gt;kernel&lt;/span&gt; = RBF() + WhiteKernel()
&lt;span style="color: #BA36A5;"&gt;gpr&lt;/span&gt; = GaussianProcessRegressor(kernel=kernel,
                               random_state=0, normalize_y=&lt;span style="color: #D0372D;"&gt;True&lt;/span&gt;).fit(_R, _y)

plt.plot(_R, _y, &lt;span style="color: #008000;"&gt;'b.'&lt;/span&gt;)
plt.plot(r, y, &lt;span style="color: #008000;"&gt;'b.'&lt;/span&gt;, alpha=0.2)

&lt;span style="color: #BA36A5;"&gt;yp&lt;/span&gt;, &lt;span style="color: #BA36A5;"&gt;se&lt;/span&gt; = gpr.predict(r[:, &lt;span style="color: #D0372D;"&gt;None&lt;/span&gt;], return_std=&lt;span style="color: #D0372D;"&gt;True&lt;/span&gt;)
plt.plot(r, yp)
plt.plot(r, yp + 2 * se, &lt;span style="color: #008000;"&gt;'k--'&lt;/span&gt;, r, yp - 2 * se, &lt;span style="color: #008000;"&gt;'k--'&lt;/span&gt;);
plt.plot(_R, _y, &lt;span style="color: #008000;"&gt;'.'&lt;/span&gt;)
plt.xlabel(&lt;span style="color: #008000;"&gt;'R'&lt;/span&gt;)
plt.ylabel(&lt;span style="color: #008000;"&gt;'E'&lt;/span&gt;);

gpr.kernel_
&lt;/pre&gt;
&lt;/div&gt;

&lt;pre class="example"&gt;
RBF(length_scale=0.0948) + WhiteKernel(noise_level=0.00635)
&lt;/pre&gt;

&lt;p&gt;
&lt;figure&gt;&lt;img src="/media/6acc7dccc37ee773aeb4c97c62929401733a02f6.png"&gt;&lt;/figure&gt; 
&lt;/p&gt;

&lt;p&gt;
The uncertainty here is primarily related to the model, i.e. it is constrained to be correct where there is data, but with no data, the model is not the right one.
&lt;/p&gt;

&lt;p&gt;
The model does well in the region where there is data, but is qualitatively wrong in the gap (even though classically this would be considered interpolation), and overestimates the uncertainty in this region. The problem is the covariance kernel decays to 0 about two length scales away from the last point, which means there is no data to inform what the weights in that region should look like.  That causes the model to revert to the mean of the data.
&lt;/p&gt;

&lt;div class="org-src-container"&gt;
&lt;pre class="src src-jupyter-python"&gt;gpr.predict([[1.8]]), gpr.predict([[3.0]]), np.mean(_y)
&lt;/pre&gt;
&lt;/div&gt;

&lt;table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides"&gt;


&lt;colgroup&gt;
&lt;col  class="org-left" /&gt;

&lt;col  class="org-left" /&gt;

&lt;col  class="org-left" /&gt;

&lt;col  class="org-left" /&gt;

&lt;col  class="org-right" /&gt;
&lt;/colgroup&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td class="org-left"&gt;array&lt;/td&gt;
&lt;td class="org-left"&gt;((-0.2452041))&lt;/td&gt;
&lt;td class="org-left"&gt;array&lt;/td&gt;
&lt;td class="org-left"&gt;((-0.29363654))&lt;/td&gt;
&lt;td class="org-right"&gt;-0.2936364964541409&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;
Why is this happening? It is not that tricky. You can think of the GP as an expansion of the data in basis functions. The kernel trick effectively makes this expansion in the infinite limit. What are those basis functions? We can draw samples of them, which we show here. You can see that where there is no data, the basis functions are "wiggly". That means they are simply not good at making predictions here.
&lt;/p&gt;

&lt;div class="org-src-container"&gt;
&lt;pre class="src src-jupyter-python"&gt;&lt;span style="color: #BA36A5;"&gt;y_samples&lt;/span&gt; = gpr.sample_y(r[:, &lt;span style="color: #D0372D;"&gt;None&lt;/span&gt;], n_samples=15, random_state=0)

plt.plot(r, yp)
plt.plot(r, yp + 2 * se, &lt;span style="color: #008000;"&gt;'k--'&lt;/span&gt;, r, yp - 2 * se, &lt;span style="color: #008000;"&gt;'k--'&lt;/span&gt;);
plt.plot(_R, _y, &lt;span style="color: #008000;"&gt;'.'&lt;/span&gt;)

plt.plot(r, y_samples, &lt;span style="color: #008000;"&gt;'k'&lt;/span&gt;, alpha=0.2);

plt.xlabel(&lt;span style="color: #008000;"&gt;'R'&lt;/span&gt;)
plt.ylabel(&lt;span style="color: #008000;"&gt;'E'&lt;/span&gt;);
&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;
&lt;figure&gt;&lt;img src="/media/cff4cfa5cacedcef6ecde8ec2b63dcee659949fb.png"&gt;&lt;/figure&gt; 
&lt;/p&gt;


&lt;p&gt;
This kernel simply cannot be used for extrapolation, or any predictions more than about two length scales away from the nearest point. Calling it Bayesian doesn't make it better. For similar reasons, this model will not work well outside the data range.
&lt;/p&gt;

&lt;p&gt;
A practical person would still consider using this model, and might even rely on the uncertainty being too large to identify regions of low reliability.
&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id="outline-container-orga7fb619" class="outline-2"&gt;
&lt;h2 id="orga7fb619"&gt;&lt;span class="section-number-2"&gt;3.&lt;/span&gt; a better kernel solves these issues&lt;/h2&gt;
&lt;div class="outline-text-2" id="text-3"&gt;
&lt;p&gt;
Not all is lost, if we know more. In this case we can construct a kernel that reflects our understanding that the data came from a Lennard Jones like interaction model. You can construct kernels by adding and multiplying kernels. Here we consider a linear kernel, the &lt;code&gt;DotProduct&lt;/code&gt; kernel, and construct a new kernel that is a sum of the linear kernel to the 12&lt;sup&gt;th&lt;/sup&gt; power, a linear kernel to the 6&lt;sup&gt;th&lt;/sup&gt; power, and a &lt;code&gt;WhiteKernel&lt;/code&gt; for the noise. It is a little subtle that this kernel should work in \(1 / r\) space, so in addition to kernel engineering, we also do feature engineering.
&lt;/p&gt;

&lt;div class="org-src-container"&gt;
&lt;pre class="src src-jupyter-python"&gt;&lt;span style="color: #0000FF; font-weight: bold;"&gt;from&lt;/span&gt; sklearn.gaussian_process.kernels &lt;span style="color: #0000FF; font-weight: bold;"&gt;import&lt;/span&gt; DotProduct

&lt;span style="color: #BA36A5;"&gt;kernel&lt;/span&gt; = DotProduct()**12 + DotProduct()**6 +  WhiteKernel()
&lt;span style="color: #BA36A5;"&gt;gpr&lt;/span&gt; = GaussianProcessRegressor(kernel=kernel, normalize_y=&lt;span style="color: #D0372D;"&gt;True&lt;/span&gt;).fit(1 / _R, _y)

plt.plot(_R, _y, &lt;span style="color: #008000;"&gt;'b.'&lt;/span&gt;)
plt.plot(r, y, &lt;span style="color: #008000;"&gt;'b.'&lt;/span&gt;, alpha=0.2)


&lt;span style="color: #BA36A5;"&gt;yp&lt;/span&gt;, &lt;span style="color: #BA36A5;"&gt;se&lt;/span&gt; = gpr.predict(1 / r[:, &lt;span style="color: #D0372D;"&gt;None&lt;/span&gt;], return_std=&lt;span style="color: #D0372D;"&gt;True&lt;/span&gt;)
plt.plot(r, yp)
plt.plot(r, yp + 2 * se, &lt;span style="color: #008000;"&gt;'k--'&lt;/span&gt;, r, yp - 2 * se, &lt;span style="color: #008000;"&gt;'k--'&lt;/span&gt;);

plt.xlabel(&lt;span style="color: #008000;"&gt;'R'&lt;/span&gt;)
plt.ylabel(&lt;span style="color: #008000;"&gt;'E'&lt;/span&gt;);

gpr.kernel_
&lt;/pre&gt;
&lt;/div&gt;

&lt;pre class="example"&gt;
DotProduct(sigma_0=0.0281) ** 12 + DotProduct(sigma_0=0.936) ** 6 + WhiteKernel(noise_level=0.0077)
&lt;/pre&gt;

&lt;p&gt;
&lt;figure&gt;&lt;img src="/media/70e91f8419a473ed578a14442694e67a3409bd1e.png"&gt;&lt;/figure&gt; 
&lt;/p&gt;

&lt;p&gt;
Note that this GPR does fine in the gap, including the right level of uncertainty there. This model is better because we used the kernel to constrain what forms the model can have. This model actually extrapolates correctly outside the data. It is worth noting that although this model has great predictive and UQ properties, it does not tell us anything about the values of &amp;epsilon; and &amp;sigma; in the Lennard Jones model. Although we might say the kernel is physics-based, i.e. it is based on the relevant features and equation, it does not have physical parameters in it.
&lt;/p&gt;

&lt;p&gt;
How about those basis functions here? You can see that all of them basically look like the LJ potential. That means they are good basis functions to expand this data set in.
&lt;/p&gt;

&lt;div class="org-src-container"&gt;
&lt;pre class="src src-jupyter-python"&gt;&lt;span style="color: #BA36A5;"&gt;y_samples&lt;/span&gt; = gpr.sample_y(1 / r[:, &lt;span style="color: #D0372D;"&gt;None&lt;/span&gt;], n_samples=15, random_state=0)

plt.plot(_R, _y, &lt;span style="color: #008000;"&gt;'.'&lt;/span&gt;)

plt.plot(r, y_samples, &lt;span style="color: #008000;"&gt;'k'&lt;/span&gt;, alpha=0.2);

plt.xlabel(&lt;span style="color: #008000;"&gt;'R'&lt;/span&gt;)
plt.ylabel(&lt;span style="color: #008000;"&gt;'E'&lt;/span&gt;);
&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;
&lt;figure&gt;&lt;img src="/media/e7fe34a01c52cb228cbbcde85e5f334e7f8237a1.png"&gt;&lt;/figure&gt; 
&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id="outline-container-orgca7200a" class="outline-2"&gt;
&lt;h2 id="orgca7200a"&gt;&lt;span class="section-number-2"&gt;4.&lt;/span&gt; How about with feature engineering?&lt;/h2&gt;
&lt;div class="outline-text-2" id="text-4"&gt;
&lt;p&gt;
Can we do even better with feature engineering here? Motivated by &lt;a href="https://www.linkedin.com/feed/update/urn:li:activity:7342573774774386688?commentUrn=urn%3Ali%3Acomment%3A%28activity%3A7342573774774386688%2C7342949865590530052%29&amp;amp;dashCommentUrn=urn%3Ali%3Afsd_comment%3A%287342949865590530052%2Curn%3Ali%3Aactivity%3A7342573774774386688%29"&gt;this comment&lt;/a&gt; by Cory Simon, we cast the problem as a linear regression in [1 / r&lt;sup&gt;6&lt;/sup&gt;, 1 / r&lt;sup&gt;12&lt;/sup&gt;] feature space. This is also a perfectly reasonable thing to do. Since our output is linear in these features, we simply use a linear kernel (aka the DotProduct kernel in sklearn).
&lt;/p&gt;

&lt;div class="org-src-container"&gt;
&lt;pre class="src src-jupyter-python"&gt;&lt;span style="color: #BA36A5;"&gt;r6&lt;/span&gt; = 1 / _R**6
&lt;span style="color: #BA36A5;"&gt;r12&lt;/span&gt; = r6**2

&lt;span style="color: #BA36A5;"&gt;kernel&lt;/span&gt; = DotProduct() + WhiteKernel()

&lt;span style="color: #BA36A5;"&gt;gpr&lt;/span&gt; = GaussianProcessRegressor(kernel=kernel, normalize_y=&lt;span style="color: #D0372D;"&gt;True&lt;/span&gt;).fit(np.hstack([r6, r12]), _y)

plt.plot(_R, _y, &lt;span style="color: #008000;"&gt;'b.'&lt;/span&gt;)
plt.plot(r, y, &lt;span style="color: #008000;"&gt;'b.'&lt;/span&gt;, alpha=0.2)

&lt;span style="color: #BA36A5;"&gt;fr6&lt;/span&gt; = 1 / r[:, &lt;span style="color: #D0372D;"&gt;None&lt;/span&gt;]**6
&lt;span style="color: #BA36A5;"&gt;fr12&lt;/span&gt; = fr6**2

&lt;span style="color: #BA36A5;"&gt;yp&lt;/span&gt;, &lt;span style="color: #BA36A5;"&gt;se&lt;/span&gt; = gpr.predict(np.hstack([fr6, fr12]), return_std=&lt;span style="color: #D0372D;"&gt;True&lt;/span&gt;)
plt.plot(r, yp)
plt.plot(r, yp + 2 * se, &lt;span style="color: #008000;"&gt;'k--'&lt;/span&gt;, r, yp - 2 * se, &lt;span style="color: #008000;"&gt;'k--'&lt;/span&gt;);

plt.xlabel(&lt;span style="color: #008000;"&gt;'R'&lt;/span&gt;)
plt.ylabel(&lt;span style="color: #008000;"&gt;'E'&lt;/span&gt;);

gpr.kernel_
&lt;/pre&gt;
&lt;/div&gt;

&lt;pre class="example"&gt;
DotProduct(sigma_0=0.74) + WhiteKernel(noise_level=0.00654)
&lt;/pre&gt;

&lt;p&gt;
&lt;figure&gt;&lt;img src="/media/d8769fe652b9e902e3d349ce26cdbd7d8050b190.png"&gt;&lt;/figure&gt; 
&lt;/p&gt;

&lt;p&gt;
We can't easily plot these basis functions the same way, so we reduce them to a 1-d plot. You can see here that these basis functions practically the same as the one with the advanced kernel.
&lt;/p&gt;

&lt;div class="org-src-container"&gt;
&lt;pre class="src src-jupyter-python"&gt;&lt;span style="color: #BA36A5;"&gt;y_samples&lt;/span&gt; = gpr.sample_y(np.hstack([fr6, fr12]),
                         n_samples=15, random_state=0)

plt.plot(_R, _y, &lt;span style="color: #008000;"&gt;'.'&lt;/span&gt;)

plt.plot(r, y_samples, &lt;span style="color: #008000;"&gt;'k'&lt;/span&gt;, alpha=0.2);

plt.xlabel(&lt;span style="color: #008000;"&gt;'R'&lt;/span&gt;)
plt.ylabel(&lt;span style="color: #008000;"&gt;'E'&lt;/span&gt;);
&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;
&lt;figure&gt;&lt;img src="/media/f777586bb8e17bac5ca3dadbfba97119addeb46b.png"&gt;&lt;/figure&gt; 
&lt;/p&gt;



&lt;p&gt;
This also works quite well, and is another way to leverage knowledge about what we are building a model for.
&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id="outline-container-orgdd16d18" class="outline-2"&gt;
&lt;h2 id="orgdd16d18"&gt;&lt;span class="section-number-2"&gt;5.&lt;/span&gt; Summary&lt;/h2&gt;
&lt;div class="outline-text-2" id="text-5"&gt;
&lt;p&gt;
Naive use of GPR can provide useful models when you have enough data, but these models likely do not accurately capture uncertainty outside that data, nor is it likely they are reliable in extrapolation. It is possible to do better than this, when you know what to do. Through feature and kernel engineering, you can sometimes create situations where the problem essentially becomes linear regression, where a simple linear kernel is what you want, or you develop a kernel that represents the underlying model. Kernel engineering is generally hard, with limited opportunities to be flexible. See &lt;a href="https://www.cs.toronto.edu/~duvenaud/cookbook/"&gt;https://www.cs.toronto.edu/~duvenaud/cookbook/&lt;/a&gt; for examples of kernels and combining them.
&lt;/p&gt;

&lt;p&gt;
You can see it is not adequate to say "we used Gaussian process regression". That is about as informative as saying linear regression without identifying the features, or nonlinear regression and not saying what model. You have to be specific about the kernel, and thoughtful about how you know if a prediction is reliable or not. Just because you get an uncertainty prediction doesn't mean its right.
&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Copyright (C) 2025 by John Kitchin. See the &lt;a href="/copying.html"&gt;License&lt;/a&gt; for information about copying.&lt;p&gt;
&lt;p&gt;&lt;a href="/org/2025/06/22/Lies,-damn-lies,-statistics-and-Bayesian-statistics.org"&gt;org-mode source&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Org-mode version = 9.8-pre&lt;/p&gt;]]></content:encoded>
    </item>
    <item>
      <title>Modeling a Cu dimer by EMT, nonlinear regression and neural networks</title>
      <link>https://kitchingroup.cheme.cmu.edu/blog/2017/03/18/Modeling-a-Cu-dimer-by-EMT-nonlinear-regression-and-neural-networks</link>
      <pubDate>Sat, 18 Mar 2017 15:47:55 EDT</pubDate>
      <category><![CDATA[machine-learning]]></category>
      <category><![CDATA[molecular-simulation]]></category>
      <category><![CDATA[neural-network]]></category>
      <category><![CDATA[python]]></category>
      <guid isPermaLink="false">YF7pYoEPLUP4Of_3TO50RTAKQhM=</guid>
      <description>Modeling a Cu dimer by EMT, nonlinear regression and neural networks</description>
      <content:encoded><![CDATA[


&lt;p&gt;
In this post we consider a Cu&lt;sub&gt;2&lt;/sub&gt; dimer and how its energy varies with the separation of the atoms. We assume we have a way to calculate this, but that it is expensive, and that we want to create a simpler model that is as accurate, but cheaper to run. A simple way to do that is to regress a physical model, but we will illustrate some challenges with that. We then show a neural network can be used as an accurate regression function without needing to know more about the physics.
&lt;/p&gt;

&lt;p&gt;
We will use an &lt;a href="https://wiki.fysik.dtu.dk/ase/ase/calculators/emt.html"&gt;effective medium theory&lt;/a&gt; calculator to demonstrate this. The calculations are not expected to be very accurate or relevant to any experimental data, but they are fast, and will illustrate several useful points that are independent of that. We will take as our energy zero the energy of two atoms at a large separation, in this case about 10 angstroms. Here we plot the energy as a function of the distance between the two atoms, which is the only degree of freedom that matters in this example.
&lt;/p&gt;

&lt;div class="org-src-container"&gt;
&lt;pre class="src src-ipython"&gt;&lt;span style="color: #0000FF;"&gt;import&lt;/span&gt; numpy &lt;span style="color: #0000FF;"&gt;as&lt;/span&gt; np
%matplotlib inline
&lt;span style="color: #0000FF;"&gt;import&lt;/span&gt; matplotlib.pyplot &lt;span style="color: #0000FF;"&gt;as&lt;/span&gt; plt

&lt;span style="color: #0000FF;"&gt;from&lt;/span&gt; ase.calculators.emt &lt;span style="color: #0000FF;"&gt;import&lt;/span&gt; EMT
&lt;span style="color: #0000FF;"&gt;from&lt;/span&gt; ase &lt;span style="color: #0000FF;"&gt;import&lt;/span&gt; Atoms

&lt;span style="color: #BA36A5;"&gt;atoms&lt;/span&gt; = Atoms(&lt;span style="color: #008000;"&gt;'Cu2'&lt;/span&gt;,[[0, 0, 0], [10, 0, 0]], pbc=[&lt;span style="color: #D0372D;"&gt;False&lt;/span&gt;, &lt;span style="color: #D0372D;"&gt;False&lt;/span&gt;, &lt;span style="color: #D0372D;"&gt;False&lt;/span&gt;])
atoms.set_calculator(EMT())

&lt;span style="color: #BA36A5;"&gt;e0&lt;/span&gt; = atoms.get_potential_energy()

&lt;span style="color: #8D8D84;"&gt;# &lt;/span&gt;&lt;span style="color: #8D8D84; font-style: italic;"&gt;Array of bond lengths to get the energy for&lt;/span&gt;
&lt;span style="color: #BA36A5;"&gt;d&lt;/span&gt; = np.linspace(1.7, 3, 30)

&lt;span style="color: #0000FF;"&gt;def&lt;/span&gt; &lt;span style="color: #006699;"&gt;get_e&lt;/span&gt;(distance):
&lt;span style="color: #9B9B9B; background-color: #EDEDED;"&gt; &lt;/span&gt;   &lt;span style="color: #BA36A5;"&gt;a&lt;/span&gt; = atoms.copy()
&lt;span style="color: #9B9B9B; background-color: #EDEDED;"&gt; &lt;/span&gt;   a[1]&lt;span style="color: #BA36A5;"&gt;.x&lt;/span&gt; = distance
&lt;span style="color: #9B9B9B; background-color: #EDEDED;"&gt; &lt;/span&gt;   a.set_calculator(EMT())
&lt;span style="color: #9B9B9B; background-color: #EDEDED;"&gt; &lt;/span&gt;   &lt;span style="color: #BA36A5;"&gt;e&lt;/span&gt; = a.get_potential_energy()
&lt;span style="color: #9B9B9B; background-color: #EDEDED;"&gt; &lt;/span&gt;   &lt;span style="color: #0000FF;"&gt;return&lt;/span&gt; e

&lt;span style="color: #BA36A5;"&gt;e&lt;/span&gt; = np.array([get_e(dist) &lt;span style="color: #0000FF;"&gt;for&lt;/span&gt; dist &lt;span style="color: #0000FF;"&gt;in&lt;/span&gt; d])
&lt;span style="color: #BA36A5;"&gt;e&lt;/span&gt; -=  e0  &lt;span style="color: #8D8D84;"&gt;# &lt;/span&gt;&lt;span style="color: #8D8D84; font-style: italic;"&gt;set the energy zero&lt;/span&gt;

plt.plot(d, e, &lt;span style="color: #008000;"&gt;'bo '&lt;/span&gt;)
plt.xlabel(&lt;span style="color: #008000;"&gt;'d (&amp;#197;)'&lt;/span&gt;)
plt.ylabel(&lt;span style="color: #008000;"&gt;'energy (eV)'&lt;/span&gt;)
&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;
&lt;img src="/media/ob-ipython-82aeda9421056689d5212f9033da900a.png"&gt; 
&lt;/p&gt;


&lt;p&gt;
We see there is a minimum, and the energy is asymmetric about the minimum. We have no functional form for the energy here, just the data in the plot. So to get another energy, we have to run another calculation. If that was expensive, we might prefer an analytical equation to evaluate instead.  We will get an analytical form by fitting a function to the data. A classic one is the &lt;a href="https://en.wikipedia.org/wiki/Buckingham_potential"&gt;Buckingham potential&lt;/a&gt;: \(E = A \exp(-B r) - \frac{C}{r^6}\). Here we perform the regression.
&lt;/p&gt;

&lt;div class="org-src-container"&gt;
&lt;pre class="src src-ipython"&gt;&lt;span style="color: #0000FF;"&gt;def&lt;/span&gt; &lt;span style="color: #006699;"&gt;model&lt;/span&gt;(r, A, B, C):
&lt;span style="color: #9B9B9B; background-color: #EDEDED;"&gt; &lt;/span&gt;   &lt;span style="color: #0000FF;"&gt;return&lt;/span&gt; A * np.exp(-B * r) - C / r**6

&lt;span style="color: #0000FF;"&gt;from&lt;/span&gt; pycse &lt;span style="color: #0000FF;"&gt;import&lt;/span&gt; nlinfit
&lt;span style="color: #0000FF;"&gt;import&lt;/span&gt; pprint

&lt;span style="color: #BA36A5;"&gt;p0&lt;/span&gt; = [-80, 1, 1]
&lt;span style="color: #BA36A5;"&gt;p&lt;/span&gt;, &lt;span style="color: #BA36A5;"&gt;pint&lt;/span&gt;, &lt;span style="color: #BA36A5;"&gt;se&lt;/span&gt; = nlinfit(model, d, e, p0, 0.05)
&lt;span style="color: #0000FF;"&gt;print&lt;/span&gt;(&lt;span style="color: #008000;"&gt;'Parameters = '&lt;/span&gt;, p)
&lt;span style="color: #0000FF;"&gt;print&lt;/span&gt;(&lt;span style="color: #008000;"&gt;'Confidence intervals = '&lt;/span&gt;)
pprint.pprint(pint)
plt.plot(d, e, &lt;span style="color: #008000;"&gt;'bo '&lt;/span&gt;, label=&lt;span style="color: #008000;"&gt;'calculations'&lt;/span&gt;)

&lt;span style="color: #BA36A5;"&gt;x&lt;/span&gt; = np.linspace(&lt;span style="color: #006FE0;"&gt;min&lt;/span&gt;(d), &lt;span style="color: #006FE0;"&gt;max&lt;/span&gt;(d))
plt.plot(x, model(x, *p), label=&lt;span style="color: #008000;"&gt;'fit'&lt;/span&gt;)
plt.legend(loc=&lt;span style="color: #008000;"&gt;'best'&lt;/span&gt;)
plt.xlabel(&lt;span style="color: #008000;"&gt;'d (&amp;#197;)'&lt;/span&gt;)
plt.ylabel(&lt;span style="color: #008000;"&gt;'energy (eV)'&lt;/span&gt;)
&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;
Parameters =  [ -83.21072545    1.18663393 -266.15259507]
Confidence intervals =
array([[ -93.47624687,  -72.94520404],
       [   1.14158438,    1.23168348],
       [-280.92915682, -251.37603331]])
&lt;img src="/media/ob-ipython-a05811588d06f090a2462ba9f48dccb3.png"&gt; 
&lt;/p&gt;

&lt;p&gt;
That fit is ok, but not great. We would be better off with a spline for this simple system! The trouble is how do we get anything better? If we had a better equation to fit to we might get better results.  While one might come up with one for this dimer, how would you extend it to more complex systems, even just a trimer? There have been decades of research dedicated to that, and we are not smarter than those researchers so, it is time for a new approach.
&lt;/p&gt;

&lt;p&gt;
We will use a Neural Network regressor. The input will be \(d\) and we want to regress a function to predict the energy.
&lt;/p&gt;

&lt;p&gt;
There are a couple of important points to make here.
&lt;/p&gt;

&lt;ol class="org-ol"&gt;
&lt;li&gt;This is just another kind of regression.&lt;/li&gt;
&lt;li&gt;We need a lot more data to do the regression. Here we use 300 data points.&lt;/li&gt;
&lt;li&gt;We need to specify a network architecture. Here we use one hidden layer with 10 neurons, and the tanh activation function on each neuron. The last layer is just the output layer. I do not claim this is any kind of optimal architecture. It is just one that works to illustrate the idea.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;
Here is the code that uses a neural network regressor, which is lightly adapted from &lt;a href="http://scikit-neuralnetwork.readthedocs.io/en/latest/guide_model.html"&gt;http://scikit-neuralnetwork.readthedocs.io/en/latest/guide_model.html&lt;/a&gt;.
&lt;/p&gt;

&lt;div class="org-src-container"&gt;
&lt;pre class="src src-ipython"&gt;&lt;span style="color: #0000FF;"&gt;from&lt;/span&gt; sknn.mlp &lt;span style="color: #0000FF;"&gt;import&lt;/span&gt; Regressor, Layer

&lt;span style="color: #BA36A5;"&gt;D&lt;/span&gt; = np.linspace(1.7, 3, 300)

&lt;span style="color: #0000FF;"&gt;def&lt;/span&gt; &lt;span style="color: #006699;"&gt;get_e&lt;/span&gt;(distance):
&lt;span style="color: #9B9B9B; background-color: #EDEDED;"&gt; &lt;/span&gt;   &lt;span style="color: #BA36A5;"&gt;a&lt;/span&gt; = atoms.copy()
&lt;span style="color: #9B9B9B; background-color: #EDEDED;"&gt; &lt;/span&gt;   a[1]&lt;span style="color: #BA36A5;"&gt;.x&lt;/span&gt; = distance
&lt;span style="color: #9B9B9B; background-color: #EDEDED;"&gt; &lt;/span&gt;   a.set_calculator(EMT())
&lt;span style="color: #9B9B9B; background-color: #EDEDED;"&gt; &lt;/span&gt;   &lt;span style="color: #BA36A5;"&gt;e&lt;/span&gt; = a.get_potential_energy()
&lt;span style="color: #9B9B9B; background-color: #EDEDED;"&gt; &lt;/span&gt;   &lt;span style="color: #0000FF;"&gt;return&lt;/span&gt; e

&lt;span style="color: #BA36A5;"&gt;E&lt;/span&gt; = np.array([get_e(dist) &lt;span style="color: #0000FF;"&gt;for&lt;/span&gt; dist &lt;span style="color: #0000FF;"&gt;in&lt;/span&gt; D])
&lt;span style="color: #BA36A5;"&gt;E&lt;/span&gt; -=  e0  &lt;span style="color: #8D8D84;"&gt;# &lt;/span&gt;&lt;span style="color: #8D8D84; font-style: italic;"&gt;set the energy zero&lt;/span&gt;

&lt;span style="color: #BA36A5;"&gt;X_train&lt;/span&gt; = np.row_stack(np.array(D))

&lt;span style="color: #BA36A5;"&gt;N&lt;/span&gt; = 10
&lt;span style="color: #BA36A5;"&gt;nn&lt;/span&gt; = Regressor(layers=[Layer(&lt;span style="color: #008000;"&gt;"Tanh"&lt;/span&gt;, units=N),
&lt;span style="color: #9B9B9B; background-color: #EDEDED;"&gt; &lt;/span&gt;   &lt;span style="color: #9B9B9B; background-color: #EDEDED;"&gt; &lt;/span&gt;   &lt;span style="color: #9B9B9B; background-color: #EDEDED;"&gt; &lt;/span&gt;   &lt;span style="color: #9B9B9B; background-color: #EDEDED;"&gt; &lt;/span&gt;   &lt;span style="color: #9B9B9B; background-color: #EDEDED;"&gt; &lt;/span&gt;   &lt;span style="color: #9B9B9B; background-color: #EDEDED;"&gt; &lt;/span&gt;  Layer(&lt;span style="color: #008000;"&gt;'Linear'&lt;/span&gt;)])
nn.fit(X_train, E)

&lt;span style="color: #BA36A5;"&gt;dfit&lt;/span&gt; = np.linspace(&lt;span style="color: #006FE0;"&gt;min&lt;/span&gt;(d), &lt;span style="color: #006FE0;"&gt;max&lt;/span&gt;(d))

&lt;span style="color: #BA36A5;"&gt;efit&lt;/span&gt; = nn.predict(np.row_stack(dfit))

plt.plot(d, e, &lt;span style="color: #008000;"&gt;'bo '&lt;/span&gt;)
plt.plot(dfit, efit)
plt.legend([&lt;span style="color: #008000;"&gt;'calculations'&lt;/span&gt;, &lt;span style="color: #008000;"&gt;'neural network'&lt;/span&gt;])
plt.xlabel(&lt;span style="color: #008000;"&gt;'d (&amp;#197;)'&lt;/span&gt;)
plt.ylabel(&lt;span style="color: #008000;"&gt;'energy (eV)'&lt;/span&gt;)
&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;
&lt;img src="/media/ob-ipython-025c1b30f565c5806510804582e91242.png"&gt; 
&lt;/p&gt;

&lt;p&gt;
This fit looks pretty good, better than we got for the Buckingham potential. Well, it probably should look better, we have many more parameters that were fitted! It is not perfect, but it could be systematically improved by increasing the number of hidden layers, and neurons in each layer. I am being a little loose here by relying on a visual assessment of the fit. To systematically improve it you would need a quantitative analysis of the errors. I also note though, that if I run the block above several times in succession, I get different fits each time. I suppose that is due to some random numbers used to initialize the fit, but sometimes the fit is about as good as the result you see above, and sometimes it is terrible.
&lt;/p&gt;

&lt;p&gt;
Ok, what is the point after all? We developed a neural network that pretty accurately captures the energy of a Cu dimer &lt;i&gt;with no knowledge&lt;/i&gt; of the physics involved. Now, EMT is not that expensive, but suppose this required 300 DFT calculations at 1 minute or more a piece? That is five hours just to get the data! With this neural network, we can quickly compute energies. For example, this shows we get about 10000 energy calculations in just 287 ms.
&lt;/p&gt;

&lt;div class="org-src-container"&gt;
&lt;pre class="src src-ipython"&gt;%%timeit

&lt;span style="color: #BA36A5;"&gt;dfit&lt;/span&gt; = np.linspace(&lt;span style="color: #006FE0;"&gt;min&lt;/span&gt;(d), &lt;span style="color: #006FE0;"&gt;max&lt;/span&gt;(d), 10000)
&lt;span style="color: #BA36A5;"&gt;efit&lt;/span&gt; = nn.predict(np.row_stack(dfit))
&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;
1 loop, best of 3: 287 ms per loop
&lt;/p&gt;

&lt;p&gt;
Compare that to the time it took to compute the 300 energies with EMT
&lt;/p&gt;

&lt;div class="org-src-container"&gt;
&lt;pre class="src src-ipython"&gt;%%timeit
&lt;span style="color: #BA36A5;"&gt;E&lt;/span&gt; = np.array([get_e(dist) &lt;span style="color: #0000FF;"&gt;for&lt;/span&gt; dist &lt;span style="color: #0000FF;"&gt;in&lt;/span&gt; D])
&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;
1 loop, best of 3: 230 ms per loop
&lt;/p&gt;

&lt;p&gt;
The neural network is a lot faster than the way we get the EMT energies!
&lt;/p&gt;

&lt;p&gt;
It is true in this case we could have used a spline, or interpolating function and it would likely be even better than this Neural Network. We are aiming to get more complicated soon though. For a trimer, we will have three dimensions to worry about, and that can still be worked out in a similar fashion I think. Past that, it becomes too hard to reduce the dimensions, and this approach breaks down. Then we have to try something else. We will get to that in another post.
&lt;/p&gt;
&lt;p&gt;Copyright (C) 2017 by John Kitchin. See the &lt;a href="/copying.html"&gt;License&lt;/a&gt; for information about copying.&lt;p&gt;
&lt;p&gt;&lt;a href="/org/2017/03/18/Modeling-a-Cu-dimer-by-EMT,-nonlinear-regression-and-neural-networks.org"&gt;org-mode source&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Org-mode version = 9.0.5&lt;/p&gt;]]></content:encoded>
    </item>
  </channel>
</rss>
