A better defun for emacs-lisp

| categories: elisp, macro, emacs | tags: | View Comments

Table of Contents

I have been thinking of better ways to write code that is more likely to have decent docstrings that are up to date, and maybe that enable automatic validation. One strategy is to keep documentation and code together, and by together I mean close together. The closer the better. I made some interesting progress in the last post, where I used a macro to let me put argument specific documentation in the same place that the argument is defined. Here I expand the idea to also provide argument default values, and validation code where the argument is defined inside the function, in addition to generating docstrings. This post is written in Emacs-lisp, mostly because I am more familiar with the macro language. The idea should apply to other lisps too.

Let's consider this prototypical, vanilla function definition, usage, and docstring.

(defun f1 (arg1 arg2)
  "Add two numbers."
  (+ arg1 arg2))

;; usage
(f1 3 4)

Here is what the help looks like from emacs.

(describe-function 'f1)

It is clear I was lazy in writing the docstring; it does not even mention the arguments. There is also no validation of the arguments so if you pass a string and a number, you will get an error. There are no defaults either, so you have to provide both arguments. It seems like there could be significant room for improvement. Of course, I could bite the bullet and write a better function like this one:

(defun f1a (arg1 &optional arg2)
  "Add ARG1 and ARG2 together.
ARG1 and  ARG2 should both be numbers."
  (when (null arg2) (setq arg2 2))
  (unless (and (numberp arg1) (numberp arg2)) (error "arg1 and arg2 should both be numbers"))
  (+ arg1 arg2))

(list (f1a 3 4) (f1a 3))

Yes, I could do that, but it is tedious to do it all the time. And it still leaves something to be desired for me. The docstring does not say what the default value is for example, and that is hard-coded in the code, i.e. not introspectible until you look at the code. Next we consider an alternative way to write the function. Compare that to this function definition, usage and documentation. The function definition is a little more verbose. Providing documentation, defaults and validation code in any form would make it that way no matter what.

(defn f2 ((arg1 "A number" :validate numberp)
          (arg2 "A number" :validate numberp :default 2))
  "Add the arguments."
  (+ arg1 arg2))

;; usage
(list (f2 3 4) (f2 3))
(describe-function 'f2)

The documentation is built up from the information in the function definition, in a form that is mostly consistent with emacs-lisp documentation standards. defn is not a regular emacs-lisp function; it is a macro I developed to generate the function code. It turned out to be long, but the gist of it is that before defining the function I loop through the arguments and collect the docstrings, along with any information about default values and/or validation functions. Then I build up the list of arguments to put in the function. Then if any default values are set, I generate some code to set those values if they are not set in the function call, and finally a similar block of validation code. At the end, I construct the defun and return it. You can check out the code if you want here: https://github.com/jkitchin/scimax/blob/master/scimax-macros.el.

Let's take a look at what this code expands to.

 '(defn f2 ((arg1 "A number" :validate numberp)
            (arg2 "A number" :validate numberp :default 2))
    "Add the arguments."
    (+ arg1 arg2)))

You can see it expands to a regular defun, with a generated docstring, generated default settings code block, and generated validation code. Pretty nice.

Let's see what happens with a function that fails the validation. We should get an error. Here we capture the error so we can see it in the post.

(condition-case err
    (f2 "oak")
   (error-message-string err)))

So we even get a useful error message when the wrong type of argument is provided. Compare that to the error message from the original version of this function. It tells us we got the wrong type, but not which argument.

(condition-case err
    (f1 "oak" 4)
   (error-message-string err)))

One last example to check out the &rest argument, with validation that every arg is a number.

(defn f4 ((rarg :rest
                :validate (lambda (x)
                            (-all-p 'identity (mapcar 'numberp x)))))
  "multiply all the arguments."
  (apply '* rarg))

(f4 1 2 3)
(condition-case err
    (f4 "oak" 4)
   (error-message-string err)))
(describe-function 'f4)

That looks ok too.

1 Summary

The motivation for this was to help me write better code with better documentation. Better code in the sense that it can provide run-time validation, with better feedback, and automatic documentation, including that there is none if that is the case. It is basically compatible with the regular defun, but enhances what kind of documentation is possible with less work on my part. I think it will make it easier to keep documentation in sync, since the argument documentation would be kept near the argument, and you can build in validation if you want to.

It is no news to lispers that macros are good for this kind of application.

Copyright (C) 2017 by John Kitchin. See the License for information about copying.

org-mode source

Org-mode version = 9.0.5

Read and Post Comments

A Hy macro for defining functions with docstrings on each argument

| categories: hylang, python | tags: | View Comments

For functions with a lot of arguments, python style docstrings leave something to be desired. For one, they are not that close to the arguments, so if you have a function with say 20 arguments, the docstring might take up a whole page! That means they are hard to keep synchronized too. Let's not argue now over the merits of a function with 20+ arguments, it is enough that they exist, and are a problem.

So what are typical documentation standards? Here is a Numpy style doc string:

def func(arg1, arg2):
    """multiply arg1 and arg2

    arg1 : a number
    arg2 : a number

    return arg1 * arg2

It works well for a small number of arguments with limited descriptions. This is a proper docstring that is accessible by introspection and pydoc. With much longer argument lists, this falls apart. I will not pick on any code in particular here, but suffice it to say I was inspired today to think of a better way. There are some other documentation solutions at http://stackoverflow.com/questions/9195455/how-to-document-a-method-with-parameters, but None of them are better in my opinion. I want accessible docstrings by instrospection, and only if that is unavailable do I want to read the code! Finally, if I have to read the code, I want it to be easy to figure out, which means the documentation is close to the arguments.

There is bad news, I do not have one for vanilla python. Python does not even give you a way to deal with this. But, if we had a lisp, we could make a macro to help us out. In fact, we have a lisp with hy! And we can use a macro to make a syntax that lets us keep the docstring close to the argument, and that constructs a real docstring so we get help later!

Here it is:

(defmacro mydef [func args &optional docstring &rest body]
  `(defn ~func [~@(map (lambda [x] (nth x 0)) args)]
     ~(+ (if docstring (+ docstring "\n\n") "")
         (.join "\n" (map (lambda [x]
                            (.format "{} : {}"
                                     (nth x 0)
                                     (nth x 1))) args)))

We can checkout how it expands like this:

(print (macroexpand '(mydef f [(a "an int")
                               (b "an int")]
                            "some doc"
                            (* a b))))
('setv' 'f' ('fn' ['a' 'b'] 'some doc\n\nParameters\n----------\na : an int\nb : an int' ('*' 'a' 'b')))

That looks ok. Now, for an example of using that. Here is the same function we defined before, but I put the documentation for each argument with the argument.

(mydef func ((arg1 "a number")
             (arg2 "a number"))
  "Multiply arg1 by arg2"
  (* arg1 arg2))

We can use the function now like a regular function.

(print (func 24 3))

And now for the help.

(help func)
Help on function func in module __main__:

func(arg1, arg2)
    Multiply arg1 by arg2

    arg1 : a number
    arg2 : a number

Now, that should amaze and astonish you if you are a vanilla Pythonista! We have our cake, and we eat it too. You just can not make up your own syntax that way in Python. Imagine, we could add type information, validation code, etc… into that macro. Maybe it could even be possible to store argument dependent documentation on the function, say in the function dictionary. That would require some conventions I guess, but they could become introspectable then. For example, in this vanilla Python:

def f(x): return x*x
f.__dict__['args'] = {'x': 'A number'}

{'args': {'x': 'A number'}}

In the end, this does not really solve all the problems I have with current docstrings in Python. It does solve a problem with writing and reading the code by keeping documentation close to the arguments, but ultimately the docstring from Python's point of view will basically look the same. It is pretty awesome that it is even possible. Hy lisp for the win here (again!).

Copyright (C) 2017 by John Kitchin. See the License for information about copying.

org-mode source

Org-mode version = 9.0.5

Read and Post Comments

Modeling a Cu dimer by EMT, nonlinear regression and neural networks

| categories: python, molecular-simulation, neural-network, machine-learning | tags: | View Comments

In this post we consider a Cu2 dimer and how its energy varies with the separation of the atoms. We assume we have a way to calculate this, but that it is expensive, and that we want to create a simpler model that is as accurate, but cheaper to run. A simple way to do that is to regress a physical model, but we will illustrate some challenges with that. We then show a neural network can be used as an accurate regression function without needing to know more about the physics.

We will use an effective medium theory calculator to demonstrate this. The calculations are not expected to be very accurate or relevant to any experimental data, but they are fast, and will illustrate several useful points that are independent of that. We will take as our energy zero the energy of two atoms at a large separation, in this case about 10 angstroms. Here we plot the energy as a function of the distance between the two atoms, which is the only degree of freedom that matters in this example.

import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

from ase.calculators.emt import EMT
from ase import Atoms

atoms = Atoms('Cu2',[[0, 0, 0], [10, 0, 0]], pbc=[False, False, False])

e0 = atoms.get_potential_energy()

# Array of bond lengths to get the energy for
d = np.linspace(1.7, 3, 30)

def get_e(distance):
    a = atoms.copy()
    a[1].x = distance
    e = a.get_potential_energy()
    return e

e = np.array([get_e(dist) for dist in d])
e -=  e0  # set the energy zero

plt.plot(d, e, 'bo ')
plt.xlabel('d (Å)')
plt.ylabel('energy (eV)')

We see there is a minimum, and the energy is asymmetric about the minimum. We have no functional form for the energy here, just the data in the plot. So to get another energy, we have to run another calculation. If that was expensive, we might prefer an analytical equation to evaluate instead. We will get an analytical form by fitting a function to the data. A classic one is the Buckingham potential: \(E = A \exp(-B r) - \frac{C}{r^6}\). Here we perform the regression.

def model(r, A, B, C):
    return A * np.exp(-B * r) - C / r**6

from pycse import nlinfit
import pprint

p0 = [-80, 1, 1]
p, pint, se = nlinfit(model, d, e, p0, 0.05)
print('Parameters = ', p)
print('Confidence intervals = ')
plt.plot(d, e, 'bo ', label='calculations')

x = np.linspace(min(d), max(d))
plt.plot(x, model(x, *p), label='fit')
plt.xlabel('d (Å)')
plt.ylabel('energy (eV)')

Parameters = [ -83.21072545 1.18663393 -266.15259507] Confidence intervals = array([[ -93.47624687, -72.94520404], [ 1.14158438, 1.23168348], [-280.92915682, -251.37603331]])

That fit is ok, but not great. We would be better off with a spline for this simple system! The trouble is how do we get anything better? If we had a better equation to fit to we might get better results. While one might come up with one for this dimer, how would you extend it to more complex systems, even just a trimer? There have been decades of research dedicated to that, and we are not smarter than those researchers so, it is time for a new approach.

We will use a Neural Network regressor. The input will be \(d\) and we want to regress a function to predict the energy.

There are a couple of important points to make here.

  1. This is just another kind of regression.
  2. We need a lot more data to do the regression. Here we use 300 data points.
  3. We need to specify a network architecture. Here we use one hidden layer with 10 neurons, and the tanh activation function on each neuron. The last layer is just the output layer. I do not claim this is any kind of optimal architecture. It is just one that works to illustrate the idea.

Here is the code that uses a neural network regressor, which is lightly adapted from http://scikit-neuralnetwork.readthedocs.io/en/latest/guide_model.html.

from sknn.mlp import Regressor, Layer

D = np.linspace(1.7, 3, 300)

def get_e(distance):
    a = atoms.copy()
    a[1].x = distance
    e = a.get_potential_energy()
    return e

E = np.array([get_e(dist) for dist in D])
E -=  e0  # set the energy zero

X_train = np.row_stack(np.array(D))

N = 10
nn = Regressor(layers=[Layer("Tanh", units=N),
nn.fit(X_train, E)

dfit = np.linspace(min(d), max(d))

efit = nn.predict(np.row_stack(dfit))

plt.plot(d, e, 'bo ')
plt.plot(dfit, efit)
plt.legend(['calculations', 'neural network'])
plt.xlabel('d (Å)')
plt.ylabel('energy (eV)')

This fit looks pretty good, better than we got for the Buckingham potential. Well, it probably should look better, we have many more parameters that were fitted! It is not perfect, but it could be systematically improved by increasing the number of hidden layers, and neurons in each layer. I am being a little loose here by relying on a visual assessment of the fit. To systematically improve it you would need a quantitative analysis of the errors. I also note though, that if I run the block above several times in succession, I get different fits each time. I suppose that is due to some random numbers used to initialize the fit, but sometimes the fit is about as good as the result you see above, and sometimes it is terrible.

Ok, what is the point after all? We developed a neural network that pretty accurately captures the energy of a Cu dimer with no knowledge of the physics involved. Now, EMT is not that expensive, but suppose this required 300 DFT calculations at 1 minute or more a piece? That is five hours just to get the data! With this neural network, we can quickly compute energies. For example, this shows we get about 10000 energy calculations in just 287 ms.


dfit = np.linspace(min(d), max(d), 10000)
efit = nn.predict(np.row_stack(dfit))

1 loop, best of 3: 287 ms per loop

Compare that to the time it took to compute the 300 energies with EMT

E = np.array([get_e(dist) for dist in D])

1 loop, best of 3: 230 ms per loop

The neural network is a lot faster than the way we get the EMT energies!

It is true in this case we could have used a spline, or interpolating function and it would likely be even better than this Neural Network. We are aiming to get more complicated soon though. For a trimer, we will have three dimensions to worry about, and that can still be worked out in a similar fashion I think. Past that, it becomes too hard to reduce the dimensions, and this approach breaks down. Then we have to try something else. We will get to that in another post.

Copyright (C) 2017 by John Kitchin. See the License for information about copying.

org-mode source

Org-mode version = 9.0.5

Read and Post Comments

New publication in Calphad

| categories: publication, news | tags: | View Comments

Alloys can have rich, complex phase behavior. Cu-Pd alloys for example show an unusual behavior where a BCC lattice forms for some compositions, even though the alloy is made from two metals that are exclusively FCC in structure! Being able to model and predict this kind of behavior is a major challenge. In this work, we use cluster expansions to model the configurational degrees of freedom in the FCC and BCC lattices and show qualitatively that we can predict the region where the B2 phase (the BCC one) forms. The agreement with experiment is not quantitative though, and we show that part this disagreement is due to the lack of vibrational entropy in the cluster expansion. When we include vibrational entropy, the qualitative agreement improves.

  author =       "Feiyang Geng and Jacob R. Boes and John R. Kitchin",
  title =        {First-Principles Study of the Cu-Pd Phase Diagram},
  journal =      "Calphad ",
  volume =       56,
  pages =        "224 - 229",
  year =         2017,
  doi =          {10.1016/j.calphad.2017.01.009},
  url =
  abstract =     "Abstract The equilibrium phase diagram of a Cu-Pd alloy has
                  been computed using cluster expansion and Monte Carlo
                  simulation methods combined with density functional theory.
                  The computed phase boundaries show basic features that are
                  consistent with the experimentally reported phase diagram.
                  Without vibrational free energy contributions, the
                  order-disorder transition temperature is underestimated by 100
                  K and the critical point is inconsistent with experimental
                  result. The addition of vibrational free energy contributions
                  yields a more qualitatively correct Cu-Pd phase diagram in the
                  Cu rich region. ",
  issn =         "0364-5916",
  keywords =     "Density functional theory",

Copyright (C) 2017 by John Kitchin. See the License for information about copying.

org-mode source

Org-mode version = 9.0.3

Read and Post Comments

New publication in Molecular Simulation

| categories: publication, news | tags: | View Comments

Molecules interact with each other when they adsorb on surfaces and these interactions are coverage dependent. Modeling these interactions is a challenge though, because there are many configurations of adsorbates on the surface, and the surface changes due to the interactions. To mitigate these challenges, one often simplifies the model, e.g. by using a cluster expansion or lattice gas Hamiltonian. These approaches have their own limitations though, and are not that useful for modeling dynamic processes like diffusion. Using molecular potentials enables the dynamic simulations, but not at the same level of accuracy as density functional theory. In this work we use density functional theory to train a neural network, and then use the neural network to model coverage-dependent adsorption isotherms and the dynamics of oxygen diffusion on a Pd(111) surface. We show the neural network can capture the onset of surface oxidation, and that the simulation results have comparable accuracy to the DFT calculations it was trained from.

  author =       {Jacob R. Boes and John R. Kitchin},
  title =        {Neural Network Predictions of Oxygen Interactions on a Dynamic
                  Pd Surface},
  journal =      {Molecular Simulation},
  pages =        {1-9},
  year =         2017,
  doi =          {10.1080/08927022.2016.1274984},
  url =          {https://doi.org/10.1080/08927022.2016.1274984},
  keywords =     {CBET-1506770},

Copyright (C) 2017 by John Kitchin. See the License for information about copying.

org-mode source

Org-mode version = 9.0.3

Read and Post Comments

Next Page ยป