## Timing Lennard-Jones implementations - ASE vs autograd

| categories: | tags:

In a comment on this post Konrad Hinsen asked if the autograd forces on a Lennard-Jones potential would be useable in production. I wasn't sure, and was suspicious that analytical force functions would be faster. It turns out to not be so simple. In this post, I attempt to do some timing experiments for comparison. These are tricky to do right, and in a meaningful way, so I will start by explaining what is tricky and why I think the results are meaningful.

The ASE calculators cache their results, and return the cached results after the first run. We will do these on a 13-atom icosahedron cluster.

from ase.calculators.lj import LennardJones
from ase.cluster.icosahedron import Icosahedron

atoms = Icosahedron('Ar', noshells=2, latticeconstant=3)
atoms.set_calculator(LennardJones())

import time
t0 = time.time()
print('energy: ', atoms.get_potential_energy())
print(' time: ', time.time() - t0)
print()

t0 = time.time()
print('energy: ', atoms.get_potential_energy())
print(' time: ', time.time() - t0)
print()

atoms.calc.results = {}
t0 = time.time()
print('energy: ', atoms.get_potential_energy())
print('time: ', time.time() - t0)

energy:  -1.25741774649
time:  0.0028629302978515625

energy:  -1.25741774649
time:  0.00078582763671875

energy:  -1.25741774649
time:  0.0031850337982177734



Note the approximate order of magnitude reduction in elapsed time for the second call. After that, we reset the calculator results, and the time goes back up. So, we have to incorporate that into our timing. Also, in the ASE calculator, the forces are simultaneously calculated. There isn't a way to separate the calls. I am going to use the timeit feature in Ipython for the timing. I don't have a lot of control over what else is running on my machine, so I have observed some variability in the timing results. Finally, I am running these on a MacBook Air.

%%timeit
atoms.get_potential_energy()
atoms.calc.results = {} # this resets it so it runs each time. Otherwise, we get cached results


1.46 ms ± 107 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

That seems like a surprisingly long time. If you neglect the calculator reset, it looks about 10 times faster because the cache lookup is fast!

Let's compare that to an implementation of the Lennard-Jones potential similar to the last time. This implementation differs from the first one I blogged about. This one is fully vectorized. It still does not support periodic boundary conditions though. This version may be up to 10 times faster than the previous version. I haven't tested this very well, I only assured it gives the same energy as ASE for the example in this post.

import autograd.numpy as np

def energy(positions):
"Compute the energy of a Lennard-Jones system."

sigma = 1.0
epsilon = 1.0
rc = 3 * sigma

e0 = 4 * epsilon * ((sigma / rc)**12 - (sigma / rc)**6)

natoms = len(positions)
# These are the pairs of indices we need to compute distances for.
a, b = np.triu_indices(natoms, 1)

d = positions[a] - positions[b]
r2 = np.sum(d**2, axis=1)
c6 = np.where(r2 <= rc**2, (sigma**2 / r2)**3, np.zeros_like(r2))

energy = -e0 * (c6 != 0.0).sum()
c12 = c6**2
energy += np.sum(4 * epsilon * (c12 - c6))

return energy

# Just to check we get the same answer
print(energy(atoms.positions))


-1.25741774649

The energy looks good. For timing, we store the positions in a variable, in case there is any lookup time, since this function only needs an array.

pos = atoms.positions


There is no caching to worry about here, we can just get the timing.

%%timeit
energy(pos)


82.2 µs ± 2.85 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Wow, that is a lot faster than the ASE implementation. Score one for vectorization.

print('Vectorized is {0:1.1f} times faster than ASE at energy.'.format(1.46e-3 / 82.5e-6))

Vectorized is 17.7 times faster than ASE at energy.



Yep, a fully vectorized implementation is a lot faster than the ASE version which uses loops. So far the difference has nothing to do with autograd.

## 1 Timing on the forces

The forces are where derivatives are important, and it is a reasonable question of whether hand-coded derivatives are faster or slower than autograd derivatives. We first look at the forces from ASE. The analytical forces take about the same time as the energy, which is not surprising. The same work is done for both of them.

np.set_printoptions(precision=3, suppress=True)
print(atoms.get_forces())

[[-0.    -0.    -0.   ]
[-0.296 -0.     0.183]
[-0.296 -0.    -0.183]
[ 0.296 -0.     0.183]
[ 0.296 -0.    -0.183]
[ 0.183 -0.296 -0.   ]
[-0.183 -0.296  0.   ]
[ 0.183  0.296 -0.   ]
[-0.183  0.296  0.   ]
[-0.     0.183 -0.296]
[ 0.    -0.183 -0.296]
[-0.     0.183  0.296]
[ 0.    -0.183  0.296]]


%%timeit
atoms.get_forces()
atoms.calc.results = {}

1.22 ms ± 38.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)



Here is our auto-differentiated force function.

from autograd import elementwise_grad

def forces(pos):
return -dEdR(pos)


Let's just check the forces for consistency.

print(forces(atoms.positions))

print(np.allclose(forces(atoms.positions), atoms.get_forces()))

[[-0.    -0.    -0.   ]
[-0.296 -0.     0.183]
[-0.296 -0.    -0.183]
[ 0.296 -0.     0.183]
[ 0.296 -0.    -0.183]
[ 0.183 -0.296 -0.   ]
[-0.183 -0.296  0.   ]
[ 0.183  0.296 -0.   ]
[-0.183  0.296  0.   ]
[-0.     0.183 -0.296]
[ 0.    -0.183 -0.296]
[-0.     0.183  0.296]
[ 0.    -0.183  0.296]]
True



Those all look the same, so now performance for that:

%%timeit

forces(pos)

727 µs ± 47.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)



This is faster than the ASE version. I suspect that it is largely because of the faster, vectorized algorithm overall.

print('autograd is {0:1.1f} times faster than ASE on forces.'.format(1.22e-3 / 727e-6))

autograd is 1.7 times faster than ASE on forces.



autograd forces are consistently 2-6 times faster than the ASE implementation. It could be possible to hand-code a faster function for the forces, if it was fully vectorized. I spent a while seeing what would be required for that, and it is not obvious how to do that. Any solution that uses loops will be slower I think.

This doesn't directly answer the question of whether this can work in production. Everything is still written in Python here, which might limit the size and length of calculations you can practically do. With the right implementation though, it looks promising.