Caching expensive function calls so you don't have to rerun them
Posted February 01, 2023 at 08:09 PM | categories: python | tags:
Updated February 01, 2023 at 08:11 PM
Table of Contents
Check out the video at:
Nobody likes to run expensive jobs more than necessary, so cache solutions are often used where you save the results, and look them up later. There is functools.cache in Python, but it is memory only, and not persistent, so you start over in a new session.
For persistent cache, joblib (https://joblib.readthedocs.io/en/latest/) is a standard tool for this. Here is a simple example:
from joblib import Memory location = '/Users/jkitchin/Dropbox/emacs/journal/2023/02/01/joblib_cache/joblib_cache' memory = Memory(location, verbose=0) @memory.cache def fun(x=1.0): print('If you see this, go get a coffee while it runs 🐌.') return x**2 print(fun(2)) # Runs the function print(fun(2)) # Looks up the cached value
If you see this, go get a coffee while it runs 🐌. 4 4
That works because joblib saves the results in a file in the location you specify.
Here is another example with another arg.
@memory.cache def f2(x=1.0, a=3): print(f'If you see this, go get a coffee while it runs. a={"🐌"*a}') return a*x**2 (f2(2), # Runs function f2(2, a=3), # calls cache f2(2, 4)) # Runs another function because a changed
If you see this, go get a coffee while it runs. a=🐌🐌🐌 If you see this, go get a coffee while it runs. a=🐌🐌🐌🐌
12 | 12 | 16 |
Here, we look up from the cache each time.
(f2(2), f2(2, a=3), f2(2, 4))
12 | 12 | 16 |
1. where things start to go wrong
1.1. Global variables
First, we look at an uncached version of a function that uses a global variable.
a = 3 def f3(x=1.0): return a*x**2 f3(2)
12
We can change a
and see the change.
a=0
f3(2)
0
Now we look at a cached version.
a = 3 @memory.cache def f4(x=1.0): print('If you see this, go get a coffee while it runs 🐌.') return a*x**2 (f4(2), f4(2), f4(2))
If you see this, go get a coffee while it runs 🐌.
12 | 12 | 12 |
Changing the global variable does not change the cache though. uh oh. This is just wrong. The answers should clearly be 0. Incorrect cache invalidation strikes.
a = 0
(f4(2), f4(2), f4(2))
12 | 12 | 12 |
1.2. running functions with mutable arguments
Using mutable arguments is a recipe for trouble and unanticipated problems, but it is easy to unintentionally do, and not always obvious, as I show here.
from ase.build import bulk from ase.calculators.emt import EMT atoms = bulk('Pd') atoms.set_calculator(EMT()) @memory.cache def f(atoms): print('If you see this, go get a coffee.') return atoms.get_potential_energy() (f(atoms), f(atoms))
If you see this, go get a coffee. If you see this, go get a coffee.
0.0003422625372841992 | 0.0003422625372841992 |
You can see this ran twice. The reason is that the atoms object was mutated by adding data onto it. Here is how I know:
import joblib atoms = bulk('Pd') atoms.set_calculator(EMT()) joblib.hash(atoms), atoms._calc.results
ee2ed2eb9fdb4b3d6416803a33f43a22 | nil |
Here you can see that simply running the get energy function the hash changes because the results dictionary on the calculator changes. That means subsequent uses of the atoms object will have a different hash, and you cannot rely on that to look up the results. In this case the results should not change the output of the function, but since they are included in the hash, it incorrectly invalidates the hash.
atoms.get_potential_energy()
joblib.hash(atoms), atoms._calc.results
d37ef0a5761f499060b4f55bdf644814 | (energy : 0.0003422625372841992 energies : array ((0.00034226)) free_energy : 0.0003422625372841992 forces : array (((0 0 0)))) |
Suffice to say, this is non-obvious, but having seen it, not a surprise; mutable arguments are frequently a source of problems.
1.3. If you run the same function different ways, the cache is not reused
Some aspects of this are specific to org-mode and how scripts are run in it. Here we have to use an absolute path to make sure we use the right cache. That still doesn't solve the problem though as we will see.
from joblib import Memory location = '/Users/jkitchin/Dropbox/emacs/journal/2023/02/01/joblib_cache/joblib_cache' memory = Memory(location, verbose=0) a = 3 @memory.cache def f4(x=1.0): print('If you see this, go get a coffee while it runs') return a*x**2 print((f4(2), f4(2), f4(2)))
The issue is that joblib uses the file name it thinks the function is from in the path it saves the results. The filename is different
1.4. Fragile cache invalidation
joblib uses the function source in its hash. That means any change to the source, including the function name, renaming variables, whitespace, comments or docstring changes invalidates the hash even though they may have no change in the output. That is an overabundance of caution, but simple to implement.
@memory.cache def f4(x=1.0): 'add a ds.' # comment print('If you see this, go get a coffee while it runs') return a*x**2 print((f4(2), f4(2), f4(2)))
If you see this, go get a coffee while it runs (0, 0, 0)
2. Some partial solutions with pycse.hashcache
I wrote hashcache
to solve some of these problems. It is actually built on top of joblib.
import pycse
pycse.__version__, pycse.__file__
2.2.1 | /Users/jkitchin/Dropbox/python/pycse/pycse/__init__.py |
from pycse.hashcache import hashcache hashcache.location = "/Users/jkitchin/Dropbox/emacs/journal/2023/02/01/cache" hashcache.verbose = False @hashcache def h1(x): print('This runs soo slow... Go get a coffee') return x**2 h1(2), h1(2)
4 | 4 |
2.1. No known problem with global variables
a = 3 @hashcache def h4(x=1.0): print('If you see this, go get a coffee while it runs') return a*x**2 (h4(2), h4(2), h4(2))
If you see this, go get a coffee while it runs
12 | 12 | 12 |
a=0
(h4(2), h4(2), h4(2))
0 | 0 | 0 |
Whew!!! we got the right answers. hashcache does a better job detecting the external change.
2.2. hashcache and mutable arguments
hashcache does not solve the mutable argument problem, but, it does warn you it detected it.
from ase.build import bulk from ase.calculators.emt import EMT atoms = bulk('Pd') atoms.set_calculator(EMT()) @hashcache def h(atoms): print('If you see this, go get a coffee.') return atoms.get_potential_energy() (h(atoms), h(atoms), h(atoms))
0.0003422625372841992 | 0.0003422625372841992 | 0.0003422625372841992 |
2.3. Reuse the cache when you run different ways
hashcache uses the same cache at the function and function environment level, so it avoids reruns even from different places. It is a judgement call by you to say if this is the right thing to do.
print(h1(2), h1(2))
4 4
from pycse.hashcache import hashcache hashcache.location = "/Users/jkitchin/Dropbox/emacs/journal/2023/02/01/cache" @hashcache def h1(x): print('This runs soo slow... Go get a coffee') return x**2 print(h1(2), h1(2))
2.4. Insensitivity to unimportant changes
Instead of hashing the source of the function, in hashcache I hash the bytecode instead. This is certainly less sensitive to unimportant changes like docstrings, comments or whitespace. I do use the function name in the hash, so even though that does not affect the output, I thought it might be confusing in the future.
Here, small changes like comments, docstrings, etc, don't affect the hash.
a = 3 @hashcache def h4(x=1.0): 'doc string' # comments print('If you see this, go get a coffee while it runs') return a*x**2 (h4(2), h4(2), h4(2))
12 | 12 | 12 |
3. Is it the answer?
Probably not completely. It is almost certain I have not captured all the ways the cache should be invalidated, or when a new cache should be used. hashcache is for now, a proof of concept in understanding why this is a hard problem to solve. I prefer its behavior over the defaults in joblib so far though.
Copyright (C) 2023 by John Kitchin. See the License for information about copying.
Org-mode version = 9.5.5