Adding new backends to hashcache

| categories: programming | tags:

I have been working on hashcache to make it more flexible. I like the base functionality that uses the filesystem for caching. That still works.

Here I set up a timeit decorator to show how this works.

from pycse.hashcache import hashcache
import time

!rm -fr ./cache

def timeit(func):
    def wrapper(*args, **kwargs):
        t0 = time.time()
        res = func(*args, **kwargs)
        print(f'Elapsed time = {time.time() - t0}s')
        return res
    return wrapper

Now we decorate a function that is "expensive". The first time we run it, it takes a long time.

@timeit
@hashcache
def expensive_func(x):
    time.sleep(3)
    return x

expensive_func(2)

Elapsed time = 3.007030963897705s 2

The second time is very fast, since we just look it up.

expensive_func(2)

Elapsed time = 0.0012097358703613281s 2

Where did we look it up from? It is stored on disk. You can see where by adding a verbose option to the decorator. This shows you all the data that was stored in the cache.

@hashcache(verbose=True)
def expensive_func(x):
    time.sleep(3)
    return x

expensive_func(2)

{ 'args': (2,), 'cwd': '/Users/jkitchin/Dropbox/emacs/journal/2023/09/23', 'elapsed_time': 3.0048787593841553, 'func': 'expensive_func', 'hash': 'b5436cc21714a7ea619729cc9768b8c5b3a03307', 'kwargs': {}, 'module': 'main', 'output': 2, 'run-at': 1695572717.2020931, 'run-at-human': 'Sun Sep 24 12:25:17 2023', 'standardized-kwargs': {'x': 2}, 'user': 'jkitchin', 'version': '0.0.2'} 2

1. Alternative backends for hashcache

The file system is an amazing cache with many benefits. There are few reasons you might like something different though. For example, it is slow to search if you have to iterate over all the directories and read the files, and it might be slow to sync lots of directories to another place.

hashcache is more flexible now, so you can define the functions that load and dump the cache. Here we use lmdb as a key-value database. lmdb expects the keys and values to be bytes, so we do some tricks with io.BytesIO to get these as strings from joblib.dump which expects to write to a file.

The load function has the signature (hash, verbose), and the dump function has the signature (hash, data, verbose). In both cases, hash will be a string for the key to save data in. data will be a dictionary that should be saved in a way that it can be reloaded. verbose is a flag that you can ignore or use to provide some kind of logging.

from pycse.hashcache import hashcache

import io, joblib, lmdb

def lmdb_dump(hsh, data, verbose=False):
    if verbose:
        print('running lmdb_dump')
    with io.BytesIO() as f:
        joblib.dump(data, f)
        value = f.getvalue()

    db = lmdb.Environment(hashcache.cache)
    with db.begin(write=True) as txn:
        txn.put(hsh.encode('utf-8'), value)

def lmdb_load(hsh, verbose=False):
    if verbose:
        print('running lmdb_load')
    db = lmdb.Environment(hashcache.cache)
    with db.begin() as txn:
        val = txn.get(hsh.encode('utf-8'))
        if val is None:
            return False, None
        else:
            return True, joblib.load(io.BytesIO(val))['output']
                                    
! rm -fr cache.lmdb

hashcache.cache = 'cache.lmdb'


@hashcache(loader=lmdb_load, dumper=lmdb_dump, verbose=True)
def f(x):
    return x

f(2)   

running lmdb_load running lmdb_dump 2

And we can recall the result as easily.

f(2)

running lmdb_load 2

2. a shelve version

Maybe you prefer a built in library like shelve. This is also quite simple.

from pycse.hashcache import hashcache

import io, joblib, shelve

def shlv_dump(hsh, data, verbose=False):
    print('running shlv_dump')
    with io.BytesIO() as f:
        joblib.dump(data, f)
        value = f.getvalue()

    with shelve.open(hashcache.cache) as db:
        db[hsh] = value

def shlv_load(hsh, verbose=False):
    print('running shlv_load')
    with shelve.open(hashcache.cache) as db:
        if hsh in db:
            return True, joblib.load(io.BytesIO(db[hsh]))['output']
        else:
            return False, None

hashcache.cache = 'cache.shlv'
! rm -f cache.shlv.db

@hashcache(loader=shlv_load, dumper=shlv_dump)
def f(x):
    return x

f(2)
    

running shlv_load running shlv_dump 2

And again loading is easy.

f(2)

running shlv_load 2

3. sqlite version

I am a big fan of sqlite. Here I use a simple table mapping a key to a value. I think it could be interesting to consider storing the value as json that would make it more searchable, or you could make a more complex table, but here we keep it simple.

from pycse.hashcache import hashcache

import io, joblib, sqlite3

def sql_dump(hsh, data, verbose=False):
    print('running sql_dump')
    with io.BytesIO() as f:
        joblib.dump(data, f)
        value = f.getvalue()

    with con:
        con.execute("INSERT INTO cache(hash, value) VALUES(?, ?)",
                    (hsh, value))

def sql_load(hsh, verbose=False):
    print('running sql_load')
    with con:        
        cur = con.execute("SELECT value FROM cache WHERE hash = ?",
                          (hsh,))
        value = cur.fetchone()
        if value is None:
            return False, None
        else:
            return True, joblib.load(io.BytesIO(value[0]))['output']

! rm -f cache.sql
hashcache.cache = 'cache.sql'
con = sqlite3.connect(hashcache.cache)
con.execute("CREATE TABLE cache(hash TEXT unique, value BLOB)")
        
@hashcache(loader=sql_load, dumper=sql_dump)
def f(x):
    return x

f(2)    

running sql_load running sql_dump 2

Once again, running is easy.

f(2)

running sql_load 2

4. redis

Finally, you might like a server to cache in. This opens the door to running the server remotely so it is accessible by multiple processes using the cache on different machines. We use redis for this example, but only run it locally. Make sure you run redis-server --daemonize yes

from pycse.hashcache import hashcache

import io, joblib, redis

db = redis.Redis(host='localhost', port=6379)

def redis_dump(hsh, data, verbose=False):
    print('running redis_dump')
    with io.BytesIO() as f:
        joblib.dump(data, f)
        value = f.getvalue()

    db.set(hsh, value)

def redis_load(hsh, verbose=False):
    print('running redis_load')
    if not hsh in db:
        return False, None
    else:
        return True, joblib.load(io.BytesIO(db.get(hsh)))['output']

    
import functools    
hashcache_redis = functools.partial(hashcache,
                                    loader=redis_load,
                                    dumper=redis_dump)    

@hashcache_redis
def f(x):
    return x

f(2)    

running redis_load running redis_dump 2

No surprise here, loading is the same as before.

f(2)

running redis_load 2

5. Summary

I have refactored hashcache to make it much easier to add new backends. You might do that for performance, ease of backup or transferability, to add new capabilities for searching, etc. The new code is a little cleaner than it was before IMO. I am not sure it is API-stable yet, but it is getting there.

Copyright (C) 2023 by John Kitchin. See the License for information about copying.

org-mode source

Org-mode version = 9.7-pre

Discuss on Twitter