New publication - Applying Large Graph Neural Networks to Predict Transition Metal Complex Energies Using the tmQM_wB97MV Data Set

| categories: publication, news | tags:

In this work, we show that we can use large graph neural networks to predict transition metal complex energies. We developed an improved dataset at a higher level of theory, and tested models ranging from GemNet-T (best) to SchNet (worst). The model performance saturates with the size of neutral structures, and improves with increasing size of charged structures. Finally, we showed that a pre-trained model from OC20 was even better than training from scratch. This indicates a degree of transferability from heterogeneous catalyst models to homogeneous molecular catalysts.

@article{garrison-2023-apply-large,
  author =       {Garrison, Aaron G. and Heras-Domingo, Javier and Kitchin, John
                  R. and dos Passos Gomes, Gabriel and Ulissi, Zachary W. and
                  Blau, Samuel M.},
  title =        {Applying Large Graph Neural Networks To Predict Transition
                  Metal Complex Energies Using the tmQM\_wB97MV Data Set},
  journal =      {Journal of Chemical Information and Modeling},
  volume =       0,
  number =       0,
  pages =        {null},
  year =         2023,
  doi =          {10.1021/acs.jcim.3c01226},
  URL =          {https://doi.org/10.1021/acs.jcim.3c01226},
  eprint =       {https://doi.org/10.1021/acs.jcim.3c01226},
  note =         {PMID: 38049389},
}

Copyright (C) 2023 by John Kitchin. See the License for information about copying.

org-mode source

Org-mode version = 9.7-pre

Discuss on Twitter

New publication - Chemical Properties from Graph Neural Network-Predicted Electron Densities

| categories: publication, news | tags:

The electron density is one of the most important quantities we use DFT to calculate. It is the foundation of how we compute energy, forces, and many other properties. DFT is expensive though, so in this work we show that we can build a graph neural network that can be used to predict electron densities directly from the atomic coordinates of a system. We show that the predicted densities can also be used to estimate dipole moments and Bader charges.

@article{sunshine-2023-chemic-proper,
  author =       {Sunshine, Ethan M. and Shuaibi, Muhammed and Ulissi, Zachary
                  W. and Kitchin, John R.},
  title =        {Chemical Properties From Graph Neural Network-Predicted
                  Electron Densities},
  journal =      {The Journal of Physical Chemistry C},
  volume =       0,
  number =       0,
  pages =        {null},
  year =         2023,
  doi =          {10.1021/acs.jpcc.3c06157},
  url =          {https://doi.org/10.1021/acs.jpcc.3c06157},
  eprint =       {https://doi.org/10.1021/acs.jpcc.3c06157},
}

Copyright (C) 2023 by John Kitchin. See the License for information about copying.

org-mode source

Org-mode version = 9.7-pre

Discuss on Twitter

New publication - Beyond Independent Error Assumptions in Large GNN Atomistic Models

| categories: publication, news | tags:

In this work we show that prediction errors from graph neural networks for related atomistic systems tend to be correlated, and as a result the differences in energy are more accurate than the absolute energies. This is similar to what is observed in DFT calculations where systematic errors also cancel in differences. We show this quantitatively through differences of systems with systematically different levels of similarity. This article is part of the Special Collection: 2023 JCP Emerging Investigators Special Collection.

@article{ock-2023-beyon-indep,
  author =       {Janghoon Ock and Tian Tian and John Kitchin and Zachary
                  Ulissi},
  title =        {Beyond Independent Error Assumptions in Large {GNN} Atomistic
                  Models},
  journal =      {The Journal of Chemical Physics},
  volume =       158,
  number =       21,
  pages =        {nil},
  year =         2023,
  doi =          {10.1063/5.0151159},
  url =          {http://dx.doi.org/10.1063/5.0151159},
  DATE_ADDED =   {Sun Sep 24 14:53:46 2023},
}

Copyright (C) 2023 by John Kitchin. See the License for information about copying.

org-mode source

Org-mode version = 9.7-pre

Discuss on Twitter

Adding new backends to hashcache

| categories: programming | tags:

I have been working on hashcache to make it more flexible. I like the base functionality that uses the filesystem for caching. That still works.

Here I set up a timeit decorator to show how this works.

from pycse.hashcache import hashcache
import time

!rm -fr ./cache

def timeit(func):
    def wrapper(*args, **kwargs):
        t0 = time.time()
        res = func(*args, **kwargs)
        print(f'Elapsed time = {time.time() - t0}s')
        return res
    return wrapper

Now we decorate a function that is "expensive". The first time we run it, it takes a long time.

@timeit
@hashcache
def expensive_func(x):
    time.sleep(3)
    return x

expensive_func(2)

Elapsed time = 3.007030963897705s 2

The second time is very fast, since we just look it up.

expensive_func(2)

Elapsed time = 0.0012097358703613281s 2

Where did we look it up from? It is stored on disk. You can see where by adding a verbose option to the decorator. This shows you all the data that was stored in the cache.

@hashcache(verbose=True)
def expensive_func(x):
    time.sleep(3)
    return x

expensive_func(2)

{ 'args': (2,), 'cwd': '/Users/jkitchin/Dropbox/emacs/journal/2023/09/23', 'elapsed_time': 3.0048787593841553, 'func': 'expensive_func', 'hash': 'b5436cc21714a7ea619729cc9768b8c5b3a03307', 'kwargs': {}, 'module': 'main', 'output': 2, 'run-at': 1695572717.2020931, 'run-at-human': 'Sun Sep 24 12:25:17 2023', 'standardized-kwargs': {'x': 2}, 'user': 'jkitchin', 'version': '0.0.2'} 2

1. Alternative backends for hashcache

The file system is an amazing cache with many benefits. There are few reasons you might like something different though. For example, it is slow to search if you have to iterate over all the directories and read the files, and it might be slow to sync lots of directories to another place.

hashcache is more flexible now, so you can define the functions that load and dump the cache. Here we use lmdb as a key-value database. lmdb expects the keys and values to be bytes, so we do some tricks with io.BytesIO to get these as strings from joblib.dump which expects to write to a file.

The load function has the signature (hash, verbose), and the dump function has the signature (hash, data, verbose). In both cases, hash will be a string for the key to save data in. data will be a dictionary that should be saved in a way that it can be reloaded. verbose is a flag that you can ignore or use to provide some kind of logging.

from pycse.hashcache import hashcache

import io, joblib, lmdb

def lmdb_dump(hsh, data, verbose=False):
    if verbose:
        print('running lmdb_dump')
    with io.BytesIO() as f:
        joblib.dump(data, f)
        value = f.getvalue()

    db = lmdb.Environment(hashcache.cache)
    with db.begin(write=True) as txn:
        txn.put(hsh.encode('utf-8'), value)

def lmdb_load(hsh, verbose=False):
    if verbose:
        print('running lmdb_load')
    db = lmdb.Environment(hashcache.cache)
    with db.begin() as txn:
        val = txn.get(hsh.encode('utf-8'))
        if val is None:
            return False, None
        else:
            return True, joblib.load(io.BytesIO(val))['output']
                                    
! rm -fr cache.lmdb

hashcache.cache = 'cache.lmdb'


@hashcache(loader=lmdb_load, dumper=lmdb_dump, verbose=True)
def f(x):
    return x

f(2)   

running lmdb_load running lmdb_dump 2

And we can recall the result as easily.

f(2)

running lmdb_load 2

2. a shelve version

Maybe you prefer a built in library like shelve. This is also quite simple.

from pycse.hashcache import hashcache

import io, joblib, shelve

def shlv_dump(hsh, data, verbose=False):
    print('running shlv_dump')
    with io.BytesIO() as f:
        joblib.dump(data, f)
        value = f.getvalue()

    with shelve.open(hashcache.cache) as db:
        db[hsh] = value

def shlv_load(hsh, verbose=False):
    print('running shlv_load')
    with shelve.open(hashcache.cache) as db:
        if hsh in db:
            return True, joblib.load(io.BytesIO(db[hsh]))['output']
        else:
            return False, None

hashcache.cache = 'cache.shlv'
! rm -f cache.shlv.db

@hashcache(loader=shlv_load, dumper=shlv_dump)
def f(x):
    return x

f(2)
    

running shlv_load running shlv_dump 2

And again loading is easy.

f(2)

running shlv_load 2

3. sqlite version

I am a big fan of sqlite. Here I use a simple table mapping a key to a value. I think it could be interesting to consider storing the value as json that would make it more searchable, or you could make a more complex table, but here we keep it simple.

from pycse.hashcache import hashcache

import io, joblib, sqlite3

def sql_dump(hsh, data, verbose=False):
    print('running sql_dump')
    with io.BytesIO() as f:
        joblib.dump(data, f)
        value = f.getvalue()

    with con:
        con.execute("INSERT INTO cache(hash, value) VALUES(?, ?)",
                    (hsh, value))

def sql_load(hsh, verbose=False):
    print('running sql_load')
    with con:        
        cur = con.execute("SELECT value FROM cache WHERE hash = ?",
                          (hsh,))
        value = cur.fetchone()
        if value is None:
            return False, None
        else:
            return True, joblib.load(io.BytesIO(value[0]))['output']

! rm -f cache.sql
hashcache.cache = 'cache.sql'
con = sqlite3.connect(hashcache.cache)
con.execute("CREATE TABLE cache(hash TEXT unique, value BLOB)")
        
@hashcache(loader=sql_load, dumper=sql_dump)
def f(x):
    return x

f(2)    

running sql_load running sql_dump 2

Once again, running is easy.

f(2)

running sql_load 2

4. redis

Finally, you might like a server to cache in. This opens the door to running the server remotely so it is accessible by multiple processes using the cache on different machines. We use redis for this example, but only run it locally. Make sure you run redis-server --daemonize yes

from pycse.hashcache import hashcache

import io, joblib, redis

db = redis.Redis(host='localhost', port=6379)

def redis_dump(hsh, data, verbose=False):
    print('running redis_dump')
    with io.BytesIO() as f:
        joblib.dump(data, f)
        value = f.getvalue()

    db.set(hsh, value)

def redis_load(hsh, verbose=False):
    print('running redis_load')
    if not hsh in db:
        return False, None
    else:
        return True, joblib.load(io.BytesIO(db.get(hsh)))['output']

    
import functools    
hashcache_redis = functools.partial(hashcache,
                                    loader=redis_load,
                                    dumper=redis_dump)    

@hashcache_redis
def f(x):
    return x

f(2)    

running redis_load running redis_dump 2

No surprise here, loading is the same as before.

f(2)

running redis_load 2

5. Summary

I have refactored hashcache to make it much easier to add new backends. You might do that for performance, ease of backup or transferability, to add new capabilities for searching, etc. The new code is a little cleaner than it was before IMO. I am not sure it is API-stable yet, but it is getting there.

Copyright (C) 2023 by John Kitchin. See the License for information about copying.

org-mode source

Org-mode version = 9.7-pre

Discuss on Twitter

A better manager for supervising Python functions

| categories: programming | tags:

In the previous post I introduced a supervisor decorator to automate rerunning functions with new arguments to fix issues in them. Almost immediately after posting it, two things started bugging me. First, I thought it was annoying to have two separate arguments for results and exceptions. I would prefer one list of functions that do the right thing. Second, and most annoying, you have to be very careful in writing your checker functions to be consistent with how you called the function so you use exactly the same positional and keyword arguments. That is tedious and limits reusability/flexibility.

So, I wrote a new manager decorator that solves these two problems. Now, you can write checker functions that work on all the arguments of a function. You decorate the checker functions to indicate if they are for results or exceptions. This was a little more of a rabbit hole than I anticipated, but I persevered, and got to a solution that works for these examples. You can find all the code here.

Here is an example where we have a test function that we want to run with new arguments until we get a positive result. We start in a way that it is possible to get a ZeroDivisionError, and we handle that too.

from pycse.supyrvisor import manager, check_result, check_exception

@check_exception
def check1(args, exc):
    if isinstance(exc, ZeroDivisionError):
        print('ooo. caught 1/0, incrementing x')
        return {'x': 1}

@check_result
def check2(args, result):
    print(args)
    if result < 0:
        args['x'] += 1
        return args
        

@manager(checkers=[check1, check2])
def test(x, a=1):
    return a / x

test(-1)

{'x': -1, 'a': 1} ooo. caught 1/0, incrementing x {'x': 1} 1.0

This also works, so you can see this is better than the previous version which would not work if you change the signature.

test(a=1, x=-1)

{'x': -1, 'a': 1} ooo. caught 1/0, incrementing x {'x': 1} 1.0

1. The previous examples with manager

Here is the new syntax with manager.

import numpy as np
from scipy.optimize import minimize

def objective(x):
    return np.exp(x**2) - 10 * np.exp(x)


@check_result
def maxIterationsExceeded(args, sol):
    if sol.message == 'Maximum number of iterations has been exceeded.':
        args['maxiter'] *= 2
        return args

@manager(checkers=[maxIterationsExceeded], verbose=True)
def get_result(maxiter=2):
    return minimize(objective, 0.0, options={'maxiter': maxiter})

get_result(2)
Proposed fix in wrapper: {'maxiter': 4}
Proposed fix in wrapper: {'maxiter': 8}
  message: Optimization terminated successfully.
  success: True
   status: 0
      fun: -36.86307468296428
        x: [ 1.662e+00]
      nit: 5
      jac: [-4.768e-07]
 hess_inv: [[ 6.481e-03]]
     nfev: 26
     njev: 13

It works!

2. Stateful supervision

In this example, we aim to find the steady state concentrations of two species by integrating a mass balance to steady state. This is visually easy to see below, the concentrations are essentially flat after 10 min or so. Computationally this is somewhat tricky to find though. A way to do it is to compare some windows of integration to see if the values are not changing very fast. For instance you could average the values from 10 to 11, and compare that to the values in 11 to 12, and keep doing that until they are close enough to the same.

def ode(t, C):
    Ca, Cb = C
    dCadt = -0.2 * Ca + 0.3 * Cb
    dCbdt = -0.3 * Cb + 0.2 * Ca
    return dCadt, dCbdt

tspan = (0, 20)

from scipy.integrate import solve_ivp
sol = solve_ivp(ode, tspan, (1, 0))

import matplotlib.pyplot as plt
plt.plot(sol.t, sol.y.T)
plt.xlabel('t (min)')
plt.ylabel('C')
plt.legend(['A', 'B']);
sol.y.T[-1]
array([0.60003278, 0.39996722])

It is not crucial to use a class here; you could also use global variables, or function attributes. A class is a standard way of encapsulating state though. We just have to make the class callable so it acts like a function when we need it to.

class ReachedSteadyState:        
    def __init__(self, tolerance=0.01):
        self.tolerance = tolerance
        self.last_solution = None
        self.count = 0

    def __str__(self):
        return 'ReachedSteadyState'

    @check_result
    def __call__(self, args, sol):
        if self.last_solution is None:
            self.last_solution = sol
            self.count += 1
            args['C0'] = sol.y.T[-1]
            return args

        # we have a previous solution
        if not np.allclose(self.last_solution.y.mean(axis=1),
                           sol.y.mean(axis=1),
                           rtol=self.tolerance,
                           atol=self.tolerance):
            self.last_solution = sol
            self.count += 1
            args['C0'] = sol.y.T[-1]
            return args

rss = ReachedSteadyState(0.0001)

@manager(checkers=[rss], max_errors=20, verbose=True)        
def get_sol(C0=(1, 0), window=1):
    sol = solve_ivp(ode, t_span=(0, window), y0=C0)
    return sol

sol = get_sol((1, 0), window=2)
sol

Proposed fix in ReachedSteadyState: {'C0': array([0.74716948, 0.25283052]), 'window': 2} Proposed fix in ReachedSteadyState: {'C0': array([0.65414484, 0.34585516]), 'window': 2} Proposed fix in ReachedSteadyState: {'C0': array([0.61992776, 0.38007224]), 'window': 2} Proposed fix in ReachedSteadyState: {'C0': array([0.60733496, 0.39266504]), 'window': 2} Proposed fix in ReachedSteadyState: {'C0': array([0.60269957, 0.39730043]), 'window': 2} Proposed fix in ReachedSteadyState: {'C0': array([0.60099346, 0.39900654]), 'window': 2} Proposed fix in ReachedSteadyState: {'C0': array([0.60036557, 0.39963443]), 'window': 2} Proposed fix in ReachedSteadyState: {'C0': array([0.60013451, 0.39986549]), 'window': 2} Proposed fix in ReachedSteadyState: {'C0': array([0.60004949, 0.39995051]), 'window': 2}

  message: The solver successfully reached the end of the integration interval.
  success: True
   status: 0
        t: [ 0.000e+00  7.179e-01  2.000e+00]
        y: [[ 6.000e-01  6.000e-01  6.000e-01]
            [ 4.000e-01  4.000e-01  4.000e-01]]
      sol: None
 t_events: None
 y_events: None
     nfev: 14
     njev: 0
      nlu: 0

We can plot the two solutions to see how different they are. This shows they are close.

import matplotlib.pyplot as plt
plt.plot(rss.last_solution.t, rss.last_solution.y.T, label=['A previous' ,'B previous'])
plt.plot(sol.t, sol.y.T, '--', label=['A current', 'B current'])
plt.legend()
plt.xlabel('relative t')
plt.ylabel('C');

Those look pretty similar on this graph.

3. Handling exceptions

Suppose you have a function that randomly fails. This could be because something does not converge with a randomly chosen initial guess, converges to an unphysical answer, etc. In these cases, it makes sense to simply try again with a new initial guess.

For this example, say we have this objective function with two minima. We will say that any solution above 0.5 is unphysical.

def f(x):
    return -(np.exp(-50 * (x - 0.25)**2) + 0.5 * np.exp(-100 * (x - 0.75)**2))


x = np.linspace(0, 1)
plt.plot(x, f(x))
plt.xlabel('x')
plt.ylabel('y');

Here we define a function that takes a guess, and gets a solution. If the solution is unphysical, we raise an exception. We define a custom exception so we can handle it specifically.

class UnphysicalSolution(Exception):
    pass

def get_minima(guess):
    sol = minimize(f, guess)

    if sol.x > 0.5:
        raise UnphysicalSolution
    else:
        return sol

@check_exception
def try_again(args, exc):
    if isinstance(exc, UnphysicalSolution):
        args['guess'] = np.random.random()
        return args
  
@manager(checkers=(try_again,), verbose=True)    
def get_minima(guess):
    sol = minimize(f, guess)

    if sol.x > 0.5:
        raise UnphysicalSolution
    else:
        return sol

get_minima(np.random.random())
Proposed fix in wrapper: {'guess': 0.03789731690063758}
  message: Optimization terminated successfully.
  success: True
   status: 0
      fun: -1.0000000000069411
        x: [ 2.500e-01]
      nit: 4
      jac: [ 0.000e+00]
 hess_inv: [[ 1.000e-02]]
     nfev: 18
     njev: 9

You can see it took four iterations to find a solution. Other times it might take zero or one, or maybe more, it depends on where the guesses fall.

4. Summary

This solution works as well as supervisor did. It was a little deeper rabbit hole to go down, mostly because of some subtlety in making the result and exception decorators work for both functions and class methods. I think it is more robust now, as it should not matter how you call the function, and any combination of args and kwargs should be working.

Copyright (C) 2023 by John Kitchin. See the License for information about copying.

org-mode source

Org-mode version = 9.7-pre

Discuss on Twitter
« Previous Page -- Next Page »