Generating an atomic stoichiometric matrix

| categories: thermodynamics, python | tags:

In computing thermodynamic properties with species, it is sometimes required to get a matrix that specifies number of each type of atom in each species. For example, we can create this by hand as follows:

  H2O CO2 H2 CO
H 2 0 2 0
C 0 1 0 1
O 1 2 0 1

Here we aim to generate this table from code. Why? 1. We can readily add species to it if we do it right. 2. We are less likely to make mistakes in generation of the table, and if we do, it will be faster to regenerate the table.

We will start with a list of strings that represent the chemical formula of each species. We will need to parse the strings to find the elements, and number of them. We will use a fairly naive regular expression to parse a chemical formula. Basically, we match a capital letter + an optional lowercase letter, followed by an optional number. Here is a fictitous example to illustrate. Note, this will not work with formulas that have parentheses, or charges.

import re
m = re.findall('([A-Z][a-z]?)(\d?)' , 'ArC2H6Cu56Pd47Co')
print m
[('Ar', ''), ('C', '2'), ('H', '6'), ('Cu', '5'), ('Pd', '4'), ('Co', '')]

Now, we need to loop over the species, and collect all the elements in them. We will just make a list of all of the elments, and then get the set.

import re

# save for future use
cf = re.compile('([A-Z][a-z]?)(\d?)')

species = ['H2O', 'CO2', 'H2', 'CO2']

all_elements = []

for s in species:
    for el, count in re.findall(cf, s):
        all_elements += [el]

print set(all_elements)
set(['H', 'C', 'O'])

Finally, we can create the table. We need to loop through each element, and then through each species

import re

# save for future use
cf = re.compile('([A-Z][a-z]?)(\d?)')

species = ['H2O', 'CO2', 'H2', 'CO2']

all_elements = []

for s in species:
    for el, count in re.findall(cf, s):
        all_elements += [el]

atoms = set(all_elements)

# we put a placeholder in the first row
counts = [[""] + species]
for e in atoms:
    # store the element in the first column
    count = [e]
    for s in species:    
        d = dict(re.findall(cf, s))
        n = d.get(e, 0)
        if n == '': n = 1
        count += [int(n)]
    counts += [count]

# this directly returns the array to org-mode
return counts
  H2O CO2 H2 CO2
H 2 0 2 0
C 0 1 0 1
O 1 2 0 2

For this simple example it seems like a lot of code. If there were 200 species though, it would be the same code! Only the list of species would be longer. It might be possible to avoid the two sets of looping, if you could represent the stoichiometric matrix as a sparse matrix, i.e. only store non-zero elements. The final comment I have is related to the parsing of the chemical formulas. Here we can only parse simple formulas. To do better than this would require a pretty sophisticated parser, probably built on the grammar of chemical formulas. The example here implements the code above using pyparsing, and could probably be extended to include more complex formulas such as (CH3)3CH.

Copyright (C) 2014 by John Kitchin. See the License for information about copying.

org-mode source

Org-mode version = 8.2.7c

Discuss on Twitter