Pyparsing meets Emacs to find chemical formulas

| categories: emacs, python | tags:

see the video: https://www.youtube.com/watch?v=sjxS9m8QCoo

Today we expand the concepts of clickable text and merge an idea from Python with Emacs. Here we will use Python to find chemical formulas in the buffer, and then highlight them with Emacs. We will use pyparsing to find the chemical formulas and then use them to create a pattern for button-lock. I chose this approach because regular expressions are hard to use on the most general kinds of chemical formulas, and a (possibly recursive) parser should be better equipped to handle this. I adapted an example grammar to match simple chemical formulas, i.e. ones that do not have any parentheses, or charges different than + or -. I think something like this could be done in Emacs, but I am not as familiar with this kind of parsing in Emacs.

Basically, we treat a formula as a group of one or more Elements that have an optional number following them. Spoiler alert: This mostly works, but in the end I conclude there is a clear benefit to a markup language for chemical formulas. Here is an example usage of a parser:

# adapted from [[https://pyparsing.wikispaces.com/file/view/chemicalFormulas.py/31041705/chemicalFormulas.py]]

from pyparsing import *

element = oneOf( """H He Li Be B C N O F Ne Na Mg Al Si P S Cl
            Ar K Ca Sc Ti V Cr Mn Fe Co Ni Cu Zn Ga Ge
            As Se Br Kr Rb Sr Y Zr Nb Mo Tc Ru Rh Pd Ag
            Cd In Sn Sb Te I Xe Cs Ba Lu Hf Ta W Re Os
            Ir Pt Au Hg Tl Pb Bi Po At Rn Fr Ra Lr Rf
            Db Sg Bh Hs Mt Ds Rg Uub Uut Uuq Uup Uuh Uus
            Uuo La Ce Pr Nd Pm Sm Eu Gd Tb Dy Ho Er Tm
            Yb Ac Th Pa U Np Pu Am Cm Bk Cf Es Fm Md No""" )

integer = Word(nums)
elementRef = Group(element + Optional(integer))
chemicalFormula = (WordStart(alphas.upper())
                   + OneOrMore(elementRef).leaveWhitespace()
                   + Optional(Or([Literal("-"),
                                  Literal("+")]))
                   + WordEnd(alphas + nums + "-+"))


s = '''Water is  H2O or OH2  not h2O, methane is CH4 and of course there is PtCl4.
What about H+ and OH-? and carbon or Carbon or H2SO4?

Is this C6H6? or C2H5OH?

and a lot of elements:
H He Li Be B C N O F Ne Na Mg Al Si P S Cl
            Ar K Ca Sc Ti V Cr Mn Fe Co Ni Cu Zn Ga Ge
            As Se Br Kr Rb Sr Y Zr Nb Mo Tc Ru Rh Pd Ag
            Cd In Sn Sb Te I Xe Cs Ba Lu Hf Ta W Re Os
            Ir Pt Au Hg Tl Pb Bi Po At Rn Fr Ra Lr Rf
            Db Sg Bh Hs Mt Ds Rg Uub Uut Uuq Uup Uuh Uus
            Uuo La Ce Pr Nd Pm Sm Eu Gd Tb Dy Ho Er Tm
            Yb Ac Th Pa U Np Pu Am Cm Bk Cf Es Fm Md No'''

matches = []
for match, start, stop in chemicalFormula.scanString(s):
   matches.append(s[start:stop])

print sorted(matches, key=lambda x: len(x), reverse=True)
['C2H5OH', 'PtCl4', 'H2SO4', 'C6H6', 'H2O', 'OH2', 'CH4', 'OH-', 'Uub', 'Uut', 'Uuq', 'Uup', 'Uuh', 'Uus', 'Uuo', 'H+', 'He', 'Li', 'Be', 'Ne', 'Na', 'Mg', 'Al', 'Si', 'Cl', 'Ar', 'Ca', 'Sc', 'Ti', 'Cr', 'Mn', 'Fe', 'Co', 'Ni', 'Cu', 'Zn', 'Ga', 'Ge', 'As', 'Se', 'Br', 'Kr', 'Rb', 'Sr', 'Zr', 'Nb', 'Mo', 'Tc', 'Ru', 'Rh', 'Pd', 'Ag', 'Cd', 'In', 'Sn', 'Sb', 'Te', 'Xe', 'Cs', 'Ba', 'Lu', 'Hf', 'Ta', 'Re', 'Os', 'Ir', 'Pt', 'Au', 'Hg', 'Tl', 'Pb', 'Bi', 'Po', 'At', 'Rn', 'Fr', 'Ra', 'Lr', 'Rf', 'Db', 'Sg', 'Bh', 'Hs', 'Mt', 'Ds', 'Rg', 'La', 'Ce', 'Pr', 'Nd', 'Pm', 'Sm', 'Eu', 'Gd', 'Tb', 'Dy', 'Ho', 'Er', 'Tm', 'Yb', 'Ac', 'Th', 'Pa', 'Np', 'Pu', 'Am', 'Cm', 'Bk', 'Cf', 'Es', 'Fm', 'Md', 'No', 'O', 'H', 'B', 'C', 'N', 'O', 'F', 'P', 'S', 'K', 'V', 'Y', 'I', 'W', 'U']

That is pretty good. If the string was actually our buffer, we could use those to create a regexp to put text-properties on them. The trick is how to get the buffer string to the Python function, and then get back usable information in lisp. We actually explored this before ! Rather than use that, we will just create the lisp output manually since this is a simple list of strings.

The first thing we should do is work out a Python script that will output the lisp results we want, which are the found formulas (I tried getting the start and stop positions, but I don't think they map onto the buffer positions very well). Here it is. We set it up as a command line tool that takes a string. We use set to get a unique list, then sort the list by length so we try matching the longest patterns first. There are a few subtle differences in this script and the example above because of some odd false hits I unsuccessfully tried to get rid of.

import sys
from pyparsing import *

element_string =  """H He Li Be B C N O F Ne Na Mg Al Si P S Cl
            Ar K Ca Sc Ti V Cr Mn Fe Co Ni Cu Zn Ga Ge
            As Se Br Kr Rb Sr Y Zr Nb Mo Tc Ru Rh Pd Ag
            Cd In Sn Sb Te I Xe Cs Ba Lu Hf Ta W Re Os
            Ir Pt Au Hg Tl Pb Bi Po At Rn Fr Ra Lr Rf
            Db Sg Bh Hs Mt Ds Rg Uub Uut Uuq Uup Uuh Uus
            Uuo La Ce Pr Nd Pm Sm Eu Gd Tb Dy Ho Er Tm
            Yb Ac Th Pa U Np Pu Am Cm Bk Cf Es Fm Md No"""
element = oneOf([x for x in element_string.split()])

integer = Word(nums)
elementRef = Group(element + Optional(integer))
chemicalFormula = (WordStart(alphas.upper()).leaveWhitespace()
                   + OneOrMore(elementRef).leaveWhitespace()
                   + Optional(Or([Literal("-"),
                                  Literal("+")])).leaveWhitespace()
                   + WordEnd(alphas + alphas.lower() + nums + "-+").leaveWhitespace())

s = sys.stdin.read().strip()

matches = []
for match, start, stop in chemicalFormula.scanString(s):
   matches.append(s[start:stop])
matches = list(set(matches))
matches.sort(key=lambda x: len(x), reverse=True)

print "'(" + ' '.join(["\"{}\"".format(m) for m in matches]) + ')'

Now we can test this:

echo "Water is H2O, methane is CH4 and of course PtCl4, what about H+ and OH-? and carbon or Carbon. Water is H2O not h2o or mH2o, methane is CH4 and of course PtCl4, what about H+ and OH-? carbon, Carbon and SRC, or H2SO4? Is this C6H6? Ethanol is C2H5OH in a sentence.

 C2H5OH firs con

This is CH3OH

H He Li Be B C N O F Ne Na Mg Al Si P S Cl
            Ar K Ca Sc Ti V Cr Mn Fe Co Ni Cu Zn Ga Ge
            As Se Br Kr Rb Sr Y Zr Nb Mo Tc Ru Rh Pd Ag
            Cd In Sn Sb Te I Xe Cs Ba Lu Hf Ta W Re Os
            Ir Pt Au Hg Tl Pb Bi Po At Rn Fr Ra Lr Rf
            Db Sg Bh Hs Mt Ds Rg Uub Uut Uuq Uup Uuh Uus
            Uuo La Ce Pr Nd Pm Sm Eu Gd Tb Dy Ho Er Tm
            Yb Ac Th Pa U Np Pu Am Cm Bk Cf Es Fm Md No
" | ./parse_chemical_formulas.py
'("C2H5OH" "CH3OH" "PtCl4" "H2SO4" "C6H6" "CH4" "OH-" "Uub" "Uuq" "Uup" "Uus" "Uuo" "Uuh" "H2O" "Uut" "Ru" "Re" "Rf" "Rg" "Ra" "Rb" "Rn" "Rh" "Be" "Ba" "Bh" "Bi" "Bk" "Br" "Ho" "Os" "Es" "Hg" "Ge" "Gd" "Ga" "Pr" "Pt" "Pu" "Pb" "Pa" "Pd" "Cd" "Po" "Pm" "Hs" "Hf" "He" "Md" "Mg" "Mo" "Mn" "Mt" "Zn" "H+" "Eu" "Zr" "Er" "Ni" "No" "Na" "Nb" "Nd" "Ne" "Np" "Fr" "Fe" "Fm" "Sr" "Kr" "Si" "Sn" "Sm" "Sc" "Sb" "Sg" "Se" "Co" "Cm" "Cl" "Ca" "Cf" "Ce" "Xe" "Tm" "Cs" "Cr" "Cu" "La" "Li" "Tl" "Lu" "Lr" "Th" "Ti" "Te" "Tb" "Tc" "Ta" "Yb" "Db" "Dy" "Ds" "Ac" "Ag" "Ir" "Am" "Al" "As" "Ar" "Au" "At" "In" "H" "P" "C" "K" "O" "S" "W" "B" "F" "N" "V" "I" "U" "Y")

That seems to work great. Now, we have a list of chemical formulas. Now, the Emacs side to call that function. We do not use regexp-opt here because I found it optimizes too much, and doesn't always match the formulas. We want explicit matches on each formula.

(defun shell-command-on-region-to-string (start end command)
  (with-output-to-string
    (shell-command-on-region start end command standard-output)))

(read (shell-command-on-region-to-string
        (point-min) (point-max)
        "./parse_chemical_formulas.py"))
quote (C2H5OH ext; t CH3OH PtCl4 H2SO4 the fir C6H6 CH4 OH- OH2 Uub co Uuq Uup Uus Uuo Uuh ord H2O Uut Ru Re Rf Rg Ra Rb Rn Rh Be Ba Bh Bi Bk Br Ho Os Es Hg Ge Gd Ga Pr t Pt Pu Pb Pa Pd Cd Po Pm Hs Hf He Md Mg Mo Mn Mt Zn H+ Eu Zr Er Ni No Na Nb Nd Ne Np Fr Fe Fm Sr Kr Si Sn Sm Sc Sb Sg Se Co Cm Cl Ca Cf Ce Xe Tm Cs Cr Cu La Li Tl Lu Lr Th Ti Te Tb Tc as Ta Yb Db Dy Ds In Ac Ag Ir Am Al As Ar Au At n H P l t C r K O S W w B F N V I U Y e i)

That is certainly less than perfect, you can see a few false hits that are not too easy to understand, e.g. why is "fir" or "the " or "as" in the list? They don't even start with an uppercase letter. One day maybe I will figure it out. I assume it is a logic flaw in my parser. Until then, let's go ahead and make the text functional, so it looks up the formula in the NIST webbook. The regexp is a little funny, we have to add word-boundaries to each formula to avoid some funny, bad matches.

(defvar chemical-formula-button nil "store button for removal later.")

(require 'nist-webbook)
(setq chemical-formula-button
      (button-lock-set-button
       (mapconcat
        (lambda (formula)
          (concat "\\<" (regexp-quote formula) "\\>"))
        (eval (read (shell-command-on-region-to-string
                     (point-min) (point-max)
                     "./parse_chemical_formulas.py")))
        "\\|")
       (lambda () (interactive)
         (nist-webbook-formula
          (get-surrounding-text-with-property
           'chemical-formula)))
       :face '((:underline t) (:background "gray80"))
       :help-echo "A chemical formula"
       :additional-property 'chemical-formula))

Here are a few tests: CH4, C2H5OH, C6H6. C(CH3)4. C6H6 is benzene. As you can see our pattern lacks context; the first word of the sentence is "as" not the symbol for arsenic. Also, our parser does not consider formulas with parentheses in them. Whenever I refer to myself, I mean myself, and not the element iodine. There are a few weird matchs I just don't understand, like firs d t x rn lac? These do not seem to match anything, and I wonder how they are getting in the list. I think this really shows that it would be useful to use some light markup for chemical formulas which would a) provide context, and b) enhance parsing accuracy. In LaTeX you would use \ce{I} to indicate that is iodine, and not a reference to myself. That is more clear than saying I use I in chemical reactions ;) And it also clarifies sentences like the letter W is used to represent tungsten as the symbol \ce{W}.

Nevertheless, we can click on the formulas, and get something to happen that is potentially useful. Is this actually useful? Conceptually yes, I think it could be, but clearly the parsing is not recognizing formulas perfectly. Sending the buffer to a dedicated program that can return a list of matches to highlight in Emacs is a good idea, especially if it is not easy to build in Emacs, or if a proven solution already exists.

Finally, we can remove the highlighted text like this. That was the reason for saving the button earlier!

(when chemical-formula-button
  (button-lock-unset-button chemical-formula-button)
  (setq chemical-formula-button nil))

Copyright (C) 2015 by John Kitchin. See the License for information about copying.

org-mode source

Org-mode version = 8.2.10

Discuss on Twitter