Another approach to embedded molecular data in org-mode

| categories: orgmode, chemistry, emacs | tags:

In the last post we examined a molecule link to a src-block defining a molecule in some format. We blurred the distinction between program and data there. Here we re-separate them to try out some different ideas. We will use an org-mode special block to contain the "data" which is a molecular representation in some format. Then, we will use open-babel to convert the format to various other formats to explore using the data.

Here is a methane molecule (with 4 implicit hydrogens in the SMILES format). We put it in a named special block in org-mode, and even put a header on it to indicate the format and a display name!


We can use the SMILES representation block as input to a new command that converts it to the CML format, with coordinates. We use a simple shell command here and pass the contents of the molecule in as a variable. That is nice because in SMILES methane is represented by a single "C", and this CML is much more verbose.

echo $input | obabel -ismi -o cml --gen3d
<?xml version="1.0"?>
<molecule xmlns="">
  <atom id="a1" elementType="C" x3="1.047517" y3="-0.064442" z3="0.060284"/>
  <atom id="a2" elementType="H" x3="2.139937" y3="-0.064341" z3="0.059898"/>
  <atom id="a3" elementType="H" x3="0.683568" y3="-0.799429" z3="-0.661322"/>
  <atom id="a4" elementType="H" x3="0.683566" y3="0.927794" z3="-0.216100"/>
  <atom id="a5" elementType="H" x3="0.683669" y3="-0.321317" z3="1.056822"/>
  <bond atomRefs2="a1 a2" order="1"/>
  <bond atomRefs2="a1 a3" order="1"/>
  <bond atomRefs2="a1 a4" order="1"/>
  <bond atomRefs2="a1 a5" order="1"/>

We can also use the CML output as input to a command that generates an SVG image, again, passing the CML in via a variable in the header.

echo $cml | obabel -icml -o svg

With our previous molecule link we can refer to these in our text now as methane-smiles and methane-cml.

So far it all looks good. Let us do something new. We will use the SMILES representation to create an ase.atoms object in Python. First, we create an xyz format that ase can read. Rather than clutter up our document with the output, we silence it.

echo $input | obabel -ismi -o xyz --gen3d

Now, we can use the string generated in a Python file to generate a tempfile (or you could have saved the result above to a file and just read it in here). I was too lazy to make the file link to the image myself, so I setup a :file header and just print the result to stdout in this block. Although all we do here is create a new image, this demonstrates you can use data from a MOLECULE block and pass it into a Python script where other kinds of calculations might occur.

from import read, write

from tempfile import mkstemp
fd, fname = mkstemp(suffix=".xyz")
with open(fname, 'w') as f:

atoms = read(fname)
write('-', atoms, format="png")

The last point to discuss is discoverability. It would be helpful if we could use a program to "extract" molecular information about the molecules we use in our work. Here is a block that will map over the MOLECULE blocks and summarize what is found with a common format (SMILES again). We generate a table of clickable links to each molecule found in the documents. There is a small appendix in this document containing h2o and caffeine that will show in this table.

(defun mlc-to-smiles (blk)
  "Convert a molecule BLK to smiles format using openbabel."
  (let* ((headers (-flatten
                   (mapcar 'org-babel-parse-header-arguments
                           (org-element-property :header blk))))
         (format (cdr (assoc :format headers)))
         (content (buffer-substring-no-properties
                   (org-element-property :contents-begin blk)
                   (org-element-property :contents-end blk)))
         (tempfile (make-temp-file "obabel-")))
    (with-temp-file tempfile
      (insert content))

    ;; convert to smiles. This outputs a smiles string and the file it was
    ;; generated from. I don't know how to suppress the file, so we use awk to
    ;; just get the SMILEs strings. It is not pretty. I know.
        (s-trim (shell-command-to-string
                 (format  "obabel %s %s -osmi 2> /dev/null | awk '{print $1}'"
                          (format "-i%s" format) tempfile)))
      (delete-file tempfile))))

;; Generate the table of molecules
(append '(("Display name" "Name" "format" "SMILES representation"))
        (org-element-map (org-element-parse-buffer) 'special-block
          (lambda (sb)
            (when (string= "MOLECULE" (org-element-property :type sb))
              (let ((headers (-flatten
                              (mapcar 'org-babel-parse-header-arguments
                                      (org-element-property :header sb)))))

                 (format "[[molecule:%s][%s]]" (org-element-property :name sb)
                         (cdr (assoc :display-name headers)))
                 (org-element-property :name sb)
                 (cdr (assoc :format headers))
                 (mlc-to-smiles sb)))))))
Display name Name format SMILES representation
methane-smiles methane-smiles smiles C
h2o h2o cml OO
caffeine caffeine xyz Cn1cnc2n(C)c(=O)n(C)c(=O)c12

That seems pretty discoverable to me. We not only can discover the molecules in this post, but can pretty easily convert them to other formats (SMILES) in this case. Since we can run any code we want on them, we could just as well import them to a database, or do subsequent calculations on them.

The MOLECULE block is not standard, and I have only demonstrated here that it is suitable for this purpose. But, it looks like we could extend it and deal with a variety of formats. We can use headers to add metadata, format, etc… Some features I find missing are similar to those in code blocks where we can type C-c ' to edit them in special modes, and the nice syntax highlighting that often comes with that.

It might be helpful to make the export of MOLECULE blocks nicer looking and more functional. The default export, for example doesn't put an id attribute in the block. First, we rewrite an org-function to add the id attribute to the exported blocks so our molecule links will work.

(defun org-html-special-block (special-block contents info)
  "Transcode a SPECIAL-BLOCK element from Org to HTML.
CONTENTS holds the contents of the block.  INFO is a plist
holding contextual information."
  (let* ((block-type (downcase
                      (org-element-property :type special-block)))
         (contents (or contents ""))
         (html5-fancy (and (org-html-html5-p info)
                           (plist-get info :html-html5-fancy)
                           (member block-type org-html-html5-elements)))
         (attributes (org-export-read-attribute :attr_html special-block)))
    (unless html5-fancy
      (let ((class (plist-get attributes :class)))
        (setq attributes (plist-put attributes :class
                                    (if class (concat class " " block-type)
        (when (org-element-property :name special-block)
          (setq attributes (plist-put
                            attributes :id
                            (org-element-property :name special-block))))))
    (setq attributes (org-html--make-attribute-string attributes))
    (when (not (equal attributes ""))
      (setq attributes (concat " " attributes)))
    (if html5-fancy
        (format "<%s%s>\n%s</%s>" block-type attributes
                contents block-type)
      (format "<div%s>\n%s\n</div>" attributes contents))))

It would be nice to add some additional information around the block, e.g. that it is a molecule, maybe some tooltip about the format, etc…, but we leave that to another day. These should probably be handled specially with a dedicated export function. You will note that MOLECULE blocks don't export too well, they should probably be wrapped in <pre> for HTML export. We will at least make them stand out with this bit of css magic.

#+HTML_HEAD_EXTRA:  <style>.molecule {background-color:LightSkyBlue;}</style>

1 Summary thoughts

This looks pretty promising as a way to embed molecular data into org-files so that the data is reusable and discoverable. If there is metadata that cannot go into the MOLECULE format we can put it in headers instead. This seems like it could be useful.

2 Appendix of molecules

2.1 Water

Here is water in the CML format.

<?xml version="1.0"?> <molecule xmlns=""> <atomArray> <atom id="a1" elementType="O"/> <atom id="a2" elementType="O"/> </atomArray> <bondArray> <bond atomRefs2="a1 a2" order="1"/> </bondArray> </molecule>

2.2 Caffeine

This is a simple xyz format of caffeine.


C 1.02887 -0.01688 -0.03460 N 2.46332 0.11699 -0.03522 C 3.33799 -0.94083 -0.03530 N 4.59156 -0.53767 -0.03594 C 4.50847 0.82120 -0.03623 N 5.57252 1.69104 -0.03687 C 6.93040 1.17620 -0.03898 C 5.33446 3.06602 -0.03685 O 6.26078 3.88171 -0.03594 N 3.98960 3.48254 -0.03830 C 3.70813 4.90531 -0.04199 C 2.87287 2.63769 -0.03747 O 1.71502 3.04777 -0.03830 C 3.21603 1.25723 -0.03610 H 0.54478 0.95872 -0.03440 H 0.73663 -0.56946 0.86233 H 0.73584 -0.56959 -0.93118 H 3.00815 -1.97242 -0.03493 H 7.67209 1.97927 -0.03815 H 7.07929 0.56516 -0.93486 H 7.08112 0.56135 0.85404 H 4.61163 5.51902 -0.04152 H 3.11230 5.15092 0.84340 H 3.11643 5.14660 -0.93127

Copyright (C) 2016 by John Kitchin. See the License for information about copying.

org-mode source

Org-mode version = 8.2.10

Discuss on Twitter