Caching searches using biblio and only seeing new results

| categories: arxiv, biblio, elisp | tags: | View Comments

In this issue in scimax, Robert asked if it was possible to save searches, and then to repeat them every so often and only see the new results. This needs some persistent caching of the records, and a comparison of the current search results with the previous search results.

biblio provides a nice interface to searching a range of resources for bibliographic references. In this post, I will focus on arxiv. Out of the box, biblio does not seem to support this use case, but as you will see, it has many of the pieces required to achieve it. Let's start picking those pieces apart.

(require 'biblio)
biblio

Here is the first piece we need: a way to run a query, and get results back as a data structure. Here we just look at the first result.

(let* ((query "alloy segregration")
       (backend 'biblio-arxiv-backend)
       (cb (url-retrieve-synchronously (funcall backend 'url query)))
       (results (with-current-buffer cb
                  (funcall backend 'parse-buffer))))
  (car results))
((doi . "10.1103/PhysRevB.76.014112")
 (identifier . "0704.2752v2")
 (year . "2007")
 (title . "Modelling Thickness-Dependence of Ferroelectric Thin Film Properties")
 (authors nil nil nil nil nil nil nil nil nil nil nil nil nil "L. Palova" nil "P. Chandra" nil "K. M. Rabe" nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil)
 (container . "PRB 76, 014112 (2007)")
 (category . "cond-mat.mtrl-sci")
 (references "10.1103/PhysRevB.76.014112" "0704.2752v2")
 (type . "eprint")
 (url . "https://doi.org/10.1103/PhysRevB.76.014112")
 (direct-url . "http://arxiv.org/pdf/0704.2752v2"))

Next, we need a database to store the results in. I will just use a flat file database with a file for each record. The filename will be the md5 hash of the doi or the record itself. Why is that a good idea? Well, the doi is a constant, so if it exists the md5 will also be a constant. The doi itself is not a good filename in general, but the md5 is. The md5 of the record itself will be fragile to any changes, so if it has a doi, we should use it. If it doesn't and later gets one, we should see it again since that could mean it has been published. Also, if it changes because of some new version we might want to see it again. In any case, the existence of that file will be evidence we have seen that record before, and will indicate we need to remove it from the current view.

The flat file database is not super inspired. It is modeled a little after elfeed, but other solutions might work better for large sets of records, but this approach will work fine for this post.

Here is a function that returns nil if the record has been seen, and if not, saves the record and returns it.

(defvar db-dir "~/.arxiv-db/")

(unless (f-dir? db-dir) (make-directory db-dir t))

(defun unseen-record-p (record)
  "Given a RECORD return it if it is unseen.
Also, save the record so next time it will be marked seen. A
record is seen if we have seen the DOI or the record as a string
before."
  (let* ((doi (cdr (assoc 'doi record)))
         (contents (with-temp-buffer
                     (prin1 record (current-buffer))
                     (buffer-string)))
         (hash (md5 (or doi contents)))
         (fname (expand-file-name hash db-dir)))

    (if (f-exists? fname)
        nil
      (with-temp-file fname
        (insert contents))
      record)))
unseen-record-p

Now we can use that as a filter that saves records by side effect.

(defun scimax-arxiv (query)
  (interactive "Query: ")

  (let* ((backend 'biblio-arxiv-backend)
         (cb (url-retrieve-synchronously (funcall backend 'url query)))
         (results (-filter 'unseen-record-p (with-current-buffer cb
                                              (funcall backend 'parse-buffer))))
         (results-buffer (biblio--make-results-buffer (current-buffer) query backend)))
    (with-current-buffer results-buffer
      (biblio-insert-results results ""))
    (pop-to-buffer results-buffer)))

(scimax-arxiv "alloy segregation")
#<buffer *arXiv search*>

Now, when I run that once I see something like this:

and if I run it again:

(scimax-arxiv "alloy segregation")
#<buffer *arXiv search*>

Then the buffer is empty, since we have seen all the entries before.

Here are the files in our database:

ls ~/.arxiv-db/

Here are the contents of one of those files:

(with-temp-buffer
 (insert-file-contents "~/.arxiv-db/18085fe2512e15d66addc7dfb71f7cd2")
 (read (buffer-string)))
((doi) (identifier . 1101.3464v3) (year . 2011) (title . Characterizing Solute Segregation and Grain Boundary Energy in a Binary
  Alloy Phase Field Crystal Model) (authors nil nil nil nil nil nil nil nil nil nil nil nil nil Jonathan Stolle nil Nikolas Provatas nil nil nil nil nil nil nil nil nil nil nil) (container) (category . cond-mat.mtrl-sci) (references nil 1101.3464v3) (type . eprint) (url . http://arxiv.org/abs/1101.3464v3) (direct-url . http://arxiv.org/pdf/1101.3464v3))

So, if you need to read this in again later, no problem.

Now, what could go wrong? I don't know much about how the search results from arxiv are returned. For example, this query returns 10 hits.

(let* ((query "alloy segregration")
       (backend 'biblio-arxiv-backend)
       (cb (url-retrieve-synchronously (funcall backend 'url query)))
       (results (with-current-buffer cb
                  (funcall backend 'parse-buffer))))
  (length results))
10

There is just no way there are only 10 hits for this query. So, there must be a bunch more that you get by either changing the requested number in some argument, or by using subsequent queries to get the rest of them. I don't know if there are more advanced query options with biblio, e.g. to find entries newer than the last time it was run. On the advanced search page for arxiv, it looks like there is only a by year option.

This is still a good idea, and a lot of the pieces are here,

Copyright (C) 2018 by John Kitchin. See the License for information about copying.

org-mode source

Org-mode version = 9.1.6

Read and Post Comments

Zhitao Guo receives the 2017-2018 James C. Meade Fellowship in Chemical Engineering

| categories: news | tags: | View Comments

The James C. Meade Fellowship was made possible by a generous donation by James Meade. This will help support Zhitao during his research this year. Zhitao is a first year PhD student who is co-advised by Andy Gellman and myself (John Kitchin), and is working on segregation in ternary alloy thin films.

Zhitao joined us from Tsinghua University in Beijing, China, where he studied chemical engineering and double majored in economics.

Congratulations Zhitao!

Copyright (C) 2018 by John Kitchin. See the License for information about copying.

org-mode source

Org-mode version = 9.1.6

Read and Post Comments

New publication in Catalysis Today

| categories: publication, news | tags: | View Comments

In this paper we continue our exploration of using high-dimensional neural networks (NN) to model metal surface properties. Our first work started with modeling Au in a variety of structures using ReaxFF and a NN boes-2016-neural-networ. We then modeled atomic oxygen adsorbates on a Pd (111) surface boes-2017-neural-networ, and segregation of an Au-Pd alloy surface boes-2017-model-segreg. Our goal throughout this work has been to systematically build up complexity in the systems we are modeling, and to explore the limitations of these potentials for modeling surfaces. This current work happened in parallel with those works, and focused on modeling Pd adatom diffusion on Pd(111) surfaces. We show another example of how to train a neural network, and then to use it model the temperature dependent diffusion of adatoms on a metal surface using molecular dynamics and Arrhenius analysis.

@article{gao-2018-model-pallad,
  author =       {Tianyu Gao and John R. Kitchin},
  title =        {Modeling Palladium Surfaces With Density Functional Theory,
                  Neural Networks and Molecular Dynamics},
  journal =      {Catalysis Today},
  year =         2018,
  doi =          {10.1016/j.cattod.2018.03.045},
  url =          {https://doi.org/10.1016/j.cattod.2018.03.045},
  DATE_ADDED =   {Sun Apr 1 18:47:55 2018},
}

Bibliography

Copyright (C) 2018 by John Kitchin. See the License for information about copying.

org-mode source

Org-mode version = 9.1.6

Read and Post Comments

New publication in Topics in Catalysis

| categories: publication, news | tags: | View Comments

Single atom alloys are alloys in the extreme dilute limit, where single atoms of a reactive metal are surrounded by comparatively unreactive metals. This makes the single reactive atoms like single atom sites where reactions can occur. These sites are interesting because they are metallic, but their electronic structure is different than the atoms in more concentrated alloys. This means there is the opportunity for different, perhaps better catalytic performance for the single atom alloys. In this paper, we studied the electronic structure and some representative reaction pathways on a series of single atom alloy surfaces.

@article{Thirumalai2018,
  author =       "Thirumalai, Hari and Kitchin, John R.",
  title =        "Investigating the Reactivity of Single Atom Alloys Using
                  Density Functional Theory",
  journal =      "Topics in Catalysis",
  year =         "2018",
  month =        "Jan",
  day =          "25",
  abstract =     "Single atom alloys are gaining importance as atom-efficient
                  catalysts which can be extremely selective and active towards
                  the formation of desired products. They possess such desirable
                  characteristics because of the presence of a highly reactive
                  single atom in a less reactive host surface. In this work, we
                  calculated the electronic structure of several representative
                  single atom alloys. We examined single atom alloys of gold,
                  silver and copper doped with single atoms of platinum,
                  palladium, iridium, rhodium and nickel in the context of the
                  d-band model of Hammer and N{\o}rskov. The reactivity of these
                  alloys was probed through the dissociation of water and nitric
                  oxide and the hydrogenation of acetylene to ethylene. We
                  observed that these alloys exhibit a sharp peak in their atom
                  projected d-band density of states, which we hypothesize could
                  be the cause of high surface reactivity. We found that the
                  d-band centers and d-band widths of these systems correlated
                  linearly as with other alloys, but that the energy of
                  adsorption of a hydrogen atom on these surfaces could not be
                  correlated with the d-band center, or the average reactivity
                  of the surface. Finally, the single atom alloys, with the
                  exception of copper--palladium showed good catalytic behavior
                  by activating the reactant molecules more strongly than the
                  bulk atom behavior and showing favorable reaction pathways on
                  the free energy diagrams for the reactions investigated.",
  issn =         "1572-9028",
  doi =          "10.1007/s11244-018-0899-0",
  url =          "https://doi.org/10.1007/s11244-018-0899-0"
}

Copyright (C) 2018 by John Kitchin. See the License for information about copying.

org-mode source

Org-mode version = 9.1.6

Read and Post Comments

New publication in Molecular Simulation

| categories: publication, news | tags: | View Comments

This paper is our latest work using neural networks in molecular simulation. In this work, we build a Behler-Parinello neural network potential of bulk zirconia. The potential can describe several polymorphs of zirconia, as well as oxygen vacancy defect formation energies and diffusion barriers. We show that we can use the potential to model oxygen vacancy diffusion using molecular dynamics at different temperatures, and to use that data to estimate the effective diffusion activation energy. This is further evidence of the general utility of the neural network-based potential for molecular simulations with DFT accuracy.

@article{wang-2018-densit-funct,
  author =       {Chen Wang and Akshay Tharval and John R. Kitchin},
  title =        {A Density Functional Theory Parameterised Neural Network Model
                  of Zirconia},
  journal =      {Molecular Simulation},
  volume =       0,
  number =       0,
  pages =        {1-8},
  year =         2018,
  doi =          {10.1080/08927022.2017.1420185},
  url =          {https://doi.org/10.1080/08927022.2017.1420185},
  eprint =       { https://doi.org/10.1080/08927022.2017.1420185 },
  publisher =    {Taylor \& Francis},
}

Copyright (C) 2018 by John Kitchin. See the License for information about copying.

org-mode source

Org-mode version = 9.1.5

Read and Post Comments

« Previous Page -- Next Page »