<?xml version="1.0" encoding="UTF-8"?>

<rss version="2.0"
     xmlns:content="http://purl.org/rss/1.0/modules/content/"
     xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
     xmlns:atom="http://www.w3.org/2005/Atom"
     xmlns:dc="http://purl.org/dc/elements/1.1/"
     xmlns:wfw="http://wellformedweb.org/CommentAPI/"
     >
  <channel>
    <atom:link href="http://kitchingroup.cheme.cmu.edu/blog/feed/index.xml" rel="self" type="application/rss+xml" />
    <title>The Kitchin Research Group</title>
    <link>https://kitchingroup.cheme.cmu.edu/blog</link>
    <description>Chemical Engineering at Carnegie Mellon University</description>
    <pubDate>Sat, 01 Nov 2025 13:47:46 GMT</pubDate>
    <generator>Blogofile</generator>
    <sy:updatePeriod>hourly</sy:updatePeriod>
    <sy:updateFrequency>1</sy:updateFrequency>
    
    <item>
      <title>Caching searches using biblio and only seeing new results</title>
      <link>https://kitchingroup.cheme.cmu.edu/blog/2018/04/11/Caching-searches-using-biblio-and-only-seeing-new-results</link>
      <pubDate>Wed, 11 Apr 2018 20:46:56 EDT</pubDate>
      <category><![CDATA[arxiv]]></category>
      <category><![CDATA[elisp]]></category>
      <category><![CDATA[biblio]]></category>
      <guid isPermaLink="false">KOPSYg1aBm18lX8lJYHeRtMOL74=</guid>
      <description>Caching searches using biblio and only seeing new results</description>
      <content:encoded><![CDATA[


&lt;p&gt;
In this &lt;a href="https://github.com/jkitchin/scimax/issues/196"&gt;issue&lt;/a&gt; in scimax, Robert asked if it was possible to save searches, and then to repeat them every so often and only see the new results. This needs some persistent caching of the records, and a comparison of the current search results with the previous search results.
&lt;/p&gt;

&lt;p&gt;
&lt;a href="https://github.com/cpitclaudel/biblio.el"&gt;biblio&lt;/a&gt; provides a nice interface to searching a range of resources for bibliographic references. In this post, I will focus on arxiv. Out of the box, biblio does not seem to support this use case, but as you will see, it has many of the pieces required to achieve it. Let's start picking those pieces apart.
&lt;/p&gt;

&lt;div class="org-src-container"&gt;
&lt;pre class="src src-emacs-lisp"&gt;(&lt;span style="color: #0000FF;"&gt;require&lt;/span&gt; '&lt;span style="color: #D0372D;"&gt;biblio&lt;/span&gt;)
&lt;/pre&gt;
&lt;/div&gt;

&lt;pre class="example"&gt;
biblio

&lt;/pre&gt;

&lt;p&gt;
Here is the first piece we need: a way to run a query, and get results back as a data structure. Here we just look at the first result.
&lt;/p&gt;

&lt;div class="org-src-container"&gt;
&lt;pre class="src src-emacs-lisp"&gt;(&lt;span style="color: #0000FF;"&gt;let*&lt;/span&gt; ((query &lt;span style="color: #008000;"&gt;"alloy segregration"&lt;/span&gt;)
       (backend 'biblio-arxiv-backend)
       (cb (url-retrieve-synchronously (funcall backend 'url query)))
       (results (&lt;span style="color: #0000FF;"&gt;with-current-buffer&lt;/span&gt; cb
                  (funcall backend 'parse-buffer))))
  (car results))
&lt;/pre&gt;
&lt;/div&gt;

&lt;div class="org-src-container"&gt;
&lt;pre class="src src-emacs-lisp"&gt;((doi . &lt;span style="color: #008000;"&gt;"10.1103/PhysRevB.76.014112"&lt;/span&gt;)
 (identifier . &lt;span style="color: #008000;"&gt;"0704.2752v2"&lt;/span&gt;)
 (year . &lt;span style="color: #008000;"&gt;"2007"&lt;/span&gt;)
 (title . &lt;span style="color: #008000;"&gt;"Modelling Thickness-Dependence of Ferroelectric Thin Film Properties"&lt;/span&gt;)
 (authors nil nil nil nil nil nil nil nil nil nil nil nil nil &lt;span style="color: #008000;"&gt;"L. Palova"&lt;/span&gt; nil &lt;span style="color: #008000;"&gt;"P. Chandra"&lt;/span&gt; nil &lt;span style="color: #008000;"&gt;"K. M. Rabe"&lt;/span&gt; nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil)
 (container . &lt;span style="color: #008000;"&gt;"PRB 76, 014112 (2007)"&lt;/span&gt;)
 (category . &lt;span style="color: #008000;"&gt;"cond-mat.mtrl-sci"&lt;/span&gt;)
 (references &lt;span style="color: #008000;"&gt;"10.1103/PhysRevB.76.014112"&lt;/span&gt; &lt;span style="color: #008000;"&gt;"0704.2752v2"&lt;/span&gt;)
 (type . &lt;span style="color: #008000;"&gt;"eprint"&lt;/span&gt;)
 (url . &lt;span style="color: #008000;"&gt;"https://doi.org/10.1103/PhysRevB.76.014112"&lt;/span&gt;)
 (direct-url . &lt;span style="color: #008000;"&gt;"http://arxiv.org/pdf/0704.2752v2"&lt;/span&gt;))
&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;
Next, we need a database to store the results in. I will just use a flat file database with a file for each record. The filename will be the md5 hash of the doi or the record itself. Why is that a good idea? Well, the doi is a constant, so if it exists the md5 will also be a constant. The doi itself is not a good filename in general, but the md5 is. The md5 of the record itself will be fragile to any changes, so if it has a doi, we should use it. If it doesn't and later gets one, we should see it again since that could mean it has been published. Also, if it changes because of some new version we might want to see it again. In any case, the existence of that file will be evidence we have seen that record before, and will indicate we need to remove it from the current view.
&lt;/p&gt;

&lt;p&gt;
The flat file database is not super inspired. It is modeled a little after elfeed, but other solutions might work better for large sets of records, but this approach will work fine for this post.
&lt;/p&gt;

&lt;p&gt;
Here is a function that returns nil if the record has been seen, and if not, saves the record and returns it.
&lt;/p&gt;

&lt;div class="org-src-container"&gt;
&lt;pre class="src src-emacs-lisp"&gt;(&lt;span style="color: #0000FF;"&gt;defvar&lt;/span&gt; &lt;span style="color: #BA36A5;"&gt;db-dir&lt;/span&gt; &lt;span style="color: #008000;"&gt;"~/.arxiv-db/"&lt;/span&gt;)

(&lt;span style="color: #0000FF;"&gt;unless&lt;/span&gt; (f-dir? db-dir) (make-directory db-dir t))

(&lt;span style="color: #0000FF;"&gt;defun&lt;/span&gt; &lt;span style="color: #006699;"&gt;unseen-record-p&lt;/span&gt; (record)
  &lt;span style="color: #036A07;"&gt;"Given a RECORD return it if it is unseen.&lt;/span&gt;
&lt;span style="color: #036A07;"&gt;Also, save the record so next time it will be marked seen. A&lt;/span&gt;
&lt;span style="color: #036A07;"&gt;record is seen if we have seen the DOI or the record as a string&lt;/span&gt;
&lt;span style="color: #036A07;"&gt;before."&lt;/span&gt;
  (&lt;span style="color: #0000FF;"&gt;let*&lt;/span&gt; ((doi (cdr (assoc 'doi record)))
         (contents (&lt;span style="color: #0000FF;"&gt;with-temp-buffer&lt;/span&gt;
                     (prin1 record (current-buffer))
                     (buffer-string)))
         (hash (md5 (&lt;span style="color: #0000FF;"&gt;or&lt;/span&gt; doi contents)))
         (fname (expand-file-name hash db-dir)))

    (&lt;span style="color: #0000FF;"&gt;if&lt;/span&gt; (f-exists? fname)
        nil
      (&lt;span style="color: #0000FF;"&gt;with-temp-file&lt;/span&gt; fname
        (insert contents))
      record)))
&lt;/pre&gt;
&lt;/div&gt;

&lt;pre class="example"&gt;
unseen-record-p

&lt;/pre&gt;

&lt;p&gt;
Now we can use that as a filter that saves records by side effect.
&lt;/p&gt;

&lt;div class="org-src-container"&gt;
&lt;pre class="src src-emacs-lisp"&gt;(&lt;span style="color: #0000FF;"&gt;defun&lt;/span&gt; &lt;span style="color: #006699;"&gt;scimax-arxiv&lt;/span&gt; (query)
  (&lt;span style="color: #0000FF;"&gt;interactive&lt;/span&gt; &lt;span style="color: #008000;"&gt;"Query: "&lt;/span&gt;)

  (&lt;span style="color: #0000FF;"&gt;let*&lt;/span&gt; ((backend 'biblio-arxiv-backend)
         (cb (url-retrieve-synchronously (funcall backend 'url query)))
         (results (-filter 'unseen-record-p (&lt;span style="color: #0000FF;"&gt;with-current-buffer&lt;/span&gt; cb
                                              (funcall backend 'parse-buffer))))
         (results-buffer (biblio--make-results-buffer (current-buffer) query backend)))
    (&lt;span style="color: #0000FF;"&gt;with-current-buffer&lt;/span&gt; results-buffer
      (biblio-insert-results results &lt;span style="color: #008000;"&gt;""&lt;/span&gt;))
    (pop-to-buffer results-buffer)))

(scimax-arxiv &lt;span style="color: #008000;"&gt;"alloy segregation"&lt;/span&gt;)
&lt;/pre&gt;
&lt;/div&gt;

&lt;pre class="example"&gt;
#&amp;lt;buffer *arXiv search*&amp;gt;

&lt;/pre&gt;

&lt;p&gt;
Now, when I run that once I see something like this:
&lt;/p&gt;

&lt;p&gt;
&lt;img src="/media/date-11-04-2018-time-20-19-52.png"&gt;
&lt;/p&gt;


&lt;p&gt;
and if I run it again:
&lt;/p&gt;

&lt;div class="org-src-container"&gt;
&lt;pre class="src src-emacs-lisp"&gt;(scimax-arxiv &lt;span style="color: #008000;"&gt;"alloy segregation"&lt;/span&gt;)
&lt;/pre&gt;
&lt;/div&gt;

&lt;pre class="example"&gt;
#&amp;lt;buffer *arXiv search*&amp;gt;

&lt;/pre&gt;

&lt;p&gt;
Then the buffer is empty, since we have seen all the entries before.
&lt;/p&gt;


&lt;p&gt;
&lt;img src="/media/date-11-04-2018-time-20-20-37.png"&gt;
&lt;/p&gt;

&lt;p&gt;
Here are the files in our database:
&lt;/p&gt;

&lt;div class="org-src-container"&gt;
&lt;pre class="src src-sh"&gt;ls ~/.arxiv-db/
&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;
Here are the contents of one of those files:
&lt;/p&gt;

&lt;div class="org-src-container"&gt;
&lt;pre class="src src-emacs-lisp"&gt;(&lt;span style="color: #0000FF;"&gt;with-temp-buffer&lt;/span&gt;
 (insert-file-contents &lt;span style="color: #008000;"&gt;"~/.arxiv-db/18085fe2512e15d66addc7dfb71f7cd2"&lt;/span&gt;)
 (read (buffer-string)))
&lt;/pre&gt;
&lt;/div&gt;

&lt;pre class="example"&gt;
((doi) (identifier . 1101.3464v3) (year . 2011) (title . Characterizing Solute Segregation and Grain Boundary Energy in a Binary
  Alloy Phase Field Crystal Model) (authors nil nil nil nil nil nil nil nil nil nil nil nil nil Jonathan Stolle nil Nikolas Provatas nil nil nil nil nil nil nil nil nil nil nil) (container) (category . cond-mat.mtrl-sci) (references nil 1101.3464v3) (type . eprint) (url . http://arxiv.org/abs/1101.3464v3) (direct-url . http://arxiv.org/pdf/1101.3464v3))

&lt;/pre&gt;

&lt;p&gt;
So, if you need to read this in again later, no problem.
&lt;/p&gt;

&lt;p&gt;
Now, what could go wrong? I don't know much about how the search results from arxiv are returned. For example, this query returns 10 hits.
&lt;/p&gt;

&lt;div class="org-src-container"&gt;
&lt;pre class="src src-emacs-lisp"&gt;(&lt;span style="color: #0000FF;"&gt;let*&lt;/span&gt; ((query &lt;span style="color: #008000;"&gt;"alloy segregration"&lt;/span&gt;)
       (backend 'biblio-arxiv-backend)
       (cb (url-retrieve-synchronously (funcall backend 'url query)))
       (results (&lt;span style="color: #0000FF;"&gt;with-current-buffer&lt;/span&gt; cb
                  (funcall backend 'parse-buffer))))
  (length results))
&lt;/pre&gt;
&lt;/div&gt;

&lt;pre class="example"&gt;
10

&lt;/pre&gt;

&lt;p&gt;
There is just no way there are only 10 hits for this query. So, there must be a bunch more that you get by either changing the requested number in some argument, or by using subsequent queries to get the rest of them. I don't know if there are more advanced query options with biblio, e.g. to find entries newer than the last time it was run. On the advanced search &lt;a href="https://arxiv.org/find"&gt;page&lt;/a&gt; for arxiv, it looks like there is only a by year option.
&lt;/p&gt;

&lt;p&gt;
This is still a good idea, and a lot of the pieces are here,
&lt;/p&gt;
&lt;p&gt;Copyright (C) 2018 by John Kitchin. See the &lt;a href="/copying.html"&gt;License&lt;/a&gt; for information about copying.&lt;p&gt;
&lt;p&gt;&lt;a href="/org/2018/04/11/Caching-searches-using-biblio-and-only-seeing-new-results.org"&gt;org-mode source&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Org-mode version = 9.1.6&lt;/p&gt;
]]></content:encoded>
    </item>
  </channel>
</rss>
