A better defun for emacs-lisp

| categories: elisp, macro, emacs | tags: | View Comments

Table of Contents

I have been thinking of better ways to write code that is more likely to have decent docstrings that are up to date, and maybe that enable automatic validation. One strategy is to keep documentation and code together, and by together I mean close together. The closer the better. I made some interesting progress in the last post, where I used a macro to let me put argument specific documentation in the same place that the argument is defined. Here I expand the idea to also provide argument default values, and validation code where the argument is defined inside the function, in addition to generating docstrings. This post is written in Emacs-lisp, mostly because I am more familiar with the macro language. The idea should apply to other lisps too.

Let's consider this prototypical, vanilla function definition, usage, and docstring.

(defun f1 (arg1 arg2)
  "Add two numbers."
  (+ arg1 arg2))

;; usage
(f1 3 4)

Here is what the help looks like from emacs.

(describe-function 'f1)

It is clear I was lazy in writing the docstring; it does not even mention the arguments. There is also no validation of the arguments so if you pass a string and a number, you will get an error. There are no defaults either, so you have to provide both arguments. It seems like there could be significant room for improvement. Of course, I could bite the bullet and write a better function like this one:

(defun f1a (arg1 &optional arg2)
  "Add ARG1 and ARG2 together.
ARG1 and  ARG2 should both be numbers."
  (when (null arg2) (setq arg2 2))
  (unless (and (numberp arg1) (numberp arg2)) (error "arg1 and arg2 should both be numbers"))
  (+ arg1 arg2))

(list (f1a 3 4) (f1a 3))

Yes, I could do that, but it is tedious to do it all the time. And it still leaves something to be desired for me. The docstring does not say what the default value is for example, and that is hard-coded in the code, i.e. not introspectible until you look at the code. Next we consider an alternative way to write the function. Compare that to this function definition, usage and documentation. The function definition is a little more verbose. Providing documentation, defaults and validation code in any form would make it that way no matter what.

(defn f2 ((arg1 "A number" :validate numberp)
          (arg2 "A number" :validate numberp :default 2))
  "Add the arguments."
  (+ arg1 arg2))

;; usage
(list (f2 3 4) (f2 3))
(describe-function 'f2)

The documentation is built up from the information in the function definition, in a form that is mostly consistent with emacs-lisp documentation standards. defn is not a regular emacs-lisp function; it is a macro I developed to generate the function code. It turned out to be long, but the gist of it is that before defining the function I loop through the arguments and collect the docstrings, along with any information about default values and/or validation functions. Then I build up the list of arguments to put in the function. Then if any default values are set, I generate some code to set those values if they are not set in the function call, and finally a similar block of validation code. At the end, I construct the defun and return it. You can check out the code if you want here: https://github.com/jkitchin/scimax/blob/master/scimax-macros.el.

Let's take a look at what this code expands to.

(macroexpand-1
 '(defn f2 ((arg1 "A number" :validate numberp)
            (arg2 "A number" :validate numberp :default 2))
    "Add the arguments."
    (+ arg1 arg2)))

You can see it expands to a regular defun, with a generated docstring, generated default settings code block, and generated validation code. Pretty nice.

Let's see what happens with a function that fails the validation. We should get an error. Here we capture the error so we can see it in the post.

(condition-case err
    (f2 "oak")
  (error
   (error-message-string err)))

So we even get a useful error message when the wrong type of argument is provided. Compare that to the error message from the original version of this function. It tells us we got the wrong type, but not which argument.

(condition-case err
    (f1 "oak" 4)
  (error
   (error-message-string err)))

One last example to check out the &rest argument, with validation that every arg is a number.

(defn f4 ((rarg :rest
                :validate (lambda (x)
                            (-all-p 'identity (mapcar 'numberp x)))))
  "multiply all the arguments."
  (apply '* rarg))

(f4 1 2 3)
(condition-case err
    (f4 "oak" 4)
  (error
   (error-message-string err)))
(describe-function 'f4)

That looks ok too.

1 Summary

The motivation for this was to help me write better code with better documentation. Better code in the sense that it can provide run-time validation, with better feedback, and automatic documentation, including that there is none if that is the case. It is basically compatible with the regular defun, but enhances what kind of documentation is possible with less work on my part. I think it will make it easier to keep documentation in sync, since the argument documentation would be kept near the argument, and you can build in validation if you want to.

It is no news to lispers that macros are good for this kind of application.

Copyright (C) 2017 by John Kitchin. See the License for information about copying.

org-mode source

Org-mode version = 9.0.5

Read and Post Comments

Exporting org-mode to Jupyter notebooks

| categories: orgmode, python, jupyter, emacs | tags: | View Comments

I am going to use Jupyter notebooks to teach from this semester. I really dislike preparing notebooks though. A browser is a really poor editor, and I really dislike Markdown. Notebooks do not seem to have any real structure in them, e.g. the collapsible outline that I am used to in org-mode, so for long notebooks, it is difficult to get a sense for the structure. I am anticipating spending up to 80 hours preparing notebooks this semester, so today I worked out some code to export org-mode to an ipython notebook!

This will let me use the power tools I am accustomed to for the creation of IPython notebooks for my students, and perhaps others who do not use org-mode.

Jupyter notebooks are just json files, so all we need to do is generate it from an org document. The basic strategy was to build up a lisp data structure that represents the notebook and then just convert that data structure to json. I split the document up into sequential markdown and code cells, and then encode those in the format required for the notebook (json).

So, here is an example of what can be easily written in org-mode, posted to this blog, and exported to an IPython notebook, all from one org-document.

Check out the notebook: exporting-orgmode-to-ipynb.ipynb .

1 Solve a nonlinear problem

Consider the equation \(x^2 = 4\). Find a solution to it in Python using a nonlinear solver.

To do that, we need to define an objective function that will be equal to zero at the solution. Here is the function:

def objective(x):
    return x**2 - 4

Next, we use fsolve with an initial guess. We get fsolve from scipy.optimize.

from scipy.optimize import fsolve

ans = fsolve(objective, 3)
print(ans)
[ 2.]

That should have been an obvious answer. The answer is in brackets because fsolve returns an array. In the next block we will unpack the solution into the answer using the comma operator. Also, we can see that using a different guess leads to a different answer. There are, of course, two answers: \(x = \pm 2\)

ans, = fsolve(objective, -3)
print(ans)
-2.0

Now you see we get a float answer!

Here are some other ways to get a float:

ans = fsolve(objective, -3)

print(float(ans))
print(ans[0])
-2.0000000000000084
-2.0

It is worth noting from the first result that fsolve is iterative and stops when it reaches zero within a tolerance. That is why it is not exactly -2.

2 Benefits of export to ipynb

  1. I can use org-mode
  2. And emacs
  3. and ipynb for teaching.

The export supports org-markup: bold, italic, underlined, and ~~strike~~.

We can use tables:

Table 1: A table of squares.
x y
1 2
2 4
3 9
4 16

We can make plots.

import numpy as np

t = np.linspace(0, 2 * np.pi)

x = np.cos(t)
y = np.sin(t)

import matplotlib.pyplot as plt
plt.plot(x, y)
plt.axis('equal')
plt.xlabel('x')
plt.ylabel('y')
plt.savefig('circle.png')

Even include HTML: <font color="red">Pay special attention to the axis labels!</font>

3 Limitations

  • Only supports iPython blocks
  • Does not do inline images in results
  • Will not support src-block variables
  • Currently only supports vanilla output results

4 Summary

The code that does this is here: ox-ipynb.el . After I use it a while I will put it in scimax. There are some tricks in it to fix up some markdown export of latex fragments and links with no descriptions.

I just run this command in Emacs to get the notebook. Even it renders reasonably in the notebook.

(export-ipynb-buffer)

Overall, this looks extremely promising to develop lecture notes and assignments in org-mode, but export them to Ipython notebooks for the students.

Copyright (C) 2017 by John Kitchin. See the License for information about copying.

org-mode source

Org-mode version = 9.0.3

Read and Post Comments

A simple emacs-lisp interface to CRUD operations in mongodb

| categories: mongodb, emacslisp, database, emacs | tags: | View Comments

In this post I showed that MongoDB is pretty easy to use for simple database applications. I showed a way to get data out of the database that is native to Emacs, but to use Mongo in emacs applications comfortably, it would be really helpful to be able to create, read, update and delete (CRUD) entries. There is a minimal interface to mongodb for emacs-lisp here: https://github.com/m2ym/mongo-el. From what I can see, it seems limited to simple, single queries, and it is written with advanced features of emacs-lisp I do not understand enough to extend it. In the last post, I showed an easy way to use mongoexport to get data from a query out of a mongo database. Here I explore a similar approach to round out the CRUD (create, read, update and delete) operations for using emacs-lisp to work with mongodb. This will enable Emacs to easily use MongoDB in applications.

We use the mongo cli with the –eval option, which allows you to run commands on the database. The basic idea is to generate the json we need from a lisp data structure, and use that json in mongo commands as needed. This sounds simple, but below you will see there are plenty of corners to take care of.

The goal here is to get something that is pretty functional. It will not be able to support all the capabilities of MongoDB and the options available in the cli.

1 Inserting entries

Here we insert a document into the contacts collection of the contacts database. As in the Python example we considered earlier, this database is automatically created when we run this command.

(require 'json))
(let* ((json (json-encode '((first-name . "John")
                            (last-name . "Kitchin")
                            (email . "jkitchin@cmu.edu"))))
       (cmd (format "mongo 127.0.0.1/contacts --quiet --eval 'db.contacts.insert(%s)'"
                    json)))
  (shell-command-to-string cmd))
json

Here is a function we can use for inserting, and as you can see it works for multiple inserts too. There is a limit on how long the json string can be for this, so you cannot add too many entries at once with this. I do not know what the limit is, and suspect it is related to using a shell command. When this succeeds there is data returned about what happened, which we try to get in lisp form. Also, I noticed I had to do a little bit of escaping, especially for entries containing a single quote, which messes up the quoting on the shell command, and for non-ascii characters which the shell did not handle well. Maybe this could be avoided with a file-based approach, or if we used a pipe to a process.

(defun mongo-insert (db collection document)
  "Insert into DB.COLLECTION the DOCUMENT.
DOCUMENT will be some lisp structure that is converted to json."
  ;; we have to escape quote any single quotes. This came from
  ;; http://stackoverflow.com/questions/1250079/how-to-escape-single-quotes-within-single-quoted-strings
  (let* ((json (replace-regexp-in-string "'" "'\"'\"'" (json-encode document)))
         ;; it seems utf-8 characters may cause issues. Let's just remove them.
         (json (replace-regexp-in-string "[^[:ascii:]]" "" json))
         (cmd (format "mongo %s --quiet --eval 'db.%s.insert(%s)'"
                      db collection
                      json))
         (output (shell-command-to-string cmd)))
    (cond
     ((string-match "BulkWriteResult(" output)
      (json-read-from-string (substring output 16 -2)))
     ((string-match "WriteResult(" output)
      (json-read-from-string (substring output 12 -2)))
     (t
      output))))
mongo-insert

Here it is in action.

(mongo-insert "contacts" "contacts"
              '(((first-name . "John")
                 (last-name . "Kitchin")
                 (email . "jkitchin@cmu.edu"))
                ((first-name . "Someone")
                 (last-name . "Else")
                 ("email" . "someone@out.there"))))
((writeErrors . []) (writeConcernErrors . []) (nInserted . 2) (nUpserted . 0) (nMatched . 0) (nModified . 0) (nRemoved . 0) (upserted . []))

Seems like an ok way to get data from Emacs into a Mongo DB, and we get lisp data returned telling us what happened.

2 Finding a document

To update documents we need to find them. We would like to find a document by the _id, but we have a small dilemma. The json we need for that needs to look like: {"_id": ObjectId("587babfaef131d0d4603b3ad")}, where the ObjectId is not quoted. The json library does not seem to be able to do that. So, we have to modify our find code to do this. This is possible by manipulating the json string after it is generated with regular expression replacement. It feels hacky, and hopefully there are not many more examples of that. If there are, we will need another approach to generating the json data. Here is the modified find function, also with the projection option. Here is another place we have to tread somewhat lightly with the _id, in this case we have to requote it so that it can be read by emacs. It might make sense to just replace it with the quoted _id string, rather than the ObjectId call. Time will tell.

Here we create two helper functions to unquote input, and requote output. We also need some code to make an array of all the results, and put commas between all the results so that we end up with valid json in the output.

(defun mongo-unquote-query (query)
  "Json encodes QUERY, and unquotes any ObjectId calls.

We don't have syntax for the ObjectId call that mongo wants in
 lisp, so a query has to look like this:
'((_id .  \"ObjectId(\"587babfaef131d0d4603b3ad\")\"))

Mongo can't have the quotes around the call, so this function
removes them.
"
  (replace-regexp-in-string "\"\\(ObjectID(\\\\\"\\(.*?\\)\\\\\")\\)\""
                            "ObjectId(\"\\2\")"
                            (json-encode query)))

(defun mongo-requote-output (output)
  "Adds quotes around ObjectId in OUTPUT.
When mongo outputs json, it has unquoted ObjectIds in it that
emacs cannot interpret as json. "
  (replace-regexp-in-string
   "ObjectId(\"\\(.*?\\)\")"
   "\"ObjectId(\\\\\"\\1\\\\\")\""
   output))

(defun mongo-find (db collection query &optional projection)
  (let* ((query-json (mongo-unquote-query query))
         (projection-json
          (and projection (json-encode projection)))
         (output (mongo-requote-output
                  ;; add [] to make an array of output in json,
                  ;; and separate results by a comma
                  (concat "["
                          (replace-regexp-in-string
                           "\n" ""
                           (shell-command-to-string
                            (format "mongo %s --quiet --eval 'db.%s.find(%s).forEach(function(myDoc) { printjsononeline(myDoc); print( \",\"); })'"
                                    db collection
                                    (if projection
                                        (format "%s, %s" query-json projection-json)
                                      query-json))))
                          "]")))) 
    (json-read-from-string output)))
mongo-find

So, finally we can run something like this:

(mongo-find "contacts" "contacts" '((email . "someone@out.there")))
[((_id . "ObjectId(\"587c166cdfcd649d3acf99fd\")") (first-name . "Someone") (last-name . "Else") (email . "someone@out.there")) ((_id . "ObjectId(\"587c16ad410565dd4c16c748\")") (first-name . "Someone") (last-name . "Else") (email . "someone@out.there")) ((_id . "ObjectId(\"587c17550e586b4f8df21de0\")") (first-name . "Someone") (last-name . "Else") (email . "someone@out.there")) ((_id . "ObjectId(\"587c1764d75279a55ffec483\")") (first-name . "Someone") (last-name . "Else") (email . "someone@out.there")) ((_id . "ObjectId(\"587c17743281f1e9d5054396\")") (first-name . "Someone") (last-name . "Else") (email . "someone@out.there")) ((_id . "ObjectId(\"587c178ad92706d2bd5a6e3c\")") (first-name . "Someone") (last-name . "Else") (email . "someone@out.there")) ((_id . "ObjectId(\"587c1794756bb2bd0f0ac499\")") (first-name . "Someone") (last-name . "Else") (email . "someone@out.there"))]

Here is an example usage with a projection that returns only the information you want, in this case, just the id.

(mongo-find "contacts" "contacts" '((email . "someone@out.there"))
            '((_id . 1)))
[((_id . "ObjectId(\"587c166cdfcd649d3acf99fd\")")) ((_id . "ObjectId(\"587c16ad410565dd4c16c748\")")) ((_id . "ObjectId(\"587c17550e586b4f8df21de0\")")) ((_id . "ObjectId(\"587c1764d75279a55ffec483\")")) ((_id . "ObjectId(\"587c17743281f1e9d5054396\")")) ((_id . "ObjectId(\"587c178ad92706d2bd5a6e3c\")")) ((_id . "ObjectId(\"587c1794756bb2bd0f0ac499\")"))]

3 Updating an entry

Ok, back to the update. To make sure that we update exactly the document we want, we will use the document _id. First, we define an update command.

(defun mongo-update (db collection query $set)
  "In DB.COLLECTION update records matching QUERY with the contents of $SET."
  (let* ((query-json (mongo-encode-query query))
         ($set-json (mongo-encode-query $set))
         (cmd (format "mongo %s --quiet --eval 'db.%s.update(%s, %s)'"
                      db collection
                      query-json $set-json))
         (output (shell-command-to-string cmd)))
    (if (string-match "WriteResult(" output)
        (json-read-from-string
         (substring output 12 -2))
      output)))
mongo-update

First a reminder of what is in this record.

(mongo-find "contacts" "contacts" '((_id . "ObjectId(\"587c16ad410565dd4c16c748\")")))
[((_id . "ObjectId(\"587c16ad410565dd4c16c748\")") (first-name . "Someone") (last-name . "Else") (email . "someone@out.there"))]

Here we set the email field to a new address. Without $set, the whole document gets replaced.

(mongo-update "contacts" "contacts"
              '((_id . "ObjectId(\"587c16ad410565dd4c16c748\")"))
              '(($set . ((email . "someone@out.there.com")))))
((nMatched . 1) (nUpserted . 0) (nModified . 1))

Finally, let's see the document again to verify it is modified.

(mongo-find "contacts" "contacts" '((_id . "ObjectId(\"587c16ad410565dd4c16c748\")")))
[((_id . "ObjectId(\"587c16ad410565dd4c16c748\")") (first-name . "Someone") (last-name . "Else") (email . "someone@out.there.com"))]

Looks good, you can see it got changed. There is a potential gotcha though. This next command looks like it should do the same thing, but it does not. The whole document gets replaced!

(mongo-update "contacts" "contacts"
              '((_id . "ObjectId(\"587c16ad410565dd4c16c748\")"))
              '((email . "someone@out.there.com")))
((nMatched . 1) (nUpserted . 0) (nModified . 1))
(mongo-find "contacts" "contacts" '((_id . "ObjectId(\"587c16ad410565dd4c16c748\")")))
[((_id . "ObjectId(\"587c16ad410565dd4c16c748\")") (email . "someone@out.there.com"))]

Do not forget the $set operator if you just want to update some fields!

4 Deleting a document

Next, let's get a delete function. I will only implement the deleteMany function here since you can give it a document id to delete only one, and usually I would want to delete all documents that meet a criteria anyway.

(defun mongo-deleteMany (db collection filter)
  "Delete records in DB.COLLECTION matched by FILTER.
TODO: add write concern."
  (let* ((filter-json (mongo-encode-query filter)) 
         (cmd (format "mongo %s --quiet --eval 'db.%s.deleteMany(%s)'"
                      db collection
                      filter-json))
         (output (shell-command-to-string cmd)))
    (json-read-from-string output)))
mongo-deleteMany

Since we borked that last document, let's just delete it.

(mongo-deleteMany "contacts" "contacts" '((_id . "ObjectId(\"587be3fa6009a569a277b680\")")))
((acknowledged . t) (deletedCount . 0))

5 Generic commands

We may want some flexibility to run collection commands. The most generic command will simply be to write the shell-command completely. We can keep a little syntax by encapsulating most of the boilerplate though. Here is a function for that.

(defun mongo-cmd (db collection cmd &rest args)
  "In DB.COLLECTION run CMD. 
ARGS if present will be used to format CMD."
  (shell-command-to-string
   (format "mongo %s --quiet --eval 'db.%s.%s'"
           db collection
           (apply #'format cmd args))))
mongo-cmd

We can get the number of documents with this:

(mongo-cmd "contacts" "contacts" "count()")
4341

Or run a more sophisticated command with arguments like this.

(mongo-cmd "contacts" "contacts" "explain().remove(%s)" (json-encode '(("category" . "enemy"))))
{
	"queryPlanner" : {
		"plannerVersion" : 1,
		"namespace" : "contacts.contacts",
		"indexFilterSet" : false,
		"parsedQuery" : {
			"category" : {
				"$eq" : "enemy"
			}
		},
		"winningPlan" : {
			"stage" : "DELETE",
			"inputStage" : {
				"stage" : "COLLSCAN",
				"filter" : {
					"category" : {
						"$eq" : "enemy"
					}
				},
				"direction" : "forward"
			}
		},
		"rejectedPlans" : [ ]
	},
	"serverInfo" : {
		"host" : "Johns-MacBook-Air.local",
		"port" : 27017,
		"version" : "3.4.1",
		"gitVersion" : "5e103c4f5583e2566a45d740225dc250baacfbd7"
	},
	"ok" : 1
}

Or, drop the collection with:

(mongo-cmd "contacts" "contacts" "drop()")
true

All gone! Note, we do not try to handle the output of any of those, and they are returned as strings.

6 A MongoDB contacts database

Now, let's re-populate it for real. I store my contacts in a variable called "contacts" as a list of a descriptive string and then cons cells. These are actually harvested from a set of org-files. It is way to slow to parse these files each time, so I keep the contacts cached in memory and only update them if a file changes.

(length contacts)
6047

There are over 6000 contacts. Let's put them in a MongoDB.

Here is a limitation of our approach. This will not work because the generated shell command ends up being too long for the shell.

(mongo-insert "contacts" "contacts"
              (loop for contact in contacts
                    collect
                    (append `((desc . ,(car contact))) (cdr contact))))

So, we do them one at time here:

(let ((ct (current-time)))
  (loop for contact in contacts
        do
        (let ((output (mongo-insert "contacts" "contacts"
                                    (append `((desc . ,(car contact))) (cdr contact)))))
          (unless (= 1 (cdr (assoc 'nInserted output)))
            (warn "error: %S for %S" (cdr (assoc 'nInserted output)) contact))))
  (message "Elapsed time %.02f seconds" (float-time (time-since ct))))
Elapsed time 762.95 seconds

That took a little over 10 minutes to add. That seems long to me. This next step confirms that they were added.

(mongo-cmd "contacts" "contacts" "count()")
6047

Next we will compare some timing of finding data in the database vs looping through the cached contacts. Here is a timing macro to measure how long it takes to run a bit of code.

;; http://stackoverflow.com/questions/23622296/emacs-timing-execution-of-function-calls-in-emacs-lisp
(defmacro measure-time (&rest body)
  "Measure the time it takes to evaluate BODY."
  `(let ((time (current-time)))
     ,@body
     (message "%.06f seconds elapsed" (float-time (time-since time)))))
measure-time

Here is the old way I would extract data. Many contacts I have are academics, and I have stored their academic ranks in each contact.

(loop for contact in contacts
      if (string= "Professor" (cdr (assoc "RANK" (cdr contact))))
      collect contact into professors
      if (string= "Associate Professor" (cdr (assoc "RANK" (cdr contact))))
      collect contact into associate-professors
      if (string= "Assistant Professor" (cdr (assoc "RANK" (cdr contact))))
      collect contact into assistant-professors
      finally return `(("Assistant Professor" ,(length assistant-professors))
                       ("Associate Professor" ,(length associate-professors))
                       ("Professor" ,(length professors))))
Assistant Professor 313
Associate Professor 283
Professor 879

How long did it take to do that?

(measure-time
 (loop for contact in contacts
       if (string= "Professor" (cdr (assoc "RANK" (cdr contact))))
       collect contact into professors
       if (string= "Associate Professor" (cdr (assoc "RANK" (cdr contact))))
       collect contact into associate-professors
       if (string= "Assistant Professor" (cdr (assoc "RANK" (cdr contact))))
       collect contact into assistant-professors
       finally return (list (length assistant-professors)
                            (length associate-professors)
                            (length professors))))
0.008772 seconds elapsed

Not long at all! Comparatively, it is very slow to get this information out of the mongodb, although considerably less code is required. That might not be surprising, considering the json parsing that has to get done here.

Here is the equivalent code to extract that data from the database.

(loop for rank in '("Assistant Professor" "Associate Professor" "Professor")
       collect (list rank (length (mongo-find "contacts" "contacts"
                                              `((RANK . ,rank))))))
Assistant Professor 313
Associate Professor 283
Professor 879

It is comparatively slow to do this. This requires three json parses, and profiling indicates that alot of the work is done in parsing the json.

(measure-time
 (loop for rank in '("Assistant Professor" "Associate Professor" "Professor")
       collect (list rank (length (mongo-find "contacts" "contacts"
                                              `((RANK . ,rank)))))))
1.914817 seconds elapsed

Here is smarter way to do it that avoids the json parsing.

(loop for rank in '("Assistant Professor" "Associate Professor" "Professor")
      collect (list rank (mongo-cmd "contacts" "contacts" "count(%s)"
                                    (json-encode `((RANK . ,rank))))))
Assistant Professor 313
Associate Professor 283
Professor 879

And you can see here it is about 10 times faster, but not nearly as fast as running the lisp code on the cache.

(measure-time
 (loop for rank in '("Assistant Professor" "Associate Professor" "Professor")
       collect (list rank (mongo-cmd "contacts" "contacts" "count(%s)"
                                     (json-encode `((RANK . ,rank)))))))
0.349413 seconds elapsed

This is how you might integrate this into a completion command:

(ivy-read "choose: "
          (loop for c across (mongo-find "contacts" "contacts" "")
                collect
                (list (cdr (assoc 'desc c)) c)))

This is basically unusable though, because it takes so long to generate the candidates (over six seconds).

(measure-time
 (loop for c across (mongo-find "contacts" "contacts" "")
       collect
       (list (cdr (assoc 'desc c)) c)))
6.228225 seconds elapsed

We can get back to usable by making the database do more work for us. Here, we simply make the database print a list of cons cells that we can read into lisp. We have to use a javascript function, with some escaping and quoting. The escaping was necessary because there is some bad data in the email field that messed up the cons cells, e.g. some things like "name" <email> with nested single and double quoting, etc., and the quoting was necessary to get cons cells of the form ("desc" . "email"), and finally we wrap them in parentheses and read back the list of cons cells. At about a quarter of a second, this is very usable to get a list of over 6000 candidates. It is still many times slower than working on the contacts list in memory though. I am not a super fan of the one-line javascript, and if it was much more complicated than this another strategy would probably be desirable.

(measure-time
 (read
  (concat
   "("
   (shell-command-to-string "mongo contacts --quiet --eval 'db.contacts.find().forEach(function (doc) {print(\"(\\\"\" + doc.desc + \"\\\" . \\\"\" + escape(doc.EMAIL) +\"\\\")\");})'")
   ")")))
0.284730 seconds elapsed

7 Text searching

Finally, let us make a text index to make searching easy. This allows us a very flexible search where we do not have to specify what field or use regular expressions. We setup the index on all the fields, so we can find entries that match even on fields that do not exist in all documents.

(mongo-cmd "contacts" "contacts" "createIndex(%s)" (json-encode '(("$**" . "text"))))
{
	"createdCollectionAutomatically" : false,
	"numIndexesBefore" : 1,
	"numIndexesAfter" : 2,
	"ok" : 1
}

Now, let's use that to find the GOOGLE-SCHOLAR url of contacts matching the following query.

(mongo-find "contacts" "contacts" '(($text . (($search . "\"Carnegie Mellon\"")))
                                    ($text . (($search . "\"John Kitchin\""))))
            '((GOOGLE-SCHOLAR . 1) (_id . 0)))
[((GOOGLE-SCHOLAR . "https://scholar.google.com/citations?hl=en&user=jD_4h7sAAAAJ"))
 nil nil]

So, you can see there were three contacts, and one of them lists my google-scholar url.

8 Summary

This looks like the foundation of a mongo/emacs-lisp interface. This interface is not that fast though, and suffers from some limitations related to the use of the shell. Depending on the actual use, it is clear you can gain performance by passing some work on the database, which requires some javascript coding. Even that revealed some subtlety, e.g. making sure the database output text that was compatible with the lisp reader. That mostly means taking care of quotes, and other special characters, which I managed with a simple escape mechanism. It is probably worth investing a few more days in building an interface that uses a process and communicates with the mongo cli directly before moving forward with any significant application that uses Mongo in emacs. There are many good ideas for that:

  1. Index all your org files (e.g. http://kitchingroup.cheme.cmu.edu/blog/2017/01/03/Find-stuff-in-org-mode-anywhere/)
  2. Index all your bibtex files (e.g. http://kitchingroup.cheme.cmu.edu/blog/2017/01/15/Querying-a-MongoDB-bibtex-database-with-Python-and-emacs-lisp/)
  3. Download RSS feeds into a searchable database
  4. Manage your contacts
  5. Index your email? mu and notmuch use xapian for this, but I have found they cannot search for things like hashtags. Maybe MongoDB would be better?

The tradeoffs between this and sqlite are more clear now. With Mongo we do not have to create the normalized tables (although it is still a good idea to think about how to structure the documents, and if they should be a little normalized). It is much easier to map lisp data structures to Mongo queries than it is to do that with SQL queries. On the other hand, it is necessary to do some javascript programming with Mongo to get some desired output. It still seems worth exploring further.

Copyright (C) 2017 by John Kitchin. See the License for information about copying.

org-mode source

Org-mode version = 9.0.3

Read and Post Comments

Querying a MongoDB bibtex database with Python and emacs-lisp

| categories: mongodb, python, database, emacs | tags: | View Comments

I have been exploring using databases to help with searching my data. In this post we explore using MongoDB for bibtex entries. I am choosing bibtex entries because it is easy to parse bibtex files, I already have a lot of them, and I have several kinds of queries I regularly use. So, they are a good candidate to test out a new database on!

MongoDB is a noSQL database that is pretty easy to use. I installed it from homebrew, and then followed the directions to run the server.

With pymongo you can make a database as easy as this:

import bibtexparser

# Read the bibtex file to get entries
with open('../../../Dropbox/bibliography/references.bib', 'r') as bibfile:
    bp = bibtexparser.load(bibfile)
    entries = bp.entries

print("N = ", len(entries))

print(entries[0])

import pymongo
from pymongo import MongoClient
client = MongoClient('localhost', 27017)

# This creates the "entries" collection
db = client['bibtex'].entries

# add each entry
for entry in entries:
    db.insert_one(entry)

N = 1671 {'keyword': 'test, word', 'year': '2006', 'publisher': 'American Chemical Society (ACS)', 'title': 'The ACS Style Guide', 'ENTRYTYPE': 'book', 'editor': 'Janet S. Dodd', 'address': 'Washington, D.C.', 'ID': '2006-acs-style-guide', 'doi': '10.1021/bk-2006-styg', 'link': 'http://dx.doi.org/10.1021/bk-2006-STYG', 'date_added': 'Wed Apr 1 10:17:54 2015', 'pages': 'nil'}

That was easy. We have a database with 1671 documents in it, and each document is essentially a dictionary of key-value pairs. You might even argue it was too easy. I didn't specify any structure to the entries at all. No required fields, no validation that the keys are spelled correctly, no validation on the values, e.g. you can see the year looks like a string. The benefit of that is that every entry went in, with no issues. On the other hand, the authors went in as a single string, as did the keywords, which affects our ability to search a little bit later. Note if you run that twice, it will add each entry again, since we do not check if the entry already exists.

A database is only useful though if it is easy to get stuff out of it. So, let's consider some test queries. First we find entries that have years less than 1950. The query is basically a little json bundle that describes a field and condition that we want to match. Here we use a less than operator, ""$lt"The results come back as a list of dictionaries. This is in stark contrast to a SQL query which is an expression in its own declarative language. A query here is a chunk of data that must get converted to code by the server. I am not 100% clear if the less than here is in the string sense or numeric sense, but for years it probably does not matter for a long time.

import pymongo
from pymongo import MongoClient
client = MongoClient('localhost', 27017)

db = client['bibtex'].entries

for i, result in enumerate(db.find({"year" : {"$lt": "1950"}})):
    print('{i: 2d}. {author}, {title}, {journal}, {year}.'.format(i=i+1, **result))
  1. Birch, Francis, Finite Elastic Strain of Cubic Crystals, Phys. Rev., 1947.
  2. Ditchburn, R. W. and Gilmour, J. C., The Vapor Pressures of Monatomic Vapors, Rev. Mod. Phys., 1941.
  3. J. Korringa, On the Calculation of the Energy of a Bloch Wave in a Metal, Physica, 1947.
  4. Nix, F. C. and MacNair, D., The Thermal Expansion of Pure Metals. {II}: Molybdenum, Palladium, Silver, Tantalum, Tungsten, Platinum, and Lead, Phys. Rev., 1942.

That seems easy enough, and those strings could easily be used as candidates for a selection tool like helm.

How about articles published by myself and my student Jacob Boes? This requires "and" logic. Apparently that is the default, so we just add three queries. One is an exact match on articles, and the other two are case-insensitive regular expression matches. I guess this has to be done on every document, since there probably is no way to index a regex match! This search was very fast, but it is not clear how fast it would be for a million entries. This matching is necessary because we stored all authors in a single field rather than splitting them into an array. We might still have to match strings for this even in an array since an author might then be "John R. Kitchin", rather than further decomposed into first and last names.

import pymongo
from pymongo import MongoClient
client = MongoClient('localhost', 27017)

db = client['bibtex']
entries = db['entries']

for i, result in enumerate(entries.find({"ENTRYTYPE": "article",
                                         "author" : {"$regex": "kitchin", '$options' : 'i'},
                                         "author" : {"$regex": "boes", '$options' : 'i'}})):
    if result.get('doi', None):
        result['doi'] = 'http://dx.doi.org/{doi}'.format(doi=result['doi'])
    else:
        result['doi'] = ''
    print('{i: 2d}. {author}, {title}, {journal}, {year}. {doi}'.format(i=i+1, **result).replace("\n", ""))
  1. Jacob R. Boes and Peter Kondratyuk and Chunrong Yin and JamesB. Miller and Andrew J. Gellman and John R. Kitchin, Core Level Shifts in {Cu-Pd} Alloys As a Function of BulkComposition and Structure, Surface Science, 2015. http://dx.doi.org/10.1016/j.susc.2015.02.011
  2. Jacob R. Boes and Gamze Gumuslu and James B. Miller and AndrewJ. Gellman and John R. Kitchin, Estimating Bulk-Composition-Dependent \ce{H2} AdsorptionEnergies on \ce{CuxPd1-x} Alloy (111) Surfaces, ACS Catalysis, 2015. http://dx.doi.org/10.1021/cs501585k
  3. Jacob R. Boes and Gamze Gumuslu and James B. Miller and AndrewJ. Gellman and John R. Kitchin, Supporting Information: Estimating Bulk-Composition-Dependent\ce{H2} Adsorption Energies on \ce{CuxPd1-x} Alloy (111)Surfaces, ACS Catalysis, 2015. http://dx.doi.org/10.1021/cs501585k
  4. G. Gumuslu and P. Kondratyuk and J. R. Boes and B. Morrealeand J. B. Miller and J. R. Kitchin and A. J. Gellman, Correlation of Electronic Structure With Catalytic Activity:\ce{H2}-\ce{D2} Exchange Across \ce{CuxPd1-x}Composition Space, ACS Catalysis, 2015. http://dx.doi.org/10.1021/cs501586t
  5. John D. Michael and Ethan L. Demeter and Steven M. Illes andQingqi Fan and Jacob R. Boes and John R. Kitchin, Alkaline Electrolyte and {Fe} Impurity Effects on thePerformance and Active-Phase Structure of {NiOOH} Thin Filmsfor {OER} Catalysis Applications, J. Phys. Chem. C, 2015. http://dx.doi.org/10.1021/acs.jpcc.5b02458
  6. Jacob R. Boes and Mitchell C. Groenenboom and John A. Keithand John R. Kitchin, Neural Network and {Reaxff} Comparison for {Au} Properties, Int. J. Quantum Chem., 2016. http://dx.doi.org/10.1002/qua.25115
  7. Jacob R. Boes and John R. Kitchin, Neural Network Predictions of Oxygen Interactions on a Dynamic Pd Surface, Molecular Simulation, Accepted 12/2016. http://dx.doi.org/10.1080/08927022.2016.1274984
  8. Jacob R. Boes and John R. Kitchin, Modeling Segregation on {AuPd}(111) Surfaces With DensityFunctional Theory and Monte Carlo Simulations, Submitted to J. Phys. Chem. C, 2016.

We can find out how many different entry types we have, as well as how many distinct keyword entries there are. The documents do not separate the keywords though, so this is just the unique strings of comma-separated keywords values. We would have had to split those in advance to have a list of keywords to search for a specific one beyond string matching. Curiously, in my bibtex entries, these are in a field called "keywords". It appears the bibtex parser may have changed the name to "keyword".

import pymongo
from pymongo import MongoClient
client = MongoClient('localhost', 27017)

db = client['bibtex']
entries = db['entries']

print(entries.distinct("ENTRYTYPE"))
print(len(entries.distinct("keyword")))
print(entries.find({"keyword": {"$exists": "true"}})[22]['keyword'])

['book', 'article', 'techreport', 'phdthesis', 'inproceedings', 'inbook', 'mastersthesis', 'misc', 'incollection'] 176 Bildungsw{\"a}rmen, Dichtefunktionalrechnungen, Perowskite, Thermochemie

1 text searching

You can do text search as well. You first have to create an index on one or more fields, and then use the $text and $search operators. Here I made an index on a few fields, and then searched on it. Note that you can only have one text index, so think about it in advance! This simplifies the query a bit, we do not have to use the regex syntax for matching on a field.

import pymongo
from pymongo import MongoClient
client = MongoClient('localhost', 27017)

db = client['bibtex']
entries = db['entries']

entries.create_index([('author', pymongo.TEXT),
                      ('title', pymongo.TEXT),
                      ('keyword', pymongo.TEXT)], sparse=True)

for i, result in enumerate(entries.find({"$text" : {"$search": "kitchin", "$search": "boes"}})):
    print('{i: 2d}. {author}, {title}, {journal}, {year}.'.format(i=i, **result).replace("\n", ""))
  1. G. Gumuslu and P. Kondratyuk and J. R. Boes and B. Morrealeand J. B. Miller and J. R. Kitchin and A. J. Gellman, Correlation of Electronic Structure With Catalytic Activity:\ce{H2}-\ce{D2} Exchange Across \ce{CuxPd1-x}Composition Space, ACS Catalysis, 2015.
  2. Jacob R. Boes and Peter Kondratyuk and Chunrong Yin and JamesB. Miller and Andrew J. Gellman and John R. Kitchin, Core Level Shifts in {Cu-Pd} Alloys As a Function of BulkComposition and Structure, Surface Science, 2015.
  3. Jacob R. Boes and Gamze Gumuslu and James B. Miller and AndrewJ. Gellman and John R. Kitchin, Estimating Bulk-Composition-Dependent \ce{H2} AdsorptionEnergies on \ce{CuxPd1-x} Alloy (111) Surfaces, ACS Catalysis, 2015.
  4. Jacob R. Boes and John R. Kitchin, Neural Network Predictions of Oxygen Interactions on a Dynamic Pd Surface, Molecular Simulation, Accepted 12/2016.
  5. Jacob R. Boes and John R. Kitchin, Modeling Segregation on {AuPd}(111) Surfaces With DensityFunctional Theory and Monte Carlo Simulations, Submitted to J. Phys. Chem. C, 2016.
  6. Jacob R. Boes and Gamze Gumuslu and James B. Miller and AndrewJ. Gellman and John R. Kitchin, Supporting Information: Estimating Bulk-Composition-Dependent\ce{H2} Adsorption Energies on \ce{CuxPd1-x} Alloy (111)Surfaces, ACS Catalysis, 2015.
  7. John D. Michael and Ethan L. Demeter and Steven M. Illes andQingqi Fan and Jacob R. Boes and John R. Kitchin, Alkaline Electrolyte and {Fe} Impurity Effects on thePerformance and Active-Phase Structure of {NiOOH} Thin Filmsfor {OER} Catalysis Applications, J. Phys. Chem. C, 2015.
  8. Jacob R. Boes and Mitchell C. Groenenboom and John A. Keithand John R. Kitchin, Neural Network and {Reaxff} Comparison for {Au} Properties, Int. J. Quantum Chem., 2016.

We can use this to search for documents with orgmode in a keyword or title too.

import pymongo
from pymongo import MongoClient
client = MongoClient('localhost', 27017)

db = client['bibtex']
entries = db['entries']

entries.create_index([('author', pymongo.TEXT),
                      ('title', pymongo.TEXT),
                      ('keyword', pymongo.TEXT)], sparse=True)

for i, result in enumerate(entries.find({"$text" : {"$search": "orgmode"}})):
    print('{i: 2d}. {author}, {title}, {journal}, {year}.'.format(i=i, **result).replace("\n", ""))
  1. John R. Kitchin, Data Sharing in Surface Science, Surface Science, 2016.
  2. Zhongnan Xu and John R. Kitchin, Probing the Coverage Dependence of Site and AdsorbateConfigurational Correlations on (111) Surfaces of LateTransition Metals, J. Phys. Chem. C, 2014.
  3. Xu, Zhongnan and Rossmeisl, Jan and Kitchin, John R., A Linear Response {DFT}+{U} Study of Trends in the OxygenEvolution Activity of Transition Metal Rutile Dioxides, The Journal of Physical Chemistry C, 2015.
  4. Prateek Mehta and Paul A. Salvador and John R. Kitchin, Identifying Potential \ce{BO2} Oxide Polymorphs for EpitaxialGrowth Candidates, ACS Appl. Mater. Interfaces, 2015.
  5. Xu, Zhongnan and Joshi, Yogesh V. and Raman, Sumathy andKitchin, John R., Accurate Electronic and Chemical Properties of 3d TransitionMetal Oxides Using a Calculated Linear Response {U} and a {DFT+ U(V)} Method, The Journal of Chemical Physics, 2015.
  6. Zhongnan Xu and John R. Kitchin, Relationships Between the Surface Electronic and ChemicalProperties of Doped 4d and 5d Late Transition Metal Dioxides, The Journal of Chemical Physics, 2015.
  7. Zhongnan Xu and John R Kitchin, Tuning Oxide Activity Through Modification of the Crystal andElectronic Structure: From Strain To Potential Polymorphs, Phys. Chem. Chem. Phys., 2015.
  8. Jacob R. Boes and Gamze Gumuslu and James B. Miller and AndrewJ. Gellman and John R. Kitchin, Supporting Information: Estimating Bulk-Composition-Dependent\ce{H2} Adsorption Energies on \ce{CuxPd1-x} Alloy (111)Surfaces, ACS Catalysis, 2015.
  9. Kitchin, John R., Examples of Effective Data Sharing in Scientific Publishing, ACS Catalysis, 2015.
  10. Curnan, Matthew T. and Kitchin, John R., Effects of Concentration, Crystal Structure, Magnetism, andElectronic Structure Method on First-Principles Oxygen VacancyFormation Energy Trends in Perovskites, The Journal of Physical Chemistry C, 2014.
  11. Kitchin, John R. and Van Gulick, Ana E. and Zilinski, Lisa D., Automating Data Sharing Through Authoring Tools, International Journal on Digital Libraries, 2016.
  12. Jacob R. Boes and Gamze Gumuslu and James B. Miller and AndrewJ. Gellman and John R. Kitchin, Estimating Bulk-Composition-Dependent \ce{H2} AdsorptionEnergies on \ce{CuxPd1-x} Alloy (111) Surfaces, ACS Catalysis, 2015.
  13. Zhongnan Xu and John R. Kitchin, Relating the Electronic Structure and Reactivity of the 3dTransition Metal Monoxide Surfaces, Catalysis Communications, 2014.
  14. Spencer D. Miller and Vladimir V. Pushkarev and AndrewJ. Gellman and John R. Kitchin, Simulating Temperature Programmed Desorption of Oxygen on{P}t(111) Using {DFT} Derived Coverage Dependent DesorptionBarriers, Topics in Catalysis, 2014.
  15. Hallenbeck, Alexander P. and Kitchin, John R., Effects of \ce{O_2} and \ce{SO_2} on the Capture Capacity of aPrimary-Amine Based Polymeric \ce{CO_2} Sorbent, Industrial \& Engineering Chemistry Research, 2013.

2 Querying from emacs-lisp

It is hard to get too excited about this if it is not easy to query from emacs and get data in a form we can use in emacs ;) The json library allows us to convert lisp data structures to json pretty easily. For example:

(require 'json)

(json-encode '((ENTRYTYPE . article)
               (author . (($regex . kitchin)
                          ($options . i)))
               (author . (($regex . boes)
                          ($options . i)))))
{"ENTRYTYPE":"article","author":{"$regex":"kitchin","$options":"i"},"author":{"$regex":"boes","$options":"i"}}

So, we can use an a-list syntax to build up the query. Then we can send it to mongo using mongoexport that will return a json string that we can read back into emacs to get lisp data. Here is an example that returns a query. We print the first element here.

(pp
 (aref (json-read-from-string
        (shell-command-to-string
         (format "mongoexport --quiet --jsonArray -d bibtex -c entries -q '%s'"
                 (json-encode '((ENTRYTYPE . article)
                                (author . (($regex . kitchin)
                                           ($options . i)))
                                (author . (($regex . boes)
                                           ($options . i))))))))
       0))
((_id
  ($oid . "5878d9644c114f59fe86cb36"))
 (author . "Jacob R. Boes and Peter Kondratyuk and Chunrong Yin and James\nB. Miller and Andrew J. Gellman and John R. Kitchin")
 (year . "2015")
 (title . "Core Level Shifts in {Cu-Pd} Alloys As a Function of Bulk\nComposition and Structure")
 (ENTRYTYPE . "article")
 (ID . "boes-2015-core-cu")
 (keyword . "DESC0004031, early-career")
 (volume . "640")
 (doi . "10.1016/j.susc.2015.02.011")
 (link . "http://dx.doi.org/10.1016/j.susc.2015.02.011")
 (issn . "0039-6028")
 (journal . "Surface Science")
 (pages . "127-132"))

That is pretty sweet, we get a lisp data structure we can use. We can wrap that into a reasonable looking function here:

(defun mongo-find (db collection query)
  (json-read-from-string
   (shell-command-to-string
    (format "mongoexport --quiet --jsonArray -d %s -c %s -q '%s'"
            db collection (json-encode query)))))
mongo-find

Now we can use the function to query the database, and then format the results. Here we look at the example of articles with authors that match "kitchin" and "boes".

(loop for counter from 1 for entry across
      (mongo-find "bibtex" "entries" '((ENTRYTYPE . article)
                                       (author . (($regex . kitchin)
                                                  ($options . i)))
                                       (author . (($regex . boes)
                                                  ($options . i)))))
      do
      (setq entry (append `(,(cons "counter" counter)) entry))
      ;; make sure we have a doi field.
      (if (assoc 'doi entry)
          (push (cons "doi" (format "http://dx.doi.org/%s" (cdr (assoc 'doi entry)))) entry)
        (push (cons "doi" "") entry))
      concat
      (concat (replace-regexp-in-string
               "\n" " "
               (s-format "${counter}. ${author}, ${title} (${year}). ${doi}"
                         'aget entry)) "\n"))
1. Jacob R. Boes and Peter Kondratyuk and Chunrong Yin and James B. Miller and Andrew J. Gellman and John R. Kitchin, Core Level Shifts in {Cu-Pd} Alloys As a Function of Bulk Composition and Structure (2015). http://dx.doi.org/10.1016/j.susc.2015.02.011
2. Jacob R. Boes and Gamze Gumuslu and James B. Miller and Andrew J. Gellman and John R. Kitchin, Estimating Bulk-Composition-Dependent \ce{H2} Adsorption Energies on \ce{Cu_{x}Pd_{1-x}} Alloy (111) Surfaces (2015). http://dx.doi.org/10.1021/cs501585k
3. Jacob R. Boes and Gamze Gumuslu and James B. Miller and Andrew J. Gellman and John R. Kitchin, Supporting Information: Estimating Bulk-Composition-Dependent \ce{H2} Adsorption Energies on \ce{Cu_{x}Pd_{1-x}} Alloy (111) Surfaces (2015). http://dx.doi.org/10.1021/cs501585k
4. G. Gumuslu and P. Kondratyuk and J. R. Boes and B. Morreale and J. B. Miller and J. R. Kitchin and A. J. Gellman, Correlation of Electronic Structure With Catalytic Activity: \ce{H2}-\ce{D2} Exchange Across \ce{Cu_{x}Pd_{1-x}} Composition Space (2015). http://dx.doi.org/10.1021/cs501586t
5. John D. Michael and Ethan L. Demeter and Steven M. Illes and Qingqi Fan and Jacob R. Boes and John R. Kitchin, Alkaline Electrolyte and {Fe} Impurity Effects on the Performance and Active-Phase Structure of {NiOOH} Thin Films for {OER} Catalysis Applications (2015). http://dx.doi.org/10.1021/acs.jpcc.5b02458
6. Jacob R. Boes and Mitchell C. Groenenboom and John A. Keith and John R. Kitchin, Neural Network and {Reaxff} Comparison for {Au} Properties (2016). http://dx.doi.org/10.1002/qua.25115
7. Jacob R. Boes and John R. Kitchin, Neural Network Predictions of Oxygen Interactions on a Dynamic Pd Surface (Accepted 12/2016). http://dx.doi.org/10.1080/08927022.2016.1274984
8. Jacob R. Boes and John R. Kitchin, Modeling Segregation on {AuPd}(111) Surfaces With Density Functional Theory and Monte Carlo Simulations (2016). 

Wow, that looks like a pretty lispy way to query the database and use the results. It is probably pretty easy to do similar things for inserting and updating documents. I will save that for another day.

3 Summary thoughts

This is not an exhaustive study of Mongo for a bibtex database. It does illustrate that it is potentially useful. Imagine a group of users can enter bibtex entries, and then share them through a central server. Or you query the server for entries and then select them using helm/ivy. That is probably faster than parsing large bibtex files (note, in org-ref I already cache the files in parsed form for performance reasons!).

It would make sense to split the authors, and keywords in another version of this database. It also could make sense to have a field that is the bibtex string, and to do text search on that string. That way you get everything in the entry for searching, and an easy way to generate bibtex files without having to reconstruct them.

It is especially interesting to run the queries through emacs-lisp since we get the benefit of editing lisp code while writing the query, e.g. parenthesis navigation, less quoting, etc… and we get back lisp data that can be used to construct helm/ivy queries, or other emacs things. That makes this look competitive with emacsql at least for the syntax. I predict that there will be more posts on this in the future.

Copyright (C) 2017 by John Kitchin. See the License for information about copying.

org-mode source

Org-mode version = 9.0.3

Read and Post Comments

Scoring elfeed articles

| categories: elfeed, emacs | tags: | View Comments

I use elfeed to read RSS feeds of scientific journals, python, emacs, and lisp blogs, and the emacs stackexchange feed. Here are the current feeds I follow.

(mapcar 'list elfeed-feeds)
(http://syndic8.scopus.com/getMessage?registrationId=ADEJAEEKHFESCDEOCDFOAHGSCDHSAKHREFMSADHNJA cmu)
http://feeds.feedburner.com/acs/accacs
http://feeds.feedburner.com/acs/enfuem
http://feeds.feedburner.com/acs/esthag
http://feeds.feedburner.com/acs/jacsat
http://feeds.feedburner.com/acs/jpcbfk
http://feeds.feedburner.com/acs/jpccck
http://feeds.feedburner.com/acs/jpclcd
http://feeds.feedburner.com/acs/cmatex
http://feeds.feedburner.com/acs/jctcce
http://feeds.feedburner.com/acs/jcisd8
http://feeds.feedburner.com/acs/iecred
http://feeds.aps.org/rss/recent/prl.xml
http://feeds.aps.org/rss/recent/prb.xml
http://www.sciencemag.org/rss/current.xml
http://feeds.nature.com/nature/rss/current
http://feeds.nature.com/nmat/rss/current
http://feeds.nature.com/nchem/rss/current
http://rss.sciencedirect.com/publication/science/09270256
http://onlinelibrary.wiley.com/rss/journal/10.1002/(ISSN)1521-3773
http://scitation.aip.org/rss/content/aip/journal/jcp/latestarticles;jsessionid=6k76xb11z253.x-aip-live-06?fmt=rss
(http://planetpython.org/rss20.xml python)
(http://planet.scipy.org/rss20.xml python)
(http://planet.emacsen.org/atom.xml emacs)
http://planet.lisp.org/rss20.xml
http://catalysis-preprint-archive.github.io/updates.rss
https://www.cmu.edu/policies/news/rss-feed.rss
(http://emacs.stackexchange.com/feeds emacs)

I get a lot of articles this way. The current size of the database is:

(elfeed-db-size)
79721

Elfeed tells me I have over 300 unread entries to review at the moment.

(elfeed-search--count-unread)
341/363:24

To deal with this deluge, I have done a couple of things. I set up some new key-bindings so I can alternate marking entries as read if the titles do not look interesting. These keybindings let me alternate fingers, so they do not get too tired (that really happens some days!).

;; help me alternate fingers in marking entries as read
(define-key elfeed-search-mode-map (kbd "f") 'elfeed-search-untag-all-unread)
(define-key elfeed-search-mode-map (kbd "j") 'elfeed-search-untag-all-unread)

I also set up some auto-tagging of the emacs and python feeds, and setup some custom faces so these tags are highlighted so they are easy to see. Anything highlighted in blue is related to emacs, green is related to python, and pink is related to my department, and I can type s, then the tag to see only those entries. Here is what my feed looks like:

Today I want to explore adding tags to entries to further prioritize them. There is a way to tag entries that is described here: https://github.com/skeeto/elfeed#tag-hooks where you can create patterns to match an entry feed title, url, title or link. Basically, you create a function that takes an entry, amd have it add or remove a tag conditionally.

I want to tag entries that meet certain criteria, for example keywords, and set a tag based on the number of matches. Ideally, one day this would be integrated with machine learning so it could rank entries by other entries I have liked, but today we setup code that will create a score for an entry based on the number of matches, and then tag it so that it will get highlighted for me. First, we define two custom faces and setup elfeed to use them. I will use two tags: important and relevant. relevant will be for entries that get a score of at least 1, and important for entries that get a score greater than 1.

(defface relevant-elfeed-entry
  `((t :background ,(color-lighten-name "orange1" 40)))
  "Marks a relevant Elfeed entry.")

(defface important-elfeed-entry
  `((t :background ,(color-lighten-name "OrangeRed2" 40)))
  "Marks an important Elfeed entry.")

(push '(relevant relevant-elfeed-entry)
      elfeed-search-face-alist)

(push '(important important-elfeed-entry)
      elfeed-search-face-alist)

In elfeed, each entry is a structure, and we can access the title and content for matching. Here is an example of a simple scoring function. The idea is just to match patterns, and then add to the score if it matches. This is not as advanced as gnus scoring, but it is a good starting point.

(defun score-elfeed-entry (entry)
  (let ((title (elfeed-entry-title entry))
        (content (elfeed-deref (elfeed-entry-content entry)))
        (score 0))
    (loop for (pattern n) in '(("alloy" 1)
                               ("machine learning\\|neural" 1)
                               ("database" 1)
                               ("reproducible" 1)
                               ("carbon dioxide\\|CO2" 1)
                               ("oxygen evolution\\|OER\\|electrolysis" 1)
                               ("perovskite\\|polymorph\\|epitax" 1)
                               ("kitchin" 2))
          if (string-match pattern title)
          do (incf score n)
          if (string-match pattern content)
          do (incf score n))
    (message "%s - %s" title score)

    ;; store score for later in case I ever integrate machine learning
    (setf (elfeed-meta entry :my/score) score)

    (cond
     ((= score 1)
      (elfeed-tag entry 'relevant))
     ((> score 1)
      (elfeed-tag entry 'important)))
    entry))

(add-hook 'elfeed-new-entry-hook 'score-elfeed-entry)
score-elfeed-entry

Now, new entries automatically get tagged with relevant or important, depending on the score that function gives them, and they get color-coded. Now, the feed looks like this:

I saved some bookmarks to see just the important or relevant ones (http://nullprogram.com/blog/2015/12/03/) so I can see new relevant entries with C-x r b and selecting the relevant bookmark. These work from anywhere in Emacs.

  @6-months-ago +unread +relev  @6-months-ago +unread +relevant
  elfeed @6-months-ago +unread  @6-months-ago +unread +important

I usually access elfeed from a command that shows me everything. Here, I define key-bindings to show me just the important or relevant ones. I could not see a way to get an or in there to show me both of them. These keys make it a one key press to show only these entries, and then get back to the full list.

(define-key elfeed-search-mode-map (kbd "i")
  (lambda () (interactive)
    (elfeed-search-set-filter "@6-months-ago +unread +important")))

(define-key elfeed-search-mode-map (kbd "v")
  (lambda () (interactive)
    (elfeed-search-set-filter "@6-months-ago +unread +relevant")))

(define-key elfeed-search-mode-map (kbd "c")
  (lambda () (interactive)
    (elfeed-search-set-filter "@6-months-ago +unread")))

That summarizes the experiment of the day. There is clearly some room for improvement on the scoring function, e.g. moving the patterns out of the function and into a customizable variable, making the patterns be specific to either the title or content, etc. I am going to try this for a few days and see if it is actually helpful first though.

Copyright (C) 2017 by John Kitchin. See the License for information about copying.

org-mode source

Org-mode version = 9.0.3

Read and Post Comments

Next Page ยป