Table of Contents
I use org-mode extensively. I write scientific papers, keep notes on meetings, write letters of recommendation, notes on scientific articles, keep TODO lists in projects, help files for software, write lecture notes, students send me homework solutions in it, it is a contact database, … Some files are on Dropbox, Google Drive, Box, some in git repos, etc. The problem is that leads to org-files everywhere on my hard drive. At this point I have several thousand org-files that span about five years of work.
It is not that easy after a while to find them. Yes there are things like recent-files, bookmarks, counsel-find-file, helm-for-files, counsel/helm-locate, helm/counsel-grep/ag/pt, projectile for searching within a project, a slew of tools to search open buffers, there is recoll, etc… There are desktop search tools, and of course, good organization habits. Over a five year time span though, these change, and I have yet to find a solution to finding what I want. What about a file I made a year ago that is not in the current directory or this project, and not in my org-agenda-files list? How do I get a dynamic todo list across all these files? Or find all the files that cite a particular bibtex entry, or that were authored by a particular student?
Previously, I indexed org files with Swish-e to make it easy to search them, with an ability to search just headlines, or paragraphs, etc. The problem with that is the nightly indexing was slow since I basically had to regenerate the database each time due to limitations in Swish-e. Finally I have gotten around to the next iteration of this idea, which is a better database. In this post, I explore using sqlite to store headlines and links in org-files.
The idea is that anytime I open or save any org file, it will be added/updated in the database. The database will store the headlines and its properties and content, as well as the location and properties of all links and file keywords. That means I should be able to efficiently query all org files I have ever visited to find TODO headlines, tagged headlines, different types of links, etc. Here we try it out and see if it is useful.
1 The database design
I used emacsql to create and interact with a sqlite3 database. It is a lispy way to generate SQL queries. I will not talk about the code much here, you can see this version org-db.el . The database design consists of several tables that contain the filenames, headlines, tags, properties, (optionally) headline-content, headline-tags, headline-properties, and links. The lisp code is a work in progress, and not something I use on a daily basis yet. This post is a proof of concept to see how well this approach works.
I use hooks to update the database when an org-file is opened (only if it is different than what is in the database based on an md5 hash) and when it is saved. Basically, these functions delete the current entries in the database for a file, then use regular expressions to go to each headline or link in the file, and add data back to the database. I found this to be faster than parsing the org-file with org-element especially for large files. Since this is all done by a hook, anytime I open an org-file anywhere it gets added/updated to the database. The performance of this is ok. This approach will not guarantee the database is 100% accurate all the time (e.g. if something modifies the file outside of emacs, like a git pull), but it doesn't need to be. Most of the files do not change often, the database gets updated each time you open a file, and you can always reindex the database from files it knows about. Time will tell how often that seems necessary.
emacsql lets you use lisp code to generate SQL that is sent to the database. Here is an example:
(emacsql-flatten-sql [:select [name] :from main:sqlite_master :where (= type table)])
SELECT name FROM main.sqlite_master WHERE type = "table";
There are some nuances, for example, main:sqlite_master gets converted to main.sqlite_master. You use vectors, keywords, and sexps to setup the command. emacsql will turn a name like filename-id into filename_id. It was not too difficulty to figure out, and the author of emacsql was really helpful on a few points. I will be referring to this post in the future to remember some of these nuances!
Here is a list of tables in the database. There are a few primary tables, and then some that store tags, properties, and keywords on the headlines. This is typical of emacsql code; it is a lisp expression that generates SQL. In this next expression org-db is a variable that stores the database connection created in org-db.el.
(emacsql org-db [:select [name] :from main:sqlite_master :where (= type table)])
Here is a description of the columns in the files table:
(emacsql org-db [:pragma (funcall table_info files)])
and the headlines table.
(emacsql org-db [:pragma (funcall table_info headlines)])
Tags and properties on a headline are stored in headline-tags and headline-properties.
The database is not large if all it has is headlines and links (no content). It got up to half a GB with content, and seemed a little slow, so for this post I leave the content out.
du -hs ~/org-db/org-db.sqlite
Here we count how many files are in the database. These are just the org-files in my Dropbox folder. There are a lot of them! If I include all the org-files from my research and teaching projects this number grows to about 10,000! You do not want to run org-map-entries on that. Note this also includes all of the org_archive files.
(emacsql org-db [:select (funcall count) :from files])
Here is the headlines count. You can see there is no chance of remembering where these are because there are so many!
(emacsql org-db [:select (funcall count) :from headlines])
And the links. So many links!
(emacsql org-db [:select (funcall count) :from links])
That is a surprising number of links.
2 Querying the link table
Let's see how many are cite links from org-ref there are.
(emacsql org-db [:select (funcall count) :from links :where (= type "cite")])
Wow, I find that to also be surprisingly large! I make a living writing proposals and scientific papers, and I wrote org-ref to make that easier, so maybe it should not be so surprising. We can search the link database for files containing citations of "kitchin-2015-examp" like this. The links table only stores the filename-id, so we join it with the files table to get useful information. Here we show the list of files that contain a citation of that reference. It is a mix of manuscripts, proposals, presentations, documentation files and notes.
(emacsql org-db [:select :distinct [files:filename] :from links :inner :join files :on (= links:filename-id files:rowid) :where (and (= type "cite") (like path "%kitchin-2015-examp%"))])
Obviously we could use this to generate candidates for something like helm or ivy like this.
(ivy-read "Open: " (emacsql org-db [:select [files:filename links:begin] :from links :inner :join files :on (= links:filename-id files:rowid) :where (and (= type "cite") (like path "%kitchin-2015-examp%"))]) :action '(1 ("o" (lambda (c) (find-file (car c)) (goto-char (nth 1 c)) (org-show-entry)))))
Now, you can find every org-file containing any bibtex key as a citation. Since SQL is the query language, you should be able to build really sophisticated queries that combine filters for multiple citations, different kinds of citations, etc.
3 Headline queries
Every headline is stored, along with its location, tags and properties. We can use the database to find headlines that are tagged or with certain properties. You can see here I have 293 tags in the database.
(emacsql org-db [:select (funcall count) :from tags])
Here we find headlines tagged with electrolyte. I tagged some papers I read with this at some point.
(emacsql org-db [:select :distinct [files:filename headlines:title] :from headlines :inner :join headline-tags :on (= headlines:rowid headline-tags:headline-id) :inner :join tags :on (= tags:rowid headline-tags:tag-id) :inner :join files :on (= headlines:filename-id files:rowid) :where (= tags:tag "electrolyte") :limit 5])
|/Users/jkitchin/Dropbox/org-mode/prj-doe-early-career.org||2010 - Nickel-borate oxygen-evolving catalyst that functions under benign conditions|
|/Users/jkitchin/Dropbox/bibliography/notes.org||1971 - A Correlation of the Solution Properties and the Electrochemical Behavior of the Nickel Hydroxide Electrode in Binary Aqueous Alkali Hydroxides|
|/Users/jkitchin/Dropbox/bibliography/notes.org||1981 - Studies concerning charged nickel hydroxide electrodes IV. Reversible potentials in LiOH, NaOH, RbOH and CsOH|
|/Users/jkitchin/Dropbox/bibliography/notes.org||1986 - The effect of lithium in preventing iron poisoning in the nickel hydroxide electrode|
|/Users/jkitchin/Dropbox/bibliography/notes.org||1996 - The role of lithium in preventing the detrimental effect of iron on alkaline battery nickel hydroxide electrode: A mechanistic aspect|
Here we see how many entries have an EMAIL property. These could serve as contacts to send email to.
(emacsql org-db [:select [(funcall count)] :from headlines :inner :join headline-properties :on (= headlines:rowid headline-properties:headline-id) :inner :join properties :on (= properties:rowid headline-properties:property-id) :where (and (= properties:property "EMAIL") (not (null headline-properties:value)))])
If you want to see the ones that match "jkitchin", here they are.
(emacsql org-db [:select :distinct [headlines:title headline-properties:value] :from headlines :inner :join headline-properties :on (= headlines:rowid headline-properties:headline-id) :inner :join properties :on (= properties:rowid headline-properties:property-id) :where (and (= properties:property "EMAIL") (like headline-properties:value "%jkitchin%"))])
Here is a query to find the number of headlines where the deadline matches 2017. Looks like I am already busy!
(emacsql org-db [:select (funcall count) :from headlines :inner :join headline-properties :on (= headlines:rowid headline-properties:headline-id) :inner :join properties :on (= properties:rowid headline-properties:property-id) :where (and (= properties:property "DEADLINE") (glob headline-properties:value "*2017*"))])
4 Keyword queries
We also store file keywords, so we can search on document titles, authors, etc. Here are five documents with titles longer than 35 characters sorted in descending order.
(emacsql org-db [:select :distinct [value] :from file-keywords :inner :join keywords :on (= file-keywords:keyword-id keywords:rowid) :where (and (> (funcall length value) 35) (= keywords:keyword "TITLE")) :order :by value :desc :limit 5])
|pycse - Python3 Computations in Science and Engineering|
|org-show - simple presentations in org-mode|
|org-mode - A Human Readable, Machine Addressable Approach to Data Archiving and Sharing in Science and Engineering|
|modifying emacs to make typing easier.|
|jmax - John's customizations to maximize Emacs|
It is possible to search on AUTHOR, and others. My memos have a #+SUBJECT keyword, so I can find memos on a subject. They also use the LATEX_CLASS of cmu-memo, so I can find all of them easily too:
(emacsql org-db [:select [(funcall count)] :from file-keywords :inner :join keywords :on (= file-keywords:keyword-id keywords:rowid) :where (and (= value "cmu-memo") (= keywords:keyword "LATEX_CLASS")) :limit 5])
How about that, 119 memos… Still it sure is nice to be able to find them.
5 Full text search
In theory, the database has a table for the headline content, and it should be fully searchable. I found the database got a little sluggish, and nearly 1/2 a GB in size when using it so I am leaving it out for now.
The foundation for something really good is here. It is still a little tedious to wrote the queries with all the table joins, but some of that could be wrapped into a function for a query. I like the lispy style of the queries, although it can be tricky to map all the concepts onto SQL. A function that might wrap this could look like this:
(org-db-query (and (= properties:property "DEADLINE") (glob headline-properties:value "*2017*")))
This is what it would ideally look like using the org tag/property match syntax. Somehow that string would have to get expanded to generate the code above. I do not have a sense for how difficult that would be. It might not be hard with a recursive descent parser, written by the same author as emacsql.
The performance is only ok. For large org files there is a notable lag in updating the database, which is notable because while updating, Emacs is blocked. I could try using an idle timer for updates with a queue, or get more clever about when to update. It is not essential that the updates be real-time, only that they are reasonably accurate or done by the time I next search. For now, it is not too annoying though. As a better database, I have had my eye on xapian since that is what mu4e (and notmuch) uses. It might be good to have an external library for parsing org-files, i.e. not through emacs, for this. It would certainly be faster. It seems like a big project though, maybe next summer ;)
Another feature this might benefit from is ignore patterns, or some file feature that prevents it from being indexed. For example, I keep an encrypted password file in org-mode, but as soon as I opened it, it got indexed right into the database, in plain text. If you walk your file system, it might make sense to avoid some directories, like .dropbox.cache. Otherwise, this still looks like a promising approach.
Copyright (C) 2017 by John Kitchin. See the License for information about copying.
Org-mode version = 9.0.3