Finding similar bibtex entries

| categories: similarity, bibtex | tags:

A common task while writing scientific papers is citing previous research. I use org-ref extensively for that, and it makes it pretty easy to find similar references, e.g. that have common authors, or common keywords. It also lets me find similar articles in Web of Science or Scopus. Suppose that I have cited a particular paper, e.g. e boes-2016-neural-networ, and I want to find similar references to it that are already in my bibtex file, and similar by my definition. With org-ref I can easily search by keyword or author to find similar entries, but these are limited by what I search for, and they are not sorted. Today, I will explore the first step in a recommender system that calculates similarity, and provides a sorted list of candidates with the most relevant ones first.

The idea is to calculate some measure of similarity between the title of that reference, and the titles of other references in my bibtex file, and then sort them by similarity. This is the reference I want to find similar entries for:

Boes, J. R., Groenenboom, M. C., Keith, J. A., & Kitchin, J. R., Neural network and Reaxff comparison for Au properties, Int. J. Quantum Chem., 116(13), 979–987 (2016). https://doi.org/10.1002/qua.25115

The first thing we do is read in our bibtex file, and print a representative entry.

import bibtexparser
from bibtexparser.bparser import BibTexParser

with open('/Users/jkitchin/Dropbox/bibliography/references.bib') as bibtex_file:
    parser = BibTexParser()
    bib_database = bibtexparser.load(bibtex_file, parser=parser)
    entries = bib_database.entries

print(entries[10])

{'author': 'Jaan Aarik and Aleks Aidla and V{\\"a}ino Sammelselg and Teet\nUustare', 'title': 'Effect of Growth Conditions on Formation of \\ce{TiO_2}-{II}\nThin Films in Atomic Layer Deposition Process', 'journal': 'Journal of Crystal Growth', 'volume': '181', 'number': '3', 'pages': '259 - 264', 'year': '1997', 'doi': '10.1016/S0022-0248(97)00279-0', 'link': 'http://www.sciencedirect.com/science/article/pii/S0022024897002790', 'issn': '0022-0248', 'ENTRYTYPE': 'article', 'ID': 'aarik-1997-effec-growt'}

Each entry is a dictionary containing the fields and their values. For this exploration, I will only consider similarities between titles. The next step is we find which entry corresponds to the reference we want to find similarities to.

ids = [e['ID'] for e in entries]
i = ids.index('boes-2016-neural-networ')
print(entries[i])

{'author': 'Jacob R. Boes and Mitchell C. Groenenboom and John A. Keith\nand John R. Kitchin', 'title': 'Neural Network and {Reaxff} Comparison for {Au} Properties', 'journal': 'Int. J. Quantum Chem.', 'volume': '116', 'number': '13', 'pages': '979-987', 'year': '2016', 'doi': '10.1002/qua.25115', 'link': 'https://doi.org/10.1002/qua.25115', 'issn': '1097-461X', 'keyword': 'Kohn-Sham density functional theory, neural networks, reactive\nforce fields, potential energy surfaces, machine learning', 'ENTRYTYPE': 'article', 'ID': 'boes-2016-neural-networ'}

It is best if we make the entry we want to find similarities to the first one, so here we swap the first and ith entries.

entries[0], entries[i] = entries[i], entries[0]

Now, we prepare the list of strings to get similarities for.

titles = [e.get('title', '') for e in entries]

We will use term frequency–inverse document frequency to get a vector that represents each title, and then use cosine similarity as a measure of similarity. Here is the place to note that I chose these, and could choose other ones too. Also, it is worth noting that in this measure of similarity I did not choose which keywords to measure similarity on.

The functionality for this is provided by sklearn. It has implemented functions for the algorithms above, and in just a few lines of code you get an array of tf-idf features to analyze. The array we get from our vectorizer contains normalized vectors, so we can get the cosine similarity just from a dot product of the vectors. The first row corresponds to the similarity of the first string to all the others. I want them sorted in descending order. The argsort function returns ascending order, so we use a trick to sort the negative of the similarity score which achieves that. There are certainly more advanced treatments of the text we could use by customizing the vectorizer, e.g. word stemming, but for now we neglect that.

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(titles)

cosine_similarities = (X * X.T).A[0]

related_docs_indices = (-cosine_similarities).argsort()

print('The top 10 recommendations for {} are:\n'.format(S[0]))
for i, j in enumerate(related_docs_indices[1:11]):
    print('{i}. {ID}: {title}, {author}\n'.format(i=i + 1, **entries[j]))

The top 10 recommendations for Neural Network and {Reaxff} Comparison for {Au} Properties are:

  1. behler-2010-neural: Neural network potential-energy surfaces for atomistic

simulations, J{\"o}rg Behler

  1. boes-2017-neural-networ: Neural Network Predictions of Oxygen Interactions on a Dynamic

{Pd} Surface, Jacob R. Boes and John R. Kitchin

  1. eshet-2010-ab: Ab Initio Quality Neural-Network Potential for Sodium, Hagai Eshet and Rustam Z. Khaliullin and Thomas D. K{\"u}hne

and J{\"o}rg Behler and Michele Parrinello

  1. behler-2014-repres-poten: Representing Potential Energy Surfaces By High-Dimensional

Neural Network Potentials, J Behler

  1. behler-2007-gener-neural: Generalized Neural-Network Representation of High-Dimensional

Potential-Energy Surfaces, J{\"o}rg Behler and Michele Parrinello

  1. artrith-2012-high: High-Dimensional Neural Network Potentials for Metal Surfaces:

A Prototype Study for Copper, Nongnuch Artrith and J{\"o}rg Behler

  1. behler-2015-const: Constructing High-Dimensional Neural Network Potentials: A

Tutorial Review, J{\"o}rg Behler

  1. artrith-2011-high: High-Dimensional Neural-Network Potentials for Multicomponent

Systems: Applications To Zinc Oxide, Nongnuch Artrith and Tobias Morawietz and J{\"o}rg Behler

  1. sosso-2012-neural-gete: Neural Network Interatomic Potential for the Phase Change

Material \ce{GeTe}, Gabriele C. Sosso and Giacomo Miceli and Sebastiano Caravati and J{\"o}rg Behler and Marco Bernasconi

  1. lorenz-2006-descr: Descriptions of Surface Chemical Reactions Using a Neural

Network Representation of the Potential-Energy Surface, S{\"o}nke Lorenz and Matthias Scheffler and Axel Gross

It is evident that this is showing other references containing the words "neural network"! I guess that is a little disappointing, since these would just as easily been narrowed down in org-ref. On the other hand, they are sorted and grouped, which would not happen in org-ref. This is a comparison of pretty short strings (just the titles), so maybe this would be much more interesting if abstracts were also included. Including authors would give a different set as well (I tried it, and got a bunch of my own references!).

I don't think it would be very difficult to get this into an Emacs selection tool, e.g. helm/ivy. Check this out:

import pycse.lisp

related_docs_indices[1:6].lisp

'(1592 1650 299 1751 103)'

That is a result that can be read directly by lisp, so we could simply write the code above as a shell script that takes an argument, and returns a list of indices to sort the candidates on. The alternative is to implement this in elisp, perhaps via a dynamic module if there is already a good C library for this. My sense is the Python libraries are more advanced in functionality.

This could have a number of other applications. Given some reference content, you could imagine finding emails that are similar to it, finding RSS entries that are similar to it, finding org headlines that are related, similar files, or similarity with any other set of strings that can be gathered, e.g. from Crossref or some other search, etc. I predict there will be more on these topics in the future!

Copyright (C) 2017 by John Kitchin. See the License for information about copying.

org-mode source

Org-mode version = 9.0.7

Discuss on Twitter