Merging bibtex files and avoiding duplicates

| categories: bibtex | tags: | View Comments

I usually advocate to have a master bibtex file with all entries in it. Emacs is helpful at avoiding duplicate entries as you enter them. Sometimes though, you have more than one bibtex file. Maybe you started one for a new project, or someone sent you one. In any case, you want to merge the files into one file. Bibtex requires each entry to have a unique key.

Let us begin. I have two bibtex files I exported from Endnote. I have already removed all the non-ascii characters and cleaned them up pretty well. We start with some analysis.

from bibtexparser.bparser import BibTexParser
with open('../../CMU/proposals/link-to-2014/bo2-polymorphs/bo2-polymorphs.bib', 'r') as bibfile:
    bp = BibTexParser(bibfile)
    entries1 = bp.get_entry_list()

print '{0} entries in file 1'.format(len(entries1))

with open('../../CMU/proposals/link-to-2014/perovskite-strain/perovskite-strain.bib', 'r') as bibfile:
    bp = BibTexParser(bibfile)
    entries2 = bp.get_entry_list()

print '{0} entries in file 2'.format(len(entries2))
100 entries in file 1
129 entries in file 2

Now, let see how many duplicates there are. It is easy to use sets for this.

# store keys to check for duplicates
from bibtexparser.bparser import BibTexParser
with open('../../CMU/proposals/link-to-2014/bo2-polymorphs/bo2-polymorphs.bib', 'r') as bibfile:
    bp = BibTexParser(bibfile)
    entries1 = bp.get_entry_list()

with open('../../CMU/proposals/link-to-2014/perovskite-strain/perovskite-strain.bib', 'r') as bibfile:
    bp = BibTexParser(bibfile)
    entries2 = bp.get_entry_list()

entry1_keys = set([entry['id'] for entry in entries1])
entry2_keys = set([entry['id'] for entry in entries2])

duplicates = entry1_keys & entry2_keys
print 'There are {0} duplicates'.format(len(duplicates))
print duplicates
There are 20 duplicates
set(['nolan-2008-vacan-co', 'giocondi-2001-spatial', 'giocondi-2001-spatial-batio3', 'wang-2006-oxidat-gga', 'piskunov-2008-elect-lamno3', 'pala-2007-modif-oxidat', 'chretien-2006-densit-funct', 'giocondi-2008-sr2nb-batio3', 'kushima-2010-compet-lacoo3', 'pala-2009-co-ti', 'giocondi-2007-srtio3', 'lee-2009-ab-labo3', 'balasubramanian-2005-epitax-phase', 'mastrikov-2010-pathw-oxygen', 'shapovalov-2007-catal', 'evarestov-2005-compar-lcao', 'choi-2007-comput-study', 'havelia-2009-nucleat-growt', 'lee-2009-ab-defec', 'lee-2009-predic-surfac'])

Ok, now we make a function to format each entry. We take that code from this this post and turn it into a function. Then we add all the entries from the first file. Then, we add entries from the second file as long as the key is not in the list from the first file.

from bibtexparser.bparser import BibTexParser
import os, textwrap

def format_bibtex_entry(entry):
    # field, format, wrap or not
    field_order = [(u'author', '{{{0}}},\n', True),
                   (u'title', '{{{0}}},\n', True),
                   (u'journal','"{0}",\n', True),
                   (u'volume','{{{0}}},\n', True),
                   (u'number', '{{{0}}},\n', True),
                   (u'pages', '{{{0}}},\n', True),
                   (u'year', '{0},\n', True),
                   (u'doi','{{{0}}},\n', False),
                   (u'url','{{\url{{{0}}}}},\n', False),
                   (u'link','{{\url{{{0}}}}},\n', False)]
    
    keys = set(entry.keys())

    extra_fields = keys.difference([f[0] for f in field_order])
    # we do not want these in our entry
    extra_fields.remove('type')
    extra_fields.remove('id')

    # Now build up our entry string
    s = '@{type}{{{id},\n'.format(type=entry['type'].upper(),
                                  id=entry['id'])

    for field, fmt, wrap in field_order:
        if field in entry:
            s1 = '  {0} ='.format(field.upper())
            s2 = fmt.format(entry[field])
            s3 = '{0:17s}{1}'.format(s1, s2)
            if wrap:
                # fill seems to remove trailing '\n'
                s3 = textwrap.fill(s3, subsequent_indent=' '*18, width=70) + '\n'
            s += s3  

    for field in extra_fields:
        if field in entry:
            s1 = '  {0} ='.format(field.upper())
            s2 = entry[field]
            s3 = '{0:17s}{{{1}}}'.format(s1, s2)
            s3 = textwrap.fill(s3, subsequent_indent=' '*18, width=70) + '\n'
            s += s3  

    s += '}\n\n'
    return s

if os.path.exists('merged.bib'): os.unlink('merged.bib')    

with open('../../CMU/proposals/link-to-2014/bo2-polymorphs/bo2-polymorphs.bib', 'r') as bibfile:
    bp = BibTexParser(bibfile)
    entries1 = bp.get_entry_list()

for entry in entries1:
    with open('merged.bib', 'a') as f:
        f.write(format_bibtex_entry(entry))

# store keys to check for duplicates
entry1_keys = [entry['id'] for entry in entries1]

with open('../../CMU/proposals/link-to-2014/perovskite-strain/perovskite-strain.bib', 'r') as bibfile:
    bp = BibTexParser(bibfile)
    entries2 = bp.get_entry_list()

for entry in entries2:
    if not entry['id'] in entry1_keys:
        with open('merged.bib', 'a') as f:
            f.write(format_bibtex_entry(entry))

Here is the merged file: merged.bib and the corresponding bibliography merged.pdf . There are 209 entries in it, which is what we expected given that there were 20 duplicates. There are no doubt other programs that merge bibtex files, but I like this approach for the following reasons:

  1. I learned a new python module that parses bibtex files.
  2. I got my entries formatted exactly the way I wanted them.
  3. I defined what constituted a duplicate.

Of course, here we only eliminate entries with duplicate keys. If the same entry has different keys, they will be merged. This is a very hard problem to get right, since there are many possible ways to abbreviate author names, journal names, and multiple ways to write the title. That is a problem best solved by using a systematic way of generating the keys, so that you minimize the possibility of duplicates that way.

Copyright (C) 2014 by John Kitchin. See the License for information about copying.

org-mode source

Org-mode version = 8.2.5h

blog comments powered by Disqus