Sorting fields in bibtex entries

| categories: bibtex | tags:

I like consistency. In particular, for bibtex entries, I would like all the fields to be in the same order, and in all caps. Why? Because then I know where to look, and incorrect entries stand out more easily. My current bibtex file does not look like this! That is a result of adding bibtex entries from various journals, which all have different conventions. Today, I am going to look at a way to achieve what I want.

The principle idea is that we will parse the bibtex file into a list of entries represented by a convenient data structure. Then, we will format each entry the way we want, and print the result back out to a new file. I will use bibtexparser and python to do this.

Let us examine what bibtexparser does for us. Here we read in a file and get the entries. Each entry is represented as a dictionary.

from bibtexparser.bparser import BibTexParser

with open('../../CMU/proposals/link-to-2014/bo2-polymorphs/bo2-polymorphs.bib', 'r') as bibfile:
    bp = BibTexParser(bibfile)
    entries = bp.get_entry_list()

# look at the first entry
print entries[0]
{u'title': u'Effect of growth conditions on formation of TiO2-II\nthin films in atomic layer deposition process', u'journal': u'Journal of Crystal Growth', u'author': u'Aarik, J. and Aidla, A. and Sammelselg, V. and\nUustare, T.', u'number': u'3', 'id': 'aarik-1997-effec-tio2', u'volume': u'181', u'link': u'<Go to ISI>://A1997YD52700011', u'year': u'1997', 'type': u'article', u'pages': u'259-264'}

Let us take a moment to analyze our bibtex file. Let us see how many types of entries we have. That gives a chance to practice counting .

from bibtexparser.bparser import BibTexParser

with open('../../CMU/proposals/link-to-2014/bo2-polymorphs/bo2-polymorphs.bib', 'r') as bibfile:
    bp = BibTexParser(bibfile)
    entries = bp.get_entry_list()

types = [entry['type'] for entry in entries]
print dict((typ, types.count(typ)) for typ in types)
{u'inbook': 2, u'article': 90, u'book': 4, u'misc': 3, u'phdthesis': 1}

Indeed, there are a lot of entries that we do not want to do by hand. Here is the order I would like the fields to be for articles. A similar order for the other types would be fine too.

AUTHOR
TITLE
JOURNAL
VOLUME
ISSUE
PAGES
YEAR
DOI
URL or link
other fields

Bibtex lets you define arbitrary fields, and we do not want to lose these in the entries. I have for example defined fields for the path to a pdf, or to a notes file in some files. We will use python sets to handle this for us. With sets, we can conveniently compute the difference in fields between our ordered list, and the entry. Here is an example. We have a master list of keys, and an entry with extra keys. We use the difference function to get the list of extra keys.

entry = set(['author', 'title', 'journal', 'field1'])
master = set(['author', 'title'])

print entry.difference(master)
set(['journal', 'field1'])

So, we will use the list we want the order of, and then add the rest of the keys.

from bibtexparser.bparser import BibTexParser

with open('../../CMU/proposals/link-to-2014/bo2-polymorphs/bo2-polymorphs.bib', 'r') as bibfile:
    bp = BibTexParser(bibfile)
    entries = bp.get_entry_list()

field_order = ['author', 'title', 'journal', 'volume', 'issue' 'pages', 'year', 'doi', 'url', 'link']

entry_keys = set(entries[0].keys())
print entry_keys.difference(field_order)
set([u'number', 'id', 'type', u'pages'])

You can see a subtlety here, the pages key is a unicode string, but our fieldorder is a regular string. Also, number is a unicode string. It appears that all the keys are unicode except type and id. In the next block we will address that.

You should probably go ahead and remove non-ascii characters from your bib-file. We got lucky with this entry, but some entries have non-ascii characters and these cause errors.

So we need to specify the order of fields, how they should be formatted, and whether we should wrap the field contents into a nice block. We do that in the next block. Note that in the formats we use double {{ to get a literal { when we use string formatting. We use the formats to wrap the fields in brackets or quotes as needed. We use the textwrap module to neatly wrap multiline fields with indentation of the second line and beyond. By some iteration, I have made this print an entry that emacs-bibtex likes, and does not need to further reformat.

WARNING: The code below creates new files, and deletes files. Make sure you pay attention to this to avoid losing your own files. You do keep your bib-file under version control right ;).

from bibtexparser.bparser import BibTexParser
import textwrap

with open('../../CMU/proposals/link-to-2014/bo2-polymorphs/bo2-polymorphs.bib', 'r') as bibfile:
    bp = BibTexParser(bibfile)
    entries = bp.get_entry_list()

# field, format, wrap or not
field_order = [(u'author', '{{{0}}},\n', True),
               (u'title', '{{{0}}},\n', True),
               (u'journal','"{0}",\n', True),
               (u'volume','{0},\n', True),
               (u'number', '{0},\n', True),
               (u'pages', '{{{0}}},\n', True),
               (u'year', '{0},\n', True),
               (u'doi','{{{0}}},\n', False),
               (u'url','{{\url{{{0}}}}},\n', False),
               (u'link','{{\url{{{0}}}}},\n', False)]

# pick an entry, this time second to last one
entry = entries[-2]
keys = set(entry.keys())

extra_fields = keys.difference([f[0] for f in field_order])

# we do not want these in our entry, they go in the "header"
extra_fields.remove('type')
extra_fields.remove('id')

# Now build up our entry string
s = '@{type}{{{id},\n'.format(type=entry['type'].upper(),
                              id=entry['id'])

# Now handle the ordered fields, then the extra fields
for field, fmt, wrap in field_order:
    if field in entry:
        s1 = '  {0} ='.format(field.upper())
        s2 = fmt.format(entry[field])
        s3 = '{0:17s}{1}'.format(s1, s2)
        if wrap:
            # fill seems to remove trailing '\n'
            s3 = textwrap.fill(s3, subsequent_indent=' '*18, width=70) + '\n'
        s += s3  

for field in extra_fields:
    if field in entry:
        s1 = '  {0} ='.format(field.upper())
        s2 = entry[field]
        s3 = '{0:17s}{{{1}}}'.format(s1, s2)
        s3 = textwrap.fill(s3, subsequent_indent=' '*18, width=70) + '\n'
        s += s3  

s += '}\n\n'

print s
@ARTICLE{yang-2008-anatas-tio2,
  AUTHOR =       {Yang, H. G. and Sun, C. H. and Qiao, S. Z. and Zou,
                  J. and Liu, G. and Smith, S. C. and Cheng, H. M. and
                  Lu, G. Q.},
  TITLE =        {Anatase \ce{TiO_2} single crystals with a large
                  percentage of reactive facets},
  JOURNAL =      "Nature",
  VOLUME =       453,
  NUMBER =       7195,
  PAGES =        {638-U4},
  YEAR =         2008,
  DOI =          {10.1038/nature06964},
  LINK =         {\url{http://www.nature.com/nature/journal/v453/n7195/pdf/nature06964.pdf}},
  KEYWORD =      {TOTAL-ENERGY CALCULATIONS WAVE BASIS-SET
                  HYDROTHERMAL CONDITIONS TITANIUM-DIOXIDE SURFACE
                  OXIDE NANOSTRUCTURES NANOPARTICLES NANOCRYSTALS
                  EFFICIENCY}
}

That looks pretty good. Now, we are ready to try the whole file. We simply loop through all the entries, and append the string to a file for each entry.

from bibtexparser.bparser import BibTexParser
import os, textwrap

with open('../../CMU/proposals/link-to-2014/bo2-polymorphs/bo2-polymorphs.bib', 'r') as bibfile:
    bp = BibTexParser(bibfile)
    entries = bp.get_entry_list()

# field, format, wrap or not
field_order = [(u'author', '{{{0}}},\n', True),
               (u'title', '{{{0}}},\n', True),
               (u'journal','"{0}",\n', True),
               (u'volume','{{{0}}},\n', True),
               (u'number', '{{{0}}},\n', True),
               (u'pages', '{{{0}}},\n', True),
               (u'year', '{0},\n', True),
               (u'doi','{{{0}}},\n', False),
               (u'url','{{\url{{{0}}}}},\n', False),
               (u'link','{{\url{{{0}}}}},\n', False)]

# rm file if it exists. this is a new file, not our bibliography!
if os.path.exists('bib.bib'): os.unlink('bib.bib')

for entry in entries:
    
    keys = set(entry.keys())

    extra_fields = keys.difference([f[0] for f in field_order])
    # we do not want these in our entry
    extra_fields.remove('type')
    extra_fields.remove('id')

    # Now build up our entry string
    s = '@{type}{{{id},\n'.format(type=entry['type'].upper(),
                                  id=entry['id'])

    for field, fmt, wrap in field_order:
        if field in entry:
            s1 = '  {0} ='.format(field.upper())
            s2 = fmt.format(entry[field])
            s3 = '{0:17s}{1}'.format(s1, s2)
            if wrap:
                # fill seems to remove trailing '\n'
                s3 = textwrap.fill(s3, subsequent_indent=' '*18, width=70) + '\n'
            s += s3  

    for field in extra_fields:
        if field in entry:
            s1 = '  {0} ='.format(field.upper())
            s2 = entry[field]
            s3 = '{0:17s}{{{1}}}'.format(s1, s2)
            s3 = textwrap.fill(s3, subsequent_indent=' '*18, width=70) + '\n'
            s += s3  

    s += '}\n\n'

    with open('bib.bib', 'a') as f:
        f.write(s)

This results in bib.bib with 100 entries, which according to emacs is a syntactically correct bibtex file, and which builds this bibliography bib.pdf , which also has 100 entries. That usually means everything is in order (minor intention of pun there). More importantly, the fields are ordered the way I want them!

Getting to this point was an iterative process. You will want to make sure the original bib file is under version control or backed up someway, in case something happens during this transformation!

Copyright (C) 2014 by John Kitchin. See the License for information about copying.

org-mode source

Org-mode version = 8.2.5h

Discuss on Twitter