Reading and writing pdf metadata

Posted June 13, 2013 at 10:19 AM | categories: programming | tags:

One key to automating analysis of files is that data be stored in files in a way that is easy to add and retrieve. I have been examining ways to add grades to files so that a program could read the file, extract the grade, and collect it in a gradebook.

PDF files could be one format where this is possible. The idea is that students would turn in a PDF file as their homework. The assignment would be graded, perhaps with hand-written markup from a tablet, and the grader would enter the grade as custom metadata in the file. Later a program would read the grade in and compile the results.

In this post I examine some python modules that can do this. There are several python modules that can interact with PDF files.

pyPDF (seems to be replaced by PyPDF2).
pdfrw
pdfMiner (seems to be read-only)

All of these can be installed with pip. First, we look at getting existing information from a PDF file.

from pyPdf import PdfFileReader, PdfFileWriter

fname = '../../../Desktop/Program Organizer Controls.pdf'

pdf = PdfFileReader(open(fname, 'rb'))

print pdf.getDocumentInfo()

{'/Creator': u'Bluebeam Stapler 10.2.1', '/Author': u'John Kitchin', '/Producer': u'Bluebeam Brewery 5.0', '/CreationDate': u"D:20130612143804-04'00'", '/ModDate': u"D:20130613095927-04'00'"}

That is pretty straightfoward. Now, let us add some new metadata. We will create a Grade metadata, and store the grade in it. It appears we need to read in the pdf file, add its pages and metadata to a writer, set the new metadata, and then write out the file.

It seems that we cannot directly modify the PDF file, so we will write out to a new file, delete the old file, and rename the new file to the old file.

from pyPdf import PdfFileReader, PdfFileWriter
from pyPdf.generic import NameObject, createStringObject

inpfn = '../../../Desktop/Program Organizer Controls.pdf'

fin = file(inpfn, 'rb')
pdf_in = PdfFileReader(fin)

writer = PdfFileWriter()

for page in range(pdf_in.getNumPages()):
    writer.addPage(pdf_in.getPage(page))

infoDict = writer._info.getObject()

info = pdf_in.documentInfo
for key in info:
    infoDict.update({NameObject(key): createStringObject(info[key])})

# add the grade
infoDict.update({NameObject('/Grade'): createStringObject(u'A+')})

# It does not appear possible to alter in place.
fout = open(inpfn+'out.pdf', 'wb')

writer.write(fout)
fin.close()
fout.close()

import os
os.unlink(inpfn)
os.rename(inpfn+'out.pdf', inpfn)

Finally, we can see we successfully modified the file.

from pyPdf import PdfFileReader, PdfFileWriter

fname = '../../../Desktop/Program Organizer Controls.pdf'

pdf = PdfFileReader(open(fname, 'rb'))

print pdf.getDocumentInfo()
print pdf.getDocumentInfo()['/Grade']

{'/Grade': u'A+', '/CreationDate': u"D:20130612143804-04'00'", '/Producer': u'Bluebeam Brewery 5.0', '/Creator': u'Bluebeam Stapler 10.2.1', '/ModDate': u"D:20130613095927-04'00'", '/Author': u'John Kitchin'}
A+

You can see we were able to successfully add a Grade metadata field. It is stored as a Custom Document Property in my PDF viewer. I am not sure how easy it would be for a grader to enter this into a PDF. It could be possible to automate this with some kind of script that made a decent workflow. For example an org-mode file could have links that open the PDF, allow you to grade it. Then, you could click on another link that would prompt you for the grade, and then add it to the pdf. Or maybe a small script could be written that would open the pdf, wait for your to close it, then prompt you for the grade before moving to the next one.

org-mode source

The Kitchin Research Group

Chemical Engineering at Carnegie Mellon University

Reading and writing pdf metadata