| categories: programming | tags: | View Comments

One key to automating analysis of files is that data be stored in files in a way that is easy to add and retrieve. I have been examining ways to add grades to files so that a program could read the file, extract the grade, and collect it in a gradebook.

PDF files could be one format where this is possible. The idea is that students would turn in a PDF file as their homework. The assignment would be graded, perhaps with hand-written markup from a tablet, and the grader would enter the grade as custom metadata in the file. Later a program would read the grade in and compile the results.

In this post I examine some python modules that can do this. There are several python modules that can interact with PDF files.

1. pyPDF (seems to be replaced by PyPDF2).
2. pdfrw
3. pdfMiner (seems to be read-only)

All of these can be installed with pip. First, we look at getting existing information from a PDF file.

from pyPdf import PdfFileReader, PdfFileWriter

fname = '../../../Desktop/Program Organizer Controls.pdf'

print pdf.getDocumentInfo()

{'/Creator': u'Bluebeam Stapler 10.2.1', '/Author': u'John Kitchin', '/Producer': u'Bluebeam Brewery 5.0', '/CreationDate': u"D:20130612143804-04'00'", '/ModDate': u"D:20130613095927-04'00'"}


That is pretty straightfoward. Now, let us add some new metadata. We will create a Grade metadata, and store the grade in it. It appears we need to read in the pdf file, add its pages and metadata to a writer, set the new metadata, and then write out the file.

It seems that we cannot directly modify the PDF file, so we will write out to a new file, delete the old file, and rename the new file to the old file.

from pyPdf import PdfFileReader, PdfFileWriter
from pyPdf.generic import NameObject, createStringObject

inpfn = '../../../Desktop/Program Organizer Controls.pdf'

fin = file(inpfn, 'rb')

writer = PdfFileWriter()

for page in range(pdf_in.getNumPages()):

infoDict = writer._info.getObject()

info = pdf_in.documentInfo
for key in info:
infoDict.update({NameObject(key): createStringObject(info[key])})

# It does not appear possible to alter in place.
fout = open(inpfn+'out.pdf', 'wb')

writer.write(fout)
fin.close()
fout.close()

import os
os.rename(inpfn+'out.pdf', inpfn)


Finally, we can see we successfully modified the file.

from pyPdf import PdfFileReader, PdfFileWriter

fname = '../../../Desktop/Program Organizer Controls.pdf'

print pdf.getDocumentInfo()

{'/Grade': u'A+', '/CreationDate': u"D:20130612143804-04'00'", '/Producer': u'Bluebeam Brewery 5.0', '/Creator': u'Bluebeam Stapler 10.2.1', '/ModDate': u"D:20130613095927-04'00'", '/Author': u'John Kitchin'}