Using the Scopus api with xml output
Posted April 06, 2015 at 09:35 AM | categories: xml, scopus, python | tags:
According to http://api.elsevier.com/documentation/AbstractRetrievalAPI.wadl , the native form of the Scopus abstract document is xml, and the full abstract cannot always be represented as json. So… I am going to just bite the bullet and learn to deal with the xml. This is a companion post to http://kitchingroup.cheme.cmu.edu/blog/2015/04/04/Making-highly-linked-bibliographies-from-the-Scopus-API/ . Most of the code in this post gets tangled out to scopus_xml.py. I know it is not totally robust yet, but I have been using it for a lot of analysis, and it works pretty well so far.
This is another long post, with code that probably runs off screen. You can see the end result of what we do in this post here: http://kitchingroup.cheme.cmu.edu/publications.html .
We start with a general function to return an xml elementtree. We build in some caching to avoid downloading things we already have; this is slow, and there are limits on how many times you can download.
import requests import os import xml.etree.ElementTree as ET from my_scopus import MY_API_KEY def get_abstract_info(EID, refresh=False): 'Get and save the json data for EID.' base = 'scopus-xml/get_abstract_info' if not os.path.exists(base): os.makedirs(base) fname = '{0}/{1}'.format(base, EID) if os.path.exists(fname) and not refresh: with open(fname) as f: return ET.fromstring(f.read()) # Otherwise retrieve and save results url = ("http://api.elsevier.com/content/abstract/eid/" + EID + '?view=META_ABS') resp = requests.get(url, headers={'Accept':'application/xml', 'X-ELS-APIKey': MY_API_KEY}) with open(fname, 'w') as f: f.write(resp.text.encode('utf-8')) results = ET.fromstring(resp.text.encode('utf-8')) return results
Next, we do some introspection to see what we have.
from scopus_xml import * #results = get_abstract_info('2-s2.0-84896759135') #results = get_abstract_info('2-s2.0-84924911828') results = get_abstract_info('2-s2.0-84901638552') for el in results: print el.tag for el1 in el: print ' -->',el1.tag print
Now, some examples for myself to see how to get things.
from scopus_xml import * results = get_abstract_info('2-s2.0-84901638552') coredata = results.find('./{http://www.elsevier.com/xml/svapi/abstract/dtd}coredata') print coredata.find('{http://www.elsevier.com/xml/svapi/abstract/dtd}srctype').text print coredata.find('{http://www.elsevier.com/xml/svapi/abstract/dtd}source-id').text #authors = results.find('./{http://www.elsevier.com/xml/svapi/abstract/dtd}authors') #for author in results.find('./{http://www.elsevier.com/xml/svapi/abstract/dtd}authors'): # print author.find('{http://www.elsevier.com/xml/ani/common}indexed-name').text for creator in coredata.find('{http://purl.org/dc/elements/1.1/}creator'): print creator.attrib print coredata.find('{http://purl.org/dc/elements/1.1/}title').text print coredata.find('{http://prismstandard.org/namespaces/basic/2.0/}publicationName').text print coredata.find('{http://prismstandard.org/namespaces/basic/2.0/}volume').text print coredata.find('{http://prismstandard.org/namespaces/basic/2.0/}pageRange').text print coredata.find('{http://prismstandard.org/namespaces/basic/2.0/}coverDate').text print coredata.find('{http://www.elsevier.com/xml/svapi/abstract/dtd}citedby-count').text print coredata.find('{http://prismstandard.org/namespaces/basic/2.0/}doi').text for link in coredata.findall('{http://www.elsevier.com/xml/svapi/abstract/dtd}link'): if link.attrib['rel'] == 'scopus': print link.attrib['href'] else: print link.attrib['href'] # alternative xpath to get the link print coredata.find("./{http://www.elsevier.com/xml/svapi/abstract/dtd}link/[@rel='scopus']").attrib['href']
That is basically it. In the next sections, we basically recreate the previous functions from scopus.py using the xml data.
1 Authors
def get_author_link(EID): results = get_abstract_info(EID) authors = results.find('./{http://www.elsevier.com/xml/svapi/abstract/dtd}authors') if authors is None: return 'No authors found' s = [] for author in authors: name = author.find('{http://www.elsevier.com/xml/ani/common}indexed-name').text auid = author.attrib['auid'] s += ['<a href="http://www.scopus.com/authid/detail.url?origin=AuthorProfile&authorId={0}">{1}</a>'.format(auid, name)] return ', '.join(s)
from scopus_xml import * print get_author_link('2-s2.0-84896759135') print get_author_link('2-s2.0-84901638552')
<a href="http://www.scopus.com/authid/detail.url?origin=AuthorProfile&authorId=8724572500">Thompson R.L.</a>, <a href="http://www.scopus.com/authid/detail.url?origin=AuthorProfile&authorId=22981503200">Shi W.</a>, <a href="http://www.scopus.com/authid/detail.url?origin=AuthorProfile&authorId=6506329719">Albenze E.</a>, <a href="http://www.scopus.com/authid/detail.url?origin=AuthorProfile&authorId=23004637900">Kusuma V.A.</a>, <a href="http://www.scopus.com/authid/detail.url?origin=AuthorProfile&authorId=55676869000">Hopkinson D.</a>, <a href="http://www.scopus.com/authid/detail.url?origin=AuthorProfile&authorId=7003584159">Damodaran K.</a>, <a href="http://www.scopus.com/authid/detail.url?origin=AuthorProfile&authorId=55005205100">Lee A.S.</a>, <a href="http://www.scopus.com/authid/detail.url?origin=AuthorProfile&authorId=7004212771">Kitchin J.R.</a>, <a href="http://www.scopus.com/authid/detail.url?origin=AuthorProfile&authorId=6701399651">Luebke D.R.</a>, <a href="http://www.scopus.com/authid/detail.url?origin=AuthorProfile&authorId=24081524800">Nulwala H.</a> <a href="http://www.scopus.com/authid/detail.url?origin=AuthorProfile&authorId=55569461200">Xu Z.</a>, <a href="http://www.scopus.com/authid/detail.url?origin=AuthorProfile&authorId=7004212771">Kitchin J.R.</a>
2 Journal
def get_journal_link(EID): results = get_abstract_info(EID) coredata = results.find('./{http://www.elsevier.com/xml/svapi/abstract/dtd}coredata') journal = coredata.find('{http://prismstandard.org/namespaces/basic/2.0/}publicationName').text sid = coredata.find('{http://www.elsevier.com/xml/svapi/abstract/dtd}source-id').text s = '<a href="http://www.scopus.com/source/sourceInfo.url?sourceId={sid}">{journal}</a>' return s.format(sid=sid, journal=journal)
from scopus_xml import * print get_journal_link('2-s2.0-84901638552')
<a href="http://www.scopus.com/source/sourceInfo.url?sourceId=22746">Catalysis Communications</a>
3 DOI link
def get_doi_link(EID): results = get_abstract_info(EID) coredata = results.find('./{http://www.elsevier.com/xml/svapi/abstract/dtd}coredata') doi = coredata.find('{http://prismstandard.org/namespaces/basic/2.0/}doi') if doi is not None: doi = doi.text s = '<a href="https://doi.org/{doi}">doi:{doi}</a>' return s.format(doi=doi)
from scopus_xml import * print get_doi_link('2-s2.0-84901638552')
4 Abstract link
def get_abstract_link(EID): results = get_abstract_info(EID) coredata = results.find('./{http://www.elsevier.com/xml/svapi/abstract/dtd}coredata') data = get_abstract_info(EID) title = coredata.find('{http://purl.org/dc/elements/1.1/}title').text.encode('utf-8') link = coredata.find("./{http://www.elsevier.com/xml/svapi/abstract/dtd}link/[@rel='scopus']").attrib['href'].encode('utf-8') s = '<a href="{link}">{title}</a>' return s.format(link=link, title=title)
from scopus_xml import * print get_abstract_link('2-s2.0-84901638552')
5 Citation image
def get_cite_img_link(EID): results = get_abstract_info(EID) coredata = results.find('./{http://www.elsevier.com/xml/svapi/abstract/dtd}coredata') doi = coredata.find('{http://prismstandard.org/namespaces/basic/2.0/}doi') if doi is not None: doi = doi.text s = '<img src="http://api.elsevier.com/content/abstract/citation-count?doi={doi}&httpAccept=image/jpeg&apiKey={apikey}"></img>' return s.format(doi=doi, apikey=MY_API_KEY, cite_link=None)
from scopus_xml import * print get_cite_img_link('2-s2.0-84901638552')
6 Getting it all together
def get_html_citation(EID): results = get_abstract_info(EID) coredata = results.find('./{http://www.elsevier.com/xml/svapi/abstract/dtd}coredata') s = '{authors}, <i>{title}</i>, {journal}, <b>{volume}{issue}</b>, {pages}, ({year}), {doi}, {cites}.' issue = '' if coredata.find('{http://prismstandard.org/namespaces/basic/2.0/}issueIdentifier') is not None: issue = '({})'.format( coredata.find('{http://prismstandard.org/namespaces/basic/2.0/}issueIdentifier').text) volume = coredata.find('{http://prismstandard.org/namespaces/basic/2.0/}volume') if volume is not None: volume = coredata.find('{http://prismstandard.org/namespaces/basic/2.0/}volume').text else: volume = 'None' pages = '' if coredata.find('{http://prismstandard.org/namespaces/basic/2.0/}pageRange') is not None: pages = 'p. ' + coredata.find('{http://prismstandard.org/namespaces/basic/2.0/}pageRange').text elif coredata.find('{http://www.elsevier.com/xml/svapi/abstract/dtd}article-number') is not None: pages = coredata.find('{http://www.elsevier.com/xml/svapi/abstract/dtd}article-number').text else: pages = 'no pages found' year = coredata.find('{http://prismstandard.org/namespaces/basic/2.0/}coverDate').text return s.format(authors=get_author_link(EID), title=get_abstract_link(EID), journal=get_journal_link(EID), volume=volume, issue=issue, pages=pages, year=year, doi=get_doi_link(EID), cites=get_cite_img_link(EID))
from scopus_xml import * print '<ol>' print '<li>',get_html_citation('2-s2.0-84896759135'),'</li>' print print '<li>',get_html_citation('2-s2.0-84924911828'),'</li>' print print '<li>',get_html_citation('2-s2.0-84901638552'),'</li>' print '</ol>'
- Thompson R.L., Shi W., Albenze E., Kusuma V.A., Hopkinson D., Damodaran K., Lee A.S., Kitchin J.R., Luebke D.R., Nulwala H., Probing the effect of electron donation on CO2 absorbing 1,2,3-triazolide ionic liquids, RSC Advances, 4(25), p. 12748-12755, (2014-03-17), doi:10.1039/c3ra47097k, .
- Xu Z., Kitchin J.R., Relationships between the surface electronic and chemical properties of doped 4d and 5d late transition metal dioxides, Journal of Chemical Physics, 142(10), 104703, (2015-03-14), doi:10.1063/1.4914093, .
- Xu Z., Kitchin J.R., Relating the electronic structure and reactivity of the 3d transition metal monoxide surfaces, Catalysis Communications, 52, p. 60-64, (2014-07-05), doi:10.1016/j.catcom.2013.10.028, .
7 Finally getting my documents
Here we get the EIDs from a search query. We use these in the next section to get a new bibliography.
import requests import json from my_scopus import MY_API_KEY resp = requests.get("http://api.elsevier.com/content/search/scopus?query=AU-ID(7004212771)&field=eid,aggregationType&count=100", headers={'Accept':'application/json', 'X-ELS-APIKey': MY_API_KEY}) results = resp.json() return [[str(r['eid']), str(r['prism:aggregationType'])] for r in results['search-results']["entry"] if str(r['prism:aggregationType']) == 'Journal']
2-s2.0-84924911828 | Journal |
2-s2.0-84923164062 | Journal |
2-s2.0-84924778427 | Journal |
2-s2.0-84924130725 | Journal |
2-s2.0-84901638552 | Journal |
2-s2.0-84898934670 | Journal |
2-s2.0-84896759135 | Journal |
2-s2.0-84896380535 | Journal |
2-s2.0-84896585411 | Journal |
2-s2.0-84916613197 | Journal |
2-s2.0-84908637059 | Journal |
2-s2.0-84880986072 | Journal |
2-s2.0-84881394200 | Journal |
2-s2.0-84873706643 | Journal |
2-s2.0-84876703352 | Journal |
2-s2.0-84867809683 | Journal |
2-s2.0-84864914806 | Journal |
2-s2.0-84865730756 | Journal |
2-s2.0-84864592302 | Journal |
2-s2.0-84863684845 | Journal |
2-s2.0-84866142469 | Journal |
2-s2.0-84861127526 | Journal |
2-s2.0-80052944171 | Journal |
2-s2.0-80051809046 | Journal |
2-s2.0-79953651013 | Journal |
2-s2.0-79952860396 | Journal |
2-s2.0-77956568341 | Journal |
2-s2.0-77954747189 | Journal |
2-s2.0-77956693843 | Journal |
2-s2.0-77949916234 | Journal |
2-s2.0-77955464573 | Journal |
2-s2.0-72049114200 | Journal |
2-s2.0-73149124752 | Journal |
2-s2.0-73149109096 | Journal |
2-s2.0-67449106405 | Journal |
2-s2.0-63649114440 | Journal |
2-s2.0-60849113132 | Journal |
2-s2.0-58649114498 | Journal |
2-s2.0-40949100780 | Journal |
2-s2.0-33750804660 | Journal |
2-s2.0-20544467859 | Journal |
2-s2.0-15744396507 | Journal |
2-s2.0-9744261716 | Journal |
2-s2.0-13444307808 | Journal |
2-s2.0-3042820285 | Journal |
2-s2.0-2942640180 | Journal |
2-s2.0-0142023762 | Journal |
2-s2.0-0141924604 | Journal |
2-s2.0-0037368024 | Journal |
2-s2.0-0037197884 | Journal |
8 And my html bibliography
This generates my blog bibliography page..
from scopus_xml import * import requests import json from my_scopus import MY_API_KEY resp = requests.get("http://api.elsevier.com/content/search/scopus?query=AU-ID(7004212771)&field=eid,aggregationType&count=100", headers={'Accept':'application/json', 'X-ELS-APIKey': MY_API_KEY}) results = resp.json() data = [[str(r['eid']), str(r['prism:aggregationType'])] for r in results['search-results']["entry"] if str(r['prism:aggregationType']) == 'Journal'] with open('../publications.html.mako', 'w') as f: f.write('''<%inherit file="_templates/site.mako" /> <article class="page_box"> <%self:filter chain="markdown"> <h1>Online collections of our work</h1> Pick your favorite: <ul> <li><a href="http://orcid.org/0000-0003-2625-9232">orcid:0000-0003-2625-9232</a></li> <li><a href="http://www.researcherid.com/rid/A-2363-2010">researcherid:A-2363-2010</a></li> <li><a href="http://www.scopus.com/authid/detail.url?origin=AuthorProfile&authorId=7004212771">scopusid:7004212771</a></li> <li><a href="https://scholar.google.com/citations?user=jD_4h7sAAAAJ">Google Scholar</a></li> <li><a href="https://www.researchgate.net/profile/John_Kitchin">Research Gate</a></li> <li><a href="https://www.growkudos.com/profiles/40205">Kudos</a></li> </ul> <h1>Publications</h1> The authors are linked to their Scopus page, the title linked to the Scopus abstract, the journal linked to the Scopus journal page, and the DOI is linked to https://doi.org which normally redirects you to the journal page. <ol reversed="reversed"> ''') for eid,type in data: f.write('<li>{}</li>'.format(get_html_citation(eid))) f.write('''</ol> </%self:filter> </article> ''')
9 Summary
The XML format is not that intuitive to me. It takes some practice writing robust code, e.g. sometimes the find command does not find anything, and then there is not text attribute to get, so you should check for success on finding things. Also, some text is unicode, and you have to take care to encode it, which my library does not do uniformly. Finally, not all journals have things like volume or issue. My formatting code is not super flexible, so these bibliography entries show None in them occasionally. Still, it is not too bad, and this enables a lot of analysis of your publications, as well as displaying them in different ways. See the result of this page here: http://kitchingroup.cheme.cmu.edu/publications.html
Copyright (C) 2015 by John Kitchin. See the License for information about copying.
Org-mode version = 8.2.10