Monday, February 11, 2008

Parsing references

I'm using the same table for authors and users of the site, so that when a new person logs in, that person can see which papers/software are listed with her/him as an author and correct any mistakes. I also want people to be able to search for author names in the database. Both require being able to extract author names from submitted references. I'd originally written a simple RIS parser, but it didn't deal well with input files that had slightly odd formatting. I've thought of using Connotea to parse references (its API allows submission of references or just URLs/DOIs, and can return author names, article titles, etc.) but I've found problems with latency (it can take a few minutes for it to load submissions into its database). There is also a problem with its strong URL focus -- I found when uploading a batch of references from Endnote in RIS format, it only loaded one  Bioinformatics citation: somewhere in each RIS record, there was a generic URL for Bioinformatics, so Connotea saw them all as pointing to the same resource. Connotea also has the complementary problem (referred to as "buggotea") of having the same reference with different URLs (PMID with one submission, DOI with another, for example) entered multiple times in its database. I'll export tagged articles to Connotea, but it probably won't work for reference parsing. Thus, I'm now playing with using Christopher Putnam's bibutils, which convert from/to Endnote, RIS, Bibtex, ISI, and other reference formats using an XML intermediate: parsing this XML intermediate for author names should be fairly easy. Once the author names are identified, I have the code for adding authors to the database, including author order, and expanding author names when more information is available (for example, if entering a paper by "M. Sanderson", otherwise unknown in the database, his first name is stored as just "M"; when a later paper by "Michael Sanderson" is entered to the database, the script changes "M Sanderson" to "Michael Sanderson" in the database, assuming that they are the same author). Currently, the only problem with bibutils is having it called correctly by PHP on our server, but that's being addressed.

No comments: