Wednesday, February 13, 2008

Added people

I've gotten the reference parsing to work and have added 1545 references, from which 1840 authors were extracted. To get the references, I looked through various online reference databases (PubMed, ISI Web of Science) to select articles about methods (searching for particular authors, looking in particular journals (for example, issues of Systematic Biology over the past several years and articles in Bioinformatics that mentioned phylogenies), etc.) based on the article titles, downloaded these citations, imported them with their various formats into EndNote, and then exported a RIS-formatted file that I then uploaded to the website (in several chunks) and parsed using bibutils conversion and then a custom XML parser written using SimpleXML in PHP. The XML from bibutils is also saved with each reference in the database, making conversion of user-selected references for export in various formats easier (I hope, but we'll see when I code that). I've also created templates on the development site to automatically display information on the included authors in a RESTful way: http://treetapper.nescent.org/person will display a paginated table of all the authors in the database with the number of references, methods, and software each has in the database; clicking on an author's name will go to a page listing her or his coauthors (ranked by number of papers in common) and references. For example, going to http://treetapper.nescent.org/person/23 will go to a page for Mike Sanderson. You can then link from person to person in this way. 

Left to do: add XML output as an option, rather than just the html output with datatables; get the datatables to sort properly (currently, Yahoo User Interface datatables (version 2.4.1) sort lexically: sorting [5, 200, 12] gives [12, 200, 5], but author names also aren't sorting properly); and do tables and a REST interface for references (I also want to be able to autocomplete on an author's name and then just display the relevant references). It might be interesting at some point to add a way to output files to visualize author relationships (perhaps with Graphviz); this could also provide another way to navigate the database.

Monday, February 11, 2008

Parsing references

I'm using the same table for authors and users of the site, so that when a new person logs in, that person can see which papers/software are listed with her/him as an author and correct any mistakes. I also want people to be able to search for author names in the database. Both require being able to extract author names from submitted references. I'd originally written a simple RIS parser, but it didn't deal well with input files that had slightly odd formatting. I've thought of using Connotea to parse references (its API allows submission of references or just URLs/DOIs, and can return author names, article titles, etc.) but I've found problems with latency (it can take a few minutes for it to load submissions into its database). There is also a problem with its strong URL focus -- I found when uploading a batch of references from Endnote in RIS format, it only loaded one  Bioinformatics citation: somewhere in each RIS record, there was a generic URL for Bioinformatics, so Connotea saw them all as pointing to the same resource. Connotea also has the complementary problem (referred to as "buggotea") of having the same reference with different URLs (PMID with one submission, DOI with another, for example) entered multiple times in its database. I'll export tagged articles to Connotea, but it probably won't work for reference parsing. Thus, I'm now playing with using Christopher Putnam's bibutils, which convert from/to Endnote, RIS, Bibtex, ISI, and other reference formats using an XML intermediate: parsing this XML intermediate for author names should be fairly easy. Once the author names are identified, I have the code for adding authors to the database, including author order, and expanding author names when more information is available (for example, if entering a paper by "M. Sanderson", otherwise unknown in the database, his first name is stored as just "M"; when a later paper by "Michael Sanderson" is entered to the database, the script changes "M Sanderson" to "Michael Sanderson" in the database, assuming that they are the same author). Currently, the only problem with bibutils is having it called correctly by PHP on our server, but that's being addressed.