Tuesday, April 8, 2008

CiteSeerX launched

An alpha version of CiteSeerX has launched. This is cool, because it provides a way to get citation counts by crawling the web (unlike Thompson ISI or Google Scholar). Its database is limited to computer science (which includes many phylogeny articles, but not enough), so I'd have to get a copy of the source code (not evident where to download this yet) and start crawling on my own.

Wednesday, March 26, 2008

Google summer of code

NESCent is a hosting organization for Google Summer of Code; I've proposed a project to make a tool to allow databases to be navigated visually (essentially by combining WebDot (part of GraphViz) with something like sqlt-diagram, but for navigating table entries rather than just looking at the schema). Only one interested student so far; if it's not funded, I'll likely do it myself later for TreeTapper.

Friday, March 21, 2008

Progress update

I've started adding some methods to the database. It always feels a bit odd to shoehorn a method into a database like this, but it should be useful in the end. I'm also working on a better (i.e., working) findmethod/tool page. As users select options, the table of methods will automatically update. Maybe. I've also updated to YUI 2.5.1.

Friday, February 29, 2008

Captcha

To prevent bots from registering for the site and adding spam, captchas are useful, so I know I'll have to add them to TreeTapper before allowing users to register. While signing up for the new Encyclopedia of Life site, I found they use the reCAPTCHA service. This uses scanned words from books as images for people to interpret (people receive a pair of words, one known and one unknown). Free captcha service, plus gets people to help digitize books -- sounds like something I'll use for TreeTapper.

Friday, February 22, 2008

Upgrading to YUI 2.5.0

I'm upgrading the site to use YUI 2.5.0 (the previous version was 2.4.1). I decided to do this because the new version has many more features in data tables, and much of the utility of the TreeTapper site comes from interaction with data tables. The process of upgrading isn't too bad: the only problem has been needing to either add a "?" at the end of datasource files and then using initialRequest, or at least setting initialRequest to "" rather than the new default of "null" [under 2.4.1,

this.myDataSource = new YAHOO.util.DataSource("templates/vocabtable_js.php? table=applicationkind");

worked, while under 2.5.0, it is transmitted as

[URL]/templates/vocabtable_js.php?table=applicationkindnull

which won't work. The solution is to do either

this.myDataSource = new YAHOO.util.DataSource("templates/vocabtable_js.php?");
[...]
this.myDataTable = new YAHOO.widget.DataTable( "applicationkind", myColumnDefs,this.myDataSource, {initialRequest:"table=applicationkind"});

or

this.myDataSource = new YAHOO.util.DataSource( "templates/vocabtable_js.php?table=applicationkind" );
[...]
this.myDataTable = new YAHOO.widget.DataTable("applicationkind", myColumnDefs,this.myDataSource, {initialRequest:""});

The first option seems better].

The new version also has parsers (perhaps they were there before, but I missed them) that allow text coming over XHR to be converted to numbers, allowing numerical sorting.

The new version has at least a couple of downsides. First, making the tables seems slower (see this discussion), which is a problem when you need big tables, as TreeTapper does, and column headers are now drawn separately, and it can take a long time (>5 seconds) for them to line up with the corresponding column in the table.

Tuesday, February 19, 2008

Google charts


I've added some Google charts to the front page of the TreeTapper site. I'm using googlechartseasyphpclass v 1.02 to more easily generate the code to call the charts (it involves converting numbers to letters for plotting, for one thing), though it does limit flexibility a bit (but the source code for the PHP script can be modified easily). YUI also has a new charts API, but it requires a very recent version of Flash for people to use (more recent than I had in FireFox) -- Google can simply create a png, which is convenient. The code to make the data to pass to the charts takes a little while to work (dozens of postgres calls): I might just update this daily and have the site call a saved version of the data. The charts I've made allow tracking of how much the database has grown in the previous month as well as breakdowns of references by year.

Citation counts

When deciding which method to use, or on what areas to focus development, the popularity of related references matters. For example, if method A is used 20 times more frequently to answer a given question than method B, in the absence of other information, a naive user should probably use method A. Just counting total citations could be misleading, though: a good new method only available in the last year will take some time to gain as many citations as a poorer older method, despite acquiring citations at a faster rate. Thus, both number and rate matter. I'm thinking of showing both the total number of citations and rate of gain of citations over a year or so; another way of displaying related info would be comparing the number of citations for a paper with the median number of citations for papers published in the same year (perhaps limiting the papers in the reference set to those similar in scope to the paper of interest). To get this info, I'll need citation info, which is not generally available (see earlier post). I'm using the number of hits in Yahoo, using its search API (which gives slightly different numbers than its html form search results) for both all pages and only PDF-formatted pages with an article's title phrase, last name of first author, and publication year. The tricky things getting this to work were converting apostrophes to html characters and making sure to include the title as a phrase, rather than as a string of words. This approach has a few disadvantages: while web hits probably correlate with how many times a paper is cited (early work suggests this is roughly true), it is not wonderfully correlated (but it can pick up hits for interesting new papers faster than waiting for later citing papers to appear), plus it is easier to mislead (I could add my papers' titles, authors, and years as signatures to all my posts and then start posting on various forums: TreeTapper would see my papers as very popular). Perhaps with the upcoming release of the new CiteSeerX, I can use that system to track citations better. I've set up TreeTapper to store the number of hits for each paper every two weeks; this will allow me to track how the popularity of articles changes through time to recover rate rather than just number of citations. As this recording is new, I'm now only showing total hits until the data are recorded over more time intervals.

Wednesday, February 13, 2008

Added people

I've gotten the reference parsing to work and have added 1545 references, from which 1840 authors were extracted. To get the references, I looked through various online reference databases (PubMed, ISI Web of Science) to select articles about methods (searching for particular authors, looking in particular journals (for example, issues of Systematic Biology over the past several years and articles in Bioinformatics that mentioned phylogenies), etc.) based on the article titles, downloaded these citations, imported them with their various formats into EndNote, and then exported a RIS-formatted file that I then uploaded to the website (in several chunks) and parsed using bibutils conversion and then a custom XML parser written using SimpleXML in PHP. The XML from bibutils is also saved with each reference in the database, making conversion of user-selected references for export in various formats easier (I hope, but we'll see when I code that). I've also created templates on the development site to automatically display information on the included authors in a RESTful way: http://treetapper.nescent.org/person will display a paginated table of all the authors in the database with the number of references, methods, and software each has in the database; clicking on an author's name will go to a page listing her or his coauthors (ranked by number of papers in common) and references. For example, going to http://treetapper.nescent.org/person/23 will go to a page for Mike Sanderson. You can then link from person to person in this way. 

Left to do: add XML output as an option, rather than just the html output with datatables; get the datatables to sort properly (currently, Yahoo User Interface datatables (version 2.4.1) sort lexically: sorting [5, 200, 12] gives [12, 200, 5], but author names also aren't sorting properly); and do tables and a REST interface for references (I also want to be able to autocomplete on an author's name and then just display the relevant references). It might be interesting at some point to add a way to output files to visualize author relationships (perhaps with Graphviz); this could also provide another way to navigate the database.

Monday, February 11, 2008

Parsing references

I'm using the same table for authors and users of the site, so that when a new person logs in, that person can see which papers/software are listed with her/him as an author and correct any mistakes. I also want people to be able to search for author names in the database. Both require being able to extract author names from submitted references. I'd originally written a simple RIS parser, but it didn't deal well with input files that had slightly odd formatting. I've thought of using Connotea to parse references (its API allows submission of references or just URLs/DOIs, and can return author names, article titles, etc.) but I've found problems with latency (it can take a few minutes for it to load submissions into its database). There is also a problem with its strong URL focus -- I found when uploading a batch of references from Endnote in RIS format, it only loaded one  Bioinformatics citation: somewhere in each RIS record, there was a generic URL for Bioinformatics, so Connotea saw them all as pointing to the same resource. Connotea also has the complementary problem (referred to as "buggotea") of having the same reference with different URLs (PMID with one submission, DOI with another, for example) entered multiple times in its database. I'll export tagged articles to Connotea, but it probably won't work for reference parsing. Thus, I'm now playing with using Christopher Putnam's bibutils, which convert from/to Endnote, RIS, Bibtex, ISI, and other reference formats using an XML intermediate: parsing this XML intermediate for author names should be fairly easy. Once the author names are identified, I have the code for adding authors to the database, including author order, and expanding author names when more information is available (for example, if entering a paper by "M. Sanderson", otherwise unknown in the database, his first name is stored as just "M"; when a later paper by "Michael Sanderson" is entered to the database, the script changes "M Sanderson" to "Michael Sanderson" in the database, assuming that they are the same author). Currently, the only problem with bibutils is having it called correctly by PHP on our server, but that's being addressed.

Thursday, January 31, 2008

Firebug debugger

Treetapper will be fairly Ajax-rich (though with more RESTful interfaces, too, and with concern for accessibility). I didn't know any Javascript (or PHP) when starting, so having a good debugger is handy. I first tried a "Javascript Debugger" plugin for Firefox, but it seemed a bit clunky. I've now installed Firebug, and I am so far quite happy with it. It allows debugging/inspection of the Javascript, CSS, and HTML of the page.

Wednesday, January 16, 2008

Updated schema

I've updated the schema slightly, keeping tracks of citations in different tables rather than just keeping one citation field in the reference table. This way, the number of citations over time can be stored (so the rate of increase of citations can be recovered, as well as the number at any point in time). For right now, "citations" will be the number of hits in Yahoo for the title, year, and lead author last name of a reference, since ISI and CrossRef both require payment to use for citation info and Google Scholar's terms of use prohibit display of info from that site (see Rod Page's post about this).

Friday, January 11, 2008

Database schema


Here is the schema for the TreeTapper database. Entries are only shown after being verified by a DB curator (the approved field in each table): this (I hope) will allow users to add missing terms while allowing me to keep the list of terms from getting too disorganized.

Tuesday, December 18, 2007

Progress so far

TreeTapper is hosted by NESCent and uses a PostgreSQL database as a backend, an assortment of PHP scripts for generating pages, and the Yahoo User Interface Library (YUI) for a front end. Reasons for these decisions:
  • NESCent hosting: Free, stable, well-supported, appropriate for the project, and allows things like use of mod_rewrite to make creating RESTful site easier.
  • Postgres: NESCent supports this rather than MySQL; triggers and views will be useful (though now also allowed in MySQL). I had thought about using arrays for some fields (a Postgres-only feature), such as when a particular bit of software can read multiple tree formats, but will instead use additional tables, following recommendations from NESCent's IT staff.
  • PHP scripts: I had looked into using Ruby on Rails, CakePHP, or other frameworks for development, but there seemed to be a lot of overhead in learning them for the benefit I'd receive.
  • YUI: This is one of the many libraries for Ajax development. YUI is feature-rich and has both great documentation (with many examples) and an active user forum, both important for me.
I've also created the database schema and a way to securely log in (using PHP sessions). Next, I'm adding the ability to store and look up references in Connotea (using a Perl API) and other reference-parsing tools so that I can start adding papers to the database; I'll also create forms for adding methods and software and start actually adding information.

First post

Part of my NESCent project is the creation of TreeTapper.org, a site dedicated to methods and software that use trees to understand biology. It has two main aims: 1) allowing users from fields as diverse as genomics, ecology, paleontology, and phylogenetics to identify methods and software that will allow them to answer interesting questions using phylogenetic trees, and 2) allowing developers to identify areas where methods are not yet available or where methods need to implemented in software. I wanted a way to keep track of progress and issues involved in the development of the site; since this process might be of interest to others, I'm using this blog to do so. Any feedback is welcome. Also, though the site is (barely) live now, it's not yet ready for users: a formal announcement of its release will be made later (through EvolDir, Ecolog, and perhaps a journal).