Tuesday, April 8, 2008
CiteSeerX launched
An alpha version of CiteSeerX has launched. This is cool, because it provides a way to get citation counts by crawling the web (unlike Thompson ISI or Google Scholar). Its database is limited to computer science (which includes many phylogeny articles, but not enough), so I'd have to get a copy of the source code (not evident where to download this yet) and start crawling on my own.
Wednesday, March 26, 2008
Google summer of code
NESCent is a hosting organization for Google Summer of Code; I've proposed a project to make a tool to allow databases to be navigated visually (essentially by combining WebDot (part of GraphViz) with something like sqlt-diagram, but for navigating table entries rather than just looking at the schema). Only one interested student so far; if it's not funded, I'll likely do it myself later for TreeTapper.
Labels:
Google,
Google Summer of Code,
GraphViz,
WebDot
Friday, March 21, 2008
Progress update
I've started adding some methods to the database. It always feels a bit odd to shoehorn a method into a database like this, but it should be useful in the end. I'm also working on a better (i.e., working) findmethod/tool page. As users select options, the table of methods will automatically update. Maybe. I've also updated to YUI 2.5.1.
Friday, February 29, 2008
Captcha
To prevent bots from registering for the site and adding spam, captchas are useful, so I know I'll have to add them to TreeTapper before allowing users to register. While signing up for the new Encyclopedia of Life site, I found they use the reCAPTCHA service. This uses scanned words from books as images for people to interpret (people receive a pair of words, one known and one unknown). Free captcha service, plus gets people to help digitize books -- sounds like something I'll use for TreeTapper.
Friday, February 22, 2008
Upgrading to YUI 2.5.0
I'm upgrading the site to use YUI 2.5.0 (the previous version was 2.4.1). I decided to do this because the new version has many more features in data tables, and much of the utility of the TreeTapper site comes from interaction with data tables. The process of upgrading isn't too bad: the only problem has been needing to either add a "?" at the end of datasource files and then using initialRequest, or at least setting initialRequest to "" rather than the new default of "null" [under 2.4.1,
this.myDataSource = new YAHOO.util.DataSource("templates/vocabtable_js.php? table=applicationkind");
worked, while under 2.5.0, it is transmitted as
[URL]/templates/vocabtable_js.php?table=applicationkindnull
which won't work. The solution is to do either
this.myDataSource = new YAHOO.util.DataSource("templates/vocabtable_js.php?");
[...]
this.myDataTable = new YAHOO.widget.DataTable( "applicationkind", myColumnDefs,this.myDataSource, {initialRequest:"table=applicationkind"});
or
this.myDataSource = new YAHOO.util.DataSource( "templates/vocabtable_js.php?table=applicationkind" );
[...]
this.myDataTable = new YAHOO.widget.DataTable("applicationkind", myColumnDefs,this.myDataSource, {initialRequest:""});
The first option seems better].
The new version also has parsers (perhaps they were there before, but I missed them) that allow text coming over XHR to be converted to numbers, allowing numerical sorting.
The new version has at least a couple of downsides. First, making the tables seems slower (see this discussion), which is a problem when you need big tables, as TreeTapper does, and column headers are now drawn separately, and it can take a long time (>5 seconds) for them to line up with the corresponding column in the table.
this.myDataSource = new YAHOO.util.DataSource("templates/vocabtable_js.php? table=applicationkind");
worked, while under 2.5.0, it is transmitted as
[URL]/templates/vocabtable_js.php?table=applicationkindnull
which won't work. The solution is to do either
this.myDataSource = new YAHOO.util.DataSource("templates/vocabtable_js.php?");
[...]
this.myDataTable = new YAHOO.widget.DataTable( "applicationkind", myColumnDefs,this.myDataSource, {initialRequest:"table=applicationkind"});
or
this.myDataSource = new YAHOO.util.DataSource( "templates/vocabtable_js.php?table=applicationkind" );
[...]
this.myDataTable = new YAHOO.widget.DataTable("applicationkind", myColumnDefs,this.myDataSource, {initialRequest:""});
The first option seems better].
The new version also has parsers (perhaps they were there before, but I missed them) that allow text coming over XHR to be converted to numbers, allowing numerical sorting.
The new version has at least a couple of downsides. First, making the tables seems slower (see this discussion), which is a problem when you need big tables, as TreeTapper does, and column headers are now drawn separately, and it can take a long time (>5 seconds) for them to line up with the corresponding column in the table.
Tuesday, February 19, 2008
Google charts

Citation counts
When deciding which method to use, or on what areas to focus development, the popularity of related references matters. For example, if method A is used 20 times more frequently to answer a given question than method B, in the absence of other information, a naive user should probably use method A. Just counting total citations could be misleading, though: a good new method only available in the last year will take some time to gain as many citations as a poorer older method, despite acquiring citations at a faster rate. Thus, both number and rate matter. I'm thinking of showing both the total number of citations and rate of gain of citations over a year or so; another way of displaying related info would be comparing the number of citations for a paper with the median number of citations for papers published in the same year (perhaps limiting the papers in the reference set to those similar in scope to the paper of interest). To get this info, I'll need citation info, which is not generally available (see earlier post). I'm using the number of hits in Yahoo, using its search API (which gives slightly different numbers than its html form search results) for both all pages and only PDF-formatted pages with an article's title phrase, last name of first author, and publication year. The tricky things getting this to work were converting apostrophes to html characters and making sure to include the title as a phrase, rather than as a string of words. This approach has a few disadvantages: while web hits probably correlate with how many times a paper is cited (early work suggests this is roughly true), it is not wonderfully correlated (but it can pick up hits for interesting new papers faster than waiting for later citing papers to appear), plus it is easier to mislead (I could add my papers' titles, authors, and years as signatures to all my posts and then start posting on various forums: TreeTapper would see my papers as very popular). Perhaps with the upcoming release of the new CiteSeerX, I can use that system to track citations better. I've set up TreeTapper to store the number of hits for each paper every two weeks; this will allow me to track how the popularity of articles changes through time to recover rate rather than just number of citations. As this recording is new, I'm now only showing total hits until the data are recorded over more time intervals.
Wednesday, February 13, 2008
Added people
I've gotten the reference parsing to work and have added 1545 references, from which 1840 authors were extracted. To get the references, I looked through various online reference databases (PubMed, ISI Web of Science) to select articles about methods (searching for particular authors, looking in particular journals (for example, issues of Systematic Biology over the past several years and articles in Bioinformatics that mentioned phylogenies), etc.) based on the article titles, downloaded these citations, imported them with their various formats into EndNote, and then exported a RIS-formatted file that I then uploaded to the website (in several chunks) and parsed using bibutils conversion and then a custom XML parser written using SimpleXML in PHP. The XML from bibutils is also saved with each reference in the database, making conversion of user-selected references for export in various formats easier (I hope, but we'll see when I code that). I've also created templates on the development site to automatically display information on the included authors in a RESTful way: http://treetapper.nescent.org/person will display a paginated table of all the authors in the database with the number of references, methods, and software each has in the database; clicking on an author's name will go to a page listing her or his coauthors (ranked by number of papers in common) and references. For example, going to http://treetapper.nescent.org/person/23 will go to a page for Mike Sanderson. You can then link from person to person in this way.
Left to do: add XML output as an option, rather than just the html output with datatables; get the datatables to sort properly (currently, Yahoo User Interface datatables (version 2.4.1) sort lexically: sorting [5, 200, 12] gives [12, 200, 5], but author names also aren't sorting properly); and do tables and a REST interface for references (I also want to be able to autocomplete on an author's name and then just display the relevant references). It might be interesting at some point to add a way to output files to visualize author relationships (perhaps with Graphviz); this could also provide another way to navigate the database.
Monday, February 11, 2008
Parsing references
I'm using the same table for authors and users of the site, so that when a new person logs in, that person can see which papers/software are listed with her/him as an author and correct any mistakes. I also want people to be able to search for author names in the database. Both require being able to extract author names from submitted references. I'd originally written a simple RIS parser, but it didn't deal well with input files that had slightly odd formatting. I've thought of using Connotea to parse references (its API allows submission of references or just URLs/DOIs, and can return author names, article titles, etc.) but I've found problems with latency (it can take a few minutes for it to load submissions into its database). There is also a problem with its strong URL focus -- I found when uploading a batch of references from Endnote in RIS format, it only loaded one Bioinformatics citation: somewhere in each RIS record, there was a generic URL for Bioinformatics, so Connotea saw them all as pointing to the same resource. Connotea also has the complementary problem (referred to as "buggotea") of having the same reference with different URLs (PMID with one submission, DOI with another, for example) entered multiple times in its database. I'll export tagged articles to Connotea, but it probably won't work for reference parsing. Thus, I'm now playing with using Christopher Putnam's bibutils, which convert from/to Endnote, RIS, Bibtex, ISI, and other reference formats using an XML intermediate: parsing this XML intermediate for author names should be fairly easy. Once the author names are identified, I have the code for adding authors to the database, including author order, and expanding author names when more information is available (for example, if entering a paper by "M. Sanderson", otherwise unknown in the database, his first name is stored as just "M"; when a later paper by "Michael Sanderson" is entered to the database, the script changes "M Sanderson" to "Michael Sanderson" in the database, assuming that they are the same author). Currently, the only problem with bibutils is having it called correctly by PHP on our server, but that's being addressed.
Labels:
bibutils,
citations,
Connotea,
reference parsing
Thursday, January 31, 2008
Firebug debugger
Treetapper will be fairly Ajax-rich (though with more RESTful interfaces, too, and with concern for accessibility). I didn't know any Javascript (or PHP) when starting, so having a good debugger is handy. I first tried a "Javascript Debugger" plugin for Firefox, but it seemed a bit clunky. I've now installed Firebug, and I am so far quite happy with it. It allows debugging/inspection of the Javascript, CSS, and HTML of the page.
Wednesday, January 16, 2008
Updated schema

Friday, January 11, 2008
Database schema
Tuesday, December 18, 2007
Progress so far
TreeTapper is hosted by NESCent and uses a PostgreSQL database as a backend, an assortment of PHP scripts for generating pages, and the Yahoo User Interface Library (YUI) for a front end. Reasons for these decisions:
- NESCent hosting: Free, stable, well-supported, appropriate for the project, and allows things like use of mod_rewrite to make creating RESTful site easier.
- Postgres: NESCent supports this rather than MySQL; triggers and views will be useful (though now also allowed in MySQL). I had thought about using arrays for some fields (a Postgres-only feature), such as when a particular bit of software can read multiple tree formats, but will instead use additional tables, following recommendations from NESCent's IT staff.
- PHP scripts: I had looked into using Ruby on Rails, CakePHP, or other frameworks for development, but there seemed to be a lot of overhead in learning them for the benefit I'd receive.
- YUI: This is one of the many libraries for Ajax development. YUI is feature-rich and has both great documentation (with many examples) and an active user forum, both important for me.
First post
Part of my NESCent project is the creation of TreeTapper.org, a site dedicated to methods and software that use trees to understand biology. It has two main aims: 1) allowing users from fields as diverse as genomics, ecology, paleontology, and phylogenetics to identify methods and software that will allow them to answer interesting questions using phylogenetic trees, and 2) allowing developers to identify areas where methods are not yet available or where methods need to implemented in software. I wanted a way to keep track of progress and issues involved in the development of the site; since this process might be of interest to others, I'm using this blog to do so. Any feedback is welcome. Also, though the site is (barely) live now, it's not yet ready for users: a formal announcement of its release will be made later (through EvolDir, Ecolog, and perhaps a journal).
Subscribe to:
Posts (Atom)