From Sunday, May 15, 8:00pm to Monday, May 16, 9:30am, TreeTapper will be offline. University of Tennessee, Knoxville, will be turning off power to much of campus in order to wire some renovated buildings.
When I left NESCent, I deposited TreeTapper's code in Google Code ( http://code.google.com/p/treetapper/ ). Subsequent development happened on a privately-hosted version of that code (I was worried about accidentally committing a file with database access passwords). I've since learned a lot more about SVN and have svn:ignored the dangerous files, as well as files that aren't really things to put in version control, such as cached images. Now, changes made on the Google Code repository will be made live on the site. This should make it easier to collaborate with others going forward.
UTK's Office of Information Technology will be performing some work on the network that will result in the main university website, department websites, individual sites, and many campus buildings to be completely offline from the afternoon of Friday March 18 until sometime March 20. TreeTapper will sadly be included in this outage.
Today, TreeTapper is moving from hosting at NESCent to hosting at the U of Tennessee. NESCent hosting has been excellent (I remember one occasion during a weekend snow storm when NESCent's sysadmin braved Durham's poorly-treated streets and spinning drivers to go in and take care of servers that were, ironically enough, overheating). However, it will be a bit easier for me to maintain if TreeTapper is hosted locally, as I'll be able to directly log in to the production server and change things without needing to worry as much about other sites (though other O'Meara lab sites, like http://www.myrmecocystus.org, are also hosted on the server). So far the move has gone well, now to see if the DNS starts pointing properly. TreeTapper may be down for a few hours while the move completes. MacPorts has been helpful for things like recompiling a version of PHP that can talk to a database.
I have applied to NSF's Software Infrastructure for Sustained Innovation program for a developer for TreeTapper and some salary support for me. Consistent with TreeTapper's open way of doing science, you can see the submitted proposal (minus the legalese pages and my CV/other grants pages) here (PDF). The goals are threefold: 1) improve the basic TreeTapper code to make it more sustainable, secure, and faster, 2) improve usability, esp. with generation of the missing methods/software diagrams and with user login (implement OpenID), hopefully making it easier for others to get involved, and 3) make the code base deployable in other fields -- imagine a TreeTapper for methods in astronomy or ecology.
Continued thanks to NESCent for funding initial development of TreeTapper (feeding me, hosting the site, providing extensive expertise) [NSF grant EF-0423641] and Google Summer of Code for funding a student to create code for visual database traversal.
Getting everything working on Mac OS 10.5 was an issue, and there's also the security concerns of having one's primary machine also acting as a web server. I'm not exploring putting TreeTapper on a Sun VirtualBox on my mac. I'm downloading Ubuntu Server now and will give it a try. Suggestions welcome, of course.
I'm moving my hosting from NESCent's servers to a server in my lab (running Mac OS 10.5). Having immediate access to the database structure will be helpful in continuing TreeTapper's development. Installing and configuring postgres on the Mac has taken some work. One useful site so far is here (though note that the latest version of postgres is 8.4, not 8.3). I'll post more as progress is made...
The code for TreeTapper (the website's mostly PHP code) is now available. I'll be continuing to modify it, but please feel free to look it over and suggest (or submit) corrections, esp. regarding security.
I'm shortly (<1 month) going to be posting TreeTapper's backend code to Google Code hosting (final site at http://code.google.com/p/treetapper/). I'll have all the relevant php files (except the one that has the database login information [I hope]) but not basic images. If anyone has any advice about things to watch out for when posting code of this sort, please let me know [I'm concerned about exposing security vulnerabilities -- don't want to rely on "security through obscurity", but if it helps...].
I'm working on slides for a job talk. I made one that might be of interest to readers of this blog, a plot of number of references with "phylogen*" in any field in Scopus through time (line on the figure, also the position of the blue circles). I also found the number of references with "evolution" or "ecolog*" as well as all combinations of these three terms to make Venn diagrams of the overlap of the terms (using Google Charts). It's interesting to see how the three areas overlap and change in size through time: "phylogen*" is now much more common (up to 38% of the frequency of "evolution") and now overlaps more with references with "ecolog*" than with references with "evolution". I'm sure there are all sort of artifacts and biases, but interesting to look at nonetheless. It does suggest that I should start thinking more about ecologists using TreeTapper.
There has been a lot of effort and interest in phylogenetics in getting big trees (1000+ taxa). This should be the most difficult problem in phylogenetics -- after all, finding the best tree is generally (under most optimization criteria) NP-hard. My interest, and therefore that of TreeTapper, is in what happens after you get the big tree: using it for understanding biological processes. This should be relatively simple. After searching an enormous tree space to get the best tree, doing something like estimating ancestral states is easy, just a downpass (postorder traversal) and uppass (preorder traversal) calculating probabilities at each node for just one character. In practice, it seems many of the programs in this area fail with large trees, often due to rounding or underflow errors (inferred from my own experiences, and those of NESCent postdocs or visitors Sam Price, Stephen Smith, and Jeremy Beaulieu). Most computer programs use numbers with finite precision -- a number smaller in magnitude than around 10E-308 can't be stored as a double, for example. That's a really small number, but if it's a likelihood (probability of the data given the tree and model), a probability of 10E-308 is just a -lnL of 706.9, a number that isn't that unusual for our sort of problems. Many calculations can just be done using ln likelihoods rather than untransformed likelihoods, but for some calculations this is more difficult (certainly more difficult to code). R apparently quietly rounds things to zero with small numbers, leading to erroneous results [from reports from others -- I haven't verified this]. My program Brownie, which in the development branch can do discrete character reconstruction, failed on a tree of ~1600 taxa due to underflow errors, so I made a new class for really small or small numbers that basically stores numbers in scientific notation, using a double for the mantissa and an int for the exponent (code available here). Other programs might need similar kludges to work. It seems odd to me, doing programming as a biologist without much formal CS training, that common programming languages don't do this sort of thing automatically (in the same way that needing to manage memory in C++ feels surprising), but it is an issue that may become more frequently encountered by us.
The relevance to TreeTapper is how and whether to record information about how programs perform with large trees. The first question is whether it's meaningful. For a simple program with one function, it's possible that it always fails if trees are over N taxa in size or if likelihoods get below X. But for complex programs like Mesquite, they might fail for some trees for some methods but work for others, so just storing a single number might be misleading. A second issue is that gathering the data might be difficult: for a typical user, deciding whether a program crashed with a tree with N taxa due to the tree size or some other bug will be hard. It also seems unreasonable to expect people to report to TreeTapper every time a program works or fails for them and under what conditions (I wouldn't make the time to do so). On the other hand, it would be helpful to users to know that a certain program just can't work with trees of a certain size and helpful for developers to know which programs need tweaking for large trees. I guess for now I won't have a separate field for this and will just rely on user comments on each program page, but let me know if you have any suggestions.
We recently had a community summit at NESCent on directions for the future. As part of the biodiversity and phylogenetics breakout group (see notes here), one thing that came up was a need for better ways of visualizing trees [one good idea was inviting Ben Fry to NESCent to work on this]. Rod Page suggested putting tree visualization in TreeTapper as a category, which is a great idea, as there are dozens of programs (see partial list at Felsenstein's site). But why, with so many programs, do people feel that so much more work is needed?
Well, based on behavior, it's obvious that current solutions don't work. NESCent has a fairly sophisticated set of users and builders of trees. When they need to view a large tree (hundreds to thousands of taxa), they don't use any sort of cool tree stretching or zooming program -- they find an old Mac, open Paup in Classic, and print out a tree over multiple pages, which they then assemble using tape and scissors. I think the reason they do this comes down to resolution of paper versus monitors (see some of Edward Tufte's books for a more general and informed discussion of this). My 1920 x 1200 pixel giant Apple Cinema display monitor (a perk loaned to all NESCent postdocs) can display fewer than 600 distinguishable horizontal lines (one pixel thick with one pixel between them). Our laser printer has a resolution of ~1200 DPI, suggesting that it could print this many lines in about one inch (I might be slightly off if there are some sort of constraints on dot geometry, but the basic idea still holds). By this calculation, my entire monitor display can be reproduced pixel by pixel in a few square inches. Plus, a printed tree can be arbitrarily large (Michael Donoghue described one several feet in diameter). Speed of visually parsing such a tree is related to how quickly the eye can move around a page: focus closely on one section to read the taxon names, jump to look at the overall picture, etc. On a screen, one would have to move the mouse around and wait for the screen to update. Even with tremendous zoom, there are only 1200 vertical positions my cursor can occupy (assuming cursor resolution == screen resolution) -- a tree larger than this, and there's no way to display even a cartoon of the whole tree on one screen in such a way that moving the mouse can select just one taxon (other than a nesting of zooms within zooms). In contrast, even on a 11" high piece of paper, with half inch margins and default line spacing, I can print a column of 155 taxon names (using 4 point Times font, about the size used on insect labels), all easily readable. Just a few pieces of paper provide much more resolution than possible on a costly monitor. It's similar to the comparison between a paper map and a GPS navigation system (or Google Maps/Earth): in a given area, the paper map has much more detail. The advantage of the navigation system (besides the whole navigation bit) is that it allows you to zoom in and out for an unlimited amount of information. For looking at a tree, though, as for visualizing a trip, seeing the entire thing with a great amount of detail rather than zooming in and out constantly can be much easier.
One solution is to have even higher resolution displays (see Mike Sanderson's wall of monitors [taken from his web page] at the end of this post, for example). I think that there will be limits to this, and we might have to wait for other fields to advance first. Instead, it might be worthwhile to work on better ways to print out large trees on tiled pieces of paper. Imagine software that takes a set of bootstrap trees and can create a PDF you can print out and stitch together showing support values and branch lengths, perhaps with reconstruction of characters in color at the nodes, too. It seems appallingly low-tech and yet rather useful. As far as I know, Classic Paup is the only program that can do this, though Mesquite has some options for tree printing that might allow this [this will be much easier to know once this section of TreeTapper's database has been filled in]. The downside is that this doesn't help much the issue of displaying trees in papers, but there, perhaps some summary graphic would work better.
I've been adding methods and software to the DB slowly, I think due to a somewhat awkward interface. TreeTapper is currently configured so that various traits of a method or program (criterion, character type, etc.) are selected from pull down menus or tables. If an element isn't there (such as a particular format of tree file), I currently have to go to a separate page, add the item, and then reload the first page. I'd been playing a bit with having popup panels for this entry and have decided to just do them. This will simplify, for a user, adding things to a database: she or he can just go to the add method, add software, or add reference pages.
Paul McMillan has been working hard at his dbgraphnav Google Summer of Code project, and it's now pretty much ready to use. I've installed it on TreeTapper (see an example here). This allows navigation of author and reference relationships, a visual complement to the coauthor and reference tables already present. It's a general tool allowing navigation of relational databases with a lot of user configuration options. After all Paul's work, and our frequent IRC conversations, it's good to see it running. One nice aspect of it is caching: GraphViz takes a long time to draw graphs, and there's a user-set option to cache by time or by using diff to only update graphs that have changed (the latter works remarkably fast). I'll be firing off a few scripts tonight to generate cached files for all the people and references in the database to make it easier for people to use.
One aspect of the project I've been disappointed by is GraphViz itself. When I proposed this project, it seemed like the logical choice to use, but I've been amazed at some of its limitations. For example, on graphs with many nodes, using "neato" layout (an undirected graph layout), some nodes are far from the rest of the graph (see Michael Donoghue's TreeTapper page for an example). This results in nodes in the center being tightly clumped while nodes at the edges have far too much whitespace around them. No matter what options we've tried for desired edge length, spring parameter, etc., we just can't make GraphViz pull those far off nodes in. Any suggestions? Other than that sort of annoyance, though, this should be a very useful addition to TreeTapper and perhaps other websites as well.
I've been quiet on the blog lately. I was prepping for the Evolution meetings; since coming back, I've been mostly focusing on a revision of a paper. Paul McMillan's Google Summer of Code project is running along well (dbgraphnav) -- a beta might be on TreeTapper as early as next week. Already, it can be used to navigate from author to paper to author again -- since the database has many authors and papers already (~2000), it's pretty useful.
I just made something to parse a series of subversion logs to look at commits over time (so I can see how I'm allocating effort between projects). Here's a plot of TreeTapper commits. Some of the commits are automated backups of the DB, so it gives an overestimate of my productivity. However, it only tracks changes to the code, not additions to the database itself.