Friday, October 17, 2008

Using large trees is surprisingly difficult

There has been a lot of effort and interest in phylogenetics in getting big trees (1000+ taxa). This should be the most difficult problem in phylogenetics -- after all, finding the best tree is generally (under most optimization criteria) NP-hard. My interest, and therefore that of TreeTapper, is in what happens after you get the big tree: using it for understanding biological processes. This should be relatively simple. After searching an enormous tree space to get the best tree, doing something like estimating ancestral states is easy, just a downpass (postorder traversal) and uppass (preorder traversal) calculating probabilities at each node for just one character. In practice, it seems many of the programs in this area fail with large trees, often due to rounding or underflow errors (inferred from my own experiences, and those of NESCent postdocs or visitors Sam Price, Stephen Smith, and Jeremy Beaulieu). Most computer programs use numbers with finite precision -- a number smaller in magnitude than around 10E-308 can't be stored as a double, for example. That's a really small number, but if it's a likelihood (probability of the data given the tree and model), a probability of 10E-308 is just a -lnL of 706.9, a number that isn't that unusual for our sort of problems. Many calculations can just be done using ln likelihoods rather than untransformed likelihoods, but for some calculations this is more difficult (certainly more difficult to code). R apparently quietly rounds things to zero with small numbers, leading to erroneous results [from reports from others -- I haven't verified this]. My program Brownie, which in the development branch can do discrete character reconstruction, failed on a tree of ~1600 taxa due to underflow errors, so I made a new class for really small or small numbers that basically stores numbers in scientific notation, using a double for the mantissa and an int for the exponent (code available here). Other programs might need similar kludges to work. It seems odd to me, doing programming as a biologist without much formal CS training, that common programming languages don't do this sort of thing automatically (in the same way that needing to manage memory in C++ feels surprising), but it is an issue that may become more frequently encountered by us.

The relevance to TreeTapper is how and whether to record information about how programs perform with large trees. The first question is whether it's meaningful. For a simple program with one function, it's possible that it always fails if trees are over N taxa in size or if likelihoods get below X. But for complex programs like Mesquite, they might fail for some trees for some methods but work for others, so just storing a single number might be misleading. A second issue is that gathering the data might be difficult: for a typical user, deciding whether a program crashed with a tree with N taxa due to the tree size or some other bug will be hard. It also seems unreasonable to expect people to report to TreeTapper every time a program works or fails for them and under what conditions (I wouldn't make the time to do so). On the other hand, it would be helpful to users to know that a certain program just can't work with trees of a certain size and helpful for developers to know which programs need tweaking for large trees. I guess for now I won't have a separate field for this and will just rely on user comments on each program page, but let me know if you have any suggestions.

Tuesday, October 14, 2008

Visualizing trees

We recently had a community summit at NESCent on directions for the future. As part of the biodiversity and phylogenetics breakout group (see notes here), one thing that came up was a need for better ways of visualizing trees [one good idea was inviting Ben Fry to NESCent to work on this]. Rod Page suggested putting tree visualization in TreeTapper as a category, which is a great idea, as there are dozens of programs (see partial list at Felsenstein's site). But why, with so many programs, do people feel that so much more work is needed?

Well, based on behavior, it's obvious that current solutions don't work. NESCent has a fairly sophisticated set of users and builders of trees. When they need to view a large tree (hundreds to thousands of taxa), they don't use any sort of cool tree stretching or zooming program -- they find an old Mac, open Paup in Classic, and print out a tree over multiple pages, which they then assemble using tape and scissors. I think the reason they do this comes down to resolution of paper versus monitors (see some of Edward Tufte's books for a more general and informed discussion of this). My 1920 x 1200 pixel giant Apple Cinema display monitor (a perk loaned to all NESCent postdocs) can display fewer than 600 distinguishable horizontal lines (one pixel thick with one pixel between them). Our laser printer has a resolution of ~1200 DPI, suggesting that it could print this many lines in about one inch (I might be slightly off if there are some sort of constraints on dot geometry, but the basic idea still holds). By this calculation, my entire monitor display can be reproduced pixel by pixel in a few square inches. Plus, a printed tree can be arbitrarily large (Michael Donoghue described one several feet in diameter). Speed of visually parsing such a tree is related to how quickly the eye can move around a page: focus closely on one section to read the taxon names, jump to look at the overall picture, etc. On a screen, one would have to move the mouse around and wait for the screen to update. Even with tremendous zoom, there are only 1200 vertical positions my cursor can occupy (assuming cursor resolution == screen resolution) -- a tree larger than this, and there's no way to display even a cartoon of the whole tree on one screen in such a way that moving the mouse can select just one taxon (other than a nesting of zooms within zooms). In contrast, even on a 11" high piece of paper, with half inch margins and default line spacing, I can print a column of 155 taxon names (using 4 point Times font, about the size used on insect labels), all easily readable. Just a few pieces of paper provide much more resolution than possible on a costly monitor. It's similar to the comparison between a paper map and a GPS navigation system (or Google Maps/Earth): in a given area, the paper map has much more detail. The advantage of the navigation system (besides the whole navigation bit) is that it allows you to zoom in and out for an unlimited amount of information. For looking at a tree, though, as for visualizing a trip, seeing the entire thing with a great amount of detail rather than zooming in and out constantly can be much easier.

One solution is to have even higher resolution displays (see Mike Sanderson's wall of monitors [taken from his web page] at the end of this post, for example). I think that there will be limits to this, and we might have to wait for other fields to advance first. Instead, it might be worthwhile to work on better ways to print out large trees on tiled pieces of paper. Imagine software that takes a set of bootstrap trees and can create a PDF you can print out and stitch together showing support values and branch lengths, perhaps with reconstruction of characters in color at the nodes, too. It seems appallingly low-tech and yet rather useful. As far as I know, Classic Paup is the only program that can do this, though Mesquite has some options for tree printing that might allow this [this will be much easier to know once this section of TreeTapper's database has been filled in]. The downside is that this doesn't help much the issue of displaying trees in papers, but there, perhaps some summary graphic would work better.


image from http://loco.biosci.arizona.edu