Friday, October 17, 2008

Using large trees is surprisingly difficult

There has been a lot of effort and interest in phylogenetics in getting big trees (1000+ taxa). This should be the most difficult problem in phylogenetics -- after all, finding the best tree is generally (under most optimization criteria) NP-hard. My interest, and therefore that of TreeTapper, is in what happens after you get the big tree: using it for understanding biological processes. This should be relatively simple. After searching an enormous tree space to get the best tree, doing something like estimating ancestral states is easy, just a downpass (postorder traversal) and uppass (preorder traversal) calculating probabilities at each node for just one character. In practice, it seems many of the programs in this area fail with large trees, often due to rounding or underflow errors (inferred from my own experiences, and those of NESCent postdocs or visitors Sam Price, Stephen Smith, and Jeremy Beaulieu). Most computer programs use numbers with finite precision -- a number smaller in magnitude than around 10E-308 can't be stored as a double, for example. That's a really small number, but if it's a likelihood (probability of the data given the tree and model), a probability of 10E-308 is just a -lnL of 706.9, a number that isn't that unusual for our sort of problems. Many calculations can just be done using ln likelihoods rather than untransformed likelihoods, but for some calculations this is more difficult (certainly more difficult to code). R apparently quietly rounds things to zero with small numbers, leading to erroneous results [from reports from others -- I haven't verified this]. My program Brownie, which in the development branch can do discrete character reconstruction, failed on a tree of ~1600 taxa due to underflow errors, so I made a new class for really small or small numbers that basically stores numbers in scientific notation, using a double for the mantissa and an int for the exponent (code available here). Other programs might need similar kludges to work. It seems odd to me, doing programming as a biologist without much formal CS training, that common programming languages don't do this sort of thing automatically (in the same way that needing to manage memory in C++ feels surprising), but it is an issue that may become more frequently encountered by us.

The relevance to TreeTapper is how and whether to record information about how programs perform with large trees. The first question is whether it's meaningful. For a simple program with one function, it's possible that it always fails if trees are over N taxa in size or if likelihoods get below X. But for complex programs like Mesquite, they might fail for some trees for some methods but work for others, so just storing a single number might be misleading. A second issue is that gathering the data might be difficult: for a typical user, deciding whether a program crashed with a tree with N taxa due to the tree size or some other bug will be hard. It also seems unreasonable to expect people to report to TreeTapper every time a program works or fails for them and under what conditions (I wouldn't make the time to do so). On the other hand, it would be helpful to users to know that a certain program just can't work with trees of a certain size and helpful for developers to know which programs need tweaking for large trees. I guess for now I won't have a separate field for this and will just rely on user comments on each program page, but let me know if you have any suggestions.

1 comment:

Luke J. Harmon said...

I usually don't have problems with this sort of thing in R because I work in log-space, by default. It's easy to do, and solves both the small numbers problem and boundary problems for optimization. If you ever have to add things, it's a bit cumbersome, but this isn't too common.