Wednesday, November 5, 2008

Growth of phylogenetics, or at least phylogen*


I'm working on slides for a job talk. I made one that might be of interest to readers of this blog, a plot of number of references with "phylogen*" in any field in Scopus through time (line on the figure, also the position of the blue circles). I also found the number of references with "evolution" or "ecolog*" as well as all combinations of these three terms to make Venn diagrams of the overlap of the terms (using Google Charts). It's interesting to see how the three areas overlap and change in size through time: "phylogen*" is now much more common (up to 38% of the frequency of "evolution") and now overlaps more with references with "ecolog*" than with references with "evolution". I'm sure there are all sort of artifacts and biases, but interesting to look at nonetheless. It does suggest that I should start thinking more about ecologists using TreeTapper.


Monday, November 3, 2008

Viewing missing methods/software, part VII: PHP+GD

I was getting dissatisfied with Processing for visualizing missing methods. It was nice to have animation of the diagrams, but it was just too brittle: it would only work in certain browsers, it stopped working sometimes when you moved between tabs or windows, it was hard to update and keep working. This is probably more due to my lack of experience in the language and the odd way I was using it (read a variable stored in Javascript on a page to talk to a PHP script to talk to a Postgres database and display the results in real time) than any inherent flaws in it. I had thought about going to Flash, but that also involved learning a new programming language. I had recently started using PHP with GD to visualize progress of my jobs on Duke's computing cluster and found it easy to use (though I'm having issues with transparency). I decided to try this with the method visualization (I wanted to get TreeTapper cleaned up a bit more before a job interview next week). I wrote a script that reads information stored in GET variables and saves an image and image map. In front of this I wrote a controller script that only calls the drawing script when an entry in the database is newer than the saved image and image map (using ideas from my Google Summer of Code student Paul McMillan, who put sophisticated caching in his DBGraphNav). The image and image map are displayed in an iframe in the calling page. There's a Javascript function that calls the controller script as a user chooses which options to display. There's also a Javascript function that uses a Panel object from YUI to display a lot of information about each node on the diagram on mouseover in a floating panel. This is a lot more flexible than the Processing implementation. For example, you can now go to a summary page for each node or click on an entry in the floating panel to go to an info page on just that item (such as a program).  Caching the images makes display a lot faster, though only a subset of possible combinations can be cached. The cache of images also provides some nice pictures for screen savers or talks.

You can try out the new interface here.

Here's a picture of the new display:


And some of the cached images:


Friday, October 17, 2008

Using large trees is surprisingly difficult

There has been a lot of effort and interest in phylogenetics in getting big trees (1000+ taxa). This should be the most difficult problem in phylogenetics -- after all, finding the best tree is generally (under most optimization criteria) NP-hard. My interest, and therefore that of TreeTapper, is in what happens after you get the big tree: using it for understanding biological processes. This should be relatively simple. After searching an enormous tree space to get the best tree, doing something like estimating ancestral states is easy, just a downpass (postorder traversal) and uppass (preorder traversal) calculating probabilities at each node for just one character. In practice, it seems many of the programs in this area fail with large trees, often due to rounding or underflow errors (inferred from my own experiences, and those of NESCent postdocs or visitors Sam Price, Stephen Smith, and Jeremy Beaulieu). Most computer programs use numbers with finite precision -- a number smaller in magnitude than around 10E-308 can't be stored as a double, for example. That's a really small number, but if it's a likelihood (probability of the data given the tree and model), a probability of 10E-308 is just a -lnL of 706.9, a number that isn't that unusual for our sort of problems. Many calculations can just be done using ln likelihoods rather than untransformed likelihoods, but for some calculations this is more difficult (certainly more difficult to code). R apparently quietly rounds things to zero with small numbers, leading to erroneous results [from reports from others -- I haven't verified this]. My program Brownie, which in the development branch can do discrete character reconstruction, failed on a tree of ~1600 taxa due to underflow errors, so I made a new class for really small or small numbers that basically stores numbers in scientific notation, using a double for the mantissa and an int for the exponent (code available here). Other programs might need similar kludges to work. It seems odd to me, doing programming as a biologist without much formal CS training, that common programming languages don't do this sort of thing automatically (in the same way that needing to manage memory in C++ feels surprising), but it is an issue that may become more frequently encountered by us.

The relevance to TreeTapper is how and whether to record information about how programs perform with large trees. The first question is whether it's meaningful. For a simple program with one function, it's possible that it always fails if trees are over N taxa in size or if likelihoods get below X. But for complex programs like Mesquite, they might fail for some trees for some methods but work for others, so just storing a single number might be misleading. A second issue is that gathering the data might be difficult: for a typical user, deciding whether a program crashed with a tree with N taxa due to the tree size or some other bug will be hard. It also seems unreasonable to expect people to report to TreeTapper every time a program works or fails for them and under what conditions (I wouldn't make the time to do so). On the other hand, it would be helpful to users to know that a certain program just can't work with trees of a certain size and helpful for developers to know which programs need tweaking for large trees. I guess for now I won't have a separate field for this and will just rely on user comments on each program page, but let me know if you have any suggestions.

Tuesday, October 14, 2008

Visualizing trees

We recently had a community summit at NESCent on directions for the future. As part of the biodiversity and phylogenetics breakout group (see notes here), one thing that came up was a need for better ways of visualizing trees [one good idea was inviting Ben Fry to NESCent to work on this]. Rod Page suggested putting tree visualization in TreeTapper as a category, which is a great idea, as there are dozens of programs (see partial list at Felsenstein's site). But why, with so many programs, do people feel that so much more work is needed?

Well, based on behavior, it's obvious that current solutions don't work. NESCent has a fairly sophisticated set of users and builders of trees. When they need to view a large tree (hundreds to thousands of taxa), they don't use any sort of cool tree stretching or zooming program -- they find an old Mac, open Paup in Classic, and print out a tree over multiple pages, which they then assemble using tape and scissors. I think the reason they do this comes down to resolution of paper versus monitors (see some of Edward Tufte's books for a more general and informed discussion of this). My 1920 x 1200 pixel giant Apple Cinema display monitor (a perk loaned to all NESCent postdocs) can display fewer than 600 distinguishable horizontal lines (one pixel thick with one pixel between them). Our laser printer has a resolution of ~1200 DPI, suggesting that it could print this many lines in about one inch (I might be slightly off if there are some sort of constraints on dot geometry, but the basic idea still holds). By this calculation, my entire monitor display can be reproduced pixel by pixel in a few square inches. Plus, a printed tree can be arbitrarily large (Michael Donoghue described one several feet in diameter). Speed of visually parsing such a tree is related to how quickly the eye can move around a page: focus closely on one section to read the taxon names, jump to look at the overall picture, etc. On a screen, one would have to move the mouse around and wait for the screen to update. Even with tremendous zoom, there are only 1200 vertical positions my cursor can occupy (assuming cursor resolution == screen resolution) -- a tree larger than this, and there's no way to display even a cartoon of the whole tree on one screen in such a way that moving the mouse can select just one taxon (other than a nesting of zooms within zooms). In contrast, even on a 11" high piece of paper, with half inch margins and default line spacing, I can print a column of 155 taxon names (using 4 point Times font, about the size used on insect labels), all easily readable. Just a few pieces of paper provide much more resolution than possible on a costly monitor. It's similar to the comparison between a paper map and a GPS navigation system (or Google Maps/Earth): in a given area, the paper map has much more detail. The advantage of the navigation system (besides the whole navigation bit) is that it allows you to zoom in and out for an unlimited amount of information. For looking at a tree, though, as for visualizing a trip, seeing the entire thing with a great amount of detail rather than zooming in and out constantly can be much easier.

One solution is to have even higher resolution displays (see Mike Sanderson's wall of monitors [taken from his web page] at the end of this post, for example). I think that there will be limits to this, and we might have to wait for other fields to advance first. Instead, it might be worthwhile to work on better ways to print out large trees on tiled pieces of paper. Imagine software that takes a set of bootstrap trees and can create a PDF you can print out and stitch together showing support values and branch lengths, perhaps with reconstruction of characters in color at the nodes, too. It seems appallingly low-tech and yet rather useful. As far as I know, Classic Paup is the only program that can do this, though Mesquite has some options for tree printing that might allow this [this will be much easier to know once this section of TreeTapper's database has been filled in]. The downside is that this doesn't help much the issue of displaying trees in papers, but there, perhaps some summary graphic would work better.


image from http://loco.biosci.arizona.edu

Friday, October 10, 2008

Better input

I've been adding methods and software to the DB slowly, I think due to a somewhat awkward interface. TreeTapper is currently configured so that various traits of a method or program (criterion, character type, etc.) are selected from pull down menus or tables. If an element isn't there (such as a particular format of tree file), I currently have to go to a separate page, add the item, and then reload the first page. I'd been playing a bit with having popup panels for this entry and have decided to just do them. This will simplify, for a user, adding things to a database: she or he can just go to the add method, add software, or add reference pages.

Tuesday, September 23, 2008

Flash rather than Processing?

I'm now wondering whether Processing was a bad choice for visualizing missing methods (though it's still better than the horribly slow Google Maps implementation). It results in Java applets, which can take quite some time to load on a page (see other general criticisms here). I had to do some hacky things (use Java code rather than stuff built into Processing) to get the applet to talk to Javascript on the page to find what items a user wants to examine, which might not be stable as Processing develops [and it probably prevents me from just converting to Processing.js or the like]. Also, on the stable site (http://www.treetapper.org, rather than http://treetapper.nescent.org), the existing Java applet I made doesn't work (to see a working version, go here). I've tried playing with the code (changing it to point to the stable site, of course, and looking at other potential issues) and it still doesn't work. Flash animations seem to work much faster and more stably across many browsers. Some versions (>4.0.5, <5.3.0) create Flash animations (the code's been moved into a separate install for later versions of PHP), so I could use that. I could just code it in ActionScript and somehow compile it. Either way, I don't know anything about coding Flash animations, how they can connect to the database and page elements, how to write them without depending on expensive software, etc., so it'd mean a bit of work to learn a new language. Any ideas?

Thursday, August 21, 2008

dbgraphnav up


Paul McMillan has been working hard at his dbgraphnav Google Summer of Code project, and it's now pretty much ready to use. I've installed it on TreeTapper (see an example here). This allows navigation of author and reference relationships, a visual complement to the coauthor and reference tables already present. It's a general tool allowing navigation of relational databases with a lot of user configuration options. After all Paul's work, and our frequent IRC conversations, it's good to see it running. One nice aspect of it is caching: GraphViz takes a long time to draw graphs, and there's a user-set option to cache by time or by using diff to only update graphs that have changed (the latter works remarkably fast). I'll be firing off a few scripts tonight to generate cached files for all the people and references in the database to make it easier for people to use.

One aspect of the project I've been disappointed by is GraphViz itself. When I proposed this project, it seemed like the logical choice to use, but I've been amazed at some of its limitations. For example, on graphs with many nodes, using "neato" layout (an undirected graph layout), some nodes are far from the rest of the graph (see Michael Donoghue's TreeTapper page for an example). This results in nodes in the center being tightly clumped while nodes at the edges have far too much whitespace around them. No matter what options we've tried for desired edge length, spring parameter, etc., we just can't make GraphViz pull those far off nodes in. Any suggestions? Other than that sort of annoyance, though, this should be a very useful addition to TreeTapper and perhaps other websites as well.

Friday, July 11, 2008

Status



I've been quiet on the blog lately. I was prepping for the Evolution meetings; since coming back, I've been mostly focusing on a revision of a paper. Paul McMillan's Google Summer of Code project is running along well (dbgraphnav) -- a beta might be on TreeTapper as early as next week. Already, it can be used to navigate from author to paper to author again -- since the database has many authors and papers already (~2000), it's pretty useful.

I just made something to parse a series of subversion logs to look at commits over time (so I can see how I'm allocating effort between projects). Here's a plot of TreeTapper commits. Some of the commits are automated backups of the DB, so it gives an overestimate of my productivity. However, it only tracks changes to the code, not additions to the database itself.



Tuesday, May 20, 2008

Viewing missing methods/software, part VI: Processing done!

I've written the code to generate the missing methods/software tree diagram using Processing. The relevant page is http://treetapper.nescent.org/findneed.php . I've designed it so that as data streams back from the database via a PHP script, Processing draws this on the diagram in real time. As users update the sortable list of options (tree type, general question, etc.) (using YUI drag and drop), a Javascript function updates the string (stored in a Javascript variable) that is passed to the PHP script. Processing checks this string (so, Java talking to a Javascript object), and if it has changed, Processing closes its old connection to the PHP script and opens a new one using the new options. It thus dynamically updates the chart and is far faster than Google Maps API. It's also easy to do sophisticated animations easily in Processing: currently, I have the nodes flying out of their parents, zooming and shifting the image, and point highlighting on mouseover. These actually don't slow down the rendering: I have the script written so that it only adds a new node to the diagram once the previous node has reached its destination, which sets an upper limit on rendering speed, but I override this and put many nodes on the tree at once if there's a backlog (>10 nodes read from the server but not drawn yet). There is rarely such a backlog, indicating that nodes are being drawn as fast as the server is passing them to Processing. It's pretty cool to be able to visualize where our field needs work (or, rather, where I need to fill in the database) using a dynamic interface, and also surprising that it wasn't too bad to program (especially considering I didn't know Processing/Java, Javascript, PHP, or Postgres when I started this in November 2007). Here is a video showing the new site being used (also available here); you may need to widen your browser window to see it all. As always, suggestions are encouraged.

Thursday, May 8, 2008

Viewing missing methods/software, part V: Processing

In my previous post, I was  bemoaning the slow speed of the visualization chart using Google Maps. Looking around in the Google Maps group, it seemed that there wasn't a good way to speed up the drawing of many markers, except for drawing them in another program first and storing them on image tiles. Since I was going to have to draw markers in a different program anyway, I decided to try using such a program by itself without involving Google Maps. First, I thought to create a static image using GD with PHP and then do an image map on top of that with javascript, but GD wasn't working on the NESCent server (this was fixed in just a couple of hours, as is common with NESCent's great tech support). While it was being fixed, I decided to learn Processing — I've been impressed by the quality of some of the diagrams it creates, and having a dynamic diagram might be more useful for users. It has good documentation and can parse XML, so within a few hours, I learned enough to be able to create the diagram below. 


Following advice from the TreeTapper design consultant (my wife), I made the nodes smaller and changed the color slightly, but otherwise it is the same as the Google diagram (though I don't have popups or mouseovers working yet), and is much faster to create. It does require users to have Java installed in their browsers, but might allow cool features later on.

Viewing missing methods/software, part IV: Sort of working

Well, I sort of have the visualization working. Users can choose which elements to graph (on Firefox) by dragging boxes to move them above or below a plotting line, and they can choose to limit to plotting only one of many options. See below.

On clicking the  update chart button, a tree is plotted using the Google Maps API, with branches colored based on whether they have no methods or software, methods but not software, or both methods and software [I originally had it update the map automatically on any change, but this is too slow]. Users can click on nodes to get an info window showing the choices made going from the tree root to the tips and links to any relevant software or methods. Just mousing over a node tells a user what option that node represents (i.e., "Treetype: Unrooted, polytomies, incomplete tree"). If a user has chosen just one setting for an option (as for criterion, below), the edge leading to that node is shorter and light gray edges are shown to indicate the options not examined.  See below (click to make larger).


The end product is beautiful and information-rich. It's also VERY slow. It takes literally minutes to generate the plot, and then >10 seconds between clicking on a node and getting the info window. Zooming in or out or moving the plot also take an agonizingly slow time. Getting info from the script that talks to the database to make code for the map takes a while, about 30 seconds, but the real slowdown comes when drawing the map. Google maps are slow with many markers and polylines: the map above has 221 markers and even more polylines, with the three circle polylines having 360 points each (using fewer makes a plot that's too rough). I'll have to decide what to do. I'd stayed away from Processing due to usability concerns, but it seems Google Maps isn't so great, either. I've read a little about generating image tiles rather than markers to speed up the map -- I'll look into this and other options. 

Thursday, April 24, 2008

Actual science

One thing that bothers me is that so far, my project seems to be about database and website design and coding, not science. However, the science comes later: adding items to the DB is at least related to scientific methods, and once the DB is full enough, I'll be able to use it to figure out what new methods need to be created to answer questions (the real goal of the project). I'm also still doing science, despite the impression blog readers might get: this week, I have done a series of likelihood bootstraps on my ant data (I had to move a couple of intron boundaries based on info from genbank, which then required a new partitioned analysis), started doing the power/bias tests for new methods of trait evolution I've developed, worked on analyses for a paper on fish evolution with Dave Collar using new methods in my program Brownie, talked to a student about models of gene evolution [see my published authored appendix on this], and twiddled my thumbs waiting for reviews on a species delimitation paper (>9 weeks in review so far [but at least it's in review]). 

Viewing missing methods/software, part III: Table views

Repeat of the design goals for viewing missing methods/software: 
  1. Allow users to see on a tree (using branch coloring) which questions don't have methods, which methods aren't in software, etc.
  2. Allow users to arrange the order in which things are displayed: question->criterion->method, or question->character type 1 -> tree type -> branch length type -> data format -> software
  3. Allow users to filter by option (only show methods relating to DNA data, for example)
  4. Make it fast, intuitive, etc.
I think a way to address these is to create table views; one has all the available methods ("actualmethods"), one has all the available methods and software ("actualsoftware"), and one could have all the imaginable combinations of all options ("biggie"). That way, all the database logic for combining the primitive tables (method->methodtotreetypetobranchlengthtype->treetype, for example) is taken care of at the view creation step, rather than requiring it all to be created on the fly when a user re-orders options. [Aha: so this is a reason for using MVC]. To draw the tree of methods/software based on user choice, one gets the tree structure by looking in the [actually hypothetical] "biggie" table; combinations (edges on the display tree) present in the "actualsoftware" table are shown in the "+methods +software" color (purple?), combinations not present there but in the "actualmethods" table get the "+methods -software" color (black?), and others get the "-methods -software" color (gray?). The only problem with this is the size of the "biggie" table view: except for built-in relations between general and posed questions and posed questions and relevant combinations of characters (see schema), it's basically a massive cross join. That means that if there are 2 data formats, 6 tree formats, and 7 platforms, the table has 2 x 6 x 7 = 84 rows.  The actual "biggie" table, having info on all the imaginable options (input formats, tree types, character types, criteria, etc.) would currently have 1,361,817,600 (1.3 billion) distinct rows. Instead of creating such a huge table, I will have a view ("generaltoposedtochartype") containing the essential relations between general question, posed question, and character combinations (only 2,480 rows currently) and then just have the program returning possible branches for the tree know that all the other options can essentially be cross joined.

I had a bit of trouble creating the initial "actualmethods" view efficiently; Hilmar Lapp, an IT guru here at NESCent (codes for BioPerl, organizes hackathons, organizes people) edited the query to make it more efficient and eliminate return of duplicate rows (without using "distinct"). Below I've posted the sql statement used to make these views in case it's useful for others (or for me in the future).

-- All actual methods
CREATE VIEW 
actualmethods
(
actualmethods_generalquestion,
actualmethods_posedquestion,
actualmethods_char1,
actualmethods_char2,
actualmethods_char3,
actualmethods_treetype,
actualmethods_branchlengthtype,
actualmethods_criterion,
actualmethods_method
)
AS
SELECT
pq.posedquestion_generalquestion,
pq.posedquestion_id,
  cc.charactercombination_char1, 
cc.charactercombination_char2, 
cc.charactercombination_char3,
mttblt.methodtotreetypetobranchlengthtype_treetype,
mttblt.methodtotreetypetobranchlengthtype_branchlengthtype,
   mc.methodtocriterion_criterion,

mttblt.methodtotreetypetobranchlengthtype_method

FROM
methodtotreetypetobranchlengthtype mttblt,
methodtocriterion mc,
methodtocharactercombination mcc,
methodtoposedquestion mpq,
posedquestion pq,
charactercombination cc,
posedquestiontocharactercombination pqcc
WHERE
methodtocriterion_method=methodtotreetypetobranchlengthtype_method
AND  
methodtocharactercombination_method=methodtotreetypetobranchlengthtype_method
AND  
methodtocharactercombination_charactercombination=charactercombination_id
AND  
methodtoposedquestion_method=methodtotreetypetobranchlengthtype_method
AND
methodtoposedquestion_posedquestion=posedquestion_id
AND
posedquestiontocharactercombination_posedquestion=posedquestion_id
AND  
posedquestiontocharactercombination_charactercombination=charactercombination_id
;

-- all actual methods and software
CREATE VIEW 
actualsoftware
(
actualsoftware_generalquestion,
actualsoftware_posedquestion,
actualsoftware_char1,
actualsoftware_char2,
actualsoftware_char3,
actualsoftware_treetype,
actualsoftware_branchlengthtype,
actualsoftware_criterion,
actualsoftware_method,
actualsoftware_dataformat,
actualsoftware_treeformat,
actualsoftware_applicationkind,
actualsoftware_platform,
actualsoftware_program
)
AS
SELECT
pq.posedquestion_generalquestion,
pq.posedquestion_id,
cc.charactercombination_char1, 
cc.charactercombination_char2, 
cc.charactercombination_char3,
mttblt.methodtotreetypetobranchlengthtype_treetype,
mttblt.methodtotreetypetobranchlengthtype_branchlengthtype,
mc.methodtocriterion_criterion,

mttblt.methodtotreetypetobranchlengthtype_method,

pdf.programtodataformat_dataformat, 
ptf.programtotreeformat_treeformat, 
ppak.programtoplatformappkind_applicationkind, 
ppak.programtoplatformappkind_platform, 

  pmcc.programtomethodtocharactercombination_program
FROM
methodtotreetypetobranchlengthtype mttblt,
methodtocriterion mc,
methodtocharactercombination mcc,
methodtoposedquestion mpq,
posedquestion pq,
charactercombination cc,
posedquestiontocharactercombination pqcc,
programtodataformat pdf,
programtotreeformat ptf,
programtomethodtocharactercombination pmcc,
programtoplatformappkind ppak
WHERE
methodtocriterion_method=methodtotreetypetobranchlengthtype_method
AND  
methodtocharactercombination_method=methodtotreetypetobranchlengthtype_method
AND  
methodtocharactercombination_charactercombination=charactercombination_id
AND  
methodtoposedquestion_method=methodtotreetypetobranchlengthtype_method
AND 
methodtoposedquestion_posedquestion=posedquestion_id
AND 
posedquestiontocharactercombination_posedquestion=posedquestion_id
AND  
posedquestiontocharactercombination_charactercombination=charactercombination_id
AND 
programtomethodtocharactercombination_methodtocharactercombination=methodtocharactercombination_id
AND 
programtomethodtocharactercombination_program=programtodataformat_program
AND 
programtodataformat_program=programtotreeformat_program
AND 
programtotreeformat_program=programtoplatformappkind_program
;

-- all general+posedquestions+chartypes
CREATE VIEW
generaltoposedtochartype
(
generaltoposedtochartype_generalquestion,
generaltoposedtochartype_posedquestion,
generaltoposedtochartype_char1,
generaltoposedtochartype_char2,
generaltoposedtochartype_char3
)
AS
SELECT
generalquestion_id,
posedquestion_id, 
charactercombination.charactercombination_char1, 
charactercombination.charactercombination_char2, 
charactercombination.charactercombination_char3
FROM
charactercombination,
posedquestiontocharactercombination,
posedquestion,
generalquestion
WHERE
posedquestion_generalquestion=generalquestion_id
AND
posedquestiontocharactercombination_charactercombination=charactercombination_id
AND
posedquestiontocharactercombination_posedquestion=posedquestion_id
;

-- all conceivable combinations of parameters other than question and chartype (a massive cross join, probably not used)
CREATE VIEW
crossjoinoptions
(
crossjoinoptions_treetype,
crossjoinoptions_branchlengthtype,
crossjoinoptions_criterion,
crossjoinoptions_dataformat,
crossjoinoptions_treeformat,
crossjoinoptions_applicationkind,
crossjoinoptions_platform
)
AS
SELECT 
treetype_id, 
branchlengthtype_id, 
criterion_id,

dataformat_id,
treeformat_id,
applicationkind_id,
platform_id
FROM
treetype,
branchlengthtype,
criterion,
dataformat,
treeformat,
applicationkind,
platform
;

Wednesday, April 23, 2008

Evolution talk

I'm giving a talk at Evolution 2008 in June on TreeTapper and information learned about missing methods/software so far (the idea is that there will be something learnable by that point, besides the proper use of YUI APIs). No word on time/session yet. 

Viewing missing methods/software, part II

It appears the strategy of using Google Maps with YUI display of items to add to the map will basically work. Rather than doing drag and drop between YUI datatables, I'm using just the drag and drop YUI code on a list of options, each with the possibility of limiting it to one item (for example, one can first organize by optimality criterion, deciding to show all or just likelihood). The interface is based on the YUI example, but with just one list, with one element of a different color so that options placed above this element appear on the tree while ones below do not (inspired by "the line" on Google Summer of Code's mentorship application). I originally thought of having two lists side by side, allowing people to move elements from one list to the other, but this was too wide for some screens once the possibility of options selection was added. 

I've also gotten a plain white Google Map (to replace traditional geographic maps) working, as well as overlays. Google maps have a wrapper for XmlHttpRequest called GXmlHttp that should make refreshing the chart based on user-sorted options possible. Now the question is how to efficiently recover information from the database to draw the tree, highlighting which branches lead to software+methods, just methods, or nothing.

Tuesday, April 22, 2008

Google summer of code

For Google Summer of Code, NESCent had 11 project ideas, 31 applicants, and just 5 slots. Paul McMillan, an undergraduate student at UC Berkeley, was one of the applicants and proposed working on the WebDot navigation of databases project (though he might end up using GraphViz directly, rather than WebDot). His application was detailed and showed good background knowledge; more impressive were his conversations (over IRC) regarding the problem, where it became evident that he had given it a lot of thought and certainly had the background to do this. This project should help with TreeTapper navigation (looking at coauthor networks, for example) and become an easy-to-use solution for other website developers. Congrats to Paul. 

This was my first year with Google Summer of Code. I was impressed by the quality of the applications NESCent-affiliated projects received and how passionate the students are about them (several whose projects didn't get funding have volunteered to work on them anyway, which is amazing (since they'll have to do something else for money, and  so will have less time)).

Monday, April 21, 2008

Viewing missing methods/software

The key interesting thing about TreeTapper for me is the ability to find missing methods or software. Any list of software and methods will tell you what's available (and isn't trivial to make), but for developers, finding what doesn't exist yet is key. At first, I was just doing a typical treeview (not in the phylogeny sense, but in the nested series of folders sense):



I started adding the beginnings of  bar plots (the red squares above) to show the number of techniques/question available for each topic. The problem with this is that it's very hard to get a quick overview of what's missing: a user has to drill down into each section and remember what's there (sensible display of some of this in with bar plots might help, but it's still not great).  But thinking about it, what would be good to show is an actual tree: for a given starting point (such as a topic: speciation rate), and then all possible descendants (such as all possible questions for this topic). Those descendants available in methods/software get one branch color (say, black), those not get another branch color (gray) [though it might be good to distinguish those present in methods but not software]. Here's a hand-drawn example for the basic idea:

And with colors and labels:
Under this approach, it's easy to distinguish areas with methods/software available (black/solid) from those lacking methods or software (gray/translucent). In the example above, the central dot represents a topic, the first circle represents questions, the second circle criteria, the third perhaps character type, etc. Derrick Zwickl had the good suggestion to allow users to set the order in which options are plotted; I'd also like to allow users to fix certain values (only look for missing methods under a likelihood criterion, for example).

The problem is that this is just a dummy layout, drawn in Apple Keynote, not actually a working image. The question becomes how to make it. I'm thinking of first having a YUI table with the various options to plot (criterion, method, character types, etc.), and then having a second table (or allowing ranking on the first table) where users can drag the options to plot them on the tree in the given order (see an example of something grossly similar here). One problem may be writing the logic to be able to look at all the options for variable Y when it descends from variable X, including which ones are and are not available, when what X and Y are is up to the user (perhaps comparing a cross-join and a left-join postgres table, or something like that, would be the key). Another question is how to actually generate the plot. There are  various Java libraries for interactive data plotting (the first thing I would try if I went this route would be generating an interface with Processing), but many of them failed when I tried them with the most recent version of Mac's Safari, and Java online (and on the desktop, too) always feels a bit clunky to me. There are various ways to make plots on the web (such as Google Charts and Yahoo Charts), but they only have a few mouseover options. I'm actually thinking of using the Google Maps API for this. Using that, plotting points, lines, and areas is now possible, and users can get information on nodes by clicking on them and one can add various javascript functions called onmouseover, onclick, etc. Users will be able to zoom in on parts of a tree. Finally, one can add custom map tiles to replace the Google tiles; in my case, I'd just have a white background, and do all the plotting with polylines and the like. This sort of use of Google Maps been done before; I remember Katy Böhner mentioning this in a talk (though I couldn't find anything on her lab's website), and there are other examples online. 

Well, we'll see how it goes. As with all posts, please feel free to make suggestions in the comments.

Tuesday, April 8, 2008

CiteSeerX launched

An alpha version of CiteSeerX has launched. This is cool, because it provides a way to get citation counts by crawling the web (unlike Thompson ISI or Google Scholar). Its database is limited to computer science (which includes many phylogeny articles, but not enough), so I'd have to get a copy of the source code (not evident where to download this yet) and start crawling on my own.

Wednesday, March 26, 2008

Google summer of code

NESCent is a hosting organization for Google Summer of Code; I've proposed a project to make a tool to allow databases to be navigated visually (essentially by combining WebDot (part of GraphViz) with something like sqlt-diagram, but for navigating table entries rather than just looking at the schema). Only one interested student so far; if it's not funded, I'll likely do it myself later for TreeTapper.