Thursday, April 24, 2008

Actual science

One thing that bothers me is that so far, my project seems to be about database and website design and coding, not science. However, the science comes later: adding items to the DB is at least related to scientific methods, and once the DB is full enough, I'll be able to use it to figure out what new methods need to be created to answer questions (the real goal of the project). I'm also still doing science, despite the impression blog readers might get: this week, I have done a series of likelihood bootstraps on my ant data (I had to move a couple of intron boundaries based on info from genbank, which then required a new partitioned analysis), started doing the power/bias tests for new methods of trait evolution I've developed, worked on analyses for a paper on fish evolution with Dave Collar using new methods in my program Brownie, talked to a student about models of gene evolution [see my published authored appendix on this], and twiddled my thumbs waiting for reviews on a species delimitation paper (>9 weeks in review so far [but at least it's in review]). 

Viewing missing methods/software, part III: Table views

Repeat of the design goals for viewing missing methods/software: 
  1. Allow users to see on a tree (using branch coloring) which questions don't have methods, which methods aren't in software, etc.
  2. Allow users to arrange the order in which things are displayed: question->criterion->method, or question->character type 1 -> tree type -> branch length type -> data format -> software
  3. Allow users to filter by option (only show methods relating to DNA data, for example)
  4. Make it fast, intuitive, etc.
I think a way to address these is to create table views; one has all the available methods ("actualmethods"), one has all the available methods and software ("actualsoftware"), and one could have all the imaginable combinations of all options ("biggie"). That way, all the database logic for combining the primitive tables (method->methodtotreetypetobranchlengthtype->treetype, for example) is taken care of at the view creation step, rather than requiring it all to be created on the fly when a user re-orders options. [Aha: so this is a reason for using MVC]. To draw the tree of methods/software based on user choice, one gets the tree structure by looking in the [actually hypothetical] "biggie" table; combinations (edges on the display tree) present in the "actualsoftware" table are shown in the "+methods +software" color (purple?), combinations not present there but in the "actualmethods" table get the "+methods -software" color (black?), and others get the "-methods -software" color (gray?). The only problem with this is the size of the "biggie" table view: except for built-in relations between general and posed questions and posed questions and relevant combinations of characters (see schema), it's basically a massive cross join. That means that if there are 2 data formats, 6 tree formats, and 7 platforms, the table has 2 x 6 x 7 = 84 rows.  The actual "biggie" table, having info on all the imaginable options (input formats, tree types, character types, criteria, etc.) would currently have 1,361,817,600 (1.3 billion) distinct rows. Instead of creating such a huge table, I will have a view ("generaltoposedtochartype") containing the essential relations between general question, posed question, and character combinations (only 2,480 rows currently) and then just have the program returning possible branches for the tree know that all the other options can essentially be cross joined.

I had a bit of trouble creating the initial "actualmethods" view efficiently; Hilmar Lapp, an IT guru here at NESCent (codes for BioPerl, organizes hackathons, organizes people) edited the query to make it more efficient and eliminate return of duplicate rows (without using "distinct"). Below I've posted the sql statement used to make these views in case it's useful for others (or for me in the future).

-- All actual methods
CREATE VIEW 
actualmethods
(
actualmethods_generalquestion,
actualmethods_posedquestion,
actualmethods_char1,
actualmethods_char2,
actualmethods_char3,
actualmethods_treetype,
actualmethods_branchlengthtype,
actualmethods_criterion,
actualmethods_method
)
AS
SELECT
pq.posedquestion_generalquestion,
pq.posedquestion_id,
  cc.charactercombination_char1, 
cc.charactercombination_char2, 
cc.charactercombination_char3,
mttblt.methodtotreetypetobranchlengthtype_treetype,
mttblt.methodtotreetypetobranchlengthtype_branchlengthtype,
   mc.methodtocriterion_criterion,

mttblt.methodtotreetypetobranchlengthtype_method

FROM
methodtotreetypetobranchlengthtype mttblt,
methodtocriterion mc,
methodtocharactercombination mcc,
methodtoposedquestion mpq,
posedquestion pq,
charactercombination cc,
posedquestiontocharactercombination pqcc
WHERE
methodtocriterion_method=methodtotreetypetobranchlengthtype_method
AND  
methodtocharactercombination_method=methodtotreetypetobranchlengthtype_method
AND  
methodtocharactercombination_charactercombination=charactercombination_id
AND  
methodtoposedquestion_method=methodtotreetypetobranchlengthtype_method
AND
methodtoposedquestion_posedquestion=posedquestion_id
AND
posedquestiontocharactercombination_posedquestion=posedquestion_id
AND  
posedquestiontocharactercombination_charactercombination=charactercombination_id
;

-- all actual methods and software
CREATE VIEW 
actualsoftware
(
actualsoftware_generalquestion,
actualsoftware_posedquestion,
actualsoftware_char1,
actualsoftware_char2,
actualsoftware_char3,
actualsoftware_treetype,
actualsoftware_branchlengthtype,
actualsoftware_criterion,
actualsoftware_method,
actualsoftware_dataformat,
actualsoftware_treeformat,
actualsoftware_applicationkind,
actualsoftware_platform,
actualsoftware_program
)
AS
SELECT
pq.posedquestion_generalquestion,
pq.posedquestion_id,
cc.charactercombination_char1, 
cc.charactercombination_char2, 
cc.charactercombination_char3,
mttblt.methodtotreetypetobranchlengthtype_treetype,
mttblt.methodtotreetypetobranchlengthtype_branchlengthtype,
mc.methodtocriterion_criterion,

mttblt.methodtotreetypetobranchlengthtype_method,

pdf.programtodataformat_dataformat, 
ptf.programtotreeformat_treeformat, 
ppak.programtoplatformappkind_applicationkind, 
ppak.programtoplatformappkind_platform, 

  pmcc.programtomethodtocharactercombination_program
FROM
methodtotreetypetobranchlengthtype mttblt,
methodtocriterion mc,
methodtocharactercombination mcc,
methodtoposedquestion mpq,
posedquestion pq,
charactercombination cc,
posedquestiontocharactercombination pqcc,
programtodataformat pdf,
programtotreeformat ptf,
programtomethodtocharactercombination pmcc,
programtoplatformappkind ppak
WHERE
methodtocriterion_method=methodtotreetypetobranchlengthtype_method
AND  
methodtocharactercombination_method=methodtotreetypetobranchlengthtype_method
AND  
methodtocharactercombination_charactercombination=charactercombination_id
AND  
methodtoposedquestion_method=methodtotreetypetobranchlengthtype_method
AND 
methodtoposedquestion_posedquestion=posedquestion_id
AND 
posedquestiontocharactercombination_posedquestion=posedquestion_id
AND  
posedquestiontocharactercombination_charactercombination=charactercombination_id
AND 
programtomethodtocharactercombination_methodtocharactercombination=methodtocharactercombination_id
AND 
programtomethodtocharactercombination_program=programtodataformat_program
AND 
programtodataformat_program=programtotreeformat_program
AND 
programtotreeformat_program=programtoplatformappkind_program
;

-- all general+posedquestions+chartypes
CREATE VIEW
generaltoposedtochartype
(
generaltoposedtochartype_generalquestion,
generaltoposedtochartype_posedquestion,
generaltoposedtochartype_char1,
generaltoposedtochartype_char2,
generaltoposedtochartype_char3
)
AS
SELECT
generalquestion_id,
posedquestion_id, 
charactercombination.charactercombination_char1, 
charactercombination.charactercombination_char2, 
charactercombination.charactercombination_char3
FROM
charactercombination,
posedquestiontocharactercombination,
posedquestion,
generalquestion
WHERE
posedquestion_generalquestion=generalquestion_id
AND
posedquestiontocharactercombination_charactercombination=charactercombination_id
AND
posedquestiontocharactercombination_posedquestion=posedquestion_id
;

-- all conceivable combinations of parameters other than question and chartype (a massive cross join, probably not used)
CREATE VIEW
crossjoinoptions
(
crossjoinoptions_treetype,
crossjoinoptions_branchlengthtype,
crossjoinoptions_criterion,
crossjoinoptions_dataformat,
crossjoinoptions_treeformat,
crossjoinoptions_applicationkind,
crossjoinoptions_platform
)
AS
SELECT 
treetype_id, 
branchlengthtype_id, 
criterion_id,

dataformat_id,
treeformat_id,
applicationkind_id,
platform_id
FROM
treetype,
branchlengthtype,
criterion,
dataformat,
treeformat,
applicationkind,
platform
;

Wednesday, April 23, 2008

Evolution talk

I'm giving a talk at Evolution 2008 in June on TreeTapper and information learned about missing methods/software so far (the idea is that there will be something learnable by that point, besides the proper use of YUI APIs). No word on time/session yet. 

Viewing missing methods/software, part II

It appears the strategy of using Google Maps with YUI display of items to add to the map will basically work. Rather than doing drag and drop between YUI datatables, I'm using just the drag and drop YUI code on a list of options, each with the possibility of limiting it to one item (for example, one can first organize by optimality criterion, deciding to show all or just likelihood). The interface is based on the YUI example, but with just one list, with one element of a different color so that options placed above this element appear on the tree while ones below do not (inspired by "the line" on Google Summer of Code's mentorship application). I originally thought of having two lists side by side, allowing people to move elements from one list to the other, but this was too wide for some screens once the possibility of options selection was added. 

I've also gotten a plain white Google Map (to replace traditional geographic maps) working, as well as overlays. Google maps have a wrapper for XmlHttpRequest called GXmlHttp that should make refreshing the chart based on user-sorted options possible. Now the question is how to efficiently recover information from the database to draw the tree, highlighting which branches lead to software+methods, just methods, or nothing.

Tuesday, April 22, 2008

Google summer of code

For Google Summer of Code, NESCent had 11 project ideas, 31 applicants, and just 5 slots. Paul McMillan, an undergraduate student at UC Berkeley, was one of the applicants and proposed working on the WebDot navigation of databases project (though he might end up using GraphViz directly, rather than WebDot). His application was detailed and showed good background knowledge; more impressive were his conversations (over IRC) regarding the problem, where it became evident that he had given it a lot of thought and certainly had the background to do this. This project should help with TreeTapper navigation (looking at coauthor networks, for example) and become an easy-to-use solution for other website developers. Congrats to Paul. 

This was my first year with Google Summer of Code. I was impressed by the quality of the applications NESCent-affiliated projects received and how passionate the students are about them (several whose projects didn't get funding have volunteered to work on them anyway, which is amazing (since they'll have to do something else for money, and  so will have less time)).

Monday, April 21, 2008

Viewing missing methods/software

The key interesting thing about TreeTapper for me is the ability to find missing methods or software. Any list of software and methods will tell you what's available (and isn't trivial to make), but for developers, finding what doesn't exist yet is key. At first, I was just doing a typical treeview (not in the phylogeny sense, but in the nested series of folders sense):



I started adding the beginnings of  bar plots (the red squares above) to show the number of techniques/question available for each topic. The problem with this is that it's very hard to get a quick overview of what's missing: a user has to drill down into each section and remember what's there (sensible display of some of this in with bar plots might help, but it's still not great).  But thinking about it, what would be good to show is an actual tree: for a given starting point (such as a topic: speciation rate), and then all possible descendants (such as all possible questions for this topic). Those descendants available in methods/software get one branch color (say, black), those not get another branch color (gray) [though it might be good to distinguish those present in methods but not software]. Here's a hand-drawn example for the basic idea:

And with colors and labels:
Under this approach, it's easy to distinguish areas with methods/software available (black/solid) from those lacking methods or software (gray/translucent). In the example above, the central dot represents a topic, the first circle represents questions, the second circle criteria, the third perhaps character type, etc. Derrick Zwickl had the good suggestion to allow users to set the order in which options are plotted; I'd also like to allow users to fix certain values (only look for missing methods under a likelihood criterion, for example).

The problem is that this is just a dummy layout, drawn in Apple Keynote, not actually a working image. The question becomes how to make it. I'm thinking of first having a YUI table with the various options to plot (criterion, method, character types, etc.), and then having a second table (or allowing ranking on the first table) where users can drag the options to plot them on the tree in the given order (see an example of something grossly similar here). One problem may be writing the logic to be able to look at all the options for variable Y when it descends from variable X, including which ones are and are not available, when what X and Y are is up to the user (perhaps comparing a cross-join and a left-join postgres table, or something like that, would be the key). Another question is how to actually generate the plot. There are  various Java libraries for interactive data plotting (the first thing I would try if I went this route would be generating an interface with Processing), but many of them failed when I tried them with the most recent version of Mac's Safari, and Java online (and on the desktop, too) always feels a bit clunky to me. There are various ways to make plots on the web (such as Google Charts and Yahoo Charts), but they only have a few mouseover options. I'm actually thinking of using the Google Maps API for this. Using that, plotting points, lines, and areas is now possible, and users can get information on nodes by clicking on them and one can add various javascript functions called onmouseover, onclick, etc. Users will be able to zoom in on parts of a tree. Finally, one can add custom map tiles to replace the Google tiles; in my case, I'd just have a white background, and do all the plotting with polylines and the like. This sort of use of Google Maps been done before; I remember Katy Böhner mentioning this in a talk (though I couldn't find anything on her lab's website), and there are other examples online. 

Well, we'll see how it goes. As with all posts, please feel free to make suggestions in the comments.