Monday, October 22, 2012

Resolving free-form citations

Cms logoCrossRef have released CrossRef Metadata Search a nice tool that can take a free-form citation and return possible matches from CrossRef's database. If you get a match CrossRef can take the DOI and format for you it in a variety of styles using DOI content negotiation.

If, like me, you spend a lot of time trying to find DOIs (and other identifiers) for articles by first parsing citations into their component parts, then this is good news. It's also good news for publishers that may balk at one of CrossRef's requirements for joining its club: if you want DOIs for your articles it's not enough to submit metadata for your article, you also need to submit the list of references that article cites, including their DOIs. This requirement enables CrossRef to offer their "cited by" service, but imposes a burden on smaller journals operating on a tight budget (e.g., Zootaxa). With CrossRef Metadata Search you can just send author-supplied citation strings from the manuscript and have a good chance of finding the corresponding DOI, if it exists.

Of course, the service only works if the article has a DOI, so it's not a complete solution to being able to parse bibliographic citations into their component parts. But it's a nice model, and I'm tempted to apply the same approach to my databases, such as BioStor or my ever growing Mendeley library (which is larger than the Mendeley desktop client can easily handle). A quick way to do this would be to use Cloudant which has cloud-based CouchDB coupled with a Lucene-based fulltext search engine. If I've time I may try and put a demo together.

Friday, October 19, 2012

The failure of phylogeny databases

It is well known that phylogeny databases such as TreeBASE capture a small fraction of the published phylogenies. This raises the question of how to increase the number of trees that get archived. One approach is compulsion:

In other words:
  1. Databasing trees is the Right Thing™ to do
  2. Few people are doing the Right Thing™
  3. This is because those people are bad/misguided and must be made to see the light

I want to suggest an alternative explanation:
  1. It is not at all obvious that databasing trees is useful
  2. The databases we have suck
  3. There's no obvious incentive for the people producing trees to database them
Why do we need a database of trees?

That we don't have a decent, widely used database of trees suggests that the argument still has to be made. Way back in the mid 1990's when TreeBASE was first starting I was at Oxford University and Paul Harvey (coauthor of The Comparative Method in Evolutionary Biology) was sceptical of its merits. Given that the comparative method depends on phylogenies, and people like Andy Purvis were in the Harvey lab building supertrees ( this may seem odd (it certainly did to me) but Paul shared the view of many systematists. Phylogenies are labile, they change with increased data and taxon sampling, hence individual trees have a short life span.

Data, in contrast, is long-lived. You'd happily reuse GenBank sequences published a decade ago, you probably wouldn't use a decade-old phylogeny. I made this point in an earlier post about the data archive Dryad (Data matters but do data sets?). A problem facing packages of data (such as papers, data sets, and phylogenies) is that the package itself may be of limited interest, beyond reproducing earlier results and benchmarking. In the case of phylogenies, if someone has a tree ((a,b),c) and someone else has a tree ((d,e),f), it's not obvious that we can combine these. But if we have sequences for the same gene from the same six taxa we can build a larger tree, say (((a,d),(b,e)),(c,f)).

I think this is part of the reason why GenBank works. Yes, there is compulsion (it's very hard to publish on sequences if you haven't deposited the data in GenBank), but there are clear benefits of depositing data. As the database grows we can do bigger analyses. If you are trying to identify a species based on its DNA, the chances are that the nearest sequence will have been deposited by somebody else. By depositing data your work it also lasts longer than if people just had the paper (your tree is likely to be outdated, that sequence from a rare, hard to obtain species might be used for decades to come).

Note that I'm not saying a database of trees isn't a good idea, but there seems to be an assumption that it is so obvious that it doesn't need justification. Demonstrably this isn't the case. Maybe we should figure out what we'd want to do with such a database, then tackle how we'd make that possible. For example, I'd want to query a phylogeny database geographically (show me trees from this part of the globe), by ecological association (find the trees for any parasites on this clade), by temporal period (what clades originated in the Miocene?), by data (what trees used this sequence which we now know is chimeric?), by topology (have we settled on the sister group to snakes yet?), and so on. I would also argue that much of this is doable, but might not actually require archiving published phylogenies. Personally I think anybody tackling these questions would do well to use PhyLoTA as their starting point.

TreeBASE sucks

Yes, I'm as sick of saying this as you are of reading it. But it doesn't change the fact that just about everything about TreeBASE from the complexity of the underlying data model, the choice of programming language, the use of a Java applet to display trees, the Byzantine search interface, and the voluminous XML output make TreeBASE a bag of hurt. None of this would matter much if it was an indispensable part of people's research toolkit, but this isn't the case. If you are trying to convince people of the benefits of sharing trees you really want a tool that makes a it seem a no brainer. We aren't there yet.

The "fuck this" point

In a great post on the piracy threshold, Matt Gemmell argues that piracy is largely the fault of content providers because they make being honest too difficult. How many times have you wanted to buy something such as a book or a movie only to discover that the content provider doesn't sell it in your part of the world (e.g., in the iBooks store in the US but not the UK) or doesn't provide it in the media you want (e.g., DVD but not online)? To top it off every time you go to the movies you are subjected to emotional blackmail or threats of unlimited fines if you were to copy the movie you already paid to watch?

6892585935 32d4e21e77 o

I think databases have the same "fuck this" threshold. If you are asking people to submit data you want to make it as easy as possible. And you want at least some of the benefits to be immediate and obvious. Otherwise you are left with coercing people, and that's being, at best, lazy.

If you want an example of how to do it right, look at Mendeley's model. They want to build a public cloud of academic papers, a laudable goal, the Right Thing™ to do. But they sell the idea not as a public good, not as the Right Thing™, nor by trying to compel people (they can't, they're a private company). Instead they address a major point of pain - where the hell did I put that PDF? - and make it trivial to organise your collection of articles. Then they make it possible to back them up to the cloud, to view them on multiple devices, to share them, and viola, we get a huge database of publications. The sociology works. So, my question is, what would the equivalent be for phylogenetics?

Wednesday, October 17, 2012


6e1f1693ed5d70b2f495e9a2c8666114 reasonably smallJames Rosindell's OneZoom tree viewer is out and the paper describing the viewer has been published in PLoS One (disclosure, I was a reviewer):

Rosindell, J., & Harmon, L. J. (2012). OneZoom: A Fractal Explorer for the Tree of Life. PLoS Biology, 10(10), e1001406. doi:10.1371/journal.pbio.1001406.g004
Below is a video where James describes OneZoom.

OneZoom is fun, and is deservedly attracting a a lot of attention. But as visually striking as it is, I confess I have reservations about fractal-based viewers. For a start they make it hard to get a sense of the relative size of taxonomic groups. Looking at the mammal tree shown in the video above your eye is drawn to the monotremes, one of the smallest mammalian lineages. That the greatest number of extant mammals are either rodents or bats is not readily apparent. Fractal geometry also removes the timescale, so you can't discover whether radiations in different clades are the same age (unlike, say, if the tree was drawn in a "traditional" fashion with a linear timescale). In some ways I think fractal viewers are rather like the hyperbolic viewers that attracted attention about a decade ago - visually striking but ultimately difficult to interpret. What I'd like to see are studies which evaluate how easily people can navigate different trees and accomplish specific tasks (such as determining closest relationships, relative clade diversity, etc.).

HypviewerIn some ways OneZoom resembles Google Maps with its zoomable interface. But ironically this only serves to illustrate a key different between OneZoom and Google Maps. Part of the strength of the later is the consistent conventions for drawing maps (e.g., north is up, south is down) which, when coupled with agreed co-ordinates (latitude and longitude), enables people to mash up geographic data. What I'd like is the equivalent of CartoDB for trees.

Friday, October 12, 2012

Mapping evolutionary biology: @evoldir and #ProjectEvoMap

Robert M. Griffin (@GriffinEvo) has launched ProjectEvoMap. Rob explains:
I have decided this week to try to create a resource where evolutionary
biologists can find info on labs and groups from all around the world. I
have created a collaborative Google map online which evolutionary biology
research groups can pin their labs to with a brief description of their
interests. Others can then browse the map to look for labs in specific
areas – for example, if someone wants to find suitable labs in their
current country for work they can see all the labs in that area, likewise
anyone looking for work in a specific region or who needs access to labs
while on fieldwork can look for nearby groups which may be able to help.

Below is a screen shot of part of the map. If you're working on evolutionary biology now is your chance to literally put your lab on the map.

In parallel I'm experimenting with adding a map to the venerable EvolDir mailing list, for which I run a twitter stream (@evoldir). Using some terribly crude code to extract what looks like an address from EvolDir posts, then calling Google's Geocoding API results in a map of recent posts. You can see the live map at This service compliments Rob's by giving a sense of current activity in the community (e.g., conferences, courses, jobs).