Friday, September 26, 2008

Half-baked ideas. II. Degree of interest tree of NCBI taxonomy

Among the many ways to display trees, degree of interest (DOI) trees strike me as one potentially useful way to display trees such as the NCBI taxonomy. For background see, e.g. doi:10.1145/1133265.1133358 (or Google "degree of interest trees").

The thing that would make this really useful is if an application was written that, like Google Earth, supported a simple annotation file format. Hence, users could create their own annotation files (e.g., taxa of a certain size, those with eyes, etc.) and upload those files, creating their own annotation layers, in much the same way as we can load sets of geographical annotations into Google Earth. I think it's this feature which makes Google Earth what it is, so my question is whether we can replicate this for classifications/phylogeny?

Half-baked ideas. I. Wiki for taxonomy

Next few weeks will be busy with term starting, kids visiting, and other commitments, so time to jot down some ideas. The first is to have a Wiki for taxonomic names. Bit like Wikispecies, but actually useful, by which I mean useful for working biologists. This would mean links to digital literature (DOIs, Handles, etc.), use of identifiers for names and taxa (such as NCBI taxids, LSIDs, etc.), and having it pre-populated with data. Imagine merging the NCBI taxonomy, Catalogue of Life, Index Fungorum, and IPNI, say, and having it automatically updated with sources such as WoRMS and uBio RSS. Why a Wiki? Well, partly to capture all the little textual annotations that are needed to flesh out the taxonomy, and partly to make it easy to correct the numerous mistakes that litter existing databases.

As an initial target, I'd aim for a comprehensively annotated NCBI taxonomy, as this is probably the most important taxonomic database that we have.

Wednesday, September 24, 2008

Friday, September 19, 2008

Challenge data

Just to provide a sense of how much data I want to analyse for the Challenge, I have the XML, PDF, and images for 1687 articles from Molecular Phylogenetics and Evolution to play with.


Last week I was at NESCent's 2008 Community Summit. As part of that meeting a few of us had a breakout group on "Biodiversity and phylogenetics". Brian O'Meara took some spectacularly thorough notes, including the pithy:

S[wofford]: What?

Julia Clarke and I were advocating data mining, not entirely successfully. At one point I started ranting about post-phylogenetics (i.e., what do do when we've basically got the tree of life). For a brief moment I thought this might be a cool new term to use, although Googling finds that W. Ford Doolittle has used it in the title of talks given at the Wenner-Gren Foundations International Symposium at Stockholm in 2003, and at Penn State in 2006. However, the 2006 talk title (Postphylogenetics: The Tree of Life in the Light of Lateral Gene Transfer) suggests a different meaning (i.e., there isn't a tree of life to be found). I prefer to think of it in the same sense as "postgenomics" -- now that we have all this information, how can we make the best use of it?

Thursday, September 04, 2008

When ISSN's disappear, taking DOIs with them

I've been using ISSN's (International Standard Serial Number) to uniquely identify journals, both to generate article identifiers, and as a parameter to send to CrossRef's OpenURL resolver. Recently I've come across journals that change their ISSN, which has fairly catastrophic effects on my lookup tools. For example, the Canadian Journal of Botany has the ISSN 0008-4026, or at least this is what JournalSeek tells me. However, the journal web site tells me that it has been renamed as Botany, with ISSN 1916-2804. The thing is, if I want to look up DOIs for articles published in the Canadian Journal of Botany, I have to use the ISSN for Botany if I want to get a result. Hence, I can't rely on looking up the ISSN for the Canadian Journal of Botany. I've come across this in other journals as well.

WorldCat's xISSN web services provide some tools to help, including a graphical display of the history of a journal and it's ISSN(s). Here is the history for 1916-2790, redrawn using Graphviz. WorldCat use Webdot, which I've written about earlier. If you view the source of the WorldCat page you can get the link to the original dot file.

The problem with these changes is that it makes ISSN's more fragile. Ideally, the original ISSN would be preserved, and/or CrossRef would have a table mapping old ISSN's onto new ones. The rate things are going, I may have to create such a table myself.

Wednesday, September 03, 2008

Hell is other people's data

Starting to get serious about the Grand Challenge. First step is to parse the XML data Elsevier made available. Sadly this is only for Molecular Phylogenetics and Evolution for 2007, I would have liked the whole journal in XML to avoid hassles with parsing PDF. However, XML is not without it's own problems. I'm slowly getting my head around Elsevier's XML (which is, it has to be said, documented in depth). Two tools I find invaluable are the oXygen XML editor, and Marc Liyanage's TextXSLT application.

As a first attempt, I'm converting Elsevier XML into JSON (being a much simpler format to handle). I'm just after what I regard as the core data, namely the bibliography, and the tables (rich with GenBank accession numbers, specimen codes, and geocoordinates). There are a few "gotchas", such as misisng namespaves to add, and HTML entities that need to be added. Then there's the fact that the XML describes both the document content and it's presentation. Tables can get complicated (cells can span more than one row or column), which makes tasks such as identifying cell contents by using the heading of the corresponding column a bit harder. I hope to put a XSLT style sheet online once I'm happy that it can handle most, if not all the tables I've come across. Then the fun of trying to extract the information can begin.