Tuesday, November 29, 2011

Mapping names to literature: closing in on 250,000 names

Following on from my earlier post Linking taxonomic names to literature: beyond digitised 5×3 index cards I've been slowly updating my latest toy:


This site displays a database mapping over 200,000 animal names to the primary literature, using a mix of identifiers (DOIs, Handles, PubMed, URLs) as well as links to freely available PDFs where they are available. Lots still to do as about a third of the 1.5 million names in the database have citations that my code hasn't been able to parse. There are also lots of gaps that need to be filled in, for example missing DOIs or PubMed identifiers, and a lot of the earlier names are linked by "microcitations" to names, and I'll need to handle those (using code from my earlier project Nomenclator Zoologicus meets Biodiversity Heritage Library: linking names directly to literature).

The mapping itself is stored in a database that I'm constantly editing, so this is far from production quality, but I've found it eye-opening just how much literature is available. There is a lot of scope for generating customised lists of papers, for example, primary taxonomic sources for taxa currently on the IUCN Red List, or those taxa which have sequences in GenBank (building on the mapping of NCBI taxa onto Wikipedia). Given that a lot of the relevant literature is in BHL, or available as PDFs, we could do some data mining, such as extracting geographical coordinates, taxonomic names, and citations. And if linked data is your thing, the 110,000 DOIs and nearly 9,000 CiNiii URLs all serve RDF (albeit not without a few problems).

I've set a "goal" of having 250,000 names mapped to the primary literature, at which point the database interface will get some much-needed attention, but for now have a look for your favourite animal and see if it's original description has been digitised.

Towards the bibliography of life

David King et al.'s paper "Towards the bibliography of life" http://dx.doi.org/10.3897/zookeys.150.2167 has just appeared in a special issue of ZooKeys. I've written a number of posts on this topic, so I've a few comments.

King et al. survey some of the issues, but don't really tackle the big issue of how we're going to build this. If we define the "bibliography of life" somewhat narrowly as the list of all papers that have published a scientific name (or a new combination, such as moving a species from one genus to another), then this is a large, but measurable undertaking. According to ION's metrics page, these are the numbers involved (for animals and protozoa):

Total New Names1,510,402
Total New Genera / Subgenera215,242
Total New Species / Subspecies1,192,366
Total Other New Names102,794
Total New Combinations241,296
Total New Synonyms260,544

Even in the worse case scenario of one name per publication (clearly not the case) this is big, but not insurmountable, task.

Publications not taxa
Part of the challenge is figuring out the best way to tackle the problem. In the past, most efforts at building taxonomic bibliographies have focussed on specific taxa, which is natural — the bibliographies are being built by taxonomists and they specialise in particular groups. But I'd argue that this is not the most efficient way to tackle the problem. Because the taxonomic literature is so widely dispersed, after the obvious "low hanging fruit" have been collected, considerable effort must be spent tracking down the harder to find citations. There are few economies of scale in this approach. In contrast, if we focus on publications at, say, the level of journal, then we can build a bibliography much more quickly. Once we've found the source, say, for one article, often we could use that information to harvest many articles from the same source (e.g., write scripts to harvest from a digital repository such as a DSpace server, or a digital library such as Gallica). But if we are focussed on a particular taxon, we will ignore the other articles in that journal ("what do I care about fish, I like turtles").

Put another way, if we imagine a taxa × publication matrix, then we can either go after rows (i.e., a bibliography for a specific taxonomic group), or columns (a list of articles in a specific journal). The article-based approach will be faster, albeit at the cost of finding articles that aren't necessarily relevant to taxonomy. This is why I'm spending what feels like far too much time harvesting article lists and uploading these to Mendeley. It is also one reason BHL has been so successful. They've simply gone after scanning the literature wholesale, rather than focussing on particular taxonomic groups.

TaxapublicationmatrixWikispecies logo enCrowd sourcing and Wikispecies
Crowd sourcing often strikes me as a euphemism for "we can't be bothered doing the tedious stuff, lets get the public to do it for us (plus it will look like we're engaged with the public)." I'm not denying can work, but I suspect it's not a magic bullet. Perhaps the best crowd sourcing is not to try and bring the crowd to a project, but go where the crowd has already gathered. In this case, an obvious crowd is the Wikispecies community. Working with the ION database for my Sherborn presentation, it's clear that the quality of bibliographic data in ION is variable, and rather poor for older references. In contrast, the reference lists on Wikispecies can be very good (e.g., the bibliography for George Boulenger). There are some issues with Wikispecies, notably the lack of a decent bibliographic template (unlike Wikipedia) so parsing references can be *cough* interesting, but there is scope here to use it to improve other databases. Citation matching can be a challenge, but in this case we have citations indexed by taxonomic name (in both ION and Wikispecies), which greatly reduces the scope of possible matches.

I think building the "bibliography of life" needs a combination of aggressive data gathering, and avoiding building additional tools unless absolutely needed. There are great tools and communities that can already be leveraged (e.g., Mendeley, Wikispecies), let's make use of them.

Thursday, November 24, 2011

BHL needs to engage with publishers (and EOL needs to link to primary literature)

Browsing EOL I stumbled upon the recently described fish Protoanguilla palau, shown below in an image by rairaiken2011:
Palauan Primitive Cave Eel

Two things struck me, the first is that the EOL page for this fish gives absolutely no clue as to where you would to find out more about this fish (apart from an unclickable link to the Wikipedia page http://en.wikipedia.org/wiki/Protoanguilla - seriously, a link that isn't clickable?), despite the fact this fish has been recently described in an Open Access publication ("A 'living fossil eel (Anguilliformes: Protanguillidae, fam. nov.) from an undersea cave in Palau", http://dx.doi.org/10.1098/rspb.2011.1289).

Now that I've got my customary grumble about EOL out of the way, let's look at the article itself. On the first page of the PDF it states:
This article cites 29 articles, 7 of which can be accessed free

So 22 of the articles or books cited in this paper are, apparently, not freely available. However, looking at the list of literature cited it becomes obvious that rather more of these citations are available online than we might think. For example, there are articles that are in the Biodiversity Heritage Library (BHL), e.g.

Then there are articles that are available in other digitising projects

  • Hay O. P. 1903 On a collection of Upper Cretaceous fishes from Mount Lebanon, Syria, with descriptions of four new genera and nineteen new species. Bull. Am. Mus. Nat. Hist. N. Y. 19, 395–452. http://hdl.handle.net/2246/1500
  • Nelson G. J. 1966 Gill arches of fishes of the order Anguilliformes. Pac. Sci. 20, 391–408. http://hdl.handle.net/10125/7805

Furthermore, there are articles that aren't necessarily free, but which have been digitised and have DOIs that have been missed by the publisher, such as the Regan paper above, and

So, the Proceedings of the Royal Society has underestimated just how many citations the reader can view online. The problem, of course, is how does a publisher discover these additional citations? Some have been missed because of sloppy bibliographic data. The missing DOIs are probably because the Regan citation lacks a volume number, and the Trewavas paper uses a different volume number to that used by Wiley (who digitised Proc. Zool. Soc. Lond.). But the content in BHL and other digital archives will be missed because finding these is not part of a publisher's normal workflow. Typically citations are matched by using services ultimately provided by CrossRef, and the bulk of BHL content is not in CrossRef.

So it seems there's an opportunity here for someone to provide a service for publishers that adds value to their content in at least three ways:
  1. Add missing DOIs due to problematic citations for older literature
  2. Add links to BHL content
  3. Add links to content in additional digitisation projects, such as journal archives in DSpace respositories

For readers this would enhance their experience (more of the literature becomes accessible to them), and for BHL and the repositories it will drive more readers to those repositories (how many people reading the paper on Protoanguilla palau have even heard of BHL?). I've said most of this before, but I really think there's an opportunity here to provide services to the publishing industry, and we don't seem to be grasping it yet.

Wednesday, November 23, 2011

Wikipedia History Flow tool now in GitHub

Inspired by a comment on my post Visualising edit history of a Wikipedia page, the code I use to make history flow diagrams like the one below is now in GitHub at https://github.com/rdmpage/wikihistoryflow.


There is also a live version at http://iphylo.org/~rpage/wikihistoryflow. If you enter the name of a Wikipedia page the tool will display the edit history with columns representing page versions and individual contributors (people and bots) distinguished by different colours.

This tool will fall over for pages with a lengthy history of edits, and requires a web browser that can support SVG, but it's a fun visualisation, and may inspire someone to do this properly.

Tuesday, November 22, 2011

Apache mod_rewrite and question marks "?"

Quick note to self in case I (inevitably) forget later. If you are using Apache mod_rewrite to make nice, clean URLs, and are also supporting JSONP, you may run into the situation where you have code that wants to append "?callback=xxx" to your URL (e.g., a cross-domain AJAX call in jQuery). Imagine you have a nice clean URL /user/123, which actually corresponds to user.php?id=123. If you append ?callback=xxx to the URL then chances are the code will break, because mod_rewrite will rewrite the URL to something like user.php?id=123?callback=xxx. What you actually want to send to your web server is user.php?id=123&callback=xxx (note the & before "callback"). After much grief trying to figure out how to coerce Apache mod_rewrite into handling this situation I found the answer, of course, on Stack Overflow. If you use the [QSA] flag, Apache will append the additional callback parameter onto the end of the rewritten URL, so JSONP will now work. Once again, Stack Overflow turned a show-stopper into a learning experience.

Friday, November 18, 2011

Adding article-level metadata to BHL

Recently I've been thinking about the best ways to make article-level metadata from BioStor more widely available. For example, for someone visiting the BHL site there is no easy way to find articles, which are the basic unit for much of the scientific literature. How hard would it be to add articles to BHL? In the past I've wanted an all-singing all dancing article-level interface to BHL content (sort of BioStor on steroids), but that's a way off, and ideally would have a broader scope than BHL. So instead I've been thinking of ways to add articles to BHL without requiring a lot of re-engineering of BHL itself.

Looking at other digital archive projects like Gallica and Google Books it strikes me that if the BHL interface to a scanned item had a "Contents" drop down menu then users would be able to go to individual articles very easily. Below is a screen shot of how Gallica does this (see http://gallica.bnf.fr/ark:/12148/bpt6k61331684/f57).


There's also a screen shot of something similar in Google Books (see http://books.google.co.uk/books?id=PkvoRnAM6WUC)


The idea would be that if BioStor had found articles within a scanned item, they would be listed in the contents menu (title, author, starting page), and if the user clicked on the article title then the BHL viewer would jump to that page. If there were no known articles, but the scanned item had a table of contents flagged (e.g., http://www.biodiversitylibrary.org/item/25703) then the menu could function as a button that takes you to that page. If there are no articles or contents, then the menu could be grayed out, or simply not displayed. This way the interface would work for books, monographs, and journal volumes.

Now, admittedly this is not the most elegant interface, and it treats articles as fragments of books rather than individual units, but it would be a start. It would also require minimal effort both on the part of BHL (who need to add the contents button), and myself (it would be easy to create a dump of the article titles indexed by scanned item).

Nature iPhone app clone in GitHub

One thing I'm increasingly conscious of is that I've a lot of demos and toy projects hanging around and the code for most of these isn't readily available. So, I plan to clean these up and put them in GitHub so others can explore the code, and reuse it if they see fit.

First up is the code to create a HTML+Javascript clone of Nature's iPhone app, as described in an earlier post.


There's a live version of the clone here here. and the code is now available from GitHub at https://github.com/rdmpage/natureiphone.