Thursday, June 25, 2015

Biodiversity Data Journal data lost on the way to GBIF and EOL

Two ongoing challenges in biodiversity informatics are getting data into a form that is usable, and linking that data across different projects platforms. A recent and interesting approach to this problem are "data journals" as exemplified by the Biodiversity Data Journal. I've been exploring some data from this journal that has been aggregated by GBIf and EOL, and have come across a few issues. In this post I'll firstly outline the standard format for moving data between biodiversity projects, the Darwin Core Archive, then illustrate some of the pitfalls.

Darwin Core Archive

Firstly a quick digression on the Darwin Core Archive format, which has a few gotchas for newcomers to the format (such as myself). The Darwin Core Archive supports a "star schema" like this.

Dwca

At the centre of the star is a table containing data either about taxa or occurrences. We can have additional tables with other sorts of data, and we also have a meta.xml file which tells us what all the data columns are and how the different tables are related to the core table.

For example, if we have taxa as our core, then we can have a table like this were each taxon has a unique taxon_id:

taxon_idtaxon stuff
1stuff
2stuff
3stuff

Now, imagine that we have a reference for each of these taxa (say it's the paper that originally described these species). Then we could add a unique identifier for that reference reference_id to the taxon table:

taxon_idreference_idtaxon stuff
1astuff
2astuff
3astuff

Now, if we were building a relational database we could have a separate table for the references, and link the two table using the reference_id as a primary key for the references and as a foreign key in the taxon table, like this:

reference_idreference stuff
areference

This means that we need only have the reference stored once, which means there's no redundancy. If we need to update the reference data, we only need to do it once.

However, this is not how Darwin Core Archive works. Because it's a star schema, we need to have a references table like this:

reference_idtaxon_idreference stuff
a1reference
a2reference
a3reference

Note that we have added the taxon_id to link the reference to each taxon, and that the same reference occurs three times (once for each taxon it refers to), hence we have redundancy. Note also that if we don't include the taxon_id key then there's no way for a Darwin Core Archive reader to link the reference to the corresponding taxa (we'll come back to this below).

I've said that the reference are in their own table. In fact, we can have everything in one big table, and use the meta.xml table to tell a Darwin Core Archive reader to process that same table but extract different data each time (the Mammal Species of the World checklist http://doi.org/10.15468/csfquc is an example of this). Hence, we could extract taxon_id and taxon stuff for the taxa, then reference_id, reference stuff for the references.

taxon_idreference_idtaxon stuffreference stuff
1astuffreference
2astuffreference
3astuffreference

The other thing to remember is that the meta.xml file is responsible for describing the data. It does this in two ways (1) it defines the type of data a given table contains (e.g., taxa, occurrence, image, etc.), and (2) it defines what each column in the data represents, using a controlled vocabulary.

The type of data each table contains is defined by a URI, and the list of these "registered extensions" is available from GBIF. The two "core" extensions are for taxa and occurrences, the two things GBIF primarily deals with, while the other extensions enable richer data to be added. Of course, a Darwin Core Archive consumer that doesn't understand these extensions can simply ignore them. Rather unfortunately, some extensions, such as the EOL media and references extensions overlap with the GBIF multimedia and references extensions. Hence, if you have, say images or bibliographic data, you have two extensions to choose from. If you choose EOL's then EOL will import your data, but GBIF won't. Furthermore, the extensions vary in richness. If you have bibliographic data then GBIF's vocabulary for references looks sparse and lacking many of the fields one might expect, whereas EOL's is quite rich.

Problems with Biodiversity Data Journal and GBIF

With that background, let's take a look at what happens to Biodiversity Data Journal (BDJ) data once it enters GBIF. For example, the species Eupolybothrus cavernicolus, described using "transcriptomic, DNA barcoding and micro-CT imaging data" (http://dx.doi.org/10.3897/BDJ.1.e1013). Data from this paper is in GBIF as both an occurrence dataset (http://doi.org/10.15468/zpz4ls) and checklist dataset (http://doi.org/10.15468/rpavbl).

Images

The checklist dataset includes both media and references. The images don't appear in GBIF, but are visible in EOL (e.g., http://eol.org/data_objects/26558840 shown below:

36163 orig Because the type for the media is set to a type (http://eol.org/schema/media/Document) that only EOL recognises, GBIF doesn't harvest the images, and hence misses out on all this extra multimedia goodness.

References

The references in the BDJ dataset don't appear in either GBIF or EOL (see http://eol.org/pages/38177334/literature). Presumably they don't appear in GBIF because BDJ uses EOL's extension, but why don't they appear in EOL? Looking at the raw data, the references.csv file in the Darwin Core lacks the coreid field needed to link the references to the corresponding taxon (the fiels is defined in the meta.xml file, but there is no corresponding column in the references.csv file. Looking at other BDJ Darwin Core Archives this seems to be a common problem.

Map

Strangely the BDJ paper shows a map with a point locality, but the same data in GBIF does not (see http://doi.org/10.15468/zpz4ls). Mapbdj

A look at the occurrences.csv shows that the file has verbatim latitude and longitude but not decimal versions of the coordinates, which is what GBIF uses to locate records on the map. So the BDJ data set isn't contributing any geographical data. Clearly a lot of BDJ data is georeferenced (see map), but not this example.

Taxa

The centipede Eupolybothrus cavernicolus is not in GBIF's backbone classification. This is a common issue, especially with newly described taxa. GBIF does not have access to recent nomenclatural data, and so even though the BDJ data comes with a ZooBank LSID urn:lsid:zoobank.org:act:6F9A6F3C-687A-436A-9497-70596584678C for the name Eupolybothrus cavernicolus, GBIF itself doesn't know about and so if you do a default search on the name Eupolybothrus cavernicolus you get only the genus.

Summary

Here are the issues I uncovered after a little bit of messing about:

  1. BDJ Darwin Core Archives don't support extensions recognised by GBIF.
  2. BDJ references lack the coreid for the taxa/occurrences and hence are not ingested by Darwin Core readers.
  3. BDJ does not seem to parse and interpret verbatim coordinates when generating Darwin Core Archives.
  4. GBIF doesn't support the extensions output by BDJ.
  5. GBIF's references extension is woefully inadequate for handling bibliographic metadata.
  6. GBIF's list of taxonomic names is woefully out of date.

What both puzzles and frustrates me is that a much trumpeted collaboration between these projects has significant problems which seem to have gone undetected. It seems as if it is enough to have a pipeline between a data journal and a project, without actually testing whether that pipeline loses or misrepresents the data. In some cases, very little of the data in a BDJ archive actually makes it into GBIF, which is wasteful and rather defeats the point of having a data journal to database pipeline in the first place.