What can Global Biodiversity Information Facility (GBIF) do for you?

I've recently been appointed Chair of the Science Committee of the Global Biodiversity Information Facility (GBIF) http://www.gbif.org [1]. The committee is a small group of people with a range of backgrounds, and one of our roles is to advise GBIF on matters scientific (e.g., what kinds of data GBIF should collect?, what kinds of scientific questions should GBIF help answer?, etc.).

There have been formal surveys (see the papers in the journal "Biodiversity Informatics" https://journals.ku.edu/index.php/jbi/issue/view/370/showToc ), meetings, and a "vision" statement (the "Global Biodiversity Informatics Outlook, http://www.biodiversityinformatics.org/ ). But there's always the chance that these fora may miss some points of view, so I'm keen to get feedback on what sort of things GBIF could do to improve the way it can help people tackle the scientific questions they are interested in.

For example, is there some fundamental limitation that GBIF has that prevents it being useful to you? Is there some feature/data type/geographic coverage/etc. that could be addressed that would make it more useful? Is there a role that GBIF should take on that it hasn't done so? A useful analogy might be to think of the central role GenBank plays in genomics, both as a place to archive your data (sequences), a repository of other people's data that you can access, and a research tool (e.g., BLAST searches to locate similar sequences). Is that the sort of thing you'd want from GBIF, or is it something entirely different?

I'd welcome any comments, suggestions, views, etc. Feel free to add them as comments to this blog, or email me (rdmpage at gmail.com).

I should stress that this is simply me trying to calibrate my perception of GBIF's role with what others think. Also, note if you have specific comments on things such as the GBIF web site please use the feedback tab on the site (that way it will reach the people who can do something about it).

[1] For those unfamiliar with GBIF, its mission "is to make the world's biodiversity data freely and openly available via the Internet". At present the bulk of the data are observations of organisms (mostly multicellular eukaryotes, i.e., animals, plants and fungi) based on either museum collections or observations of living organisms. You can get an idea of the kind of science that uses GBIF-hosted data from this list of papers on Mendeley http://www.mendeley.com/groups/1068301/gbif-public-library/


Based on responses so far I'll compile a list below of suggestions/themes.


  • Have the ability to annotate records (e.g., flag errors) and some mechanism where those annotations get incorporated into GBIF and/or primary data providers.

Dashboard/gap analysis

  • For any search provide information on how complete and/or representative the data is likely to be (for example, are vertebrates over-represented, what is the extent of sampling in this area, etc.).

Geographic coverage

  • Fill big gaps in coverage (e.g., Russia, China, much of the tropics).


  • Link GBIF occurrence records to sequences in GenBank


  • Who identified specimen?
  • Details on georeferencing (esp. if not GPS)

Data types

  • DNA sequences
  • abundance

Data sources

  • GenBank
  • Literature records (e.g., data mining published papers)
  • "Gray" literature, e.g. field books, reports


  • Lack of stable identifiers for occurrences
  • Contributors of specimen data not (yet) in an institution have to mint their own identifiers, with no way of linking those to any future identifier minted by the institution that will eventually house their collection)


  • Being able to refine taxon search by geographic region
  • Search on any Darwin Core field
  • Wild card search
  • Support for GIS data formats
  • Search using arbitrary bounding polygons (e.g., draw a shape on a map)


Which taxonomic journals should be digitised next?

One reason I was able to build BioNames is because a significant fraction of the taxonomic literature for animals is now online, either due to the efforts of the Biodiversity Heritage Library, digital archives, commercial publishers, or individual institutions and scientific societies. However there are still big gaps in literature availability. To get a sense of these gaps I've constructed a table listing all the journals in BioNames that have an ISSN, ordered by the number of articles in BioNames (i.e., mostly articles that publish new names). The full table is here, I've reproduced part of it below (limited to those journals with at least 500 articles in BioNames). If you click on the ISSN in the table you can go to the corresponding page in BioNames to get full details of what BioNames currently knows about that journal.

The journals in red are the ones with the worst online presence (see complete key below). Note that BioNames is still a work in progress so there will be some journals that are online but I've simply not had a chance to add them to BioNames. With that in mind, there are some striking gaps in the digital availability of taxonomic publications. Several Russian journals (collectively publishing thousands of articles) are not online (the story here is somewhat complicated because some Russian journals also have English-language translations available but these are mostly recent articles). A number of large entomological journals are not available (perhaps not surprising given that most described animal taxa are insects).

We can think of this as a "league table" of literature availability. My hope is that digitising projects such as the Biodiversity Heritage Library will look at this and use it to help prioritise which journals to scan. In particular, if the journal is not pre-1923 (and therefore out of US copyright) I hope BHL will then contact the journal's publisher and see if they would be willing to add their journal to those (such as Proceedings of the Biological Society of Washington) that have opened up their complete back catalogue to being scanned by BHL.

I also hope that scientific societies or organisations that publish journals in the "red" or "orange" zones will consider digitising their journals and making their contents accessible to the wider community. We are reaching the point where if knowledge is not online then it effectively doesn't exist.

> 90%Almost all are available
< 90%Most are available
< 50%Limited availability
< 10%Mostly inaccessible
