Monday, August 31, 2009

Comparing Wikipedia and Mammal Species of the World classifications



Continuing the saga of making sense of the mammal classification in Wikipedia, I've done a quick comparison with the Mammal Species of the World (third edition) classification. MSW is the default taxonomic reference used by WikiProject Mammals. I downloaded the MSW taxonomy as a CSV file (warning, it's big), and wrote a script to pull out the classification as a GML file (my preferred graph format).

Based on some earlier work with Gabriel Valiente, I wrote a simple program that takes two trees and highlights the nodes in common to the two trees. I then input into this program the MSW tree, and the largest component of the graph of Wikipedia mammals. The MSW tree has 13582 nodes, the Wikipedia tree has 6287. Note that Wikipedia has more taxa than these 6287 nodes suggest, but they aren't connected to the largest tree (often due to intermediate nodes in the classification lacking a page in Wikipedia). The two trees have 4935 nodes in common (again, this number will be a little low, there are some weird taxon names due to problems parsing Wikipedia).

MSW versus Wikipedia
Below is a the MSW classification with taxa in Wikipedia shown in red.
w-msw.jpg


[Larger scale view here]

The impression given is that most Wikipedia mammal pages are in MSW, with some notable exceptions, including higher level taxa such as Afrotheria, and extinct taxa such as the Multituberculata. Some extant taxa are missing due to synonymy. For example, Wikipedia gives the scientific name of Anthony's pipistrelle as Pipistrellus anthonyi, whereas MSW has it as Hypsugo anthonyi.
As an aside, Wikipedia pages often get muddled about parentheses around taxonomic author names. The authority is in parentheses if the current genus is not the original genus the species was placed. Hence, Pipistrellus anthonyi (Tate, 1942) should actually be Pipistrellus anthonyi Tate, 1942, as Tate originally described this taxon as a species of Pipistrellus (see hdl:2246/1783). However, the name Hypsugo anthonyi (Tate, 1942) does need parentheses.


Some Wikipedia taxa also postdate the publication of MSW, such as Philander deltae (see doi:10.1644/05-MAMM-A-065R2.1).


Wikipedia versus MSW
When we do the reverse comparison we see something rather different.

msw-w.jpg


[Larger scale view here]

This is the MSW tree, coloured red where the MSW taxon has a page in Wikipedia. There are big gaps, some of which are due to those pages being in another component (in other words, many "missing" taxa do have pages in Wikipedia, they are just not properly linked to the bigger tree). MSW is also rich in subspecies, which tend to lack their own pages in Wikipedia (possibly a good thing in the cases of taxa such as pocket gophers).

It would be nice to make these comparisons automatic, and develop tools so that managing taxonomy in Wikipedia could be made easier.

Saturday, August 29, 2009

Mammal tree from Wikipedia

Following on from my previous post about visualising the mammalian classification in Wikipedia, I've extracted the largest component from the graph for all mammal taxa in Wikipedia, and it is a tree. This wasn't apparent in the previous diagram, where the component appeared as a big ball due to the layout algorithm used.
tree.jpg


What this suggests is that Wikipedia contributors are quite capable of generating trees, it's just that not all the bits of the tree are connected (hence all the components in the previous post.

As Cyndy Parr suggested in her comments, it would be useful to compare the Wikipedia-derived tree with other trees, say from Mammal species of the World or ITIS.

Friday, August 28, 2009

Visualising the Wikipedia classification of mammals

As part of my on-going experiments with Wikipedia as a repository of taxonomic information, I've extracted mammal pages from Wikipedia. There's a lot to be done with these, but the first thing I wanted to ask was whether the Wikipedia pages would form a tree (i.e., had the authors of these pages managed to ensure the pages formed a single, coherent taxonomic classification). The answer, as shown in the graph below, is no.
m.jpg


The graph contains 7750 nodes, each one representing a Wikipedia page with a Taxobox containing the class Mammalia. A node is connected to the node corresponding to its parent in the mammalian classification.

If it formed a single classification there would be just one component. Instead, it contains 841 distinct components, many of which you can see at the bottom. If you want to explore the graph, I've made an image map here using the wonderful graph editor yEd. You'll need to move the browser's scroll bars to see the graph. If you click on the node you'll be taken to the corresponding Wikipedia page.

Note: The graph has been laid out using yEd's organic layout command, so it won't look tree-like. The diagram is intended to testing for connectedness only.

Some of these components may be due to errors in my parser, but many are due to inconsistencies in Wikipedia. Typical problems are Taxoboxes containing taxa for which there is no page in Wikipedia (these are visible as redlinks), or monotypic taxa where the pages for the genus and species are the same).

Of course, the joy of Wikipedia is that these problems can be easily fixed, but the trick is discovering the problems in the first place. There is a distinct lack of tools to enable Wikipedia editors to view the entire classification of interest and identify areas that need fixing (something Roger Hyam alluded to in his comment on an earlier posting). It would, of course, be great to be able to edit the graph shown above and have those changes automatically transmitted to Wikipedia.

Friday, August 21, 2009

Scientific citations in Wikipedia

wikipediaisaccuratecitationneeded.jpg
While thinking about measuring the quality of Wikipedia articles by counting the number of times they cite external literature, and conversely measuring the impact of papers by how many times they're cited in Wikipedia, I discovered, as usual, that somebody has already done it. I came across this nice paper by Finn Årup Nielsen (arXiv:0705.2106v1) (originally published in First Monday as a HTML document, I've embedded the PDF from arXiv below).

Nielsen retrieved 30,368 citations from Wikipedia, and summarised how many times each journal is cited within Wikipedia. He then compared this with a measure of citations within the scientific literature by multiplying the journal's impact factor by the total number of citations. In general there's a pretty good correlation.
1997-20088-1-PB.gif


What is striking to me is that
When individual journals are examined Wikipedia citations to astronomy journals stand out compared to the overall trend (Figure 2). Also Australian botany journals received a considerable number of citations, e.g., Nuytsia (101 [citations]), in part due to concerted effort for the genus Banksia, where several Wikipedia articles for Banksia species have reached "featured article" status.


In the diagram, note also that Australian Systematic Botany (ISSN 1030-1887), which has a impact factor of 1.351, is punching well above its weight in Wikipedia. What I want to find out is whether this is true for other taxonomic journals. Nielsen's study was based on a Wikipedia dump from 2 April 2007, and a lot has been added since then (and the journal Zootaxa has become a major publisher of new taxonomic names).

But what I'm also wondering is whether this is not a great opportunity for the taxonomic community. By responding to {{citation needed}}, we can improve the quality of Wikipedia, and increase the visibility of their work. Given that many Wikipedia taxon pages are in the top 10 Google hits {{citation needed}}, our work is but one click away from the Google results page. Instead of endlessly moaning about the low impact factor of taxonomic journals, we can actively do something that increases the quality and visibility of taxonomic information, and by extension, taxonomy itself.

Scientific citations in Wikipedia

Tuesday, August 18, 2009

To wiki or not to wiki?

What follows are some random thoughts as I try and sort out what things I want to focus on in the coming days/weeks. If you don't want to see some wallowing and general procrastination, look away now.

I see four main strands in what I've been up to in the last year or so:
  1. services
  2. mashups
  3. wikis
  4. phyloinformatics
Let's take these in turns.

Services
Not glamourous, but necessary. This is basically bioGUID (see also hdl:10101/npre.2009.3079.1). bioGUID provides OpenURL services for resolving articles (it has nearly 84,000 articles in it's cache), looking up journal names, resolving LSIDs, and RSS feeds.

Mashups
iSpecies is my now aging tool for mashing up data from diverse sources, such as Wikipedia, NCBI, GBIF, Yahoo, and Google Scholar. I tweak it every so often (mainly to deal with Google Scholar forever mucking around with their HTML). The big limitation of iSpecies is that it doesn't make it's results reusable (i.e., you can't write a script to call iSpecies and return data). However, it's still the place I go to to quickly find out about a taxon.

The other mashups I've been playing with focus on taking standardised RSS feeds (provided by bioGUID, see above) and mashing them up, sometimes with a nice front end (e.g., my e-Biosphere 09 challenge entry).

Wiki
I've invested a huge amount of effort in learning how wikis (especially Mediawiki and its semantic extensions) work, documented in earlier posts. I created a wiki of taxonomic names as a sandbox to explore some of these ideas.

I've come to the conclusion that for basic taxonomic and biological information, the only sensible strategy for our community is to use (and contribute to) Wikipedia. I'm struggling to see any justification for continuing with a proliferation of taxonomic databases. After e-Biosphere 09 the game's up, people have started to notice that we've an excess of databases (see Claire Thomas in Science, "Biodiversity Databases Spread, Prompting Unification Call", doi:10.1126/science.324_1632).

Phyloinformatics
In truth I've not been doing much on this, apart from releasing tvwidget (code available from Google Code), and playing with a mapping of TreeBASE studies to bibliographic identifiers (available as a featured download from here). I've played with tvwidget in Mediawiki, and it seems to work quite well.

Where now?
So, where now? Here are some thoughts:
  1. I will continue to hack bioGUID (it's now consuming RSS feeds from journals, as well as Zotero). Everything I do pretty much depends on the services bioGUID provides

  2. iSpecies really needs a big overhaul to serve data in a form that can be built upon. But this requires decisions on what that format should be, so this isn't likely to happen soon. But I think the future of mashup work is to use RDF and triple stores (providing that some degree of editing is possible). I think a tool linking together different data sources (along the lines of my ill-fated Elsevier Challenge entry) has enormous potential.

  3. I'm exploring Wikipedia and Wikispecies. I'm tempted to do a quantitative analysis of Wikipedia's classification. I think there needs to be some serious analysis of Wikipedia if people are going to use it as a major taxonomic resource.

  4. If I focus on Wikipedia (i.e., using an existing wiki rather than try to create my own), then that leaves me wondering what all the playing with iTaxon was for. Well, actually I think the original goal of this blog (way back in December 2005) is ideally suited to a wiki. Pretty much all the elements are in place to dump a copy of TreeBASE into a wiki and open up the editing of links to literature and taxonomic names. I think this is going to handily beat my previous efforts (TbMap, doi:10.1186/1471-2105-8-158), especially as errors will be easy to fix.

So, food for thought. Now, I just need to focus a little and get down to actually doing the work.

Monday, August 17, 2009

Nexus Data Editor and Windows Vista

nde.gifSometimes it's just amazing/frightening how long a piece of software remains useful. I wrote Nexus Data Editor (NDE) in the late 1990's, mainly to keep my then PhD student Vince Smith happy. Vince was constructing a morphological dataset for lice, and he didn't like Macs (in those days, he's seen the light now), and even if he did MacClade didn't allow him to wax lyrical about character states, so I wrote NDE for Windows (in those days this meant Windows 95 and NT). Vince and other students found it useful, so I wrote a manual and released it.

Turns out people still use NDE, but it doesn't install on Vista. I finally bit the bullet and put a installed a copy of Vista in VM Fusion on my MacBook, and confirmed that the installation was broken. Fearing I'd have to compile NDE for Vista (a challenge as it was built using the wonderful Borland 5.02 C++ compiler and IDE, now defunct). Turns out, it's the install package itself that's broken (built using Install Shield). The Inno Setup installer I use for TreeView X works fine, however.

The upshot is, if you use(d) NDE and have Vista, download a new copy of NDE from the web site, and it should work. Thanks to Mike Polcyn at the Southern Methodist University, Dallas, for the prompting that finally got this done.

Wednesday, August 12, 2009

GBIF and Linked Data

At the end of day two of the GBIF LSID-GUID Task Group I put together this crude diagram to summarise some of the possible links between biodiversity data and the larger linked data cloud, which I, among others, have argued is where biodiversity informatics should be heading. Here's my hastily put together diagram (created using the wonderful OmniGraffle):
Links.jpg


I've put GBIF at the centre since we're at GBIF, and it's them we are trying to convince. Yellow circles are biodiversity data sources (which aren't linked data providers (but some can me made so using, for example, my LSID proxy resolver), white circles are linked data sources.

The "sales pitch"is that if we join the linked data cloud we open up the possibility of some very powerful queries, especially once that are outside the relatively narrow scope of what GBIF and TDWG concern themselves with. Imagine being able to query biodiversity data with respect to population and economic data across countries. These are the sort of things we could realistically aim for.

On a practical level, it also means biodiversity database could devolve a lot of their tasks to other databases (via reusing identifiers). Some taxonomists have DBPedia URIs, and more could be added to Wikipedia (and so will find there way into DBPedia). Geonames provides geographic URIs which we could reuse, and so on. Within our own community we could do a better job of reusing our own identifiers, and reusing external ones (such as taxa in Wikipedia).

It's late, this is a rushed diagram, and I don't know if it's going to end up in whatever report we manage to assemble tomorrow (our final day). But I hope it captures some of the scope of what we're looking at. I know there are some problems (as have been pointed out to me on Twitter), I'll try and deal with these tomorrow.

Tuesday, August 11, 2009

Wikispecies RSS feed

Following on from my previous post about Wikispecies (which generated some discussion on TAXACOM) I've played some more with Wikispecies.

AS a first step I've added a Wikispecies RSS feed to my list of RSS feeds. This feed takes the original Wikispecies RSS feed for new pages (generated by the page Special:NewPages) and tries to extract some details before reformatting it as an ATOM feed. Specifically, I extract GUIDs such as IPNI and Index Fungorum identifiers, bibliographic references (which I will later parse to try and extract identifiers such as DOIs), and latitude and longitude if the Wikispecies page has type locality information. Having the later means that the RSS feed can be displayed as a map (Google Maps can take a RSS feed with geotagged items and display it on a map for you).

The map below is live, so it will show any geotagged items in the current Wikispecies feed.


View Larger Map


Friday, August 07, 2009

Wikispecies is not a database


This post was prompted by Stephen Thorpe's post on TAXACOM about Wikispecies in which he wrote (in a thread discussing Roger Hyam's recent blog post) that
[i]f it [Wikispecies] isn't a true database, then it is BETTER than a database. It can do anything a database can do, and more, if you know how it works properly.
I beg to differ. Wikispecies runs on a database (the Mediawiki software uses a database to store the wiki), and Mediawiki can be thought of as a database of semi-structured text, but it lacks a lot of the functionality database users would expect. For example, in Wikispecies there's no way to perform basic queries such as how many descendants a given taxon has, what names a particular author has published, or to find out in which geographic region most new names are being described from. Much of this information is in Wikispecies, it just isn't in a form that we can usefully use.

These limitations are mostly due to the underlying software (Mediawiki), which fortunately can be extended to address these issues using Semantic Mediawiki. I've explored these ideas earlier. With some restructuring, Wikispecies could become a database, but it would require some serious work.

But this raises the real issue with Wikispecies, namely what is it for? Wikipedia is much more informative for many taxa, and the two wikis are very poorly linked (surely we'd want Wikipedia pages linked to the corresponding Wikispecies pages?). Given that Wikipedia is the basis for some core efforts in linked data (e.g., DBPedia), it seems a no brainer that we would want our information stored in Wikipedia, rather than Wikispecies.

It seems to me that the split between Wikipedia and Wikispecies parallels that between "taxonomic concepts" and "taxonomic names". Wikipedia provides the former, in that it provides one (consensus) view of what a taxon is. Wikispecies would be ideally placed to be a nomenclatural database (and a great place to put all the synonyms that we've accumulated over time, but which would swamp Wikipedia). But Wikispecies seems also to want to provide a classification as well, which strikes me as unnecessary (and raises the issue of how this relates to the classification in Wikipedia).

I don't wish to denigrate the efforts of Wikispecies contributors (they are doing some neat things, such as harvesting new names from Zookeys), and by clever use of templates they avoid some of the serious problems with classification in Wikipedia, but it's not a taxonomic database, at least, not yet.