It’s probably time I got to this- I did promise it over three months ago. Yesterday, I finally sat down to revisit a dataset which was created in 2008-2009, but then orphaned when staff moved on to other positions. It’s a fantastic dataset- bees were collected in bioenergy cropping systems across two states, using bee bowl arrays. The survey resulted in nearly 5000 specimens. What we’re interested in is how the community of bees differs between crops, how nesting guilds, sociality structures, and pollination potential is affected by these cropping systems.
As far as datasets go, this one is middling in size, from my perspective, and it’s had a lot of different hands on it, so there are some (ahem) issues with the consistency with which the data has been entered.
The problem? Inconsistent entry of data. Treating spreadsheets like lab notebooks. Relying on the human brain to be able to see patterns that computers have trouble picking out.
So, say you are entering species names for a handful of specimens. We’ll make them all honey bees. For specimen one, you enter:
1, Apis mellifera
You go to lunch, come back, and enter the next few specimens
2, Apis melifera
3, A. mellifera
4, Apis mell.
5, Apis mellifera L.
Then, you hand it over to your undergrad assistant:
6, Apis mallifera
7, Apis Mellifera
8, apis mellifera
Done. You come back, three months later, load your data into R [do some intermediate stuff to reshape your data table, more on this later], ask it to compute species richness*, and:
Well, crap. That didn’t work.
When you read that list, you probably understand that the species listed there are all the same thing. The problem is, the computer doesn’t. It can’t see things like you do.** The computer looks at that list and sees eight unique strings- thus, eight unique species.
Now, imagine this, times five thousand. Or 700,000, in the case of some of the datasets I’ve worked with. If you try to manually re-enter all the data, it will take you a whole lot of time, better spent doing science.*** In short- this looks like a job for OpenRefine
OpenRefine was a revelation to me when I first discovered it- a friend on twitter**** mentioned it as a way of handling some problems that seemed very, very familiar to me. I first applied it to the aphid suction trap dataset I’ve told you about before, and I used it to find out that over 800 unique strings actually just***** represented 250 species.
The beauty of OpenRefine is that its’s very intuitive. An Excel user with no coding experience can use it and quickly and effectively clean large amounts of data with very little effort. Go download OpenRefine now. Install it, run it.******
Got it? Good.
Okay, so I just pointed OpenRefine at my file, uploaded it, accepted the default options, and clicked “Create Project.” Here’s what I see:
And now, you can look at each data field to check up on the kinds of data in that field (i.e. ‘faceting’)
I started by checking data in the “State” field. As you can see- look at the Facet/Filter tab to the left of the image, we get two facets. This is good, because I know from my metadata that collections only occurred in Michigan (in 2008 and 2009) and Wisconsin (2009 only). You can also see the number of times that string occurs in that column and you now know there’s almost twice as many observations in MI vs WI. Similarly, when faceting by year, we find there’s only values for 2008 and 2009, with nearly twice as many observations in 2009. Good. This all makes sense.
When we get to faceting things by taxonomic family (lower left), things get a bit more interesting. We quickly can see that more than 60% of the bees we collected were from the family Halictidae, and we collected very few Chrysidiade. But still, everything looks pretty good from a data quality standpoint.
So next, let’s look at “Taxon” which is the most precise taxonomic identification we could get for each specimen:
So, OpenRefine sees 148 unique strings under “Taxon.” But if you just eyeball some of the first species names, you can see there’s an entry that misspells “Andrena chromotricha.” Hover over the choice, and you’ll get an option to edit that name. OpenRefine will also catch things like extra spaces before or after text strings- things very, very difficult to pick up on casual observation.
If you click the “cluster” button, OpenRefine will suggest strings that might mean the same thing and you can quickly edit them into one facet. Try a variety of methods and keying functions- you might find different things. Also, go through your list and just eyeball- sometimes OpenRefine can’t detect things you can (and sometimes it will try to overzealously cluster things) so it’s important to use your judgement.
In the end, I was able to get my Taxon facets down to 126 from 148- that means there was 22 minor typographical errors in this field with 4904 entries. Finding them manually would have been very time consuming, but this exercise took me less than 10 mins (excluding, of course, the time it took me to take screenshots and explain what I was doing to you.)
So, in short- if you have categorical data and lots of it, there is no tool better to have in your toolbelt than OpenRefine.
*Number of species, for the non-ecologists. Nothing to do with how much money these bees are making us.
**You’re very insightful, gentle reader.
***Or playing Candy Crush Saga or watching 19 Kids and Counting. I don’t judge.
— Dave Harris (@davidjayharris) August 28, 2013
***** “just” 250 species. HAHAHAHAHA.
******Troubleshooting tip- I do not know why, but it won’t run for me if I previously had an active Chrome session in my boot. I have to restart my creaky old (middleaged?) laptop, and use the OpenRefine app to launch Chrome. After that, it doesn’t care what I do, but just take this into consideration if you’re having trouble.