Bridging the data style gap- Part 1- different disciplines, different data formats

This post is first in a series where I will essentially be live-blogging on working on an #otherpeoplesdata dataset. These data are associated with a project examining bee diversity in bioenergy cropping systems. I’ve already done a little bit of work on said data- it was collected by two different teams, in two states, in two years, and so each was associated with its own spreadsheet, and there was a bit of formatting I needed to do to get it all harmonized and in one place. Once the data was all in one place, I was able to get a comprehensive list of the specimens that had no identifications. Just now, our resident bee taxonomist finished up with identifications of these last few specimens, and so over the next few days, I will be working on the data, getting it into a format where I can do the diversity and function analyses I have planned.

I love taxonomists/systematists. Why? because I’m mostly interested in diversity/function relationships in insect communities, and I’m kind of terrible at IDing insects. If I want to get taxonomic specificity beyond family level, I need to work with people who are much, much better at identifying insects than I am.

However, taxonomists deal with data in a very different way than I typically do. This primarily comes down to the way we process samples. In my subdicipline, we tend to look at data in a per observation (i.e. per trap, per plot, etc) kind of way. In plot 6, on Tuesday, we found 12 ladybugs, 15 pirate bugs, 15478 aphids, and no elephants. Taxonomists, however, tend to deal with data on a per specimen basis (i.e. each specimen is an observation) Ladybug #10, a Hippodamia convergens, was found in plot 6 on Tuesday. Ladybug #11, a Harmonia axyridis, was found in plot 6 on Tuesday.This, I think, is a direct result of how samples with more complicated identifications need to be processed. Typically, when my colleagues are processing captures of taxa with a more involved identification process, things go something like this:

1. Traps are placed in the field
2. Traps are collected, brought back from the field
3. Specimens are taken, one by one, from the traps. Each specimen is labeled with its originating trap, date of capture, and prepared for identification.*
4. Specimens are grouped largely by morphospecies in specimen trays**
5. Specimens are individually identified to the most precise taxon possible, and tagged with these IDs
6. Identifications are entered, by specimen, into a database, with relevant data about trap of origin, etc.

When I work with data like these, I ask questions like “How does diversity differ between habitats?” and “how do different functional groups of organisms differ between habitats?” But, when data are in the format I’ve described, it’s really tricky to quickly pull out ecologically relevant statistics, and so some work needs to be done to change the data into a form that’s useful for the questions I’m asking. In the next post or two,*** I will work through preparing a real dataset in this format for analysis. We’re going to look at using data refining tools to find errors in categorical variables, reshaping data in R, and dealing with ‘implied’ zeroes.****

*either by slide mounting, in the case of many aphids, or pinned, in the case of bees. Bees are especially fun, because when they’re trapped using methods involving liquids like bee bowls, when you pin them, there’s an extra step where you need to blow dry them to make their setae (ie: body hairs) fluff out so that it doesn’t dry clumped, obscuring their features. Thus, many of our undergrad assistants become quite qualified bee mortician/beauticians by the time we’re done with them.
**this considerably speeds up IDing- morphospecies more often than not belong to similar groups, meaning you won’t have to start at square one from your keys between each specimen.
***The number of posts will depend on how many curve balls the data throws me!
**** Fun fact about me: the very first scientific dataset I created took this precise specimen-as-observation form. I was an undergrad, examining parasitoids of an introduced agricultural pest The grad students I was working with, who did the lion’s share of the analyses on how parasitism varied by the pest’s crop host, kept sending me back to re-organize my data and I could NOT understand why.


About cbahlai

Hi! I'm Christie and I'm an applied quantitative ecologist and new professor. I am an #otherpeoplesdata wrangler, stats enthusiast, and, of course, a bug counter. I cohabitate with five other vertebrates: one spouse, one first grader, one preschooler and two cats.
This entry was posted in Uncategorized and tagged , , , , , , , , , , , , . Bookmark the permalink.

4 Responses to Bridging the data style gap- Part 1- different disciplines, different data formats

  1. ibartomeus says:

    And now Data is on Bees! What else can you ask…

  2. cbahlai says:

    I haven’t worked much with bees yet, but I’m getting into it! I’m working with some great data producers!

  3. Pingback: Guest post- About Style | Practical Data Management for Bug Counters

  4. Pingback: Help me, I’m covered in bees -or- using OpenRefine to clean specimen data | Practical Data Management for Bug Counters

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s