All right, I’m finally back from all my travelling. I’ve seen many a strange thing in my travels.
But now, back to reality.* It’s time for some data manipulation.
Where we left off, we had a nice, clean specimen list, with information on the site where the specimen was collected, date, plant community at the site at the time of collection, as well as information about the specimen itself (species identification, taxonomic family, nesting habitat, sociality, and a variable describing membership to various pollen deposition groupings)**
So, what I’ve got is a perfectly serviceable specimen list with a bunch of supplemental information about each of the specimens. But I’m an ecologist. I’m interested in the bee communities by habitat- namely, how the habitat affects the bee community as a whole, and how it affects different functional groups (ie nesting guilds, sociality structures), and how habitat affects the community’s ability to pollinate. In order to do these analyses, I need observations on a per-habitat basis, not a per-specimen basis.
Friends, it’s time to get our reshape2 on. ***
First, we need to get my species list into R, and into an object called
bee.specimens. Then, let’s see what we have:
 "State" "Unique.ID" "Year" "DOY" "Family" "Genus" "Species" "Taxon"
 "Group" "Site.ID" "Sample" "Treatment" "Trap.height" "Nest.biology" "Sociality"
For your reference, “DOY” refers to ‘day of year’ and “Taxon” is a concatenation of the most precise taxonomic identification of the specimen possible, usually in the form
Genus.species, and “Group” is the pollen deposition group of that species. “Unique.ID” refers to the tag number that specimen has in our cupboard, and “Site.ID” is our unique site name (with details of location, landscape parameters, etc, describing that site elsewhere)
As I said before, I want to know how bee diversity, abundance, pollen deposition, nesting guild, etc, varies across site and treatment. Since I’m not really interested in within season variation, and all sites were sampled the same number of times within a year, I want to reshape my data so I can compute these values on a by site, by year basis.
Reshaping data relies on two key commands- melt and cast. You ‘melt’ the data to make it fluid, and then you ‘cast’ it into the shape you want it to take. In reshape2, ‘cast’ takes the form ‘acast’ and ‘dcast’, where the former outputs to an array or vector, and the later outputs to a data frame. For this application, we want a data frame output, so we’ll use dcast
# load the reshape package
#melt the data so that R understands which headers are needed for moving the data around
specimen.list<-melt(bee.specimens, id=1:17, na.rm=TRUE)
And now the fun part begins. First, let’s ask for a species list, by year, with number of observations of each species:
#see how many species we have
species.list<-dcast(specimen.list, Taxon~Year, length)
And we get something that looks like this:
So a bit about the code- ‘length’ is the default for ‘fun.aggregate’- that is, the aggregation function, and what ‘length’ does is simply count the number of observations of each taxon, in each year. In this dataset, one row equals one specimen, but in many cases, you’d have a ‘count’ variable for the number of individuals of a given species observed at a time- that you could do multiple operations on- say you wanted the mean value of rows meeting your criteria, standard deviation, etc…those functions are all possible.
Now that we know we’ve got our data flowing freely, let’s focus on casting it in some useful ways. We need to create a matrix of bee species by year and site so that we can calculate rarefied richness, diversity, and so on. That goes a little something like this:
#cast a cross-tab by taxon
bee.matrix<-dcast(specimen.list, Year+Site.ID~Taxon, length)
Well, that was easy:
We can follow through and reshape the data to get values by other classifications to- ie groupings, nest biology etc. I’m going to go do this now. Next time, we can talk about re-aggregating these data into a single matrix.****
* For those of you whose reality includes writing a blog about data management. Although, honestly, if I don’t keep busy, I will climb the walls waiting to hear from the places I’ve interviewed.
** Note that data such as family, nesting habits, etc, could have been better handled in a connected database table, and not repeated over and over again in this original data sheet- because Bombus impatiens is always a bumble bee, and are always hive nesters- but, well, #otherpeoplesdata.
*** An excellent guidebook for the reshape package is available here. It’s not for Reshape2 but I found this an incredibly useful resource to teach me the basic concepts of reshaping data
**** And for those of you who can’t wait to get out of R, I might even show you how to export the data. Or maybe I won’t. I haven’t decided.