If you follow me on twitter, you probably know I’m compiling a huge dataset. I’ve used a variety of tools to work on these data (most notably reshape and OpenRefine), and now the data is in its cleanest form- a list style table. These data are now in a CSV file containing over 3.2 million observations, and I couldn’t work with them in Excel if I wanted to.*,**
The reason these data are so long is because they are in list format, and we’ve got a heck of a lot of species, by a heck of a lot of sampling dates, by a heck of a lot of locations. I could considerably shorten the number of rows in the file if I put them into cross-tab format- that is, turned all the sites, or dates, or species into headers. So why am I being so difficult and insisting these data stay in this format? First, I’m at a place where these data are analysis ready and no longer require a spreadsheet, and secondly, this is the master sheet, and it is important to keep all observations together in one form for the purpose of archiving. List format is how to do that.
First, let me show you an example of list format versus cross-tab format, so you know what I’m talking about. The White Lab wrote an excellent paper on data management that described this very clearly.
Now, let’s discuss why list format is a better form to store your data in than cross-tab. Many applications commonly used by entomologists and ecologists actually want your data in a certain cross-tab form (Think PRIMER or EcoSim),*** so why not store your data in a way that could feed directly into that particular package? The answer my friends, is that other applications and end uses will require your data to be in list form, and it is 8.3462X1027**** easier to turn data from list form to cross-tab than vice-versa, particularly if you don’t code- you can actually do it very easily in Excel using Pivot tables.***** Thus, list form is simply the most appropriate way to store your data. Your master dataset should be in list form.
So even if the end use you envision for your data involves cross-tab format, I REALLY REALLY want you to go through and make a master, list form data. It’s probably easiest to do this first, while you are entering the data. Once it’s all cleaned and checked over for errors, save it, back it up, and leave that file alone. If you need to do manipulations for a certain application, make a copy of the file and do the manipulations there. DO NOT alter your master dataset. Your master dataset is the one you’ll most likely need to share if anyone requests your data, and it probably wouldn’t hurt to have your data backed up in an online public archive****** (more on this down the road).
*Fun fact: Excel stops at 1,048,576 rows. If you exceed this number of rows, you get to eat a cupcake. Why? I don’t make the rules*******, but I sure like following them.
**In fact, I think I’m getting beyond the capacity of most spreadsheet programs- I even crashed Gnumeric a couple of times with it on my Windows machine, but admittedly, I haven’t tried from my Ubuntu machine.
*** #Rstats people: yes, I know, you can do anything these packages do in R. Yes, I know R is better. Yes, I know, R actually massages your feet while it rarefies. These tools are still commonly used and it’s my job to help my colleagues use them.
*****#Rstats people: baby steps. People get VERY intimidated by coding. Don’t crucify the non-believers yet.
****** Unless your data is somehow inherently evil and could bring about the destruction of humankind. Then, please, lock that data away in a drawer. I have a kid and want there to be an earth for her to grow up in.
*******I made this rule.