Formatting your data tables for storage and re-use

If you follow me on twitter, you probably know I’m compiling a huge dataset. I’ve used a variety of tools to work on these data (most notably reshape and OpenRefine), and now the data is in its cleanest form- a list style table. These data are now in a CSV file containing over 3.2 million observations, and I couldn’t work with them in Excel if I wanted to.*,**

The reason these data are so long is because they are in list format, and we’ve got a heck of a lot of species, by a heck of a lot of sampling dates, by a heck of a lot of locations. I could considerably shorten the number of rows in the file if I put them into cross-tab format- that is, turned all the sites, or dates, or species into headers. So why am I being so difficult and insisting these data stay in this format? First, I’m at a place where these data are analysis ready and no longer require a spreadsheet, and secondly, this is the master sheet, and it is important to keep all observations together in one form for the purpose of archiving. List format is how to do that.

First, let me show you an example of list format versus cross-tab format, so you know what I’m talking about. The White Lab wrote an excellent paper on data management that described this very clearly.

Figure 1 from White et al, linked above. Panel b shows two examples of cross-tab data, and how that data would be displayed in list form. Panel a is also very important, but we'll get to that in a later post.

Figure 1 from White et al, linked above. Panel b shows two examples of cross-tab data, and how that data would be displayed in list form. Panel a is also very important, but we’ll get to that in a later post.

Another good example of how to put your data in list form is discussed in Luis Apiolaza’s blog post on preparing data for analysis in Excel.

Now, let’s discuss why list format is a better form to store your data in than cross-tab. Many applications commonly used by entomologists and ecologists actually want your data in a certain cross-tab form (Think PRIMER or EcoSim),*** so why not store your data in a way that could feed directly into that particular package? The answer my friends, is that other applications and end uses will require your data to be in list form, and it is 8.3462X1027**** easier to turn data from list form to cross-tab than vice-versa, particularly if you don’t code- you can actually do it very easily in Excel using Pivot tables.***** Thus, list form is simply the most appropriate way to store your data. Your master dataset should be in list form.

So even if the end use you envision for your data involves cross-tab format, I REALLY REALLY want you to go through and make a master, list form data. It’s probably easiest to do this first, while you are entering the data. Once it’s all cleaned and checked over for errors, save it, back it up, and leave that file alone. If you need to do manipulations for a certain application, make a copy of the file and do the manipulations there. DO NOT alter your master dataset. Your master dataset is the one you’ll most likely need to share if anyone requests your data, and it probably wouldn’t hurt to have your data backed up in an online public archive****** (more on this down the road).

*Fun fact: Excel stops at 1,048,576 rows. If you exceed this number of rows, you get to eat a cupcake. Why? I don’t make the rules*******, but I sure like following them.
**In fact, I think I’m getting beyond the capacity of most spreadsheet programs- I even crashed Gnumeric a couple of times with it on my Windows machine, but admittedly, I haven’t tried from my Ubuntu machine.
*** #Rstats people: yes, I know, you can do anything these packages do in R. Yes, I know R is better. Yes, I know, R actually massages your feet while it rarefies. These tools are still commonly used and it’s my job to help my colleagues use them.
****Lowball estimate.
*****#Rstats people: baby steps. People get VERY intimidated by coding. Don’t crucify the non-believers yet.
****** Unless your data is somehow inherently evil and could bring about the destruction of humankind. Then, please, lock that data away in a drawer. I have a kid and want there to be an earth for her to grow up in.
*******I made this rule.


About cbahlai

Hi! I'm Christie and I'm an applied quantitative ecologist and new professor. I am an #otherpeoplesdata wrangler, stats enthusiast, and, of course, a bug counter. I cohabitate with five other vertebrates: one spouse, one first grader, one preschooler and two cats.
This entry was posted in Uncategorized and tagged , , , , , , , , , . Bookmark the permalink.

5 Responses to Formatting your data tables for storage and re-use

  1. Careful when making copies of the master file that you don’t end up with a bajillion versions and no record of what you did to each, though!

  2. Pingback: Teaching students how to data- some reflections | Practical Data Management for Bug Counters

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s