Almost every entomologist I know uses excel for data storage. I’m not going to turn this into a rant about why that’s wrong, in fact, I’m going to confess something that may seriously impair my street cred* as a data manager. I use excel, more than any one single tool, when doing basic data entry and manipulations. It’s tactile. It’s efficient. It has a lot of useful features. And here’s the biggest reason: almost all of my colleagues use it. Using the same basic tools as they do means we can efficiently work together, I can help my co-workers through problems without switching gears in my brain, and we don’t have annoying compatibility** issues cropping up here and there.***
But fellow bug counters- the data managers of the world cringe when you send them a file called “Site 1 data final 2013 .xls”
There are several reasons for this, but mostly, it’s because you’ve treated Excel like it’s a lab notebook or a physical datasheet. I think this is the main reason that data managers will often spit after uttering the word “Excel”- as a tool, it simply makes it too easy to treat it this way.
I’m just going to say it. Adding bold headers, borders, colors… this is not data management. I have worked with many people over the years that have sent me datasheets that sure look pretty, but are formatted in a way that require me to do hours, if not days of work before I can do so much as an ANOVA with it. Making your data look pretty in excel is like writing notes in color coded glitter pen****- pretty, but not adding substance- and sometimes actually harmful- but I’ll get to that in another post. But even more potentially damaging- people also often use tabs in excel as if they were pages in a notebook- another thing that makes data very difficult to compile and work with. Worse yet, they treat one page of a spreadsheet as if it were multiple datasheets laid out on a table.
In the interest of keeping this digestible, I’m going to save a lot of the details for some subsequent posts, so I’m going to focus on this latter issue.
Excel is powerful because it allows us to connect things that relate to each other in a machine-readable way. Failing to use this functionality is essentially the same as scanning your datasheets and saving them as jpegs- digitally backed up, but not really doing anything for anybody. When you don’t set up your spreadsheet which allows the computer to see how things are connected, you’re either creating a lot of work for you or for someone else, or dooming your data to obscurity.
There are two simple rules you should keep in mind when entering your data into any spreadsheet:
- Each data cell is an observation that must have all the relevant information connected to it for it to stand on its own.
- You must make it clear to the computer how the data cells relate to the relevant information and each other.
So, you can see how these two points apply directly to how you set up your spreadsheets. When you create extra tabs, you fail to allow the computer to see connections in the data that are there. When you create multiple tables within one spreadsheet, you’re drawing false associations between things for the computer, which sees each row as an observation.
Point 1 seems kind of obvious, doesn’t it? I’m lucky, in that my colleagues are very good at writing down all the information relevant to an observation. I’ve very rarely had trouble eventually tracking down the information I need to make data usable- often by reading the spreadsheet as if it were a lab notebook. The problem is, when datasets get large, this becomes kind of a nightmare, where I dream the world is a white screen covered in light grey lines and…. yeah.
Point 2 requires a bit more forethought, but it’s actually as simple as the first. The next time you’re entering data, and you go to create another tab or table, I want you to ask yourself “Self, could I avoid adding this tab by adding another column to my original spreadsheet?”
Say, for instance, you were counting bugs at two sites over the course of the summer. You might initially set up your spreadsheet with the headers:
Site Species Abundance
So on sampling date #1, you enter your data, and, wow, it’s a pretty long list of species you captured, you have to scroll down on the spreadsheet to get to the end. So on sampling date # 2, you get the impulse to just create a new tab, call it date_2 and be done with it.
Don’t do it.
Pssht, you say. Whatever, if I really need it together I’ll just paste it together later. That way I only have to type date_2 once.
Add another column. Call it date. Put the date that the samples were taken in for each observation.
There are three reasons for this. One, pretty much no statistical analysis you’re going to want to do at the end of this experiment will only require the data from one sampling date. Two, pasting together data after the fact is a bigger pain in the butt***** than you’re anticipating. You’re probably going to have a lot of sampling dates. Three: it is very easy for inconsistencies to creep in when you’re using multiple sheets. This is a pain in the butt on a new level.
If you’ve been treating excel like a lab notebook, don’t be embarrassed. Data management is taught by almost no one, and almost all of us who do know anything about it have learned through trial and error. We’ll get through this, together. You might even learn to like it.
*as a postdoc who is actively on the faculty job market, you know street cred is very important. VERY VERY IMPORTANT.
** Heh, my MS Surface is autocorrecting compatibility to comparability. I can hear the guffaws from the Linux users from here.
*** I can’t afford to be labeled ‘difficult’ at this stage in my career, see first footnote above.
****And don’t get me wrong, I love glitter pens.
*****My surface is also autocorrecting ‘butt’ to ‘but’. Oh, Surface, you’re such a prude.