Cleaning up your act: simple ways to find errors in your raw data

It’s been an interesting week. We had a massive ice storm in the eastern Great Lakes area, trees and power lines were totaled, power was out everywhere- I didn’t have electricity for four days. Things, basically, are a disordered mess, and everyone is working really hard to clean things up.

Which brings us to today’s topic- data cleaning and quality control.*

Even if you’re the most meticulous person in the world, if you have a lot of data, errors are probably going to creep in as you enter your data into your spreadsheet or database.** Often, they aren’t noticed until late in the game- after you’ve done your stats, after you’ve generated figures, after you’ve fussed with the figures to ensure the fonts are all right and everything is perfect- basically, after days and days of work.  I think you know where I’m going here. The thing with errors is, the earlier you find them, the easier it is to deal with them.

So how do you find the errors before you do a lot of work? There are a variety of approaches you can take, and each approach will depend on what type of software you feel most competent with. True to my theme, I’m going to assume you, the reader, are most comfortable with spreadsheets, with a willingness to play around with R and try other neat tools. For today, I’m going to focus on numerical data, we can get  to cleaning categorical data later. I’m going to walk you through a few examples using an actual dataset that I used during my PhD, examining how captures of flying soybean aphids in suction traps related to different environmental conditions.***

These data are pretty much a classic example of the type of bug-counting datasets I use. I took a subset of the data produced by the North Central IPM Center’s suction trap network– using only captures of soybean aphid, and only 2005-2009 data, and for each trap and each sampling date, I had my trusty summer student extract weather station data and field scouting records. I calculated photoperiod for each site at each sampling date. The resulting data looked something like this:

Look at all those columns! Gosh!

This is roughly what the dataset looked like when we had finished collecting all the components we needed, before I started analysis

The first step I took with these data was a very simple one: I sorted it, by each column, one at a time. And I looked at it. I primarily looked at the top and the bottom of each column as I sorted it, and I looked for two things:  impossible values and improbable values.

For each parameter in your dataset, you should think about what values or states that parameter can take. For instance, a column giving temperatures in  Celsius would never contain letters, and going under -273 C  would violate the laws of physics.**** Counts of things must take the form of non-negative, whole numbers.  Sorting helps these obvious issues jump out at you.  You also may be able to spot any highly improbable values (say, temperatures exceeding  50C), which will give you a chance to go back to your records and correct things.

At this time, it’s not actually a bad idea to make a little table of what your parameters are, and define the possible and probable values, and give a brief description of what each parameter is, and how it was obtained. Not only will it help you assess what’s actually going on with your data, you’ll get a smug sense of self-satisfaction when you realize your metadata is half written.

NB: You can also fairly easily use database software to set rules for the form the data must take so that these errors do not creep in at the time of data entry.***** 

You can also use sorting methods to clean up your categorical variables******, and if you only have a few, that’s okay, but I’ll get to some tools  you can use to do this MUCH more efficiently in another post. But now, we’re going to get into a slightly (but only slightly) more complicated method of detecting improbable numerical data. This is especially good for large, multivariate datasets like the one I’ve shown you above.

Scatterplot matricies. They are the BEST. The code is simple. There is simply no reason NOT to do them.

A scatterplot matrix is, um, a matrix of scatterplots. So, they deliver what they advertise. Essentially, what they do is create an X-Y plot for each combination of numerical parameters in your dataset. Temperature versus ordinal date. Counts of organisms versus precipitation. And every combination thereafter.

Lots of plots!!!

Scatter plot matrix to examine covariance structure between parameters used in models for captures of soybean aphid in suction traps in central North America, 2005-2009. Parameters are 1) Suction trap captures (# of aphids captured), 2) Degree day accumulation (Base 10C, from Jul. 1), 3) Field infestation (categorical values 0-6), 4) Ordinal date, 5) Latitude (oN), 6) Longitude (oW), 7) Maximum temperature (C), 8) Mean temperature (C), 9) Minimum temperature (C), 10) Photoperiod (h), and 11) Precipitation (mm). I’ve circled some obvious outliers in yellow.

The reason that these are so useful in detecting  improbable data is sometimes these outliers are only really noticeable in context.  A temperature of -15 C in the midwest in early March? Sure. In late August? Probably not. Scatterplot matrices allow you to not only see parameter values that don’t fall in line with the usual pattern, and flag them for extra checking up.

In the example I’ve provided above, I’ve circled two sets of outliers. In the upper circle (plot 4-7), you can see a handful of points deviating between the normal arcing pattern of maximum temperature  versus ordinal date.  I went back to the weather station records, and turns out that this is real  data- an extreme, fairly localized heat wave affecting a few sampling sites in one of our sampling years. The second outlier I’ve highlighted, in the plot of minimum temperature by ordinal date, shows one extremely low value. Going back to weather station records associated with that data point, I discovered that someone had mistakenly input -14 instead of 14 as the minimum temperature for that day. Oops. An easy mistake to make, but something that could have affected my analysis and results.

This form of data cleaning will not catch all input errors, but it will certainly help minimize major errors from cropping up down the road. Taking these steps after you enter data has another benefit- you’ll have better familiarity with the data when you go to do statistical analyses. Knowing what your data looks like, overall, will help you select appropriate statistical analyses. You will get grey hairs and wrinkles at a slower rate. It also makes your data more re-usable and sharable. Really, everyone wins!

*What? That was a totally logical transition.
**But you’re probably not the most meticulous person in the world. It’s okay, I’m not, either.
***Paper was just accepted fully at Agriculture and Forest Entomology. Can I get a w00t w00t?
**** It is inadvisable to violate the laws of physics. The penalties are harsh. Very, very harsh. If you could see me right now, you’d see the haunted look in my eyes and know how serious I am.
***** But in my experience, less than 5% of entomologists use database software. I’m working on it.
******Confession time: I relied on sorting  and then careful, manual combing to clean the categorical variables in this dataset. It was horrible. I know better now and will never do it that way again.

Advertisements

About cbahlai

Hi! I'm Christie and I'm an applied ecologist and postdoc in the midwestern US. I am an #otherpeoplesdata wrangler, stats enthusiast, and, of course, a bug counter. I cohabitate with five other vertebrates: one spouse, one preschooler, one teeny baby and two cats.
This entry was posted in Uncategorized and tagged , , , , , , , , , , , . Bookmark the permalink.

9 Responses to Cleaning up your act: simple ways to find errors in your raw data

  1. Andrew says:

    great post! This is one area in which using R can reduce one’s workload. When analysis is scripted, you just re-run everything after discovering an error!

  2. I wish some of the people I work with would do some error checking before writing a manuscript that either ignores the blatant error, or tries to rationalise it.

    Using the argument gap=0 in the pairs function can make the scatter plot matrix more readable when you have many variables by reducing white space.

  3. Have you ever used Open Refine for data cleanup? It’s a great tool and works especially well for messy text data.

  4. Pingback: Links 1/7/14 | Mike the Mad Biologist

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s