This post is about nothing*

You’ll all be pleased to know that I got my big, Big-Data pre-proposal to NSF sent off.** It’s been occupying about 90% of my brain these last few weeks, so it’s good to send it off, and get myself back in a relative groove.

In fact, today, let’s think about nothing.

Now, I suppose I could end it here, but really, that wouldn’t be my style. Let’s dissect this. I want us all to really think about nothing, the different kinds of nothing, and what that means- especially when it comes to data.

When it comes to bug counting, there are two kinds of nothing.

1) We looked, and found no bugs there.
2) We didn’t look there (or didn’t get a reliable look there).

So, case number 1 happens when you go to the field and put out a sticky card/malaise trap/extremely tolerant undergrad assistant with an aspirator, and you catch no bugs (or no bugs that you’re interested in). This is referred to as a true zero.

Case number 2 happens in places where you didn’t look like Antarctica or the moon or downtown Toronto***, or it was before the growing season started, or a bear/monster/competing lab set fire to your malaise trap, or your undergrad went to lunch and forgot to count the rep. This type of data is referred to as null.

It is very, very important to keep track of which kind of nothing you have. Why? Well, I’m going to tell you. The main reason, of course, is, if you don’t keep track, in some reasonable way, what is zero, and what is null, it makes my life a lot harder.

Also, it can really screw with your stats. You don’t want your stats to be screwed, right?

One of the number one problems I come across in every new #otherpeoplesdata dataset I deal with is a universal one:

Is your blank a zero? A null? Not entered yet? There have been COUNTLESS occasions where I have been given a datasheet where most (though not necessarily all) of the true zeroes are blank. And, conversely, in some cases, people will enter a zero to indicate that there is no data. The thing is, nulls and zeros behave differently in spreadsheets and statistical programs, and failing to distinguish between them can give you radically different answers. I’ll give you an excel example:

Yes, I know, it's crosstab. Shhh! I'm trying to make a point.

A hypothetical example, in three scenarios. as you can see, in all cases, something terrible happened on Day 3…but WHAT?!

In the spreadsheet above, I’ve got three scenarios, where the user has entered three different things on the third day of the study- 0, blank and NA.

In scenario 1, the 0 under Day_3 suggests that you went to your site, and you counted ladybugs. But on that day, you saw zero ladybugs. Thus, the average number of ladybugs you saw in a day is 15. In scenario 3, something kept you from counting ladybugs on day 3, and so when you apply an average calculation to that set of four cells, excel skips the NA cell when it’s computing the average- and (correctly) tells you that you averaged 20 ladybugs per day on days you sampled. But scenario 2 contains a blank- and as you can see, excel gives you the same average as when you had an NA in the cell. SO- if you’re indicating zeroes with blanks, what will happen with many calculations is you will artificially inflate your resulting means (and conversely, if you indicate nulls with zeroes, you will artificially deflate the means).Essentially, what you’re doing is introducing a whole ugly new source of error and inaccuracy to your data.

So, what should you do to resolve this? First and foremost- if you found zero of anything- enter that in your datasheet. Do not leave it blank. Even if most of your values are zeros. Fill them in. Do it. Tell your summer help. Tell your great Uncle next Thanksgiving at the dinner table.**** Tell everyone. A zero is a zero is a zero, and it’s never a blank.

Next, decide on a good null indicator, and make a note of it in your metadata. NA is the null indicator I prefer to use because it’s what R uses, and I mostly use R. Ways to indicate null vary between stats programs, so some people will suggest leaving nulls blank. I personally prefer not to do this, because I like to have every cell accounted for in some way, and a blank suggests to me the possibility that someone just forgot to enter the data. However, as long as it’s consistent and clear, there are a few reasonable ways to indicate nulls. Use it throughout your data- that’s right- the same one throughout.*****

Finally, be sure you don’t use nulls and zeroes interchangeably. Most, but not all****** ecological datasets will have a LOT of one, the other, or both in them.

*Alternate titles: “Much ado about nothing”, “Getting worked up over nothing”, ‘Zero heroes”, “Embracing nothingness”, “Divided by zero”, “If The Earl of Grantham was better with managing null data, would the Abbey’s financial woes finally go away?”, “This is not a Seinfeld reference. I prefer My Little Pony References.”
**if you follow me on twitter, you know that I was screaming it from the rooftops. Also, I got a bit of a big head when the VP of research agreed to give me, in all my lowly postdocness, PI status at our not-so-little university. Color me psyched right out.
***urban ecologists excepted. I actually know a whole bunch of people that count bugs in downtown Toronto.
****Protip: discussing data management in graphic detail at dinners with the extended family very quickly stops all conversation about when, exactly, you plan to get a real job. And most other awkward conversations.
*****The most epic use of an inappropriate null indicator I’ve ever seen was when N-O-S-A-M-P-L-E was spelled, one letter per cell, down a column. I’d long since re-sorted the data, so solving the mystery of what all these letters meant basically involved solving a jumble.
******Presence only data, and/or abundance given presence- i.e.- data that only report where a given organism is found, is commonly produced by natural history collections. This type of data is useful, but comes with a set of challenges. A good paper on all that by Pearce and Boyce is worth a read if these are the types of data you’re working with.

Advertisements

About cbahlai

Hi! I'm Christie and I'm an applied ecologist and postdoc in the midwestern US. I am an #otherpeoplesdata wrangler, stats enthusiast, and, of course, a bug counter. I cohabitate with five other vertebrates: one spouse, one preschooler, one teeny baby and two cats.
This entry was posted in Uncategorized and tagged , , , , , , , , , . Bookmark the permalink.

2 Responses to This post is about nothing*

  1. Another great post on another important topic. One other thing I’d specifically recommend against is the use of all the other numerical null indicators in addition to zero. It doesn’t take long playing around with ecological data to run into the famous 999 and -999 null indicators. They have equivalent problems to using zeros in that they can then get figured into calculations without realizing it and they also tend to get entered inconsistently (I typically see at least 3 members of 999, 9999, -999, -9999, etc. anytime a dataset designates a 999 based null).

    I’m one of those folks who has gravitated towards documented blanks, perhaps in part because I work across a lot of different tools and none of the standard keywords are meaningfully portable. ‘NA’ is close, but if I want to import data with NAs into most SQL databases, I first have to replace those separately, and extra data cleaning steps lead to more opportunities for mistakes. There are similar opportunities for errors if you’re reading the data in yourself instead of having a library do it. My other concern with ‘NA’ is that it can be an abbreviation for a lot of things in ecology related data. I’ve run into issues with ‘North America’ and ‘Neotoma Albigula’ being abbreviate by ‘NA’ in datasets and depending on the import system you’re using these cells or even entire lines can silently be replaced in confusing ways that most people will never catch.

    We did a fair bit of looking around when writing:
    http://library.queensu.ca/ojs/index.php/IEE/article/view/4608

    and never did find anything that we felt represented a meaningful cross-tool standard, so I think that NA and NULL are both reasonable as well. It’s just that they add the extra layer of detective work in a different place than blanks.

    • cbahlai says:

      Oy! Yes, I forgot to mention about using numerical values. How did the 999 thing even start, anyway? It just seems like a bad idea from the get-go!

      And yeah, NA is not without its problems. One dataset I cleaned used n.a. to mean “no aphids” which threw me for a loop while In was trying to figure things out.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s