You’ll all be pleased to know that I got my big, Big-Data pre-proposal to NSF sent off.** It’s been occupying about 90% of my brain these last few weeks, so it’s good to send it off, and get myself back in a relative groove.
In fact, today, let’s think about nothing.
Now, I suppose I could end it here, but really, that wouldn’t be my style. Let’s dissect this. I want us all to really think about nothing, the different kinds of nothing, and what that means- especially when it comes to data.
When it comes to bug counting, there are two kinds of nothing.
1) We looked, and found no bugs there.
2) We didn’t look there (or didn’t get a reliable look there).
So, case number 1 happens when you go to the field and put out a sticky card/malaise trap/extremely tolerant undergrad assistant with an aspirator, and you catch no bugs (or no bugs that you’re interested in). This is referred to as a true zero.
Case number 2 happens in places where you didn’t look like Antarctica or the moon or downtown Toronto***, or it was before the growing season started, or a bear/monster/competing lab set fire to your malaise trap, or your undergrad went to lunch and forgot to count the rep. This type of data is referred to as null.
It is very, very important to keep track of which kind of nothing you have. Why? Well, I’m going to tell you. The main reason, of course, is, if you don’t keep track, in some reasonable way, what is zero, and what is null, it makes my life a lot harder.
Also, it can really screw with your stats. You don’t want your stats to be screwed, right?
One of the number one problems I come across in every new #otherpeoplesdata dataset I deal with is a universal one:
Do those empty cells mean data was left out or that the data doesn't exist? No way to know. #otherpeoplesdata
— Brian Rolek (@brols) January 30, 2014
Is your blank a zero? A null? Not entered yet? There have been COUNTLESS occasions where I have been given a datasheet where most (though not necessarily all) of the true zeroes are blank. And, conversely, in some cases, people will enter a zero to indicate that there is no data. The thing is, nulls and zeros behave differently in spreadsheets and statistical programs, and failing to distinguish between them can give you radically different answers. I’ll give you an excel example:
In the spreadsheet above, I’ve got three scenarios, where the user has entered three different things on the third day of the study- 0, blank and NA.
In scenario 1, the 0 under Day_3 suggests that you went to your site, and you counted ladybugs. But on that day, you saw zero ladybugs. Thus, the average number of ladybugs you saw in a day is 15. In scenario 3, something kept you from counting ladybugs on day 3, and so when you apply an average calculation to that set of four cells, excel skips the NA cell when it’s computing the average- and (correctly) tells you that you averaged 20 ladybugs per day on days you sampled. But scenario 2 contains a blank- and as you can see, excel gives you the same average as when you had an NA in the cell. SO- if you’re indicating zeroes with blanks, what will happen with many calculations is you will artificially inflate your resulting means (and conversely, if you indicate nulls with zeroes, you will artificially deflate the means).Essentially, what you’re doing is introducing a whole ugly new source of error and inaccuracy to your data.
So, what should you do to resolve this? First and foremost- if you found zero of anything- enter that in your datasheet. Do not leave it blank. Even if most of your values are zeros. Fill them in. Do it. Tell your summer help. Tell your great Uncle next Thanksgiving at the dinner table.**** Tell everyone. A zero is a zero is a zero, and it’s never a blank.
Next, decide on a good null indicator, and make a note of it in your metadata. NA is the null indicator I prefer to use because it’s what R uses, and I mostly use R. Ways to indicate null vary between stats programs, so some people will suggest leaving nulls blank. I personally prefer not to do this, because I like to have every cell accounted for in some way, and a blank suggests to me the possibility that someone just forgot to enter the data. However, as long as it’s consistent and clear, there are a few reasonable ways to indicate nulls. Use it throughout your data- that’s right- the same one throughout.*****
Finally, be sure you don’t use nulls and zeroes interchangeably. Most, but not all****** ecological datasets will have a LOT of one, the other, or both in them.
*Alternate titles: “Much ado about nothing”, “Getting worked up over nothing”, ‘Zero heroes”, “Embracing nothingness”, “Divided by zero”, “If The Earl of Grantham was better with managing null data, would the Abbey’s financial woes finally go away?”, “This is not a Seinfeld reference. I prefer My Little Pony References.”
**if you follow me on twitter, you know that I was screaming it from the rooftops. Also, I got a bit of a big head when the VP of research agreed to give me, in all my lowly postdocness, PI status at our not-so-little university. Color me psyched right out.
***urban ecologists excepted. I actually know a whole bunch of people that count bugs in downtown Toronto.
****Protip: discussing data management in graphic detail at dinners with the extended family very quickly stops all conversation about when, exactly, you plan to get a real job. And most other awkward conversations.
*****The most epic use of an inappropriate null indicator I’ve ever seen was when N-O-S-A-M-P-L-E was spelled, one letter per cell, down a column. I’d long since re-sorted the data, so solving the mystery of what all these letters meant basically involved solving a jumble.
******Presence only data, and/or abundance given presence- i.e.- data that only report where a given organism is found, is commonly produced by natural history collections. This type of data is useful, but comes with a set of challenges. A good paper on all that by Pearce and Boyce is worth a read if these are the types of data you’re working with.