If you love your data, set it free.

I’m emerging from a deep-focus period for a brief comment. I tried not to. I didn’t want to engage. But dangit, it’s getting too hot. And it’s hitting close to home.

Sharing data. Specifically, being required to share data.

The reason that this hits so close to home to me is pretty apparent. I’ve touched on this before. I like it when people share data. I think it’s very important to share data. Heck, I think it’s RIGHT to share data.* I want to help people make their data sharable. It’s my MO. My raison d’etre. And so I genuinely want to help people realize this goal.

But part of this is selfish. See, I like data. I like using other people’s data to do my science. So I’ve come to realize that in addition to my educational mandate, I’m also…(cue dramatic music)…the person who is most likely to scoop you.** My friends, I am a Data Vampire. ***

I don’t think of myself as particularly threatening. I’m a nerdy person who likes to use big datasets, compiled from multiple sources, to see patterns in economically important insect populations. I use results of my work to try and help farmers grow crops more efficiently, and with fewer environmental impacts. I like giving credit where credit is due, and I’m first to acknowledge the roles of my many collaborators. My collaborators and I tend to function complimentarily- they produce data, I analyze and write up. Really, this niche has worked out pretty well for me. And I like to think I’m one of the good guys.

The main two arguments I see against data sharing are:

1. “Sharing is a pain.”

2. “I need to retain control”

I’ll do what I can to answer to each of these with very specific examples.

1. Yep, just like anything we take on, the learning curve with data sharing can be daunting. I’m not all the way there yet myself. I have a dirty secret- pretty much all the data I produced prior to 2008 or so is in NO FORM for public consumption. I was young and stupid and didn’t know any better. It’s only been since I started working with #otherpeoplesdata that I really clued in- we need to have community standards for data formatting and sharing- because there are entire months of my life that I’m not getting back where I have been cleaning and reformatting data. So- don’t put your back up. This is really just a practice that I want you to try and implement, moving forward.

But it’s important. Science that does not include its data limits its reuse. In addition to my own data curation, I’ve also spent time trying, desperately, to pull relevant parameters out of other people’s published works- but because the data is in a figure or summary table, I’ve had to eyeball it or just use means. Not ideal.

For instance, in a paper I recently co-authored, I had to develop an estimate for the temperature-dependent intrinsic rate of increase of soybean aphid. There are two papers published on this subject, but in these papers, the rate of increase in response to temperature was presented at a series of constant temperatures as a mean and standard error. Had I access to the raw data, I could have done a weighted regression, played with the error structure- but as it was, I had to make do with means and just regress as they were. I’m sure the authors hadn’t intended this end use- how could they anticipate every possible application of their data in the future? But this was a VERY useful application of their original data, and it would have been useful to have access to it.

2. On retaining control- in the example I described above, I used someone else’s data, from their paper, without their expressed permission- I simply extracted it from the published literature and provided appropriate citations in my paper. I think this is fundamental to how science works- the earlier scientists studied the phenomena, and published on it, and my co-authors and I used insights from that work as part of a next step. This is, in my experience, how people will use other peoples data that accompanies a published work. This is the type of data sharing that PLoS is aiming for, if I’m understanding correctly.

So, you may say, sharing data when expectations about how it will be used are clear and the paper is published is one thing. Sharing data in a completely open way is another.

But I would take it even farther. As I have mentioned, in my current position, my primary role is to analyse data that has been produced by long term projects. These data are available online, and have been for years. Have we been scooped? There are no documented cases of this in our group, and I have yet to find any good examples of anyone getting scooped because they’ve shared their data in my field. If anything, these beautifully curated, scientifically interesting data are under-analyzed- we’ve only just scratched the surface with some of them.

Now, if you want to use someone’s data in a substantial way, it’s just good form to talk with them. It’s mutually beneficial, even. Obviously, not everyone is as much of a free-love-and-data hippie (/vampire) like me, and it doesn’t help your cause to alienate people. But I would argue that community norms prevent full-on scooping from occurring. Frankly, datasets, even the best curated datasets, have a get-to-know-you period, and the data-creator has an advantage over any potential scooper. Use this advantage to help people see the value of these data. Heck, seek out the data vampires. If you’re holding onto data that you love but haven’t done anything with, it’s time to set it free.

*Science belongs to society. NO. It does. Stop arguing with me. Call me an idealist and dismiss me as naive instead.
**or at least, SEEN as the person who is most likely to scoop you. Now cue that sound where someone scratches a vinyl record to indicate something startling just happened in a Rob Schneider movie.
*** And yes, I DO sparkle in the sunshine, since you asked.