Sharing: it can be harder than you think.

I have a three year old daughter, so you don’t have to tell me how hard it is for some people to share. It’s hard, when you perceive something as yours, when you love it and MOMMY! SHE TOOK IT AND ITS MINE!!! MINE MINE!!!

But what about data? Who’s the boss of it?

It’s a matter of personal policy that, whenever possible*, the data I’m currently working on is publicly available. In fact- here are the two datasets I’ve been spending most of my time on for the past year or so:
Aphids
Ladybeetles

And there they are. Yep.

But it’s not that simple, as it turns out. If it was, everyone would share everything, right? ** So, why aren’t you sharing your data?

This is a real question. And I don’t know the answer to it.
In fact, earlier, I tweeted:

I gotta ask the broader community.Tweeps, have you ever been *hurt* by sharing data? Examples? @ethanwhite @RobLanfear @davidjayharris

— Christie Bahlai (@cbahlai) January 21, 2014

And I have heard nary a peep since then. Of course, I could be polling the converted, so I wanted to put this out there again.

The best impediments I can come up with from the side of a data-creator,*** real or perceived:

1) getting scooped. IE Jerks who take your data, pass it off as something created by them, get a paper in Nature, and laugh at you from atop their piles of scientific accolades. I’m not aware of any specific examples of this in my field, but we’re mostly laid back ecology types. There’s lots of bug counting to go around. But this could be my good-natured Canadian naiveté.**** Does this happen?

2) your data is inappropriate for sharing. Say it’s proprietary, sensitive, personal, or even dangerous. What are examples where this is the case, and how do you publish findings, then?

3) Your data is too messed up for anyone to be able to interpret but you. Well, that’s a solvable problem, and I want to help you fix it. See every other post I’ve written. Heck, call me. Don’t just leave it.

I really want to know- what are your personal impediments to sharing data? I think the key to creating truly open science is to understand what scientists perceive as roadblocks, and to work WITH them to remove the impediments. We can accomplish a lot more working together.

*whenever possible means whenever I’m working with data that I’m primarily responsible for, and my colleagues are okay with me sharing. Usually, they’re okay with it, because usually, they were sharing the data with me in the first place, but when I’m say, consulting on the analysis for a grad student’s project, I treat those data as if they were confidential, and provide the student with guidance to encourage them to share. In case they’re hit by a bus, their legacy lives on. Nothing convinces a grad student more than a looming bus and the possibility of a legacy.
**I’m not going to go into detail about licensing in this post, but this IS a big issue in data sharing. In fact, it was a discussion with @davidjayharris, @ethanwhite and others that inspired this very post. For more on that, check this out. There’s lots of other resources on the web, but that’s a good place to start.
***data-creators, legacy….am I buttering up the grad students enough?
****We’re all in this together, eh?

About cbahlai

Hi! I'm Christie and I'm a computational ecologist and professor. I am an #otherpeoplesdata wrangler, stats enthusiast, and, of course, a bug counter. I cohabitate with five other vertebrates: one spouse, one spirited grade schooler, one energetic preschooler and two cats.

View all posts by cbahlai →

10 Responses to Sharing: it can be harder than you think.

Kristin Briney says:

January 21, 2014 at 2:37 am

Christopher Gutteridge and Alexander Dutton compiled a great list of why researchers don’t want to share their data, with possible responses to each concern. The document contains lots of great things to think about.

https://docs.google.com/document/d/1nDtHpnIDTY_G32EMJniXaOGBufjHCCk4VC9WGOf7jK4/edit

- cbahlai says:
  
  January 21, 2014 at 2:44 am
  
  That’s really useful! Thanks!
  
ibartomeus says:

January 21, 2014 at 6:47 am

a 4) option is “if the people have to ask me for my data in order to use it I can force them to be co-author in his/her paper on the only basis of being data provider”. Usually is not stated in this form, but is a real thing.

A variant of 1) is “if other people can come up with another way of using my data, I may come up with this idea too”. (but the point is that you may not).

Finally, 3) reason is valid if there is no reward (or you see no reward) on sharing the data, because just writing up the right metadata and upload it somewhere take some time, and all we are very busy.

- cbahlai says:
  
  January 21, 2014 at 2:52 pm
  
  Hunh- #4 as a way to retain control (or perceived control). I can see that.
  
  Re:“if other people can come up with another way of using my data, I may come up with this idea too”- I see this as sort of a variant of retaining control, too. It’s probably my idealistic naiveté talking again, but I see new ideas coming at my work as opportunities to learn new things, work with new people, etc, etc. I know there’s different ways of looking at it, but isn’t it, for the most part, looked upon favorably when you’ve published with a diverse group of collaborators? I see sharing an opportunity to facilitate these interactions.
  
  On lack of incentive- I absolutely agree. If you can’t see a good payoff for sharing your data, why spread yourself thinner? I have been known to put my back up when approached with ideas about how I could do *more* work at the wrong time. That’s why I advocate a baby-steps approach to learning data management, so people don’t get overwhelmed, give up, and cut themselves off from the OS community.
  
Timothée Poisot says:

January 21, 2014 at 2:50 pm

Great post!

I would not worry too much about being scooped. It can happen (probably already did), but by getting public data and knowingly scooping someone, you will get yourself a bad reputation in the community. I don’t think one more paper is worth destroying your reputation.

- cbahlai says:
  
  January 21, 2014 at 3:55 pm
  
  See, that’s sort of my take, too. There are social/professional consequences for not conforming to community standards. Does that keep scooping enough in check to make sharing with the risk to the risk adverse? I think the balance is toward benefit, but there may also be reasonable steps that we can take to reassure the hesitant.
  
Timothée Poisot says:

January 21, 2014 at 4:01 pm

Another comment I’ve heard a lot is “I’m an empiricist, and I worked hard for my data, so I don’t want lazy data crunchers or theoreticians using them and publishind papers on my back” – which I think is idiotic, but still probably the most common objection to data sharing I’ve heard. And I’m not sure how to adress it. Ideas?

- cbahlai says:
  
  January 21, 2014 at 4:11 pm
  
  I can see that- a lot of data is hard-won, and it’s hard for people to let go of it because of that. I think it’s a very important matter of practice for data crunchers, theoreticians and modellers to acknowledge the role of the empiricists in their work- in the very least, through proper data citation, but also by considering co-authorships with data-producers where appropriate. I know people are working to change this, too, but data citations don’t really count for much when it comes to success metrics for scientists- we need to all work on the culture around that.
  
  My most interesting work, in my opinion, are projects that have resulted from me partnering with data generators. Because our skills are complimentary, we can do things that neither of us would be able to do alone.
  
astoltzfus says:

January 24, 2014 at 9:21 am

Nice post. I’ve collected information on this semi-systematically for a paper on data sharing in evolutionary biology (see http://www.biomedcentral.com/1756-0500/5/574). I didn’t hear any stories about actual harm. The most common reasons for not sharing are that (1) the cost of getting the data in a shareable form (which is often high due to lack of appropriate standards and technology) is perceived to be high relative to the expected tangible benefits (authorship, citations, etc) of sharing and (2) the originator wants to maintain control of the data, either to continue milking it for credit, or to avoid misinterpretation.

Personally I think the desire to control scientific data in order to prevent its misuse is complete bs, possibly borderline evil. I’ve even heard a phylogeny software developer say that if it was too easy to run analysis programs then every fool would be doing it, and that would lower the professional standards. That kind of thinking makes my head spin, but the reality is that actual people with legit PhDs sometimes think like that.

Pingback: If you love your data, set it free. | Practical Data Management for Bug Counters