How do I find stuff? An undergraduate’s journey through an online data archive

This post is written by Tasia North, an undergraduate student who’s working with us on the Bad Breakup project. As part of our research plan on this project, we’re identifying barriers to data reuse from publicly shared, NSF-produced data sources. Tasia is working on a project which is examining patterns within tri-trophic interactions in long term data- basically asking the question- do significant trends move between trophic levels, and if they do, how? But in order to do it, they were first tasked with finding a few representative sets of data. I wanted them to have an authentic experience- just that there were data like this, here’s a database of datasets, now see how you can find information to support this investigation- and write down what you find. This is Tasia’s first blog post with their reflections on the experience!



Hello Blogosphere,

I’m the new undergrad working here at the Bahlai lab. If you’ve been following along with the blog you’re aware of the bad break up project that’s been going on. This project looks at a long term data set, and breaks it up into shorter clumps to look at the trends. This will allow us to quantify how often we are wrong when we base conclusions off of three or four year studies.

We are digging through approximately 58,000 datasets from the US-LTER. My task is to continue working on the tritrophic interactions that Julia had started. She had created a list of sites that are likely to have the data needed, and what organisms I can look for. I needed to take that list, sort through available LTER data to find the data set, determine if it was at least 12 years or longer, and that it is usable and accessible data.

Easy enough right?

So a few things about me might be useful for context here. As stated, I am an undergraduate student studying Ecology and Conservation Biology. I’ve essentially no experience with large scale data management, and no experience using the LTER website or getting data from this site. In fact I had to google to find the LTER website since I have never used it before and thought everyone was saying LTR. In other words I am a newb at this. However I am armed with four and a half years of college experience (super seniors represent!), and I’m a millennial with the standard ‘navigating internet and sorting through stuff’ skills that are common to my generation. Someone with my education level, computer skills, and the reasonable level of guidance that I have should be able to navigate this site and find the information that I need. Here’s a step by step walkthrough of how successful I was at navigating these sites, what I found, and also some memes to express the feelings that arose during this experience.

The first thing I did was google for the LTER website, this takes me to the data portal. Now Christie wrote in the last blog post that this portal was, erm, less than helpful. But everyone else in the lab was busy when I started working on this and I didn’t want to interrupt anyone. So I got to find out about the data portal all on my own! I start out clicking on the advanced search option. According to the list Julia gave me there is probably a survey of small mammals in the Konza Prairie LTER site that is at least 12 years long. So I type in small mammals, select Konza Prairie, and I am presented with  . . . this . .


As you can see, nowhere does it say how many years are included in the data set. It only lists the publication date, which is of no use to me if I want to know how long they studied something.  In order to find the length of study, I have to click on the title, scroll down, find the metadata report, click on that, and then scroll down to find the years.

I have spent literal hours over the last couple weeks going through searching for keywords that will hopefully bring up what I need, clicking on a title, then clicking on the metadata report, and then scrolling all the way down just to see something like this:


Or this:


This was a huge time suck and mildly frustrating to say the least. I was about ready to take my extensive credentials as a *checks notes* Mildly Annoyed Undergrad™ and march right up the LTER office and demand they change their name to the Year Long Ecological Research Network. In fact I was so peeved I took a break to make this extremely niche meme that about 6 people will think is funny.


Finally though, after sifting through what feels like a million data sets, I find one that actually goes for more than a few years. A bit of clicking around brings me to an excel spreadsheet with the data on it.


Now I just need to determine if this data on small mammals is usable. Thankfully it looks complete and without any weird blank spaces or scary looking errors (an earlier excel file I found had an error code of  -99999 and that was a scary looking data sheet if I’d ever seen one).

Most of the spreadsheet is logical. There’s the year (this was out of order but clearly labeled so it’s fine), the season, and a watershed ID number, all of that’s fine. Then we get to the actual data, it’s a whole bunch of acronyms, followed by a series of numbers.


Now it may be a cool science thing that I’m not privy to, but this spreadsheet is so full of acronyms that it’s essentially illegible to an outsider unfamiliar with the system. Isn’t the goal of these types of data sets to allow future scientists to come in and reuse the data with relative ease? Well, that’s what the metadata was for!

Thankfully there was an easy to find (it was not) and logically labeled (it also was not) file attached named knb-lter-knz.88.7.txt. This file is not to be confused with or knb-lter-knz.88.7.xml, these other two files contain… information (its actually probably really important stuff but I don’t know what any of it means yet). Thankfully the metadata was mostly legible and explained the acronyms clearly. Took me a couple extra clicks but I think this data set will work for what I need!

Next steps are cleaning up the data!




Posted in Uncategorized | Leave a comment

Equity and Ethics in Environmental Data Science

Guest post by Kaitlin Stack Whitney

Recently Christie and Kaitlin had the privilege of participating in the first ever NSF INCLUDES funded Environmental Data Science Inclusion Network (EDSIN) conference, which took place in early April 2019 hosted by the National Ecological Observatory Network.

I (Kaitlin) had the great pleasure of being on a keynote panel focused on “further defining the problem space” and focused my comments on disability access and inclusion in academia and science more broadly. A much more in depth focus on those topics was presented by my colleague Dr. Drew Hasley during a plenary the next day. You can check out all the presenters and their presentations here.

Yet as the keynote plenary speaker and eminent scholar Dr. Carolyn Finney explained, problem space is probably the wrong term – and framing. Diversity, equity, and inclusion in environmental data science isn’t a problem! It doesn’t need fixing! Addressing it isn’t a problem solving exercise. In her words “it’s not a problem to solve, it’s a process.

Dr. Finney’s keynote was hands down the best start to a conference I have ever witnessed, and I took copious notes of the wisdom and experiences she generously shared with us. Some key takeaways (for me) that I want to share, as they inform my (our) collaborations and work:

  • Outreach is outdated, individuals are fully formed.” How do we ensure that our outreach (especially Broader Impacts as concept) acknowledge and honor that about everyone?
  • “You have to do something different, it’s not about being comfortable.” How do we work to make sure that our collaborators and trainees are comfortable and fully seen? Many of my colleagues, especially women of color, have never been fully comfortable or seen in their work. It’s not about my comfort and being a better ally means taking active steps not to center myself or my comfort.
  • “Diversity is not assimilation.” How do we respect and honor -dare I say cherish – difference in our collaborators and trainees? How do we make space for them, as opposed to only inviting them into pre-determined spaces?
  • “Privilege has the privilege of not seeing itself.” How do people with privilege learn about the things they have the privilege of not experience, let alone “see?” I am a white woman, a faculty member, and co-PI on a federally-funded-grant-project; I have a lot of privilege and power in relation to many people in my professional community. How do I educate myself to be a better ally and advocate without shifting the burden to marginalized colleagues who already do too much uncompensated service work?

I also presented a poster, “10 steps to make science (more) accessible” which has been a years-long collaborative effort with 4 colleagues across North America (in alpha order – Dr. Emilio Bruna, Dr. Simon J. Goring, Dr. Aerin Jacob, and Dr. Timothée Poisot). Check it out on FigShare and you can check out our corresponding preprint here. We’ll be making sure that our Bahlai and Whitney lab collaborations are multimodal and accessible. It’s a lifelong and continuous effort, but the focus of resources like these are to make clear that there are simple and quick steps that people can take even if they don’t have control over the publishing platform or conference format to increase access and inclusion.

We both also had the opportunity to participate – both as contributors and listeners – in a lot of excellent breakout sessions. Some of my favorites included those focused on disability inclusion and ethics in data science education. As faculty and researchers, how can we introduce ethical data collection and analysis to students from the beginning of their education in environmental science and data science, not as an add-on? Christie shared some great insights from her own teaching, and I brought some of those back to my own classroom this semester. There’s going to be a lot more to come out our participation in EDSIN, both as the conference outcomes and collaborations develop further, and as I implement more of what I learned from new and old collaborators there. We’ll keep sharing.

Maybe you’re reading this blog generally for the quality bug counting content – and not sure how this fits in. But it’s at the heart of what we do – and it shapes our science. As Dr. Finney also said, “our relationships are worth the risk of going there, to discuss our differences.” People count the bugs and create the scientific community – I (we) must work harder and more effectively to ensure the field of environmental data science is ethical, equitable, diverse, and inclusive of everyone who wants to be in it.

You and everyone you know can join the EDSIN community to make environmental data science more accessible, inclusive, equitable, and awesome! Sign up on the QUBES platform:

The EDSIN conference is based upon work supported by the National Science Foundation under Grant No. 1812997. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Posted in Uncategorized | Tagged , , , | Leave a comment

The calm before the (algorithmic) storm

Kaitlin, Julia and I have been busy. Not that you’d know. It’s been a lot of behind the scenes work so far on this project- it’s started with weekly meetings where we talk out our ideas and approaches. You’ll recall, the bad breakup project is an algorithm I designed to mine data from the perspective of a human who was looking to find statistically significant trends.  Essentially, it asks “Given a snippet of data, what conclusion would a typical biologist applying typical statistics make about these data?”* We were going to use the 58,000 datasets compiled through the US-LTER over its 40 year history as our data fuel- with the longest series (the complete record) serving as a proxy for truth.**

Turns out, this problem that we’re working on is pretty big, so establishing exactly how to approach these 58,000 archival  datasets  is our main challenge. It looks kinda like this:

hypothetical workflow

A crudely constructed powerpoint flowchart. Top row: Hypothetical workflow- Get data > put data in the thing > ?????? > Profit!  Bottom row: Actual workflow-  ?????? > Put data in the thing > ?????? > Profit!

Our collaborator, Sarah Cusser at Michigan State, has had amazing successes in the context of a ‘deep look’ at a single long term experiment- she has been examining how long we have to watch a system to see treatment differences across several common agricultural practices- and how consistent this effect is. Her findings, in prep right now, can be used to make recommendations about how we make recommendations to farmers- essentially when can we be confident our recommendations are right, and when should we moderate our confidence, when guiding how farmers select practices.

Building on the idea of the deep dive, Kaitlin got an idea- what if we specifically sought out datasets that document tritrophic interactions from the LTER- we could use these focal datasets to examine within-site patterns between trophic levels- i.e. do misleading results travel together between trophic levels? A brilliant idea, so Julia sat down and started combing the organismal abundance/plant biomass data across the LTER- and proceeded to spend a lot of time spinning her wheels.

The challenge, it seems, is also one of LTER’s strengths.  LTER data is available through two paths- in data archives set by individual sites, and centrally, through a portal at DataONE. There is a heck of a lot of stuff going on at each individual LTER site, which is so, so awesome and cool and I love learning from it***, but, given that each site is so unique, navigating between them to try and find information which supports synthetic approaches is…okay, I’m not going to mince words, not awesome****. Julia has been working on LTER related projects for nearly a decade now***** and found that the DataOne catalog was fine if she knew what she was looking for, but it wasn’t easy to browse, while extracting meaning from anything. She quickly found the individual sites were, generally, superior from a browsing perspective, because each site gave a bit of context about what the sites actually are, what the major experiments were, and their data. The data catalogs at each site, though, were all slightly different, and thus varied in their ease of navigation- and moving between the sites, the approaches differed.  So she’s still chipping away at this.

Meanwhile, though, something big hit the news. The INSECT DECLINE THING.

Oh boy.  So, I am going to summon the full authority of my position as your friendly local data scientist, insect ecologist and time series expert. This study has some serious design flaws and its results are unreliable******.  Manu Saunders wrote an excellent blog post on the subject so I won’t go into it in detail here.  But our group sat down and discussed the study at one of our regular meetings, and realized we were perhaps in one of the best positions to critically, quantitatively examine these claims- so we got to work. More on this soon.

Soon after we got started, a reporter from Discover magazine reached out to Kaitlin for comment on the story, and a truly excellent article resulted.

To quote a quote from the article:

“You can’t just draw a line through some data points, take it down to zero and say, ‘Right, that’s how long we’ve got,’” Broad says. “That’s not how stats works, it’s not how insect populations work, either.”

Later in the article,  Kaitlin highlights how we’re addressing these problems in the work we’re doing– and examines why the scientists making these dire extrapolations are reaching the conclusions they do.

I’m really excited to see what we find.


*  my long term goal is to replace typical biologists making typical conclusions with about 185 lines of well commented R code by 2027

** It’s a proxy because even the longest time series I have is a snippet of the whole story. Ever since I started partying with sociologists, I find I use phrases like “proxy for truth” in  my day-to-day more than I’d like to admit. Wait till we write our book together.

*** Can you tell that I have a deep, deep, identity-defining love for LTER? Because I do.


***** Just to make you feel old, Juj.


Posted in Uncategorized | 2 Comments

Bad breakups (with your data)

Hi all,

It’s been a long time. It turns out this assistant professoring thing does not leave me with a lot of time. Hmm. Who knew? Since we last spoke, I’ve been building my lab- both the physical space:

and the online infrastructure.

I’m building collaborations with friends and colleagues all over the place, and working to help finish the student projects that I became involved with through my previous positions at Michigan State. I’m also working hard to get my uniquely Bahlai Lab research vision off the ground. A big part of that is people.

I have people now! Julia, my long-suffering technician, puts up with my barrage of ridiculous ideas and helps me bring the vision to reality. She’s also in training to be a butt-kicking librarian. Cheyan, PhD student, is studying how ‘non-traditional’ data sources (with a focus on citizen science) can be used to develop and engage people in long term ecosystem management. Katie, PhD Student, is studying how we can measure insect mediated ecosystem services and functions in green infrastructure projects. Christian, PhD Student, is examining the use cases and factors affecting the quality and quantity of citizen science data, in the context of Odonate conservation under climate and habitat change.*  (Yes, you counted right- that’s 3 PhD students already). And, my undergraduate project student, Erin, will be working with me on my next big thing.

Which brings us to IT. My Next big thing.

A while back, I had an idea. It wasn’t completely my idea- it came out of conversations with a few people. As you know, I’m interested in big(ish) data- finding trends from patterns we see when we put together a lot of information about a system.** But the big is a combination of a lot of littles.***

When we look at systems for a while we get to see a lot more of the whole, big, messy variability of a system. I’ll illustrate with an example.

Y’all know about the fireflies. No? Okay, it’s been a while and I can remind you about the fireflies. I recorded a video****:

TL:DW- My Reproducible Quantitative methods class produced a paper about firefly phenology.  Fancy people liked it and it got media attention.

During my interview about the piece, the reporter kept going back to the idea of trajectory. Yes, sure, phenology, cool, but what is the *trajectory* of firefly populations? ARE firefly populations in decline?

Here I had one of the longest time series documenting systematic collection of fireflies, known to science, and I could not answer this seemingly simple question. For your reference, here is the data from our site, grouped by plant community of capture:


My reply to the reporter was “I don’t know. But if I had less data I would tell you, and I’d be surer of my answer.”

A little tongue in cheek, to be certain, but isn’t that what we’re doing every day in science? One of the fundamental questions we ask in ecology is where is my system going? and we’re making extrapolations based on the data we have available. We know it’s not always the right thing to do, but we do our best, looking at the world through the limited windows available to us. In ecology, the three year study is pretty much the standard:

We know that this is problematic. This is why the USLTER network exists. People get that. But we’ve still got to do work in the shorter time scales. We gotta graduate students. My grants don’t go on forever. We can learn lots of things from studying systems in the short term.


How do we know when these short term studies are misleading us? What are the effects of the time period we’re looking at? The length of time we’re watching, and the type of process? How often we’re measuring?  and how the heck can we test this, if we’re mostly doing short term studies?

Friends, I had an idea. Why not re-analyse long time series data– as if they were short term data? Break it up in all sorts of objectively bad ways (THERE! I EXPLAINED THE POST TITLE), analyse using standard statistical methods, collect these statistics up, and look for trends in conclusions we reach, given different ways of collecting the data?

All this would take would be a relatively simple algorithm, a whole pile of time series data, and some money, time and patience for the personnel to drop data in and collect the stuff that comes out of the algorithm machine. I can write an algorithm, and hey, the USLTER has lots of data that would be appropriate to get this done, but the latter components are a little harder for a new professor to come by.  So, I put it on the back burner.

Anyway, this summer, my friend and collaborator Kaitlin Stack Whitney brought this grant opportunity to my attention.

EAGER proposals for high-risk/high-reward innovative studies that address development and testing of important science and engineering ideas and theories through use of existing data. […..] proposals must:

Involve, for data proposed for use, publicly-available data generated through NSF funding; and

Agree to make public the details about their experiences reusing the data, including especially challenges associated with that reuse.


Hey! I, in fact, am a professional at reusing data produced by NSF project and publicly documenting my experiences using said data! I [cough] kinda have a blog about it. So me, Kaitlin, and my technician Julia sat down.

We wrote a proposal.

And it got funded.


Three junior women scientists getting an NSF award? No Big Deal.***** We’re getting this project underway, now. My undergraduate, Erin, will be gathering candidate data sets for trial bulk runs of the algorithm over the winter semester. Collaborators Sarah Cusser and Nick Haddad at Michigan State are using the algorithm on a focal dataset to do a deep dive into how patterns of observations affect conclusions in agricultural systems. Basically, we’re going to figure out once and for all- how often are we wrong when we look at our data?

This is going to be big, my friends, stay tuned.

*Note to self, get Christian on the website!! another item for The List.

**thank you for coming to my TED talk that’s not actually a TED talk.

*** you can put that wisdom on my tombstone

****thank you for coming to my other TED talk that’s not actually a TED talk.

*****This is a big deal and I am pretty excited about it.

Posted in Uncategorized | 2 Comments

Official Blog-nouncement

Hi, all.

So I’ve been very quiet lately here on the blog. This is, largely, because: academic job market. My search kicked up a notch this past semester, and I was 1) very busy with travelling for job interviews and 2) it’s considered gauche to talk about interviews while they’re ongoing. I was also busy with other things (y’know, research and teaching), but I’ll get into that in a bit, but for now, here’s the big announcement.



The Bahlai Lab of Applied Quantitative Ecology starts this fall at Kent State University, in the Department of Biological Sciences. It’s not a coincidence that the lab bears my name. It’s my lab.1

In the Bahlai Lab2, we focus on developing tools and metrics for better understanding the functional ecology of communities over time. Our two major areas of research are currently:
1) Studying the responses of insect predator-prey communities in response to disturbance, such as invasions and climate change
2) Developing break-point analysis tools to better quantify the impacts of change in long term ecological observations.

But more on that later. Lots, lots more 🙂

Incidentally, now that I’m on the other side, I’m allowed to talk about fight club the academic job market, so I wrote a piece for American Scientist about it. It seems to be resonating with people, so go check it out.

The past semester was very full, as I mentioned. I taught another offering of Reproducible Quantitative Methods, this time to a group of 20. Projects (which are still ongoing) included examining the role of environmental drivers in rockhopper penguin egg size, a community and functional analysis of landscape and climate impacts on Midwest aphid populations, an integrated population model to better understand the trajectory of, and constraints on American woodcock population dynamics, and a data rescue project to digitally preserve the National Eutrophication database.

In my personal research, I’m also working on building out my technique for detecting regime shifts (ie- changes to the rules governing population regulation) in dynamic populations.  Currently (like, literally, iterations 201-250 are running right now), I’m running simulated data through the model to determine which conditions it works best under. I’ll be applying it to case studies from insect populations, and seeing what we can see.

It’s going to be an exciting few months, as I transition from Mid-Michigan to Northeast Ohio. 🙂


1. [barely audible squeal of joy]
2. I like saying it. In the Bahlai Lab… Welcome to the Bahlai lab…

Posted in Uncategorized | 1 Comment

Soft(ware) skills

In case you’ve been missing hearing my voice in your head, I had a post published over at the Data Carpentry blog today.

Posted in Uncategorized | Tagged , , | Leave a comment

Inference and being wrong in a post-truth era

There’s been a confluence of recent events that have got me thinking about truth and facts. First of all, I’ve been thinking a lot about how we move forward, as scientists when there’s this steaming mess going on.  Secondly, there’s been a series of really honest blog posts where some amazing, respected scientists, talk about the times they’ve found errors in their published work. Thirdly a student member of a online community of academic women that I’m part of posted about a severe dressing down she got from her adviser for what appeared to be a few minor errors that she worried she’d undermined her professional credibility, and possibly her whole career. Finally, I’ve been preparing for another offering of my quantitative methods course, and as such, am thinking about how to best talk to students about data, inference, and what it all means.


Sci, Robot.

It’s more important, now more than ever, that scientists are willing to stand up against misinformation. We are knee deep in it, and friends, we’re the ones holding the shovels. But we don’t always, because it’s exhausting and often makes us vulnerable. We live in an era where scientists are not particularly trusted. When we’re trusted, we’re seen as inhuman. And we’re certainly not understood- who can forget how our research is openly mocked with reductio ad absurdum arguments by people with political agendas to disprove our work1 or undermine science in general (a situation that, I fear, is only going to get worse over the next few years)?  Add into the mix the funding climate leading to cutthroat competition for positions2 and stir in the scientific enterprise’s (essential and important) self-criticism, and  it’s no wonder that scientists are hesitant to engage. In short, it’s soul crushing, and the stakes are high.

This puts a lot of pressure on scientists to be right.

But I believe this climate is hurting us- in a lot of ways. Toxic levels of perfectionism. In the example I mentioned above, we have a graduate student worried about her career because of a typo. I’ve often cited the times I’ve observed students I’m working with get stuck on data problems because they’re afraid to do the wrong thing. Heck, I’m struggling to get the words out right now, because I’m afraid that someone will read what I write and think I’m advocating for sloppy science.3

But how much value does science gain when its practitioners are paralyzed into inaction?  And even more pressing, how do we encourage more diverse contributions in science when people trying to join the community are are shunned when they misstep?

This whole seeing scientists as robots, expecting scientists to be robots, even from within, I believe has a much more insidious consequence as well. It sets up the expectation  that science, all science, every preliminary study, to be absolute.  And we, as practitioners, know that’s not true. And when science is seen as absolute and invariable, it sets itself up as a convenient straw-man for people with anti-science agendas. For example, anthropogenic climate change is happening.4 If you’re reading this blog, you probably agree with this statement. But there’s a controversy, and often cited by team Give-The-Planet-An-Uninvited-Sweaty-Hug5 is the so-called Antarctic cooling, as a point to ‘disprove’ global climate change.6

So, no. Obviously no. Thinking that an overwhelming body of evidence can be brought down by localized variation in trends? And this is not even to mention the role interpretation, statistics choice, even method of observation play. This is a problem, and I can’t help but feel it is up to me, at least in a small part, to fix it. My gut says this is something we can only address by helping people really understand how science is done, and by inviting more people in.

So back to my course- my corner of influence. As this is a course that advocates a open science approach, it forces students into a paradigm where they will be publicly wrong, or at least incomplete and unpolished. As I mentioned, this is something not all scientists take kindly to. But I know I’ve had a lot more luck convincing people of my sometimes controversial conclusions when I open up, show them the steps I took to get there. This is why I applaud the scientists who are open about their missteps and mistakes- these people are teaching the world more about science and process than any ‘perfect’ paper may.

In my own work, I’ve adopted something of a radical openness. The projects I’m leading right now are out there  for the world to see, from the time I create the first file and start dumping stream-of-consciousness comments in about what I intend to do, through each bump in the road.  It is my hope that through this openness and transparency, people feel invited to build on my work,  they can validate the robustness of what I conclude, and they can use these ideas to help, in a small way, understand and buffer change in the world.7

Seventeen students have chosen to join me on this journey next semester. I hope to give them what they need to save the world, too. No pressure.

  1. The blog title makes it easy for you to do this with my work. Since really transitioning to quantitative ecology full time, you might even say I use a computer to count imaginary bugs.
  2. #operationhiremeplease2017
  3. See footnote #2. I’m not. I’m meticulous.
  4. I do hope that NASA article I linked isn’t turned into an ad for a certain prominent brand of luxury hotel in newly coastal Ames, Iowa anytime soon.
  5. This was the most PG  glib nickname for this group I could muster.
  6. Incidentally, I was at a conference in South Africa in October, and Michael Gooseff presented the data from the McMurdo dry valleys that showed this apparent slight cooling trend from the 1978-1998, and then showed the rapid increase since then. I was most intrigued by the inflection point, naturally, as a person interested in breakpoint analysis would naturally be, so stay tuned for some upcoming work in that area. I’ve started a new project. 🙂8
  7. And I mean this quite literally. The project I’ve linked to is developing a tool to understand when changes are occurring in dynamic populations from legacy population data. Because it’s really hard to determine when and how the factors regulating a population changed in an unbiased way- and without knowing that, it’s pretty hard to mitigate those factors.
  8. I’m being all secretive about this to create intrigue but if you’re motivated you can check out exactly what I’m doing on my github. 🙂 🙂
Posted in Uncategorized | 1 Comment