Of time series and ticks

Editors’ note: This is a guest post by Sofie Christie, a RIT RISE research fellow and an undergraduate at the Rochester Institute of Technology in Bioinformatics. You can check out more of Sofie’s research on her github and figshare sites.

This summer I’ve been working with  Dr. Kaitlin Stack Whitney and the Bahlai  Lab at Kent State University on their ‘Managing Our Expectations‘ project. For my own research as part of the project, I decided to focus on Ixodes scapularis. Also know as the deer tick, it is a primary vector of Borrelia burgdorferi, which transmits Lyme disease. Ixodes scapularis populations appear to be  increasing. Media coverage also often frames deer ticks as increasing:

However, biological studies generally collect short term datasets, and the trends observed from these studies may not be indicative of longer term trajectories.

For example, what might you conclude if you only saw this:

Versus seeing this:

First, I had to find long term, publicly available tick datasets. My criteria was that the dataset had to be at least 10 years long, collected in at least one location, have numerical density, abundance, or count data for the tick species Ixodes scapularis, and can be any life stage.

First, I started with the LTER Data portal. I typed in the keyword scapularis. There were two results – both were at least 12 years long but I excluded one because it had no count data, only a metric of: uncommon, rare, and common. The other dataset, which was focused in Harvard forest, did have count data, so I added the link to my list of candidate datasets. 

Still in the LTER Data portal, I then typed in the keyword tick and found 11 results, however most focused on different animals and had no tick sampling data, and I did not find any new datasets that fit the criteria. I was hoping that there would be more candidate datasets, especially from a data portal focusing on long term datasets! [Editors’ note: Given how few long term deer tick datasets Sofie found in the LTER system , even though ‘bad breakups’ is mainly focused on LTER and publicly-funded datasets, we were compelled to keep going to see if other sources had more!]

Next, I searched the Data Dryad repository by first typing in scapularis, which turned up 9 results. Several focused on the microbiology of the tick. I found two datasets that fit the criteria. I then typed in tick for the search term, and found several more datasets focusing on tick abundance, but some of them were only short term. I found one possible dataset that focused on several different tick species.

Then I searched through NEON data portal by typing in scapularis, which turned up 0 results, then tick, which turned up 2 results. Only one of them contained tick sampling data, and it was too short of a study to fit our criteria.

I searched through Google Datasets by typing in the keyword scapularis. There were 76 results. Many of the results were duplicates. I also found some that focused on the microbiology of ticks, rather than working to sample ticks out in the field. Others were not long enough. In short, there were a lot of datasets that had to be filtered through, but I found 4 datasets fitting the criteria from the NY Department of Health. These 4 datasets were from studies that collected data in the same way, by dragging and was in the metric of average # of ticks per 1000 meters. However, this metric was linear, so it could not be compared as easily to a spatial metric like # ticks per meter squared.

When I showed my mentor the NY Department of Health data I had found, she suggested looking at data portals from other states, especially the Eastern ones, such as Connecticut, which could have publicly available tick datasets due to the higher prevalence of ticks in those states. I actually looked at a map based on infected deer tick presence of the US to help me prioritize which states that I should be looking at. I predicted that if a state has a confirmed deer tick population, then those states will be more likely to have collected tick data, although whether or not it is publicly available is still questionable.

Map of Lyme disease risk. Digital image. Live Science. 7 February 2012, https://www.livescience.com/18340-lyme-disease-risk-map.html

For example, the University of Rhode Island has collected tick data, but they do not seem to have it publicly available. They have a site called TickEncounter which provides all sorts of information about ticks, like what species are common in Rhode Island, how to identify them, how to remove them, and what kind of habitat different tick species may be found in. They also have a crowd-sourced survey called TickSpotters where the public can submit tick data, which includes the tick species, life stage, whether it was found on a person, pet, or wandering, date, and what country it was found. They then use this information to provide a TickEncounter index, which seems to be ranked as ‘Low’, ‘Medium’, or ‘High.’ That’s not exactly as helpful as providing the raw data, however.

Another common issue I encountered was that some of the states only had surveillance reports for Lyme disease, and no surveillance tick data.

For New Jersey, I started by searching for the NJ Department of Health webpage. I accessed the Offices & Programs dropdown at the front of the page, tabbed to the Communicable Disease page, accessed the Statistics, Reports, & Publication dropdown, and clicked on Vector borne surveillance reports. While some of the earlier reports did provide data on tick related emergency visits and tick borne diseases, the older reports only provided data about mosquito borne diseases, and weren’t enough reports with tick data to constitute 10 years. So I did not include the New Jersey reports in my list of candidate datasets.

I searched through for the Connecticut Department of Health webpage and typed in the keyword tick. There were 75 results. I tabbed to the link worded “Tick” which led to this page, and upon scrolling down to the CAES Tick Office & Tick Testing section, led to a link titled “Tick Test Summaries.” This page provides a list of tick testing results for each page. These reports include data for several tick species, such as Ixodes scapularis, and also included data on the percentage found positive for Borrelia burgdorferi, the bacterium that transmits Lyme disease. As there were over 10 years worth of reports, it met my criteria and I added to my list of candidate datasets.

I then looked to Pennsylvania for tick datasets. I searched their department of health webpage using the keyword tick, but found no datasets. After this, I tried something else and searched google using the keyword pennsylvania tick datasets. Lo and behold, there was actually a study that provided 117 years of data. The data provided through the study had count data of Ixodes scapularis over all life stages, and as it fit the criteria, I added it to my list of candidate datasets. Last but not least, I used the same method to search for tick data in the state of Delaware, however, I did not find any results.

At this point, I stopped looking for tick data, and I now had a compiled list of 10 candidate tick datasets. Searching through all of these different sites showed me how diverse studies can be in terms of collecting and reporting tick data. There are many different metrics, standards, methods for tick data, and comparing the tick abundance, count, or density data from different studies isn’t that simple. For example, the dataset in Harvard forest was an opportunistic study where they recorded the number of ticks and tick bites found on summer research interns, which is a rather different way to sample than dragging.  In addition, the datasets also varied in terms of the life stage they sampled for – some only sampled for adults and nymphs, while others sampled for all of the life stages. There was also a lot of averaging in some of the datasets like the NY Department of Health one, where they only provide the data on the county level scale, rather than providing the specific location or plot. This makes it hard to compare to other datasets like the one done in Cary Forest because the geographical scope is so different. I was especially surprised at the lack of publicly available data from the state Department of Health sites. The states that I looked at are well known for their deer tick populations, so you might expect that they would have collected deer tick data, given that it is a critical health concern. 

[Do you have 10+ years of deer tick data – or know of another source – that we should be including? If so, let us know in the comments or via social media!]

Advertisements
Posted in Uncategorized | Tagged , , , , | Leave a comment

Irrigrated: In which Tasia Complains About Things in List Form Because Narratives are Difficult

This is a second post  written by Tasia North, an undergraduate student who’s working with us on the Bad Breakup project. As part of our research plan on this project, we’re identifying barriers to data reuse from publicly shared, NSF-produced data sources. Tasia is working on a project which is examining patterns within tri-trophic interactions in long term data- basically asking the question- do significant trends move between trophic levels, and if they do, how? But in order to do it, they were first tasked with finding a few representative sets of data. I wanted them to have an authentic experience- just that there were data like this, here’s a database of datasets, now see how you can find information to support this investigation- and write down what you find. 

-Christie

Hi folks,

These past couple weeks I’ve been cleaning data and wading through metadata. Trying to decipher work other people have done is difficult and I have some feelings about it (hence this blog post). This is for the bad breakup project, in which we take long term data sets (12 years or more) and chop it up to see what trends would appear if we had only measured that system for a shorter period of time. This will allow us to quantify how often we are wrong when we make conclusions off of those shorter 3 to 5 year studies that are so common in science.

More specifically I’m searching for data from LTER sites that have complete long term abundance data across three trophic levels. I’ve already gone through the arduous process of sifting through the LTER data portal and finding what I need. Now that I’ve found the datasets, I need to wade through its metadata, make sure I understand what they did enough to know if it will work for the analysis, and ensure that it meets all the criteria.

Now part of what Christie wanted me to do during this process was to identify any barriers to reusing this data and write about that. So rather than writing a narrative of how my metadata experience has gone so far, I’m just gonna write a list of all the things that have been confusing and made this process more difficult because I am lazy. Please enjoy my list!

2. Typos

A lot of these are kinda funny and perfectly harmless, but someone really should have done one last proofread before putting this up. I do love that they sampled the yee ole aire

2-tasia-blog-1

Image description: A screenshot of several highlighted typos in metadata, and a screengrab of Captain Picard from Star Trek giggling with his hand on his cheek saying “oopsie”. Highlighted typos include “wind speed at astart of sampling” and “aire temperature at start of sampling”

 

2. Not writing down the units of things

Yes, mmhm, you read that right. This person measured mass and then recorded the units as dimensionless. How can a unit of mass be dimensionless?? Maybe if I were more familiar with this particular system and how much biomass it usually produces it would be obvious, but I know nothing about this ecosystem and I don’t know how much grass would be a logical amount of grass in this area! Should I assume that its kilograms since that’s the SI unit of mass? I don’t really like to assume things when working with a big set of someone else’s data but since they didn’t write it down I guess we have to spend a bunch more time digging. Thankfully this doesn’t really matter for what we’re doing  (our algorithm works on Z scores, so take THAT dimensionless data!) but it would matter a lot for just about any other type of analysis or if I wanted to reproduce this study.

2-tasia-blog-2

Image description: A screenshot of metadata showing that for the attribute “mass of live grass” the unit is listed as dimensionless. For the attribute “mass of forbs” the unit is also listed as dimensionless

3. Very vague details or no details at all

I’m having a really hard time finding anything meaningful about methods in a lot of these datasets. This screenshot is just one example of many vague things I found. Did they treat it with herbicide? Pesticide? Did they pull weeds? Or fertilize? Were they burning? Playing music to the plants to see if they grew better? Who knows? All I know is that there are at least two treatments and that one of them is old but that’s it.  I searched high and low for a paragraph about methods but found N O T H I N G directly connected with the data set.

2-tasia-blog-3

Image description: A screen shot of metadata, showing the first attribute as Old treatment. The storage type is listed as string, and measurement scale says old treatment. The second attribute is Current treatment, with the same things listed for storage type and measurement scale. There is no other meaningful information that could lend clues as to what the old and current treatments actually were.

4. USING LABELS IN THE DATA SET AND THEN NOT DEFINING THAT LABEL IN THE METADATA. I AM WRITING THIS IN ALL CAPS JUST TO LET EVERYONE KNOW HOW STRONG I AM FEELING ABOUT THIS ONE.

Yeesh. This is grass biomass divided into the 2( or maybe 3??) treatment plots that they used in the study.  The top screenshot is the only thing in the metadata referencing these labels. And the screenshot below is what’s in the actual dataset. “C” is defined in metadata. “i” is defined. “ni” is not defined anywhere and is only used in the dataset those three times. I pored through the metadata looking for these definitions but at this point I think the best I can do is assume that “ni” stands for non-irrigated and means the same thing as control. I really don’t like making assumptions like that but there isn’t really another option with the given information! And to top it all off they misspelled “irrigation” in parts of the metadata!!*

2-tasia-blog-4

Image description: A screenshot of metadata showing an attribute Treatment. Under measurement scale, it defines two variables, c = control, and i = irrigrated (typo is copied exactly from metadata). Text does not define ni at any point

2-tasia-blog-5

Image description: Screenshot of a pivot table created in excel. The table shows data of average livegrass biomass from 1991 to 2015, with columns for treatments c, i, and ni. There are only three values in the ni column. When there is a value in the ni column, the c column is blank, and anytime there is something in the c column, ni column is blank. Since ni is not labeled anywhere in the metadata, and it follows this pattern, the most logical assumption is that c and ni are the same thing and there was just a mistake somewhere in the data recording process

I get that this is a long data set which has probably had a lot of people working on it over the years. But my goodness pretty please stick to the same labeling system and if you have to change it ya gotta define the new labels you use and maybe put a sentence about the change somewhere!

The cats do not approve of this kind of behavior. They do not approve.

2-tasia-blog-6

Image description: A meme showing two angry looking grey and black stripped cats glaring at the camera with a text that says “When an LTER dataset uses a label and the doesn’t define it in the metadata”

5. Being extremely vague on methods and labeling

A lot of these data have only a couple sentences on methods, it would be nearly impossible to recreate the survey from the information given. For example I’m looking at a grasshopper survey, the labeling system they used is a little confusing and I’m not quite sure what they actually did for sampling. The metadata has literally 3 sentences about methods. They did a sweep sample in July/August of each year, and “At each site on each occasion, 10 sets of 20 sweeps (200 sweeps total) are taken”. That’s it. From the data I can see the set of 10, that’s labeled very clearly, but I can’t find any clear set of 20. I’m unclear if they had multiple people who each did 20 sweeps 10 times? Or multiple people who together did 10 sets of 20 sweeps? I’m also not sure if the 10 sets were over the same area of if they had transects of some kind? I’ve never done grasshopper sampling before so I have no reference of what normally happens with this kind of a survey, and since they didn’t write it in the methods I’ll just be confused about it and hope it won’t change anything about the analysis. (I’ve since been informed by someone more familiar with these types of sampling methods that the 20 sweeps refers to each actual sweep of the net, so they did 20 net sweeps 10 times, jury is still out about where they did these 10 sweeps, and what parameters they used for selecting the sampling transects)

 

~about a week later~

 

Friends I LITERALLY CANNOT. I have just discovered, that the metadata that I download in a zip file from the LTER site, is different from the one on the internet. They literally wrote two different sets of metadata. One that is conveniently placed in a downloadable zipfile along with the data. This one is extremely vague and has almost no information on methods. And the other metadata has detailed paragraphs on methods, changes to the study over time ect. Where is this detailed file? Tucked in some corner of the website that I would never have thought to check because I thought that it was the same as the one I downloaded. I assumed that everything put inside that zipfile would be the most complete set of information and that clicking around the site for random files would be a waste of time when I have the “exact” “same” “file” already downloaded.

2-tasia-blog-7

Image description: picture of a white baby flamingo on a blue background. The chick has its mouth open in what looks like an angry scream. Text reads *incoherent screaming*

Is this common knowledge that there are 2 different sets? Why didn’t they just put those couple of detailed paragraphs into the downloadable metadata? Now I need to go back to every other place I was confused and check to see if they have the answers in a second secret file somewhere online. Unfortunately it looks like it doesn’t have any information on the missing labels/units, but it cleared up a lot of my questions about methods and larger context and how/why/where they did what they did.

I was SO confused and spent SO much time looking through the files I had downloaded looking for answers and it turns out the answers were there all along just in a file I had not thought to check since I assumed that the 2 files both called metadata were the same thing. *heavy sigh*

2-tasia-blog-8

Image description: Fat fuzzy cat with short stubby legs sits at a laptop wearing a red bowtie and small round glasses. The cat has its mouth open looking surprised, indignant, and offended at the computer screen. Text reads “My face when I found the second set of metadata.

* this is when Christie made a bad dad joke about being very  “irrigrated” about the whole situation. Do you see what her poor lab has to put up with?

Posted in Uncategorized | Leave a comment

How do I find stuff? An undergraduate’s journey through an online data archive

This post is written by Tasia North, an undergraduate student who’s working with us on the Bad Breakup project. As part of our research plan on this project, we’re identifying barriers to data reuse from publicly shared, NSF-produced data sources. Tasia is working on a project which is examining patterns within tri-trophic interactions in long term data- basically asking the question- do significant trends move between trophic levels, and if they do, how? But in order to do it, they were first tasked with finding a few representative sets of data. I wanted them to have an authentic experience- just that there were data like this, here’s a database of datasets, now see how you can find information to support this investigation- and write down what you find. This is Tasia’s first blog post with their reflections on the experience!

-Christie

 

Hello Blogosphere,

I’m the new undergrad working here at the Bahlai lab. If you’ve been following along with the blog you’re aware of the bad break up project that’s been going on. This project looks at a long term data set, and breaks it up into shorter clumps to look at the trends. This will allow us to quantify how often we are wrong when we base conclusions off of three or four year studies.

We are digging through approximately 58,000 datasets from the US-LTER. My task is to continue working on the tritrophic interactions that Julia had started. She had created a list of sites that are likely to have the data needed, and what organisms I can look for. I needed to take that list, sort through available LTER data to find the data set, determine if it was at least 12 years or longer, and that it is usable and accessible data.

Easy enough right?

So a few things about me might be useful for context here. As stated, I am an undergraduate student studying Ecology and Conservation Biology. I’ve essentially no experience with large scale data management, and no experience using the LTER website or getting data from this site. In fact I had to google to find the LTER website since I have never used it before and thought everyone was saying LTR. In other words I am a newb at this. However I am armed with four and a half years of college experience (super seniors represent!), and I’m a millennial with the standard ‘navigating internet and sorting through stuff’ skills that are common to my generation. Someone with my education level, computer skills, and the reasonable level of guidance that I have should be able to navigate this site and find the information that I need. Here’s a step by step walkthrough of how successful I was at navigating these sites, what I found, and also some memes to express the feelings that arose during this experience.

The first thing I did was google for the LTER website, this takes me to the data portal. Now Christie wrote in the last blog post that this portal was, erm, less than helpful. But everyone else in the lab was busy when I started working on this and I didn’t want to interrupt anyone. So I got to find out about the data portal all on my own! I start out clicking on the advanced search option. According to the list Julia gave me there is probably a survey of small mammals in the Konza Prairie LTER site that is at least 12 years long. So I type in small mammals, select Konza Prairie, and I am presented with  . . . this . .

tasia-blog1

As you can see, nowhere does it say how many years are included in the data set. It only lists the publication date, which is of no use to me if I want to know how long they studied something.  In order to find the length of study, I have to click on the title, scroll down, find the metadata report, click on that, and then scroll down to find the years.

I have spent literal hours over the last couple weeks going through searching for keywords that will hopefully bring up what I need, clicking on a title, then clicking on the metadata report, and then scrolling all the way down just to see something like this:

tasia-blog2

Or this:

tasia-blog3

This was a huge time suck and mildly frustrating to say the least. I was about ready to take my extensive credentials as a *checks notes* Mildly Annoyed Undergrad™ and march right up the LTER office and demand they change their name to the Year Long Ecological Research Network. In fact I was so peeved I took a break to make this extremely niche meme that about 6 people will think is funny.

tasia-blog4.png

Finally though, after sifting through what feels like a million data sets, I find one that actually goes for more than a few years. A bit of clicking around brings me to an excel spreadsheet with the data on it.

Hallelujah!

Now I just need to determine if this data on small mammals is usable. Thankfully it looks complete and without any weird blank spaces or scary looking errors (an earlier excel file I found had an error code of  -99999 and that was a scary looking data sheet if I’d ever seen one).

Most of the spreadsheet is logical. There’s the year (this was out of order but clearly labeled so it’s fine), the season, and a watershed ID number, all of that’s fine. Then we get to the actual data, it’s a whole bunch of acronyms, followed by a series of numbers.

tasia-blog5.jpg

Now it may be a cool science thing that I’m not privy to, but this spreadsheet is so full of acronyms that it’s essentially illegible to an outsider unfamiliar with the system. Isn’t the goal of these types of data sets to allow future scientists to come in and reuse the data with relative ease? Well, that’s what the metadata was for!

Thankfully there was an easy to find (it was not) and logically labeled (it also was not) file attached named knb-lter-knz.88.7.txt. This file is not to be confused with knb-lter-knz.88.7.report.xml or knb-lter-knz.88.7.xml, these other two files contain… information (its actually probably really important stuff but I don’t know what any of it means yet). Thankfully the metadata was mostly legible and explained the acronyms clearly. Took me a couple extra clicks but I think this data set will work for what I need!

Next steps are cleaning up the data!

 

 

 

Posted in Uncategorized | Leave a comment

Equity and Ethics in Environmental Data Science

Guest post by Kaitlin Stack Whitney

Recently Christie and Kaitlin had the privilege of participating in the first ever NSF INCLUDES funded Environmental Data Science Inclusion Network (EDSIN) conference, which took place in early April 2019 hosted by the National Ecological Observatory Network.

I (Kaitlin) had the great pleasure of being on a keynote panel focused on “further defining the problem space” and focused my comments on disability access and inclusion in academia and science more broadly. A much more in depth focus on those topics was presented by my colleague Dr. Drew Hasley during a plenary the next day. You can check out all the presenters and their presentations here.

Yet as the keynote plenary speaker and eminent scholar Dr. Carolyn Finney explained, problem space is probably the wrong term – and framing. Diversity, equity, and inclusion in environmental data science isn’t a problem! It doesn’t need fixing! Addressing it isn’t a problem solving exercise. In her words “it’s not a problem to solve, it’s a process.

Dr. Finney’s keynote was hands down the best start to a conference I have ever witnessed, and I took copious notes of the wisdom and experiences she generously shared with us. Some key takeaways (for me) that I want to share, as they inform my (our) collaborations and work:

  • Outreach is outdated, individuals are fully formed.” How do we ensure that our outreach (especially Broader Impacts as concept) acknowledge and honor that about everyone?
  • “You have to do something different, it’s not about being comfortable.” How do we work to make sure that our collaborators and trainees are comfortable and fully seen? Many of my colleagues, especially women of color, have never been fully comfortable or seen in their work. It’s not about my comfort and being a better ally means taking active steps not to center myself or my comfort.
  • “Diversity is not assimilation.” How do we respect and honor -dare I say cherish – difference in our collaborators and trainees? How do we make space for them, as opposed to only inviting them into pre-determined spaces?
  • “Privilege has the privilege of not seeing itself.” How do people with privilege learn about the things they have the privilege of not experience, let alone “see?” I am a white woman, a faculty member, and co-PI on a federally-funded-grant-project; I have a lot of privilege and power in relation to many people in my professional community. How do I educate myself to be a better ally and advocate without shifting the burden to marginalized colleagues who already do too much uncompensated service work?

I also presented a poster, “10 steps to make science (more) accessible” which has been a years-long collaborative effort with 4 colleagues across North America (in alpha order – Dr. Emilio Bruna, Dr. Simon J. Goring, Dr. Aerin Jacob, and Dr. Timothée Poisot). Check it out on FigShare and you can check out our corresponding preprint here. We’ll be making sure that our Bahlai and Whitney lab collaborations are multimodal and accessible. It’s a lifelong and continuous effort, but the focus of resources like these are to make clear that there are simple and quick steps that people can take even if they don’t have control over the publishing platform or conference format to increase access and inclusion.

We both also had the opportunity to participate – both as contributors and listeners – in a lot of excellent breakout sessions. Some of my favorites included those focused on disability inclusion and ethics in data science education. As faculty and researchers, how can we introduce ethical data collection and analysis to students from the beginning of their education in environmental science and data science, not as an add-on? Christie shared some great insights from her own teaching, and I brought some of those back to my own classroom this semester. There’s going to be a lot more to come out our participation in EDSIN, both as the conference outcomes and collaborations develop further, and as I implement more of what I learned from new and old collaborators there. We’ll keep sharing.

Maybe you’re reading this blog generally for the quality bug counting content – and not sure how this fits in. But it’s at the heart of what we do – and it shapes our science. As Dr. Finney also said, “our relationships are worth the risk of going there, to discuss our differences.” People count the bugs and create the scientific community – I (we) must work harder and more effectively to ensure the field of environmental data science is ethical, equitable, diverse, and inclusive of everyone who wants to be in it.

You and everyone you know can join the EDSIN community to make environmental data science more accessible, inclusive, equitable, and awesome! Sign up on the QUBES platform: https://qubeshub.org/community/groups/edsin

The EDSIN conference is based upon work supported by the National Science Foundation under Grant No. 1812997. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Posted in Uncategorized | Tagged , , , | Leave a comment

The calm before the (algorithmic) storm

Kaitlin, Julia and I have been busy. Not that you’d know. It’s been a lot of behind the scenes work so far on this project- it’s started with weekly meetings where we talk out our ideas and approaches. You’ll recall, the bad breakup project is an algorithm I designed to mine data from the perspective of a human who was looking to find statistically significant trends.  Essentially, it asks “Given a snippet of data, what conclusion would a typical biologist applying typical statistics make about these data?”* We were going to use the 58,000 datasets compiled through the US-LTER over its 40 year history as our data fuel- with the longest series (the complete record) serving as a proxy for truth.**

Turns out, this problem that we’re working on is pretty big, so establishing exactly how to approach these 58,000 archival  datasets  is our main challenge. It looks kinda like this:

hypothetical workflow

A crudely constructed powerpoint flowchart. Top row: Hypothetical workflow- Get data > put data in the thing > ?????? > Profit!  Bottom row: Actual workflow-  ?????? > Put data in the thing > ?????? > Profit!

Our collaborator, Sarah Cusser at Michigan State, has had amazing successes in the context of a ‘deep look’ at a single long term experiment- she has been examining how long we have to watch a system to see treatment differences across several common agricultural practices- and how consistent this effect is. Her findings, in prep right now, can be used to make recommendations about how we make recommendations to farmers- essentially when can we be confident our recommendations are right, and when should we moderate our confidence, when guiding how farmers select practices.

Building on the idea of the deep dive, Kaitlin got an idea- what if we specifically sought out datasets that document tritrophic interactions from the LTER- we could use these focal datasets to examine within-site patterns between trophic levels- i.e. do misleading results travel together between trophic levels? A brilliant idea, so Julia sat down and started combing the organismal abundance/plant biomass data across the LTER- and proceeded to spend a lot of time spinning her wheels.

The challenge, it seems, is also one of LTER’s strengths.  LTER data is available through two paths- in data archives set by individual sites, and centrally, through a portal at DataONE. There is a heck of a lot of stuff going on at each individual LTER site, which is so, so awesome and cool and I love learning from it***, but, given that each site is so unique, navigating between them to try and find information which supports synthetic approaches is…okay, I’m not going to mince words, not awesome****. Julia has been working on LTER related projects for nearly a decade now***** and found that the DataOne catalog was fine if she knew what she was looking for, but it wasn’t easy to browse, while extracting meaning from anything. She quickly found the individual sites were, generally, superior from a browsing perspective, because each site gave a bit of context about what the sites actually are, what the major experiments were, and their data. The data catalogs at each site, though, were all slightly different, and thus varied in their ease of navigation- and moving between the sites, the approaches differed.  So she’s still chipping away at this.

Meanwhile, though, something big hit the news. The INSECT DECLINE THING.

Oh boy.  So, I am going to summon the full authority of my position as your friendly local data scientist, insect ecologist and time series expert. This study has some serious design flaws and its results are unreliable******.  Manu Saunders wrote an excellent blog post on the subject so I won’t go into it in detail here.  But our group sat down and discussed the study at one of our regular meetings, and realized we were perhaps in one of the best positions to critically, quantitatively examine these claims- so we got to work. More on this soon.

Soon after we got started, a reporter from Discover magazine reached out to Kaitlin for comment on the story, and a truly excellent article resulted.

To quote a quote from the article:

“You can’t just draw a line through some data points, take it down to zero and say, ‘Right, that’s how long we’ve got,’” Broad says. “That’s not how stats works, it’s not how insect populations work, either.”

Later in the article,  Kaitlin highlights how we’re addressing these problems in the work we’re doing– and examines why the scientists making these dire extrapolations are reaching the conclusions they do.

I’m really excited to see what we find.

——-

*  my long term goal is to replace typical biologists making typical conclusions with about 185 lines of well commented R code by 2027

** It’s a proxy because even the longest time series I have is a snippet of the whole story. Ever since I started partying with sociologists, I find I use phrases like “proxy for truth” in  my day-to-day more than I’d like to admit. Wait till we write our book together.

*** Can you tell that I have a deep, deep, identity-defining love for LTER? Because I do.

**** MAXIMUM CANADIAN SHADE.

***** Just to make you feel old, Juj.

****** DOUBLE MAXIMUM CANADIAN SHADE. NOT MANY CAN SURVIVE THIS LEVEL OF SHADE

Posted in Uncategorized | 2 Comments

Bad breakups (with your data)

Hi all,

It’s been a long time. It turns out this assistant professoring thing does not leave me with a lot of time. Hmm. Who knew? Since we last spoke, I’ve been building my lab- both the physical space:

and the online infrastructure.

I’m building collaborations with friends and colleagues all over the place, and working to help finish the student projects that I became involved with through my previous positions at Michigan State. I’m also working hard to get my uniquely Bahlai Lab research vision off the ground. A big part of that is people.

I have people now! Julia, my long-suffering technician, puts up with my barrage of ridiculous ideas and helps me bring the vision to reality. She’s also in training to be a butt-kicking librarian. Cheyan, PhD student, is studying how ‘non-traditional’ data sources (with a focus on citizen science) can be used to develop and engage people in long term ecosystem management. Katie, PhD Student, is studying how we can measure insect mediated ecosystem services and functions in green infrastructure projects. Christian, PhD Student, is examining the use cases and factors affecting the quality and quantity of citizen science data, in the context of Odonate conservation under climate and habitat change.*  (Yes, you counted right- that’s 3 PhD students already). And, my undergraduate project student, Erin, will be working with me on my next big thing.

Which brings us to IT. My Next big thing.

A while back, I had an idea. It wasn’t completely my idea- it came out of conversations with a few people. As you know, I’m interested in big(ish) data- finding trends from patterns we see when we put together a lot of information about a system.** But the big is a combination of a lot of littles.***

When we look at systems for a while we get to see a lot more of the whole, big, messy variability of a system. I’ll illustrate with an example.

Y’all know about the fireflies. No? Okay, it’s been a while and I can remind you about the fireflies. I recorded a video****:

TL:DW- My Reproducible Quantitative methods class produced a paper about firefly phenology.  Fancy people liked it and it got media attention.

During my interview about the piece, the reporter kept going back to the idea of trajectory. Yes, sure, phenology, cool, but what is the *trajectory* of firefly populations? ARE firefly populations in decline?

Here I had one of the longest time series documenting systematic collection of fireflies, known to science, and I could not answer this seemingly simple question. For your reference, here is the data from our site, grouped by plant community of capture:

fireflies

My reply to the reporter was “I don’t know. But if I had less data I would tell you, and I’d be surer of my answer.”

A little tongue in cheek, to be certain, but isn’t that what we’re doing every day in science? One of the fundamental questions we ask in ecology is where is my system going? and we’re making extrapolations based on the data we have available. We know it’s not always the right thing to do, but we do our best, looking at the world through the limited windows available to us. In ecology, the three year study is pretty much the standard:

We know that this is problematic. This is why the USLTER network exists. People get that. But we’ve still got to do work in the shorter time scales. We gotta graduate students. My grants don’t go on forever. We can learn lots of things from studying systems in the short term.

But.

How do we know when these short term studies are misleading us? What are the effects of the time period we’re looking at? The length of time we’re watching, and the type of process? How often we’re measuring?  and how the heck can we test this, if we’re mostly doing short term studies?

Friends, I had an idea. Why not re-analyse long time series data– as if they were short term data? Break it up in all sorts of objectively bad ways (THERE! I EXPLAINED THE POST TITLE), analyse using standard statistical methods, collect these statistics up, and look for trends in conclusions we reach, given different ways of collecting the data?

All this would take would be a relatively simple algorithm, a whole pile of time series data, and some money, time and patience for the personnel to drop data in and collect the stuff that comes out of the algorithm machine. I can write an algorithm, and hey, the USLTER has lots of data that would be appropriate to get this done, but the latter components are a little harder for a new professor to come by.  So, I put it on the back burner.

Anyway, this summer, my friend and collaborator Kaitlin Stack Whitney brought this grant opportunity to my attention.

EAGER proposals for high-risk/high-reward innovative studies that address development and testing of important science and engineering ideas and theories through use of existing data. […..] proposals must:

Involve, for data proposed for use, publicly-available data generated through NSF funding; and

Agree to make public the details about their experiences reusing the data, including especially challenges associated with that reuse.

!!!

Hey! I, in fact, am a professional at reusing data produced by NSF project and publicly documenting my experiences using said data! I [cough] kinda have a blog about it. So me, Kaitlin, and my technician Julia sat down.

We wrote a proposal.

And it got funded.

Tina-Fey-giving-herself-high-five

Three junior women scientists getting an NSF award? No Big Deal.***** We’re getting this project underway, now. My undergraduate, Erin, will be gathering candidate data sets for trial bulk runs of the algorithm over the winter semester. Collaborators Sarah Cusser and Nick Haddad at Michigan State are using the algorithm on a focal dataset to do a deep dive into how patterns of observations affect conclusions in agricultural systems. Basically, we’re going to figure out once and for all- how often are we wrong when we look at our data?

This is going to be big, my friends, stay tuned.

*Note to self, get Christian on the website!! another item for The List.

**thank you for coming to my TED talk that’s not actually a TED talk.

*** you can put that wisdom on my tombstone

****thank you for coming to my other TED talk that’s not actually a TED talk.

*****This is a big deal and I am pretty excited about it.

Posted in Uncategorized | 3 Comments

Official Blog-nouncement

Hi, all.

So I’ve been very quiet lately here on the blog. This is, largely, because: academic job market. My search kicked up a notch this past semester, and I was 1) very busy with travelling for job interviews and 2) it’s considered gauche to talk about interviews while they’re ongoing. I was also busy with other things (y’know, research and teaching), but I’ll get into that in a bit, but for now, here’s the big announcement.

20170508_192020-ANIMATION

Tah-dahhh!

The Bahlai Lab of Applied Quantitative Ecology starts this fall at Kent State University, in the Department of Biological Sciences. It’s not a coincidence that the lab bears my name. It’s my lab.1

In the Bahlai Lab2, we focus on developing tools and metrics for better understanding the functional ecology of communities over time. Our two major areas of research are currently:
1) Studying the responses of insect predator-prey communities in response to disturbance, such as invasions and climate change
-and-
2) Developing break-point analysis tools to better quantify the impacts of change in long term ecological observations.

But more on that later. Lots, lots more 🙂

Incidentally, now that I’m on the other side, I’m allowed to talk about fight club the academic job market, so I wrote a piece for American Scientist about it. It seems to be resonating with people, so go check it out.

The past semester was very full, as I mentioned. I taught another offering of Reproducible Quantitative Methods, this time to a group of 20. Projects (which are still ongoing) included examining the role of environmental drivers in rockhopper penguin egg size, a community and functional analysis of landscape and climate impacts on Midwest aphid populations, an integrated population model to better understand the trajectory of, and constraints on American woodcock population dynamics, and a data rescue project to digitally preserve the National Eutrophication database.

In my personal research, I’m also working on building out my technique for detecting regime shifts (ie- changes to the rules governing population regulation) in dynamic populations.  Currently (like, literally, iterations 201-250 are running right now), I’m running simulated data through the model to determine which conditions it works best under. I’ll be applying it to case studies from insect populations, and seeing what we can see.

It’s going to be an exciting few months, as I transition from Mid-Michigan to Northeast Ohio. 🙂

 

1. [barely audible squeal of joy]
2. I like saying it. In the Bahlai Lab… Welcome to the Bahlai lab…

Posted in Uncategorized | 1 Comment