Practical Data Management for Bug Counters

Is accessibility part of “open” science?

Posted on July 13, 2021 by kstackwhitney

Today’s post is a guest blog post by Jay Wickard, an undergraduate research student at Rochester Institute of Technology, who is working with the ‘Managing Our Expectations’ team this summer to examine accessibility (or the lack thereof) as a dimension of open and reproducible science.

Hey everyone, this past month I have been doing research on the different accessibility guidelines found in varying scientific journals. The one similarity connecting all of them, being that they are open access journals. These types of journals have been instrumental in fostering inclusivity within the scientific world, but are they as accessible as they claim they want to be? The term “accessibility” has multiple competing definitions, which inevitably leads to confusion. In many scientific circles, when you refer to accessibility, you are referring to the process of obtaining or accessing data. For example, saying “the data are accessible” might mean the data is posted online or that the data is available for no cost. This is often within the context of open access statements and policies, but this isn’t the only definition that we should be taking into consideration. Accessibility is also defined as being easily accessible to people with disabilities.

Just how many “Open Access Journals” provide disability accessibility to their readers? My research aimed to explore this question through analyzing the guidelines for submissions/authors and the open statement/vision sections of three hundred “open” scientific journals. My focus was on guidelines about image accessibility, such as including alt-text or image descriptions. These journals represented a wide variety of topics from across the world. I was specifically searching for guidelines that required things like alternative text, captions, legends, or titles for figures. I was also looking for image requirements that made mention of contrast, color, and things like size. Within the open access statements, I was looking for the mention of disability or accessibility as part of their conceived understanding of open science and inclusion.

While I wish I could say I was surprised with my findings, I have to admit that the lack of specific accessibility guidelines became a pervasive issue within my data set. Most of the journals within the dataset only included general image guidelines that focused on sizing and sometimes resolution of the images. When color choice was referenced it was either in relation to being accessible to those with color blindness, or in reference to a preferred color model typically being RGB. In the table below, one can see the extent of this amongst the journals in my dataset.

	Author / Submission Guidelines					Open Access Statements
	Image Guidelines	Alternative Text	Image Contrast	Color Choice	Description (captions, titles, legends)	Access/ Accessibility	Disabled/ Disability	Inclusive/ Inclusion
Journals With:	213	0	6	38	139	228	0	2
Journals Without:	76	289	283	251	150	70	298	296

Table 1: Table depicting data on the accessibility guidelines found within open access scientific journals. Journals either had the above requirements present in their guidelines, or they were marked as “being without”. Note that only 289 journals (of 300) had complete author guidelines, and 298 journals had complete open access guidelines.

The journal that appeared to be the most accessible out of my data set was Living Reviews in Relativity, a physics journal from Germany published by SpringerOpen. This journal had all the above author guideline requirements listed in the table except for alternative text. Within the guidelines themselves, was a section on accessibility. This section gave advice to authors on ways to ensure their images were viewable and using universal design principles to be more accessible to more people.

Journals under the Elsevier publication also commonly had accessibility guidelines specifically geared towards those with vision-related disabilities. The eighteen Elsevier journals within my dataset all had sections which provided information on colorblindness, along with tips on how to pick colors that are easy for everyone to view. However, not all journals were as accommodating as the ones mentioned above.

The lack of accessibility guidelines is striking. Some journals made reference to figures and images needing detailed captions; however, none of the journals in my dataset required alternative text for their images. While there are text to speech devices and software known as screen reader technology, these can not help if there is no alternative text to be read in the first place. As scientists, we all know how important figures can be in understanding research. Not knowing what is within the image will directly impact your understanding of the material. So why do so many journals allow authors to submit and publish works without the formatting that screenreader users may want or need?

In my opinion, this issue of accessibility is one that is caused by ignorance. Be honest, how many times have you even thought about whether an article you were reading was accessible? When was the last time you included a caption or alternative text with the image you posted online? Even I admit that I have been ignorant to this issue. Our society instills in us an idea of what is normal. We then use this framework to make advancements towards the future. However, as time passes, we begin to see the ways in which our frameworks of normalcy lie to us and fail us. Those with disabilities are often left out of this false notion of normalcy, which explains why many people with disabilities are left out of online discourse, including science.

While researching these different journals, I started to find a pattern in the ways we structure scientific literature. The expectations put forth in the three hundred journals I researched were almost always the same. There were even times where different journals had the same exact guidelines. It is safe to say that the guidelines authors are following today could be the same guidelines that journals have defaulted to for years. It’s my suspicion that the reason why these guidelines are not more inclusive, is simply because they are out of date and in desperate need of updating. It’s just doing things “the way they’ve always been done.”

Whether this lack of accessibility is on purpose or not, it still has real consequences for real people. By not providing what is needed, the scientific community is actively gatekeeping and preventing people from interacting with data and one another. Science is all about connection and understanding. That is why I think the lack of accessibility within science is such an important issue. The inaccessibility prevents connection, which will only in turn limit our understanding of ourselves and the universe.

Moving forward, we need updated and more inclusive guidelines for things like publishing in scientific journals. Authors should be required to provide alternative text on their images, and they should also use images with colorblind friendly colors. These are just a few of the many steps that can be taken to increase the accessibility of online spaces. But what is most important is that we are actively listening to the demands and needs of scientists with disabilities. I am not the first person to bring up accessibility issues within science, and I will not be the last. In the hopes of highlighting first person perspectives I have included some further information on this topic below.

So, where does this research go from here? More analysis needs to be done into both the Author Guidelines and the Open Access statements, so that a better understanding can be made between the hundreds of journals. I hope to find patterns in phrasing and other similarities. I also hope that this research will encourage more people to make their content more accessible. Even if the journal publishers do not require alternative text, or any other universal design principles, that doesn’t mean you can’t include them yourself.

For more information on disabilities and accessibility in STEM, check out these resources:

Resource for writing effective alt text – http://diagramcenter.org/specific-guidelines-d.html#41
How to Use Color Blind Friendly Palettes to Make Your Charts Accessible – https://venngage.com/blog/color-blind-friendly-palette/
Through the Lens of Disability – https://www.sciencehistory.org/distillations/through-the-lens-of-disability
Disability and the Myth of the Independent Scientist – https://www.sciencehistory.org/distillations/disability-and-the-myth-of-the-independent-scientist
How to Make Professional Conferences More Accessible for Disabled People: Guidance from Actual Disabled Scientists – https://blog.ucsusa.org/science-blogger/how-to-make-professional-conferences-more-accessible-for-disabled-people-guidance-from-actual-disabled-scientists/

Posted in Uncategorized | Tagged accessibility, inclusion, open science, publishing | Leave a comment

Re-thinking what we think we know about insect declines

Posted on June 21, 2021 by kstackwhitney

Happy pollinator week! Understandably, during this week there is a lot of attention on the decline of pollinating insects and actions that everyone can take (individual, community, regulatory, structural) to address these threats. (I would be remiss here not to say go read my new pre-print analyzing the US state pollinator plans and how they align – or not – with best practices in evidence-based policymaking)

This conversation about pollinator declines often goes hand in hand with the conversation about concern over the declines of insects more broadly. As some recent major papers have claimed, there is evidence for large-scale and ongoing insect declines across all major Orders. Or not? Yet there have also been responses and rebuttals to those claims – not necessarily denying the existing of insect declines, but pointing out how critical it is to conduct these studies of insect declines in ways that rigorously and appropriately test the question.

One recent paper that generated a lot of media attention and discussion by the research community was published in early 2019, which I am colloquially referring to as The Insect Decline Study. The attention from the media mainly focused on the claims in the study that the world would experience “loss of all insects within 100 years.” The attention from the research community was focused elsewhere – and on why that astounding claim is very likely false. Wagner (and many others) pointed out the significant limitations of requiring a keyword search include “declin*.” Mupepele et al pointed out that the rate of decline was calculated across “percentage of species declining per year” and across studies that measured populations and changes in many different ways and with varying sampling efforts. Saunders wrote about the complexities of studying declines, especially when most academic journals are “averse to publishing null results” – introducing another survey/data bias into the mix. [Despite these issues, the paper has now been cited 1197 times according to Google Scholar – and 641 times in the Web of Science database, as of today.]

The data the authors used to generate their conclusions came from existing studies – as many of the insect decline studies are doing. So since we’re working on a project about reproducibility in biology and work with insects, this study seemed like a perfect case to get into and unpack a bit more. Our original idea was to use the broken windows algorithm that Christie developed on the data in this paper. See our recent paper on it here. This seemed like a great idea, the broken windows algorithm is designed for long term studies. And the stated objective of The Insect Decline Study was “compiling all long-term insect surveys conducted over the past 40 years,” so we expected that the study would include lots of long term datasets. We would then examine how long term datasets about insect populations when chunked up into a variety of short term bins did or didn’t support conclusions about declines, versus other kinds of trajectories. This is critical, since the vast majority of empirical work on insect populations are short term studies.

So we started the seemingly innocent task of finding all the underlying studies in The Insect Decline Study and extracting their underlying data. Easy, right? Then plug in it, right?

Well, in practice, this wasn’t possible for some reasons I’ll explain below. It’s been a wild ride of exploring this data – and even though we couldn’t use the broken windows algorithm, we believe it’s a still great case for exploring reproducibility, the science of insect declines, population trajectories – and just generally working with data. (For more on that, go check out Christie’s excellent new podcast, How Do You Know?)

Figure caption / alt text = customized meme of the Natalie Portman / Anakin Star Wars meme. Top left quadrant has male character and text box reading “Insects are declining”. Top right quadrant then has woman character with a big smile and a text box saying “You used long term studies, right?” Then the lower left quadrant has the male character staring with no reply. The last, lower right quadrant has the woman character looking back with angst and the same text “You used long term studies, right?”

The underlying data

Others have already mentioned the limitation of the The Insect Decline Study was looking for declines in their literature review. Yet there were immediately other problems we had in understanding which papers and datasets were included. The methods section lists which keywords were used in a specific database and some additional parameters. It doesn’t mention what date that was on – but more importantly, there’s a key sentence after that: “Additional papers were obtained from the literature references.”

But the study and their supplemental data never explain which papers/datasets came from the literature review – and which they got from this ‘snowball’ approach. But it does potentially explain why there are studies included that would be unlikely to have appeared using their keyword search – such as studies from the 1990s in Czech and Italian – or a UK natural history pamphlet, or an entire book in Swedish on bark beetles. But if they wanted to use references within references to find more sources, potentially because they wanted to ensure a representative sample, why stop there? As Christie has previously noted, major US long term insect datasets were not included.

The study also never included the full references for the datasets and papers that they used in the study. In the Supplemental Data table, it just provides Author Year information – no data or paper titles or journals. So for today, I’ll focus on the references we could find and translate (1 reference/dataset was never found and four we have but have yet to translate). I’ll quickly run through below some of the key findings from our exploration – we’ll be sharing lots more on this soon.

Temporal sampling methods

So it was hard to find the data and hard to understand why this data had been selected. This was even more true after assessing the length of the underlying datasets used. Analyzing insect declines requires having insect population data over time – so we would expect that this study included lots of underlying data that took repeated measures, ideally over a long period of time. Yet we found that this was overwhelmingly not the case. A few had annual data – but under 10 years; just as many had sporadic data collection under 10 years. Only about a fifth of the datasets used had data collected for more than 10 years – but it was sporadic (meaning not every year). Another fifth had a ‘snapshot’ – meaning only one time point provided. And lots of other underlying datasets had no temporal sampling methods to speak of based on how they were using literature or other methods – it was just not part of their framing. Bye-bye dreams of using the broken window algorithm – this paper, despite its stated objective, just did not compile long term, publicly available insect population datasets.

Geographic scopes

And while the study made claims about global insect declines, most of the underlying datasets were not conducted at the global scale. We found several of the studies were of individual or several fields in a region. About a third conducted regional studies. And almost half the references were conducted at the scale of one country. However, the size of a country varies greatly! These ranged from the UK to Japan to New Zealand to Brazil – and more. Roughly a quarter of the references included data from multiple countries. Yet as we know from other responses that have been written, including multiple countries doesn’t mean those countries best represent the diversity or distributions of the taxa the paper was assessing – or insects generally.

Response variables

If you’re studying insect declines, you’d want to be using papers that document populations over time and abundance, right? Well, The Insect Decline Study was ostensibly about species and threat status – not actually a test of declines. So it’s unfortunately both surprising – and very much not – that the response variables in the underlying studies were not abundance metrics. Many datasets discussed IUCN red list threat status and had no quantitative information. Others were numbers of occurrences or numbers of species. Some were community composition and distribution; many were species richness.

This goes hand in hand with many of the underlying papers/datasets themselves not being empirical studies. A significant number were literature reviews or other publications that were descriptive lists or records about species presence or range. Even when the underlying datasets were quantitative or empirical studies, they were carried out and analyzed in a wide variety of ways. This includes citizen science datasets, museum records, surveys (netting, kicknets, etc), or compiled data from other studies. Can scientists make conclusions over a range of sampling methods? Absolutely – it just needs to be done with transparency, care, and parameters for uncertainty, sampling effort, and other factors.

So are insects declining? Are pollinators declining? There is absolutely reason to be concerned about this – and to fight myths about pollinators to use best practices to combat declines, as Dr. Sheilla Colla discusses here and in her scholarship. Yet as previous work from Christie, this team, and others have discussed – it’s critical to test those questions and make those claims using the best available data and careful analytical approaches.

Posted in Uncategorized | Tagged insect declines, reproducibility | 3 Comments

How does sampling design impact what we conclude about deer tick populations?

Posted on March 6, 2020 by kstackwhitney

Today’s post is a guest post by Rowan Christie, a RIT Bioinformatics BS student and RIT/NTID RISE research fellow. Rowan has been working for the past year on the NSF-funded ‘Managing Our Expectations’ project with Christie and Kaitlin. Check out Rowan’s work and progress on GitHub and FigShare. This work updates Rowan’s earlier post, which you can read here.

Deer ticks are vectors of Lyme disease, a debilitating disease that can cause fever, muscle aches, facial paralysis, and other symptoms that can be really debilitating for many people.

@Global Lyme Alliance (https://globallymealliance.org/about-lyme/diagnosis/symptoms/)

In addition, deer ticks may also be spreading – CDC reports that the number of counties with the blacklegged ticks in the United States also has more than doubled over the past twenty years. You can learn more about the the CDC’s deer tick surveillance in their report here (https://www.cdc.gov/ticks/resources/TickSurveillance_Iscapularis-P.pdf).

Because of these reasons, deer ticks are a major public health concern. So to better understand risks to public health, it is important to study deer tick abundance and population trajectories.

Many studies have already been conducted to measure deer tick abundance trends and the occurrence of pathogen within them. However, one challenge is that most biological studies are mostly short term (~3 years). This can be problematic because the trends observed may not be indicative of longer-term patterns, and could only be a small variation on a much larger temporal scale. So you could assume that an upward trend could indicate a major increase in tick abundance. However, when the rest of the data is present…

Timeseries showing the density in m² of Ixodes scapularis (deer ticks) by year in all grids (plots) of Cary Forest, NY.

You may find that the upward trend is not particularly significant compared to the rest of the data. So essentially, you are missing the big picture. Because of this, we decided to focus on long term datasets and investigate how the stability of patterns within the dataset responded to the number of years in the dataset.

Tick studies also vary in methodologies, making it difficult to weigh their relative quality of evidence. For example, many studies are focused on different sampling techniques such as: dragging (shown in the first picture for sampling method), which is where you take a large sheet and drag it around a site to pick up and measure the amount of ticks in a given location, and public surveys, an opportunistic sampling method where people send in reports of ticks found on themselves. Some studies focus on different life stages: larval (youngest), nymph (middle stage), and adults (final stage). Studies can also differ in geographic scope — for example, some studies collect data on a county level scale while other studies provide data on a more specific level like plot or state forest. Differences between study methods could be influencing the results we observe and conclusions we reach, so when you compare studies that vary between, say, sampling technique, it may be misconstrued to do so because using a standardized sampling techniques like dragging may provide rather different results than an opportunistic sampling technique like people sending in reports of ticks they found on themselves.

Thus, our objective was to investigate how study factors such as length, life stage, sampling technique, and geographic scope influenced deer tick abundance trends. Several studies have already investigated how study length impacts the ability to detect consistent differences between different samples/datasets. There is support for long term datasets being more likely to lead to stable patterns, according to this study.

To analyze the patterns between datasets, we need to consider statistics such as years to stability, which is the number of years it took for the dataset to reach a stable pattern, and reflects upon the consistency and strength of the trend. For example, the longer it takes for the dataset to reach stability, the more support it provides for long term datasets because they are more likely to reach stability. Similarly, other study factors can also influence how likely the dataset is to reach stability. So our hypothesis was:

Longer deer tick datasets would have more consistent population trends, so they are more likely to reach stability;
Studies using dragging will have more consistent population trends, so they are more likely to vary less by stability time.

First, we searched for publicly available datasets from observational studies from data repositories such as LTER, DataDryad, NCBI, DataOne, Google Datasets, and various department of health websites from different states in the US that measured count or density data of deer ticks at least annually for 10 or more years.

The datasets we collected were from states New York, Massachusetts, New Jersey, Iowa, and Connecticut, ranged from 9 – 24 years long between 1995-2017, included larvae, nymph, and adult life stage, and recorded at grid (plot), state forest, town, and county level scale.

To test what led to stable population trends, we used the ‘bad breakup’ algorithm in R developed by Dr. Christie Bahlai of Kent State University to model every subset of data greater than two years in the dataset and determine whether or not the subset was statistically significant, thus determining the number of years it takes for the dataset to reach a stable pattern.

First, data (an example dataset of tick count in Bethany county, CT, is shown below) is subjected to a standardization algorithm to normalize the data and to make it easier to compare tick count/density data with very different magnitudes and minimize the impact of using different units (such as # of ticks per meter squared and count of ticks people found on themselves) on the observed trends.

Then the algorithm iterates through data (as shown below) and each interval is run through a linear model. This linear model calculates the slope statistic, which is the change of standardized density over change in year – so say if the interval was this:

Year	stand.response
1996	0.446773
1997	-0.29591
1998	0.168265

The linear model will calculate the slope of the fitted line based on this interval, like this:

And the slope of that line (which is -0.139254) is the statistic we are using. The standard error of this slope, p-value, and r² are also calculated.

The pyramid plot shown is a way to represent these summary statistics, where the slope and standard error per year are visualized. The red cross indicates insignificance (p-value>0.05) and black circle indicates significance (p<0.05).

We also needed a way to pull relevant metrics out of the computation: stability time, relative range, absolute range, proportion significant. You can see how these statistics are calculated in Dr. Christie’s bad breakup github repo.

Stability can be defined as greater than some percentage of slopes occuring within the standard deviation of the slope of the longest series, for a given window length.

The absolute range of significant findings and relative range, which is the absolute over and under estimate compared to the slope of the longest series, was also computed.

Proportion significant is the proportion of total windows with statistically significant values.

Proportion significantly wrong is ‘directionally wrong’ and represents the proportion of total windows where there is a significant relationship that does not match the direction of the true slope. We also included another statistic to study, proportion significantly right, which is essentially the inverse of proportion significantly wrong, equal to 1-proportion significantly wrong.

We calculated all of these metrics for each possible dataset we had collected, which includes all locations, life stages, and whether or not the dataset involved percentage of ticks infected with Borrelia burgdorferi. We recorded these stats in a dataset, and we ended up with 289 records. So basically, we had a dataset of stats we could analyze – stats on stats.

To analyze the results, we first looked at how the frequencies of the years to reach stability differ to see what years the datasets reached stability (i.e. the length of time needed for any given observation period to reflect the same trend as the longest time series) overall. We found that none of the studies we looked at reached stable patterns in under four years. This is critical to note because it indicates that studies with less than five years of data are unlikely to reach stable patterns- and may thus be insufficient to characterize tick dynamics. This provides support for our first hypothesis and for pursuing longer term studies, as they have been shown to be more likely to be more accurate and have more stable patterns.

Figure 1: Line chart showing years to reach stability for all datasets. Most of our datasets reached stability within 5 to 10 years. None of the datasets reached stability under four years.

To test our second hypothesis that dragging leads to more stable trends, we compared years to stability and the sampling techniques dragging (standardized) and reports from people having found ticks on themselves (opportunistic). We found that data produced by dragging varies less stability time (t(236.23)=-8.5346, p<0.05). This difference is largely explained by variability: opportunistic sampling techniques have a wider range, which means that it is more likely to vary in reliability, supporting our second hypothesis.

Figure 3: Boxplot showing how the statistical stability time differs between two sampling techniques: dragging (standardized) and reports from people having found ticks on themselves (opportunistic). The opportunistic sampling technique appears to have a wider range times to reach stability.

Another important factor to consider is the life stage of the tick. Many studies collect data for each life stage of the tick, but some focus on only nymphs or adults, or don’t record the life stage. Results may differ between different life stages, which could lead to studies obscuring patterns and inferring different conclusions as a result. To investigate this possibility, we looked at how life stages could differ in stability with the available data. We found that surveys for adults and nymphs, adults and larvae, and nymphs and larvae did not differ in stability time (t(126.84) = -5.9627, p<0.05; t(10.111) = -5.9627, p<0.05; t(10.54) = -5.5196, p<0.05 respectively). This suggests that all life stages behave similarly in regards to abundance.

Figure 5: Line graph of the stability time between each of the different tick life stages (nymph, larvae, adult) by the number of datasets. Life stages are color coded (nymph is blue, larvae is orange, adult is green).

Other analyses were also made in regards to geographic scope and the occurrence of pathogen that some studies recorded, and we also tested compared proportion significantly wrong and right.

Our main takeaway is that long term studies are very useful for understanding long term populations dynamics and are more likely to reach a stable pattern. We should be cautious with using short term studies (especially those less than 5 years) to interpret longer term population trends of deer ticks because they may lead to misleading results. In addition, dragging has proved to be a more reliable method for reaching consistent results, providing support for using standardized methods.

Rowan is supported by the National Institute of General Medical Sciences of the National Institutes of the Health under Award Number R25GM122672. Kaitlin and Christie are supported by the Office of Advanced Cyberinfrastructure in the National Science Foundation under Award Number #1838807. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health or the National Science Foundation.

Posted in Uncategorized | Tagged guest post, real data, ticks | Leave a comment

Building access into open – writing image description templates into our code annotations

Posted on October 8, 2019 by kstackwhitney

One of the critical elements of our ongoing projects is to identify barriers to reproducible science with public data. Our teammates have previously mentioned lacking or irregular data, data collection, sampling period, and metadata as examples. But today I want to focus on accessibility in science that’s trying to be open and reproducible.

“Open science” means a lot of things. I don’t have space in one blog post to get into all the uses, but today I’m concerned with the phrase “access” when it comes to open science. Often “access” is used within the open science domain to refer to multiple things itself! Some people mean cost – as in, open software and hardware is cheaper and therefore more people can purchase and use scientific gear and protocols that people and institutions with more resources can. Some people mean learning curve – as in, the tools of open science may be easier to learn than some proprietary ones, therefore it’s easier for nonexperts or nonacademics to participate using the same protocols.

But access also means – and I mean – disability accessibility. Is your project, product, dataset, workflow, team – let alone your code – accessible to collaborators (and would be collaborators) with disabilities? This work is informed by two researchers and professors I greatly respect – Dr. Jon Henner of UNC Greensboro and Dr. Liz Hare of Dog Genetics. Dr. Henner has previously noted that “access and inclusion” is often now used to indicate the other uses of “accessible” but NOT disability. And Dr. Hare informed me – and the rest of the twittersphere – about the lack of compatibility between open science software and access technologies. Open science that isn’t accessible isn’t open! So what can practitioners do to contribute to an actually more open science in the public domain?

Number one is (always) follow the lead of disabled scientists and researchers. Disabled scientists are the experts in making their science accessible. So if you’re a nondisabled scientist like me, the first and best thing to do is educate yourself to ensure you’re not erasing, paving over, or counteracting the work of disabled experts, peers, and mentees. The rest of what I’ll type about today uses an example our team realized was a barrier in our own workflow and products. But that guidance above holds constant across any example I could share.

Christie and collaborator Sarah Cusser have been working on an analysis studying how length of time period impacts the signal and significance of the phenomenon studied. And with the help of some of Sarah’s other collaborators, they developed a way to visualize their results in what they’re referring to as a pyramid plot. These plots are built into the latest software release, which you can find on the Bahlai Lab github repo here – https://github.com/BahlaiLab/bad_breakup_2. To give you an example of the kind of figure from Rowan’s already online presentations on FigShare, here’s one of the pyramid plots:

The figure above is titled “Adult deer ticks in Cary Forest” and the x axis reads “slope” with a scale from -1.5 to 1, while the y axis reads “number of years in window” with a scale from 0 to 25 years. There’s also a key that is labeled significance with red X meaning ‘no’ and black O meaning Yes. Then in the plot itself, there’s a dotted trendline parallel to the y axis, that shows the trend the full length dataset converges on – and two fainter dotted lines parallel to the Y axis indicating a standard error range. For every possible number of years in the total number of years in the dataset, there are Xs and Os showing what signal (positive or negative, based on where on the x axis it falls) and significance (represented by the size of the circle). Scores outside the error range would indicate a false result, as in, misrepresenting the longer term pattern by the longer dataset. In the case of this specific figure, the trend seems to converge after about 15 years, based on the Xs and Os being within the error bounds.

This is a complicated figure, with a lot of information embedded into the image. And this is just one, specific to this dataset, slope, significance, and years included. We want to make our results easily interpreted. So Christie adjusted it based on best practices for image access – for example, ensuring that the symbols weren’t color coded to indicate significance differences, that they are also different shapes and still high contrast. Same with the effect size, being represented by size of X and O, not color shading.

Yet knowing what Dr. Hare has previously shared about the software tools we’re using, I was interested in making sure our figures and the data within them was represented multiple ways, to ensure access. Just like Dr. Drew Hasley told everyone at EDSIN – data visualization alone isn’t ideal, multimodal representation of data is – to ensure there’s multiple points of entry for collaborators and learners. So while Christie continues to update the code to output text tables that contain everything in the image (and several of those functions are now in the latest release!)

Since the whole goal of releasing the moving window analysis code is for other people to use it with their data, why not write image description templates into our code with the annotations? As well as the code that makes the plots and the tables. But I realized I had never, ever come across R code with image description templates for scientific figures.

So earlier in the summer, I shared a poll on twitter, asking R users whether they had ever come across or written image description templates into code to go with the plots created with the code?

Here’s a screenshot of the poll from my twitter feed:

My poll on August 20, 2019 asked “hey #RStats and other #OA #opensci friends, poll time – have you ever written – or seen – image description templates built into code?” 46 people voted, and 83% of respondents replied that they had never come across image description templates written into code. 15% responded they had no idea or just wanted to see the results and 2% (so, 1 person) responded that they had come across or written this (they didn’t clarify). Unfortunately I can’t say I am surprised, but it’s something we’d like to include with our next release – again, in addition to tabular representations of the same information in the figure, to achieve that multimodal representation of the analysis to make it more accessible.

Getting back to listening/reading and educating ourselves, rather than reinventing or breaking the wheel, I’ll be drafting our first attempt at the image description templates for the code using best practices from the National Center for Accessible Media and WebAIM, using specifically their guidance for image descriptions for complex scientific images. And while image descriptions are supposed to be *specific* to the image, we are trying to write a template, so we’ll also need to indicate where in our template the description needs to be adjusted to be specific to the completed analysis.

Know of a team doing this well that we should learn from? Read or listened to a great scientific or complex figure description that we should model descriptions on? We’re very open to feedback as we try to make our work more open – as in, accessible.

Posted in Uncategorized | Tagged accessibility, open science | Leave a comment

Of time series and ticks

Posted on August 10, 2019 by kstackwhitney

Editors’ note: This is a guest post by Rowan Christie, a RIT RISE research fellow and an undergraduate at the Rochester Institute of Technology in Bioinformatics. You can check out more of Rowan’s research on her github and figshare sites.

This summer I’ve been working with Dr. Kaitlin Stack Whitney and the Bahlai Lab at Kent State University on their ‘Managing Our Expectations‘ project. For my own research as part of the project, I decided to focus on Ixodes scapularis. Also know as the deer tick, it is a primary vector of Borrelia burgdorferi, which transmits Lyme disease. Ixodes scapularis populations appear to be increasing. Media coverage also often frames deer ticks as increasing:

However, biological studies generally collect short term datasets, and the trends observed from these studies may not be indicative of longer term trajectories.

For example, what might you conclude if you only saw this:

Versus seeing this:

First, I had to find long term, publicly available tick datasets. My criteria was that the dataset had to be at least 10 years long, collected in at least one location, have numerical density, abundance, or count data for the tick species Ixodes scapularis, and can be any life stage.

First, I started with the LTER Data portal. I typed in the keyword scapularis. There were two results – both were at least 12 years long but I excluded one because it had no count data, only a metric of: uncommon, rare, and common. The other dataset, which was focused in Harvard forest, did have count data, so I added the link to my list of candidate datasets.

Still in the LTER Data portal, I then typed in the keyword tick and found 11 results, however most focused on different animals and had no tick sampling data, and I did not find any new datasets that fit the criteria. I was hoping that there would be more candidate datasets, especially from a data portal focusing on long term datasets! [Editors’ note: Given how few long term deer tick datasets Rowan found in the LTER system , even though ‘bad breakups’ is mainly focused on LTER and publicly-funded datasets, we were compelled to keep going to see if other sources had more!]

Next, I searched the Data Dryad repository by first typing in scapularis, which turned up 9 results. Several focused on the microbiology of the tick. I found two datasets that fit the criteria. I then typed in tick for the search term, and found several more datasets focusing on tick abundance, but some of them were only short term. I found one possible dataset that focused on several different tick species.

Then I searched through NEON data portal by typing in scapularis, which turned up 0 results, then tick, which turned up 2 results. Only one of them contained tick sampling data, and it was too short of a study to fit our criteria.

I searched through Google Datasets by typing in the keyword scapularis. There were 76 results. Many of the results were duplicates. I also found some that focused on the microbiology of ticks, rather than working to sample ticks out in the field. Others were not long enough. In short, there were a lot of datasets that had to be filtered through, but I found 4 datasets fitting the criteria from the NY Department of Health. These 4 datasets were from studies that collected data in the same way, by dragging and was in the metric of average # of ticks per 1000 meters. However, this metric was linear, so it could not be compared as easily to a spatial metric like # ticks per meter squared.

When I showed my mentor the NY Department of Health data I had found, she suggested looking at data portals from other states, especially the Eastern ones, such as Connecticut, which could have publicly available tick datasets due to the higher prevalence of ticks in those states. I actually looked at a map based on infected deer tick presence of the US to help me prioritize which states that I should be looking at. I predicted that if a state has a confirmed deer tick population, then those states will be more likely to have collected tick data, although whether or not it is publicly available is still questionable.

Map of Lyme disease risk. Digital image. Live Science. 7 February 2012, https://www.livescience.com/18340-lyme-disease-risk-map.html

For example, the University of Rhode Island has collected tick data, but they do not seem to have it publicly available. They have a site called TickEncounter which provides all sorts of information about ticks, like what species are common in Rhode Island, how to identify them, how to remove them, and what kind of habitat different tick species may be found in. They also have a crowd-sourced survey called TickSpotters where the public can submit tick data, which includes the tick species, life stage, whether it was found on a person, pet, or wandering, date, and what country it was found. They then use this information to provide a TickEncounter index, which seems to be ranked as ‘Low’, ‘Medium’, or ‘High.’ That’s not exactly as helpful as providing the raw data, however.

Another common issue I encountered was that some of the states only had surveillance reports for Lyme disease, and no surveillance tick data.

For New Jersey, I started by searching for the NJ Department of Health webpage. I accessed the Offices & Programs dropdown at the front of the page, tabbed to the Communicable Disease page, accessed the Statistics, Reports, & Publication dropdown, and clicked on Vector borne surveillance reports. While some of the earlier reports did provide data on tick related emergency visits and tick borne diseases, the older reports only provided data about mosquito borne diseases, and weren’t enough reports with tick data to constitute 10 years. So I did not include the New Jersey reports in my list of candidate datasets.

I searched through for the Connecticut Department of Health webpage and typed in the keyword tick. There were 75 results. I tabbed to the link worded “Tick” which led to this page, and upon scrolling down to the CAES Tick Office & Tick Testing section, led to a link titled “Tick Test Summaries.” This page provides a list of tick testing results for each page. These reports include data for several tick species, such as Ixodes scapularis, and also included data on the percentage found positive for Borrelia burgdorferi, the bacterium that transmits Lyme disease. As there were over 10 years worth of reports, it met my criteria and I added to my list of candidate datasets.

I then looked to Pennsylvania for tick datasets. I searched their department of health webpage using the keyword tick, but found no datasets. After this, I tried something else and searched google using the keyword pennsylvania tick datasets. Lo and behold, there was actually a study that provided 117 years of data. The data provided through the study had count data of Ixodes scapularis over all life stages, and as it fit the criteria, I added it to my list of candidate datasets. Last but not least, I used the same method to search for tick data in the state of Delaware, however, I did not find any results.

At this point, I stopped looking for tick data, and I now had a compiled list of 10 candidate tick datasets. Searching through all of these different sites showed me how diverse studies can be in terms of collecting and reporting tick data. There are many different metrics, standards, methods for tick data, and comparing the tick abundance, count, or density data from different studies isn’t that simple. For example, the dataset in Harvard forest was an opportunistic study where they recorded the number of ticks and tick bites found on summer research interns, which is a rather different way to sample than dragging. In addition, the datasets also varied in terms of the life stage they sampled for – some only sampled for adults and nymphs, while others sampled for all of the life stages. There was also a lot of averaging in some of the datasets like the NY Department of Health one, where they only provide the data on the county level scale, rather than providing the specific location or plot. This makes it hard to compare to other datasets like the one done in Cary Forest because the geographical scope is so different. I was especially surprised at the lack of publicly available data from the state Department of Health sites. The states that I looked at are well known for their deer tick populations, so you might expect that they would have collected deer tick data, given that it is a critical health concern.

[Do you have 10+ years of deer tick data – or know of another source – that we should be including? If so, let us know in the comments or via social media!]

Posted in Uncategorized | Tagged data, data formatting, otherpeoplesdata, real data, ticks | 1 Comment

Irrigrated: In which Tasia Complains About Things in List Form Because Narratives are Difficult

Posted on July 30, 2019 by cbahlai

This is a second post written by Tasia North, an undergraduate student who’s working with us on the Bad Breakup project. As part of our research plan on this project, we’re identifying barriers to data reuse from publicly shared, NSF-produced data sources. Tasia is working on a project which is examining patterns within tri-trophic interactions in long term data- basically asking the question- do significant trends move between trophic levels, and if they do, how? But in order to do it, they were first tasked with finding a few representative sets of data. I wanted them to have an authentic experience- just that there were data like this, here’s a database of datasets, now see how you can find information to support this investigation- and write down what you find.

-Christie

Hi folks,

These past couple weeks I’ve been cleaning data and wading through metadata. Trying to decipher work other people have done is difficult and I have some feelings about it (hence this blog post). This is for the bad breakup project, in which we take long term data sets (12 years or more) and chop it up to see what trends would appear if we had only measured that system for a shorter period of time. This will allow us to quantify how often we are wrong when we make conclusions off of those shorter 3 to 5 year studies that are so common in science.

More specifically I’m searching for data from LTER sites that have complete long term abundance data across three trophic levels. I’ve already gone through the arduous process of sifting through the LTER data portal and finding what I need. Now that I’ve found the datasets, I need to wade through its metadata, make sure I understand what they did enough to know if it will work for the analysis, and ensure that it meets all the criteria.

Now part of what Christie wanted me to do during this process was to identify any barriers to reusing this data and write about that. So rather than writing a narrative of how my metadata experience has gone so far, I’m just gonna write a list of all the things that have been confusing and made this process more difficult because I am lazy. Please enjoy my list!

2. Typos

A lot of these are kinda funny and perfectly harmless, but someone really should have done one last proofread before putting this up. I do love that they sampled the yee ole aire

Image description: A screenshot of several highlighted typos in metadata, and a screengrab of Captain Picard from Star Trek giggling with his hand on his cheek saying “oopsie”. Highlighted typos include “wind speed at astart of sampling” and “aire temperature at start of sampling”

2. Not writing down the units of things

Yes, mmhm, you read that right. This person measured mass and then recorded the units as dimensionless. How can a unit of mass be dimensionless?? Maybe if I were more familiar with this particular system and how much biomass it usually produces it would be obvious, but I know nothing about this ecosystem and I don’t know how much grass would be a logical amount of grass in this area! Should I assume that its kilograms since that’s the SI unit of mass? I don’t really like to assume things when working with a big set of someone else’s data but since they didn’t write it down I guess we have to spend a bunch more time digging. Thankfully this doesn’t really matter for what we’re doing (our algorithm works on Z scores, so take THAT dimensionless data!) but it would matter a lot for just about any other type of analysis or if I wanted to reproduce this study.

Image description: A screenshot of metadata showing that for the attribute “mass of live grass” the unit is listed as dimensionless. For the attribute “mass of forbs” the unit is also listed as dimensionless

3. Very vague details or no details at all

I’m having a really hard time finding anything meaningful about methods in a lot of these datasets. This screenshot is just one example of many vague things I found. Did they treat it with herbicide? Pesticide? Did they pull weeds? Or fertilize? Were they burning? Playing music to the plants to see if they grew better? Who knows? All I know is that there are at least two treatments and that one of them is old but that’s it. I searched high and low for a paragraph about methods but found N O T H I N G directly connected with the data set.

Image description: A screen shot of metadata, showing the first attribute as Old treatment. The storage type is listed as string, and measurement scale says old treatment. The second attribute is Current treatment, with the same things listed for storage type and measurement scale. There is no other meaningful information that could lend clues as to what the old and current treatments actually were.

4. USING LABELS IN THE DATA SET AND THEN NOT DEFINING THAT LABEL IN THE METADATA. I AM WRITING THIS IN ALL CAPS JUST TO LET EVERYONE KNOW HOW STRONG I AM FEELING ABOUT THIS ONE.

Yeesh. This is grass biomass divided into the 2( or maybe 3??) treatment plots that they used in the study. The top screenshot is the only thing in the metadata referencing these labels. And the screenshot below is what’s in the actual dataset. “C” is defined in metadata. “i” is defined. “ni” is not defined anywhere and is only used in the dataset those three times. I pored through the metadata looking for these definitions but at this point I think the best I can do is assume that “ni” stands for non-irrigated and means the same thing as control. I really don’t like making assumptions like that but there isn’t really another option with the given information! And to top it all off they misspelled “irrigation” in parts of the metadata!!*

Image description: A screenshot of metadata showing an attribute Treatment. Under measurement scale, it defines two variables, c = control, and i = irrigrated (typo is copied exactly from metadata). Text does not define ni at any point

Image description: Screenshot of a pivot table created in excel. The table shows data of average livegrass biomass from 1991 to 2015, with columns for treatments c, i, and ni. There are only three values in the ni column. When there is a value in the ni column, the c column is blank, and anytime there is something in the c column, ni column is blank. Since ni is not labeled anywhere in the metadata, and it follows this pattern, the most logical assumption is that c and ni are the same thing and there was just a mistake somewhere in the data recording process

I get that this is a long data set which has probably had a lot of people working on it over the years. But my goodness pretty please stick to the same labeling system and if you have to change it ya gotta define the new labels you use and maybe put a sentence about the change somewhere!

The cats do not approve of this kind of behavior. They do not approve.

Image description: A meme showing two angry looking grey and black stripped cats glaring at the camera with a text that says “When an LTER dataset uses a label and the doesn’t define it in the metadata”

5. Being extremely vague on methods and labeling

A lot of these data have only a couple sentences on methods, it would be nearly impossible to recreate the survey from the information given. For example I’m looking at a grasshopper survey, the labeling system they used is a little confusing and I’m not quite sure what they actually did for sampling. The metadata has literally 3 sentences about methods. They did a sweep sample in July/August of each year, and “At each site on each occasion, 10 sets of 20 sweeps (200 sweeps total) are taken”. That’s it. From the data I can see the set of 10, that’s labeled very clearly, but I can’t find any clear set of 20. I’m unclear if they had multiple people who each did 20 sweeps 10 times? Or multiple people who together did 10 sets of 20 sweeps? I’m also not sure if the 10 sets were over the same area of if they had transects of some kind? I’ve never done grasshopper sampling before so I have no reference of what normally happens with this kind of a survey, and since they didn’t write it in the methods I’ll just be confused about it and hope it won’t change anything about the analysis. (I’ve since been informed by someone more familiar with these types of sampling methods that the 20 sweeps refers to each actual sweep of the net, so they did 20 net sweeps 10 times, jury is still out about where they did these 10 sweeps, and what parameters they used for selecting the sampling transects)

~about a week later~

Friends I LITERALLY CANNOT. I have just discovered, that the metadata that I download in a zip file from the LTER site, is different from the one on the internet. They literally wrote two different sets of metadata. One that is conveniently placed in a downloadable zipfile along with the data. This one is extremely vague and has almost no information on methods. And the other metadata has detailed paragraphs on methods, changes to the study over time ect. Where is this detailed file? Tucked in some corner of the website that I would never have thought to check because I thought that it was the same as the one I downloaded. I assumed that everything put inside that zipfile would be the most complete set of information and that clicking around the site for random files would be a waste of time when I have the “exact” “same” “file” already downloaded.

Image description: picture of a white baby flamingo on a blue background. The chick has its mouth open in what looks like an angry scream. Text reads *incoherent screaming*

Is this common knowledge that there are 2 different sets? Why didn’t they just put those couple of detailed paragraphs into the downloadable metadata? Now I need to go back to every other place I was confused and check to see if they have the answers in a second secret file somewhere online. Unfortunately it looks like it doesn’t have any information on the missing labels/units, but it cleared up a lot of my questions about methods and larger context and how/why/where they did what they did.

I was SO confused and spent SO much time looking through the files I had downloaded looking for answers and it turns out the answers were there all along just in a file I had not thought to check since I assumed that the 2 files both called metadata were the same thing. *heavy sigh*

Image description: Fat fuzzy cat with short stubby legs sits at a laptop wearing a red bowtie and small round glasses. The cat has its mouth open looking surprised, indignant, and offended at the computer screen. Text reads “My face when I found the second set of metadata.

—

* this is when Christie made a bad dad joke about being very “irrigrated” about the whole situation. Do you see what her poor lab has to put up with?

Posted in Uncategorized | Leave a comment

How do I find stuff? An undergraduate’s journey through an online data archive

Posted on June 24, 2019 by cbahlai

This post is written by Tasia North, an undergraduate student who’s working with us on the Bad Breakup project. As part of our research plan on this project, we’re identifying barriers to data reuse from publicly shared, NSF-produced data sources. Tasia is working on a project which is examining patterns within tri-trophic interactions in long term data- basically asking the question- do significant trends move between trophic levels, and if they do, how? But in order to do it, they were first tasked with finding a few representative sets of data. I wanted them to have an authentic experience- just that there were data like this, here’s a database of datasets, now see how you can find information to support this investigation- and write down what you find. This is Tasia’s first blog post with their reflections on the experience!

-Christie

Hello Blogosphere,

I’m the new undergrad working here at the Bahlai lab. If you’ve been following along with the blog you’re aware of the bad break up project that’s been going on. This project looks at a long term data set, and breaks it up into shorter clumps to look at the trends. This will allow us to quantify how often we are wrong when we base conclusions off of three or four year studies.

We are digging through approximately 58,000 datasets from the US-LTER. My task is to continue working on the tritrophic interactions that Julia had started. She had created a list of sites that are likely to have the data needed, and what organisms I can look for. I needed to take that list, sort through available LTER data to find the data set, determine if it was at least 12 years or longer, and that it is usable and accessible data.

Easy enough right?

So a few things about me might be useful for context here. As stated, I am an undergraduate student studying Ecology and Conservation Biology. I’ve essentially no experience with large scale data management, and no experience using the LTER website or getting data from this site. In fact I had to google to find the LTER website since I have never used it before and thought everyone was saying LTR. In other words I am a newb at this. However I am armed with four and a half years of college experience (super seniors represent!), and I’m a millennial with the standard ‘navigating internet and sorting through stuff’ skills that are common to my generation. Someone with my education level, computer skills, and the reasonable level of guidance that I have should be able to navigate this site and find the information that I need. Here’s a step by step walkthrough of how successful I was at navigating these sites, what I found, and also some memes to express the feelings that arose during this experience.

The first thing I did was google for the LTER website, this takes me to the data portal. Now Christie wrote in the last blog post that this portal was, erm, less than helpful. But everyone else in the lab was busy when I started working on this and I didn’t want to interrupt anyone. So I got to find out about the data portal all on my own! I start out clicking on the advanced search option. According to the list Julia gave me there is probably a survey of small mammals in the Konza Prairie LTER site that is at least 12 years long. So I type in small mammals, select Konza Prairie, and I am presented with . . . this . .

tasia-blog1

As you can see, nowhere does it say how many years are included in the data set. It only lists the publication date, which is of no use to me if I want to know how long they studied something. In order to find the length of study, I have to click on the title, scroll down, find the metadata report, click on that, and then scroll down to find the years.

I have spent literal hours over the last couple weeks going through searching for keywords that will hopefully bring up what I need, clicking on a title, then clicking on the metadata report, and then scrolling all the way down just to see something like this:

tasia-blog2

Or this:

tasia-blog3

This was a huge time suck and mildly frustrating to say the least. I was about ready to take my extensive credentials as a *checks notes* Mildly Annoyed Undergrad™ and march right up the LTER office and demand they change their name to the Year Long Ecological Research Network. In fact I was so peeved I took a break to make this extremely niche meme that about 6 people will think is funny.

Finally though, after sifting through what feels like a million data sets, I find one that actually goes for more than a few years. A bit of clicking around brings me to an excel spreadsheet with the data on it.

Hallelujah!

Now I just need to determine if this data on small mammals is usable. Thankfully it looks complete and without any weird blank spaces or scary looking errors (an earlier excel file I found had an error code of -99999 and that was a scary looking data sheet if I’d ever seen one).

Most of the spreadsheet is logical. There’s the year (this was out of order but clearly labeled so it’s fine), the season, and a watershed ID number, all of that’s fine. Then we get to the actual data, it’s a whole bunch of acronyms, followed by a series of numbers.

Now it may be a cool science thing that I’m not privy to, but this spreadsheet is so full of acronyms that it’s essentially illegible to an outsider unfamiliar with the system. Isn’t the goal of these types of data sets to allow future scientists to come in and reuse the data with relative ease? Well, that’s what the metadata was for!

Thankfully there was an easy to find (it was not) and logically labeled (it also was not) file attached named knb-lter-knz.88.7.txt. This file is not to be confused with knb-lter-knz.88.7.report.xml or knb-lter-knz.88.7.xml, these other two files contain… information (its actually probably really important stuff but I don’t know what any of it means yet). Thankfully the metadata was mostly legible and explained the acronyms clearly. Took me a couple extra clicks but I think this data set will work for what I need!

Next steps are cleaning up the data!

Posted in Uncategorized | Leave a comment

Equity and Ethics in Environmental Data Science

Posted on April 29, 2019 by kstackwhitney

Guest post by Kaitlin Stack Whitney

Recently Christie and Kaitlin had the privilege of participating in the first ever NSF INCLUDES funded Environmental Data Science Inclusion Network (EDSIN) conference, which took place in early April 2019 hosted by the National Ecological Observatory Network.

I (Kaitlin) had the great pleasure of being on a keynote panel focused on “further defining the problem space” and focused my comments on disability access and inclusion in academia and science more broadly. A much more in depth focus on those topics was presented by my colleague Dr. Drew Hasley during a plenary the next day. You can check out all the presenters and their presentations here.

Yet as the keynote plenary speaker and eminent scholar Dr. Carolyn Finney explained, problem space is probably the wrong term – and framing. Diversity, equity, and inclusion in environmental data science isn’t a problem! It doesn’t need fixing! Addressing it isn’t a problem solving exercise. In her words “it’s not a problem to solve, it’s a process.“

Dr. Finney’s keynote was hands down the best start to a conference I have ever witnessed, and I took copious notes of the wisdom and experiences she generously shared with us. Some key takeaways (for me) that I want to share, as they inform my (our) collaborations and work:

“Outreach is outdated, individuals are fully formed.” How do we ensure that our outreach (especially Broader Impacts as concept) acknowledge and honor that about everyone?
“You have to do something different, it’s not about being comfortable.” How do we work to make sure that our collaborators and trainees are comfortable and fully seen? Many of my colleagues, especially women of color, have never been fully comfortable or seen in their work. It’s not about my comfort and being a better ally means taking active steps not to center myself or my comfort.
“Diversity is not assimilation.” How do we respect and honor -dare I say cherish – difference in our collaborators and trainees? How do we make space for them, as opposed to only inviting them into pre-determined spaces?
“Privilege has the privilege of not seeing itself.” How do people with privilege learn about the things they have the privilege of not experience, let alone “see?” I am a white woman, a faculty member, and co-PI on a federally-funded-grant-project; I have a lot of privilege and power in relation to many people in my professional community. How do I educate myself to be a better ally and advocate without shifting the burden to marginalized colleagues who already do too much uncompensated service work?

I also presented a poster, “10 steps to make science (more) accessible” which has been a years-long collaborative effort with 4 colleagues across North America (in alpha order – Dr. Emilio Bruna, Dr. Simon J. Goring, Dr. Aerin Jacob, and Dr. Timothée Poisot). Check it out on FigShare and you can check out our corresponding preprint here. We’ll be making sure that our Bahlai and Whitney lab collaborations are multimodal and accessible. It’s a lifelong and continuous effort, but the focus of resources like these are to make clear that there are simple and quick steps that people can take even if they don’t have control over the publishing platform or conference format to increase access and inclusion.

We both also had the opportunity to participate – both as contributors and listeners – in a lot of excellent breakout sessions. Some of my favorites included those focused on disability inclusion and ethics in data science education. As faculty and researchers, how can we introduce ethical data collection and analysis to students from the beginning of their education in environmental science and data science, not as an add-on? Christie shared some great insights from her own teaching, and I brought some of those back to my own classroom this semester. There’s going to be a lot more to come out our participation in EDSIN, both as the conference outcomes and collaborations develop further, and as I implement more of what I learned from new and old collaborators there. We’ll keep sharing.

Maybe you’re reading this blog generally for the quality bug counting content – and not sure how this fits in. But it’s at the heart of what we do – and it shapes our science. As Dr. Finney also said, “our relationships are worth the risk of going there, to discuss our differences.” People count the bugs and create the scientific community – I (we) must work harder and more effectively to ensure the field of environmental data science is ethical, equitable, diverse, and inclusive of everyone who wants to be in it.

You and everyone you know can join the EDSIN community to make environmental data science more accessible, inclusive, equitable, and awesome! Sign up on the QUBES platform: https://qubeshub.org/community/groups/edsin

The EDSIN conference is based upon work supported by the National Science Foundation under Grant No. 1812997. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Posted in Uncategorized | Tagged accessibility, diversity, ethics, inclusion | Leave a comment

The calm before the (algorithmic) storm

Posted on March 29, 2019 by cbahlai

Kaitlin, Julia and I have been busy. Not that you’d know. It’s been a lot of behind the scenes work so far on this project- it’s started with weekly meetings where we talk out our ideas and approaches. You’ll recall, the bad breakup project is an algorithm I designed to mine data from the perspective of a human who was looking to find statistically significant trends. Essentially, it asks “Given a snippet of data, what conclusion would a typical biologist applying typical statistics make about these data?”* We were going to use the 58,000 datasets compiled through the US-LTER over its 40 year history as our data fuel- with the longest series (the complete record) serving as a proxy for truth.**

Turns out, this problem that we’re working on is pretty big, so establishing exactly how to approach these 58,000 archival datasets is our main challenge. It looks kinda like this:

A crudely constructed powerpoint flowchart. Top row: Hypothetical workflow- Get data > put data in the thing > ?????? > Profit! Bottom row: Actual workflow- ?????? > Put data in the thing > ?????? > Profit!

Our collaborator, Sarah Cusser at Michigan State, has had amazing successes in the context of a ‘deep look’ at a single long term experiment- she has been examining how long we have to watch a system to see treatment differences across several common agricultural practices- and how consistent this effect is. Her findings, in prep right now, can be used to make recommendations about how we make recommendations to farmers- essentially when can we be confident our recommendations are right, and when should we moderate our confidence, when guiding how farmers select practices.

Building on the idea of the deep dive, Kaitlin got an idea- what if we specifically sought out datasets that document tritrophic interactions from the LTER- we could use these focal datasets to examine within-site patterns between trophic levels- i.e. do misleading results travel together between trophic levels? A brilliant idea, so Julia sat down and started combing the organismal abundance/plant biomass data across the LTER- and proceeded to spend a lot of time spinning her wheels.

The challenge, it seems, is also one of LTER’s strengths. LTER data is available through two paths- in data archives set by individual sites, and centrally, through a portal at DataONE. There is a heck of a lot of stuff going on at each individual LTER site, which is so, so awesome and cool and I love learning from it***, but, given that each site is so unique, navigating between them to try and find information which supports synthetic approaches is…okay, I’m not going to mince words, not awesome****. Julia has been working on LTER related projects for nearly a decade now***** and found that the DataOne catalog was fine if she knew what she was looking for, but it wasn’t easy to browse, while extracting meaning from anything. She quickly found the individual sites were, generally, superior from a browsing perspective, because each site gave a bit of context about what the sites actually are, what the major experiments were, and their data. The data catalogs at each site, though, were all slightly different, and thus varied in their ease of navigation- and moving between the sites, the approaches differed. So she’s still chipping away at this.

Meanwhile, though, something big hit the news. The INSECT DECLINE THING.

Oh boy. So, I am going to summon the full authority of my position as your friendly local data scientist, insect ecologist and time series expert. This study has some serious design flaws and its results are unreliable******. Manu Saunders wrote an excellent blog post on the subject so I won’t go into it in detail here. But our group sat down and discussed the study at one of our regular meetings, and realized we were perhaps in one of the best positions to critically, quantitatively examine these claims- so we got to work. More on this soon.

Soon after we got started, a reporter from Discover magazine reached out to Kaitlin for comment on the story, and a truly excellent article resulted.

To quote a quote from the article:

“You can’t just draw a line through some data points, take it down to zero and say, ‘Right, that’s how long we’ve got,’” Broad says. “That’s not how stats works, it’s not how insect populations work, either.”

Later in the article, Kaitlin highlights how we’re addressing these problems in the work we’re doing– and examines why the scientists making these dire extrapolations are reaching the conclusions they do.

I’m really excited to see what we find.

——-

* my long term goal is to replace typical biologists making typical conclusions with about 185 lines of well commented R code by 2027

** It’s a proxy because even the longest time series I have is a snippet of the whole story. Ever since I started partying with sociologists, I find I use phrases like “proxy for truth” in my day-to-day more than I’d like to admit. Wait till we write our book together.

*** Can you tell that I have a deep, deep, identity-defining love for LTER? Because I do.

**** MAXIMUM CANADIAN SHADE.

***** Just to make you feel old, Juj.

****** DOUBLE MAXIMUM CANADIAN SHADE. NOT MANY CAN SURVIVE THIS LEVEL OF SHADE

Posted in Uncategorized | 2 Comments

Bad breakups (with your data)

Posted on December 12, 2018 by cbahlai

Hi all,

It’s been a long time. It turns out this assistant professoring thing does not leave me with a lot of time. Hmm. Who knew? Since we last spoke, I’ve been building my lab- both the physical space:

The boss is twitchy about starting teaching a new class tomorrow so she re-arranged my furniture. pic.twitter.com/Rl8hrqMT23

— Bahlai Lab (@BahlaiLab) August 28, 2018

and the online infrastructure.

I’m building collaborations with friends and colleagues all over the place, and working to help finish the student projects that I became involved with through my previous positions at Michigan State. I’m also working hard to get my uniquely Bahlai Lab research vision off the ground. A big part of that is people.

I have people now! Julia, my long-suffering technician, puts up with my barrage of ridiculous ideas and helps me bring the vision to reality. She’s also in training to be a butt-kicking librarian. Cheyan, PhD student, is studying how ‘non-traditional’ data sources (with a focus on citizen science) can be used to develop and engage people in long term ecosystem management. Katie, PhD Student, is studying how we can measure insect mediated ecosystem services and functions in green infrastructure projects. Christian, PhD Student, is examining the use cases and factors affecting the quality and quantity of citizen science data, in the context of Odonate conservation under climate and habitat change.* (Yes, you counted right- that’s 3 PhD students already). And, my undergraduate project student, Erin, will be working with me on my next big thing.

Which brings us to IT. My Next big thing.

A while back, I had an idea. It wasn’t completely my idea- it came out of conversations with a few people. As you know, I’m interested in big(ish) data- finding trends from patterns we see when we put together a lot of information about a system.** But the big is a combination of a lot of littles.***

When we look at systems for a while we get to see a lot more of the whole, big, messy variability of a system. I’ll illustrate with an example.

Y’all know about the fireflies. No? Okay, it’s been a while and I can remind you about the fireflies. I recorded a video****:

TL:DW- My Reproducible Quantitative methods class produced a paper about firefly phenology. Fancy people liked it and it got media attention.

During my interview about the piece, the reporter kept going back to the idea of trajectory. Yes, sure, phenology, cool, but what is the *trajectory* of firefly populations? ARE firefly populations in decline?

Here I had one of the longest time series documenting systematic collection of fireflies, known to science, and I could not answer this seemingly simple question. For your reference, here is the data from our site, grouped by plant community of capture:

fireflies

My reply to the reporter was “I don’t know. But if I had less data I would tell you, and I’d be surer of my answer.”

A little tongue in cheek, to be certain, but isn’t that what we’re doing every day in science? One of the fundamental questions we ask in ecology is where is my system going? and we’re making extrapolations based on the data we have available. We know it’s not always the right thing to do, but we do our best, looking at the world through the limited windows available to us. In ecology, the three year study is pretty much the standard:

A crude estimate, but interesting. I was curious how long the ‘typical’ study in ecology took. They’re getting longer (temporally). pic.twitter.com/boyymQi1Xm

— Dr. Christie Bahlai (@cbahlai) May 9, 2018

We know that this is problematic. This is why the USLTER network exists. People get that. But we’ve still got to do work in the shorter time scales. We gotta graduate students. My grants don’t go on forever. We can learn lots of things from studying systems in the short term.

But.

How do we know when these short term studies are misleading us? What are the effects of the time period we’re looking at? The length of time we’re watching, and the type of process? How often we’re measuring? and how the heck can we test this, if we’re mostly doing short term studies?

Friends, I had an idea. Why not re-analyse long time series data– as if they were short term data? Break it up in all sorts of objectively bad ways (THERE! I EXPLAINED THE POST TITLE), analyse using standard statistical methods, collect these statistics up, and look for trends in conclusions we reach, given different ways of collecting the data?

All this would take would be a relatively simple algorithm, a whole pile of time series data, and some money, time and patience for the personnel to drop data in and collect the stuff that comes out of the algorithm machine. I can write an algorithm, and hey, the USLTER has lots of data that would be appropriate to get this done, but the latter components are a little harder for a new professor to come by. So, I put it on the back burner.

Anyway, this summer, my friend and collaborator Kaitlin Stack Whitney brought this grant opportunity to my attention.

EAGER proposals for high-risk/high-reward innovative studies that address development and testing of important science and engineering ideas and theories through use of existing data. […..] proposals must:

Involve, for data proposed for use, publicly-available data generated through NSF funding; and

Agree to make public the details about their experiences reusing the data, including especially challenges associated with that reuse.

!!!

Hey! I, in fact, am a professional at reusing data produced by NSF project and publicly documenting my experiences using said data! I [cough] kinda have a blog about it. So me, Kaitlin, and my technician Julia sat down.

We wrote a proposal.

And it got funded.

Tina-Fey-giving-herself-high-five

Three junior women scientists getting an NSF award? No Big Deal.***** We’re getting this project underway, now. My undergraduate, Erin, will be gathering candidate data sets for trial bulk runs of the algorithm over the winter semester. Collaborators Sarah Cusser and Nick Haddad at Michigan State are using the algorithm on a focal dataset to do a deep dive into how patterns of observations affect conclusions in agricultural systems. Basically, we’re going to figure out once and for all- how often are we wrong when we look at our data?

This is going to be big, my friends, stay tuned.

—

*Note to self, get Christian on the website!! another item for The List.

**thank you for coming to my TED talk that’s not actually a TED talk.

*** you can put that wisdom on my tombstone

****thank you for coming to my other TED talk that’s not actually a TED talk.

*****This is a big deal and I am pretty excited about it.

Posted in Uncategorized | 3 Comments