Statistical solutions for error and bias in global citizen science datasets

doi:10.1016/j.biocon.2013.07.037

Biological Conservation

Volume 173, May 2014, Pages 144-154

https://doi.org/10.1016/j.biocon.2013.07.037 Get rights and content

Highlights

•
Citizen-scientist (CS) datasets offer unique opportunities and challenges to the study of global conservation priorities.
•
Fortunately, issues of error and bias found in CS data are similar to those found in other large-scale databases.
•
As a consequence, statistical tools exist to handle many kinds of error and bias common to CS data.
•
We highlight some statistical approaches that are used in ecological contexts and are available in free software packages.

Abstract

Networks of citizen scientists (CS) have the potential to observe biodiversity and species distributions at global scales. Yet the adoption of such datasets in conservation science may be hindered by a perception that the data are of low quality. This perception likely stems from the propensity of data generated by CS to contain greater levels of variability (e.g., measurement error) or bias (e.g., spatio-temporal clustering) in comparison to data collected by scientists or instruments. Modern analytical approaches can account for many types of error and bias typical of CS datasets. It is possible to (1) describe how pseudo-replication in sampling influences the overall variability in response data using mixed-effects modeling, (2) integrate data to explicitly model the sampling process and account for bias using a hierarchical modeling framework, and (3) examine the relative influence of many different or related explanatory factors using machine learning tools. Information from these modeling approaches can be used to predict species distributions and to estimate biodiversity. Even so, achieving the full potential from CS projects requires meta-data describing the sampling process, reference data to allow for standardization, and insightful modeling suitable to the question of interest.

Introduction

Evaluating global changes in the distribution and diversity of Earth’s biota requires datasets of ambitious proportions where effort is shared over hundreds, or even thousands of individuals (Silvertown, 2009). In recent decades, volunteers, often labeled as ‘citizen scientists’ (CS), have been central to the collection of broad-scale datasets allowing the scientific community to address questions that would otherwise be logistically or financially unfeasible, even for the most dedicated scientific team (Dickinson et al., 2010; Stuart-Smith et al., in press). Consequently, volunteer networks provide an opportunity to answer conservation-related questions on the broad temporal and spatial scales that are relevant to understanding global biodiversity patterns. As proof of this concept, long-running volunteer monitoring programs have generated thousands of peer-reviewed papers (Sullivan et al., 2009) and can thus offer models for the development of similar programs in novel systems (Bonney et al., 2009).

As well as providing a practical means of addressing large-scale questions in ecology, involving citizens in the collection of data has a number of benefits to conservation-related projects. By being inclusive and engaging large numbers of people, CS projects can bring important publicity and discourse on conservation issues, and provide opportunities for the public to take an active role in management and conservation (Pattengill-Semmens and Semmens, 2003). Additionally, CS projects can often afford to be more exploratory than more regimented monitoring programs, making observations of rare events possible with sightings from large networks of volunteers that span broad spatial scales. Given these advantages, the capacity for addressing global-scale conservation may well rest in the realm of citizen science (Silvertown, 2009).

In spite of the proven success and potential for using CS datasets to address pressing global issues, there has been intense debate over the utility of such data in a scientific framework. Detractors suggest that involving large numbers of individuals with varying skill and commitment will lead to decreased precision in measurements such as in the identification or counting of species. Moreover, significant sources of bias may be present in the data, such as under-detection of species or the non-random distribution of effort (Crall et al., 2011). Such concerns have motivated CS projects to maximize the quality of data collected through improved sampling protocols and training (Edgar and Stuart-Smith, 2009), database management (Crall et al., 2011), and filtering or subsampling data to deal with error and uneven effort (Wiggins and Crowston, 2011, Wiggins et al., 2011). However, in many broadly distributed databases it may be impossible to implement rigid protocols or to eliminate all sources of error and bias. Thus, global CS datasets will likely violate the basic assumptions of some statistical analyses.

Fortunately, the issues of error and bias that are often present in CS data are not unique; analogous problems exist in datasets across a wide variety of disciplines and can be addressed using a suite of analytical approaches. In many cases, CS databases resemble the data collected for meta-analytical and landscape ecology syntheses where methods for accurately estimating and incorporating within-study or within-observer variability are key to drawing conclusions from the data (Hedges et al., 2010). For complex datasets, machine learning (ML) approaches are available that can examine the relative importance of large numbers of predictive variables in explaining the response data (Fink and Hochachka, 2012, Olden et al., 2008). Moreover, custom hierarchical analyses can recognize and account for the variable and clustered nature of CS data (Hochachka et al., 2012).

Here, our overall objective is to promote the use of CS data in conservation ecology and policy by highlighting how issues of data quality can be addressed using a suite of relatively new statistical tools. We first provide context by describing the main considerations for identifying and quantifying data quality issues present in CS data. Second, we explore a number of modeling approaches available for use with CS data with case examples to illustrate how specific issues of error and bias can alter understanding of biological patterns when left unaccounted for. Our perspective is that CS data has the potential to describe global patterns in biodiversity and the mechanisms driving change in ecosystems, communities and species. The inferential capacity to do so rests on the continued development and use of modeling approaches to identify and correct for data quality issues.

Section snippets

Contextualizing the quality issues present in citizen science data

Most CS projects recognize the potential issues of error and bias present when using large numbers of volunteers to collect data. Volunteer training, data standardization, validation and filtering procedures reduce potential sources of error and bias before, during and after the data are collected (Bonter and Cooper, 2012, Wiggins et al., 2011). In fact, studies comparing data generated by skilled volunteers vs. experts often show comparable estimates (e.g. Delaney et al., 2008, Edgar and

Modeling approaches

Modern statistical tools present options for accounting for many types of error and biases. In the following sections, we describe a variety of such techniques that may be particularly relevant to CS data. We aim to indicate where and why one might use each tool, to describe the different approaches and illustrate applications by drawing on examples from the literature. Table 1 provides examples of freely available statistical packages for implementing many of the approaches we describe in the

Recommendations

There is great potential for the use of CS data as a mainstream tool to address the important ecological and conservation questions of our time. However, in order to do so, researchers will need to consider some basic principles of data collection, management and analysis. Taking an overview of recent techniques used in research based on CS data (Table 2) and incorporating the advice found in Zuur et al. (2010), we have extracted a few recommendations.

First, working with both statisticians and

Acknowledgments

T.J.B. was supported by an Australian Research Council Linkage doctoral fellowship and the Australian Research Council Centre for Excellence in Environmental Decisions. Salary for A.E.B was by the Fisheries Research Development Corporation, the National Climate Change Adaptation Facility, and the Australian National Network in Marine Science (a collaborative funding initiative between James Cook University, the University of Tasmania, and the University of Western Australia). Development of the

References (88)

T. Amano et al.
Hierarchical models for smoothed population indices: the importance of considering variations in trends of count data among sites
Ecol. Ind.
(2012)
B.M. Bolker et al.
Review: generalized linear mixed models: a practical guide for ecology and evolution
Trends Ecol. Evol.
(2009)
A. Comber et al.
Using control data to determine the reliability of volunteered geographic information about land cover
Int. J. Appl. Earth Obs. Geoinf.
(2013)
S.R. deSolla et al.
Effect of sampling effort and species detectability on volunteer based anuran monitoring programs
Biol. Conserv.
(2005)
G.J. Edgar et al.
Biases associated with the use of underwater visual census techniques to quantify the density and size-structure of fish populations
J. Exp. Mar. Biol. Ecol.
(2004)
W.M. Hochachka et al.
Data-intensive science applied to broad-scale citizen science
Trends Ecol. Evol. (Pers. Ed.)
(2012)
M.N. Maunder et al.
Standardizing catch and effort data: a review of recent approaches
Fish. Res.
(2004)
B. Parsons et al.
Using community observations to predict occurrence of malleefowl (Leipoa ocellata) in the Western Australian wheatbelt
Biol. Conserv.
(2009)
S.J. Phillips et al.
Maximum entropy modeling of species geographic distributions
Ecol. Model.
(2006)
J. Seys et al.
An evaluation of beached bird monitoring approaches
Mar. Pollut. Bull.
(2002)

J. Silvertown

A new dawn for citizen science

Trends Ecol. Evol. (Pers. Ed.)

(2009)

T. Snäll et al.

Evaluating citizen-based presence data for bird monitoring

Biol. Conserv.

(2011)

B.L. Sullivan et al.

EBird: a citizen-based bird observation network in the biological sciences

Biol. Conserv.

(2009)

A.I.T. Tulloch et al.

Realizing the full potential of citizen science monitoring schemes

Biol. Conserv.

(2013)

A. Ahrends et al.

Funding begets biodiversity

Divers. Distrib.

(2011)

J. Alroy

Geographical, environmental and intrinsic biotic controls on Phanerozoic marine diversification

Palaeontology

(2010)

A. Arab et al.

Zero-inflated modeling of fish catch per unit area resulting from multiple gears: application to channel catfish and shovelnose sturgeon in the Missouri River

N. Am. J. Fish. Manage.

(2008)

M.B. Ashcroft et al.

Combining citizen science, bioclimatic envelope models and observed habitat preferences to determine the distribution of an inconspicuous, recently detected introduced bee (Halictus smaragdulus Vachal Hymenoptera: Halictidae) in Australia

Biol. Invasions

(2012)

M. Barbet-Massin et al.

Selecting pseudo-absences for species distribution models: how, where and how many?

Methods Ecol. Evol.

(2012)

E.H. Boakes et al.

Distorted views of biodiversity: spatial and temporal bias in species occurrence data

PLoS Biol.

(2010)

R. Bonney et al.

Citizen science: a developing tool for expanding science knowledge and scientific literacy

Bioscience

(2009)

N.D. Bonter et al.

Data validation in citizen science: a case study from project FeederWatch

Front. Ecol. Environ.

(2012)

B.S. Bray et al.

Evaluation of a statewide volunteer angler diary program for use as a fishery assessment tool

N. Am. J. Fish. Manag.

(2001)

C. Brunsdon et al.

Assessing the changing flowering date of the common lilac in North America: a random coefficient model approach

Geoinformatica

(2012)

J.R. Busby

BIOCLIM — A bioclimate analysis and prediction system. Nature conservation: cost effective biological surveys and data analysis

N. Butt et al.

Quantifying the sampling error in tree census measurements by volunteers and its effect on carbon stock estimates

Ecol. Appl.

(2013)

M.W. Cadotte et al.

Phylogenetic diversity metrics for ecological communities: integrating species richness, abundance and evolutionary history

Ecol. Lett.

(2010)

A. Chao et al.

Coverage-based rarefaction and extrapolation: standardizing samples by completeness rather than size

Ecology

(2012)

A. Chao et al.

Rarefaction and extrapolation with Hill numbers: a framework for sampling and estimation in species diversity studies

Ecol. Monogr.

(2013)

Conn, P., McClintock, B.T., Cameron, M., Johnson, D.S., Moreland, E., Boveng, P.L., in press. Accomodating species...

T.E. Cox et al.

Expert variability provides perspectives on the strengths and weaknesses of citizen-driven intertidal monitoring program

Ecol. Appl.

(2012)

A.W. Crall et al.

Assessing citizen science data quality: a case study

Conserv. Lett.

(2011)

A. Cutler et al.

Random forests for classification in ecology

Ecology

(2007)

G. De’ath et al.

Classification and regression trees: a powerful yet simple technique for ecological data analysis

Ecology

(2000)

D. Delaney et al.

Marine invasive species: validation of citizen science and implications for national monitoring networks

Biol. Invasions

(2008)

J.L. Dickinson et al.

Citizen science as an ecological research tool: challenges and benefits

An. Ecol., Evol. Syst.

(2010)

F.D. Dormann et al.

Methods to account for spatial autocorrelation in the analysis of species distribution data: a review

Ecography

(2007)

M. Dudík et al.

Correcting sample selection bias in maximum entropy density estimation

G.J. Edgar et al.

Ecological effects of marine protected areas on rocky reef communities; a continental-scale analysis

Mar. Ecol. Prog. Ser.

(2009)

J. Elith et al.

Novel methods improve prediction of species’ distributions from occurrence data

Ecography

(2006)

J. Elith et al.

A working guide to boosted regression trees

J. Anim. Ecol.

(2008)

R.G. Farmer et al.

Observer effects and avian-call-count survey quality: rare-species biases and overconfidence/effets des observateurs et qualite des inventaires par le denombrement des chants: biais sur les especes rares et exces de confiance

Auk

(2012)

S. Ferrier et al.

Spatial modelling of biodiversity at the community level

J. Appl. Ecol.

(2006)

R.M. Fewster et al.

Analysis of population trends for farmland birds using generalized additive models

Ecology

(2000)

Cited by (359)

Addressing measurement error in lobster growth modelling
2024, Regional Studies in Marine Science
Mark and recapture tagging data is key for determining growth in species that are hard or impossible to age. Obtaining this data can be an expensive process, consequently there is an increasing reliance on measurements from non-scientists including commercial and recreational fishers and industry volunteers. A challenge with relying on data collection from these groups is that they are likely to have higher measurement errors. We demonstrate that this measurement error can introduce substantial bias. To account for this we developed a Bayesian model that is robust to a broad range of measurement uncertainty, which enables stochasticity in model outputs to be attributed to more appropriate causes, such as environmental drivers for further investigation. The model was tested through application to Southern Rock Lobster (SRL), Jasus edwardsii data. This was applied in two distinct areas in Tasmania, with different biological characteristics and data collection regimes. Application to this case study demonstrates that high measurement error can be problematic for commonly used methods, however our developed approach allows unbiased growth estimation from such datasets.
Citizen scientists—practices, observations, and experience
2024, Humanities and Social Sciences Communications
Investigating odonates' response to climate change in Great Britain: A tale of two strategies
2024, Diversity and Distributions
Mission Monarch: engaging the Canadian public for the conservation of a species at risk
2024, Journal of Insect Conservation
Establishing a long-term citizen science project? Lessons learned from the Community Lake Ice Collaboration spanning over 30 yr and 1000 lakes
2024, Limnology And Oceanography Letters
Prosocial and Financial Incentives for Biodiversity Conservation: A Field Experiment Using a Smartphone App
2024, SSRN

View all citing articles on Scopus

View full text

Statistical solutions for error and bias in global citizen science datasets

Highlights

Abstract

Introduction

Section snippets

Contextualizing the quality issues present in citizen science data

Modeling approaches

Recommendations

Acknowledgments

Ecol. Ind.

Trends Ecol. Evol.

Int. J. Appl. Earth Obs. Geoinf.

Biol. Conserv.

J. Exp. Mar. Biol. Ecol.

Trends Ecol. Evol. (Pers. Ed.)

Fish. Res.

Biol. Conserv.

Ecol. Model.

Mar. Pollut. Bull.

Trends Ecol. Evol. (Pers. Ed.)

Biol. Conserv.

Biol. Conserv.

Biol. Conserv.

Funding begets biodiversity

Divers. Distrib.

Geographical, environmental and intrinsic biotic controls on Phanerozoic marine diversification

Palaeontology

Zero-inflated modeling of fish catch per unit area resulting from multiple gears: application to channel catfish and shovelnose sturgeon in the Missouri River

N. Am. J. Fish. Manage.

Combining citizen science, bioclimatic envelope models and observed habitat preferences to determine the distribution of an inconspicuous, recently detected introduced bee (Halictus smaragdulus Vachal Hymenoptera: Halictidae) in Australia

Biol. Invasions

Selecting pseudo-absences for species distribution models: how, where and how many?

Methods Ecol. Evol.

Distorted views of biodiversity: spatial and temporal bias in species occurrence data

PLoS Biol.

Citizen science: a developing tool for expanding science knowledge and scientific literacy

Bioscience

Data validation in citizen science: a case study from project FeederWatch

Front. Ecol. Environ.

Evaluation of a statewide volunteer angler diary program for use as a fishery assessment tool

N. Am. J. Fish. Manag.

Assessing the changing flowering date of the common lilac in North America: a random coefficient model approach

Geoinformatica

BIOCLIM — A bioclimate analysis and prediction system. Nature conservation: cost effective biological surveys and data analysis

Quantifying the sampling error in tree census measurements by volunteers and its effect on carbon stock estimates

Ecol. Appl.

Phylogenetic diversity metrics for ecological communities: integrating species richness, abundance and evolutionary history

Ecol. Lett.

Coverage-based rarefaction and extrapolation: standardizing samples by completeness rather than size

Ecology

Rarefaction and extrapolation with Hill numbers: a framework for sampling and estimation in species diversity studies

Ecol. Monogr.

Expert variability provides perspectives on the strengths and weaknesses of citizen-driven intertidal monitoring program

Ecol. Appl.

Assessing citizen science data quality: a case study

Conserv. Lett.

Random forests for classification in ecology

Ecology

Classification and regression trees: a powerful yet simple technique for ecological data analysis

Ecology

Marine invasive species: validation of citizen science and implications for national monitoring networks

Biol. Invasions

Citizen science as an ecological research tool: challenges and benefits

An. Ecol., Evol. Syst.

Methods to account for spatial autocorrelation in the analysis of species distribution data: a review

Ecography

Correcting sample selection bias in maximum entropy density estimation

Ecological effects of marine protected areas on rocky reef communities; a continental-scale analysis

Mar. Ecol. Prog. Ser.

Novel methods improve prediction of species’ distributions from occurrence data

Ecography

A working guide to boosted regression trees

J. Anim. Ecol.

Observer effects and avian-call-count survey quality: rare-species biases and overconfidence/effets des observateurs et qualite des inventaires par le denombrement des chants: biais sur les especes rares et exces de confiance

Auk

Spatial modelling of biodiversity at the community level

J. Appl. Ecol.

Analysis of population trends for farmland birds using generalized additive models

Ecology