Statistical solutions for error and bias in global citizen science datasets
Introduction
Evaluating global changes in the distribution and diversity of Earth’s biota requires datasets of ambitious proportions where effort is shared over hundreds, or even thousands of individuals (Silvertown, 2009). In recent decades, volunteers, often labeled as ‘citizen scientists’ (CS), have been central to the collection of broad-scale datasets allowing the scientific community to address questions that would otherwise be logistically or financially unfeasible, even for the most dedicated scientific team (Dickinson et al., 2010; Stuart-Smith et al., in press). Consequently, volunteer networks provide an opportunity to answer conservation-related questions on the broad temporal and spatial scales that are relevant to understanding global biodiversity patterns. As proof of this concept, long-running volunteer monitoring programs have generated thousands of peer-reviewed papers (Sullivan et al., 2009) and can thus offer models for the development of similar programs in novel systems (Bonney et al., 2009).
As well as providing a practical means of addressing large-scale questions in ecology, involving citizens in the collection of data has a number of benefits to conservation-related projects. By being inclusive and engaging large numbers of people, CS projects can bring important publicity and discourse on conservation issues, and provide opportunities for the public to take an active role in management and conservation (Pattengill-Semmens and Semmens, 2003). Additionally, CS projects can often afford to be more exploratory than more regimented monitoring programs, making observations of rare events possible with sightings from large networks of volunteers that span broad spatial scales. Given these advantages, the capacity for addressing global-scale conservation may well rest in the realm of citizen science (Silvertown, 2009).
In spite of the proven success and potential for using CS datasets to address pressing global issues, there has been intense debate over the utility of such data in a scientific framework. Detractors suggest that involving large numbers of individuals with varying skill and commitment will lead to decreased precision in measurements such as in the identification or counting of species. Moreover, significant sources of bias may be present in the data, such as under-detection of species or the non-random distribution of effort (Crall et al., 2011). Such concerns have motivated CS projects to maximize the quality of data collected through improved sampling protocols and training (Edgar and Stuart-Smith, 2009), database management (Crall et al., 2011), and filtering or subsampling data to deal with error and uneven effort (Wiggins and Crowston, 2011, Wiggins et al., 2011). However, in many broadly distributed databases it may be impossible to implement rigid protocols or to eliminate all sources of error and bias. Thus, global CS datasets will likely violate the basic assumptions of some statistical analyses.
Fortunately, the issues of error and bias that are often present in CS data are not unique; analogous problems exist in datasets across a wide variety of disciplines and can be addressed using a suite of analytical approaches. In many cases, CS databases resemble the data collected for meta-analytical and landscape ecology syntheses where methods for accurately estimating and incorporating within-study or within-observer variability are key to drawing conclusions from the data (Hedges et al., 2010). For complex datasets, machine learning (ML) approaches are available that can examine the relative importance of large numbers of predictive variables in explaining the response data (Fink and Hochachka, 2012, Olden et al., 2008). Moreover, custom hierarchical analyses can recognize and account for the variable and clustered nature of CS data (Hochachka et al., 2012).
Here, our overall objective is to promote the use of CS data in conservation ecology and policy by highlighting how issues of data quality can be addressed using a suite of relatively new statistical tools. We first provide context by describing the main considerations for identifying and quantifying data quality issues present in CS data. Second, we explore a number of modeling approaches available for use with CS data with case examples to illustrate how specific issues of error and bias can alter understanding of biological patterns when left unaccounted for. Our perspective is that CS data has the potential to describe global patterns in biodiversity and the mechanisms driving change in ecosystems, communities and species. The inferential capacity to do so rests on the continued development and use of modeling approaches to identify and correct for data quality issues.
Section snippets
Contextualizing the quality issues present in citizen science data
Most CS projects recognize the potential issues of error and bias present when using large numbers of volunteers to collect data. Volunteer training, data standardization, validation and filtering procedures reduce potential sources of error and bias before, during and after the data are collected (Bonter and Cooper, 2012, Wiggins et al., 2011). In fact, studies comparing data generated by skilled volunteers vs. experts often show comparable estimates (e.g. Delaney et al., 2008, Edgar and
Modeling approaches
Modern statistical tools present options for accounting for many types of error and biases. In the following sections, we describe a variety of such techniques that may be particularly relevant to CS data. We aim to indicate where and why one might use each tool, to describe the different approaches and illustrate applications by drawing on examples from the literature. Table 1 provides examples of freely available statistical packages for implementing many of the approaches we describe in the
Recommendations
There is great potential for the use of CS data as a mainstream tool to address the important ecological and conservation questions of our time. However, in order to do so, researchers will need to consider some basic principles of data collection, management and analysis. Taking an overview of recent techniques used in research based on CS data (Table 2) and incorporating the advice found in Zuur et al. (2010), we have extracted a few recommendations.
First, working with both statisticians and
Acknowledgments
T.J.B. was supported by an Australian Research Council Linkage doctoral fellowship and the Australian Research Council Centre for Excellence in Environmental Decisions. Salary for A.E.B was by the Fisheries Research Development Corporation, the National Climate Change Adaptation Facility, and the Australian National Network in Marine Science (a collaborative funding initiative between James Cook University, the University of Tasmania, and the University of Western Australia). Development of the
References (88)
- et al.
Hierarchical models for smoothed population indices: the importance of considering variations in trends of count data among sites
Ecol. Ind.
(2012) - et al.
Review: generalized linear mixed models: a practical guide for ecology and evolution
Trends Ecol. Evol.
(2009) - et al.
Using control data to determine the reliability of volunteered geographic information about land cover
Int. J. Appl. Earth Obs. Geoinf.
(2013) - et al.
Effect of sampling effort and species detectability on volunteer based anuran monitoring programs
Biol. Conserv.
(2005) - et al.
Biases associated with the use of underwater visual census techniques to quantify the density and size-structure of fish populations
J. Exp. Mar. Biol. Ecol.
(2004) - et al.
Data-intensive science applied to broad-scale citizen science
Trends Ecol. Evol. (Pers. Ed.)
(2012) - et al.
Standardizing catch and effort data: a review of recent approaches
Fish. Res.
(2004) - et al.
Using community observations to predict occurrence of malleefowl (Leipoa ocellata) in the Western Australian wheatbelt
Biol. Conserv.
(2009) - et al.
Maximum entropy modeling of species geographic distributions
Ecol. Model.
(2006) - et al.
An evaluation of beached bird monitoring approaches
Mar. Pollut. Bull.
(2002)
A new dawn for citizen science
Trends Ecol. Evol. (Pers. Ed.)
Evaluating citizen-based presence data for bird monitoring
Biol. Conserv.
EBird: a citizen-based bird observation network in the biological sciences
Biol. Conserv.
Realizing the full potential of citizen science monitoring schemes
Biol. Conserv.
Funding begets biodiversity
Divers. Distrib.
Geographical, environmental and intrinsic biotic controls on Phanerozoic marine diversification
Palaeontology
Zero-inflated modeling of fish catch per unit area resulting from multiple gears: application to channel catfish and shovelnose sturgeon in the Missouri River
N. Am. J. Fish. Manage.
Combining citizen science, bioclimatic envelope models and observed habitat preferences to determine the distribution of an inconspicuous, recently detected introduced bee (Halictus smaragdulus Vachal Hymenoptera: Halictidae) in Australia
Biol. Invasions
Selecting pseudo-absences for species distribution models: how, where and how many?
Methods Ecol. Evol.
Distorted views of biodiversity: spatial and temporal bias in species occurrence data
PLoS Biol.
Citizen science: a developing tool for expanding science knowledge and scientific literacy
Bioscience
Data validation in citizen science: a case study from project FeederWatch
Front. Ecol. Environ.
Evaluation of a statewide volunteer angler diary program for use as a fishery assessment tool
N. Am. J. Fish. Manag.
Assessing the changing flowering date of the common lilac in North America: a random coefficient model approach
Geoinformatica
BIOCLIM — A bioclimate analysis and prediction system. Nature conservation: cost effective biological surveys and data analysis
Quantifying the sampling error in tree census measurements by volunteers and its effect on carbon stock estimates
Ecol. Appl.
Phylogenetic diversity metrics for ecological communities: integrating species richness, abundance and evolutionary history
Ecol. Lett.
Coverage-based rarefaction and extrapolation: standardizing samples by completeness rather than size
Ecology
Rarefaction and extrapolation with Hill numbers: a framework for sampling and estimation in species diversity studies
Ecol. Monogr.
Expert variability provides perspectives on the strengths and weaknesses of citizen-driven intertidal monitoring program
Ecol. Appl.
Assessing citizen science data quality: a case study
Conserv. Lett.
Random forests for classification in ecology
Ecology
Classification and regression trees: a powerful yet simple technique for ecological data analysis
Ecology
Marine invasive species: validation of citizen science and implications for national monitoring networks
Biol. Invasions
Citizen science as an ecological research tool: challenges and benefits
An. Ecol., Evol. Syst.
Methods to account for spatial autocorrelation in the analysis of species distribution data: a review
Ecography
Correcting sample selection bias in maximum entropy density estimation
Ecological effects of marine protected areas on rocky reef communities; a continental-scale analysis
Mar. Ecol. Prog. Ser.
Novel methods improve prediction of species’ distributions from occurrence data
Ecography
A working guide to boosted regression trees
J. Anim. Ecol.
Observer effects and avian-call-count survey quality: rare-species biases and overconfidence/effets des observateurs et qualite des inventaires par le denombrement des chants: biais sur les especes rares et exces de confiance
Auk
Spatial modelling of biodiversity at the community level
J. Appl. Ecol.
Analysis of population trends for farmland birds using generalized additive models
Ecology
Cited by (359)
Addressing measurement error in lobster growth modelling
2024, Regional Studies in Marine ScienceCitizen scientists—practices, observations, and experience
2024, Humanities and Social Sciences CommunicationsInvestigating odonates' response to climate change in Great Britain: A tale of two strategies
2024, Diversity and DistributionsMission Monarch: engaging the Canadian public for the conservation of a species at risk
2024, Journal of Insect ConservationEstablishing a long-term citizen science project? Lessons learned from the Community Lake Ice Collaboration spanning over 30 yr and 1000 lakes
2024, Limnology And Oceanography Letters