Elsevier

Biological Conservation

Volume 173, May 2014, Pages 144-154
Biological Conservation

Statistical solutions for error and bias in global citizen science datasets

https://doi.org/10.1016/j.biocon.2013.07.037Get rights and content

Highlights

  • Citizen-scientist (CS) datasets offer unique opportunities and challenges to the study of global conservation priorities.

  • Fortunately, issues of error and bias found in CS data are similar to those found in other large-scale databases.

  • As a consequence, statistical tools exist to handle many kinds of error and bias common to CS data.

  • We highlight some statistical approaches that are used in ecological contexts and are available in free software packages.

Abstract

Networks of citizen scientists (CS) have the potential to observe biodiversity and species distributions at global scales. Yet the adoption of such datasets in conservation science may be hindered by a perception that the data are of low quality. This perception likely stems from the propensity of data generated by CS to contain greater levels of variability (e.g., measurement error) or bias (e.g., spatio-temporal clustering) in comparison to data collected by scientists or instruments. Modern analytical approaches can account for many types of error and bias typical of CS datasets. It is possible to (1) describe how pseudo-replication in sampling influences the overall variability in response data using mixed-effects modeling, (2) integrate data to explicitly model the sampling process and account for bias using a hierarchical modeling framework, and (3) examine the relative influence of many different or related explanatory factors using machine learning tools. Information from these modeling approaches can be used to predict species distributions and to estimate biodiversity. Even so, achieving the full potential from CS projects requires meta-data describing the sampling process, reference data to allow for standardization, and insightful modeling suitable to the question of interest.

Introduction

Evaluating global changes in the distribution and diversity of Earth’s biota requires datasets of ambitious proportions where effort is shared over hundreds, or even thousands of individuals (Silvertown, 2009). In recent decades, volunteers, often labeled as ‘citizen scientists’ (CS), have been central to the collection of broad-scale datasets allowing the scientific community to address questions that would otherwise be logistically or financially unfeasible, even for the most dedicated scientific team (Dickinson et al., 2010; Stuart-Smith et al., in press). Consequently, volunteer networks provide an opportunity to answer conservation-related questions on the broad temporal and spatial scales that are relevant to understanding global biodiversity patterns. As proof of this concept, long-running volunteer monitoring programs have generated thousands of peer-reviewed papers (Sullivan et al., 2009) and can thus offer models for the development of similar programs in novel systems (Bonney et al., 2009).

As well as providing a practical means of addressing large-scale questions in ecology, involving citizens in the collection of data has a number of benefits to conservation-related projects. By being inclusive and engaging large numbers of people, CS projects can bring important publicity and discourse on conservation issues, and provide opportunities for the public to take an active role in management and conservation (Pattengill-Semmens and Semmens, 2003). Additionally, CS projects can often afford to be more exploratory than more regimented monitoring programs, making observations of rare events possible with sightings from large networks of volunteers that span broad spatial scales. Given these advantages, the capacity for addressing global-scale conservation may well rest in the realm of citizen science (Silvertown, 2009).

In spite of the proven success and potential for using CS datasets to address pressing global issues, there has been intense debate over the utility of such data in a scientific framework. Detractors suggest that involving large numbers of individuals with varying skill and commitment will lead to decreased precision in measurements such as in the identification or counting of species. Moreover, significant sources of bias may be present in the data, such as under-detection of species or the non-random distribution of effort (Crall et al., 2011). Such concerns have motivated CS projects to maximize the quality of data collected through improved sampling protocols and training (Edgar and Stuart-Smith, 2009), database management (Crall et al., 2011), and filtering or subsampling data to deal with error and uneven effort (Wiggins and Crowston, 2011, Wiggins et al., 2011). However, in many broadly distributed databases it may be impossible to implement rigid protocols or to eliminate all sources of error and bias. Thus, global CS datasets will likely violate the basic assumptions of some statistical analyses.

Fortunately, the issues of error and bias that are often present in CS data are not unique; analogous problems exist in datasets across a wide variety of disciplines and can be addressed using a suite of analytical approaches. In many cases, CS databases resemble the data collected for meta-analytical and landscape ecology syntheses where methods for accurately estimating and incorporating within-study or within-observer variability are key to drawing conclusions from the data (Hedges et al., 2010). For complex datasets, machine learning (ML) approaches are available that can examine the relative importance of large numbers of predictive variables in explaining the response data (Fink and Hochachka, 2012, Olden et al., 2008). Moreover, custom hierarchical analyses can recognize and account for the variable and clustered nature of CS data (Hochachka et al., 2012).

Here, our overall objective is to promote the use of CS data in conservation ecology and policy by highlighting how issues of data quality can be addressed using a suite of relatively new statistical tools. We first provide context by describing the main considerations for identifying and quantifying data quality issues present in CS data. Second, we explore a number of modeling approaches available for use with CS data with case examples to illustrate how specific issues of error and bias can alter understanding of biological patterns when left unaccounted for. Our perspective is that CS data has the potential to describe global patterns in biodiversity and the mechanisms driving change in ecosystems, communities and species. The inferential capacity to do so rests on the continued development and use of modeling approaches to identify and correct for data quality issues.

Section snippets

Contextualizing the quality issues present in citizen science data

Most CS projects recognize the potential issues of error and bias present when using large numbers of volunteers to collect data. Volunteer training, data standardization, validation and filtering procedures reduce potential sources of error and bias before, during and after the data are collected (Bonter and Cooper, 2012, Wiggins et al., 2011). In fact, studies comparing data generated by skilled volunteers vs. experts often show comparable estimates (e.g. Delaney et al., 2008, Edgar and

Modeling approaches

Modern statistical tools present options for accounting for many types of error and biases. In the following sections, we describe a variety of such techniques that may be particularly relevant to CS data. We aim to indicate where and why one might use each tool, to describe the different approaches and illustrate applications by drawing on examples from the literature. Table 1 provides examples of freely available statistical packages for implementing many of the approaches we describe in the

Recommendations

There is great potential for the use of CS data as a mainstream tool to address the important ecological and conservation questions of our time. However, in order to do so, researchers will need to consider some basic principles of data collection, management and analysis. Taking an overview of recent techniques used in research based on CS data (Table 2) and incorporating the advice found in Zuur et al. (2010), we have extracted a few recommendations.

First, working with both statisticians and

Acknowledgments

T.J.B. was supported by an Australian Research Council Linkage doctoral fellowship and the Australian Research Council Centre for Excellence in Environmental Decisions. Salary for A.E.B was by the Fisheries Research Development Corporation, the National Climate Change Adaptation Facility, and the Australian National Network in Marine Science (a collaborative funding initiative between James Cook University, the University of Tasmania, and the University of Western Australia). Development of the

References (88)

  • J. Silvertown

    A new dawn for citizen science

    Trends Ecol. Evol. (Pers. Ed.)

    (2009)
  • T. Snäll et al.

    Evaluating citizen-based presence data for bird monitoring

    Biol. Conserv.

    (2011)
  • B.L. Sullivan et al.

    EBird: a citizen-based bird observation network in the biological sciences

    Biol. Conserv.

    (2009)
  • A.I.T. Tulloch et al.

    Realizing the full potential of citizen science monitoring schemes

    Biol. Conserv.

    (2013)
  • A. Ahrends et al.

    Funding begets biodiversity

    Divers. Distrib.

    (2011)
  • J. Alroy

    Geographical, environmental and intrinsic biotic controls on Phanerozoic marine diversification

    Palaeontology

    (2010)
  • A. Arab et al.

    Zero-inflated modeling of fish catch per unit area resulting from multiple gears: application to channel catfish and shovelnose sturgeon in the Missouri River

    N. Am. J. Fish. Manage.

    (2008)
  • M.B. Ashcroft et al.

    Combining citizen science, bioclimatic envelope models and observed habitat preferences to determine the distribution of an inconspicuous, recently detected introduced bee (Halictus smaragdulus Vachal Hymenoptera: Halictidae) in Australia

    Biol. Invasions

    (2012)
  • M. Barbet-Massin et al.

    Selecting pseudo-absences for species distribution models: how, where and how many?

    Methods Ecol. Evol.

    (2012)
  • E.H. Boakes et al.

    Distorted views of biodiversity: spatial and temporal bias in species occurrence data

    PLoS Biol.

    (2010)
  • R. Bonney et al.

    Citizen science: a developing tool for expanding science knowledge and scientific literacy

    Bioscience

    (2009)
  • N.D. Bonter et al.

    Data validation in citizen science: a case study from project FeederWatch

    Front. Ecol. Environ.

    (2012)
  • B.S. Bray et al.

    Evaluation of a statewide volunteer angler diary program for use as a fishery assessment tool

    N. Am. J. Fish. Manag.

    (2001)
  • C. Brunsdon et al.

    Assessing the changing flowering date of the common lilac in North America: a random coefficient model approach

    Geoinformatica

    (2012)
  • J.R. Busby

    BIOCLIM — A bioclimate analysis and prediction system. Nature conservation: cost effective biological surveys and data analysis

  • N. Butt et al.

    Quantifying the sampling error in tree census measurements by volunteers and its effect on carbon stock estimates

    Ecol. Appl.

    (2013)
  • M.W. Cadotte et al.

    Phylogenetic diversity metrics for ecological communities: integrating species richness, abundance and evolutionary history

    Ecol. Lett.

    (2010)
  • A. Chao et al.

    Coverage-based rarefaction and extrapolation: standardizing samples by completeness rather than size

    Ecology

    (2012)
  • A. Chao et al.

    Rarefaction and extrapolation with Hill numbers: a framework for sampling and estimation in species diversity studies

    Ecol. Monogr.

    (2013)
  • Conn, P., McClintock, B.T., Cameron, M., Johnson, D.S., Moreland, E., Boveng, P.L., in press. Accomodating species...
  • T.E. Cox et al.

    Expert variability provides perspectives on the strengths and weaknesses of citizen-driven intertidal monitoring program

    Ecol. Appl.

    (2012)
  • A.W. Crall et al.

    Assessing citizen science data quality: a case study

    Conserv. Lett.

    (2011)
  • A. Cutler et al.

    Random forests for classification in ecology

    Ecology

    (2007)
  • G. De’ath et al.

    Classification and regression trees: a powerful yet simple technique for ecological data analysis

    Ecology

    (2000)
  • D. Delaney et al.

    Marine invasive species: validation of citizen science and implications for national monitoring networks

    Biol. Invasions

    (2008)
  • J.L. Dickinson et al.

    Citizen science as an ecological research tool: challenges and benefits

    An. Ecol., Evol. Syst.

    (2010)
  • F.D. Dormann et al.

    Methods to account for spatial autocorrelation in the analysis of species distribution data: a review

    Ecography

    (2007)
  • M. Dudík et al.

    Correcting sample selection bias in maximum entropy density estimation

  • G.J. Edgar et al.

    Ecological effects of marine protected areas on rocky reef communities; a continental-scale analysis

    Mar. Ecol. Prog. Ser.

    (2009)
  • J. Elith et al.

    Novel methods improve prediction of species’ distributions from occurrence data

    Ecography

    (2006)
  • J. Elith et al.

    A working guide to boosted regression trees

    J. Anim. Ecol.

    (2008)
  • R.G. Farmer et al.

    Observer effects and avian-call-count survey quality: rare-species biases and overconfidence/effets des observateurs et qualite des inventaires par le denombrement des chants: biais sur les especes rares et exces de confiance

    Auk

    (2012)
  • S. Ferrier et al.

    Spatial modelling of biodiversity at the community level

    J. Appl. Ecol.

    (2006)
  • R.M. Fewster et al.

    Analysis of population trends for farmland birds using generalized additive models

    Ecology

    (2000)
  • Cited by (359)

    View all citing articles on Scopus
    View full text