Selecting pseudo-absence data for presence-only distribution modeling: How far should you stray from what you know?

doi:10.1016/j.ecolmodel.2008.11.010

Ecological Modelling

Volume 220, Issue 4, 24 February 2009, Pages 589-594

https://doi.org/10.1016/j.ecolmodel.2008.11.010 Get rights and content

Abstract

An important decision in presence-only species distribution modeling is how to select background (or pseudo-absence) localities for model parameterization. The selection of such localities may influence model parameterization and thus, can influence the appropriateness and accuracy of the model prediction when extrapolating the species distribution across time and space. We used 12 species from the Australian Wet Tropics (AWT) to evaluate the relationship between the geographic extent from which pseudo-absences are taken and model performance, and shape and importance of predictor variables using the MAXENT modeling method. Model performance is lower when pseudo-absence points are taken from either a restricted or broad region with respect to species occurrence data than from an intermediate region. Furthermore, variable importance (i.e., contribution to the model) changed such that, models became increasingly simplified, dominated by just two variables, as the area from which pseudo-absence points were drawn increased. Our results suggest that it is important to consider the spatial extent from which pseudo-absence data are taken. We suggest species distribution modeling exercises should begin with exploratory analyses evaluating what extent might provide both the most accurate results and biologically meaningful fit between species occurrence and predictor variables. This is especially important when modeling across space or time—a growing application for species distributional modeling.

Introduction

Appropriate selection of pseudo-absence or background locations is essential for presence-only species distribution modeling (SDM) (Chefaoui and Lobo, 2008). Recent studies have highlighted several methods for selection of pseudo-absence points including at: random (e.g., Stockwell and Peters, 1999); random with geographic-weighted exclusion (e.g., Hirzel et al., 2001); random with environmentally weighted exclusion (e.g., Zaniewski et al., 2002); locations that have been visited (i.e., occurrences for other species) but where the target species was not recorded (e.g., Elith and Leathwick, 2007); and occurrences for an entire group of species collected using the same methods, encapsulating sampling bias of data (e.g., Phillips and Dudik, 2008). While the relative merits of these different methods have been discussed previously (e.g., Lütolf et al., 2006, Chefaoui and Lobo, 2008, Phillips and Dudik, 2008), one important methodological step that has not been properly evaluated is the extent of the geographic region in which background or pseudo-absence points are taken. We suspect that, in practice, the decision to set spatial constraints on the background is typically one that is made unconsciously. Modelers simply default to using the extent of an arbitrarily defined study area. But does this really matter?

There are several reasons why pseudo-absences selected at large distances from known occurrences may be problematic. Essentially, pseudo-absences are meant to provide a comparative data set to enable the conditions under which a species occurs to be contrasted against those where it is absent. If pseudo-absences are geographically disparate from the presence locations, predictive models will be dominated by parameters that serve to coarsely discriminate regional conditions with weakened ability to tease out fine scale conditions that actually limit the species distribution. This is in direct conflict with the purpose of generating pseudo-absences in the first place.

The objective of this study was to ask whether background size really matters and, if so, how far from presence localities should selection of pseudo-absence points be taken? We address both questions by selecting random pseudo-absences from increasingly larger background areas and monitoring the impact this has on the predictions of species distribution models. Specifically, we examine 12 rainforest vertebrates from the Australian Wet Tropics (AWT) and employ a common presence-only ecological niche modeling methodology, MAXENT (Phillips et al., 2006). In this application MAXENT is used to represent both background and pseudo-absence modeling. We explore changes in model accuracy, predicted distributional area and relative importance of predictor variables with increasing background size.

Section snippets

Methods

The AWT of northeastern Australia is an ideal candidate region for testing our objectives (Fig. 1). The region contains a diverse, well-studied vertebrate fauna and encompasses strong environmental gradients. The AWT supports 1.8 million ha of rainforest-dominated vegetation that was once widespread in Miocene Australia but now forms a distinct and isolated environmental domain of high diversity surrounded by drier and warmer environments (Nix, 1991, Moritz, 2005).

We modeled vertebrate species

Results

The results are summarized in Fig. 2. Model predictions and performance changed in at least four important ways as the area of the background from which pseudo-absences were drawn increased. First, the flexible area AUC increased. Specifically, the AUC increased rapidly as background size expanded from 10 to 100 km. Subsequent expansions resulted in only minor increases in AUC (i.e., at 100 km all models already had an AUC > 0.93 and by 500 km AUC > 0.99). Second, in 50% of the species, the fixed area

Discussion

Here we show that the size of background from which pseudo-absences are drawn has important ramifications for predictions and performance of SDMs. We have focused on predictions of current distributions but this issue will likely be even more problematic for models that are projected onto different geographic space or under different climate scenarios. For example, inappropriate background selection may unduly affect studies of invasive species (e.g., Mau-Crimmins et al., 2006, Steiner et al.,

Acknowledgements

This research was supported by the James Cook University Research Advancement Program, the Marine and Tropical Sciences Research Facility, Earthwatch Institute and the Queensland Smart State Program.

References (23)

M.B. Araújo et al.
Ensemble forecasting of species distributions
Trends Ecol. Evol.
(2007)
R.M. Chefaoui et al.
Assessing the effects of pseudo-absences on predictive distribution model performance
Ecol. Model.
(2008)
I. Growns et al.
Classification of aquatic bioregions through the use of distributional modelling of freshwater fish
Ecol. Model.
(2008)
D.W. Hilbert et al.
The utility of artificial neural networks for modelling the distribution of vegetation in past, present and future climates
Ecol. Model.
(2001)
A.H. Hirzel et al.
Assessing habitat-suitability models with a virtual species
Ecol. Model.
(2001)
T.M. Mau-Crimmins et al.
Can the invaded range of a species be predicted sufficiently using only native-range data? Lehmann lovegrass (Eragrostis lehmanniana) in the southwestern United States
Ecol. Model.
(2006)
S.J. Phillips et al.
Maximum entropy modeling of species geographic distributions
Ecol. Model.
(2006)
A.E. Zaniewski et al.
Predicting species spatial distributions using presence-only data: a case study of native New Zealand ferns
Ecol. Model.
(2002)
J. Elith et al.
Novel methods improve prediction of species’ distributions from occurrence data
Ecography
(2006)
J. Elith et al.
Predicting species distributions from museum and herbarium records using multiresponse models fitted with multivariate adaptive regression splines
Divers. Distrib.
(2007)

G.M. Harris et al.

Redefining biodiversity conservation priorities

Conserv. Biol.

(2005)

Cited by (642)

Species distribution and habitat attributes guide translocation planning of a threatened short-range endemic plant
2024, Global Ecology and Conservation
The success of plant translocations depends on defining habitat attributes critical for establishment and survival, and on locating this habitat in the landscape. We used species distribution modelling, coupled with fine-scale characterisation of local habitat attributes to characterise and identify potential sites for translocation of Tetratheca erubescens J.P.Bull (Elaeocarpaceae), a threatened shrub restricted to cliff faces on one banded ironstone range in semi-arid south-west Western Australia. Here, we 1) constructed a maximum entropy species distribution model (SDM) from known occurrence locations and environmental data, and projected this onto a broader surrounding landscape to seek suitable unoccupied habitat; 2) characterised the local habitat attributes (LHA) of both occupied and predicted but unoccupied habitat (locality types), determining potential characteristics of habitat suitability at the scale of individual plants (microsites) not captured by the landscape-scale models and; 3) assessed unoccupied areas of modelled high habitat suitability to identify ‘potential translocation sites’ that may have the characteristics to support T. erubescens. The SDM resolved two management considerations by identifying suitable habitat outside the known extent of threatened species to locate areas to search for additional natural populations (none found) or to verify as ‘potential translocation sites’ (24 sites identified). Using field surveys to assess LHA at a finer-scale, we detected differences between plant microsites and other random points in occupied habitat. As such, T. erubescens was more likely to grow in wider rock cracks, located in relatively water gaining points in the cliff profile, in soils with organic (rather than mineral) content. Our approach systematically addressed the complex challenges involved in identifying and selecting sites for translocation of a threatened plant species, a first for a semi-arid environment. We used the outcomes of this study to present a conceptual model for practitioners and regulators that outlines our approach for identifying ‘potential translocation sites’ and ranking their suitability within the context of three assessment filters: species requirements; management-operational constraints and regulatory considerations. This application is relevant to translocation programs that seek to return plant species into natural habitat areas.
Mapping the potential distribution of Asian elephants: Implications for conservation and human–elephant conflict mitigation in South and Southeast Asia
2024, Ecological Informatics
Asian elephants play a pivotal role in their ecosystem. Understanding the potential distribution area of this species is vital for effective conservation efforts and mitigation of human-elephant conflicts. In this study, we used the maximum entropy to simulate the potential distribution area of Asian elephants across South and Southeast Asia, leveraging Maximum Entropy (MaxEnt) and presence data sourced from the Global Biodiversity Information Facility (GBIF). The analysis revealed that the potential distribution area of Asian elephants spans 530,418 km² (10.59% of the study area), with significant potential distribution areas observed in Indonesia (136,890 km²) and Malaysia (119,497 km²). Vegetation type emerged as the dominant environmental factor influencing model outcomes, encompassing aspects such as broadleaved evergreen tree coverage, broadleaved deciduous closed tree coverage and EVI. The potential distribution area of Asian elephants overlaps with regions inhabited by 55.25 million people, with 6.07 million people residing in highly suitable habitats. India and Malaysia have high potential for human-elephant conflict (HEC) due to the high number of people living in potential and highly suitable habitats for elephants. Bangladesh and Nepal, on the other hand, have fewer people living in these habitats suitable for elephants, but they face relatively high human population density in these areas.
The Atlantic forest is a potentially climatic suitable habitat for four Neotropical Myrtaceae species through time
2024, Ecological Informatics
Myrtaceae is one of the most species-rich botanical families and is a critical floristic component in regions with high diversity, such as the Atlantic Forest and Cerrado. In the Neotropical region, Myrteae is the main tribe of Myrtaceae and includes the most diverse genera Eugenia, Myrcia, Psidium, Myrceugenia, and Campomanesia. Here, we investigated the climatic suitability selected Myrteae species - Campomanesia guazumifolia, C. xanthocarpa, Eugenia pyriformis, and Psidium cattleyanum - across South America. This study spans the present day, three historical periods, and two future climate change scenarios. Our modeling analysis (ENSEMBLE) included environmental variables applied at the times evaluated. Our results suggest that temperature seasonality and precipitation in the driest month were the variables that most influenced climate suitability in the species. The Atlantic Forest lato sensu is a potentially climate suitable habitat for these four species over time, which matches the center of diversification and richness of Myrtaceae, in regions where they coexist and share habitats sympatrically. Historical glaciation events have influenced the retraction and expansion of species distribution, ultimately contributing to their current coexistence in select neotropical ecoregions. Our projections for the future indicate climate suitable habitats in areas similar to present models despite the different effects of climate change. The Atlantic Forest is the key to maintaining Myrteae biodiversity over time. Therefore, it is necessary to combine other approaches (e.g., evolutionary, ecological, and genetic studies) to deeply understand the evolutionary history of this region, its protection, and the maintenance of the biodiversity it harbors.
Response of plant species to impact of climate change in Hugumbrda Grat-Kahsu forest, Tigray, Ethiopia: Implications for domestication and climate change mitigation
2024, Trees, Forests and People
This study aimed to predict distribution and Total Carbon Stock (TCS) dynamics of Acacia abyssinica, Carissa edulis, and Juniperus procera in the Hugumbrda Grat-Kahsu National Forest in current (1970–2000) and future climate scenarios (2021–2100). Bioclimatic, soil, and elevation data were used for modeling using Maxent, with model accuracy evaluated using Area Under the Curve (AUC), Kappa test and True Skill Statistic (TSS). Significant differences were observed in distribution of species between current and future periods under Shared Socioeconomic Pathways (SSPs) of SSP2-4.5 and SSP5-8.5 scenarios. The main contributing predictors of the species distribution were temperature seasonality, altitude, and precipitation of the warmest quarter. All species were projected to shift to higher altitudes in the future. Acacia abyssinica’s current potential distribution (42.9 %) could expand to 77.1–99.2 % (SSP2-4.5) and 63.8–72.9 % (SSP5-8.5). Carissa edulis could extend from 54.2 % to 89.5–100 % (SSP2-4.5) and 77.1–87.9 % (SSP5-8.5). Juniperus procera’s might increase from 63.8 % to 91.8-99.7 % (SSP2-4.5) and 78–88.1 % (SSP5-8.5). The projected future climate is expected to result in an expansion of new suitable areas for all three species. The TCS estimates per km² were 169 (Acacia abyssinica), 46 (Carissa edulis), and 1381 ton (Juniperus procera). In SSP2-4.5, Acacia abyssinica’s TCS could rise from 25,688 to 59,319 tons, Carissa edulis from 8,832 to 16,284 tons, and Juniperus procera from 312,106 to 487,493 tons. In SSP5-8.5, projections indicated 43,602 tons (Acacia abyssinica), 14,306 tons (Carissa edulis), and 430,872 tons (Juniperus procera). The study concludes by recommending the strategic planting of these species in both current and future suitable areas to enhance ecosystem services and ensure their sustained existence in the face of changing climates.
The influence of the number and distribution of background points in presence-background species distribution models
2024, Ecological Modelling
Species distribution models (SDMs), which relate recorded observations (presences) and absences or background points to environmental characteristics, are powerful tools used to generate hypotheses about the biogeography, ecology, and conservation of species. Although many researchers have examined the effects of presence and background point distributions on model outputs, they have not systematically evaluated the effects of various methods of background point sampling on the performance of a single model algorithm across many species. Therefore, a consensus on the preferred methods of background point sampling is lacking. Here, we conducted presence-background SDMs for 20 vertebrate species in North America under a variety of background point conditions, varying the number of background points used, the size of the buffer used to constrain the background points around the occurrences, and the percentage of background points sampled within the buffer (“spatial weighting”). We evaluated the accuracy and transferability of the models using Boyce index, overlap with expert-generated range maps, and area overpredicted and underpredicted by the SDM (and AUC for comparability with other studies).
SDM performance is highly dependent on the species modelled but is affected by the number and spread of background points. Models with little spatial weighting had high accuracy (overlap values), but extreme extrapolation errors and overprediction. In contrast, SDMs with high transferability (high Boyce index values and low overprediction) had moderate-to-high spatial weighting. These results emphasize the importance of both background points and evaluation metric selection in SDMs. For other, more successful metrics, using many background points with spatial weighting may be preferred for models with large extents. These results can assist researchers in selecting the background point parameters most relevant for their research question, allowing them to fine-tune their hypotheses on the distribution of species through space and time.
Investigating the planning efficiency of species richness- and complementarity-based algorithms in data deficient areas
2024, Biological Conservation
Efficient conservation planning is a necessary approach for protecting endangered species with deficient distribution data and reducing biodiversity loss. Richness- and complementarity-based algorithms can improve conservation planning efficiency to different degrees, but differences in planning efficiency caused by data availability and algorithms are often disregarded and require clarification. Here, we classified endangered, endemic, and national key protected plant species based on their occurrence data availability for species distribution modeling. We implemented species richness- (SRA), species number complementarity- (SNCA) and species value complementarity-based algorithms (SVCA) to identify priority conservation areas (PCAs) in the Southeast Himalaya Biodiversity Priority Conservation Area. We established six scenarios and compared their planning efficiencies. The spatial distribution of PCAs and their conservation efficiency varied depending on data availability and optimization algorithms. The 17%, 30%, and 45% PCAs identified in the six scenarios differed in geographical pattern, whereas differences in species occurrences had less effect on their conservation efficiency. Specifically, the PCAs identified by using species with sufficient data as surrogates captured species with high conservation value. Additionally, SNCA was better suited for capturing species number, and SVCA was more cost-effective for protecting species area and species of high conservation value, but SRA distribution pattern presented less fragmented and well-connected. Nevertheless, 50.11% area of the optimized PCAs remained uncovered by existing protected areas. Our analysis reveals the influence of species occurrence data availability, and algorithms on planning efficiency, and provides a cost-effective solution for improving planning efficiency and optimizing protected area network in data deficient areas.

View all citing articles on Scopus

View full text

Article preview

Ecological Modelling

Abstract

Introduction

Section snippets

Methods

Results

Discussion

Acknowledgements

References (23)

Ensemble forecasting of species distributions

Trends Ecol. Evol.

Assessing the effects of pseudo-absences on predictive distribution model performance

Ecol. Model.

Classification of aquatic bioregions through the use of distributional modelling of freshwater fish

Ecol. Model.

The utility of artificial neural networks for modelling the distribution of vegetation in past, present and future climates

Ecol. Model.

Assessing habitat-suitability models with a virtual species

Ecol. Model.

Can the invaded range of a species be predicted sufficiently using only native-range data? Lehmann lovegrass (Eragrostis lehmanniana) in the southwestern United States

Ecol. Model.

Maximum entropy modeling of species geographic distributions

Ecol. Model.

Predicting species spatial distributions using presence-only data: a case study of native New Zealand ferns

Ecol. Model.

Novel methods improve prediction of species’ distributions from occurrence data

Ecography

Predicting species distributions from museum and herbarium records using multiresponse models fitted with multivariate adaptive regression splines

Divers. Distrib.

Redefining biodiversity conservation priorities

Conserv. Biol.

Cited by (642)

Species distribution and habitat attributes guide translocation planning of a threatened short-range endemic plant

Mapping the potential distribution of Asian elephants: Implications for conservation and human–elephant conflict mitigation in South and Southeast Asia

The Atlantic forest is a potentially climatic suitable habitat for four Neotropical Myrtaceae species through time

Response of plant species to impact of climate change in Hugumbrda Grat-Kahsu forest, Tigray, Ethiopia: Implications for domestication and climate change mitigation

The influence of the number and distribution of background points in presence-background species distribution models

Investigating the planning efficiency of species richness- and complementarity-based algorithms in data deficient areas

Short CommunicationSelecting pseudo-absence data for presence-only distribution modeling: How far should you stray from what you know?

Abstract

Introduction

Section snippets

Methods

Results

Discussion

Acknowledgements

Trends Ecol. Evol.

Ecol. Model.

Ecol. Model.

Ecol. Model.

Ecol. Model.

Ecol. Model.

Ecol. Model.

Ecol. Model.

Novel methods improve prediction of species’ distributions from occurrence data

Ecography

Predicting species distributions from museum and herbarium records using multiresponse models fitted with multivariate adaptive regression splines

Divers. Distrib.

Redefining biodiversity conservation priorities

Conserv. Biol.

Short Communication
Selecting pseudo-absence data for presence-only distribution modeling: How far should you stray from what you know?