Preferential sampling for presence/absence data and for fusion of presence/absence data with presence-only data

Presence/absence data and presence-only data are the two customary sources for learning about species distributions over a region. We illuminate the fundamental modeling differences between the two types of data. Most simply, locations are considered as fixed under presence/absence data; locations are random under presence-only data. The definition of"probability of presence"is incompatible between the two. So, we take issue with modeling strategies in the literature which ignore this incompatibility, which assume that presence/absence modeling can be induced from presence-only specifications and therefore, that fusion of presence-only and presence/absence data sources is routine. We argue that presence/absence data should be modeled at point level. That is, we need to specify a surface which provides the probability of presence at any location in the region. A realization from this surface is a binary map yielding the results of Bernoulli trials across all locations. Presence-only data should be modeled as a point pattern driven by specification of an intensity function. We further argue that, with just presence/absence data, preferential sampling, using a shared process perspective, can improve our estimated presence/absence surface and prediction of presence. We also argue that preferential sampling can enable a probabilistically coherent fusion of the two data types. We illustrate with two real datasets, one presence/absence, one presence-only for invasive species presence in New England in the United States. We demonstrate that potential bias in sampling locations can affect inference with regard to presence/absence and show that inference can be improved with preferential sampling ideas. We also provide a probabilistically coherent fusion of the two datasets to again improve inference with regard to presence/absence.


Introduction
Learning about species distributions is an important activity in the ecology community. The literature discusses two types of data collection to learn about species distributions: presence/absence and presence-only. The former works with some version of designed sampling where say plots (grid cells, quadrats, etc.) are sampled and presence/absence or abundance of a species is observed for the sampled plots. That is, locations are fixed. Presence-only data arises through randomly encountering a species within a region and is typically collected in the form of museum or citizen science data. That is, locations are random. In fact, the distinction between the two types of data collection can be murky since, if data collection is developed through gridding of cells, then, conceptually, the observations associated with the cells can be viewed as capturing presence/absence as well as presence-only, as we elaborate below. In any event, the literature on modeling presence/absence data is enormous by now and, more recently, there has been a consequential growth in the literature addressing modeling for the presence-only setting. References to this literature are supplied as part of the development in Sections "Some presence/absence modeling details" and "Fusing presence/absence and presence-only data", respectively.
The contribution of this paper is to address some fundamental and occasionally contentious threads in the literature with regard to the foregoing data collection. For instance, it is asserted that a common modeling framework can be used for both data types, that presence/absence data modeling can be induced under a presence-only framework, and, moreover, that presence-only data can be used to infer about presence/absence (Dorazio, 2014;Royle et al., 2012;Hastie and Fithian, 2013). A further implication is that fusion of general presence/absence and presence-only data sources can be implemented within what is essentially the presence-only framework (Pacifici et al., 2017).
We consider these issues first with discussion to attempt to clarify what "presence at a location" means. We argue that probabilistic modeling for the two data types is distinct and incompatible.
Specifically, since, in nature, presence/absence is seen at point locations, we propose that presence/absence data should be modeled at point level. We note that, in practice, presence/absence data is not always collected at fine resolution. Often data collection is such that presence/absence is only recorded as a binary event over an areal unit. Such data collection loses information by ignoring say, the number of individuals found in the unit and perhaps even information on the locations of the individuals in the unit. It also suggests that the size of the unit should be considered with regard to the chance of presence in the unit. Furthermore, this does not refute the assertion that the presence/absence process should be modeled at point-level; rather, it reveals that the data collection makes it difficult to implement point-level modeling. In practice, the size of the unit is rarely incorporated into the modeling and, in fact, if size is not considered then the modeling returns us to point level specification where the unit is geo-referenced say, by its centroid. If the size of the unit is incorporated into the modeling then we find ourselves closer to the spirit of presence-only modeling, as we clarify below.
Next, under point-level modeling for such data, we bring in preferential sampling ideas to clarify how potential bias in selection of sampling locations can affect inference with regard to presence/absence. Using what is referred to as the "shared process" perspective, we demonstrate that estimation of the probability of presence as well as prediction of presence can be improved by accounting for preferential sampling. Then, we briefly turn to the fusion problem, arguing that current versions of such fusion in the literature have fundamental flaws. We propose a probabilistically coherent fusion, again employing the shared process perspective for implementing the fusion, extending application of preferential sampling. This allows the two data sources to be probabilistically independent or dependent. Altogether, this perspective provides a collection of models to take presence/absence modeling to a richer explanatory level.
To examine the foregoing issues, we need to look carefully at how presence/absence has been customarily modeled in the literature. We also need to do the same with regard to the presence-only literature. Further, we need to elaborate what preferential sampling is in order to reveal its utility for these issues. In the interest of keeping the explication at a concise and, hopefully, comfortable level, we only consider individual species models. However, extension to joint species distribution modeling is available and will be presented in a subsequent paper. In order to go forward, we first offer some preliminary words regarding what a presence/absence event means (to expedite the flow we defer referencing to subsequent sections).

The fundamental issue
The fundamental issue that underlies the development of this paper is the attempt to clarify exactly what a presence/absence event means? It seems that if one asks different ecologists one may get different answers. As above, for a given species, is it going to a geo-coded location and recording whether or not the species is there? Or, is it a binary summary for an areal unit? Was the species observed on the unit or not? Here, in order to bridge with preferential sampling ideas, we need to conceptualize presence/absence at point-level. In this way, we can formalize a "probability of presence" surface which provides the presence/absence probability for the species at each location in the region, driven by environmental features at the location. Below, we attempt to clarify more about the behavior of this surface. However, for example, this surface can be thresholded at a selected probability in order to obtain a niche for the species within the region. If we add to this surface a conceptual Bernoulli trial at every location, we obtain a realization of a presence/absence surface for the species, a binary map with a 1 or 0 at each point in the region. This surface is only partially observed through the data collection. In terms of "seeing" this surface, at best we can display it with a high resolution grid of points. Therefore, coherent modeling for these two surfaces which enables a generative probabilistic model for presence/absence data is a primary objective of this paper. Presence/absence modeling in the context of areal units does not permit modeling of a probability of presence surface; in this case presence/absence probability will depend upon the size, shape, orientation, etc. of the areal units.
We focus on plants (in order to remove movement challenges). (To attempt to align with animal movement modeling terminology, for plants the term occupancy is equivalent to presence and the term use might be connected to high probability of presence.) Then, for a given plant species, we can ask what the true realization of a presence/absence surface over a fixed region at a fixed time would look like? At any location in the region, this surface must take on the value of either 1 if the species is present there or 0 if it is not. If this surface is specified to be 1 for some areal units and 0 for others, then the realization of the surface will be dependent upon the selection of the units, their size, their shape, their orientation. This would seem to conflict with how presence/absence arises in nature. Such incoherence can be avoided if presence/absence is viewed at point level. Furthermore, from a point level definition, we can scale up to arbitrary areal units (see below) whereas we can not do the reverse.
More explicitly, from a point-level perspective, we can "see" the realization of the presence/absence surface over the entire region of interest, and thus, over any subset of the region without imposing any areal scales. In this regard, presence/absence data is frequently associated with areal units, e.g., described as presence/absence over a grid cell, e.g., a plot or a quadrat. Depending upon the size of the region relative to the size of the areal units, the unit may be considered as a point in the region. However, formally, presence/absence is never observed at a point. Even at fine resolution, a point is only specified with regard to a number of significant decimal places so, implicitly, it is an area due to rounding. The idea of a point-level process specification is accepted as conceptual.We adopt this idea routinely in modeling data. We never observe continuous measurements; we only observe them up to decimal accuracy. Nonetheless, conceptually, we proceed to model them as continuous.
When presence/absence data is recorded at areal units, presence is customarily declared if the species is found anywhere in the unit. But then, it seems necessary to model the probability of presence considering the size of the unit. Moreover, this definition would ignore the abundance on the unit. Should presence associated with one individual in a unit be the same as presence associated with ten individuals in a unit of the same size? Shouldn't there be implications for probability of presence in the unit? Coherence finds us wanting to think of presence/absence in a dimensionless fashion.
It can be argued that the presences of a species over a region form a point pattern. That is, there are a random, finite, number of individuals randomly located in the region. We agree and pursue this line of thinking more precisely below. However, we seek to make a connection to a realization of a presence/absence surface as well as to a model for a probability of presence/absence surface. In this regard, would an ecologist who went out to sample a fixed collection of units for presence/absence attempt to model presence/absence through a (partial) realization of a point pattern? We think, instead, that some version of a regression model using suitable unit-level covariates would be attempted, as we describe below. Furthermore, we see that there is no notion of an intensity associated with presence/absence observations. Intensities arise from thinking about presence through point patterns, a perspective that is associated with presence-only data, as we develop below. Intensity surfaces can be normalized to density surfaces. Such density surfaces reflect the relative chance of observing a species at a given location compared with other locations in the region. They have nothing to do with providing a probability of presence surface.
If we scale a realization of a presence/absence surface as a binary map to an areal unit then it makes sense to think about the average of the realization over that unit, i.e., the proportion of 1's over the unit. This proportion is the empirical chance of finding the species present at a randomly selected location within the unit. In fact, the proportion of 1's over the entire region can be interpreted as the prevalence of the species over the region. Similarly, with a modeled probability of presence surface, if we scale this surface over the unit, we obtain an average probability over the unit. This average conveys the modeled probability of finding a presence at a randomly selected location within the unit. The issue is that we need not think in terms of areal units in order to model presence/absence. If we want to investigate units then we can scale accordingly.

Our motivating dataset
We illustrate all of the above with invasive plant data from New England in the U.S. We extracted a subregion of the six New England states (Connecticut, Rhode Island, Massachusetts, Vermont, New Hampshire and Maine). The presence/absence dataset comes from the Invasive Plant Atlas of New England (IPANE) and consists of more than 4000 sites where invasive species surveys were conducted and focuses on seven species. Details are provided below. The presence-only dataset comes from the Global Biodiversity Information Facility (GBIF) which is a data aggregator for biological collections worldwide. The number of observations will vary from species to species.
Details are provided below.
IPANE is a citizen science organization that engages volunteers in scientifically rigorous sampling protocols. There are 4314 unique sampling sites across New England where invasive plant surveys were conducted. Each site is provided with a location (latitude, longitude) and has been classified with regard to each focal species as a presence (focal species recorded) or an absence (focal species not recorded). The dataset includes seven of the most common invasive plant species in the IPANE database: multiflora rose (MR), oriental bittersweet (OB), Japanese barberry (JB), glossy buckthorn (GB), autumn olive (AO), burning bush (BB) and garlic mustard (GM). All species are terrestrial and all but garlic mustard are woody (shrubs, small trees, or vines). These species vary in their land cover associations (e.g., some occur in forest understory and others occur in open habitats). We consider the same species within GBIF. Duplicated points and points lying outside the study region are discarded from the original dataset. Table 1 displays the species name and sample size for the IPANE and the GBIF datasets. In the analysis below, for convenience, longitude and latitude are transformed to eastings and northings, and rescaled from km units to 10 km units. Figure 1 displays the distribution of presence and absence locations from IPANE for each species across the study region. Figure 2 displays the distribution of presence-only points from GBIF for each species across the study region. For some species, for example garlic mustard, the distribution of the presence-only points shows a different pattern from that of the observed presences in the presence/absence data. Once more, the presences in IPANE arise from fixed sampling locations while the presences in GBIF arise at random locations. Importantly, we removed all of upper Maine as the figures show. Both the IPANE and the GBIF data were so sparse there that extending spatial modeling to include this region produced poorly behaved model fitting.
Adding to the original database, we have 19 potential covariates provided by WorldClim (version 1.4, http:/ /www.worldclim.org/version1) as 30-arc second (∼1 km) raster data. We select 7 covariates from them by discarding highly correlated covariates. They are (1) mean diurnal range (mDR, mean of monthly (max temp-min temp)), (2) max temperature of warmest month (max-TWM), (3) min temperature of coldest month (minTCM), (4) mean temperature of driest quarter (meanTDQ), (5) precipitation of wettest month (PWM), (6) precipitation seasonality (PS, the standard deviation of the monthly precipitation estimates expressed as a percentage of the mean of those estimates, that is, the annual mean), and (7) precipitation of warmest quarter (PWQ). With regard to possible multicollinearity concerns, these seven covariates were chosen such that each pair has absolute correlation less than 0.7. Figure 3

Some presence/absence modeling details
Confining ourselves to plants and following the discussion in the Introduction, we make the assumption that presence/absence data arises as observation of binary responses, presence (1) or absence (0) at a collection of sampling locations (see, e.g., Elith et al., 2006, and references therein for a review). The goal is to explain the probability of presence at a location given the environmental conditions that are present there. The customary approach is to build a binary regression model with say a logit or probit link where the covariates can be introduced linearly (see below) or as smoothly varying functions. The latter choice results in generalized additive models (GAMs) which tend to fit data well since they employ additional parameters to enable the response to assume nonlinear and multimodal relationships with the predictors (Guisan et al., 2002;Elith et al., 2006). The price that is paid for using GAMs is a loss of simplicity in interpretation as well as the risk of overfitting resulting in poor out-of-sample prediction. We don't consider GAMs further here.
Much of the early presence/absence work was non-spatial in the sense that, in modeling presence/absence probabilities, though spatial covariate information was included, potential spatial dependence in the residuals was not. Accounting for the latter seems critical. Causal ecological explanations such as localized dispersal as well as omitted (or unobserved) explanatory variables with spatial pattern such as local smoothness of soil or topographic features suggest that, at sufficiently high resolution, occurrence of a species at one location will be associated with its occurrence at neighboring locations (Ver Hoef et al., 2001). In particular, such dependence structure, introduced through spatial random effects, facilitates learning about presence/absence for portions of a study region that have not been sampled, accommodating gaps in sampling and irregular sampling effort.
Following the framework presented in Gelfand et al. (2006), suppose Y (s) denotes the presence/absence (1/0) of the species at sample location s. If the study region D is partitioned into grid cells, say at the level of resolution of the environmental covariates, then, summing up Y (s) over n i , the number of sites sampled in cell i, yields grid cell level counts: Y i+ = s∈gridi Y (s). This is an elementary illustration of scaling up from points to areal units. If the sampling site is viewed as the grid cell then we have n i = 1, a single Bernoulli trial for the cell. If the cell was not sampled, we have n i = 0.
If we assume independence for the trials, a binomial distribution results for Y i+ , i.e., Y i+ ∼ Binomial(n i , p i ). Explicitly, the probability that the species occurs in cell i, p i , is related functionally to the environmental variables with a logit link function and a linear (in coefficients) pre- Here w i is a vector of explanatory environmental variables associated with cell i and β is a vector of associated coefficients. Here, and in the sequel, we could equally well use a probit link function.
We can extend this grid cell level model to be spatially explicit by adding spatial random effects. In modeling p i , a spatial term ρ i associated with grid i is added yielding log p i The random effect ρ i adjusts the probability of presence of the modeled species up or down, depending on the values in a spatial neighborhood of cell i. To capture this behavior, we customarily employ a Gaussian intrinsic or conditional auto-regressive (CAR) model (Besag, 1974). Such a model proposes that the effect for a particular grid cell should be roughly the average of the effects of its neighboring cells and results in a multivariate normal as the joint distribution over all the cells. There are many ways to specify neighbor structure; see Banerjee et al. (2014) for a full discussion.
If we view the visited sites as points and therefore model at point scale, Y (s) would be taken as analogously relating the probability that the species occurs in site s, p(s), to the set of environmental variables as log p(s) 1−p(s) = w T (s)β. Such modelling requires that we have covariate levels w(s) for each site. This model is referred to as a spatial regression in the sense that the regressors are spatially referenced. If we set w(s) = w i when s is within grid i, we return to the grid cell model above.
Most relevant for the remainder of this paper, we extend (1) to bring in spatial dependence between points based on their relative locations using Gaussian processes, creating geostatistical models (Banerjee et al., 2014). We would model Y (s) given p(s) and augment the explanation of Here, ω(s) is the spatial random effect associated with point s, arising as a realization of a Gaussian process. A suitable covariance function would be selected. With binary response, this model is referred to a spatial generalized linear model (GLM); see Diggle et al. (1998). The model has two levels: the first or data-level specification is a Bernoulli trial and the second or process level presents the probability of presence surface. Inference from (2) would be about this surface at any location in the study region, with these probabilities explained through the spatially referenced predictors. The realized presence/absence surface, i.e. {Y (s) : s ∈ D} associated with the first level is also of interest.
The fact that, in practice, presence/absence is not observable at point level does not preclude useful point level modeling. Indeed, this is the case with all geostatistical modeling (Banerjee et al., 2014), e.g., temperature is never observed at a dimensionless location but we routinely model temperature surfaces. Assuming data of the form, (w(s i ), Y (s i )) for sites i = 1, 2, ..., n and adopting the hierarchical (multi-level) regression in (2) with say, a probit link, P (Y (s) = 1) ≡ Here, (s) is pure error, i.e., ∼ N (0, 1) and ω(s) is a mean 0 Gaussian process with a suitable correlation function, typically an exponential or, more generally, a Matérn. See, e.g., Banerjee et al. (2014) Chapter 6 for full discussion of such regression models.
Under this model, the Y (s) are drawn as conditionally independent Bernoulli trials given p(s) (and the associated Z(s) are conditionally independent normals). As a result, even if the probability of presence surface p(s) is smooth, realizations of the presence/absence surface are everywhere discontinuous. Below ("What does "probability of presence" mean?") we suggest that, under a point-level modeling specification, such behavior may not be desirable. In the Supplementary Material (Appendix S2) we present an alternative specification which deals with this concern as well as an associated technical problem.
What does "probability of presence" mean?
Following "Introduction: The fundamental issue" we now attempt more explicit discussion regarding what an observed presence means and the associated implications. The issue is whether presence/absence is viewed as an event at point level or at areal level. Is it a Bernoulli trial at say location s or is it the probability that the number of individuals of a species in a set, say A, is ≥ 1?
If we model presence/absence at point level, then Y (s) = 1 is a Bernoulli trial at location s. However, what does Y (A) mean? A coherent probabilistic definition specifies it as a block It is the proportion of the Y (s) in A that equal 1; it is not a Bernoulli trial and P (Y (A) = 1) = 0 since the probability that almost every Bernoulli trial in A results in a 1 equals 0. We can calculate (2). That is, E(Y (A)) becomes the average probability of presence over A. It is the probability that, at a randomly selected location in A, the species is . Since presence-only data samples the point pattern (although likely not fully but, rather, up to sampling effort over the region (Chakraborty et al., 2011;Fithian et al., 2015)), it is compatible with this definition of presence/absence. However, the probability of a presence in A is only defined with regard to the size of A and will vary with A, a concern raised in Hastie and Fithian (2013)  only data has been used to provide presence/absence probabilities and also the way presence-only data has been fused with presence/absence data (see, e.g., Pacifici et al. (2017)). Instead, we propose probabilistically coherent remediation for this incompatibility in Sections "Preferential sampling" and "Fusing presence/absence and presence-only data" below. However, first, in the next subsection, we attempt further clarification of point-level presence/absence modeling.

Further clarification of point level presence/absence modeling
In reconciling the differences above it may be useful to think more carefully about what the distribution of a species looks like within a specified region, D. Suppose we consider the complete census of individuals in the region. To be realistic, we have to view the number of presences in a bounded region as finite and therefore a presence must be bigger than a (dimensionless) point since there are an uncountable number of points in D. The scaling issue arises once more. Formally, a presence can not arise at a point, it is not dimensionless in size; practically, it can be observed as point-referenced. So, at point-level, the presence/absence surface over the region consists of a finite set of "patches" where the species is present and, outside of these patches, the species is absent. From an ecological and practical perspective, we could think of a patch as a collection of individuals of a particular species (it might be just one) that is dense enough so that, at point level, we would declare presence for every location in the patch. However, if the gaps between the individuals become sufficiently large, then those locations in the gaps must now become absences.
The scaling here is qualitative, not quantitative -an ecologist would not attempt to be precise here and the denseness needed to define a patch depends upon the sizes of the patch relative to the size of D. In the sequel, we also avoid defining patch sizes.
Then, a presence-only realization over the region becomes this finite set of patches. To view it as a point pattern, we might assume that each individual is located at the centroid of its patch. That is, with a complete census, the number of patches equals the number of points in the point pattern.
However, with regard to a dimensionless definition of presence/absence, a presence at a location is observed if the location falls within a patch associated with a point. This definition of the realized presence/absence surface gives an immediately rigorous definition of prevalence. The prevalence of the species over D is the total area of the patches for the species relative to the total area of D. A model which assumes conditionally independent Bernoulli trials across locations is not formally appropriate since such a model will provide random 0s and 1s across locations, yielding no local constancy. However, in practice, a suitable version of such a model will usually perform well (see Appendix S2 of the Supplementary Material) and, in fact, such a model is adopted below for computational convenience.
Third, conceptually, the number points in the point pattern can be smaller or larger than the number of observed presence locations. That is, observing a presence at a location is not identical to observing the centroid associated with the patch containing the observed presence. According to selection of sampling sites, the same individual may be observed at more than one point (though, in practice, it is not likely to be recorded as such) but also, some individuals may never be observed.
Practically, we acknowledge that presence/absence sampling will never observe all individuals but also, that presence-only sampling will rarely observe all individuals. So, formally, with a dataset of point-level presence/absence locations and a dataset of presence-only random locations, at sufficiently fine spatial resolution, the two sets of locations will be disjoint.

Preferential sampling
Working in a point-referenced framework, we bring in preferential sampling ideas to clarify how potential bias in selection of sampling locations can affect inference with regard to presence/absence.
Consideration of preferential sampling can improve presence/absence prediction as well as providing modeling for fusing presence-only data with presence/absence data.
What is preferential sampling all about?
The notion of preferential sampling was introduced into the literature in the seminal paper of Dig- from these stations will necessarily produce only high predictions. A remedy lies in suitable spatial design of the locations, e.g., a random or space-filling design (Saltzman and Nychka, 1998) for locations over the region of interest is expected to preclude such bias. Figure 4 presents three examples showing choices of presence/absence sampling locations relative to a (simulated) response surface. Interpolation under the preferential sampling scenario will tend to produce predictions which are too high in the blue regions.
In practice, sampling for presence/absence may be designed such that ecologists will tend to sample where they expect to find individuals. Such bias in the collection of sampling locations can affect predictive performance. Recognizing the possibility of such bias, can we revise presence/absence prediction to adjust for it? This is the intention of preferential sampling modeling.
While the set of sampling locations may not have been developed randomly, we study it as if it was a realization of a spatial point process. That is, it may be designed/specified in some fashion but not necessarily with the intention of being roughly uniformly distributed over D. Then, the question becomes a stochastic one: is the realization of the responses independent of the realization of the locations? If no, then we have what is called preferential sampling. Importantly, the dependence here is stochastic dependence. Notationally/functionally, the responses are associated with the locations. We make this more clear below.
In our context, as we have discussed, the presence/absence data is driven by a probability of presence surface. This surface plays the role of the "exposure" surface, with the observed set of binary responses, Y = (Y (s 1 ), Y (s 2 ), ..., Y (s n )), informing about it. Taking the set of sampling locations as a realization of a random point pattern, S = {s 1 , s 2 , ..., s n }, the question we ask is whether Y is independent of S, again in a stochastic sense? The answer will depend on the models we supply for Y and S. Below, we develop several choices, using the idea of a shared process, that enable us to address this question and, furthermore, whether S enables us to improve our inference regarding prediction of presence for a species at a location.

Preferential sampling models for presence/absence data
To develop the stochastic specifications that formalize preferential sampling for a region D, we consider two cases for the intensity associated with the point pattern of sampling locations, S: (i) log λ(s) = w T (s)β, i.e., a nonhomogeneous Poisson process (NHPP) and (ii) log λ(s) = w T (s)β + η(s), a logGaussian Cox process (LGCP).
Here, w(s) is a vector of predictors with associated regression coefficients β and η(s) is a mean 0 GP (below, for convenience, with an exponential covariance function). See, e.g., Illian et al. (2008) for full discussion of NHPPs and LGCPs. In the sequel we only work with (ii).
We adopt a direct model for Y (s) through a latent Gaussian process, Z(s), i.e., Y (s) = 1(Z(s) > 0), as in the Supplementary Material (Appendix S2). So, we only need to propose models for Z(s). We start with a simple spatial regression, Here, the coefficient δ plays a preferential sampling role. For example, suppose the design S oversamples locations in D where we observe presences, where Y (s) tends to be 1, i.e., where Z(s) tends to be high. Then, η(s) will tend to be high around those locations. Therefore, η(s) can be a significant predictor for Z(s) (hence for Y (s)) with δ > 0. (A similar argument applies when δ < 0.) With (ii) and (c), η(s) is the shared process.
A further shared process model for Y that can be explored in this regard extends (a) to (d): Z(s) = x T (s)α + δη(s) + (s).
Here, interest is in comparing (d) Table 2. However, in the next subsection, we examine just a subset of possible model comparison. We compare (a) and (ii) with (d) and (ii). We compare (b) and (ii) with (c) and (ii).
Since the intent is to improve the predictive performance of the model for Y, model comparison criteria focuses on out-of-sample prediction for Y (s)'s.
Model fitting and inference for presence/absence data using preferential sampling To make the model comparison between (b) with (ii) vs. (c) with (ii) we only need to fit the latter and look at the posterior distribution for δ. Similarly, for the model comparison between (a) with (ii) vs. (d) with (ii) we only need to fit the latter. We do this below for each of the seven species.

We fit models (a) -(d) for the presence/absence data. For model (c) and (d), we include log
Gaussian Cox process models for S, i.e., for model (ii), by taking 2,666 regular grid cells over D to approximate the likelihood over the region. The regular grid is needed because we introduce the η surface into models (c) and (d). Among these grid cells, 1870 do not include any presence/absence locations. For all species, we use the same seven covariates presented in "Our motivating dataset" for both w and x.
These hierarchical models are fitted within a Bayesian framework. Model fitting details are given in the Supplementary Material (Appendix S4). As for Bayesian inference, although Gibbs sampling is available for the ω(s) process, its computational cost/time is O(n 3 ) and required memory is O(n 2 ). In our case, we have a relatively large n = 4314, so we implement a nearest neighbor Gaussian process (Datta et al., 2016a, NNGP), which is a sparse Gaussian process model whose  the results for model (d) suggest significant preferential sampling effects; the means for δ are significantly different from 0. When δ > 0, this implies that, in the selection of the presence/absence locations for the species, presences were oversampled. When δ < 0, this means that in the selection of the presence/absence locations for the species, presences were undersampled. This insight is useful and can help in predicting probability of presence at unobserved locations. Furthermore, failing to include the η(s) into the modeling might lead to misinterpretation of the effects of the regressors.
The η(s) also provide improved prediction of presence/absence (see below). However, with inclusion of the ω(s) surface (model (c)), the δ coefficients become insignificant. The flexibility of the ω(s) surface seems to remove the benefit of using the η(s) surface as a predictor.  With binary response, to demonstrate improved prediction we consider misclassification error using the Tjur R 2 coefficient of determination (Tjur, 2009). This measure prefers a model with high probability of presence when presence is observed and low probability of presence when absence is observed. For species j, this quantity is given by T R j = (π j (1) −π j (0)) whereπ j (1) andπ j (0) are the average probabilities of presence for the observed ones and zeros associated with the j-th species across the locations. The larger the T R j , the better the discrimination.
We held out 20% (879) of the presence/absence locations, chosen at random, for the seven species. Table 5

Fusing presence/absence and presence-only data
We turn to the data fusion question. Data fusion (also assimilation) is a widely employed objective when multiple data sources are available to inform about a common response of interest (Nychka and Anderson, 2010;Wikle and Berliner, 2007 (Wikle et al., 2001;Sahu et al., 2016;Rundel et al., 2015).
In our setting, data fusion is different from customary settings. Rather than multiple data sources informing about a common response, e.g., ozone level, we have two different types of data. While both inform about species distribution, we have argued above that presence/absence data is not described stochastically in the same way as presence-only data. The fusion approaches considered in the literature (Fithian et al., 2015;Dorazio, 2014;Giraud et al., 2016;Pacifici et al., 2017) ignore this and assume a latent point pattern model for the presence-only data and that the presence/absence data is induced under this model. Since we argue that a point pattern specification is inappropriate for presence/absence data, we think a different type of fusion is required. We have a point pattern model for the presence-only data and a binary map model for the presence/absence data. Since the point pattern of presences may inform about the probability of presence at a location, again we turn to preferential sampling ideas (Diggle et al., 2010) in order to explore a coherent probabilistic fusion.
The extra information available to make a data fusion story is S P O , the set of observed presenceonly locations. Formally, what information does S P O bring with regard to learning about the probability of presence surface? As in "What does "probability of presence" mean?", assume that .., s * m } is a complete census of species locations in D. Associated with S P O , we consider an intensity, λ P O (s), specified with a set of models similar to (i) or (ii) above. We expect λ P O (s) to be elevated near these observations. For example, analogous to (ii), let log λ P O (s) = w T (s)β P O + η P O (s), using the same predictors as with the presence/absence modeling. However, the mechanisms that created S P O and S P A (the point pattern of presence/absence locations) are different, so it doesn't make sense that S P O and S P A follow the same model. In order to capture the influence of S P O on the p(s) surface associated with Y P A (the presence/absence data), we could add δ P O η P O (s) to the mean for Z(s) in model (c) of "Preferential sampling: Preferential sampling models for presence/absence data", i.e. we could have a δ P A η P A (s) term and a δ P O η P O (s) term in order to improve prediction of presence/absence. So, we have two sources for possible preferential sampling, one for each dataset. We might insist that δ P O > 0. Then, from the presence-only data, the probability of presence will be increased around the s * j 's and decreased away from them. Indeed, the locations in S P O are severely biased; they are locations where we see only 1's. We are severely over-sampling presences with S P O and we should increase probability of presence where we do.
We adopt a model for S P O analogous to model (ii) in "Preferential sampling: Model fitting and inference for presence/absence data using preferential sampling" for S P A Then, we can add a δ P O η P O (s) term to the mean of Z(s) under (b), (c), or (d). In other words, we model S P A as a LGCP with intensity and S P O as a LGCP with intensity We consider the following models for Y arising directly through specification for Z(s): These models replace δ P A η P A (s) with δ P O η P O (s) in models (c) and (d). They allow only the point pattern of the presence-only data to help explain probability of presence. In addition, we consider two models which also include preferential sampling associated with the presence/absence data, δ P A η P A (s): The Supplementary Material (Appendix S4) shows how to fit these models.
As a last remark, in practice, with a partial realization of the presence-only point pattern, we need to degrade λ P O (s) in the model fitting. The following subsection briefly reviews an approach to implement such degradation. The Supplementary Material (Appendix S4) shows how to adjust the fitting of the models above in the presence of a partially observed presence-only point pattern.

Spatial modeling of presence-only data in practice
Analysis of presence-only data has seen growth in recent years due to increased availability of such records from museum databases and other non-systematic surveys, see Graham et al. (2004).
Presence-only data is not inferior to presence/absence data. In fact, it can be viewed as the opposite; in principle, presence-only data offer a complete census while presence/absence data, since confined to a specified set of sampling sites, contains less information. However, in practice, a complete census of individuals is rarely achieved. The sampling effort required to obtain such censuses usually exceeds the available resources.
An early model-based strategy for presence-only data attempts to implement a presence/absence approach by drawing so-called background samples, a random sample of locations in the region with known environmental features. These samples were characterized as pseudo-absences (Engler et al., 2004;Ferrier et al., 2002) and a logistic regression was fitted to the observed presences and these pseudo-absences, following "Some presence/absence modeling details". Since presence/absence is unknown for these samples, work of Pearce and Boyce (2006) and Ward et al. (2009) showed how to adjust the resulting logistic regression to account for this. In any event, this approach manufactures an arbitrary amount of data. Additionally, it ignores spatial dependence for presence/absence across locations. The observed presences, as a random number of random locations, should be viewed as a spatial point pattern (see Warton and Shepherd, 2010;Chakraborty et al., 2011, in this regard).
An algorithmic strategy in common use these days is the maximum entropy (Maxent) approach, (see, e.g., Phillips et al., 2006Phillips et al., , 2009. Maxent is a constrained optimization method which finds the optimal species density (closest to a uniform) subject to moment constraints. The availability of an attractive software package (http://biodiversityinformatics.amnh.org /open source/maxent/), encourages its use for presence-only data analysis. The resultant density surface is interpreted as providing the relative chance of observing a species at a given location compared to other locations in the region (and can not be interpreted as providing presence/absence probabilities). However, as an optimization strategy rather than a stochastic modeling approach, Maxent is unable to attach any uncertainty to resulting optimized estimates. Also, Maxent is unable to provide an intensity surface. Hence, for example, we are unable to determine the expected number of individuals in a specified region or the probability of at least one individual in a specified region.
Arguably, a formal point pattern modeling approach is preferable since it enables full inference, with associated uncertainty, over the region. Modeling presence-only data as a point pattern specifies an associated intensity in terms of the available environments, at available spatial scale, across the region. Spatial structure for the intensity surface is introduced through spatial random effects, resulting in a log Gaussian Cox process (Møller et al., 1998;Møller and Waagepetersen, 2004), as discussed in "Preferential sampling" above.
Employing the LGCP in practice acknowledges that the observed point pattern is biased through anthropogenic processes, e.g., human intervention to transform the landscape and non-uniform (in fact, often very irregular) sampling effort. Such bias in sampling is a common problem, see for example Loiselle et al. (2008) and references therein. This requires adjusting the potential species intensity to a realized intensity which is treated as a degradation of the former. Such modeling adjustment is discussed in detail in Chakraborty et al. (2011) which we briefly review below.
Variation in site access is one of the factors that influences the likelihood of the site to be visited/sampled. For example, sites (i) adjacent to roads or along paths, (ii) near urban areas, (iii) with public ownership, e.g., state or national parks, or (iv) with flat topography are likely to be over-sampled relative to more inaccessible sites. When bias implies that only a portion of the region is sampled, it is likely that only a portion of the overall point pattern is observed. Land use, as a result of human intervention, affects availability of locations, hence, also inference about the intensity. Also, agricultural transformation and dense stands of alien invasive species preclude availability. Transformed areas are not sampled and this information must also be included in the modeling. Altogether, sampling tends to be sparse and irregular; we rarely collect a random sample of all available environments.
Detection can affect inference regarding the intensity. That is, we may incorrectly identify a species as present when it is actually absent (false presence) or fail to detect a species that is actually present (false absence) (Reese et al., 2005). The prevalence of these false records will affect the performance of an explanatory model on response to environmental features (Tyre et al., 2003). Modeling for these errors can be attempted but requires information beyond the scope here.

Some explicit modeling details
Following ideas in Chakraborty et al. (2011), we conceptualize a potential intensity, i.e., the intensity in the absence of degradation, as well as a realized (or effective) intensity that operates in the presence of degradation. The intensity is tiled to grid cells at the resolution of the available environmental covariate surface. We consider three surfaces over a region of interest, D.

Model fitting and inference for data fusion
We fit the models above in (3) and (4)  There is no simple solution for modeling sampling effort. However, in the absence of a complete census, some assumption needs to be made in order to sensibly degrade the intensity. We adopt the sampling effort surface T (s) for each grid cell such that T (s) = 1 for all cells where at least one presence-only point is observed across all species, T (s) = 0 otherwise. There is nothing in the modeling that, for a given species, prevents/discourages λ P O (s) from being small if the data suggests it. We are only attempting to account for degradation apart from this.
The estimation and predictive performance results for models (a) and (b) are the same as those in "Preferential sampling: Model fitting and inference for presence/absence data using preferential sampling". Since δ P O is expected to be positive, a priori, we adopt a truncated normal prior on the nonnegative domain, i.e., δ P O ∼ N ≥0 (0, 100). Table 6 displays the estimation results for model (e) which include both δ P A η P A (s) and δ P O η P O (s).
None of the δ P A are significantly different from 0. However, all of the δ P O are significantly differ-ent from 0, revealing that the locations of the presence-only sites significantly improve the performance of the presence-absence model. Table 7 displays the results for the TR measure under the same settings as in "Preferential sampling: Model fitting and inference for presence/absence data using preferential sampling". The results are similar to those in Table 5. Performance is essentially indistinguishable across all models other than model (a); however, model (f) emerges as the best.
As a last remark here, if we focus on presence/absence locations which are near observed presenceonly locations, we find an improvement in the TR measure for presence-absence at those locations compared to the corresponding model ignoring the presence-only data (results not shown).

Summary and future work
Our contribution is to attempt to bring more clarity to a frequent activity for ecologists, modeling presence/absence for species, confining ourselves to plants. We have done this from a probabilistic perspective, arguing that presence/absence data should be viewed as a point level phenomenon and therefore, stochastic modeling for presence/absence should be done at dimensionless points.
In the development we have also argued that attempting to model presence/absence at areal scale raises challenges and, further, that any such modeling is incompatible with point-level modeling.
We have also asserted that the number of presences in a fixed bounded region must be finite and therefore, that a physical realization of a presence in the region is larger than a dimensionless point.
We acknowledge that we are being more formal in developing this perspective than is customary. When presence/absence data is supplied at point-level, it will be a geo-coded location and, in many cases, it is supplied at areal scale, recording presence of the species anywhere in the areal unit. However, this is not a deterrent from considering our perspective. All continuous measurements are obtained up to rounding error. When a temperature is recorded at a location, the location is provided up to the accuracy of the geo-coding device; nonetheless, we routinely model temperatures at (dimensionless) points.
Next, we turned our attention to attempting to improve prediction of probability of presence at a location for a presence/absence dataset. We introduced the usefulness of preferential sampling in this context, anticipating that there may be bias in sampling sites visited for presence/absence data; sampling may favor seeing more presences. We argued that the idea of a shared process model, Finally, we asserted that presence-only data should be modeled as a point pattern, albeit degraded due to availability and sampling effort over the study region. We showed that, as a result, a common model for presence/absence and for presence-only data can not be stochastically coherent.
Hence, if we seek a data fusion having both presence/absence data as well as presence-only data, a different approach is needed. We argued that, again, a shared process specification is coherent for such fusion and illustrated by adding a presence-only plant dataset from New England to the presence/absence dataset.
Future work offers much opportunity. More experience is needed with regard to the rich set of modeling specifications that we have presented in Sections "Preferential sampling", "Fusing presence/absence and presence-only data" and Supplementary Material Appendix S2. We also anticipate the need to supply user-friendly software to enable ecologists to play with these models with their own datasets. A particularly useful future direction leads us to joint species distribution models. These are easy to envision but challenging to fit. Another useful future direction will consider different types of response data, e.g., abundance or basal area, where preferential sampling of locations may occur. Possibly the most difficult challenge will be to move to animal movement data where the concepts of occupancy, use, and dynamics need to be carefully brought into play.      Table 7: Estimation results for the TR measure for the data fusion. The results for (a) and (b) are the same as in Table 5 Model (

Figure captions
• Figure 1: The distribution of presence (blue) and absence (green) points for each species across the study region.
• Figure 2: The distribution of presence only points (red) for each species across the study region.
• Figure 3: The standardized covariate surface for each of the 7 selected covariates across the study region.