Modelling palaeoecological time series using generalized additive models

In the absence of annual laminations, time series generated from lake sediments or other similar stratigraphic sequences are irregularly spaced in time, which complicates formal analysis using classical statistical time series models. In lieu, statistical analyses of trends in palaeoen-vironmental time series, if done at all, have typically used simpler linear regressions or (non-) parametric correlations with little regard for the violation of assumptions that almost surely occurs due to temporal dependencies in the data or that correlations do not provide estimates of the magnitude of change, just whether or not there is a linear or monotonic trend. Alternative approaches have used Loess-estimated trends to justify data interpretations or test hypotheses as to the causal factors without considering the inherent subjectivity of the choice of parameters used to achieve the Loess fit (e.g. span width, degree of polynomial). Generalized additive models (GAMs) are statistical models that can be used to estimate trends as smooth functions of time. Unlike Loess, GAMs use automatic smoothness selection methods to objectively determine the complexity of the fitted trend, and as formal statistical models, GAMs, allow for potentially complex, non-linear trends, a proper accounting of model uncertainty, and the identification of periods of significant temporal change. Here, I present a consistent and modern approach to the estimation of trends in palaeoenvironmental time series using GAMs, illustrating features of the methodology with two example time series of contrasting complexity; a 150-year bulk organic matter δ15N time series from Small Water, UK, and a 3000-year alkenone record from Braya-Sϕ, Greenland. I discuss the underlying mechanics of GAMs that allow them to learn the shape of the trend from the data themselves and how simultaneous confidence intervals and the first derivatives of the trend are used to properly account for model uncertainty and identify periods of change. It is hoped that by using GAMs greater attention is paid to the statistical estimation of trends in palaeoenvironmental time series leading to more a robust and reproducible palaeoscience.

reasons, not least because the CV scheme used must involve the time ordering of the data (e.g. Figure 1: Flowchart showing the main steps in the analysis of time series using generalized additive models. The main R functions associated with each step or decision are shown in bold. 115 To illustrate trend estimation in palaeoenvironmental data using GAMs, I use two proxy time

Example time series
where 0 is a constant term, the model intercept, representing the expected value of where 147 is 0. 1 is the slope of the best fit line through the data; it measures the rate of change in , for example 159 = 0 + 1 + 2 2 + ⋯ + + (2) where polynomials of up to order are used. This model allows for more complex trends fit is an additional choice left for the analyst to specify; choosing the optimal is not a trivial 172 task when the data are a time series and residual autocorrelation is likely present.

173
Can we do better than these polynomial fits? In the remainder, I hope to demonstrate that 174 the answer to that question is emphatically "yes"! Below I describe a coherent and consistent 175 approach to modelling palaeoenvironmental time series using generalized additive models 176 that builds upon the linear regression framework. 3 Generalized additive models 178 The GAM version of the linear model (1) is where the linear effect of time (the 1 part) has been replaced by a smooth function of time, 180 ( ). The immediate advantage of the GAM is that we are no longer restricted to the shapes 181 of trends that can be fitted via global polynomial functions such as (2). Instead, the shape of 182 the fitted trend will be estimated from the data itself.

183
The linear model is a special case of a broader class, known as the generalized linear model where is the expected value (e.g. the mean count or the probability of occurrence) of the 192 random variable ( ≡ ( )) of which we have observations . is the link function, an and +∞, and it is on this scale that model fitting actually occurs (i.e. using equation (4b)).

199
However we need to map these unbounded values back on to the non-negative response scale.

200
The inverse of the log link function, the exponential function, achieves this and maps values 201 to the interval 0-∞ (equation (4c)).

202
In (4a), we further assume that the observations are drawn from a member of the exponential 203 family of distributions -such as the Poisson for count data, the binomial for presence/absence 204 or counts from a total -with expected value and possibly some additional parameters Θ 205 ( ∼ EF( , Θ)). Additionally, many software implementations of the above model also allow 206 for distributions that are not within the exponential family but which can be fitted using an 207 algorithm superficially similar to the one used to fit GAMs to members of the exponential 208 family (e.g. Wood et al., 2016). selves. We will refer to these non-linear functions as smooth functions, or just smooths for short, 216 and we will denote a smooth using ( ). Further, we would like to represent the smooths in 217 a way that (4) is represented parametrically so that it can be estimate within the well-studied 218 GLM framework. This is achieved by representing the smooth using a basis. A basis is a set 219 of functions that collectively span a space of smooths that, we hope, contains the true ( ) or  There are a bewildering array of different types of spline. In the models discussed below we 235 will largely restrict ourselves to cubic regression splines (CRS) and thin plate regression splines 236 (TPRS). In addition, I also discuss two special types of spline basis, an adaptive spline basis 237 and a Gaussian process spline basis.

238
A cubic spline is a smooth curve comprised of sections of cubic polynomials, where the sections 239 are joined together at some specified locations -known as knots -in such a way that at 240 the joins, the two sections of cubic polynomial that meet have the same value as well as the 241 same first and second derivative. These properties mean that the sections join smoothly and 242 differentiably at the knots (Wood, 2017, 5.3.1).

243
The CRS can be parameterized in a number of different ways. One requires a knot at each   To fit a GAM using either of the two regression spline bases described above, the analyst is 300 generally only required to the specify the size (rank) of the basis expansion required to rep-301 resent or closely approximate the true function . With practice and some knowledge of the 302 system from which the observations arise, it can be relatively easy to put an upper limit on the 303 expected complexity of the true trend in the data. Additionally, the number of available data 304 points places a constraint on the upper limit of the size of basis expansion that can be used.

305
In practice, the size of the basis is an upper limit on the expected complexity of the trend,

319
Having identified low rank regression splines as a useful way to represent , we next need 320 a way to decide how wiggly the fitted trend should be. A backwards elimination approach 321 to sequentially remove knots or basis functions might seem appropriate, however such an 322 approach would likely fail as the resulting sequence of models would not be strictly nested, 323 precluding many forms of statistical comparison (Wood, 2017). Alternatively, we could keep 324 the basis dimension at a fixed size but guard against fitting very complex models through the 325 use of a wiggliness penalty.

326
The default wiggliness penalty used in GAMs is on the second derivative of the spline, which The right hand side of (5) is the penalty in quadratic form. The convenience of the quadratic 331 form is that it is a function of the estimated coefficients of ( ) where S is known as the penalty  The shaded bands surrounding the estimated trends are approximate 95% across-the-function confidence intervals. For the U K 37 series, two models are shown; the orange fit is the result of a GAM with a continuous-time AR(1) process estimated using REML smoothness selection, while the blue fit is that of a simple GAM with GCV-based smoothness selection. The REMLbased fit significantly oversmooths the U K 37 time series.

391
The fitted trend is shown in Figure 6a, and well-captures the strong pattern in the data. The  what is happening when we are fitting these two models.

422
The primary issue leading to poor fit is that neither model accounts for the different variance  The 0.95 probability quantile of the t distribution may be used instead, which will account for estimation of , the variance of the data. However, given the number of observations, and hence residual degrees of freedom, needed to motivate fitting GAMs, differences between intervals computed using extreme quantiles of the standard normal or the t distribution will be tiny. Simultaneous intervals computed using the method described are show in Figure 9 alongside 524 the across-the-function confidence intervals for the trends fitted to both example data sets. As 525 expected, the simultaneous interval is somewhat wider than the across-the-function interval.

526
The critical value 1− for the simultaneous interval of the estimated trend in δ 15 N is 3.07, 527 whilst the same value for the U K 37 series is 3.43, leading to intervals that are approximately 528 ±50% and ±75% wider than the equivalent across-the-function intervals.
In the basic GAM, Λ ≡ I is an identity matrix, a matrix with 1s on the diagonal and 0s elsewhere is the correlation between residuals separated by Δ years, wherê= 0.6. The shaded band is a 95% pointwise confidence interval on the estimated correlation ℎ.
which is where the independence assumption of the model comes from; a model residual is   Figure 13: Gaussian process smooths fitted to the U K 37 time series. REML score traces for GAMs fitted using power exponential ( = 1) or Matérn ( = 1.5) correlation functions as a function of the effective range parameter ( ) are shown (a). The optimal model for each function is that with the lowest REML score. b) shows the resulting trends estimated using the respective correlation function with the value of set to the optimal value. In all cases, is the effective range parameter, which sets the distance beyond which the cor-658 relation function is effectively zero. 659 Figure 12 shows examples of two different correlation functions; the power exponential (Fig-660 ure 12a), and the Matérn (Figure 12b Figure 13a shows the REML score for models estimated using     implements this approach within R and can use mgcv to fit GAMs to each species.

817
A pragmatic although inelegant approach that has been used to estimate trends in multivariate 818 palaeoecological data is to first summarise the response data using an unconstrained ordina-

821
palaeoecologists already use to summarise multivariate stratigraphic data, it is best thought of 823 as modelling changes in abundance or relative composition at the community level. It is less 824 well suited to unpicking taxon-specific trends however, because the ordination step combines 825 individual species information into latent variables (axes) that are linear combinations of all 826 species and it is these latent variables that are then modelled using GAM. Conflict of interest statement 845 The author declares that the research was conducted in the absence of any commercial or 846 financial relationships that could be construed as a potential conflict of interest.