Data-driven causal analysis of observational time series in ecology

Complex ecosystems are challenging to understand as they often defy manipulative experiments for practical or ethical reasons. In response, several fields have developed parallel approaches to infer causal relations from observational time series. Yet these methods are easy to misunderstand and often controversial. Here, we provide an accessible and critical review of three statistical causal inference approaches popular in ecological time series analysis: pairwise correlation, Granger causality, and state space reconstruction. For each, we ask what a method tests for, what causal statement it might imply, and when it could lead us astray. We devise new ways of visualizing key concepts, describe some novel pathologies of causal inference methods, and point out how so-called “model-free” causality tests are not assumption-free. We hope that our synthesis will facilitate thoughtful application of causal inference approaches and encourage explicit statements of assumptions.


Introduction
: Causality. (A) Definition. If a perturbation in X results in a change in future (or current) Y , then X causes Y . This definition does not require that any perturbation in X will perturb Y . For example, if the effect of X on Y has saturated, then a further increase in X will not affect Y . To embody probabilistic thinking (e.g. drunk driving increases the chance of accidents) [16], X and Y are depicted as histograms. Perturbing X can perturb the current value of Y if, for example, X and Y describe the same quantity or if X and Y are linked by a conservation law (e.g. conservation of energy). (B) Direct versus indirect causality. The direct causers of Y are given by the minimal set of variables such that once the entire set is fixed, no other variables can cause Y . Here, three players X, Z, and U activate Y . The set {X, Z} constitutes the direct causers of Y (Y 's "Markovian parents" [29,16]), since if we fix both X and Z, then Y becomes independent of U . If a causer is not direct, we say that it is indirect. Whether a causer is direct or indirect can depend on the scope of included variables. For example, suppose that yeast releases acetate, and acetate inhibits the growth of bacteria. If acetate is not in our scope, then yeast density has a direct causal effect on bacterial density. Conversely, if acetate is included in our scope, then acetate (but not yeast) is the direct causer of bacterial density since fixing acetate concentration would fix bacterial growth regardless of yeast density. (C) Direct causality does not always imply causality. Here, {X, Y } are the direct causers of W . While X directly causes W , X does not cause W because its direct effect and indirect effect (through Y ) cancel out. Also note that causality is not always transitive: X causes Y and Y causes W , but X does not cause W . In this article, causality is represented by a hollow arrow, while activation and inhibition are represented by filled arrows and blunt-head arrows, respectively. enough information for easy causal identifiability [81,73]. In an island, individuals stochastically migrate to and from a "reservoir" (a random walk). At each time step, the net change in island biomass is drawn from a standard normal distribution (mean = 0; standard deviation = 1 biomass unit). (C) A population receives cells through migration and loses cells to death. Population size has reached an equilibrium. In both cases, we plotted example time series (upper right), a scatterplot comparing two island populations (lower left), the histogram of Pearson correlation coefficient strength (blue shading), and the fraction of simulations in which the correlation was deemed significant (p  0.05) by surrogate data tests using either permutation or phase randomization (see main text). Ideally, the proportion of correlations that are significant (false positives) should not exceed 5%. The strength of correlation is weaker in (C) compared to (B), yet still often significant according to the permutation test. See Supplementary Methods for more details.
Several procedures are commonly used for generating surrogate Y data, each corresponding to an 136 7 assumption about how the original Y time series was generated. One popular procedure is to shuffle the 137 values of Y in time [33,56,57,58]. This procedure, often called permutation, assumes that all time points of 138 Y are independent of each other. A rejection of this null hypothesis (i.e. a low p-value) often provides little 139 evidence of dependence because rarely is this assumption met. For example, for independent time series 140 in Figure 1B-C, this test returns p < 0.05 at rates of 30 ⇠ 92%, much higher than 5%. Nevertheless, the 141 permutation test sneaks into many applied works, perhaps because it is the default option in some popular 142 software packages. Another procedure for generating surrogate Y data has been called phase randomization.

143
It first uses the Fourier transform to represent a time series as a sum of sine waves, then randomly shifts each 144 of the component sine waves in time, and finally sums the phase-shifted components [23, 60] ( Figure S8).

145
This procedure assumes that Y is Gaussian, linear, and stationary [60, 59], where "Gaussian" means that 146 any subsequence of Y follows a multivariate Gaussian distribution, "stationary" means that this distribution 147 does not change over time, and "linear" means that future Y values depend linearly on past Y values and 148 past random events ("process noise", Figure 7A). Indeed, this test performed well (with a false positive rate 149 of 4%) when Y satisfied the three assumptions ( Figure 2C), and relatively poorly when the stationarity 150 assumption was violated (with a false positive rate of 21%; Figure 2B ). Other surrogate data procedures  In sum, surrogate data allow a researcher to use an observed correlation statistic to test for dependence 155 under some assumption about the data-generating process. Dependence indicates the presence of a causal 156 relationship, and conditional dependence can sometimes even indicate the direction [80, 73, 85] ( Figure S5).

157
Below we consider Granger causality and convergent cross mapping, two methods which explicitly take 158 advantage of temporal information to infer the direction of causality. In simple language, X is said to Granger-cause Y if a collection of time series containing all historical 162 measurements predicts Y 's future behavior better than a similar collection that excludes the history of X. 163 An important consequence of this definition is that Granger causality excludes indirect causes, as illustrated 164 in Figure 3A. In practice, whether a causal relationship is direct or indirect depends on which variables are 165 observed (e.g. in Figure 3A, if Y were not observed , then X would "directly" and Granger-cause Z). Under linear Granger causality, X Granger-causes Y if including the history of X in a linear autoregressive model (Eq 1) allows for a better prediction of future Y than not including the history of X (i.e. setting all ↵ k coefficients to zero). By "linear autoregressive model", we mean that the future value of variable Y is modeled as a linear combination of historical values of X and Y and all other observed variables that might help predict Y "· · · ": Here, t is the time index, k = 0, 1, · · · , n is a time lag index, c is a constant, coefficients such as ↵ k and k represent the strength of contributions from the respective terms, and " t represents independent and identically-distributed (IID, 1.2) process noise ( Figure 7A).

General Granger causality [26]:
Let X t Y t , and Z t be series of random variables indexed by time t. X Granger-causes Y with respect to for at least one possible realization of the X, Y , and Z series. Here, P (Y t |S) is the probability of Y t conditional on the variable set S. Note that Z k in Eq. 2 may include multiple variables and thus plays the same role as "· · · " in Eq. 1.

171
Failure modes 172 We discuss four conceptually or practically important instances where Granger causality can fail as an 173 indicator of direct causality ( Figure 3B). These pathologies, applicable to both linear and general Granger Although X causes Z, X does not Granger-cause Z because with the history of Y available, the history of X no longer adds value for predicting Z. In this example, we can also see that Granger causality is not transitive: X Granger-causes Y and Y Granger-causes Z, but X does not Granger-cause Z. (B) Failure modes of Granger causality when inferring direct causality. (i) False negative due to lack of stochasticity. X and Y mutually and deterministically cause one another through a copy operation [91, 8]: X(t) copies Y (t 1) and vice versa. Since X(t 2) already contains sufficient information to know X(t) exactly, the history of Y cannot improve prediction of X, and so Y does not Granger-cause X. By symmetry, X does not Granger-cause Y . (ii) False positive due to unobserved common cause. X causes Y with a delay of 1, and causes Z with a delay of 2. We only observe Y and Z. Since Y receives the same "information" before Z, the history of Y helps to predict Z, and thus Y Granger-causes Z, resulting in a false positive.
(iii) Infrequent sampling induces false negatives. Note that in the equation of X, the ratio of signal (the coefficient of causer Y ) to noise (the coefficient of process noise or ") is much smaller when sampling is infrequent (compare right with left). The quantitative relationship is derived in Appendix 1.10. (iv) Measurement noise can lead Granger causality to suffer both false positives and false negatives.
] (shown as three red dots) can be represented as a single point in the 3-dimensional Z delay space (C, red dot). (C) We then shade each point in the Z delay space by its corresponding contemporaneous value of Y (t) (without measurement noise). The shading is continuous (i.e. gradual transitions in shade), and note that Y causes Z (correctly, in this case). (D) When we repeat this procedure, but now shade the Z delay space by W (t), the shading is bumpy, and note that W does not cause Z (even though Z causes W ). (E) Shading the delay space of Z by the causally unrelated V also gives a bumpy result. (F) Dynamics as in (C), but now with noisy measurements of Y (purple in B). The shading is no longer gradual. Thus with noisy data, causal inference is more difficult. See Supplementary Methods for more details.
In this example, there is a continuous delay map from a causee to a causer, but not the other way around, 237 and also no continuous delay map between causally unrelated variables. If this behavior reflects a broader 238 principle, then perhaps continuous delay maps can be used to infer the presence and direction of causation.  Nonreverting continuous dynamics may lead one to infer causality where there is none. This example consists of two time series: a wavy linear increase and a parabolic trajectory. Although they are causally unrelated, we can find continuous delay maps between them. This is because there is (i) a continuous map from the delay vector [X(t), X(t ⌧ )] to t (X is "nonreverting"), and (ii) a continuous map from t to Z (Z is "continuous"), and thus there is a continuous delay map from X to Z ("nonreverting continuous dynamics"). Thus, one falsely infers that Z causes X, and with similar reasoning that X causes Z. Second row: X drives Z such that their dynamics are "synchronzied", and consequently, we find a continuous delay map also from X to Z even though Z does not drive X. Note that the extent of synchronization is not always apparent from inspecting equations (e.g. Figure 12 of [13]) or dynamics ( Figure S16). Third row: X oscillates at a period that is 5 times the oscillatory period of Y . There is a continuous delay map from X to Z even through X and Z are causally unrelated. Note that true causality sometimes also induces oscillations with integer multiples periods (e.g. in Figure 4, the period of Z is 3 times the period of X). Bottom row: In the classic chaotic Lorenz attractor, the Z variable is caused by X and Y , but we do not see a continuous map from the delay space of Z to either X (shown) or Y (not shown). This is because, as mentioned earlier, satisfying the conditions in Takens's theorem makes a continuous mapping likely but not guaranteed (Appendix 1.13). Here, Z is an example of this lack of guarantee [21] due to a symmetry in the system (see "Background definitions for causation in dynamic systems" in the Supplementary Information of [5]).
Despite the caveats above, one might still attempt to use SSR to hypothesize causal relations. In this case, 271 one must test for continuity (which has a precise mathematical definition, see Figure S12). However, if a 272 delay map from a putative causee to a putative causer is not continuous, we need to decide whether this In this section we examine how environmental drivers, process noise, and measurement noise can influence 291 the performance of Granger causality and CCM, using computer simulations. We constructed a toy ecological 292 system with a known causal structure, obtained its dynamics (with noise) through simulations, and applied 293 a linear Granger causality test (using the MVGC package [27]) and CCM (using the R language package 294 rEDM) to test how well we can infer causal relationships. 295 We simulated a two-species community in which one species (S 1 ) causally influences the other species 296 (S 2 ) but S 2 has no influence on S 1 ( Figure 7B). Additionally, S 1 is causally influenced by an unobserved 297 periodic external driver and S 2 either is ( Figure 7D) or is not ( Figure 7E) causally influenced by its own 298 (also unobserved) periodic external driver. We added process noise to model the stochastic nature of natural 299 ecosystems and added measurement noise to model measurement uncertainty. Process noise propagates to

305
Granger causality and CCM can perform well when their respective requirements are met, but both are 306 fairly sensitive to the levels of process and measurement noise ( Figure 7D and E, correct inferences colored 307 as green in pie charts). In reality, where a system lies in the spectrum of process versus measurement noise 308 is often unknown, and we are not aware of any test that reliably distinguishes between process noise and 309 measurement noise. Both Granger causality and CCM are also sensitive to details of the ecosystem (whether 310 or not S 2 has its own external driver; compare Figure 7D with E). CCM is additionally sensitive to test 311 procedure details ( Figure 7D and E, olive brackets). We simulated a two-species community. The process noise terms ✏p1(t) and ✏p2(t), as well as the measurement noise terms ✏m1(t) and ✏m2(t), are IID normal random variables with a mean of zero and a standard deviation whose value we vary. (C) Five possible outcomes of causal inference. (D, E) Community dynamics and causal inference outcomes. We varied the level (i.e. standard deviation) of process noise and measurement noise. Each pie chart shows the distribution of inference outcomes from 1000 independent replicates. Granger causality: In the deterministic setting (lower left corner), the MVGC tool correctly rejects the data as inappropriate. With modest (but not high) process noise and no measurement noise, Granger causality frequently infers the correct causal structure. High measurement noise prevents the detection of true causality, as expected. With low process noise (i.e.  0.05), adding an intermediate level of measurement noise can increase the false positive rate. CCM: The performance of CCM is sensitive to test procedure (olive brackets) and ecological details (one versus two drivers), and quickly deteriorates with measurement or process noise. When S2 does not have its own external driver (E), CCM infers false causality in the deterministic case due to the (visually apparent) synchrony. In both the main and alternative CCM procedures, criterion 1 (positive ⇢) was checked directly and random phase surrogate data were used to test criterion 2 (significance of ⇢). Criterion