Improved permutation tests enable detection of dependence between nonstationary time series despite limited numbers of replicates

In disciplines from ecology to neuroscience, researchers analyze correlations between pairs of nonstationary time series to infer relationships among variables. This often involves a statistical test to determine whether an observed correlation is stronger than expected under the null hypothesis of independence. Testing for dependence between nonstationary time series with only one experimental replicate is exceedingly challenging. However, with many replicates, a nonparametric trial-swapping permutation test can be employed, comparing within-replicate correlations to between-replicate correlations. Although largely assumption-free, this test is severely limited by the number of replicates because its minimum achievable p-value is 1/n! where n is the number of replicates. This curtails its applicability to many biomedical studies, where n is frequently as low as 3, which would render significance thresholds like 0.05 unattainable. To address this, we propose modified permutation tests that can report lower p-values of 2/nn or 1/nn when there is strong evidence of dependence. We prove that the tests guarantee a false positive rate at or below the significance level, as long as replicates come from independent and identical experiments. We demonstrate this approach by confirming the observation that groups of zebrafish swim faster when directionally aligned, using an existing dataset with 3 biological replicates. Data availability All code used, along with execution instructions, is within S1 Code. The time series data analyzed in this paper can be obtained via the following link: https://drive.google.com/drive/folders/1UmzlX-yJhzQ5KX5rGry8wZgXvcz6HefD


Introduction
Scientists frequently look for correlations between variables to identify potentially important relationships, or to support conceptual models.In disciplines with a focus on dynamics such as ecology and physiology, it is common to measure correlations between temporal processes (i.e.time series correlations).Many important biological processes are nonstationary, meaning that their statistical properties (such as mean or variance) change systematically across time.For instance, the expansion of an invasive species and a cell's response to a new environmental stress are both likely nonstationary.
Interpreting a correlation between a pair of nonstationary time series can be highly fraught because it is easy to obtain a seemingly impressive correlation between two time series that have no meaningful relationship [1].For example, the sizes of any two exponentially-growing populations will be correlated over time due to a shared growth law, even if the populations lack any interaction or shared influences.To avoid spurious correlations, it helps to distinguish between the concepts of "correlation" and "dependence".In time series research, the term "correlation" is often used procedurally [2,3,4].Similarly, here we define a correlation function (ρ) to be any function that takes two time series and produces a statistic, although it is usually interpreted as a measure of similarity or relatedness.
In contrast to correlation, "dependence" is a hypothesis about the relationship between variables, and it has clearer scientific implications.Two events A and B are called dependent if the probability that they both occur P (A, B) differs from the product of their individual probabilities P (A)P (B).Similarly, two temporal processes are dependent if the probability of observing any particular pair of trajectories differs from the product of the probabilities of individual trajectories.(The formal definition of dependence extends this idea to continuous variables, in which case discrete event probabilities may be replaced by probability densities [5,6].)The importance of dependence relationships can be attributed to Reichenbach's common cause principle, which states that if two variables are dependent, then either they share a common causal influence, or one variable causally influences the other (possibly indirectly) [7,8].Thus, it is useful to to first test whether an observed correlation indicates dependence before pursuing specific mechanistic explanations.
There are a handful of systematic approaches to testing for dependence between nonstationary time series.However, most are fairly limited in their scope.First, when possible, a nonstationary time series can be transformed to become stationary, meaning that its statistical properties do not change systematically across time.This enables access to a wide arsenal of tests applicable to stationary data.Transformations include subtracting a trend, taking the derivative (more precisely, "differencing" between neighboring points), or choosing a stationary-seeming window of time [9,10].However, it is easy to see potential pathologies in each of these three transformations: Taking the derivative of an exponential curve just produces another exponential curve; subtracting a fitted linear trend from the random walk (X(t) = X(t − 1) + ϵ(t) where ϵ(t) is random noise) does not make it stationary [9]; similarly, searching for a stationary window of the same random walk process is futile since the variance of X(t) increases with each step.In a second approach, one may compare the observed correlation between two time series with correlations obtained when one of the time series is replaced by a synthetic replicate, where synthetic replicates are generated in a way that models the form of nonstationarity in the process [11].However, these require a correct model of how the statistical properties of the process evolve over time.Lastly, there are tests within the econometrics literature that can provide evidence for dependence between random-walk-like nonstationary processes by detecting a property called cointegration, but cointegration only occurs when some linear combination of the time series is stationary [12].
Alternatively, an approach based on permutations is possible when there are multiple identically distributed and independent (iid) replicates (or equivalently, trials; we use "replicates" and "trials" interchangeably).In this case, one may evaluate the significance of a within-replicate correlation by comparing it to between-replicate correlations.That is, if X i and Y i are time series of variables X and Y from replicate i, the correlation of (X i , Y i ) may be compared to the correlation of (X i , Y j̸ =i ).If the two variables are dependent, within-replicate correlations should tend to be stronger than between-replicate correlations.This approach is sometimes called "inter-subject surrogates" [13,14].
The inter-subject surrogate approach can be used to test for correlations in each trial separately [15].In this case, a simple nonparametric test of the correlation between, for instance, X 1 and Y 1 can be performed by computing ρ(X 1 , Y j ) for j = 1, ..., n and writing down a p-value as the proportion of these correlations that are greater than or equal to ρ(X 1 , Y 1 ) [16,17].For this approach, a minimum of n = 20 trials is needed if one wishes to possibly obtain a p-value of 0.05 or below.Since this procedure checks for dependence in trial 1 specifically, checking for dependence in other trials via this method would require analogous tests, and potentially a multiple testing correction.
However, the number of replicates is sometimes constrained by considerations of logistics, ethics, or cost [18], and single-digit replicate numbers are common in biomedical research [19,20].For instance, two influential longitudinal human gut microbiome sampling studies relied on cohorts of two subjects each [21,22].Additionally, secondary analysis of public data from earlier studies can be a resource-efficient mode of research [23], and this approach is necessarily limited to the number of replicates within the existing dataset.
A straightforward strategy in testing for dependence with smaller replicate numbers is to perform a single permutation test (Fig 1A ) using the data from all replicates [24,25].As an example with n = 4 trials, the test procedure begins by computing the mean within-trial correlation (ρ denotes correlation): Next, as a null model, recompute the mean correlation after randomly permuting the Y time series while holding the X time series in the original order.This is the same as permuting Xs and maintaining the order of the Y s.For instance, one such permutation might be: A p-value can be calculated as the proportion of permutations (including the original ordering) that produce a mean correlation that is as strong as, or stronger than, ρ orig (Fig 1A).One may then detect a correlation at the significance level α if p ≤ α.We emphasize that these permutations are obtained by swapping trials, not by swapping time points.This test has been used to detect correlations between time series in neuroscience and psychology settings [26,24,27].It has also been used to detect correlations between variables measured in brain images, which can have similar nonstationarity challenges as time series [28].A noteworthy advantage of this approach is that it is valid (meaning that the false positive rate is guaranteed to not exceed α) under very mild assumptions.Namely, the test is valid if the X i trials are exchangeable with one another (i.e.all permutations of the sequence X 1 , ..., X n have the same joint probability distribution), or if the Y i trials are exchangeable (see Corollary 4 in Appendix 1).
Yet, even the permutation test remains considerably limited by the number of trials.Specifically, the lowest p-value that can be obtained with n replicates is 1/n! since there are n! possible permutations.For example, with n = 3, the lowest possible p-value is 1/3! ≈ 0.17 and with n = 4, the lowest is 1/4!≈ 0.04.
These minima apply even if the average within-trial correlation is much greater than the average betweentrial correlation.Thus, the level of evidence against the null hypothesis will always be modest at best for very small numbers of replicates.
In this article, we show that by adding one additional step to the permutation test, we may achieve p-values as low as 1/n n .For n = 3 or 4 respectively, the lowest possible p-value is then ≈ 0.04 or 0.004.
This new result only requires that the replicates be iid.Thus, this modified permutation test allows the data analyst to detect dependence with stronger confidence when there is sufficient evidence to do so.We illustrate this approach using simulations as well as a real world example involving recordings of zebrafish swimming behavior with only 3 replicate trials.

Results
The concept of an X-perfect or Y -perfect match Central to our approach is the concept of an X-perfect match or Y -perfect match.To motivate the ideas, suppose that a group of n graduate students X = (X 1 , ..., X n ) has been paired with a group of n advisors Y = (Y 1 , ..., Y n ) so that the ith student (X i ) is paired with the ith advisor (Y i ).Moreover, let ρ(X i , Y j ) be a number that measures how well the ith student and the jth advisor get along.We say that the ith student is "happy" if they get along with their own advisor strictly better than they get along with any other advisor, Similarly, a Y -perfect match has occurred if and only if: for all pairs (i, j) such that i ̸ = j.
Permute-match tests: modified permutation tests of dependence We now describe tests for dependence between nonstationary time series based on the ideas just laid out, which we will call permute-match tests.Suppose that an experiment produces time series X = (X 1 , ..., X n ) and Y = (Y 1 , ..., Y n ), where X i and Y i are time series obtained from the ith replicate, and we want to test whether X and Y are dependent.For example, we may have video recordings of n animals in separate but identically constructed enclosures; for the ith animal, we extract a time series of movement speed (X i ) and a time series of the distance from a light source (Y i ).Using these data, we may wish to investigate whether the animals under study tend to move at different speeds depending on their distance from the light source.
We introduce two tests: simultaneous permute-match and sequential permute-match.As we will see, the former can be more reproducible while the latter can be more powerful.
In the simultaneous permute-match test (Fig 1E ), we check for both an X−perfect and a Y − perfect match.If either the X-or Y -perfect match occurs, we write down a p-value to be p = 2/n n per equation 1.
If neither X nor Y achieves a perfect match, then we fall back to the permutation test: where N ≥ is the number of all n! possible permutations (including the original ordering) that produce an average correlation that is at least as large as the original ordering (Fig 1A).
In the sequential permute-match test, we begin by choosing a variable (either X or Y ) for the first perfect match test.For example, suppose we choose X first (in which case we refer to the procedure as the "permute-match (X▶Y ) " test).If an X-perfect match is observed, then p = 1/n n (Fig 1F).If X-perfect match is not observed, we then check for a Y -perfect match, and write p = 2/n n if a Y -perfect match is observed.
If neither an X-perfect nor Y -perfect match occurs, then p = p perm .A permute-match (Y ▶X) test is defined analogously.
Under H 0 (i.e. the Y i replicates are iid, the X i replicates are iid, and X and Y are independent), all the above tests are valid, meaning that for any significance level α, we have P (p ≤ α) ≤ α.This fact is proven as Theorems 10-12 in Appendix 2. Here, N ≥ is the number of average correlations (including the original correlation) that are equal to or larger than the original correlation.p perm is then N ≥ /n! because n! is the total number of average correlations, including the original.(B) Illustration of a Y -perfect match, represented as a table of correlations wherein each diagonal entry is the greatest in its own row.Each line with a ">" or "<" symbol denotes a required ordering relationship between the two numbers at either end of the line.Note that this example has a Y -perfect match, but not an X-perfect match.For an X-perfect match, each diagonal entry would need to be the greatest in its own column.(C) If an X-or Y -perfect match occurs, then the original correlation, ρ orig is always greater than any shuffled correlation ρ shf l .Top: If a Y -perfect match occurs, then each Y i gives the greatest correlation when it s paired with the X i from the same trial.A shuffled correlation ρ shf l pairs Y i s from some trials with X j s from different trials, thereby reducing the correlation.Lines with symbols ">", "<", and "=" denote comparisons between terms.Bottom: If an X-perfect match occurs, a similar argument can be applied.(D) Since a perfect match ensures that the original correlation is greater than any shuffled correlation, a perfect match also ensures that p perm takes on its lowest possible value of

Permute-match test variants may differ in statistical power
Although the different tests -permutation test, simultaneous permute-match test, and sequential permute- 2).In this regime, the permutation test has zero power, since its p-value is limited at 1/n!.Conversely, if either an X-perfect or a Y -perfect match occurs, then, the sequential permute-match tests as well as the simultaneous permute-match (X;Y ) test will all detect dependence.test may differ in power, as can be seen in the upper left panel of Fig 2D .We do not currently know how to pre-diagnose which of these two tests will be most powerful based on a given dataset, so the choice of which test to run in this case is arbitrary.
Regime 4: when α < 1/n n , no test has any power even when a perfect match occurs.The above example was chosen for ease of presentation -in fact, the example is so simple that it may be correctly treated by just fitting and removing the linear trend.In Appendix 4, we show that the same ideas hold in an example with a more complex nonstationary data-generating process (a logistic map with time-varying parameters) and a nonlinear correlation function (cross-map skill).

An example from animal behavior science
Living in groups is a common experience among animals.For example, at least half of fish species are thought to form groups at some stage of life [29,30].The properties of these groups can impart a variety of important fitness effects [31].In groups of fish, coordinated swimming may facilitate foraging (e.g. by enabling fish to "communicate" the location of food to others) and predator escape (e.g. by confusing predators) [32,30].
One study [30] used videos of small groups of zebrafish (8 fish per group) to observe that fish swam faster during video segments when they were in a high-polarization state (i.e. when they were directionally aligned) than during segments when they were in a low-polarization state.A correlation between polarization and speed might indicate statistical dependence between the two variables, or it might just be a consequence of temporal trends (e.g.polarization and swimming speed both happen to decrease with time).
Using a publicly available dataset [33] containing fish trajectories from 3 replicate 10-minute videos of small groups of juvenile Danio rerio zebrafish (10 fish per group), we performed a permute-match test of dependence between polarization and fish swimming speed.As we will see, although these quantities are potentially nonstationary, the permute-match test nevertheless detects a statistical dependence between them.
To quantify polarization, we use "circular variance" (v circ ; see Methods), a common measure of angular dispersion [34].Since high polarization means low dispersion, we quantify polarization as 1 − v circ , and call this quantity the circular concentration, since it indicates "how concentrated the [circular] data is toward the center" ( [35], page 15).Circular concentration is bounded between 0 (low alignment) and 1 (perfect alignment).To quantify speed, we took the "average individual speed", which is a term we use to denote an average across individuals, not across time.More precisely, the average individual speed of a 10-fish group at time t is (s 1,t + s 2,t + • • • + s 10,t )/10 where s i,t is the speed of fish i at time t.Note that average individual speed is distinct from the speed of the group center, which has also been studied in Danio fish [31].
Neither average individual speed nor circular concentration is obviously stationary (Fig 3A ).By visual inspection, the average individual speed seems to decrease across time, and all time series are deemed nonstationary by a Kwiatkowski-Phillips-Schmidt-Shin test (p < 0.01), and so we may be on shaky ground to use methods that require stationarity.Additionally, since only 3 trials are available, the permutation test and the simultaneous permute-match test cannot detect dependence at the 0.05 level, so we perform a sequential permute-match test.
We arbitrarily started with average individual speed and identified a "speed"-perfect match, indicating a significant correlation between average individual speed and circular concentration (Fig 3B; p = 1/3 3 ≈ 0.04).
Significance does not depend on the choice of which match is tested, since both a "speed"-perfect and a "circular concentration"-perfect match is obtained.Note that although trials are independent of each other, all between-trial correlations from the full dataset are positive (Fig 3B ), illustrating the danger in applying naive data analysis methods to data that are autocorrelated or even nonstationary.
To try a parametric alternative, for each trial we averaged values of average individual speed and circular concentration over the full 10 minutes (Fig 3C ; where each point is one trial).The sample Pearson correlation of the time-averaged variables is 0.99996, and a one-tailed test of significance (using the function stats.pearsonrfrom the Python package Scipy [36]) also detects dependence (p ≈ 0.003), consistent with the permute-match tests.This test relies on a null distribution derived under the assumption that the three points in Fig 3C are drawn from a bivariate Gaussian distribution with zero covariance.We view this test as comparable to the permute-match tests because although it requires a parametric assumption, it can be valid even when the time series are nonstationary.
Since data are limited in many scientific applications, we asked whether the results of the sequential permute-match tests and the parametric Pearson correlation test were consistent across different time series lengths.We truncated time series to various lengths (from 20 seconds to 10 minutes), and checked whether each test reported a significant (p ≤ 0.05) correlation (Fig 3D).Both sequential permute-match tests reported a significant correlation at all lengths tested, whereas the parametric test was inconsistent, finding a significant correlation for about one third of lengths.Length was varied between 20 seconds and 10 minutes in 20-second increments, with each increment representing 640 frames since the frame rate is 32 frames per second.We used the Pearson correlation (not its magnitude) in the permute-match tests since the alternative hypothesis is that the correlation is positive, not merely nonzero.

Discussion
Outside the context of time series, permutation tests have been celebrated because they require only minimal assumptions [38].To satisfy the requirements of permutation tests, it is sufficient (see Corollary 4 in Appendix 1) to collect data from exchangeable replicates.Data in each trial can be autocorrelated (e.g.temporally or spatially) and nonstationary.No distributional assumptions are required, and the correlation function can even be asymmetric so that ρ(X, Y ) ̸ = ρ(Y, X) as occurs when correlations are based on techniques that use one time series to predict values of another (e.g.Appendix 4).Such radical freedom from assumptions stands in stark contrast to the situation in the analysis of single-replicate time series, where the practitioner may be forced to make assumptions that can be difficult to verify.For example, most statistical tests of correlation between two time series require stationary data at a minimum [39].Yet, it is difficult to verify that a single time series is stationary because stationarity is a property of an entire ensemble of time series [39], so that checking for stationarity in a single time series is philosophically analogous to checking for the Gaussianity of a single data point.
However, applying a trial-swapping permutation test to check for significant correlations requires that a sufficient number of trials be available to permute amongst one another.When only small numbers of trials are available, the permutation test is severely limited by mathematics.The minimum p-value that can possibly be achieved by a permutation test is 1/n!, which will provide only modest evidence against a null hypothesis when n = 4 and is essentially useless when n = 3, regardless of how strong the true dependence may be.This limitation is particularly severe for researchers testing several related hypotheses, as this ideally requires a test that can produce p-values sufficiently low to maintain significance even after a multiple-testing correction.
Here we have described modified permutation tests -the sequential and simultaneous permute-match tests -that have access to lower p-values.In the sequential permute-match test, if the first variable does not display a perfect match, this test falls back to the simultaneous permute-match test.Thus, the sequential permute-match test is more powerful than the simultaneous test.However, the sequential test requires the arbitrary choice of checking for an X-or Y -perfect match first, making it potentially less reproducible between studies than the simultaneous test, which does not involve an arbitrary choice.If neither an X-nor Y -perfect match is obtained, both tests default to a permutation test.This step is valid because the tests are nested in the sense that a perfect match can only occur when the permutation p-value takes on its lowest possible value (Lemma 8 in Appendix 1).
It seems likely that the tests could be further modified to report an even lower p-value when both an Xand Y -perfect match occur, as in Fig 3B .However, we have not yet determined a sharp upper bound on the probability of this two-way match event.This problem is left for future efforts.
Although we have here used language and examples involving time series, the permute-match test can in principle be applied to other scenarios where iid replicates are available, but data within each replicate are dependent.One example is spatial data in brain imaging, where the permutation test has been performed using multiple scanned brains [28].Beyond spatial processes, dependent data also appear in nucleotide sequences and natural language text.Overall, advances in methods that take advantage of multiple replicates may facilitate statistical testing without requiring assumptions that are difficult to verify.

Data preprocessing
We obtained fish trajectory data [33] from the web address at https://drive.google.com/drive/folders/1UmzlX-yJhzQ5KX5rGry8wZgXvcz6HefD.For the first trial we used the file at the path "10/1/trajecto-ries_wo_gaps.npy"(where "10" indicates the number of fish and "1" indicates the first trial).For the second and third trials, the file path was the same except it began with "10/2/" and "10/3/" respectively.These files contain sequences of fish positions indexed by video frame, as well as constants such as frame rate and approximate fish body length.
We estimated the velocity of each fish as where − → p i,t is a 2-dimensional column vector specifying the position of the ith fish in units of body length at video frame t, − → v i,t is a 2-dimensional column vector specifying the fish velocity in units of body length per second at video frame t, and R is the frame rate (32 frames per second).A group's average individual speed at each time was then given by where N is the number of fish.The circular concentration C t of a group at each time was given by the formula Here, atan2 is the 2-argument arctangent function (i.e."arctan2" in the Python package Numpy, ver. 1.21.6).Also, v x,i,t and v y,i,t denote the x and y components of the velocity vector − → v i,t .The calculation was carried out using the function "stats.circstats.circvar" in the Python package Astropy (ver.3.2.2) [40].
sequence.This notation is useful as we will be dealing with the set {T (h 1 Z), ..., T (h m Z)} for potentially large values of m.Returning to our example, if z = 0, then T (1) (0) = 10 and T (2) (0) = 11.Similarly for z = 1, T (1) (1) = 10 and T (2) (1) = 11.Also, in the statement of the lemma, T (i) is defined as a function of z, not Z.This is to emphasize that the ordering depends on the particular outcome of the random variable and is not fixed.Finally, the lemma makes a claim about the probability that T (Z) exceeds T (k) (Z), which is the kth lowest statistic in the set.In the example, if α = 0.05, then k = 2 − ⌊0.1⌋ = 2. T (2) (Z) is always going to be the second smallest, i.e. the largest, element of T (h 1 z), T (h 2 z), which is 11.Of course, the probability of T (Z) being even greater than the largest number is 0, and since 0 is less than α = 0.05 we have verified the lemma in this case.As another example, suppose α = 0.5.
Proof of Lemma 1: Let I(A) be the indicator function for the event A (meaning that I(A) = 1 if A occurs and I(A) = 0 otherwise).
First suppose that T (k) (Z) is not tied with any other T (i) (Z).Then Eq.A-1 implies that there are ⌊mα⌋ values of T (i) (Z) that are greater than To arrive at the second equality, we used the fact that T (k) (Z) = T (k) (h i Z), which follows from the second requirement of the lemma (i.e.{h 1 Z, ..., h m Z} = {h 1 h i Z, ..., h m h i Z}).
Alternatively, if there is possibly a tie for T (k) (Z) (meaning that there may be some j ̸ = k such that ), then we have a similar formula, but with an inequality instead of an equality: This version with the inequality is of course general since it is true under both cases.Using this formula, Since Z and h i Z have the same distribution, we may remove the h i in the above formula, and we have Dividing through by m gives as required.
Next we give a definition of the permutation test.The definition here is somewhat more general than in the main text.As with h i , we will denote "g i applied to Z" by writing g i Z.
Definition 2 The permutation test of independence Let X = (X 1 , ..., X n ) and Y = (Y 1 , ..., Y n ) be sequences of random variables.Let G = {g 1 , ..., g n! } be the complete set of functions that permute a sequence of length n.For example, if Y is (2, 7, 4), then g i Y might be (2,4,7).Define the statistic T (X, Y ) to be a function that returns a real number, typically interpreted as the overall correlation strength.Repeatedly compute the statistic after permuting the elements of Y .That is, compute Let N ≥ be the number of permuted statistic values that are at least as large as the original statistic value.That is, where I(A) is the indicator function for the event A (meaning that I(A) = 1 if A occurs and I(A) = 0 otherwise).Then the p-value of the test is given by We now prove the validity of the permutation test.

Lemma 3
The permutation test of independence is valid when Y is exchangeable.
Let X = (X 1 , ..., X n ) and Y = (Y 1 , ..., Y n ) be sequences of random variables such that Y is exchangeable and where X is independent of Y .Perform the permutation test of independence (Def.2) to obtain p perm .
Proof: Here we use the symbols g i and T with their meanings from Def. 2.
We first set up the problem in a way that allows us to apply Lemma 1.To construct the random variable Z and transformation set H for Lemma 1, we choose Z = (X, Y ) and h i Z = (X, g i Y ).Lemma 1 has two requirements.First, Z and h i Z must have the same distribution.This requirement is satisfied: Since Y is exchangeable and X is independent of Y , it follows that (X, Y ) has the same distribution as (X, g i Y ).The second requirement is that {h 1 Z, ..., h m Z} = {h 1 h i Z, ..., h m h i Z}.This requirement is also satisfied: The set of permutations of Y is the same as the set of permutations of g i Y .Since both requirements are satisfied, we may apply Lemma 1.
We now prove the result directly.Below, A ↔ B means "A if and only if B".
As in Eq.A-1, let us define k = m − ⌊mα⌋, where n! = m.Then, the final line above says "the number of permuted Y s that produce a smaller T than the original Y is greater than or equal to k." Said another way, "when we rank the T s from least to greatest, the T from the original Y will have a higher rank than k".This in turn is equivalent to T (X, Y ) > T (k) (X, Y ).Thus, we may directly apply Lemma 1, thereby obtaining P (p perm ≤ α) = P (T (X, Y ) > T (k) (X, Y )) ≤ α The procedure of Eq A-10 produces a false positive rate that generally exceeds α, and depends on the relationship between tests X and Y .This relationship X and Y can be quantified by P ′ , the probability that test Y is significant (i.e.2p Y ≤ α) but test X is not significant (i.e.p X > α).The false positive range is shown over the full range of P ′ , from 0 to α/2.The only way for the overall procedure of Eq A-10 to be valid is when P ′ = 0, where tests X and Y are so tightly coupled that a significant result in test Y deterministically ensures that test X is also significant.Other than this edge case, the false positive rate of the overall procedure will exceed α.

Joint probability
Total probability Test X is significant (p X ≤ α) Total probability α 1 − α Table A-1: Joint and marginal probabilities of the outcomes of tests X and Y , assuming that p X and p Y both follow Unif(0, 1).The parameter P ′ is the probability that test Y is significant but test X is not.
Each joint probability is of course bounded between 0 and 1.Thus, moving clockwise from the upper left of the 2 × 2 inner table we have the following constraints on P ′ : Among these constraints, the tightest bounds on P ′ are 0 ≤ P ′ ≤ α/2.Notice that the false positive rate of the overall procedure is given by: as well as the case of independence in which P ′ = P (2p Y ≤ α) × (p X > α).Other than the extreme case of P ′ = 0, in which a significant Y test result deterministically ensures that the X test is also significant, the false positive rate of the overall procedure, P (p ≤ α), will be greater than α.Thus the procedure is not valid.The processes X and Y are given by a coupled logistic map on a time grid t = 1, 2, ..., 100.The system is nonstationary because the parameter r(t) varies with time from 3.72 when t = 1 to 3.82 when t = 99.To see the nonstationarity clearly, we plot X i (t) against X i (t + 1) and color points by time, showing that as time progresses, there is an upward drift in the parabola that the points lie on.To better show the trend, 10 replicates are shown simultaneously in this chart.(D) Statistical power of the permutation test and permute-match tests as a function of the number of replicates, the significance level, and r X,Y .Power was calculated from 5000 simulations at each value of r X,Y between r X,Y = 0 and r X,Y = 0.2 in steps of size 0.005.The correlation statistic (ρ) was cross-map skill, which is known to readily detect dependence between X and Y in this system [45].For the cross-map skill calculation, we used Y to estimate X, which corresponds to a scenario where the data analyst hypothesizes that X influences Y .Cross-map skill requires two parameters, the embedding dimension and the embedding lag, and these were set to 2 and 1 respectively following prior works that used the logistic map for benchmarking [45,46].

Figure 1 :
Figure 1: The permutation test, perfect match concept, and permute-match tests.(A) The permutation test.Circles indicate the average correlation obtained with (grey) or without (pink) trial swapping.Here, N ≥ is the number of average correlations (including the original correlation) that are equal to or larger than the original correlation.p perm is then N ≥ /n! because n! is the total number of average correlations, including the original.(B) Illustration of a Y -perfect match, represented as a table of correlations wherein each diagonal entry is the greatest in its own row.Each line with a ">" or "<" symbol denotes a required ordering relationship between the two numbers at either end of the line.Note that this example has a Y -perfect match, but not an X-perfect match.For an X-perfect match, each diagonal entry would need to be the greatest in its own column.(C) If an X-or Y -perfect match occurs, then the original correlation, ρ orig is always greater than any shuffled correlation ρ shf l .Top: If a Y -perfect match occurs, then each Y i gives the greatest correlation when it s paired with the X i from the same trial.A shuffled correlation ρ shf l pairs Y i s from some trials with X j s from different trials, thereby reducing the correlation.Lines with symbols ">", "<", and "=" denote comparisons between terms.Bottom: If an X-perfect match occurs, a similar argument can be applied.(D) Since a perfect match ensures that the original correlation is greater than any shuffled correlation, a perfect match also ensures that p perm takes on its lowest possible value of 1/n!. (E) The simultaneous permute-match (X;Y ) test.(F) The sequential permute-match (X▶Y ) test; note that the permute-match (Y ▶X) test is defined analogously.
match tests -are all valid in the sense of correctly controlling false positive rates, how do they vary in power -the probability of detecting true dependence?The answer depends on the number of replicates n and the desired significance level α.There are four regimes to consider (Fig2A).Regime 1: when α ≥ 1/n!, all tests have the same power.If the permutation test detects dependence, permute-match tests will too since they fall back to a permutation test in the "worst case" (Fig 2A, column 1).Conversely, if the permutation test did not detect dependence (i.e.p perm > α), then it follows that p perm > 1/n!, which implies that neither an X-perfect nor Y -perfect match occurred (Fig 1D), and so the permute-match tests cannot have detected dependence either.Regime 2: when 1/n! > α ≥ 2/n n , all permute-match tests are tied for highest power (Fig 2A, column

Regime 3 :
when 2/n n > α ≥ 1/n n , the simultaneous permute-match (X;Y ) test will have no power as the desired p-value is below the lowest possible p-value achievable by the test.Only the permute-match (X▶Y ) or (Y ▶X) tests now have power (Fig 2A, column 3).Note that permute-match (X▶Y ) and permute-match (Y ▶X)

Figure
Figure 2B-D illustrates these different scenarios using a simulated nonstationary system and some common significance cutoffs.We simulated time series of two variables (X and Y ) from a small number of replicates (between 3 and 5).Here, the time series were produced from a pair of linear time trends with coupled additive noise (r X,Y determines the strength of coupling; Fig 2B).We chose the Pearson correlation coefficient as our correlation function ρ.The various sub-panels of Fig 2D illustrate the different aforementioned regimes.

Figure 2 :
Figure 2: Statistical power of various permutation test variants as illustrated using a simple nonstationary system.(A) Summary of test power as a function of the significance level α and the number of replicates n. (B, C) System equations and example dynamics.The processes X and Y are given by a linear trend with additive noise on a time grid t = 1, 2, ..., 100.The noise terms ϵ X,i (t) and ϵ Y,i (t) are drawn from a bivariate normal distribution with a mean of 0, variance of 1, and covariance of r X,Y .The chart shows an example pair of time series where X and Y are dependent (r X,Y = 0.3).(C) Statistical power of the permutation test and various permute-match tests as a function of the replicate number n, significance level α, and strength of dependence r X,Y .Power was calculated from 5000 simulations at each value of r X,Y between r X,Y = 0 and 0.54 in steps of size 0.01.We chose the Pearson correlation coefficient as our correlation function ρ.

Figure 3 :
Figure 3: A sequential permute-match test detects dependence between average individual speed and circular concentration in small groups of zebrafish.(A) Time series of average individual speed and circular concentration for the three replicate videos.Black curves show original time series, and red curves show a 30-second moving average.Average individual speed appears to decrease with time in all trials.All 12 time series (2 variables × 3 trials × 2 smoothing conditions) were deemed nonstationary by a Kwiatkowski-Phillips-Schmidt-Shin (KPSS; [37]) test (p < 0.01).Note that the KPSS test seeks to reject a stationary null hypothesis in contrast to other common tests, whose null hypotheses are nonstationary.(B) Pearson correlation between circular concentration and average individual speed, both within trials and between trials, using the full 10 minutes of data.Tableentriesare shaded by correlation.We observe a speed-perfect match (and a circular concentration-perfect match), and thus detect dependence with p = 1/3 3 ≈ 0.04.(C) A parametric test also detects dependence.As a parametric alternative, for each trial we averaged values of speed and circular concentration over the full 10 minutes (e.g. the scatter plot shown, where each point is one trial).The sample Pearson correlation of the time-averaged variables is 0.99996, and a one-tailed test of significance gives p ≈ 0.003.(D) Permute-match tests detected a significant correlation between speed and circular concentration more consistently than the parametric test.We truncated time series to different lengths starting from the first frame, and compared the parametric test with the two possible permute-match tests.Length was varied between 20 seconds and 10 minutes in 20-second increments, with each increment representing 640 frames since the frame rate is 32 frames per second.We used the Pearson correlation (not its magnitude) in the permute-match tests since the alternative hypothesis is that the correlation is positive, not merely nonzero.

Figure A- 1 :
Figure A-1:The procedure of Eq A-10 produces a false positive rate that generally exceeds α, and depends on the relationship between tests X and Y .This relationship X and Y can be quantified by P ′ , the probability that test Y is significant (i.e.2p Y ≤ α) but test X is not significant (i.e.p X > α).The false positive range is shown over the full range of P ′ , from 0 to α/2.The only way for the overall procedure of Eq A-10 to be valid is when P ′ = 0, where tests X and Y are so tightly coupled that a significant result in test Y deterministically ensures that test X is also significant.Other than this edge case, the false positive rate of the overall procedure will exceed α.

1 −
Fig A-1 plots Eq A-11 within the allowed bounds of P ′ , noting the extreme cases of P ′ = 0 and P ′ = α/2,

Figure A- 2 :
Figure A-2: Statistical power of permute-match tests and the permutation test in a nonlinear and nonstationary system.(A) System equations.(B) Example dynamics.The processes X and Y are given by a coupled logistic map on a time grid t = 1, 2, ..., 100.The system is nonstationary because the parameter r(t) varies with time from 3.72 when t = 1 to 3.82 when t = 99.To see the nonstationarity clearly, we plot X i (t) against X i (t + 1) and color points by time, showing that as time progresses, there is an upward drift in the parabola that the points lie on.To better show the trend, 10 replicates are shown simultaneously in this chart.(D) Statistical power of the permutation test and permute-match tests as a function of the number of replicates, the significance level, and r X,Y .Power was calculated from 5000 simulations at each value of r X,Y between r X,Y = 0 and r X,Y = 0.2 in steps of size 0.005.The correlation statistic (ρ) was cross-map skill, which is known to readily detect dependence between X and Y in this system[45].For the cross-map skill calculation, we used Y to estimate X, which corresponds to a scenario where the data analyst hypothesizes that X influences Y .Cross-map skill requires two parameters, the embedding dimension and the embedding lag, and these were set to 2 and 1 respectively following prior works that used the logistic map for benchmarking[45,46].
The arrangement of student-advisor pairs is X-perfect if all students are happy, and Y -perfect if all advisors are happy.Analogously, if X = (X 1 , ..., X n ) and Y = (Y 1 , ..., Y n ) are collections of time series with n trials each, and if ρ is some correlation function, we say that an X-perfect match has occurred if (and only if) for all X Throughout the article, we use boldface X or Y to indicate a collection of trials, X i or Y i to indicate an individual trial, and X or Y to refer to the variable in general.
than any of the shuffled correlations, as illustrated in Fig 1C.Thus, a perfect match guarantees p perm = 1/n!, the lowest p-value achievable with the permutation test (Fig 1D).Furthermore, letting P denote probability, we can prove that P (X−perfect match) ≤ 1/n n under the null hypothesis H 0 wherein (1) the Y i trials are iid, (2) the X i trials are iid, and (3) X and Y are independent (Lemma 6 in Appendix 2).By an analogous argument (Lemma 7 in Appendix 2), under H 0 we have P (Y −perfect match) ≤ 1/n n and thus the first card is an ace of hearts, the second card cannot be the ace of hearts.In practice, biological experimental designs typically use replicates that are both exchangeable and iid, so the more restrictive requirement of the permute-match test is often inconsequential.The different tests -permutation test, simultaneous permute-match test, and sequential permute-match tests -are valid tests of dependence given iid replicates (Theorems 10-12 in in Appendix 1) .In other words, they all guarantee false positive rates at or below the significance level.Two subtle aspects of the permutematch tests are worth mentioning here.First, no multiple testing correction is applied to account for the fact that both match tests and a permutation test are used.In other words, although the simultaneous permute-match (X;Y ) test assigns p = 2/n n instead of p = 1/n n when an X-or Y -perfect match occurs as a Bonferroni-type correction, no such correction is needed when combining the match test and the permutation 1/n!. (E) The simultaneous permute-match (X;Y ) test.(F)The sequential permute-match (X▶Y ) test; note that the permute-match (Y ▶X) test is defined analogously.The permute-match tests (Fig 1E-F) are "upgrades" to the permutation test (Fig 1A) because permutematch tests always report the same or a lower p-value (since 1/n n ≤ 2/n n < 1/n! for all integers n ≥ 2;see Lemma 9 in Appendix 1).In exchange for lower p-values, the permute-match tests have a slightly more restrictive data requirement than the permutation test.The permutation test is valid either if the X i trials are exchangeable, or if the Y i trials are exchangeable (Corollary 4 in Appendix 1).The permute-match tests require that all X i trials are iid and that Y i trials are iid.If trials are iid, then they are necessarily exchangeable.However, exchangeable trials are not always iid.For example, the first five cards drawn from a shuffled deck are exchangeable because all possible orderings of cards are equally likely, but they are not iid because if test (Fig1E).This is because the permutation test and perfect match events are nested.That is, an X-or Y -perfect match is possible only if p perm is already at its lowest possible value of 1/n! (Fig1C-D).Thus, unlike combining un-nested tests where each can separately contribute false positives, we can safely combine the checks for perfect matches with the permutation test.The second subtlety refers specifically to the sequential permute-match tests.These tests use the lower p-value of 1/n n if the first perfect match occurs, but use the more conservative "Bonferroni-corrected" p-value of 2/n n if only the second match occurs (Fig1F).This superficially resembles the practice of conducting additional tests merely because earlier tests did not result in a detection -a form of p-hacking that generally results in an invalid final p-value.However, our sequential permute-match procedure is actually valid because it combines binary events instead of continuous variables.See Appendix 3 for a detailed discussion of this point.