Applying non-parametric testing to discrete transfer entropy

Transfer entropy (TE) is a powerful algorithm which attempts to detect the transfer of information from one system to another. In neuroscience, it has the potential to track the movement of information through complex neuronal systems, and provide powerful insights into their organization and operation. One such application is the ability to infer the existence of causal connectivity (such as synaptic pathways) between neurons in a culture being recorded by micro-electrode array (MEA). There are several challenges, however, in applying TE to neurological data; one of these is the ability to robustly classify what experimental TE value qualifies as significant. We find that common methods in spike train analysis such as a Z-test cannot be applied, as their assumptions are not met. Instead, we utilize surrogate data to compute a sample under the null hypothesis (no causal connection), and resample experimental data through Markov chain Monte Carlo (MCMC) methods to create a sample of TE values under experimental conditions. A standard non-parametric test (Mann-Whitney U-test) is then applied to compare these samples, and determine if they represent a significant connection. We have applied this methodology to MEA recordings of neuronal cultures developing over a period of roughly a month, and find that it provides a wealth of information regarding the cultures' maturity. This includes features such as the directed graph of causal connections across the MEA and identification of information exchange centers. These results are consistent and carry a well-defined significance level.

Micro-electrode arrays (MEAs) are tools which contain microscopic electrodes, allowing 2 for measurements of electrical activity across a variety of samples (such as cultures of 3 primary neurons). The resolution and capabilities of MEAs are growing, as 4 measurement electrodes become smaller, more numerous, and more sensitive ( [1]). 5 While this allows researchers to study the development and behavior of neuronal 6 networks in unprecedented detail, it also creates challenges in data analysis as the size A challenge for both these algorithms is determining what value is significant, as 16 both methods have biases and can be fooled by coincidences in the dataset, producing 17 non-zero values even when a significant connection does not exist ( [9], [10]). As a 18 result, it is necessary to filter these data using estimates of their values under the null 19 hypothesis that no connection exists. 20 We present a novel Markov-chain Monte Carlo (MCMC) based method of detecting 21 significance from TE results, by sampling surrogate and experimental values. A 22 non-parametric statistical test, the Mann-Whitney U-test, is then applied to these 23 samples to determine the significance of each hypothesis regarding the connectivity. 24 We then apply this method to analyze the connectivity found within experimental 25 recordings of developing neuronal cultures created by Wagenaar et al. ( [11]). The TE 26 results provide a wealth of information, including the undirected graph of connectivity 27 during development, and the relative importance of different sites in exchanging 28 information. 29 Lastly, we briefly the performance of our software, written in the Julia language and 30 compatible with common Python tools for plotting, network analysis, and 31 browser-based interactivity [12][13][14]. As a result, our implementation is fully open-source, 32 portable, and can be quickly built from an image. Transfer entropy (TE) is an information-theory based measure which quantifies how 36 much one's prediction of a signal changes based on the state of another signal ( [15]). It 37 is a measure which has been widely adopted in neuroscience ( [10,16]), as it has been 38 found to have give the most accurate results from several different methods applied to 39 simulated neural networks [17]. 40 As detailed by Vicente et al. [10], TE may be calculated on discrete data (such as 41 spike trains) by creating a 'delay vector.' Given an original, discrete spike train x(t) 42 which can indicate the presence or absence of a spike at t, let x(w) be a vector counting 43 the number of spikes in each temporal bin of width w, with sufficient bins to span x(t). 44 The short-term behavior of x(w) at an index i is then represented a single state, by 45 creating a delay vector X τ t , which captures x i from i to the previous τ − 1 values.
The combination of a signal x's past state, its next value, and the past state of 47 another signal y are used to create a joint distribution (JD, Fig 1a). The objective of 48 this JD is to summarize the probability of each future state of x, given the delay vectors 49 representing the past of both signals. By summing this JD along its rows, a marginal 50 distribution (MD) is found, which predicts the future of x using only its own past state, 51 excluding the past of y. Often, the past state may be offset by an additional o bins, 52 inspecting for relationships which may have a delay.

53
The divergence between conditional probabilities predicting the future of x using the 54 JD and MD is the transfer entropy between the signals. When the information from y is 55 useful to predict the future of x, the divergence between these distributions will increase, 56 Illustration of calculating TE. a) Illustration of generating the joint distribution (JD) which describes the future state of a signal given its past state, and the past state of another signal. It can be reduced to the marginal distribution (MD, left side), which predicts its future outcome based on only its own past state. The divergence between each distribution's predictions is used to calculate TE. b) Markov-chain Monte Carlo (MCMC) method of generating spike count vectors representing the null hypothesis (no connectivity). In this example, the embedding length τ is 2, and the maximum possible spike count in a bin is 1. This creates a simple system which can transition through 4 states to generate a spike train. c) The MCMC method of resampling experimental TE values. Two signals' futures must be chosen randomly using the conditional probabilities from their original JDs. This provides a transition to the next joint state. If this state was not encountered in the original recording, it must be rejected (bottom).
indicating the possibility of a causal relationship from y to x. The inverse is not true, as 57 TE is not commutative; the influence of x on y must be separately examined.
While TE is a powerful informational method for investigating possible causality 59 between different signals, it is computationally expensive. To apply it to spiking data, 60 one must compute joint distributions of all spike embeddings between all combinations 61 of electrodes on the array. One must also select the dimension of the delay embedding 62 (τ ), and the period the delay embedding spans. The correct values for these parameters 63 are not obvious, and each has a significant impact on the algorithm's end result.

64
Additionally, it has been shown that TE only achieves peak accuracy for spike trains 65 when it searches for causality at multiple time-offsets, as neurons influence one another 66 at differing time scales ( [18] To determine what level a potential relation must exceed in order to be significant, it 73 must be compared to the amount which could be expected under the null hypothesis 74 (there is no causal relationship between the sources). If a value exceeds this, the 75 alternative hypothesis (there exists a causal connection between the sources) is accepted. 76

77
'Surrogate' datasets approximating each spike train operating under the null hypothesis 78 of no connectivity can be created by several methods. All of these methods must tread 79 a thin path between maintaining the overall properties of a recording (such as spike 80 count and inter-spike intervals), while obfuscating information that was created by 81 causal relationships between sources ( [9] Under the simpler case of the null hypothesis, only one MD for each source is needed. 103 Assuming a source starts at a quiet state, its first state vector X τ i is created. Using the 104 conditional probability P (x i+1 |X τ i ), an outcome for the next state x i+1 is randomly 105 selected. The state vector is then moved forward in time to include this outcome, and 106 this process is repeated until a sample of sufficient length is created (Fig 1b). In our 107 implementation, spike-count vectors for each channel are generated until the total 108 number of spikes is equal to that of the original recording. This surrogate recording is 109 then re-analyzed to calculate TE values for each possible connection. In this case, each 110 source is independently sampled from a conditional distribution dependent on only its 111 own past, and there is no causal behavior between sources. As a result, any detected 112 information transfer is a due to the TE algorithm's biases.

113
The more challenging case is to re-sample the value of each possible connection 114 under the alternative hypothesis that sources do influence one another. In this case, the 115 joint distributions predicting the next state of the two relevant sources must be 116 simultaneously sampled. Similarly to the null case, this is done by starting at a quiet 117 state, and randomly choosing each signal's next state based on the conditional 118 probability P (x i+1 |X τ i , Y τ i ), given the joint state between both sources' state vectors.

119
The occurrence of each new joint state is incremented in a JD, and the process is 120 repeated until the number of total joint states in this distribution equals the number 121 calculated using the original recording.

122
One complication arises, however: both signals are being generated at once, and they 123 can occasionally choose conflicting states for the next step. This represents a state 124 which was never encountered in the original recording (such as both channels spiking 125 simultaneously). When this rare situation occurs, outcomes are re-chosen until a new, 126 valid configuration is found. At least one such configuration always exists; given that 127 the signal entered one state in phase space, it also had to traverse out of it (the 128 trajectory of the signal is closed, given that the beginning and ending of each recording 129 corresponds to a quiescent state).

130
Additionally, these Markov chains can only be made from distributions which were 131 created at a zero offset. When the next outcome is delayed from the current state 132 vector, the next state vector is not immediately dependent on the previous spiking state, 133 and a Markov chain which gives appropriate state-changes cannot be constructed. As a 134 result, to re-sample TE values at a non-zero delay, the signal must be reconstructed 135 from the 0-delay distribution, and the delayed state measured from its generated signal. 136 which the experimental samples are lesser. If this proportion is below the risk level, the 150 null hypothesis is rejected, and the difference between these two samples indicates that 151 the experimental TE value is significant, and provides sufficient evidence to infer a 152 causal connection exists. 153 We denote this method as 'resampling transfer entropy' (R-TE), as it compares 154 surrogate data and resampled experimental values to determine if a causal connection is 155 significant.

176
To test the viability and performance of R-TE on experimental data, we applied it to 177 MEA recordings created by Wagenaar et al., as they provide a large collection of 178 neuronal cultures which developed over several weeks [11]. These cultures were prepared 179 under differing conditions, including the source of the cortical material and the density 180 at which it was plated. Cortical material was collected from eight Wistar rat embryos, 181 giving rise to 8 'batches' of source material. The material was then disassociated, and  We sub-selected 4 cultures to investigate as a demonstration of R-TE's capabilities. 186 The first two cultures, A and B, were densely plated (2.5 ± 1.5x10 3 cells/mm 2 -batch 1, 187 dense cultures 1 and 2). Culture C has a 'small' plating density, with 1.6 ± 0.6x10 3 188 cells/mm 2 (batch 6, small culture 1), and culture D has a 'sparse' 0.6 ± 0.24x10 3 189 cells/mm 2 plating density (batch 6, sparse culture 1). It was reported that the sparser 190 cultures C and D developed more slowly than the dense cultures A and B, as the former 191 required more time to develop complex spiking features. This contrast provides an To find the appropriate bin width w, delay embedding length τ , and range of delay 202 offsets o, we first make a basic inspection of the signal dynamics. A correlogram shows 203 that many of the interactions between signals fall within a window of 20 ms (Fig 2a), 204 providing a relevant period which the series of offsets o should inspect. Next, the 205 minimum value of w is given by the sampling rate of the recording (25 kHz). At this 206 minimum of w = 40 µs, no more than one spike can occur in each bin. As w increases, 207 more than one spike can fall in each bin. By changing w, the maximum number of 208 spikes within one bin increases. If the bin is too large, many spikes which could be 209 interacting or transferring information will fall into a single bin, improperly capturing 210 the signal. We see an inflection above 1 ms, where the maximum number of spikes 211 begins to rapidly increase -this suggests that w should not be larger than this value 212 (Fig 2b). With o and w selected from signal dynamics, τ must be selected from its effect 213 on results. At larger values of τ and w, the improper embedding of the signal causes the 214 JD to effectively begin 'memorizing' it, creating artificially high TE values. We find 215 that τ = 3 is a good selection, as the results with a larger or smaller τ do not differ 216 strongly (Fig 2c). When results are analyzed, the power of the test being applied also 217 depends on the number of resampled and surrogate values being sampled. Sampling the 218 variation in total number of significant connections detected by bootstrap, we find that 219 32 samples of each is more than sufficient to effectively eliminate the variation in a 220 tests's outcome (Fig 2d). These parameters (w = 1 ms, τ = 3, o = 0-20 bins, 32 221 resamples/surrogates) are used to analyze the recordings for cultures A-D. Parameter Selection for TE. a) Total correlation counts between pairs of signals over a range of delays (Culture A, 25 days in vitro (DIV)). Many of the strongest interactions occur within 20 ms. b) The maximum number of spike counts found in a single bin, for recordings from 8-18 DIV for culture A. After a bin width of 1 ms, many events can begin falling within a single bin, indicating the signal is being improperly embedded. c) The total number of significant connections inferred by R-TE while varying w and τ , given a constant o which ranges from 0-20 bins offset (Culture C, 28 DIV). A region of parameters which leads to results that do not change greatly indicates possible good values to use. The selected parameters w = 1 ms, τ = 3 are outlined in red. d) The standard deviation in number of significant connections for a sample file (culture A, 10 DIV), by the number of resampled and surrogate values. The selected parameter (32 of each) is highlighted in red.
out the majority of these insignificant connections when standard significance levels 233 (α ≤5%) are used (Fig 3a). But as the cultures develop, R-TE begins to detect 234 connections with a significance which cannot be reasonably rejected (Fig 3b). are low (Fig 4a,b). As development progresses and values grow, Gaussians may become 239 suitable (Fig 4c), but it is not a safe assumption to make for all cases. This motivates 240 our non-parametric approach to testing significance.

242
All recordings of cultures A-D were analyzed using R-TE, with a significance threshold 243 of α=1%. This yields an adjacency matrix for each recording, representing the 244 connectivity of the neural network between the recording sites on the MEA (Fig 5a). As 245 expected, the two dense cultures A and B show similar development, reaching near full 246 connectivity after approximately two weeks in vitro. In contrast, cultures C and D 247 (with small and sparse densities, respectively) develop more slowly and have fewer 248 connections, as would be expected from their lower plating densities (Fig 5b). 249 Additionally, the number of detected connections for all cultures does not vary greatly 250 when standard significance levels (α ≤ 5%) are used (Fig 5c).

251
This finding is consistent with those of Wagenaar et al., who reported that cultures 252 with lower plating densities developed more slowly and exhibited bursting behavior later 253 c) The variation in number of accepted connections for this culture by significance level. than dense cultures. One feature of note is that the onset of greater connectivity levels 254 in cultures A-D (Fig 5b) correlates to the onset of more consistent and complex 255 bursting activity, as described in their original work ( [11]).

256
The TE values of each connection also provide insight into the structure of the 257 neuronal network. Each connection's TE value approximates the exchange of 258 information from one site to the other, adding a meaningful weight to each edge between 259 nodes in the graph describing connectivity. By summing the TE values on edges incident 260 on a node, the relative importance of this node in processing information can be seen. 261 We find that stable rankings of nodes, sorted by the sum of incoming TE, emerge 262 alongside connectivity for all cultures (Fig 6a). In other words, connections which 263 develop early remain important, and often grow in strength as the neuronal network's 264 connectivity increases. (6d,e,f) Incoming TE also correlates well with a coarse 265 description of centrality, the in-degree (Fig 5e,f) modern multi-core machine. Fully analyzing a 60-channel recording, including creating 276 surrogates and resampling values, rarely took more than an hour (Fig 7a). Performance of R-TE. a) An overview of R-TE's runtime on the MEA recordings, using the specified parameters (w = 1 ms, τ = 3, 0-20 offsets, 32 resamples, 32 surrogates). b) Runtime scales linearly with spike count, and roughly linearly with delay. c) R-TE's execution time is dominated by the MCMC resampling process, which must inspect every possible connection separately.
Runtime scales linearly with the number of spikes in the recording, and the length of 278 the delay embedding used (Fig 7b). It is also linearly dependent on the number of 279 surrogates, though this can be offset by parallelizing their generation. Lastly, runtime 280 depends on the number of channels squared. As a consequence, the procedure which 281 dominates the time used is the resampling process (Fig 7c), which must inspect the 282 variability of every possible connection in the recording. values. Non-parametric statistics directly applied to these samples are used as a basis 288 for significance testing, instead of assuming that these values follow a distribution. This 289 creates a more solid basis for statistical significance testing to determine whether a TE 290 value corresponds to a significant connection. 291 We find that this makes a marked and significant improvement in inferring the 292 connectivity of a neuronal network from its discrete spike train (as recored by an MEA). 293 TE also provides insights into the organization of information flow within this biological 294 network, providing a view into development and identifying the emergence of hubs.

295
In future work, exporting the resampling algorithm to a GPU computation platform 296 could dramatically increase the speed of computation, as it is a massively parallel 297 problem. Additionally, other methods of determining the optimal TE parameters would 298 be useful to increase confidence the signal is being properly captured.

299
In conclusion, we believe this work provides a novel basis to infer the functional 300 connectivity of a large neuronal network via transfer entropy, and yields insights into its 301 operation.