## Abstract

Phenotype measurements frequently take the form of time series, but we currently lack a systematic method for relating these complex data streams to scientifically meaningful outcomes, such as relating the movement dynamics of a model organism to their genotype, or measurements of brain dynamics of a patient to their disease diagnosis. Here we report a new tool, *hctsa*, that automatically selects interpretable and useful properties of time series by comparing over 7 700 time-series features drawn from diverse scientific literatures. Using exemplar applications to high throughput phenotyping experiments, we show how *hctsa* allows researchers to leverage decades of time-series research to understand and quantify informative structure in time-series data.

Time-series data, repeated measurements of a quantity taken through time, are being recorded in increasing volumes in biology and medicine. This wealth of data has opened the door to a range of new research problems, including diagnosis of pathology from biomedical data streams in human patients [1, 2], understanding the role of specific neural circuits for behavior [3], and linking genotype to phenotype to understand gene function [4, 5, 6] or disease processes [7, 8, 9]. While differences in scalar phenotypes are relatively simple to calculate (such as the body length of a worm or the blood pressure of a human subject), it is less clear how to compare complex time-varying data streams (such as the movement dynamics of a worm, the heart rate fluctuations of a clinical patient, or the sequence of reaction times across a cognitive task). In all of these diverse applications, we require a method for reducing complex time-series data streams to informative, low-dimensional summaries.

A common way of summarizing a time series is by measuring a simple statistic such as its sample mean, which has the advantage of being easily interpretable—e.g., knocking out the gene *unc*-9 decreases the mean movement speed of the nematode worm, *Caenorhabditis elegans* [10]. However, this approach fails for many real-world applications in which the phenotypic differences are more subtle than simple mean shifts. Sophisticated tools for measuring structure in time-series data have been developed by a broad range of researchers, including contributions from the fields of statistics, electrical engineering, economics, statistical physics, dynamical systems, and biomedicine. This interdisciplinary literature includes summaries of the distribution of values in the data (e.g., Gaussianity, properties of outliers), autocorrelation structure (including power spectral measures), stationarity (how properties change over time), information theoretic measures of entropy and temporal predictability, linear and nonlinear model fits to the data, and methods from the physical nonlinear time-series analysis literature [11]. There is currently no systematic way of leveraging this giant corpus of scientific work to determine which of these thousands of possible summary statistics best address a particular scientific hypothesis, because the methods have typically been locked in discipline-specific journal articles. Here we introduce a software package for performing highly comparative time-series analysis, *hctsa*, that opens up the interdisciplinary time-series analysis literature in the form of over 7 700 *features*, each of which captures a different type of interpretable structure in a univariate time series. By comparing the performance of these features on a given dataset, *hctsa* facilitates data-driven, statistically-controlled selection of informative time-series summary statistics for phenotyping applications, overcoming an otherwise time-consuming and subjective manual task.

The general problem is depicted in Fig. 1, where we focus on distinguishing time series recorded from two different classes for demonstration (e.g., a patient group and a control group), although we note that the same general framework applies to multiclass classification or regression problems [11]. After measuring time-series data, *hctsa* is used to perform massive feature extraction, where the behavior of over 7 700 different scientific analysis methods applied to the dataset can be visualized as a feature matrix with a row for every time series and a column for every feature, shown in Fig. 1. The rich structure of the feature matrix reveals some sets of features (i.e., areas of the scientific time-series analysis literature) that capture meaningful differences between groups (e.g., lighter color for type A and darker color for type B in Fig. 1) and thus represent promising candidates as quantitative phenotypes for distinguishing data of the two types. In addition to facilitating massive feature extraction, *hctsa* includes a comprehensive suite of analytics for extracting useful and interpretable insights into the structure of the dataset, including: (i) identifying scientific methods that best quantify differences between labeled groups of data, providing interpretable insights into the phenotypic differences (incorporating permutation testing to statistically control for multiple hypothesis testing), (ii) building a classifier that draws on the full diversity of scientific methods to optimize the accuracy of phenotypic classification, and (iii) visualizing low-dimensional structure in the dataset to understand potential clustering structure or other relationships between the time series. By leveraging a comprehensive interdisciplinary literature on time-series analysis, *hctsa* thus enables researchers to gain a range of interpretable and useful insights into their data.

To demonstrate the approach, we applied *hctsa* to two case studies of *C. elegans* and *Drosophila melanogaster* movement, as shown in Fig. 2. Sample time series of the movement speed dynamics of five different strains of *C. elegans* [5] are shown in Fig. 2a (upper). Being noisy empirical recordings with no clear visual differences between strains, it is unclear what types of analysis methods might capture differences between the genotypes. Using *hctsa* to compute the behavior of over 7 700 different time-series features (subsequently filtered down to 6504 well-behaved features, see *Online Methods*), we found that the feature set as a whole predicted genotype from the short, noisy time series with a ten-fold cross-validated balanced accuracy of 80% (using a linear SVM; chance level: 20%). A total of 4499 different features of movement speed time series were individually informative of the genotype label (*q* < 0.05, FDR-corrected, permutation test), with the 40 most informative features spanning diverse methodological literatures, including AR and state space model fitting methods, detrended fluctuation analysis, local mean forecasting, multi-scale Sample Entropy, and wavelet decompositions of the signal, shown as a structured pairwise correlation plot in Fig. 2a (lower). One of the multiscale entropy measures [12] is highlighted in Fig. 2a (middle), which computes the Sample Entropy, SampEn(2, 0.15), at a scale level 3 (corresponding to 100 ms bins), which can be thought of as quantifying the ‘unpredictability’ of the time series at this timescale. Violin plots in Fig. 2a reveal overall, physiologically interpretable differences between the genotypes, with the lab strain N2 and wild isolate strain CB4856 showing the most predictable time series at this timescale, the morphological mutant, *dpy-20* being intermediate, and neural mutants like the *unc-38* and *unc-9* being the least predictable. The automatic selection of this temporal entropy measure by *hctsa* is consistent with recent work proposing the similar concept of ‘compressibility’ of posture sequences as a quantitative pheno-type for *C. elegans* [13]. By comparing a wide interdisciplinary literature of analysis methods, hctsa thus selects biologically informative quantitative phenotypes from these time-series data automatically.

We also applied *hctsa* to 12 h movement speed time series of *Drosophila* restricted to a one-dimensional tube, labeled as either ‘day’ (light on) or ‘night’ (light off), and as either ‘male’ or ‘female’, as shown in Fig. 2b [14]. Leveraging the full feature library, we successfully classified recordings made during the day versus at night with a mean 10-fold cross-validated balanced accuracy of 98%, and also clearly distinguished the sex of the organisms (96%). The software selected interpretable quantitative time-series features for different groupings of the data, as annotated in Fig. 2b (upper), highlighting the increased spectral flatness and shorter durations between outliers in females relative to males (driven by their less predictable movement), and increased stationarity of movement dynamics during the day, with fewer extreme outliers than at night (which is dominated by more bursty patterns between sleep and activity). Taking four combination classes (colored in Fig. 2b), the mean balanced 10-fold cross-validated balanced accuracy remains high, at 95%. Again, *hctsa* identifies interpretable features like the standard deviation of incremental differences in the *z*-scored time series, shown as a violin plot in Fig. 2b (middle). This simple measure gives higher values to time series exhibiting greater changes from one time point to the next, providing a simple but easily interpretable measure of temporal predictability, which is increased during the day, and in females. Even when class labels are not used, *hctsa* can draw on the combined behavior of thousands of scientific methods to provide an informative low dimensional principal components representation of the dataset, shown in Fig. 2b (lower), in which the four classes are clearly separated. The subtle quantitative phenotypes of *Drosophila* movement provided automatically by *hctsa* go beyond simple comparisons of the overall amount of movement between day and night, or between males and females [14, 15]. For example, while it is known that females have shorter sleep bouts than males, *hctsa* quantifies the sexually dimorphic behavior by selecting new measures of, for example, predictability of movement (reduced in females) or time intervals between large movements (reduced in females), with this picture of erratic female *Drosophila* movement potentially reflecting their need to forage for food and select egg laying sites, in contrast to the more predictable male behavior of conserving energy to avoid predators [15]. These results demonstrate the benefits of leveraging a wide variety of time-series analysis methods to automatically learn informative structure in time-series data.

In summary, we introduce a new software framework, *hctsa*, which automates the selection of quantitative phenotypes from time-series data by leveraging a large and interdisciplinary literature on time-series analysis. In a reversal of the typical time-series analysis process in which methods are selected manually by researchers, here we show that statistical machine learning can aid this process of human learning by subjecting time-series data to thousands of scientific methods. In addition to the high throughput phenotyping applications demonstrated here, *hctsa* has general utility, including behavioral phenotyping in cognitive science, and diagnosis of disease from biomedical data streams such as heart rates or brain dynamics. Furthermore, although we focus on classification problems here, we note that the same approach applies to regression problems, where one aims to find time-series features that vary with a continuous variable (such as the dosage of a drug, a standardized depression score of a patient, etc.) [11]. Code for running *hctsa* in Matlab has been developed and refined over many years through a range of diverse applications [11, 16, 17] and is available at www.github.com/benfulcher/hctsa, with accompanying comprehensive documentation at www.gitbook.com/book/benfulcher/hctsa-manual.

## MATERIALS AND METHODS

### Software details and reproducibility

Following from the original concept and proof of principle for a highly comparative approach to time-series analysis [11], this article introduces a well-documented and user-friendly Matlab-based software platform for performing it (Matlab is a product of The MathWorks, Natick, MA). The set of over 7 700 features has been developed and refined through applications to a wide range of research and industrial problems over many years. A full analysis pipeline has also been built to allow researchers to run highly comparative analysis on their own data, including functions for initiating new analysis tasks, computing features locally in Matlab or through an interface to a mySQL server (enabling distributed computing for large datasets), processing the results of the feature extraction (including options for filtering features on their behavior and feature normalization), and a range of other analytic outputs to facilitate scientific interpretation (including the plots shown in this paper).

### Dataset availability

The two datasets analysed here, including the labeled time series and the full results of *hctsa* feature extraction, are available in the form of Matlab files (.mat) for *C. elegans:* https://dx.doi.org/10.4225/03/580478f951263, and for *Drosophila:* https://dx.doi.org/10.4225/03/5804798d2a2ec.

### Code availability

Analyses presented here were computed using v0.92 of *hctsa,* which contains a total of 7 749 features. The *hctsa* software is freely available at github.com/benfulcher/hctsa/. Analysis pipelines used to produce the results reported here (as well as many other outputs from *hctsa)* are available at github.com/benfulcher/hctsa_phenotypingWorm/ and github.com/benfulcher/hctsa_phenotypingFly/ for the *C. elegans* and *Drosophila* datasets, respectively.

### Analysis details

#### Feature filtration and normalization

For any given analysis, we filtered out any features that were constant across the dataset or contained any ‘special’ values (e.g., due to applying a method that is inappropriate for the data, such as fitting a positive-only distribution to data that are not positive only, or attempting to fit a model to the data that does not converge, etc.). Due to this filtering, a different number of total features will be usable for a given dataset, depending on its properties.

When searching for discriminative individual features, we did not normalize or rescale feature values to enable results to be interpreted in the natural scale of each feature. However, when computing the Principal Components of a dataset, or learning a classifier in the full feature space, we normalized each feature to the unit interval using a scaled robust sigmoid function [11]:
where represents the normalized feature values across a time-series dataset, **f** is the vector of un-normalized feature values, *m*_{f} is the median of **f**, and *r*_{f} is its interquartile range.

#### Classification

For multi-class classification, we trained linear support vector machine classifiers in Matlab 2015b (a product of The MathWorks, Natick, MA) using the `fitcecoc` function with a linear kernel SVM. To compare single univariate features, we used simple linear discriminant analysis (using `classify`). When training SVM classifiers, we weighted each observation, *x,* as the inverse probability of its class label across the dataset to account for class imbalance.

Due to imbalance of observations across the multiclass classification problems investigated here, we report balanced classification accuracy, *C*_{bal}, over *m* classes in terms of the number of correctly identified examples of a given class *t*_{i}, and the total number of examples of each class, *c*_{i}, as

Balancing the accuracy in this way ensures that all classes contribute equally to the classification statistic.

### Datasets

#### Caenorhabditis elegans movement speed

Movement speed time-series data were obtained from approximately 15 min videos obtained using tracking microscopes [5, 10] (see wormbase.org for more information). A total of 226 movement time series sampled at 30.03 Hz were obtained from the CB4856 (Hawaiian wild isolate, 29 time series) and N2 (lab strain, 100 time series) strains, and the mutants *dpy-20* (34 time series), *unc-9* (20 time series), *unc-38* (43 time series). For the *dpy-20, unc-9, unc-38* knockouts and the CB4856 strain, all available data at the specified frame rate were used. For the wildtype N2 strain, we took a random sample of 100 of the 1200 time series recorded at a sampling rate of 30.03 Hz. If missing data in a time series made up less than 15% of its length and in a contiguous block at the beginning or end of the recording, the time series was retained with this section of missing data removed, otherwise the time series was removed.

#### Drosophila melanogaster movement speed

We analysed time series of the movement speed of flies restricted to a one-dimensional tube and tracked continuously for between 3 and 6 days using video tracking [14, 18]. Movement speed was estimated as the maximum speed of the measured data (sampled at approximately 2 Hz) in each non-overlapping 10 s time window, where displacements are measured as the euclidean distance between successive co-ordinates of the fly. In this way, here we analyze these time series of movement speed, sampled at a rate of 0.1 Hz. Time series were split into 12 h segments and labeled as either ‘day’ (light on, 574 time series) or ‘night’ (light off, 574 time series), and as either ‘male’ (554 time series) or ‘female’ (594 time series).

## Acknowledgements

We thank Andre Brown and Bertalan Gyenes for sharing the *C. elegans* movement dataset, and helpful feedback on the resulting analysis and manuscript. We thank Giorgio Gilestro and Quentin Geissmann for sharing the *Drosophila* movement dataset, and helpful feedback on the resulting analysis. Many thanks to Rachael Fulcher for help with graphic design, and to Alex Fornito and Iain Johnston for useful feedback on the manuscript.