Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Automatic time-series phenotyping using massive feature extraction

View ORCID ProfileB. D. Fulcher, N. S. Jones
doi: https://doi.org/10.1101/081463
B. D. Fulcher
1Monash Institute of Cognitive and Clinical Neurosciences (MICCN), Monash University, Victoria, Australia
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for B. D. Fulcher
N. S. Jones
2Department of Mathematics, Imperial College London, London, United Kingdom
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Preview PDF
Loading

Abstract

Phenotype measurements frequently take the form of time series, but we currently lack a systematic method for relating these complex data streams to scientifically meaningful outcomes, such as relating the movement dynamics of a model organism to their genotype, or measurements of brain dynamics of a patient to their disease diagnosis. Here we report a new tool, hctsa, that automatically selects interpretable and useful properties of time series by comparing over 7 700 time-series features drawn from diverse scientific literatures. Using exemplar applications to high throughput phenotyping experiments, we show how hctsa allows researchers to leverage decades of time-series research to understand and quantify informative structure in time-series data.

Time-series data, repeated measurements of a quantity taken through time, are being recorded in increasing volumes in biology and medicine. This wealth of data has opened the door to a range of new research problems, including diagnosis of pathology from biomedical data streams in human patients [1, 2], understanding the role of specific neural circuits for behavior [3], and linking genotype to phenotype to understand gene function [4, 5, 6] or disease processes [7, 8, 9]. While differences in scalar phenotypes are relatively simple to calculate (such as the body length of a worm or the blood pressure of a human subject), it is less clear how to compare complex time-varying data streams (such as the movement dynamics of a worm, the heart rate fluctuations of a clinical patient, or the sequence of reaction times across a cognitive task). In all of these diverse applications, we require a method for reducing complex time-series data streams to informative, low-dimensional summaries.

A common way of summarizing a time series is by measuring a simple statistic such as its sample mean, which has the advantage of being easily interpretable—e.g., knocking out the gene unc-9 decreases the mean movement speed of the nematode worm, Caenorhabditis elegans [10]. However, this approach fails for many real-world applications in which the phenotypic differences are more subtle than simple mean shifts. Sophisticated tools for measuring structure in time-series data have been developed by a broad range of researchers, including contributions from the fields of statistics, electrical engineering, economics, statistical physics, dynamical systems, and biomedicine. This interdisciplinary literature includes summaries of the distribution of values in the data (e.g., Gaussianity, properties of outliers), autocorrelation structure (including power spectral measures), stationarity (how properties change over time), information theoretic measures of entropy and temporal predictability, linear and nonlinear model fits to the data, and methods from the physical nonlinear time-series analysis literature [11]. There is currently no systematic way of leveraging this giant corpus of scientific work to determine which of these thousands of possible summary statistics best address a particular scientific hypothesis, because the methods have typically been locked in discipline-specific journal articles. Here we introduce a software package for performing highly comparative time-series analysis, hctsa, that opens up the interdisciplinary time-series analysis literature in the form of over 7 700 features, each of which captures a different type of interpretable structure in a univariate time series. By comparing the performance of these features on a given dataset, hctsa facilitates data-driven, statistically-controlled selection of informative time-series summary statistics for phenotyping applications, overcoming an otherwise time-consuming and subjective manual task.

The general problem is depicted in Fig. 1, where we focus on distinguishing time series recorded from two different classes for demonstration (e.g., a patient group and a control group), although we note that the same general framework applies to multiclass classification or regression problems [11]. After measuring time-series data, hctsa is used to perform massive feature extraction, where the behavior of over 7 700 different scientific analysis methods applied to the dataset can be visualized as a feature matrix with a row for every time series and a column for every feature, shown in Fig. 1. The rich structure of the feature matrix reveals some sets of features (i.e., areas of the scientific time-series analysis literature) that capture meaningful differences between groups (e.g., lighter color for type A and darker color for type B in Fig. 1) and thus represent promising candidates as quantitative phenotypes for distinguishing data of the two types. In addition to facilitating massive feature extraction, hctsa includes a comprehensive suite of analytics for extracting useful and interpretable insights into the structure of the dataset, including: (i) identifying scientific methods that best quantify differences between labeled groups of data, providing interpretable insights into the phenotypic differences (incorporating permutation testing to statistically control for multiple hypothesis testing), (ii) building a classifier that draws on the full diversity of scientific methods to optimize the accuracy of phenotypic classification, and (iii) visualizing low-dimensional structure in the dataset to understand potential clustering structure or other relationships between the time series. By leveraging a comprehensive interdisciplinary literature on time-series analysis, hctsa thus enables researchers to gain a range of interpretable and useful insights into their data.

Figure 1:
  • Download figure
  • Open in new tab
Figure 1: Using a massive interdisciplinary library of time-series analysis methods to quantify and interpret phenotypic difference using hctsa.

We illustrate the problem of distinguishing two labeled classes of systems using measured time-series data. The hctsa package facilitates massive feature extraction to compare over 7 700 features of each time series, derived from an interdisciplinary time-series analysis literature. The feature matrix contains the result of this feature extraction, where each row represents a time series and each column represents a feature that encapsulates some property of that time series (e.g., measures of its autocorrelation structure, entropy, etc.). Color (blue and red) labels the two types of data—e.g., electrophysiological recordings from healthy controls (A) or people with schizophrenia (B)—and dark/light labels low/high values of each feature, revealing rich structure in the dynamical properties of the dataset. A range of analysis functions are included with hctsa, including learning interpretable differences between the labeled groups (visualized as a box plot revealing that time series of type A have increased entropy), and visualizing informative low-dimensional structure in the dataset.

To demonstrate the approach, we applied hctsa to two case studies of C. elegans and Drosophila melanogaster movement, as shown in Fig. 2. Sample time series of the movement speed dynamics of five different strains of C. elegans [5] are shown in Fig. 2a (upper). Being noisy empirical recordings with no clear visual differences between strains, it is unclear what types of analysis methods might capture differences between the genotypes. Using hctsa to compute the behavior of over 7 700 different time-series features (subsequently filtered down to 6504 well-behaved features, see Online Methods), we found that the feature set as a whole predicted genotype from the short, noisy time series with a ten-fold cross-validated balanced accuracy of 80% (using a linear SVM; chance level: 20%). A total of 4499 different features of movement speed time series were individually informative of the genotype label (q < 0.05, FDR-corrected, permutation test), with the 40 most informative features spanning diverse methodological literatures, including AR and state space model fitting methods, detrended fluctuation analysis, local mean forecasting, multi-scale Sample Entropy, and wavelet decompositions of the signal, shown as a structured pairwise correlation plot in Fig. 2a (lower). One of the multiscale entropy measures [12] is highlighted in Fig. 2a (middle), which computes the Sample Entropy, SampEn(2, 0.15), at a scale level 3 (corresponding to 100 ms bins), which can be thought of as quantifying the ‘unpredictability’ of the time series at this timescale. Violin plots in Fig. 2a reveal overall, physiologically interpretable differences between the genotypes, with the lab strain N2 and wild isolate strain CB4856 showing the most predictable time series at this timescale, the morphological mutant, dpy-20 being intermediate, and neural mutants like the unc-38 and unc-9 being the least predictable. The automatic selection of this temporal entropy measure by hctsa is consistent with recent work proposing the similar concept of ‘compressibility’ of posture sequences as a quantitative pheno-type for C. elegans [13]. By comparing a wide interdisciplinary literature of analysis methods, hctsa thus selects biologically informative quantitative phenotypes from these time-series data automatically.

Figure 2:
  • Download figure
  • Open in new tab
Figure 2: hctsa uncovers interpretable, quantitative phenotypic differences in movement speed time series of C. elegans and Drosophila.

A C. elegans. Top: Two examples of movement speed time series are shown for each of five genotypes. Middle: Class distributions of multiscale Sample Entropy, selected by hctsa as an informative measure, are shown as a violin plot, demonstrating that the neural mutant unc-9 genotype has the highest average Sample Entropy at this scale, followed by unc-38, the morphological mutant dpy-20, the lab-based strain N2, and the wild type strain CB4856. Bottom: The top 40 features identified by hctsa for distinguishing the five genotypes span a wide range of time-series analysis techniques, labeled using color, and form sets of highly correlated groups. The multiscale Sample Entropy shown above is indicated with a red arrow and star. B Drosophila. Top: Two examples of movement speed time series are shown for each of four groups, labeled as either ‘male’ or ‘female’, and either ‘day’ or ‘night’. Interpretable measures of difference between each pair of conditions were extracted using hctsa, and are summarized using text. Middle: hctsa identified the standard deviation of successive changes in movement speed as a simple but highly discriminative feature, shown as a violin plot. Bottom: A two-dimensional principal components projection of the dataset across the full hctsa feature library is informative of the class structure in the dataset. Shading has been added to guide the eye.

We also applied hctsa to 12 h movement speed time series of Drosophila restricted to a one-dimensional tube, labeled as either ‘day’ (light on) or ‘night’ (light off), and as either ‘male’ or ‘female’, as shown in Fig. 2b [14]. Leveraging the full feature library, we successfully classified recordings made during the day versus at night with a mean 10-fold cross-validated balanced accuracy of 98%, and also clearly distinguished the sex of the organisms (96%). The software selected interpretable quantitative time-series features for different groupings of the data, as annotated in Fig. 2b (upper), highlighting the increased spectral flatness and shorter durations between outliers in females relative to males (driven by their less predictable movement), and increased stationarity of movement dynamics during the day, with fewer extreme outliers than at night (which is dominated by more bursty patterns between sleep and activity). Taking four combination classes (colored in Fig. 2b), the mean balanced 10-fold cross-validated balanced accuracy remains high, at 95%. Again, hctsa identifies interpretable features like the standard deviation of incremental differences in the z-scored time series, shown as a violin plot in Fig. 2b (middle). This simple measure gives higher values to time series exhibiting greater changes from one time point to the next, providing a simple but easily interpretable measure of temporal predictability, which is increased during the day, and in females. Even when class labels are not used, hctsa can draw on the combined behavior of thousands of scientific methods to provide an informative low dimensional principal components representation of the dataset, shown in Fig. 2b (lower), in which the four classes are clearly separated. The subtle quantitative phenotypes of Drosophila movement provided automatically by hctsa go beyond simple comparisons of the overall amount of movement between day and night, or between males and females [14, 15]. For example, while it is known that females have shorter sleep bouts than males, hctsa quantifies the sexually dimorphic behavior by selecting new measures of, for example, predictability of movement (reduced in females) or time intervals between large movements (reduced in females), with this picture of erratic female Drosophila movement potentially reflecting their need to forage for food and select egg laying sites, in contrast to the more predictable male behavior of conserving energy to avoid predators [15]. These results demonstrate the benefits of leveraging a wide variety of time-series analysis methods to automatically learn informative structure in time-series data.

In summary, we introduce a new software framework, hctsa, which automates the selection of quantitative phenotypes from time-series data by leveraging a large and interdisciplinary literature on time-series analysis. In a reversal of the typical time-series analysis process in which methods are selected manually by researchers, here we show that statistical machine learning can aid this process of human learning by subjecting time-series data to thousands of scientific methods. In addition to the high throughput phenotyping applications demonstrated here, hctsa has general utility, including behavioral phenotyping in cognitive science, and diagnosis of disease from biomedical data streams such as heart rates or brain dynamics. Furthermore, although we focus on classification problems here, we note that the same approach applies to regression problems, where one aims to find time-series features that vary with a continuous variable (such as the dosage of a drug, a standardized depression score of a patient, etc.) [11]. Code for running hctsa in Matlab has been developed and refined over many years through a range of diverse applications [11, 16, 17] and is available at www.github.com/benfulcher/hctsa, with accompanying comprehensive documentation at www.gitbook.com/book/benfulcher/hctsa-manual.

MATERIALS AND METHODS

Software details and reproducibility

Following from the original concept and proof of principle for a highly comparative approach to time-series analysis [11], this article introduces a well-documented and user-friendly Matlab-based software platform for performing it (Matlab is a product of The MathWorks, Natick, MA). The set of over 7 700 features has been developed and refined through applications to a wide range of research and industrial problems over many years. A full analysis pipeline has also been built to allow researchers to run highly comparative analysis on their own data, including functions for initiating new analysis tasks, computing features locally in Matlab or through an interface to a mySQL server (enabling distributed computing for large datasets), processing the results of the feature extraction (including options for filtering features on their behavior and feature normalization), and a range of other analytic outputs to facilitate scientific interpretation (including the plots shown in this paper).

Dataset availability

The two datasets analysed here, including the labeled time series and the full results of hctsa feature extraction, are available in the form of Matlab files (.mat) for C. elegans: https://dx.doi.org/10.4225/03/580478f951263, and for Drosophila: https://dx.doi.org/10.4225/03/5804798d2a2ec.

Code availability

Analyses presented here were computed using v0.92 of hctsa, which contains a total of 7 749 features. The hctsa software is freely available at github.com/benfulcher/hctsa/. Analysis pipelines used to produce the results reported here (as well as many other outputs from hctsa) are available at github.com/benfulcher/hctsa_phenotypingWorm/ and github.com/benfulcher/hctsa_phenotypingFly/ for the C. elegans and Drosophila datasets, respectively.

Analysis details

Feature filtration and normalization

For any given analysis, we filtered out any features that were constant across the dataset or contained any ‘special’ values (e.g., due to applying a method that is inappropriate for the data, such as fitting a positive-only distribution to data that are not positive only, or attempting to fit a model to the data that does not converge, etc.). Due to this filtering, a different number of total features will be usable for a given dataset, depending on its properties.

When searching for discriminative individual features, we did not normalize or rescale feature values to enable results to be interpreted in the natural scale of each feature. However, when computing the Principal Components of a dataset, or learning a classifier in the full feature space, we normalized each feature to the unit interval using a scaled robust sigmoid function [11]: Embedded Image where Embedded Image represents the normalized feature values across a time-series dataset, f is the vector of un-normalized feature values, mf is the median of f, and rf is its interquartile range.

Classification

For multi-class classification, we trained linear support vector machine classifiers in Matlab 2015b (a product of The MathWorks, Natick, MA) using the fitcecoc function with a linear kernel SVM. To compare single univariate features, we used simple linear discriminant analysis (using classify). When training SVM classifiers, we weighted each observation, x, as the inverse probability of its class label across the dataset to account for class imbalance.

Due to imbalance of observations across the multiclass classification problems investigated here, we report balanced classification accuracy, Cbal, over m classes in terms of the number of correctly identified examples of a given class ti, and the total number of examples of each class, ci, as

Embedded Image

Balancing the accuracy in this way ensures that all classes contribute equally to the classification statistic.

Datasets

Caenorhabditis elegans movement speed

Movement speed time-series data were obtained from approximately 15 min videos obtained using tracking microscopes [5, 10] (see wormbase.org for more information). A total of 226 movement time series sampled at 30.03 Hz were obtained from the CB4856 (Hawaiian wild isolate, 29 time series) and N2 (lab strain, 100 time series) strains, and the mutants dpy-20 (34 time series), unc-9 (20 time series), unc-38 (43 time series). For the dpy-20, unc-9, unc-38 knockouts and the CB4856 strain, all available data at the specified frame rate were used. For the wildtype N2 strain, we took a random sample of 100 of the 1200 time series recorded at a sampling rate of 30.03 Hz. If missing data in a time series made up less than 15% of its length and in a contiguous block at the beginning or end of the recording, the time series was retained with this section of missing data removed, otherwise the time series was removed.

Drosophila melanogaster movement speed

We analysed time series of the movement speed of flies restricted to a one-dimensional tube and tracked continuously for between 3 and 6 days using video tracking [14, 18]. Movement speed was estimated as the maximum speed of the measured data (sampled at approximately 2 Hz) in each non-overlapping 10 s time window, where displacements are measured as the euclidean distance between successive co-ordinates of the fly. In this way, here we analyze these time series of movement speed, sampled at a rate of 0.1 Hz. Time series were split into 12 h segments and labeled as either ‘day’ (light on, 574 time series) or ‘night’ (light off, 574 time series), and as either ‘male’ (554 time series) or ‘female’ (594 time series).

Acknowledgements

We thank Andre Brown and Bertalan Gyenes for sharing the C. elegans movement dataset, and helpful feedback on the resulting analysis and manuscript. We thank Giorgio Gilestro and Quentin Geissmann for sharing the Drosophila movement dataset, and helpful feedback on the resulting analysis. Many thanks to Rachael Fulcher for help with graphic design, and to Alex Fornito and Iain Johnston for useful feedback on the manuscript.

References

  1. [1].↵
    G. Hripcsak and D. J. Albers. Next-generation phenotyping of electronic health records. J. Am. Med. Inform. Assoc. 20, 117 (2013).
    OpenUrlCrossRefPubMed
  2. [2].↵
    T. Insel, B. Cuthbert, M. Garvey, et al. Research Domain Criteria (RDoC): Toward a new classification framework for research on mental disorders. Am. J. Psychiatry 167, 748 (2010).
    OpenUrlCrossRefPubMedWeb of Science
  3. [3].↵
    J. T. Vogelstein, Y. Park, T. Ohyama, et al. Discovery of brainwide neural-behavioral maps via multiscale unsupervised structure learning. Science 344, 386 (2014).
    OpenUrlAbstract/FREE Full Text
  4. [4].↵
    P. M. Nolan, J. Peters, M. Strivens, et al. A systematic, genome-wide, phenotype-driven mutagenesis programme for gene function studies in the mouse. Nat. Genet. 25, 440 (2000).
    OpenUrlCrossRefPubMedWeb of Science
  5. [5].↵
    A. E. X. Brown, E. I. Yemini, L. J. Grundy, T. Jucikas, and W. R. Schafer. A dictionary of behavioral motifs reveals clusters of genes affecting Caenorhabditis elegans locomotion. Proc. Natl. Acad. Sci. USA 110, 791 (2013).
    OpenUrlAbstract/FREE Full Text
  6. [6].↵
    J. Kain, C. Stokes, Q. Gaudry, et al. Leg-tracking and automated behavioural classification in Drosophila. Nat. Comm. 4, 1910 (2013).
    OpenUrl
  7. [7].↵
    J. T. Johnson, M. S. Hansen, I. Wu, et al. Virtual histology of transgenic mouse embryos for high-throughput phenotyping. PLoS Genet. 2, e61 (2006).
    OpenUrlCrossRefPubMed
  8. [8].↵
    H. Gates, A.-M. Mallon, and S. D. M. Brown. High-throughput mouse phenotyping. Methods 53, 394 (2011).
    OpenUrlCrossRefPubMed
  9. [9].↵
    B. Yang, J. B. Treweek, R. P. Kulkarni, et al. Single-cell phenotyping within transparent intact tissue through whole-body clearing. Cell 158, 945 (2014).
    OpenUrlCrossRefPubMedWeb of Science
  10. [10].↵
    E. Yemini, T. Jucikas, L. J. Grundy, A. E. X. Brown, and W. R. Schafer. A database of Caenorhabditis elegans behavioral phenotypes. Nat. Methods 10, 877 (2013).
    OpenUrlCrossRefPubMedWeb of Science
  11. [11].↵
    B. D. Fulcher, M. A. Little, and N. S. Jones. Highly comparative time-series analysis: The empirical structure of time series and their methods. J. Roy. Soc. Interface 10, 20130048 (2013).
    OpenUrl
  12. [12].↵
    M. Costa, A. L. Goldberger, and C. K. Peng. Multiscale entropy analysis of biological signals. Phys. Rev. E 71, 021906 (2005).
    OpenUrl
  13. [13].↵
    A. Gomez-Marin, G. J. Stephens, and A. E. X. Brown. Hierarchical compression of C. elegans locomotion reveals phenotypic differences in the organisation of behaviour. bioRxiv p. 029462 (2015).
  14. [14].↵
    G. F. Gilestro. Video tracking and analysis of sleep in Drosophila melanogaster. Nat. Protoc. 7, 995 (2012).
    OpenUrlCrossRefPubMed
  15. [15].↵
    R. E. Isaac, C. Li, A. E. Leedale, and A. D. Shirras. Drosophila male sex peptide inhibits siesta sleep and promotes locomotor activity in the post-mated female. Proc. R. Soc. Lond. B 277, 65 (2010).
    OpenUrlCrossRefPubMed
  16. [16].↵
    B. D. Fulcher, A. E. Georgieva, C. W. G. Redman, and N. S. Jones. Highly comparative fetal heart rate analysis. 34th Ann. Int. Conf. IEEE EMBC pp. 3135–3138 (2012).
  17. [17].↵
    B. D. Fulcher and N. S. Jones. Highly comparative feature-based time-series classification. IEEE Trans. Knowl. Data Eng. 26, 3026 (2014).
    OpenUrl
  18. [18].↵
    N. Donelson, E. Z. Kim, J. B. Slawson, et al. High-resolution positional tracking for long-term analysis of drosophila sleep and locomotion using the “tracker” program. PLoS ONE 7, e37250 (2012).
    OpenUrlCrossRefPubMed
View Abstract
Back to top
PreviousNext
Posted October 17, 2016.
Download PDF
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Automatic time-series phenotyping using massive feature extraction
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Automatic time-series phenotyping using massive feature extraction
B. D. Fulcher, N. S. Jones
bioRxiv 081463; doi: https://doi.org/10.1101/081463
Digg logo Reddit logo Twitter logo CiteULike logo Facebook logo Google logo Mendeley logo
Citation Tools
Automatic time-series phenotyping using massive feature extraction
B. D. Fulcher, N. S. Jones
bioRxiv 081463; doi: https://doi.org/10.1101/081463

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (2430)
  • Biochemistry (4791)
  • Bioengineering (3333)
  • Bioinformatics (14684)
  • Biophysics (6640)
  • Cancer Biology (5172)
  • Cell Biology (7429)
  • Clinical Trials (138)
  • Developmental Biology (4367)
  • Ecology (6874)
  • Epidemiology (2057)
  • Evolutionary Biology (9926)
  • Genetics (7346)
  • Genomics (9533)
  • Immunology (4558)
  • Microbiology (12686)
  • Molecular Biology (4948)
  • Neuroscience (28348)
  • Paleontology (199)
  • Pathology (809)
  • Pharmacology and Toxicology (1392)
  • Physiology (2024)
  • Plant Biology (4504)
  • Scientific Communication and Education (977)
  • Synthetic Biology (1299)
  • Systems Biology (3917)
  • Zoology (726)