Electrocardiogram feature extraction and interval measurements using optimal representative cycles from persistent homology

Cardiovascular diseases are among the leading causes of death, and their early detection and treatment is important for lowering their prevalence and mortality rate. Electrocardiograms (ECGs) record electrical activity of the heart to provide information used to diagnose and treat various cardiovascular diseases. Many approaches to computer-aided ECG analysis have been performed, including Fourier analysis, principal component analysis, analyzing morphological changes, and machine learning. Due to the high accuracy required of ECG-analysis software, there is no universally-agreed upon algorithm to identify P,Q,R,S, and T-waves and measure intervals of interest. Topological data analysis uses tools from algebraic topology to quantify hole-like shapes within data, and methods using persistence statistics and fractal dimension with machine learning have been applied to ECG signals in the context of detecting arrhythmias within recent years. To our knowledge, there does not exist a method of identifying P,Q,S, and T-waves and measuring intervals of interest which relies on topological features of the data, and we propose a novel topological method for performing these aspects of ECG analysis. Specifically, we establish criteria to identify cardinality-minimal and area-minimal 1-cycles with certain properties as P,Q,S, and T-waves. This yields a procedure for measuring the PR-interval, QT-interval, ST-segment, QRS-duration, P-wave duration, and T-wave duration in Lead II ECG data. We apply our procedure to 400 sets of simulated Lead II ECG signals and compare with the interval values set by the model. Additionally, the algorithm is used to identify cardinality-minimal and area-minimal 1-cycles as P,Q,S, and T-waves in two sets of 200 randomly sampled Lead II ECG signals of real patients with 11 common rhythms. Analysis of optimal 1-cycles identified as P,Q,S, and T-waves and comparison of interval measurements shows that 1-cycle reconstructions can provide useful information about the ECG signal and could hold utility in characterizing arrhythmias. Author summary Topological data analysis (TDA) has been a rapidly growing field within the past 15 years and has found applications across many fields. In the context of TDA, several algorithms primarily using persistence barcode statistics and machine learning have been applied to electrocardiogram (ECG) signals in recent years. We use a topological data-analytic method to identify subsets of an ECG signal which are representative of certain topological features in the ECG signal, and we propose that those subsets coincide with the P,Q,S, and T-waves in the ECG signal. We then use information about these subsets of the signal identified as P,Q,S, and T-waves to measure the PR-interval, QT-interval, ST-segment, QRS-duration, P-wave duration, and T-wave duration. We demonstrate our method on both simulated and real Lead II ECG data. These results show how identifying subsets of an ECG signal with certain topological properties could be used in analyzing the morphology of the signal over time and in arrhythmia-detection algorithms.

Cardiovascular diseases are among the leading causes of death due to their high 2 prevalence and mortality rate [1] [2] [3]. Electrocardiograms (ECGs) provide a 3 non-invasive measure of the heart's electrical activity and are used in diagnosing and 4 managing various cardiovascular diseases. Thus the analysis of ECGs is important for 5 accurate diagnosis and proper treatment of cardiovascular diseases. Several approaches 6 to automated ECG analysis have been performed, including machine 7 learning [4] [5] [6] [7] [8] [9] [10] [11], wavelet transforms [12] [13] [14] [15] [16], and 8 persistent homology [17] [18] [19] [20] [21] [22]. Due to the high accuracy required of 9 ECG-analysis software and the fact that the bulk of ECG analysis is carried out by 10 healthcare providers, the development of algorithms that identifies P,Q,R,S, and 11 T-waves, measures intervals of interest, and/or detects arrhythmias is an active area of 12 research. 13 Topological data analysis (TDA) is concerned with the study of shapes constructed 14 from a dataset which are invariant under continuous deformations such as stretching 15 and twisting. Applications of TDA to ECG signals have used persistence 16 statistics [20] [18], fractal dimension [18], and machine learning [17] [19]. Using cycle 17 reconstructions has shown utility in various applications outside of ECG analysis such 18 as analyzing structures on the atomic scale [23] and in structural engineering [24]. To 19 our knowledge, nobody has performed an "inverse" analysis of using information from 20 cycle reconstructions to analyze ECG signals. We focus our attention on the "rhythm 21 lead", i.e. Lead II, and for the remainder of the paper, any reference to an ECG signal, 22 whether simulated or real, refers to a Lead II ECG signal. 23 Suppose we are given an n-dimensional dataset, that is, a set of points in R n . There 24 are n distinct types of topological features associated with that dataset: roughly, these 25 "types" correspond to what one would intuitively regard as "holes" or "gaps" of differing 26 dimensions in the dataset. To apply these methods to ECG data, we treat an ECG 27 signal as two dimensional with one temporal dimension and one amplitude dimension 28 rather than one dimensional with a specified sampling frequency. Consequently, from 29 this perspective, there are two distinct types of topological features associated with 30 ECG data: 0-dimensional homology features represent equivalence classes of connected 31 components, while 1-dimensional homology features represent equivalence classes of 32 non-contractible loops. The 0-dimensional homology features are useful in analyzing 33 clustering phenomena and have recently shown utility as a metric of heart rate 34 variability when applied to ECG signals [21] [22]. 35 In this paper, our focus is instead on the 1-dimensional homology features, which we 36 use in a novel way to analyze ECG data. Specifically, we identify subsets of an ECG 37 signal as P,Q,S, and T-waves by considering these subsets to be representative cycles of 38 1-dimensional homology features of the signal with certain properties and then use these 39 subsets identified as individual waves to measure the PR-interval, QT-interval, 40 ST-segment, QRS-duration, P-wave duration, and T-wave duration. To illustrate the 41 intuition behind 1-dimensional homology features and their representative cycles, we 42 first present an example which demonstrates how the 1-dimensional homology features 43 of a simple 2-dimensional dataset is related to the shape of the data in the "Example: a 44 1-dimensional topological feature in a simple dataset" section. This section describes 45 some geometric intuition of 1-dimensional homology features without going into linear 46 algebra details. These details are then given and the ideas from this section are 47 formalized in the "Background on topological data analysis" section, where we define 48 key terms in algebraic and computational topology to provide minimal background and 49 establish terms that will be used in identifying features of ECG signals. Example: a 1-dimensional topological feature in a simple dataset 52 Consider the set of points in the Cartesian plane R 2 shown in Fig 1A. Then consider a 53 circle drawn around each datum, each with the same radius r. The geometric Čech 54 complex of radius r is defined as the union of the interiors of these circles. It is a subset 55 of R 2 , and as r increases, the geometric Čech complex of radius r is a larger and larger 56 subset of R 2 . Fig 1B- 2 , respectively. F: Persistence diagram: each blue triangle represents a 1-dimensional homological feature, i.e. an equivalence class of non-contractible loops, with coordinates (birth radius, death radius); the y=x line is drawn to depict the persistence distribution of all 1-dimensional homological features over a set of r values by noting that the more persistent a 1-dimensional homological feature is, i.e. the larger the difference between its death radius and birth radius, the more 'above' the y=x line it will be.
Recall that if we can draw a non-contractible loop within the geometric Čech 70 complex of radius r for some r > 0, then this loop must be "stuck" around some void 71 encircled by the geometric Čech complex. This non-contractible loop can be 72 continuously deformed to construct another non-contractible loop "stuck" around the 73 same void. We say the two loops are homotopic. As an example, the red and green 74 loops shown in Fig 1D are homotopic. The set of all possible non-contractible loops 75 "stuck" around some void encircled by the geometric Čech complex forms an equivalence 76 class of non-contractible loops, i.e. a set of non-contractible loops where any two 77 non-contractible loops in the set are homotopic. In practice, rather than homotopy, we 78 use a weaker but more technically-involved equivalence relation on loops called homology 79 to utilize efficient algorithms such as Ripser [25] and GUDHI [26] in computing 80 topological features. For a rigorous treatment of homotopy and homology, see [27].

82
Suppose we are given a dataset X and a positive real number r. We writeC r (X) for 83 the geometric Čech complex of X of radius r. For a given two-dimensional dataset such 84 that there exists a non-contractible loop ℓ within its geometric Čech complex of radius 85 r, we define the birth radius of ℓ as the smallest real number b such that some loop in 86 C r (X) which is equivalent to ℓ and which is contained in the subsetC b (X) ofC r (X) 87 exists. It follows immediately from this definition that b ≤ r. Similarly, we define the 88 death radius of ℓ as the smallest real number d such that r ≤ d and such that ℓ is 89 contractible when regarded as a loop inC d (X). That is, the birth radius of a 90 non-contractible loop is the smallest radius at which the equivalence class of that The data comprising the larger square in the example dataset were included to 98 compare two 1-dimensional homology features of different persistence within a given 99 2-dimensional dataset. As previously noted, the birth radius and death radius of the Background on topological data analysis 124 We now set out to formalize the notion of "equivalence classes of non-contractible loops 125 that persist for a given range of radius values." Given a set of data X represented as a 126 finite set of points in R 2 , we construct a simplicial complex as a topological space that 127 approximates the structure of the data.

128
Definition 0.1 A simplicial complex is a collection K of subsets of a finite set V such 129 that: An element of V is referred to as a vertex, and an element of K with cardinality n + 1 133 is referred to as an n-simplex.

134
There are several ways to construct a simplicial complex given a finite set of points 135 in R 2 , and to be consistent with the geometry of the simple dataset example, we  Definition 0.2 Given a dataset X represented as a finite subset of R 2 , and given a 143 positive real number r, the radius r Vietoris-Rips complex of X, denoted V R r (X), is the 144 simplicial complex given by the collection of all subsets U of X with the property that if 145 Thus the radius r Vietoris Rips complex of a finite 148 subset of R 2 defines a simplicial complex.

150
We are now in a position to be more concrete about the notion of an "equivalence 151 class of non-contractible loops" within the geometric Čech complex, as discussed in the 152 "Example: a 1-dimensional topological feature in a simple dataset" section. By an 153 "equivalence class of non-contractible loops," we are referring to an element of the 154 1-dimensional homology group of some radius r Vietoris-Rips complex, which we now 155 set out to define.

157
Let X be a finite subset of R 2 , let r be a positive real number, and let C n be the 158 vector space over F 2 with basis consisting of the elements of V R r (X) of cardinality 159 n + 1 for n = 0, 1, 2. Furthermore, suppose there is an ordering on V R r (X). Consider 160 and 161 x i indicates that x i is omitted from the ordered simplex. The elements of C 1 are 162 referred to as 1-chains, the elements of ker(δ 0 ) are referred to as 1-cycles, and elements 163 of im(δ 1 ) are referred to as 1-boundaries. Since δ 0 (δ 1 (v)) = 0 for all v ∈ C 2 , every 164 1-boundary is an 1-cycle. However, it is not necessarily true that every 1-cycle is an 165 1-boundary. Intuitively, if we think of X as a cloud of points in the plane, the 166 1-dimensional homology group of V R r (X) is defined such that its dimension over F 2 167 counts the number of "holes" in that cloud.

168
Definition 0.3 Given r > 0 and V R r (X) where X is a finite subset of R 2 , we follow 169 the construction of F 2 -vector spaces C 0 , C 1 , C 2 and linear transformations δ −1 , δ 0 , δ 1 , 170 δ 2 as outlined above and define the first homology group of V R r (X) as the quotient By increasing r, we create a sequence of Vietoris-Rips Complexes where where V R ri (X) is a proper subset of V R rj (X) for i < j and i 0 , i 1 ,...,i m−1 are inclusion 177 homomorphisms. This induces a sequence of , and all n = 0, 1, ..., m − 1. 179 We now give a name to the "smallest" and "largest" r > 0 such that a given 1-cycle thorough treatment of persistent homology, see [29].

195
In this section, we first describe how ECG data are processed to be in a form such that 196 our topological approach can be applied consistently. We then describe our topological 197 method of identifying P,Q,S, and T-waves and measuring the PR-interval, 198 ST-segment, QRS-duration, P-wave duration, and T-wave duration. Lastly, we present 199 the methods used to simulate ECG signals with the PR-interval, QT-interval, Before an ECG signal can be analyzed using our topological approach, the signal is 204 processed such that the topological approach can be consistently applied. An outline of 205 this processing pipeline is depicted in Fig 3. This processing pipeline is the same 206 regardless of whether the ECG signal is simulated or real.
is a local maximum of the signal, 215 and there are exactly i − 1 local maxima f (t R,j ) such that t R,j < t R,i and 0.6 < f (t R,j ), 216 j = 1, 2, ..., i − 1 [30]. Next, an isoelectric line is incorporated into the signal by 217 mapping (t, f (t) M ) to (t, baseline) where baseline is the median value of the set 218 {f * (t) | t ∈ D and f * (t) is f (t) rounded to the nearest hundredth} for all t ∈ D such 219 that t = h(2k) for some integer k. Lastly, we consider the restriction of the processed 220 ECG signal such that the signal begins and ends with an R-wave peak as the input of 221 our topological method.

222
Extracting features of processed ECG signals using TDA  • Time coordinate of centroid of 1-cycle: It is a property of ECG signals that 244 a T-wave follows a QRS-complex. We also expect a detected P-wave to necessarily 245 precede a QRS-complex due to our simulations not featuring atrial activity with 246 missed ventricular beats. Thus we expect that the centroids of 1-cycles comprising 247 a T-wave and P-wave lie between a pair of adjacent R-waves such that the T-wave 248 is some distance after the preceding R-wave, the P-wave is some distance prior to 249 the following R-wave, and the P-wave follows the T-wave. Similarly, we impose 250 that a 1-cycle comprising a detected Q-wave and S-wave slightly precedes an 251 R-wave and shortly follows an R-wave, respectively.  Conversely, it is expected that the average amplitude of the signal in the Q,S-wave 257 be less than the signal amplitude at baseline.

258
• Optimality: Of the 1-cycles satisfying the persistence, birth filtration, and 259 centroid coordinate conditions above, the 1-cycle representing the given P,Q,S, or 260 T-wave is expected to be optimal in the sense of i) having a minimal number of data points comprising the 1-cycle and/or ii) having a minimal-area convex hull. 262 These optimal cycles were computed using HomCloud [31].

Fig 4. Illustration of properties of an ECG used to detect P,Q,S, and T-waves
A: The red loops illustrate elements from equivalence classes of non-contractible loops corresponding to the P,Q,S, and T-wave. Notice that without the addition of the isoelectric baseline, we would be unable to draw non-contractible loops corresponding to the P,Q,S, and T-waves within the geometric Čech complex. B: Example of output without imposing an upper bound on the birth filtration of a 1-cycle identified as a P-wave.
We now present the specific criteria that identifies 1-cycles with certain properties 264 related to those described above as being a P-wave or T-wave. We note that differences 265 in the classification criteria for P-waves and T-waves involves changes in the bounds of 266 the persistence range and changes to the upper bound of the birth filtration.

267
Additionally, the centroid of a 1-cycle identified as a P-wave is expected to be closer to 268 the subsequent R-wave than the preceding R-wave, and vice versa for the centroid of a 269 1-cycle identified as a T-wave.  • the birth filtration of the 1-cycle is less than 0.04 R_peak_time_coordinate[i] + 0.5 * RR_interval for some i. 287 We continue with describing the criteria for a 1-cycle to be identified as a Q-wave or 288 S-wave. The interval which the persistence of a 1-cycle must belong to for the 1-cycle to 289 be classified as a Q-wave or S-wave includes a smaller lower bound and smaller upper 290 bound compared to the interval of valid persistence values for a 1-cycle to be classified 291 as a P,T-wave. This is due to the data comprising Q-waves and S-waves having less of a 292 loop-like structure than data comprising the P-waves and T-waves. We also impose that 293 the centroid of a 1-cycle identified as a Q-wave or S-wave more closely precede or follow 294 February 1, 2022 9/17 the nearby R-wave peak relative to a P-wave or T-wave, respectively. • the birth filtration of the 1-cycle is less than 0.06 R_peak_time_coordinate[i] for some i.

308
• the birth filtration of the 1-cycle is less than 0.06 R_peak_time_coordinate[i] + 0.12 * RR_interval for some i.

312
In the above criteria, "optimal 1-cycle" refers to a 1-cycle computed as being minimal 313 with respect to i) the cardinality of the set of data points comprising the 1-cycle or ii) 314 the area spanned by the convex hull of the 1-cycle. Both types of optimal 1-cycles were 315 computed in our analysis and are compared. An example of area-minimal 1-cycles 316 identified as P,Q,S, and T-waves using the above criteria depicted on an ECG signal is 317 shown in Fig 5. Cardinality-minimal and area-minimal 1-cycles identified as P,Q,S, and 318 T-waves for the 400 simulated ECG signals is provided as Supplementary File 1 319 and Supplementary File 2. Additionally, cardinality-minimal and area-minimal 1-cycles 320 identified as P,Q,S, and T-waves for 200 Lead II ECG signals randomly sampled from a 321 pool of 10646 denoised signals with 11 common rhythms as described in [32] are shown 322 in Supplementary File 3 and Supplementary File 4, respectively. Example of simulated ECG signal and persistence diagram with area-optimal 1-cycles identified as P,Q,S, and T-waves The points comprising P,Q,S, and T-waves were identified as area-optimal 1-cycles with certain properties depending on their birth filtration, persistence, and centroid. The Q-wave and S-wave 1-cycles drawn have the left-most and right-most centroid of all Q-wave and S-waves detected within a given heartbeat, respectively. This is consistent with the choice of Q-waves and S-waves used to measure intervals of interest. The color of 1-cycles identified as P,Q,S, and T-waves matches the color of the corresponding points in the persistence diagram.
We now describe how the PR-interval, QT-interval, ST-segment, QRS-duration, 325 P-wave duration, and T-wave duration are measured using 1-cycles identified as P,Q,S, 326 and T-waves. To measure these intervals solely using topological features of the data, ECG signals were generated with a sampling frequency of 500Hz using an ECG 356 Simulator in MATLAB [33]. The set of parameters that completely characterizes what 357 we refer to as a simulated ECG signal includes the heart rate, the amplitude and 358 duration of the R-wave, and the amplitude, duration, and location with respect to i) the 359 subsequent R-wave for the P,Q-wave and ii) the preceding R-wave for the S,T-wave. We 360 now go into detail about how this parameter set is used to construct a simulated ECG 361 signal with a given PR-interval, QT-interval, ST-segment, QRS-duration, P-wave 362 duration, and T-wave duration. We now briefly describe the P,Q,R,S, and T-waves. The QRS complex consists of six 377 line segments, and the shape of the P and T-wave is dictated by a Fourier series [33]. 378 The amplitude and duration of the X-wave is given by the parameters 379 amplitude_X_wave and duration_X_wave, respectively, for X=P,Q,R,S,T. The 380 distance along the time-axis from the centroid of the ordered pairs comprising the 381 P,Q-wave to the coordinate pair of the following R-wave peak is given by the parameter 382 location_P,Q_wave. Similarly, the distance along the time-axis from the coordinate 383 pair of an R-wave peak to the centroid of the ordered pairs (t, f (t)) comprising the 384 subsequent S,T-wave is given by the parameter location_S,T_wave. By uniformly sampling the heart rate from [60,100] and the parameter values of the 405 P,Q,R,S, and T-waves from the ranges in Table 1 heart rate from one another and that the location of the P,Q,S, and T-waves are defined with respect to the subsequent R-wave in the case of P,Q-waves and to the preceding R-wave in the case of S,T-waves.

408
The PR-interval, QT-interval, ST-segment, QRS-duration, P-wave duration, and T-wave 409 duration of 1000 simulated ECG signals were measured using i) cardinality-minimal 410 1-cycle reconstructions and ii) area-minimal 1-cycle reconstructions and compared with 411 the interval measurements made by model parameters. For each simulation, the percent 412 difference between the model's measurement of each interval and the measurement of 413 each interval performed using optimal 1-cycle reconstructions was calculated. The 414 distribution of these measurements using cardinality-minimal and area-minimal 1-cycles 415 are shown in Fig 6 and Fig 7, respectively. The average percent difference and the 416 standard deviation of the percent difference measurements for each interval of interest 417 across the 1000 simulated ECG signals is shown in Table 2.   2.972 ± 9.136 The intervals of interest of 1000 simulated ECG signals were measured using parameters of the model, cardinality-minimal 1-cycle reconstructions, and area-minimal 1-cycle reconstructions as outlined in the Methods section. The reported percent difference values refer to the percent difference in interval measurements between the model's measurements and measurements made using optimal 1-cycles.
From comparing cardinality-minimal 1-cycles identified as P,Q,S, and T-waves to 419 area-minimal 1-cycles identified as P,Q,S, and T-waves in Supplementary File 1 and 420 Supplementary File 2, we see that these cycles are often "close" to one another, likely 421 resulting in the similar percent error distributions seen in Fig 6, Fig 7, and Table 2. It is 422 also evident from Supplementary File 1 and Supplementary File 2 that the algorithm 423 could be improved by using a more stable method to include an isoelectric baseline in 424 the signal. Additionally, to ensure that P,Q,S, and T-wave identifications and interval 425 measurements are not compromised due to misidentified or undetected R-wave peaks, a 426 more robust R-wave peak detection algorithm could be used such as those 427 in [34], [35], [36].