A Personalized and Adaptive Distribution Classification of Actigraphy Segments into Sleep-Wake States

Wearable actimeters have the potential to greatly improve our understanding sleep in natural environments and in long-term experiments. Current technologies have served the sleep community well, but they have known weaknesses that introduce errors that can compromise reliable and relevant clinical and research sleep and wakefulness profiles from these data. Newer data collection technologies, such as microelectromechanical systems (MEMS), offer opportunities to gather movement data in different forms and at higher frequencies, making new analytical methods possible and potentially advantageous. We have developed a novel statistical algorithm, called the Wasserstein Algorithm for Classifying Sleep and Wakefulness (WACSAW), that is based on optimal transport statistics and uses MEMS data as its input. WACSAW segments group data into periods with similar movement patterns and uses optimal transport methods to generate a Wasserstein profile for each segment. The second utilization of optimal transport methodology measures the difference between each segment profile and a hypothetical segment of idealized sleep. Characteristic functions, derived from individual activity segments, were clustered and classified the segment as sleep or wakefulness. WACSAW was initially developed on a 6-person cohort and applied to an additional 16 independent participants. WACSAW returned >95% overall accuracy in sleep and wakefulness assignments validated against participant logs. Compared to the Actiwatch Spectrum Plus, WACSAW delivered a ∼10% improvement in accuracy, sensitivity, and selectivity and showed a reduced standard error between participants, indicating WACSAW conformed to individualized data. In addition, we directly compared WACSAW to GGIR, a current used algorithm designed to accept MEMS data. WACSAW showed an improvement in overall statistics and handled the time series and segmentation differently, which may contribute unique information to activity recordings. Here, we provide novel statistical approach to actimetry that improves sleep/wakefulness designations, adapts to individuals, provides interim metrics that further interpretations and is open source for modification. Author summary Wearables are an emerging class of technologies that have the potential to provide real-time, important information to individuals based on their unique biological and behavioral makeup. For 40 years, actimetry has been applied to identifying sleep and wakefulness in natural living environments as sleep changes between laboratory and home environments. Yet, these valuable analyses have weaknesses that may compromise accuracy. Newer actimetry wearables can collect high frequency movement data and offer an opportunity for different types of analyses. We have developed a novel algorithm called the Wasserstein Algorithm for Classifying Sleep and Wakefulness (WACSAW). WACSAW employs optimal transport statistics to compare movement variation between different time segments to classify sleep or wakefulness for that segment. WACSAW produces >95% accuracy across the 24-hr day for both sleep and wakefulness categorization and more accurately classifies behavior than the Actiwatch Spectrum Plus. In addition, WACSAW output has interim metrics that can be used to assess reliability of the output and requires no human intervention to run. WACSAW may help achieve observations in daily living situations to determine factors that alter sleep, understand the variation in sleep, and set the stage for more home diagnoses and disease identification.


48
Inadequate sleep is being increasingly linked to numerous harmful diseases. Insufficient 49 sleep has been linked to 7 of the 15 leading causes of death in the U.S. [1] and is highly 50 correlated to diabetes [2,3], cardiovascular disease [4][5][6], cognitive [7], inflammation and 51 immune dysfunction [8][9][10], and among other possible health complications, including 52 earlier death [5,[11][12][13][14][15]. Many of these studies are based on self-report and subjective 53 sleep reporting can be prone to errors [16,17]. Unfortunately, sleep deprivation 54 continues to be a pervasive problem with sleep time potentially decreasing and 55 individuals choosing other activities over sleep [18,19] which can reduce the overall 56 public health. One confound to these laboratory approaches is that people sleep 57 differently in real world environments than in a lab setting [20] and populations in 58 which these effects have been demonstrated are typically kept small to accommodate 59 laboratory space and time. Thus, objective recording in real world environments would 60 provide practical insight into how natural sleep varies across individuals, manifests in 61 clinical populations, and relates to increased likelihood of health disorders. 62 Wearables, such as wrist actimeters, have been employed by the sleep field for the 63 last 40 years to attempt to gather objective sleep data in the natural environment. 64 They have emerged as a practical method for determining sleep parameters, especially 65 in long-term or field experiments where polysomnography (PSG), the gold standard for 66 sleep detection, is impractical. In 2007, the American Academy of Sleep Medicine 67 determined that actigraphy was accurate enough to determine total sleep time for both 68 healthy individuals and individuals with sleeping disorders [21]. Thus, there is a general 69 interest in having an automated system that can be deployed widely across the 70 population and does not require a large expense of human effort. 71 Despite the successes of current actimetry, there remain acknowledged weaknesses in 72 current data collection and analysis that limit the utility of these algorithms. Two of 73 the most commonly employed approaches are the Sadeh [22] and Cole-Kripke [23] 74 algorithms were developed using regression methods. While these approaches have 75 provided valuable insights into sleep biology, regression methodologies optimizes results 76 for the population statistics while some individuals may substantially deviate from the 77 overall average, which distorts the reliable interpretation of every individual without an 78 objective way to validate sleep and wakefulness assignments. Additionally, these 79 regression-based algorithms have difficulty identifying quiet wakefulness [24], possibly 80 due to sampling rate, computational or statical power. On the whole, these algorithms 81 have provided useful data for understanding variations in naturalistic sleep and these Mitter, Respironics Inc., Bend, Oregon, USA) for two days and keep a similar detailed 135 activity log as in the first protocol. There was a subset of individuals that participated 136 in both protocols as well as some that only completed one or the other. Participants 137 were instructed to wear the watch(es) on their nondominant wrist 24 hours a day. The 138 activity log was completed in either a Qualtrics survey accessed through a mobile device 139 or by maintaining an Excel spreadsheet. Participants were asked to reflect on activities 140 at least 2 times per day but encouraged to report as often as possible to increase 141 accuracy in the reports. Surveyed data included start time of activity, end time of 142 activity, activity level, alertness, and ambient light conditions. Temperature and 143 activity measurements from the GENEActiv actimeter were used to validate watch-on 144 compliance. Specificity (% correctly classified wake over the total labeled wakefulness).

153
One concern with detailed activity logs is that they can be subjective and reflect an 154 estimate of start and stop rather than exact times of activity transitions [37]. The 155 human error can impact the validation metrics. Thus, we compared WACSAW output 156 to participant logs using two methods. The first uses the raw activity log data provided 157 by the participant (referred to as 'Raw' data comparison). The second involved a 158 research assistant to manually adjust the start and end times subject to the following 159 criteria: there was a sudden change in volatility of a participant's movement within ±20 160 minutes of the participant indicating there was a change in activity. In such 161 circumstances, the start and endpoint of the activity were adjusted to this timepoint. 162 Adjustments were also made if a participant noted an activity during a reported 163 interval, but no precise time was given. For example, if they stated they got up to check 164 on an animal in the night without providing a time, we assumed a small period of 165 movement during that night represented their logged activity. We refer to these data as 166 adjusted log or adjusted data (Adj), and these data are analyzed independently from 167 the raw data. Both sets of data are presented in this report, though there are only 168 minor differences.

169
Comparison to GGIR 170 Our study includes a comparison with the method introduced by van Hees et al. [38],  Whenever two significant posture changes occur within a time period shorter than the 177 predefined time threshold, they are considered part of the same active period, and if 178 these shifts are closer in time than the specified threshold, the segments are merged 179 together, akin to linking activity segments and indicating continued activity. A posture 180 shift is defined based on the instantaneous intensity of the movement, contrasting with 181 WACSAW, which analyzes local volatility. When more than one such posture change is 182 detected, the method identifies periods where these changes occur less frequently than 183 the specified time threshold. If such periods exist, they are classified as periods of no 184 significant posture change, implying sedentary behavior. The selection of time and angle 185 thresholds are crucial hyperparameters in this approach. van Heese et al. [38] found that 186 a time threshold of 5 minutes and a tilt angle threshold of 5 degrees generally yielded 187 the most reliable results. For our comparison, we have used these same values [38]. activity records with challenging situations. These participants were used for all 215 development sections and the algorithm was validated on a larger, independent cohort 216 after development.

218
The GENEActiv device uses MEMS to record component of gravitational force changes 219 in perpendicular x, y, and z directions where x and y are in the plane of the face of the 220 watch and z is perpendicular to that plane. Data were recorded at 10 Hz frequency 221 resulting in a representative movement time series (Fig. 1). It has been determined that 222 movement away from the direction of gravity sufficiently captures meaningful movement 223 in all 3 directions and can be captured with the tilt angle time series [28]. The tilt angle 224 reduces the dimension of data stream needed for developing the algorithm. Thus, data 225 from the 3 directions were merged to determine the tilt angle (θ t ), by the following 226 formula (Eq. 1): where x t , y t , and z t are the force changes in the x, y, and z axes at time (t) (Fig. 1B). 228 The tilt angle preserves areas of high and low volatility. In addition, there are periods 229 that are in between high and low volatility (orange label, Fig. 1) that represents quiet 230 wakefulness according to the activity logs. Step 2: Segmentation

232
Higher frequency data captured by the MEMS sensors provide more opportunity to 233 detect rapid changes in the tilt angle ( Fig. 2A). To isolate changes in the tilt angle 234 irrespective of the absolute magnitude of force, we took the first difference of the tilt 235 angle (∆θ t ; Eq. 2), which is the difference between adjacent observed angles θ t and θ t−1 236 (Fig. 2B): Periods of high activity (and thus high volatility) will produce distributions of first 238 differences with larger magnitude compared to those periods of lower activity. Notably 239 distributions of the first difference with varying characteristics can be observed. For emphasize such differences and to distinguish between states, we need a sensitive, yet 247 robust method for comparing distributions. Going forward, we will refer to the first 248 difference of the tilt angle as "movement" and the distribution of the first difference of 249 the tilt angle as "movement distribution." conversion of adjacent points of the raw tilt angle series from A (∆θ t ) isolates the movement variability from the intensity level. The orange background indicates an intense period of movement, the green indicates more quiescent wakefulness, and blue indicates sleep according to activity logs. (C-E) The first differences within these 5 min periods were plotted as a histogram in the corresponding colors to visualize the differences in movement. Note the similar nature of quiescent wakefulness (D) and sleep (E).

251
The use of the Wasserstein metric, a scalar measure based on optimal transport theory, 252 is well suited to distinguishing subtle distributional changes. The greater the difference 253 between two distributions are from one another, the higher the Wasserstein distance 254 between them. Mathematically, the Wasserstein distance is given by (Eq. 3) where x and z are two time periods of the same length n, with the first differences for 256 that time period sorted in ascending order. We use the notation x k and z k to denote 257 the k th element of the sorted periods (x k ≤ x k+1 ; z k ≤ z k+1 ), and p to denote the 258 selected order of the Wasserstein distance. Note that in the current context, the 259 elements x k and z k denote the ordered k th first difference tilt angle (∆θ) corresponding 260 to the two time periods that are compared. Details about the order p will be provided 261 in a later section (see "Step 3: Classification"). A detailed treatment of optimal 262 transport (OT) is presented in [39,40] and methods of computational OT in [41].

263
The Wasserstein distance, in our use case, converts differences between two 264 movement distributions into a single score (Fig. 3). Given a 1 second point of interest In a magnified example of where the segmentation points are selected, there is a quiet period (Seg 1) next to a short active period (Seg 2), and then the participant returns to a quiet period (Seg 3). Because the quiet period of Seg 1 is longer than 5 min, there is a very small score at the beginning of the Wasserstein scores. As the WACSAW calculation moves into a region of larger difference, the Wasserstein score increases and crosses a threshold (horizontal orange stippled line; see text for details on how it is calculated). The whole area above the threshold is designated with a gray background. From this region, the peak of the Wasserstein score is identified and marked as the changepoint. As WACSAW moves through the area in Seg 2, the distributions become more similar even as they are more volatile, and the score decreases. As the calculation moves into the second quiet period, the Wasserstein score again increases as the distributions become more different again. The score crosses the threshold and the peak is again determined. Thus, from this small time series, WACSAW identified 3 segments corresponding to the activity plot in (B). result in too many segments as well as segments of very small duration. To address this 287 concern, we employed a peak detection algorithm above a threshold which we termed 288 the Change Point Threshold (CPT; horizontal, dashed red line in Fig. 3B). The CPT is 289 defined as the mean of the Wasserstein scores over a 2-day period (see hyperparameter 290 optimization section for a detailed explanation). This threshold is applied over the 291 entire 2-day period and is unique for that individual and for that 2-day period ( Fig. 3B), 292 allowing it to vary based on the unique characteristics of both variables. We designate 293 the region of contiguous time points above the CPT to be the change point region, the 294 peak detection algorithm was applied within each change point region (gray regions in 295 Fig. 3B). Thus, within each such region, WACSAW determines and labels the highest 296 Wasserstein score within that region as the change point (vertical red lines).

297
How sensitive the Wasserstein distance is to differences in distribution can be 298 controlled by the distance order p, lower values of which magnifies differences between 299 distributions. Thus, an appropriately chosen value of p will enable us to magnify the 300 difference between distributions associated with quiescent wakefulness and sleep. In 301 addition, the use of the CPT that are individual specific and the use of the Wasserstein 302 equation can be employed to detect relative differences in volatility in order to identify 303 periods with different activity specific to that individual. In contrast, use of a 304 population-based threshold that signifies sleep periods below it and wakefulness above it 305 would not be adaptive to individual differences.

306
With this change point detection procedure, we identified regions within which 307 relatively homogeneous volatility is detected, especially in the waking period. However, 308 the segmentation protocol produced too many small regions, with too little data to 309 analyze for classification, because of the highly sensitive nature of the Wasserstein hyperparameter section for a discussion). If the segment of interest was significantly 326 different from the previous segment, then we compared it with the subsequent segment. 327 If the segments were not different from either the previous or following segments, they 328 were merged. If they were different, they remained their own entity. For example, this 329 process resulted in a reduction of segments from 716 to 118 in a 2-day recording, which 330 was typical of the results we analyzed. reported in the logs and may reflect roll quick postural shifts such as roll over events.

362
The quick shifts did not elevate the Wasserstein score above the change point threshold, 363 and thus these are not designated as a wakefulness events according to WACSAW. Step 3: Classification

365
The last step is to classify the segments into sleep and wakeful states. The classification 366 of segments is done in two steps: (A) initial separation of segments into sleep, wake, and 367 undetermined groups and then the (B) final classification based on the sleep and wake 368 groups obtained in step A by using that person's activity data to assign the segments in 369 the undetermined group into sleep or wake categories.

370
In Step A, we compared each segment with an idealized sleep segment in which the 371 movement values are all zero. For each segment, a thousand subintervals, with a 372 minimum length of 30 seconds within each segment of interest, are randomly selected 373 (Fig. 6A). The Wasserstein distance between the distribution of the middle 95% of the 374 first differences in each subsegment and the idealized sleep distribution (baseline 375 distribution degenerate at zero) was then computed (Fig. 6B). This provides scores 376 reflecting the transport energy required to move each subsegment distribution to the 377 October 12, 2023 14/37 baseline distribution. We filtered the data by eliminating the extreme 5% of raw first 378 difference values to avoid misclassification of short sleep segments due to the presence of 379 a few large movements. This procedure generates one thousand Wasserstein scores for 380 each segment, and these scores form a distribution reflecting the difference of a segment 381 from that of idealized sleep. For this segment, there are scattered values to the left and 382 right of the peak within each distribution, indicating subsegments with more or less 383 variability than those that make up the prominent peak, but the large peak defines this 384 segment because it is the most frequent value, roughly 0.0075 (Fig. 6C). This 385 distribution is typical of one in which there are short bursts of activity within a larger 386 low-variability segment. Mathematically, the Wasserstein distance used in this step of 387 classification reduces to where the value of p is set to k, ℓ = 1, ..., L represents the subsegment of interest, i . Note that ∆θ * t is the first difference at time t 391 if the first difference is within the middle 95% of data in the segment and zero otherwise. 392 Borrowing from optimal transport phraseology, the value D  The segment being analyzed is in black, while grayed out segments will be treated similarly in an independent calculation. Five random hypothetical segments of the 1000 generated by WACSAW are Illustrated under the segment of interest. Each of the randomly generated segments is individually compared to "idealized sleep" by optimal transport methods. (B) The left histogram is the distribution of first difference values from subsegment i where we remove the extreme 5% of points (red bars are censored, blue bars are kept). That distribution is compared to the "idealized sleep" distribution in the right distribution to generate an optimal transport score. (C) The histogram of the optimal transport score for each of the 1000 iterations of subsegment i to characterize the variability of the entire segment, which appears to have a peak around 0.075. There are portions of the segment that have higher and lower variability with scores greater and less than 0.075, but they are a small minority of subsegments. The overall shape of the distribution characterizes the variability of the segment.
Over a full 2-day recording period, WACSAW generates a profile typical of the one 396 presented in Fig. 7A. Histograms from individual segments are differentiated by color. 397 There is a cluster of tall, thin histograms to left side of the figure indicating less 398 difference from idealized sleep. Moving to the right of the distribution profile, the peaks 399 become flatter and broader, indicating more variability throughout the segment. More 400 variability is usually an indicator of some wakeful activity. Note that there are values to 401 the right of 0.05, but the graph has been truncated to highlight the distributions from 402 0-0.01 of the profile. worth of data without identifying individual segments. By eye, there appears to be a difference in the appearance of histograms below a 0.01 score and those above (represented by the vertical red line). (C) We iteratively tested the percent of correct categorization using thresholds of Wasserstein scores from 0 -0.05 compared against the activity logs. We evaluated overall accuracy (green line), sensitivity to correctly identifying sleep periods (yellow line), and selectivity to correctly identifying wakefulness periods (green line). In this example, the best balance of all three statistics was at a Wasserstein score of 0.01 (vertical red line). (D) Results of using a hard 0.01 threshold to classify sleep and wakefulness overlaid on the tilt angle data gathered for that time series. Orange indicates wakefulness calls and purple indicates sleep periods, as identified by WACSAW. The overall accuracy is ¿90% under these conditions. Based on observations from Fig. 7B, which shows the the cummulative Wasserstein 404 score histogram, we hypothesized that sleep-like segments were to the left side of the 405 graph while waking segments tended to reside on the right side of the graph. To test this 406 hypothesis, we used a sliding cutoff point to determine if there was a threshold at which 407 the accuracy of sleep and wakefulness calls were optimized using the data from Fig. 7B. 408 The correct designations for each second of the recording were calculated as a percent of 409 total recording time (accuracy), percent of correct sleep calls over total amount of sleep 410 (specificity), and percent of correct wakefulness calls of the total amount of wakefulness 411 (sensitivity). WACSAW calls were validated using the participant logs or adjusted log as 412 the comparator group for what occurred during that time period. As the Wasserstein 413 value went from 0 up through 0.01, the accuracy of each of the measures improved, 414 except for sleep sensitivity because when the program calls everything sleep it will 415 always get the sleep segments correct. The optimal value appeared to be 0.01 for this 416 individual, which resulted in the calls in Fig. 7C. For this individual dataset, it resulted 417 in a >95% accuracy across the 2 days. Yet, this hard cutoff did not work for all 418 individuals. We observed several shallow drops in accuracy between 0.01 and 0.02 419 (Fig. 7C) indicating an overlap of sleep and wakeful segments because there was not a 420 threshold that optimized the sleep and wakefulness calls for every individual. For the 421 example participant, setting the threshold to a transport energy to 0.01 resulted in a 422 >97% accuracy (Fig. 7D). But in other participants, a threshold of 0.01 reduced the 423 accuracy, suggesting that an absolute population-based threshold was impractical.
where S i is the i th segment, n i is the number of observations in S i , j is the imaginary  Within this Development set, WACSAW performed admirably ( Table 2). WACSAW 474 classifies sleep or wakefulness on a second-by-second basis and produces results without 475 human intervention. As mentioned above, we have compared the output to both the 476 raw log entries from the participants(raw) as well as to an investigator adjusted log that 477 was defined by a set of rules (see above, adj). We present both sets of data for 478 comparison and note that the adjusted data set results in only a marginal improvement. 479 The overall accuracy of WACSAW is 96.8% correct classification with the raw data and 480 97.7% in the adjusted data. The correct classification of sleep episodes (specificity) is 481 above 96% across this set and the correct classification of wakefulness (selectivity) is 482 above 97%. The median reflects that there are unlikely to be single values that are 483 responsible for pulling the mean in one direction or the other. In addition, the standard 484 deviation (Std dev) and median absolute deviation (MAD), which is the average median 485 of the absolute deviations from median accross the group, suggest that the results are 486 accurate consistently across individuals. Given the age and situational differences, it 487 suggests that WACSAW can somewhat adapt to individual circumstances. The document specific times for changes in activity, we will only discuss the adjusted values, 494 but the raw data will be presented for the performance of WACSAW.

495
Five out of the 6 participants within the Development cohort had a second data 496 collection period that was independent of the original data collection period on which 497 WACSAW was developed. As an initial test of WACSAW's performance, we applied 498 WACSAW to this independent data collection period in the participants with the same 499 October 12, 2023 21/37 properties as the Development cohort. Overall accuracy, specificity, and selectivity were 500 all above the 95% accuracy threshold for the entire group with only 1 participant 501 records below this threshold for the 48 hour recording period. The standard deviation 502 (SD) and MAD remain low across this small population. This test shows that the 503 performance of WACSAW is not totally reliant upon the training data and can perform 504 well on independent data, albeit obtained from the same subjects used in the 505 development cohort.

506
To test whether WACSAW can perform equally well on independent data from 507 individuals without the same underlying properties, we applied WACSAW to 48-hour 508 recordings from 16 independent individuals (Table 3). Across the "Independent 509 validation" data set, WACSAW maintained its average of greater than 95% accuracy 510 across overall performance, sensitivity, and selectivity. The std dev and MAD remained 511 low. There is one individual, participant 11, that had one day in which WACSAW 512 demonstrated a significant underperformance. In a posthoc interview, it was determined 513 that the person was sick over the data collection period. It highlights that WACSAW 514 has room for improvement under specific circumstances. Under these circumstances, the 515 interim metrics, like the cumulative histogram of transport energy discussed in Fig. 7, 516 may suggest that results may not be as reliable and are worth further investigator 517 validation (see Fig. 11B below). Despite the 24-hr underperformance, WACSAW 518 performance in all individuals were above 86% with an overall average over 95%.
519 Table 3. WACSAW performance in the independent validation cohort.
It is possible that we have been evaluating a set of data for which is it simple and 520 easily definable to identify sleep and wakefulness. Therefore, we compared WACSAW's 521 output to the sleep and wakefulness designations produced by the clinically validated 522 Actiwatch that is produced and maintained by Respironics. The Actiwatch is a popular, 523 validated actimeter which uses piezoelectric accelerometers to obtain movement counts 524 which are then used to determine sleep and wakefulness. For a detailed treatment of the 525 Actiwatch algorithm, we refer to you to the Actiwatch documentation [35]. The data from these samples was analyzed in earlier charts but had not been benchmarked to an 527 activity device that has been validated and approved for use in a clinical and research 528 setting. For this period of data collection, participants were wearing both watches at 529 the same time on the same wrist. The data from the Actiwatch was processed through 530 the company's analysis software to obtain the classifications whereas WACSAW was 531 applied to the GENEActiv MEMS data as described above. 532 We maintained the same divisions in the population because the first 5 participants 533 were the majority of the data on which WACSAW was originally developed. WACSAW 534 performs well on the data on which it was optimized as discussed above (Table 4), and 535 it outperforms the Actigraph by roughly 10% on measures of accuracy, sensitivity and 536 selectivity. Across the population, WACSAW performs better between individuals as  Table 4. Comparison of WACSAW performance to Actiwatch designations.
When we evaluated 10 individuals from the independent validation cohort, the algorithm was lower and the variation within the population is higher with almost none 549 of the values crossing the 95% accuracy threshold. When the data are combined in the 550 "Full cohort," our interpretation is that the conclusions remain the same. Thus,

551
WACSAW is not analyzing data that is easy to categorize as the Actiwatch found them 552 to be easy to categorize and thus produce higher accuracies. 553 We hypothesize that one big difference between the accuracy of WACSAW and the 554 accuracy of the Actiwatch designations appeared to be in classifying quiescent 555 wakefulness. Fig. 10 (Fig. 10A), 559 and two other instances where the Actiwatch classification differed from the participant 560 log during activities such as watching TV (Fig. 10B) and desk work (Fig. 10C), but 561 WACSAW correctly classified them. 562 Fig 10. Comparison of behavioral categorization between WACSAW and Actiwatch. We chose 3 raw data examples (A-C) to illustrate differences in the performance of WACSAW compared to Actiwatch with a particular focus on the challenges of quiet wakefulness. Orange background indicates a wakefulness categorization by WACSAW or the Actigraph algorithm or that the person logged they were awake (middle bar in each panel. A blue background indicates sleep for all three. In (A), a participant was reading in bed during this period. In (B), a second participant was watching TV in their couch during the wakefulness period leading up to sleep. While in (C), a third participant was working at their desk during the entire recording period shown. Vertical red lines in the WACSAW trace indicate segmentation breaks determined by the algorithm. The Actigraph assigns many more sections as sleep when the participant was engaged in quiet waking activities (see middle bar for actual activity).
Lastly, we compared our outcome to an existing analysis program that also accepts 563 October 12, 2023 25/37 data from the GENEActiv watch to determine if the frequency of data collection was 564 the only difference in the accuracy of the predictions. We submitted the data used for 565 earlier WACSAW predictions to the GGIR program and compared its output to the 566 participant logs using the adjusted measurements. We found that WACSAW was 567 approximately 5% better in overall accuracy, roughly 12% more accurate with sleep, and 568 approximately 2% better with selectivity, which identifies wakefulness. We compared 569 the visual outputs from each program (Fig. 11). GGIR makes substantially more state 570 changes, especially during the night compared to WACSAW. For instance, in participant 571 19, the first night of sleep is segmented quite frequently using GGIR compared to how 572 WACSAW determines sleep and wakefulness. It may be in these numerous transitions 573 that leads to WACSAW's distinction from GGIR. 574 Fig 11. Comparison of time series of sleep/wakefulness output from WACSAW and GGIR. We directly compared the output from 3 participants, Part 5, which was an older individual, Part 11, in which WACSAW demonstrated its worst performance, and Part 19, in which WACSAW did very well. An orange part of the trace indicates that the given software called the section wakefulness and a blue portion of the trace indicates that the software labeled the section as sleep. better and more personalized performance in categorizing sleep and wakefulness. 595 We developed WACSAW from data collected using the GENEActiv actimeter, which 596 uses the MEMS to detect changes in movement at a subsecond sampling rate. The 597 sampling property has distinct advantages in classifying sleep and wakefulness states 598 compared to traditional piezoelectric methods. First, we could compose a more detailed 599 representation of movement variability within a smaller time window, which allows for a 600 faster determination of segmentation boundaries between two periods of different 601 activity levels. Second, we could generate probability distributions within smaller time 602 windows. Thirdly, higher frequency collection provides robust distributions of movement 603 statistics in a short time window, making the Wasserstein distance more accurate in 604 determining changes in underlying distributions. Lastly, segmentation becomes more 605 localized, which minimizes sleep-wake overlap in any given region and provides a more 606 accurate partitioning of sleep and wakeful regions. MEMS technology has permitted us 607 to explore statistically based approaches that accommodates individual differences 608 rather than regression-based approaches. Because WACSAW is agnostic to the exact 609 nature of the data, it could also be applied to other classification systems that rely on 610 variability changes.   Some of the most utilized algorithms are subject to flaws due to the fact that they are 643 regression-based, such as the Sadeh [22] and Cole-Kripke [23]. In fact, a comparison of 644 The approach presented here has limitations, and future investigations may validate 690 or mitigate its generalizability. The most obvious limit is the validation against PSG.

691
Our validation using logs is a practical and opportunistic method that both allowed us 692 to progress on WACSAW but is also a good validation method for many of the 693 characteristics that researchers and participants care about. The weakness is that it 694 does not have the moment-to-moment validation of brain state that can reveal short or 695 brief periods of wakefulness and define sleep latency, which is an important parameter 696 for specific sleep disorders. In fact, we adjusted sleep and wakefulness periods based on 697 a 20 minute window with the acknowledgement that sleep logs can be inaccurate. That 698 said, it only altered accuracy percentages by less than 2%. WACSAW has demonstrated 699 the potential to identify nighttime brief awakenings and potential restlessness in 700 individuals. However, it has thus far been validated to detailed activity logs, thus 701 nighttime states have not been validated against PSG.

702
Another limitation is the sample size of participants. We were cognizant to include 703 people of different ages, sexes and behavior types. We are aware of the limitations of 704 other algorithms and wanted to specifically challenge WACSAW with these situations to 705 improve the interpretation of activity data. We assessed 22 unique individuals and 706 repeated the experiment on a large subset of those individuals, and WACSAW 707 performed well across all of those individuals. One participant (11) was sick over one 708 night of the test period. It reduced the high accuracy seen across the other participants 709 and highlights the unique circumstances that exist in individuals with illness or sleep 710 disorders. We anticipate that alterations described in the paragraph above could 711 optimize WACSAW performance across these situations. Also, the vast majority of our 712 population was affiliated with a university, which may inadvertently skew WACSAW 713 results in an unidentified way. Lastly at this time, WACSAW does not output some of 714 the typical metrics provided by other algorithms, which is a future goal of this project 715 so that the program is as useful as possible for the sleep community. It is possible that 716 WACSAW could be used in combination with such analysis suites as GGIR to further 717 improve the understanding of how naturalistic sleep impacts health and behavior.

718
Despite the above drawbacks to this study, we propose that WACSAW may 719 contribute to classifying sleep in naturalistic settings. WACSAW may be more 720 automated in its application than other methodologies. We submit that WACSAW 721 removes the absolute necessity for participant logs in certain types of studies. In fact, 722 we found circumstances where the segmentation points from WACSAW were more 723 reliable than the participant logs because the log times misidentify behaviors because of 724 memory or attention failures from the participants. This may account for a 1-2% 725 discrepancy based on our data. This makes WACSAW more useful in broad population 726 experiments because there is less human intervention required. Additionally, because 727 WACSAW uses only a single stream of data, namely movement data, to classify 728 behavioral states, longer recordings may be possible based on equal battery life and 729 data storage capacities. Moreover, WACSAW classifies varying length segments rather 730 than fixed length epochs, thereby providing more information that could be utilized to 731 improve the interpretation of a sleep or wakefulness call. The interim metrics may offer 732 opportunities to further identify important patterns and relationships within the 733 movement data. They could be used to determine reliability of the classification results 734 based on these interim metrics. WACSAW adapts to individuals by utilizing the 735 Wasserstein threshold and the characteristic function, both derived from the movement 736 data of that specific individual. In contrast to machine learning algorithms, WACSAW 737 is more explainable while maintaining high accuracy. Lastly, WACSAW and its code 738 will be made available to researchers and the public to evaluate and modify to 739 implement in their particular situation. The code for WACSAW can be found at [45]. 740 Supporting information 741 S1 Appendix. Results of the hyperparameter grid search..

Hyperparameter Grid Search
The final WACSAW algorithm contains 4 hyperparameters to which we assigned values as we developed the WACSAW process. These assignments were based on quick trial and error, but they needed to be formally tuned to determine what values optimize WACSAW output. We used data from the same 6 training individuals with their varying sleep characteristics and tuned the model using accuracy to determine the hyperparameters that led to the best outcomes.
We needed to tune (1) the order of the Wasserstein distance (p in Eq. 3); (2) the window size of data input; (3) the change point threshold for segmentation; and (4) the significance value of the Levene test.
For the order of the Wasserstein distance, we considered p ∈ {1, 2}. For the window size, we considered windows of size 2, 5, 7 minutes. For the change point threshold, we considered the statistics mean, median, and α-trimmed mean with α = 0.25 for a 2-day period. For the Levene significant level, we checked significance levels between 0 and 1 ({10 −s : s ∈ [0, 81]}) (note that the extremely small values were computed using arbitrary precision arithmetic). It may be more appropriate to treat this value, not as a significance level of a test, but as a parameter one could tune to accentuate the differences between sleep and wakeful segments, such as quiescent wakefulness.
We use parameter sweeps to test the hyperparameters. This revealed that WACSAW is not overly sensitive to variations in the hyperparameters within the search space, indicating that inadvertent deviations from optimal do not change the final output of WACSAW to any fundamental extent. Comparing the accuracy of WACSAW to the adjusted log classifications revealed that the accuracy stayed above 90% for all hyperparameter combinations on the training set and standard deviation ranged from 1 to below 11, computed from the 6 individuals in the development set. Moreover, the standard deviation decreases as accuracy increases. The top performing combination was order 1 with a 5 minute window, a mean change point threshold, and a Levene significance level of 1e-25 (Table 6). But the top ten combinations all had an accuracy above 97% but all combinations also had a standard deviation in accuracy of 1.30-2.8% between individuals, which indicates that the adaptive aspects of WACSAW reduced the variability in accuracy sometimes observed when a common algorithm is applied between individuals. Table 6. Top 20 results based on accuracy from the hyperparameter sweep. The initial columns are the values of the hyperparameters and the validation metric columns are the metrics for WACSAW when run with the hyperparameter combination on the development cohort.
Other hyperparameters include the permutation test threshold, the ξ cutoff for the characteristic functions, and the characteristic clustering parameter, namely the 90% closeness parameter. These parameters were not subject to tuning as they were either empirically chosen, restricted due to computation, or were heuristically chosen.