Flying In-formation: A computational method for the classification of host seeking mosquito flight patterns using path segmentation and unsupervised machine learning

The rational design of effective vector control tools requires detailed knowledge of vector behaviour. Yet, behavioural observations, interpretations, evaluations and definitions by even the most experienced researcher are constrained by subjectivity and perceptual limits. Seeking an objective alternative to ‘expertise’, we developed and tested an unsupervised method for the automatic identification of video-tracked mosquito flight behaviour. This method unites path-segmentation and unsupervised machine learning in an innovative workflow and is implemented using a combination of R and python. The workflow (1) records movement trajectories; (2) applies path-segmentation; (3) clusters path segments using unsupervised learning; and (4) interprets results. Analysis of the flight patterns of An. gambiae s.s., responding to human-baited insecticide-treated bednets (ITNs), by the new method identified four distinct behaviour modes: with ‘swooping’ and ‘approaching’ modes predominant at ITNs; increased ‘walking’ behaviours at untreated nets; similar rates of ‘reacting’ at both nets; and higher overall activity at treated nets. The method’s validity was tested by comparing these findings with those from a similar setting using an expertise-based method. The level of correspondence found between the studies validated the accuracy of the new method. While researcher-defined behaviours are inherently subjective, and prone to corollary shortcomings, the new approach’s mathematical method is objective, automatic, repeatable and a validated alternative for analysing complex vector behaviour. This method provides a novel and adaptable analytical tool and is freely available to vector biologists, ethologists and behavioural ecologists. Author summary Vector control targets the insects and arachnids that transmit 1 in every 6 communicable diseases worldwide. Since the effectiveness of many vector control tools depends on exploiting or changing vector behaviour, a firm understanding of this behaviour is required to maximise the impact of existing tools and design new interventions. However, current methods for identifying such behaviours are based primarily on expert knowledge, which can be inefficient, difficult to scale and limited by perceptual abilities. To overcome this, we present, detail and validate a new method for categorising vector behaviour. This method combines existing path segmentation and unsupervised machine learning algorithms to identify changes in vector movement trajectories and classify behaviours. The accuracy of the new method is demonstrated by replicating existing, expert-derived, findings covering the behaviour of host-seeking mosquitos around insecticide treated bednets, compared to nets without insecticide. As the method found the same changes in mosquito activity as previous research, it is said to be validated. The new method is significant, as it improves the analytical capabilities of biologists working to reduce the burden of vector-borne diseases, such as malaria, through an understanding of behaviour.


125
To assess the accuracy of the new workflow, we applied the method to the activity of An. gambiae, a principal 126 vector of malaria in sub-Saharan Africa, around either an insecticide-treated net (as approved by the World 127 Health Organisation, hereafter 'treated') or an untreated polyester net ('untreated'). A strain of mosquito 128 susceptible to all insecticides, Kisumu, was used in both the untreated and treated arms of the experiment. 129 The results of this application were then compared with those from a previous, expert-derived, study to 130 validate the accuracy of the workflow. 131 Data acquisition, cleaning and assessment 132 Activity rates, based on observations from the raw data, were found to be much higher around an untreated 133 net, with the number and length of movements significantly lower when an ITN was used (Table 1). When 134 the autocorrelation of the datasets was assessed, it was found that movement velocity was positively 135 autocorrelated through 50 time-lags in both the untreated (Fig 2A) and treated (Fig 2B)

147
This association becomes weaker as the lag increases (from 0.7 at lag 1 to 0.3 at lag 50).

148
Path segmentation 149 Results of the path segmentation are found in Table 1. Activity of An. gambiae s.s. was again found to be 150 significantly higher in the untreated trial. 151

152
Internal validation indicated that the optimal algorithm and parameters to cluster both the untreated and 153 treated data was an agglomerative clustering algorithm using Ward's method for linkage and four clusters. 154 The untreated clustering produced a silhouette score of 0.36 (Fig 3A), while the treated grouping's silhouette 155 score was 0.41 ( Fig 3B). Results of this analysis are shown in Table 2  Interpreting results

171
Mean statistics of each group were investigated to interpret the results (Table 2). Similarly broad behavioural 172 types were found in both arms of the study. After interpretation, these groups were labelled 'swooping', 173 'approaching', 'reacting' and 'walking.' 'Swooping' captures fast, short and highly autocorrelated 174 movements; 'approaching' slower, less variable behaviour with low autocorrelation; 'reacting' faster, more 175 variable actions with some autocorrelation; and 'walking' encompasses long, slow movements that are not Tanzania using wild An. arabiensis, a sibling species closely related to Anopheles gambiae s.s. and that 193 exhibits many of the same host seeking behaviour characteristics [9]. This previous study used expert 194 knowledge to identify behaviour types, determining that mosquitos exhibited four behaviours around both 195 an untreated and treated net ('swooping', 'visiting', 'bouncing' and 'resting') and that total activity levels 196 dropped significantly at ITNs compared to untreated bednet (from a geometric mean time of 73.5 mins to 197 23.8 mins). Where particular behaviours are concerned, and comparing total mean times, the study found 198 that 'swooping' (where "tracks do not contact the bednet"), 'visiting' ("long periods of flight are interspersed 199 with infrequent net contacts') and 'resting' ("mosquito movement is under 1.33 mm /s") all increased in the 200 presence of a treated net (however, this increase in swooping was not found to be statistically significant). 201 The study also found that 'bouncing' ("rapid contacts with the bednet surface… include[ing] walking") 202 reduced significantly around the treated net (when evaluating geometric mean times). 203 Comparing findings from this study and those of [9], several replications are clear: (1) both studies recognised 204 four types of mosquito behaviour; (2) total vector activity fell at a treated bednet; (3) 'swooping' behaviour 205 increased with a treated net; and (4) 'walking' / 'bouncing' is decreased when using a treated net. However, 206 although 'swooping' and 'bouncing' from [9] are acceptable analogues to the behaviours 'swooping' and 207 'walking' from this study, it is not possible to align the prior study's 'resting' and 'visiting' with the 208 'approaching' or 'reacting' categories of this study, due to divergent definitions. 209 As such, comparison with previous results can only be said to validate four of this study's conclusions (i.e., 210 (1), (2), (3) and (6) from the Conclusions section). Although there are slight differences between [9] and the 211 current study (i.e., in vectors observed and the definition of behaviour modes), these differences are minor, 212 potentially explicable by the use of a wild population, which is inherently more genetically diverse. With this 213 knowledge, the similarities are such that [9] can be said to support several major findings from this study in 214 the given setting. Consequently, the external validity of the new method was deemed to be proven. 215 Here we supply a preliminary application of the new method, analysing the behaviour of An. gambiae s.s., in 225 the presence of both baited untreated bednets and baited ITNs. As the study replicated previous findings, 226 the method is deemed to be an innovative, validated and productive approach that improves and expands 227 the existing toolkit available to vector biologists. Furthermore, the method is repeatable, as any individual 228 with the same dataset will produce the same behavioural tokens and behavioural types; it is generalisable, 229 as it is not limited to a single domain, but can be applied to any vector in any setting; and its foundation in 230 mathematical processes ensures it is immune to observer bias. The accuracy of the new method is confirmed 231 using both internal and external validation [37,38]. The former ensures the correct algorithm and initial 232 parameters are applied, while the latter tests the accuracy of the approach itself. Internal validity is measured 233 in two ways: (1) formal metrics of similarity between datapoints (i.e., silhouette scores) are studied; and (2)  for the particular difficulties encountered when tracking vectors must be used in this instance. As the key 250 difficulty here is the frequency of lost frames (caused by the recording system momentarily losing track of 251 the small vector), a method that can handle an irregular dataset is required. As BCPA is a likelihood-based 252 form of path segmentation, which sweeps an analysis window over an entire movement path to identify 253 significant shifts in a parameter value, it provides a robust method for dissecting vector activity into 254 behavioural tokens that can account for irregular temporal measurement intervals and does so without any 255 a priori assumptions [19,20]. 256 Although offering several advancements, the method presented here is subject to its own limitations. One 257 such constraint concerns the clarity of the silhouette created by any grouping of behaviours. That is, as 258 behavioural units are nebulous concepts, any silhouette of their classification will be equally unclear and 259 datapoints from different behaviours will not necessarily have a high separateness [51-53]. For example, the 260 distinction between fast walking and slow running is not clear. Consequently, the identification of strong 261 patterns when assessing the clustering of behaviour is unlikely. This is shown in the contiguous silhouettes 262 and the silhouette values produced by movement data (Fig 3). Additionally, it is important to make explicit 263 the assumptions on which this study is based. These assumptions are that vectors are always in some 264 behavioural state, that vectors have more than one potential behavioural mode and that these modes are 265 discrete and expressed over a period of time. Finally, it needs to be clarified that the method presented here 266 is not totally objective. Since the workflow's Interpretation stage requires experts to attach a label to clusters, 267 a level of subjectivity is still required to implement this analysis. Although this labelling is not theoretically 268 necessary to produce and compare results (as clusters can be described by their characteristics alone), a level 269 of subjectivity is still needed to interpret these results and apply them to everyday discourse concerning

284
We present a four-stage workflow in which vector movement trajectories are first collected and pre-285 processed via BCPA. The most appropriate unsupervised clustering algorithm, and initial parameters, are 286 then identified and applied before the workflow concludes with the interpretation of results, decoding and 287 attaching a behavioural label to each group. The whole workflow is then validated by measuring the accuracy 288 of its results. 289

290
The workflow presented here is implemented using a combination of R and Python. R is used for pre-291 processing, utilising the BCPA package built for that language. Python, through a Jupyter notebook, is used 292 at the clustering stage to exploit the scikit-learn library. We recommend that the Anaconda platform 293 be used to access RStudio and JupyterLab, as up-to-date installations for Windows, Linux and Mac can all be 294 found in that single distribution. Code, and further details, needed to run the workflow can be accessed 295 through a public GitLab repository: https://gitlab.com/MTFowler/lstm_flightcluster. All analysis found here 296 was performed on a ThinkPad X1 Carbon, using an Intel i7-7500u CPU. 297 All procedures associated with the collection of mosquito flight data are as described in [9,11,55,56]. Briefly, 298 the 'Kisumu' laboratory strain of An gambiae, a primary malaria vector across sub-Saharan Africa and 299 susceptible to all insecticides, was used in both the untreated and treated arms of the experiment. All 300 mosquito flight assays were completed in a purpose-built climate-controlled insectary in Liverpool. 301

302
Vector movement paths were represented by spatial identifiers ordered sequentially via a time variable [18-303 20,30]. Each event was captured by a unique identifier, an x (longitude or easting) coordinate, a y (latitude 304 or northing) coordinate and a time variable (Fig 1A). This data was collected using an optical imaging and 305 flight-tracking system detailed in [55,56]. This system allowed for multiple vectors to move unconstrained 306 within an enclosed area, a subset of this space being within the field of view of the recording system, creating 307 the recording volume. After collection, movement trajectory data was cleaned and assessed (Table 1). 308 Movements considered noise were removed. This 'noise' included short tracks deemed to be isolated 309 fragments from a larger track, or disturbance that has been missed during video cleaning [9-11]. 310 Furthermore, although BCPA accounts for semi-regular sampling [19,20], allowing for some irregularity in 311 the dataset, movement tracks were removed from the analysis if they contained two datapoints at the same 312 time or if they had especially large time gaps (i.e., greater than 10 seconds). Finally, as path segmentation 313 assumes that all time series data displays serial dependence, it was confirmed that the dataset was 314 autocorrelated (i.e., that the velocity of each datapoint is statistically correlated with its recent past) [20]. 315 This was accomplished in R using the 'Autocorrelation and Cross-Correlation Function Estimation', ACF(). 316

317
With a correctly formatted dataset, that had been cleaned and assessed, BCPA was applied. BCPA is a form 318 of path segmentation that identifies changes in animal behaviour, at the path-level, based on significant 319 shifts in a parameter value of an organism's movement trajectory. As BCPA accepts movement paths as 320 sequentially ordered step lengths, turning angles and velocities, rather than the spatial identifiers collected 321 by tracking technology, spatial values were converted into the required variables using the GetVT() 322 function from R's BCPA package [57]. Within BCPA there are four user defined parameters: (1) the 323 'Parameter Value' (the response time-series variable in which significant changes will identify a behavioural 324 change point); (2) the 'Window Size' (the number of datapoints the window will capture when sweeping); 325 (3) a sensitivity parameter 'K'; and (4) the 'Cluster Width.' For arthropod activity, it was determined that 326 optimal segmentation occurs at a significant change in persistence velocity (Velocity*cos(Turning Angle)), 327 using a window size of 30, a window step of 1, a sensitivity value of 2 and a cluster width of 1. These initial 328 parameters were determined following BCPA documentation recommendations [19,20,57] and to maximise 329 sensitivity to behavioural shifts. (Note, however, that this increase in sensitivity amplifies the chances of 330 spurious shifts being detected which will ultimately result in transitions to the same behaviour in the final 331 output. However, as the alternative is to lower sensitivity and potentially miss legitimate changes in 332 behaviour, a high sensitivity, with corollary spurious shifts, is preferred.) 333

334
To determine the optimal unsupervised learning algorithm and initial parameters for clustering, internal 335 validation was undertaken. Following [37,38], the form of internal validation used was silhouette scores 336 [58,59] and visual inspection of t-SNE plots [39][40][41]. A silhouette score measures how well data had been 337 grouped, comparing each object's similarity to others within its own cluster (group tightness) and those from 338 other clusters (group separation) and was calculated using Python's silhouette_score() function from 339 the metrics module of the scikit-learn library. This measure gives a score between -1.00 and +1.00, 340 with a silhouette value below 0.20 showing no structure is present in the data and the grouping is invalid; a 341 figure over 0.70 representing a strong structure and a valid grouping; and a silhouette score around 0.50 342 illustrating that a reasonable structure has been found within the data and that the clustering is acceptable 343 [59]. Detailed silhouette coefficients for each sample was then visualised using a silhouette plot in Python 344 with the silhouette_samples() function from the sklearn.metrics module ( Fig 1C). As all 345 clusterings require manual review to validate appropriateness [37], the high-dimensional data was reduced 346 and positioned in a two-dimension map using t-distributed Stochastic Neighbour Embedding (t-SNE) [39]. 347 Once mapped, the appropriateness of the clustering was verified through manual visual inspection. Review 348 ensured there were acceptable levels of cohesion between members of the same group and separateness 349 between members of different groups. t-SNE was undertaken using the TSNE() function from the 350 sklearn.manifold module found within Python. When performing t-SNE, the user needs to define the 351 perplexity (an estimate number of nearest neighbours for each datapoint), with larger datasets requiring a 352 larger perplexity [41]. Consequently, perplexity was fine-tuned to show global geometry. 353 Once the optimum algorithm and parameters were determined by silhouette score and inspection of t-SNE 354 plot, findings were applied to the BCPA output. Python's scikit-learn library was used as it is an efficient 355 means of building standard machine learning models [60]. (Other packages, such as R's class, are available 356 when implementing unsupervised machine learning, however different packages may produce different 357 results.) 358

359
The final stage of analysis is to interpret and label clusters [35]. Although the naming of individual clusters 360 can be systematised, its final interpretation is ultimately a somewhat subjective process. Interpretation here 361 entailed scrutiny of the attributes of each cluster. Taking each group's mean velocity, standard deviation, 362 autocorrelation and duration, an expert analysed and named the behaviour associated with the movement 363 classes. This initial understanding was then confirmed and refined through visual inspection of 364 representative samples. 'Representative' was defined as those samples found at the centre point of a group 365 and such datapoints were deemed to be the most typical of that behaviour class [33,34]. Consequently, 366 interpretation of results was bolstered through centroid analysis, with those datapoints at the heart of each 367 group's t-SNE mapping, or as close to the centre as possible, isolated and visually inspected by an expert. 368 Multiple such examples close to the centre were isolated and inspected, thereby confirming, or refining, the 369 initial analysis. 370 Using both the analysis of descriptive statistics and centroid analysis, an expert was able to interpret the 371 broad behavioural type of each cluster and attach a sensible label. If no label was able to be attached, either 372 because no behaviour is being demonstrated or behaviours are spread between clusters, the full workflow 373 was undertaken again. By manipulating the user defined settings during BCPA or dimensionality reduction, 374 significant changes in clustering can result. Consequently, these parameters were fine-tuned to optimise 375 performance and final behaviour identification. 376

377
To ascertain the accuracy of the new method, its performance was externally validated by comparing results 378 concerning the difference in flight patterns of An. gambiae s.s. around human-baited insecticide-treated 379 bednets (ITNs) and untreated bednets. Findings generated by the new, computation-derived, method were 380 contrasted with those from a previous study that employed the existing, expert-defined, method for 381 behavioural classification. Although a standard method of external validation is through comparison against 382 an a priori dataset, testing whether a known, true, clustering can be recreated [38], this was not possible in 383 this instance. There is no known, incontrovertibly true, grouping for this An. gambiae s.s. behaviour and 384 therefore no such comparison can be made. Consequently, external validation was established via 385 confirmation with previous results. That is, the workflow's accuracy as a method was corroborated by 386 comparing its conclusions to those already present in the literature [9] to verify whether prior findings could 387 be replicated.