Milo: differential abundance testing on single-cell data using k-NN graphs

Single-cell omic protocols applied to disease, development or mechanistic studies can reveal the emergence of aberrant cell states or changes in differentiation. These perturbations can manifest as a shift in the abundance of cells associated with a biological condition. Current computational workflows for comparative analyses typically use discrete clusters as input when testing for differential abundance between experimental conditions. However, clusters are not always an optimal representation of the biological manifold on which cells lie, especially in the context of continuous differentiation trajectories. To overcome these barriers to discovery, we present Milo, a flexible and scalable statistical framework that performs differential abundance testing by assigning cells to partially overlapping neighbourhoods on a k-nearest neighbour graph. Our method samples and refines neighbourhoods across the graph and leverages the flexibility of generalized linear models, making it applicable to a wide range of experimental settings. Using simulations, we show that Milo is both robust and sensitive, and can reveal subtle but important cell state perturbations that are obscured by discretizing cells into clusters. We illustrate the power of Milo by identifying the perturbed differentiation during ageing of a lineage-biased thymic epithelial precursor state and by uncovering extensive perturbation to multiple lineages in human cirrhotic liver. Milo is provided as an open-source R software package with documentation and tutorials at https://github.com/MarioniLab/miloR.

conditions using graph neighbourhoods (Fig 1). Our computational approach allows [8], which leads to high coverage of the graph while simultaneously controlling the number of 80 neighbourhoods that need to be tested. For each neighbourhood we then perform hypothesis 81 testing between biological conditions to identify differentially abundant cell states whilst 82 controlling the FDR across the graph neighbourhoods.

84
Our method works on a k-NN graph that represents the high-dimensional relationships 85 between single-cells, a common scaffold for many single-cell analyses [1-4] ( Fig 1A). The first 86 step in our method is to define a set of representative neighbourhoods on the k-NN graph,

87
where a neighbourhood is defined as the group of cells that are connected to an index cell by 88 an edge in the graph. Consequently, we need to sample a subset of single-cells to use as 89 neighbourhood indices. Adopting a purely random sampling approach means that the number 90 of neighbourhoods required to sample a fixed proportion of cells scales linearly with the total 91 number of index cells (Supp Fig 1B). This leads to an increased multiple testing burden, with 92 the potential to reduce statistical power. To solve this problem we have implemented a refined To illustrate the power and accuracy of Milo we first simulated 100 independent continuous of cells in a specific region of each trajectory from condition 'B'. Moreover, we assigned cells 134 to one of 3 replicates per condition, thus mimicking a common experimental design. These 135 simulated data sets provide a ground truth against which the performance of differential 136 abundance testing approaches can be compared (Fig 2A).

138
As well as Milo, we applied two methods designed for differential abundance testing using 139 single-cell data to these simulated datasets: Cydar, originally designed to model differential 140 abundance in mass cytometry data [6], and DAseq, which utilises a logistic classifier to predict 141 which cells are from single-cell DA subpopulations represented by a reduced dimensional 142 space [7] (Fig 2B). In addition, we applied the current best-practise analysis strategy for single-143 cell analysis: graph-clustering followed by differential abundance testing of clusters between 144 conditions. For this approach we applied 2 commonly used community detection algorithms:

145
Louvain [15] and Walktrap [16]. We modelled the differential abundance of clusters from these 146 algorithms using a NB GLM, as implemented in edgeR [9]. To ensure comparability between methods we used the same reduced dimensional space as the input for all methods and the 148 same parameter values, where these were shared, e.g. the value of 'k' for k-NN graph building.

149
Where parameters were specific to a method, we made use of the recommended practise by 150 the method developers to select an appropriate value (Supp Table 1).

152
For each simulated dataset we computed the confusion matrix of each method against the 153 ground truth, and calculated a number of common summary statistics (Fig 2C, Supp Fig 5), 154 enabling an assessment of how well each method performs across a variety of metrics. To generate a single value for comparing methods, and integrate across the four categories of a the additional gains of modelling cell states using overlapping neighbourhoods (Fig 2C, Supp  Fig 5). This was confirmed when examining the MCC, where we observed that Milo yielded 162 the highest median correlation (0.85) and lowest variance ( Fig 2D). Conversely, the clustering-163 based methods resulted in highly variable MCC values, illustrating the sensitivity of these 164 approaches to the input data set. In sum, our simulations demonstrate that Milo circumvents 165 a common bottleneck in single-cell analyses: the need to perform iterative rounds of 166 community detection to achieve an optimal clustering prior to differential abundance testing. The benchmarking dataset is fairly typical in size for current single-cell experiments. However, 171 moving forward, the number of cells assayed is likely to increase with advances in 172 experimental sample multiplexing [18,19]. As such, we tested the scalability of the Milo 173 workflow, and profiled the memory usage across multiple steps. For this we ran Milo on 3 174 published datasets of differing sizes from ~2000 to ~130,000 cells, representing differences 175 in both biological and experimental complexity [2-4], as well as a dataset of 200,000 simulated single-cells from a linear trajectory (see Methods). Using these 4 data sets we measured the amount of time required to execute the Milo workflow from graph-building through to differential abundance testing (Fig 3A). In parallel, we profiled the amount of memory used 179 across the entire workflow ( Fig 3B) and at each defined step (Supp Fig 6). Notably, the amount 180 of time taken increased linearly with the total size of the data set (Fig 3A), which for a large of the Milo workflow scaled primarily with the size of the input dataset (Fig 3B), indicating that 183 the complexity and composition of the single-cells largely determines the memory 184 requirements (Supp Fig 6). Importantly, these memory requirements are within the resources 185 of common desktop computers (i.e. <16GB). This benchmarking analysis demonstrates that

186
Milo is able to perform differential abundance analysis in large and complex datasets at a 187 scale and speed that is feasible on a desktop computer.

192
which were previously clustered into 9 distinct TEC subtypes ( Fig 4A) [3]. These data, 193 generated using plate-based SMART-seq2, consist of 2327 single-cells equally sampled from 194 mice at 5 different ages: 1, 4, 16, 32 and 52 weeks old ( Fig 4B). Moreover, the experimental 195 design included 5 replicate experimental samples of cells for each age. The goal of the study 196 was to identify TEC subtypes that change in frequency during natural ageing.

198
To this end, we first constructed a k-NN graph, before assigning cells to 363 neighbourhoods, 199 which were then used to test for differential abundance of TEC states across time. At a 10% 200 FDR, we identified 217 DA neighbourhoods (112 showed a decreased abundance with age, 201 105 an increased abundance with age) spanning multiple TEC states (Fig 4C). We compared 202 our results to those generated in the original publication, which demonstrated that we were 203 able to identify the same DA states (Fig 4D), including changes in the abundance of the 'sTEC' 204 population, which consisted of just 24 cells. Moreover, whilst we recovered the previously reported accumulation of Intertypical TEC with age, we also identified a novel subset of these

219
Therefore, these analyses demonstrate the sensitivity of Milo by identifying that a mTEC 220 progenitor state is depleted with age, a finding that was not resolved using clustering

226
The original study assigned cells to multiple lineages, including immune, endothelial and 227 mesenchymal cells (Fig 5A-B). A key goal of the study was to ask whether different cell types 228 were differentially abundant between experimental samples taken from healthy and cirrhotic 229 tissue. In the original study, cells from each lineage were sub-clustered and these sub-clusters 230 were interrogated using a Poisson GLM to identify whether there were differential contributions 231 from cirrhotic and healthy donors.
To explore whether more subtle differences could be detected, we applied Milo analysis to 234 2696 neighbourhoods spanning the k-NN graph and identified 1404 neighbourhoods with 235 differential abundance (10% FDR; Fig 5C). To assess performance, we compared DA results 236 with those from the compositional analysis performed by Ramachandran et al. [2]. Milo 237 recovered DA neighbourhoods in all clusters identified as differentially abundant between 238 cirrhotic or uninjured tissue in the original study ( Fig 5D).

240
Moreover, Milo identified multiple groups of neighbourhoods within the same pre-defined sub-241 clusters that showed opposing directions of differential abundance between the control and 242 cirrhotic liver experimental samples ( Fig 5D). In other words, within a sub-cluster, some 243 neighbourhoods were enriched for control experimental samples whilst others were enriched 244 for disease experimental samples. These patterns, exemplified by the T cell (2) and the endothelial (5) compartments were obscured in the previous study due to the reliance on pre-246 clustering ( Fig 5D).

248
To further explore the biological meaning of these neighbourhoods, we first focused on the 249 hepatic endothelial cells, where we resolved disease specific subpopulations at higher

265
Milo also identified strong DA between healthy and cirrhotic cells in lineages that were 266 unexplored in the original study, such as the cholangiocyte compartment ( Fig 5D).  The definition of neighbourhoods, as implemented in Milo, overcomes the main limitations of 288 standard-of-practice clustering-based DA analysis, whilst utilising a common data-structure in 289 single-cell analysis -graphs. A strength of our approach is that it is applicable to a wide range 290 of datasets with vastly different topologies, including gradual state transitions, thus removing 291 the need for time-consuming iterative sub-clustering and identifying subtle differences in 292 differential abundance that would otherwise be obscured ( Fig 5D).

294
Recently, other clustering-free methods have been proposed to detect compositional

435
With this large simulation we down-sampled to specific proportions, ranging from 1 to 100%, 436 and recorded the elapsed system time to complete the Milo workflow using the system.time

441
To assess the memory usage of the Milo workflow we made use of the Rprof function in R 442 to record the total amount of memory used at each step. We followed the same approach as above, down-sampling simulated and published datasets from 1 to 100% of the total cell numbers. All memory usage is reported in megabytes (MB).       Assign cells to neighbourhoods Test neighbourhoods for differential abundance