Models of Throughput for Multi-Cell, Multi-Type Droplet Microfluidics

New experimental platforms encapsulate multiple cells per microfluidic droplet, with each cell belonging to one of multiple possible types. The motivating example comes from microbial ecology, where we want to observe the interactions of microbial strains. Because droplets are formed randomly, we want to accurately predict the data throughput, the numbers of droplets containing desired combinations of cell types. Herein I identify the default statistical model for predicting the data throughput of multi-cell, multi-type droplet microfluidics experiments, which fits to cell type count data. I explain the assumptions behind this model and issues that in practice may cause these assumptions to fail. One such issue, “compositional heterogeneity”, is unique to multi-type experiments. I show how to modify the default statistical model to describe the consequences of these issues, without needing to mechanistically model their causes. In practice, only two of these issues may substantially change the data throughput predictions. The changes depend on both (1) which combination of these issues are present, and (2) the precise definition of data throughput. Finally, I show that for a given experimental platform one can estimate the severity of these two issues, enabling more accurate data throughput predictions that account for these two issues.


27
This paper highlights the work presented in Part II of the PhD Thesis (Krinsman, 2022). Notation 48 Let S denote the total number of cell types in the sampling pool. In the motivating microbial ecology 49 example, "S" for "strains" is the total number of microbial strains in the microbial community sample.
where for each type s ("s" for "strain") of the S total types, N (s) (0) is the random variable equalling the 57 number of cells from type s at time 0.

58
For each cell type s, let f (s) (" f " for "frequency") denote the relative abundance of cell type s in the 59 sampling pool. In particular, we necessarily have that ∑ S s=1 f (s) = 1. s 2 but not type s 1 . With these groups, we can quantify the effect of type s 1 on type s 2 by targeting an 91 average treatment effect (ATE) or other similar causal estimand. 92 Even when we can implement estimators for such causal estimands, sometimes the results may be 93 unreliable. Consider e.g. a situation where our treatment group T (s 1 ,s 2 ) consists of only T (s 1 ,s 2 ) = 5 94 droplets and our control group C (s 1 ,s 2 ) consists of C (s 1 ,s 2 ) ≈ 50, 000 droplets. This example may seem 95 unrealistic, but it is actually optimistic compared to some of the disparities between the treatment and 96 control group sizes T (s 1 ,s 2 ) , C (s 1 ,s 2 ) that one may expect to see in practice. Cf. the discussion in (Krinsman,97 2022, Appendix 2.C). Regardless of how we choose to quantify interaction effects, we need to estimate 98 the reliability of our estimates in order for them to be scientifically useful. We need to estimate their 99 statistical power P(T (s 1 ,s 2 ) , C (s 1 ,s 2 ) ).

100
Of course (even for a fixed causal estimand) the statistical power P function will depend on the 101 particular estimator, so at this level of generality we cannot complete such a calculation. However, a 102 priori we do also know that the statistical power P(T (s 1 ,s 2 ) , C (s 1 ,s 2 ) ) will always depend on the sizes 103 T (s 1 ,s 2 ) , C (s 1 ,s 2 ) of the treatment group and control group, respectively. Thus, to estimate statistical power 104 P(T (s 1 ,s 2 ) , C (s 1 ,s 2 ) ), we always need to know the treatment group and control group sizes T (s 1 ,s 2 ) , C (s 1 ,s 2 ) , 105 regardless of the particular estimator.

106
Unfortunately, due to the random nature of droplet formation in droplet-based microfluidics experi-107 ments, the scientist can never directly control the treatment group and control group sizes T (s 1 ,s 2 ) , C (s 1 ,s 2 ) .

108
The treatment group and control group sizes T (s 1 ,s 2 ) , C (s 1 ,s 2 ) are random variables. Therefore, for any 109 estimator of any causal estimand, the statistical power P(T (s 1 ,s 2 ) , C (s 1 ,s 2 ) ) will itself be a random variable.

110
This is not inherently a "deal-breaker", but we do still need to constrain the possibilities for the distribution 111 of the statistical power P(T (s 1 ,s 2 ) , C (s 1 ,s 2 ) ). At the very least, we would like to be able to compute the 112 expected statistical power E[P(T (s 1 ,s 2 ) , C (s 1 ,s 2 ) )].

113
Specifically, for a given estimator of a given causal estimand, the statistical power P(T (s 1 ,s 2 ) , C (s 1 ,s 2 ) ) 114 will be a deterministic function P of the treatment and control group sizes T (s 1 ,s 2 ) , C (s 1 ,s 2 ) (and possibly 115 other factors which for simplicity we will assume to be known and fixed). If we know specific values 116 T (s 1 ,s 2 ) = T (s 1 ,s 2 ) , C (s 1 ,s 2 ) = C (s 1 ,s 2 ) for the treatment and control group sizes T (s 1 ,s 2 ) , C (s 1 ,s 2 ) , then we 117 can evaluate this statistical power function P and get a deterministic value P(T (s 1 ,s 2 ) ,C (s 1 ,s 2 ) ). If 118 instead the treatment and control group sizes T (s 1 ,s 2 ) , C (s 1 ,s 2 ) are random variables, then when evaluating 119 this statistical power function we get a new random variable P(T (s 1 ,s 2 ) , C (s 1 ,s 2 ) ) whose distribution is 120 determined by the pushforward measure.

121
Obviously, without specifying the particular causal estimand and its particular estimator, we still 122 cannot complete the calculation of the expectation E P(T (s 1 ,s 2 ) , C (s 1 ,s 2 ) ) of a random statistical power 123 P(T (s 1 ,s 2 ) , C (s 1 ,s 2 ) ). Nevertheless, for any given estimator of any given causal estimand, the distribution 124 of the statistical power P(T (s 1 ,s 2 ) , C (s 1 ,s 2 ) ) will be determined by the distribution of the treatment group 125 and control group sizes T (s 1 ,s 2 ) , C (s 1 ,s 2 ) . Thus, for any estimator of any causal estimand, as a prerequisite 126 for being able to estimate expected statistical power E P(T (s 1 ,s 2 ) , C (s 1 ,s 2 ) ) , we must first understand 127 the distribution of the treatment group and control group sizes T (s 1 ,s 2 ) , C (s 1 ,s 2 ) .

128
Note that the treatment group and control group sizes T (s 1 ,s 2 ) , C (s 1 ,s 2 ) can be understood as sums of 129 indicator random variables: To completely characterize the distribution of indicator random variables it suffices to know their expecta-131 tions, i.e. the probabilities: If we assume the droplets are identically distributed, then linearity of expectation allows us to compute we don't want to make additional assumptions about the distributions of the droplets.

137
However, if we additionally assume that the droplets are mutually independent, then we can compute 138 the distributions of the treatment and control group sizes as marginals of a Multinomial distribution 1 .

139
Given the statistical power function P, i.e. choice of a particular causal estimand and estimator thereof, 140 this would then in principle 2 be enough information to compute the distribution of the statistical power 141 P(T (s 1 ,s 2 ) , C (s 1 ,s 2 ) ) or any summary statistic thereof.

142
One might object to treating the statistical power P(T (s 1 ,s 2 ) , C (s 1 ,s 2 ) ) as a random variable. Certainly, 143 after the experiment has been run, the treatment group and control group sizes T (s 1 ,s 2 ) , C (s 1 ,s 2 ) will be 144 known, allowing us to condition on their observed values T (s 1 ,s 2 ) ,C (s 1 ,s 2 ) and "restore determinism" by is insufficient to help the scientist before running the experiment. The scientist has no direct control in 147 advance over what exactly the treatment group and control group sizes T (s 1 ,s 2 ) , C (s 1 ,s 2 ) will turn out to be. power P(T (s 1 ,s 2 ) , C (s 1 ,s 2 ) ) they will most likely be able to achieve as a function of the deterministic 155 factors that the scientist can directly control. From that perspective it is inescapable that

157
(2) the most basic prerequisite for constraining the distribution of the statistical power P(T (s 1 ,s 2 ) , C (s 1 ,s 2 ) ) 158 is to characterize the distributions of the treatment group and control group sizes T (s 1 ,s 2 ) , C (s 1 ,s 2 ) , (1) · · ·n 1 Cf. sections S3.2.2 and S4.2.2. 2 In practice, given realistic problem sizes, evaluating this exact formula would most likely be intractable, such that one would probably also need to use additional heuristics and approximations. The choice of those heuristics and approximations would depend on the particular causal estimand and estimator thereof, and thus is outside the scope of this discussion.

5/20
correspond to mutually independent Poisson-distributed random variables, is Heuristically speaking, equation (7) says that during the formation of each droplet the sampling rate λ is 179 evenly spread out among the multiple types according to their frequencies in the population. • For every type s, the number of cells in the population which belong to type s is very large.

184
• Each cell (regardless of its type) has the same small probability of ending up in the droplet as any 185 other cell.

186
• Whether a given cell (regardless of its type) ends up in the droplet is completely independent of 187 what happens to any other cell.

188
The first can be thought of as stating that the population size for each type is effectively infinite. The 189 second and third can together be thought of as stating that the sampling pool is perfectly homogeneous.

191
While hPoMu is definitely a sensible working model to start with, its implicit assumptions failing to be 192 realistic could undermine its usefulness in practice.

193
Populations are Finite 194 We assumed that for each individual type s the number of cells is "effectively infinite". For "rare" types 195 it is a priori unclear whether this assumption is reasonable. However in practice, even for types with  The goal for these experiments is to have one cell per droplet, whence the name "single-cell". However, because the droplet formation process is random, in practice some droplets in "single-cell" experiments may still end up with more than one cell.

7/20
"compositional heterogeneity". Higher compositional heterogeneity implies more variance in the relative 214 abundances for the cell types throughout the sampling pool. Cf. figures 4 and 5.

(8)
In general the marginal distributions of members of the ghNBDM family have non-zero cross-covariance.

240
Thus the entries N (s) (0) of the random vector N(0) will in general not be mutually independent. This is 241 unlike the default working model (hPoMu), cf. again equation (7).

242
Notice how this working model has two additional parameters compared to the hPoMu working 243 model, namely the density concentration ζ D parameter and the compositional concentration ζ C parameter.

244
Higher values of the density concentration ζ D parameter correspond to lower density heterogeneity (cf.

9/20
For all simulations and distributions, the same "simulated community" of 91 types was used. These 264 correspond to 90 types distributed across 9 distinct relative abundances: 10 types for each of .01%, .02%,

266
All distributions used the same value of the rate parameter, λ = 2, the expected total number of cells for The procedure was then otherwise exactly the same as for the gluttonous groups.

293
Compositional heterogeneity, which is unique to multi-type experiments, differently affects gluttonous and 294 picky data throughput, depending on the presence or absence of density heterogeneity. These differences

328
The Effect of Heterogeneity on Data Throughput is Equivocal 329 The general effect that heterogeneities will have on the data throughput of multi-cell, multi-type droplet 330 microfluidics experiments is not straightforward. The answer depends on the particular combination of is negative. Both the numbers of picky treatments and gluttonous treatments decrease. Regardless of 334 whether one uses picky or gluttonous controls, the number of treatments is still the limiting "bottleneck".

335
Given only compositional heterogeneity but no density heterogeneity, whether the effect on data 336 throughput is positive depends primarily on our opinion of what is useful data. If our primary concern 337 is avoiding "data starvation" and being able to say anything about the interactions of the rarest types, 338 then the strong decrease in the number of gluttonous treatment groups clearly means that the effect of 339 heterogeneity is negative. On the other hand, if we consider only the data from the picky treatments useful, 340 then in the absence of density heterogeneity the effect of compositional heterogeneity is slightly positive.

341
The Estimators are Good Enough for Use in Practice

342
The above results show that the proposed estimators both (1) converge towards correct answers as the 343 data size increases and (2) have decreasing variance as the data size increases. Therefore they satisfy the 344 minimal requirements one generally expects of "decent" or "adequate" statistical estimators.

345
The estimators were already guaranteed to be asymptotically consistent when the statistical model We need to be able to estimate both density and compositional heterogeneity from empirical data produced 353 by multi-cell, multi-type droplet microfluidics experiments in order to predict how these heterogeneities 354 might affect the throughput of a given experimental platform. We have viable statistical estimators for 355 both density heterogeneity and compositional heterogeneity. Therefore, neither quantity needs to be 356 treated as an "unfathomable unknown". We are able to actively account for the effects of both. Given 357 empirical data, even of only moderate size, we can get estimates for concentration parameters describing 358 levels of heterogeneity that are realistic for the experimental platform.

359
Having some sense for the amount of either type of heterogeneity one is likely to encounter in practice 360 for any given experimental platform is important. It enables more accurately predicting the throughput 361 from multi-cell, multi-type droplet microfluidics experiments. More accurate predictions of throughput 362 will make droplet microfluidics experiments more viable alternatives to manually co-culturing multiple 363 types of cells. More accurate predictions of throughput are also necessary to accurately simulate multi-cell, 364 multi-type droplet microfluidics experiments. Accurate simulations in turn will help us to identify the 365 statistical analyses that perform best for the novel kinds of datasets produced by multi-cell, multi-type 366 droplet microfluidics experiments. Identifying best performing statistical analyses for these experiments 367 will further make them more viable alternatives to manually co-culturing multiple types of cells.