## Abstract

**Motivation** New technologies allow for the elaborate measurement of different traits of single cells. These data promise to elucidate intra-cellular networks in unprecedented detail and further help to improve treatment of diseases like cancer. However, cell populations can be very heterogeneous.

**Results** We developed a mixture of Nested Effects Models (M&NEM) for single-cell data to simultaneously identify different cellular sub-populations and their corresponding causal networks to explain the heterogeneity in a cell population. For inference, we assign each cell to a network with a certain probability and iteratively update the optimal networks and cell probabilities in an Expectation Maximization scheme. We validate our method in the controlled setting of a simulation study and apply it to three data sets of pooled CRISPR screens generated previously by two novel experimental techniques, namely Crop-Seq and Perturb-Seq.

**Availability** The mixture Nested Effects Model (M&NEM) is available as the R-package mnem at https://github.com/cbgethz/mnem/.

**Contact** martin.pirkl{at}bsse.ethz.ch, niko.beerenwinkel{at}bsse.ethz.ch

## 1 Introduction

Understanding heterogeneous diseases like cancer on a molecular level is challenging, but also crucial for the improvement and development of therapies. Molecular intra-tumor heterogeneity is an important factor for cancer treatment (Sun, 2015, Prasetyanti and Medema, 2017). Treatments often assume cancer to be homogeneous across cells. However, if different cell types are resistant to different treatments, the success of current treatment strategies is limited.

A key component of the molecular landscape are signaling pathways and how they are causally wired in healthy and diseased cells. De-regulation of pathways in diseased cells is prevalent (Mao, 2012, Giancotti, 2014) and to study this de-regulation, different mathematical methods have been developed. Several different algorithms have been proposed to analyze causal interactions of genes from different types of data (Friedman *et al.*, 2000; Nachman *et al.*, 2004; Margolin *et al.*, 2006; Kalisch and Bühlmann, 2007). Nested Effects Models (NEM, Markowetz *et al.*, 2005, 2007) infer pathways from perturbation data. In each experiment, one protein in the pathway is knocked down and a multi-trait read-out is produced, e.g., gene expression or cell imaging data (Siebourg-Polster *et al.*, 2015). If the expression of one gene changes during the knock-down compared to the unperturbed control, the knock-down has an effect on the gene and the gene responds to the knockdown. If the genes responding to the knock-down of protein B are a subset of the genes responding to the knock-down of protein A, NEMs will place A upstream of B in the pathway and a causal edge A to B is inferred.

NEMs have been successfully applied to different biological data sets to infer the causal network of signaling pathways (Markowetz *et al.*, 2005; Froehlich *et al.*, 2009; MacNeil *et al.*, 2015). Several extensions of NEMs have been developed, e.g. to account for hidden variables (Sadeh *et al.*, 2013). Epistatic Nested Effects Models (Pirkl *et al.*, 2017) systematically infer epistasis from double knock-down screens. Boolean Nested Effects Models (Pirkl *et al.*, 2016) make use of arbitrary combinations of knock-downs and knock-ins per experiment to infer a full boolean network and additionally integrate literature knowledge. Dynamic Nested Effects Models (Anchang *et al.*, 2009; Froehlich *et al.*, 2011) infer the rate of the signal flow within the network from time series data, while Hidden Markov Nested Effects Models (Wang *et al.*, 2014) model the evolution of the network itself during a time course. NEMix (Siebourg-Polster *et al.*, 2015) introduces a hidden variable to account for unobserved pathway activation.

The arrival of single-cell technologies provides new opportunities to improve resolution and account for heterogeneity in a population of cells. Pooled CRISPR screens enable gene expression measurements for thousands of cells with each cell having been the target of a CRISPR modification, i.e. a knock-down (Dixit *et al.*, 2016; Datlinger *et al.*, 2017). However, the heterogeneity in cell populations measured with single-cell technologies remains an open problem and there is a need for methods tailored to this new type of data.

Motivated by evidence, that causal signaling pathways can be differently wired in sub-populations of cells (Gaudet and Miller-Jensen, 2016), we introduce a mixture model, which simultaneously infers different sub-populations of cells across knock-downs and a causal network of the perturbed genes (Fig. 1). Cells are not hard clustered, but soft, such that each cell has a certain probability of being generated by each network (component). This probability defines how much a cell contributes to the network inference for each component.

We show that Mixture Nested Effects Models (M&NEMs) work well in the controlled setting of a simulation study and apply our method to three data sets from two different pooled CRISPR screens based on Crop-Seq (Datlinger *et al.*, 2017) and Perturb-Seq (Dixit *et al.*, 2016). In those screens thousands of cells were pooled and each transfected with a different sgRNA to knock-out a specific gene. Gene expression data was generated by single cell RNA-Seq. For the Crop-Seq screen we concentrated on one data set investigating the T-cell receptor pathway in the T-Cell leukemia derived Jurkat cell line and key regulators DOK2, EGR3, LAT, LCK, PTPN6, PTPN11 and ZAP70. From the Perturb-Seq screen we model the causal interplay of cell cycle genes in one data set and transcription factors in another data set. Both data sets of the Perturb-Seq screens are derived from K562 leukemia cells.

## 2 Overview

In this section we review the original Nested Effects Model and extend it to a mixture of NEMs. Furthermore we discuss identifiability and propose a method for model selection to prevent over fitting.

### 2.1 Nested Effects Model

A Nested Effects Model (NEM) is parametrized by an adjacency matrix Φ ∊ *M _{n×n}*({0, 1}) for the directed acyclic graph (DAG) representation of the signaling graph with perturbed genes as nodes (S-genes) and an adjacency matrix Θ ∊

*M*({0, 1}) for the attachments of the different features from the data (E-genes), e.g., genes from gene expression data.

_{n×m}*θ*= 1, if E-gene j is attached to S-gene

_{ij}*i*. Each column of Θ has at most one non-zero entry, because NEMs make the assumption that each E-gene can have at most one parent. Similar to Tresch and Markowetz (2008) we add a null S-gene, which predicts no effects to account for uninformative features.

We calculate the expected E-gene profiles for a given model (Φ, Θ) as the matrix product
with *f _{ij}* the predicted state of E-gene

*i*in knock-down

*j*.

Let be the raw data matrix of the perturbation experiments and the log ratio matrix with perturbed cells indexing the columns and observed genes indexing the rows,
with *e _{ij}* the unknown state of E-gene

*i*in knock-down

*j*. As in Tresch and Markowetz (2008) we can write the log likelihood ratio of a given model (Φ, Θ) and the null model

*N*, which predicts no effects, as where tr denotes the trace of a quadratic matrix. However,

*F R*is only quadratic if the data includes only one cell per knock-down, i.e.

*l*=

*n*. Hence, the data has to be summarized beforehand, e.g., by taking the average over all experiments with the same knock-down (replicates).

### 2.2 Mixture Nested Effects Model

Instead of inferring a single network Φ and E-gene attachments Θ from the whole data set as in the previous section, we formulate a mixture, which infers several networks with unique attachments and different sub-populations of cells.

The model parameters for a mixture of *K* components (Φ, Θ) are

Given a component (Φ* _{k}*, Θ

*) we calculate the expected knock-down profiles for all single perturbations using Eq. 1 as with*

_{k}*f*the expected value of E-gene

_{k,ij}*j*under the perturbation of S-gene

*i*in component

*k*.

The log ratio profile of all cells given component *k* is
and the log likelihood ratio of component *k* is

Let *Z* ∊ *M _{K×l}*({0, 1}) be a matrix for the hidden cell attachments to the components.

*z*= 1, if cell

_{ki}*i*belongs to component

*k*. Each column of

*Z*has exactly one non-zero entry. The distribution of

*Z*is defined by the mixing coefficients

*π*as for all

_{k}*i*∊ {1, …,

*l*} with

*π*= (

*π*) and .

_{1}, …, π_{K}**Log likelihood of the mixture**. For model optimization we choose a maximum likelihood (ML) approach using the log likelihood ratios similarly to the formulation for a single mixture component,

The full derivation of the likelihood ratio is in Eq. S1 of the supplement.

### 2.3 Inference with a Expectation maximization algorithm

We developed an Expected Maximization scheme (Dempster *et al.*, 1977) for inference.

**E step**. Let *π*, (**Φ, Θ**) be the current parametrization of our mixture model. We calculate *L _{k}* from Eq. 3 with
with cell and component specific weights

*ϒ*substituted for

_{kj}*R*for every component

*k*and subsequently the responsibilities (supplement, Eq. (S2)) which we summarise in and the log likelihood ratio (Eq. 4).

** M_{Φ} step**. We update

*π*with

**Φ** remains fixed and we estimate **Θ** by their maximum a posteriori attachment to each S-gene. For this we use the known perturbation map *ρ* = (*ϱ _{ij}*) with

*ϱ*= 1, if cell j has been perturbed by a knock-down of S-gene i. We compute the fit of every E-gene to every S-gene and set

_{ij}We alternate between the *E* step and the *M*_{Θ} step until the log likelihood ratio in Eq. 4 converges.

**M step**. Given Γ, we optimize each component (Φ* _{k}*, Θ

*) with respect to*

_{k}*R*. We maximize the log likelihood ratio defined in Eq. 2 to find new optimum in the following way.

_{k}We optimize each individual component with a natural extension of the module network approach by Froehlich *et al.* (2008). We cluster knock-downs, averaged over cells, into groups of size n (e.g. *n* = 5) and perform a local neighborhood search on each group. In the local neighborhood search we evaluate each edge for absence and presence and check whether a change in status improves the log likelihood ratio and change the edge which improves it most. We combine the inferred sub-networks to one large network including all S-genes and use it as the initial network for a local neighborhood search on the full set of S-genes. During the optimization of *Φ _{k}*, we estimate

*Θ*as in Eq. 6 before we calculate the log likelihood ratio.

_{k}We alternate between the *E*, *M*_{Θ} and *M* steps until the the log likelihood ratio in Eq. 4 converges. To increase the probability of convergence to a global optimum, the EM algorithm is initialized several times with random responsibilities between 0 and 1.

### 2.4 Model identifiability

In the case of the original NEMs, two NEMs Φ_{1} and Φ_{2} are identical if and only if they have equal transitive closures, i.e. they produce identical data. This identity still holds for each component of a mixture of NEMs. However, mixture NEMs have additional identifiability issues.

In general, two M&NEMs are not distinguishable, if they generate the same data. Let *F = (F _{1}, …, F_{m}*) be the expected data pattern for M&NEM A and the expected data pattern for M&NEM B. If each column

*f*of

_{v}*F*is included in and each column of is included in

*F*, A and B are not distinguishable.

Fig. 2 shows a schematic example for two identical mixtures (A,B) with different components. For convenience of this example we assume an a posteriori hard clustering of the cells to the components and equal attachments Θ_{1} = Θ_{2}. For two cell clusters we compute an optimal mixture of two NEMs (A). However, if we divide the same data into two different clusters, we compute an optimal mixture of NEMs (B), which differs from A. Nevertheless, both mixtures perfectly explain the same data and are therefore indistinguishable from each other.

### 2.5 Model selection

In a typical situation for M&NEMs we do not know the correct number of components *K*. To prevent over fitting and enforce sparsity to the solution, we choose the optimal *K* via a penalized log likelihood ratio, penalizing complex and redundant network structures in a similar fashion as Froehlich *et al.* (2007). For each *K* ∊ {1, …, 5} we infer an optimal solution using the EM. Then we score each of the five solutions with a penalized log likelihood ratio, which we define as
with a complexity parameter *s*, model log likelihood ratio *LLR* (Eq. 4) and the sample size *n* (number of cells). We define *s* for a mixture of *K* components as
with number of edges of an adjacency matrix *A* denoted by *|A|*. Thus the number of parameters *s* are all edges in the graphs of * _{k}* and

*plus one less than the number of mixture weights, since the last weight is determined by the others. Finally we choose the solution, which minimizes the penalized log likelihood ratio. Fig. 3 shows the raw and the penalized log likelihood ratio as functions of the number of components for the data sets in our application.*

_{k}### 2.6 Effect log-odds

We calculate log odds for the effects analogous to Siebourg-Polster *et al.* (2015). Let *d _{ij}* be the normalized count value for gene

*i*and cell

*j*. Cell

*j*was perturbed by a knock-down of gene

*k*. We estimate the empirical distribution function

*F*

_{0}of the normalized control counts for gene

*i*and the empirical distribution function

*F*of the normalized counts from cells perturbed by

_{k}*k*for gene

*i*and calculate the log odds by

If the E-gene shows a clear effect in the cell, *r _{ij}* will be greater than zero and if it shows no effect, it will be less than or equal to zero.

We remove E-genes with a standard deviation smaller than the global standard deviation over the whole data set, i.e. E-genes which have small log odds apart from outliers.

## 3 Simulations

We showed that M&NEMs work well in simulations under reasonable conditions. For *n* ∊ {3, 5, 10, 20} S-genes and *K* ∊ {1, 2, 3, 4, 5} we drew random mixture weights *π* and component(s) (**Φ, Θ**) as the ground truth. We simulated 1000 cells overall, two E-genes per S-gene and 10% uninformative E-genes. The simulated data were log odds with added Gaussian noise around −1 for no effect and 1 for effect. Fig. 4 shows the result of 100 runs and Gaussian noise *N* (0, *σ*) with *σ* ∊ {1, 2.5, 5}.

We computed accuracy from similarity of the ground truth **Φ** = (Φ_{1}, …, Φ* _{K}*) and the inferred optimum . That is, we check how accurately we find a column from the ground truth

**Φ**in the inferred optimum and vice versa with the following score, where

*U*and

*V*are the sets of columns with the same perturbation as

*u*and respectively

*v*,

*ϕ*, as the columns of

_{i}**Φ**respectively and with the hamming distance hd.

The simulations show that M&NEMs can identify the ground truth with high accuracy for reasonable noise levels and is still robust in settings with high noise over a varying number of components and S-genes. The accuracy for *K* and the mixture weights are shown in Fig. S1-S2 of the supplement.

## 4 Application to pooled single cell CRISPR screens

In our application of M&NEM we analyze three data sets which combine pooled CRISPR screening with single cell RNA-seq readouts. One data set was generated with Crop-Seq (Datlinger *et al.*, 2017) and the other two with Perturb-Seq (Dixit *et al.*, 2016).

### 4.1 CRISPR droplet sequencing (Crop-Seq)

Datlinger *et al.*, 2017 combined pooled CRISPR screening with single-cell RNA sequencing to produce gene expression count data on the single-cell level. They showed the validity of their method with an analysis of T-cell receptor (TCR) activation in Jurkat cells. We downloaded the processed CROP-seq data from the NCBI GEO database (Edgar *et al.*, 2002, GSE92872). We reduced the data to stimulated cells and genes, which have a median count number of *>* 0 over the remaining cells. We normalized the count data to counts per 10000, i.e. we divided each count by the sum of counts of its respective column and multiplied by 10000. Next we took the log of the normalized counts plus a pseudo count of 0.5 and calculated log odds (Eq. 9).

As a set of knock-outs we concentrated on S-genes involved in T-Cell receptor activity as in Fig. 2, h of Datlinger *et al.* (2017), namely: DOK2, EGR3, LAT, LCK, PTPN6, PTPN11 and ZAP70. This leaves us with a population of 535 unique cells and 663 E-genes. Fig. 5 shows the result for the highest scoring result with *K* = 2. Around 58% of cells are assigned to the red network and 42% to the blue. M&NEM confirms key down-stream regulators LCK and LAT (Datlinger *et al.*, 2017, Fig. 2, h). However, ZAP70 is placed more upstream especially in the blue network. DOK2 on the other hand is correctly placed as an upstream regulator in the red network (Datlinger *et al.*, 2017, Fig. 2, h), but placed right downstream of everything else in the blue one, hinting at an altered causal role of DOK2 in the smaller cell population. PTPN6 and PTPN11 are placed as the main regulators in the red respectively blue network.

A posteriori a majority of 310 cells are attached to the red network. However, for DOK2 and PTPN6 the majority of cells for each knock-out are attached to the blue network, which explains the relatively high mixture weight of 42%, The responsibilities for each network are almost binary, 100% respectively 0% (Fig. 6, A). This is almost equivalent to a hard clustering of the cells, i.e. there is virtually no uncertainty of the cell attachments.

A more detailed version of the network with E-gene/Cell attachments for the two highest scoring results (*K* = 2 and *K* = 3) are shown in the supplement, Fig. S3-S4.

### 4.2 Combining CRISPR-based perturbation and RNA-seq (Perturb-Seq)

The data sets of Dixit *et al.* (2016) consists of RNA-seq transcriptome read-outs for single cells. We downloaded them from the BROAD single-cell portal (https://portals.broadinstitute.org/single_cell) and used the log transformed counts per 10000 normalized expression values.

**Cell Cycle Regulators**. Dixit *et al.* (2016) performed knock-out experiments for thirteen cell cycle regulators in K562 cells. After preprocessing the data set consists of 19283 cells and 980 E-genes. Fig. 7 shows the highest scoring M&NEM result (*K* = 2) with mixture weights 46.8% (red) and 53.2% (blue).

Dixit *et al.* (2016) identified the perturbations of PTGER2, CAB7 and CIT as advantageous for proliferation. We found PTGER2 and CIT downstream in both our networks, especially in the heavier blue one, while CAB7 is placed in the middle in both. However, Dixit *et al.* (2016) found a distinct transcriptional phenotype for CAB7, which can explain the different roles in the networks in comparison to PTGER2 and CIT.

Reciprocally, RACGAP1, TOR1AIP1 and AURKA are placed right at the top of the blue network, while their perturbations are identified by Dixit *et al.* (2016) as disadvantageous to proliferation. However, in the other (red) network, only RACGAP1 remains on top and TOR1AIP1 and AURKA are placed right at the bottom. This hints at much more diverse regulatory roles of the latter two and a necessity for RACGAP1 to stay upstream in the network as a key regulator (Imaoka *et al.*, 2015).

Overall the networks differ also in their general shape. While the red network consists of two co-regulating branches, that converge, the blue network is much more inter-connected.

The histogram of responsibilities is shown in Fig. 6, B. The posteriori attachment of cells shows a much softer gradient than for the Crop-Seq data set. While each S-gene in each component has at least one cell which responsibility 99%, for many cells the responsibilities are between 5% and 95%.

We show a more detailed depictions of the two highest scoring M&NEMs in the supplement, Fig. S5-S6.

**Transcription Factor Interplay**. In a second data set, Dixit *et al.* (2016) performed knock-out experiments for ten transcription factors in K562 cells. The pre-processed data set consists of 22402 cells and 700 E-genes. Fig. 8 shows the optimal network inferred by M&NEM (*K* = 2) with mixture weights of 53.3% (red) and 46.7%.

We identify YY1 as a major regulator for all other genes as it is placed most upstream in both networks. YY1‘s importance as a major transcription factor has been shown before (Tastanova *et al.*, 2016). This is further confirmed as the second highest scoring M&NEM (*K* = 3, supplement, Fig. S8) still places YY1 most upstream in all networks. Similarly, the upstream causal relation of YY1 to NR2C2 is conserved as well.

The other transcription factors mainly switch places in the middle part of the network, except for GABPA and EGR1, who alternately function as the sink node.

Again, the posteriori attachment of cells shows a much softer gradient than for the CROP-seq data set (Fig. 6, A,C). While each S-gene in each component has at least one cell with responsibility ≥ 95%, for many cells the responsibilities are between 20% and 80%.

A more detailed depictions of the two highest scoring M&NEMs is shown in the supplement, Fig. S7-S8.

## 5 Discussion

We have introduced M&NEM, a novel method for the identification of heterogeneous sub populations of single cells with a mixture of networks. M&NEM infers multiple networks from a heterogeneous cell population instead of a single one averaged over the whole population. This additional flexibility allows us to compensate model limitations of the original NEM. M&NEM successfully infers sub populations and the underlaying ground truth mixture of networks in a simulation study under reasonable assumptions.

In our application study, we have investigated three data sets from single cell CRISPR experiments combined with full transcriptomic read-outs. M&NEM confirms known causal interaction and infers novel ambiguous roles for several key regulators (e.g. DOK2), which might be differently regulated in a sub population of cells. We also identify key players like RACGAP1 and YY1, which seem to be necessary for upstream regulation.

Without the use of our model selection to enforce sparseness, our model might lead to over fitting. However, this over fitting might not always be due to noise or technical artifacts, but due to hidden players not perturbed in the data as proposed by Sadeh *et al.* (2013). For example, if we look at the second highest scoring M&NEM for the cell cycle regulators (*K* = 3, supplement, Fig. S6), we see that TOR1AIP1 is placed at the bottom of the blue network with no cells attached and the highest responsibility for a cell at 2%, i.e. almost no information for this placement of the TOR1AIP1 S-gene comes from a cell in which TOR1AIP1 was perturbed. Our hypothesis is, that many E-genes react to AURKA and many E-genes react to CIT, but also many E-genes react to both. Original NEMs cannot model this and it is the exact situation for which Sadeh *et al.* (2013) introduce a hidden player (not perturbed) to account for the diversity of E-genes. In our blue network, TOR1AIP1 is placed to model this scenario and is therefore a stand-in for the unknown hidden player and *not* the actual TOR1AIP1 S-gene (Fig. 9). However, Sadeh *et al.* (2013) use a binomial test based on the binarized data to account for noise, while our model does it in a greedy fashion, which we penalize with our penalized log likelihood ratio. Hence, an integration of the method of Sadeh *et al.*, 2013 into our mixture model to identify hidden players accounting for noise would be an interesting addition.

## Funding

Part of this work has been funded by SystemsX.ch, the Swiss Initiative in Systems Biology, under Grant No. RTD 2013/152 (TargetInfectX – Multi-Pronged Perturbation of Pathogen Infection in Human Cells), evaluated by the Swiss National Science Foundation.

**Conflict of interest**: none declared