Extracting Biological Insights from the Project Achilles Genome-Scale CRISPR Screens in Cancer Cell Lines

One of the main goals of the Cancer Dependency Map project is to systematically identify cancer vulnerabilities across cancer types to accelerate therapeutic discovery. Project Achilles serves this goal through the in vitro study of genetic dependencies in cancer cell lines using CRISPR/Cas9 (and, previously, RNAi) loss-of-function screens. The project is committed to the public release of its experimental results quarterly on the DepMap Portal (https://depmap.org), on a pre-publication basis. As the experiment has evolved, data processing procedures have changed. Here we present the current and projected Achilles processing pipeline, including recent improvements and the analyses that led us to adopt them, spanning data releases from early 2018 to the first quarter of 2020. Notable changes include quality control metrics, calculation of probabilities of dependency, and correction for screen quality and other biases. Developing and improving methods for extracting biologically-meaningful scores from Achilles experiments is an ongoing process, and we will continue to evaluate and revise data processing procedures to produce the best results.


Introduction
Most contemporary cancer drug therapies are broadly toxic to cells. Precision medicine aims to avoid broad toxicity by exploiting cancer-specific vulnerabilities. This approach has shown success with small molecules targeting oncogenes such as B-RAF, ALK, and ER. However, the majority of patients still lack clinically indicated targeted therapies 1,2

. The Cancer Dependency
Map aims to address this need and accelerate the development of targeted therapies by charting the landscape of cancer vulnerability 3 . Within this effort, Project Achilles identifies the Overview Fig. 1: The Achilles data pipeline . The major steps used to generate Achilles matrices from deconvoluted screen sgRNA readcount data. This pipeline applies to public 19Q1 and later datasets. Black and white grids illustrate which axis is being reduced during QC steps.
The Achilles pipeline consists of four stages, named for their outputs: readcount, log fold change, CERES gene effect, and gene dependency ( Fig. 1 ). In the readcount stage, deconvoluted sgRNA accounts are assembled and replicates failing fingerprinting, sgRNAs with inadequate pDNA representation, sgRNAs with suspected off-target activity, and replicates with too few reads are removed. At the log fold change stage, individual replicates and whole cell lines with inadequate control separation or poor replicate correlation are removed, then CERES inference is performed. As part of CERES, sgRNAs are aligned to the hg38 genome assembly and sgRNAs with no annotated gene targets are dropped. In the CERES gene effect phase, the CERES output is rescaled and corrected to remove known confounders and common dependencies identified. Finally, for each gene score in each cell line we infer the probability that the gene's score represents a true biological depletion phenotype.
Various details of this pipeline have evolved over time. Changes are described throughout this document. For reference, Table 1 lists the changes introduced with each quarterly release.

Release Changes
Public 18Q1 Introduced gene_dependency, pan_dependent_genes, and a standardized file naming system and formatting; Added Entrez IDs to gene names; Identified suspected off-target guides and removed them Public 18Q2 Changed copy number prioritization from preferring Broad data of all kinds to preferring WES data from all sources; Removed compound gene annotations. Began providing gene_effect_unscaled (the raw CERES output); Public releases combined with CCLE releases; Achilles files names have the prefix "Achilles_" added Public 19Q3 Changed reference assembly to hg38; Changed positive controls used for scaling the data to a more inclusive and recent list; Changed gene dependency probability inference to use the same positive controls used for scaling, rather than those found from the data itself; Changed cell line and replicate QC metrics to use NNMD for control separation and pDNA-normalized Pearson correlation of high-variance genes for replicate reproducibility; Changed batch correction to only remove five principal components from the data; Retained sgRNAs failing QC in logfold_change and provide mapping information for them in Achilles_dropped_guides.csv ;

Removed dropped_guides
Removed Y chromosome data from gene_effect_unscaled and later files Public 19Q4 None anticipated.
Public 20Q1 Other changes undergoing evaluation

Data Generation and Assembly of Read Counts
Replicates from the Avana library are sequenced in four technical replicates and deconvoluted as described in Meyers et al . 5  Read count measurements belonging to the same batch are summed and combined. All biological replicate and pDNA batch read counts are combined in a read count matrix ( Achilles_raw_readcounts.csv ) with replicates and pDNA batches as columns and rows corresponding to the total number of sgRNA reads.
We apply four filters to the resulting read count matrix. First, we remove suspected off-target Avana sgRNAs. These sgRNAs were identified from early runs of CERES that identified a single sgRNA from among those targeting a gene as the sole efficacious sgRNA for that gene. A list of these sgRNAs is provided with each Avana dataset as ( Achilles_)dropped_guides.txt for data sets before 19Q3. In later datasets, failed guides also have their gene targets indicated in the file Achilles_dropped_guides.csv . The sgRNA abundance in the pDNA pool is not uniform but roughly log-normal as expected of the results of an exponential process, with a tail of sgRNAs with low or zero representation ( Fig. 2 ). Steps contributing to the range of pDNA sgRNA abundance include the production of oligos on chip, DNA cloning, and amplification in bacteria.
To prevent excessive noise, we set to null (NA) pDNA measurements for sgRNAs if the measurement is less than one read per million for that pDNA batch, corresponding to about 0.605% of all sgRNA reads out of 74,661 sgRNAs ( Fig. 2 ). sgRNA reads are then NAed in all replicates belonging to the same pDNA batch. Next, we remove all replicates that fail to match their parent cell line using SNP fingerprinting (described further below). Finally, we remove replicates that have fewer than 15 million total read counts. These are sent for resequencing to generate sufficient read depth. We normalize each read count column by its total reads, multiply by one million, and add one pseudocount to generate reads per million (RPM). We take the log2 of the ratio of each replicate RPM to the corresponding pDNA batch RPM to generate an sgRNA by replicate logfold change (LFC) matrix.

Data Quality Control
Quality control for the Achilles CRISPR data proceeds in the following order: (1) the fingerprint integrity of each replicate is checked and failing replicates are dropped, (2) replicates are next dropped if they show insufficient separation between positive and negative control genes, (3) surviving replicates are dropped if they are not sufficiently similar to other replicates of that cell line, and (4) 6 . In the case of aneuploidy, a homozygous SNP would not affect the SNP calling, but a heterozygous SNP would cause a failed call due to unequal signal from alleles; a sample with insufficient called SNPs will fail fingerprinting. For samples A and B to match, the Pearson correlation between the SNP calls of A and B must be greater than or equal to 0.9, and the z-score of the Pearson correlation between A and B must be greater than or equal to 3.0 relative to the distribution of Pearson correlations between A and every sample that we have seen so far. With this metric, we check that the fingerprints of the Cas9-positive "derivative" samples and sgRNA-infected "screen" samples match the fingerprint of the original "parental" cell line. 95% of screens match their respective parental ( Fig. 3 ). Additionally, the p value for a screen matching a parental other than its own given the distribution of z-scores of unpaired parentals and screens is 0.00038 ( Fig. 3b ). In addition to excluding contaminated samples, this method allows us to identify and reverse labeling errors. In one case, we found that the derivative for DMS53 matched the parental for HCC38 rather than its own (and vice versa), indicating that either parental or derivative data had been swapped.

Separation of Gene Controls
The separation control step in the Achilles pipeline is used to check that essential and nonessential genes are behaving as expected. Previously, we used the strictly standardized median difference (SSMD) to assess the separation between the killing effects of the two groups. However, SSMD penalizes high variance in the LFC values of both the essential and nonessential genes. While we expect the LFC values of the nonessential genes to have a tight distribution centered around zero, no analogous expectation exists for the essential genes.
Instead, we measured the difference between the means of the LFC values of the essential and nonessential genes divided by the standard deviation of the LFC values of the nonessential genes, which we refer to as the null-normalized mean difference (NNMD; Fig. 4a ). Note that for both NNMD and SSMD more negative scores indicate a better screen.
NNMD is more strongly related to the first two principal components of the data (absolute Pearson's r = 0.65 and 0.52, respectively; Fig. 4b ) as well as Cas9 activity (Pearson's r = -0.30;

Agreement between Replicates
Because at least two replicates were performed for each screen, we computed the Pearson correlation between the replicates to ensure that the results are reproducible. Previously, we computed the Pearson correlation using log2-normalized read counts using all genes.
However, we observed that using log2-normalized data can produce high Pearson correlations between random pairs of replicates driven by the pDNA distribution alone; we instead used gene-level LFC values instead as these are more sensitive to cell-line specific screen response.
Additionally, when calculating correlation, we restricted genes to those with high variance across cell lines identified from the top 3% most variable scores in the public 18Q4 gene effect ( Achilles_high_variance_genes.csv ). The separation between the null distribution and actual distribution of Pearson correlations between replicates is larger using high variance genes than all genes, meaning we better capture a distinct representation of each cell line using high variance genes ( Fig. 4f ). We require a Pearson correlation threshold of 0.61, equivalent to a p-value of 0.05 for observing the correlation between random pairs of replicates. In the case of more than two replicates per cell line, replicates not reaching the 0.61 threshold when correlated with any of the other replicates for that cell line are excluded from the dataset.

Estimation of gene effect scores
Cas9 cutting of DNA induces a toxic effect that increases with the number of sites cut. This results in the "copy effect," in which sgRNAs targeting highly amplified regions of the genome produce depletion effects comparable to knockout of an essential gene even when the specific locus targeted is an intergenic region 10 . To remove the copy effect, the Broad Institute developed the CERES package 5,10 . Using CERES consists of two phases: prepare_inputs, which aligns sgRNAs to genes and identifies the number of cuts expected for each sgRNA in each cell line; and run_ceres, which treats each observed sgRNA log fold change as the sum of a gene and copy number effect, multiplied by a guide efficacy. More details can be found in Meyers et. at. 5 .

Alignment of sgRNAs to genes
In previous Achilles releases, sgRNAs were aligned to hg19. Beginning with DepMap_Public_19Q3, sgRNAs will be aligned to hg38. Alignments are performed using Bowtie to find all exact matches to the 20-nt sgRNA sequence. Matches are then filtered for the presence of a NGG PAM sequence at the 3' end.
Matches which fall in the exon of a gene as annotated by the Consensus Coding Sequence (CCDS) GRCh38.p12 (GRCh37.13 for Achilles releases DepMap_public_19Q2 and earlier) are annotated as targeting that gene. CERES models multiple gene targeting as linear and additive.
This assumption can fail for synergistic cases, as when a single sgRNA targets many members of a family 11 . It can also introduce problems in the case of compound genes (those containing the superset of two other genes' transcripts). With compound genes, CERES can arbitrarily shift the scores of the two individual genes in one direction (negative or positive) while shifting the scores of the compound gene in the opposite direction. The resulting scores produce good matches to the sgRNA log fold change data according to the CERES model, but the inferred gene effects are unlikely to be biological. As it is uncertain how often the compound forms of most genes really occur in cells, we have opted to avoid this problem by removing compound genes from gene annotations from avana_public_18Q2 onwards.
sgRNA Copy Number In previous releases of Achilles, it was also necessary to discard genes located on the sex chromosomes. This was due to systematically biased copy number estimations for these chromosomes using the single-nucleotide polymorphism (SNP) pipeline, which normalized copy number to the background across cell lines. Combined with CERES this led to small but persistent biases in the scores for genes located on these chromosomes. This problem does not arise with WES-based copy number. Additionally, there is clear evidence that WES copy number has a generally stronger relationship with log fold change than SNP copy number. The Broad has therefore undertaken whole-exome sequencing of all cell lines in Achilles that have not previously been sequenced by Sanger. Beginning with avana_public_19Q3, cell lines with WES-based copy number were released with CERES scores for the X chromosome, and by 20Q1 we anticipate that all cell lines will have WES copy number and X chromosome scores. In general, the correlation analysis we introduced above uses the gene-level relative CN profile.
Relative CN for each profile is multiplied by 2 to be relative to a ploidy of 2 and capped at a maximum value of 20 for each gene in the Achilles screen. profiles, we observe distinct CNAs between the two centers ( Fig. 5a ). When we correlate its gene-level LFC with the CN profiles, we observe a more negative correlation with the Broad Institute SNP profile over the Sanger WES profile ( Fig. 5b ). Based on this analysis, the Sanger WES profile is too dissimilar to the cell line screened in Achilles and should not be used for CN correction. 25 cell lines failed this correlation analysis from DepMap_public_19Q1 ( Fig. 5c ).

Running CERES
The CERES method is described in Meyers et. al. 5 , and the code is available at Github 13 . The critical hyperparameter of the CERES model is lambda_g, which controls the strength of the regularization of individual gene scores towards the mean. A high value for this parameter produces good separation of common essential and nonessential genes, at the expense of compressing the scores of selective essential genes. We have chosen the value 0.4 as the compromise between these competing considerations. This value is slightly less than the values that tend to maximize out-of-sample accuracy in most datasets. However, because most genes are either essential or nonessential in all lines, maximizing global out-of-sample accuracy will produce over-regularized datasets for the purpose of identifying selective dependencies.
CERES-estimated gene effect scores are provided unaltered (except for filtering sex chromosome scores in lines with SNP copy number estimates) as To further assess the extent of data contamination by experimental confounders, we perform principal component analysis (PCA) on gene effect level data and study the relationship of the dominant components to confounding features. We are able to create strong predictive models for the first six principal components using only experimental confounders. Combined, these principal components explain more than 20% of the dataset-wide variance. A similar bias is present at all levels of processing, suggesting the phenomenon is a result of experimental design, not a specific data processing step ( Fig. 6a, 6b ).
Additionally, we find that experimental confounders are often stronger predictors of individual gene scores than biological features. We construct predictive models for the gene dependency  Fig. 6c) .
In the following three sections, we will first describe a new step in the Avana data processing pipeline called Removal of Confounding Principal Components (RCPC). Next, we will compare RCPC to alternative methods on the basis of their ability to remove unwanted variation and denoise biological signal. In the last section, we will examine the effects of RCPC on dependency prediction. In addition to the above, we also investigate a recently developed method 17 which utilizes a novel correction of the Avana dataset to identify co-functional gene interactions. The authors use PCA on a set of negative control genes to isolate undesirable systematic variation. In this report, we refer to this method as negative-control PCA correction (ncPCA). The original ncPCA method is not appropriate for the standard Avana data release because it does not preserve the absolute dependency status of genes: after correction, gene means approach zero. However, we developed an adaptation of ncPCA which we call modified negative control PCA correction (ncPCA mod). For details on how the methods are applied, see the Analysis Methods section.
All presented methods remove the observed relationship with experimental confounders ( Fig.   7a) , so correction methods are also evaluated according to their ability to improve biological signal according to four criteria: (1) strength of associations between known pairs of related genes ( Fig 7b) ; (2) agreement of cell line clustering with tissue of origin annotations (Fig 7c) ; (3) strength of association between gene dependency data and gene expression data ( Fig 7d) ; (4) strength of known biomarkers of dependency ( Fig. 7e) . All of the presented methods yield substantial improvements in metrics 1-3, but cause decreases in the strength of biomarker-gene associations for a set of 25 genes with strong, consistent, biomarkers across the DepMap CRISPR knockout and RNAi datasets. These biomarker-gene associations are referred to as "consensus biomarkers" ( Supplementary Table 2 ). RCPC outperforms the other methods by most metrics, but was not the top choice with respect to two metrics: RCPC yields a smaller increase in the number of discovered relationships between known paralog genes compared to two other methods. RCPC also caused the second-largest decrease in the strength of Spearman correlation for consensus biomarkers; RCPC decreases the median Spearman correlation from 0.55 to 0.50. The ncPCA method has been reported to show large gains in identifying functional gene relationships 17 . Accordingly, we compare ncPCA and RCPC based on their utility for discovering co-functional gene interactions and their ability to remove systematic bias. We observe that while ncPCA reduces systematic bias among nonessential genes, it reimposes this bias on essential genes. This phenomenon is exemplified by PC1, which is identified as explaining the majority of the bias among the negative control gene set 17 . After ncPCA, correlations to the loading of PC1 decrease among nonessential genes but increase among essential genes ( Fig.   8a ). The introduction of a confounding effect results in inflated correlations among all essential genes regardless of biological function ( Fig. 8b ). In the original publication of ncPCA, a 50-fold increase in the number of significantly correlated genes is reported. This is partially a result of the introduction of a shared confounding effect and partially a result of an improvement in biological signal quality. We use a previously developed method 18 for evaluating the significance of protein complexes in fitness similarity networks to compare the utility of RCPC and ncPCA for discovery of gene-interactions and find that RCPC outperforms ncPCA in this regard ( Fig. 8c ).
To illustrate the confounding effect of ncPCA, we draw networks containing four essential complexes and include the top 400 ranked correlations (within or between complex subunits) as edges. For the uncorrected and RCPC datasets, the majority of edges lie within protein complexes. In contrast, after ncPCA, most edges are between essential genes with high mean effect score in different complexes ( Fig. 9 ). Because ncPCA creates indiscriminately high correlations between most essential genes, we recommend the use of RCPC. Recall of CORUM complexes with a significant internal-to-external edge ratio colored by correction method.

Effect of Correction on Dependency Predictions
One of the primary use cases of the Avana dataset is dependency prediction. To identify biomarkers of dependency, we construct predictive models for the gene effect scores of each gene using a combination of experimental confounders and CCLE cell line characterizations as features (Analysis Methods). To quantify the effect of the correction method on dependency prediction, we compare the performance of models trained on uncorrected data to that of models trained on RCPC corrected data. The impact of confounders is almost completely eliminated ( Fig. 10a) . After correction, the percentage of models that use a technical artifact as their top feature decreases from 61% to 1%. This improvement holds at all levels of performance. At most, 3% of models with scores greater than any threshold use a technical artifact as their top feature. While systematic bias is reduced, we also observe an overall decrease in model predictability. Model predictability decreases for two reasons. First, some biological signal is contained in the discarded principal components. Second, models with a biological top feature still use technical artifacts and generate improved predictions as a result (Fig. 10b) . Even if a model is not explicitly using a technical artifact, it may use a biological feature which is associated with a technical artifact.

Post-Processing
Generating Dependency Probabilities The majority of CERES scores appear to follow the distribution of known nonessential or unexpressed genes, suggesting that the true cell viability impact of knocking out these genes is zero ( Fig. 11a ). On the other hand, if knocking out a gene produces any nonzero reduction of viability, cells with that gene knocked out will vanish as a proportion of the cell pool over a sufficiently long time interval. Therefore it may sometimes be less interesting to ask how strongly depleted cells are after some gene knockout and more interesting to ask how likely it is that the observed depletion represents a real viability phenotype rather than noise. We then employ a standard Bayesian E-M optimization procedure with a single free parameter: the total fraction of gene scores in the cell line which was generated by the null distribution. We use an initial guess of 0.85, which is generally close to the final result. The final probability of each gene score being generated by the positive (real depletion) distribution is provided to users in the matrix Achilles_gene_dependency.csv .

The relationship between dependency probability and CERES gene effect is monotonic and
approximately sigmoidal for all cell lines, but the dependency probability associated with intermediate gene effect scores varies widely between lines ( Fig. 11b ). This is due to the variance in screen quality between different lines. In a low-quality screen, all dependency probabilities are flattened towards the prior (generally around 0.15), creating a shallower slope.
Consequently, a CERES score of -0.5 has very different interpretations in high-and low-quality screens.
Identifying common dependencies To generate this list, our approach follows the basic intuition that if a gene is universally important for cell viability, it should fall in the top Z most depleting genes in at least 90% of cell lines. (We assume that noise or contamination might cause the gene not to score strongly in up to 10% of cell lines.) The threshold ranking quantile Z can be chosen in a data-driven way. For a given gene, we can rank its gene effect score in each cell line, then arrange cell lines in order of increasing gene effect score for that gene. For most genes, the 90th-percentile least dependent cell line will show little or no depletion, while common dependencies will still be depleted ( Fig.   12a ). This creates a bimodal distribution of gene ranks in their 90th-percentile least depleted lines, where the left peak contains common dependencies and the right peak the remainder of genes ( Fig. 12b ). Accordingly, we choose the threshold Z to be the point of minimum density between the two peaks using a Gaussian smoothing kernel with width 0.1 and rounding to the nearest 0.001. The resulting list of common essentials is insensitive to the choice of 90th-percentile cutoff ( Fig. 12c ) due to the change in the stringency of the threshold Z as the choice of percentile cutoff varies ( Fig. 12d ).

Predicting Dependencies
We used scikit-learn's RandomForestRegressor to predict gene effect scores and calculate feature importance from avana_public_19Q1 before and after the PC batch correction. The forest is constrained to have 100 estimators, maximum depth eight, and minimum samples per leaf node five. The following CCLE features were used: • RNASeq expression data for both protein-coding and non-coding regions For each gene, training and prediction were performed over ten folds of cross-validation. For each fold, cell features were filtered to the 1,000 with highest Pearson correlation to the target gene effect profile in the training set, the forest was trained, and predictions were made and stored for the remaining 10% of validation cell lines. After all ten folds were completed, the complete set of out-of-sample predictions of gene effect were correlated with the actual gene effect profile to score the model. To report feature importance, the model was retrained on all samples.

Replicate reproducibility -generating null distributions
Null distributions were generated by randomly selecting two replicates from all replicates and taking the Pearson correlation between them. This was done 1 million times for each null distribution.

High variance genes
High variance genes were found by looking at the variance of gene effect across cell lines for each gene. The genes with the highest variance (top 3%) were taken; this yielded 529 genes.

ComBat correction
ComBat only supports the correction of data using discrete batches, but the cas9 activity and nonessential z-score of a cell line are continuous values. To create discrete batches, continuous quantities were split into 3 equal sized groups. ComBat was first applied to correct cas9 activity with nonessential z-score (discretized) and pDNA batch as covariates. ComBat was then applied to correct nonessential z-score with pDNA batch as a covariate. ComBat was then applied to correct pDNA batch without covariates. Finally, cell lines were renormalized according to the procedure in the CERES Normalization section.
The development of this procedure required multiple iterations. Two factors, the order in which experimental confounders were corrected and the size of continuous-variable groups, were varied extensively. ComBat procedures were compared based on their performance on the metrics summarized in Figure 7.
The SVA R package implementation of ComBat was used.

RUV2 Correction
The random effect version of Naive RUV-2 implemented as naiveRandRUV in the RUVnormalize package was used. The ridge-regression regularization parameter was set to 0.1 and the rank of the estimated unwanted variation was set to 10. Nonessential genes were used as negative control genes.
Before application of the correction, gene means were removed. After application, gene means were added back and cell lines were renormalized according to the procedure outlined in the CERES normalization section of this report.

Evaluation of cell line clustering
Each gene effect matrix was pruned so that every cell line had at least one other cell line of the same primary tissue type. Subsequently, t -distributed stochastic neighbor embedding (t-SNE) with perplexity equal to five was applied to each pruned gene effect matrix to reduce the dimension to two. K-means clustering was then applied to each embedded matrix with k equal to the number of primary tissue types represented. The adjusted Rand index was calculated between the k-means groupings and parent primary tissue groups. This process was repeated 100 times for each correction.

Evaluation of correlation significance between related gene pairs
Correlation matrices for each gene effect matrix were computed and values within columns were ranked from high to low. The p-value for each correlation was equal to the rank of the correlation divided by 17633 (the number of other genes in gene effect level data). Q-values were then calculated across all p-values within a correction method. The number of biologically related gene-pairs with a q-value less than 0.1 were reported. Two lists of biologically related genes were used, human core complexes from CORUM and Human paralogs from Ensembl.

Evaluation of concordance with gene expression data
The correlation between gene effect data and Public 18Q4 CCLE TPM expression data was evaluated according to the following procedure: A pseudo count of 1 was added to Public 18Q4 CCLE TPM expression data and the resulting matrix was log2-transformed. All genes not present in the Public 18Q4 CCLE TPM and avana_public_19Q1 were discarded from both datasets.
Correlation matrices were computed between each gene effect matrix and the processed expression data (rows correspond to gene effect data and columns correspond to expression data). P-values were determined for each gene as the within row rank of the correlation between its knockout profile and its expression profile divided by 17395 (the number of genes in both datasets). Q-values were calculated for all genes within a correction method. The number of genes with a q-value less than .1 was reported.

Evaluation of consensus biomarkers
Probabilities of dependency were calculated for DEMETER2 19 gene scores using the methodology described under Post-Processing above. The dependency prediction pipeline described in the Analysis Methods section was run on DEMETER2 probabilities for genes with at least 10 dependent and 10 nondependent lines, in addition to Avana dependency probabilities. Genes with models from both datasets that shared a feature in common among the top two most important features and had model scores greater than 0.4 were selected.
Spearman correlations were calculated between the corrected gene effect profiles of the selected genes and their top predictive features. Public 18Q4 CCLE TPM expression data and 18Q4 public DepMap Mutation call data were used.

Implementation of ncPCA
A list of olfactory receptor genes was downloaded from the HGNC website ( https://www.genenames.org/data/genegroup/#!/group/141 ). Genes with guides which targeted multiple genes as indicated in the guide_gene_map.csv file were dropped to create a list of 278 olfactory receptor genes.
Permutation testing was implemented as explained in Boyle et al . 17 . The first four principal components explained a significant percent of the variation among olfactory genes.
PCA was performed on the olfactory gene effect matrix (278 olfactory genes by 558 cell lines) using the prcomp function in R, without column scaling. Gene effect profiles were projected onto the first four principal components and returned to their original space via matrix multiplication, as specified in Boyle et al . 17 , to create a correction matrix. The difference between the original, full dataset and the correction matrix was used as the corrected dataset.
To generate a modified ncPCA dataset (benchmarked in Figure 7) we modified in the original protocol in two ways. First, gene means were removed before the correction procedure and added back after. Second, cell lines were renormalized after the gene means were restored.
More specifically, the normalization was performed in four steps. First, gene effect data was gene-centered and transposed (genes are rows and cell lines are columns). Second, permutation testing was performed to find principal components of nonessential genes that contain significant variation related to nonspecific killing. Third, the full, preprocessed gene effect matrix was projected onto the loadings of the nonsignificant principal components of the nonessential gene effect data and subsequently projected back to the original space. Lastly, gene means were then added to the matrix and cell lines were recentered according to the procedure in the CERES Normalization section.

Correlation network significance testing
Significance testing of correlation networks for CORUM protein complexes was performed as described in Pan et al ., 2018 18 , with the following modifications: 1) The results of 2000 (rather than 10,000) shuffled networks were used to generate the null distribution for significance testing; 2) Correlation ranks were computed on the full correlation matrices (17,634 x 17,634 genes) for each underlying dataset.
Genes present in more than one geneset were discarded. For each of the three gene effect matrices considered (uncorrected, RCPC and ncPCA-corrected), Pearson correlations were calculated between all pairs of genes present in the union of these four genesets. For each dataset, the top 400 correlations were taken as edges in a correlation network, which was visualized in the Cytoscape desktop application. The network layout was computed by combining the top edges from all three networks into a single combined network (1200 edges total), selecting the "Group Attributes Layout", and selectively plotting edges specific to each dataset. Finally, to generate the ROC curve, true positives were defined as correlations between genes from the same geneset, while false positives were those between genes from different genesets.