Improved prediction of bacterial CRISPRi guide efficiency through data integration and automated machine learning

CRISPR interference (CRISPRi), the targeting of a catalytically dead Cas protein to block transcription, is the leading technique to silence gene expression in bacteria. Genome-scale CRISPRi essentiality screens provide one data source from which rules for guide design can be extracted. However, depletion confounds guide efficiency with effects from the targeted gene. Using automated machine learning, we show that depletion can be predicted by a combination of guide and gene features, with expression of the target gene having an outsized influence. Further, integrating data across independent CRISPRi screens improves performance. We develop a mixed-effect random forest regression model that learns from multiple datasets and isolates effects manipulable in guide design, and apply methods from explainable AI to infer interpretable design rules. Our method outperforms the state-of-the-art in predicting depletion in an independent saturating screen targeting purine biosynthesis genes in Escherichia coli . Our approach provides a blueprint for the development of predictive models for CRISPR technologies in bacteria.


Introduction
CRISPR interference (CRISPRi), in which a catalytically dead Cas protein incapable of DNA cleavage (dCas) is targeted to interfere with transcription of a gene of choice (Bikard et al., 2013;Qi et al., 2013), is the most widely used CRISPR technology in bacteria. In contrast to eukaryotes, many bacteria lack the necessary repair pathways to survive genome editing by the double-stranded break induced by CRISPR-Cas9.
Applications of CRISPR-Cas9 as a sequence-specific antibiotic notwithstanding (Bikard et al., 2014;Citorik et al., 2014;Gomaa et al., 2014), the main impact of CRISPR-Cas in engineering bacteria has come from using it as a platform on which to develop new technologies that can be guided to a specific locus in a programmable fashion. CRISPRi is the simplest example of this, where the dCas protein itself serves as an effector to silence gene expression by physically blocking the procession of the RNA polymerase.
The development of CRISPRi has opened up a range of biological applications, from down-regulating individual genes for genetic studies to performing genome-wide fitness screens or engineering genetic circuits (Luo et al., 2016;Vigouroux and Bikard, 2020). As an alternative screening technology to transposon mutagenesis (Cain et al., 2020), CRISPRi has the advantage that particular genes of interest can be directly targeted, avoiding the need for large mutant libraries to achieve gene saturation.
Another area of application is engineering synthetic regulatory circuits (Jusiak et al., 2016) or metabolic networks (Cho et al., 2018a;Mougiakos et al., 2018), where collections are gRNAs are used to coordinately downregulate and upregulate associated genes and pathways. However, all of these applications critically depend on the efficiency of silencing provided by selected guides. Genetic screens already routinely employ tens of thousands of guides simultaneously, and it is impractical to individually test each guide's efficiency. This problem will only be accentuated as the scale of applications increases through the use of CRISPR array technology that allows multiplexed expression of suites of guides simultaneously (Liao et al., 2019;Reis et al., 2019) to dissect and engineer increasingly complex phenotypes. Reliable prediction of guide efficiency will therefore become increasingly important as applications of CRISPRi become increasingly ambitious.
Given the impact of CRISPR-based genome engineering in eukaryotes, significant effort has been expended in developing methods for predicting editing efficiency. The first attempts used classical machine learning methods on relatively small datasets comprising efficiency measurements for thousands of gRNAs. The applied methods include logistic regression (Doench et al., 2014), support vector machines (Labun et al., 2016;Wong et al., 2015), linear regression (Moreno-Mateos et al., 2015), and gradient-boosted decision trees (Doench et al., 2016). As the amount of Cas9 editing data has increased, deep learning approaches have become increasingly popular. These include convolutional neural networks (Chuai et al., 2018), which apply a collection of adaptive filters to automatically extract local sequence features, and long short-term memory networks (LSTM) (Wang et al., 2019), which retain a memory that potentially allows for the detection of long-range interactions between sequence features. Newer methods have put substantial effort into engineering deep learning architectures to further boost performance (Kim et al., 2019). It is important to note that many of these deep learning methods have been trained on tens of thousands of measurements of guide efficiency, and fusing datasets has played an important role in further increasing performance (Xiang et al., 2021).
So far, relatively little attention has been paid to predicting guide efficiency for bacterial CRISPRi. The only study to date developed a LASSO regression model for predicting CRISPRi guide efficiency (Calvo-Villamañán et al., 2020) with a limited sequence feature set using data from a single genome-wide CRISPRi screen in Escherichia coli . Given the trajectory of prediction methods for eukaryotic genome engineering applications towards larger datasets and more complex machine learning approaches, we asked if a similar approach could improve our ability to extract design rules for gRNAs for bacterial CRISPRi applications. Starting with an investigation of features driving guide depletion in CRISPRi screens using automated machine learning, we find that gene effects that are not modifiable in guide design dominate. Starting from this foundation, we develop a machine learning approach that separates gene and guide effects while learning from multiple independent CRISPRi screens, allowing us to arrive at a predictive model of guide efficiency that we show improves on the state-of-the-art using a saturating depletion screen of purine biosynthesis genes during growth in minimal media.

Results
Automated machine learning and feature engineering identifies gene-specific effects in CRISPRi depletion screens We set out to devise design rules for CRISPRi in bacteria by combining machine learning with large experimental datasets. The largest available datasets come from genome-wide depletion screens. However, it is currently unknown how well depletion in these screens can be predicted given known guide and genomic features. We thus began our investigation by applying automated machine learning (autoML) ( Figure 1A).
AutoML refers to a collection of techniques that attempt to automate the often labor-intensive process of model selection and optimization. Rather than sequentially fitting different types of models and individually optimizing their hyperparameters as typically done in applying ML to a new problem, autoML techniques turn model selection itself into an optimization problem. We used the Auto-Sklearn package (Feurer et al., 2015) that wraps classification and regression models implemented in the Python Scikit-learn package (Pedregosa et al., 2011) in a Bayesian optimization framework.
We first asked how well we could predict gRNA depletion log 2 fold-changes (logFCs) for essential genes as defined by the Keio collection (Baba et al., 2006), and what features would be required for accurate prediction. As essential genes should have an infinite fitness cost upon complete silencing, we assumed differences in depletion would mainly depend on gRNA silencing efficiency. We leveraged a published E. coli CRISPRi essentiality screen using dCas9 performed in rich media , which included 1,951 guides targeting 293 essential genes. To predict depletion in this data set, we engineered a series of feature sets of increasing complexity ( Figure S1; Table S1) starting with the one-hot-encoded gRNA and PAM sequence as well as the one-hot encoded dinucleotide sequence including four bases upstream of the gRNA sequence and three bases following the NGG PAM. This resulted in a poorly performing model with a median Spearman's ρ of~0.20 in 10-fold cross-validation ( Figure 1B; Table S2). We therefore iteratively added a set of additional features while monitoring changes in model performance. As targeting efficiency has been suggested to depend on distance to the transcriptional start site (Qi et al., 2013;Wang et al., 2018), the set included absolute and relative distance to the start codon. We also included a suite of thermodynamic features describing gRNA:target interactions predicted using the ViennaRNA package (Lorenz et al., 2011): minimum free energy of the folded gRNA, hybridization of two gRNAs, and hybridization of the targeted DNA and gRNA (Lorenz et al., 2012). These additional feature sets resulted in only moderate improvement in Spearman correlation (ρ~0.24) for our predictions.
Given that features describing the guide sequences themselves were inadequate to predict guide depletion, we developed a series of features associated with each targeted gene that we reasoned may have some explanatory power ( Figure 1B). First we used gene ID alone as a predictor, reasoning that incomplete silencing of essential genes may lead to different rates of depletion. While doing so improved the accuracy of predictions, the Spearman correlation between predicted and measured log-change remained below 0.4. We reasoned that additional information about each gene might improve our capacity to predict depletion, and so we constructed eight additional features describing each gene. We collected publicly available RNA expression data over growth in minimal media (Conway et al., 2014) and computed minimum and maximum expression values. We collected transcription unit (TU) information from RegulonDB (Santos-Zavaleta et al., 2019) and calculated the distance from the target site to the start of the TU, the number of downstream genes in each TU, and the presence of other essential genes in the TU. Finally we also included gene GC content.
Incorporating these additional gene features led to a major improvement in prediction accuracy, with cross-validation performance jumping to a Spearman's ρ of ~0.68 To understand the contribution of these features to the prediction of gRNA depletion, we used SHapley Additive exPlanation values (SHAP values) computed with TreeExplainer (Lundberg et al., 2020) on the best performing random forest regression model produced by Auto-Sklearn ( Figure 1C; Table S3). SHAP values are a game-theoretic approach to feature importance that capture the marginal contribution of a given feature to a prediction. Looking at average absolute SHAP values provides a measure of feature importance, while plotting individual SHAP values shows how each feature affects each individual prediction. Of all considered features, maximal RNA expression had the single largest average effect on depletion, making an average of ã 1.7 fold difference to the predictions. Unexpectedly, high target gene expression tended to be associated with higher depletion. There was also clear evidence for polar effects from CRISPRi, as the number of downstream essential genes was highly predictive of increased depletion. The most predictive effects that could actually be modified by guide design were associated with guide distance to the transcriptional start site, but on average these had fairly small effects compared to features associated with the target gene. In summary, we found that autoML can rapidly produce predictive models of CRISPRi guide depletion, and the predictions made by these models are dominated by the effects of gene features that can not be modified in guide design.
These findings outline key features that need to be accounted for to accurately infer guide efficiency from genome-wide depletion screens. cloned into an appropriate plasmid for expression. This plasmid collection is then transformed into the target bacteria, and depletion is measured as the change in guide frequency over growth determined by sequencing relative to a set of non-targeting gRNAs. The measured depletion (logFC) is then a mixture of the fitness effect of gene knockdown with the efficiency of silencing itself. (B) Comparison of Spearman correlation between actual and predicted guide depletion in 10-fold cross-validation (CV) of the best model trained with Auto-Sklearn with different feature combinations, using data from . (C) The ten most predictive features determined using TreeExplainer on the optimal random forest model trained with Auto-Sklearn and 574 guide and gene features. Mean absolute SHAP value (left) provides a global measure of feature importances, while the beeswarm plot (right) shows the effect of each feature on each individual gRNA prediction.

Data fusion improves prediction performance
As we had exhausted new features to explore, we next asked whether the size of our dataset was limiting the accuracy of our predictions. To this end, we collected data from two additional CRISPRi screens of E. coli in rich media. First, we included data from an additional screen using the same gRNA library but with dCas9 expressed from a stronger promoter, which we refer to here as E18 Cui . Second, we included data from a completely independent screen using a higher density library containing twice as many guides targeting essential genes (4197; 528 are identical to gRNAs contained in Cui/Rousset), which we refer to as Wang . We refer to the original data set as E75 Rousset. It is also worth noting that while the E18 Cui and E75 Rousset libraries were grown repeatedly to stationary phase, the Wang screen was collected in log phase. The level of depletion in each dataset exhibited qualitative differences, with Wang showing a clearer bimodal separation between depleted and non-depleted guides (Figure 2A). There was a reasonable correlation of depletion between datasets, with E18 Cui and E75 Rousset exhibiting a Spearman's ρ of~0.9. The correlation between Wang and the other two datasets was lower (ρ~0.75-0.79), but this seemed mostly attributable to a saturation effect in Wang, possibly due to the shorter growth period ( Figure S2).
To investigate the impact of fusing these datasets on model performance, we trained a series of models using Auto-Sklearn with each dataset individually or in combination including a dataset indicator as a potential predictor, and we tested them on sets of guides held out from each dataset as well as a mixed test set ( Figure 2B;  Figure S3 A-C), while tree-based methods (e.g. random forest regression, histogram-based gradient boosted trees; Figure S3 E,F) showed clear improvement.
Importantly, none of the tested models appeared to degrade in performance when trained with fused data. These findings suggest that both increased generalizability and accuracy can be achieved by integrating multiple data sources for training tree-based models for CRISPRi depletion.

Segregating guide and gene effects produces a predictive model for CRISPRi guide efficiency
Our exploration of the features most predictive of gRNA depletion in competitive screens highlighted that features describing the targeted gene often made much larger contributions to the prediction than features describing the guide sequence. This is problematic for predicting guide efficiency from depletion screens, as this large gene-to-gene variation in depletion must be removed to properly extract the contribution of guide efficiency.
We took two distinct approaches to separating guide and gene effects. The first was to explicitly model both effects jointly using Mixed-Effect Random Forest (MERF) regression (Hajjem et al., 2014). The MERF model handles data with an underlying cluster structure by defining two separate models: a linear model that captures random effects associated with the cluster, and a random forest (or other complex model) that captures fixed effects associated with each individual measurement. These models are then jointly optimized in an iterative process using the expectation-maximization algorithm. In our case, random effects correspond to features associated with each gene (e.g. gene ID, expression level) as well as dataset, while fixed effects correspond to features that could be manipulated in gRNA design (e.g. PAM and guide sequence, thermodynamic properties).
For the second approach, which we refer to as median subtracting (MS), we subtract the gene-wise median logFC from each gRNA depletion value to calculate relative "activity scores" following previous work (Calvo-Villamañán et al., 2020).
However, this leads to problems in integrating multiple datasets, as the range of depletion values varies across datasets (Figure 2A). We adapted a previously described approach used for fusing CRISPR gene deletion datasets (Xiang et al., 2021). First, we averaged the logFCs between E75 Rousset and E18 Cui which share all guides in common. We then calculated a linear scale factor for guides shared between Wang and the averaged Rousset/Cui data set to make logFCs for the unshared guides in Wang comparable to logFCs derived from Rousset/Cui ( Figure   S4A-C). For cross-validation, scaling was performed within each test fold to avoid possible leakage of information between test and training sets.
Both the fixed effect model from the MERF and activity scores in the MS method remove gene-specific effects to estimate guide efficiency, making guide-wise cross-validation difficult as the true guide efficiency is unknown. As an alternative to guide-wise cross-validation, we developed a gene-wise cross-validation scheme. We trained new models using 10-fold cross validation, this time holding out all guides targeting a set of held-out genes, evaluating the Spearman correlation between predictions and measured depletion within each gene under the assumption that rank order should reflect guide efficiency within a gene.
We trained and tested three models. Since tree-based models performed best on predicting gRNA depletion, we trained both a MERF model and a random forest trained on MS data. These were compared to both a published LASSO model based on the MS approach (Calvo-Villamañán et al., 2020) (hereafter referred to as "Pasteur") and a LASSO model we trained to evaluate the effect of our expanded feature and data sets on prediction accuracy. It should be noted that the Pasteur model was trained on the E75 Rousset data, so our benchmark results are not independent of its training data and will tend to overestimate the performance of the Pasteur model.
As we had previously observed in our evaluation of predictions of guide-wise depletion, data fusion between multiple CRISPRi screens consistently improved performance across models. In aggregate, the random forest models performed slightly better than the LASSO-based models (median ρ=0.375 (MERF) and 0.378 (MS) vs.
0.357 (Pasteur)). When we broke this down into performance on held-out genes in individual datasets ( Cui data, and worse on Wang. The MERF and MS random forest models performed generally similarly to one another, likely due to the high correlation observed between median gene-wise logFC and the MERF-predicted random effects across our datasets ( Figure S4D). In sum, we find that random forests trained on multiple datasets outperform simpler regression models on predicting guide efficiency for held-out genes, and that the MERF approach provides a straight-forward means of integrating datasets while isolating effects important for guide efficiency. correlations between predictions and measured logFC for held-out genes. Genes were held out in 10-fold cross-validation, and the reported median Spearman correlation was calculated across all held-out genes.

Model interpretation with explainable AI illustrates rational design rules for CRISPRi
To understand the features underlying model performance, we again examined SHAP values for our random forest models using TreeExplainer (Lundberg et al., 2020). We observed similar features with large impacts on predictions from both random forest models ( Figure 3A; S4E; Table S7). In the MERF model, the strongest average effects were seen for a guanine at the +1 position following the PAM, followed by distances from the start codon. In particular, we found that targeting positions further from the start codon led to reduced guide efficiency, as has been inferred previously (Qi et al., 2013).
Other top features involved the nucleotide at position 20 of the guide, directly adjacent to the PAM sequence ( Figure 3A & B). Here guanine and particularly adenine at this position negatively impacted silencing efficiency, while cytosine and thymine increased efficiency -almost the exact inverse of previous reports for Cas9 efficiency in eukaryotic genome editing applications (Doench et al., 2014). Within and following the PAM sequence, our SHAP values were qualitatively similar to previous observations in Cas9 genome editing. Cytosine was favored at the variable position of the NGG PAM, and a guanine residue immediately following the PAM had a negative impact on silencing, though we additionally observed a positive impact of cytosine at this position.
These effects within and around the PAM sequence appeared to interact with each other, as we saw additional effects for dinucleotide sequences covering the end of the guide sequence and first residue of the PAM where cytosine-cytosine and thymine-cytosine residues improved performance while guanine-guanine residues had a strong negative effect.
To further investigate potential interactions between features, we estimated SHAP interaction values that quantify situations in which the presence of one feature changes the impact of another, so that the combined SHAP value for both features together is not the simple sum of each feature's SHAP value. To provide a visualization of these interactions, we calculated expected effects using the median SHAP value for each feature from guides containing only one of the interacting features, and compared the expected sum to the actual SHAP values for guides containing both features ( Figure 3C; Table S8).

Deep learning approaches do not improve prediction performance
Given that we saw better performance with tree-based methods over linear regression, we next asked if model complexity was a limiting factor in prediction. Deep learning approaches have been applied to predicting guide efficiency for a number of CRISPR technologies (Chuai et al., 2018;Kim et al., 2018Kim et al., , 2019Wang et al., 2019;Xiang et al., 2021). Considering this, we asked if deep learning models would also improve performance in predicting gRNA efficiency for CRISPRi in bacteria. As a representative architecture, we implemented a one-dimensional convolutional neural network (CNN), which runs a series of kernel filters across the sequence to extract local features. In addition to a custom CNN architecture, we reimplemented and tested the state-of-the-art deep learning architecture used for predicting Cas9 gene editing efficiency by CRISPRon (Xiang et al., 2021), only trained using our CRISPRi data.
For our custom CNN architecture, we used the convolutional layers to extract sequence features before concatenating them to the rest of our guide feature set ( Figure S5A). This concatenated feature set was then fed through a fully connected 4-layer multilayer perceptron (MLP) for regression using MS values for guide efficiency.
These results show that conventional machine learning approaches can outperform deep learning architectures and suggest that data may currently be limiting for more complex machine learning approaches.

A saturating screen of purine biosynthesis genes independently validates performance of tree-based models and data fusion
Our previous benchmarking indicated that tree-based methods trained on multiple datasets outperformed other methods in predicting guide efficiency. However, these results were based entirely on cross-validation within our training datasets. To produce a truly independent test set, we first targeted a plasmid-encoded GFP construct with 19 gRNAs across a range of predicted guide efficiencies, and measured the reduction in cell fluorescence by flow cytometry (Figure 4A) Figure S6A).
However, when we reanalyzed the data from a Miller assay (measuring β-Galactosidase activity) previously used to validate the Pasteur model (Calvo-Villamañán et al., 2020) ( Figure 4B; Table S10), we found that the Pasteur model performed best (ρ=0.71), followed by the MERF and MS random forests (ρ=0.65, 0.59) and the LASSO model.
We also tested three tools designed for predicting Cas9 guide efficiency for genome editing in eukaryotes (Doench et al., 2016;Kim et al., 2019;Wilson et al., 2018), and all performed universally poorly on both data sets.
While the exact reasons for the discrepancies in performance between our GFP measurements and the Miller assay are unclear, one plausible explanation is that these data sets simply have sample sizes too small to discriminate between prediction methods. To resolve this, we performed a high-throughput screen targeting nine genes from the purine biosynthesis pathway of E. coli known to be essential in minimal media, spread across seven independent transcriptional units ( Figure 4C). To avoid any bias in guide selection, we saturated all potential target sites in each gene, ending with a total of 750 gRNAs, including between 35 and 223 guides per gene. Duplicate samples were then collected at three time points during growth in M9 minimal medium, and gRNA depletion was measured with reference to input samples, normalized using a set of 50 gRNAs designed not to target any E. coli sequence.
Comparing the experimentally determined depletion values to predictions from our tested models confirmed the results of our previous cross-validation ( Figure 4D; Beyond validating the performance of our models, our saturating screen of purine biosynthesis genes also revealed previously unobserved features of CRISPRi depletion screens. First, there were two genes, purE and purK, on which all methods performed poorly as measured by Spearman correlation. Upon inspection of the depletion values, it became clear that this was because there was surprisingly little variation in guide efficiency along these transcripts (Figure 4E; S6C). This meant that for these genes, differences in ranking reflected very small differences in depletion, likely within the error of our experimental measurements. We examined our initial training set to see if this might be a more widespread phenomenon, finding a substantial number of genes with low variation in their guide depletion values ( Figure S6D). This may be a factor in the overall low average Spearman correlations we report in our cross-validation.
A second unexpected feature was the overall lack of a clear relationship between guide efficiency and distance to the transcriptional start site. Of the nine tested genes, only two, purC and purM, showed a clear linear dependency of depletion on position within the gene sequence. This was particularly surprising, as distance features were clearly important to our model predictions. We attempted to train a model excluding distance features, but this substantially degraded performance on predicting depletion in our high-throughput screen ( Figure S6B). Whether this is an artifact of our training data, based on screens which used small collections of guides biased towards the 5′ end of genes, or if other guide features compensate for positional differences in guide efficiency remains unclear. In support of the latter, our analysis of feature interactions found many of the strongest effects came from interactions between distance features and sequence features in the vicinity of the PAM ( Table S8), suggesting that sequence features have larger effects on efficacy as the distance from the transcription start site increases. In sum, our screen of guides targeting purine biosynthesis genes independently validated the better performance of our random forest models compared to the state-of-the-art, while also highlighting some unexpected features of CRISPRi. Spearman correlations between the predicted scores and measured logFC across collected timepoints.
(E) Measured logFCs for each guide as a function of distance to the start codon for each gene.

Discussion
In this study, we developed a predictive model for CRISPRi guide efficiency using integrated data from three gene essentiality screens in E. coli. We extensively explored the process of model development, evaluating how feature engineering, data integration, and model selection affect performance. We have shown that this model improves on the previous state-of-the-art using both gene-wise cross-validation on our training data as well as a fully independent screen of guides targeting purine biosynthesis genes essential in minimal media. These investigations provide a blueprint for developing similar predictive models, both for other CRISPR-Cas systems (Vialetto et al., 2021) and technologies, as well as for CRISPRi in different bacteria where design rules may vary. We have made a web server for predicting CRISPRi guide efficiency using our MERF publicly available at: https://ciao.helmholtz-hiri.de.
Prediction of guide efficiency will become increasingly important with more complex applications of CRISPR technologies. In particular, the potential for multiplexing CRISPRi presents could be transformative when compared to established technologies. One example of this would be in screening for fitness interactions between genes. The current state-of-the-art is based on arrayed mating of single gene deletion libraries (Butland et al., 2008;Typas et al., 2008), which is both labor intensive and technically challenging, and becomes increasingly so when querying higher-order interactions (Kuzmin et al., 2018). A similar example is in metabolic engineering where multiplexed CRISPRi can be used to modulate biosynthetic pathways to optimize production of a particular metabolite for industrial applications (Lian et al., 2017 We have also shown that a rich, biologically relevant feature set is important for predicting CRISPRi depletion. Strikingly, we found that gene identity alone was a poor predictor of depletion when compared to including measures of gene expression and potential for polar interactions within transcriptional units. In particular, gene expression was the single largest contributor to gene depletion as measured by SHAP values, and higher expression was counterintuitively associated with higher depletion. As the availability of transcriptomics data may be lacking for some organisms, we also tested the possibility of using the codon adaptive index (CAI) as a proxy, with promising results ( Figure S6B). While these results do not necessarily imply a direct causal relationship between expression and depletion, they do suggest that caution should be taken when comparing guide depletion levels between genes in a screen as factors other than gene fitness may strongly influence the degree of depletion.
Model selection and tuning also had a large impact on prediction performance.
The standard approach to developing a machine learning model for a particular application generally involves a significant degree of trial and error. This is true both for selecting a type of model and tuning model hyperparameters, which is often critical to performance. To avoid this, we applied automated machine learning, which turns model selection and tuning into an optimization problem. Auto-Sklearn (Feurer et al., 2015) searches over a set of twelve regressors and their hyperparameters and was able to reliably select a final model comparable in performance to hand-tuning. For all of our various data and feature sets, Auto-Sklearn tended to select tree-based regressors.
This is in contrast to previous work that suggested linear regression was adequate to capture guide efficiency (Calvo-Villamañán et al., 2020). Possibly this is due to the more complex feature set we constructed, containing a wide range of features beyond simple sequence features. We also applied several deep learning approaches, including the architecture successfully used by CRISPRon to predict Cas9 genome editing efficiency (Xiang et al., 2021), but these failed to achieve similar performance to tree ensemble regressors. This again suggests that the availability of data for training may be a limiting factor, as deep learning models often require large sample sizes to achieve high performance.
Our finding that gene features that cannot be modified during guide design are dominant in determining depletion in CRISPRi screens highlighted the importance of removing these before attempting to predict guide efficiency. Whereas predictive methods for CRISPR-Cas genome engineering tools targeted at eukaryotes are often trained on large datasets with direct measurements of guide efficiency (e.g. indel rates), for bacterial CRISPRi the largest data sets come from essentiality screens, which provide only indirect measurements of guide efficiency. We took two distinct approaches to extract efficiency from CRISPRi screens -directly modeling and removing gene effects using a mixed-effect random forest (MERF) (Hajjem et al., 2014), and heuristically subtracting median values for each gene from guide depletion values as described previously (Calvo-Villamañán et al., 2020). To our surprise, both approaches produced models with roughly similar performance and appeared to identify largely similar feature sets driving guide silencing efficiency. It is possible that the richer description of gene effects enabled by the MERF may become more important when incorporating data from additional screens in other conditions, or expanding beyond using only essential genes for training. The MERF can also infer its own normalization between data sets as part of its random effect inference, which will greatly simplify training of more complex models.
While we focused here on applications of CRISPRi with dCas9 in E. coli, the techniques we have developed are in principle generic and could be extended to CRISPRi with any catalytically-dead nuclease in any bacterium of interest, or even to entirely different CRISPR systems. For instance, we recently applied the same basic methodology to investigate the features underlying autoimmune activation of Cas13 targeting cellular RNA (Vialetto et al., 2021). It is becoming increasingly clear that the performance of CRISPRi depends on both genetic background and the specific Cas protein used. For instance, Streptococcus pyogenes dCas9 expression has low silencing efficiency in some bacteria and can even be toxic (Cho et al., 2018b), forcing the adoption of alternative Cas proteins (Rock et al., 2017). Alternative Cas proteins have large differences in their PAM preferences and the stringency of the PAM requirement (Collias and Beisel, 2021); presumably, alternative dCas proteins may also respond differently to the other gene and guide features described here. The approach outlined here, applying autoML and explainable AI to rapidly arrive at a description of the design rules underlying the efficiency of CRISPRi silencing, provides a means to rapidly characterize the behavior of new dCas proteins as genome-wide screening data becomes available.

Figure S1
Figure S1: Illustration of the genomic and sequence features used, see also Table S1.

Training datasets
We collected the data from three previous CRISPRi genome-wide essentiality screens in E.coli K12 MG1655 Rousset et al., 2018;Wang et al., 2018). The sequence, targeted gene, gene position, and fitness effect of each gRNA was retrieved from the supplementary information of each study. Gene sequences and positions were updated to be consistent with the latest reference genome version (NC_000913.3). We discarded gRNAs from the Wang data set previously removed as having insufficient read counts  or sequences from the Rousset and Cui datasets that differed from the reference sequence due to differences in the genome versions. 8099 gRNAs targeting the coding-strand within the coding regions of essential genes were extracted in total from all three datasets. Genes targeted by fewer than 5 gRNAs were removed.  (Baba et al., 2006), and transcriptional unit definitions from RegulonDB (Tierrafría et al., 2022). Transcriptomic data including gene expression levels across growth at ten different ODs were obtained from a previous study (Conway et al., 2014). Minimal or maximal expression levels were calculated across the range of ODs until the growth phase when cells were collected in each CRISPRi screen: OD 1.4 for the Wang dataset, and all ODs for the Rousset and Cui datasets. The codon adaptation index (CAI) for each gene was calculated using CAIcal (Puigbò et al., 2008).
Deep learning models were trained using pytorch (version 1.8.1) (Paszke et al., 2019) and pytorch-lightning (version 1.5.10). For our custom 1D CNN model, sequence features were processed using 1D convolutional layers and later concatenated with other guide features. Concatenated features were further processed with fully connected layers. Three 1D convolutional layers were implemented sequentially with input channels 4, 64, and 64, output channel 64, 64, and 32, kernel size 5, 3, and 1, and stride 2, 2, and 1 respectively. For fully connected layers, output dimensions are 128, 64, 32 and 1 (which is predicted gRNA efficiency). The first three fully connected layers are accompanied by batch normalization (Ioffe and Szegedy, 2015), ReLU and dropout (Srivastava et al., 2014) (p=0.5). We trained the model using AdamW (Loshchilov and Hutter, 2017) optimiser with learning rate of 0.001 and batch size of 32. For CRISPROn, we followed the same architecture (CGx) with different numbers of non-sequential features concatenated to processed sequential features. We trained the model using the Adam optimizer (Kingma and Ba, 2014) with a learning rate of 0.001 and batch size of 32. Models were trained and evaluated with 10-fold cross-evaluation based on the gRNA sequences to predict gRNA depletion.
Tree-based models were interpreted using TreeExplainer from the python shap package (version 0.39.0) (Lundberg et al., 2020). SHAP value plots were generated with the 'summary_plot' function in shap.

Segregation of guide and gene effects
We removed genes with less than 5 gRNAs in each dataset to stabilize estimates of median gRNA activity scores (see below), resulting in 7400 gRNAs in total. This included 1618 gRNAs targeting 171 genes in E75 Rousset/E18 Cui and 4164 targeting 300 genes in Wang.
MERF models were trained using package merf (version 1.0). Hyperparameters for the fixed-effect random forest model were taken from auto-sklearn. 564 guide-specific features were assigned as fixed effects, while 9 gene-specific features except gene ID were assigned as random effects. 301 unique gene IDs were used as cluster IDs. The trained fixed-effect model was used to predict gRNA efficiency. To train simplified models excluding transcriptomic measurements (Figure S6B), CAI value, gene length, gene GC content, and dataset were included for the random-effect model.
For median subtracting (MS) models, logFC values were scaled to integrate the datasets, as an adaptation of a previously applied data fusion method (Xiang et al., 2021). First, the mean of logFCs of E75 Rousset and E18 Cui were calculated and used as the scaled logFC ( Figure S4A&B). Then linear regression was performed between the logFCs in Wang and scaled logFCs in E75 Rousset for 378 overlapping gRNAs. All of the logFCs from Wang were then scaled by the fitted slope and intercept ( Figure   S4C). The 378 overlapping gRNAs in Wang were excluded in the subsequent training for MS models. Activity scores were calculated by subtracting the scaled logFC of each gRNA from median scaled logFC for each gene across all 3 datasets.
MS models were trained with guide-specific features to predict activity scores for each gRNA. The hyperparameters of the MS random forest model were the same as for MERF, while those of the LASSO model were optimized using hyperopt (version 0.2.5) (Bergstra et al., 2013) with a search space for alpha ranging from 0 to 0.1. The trained models were directly used to predict activity scores.
Training and test sets were split gene-wise based on gene identifier. 10-fold cross-validation was used to evaluate model performance.
The fixed-effect model from MERF and the MS (RF) model were interpreted using the shap package (Lundberg et al., 2020).

Strains and growth conditions
All strains, plasmids, and primers are listed in Supplementary Table S14 and S15. E.
coli cells were grown in Lysogeny Broth (LB) (10 g/L NaCl, 5 g/L yeast extract, 10 g/L was calculated as the ratio between the difference in fluorescence values between the gRNA and negative control, and the difference between the positive and the negative control, followed by log transformation. The mean log fold repression across three replicates was compared to predicted values from the machine learning models (Table   S11).
For experiments with S. Typhimurium, the procedure was similar, but cells were grown until an OD 600 of~0.8 before analysis on an Accuri C6 flow cytometer. To eliminate non-cellular events, the forward scatter (cutoff of 10,000) was used and the mean green fluorescence value (FL1-H) across 30,000 events within a gate set for S. Typhimurium was used for data analysis as described above across four replicates.

Generation of the sgRNA library
For the sgRNA library targeting the purine biosynthesis pathway in E. coli MG1655, plasmid DC512 served as a backbone, following a previously established protocol (Liao et al., 2019). To generate a library with 800 sgRNAs (including 50 non-targeting sgRNAs; Table S12), 800 forward and reverse oligonucleotides each encoding one spacer and a 4-nt junction, were synthesized as an oPool (1600 oligos at 10pmol/oligo) by Integrated Device Technology (IDT). The same 5′ and 3′ assembly junction sequences were used for all spacer pairs leading to the same integration site within the backbone (5′ TAGT overhang at the 5′ end and a 3′ AAAC overhang at the 3′ end).
Supplementary Table S16 contains the specific oligonucleotides and assembly junctions used for the library generation. The oligos were phosphorylated and annealed to form dsDNA with a 5′ and 3′ overhang. The steps of phosphorylation and annealing were combined and conducted in one pot, by adding 8,000 fmol of the oPool and 1 µl T4 polynucleotide kinase (10 units) to 5 µl 10x T4 ligation buffer and then, adding water until reaching a final volume of 50 µl. After mixing briefly by pipetting the mix was incubated at 37°C for 30 minutes in a thermocycler and then incubated at 65°C for 20 minutes in a thermocycler to heat-inactivate the kinase. For the annealing of the forward and reverse oligo pairs, the following thermocycler steps were added: 95°C for 5 min, 94°C for 15 s, decrease by 1°C, and hold for 30 seconds for 79 cycles. For integrating the dsDNA inserts into DC512, 400 fmol of the dsDNA, 20 fmol of backbone plasmid, 0.5 µL of T4 ligase (1000 units), and 1.5 µL of BbsI (15 units) were added to 2 µL of 10x T4 ligation buffer, then water was added to reach a total volume of 20 µl. A thermocycler was used to perform 35 cycles of digestion and ligation (37 °C for 2 min, 16 °C for 5 min) followed by a final digestion step (60 °C for 10 min) and a heat inactivation step (80 °C for 10 min). After NdeI digestion (37°C, 1h) of the ligation mix to remove any remaining original backbone plasmids and subsequent ethanol precipitation, 10 µl of the ligation mix was transformed into electrocompetent E. coli NEB10 beta (NEB, Ipswich, MA, USA), following the manufacturer's instructions. After transformation and recovery in 1 ml SOC for 1 h at 37 °C with shaking at 250 rpm, different dilutions of the recovered cells were plated on LB agar containing the appropriate antibiotic and incubated for 16 h to check the number and color of the resulting colonies (ensuring a~58X coverage).
The rest of the recovered culture was added to 100 mL LB media containing the appropriate antibiotic and incubated at 37 °C with shaking at 250 rpm to OD 600 ≈ 1. Cells were harvested by centrifugation and subjected to plasmid extraction. Sanger sequencing was used to validate the library plasmid DNA.

Screening experiment
E. coli strain MG1655 was initially transformed with a dCas9 encoding plasmid (2.0 kV, 200 Omega, and 25 μF). The resulting strain SG332 was then transformed with the sgRNA library by electroporation and recovered in 900 µl SOC for 1.5h at 37 °C with shaking at 250 rpm. Different dilutions of the recovered cells were plated on LB agar containing the appropriate antibiotics and incubated for 16 h to check the number of the resulting colonies (~56 5 colonies). The recovered culture was back-diluted to OD 600 0.01 in LB medium with appropriate antibiotics and incubated at 37 °C with shaking for 13h.
Subsequently, 5 mL of the culture was sampled and the library was extracted by miniprep (Nucleospin Plasmid, Macherey-Nagel) to obtain the initial sgRNA distribution.
The calculated amount of culture to reach OD 600 0.01 in 50 ml M9 minimal medium, was sampled and washed twice with M9 minimal medium to remove traces of the LB medium. The culture was incubated at 37°C with shaking until it reached OD 600 1, allowing~6 replications. 5 ml of the culture was sampled at OD 600 0.2 and OD 600 0.6 and at OD 600 1 and the library was extracted by miniprep. The experiment was performed in duplicate starting from two independent transformations of MG1655 with the plasmid library.

Library sequencing
The sequencing library was generated using the KAPA HiFi HotStart Library Amplification Kit for Illumina® platforms (Roche) and the primers listed in Supplementary Table S16. The first PCR adds the first index. The second PCR adds the second index and flow cell-binding sequence. The amplicons of the first and second PCR reactions were purified using solid-phase reversible immobilization beads (AMPure XP, Beckman Coulter) following the manufacturer's instructions to remove excess primers and possible primer dimers. The sequencing library samples, with the required DNA concentrations ranging from 100 pg -200 ng in a total volume of 10 µL, were submitted to the HZI NGS sequencing facility (Braunschweig, Germany) for paired-end 2 × 50 bp deep sequencing with 800,000 reads per sample on a NovaSeq 6000 sequencer.

Sequencing data processing
Paired-end reads were merged using BBMerge (version 38.69) with parameters "qtrim2=t, ecco, trimq=20, -Xmx1g". Merged reads with perfect matches were assigned to the gRNA library using a Python script. After filtering guides for at least 1 count per million in at least 4 samples, read counts of each gRNA were normalized by factors derived from non-targeting guides using the trimmed mean of m-values method in edgeR (version 3.28.0) (Robinson et al., 2009). An extra column was added to the design matrix to capture batch effects between the two replicate experiments.
Differential abundance (logFC) of gRNAs between time points and the input library were estimated using edgeR, and a quasi-likelihood F test was used to test for significance after fitting in a generalized linear model.

Code and data availability
All code necessary to reproduce this results in the manuscript are available at: https://github.com/BarquistLab/CRISPRi_guide_efficiency_bacteria. Raw sequencing data for the CRISPRi purine screen has been deposited in GEO under accession GSE196911. A webserver implementation of the final MERF model is available at: http://ciao.helmholtz-hiri.de.