Abstract
Multiplex assays of variant effect (MAVEs) are diverse techniques that include deep mutational scanning (DMS) experiments on proteins and massively parallel reporter assays (MPRAs) on cis-regulatory sequences. MAVEs are being rapidly adopted in many areas of biology, but a general strategy for inferring quantitative models of genotype-phenotype (G-P) maps from MAVE data is lacking. Here we introduce a conceptually unified approach for learning G-P maps from MAVE datasets. Our strategy is grounded in concepts from information theory, and is based on the view of G-P maps as a form of information compression. We also introduce MAVE-NN, an easy-to-use Python package that implements this approach using a neural network backend. The ability of MAVE-NN to infer diverse G-P maps—including biophysically interpretable models—is demonstrated on DMS and MPRA data in a variety of biological contexts. MAVE-NN thus provides a unified solution to a major outstanding need in the MAVE community.
Introduction
Over the last decade, the ability to quantitatively study genotype-phenotype (G-P) maps has been revolutionized by the development of multiplex assays of variant effect (MAVEs), which can measure molecular phenotypes for thousands to millions of genotypic variants in parallel.1,2 MAVE is an umbrella term that describes a diverse set of experimental methods, three examples of which are illustrated in Fig. 1. Deep mutational scanning (DMS) experiments3 are a type of MAVE commonly used to study protein sequence-function relationships. These assays work by linking variant proteins to their coding sequences, either directly or indirectly, then using deep sequencing to assay which variants survive a process of activity-dependent selection (e.g., Fig. 1a). Massively parallel reporter assays (MPRAs) are another major class of MAVE, and are commonly used to study DNA or RNA sequences that regulate gene expression at a variety of steps, including transcription, mRNA splicing, cleavage and polyadenylation, translation, and mRNA decay.4–7 MPRAs typically rely on either an RNA-seq readout of barcode abundances (Fig. 1c) or the sorting of cells expressing a fluorescent reporter gene (Fig. 1e).
Most computational methods for analyzing MAVE data have focused on accurately quantifying the activity of individual assayed sequences.8–14 However, MAVE measurements like enrichment ratios or cellular fluorescence levels usually cannot be interpreted as providing direct quantification of biologically meaningful activities, due to the presence of experiment-specific nonlinearities and noise. Moreover, MAVE data is usually incomplete, as one often wishes to understand G-P maps over vastly larger regions of sequence space than can be exhaustively assayed. The explicit quantitative modeling of G-P maps can address both the indirectness and incompleteness of MAVE measurements.1,15 The goal here is to determine a mathematical function that, given a sequence as input, will return a quantitative value for that sequence’s molecular phenotype. Such quantitative modeling has been of great interest since the earliest MAVE methods were developed,16–18 but no general-use software has yet been described for inferring G-P maps of arbitrary functional form from MAVE data.
Here we introduce a unified conceptual framework for the quantitative modeling of MAVE data. This framework is based on the use of latent phenotype models, which assume that each assayed sequence has a well-defined latent phenotype (specified by the G-P map), of which the MAVE experiment provides an indirect readout (described by the measurement process). The quantitative forms of both the G-P map and the measurement process are then inferred from MAVE data simultaneously. We further introduce an information-theoretic approach for separately assessing the performance of the G-P map and the measurement process components of latent phenotype models. This strategy is implemented in an easy-to-use open-source Python package called MAVE-NN, which is built on a TensorFlow 2 backend.19 In what follows, we expand on this unified MAVE modeling strategy and apply it to a diverse array of DMS and MPRA datasets. Along the way we note the substantial advantages that MAVE-NN provides over other MAVE modeling methods, illustrate how the capabilities of MAVE-NN can inform experimental design going forward, and highlight new biological insights that our quantitative modeling of MAVE data reveals.
Results
Latent phenotype modeling strategy
MAVE-NN supports the analysis of MAVE data on DNA, RNA, and protein sequences, and can accommodate either continuous or discrete measurement values. Given a set of sequence-measurement pairs, MAVE-NN aims to infer a probabilistic mapping from sequence to measurement. Our primary enabling assumption, which is encoded in the structure of the latent phenotype model (Fig. 2a), is that this mapping occurs in two stages. Each sequence is first mapped to a latent phenotype by a deterministic G-P map, then this latent phenotype is mapped to possible measurement values via a stochastic measurement process. During training, the G-P map and measurement process are simultaneously learned by maximizing a regularized form of likelihood. Our initial implementation of MAVE-NN assumes that latent phenotypes are one-dimensional quantities, but multidimensional latent phenotypes are fully compatible within this conceptual framework.20,21
MAVE-NN includes four types of built-in G-P maps: additive, neighbor, pairwise, and black box. Additive G-P maps assume that each character at each position within a sequence contributes independently to the latent phenotype. Neighbor G-P maps incorporate interactions between nearest-neighbor characters, while pairwise G-P maps include interactions between all pairs of characters regardless of their position. Black box G-P maps have the form of a densely connected multilayer perceptron, the specific architecture of which can be controlled by the user. MAVE-NN also supports custom G-P maps that can be used, e.g., to represent specific biophysical hypotheses about the mechanisms of sequence function.
To handle both discrete and continuous measurement values, two different strategies for modeling measurement processes are provided. Measurement process agnostic (MPA) regression uses techniques from the biophysics literature15,16,20,22 to analyze MAVE datasets that report discrete measurements. Here the measurement process is represented by an overparameterized neural network that takes the latent phenotype value as input and outputs the probability of each possible measurement value (Fig. 2b). Global epistasis (GE) regression, by contrast, leverages ideas previously developed in the evolution literature23–26 for analyzing datasets that contain continuous measurements (Fig. 2c). Here, the latent phenotype is nonlinearly mapped to a prediction that represents the most probable measurement value. A noise model is then used to describe the distribution of likely deviations from this prediction. MAVE-NN supports both homoscedastic and heteroscedastic noise models based on three different classes of probability distribution: Gaussian, Cauchy, and skewed-t. We note that the skewed-t distribution, introduced by Jones and Faddy,27 reduces to Gaussian and Cauchy distributions in certain limits while also accommodating asymmetric experimental noise. Fig. 2d shows an example of a GE measurement process with a heteroscedastic skewed-t noise model.
Information-theoretic measures of model performance
We further propose three distinct quantities for assessing the performance of latent phenotype models (Fig. 2e). These quantities are motivated by thinking of G-P maps in terms of information compression. In information theory, a quantity called mutual information quantifies the amount of information that one variable encodes about another.28,29 Unlike standard metrics of model performance, like accuracy or R2, mutual information can be computed between any two types of variables (discrete, continuous, multi-dimensional, etc.). This property makes the information-based quantities we propose below applicable to all MAVE datasets, regardless of the specific type of experimental readout used. We note, however, that accurately estimating mutual information and related quantities from finite data is nontrivial and that MAVE-NN uses a variety of approaches to do this.
Intrinsic information, Iint, is the mutual information between the sequences and measurements contained within a MAVE dataset. This quantity provides a benchmark against which to compare the performance of inferred G-P maps. Predictive information, Ipre, is the mutual information between MAVE measurements and the latent phenotype values predicted by a G-P map of interest. This quantifies how well the G-P map preserves sequence-encoded information that is determinative of experimental measurements. When evaluated on test data, Ipre is bounded above by Iint, and equality obtains only when the latent phenotype losslessly encodes relevant sequence-encoded information. Variational information, Ivar, is a linear transformation of log likelihood that provides a variational lower bound on Ipre.30–32 The difference between Ipre and Ivar quantifies how accurately the inferred measurement process matches the observed distribution of measurements and latent phenotypes (see Supplemental Information).
MAVE-NN infers model parameters by maximizing a (lightly) regularized form of likelihood. These computations are performed using the standard backpropagation-based training algorithms provided within the TensorFlow 2 backend. With certain caveats noted (see Methods), this optimization procedure maximizes Ipre while avoiding the costly estimates of mutual information at each iteration that have hindered the adoption of previous mutual-information-based modeling strategies.16
Application: deep mutational scanning assays
We now demonstrate the capabilities of MAVE-NN on three DMS datasets, starting with the study of Olson et al.33 on pairwise epistasis in protein G. Here the authors measured the effects of all single and nearly all double mutations to residues 2-56 of the IgG binding domain. This domain, called GB1, has long served as a model system for studying protein sequence-function relationships. To assay the binding of GB1 variants to IgG, the authors combined mRNA display with ultra-high-throughput DNA sequencing (Fig. 1a). The resulting dataset reports log enrichment values for all 1,045 single- and 530,737 double-mutant GB1 variants (Fig. 1b).
Inspired in by the work of Otwinowski et al.,26 we used MAVE-NN to infer a latent phenotype model comprising an additive G-P map and a GE measurement process. This inference procedure required only about 3 minutes on a standard laptop computer (Supplemental Fig. S1). Fig. 3a illustrates the inferred additive G-P map via the effects that every possible single-residue mutation has on the latent phenotype. From this heatmap of additive effects, we can immediately identify all of the critical GB1 residues, including residues 27, 31, 41, 43, and 52. We also observe that missense mutations to proline throughout the GB1 domain tend to negatively impact IgG binding, as expected due to this amino acid’s exceptional conformational rigidity. Fig. 3b illustrates the corresponding GE measurement process, revealing a sigmoidal relationship between log enrichment measurements and the latent phenotype values predicted by the G-P map. Nonlinearities like this are ubiquitous in DMS data due to the presence of background and saturation effects. Unless they are explicitly accounted for in one’s quantitative modeling efforts, as they are here, these nonlinearities can greatly distort the parameters of inferred G-P maps. Fig. 3c shows that accounting for this nonlinearity yields predictions that correlate quite well with measurement values. Moreover, every latent phenotype model inferred by MAVE-NN can be used as a MAVE dataset simulator (see Methods). By analyzing simulated data generated by our inferred model for this GB1 experiment, we further observed that MAVE-NN can accurately and robustly recover the GE nonlinearity and ground-truth G-P map parameters (Supplementary Fig. S1).
Fig. 3d summarizes the values of our information-theoretic metrics for model performance. On held-out test data, we find that Ivar = 2.194 ± 0.020 bits and Ipre = 2.220 ± 0.008 bits and. The similarity of these two values suggests that the inferred GE measurement process, which includes a heteroscedastic skewed-t noise model, has nearly sufficient accuracy to fully describe the distribution of residuals. We further find that 2.680 ± 0.008 bits ≤ Iint ≤ 3.213 ± 0.033 bits (see Methods), meaning that the inferred G-P map accounts for 70%-84% of the total sequence-dependent information in the dataset. While this performance is impressive, the additive G-P map evidently misses some relevant sequence features. This observation motivates the more complex biophysical model for GB1 discussed later in Results.
The ability of MAVE-NN to deconvolve experimental nonlinearities from additive G-P maps requires that some of the assayed sequences contain multiple mutations. This is because such nonlinearities are inferred by reconciling the effects of single mutations with the effects observed for combinations of two or more mutations. To investigate how many multiple-mutation variants are required, we performed GE inference on subsets of the GB1 dataset containing all 1,045 single-mutation sequences and either 50,000, 5,000, or 500 double-mutation sequences (see Methods). The shapes of the resulting GE nonlinearities are illustrated in Figs. 3e-g. Remarkably, MAVE-NN is able to recover the underlying nonlinearity using only about 500 randomly selected double mutants, which represent only ∼0.1% of all possible double mutants. The analysis of simulated data also supports the ability to accurately recover ground-truth model predictions using highly reduced datasets (Supplemental Fig. S1). These findings have important implications for the design of DMS experiments: even if one only wants to determine an additive G-P map, including a modest number of multiple-mutation sequences in the assayed library is often advisable because it may allow the removal of artifactual nonlinearities.
To test the capabilities of MAVE-NN on less complete DMS datasets, we analyzed recent experiments on amyloid beta (Aβ)34 and TDP-43,35 both of which exhibit aggregation behavior in the context of neurodegenerative diseases. Like with GB1, the variant libraries used in both experiments included a substantial number of multiple-mutation sequences: 499 single- and 15,567 double-mutation sequences for Aβ; 1,266 single- and 56,730 double-mutation sequences for TDP-43. But unlike with GB1, these datasets are highly incomplete due to the use of mutagenic PCR for variant library creation.
We used MAVE-NN to infer additive G-P maps from these two datasets, adopting the same type of latent phenotype model used for GB1. Fig. 4a illustrates the additive G-P map inferred from aggregation measurements of Aβ variants. In agreement with the original study, we see that most amino acid mutations between positions 30-40 have a negative effect on nucleation, suggesting that this region plays a major role in nucleation behavior. Fig. 4b shows the corresponding measurement process. Even though these data are much sparser than the GB1 data, the inferred model performs well on held-out test data (Ivar = 1.147 ± 0.043 bits, Ipre = 1.254 ± 0.024 bits, R2 = 0.793 ± 0.071). Similarly, Figs. 4c-d show the G-P map parameters and GE measurement process inferred from toxicity measurements of TDP-43 variants, revealing among other things the toxicity-determining hot-spot observed by Bolognesi et al.35 at positions 310-340. The resulting latent phenotype model performs well on held-out test data (Ivar = 1.806 ± 0.018 bits, Ipre = 2.011 ± 0.019 bits, R2 = 0.912 ± 0.052).
Application: a massively parallel splicing assay
Exon/intron boundaries are defined by 5’ splice sites (5’ss), which bind the U1 snRNP during the initial stages of spliceosome assembly. To investigate how 5’ss sequence quantitatively controls alternative mRNA splicing, Wong et al.36 used a massively parallel splicing assay (MPSA) to measure percent-spliced-in (PSI) values for nearly all 32,768 possible 5’ss of the form NNN/GYNNNN in three different genetic contexts (Fig. 1c,d). Applying MAVE-NN to data from the BRCA2 exon 17 context, we inferred four different types of G-P maps: additive, neighbor, pairwise, and black box. As with GB1, these G-P maps were each inferred using GE regression with a heteroscedastic skewed-t noise model. For comparison, we also inferred an additive G-P map using the epistasis package of Sailer and Harms.25
Fig. 5a compares the performance of these G-P map models on held-out test data, while Figs. 5b-d illustrate the corresponding inferred measurement processes. We observe that the additive G-P map inferred using the epistasis package25 exhibits less predictive information (Ipre = 0.220 ± 0.012 bits) than the additive G-P map found using MAVE-NN (P = 0.007, two-sided z-test). This is likely because the epistasis package estimates the parameters of the additive G-P map prior to estimating the GE nonlinearity. We also note that, while the epistasis package provides a variety of options for modeling the GE nonlinearity, none of these options appear to work as well as our mixture-of-sigmoids approach (compare Figs. 5b,c). This finding again demonstrates that the accurate inference of G-P maps requires the explicit and simultaneous modeling of experimental nonlinearities.
We also observe that increasingly complex G-P maps exhibit increased accuracy. For example, the additive G-P map gives Ipre = 0.262 ± 0.011 bits, whereas the pairwise G-P map (Figs. 5e,f) attains Ipre = 0.367 ± 0.015 bits. We note that the parameters of the pairwise G-P map appear to be very precisely determined, as MAVE-NN was able to accurately recover ground-truth parameters from simulated datasets of the same size (Supplemental Fig. S2). The black box G-P map, which is comprised of 5 densely connected hidden layers of 10 nodes each, performed the best of all four G-P maps, achieving Ipre = 0.489 ± 0.012 bits. Remarkably, this last predictive information value exceeds the lower bound of Iint ≥ 0.461 ± 0.007 bits, which was estimated from replicate experiments (see Methods). We thus conclude that pairwise interaction models are not flexible enough to fully account for how 5’ss sequences control splicing. More generally, these results underscore the need for software that is capable of inferring and assessing a variety of different G-P maps through a uniform interface.
Application: biophysically interpretable G-P maps
Biophysical models, unlike the phenomenological models considered thus far, have mathematical structures that reflect specific hypotheses about how sequence-dependent interactions between macromolecules mechanistically define G-P maps. Thermodynamic models, which rely on a quasi-equilibrium assumption, are the most commonly used type of biophysical model.37–39 Previous studies have shown that precise thermodynamic models can be inferred from MAVE datasets,16 but no software intended use by the broader MAVE community has yet been developed for doing this. MAVE-NN meets this need by enabling the inference of custom G-P maps. We now demonstrate this biophysical modeling capability in the contexts of protein-ligand binding (using DMS data; Fig. 1a) and bacterial transcriptional regulation (using sort-seq MPRA data; Fig. 1e).
Otwinowski40 showed that a three-state thermodynamic G-P map (Fig. 6a), one that accounts for GB1 folding energy in addition to GB1-IgG binding energy,41 can explain the DMS data of Olson et al.33 better than a simple additive G-P map does. This biophysical model subsequently received impressive confirmation in the work of Nisthal et al.,42 who measured the thermostability of 812 single-mutation GB1 variants. We tested the ability of MAVE-NN to recover the same type of thermodynamic model that Otwinowski had inferred using custom analysis scripts. Our analysis yielded a G-P map with significantly improved performance on the data of Olson et al. (Ivar = 2.353 ± 0.012 bits, Ipre = 2.373 ± 0.009 bits, R2 = 0.948 ± 0.002) relative to the additive G-P map of Fig. 3. Fig. 6b shows the two inferred energy matrices that respectively describe the effects of every possible single-residue mutation on the Gibbs free energies of protein folding and protein-ligand binding. The folding energy predictions our model also correlate as well with the data of Nisthal et al. (R2 = 0.548 ± 0.050) as the predictions of Otwinowski’s model does (R2 = 0.517 ± 0.058). This demonstrates that MAVE-NN can infer accurate and interpretable quantitative models of protein biophysics.
To test MAVE-NN’s ability to infer thermodynamic models of transcriptional regulation, we first re-analyzed the MPRA data of Kinney et al.,16 in which random mutations to a 75 bp region of the Escherichia coli lac promoter were assayed. This promoter region binds two regulatory proteins, σ70 RNA polymerase (RNAP) and the transcription factor CRP. As in Kinney et al.,16 we proposed a four-state thermodynamic model that quantitatively explains how promoter sequences control transcription rate (Fig. 6c). The parameters of this G-P map include the Gibbs free energy of interaction between CRP and RNAP, as well as energy matrices that describe the CRP-DNA and RNAP-DNA interaction energies. Because the sort-seq MPRA of Kinney et al. yielded discrete measurement values (Figs. 1e,f), we used an MPA measurement process in our latent phenotype model (Fig. 6d). The biophysical parameter values we thus inferred (Fig. 6e) largely match those of Kinney et al., but were obtained far more rapidly (in ∼10 min versus multiple days) thanks to the use of stochastic gradient descent rather than Metropolis Monte Carlo.
Next we analyzed sort-seq MPRA data obtained by Belliveau et al.43 for the xylE promoter, which had no regulatory annotation prior to that study and for which no biophysical model had yet been developed. Based on their MPRA data, as well as follow-up mass spectrometry experiments, Belliveau et al. proposed that xylE is regulated by RNAP, CRP, and the locus-specific regulator XylR. These findings motivated us to propose and train an eight-state thermodynamic model describing how interactions between these three regulatory proteins might control xylE expression (Fig. 6f). The resulting quantitative model includes energy matrix descriptions for RNAP, CRP, and XylR binding to DNA, as well as Gibbs free energy values for the CRP-XylR and XylR-RNAP interactions (Fig. 6g). From this model we see that XylR activates RNAP through what appears to be a class II activation mechanism,44 as energetic contributions from the -35 region of the RNAP binding site are markedly reduced in the xylE context relative to the lac context (Fig. 6e). We also see that CRP—a homodimer with dyadic symmetry—binds its site with remarkable asymmetry (again, compare to Fig. 6e). The biophysical factors that determine whether symmetric transcription factors like CRP interact with DNA in symmetric or asymmetric poses are poorly understood, and represent just one avenue of investigation opened up by the capabilities of MAVE-NN. More generally, these results provide a proof-of-principle demonstration of how MAVE-NN can be used, together with MPRA experiments, to establish biophysical models for previously uncharacterized gene regulatory sequences.
Discussion
In this work we have presented a unified strategy for inferring quantitative models of G-P maps from diverse MAVE datasets. At the core of our approach is the conceptualization of G-P maps as a form of information compression, i.e., that the G-P map first compresses an input sequence into a latent phenotype value, which the MAVE then reads out indirectly via a noisy nonlinear measurement process. By explicitly modeling this measurement process, one can remove potentially confounding effects from the G-P map, as well as accommodate diverse experimental designs. We have also introduced three information-theoretic metrics for assessing the performance of the resulting models. These capabilities have been implemented within an easy-to-use Python package called MAVE-NN.
We have demonstrated the capabilities of MAVE-NN in diverse biological contexts, including in the analysis of both DMS and MPRA data. We have also demonstrated the superior performance of MAVE-NN relative to the epistasis package of Sailer and Harms.25 Along the way, we observed that MAVE-NN can deconvolve experimental nonlinearities from additive G-P maps when a relatively small number of sequences containing multiple mutations are included in the assayed libraries. This capability provides a compelling reason for experimentalists to include such sequences in their MAVE libraries, even if they are primarily interested in the effects of single mutations. Finally, we showed how MAVE-NN can learn biophysically interpretable G-P maps from both DMS and MPRA data.
Applying MAVE-NN to the MPSA data of Wong et al.,36 we discovered that pairwise interaction models are not sufficient to describe how 5’ss sequences govern alternative mRNA splicing, and that higher-order epistatic interactions are needed to describe this critical aspect of eukaryotic biology. We also inferred the first biophysical model for transcriptional regulation by the xylE promoter. This biophysical model reveals that the well-studied transcription factor CRP binds its target site with surprising asymmetry in vivo, an intriguing phenomenon about which much remains to be learned.
MAVE-NN thus fills a critical need in the MAVE community, providing user-friendly software capable of learning quantitative models of G-P maps from diverse MAVE datasets. MAVE-NN has a streamlined user interface, is thoroughly tested, and is readily installed from PyPI by executing “pip install mavenn” at the command line. Comprehensive documentation, worked examples, and step-by-step tutorials are available at http://mavenn.readthedocs.io.
Author contributions
AT, WTI, DMM, and JBK conceived the project. AT and JBK wrote the software with assistance from AP and MK. WTI and JBK wrote a preliminary version of the software. AT, MK, and JBK performed the data analysis. AT, DMM, and JBK wrote the manuscript with contributions from MK and AP.
Conflicts of interest
The authors declare that they have no known conflicts of interest.
Online Methods
Notation
We represent each MAVE dataset as a set of N observations, , where each observation consists of a sequence xn and a measurement yn. Here, yn can be either a continuous real-valued number, or a nonnegative integer representing the “bin” in which the nth sequence was found. Note that, in this representation the same sequence x can be observed multiple times, potentially with different values for y due to experimental noise.
G-P maps
We assume that all sequences have the same length L, and that at each of the L positions in each sequence there is one of C possible characters. MAVE-NN represents sequences using a vector of one-hot encoded features of the form where l = 0,1, …, L − 1 indexes positions within the sequence, and c indexes the C distinct characters. MAVE-NN supports built-in alphabets for DNA, RNA and protein (with or without stop codons), as well as user-defined sequence alphabets.
We assume that the latent phenotype is given by a linear function ϕ(x; θ) that depends on a set of G-P map parameters θ. As mentioned in the main text, MAVE-NN supports four types of G-P map models, all of which can be inferred using either GE regression or MPA regression. The additive model is given by, and thus each position in x contributes independently to the latent phenotype. The neighbor model is given by, and further accounts for potential epistatic interactions between neighboring positions. The pairwise model is given by, and includes interactions between all pairs of positions. Note our convention of requiring l′ > l in the pairwise parameters θl:c,l′:c′.
Unlike these three parametric models, the black box G-P map does not have a fixed functional form. Rather, it is given by a multilayer perceptron that takes a vector of sequence features (additive, neighbor, or pairwise) as input, contains multiple fully-connected hidden layers with nonlinear activations, and has a single node output with a linear activation. Users are able to specify the number of hidden layers, the number of nodes in each hidden layer, and the activation function used by these nodes.
MAVE-NN further supports custom G-P maps that users can define by subclassing the G-P map base class. These G-P maps can have arbitrary functional form, e.g., representing specific biophysical hypotheses of sequence function. This feature of MAVE-NN is showcased in the analyses of Fig. 6.
Gauge modes and diffeomorphic modes
G-P maps typically have non-identifiable degrees of freedom that must be fixed, i.e., pinned down, before the values of individual parameters can be meaningfully interpreted or compared between models. These degrees of freedom come in two flavors: gauge modes and diffeomorphic modes. Gauge modes are changes to θ that do not alter the values of the latent phenotype ϕ. Diffeomorphic modes15,20 are changes to θ that do alter ϕ, but do so in ways that can be undone by transformations of the measurement process p(y|ϕ). As shown by Kinney and Atwal,15,20 the diffeomorphic modes of linear G-P maps like those considered here will in general correspond to affine transformations of ϕ, although additional unconstrained modes can occur in special situations.
MAVE-NN fixes both gauge modes and diffeomorphic modes of inferred models (except when using custom G-P maps). The diffeomorphic modes of G-P maps are fixed by transforming θ via and then where a = mean({ϕn}) and b = std({ϕn}) are the mean and standard deviation of ϕ values computed on the training data. This produces a corresponding change in latent phenotype values ϕ → (ϕ − a)/b. To avoid altering likelihood values, MAVE-NN makes a corresponding transformation to the measurement process p(y|ϕ). In GE regression this is done by adjusting the GE nonlinearity via while keeping the noise model p(y|ŷ) fixed. In MPA regression MAVE-NN transforms the full measurement process via
For the three parametric G-P maps, gauge modes are fixed using what we call the “hierarchical gauge.” Here, the parameters θ are adjusted so that the lower-order terms in ϕ(x; θ) account for the highest possible fraction of variance in ϕ. This procedure requires a probability distribution on sequence space with respect to which these variances are computed. MAVE-NN assumes that such distributions factorize by position, and can thus be represented by a probability matrix with elements pl:c, denoting the probability of character c at position l. MAVE-NN provides three built-in choices for this distribution: uniform, empirical, or wildtype. The corresponding values of pl:c are given by where nl:c denotes the number of sequences (out of N total) that have character c at position l, and is the one-hot encoding of a user-specified wildtype sequence. In particular, the wildtype gauge was used for illustrating the additive G-P maps in Fig. 3 and Fig. 4, while the uniform gauge was used for illustrating the pairwise G-P map in Fig. 5 and the energy matrices in Fig. 6. After a sequence distribution is chosen, MAVE-NN fixes the gauge of the pairwise G-P map by transforming and
This transformation is also used for the additive and neighbor G-P maps, but with θl:c, l′: c′ = 0 for all l, l′ (additive) or whenever l′ ≠ l + 1 (neighbor).
GE nonlinearities
GE models assume that each measurement y is a nonlinear function of the latent phenotype g(ϕ) plus some noise. In MAVE-NN, this nonlinearity is represented as a sum of tanh sigmoids:
Here, K specifies the number of hidden nodes contributing to the sum, and α = {a, bk, ck, dk} are trainable parameters. We note that this mathematical form is an example of the bottleneck architecture previously used by21,24 for modeling GE nonlinearities. By default, MAVE-NN constrains g(ϕ; α) to be monotonic in ϕ by requiring all bk ≥ 0 and ck ≥ 0, but this constraint can be relaxed.
GE noise models
MAVE-NN supports three types of GE noise model: Gaussian, Cauchy, and skew-t. These all support the analytic computation of quantiles and confidence intervals, as well as the rapid sampling of simulated measurement values. The Gaussian noise model is given by where s denotes the standard deviation. Importantly, MAVE-NN allows this noise model to be heteroskedastic by representing s as an exponentiated polynomial in ŷ, i.e., where K is the order of the polynomial and {ak} are trainable parameters. The user has the option to set K, and setting K = 0 renders this noise model homoscedastic. Quantiles are computed using for user-specified values of q ∈ [0,1]. Similarly, the Cauchy noise model is given by where the scale parameter s is an exponentiated K’th order polynomial in ŷ, and quantiles are computed using .
The skew-t noise model is of the form described by Jones and Faddy,27 and is given by where and
Note that the t statistic here is an affine function of y chosen so that the distribution’s mode (corresponding to t∗) is positioned at ŷ. The three parameters of this noise model, {s, a, b}, are each represented using K-th order exponentiated polynomials with trainable coefficients.
Quantiles are computed using where and I−1 denotes the inverse of the regularized incomplete Beta function Ix(a, b).
MPA measurement process
In MPA regression, MAVE-NN directly models the measurement process p(y|ϕ). At present, MAVE-NN only supports MPA regression for discrete values of y indexed using nonnegative integers. MAVE-NN supports two alternative forms of input for MPA regression. One is a set of sequence-measurement pairs , where N is the total number of reads, {xn} is a set of (typically) non-unique sequences, each yn ∈ {0,1, …, Y − 1} is a bin number, and Y is the total number of bins. The other is a set of sequence-count-vector pairs , where M is the total number of unique sequences and cm = (cm0, cm1, …, cm (Y−1)) is a vector that lists the number of times cmy that the sequence xm was observed in each bin y. MPA measurement processes are represented as multilayer perceptron with one hidden layer (having tanh activations) and a softmax output layer. Specifically, where and K is the number of hidden nodes per value of y. The trainable parameters of this measurement process are η = {ay, byk, cyk, dyk}.
Loss function
Let θ denote the G-P map parameters, and η denote the parameters of the measurement process. MAVE-NN optimizes these parameters using stochastic gradient descent on a loss function given by where ℒlike is the negative log likelihood of the model, given by where ϕn = ϕ(xn; θ), and ℒreg provides for regularization of the model parameters.
In the context of GE regression, we can write η = (α, β) where α represents the parameters of the GE nonlinearity g(ϕ; α), and β denotes the parameters of the noise model p(y|ŷ; β). The likelihood contribution from each observation n then becomes p(yn|ϕn; η) = p(yn|ŷn; β) where ŷn = g(ϕn; α). In the context of MPA regression with a dataset of the form , the loss function simplifies to where ϕm = ϕ(xm; θ). For the regularization term, MAVE-NN uses an L2 penalty of the form where the user-adjusted parameters λθ and λη respectively control the strength of regularization for the G-P map and measurement process parameters.
Predictive information
In what follows, we use pmodel(y|ϕ) to denote a measurement process inferred by MAVE-NN, whereas ptrue(y|ϕ) denotes the empirical conditional distribution of y and ϕ values that would be observed in the limit of infinite test data.
Predictive information Ipre = I[y; ϕ], where I[⋅;⋅] represents mutual information computed on data not used for training (i.e., a held-out test set or data from a different experiment), Ipre provides a measure of how strongly a G-P map predicts experimental measurements. Importantly, this quantity does not depend on the corresponding measurement process pmodel(y|ϕ). To estimate Ipre, we use k’th nearest neighbor (kNN) estimators of entropy and mutual information adapted from the NPEET Python package.46 Here, the user has the option of adjusting k, which controls a variance/bias tradeoff. When y is discrete (MPA regression), Ipre is computed using the classic kNN entropy estimator47,48 via the decomposition I[y; ϕ] = H[ϕ] − ∑y p (y)Hy[ϕ], where Hy[ϕ] denotes the entropy of ptrue(ϕ|y). When y is continuous (GE regression), I[y; ϕ] is estimated using the kNN-based Kraskov Stögbauer Grassberger (KSG) algorithm.48 This approach optionally supports the local nonuniformity correction of Gao et al.,49 which is important when y and ϕ exhibit strong dependencies, but which also requires substantially more time to compute.
Variational information
We define variational information as an affine transformation of ℒlike,
Here, H[y] is the entropy of the data {yn}, which is estimated using the k’th nearest neighbor (kNN) estimator from the NPEET package.46 Noting that this quantity can also be written as Ivar = H[y] − mean({Qn}), where Qn = −log2p(yn|ϕn), we estimate the associated uncertainty using
The inference strategy used by MAVE-NN is based on the fact that Ivar provides a tight variational lower bound on Ipre.30 Indeed, in the large data limit, where DKL(⋅) ≥ 0 is the Kullback-Leibler divergence, and thus quantifies the accuracy of the inferred measurement process. From Eq. 30 one can see that, with appropriate caveats, maximizing Ivar (or equivalently, ℒlike) will also maximize Ipre.20 But unlike Ipre, Ivar is readily compatible with backpropagation and stochastic gradient descent. See Supplemental Information for a derivation of Eq. 30 and an expanded discussion of this key point. Note: Sharpee et al.50 cleverly showed that Ipre can, in fact, be optimized using stochastic gradient descent. Computing gradients of Ipre, however, requires a time-consuming density estimation step. Optimizing Ivar, on the other hand, can be done using standard per-datum backpropagation.
Intrinsic information
Intrinsic information, Iint = I[x; y], is the mutual information between the sequences x and measurements y in a dataset. This quantity is somewhat tricky to estimate due to the high-dimensional nature of sequence space. We instead used three different methods to obtain the upper and lower bounds on Iint shown in Fig. 3d and Fig. 5a. More generally, we believe the development of both computational and experimental methods for estimating Iint is be an important avenue for future research.
To compute the upper bound on Iint for GB1 data (in Fig. 3d), we used the fact that where H[y] is the entropy of all measurements y, Hx[y] is the entropy of p(y|x) for a specific choice of sequence x, and ⟨⋅⟩x indicates averaging over all sequences x. In this dataset, the measurement values were computed using where ci is the input read count and cs is the selected read count. H[y] was estimated using the KNN estimator.47 We estimated the uncertainty in y by propagating errors expected due to Poisson fluctuations in read counts, which gives
Then, assuming p(y|x) to be approximately Gaussian, we find the corresponding conditional entropy to be
These H[y] and Hx[y] values were then used in Eq. 31 to estimate Iint. This should provide an upper bound on the true value of Iint because uncertainty in y must be at least that expected under Poisson sampling of reads. We note, however, that the use of linear error propagation and the assumption that p(y|x) is approximately Gaussian complicate this conclusion. Also, when applied to MPSA data, this method yielded an upper bound of 0.96 bits. We believe this value is likely to be far higher than the true value of Iint, and that this mismatch probably resulted from read counts in the MPSA data being over-dispersed.
To compute the lower bound on Iint for GB1 data (Fig. 3d) we used the predictive information Ipre (on test data) of a GE regression model having a blackbox G-P map. This provides a lower bound because Iint ≥ Ipre for any model (when evaluated on test data) due to the Data Processing Inequality and the Markov Chain nature of the dependencies y ← x → ϕ in Fig. 2e.20,29
To compute a lower bound on Iint for MPSA data (Fig. 5c), we leveraged the availability of replicate data in Wong et al..36 Let y and y′ represent the original and replicate measurements obtained for a sequence x. Because y ← x → y′ forms a Markov chain, I[x; y] ≥ I[y; y′].29 We therefore used an estimate of I[y; y′], computed using the KSG method,46,48 as the lower bound for Iint.
Uncertainties in kNN estimates
MAVE-NN quantifies uncertainties in H[y] and I[y; ϕ] using multiple random samples of half the data. Let 𝒟100% denote a full dataset, and let 𝒟50%,r denote a 50% subsample (indexed by r) of this dataset. Given an estimator E(⋅) of either entropy or mutual information, as well as the number of subsamples R to use, the uncertainty in E(𝒟100%) is estimated as
MAVE-NN uses R = 25 by default. We note that computing such uncertainty estimates substantially increases computation time, as E(⋅) needs to be evaluated R + 1 times instead of just once. We also note that bootstrap resampling51,52 is often inadvisable in this context, as it systematically underestimates H[y] and overestimates I[y; z].
Datasets
For the GB1 DMS dataset of Olson et al.,33 measurements were computed using where and respectively represent the number of reads from the input and output samples (i.e., pre-selection and post-selection libraries), and n = WT represents the 55 aa wildtype sequence, corresponding to positions 2-56 of the GB1 domain. To infer the model in Fig. 3b and to compute the information metrics in Fig. 3c, only double-mutant sequences with were used; these represent 530,737 out of the 536,085 possible double mutants. For the models in Figs. 3d-f, yn values for the 1045 single-mutant were also used in the inference procedure.
For the Aβ DMS data of Seuma et al.34 and TDP-43 DMS data of Bolognesi et al.,35 yn values respectively represent nucleation scores and toxicity scores reported by the authors.
For the MPSA data of Wong et al.,36 we used the data of library 1 replicate 1 obtained for the BRCA2 minigene data. Measurements were computed as where and respectively represent the number of barcode reads obtained from exon inclusion isoforms and from total mRNA, and n = CONS corresponds to the consensus 5’ss sequence CAG/GUAAGU. Corresponding PSI values were computed as . Only sequences with were used, representing 30,483 of the 32,768 possible sequences of the form NNN/GYNNNN.
For the lac promoter sort-seq MPRA data of Kinney et al.,16 we used data from the “full-wt” experiment (available at https://github.com/jbkinney/09_sortseq). For the xylE promoter sort-seq MPRA data of Bellilveau et al.,43 we used data kindly provided by the authors.
Acknowledgements
This work was supported by NIH grant 1R35GM133777 (awarded to JBK), NIH Grant 1R35GM133613 (awarded to DMM), an Alfred P. Sloan Research Fellowship (awarded to DMM), a grant from the CSHL/Northwell Health partnership, and funding from the Simons Center for Quantitative Biology at Cold Spring Harbor Laboratory.
Footnotes
Major revisions throughout.