Abstract
Despite recent advances in high-throughput combinatorial mutagenesis assays, the number of labeled sequences available to predict molecular functions has remained small for the vastness of the sequence space combined with the ruggedness of many fitness functions. Expressive models in machine learning (ML), such as deep neural networks (DNNs), can model the nonlinearities in rugged fitness functions, which manifest as high-order epistatic interactions among the mutational sites. However, in the absence of an inductive bias, DNNs overfit to the small number of labeled sequences available for training. Herein, we exploit the recent biological evidence that epistatic interactions in many fitness functions are sparse; this knowledge can be used as an inductive bias to regularize DNNs. We have developed a method for sparse epistatic regularization of DNNs, called the epistatic net (EN), which constrains the number of non-zero coefficients in the spectral representation of DNNs. For larger sequences, where finding the spectral transform becomes computationally intractable, we have developed a scalable extension of EN, which subsamples the combinatorial sequence space uniformly inducing a sparse-graph-code structure, and regularizes DNNs using the resulting greedy optimization method. Results on several biological landscapes, from bacterial to protein fitness functions, showed that EN consistently improves the prediction accuracy of DNNs and enables them to outperform competing models which assume other forms of inductive biases. EN estimates all the higher-order epistatic interactions of DNNs trained on massive sequence spaces—a computational problem that takes years to solve without leveraging the epistatic sparsity in the fitness functions.
Significance Statement Predicting the properties of small molecules (such as proteins) from their sequence is an important problem in computational biology. The main challenge is in developing a model that can capture the non-linearities in the function mapping the sequence to the property of interest (e.g., fluorescence) using the limited number of available labeled sequences from biological assays. In this paper, we identify a biologically-plausible sparsity prior and develop a method to infuse this prior into the structure of deep neural networks (DNNs) by regularizing their spectral representation. We demonstrate that our method significantly improves the prediction accuracy of DNNs and enables an interpretable explanation of DNNs—a task that is computationally intractable without leveraging the hidden structure in biological functions.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
Results in Figures 3 and S5 are updated.