Abstract
To understand the decision process of genomic sequence-to-function models, various explainable AI algorithms have been proposed. These methods determine the importance of each nucleotide in a given input sequence to the model’s predictions, and enable discovery of cis regulatory motif grammar for gene regulation. The most commonly applied method is in silico saturation mutagenesis (ISM) because its per-nucleotide importance scores can be intuitively understood as the computational counterpart to in vivo saturation mutagenesis experiments. While ISM is highly interpretable, it is computationally challenging to perform, because it requires computing three forward passes for every nucleotide in the given input sequence; these computations add up when analyzing a large number of sequences, and become prohibitive as the length of the input sequences and size of the model grows. Here, we show how to use the first-order Taylor approximation for ISM, which reduces its computation cost to a single forward pass for an input sequence, placing its scalability on equal footing with gradient-based approximation methods such as “gradient-times-input”. We show that the Taylor ISM (TISM) approximation is robust across different model ablations, random initializations, training parameters, and data set sizes. We use our theoretical derivation to connect ISM with the gradient values and show how this approximation is related to a recently suggested correction of the model’s gradients.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
↵$ Co-senior authorship