Inference of splicing motifs through visualization of recurrent networks

Aparajita Dutta; Aman Dalmia; R Athul; Kusum Kumari Singh; Ashish Anand

doi:10.1101/451906

Abstract

Neural models have been able to obtain state-of-the-art performances on several genome sequence-based prediction tasks. Often such models take only nucleotide sequences as input and learn relevant features on its own. However, extracting the inter-pretable motifs from the model remains a challenge. This work explores four existing visualization techniques in their ability to infer relevant sequence patterns learned by a recurrent neural network (RNN) on the task of splice junction identification. The visualization techniques have been modulated to suit the genome sequences as input. The visualizations inspect genomic regions at the level of a single nucleotide as well as a span of consecutive nucleotides. This inspection is performed from the perspective of the overall model as well as individual sequences. We infer canonical and non-canonical splicing motifs from a single neural model. We also propose a cumulative scoring function that ranks the combination of significant regions across the sequences involved in non-canonical splicing event. Results indicate that the visualization technique giving preference to k-mer patterns can extract known splicing motifs better than the techniques focusing on a single nucleotide.

Introduction

The recent trend in genome sequence analysis is the application of neural network models that learn features from the sequence de-novo^1,2. The primary motivation to let the model learn relevant features by itself is to avoid the existing knowledge bias. Several deep and shallow neural network models have been successfully applied on genome sequence-based tasks, such as, identification of transcription factor binding sites³, microRNA target pre-diction⁴, and prediction of DNA methylation states⁵. However, the inference of significant features or genomic regions from the learned models remains a challenge. In the domain of computer vision, image processing, and natural language processing (NLP), several visualization techniques have been effectively applied to analyze learned models and infer relevant features. Our work aims to apply visualization techniques to identify the significant regions of the genome sequence, in the form of sequence patterns or motifs, that contribute to the prediction performance. Towards this aim, we also ask the following two principal questions-Q1: How can various visualization techniques be adapted for identifying significant genomic regions for a particular task? Q2: Do all visualization techniques deliver similar results or are one method superior to the other?

We employ four different visualization techniques by modulating them to suit the genome sequences as input. The characteristics of the chosen visualization methods can be broadly classified into two categories based on their ways of identifying motifs. One set of the visualization techniques identifies motifs considering all the sequences present in the dataset, whereas the other set identifies motifs based on individual sequences. The four visualization techniques are as follows:

The attention mechanism to obtain significant genomic regions captured by the overall learning model.
Smooth gradient of noisy nucleotide embeddings to access the impact incurred on the classification score by a small change in any nucleotide of a given sequence.
Omission of a single nucleotide to access the importance of each nucleotide present in a genome sequence.
Occlusion of k-mers to derive fixed as well as variable length sequence patterns from a genome sequence.

Among the visualization techniques, the attention mechanism falls in the first category whereas the rest three fall in the second category.

Towards our aim, we consider the task of splice junction classification and evaluate the different visualization techniques in their ability to infer known splicing motifs. Splice junction classification is an important sub-task of genome annotation. This task involves identification of exon-intron (donor site) and intron-exon (acceptor site) boundaries which are usually characterized by canonical motif dimers GT and AG respectively (further explained in Supplementary Materials (Section 1.1)). However, there are exceptions to these consensus motifs which yield the non-canonical motifs that correspond to non-canonical or unconventional splicing events⁶. Most of the existing computational methods focus only on the identification of canonical splice junctions due to the lack of consistent non-canonical consensus. Nevertheless, the non-canonical sequence patterns are equally important in understanding the splicing phenomenon⁶, and hence this remains an interesting area to be investigated further. In summary, our additional objective is to extract relevant canonical and non-canonical splicing motifs from the same model.

Motivated by application of RNN in sequence-based bioinformatics problems^4,7,8, we have further explored its application in splice junction prediction by employing bidirectional long short-term memory (BLSTM) units⁹ in the hidden layer. Since a BLSTM network reads the context from both the ends, incorporating bidirectionality is expected to boost the performance of prediction models¹. We further superimposed the model with an attention¹⁰ layer to add interpretability to the model.

The contributions of this paper can be summarized as follows:

We explore the application of BLSTM network along with attention mechanism for the prediction of splice junctions. The proposed architecture achieves the state-of-the-art performance.
We generated two different types of negative dataset to test the consistency of the various recurrent neural models. Comprehensive analysis of splicing motifs is performed with the proposed architecture as it is the most consistent in its performance with both the dataset.
We have redesigned some of the effective visualization techniques, available in the literature, to be capable of comprehending genome sequences as inputs.
We infer splicing motifs for both canonical and non-canonical splicing events. The canonical patterns are validated with the existing knowledge from literature.
We propose a scoring function, named cumulative deviation score, to rank the most significant combinations of sequence patterns present across the sequences belonging to non-canonical splicing events.

Related work

Visualization of sequence motifs

With the objective of deciphering the reason behind the outstanding performance of the various learning models, various attempts have been made to monitor the change in model weights as learning progresses^4,11,5 extracted sequence motifs by aligning sequence fragments that maximally activated the filters of the convolutional layer for predicting single-cell methy-lation states.¹² performed one-dimensional global average pooling on the attention weighted output of an RNN to discover the parts of the sequence that are significant for identifying pre-miRNAs.

Lanchantin et al. in¹³ explored various sequence-specific as well as class-specific visualization techniques to obtain the important nucleotide positions present in a genome sequence for classification of transcription factor binding sites. Based on a similar objective, we have incorporated variable length occlusion to obtain variable length canonical as well as non-canonical splicing motifs apart from accessing the importance of each nucleotide position using attention and omission scores. We have also used the smooth gradient of noisy nu-cleotide embeddings, rather than raw gradients, to generate sharper sensitivity maps ¹⁴.

¹⁵ predicted splice sites using a deep convolutional neural network (CNN) having five convolutional layers. They have also identified significant genomic regions based on a visualization technique DeepLIFT ¹⁶ which assigns a contribution score to each nucleotide in a sequence based on its significance on the predicted result. However, their model predicts either the donor or the acceptor splice sites based on the dataset used. Also, they have considered only canonical splice junctions for visualizing the important genomic regions.

Splice junction classification

Several computational methods have been proposed for splice junction classification. In recent times, advanced sequencing technologies like RNA-seq have produced a plethora of sequenced genome. The abundance of annotated data has boosted both alignment based and machine learning based methodologies for predicting splice junctions. Alignment based methods identify splice junctions via map-assemble strategy where numerous short reads are first mapped to a reference genome after which they are assembled to identify distinct clusters representing exonic regions¹⁷. However, there is a possibility of a short read randomly matching a large reference genome containing multiple occurrences of the short read sequence¹⁸. Also, the existing alignment based methods^17,19 consider only canonical splice junctions in the prediction task¹¹.

The traditional machine learning based splice junction predictors use hand-crafted features like presence or absence of specific nucleotide patterns around the splice junctions^20–22. Since all the splicing signals are still not known, the hand-engineered features may adversely affect the accuracy of prediction models due to the inclusion of irrelevant features as well as high dimensionality. There have been attempts at handling the issue of high dimensionality by optimizing the features using feature selection techniques^23,24. Nonetheless, limited biological knowledge still ensued the inclusion of irrelevant features and hence revealed the necessity of applying learning techniques that can capture, by itself, the complex splicing signals in the form of features from the genome sequence.

Lee et al. proposed a splice junction prediction model based on deep Boltzmann machine¹¹. Zhang et al. employed a deep CNN, named DeepSplice²⁵ that predicts novel splice junctions. Lee et al. have explored different RNN units in the hidden layers of a deep network for predicting splice junctions⁷. Dutta et al. proposed distributed feature representations of splice junctions, named SpliceVec²⁶, which captures splicing features to be classified by a multilayer perceptron (MLP). These prediction models yielded promising accuracy in the prediction of splice junctions. However, most of these models fail to extract the sequence motifs that govern the splicing phenomenon due to lack of interpretability in the model.

Methods

In this section, we introduce the neural architecture employed for classification of true and decoy splice junctions. Further, we discuss the visualization techniques applied to analyze the patterns learned by the model.

Neural architecture

The proposed architecture is shown in Figure 1. The entire workflow is explained in the following subsections.

Figure 1:

An overview of the proposed architecture.

Input representation

Input is a putative splice junction sequence consisting of five nucleotide codes, A (Adenine), C (Cytosine), G (Guanine), T (Thymine) and N (representing any one of the four nucleotides), where each of the nucleotide code is represented as an integer. We use a dense vector representation for each of the five nucleotide codes. Each input sequence is passed through an embedding layer which transforms each input splice junction sequence of length n into an n × 4 dimensional dense vector that gets updated while training the neural network.

Modeling splice junctions using BLSTM network

The n × 4 dimensional dense input vectors are fed in mini-batches into both the forward and backward LSTM layers configured as a BLSTM (details on LSTM and BLSTM provided in Supplementary Materials (Section 1.2)). Both the LSTM layers learn meaningful features in a supervised manner to generate an n_l × n dimensional vector representation, where n_l is the number of hidden units in each LSTM layer. Both the vectors generated by the forward and backward LSTM layers are concatenated to generate a 2n_l × n dimensional vector representing the learned features of each splice junction.

Feature interpretation using attention layer

This layer is added to the model for obtaining a more targeted model which can capture the role each nucleotide in the input sequence plays in the classification of the splice junction. The attention mechanism is explained in Supplementary Materials (Section 1.2). The 2n_l × n representation of each sequence obtained from the BLSTM network is next fed into the attention layer. The attention weights obtained from this layer are fed into a fully connected layer and eventually to a softmax layer to obtain the classification results. We have used binary cross-entropy and Adam²⁷ as the loss function and the optimizer respectively.

Visualization techniques

The visualization techniques rely on measuring the change in performance of the learned model effected by changing either the input sequence or the embedding space of the model for a genomic region. The change can be implemented at a single nucleotide or a span of consecutive nucleotides. In some sense, these techniques mimic the site-specific mutagenesis techniques performed in a wet lab setup. The visualization techniques are applied to define a scoring function, referred to as the deviation value, which reflects the contribution of genomic regions to the classification score. We have inferred splicing motifs from the significant regions identified by the visualization techniques based on the deviation value. The various visualization techniques employed are described in the following subsections.

Smooth gradient with noisy embeddings

In image classification tasks, the gradient of the unnormalized output probabilities with respect to the input image, referred to as sensitivity maps¹⁴, indicate how much can a tiny change in that pixel affect the final output. In the case of images, minutely changing the pixel values does not change the image significantly as the image still looks the same²⁸. Whereas genome sequences, comprising of discrete nucleotides, can significantly alter the underlying biology on replacing a nucleotide by another.

Thus, to incur a slight change, we add noise to the embeddings of the nucleotides and compute the change in classification score. However, the sensitivity maps resulting from raw gradients are usually noisy. Therefore, based on the concept of smooth gradient²⁸, we average out the gradients obtained from several different noisy embeddings for each position of a sequence. The average gradient at each sequence position, named smooth gradient, is the deviation value in this case. As a result, one might expect that the resulting averaged sensitivity map would crisply highlight the key regions.

Omission of a single nucleotide

The feature vector obtained from the fully connected layer of the proposed architecture represents the complete input sequence. To measure the significance of each sequence position, we calculate its omission score²⁹. The omission score of the j^th position in a sequence s_i is given by where V_{s_i} is the feature representation obtained from the fully connected layer for the sequence s_i and is the feature representation obtained from the same layer for the same sequence with the nucleotide at position replaced by N. Cosine(V_{s_i}, ) measures the similarity of the two vectors and is calculated as

Therefore, omission score measures the deviation value of the vector representations of the sequence, with and without the omitted nucleotide. Higher deviation implies a higher significance of that sequence position.

Occlusion of k-mers

We occluded portions of a sequence to observe the variation in the predicted output. This approach has its motivation from.³⁰ For each sequence, we run a sliding window w_l of length l ¹, centered at nucleotide number (l + l)/2, and replace the characters within the particular window with N. We pass the modified sequence through the model to obtain the deviation value given by the absolute difference of the model outputs with and without the occlusion. For occlusion of a window of length l, the deviation value is stored in the center, that is, in position (l + l)/2, of the window.

We generated deviation values for test sequences in batches. For a batch of size B, the deviation values are in the form of a matrix of size B × n where each input sequence is of length n. The naive implementation (iterative) takes much time. But after some modifications (explained in Supplementary Materials (Section 2.1)), a batch implementation is performed which reduces the computation time significantly.

We propose two variations of occlusion described as follows:

Fixed length occlusion

In this case, we have considered occlusion of only 3-mers to compute the deviation value at each sequence position. Deviation values at boundary indices are computed by occluding the first two or the last two indices only. The deviation value at index j of sequence s_i is assigned to the middle index of the 3-mer. The significance of a genomic region is proportional to the corresponding deviation value.

Variable length occlusion

Fixed length occlusion has a limitation of considering only fixed length genomic regions whereas in real scenario there may be sequence patterns of variable lengths that regulate splicing. Hence, we incorporate variable length occlusion where for each index j of a sequence s_i, we have occluded a window w_l of length l ∈ {3, 5, 7, 9, 11} and computed the deviation values denoted by , for each window length l, with and without occlusion. The deviation value assigned to position (l + 1)/2 of window w_l is given by for l ∈ {3, 5, 7, 9, 11}. The window length corresponding to index j of sequence s_i is stored in the j^th column of the i^th row of a window matrix. Therefore, the value at the j^th column of the i^th row of the window matrix signifies the length of the pattern, centered at index j of sequence s_i, that contributes maximum to the prediction of the model.

Prerequisites for visualization

For each of the visualization techniques, we obtain 15 and 7 sets of randomly selected 50 true and 50 decoy splice junctions from canonical and non-canonical test sequences respectively. Going forward, the splice junction sequences comprising the canonical dimer motifs (GT – AG) are referred to as the canonical sequences and likewise for the non-canonical sequences. We execute the following steps to obtain the splicing motifs.

Generating heat maps

For each of the visualization techniques, we generate heat maps for each set. The heat maps are generated from the corresponding matrix of deviation values, named deviation matrix. Further, each set is sampled to obtain a final heat map comprising of random sequences from all the sets. Each sequence in the heat map is 164 nucleotide (nt) long. The first 82 nt of a sequence are the upstream and downstream regions of the donor site whereas the next 82 nt are the upstream and downstream regions of the acceptor site of a junction pair (dataset explained in the next section). The heat map consists of true test sequences only.

Identifying the number of significant indices in a genomic region

To reveal the splicing motifs using the visualization techniques, we need to first identify the number of sequence indices which can be considered influential in splice junction identification. For this, we arrange the indices of a sequence in non-increasing order of weights in the deviation matrix. Next, we obtain the frequency of occurrence of each index in the top T positions among all the sequences present in the sets, depicted by the occurrence-frequency graph. In other words, we consider the first T positions of the ordered deviation matrix for all the sequences across all the sets and count the number of times each index occurs in the first T positions. Here, the term ‘position’ is used relative to the length of the deviation matrix whereas ‘index’ refers to the absolute position based on the indexing explained later.

Based on the occurrence-frequency graph, we select a significant number ‘K’ which indicates the number of sequence indices that play a significant role in the predicted output of the model. To quantify the value of K, we consider it as the number of indices whose frequency of occurrence among the top T positions is more than 25% of the total number of sequences across all the sets.

Generating splicing motifs

In a particular set, the top K indices across all the sequences are calculated as follows. For each sequence s_i where i ∈ {1, 2,…, 50}, we select the top K indices , in non-increasing order of deviation values. Therefore, we obtained a 50 × K position matrix from which we select the top K indices p_k with maximum weighted occurrences across all the sequences in the set. The weighted occurrence of each index p_k is calculated as the sum of the weights of each occurrence of that index across all the sequences in the set. Here, the highest weight ‘K’ is added if the index occurs in the first column of the position matrix and the lowest weight ‘1’ is added if the index occurs in the last (K^th) column of the position matrix. This can be represented using the formula

On obtaining the top K indices of each set, we select the overall top K indices across all the sets for canonical sequences, and across all the sets for non-canonical sequences. This is again performed using Equation (3). The splicing motifs are generated for these top K indices across all the sequences present in all the sets.

Experimental setup

Positive data generation

We use GENCODE annotations³¹, based on human genome assembly version GRCh38, to train and test our model. We target to access the model’s performance on the prediction of novel splice junctions. For this, we train the model using an earlier release and test the model on only the newly added splice junctions in a later release, as described in²⁵. The two versions used are version 20 (released in August 2014) and version 26 (released in March 2017).

We extract 291,831 and 294,576 unique splice junction pairs from version 20 and version 26 respectively. The junction pairs are extracted from protein-coding genes only. Each junction pair comprises an intron with flanking upstream and downstream exonic regions. The intronic length is observed to vary from 1 to 1,240,200 nt. A previous study found that the shortest known eukaryotic intron is 30 base pairs (bp) long belonging to the human MST1L gene³². Introns shorter than 30 bp usually result from sequencing errors in genome. Therefore, we consider introns of length greater than 30 bp only. This reduces the number of junction pairs to 290,502 and 293,889 in versions 20 and 26 respectively. We exclude, from version 26, all the junction pairs present in version 20. This leaves us with 5,612 novel junction pairs in version 26, which composes our test data.

Negative data generation

Existing works adopt one of the two techniques to generate negative data. We adopt both the ways of generating the negative data to have comprehensive and unbiased analysis. The Type-1 dataset is generated similar to the standard procedure described in.³³ We extract a portion of the sequence from the center of each true intron to generate a pseudo sequence. The length of the extracted portion is kept equal to the length of an input sequence. This set of negative data captures the non-randomness of DNA sequences. We obtain 290,502 false samples for training data and 5,612 false samples for testing data using this procedure.

More than 98% of splice junctions contain the consensus dimer GT and AG at the donor and acceptor junctions respectively³⁴. However, the occurrence of the consensus dinucleotide is far more frequent compared to the number of true splice junctions in the genome. The neat exclusion of the pseudo sites by the splicing mechanism suggests the presence of other subtle splicing signals which play an important role in the process. The presence of consensus dimer in all the negative samples will result in the model learning the remaining splicing patterns present in the vicinity of the splice junctions.

Therefore, we generate the Type-2 dataset based on the procedure described in^25,26 where the negative data is randomly sampled from the human genome assembly version GRCh38. For each decoy junction pair, we randomly search for the consensus dimer GT and AG such that both lie in the same chromosome and the distance between them lies in the range of 30 and 1,240,200 nt. We obtain a huge number of such samples using this procedure, out of which we randomly select 290,502 false samples for training data and 5,612 false samples for testing data. In both the dataset, the positive junction pairs are the same. Both the scenarios are pictorially depicted in Figure 4 of Supplementary Materials (Section 2.2).

Training and Hyperparameter tuning

Each input splice junction is truncated to 40 nt upstream and downstream flanking regions of the consensus dimer GT or AG, thus obtaining an 82 nt sequence. Both the donor and acceptor junctions of a junction pair are concatenated to form a 164 nt sequence. The effect of variation in the flanking region on model accuracy is shown in Table 1 of Supplementary Materials (Section 2.3).

View this table:

Table 1:

Performance of the proposed architecture compared with the state-of-the-art models. Accuracy (Ac), Precision (Pr), Recall (Re) and F1 Score (F1) are computed in percentage.

The proposed architecture can be represented as a (1-4-100-100-2048-2) architecture where 4-dimensional embeddings are passed through a BLSTM, attention and fully connected layer with 100, 100 and 2048 units respectively. Values for batch size, dropout, recurrent dropout, and epochs are tuned to 128, 0.5, 0.2 and 50 respectively. The hyperpa-rameters are tuned by partitioning the training data from version 20 into 90% training and 10% validation data. All experiments were carried out on an NVIDIA GeForce GTX 980 Ti GPU machine with 6GB memory. We evaluate the performance of the classifier based on precision, recall, accuracy, and F1 score. Table 2 in Supplementary Materials (Section 2.4) shows the variation in the performance of the model with variation in the number of hidden layers.

View this table:

Table 2:

The top 10 significant indices obtained from the top 5 non-canonical sequences, based on the cumulative scoring function, for each visualization.

Baselines

We implemented the following state-of-the-art models as baselines and compared the results obtained by the various models on the same set of training and testing data. The hyper-parameters for the baselines (details in Supplementary Materials (Section 2.5)) were tuned using the same process mentioned in the previous subsection.

DeepSplice: This model classifies input sequences using a deep CNN comprising of two convolutional layers²⁵. Input sequences are represented in the form of a 4 × 30 matrix comprising of the upstream and downstream flanking regions of both acceptor and donor splice junctions.
SpliceVec-MLP: This model learns feature vectors of the entire intronic region along with flanking upstream and downstream exonic regions to be classified using an MLP²⁶. The input formation for SpliceVec-MLP is described in Supplementary Materials (Section 2.6).
Vanilla LSTM: This model comprises an embedding layer, two hidden LSTM layers, and a softmax output layer as proposed in⁷.
Vanilla LSTM with attention: We replaced the hidden units of the proposed architecture with LSTM units.

Existing consensus for analyzing visualizations

We compare the obtained visualization results with the existing consensus of canonical sequences. Figure 2 shows the extended donor site consensus 9-mer [AC]AGGTRAGT and the extended acceptor site consensus 15-mer Y₁₀NCAGG, known from existing studies³⁵. The acceptor site consensus mostly consists of the polypyrimidine tract (PY-tract) Y₁₀. Our dataset comprises 164 nt long junction pairs where the first 82 nt represents the donor region, and the next 82 nt represents the acceptor region. Both the donor and acceptor regions comprise the consensus dimer with 40 nt upstream and 40 nt downstream regions each. Both the consensus dimers GT and AG, at donor and acceptor sites respectively, are indexed as D1D2 and A1A2. The upstream flanking region of the donor site is indexed-ID through −40D starting from the closest to the farthest of the consensus dimer. Similarly, the downstream flanking region of the donor site is indexed ID through 40D starting from the position closest to the consensus dimer. Indexing is similar for acceptor site flanking regions with only D replaced by A.

Figure 2: Existing consensus in the canonical positive sequences of the dataset.

The consensus comprises the extended donor site consensus 9-mer [AC]AGGTRAGT ranging from indices -3D to 4D and the extended acceptor site consensus 15-mer Y₁₀NCAGG ranging from indices -12A to 1A in a 164 nt long junction pair.

Results

Quantitative analysis of prediction performance

We evaluate the performance of the proposed architecture using both Type-1 and Type-2 dataset. We compare the predictive performance with the baselines described in the previous section.

Table 1 shows the comparison of the proposed architecture with the baselines. We observe a performance improvement of the proposed architecture, over other neural network based baselines, in the range of 8%-26% for the Type-1 dataset and an improvement in the range of 4%-5% for the Type-2 dataset. The performances of other variants of RNN are comparable to the proposed architecture. The RNN based models are consistent in its performance with both the dataset. We analyze the performance of the proposed architecture on the Type-2 dataset.

Qualitative analysis of the feature representation

To access the quality of the feature embedding obtained from the dense layer of the proposed architecture, we plot the 2048-dimensional vector by reducing it to a two-dimensional vector using Stochastic Neighbor Embedding (t-SNE)³⁶. Although the dimension reduction procedure involves some loss of information, we still observe that the projected representations are distinctly separable in the two-dimensional feature space. Figure 3 shows a t-SNE plot of randomly selected 1000 true and 1000 decoy splice junctions. Four other plots of randomly selected true and decoy splice junctions are shown in Figure 5 of Supplementary Materials (Section 3.1). ²

Figure 3:

t-SNE plot of 1000 true and 1000 decoy splice junctions. Points in blue represent true whereas points in yellow represent decoy splice junctions.

Figure 4:

Occurrence-frequency graph based on the attention weights of top 20 indices among all the canonical true sequences present in the 15 sets.

Figure 5: Splicing motifs of the top K indices obtained from the occurrence-frequency graph of various visualization techniques.

The motifs on the left are obtained from the canonical sequences whereas the motifs on the right are obtained from the non-canonical sequences. The indices along the x-axis are arranged in increasing order of its appearance in the input sequences. The frequency plotted along the y-axis is the relative frequency of a nucleotide with respect to other nucleotides at a given index. The motifs are generated from (a) and (b) Attention weights with K = 10 (c) and (d) Smooth gradients with K = 10 (e) and (f) Omission scores with K = 10 (g) and (h) 3-mer occlusion weights with K = 15.