RaptGen: A variational autoencoder with profile hidden Markov model for generative aptamer discovery

Nucleic acid aptamers are generated by an in vitro molecular evolution method known as systematic evolution of ligands by exponential enrichment (SELEX). A variety of candidates is limited by actual sequencing data from an experiment. Here, we developed RaptGen, which is a variational autoencoder for in silico aptamer generation. RaptGen exploits a profile hidden Markov model decoder to represent motif sequences effectively. We showed that RaptGen embedded simulation sequence data into low-dimension latent space dependent on motif information. We also performed sequence embedding using two independent SELEX datasets. RaptGen successfully generated aptamers from the latent space even though they were not included in high-throughput sequencing. RaptGen could also generate a truncated aptamer with a short learning model. We demonstrated that RaptGen could be applied to activity-guided aptamer generation according to Bayesian optimization. We concluded that a generative method by RaptGen and latent representation are useful for aptamer discovery. Codes are available at https://github.com/hmdlab/raptgen.

1 Introduction 1 various loop regions [26]. In addition to these approaches, AptaMut utilizes mutational information from SELEX 22 experiments [22]. As nucleotide substitutions can increase aptamer affinity, mutational information is beneficial for 23 candidate discovery. However, while insertions and deletions are also important factors for altering aptamer activity, 24 in silico methods dealing with these mutations are poorly developed. Thus, a method that generates sequences from 25 experimental data is needed to expand exploratory space, and including motif information and nucleotide mutations 26 confer an increased opportunity for aptamer discovery. 27 To develop a procedure for aptamer generation and motif finding, we focused on a neural network. As reported 28 previously, neural networks are suitable for analyzing large datasets and are compatible with high-throughput 29 sequencing data. DeepBind adopts a convolutional neural network (CNN) to distinguish DNA motifs from 30 transcription factors and find sequence motifs by visualizing network parameters [27]. Recurrent neural networks can 31 also be used for sequence discovery [28,29]. Currently, neural network-driven generative models are being applied in 32 a broad range of research areas. Some examples of neural network-dependent generative models include deep belief 33 networks (DBNs) [30], variational autoencoders (VAE) [31], and generative adversarial networks (GAN) [32]. For a 34 probabilistic generation of nucleic sequences, using long short-term memory (LSTM) was proposed to mimic 35 sequence distribution [33]. GAN-based sequence generation methods have also been proposed [34]. 36 In small molecule discovery, variational autoencoder (VAE)-based compound designs have been reported. VAEs 37 learn a representation of the data by reconstructing the input data from a compressed vector [31]. Kusner et al. used 38 grammar-based VAE and SMILES sequences to generate chemical structures for activity optimization [35], and Markov model decoder to efficiently create latent space in which sequences form clusters based on motif structure. 48 Using the latent representation, we generated aptamers not included in the high-throughput sequencing data. 49 Strategies for sequence truncation and activity-guided aptamer generation are also proposed. 50 2 Materials and Methods 51 52 The VAE proposed in this study is a CNN-based encoder with skip connections and a profile HMM decoder with 53 several training methods. Two simulation datasets containing different types of motifs were generated to assess the 54 interpretability of the decoder. Two independent HT-SELEX datasets were subjected to the VAE, and the Gaussian 55 Mixture Model (GMM) was used for multiple candidate selection. Furthermore, Bayesian optimization (BO) was 56 performed based on the activities of tested sequences proposed by GMM, and sequences were truncated by shortening 57 the model length. The process is explained in detail in the following sections. An overview is shown in Figure 1.  VAEs consist of an encoder neural network that transforms input sequence x into latent distribution q φ (z|x) and a 61 decoder neural network that reconstructs the input data from latent representation z by learning p θ (x|z). As VAE is 62 a generative model, it can be evaluated by model evidence. However, given a dataset X = {x (i) } N i=1 , the model evidence p θ (X) is not computationally tractable. Alternatively, we can maximize the Evidence Lower BOund 64 (ELBO); L(θ, φ; X) to calculate how the model describes the dataset using Jensen's inequality, 65 log p θ (X) ≥ L (θ, φ;

Overall Study Parameters
where D KL (p||q) is the Kullback-Leibler divergence between distributions p and q. The former term on the 67 right-hand-side is the regularization error, and the latter is the reconstruction error. Modeling this reconstruction 68 error to suit the problem determines the structure of the latent space.  The RaptGen encoder network consists of a stack of convolutional layers with skip connections. Each character was 71 first embedded into a 32-channel vector and went through seven convolutional layers with skip connections. Then, 72 max pooling and fully-connected layering transform the vector into the distribution parameters of latent 73 representation q φ (z|x). The structure is shown in detail in the Supplementary Information subsection S1.5.  RaptGen. The profile HMM is a model that outputs by probabilistically moving from state to state ( Figure 2). The 77 profile HMM consists of a match state (M), an insertion state (I), and a deletion state (D). Each state emits specific 78 outputs introduced to represent multiple sequence alignments [37]. The match state has a high probability of 79 emitting a particular character, the insertion state has an equal chance, and the deletion state always emits a null 80 character. These probabilities are called emission probabilities. The other probabilistic parameter is the transition 81 probability. This defines the likeliness of transition from a state to the next state. In a profile HMM, the emission 82 probability e S (c) is the probability of output character c from state S, and transition probability a S,S is the 83 probability of changing state from S to S . These are defined as e S (c) = p(c|S) and a S,S = p(S |S), respectively.

84
Because profile HMM is a model in which the state transition depends only on the previous single state, the 85 sequence probability p(x) can be written by using the Markov chain rule: where π is the possible state path, π last is the last state in the path, L is the length of the sequence, x j:k is the 87 subsequence of x from the j-th character to the k-th character on both ends, x 0 is a null character that indicates the 88 start of the sequence, x L+1 is a null character that indicates the end of the sequence, and m is the number of 89 matching states in the model. It is computationally expensive to calculate the sequence probability for all possible 90 paths. Introducing a forward algorithm can lower the computational cost to O(Lm). The forward algorithm consists 91 of a forward variable defined as f S j (i) = p(x 0:i , π last = S j ), and the probability can be calculated recurrently by The emission probability of the insertion state does not depend on the position of the motif; therefore, it is set to 93 a constant of 1/4 for RNA sequences. We set the probability to output the final end-of-sequence token 94 p(x L+1 |M m+1 ) to 1.
Cat(x i |f θ (z)), where Cat is a categorical 100 distribution and f θ is a neural network. The AR model outputs a probability according to previous data. The 101 probability of the sequence p(x|z) is calculated by Cat(x i |g θ (x 0:i−1 , z)), where 102 g θ is a recurrent neural network (RNN). The architecture of networks f θ and g θ is described in the Supplementary 103 Information subsection S1.5.  To learn RaptGen, state transition regularization was introduced. Also, weighed regularization loss was introduced 106 for all VAEs, including RaptGen.

108
A VAE can be trained with backpropagation by treating ELBO as a loss function. In addition to ELBO, a Dirichlet 109 prior distribution was used on the transition probabilities to avoid unnecessary state transitions in the early rounds 110 of training RaptGen. By penalizing transitions other than match-to-match at the beginning of the learning process, 111 insertions and deletions are forced to occur less. This allows continuous motifs to be learned and lowers the 112 probability of obtaining models with meaningless transitions traversing deletion states.

113
The probability of categorical variable p = {p k } sampled from a Dirichlet distribution is The regularization term is the sum of the log-odds ratio of 115 the training probability from the matching state over each position i, defined as For the simulation data shown in Figure 3(a), ten different motif sequences of length ten were generated, and single 136 nucleotide modification with a 10% error rate was added. In other words, each motif sequence had a 10/3 percent 137 chance of deletion, insertion, or modification at a specific position. After this procedure, sequences were randomly 138 extended to reach 20 nt by adding nucleotides to the right and the left. We made 10,000 sequences in total, with no 139 duplication.

140
For the simulation data shown in Figure 4(a), sequences containing paired motifs were generated. Two five-nt 141 motifs were made, and then one of the motifs was randomly deleted at a probability of 25 percent each. If both 142 motifs remained, two to six nt were randomly inserted between the left and right motifs. Subsequently, sequences 143 were randomly extended to reach 20 nt, and 5,000 of these sequences were generated.

145
SELEX data used in this study were obtained previously [20]. The sequences are available as DRA009383 and 146 DRA009384, which we call dataset A and dataset B, respectively. Dataset A consists of nine SELEX rounds from 0 147 to 8, and dataset B consists of four rounds from 3 to 6. The round with the smallest unique ratio U (T ) with the 148 restriction of U (T ) > 0.5 was used, defined as where D(T ) are the whole sequences, read in round T . The 4th round was selected for each dataset.  We used the GMM for initial sequence selection from the obtained latent space. To efficiently select ten points to be 153 evaluated, GMM was run 100 times with ten components, and the mean vectors of the model with the best evidence 154 (likelihood) were selected.

156
The SPR assays were performed using a Biacore T200 instrument (GE Healthcare) as described previously with 157 slight modifications [20]. The target proteins of datasets A and B were human recombinant transglutaminase 2

176
BO uses both the search for sequences that have not been explored to a reasonable extent and the utility of utilizing 177 sequences with known affinity to select the next sequence for evaluation. The local penalization function is a method 178 that can determine the multi-point expected improvement of candidates by considering the smoothness of the 179 potential function [40]. As it converges faster than qEI [41] and other methods for simultaneous optimization. We 180 used this method to perform multi-point optimization. Implementation was performed with the GPyOpt 181 package [42]. 182 3 Results and Discussion  We first attempted to construct a VAE with an encoder and a decoder applicable to aptamer discovery. In the 186 aptamer representation space, sequences containing the same motif should be in a neighboring area. Robustness 187 against nucleotide mutations and motif positions should also be considered. To identify a desirable decoder, we 188 investigated different types of sequence representation models. We constructed VAEs with a CNN encoder and three 189 different types of probabilistic models: the MC model, AR model, and profile HMM, as a decoder. Simulation data 190 including ten different motifs were created to assess the visualizing capability of these VAEs (Figure 3a). We  (19.50); however, the reconstitution error was the worst (18.32). Furthermore, the classification result was not 199 optimal. We suppose that latent representation is dispensable in the AR model because the model itself has context 200 information. Additionally, we compared the different encoder types. LSTM [43] and CNN-LSTM were evaluated in 201 combination with the above three decoders. LSTM is used in character-level text modeling. The embedding space 202 from the MC and AR models was still inadequate using either encoder (Supplementary Information subsection S1.8). 203 Profile HMM created distinguishable embedding with LSTM, whereas a learning deficiency was observed in 204 combination with CNN-LSTM (Supplementary Information subsection S1.8). Collectively, we concluded that the 205 profile HMM decoder is favorable for motif-dependent embedding. A VAE composed of a CNN encoder and a profile 206 HMM decoder was examined in the following study.  We next tested whether our VAE model could distinguish split motifs. Subsequence co-occurrence at distances is 209 often observed in RNA due to intramolecular base-pairing and internal loop structures [44]. We applied simulation 210 data with a pair of 5-nt split motifs to the VAE ( Figure 4). The MC model decoder was used for comparison. only-motif models were (0.995, 0.000), (0.107, 0.002) and (0.000, 0.987), respectively. Interestingly, the point located 219 between these two motifs has a high probability of including both motifs. These results show that a profile HMM 220 decoder is also applicable for split motifs. Hereafter, we called a VAE with a profile HMM decoder RaptGen.  We further evaluated RaptGen using SELEX sequence data obtained from our previous study [20]. Because real 224 data are more complex than simulation data, we first investigated the dimensions of the latent space. Raw 225 HT-SELEX data have 30-or 40-nt variable regions and fixed primer regions at both ends. In the present study, we 226 used the variable region to create latent space. We tested up to 12 spatial dimensions and trained the model 50 227 times on dataset A ( Figure 5). The minimum loss was in the four dimensions, and the second-lowest was in the two 228 dimensions. Loss tended to increase as the embedding dimension increased; however, the loss of one-dimensional 229 space was higher than that of the ten-dimensional space. The lower dimension would be favorable for visualization 230 and performing BO would be advantageous, as described in later sections. Therefore, we adopted a two-dimensional 231 space instead of a four-dimensional space for analysis.  We next subjected two independent high-throughput SELEX datasets, Dataset A and B, to RaptGen. The resulting 234 latent embeddings are shown in Table 1 and the Supplementary Information subsection S1.4. We previously 235 demonstrated that aptamers from Datasets A and B exhibit continuous motifs and split motifs, respectively. As the 236 SELEX experiment sequences are amplified with specific binding motifs, we reasoned that they would form clusters 237 in a latent space based on their motifs. Thus, we used the GMM, which hypothesizes that data consists of a mixture 238 of Gaussian distributions, to classify the distributions. We chose ten different points representing the latent cluster 239  center of the GMM (Table 1). We observed that sequences with an ambiguous profile HMM such as A-GMM-2,

240
A-GMM5, and B-GMM-0, were embedded near the latent space center. In contrast, the near edge area contained 241 sequences that emit nucleotides preferentially. We also confirmed that similar profiles were embedded in similar 242 areas (Supplementary Information subsection S1.4). These results provide support for the use of RaptGen to analyze 243 high-throughput SELEX data.

245
We attempted to generate the most probable sequence from the profile HMM of each GMM center for activity 246 evaluation. We calculated the model state path with the highest probability and derived the most probable sequence 247 according to the path. When the path included insertion states, we generated up to 256 sequences with no 248 duplication by randomly replacing each insertion state with a single nucleotide and selected a sequence with the 249 highest probability. The resulting reconstituted sequences and their probabilities are shown in Table 1. After 250 connecting with their fixed primer sequences, aptamer RNAs were produced by in vitro transcription, and their 251 binding activities were assessed by SPR. Aptamers identified in our previous study were used as positive controls.

252
Although more than half of the candidates were found to have weak or no activity, some sequences such as 253 A-GMM-1, B-GMM-4, and B-GMM-8 had evident binding activity. To determine whether these aptamers exist in 254 the original data, we calculated each sequence's edit distance from the nearest HT-SELEX sequence (  working with actual sequence data.    were created in a GMM-dependent manner described above. Figure 6 shows the relative activity of proposed aptamers with 267 their lengths. For dataset A, the 28-nt candidate showed binding activity where the initial library was 30 nt. For dataset B, 268 the 29-nt candidate showed considerable activity compared with the original setting, which was 40 nt. These results suggest 269 that RaptGen can generate a shorter aptamer than the experimentally expected length. We found that sequences with low 270 reconstitution probability tended to have low binding activity and that sequences showing binding activity had relatively high 271 probability ( Figure 6). This observation would be helpful for effective candidate selection. We observed a tendency of 272 sequence extension in dataset A-L20 and L25 and dataset B-L35. For instance, in dataset A, 26 nt sequences were generated 273 from the 20 nt RaptGen setting. We speculate that the profile HMM is prone to imitating the original length in some 274 situations. The optimal truncation length was different for each dataset. We did not identify the cause of this difference.

275
Further studies should be performed to determine efficient truncation.   Table 1. Different markers indicate different lengths of the profile HMM decoder. Colors indicate the log probability of a sequence.

277
In another application of RaptGen, we generated aptamers using activity information. Aptamer derivatives harboring 278 nucleotide mutations should be distributed around the mother sequence in the latent space. To predict effective candidates 279 from the neighboring area of an active aptamer, binding activity distribution should be predicted. We used a BO algorithm 280 for learning an activity distribution. Because the distribution for the BO process is required to be of low dimension, RaptGen 281 is suitable for this strategy. To implement BO, we first embedded activity data in the latent space. The sequences listed in 282 Table 1 were re-converted into the space. Several locations moved from the initial GMM center (Figure 7). We used these 283 re-embedded positions to perform BO. The resulting predicted activity distributions are shown in Figure 7. To propose 284 multiple candidates in parallel, we utilized the local penalization function [45]. Ten profile HMM were proposed and evaluated 285 for their activity. As shown in Figure 7, candidates were generated from the peripheral area of the positive clone. We 286 confirmed that new aptamers incorporated nucleotide substitutions (Table 3). In addition, most of them had binding activity. 287 Similar results were obtained for both datasets A and B. These results indicate that RaptGen can propose aptamer derivatives 288 in an activity-guided manner.  Table 1 were embedded into latent space. Grey points indicate latent embeddings shown in Table 1. The contour line overlaid on the embeddings indicates the predicted activity level. This is the acquisition function of BO, which is the upper confidence bound of the posterior distribution of the Gaussian process (GP-UCB) [46]. Ten points were proposed by the BO process with a local penalization function. Circles represent the re-embedded position of the GMM centers. Red and blue indicate high and low binding activity, respectively. Stars represent the locations proposed by BO.

290
Jolma and colleagues evaluated the binding specificities of transcription factors using SELEX data [8]. In addition,

291
CNN-based research was conducted to classify randomly shuffled sequences and determine the motifs in a dataset [27]. We 292 selected five transcription factors that were presented in these studies. We applied ten GMM distributions in this experiment 293 and identified motifs similar to those from Jolma et al. Similarity was based on the edit distance of the most probable 294 sequence from each motif. RaptGen was able to produce sequence motifs consistent with these studies (Table 4). Thus, we 295 could search for sequence motifs obtained by running a CNN. A future study could be performed to search for an appropriate 296 number of distributions using model selection methods such as Akaike Information Criteria (AIC) [47]. Whole GMM motifs 297 and other results are shown in the Supplementary Information Table S3. The present version of RaptGen does not consider the secondary structure of aptamers. Secondary structure information is 300 critical for identifying active aptamers that [19,20]. Therefore, including the secondary structure in the sequence probabilistic 301 model would improve RaptGen performance. An alternative model such as profile stochastic context-free grammar 302 (SCFG) [48] will be tested in follow-up studies.

303
Here, we demonstrated that RaptGen could propose candidates according to activity distribution. According to BO, a 304 sequential construction of posterior distribution would allow us to optimize activity in the latent space. For another instance 305 protein-protein interactions. The application of RaptGen for this purpose is promising.

308
While RaptGen helps visualize and understand sequence motifs, this method requires computational cost due to sequence 309 probability calculation. Compared with the MC model, which can calculate the sequence independently by position, and the 310 AR model that only needs calculation on the previous nucleotides, profile HMM requires calculation on all possible state paths 311 and previous (sub)sequences. The offset calculation cost for MC, AR, and profile HMM is O(1), O(l), and O(lm),respectively, 312 where l is the number of previous characters including itself, and m is the model length of the profile HMM. Profile HMM also 313 needs to calculate the costly logsumexp function many times, leading to a longer training time. Additional studies are 314 necessary to improve these issues. 315

316
In this study, we developed a novel nucleic acid aptamer discovery method, RaptGen, based on a variational autoencoder and 317 profile hidden Markov model. To our knowledge, this is the first study to utilize a VAE for aptamers. RaptGen converts 318 experimental sequence data into low dimensional space, depending on motif information. Active aptamers were generated 319 from the representation space using a GMM. Aptamers were shorter than the original SELEX aptamers were also generated. 320 Furthermore, we conducted a BO process for derivatizing aptamers based on activity. The creation of efficient embeddings 321 allowed the generation of aptamers in silico more efficiently.   To create a sequence logo for the given profile HMM, we calculated the most probable path of the profile HMM. The most 332 probable path was iteratively acquired using the following pseudocode.

333
Algorithm 1 Sequence logo generation When calculating the probability of insertion state I, the state's recurrency was ignored because the probability of staying on I recurrently is lower than that of immediately moving to the next state. After the most probable path was achieved, the sequence logo was written according to the emission probability of each state using WebLogo technology [49,50]. The overall height Ri at state S is defined as RS = log 2 (|C|) − (HS + en) where |C| is the number of characters (typically 4 for RNA), HS is the uncertainty at state S, and en is the correction factor. The correction factor en adjusts the result when there are a few sample sequences. In our setting, we set en to 0. The uncertainty HS is defined as where b is one of the bases (A, U, G, C) and eS(b) is the probability of emitting base b from state S. The height of the base at a certain state hS(b) is defined as The sequence logo was written using hS(b), placing the higher probable base at the bottom and the state path sequence from 334 left to right.

336
The sequences obtained in the present study are shown in Table S1. The ID is named after a rule of dataID, length of the 337 trained model, the method to select the sequence, and the index of the sequence. max seq indicates that the sequence was the 338 most probable sequence emitted from the most probable state path of the given profile HMM. Relative activity shows % 339 relative binding activities of positive controls assessed by the SPR experiment. The positive control sequences are as follows: 340 A-PC is equal to Data1-11, and B-PC is equal to Data2-1 in our previous report [20] . 341

16/30
Table S1. The sequences obtained through data are shown. Log probability is the probability that the sequence was obtained from a certain point of space. For GMM-, it was obtained in the initial sequence selection, and for BO-, it was obtained in the next candidate selection procedure.   The statistics of the datasets are shown in Table S2. The column names are as follows:

343
• ID : The ID named after the rule; {indicator of the dataset}-{round of the SELEX}.

344
• raw reads : The number of reads acquired in the sequencing procedure.

345
• no filter unique : The number of the unique sequences with no filtration.

346
• adapter match : The number of sequences that match the forward-and reverse-adapter.

347
• designed length : The number of sequences that match the adapters and also the length of the sequence matches the 348 experimentally designed length.

349
• filtered unique : The number of unique sequences that passed both adapter filtering and design length filtering.

350
• > 1 : The number of filtered unique sequences that read more than once.

351
• U (T ) : The unique ratio defined in the main paper. The ratio was calculated using filter-passed sequences.

352
• ∆U (T ) : The difference in unique ratio with the previous round.

353
Note that the data with the lowest U (T ) , which holds U (T ) > 0.5 , were used. We created a map of sequence logos for the two sets of data acquired using the sequence logo creation method, as mentioned 356 in subsection S1.1. This sequence logo is a visualization of the profile HMM at a point equally divided from -2.5 to 2.5 on 357 each axis of the two-dimensional latent space. The sequence logo map for data A is shown in Figure S1, and for data B is 358 shown in Figure S2.     The structure of RaptGen is shown in Figure S3. The RaptGen consists of a convolutional neural network (CNN)-based 361 encoder and a profile HMM for decoder distribution. We further tried recurrent neural networks with long short-term memory 362 (LSTM) and CNN-LSTM, a network of CNN followed by LSTM. The CNN utilized skip-connection [51], which enables deeper 363 layers to learn appropriately. The models implemented in this study are available at https://github.com/hmdlab/raptgen. 364 S1.6 Encoders 365 S1.6.1 Convolutional neural network 366 A CNN captures sequential motifs that are aligned in certain positions. The encoder CNN is a network that first embeds each 367 character into a 32-dimensional vector. Then, the six layers of the skip-connection layer capture the interactions. Finally, the 368 max-pooling layer outputs the resulting feature vector. The skip-connection layer is a combination of a convolutional layer, 369 batch normalization [52], and leaky rectified linear unit (leaky ReLU) [53]. The structure of the skip-connection layer is

354
where xin is the input vector, BN(·) is the batch normalization layer, and σ(·) is the leaky ReLU layer. The convolutional layer and recurrent layer are used in combination to consider both fixed-length and long-distance 383 interactions. The CNN was almost the same as described in the previous section, with the difference that the final 384 max-pooling was removed. LSTM treated this feature vector as sequence embedding as described in subsubsection S1.6.2. Where z is the sampled vector and FCJ,K is a fully connected layer, which is defined as FCJ,K (x) = W T J,K x with the learnable 389 parameter WJ,K ∈ R J×K . To interpret the interactions of each other, the embedding parameter is calculated using the 390 transposed convolution function [54], which is generally the opposite of a convolutional function. The final output xout is 391 x5 = Trans Conv32,32(σ(BN(x4))) x6 = Trans Conv32,32(σ(BN(x5))) xout = σ(Trans Conv32,32(σ(BN(x6))))

22/30
where Trans ConvJ,K(·) is the transposed convolutional function with the trainable parameters of the input dimension J and 392 output dimension K. All transposed convolutional layers have the same kernel size of 7 and the same padding length of 3. 393 S1.7.2 Autoregressive model 394 To run the autoregressive model, we used a gated recurrent unit (GRU), which is a simplified version of LSTM [55]. The GRU 395 is calculated as follows: 396 zt = σg (Wzxt + Uzht−1 + bz) rt = σg (Wrxt + Urht−1 + br) where t is time, xt is the input vector, ht is the output vector, zt is the update gate vector, rt is the initializing gate vector, 397 and W , U , and b are parameter vectors. The probability of output sequence p(x) = L i=1 p(xi | x0:i−1, z) is calculated by where GRU(x, h0) is defined as a GRU function with input vector x of length L and initial hidden vector h0, which outputs 399 hL. 400 S1.7.3 Profile hidden Markov model 401 The profile HMM is described in Figure S3b. The embedded sequence is a D-dimensional vector, which is transformed into 32 402 dimensions by a fully connected (FC) layer. After the vector is rectified by leaky ReLU, the vector is transformed into a  Figure S3. (a) The skip-connection layer. The input feature vector is first passed through 64 hidden layers, then through 32 layers, and added to the original vector. It then passes through the Leaky ReLU normalization layer and produces output. (b) Overall architecture of RaptGen. The input sequence is initially embedded into 32 feature vector and goes through skip-connection layers. After the latent mean and log-variance are calculated with the fully connected layer, which is written as "Linear," the latent variable is sampled and calculated to fit the profile HMM parameter shapes.

406
The embeddings are shown in Figure S4. The embedding of the multi-categorical probabilistic model tends to place sequences 407 near the same motif; however, the nearest surrounding sequence is not from the same motifs. Although the autoregressive 408 model has a lower loss, it tends to have an unsplit latent space. experiments [8]. We utilized the data from that research and estimated the latent embedding, and checked whether the 412 derived motif logo was consistent. We selected five targets whose logo was mentioned in the research of DeepBind [27].
413 Table S3 shows learned embedding spaces and the sequence logo of the GMM center trained on 10 components. Although the 414 embeddings did not clearly split into 10 areas, the Profile HMM logo was consistent with previously determined motifs. Logo 415 images of previous research were downloaded from the CisBP database [56]. The motif learned by Deepbind with the top 416 three weights is also shown. The reconstruction error is to the left, and the regularization error is to the right in the braces. The color represents the motif that each sequence contains. Table S3. Sequences taken from the SELEX experiment in a previous study [8]