Parallel Implementation of Smith-Waterman Algorithm on FPGA

In bioinformatics, alignment is an essential technique for finding similarities between biological sequences. Usually, the alignment is performed with the Smith-Waterman (SW) algorithm, a well-known sequence alignment technique of high-level precision based on dynamic programming. However, given the massive data volume in biological databases and their continuous exponential increase, high-speed data processing is necessary. Therefore, this work proposes a parallel hardware design for the SW algorithm with a systolic array structure to accelerate the Forward and Backtracking steps. For this purpose, the architecture calculates and stores the paths in the Forward stage for pre-organizing the alignment, which reduces the complexity of the Backtracking stage. The backtracking starts from the maximum score position in the matrix and generates the optimal SW sequence alignment path. The architecture was validated on Field-Programmable Gate Array (FPGA), and synthesis analyses have shown that the proposed design reaches up to 79 . 5 Giga Cell Updates per Second (GCPUS).

Thus, allowing high scalability. According to [30], the systolic array is a class of parallel 67 computing architecture that describes an array for dense linear algebraic calculations, 68 proposed by [24]. Its hardware implementation usually uses a pipeline structure, where 69 the data is propagated between Processing Elements (PEs). Besides, its main advantage 70 is to reduce the number of memory accesses throughout the data flow. Hence, systolic 71 arrays simplify the architecture and improve the system's operating frequency. [31]. 72

73
This subsection briefly discusses hardware-based approaches for the SW algorithm that 74 can be found in the literature, such as implementations in supercomputers [32], 75 GPUs [33][34][35][36], architectures based on Resistive Content Addressable Memories 76 (ReCAMs) [37] and FPGAs [34,[38][39][40][41][42][43]. 77 GPUs are well-known for their high degree of parallelism and computing intensity. 78 However, they have a high cost, significant computing latency, and low energy efficiency. 79 The high computing latency is due to the high number of cores and low cache memory 80 to control these cores. In contrast to GPUs, FPGAs are customizable according to the 81 user's needs, achieving better computing performance and lower latency [14,44,45]. 82 However, FPGA hardware development is usually complex and takes a long time. 83 In recent years, supercomputers have also become widely used for processing massive 84 data. In [32], a hybrid SW-NW algorithm deployed on the Sunway TaihuLight  Unlike the conventional platforms previously mentioned, the SW algorithm has also 96 been implemented on ResCAMs, as can be seen in [37]. ResCAMs is a storage 97 accelerator system that allows millions of processing units (PUs) to be deployed over 98 multiple silicon arrays. In [37], the implementation was used to compare the 99 homologous chromosomes between humans (GRCh37) and chimpanzees (panTro4), and 100 the only SW step performed was the building of the score matrix. As a result, their achieved regarding other SW designs implemented on FPGA and GPUs. Besides, it 118 reached a 26% reduction in power consumption compared to the GPU implementation. 119 Similarly, more FPGA approaches using systolic arrays for the NW and SW with 120 backtracking sequencing techniques have been proposed, such as [41,46]. In [41], a 121 VHDL SW implementation, using Dynamic Programming (DP) with approximation 122 correspondences for two different strategies, was proposed. It achieved 23.5GCUPS with 123 speedups between 150× to 400× compared to a 2004-era PC. Meanwhile, in [46], the 124 implementation was based on PEs to perform elementary calculations and a diagonally 125 backtracking search, also developed in VHDL. Comparisons were made with the linear 126 and affine strategies, achieving 10.5GCUPS.

127
In [42], the SW forward and backtracking processes were implemented in an FPGA. 128 The Qnet structure was adopted for communicating with the FPGA, reaching 129 25.6GCUPS, a speedup of 300× compared to a desktop computer. 130 Another FPGA alternative for implementing the SW is OpenCL and OpenMP, as 131 shown in [43]. An OpenCL conversion extension is used to synthesize the algorithm and 132 synchronize tasks in multiple Altera FPGAs through the OSWALD [34,41,42,46].

142
Our approach performs both the Forward and Backtracking stages of the algorithm. 143 Unlike the approaches in the literature, we obtain the alignment path distances during 144 the Forward Stage processing and the maximum score, reducing the complexity of the 145 Backtracking Stage processing. Memories are used to propagate the distances and 146 maximum score, allowing the Backtracking step to follow the path directly. Thus, our 147 architecture achieves good performance (short critical path), reduced memory usage and 148 high scalability, and prevents memory access overlap latency, even implementing the two 149 stages of the SW algorithm. Smith and Waterman originally proposed the SW algorithm in 1981 to performs local 152 sequence alignment of nucleotides and proteins in the biological field [11]. The sequence 153 alignment of the SW algorithm includes the Forward and Backtracking stages, which 154 are performed by the calculation results of the alignment similarity score. Besides, the 155 alignment is performed based on two input sequences called query sequence, q, and 156 dataset sequence s. The query sequence can be expressed by where q j is the j-th nucleotide or amino-acid protein and N is the length of the query 158 sequence. The dataset sequence can be expressed by where s i is the i-th nucleotide or amino-acid protein, and M is the dataset sequence 160 length. Therefore, the SW algorithm is calculated attractively for two dimensions, and 161 it has a computational complexity of O(M × N ).
The Forward Stage calculates the scoring matrix, H, where H is a two-dimensional 163 array that can only take values greater than or equal to 0 (i.e., H ∈ N 2 ). This matrix is 164 generated by comparing the elements of the sequences q and s. Usually, H is generated 165 using DP, and it is initialized with zeroes in the first row and column. Subsequently, the 166 DP process is performed to calculate the sequence scores. Based in works presented 167 in [32,34,47], the recurrence relationship can be defined as obtaining the similarity score between s i and q j , E and F are two assisted matrices 170 when calculating matrix H, ρ is the gap opening penalty and σ is gap extension penalty. 171 In the particular case of ρ = σ, a linear gap penalty model is obtained, opening and 172 extending a gap with the cost γ. P is also called a substitution matrix, where the 173 simplest version is when the diagonal receives the match value and the rest of the 174 matrix has a mismatch value. When performing all element calculations, this expression 175 is the H M,N matrix. Therefore, H(i, j) is the maximum alignment score of two 176 sub-sequences s and q. The initialization condition is The maximum score value of H(i, j) in the Forward Stage is the last sequence that 178 will be aligned. To determine the relationship, the previous neighborhood values of the 179 analyzed element are required, i.e., the values on the diagonal, horizontal, and vertical 180 positions, as illustrated in Figure 1. As can be observed, the score of w can be found respectively. This windowing step occurs throughout the process of determining all 183 scores in H. defined based on the sequences q and s. Thereby, the w score is determined as where γ, α, and β represent the linear gap, a match, and a mismatch, respectively. A x 192 when fully populated, the H matrix contains the score and path information.  The hardware architecture for the SW algorithm proposed in this work was developed 205 using systolic arrays to input two DNA sequences and increase the processing speed of 206 the local sequence alignment. An overview of the systolic array structure of the 207 proposal for N PEs is shown in Figure 2. Besides, each PE is divided into 3 modules.

208
These modules are the Forward stage, the storage process, and the Backtracking stage, 209 as seen in Section 2. Each module is illustrated in blue, green, and yellow, respectively. 210 The Forward stage has its module named as Matrix Score Module (MSM), the storage 211 process module is called as Memory Module (MM), and the Backtracking stage has its 212 module as Backtracking Stage (BS).

213
The labeled signals shown in Figure 2 are generated outside the modules.

214
Meanwhile, the non-labeled ones are generated by computations inside the modules and 215 detailed throughout this Section. The sequences q and s, defined according to 216 Equations 1 and 2, are external discrete signals used as inputs of the SW algorithm.

217
Furthermore, each signal in the sequences represents one of the four DNA nucleotides, 218 i.e., A, G, T, or C (also withstand twenty levels referring to amino acids).

219
Initially, the circuit starts when the MSM modules propagate the q and s signals. As 220 seen in Figure 2, each k-th element of the q and s sequences are shifted to each MSM 221 output to shorten and stabilize the critical path, as well as allowing the computation of 222 scores synchronously, preserving the systolic array structure. Afterward, the MSM 223 computes the score according to Equation 3, and propagates the sequence elements to 224 the next MSM; also, the computed results are sent to the respective MM in their order 225 of entry. During this process, the MM operates exclusively in writing mode while the 226 process has not reached the last computation between the two sequences.

227
The Forward stage is completed after fully computing the scores of the H matrix.  233 Figure 3 shows the block design that represents each PE of the systolic array, with a 234 detailed description of the signals between the modules within one PE. As can be 235 observed, besides the two input sequences to be compared, q and s, the MSM also 236 receives an enable signal, en. After computing the score between each k-th element of 237 the two sequences (i.e., an element of the H matrix), the MSM outputs to the next PE 238 the following signals: the calculated score, Sc j ; the maximum score, M axV al, and its 239 position, AddrRAM i∧j ; the PE index; along with the input signals q, s, and en, shifted 240 in time. In addition, the MSM also outputs signals to the MM, which are the calculated 241 path direction, Direction, and the storage address of that path wAddrDir.

242
Subsequently, after fully populating the H matrix and, consequently, the D matrix, 243 the Forward stage is finished enabling the T raceback signal, which in turn begins the BS. 244 Firstly, the BS sets signal BT Start to 1, indicating the start of the Backtracking process. 245 Therefore, the mRAM i∧j are propagated back until it reaches the BS with maximum 246 score, which is identified by the signal index. From this location match, the btcontrol 247 signal is changed to allow the reading of the memory by MM. Thus, the BS receives the 248 path value from the MM at signal d j when sending the memory address rAddrDir j 249 signal. The d j value allows the BS to calculate the next requested address and 250 propagate it to the next module through the path(j) signal, representing the memory 251 address of the request path in MM. Lastly, the alignment value enters valDir, and the 252 process continues until it reaches the complete alignment. All modules are detailed in 253 the following subsections. All signals present in this Section are shown in Table 1.

262
Each element of the matrix D needs 2 bits to be expressed, delivering a more 263 economical storage process compared to H, which can certainly need more than 2 bits 264 to represent each element.

265
As previously mentioned, the alignment process is performed based on the query and 266 dataset input sequences, q and s, respectively. Also, there can have different sizes,  The systolic array structure developed for the matrices is composed of N PEs.

271
Therefore, for each j-th element in q, there is a j-th PE. It is based on dividing the 272 construction of the H score matrix expressed by and finding the best path in which the D matrix returns the correct sequence alignment, 274 which in turn is equivalent to the directional flags that determined the alignment path. 275 Moreover, for each P E j (which represents a column of the matrix H) there is i-th s(i) 276 The number of MSMs submodules corresponds to the number of elements in q, i.e., 278 {j ∈ N | 0 ≤ j < N }, as can be observed in Figure 4. Therefore, H is formed by N 279 columns, according to Equation 6. Besides, the MSM also calculates the path, the  The SW algorithm in this work is initialized by the en(k) signal, which enables the 283 memory components in the MSM and MM modules to allocate the two sequences q(k) 284 and s(k). The en(k) is a sequence of pulses of value 1 with size equal to the s sequence. 285 Thus, the sequences are transmitted at each sampling time to the Forward Module. The 286 signals are received in MSM, and the respective q(k) is allocated according to its where α and β are arbitrary values that correspond to the match value and mismatch 296 values, respectively.

297
Subsequently, the score value, Sc j−1 (i − 1), and correspondence value, α ∧ β, are 298 added to define a portion of g j (i). The where γ is an arbitrary value that represents the chosen linear gap value. This 311 expression is equivalent to Equation 3.

312
The output of the pink blocks, called opr, are propagated to the next submodule for 313 choosing the maximum score and distance path, as shown in Figure 5. This submodule 314 is built with a set of multiplexers and relational circuits that can find the maximum 315 score value with the coded distance of the path by comparing the opr signals, as seen in 316 Figure 6.

317
Selecting path distances is based on a simple encoding of three levels representing 318 the alignment action to be adopted: 2, 1, and 3. Therefore, the levels 2, 1, and 3 319 represent a match, a gap in the target sequence q and s, respectively, as described in 320 Section 2. The encoding process of directions is performed in the Forward step, as 321 illustrated in Figure 6. During this process, the same signals used to calculate the H 322 score matrix are needed, i.e., the opr j−1 (i − 1), opr j−1 (i) and opr j (i − 1), as seen in  neighborhood. An integer value is associated with the d j corresponding to the address 331 of w, according to the maximum value determined in the neighborhood, these values are 332 assigned according to the expression where 1, 2 and 3 is the vertical, diagonal, and horizontal paths, respectively. The where (x + (α ∨ β)) = opr j−1 (i − 1), (y − γ) = opr j (i − 1) and (v − γ) = opr j−1 (i).
if y > v then score ← y; direction ← 1; else score ← v; direction ← 3; end end // Stores the score in matrix H and the direction in matrix D calculated; H(i + 1, j + 1) = score; D(i + 1, j + 1) = direction; // checking which is the highest calculated score; if maxV al < score then maxV al ← score; posM i ← i + 1; posM j ← j + 1; end end end return D, posM i, posM j; After the process of selection the score and direction, it has the choice of the 338 maximum score based on a logic of multiplexers and relational blocks. There is a 339 counter, called cntR, to determines the number of times that the selection of the score 340 and direction is carried out, i.e., the H matrix row that the process is on. This is 341 necessary to determine the AddrRAM i address. At the beginning of MSM processing, 342 index(j − 1) is added to 1, just once for each MSM, becoming index(j) and      the sequence s, which in turn, the amount of RAM memories is equal to the number of 388 PEs in the systolic array.

389
The enable signal, en, is used as write enable for each RAM in the MM. Therefore, 390 en = 1 defines the write mode, while en = 0 the read mode. In addition, the btcontrol 391 signal selects which module controls the RAM address bus. Hence, for btcontrol = 0 the 392 memory addresses are defined by the MSM module through wAddrDir signal, while 393 btcontrol = 1 selects the BS module to define the addresses via rAddrDir signal.  The backtracking process starts when the T raceback signal is enabled in the MSM by 402 counters that determine the last PE and the last processed element of s, as described in 403 Forward Stage. As previously mentioned in subsection 3.1, the MSM propagates to the 404 MM the maximum score address that is used as the starting point for alignment, as 405 shown in Figure 8. Meantime, the Figure 9 details the submodules used to create each 406 BS module. The submodules in green are circuits for controlling and synchronizing all 407 signals during the module operation, while the blue submodule performs the alignment 408 path described in this section.  mRAM i value to rAddrDir to read the memories in the MM, which in turn, returns 418 the d(i) value to the Direction Process submodule, as can be seen in Figure 9.

419
Secondly, the alignment process starts. The circuits used to build the alignment 420 submodule are shown in Figure 10. As can be observed, the input d j (i) is used as the 421 multiplexer selector to perform the Equation 10. Therefore, for d j (i) = 3, BS remains in 422 the same memory position and moves back one BS module, i.e., horizontal displacement. 423 While for d j (i) = 1, only the memory position decreases by 1, and BS is verified by the 424 Direction Process and Continue Processing submodules (i.e., vertical displacement).

425
Meanwhile, for d j (i) = 2, the memory position also decreases by 1, and it moves to the 426 previous module with the displacement in the memory position. The circuit after the 427 first multiplexer prevents negative addresses in the memory. Given that the path to align the first element is found, the Alignment Block 429 submodule receives the rAddrDir j and d j (i) signals to define the path to be followed 430 by the next BS, as seen in Figure 9. Initially, a logical circuit enables the BT Start and 431 Direction Process submodules to propagate those signals to the Alignment Block. The 432 Direction Process and Continue Processing submodules carry out checks to define which 433 BS module is active, that is, for d j (i) = 1, the data processing is held in the current BS 434 module, and for d j (i) = 1, the signal BT N ext is enabled, indicating the end of data 435 processing in the current PE to start in the next one.

436
After finding the module for the maximum score, the mRAM i and mRAM j signs 437 finish their function. Thus, from the determination of the BS with the maximum score, 438 the path(j) sign is used as a guide for locating the alignment of each module. Then, the 439 data in MM is requested and the d j value is returned for verification and establishment 440 of alignment. The verification and establishment of the alignment path is done by the 441 Memory Index, Direction Process, and Continue Processing submodules. Decisions 442 related to d j value are made in Alignment Block submodule, as illustrated in Figure 10. 443 Finally, the Finish Processing and Continue Processing submodules finish the data 444 processing in the module. Thereby, the valDir output of each submodule is used to   The development of the algorithm was carried out using the development platform 461 provided by the FPGA manufacturer, in this case, Xilinx [48]. This platform allows the 462 user to develop circuits using the block diagram strategy instead of VHDL or Verilog.

463
The architecture was deployed on the FPGA Virtex-6 XC6VLX240T and compared to 464 state-of-the-art works. Usually, hardware implementations of the SW algorithm in the 465 literature were implemented only the Forward Stage or both the Forward and 466 Backtracking Stages. In our proposal, both stages were implemented.

467
The performance for hardware implementations of the SW algorithm is usually in which a cell can be one vector or matrix element to be computed. This metric can 470 also be described based on the clock frequency, that is, 471 GCUPS = number of cells × clock frequency × 10 9 .
The latter equation is often used to compare the systolic array efficiency. Since the 472 number of cells is equivalent to the number of PEs, and the clock frequency defines the 473 operating frequency, it is unnecessary to measure the total runtime of the algorithm.

488
The data bit-width was defined by the maximum size of the input sequences, limited 489 by FPGA memory capacity. Hence, the input sequence bit-width was set to 3 while 490 constants were defined according to its value. Besides, the bit-width for the MSM buses 491 that perform mathematical operations was defined as log total−P Es ×α. Meanwhile, the 492 sequence counters for s is log s−size . 493 Figure 11 shows the architecture deployed and running on the Virtex-6 FPGA. The 494 host computer (i7-3632QM CPU and 8GB of RAM) was used to plot the results and 495 compare them to a software implementation presented in [49], as shown in Figure 12. In 496 the Figure 12, it can be seen that the y axis refers to the s sequence, while the x axis Row refers to the position in the s, whereas Column is related to the element of the q. 501 The amount of sequence alignment performed is represented by Number of Alignments. 502 The architecture parameters for the demo were set to match = 5, mismatch = −5, 503 gap = 1, and 128 PEs. Hence, the size of the sequence q is also 128. Meanwhile, the 504 size of the sequence s was set to 8,192, resulting in a total of 1, 048, 576 calculated cells. 505 Sequence q is loaded into memory at each iteration, where it can vary between 4 506 different 128 nucleotide sequences in the demonstration. The demo is available at [50]. 507

508
Analysis of the synthesis results for the SW hardware implementation were carried out 509 for two FPGAs: Virtex-6 XC6VLX240T and Virtex-7 XC7VX485T. Table 2 -T T G T C A -T C   --A -A T  -C C ---A G C -----C A T T --C  --C T A --T

Fig 12.
Illustration of the results obtained from our proposal in co-simulation. The image is the most detailed representation of the monitor in Figure 11. It can see that the y-axis refers to the s, while the x-axis refers to the q. The position at which the alignment starts is indicated by Row and Column. The maximum score value found is presented by Maximum Value. The amount of sequence alignment performed is represented by Number of Alignments.
in the Virtex-6, a total of 68% of the Slice Look-Up Tables (LUTs) were used in 518 contrast to only 7% for 64 PEs. Concerning the frequency, a slight decrease is observed 519 as the number of PE increases due to an increase in the critical path. Concerning the 520 Virtex-7, there are unused FPGA resources as less than 35% of Slices LUTs were used. 521 Therefore, it can be used to increase the number of PEs and, thus, the performance.  The works presented in Table 3 are available in [34]. The second column indicates 527 whether the backtracking stage was also developed on FPGA or only the forward.

528
Meantime, the third to fifth columns present the number of PEs, operating frequency, 529 and performance, respectively. The performance was obtained according to Equation 12. 530 As can be seen, our approach and the one proposed by [34] were the only ones to 531 implement a high number of PEs. However, in [34] only the backtracking path was 532 deployed on the FPGA, and a submatrix structure is used to load the path chosen for 533 alignment. Meanwhile, our architecture relies on a memory storage structure and the 534 definition of the maximum score to align the sequences. Furthermore, a comparison with [34] was also carried out regarding the FPGA area 537 occupation, and it is presented in Table 4. The second and third columns present the 538 FPGA and the number of PEs used, respectively. Meanwhile, the third and fourth 539 columns present the slices and memory blocks occupied, and the fifth column the 540 operating frequency.

541
As shown in Table 4, for the same number of PEs, our architecture occupied 35, 286 543 slices and 0 BRAMs in contrast to 57, 870 slices and 896 block RAMs (28 Mbits 544 memory) in [34]. Also, the total area occupation was higher than 60%, compared to 545 46% on ours, due to the substitution matrix. Therefore, our proposal has high 546 scalability due to the low resource usage (can reach up to 1, 024 PEs for the 547 XC7VX485T). Besides, our implementation proposal can be implemented in smaller 548 FPGAs, such as the Virtex-6 XC6VLX240T, with a reasonable nucleotide sequence.

549
Regarding the operation frequency, our proposal can reach up to 155 MHz. So, it is 550 observed that the proposals with the best performances have a similar structure, even 551 with different approaches to the solution. Our proposal and [34] achieving the same 552 performance for the frequency of 150 MHz. 553 Therefore, our work uses fewer hardware resources to perform the alignment process 554 due to the chosen backtracking approach. As the backtracking stage results in high 555 computational complexity, we simplified the process using the path mapping through 556 the maximum value in D and H, resulting in linear computational complexity. On the 557 other hand, the architecture proposed by [34] uses considerably more memory resources 558 due to data partitioning and prefetching for the backtracking step. Despite both works 559 achieving similar performance due to the systolic array, there are significant differences 560 in the alignment approach chosen for the FPGA implementation.

561
The hardware implementation of the alignment process through our approach, 562 developed based on a chain of directions and the maximum score address, is a key 563 contribution for the low use of memories and, thus, achieve high hardware scalability.

564
Hence, the proposed method can compress the data, using only 3 bits in a fixed-point 565 implementation. 566

567
This paper presented a parallel FPGA platform design to accelerate both the Forward 568 and Backtracking stages of the SW algorithm. The main contributions were the 569 high-speed data processing implementation and low memory usage that allowed high 570 scalability. In order to satisfy the high-throughput, ultra-low-latency and low-power 571 requirements and to alleviate the raw data processing problem in bioinformatics. From 572 the strategy of storing alignment path distances and maximum score position during 573 Forward Stage processing. It was possible to reduce the complexity of Backtracking 574 Stage processing which allowed to follow the path directly. The proposal architecture 575 achieved a satisfactory critical path, reduced memory usage and high scalability for 576 two-step SW algorithm. Synthesis results showed that the proposed method could 577 support up to 1, 024 PEs in only one FPGA, using the Xilinx Virtex-7 XC7VX485T.

578
The main advantage is the low hardware resource usage and high performance of 79.5 579 GCUPS, with an operating frequency of up to 155MHz, without using external