A Fast 3 × N Matrix Multiply Routine for Calculation of Protein RMSD

The bottleneck for the rapid calculation of the root-mean-square deviation in atomic coordinates (RMSD) between pairs of protein structures for large numbers of conformations is the evaluation of a 3 × N × N × 3 matrix product over conformation pairs. Here we describe two matrix multiply routines specialized for the 3 × N case that are able to significantly outperform (by up to 3X) off-the-shelf high-performance linear algebra libraries for this computation, reaching machine limits on performance. The routines are implemented in C and Python libraries, and are available at https://github.com/simtk/IRMSD.


Introduction
The three-dimensional biomolecule structures provided by X-ray crystallography and NMR spectroscopy have pushed biology to the resolution of individual atoms.Understanding at this level of detail has been nothing short of revolutionary, with applications ranging from the design of small molecule drugs [8] to the detailed mechanism of protein synthesis within the cell [1,13].When analyzing biological structures, the root mean square deviation (RMSD) is a fundamental computational tool-the RMSD provides a valuable metric of the similarity of two conformations of a given protein.Such a metric enables algorithms that classify [10] protein structures and predict [12] structures from previously existing models.
RMSD plays an even larger role in analyzing molecular dynamics simulations of biomolecules.
For example, Markov state model approaches [15,4,14] model the biomolecule dynamics by a Markov process on a state space that is typically determined by clustering.In such approaches, one often must cluster a simulation dataset with more than 10 6 conformations into a model containing 10 4 conformational states-clustering such a dataset is typically limited by the large number of RMSD calculations required.
Efficient, numerically stable algorithms exist to calculate both the structural RMSD as well as the optimal rotation matrix corresponding to this RMSD.Given some pre-calculated data, the RMSD can be calculated in 121 FLOPs (floating-point operations) on average [16], and the rotation matrix (as a unit quaternion) in 165 FLOPs in the worst case [11].However, the pre-calculation steps for each algorithm are far more expensive.Both methods require the structures to be zero centered and require matrix magnitudes (self-inner-products) for both structures (where the coordinates are considered as 3×N matrices), as well as the inner product between the two.In large-scale clustering methods, the centering and magnitudes may be pre-calculated for each structure; however, the pairwise computation of coordinate matrix products takes approximately 18N FLOPs per matrix multiplication.Thus, even for small structures, the matrix multiplication becomes the bottleneck in RMSD calculation.
In this article we present a fast matrix multiply routine, specialized for the 3 × N matrix multiplications used in protein RMSD calculation, that uses SSE2/3 vectorization and register blocking for Intel and AMD x86 64 processors to run (3X) faster than off-the-shelf linear algebra routines.We present two versions of the matrix multiply, corresponding to different memory layouts, and provide benchmark results on internal protein datasets.Our code is available at https://github.com/simtk/IRMSD.

RMSD Calculation with the Theobald Algorithm
An algorithm due to Theobald [16] computes the RMSD between a pair of N-atom structures A and B (represented as N ×3 matrices) as a simple function of the largest eigenvalue of a 4x4 symmetric "key" matrix K: elements of the product between A and B: Given G A , G B , and S, Theobald's algorithm, using Newton-Raphson iterations to refine the estimated eigenvalue, can compute the RMSD in 66 FLOPs (floating-point operations) to set up K, and an average of 55 FLOPs to iteratively estimate the eigenvalue.A refinement of the algorithm [11] also calculates the rotation matrix corresponding to the estimated RMSD in a worst-case 99 FLOPs after K has been computed.

Fast Matrix Multiplication Routines
Using the Theobald algorithm, the calculation of S dominates the cost of computing an RMSD.
The N × 3 matrix inner product requires 9× (N multiplies and (N − 1) adds); furthermore, it is memory-bandwidth intensive, requiring that both structures be loaded from memory at least once (three times, for a naive matrix multiply).Indeed, the matrix multiply is typically memory bandwidth bound.
In this section, we describe two versions of a register-blocked, vectorized matrix multiply to accelerate this computational bottleneck.Our target architecture is Intel/AMD x86 64 processors supporting the SSE vector intrinsics, which are currently the most common types of CPU found on standalone computers as well as large supercomputers.64-bit support is helpful for these algorithms because it exposes an additional 8 architectural vector registers; our method can be used with only slight modification on 32-bit machines but will be less efficient because of register spills.
x86 64 SSE exposes a 4-wide (for single-precision) or 2-wide (for double-precision) vector architecture with 16 architectural vector registers.Modern CPUs (since the Core 2 generation from Intel and the K10 (Phenom) architecture from AMD) are able to execute a vector multiply or add in 1 clock cycle -potentially increasing throughput 4x from scalar floating point code.
Our paper describes single-precision code.We have two different matrix multiplication routines, corresponding to the two common representations for protein structures in memory: axis-major and atom-major (Figure 1).

Axis-major Data
The simpler version of the matrix multiply requires that data be transposed from the representation used in the Theobald papers: specifically, instead of storing N × 3 matrices A and B, we store 3 × N matrices A T and B T .This "axis-major" layout has the x, y, and z axes along the slowlyvarying index component of the matrix.These structures are stored such that each axis is padded with zeros such that the length is a multiple of 4 (to pack into vector registers properly), and each axis is stored in memory aligned to a 16-byte boundary (to use fast SSE aligned load instructions).
In the description that follows, we will treat each axis of each structure as a 1 × N array Ax, Ay, etc.
The axis-major method performs L/4 iterations (where L is N rounded up to the nearest multiple of 4).Before the loop, 9 vector registers xx, xy, ..., zz corresponding to the 9 elements of matrix S are initialized to [0,0,0,0]; these will hold partial summations of the axis-wise inner products.In each iteration, the method first loads 4 elements from each of Bx, By, and Bz into registers and then copied into temporary registers tx, ty, tz.It then loads 4 elements from Ax into a register.Since SSE has no non-destructive multiplication or addition operations, a destructive multiply is used to compute tx = tx * Ax, and so on for each axis.This is then added to the accumulation register: xx = xx + tx.This set of operations then repeats for Ay and Az.In total, the loop as written uses all 16 vector registers in SSE: 9 accumulation registers, 3 to store copies of Bx, y, z, 3 temporaries, and 1 to store the loaded axis chunk from structure A (compiler optimization reduces the register count to 15).At the end of the loop, each vector accumulation register is then sum-reduced to a single element, corresponding to the required element of matrix

S.
This method streams each structure through from memory only once (a factor of 3 improvement versus a naive multiply).On a 64-bit machine, moreover, it requires no memory accesses in the inner loop except for the structure loads; all temporaries are stored in registers, something made possible by the specialization to N × 3 matrices.Furthermore, with the exception of padding elements at the end and the final horizontal reduction (negligible in number for typical structure sizes), the vector lanes are fully utilized, dramatically improving arithmetic efficiency.

Atom-major Data
If storing data atom-major (i.e., as N × 3 rather than 3 × N in memory) is preferred for other reasons, it is still possible to gain many of the advantages of the previous vectorized method.Our atom-major method loads four atoms of coordinates from memory and shuffles the contents of the registers to recover an axis-major ordering in registers.This allows the same core arithmetic routines to be used as with the axis-major method, with an extra shuffle step at each load.Consequently, it eliminates the need for an expensive transpose operation in memory, at the cost of additional logic operations during the matrix multiplication.However, because the multi-core matrix multiply is typically memory-bound, these logic operations are often effectively "free" (hidden behind the latency of memory loads).
Figure 2 shows the structure of the shuffling operations used to transpose the atomic coordinates in registers.The temporaries needed for shuffling inflate the register count in the inner loop such that one register must be spilled to the stack each iteration.However, because this will almost certainly be served by the L1 cache, the performance impact is minimal.The four atoms (twelve coordinates) are first loaded into registers X, Y, and T2; Z is initialized to the value of X and T2 to the value of Y.The shuffle operation uses six instructions to unpack the data from X, Y, and T2 into registers X, Y, and Z, such that X contains the x coordinates for each of the 4 atoms, in order, and so on for Y and Z.

Benchmarking Methodology
To benchmark the performance of our matrix multiplication code, we compare its throughput to that of the matrix multiplication code included with the Theobald RMSD routines [16], as well as that of the SGEMM (single-precision matrix-matrix multiply) routine in GotoBLAS2 version 1.13 [7], a high-performance linear algebra library.We additionally compare the effective RMSD throughput of our matrix multiply and the GotoBLAS SGEMM as applied in k-centers clustering [6,5] of protein conformations.
OpenMP, a multi-vendor language extension for parallelism in C/C++ and FORTRAN, was used to parallelize all matrix multiplication routines.The matrix multiplication benchmark first initializes 6 .CC-BY 4.0 International license peer-reviewed) is the author/funder.It is made available under a The copyright holder for this preprint (which was not .http://dx.doi.org/10.1101/008631doi: bioRxiv preprint first posted online Sep. 2, 2014;

Structure data
Figure 1: Illustration of linear layout options for a protein structure in memory.Xi/Yi/Zi are the X, Y, and Z coordinates of the i'th atom.4 GB of random single-precision floating point numbers (2 30 numbers).For each tested number of atoms N , it interprets this array as 2 30 /N structures of N atoms each.The first structure is treated as a reference, and every other structure is multiplied against this reference.For parallel runs, each thread handles a disjoint set of structures.This is run five times at each structure size and number of threads; the average of the elapsed time of the five runs is used to compute the effective arithmetic throughput of each method in GFLOP/s (billions of floating-point operations per second).
The clustering benchmark was performed using MSMBuilder [3] version 2.0.MSMBuilder contains a Python-Numpy implementation of the k-centers [6,5] clustering algorithm using the Theobald RMSD metric.We modified MSMBuilder to use either the current multiplication code or GotoBLAS.The benchmark consisted of the PDBs 1NOT (176 atoms), 1YRF (582 atoms), 2KT2 (982 atoms), 2L30 (1679 atoms), 2KTD (2588 atoms), and 3PB4 (4947 atoms).This range of sizes covers all single-domain proteins with structures in the Protein Data Bank [2].To generate nonequivalent protein conformations, we used the Gromacs [9] molecular dynamics package to simulate each protein for 5 ns.Clustering benchmarks were performed using single-threaded linear algebra.The K-Centers algorithm was used to cluster each dataset of 40,000 conformations into 100 states.The time required for clustering each dataset was measured using the Python time.time()function.Each benchmark was repeated 4 times; mean calculation times are re-The time required for system preparation and file IO was subtracted; these values are independent of the underlying linear algebra library.Clustering benchmarks were performed on a 6-core AMD Phenom II X6 1090T processor (3.2-3.4GHZ) with 8GB of DDR3-1333 memory.
The included benchmark reports single-threaded performance; as with the matrix multiplication benchmark, parallelism for both GotoBLAS and our code can be implemented using OpenMP to parallelize over structures.

Matrix Multiplication Throughput
The best way to evaluate the throughput of the methods is versus the architectural limits of our benchmark machine.The 20GB/s memory bandwidth is capable of sustaining 30 GFLOP/s in matrix multiplication if the reference structure is kept in the cache across comparisons or 15 GFLOP/s if both are streamed from memory.Each core of our Core i7-920 has a clock speed between 2.66 and 2.93 GHz (the CPU is capable of dynamically increasing its clock rate if thermal headroom is available), and can process 4 FLOP per cycle if instructions are issued one at a time ("single-issue throughput"), for an arithmetic peak of 10.64-11.72GFLOP/s per core, or 42.56-46.88GFLOP/s over all four cores.Thus, an optimal matrix multiplication code should achieve 10-12 GFLOP/s single-threaded (arithmetic-bound) and ∼ 30 GFLOP/s with four threads (memory-bound).
Figure 3 shows the performance of our code, GotoBLAS, and the original Theobald method with one and four threads.Four performance regimes are clearly visible for our code, corresponding to levels in the machine's memory hierarchy (atom counts for which the reference structure can fit in the L1, L2, and L3 caches, and where it must be streamed from memory).With one thread, Y3 Y =Y.zxyw R1 = {R1.**,R2.**} operations implemented using SSE SHUFPS instruction R = R.**** implemented using SSE2 PSHUFD.

Figure 2 :
Figure 2: Dataflow for in-register transpose of atom-major data to axis-major format (a) Single-threaded (b) Multi-threaded (4 threads)

Figure 3 :
Figure 3: Arithmetic throughput of our matrix multiply routines versus GotoBLAS and matrix multiply from Theobald code in Ref [16], using 1 or 4 threads

1
. CC-BY 4.0 International license peer-reviewed) is the author/funder.It is made available under a 1/2.G A and G B are the self-inner-products xx + S yy + S zz ) Syz − Szy S zx − S xz S xy − S yx Syz − Szy (S xx − S yy − S zz ) S xy + S yx S zx + S xz S zx − S xz S xy + S yx (−S xx + S yy − S zz ) S yz + S zy S xy − S yx S zx + S xz S yz + S zy (−S xx − S yy + S zz )

4
. CC-BY 4.0 International license peer-reviewed) is the author/funder.It is made available under a The copyright holder for this preprint (which was not .http://dx.doi.org/10.1101/008631doi: bioRxiv preprint first posted online Sep. 2, 2014;