ComPotts: Optimal alignment of coevolutionary models for protein sequences

To assign structural and functional annotations to the ever increasing amount of sequenced proteins, the main approach relies on sequence-based homology search methods, e.g. BLAST or the current state-of-the-art methods based on profile Hidden Markov Models (pHMMs), which rely on significant alignments of query sequences to annotated proteins or protein families. While powerful, these approaches do not take coevolution between residues into account. Taking advantage of recent advances in the field of contact prediction, we propose here to represent proteins by Potts models, which model direct couplings between positions in addition to positional composition. Due to the presence of non-local dependencies, aligning two Potts models is computationally hard. To tackle this task, we introduce an Integer Linear Programming formulation of the problem and present ComPotts, an implementation able to compute the optimal alignment of two Potts models representing proteins in tractable time. A first experimentation on 59 low sequence identity pairwise alignments, extracted from 3 reference alignments from sisyphus and BaliBase3 databases, shows that ComPotts finds better alignments than the other tested methods in the majority of these cases.


Introduction
Thanks to sequencing technologies, the number of available protein sequences has considerably increased in the past years, but their functional and structural annotation remains a bottleneck.This task is thus classically performed in silico by scoring the alignment of new sequences to well-annotated homologs.One of the best-known method is BLAST [1], which performs pairwise sequence alignments.The main tools for homology search use now Profile Hidden Markov Models (pHMMs), which model position-specific composition, insertion and deletion probabilities of families of homologous proteins.Two well-known software packages using pHMMs are widely used today: HMMER [2] aligns sequences to pHMMs and HH-suite [3] takes it further by aligning pHMMs to pHMMs.
Despite their solid performance, pHMMs are innerly limited by their positional nature.Yet, it is well-known that residues that are distant in the sequence can interact and co-evolve, e.g.due to their spatial proximity, resulting in correlated positions (see for instance [4]).
There have been a few attempts to make use of long-distance information.Menke, Berger and Cowen introduced a Markov Random Field (MRF) approach where MRFs generalize pHMMs by allowing dependencies between paired residues in β-strands to recognize proteins that fold into βstructural motifs [5].Their MRFs are trained on multiple structure alignments.Simplified models [6] and heuristics [7] have been proposed to speed up the process.While these methods outperform HMMER [2] in propeller fold prediction, they are limited to sequence-MRF alignment on β-strand motifs with available structures.Xu et al. [8] proposed a more general method, MRFalign, which performs MRF-MRF alignments using probabilities estimated by neural networks from amino acid frequencies and mutual information.Unlike SMURF, MRFalign allows dependencies between all positions and MRFs are built on multiple sequence alignments.MRFalign showed better alignment precision and recall than HHalign and HMMER on a dataset of 60K non-redundant SCOP70 protein pairs with less than 40% identity with respect to reference structural alignments made by DeepAlign [9], showing the potential of using long-distance information in protein sequence alignment.
Meanwhile, another type of MRF led to a breakthrough in the field of contact prediction [10]: the Potts model.This model was brought forward by Direct Coupling Analysis [11], a statistical method to extract direct correlations from multiple sequence alignments.Once inferred on a MSA, a Potts model's nodes represent positional conservation, and its edges represent direct couplings between positions in the MSA.Unlike mutual information which also captures indirect correlations between positions, Potts models are global models capturing the collective effects of entire networks of correlations through their coupling parameters [12], thus tackling indirect effects and making them a relevant means of predicting interactions between residues.Beyond contact prediction, the positional and the direct coupling information captured by Potts model's parameters might also be valuable in the context of protein homology search.The idea of using Potts models for this purpose was proposed last year at the same workshop by Muntoni and Weigt [13], who propose to align sequences to Potts models, and by us [14] with the introduction of ComPotts, our method to align Potts models to Potts models.
In this paper, we fully describe ComPotts and focus on its performances in terms of alignment quality.In the following sections, we explain our choices for Potts model inference and we describe our method for aligning them, which builds on the work of Wohlers, Andonov, Malod-Dognin and Klau [15,16,17] to propose an Integer Linear Programming formulation for this problem, with an adequate scoring function.We assess the quality of ComPotts' alignments with respect to 59 reference pairwise alignments extracted from sisyphus [18] and BaliBase3 [19] databases.On these first experiments, computation time was tractable and ComPotts found better alignments than its main competitors: BLAST, HHalign (which is HHblits' alignment method) and MRFalign.

Methods
In this section, we describe our approach to align two Potts models.We start with a short summary of Potts models notations and then we explain the choices we made for the inference of Potts models.Then, we introduce our formulation of the alignment problem as an Integer Linear Programming problem, using notations from [20].

Inference of Potts models
Potts models are discrete instances of pairwise Markov Random Fields which originate from statistical physics.They generalize Ising models by describing interacting spins on a crystalline lattice with a finite alphabet.In the paper introducing Direct Coupling Analysis [11], Weigt et al. came up with the idea of applying them to proteins: inferred on a multiple sequence alignment, a Potts Model could then be used to predict contacts between residues.
A Potts model on protein sequences can be defined as follows: Let S be a multiple sequence alignment (MSA) of length L over an alphabet Σ of length q (here we use the amino acid alphabet, which is of length q = 20).A Potts model for S is a statistical model defining a probability distribution over the set Σ L of all sequences of length L which complies to the maximum-entropy principle and whose single and double marginal probabilities are the empirical frequencies of the MSA.Formally, denoting f i (a) the frequency of letter a at position i in the MSA S and f ij (a, b) the frequency of a and b together at positions i and j in S, a Potts model for S satisfies: and defines a Boltzmann distribution on Σ L with: where: -Z is a normalization constant : are positional parameters termed fields.Each v i is a real vector of length q where v i (a) is a weight related to the propensity of letter a to be found at position i.
quantifies the tendency of the letters a and b to co-occur at positions i and j.
The value In theory, one could infer a Potts model from a MSA S by likelihood maximization, i.e. by finding the positional parameters v and coupling parameters w that maximize P(S|v, w).In practice, however, this would require the computation of the normalization constant Z at each step, which is computationally intractable.Among the several approximate inference methods that have been proposed [21,22,23,24,12], we opted for pseudo-likelihood maximization since it was proven to be a consistent estimator in the limit of infinite data [25,26] within reasonable time.Furthermore, since our goal is to align Potts models, we need the inferrence to be geared towards similar models for similar MSAs, which is not what inference methods were initially designed for.In an effort towards inferring canonical Potts models, we chose to use CCMpredPy [27], a recent Python-based version of CCMpred [28] which, instead of using the standard which yields the correct probability model if no columns are coupled, i.e.P(x|v, w) = L i=1 P(x i ).Our intuition is that positional parameters should explain the MSA as much as possible and only necessary couplings should be added.

Alignment of Potts models
We introduce here our method for aligning two Potts models.We start by describing the function we designed to score a given alignment, then we add the constraints that make the alignment proper by embedding it in an Integer Linear Programming formulation, following Wohlers et al. [17], allowing us to use their efficient solver for the optimization.

Scoring an alignment
We want the alignment score of two Potts models A and B to maximize the similarity between aligned fields and aligned couplings.
Formally, we want to find the binary variables x ik and y ikjl , where x ik = 1 iff node i of Potts model A is aligned with node k of Potts Model B and y ikjl = 1 iff edge (i, j) of Potts model A is aligned with edge (k, l) of Potts model B, such that: ) and s w (w A ij , w B kl ) are similarity scores between positional parameters v A i and v B k and coupling parameters w A ij and w B kl .To score the similarity s v (v A i , v B k ) between positional parameters v A i and v B k we use the scalar product : And to score the similarity s w (w A ij , w B kl ) between coupling parameters w A ij and w B kl we use the Frobenius inner product, which is the natural extension of scalar product to matrices : This scoring function can be seen as a natural extension of the opposite of the Hamiltonian of a where : -e x i is the vector defined by ∀a ∈ [1..q], e x i (a) = δ(a, x i ) -e x i x j is the matrix defined by

2.2.2
Optimizing the score with respect to constraints Naturally, the scoring function should be maximized with respect to constraints ensuring that the alignment is proper.In that perspective, we build on the work of Wohlers et al. [17], initially dedicated to protein structure alignment, to propose an Integer Linear Programming formulation for the Potts model alignment problem.
We remind first the constraints for a proper alignment following [20].
First, we need the definition of alignment graph.Every proper alignment of two Potts model is described by a strictly increasing path in this alignment graph, which is defined as a subset {i 1 .k 1 , i 2 .k 2 , • • • , i n .kn } of alignment graph nodes that can be ordered such that each node is strictly larger than the previous one, i.e.
To specify the constraints of the ILP, they defined sets of mutually contradicting nodes, called decreasing paths.A decreasing path is a set {i The set of all decreasing paths is denoted C.
We also give notations for the left and right neighborhood of a node : let i.k be a node in the alignment graph and V + i.k (resp.V − i.k ) denote the set of couples that are strictly larger (resp.smaller) than i.k, e.g.
) denote the set of all decreasing paths in V + i.k (resp.V − i.k ).Given the notations above, with A (resp.B) a Potts Model of length L A (resp.L B ) with parameters v A and w A (resp v B and w B ), aligning A and B can be formulated as the following Integer Linear Programming problem: x ik ≥ j.l∈C i.k∈C x binary (7) As in [17], an affine gap cost function can be added to the score function to account for insertions and deletions in the sequences.

Results
We implemented this ILP formulation in a program, ComPotts, embedding the solver from [17].We assessed the performances of ComPotts in terms of alignment precision and recall with respect to a set of 59 pairwise reference alignments.For each sequence, a Potts model was inferred on a multiple sequence alignment of close homologs retrieved by HHblits.

Data
We extracted 59 reference pairwise sequence alignments from 3 reference multiple sequence alignments from sisyphus [18] and BaliBase3 [19] with a particularly low sequence identity.To focus on sequences with coevolution information, we considered only sequences with at least 1000 close homologs (see next section).We also discarded sequences with more than 200 amino acids for memory considerations with respect to CCMpredPy.Reference alignments are summed up in table 3 Tab.1. Reference multiple alignments used in our experiment and selected sequences extracted.

Potts model inference
For each sequence, we built a MSA of its close homologs by running HHblits [3] v3.0.3 on Uniclust30 [29] (version 08/2018) with parameters recommended in [30] for CCMpred: -maxfilt 100000 -realign_max 100000 -all -B 100000 -Z 100000 -n 3 -e 0.001 which we then filtered at 80% identity using HHfilter, and took the first 1000 sequences.If the MSA had less than 1000 sequences we removed it from the experiment.This MSA was then used to train a Potts model with CCMpredPy using default parameters except for the w regularization factor coefficient (we set it to 30, which we empirically found to result in v 2 2 1 2 w 2 2 , in other words making v and w scores commensurable) and the uniform pseudo-counts count on the v (we set it to 1000 to have as many pseudo-counts as the number of sequences in the alignment in order to enhance stronger conservation signal and limit excessive similarity scores due to missing the same rare residues).

Potts model alignment
We ran ComPotts with a gap open cost of 8 and no gap extend cost, which we found empirically to yield the best alignments in previous experiments.To speed up the computations, we decided to stop the optimization when 2(U B−LB) s(A,A)+s(B,B) < 0.005 with U B and LB the current upper and lower bounds of the Lagrangian optimization, since we realized in preliminary experiments that in practice it gave the same alignments as the optimal ones in significantly less time.

Alignment quality assessment
We compared each resulting alignment with the reference pairwise alignment extracted from the multiple sequence alignment, considering the alignment precision with the Modeler score [31] # correctly aligned columns # columns in test alignment and the alignment recall with the TC score [32] # correctly aligned columns # columns in ref alignment , computed using Edgar's qscore program [33] v2.1.
To compare our results, we ran HHalign v3.0.3 to align HMMs built on the MSAs outputted by HHblits, MRFalign v0.90 to align MRFs it built from the sequences, both with default options, and BLASTp v2.9.0+ without E-value cutoff.As a control, we also ran Matt v1.00 on the corresponding PDB structures.Results are summarized in figure 2. Note that Matt failed to run 3 of the alignments.BLAST is unquestionably outperformed by all other tools on this set.10 out of the 59 sequence pairs could not be aligned (not hit was found) and, on 22 of the alignments it performed, BLAST had both a recall (TC score) and a precision (Modeler score) of 0. Its average TC score is 0.2694 and its average Modeler score is 0.4357, which is about half the average scores of the other methods.BLAST has a better precision on some alignments, most of the time because its alignments are smaller, which results in a rather low recall, except for some alignments which seem to be quite easy for everyone, such as 1r21A and 2fezA.
All methods seem to struggle with the alignment of 1au7 and 1neq: HHalign's precision skyrockets to 1.0, but at the cost of a recall of 0.47, while ComPotts and MRFalign yield their worst scores, with respective recalls of 0.31 and 0.65 and respective precisions of 0.15 and 0.35.
On average, ComPotts' alignments have a better recall than all compared tools including Matt with 0.758, versus 0.670 for HHalign, 0.713 for MRFalign, and 0.749 for Matt, outperforming HHalign most of the time -in 52 out of the 59 alignments -and MRFalign in 39 alignments out of the 59, while still having a slightly better precision than all other sequence-based tools with 0.847 while HHalign's is 0.826 and MRFalign's is 0.822, outperforming HHalign in 46 alignments out of the 59 and MRFalign in 30 alignments.Matt has the best precision on average with 0.872.Overall, ComPotts has an average F 1 score ( 2×precision×recall precision+recall ) of 0.800, versus 0.740 for HHalign and 0.763 for MRFalign, yielding better alignments than HHalign in 52 cases and better than MRFalign in 39 cases.For asyet-unknown reasons though, our scores for the alignment of 1au7 and 1r69 are remarkably lower than our competitors.

Computation time considerations
We examined the computation times of ComPotts, HHalign and MRFalign, considering only the time they took to align the models and not the time needed to build the models.Not surprisingly, ComPotts is significantly slower than HHalign and MRFalign.This is explained by the fact that HHalign only performs 1D alignment, and MRFalign uses a heuristic to compute the alignment, whereas ComPotts uses an exact algorithm.Aligning two sequences took between 37 seconds (for two models with 75 and 63 positions) and 14.49 minutes (for two models with 144 and 151 positions), with an average of 3.33 minutes on a Debian9 virtual machine with 4 vCPUs and 8GB of RAM, whereas HHalign yields a solution in less than 4 seconds and MRFalign in less than 0.20 seconds.It is worth noting that, although the computation time is significantly higher than its competitors, the solver yields an exact solution in tractable time, even though this problem is NP-complete [34].In this experiment, the computation time seems to be dominated by the computation of all the s w scores, which is quadratic in the number of pairs of edges.

Conclusion
We described ComPotts, our ILP-based method for Potts model-Potts model alignment which can yield the exact solution in tractable time.We reported encouraging results on first experiments where ComPotts often yields better alignments than its two main competitors, HHalign and MRFalign, with respect to a set of 59 low sequence identity reference pairwise alignments.These initial results suggest that direct coupling information can improve protein sequence alignment and might improve sequence-based homology search as well.We still have to see whether the score yielded by ComPotts has more discriminatory power than other methods and enables to better distinguish homologous from non-homologous proteins.

Fig. 1 .
Fig. 1.Illustration of the alignment of two Potts models A and B.
For A and B two Potts Models of lengths L A and L B , the alignment graph G is a L A × L B grid graph where rows (from bottom to top) represent the nodes of A and columns (from left to right) represent the nodes of B. A node i.k in the alignment graph indicates the alignment of node i from Potts model A and node k from Potts model B.