## Abstract

Linkage effects in a multi-locus population strongly influence its evolution. Recent models based on the traveling wave approach enable us to predict the speed of evolution and the statistics of phylogeny. However, predicting the evolution of specific sites and pairs of sites in the multi-locus context remains a mathematical challenge. In particular, the effect of epistasis, the interaction of gene regions contributing to phenotype, is difficult both to predict theoretically and detect experimentally in sequence data. A large number of false interactions arising from linkage and indirect interactions which mask true interactions. Here we develop a method to clean false-positive interactions. We start by demonstrating that averaging of the two-way haplotype frequencies over a hundred of independent populations is not enough to clear false interactions. Then, to address this problem, we develop analytically and use a triple-way haplotype test, which isolates true interactions. Next, the fidelity of the test is confirmed on simulated genetic sequences, where the epistatic network known in advance. Finally, we apply the test to a large database on influenza A H1N1 virus sequences of neurominidase from various geographic locations to predict the epistatic network responsible for the transition from the pre-pandemic virus to the pandemic strain. We predict a primary mutation and 15-22 secondary compensatory mutations of variable strength, as many as typically observed for drug resistance and immune escape mutations in HIV. These results present a simple and reliable method to measure epistatic interaction from sequence data.

**Author’s summary** Interaction of genomic sites creating “fitness landscape” is very important for predicting the escape of viruses from drugs and immune response and for pssinf through fitness valleys. Many efforts have been invested into measuring these interactions from DNA sequence sets. Unfortunately, reproducibility of the results remains low, due partly to a very small fraction of interaction pairs, and partly to stochastic noise intrinsic for evolution masking true interactions. Here we propose a method based on analysis of genetic sequences at three genomic sites to clean stochastic linkage and apply it to influenza virus sequence data.

## Introduction

It was realized long ago that properties of an evolving populations are strongly affected by the fact that alleles at different genetic sites, loci, are co-inherited unless separated by recombination. These *linkage* effects include clonal interference (FISHER 1930; MULLER 1932), genetic background and genetic hitchhiking effect (RICE 2004), enhanced accumulation of deleterious mutations (Muller ratchet) (FELSENSTEIN 1974), and the increase of random genetic drift on one locus due to selection at another locus (HILL AND ROBERTSON 1966a). Linkage slows down substantially adaptation of population, and creates random associations between pairs of mutations that occurred on the same branch of the ancestral tree.

These effects has been taken into account by early mathematical models considering two loci (KIMURA 1994) and, more recently, by the traveling wave approach, which can describe an arbitrarily large number of connected sites (ROUZINE *et al.* 2003; DESAI AND FISHER 2007; HALLATSCHEK 2011; GOOD *et al.* 2012). These models usually consider fitness classes of sequences subjected to selection, mutation, random genetic drift. They show that population represent a narrow fitness distribution traveling in the fitness space in a direction depending on initial conditions and parameters. It consists from the deterministic bulk subject to selection alone, and the stochastic leading edge. At the edge, generation and establishment of new beneficial mutations, limits the speed of adaptation. Alternatively, the distribution may move backwards accumulating deleterious alleles (Muller ratchet). These models were able to predict accurately several integral parameters of evolution in terms of model parameters, including population size, mutation rate, and the distribution of selection coefficients over loci. Predicted parameters include the adaptation rate, Muller ratchet rate, equilibrium state (ROUZINE *et al.* 2003; ROUZINE *et al.* 2008), the probability of fixation and the probable selection coefficient (GOOD *et al.* 2012), and the statistics of the ancestral tree (WALCZAK *et al.* 2012; NEHER AND HALLATSCHEK 2013).

With all this progress, the evolution of separate sites in the multi-site still remains an open question. How do allelic frequencies at each site change in time? Although they are known to follow random trajectories, what can be said about the average allelic frequency of a given site with a given fitness effect of mutation? What can we say about the evolution of site pairs, especially, in the presence of the epistatic interaction?

Epistasis, the interaction of genes and gene regions contributing to phenotype, is an omnipresent phenomenon (CARLBORG *et al.* 2006). Gene interactions are responsible for 70% of the organism’s genetic inheritance (ZUK *et al.* 2012) and create fitness valleys in the evolutionary path (WEISSMAN *et al.* 2009). In pathogens, epistasis accelerates the development of drug resistance and immune escape and impedes its reversion (NIJHUIS *et al.* 1999; LEVIN *et al.* 2000; PIANA *et al.* 2002; HANDEL *et al.* 2006; GONZALEZ-ORTEGA *et al.* 2011; WU *et al.* 2018). Most of HIV variation in untreated patients has been argued to arise from mutations compensating virus escape from the immune response (ROUZINE AND COFFIN 1999).

A number of approaches have been proposed to measure epistasis from genomic data (CORDELL 2009; CHEN *et al.* 2011; UEKI AND CORDELL 2012). The simplest methods are based on pairwise allelic correlations (HILL AND ROBERTSON 1966b; BARTON 1995). However, other forces, such as linkage and indirect interactions, create strong correlations even between non-interacting pairs of loci, much more numerous than epistatic pairs. Stochastic effects are a serious obstacle to the detection of epistatic effects (WEI *et al.* 2014). Mathematical modeling shows that, in a single asexual population with a relative short genome, stochastic linkage overshadows the epistatic footprint, except in a narrow range of times and parameters (PEDRUZZI AND ROUZINE 2019).

A tree-based real-time technique was used to detect epistatic interactions in influenza A including those that are known to confer resistance to drug oseltamavir (KRYAZHIMSKIY *et al.* 2011; NEVEROV *et al.* 2015). They identified compensatory mutations as closely following primary mutations and appearing in the same lineage. A method to discern directly interacting pairs from false (indirect) interactions in a highly-diverged single-time protein sequence set from different species has also been proposed (WEIGT *et al.* 2009; COCCO *et al.* 2018). The idea was to combine a covariance method based on mutual information with the estimates of the direct mutual information. The direct interactions are estimated from the condition that entropy is maximal given observed frequencies of single sites and pairs of sites. A similar technique has been used to measure the fitness landscape of Ab binding regions of HIV surface protein gp120 (LOUIE *et al.* 2018).

While the last data-based technique allows to eliminate indirect interactions for highly-divergent sequences from different species, it cannot eliminate linkage effects present in a single or several recently-diverged populations. The divergence of independent populations from a common ancestor in a random direction creates high covariance (linkage disequilibrium) of random sign for most of site pairs masking the covariance from epistasis (PEDRUZZI AND ROUZINE 2019).

Recently, we proposed another, evolution-based approach to the problem. The challenge we faced was to predict analytically the co-evolution of a specific pair of loci linked to a multi-site genome and then use this information to estimate epistatic interaction (PEDRUZZI *et al.* 2018). Using a simulated population, we verified that, in a broad parameter range, the distribution of alleles across genomes and sites, within each fitness class, is controlled by the condition of maximum entropy at given fitness. We demonstrated the existence of a broad parameter and time range where each fitness class of an adapting population has sufficient time to assume the most likely and disordered state, given by the current average fitness, which changes slowly in time. This situation is referred to as “quasi-equilibrium” (not to be confused with “linkage quasi-equilibrium” or “mutation-selection balance”) . In quasi-equilibrium, the entropy *S* is at its current maximum and is a function of fitness *W*, which slowly changes in time during the process of adaptation: *S* = *F*[*W*(*t*)]. The following simple derivation employs the fact, known from all the traveling-wave models cited above, that the distribution of fitness *W* is relatively narrow, std[*W*] ≪ *W*. Quasi-equilibrium is possible due to a very slow motion of the fitness distribution, limited by the slow establishment of new beneficial alleles at the front edge of the distribution (ROUZINE *et al.* 2003; DESAI AND FISHER 2007; NEHER *et al.* 2010; HALLATSCHEK 2011). This situation exists very far mutation-selection balance, i.e., full equilibrium.

Consider a pair of loci (sites). Assume that each site can have two variants, either a better-fit allele or a less-fit allele, denoted 0 or 1, and *s*_{1}, *s*_{2} are the respective fitness effects of mutation at the sites. We can have 4 possible haplotypes, 00, 01, 10, 11. Let us also assume that difference in fitness between two alleles is the two sites interact epistatically, so that two deleterious alleles emerging together, denoted 11, partly compensate each other to the degree *E*, where 0 < *E* < 1. The log fitness values of the four haplotypes are
respectively (Fig. 1). If *E* is allowed to change below 0 or above 1, we can obtain any type and sign of epistasis: positive, negative, overcompensation.

The frequency of haplotype in population *f*_{ij} is proportional to the number of possible sequence configurations of the rest of the genome (grey box in Fig. 1a), exp(*S*_{rest}). Entropy *S*_{rest}, as we explained above, is restricted by the fitness of the rest of genome equal to *W*−*W*_{ij}. Hence, the entropy of each haplotype subset is *S*_{rest} = *S*(*W* − *W*_{ij}). Since the genome is long, we can safely assume that *W*_{pair} is much smaller than *W*, so that the corresponding change in entropy is small and proportional to *W*_{ij}. Hence, we can approximate
where *β* = |*dS*/*dW*|. The frequency of each haplotype is proportional to the corresponding configuration number, exp[*S*(*W* − *W*_{ij})]. Combining Eqs. 1 and 2, we can express the haplotype frequencies as
where *β* slowly depends on *W* and, hence, on time, but is the same for all genomic sites (PEDRUZZI *et al.* 2018). This simple derivation assumes that the site pair does not interact with other sites of the genome.

This assumption can be lifted and generalized for any interaction network, and the function *S*(*W*) can be calculated explicitly from a more microsopic model (PEDRUZZI *et al.* 2018). We also do not need to restrict ourselves to the bi-allelic model, but we will keep this assumption, since it is fairly good for a relatively short term evolution. (A physicist will recognize in Eq. 2 the Boltzman distribution derived originally for a small subsystem connecting to a large thermal reservoir.)

Eqs. 1, 2 predicts how the average frequencies of single site and pairs of sites evolve in time. It is also tempting to use it for the practical goal of estimating parameters *E*, *s*_{1}, *s*_{2} from the observed haplotype frequencies, and thus find the fitness landscape for the entire genome, at least, if epistasis is pairwise, as we assumed. In reality, these formulae are not readily applicable to real-life sequences. because of the strong random variation of *f*_{ij} among pairs of loci caused by stochastic linkage effects.

Of course, the problem of strong linkage noise causing poor reproducibility of various methods [see (WEI *et al.* 2014) for review] is not specific for this method of estimating epistasis. Any attempt to measure epistasis in a single population, whether by measuring covariance [*D*′, *r*^{2}, mutual entropy, (PEDRUZZI AND ROUZINE 2019)] or by tree-based methods used, for example, by (NEVEROV *et al.* 2015), faces the same problem. The epistatic pairs in a single population are not observable in principle, due to the the phylogenetic relationship of sequences in population. As shown previously by simulation, all sequences in a population closely resemble the common ancestor, which diverges from the origin in a random direction (PEDRUZZI AND ROUZINE 2019). As a result, any measure of co-variance, or even the use of the entire tree, produces only strong noise of random sign. Co-variation due to random linkage completely masks the epistasis signature in a population. The only way to resolve this issue is by averaging over many independent populations. But how many populations do we need to have a small fraction of false positives?

As we show in the present work, the number of required populations is unrealistically large even for a short genome of 50 loci, not to mention more realistic genomes. The averaging over populations helps only to enrich the observed network in true bonds, but does not solve the problem: false-positive interactions dominate. We also not that the existing technique developed for measuring epistasis by comparing sequences from remote species (WEIGT *et al.* 2009; COCCO *et al.* 2018), does not help with this problem either, because it is designed for indirect interactions, not for filtering out linkage effects for closely-related populations.

Here we offer a new technique, based on the application of the quasi-equilibrium argument (above) to triple-way haplotypes. We show analytically and by simulation that a triple-way haplotype test represents an effective way to eliminate residual false links, including linkage and indirect interactions. We will demonstrate its high fidelity in a broad parameter range on the mock sequences obtained by computer simulation.

After training the method on simulated sequences, we apply this technique to real virus sequences from an adapting population. Influenza virus evolving in a population was demonstrated to obey the traveling wave theory with an effective selection pressure caused by the accumulating memory cells (ROUZINE AND ROZHNOVA 2018; YAN *et al.* 2019). Therefore, it satisfies the same quasi-equilibrium criteria (the evolution is slow and the distribution of fitness is narrow) and is amenable to our method. Using 8000 influenza sequences obtained from various geographic locations sequences, both before and after the pandemic of 2009, we detect and estimate quantitatively the epistatic network responsible for the transition of two influenza proteins (NA and HA) between the old and new strains, which share 80% of homology. To avoid a possible misunderstanding, the origin and the cause of pandemics, such as antigenic shifts, are not addressed in this paper; we focus only on the epistatic network, which allows to obtain the new variant from the old variant.

## RESULTS

### Simulation model to generate sequences for the test

We start by simulating *in silico* stochastic asexual evolution of a virus using a Wright-Fisher model in the presence of the factors of mutation, random genetic drift, and selection with known epistatic interactions (Fig. 2b). Simulation generates genome sequences demonstrating LD originating from epistatic interactions, indirect interactions, and stochastic linkage effects. We assume only two alleles per amino acid position (site): better-fit (denoted 0) or less-fit (denoted 1). The binary simplification provides a major reduction in the computational cost and is especially accurate for relatively conserved sequences (ROUZINE AND COFFIN 2005). We set some pairs of sites to interact epistatically, with a mutual degree of compensation between deleterious alleles set at *E* = 0.75. We consider the same example of double arches (Fig. 1 or 2B).

### First step: averaging over populations

Out task is to recover true epistatic interactions from simulated sequences using pair-wise association analysis (Fig. 2A). Previous work has shown that their detection is masked by stochastic linkage, which is known for its strong effects on evolution, including clonal interference, genetic background effect, and alterations in the adaptation speed and genealogy (FISHER 1930; BARTON 1995; ROUZINE *et al.* 2003; BRUNET *et al.* 2008; ROUZINE AND COFFIN 2010; GOOD *et al.* 2012; NEHER AND HALLATSCHEK 2013). To decrease the linkage effects on detection, we calculate pairwise haplotypes frequencies *f*_{ij} for each pair of sites (*Methods*) and average them over multiple evolutionary-independent populations of the same size (PEDRUZZI AND ROUZINE 2019). Then, we perform a pairwise association analysis using a correlation measure shown to be a direct estimator of the degree of mutual compensation of two deleterious alleles, *E* (PEDRUZZI *et al.* 2018). More traditional LD measures, including *D*′ and Pearson coefficient *r*^{2}, have similar performance but do not estimate *E* directly (PEDRUZZI AND ROUZINE 2019). We select pairs with high correlation, UFE > 0.6. For a single population, the raw cluster of predicted pairs is extremely complex and completely hides true epistatic interactions, in agreement with previous results (PEDRUZZI *et al.* 2018) (Fig 2C). A significant complexity reduction is obtained by averaging *f*_{ij} over 200 independent populations (Fig. 2D).

### Second step: Triple-site haplotype method

Although much simpler, the resulting network in Fig. 1D is still densely populated by residual stochastic LD and indirect links, which are as strong as direct interactions. To filter the residual false links, we use a procedure based on the quasi-equilibrium approximation *(Introduction* and *Methods, Analytic Derivation*), and then test its validity by comparison of the predicted network with the network set in simulation, as follows.

We re-calculate UFE for every potential link *i,j* by considering only the sequences that contain 0 at one of the neighbor nodes of the link (3-locus haplotypes 110, 100, 010, and 000), and denote it UFE_{ij0}. A 0-node cuts a possible indirect correlation for the link even it is very strong (*Methods*). Considering different 0 nodes, we find the minimum value of UFE_{ij0} over all possible detours. The scatter plot in Fig. 2E demonstrates that, for the false pairs, min(UFE_{ij0}) is several-fold smaller than UFE_{ij}. The reason for this effect is that a 0-node removes a detour path around the link that causes indirect correlations. Therefore, we identify and remove false links as those with low UFE_{ij0}/UFE. The exact cutoff is not crucial, as long as we average *f*_{ij} over, at least, 20 populations, so that the two groups, false and true interactions, remain distinct. As a result, we obtain 100% perfect detection. We even obtain correct estimates for the compensation strength, within 15% accuracy UFE = *E* (Fig. 2F, green links).

We can conclude that, as predicted previously [2008, 2009], the quasi-equilibrium approximation works in this parameter range, which corresponds to the traveling wave regime, and that the triple-haplotype method has a high fidelity with simulated sequences with a known epistatic network.

### Application to influenza A virus data

After training the method on simulated sequences, we calculate interaction network for virus sequences from an evolving population. Influenza evolution in human population was shown to be described by the traveling wave framework [rr, neher], which justifies the use of quasi-equilibrium approximation during the process of adaptation (not to be confused with mutation-selection equilibrium) [PBR]. Therefore, we have the right to apply the technique to influenza virus sequences.

We focus on the two surface proteins of Influenza A virus strain H1N1, Neuraminidase (NA) and Hemagglutinin (HA), both important targets of immune response and drugs. Our aim is to identify subnetwork responsible for the transition of the pre-2009 strain of H1N1 to the post-2009 of the same strain. For this aim, we need to compare sequences of the first strain sampled worldwide to sequences of the second strain sample worldwide.

We have downloaded 8440 and 9811 sequences for NA and HA, respectively, from a public database (https://www.fludb.org). They were collected worldwide, from different geographic locations, between years 2005 and 2010, and included both pre-pandemic and post-pandemic strain. After aligning and binarizing sequences by setting consensus residues to 0 and non-consensus to 1, we observed a bimodal distribution of sequences in allele frequency, with two separate maxima. The bimodal distribution is related to the emergence of a new strain with 80% homology with the old strain in 2009.

In order to compensate for unequal sampling from the early strain and the late strain, the sequences with mutation frequency per genome less than a preset value *d*_{v} were randomly sampled and down-weighted by a coefficient *D*_{w} ranging from 5% to 50%. To obtain the average pairwise haplotype frequencies *f*_{ij}, we repeated the sampling multiple times. Then, we followed the procedure described above to predict the intra-protein network of interactions.

### Sensitivity to sampling

The resulting networks are fairly robust to the variation of (*d*_{v}, *D*_{w}) (Fig. 3A-F). There exists a plateau region of (*d*_{v}, *D*_{w}) where the total number of links varies weakly (Fig. 3E). The dependence of results on (*d*_{v}, *D*_{w}), which predict between 15 and 22 compensatory sites, probably originates from the unequal presentation of local subpopulations in the database. As a result, stochastic linkage effects average out less than optimally. Below, we choose network variants shown in Fig 3D and 3J as representative for NA and HA, respectively. We observe that site 248 in NA serves the primary site for multiple compensatory links (Fig. 3D).

### Structural interpretation

The predicted epistatic sites are shown on the three-dimensional protein structure of NA. The active pocket of NA (purple) serves to bind sialic acid on target cell surface (Fig 4A). These results suggest that a mutation at residue 248 near the pocket was the primary mutation responsible for the replacement of the old variant of NA with the new variant of NA, and the other mutations shown on the wheel are compensatory mutations helping NA to restore and improve its fitness. Mutation 248 was shown to enhance the low-pH stability of NA (TAKAHASHI *et al.* 2013), and it is found in all influenza A H1N1 strains after 2009 pandemic, regardless of a geographic location (SEIBERT *et al.* 2012; BYARUGABA *et al.* 2016; OTTE *et al.* 2016).

## DISCUSSION

In the present work, we propose an efficient evolution-based method to identify the co-variance caused by epistasis from the co-variance caused by stochastic linkage effects and indirect interaction. The idea (*Methods*) is to pre-average the observed haplotype frequencies over independent (or quasi-independent) populations, select the links with high co-variance, and then apply a tri-way haplotype test for each candidate bond to eliminate the effects of linked sites due to common inheritance.

The test is designed analytically, based on the quasi-equilibrium state within the traveling fitness wave (not to be confused with quasi-linkage equilibrium or mutation-selection balance) tested by simulation in a broad parameter range (PEDRUZZI *et al.* 2018; BARLUKOVA *et al.* 2020). Intuitively, because population evolves slowly, the distribution of alleles between sites has sufficient time to obtain the most likely and disordered state given a current average fitness of population.

We use a mock sequence set evolved in a Wright-Fisher population with a known epistatic network to demonstrate the high fidelity of the method in the controlled environment. The method eliminates false bonds caused by both indirect interactions and linkage, at least, in the case of relatively simple epistatic network and moderately diverse populations (< 40% diversity).

We apply this technique to identify primary and secondary mutations in the two surface proteins of influenza A H1N1 responsible for the transition between the pre-2009 variant and the post-2009 variants of these proteins. We do not address the complex origins of 2009 pandemic in this work, which involves antigenic shifts. Influenza virus has been shown to map to the traveling wave theory (ROUZINE AND ROZHNOVA 2018; YAN *et al.* 2019), which justifies the use of the quasi-equlibrium assumption. For NA protein, we observe a single primary site and 15-20 strong multiple compensatory mutations, whose is in the same range as the number of compensatory mutations observed for HIV for drug-resistant strains.

As compared to the existing technique of removal of the indirect bonds developed for highly diverged species (WEIGT *et al.* 2009; COCCO *et al.* 2018), the method is designed to be used for recently diverged populations of the same species with small to moderate diversity. Furthermore, our method primarily targets stochastic linkage, which is less important when comparing sequences from differeent species (WEIGT *et al.* 2009; COCCO *et al.* 2018). Where both methods can be potentially applied, such as calculating the fitness landscape of HIV Ab-binding regions (LOUIE *et al.* 2018), our method is computationally faster, because it is local. Indeed, we can consider one pair of loci at a time and not worry about simultaneous optimization of *L*^{2}/2 parameters of the full interaction matrix. Also, it helps to avoid the situation when the number of fitting parameter is too large, and the system is over-defined.

To summarize, we proposed a technique to predict evolution of pairs of sites and tease it out from stochastic linkage effects. We hope that our quasi-equilibrium approach and further development of this technique will prove useful for all researchers interested in finding fitness landscapes of various organisms from genetic samples.

## METHODS

### Wright-Fisher model for simulation

We simulate the evolution of a haploid population of *N* binary sequences (ACEVEDO *et al.*), where each genome site (nucleotide position) numbered by *i* =1, 2, …, *L* has either *K*_{i} =0 or *K*_{i} =1, in discrete generations (Wright-Fisher model). The processes included are random mutation at rate μ*L*, per genome, natural selection, random genetic drift, and an epistatic network with a set strength and topology. Recombination is absent. Previous work *in silico* shows that sufficient levels of recombination would enhance epistatic detection (PEDRUZZI AND ROUZINE 2019). In the presence of epistasis, the average progeny number (Darwinian fitness) of sequence [*K*_{i}] is set to *e*^{W} where
where the selection coefficients *s*_{i} and *s*_{j} represent the fitness costs of two deleterious mutations that are partially compensated by each other. By definition, *E*_{ij} is the degree of compensation of deleterious alleles at sites *i* and *j*. Values *E* = 0 and 1 represents no epistasis and full compensation, respectively.

### Quasi-equilibrium approximation

In a broad parameter range, an adapting asexual or partly sexual population represents a slowly-moving, narrow peak in fitness space (ROUZINE *et al.* 2003; ROUZINE AND COFFIN 2005; ROUZINE *et al.* 2008; ROUZINE AND COFFIN 2010; GOOD *et al.* 2012). Its speed is restricted by the formation and fixation of rare beneficial mutations in the few best-fit genomes (the leading edge). Because the fitness distribution moves slowly, the alleles entropy of the mutation distribution over genomes has enough time to reach its current maximum, restricted by the current average fitness of the population. This situation is called “quasi-equilibrium”. We verified the validity of Eq. 3 in a broad range of parameters and initial conditions after time ~ 1/<*s*> (PEDRUZZI *et al.* 2018). When the population finally arrives in mutation-selection equilibrium, the presence of deleterious mutation events compensates the natural selection and creates linkage equilibrium and the mutation-selection balance. At this point, the method stops working.

### Linkage measure

In our recent work (PEDRUZZI *et al.* 2018), we introduced a binary measure of LD
where *f*_{00}, *f*_{01}, *f*_{10} *f*_{11} are the average-over-ensemble haplotype frequencies, where 0 corresponds to consensus sequence. As follows from Eqs. (1) and (2), UFE = *E* for an isolated epistatic pair (PEDRUZZI *et al.* 2018).

UFE performs similarly to more widely used measures, such as Lewontin’s *D'* and Pearson correlation coefficient, *r*^{2} (PEDRUZZI AND ROUZINE 2019) or mutual information (WEIGT *et al.* 2009; COCCO *et al.* 2018). As compared to these measures, UFE has the unique advantage of directly measuring the degree of mutual compensation of two alleles *E* (Eq. 2), provided they are epistatically isolated. If they are a part of a network, this measure overestimates *E*. We calculate UFE for every potential epistatic link in our simulated sequence set (*Results*).

### Tri-site test of false bonds

To test whether the interaction is direct, and not due to linkage or an indirect detour, we calculate for each suspected pair *i, j* the value
where the third 0 selects only for the sequences with the consensus allele 0 at a chosen neighbor site of the pair. We try all possible neighbor sites and calculate the minimum value of UFE_{ij0} over all neighbors. This method cuts the most important indirect correlation by “detour” for a site pair. It works for both linkage and a modest amount of indirect correlation.

To illustrate how this method works on indirect bonds, let us consider a simple analytic example in Fig. 1. We assume that some sites carry deleterious alleles with equal selection coefficient *s*_{i} = *s*_{0} in Eq. 3, with a fixed epistatic strength *E*_{ij} = *E*. In this simple case, we can fully characterize a genome by the numbers of interacting allelic clusters of different size. To do so, let *k*_{i} define the number of clusters with *i* nodes and *b*_{i} bonds. Then, from Eq. 3, we can express fitness as a sum over clusters of different size (PEDRUZZI *et al.* 2018)
New notation *f*_{0} represents the frequency of uncompensated mutations with total fitness *W*. The number of bonds *b*_{i} depends on the topology, and *b*_{1} = 0, *b*_{2} = 1. In our example with double arches, *b*_{3} = 2 (Fig. 1).

As it has been shown previously in the literature on traveling wave (see *Introduction*), the population of genomes represents a slowly moving, narrow fitness peak. Because it in the state of quasi-equilibrium determined by its current average fitness, at each moment of time, numbers *k*_{i} are determined by the condition that the entropy of the system is maximum given the value of fitness, Eq. 7. Entropy *S* is defined as the log number of sequence configurations
where *L*_{i} is the number of all possible locations for a cluster of size *i*, and *n*_{i} is the number of each cluster’s configurations (shapes). The values of *L*_{i} and *n*_{i} depend on the network topology.

Previously, we applied this argument for several topologies to predict cluster numbers (PEDRUZZI *et al.* 2018). We showed, for the one in Fig. 1, that the numbers of clusters of size *i* = 1, 2 and 3 in, normalized as *f*_{i} = *k*_{i}/*L*, are related as

The 1st and 3d site in each triplet in Fig. 1 do not interact directly, but only indirectly through site 2. For these two sites, the haplotype frequencies are

When epistatic interaction is sufficiently strong, as given by the condition *E* > 1/2, triplets dominate over single alleles and their doubles, as given by *f*_{1} ≪ *f*_{2} ≪ *f*_{3} (Eq. 9). In this case, we can approximate haplotype frequencies as

Using covariance measure UFE_{ij} defined in Eq. 3, we obtain . We observe that the indirect covariance between sites 1 and 3 is very strong, ~ 1. [In fact, covariance is approximately the same for directly interacting sites 1 and 2 (Fig. 1), for which we previously obtained , see (PEDRUZZI *et al.* 2018), *Supplement*, Eqs. (3.29) and (3.30)].

However, if we calculate UFE_{ij0} instead of UFE_{ij} by including only the sequences with majority allele “0” found at position 2, we obtain
instead of Eq. 9, and Eq. 6 yields UFE_{ij0} = 0. Thus, phantom covariance disappears, when we select only sequences with a majority allele between the sites of interest. This result is intuitively clear: by the definition of fitness (Eq. 1), only minority alleles “1” interact with each other, while majority alleles are neutral background. The same method turns out to be extremely effective for eliminating false bonds created by linkage (see *Results*, Fig. 3).

### Sequence preparation

We have applied the three-way test to influenza virus. Evolution of the virus in population has been mapped recently to traveling-wave models (ROUZINE AND ROZHNOVA 2018; YAN *et al.* 2019), which justifies the use of quasi-equilibrium approximation dependent on current average fitness (see Intro for the explanation). We performed a multiple progressive alignment for amino acid sequences obtained for two proteins of Influenza A virus strain H1N1, Neuraminidase (NA) and Hemagglutinin (HA) from public database https://www.fludb.org.

We have used all the sequences discovered in the database for the period 2000-2010, which all come from various geographic locations. Our aim was to understand the transition between the protein variants before and after pandemic of 2009, which have 80% of mutual homology. We aimed at discovering only the epistatic sub-network related to that transition, and are not interested in any other epistatic interactions in these proteins. For this end, we compared worldwide samples of sequences from the two strains. We randomly sampled similar amounts of sequences from the first and second strains, and re-sampled them several hundred times (*Results*). We also checked the robustness of the results to exact sampling size (see *Results*).

Pairwise distances between sequences were computed using pairwise alignment with the Gonnet scoring matrix implemented in MATLAB. To calculate the guide tree we used the neighbor-joining method assuming equal variance and independence of evolutionary distance estimates. The obtained consensus served as a universal reference to binarize data sequences. Before applying the detection algorithm, the protein sequences were binarized, by direct comparison of each sequence to the consensus. Each amino-acid residue was set to 0 or 1 for consensus or non–consensus. Pooling all mutations per site ignores the chemistry of substitutions. In exchange, this approach greatly reduces the number of haplotype combinations and increases the sensitivity by effectively increasing the haplotype frequencies. Next, we measure the mutational frequency for each sequence, for each site. The subset of low-diversity sequences with allelic frequency below a cut-off *d*_{v} is randomly sampled and down-weighted according to the coefficient *D*_{w}. Then, we determined the average pairwise and triple haplotype frequencies for pairs and triplets of sites, to be used in the method described above.

## 3D structures

To characterize the three-dimensional network of epistatic interaction, we used software package ChimeraX from internet site https://www.rbvi.ucsf.edu/chimerax/.

## Data availability

Influenza sequence data are from public database https://www.fludb.org.

## Acknowledgements

We thank Martin Weigt and Alessandra Carbone for useful comments. This research has been funded by Agence Nationale de la Recherche grant J16R389 to IMR, http://www.agence-nationale-recherche.fr/.

## Footnotes

Influenza protein sequence data are from public database https://www.fludb.org.