Deciphering the code of viral-host adaptation 1 through maximum entropy models

12 Understanding how the genome of a virus evolves depending on the host it 13 infects is an important question that challenges our knowledge about several 14 mechanisms of host-pathogen interactions, including mutational signatures, 15 innate immunity, and codon optimization. A key facet of this general topic 16 is the study of viral genome evolution after a host-jumping event, a topic 17 which has experienced a surge in interest due to the fight against emerging 18 pathogens such as SARS-CoV-2. In this work, we tackle this question by in-19 troducing a new method to learn Maximum Entropy Nucleotide Bias models 20 (MENB) reflecting single, di-and tri-nucleotide usage, which can be trained 21 from viral sequences that infect a given host. We show that both the viral 22 family and the host leave a fingerprint in nucleotide usages which MENB 23 models decode. When the task is to classify both the host and the viral fam-24 ily for a sequence of unknown viral origin MENB models outperform state 25 of the art methods based on deep neural networks. We further demonstrate 26 the generative properties of the proposed framework, presenting an example 27 where we change the nucleotide composition of the 1918 H1N1 Influenza 28 A sequence without changing its protein sequence, while manipulating the 29 nucleotide usage, by diminishing its CpG content. Finally we consider two 30

troducing a new method to learn Maximum Entropy Nucleotide Bias models (MENB) reflecting single, di-and tri-nucleotide usage, which can be trained from viral sequences that infect a given host.We show that both the viral family and the host leave a fingerprint in nucleotide usages which MENB models decode.When the task is to classify both the host and the viral family for a sequence of unknown viral origin MENB models outperform state of the art methods based on deep neural networks.We further demonstrate the generative properties of the proposed framework, presenting an example where we change the nucleotide composition of the 1918 H1N1 Influenza A sequence without changing its protein sequence, while manipulating the nucleotide usage, by diminishing its CpG content.Finally we consider two well-known cases of zoonotic jumps, for the H1N1 Influenza A and for the SARS-CoV-2 viruses, and show that our method can be used to track the adaptation to the new host and to shed light on the more relevant selective pressures which have acted on motif usage during this process.Our work has wide-ranging applications, including integration into metagenomic studies to identify hosts for diverse viruses, surveillance of emerging pathogens, prediction of synonymous mutations that effect immunogenicity during viral evolution in a new host, and the estimation of putative evolutionary ages for viral sequences in similar scenarios.Additionally, the computational framework introduced here can be used to assist vaccine design by tuning motif usage with fine-grained control.

Introduction
The recent COVID-19 pandemic inspired the scientific community to investigate zoonotic transmission of viruses [Parrish et al., 2008, Andersen et al., 2020] and the subsequent evolutionary dynamics of viral adaptation to a new host.Several experimental [Starr et al., 2020, Moulana et al., 2022] and computational [Rodriguez-Rivas et al., 2022, Tubiana et al., 2022] investigations pointed out the impact of amino-acid mutations in the spike glycoprotein and their effects on its interaction with the human ACE2 receptor, which conferred a fitness advantage and resulted in selective sweeps of new variants [Kang et al., 2021, Lee et al., 2022].
Another fundamental question is identifying Pathogen-Associated Molecular Patterns (PAMPs) in a viral sequence [Akira and Hemmi, 2003] and predicting how the virus changed those patterns to adapt to the human environment and to alter innate immune recognition and response.This topic had been previously explored for the H1N1 strain of the 1918 H1N1 influenza pandemic.In this context it has been shown that the viral genome evolved in a predictable way to lose CpG motifs (a cytosine followed by a guanine in the 5'-to-3' sense) after entering its human host from an avian reservoir [Greenbaum et al., 2008, Greenbaum et al., 2014].This observa-tion, together with the fact that most human-infecting viruses have a low abundance of CpG motifs, was followed by the identification of the CpGdependent receptor specificity of the human Zinc-finger Antiviral Protein (ZAP, coded by ZC3HAV1 gene) [Gao et al., 2002, Takata et al., 2017], implying such approaches can identify recognition sites by host anti-viral restriction factors.Similar analyses for the early evolution of SARS-CoV-2 have been carried out [Di Gioacchino et al., 2021, Kumar et al., 2022], showing a similar pressure to reduce CpG motifs in CpG-rich regions of the viral genome.Finally, understanding and controlling the impact of a foreign RNA sequence on the stimulation of the innate immune response has an important application in DNA and RNA vaccine design in order to avoid over-stimulating the host innate reaction to nucleic acids in the vaccine [Zhang et al., 2023], while also optimizing for features such as codon bias [Pardi et al., 2018].
These questions are facets of the fundamental problem of determining how the interaction of a virus with its host is imprinted upon evolving viral genomes.This topic has been considered in several contexts [Hall et al., 2013, Bloom et al., 2023], demonstrating that viruses of the same family accumulate mutations to use similar nucleotide patterns when they evolve in interaction with a specific host.This idea has been in turn the cornerstone of a fruitful series of works aimed at determining the host of a virus from its genome.Remarkably, it has been shown that methods that do not resort to sequence alignment perform, for this specific task, comparably well with alignment-based methods [Li and Sun, 2018].These methods typically rely on using machine learning based on the frequencies of k-mers (subsequences of length k) up to a given length k max , either alone [Tang et al., 2015, Brierley andFowler, 2021], together with other features such as physicalchemical properties of amino-acids [Young et al., 2020], or using a hybrid method that integrates alignment-based features [Babayan et al., 2018].Recently, techniques based on deep neural networks have been suggested to solve the task of finding the correct host of a given virus, completely bypassing the choice of the features used for a model [Mock et al., 2020].While most of these methods can give remarkable classification performances, there is a pressing need for techniques that are effective at the classification task while remaining at the same time simple to use and interpretable.The latter point is particularly important to increase our molecular understanding of the evolutionary processes that a virus undergoes after an host switch, which can then be targeted by a antiviral therapies during a zoonitic transmission.
In this work, we address all these issues by taking a novel approach: we build a maximum entropy model whose parameters are inferred to cap-ture short-range (up to 3-mers) nucleotide usage patterns in viral genome sequences.Maximum-entropy models have been already used in several contexts, such as for protein sequences [Morcos et al., 2011, Cocco et al., 2018, Mayer et al., 2022], neuronal spiking activity [Tavoni et al., 2017, Ferrari et al., 2017] and social dynamics [Bialek et al., 2012, Chen et al., 2022], demonstrating the effectiveness and flexibility of this approach.In the context of viral evolution and identification of PAMPS in RNA sequences the approach introduced here extends the selective force model previously introduced [Greenbaum et al., 2014, Tanne et al., 2015, Di Gioacchino et al., 2021] which reproduced the motif usage of a particular k-mers only: CpG dinucleotide and other individual motifs.In analogy to k-mer based methods our model does not require any alignment or annotation of the genetic sequence under analysis.We show our technique is simple but extremely effective to tackle the host classification task, resulting in performances comparable with deep neural network models or, in the more challenging setting where no phylogenetic information is available, superior in its discrimination capability.

MENB: a model for host and viral origin classification
Our unsupervised learning model, MENB, infers parameters associated for each k-mer up to k = 3 and defines a probability distribution on viral sequences (of a fixed length), in such a way that the expected k-mer frequencies from this distribution match with those observed in the training data.As shown in Methods Sec.5.1.1,this results in the following probability distribution for a viral sequence s: where S is the set of nucleotides, n m (s) is the number of times the motif m is present in s, and the parameters indicated by f are the "forces" [Greenbaum et al., 2014] to be inferred from the training data.
To train our model we collected viral sequences from the BV-BRC database [Olson et al., 2022], and filtered the data for sequences of three host classes: human, avian and swine viruses.We required at least 150 (different) viral genomes for each host class, and this left us with 4 viral families: Coronaviridae, Flaviviridae, Picornaviridae, and Orthomyxoviridae(focusing on Influenza A alone).We stress that such number of sequences is in principle not necessary to train our models: a single sequence (of sufficient length) is enough, provided that the number of motifs observed in that sequence is representative.To avoid biases in choosing this reference sequence, however, we decided to train the models on sets of 100 sequences (the remaining sequences are used as test set).We then test the model in the task of host classification from a viral sequence.We consider three strategies to assign an host to a given viral sequence.In the simplest one, called "MENB-H", for each host h we grouped together the sequences belonging to different viral families and trained a single MENB model that approximates the probability p(s|h).Given a new sequence s, we can therefore estimate the probability of it coming from host h using Bayes formula p(h|s) ∝ p(s|h) p(h), where p(h) is a prior that we will consider uniform over the host distribution.
To introduce a more complex strategy we start by training a set of MENB models p(s|h, v) at fixed viral family v and host h.As in the previous case, we can then obtain the probability of a sequence to be associated to a hostvirus, (h, v), pair as p(h, v|s) ∝ p(s|h, v) p(h, v).If we know the viral origin (v 0 ) of the test sequence we can limit ourselves to compare models trained for that family on different host, a strategy that we name "MENB-H|V", and by assuming an uniform prior p(h, v 0 ) we obtain p(h|s, v 0 ) ∝ p(s|h, v 0 ).
If, on the contrary, we ignore the viral family of the sequence we can then sum over the different viral families to have a probability a virus is associated with a given host, a strategy that we will call "MENB-H,V".By assuming again a uniform prior we obtain p(h|s) ∝ v p(s|h, v).Remarkably, for all viral genomes analyzed in this work, there is a unique term that contributes much more than all the others to the above summation.Hence we can associate to a viral sequence a specific host as the most likely origin, and likewise guess the viral family from the term that mostly contributes to the probability of that host.
The results of the host classification task on test viral sequences, after having trained the models using the three strategies ("MENB-H", "MENB-H,V", "MENB-H|V") discussed are displayed in in Fig. 1A.We first notice that the viral agnostic models, MENB-H, has a low accuracy: the accuracy averaged over the viral families is about 51%, blue dashed line), only marginally better than random guessing (33%, black dashed line), with performances comparable to random guessing for Coronaviridae and Orthomyxoviridae.Similar results have been observed elsewhere [Mock et al., 2020].
A possible explanation for the failure of this viral-agnostic host inference strategy is that viral genomes are highly constrained (for instance, they need to code for multiple, sometimes overlapping protein sequences while interacting with viral proteins for encapsulation), hence not free to evolve to change their nucleotide usage in a way that depends uniquely upon the host.
Such explanation is confirmed by the improved performances obtained when learning viral-families dependent models for each hosts ("MENB-V,H"), and marginalizing over viral families to find the most probable host."MENB-V,H" gives an average performance in classification of (85%, orange bars in Fig. 1A ).Moreover when comparing (Fig. 1B) the values of v that give the largest contribution to the sum with the real viral families.We find an average accuracy of about 97%, confirming that the "MENB-V,H" strategy is able to predict, with a very good accuracy, both the host and the viral family of a new sequence.

Comparison of MENB with other approaches
Given the performance of MENB models for the host classification task, a natural question is how it compares with other state-of-the-art approaches.
To answer this, we considered VIDHOP [Mock et al., 2020], a deep-neural network designed specifically for this task which can be obtained from a public code repository re-trained by any user.The authors in [Mock et al., 2020] noticed that their algorithm could not generalize to viruses of different families, so they designed VIDHOP to work at fixed viral family.As we demonstrated, MENB can in principle work without information about the viral family of the target sequence, but to make the comparison fairer we modified our approach to use MENB models to assess the host of viral sequences at fixed viral family: we considered as hosts directly the arg max h p(h, v|s), where the correct viral family v is used instead of summing on all possible families.We retrained VIDHOP and MENB on the same sequences, and compared their performances.As expected from the higher complexity (in terms of number of learnable parameters) of VIDHOP, its performances are better than MENB in most cases and in particular for Coronaviridae, while being very similar for Orthomyxoviridae, as shown in Fig. 1A (green and red bars).On the other hand, VIDHOP requires many more resources (in terms of time and computational power) with respect to MENB (for instance, for each viral family VIDHOP requires about 1 hour on a 56-core CPU, while MENB requires less than 5 minutes on 3 cores).
We then wanted to confirm that the host classification results we obtained with MENB models are actually related to viral adaptation to their hosts, and not caused by spurious effects such as phylogenetic correlations that lead to strong similarity of sequences in the training and test set.
We therefore designed a more difficult classification task based on out-of- In general, the performance of MENB models derive from the differences between the probability distributions over viral sequences that each model learns.In Fig. 2 we show the symmetrized Kullback-Leibler (KL) divergence (for a definition, see Methods Sec.5.1.3)between each pair of distributions.
Remarkably, models trained on viruses infecting the same host encode far more different probability distributions than models trained on viruses of the same family, suggesting that the nucleotide usage is more driven by phylogenetic correlations than by host adaptation.This is compatible with the much greater performances of the MENB models in discriminating viral families rather than hosts, and the smaller divergences within viral families ultimately justify the choice of using the "MENB-H,V" strategy.Moreover, we notice that Orthomyxoviridae viruses have smaller differences between hosts with respect to other viral families, probably because of their tendency to commonly undergo reassortments with segments of viruses adapted to different hosts.

Generative power of MENB models
In Fig. 3 we focus on the human Orthomyxoviridae viral sequences and we show that the MENB model reproduces, as expected, the 1-, 2-and 3-mer statistics of the training set (Suppl.Fig. 5).Moreover, it generalizes to new sequences in the test set, which are not used for the training, when full genomes are used (Fig. 3A).These performances are only slightly degraded when a fraction of each genome is used in the training test and the test set contains new sequences and the unseen part of the genome (Fig. 3C), further showing how nucleotide usage biases encompass the full viral sequences and can be learned from a fraction of them.
The MENB models are trained to reproduce the frequency of 1-, 2-, and 3-mers observed in the training dataset; we next investigated how well these models reproduce higher order statistics.To do so, we sampled sequences from the probability distribution encoded by MENB models (using a standard Metropolis-Hastings algorithm) and compared to the 4-mer frequencies observed in these sampled sequences with those of the training dataset.In In Fig. 4 we further show how we can leverage MENB models together with the Metropolis-Hastings sampling algorithm to change the nucleotide usage of a protein-coding sequence, while keeping fixed its amino-acid sequence.As an illustration, we considered the PB2 coding region of the 1918 H1N1 strain and wanted to reduce its number of PAMP associated CpG motifs [Greenbaum et al., 2008, Greenbaum et al., 2014].
We 2.4 Viruses adapt to their host after hosts jumps: Applications to H1N1 influenza and SARS-CoV-2 We demonstrated that our model can infer the host of viral sequences from their nucleotide statistics alone.Here we show the model describes the evolution of a viral strain after a host jump.We start with the case of 1918 H1N1 influenza pandemics: we collected all PB2 segments available in our dataset associated to the H1N1 strain up to 2008.It is commonly accepted that the pandemics originated with a jump from avian to human hosts [Taubenberger et al., 2005].To compare the two hosts we will use in our analysis the human and the avian model trained, for each host, on all the segments of influenza viruses excluding PB2 to avoid potential overfitting.Before assigning sequences to their host, we built a phylogenetic tree, on a random subsample of up to 20 sequences per year, using Nextstrain Quite remarkably, the log-probability score introduced here works as a sort of "molecular clock", by steadily increasing as the virus adapts to the new host.Similar results are obtained also by a simple model only reflecting the nucleotide usage or also including the CpG forces [Greenbaum et al., 2014] (Suppl.Fig. 4), although in these cases the difference of log-probability between the two models is less pronounced, confirming that host adaptation takes place at different order on motif's usage.
As a final application of our MENB models, we turned to the SARS-CoV-2 virus.We wanted to check if we can see hints of host adaptation as for the 1918 H1N1 virus.This case is different from H1N1 as the original host of SARS-CoV-2 is currently unknown and subject of scientific debate [Andersen et al., 2020]; we have therefore assumed that the origi-nal Wuhan sequence is representative of the (unknown) previous host and build its MEMB model from this unique sequence, while building the model for SARS-Cov-2 in human host from the sequences collected during the recent pandemic waves and collected in Nextstrain [Hadfield et al., 2018].We stress that although in principle our method could be used to investigate the most likely origin of SARS-CoV-2, this would require Coronaviridae data of other species (such as pangolins and bats), but current data is biased towards sequences similar to the human SARS-CoV-2 and hence not representative of the original host.
The log-probability difference between the two models is plotted in Fig. 6 as a function of time for the first 1100 days from the start of the 2020 pandemic.It shows a slow but steady adaptation to human nucleotide usage (black line, whose slope is significantly different from 0 with a p-value of 10 −9 ).Quite surprisingly, the slope of the fitting line is larger for sequences collected in the last 6 months (data downloaded on June 30th, 2023), suggesting an increase of the adaptation speed in the Omicron 23A variant that appeared in early January 2023 and rapidly took over the entire SARS-CoV-2 global population.In the above analysis we have taken into account a number of limitations and delicate points that we discuss here.First, the SARS-CoV-2 sequence data is heavily biased, both geographically (a large fraction of the sequences are collected in a small number of countries) and temporarily (the rate of sequence collection increased steadily in the first months of the pandemics).Second, as discussed above, we have used the single Wuhan sequence to infer the model for the unknown virus transmitting host.Third, the time over which the adaptation to the human host has been sampled is much smaller than that of the H1N1 strain, on such a short time scale adaption driven by non-synonymous mutations with clear fitness advantages could result in a confounding signal.
To address the first issue we used a curated dataset of sequences collected by Nextstrain [Hadfield et al., 2018] to build the model of SARS-CoV-2 in the human host: in this dataset sequences are subsampled to reduce biases from different geographic regions and time periods, and most of the sequences are collected in the last 6 months.As for the second issue, although a MENB model can be trained with a single sequence, in this case the motif frequencies are less representative of the virus-host pair under analysis, so additional caution must be used in this case in interpreting the results obtained.Indeed, by construction the initial log-likelihood associated to the "Wuhan" host will be higher than the one for human Coronaviridae.Moreover the "Wuhan" host is likely not to be the human [Andersen et al., 2020], but using specific viral sequences that has been collected in bats or pangolins (that have been suggested to be the reservoir of SARS-CoV-2 ancestors) due to their similarity with the Wuhan sequence would give very similar results.Regarding the third problem outlined before, there is no way to deal with it other than collecting sequences for longer times, but the questions of whether some early signals of host adaptation can be spotted with the genomes observed so far is still well-posed.).To ease the visualization of the increasing trend of the score difference in the last 180 days, daily averages of the score differences are plotted as orange points.

MENB models' parameters reflect biologically-relevant features
The MEMB models offer the advantage to have a relatively low number of learnable parameters and that each of them is related to the usage of the corresponding motif.Such models are therefore ideal candidates for interpretation, that in turn can be useful to accumulate insight into potential roles in molecular biology of motifs, for instance associated to the recognition by the host innate immune system.
To showcase this we considered two models trained on the PB2 seg-

Discussion
We demonstrate our maximum-entropy approach can successfully be used properties.This can be of direct applicability in practical cases, such as when a new viral subfamily is discovered which possesses a genome different enough from those used in the training set.This scenario is likely to become more and more relevant in the near future, as new viral sequences continue to be discovered [Tisza et al., 2020, Edgar et al., 2022].
Our framework can predict the viral genome evolution in a new host, as the log-probability difference in the new host with respect to the previous host increases in time and measures how well the sequence has adapted to its new host environment.This is clearly shown for the H1N1 Influenza for for which we have 100 years of sampled sequences at our disposal; we see a similar trend of host adaptation for the SARS-CoV-2 pandemics as well, which hass accelerated with the expansion of new variants [Di Gioacchino et al., 2021, Kumar et al., 2022].
An important open question is whether the adaptation to the host that we observe directly provides a fitness advantage to the viruses, or if it is a  Finally, thanks to their generative properties underlined here, MENB models are ideal candidate for the optimization in RNA vaccine design for efficiency and minimizing rejection due to immunogenicity [Pardi et al., 2018].By 5 Methods

The maximum entropy nucleotide bias model
In this section, we will first give a maximum-entropy derivation of the MENB model as given in Eq. (2.1).This will clarify why some of the parameters can be arbitrarily fixed as they are redundant (gauge choice) and we will discuss the specific choices in this regard made here.Finally we will describe how all the computations involving the MENB model used in this paper can be performed exactly and efficiently building on classical statistical-physics methods.

Maximum entropy justification
Consider an set of sequences observed (data), we want to find a probability distribution on the sequence space (model) such that: (i) the observed frequencies of nucleotides, 2-mers and 3-mers in the data match those ex- (1) ab and f (3) abc .Here ⟨f (s)⟩ = s p(s)f (s), and quantities with the obs superscript are averages computed on the data sequences.By taking the functional derivative with respect p(s), we obtain the functional form given in Eq. (2.1), where the Lagrange multipliers, that we also call force parameters, need to be fixed so that the observed frequencies of nucleotides, 2-mers and 3-mers in the data match those expected by sampling sequences according to the model.Follwing [Greenbaum et al., 2014], this parameter inference can be performed by computing the partition function that normalizes the probability distribution in Eq. (2.1) and using it to estimate the quantities ⟨n a (s)⟩, ⟨n ab (s)⟩, ⟨n abc (s)⟩.Finally, a root-finding algorithm such as the Newton-Raphson method can be used to find the correct values for the parameters.Optionally the observed quantities n obs a , n obs ab and n obs abc can be regularized by adding pseudocounts to avoid parameter divergences or to give less weight to the sequence details during the inference.

Gauge choices for MENB model
The MENB model specifies a probability distribution over sequences of length L. As such, any change of parameters that does not change the probability of any sequence does not have any observable effect and it is called a gauge degree of freedom.For instance, we can send f (1) a + K and, for any value of K, this modification does not impact the probability of any sequence as it can be readily showed using the fact that a∈S n a (s) = L.
As a consequence, we are free to choose a value for K so that, for instance, f T = 0, or so that a∈S f (1) The presence of gauge degrees of freedom stems from the fact that there are many ways of choosing the 84 force parameters in Eq. (2.1) so that the observed frequencies of nucleotides, 2-mers and 3-mers in the data match those expected from to the model.Indeed, although this requirement can be written as a set of 84 equations, some of them are not independent because of the following considerations: where the symbol ≃ means that the condition is respected in the large-L limit, which is the relevant case for all sequences considered in this work.
This set of equations can be used to fix the gauge degrees of freedom ("choose the gauge"), and we do so in this work by choosing a gauge where the maximum number of parameters is set to zero, that we call lattice-gas gauge (with a slight abuse of notation), or by choosing a gauge where there is no arbitrary symmetry breaking among the model parameters, that we call zero-sum gauge.
For the lattice-gas gauge, we decide to set to zero all forces of the form T x ∀x ∈ S, f xT ∀x ∈ S, f T T T , f T T x ∀x ∈ S, f T xT ∀x ∈ S, f xT T ∀x ∈ S, f T xy ∀x, y ∈ S, f xyT ∀x, y ∈ S. Therefore non-zero T -containing forces only have the form h xT y with x, y ∈ S.This means that the effective number of free parameters to be inferred goes from 84 to 48.
The lattice-gas gauge is particularly useful to speed-up the inference process and to avoid the Newton-Raphson method to fail to converge due to flat directions in the parameter space, but it is not practical when looking at the inferred parameters to interpret them.For this reason after inference we use the zero-sum gauge, that is defined by the following set of equations (2) can be computed exactly in a time that scales linearly with the length of the sequence L using the so-called transfer matrix method, well-known in statistical physics.This method has been already described for a similar problem in [Greenbaum et al., 2014] (Supporting Information), and the only difference in this case is that the matrices also contain a term that accounts for the 3-body interaction.
Once the partition function of a MENB model is computed, we have immediate access to a wealth of relevant quantities.In particular, we can compute the expected number of ℓ-mers M as which is the main quantity used to produce Fig. 6B.
Another relevant quantity is the Kullback-Leibler divergence between two models, p 1 and p 2 .It can be written as (6) log Z 1 and log Z 2 can be computed exactly with the transfer matrix method, and to compute the last term on the r.h.s. of Eq. ( 6) we define and we have From the KL divergence we can compute the attributions showed in Fig. 6C.
Following [Sundararajan et al., 2017], we consider two MENB models defined by the force parameters f 1 and f 2 .We will use the notation to denote the KL divergence between the models with parameter f 1 and f 2 .
Thanks to the fundamental theorem of calculus for line integrals, and using The individual terms of the sum in this equations are the attribution plotted, after rescaling for the total KL divergence, in Fig. 6C and Suppl.Fig. 3.As a final remark, we notice that the attributions depends on the gauge used.
In this work we always computed attributions in the zero-sum gauge, and we observe that if the parameters f 1 and f 2 are from models in the zero-sum gauge, then Eqs.(4) still hold for f 1 + t(f 1 − f 2 ), and so the path of models used in Eq. ( 9) preserve the zero-sum gauge.

Data and code availability
All sequence data has been collected from the BV-BRC database [Olson et al., 2022].
After discarding short viral sequences (length lower than 1000 bases), we selected the pairs of host and viral family so that each viral family has at least 100 sequences annotated with each host chosen.We discarded Influenza A sequences collected after 2009 as the database is dominated by strains of the H1N1 "swine flu", whose triple-reassortment origin [Garten et al., 2009] and (likely) not perfect adaptation to humans is a confounding factor during training.The resulting dataset, that is the starting point for all the results presented here, is available at https://zenodo.org/doi/10.5281/zenodo.10050076.The SARS-CoV-2 data used for Fig. 5B can be downloaded at https://nextstrain.org.Notice, however, that this data is often updated as it is focused on the last 6 months.To allow exact reproducibility of our results we uploaded the data we used (downloaded on June 30th 2023) together with the sequence data at https://zenodo.org/doi/10.5281/zenodo.10050076.
The code to infer models is written in Julia and publicly available in the GitHub repository at https://github.com/adigioacchino/MaxEntNucleotideBiases.jl.
We trained our MENB models on 100 viral sequences randomly selected from our dataset for each pair of host and viral family.We replicated this sub-sample three times, observing quite small quantitative differences based on the sequence choice (see error bars in Fig. 1).For the comparison with VIDHOP presented in Fig. 1A, C

Figure 1 :Figure 2 :
Figure1: MENB models can predict host and viral family of viral genomes.A: accuracy of MENB models trained on all viruses with the same host (blue bars) and on all virus-host pairs (orange bars) on the host classification task on the test set of viral genomes; green bars are obtained using the same models used for the orange bar, but using only the correct viral family of the target viral genome, and red bars are the accuracy of the host classification task in this same setting with the VIDHOP algorithm.B: accuracy of MENB models trained on all virus-host pairs in determining the correct viral family for the target test genome as the one that mostly contributes in the host classification.C: same as A, but the training is done on the first half of the genome (for Coronaviridae, Flaviviridae, Picornaviridae) or on all segments but PB2 (for Orthomyxoviridae), and the test is done on the remaining part of the sequence.D: same as B, with the same task as described in C.

Fig. 3B ,
Fig. 3B,D we show that MENB model almost perfectly capture the 4-mer statistics.

Figure 3 :Figure 4 :
Figure 3: MENB models generalize well to test sequences and higher-order motifs.A: Frequency of nucleotides, 2-mers and 3-mers observed in the test set of full human Orthomyxoviridae sequences versus the value obtained analytically from the inferred MENB model.B: Same as A for the MENB model trained on human Orthomyxoviridae sequences without the segment coding for PB2.C: Frequency of 4-mers observed in the training set of full human Orthomyxoviridae sequences versus the value obtained from sequenced sampled from the inferred MENB model.D: Same as C for the MENB model trained on human Orthomyxoviridae sequences without the segment coding for PB2.

[
Fig.1from the sequences sampled over time but also from the reconstructed roots along the phylogenetic tree.We observe that the maximum-entropy model is misled in the assessment of the host of the 1918 PB2 segments (left side of Fig.5A), which is wrongly classified as an avian virus, while being sampled in humans.This mislcassification is a clear signature of the host jump which had just occurred and originated the 1918 pandemic.The classification changes with time: as the virus evolve in contact with the human host, the model assigns to it higher log-probability differences, giving equal scores to human and avian origin around 1950.For more recent samples the model is more and more confident about the human classification.

Figure 5 :
Figure5: MENB models can be used to quantify host adaptation dynamics after host jumps.A: Scatter plot of loglikelihood differences of the MENB Orthomyxoviridae human and avian models versus time of H1N1 Influenza A sequences.The colored lines are the reconstructed paths of the inferred phylogenetic tree that connect the root to each leaf (observed sequence), and the score versus inferred time is plotted also for the internal node (inferred) sequences.B: Scatter plot of loglikelihood differences of the MENB Coronaviridae model versus a MENB model trained on the original Wuhan SARS-CoV-2 sequence versus time from December 26th 2019.The black line is a linear fit on the first 1089 days (slope: 9 • 10 −5 , p-value: 10 −9 ), the orange line is a linear fit of the last 180 days (slope: 3.5 • 10 −4 , p-value: 10 −7 ).To ease the visualization of the increasing trend of the score difference in the last 180 days, daily averages of the score differences are plotted as orange points.
ments of Orthomyxoviridae viruses: one ("H1N1 1918") has been trained on the sequence collected in 1918, the other ("H1N1 2007") has been trained on 26 sequences collected in 2007.In Fig.6Awe show the entire parameter profile of the two models.Parameters different form zero reflect the presence of selective forces which push up or down the number of the corresponding motif with respect to sequences generated uniformly at random.Considering the fact that the 1918 strain was likely of avian origin, the first interesting remark is an overall similarly of the force profile in the two cases, especially for nucleotides and dimers, which indicate that many of the force parameters did not significantly change during the adaptation to the human host.The two dinucleotides with the largest negative forces are the CpG, reflecting the well-known avoidance of CpG motifs, followed by UpA, another known avoided motif that is supposed to have a role in codon efficiency[Tulloch et al., 2014, Atkinson et al., 2014].Moreover, the force in UpG motif is large and positive, likely due to the C>U and A>G mutational processes on, respectively, CpG and UpA motifs.This observation points out an important concept that is commonly overlooked in k-mer analyses of genetic sequences: the lack of one or more motif is necessarily compensated by an increase in abundance of other motifs, and vice-versa.In our framework this is deeply connected to the gauge choices that have to be taken due to conservation of probabilities at single, di and tri-nucleotide levels and are discussed in more details in Methods Sec.5.1.2.The differences in the parameter profile of In Fig.6Adisclose the selective pressures on the nucleotide biases, dimers and trimers driving the evolution of the viral sequence in the adaptation to the new host.The most striking differences between the 1918 and the 2007 viruses are the further decreases in the CpG force, as well as CGU motifs decrease, from a value around zero in 1918 to a negative value in 2007.An opposite evolution is observed for the GpG force increasing from zero to a positive value and for the CGG force which relaxes from a negative value toward zero in the 2007 (see also Supp.

Fig. 9 )
Fig.9).The decrease in CpG forces confirms previous findings and what obtained with a simpler model containing only the CpG force, moreover the different behavior for the tri-nucleotide mirrors the context dependence of the CpG loss[Greenbaum et al., 2008, Greenbaum et al., 2014].A more rigorous way to study the evolution of the forces is to find the key parameters to discriminate the models inferred from the 1918 and 2007 sequences.This problem can be addressed within the framework of integrated gradients[Sundararajan et al., 2017]:We compute the symmetrized KL divergence between the two MENB models as the sum of attributions, i. e. integrated gradients with respect to each parameter (more details about the procedure are given in Methods Sec.5.1.3;see Suppl.Fig. 7 for the comparison of symmetrized versus non-symmetrized KL divergences).In Fig. 6B we show the values for the top-20 attributions to the symmetryzed KL divergence: consistently with the forces differences, we find that the largest attribution is on CpG dinucleotide, and several 3-nucleotides motifs containng CpG (CGA, CGU, CGG, CCG) are present.The GpG and GpA and UpA dinucleotides and several related trinucleotides (TGG, GGC, GGC, CGG, GAG, TAC, TAG) have a large attribution too.Once the inference of parameters is performed we can analytically compute the expected number of 1-, 2-, and 3-nucleotide motifs in a viral sequence according to the MENB models (see Methods Sec.5.1.3),which (as shown in Fig.5) should reproduce, by model construction, the motif frequencies in the data, as previously shown in Fig.8A.It is interesting to compare the force attributions in flu evolution to the relative difference in motif frequency Fig.6Cas, due to network effects, they are only marginally related.Nucleotide or dinucleotide usage can, for instance, be driven also by the di-nucleotide and tri-nucleotide forces.In agreement with the force attributions, the CpG dinucletide shows, among all dinucleotides resulting in human-adapted H1N1 strains, the largest relative decrease in 2007 with respect to 1918.Moreover we observe more UA and AA nucleotides with respect to the 1918 strain.As for 3-mers, the signal is dominated by decrease in usage of specific CpG-containing motifs, although for instance an increase of TAC motifs is observed (Fig.6B).It is important to notice that relative changes of 3-mers cannot be compared immediately with those of 2-mers, due to the fact that there are 64 different 3-mers and 16 2-mers and so individual 3-mers are in general rarer than individual 2-mers and largest changes are to be expected.We next discuss the force comparison in the context of virus and host classification, from MENB models inferred from the ensemble of sequences for a fixed viral family and host, to bring out similarities and difference in to predict from a sequence its viral origin and host based on conditional probabilities and Bayes rule.Consistently with some recent empirical observations[Mock et al., 2020], we show viral sequences adapt to the host nucleotide usage under specific viral-family depending constraints.In the host-classification task, our interpretable MENB algorithm has competitive performance with state-of-the-art approaches based on deep neural networks, despite being far simpler in terms of number of learnable parameters.As expected by classical bias-variance trade-off considerations[Posani et al., 2022], our methods is less subject to the specific details of the training data, and shows remarkable out-of-distribution generalization

Figure 6 :
Figure6: The learned parameters of MENB models can be directly visualized and interpreted.A: Plot of each of the 84 parameters (forces) learned by MENB models trained on all segments but PB2 of H1N1 Influenza A strains collected in 2007 (blue) and of the 1918 strain (orange).B: Attributions computed with the method of integrated gradients (Methods Sec.5.1.3)for the symmetrized Kullback-Leibler divergence between the MENB models used in panel A. To allow for an easier visualization only the 20 parameters with the highest contribution (in absolute value) to the symmetrized KL divergence are shown.Light blue bars denote negative attributions.C: Relative difference in expected motif frequencies between the MENB models used in panel A (Methods Sec.5.1.3).Only the 5 top differences (in absolute value) are plotted for 2-mers and 3-mers.Blue (orange) bars correspond to positive (negative) differences.
preditcing how viruses adapt to their new host we can better understand mechanisms that drive their adaptation and design intervention This work was supported by grant ANR-19 Decrypted CE30-0021-01, and by the European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No 101026293.
pected by sampling sequences according to the model, and (ii) the entropy − s p(s) log p(s) is maximized.Therefore the MENB model probability distribution maximizes the following quantity s) and the Lagrange multipliers f a∈S n a (s) = L ab∈S n ab (s) ≃ L a∈S n ax (s) ≃ n x , a∈S n xa (s) ≃ n x ∀x ∈ S abc∈S n abc (s) ≃ L ab∈S n abx (s) ≃ n x , ab∈S n axb (s) ≃ n x ab∈S n xab (s) ≃ n x ∀x ∈ S a∈S n xya (s) ≃ n xy , a∈S n axy (s) ≃ n xy ∀x, y ∈ S(3) of the partition function and related quantities An remarkable characteristic of the MENB model is that the partition function Z given in Eq.
Relative difference in motif usage shown for each pair of hosts at given viral family.Blue bars correspond to increases in motif usage, and orange bars to decreases.Only the 3 highest differences (in absolute value) are shown for nucleotides, 2-mers and 3-mers.Attribution to symmetrized KL divergence shown for each pair of hosts at given viral family.Blue bars correspond to positive attributions, and orange bars to negative attributions.Only the 10 highest attributions (in absolute value) are shown. .5: A: Frequency of nucleotides, 2-mers and 3-mers observed in the training set of full human Orthomyxoviridaesequences versus the value obtained analytically from the inferred MENB model.B: Same as A for the MENB model trained on human Orthomyxoviridaesequences without the segment coding for PB2. . .6: A: Plot of each of the 84 parameters (forces) learned by MENB models trained on the SARS-CoV-2 sequence collected in Wuhan in December 2019 (blue) and on sequences collected in June 2023 (orange).B:Relative difference in expected motif frequencies between the MENB models used in panel A (Methods Sec.5.1.3).Only the 5 top differences (in absolute value) are plotted for 2-mers and 3-mers.Blue (orange) bars correspond to positive (negative) differences.C: Attributions computed with the method of integrated gradients (Methods Sec.5.1.3)for the symmetrized Kullback-Leibler divergence between the MENB models used in panel A. To allow for an easier visualization only the 20 parameters with the highest contribution (in absolute value) to the symmetrized KL divergence are shown.Orange bars denote negative attributions.Comparison between attribution to the symmetrized KL divergence between Orthomyxoviridae human and avian viruses (panel A), and the two non-symmetrized KL divergences that compose it (panels B, C).
we used exactly the same sequences.A Snakemake pipeline to train and test the MENB models on the data used here is available at https://github.com/adigioacchino/MENB_snakemake.
Suppl.Fig. 8: Comparison between the number of motif observed in the 1918 H1N1 PB2 sequence and in PB2-coding sequence synthetically evolved to reduce their CpG number.A value of 1 means no change in motif abundance.CpG-containig motifs are highlighted with orange lines.Comparison between the forces inferred on the 1918 and in 2007 H1N1 sequences.Blue/orange bars correspond to increased/decreased forces of 2007 sequences with respect to the 1918 sequence.