Recurrent mutations in SARS-CoV-2 genomes isolated from mink point to rapid host-adaptation

Severe acute respiratory coronavirus 2 (SARS-CoV-2), the agent of the ongoing COVID-19 pandemic, jumped into humans from an unknown animal reservoir in late 2019. In line with other coronaviruses, SARS-CoV-2 has the potential to infect a broad range of hosts. SARS-CoV-2 genomes have now been isolated from cats, dogs, lions, tigers and minks. SARS-CoV-2 seems to transmit particularly well in mink farms with outbreaks reported in Spain, Sweden, the Netherlands, Italy, the USA and Denmark. Genomic data from SARS-CoV-2 isolated from infected minks provides a natural case study of a secondary host jump of the virus, in this case from humans to animals, and occasionally back again. We screened published SARS-CoV-2 genomes isolated from minks for the presence of recurrent mutations common in mink but infrequent in SARS-CoV-2 genomes isolated from human infections. We identify 23 recurrent mutations including three nonsynonymous mutations in the Receptor Binding Domain of the SARS-CoV-2 spike protein that independently emerged at least four times but are only rarely observed in human lineages. The repeat emergence of mutations across phylogenetically distinct lineages of the virus isolated from minks points to ongoing adaptation of SARS-CoV-2 to a new host. The rapid acquisition and spread of SARS-CoV-2 mutations in minks suggests that if a similar phenomenon of host adaptation had occurred upon its jump into humans, those human-specific mutations would likely have reached fixation already before the first SARS-CoV-2 genomes were generated. Data SummaryAll genome assemblies considered in this manuscript are openly available on registration with GISAID (https://www.gisaid.org). Information on the included assemblies, including the accessions used in the global analysis are provided in Tables S1-S2.

Upon an initial successful host jump, the ability of a pathogen to spread within its novel host population depends on its ability to transmit effectively between individuals. The four endemic HCoVs transmit readily in humans and are highly contagious particularly in children 18,19 .
SARS-CoV-1 was probably less transmissible than SARS-CoV-2 20 , which likely contributed to the 2003 SARS outbreak being comparably easy to control. MERS-CoV transmits poorly between humans and the repeat outbreaks it causes are primarily due to spillovers into humans from a reservoir it established in camelids in the Arabic Peninsula 21,22 . Finally, SARS-CoV-2 is highly transmissible in humans with an unusually high basic reproductive numbers (R0) for a respiratory virus having been reported in several setting 23 .
Transmissibility of a pathogen is expected to vary in different hosts and may often be limited in the early stages after the host jump. Such initial maladaptation may lead to a phase of intense natural selection when a pathogen acquires key mutations that maximize its transmission ability in the novel host. Efforts to identify mutations associated to SARS-CoV-2's transmissibility failed to identify obvious candidates for adaptation to its human host 24,25 . The only possible exception is the spike D614G mutation, whose role as a driver of transmission, or a neutral genetic marker of the otherwise successful SARS-CoV-2 B.1 lineage remains debated 24,[26][27][28] .
One plausible reason for the lack of mutations unambiguously associated to transmission in SARS-CoV-2 is that they may have emerged and spread before the virus was identified. Indeed, if by the time the earliest genomes were being sequenced, all key mutations associated to increased transmission in humans had already reached fixation, it would be impossible to identify those solely by analyzing SARS-CoV-2 genomes in circulation in humans.
The generation and release of genomic data from SARS-CoV-2 outbreaks in farmed European (Mustela lutreola) and American minks (Neovison vison) offer opportunities to identify mutations emerging and spreading in mink farms following the host jump(s). The conditions in intensive farming settings may provide highly suitable environments for small fitness selective differentials between strains to overcome the effect of genetic drift, allowing for natural selection to further increase transmissibility. Particularly promising candidates for host adaptation are those mutations occurring recurrently in different mink SARS-CoV-2 lineages, and that are uncommon in human viruses. The secondary host jump from humans into minks offers a glimpse into the window of early viral host adaptation of SARS-CoV-2 to a new host that has likely been missed at the start of the COVID-19 pandemic.
We screened publicly available SARS-CoV-2 genomes isolated from minks for the presence of recurrent mutations. We identified seven nonsynonymous mutations independently arising at least three times as plausible candidates for adaptation to transmission in minks. All seven candidate mutations are at low frequency in large repositories of SARS-CoV-2 strains circulating in humans. We note in particular three recurrent nonsynonymous mutations in the receptor-binding domain of the SARS-CoV-2 spike protein (S-protein), the region essential for binding to host Angiotensin-converting enzyme 2 (ACE2) receptors allowing cell entry 46,47 .
We computationally predict the role of candidates in this region on modulating the binding stability of the human and mink spike protein ACE2 complex. Our results highlight the rapid and repeat emergence of mutations in the spike protein, across phylogenetically distinct lineages of the virus isolated from minks in Denmark and the Netherlands, and point to rapid adaptation of SARS-CoV-2 to a new host.

Alignment of mink SARS-CoV-2
All publicly available genome assemblies of SARS-CoV-2 isolated from minks were downloaded from both GISAID 48,49 (https://www.gisaid.org) and NCBI (https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/ ) as of 14 th October 2020, which included published genomic data associated to two studies of SARS-CoV-2 in minks in the Netherlands 37,38 . A full list of the 239 considered accessions is provided in Table S1 which includes 227 genome assemblies sampled in the Netherlands and 12 genome assemblies sampled in Denmark covering the period of the 24 th of April 2020 to 16 th of September 2020.
Some of the genomes were flagged as 'low quality' (commented in Table S1) but were included in the initial analaysis. All genome assemblies were profile aligned to the SARS- was retained to root the tree. A maximum likelihood phylogenetic tree was built across the alignment ( Figure S1) using RaxML 52 run via the Augur (https://github.com/nextstrain/augur) tree implementation (Figure 1). We informally estimated the substitution rate over the mink SARS-CoV-2 alignment by computing the root-to-tip temporal regression implemented in BactDating v1.0.1 53 ( Figure S2).

Global context of mink SARS-CoV-2 infections
To place SARS-CoV-2 genomes isolated from minks into a global context we downloaded  55 and were subsequently removed, together with two low-quality mink SARS-CoV-2 assemblies which were also considered phylogenetic outliers (EPI_ISL_577816 and EPI_ISL_577819). Trees were queried and plotted using the R packages Ape v5.4 56 and ggtree v1.16.6 57 (see Figure 1).

Mutation analysis
All variable positions in the coding regions of the genome were identified and annotated for synonymous or nonsynonymous status (Figure S3-S4). This was done by retrieving the amino acid changes corresponding to all SNPs at these positions using a custom Biopython (v1.76) script (https://github.com/cednotsed/mink_homoplasies/blob/main/dnds/snp_to_sav_parser.py) with annotations reported in Table S3. The Orf coordinates used (including the Orf1ab ribosomal frameshift site) were obtained from the associated metadata for Wuhan-Hu-1 (NC_045512.2).
The frequencies of each type of mutation in the human and mink SARS-CoV-2 alignments were calculated using the base.freq function from the R package Ape v5.4 56 ( Figure S5). We tested whether the mutational frequencies in the human and mink genomes differed using a Monte Carlo simulation of the χ2 statistic with fixed margins (2000 iterations) 58,59 . This was implemented using the chisq.test function in R with the simulate.p.value flag. CpG dinucleotide frequencies for both SARS-CoV-2 alignments of genomes isolated from human and mink were also calculated using a custom R script (https://github.com/cednotsed/mink_homoplasies/blob/main/CpG/plot_CpG.R) ( Figure S6).
A Wilcoxon rank sum test implemented using the wilcox.test function in R was used to test if the distribution of CpG dinucleotide frequencies differed between the two datasets.

Identification of recurrent mutations
We screened for the presence of recurrent mutations in the mink SARS-CoV-2 masked alignment using HomoplasyFinder v0.0.0.9 60 , as described in our previous work 61,62 .
HomoplasyFinder employs the method first described by Fitch 63 , providing, for each site, the site specific consistency index and the minimum number of independent emergences in the phylogenetic tree. All nucleotide positions with a consistency index <0.5 are considered homoplasic (Figure 2, Figure S7, Table S4). Sites identified as homoplasic were screened against the global dataset of 54,793 human SARS-CoV-2 ( Figure S8). In addition amino acid replacements were identified in human samples in the CoV-GLUE 64 repository (accessed 16 th November 2020, last update from GISAID 9 th November 2020), which provides frequently updated screens against all assemblies shared to GISAID 48,49 (Figure 2, Table S4).

Structural data used for the analysis
The structure of the SARS-CoV-2 spike protein, reference strain, bound to human ACE2 has been solved at 2.45Å resolution 65 (PDB ID 6M0J). We visualised this structure using PyMOL v2.4.1 66 . We used this as the template to model the structures of the American mink (Neovision vision) ACE2 bound to the SARS-CoV-2 reference (Wuhan-Hu-1) and the mink ACE2 protein bound to versions of SARS-CoV-2 mutated for three candidate sites in the RBD. We also built a model of the human ACE2 bound to the mutated SARS-CoV-2.
We generated query-template alignments using HH-suite 67 and predicted 3D models using MODELLER v.9.24 68 . We used the 'very_slow' schedule for model refinement to optimise the geometry of the complex and interface. We generated 10 models for each S-protein:ACE2 complex and selected the model with the lowest nDOPE 69 score, which reflects the quality of the model. Positive scores are likely to be poor models, while scores lower than -1 are likely to be native-like. The sequence similarity of the human ACE2 and the mink ACE2 is fairly high (83% amino-acid sequence identity) and all generated models were of high quality (nDOPE < -1).

Measuring changes in the stability of the S-protein:ACE2 complex following mutation
We calculated the change in stability of the S-protein:ACE2 complex using two independent methods. The first, HADDOCK 70 , is one of the top-performing protein-protein docking servers in the CAPRI competition 71 . The HADDOCK scoring function uses linear combination of various energies, van der waals, electrostatics and desolvation. We employed HADDOCK (v2.4 web server) to score the complexes ( Figure S9). We also calculated the change in the stability of the S-protein:ACE2 complex using mCSM-PPI2 72 ( Figure S10). This assigns a graph-based signature vector to each mutation, which is then used within machine learning models to predict the binding energy. The signature vector is based upon atom-distance patterns in the wild-type protein, pharmacophore information and available experimental information, evolutionary information and energetic terms. We used the mCSM-PPI2 server (http://biosig.unimelb.edu.au/mcsm_ppi2/) for the simulations. These methods were used because we found in a previous study that they reported stability changes, following mutations in the S-protein:ACE2 complex, that correlated well with the available in vivo and in vitro experimental data on susceptibility to infection 29 .
In particular we performed the following calculations: 1. Using the model of the mink ACE2:SARS-CoV-2 reference complex we mutated the target residue to those RBD candidates identified in mink SARS-CoV-2.
2. Using the model of the mink ACE2:SARS-CoV-2 mink complex we mutated the target residues to the human SARS-CoV-2 reference strain.
3. Using the structure of the human ACE2:SARS-CoV-2 reference we mutated the target residue to those RBD candidates identified in mink SARS-CoV-2.
4. Using the model of the human ACE2:SARS-CoV-2 mink complex we mutated the target residue to the human SARS-CoV-2 reference strain.
For HADDOCK, a value that is more negative than for the reference Wuhan-Hu-1 S-protein: ACE2 complex suggests stabilisation of the complex. Whilst for mCSM-PPI-2 negative G values reflect destabilisation of the complex by the mutation and positive values reflect stabilisation of the complex.
We also evaluated structural changes for all combinations of RBD mutations and receptor complexes (Figures S11-S13).

Genomic diversity in mink SARS-CoV-2
At the time of writing (6 th November 2020) 239 SARS-CoV-2 genome assemblies isolated from minks were publicly available on GISAID 48,49 and NCBI (Table S1) Table S1 and Table S2.
Across the mink SARS-CoV-2 alignment we observed a ratio of nonsynonymous to synonymous changes of 1.84, in line with the previous estimate of 1.88 for SARS-COV-2 lineages circulating in humans 24 . A full list of all identified mutations is provided in Table S3 with nonsynonymous mutations displayed in Figures S3-S4. As previously reported for human SARS-CoV-2, we observe marked compositional asymmetries in mink SARS-CoV-2 likely deriving from the mutational action of host RNA-editing mechanisms 24,74,75 . For example, 37% of all nonsynonymous mutations represent C→U changes ( Figure S5). This results in 87% of all possible nonsynonymous mutations involving a C being C→U changes, in line with C→U changes also representing by far the most common mutations for SARS-CoV-2 strains circulating in humans 24,74,75 . Though, the overall pattern of nucleotide substitutions is statistically significantly different between strains circulating in humans and minks (p < 0.001), with in particular a deficit of G→C and an excess of C→A mutations in minks ( Figure S5).
SARS-CoV-2 genomes isolated from minks also showed a stronger depletion of CpG sites (p = 2.2 x 10 -16 ) relative to their counterparts circulating in humans ( Figure S6).

Recurrent mutations in mink SARS-CoV-2
Across the masked alignment we detect 23 mutations that have appeared independently at least twice in SARS-CoV2 circulating in minks, corresponding to a consistency index (CI) <0.5 60 .
Of those 23 mutations, 16 comprise nonsynonymous and seven synonymous changes ( -N501T). Of note the latter three recurrent mutations in the spike protein correspond to residue changes in the SARS-CoV-2 spike RBD (Figure 2, Figure S4, Figure S7), which may be a particularly important region for adaptation of SARS-CoV-2 to host receptors 47 . In particular, we infer five independent emergences of Y453F across four lineages, five emergences of F486L across three lineages and four emergences of N501T across three lineages.
Mutations specifically increasing transmission in minks can be expected to have emerged only a limited number of times in SARS-CoV-2 strains circulating in humans and to have remained at low frequency in the human SARS-CoV-2 population. Using a phylogenetic dataset of >54,000 human SARS-CoV-2 genomes, we identify some overlap between sites homoplasic in minks and in humans ( Figure S8). This suggests that mutations having emerged recurrently in viruses circulating in minks may not all reflect host adaptation but could also be the result of mutations induced by vertebrate APOBEC proteins 24,74,75 . This is in line with the large proportion of homoplasies involving C→T changes (12/23) consistent with know patterns of hypermutation in vertebrates 74 (Figure 2, Table S4). As such, we expect strong candidate mutations for SARS-CoV-2 host adaptation to minks to be at low frequency in SARS-CoV-2 circulating in humans. Table S4.

Figure 2: a) 23 recurrent mutations identified in the mink SARS-CoV-2 alignment. Y-axis provides the minimum number of change points detected by HomoplasyFinder. Colours provide the genome annotations as given at top with different categories of mutation denoted by the symbols. Recurrent mutations in the spike RBD are bounded by vertical dashed lines. b) Number (y-axis) of human associated SARS-CoV-2 genomes annotated as carrying nonsynonymous mutations recurrent in mink associated SARS-CoV-2 as of 9 th November 2020. c) Pies providing the geographic distribution of human SARS-CoV-2 assemblies carrying each of the seven candidate changes (corresponding to nonsynonymous mutations with at least three emergences in minks). Legend at bottom provides the country assignments: AUS -Australia, CHE -Switzerland, CAN -Canada, CIV -Ivory Coast, DEN -Denmark, ESP -Spain, GBR -Great Britain, IND -India, NLD -Netherlands, RUS -Russia, USA -United States of America, ZAF -South Africa. A full list is provided in
Following a screen of amino acid replacements in CoV-GLUE 64 (accessed 16 th November 2020) we found that all seven nonsynonymous mutations that have independently emerged at least three times in strains circulating in minks are carried by fewer than ~0.3% of the SARS-CoV-2 genomes isolated from humans available on large publicly available sequencing repositories (Figure 2, Table S4). Of the mink homoplasies located in the spike protein, two  We used the HADDOCK 70 and mCSM-PPI2 72 protocols to analyse the change in stability of the mink SARS-CoV-2 S-protein:ACE2 complex for the three recurrent spike RBD mutations.
To analyse the changes in stability we used 3D-models of the mink ACE2 bound to SARS-CoV-2 mink strain and then mutated the target residue to that found in human-associated reference SARS-CoV-2. We used this approach as previous work showed that it gave results that correlated well with experimental data on susceptibility to infection 29 . However, we got similar results when we performed the calculation by using the complex of the mink ACE2 bound to SARS-CoV-2 Wuhan-Hu-1 and mutated the target SARS-CoV-2 residue to those observed in minks (Figure S9-S10). The results indicate marginal changes in the stability of the complex for these mutations and are somewhat conflicting between the methods, but none of the values reported is expected to significantly stabilise or destabilise the complex.

Structural analyses
In light of the small estimated changes in binding energy associated with complex stabilities, additional structural analyses were performed to gain further insights into whether the mutations were likely to affect complex stability. The first of the candidate residues 453 (nucleotide position 22920) lies close to the centre of the S-protein:ACE2 interface. In Wuhan-Hu-1 the tyrosine at this position interacts with H34 in human ACE2 enabling the formation of a polar hydrogen bond which would stabilise the complex. In the American mink ACE2, position 34 has a tyrosine residue which would lead to a clash with the polar tyrosine 453 for Wuhan-Hu-1 SARS-CoV-2. Therefore the emergence of mutation of Y453 to a phenyalanine, observed in 47 mink associated SARS-CoV-2 (Table S3), could be favourable as it reduces the clash of polar groups (see Figure S11).
The SARS-CoV-2 spike residue 486 (nt position 23018) has been identified as one of the most important locations -'the receptor binding motif' -for human ACE2 binding to the spike protein 82,83 and lies within a hydrophobic pocket ( Figure S12). It has been shown that SARS-CoV-2 exploits this pocket better than SARS-CoV-1, as the phenylalanine residue allows for pi stacking of the aromatic rings with Y83 in human ACE2 which enhances affinity 65,82,83 . Our previous analysis 29 showed the Y83 residue to be highly conserved across human and other animals including the American mink. While we note that this residue change arises from a T→C transition (Figure 2), this polymorphism has appeared at least five times and was observed in 96 mink associated SARS-CoV-2. Mutation of phenylalanine to leucine replaces the large aromatic ring with an aliphatic leucine which is not involved in any substantive interactions with the Y83 (see Figure S12).
Our final candidate, N501T (nt position 23064) is only observed in five mink SARS-CoV-2 but has independently emerged four times (Table S4). This residue change has been observed in only ten human SARS-CoV-2 assemblies to date (Figure 2). The recurrent mutation leads to the replacement of the asparagine by a threonine residue in the mink associated strain ( Figure S13). This gives rise to a hydrogen bond to lysine 353 which may contribute to the small increase in stability reported by mCSM-PPI2 (Figure S10).

Discussion
The COVID-19 pandemic is understood to have been caused by a unique host jump into humans from a single yet-undescribed zoonotic source in the latter half of 2019 most likely in China 2,3,61 . This created a situation where the entire genetic diversity of the SARS-CoV-2 population was initially negligible despite the virus having rapidly swept across the world.
Genetic diversity has been building up since the start of the pandemic through the acquisition of new mutations, but is still limited relative to other RNA viruses with broad distributions.
Conversely, the secondary host jump into minks involved multiple independent secondary transmissions into mink farms in different countries. Whilst the data we analysed represents only a subset of infected mink farms from just two countries, the Netherlands and Denmark, we already identify a minimum of seven independent host jumps from humans to minks, involving strains representative of essentially the entire SARS-CoV-2 lineage diversity in circulation (Figure 1).
The secondary transmissions of SARS-CoV-2 from humans into minks provides a set of 'natural experiments' to identify mutations involved in the adaptation of the virus to a novel host. By analysing SARS-CoV-2 isolated from minks, we recovered 23 mutations having independently emerged multiple times. By restricting this set to the nonsynonymous mutations that have appeared at least three times in mink, we identify seven variants that are strong candidates for host adaptation to minks (Figure 2, Table S4). These seven candidates comprise a recurrent change in nsp9 of Orf1ab, a region involved in mediating viral replication 84 , as well as the repeat emergence of three mutations in Orf3a, a protein thought to play an important role in triggering host inflammatory response 85,86 .
We particularly focus on the three candidates that have emerged independently in minks at least four times in the RBD of the SARS-CoV-2 spike protein, a region vital for determining host-range 47 (Figure 2, Figure S4). We detect at least five independent emergences of Y453F across four phylogenetic lineages, five emergences of F486L across three lineages and four emergences of N501T across three lineages in minks. It is noteworthy that all three mutations are not, or only marginally homoplasic, in humans and all three have also been observed at low frequency in strains circulating in human COVID-19 infections which have been intensively sampled; ~150,000 high quality genome assemblies are available on GISAID at the time of writing (Table S4, Figure 2).
Quantifying the consequences of these candidate RBD mutations on the binding affinity of the SARS-CoV-2 spike protein to human and mink ACE2 receptors we predict only subtle changes in S-protein:ACE2 complex stability for Y453F and F486L (Figure S9-S10). Consistently, characterisation of mutated variants in the RBD region suggest the region has a high tolerance to mutation in the context of receptor binding 87  We also identify structural support for a marginal increase in stability of the mink complex with the N501T mutation relative to wild type ( Figure S10, Figure S13). In our dataset we detect the presence of this mutation in five mink SARS-CoV-2 though this corresponds to at least four independent emergences, with N501T being exceptionally rare in human infections ( Figure 2). It may also be relevant that a different residue change in this position (N501Y) has been proposed as a mechanism of host adaption in mice infected with SARS-CoV-2 89 suggesting the broader role of residue changes at 501 as relevant affinity enhancing mutations 87 .
In addition to adaptation to a novel host through acquisition of new variants, RNA viruses are also under pressure of their hosts' immune system which comprises antiviral mechanisms such as APOBEC proteins inducing hypermutation at specific sites in their genome 74,75 . While C→U changes, most likely induced by proteins from the APOBEC family, represent the majority of mutations in SARS-CoV-2 circulating both in minks and humans, we also detected subtle shifts in mutational patterns likely due to differences in the innate immune system of humans and minks ( Figure S5-S6). The widespread depletion of CpG sites in RNA viruses can also be explained by another defense mechanism in the vertebrate innate immune system, in the form of Zinc-Finger Antiviral Proteins (ZAPs), which target viral RNA for exosome-mediated degradation 90,91 . There have been earlier suggestions that ZAPs from different vertebrate host taxa may exert varying levels of CpG depletion pressure 92,93 . In this context, it is an intriguing observation that SARS-CoV-2 genomes isolated from minks displayed a moderate but statistically highly significant depletion of CpG sites (p = 2.2 x 10 -16 ) relative to their counterparts circulating in humans ( Figure S6). Our work points to the frequent emergence of adaptive mutations in minks which may be retained during anthroponotic spillovers, at least transiently so. To date, the vast majority of mutations that have been observed in SARS-CoV-2 appear selectively neutral, or even deleterious in humans 24,95 and those which we identify as strong candidates for host-adaptation to minks have remained at low frequency globally in human SARS-CoV-2 to date (Figure 2).
There is no a priori reason to expect that mutations adaptive in minks will lead to any marked change in the dynamic of human COVID-19 infections. Indeed, SARS-CoV-2 lineages carrying candidate mutations adaptive to transmission in minks have been sequenced since the first outbreaks of SARS-CoV-2 on mink farms in April 2020, with the first mink SARS-CoV-2 sample carrying Y453F dating back to the 24 th April 2020.
The low prevalence of mink-adapted candidate mutations in strains in circulation in humans to date (November 2020) suggests they are not expected to increase transmission of the virus in humans. However, mutations located within the RBD of the SARS-CoV-2 spike protein do warrant careful monitoring. Genetic variation in the RBD region is common and studies of related Sarbecoviruses have previously identified signals consistent with complex selection histories in this region 2,10,96,97 . The RBD is the most immunodominant region of the SARS-CoV-2 genome and any mutation in this region therefore has potential implications for 18 antigenic response [78][79][80][81] . The spike RBD is also the target of antibody based therapies and vaccines [98][99][100][101] . In this context, it may seem concerning that Y453F has been flagged as as a possible vaccine-escape mutation 102 . Though our work suggests this mutation is an adaptive change specifically enhancing transmission in minks. Other RBD variants, already found at higher frequency in human SARS-CoV-2, represent stronger candidates for vaccine-escape mutations 103,104 . Once vaccines will be deployed, careful characterisation and tracking of the frequency of such mutations will be essential to inform if and when vaccines will need to be redesigned. This critical effort will be greatly enabled by open genomic data sharing platforms 48,49,64,105 .