Genetic Characteristics and Phylogeny of 969-bp S Gene Sequence of SARS-CoV-2 from Hawaii Reveals the Worldwide Emerging P681H Mutation

COVID-19 pandemic has ravaged the world, caused over 1.8 million deaths in the first year, and severely affected the global economy. Hawaii is not spared from the transmission of SARS-CoV-2 in the local population, including high infection rates in racial and ethnic minorities. Early in the pandemic, we described in this journal various technologies used for the detection of SARS-CoV-2. Herein we characterize a 969-bp SARS-CoV-2 segment of the S gene downstream of the receptor-binding domain. At the John A. Burns School of Medicine Biocontainment Facility, RNA was extracted from an oropharyngeal swab and a nasal swab from two patients from Hawaii who were infected with the SARS-CoV-2 in August 2020. Following PCR, the two viral strains were sequenced using Sanger sequencing, and phylogenetic trees were generated using MEGAX. Phylogenetic tree results indicate that the virus has been introduced to Hawaii from multiple sources. Further, we decoded 13 single nucleotide polymorphisms across 13 unique SARS-CoV-2 genomes within this region of the S gene, with one non-synonymous mutation (P681H) found in the two Hawaii strains. The P681H mutation has unique and emerging characteristics with a significant exponential increase in worldwide frequency when compared to the plateauing of the now universal D614G mutation. The P681H mutation is also characteristic of the new SARS-CoV-2 variants from the United Kingdom and Nigeria. Additionally, several mutations resulting in cysteine residues were detected, potentially resulting in disruption of the disulfide bridges in and around the receptor-binding domain. Targeted sequence characterization is warranted to determine the origin of multiple introductions of SARS-CoV-2 circulating in Hawaii.


Introduction:
The zoonotic virus responsible for the present Coronavirus Disease 2019  pandemic is SARS-CoV-2 (formerly nCoV). 1 SARS-CoV-2 emerged in Wuhan, China, 2 at a seafood market 3 in November 2019 4 and has been evolving ever since. 1 The COVID-19 pandemic resultant from SARS-CoV-2 has been responsible for infecting more than 84 million people worldwide and has been fatal in more than 1.8 million persons experiencing an infection. 5 The State of Hawaii has dealt with more than 22,000 cases and 280 deaths, with daily reports steady at approximately 60 new cases per day since August 2020. 6 To understand SARS-CoV-2 emergence and the disease, COVID-19, one must look at the genome, as the virus evolves and emerges through genomic alterations and adaptations. SARS-CoV-2 belongs to the family of Coronaviridae -genus Betacoronavirus -which are viruses with 26,000-32,000 nucleotide long single-stranded positive-sense RNA genomes. [7][8][9] Geneticists and virologists look at the SARS-CoV-2 genome and its adaptations to analyze the nucleotide and amino acid variations. Analysis of the SARS-CoV-2 genome will allow us to track the spread through unique genomic fingerprints, 1 determine whether these adaptations alter the viral fitness, infectious capabilities, 10 and develop potential vaccines and therapeutics. 11,12 While genes encode 20 proteins consisting of four structural and 16 non-structural proteins in the SARS-CoV-2 ~30,000-bp genome, 9 the gene looked to most is the S gene responsible for the spike protein. The spike protein is a 1,273 amino acid long 12 (YP_009724390.1)(NC_045512) surface protein that is the viral component accountable for interacting with the human ACE2 (UNIPROT ID Q9BYF1). 7,13,14 ACE2, genetically encoded on the human X chromosome (ENSG00000130234), is a component of the renin-angiotensin hormone system in humans and is ultimately a vasodilator. 15,16 This interaction between SARS-CoV-2 spike protein and human ACE2 via the RBD allows SARS-CoV-2 to enter cells and infect the human host. 12 Mutations in the spike protein can alter binding efficiency and viral fitness. 10 Indeed, some nucleotide mutations in the SARS-CoV-2 S gene change pathogenicity 12 and fitness, 10,17 reduce virulence, 1,4 and have become commonplace in tracking the spread of SARS-CoV-2. 17 These S gene and spike protein mutations can increase transmission of the virus between hosts through anatomical localization to the upper respiratory tract. 10 Therefore, to understand the SARS-CoV-2 pathogenicity, it is important to characterize the virus mutations to study viral pathogenesis and vaccine development. In this study we report analysis of a 969-bp SARS-CoV-2 S gene from two patients from Hawaii to understand the changes in the spike protein, a target for vaccines.
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made Two patients (PID 00498 and PID 00708) analyzed in this report were part of the University of Hawaii at Manoa IRB approved H051 study (IRB# 2020-00367) led by Dr. Cecilia Shikuma. Patient PID 00498 and patient PID 00708 are both males, one is caucasian and the other is Japanese/Okinawan/Filipino, and their mean age is 29.5 years. Oropharyngeal swab (OS -PID 00498) and Nasal swab (NS -PID 00708) from previously identified SARS-CoV-2 positive patients were collected and stored at -80°C. Swabs from PID 00498 and from PID 00708, were collected in August 2020, 3 days after first PCR positive diagnosis. Neither of the patients traveled outside of Honolulu in the week before their first PCR positive SARS-CoV-2 diagnosis, however both identified potential sources of exposure in Honolulu.
Swabs stored in VTM at -80C were thawed in the biosafety cabinet at the John A. Burns School of Medicine high containment laboratory as part of the University of Hawaii at Manoa IBC approved study (IBC#20-04-830-05). VTM was centrifuged to separate the supernatant from debris, aliquoted and 140 µL of the VTM was used for viral RNA purification using the QIAamp® Viral RNA Mini Kit (Cat# 52906) following the manufacturer's instructions. The samples were eluted in 30 µL of the elution buffer.

Reverse-Transcriptase Polymerase Chain Reaction and Sequencing:
As per the manufacturer's instructions, purified viral RNA was transcribed into cDNA using the Takara LA Taq polymerase Kit (Cat #RR012A) with random nine-mers and extension time of 90 minutes. Primer sets were designed based on published sequences and were procured from Integrated DNA Technologies (Coralville, IA). A 1,127-bp segment of the S gene was amplified using the Takara RNA LA PCR Kit (Cat #RR012A) and primers CF and CR (Figure 1). 18 PCR was conducted according to the manufacturer's instructions and cycled on the Applied Biosystems GeneAmp® PCR System 9600. PCR products were then electrophoresed on 1.5% agarose 1x TBE gels at 50V and the amplicons of interest were purified using the Qiagen QIAquick Gel Extraction Kit (Cat# 28704).
Sanger sequencing was conducted on the amplicons using four primers (CF, CR, 18 CR2, and CR3) at the ASGPB core facility at the University of Hawai i at Mānoa (Figure 1). The resulting sequences were input into and verified using both MEGAX 19,20 and SnapGene software (Insightful Science, snapgene.com) and aligned using MUSCLE program 21 to define the contiguous sequence. The resulting 969-bp consensus sequences were uploaded to . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.06.425497 doi: bioRxiv preprint NCBI. The S gene 969-bp region encompasses nucleotides 23,042 to 24,010, and corresponds to amino acids 494 to 816, which involves the 3' and C-terminal of the RBD that ends at nucleotide 23,185 and amino acid 541. 7

SNP Analysis:
Sixty-eight coronavirus strains representing alpha and beta lineages were manually selected from NCBI. Of these 68 strains, 55 were SARS-CoV-2 strains which represented 25 distinct geographical locations spanning the pandemic duration, at least one per month from December 2019 to September 2020. All SARS-CoV-2 sequences, including previously published Hawaii sequences, were first aligned and redundant sequences were removed from further analysis. Coronavirus sequences were aligned with the 969-bp S gene region with SnapGene using MUSCLE and the corresponding region was used for future analysis. 21 The non-SARS-CoV-2 strains were removed if the 969bp S gene of SARS-CoV-2 sequence did not align with the S gene of the non-SARS-CoV-2 strains. Based on the alignment SNPs were identified and annotated into SnapGene to analyze the amino acid substitutions.
Upon finding the P681H mutation among the two Hawaii strains in this study, the GISAID database 22,23 was used to filter worldwide SARS-CoV-2 sequences by the P681H mutation to create a ratio of sequences containing the P681H mutation to all sequences reported in the GISAID database for a given month. Inclusion criteria were for sequences providing a full month, day, and year. The D614G mutation underwent assessment in the same manner for comparison. All prevalence data converted into ratio underwent a logarithmic transformation.
Pearson's correlation tests between P681H frequency vs. month, D614G frequency vs. month, and P681H frequency vs. D614G frequency were conducted and verified using GraphPad Prism version 9.0.0 for Mac (GraphPad Software, San Diego, California USA, www.graphpad.com), JASP version 0.14, 24 and RStudio version 1. 3.1093. 25 Phylogenetic Tree: After the SNP analysis, incomplete sequences were removed prior to the construction of the phylogenetic tree. The phylogenetic tree was constructed using MEGAX. 19 The alignment was first done using the program MUSCLE. 21 The phylogenetic tree was then generated with Maximum Likelihood parameters with 1,000 bootstraps in MEGAX 19,20

Results:
Gene Amplification and Sequence Analysis: . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.06.425497 doi: bioRxiv preprint SARS-CoV-2 genomic sequences were detected by RT-PCR in both the patients, PID 00498 and PID 00708, using various primers spanning the S gene ( Figure 1). A 1,127-bp segment was amplified and sequenced and the entire sequence of the amplicon was aligned with at least one forward and one reverse sequence (translated into reverse-complement) to span the whole 1,127-bp region. For final sequence analysis a 969-bp sequence verified by sequencing the 5' and 3' ends was used. The two Hawaii sequences were deposited in the GenBank, accession numbers MW237663 for PID 00498 and MW237664 for PID 00708.  (Table 1 and Figure 1). Two of the thirteen mutations resulted in a synonymous mutation (amino acid 541 and 790) ( Table 1 and Figure 1). The P681H mutation is unique to the Hawaii strains from this study (MW237663 and MW237664).

Phylogenetic Tree:
Of the 13 SARS-CoV-2 sequences used for SNP identification, two were incomplete due to unidentified nucleotides and were removed from the phylogenetic analysis. Similarly, of the 13 non-SARS-CoV-2 sequences, four did not align to the 969-bp S gene due to large insertions and/or deletions and were removed from further . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 7, 2021.

Discussion:
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 7, 2021. This report focuses on SARS-CoV-2 strains from two patients in Hawaii based on a 969-bp S gene. The sequence and phylogenetic analysis indicate and support that the S gene is continuously mutating as previously reported, 30 and Hawaii may be harboring a unique strain with an emerging mutation in an altered spike protein.
Analysis of the SNPs found in the 969-bp S gene region, indicates the loss of a proline residue and the gain of cysteine residues. These mutations potentially alter the spike protein monomeric and trimeric structures.
The P681H represents the loss of a proline residue and the gain instead of an imidazole-containing histidine residue. According to the GISAID database, the P681H mutation is found worldwide in 5,955 strains reported as of December 31, 2020. The Pearson's correlation test of the logarithmic transformed P681H prevalence of the mutation versus time indicates that the P681H mutation is exponentially increasing worldwide and the sequences encompassing the P681H mutation is dominating significantly when compared to other SARS-CoV-2 strains. Such a significant finding indicates that there is a selective pressure in favor of this mutation.
A study looking at the effect of the loss of a proline residue in the spike protein of the mouse hepatitis virus, a coronavirus, observed altered pathology, fusion kinetics, and enhanced infectivity. 31 The same study also suggests that prolines in regions adjacent to RBDs may not be essential for fusion but may significantly change structure and function of the spike protein. 31 Recent SARS-CoV-2 studies indicate that the P681H mutation is immediately juxtaposed to the amino acid 682-685, furin cleavage site, identified at the S1/S2 linkage site, which predicted enhance systemic infection 12,32 , and increased membrane fusion. 11 Additionally, the proline in the P681H mutation is within the epitope found to be the highest-ranking B and T cell epitope based on the in silico long-term population-scale epitope prediction for vaccine development study. 11 That is, the sequence surrounding this P681H mutation is predictably the loci where the immune response is targeted. Meaning this mutation could be the first (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.06.425497 doi: bioRxiv preprint MW190887.1, MT344948.1, MT344949.1). All of these original strains introduced to Hawaii were collected in March 2020, and 66.6% (8/12) have the D614G mutation. The D614G mutation has become universal throughout the SARS-CoV-2 strains. 10 The D614G mutation is known to enhance infectivity, replication, and localize the virus to the upper respiratory tract to increase transmission. 10,17 Interestingly, several mutations (nucleotide position 241 C→T, position 3037 C→T, and position 14408 C→T, etc.) exist alongside the D614G mutation 10,17 Similarly, in the both Hawaii SARS-CoV-2 strains the P681H mutation is also present alongside D614G 10,13,17 . This observation indicates that, similar to D614G mutation the P681H mutation is becoming globally prevalent among SARS-CoV-2 sequences.
The R577C, S680C, and F797C mutations depicted in Table 1  The other non-synonymous SNPs found in this study -A522S, F543L, I584V, I726F, A771S, E780Q -are not as apparent in presenting drastic evolutionary change, but they too deserve further analysis.
Recently, the NERVTAG group based out of London, England, has reported on a new SARS-CoV-2 variant (VOC202012/01). 35,36 NERVTAG reports that the variant has increased transmissibility and further studies are underway to confirm their report. 37 This variant includes amino acid mutations in ORF1ab, spike, Orf8, and N. 35,37,38 The six ORF1ab mutations are T1001I, A1708D, I2230T, and  35,37,38 The Orf8 mutations are Q27stop, R52I, and Y73C. 37,38 The N protein mutations are D3L and S235F. 37,38 When comparing the SNPs encompassing the 969-bp of two strains from this study to the reference genome for VOC202012/01, EPI_ISL_601443, 39 we found two similar mutations, D614G and P681H. 37 Further, . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made  40 has been defined by the P681H mutation found in the two Hawaii strains.
The two Hawaii strains analyzed in this study cluster together predictably due to the emerging P681H mutation. These two strains also cluster closely with a strain from China and a previously published Hawaii strain.
Other previously published Hawaii strains cluster with SARS-CoV-2 strains from New York, Wuhan, China, and Sweden. These analysis and resultant phylogenetic tree indicate that the virus has likely been introduced to Hawaii through several sources.
Over the past year, SARS-CoV-2 worldwide has evolved and will continue to do so. As of this report's In summary, COVID-19 in Hawaii and the pandemic originating in Wuhan in the 2019-2020 winter is still ongoing. The virus continues to mutate and the effects and outcomes of several of these mutations has yet to be elucidated. This study demonstrates a partial sequence from the first SARS-CoV-2 strain possessing the P681H nonsynonymous mutation. In Hawaii, Native Hawaiians and Pacific Islanders have significantly high prevalence of SARS-CoV-2 when compared to other ethnic minorities and Whites. Characterizing viral sequences from these minority groups is important to better understand virus transmission and pathogenicity.

Conflicts of Interest:
The authors report no conflicts of interest. . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.06.425497 doi: bioRxiv preprint from the City and County of Honolulu. We thank Dr. Sean Cleveland, Information Technology Services, University of Hawaii (UH) for assistance with the phylogenetic tree and the University of Hawaii MANA High-Performance Cluster for use of the facility, and Dr. Vedbar Khadka for assistance with MEGAX. We thank the Nurses and staff of the Hawaii Center for AIDS for assisting with the H051 study, and the patients for participating in this study.

Author's Contributions:
• Guarantor of integrity of entire study: David P. Maison, Vivek R. Nerurkar  (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 7, 2021.   . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.06.425497 doi: bioRxiv preprint The Tamura-Nei model and the Maximum Likelihood method were used to infer evolutionary history using 1,000 bootstraps. The tree with the highest log likelihood is shown in the figure. Next to the branches is shown the percentage of trees in which the associated taxa clustered together -only values greater than 70 are displayed.
Neighbor-Join and BioNJ algorithms were applied to a matrix of pairwise distances estimated using the Tamura-Nei model to automatically obtain the initial tree for the heuristic search, and then selecting the topology with superior log likelihood value. This analysis involves nucleotide sequences from 20 taxa. There were a total of 1,366 sites. Yellow text is used for betacoronavirus lineage A, red text denotes betacoronavirus lineage B, black text denotes betacoronavirus lineage C, blue text shows betacoronavirus lineage D, and purple text is for alphacoronavirus. * represents strains from Hawaii. ** represents strains from this study. Created with BioRender.com.
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.06.425497 doi: bioRxiv preprint  (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.06.425497 doi: bioRxiv preprint