Population genetic analysis of Indian SARS-CoV-2 isolates reveals a unique phylogenetic cluster

The SARS-CoV-2 pandemic originated from Wuhan, China in December 2019 raised an alarming situation all over the globe. Sequencing of this novel virus provides an opportunity to evaluate the genetic polymorphism present in the viral population. Herein, we analysed 173 sequences isolated from Indian patients and performed SNP linkage, clustering and phylogenetic analysis to understand the local genetic diversity. We found that the SNP linkages that lead to the identification of some global clades, do not hold true for the local clade classification. In addition to the unique cluster, established by another Indian study, we identified a new cluster (I-20D) that encompasses 28% of the analysed sequences. This cluster is defined by two linked variations – C22444T and C28854T. A detailed study of such polymorphisms can be useful for drug and vaccine development.


Introduction
The emergence and global spread of the novel Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) causing coronavirus disease 2019  is the major point of concern for all affected countries. The origin of SARS-CoV-2 can be traced back to late 2019 in Wuhan, China 1 . Since then SARS-CoV-2 has spread across the world and become a global health emergency on March, 2020 when World Health Organisation (WHO) declared COVID-19 as global pandemic 2 . The worldwide alarming situation can be foreseen with more than 9,500,000 people infected and over 489,000 deaths reported globally (as of June 26 th 2020). After its entry into human host, the continuous divergence of the virus generated at least CoV's RBD as compared to the bat virus, RaTG13 (86.19%) 8 . A recent study highlights a recombination event between bat CoV and pangolin CoV, before its introduction in humans 9 .
Strong purifying selection pressure and recombination shapes the origin and evolution of the novel SARS-CoV-2 6,9 . After its human entry the molecular divergence of SARS-CoV-2 is continuing and it is important to track this molecular evolution to understand the epidemiology of the disease and to effectively design drugs and vaccines for a target population.
In an earlier study 10 , 103 SARS-CoV-2 genomes were segregated in two lineages 'L' and 'S' based on two linked mutations. The expansion of the viral genome sequences revealed large number of polymorphism in the population and their geo-distribution 11 . Nextstrain defines five major global clades that can classify the overall global SARS-CoV-2 genome sequences 3 . The geo-distribution of the mutations and therefore clades 11 across countries can have severe impact on the local evolution of the virus. In a previous study 12 , combination of four genetic variants were used to define a unique clade predominant in India.
The publicly available whole genome sequences of SARS-CoV-2 isolates from different countries can be utilized for understanding the evolution of the virus as of function of time and geography 11 . The information can be used to characterize the novel viral mutations in a population that can be further exploited for designing better diagnostic tests, vaccines and antiviral drugs. In the present study, we characterized the genetic variants in 173 SARS-CoV-2 genomes isolated from the Indian population, predominantly from Ahmedabad and Gujarat.
We analysed the genetic variants defining the phylogenetic clusters and established a new clade dominant in the Indian population that was not reported earlier.

Data collection and sequence analysis of SARS-CoV-2 isolated from Indian Population
173 FASTA sequence files of SARS-CoV-2 were retrieved from NCBI 13 (update: 05-06-2020) database. Only complete genomes originated from Indian population were used in this study.
Reference sequence and annotations of SARS-CoV-2 (NC_045512) were retrieved from GenBank, NCBI 13 and the reference sequence was used as parent sequence for all other sequences. The sequences were divided into two sets: initial set (update: till 17-05-2020) and final set (update: till 05-06-2020). The first set of 60 sequences (59 Indian originated sequences and 1 reference sequence) were used for initial analysis of genetic diversity, SNP linkages, clustering and phylogenetic analysis. And the final set of 174 sequences was used to test the results concluded from the analysis of initial set.
Multiple Sequence Alignment (MSA) of initial set sequences was performed using MUSCLE algorithm in MEGA 14 . Further, MEGA 14 was used to evaluate the mutation type (Synonymous or Non-synonymous) and frequencies at every single site in aligned sequences of whole genome as well as concatenated CDS (coding regions). Mutations occurring 5 or more than 5 times were considered as recurrent mutations. The protein and conserved domain information were retrieved from NCBI 15 .

Linkage Disequilibrium and linkage-based sequence clustering
The aligned file of all retrieved sequences of initial set was used to generate a HapMap file with sixteen selected mutations containing information regarding sequence, sequence position and allele frequency of both minor and major. TASSEL version 5.2.61 16 was used to generate linkage plot between the selected SNPs using HapMap file. Correlation analysis between mutations, sequences and geographical locations was performed using ClustVis 17 tool. A heat map was generated using the following settings: unit variance scaling was applied to rows, both rows and columns were clustered using correlation distance and average method, and tightest cluster first was used for tree ordering of both rows and columns. Clusters were defined based on the linked SNPs in the sequence.

Phylogenetic analysis and global distribution of clades
Phylogenetic analysis between SARS-CoV-2 sequences used in this study was performed using PhyML 18 with Smart Model Selection 19 with tree searching using Nearest Neighbour Interchange (NNI) and 100 bootstrap replicates. GTR+I substitution model was selected as best model using BIC (Bayesian Information Criterion) selection criteria.
Nextstrain nomenclature 3 for clade identification was adopted and used to classify the sequences in this study. For defining a new clade, at least 2 mutations should be away from its parent major clade.
For global analysis, sequences were retrieved from all the countries who had deposited in NCBI till June 5 th 2020. Then Multiple sequence alignment (MSA) was performed and frequency of the occurrence of the two mutations simultaneously was assessed.

Functional Characterization of amino acid change
The N -gene variant, C194T, was evaluated for the functional characterization using the Protein Variant Effect Analyser (PROVEAN) and PolyPhen-2. In PROVEANv1.1 20 , the predefined threshold score was used, i.e. -2.5. This cut off score corresponds to 80.4% sensitivity and 78.6% specificity. Polyphen-2 uses Naïve Bayes classifier for predicting functional significance of the allele replacement 21 . Default parameters has been used for calculating the Polyphen-2 score.

SARS-CoV-2 genome sequences, their origin and date of collection
The viral genome isolates of Indian origin (n=173) along with the reference genome sequence (Wuhan-Hu-1, NC_045512.2) were collected over the period of five and half months from December 24, 2019 to June 5, 2020; and most of them (94.8%) were isolated after mid-April (Fig: 1a). The most represented origin of isolated viruses was Ahmedabad, with 43.3% (75/173) share of all sequences (Fig: 1b). Out of these 173 Indian origin SARS-CoV-2 sequences, 59 SARS-CoV-2 complete genomes (update: till 17-05-2020; initial set) were used for initial evaluation of genetic diversity, SNP linkages and phylogenetic analysis. Rest of the sequences along with the initial set (n = 173) were later used to validate the results obtained from the analysis of initial sequence set.

Genetic diversity in the Indian SARS-CoV-2 sequences
A total of 136 variations were identified in the initial sequence set studied as compared to the reference sequence Wuhan-Hu-1/2019 (NC_045512.2), Additionally, two unrelated 3-nucleotide deletions were observed in two different sequences. There were 91 singleton mutations present, harbouring change only in one sequence. There were four sites, one in the 5' (1-265) and three in the 3' (29675..29903) UTR regions where mutations were observed. All mutations present in 3' UTR region were singleton. However, a C to T mutation at position 241 (compared to the reference genome) in the 5'-UTR was found in about 86% of the 173 Indian sequences.
In concatenated CDS region of all genes, total 131 variations were found, including 82 (62.6%) non-synonymous and 49 (37.4%) synonymous mutations (Fig: 2a). Out of them, 88 (67.2%) were singleton that includes 55 (62.5%) non-synonymous and 33 (37.5%) synonymous mutations (Fig: 2a). For closely related isolates, relationship between selection and dN/dS values is not monotonic. The high occurrence of non-synonymous mutations (Fig: 2b) at selected sites might suggest a rapid sweep in the population and strong selection 22 .Therefore, predicting selection pressure from dN/dS can be misleading 22 .
The analysis of the frequencies of mutated alleles in the initial set revealed the presence of 16 frequent mutations with minor allele frequency of f > 5% (the frequency that is considered significant 23 ), i.e. mutations present in more than three SARS-CoV-2 genomes were considered for this analysis. The frequencies of these 16 selected recurrent mutations for initial as well as final set are given in Table: 1.
As can be seen from the above table, the 10 non-synonymous mutations were distributed in five genes coding seven proteins with variable frequencies, in the genes coding for ORF1ab

SNP linkage and Phylogenetic analysis of SARS-CoV-2 genomes
To analyse and visualize the linkage pattern among the selected sixteen recurrent mutations, a linkage disequilibrium (LD) analysis was performed using TASSEL 5.2.61 16 . We found that most of the analysed SNPs had shown linkage, with value of D' (Fig: 3a, upper right triangle) as 1 and varying R 2 value (Fig: 3a, lower left triangle) as the allele frequency is varying for different mutation pairs. Some SNPs had shown strong linkage 24 with similar allele frequencies having R 2 value more than 0.8 along with D' value of 1 (Fig: 3b, Table: 2).
To further confirm the above finding a correlation analysis between recurrent mutations and individual sequences were performed using ClustVis 17 . As expected, we observed a clear clustering of sequences based on the linkage pattern of various mutations (Fig: 4, Table: 2).
There were two broad clusters, one minor (A) and one major (B) cluster that divide the population in two distinct groups. The minor group was nearer to reference genome in sequence similarity and the major group can be further divided into two smaller groups (C and D). Both 'C' and 'D' encompass two further smaller clusters 'E' and 'F' respectively. Likewise, 'A' cluster contains a small cluster 'G'. In our study population there was an over-representation of SARS-CoV-2 genomes isolated from Ahmedabad and most of these sequences fall in the major cluster 'B' indicating of a possible common origin of the sequences.
A phylogenetic tree was constructed using 60 SARS-CoV-2 whole genome sequences (including 1 reference genome) of the initial set. The phylogenetic tree revealed similar organization of clusters and in agreement with our previous results (Fig: 5). The lineages can be distinguished and identified based on the occurrence of the linked SNPs.

Extending analysis to 173 sequences and comparing with the global status
When we extended our analysis by comparing all 173 sequences together, we found that most of the frequencies of selected recurrent mutations were matching with the initial sequence set (Fig: 6). Frequencies of only two non-synonymous mutations were decreased by more than five percent, indicating their loss in the population. There were some mutations which were previously present in low numbers but after extending the sequence numbers they appear with higher frequencies, suggesting either selection pressure operating on these sites or introduction On analysing the sequences, we found 19B, a cluster designated as a major cluster as per global analysis, was not prominent (only 3.5% of samples) in our study population (Fig: 7). In addition to 19B, we found another clade emerging out of 19A, encompassing 8.8% of our sample population. This clade is defined by a set of four mutations; C28311T, C6312A, C13730T and C23929T. This clade was also identified in another Indian study 12 where they observed around 29% of their study population to be in this cluster and named it as Clade I/A3i (defined by a combination of same four variants). This study used viral genomes isolated from Telangana along with other available sequences to infer the phylogeny. The observed differences in the frequency of the sub-populations in these two studies in Clade I/A3i is due to the altered demographics of the virus sequences.
Most of the sequences (~ 86%) used in our study population fall in the major clade 20A (Fig:   7). Nextstrain 3 defined clade 20A by two mutations, C14408T and A23403G. We found an additional mutation (C3037T) redefining the clade (I-20A, i.e. India-20A) for Indian population (high D' and R 2 values, Table: 2). Also, C241T mutation in the 5'UTR (not shown here) is tightly linked with these variants of clade I-20A. Similar to 19B, very few sequences (1.7%) represent clade 20B in our study population.
Clade 20C emerged from 20A and was defined by Nextstrain with two mutations, C1059T and G25563T. Surprisingly, in our study population the linkage expected between these two mutations was not found. On the contrary, G25563T was found to be strongly linked with mutations C18877T and C26735T. Since this linkage was different from what was observed for 20C, we renamed this clade as I-20C (India-20C) for Indian population.
In addition to the above clades we identified a novel clade (I-20D) emerging from I-20C, representing around 28% of the sequences in the population studied. This clade was defined by a combination of C22444T and C28854T variants (Fig: 7). C2836T was also found to be linked with C22444T and C28854T (Table: 2, Fig: 5). Since the allele frequency of C2836T was lower than C22444T and C28854T (Table: 1 (Table: 3).

Functional evaluation of SNPs in Clade I-20D
One of the two mutations defining the clade I-20D is synonymous (C22444T), while the other one (C28854T) is non-synonymous therefore can cause a change in the amino acid sequence.  (Table: 4). However, further experimental studies are needed to determine the true effect of this mutation.

Discussion
In this report, we investigated the pattern of molecular evolution of 173 SARS-CoV-2 genomes isolated from Indian population. On the basis of linkages between sixteen recurrent SNPs (each found in more than 5% sequences), sequences were clustered and subsequently clades were defined. In this study population, we found two unique clades, that were not among the major global clades as defined by Nextstrain. One of the two clades that we identified was also reported by Banu et at., wherein the cluster was named, Clade I/A3i 12 . The identification of the cluster was defined by a combination of four variants, C28311T, C6312A, C13730T and C23929T. Similar to their observation, we found another linked SNP along with these variants but not all of the sequences in the cluster shared this variant, G11083T (D' and R^2 values with all other four variants were less than 0.78 and 0.49, respectively), therefore excluded from the clade identifier set. The above referred study included a large sequence population from Telangana with the geographical distribution of this cluster was found to be prevalent in Telangana, Tamil Nadu, Maharashtra and Delhi. The major contributors to our study population were the SARS-CoV-2 sequences originated mostly from Gujarat. Since this clade (I-A3i) may not be predominant in Gujarat, the number of sequences belonging to the clade was very less (8.8% of the total sequences).
In our analysis, we found another unique clade specific to the Indian population (absent in other countries/continents). This clade was predominantly present in the study population (28% of the total sequences) and was defined by two tightly linked SNPs, C22444T and C28854T. We named this cluster, I-20D (India-20D), since it is derived from the parent clade I-20C (India-20C). The global clade, 20C was redefined for our study population as the combination of two variants C1059T and G25563T, defining it, was missing in the sequences we analysed. Instead, G25563T was found to be strongly linked with C18877T and C26735T and hence we renamed this clade as I-20C. Similarly, clade 20A was renamed I-20A as C14408T and A23403G was tightly linked with C3037T. Therefore, the SNPs that determine the clades in global population (Nextstrain) have shown an altered occurrence patterns in Indian population.
Recently, a study demonstrated increased infectivity of SARS-CoV-2 virus when D (Aspartic Acid) at 614 position of surface glycoprotein converted into G (Glycine) 26 . The D614G (A23403G) is one of the mutation identifiers of 20A global clade, which was found to occur in 86% of our study population. Whether this mutation has resulted in a higher infectivity in       Fig: 3b) Scatter plot between the R 2 (x-axis) and D' (y-axis) of each SNP pair. The points in the red box had shown strong linkage with similar allele frequencies (Table: 2). The intensity of dark shadow behind the blue dot indicates the presence of multiple points.