TY - JOUR T1 - Comprehensive annotations of the mutational spectra of SARS-CoV-2 spike protein: a fast and accurate pipeline JF - bioRxiv DO - 10.1101/2020.06.29.177238 SP - 2020.06.29.177238 AU - M. Shaminur Rahman AU - M. Rafiul Islam AU - M. Nazmul Hoque AU - A. S. M. Rubayet Ul Alam AU - Masuda Akther AU - J. Akter Puspo AU - Salma Akter AU - Azraf Anwar AU - Munawar Sultana AU - M. Anwar Hossain Y1 - 2020/01/01 UR - http://biorxiv.org/content/early/2020/06/29/2020.06.29.177238.abstract N2 - In order to explore nonsynonymous mutations and deletions in the spike (S) protein of SARS-CoV-2, we comprehensively analyzed 35,750 complete S protein gene sequences from across six continents and five climate zones around the world, as documented in the GISAID database as of June 24th, 2020. Through a custom Python-based pipeline for analyzing mutations, we identified 27,801 (77.77 % of spike sequences) mutated strains compared to Wuhan-Hu-1 strain. 84.40% of these strains had only single amino-acid (aa) substitution mutations, but an outlier strain from Bosnia and Herzegovina (EPI_ISL_463893) was found to possess six aa substitutions. The D614G variant of the major G clade was found to be predominant across circulating strains in all climates. We also identified 988 unique aa substitution mutations distributed across 660 positions within the spike protein, with eleven sites showing high variability – these sites had four types of aa variations at each position. Besides, 17 in-frame deletions at four major regions (three in N-terminal domain and one just downstream of the RBD) may have possible impact on attenuation. Moreover, the mutational frequency differed significantly (p= 0.003, Kruskal–Wallis test) among the SARS-CoV-2 strains worldwide. This study presents a fast and accurate pipeline for identifying nonsynonymous mutations and deletions from large dataset for any particular protein coding sequence and presents this S protein data as representative analysis. By using separate multi-sequence alignment with MAFFT, removing ambiguous sequences and in-frame stop codons, and utilizing pairwise alignment, this method can derive nonsynonymus mutations (Reference:Position:Strain). We believe this will aid in the surveillance of any proteins encoded by SARS-CoV-2, and will prove to be crucial in tracking the ever-increasing variation of many other divergent RNA viruses in the future.Competing Interest StatementThe authors have declared no competing interest. ER -