TY - JOUR T1 - A Pan-Coronavirus Vaccine Candidate: Nine Amino Acid Substitutions in the ORF1ab Gene Attenuate 99% of 365 Unique Coronaviruses: A Comparative Effectiveness Research Study JF - bioRxiv DO - 10.1101/2022.04.28.489618 SP - 2022.04.28.489618 AU - Eric Luellen Y1 - 2022/01/01 UR - http://biorxiv.org/content/early/2022/04/28/2022.04.28.489618.abstract N2 - Background The COVID-19 pandemic has been a watershed event. Industry and governments have reacted, investing over US$105 billion in vaccine research.1 The ‘Holy Grail’ is a universal, pan-coronavirus, vaccine to protect humankind from future SARS-CoV-2 variants and the thousands of similar coronaviruses with pandemic potential.2 This paper proposes a new vaccine candidate that appears to attenuate the SARS-Cov-2 coronavirus variants to render it safe to use as a vaccine. Moreover, these results indicate it may be efficacious against 99% of 365 coronaviruses. This research model is wet-dry-wet; it originated in genomic sequencing laboratories, evolved to computational modeling, and the candidate result now require validation back in a wet lab.Objectives This study’s purpose was to test the hypothesis that machine learning applied to sequenced coronaviruses’ genomes could identify which amino acid substitutions likely attenuate the viruses to produce a safe and effective pan-coronavirus vaccine candidate. This candidate is now eligible to be pre-clinically then clinically tested and proven. If validated, it would constitute a traditional attenuated virus vaccine to protect against hundreds of coronaviruses, including the many future variants of SARS-CoV-2 predicted from continuously recombining in unvaccinated populations and spreading by modern mass travel.Methods Using machine learning, this was an in silico comparative effectiveness research study on trinucleotide functions in nonstructural proteins of 365 novel coronavirus genomes. Sequences of 7,097 codons in the ORF1ab gene were collected from 65 global locations infecting 68 species and reported to the US National Institute of Health. The data were proprietarily transformed twice to enable machine learning ingestion, mapping, and interpretation. The set of 2,590,405 data points was randomly divided into three cohorts: 255 (70%) observations for training; and two cohorts of 55 (15%) observations each for testing. Machine learning models were trained in the statistical programming language R and compared to identify which mixture of the 7.097 × 1023 possible amino-acid-location combinations would attenuate SARS-CoV-2 and other coronaviruses that have infected humans.Results Contests of machine-learning algorithms identified nine amino-acid point substitutions in the ORF1ab gene that likely attenuate 98.98% of 365 (361) novel coronaviruses. Notably, seven substitutions are for the amino acid alanine. Most of the locations (5 of 9) are in nonstructural proteins (NSPs) 2 and 3. The substitutions are alanine to (1) valine at codon 4273; (2) leucine at codon 5077; (3) phenylalanine at codon 2001; (4) leucine at codon 372; (5) proline at codon 354; (6) phenylalanine at codon 2811; (7) phenylalanine at codon 4703; (8) leucine to serine at codon 2333; and, (9) threonine to alanine at codon 5131.Conclusions The primary outcome is a new, highly promising, pan-coronavirus vaccine candidate based on nine amino-acid substitutions in the ORF1ab gene. The secondary outcome was evidence that sequences of wet-dry lab collaborations – here machine learning analysis of viral genomes informing codon functions -- may discover new broader and more stable vaccines candidates more quickly and inexpensively than traditional methods.Competing Interest StatementThe authors have declared no competing interest.AUROCArea under the receiver operating characteristic curveCARTClassification and regression treeCDPCodon deoptimizationCRISPRClustered regularly interspaced short palindromic repeatsGDPGross domestic productIBVInfectious bronchitis virusOOBOut-of-bag (e.g., error rate)ORFOpen reading frame (e.g., gene and polyprotein)LLMLinear logistic modelMDAMean decrease in accuracy (from variable exclusion in the model)MDGMean decrease in GiniMERSMiddle East Respiratory SyndromeNIH USNational Institute of HealthNSPNonstructural proteinPDBProtein databankRNARibonucleic acidROCReceiver operating characteristicSVMSupport vector machineTPAThird-party annotation ER -