COVID-19 CG: Tracking SARS-CoV-2 mutations by locations and dates of interest

COVID-19 CG is an open resource for tracking SARS-CoV-2 single-nucleotide variations (SNVs) and lineages while filtering by location, date, gene, and mutation of interest. COVID-19 CG provides significant time, labor, and cost-saving utility to diverse projects on SARS-CoV-2 transmission, evolution, emergence, immune interactions, diagnostics, therapeutics, vaccines, and intervention tracking. Here, we describe case studies in which users can interrogate (1) SNVs in the SARS-CoV-2 Spike receptor binding domain (RBD) across different geographic regions to inform the design and testing of therapeutics, (2) SNVs that may impact the sensitivity of commonly used diagnostic primers, and (3) the recent emergence of a dominant lineage harboring an S477N RBD mutation in Australia. To accelerate COVID-19 research and public health efforts, COVID-19 CG will be continually upgraded with new features for users to quickly and reliably pinpoint mutations as the virus evolves throughout the pandemic and in response to therapeutic and public health interventions.


Introduction
contact tracing efforts and to inform public health decisions -these are paramount to the re-48 opening of countries and inter-regional travel (Collins 2020; Rockett et al. 2020; Oude Munnink, 49 et al. 2020; Gudbjartsson et al. 2020;Pybus et al. 2020). Yet, the quantity and complexity of 50 SARS-CoV-2 genomic data (and metadata) make it challenging and costly for the majority of 51 scientists to stay abreast of SARS-CoV-2 mutations in a way that is meaningful to their specific 52 research goals. Currently, each group or organization has to independently expend labor, 53 computing costs, and, most importantly, time to curate and analyze the genomic data from 54 GISAID before they can generate specific hypotheses about SARS-CoV-2 lineages and 55 mutations in their population(s) of interest. 56

Results 58
To address this challenge, we built COVID-19 CoV Genetics (COVID-19 CG, covidcg.org), a 59 found in only 1.05% of the Australian SARS-CoV-2 sequences before June, now constitutes 84 more than 90% of the sequenced June through August isolates ( Figure 2C). This geographical 85 and temporal variation is important to incorporate into the design and testing of therapeutic 86 antibodies (such as those under development as therapeutics by Regeneron that specifically 87 target the SARS-CoV-2 Spike RBD), as well as mRNA or recombinant protein-based vaccines. 88 This will help to assure developers of the efficacy of their therapeutics and vaccines against the 89 SARS-CoV-2 variants that are present in the intended location of implementation. 90 91 In addition, COVID-19 CG can be harnessed to track changes in SARS-CoV-2 evolution post-92 implementation of therapeutics and vaccines. It will be crucial to watch for rare escape variants 93 that could resist drug-or immune-based interventions to eventually become the dominant 94 SARS-CoV-2 variant in the community. This need was particularly emphasized by a Regeneron 95 study that demonstrated that single amino acid variants could evolve rapidly in the SARS-CoV-2 96 Spike to ablate binding to antibodies that had been previously selected for their ability to 97 neutralize all known RBD variants; these amino acid variations were found either inside or 98 outside of the targeted RBD region, and some are already present at low frequency among 99 human isolates globally (Baum et al., 2020). The authors, Baum et al., suggested that these 100 rare escape variants could be selected under the pressure of single antibody treatment, and, 101 therefore, advocated for the application of cocktails of antibodies that bind to different epitopes 102 to minimize SARS-CoV-2 mutational escape. A recent study by Greaney et al. generated high-103 resolution 'escape maps' delineating RBD mutations that could potentially result in virus escape 104 from neutralization by ten different human antibodies (Greaney et al., 2020). Based on lessons 105 learnt from the rise of multidrug resistant bacteria and cancer cells, it will be of the utmost 106 importance to continue tracking SARS-CoV-2 evolution even when multiple vaccines and 107 therapeutics are implemented in a given human population. 108 Diagnostics developers can evaluate their probe, primer, or point-of-care diagnostic according 110 to user-defined regional and temporal SARS-CoV-2 genomic variation. More than 665 111 established primers/probes are built into COVID-19 CG, and new diagnostics will be continually 112 incorporated into the browser. Users can also input custom coordinates or sequences to 113 evaluate their own target sequences and design new diagnostics. 114

115
Case study of SNVs that could impact the sensitivity of diagnostic primers: A recent 116 preprint alerted us to the finding that a common G29140T SNV, found in 22.3% of the study's 117 samples from Madera County, California, was adversely affecting SARS-CoV-2 detection by the 118 NIID_2019-nCoV_N_F2 diagnostic primer used at their sequencing center; the single SNV 119 caused a ~30-fold drop in the quantity of amplicon produced by the NIID_2019-nCov_N_F2/R2 120 primer pair (Vanaerschot et al., 2020). We used COVID-19 CG to detect other SNVs that could 121 impact the use of this primer pair, discovering that there are SARS-CoV-2 variants in several 122 countries with a different C29144T mutation at the very 3' end of the same NIID_2019-123 nCoV_N_F2 primer ( Figure 3A). As the authors of the preprint, Vanaerschot et al., noted, SNVs 124 could impact assay accuracy if diagnostic primers and probes are also being used to quantify 125 viral loads in patients. We found that at least ten other primer pairs could potentially be at risk in 126 different geographical regions due to SNVs that appear proximal to the 3' ends of primers 127 N_Sarbarco_R1; and Institut Pasteur, Paris 12759Rv. We advocate that labs and clinics use 132 COVID-19 CG (https://covidcg.org) to check their most commonly used primers and probes 133 against the SARS-CoV-2 sequences that are prevalent in their geographic regions. 134

Researchers and public health professionals can use COVID-19 CG to gain insights as to 136
how the virus is evolving in a given population over time (e.g., in which genes are mutations 137 occurring, and do these lead to structural or phenotypic changes?). For example, users can 138 track D614G distributions across any region of interest over time. goal that affects all of humanity, we advocate for the increased sequencing of SARS-CoV-2 209 isolates from patients (and infected animals) around the world, and for these data to be shared 210 in as timely a manner as possible.

Data Pipeline 213
Our data processing pipeline is written with the Snakemake scalable bioinformatics workflow 214 engine (Koster and Rahmann, 2012)

Application Compilation 264
The web application is written in Javascript, and primarily uses the libraries React.js, MobX, and 265 Vega. The code is compiled into javascript bundles by webpack. All sequence data is 266 compressed and injected inline as JSON into the javascript bundle -no server is needed to 267 serve data to end users. The compiled application files can then be hosted on any static server.