SWAMPy: Simulating SARS-CoV-2 Wastewater Amplicon Metagenomes with Python

Motivation: Tracking SARS-CoV-2 variants through genomic sequencing has been an important part of the global response to the pandemic. As well as whole-genome sequencing of clinical samples, this surveillance effort has been aided by amplicon sequencing of wastewater samples, which proved effective in real case studies. Because of its relevance to public healthcare decisions, testing and benchmarking wastewater sequencing analysis methods is also crucial, which necessitates a simulator. Although metagenomic simulators exist, none are fit for the purpose of simulating the metagenomes produced through amplicon sequencing of wastewater. Results: Our new simulation tool, SWAMPy ( S imulating SARS-CoV-2 W astewater A mplicon M etagenomes with Py thon), is intended to provide realistic simulated SARS-CoV-2 wastewater sequencing datasets with which other programs that rely on this type of data can be evaluated and improved. Availability: The code for this project is available at https://github. com/goldman-gp-ebi/SWAMPy . It can be installed on any Unix-based operating system and is available under the GPL-v3 license.


Introduction
Wastewater sequencing has proven useful in the genomic surveillance of SARS-CoV-2 and can provide a less-biased picture of the variants circulating in a population than clinical surveillance (1). Amplicon sequencing is the preferred method for this purpose since it is efficient in terms of cost, labour and time, and is well-suited for heavily contaminated samples -as may be found with biological samples collected for SARS-CoV-2 sequencing -thanks to its targeted nature (2). Such sequencing has typically been done via multiplex PCR using a pre-defined primer set with paired-end reads generated by an Illumina device (1).
A number of methods and software tools for wastewater SARS-CoV-2 sequencing data analysis are available such as SAM Refiner (3), COJAC (4), LCS (5) and Freyja (6). Evaluating the effectiveness of new methods on in vivo or in vitro samples is often difficult or impossible, for example because of the lack of availability of a wide range of real or synthetic samples and the costs of repeated experiments (7). However, simulated datasets can provide an efficient way of benchmarking new methods' performances (8).
There is a specific set of features characteristic to data coming from wastewater amplicon sequencing. For example, it has been shown that there is a high variation in amplification across different amplicons of a given primer set, resulting in a variation in read depth across the genome (1,2). Moreover, wastewater data is expected to represent a mixture of different SARS-CoV-2 variants since the biological matter in the sample comes from multiple people, and will carry RNA degradation signatures resulting from the environmental exposure of the viral RNAs in sewage as well as PCR, sequencing library layout-specific and sequencing device-specific errors (9)(10)(11).
Existing standard metagenomic simulators do not attempt to capture all the characteristics seen in the amplicon sequencing protocols used for SARS-CoV-2 such as the ARTIC community protocols (12; https://artic.network/ncov-2019; https://artic.network/resources/ncov/ ncov-amplicon-v3.pdf) widely used to prepare samples for Illumina sequencing platforms. For example, InSilicoSeq (13) is intended to produce shotgun metagenomic sequences; and while Grinder (7) can simulate amplicon sequencing, it cannot be tuned to produce a bespoke amplicon distribution and does not produce realistic sequencing quality scores. The simulation tool ART (14) can also generate reads for amplicons, but only in equal proportions. Studies (5,(15)(16)(17) have demonstrated the need for a dedicated wastewater SARS-CoV-2 sequencing simulator. Each, however, performed its own simulations for its specific use case, with most simulators limited to simplified scenarios with uniform amplicon abundances and only read errors (e.g. 15,16,18), or recreating the read-depth variation of one specific real experiment (e.g. 5,17). Both cases have drawbacks, either risking overfitting to one dataset, or omitting important characteristics of real life data such as PCR errors and variable amplicon abundances.  (29)), and Sanger sequencing and ABI SOLiD (30). Illumina layout indicates single ended (SE) or paired end (PE) reads. Output file types are described by (31). Mix. indicates whether the program accepts a list of genomes to simulate a mixed sample. Amps.: can the program simulate amplicon sequencing protocols. VAA: are variable (i.e. non-uniform) amplicon abundances modeled. Ampl. bias: whether differential amplification success is modelled, based on (e.g.) reads' GC content. Regarding SNP and indel errors, HF denotes high-frequency errors, i.e. recurring position-dependent errors, possibly appearing at the same position across different amplicons, as described in Methods; whereas Reads denotes independent per-read errors, e.g. from sequencing device error. 1 VLQ is a tool for analyzing wastewater sample sequence data for SARS-CoV-2 variant quantification; it has an unnamed simulation tool available as part of its benchmarking code. 2 Output from Grinder can be FASTQ-formatted, but the quality scores serve only to indicate whether or not each position incurred a sequencing error.
shows a comparison of some of these simulation tools. Our simulator, SWAMPy (Simulating SARS-CoV-2 Wastewater Amplicon Metagenomes with Python), is intended to produce a realistic set of reads that might be generated through multiplex PCR of a wastewater sample, and sequenced by an Illumina sequencer. We model the following scenario: 1. Different viral genomes coming from a human population contaminate wastewater systems, creating a mixture of virus variants which is then captured in a wastewater sample. At this stage, viral RNAs are exposed to RNA degradation and there is a variation in variant abundance in the mixture.
2. After sample collection, PCR amplification of whole viral genomes in segments using a pre-defined primer set results in an amplicon population. At this stage, amplicons gain PCR errors and there is further variation in amplicon abundance due to differential amplification success of the primers of a given primer-set.
3. These amplicons are then sequenced on an Illumina device, creating paired-end reads of a fixed length. At this stage, sequencing errors appear.

Methods
The overall workflow of SWAMPy can be seen in Fig. 1. The four basic steps of our software pipeline are detailed as follows: 1. Create an initial amplicon population 2. Simulate the number of DNA fragments (copies) per amplicon 3. Simulate high-frequency errors by mutating amplicons in the amplicon population 4. Simulate sequencing reads using ART A. Create an initial amplicon population. The software assumes that the user has supplied a set of SARS-CoV-2 genomes, which we refer to as source genomes, and has selected a primer set among the supported primer sets in the program. The default primer set is ARTIC version 1. On the basis of this selection, amplicons are extracted from each genome as follows. First, we use Bowtie 2 (32) to align the primers (forward and reverse complement) to each virus genome to detect primer binding positions on the source genomes. Next, we slice the source genomes from the primer binding positions to obtain individual amplicons of each source genome, including the primer sequences. During the alignment step, some primers may not align well with the viral genome, and in those cases, the corresponding amplicon is not produced. While this is a strict penalty for primer misalignment, it approximates the observation that some amplicons can be dropped due to mutations in the target regions (33).

B. Simulate numbers of copies per amplicon.
To simulate numbers of copies per amplicon and genome, we offer two versions of a combined Multinomial and Dirichlet model. For either of these models, the user must supply three parameters: total target number of reads N , a vector of genome abundances p g indexed over genomes g, and a Dirichlet parameter c. The choice of c roughly equates to a measure of sample quality: a higher value of c (e.g. 200) corresponds to high quality samples (roughly uniform abundances of amplicons between simulation runs) and a lower value (e.g. 10) to low quality samples (highly variable amplicon abundances between simulations, with higher rates of amplicon dropout). For the supported primer sets SWAMPy provides an experimentally derived prior on the amplicon proportions π a indexed over amplicons a.
Model 1 (equal expected amplicon proportions across genomes): Sample amplicon proportions p a from Dirichlet(c × π a ) to be shared across all genomes 3. For each genome, sample numbers of reads per amplicon per genome as x a,g ∼ Multinomial(N g , p a ) Model 2 (different amplicon proportions across genomes): For each genome, independently sample amplicon proportions p a,g from Dirichlet(c × π a ) 3. For each genome, sample numbers of reads per amplicon per genome as x a,g ∼ Multinomial(N g , p a,g ) One subtlety in this process is that the numbers of reads do not account for amplicons dropped in the alignment step, which leads to some missing reads. For example, if the model assigns 100 reads to amplicon 1 in genome A, yet a mutation at the primer site of this amplicon causes it to drop out, then the total number of reads produced would be 100 fewer than expected. Hence the actual total number of reads may be less than the target, N .

C. Add high-frequency errors.
While we model sequencing error with standard bioinformatics tools, we developed a new model for the effects of RNA degradation, PCR mutations, and other library preparation artefacts, which we collectively refer to as high-frequency errors. We define highfrequency errors as non-naturally occurring insertions, deletions and substitutions that non-independently affect multiple reads. We further classify high-frequency errors as unique or recurrent with respect to their appearance across different source genomes in the mixture. Recurrent errors are the ones that are present in all source genomes in the simulated mixture, consistent with our observation of individual errors affecting multiple real wastewater sequencing experiments. These might originate for example from genomic positions particularly susceptible to degradation, or context-dependent PCR errors (34,35). In contrast, unique high-frequency errors are present in only one of the genomes in the mixture, corresponding for example to PCR errors that are not contextdependent, and low-rate RNA degradation errors. unique errors, the expected value of the VAF in the final mixture will be the product of the assigned VAF f and the corresponding amplicon abundance π a .
C.2. Apply sampled errors to simulated amplicons. After we compile the table that contains all simulated errors, we process each source genome g and each amplicon a in the amplicon population that we previously created. For each a, g: 1. Errors that affect genome g and amplicon a are selected from the error table.
2. Because simulated error positions are based on the Wuhan-Hu-1 reference and a variant amplicon in a wastewater sample may contain indels, the amplicon sequences are aligned to Wuhan-Hu-1 using Bowtie 2 (32) and the positions of the errors within the amplicon are determined.
3. For each error e, the number of reads in which e is present is determined by sampling a read count n e from Binomial(x a,g , f e ), where x a,g is the total read count of amplicon a for genome g as described in Section B, and f e is the VAF of the error as determined in Section C.1.
4. For each possible combination i of high-frequency errors affecting genome g and amplicon a, a read count n i is randomly sampled respecting individual read counts of the errors. We make no attempt to simulate correlations among the errors on amplicons as simulating error inheritance for each amplicon is computationally too expensive and we assume errors on an amplicon are independent.
5. Finally, for each combination i of errors affecting a and g, a new corresponding modified amplicon sequence is created.
D. Simulate read sequencing using ART. To create a set of simulated paired-end Illumina reads from each amplicon, each with a given read count, we use the program ART (14). We use ART's paired-end amplicon mode, as well as the noALN and maskN flags. These settings create 150bp paired-end reads, and faithfully transcribe any "N" characters appearing within the amplicons. We use a set of default error rates and quality score profiles tuned for the Illumina MiSeq V3 sequencer, though the ART package has options available for other platforms and read lengths. A full list of the flags used is in the supplementary material.
Finally, we use a custom script based on ubiquitous bash utilities to concatenate all of the FASTQ read files, and shuffle their order to avoid potential biases in case any downstream application software can be influenced by read ordering.

Results
The source code of our python implementation of SWAMPy, together with the program documentation and exemplar files is available under the GPL-v3 license at https:// github.com/goldman-gp-ebi/SWAMPy.
SWAMPy Implementation. SWAMPy takes as input a multi-FASTA file containing the SARS-CoV-2 variant genomes that will be present in the simulated wastewater sample, as well as a file that contains the relative abundances of these variants in the mixture. For ease of use, other input files (primer-set-specific sequence files, and primer-set-specific amplicon distribution files) were wrapped with a single --primer-set parameter which loads the corresponding input files for the specified primer set. As of December 2022, there are three supported primer sets: ARTIC V1, ARTIC V4 (12) and Nimagen V2 (37). There are many command line parameters that allows fine control of the program such as the parameter c that reflects the quality of the wastewater sample as described in Section B, target number of simulated reads, and error rates, VAF and lengths of high-frequency errors as described in Section C.1. The full list of command line interface arguments and their explanations are available on the GitHub wiki page: https://github.com/goldman-gp-ebi/ SWAMPy/wiki/CLI-arguments.
An example SWAMPy run takes 300 seconds to complete and reaches 700MB of max memory when run with default parameters (three SARS-CoV-2 variants and default error rates and 100,000 total read counts) on a single thread of an Intel Xeon Gold 6336Y 2.40GHz CPU.
SWAMPy produces five output files by default: • FASTQ files of the simulated forward and reverse reads, matching Illumina standards • A table that shows the abundance of each wild-type amplicon after the randomness in amplicon copy number sampling (as described in Section B) was applied • A VCF file that contains all the intended highfrequency errors from the error table described in Section C.1 • A log file Alignment images of simulation outputs illustrate that the major characteristics of wastewater data are present in the simulated data (Fig. 2) such as overlapping amplicons, different kinds of errors and read depth variation across the genome.
Use Case. We used SWAMPy to simulate 73 time points throughout the course of a hypothetical SARS-CoV-2 pandemic where the Alpha (B.1.1.7) variant starts out dominant before Delta (AY.4) rises in frequency and then Omicron (BA.1.1) emerges and takes over ( Fig. 3; see the supplementary material for the SWAMPy options used; for the exact abundances at each time point, see supplementary material). Then we used a downstream application program, Freyja, which is designed to detect SARS-CoV-2 variants and their relative abundances from sequencing data obtained from wastewater samples (6,25,27,28). We observe that Freyja is quite successful in demixing the simulated data overall in this relatively complex scenario, though it sometimes inferred the presence of variants that are not specified in the simulated mixture potentially with high frequencies. This probably stems from some high-frequency errors in the simulated data as described in Section C.

Conclusions
We have shown that SWAMPy is a viable simulation tool for SARS-CoV-2 wastewater metagenomes, building on the simulator ART but much better suited to the modelling challenges idiosyncratic to SARS-CoV-2 metagenomes such as high-frequency errors and irregular amplicon abundance profiles. Both of these models are based on real abundance and error data, from a large number of in vitro whole-genome amplicon sequencing experiments, detailed in the supplementary material. Compared with other metagenomic simulators, many of which support a range of complex features such as chimeric amplicons and shotgun metagenomic reads, SWAMPy aims to fill a niche created by metagenomic amplicon studies such as wastewater surveillance of SARS-CoV-2. This niche seems important given the disparity between features available in most general-purpose metagenomic simulators (Table 1) and the requirements of tools being developed for wastewater studies. Our simulator supports two versions of the ARTIC protocol, which at present is the most prevalent sequencing protocol for SARS-CoV-2 metagenomes, and the Nimagen V2 protocol. We will strive to support future iterations of these, as well as new superseding protocols as they arise in the future. There are other areas where we hope to make improvements to modelling and usability, such as supporting a greater range of sequencing platforms, ensuring that amplicon dropout rates match closely with experimental findings and accounting for some additional parameters in high-frequency error models, which requires a deeper understanding of error mechanisms through controlled experiments. We hope that this simulation tool will prove valuable by providing non-trivial test cases especially for strain-resolving SARS-CoV-2 metagenomics algorithms, and for creating control case data for researchers working on SARS-CoV-2 wastewater studies. Wastewater surveillance can provide a cheaper alternative to widespread sequencing of clinical SARS-CoV-2 samples, and it is our hope that through appropriate modelling and simulation of the processes involved in amplicon sequencing of wastewater, these data can be leveraged to their full potential in aiding public health.

Supplementary Note 1: Parameter Estimation and Simulation Experiment Parameters
A. Amplicon abundance estimation. To estimate Dirichlet parameters for amplicon abundances of the AR-TIC v3 and Nimagen v2 primer schemes, we summed amplicon counts over a number of experiments, using results from both synthetic SARS-CoV-2 sequences and real wastewater data. In either case, viral RNA was reverse transcribed before being amplified using each of these two primer schemes.
The spreadsheets at https://github.com/goldman-gp-ebi/SWAMPy/tree/main/supplementary_files provides a summary of amplicon counts through a number of experiments. We excluded amplicon counts in experiments where there was an obvious systematic bias: in some experiments one of the two primer pools failed, and we discarded these results. Other amplicons failed to amplify in the synthetic virus genomes; this was due to the synthetic genomes being produced in 5 kbp chunks. We again discarded these amplicon counts. Finally we normalised our counts' values so that they summed to 1.
For the Artic v4 primer scheme, we used an amplicon abundance profile based on a single experiment (Joshua Quick, personal communication). The coverage file for this experiment is again provided at https://github.com/goldman-gp-ebi/ SWAMPy/tree/main/supplementary_files.

B. ART Parameters.
To simulate sequencing errors we used the program art_illumina (part of the ART suite of simulation tools (14)), with the following command-line parameters: The variables SEQ_SYS, READ_LENGTH, and NUMBER_OF_READS are set by the user.
SEED, IN-PUT_AMPLICON_FASTA, and OUTPUT_FASTQ_FILENAME are changed for each amplicon of each genome that is being simulated. To define the default high-frequency error rates in SWAMPy, we performed an error characterisation analysis on 121 real wastewater sequencing datasets. Wastewater sequencing experiments were conducted by members of JBC-led Wastewater Genomics collaboration. 12 of the 121 samples are mixtures of synthetically produced SARS-CoV-2 variant genomes, which mimic wastewater samples. Their ENA accessions are: ERR10084556, ERR10084565, ERR10084558, ERR10084543, ERR10084590, ERR10084581, ERR10084585, ERR10084577, ERR10084594, ERR10084592, ERR10084549, ERR10084547, ERR10084599, ERR10084595, ERR10084586, ERR10084579, ERR10084589, ERR10084564, ERR10084588, ERR10084562. 109 of them were sampled from wastewater in the UK and the data will be publicly available soon.
For each sample of these datasets, we mapped raw reads to the Wuhan-Hu-1 (36) reference genome using Bowtie 2.4.4 (32). We then used bcftools mpileup 1.13 (39) to obtain VCF files. We did not perform a separate variant calling as we are interested in errors as opposed to SNPs and needed every discrepancy between the reference genome and our sample reads; henceforth will refer to such differences as "variants". We filtered out positions with a read depth (DP) < 10 and with < 5 reads supporting the alternative allele (AD). The remaining variants are classified into different categories as summarised in Fig. 4. First, if the variant allele frequency (VAF) of a variant is low, in particular lower than 0.02, we filtered out that variant since such low-frequency variants might be the result of standard sequencing errors (40) and are unlikely to affect downstream analyses. The remaining variants included real polymorphisms between different SARS-CoV-2 variants, as well as possible high-frequency errors, which is the group of errors of interest. To identify likely high-frequency errors, we focused on putative nonviable mutations, since nonviable mutations cannot be real polymorphisms.
C. High-frequency errors parameter estimation. We define putative nonviable mutations as nonsense (stop codon) substitutions or indels of length not a multiple of three found on ORF1ab and S open reading frames. We excluded other smaller and less characterized genes from the analysis since these can present viable nonsense mutations (41,42). To further eliminate possible real polymorphisms, we also excluded a portion from the 3 ′ ends of the ORF1ab and S open reading frames since nonsense mutations could be tolerable there. The exact positions included are 266-12000, 13465-20000 and 21563-25000. We further divided the high-frequency errors as either recurrent or unique based on if they appeared in more than one wastewater sample. We subdivided both recurrent and unique errors into insertions, deletions and substitutions subclasses, and estimated default rates of these events by counting the observed variants in each class and normalizing by the number of genome positions with sufficient coverage and correcting for the expected proportions of high-frequency errors that would not have been identified (indels with length multiple of three and non-stop codon substitutions). A more exhaustive description of the approach to infer high-frequency error rates and lengths is given in the following subsections.
C.1. Indel error length distributions. Following the length distributions observed in our putative high-frequency indel errors (see figure 5), we model high-frequency indel error lengths as: • Insertions: uniform distribution with minimum length 1 and maximum 14 (the minimum and maximum lengths found in real data) U (1, 14).
We could not use a standard distribution fitting technique to estimate the indel length geometric distribution parameter p because we cannot observe putative high-frequency error deletions with length multiple of three. Instead, to avoid potential biases due to the missing data, we estimate p using only the counts of putative high-frequency deletion errors of length one and two. Since under Geometric(p) the ratio of the probability of lengths l = 2 and l = 1 is We use the estimatorp where C 2 and C 1 are the observed counts of putative high-frequency deletion errors of length 2 and 1 respectively.
C.2. High-frequency error rate estimation. We estimated six separate high-frequency error rates, one for each of the six classes of high-frequency errors. For each high-frequency error class i, we call C i its putative error count observed in the real data, and we estimate its rate r i asr where f i is the missing data correction factor for class i, and L is the total number of genome positions, across all real datasets, that we considered when looking for putative high-frequency errors. For substitutions, f i corrects for the fact that For insertions, the correction factor f i is simply 3/2 because we assume a uniform high-frequency insertion length distribution and because we did not count insertion with length multiple of three among putative high-frequency insertion errors.
For the deletion correction factor we took a similar approach, but accounting for the assumed geometric distribution of lengths. In a geometric distribution with parameter p, the total probability of all multiples of three is given by where for the last step we used the identity x≥0 x i = (1 − x) −1 . The total probability of the lengths that we do observe is 1 minus this value, i.e. 1 − p(1 − p) 2 /(1 − (1 − p) 3 ), and therefore we use as correction factor f i of high-frequency deletions its inverse: C.3. High-frequency error frequency distributions. To model the default variant allele frequencies (VAF) of high-frequency errors of each type, we use the Beta distribution (see e.g. 43) whose default parameters were estimated from the frequencies of putative high-frequency errors using the method of moments.
D. Use-case. The exact proportions of the SARS-CoV-2 variants at the 73 time points in our simulations are shown in Table 3.