Targeted decontamination of sequencing data with CLEAN

Background Many biological and medical questions are answered based on the analysis of sequence data. However, we can find contaminations, artificial spike-ins, and overrepresented rRNA sequences in various read collections and assemblies; complicating data analysis and making interpretation difficult. In particular, spike-ins used as controls, such as those known from Illumina (PhiX phage) or Nanopore data (DNA CS lambda phage, yeast enolase ENO2), are often not considered as contaminants and also not appropriately removed during bioinformatics analyses. Findings To address this, we developed CLEAN, a pipeline to remove unwanted sequence data from both long and short read sequencing techniques from a wide range of use cases. While focusing on Illumina and Nanopore data and removing of their technology-specific control sequences, the pipeline can also be used for everyday tasks, such as host decontamination of metagenomic reads and assemblies, or the removal of rRNA from RNA-Seq data. The results are the purified sequences and the sequences identified as contaminated with statistics summarized in an HTML report. Conclusions The decontaminated output files can be used directly in subsequent analyses, resulting in faster computations and improved results. Although decontamination is a task that seems mundane, many contaminants are routinely overlooked, cleaned by steps that are not fully reproducible or difficult to trace by the user. CLEAN will facilitate reproducible, platform-independent data analysis in genomics and transcriptomics and is freely available at https://github.com/hoelzer/clean under a BSD3 license.


Background
The high-throughput sequencing of DNA and RNA has become a standard approach in molecular biology.Next-Generation Sequencing (NGS), predominantly provided by Illumina, is capable of generating high-quality data from DNA and cDNA with low costs and error rates.The relatively short reads (50-300 nt) produced by NGS are, amongst other topics, used for the reconstruction of genomes, identifying SNPs, or characterizing differentially expressed genes.One technological limitation of NGS, the short read length, was overcome in recent years with the development of long-read sequencing technologies (Third-Generation Sequencing; TGS).In particular, Oxford Nanopore Technologies (ONT) provides a small, affordable, and mobile device that can generate reads of unprecedented length from DNA, cDNA, and also native RNA (Hu et al. 2021;Quick et al. 2016).Next to Illumina, the technology was also widely used to sequence SARS-CoV-2 samples during the COVID-19 pandemic (Brandt et al. 2021).The longer reads are used to significantly improve assembly contiguity (Nurk et al. 2022), the taxonomic classification of metagenomic samples (Overholt et al. 2020), or help to characterize alternative splicing in more depth (Naftaly et al. 2021) while technological advances continue to push error rates more closely towards the level of short-read data (Sereika et al. 2022).Since NGS and TGS (or simply "sequencing technologies") are widely used, quality control of the raw sequencing data is becoming increasingly important.Most bioinformatics tools and pipelines identify and trim low-quality bases and remove remaining adapter sequences.However, one crucial step is often overlooked and still poses a challenge for sequencing technologies (Nieuwenhuis et al. 2020): the identification of DNA and/or RNA contamination where material from two or more sources is accidentally mixed or is simply a natural component of the sample, for example originating from cell line preparation (Chrisman et al. 2022) or in metagenomic samples.When contamination happens after sample collection or shipping, the preparation of the sequencing library involving multiple steps in the lab is another possible source (Porter et al. 2021).Apart from such unwanted contaminations, short and well-described control sequences are frequently spiked into sequencing runs to function as calibrations for basecalling and monitor the sequencing run's quality.Most commonly known is the PhiX phage genome, frequently used as a control in Illumina experiments.PhiX sequences were already shown to be large-scale contaminations in microbial isolate genomes because the reads were not cleaned before assembly and publication of genomes in public databases (Mukherjee et al. 2015).Also, we found that the positive control in ONT DNA sequencing (known as DCS), a 3.6 kb standard amplicon mapping the 3' end of the Lambda phage genome, is wrongly labeled as E. coli or Klebsiella quasipneumoniae subsp.similipneumoniae plasmid in the NCBI GenBank (CP077071.1,CP092122.1),see Supplemental Figure 1.For ONT native RNA sequencing, a yeast ENO2 Enolase II transcript of strain S288C, YHR174W, functions as a positive control.Spike-in steps are usually optional; however, the information if a spike-in was used, often does not reach the user working with raw reads.Besides the decontamination of such manually introduced control sequences and other accidentally introduced but known contaminations, other use cases exist, where specific biological sequences should be removed, that can be a natural part of a sample or are still remaining after experimental steps.One prominent example is the removal of ribosomal or mitochondrial RNA from Illumina RNA-Seq samples before read-count normalization and differential gene expression estimation (Wolf 2013;Zhao et al. 2018;Raz et al. 2011).Even if rRNA depletion kits are frequently used to reduce the amount of rRNA before sequencing, rRNA can still be present in a sample.This applies in particular to non-model species where no optimized kit exists (Hölzer et al. 2019).Another example is the removal of host sequences, for example, in human gut microbiome sequencing data (Almeida et al. 2019), which is becoming increasingly important with the advent of metagenomic and metatranscriptomic sequencing.In the past, several tools were developed for the fast classification of sequence data and thus also applicable for decontamination.One approach involves the taxonomic classification of all reads followed by removing unwanted sequences.Tools implementing such an approach are Kraken2/Kraken software suite (Wood et al. 2019 andLu et al. 2022), Clark (Ounit et al. 2015) and Kaiju (Menzel et al. 2016).HoCoRT (Rumbavicius et al. 2022) offers a wrapper around wellknown mapping and classification tools.SourceTracker (Knights et al. 2011), microDecon (McKnight et al. 2019) and Decontam (Davis et al. 2018) follow the metagenomics approach by analyzing the composition of the sample and finding unexpected proportions of contamination taxa.The latter focus on short-read data, while other tools focus on the ONT DNA spike-in, nanolyse (De Coster et al. 2018), or on cloned, exogenous cDNA removal from NGS data, cDNAdetector (Qi et al. 2021).However, and although decontamination of already known species is in many cases a rather easy task with potentially huge benefits, many studies still lack appropriate decontamination of their sequenced samples.One reason for that might be that the output files of many pipelines cannot be directly used for downstream steps such as assembly or annotation and additional formatting of the files and extraction of the results are needed.As a direct result, we can find contamination omnipresent in genomic resources (Steinegger and Salzberg 2020).In particular, with the rise of TGS data, specialized methods are also needed for the fast decontamination of long reads.Mapping reads to a reference genome for decontamination can be a general step while working with sequencing data.Therefore, we developed CLEAN (https://github.com/hoelzer/clean)as an easy-to-use all-in-one decontamination pipeline for short reads, long reads, and any FASTAformatted sequence file.While initially developed for the decontamination of Illumina and Nanopore positive spike-in controls and host sequences in metagenomic samples, we extended the functionality of the pipeline to clean against any provided reference sequence(s).Also, we implemented the removal of rRNA from Illumina RNA-Seq samples in a faster and easier way than current state-of-the-art software (Kopylova, Noé, and Touzet 2012).Furthermore, CLEAN includes a convenient QC report and outputs the intermediate mapping files, which can be used for further investigation.Thus, CLEAN can be easily downloaded, installed, and executed with a single command on a local laptop, a high-performance cluster, or the cloud.We especially focused on well-structured output files and formats so that the decontaminated data files can be directly used in further downstream analyses such as assembly or annotation, thus allowing direct integration of CLEAN in other workflows.We believe that by providing an easy-to-use, expandable and reproducible pipeline, the decontamination of all kinds of sequencing data in molecular biology studies and genomic resources will increase.

Findings Implementation
We implemented the pipeline in the workflow manager Nextflow v21.04.0 or higher (Di Tommaso et al. 2017).Every step is encapsulated in a software container (Docker (Boettiger 2015) or Singularity) or virtual environment (Conda (Grüning et al. 2018)).The modular structure allows updating of the containers and environments periodically.The user can deploy the software directly from GitHub.CLEAN can be easily installed -only Nextflow and one of Docker, Singularity, or Conda must be installed.We offer configurations for local execution, LSF and SLURM workload managers, and a simple cloud execution.

Workflow
CLEAN's input can be single-and paired-end Illumina FASTQ files, ONT FASTQ read files, and FASTA files, see Figure 1.The input is the only required parameter.The user can optionally add a FASTA file for a custom contamination reference.We provide common host genomes, e.g., Homo sapiens, Mus musculus, and Escherichia coli, spike-in sequences for Illumina, direct RNA ONT and DNA ONT sequencing, and an rRNA contamination reference (derived from SortMeRNA (Kopylova et al. 2012) (https://github.com/biocore/sortmerna/tree/master/data/rRNA_databases).
CLEAN concatenates all specified contaminations, e.g., to clean reads of the host and the spikein in one step.Each input file (FASTQ and/or FASTA) is mapped against the contamination reference with minimap2 v2.18 (Li 2018).For Illumina, we also offer a kmer-based option with bbduk (https://sourceforge.net/projects/bbmap/).After the mapping, we separate mapped from unmapped reads or contigs by the primary alignment with SAMtools (Li 2018;Danecek et al. 2021).For ONT data and the DSC control, we provide the parameter --dcs_strict: only reads that map to the DCS and cover at least one of the artificial DCS ends are considered as contamination.By that, we avoid removing similar phage DNA that is actually part of, e.g., a metagenomics sample.If the user sets the parameter --min_clip, mapped reads are filtered by the total length (sum of both ends) of the soft-clipped positions.If --min_clip >= 1, the total number is considered, else the fraction of soft-clipped positions to the read length.The user can optionally specify FASTA files with --keep.Input reads are separately mapped to this reference.If a read maps to the "keep"-reference but was classified as contamination before, CLEAN moves the read to the set of clean reads.Thus, the user can reduce false negatives.This can help in particular when working with closely related species or metagenomic samples.CLEAN creates for the input files as well as the clean and contamination files quality reports with FastQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) (Illumina reads), NanoPlot (De Coster et al. 2018) (ONT reads), or QUAST (FASTA files).MulitQC (Ewels et al. 2016) summarizes all quality reports and mapping statistics in an HTML report.Besides the MultiQC summary report and the de-and contaminated reads or FASTA files for direct downstream usage, CLEAN also emits the indexed mapping files in BAM format and the indexed contamination reference.If necessary, the user can further examine the results in a genome browser such as IGV (Robinson et al. 2011).Figure 1.Schematic overview of the CLEAN workflow.Gray/blurred elements are optional and depend on the user input.The pipeline can search multiple FASTA or FASTQ inputs against a user-defined set of reference sequences (potential contamination).CLEAN automatically combines different user-defined FASTA reference sequences, built-in spike-in controls, and downloadable host species into one mapping index for decontamination.The user can also specify FASTA files comprising sequences that should explicitly not be counted as contamination.The output is finally filtered to provide well-formatted FASTA or FASTQ files for direct downstream analyses.The icons and diagram components that make up the schematic view were originally designed by James A. Fellow Yates & nf-core under a CCO license (public domain).

External resources
The user can define a contamination reference or choose from included ones.These are the currently provided host genomes in CLEAN version v1.0.0-alpha:Homo sapiens (Ensembl release 99), Mus musculus (Ensembl release 99), Gallus gallus (Ensembl release 99), Escherichia coli (Ensembl release 45), Chlorocebus sabeus (NCBI GCF_000409795.2), and Columba livia (NCBI GCF_000337935.1).A genome is only downloaded once on-demand and can be reused.The list of automatically downloadable references can be easily extended upon request or by experienced users.However, the user can also always provide additional reference FASTAs via a parameter.As an rRNA reference, we provide the rRNA database from SortMeRNA, a tool commonly used to filter rRNA from metatranscriptomic data.The database contains representative rRNA sequences from the Rfam and SILVA databases (see https://github.com/biocore/sortmerna/blob/master/data/rRNA_databases/README.txt).Spike-in sequences for direct RNA and DNA ONT sequencing are taken from Guppy, the basecaller developed by ONT: yeast enolase ENO2/YHR174W of 1.2 kb and a Lambda Phage amplicon of 3.6 kb.By further investigating the latter, we found another resource at the ONT community for the DCS sequence (https://assets.ctfassets.net/hkzaxo8a05x5/2IX56YmF5ug0kAQYoAg2Uk/159523e326b1b791e3b842c4791420a6/DNA_CS.txt).This 3560 nt long sequence is a substring of the Guppy sequence (3587 nt), where the first 27 nucleotides are duplicated at the start, see Supplemental Figure 2 and 3.The first 65 nt (Guppy 92 nt) and the last 48 nt seem to be artificial as they show no hits in a BLAST search against the NCBI nucleotide collection (nr/nt).

Results & Discussion
In the following, we will show the application of CLEAN for three common use cases: 1) removal of DNA attributed to Chlorocebus species (Green Monkey cell line) contamination from hybridassembled Chlamydiifrater samples improves assembly quality, 2) decontamination of the yeast enolase control in Nanopore native RNA-Seq data of a Coronavirus sequencing run, and finally, 3) fast removal of rRNA from an Illumina RNA-Seq data set.

Illumina-sequenced Chlamydiaceae
The polished assemblies based on cleaned reads reveal 1.19 Mb circular genomes and the plasmids 6 kb for each of the four Chlamydiifrater isolates.Without prior decontamination of the cell line DNA, contigs belonging to Chlorocebus species can be found in the final assemblies.Using an older version of Unicycler, running the assemblies without a CLEAN step of the raw read data also yields more fragmented final assembly results, likely due to the inflated complexity of the initial short-read graph.However, this issue was resolved by using a newer version of Unicycler, but still, contigs belonging to the used cell line could be found.Thus, decontamination of DNA belonging to a host cell line can 1) improve the general assembly process and 2) results in a much cleaner assembly.
Case study II: Yeast enolase is a highly abundant spike-in control in Nanopore native RNA-Seq data Nanopore sequencing is currently the only technology that allows the sequencing of native RNA strands without a cDNA intermediate (Ergin, Kherad, and Alagoz 2022).This 'direct RNA' protocol includes the addition of a calibration strand (amplified RNA sequences of the S. cerevisiae Enolase 2 mRNA, GenBank, NP_012044.1)as a spike-in positive control.Depending on the concentration of sample input RNA, this spike-in can represent a substantial fraction of the sequenced reads.In our study of direct RNA sequencing of Human Coronavirus genomes (Viehweger et al. 2019) these sequences made up 15.8% and 10.2% of the two samples, respectively.Due to algorithmic advances, re-basecalling the raw data with version 4.0.11 of the Guppy basecaller (RNA models are unchanged since then) yields more reads and a higher fraction of spike-in reads (31.4% and 31.0%,see Figure 2).Guppy does not filter these with default parameters but has an optional parameter (--calib_detect) to enable detection and filtering calibration strand reads.However, we found that this functionality does not adequately detect spike-in reads: 35.4% and 19.8% of spike-in reads were still present using this parameter.Applying CLEAN to this dataset removes all calibration strand reads (see Figure 2).Generally, if a positive control is not needed for the experiment, we suggest skipping the addition of this spike-in.This can increase the yield of desired RNA reads by freeing up throughput capacity.For all direct RNA read data with added spike-in, we propose using CLEAN to remove these sequences reliably and quickly before downstream analyses are performed.Case study III: Speeding up an everyday task in transcriptomics -removal of rRNA from Illumina RNA-Seq data CLEAN performs equally compared to SortMeRNA in terms of selectivity: 99.99 % of simulated non-rRNA reads are detected as non-rRNA reads with CLEAN; SortMeRNA achieves slightly less with 99.94 %.Regarding sensitivity, CLEAN performs at the same level as SortMeRNA with a maximum difference of 6.17 % at Set 2, see Table 1.On the real-data sample, CLEAN runs about 1.7 fold faster than SortMeRNA, see Figure 3 1.Sensitivity comparison of CLEAN and SortMeRNA v4.3.4 for six simulated datasets consisting of 1 Mio Illumina rRNA reads each.CLEAN has a slightly decreased sensitivity than SortMeRNA; however, it is much faster (Figure 3).

Test data sets and computations
Case study I: Removal of cell cultivation contamination from Nanopore-and
Case study II: Coronavirus native RNA sequencing with Nanopore Virus generation, RNA isolation, sample preparation, and sequencing are detailed in (Viehweger et al. 2019).Briefly, Huh7 cells were infected with recombinant HCoV-229E variants, yielding two samples in cell culture (WT and SL2).Total RNA of these was isolated, and 1 µg of RNA in 9 µL was carried into the library preparation with the Oxford Nanopore direct RNA-Seq (DRS) protocol (SQK-RNA001).Sequencing ran for 48h on an R9.4 flow cell on a MinION device.For this study, the raw data was basecalled with Guppy (version 4.0.11),once with and once without the --calib_detect parameter.Assignment of reads to either HCoV-229E, S. cerevisiae Enolase 2, or human was done by mapping to a combined reference of all three with minimap2 (version 2.17, parameters: -ax splice -k14).Finally, we used CLEAN on the basecalled DRS reads with calibration strand detection and compared the results to the manual assignment.All commands and the plotting script are available from the supplement.
Case study III: rRNA removal from bulk RNA-Seq Illumina data We tested and compared CLEAN's functionality to remove ribosomal RNA in terms of sensitivity and selectivity against SortMeRNA (v4.3.4) (Kopylova, Noé, and Touzet 2012).All seven simulated datasets were downloaded from (Kopylova, Noé, and Touzet 2012).Briefly, here 1 million single-end rRNA Illumina reads with a read length of 100 bp were simulated with different identities with respect to the SILVA database, or origin from truncated sections of the bacteria phylogenetic tree.One of the seven simulated samples contains non-rRNA reads to test for selectivity.We converted the provided FASTA files into FASTQ files with seqtk (v1.3-r106, https://github.com/lh3/seqtk).To compare runtime performance, we chose a non-simulated Illumina RNA-Seq sample (GEO Accession GSM3431091) from a bat transcriptome study (Hölzer et al. 2019).For total RNA obtained from a bat (Myotis daubentonii) cell line, cDNA libraries were prepared utilizing the Illumina Ribo-Zero rRNA Removal Kit for human/mouse/rat.We used CLEAN with the --rrna parameter and SortMeRNA to remove rRNA reads from the sample.We run each tool ten times with 30 threads to compare runtime differences measured with Linux's time command on a Linux server (CPU: Opteron 6376, 64 x 2,1 GHz, RAM: 768 GB).

Conclusion
We developed CLEAN to easily screen any nucleotide sequences against reference sequences to identify and remove potential contamination.Therefore, common tasks are the removal of positive controls added during library preparation, host contamination, or ribosomal RNAs.Decontamination with CLEAN can be easily pre-connected to the actual analysis as the output needs no further processing or reformatting.The pipeline uses alignment-based approaches for short-and long-reads that subsequently also allow for inspection of the reads aligned to a potential contamination reference in more detail.Furthermore, CLEAN provides quality control reports for more insights.CLEAN is freely available at https://github.com/hoelzer/cleanand can be easily installed and executed using Nextflow.

Limitations
CLEAN cannot be used for the removal of unexpected contaminations.For such a task, DecontaMiner, a tool to remove contaminating sequences of unmapped reads (Sangiovanni et al. 2019), or QC-Blind, a tool for quality control and contamination screening without a reference genome (Xi et al. 2019) can be used.Other tools try to find unexpected compositions in metagenomics samples to identify contaminations (McKnight et al. 2019), (Davis et al. 2018).With CLEAN we did also not focus on the detection of cross-contamination where other tools such as ART-DeCo (Fiévet et al. 2019) can be used.Furthermore, CLEAN should not be used where tools with higher sensitivity are available, e.g., SortMeRNA for rRNA annotation and Kraken2 for taxonomic classification.

Availability of Supporting Source Code and Requirements
 Project name: CLEAN  Project home page: https://github.com/hoelzer/clean Operating system(s): Platform independent due to workflow managment system and container usage  Programming language: Nextflow, Bash  Other requirements: Nextflow v21.04.0 or higher (compatible on POSIX systems and Windows via WSL; requires Bash 3.2 or higher, Java 11 up to 18), Conda or Singularity or Docker  License: BSD3

Figure 2 .
Figure 2. Number of reads mapping to the human genome, HCoV-229E or S. cerevisiae Enolase 2 (from bottom to top) for two HCoV-229E samples WT (left) and SL2 (right) after Guppy (default parameters), Guppy with --calib_detect or after CLEAN usage.Only CLEAN is able to remove all reads deriving from the dRNA control sequence.WT -wild type sample, SL2 -sample with different RNA secondary structure.

Figure 3 .
Figure 3.Comparison of the runtime for ten repeated runs of CLEAN and SortMeRNA v4.3.4.Both tools were executed on a Linux server (CPU: Opteron 6376, 64 x 2,1 GHz, RAM: 768 GB) with 30 threads.Time was measured with the Linux time command.