DEN-IM: Dengue Virus identification from shotgun and targeted metagenomics

Dengue virus (DENV) represents a public health and economic burden in affected countries. The availability of genomic data is key to understand viral evolution and dynamics, supporting improved control strategies. Currently, the use of High Throughput Sequencing (HTS) technologies, which can be applied both directly to patient samples (shotgun metagenomics) and to PCR amplified viral sequences (targeted metagenomics), is the most informative approach to monitor the viral dissemination and genetic diversity. Despite many advantages, these technologies require bioinformatics expertise and appropriate infrastructure for the analysis and interpretation of the resulting data. In addition, the many software solutions available can hamper reproducibility and comparison of results. Here we present DEN-IM, a one-stop, user-friendly, containerised and reproducible workflow for the analysis of DENV sequencing data, both from shotgun and targeted metagenomics approaches. It is able to infer DENV coding sequence (CDS), identify serotype and genotype, and generate a phylogenetic tree. It can easily be run on any UNIX-like system, from local machines to high-performance computing clusters, performing a comprehensive analysis without the requirement of extensive bioinformatics expertise. Using DEN-IM, we successfully analysed two DENV datasets. The first comprised 25 shotgun metagenomic sequencing samples of varying serotype and genotype, including a spiked sample containing the existing four serotypes. The second dataset consisted of 106 targeted metagenomics samples of DENV 3 genotype III where DEN-IM allowed detection of the intra-genotype diversity. The DEN-IM workflow, parameters and execution configuration files, and documentation are freely available at https://github.com/B-UMMI/DEN-IM.

We developed DEN-IM as a ready-to-use, one-stop, reproducible bioinformatic analysis workflow Figure 1. The DEN-IM workflow separated into five different components. The raw sequencing reads are provided as input to the first block (in blue), responsible for quality control and elimination of low-quality reads and sequences. After successful pre-processing of the reads, these enter the second block (green) for retrieval of the DENV reads using the mapping database of 3830 complete DENV genomes as reference. This block also provides an initial estimate of the sequencing depth. After the de novo assembly and assembly correction block (yellow), the coding sequences (CDSs) are retrieved and are then classified with the reduced complexity DENV typing database containing 161 sequences representing the known diversity of DENV serotypes and genotypes (red).If a complete CDS fails to be assembled, the reads are mapped against the DENV typing database and a consensus sequence is obtained for classification and phylogenetic inference. All CDSs are aligned and compared in a phylogenetic analysis (purple). Lastly, a report is compiled (gray) with the results of all the blocks of the workflow. execution of complex distributed computational workflows.
If more than one complete CDS is present in a sample, each of the sequences will follow the rest 83 of the DEN-IM workflow independently. If no full CDS is assembled neither with SPAdes nor with 84 MEGAHIT, the processed reads are passed on to the next step for consensus generation by mapping, 85 effectively constituting DEN-IM's two pronged approach using both assemblers and mapping.  If a complete CDS fails to be obtained through the assembly process, the processed reads are Gigabytes of memory is advised. The disk space required for execution depends greatly on the size 126 of the input data, but for the datasets used in this article DEN-IM generates approximately 20 Gb 127 data per Gb of input data.  The execution log file for each component, as well as a log file for the execution of the workflow, are 131 also available. 132 DEN-IM creates an interactive HTML report ( Figure S1), stored in the 'pipeline results' directory, 133 containing all the information in the results divided into four sections: report overview, tables, charts 134 and phylogenetic tree. The report can be easily exchanged between collaborators by compressing 135 and sharing the "pipeline report" folder.

136
The report overview contains information about the number of samples in the analysis. It allows 137 for selection, filtering and highlighting of particular samples and tools in the workflow.

138
The table section contains the results and statistics the quality control, assembly, read mapping  Figure S1).

152
The assembled contig size distribution scatter plot is available in the chart section, showing 153 the contig size distribution for the Pilon corrected assembled CDSs.

154
Lastly, a phylogenetic tree is included, rooted at midpoint for visualisation purposes, and 155 with each tip coloured according to the genotyping results. If the option to retrieve the closest 156 typing reference is selected, these sequences are also included in the tree with respective typing 157 metadata. The tree can be displayed in several conformationsprovided by Phylocanvas JavaScript   The workflow was executed using the default parameters and directives for resources, with the 171 option to include the closest typing references in the final tree.

172
The negative control and the 92-1001 sample has no reads after trimming and filtering of low 173 complexity reads, therefore they were removed from further analysis.    The serotype and genotype was successfully determined for the 24 DENV CDSs by BLAST 186 ( Figure S1, Table 1). The most common were serotype 2 genotype III (AsianAmerican) and serotype 187 4 genotype II, with 8 samples each (33.33%), followed by serotype 3 genotype III (n=5, 20.83%),   (Table S2).

204
Of the 106 samples, 43 (40.60%) managed to assemble a complete CDS sequence (Table S1) 205 whereas a mapping approach was used for the remaining 63 samples (59.90%) and a consensus CDS   The interactive HTML reports ( Figure S1) provide an intuitive platform for data exploration, 254 allowing the user to highlight specific samples, filter and re-order the data tables, and export the  In conclusion, we provide a user-friendly workflow that makes it possible to analyse paired-end raw The burden of DENV disease is already large, but is still increasing as the risk of exposure to the 276 virus is increasing, not only through travel to endemic areas but also due to the expansion of the 277 geographic areas of the mosquito vectors and the disease [30]. genotypes, all four genotypes present in the spiked sample were accurately detected. This raises 288 the possibility that DEN-IM can play a role in the identification of these cases whose prevalence 289 is increasingly appreciated in highly endemic areas. Moreover, although being ready-to-use, the 290 DEN-IM workflow can be easily customised to optimise the data analysis.  were removed and only the sequences with sub-type data were kept. A representative of DENV 303 serotype 1 genotype III was introduced (EF457905, recovered from monkey) as no representatives 304 were available with the search criteria used. This genotype is Sylvatic and considered extinct [31] [32].

305
Additionally, any sample with IUPAC codes in the sequence provided were excluded.

306
In order to recover the maximum number of DENV reads from the input HTS data in the first 307 mapping step (Figure 1), we maintained the database with the 3830 complete DENV genomes to 308 retain as much diversity as possible. This database is referred as "DENV mapping database".  (Table S4), 63 DENV-2 (Table S5), 25 DENV-3 (Table S6) and 27 DENV-4 (Table S7). This database 313 is referred as "DENV typing database". This step is necessary to speed up the classification step for  (Table S4). In order to harmonise dengue nomenclature, the 324 system adopted uses Roman-numeric labels to identify the genotype, with the exception of Serotype 325 2 (Table S5), which used both Roman-numeric and geographic origin due to the widespread adoption 326 of the latter. The short-read paired-end data is passed as input through the "-fastq" parameter, that by default 329 is set to match all files in the "fastq" folder that match the pattern "* R1,2* ". In the process to 330 verify the integrity of the paired-end raw sequencing data, the integrity of the input files is assessed 331 by attempting to decompress and read the files. An estimation of the depth of coverage is also 332 performed. By default, the input size ("-genomeSize") is set to 0.012 Mb and the minimum coverage 333 depth ("-minCoverage") is set to 10. If any input file is found to be corrupt, its progression in the 334 workflow is aborted.

338
By default, Trimmomatic uses the default set of Illumina adapters provided with the workflow but 339 this behaviour can be overwritten with the "-adapters" parameter. The additional Trimmomatic 340 parameters "-trimSlidingWindow ", "-trimLeading", "-trimTrailing" and "-trimMinLength" can all 341 be set to different values.

342
The removal of low complexity sequences is done with PrinSeq [12] using a custom parameter 343 ("-pattern"), which by default is set to the value "A 50%; T 50%; N 50%", removing sequences whose 344 content is at least half composed of a polymeric sequence (A, T or N).

345
To retrieve the reads that map to the DENV reference database, Bowtie2 [13] is run with default parameters with the DENV mapping database as a reference. The reads and their mates that map 347 to the reference are retrieved with "samtools view -buh -F 12 " and "samtools fastq" commands.

348
The DENV mapping database can be altered with the "-reference" parameter, or alternatively, a 349 Bowtie2 index can be provided with the "-index " parameter. This allows for the workflow to work 350 with other databases obtained through public and owned DENV genomes. The coverage estimation 351 step is performed on the retrieved DENV reads with the same parameters are the first estimation 352 ("-genomeSize=0.012" and "-minCoverage=10").

353
In the assembly process, the retrieved DENV reads are firstly assembled with SPAdes Genome 354 Assembler [15] with the options "-careful -only-assembler -cov-cutoff ". The coverage cutoff if 355 dictated by the "-spadesMinCoverage" and "-spadesMinKmerCoverage" parameters, set to 2 by 356 default. If the assembly with SPAdes fails to produce a contig equal or greater than the value 357 defined in the "-minimumContigSize" parameter (default of 10000), the data is re-assembled with the 358 MEGAHIT assembler [16] with default parameters. By default the k-mers to be used in the assembly 359 in both tools ("-spadesKmers" and "-megahitKmers") are automatically determined depending on 360 the read size. If the maximum read length is equal or greater than 175 nucleotides, the assembly is 361 done with the k-mers "55, 77, 99, 113, 127", otherwise the k-mers "21, 33, 55, 67, 77" are used.

362
To correct the assemblies produced, the Pilon tool [17] is run after mapping the QC'ed reads 363 back to the assembly with Bowtie2 and "samtools sort". This process also verifies the coverage and 364 the number of contigs produced in the assembly. The behaviour can be altered with the parameters 365 "-minAssemblyCoverage", "-AMaxContigs" and "-genomeSize", set to "auto", 1000 and 0.01 Mb by 366 default. The first parameter, when set to 'auto', the minimum assembly coverage for each contig 367 required is set to the 1/3 of the assembly mean coverage or to a minimum of 10x. The ratio of contig 368 number per genome MB is calculated based on the genome size estimation for the samples.

369
The contigs larger than the value defined in the "-size" parameter (default of 10000 nucleotides) 370 are considered to be complete CDSs and follow the rest to the workflow independently. If no complete 371 CDS is recovered, the QC'ed read data is passed to the mapping to module that does the DENV 372 typing database and consensus generation.

373
The serotyping and genotyping is performed with the Seq Typing tool [18] with the command 374 "seq typing.py assembly" or "seq typing.py reads", using as reference the provided curated DENV 375 typing database. It is possible to retrieve the genomes of the closest references and include them in      Figure S1. a) DEN-IM's quality control report containing information of the number of basepairs and the number of reads for the analysed samples, the estimated coverage depth before and after mapping, and the percentage of reads in the input data that were trimmed. b)DEN-IM's typing report for 24 CDSs recovered from the metagenomic dataset. The ID contains the CDS contig name, the typing result for serotype-genotype, the values for identity and coverage, and the GenBank ID of the closest reference in the Typing Database containing 161 complete DENV genomes. Figure S2. Maximum Likelihood inference of the multiple sequence alignment of the 46 DENV-1 complete genomes in the typing dataset. 1635 complete DENV-1 genomes were clustered at 98% nucleotide identity and the representative genomes were aligned with mafft. A maximum likelihood tree was infered with RAxML. The tree is coloured according to genotype (red: genotype I; green: genotype II; blue: genotype III; purple: genotype IV). The sample JF459993, marked with a star, is currently annotated in ViPR as belonging to genotype IV but, given to the good phylogenetic support, it was re-classified as belonging to the genotype I. . Figure S3. Maximum Likelihood inference of the multiple sequence alignment of the 63 DENV-2 complete genomes in the typing dataset. 1067 complete DENV-1 genomes were clustered at 98% nucleotide identity and the representative genomes were aligned with mafft. A maximum likelihood tree was infered with RAxML. The tree is coloured according to genotype (red: genotype I; green: genotype II; blue: genotype III; purple: genotype IV). Figure S4. Maximum Likelihood inference of the multiple sequence alignment of the 25 DENV-3 complete genomes in the typing dataset. 807 complete DENV-3 genomes were clustered at 98% nucleotide identity and the representative genomes were aligned with mafft. A maximum likelihood tree was infered with RAxML. The tree is coloured according to genotype (red: genotype I; green: genotype II; blue: genotype III; purple: genotype IV). Figure S5. Maximum Likelihood inference of the multiple sequence alignment of the 27 DENV-4 complete genomes in the typing dataset. 320 complete DENV-4 genomes were clustered at 98% nucleotide identity and the representative genomes were aligned with mafft. A maximum likelihood tree was infered with RAxML. The tree is coloured according to genotype (red: genotype I; green: genotype II; blue: genotype III; purple: genotype IV).