Abstract
Summary Simulated genomes with pre-defined and random genomic variants can be very useful for benchmarking genomic and bioinformatics analyses. Here we introduce simuG, a light-weighted tool for simulating the full-spectrum of genomic variants. The simplicity and versatility of simuG makes it a unique general purpose genome simulator for a wide-range of simulation-based applications.
Availability and implementation Code in Perl along with user manual and testing data is available at https://github.com/yjx1217/simuG. This software is free for use under the MIT license.
1 Introduction
Along with the rapid progressing of genome sequencing technologies, many bioinformatics tools have been developed for characterizing genomic variants based on genome sequencing data. While there is an increasing availability of experimentally validated gold-standard genome sequencing data set from real biological samples, in silico simulation remains a powerful approach for gauging and comparing the performance of bioinformatics tools. Correspondingly, many read simulators have been developed for different sequencing technologies, such as ART (Huang et al., 2012) for Illumina and 454, SimLoRD (Stöcker et al., 2016) for PacBio, and DeepSimulator (Li et al., 2018) for Oxford Nanopore. However, when it comes to tools for simulating genome sequences with embeded variants, the choices appear much limited. The current available tools are either too simple or too specialized. For example, SInC (Pattnaik et al., 2014) can introduce random single nucleotide polymorphisms (SNPs), Insertion/Deletions (INDELs), and copy number variants (CNVs) into a user-provided reference genome but lacks the ability to simulate pre-defined variants, which is actually highly relevant in some simulation applications. Simulome (Price et al., 2017) is another random variant simulator that provides finer control options, but it is designed for prokaryote genome only. More sophisticated tools exist, such as VarSim (Mu et al., 2015) and Xome-Blender (Semeraro et al., 2018), but these tools are majorly tailored for human cancer genome simulation and often require additional third-party databases. Therefore, we feel there is need for a genome simulator that strikes a balance between simplicity and versatility. With this in mind, we developed a general-purpose genome simulator simuG, which is versatile enough to simulate both small (i.e. SNPs and INDELs) and large (i.e. CNVs, inversions, and translocations) genomic variants while staying light weighted with no extra dependency and minimal input requirements. These features together make simuG highly amenable to a wide range of application scenarios.
2 Description and feature highlight
simuG is a command-line tool written in Perl and supports all mainstream operating systems. It takes the user-supplied reference genome as the working template to introduce nonoverlapping genomic variants of all major types (i.e. SNPs, INDELs, CNVs, inversions, and translocations). SNP and INDELs can be introduced in the same time, whereas CNVs (implemented as segmental duplications and deletions), inversions, and translocations can be introduced with independent runs. For each variant type, simuG can simulate pre-defined or random variants depending on specified options. For pre-defined variants, a user-supplied VCF file that specifies all desired variants is needed, based on which simuG will operate on the input reference genome to introduce the corresponding variants. For random variants, simuG provides a rich array of options for fine-grained controls, such as ‘-titv_ratio’ for specifying the transition/transversion ratio of SNPs, ‘-indel_size_powerlaw_alpha’ and’-indel_size_powerlaw_constant’ for specifying the size distribution of INDELs, ‘-cnv_gain_loss_ratio’ for specifying the ratio of segmental duplication and segmental deletion for CNVs, and ‘-centromere_gff’ for specifying the location of centromeres so that simulated random CNVs, inversions, and translocations will not disrupt the specified centromeres. An ancillary script vcf2model.pl is further provided to directly calculate the best parameter combinations for the random SNP/INDEL simulation based on real data. Moreover, given the strong association between gross chromosomal rearrangement breakpoints and repetitive sequences (e.g. transposable elements) observed in empirical studies (Zhang et al., 2011; Yue et al., 2017), simuG can simulate random inversions and translocations by only sampling from user-defined breakpoints (by specifying the ‘-inversion_breakpoint_gff’ and ‘-translocation_breakpoint_gff’ options). The specific feature type and strand information of these user-defined breakpoints will be considered during the breakpoint sampling. For example, the breakpoint pairs that can trigger inversion should belong to the same feature type but from opposite strands (e.g. inverted repeats). Also, when specified, centromere will be given special consideration in random translocation simulation so that translocations leading to dicentric chromosomes will not be sampled. Finally, when needed, users can also define a list of chromosome(s) to be excluded from variant introduction. Upon the completion of the simulation, three files will be produced: 1) a simulated genome bearing introduced variants in FASTA format, 2) a tabular file showing the genomic locations of all introduced variants relative to both the reference genome and the simulated genome, 3) a VCF file showing the genomic locations of all introduced variants relative to the reference genome. Since simuG’s major input/output formats (e.g. FASTA, VCF, and GFF3) are all widely used in the field, it should be fairly straightforward to connect simuG with other computational tools both upstream and downstream in any user-specific simulation study design. Please note that when comparing the VCF outputs from simuG and other tools, all VCF files used for the such comparison should be normalized by tools like vt (Tan et al., 2015) beforehand.
3 Application demonstration
To demonstrate the application of simuG in a real case scenario, we ran simuG with the budding yeast Saccharomyces cerevisiae S288C (R64-2-1) reference genome to generate five simulated genomes: 1) with 1000 SNPs + 100 random INDELs, 2) with 10 random inversions, 3) with 5 random inversions triggered by breakpoints sampled from pre-specified transposable elements (TEs), 4) with 2 random translocation, 5) with 2 random translocation triggered by breakpoints sampled from pre-specified TEs. Based on each simulated genome, 50X 150-bp Illumina paired-end reads were simulated with ART (Huang et al., 2012) and mapped to the reference genome by BWA (Li and Durbin, 2009). With this setup, we evaluated the performance of different variant calling tools for both small and large variants (Table 1 and Supplementary Note). For small-variants (i.e. SNP and INDELs), we found freebayes (Garrison and Marth, 2012) and GATK4’s HaplotypeCaller (Poplin et al., 2018) both performed well, with the latter one edged out in INDEL calling. For large variants like inversions and translocations, we found both Delly (Rausch et al., 2012) and Manta (Chen et al., 2016) were able to identify simulated events when no TEs were associated with the breakpoints, although the exact breakpoint could be slightly off sometimes, especially with Delly. In contrast, for simulated inversions and translocations with TE breakpoints, both tools failed to detect most events in our test.
4 Conclusions
We developed simuG, a simple, flexible, and powerful tool to simulate genome sequences with both pre-defined and random genomic variants. Simple as it is, simuG is highly versatile to handle the full spectrum of genomic variants, which makes it very useful to serve the purpose of various simulation studies.
Funding
This work was supported by Agence Nationale de la Recherche (ANR-16-CE12-0019). J.-X. Yue was supported by a postdoctoral fellowship from Fondation ARC pour la Recherche sur le Cancer (PDF20150602803).
Conflict of Interest: none declared.