Abstract
Background Next generation sequencing (NGS) technology has made it possible to perform high-resolution screens for structural variants. Computational methods for detection of structural variants utilize paired-end mapping information, depth of coverage, split reads, or some combination of such data. The available methods are particularly designed to detect structural variants in single genomes or multiple genomes in a pairwise manner. The aim of this study was to develop a bioinformatics pipeline for detection of large structural variants using multiple pooled populations.
Results Here we describe the method “DevRO”, developed to enable identification of structural variants using short insert paired-ends and long-range mate-pairs. DevRO uses paired-end mapping information from both types of libraries for identification of inversions, deletions and duplications followed by read depth information to screen for copy number variants. DevRO can detect structural variants in multiple populations without the need for pairwise comparisons. It uses a combined approach based on (i) paired-end mapping and (ii) depth of coverage that gives power to the study as compared to traditional methods that are based on either of these. DevRO is also designed to detect deletions in the reference assembly, which is an added functionality as compared to available methods.
Conclusion We report a bioinformatics pipeline “DevRO” for detection of structural variants using paired-end mapping and depth of coverage methods tested on sequencing reads from multiple pooled rabbit populations. This method is useful when large numbers of populations have been re-sequenced as compared to traditional methods that can detect structural variants in a pairwise manner.
Background
In human genetics a focus has been to identify rare structural variants associated with disease whereas in animal genetics as well as in evolutionary biology it is of considerable interest to identify loci under positive selection. Large structural variants (SV) are an important form of genetic variants that frequently underlie phenotypic variation (Andersson, 2013; Weischenfeldt et al., 2013). SV refers to copy or dosage changing variants called copy number variations (CNVs which include deletions, insertions and duplications (Redon et al., 2006)) or dosage neutral variants like inversions and translocations. In humans, approximately 1.2% of a single genome differs from the reference genome as regards CNV genotypes (Pang et al., 2010). The average size of SVs in the human genome is ~8kb and they range from 50bp to large structural events (Alkan et al., 2011). In the recent past, several studies indicated that SVs have been associated with a variety of human diseases (Yang et al., 2013; McCarroll & Altshuler, 2007; Stranger et al., 2007). Whereas in livestock genomes, research in genome-wide CNV identification of various domestic animals showed their importance either in disease association (Olsson et al., 2011), phenotypic changes (Jia et al., 2013; Imsland et al., 2012; Rubin et al., 2010) affecting different traits, association with breed-specific differences in adaptation, health, and production traits (Bickhart et al., 2012) and adaptation to starch rich diet in dogs after domestication (Axelsson et al., 2013).
Before the development of next generation sequencing, structural variants were discovered by either hybridization methods (aCGH) in which the relative probe hybridization intensities differ between two compared genomes (Pinkel et al., 1998) or using Hapmap data and SNP arrays measuring the intensities of probe signals at SNP loci (International HapMap et al., 2010). However, the power was limited because of the size and breakpoint resolution of the predicted SV due to the density of SNP arrays. Sanger sequencing of paired reads was used as an alternative to the above-mentioned methods to detect CNVs, inversions and translocations with high accuracy and resolution at the expense of time and cost. Today next generation sequencing technologies (NGS) can generate a large amount of sequence data for instance by whole genome sequencing at a fraction of the cost and time. Several methods have been developed to enable SV detection from NGS data but each method has some limitations. In general, there are four categories of methods to detect SVs using NGS data 1) Depth of Coverage (DoC); 2) Paired-end mapping (PEM); 3) Split read (SR) and 4) Assembly (AS) based methods (Alkan et al., 2011).
The assumption of depth of coverage based methods (e.g. CNVseq and CNVnator) is that the coverage is uniform i.e. the number of reads mapped to a region are assumed to follow a Poisson distribution, however these methods are unable to detect inversions and translocations (Abyzov et al., 2011; Xie & Tammi, 2009). Paired-end mapping methods (e.g. DELLY and Breakdancer) use the information of paired reads and their orientation. The insert size is used to detect insertions and deletions, although the size of CNVs detected is limited by insert size of the libraries used (Rausch et al., 2012; Chen et al., 2009). Split read methods (e.g. Pindel) use the information of anchored reads to identify the breakpoint locations while assembly based methods (e.g. SOAPdenovo) are based on a denovo assembly (Li et al., 2010; Ye et al., 2009). Today, several tools have been developed that use a combination of the data available (e.g GenomeSTRiP, SVDetect) (Handsaker et al., 2011; Zeitouni et al., 2010).
Most of the available methods either use data from a single sample (CNVnator) or make pairwise comparisons between samples (CNVseq). The complexity arises when there are multiple groups (e.g. group1 comprising n populations and group2 comprising m populations) then detection of SVs needs further pairwise comparisons at the downstream level, making it difficult to attribute SVs to a specific group. Another complicating problem arises if the reference genome carries a deletion of a segment or if the resequenced individual/population carries an insertion not present in the reference, because the reads from this region cannot be mapped to the reference genome. This may for instance occur when comparing domestic animals with their wild ancestors, if a DNA segment has been deleted during the domestication process. Clearly, genome sequences from more individuals are needed to define whether a certain SV allele is derived or ancestral.
The aim of this study was to develop a tool based on deviant read and read orientation (DevRO) information using both paired-end mapping and depth of coverage analysis that can detect (i) structural variants in multiple populations and (ii) detect deletions in a reference assembly compared with individuals carrying a non-deleted allele.
Results and Discussion
Detection of structural variants using whole genome resequencing data
Here we used BAM files (Li et al., 2009) as input to DevRO VariantCaller. The VariantCaller step generated raw variants calls from multiple individuals. In DevRO VariantCaller, we scan the entire genome for PEM signatures (Figure 1) in windows of 1 kb. For each locus we store the information of counts of discordant reads in each population, median mate position for the discordant reads within 2 standard deviations of the mapping distance between mate pair reads, forward and reverse median position of the anchored (singleton) reads and their counts in each population, soft clipped read counts in each population along with forward and reverse strand clipped positions and concordant read counts (Figure 2, Figure S1). This information is further processed at the VariantParser step to calculate the fraction of deviant reads in each group. This information is used to identify loci showing significant differences between groups. The CNVs detected using VariantParser are annotated and scored using the DoC information from paired-end sequencing data.
As a test case we used two data sets. 1) Mate pair (MP) data generated from two wild and two domestic rabbits (average insert size of 4.5 kb and average coverage of 3x) for PEM signatures (Figure 1). 2) Rabbit paired-end (PE) sequencing data from pooled samples of wild and domestic rabbit populations with an average coverage of 10x per population (Carneiro et al., 2014) was used for DoC information. For this particular test case, DevRO uses information of two groups (wild and domestic rabbits) to find SVs with significant frequency differences between the two groups. However, it is not limited to this scenario and can take any two groups (e.g. case and control).
This resulted in identification of candidate deletions, duplications (Table S1) and preliminary results of inversions present at a relatively high frequency in either group 1 or group 2. Figure 3 and Figure S2 shows few examples of candidate duplication and deletions predicted by DevRO. Further improvements utilizing for example base quality information of deviant reads, and mapping of breakpoint read sequences on the reference genome will be added to later versions of DevRO.
Condordance with SVDetect and Breakdancer
SVDetect (Zeitouni et al., 2010) is also based on the combined use of PEM and DoC data for detecting SVs in multiple samples. Breakdancer (Chen et al., 2009) is based on the PEM approach for detection of SVs (deletions, inversions and translocations). The main difference between DevRO and these tools is that DevRO, as opposed to SVDetect and Breakdancer, does not require pairwise comparisons of groups (Table 1). We ran SVDetect and Breakdancer using the rabbit mate pair data in order to assess concordance with DevRO results (Table 2). The number of inversions detected was 178 and 281 for SVDetect and DevRO respectively, with 248 overlapping inversions using bedtools feature “reciprocal fraction overlap” of 0.7. Table S2 shows the preliminary list of candidate inversions detected by DevRO not in SVDetect list. One possible reason why SVDetect detected fewer inversions could be it is based on pairwise comparisons as opposed to DevROs group level contrasts. The overlap between SVDetect and DevRO was 500 and 391 for deletions and duplications, respectively (Table 2). However, the overlap between Breakdancer and DevRO showed big differences 13 and 43 for inversions and deletions, respectively. One possible reason for this big difference in deletions could be due to the combined use of PEM and DoC in DevRO while only using PEM in Breakdancer for detection of SVs.
Deletions in the rabbit reference genome
A unique feature with DevRO is that none of the previously described tools are designed to search for deletions present in the reference assembly. This is of particular interest in draft assemblies as well as in domestic animals when the reference genome is based on a domesticated individual, since this makes it challenging to identify regions of the genome that may have been lost during domestication. In rabbits, is based on a domesticated individual, as is the case for all other domesticated species except the chicken. One of the aims with the development of DevRO was to identify deletions present in a reference assembly; such deletions may be due to assembly errors or because the individual used for the assembly carry one or more deletions. In our case, the pileups of singleton reads (anchored reads whose pairs are unmapped) using data from wild rabbits were used to identify candidate loci. These were further narrowed down using the median positions for forward and reverse singletons and soft clipped positions. This resulted in a preliminary candidate list of breakpoints of deletions in the reference genome. These putative breakpoints of deletions were further used to extract the hanging reads. By combining the anchored and hanging read sequences (which represents one mapped read and one unmapped read), BLAT mappings were conducted to the human reference genome in order to investigate whether the unmapped reads corresponded to a human sequence homologous to the rabbit locus as this would indicate that the reference carried a deletion of evolutionary conserved sequences. Further improvements are needed to conduct de novo assembly of the singletons and unmapped reads at breakpoints. Further work is also required to experimentally validate these putative deletions.
In order to test DevRO for detection of deletions in reference assembly, we simulated deletions in the rabbit reference assembly (Table S3). Here we deleted ten regions ranging in size from 1 kb to 70 kb on chromosome 1 with repeats (repeat information extracted from UCSC Santa Cruze Browser) flanking the breakpoints or overlapping them. DevRO successfully identified nine of the simulated deletions but was unable to detect one of the regions (Tables S3). This region showed a mappability of 0.008 at the 5’ end of the breakpoint indicating a highly repetitive region.
Conclusions
Here we report a bioinformatics pipeline “DevRO” for detection of structural variants using paired-end mapping and depth of coverage analysis and tested it on rabbit data. It has an added routine for detecting deletions present in a reference genome. This method will be useful when large numbers of populations are re-sequenced as compared to other frequenctly used methods it is designed to detect structural variants in pairwise comparisons of groups.
Methods
The input of DevRO is a set of aligned MP or PE reads in SAM/BAM format (Li et al., 2009). All input BAM files from multiple populations are analyzed jointly, to avoid pairwise comparisons in the end.
The pipeline DevRO consists of three modules 1) VariantCaller 2) VariantParser and 3) VariantAnnotate that are used to detect inversions, deletions and duplications when comparing two or more populations as well as deletions in a reference genome seq.
Variant Caller
We used BAM files (Li et al., 2009) from MP data as an input to DevRO VariantCaller. In this step, we scan the genome in windows of 1kb to search for PEM signatures using discordant reads (Figure 1). The discordant or deviant reads have (I) abnormal mapping distance in comparison to average mapping distance, the threshold for declaring a PEM as deviant is that it deviates more than three standard deviations from the mean which is a signature for deletion or insertion, (II) abnormal relative mapping orientation (in case of duplication or inversion) as shown in Figure 1. The information of the mate orientation strand for the left or right clipped reads to identify breakpoints. Together with counts and position of discordant reads we also recorded the counts of anchored or singleton reads and soft clip reads using bitwise flags and CIGAR (Li et al., 2009) given in alignment files (BAM) as the method described in SAMtools (Li et al., 2009). For duplications, we store all those loci where at least one duplication read signature is observed in any population, and record the information of deviant/discordant and normal read counts for such loci in each population, median positions are calculated using the mates (discordant reads). This step gives us the raw calls for loci with read counts in each SV category for all the populations in question. The following filters were used during this step: mapping quality for discordant reads >=10; same criteria is applied for each SV type for which the VariantCaller is run. Each analysis is done separately for duplications, inversions and deletions. This means that when we are running the script for the duplication scan and we come across inverted loci having no duplication reads then these loci will not be reported in output of duplications call but will be reported in inversion calls. In contrast, if we find inverted reads where we have duplicated reads also, the loci will be stored with the information of read counts for duplications and inversions. The same principle applies for the scan of inversions.
Variant parser
The purpose of this module is to extract only those loci where there is differentiation between the two groups analyzed (domestic and wild rabbits in the data analyzed in this study). The input is the result of Variant Caller raw calls obtained in the above step. In this step we calculate the fraction of abnormal reads in the two groups and only extract the loci where we see a significant difference between the two groups.
The formula for calculating the fraction of reads consistent with duplication in group 1 is as follows:
frac_dup = A1_dup/(A1_dup+A2_dup), where A1 and A2 represents two groups.
The same type of formula is used to calculate the fraction for all discordant or deviant reads.
Variant Annotate
The predicted CNVs from VariantParser are given as an input to Variant Annotate. We calculate chisq test for each candidate using normalized deviant read counts in each group (A1 and A2) and the total read counts. The purpose of this step is to score and annotate the loci identified in the variant parser step by using pvalue and information of depth of coverage for CNVs using paired-end data (method explained in Carneiro et al., 2014). For each group the average depth is extracted in the predicted regions and an M-value (log2 fold change) is calculated. This step gives CNVs that show significant allele frequency difference between two groups (absolute M-value >=0.7).
For visualization, log2 fold change is plotted for each candidate CNV in R. The breakpoints for inversions are shown using the positions of mate pair. We next annotate the candidates using information for genes, repeats, gaps or custom annotations. Finally, candidate SVs are scored as alpha, beta and gamma by using the following criteria for all SVs:
Alpha: SV with absolute M-values greater than 0.7 from MP and PE data and that fulfill the fraction check, where fraction check is at least 10% of deviant reads or signatures observed.
Beta: Below 10% and above 1% of deviant reads support with or without M-values. Gamma: without M-value support, with less than 1% deviant reads.
Deletion in the reference genome assembly
For detecting deletions in the reference genome or insertions in the resequenced genome, the genome was scanned as in the VariantCaller step by using pileups of singleton (anchored reads) in non-reference populations with mapping quality greater than 10. Median position is calculated for both forward and reverse singleton reads with counts for discordant reads in each window. If multiple discordant reads are observed, then it is less likely that the singletons are due to deletion in the reference assembly and we discard that window in parser step. We further narrow down the region by making use of median positions of singletons and only allow 0.05% overlap between forward and reverse singletons (if any). Together with this information, soft clips (search for right and left clipped positions at breakpoints) are used to detect breakpoints of deletions in the reference assembly. In order to know what part may have been deleted in the reference genome, we also make use of BLAT to map the unmapped pairs of singletons at breakpoints to the human genome assembly to infer whether a DNA fragment observed in human (and in other species) is homologous to those unmapped reads. The size and expected gene content within the part missing in the reference assembly is inferred.
Datasets
Four rabbits (two domestics and two wild from the Oryctolagus cuniculus algirus subspecies) were re-sequenced at 3x coverage using Illumina mate pair (2x50bp) with an average insert of 4.5kb. This dataset was used for paired-end mapping analysis. The depth of coverage analysis was done for the CNVs using the previously published paired-end sequencing data (Carneiro et al., 2014).
Acknowledgements
The work was supported by the ERC grant BATESON to LA, by POPH-QREN funds from the European Social Fund and Portuguese MCTES [postdoc grant to M.C (SFRH/BPD/72343/2010)], by FEDER funds through the COMPETE program and by Portuguese national funds through the FCT – Fundação para a Ciência e a Tecnologia – (PTDC/CVT/122943/2010), by an EU FP7 REGPOT grant [CIBIO-New-Gen][286431], and by travel grants to M.C. (COST Action TD1101) and Higher Education Commission in Pakistan (support for S.S.). Computer resources were supplied by UPPNEX at Science for Life Laboratory.