Identification and prioritisation of causal variants in human genetic disorders from exome or whole genome sequencing data

With genome sequencing entering the clinics as diagnostic tool to study genetic disorders, there is an increasing need for bioinformatics solutions that enable precise causal variant identification in a timely manner. Background Workflows for the identification of candidate disease-causing variants perform usually the following tasks: i) identification of variants; ii) filtering of variants to remove polymorphisms and technical artifacts; and iii) prioritization of the remaining variants to provide a small set of candidates for further analysis. Methods Here, we present a pipeline designed to identify variants and prioritize the variants and genes from trio sequencing or pedigree-based sequencing data into different tiers. Results We show how this pipeline was applied in a study of patients with neurodevelopmental disorders of unknown cause, where it helped to identify the causal variants in more than 35% of the cases. Conclusions Classification and prioritization of variants into different tiers helps to select a small set of variants for downstream analysis.


Introduction
Next generation sequencing (NGS) has proven to be a powerful technique to identify causal genes in rare genetic disorders [1,2]. The decline in sequencing cost has enabled the use of NGS for diagnostics in the broader clinical setting [3][4][5]. However, the lower cost of data generation resulted in a daunting task of managing, analyzing and interpreting large data sets [6]. The bioinformatics tools and pipelines need to be constantly improved to keep pace and to enable a speedy analysis of the NGS data.
There are more than 3 million variants present in individual genomes compared to the human reference genome [7]. For clinical sequencing projects this list needs to be reduced to a manageable number of candidate variants for downstream analysis. One strategy to effectively reduce the number of candidate variants is trio sequencing, meaning that the healthy parents are sequenced along with their affected children.
Generally, analysis workflows for trio-or pedigree-based analysis share three basic steps [8]. In step one, the raw sequence reads are mapped to the reference genome for each sample individually and variants are called for all samples together by identifying the differences to the reference genome and determining the genotype per sample. In step two, technical artifacts and variants which are common in the population are removed as these are highly unlikely to be the cause of rare genetic diseases. Even after this first filtering step plenty of variants remain as potential candidates, and prioritization of the variants in the candidate list using pedigree information and various variant annotations is required [8].
Here, we present a complete workflow for candidate variant identification and prioritization to analyze whole exome sequencing (WES) and whole genome sequencing (WGS) data generated from trios or from larger pedigrees.

Material and Methods
We developed a workflow that performs the read alignment, variant calling, annotation, filtering and prioritization steps. A particular focus is put on various filtering and annotation steps to prioritize variants depending on the assumed inheritance model for further analysis.

Variant filtering
Variants that passed all the Platypus internal filters were considered further.
Frequent variants were removed based on MAF threshold of 0.1% from ExAC and the 1000 genome phase III database. To remove technical artifacts specific to our pipeline, variants that were present in the local controls above the threshold of 2% were considered as artifacts and were removed.
In the trio sequencing setting variant information from the patient and the healthy parents was combined to efficiently reduce the number of candidates. Only variants fulfilling an inheritance model for a disease were considered further. For an autosomal dominant (AD) disease, only de novo variants, i.e. variants which are present in the patient but not in any of the parents were considered as candidates. In case of autosomal recessive (AR) inheritance of consanguineous parents, we expected for the genotype to be homozygous in the patients and heterozygous in both parents. Hemizygous variants had to be heterozygous in only one of the parents and homozygous in the patient. Candidates for X-linked (XL) variants had to be heterozygous in the mother and hemizygous in the male patient. Finally, to identify candidates for compound heterozygous inheritance of AR diseases, SNVs and indels were combined to find two different heterozygous mutations in the patient in the same gene that were inherited one from each of the parents.
Variants were further checked for their genotype quality in all the samples in the family and low genotype quality variants (Phred score <20) were filtered out. The remaining variants were prioritized to select a list of candidate causal variants for further downstream analysis.

Variant prioritization
Both variants and genes were prioritized by different measures and classified into different tiers, from which the final candidates were selected.
Variants were prioritized based on their effects on protein function, which were predicted from various conservation scores. To this end, various annotations from dbNSFP [16] including the GERP score [17] and CADD scores [18] were added to the variants. The GERP score [17] measures the evolutionary conservation of the sequences across species; a position with a score greater than two is considered as a highly conserved nucleotide and its disturbance is likely to have a high functional impact. CADD scores [18] integrate various annotations including sequence conservation scores and ENCODE project functional annotations to measure the deleteriousness of the variants. A CADD score of 13 in Phred scale was used as threshold, which means that prioritized variants are considered to be in the top 5% of the deleterious variants in the human genome.
Genes were prioritized based on their intolerance towards functional mutations.
Intolerance missense z scores or pLI from ExAC were added if the variant was a missense or loss of function (LoF, includes stop gain/loss, splice acceptor/donor or frameshift indels), respectively. An increasing positive z-score indicates increasing intolerance to a missense mutation and we consider a z-score above +2 as an outlier intolerant gene, and a pLI score > 0.9 is considered intolerant to a LoF variant.
Finally, variants were categorized into two tiers which each contained three levels.
Level 0 of both tiers contained the whole variant set before prioritization. In Tier 1 only LoF variants were moved to level 1. LoF variants with CADD score above the threshold were moved further into level 2. Finally, variants in level 2 which affect genes with pLI score above the threshold were moved into level 3. The missense variants in level 0 were moved into level 1 of Tier 2 and further prioritized into different levels of Tier 2, accordingly. Here, instead of the ExAC pLI score the ExAC missense Z-score is used to prioritize variants into level 3. In the downstream analysis, initially only the variants in level 3 of both tiers were considered. Only if there were no candidates found in level 3, variants in lower levels were examined.
In addition, these prioritized variants were further classified using the guidelines providing by the American College of Medical Genetics and the Association for Molecular Pathology [19] by medical geneticists during reporting. In Tier 1, 69 LoF variants were found in level 1 from all the samples, on average 1 LoF variant per sample, and 49 of those variants have CADD score greater than 13 were in level 2, and 12 variants were in genes that are intolerant to a new LoF variants in level 3 ( Figure   3). In Tier 2, there are 529 total missense variants and 296 of them have CADD score above 13, among them 67 variants are in the genes with ExAC mis_z score greater than 2 ( Figure   4).
Using this approach, the causal variants could be identified in 15 in the NDD cohort (     [20]. However, individual tools are often not in a good agreement, so that it is recommended to use a consensus vote of multiple tools or to use meta-tools that combine the results of multiple individual prediction algorithms [8] [ 16]. For these reasons, our workflow relies on the CADD score [18], which integrates multiple levels of information including conservation and functional data, for variant prioritization.
A complementary information to the deleteriousness of the variant is the tolerance of the affected gene to new functional mutations. The intolerance or constraint score of a particular gene is calculated from the deviation of the observed number of functional mutations in a gene in large populations from the expected number which is based on the total amount of variation in this gene [21]. However, also the gene intolerance score can be misleading, as it does for example predict major cancer predisposition genes to be tolerant to new functional mutations [20]. Since there are subregions of the genes which are much more intolerant to functional mutations than other regions, a region-based or exon-based intolerance score could achieve a higher resolution and hence enable better predictions.
Furthermore, as noted earlier in the results section, the gene intolerance score performs poorly for AR variants. So, a new gene intolerance score calculated solely from the homozygous variants in the gene should be used to prioritize variants in the AR inheritance model.
Current studies to identify causal variants in genetic diseases, focus usually only on exonic, functional variants and investigate only SNVs and small indels. In these studies, for up to 60% of the investigated cases no causal variants could be identified [3,4,22]. A possible explanation might be that in these cases the causal variants are focal copy number variants (CNVs) or structural variants (SVs), and might also be interested to look into +/-10 bases around the splice sites, which cannot comprehensively be identified from WES data and thus require WGS for full exploration. Alternatively, causal variants might affect non protein-coding regions of the genome. While these variants can be identified from WGS data, their interpretation and functional effect prediction is much more challenging than for coding variants. Multiple efforts have been made recently to produce more accurate tools to predict effect of non-coding variants in Mendelian diseases [23,24], so that in the future the inclusion of non-coding variants into prioritization workflows should be possible.

Conclusion:
The rapid decrease in sequencing costs opened the door for broad application of genome sequencing in research as well as in clinical settings. The resulting massive sequence data production made data analysis and interpretation a daunting task in clinical genomics. We have developed an automated variant identification and prioritization pipeline for genetic disorders. With the classification of variants and genes into different tiers, the pipeline makes it easy to focus on a small set of candidate variants for downstream analysis. The pipeline will continuously be updated by adding new better accurate tools and scores to improve its performance and will eventually also allow to include SVs, CNVs, and non-coding variants.