Targeted gene sequencing in 6994 individuals with neurodevelopmental disorder with epilepsy

We aimed to gain insight into frequencies of genetic variants in genes implicated in neurodevelopmental disorder with epilepsy (NDD+E) by investigating large cohorts of patients in a diagnostic setting. We analyzed variants in NDD+E using epilepsy gene panel sequencing performed between 2013 and 2017 by two large diagnostic companies. We compared variant frequencies in 6994 panels with another 8588 recently published panels as well as exome-wide de novo variants in 1942 individuals with NDD+E and 10,937 controls. Genes with highest frequencies of ultrarare variants in NDD+E comprised SCN1A, KCNQ2, SCN2A, CDKL5, SCN8A, and STXBP1, concordant with the two other epilepsy cohorts we investigated. In only 46% of the analyzed 262 dominant and X-linked panel genes ultrarare variants in patients were reported. Among genes with contradictory evidence of association with epilepsy, CACNB4, CLCN2, EFHC1, GABRD, MAGI2, and SRPX2 showed equal frequencies in cases and controls. We show that improvement of panel design increased diagnostic yield over time, but panels still display genes with low or no diagnostic yield. With our data, we hope to improve current diagnostic NDD+E panel design and provide a resource of ultrarare variants in individuals with NDD+E to the community.


Introduction
In recent years, genetic research has gained novel biological insights into the etiology of neurological disorders, particularly in epilepsy 1,2 . Neurodevelopmental disorders with epilepsy (NDD+E) are a rare group of disorders frequently caused by de novo events in protein-coding genes 3,4 precise genetic diagnosis influences genetic counseling but may also guide treatment decisions by enabling medication or treatment tailored to the patient's underlying genetic defect 2,4 . Examples include treatment with sodium channel blockers in SCN2A-and SCN8A-related NDD+E 5,6 , ezogabine in KCNQ2-related NDD+E 7 or a ketogenic diet in SLC2A1-related GLUT1 deficiency 8 . Up to 28 % of de novo variants (DNV) being found in NDD+E-related genes are associated with such targeted treatment approaches 4 . However, assessments of how often NDD+E-associated genes are mutated are currently insufficient due to lack of large-scale genetic analyses in NDD+E.
Targeted sequencing of specific disease-related gene panels has been part of the diagnostic workup of highly prevalent heterogeneous disorders such as breast cancer 9 , cardiomyopathy 10 and epilepsy [11][12][13] . Multiple genes are sequenced in parallel enabling lower sequencing cost, higher coverage and near-absence of secondary findings compared to exome sequencing 14 . However, high heterogeneity of epilepsy gene panel content has been observed 4,15 . This is likely due to the dramatically growing number of genes associated with epilepsy and diverse integration in the established panels, often without robust statistical evidence 4,16 . To increase yield in diagnostic sequencing panels, it is essential considering genes with proven disease association as well as a reasonable frequency of pathogenic variants among affected individuals.
Here, we report likely damaging variants in 645 epilepsy panel genes sequenced at two molecular diagnostic companies, CeGaT (Germany) and Courtagen (USA). In total, 6994 patients with NDD+E of suspected monogenic cause underwent diagnostic sequencing at the respective companies, the majority as first-tier diagnostic test. We compare this large cohort of panels in NDD+E patients with another study of similar design (n=8565) 17 , 10,937 controls as well as with a cohort of exome-wide DNV in NDD+E 4 (n=1942) investigating variant frequencies in confirmed and putative NDD+E genes in NDD+E panels.

Gene Panel Sequencing Data
We analyzed gene panel sequencing data of 6994 individuals diagnosed with NDD+E or related disorders of suspected monogenic origin. The data was generated during routine diagnostic sequencing by two different commercial companies, Courtagen (US, n = 3817 cases) and CeGaT (Germany, n = 3177 cases) with similar overall approach and design 11 .
Information on cognitive outcome was available in about 59% of cases revealing fractions of individuals with intellectual disability (ID) of 96% (2176/2266, Courtagen) and 97.8% (1833/1875, CeGaT). In the majority of cases, epilepsy was early-onset (before three years of age). Analysis was performed between 2013 and 2017, during which time up to 10 different but vastly overlapping NDD+E panel designs were used by each company.
Panels contained a median of 471 and 498 confirmed or suspected epilepsy genes at each respective company and a median 4870 individuals were sequenced per gene (Supplementary Figure S1, Supplementary Table S1, Figure 3). We decided to analyze the 645 genes (see Supplementary Table S2) that were sequenced in at least 2000 individuals. As the first systematic guideline for diagnostic variant interpretation was not published before 2015 18 , we decided to focus on functional (null variants as well as missense variants predicted to be deleterious by in silico tools, see Methods) ultra-rare variants without pathogenicity labels that are not present in the general population 19 . In this setting, functional variants in genes not ordered by the respective clinician were not consistently reported, while we cannot access genes ordered by clinicians. Consequently, we identify few genes with significantly lower variant frequencies in cases than controls (Supplementary Figure S3) suggesting underreporting of variants in these genes.

Data Processing
A more detailed description of the analysis pipeline of the Courtagen company has been published 20 . A brief overview of analysis steps is described as follows. Courtagen and CeGaT employed custom-designed Agilent Haloplex and SureSelect enrichment kits, to enrich patients' genomic DNA for target regions of epilepsy (candidate) genes. This was followed by paired-end sequencing (250 or 200bp, respectively) on Illumina platforms (miSeq and HiSeq). Adaptor sequences were then trimmed, and the sequencing reads were aligned to the human reference genome hg19 (GRCh37) with bwa-mem (biobwa.sourceforge.net). Reads that mapped equally well to more than one genomic position were discarded. Quality checks were performed ensuring adequate distributions of various quality control metrics such as insert size distribution, mismatch rates, GC bias etc. Subsequent variant calling was done with different pipelines. Variants were filtered for population frequencies < 1% (ExAC, EVS, 1000 Genomes) and platform-specific sequencing artifacts. Follow-up Sanger sequencing was then performed on most variants available to us.
In case of available parental samples, the de novo status of individual variants was tested by Sanger sequencing. For one of the companies, segregation testing was partially documented. Out of 1173 ultra-rare damaging variants, 162 (14%) were verified as de novo, 36 (3%) segregated with disease and for 975 (83%) segregation was unknown.

Reannotation and Filtering
All variants reported to patients as well as variants in controls were re-annotated with the following pipeline. Variants were annotated with Ensembl's Variant Effect Predictor 21 (=VEP) of version 82 using database 83 of GRCh37 as reference genome. Per variant the transcript with the most severe impact, as predicted by VEP, was selected for further analyses. The decreasing order of variant impacts was HIGH, MODERATE, MODIFIER, LOW. Only protein-altering variants [missense or null (premature stop codon, essential splice site, frameshift)] were included in further analyses. Variants that were present in a subset of ExAC (v0.3), an aggregation of exome sequences from adult individuals without severe childhood-onset diseases and without psychiatric diseases (n = 45,376) 19 , were excluded, as these have been shown to convey no detectable risk to disease on a group level 22 . To increase power for variants that were not tested for segregation, we filtered missense variants predicted to be damaging by PolyPhen 23 (v2.2.2) or Sift 24 (v5.2.2). In total, 42% of individuals had no, 34% had one, 15% had two and 8% had three or more ultra-rare variants (either damaging missense or null variant). We labeled ultra-rare variants for which we had no information on segregation as putative de novo variants when they had previously been reported as confirmed DNV in individuals with NDD+E 4 and/or ClinVar 25 (date 08/2017).

Population controls
We used controls as a population reference of ultra-rare variant frequencies per gene. coverage and genotype quality (GQ, estimated in GATK pipeline 27 ) > 30. On average, the number of sites with non-reference genotypes in controls that were excluded due to coverage <30X for this analysis was 1.03% (see Supplementary Figure S2). In one company, this number is on average 0.2% (personal communication). Due to the targeted approach, we expect this to be similarly low in the other company. Diagnostic panels may be at an advantage to identifying variants compared to exomes as they have higher average coverage and were subjected to initially lower GQ cutoffs. On the other hand, variants have been validated by Sanger sequencing in some of the controls, but all of the cases and variants in certain genes were systematically underreported in panels (see Figure 1). Controls are of non-Finnish European origin while cases are mostly of non-Finnish European origin, with few exceptions (personal communication). While controls and cases were not matched for more specific population structure we expect this to have no significant influence in singleton rates as these are relatively consistent in different (particularly non-Finnish European) populations in the 1000 genomes project (https://www.nature.com/articles/nature15393/figures/1) and we also show, that many singletons in cases are likely of de novo origin.

Statistical analyses
We assessed individual gene tolerance to null or missense variants in the general population by using the pLI score (probability of being loss-of-function intolerant), missense z-score (z-score of observed versus expected missense variants) 19 or shet score (selective effects for heterozygous protein null variants) 28 . We defined a gene as constrained with the cut-offs > 0.9 for pLI, > 3.09 for missense z-scores and > 0.05 for shet based on recommendations of the score developers. We compared scores using Wilcoxon rank tests pLI and shet scores as the data appeared not normally distributed upon We subset the list to genes associated with any descending HPO terms 29 of epilepsy (HP:0001250) or intellectual disability (HP:0001250) or "Brain/Cognition" and only included dominant/X-linked disease genes labelled as "confirmed" or "probable". We also annotated MPC scores (for Missense badness, PolyPhen-2, and Constraint), a pathogenicity score that leverages regional depletion of missense variants in the general population as well as amino acid deleteriousness 30 to compare ultra-rare and DNV.

Code availability
All statistical analyses were done with the R programming language (www.r-project.org).
The code will be available upon request.

Results
Genes with ultra-rare variants in NDD+E include DEE but also NDD genes We assessed frequencies of likely protein-altering (missense or null) ultra-rare variants in 6,994 individuals with NDD+E ( Figure 1). While we did not assess variant pathogenicity with all ACMG criteria 18  Reassuringly, ranks of top genes were in concordance with a recently published study of similar design (gene panel sequencing in 8565 epilepsy patients 17 , see Figure 2).

Comparing variant frequencies per gene in 6,994 panels and 1,942 trio exomes
We compared ultra-rare variant frequencies in our panel dataset to DNV frequencies in a large recent exome-wide trio study of 1,942 individuals with NDD+E 4 . Restricting our dataset to genes sequenced in 6000 to 6994 individuals we found correlation between the datasets for both missense and null variants (Missense variants: p-value = 3x10 -9 , rho= 0.63; null variants: p-value= 4x10 -6 , rho= 0.53, method: Spearman correlation, see

Confirmed and putative de novo variants
Among 6994 epilepsy cases, we revealed 333 DNV that were not in ExAC as well as either damaging missense or null DNV. 95% (317/333) of DNV were in 54 constrained genes, Figure S4). 4  The majority of dominant/X-linked panel genes (502 of 645) did either not display any ultra-rare variants in > 2000 epilepsy cases or even had lower frequencies of ultra-rare variants in cases than controls. This could be due to a low mutation rate of these genes or a phenotype rarely ascertained in our cohort. However, given the fact that these genes had no significantly different mutation rate but significantly lower constraint scores compared to all other dominant or X-linked genes in this study it is likely that many of them are not disease associated. This observation is paralleled by a study of similar design on 7,855 individuals with childhood-onset cardiomyopathy, where several genes frequently sequenced in clinical routine, could also not be convincingly associated with disease 10 . In our study, panel design originated in 2010, when multiple candidate genes for rare diseases were nominated without sufficient statistical evidence and could not be confirmed in a clinical setting 39 . This was also described specifically for epilepsy genetics 4,16 .
Of 645 panel genes in our study, 329 genes were associated with recessive inheritance. We also evaluated the frequencies of ultra-rare variants in five genes with contradictory evidence of gene-disease relationship, which thus had been classified as "disputed" by the formal criteria of the ClinGen Consortium 16 (CACNA1H, CACNB4, EFHC1, MAGI2, SRPX2) as well as two genes with contradictory susceptibility to epilepsy (CLCN2, GABRD). CACNB4, EFHC1, MAGI2, SRPX2, CLCN2 and GABRD showed identical frequencies of ultra-rare variants in cases compared to controls (Supplementary Figure S5). Thus, our findings support the evidence that CACNB4, EFHC1, MAGI2, SRPX2, CLCN2 and GABRD may not be truly associated with epilepsy. Coverage of CACNA1H was too poor in controls from ExAC to allow a valid comparison of variant frequencies between cases and controls.
In summary, our data provides evidence to further improve the design of NDD+E panels by i) including genes with highest burden of ultra-rare variants, ii) adjusting the ratio of autosomal dominant and X-linked genes with high diagnostic yield versus autosomal recessive genes with low diagnostic yield and iii) excluding genes with poor evidence for true disease-association or very few ultra-rare variants in epilepsy cases.     Ultrarare Variants (% variants in tested genes) LP/P Variants (% of positive cases) 17