Abstract
Exome sequencing of thousands of families has revealed many individual risk genes for congenital heart defects (CHD), yet most cases cannot be explained by a single causal mutation. Further, those who carry de novo and inherited mutations in known risk genes often demonstrate variable phenotypes even within the same family, indicating the presence of genetic modifiers. To explore oligogenic causes of CHD without assessing billions of variant combinations, we developed an efficient, simulation-based method to detect gene sets that carry damaging variants in probands at a higher rate than expected given parental genotypes. We implemented this approach in software called Gene Combinations in Oligogenic Disease (GCOD) and applied it to a cohort of 3382 trios with exome sequencing. This analysis detected 353 high-confidence risk genes in 202 pairs that appear together in multiple probands but rarely or never appear in combination in their unaffected parents. Stratifying analyses by specific CHD diagnosis and considering gene combinations of higher orders yielded an additional 244 gene sets. The oligogenic genes we discovered cluster in pathways specific to heart development and suggest new molecular disease mechanisms, such as arylsulfatase activity and de novo nucleotide biosynthesis. Finally, by combining CHD families with an autism spectrum disorder cohort, we were able to detect 925 oligogenic sets transmitted in renal disease, a known co-morbidity of both conditions. As genome sequencing is applied to more families and other disorders, GCOD will enable detection of increasingly large, novel gene combinations, shedding light on combinatorial causes of genetic diseases.
Introduction
Many disease phenotypes are complex and genetically heterogeneous such that damaging variants at different genetic loci can lead to similar disease manifestations (Veenstra-Vanderweele, Christian, and Cook 2004; Akhirome et al. 2017). This genetic architecture complicates our ability to identify causal variants and genes, despite increasingly large cohorts with exome or whole genome sequencing. For instance, congenital heart defects (CHD) have heritability estimates of 70-90%, but just 8% of probands have a damaging variant in a known CHD gene that is not shared by an unaffected family member (Jin et al. 2017). This indicates that although most defects can be explained by patient genetics, our current knowledge is insufficient to pinpoint the full genetic source and mechanism of most individual cases. Some risk genes remain cryptic due to incomplete penetrance, which reduces our statistical power to detect them; at the same time, previously-identified risk genes and variants may themselves exhibit incomplete penetrance such that carriers differ in the presence and severity of disease symptoms (Parker and Landstrom 2021). A potential explanation for these phenomena is oligogenic inheritance, or multifactorial inheritance in which a few variants are required in combination to cause a particular phenotype (Kousi and Katsanis 2015). For example, an unaffected father may carry a variant that is tolerated in most genetic backgrounds, and is therefore unlikely to be considered a candidate for causing disease; however, if that variant is transmitted to a child along with a de novo or maternally-derived damaging allele in another gene, this could destabilize key pathways and cause disease where the single variant would not.
Previous work has validated one instance of oligogenic inheritance in CHD (Gifford et al. 2019), as well as speculated a role for oligogenic inheritance in other developmental disorders like autism spectrum disorder (ASD) (Schaaf et al. 2011; Wenger et al. 2016). While these studies yielded mechanistic insights into specific gene and variant combinations, they relied on known risk genes and existing functional information to propose plausible hypotheses about specific gene combinations. Alternatively, one could enumerate and prioritize all damaging variant combinations in an automated and statistically rigorous manner. But this strategy is complicated by combinatorial explosion: given that each human genome typically contains 40,000 to 200,000 variants at <0.5% minor allele frequency (MAF) (1000 Genomes Project Consortium et al. 2015), an analysis of all possible rare variant pairs yields between 8 billion to nearly 20 trillion combinations per individual. Nevertheless, there are existing methods that scan the genome for possible oligogenic disease combinations. ORVAL, for example, accepts a list of variants as input, and uses gene annotations to assess the potential for oligogenic combinations (Renaux et al. 2019). However, this tool only offers qualitative hypotheses for a single individual, which limits applicability across probands. In contrast, the Digenic method reduces the combinatorial search space by aggregating rare variants at the gene level and assesses the statistical significance of each gene pair based on its prevalence in a disease cohort (Kerner et al. 2020), but this method is limited to oligogenic combinations of only two genes (“digenic”). Another recent advance in the field is RareComb (Pounraja and Girirajan 2022), an algorithm that tests for greater frequency of gene combinations in cases compared to controls. However, neither of these computational methods incorporate parental sequencing data, which are especially useful in reducing false positives in simplex families, where an affected proband is born to unaffected parents. It logically follows that the highest-effect causal variants are likely transmitted separately from each parent or include at least one de novo mutation; thus, incorporating parental transmission data is potentially a powerful tool to reduce false positives, increase confidence in the phenotypic relevance of variants, and correct for population structure.
To address this gap, we developed the tool Gene Combinations in Oligogenic Disease (GCOD), an algorithmic framework that uses simulation experiments to identify disease gene sets that carry rare damaging variants significantly more often than expected given parental genotypes. GCOD automatically assesses every digenic pair observed in multiple probands, as well as sets of three, four, or more genes. Using information from unaffected parents limits the number of gene sets tested, which increases statistical power and accelerates analysis time. Applying GCOD to exome sequencing from families with CHD, we identified potential novel risk genes and pathways in disease. Jointly analyzing these families with an ASD cohort, we demonstrate the power of GCOD for detecting gene sets associated with comorbidities. These results greatly expand candidate causal mechanisms for both disorders and establish a broad oligogenic component for developmental diseases.
Results
Computational identification of pathogenic gene set candidates
We developed an algorithm to predict pathogenic gene sets using exome or whole genome sequencing data and implemented it in software we called GCOD. The method involves four steps (Figure 1): select variants of interest based on constraint or other criteria, enumerate co-occurring genes, count their frequency across probands, and evaluate the statistical significance of these counts given the parental genotypes. An optional fifth step repeats the process for healthy siblings or pseudo-siblings (Yu and Deng 2011) as a control. GCOD can be viewed as an extension of the Transmission Disequilibrium Test (TDT) in which the unit being tested is an observed co-occurrence of variants in a group of genes, rather than observed transmissions at a single locus (Spielman, McGinnis, and Ewens 1993) or within a single gene (He et al. 2017). GCOD also differs from the TDT in its use of simulations to assess significance rather than a chi-squared statistic, increasing our power to detect rare gene combinations with very few expected observations. Details and options for each step of GCOD follow.
First, users provide a list of damaging variants meeting some criteria of interest from each individual in the dataset (Figure 1A). GCOD then enumerates all combinations of genes that contain co-occurring damaging variants in each proband (Figure 1B), retaining only those combinations observed in multiple families (referred to as “candidate sets”). By changing the definition of damaging, different scenarios can be explored, from small gene sets with very rare co-occurring variants of high impact to larger gene sets that include small-effect, possibly-damaging variants that could modify the effects of other primary disease-causing variant(s) or otherwise contribute to a polygenic signal. We designed GCOD to automatically consider three tiers of variant severity (Supplemental Table 1), defined by residue-level characteristics such as minor allele frequency (MAF) and CADD score (Rentzsch et al. 2019), gene-level cutoffs for gnomAD observed-expected z-score (Karczewski et al. 2020), as well as gene shet score (Cassa et al. 2017) and minimum expression level in relevant cell types (J. Cao et al. 2020). The variant severity tiers are nested such that more severe variants are included in each of the decreasingly strict tiers.
By default, a candidate gene set is only considered in a particular proband if the qualifying variants were derived from at least two sources (that is, “oligogenic transmission” where a healthy parent did not also carry that same combination of variants), though users can remove this requirement. The count of observed oligogenic transmissions comprise the test statistic. To reduce the number of tested hypotheses and the multiple testing correction burden where possible, we implemented an algorithm that begins with digenic pairs and merges these up to the “highest-order” set, defined as the largest unique candidate set observed in two or more probands. Finally, lower-order combinations are added back if an additional proband harbors this candidate set and not of any larger sets. For example, two probands carrying damaging variants in seven genes share 21 unique gene pairs, 35 unique gene trios and quartets, 21 unique gene pentads, and so on (Supplemental Figure 1, Methods: Highest-order gene set enumeration). If no other probands carry any of the sets, GCOD tests only the seven-gene set and none of the 119 lower-order combinations. But if a third proband were to carry one of the gene pairs but none of the higher-order candidate sets, the pair (with a count of three) is included along with the seven-gene set (with a count of two). This algorithm produces a co-occurrence count for each highest-order candidate set, recursively checking for additional transmissions at each level descending to pairs.
In the third step, GCOD performs simulations based on parental genotypes to assess the statistical significance of the candidate set counts (Figure 1C). The goal of the simulations is to estimate the distribution of counts expected by chance if there were no transmission disequilibrium. In each iteration, one hypothetical offspring is simulated for each parental pair as follows. Each damaging parental variant has a 50% chance of being passed to the simulated offspring. We treat variants independently and do not model linkage disequilibrium, because GCOD counts genes with at least one damaging variant rather than the number of variants per gene, and given that linkage disequilibrium decays by 70 kilobases (kb) in most regions of the human genome, genes are only very rarely in the same LD block (Bosch et al. 2009). Additionally, variants must be derived from at least two sources such that a candidate set will never be entirely inherited as a single LD block (i.e., a variant from the other parent or mutated de novo must have been observed). After simulating inherited variants, GCOD adds de novo variants using previous estimates of per-gene de novo mutation probabilities (Samocha et al. 2014). Henceforth, we use the term “oligogenic transmission” to describe an event in which a variant combination occurs in the offspring that was not seen in either parent, i.e. comprises at least one variant from each parent and/or a de novo mutation.
For each simulation, we tally the number of times a given gene set was transmitted with oligogenic inheritance in the simulated cohort. Repeating over many iterations generates a null distribution of oligogenic transmissions against which we compare the observed number in probands in the fourth step. We calculate the probability of no over-transmission for each candidate gene set as the proportion of iterations with a simulated count equal to or greater than the number of probands (Figure 1D). GCOD returns the full list of candidate sets along with the p-values associated with their transmission to probands (unadjusted, Benjamini-Hochberg corrected, and Bonferroni corrected).
In an optional fifth step, GCOD repeats this entire procedure for sibling controls. If healthy siblings have been sequenced, these may be used. Otherwise, GCOD generates a genotype for a pseudo-sibling of each proband, a hypothetical sibling who has inherited the parental alleles that were not transmitted to the affected individual (Yu and Deng 2011) (Figure 1E). Siblings, unlike pseudo-siblings, carry some variants inherited by the proband. They also can be assessed for the phenotype, whereas pseudo-siblings cannot. Siblings without a diagnosis and pseudo-siblings are typically presumed to be unaffected, though this is not always true; siblings must be deeply phenotyped past the typical age of onset to rule out disease, and pseudo-siblings have unknown phenotypes. Hence, it can be helpful to use both when sibling data is available and to interpret results with this caveat in mind. Collectively, the sibling controls serve several purposes. Significant candidate sets from probands can be filtered to exclude any that also appear in (pseudo-)siblings. A comparison of the number of significant sets for probands versus (pseudo-)siblings can also be used to assess overall evidence that a phenotype is likely oligogenic, where probands would be expected to carry more over-transmitted gene sets than (pseudo-)siblings. Finally, the types of genes in significant sets can be compared between probands and (pseudo-)siblings, for example, using gene set enrichment, pathway analysis, or literature review. These analyses help to refine and prioritize significant sets from step three. We refer to the resulting statistically significant gene sets as “oligogenic sets”.
Probands with a congenital heart defect carry more oligogenic gene sets than familial controls
To discover oligogenic disease genes for CHD, we applied GCOD to whole exome sequencing data from 3382 families in the Pediatric Cardiac Genomics Consortium (PCGC) (Pediatric Cardiac Genomics Consortium et al. 2013; Hoang et al. 2018; Morton et al. 2021). In each family, the PCGC generated exome variants for two asymptomatic parents and an affected child. We used parental data to compute pseudo-sibling genotypes for each trio (Yu and Deng 2011) (Figure 1E). We ran GCOD at three tiers of variant severity (Supplemental Table 1): the Strict tier which includes only very rare mutations with a high probability of impairing protein function, plus two more lenient tiers of variant severity (Base-damaging and Base). Only oligogenic transmissions of a variant set were considered (see above). To reduce false positives caused by a single driver gene with spurious co-transmitted partners, we additionally filtered oligogenic sets to those with a logistic regression interaction coefficient greater than 1.0 (Methods: Logistic regression). In short, this step aims to restrict our analysis to sets in which a higher proportion of individuals with damaging mutations in the full oligogenic set have CHD compared to individuals with damaging mutations in each gene separately.
The GCOD approach revealed 202 oligogenic gene pairs comprising 353 CHD risk genes at the Strict and Base-damaging tiers (Supplemental Table 2). At each variant tier, we identified more oligogenic gene pairs in CHD probands compared to their pseudo-siblings (Figure 2A). These differences were all statistically significant (Binomial p ≤ 1.4×10−23). At the Strict tier, for example, we found 179 gene pairs significantly over-transmitted to probands, compared to just 38 gene pairs over-transmitted to their pseudo-siblings (Bonferroni-adjusted p < 0.05) (Table 1). A similar nearly 5-fold enrichment was seen at the Base-damaging tier, and an almost 12-fold enrichment was detected at the Base tier. Oligogenic pairs are present in a substantial number of probands, with 9.2% of PCGC probands carrying at least one pair even at the Strict tier. Together these results provide evidence that CHD is broadly oligogenic.
As expected, for both probands and pseudo-siblings the number of oligogenic pairs increased with more lenient definitions of damaging variants (Table 1), with the Base tier yielding 22515 oligogenic pairs in probands (versus 2783 in pseudo-siblings). This many significant gene pairs means that 96.6% of probands carry at least one oligogenic pair at the Base tier. These gene pairs likely comprise a more diffuse polygenic risk compared to the 179 pairs at the Strict tier, which may play a more central role in the disease of the 312 probands who carry them. We therefore focus the subsequent analyses on the oligogenic sets from the Strict and Base-damaging tiers, which are expected to have larger pathogenic effects per occurrence.
We next computed the highest-order gene sets shared among PCGC probands using Strict and Base-damaging variants (Methods: Highest-order gene sets). Highest-order sets ranged from groups of two genes to groups of twelve genes for both probands and pseudo-siblings. As with gene pairs, we find a significantly greater count of over-transmitted sets in probands compared to pseudo-siblings (both binomial p < 1 × 10−150, Figure 2B). The highest-order oligogenic sets at the Base-damaging and Strict level yield an additional 1662 potential contributing risk genes.
We next investigated whether the oligogenic pairs identified in PCGC probands include any novel CHD genes that have been missed using smaller datasets and single-gene tests. First, we annotated which oligogenic sets contained one or more of the 253 curated human or mouse CHD genes reported by Jin et al. (Jin et al. 2017). While 18% of Base-damaging and Strict proband oligogenic pairs contain at least one of these CHD genes, the vast majority of risk genes captured by GCOD were not previously known (312 of 334 genes, Table 1). Next, we directly assessed the additional genes discovered by testing pairs versus single genes within the GCOD simulation framework (Methods: Single-gene transmission simulation test). This analysis is distinct from using known CHD genes, because prior studies used a variety of statistical tests, cohorts, and sample sizes. GCOD detects about a third as many genes in single-gene mode compared to testing oligogenic pairs at the Strict and Base-damaging tiers, and this difference is even more pronounced at the Base tier (Table 1). In all tiers, fewer known CHD genes are detected in single-gene mode compared to when oligogenic sets are run. These findings underscore the additional signal GCOD is able to capture by considering oligogenic transmissions.
Oligogenic combinations are found together in canonical cardiac gene sets significantly more often in CHD probands
We have so far shown that proband cohorts have a greater number of oligogenic gene sets, and that these sets are enriched for known CHD risk genes. To test whether the CHD oligogenic pairs identified by GCOD comprise phenotype-relevant genes, we performed a Gene Ontology (GO) analysis. Strict tier oligogenic genes were enriched for many GO terms related to heart development, such as “atrioventricular canal development” and “cardiac atrium morphogenesis” (Figure 2C). In contrast, no GO terms were enriched in pseudo-sibling genes at the Strict or Base-damaging tiers. We therefore conclude that genes comprising proband oligogenic pairs are relevant to CHD, and genes over-transmitted to pseudo-siblings are largely false positives unrelated to any particular phenotype, confirming their utility as controls. Interestingly, pseudo-sibling sets at the more permissive Base tier are enriched for GO:0046872 metal ion binding and GO:0005524 ATP binding, potentially indicating that related polygenic risk for CHD is present in healthy members of PCGC families (Supplemental Figure 2).
The aforementioned analysis combines oligogenic sets to find common functions among our putative risk genes. We further hypothesized that genes within the same disease-associated oligogenic set would share functional annotations and also co-occur in pathways and protein complexes. To test this, we calculated the odds of an oligogenic combination being found in a proband and co-occurring in a Molecular Signatures Database (MSigDB) ontology gene set, and found a significant association for the combined GO collections (odds ratio = 1.6, Fisher exact p = 2.2e-33) and for the Human Phenotype Ontology collections related to the heart (odds ratio = 3.8, Fisher exact p = 3.5e-12). A similar enrichment was observed for the PCNet composite database of gene-gene interaction networks (S.-Y. Cao et al. 2021) (odds ratio = 1.8, Fisher exact p = 4.8e-05). These results indicate that the genes comprising CHD oligogenic pairs in the PCGC cohort are functionally associated, perhaps indicating that the interaction of their gene products is necessary for shared functions in the context of disease pathogenesis.
In summary, we have enumerated gene sets for which multiple CHD probands inherited an oligogenic set of variants and tested whether each set was transmitted together more often than expected. We showed that the genes in the resulting oligogenic sets are enriched for heart development functions and tend to occur together in canonical gene networks. By leveraging rare variation from separate sources (two parents or inherited plus de novo), GCOD captures additional known and putative disease genes that other rare variant aggregation methods do not.
Novel risk genes and interactions in CHD
While up to this point we have analyzed oligogenic transmission frequency across the entirety of the exomes of the 3382 trios in the PCGC, these probands have been diagnosed with myriad distinct diagnoses. To explore the possibility that similar phenotypes are caused by more similar genetic etiology, we selected five diagnoses with relatively consistent phenotypic criteria (Table 2): Ebstein’s anomaly, pulmonary atresia with intact ventricular septum (PA-IVS), truncus arteriosus, tetralogy of Fallot, and hypoplastic left heart syndrome (HLHS). HLHS with mitral atresia and aortic atresia (HLHS-MA-AA) or mitral stenosis and aortic atresia (HLHS-MS-AA) were also analyzed separately. We hypothesized that these smaller analyses might have improved power to detect oligogenic sets that are specific to one or a few diagnoses but do not replicate in the entire cohort.
Counts of oligogenic sets including digenic pairs and highest-order sets in phenotypically similar sub-groups corroborated our finding from the full PCGC cohort that probands harbor significantly more oligogenic sets compared to pseudo-siblings (Figure 3A). Many of these sets were found in the broader PCGC as a whole, but we find 160, 181, and 7960 additional diagnosis-specific oligogenic sets at the Strict, Base-damaging, and Base tiers respectively (Supplemental Table 3). Many of these were found in the largest diagnostic categories, like HLHS (all subtypes) and tetralogy of Fallot. However, even at the Strict level we detect the oligogenic pair CAPN9-MGA, a rare combination which was transmitted to two of just 66 individuals with PA-IVS. Almost eight percent of patients diagnosed with HLHS-MA-AA carry oligogenically-transmitted damaging mutations in one of the following oligogenic sets: DNA2-FOXK2, MEGF6-SRRM2, SACS-TNC, or CRB1-GALNS.
We cross-referenced canonical protein complexes sourced from the CORUM database (Giurgiu et al. 2019) with our oligogenic sets from all of PCGC (Supplemental Table 2) and from separate diagnoses (Supplemental Table 3), finding 34 proband oligogenic sets in which two or more genes physically interact. For example, the products of the CREBBP and EP300 genes participate in the p300-CBP-p270-SWI/SNF complex, which regulates histone acetyltransferase activity during development (Chan and La Thangue 2001). Variants in these genes were transmitted oligogenically to two individuals in combination with the COL6A3 gene (COL6A3-CREBBP-EP300, 2 probands, p = 0.016). These three genes were also over-transmitted independently with other genes, which led to the discovery of a network of interconnected oligogenic sets that collectively harbor multiple hits to three complexes: p300-CBP-p270-SWI/SNF, AML1-HIPK2-p300, and pRb2/p130-multimolecular complex (Figure 3).
EP300 co-occurs with RUNX1 in the AML1-HIPK2-p300 complex, and although the digenic pair is observed in one unaffected parent, the highest-order oligogenic set including MYO9A from an alternate source is only observed in CHD probands, both diagnosed with pulmonary stenosis (Figure 3C, left; EP300-MYO9A-RAD54L2-RUNX1, 2 probands, p < 0.0001). RUNX1 phosphorylates EP300 and CREBBP to activate acetyltransferase activity (Aikawa et al. 2006). The co-occurrence of multiple combinations of transcriptional regulators and chromatin modifiers suggests that they interact in gene regulatory networks that are necessary for normal heart development. Genes not annotated with these functions, such as COL6A3 and MYO9A, could be target genes or have upstream or downstream regulatory functions. For example, the fibrillar collagen gene COL6A3 has a role in TGFβ signaling as previously characterized in cancers (Huang et al. 2018; Dankel et al. 2020). Given the involvement of TGFβ signaling in epithelial–mesenchymal transformations in cardiovascular development (Azhar et al. 2003) and its coactivation by the CBP/p300 complex(Janknecht, Wells, and Hunter 1998), co-occurring mutations in COL6A3, CREBBP, and EP300 are likely pathogenic due to perturbation of this pathway.
Notably, twelve probands carried predicted-damaging variants in both the MYO18B gene and the SACS gene, which encode for the myosin-18B and sacsin proteins respectively (Figure 3B, bottom; Supplemental Table 3). Ten of these observations were oligogenic transmissions in which the variants did not co-occur in an unaffected parent. The other two probands inherited MYO18B-SACS from their respective mothers, but additionally inherited mutations in KCNH6 from the fathers (Figure 3C, right). We conclude that there is an underappreciated role for the SACS gene in heart development, likely related to the organization and activity of the cytoskeleton. This is consistent with previous research indicating that sacsin regulates filament assembly, though this was determined in neurofilaments (Gentil et al. 2019). The SACS protein also acts as a chaperone in the binding of LRP1B (Marschang et al. 2004).
LRP1B was found to be significantly over-transmitted with SACS at the Base tier of variant severity (4 probands, p < 0.0001). Furthermore, oligogenic transmission of LRP1B variants with both MYH6 and SRCAP variants at the Strict tier was observed in three families. SRCAP is the Snf2-related CREBBP activator protein, and is known to be involved in histone modification and oxidative metabolism during prenatal heart development (Xu et al. 2021), and MYH6 is a major component in the thick filaments of the sarcomere (Mahdavi, Periasamy, and Nadal-Ginard 1982; Warkman et al. 2012). Oligogenic transmission of Base-damaging/Strict variants in the MYH6-SRCAP pair itself occurred in nine CHD probands (p < 0.0001). Other potential genetic modifiers of this interaction include the oxidase DUOX1 (transmitted with MYH6-SRCAP in 4 of 9 probands) and the membrane-repair gene DYSF (2 of 9 probands). This result suggests a genetic interaction of cell membrane proteins like LRP1B and its chaperone SACS with genes involved in cell motility and oxidative metabolism.
Novel risk genes and interactions in developmental comorbidities
CHD probands have an elevated risk of diagnosis with neurodevelopmental delay and ASD later in life, due to some combination of shared genetic etiology and downstream injury induced by the heart defect (Marino et al. 2012; McCarthy et al. 2015; Wernovsky and Licht 2016). ASD and CHD share 44 canonical risk genes (Supplemental Table 4), including genes implicated in chromatin remodeling like CHD7 (Kasah, Oddy, and Basson 2018; Ji et al. 2020) and cell signaling like PTPN11 (S. Wang et al. 2009). Furthermore, both of these developmental phenotypes are associated with secondary renal (San Agustin et al. 2016; Gabriel, Pazour, and Lo 2018; Clothier and Absoud 2021) and gastrointestinal (Feng et al. 2021; Niesler and Rappold 2021) comorbidities. We therefore sought to characterize the extent to which ASD and CHD probands share oligogenic gene sets and functions. We used GCOD to analyze the sequenced exomes of 1793 families from the Simons Foundation Autism Research Initiative (SFARI) (Fischbach and Lord 2010) and found 1358 Strict and 2866 Base-damaging tier oligogenic pairs. Since an unaffected sibling is sequenced in each family, we were able to further filter this set to those pairs not transmitted to any unaffected sibling (1023 Strict and 2060 Base-damaging pairs, Supplemental Table 5). Of the 1310 genes found in these ASD oligogenic pairs, 138 have been previously characterized in ASD, making the other 1172 novel risk gene candidates.
Interestingly, we found no oligogenic pairs or sets over-transmitted to both ASD and CHD probands at the Strict and Base-damaging tiers. However, the overlap of genes contained within these sets is greater than expected by chance (Figure 4A, Hypergeometric test p < 1×10−27). GO terms significantly overrepresented by the overlapping genes include calcium ion binding, cell motility, and extracellular matrix activity (Supplemental Table 6). One novel gene shared between CHD and ASD probands is ARSG, which encodes for the arylsulfatase G protein. While its known functions are limited to arylsulfatase and heparin hydrolysis, homozygous variants play a causal role in Usher syndrome, which involves congenital deafness and vision loss (Khateb et al. 2018; Abad-Morales et al. 2020). In the CHD cohort, a de novo mutation in the microtubule assembly gene MAP2 was co-transmitted with a rare inherited ARSG variant in two probands, more often than expected by chance (Figure 4B left, p<0.0001). Fifty unaffected parents carried a rare predicted-damaging variant in one of either gene, suggesting that a damaged copy of one is not sufficient to cause a defect. The probands with both MAP2 and ARSG variants were born with valvular defects (aortic stenosis or Ebstein’s anomaly, respectively), and have no known learning disability or autism spectrum diagnoses. In the SFARI autism cohort, ARSG was significantly over-transmitted to probands with five genes, namely ROBO3, BAHCC1, ZNF696, HAPLN2, and HTR1A (p<0.0001 for all, Figure 4B left). ROBO3, ZNF696, and HTR1A have been previously implicated in autism (Anitha et al. 2008; Krumm et al. 2015; Sener et al. 2016), while BAHCC1 is known to be necessary in neuronal commitment (Hezroni et al. 2020), and HAPLN2 has been associated with neurological disorders (Q. Wang et al. 2019). We suspect that ARSG participates in embryonic development of heart and brain (as well as ear and eye), possibly through a shared function like calcium ion binding, cell motility, or extracellular matrix activity; additional modifier genes in each case may determine whether this manifests in heart or brain phenotypes. Notably, genes in oligogenic sets with ARSG were found to be mutated de novo in at least one proband with the combination (Figure 4B right), demonstrating that GCOD is able to recover combinations with de novo variants as well as the more common transmission from alternating parents.
Another gene found in both CHD and ASD oligogenic sets was LRP1, a large, commonly-mutated gene that encodes for the low-density lipoprotein (LDL) receptor-related protein, involved in Wnt signaling. While LRP1 is known to be associated with ASD and CHD (Lillis et al. 2008; Torrico et al. 2019; Lin et al. 2020), we found 392 unaffected SFARI parents and siblings with one or more rare damaging LRP1 variants, and therefore predict that LRP1 is more likely to be an oligogenic modifier than a monogenic cause in any given case. Notably, 7 ASD probands had co-occurring variants in the digenic pair HERC1-LRP1, a gene combination that was never found in an unaffected sibling and observed only once in an unaffected mother (whose autistic child carries an additional HERC1 de novo mutation, Figure 4C). Many unaffected family members carry the same damaging HERC1 mutation as the familial proband (n=51 unaffected carriers), but the disease faithfully segregated seven times with the addition of a damaging LRP1 variant. We therefore conclude that LRP1 is likely to increase risk of developing ASD phenotypes, along with additional genetic modifiers determining final phenotypic outcome of HERC1 variants.
We next examined genes associated with renal comorbidities in CHD and ASD. Of the 1468 CHD probands that were explicitly examined for renal abnormalities, 185 were determined to have a kidney phenotype. In ASD, 47 of 1748 probands have a recorded diagnosis of a kidney or urinary phenotype. Based on the hypothesis that renal abnormalities in CHD and ASD have the same underlying causal genes (albeit with additional phenotype-specific modifiers), we ran a joint GCOD analysis with all 232 probands diagnosed with a kidney disease in either cohort (“kidney probands”). Candidate gene sets in kidney probands were further filtered by those with a logistic interaction coefficient greater than 1 to ensure that the combination of genes in these sets was more predictive of disease than any single constituent gene (Methods: Logistic regression). We found 925 oligogenic gene sets, consisting of 506 unique genes, which were significantly over-transmitted to renal probands and increased the odds of kidney disease when combined (Supplemental Table 7). Genes in these sets are enriched for functions in membrane transport, especially of calcium ions, and filament/motor activity (Supplemental Table 8). We recovered three CHD genes that were previously described to confer risk to kidney development, namely LRP2, TMEM67, SLIT2 (Gabriel, Pazour, and Lo 2018), as well as 12 genes annotated with the process GO:0001822 (kidney development), namely CAT, ADIPOQ, NOTCH3, WFS1, SULF1, COL4A3, FBN1, LRP4, DCHS2, COL4A4, PKHD1, and SMAD3.
Previous studies have suggested ciliopathies as an underlying cause of renal-cardiac comorbidities (San Agustin et al. 2016; Gabriel, Pazour, and Lo 2018), and separately in autism and neuropsychiatric diagnoses (Louvi and Grove 2011; Marley and von Zastrow 2012; Guo et al. 2015). For example, bi-allelic PKHD1 lesions are a known cause of polycystic kidney disease and heart defects, presumably related to its function in primary cilia in epithelial cells (Ward et al. 2002; Zhang et al. 2004; Chinali et al. 2019). However, we found that probands with rare damaging monoallelic PKHD1 variants were less likely to have been diagnosed with a kidney phenotype than probands without a rare damaging PKHD1 mutation (odds ratio = 0.55, Fisher exact p = 0.02). This signal is driven mainly by a high proportion of CHD probands with a damaging PKHD1 mutation but without a kidney diagnosis (54 of 59 PKHD1-carrying CHD probands). Combining the PCGC cohort with the SFARI autism cohort, GCOD found 55 genes significantly over-transmitted with PKHD1 specifically to patients with renal complications, 22 of which were significantly over-transmitted as a digneic pair with PKHD1 alone (Figure 4D, left). Once again we find that there are no overlapping oligogenic sets across each of the three analyses (kidney cohort, CHD-only cohort, and ASD-only cohort). We conclude that PKHD1 does indeed contribute to both cardiac and renal pathologies, but penetrance in the kidney appears to be modified by variants in other genes which largely differ between probands with comorbid kidney defects versus CHD-only or ASD-only. This is the first analysis to our knowledge that combines genetic data from two disparate phenotypes to increase detection power of a common comorbidity.
Discussion
Developmental phenotypes affecting the heart and brain have profound impacts on the lives of patients and their families, and understanding the underlying cause in each case can reveal additional disease risks to monitor as the patient ages to adulthood. Evidence of oligogenic transmission of developmental disease has increased in recent years (Priest et al. 2016; Alsemari et al. 2018; Gifford et al. 2019; Mkaouar et al. 2021), necessitating a fast and accurate method to determine gene combinations of interest from patient data. We provide a computational framework called GCOD to test for the significance of oligogenic gene set transmission and report 202, 1023, and 922 significantly over-transmitted oligogenic pairs in CHD, ASD, and kidney disease respectively. These findings collectively contribute 2966 novel potential risk genes and modifiers from 7889 unique predicted-damaging higher-order sets across all three datasets. This software is unique compared to other recent developments in the field of oligogenic combination detection (Kerner et al. 2020; Pounraja and Girirajan 2022), because it uses family genotypes to limit statistical inference to gene sets with oligogenic transmission, thereby reducing computation and increasing our power to detect true phenotype associations.
Given that ten percent of probands carry an oligogenic set of predicted-damaging variants, our method potentially explains the genetic underpinnings of a substantial minority of cases with previously cryptic causal variants (Table 1). Experimental validation is required to determine how many of these candidate variant sets are indeed sufficient to cause heart defects, or whether some level of additional background genetic risk is necessary. This is especially key as 97% of probands also carry combinations of less damaging variants that collectively confer some risk of CHD. A polygenic risk score approach (Wendt et al. 2020; Isgut et al. 2021) could be used to prioritize oligogenic sets that lead to disease even in otherwise relatively low-risk genomes.
We have established the concept of the “highest-order” gene set, which allows users to examine gene groups of three, four, and larger sets. This increases power by minimizing the number of individual tests. Such an analysis could be computationally prohibitive, but the GCOD algorithm efficiently leverages family sequencing data to filter out combinations seen in healthy parents. Parental sequencing also controls for the effects of population stratification, as gene set significance is conditional only on variant transmission from parents to offspring, and does not rely on relative frequencies in a population of potentially different ancestry.
Our method identified several protein complexes and functions associated with heart development, and further expanded these canonical sets to include potentially-interacting genes and risk modifiers. For example, oligogenic combinations including SACS hint at key protein interactions between the SACS filamental organizing protein with various myosin and membrane-related gene products. Exploring the function and interactions of such gene products could further elucidate cytoskeletal activity in the developing heart. Furthermore, genes that appear in CHD oligogenic sets are enriched for appropriate functions like atrioventricular canal development, as well as more novel associations like de novo pyrimidine synthesis and perception of light stimulus. As was recently explored by (Morton et al. 2021), individuals born with heart defects are at risk for additional complications later in life, including cancers. We similarly hypothesize a role for known bladder cancer gene COL6A3 in CHD, and our results further indicate that eye and ear function in patients may be affected based on functional analysis. We recover several retinitis pigmentosa risk genes, suggesting future studies to determine if the eye health of CHD patients should be monitored as they age to adulthood.
The overlap of genes and functions related to cell motility and ECM signaling in both CHD and ASD further corroborates previous findings that the increased incidence of autism in CHD patients is partially due to pleiotropic effects of shared risk genes and is not solely attributable to impaired cardiac function, such as hypoxia affecting brain development (Bean Jaworski et al. 2017; Sigmon et al. 2019). Furthermore we have identified modifiers in CHD- and ASD-associated renal phenotypes, though we note that this analysis likely suffers from incomplete diagnostic information. For example, some probands without renal indications at birth or at the time of the study may develop symptoms later in life due to genetic risk, reducing our power. This makes the oligogenic sets we identified a conservative list of risk genes for congenital kidney abnormalities in CHD and ASD.
Our simulation method to assess significance of oligogenic sets relies on having sequencing data from enough families to observe multiple occurrences of rare gene combinations. However, we expect that clinicians working with one or just a few family pedigrees could make use of the first two steps of GCOD alone. Variant and gene combinations segregating with the disease but not observed in healthy family members can be enumerated to develop hypotheses for disease etiology where no clear Mendelian inheritance pattern exists.
GCOD as implemented here is not an exhaustive search for oligogenic sets in these families, as it does not take into account common variants (which are expected to confer less risk, but undoubtedly play a role in some cases). It is also important to note that kinship controls, while offering advantages in a conservative analysis, are likely not entirely unaffected by the risk variants they carry and may exhibit subclinical phenotypes. This is an especially important caveat in the SFARI autism cohort, as sets were filtered to those never observed in an unaffected sibling; however, the protein products of sibling gene sets might indeed interact and collaboratively cause disease contingent on other factors like sex or differences in environmental exposure between siblings. Finally, the method is more sensitive to de novo variation (given that an inherited event has a 50% probability of occurring, while most de novo events are exceedingly rare); results are therefore biased towards discovering oligogenic sets where a non-synonymous de novo variant was observed.
Overall, we have created a framework to identify combinations of genes transmitted in CHD, ASD, and their renal comorbidities. We have provided researchers with these lists of prioritized, high-confidence gene and variant combinations for high-throughput screens or more precise mechanistic experiments, in order to move the field towards a more complete understanding of gene interactions during development, and how genetic variants combine in disease. GCOD is freely available for applications to other diseases and species.
Methods
Variant calling and filtering
Whole Exome Sequencing data from the 3382 CHD trios was processed as described in (Jin et al. 2017) and (Morton et al. 2021). Protein-coding mutations were filtered based on a Mapping Quality score > 59 and Genotype Quality > 90. De novo variants were called using the TrioDeNovo program (Wei et al. 2015) and accepted if the minor allele frequency (MAF) and read-depth criteria described in (Homsy et al. 2015) are met. Namely, the in-cohort MAF of the variant must be below 4×10−4, with a minimum of 5 alternative reads and 10 total reads in the proband, and a minimum of 10 reference reads in the parents (with a maximum alternate allele ratio of 3.5%). Variants in the SFARI autism cohort were called as described in (Krumm et al. 2015). In addition to the criteria above, we filtered this set for inherited variants with a Genotype Quality Mean > 85 and plausible inheritance pattern (e.g. if a parent is 1/1 and a child is 0/0 without a de novo variant at this locus, the variant is disqualified). De novo variants for SFARI probands and unaffected siblings were provided by (Iossifov et al. 2014). All variants were called and are reported in GRCh37 coordinates.
For both datasets, we annotated variants using ANNOVAR version date 2021-06-08 (K. Wang, Li, and Hakonarson 2010). We restricted our analysis to variants annotated as <5% MAF in either the gnomAD (Karczewski et al. 2020) or ExAC non-psychiatric (Lek et al. 2016) databases, and below an in-cohort MAF of 10%. GCOD users can change these cutoffs where desired. Variants predicted to be damaging by at least one of MetaSVM, FatHMM, and SIFT models were considered in the Strict and Base-damaging tiers of variant severity (Liu et al. 2016; Rogers et al. 2017). Filtering criteria of variant severity is specified in Supplemental Table 1. We used gnomAD observed/expected (Karczewski et al. 2020), Shet (Zhu, Zhang, and Sha 2018), CADD scores (Rentzsch et al. 2019), and a minimum expression of TPM > 0.5 in any cardiac or brain cell type (for CHD and ASD genes, respectively) included in the DESCARTES developmental gene expression database (J. Cao et al. 2020). Cell types considered are listed below.
Cardiac cell types: ‘Heart-Cardiomyocytes’, ‘Heart-CLC_IL5RA positive cells’, ‘Heart-ELF3_AGBL2 positive cells’, ‘Heart-Endocardial cells’, ‘Heart-Epicardial fat cells’, ‘Heart-Erythroblasts’, ‘Heart-Lymphatic endothelial cells’, ‘Heart-Lymphoid cells’, ‘Heart-Megakaryocytes’, ‘Heart-Myeloid cells’, ‘Heart-SATB2_LRRC7 positive cells’, ‘Heart-Schwann cells’, ‘Heart-Smooth muscle cells’, ‘Heart-Stromal cells’, ‘Heart-Vascular endothelial cells’, ‘Heart-Visceral neurons.’
Brain cell types: ‘Cerebellum-Astrocytes’, ‘Cerebellum-Granule neurons’, ‘Cerebellum-Inhibitory interneurons’, ‘Cerebellum-Microglia’, ‘Cerebellum-Oligodendrocytes’, ‘Cerebellum-Purkinje neurons’, ‘Cerebellum-SLC24A4_PEX5L positive cells’, ‘Cerebellum-Unipolar brush cells’, ‘Cerebellum-Vascular endothelial cells’, ‘Cerebrum-Astrocytes’, ‘Cerebrum-Excitatory neurons’, ‘Cerebrum-Inhibitory neurons’, ‘Cerebrum-Limbic system neurons’, ‘Cerebrum-Megakaryocytes’, ‘Cerebrum-Microglia’, ‘Cerebrum-Oligodendrocytes’, ‘Cerebrum-SKOR2_NPSR1 positive cells’, ‘Cerebrum-Vascular endothelial cells.’
Pseudo-sibling genotype generation
Pseudo-sibling genotypes were generated from the two parental alleles not transmitted to the proband as described in (Yu and Deng 2011). In the rare case in which a proband de novo mutation occurred at a locus where a parent carried a qualifying variant, the pseudo-sibling inherited the parental rare allele if the proband did not. De novo variants in pseudo-siblings were randomly assigned to genes based on previously-derived protein-coding non-synonymous (‘prot’) mutability (Samocha et al. 2014). Since all such variants are automatically included at the Strict variant tier, GCOD does not predict specific amino acid substitutions to further qualify the de novo variant’s CADD, shet, or predicted-damaging status.
Digenic pairs enumeration
In order to efficiently enumerate gene pairs in which multiple probands harbor co-occurring mutations at each tier of variant severity, we create a gene-by-family matrix where cell values indicate the presence of a variant meeting the relevant criteria. We use the following key:
0 = proband does not have a rare variant in this gene;
1 = proband harbors at least one variant at the Base tier of variant severity, and no variants in this gene at higher tiers;
2 = proband harbors at least one variant at the Base-damaging tier of variant severity, and no Strict-level variants in this gene;
3 = proband harbors at least one variant at the Strict tier of variant severity.
This can be repeated to create a separate matrix for (pseudo-)siblings. During generation of each gene-by-family matrix, we create three additional “provenance” matrices to indicate the inheritance of the variants meeting criteria at each of the three tiers. The provenance matrices indicate whether mother, father, both, or neither parents contributed a variant at or above this level of severity. These matrices are indexed during group enumeration to determine whether a given combination of genes includes an oligogenically-inherited variant set (that is, the set of offspring variants in this gene combination are not also collectively carried by one of the unaffected parents).
Gene columns in the gene-by-family matrix were dropped if fewer than two probands (or siblings) carried a mutation at or above the relevant tier of severity. For remaining genes, all possible combinations of digenic pairs were enumerated and checked against the matrix for multiple co-occurrences at each severity tier. Note that each severity tier is inclusive of more-damaging variants, such that a variant meeting Strict criteria is included in Base and Base-damaging analyses. If the provenance matrix indicated an oligogenic transmission of a given set of variants in a digenic pair, and this was true of at least two probands (or siblings), the gene pair was considered a “candidate set.”
Highest-order gene set enumeration
We defined the concept of a “highest-order” gene set, which comprises the largest unique candidate set observed in two or more probands (Supplemental Figure 1). To efficiently enumerate these sets, we begin with the previously-calculated digenic pairs and the list of offspring harboring them. A highest-order candidate set necessarily contains at least one qualifying digenic pair; therefore we reduce the computational search space by only examining pairs of offspring already known to carry a common digenic candidate. For every pair of probands sharing a digenic candidate, all additional damaging variant-harboring genes in common between the two probands comprise a highest-order set. In other words, the gene pair is expanded by adding all other shared genes. The final step is to check whether any lower-order sets for a given pair of probands is found in one or more additional probands. For example, Proband1 and Proband2 may have a highest-order set of GeneA-GeneB-GeneC-GeneD, while Proband1 and Proband3 have a highest-order set of GeneA-GeneB-GeneD-GeneE. The initial algorithmic pass only recorded the following as highest order: where the true carriers of the latter set should be: GCOD therefore sorts the candidate sets by the number of participating genes, beginning with sets of two and extending to the largest highest-order set. GCOD sequentially checks whether a given highest-order candidate is a subset of another candidate, and if so, confirms oligogenic transmission of the variants in the additional probands only captured in the higher-order candidate set. Thus, since the GeneA-GeneB-GeneD combination is a subset of both GeneA-GeneB-GeneC-GeneD and GeneA-GeneB-GeneD-GeneE in the example above, we append Proband1 to the GeneA-GeneB-GeneD entry as long as the transmission of those variants in Proband1 meet criteria for oligogenic transmission.
Logistic regression
Before simulation analysis, candidate sets are filtered based on their interaction coefficient in a logistic regression model. Disease status is predicted by n+1 variables, where n is the number of genes in an oligogenic candidate set. For all parents and probands (and sequenced siblings where applicable), the presence/absence of a qualifying variant in each individual gene, as well as whether all genes in the combination harbor a variant in that individual, are denoted by 0/1. We use the binomial glm() function in R (version 3.6.1, (R Core Team 2020)) with 50 maximum iterations to test whether the coefficient of the gene interaction is greater than 1. This indicates that the combination itself is broadly associated with disease in this dataset, and it removes spurious combinations that are driven by a single gene within the set.
Oligogenic transmission simulation test
For each family, parental genotypes at loci participating in candidate sets are aggregated and used to simulate 10,000 random offspring. One of each allele from father and mother is randomly selected to comprise the simulated offspring genotype at each locus, and additional de novo mutations are simulated based on previously-derived mutability constants (Samocha et al. 2014). A gene-by-family matrix and parental provenance matrices (as described above in “Digenic pairs enumeration”) is created for each of the 10,000 simulated cohorts. For each oligogenic candidate set, we index the constituent genes and record which simulated offspring carry that combination, contingent on oligogenic transmission. For each candidate set, this generates a distribution of co-occurrence counts based on parental genotypes under the null hypothesis of no disease association (i.e., random transmission). The raw p-value is determined by the proportion of iterations with a simulated count as large or larger than the proband test statistic. GCOD returns the test statistic (number of probands observed with the oligogenic combination) and raw, Benjamini-Hochberg FDR-corrected, and Bonferroni-corrected p-values for each candidate set. In this study, we use a significance threshold of Bonferroni-corrected p-value less than 0.05.
Single-gene transmission simulation test
From the 10,000 simulated gene-by-family matrices described in the “Oligogenic transmission simulation test” above, we additionally create marginal null distributions of the number of probands harboring variants in each gene and use these to compute p-values for individual genes. Significant p-values indicate the binary presence of any qualifying variant in a gene is significantly higher in probands than randomly simulated offspring. Note that this is distinct from a traditional Transmission Disequilibrium Test in that the count of binary presence/absence across probands is the test statistic here, instead of all variant transmissions within the given gene.
Gene Ontology analyses
We performed Gene Ontology (GO) analyses using the GOATOOLS package in Python3 (version 1.1.6, (Klopfenstein et al. 2018)), using the GOEnrichmentStudyNS() function with default settings (alpha = 0.05, correction method = fdr_bh for Benjamini-Hochberg FDR-adjusted p-values). When reporting top enriched terms, we focused on GO categories with more than five gene members because small categories are prone to false positives.
Oligogenic set network discovery and depiction
We selected specific oligogenic sets in CHD probands based on criteria detailed below. To visualize gene network nodes represent genes and are colored according to functional annotations. Edges indicate that genes carry damaging variants in the same proband, with edge width representing the number of probands with co-occurring mutations in that gene pair. Note that the edge counts can include probands in which the gene pair was inherited from one parent (non-oligogenic transmission), under the condition that a higher-order oligogenic set includes that transmission (i.e., at least one variant in another gene in the higher-order set was inherited from the other parent or mutated de novo). Edges are not drawn between nodes unless the two genes appear together in at least one significant oligogenic set. Oligogenic sets are reported in the accompanying text, as well as Supplemental Table 2 for the full cohort and Supplemental Table 4 for groups specific to a particular diagnosis. For brevity, only select significant sets of interest are included in the provenance matrices in these supplemental tables.
For Figure 3B, we discovered two oligogenic sets in which genes are known to physically interact in a canonical protein complex. We selected all other significant oligogenic sets containing at least one gene in these complexes, and visualized them using genes for nodes and co-occurrences as edges. For Figure 3D, we sought an oligogenic set with several counts of oligogenic transmissions but rarely any variant combinations seen in unaffected parents, discovering the MYO18B-SACS combination transmitted oligogenically 10 out of 12 times. We additionally incorporated genes appearing in significant oligogenic sets in any of these 12 probands. Figure 4A shows all significant sets containing ARSG in the CHD and ASD cohorts (6 pairs), and Figure 4B shows all significant sets containing HERC1 in the ASD cohort (3 pairs).
Data and software access
PCGC variants are available under dbGaP Study Accession: phs000571.v6.p2 for qualifying researchers. Access to SFARI ASD variants can be requested at https://base.sfari.org/. GCOD software, including Jupyter notebooks with detailed analyses and use cases, is provided at https://github.com/mepittman/gcod.
Competing interest statement
The authors have no competing interests to report.
Acknowledgements
We thank the participants and research clinicians of the Pediatric Cardiac Genomics Consortium (PCGC) and Simons Foundation Autism Research Initiative. We thank Dr. Sarah Morton of the Harvard Medical School Division of Newborn Medicine for providing data and processing pipelines from the PCGC. CHD genomic data were generated by the PCGC, under the auspices of the National Heart, Lung, and Blood Institute’s Bench to Bassinet Program https://benchtobassinet.com. This study was funded by NIH/NHLBI (grants P01 HL098707, P01 HL146366, and UM1 HL098179), Additional Ventures, and the L.K. Whittier Foundation. M.P and K.L were supported by NIH/NHLBI awards F31HL156439 and F32HL158263, respectively.