Genotype-phenotype association mining in bipolar disorder: market research meets complex genetics

Disentangling the etiology of common, complex diseases is a major challenge in genetic research. For bipolar disorder (BD), several genome-wide association studies (GWAS) have been performed. Similar to other complex disorders, major breakthroughs in explaining the high heritability of BD through GWAS have remained elusive. To overcome this dilemma, genetic research into BD, has embraced a variety of strategies such as the formation of large consortia to increase sample size and sequencing approaches. Here we advocate a complementary approach making use of already existing GWAS data: applying a data mining procedure to identify yet undetected genotype-phenotype relationships. We adapted association rule mining, a data mining technique traditionally used in retail market research, to identify frequent and characteristic genotype patterns showing strong associations to phenotype clusters. We applied this strategy to three independent GWAS datasets from 2,835 phenotypically characterized patients with BD. In a discovery step, 20,882 candidate association rules were extracted. Two of these - one associated with eating disorder and the other with anxiety - remained significant in an independent dataset after robust correction for multiple testing, showing considerable effect sizes (odds ratio ~ 3.4 and 3.0, respectively). Our approach may help detect novel specific genotype-phenotype relationships in BD typically not explored by analyses like GWAS. While we adapted the data mining tool within the context of BD gene discovery, it may facilitate identifying highly specific genotype-phenotype relationships in subsets of genome-wide data sets of other complex phenotype with similar epidemiological properties and challenges to gene discovery efforts.


Introduction
It is widely accepted that the high heritability of around 80 % for bipolar disorder (BD) is conferred by a polygenic component yet to be understood in its complexity [1,2]. Genomewide association studies of BD have identified several genome-wide significant variants and also hinted at the existence of many more variants which fail to achieve the rigorous threshold of genome-wide significance (p<5.0e-08) but contribute to the overall variance when considered within the context of polygenicity [5,6]. However, the number of newly identified variants is far below original expectations, with limited sample sizes being one of the explanatory factors. The largest sample for a meta-analysis of GWAS of BD to date comprised nearly 64,000 participants [7]. Although this is an impressive sample size, GWAS of other phenotypes, such as adult height, have demonstrated that samples three-times this figure are required to achieve an adequate number of significant findings [8]. Recent successes of the Psychiatric Genomics Consortium (https://pgc.unc.edu/) in schizophrenia genetics where case-control samples have already exceeded 100,000 individuals suggest that continued enlargement of sample size will also increase the yield of genome-wide significant findings for BD. Clinical heterogeneity of the BD phenotype may also have hampered success in identifying vulnerability genes. DSM [9] and ICD [10] present a list of possible symptoms, each of which must persist for a minimum period of time for the diagnosis to be assigned. Since a diagnosis of BD is based upon the presence of a minimum number of these symptoms, the diagnosis can be assigned for varying symptom constellations. Thus the nature and number of the underlying clinical symptoms, as well as the time periods over which they occur, show substantial variation between patients. Thus, the clinical presentation is diverse, and differing disease courses are observed within each diagnostic category.
We hypothesize that heterogeneity can be reduced and the number of identified variants increased by analyzing the joint effect of several genetic variants on specific subsets of clinical items identified in BD patients [11,12]. We hypothesize that systematic data mining approaches from other fields can be applied to analyses of GWAS data. Popular methods such as support vector machines, Bayesian networks, and association rule mining (ARM) have been successfully applied in industry. ARM is one of the most important and well researched techniques of data mining [13]. It aims to extract casual structures among sets of items in data bases for discovering and predicting regularities and has been applied extensively to market research [14,15] in order to analyze customer habits. Recently, it has been introduced to biological data, in particular microarray data for gene expression analysis [16,17]. We consider this approach highly appropriate for genome-wide data, since its main goal is to unravel unknown associations between source data, i.e. customer profiles in market research, and potential targets, i.e. their buying behavior, which can then be used for target prediction (Figure 1). Within the context of genome-wide data, the source data are genetic variants and the potential targets are symptom clusters. The aim of the present study is to apply this data mining approach to GWAS datasets of BD in order to identify yet undetected genotype-phenotype associations, searching for associations between frequently occurring genotype combinations and symptom clusters.
Clinical symptoms, sociodemographic and environmental features were ascertained using structured interviews (DIGS [21] for GAIIN and SCID-I for BoMa [22]. All phenotypes were retrieved from professionally curated databases [23,24]. Detailed information on the samples can be found elsewhere [18,19,25]. Descriptive statistics for both samples are provided in Table S1. The total sample for the present study comprised n=5,579 subjects (2,835 cases and 2,744 controls). The GAIN sample was used for the discovery step, and the TGEN and BoMa samples were used for the replication step. Prior to study inclusion, written informed consent was obtained from all subjects. The study was performed under a protocol approved by the ethical committee of the University of Heidelberg (Medizinische Ethikkommission II).

Selection of clinical features
In addition to the two phenotypic specifiers age at onset (AAO) and sex, we included a variety of other phenotypic feature, for the selection of which we applied the following criteria: (i) evidence of familiality and/or heritability [26]; (ii) a frequency of at least 5% across all three samples; (iii) a missing data rate of less than 10%; (iv) availability in at least two of the three data sets; and/or (v) clinical features with a high frequency among BD patients, including comorbid features not being part of the diagnosis of BD. In total, we selected 23 clinical features (Table S2), the frequency of which was similar across all three samples ( Figure S1), and ranged from <10% (e.g. eating disorder) to 80% (e.g. reckless behavior).

Selection of single markers and genetic model
The GAIN and TGEN samples were genome-wide genotyped on the Affymetrix 6.0 SNP array. For the BoMa sample, the Illumina HumanHap550 BeadChip was used. All genotypes were imputed based on 2.1 million HapMap Phase 2 markers [27]. Due to computational runtime constraints, our analysis is based on a selected number of markers. We included only those SNPs that showed an association p-value of less than 0.001 in a recent meta-analysis of 4,961 BD patients and 7,294 controls (Supplementary Notes, Methods-SNP selection). Our resulting SNP set comprised 5,487 SNPs, on which LD pruning (Supplementary Notes, Methods-Linkage disequilibrium) was performed in order to reduce redundancy within the genotype data before the discovery step and to decrease runtime. This left us with a total of 1,599 SNPs. Of these, 1,581 SNPs were available in all three samples studied. As the ARM approach requires binary variables we had to transform the genotype information into a binary format (Supplementary Notes, Methods-Genetic Models).

Algorithm for association rule mining
The basic idea for identifying genotype-phenotype data using these binary genotype data is to (i) receive frequent genotype patterns, (ii) to look for significantly associated phenotypes as candidates, or in terms of the original algorithm candidate association rules, in a discovery dataset, and (iii) to validate these candidate association rules in an independent replication dataset. Figure 2 illustrates the basic idea of combining genotypic information in order to identify frequent genotype patterns (left) and evaluate the patterns regarding interesting phenotype traits in order to receive a candidate association rule like genotype-pattern A implies phenotype-pattern B ( ) Identifying frequent genotype patterns. The frequent genotype patterns can be identified in a systematic manner. Several approaches have been developed for association rule mining [28,29]. Here we use the most common Apriori algorithm, as it can be implemented in a straightforward manner and shows a good performance for short patterns, making it an ideal choice for the present study. Replication of candidate association rules. The third stage of our rule mining approach is the replication of the candidate rules. Significance testing is rarely investigated in rule mining [30]. However, we considered this to be important as an inherent aspect of rule mining is the occurrence of false positive results. We anticipated 50,000 false positives per one million tests on the basis of the widely used type I error rate of 5%. One approach to correct for this is to assume that the association rules are independent and apply Bonferroni correction to the test statistics in the discovery data only. However, several simulations have shown that as well as reducing the rate of false discoveries, the Bonferroni approach also reduces the rate of true positive findings [30]. Alternatively, permutation tests can be performed to test whether or not the association between a genotype pattern and a phenotype cluster is random (Supplementary Notes, Methods-Permutation tests). However, when constrained to a single dataset, both methods are susceptible to overfitting. Thus, we considered the performance of a replication of all candidate rules nCR in an independent dataset a more appropriate alternative as this adjusts for potentially spurious sample effects and random associations. Using the latter approach and the Bonferroni method, we defined a primary test-wide significance level adj α for the replication as: However, as shown by our findings and those of Webb [30], when extracting a set of association rules using the ARM approach, the rules are unlikely to be independent. Thus, this significance testing remains conservative and is likely to reject true positive results. Therefore, we also report p-values adjusted using the false discovery rate (FDR). This is an alternative statistical method to adjust for multiple testing: FDR assumes sub-groups of tests to be dependent. FDR is less conservative, resulting in an increase in power at the cost of an increased likelihood of type I errors [31,32].

Analyses
In order to apply our approach to the GWAS data, we developed a software tool, termed

Results
N=20,882 candidate rules satisfied the required thresholds in the discovery data set. The strongest association rule showed a p-value of 3.457e-15 (#962) and thus reached significance after correction for multiple testing using the Bonferroni method (adjusted p-value=3.260e-04). When all candidate rules were compared, 15 disjunct phenotype clusters were observed. Of these, 11 consisted of a single clinical feature. The remaining four consisted of two clinical features (Table S3).
Replication of the n=20,882 candidate rules was then performed in our replication dataset of n=1,835 BD patients. The level of significance after adjustment using the Bonferroni method was 2.394e-06 for a default alpha of 5%. Although replication of the top finding from the discovery step (#962) failed, two rules met the significance threshold: The distribution of the p-values for all candidate rules in the replication dataset fits the expected chi-squared distribution ( Figure S2).

Association finding with eating disorder
Our top finding, rule #12978, showed a genotype pattern frequency of 5.2-7.4% in the case samples and of around 5.2-5.8% in the control populations (Table S4). Further details of the genotype pattern are shown in Table S7. In addition to the primary replication within the discovery-replication framework, two types of permutation tests (Supplementary Notes, Methods-Permutation tests) were performed to estimate: (a) the probability of finding a more significant association with the genotype pattern by re-sampling the phenotype; and (b) the probability of randomly choosing a genotype pattern that shows at least the same level of significance. Both reject the hypothesis of a random association based on the empirical pvalues observed (4.000e-06 and 7.000e-06, respectively), based on 1e+06 trials in the discovery data. In a subsequent step, we combined the data of all n=2,835 patients and reevaluated rule #12978, i.e. we compared patients with and without an eating disorder (BD_ED and BD_nonED, respectively) in terms of the genotype pattern of this rule. This combined analysis of cases showed a p-value=5.300e-14 and an OR=4.120 [.95 CI: 2.740-6.068]. Thus, within the group of patients carrying the genotype pattern, the frequency of a co-morbid eating disorder is increased on average by a factor of 4. An association analysis was then performed for each of the three SNPs of the genotype pattern to determine whether the observed association was due to the combination of the three SNPs or conferred by only one of them. Single trend tests for the phenotype 'eating disorder' was performed in each of the three datasets using PLINK [33]. No significant evidence was found to support the hypothesis that the association with the phenotype of the rule is driven by a single SNP (Table S6).
We furthermore performed an association study of the genotype pattern of rule #12978 in cases versus controls. No differential distribution of the genotype pattern was observed between (a) the BD_nonED cases and controls and (b) between all BD cases and controls: However, the genotype pattern was significantly associated with BD_ED cases compared to controls (p-value=4.937e-14, OR=4.107 [.95 CI: 2.735-6.040]) (Table S5).

Association finding with simple phobia
The second finding, rule #6221, showed an association with 'simple phobia'. The genotype frequencies were 5.4-6.9% in cases and 5.1-7.2% in controls. In the combined analysis of all cases, we observed a p-value=3.476e-13 (adjusted p-value=3.427e-02) and an OR=3.551 [.95 CI: 2.453-5.063]. Thus, within the group of patients carrying the genotype pattern, the frequency of a co-morbid simple phobia increased on average by a factor of 3.5. As was the case for the rule including eating disorder, the association was not conferred by the single SNPs taken separately but only in combination (Table S6). Likewise a case-control analysis showed: (a) no differential distribution of the genotype pattern between the BD_nonSP cases and controls nor (b) between all BD cases and controls. However, we observed (c) a significant differential distribution between BD_SP cases and controls (p-value=1.686e-11, OR=3.195 [.95 CI: 2.220-4.523]) (Table S5).

Discussion
Application of the ARM data mining approach identified significant associations between sets of candidate SNPs and BD subgroups characterized by two specific comorbid conditions: eating disorder and simple phobia.
Our top finding (rule #12978) highlights an association between the genotype pattern of rule #12978 and the subgroup of BD patients with an eating disorder. The association was conferred by the combination of three SNPs but not by the individual SNPs. While the proportion of BD patients with an eating disorder was very small (n=192 patients; i.e. 6.8% of our sample), this frequency is comparable to that reported in other studies [34,35]. Thirtyseven of these patients displayed the genotype pattern of rule #12978, which was present in 182 of all BD patients. Despite the small sample size, the association finding (p-value=4.937e-14) is rather strong with an OR=4.107 in the combined case-control analysis, an effect size typically not seen for diagnosis-based studies.
The likelihood that our findings may be due to chance is further decreased when considering the following two points: Firstly, the replication sample was comprised of two smaller samples, and in both of these samples, the effect was in the same direction (with test-wide significance being achieved in neither). Secondly, our findings fall in line with reports on the function of the genes involved. SNP rs3769745 of rule #12978 is located in the intron region of the cyclic nucleotide gated channel alpha 3 gene (CNGA3) on chromosome 2. In humans, CNGA3 is implicated in total color blindness (achromatopsia) [36,37]. Animal studies have shown that CNGA3 is required for normal vision [38], olfactory signal transduction [39], and involved in nociceptive processing [40]. Further, it is expressed in the mouse brain and is reported to influence synaptic plasticity and behaviour [41]. Research has also shown that the specialized olfactory subsystem to which CNGA3 belongs is required for the acquisition of socially transmitted food preferences (STFPs) in mice. Mice that lack this gene fail to acquire STFPs from other mice, and exhibit an absence of neuronal activation of the ventral subiculum of the hippocampus, a brain region implicated in STFP retrieval [42]. According to the KEGG Database, CNGA3 is in a common pathway, .e., olfactory transduction (KEGG ID hsa04740), with CALM1, a candidate gene for anorexia nervosa [43]. To the best of our knowledge, no association between this variant and eating disorder has been reported so far.
For the other two variants, a plausible support from biological data is not available. SNP rs6733011 is located in an intron region of the KIAA1211-like (KIAA1211L) gene on chromosome 2 that encodes the uncharacterized protein C2orf55 (chromosome 2 open reading frame 55). The location is within a 500 kb window to rs3769745, but not in the same LD block (r² = 0.027 and D' = 0.343 in the discovery dataset). Its function remains unknown. SNP rs4113925 is located on chromosome 12q24.21 in an intron of the T-box transcription factor (TBX5) gene. This T-box gene has been implicated in heart development and disease as well as specification of limb identity [44].
To investigate whether our finding identified genetic markers specific to BD with an eating disorder subphenotype or eating disorder per se, we tested a potential association of the genotype pattern of rule #12978 with an eating disorder phenotype comprising anorexia and bulimia in a population-based sample from Australia (n=1672, 12.9 % with a diagnosis of anorexia or bulimia). We did not see an association of the genotype pattern of rule #12978, suggesting that our approach has detected a genetic marker for BD with comorbid eating disorder rather than for eating disorder per se.
Our second finding, rule #6221, showed an association with simple phobia. Two of the three contributing SNPs are located within genes. SNP rs4757144 is located in an intron region of the aryl hydrocarbon receptor nuclear translocator-like (ARNTL) gene, and rs3130781 is located in an intron region of the diffuse panbronchiolitis critical region 1 (DPCR1) gene. The third SNP, rs858057, is located at an intergenic region of 20p11.21. An implication of ARNTL in the etiology disorders via its influence on the circadian system has been discussed repeatedly. [45][46][47][48][49][50], There is further report that genes homologous to ARNTL may be implicated in the etiology of anxiety. Sipilä and colleagues [50] tested several anxiety phenotypes for association with 13 circadian genes and found association between social phobia and ARNTL2. Thus the ARNTL gene family may be involved in this co-morbid phenotype. The second gene, DPCR1, is located in the major histocompatibility complex (MHC), which hosts genes that are crucial for the functioning of the immune system.
While we observed several other genotype-phenotype rules that may warrant further in-depth investigation (Table 1 and Supplementary Notes, Further results), we focused on the rules implicating BD subtypes with comorbid eating disorder and simple phobia, respectively, as only these two survived our stringent multi-tiered evaluation of potential type I error. These steps help minimize -if not eliminate-type I error rate in ARM due to the overfitting of rules in a particular dataset [28].
We would like to point out that the two reported association rules were associated with very low frequency phenotypes. This is due to the characteristics of the z-score approach applied.
Since small proportions of the data are more likely to deviate strongly from the random distribution, larger effects and thus larger z-scores are expected. As only those rules that show a z-score of greater or equal 5 are extracted as candidates, this particular rule measure is biased towards associations with low frequency phenotypes. This further explains the relatively small number, i.e. 4 out of 15 (Supplementary Notes, Methods-Phenotype cluster), of phenotype clusters that consisted of more than one phenotype. We may thus have discarded many potentially true findings. Given that our study can be considered a proof-ofconcept for the application of ARM on GWAS-derived data for a complex phenotype, we opted for statistical stringency rather than a merely exploratory pattern mining. While this approach resulted in only two findings, they are characterized by effect sizes up to four times larger than typically seen in GWAS of BD or other complex traits.
In summary, using already available GWAS data sets on BD, we have established and implemented a novel data mining process for complex genetic data. We identified genotypephenotype patterns highlighting subtypes of BD characterized by specific comorbid conditions. These two comorbid conditions, eating disorder and simple phobia, may delineate more homogenous subgroups of BD that warrant further study in genomic studies of BD.
An important limitation of our approach is that our approach was only based on 5487 SNPs that showed some evidence of association with BD. As association rule mining may detect hidden association of specific phenotypes with previously un-identified SNPs, our approach may have missed several novel associations. This restriction, however, was due to our motivation to perform genotype-phenotype dissection on SNPs that showed some evidence of association. We were further bound by some computational runtime constraints. Further extensions of the algorithm will be required to allow for a variety of assumed genetic models (here we used a dominant genetic model), to optimize computational feasibility for an increased number of SNPs (Supplementary Notes, Methods-Runtime), and to determine the optimal correction methodology for highly correlated data Our approach highlights a strategy for genotype-phenotype dissection and for the identification of genetic susceptibility variants beyond initial GWAS of heterogeneous disorders. Finally, our results emphasize the importance of thorough phenotyping, particularly with regard to comorbidity.    is traversed as long as there are unprocessed genotype patterns that cannot be pruned before.