Abstract
Heritability is a fundamental characteristic of human disease essential to the development of a biological understanding of the causes of disease. Traditionally, heritability studies are a laborious process of patient recruitment and phenotype ascertainment. Electronic health records (EHR) passively capture a wide range and depth of clinically relevant data and represent a novel resource for studying heritability of many traits and conditions that are not typically accessible. In addition to a wealth of disease phenotypes, nearly every hospital collects and stores next-of-kin information on the emergency contact forms when a patient is admitted. Until now, these data have gone completely unused for research purposes. We introduce a novel algorithm to infer familial relationships using emergency contact information while maintaining privacy. Here we show that EHR data yield accurate estimates of heritability across all available phenotypes using millions familial relationships mined from emergency contact data at two large academic medical centers. Estimates of heritability were consistent between sites and with previously reported estimates. Inconsistencies were indicative of limitations and opportunities unique to EHR research. Critically, these analyses provide a novel validation of the utility of electronic health records in inferences about the biological basis of disease.
Introduction
Family history is one of the most important disease risk factors necessary for the implementation of precision medicine in the clinical setting1. The predictive value of family history for any given trait is directly related to the fraction of phenotypic variance attributable to genetic factors, known as heritability2. Knowledge of disease heritability combined with family history information is clinically useful for identifying risk factors, estimating risk of disease, customizing treatment, and tailoring patient care. Moreover, by quantifying genetic contribution to a trait, heritability estimation represents the first step in gene mapping efforts for any disease.
Estimating heritability has traditionally required in-depth family studies, with twin studies being the gold standard. By their nature these studies can be laborious, limiting their sample sizes and, subsequently, their power. A notable exception, and perhaps the largest single study, used 80,309 monozygotic and 123,382 same-sex dizygotic twins to conclude that there is significant familial risk for prostate, melanoma, breast, ovary, and uterine cancers3. Another study brought together 2,748 twin studies conducted since 1955 covering 14.5 million subjects. However, in such a meta-analysis individual data are not available, preventing any study of cross-sections, combinations of traits, or strata that were not analyzed in the original study4.
Electronic Health Records (EHR) are in broad use and offer an alternative to traditional phenotyping. Everyday, the EHR records thousands of patient phenotypes from drug prescriptions and disease diagnoses to clinical pathology results and physician notes. Use of the EHR as an observational dataset presents a novel opportunity to conduct rapid and expansive studies of disease and phenotype heritability. In particular, they enable access to traits that otherwise might not be studied. In addition, data captured by these systems represent the diversity of the patient populations they serve, and, in ethnically diverse regions like New York City, make previously unattainable cohorts available for study.5 The caveat is that these data are known to contain many biases and errors that limit their use. Issues regarding missingness and accuracy are widely cited as the primary limitations6. However, the most critical limitation for genetic studies may be the uncontrolled ascertainment bias. The probability that a particular trait is recorded in the EHR is not uniform across disease conditions or across patients. For example, a patient seen for a routine checkup with no symptoms is unlikely to undergo an MRI, regardless of whether or not they have an unruptured brain aneurysm.
The genetic relatedness between patients is not routinely captured in the EHR during clinical practice. In some hospitals, as is the case for the two we represent, a link is made between the mother’s and child’s medical records upon birth. In general, however, familial links are not present. Recent work has identified twins by comparing birth dates and surnames7, but there is a more comprehensive source of familial relationship data that is available at nearly every hospital across the country – the emergency contact information. Upon admission, each patient is asked to provide contact details to be used in case of emergency as well as how they are related to the individual provided. If accurate, this ubiquitous resource can be used to define a broad network of relatedness across a hospital’s patient population.
In this study, we demonstrate the utility of the EHR as a genetics research resource by using extracted data to estimate the heritability and familial recurrence rates of over 700 phenotypes – both quantitative and dichotomous. We performed this analysis independently at two large academic medical centers in New York City. We present our algorithm for extracting relationships, called Relationship Inference From The Electronic Health Record (RIFTEHR), and use it to infer 4.7 million familial relationships among our patients. We then computed recurrence rates and heritability estimates for every available phenotype. Our derived heritability estimates accurately reflect those previously reported and we report heritability estimates for many traits that may otherwise never have been studied.
Results
Mining familial relationships from the EHR
We obtained the data for this study from the inpatient EHR used at the hospitals of Columbia University Medical Center and Weill Cornell Medical College. These hospitals operate together as NewYork-Presbyterian Hospital and herein, we will refer to the hospitals and the data associated with them as Columbia and Cornell, respectively. The study was approved by Institutional Review Boards at both Columbia and Cornell University.
In total, 4,768,013 emergency contacts were provided by 2,388,455 patients at the two medical centers. Of these, we identified the emergency contact as a patient in 785,943 cases (488,932 and 297,011 at Columbia and Cornell, respectively). Using these next-of-kin data, we inferred an additional 2,614,657 relationships at Columbia and 1,200,977 at Cornell. Including inferences, a total of 3,103,589 unique relationships have been identified at Columbia and 1,497,988 at Cornell. Inferred relationships include first to fourth degree relatives as well as spouses and in-laws (Table 1, Supplementary Table 1). We grouped individuals into families by identifying disconnected subgraphs (Materials and Methods). We found 223,307 families at Columbia containing 2 to 134 members per family. Similarly, we found 155,883 families at Cornell, with up to 129 members per family. This includes 127 families that span four generations (i.e. families that contain great-great-grandparents and great-great-grandchildren) at Columbia and 72 families that span four generations at Cornell.
Relationships by degree.
Demographic data of the electronic health records at the medical centers of Columbia and Cornell University.
The relationship between mother and child was explicitly documented in the EHR for babies delivered at both medical centers. This ‘EHR mother-baby linkage’ provided a reference standard for maternal relationships, allowing us to compute sensitivity and positive predictive value (PPV) of the relationship inference method. For maternal relationships, we obtained 92.9% sensitivity with 95.7% PPV at Columbia and 96.8% sensitivity with 98.3% PPV at Cornell (Figure 1A).
(A) The medical centers at both Columbia and Cornell have implemented a link between the electronic health records of mother and baby at the time of birth. We used these links as a gold standard to evaluate RIFTEHR, our algorithm for automatically inferring relationships from the EHR. Of 40,095 mother-baby links at Columbia, RIFTEHR correctly identifies 35,775, falsely identifies 1,600 and misses 2,720. Positive predictive value (PPV) is 96% and sensitivity is 93%. Of 39,691 mother-baby links at Cornell, RIFTEHR correctly identifies 37,797, falsely identifies 657, and misses 1,237. PPV is 98% and sensitivity is 97%. (B and C) Through biobanks at Columbia, 185 of the patients with identified relationships from RIFTEHR also had genetic data available and appropriately consented for use in our study. For these 185 patients, RIFTEHR predicted a total of 122 relationships: 78 parent/child relationships, 19 sibling relationships, 3 grandparent/grandchild relationships, 6 aunt/uncle/niece/nephew relationships, and one grandaunt/grandniece relationship. Genetic relatedness was determined for each pair of individuals. All 78 parent/child relationships had the expected genetic relatedness of 50% (49%±3%). Of the siblings predicted by RIFTEHR 13 were full siblings, 2 were half siblings (genetic relatedness of 25%), and 4 were identical twins. The high rate of twins in our small sample is a result of the secondary use of existing data – which was originally collected for genetic studies. Excluding these twins yields a more accurate estimate of RIFTEHR’s performance (PPV=87%). Overall the RIFTEHR relationship and the genetic relationship were significantly correlated (r=0.65, p=6.24e-14). (D) Average age differences for each relationship type. We computed the age differences for each pair of individuals at both Columbia (blue) and Cornell (red). The age differences are consistent across sites.
We validated the identified relationships by comparison to genetic relatedness (Figure 1). We collected a dataset of 186 patients for which we have EHR-inferred relationships and who have genetic data available that was consented for reuse. We used PLINK to estimate relatedness. All 78 predicted parent/child relationships had the expected genetic relatedness of 50% as well as the three grandparent/grandchild relationships. All 19 sibling relationships were genetically related, but four were identical twins and two were half-siblings. Overall, relationships extracted from the EHR significantly correlate with the expected genetic relatedness (r = 0.65, p = 6.26e-14, Figure 1B).
Health records-based estimates of heritability
To differentiate heritability estimates derived under uncertain ascertainment conditions, we introduce the concept of “observational h2” or h2o. h2o is an estimate of the narrow-sense heritability where the phenotypes (traits) come from observational data sources. Observational data are subject to confounding biases from physician and patient behaviors that will affect the probability that a particular trait is ascertained. These ascertainment biases can vary from patient to patient, family to family, and cases to controls. The consequence is that the estimated heritability will be highly dependent on the particular families and individuals upon which the estimate is based. To correct for this, we bootstrapped the heritability estimates. For each sampling we used SOLAR8 to estimate the heritability of the trait, in a procedure we call SOLARStrap (Materials and Methods). High sampling variance indicates the presence of heterogeneous biases. Heritability estimates are adjusted for age and sex.
We mined the literature for heritability estimates and found 91 phenotypes that mapped to phenotypes we curated from the EHR. We used the Columbia data to set the quality control parameters of the SOLARStrap procedure (Materials and Methods). 10 of the traits in the Cornell data passed these QC criteria and we found that they were significantly correlated with literature estimates for these traits (r=0.73, p=0.016, Figure 2A). On average, estimates from Columbia were 20% ± 9% lower than those reported in the literature and those from Cornell were 7% ± 9% lower (Figure 2B). Heritability estimates derived from Cornell data were highly correlated with those derived from Columbia data (r=0.67, p=2.56e-12, Figure 2C). As a group, respiratory diseases had the highest average heritability for both dichotomous (Figure 2D) and quantitative (Figure 2E) traits. Genitourinary and gastrointestinal diseases had the lowest average heritability.
We designed a method, called SOLARStrap, for estimating the heritability of traits where the phenotype is derived under unknown ascertainment biases, the h2o. We trained the hyperparameters of the model on a small subset of manually defined phenotypes with available heritability estimates from the literature (Materials and Methods) using data from Columbia and tested these parameters at Cornell. (A) We found that performance was consistent across both sites and that h2o is significantly correlated with literature estimates of h2. (B) Comparison of h2o (from Columbia and Cornell EHR) and h2 (from Literature) for 28 traits used for fitting the hyperparameters. Median and 95% confidence interval are shown. (C) We evaluated the heritability of just over 1,494 traits at Columbia and 1,145 traits at Cornell (Materials and Methods). We performed the analysis independently each site. After quality control filters we found 216 traits with significant heritability at Columbia and 160 traits at Cornell, with 85 traits falling in the intersection. These 85 traits were significantly correlated between the two sites (r=0.67, p=2.56e-12). (D) 124 dichotomous traits (from disease billing codes) grouped by disease category and sorted by heritability within each group. Disease categories are sorted by the median heritability of the diseases within the category. Respiratory disease has the highest average heritability and gastrointestinal disease has the lowest average heritability. (E) 92 quantitative traits (from clinical pathology reports) grouped by disease category and sorted by heritability within each group. Trait categories are sorted by the median heritability of the traits within the category. Respiratory disorders have the highest average heritability followed by metabolic and nutritional disorders. Gastrointestinal and genitourinary disorders have the lowest average heritability of their corresponding quantitative traits. (F) For the 124 dichotomous traits we have both estimates of heritability and sibling recurrence rates. The median heritability and recurrence rate were computed for each category and then normalized to the overall median heritability across all groups (y axis). The same was done for recurrence rates (x axis). Each category is shown as an open circle colored according to (D). The size of the circle indicates the number of traits within that category. Categories in the top right quadrant have higher than average heritability and higher than average recurrence rates while categories in the top left quadrant have low recurrence rates and high heritability, etc. (G) Observational heritability for morbid obesity and obesity at Columbia (light blue) and Cornell (red) as well as for for HDL cholesterol at Columbia (light blue) and Cornell (red).
For dichotomous traits, we explored the relative contribution of genetics and the environment to the phenotype by comparing heritability estimates to sibling recurrence rates (Figure 2F). Disease groups fell into four distinct regions: (1) those with greater than average genetic and environmental contribution (Figure 2F, top right) – respiratory and neurologic diseases fell into this quadrant; (2) those with high genetics and lower than average environment (Figure 2F, top left) – e.g. endocrine and metabolic diseases; (3) high environment and low genetics (Figure 2F, bottom right) – gastrointestinal and genitourinary diseases; and (4) low environment and low genetics (Figure 2F, bottom left) – the general category of signs and symptoms is an example here.
Using phenotypes from the EHR for heritability can provide clarity for poorly studied traits, reveal subtle differences between closely related conditions, and open up new avenues of heritability research. For example, two previous studies have shown conflicting evidence for the relative heritability of HDL cholesterol and LDL cholesterol 9,10. The larger of these two studies (N=378) found no difference in the heritability of these two traits when adjusting for age and sex, while the other found a slightly higher heritability for HDL, but was underpowered to detect significance. We present strong evidence that HDL is significantly more heritable than LDL (h2o=0.49 vs 0.36, p=5.3e-41 at Columbia; h2o=0.47 vs 0.25, p=6.2e-159 at Cornell; Figure 2G). At 96,241 patients in the Columbia cohort and 33,239 patients in the Cornell cohort, ours may be the largest heritability study of cholesterol ever conducted. In addition, subtle phenotypical variations that are routinely collected clinically can be studied. For example, we found that the heritability of “obesity” is significantly greater than for “morbid obesity” (h2o = 0.43 vs 0.36, p=2.1e-8, N=26,783 at Columbia and h2o=0.63 vs 0.51, p=3.1e-9, N=11,220 at Cornell). Finally, the EHR can identify novel traits for genetic study. The most heritable trait we found was for “victim of child abuse,” h2o=0.90 (0.73-1.00), N=1,142 (Table 2). This trait is unique in that it is not a trait of the individual with the diagnosis code, but of another individual with whom the child interacts. To account for a potential artifact introduced by several siblings abused by a single individual, we recomputed heritability excluding siblings (Materials and Methods). We found that, while the effect is mitigated, the heritability remains high at h2o=0.80 (0.68-0.96) (Table S5). The familial trend of this behavioral trait has been well documented in the psychology literature11-13. Our findings provide additional evidence for a genetic role as well. Scientists studying child abuse and related conditions may consider performing a more traditional genetics analysis in the future.
The median observational heritability and ranges are shown for each of the 12 dichotomous trait categories and the 12 quantitative trait categories. Within each category the trait with the highest heritability and the trait with the lowest heritability are shown.
Recurrence Rates
We estimate sibling and familial recurrence for 765 dichotomous traits at Columbia and 393 traits at Cornell. When looking at sibling and familial recurrence, perinatal conditions are the most concordant between sites (r2 = 0.94 for sibling and 0.96 for familial). The least concordant were diseases of the digestive system (r2 = 0.02) for sibling and signs and symptoms for familial (r2 = 0). Sibling recurrence and familial recurrence are highly correlated (r = 0.71, p = 2.52e-40) as well as familial recurrence (r = 0.49, p = 1.99e-21) (Figure 3A and 3B). Sibling recurrence, on average, is greater than familial recurrence at both sites (Figure 3C). We also calculated recurrence by disease site and stratified by relationship type (sibling, cousin, first cousin once removed). We observe that disease recurrence among siblings is higher than among cousins, which is higher than among first cousin once removed (Figure 3D).
(A) Correlation of familial recurrence estimates for 328 conditions (rho = 0.49, p = 1.99e-21) between Cornell and Columbia. Perinatal conditions (green) represent the most concordant (r2 = 0.96) and signs and symptoms (black) represent the least concordant (r2 = 0). (B) Correlation of sibling recurrence estimates for 250 conditions between Cornell and Columbia (rho = 0.71, p = 2.52e-40). Perinatal conditions (green) once again represents the most concordant (r2 = 0.94) and diseases of the digestive system (black) represent the least concordant (r2 = 0.02). (C) Sibling recurrence estimates versus familial recurrence estimates at Columbia (left, blue) and Cornell (right, red). Sibling recurrence and familial recurrence is significantly correlated at both sites and, on average, is greater than familial recurrence at both sites. (D) Sibling recurrence by disease category stratified by relationship type (sibling, cousin, first cousin once removed) for Columbia. Circles represent sibling recurrence rates, diamonds represent cousin recurrence rates and squares represent first cousin once removed along with the 95% confidence interval.
Data accuracy and missingness
We evaluated the effect of the two most commonly cited limitations of EHR data, errors and missingness, on our estimates of observational heritability (Figure S1). Rhinitis is highly heritable in family studies (h2=0.95 CI=0.78-0.97)14 and also has high observational heritability at both sites (h2o=0.62 CI:0.49-0.73, h2o = 0.78 CI:0.61-0.91, Figure 3B). We evaluated the effect of errors and missingness on h2o for rhinitis at Cornell. The estimates are robust to missingness (Figure S1B). When 30% of the data are removed, the estimates remain consistent. Note, that as more data are missing, power will become the major limitation. Heritability estimates are consistent until 20%, or more, of the data are noise, at which point the confidence intervals no longer overlap (Figure S1A). Injection of 5% noise reduces the estimate 13% (from h2o = 0.77 to h2o = 0.67) and 10% noise reduces the estimate 30% (from h2o = 0.77 to h2o = 0.53). This likely explains why our estimates are 7-20% lower than what would be expected from a carefully ascertained study, corresponding to around 5% misclassification in our EHR.
SOLARStrap sensitivity analysis. (A) The effect of noise injection on the estimate of observational heritability of rhinitis. We injected noise into the data by randomly shuffling a subset of the patient diagnoses. This simulates misclassification (misdiagnosis or missed diagnosis) in the medical records. When no noise is injected the estimate is 0.77 (0.60-0.92). As noise is introduced the estimate of the heritability decreases to 0.36 (0.23-0.49) once one quarter of the data are randomized. (B) The effect of missingness injection on the estimate of observational heritability of rhinitis. We injecting missingness into the data by randomly removing a subset of the patient diagnoses. This simulates data that are missed by the medical records – an event that is common, especially at tertiary medical centers. When no data are removed the observational heritability is 0.75 (0.58-0.91). The heritability estimate remains consistent until 30% of the data are removed at which time the estimate is 0.54 (0.37-0.68).
Discussion
Analysis of EHR data has yielded insight into drug effectiveness and allowed precise definition of phenotypes to investigate disease processes 15-20. For the first time on a large scale, we have used EHR data to infer pedigrees from patient-provided emergency contact information. We present our novel algorithm for performing this relationship extraction, RIFTEHR, and validated its performance. This approach has significant implications for estimating heritability of disease without direct genetic testing. The EHR data used in this research are nearly ubiquitous and, if privacy is adequately protected, could allow almost any research hospital to identify related patients with high specificity and sensitivity. Finally, we used EHR-inferred relationships to evaluate the heritability of 2,089 traits and found 328 with significant heritability. The heritability of many of these traits have never before been studied.
Heritability is a key component in precision medicine, and is typically estimated based on family history. Collection of comprehensive and accurate family history is time-consuming and does not occur during the vast majority of clinical encounters. The construction of pedigrees by inference of relatedness from administrative records allows for rapid assessment of family history and heritability at scales that were previously impossible to achieve. The algorithm used in this study uncovered over 379,000 pedigrees within the medical records of two academic medical centers. We validated the inferred familial relationships against both clinical and genetic references and found PPV between 87% and 99%. One of the limitations of our method is the challenge to differentiate between direct blood relatives and adopted families. Emergency contact is not a biological construct; therefore, patients identify not only direct-blood relatives, but also adoptive family members and use familial labels for friends.
Using EHR-inferred relationships we calculated heritability, sibling recurrence, and familial recurrence estimates among individuals with defined relationships. Previous research in this area has focused on family studies of known relatives, specifically twins. Mayer and colleagues used EHR data to create a cohort of 2,000 twins/multiple births and measured concordance among identified twins for two highly heritable diseases, muscular dystrophy and fragile-X syndrome.7 Our study looked not only at twins, but entire families across several generations. We evaluated 2,089 traits and computed high confidence heritability estimates for 328 of them. Importantly, most previous studies have predominantly involved White Europeans and may not be representative of other populations. However, our results reflect the diverse, multiethnic population of New York City.
The primary and most significant challenge when using traits defined from an observational resource, like the electronic health records (EHR), is incomplete phenotype information resulting in ascertainment bias. In a heritability study, the phenotype of each study participant is, ideally, carefully evaluated and quantified. This is infeasible, however, when the cohort contains millions of patients with thousands of phenotypes. The differential probability that a given individual will be phenotyped for a study trait is the ascertainment bias. The bias may depend on many latent factors, including the trait being studied, the trait status of relatives, the proximity to the hospital, and an individual’s ethnicity and cultural identification, among others. The consequence of this uncontrolled ascertainment bias is that heritability estimates will be highly dependent on the particular individuals in the study cohort. We used repeated sub-sampling to characterize this dependence quantitatively. EHR-based heritability estimates are particularly well-suited for complex traits that require large numbers of patients (e.g., Type 2 Diabetes Mellitus and Obesity). Most importantly, using the EHR can identify new avenues for research. We report very high heritability for child abuse, indicating a potential genetic role in this well studied condition.
The unique nature of the relationships and phenotypes derived from the EHR may necessitate novel methods for estimating heritability. We used a mixed linear model implemented in SOLAR8 to estimate heritability and used repeated sampling to characterize the variance from ascertainment heterogeneities. There may be more accurate ways to estimate heritability from this unique data source. For example, in the case of child abuse, it is the victim of the abuse and not the abuser who will have the data coded. New methods designed for EHR data may be able to better control for the peculiar confounding effects of observational data.
There are significant bioethical considerations regarding the use of the RIFTEHR method, including how best to balance the competing demands of protecting patients’ privacy with clinicians’ duty to warn relatives of potential genetic risks. The method could readily be applied in EHR systems, such that clinicians could easily access the health information of a patient’s family members. In the United States, accessing a family member’s health information in this manner may be considered a violation of the 1996 Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule21. On the other hand, case law in the United States has established that healthcare providers have a responsibility to inform a patient’s relatives about heritable conditions that may reasonably put the relatives “at risk of harm” 22. These conflicts may need to be resolved before automatic relationship inference can be used clinically.
We have described and validated a novel method for identifying familial relationships in patient’s medical records, and used 4.7 million relationships inferred from the EHRs at two academic medical centers to estimate heritability of disease. We found that heritability estimates were concordant across the two centers, suggesting that the method may have broad applicability. An EHR that is linked to genetic information enables personalized disease risk prediction and facilitates heritability determination for EHR-captured phenotypes that have not been previously studied by family-based or twin studies. Identifying familial relationships is useful for all aspects of medicine, ranging from genetic research to clinical practice, making RIFTEHR a valuable tool for the advancement of precision medicine. The correspondence of our heritability estimates with family based estimates provides a direct and novel validation of the value of electronic health records in making inferences about disease which is now emerging as a central approach in precision medicine.
Materials and Methods
The data for this study was obtained from the inpatient EHR used at the hospitals affiliated with two large academic medical centers in New York City: Columbia University Medical Center and Weill Cornell Medical College. These hospitals operate together as NewYork-Presbyterian Hospital and herein, we will refer to the hospitals and the data associated with them as Columbia and Cornell, respectively.
1. Relationship Inference from the Electronic Health Record (RIFTEHR)
This research was approved by the institutional review boards at the two study sites. As is common practice, when patients received care at either site, they were asked to provide information about an emergency contact. This information included the person’s name, address, phone number, and their relationship to the patient (e.g., parent, sibling, friend). We used the emergency contact information to identify familial relationships in the EHR in cases where the emergency contact person had his or her own record generated by an encounter with the healthcare system. Algorithmically, we then inferred additional relationships from the connectedness of the identified individuals. This information was validated against genetic data and a separate module of the EHR which documented the linkage between mother’s and their newborn’s medical record. Using the relationships identified, we assigned phenotypes using clinical history, and subsequently evaluated familial recurrence for all available clinical phenotypes.
1.1. Deriving familial relationships from emergency contact data
1.1.1. Matching emergency contact to medical records
Our algorithm creates for each patient a list of all reported emergency contacts. Then, for each emergency contact, it attempts to identify a medical record by matching first name, last name, primary phone number and ZIP code. First we consider all cases with first name and filter the table that contains all patients’ information to identify records that contain the same first name. We then return the identified records and perform the same comparison with last name, primary phone number and ZIP code. Subsequently, we compare the combination of two variables at a time (i.e. first name and last name, first name and primary phone number, first name and ZIP code, etc.). We then perform combinations of three variables and then of all four variables. We only consider it successful when we identify a single patient that matches to the emergency contact information given. We also capture which variables were used in the matching process for each one of the emergency contacts (i.e. first name and last name; first name, last name and phone number, etc.). The output of this algorithm contains the patient’s identifier, the relationship between the patient and the matched emergency contact, the emergency contact’s identifier, as well as a list of the variables used to perform the matching process. We use as patient identifiers the Enterprise Master Patient Index (EMPI), when available or the medical record number (MRN). EMPIs are a unique identifier created to refer to multiple MRNs across the healthcare organization. Using EMPIs allow us to perform better in the matching process since duplicates from patients having more than one MRN are excluded.
1.1.2. Quality Control of matches
Once the matches are identified, we exclude patients with non-biological relationships (i.e. spouse, friend). Specific relationships are mapped to relationship groups (e.g. the relationship “mother” is mapped to “parent”). We then calculate the age difference between two related patients and exclude parents that are less then 10 years older than their children, children that are less than 10 years younger than their parents, grandparents that are less than 20 years older than their grandchildren, grandchildren that are less than 20 years younger than their grandparent. Since parents and grandparents must be older than their children and grandchildren, we also flip relationships when the age difference between parent or grandparent and its child or grandchild is negative, specifically the relationship “parent” becomes “child” and the relationship “grandparent” becomes “grandchild”. The same process is done when the age difference between children and grandchildren in positive. We also exclude every patient that matches to 20 or more distinct emergency contacts. Finally, we generate the opposite relationship for every relationship pair. For example, if we have that A is parent of B, the opposite relationship is that B is child of A.
1.1.3. Inferring familial relationships
Using the matches identified, we infer additional relationships. The inference process is made based on familial relationship rules. For example, if patient A is mother of patient B and patient B is mother of patient C, then by inference we know that A is grandmother of C and C is grandchild of A. The rules used to perform these inferences are described on Supplementary Table 4.
1.1.4. Quality Control of inferred relationships
Once additional relationships are inferred, we remove ambiguous relationships such as “Parent/Aunt/Uncle” if the same pair contains a unique specific relationship, in this case, either “Parent” or “Aunt/Uncle”. The same is done for “Child/Nephew/Niece”, “Sibling/Cousin”, “Parent/Parent-in-law”, “Child/Child-in-law”, “Grandaunt/Granduncle/Grandaunt-in-law/Granduncle-in-law”, “Grandchild/Grandchild-in-law”, “Grandnephew/Grandniece/Grandnephew-in-law/Grandniece-in-law”, “Grandparent/Grandparent-in-law”, “Great-grandchild/Great-grandchild-in-law”, “Great-grandparent/Great-grandparent-in-law”, “Nephew/Niece/Nephew-in-law/Niece-in-law”, and “Sibling/Sibling-in-law”.
1.1.5. Identification of families
To identify families in the datasets, we exclude all non-biological relationships such as spouses and in-laws, as well as ambiguous relationships such as “Parent/Parent-in-law”. Using both provided and inferred relationships, we created a network where each node corresponds to a patient and edges represent familial relationships. To identify different families, we decomposed network into individual connected components.
1.1.6. Identification of twins
To identify twins we matched siblings that shared the same last name and the same date of birth. We do not have enough information to distinguish between monozygotic and dizygotic twins.
1.2. Evaluation of automatically inferred relationships
1.2.1. Evaluation using the EHR’s mother-baby linkage
We used the EHR’s mother-baby linkage as gold standard to evaluate identified maternal relationships. We consider true positives cases where maternal relationships present in the EHR’s mother-baby linkage table and also identified by our algorithm, false positives when we identified maternal relationships that are discordant with the one in the EHR’s mother-baby linkage and lastly, false negatives when a maternal relationship was captured by the EHR’s mother-baby linkage but not by our method. Overall performance was evaluated by calculating overall sensitivity and positive predictive value (PPV). In order to assess if matches identified by different variables perform differently, we also computed sensitivity and PPV stratifying the matches by the number of variables used to match the emergency contact to a patient in our healthcare system (Table S2), as well as by the combination of variables (i.e. last name only, first name and last name, etc.) used to perform the match (Table S3).
Performance by number of paths.
Performance by matched path.
1.2.2. Evaluation using genetic data with analysis for kinship
Genotype data was collected from existing sources for 186 individuals. Data was collected from three separate sources, the Institute for Genomic Medicine, The Columbia University Medical Center Pathology Department, and the Washington Heights/Inwood Informatics Infrastructure for Comparative Effectiveness Research (WICER) project, using whole exome sequencing, Affymetrix CytoScan HD array, and the Illumina Multi-Ethnic Genotyping Array, respectively. In order to select SNPs for kinship, minor allele frequency was filtered to >5%, and genotyping rate to 99% using PLINK 21. Independent SNPs were selected using the sliding window (100 SNPs) linkage disequilibrium approach. This resulted in a total of 24,752 variants from the Institute for Genomic Medicine data, 8,544 SNPS from the WICER data, and 32,938 SNPs from the Pathology Department data. PLINK was then used to calculate identify by descent by determining results (P(IBD=2)+0.5*P(IBD=1)(proportion IBD)) for each pair of individuals. We consider that the predicted relationship is correct if the blood relationship fraction between the two people is the same as the one expected for the predicted relationship with a margin of error of 20% of the expected blood relationships. For example, for predicted mother-child pairs, two individuals in a pair share 50% (±10%) of their genetic information, then that gives us evidence to consider that the predicted relationship is correct. Likewise, for a predicted aunt-niece pair, the two individuals are expected to share 25% (±5%). Performance was evaluated by calculating PPV.
1.2.3. Evaluation using clinical data
As a qualitative validation of all relationship types, including distant relationships such as great-grandparent, we calculated age difference between all pairs of family relatives and stratified it by relationship type. We compared the identified age differences to what would be expected in a real family structure. For example, great-grandparents should be much older than their great-grandchildren.
2. Phenotyping in the EHR
We used clinical pathology reports as quantitative traits and diagnosis billing codes as dichotomous traits in our study. We extracted the top used clinical pathology reports and mapped them to LOINC codes so that they could be matched between institutions. Each patient may have multiple lab reports over time. To get a single phenotype value we collapsed all reports for each patient into a single value using the mean. This mean represents the average value for the report for the patient over all time available. For example, a patient’s mean blood glucose value over their lifetime.
For dichotomous traits we used any diagnosis billing code that was used for at least 1,000 distinct patients. Any patient with evidence of that code in their medical record history was considered a “case.” Controls were chosen as any patient that did not have that diagnosis nor any diagnosis that shared an ancestor according to the Clinical Classifications Software (CCS). This tool was developed by the Agency for Healthcare Research and Quality (AHRQ). CCS is composed of diagnoses and procedures organized in two related classification systems. In this study, we only used the diagnoses classifications. The single-level system consists of 285 mutually-exclusive diagnosis categories. It enables researchers to map any of the 3,824 ICD9-CM diagnosis codes into one of the 285 CCS categories. CCS also has a multi-level system composed of 4 levels representing a hierarchy of the 285 categories. The first level is broken into 18 categories. To define a control group, we linked the ICD9 codes associated to a phenotype of interest to their CCS categories using the top-level hierarchical categories. We also generated a table associating each patient to CCS categories they were diagnosed with. Once this mapping was done, each phenotype was associated to one or multiple distinct CCS categories. We matched these CCS categories in the multi-level system to identify the first level parent category. We considered these top level categories as our exclusion criteria: the control cohort for this phenotype should have no mention of any CCS under these categories in its medical records. For example, the controls for atrial fibrillation will exclude patients with cardiovascular diseases.
We semi-manually curated a set of 85 phenotypes to use for training and testing the SOLARStrap algorithm (See Methods 3.3). For these 85 phenotypes, we grouped closely related diagnoses codes together to increase the total number of patients (Table S6).
3. Estimation of heritability from the Electronic Health Records
3.1. Rationale
The primary and most significant challenge when using traits defined from an observational resource, like the electronic health records (EHR), is the lack of ascertainment. In a heritability study, the phenotype of each study participate is, ideally, carefully evaluated and quantified. This is infeasible, however, when the cohort contains millions of patients with thousands of phenotypes. The differential probability that a given individual will be phenotyped for a study trait is the ascertainment bias. The bias may depend on many latent factors, including the trait being studied, the trait status of relatives, the proximity to the hospital, and an individual’s ethnicity and cultural identification, among others. The consequence of this uncontrolled ascertainment bias is that heritability estimates will be highly dependent on the particular individuals in the study cohort. We used repeated sub-sampling to characterize this dependency quantitatively. We define the observational heritability, or h2o, as the average of the statistically significant sample estimates (using median). For a given trait, the procedure, which we call SOLARStrap, involves sampling families, running SOLAR to estimate sample heritability, and rejecting or accepting the estimate based on a set of quality control criteria. Each step is detailed below.
3.2. SOLARStrap Protocol
3.2.1. Building pedigree files
Of the 223,307 families at Columbia there were 6,894 that contained conflicting relationships – where two individuals were inferred to have two different relationships. At Cornell 3,258 families of 155,811 contained conflicts. These families were excluding from the heritability studies. In some cases, more than one mother or father is annotated for an individual. This could be because of duplicate patient records or errors in the EHR relationship extraction. We resolve these issues by choosing the mother or father that has more relationships in the family. The other relationship is discarded. We then constructed a master pedigree file for each site. To construct this pedigree file we iterate through each member of each family. For each individual we will either know the mother and father from the EHR derived relationships or not. If not known, then a new identifier is created to represent the parent. At this point we iterate through all other family members and record the relationships between the new individual and each family member. We repeat this process until the entire pedigree file is created. The master pedigree files contain 1,377,173 and 940,040 individuals for Columbia and Cornell, respectively.
3.2.1. Sampling Families
The number of families that are sampled combined with the prevalence of the trait defines the power of the heritability analysis. A smaller heritability can be detected with larger sample sizes. However, as the sample size increases toward the total number of families the variance in heritability that can be observed will decrease. This is because we are sampling without replacement. Since we do not know a priori what the magnitude of the heritability will be or what the variance will be we iterate through sample sizes from 100 to the total number of available families. The maximum sample size is defined by the limitations of SOLAR which can only handle a maximum of 32,000 individuals per pedigree file. For each sample size we perform 200 samplings. For each of these we build a custom pedigree and phenotype files and run SOLAR to estimate the heritability. We then aggregate the results.
3.2.2. Sample pedigree files
For each sampling a set of N families are selected. To construct the sample pedigree file we identify all lines from the master pedigree files that correspond to these families and create a new file from this subset.
3.2.3. Sample phenotype files
Once the pedigree file is created, we iterate over every individual in the pedigree and use the reference trait data and demographic data to enter the phenotype status and age of the patient. If no phenotype data are available for the individual we enter it as missing. For dichotomous traits the trait values are either 0 (absence), 1 (presence), or missing and a “proband” is randomly assigned by selected a single individual from each family that has the trait. See “Phenotyping in the EHR” for a description of how these traits are assigned. For quantitative traits we enter the quantitative value or missing.
3.2.2. Running SOLAR
We use SOLAR to estimate both quantitative and dichotomous trait heritability using a mixed linear model. In both cases sex and age are modeled as covariates. After the pedigree and phenotype files are loaded the heritability is estimated with the `polygenic-screen` command. We used the tdist command in SOLAR to adjust quantitative traits that are not normally distributed. For dichotomous traits one “proband” is chosen at random for each family. SOLAR will automatically detect the presence of a dichotomous trait and convert the estimate from the observed scale to the liability scale. The heritability, error on the heritability, and the p value are saved from each run for later analysis and aggregation.
3.2.3. Quality Control of SOLAR heritability solutions
SOLAR does not converge on a solution for heritability for all samples. Errors in the pedigree or in the ascertainment of phenotypes are the most likely causes for these failures. First, we reject any runs of SOLAR that result in no solution for the heritability. We then consider two additional criteria that must be met in order for a solution to be considered legitimate: (i) edge epsilon (∊e), any estimate within ∊e of 1 or 0 is rejected; and (ii) noise epsilon (∊n), any estimate with implausibly low error is rejected (h2 error is less than ∊n of the h2 estimate). These hyperparameters are set using a set of phenotypes for which we have observational heritability estimates and high confidence literature reported heritabilities from other studies.
POSA. After filtering the SOLAR solutions for the basic criteria, we define an additional quality control metric called the Proportion Of Significant Attempts, or POSA. POSA is defined as the number of solutions with a p value less than αPOSA divided by the total number of converged solutions (or attempts). The POSA is important because it is closely related to the power of the analysis. A fully powered analysis will have a POSA of 1, meaning that all of the converged estimates are statistically significant. A POSA of 0.5 means that only half of the converged estimates are statistically significant. When the families were sampled the observed heritability was large enough to be detected with p < αPOSA half of the time. In other words, we were powered to detect a heritability in 50% of samplings. We show that the higher the POSA, the more accurate the heritability estimates are (Figure S2). We chose a minimum POSA score, POSAlower and the αPOSA using a set of phenotypes for which we have observational heritability estimates and high confidence literature reported heritabilities.
Accuracy of heritability estimates relies on the proportion of significant attempts (POSA). The POSA score is a measure of how reliable the heritability estimate is that is generated by SOLARStrap. If none of the sample estimates are statistically significant then the POSA will be 0 indicating that the analysis is underpowered. As the sample size increases the power will increase and so does the the POSA score. At a POSA of 0.5 or above, the correlation of the observational heritability estimates to the reference standard jumps significantly. A POSA of 0.7 or above was found to yield the maximum correlation between SOLARStrap heritability estimates and the reference standard.
3.2.4. Aggregation of sampling results (computing h2o)
For each sampling that passes quality control and meets the minimum POSA score, we compute the h2o as the median. The median h2o corresponds to a single run of SOLAR that has passed all of the quality control filters. We used the standard error reported by SOLAR for that run as the error of the h2o. We found that this error is closely related to the sampling variance (Figure S3). All raw heritability estimates that pass the initial quality control are made publicly available for reanalysis.
SOLAR error versus SOLARStrap variance. The error estimate from SOLAR is significantly correlated to the sampling variance of the heritability estimates (r=0.63, p=3.3e-10).
3.3. Fit and validation of hyperparameters
Heritability estimates for 91 phenotypes were mined from the literature along with their corresponding confidence intervals, if they were available. We performed a brute force search through the parameter space. Possible values for edge epsilon (∊e) were (0, 1e-9, 1e-8, and 1e-7). Possible values for noise epsilon (∊n) were (0.01, 0.025, 0.05, 0.075, 0.1, and 0.2). Possible values for αPOSA were (0.03, 0.05, 0.1, 0.25, 0.5, and 1.0). Possible values for the POSAlower were (0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, and 0.975). We evaluated each set of parameters for the correlation between the h2o and the h2. Only a single site was used to fit these parameters, leaving data from the other site available for validation. The maximum correlation was 0.558 with αPOSA = 0.05, ∊e = 1e-9, ∊n = 0.05, and POSAlower = 0.7. At these parameter settings 19 traits passed quality control. The average difference between h2o and h2 was 17.7% ± 8%.
To evaluate the generalizability of the hyperparameters, we applied them to the validation site data. 10 traits passed the quality controls and we found that they were correlated with literature estimates of heritability (r = 0.73, p = 0.016). The average difference between h2o and h2 was 7.6% ± 9%.
3.4. Preparation of data for analysis on external computing clusters
Due to the high number of heritability estimates that need to be computed, external computing resources are used: The Open Science Grid (OSG) and Amazon Web Services (AWS). The Open Science Grid (OSG) is a massive computing resource funded by the Department of Energy and the National Science Foundation. The OSG is comprised of over 100 individual sites throughout the United States, primarily located at universities and national laboratories. The sites contain anywhere from hundreds to tens of thousands of CPU cores available for scientific research23,24. AWS is used to supplement this resource, which makes available on-demand compute instances with high performance capacity. Per institutional requirements, no protected health information or personally identifying information can be transferred to systems outside of our institutional networks. To leverage these resources for our computing task we prepared a data subset according to the Safe Harbor guidance provided by the U.S. Department of Health and Human Services (http://www.hhs.gov/hipaa/forprofessionals/privacy/special-topics/de-identification/index.html). Here is a point-by-point account of how we processed the data for Safe Harbor for each of the 18 identifiers: (A) we removed first, middle, and last names for all patients, (B) all patient address information is removed, (C) all dates are removed and all ages over 89 are coded as “90”, (D) telephone numbers and (E) fax numbers are removed, (F) there are no email addresses in our subset of the clinical data, (G) there are no social security numbers in our subset of the clinical data, (H) medical record numbers are mapped to a 10 digit random number and the mapping is stored on a limited access PHI-certified server within the institutional firewall and will never be made available, (I) there are no health plan beneficiary numbers in our data subset, (J) there are no account numbers in our data subset, (K) there are no certificate or license numbers, (L) there are no vehicle numbers or serial numbers in our data subset, (M) there are device identifiers or serial numbers, (N) there are no URLs in our data subset, (O) there are no IP addresses in our data subset, (P) there are no biometric identifiers in our data subset, (Q) there are no full-face or comparable images in our data subset, (R) there are no other uniquely identifying characteristics or numbers. All data were transferred using secure file transfer protocols using encryption and were destroyed immediately after retrieval of the results. In total we used over 20,000 cpu-hours to compute heritability estimates for 2,089 traits.
3.5. Investigation of Heritability of “Victim of Child Abuse”
The trait with the highest heritability in our study was “victim of child abuse” coded as V61.21. The heritability was 0.90 with the 95% confidence interval spanning from 0.73 to 1.00 at Columbia when sampling 600 families. There was not enough data at Cornell to estimate the heritability. At Columbia this trait was coded for 946 families where at Cornell it was available for only 134 families. None of the estimates from Cornell passed our primary QC screen. At Columbia, however, estimates are available and sampling sizes of 200, 300, 400, 500, and 600 all passed the second QC stage (POSA > 0.7). The heritability estimates ranged from 0.76 (0.45-0.97) to 0.90 (0.73-1.000) and are shown in Table S5. This trait is not actually a trait of the individual that the code is assigned, but to another individual with whom the patient interacts. We suspected that the high heritability may be an artifact of multiple siblings in a single family being abused by a single individual. To account for this, we chose only a single affected sibling for each family. All other siblings were coded as having their trait “missing.” We then ran SOLARStrap for sample sizes of 200, 300, 400, 500, and 600 for comparison (Table S5).
Relationship inference rules.
Observational heritability of child abuse.
85 semi-manually created phenotypes.
4. Estimation of sibling and familial recurrence for dichotomous traits
We estimated sibling recurrence as the proportion of individuals with that trait given that they have a sibling with the trait. We randomized the choice of primary sibling that the probability is conditioned upon. We computed familial recurrence similarly, except that any relationship type was allowed. Both sibling and familial recurrence were only calculated for conditions with 10 or more concordant pairs. The recurrence rate is calculated by and the error by
.
To compare disease recurrence rates, we computed recurrence for each relationship type. To test if the groups were statistically different, we performed a Chi-squared test with Bonferroni correction.
5. Preparation of clinical data for release
Due to institutional restrictions, we cannot release the exact data as it was used in our analysis. However, we are sensitive to issues regarding reproducibility and replicability. Therefore, we have modified the dataset according to the rules of Safe Harbor as provided by the U.S. Department of Health and Human Services. The processing of the data for release was performed as described in section 3.4. However, in this case we took three additional precautions beyond what is required for Safe Harbor since these data will be made completely public. We are releasing data for a single trait (rhinitis). We will continue to release more traits as the data are reviewed to protect patient privacy with the ultimate goal of releasing all of the trait and relationship data for all phenotypes. No data are released for families containing more than five members. This will protect against identification through unique familial relationships situations. All aggregate data and their corresponding statistics are released without obfuscation. The data are available on the supporting website: http://riftehr.tatonettilab.org/.
6. Computational and statistical software
Statistical analysis, data preparation, and figure creation was performed using Python 2.7. The python system environment is described fully in the supplemental materials. Relationship inferences was implemented in Julia 0.4.3. All correlations are reported as Pearson correlation coefficients, unless otherwise noted. All code for RIFTEHR and SOLARStrap is available on the supporting website: http://riftehr.tatonettilab.org/.
7. Literature review
For validation purposes, we performed literature review on heritability estimates on 128 traits. We started by analyzing studies that were included in the table available at http://www.snpedia.com/index.php/Heritability (accessed on March 2016). We then downloaded all papers we had access to and extracted the described trait with the respective heritability estimates as well as the confidence intervals, when available.
Acknowledgements
FP and DKV are supported by R01H5021816. KQ, RV, AY, and NPT are supported by R01GM107145. KK is supported by NIDDK R01DK105124. RV, KK, and NPT are supported by the Herbert Irving Scholars Award. GH is supported by R01LM006910. This research used resources from the Open Science Grid, which is supported by the National Science Foundation and the U.S. Department of Energy’s Office of Science. Collection of genetic samples was supported by R01HS022961.
Footnotes
↵† Co-senior author