ABSTRACT
Objective Major Depressive Disorder (MDD) is one of the most common mental illnesses and a leading cause of disability worldwide. Electronic Health Records (EHR) allow researchers to conduct unprecedented large-scale observational studies investigating MDD, its disease development and its interaction with other health outcomes. While there exist methods to classify patients as clear cases or controls given specific data requirements, there are presently no simple, generalizable, and validated methods to classify an entire patient population into varying groups of depression likelihood and severity.
Materials and Methods We propose an electronic phenotype algorithm that classifies patients into one of five mutually exclusive, ordinal groups, varying in depression phenotype. Using data from an integrated health system on 278,026 patients from a 10-year study period we demonstrate the convergent validity of these phenotype constructs by presenting multiple lines of evidence associated with depression.
Results Convergent validity is derived from expected patterns in health care utilization, psychiatric prescriptions, indicators of suicidality, diagnoses of serious comorbidity, mortality, symptom severity, and finally, polygenic risk scores.
Discussion The algorithm is generalizable to most EHR data sets because it requires only International Classification of Diseases (ICD) diagnostic codes and medication orders and can be used for stratification of an entire patient population.
Conclusion Careful consideration must be given to the definitions of patient cohorts when utilizing EHR data, particularly when classifying subjects with heterogenous disorders such as MDD. This algorithm may prove useful to others that wish to study depression in entire patient populations with EHR data.
BACKGROUND
Depression is a highly prevalent mental illness that accounts for $43 billion in medical costs annually and is a leading cause of disability[1]. Depression has been linked to worse outcomes and increased healthcare utilization for numerous common medical disorders[2–6]. However, depression is a heterogeneous disorder, and its etiologies remain poorly understood [7]. There is an urgent need to better understand the causes and course of depression in order to develop more effective treatment and prevention strategies. Electronic Health Records (EHR) from large integrated health systems now offer the opportunity for researchers to conduct unprecedented, large scale studies of patients in real-world settings[8–13]. Critical to these pursuits are phenotypic algorithms that correctly distinguish who has the disorder within the patient population[14, 15]. Depression is a particularly difficult phenotype to define and studies often use heterogeneous criteria when utilizing EHR data to identify patients with depression[16–21]. As with any phenotypic algorithm, the challenge is to validly define depression with high sensitivity and specificity, limiting both false positive and false negative classification of patients [22]. An additional complexity is the fact that depression may not be a binary phenomenon, but rather it may exist on a continuum with a range of severity in a population.
There are at least four potential sources of information from the EHR for defining depression: a) International Classification of Diseases (ICD) diagnosis codes; b) depression screening measures; c) medication orders; and d) clinical notes. While some studies use only ICD diagnosis codes to identify patients[21, 23], others have demonstrated that using these alone has inferior sensitivity and precision when compared with combinatorial models that use multiple sources of information [24]. National recommendations that adults be screened for depression annually has increased the availability and use of symptom questionnaires[16–18, 25] such as the Patient Health Questionnaire (PHQ-9)[26]. However, implementation of such screening measures has been fairly recent, is not standardized, and shows limited agreement with ICD codes for depression [19, 20]. Phenotyping algorithms may also use information from medication treatment codes [27–29]. Unfortunately, there may be a long delay between when patients with depression first experience the onset of symptoms and ultimately receive care, including with medication[30–33]. Additionally, antidepressants may be prescribed for a variety of comorbid mental [34] and non-mental health indications[35], such as tobacco use cessation[36] or chronic pain[37], which complicates its use for reliably identifying depression [38]. Lastly, the use of natural language processing (NLP) on clinical notes has great promise for classifying psychiatric disorders[24, 39–42]. However, the generalizability of these methods may be limited due to data sets that do not have the number or types of notes required or contain only deidentified data. Finally, most methods involve the exclusion of a sizeable number of patients with uncertain status, which prevents clinically relevant population-wide studies that include and classify all patients in the population.
OBJECTIVE
This study examines a novel phenotyping algorithm for defining depression along a continuum using EHR data and evaluates their construct validity with other indicators of health that should correlate with depression. We focus on definitions based on ICD diagnosis codes and medication order data because these sources of data are more readily available than depression screening data and/or clinical notes and thus the resulting definitions are more widely generalizable.
MATERIALS AND METHODS
Study data and analysis
This study included de-identified Electronic Health Records (EHR) data obtained from January 1st, 2005 to September 30th, 2015 (10.75 years) for patients seen in the Geisinger Health System, an integrated health care system located in central Pennsylvania. The Geisinger system has a stable patient population whose EHRs have been collected in a central data warehouse and are available for clinical and research purposes [43–46]. The end date of the study period was chosen based on the transition from ICD-9 codes to ICD-10 codes in this hospital system. Adult patients 18 years or older at the beginning of the study, 90 years or younger at the end of the study, who had a Geisinger Primary Care Physician (PCP) at any point during the study period, and had at least one outpatient visit within the system during the study period were included in the study cohort (n=278,026) Demographic information, medication order histories, and details of all outpatient, Emergency Department (ED), and inpatient encounters were obtained on these patients. The study was approved by the Geisinger Internal Review Board as non-human subjects research.
We used domain knowledge of depression clinical care to implement an algorithm for partitioning all patients into one of five ordinal phenotype groups reflecting decreasing/increasing likelihood and/or severity of depression based on the available ICD-9 diagnosis codes and medication orders, Then, to evaluate the convergent/divergent validity of these phenotype groups, we examined how they related to other health care related characteristics thought to be associated with depression. These other characteristics included measures of health care utilization, medical history, depressive symptoms, and polygenic risk for depression. All analyses and visualizations described in further detail below were conducted in R (2017, R Core Team, Vienna, Austria) and GraphPad Prism 6 (La Jolla, CA).
Depression phenotype algorithm
To implement the phenotype algorithm (Figure 1), patients were first grouped into “Major Depressive Disorder” (MDD) if they received a diagnosis of Major Depressive Disorder (ICD-9 codes: 296.20, 296.21, 296.22, 296.23, 296.24, 296.25, 296.26, 296.30, 296.31, 296.32, 296.33, 296.34, 296.35, 296.36, 296.82) one or more times as an ED or inpatient discharge diagnosis or on their problem list, or two or more times within a 2-year period as an outpatient discharge diagnosis. Second, the remaining patients who did not meet previous criteria were then grouped into “Other Depression” (OthDep) if they received a diagnosis of a depressive disorder not elsewhere classified (ICD9 code: 311) one or more times as an ED or inpatient discharge diagnosis or on their problem list, or two or more times within a 2-year period as an outpatient discharge diagnosis. Third, the remaining patients who did not meet any of these previous criteria were then grouped into “Multiple Antidepressants No Depression” (MultiRx) if they received two or more antidepressant medication orders (RxNorm classification “Antidepressant”) during the study period that were not prescribed for common indicatons unelated to depression, including tobacco use disorder (305.1) and hereditary and idiopathic neuropathy (356). Fourth, the remaining patients who did not meet any of these previous criteria were then grouped into “Miscellaneous Antidepressants NoDepression” (MiscRx) if they received any other antidepressant medication orders not captured in the previous grouping. Finally, all other remaining patients were grouped into “No Depression” (NoDep).
Healthcare utilization
Depression is associated with increased healthcare utilization[47–49]. As a measure of healthcare utilization, we calculated the average number of visits annually across different health care settings and compared these across the groups. To calculate the average number of visits in each healthcare setting per patient per year, we tallied the number of total visits in the outpatient, ED, non-psychiatric inpatient, and psychiatric inpatient settings during the entire study period for each group and divided this by the total number of patients in that group and the length of the study period (10.75 years), resulting in the average yearly visit frequency.
Medical features
We then evaluated several medical features that are known to be associated with depression, including medications, comorbidities and mortality. Depressed patients are often prescribed antipsychotics to augment their antidepressant medication[50], and antianxiety agents to treat anxiety, which is a well-known comorbid condition with depression[51]. We determined the percent of patients in each group that had ever received at least one order for these medications as classified in RxNorm (“Antipsychotic” and “Antianxiety Agent”, respectively). Depressed patients are known to have higher rates of suicidality[52, 53] and mortality[54, 55] than non-depressed patients. We used the discharge diagnosis codes from encounter records to determine the percent of patients with ICD-9 codes for suicide and self-inflicted injury (E950 - E958). We used the date of death to determine the percent mortality of each group during the study period. Substance abuse was captured using discharge diagnosis codes from encounter records to determine the percent of patients in each group that had ICD-9 codes for alcohol and drug abuse or dependence (303, 304, and 305, excluding tobacco use disorder 305.1). Finally, as depressed patients have greater overall burden of medical comorbidities than non-depressed patients[56], we calculated the Charlson Comorbidity Index (CCI) score for each patient. The CCI contains 19 categories of serious comorbidities that predicts the 10-year mortality of patients[57].
Depression symptoms
In 2012, Geisinger Health System began implementing universal screening for depression with the Patient Health Questionnaire 2 (PHQ-2). If patients endorse either of the two screener questions, they are then asked the additional 7 questions (PHQ-9). The PHQ-9 is a validated instrument for assessing current depression symptom severity and has a tiered rating scale based on total score (0: no depression, 1-9: mild, 10-14: moderate, 15-19: moderately severe, 20+: severe)[58, 59]. For all patients that had one or more PHQ-2/9result in their EHR (n=170,618; 61.4%), we identified their maximum score and determined the percent of patients in each group with maximum scores in the different PHQ-2/9 scoring categories.
Polygenic risk scores
Geisinger Health System has recruited over 90,000 patients to participate in a genetics study called MyCode[43]. A subset of the study population described above have participated in this study, allowing us to calculate polygenic risk scores (PRS) for comparison across depression groups. Using publicly available summary results from a recently published genome-wide association study (GWAS) of MDD [60], we calculated PRS for each patient in our data set that had genetic data available (n=52,775; 19%). Polygenic risk scores for MDD were calculated using PRSice-2[61] at eight predetermined p-value thresholds: 1, 0.5, 0.1, 0.05, 0.01, 0.001, 0.0001, 0.000001. As results were similar across all p-value thresholds (Supplementary Table 1), we only report results for the nominal threshold of p=0.05.
RESULTS
We developed and implemented an algorithm for grouping patients based on ICD9 codes and medication orders into one of five ordinal groups, varying in likelihood and severity of depression, which we named from most to least likely/severe: “MDD”, “OthDep”, “MultiRx”, “MiscRx”, and “NoDep” (Figure 1). Of the total patient population, each group accounts for 3.2%, 17.4%, 11.1%, 10.8%, and 57.5%, respectively. Summary statistics on sex, race, marital status, age at beginning of the study and length between first and last encounters are shown for the total patient population and all five groups in Table 1. The sample is 54.0% female, 95.7% white, 60.9% married, median age 45, and median length between first and last encounter is just over 7 years. As expected, depression occurs more commonly in females, reflected by an increased Female to Male ratio in each “affected” group. Patients in the MDD and OthDep groups are less likely to be “Married” compared with those in the NoDep, MultiRx, and MiscRx groups. Age at the beginning of the study does not differ substantially between groups (median: 45, third quartile: 57 years), although it is slightly less for those in the MDD group (median: 44, third quartile: 55 years). The length between first and last encounter varied across groups, with a much higher percentage of patients in one of the four “affected” groups having at least 7 years of observation than those in the NoDep group. It is also of note that the vast majority of patients in the MDD and OthDep groups received at least one antidepressant medication order, 97.2% and 93.0%, respectively.
Healthcare utilization
The average number of outpatient visits/year decreased monotonically across the five phenotype groups from a high of over 5.5 visits/year for MDD patients to a low of fewer than 2 visits/year for NoDep patients (Fig 2A). The pattern was similar albeit with lower overall averages for ED visits (ranging from 0.24 to 0.06 visits/year, respectively) and non-psychiatric inpatient visits (ranging from 0.11 to 0.03 visits/year, respectively) (Figures 2B and 2C). The most striking difference across the five phenotype groups was observed for psychiatric inpatient visits (Figure 2D). MDD patients had an average of 0.03 such visits/year, which was 10 times greater than that for OthDep patients, the next highest group, while the rates were negligible for the remaining MultiRx, MiscRx and NoDep groups.
Medication orders and comorbidities
Similar to the utilization patterns, the percent of patients that received antipsychotic (Figure 3A) or antianxiety agents (Figure 3B), decreased monotonically across the five groups from a high for MDD to a low for NoDep. Approximately 44% of MDD compared to 5% of NoDep patients were prescribed antipsychotic medications, while 31% of MDD compared to 5% of NoDep patients were prescribed antianxiety medications. A similar pattern was observed across the five phenotype groups for the percent of patients with substance use and abuse (Figure 3C) and mean Charlson Comorbidity Index scores (Figure 3D). The pattern differed, however, for suicide related diagnosis codes and overall mortality. Almost all suicide related diagnosis codes were noted in MDD and OthDep patients, while the percent of patients with such codes were negligible in the MultiRx, MiscRx, and NoDep groups (Figure 3E). As for overall mortality, interestingly, the highest rates were observed for patients in the OthDep group. (Figure 3F).
Medication orders and comorbidities
Similar to the utilization patterns, the percent of patients that received antipsychotic (Figure 3A) or antianxiety agents (Figure 3B), decreased monotonically across the five groups from a high for MDD to a low for NoDep. Approximately 44% of MDD compared to 5% of NoDep patients were prescribed antipsychotic medications, while 31% of MDD compared to 5% of NoDep patients were prescribed antianxiety medications. A similar pattern was observed across the five phenotype groups for the percent of patients with substance use and abuse (Figure 3C) and mean Charlson Comorbidity Index scores (Figure 3D). The pattern differed, however, for suicide related diagnosis codes and overall mortality. Almost all suicide related diagnosis codes were noted in MDD and OthDep patients, while the percent of patients with such codes were negligible in the MultiRx, MiscRx, and NoDep groups (Figure 3E). As for overall mortality, interestingly, the highest rates were observed for patients in the OthDep group. (Figure 3F).
Depression symptoms
A subset of the study population was screened at least once with the Patient Health Questionnaire (PHQ), a well validated depression measure (Figure 4). Patients in the MDD group were the most likely to have been screened (73.4%, n=6421), followed by those in the OthDep (68.7%; n=33,229), MultiRx (69.1%; n=21,419), MiscRx (62.9%; n=18,865), and finally NoDep (56.7%; n=90,684). The majority of those in the NoDep group scored a maximum of 0, indicating no depression, and only 2.6% scored 10 or higher, indicating moderate, or more severe depression. Those in the MiscRx and MultiRx groups were remarkably similar to one another at all possible scores. This contrasts with the more severely “affected” groups MDD and OthDep. Over a third of the MDD group and nearly a quarter of the OthDep group had a maximum score of 10 or higher.
Polygenic risk scores
As depression has a prominent heritable component [62], we used polygenic risk scores (PRS) derived from an external GWAS [60] to further validate our phenotype groups for those patients in this study who had available genome-wide genotype data. As shown in Figure 5, we found a gradient of increased PRSs across the five phenotype groups, with the highest PRSs seen in the MDD group and the greatest difference seen between the MDD and NoDep groups (P < 2.2 -16, R2 =0.8%).
DISCUSSION
Here we present a novel electronic phenotype algorithm using ICD codes and medication orders in EHR data, resulting in five mutually exclusive and ordinal groups for categorizing patients with depression across an entire population. These phenotype groups demonstrate convergent validity as assessed by significant differences between them in a range of clinically relevant characteristics that are expected to correlate with depression [63, 64], including healthcare utilization [47], medical features (such as treatments [65], comorbidities [66, 67], and mortality), depression screening results, and polygenic risk scores. Interestingly, these clinically relevant characteristics tended to vary in a “dose-response” like fashion across the phenotype groups consistent with the fact that the groups defined by increasingly more stringent criteria identified patients with increasing likelihood and/or severity of depression. These findings demonstrate that it is possible to use diagnostic codes and medications orders in EHR data to validly categorize patients with respect to depression across the entire population.
With the wide spread adoption of EHRs, there is growing interest in identifying patients with depression for subsequent research using data available from the EHRs. As a result, a number of previous studies have attempted to do this employing a variety of methods that use in isolation or combination ICD9 diagnosis codes, antidepressant prescription orders, and NLP of progress notes. Many of these studies have not validated their algorithms [18–21, 23]. There are a few studies, however, that have attempted to validate their algorithms by comparing identified cases and controls against a putative gold-standard diagnosis established by expert review of 50 or 100 medical charts per group [17, 24, 39]. Our approach differs from these previous efforts in that we sought to establish convergent rather than criterion validation of our algorithm, which allowed us to characterize and compare the full range of depression in the entire sample rather than a small sub-sample of putative cases and controls which presumably represent the extremes of the phenotype in the population. In addition, other than one study which used genetic data to try and validate an algorithm for defining a bipolar disorder phenotype [40], our study is the only one as far as we know to use genetic data to validate an algorithm for defining depression.
This study has several limitations that warrant discussion. First, with regard to our examination of healthcare utilization patterns and medical features, we assumed that patients contributed follow-up over the entire study period. We did this to simplify our estimates, though it is possible that a number of patients entered or left the Geisinger service area during the study period. Second, with regard to our algorithm, we categorized patients based on the most severe criteria met over the study period and assumed they belonged to that group for the entire period. This ever/never approach does not allow for change over time and does not take full advantage of this rich longitudinal data, which may be interesting to study using longitudinal data analysis methods in the future. Our algorithm was also based on ICD-9 codes, as opposed to ICD-10 codes which are the current version in use for diagnosis and healthcare billing. Fortunately there are many resources available for mapping between ICD-9 and ICD-10 codes that can be leveraged to create an equivalent list of ICD-10 codes to the list and codes used here [68, 69]. Third, with regard to the sample under study, it consisted of predominantly white patients from rural communities in central Pennsylvania. As a result, there may be questions about the generalizability of our approach to other more diverse and urban populations. In addition, there are other healthcare systems in the service area and thus the data we used to categorize patients may not have completely captured all medical utilization. Claims data may provide an additional source of information about healthcare utilization that would be useful in future studies.
This study also benefits from several significant strengths. One of the major strengths of this study is the large longitudinal data from a stable patient population in an integrated health care system. With extended and extensive data available from multiple health care settings (primary, emergency, and hospital care), we were able to validate our simple phenotype algorithm by comparison with a range of different measures. Additionally, our algorithm allows for classification of all patients in a given population across a spectrum of depression severity, as opposed to other approaches which often seek to classify only the extremes of the population to the exclusion of patients with uncertain binary case versus control status [39]. Finally, our algorithm uses only diagnosis codes and medication data to validly classify patients with regard to depression and therefore may be more generalizable to other systems that don’t have ready access to data on screening measures or from natural language processing of clinician notes.
CONCLUSION
The electronic phenotype algorithm presented here provides a simple and valid model for defining patients with varying severities of depression and may be useful for researchers who wish to examine the effects of depression in entire patient populations. The five mutually exclusive, ordinal groups demonstrate differences that are expected to correlate with depression. We found correlations with utilization patterns, treatments, comorbidities, mortality, depression screening instruments for symptom severity, and polygenic risk scores. These constructs are generalizable to any data set that has both diagnosis codes and medication orders and at least two years of data. Ultimately, the definition of depression phenotypes will depend upon the goal of the research, and we present one possible method that demonstrates convergent validity, generalizability, and inclusivity.
ACKNOWLEDGEMENTS, COMPETING INTERESTS, FUNDING
Funding
Author WMI is funded by the National Institutes of Health (T32MH014592-41 Psychiatric Epidemiology Training Program)
Conflict of Interest
The authors declare that they have no conflict of interest.