Non-parametric mixture models identify trajectories of childhood immune development relevant to asthma and allergy

Events in early life contribute to subsequent risk of asthma; however, the causes and trajectories of childhood wheeze are heterogeneous and do not always result in asthma. Similarly, not all atopic individuals develop wheeze, and vice versa. The reasons for these differences are unclear. Using unsupervised model-based cluster analysis, we identified latent clusters within a prospective birth cohort with deep immunological and respiratory phenotyping. We characterised each cluster in terms of immunological profile and disease risk, and replicated our results in external cohorts from the UK and USA. We discovered three distinct trajectories, one of which is a high-risk “atopic” cluster with increased propensity for allergic diseases throughout childhood. Atopy contributes varyingly to later wheeze depending on cluster membership. Our findings demonstrate the utility of unsupervised analysis in elucidating heterogeneity in asthma pathogenesis and provide a foundation for improving management and prevention of childhood asthma.


Introduction
Asthma is a global health problem, and there is a pressing need for better understanding of its pathogenesis [1]. Both genetic and environmental factors are involved in asthma [2,3], and the "hygiene hypothesis" proposes that modern changes to hygiene, sanitation and living environment have modified human exposures to microbes, with subsequent effects on earlylife immune development [4]. However, the clinical presentation and prognosis of childhood wheeze is highly variable: some children remit; others remit but relapse in later life; and yet others have wheeze persisting into adult asthma [5]. These differences suggest that the underlying causes of disease also differ from person to person. For example, while asthma is commonly linked to allergy, not all individuals with wheeze are sensitised to allergen, and vice versa [6]. As such, childhood asthma is a heterogeneous condition [7,8], and this greatly complicates the study of its pathogenesis [9]. We postulate that there are subpopulations in early childhood, each sharing similar patterns of pathophysiology, disease susceptibility and phenotype that permit categorisation into clusters. If we can agnostically identify these clusters, then we may identify the biological mechanisms that underlie them, and find targets for early intervention that are specific for different asthma subtypes.
Older attempts at subtyping asthma susceptibility relied on supervised classification, using expert knowledge and cut-offs to define clusters. For example, specific immunoglobulin E (IgE) ≥ 0.35 kU/L; wheal diameter ≥ 3 mm in a skin prick test (SPT); or symptom score surpassing a threshold -would determine classification into a high-risk profile [10,11]. However, these cut-offs vary with age, gender or other parameters, and may not accurately reflect true attribution of risk [12]. Hence, they often continue to produce heterogeneous groups. Furthermore, previous studies tended to focus on a single "domain", for instance grouping only by immunological response [13], symptomatology or timing of disease [14,15]. Recently, researchers have turned to unsupervised approaches, such as model-based cluster analysis and latent class analysis (LCA) [16][17][18][19][20][21]. These do not require experts to supply cut-offs, but can instead "learn" boundaries from the data. They can potentially uncover patterns of similarity not immediately obvious to the human eye. Finally, these methods can cover a broader range of domains, incorporating measurements from multiple sources to determine clusters that are potentially informative of asthma risk.
Here, we use a data-driven unsupervised framework together with a comprehensivelyphenotyped birth cohort (the Childhood Asthma Study, CAS) to define developmental trajectories during preschool years, a period known to be critical to asthma pathogenesis. Specifically, we 1) discover, using non-parametric mixture models, latent clusters that define early childhood trajectories of immune function and susceptibility to respiratory infection; 2) investigate how these clusters relate to differential profiles of asthma susceptibility, and to existing definitions of atopy; 3) identify risk factors for asthma within each cluster, which may differ across said clusters; 4) summarise and simplify our findings in decision trees; and 5) externally validate the clusters in independent cohorts.

Results
Our discovery dataset was CAS, a Western Australian birth cohort (N=263) enriched for parental asthma history [22], with clinical, immunological, and respiratory measurements from the first ten years of life (Methods). Clinical variables included demographics, incidence of allergic disease and family history. Immunological variables included IgE, IgG4, and IgG antibody levels for common allergens, as well as a combination antibody assay (Phadiatop) which covers multiple allergens [23]. We measured frequency and severity of respiratory infections, and performed 16S rRNA sequencing on nasopharyngeal aspirates (NPAs) collected during respiratory infections (disease samples) or routine check-ups (healthy samples). These NPAs have been classified by Teo et al [24,25], based on clustering of microbial composition, into microbiome profile groups (MPGs) that were associated with healthy respiratory states (health-associated MPGs, e.g. Alloiococcus-, Staphylococcus-or Corynebacterium-dominated) or infectious respiratory states (infection-associated MPGs, e.g. Moraxella-, Haemophilus-, or Streptococcus-dominated).
To identify latent clusters, we applied non-parametric expectation-maximisation (EM) mixture modelling ("npEM") to CAS. We used a largely non-selective approach to the choice of features, but we did explicitly exclude variables with excessive missing data (Methods, Supplementary Methods), as well as primary outcomes such as yearly incidence of wheeze and physician-diagnosed asthma (Supplementary Table 1). By virtue of study design and exclusion criteria, most included variables were related to immunological function or respiratory infection in the first three years of life. Individuals were assigned to a cluster if the mixture model determined ≥ 90% probability of membership in that cluster (Supplementary Methods). We described each cluster in terms of key characteristics and significant cluster-specific predictors for age-five wheeze. We also built an npEM-derived classifier to cluster samples with low missing data, and classify individuals from two comparable datasets for replication -the Manchester Asthma and Allergy Study (MAAS) (N=1085) [26] from Manchester, UK, and the Childhood Origins of Asthma Study (COAST) (N=289) from Wisconsin, USA [27] (Supplementary Table 2; Supplementary Figure 1). Finally, we developed a decision tree classifier, which allowed us to simplify cluster description and determine which features best separated clusters.

CAS1: low-risk, non-atopic cluster with transient wheeze
CAS1 was a low-risk cluster with infrequent and transient respiratory wheeze. Rates of wheeze declined from 33% at age 1 to 12% by age 10 (Table 1; Figure 2). In this cluster, Th2 cytokine responses of PBMCs to allergen stimulation were minimal; and rates of allergen sensitisation (as measured by IgE or SPT) were the lowest among all groups ( Table  2, Supplementary Tables 3B-D; Figure 3). IgG and IgG4 were also low across all allergens.
Frequency of respiratory infection in CAS1 were intermediate to low (Table 3). However, in this cluster, a high frequency of lower respiratory infections (LRIs) in childhood, especially wheezy LRIs (wLRIs), was a risk factor for age-five wheeze -even after adjusting for sex, BMI and parental history of asthma as demographic covariates (Table 4; Figure 4A-B). After multiple regression analysis with stepwise backward elimination (Methods), three variables remained significant in the one model: age-three wLRI frequency (odds ratio OR 8.3 per unit increase, p=3.4×10 -2 ); age-four LRI frequency (OR 3.6, p=0.022); and proportion of infection-associated MPGs (Streptococcus, Haemophilus, Moraxella) in age-two-to-four healthy NPAs (OR 0.13 per quartile, p=0.016). Repeated-measures ANOVA confirmed that LRI and wLRI frequency in the first 3 years of life were predictors for age-five wheeze within CAS1 (Supplementary Table 4).

CAS2: low-risk cluster susceptible to atopic and non-atopic wheeze
Like CAS1, CAS2 was a low-risk cluster with infrequent allergic disease. However, compared to CAS1, Phadiatop and HDM IgE were slightly elevated at most timepoints ( Table 2, Supplementary Table 3B; Figure 3A). Conversely, peanut IgE was not significantly elevated (Wilcoxon, adjusted p=0.99 at all timepoints; Figure 3D). As for other antibody isotypes: CAS2 IgG was between CAS1 and CAS3 levels, with it being closer to CAS1. CAS2 IgG4 was also intermediate, but was much closer to CAS3 levels than CAS1 (Table 2; Figure 3). However, despite the antibody differences between CAS1 and CAS2, yearly rates of wheeze in CAS2 remained comparable to CAS1 (30% at age 1, declining to 18% at age 10; Table 1; Figure 2). Interestingly, compared to CAS1, individuals in CAS2 tended to have fewer older siblings living in the household at age 2, as well as more frequent paternal history of asthma (adjusted p=0.029 and 0.055, respectively; Table 1).
The predictive factors for wheeze at age 5 in CAS2 included: LRI, wLRI and fLRI frequency (GLM; p=2.7×10 -3 , 0.016 and 0.02 at age 3, respectively); HDM IgE (p=0.016 and 0.011 at ages 2 and 4, respectively); and Phadiatop IgE (p=0.01 at age 4) (Table 4; Figure 4). After multiple regression analysis with stepwise backward elimination (Methods), three of these remained significant: age-two fLRI (p=0.006, OR 11 per unit increase), age-four wLRI (p=0.006, OR 4.8), and age-four Phadiatop IgE (p=5.5×10 -3 , OR 4.1). Repeated-measures ANOVA showed that HDM IgE and LRI-related variables (LRI, wLRI, fLRI) from the first 3 years of life were significant predictors of age-five wheeze in CAS2 (Supplementary Table  4). But although both allergic (IgE-related) and non-allergic (infection-related) risk factors contributed to age-five wheeze, there was no significant evidence of interaction between them (p=0.36 within CAS2 alone, p=0.92 across entire cohort, for age-four wLRI frequency × Phadiatop IgE). Overall, CAS2 represented a low-risk trajectory susceptible to, but not necessarily afflicted by, wheeze due to atopic and non-atopic risk factors. In this cluster, atopic determinants of age-five wheeze (HDM and Phadiatop IgE) were only active from age 2 onwards, suggesting delayed atopic wheeze in this cluster.

CAS3: high-risk atopic cluster with persistent wheeze
CAS3 was a "high-risk" cluster, where persistent respiratory wheeze and atopic disease was seen in more than half the group throughout the first 10 years of life (Table 1; Figure 2). This cluster was dominated by males (86%, Fisher exact test, unadjusted p=6.8×10 -3 compared to CAS1, Table 1), and appeared to represent an early-and multi-sensitised atopic phenotype with persistent wheeze.
CAS3 had elevated IgE, IgG, and IgG4 responses to common allergens, especially HDM ( Table 2, Supplementary Table 3B; Figure 3). Peanut, HDM and Phadiatop IgE were significantly greater in CAS3 than in CAS1 from 6 months onwards. SPTs were also more frequently positive in CAS3, especially to HDM and food allergens. From 6 months onwards, wheal sizes in cow's milk and egg white SPTs were on average greater in CAS3 than in CAS1 (Wilcoxon, adjusted p=4.8×10 -9 for egg white SPT at age 5, Supplementary Table  3D). Age-five peanut SPTs also yielded stronger wheal responses in CAS3 compared to CAS1 (Wilcoxon, adjusted p=8.4×10 -4 ).
No strong predictors for age-five wheeze were identified within CAS3 ( Table 4): only couch grass-specific IgE at age 2 and ARI frequency at age 1 were weakly significant (both p=0.046). Neither of these reached statistical significance with stepwise backward elimination. However, the prolific IgE response, and the prevalence and severity of early-life

IgG4 and IgG
Across all clusters, allergen-specific IgG4 and IgG were positively correlated with IgE for the same allergen (especially HDM, Supplementary Figure 2). As noted previously, CAS2 and CAS3 were distinguished from CAS1 by high IgG4 against multiple allergens, and CAS3 had greater IgG4 responses than either CAS1 or CAS2 (Supplementary Table 3B; Figure 3). Although previous literature suggests possible protection conferred by IgG4 [28] or IgG [29], in this study there was no clear or consistent evidence of protection by either IgG4 or IgG against later wheeze ( Table 4). Furthermore, the protected status of CAS2 was unlikely to be driven by IgG4, given that CAS3 had higher IgG4 than CAS2.

Patterns in IgE, IgG, cytokine and SPT responses
Although they were highly-correlated, phenotypes of IgE, IgG, Th2 cytokine and SPT responses did not overlap perfectly. CAS3 was enriched for individuals with strong signals in all modalities, but there remained individuals within CAS3 and the rest of the cohort who were only responsive in some modalities but not others. Notably, IgE and SPT signals did not always coincide (Supplementary Figure 3A). Also, some individuals with IgE or SPT sensitisation against HDM did not exhibit detectable Th2 cytokine response to in vitro HDM stimulation (Supplementary Figure 3A). Finally, HDM IgG4 did not appear to be responsible for this effect: those with IgE but not Th2 cytokine responses (i.e. HDM IL-13 at limit of detection 0.01 pg/L) did not have significantly different levels of IgG4 compared to those with both IgE and Th2 responses (Wilcoxon, p=0.15, Supplementary Figure 3B).

Comparison to existing criteria for atopy
The information conveyed by the npEM-derived CAS clusters was consistent with that of traditional atopy thresholds (i.e. any specific IgE ≥ 0.35 kU/L or SPT ≥ 2mm at age 2). When we compared the CAS clusters with supervised groups created using traditional thresholds (Supplementary Table 5), we found that CAS1 most closely matched a non-atopic phenotype (58 of 84 had no specific IgE greater than 0.35 kU/L by age 2). Conversely, CAS2 and CAS3 partially matched the traditional criteria for atopy, with CAS3 being an extreme phenotype (all 22 children in CAS3 had some specific IgE ≥ 0.35kU/L by age 2). However, the CAS clusters outperformed IgE-and SPT-defined atopy in terms of predicting for age-five wheeze (likelihood ratio test for model with clusters vs. model with IgE/SPT, Chi-squared=23, p=2.0×10 -6 ). In addition, at age 2, 68% of CAS3 were "sensitised" (any specific IgE ≥ 0.35kU/L) to two or more allergens, compared to only 1% and 6% for CAS1 and CAS2 respectively. This emphasised CAS3 as an early-and multi-sensitised phenotype. Finally, many members of low-risk CAS1 and CAS2 who were IgE-or SPT-responsive prior to age 5 did not maintain atopic wheeze at age 5 (77% or 79 of 103), compared to CAS3 (24% or 5 of 21). Therefore, the association of IgE and SPT results with disease risk varied across clusters. Overall, this suggests that fixed atopy thresholds are not sufficient on their own in delineating risk profiles -instead, an unsupervised clustering approach may be superior.

Relationship with time-dependent wheeze phenotypes
We explored how the npEM-derived clusters mapped to pre-defined wheezing phenotypes ( Figure 2C): no wheeze (in the first three years of life, or at age 5), transient wheeze (only in the first three years of life), late wheeze (only at age 5), and persistent wheeze (in both first three years of life and age 5). We found that CAS3 was enriched for persistent wheeze, while individuals in CAS1 or CAS2 tended to have transient or no wheeze. There were rarely any members of the cohort with late wheeze (approximately 10% or less).

Co-associations with food sensitisation, eczema and wheeze
In addition to persistent wheeze, CAS3 was also enriched for persistent food sensitisation (peanut IgE ≥ 0.35 kU/L, or positive egg white or cow's milk SPTs) and persistent eczema: 44% of all individuals in CAS3 satisfied all three conditions (Supplementary Figure 4). Almost all individuals in CAS3 had both eczema and food sensitisation from age 6m onwards, with rates of food sensitisation and wheeze increasing with time ( Figure 2D). In contrast, CAS1 and CAS2 had low rates of food sensitisation, and declining rates of both eczema and wheeze. These trends lend credence to the hypothesis that the "atopic march" phenotype [30,31] may only be present in a minority of the population (e.g. CAS3) [19].

Relationship with microbiome
Previous studies suggest an association between asthma risk and early-life disruption of the respiratory microbiome, especially colonisation with Streptococcus spp. in the first 7 weeks of life [24]. In this study, using the same data and definitions, we found that CAS3 was overrepresented by individuals who had >20% relative abundance of Streptococcus in their first infection-naïve healthy NPA, within the first 7 weeks of life (44% versus 11% and 15% in CAS1 and CAS2, respectively; Fisher exact test, unadjusted p=0.042 and 0.065, respectively; Table 3). [24,32] previously found that transient incursions with MPGs associated with acute respiratory infections (Streptococcus, Haemophilus and Moraxella spp.) were associated with increased frequency and severity of subsequent LRIs and wheezing disease. Here, we found that the proportion of infection-associated MPGs in healthy samples from age 0 to 2 was greater in CAS3 (62% vs. 49% and 32% in CAS1 and CAS2, respectively; Fisher exact test, unadjusted p=0.2 and 5.5×10 -4 , respectively; Table 3). This finding was independent of LRI and wLRI frequency (GLM; p<0.05 for model predicting group membership, with age-two LRI and wLRI as covariates). On the contrary, there were no associations between cluster membership and health-associated MPGs (Corynebacterium, Alloiococcus, Staphylococcus spp.; Supplementary Table 3E). [25] suggested that infection-associated MPGs in early life were predictive for age-five wheeze in atopic children, while in non-atopic children they were predictive for transient wheeze (i.e. wheeze only in the first 3 years of life and not later). In this study, a similar trend was noted for infection-associated MPGs from age 0 to 2, in relation to transient wheeze in "non-atopic" CAS1 (GLM, OR 3.6 per percent, p=0.17, with demographic covariates). Surprisingly, there was evidence that infection-associated MPGs in later samples (from age 2 to 4) were protective against age-five wheeze in CAS1 (OR 0.086 per percent, 0.45 per quartile, p=0.034 and 0.035, respectively; Table 4). Infection-and health-associated MPGs were otherwise not associated with age-five wheeze within the other clusters.

Decision tree analysis
We used decision tree analysis to determine the handful of biological features that most strongly distinguish each npEM cluster. This process may allow us to simplify the clustering into a tree algorithm that can then be used clinically for screening or diagnosis. Unlike the previous GLMs, which identified variables most predictive for wheeze, the decision trees identified variables most discriminatory for age-five wheeze versus non-wheeze within each cluster.
Decision tree analysis on the CAS dataset, using all available variables from all timepoints for classification, created a "Simple Tree" with two decision nodes and three end nodes ( Figure 6). This tree had 89% accuracy in terms of retrieving the cluster memberships from the original npEM model, where accuracy is calculated as percentage overlap of tree clusters with original CAS clusters. Applying this Simple Tree to CAS, we found that membership in the CAS3-equivalent tree cluster was also a better predictor for age-five wheeze (likelihood ratio test, Chi-squared=19, p<1×10 -5 ) than traditional thresholds for atopy based on IgE and SPT measurements at age 2.
Further tree analyses using variables restricted to each timepoint identified similar trends (Supplementary Figure 5); IgG4-related variables best separated CAS1 from other clusters (from Phadiatop in early life, to HDM by age 3), while IgE-related variables (Phadiatop) best separated CAS2 and CAS3. Explicitly forcing the exclusion of Phadiatop variables from tree analysis caused these thresholds to be replaced with allergen-specific assays: cat and peanut IgG4 for Phadiatop IgG4; and peanut and HDM IgE for Phadiatop IgE ( Supplementary  Figures 6 and 7). This is consistent with correlation patterns amongst the IgE and IgG4 variables (Supplementary Table 6).
Because the causes of wheeze were likely different for different clusters, we also constructed a "Comprehensive Tree" that best split individuals into six groups, based on cluster membership crossed with age-five wheeze status (Supplementary Figure 8). For decision nodes we excluded all age-five features related to wheeze (e.g. LRIs, wheezy LRIs at age 5), because of definitional overlap with our outcome of interest. We thus identified nodes that were consistent with the predictors for wheeze found in the previous regression analyses ( Table 4), combined with nodes from the Simple Tree ( Figure 6). The Comprehensive Tree had a total accuracy of 77% in correctly recovering both cluster membership and wheeze status. In terms of identifying purely wheeze status at age 5, the accuracy of the tree was 84%, with a positive predictive value (PPV, or precision) of 72%, negative predictive value (NPV) of 88%, sensitivity (recall) of 71% and specificity of 89%. The Comprehensive Tree 9 was more successful in flagging age-five wheeze (likelihood ratio test, Chi-squared=60, p=6.1×10 -13 ), compared to the traditional atopy thresholds described previously.

External replication of clusters in MAAS and COAST
The disease trajectories described by the CAS npEM clusters were successfully replicated in both MAAS (N=1085) [26] and COAST (N=289) [27]. We applied our npEM classifier (Methods) to MAAS and COAST, and found that individuals classified into "Cluster 3" (MAAS3/COAST3) had a persistent disease phenotype extending into late adolescence, with consistently high rates of parent-reported wheeze and physician-diagnosed asthma from birth to age 16. The other two clusters (Cluster 1 = MAAS1/COAST1; Cluster 2 = MAAS2/COAST2) appeared to be relatively low-risk ( Figure 7A,B,D).
MAAS3 and COAST3 exhibited stronger IgE expression (total, HDM, cat, dog) from ages 1 to 8 ( Figure 7C,E), compared to other clusters in each dataset. Like CAS3, COAST3 demonstrated elevated PBMC expression of Th2 cytokine protein (IL-5 and IL-13) in response to HDM stimulation at age 3 ( Figure 7F). This was not replicated in MAAS3, but previous work in MAAS had identified that a strong Th2 response (IL-5, IL-13) to HDM stimulation of PBMCs at age eight was associated with increased risk of HDM sensitisation and asthma [21]. Nonetheless, MAAS3 appeared to be overrepresented in "early-sensitised" and "multiple sensitised" phenotypes discovered earlier by Lazic et al [17] from SPT and IgE data. Approximately 86% of individuals in MAAS3 belonged to either one of these two phenotypes, although only 13% of individuals in these two phenotypes were accounted for by MAAS3.
Furthermore, when we explored potential predictors of wheeze phenotypes and asthma diagnosis in later childhood, we found that the clusters in the external cohorts were very similar to those in CAS. In COAST1, LRI and wLRI frequency at age 2 were predictive of asthma diagnosis at age six (GLMs with demographic covariates, p=0.02 and 0.02, respectively), while in COAST2, HDM IgE at age 3, and LRI, wLRI and fLRI frequencies at age 2 were all predictive (GLMs, p<0.05 for all) (Supplementary Figure 9). Although the timing and magnitude of associations differed between cohorts, this reaffirmed wheeze in Cluster 1 as being primarily non-atopic in origin, while wheeze in Cluster 2 seemed to be driven by both non-atopic and atopic factors.
We attempted to validate CAS-derived decision trees in the MAAS dataset, as it contained measurements of both age-five HDM IgE and HDM IgG4, which we used as surrogates for age-three HDM IgE and HDM IgG4. These features comprised two decision-node features in the Phadiatop-free equivalent of the CAS Simple Tree (Supplementary Figure 5). COAST did not have any IgG4 measurements, so tree validation was not attempted there. The performance of the Simple Tree when applied to MAAS was poor, with only 20% accuracy in terms of overlap between tree clusters and npEM clusters, compared to 89% in CAS. Instead, when we generated a new tree from MAAS using its npEM clusters (Supplementary Figure 10), we achieved good accuracy (85% clusters correctly identified). However, the decision nodes of this tree were different to CAS, being related to family size and SPTs rather than IgE and IgG4. The MAAS1-equivalent tree cluster was distinguished by reduced SPT responsiveness (wheal size < 4-5mm) to cat, HDM and grass; while the MAAS3 equivalent was defined by strong HDM or grass SPT responsiveness. While this differs from the CAS decision trees, it is consistent with the broad distinction between nonatopic Cluster 1 and atopic Cluster 3. We also stress that CAS and COAST are both high-risk cohorts (each child having a parent with asthma or allergies), while MAAS was not.

Discussion
We have used model-based cluster analysis to uncover clusters of children with differential asthma susceptibility. Specifically, there was a high-risk group characterised by very early allergen-specific Th2 activity; early sensitization to multiple allergens including food allergens; and concurrent frequent respiratory infections -resulting in high incidence of atopic persistent wheeze. We also found a lower-risk cluster, with limited or delayed elevation in IgE -this resulted in a lower incidence of mixed (atopic and non-atopic) wheeze. Finally, there was a low-risk cluster which exhibited occasional and transient infectionrelated wheeze, with minimal allergen sensitisation. These clusters were replicated in external datasets, suggesting relevance across populations. A decision tree that accounts for cluster membership with modified thresholds for atopic sensitisation is superior to traditional definitions of atopy in predicting disease occurrence. A summary of key findings is given in Table 5, and in-depth discussion of the biological and practical significance of these findings can be found in the Supplementary discussion.
Our findings demonstrate clear and homogeneous developmental trajectories among children in multiple cohorts. The latent clusters incorporated multiple domains, including immune function and infection frequency, that reflected both endotype (pathophysiology) and phenotype. We emphasise that our approach was unsupervised and exploratory -endpoint variables describing parent-reported wheeze and atopic disease were excluded from clustering, and clusters were constructed de novo without any reference to traditional atopic thresholds. Therefore, it was encouraging to observe that the data-driven clusters differed in susceptibility and nature of subsequent wheeze, and that biologically-and clinically-relevant findings could still be derived from them. Our results build on previous findings [11,33] demonstrating that the concept of atopy, as an intrinsic or heritable predisposition to allergic disease, is more complicated than what could be described by dichotomies or thresholds. Instead, our study strongly supports the future use of predictive models with more precise, subgroup-driven representations of atopy and other relevant pathophysiological mechanisms.
The characterisation of these three clusters demonstrates how the complex phenomenon of asthma pathogenesis can be explored in depth using clustering. We have successfully provided an example where addressing inter-cluster differences have allowed the identification of intra-cluster disease predictors. The clusters may be further characterised by exploring other aspects of asthma pathophysiology, including genetics, epigenetics and others. By continuing with these approaches, we can hopefully move away from fixed thresholds or criteria for atopic risk, to more sophisticated formulations of risk, which will then improve future attempts at the targeted screening, prevention and treatment of asthma. These approaches may also be broadly applied to other heterogeneous diseases or datasets, and computerised tools may then be designed to embody the sum knowledge from these approaches. Such approaches can eventually help clinicians and scientists achieve a fuller understanding of pathophysiology, and hence better predict and manage human disease.

Patients and study design
The Childhood Asthma Study (CAS) was a prospective birth cohort (N=263) operated by the Telethon Kids Institute, Perth, Western Australia [22]. CAS was established with the goal of describing the risk factors and pathogenesis of childhood allergy and asthma. Details of CAS 11 have been reported in previous publications from our group [22,24,[34][35][36], and are summarised below.
In CAS, expectant parents were recruited from private paediatric clinics in Perth during the period spanning July 1996 to June 1998. Each child who was born and subsequently recruited had at least one parent with physician-diagnosed asthma or atopic disease (hayfever, eczema). The child was then followed from birth till age 10 at the latest, with routine medical examinations, clinical questionnaires, blood sampling at multiple time points (6-7 weeks, 6 months, 1 year, 2, 3, 4, 5, and 10 years) and collection of nasopharyngeal samples. Parents also kept a daily symptom diary and recorded the presence of symptoms of respiratory infection and other illnesses, and all medications taken.
In addition, nasopharyngeal samples were taken from each child during healthy routine visits, as well as unscheduled visits where parents were asked to present with their child at every onset of symptoms of a respiratory infection. We then screened these samples for viral and bacterial pathogens using rtPCR and 16s rRNA amplicon sequencing with Illumina MiSeq (San Diego, US), respectively [24]. Specific details are described in the supplement to Teo et al [24].
Other collected data included: sex, height and weight; paternal and maternal history of atopic disease; blood levels of basophils, plasmacytoid and myeloid dendritic cells as measured by fluorescence-assisted cell sorting (FACS); and levels of vitamin D (25hydroxycholecalciferol, 25(OH)D), the measurement of which has been described by Hollams et al [36].
The study designs and measurements performed in the replication cohorts (MAAS, COAST) have been described elsewhere [19,37]. Respiratory infection phenotypes (ARI, LRI, URI, fLRI, wLRI) were redefined in COAST based on their recorded symptom scores.

Identification of latent clusters
We used an implementation of a non-parametric mixture model (npEM) from the R package "mixtools" [38], because: 1) it was plausible to consider a population as a mixture of subpopulations each with their own unique distributions; 2) it had advantages over other unsupervised approaches [39] -unlike LCA, npEM could handle continuous variables; and 3) it lent itself to an intuitive method for supervised classification of other datasets into similar clusters (see Supplementary Methods). Further details can be found in the supplementary, and a graphical outline of methodology is given in Supplementary Figure 1.
Prior to cluster analysis, quality control measures were applied to the data. Variables ("features") and subjects that had excessive missingness (i.e. more than 30% of variables) were excluded from clustering. Also excluded were features pertaining to our outcomes of interest; namely, incidence of parent-reported wheeze, asthma diagnosis and atopic disease at all timepoints. Feature selection was otherwise exploratory and all-inclusive, in that we attempted to retain as many individuals and variables as possible. We also included frequency of wheeze in the context of respiratory infection as it represented infection severity. This left us with a "complete-case" dataset of 186 subjects and 174 variables for clustering, which essentially covered variables from the first three years of life for each child. Some highly-skewed features, such as antibody and cytokine levels, were then subjected to logarithmic (base 10) transformation. Positional standardisation scaling was then applied across all variables. The complete list of clustering features is provided in Supplementary  Table 1.
Unsupervised cluster analysis was performed on processed and scaled data using nonparametric expectation-maximisation (EM) mixture modelling (npEM) from the "mixtools" R package [40]. This method assumes that the frequency distributions of each cluster can be represented by non-parametric density estimates that are learned from the data in an iterative process. The optimal number of clusters was determined by scree plot and calculation of the Bayesian information criterion (BIC). The density functions generated by the resulting npEM model were then used to classify as many of the remaining "low-missingness" subjects as possible (31 of 36), so that the resultant groupings are a composite of unsupervised cluster analysis and supervised classification. Subjects were assigned to the cluster with ≥ 90% probability according to the model (Supplementary Methods).
Decision tree analysis was also conducted with the CAS clusters using "rpart" [41] to create classification trees that summarise inter-cluster differences and generate thresholds. We also specifically compared the classification trees with existing thresholds for atopy (any specific IgE at age 2 ≥ 0.35 kU/L, and/or any specific SPT at age 2 ≥ 2mm) [11], in terms of efficacy in predicting age-five wheeze.
The npEM clusters were then described and validated in two external datasets, MAAS (N=1085) [26] and COAST (N=289) [27]. This replication was performed by applying the density function-derived classification method used previously for the low-missingness CAS subjects. Only features that were common to both CAS and the replication cohorts (MAAS, COAST) were used for replicating the classification (Supplementary Table 1); these modified classification models were also tested in CAS, and the resulting CAS clusters compared to the pre-existing clusters in CAS. Further details can be found in the Supplementary Methods and Supplementary Results.

Statistical analyses
We performed statistical analyses comparing clusters in terms of all variables in the dataset. Of interest to us were our primary outcomes: asthma diagnosis and parent-reported wheeze at each timepoint. Comparisons were performed separately for each variable and timepoint. Statistical tests used included t-tests, Mann-Whitney-Wilcoxon tests, ANOVAs, Kruskal-Wallis tests, chi-squared and Fisher exact tests; and logistic and linear regression. For summary statistics, multiple testing adjustment was performed using the Benjamini-Yekutieli 13 (BY) method, for all across-cluster tests (Cluster × trait); and for all comparisons between clusters (CAS1 vs. 2, 1 vs. 3, and 2 vs. 3). The BY method was chosen as it accounted for positive dependency across the highly-correlated variables in the CAS dataset [42]. For variables that underwent logarithmic transformation for statistical analyses before being transformed back, we used geometric means instead of arithmetic means to describe the measure of central tendency (in this case, the geometric mean is equivalent to the exponent of the arithmetic mean of the log-transform).
We then determined the predictors for age-five wheeze within each cluster. Both simple and multiple regression models were constructed; the former were built with and without a base set of covariates (sex, family history of asthma, BMI where available). The latter were built by manually selecting variables found to be most statistically-significant (at least p<0.05) in the univariate analyses, for each timepoint, followed by step-wise backward elimination to achieve the most parsimonious model with all predictors statistically-significant (p<0.05). Repeated-measures ANOVAs were also performed for selected predictors of age-five wheeze. Finally, generalised linear models (GLMs) were generated and their likelihood ratios examined using the "lrtest" function from the R package "Epidisplay" [43], to check how much cluster membership or classification trees improved upon prediction of age-five wheeze using selected predictors.
14 Tables        5: Key findings from cluster analysis § Certain childhood populations may be broadly split into three clusters, each representing a unique trajectory of immune function and susceptibility to respiratory infections: low-risk non-atopic Cluster 1 with transient wheeze; low-risk but allergy-susceptible Cluster 2 with mixed wheeze; and strongly-atopic high-risk Cluster 3 with persistent wheeze. § Cluster 3 is consistent with an early-sensitised and multi-sensitised phenotype. § HDM hypersensitivity is an important predictor of wheeze in allergic or allergy -susceptible individuals. § In CAS, IgG4 flags for clusters with susceptibility to atopic disease (CAS2 and CAS3), while early and multiple-allergen elevation in IgE predicts frank atopic disease. The pathophysiological role of IgG4 remains unclear. § Food and peanut hypersensitivities are important contributors to membership in high-risk Cluster 3. This may be pathophysiologically related to eczema, multi-sensitisation and the atopic march. § Allergic and infective processes act in a synergistic manner to intensify airway inflammation during respiratory pathogen clearance. Some clusters (Cluster 3) may be more susceptible to this effect than others that lack strong allergic sensitisation (Cluster 1). § The microbiome also acts differently on asthma risk depending on cluster membership. In CAS, early-life asymptomatic colonisation with infection-associated MPGs is associated with risk of persistent wheeze in allergy-susceptible clusters (CAS2, CAS3), while it is potentially protective in non-atopic children (CAS1) § Tests for atopy (IgE, SPT, cytokines) do not necessarily overlap. Therefore, atopy may be better defined by the composite result from a battery of tests encapsulated in a predictive model, rather than just a single test or threshold. § Different childhood populations may share similar trajectories of asthma susceptibility, but there may be subtle differences in terms of the types of tests, allergens, or biological signals that are most informative (SPT, IgE, cytokines, etc.). Figure 1: Non-     Points indicate means; bars indicate 95% CI (t-distribution). *p<0.05 for Mann-Whitney-Wilcoxon comparison within each timepoint. #p<0.05 for repeated-measures ANOVA across timepoints from the first 3 years of life (see Table 4).

Figure 6: A "simple" decision tree generated by recursive partitioning from CAS data, with breakdown of tree clusters by actual CAS npEM-derived clusters (A); scatterplot showing separation of CAS clusters by decision split thresholds (B)
Percentages in Panel A may not sum up to 100%, because some individuals have missing values for decision node variables, hence making them impossible to classify. In Panel B, note that left-most column of points represent values of HDM IgG4 that were less than the limit-of-detection (LOD) for that assay (0.0003 µg/L), and were subsequently assigned to half the LOD (0.00015 µg/L). Most of these points belonged to individuals from CAS1. MAAS cohort (N=934) was classified using npEM model from CAS, into MAAS1 (N=199, 21%), MAAS2 (N=692, 74%) and MAAS3 (N=43, 5%); these correspond to CAS clusters CAS1, 2 and 3, respectively. COAST cohort (N=285) was similarly classified into COAST1 (N=105, 37%), COAST2 (N=151, 53%) and COAST3 (N=29, 10%).