Fungal Pathogen Gene Selection for Predicting the Onset of Infection Using a Multi-Stage Machine Learning Approach

Phytopathogenic fungi pose a serious threat to global food security. Next-generation sequencing technologies, such as transcriptomics, are increasingly used to profile infection, assess environmental adaptation and gauge host-responses. The accumulation of these large-scale data has created the opportunity to employ new computational methods to gain greater biological insights. Machine learning approaches, that learn to identify patterns in complex data sets, have recently been applied to the field of plant-pathogen interactions. Here, we apply a machine learning approach to transcriptomics data for the fungal pathogen Zymoseptoria tritici, to predict the onset of infection as measured by timing of the appearance of necrosis. We present a method for identifying the most important genes that predict infection timings, accurately classify isolates as early and late infectors and predict the timing of infection of ‘novel’ isolates using only a subset of the data. These methods and the genes identified further demonstrate the use of these tools in the field of plant-pathogen interactions and have implications for the identification of biomarkers for disease monitoring and forecasting. Fungi that infect plants pose a serious threat to global food security. Methods to study these pathogens generate vast amounts of data that create new opportunities for computational tools to analyse them. Machine learning methods can learn patterns in complex data such as when genes are turned on or off in fungal plant pathogens. In this study we use machine learning approaches to predict the onset of infection in several isolates of an important fungal pathogen. We show that these methods can identify a small group of genes that are predictive of the infection onset. We can even use these methods on ‘novel’ isolates to infer the likely timing of disease development. Our work has implications for plant disease diagnosis, monitoring and forecasting.


Introduction
Phytopathogenic fungi threaten global food security, destroying up to 30% of crop products through disease and spoilage [1,2].To cause disease and complete their life cycle, plant pathogens have to overcome the physical and chemical barriers of plant immunity, including pathogen-associated-molecular-pattern (PAMP)-triggered-immunity (PTI) [3].They achieve this by secreting effector proteins, which disrupt host signalling for threat perception and defence mechanisms [3][4][5].In turn, plants have evolved resistance (R) proteins which detect specific pathogen effectors or monitor the effector-targeted proteins and initialise effectortriggered immunity (ETI) [6].These gene-for-gene (R gene to effector gene) interactions can determine the success or failure of infection [7].ETI induces a suite of responses including a hypersensitive response (HR), localised cell death, activating multiple protein kinase cascades, as well as initiating an oxidative burst and promoting expression of many defence related genes [3,8].
The Dothidiomycete fungus Zymoseptoria tritici is the causal agent of Septoria tritici blotch (STB), a severe disease of wheat [9].STB results in significant economic losses through reduced yields of 5-10% per annum in the EU [10] and the cost of disease management through application of fungicides, which accounts for approximately 70% of the European agricultural fungicide market [11,12].Z. tritici is capable of rapid evolution due to a high rate of gene flow and recombination in large populations [13][14][15][16].This fact, coupled with management practices has driven selection in Z. tritici for resistance to all major fungicides [17][18][19][20][21][22][23].The situation is compounded by Z. tritici's highly plastic genome, which enables the fungus to overcome major wheat resistance genes [24].Z. tritici has 21 chromosomes with eight chromosomes being dispensable and absent in some isolates of the species [25].Indeed, 40% of the Z. tritici pan-genome, the entire set of genes within the species, is composed of orphan genes [26,27], giving Z. tritici the largest known accessory genome among fungi [28].Deletion of the dispensable chromosomes has confirmed that some are necessary for host specificity [29].Transcriptomic studies show flexibility of infection programs with heterogeneity in gene expres-sion throughout development between isolates [30,31].This extends to variability in development and aggressiveness when assessing specific isolate pathogen-host combinations [32].
Z. tritici is a member of the Mycosphaerellaceae family, which are often deemed hemibiotrophic due to having a long asymptomatic, biotrophic phase in the disease cycle [33].This is followed by a rapid switch to a necrotrophic phase and the development of disease symptoms, including irregular chlorotic lesions and then necrotic blotches forming on the leaf [34][35][36].Necrotrophy is associated with the hallmarks of PTI and ETI in the host, including programmed cell death (PCD) and accumulation of H 2 O 2 [37,38].This results in the collapse of mesophyll tissue and an abundance of nutrients flooding the apoplastic spaces, facilitating rapid fungal growth and finally sporulation [39][40][41][42].With limited evidence of biotrophy during the asymptomatic phase, Z. tritici is more accurately described as a latent necrotroph [43].The asexual reproductive cycle is characterised by distinct developmental stages throughout both the asymptomatic and necrotrophic phases, lasting 2-3 weeks [30,44,45].A spore will alight upon a wheat leaf, develop invasive hyphae, gain ingress through natural openings (stomata) as opposed to forming specialised structures such as an appressorium, and growth is exclusively in the intercellular spaces between the mesophyll cells of the leaf.These stages comprise the asymptomatic phase of infection and typically last 914 days [35,44,[46][47][48][49][50].The necrotrophic phase is represented by increased fungal biomass, hyphae reaching new stomata and forming new fruiting bodies called pycnidia in the sub-stomatal cavity.Mature pycnidia are packed with macropycnidiospores, which are released through the stomatal aperture under conditions of high humidity and spread by rain splash to repeat the infection cycle [11,50,51].Development is asynchronous between individual spores infecting a single leaf [45] and spores can grow epiphytically in excess of 10 days [52].
Although the stages of development are known, a complete picture for the molecular mechanisms controlling development remain to be established.Efforts to elucidate this subject have focused on characterising virulence factors, focusing on small secreted effectors [42,44,[53][54][55][56][57][58][59][60][61][62][63][64][65][66][67][68], transcription factors and kinases [69][70][71][72][73][74].A broader approach involves transcriptome profiling through the infection cycle, assessing systems-level adaptation to environmental changes and differences in host responses from cultivars with ranging susceptibilities [30,31,44,[75][76][77][78][79][80][81].Transcriptome studies generate vast amounts of data and we require intelligent methods to mine and understand them.Recently, machine learning approaches have been applied to a range of areas in plant-pathogen interactions including disease monitoring, genomic selection for resistance and the identification of potential effectors (reviewed in [82]).Although studies have looked at using a variety of transcriptomic data to identify stress conditions [83] and genes associated with complex agronomic traits in host plants [84], to the best of our knowledge no work has focused on using pathogen transcriptomic data to predict the outcome of infection and identify the key drivers, or genes that predict these outcomes.
In this work, we apply a combined Machine Learning (ML) approach with a parametric statistical model to available Z. tritici transcriptomic data captured during infection to: (i) identify a small subset of genes which may be associated with infection progression; (ii) uncover relationships between expression changes of these genes over time and the timings of necrosis appearance as a marker of infection outcome; (iii) predict how quickly a 'novel' pathogenic isolate reaches this infection milestone based on the expression of the subset of genes.Our approach relies on little to no expert information about novel isolates and is generally applicable across different experimental designs (e.g. in the timings of observations) and species.The method we propose is composed of three stages: in the first stage we use a ML approach to identify the genes which best characterise the infection phases; in the second stage, we quantify how the expression of these genes changes over time; and in the third stage, we develop a statistical model to give a "high resolution" estimate of the timing of the infection "peak".The key insight in stage one is to formulate the problem as a feature selection problem in a binary classification task, where the two classes/conditions to discriminate are the non-infectious state and the infectious state.The key insight in stage two is the definition of a robust notion of distance from the non-infectious state, to this end we employ the Mahalanobis distance between gene expressions (so to take into account the natural variability of each gene) but limiting the information to the (few) discriminative genes identified in stage one (so to achieve the largest possible signal to noise ratio).Finally, the key insight in stage three is to use the metric developed in stage two to quantitatively define specific phases in the shape of the temporal evolution of the infection, namely: an onset phase, a peak phase and a decay phase to a stable condition.
The empirical results show that our approach can identify a characteristic subset of genes that significantly change expression between the onset of infection and appearance of necrosis.Furthermore, summarising the change in expression of these genes is sufficient to differentiate between isolates showing an early or late appearance of symptoms.Finally, we show that our methods are potentially capable of predicting infection timings of 'novel' isolates, even on a subset of the available transcriptomics data.Overall, our results suggest potential utility of the proposed approach for the identification of a small subset of genes predictive of an infection outcome that can be investigated as potential candidates for disrupting the disease process as well as for forecasting the emergence of new plant disease.

Data acquisition and experimental design
We used gene expression data from two studies; (i) Haueisen et al. characterised gene expression for 3 isolates of Z. tritici (Zt05, Zt09 and Zt10) at 4 stages of infection (A, B, C and D) determined by confocal microscopy [30].Zt05 is a field isolate from Denmark, Zt09 is a derivative of the reference isolate IPO323 and Zt10 is another field isolate from Iran.Previous work has identified >500,000 and >600,000 single nucleotide polymorphisms (SNPs) with reference isolate IPO323 indicating there is significant genetic distance between these 3 isolates [85].(ii) Palma-Guerrero et al. recorded gene expression data for 4 isolates of Z. tritici (3D7, 3D1, 1A5 and 1E4) at 7, 12, 14 and 28 days post infection (dpi) [31].These isolates were all collected from two wheat fields in Switzerland, but show considerable genetic diversity with approximately 310,000 SNPs between each pair [75,79].Selection of isolates from diverse geographical regions that show significant genetic diversity allows our approach to be generalised across Z. tritici isolates.Normalising and filtering the data (see Materials and Methods) left 9,371 and 6,641 genes in the Haueisen et al. and Palma-Guerrero et al. data respectively, with 6,486 genes present in both data sets.Only genes detected to be expressed in all isolates were used in this study, this is important as identifying commonly expressed genes that are predictive of infection allows us to demonstrate the utility of these methods across isolates.Both sources of data for this study report a variety of infection outcomes including the timing of disease symptom development and the percentage leaf area covered by necrosis and pycnidia.For this study we used the most comparable measure common to both experiments, which was the timing of the appearance of necrosis measured in days post infection (dpi), which we used to classify infection by these isolates as early or late onset.This follows the findings that isolate Zt10 develops necrosis significantly later than isolates Zt05 and Zt09 [30], and that symptoms of infection, including necrosis, were also found to occur later and be less severe for isolate 3D1 compared to 3D7, 1A5 and 1E4 [31].

Machine learning to identify genes predictive of infection timings
We first used a random forest classifier to identify those genes that change their expression significantly between two distinct time points in the infection period that therefore, may represent important genes in disease progression.For the first time point, we selected the initial time point of infection.For the second, we selected the time point closest to necrosis development, the infection outcome of interest.The time points closest to necrosis development were infection stage C and 14 dpi for the Haueisen et al. [30] and Palma-Guerrero et al. [31] data, respectively.These two time points are the two "classes" the random forest seeks to distinguish between based on the gene expression values, which are the "features".This setup allows us to quantitatively assess how predictive each gene is in determining at what stage of infection the observation is made.The random forest classifier had an out-of-bag prediction error of 3.5%, which is the fraction of miss-classified samples.The 20 most important genes for determining the stage of infection are reported in Table 1, as quantified by their impurity score from the random forest classifier.

Comparing machine-learning to differential expression analysis
Traditionally, to identify genes that change in their expression significantly over time, we would use standard differential expression analysis rather than the machine learning approach described above.The rational for using a ML feature selection approach is that, due to the complexity of the interactions in biological systems, we should strive for methods that can process correlation and non linear interaction effects rather than limiting the selection process to univariate approaches based on ttests -or other statistical tests -that model one single gene at a time [86,87].To compare our results to those of the more traditional analysis, we identified significantly differentially expressed genes using edgeR [88].This analysis revealed 2,757 significantly differentially expressed genes (corrected p < 0.05), of which, all 20 genes identified in Table 1 are present (12 upregulated, 8 down-regulated; full list available in Table S1).However, our most important genes are not the most significant according to p − value ranking, nor do they show the largest changes in expression ranked by fold change.In our approach these genes have been selected on the basis of their joint capacity to inform the classifier to distinguish the two conditions, irrespective of the magnitude of their expression levels.It is not uncommon for conventional differential expression analysis to reveal very large lists of genes and, without significant further analysis, including arbitrary Table 1.The 20 most important genes as quantified by their (rounded) impurity scores in order of decreasing rank, with all isolates as inputs.

Rank Gene ID
Reference ID Gene Name Importance cutoffs, it would be difficult to narrow these lists down to a manageable number of genes for further analysis and potential experimental investigation.We believe that machine learning based methods for gene selection offer an attractive alternative to traditional differential expression as they 1) allow to rank genes using a multivariate importance measure (e.g. the reduction of the classifier discriminative power) and they often 2) yield sorted lists of genes' importance scores with a pronounced "knee", which allows the identification of a natural cutoff.
Haueisen et al. have also published lists of differentially expressed genes that represent a core transcriptional program during wheat infection [30].These genes are differentially expressed between infection stages across all 3 isolates present in the study.In total the authors identify 676 genes that show differential expression between the 4 identified stages of infection.As mentioned above, this represents a relatively large number of genes to process.In this study, 8 of the 20 most important genes are also differentially expressed between stages of infection in Haueisen et al..All of these 8 genes (Zt09_chr_1_00237, Zt09_chr_3_-00489, Zt09_chr_9_00376, Zt09_chr_3_00461, Zt09_-chr_2_00249, Zt09_chr_12_00203, Zt09_chr_6_00030 and Zt09_chr_4_00505) show differential expression at stage C (note only consecutive stages were tested i.e.B vs C and C vs D), and in all cases except Zt09_-chr_4_00505, these genes are up-regulated at stage C.This is an expected result as stage C relates to necrosis development, which is our infection outcome of interest.This is the same time point that we have used to identify the most important genes presented in Table 1.Although there is considerable overlap between the most important genes identified in this study and those genes showing differential expression in the core transcriptional response identified by Haueisen et al. [30], there are 12 genes that do not overlap.This may be because our study uses a wider range of isolates with more diversity, potentially allowing us to capture a more general set of genes that define infection progression.

Summarising change of expression differentiates isolates showing early-and late-onset necrosis
We next aimed to quantitatively summarise the gene expression changes over time for the K most important genes identified by the random forest classifier (where the value of K must be chosen).To do this we used the Mahalanobis distance from the point of infection to each subsequent time point.Figure 1 shows the logarithm of Mahalanobis distances of the K = 10 most important genes at varying times.We see that these genes change significantly over the first two weeks (relative to the initial state) before plateauing or indeed decreasing.The consistency of this structure across isolates suggests that the Mahalanobis distance is a compelling measure to summarise gene expression changes during infection.However, as seen in Figure 1, it is not immediately obvious which isolates progress more slowly or quickly, due to large gaps in the time series of observations, and because of misalignment between the timings of observations.Adopting the average timing of the 'peak' distance for each isolate as a proxy for the speed of infection, we can't be certain if a peak coincides exactly with the timing of observations or instead lies within any of the gaps.We therefore adopt a Bayesian statistical model which makes simple assumptions about the functional form of the distance trajectory, to reduce the problem to a few parameters for each strain.Specifically, the model assumes the log-distances for each isolate can be described by a piece-wise linear trend plus an error term that captures variability between replicates.One of the parameters for each strain is the timing of the peak distance, which can be directly estimated along with measures of uncertainty.
Before proceeding, we must choose a value of K which determines how many important genes we use to compute the Mahalanobis distances.Here, the value of K can be thought of as a parameter, one which will inevitably affect the predictive performance of the statistical model: if K is too low, then we may be excluding genes which are informative for differentiating between early and late onset isolates; if K is too high, then we risk diluting the signal from the most informative genes.
In this case, we believe the data available to us are insufficient to make a serious attempt to choose the optimal value of K without risking over fitting.We therefore discuss results in the context of different values of K: 5, 10, 15, and 20.
Figure 2 shows the posterior median expected logdistances and Figure S1 shows the posterior distributions of the timing of the peak distances, for each isolate and for different values of K. Across these values of K, there are some consistent patterns.First, the isolate with the longest time until necrosis development, Zt10, consistently has the latest peak distance.
The other late onset isolate, 3D1, is also among the isolates with the latest peak distance.On the opposite end of the spectrum, the isolates with the earliest necrosis development, Zt05 and Zt09, are consistently among the isolates with the earliest peak distances.Al-though these distributions show some uncertainty, the consistent differences in early-and late-onset infection isolates suggest that our results are somewhat robust to the value of K.However, for these and later results it should be considered that, if the timings of the peak distances of the 7 isolates were ordered completely at random, there would be a just-below 5% probability that the two "late-onset" isolates would have the two latest peak distances (by chance).There is therefore a small but non-trivial chance that our results are random fortune.We fully acknowledge that to properly assess the significance of these results we would need a larger number of time points and isolates.Other substantive limitations of our analysis are discussed in the Methods.

Distinguishing early-and late-onset of infection for 'novel' isolates
Next, we used leave-one-out experiments to investigate the generality of our approach when applied to novel isolates.First, we investigated the stability of the important gene ranks across 7 runs of the procedure, where each one of the isolates was left out of gene selection in turn.Table 2 shows the 20 most important genes in order of median rank across the 7 experiments, as well as the lowest, mean and highest rank each gene achieved.Notably, one gene, Zt09_chr_1_00237 (My-cgr3G78133), was the most important in all 7 experiments.Many of these genes are also present in Table 1, with three exceptions; Zt09_chr_5_00763 (My-cgr3G72602), Zt09_chr_2_01239 (Mycgr3G69186) and Zt09_chr_13_00010.Fitting the parametric model once when each isolate is treated as novel, results in 7 distinct model fits.To summarise across these different fits, Figure 3 shows the expected log-Mahalanobis dis- tance over time for each isolate, where the fit for each isolate is from the run where that isolate was treated as novel (using the K = 10 most important genes in each case).Here, we are looking to see if the patterns seen with respect to the timing of the peak distance hold, as shown in Figure 2. Indeed, we find that the peak distance for the two late onset isolates (Zt10 and 3D1) appear later in time than the early onset isolates, indicating that we can potentially distinguish between earlyand late-onset of necrosis when we treat each isolate as unseen in turn.

Predicting the timing of infection milestones for 'novel' isolates with limited biological measurements
Finally, we also investigated whether our approach could be used to predict the timing of the peak Mahalanobis distance of a novel isolate, and by extension, the timing of infection milestones without observing the whole time series for that isolate.To do this we adjusted our leave-one-out experiment so that the distances at the third and fourth time points for the novel isolate were treated as missing values.This means that the estimation of the timing of the peak distance (from the Bayesian statistical model) is only based on the first two time points for the novel isolate.This would demonstrate the potential of our methods for predicting the virulence of new pathogenic isolates with few biological measurements.First, we assessed how well our model performs when extrapolating beyond the available time points by comparing the distance values we treated as missing to their associated predicted values.Figure 4 shows posterior median predictions and 95% prediction intervals for the third and fourth time points.From this plot we can see that predictions are more accurate for the 3rd time point -which is arguably more important than the 4th, as it is closest to the peak we are trying to predict.Indeed, the majority of variability in the true novel log distances is explained by the predictions (R 2 = 0.71) for the 3rd time point.We can also see that the 95% prediction intervals cover the observed values almost every time.
There is some evidence that the model may systematically under-predict the distance at the 3rd time point, which should be taken into account when designing future versions of this approach.
Figure 5 shows the posterior median expected log-Mahalanobis distance where the fit for each isolate is taken from the model where that isolate was treated as novel (and therefore the 3rd and 4th time points were treated as missing values).Here the peak distance for the late onset isolates is noticeably later than for the early onset isolates, implying that when using this approach there is potential to predict the onset of infection for a new isolate when using only existing isolates for feature selection and when only observing the first portion of the time series for the new isolate.These results demonstrate the potential utility of our approach for predicting infection outcomes, in this case the onset of necrosis, for new isolates using only limited biological measurements.In turn, these results therefore have implications for the use of these methods for developing new diagnostics and forecasting the emergence of new disease.
Many genes have enzymatic functions, particularly oxidoreductase activity.
In order to expand on the qualitative functional descriptions of the most important genes, we next performed enrichment analysis for these genes in Z. tritici and of orthologs in other well annotated fungal species.In Z. tritici, we found enrichment for GO terms and pathways relating to lipid and sphingolipid biosynthesis processes, sphingolipid desaturase activity and oxidoreductase activity.
However, although these enrichments were significant (p-value < 0.05), none were significant after false discovery rate correction (Table S2).The first set of orthologs analysed were in another Dothideomycete, Cladosporium fulvum.We used the MycoCosm database [91] to identify orthologs and to find GO, InterPro and EuKaryotic Orthologous Groups (KOG) annotations.Although the MycoCosm database doesn't allow the identification of enrichment of annotations, we find that the annotations match closely what was observed in Z. tritici.GO terms relating to oxidoreductase activity (GO:0016651), ATP binding/ATPase activity (GO:0005524, GO:0016887) and transporter activity (GO:0005215, GO:0042626) are all associated with C. fulvum orthologs.These patterns are repeated for InterPro and KOG annotations where oxidoreductase subunits (IPR011538, IPR011537, IPR001949) and ABC transporter-like domains (IPR003439, IPR013525, IPR010929) are well represented.KOG annotations show similar patterns with transcription factors (KOG2294), oxidoreductase subunits (KOG2658) and ABC transporters, related to drug resistance (KOG0065), all present (Table S3).Next, we chose another wheat pathogen, Fusarium graminacearum, and identified orthologs and enriched GO terms and pathways using FungiDB [92].As before, we find annotations for a variety of GO terms related to transport (GO:0015168, GO:0015105, GO:0042626, GO:0015399, GO:0022804, GO:0015103), oxidoreductase activity (GO:0016655, GO:0016491, GO:0055114, GO:0072593) and sphingolipid biosynthesis (GO:0042284, GO:0030148) are enriched but not significant after false discovery rate correction.However, we find several Cellular Component GO terms are significantly enriched (corrected p < 0.05) relating to the cell membrane (GO:0005743, GO:0019866, GO:0005886, GO:0031966, GO:0005740, GO:0016020, GO:0031975, GO:0031967), supporting the suggestion that these genes may be functionally related to transport across the cell membrane (Table S4).Finally, we analysed orthologs and their functions in the well-studied, model filamentous fungi, Aspergillus fumigatus (Table S5).To support previous findings, GO terms relating to oxidoreductase activity (GO:0016491) and cell wall organisation (GO:0070871, GO:1904541, GO:0044277, GO:0071853) are significantly enriched (corrected p < 0.05).Functional pathway analysis of the most important Z. tritici genes and their orthologs supports the observation that these genes are likely to be involved in enzymatic reactions, particularly oxidation-reduction reactions, and transport, potentially across the cell membrane.

Discussion
Machine learning methods have recently been applied to the study of plant-pathogen interactions in the areas of disease monitoring, genomic selection for resistance and prediction of potential effectors (reviewed in [82]).
In this study we apply a multi-stage machine learning approach with a parametric statistical model to publicly available transcriptomic data for several isolates of the wheat pathogen Z. tritici.Our methods are capable of identifying a subset of genes that significantly change their expression between the onset of infection and an infection outcome measured as the first appearance of necrosis.These genes include those with disease-related functions and as yet uncharacterised or isolate-specific genes that might represent novel genes with roles in disease.We next find that the changes in expression of these genes, summarised by the Mahalanobis distance, differentiates between fungal isolates that show an early or late appearance of necrosis.Finally, we demonstrate potential ability to discriminate between "early" and "late" necrosis onset of novel isolates.
It has been proposed that machine learning methods will be central to analysing high-resolution omics data, such as those from multiple members of a species to understand plant-pathogen interactions [93].However, current applications of machine learning methods to transcriptomics data of plant-pathogen interactions fo-cus largely on predicting stress conditions [83], genes associated with specific traits [84] or predicting gene regulatory networks (GRN) in the host [94].For fungal pathogens, transcriptomic datasets have been used to infer a GRN for Fusarium graminearum, with the aim of identifying major regulators of disease pathways that provide candidates for experimental validation [95].
Our work represents, to the best of our knowledge, the first approach that aims to use high-resolution transcriptomics data from multiple members of a species to predict infection outcomes and, in doing so, identify those genes most predictive of that infection outcome, in this case the onset of necrosis.
Identifying the most important genes that predict the onset of necrosis may yield useful insights into the disease process and provide novel candidates for forecasting the virulence of new isolates.In general, those genes found to be important in predicting disease fall into three categories; (i) transporters, (ii) transcriptional regulators and (ii) enzymes, particularly oxidoreductases (Tables 1  & 2).These functions are supported by pathway analysis of the Z. tritici genes (Table S2) and orthologs in other relevant fungal species (Tables S3-S5).Zt09_-chr_12_00203 (MgAtr2) is a known ABC transporter that provides protection against toxic compounds including fungicides and plant metabolites [90,96], but is not essential for virulence [97].Zt09_chr_6_00030 (My-cgr3G109703) is a putative major facilitator superfamily transporter and other members of this family, such as MgMfs1, have been shown to confer protection against natural toxic compounds and fungicides in Z. tritici [98].
In addition to transporters, many of the important genes are annotated as having oxidoreductase activity and are potentially involved in the response to oxidative stress.Responding to reactive oxygen species, which plays a role in host defense, is important to both fungal pathogens of plants [99] and humans [100].For example, Zt09_chr_12_00206 (Mycgr3G111474) is annotated as being involved in the sphingolipid metabolic process (GO:0006665), where it has been speculated  that microbes containing sphingomyelin would be more resistant to damage by oxidative stress [101], with implications for fungal pathogenesis [102,103].Many of the genes identified as being most important for the transition to necrosis are uncharacterised and, as such, represent new predictions of genes with roles in disease and potential biomarkers for disease forecasting and surveillance.Previous work applying machine learning techniques to infectious disease have identified the most important mutations that predict the human transmission of avian influenza viruses with implications for disease monitoring [104].To differentiate between isolates showing early and late onset of necrosis development we summarised the changes in expression of the most important genes using the Mahalanobis distance.A Bayesian parametric model applied to the distances then correctly distinguishes early and late onset for varying numbers of most important genes (K) (Figure 2).Furthermore, we have shown that this measure has potential promise in predicting infection timings of 'novel' isolates using leave-one-out experiments (Figure 3) and with only limited timepoints from the transcriptomics series (Figures 4 & 5).We demonstrate the utility of our approaches to predicting the onset of necrosis development in new isolates of Z.tritici using only the most important genes identified in this study and limited data on infection collected for the new isolate.The results indicate that the methods presented here could be used to predict the impact of a new pathogenic isolate.Previous work in related fields has focused on assessing pre-planting factors, such as latitude and longitude, to predict the susceptibility of wheat to outbreaks of Parastagonospora nodorum [105].Studies have also used machine learning on images, fluorescence and thermography to classify leaves as infected or uninfected [106] and predict disease development stages [107].All these studies use machine learning methods for the prediction of disease, however our approach is the first to use pathogen transcriptomics as the predictor variables rather than imaging.This study adds further evidence for the potential applicability of machine learning approaches for the prediction of disease outbreaks that may inform disease management decisions.
Although we have potentially succeeded in deploying machine learning approaches to predict the onset of necrosis in an important fungal pathogen, our approach is somewhat limited by data availability and inconsistency.The publicly available RNA-Seq data used in this study has been collected based on different criteria, with one study collecting data at pre-defined timepoints [31] and another at different disease development stages [30].Consistent transcriptomic data collection, as well as more fine-grained expression data, would further increase the predictive accuracy of the proposed approach.Likewise, the infection outcomes available for this study also limited our predictions to the onset of necrosis.This highlights the need for quantitative information on disease progression, which is becoming increasingly more feasible with the development of low-cost phenotyping imaging setups and software for automated image analysis [108], though additional work is still required before these approaches can be used to quantitatively measure fungal infection.
The approach outlined in this study takes a noteworthy step towards demonstrating the promise of machine learning to identify the most important genes predictive of infection outcomes.In this application, many of the most important genes are uncharacterised and thus, this approach may be useful for the annotation of genes with roles in disease and to identify biomarkers that can be used to develop new diagnostics for disease monitoring and forecasting.Moreover, we have demonstrated that we are capable of predicting a disease outcome for novel isolates using only the first two gene expression timepoints, which has further implications for the speed and ease of data collection to deploy these approaches for disease surveillance and prediction.We can envisage a scenario where these approaches can be applied to a newly emerging isolate of a known pathogenic species, where existing data could be used to train our models and the collection of minimal new data (we have demonstrated our approach on transcriptomics data of only 2 infection timepoints) on the emerging isolate can be used to predict its infectivity.Of course, there are many potential avenues for future work including the integration of different omics data [82], reduction of the high-dimensionality of gene expression data with network biology approaches [109] and use of automated imaging data to capture infection phenotypes [108], that may reduce data collection and broaden the phenotypes that can be predicted.Together, these advancements are a significant step towards wider utility and application of machine learning methods to study plant-pathogen interactions with implications for developing new monitoring techniques, control strategies and treatments.

Data gathering and pre-processing
We used gene expression data from two studies; (i) Haueisen et al. characterised gene expression for 3 isolates of Z. tritici (Zt05, Zt09 and Zt10) at 4 stages of infection (A, B, C and D) determined by confocal microscopy on wheat cultivar Obelisk, each with 2 replicates [30].Infection stages can be converted into dpi for each isolate using the results of the confocal microscopy experiments reported by Haueisen et al.. Note, Haueisen et al. reported ranges for the timing of each infection stage, and we took the mid-point of these ranges.These data were downloaded as raw counts from the Gene Expression Omnibus [110] under accession GSE106136.(ii) Palma-Guerrero et al. recorded gene expression data for 4 isolates of Z. tritici (3D7, 3D1, 1A5 and 1E4) infecting wheat cultivar Drifter at 7, 12, 14 and 28 dpi with 3 replicates [31].These data were downloaded as raw counts provided in Supplementary Table 2 of the original publication.Importantly, sequencing data from both datasets were aligned to the Zt09 annotation [78] meaning all isolates were annotated with consistent gene identifiers.To normalise raw count data to account for gene length and library size (number of reads mapped to genes) we used Reads Per Kilobase of transcript per Million mapped reads (RPKM).Gene lengths for the Zt09 annotation were taken from Haueisen et al. [30] and library size was defined as the total number of reads assigned to genes in each sample.In each dataset, genes were removed if there were no expression values for a gene in all replicates in any single timepoint.For any timepoint with only partial data, the missing data was filled using the mean of the expression values of the other replicates for that isolate and timepoint.Filtering left 9,371 and 6,641 genes in the Haueisen et al. and Palam-Guerrero et al. data respectively.Normalised data were reported as log(RPKM + 1) and are available in Supplemental Tables S6 and S7

Infection outcomes
Both Haueisen et al. and Palma-Guerrero et al. report a variety of infection outcomes including the timing of disease symptom development such as necrosis and pycnidia and the percentage leaf area covered by necrosis (or necrosis level) and pycnidia.For this study we used the most comparable measure common to both experiments, which was the timing of the appearance of necrosis measured in dpi.We used the timing of the appearance of necrosis to broadly classify infection by these isolates as early or late onset.This follows the findings that isolate Zt10 develops necrosis significantly later than isolates Zt05 and Zt09 [30].Symptoms of infection, including necrosis were also found to occur later and be less severe for isolate 3D1 compared to 3D7, 1A5 and 1E4 [31].Supplemental Table S8 shows the infection timing and classification for each isolate used in this study.

Differential expression analysis
In order to be able to compare the results of our machine learning approach with the more traditionally used differential expression analysis, we identified significantly differentially expressed genes using edgeR [88].the method of Benjamini and Hochberg [111].Note that this set of differentially expressed genes is used only for comparison purposes, but it is not otherwise used in the proposed approach.

Gene selection
The first stage of our approach is to identify a small subset of genes which change significantly over the infection period and therefore may be candidate genes important in the disease process.To do this we apply a machine learning (random forest [112]) classifier to differentiate between observations of gene expression made close to the infection time and observations made close to the time of necrosis development.Specifically, the 'output' for time t, replicate r and isolate s, denoted by y t,r,s , is defined as a binary quantity where y t,r,s = 0 if t is the initial time point of the experiment and y t,r,s = 1 if t is the time point closest to necrosis development.This time is taken to be stage C in the Haueisen experiment and 14 days in the Palma-Guerrero experiment.For each output y t,r,s we also have a corresponding vector of gene expressions x t,r,s which serves as the input to the classifier.The main result of doing this is not the fitted classifier itself but the ability to quantitatively assess how predictive each gene is in determining at what stage of infection the observation is made.As such a ranking of genes by importance is obtained by computation of the impurity score for each gene and sorting the scores into descending order.The gene with the highest score is therefore the most 'important', and so on.It then remains to choose a value of K whereby we carry forward only the K most important genes into the subsequent stages.Random forest classifiers are inherently stochastic, meaning the ordering of gene importance can vary when the model is fit several times to the same data.To improve the stability of the importance ranking, we used an 'extra trees' version of the random forest clas-sifier (R package ranger [113]) and employed a large number of trees (100k).

Mahalanobis distance
The second stage of our approach is to quantitatively summarise the change in expression over time of a subset of relevant genes.That is, we want to characterise, with a single numerical value, the state of the biological system with respect to the infection process, so that, ideally, a state of infection is clearly distinguishable from the non-infected state.To accomplish this, we adopt all observations of these genes made at the first time point in each experiment as a baseline.Using these observations, we compute a mean expression vector µ and an empirical covariance matrix Σ.Then, for observations made at time t, of isolate s and of replicate r, denoted by x t,s,r , the distance d t,s,r > 0 from the baseline distribution (as summarised by µ and Σ) is computed using the Mahalanobis distance.For given, x, µ and Σ, this distance is defined by1 : ( The notion of Mahalanobis distance allows us to take into account the natural variability of each gene via their estimated dispersion and to treat correlated genes in an appropriate way.Rather than treating each gene independently, the Mahalanobis distance allows us to distinguish when the changed expression level of a particular gene is contributing to characterise a diverse state or when, given that other correlated genes behave in a similar fashion, its contribution should be discounted.By limiting the information to a small subset of discriminative genes identified in stage one, we significantly improve the signal to noise ratio and avoid the confounding effects of genes that are not related to the infection process (e.g.housekeeping genes).

Parametric model
We next aimed to investigate whether the timing of the 'peak' Mahalanobis distance might give an insight into the relative timings of certain infection milestones, such as the development of necrosis.There is no guarantee, however, that the timing of observations will directly coincide with the peak distance.Moreover, given observations at a limited number of fixed time points, it is not possible to pinpoint the timing of the peak precisely.Therefore, we seek to estimate the timing of the peak for each isolate, with associated uncertainty, in a way which generalises across experiments that make observations at different times post infection.To achieve this we adopt a relatively simple parametric statistical model in the Bayesian framework.This approach also offers the potential to predict the timing of a novel isolate's peak (and therefore infer its infection characteristics) without observing the whole time series.For a given isolate s observed at time t, the model assumes independent, identically distributed Normal (Gaussian) errors for the logarithm of the Mahalanobis distance d t,s : where m t,s is a piece-wise linear function of time for each isolate and σ captures variability associated with observation error and differences between replicates.The functions m t,s are defined as linear in between four 'knots' which are different for each isolate.Each knot has a corresponding x-axis (time) value, and a yaxis (log-Mahalanobis distance) value.Figure S2 is a visual aid for understanding the structure of this function.For a particular isolate, the first knot in time is fixed at 0 days post infection and shares a log-distance value with the second knot.The second knots, seen as the first break-points following flat periods in Figure S2, therefore represent the time when the expression of the most important genes begin to increase significantly.We will therefore refer to the second knots as the 'point of increase'.The third knots then represent the 'peak distance', and the final (fourth) knots, fixed at 28 days, represent the distance at the 'end point' of the experiment.For a particular isolate, we therefore have two knots ('point of increase' and 'peak distance') where the timings (x-axis coordinates) are largely free parameters.In the Bayesian framework for statistical inference, we must specify 'prior' distributions for all parameters, representing our uncertainty about those parameters in the absence of any data.To be as uninformative as possible, we define the timing of these knots as parameters with uniform prior distributions spanning 0 to 28 days, with only the natural constraint that the peak must occur after the point of increase.We then have three knots ('point of increase', 'peak distance' and 'end point') where the log-distances (y-axis coordinates) are largely free parameters, which we denote by α = α 1,s , α 2,s , α 3,s in order of increasing time.Here we specify that the log-distance (y-axis value) of these knots for different isolates come from a common distribution.This is achieved by treating the α j,s as random effects: α j,s ∼ Normal(ι j , 0.5 2 ); (3) For the point of increase, peak distance and the end point knots, ι 1 , ι 2 , ι 3 define the expected log-distance, respectively.In summary, the timing of the knots is independent across isolates, but the distance is similar.Using this specification, it is possible to fit observations made at arbitrary time points and make predictions of the distance for a particular isolate both in-between and beyond the observed time points.In the Bayesian framework, the observations update the prior distributions, resulting in posterior distributions which quantify our uncertainty in the model parameters.We employ the Markov chain Monte Carlo (MCMC) method (R package nimble [114]) to obtain samples from the joint posterior distribution of all parameters, subject to convergence criteria being met.For any given parameter, we can derive central point estimates (e.g. by computing the mean of the posterior samples) and measures of uncertainty (e.g.95% intervals can be obtained by computing the 2.5% and 97.5% quantiles of the samples).Using the Monte Carlo simulation method, we can then obtain samples from the posterior predictive distributions for missing or new data points.These predictions jointly account for both random variability as quantified by the model and uncertainty in the model parameters.
Figure S2 shows the posterior median expected log-Mahalanobis distance defined by the piecewise linear functions for each isolate, with 95% uncertainty intervals.Dashed lines are used to distinguish the two 'late' onset isolates, Zt10 and 3D1.Notably, although the model is specified as piecewise linear given the positions of the knots, treating these positions as random variables smooths the functions considerably once parametric uncertainty is taken into account.Looking at the estimated functions, the peaks for the two late onset isolates appear to occur later than all the 'early' onset isolates.This can be investigated more clearly by analysing the posterior distributions for the timing of the peak knots, shown in Figure S3.As the number of replicates and time points for each isolate is quite small, these distributions are quite uncertain.Moreover, they appear quite misshapen compared to most analysed posterior distributions, despite MCMC convergence and the number of posterior samples being relatively high.If more data were available, a potential and indeed straight-forward extension to the model would be to directly relate the parameters controlling the timing of the peaks with covariates such as the timing of necrosis, the timing of pycnidia appearance, or the necrosis level.
For seen isolates the values of these covariates would be known, but would be left as missing values for novel isolates.Using the Bayesian approach, these infection outcomes could then be predicted -assuming the relationships between the timing of the peak and these covariates turn out to be reasonably strong -to allow for a richer assessment of a novel isolate's infection capability.

Model validation
Up to this point we have explained our potential for identifying a small subset of important genes and, to uncover relationships between the expression of these genes and infection timings.However, we have so far treated all of the available data as known.To investigate whether we can also predict how quickly a novel isolate will reach infection milestones, we have employed a leave-one-out approach where each of the 7 available isolates is treated as novel in turn.For each isolate the process is as follows: Step 1: Perform gene selection with the random forest classifier using only the 6 seen isolates as data.
Step 2: Calculate Mahalanobis distances for all 7 isolates but using only the 6 seen isolates as inputs for calculating µ and Σ.
Step 3: Fit the parametric model to all 7 isolates twice, first with data for all isolates and all time points (a) and second using only the first two time points for the novel isolate (b).
Withholding the novel isolate from gene selection ensures that any conclusions drawn from the timing of the peak distance relies only on the important genes from the other 6 isolates.Meanwhile, by fitting the parametric model a second time with only the first two time points for each novel isolate, we can investigate the potential of our approach for predicting the timing of the peak distance, and by extension how quickly the infection will develop, without observing the whole time series.In the latter case (b), we treat the distances for the novel isolate at the third and fourth time points as missing values.For these values samples can then be obtained from their respective posterior predictive distributions, available from the fitted Bayesian model.In assessing these predictions, we are primarily interested in: i) the average error of predictions (bias); ii) the average absolute error of predictions (accuracy) and; iii) the reliability of prediction intervals in capturing the true novel values (coverage).For i), when predicting the log() distances at the third time point, the average of all observed values minus all predicted values (posterior means) was 0.20, indicating slight under-prediction overall.When predicting the fourth time point, the average error was -0.08, indicating little bias relative to the scale of the input data.For ii), the average absolute difference between the observed values and predicted values was 0.5 for the third time point and 0.54 for the fourth.This is only moderately larger than the expected absolute error of 0.38 from the model with K = 10 and all isolates and time points observed (Figure S2), as quantified by the posterior mean of σ.When comparing predictions for the third time point to the true novel value, a substantial R 2 = 71% of the variability in the true values is explained by the predictions.For the fourth time point, R 2 is negative, indicating that the model can not effectively predict the Mahalanobis distance of novel isolates at the end point of the experiment, relative to other isolates.For iii), the 95% prediction intervals contained the true novel values approximately 95% of the time for both time points.Taken together, these metrics suggest that predictive point estimates for novel isolates and time points are reasonably accurate (relative to the expected error in the model with all isolates and time points observed) and that 95% prediction intervals are likely to contain the unknown true log distances.

Bioinformatic analysis of selected genes
To further analyse the genes that change expression significantly and can be used to distinguish between early-and late-onset of infection, we performed pathway analysis and examined orthologs from other species.Pathway analysis was performed using FungiDB [92].Briefly, we used FungiDB's BLASTn search functionality to search the sequences of the most important genes against the Z. tritici IPO323 sequence to convert between IDs used in this study and those used by FungiDB.Next, we used the FungiDB search strategies to perform pathway analysis on these IDs to identify enriched GO terms, KEGG annotations and MetaCyc pathways.Annotations and pathways were deemed to be significantly enriched with a Benjamini and Hochberg [111] corrected p-value < 0.05.
In order to examine orthologs we chose 3 species, another Dothideomycete species Cladosporium fulvum, Fusarium graminacearum a wheat pathogen from the family Nectriaceae and Aspergillus fumigatus a wellstudied model filamentous fungi.For C. fulvum, we used the MycoCosm database [91] BLAST functionality to identify C. fulvum orthologs to the selected Z. tritici genes.We next assembled the GO, InterPro and Eukaryotic Orthologous Groups (KOG) for the identified orthologs.We used FungiDB [92] to identify orthologs and conduct pathway enrichment analysis for F. graminacearum and A. fumigatus.As before, we used the FungiDB search strategies to perform pathway analysis on F. graminacearum and A. fumigatus orthologs to identify enriched GO terms, KEGG annotations and MetaCyc pathways.Annotations and pathways were deemed to be significantly enriched with a Benjamini and Hochberg [111] corrected p-value < 0.05.

Substantive limitations of the results for distinguishing early versus late infection onset isolates
As mentioned in the results section, we acknowledge that the correct ordering of the peak distances, such that the two "late-onset" isolates have later peaks than the 5 "early-onset" isolates, could have arisen from random ordering, with a probability just under 5%.Other substantive limitations include: • For the Haueisen et al. data, we derived dpi values (days post infection) for the infection stages by taking the mid points of the ranges they reported.We did not explore the sensitivity of our results to this choice.
• We applied our methods to genetic expression data subject to a log(RPKM + 1) transformation.We did not explore the sensitivity of our results to alternative transformations.

Figure 1 .
Figure 1.The log-Mahalanobis distance of gene expression, relative to the initial time point, for each time point, replicate and isolate.This analysis uses only the K = 10 most important genes identified by the random forest classifier that identifies those genes that change their expression significantly between the initial time point and the time point closest to necrosis development.

Figure 2 .
Figure 2.Posterior median expected log-Mahalanobis distance with 95% uncertainty intervals, for each isolate.Dashed lines emphasise the two late onset isolates Zt10 (purple) and 3D1 (pink), whose peaks appear delayed compared to the early onset isolates.

Figure 3 .
Figure 3.Posterior median expected log-Mahalanobis distance with 95% uncertainty intervals, where the fit for each isolate is from the selection and model run where that isolate was treated as novel with all four time points and using the K = 10 most important genes.Different colours represent the isolates and the dashed lines emphasise the two late onset isolates Zt10 (purple) and 3D1 (pink).Noticeably the peaks in posterior median expected log-Mahalanobis distance are delayed in the late onset isolates.

Figure 4 .
Figure 4. Posterior median predictions of the log-Mahalanobis distance at the 3rd and 4th time points of the novel isolates (which were treated as missing values), with 95% uncertainty intervals, using the K = 10 most important genes.Different colours represent the isolates.

Figure 5 .
Figure 5. Posterior median expected log-Mahalanobis distance with 95% uncertainty intervals, where the fit for each isolate is from the selection and model run where that isolate was treated as novel using only the first two time points and the K = 10 most important genes.Different colours represent the isolates and the dashed lines emphasise the two late onset isolates Zt10 (purple) and 3D1 (pink).Noticeably the peaks in posterior median expected log-Mahalanobis distance are delayed in the late onset isolates.
for the data from Haueisen et al. and Palma-Guerrero et al. respectively.
Briefly, we merged the Haueisen et al. and Palma-Guerrero et al. data sets on common gene IDs and used edgeR's default methods for filtering and normalisation of count data.We defined groups to compare the first infection timepoint (defined as infection stage A and 7 dpi in the Haueisen et al. and Palam-Guerrero et al. data respectively) to the timepoint of 'peak' Mahalanobis distance (defined as infection stage C and 14 dpi in the Haueisen et al. and Palam-Guerrero et al. data respectively -see below) in those isolates classified as early infecting.The exact test was used to identify significant differential expression with p − values corrected by

Table 2 .
The 20 most important genes according to median rank across the 7 leave-one-out experiments, as well as their lowest, mean (rounded) and highest ranks.