Multi-label pathway prediction based on active dataset subsampling

Machine learning methods show great promise in predicting metabolic pathways at different levels of biological organization. However, several complications remain that can degrade prediction performance including inadequately labeled training data, missing feature information, and inherent imbalances in the distribution of enzymes and pathways within a dataset. This class imbalance problem is commonly encountered by the machine learning community when the proportion of instances over class labels within a dataset are uneven, resulting in poor predictive performance for underrepresented classes. Here, we present leADS, multi-label learning based on active dataset subsampling, that leverages the idea of subsampling points from a pool of data to reduce the negative impact of training loss due to class imbalance. Specifically, leADS performs an iterative process to: (i)-construct an acquisition model in an ensemble framework; (ii) select informative points using an appropriate acquisition function; and (iii) train on selected samples. Multiple base learners are implemented in parallel where each is assigned a portion of labeled training data to learn pathways. We benchmark leADS using a corpora of 10 experimental datasets manifesting diverse multi-label properties used in previous pathway prediction studies, including manually curated organismal genomes, synthetic microbial communities and low complexity microbial communities. Resulting performance metrics equaled or exceeded previously reported machine learning methods for both organismal and multi-organismal genomes while establishing an extensible framework for navigating class imbalances across diverse real world datasets. Availability and implementation The software package, and installation instructions are published on github.com/leADS Contact shallam@mail.ubc.ca


Introduction
Metabolic pathways are composed of interconnected reactions catalyzed by enzymes. The set of reactions within and between cells comprises a reactome. Pathways and reactomes can be predicted from annotated genes encoded within organismal or multi-organismal genomes. This pathway prediction problem presents a fundamental challenge in biology that connects hereditary information contained within the DNA of living things e.g. genotype, to its expression and activity at the individual, population and community levels of organization e.g. phenotype ( [29,17,23]). The rise of increasingly powerful sequencing technologies has motivated corresponding innovations in the methods used to predict metabolic pathways at different levels of genome complexity and completion ( [1]). These encompass rule-based or heuristic methods including PathoLogic ( [22]) and MinPath ( [41]), and more recently, machine learning (ML) methods including PtwML ( [10]), mlLGPR ( [5]) and triUMPF ( [4]). While ML methods overcome issues of probability and scale associated with rule-based methods, several complications remain that can degrade prediction performance including inadequately labeled training data, missing feature information, and inherent imbalances in the distribution of pathways within a dataset.
The class imbalance problem arises when the proportion of instances over class labels within a dataset are uneven, resulting in poor predictive performance for underrepresented classes e.g. training loss. Such skewed distributions are encountered across a wide range of real world datasets, from environmental monitoring and fraud detection to medical diagnosis and facial recognition ( [19]). In the case of metabolic pathways, a similar problem exists where certain pathways are more common than others because they conduct core metabolic functions conserved across the tree of life. These functions are overrepresented in labeled training data relative to more niche-defining or accessory metabolic functions. Basher and colleagues described an information hierarchy based on the BioCyc curation-tiered structure of Pathway/Genome Databases (PGDBs) ( [8]) that traverses four tiers of genome completion and complexity (T1-4) in descending order of curation and functional validation ( [5]). Labeled pathways associated with T1-3 genomes were incorporated into synthetic datasets and used to train supervised ML pathway prediction methods ( [5,4]). During the benchmarking process class imbalances were recognized that limited recovery of underrepresented pathways in the training data. For example, labeled T2-3 pathways follow a power law distribution ( Fig. 1) where 30 − 35% of pathways were observed to occur in less than 25 PGDBs within the BioCyc collection. This class imbalance extended to closely related genotypes e.g. E. coli with potential implications for resolving metabolic differences between symbiotic, commensal or pathogenic strains. Different class imbalance learning methods have been developed that take into account skewed distributions including sampling, algorithm modification and ensemble learning ( [24]). Sampling methods attempt to balance input data prior to training through random under-sampling, one sided-selection or a combination of over-sampling less common classes while under-sampling more common ones ( [11]). In relation to PGDBs with numerous shared pathways, noisy class labels or missing pathway information (e.g. T2-4), subsampling presents a more tractable solution than oversampling. Two distinct modes of subsampling have been developed that are effective under different training scenarios: i) incremental learning from easier to harder examples, and ii) hard example mining. While the incremental mode may be effective when learning from noisy data by gradually removing hard examples ( [6,30]), sampling hard examples directly can accelerate the learning process ( [36,26]). Given that BioCyc (T2 &3) contains more than 9000 instances (corresponding to over 1500 organismal genomes) hard example mining is expected to reduce training loss resulting from pathway class imbalance.
Here we describe leADS, multi-label learning based on active dataset subsampling, that builds on prior work in active dataset subsampling (ADS) ( [9]) by incorporating an ensemble of multi-label learners ( [42]) to perform hard example mining. Specifically, leADS executes, in parallel, a group of multi-label base learners (constituting an ensemble) where each is allocated to learn from a portion of randomly selected samples ( [40]). Then, each member in the ensemble selects data according to predefined choices of: i)-sample size and ii)-an acquisition function. Samples from all base learners are aggregated for subsequent rounds of learning.
To verify the effectiveness of leADS, we conducted three experimental studies: parameter sensitivity, scalability, and metabolic pathway prediction. Overall, leaADS significantly improved pathway prediction results in relation to other inference methods including MinPath ( [41]), PathoLogic ( [22]), mlLGPR ( [5]) and triUMPF ( [4]) on a corpora of 10 organismal and multi-organismal datasets including T1 PGDBs from the BioCyc collection, symbiont genomes encoding distributed metabolic pathways for amino acid biosynthesis [27], genomes used in the Critical Assessment of Metagenome Interpretation (CAMI) initiative [33], and whole genome shotgun sequences from the Hawaii Ocean Time Series (HOTS) [38].

Definitions and Problem Formulation
Here the default vector is considered to be a column vector and is represented by a boldface lowercase letter (e.g., x) while the matrices are represented by boldface uppercase letters (e.g., X). If a subscript letter i is attached to a matrix, such as Xi, it indicates the i-th row of X, which is a row vector. A subscript character to a vector, xi, denotes an i-th cell of x. Occasional superscript, X (i) , suggests an index to a sample or current epoch during a learning period. With these notations in mind, we introduce information integral to the problem formulation, starting by defining the multi-label data.
is a vector indicating the abundance information corresponding to enzymatic reactions. An enzymatic reaction is denoted by c, which is an element of a set E = {c1, c2, ..., cr}, having r possible enzymatic reactions, hence, the vector size x (i) is r. The abundance of an enzymatic reaction for an example i, say c is a pathway label vector of size t representing the total number of pathways derived from a set of universal metabolic pathway Y. The matrix form of x (i) and y (i) are X and Y, respectively.
Both E and Y can be retrieved from trusted sources, such as KEGG ( [21]) or MetaCyc ( [7]). Although the input space is assumed to be encoded as r-dimensional vector, symbolized as X = R r , through features engineering it can be represented as X = R d .
Problem Statement. Given a multi-label dataset, SA, the goal is to select a subset of SA, denoted by S per% , where per% is a prespecified hyperparameter, indicating the proportion of samples to be chosen from SA, such that learning on S per% incurs similar predictive score (or better) as if it was trained on full multi-label dataset, SA.

The leADS Method
In this section, we provide a description of leADS components including: i)-building an acquisition model, ii)-active dataset sub-sampling, and iii)-learning using the reduced sub-sampled data. These three steps interact with each other in an iterative process as illustrated in Fig. 2. At the very first iteration, a set S 0 per% is initialized with randomly selected data. In the next iteration q, instead of re-initializing S q per% with randomly selected samples, S q−1 per% data collected from the previous iteration q − 1 is used, constituting a build-up scheme implemented in many active learning methods ( [9]). This process is repeated until the maximum number of rounds τ is reached.

Building an Acquisition Model
Given SA, the objective of this step is to estimate posterior predictive uncertainty given a new test point x * for a pathway yj as: where Θ ∈ R t×r denotes pathway's parameters. Notice that Eq 3.1 involves marginalization over Θj parameters, which is hard to compute ( [28]). One way to mitigate this issue is to approximate the above equation using Monte Carlo (MC) techniques ( [24]) by constructing an ensemble, denoted by E, which consists of g(∈ Z ≥1 ) models (Fig. 2c) where each generates multiple samples according to the following formula: where, is sampled from q(Θ s ) which is considered to be in the same family distribution as the true hidden variables p(Θ s j |SA). The parameters Θ s for the s-th model can be estimated according to the multi-label 1-vs-All approach ( [42]).
Although the computed MC error is expected to decrease by incorporating more samples and members in E, label correlation increases computational complexity during training and pathway prediction (see Section 6.2). Moreover, a single multi-label learner (Fig. 3a) suffers from generalization error due to overfitting despite being able to exploit label correlations. In contrast the ensemble learning method (Fig. 3b) is robust given a group of multi-label base learners that are both accurate and diverse (with regard to the allocated samples), potentially reducing overfitting. Figure 2: A schematic diagram indicating the leADS workflow. Using a multi-label pathway dataset (a), leADS randomly selects samples at the very first iteration (b) then builds g members of an ensemble (c), where each is trained on a randomly selected portion of the training set. Next, leADS applies an acquisition function (d), based on either: entropy, mutual information, variation ratios, or normalized propensity scored precision at k, to select per% sub-samples. Following subsample selection, leADS performs parallel training steps (e). The process (b-e) is repeated τ times (f), where during each iteration per% samples are used in addition to another set of samples for training. If the current iteration q reaches a desired number of rounds τ , training is terminated and final per% results are presented (g).

Subsampling Dataset
During this step, a subset of SA, denoted as S q−1 per% ⊆ SA, is picked for each member in E using an acquisition function f : x → R where per% is a pre-specified threshold, indicating the proportion of samples to be chosen from SA, at iteration q − 1.
Four acquisition functions used in subsampling are described that incorporate predictive uncertainty distribution from the previous step: entropy, mutual information, variation ratios, and normalized PSP@k. For each function, we retrieve top per% samples that contain high acquisition (or uncertainty) values.

Entropy (H) ([34]
). This function measures the uncertainty of a sample given the predictive distribution of that sample: where p is a vector of predictive probabilities over t pathways.

Mutual information (M) ([37]
). This function looks for low mutual information between g models, encouraging samples with high disagreement to be selected during the data acquisition process: where H s denotes the entropy obtained from an individual member of E for a sample before marginalization. Since entropy is always positive, the maximum possible value for M is H. However, when the models make similar predictions, then 1 e s=e s=1 H s → H, resulting in M → 0, which is its minimum value ( [9]). Note that this formula is similar to multi-label negative correlation learning ( [35]), which estimates pairwise negative correlation of each learner's error with respect to errors of other members in E.

Variation ratios (V) ([14]
). This function measures the number of members in E that disagree with the majority vote for a sample according to k desired pathway size, where larger values indicate higher uncertainty: where V corresponds the disagreement of k pathways across g models, where k ∈ Z>0 is a pre-specified number of pathways to be considered in computing the mode operation.
4. Normalized propensity scored precision at k (nPSP@k). This is a modified version of PSP@k ( [20]), which measures the average precision of top k relevant pathways given an instance i where larger values indicate less uncertainty: where Norm(.) scales the score within [0, 1]. The term p is a vector of predictive probabilities over t pathways, rank k (p) returns the indices of k largest value in p, ranked in a descending order, where k ∈ Z>0 is a hyperparameter. ps j is the propensity score for the j-th pathway, where nj is the number of the positive training instances for the pathway j. In the context of extreme multi-label problems, PSP@k was used to derive an upper bound for missing/miss-classified labels ( [39]), and is reported to be a good performance metric for long-tail distribution in which a significant portion of labels are tail labels ( [31,2]).

Training on the Reduced Dataset
As described above, each member in E is assigned to train on randomly selected samples from S q−1 per% , which is expected to contain hard examples that are difficult to learn and classify. The process is repeated τ times, where during each iteration the top per% are selected based on their acquisition values for the next round of training.

Optimization and Prediction
The objective function in Eq. 3.2 can be solved by decomposing into t independent binary classification problems according to the multi-label 1-vs-All approach enabling parallel training. Consider optimization for a member s: where ||.|| 2 2,1 is the L2,1 regularization term, which is the sum of the Euclidean norms of columns of Θ. The L2,1 norm imposes sparsity on the model's parameters to minimize the negative effect of label correlations, where λ(∈ R>0) is employed to govern relative contributions of L2,1 and the log-loss term. Although the joint formula in Eq 4.1 is convex, the logistic log-loss function still posses a problem where there exists no analytical solution for it. To address this problem, we apply mini-batch gradient descent ( [25]), which begins with some initial random guess for leADS parameters, and performs iterative updates to each individual parameter to minimize Eq. 4.1 where the derivative for each Θ s j ∈ Θ s has the following formula: For prediction, we apply a cut-off threshold ξ ∈ R ≥0 to retain only pathways having higher probability values than ξ, i.e., x (i) .

Experimental Setup
In this section, we describe an experimental framework used to demonstrate leADS pathway prediction performance across multiple datasets spanning the genomic information hierarchy ( [5]). leADS was written in the Python programming language (v3). Unless otherwise specified all tests were conducted on a Linux server using 10 cores of Intel Xeon CPU E5-2650.

Parameter Settings
We used pathway2vec ( [3]) to obtain pathway and EC features using "crt" as the embedding method with the following settings: the number of memorized domain was 3, the explore and the in-out hyperparameters were 0.55 and 0.84, respectively, the number of sampled path instances was 100, the walk length was 100, the embedding dimension size was m = 128, the neighborhood size was 5, the size of negative samples was 5, and the used configuration of MetaCyc was "uec", indicating links among ECs were trimmed. The obtained features were used to leverage correlations among ECs and pathways for training leADS (see Supp. Section 4). We then trained leADS using the following default settings (unless otherwise mentioned): the learning The impact of k on leADS performance on the CAMI dataset by varying k ∈ {5, 10, 15, 20, 30, 40, 50, 70, 90, 100} using variation ratios and nPSP as acquisition functions is demonstrated in Fig. 4a while the effect of four acquisition functions and random sampling by varying sample size according to per% ∈ {30%, 50%, 70%} is shown in Fig. 4b. rate was 0.0001, the batch size was 50, the number of epochs was 3, the number of models was g = 3, the proportion of samples (per%) to be selected was 30%, the number of subsampled pathways for each member was 500, and the cutoff threshold ξ for predictions was 0.5. For the regularized hyperparameter λ, we performed 10-fold cross-validation on BioCyc T2 &3 data and found the settings λ = 10 to be optimum according to results obtained on golden T1 and CAMI datasets.

Experimental Results and Discussion
To verify the effectiveness of leADS, we conducted three experimental studies: parameter sensitivity, scalability, and metabolic pathway prediction.

Parameter Sensitivity
Experimental setup. In this section, the impact of two user defined hyperparameters (k and per%) were evaluated on the CAMI dataset using acquisition functions described in Section 3.2. In the case of k, a range of values between {5, 10, 15, 20, 30, 40, 50, 70, 90, 100} was tested in relation to pathway size for variation ratios in Eq. 3.5 or top k relevant pathways for nPSP in Eq. 3.6. In the case of per% different subsampling proportions between {30%, 50%, 70%} were tested by selecting BioCyc T2 &3 data at random. For variation ratios and nPSP, the values of k were fixed based on the optimum results obtained from the previous experiment. All other hyperparameters, were set according to the configurations described in Section 5.2 and results were reported using average F1 scores. Experimental results. Fig. 4a shows the impact of k for both variation ratios and nPSP acquisition functions. Although both functions have similar disagreement metrics, the optimum performance for variation ratios is at k = 15 while the optimum for nPSP is at k = 40. This discrepancy in k values likely results from the effects of subsampling pathways and examples that are allocated randomly to each member in E. After several rounds of experiments, we found k = 50 to be optimum for both variation ratios and nPSP. Next, we examined the effect of per% on leADS's performances using four acquisition functions and random sampling, where we fixed k = 50 for variation ratios and nPSP. From Fig. 4b, it is evident that leADS performance generally improves by including more samples for each acquisition function, although the entropy function resulted in a marginal improvement. In contrast, random sampling had no performance benefit across the sample size range tested.

Scalability to the Ensemble Size
Experimental setup. In this section, time complexity of training was determined when the model size varied as g ∈ {1, 2, 3, 5, 10, 15, 20, 50}, simultaneously. Performance was evaluated on the CAMI dataset as described above using the average F1 score metric for each configuration of g. per%wassetto30% of BioCyc T2 &3 data for training under the four acquisition functions. In the case of random sampling, leADS was trained on 30% of randomly selected BioCyc T2 &3 data. Performance was expected was expected to improve proportionally to the member size in E (due to the dual effects of pathways and examples that are being allocated randomly to each base learner) with concomitant increase in computational time. See section 5.2 for configuration settings. Experimental results. Results in Fig. 5a are consistent with expectations, with gradual inclusion of more members in E improving leADs performance. Although random sampling reduced time complexity when compared to the four acquisition functions under all model size configurations, it resulted in the lowest performance (Fig. 5b). Among the four acquisition functions, variation ratios required an additional mode operation, contributing to increased training time. Based on these results, setting the model size between [3,10] ∈ Z>0 while increasing pathway subsampling size accordingly (e.g. 2000 for 10 members) is recommended to improve prediction outcomes and reduce both computational complexity (training and inference) and parameter storage needs.

Metabolic Pathway Prediction
Experimental setup. In this section, pathway prediction performance was evaluated using parameter settings described in Section 5.2. Three training configurations were tested: i)-per% = 70% under the four acquisition functions, ii)-random sampling corresponding to 70% of BioCyc T2 &3 selected at random, and iii)-full configuration where all BioCyc T2 &3 data were utilized without subsampling. After training, pathway prediction results were reported on golden T1 data using four evaluation metrics: Hamming loss, average precision, average recall, and average F1 score. leADS performance was compared to four extant pathway prediction algorithms: i)-MinPath v1.2 ([41]), ii)-PathoLogic v21 ( [22]); iii)-mlLGPR ( [5]) and iv)-triUMPF ([4]) on the T1 data. In addition, we compared leADS performance to other methods on multiorganismal datasets including symbiont, CAMI low complexity and HOTS datasets. For all experiments,the number of epochs was 10, the member size was g = 3, the subsampled pathway size was 2000, and k was 50 (for variation ratios and nPSP). See section 5.2 for additional configuration settings. Experimental results. As shown in Table 1, leADS resulted in competitive performance compared to other pathway inference algorithms based on average F1 scores. For each column in Table 1, a boldface number represents the best evaluation metric score while an underlined number indicates the best score between leADS variants. Among the four acquisition functions, leADS+nPSP resulted in the highest average F1 scores for EcoCyc (0.8874) and HumanCyc (0.8333) which are also the highest scores among all models tested. Consistent with previous sections, random sampling resulted in the poorest overall performance scores. Interestingly, leADS+Full in Table 1 was on par with random sampling, reinforcing the idea that BioCyc T2 &3 contain noisy data that hampered proper estimation of leADS coefficients. Through subsampling informative data in an ensemble based framework, leADS was able to reduce noise and improve the prediction performance on golden T1 data.
To evaluate leADS performance on metabolic pathways distributed between organisms we used the reduced genomes of the mealybug symbionts Moranella (GenBank NC-015735) and Tremblaya (GenBank NC-015736) ( [27]). The two symbiont genomes in combination encode intact biosynthetic pathways for 9 essential amino acids. Pathologic, mlLGPR, triUMPF, and leADS were used to predict pathways on individual symbiont genomes and a concatenated dataset consisting of both symbiont genomes, and resulting amino acid biosynthetic pathway distributions were determined (Supp. Fig. 1). Pathologic, triUMPF, and leADS predicted 6 of the expected amino acid biosynthetic pathways on the composite genome while mlLGPR predicted 8 pathways. The phenylalanine biosynthesis (L-phenylalanine biosynthesis I ) pathway was excluded from analysis because the associated genes were reported to be missing during the ORF prediction process. All models inferred false positive pathways for individual symbiont genomes (Moranella and Tremblaya) despite reduced pathway coverage information (mapping enzymes onto associated 9 amino acid biosynthetic pathways) relative to the composite genome. Although it is possible for leADS to reduce type I error by incorporating taxonomy-based predictions using rules, such pruning can also increase falsenegative (type II error) pathway predictions in multi-organismal datasets ( [18]).
To evaluate performance on more complex multi-organismal genomes we compared leADS to mlLGPR and triUMPF using the CAMI low complexity dataset ( [33]) and to PathoLogic, mlLGPR, triUMPF using the HOTS dataset ( [38]). In the case of CAMI, leADS+nPSP outperformed other methods resulting in an average F1 score of 0.6214 (Supp. Table 2). In the case of HOTS, leADS+Random, leADS+Full, leADS+H, leADS+M, leADS+V, and leADS+nPSP predicted a total of 60, 67, 63, 68, 67, and 68 pathways among a subset of 180 selected water column pathways ( [18]), while PathoLogic, mlLGPR, and triUMPF (using BioCyc v21) inferred 54, 62 and 67 pathways, respectively. These observations indicate that leADS with subsampling improves pathway prediction outcomes by reducing training loss due to pathway class imbalance (Supp. Fig. 10).

Conclusion
In this paper we present leADS, a novel ensemble-based ML approach for hard example mining that constructs a set of diverse multi-label base learners to jointly improve the subselection of samples and overcome class imbalance during metabolic pathway prediction from genomic sequence information at different levels of complexity and completion. leADS performs an iterative process to: (i)-construct an acquisition model in an ensemble framework; (ii) select informative points using an appropriate acquisition function including entropy, mutual information, variation ratios, and normalized PSP@k; and (iii) train on selected samples.
We evaluated leADS performance using a corpora of experimental datasets manifesting diverse multilabel properties comparing pathway prediction outcomes to other prediction methods including MinPath ( [41]), PathoLogic ( [22]), mlLGPR ( [5]) and triUMPF ( [4]). Resulting performance metrics indicated that leADS equaled or exceeded pathway prediction outcomes on organismal and multi-organismal datasets with increased sensitivity on T1 golden data. This indicates that active subsampling can overcome pathway class imbalance. At the same time, it is important to emphasize that the acquisition functions used in subsampling tend to reduce the number of pathways used in training Y in Def. 2.1. For example, leADS+H, leADS+M, leADS+V, and leADS+nPSP returned 1380, 1378, 1431 and 1404 distinct pathways, respectively, from a total of 1512 pathways in BioCyc T2 &3. This reduction reveals a fundamental limitation of subsampling based approaches ( [11]).
Members of an ensemble in leADS have the following two important properties: representativeness (each member has a different set of candidate examples) and diversity (each member has different overlapping pathways across examples) ( [16]). Having these properties implies that a member trained on a subset of examples containing a more diverse subset of pathways should be given more weights when predicting those subsets of pathways. Unfortunately, leADS does not utilize such weighting which can be resolved in part by adopting a better voting scheme ( [32,15]), or by incorporating an additional learner that integrates weights obtained from all the base learners into global weights ( [12,13]). Looking forward, an integrated ensemble or meta-learning framework is needed that can estimate the confidence of multiple training methods to provide an optimal balance between sensitivity and precision when predicting pathways across different levels of genome complexity and completion.  Table 1: Predictive performance of each comparing algorithm on 6 benchmark datasets. leADS+Full: leADS with full data, leADS+Random: leADS with random sampling, leADS+H: leADS with entropy, leADS+M: leADS with mutual information, leADS+V: leADS with variation ratios, and leADS+nPSP: leADS with normalized propensity scored precision. For each performance metric, '↓' indicates the smaller score is better while '↑' indicates the higher score is better. Values in boldface represent the best performance score while the underlined score indicates the best performance among leADS variances.       Table 2: Predictive performance of mlLGPR with elastic net penalty, triUMPF, and leADS on CAMI low complexity data. leADS+Full: leADS with full data, leADS+Random: leADS with random sampling, leADS+H: leADS with entropy, leADS+M: leADS with mutual information, leADS+V: leADS with variation ratios, and leADS+nPSP: leADS with normalized propensity scored precision. Values in boldface represent the best performance score while the underlined score indicates the best performance among leADS variances.

Methods
where ⊕ indicates the vector concatenation operation, E ∈ R r×m corresponds the feature matrix of ECs and m = 128. The addition of features results in a dimension of size r + m, where r = 3650. We expect by incorporating enzymatic reaction features into the original r dimensional example x (i) , the modifiedx (i) summarizes informative characteristics, which are expected to be useful in pathway prediction.

Metabolic Pathway Prediction
Here, we investigate the effectiveness of leADS for the pathway prediction task on mealybug symbiont genomes, CAMI low complexity, and HOTS datasets.

Predicted Pathways on Symbiont data
We analyzed pathways from each individual genome of symbiotic data and their combinations. Fig. 1 shows that leADS (with all strategies), triUMPF, and PathoLogic predicted 6 of the expected amino acid biosynthetic pathways on the composite genome while mlLGPR predicted 8 pathways.

Pathway Prediction from CAMI data
In this section, we contrast leADS (using four acquisition functions and random sampling) with triUMPF and mlLGPR (using elastic net penalty with reaction and pathway evidence features) on CAMI low complexity dataset. From Table 2, we observe that leADS+nPSP outperformed other algorithms with regard to the average F1 score, achieving 0.6214.  Figure 1: Comparative study of predicted pathways for symbiont data between PathoLogic, mlLGPR, triUMPF, and leADS (with random sampling, full data, and four acquisition functions). Black circles indicate predicted pathways by associated models while grey circles indicate pathways that were not recovered by models. The size of circles corresponds the pathway abundance information.

Predicted Pathways from HOTS data
We used leADS to infer a set of pathways from HOTS dataset, where leADS+Random, leADS+Full, leADS+H, leADS+M, leADS+V, and leADS+nPSP were able to recover a total of 60, 67, 63, 68, 67, and 68 pathways while triUMPF, mlLGPR, and PathoLogic detected 67, 62, and 54 pathways, respectively, from 180 previously reported pathways ( [4]). The results of leADS are presented in Figs. 2, 3, 4 & 5. Figure 2: Comparative study of predicted pathways for HOTS 25m dataset between PathoLogic, mlLGPR, triUMPF, and leADS (with random sampling, full data, and four acquisition functions). Black circles indicate predicted pathways by the associated models while grey circles indicate pathways that were not recovered by models. The size of circles corresponds the pathway abundance information. Figure 3: Comparative study of predicted pathways for HOTS 75m dataset between PathoLogic, mlLGPR, triUMPF, and leADS (with random sampling, full data, and four acquisition functions). Black circles indicate predicted pathways by the associated models while grey circles indicate pathways that were not recovered by models. The size of circles corresponds the pathway abundance information. Figure 4: Comparative study of predicted pathways for HOTS 110m dataset between PathoLogic, mlLGPR, triUMPF, and leADS (with random sampling, full data, and four acquisition functions). Black circles indicate predicted pathways by the associated models while grey circles indicate pathways that were not recovered by models. The size of circles corresponds the pathway abundance information. Figure 5: Comparative study of predicted pathways for HOTS 500m dataset between PathoLogic, mlLGPR, triUMPF, and leADS (with random sampling, full data, and four acquisition functions). Black circles indicate predicted pathways by the associated models while grey circles indicate pathways that were not recovered by models. The size of circles corresponds the pathway abundance information.