HiTaC: Hierarchical Taxonomic Classification of Fungal ITS Sequences

Motivation: Fungi play a key role in several important ecological functions, ranging from organic matter decomposition to symbiotic associations with plants. Moreover, fungi naturally inhabit the human microbiome and can be causative agents of human infections. An accurate and robust method for fungal ITS classification is not only desired for the purpose of better diversity estimation, but it can also help us gain a deeper insight into the dynamics of environmental communities and ultimately comprehend whether the abundance of certain species correlate with health and disease. Although many methods have been proposed for taxonomic classification, to the best of our knowledge, none of them consider the taxonomic tree hierarchy when building their models. This in turn, leads to lower generalization power and higher risk of committing classification errors. Results: Here we introduce HiTaC, a robust hierarchical machine learning model for accurate ITS classification, which requires a small amount of data for training and is able to handle imbalanced datasets. HiTaC was thoroughly evaluated with the established TAXXI benchmark, characterizing its ability to correctly classify fungal ITS sequences of varying lengths and a range of identity differences between the training and test data. HiTaC outperforms state-of-the-art methods when trained over noisy data, consistently achieving higher accuracy and sensitivity across different taxonomic ranks, improving accuracy by 5.5 and sensitivity by 6.9 percentage points over top methods in the most noisy dataset available on TAXXI


Introduction
In the last decades, there has been a surge of interest in characterizing the mycoflora of communities. This is due to the fact that fungi are key organisms in several important ecological functions, ranging from nutrient recycling to mycorrhizal associations and decomposition of wood and litter (Hawksworth, 2001). Moreover, fungi naturally inhabit the human microbiome and can be causative agents of infections that may lead to death (Brown et al., 2012;Hoggard et al., 2018), yet very little is known about the fungal microbiota of healthy individuals when compared to the bacterial microbiome (Huffnagle and Noverr, 2013). Furthermore, identifying the presence of fungi can be challenging, as they rarely form structures visible to the naked eye and similar structures are frequently composed of several distinct species (Nilsson et al., 2009). Hence, it became common practice to use DNA sequencing in addition to morphological studies in contemporary mycology (Hibbett, 2007).
One of the approaches that can be employed is whole-genome shotgun sequencing (WGS), where the purpose is to obtain the genetic material of all organisms present in a given microbiome simultaneously (Venter et al., 2004). Nonetheless, in studies where the end goal is only to estimate the diversity or discover the evolutionary distance of organisms in a microbiome, it is cheaper and more convenient to sequence only genetic markers such as 16S rRNA, which is well conserved in all bacteria (Fuhrman, 2012) and also widely used to classify and identify archaea (Kim and Chun, 2014). In mycology, the internal transcribed spacer (ITS) region was adopted as the universal marker for classifying fungi (Schoch et al., 2012). Comparatively, ITS has far more sequences available than fungal WGS, making it a viable cheaper alternative in mycology.
The main characteristic of 16S rRNA that makes it a good genetic marker is the presence of nine hypervariable regions V1-V9, which are flanked by conserved regions that can be used to amplify target sequences using universal primers (Chakravorty et al., 2007). However, 16S rRNA has fewer hypervariable domains in fungi, and consequently it is more appropriate to use the ITS regions which are composed by 18S-26S rRNAs (Schoch et al., 2012). Additionally, the ease with which ITS is amplified makes it an appealing target for sequencing samples from environments as challenging as soil and wood, where the initial quantity and quality of DNA is low (Nilsson et al., 2009).
From a computational perspective, the taxonomic classification of ITS fungal sequences can be performed by using either homology-based or prediction-based techniques. Homology-based approaches require the alignment of a query sequence against all sequences available in a reference database (Jonasson et al., 2002) such as Silva (Pruesse et al., 2007), Ribosomal Database Project (RDP) (Cole et al., 2008) and UNITE (Nilsson et al., 2018). There are several homology-based software currently available for the analysis of environmental samples, such as EzTaxon (Chun et al., 2007), MG-RAST (Meyer et al., 2008), PyNAST (Caporaso et al., 2009), UCLUST (Edgar, 2010), QIIME 2 (Bolyen et al., 2019), MEGAN (Mitra et al., 2011) and FindFungi (Donovan et al., 2018). However, a major limitation of homology-based approaches is the reliance on the availability of homologous sequences in reference databases (Gupta et al., 2014). Alternatively, prediction-based approaches might be useful to overcome this limitation.
One of the first proposed tools to perform prediction-based taxonomic classification was the RDP-Classifier, which uses 8-mer frequencies to train a Naive Bayes classification algorithm (Wang et al., 2007). Improving upon the RDP-Classifier, a novel Naive Bayes classifier with multinomial models was developed to increase the accuracy (Liu and Wong, 2013). This multinomial classifier was recently reimplemented and optimized on Microclass (Liland et al., 2017).
Other prediction-based approaches have also been proposed in the literature, such as the k-Nearest Neighbor (KNN) algorithm (Schloss et al., 2009). Given a query sequence, the KNN algorithm identifies the k-most similar sequences in a database, and uses the taxonomic information from each of those sequences to determine the consensus taxonomy. Q2_SK (Bokulich et al., 2018b) implements several supervised learning classifiers from scikit-learn (Pedregosa et al., 2011), e.g., Random Forest, Support Vector Machines (SVM) and Gradient Boosting. Q2_SK is flexible and allows the selection of features and classifiers.
Despite all these advances made in prediction-based taxonomic classification, to the best of our knowledge, all the tools available perform flat classification, which means that they predict only leaf nodes and do not fully explore the hierarchical structure of the taxonomic tree when building their models (see Fig. 3 for a depiction of the hierarchical structure of the taxonomic tree). The main disadvantage of this approach is that the information about parent-child class relationships in the hierarchical structure is not explored, which may cause flat classifiers to commit more errors than hierarchical approaches (Silla and Freitas, 2011). Therefore, we developed a new hierarchical taxonomic classifier for fungal ITS sequences, denominated HiTaC, which explores the hierarchical structure of the taxonomic tree and improves the accuracy and sensitivity of fungal ITS predictions. Furthermore, HiTaC can be easily installed and is compatible with QIIME2, facilitating its adoption in ITS sequence analysis pipelines.

Algorithm
HiTaC automatically performs feature extraction, building of the hierarchical model, taxonomic classification and reporting of the predictions. Fig. 1 illustrates the algorithmic steps of HiTaC, where it starts by extracting the 6-mer frequencies from both training and test files. Although 6-mer was chosen as the standard value, it can be changed via a command line parameter, however, we did not see much improvement for larger k-mer values that justify the increase in computational complexity. Afterwards, a logistic regression classifier is trained for each parent node in the taxonomic tree. Once the hierarchical model is trained, predictions are made based on the taxonomic tree structure and HiTaC finishes execution reporting the predictions made.

GCA CAT ATT
Our method starts by decomposing the DNA sequences into their constituent k-mers.
A matrix is built, where rows contain the k-mer frequency for the training sequences.

Fungi Ascomycota
A logistic regression classifier is trained for each parent node in the taxonomic hierarchy. The goal is to be able to predict the children labels.
Predictions are reported, based on the taxonomic hierarchy. For example, if the classifier at the root node predicts that a sample is Basidiomycota, then only that branch will be considered for the remaining levels.  we first decompose them into their constituent k-mers. In the example shown in step 01, k = 3, resulting in 5 k-mers. Afterwards, HiTaC computes a matrix containing the k-mer frequency for all training sequences, where each row contains the k-mer frequency for a given sequence and the columns are all possible k-mers for a given k, i.e., all k nucleotide permutations of A, C, G and T. Next, HiTaC trains a logistic regression classifier for each parent node in the taxonomic tree, where the features are the k-mer frequencies computed in step 2 and the labels are the taxonomic annotations of the training sequences. In step 03, only the first 2 levels are shown for brevity. Lastly, HiTaC makes predictions in a top-down approach, based on the taxonomic hierarchy.
HiTaC trains a local classifier for each parent node of the taxonomic tree, using the local hierarchical classifier per parent node implementation provided in the library HiClass (Miranda et al., 2021). For example, a classifier in the root node of the taxonomic tree decides whether a given sequence belongs to the phylum Ascomycota or Basidiomycota. If the sequence is classified as Ascomycota, then a child classifier decides whether the sequence belongs to the class Sordariomycetes, Dothideomycetes, or Leotiomycetes, and similarly for the remaining standard taxonomic ranks, namely order, family, genus and species (see Fig. 3 for an example). This hierarchical model was chosen instead of the local hierarchical classifiers per level or per node because it respects the natural relationships between the hierarchical categories, thus avoiding inconsistent predictions across different taxonomic ranks. A detailed review of different hierarchical models and applications is provided by (Silla and Freitas, 2011). for each level of the hierarchy, using the same k-mer frequency algorithm introduced previously on Fig. 1. In the example shown in step 01, there are only 3 levels in the hierarchy for simplicity. Afterwards, HiTaC computes the confidence score for all taxonomic ranks assigned to a sample by the local classifier per parent node, then on a bottom-up approach it removes labels that are below a certain threshold. In step 02, supposing that the threshold is 0.7 and that the given sample has a confidence score of 0.2 assigned for the lowest rank, then the final prediction for this sample will only contain the labels Fungi and Ascomycota.
Each parent node in our hierarchical model consists of a logistic regression classifier from the package scikit-learn v0.24.1 (Pedregosa et al., 2011). Logistic regression is a statistical model that is frequently used for classification and predictive analysis. In summary, logistic regression estimates possible outcomes using a logistic function (for a mathematical definition, we redirect the reader to scikit-learn's documentation (Scikitlearn, 2022)), and it was chosen because it is a multi-class classifier known for requiring a small amount of data for training, and it copes well with imbalanced data when corrections are applied (Maalouf, 2011).
Furthermore, a filter was implemented in HiTaC to improve the quality of the predictions (Fig. 2), reducing classification errors when dealing with novel sequences or when the algorithm has a low confidence the prediction is correct. In this filter, another logistic regression classifier is trained for each taxonomic level available in the training data, employing the local hierarchical classifier per level implementation from HiClass (Miranda et al., 2021). Afterwards, correction is performed in a bottom-up approach, i.e., correcting labels from the leaf nodes towards the root. The confidence threshold of this filter has a default value of 0.7, however, it can also be defined by the user via a command line argument.
HiTaC was implemented in Python 3 as a QIIME2 plugin in order to facilitate adoption in existing ITS sequence analysis pipelines, and it can be easily installed through conda, pip or docker, allowing users to install our hierarchical classifier along with its dependencies by running a single command, which is not possible for most of the methods available in the literature.

Evaluation
The TAXXI benchmark (Edgar, 2018) was employed to facilitate the comparison between HiTaC and seventeen other similar software previously available in the benchmark (Table 1), some of which are homology based approaches, while others are alignment free. We relied on this benchmark because it mimics more accurately real world scenarios including novel sequences, as well as its diversity in software, reference databases and use of real environmental data. Furthermore, the TAXXI benchmark assesses the accuracy using cross-validation by identity, a strategy which models the variation in distances between query sequences and the closest entry in reference databases. In this approach, a 90% identity enforces sequences to only belong to species that are absent from the training set, while a 95% identity allows for a mix of present and absent species. In other words, the cross-validation by identity enables the creation of datasets whose lowest taxonomic ranks were nonexistent in the training data while still knowing the ground truth, which allows for a better evaluation on how a classifier behaves when trying to predict novel sequences.

Datasets
We adopted five datasets containing both training and test data, which were previously available in the TAXXI benchmark (Edgar, 2018). These datasets were selected because they were the only ones with taxonomic annotations of fungal ITS sequences up to species level. Although there were other five datasets available in this benchmark possessing fungal ITS sequences, we noticed that they were copies of the same datasets we selected, except that they were missing the species level annotation in the training datasets and ground truth. Thus, there was no added value in using these other datasets in our evaluation. The fungal ITS datasets selected from the TAXXI benchmark were based on real environmental data, publicly available on NCBI (Sayers et al., 2012), RDP (Cole et al., 2014), the Warcup fungal ITS training set v2 (WITS) (Deshpande et al., 2016), UNITE (Kõljalg et al., 2005) and other in vivo samples. They were created through cross-validation by identity, hence the identity between the query sequences and the closest entry in a reference database varied on them, ranging from 90% to 100% (Edgar, 2018) (Table 2). In practice, an identity of 90% means that the training data does not contain any of the labels that are in the test dataset at the species level, but most labels are known at the genus level. The known taxa at the species level increase with identity, i.e., all species are present in the training data when the identity is 100%. It is also important to mention that the training datasets are extremely imbalanced. For example, among the seven existing phyla on dataset SP RDP ITS 90, Ascomycota and Basidiomycota contain approximately 52% and 44% of total instances, respectively (Fig. 3). A similar pattern can be observed in the other datasets with higher identities.

Evaluation Metrics
To measure the performance of our model, we adopted all metrics available in the TAXXI benchmark (Edgar, 2018), namely: • Accuracy (ACC) -Accounts for all test sequences predicted by the algorithm with names that are known and/or over-predicted, ranging from 0% (no correct predictions) to 100% (no errors); • Misclassification Rate (MCR) -The percentage of known sequences incorrectly predicted; • Over-classification Rate (OCR) -The percentage of unknown sequences assigned a label; • True Positive Rate (TPR) -The percentage of known sequences correctly predicted; • Under-classification Rate (UCR) -The percentage of known sequences that were not assigned a label.
Defining these metrics in mathematical terms, we have: where known sequences have labels in the training dataset and novel sequences do not have labels in the training dataset. A true positive is the number of correctly predicted sequences. Under-classification errors occur when too few ranks are predicted, while over-classification errors happen when too many ranks are predicted. Finally, misclassification errors occur when a known name is incorrectly predicted.

Resources benchmark
We enhanced the TAXXI benchmark by porting it to Snakemake (Köster and Rahmann, 2012) in order to make it more reproducible, and we introduced a resources benchmark. The benchmark scripts are available in our GitHub repository 1 , and it was computed on a cluster node running GNU/Linux with 512 GB physical memory and 128 cores provided by two AMD EPYC™ 7742 processors, where 8 cores were allocated for each method. The only exception was Q1 because one of its dependencies, uclust, is only provided inside a virtual disk image. Hence, Q1 was executed inside a GNU/Linux virtual machine on a laptop running Windows 10 with 16 GB physical memory and 12 cores provided by an AMD Ryzen™ 5 4600H processor, where 10 GB and 6 cores were allocated.

Results
We comprehensively evaluated the performance of HiTaC at different taxonomic ranks on five datasets containing fungal ITS sequences with identities varying between 90% and 100%. We showed that HiTaC outperformed state-of-the-art taxonomic classifiers at the genus level, challenging it with novel sequences and noisy regions in sequencing data in the datasets SP RDP ITS 90-97. We also evaluated the robustness of HiTaC by testing it at the species level, the most difficult rank that is possible to predict with fungal ITS sequences, on datasets SP RDP ITS 99-100 from the TAXXI benchmark. The results presented here and further taxonomic ranks can be found in our GitHub repository 2 . We first evaluated the accuracy at the genus and species level on the SP RDP ITS datasets, reporting the accuracy for all methods available in the TAXXI benchmark. As shown in Fig. 4, at the species level HiTaC and BTOP achieved 100% accuracy on the dataset with 100% identity. Furthermore, HiTaC obtained a significantly higher accuracy for the dataset with 90% identity at the genus level, with a 5.5 percentage points improvement over NBC50, the second best third party method for this dataset, while HiTaC's filter also promoted a slight improvement over NBC50 for the same dataset. HiTaC also achieved top accuracy for the datasets with identities varying between 95-99%. On the other hand, while HiTaC's filter did not achieve top accuracy for these datasets, it was still quite competitive, ranking just a few positions below the top performing methods and it was planned to reduce the over-classification rate as we will show later.  Fig. 4. Accuracy results for datasets SP RDP ITS 90 at genus level, SP RDP ITS 95 at genus level, SP RDP ITS 97 at genus level, SP RDP ITS 99 at species level and (e) SP RDP ITS 100 at species level. HiTaC and HiTaC_Filtered achieved the highest accuracy for dataset SP RDP ITS 90, and HiTaC outperformed the third best method (NBC50) by 5.5 percentage points. HiTaC surpassed top methods in the remaining datasets SP RDP ITS 95-99, while for dataset SP RDP ITS 100 HiTaC tied with BTOP as the best method. Missing results for NBC50 and NBC80 for dataset SP RDP ITS 100 are due to a memory limit exceeded error. 2 https://github.com/mirand863/hitac/tree/main/benchmark/results/metrics Similar conclusions can be drawn from Fig. 5, in which we evaluated the true positive rate (TPR) for the top 5 methods for each dataset in the TAXXI benchmark. The trend is that HiTaC achieved a TPR higher or equal to top methods. To illustrate this point, the best true positive rate at genus level for dataset SP RDP ITS 90 was achieved by HiTaC, which sharply increased the TPR by 6.9 percentage points when compared with BTOP. Regarding dataset SP RDP ITS 95 at genus level, HiTaC obtained 1.7 percentage points improvement over Microclass, while for dataset SP RDP ITS 97 at genus level HiTaC surpassed Microclass with a 0.5 increase in percentage points. At species level, HiTaC achieved a 0.7 percentage points gain over BTOP for dataset SP RDP ITS 99, while for dataset SP RDP ITS 100 HiTaC tied with BTOP, obtaining a perfect score of 100% TPR.
SP RDP ITS 90 at genus level SP RDP ITS 95 at genus level SP RDP ITS 97 at genus level SP RDP ITS 99 at species level SP RDP ITS 100 at species level SP RDP ITS 95 at genus level, SP RDP ITS 97 at genus level, SP RDP ITS 99 at species level and SP RDP ITS 100 at species level. The trend is that HiTaC achieved a sensitivity higher or equal to top methods. For instance, HiTaC tied with the top method for the dataset with 100% identity, obtaining a perfect score. Moreover, for the datasets with 90-99% identity HiTaC improved the true positive rates upon top methods.
On Fig 6, we show the results for under-classification and overclassification rates for dataset SP RDP ITS 90 at genus level. HiTaC, BTOP, KTOP, Microclass and TOP did not leave any known sequences unclassified. KNN, on the other hand, left 60.9% known sequences unclassified. Regarding over-classification, KNN achieved the lowest rate overall at 18.4%. Conversely, HiTaC, BTOP, KTOP, Microclass and TOP scored 100% over-classification rates for most datasets, which means that they are wrongly labeling novel sequences that do not exist in the training datasets. The introduction of the filter proved effective to reduce HiTaC's over-classification rate, but slightly increased the under-classification rate. An ideal algorithm would be able to classify all known sequences and leave novel sequences unclassified, thus achieving 100% accuracy. at genus level. The introduction of the filter in HiTaC proved to be effective by sharply decreasing the over-classification rate from 100% to 44.1%. Contrarily, the filter slightly increased the under-classification rate from 0% to 20.4%, but it still ranked among the methods with the best trade-off between over-classification and under-classification rates, which are the methods closer to the diagonal on the bottom left. An ideal method should be able to classify all known sequences and leave novel sequences unclassified, thus avoiding both over-classification and under-classification errors.
The misclassification rate results are shown in Fig. 7. The methods that predominantly attained the lowest misclassification rates were KNN, NBC80, RDP80 and SINTAX80. In general, HiTaC achieved a median misclassification rate when compared to the other methods, reaching 19.7% for the dataset SP RDP ITS 90 and lower values for the remaining datasets with higher identities, eventually reaching a misclassification rate of 0% for the dataset with 100% identity. The filter implemented in HiTaC successfully reduced the misclassification rate to 10.3% for the dataset SP RDP ITS 90 and even lower values were achieved for the remaining datasets, reaching a misclassification rate as low as 0% for the dataset SP RDP ITS 100.
Lastly, Table 3 displays the results for the resources benchmark for dataset SP RDP ITS 100, at species level. In terms of wall clock time, HiTaC ranked in the 10 th position, being slower only when compared with methods implemented in compiled languages such as C and C++. NBC presented a memory limit exceeded error, since its 32 bit version can only use at most 4 GB of RAM.

Discussion
Taxonomic classification is the first step in studies targeting the taxonomic analysis of environmental samples and its results are crucial for downstream analysis. In this work we developed a robust hierarchical taxonomic classifier for accurate fungal ITS classification. Our results showed that with the hierarchical machine learning model, HiTaC could achieve significantly better overall performance than state-of-the-art taxonomic classifiers even with imbalanced training data. We noticed that HiTaC, BTOP, KTOP, Microclass and TOP erroneously classify novel sequences that do not have labels in the training datasets, when query identities are below 100%. For this reason, the predictions made by these methods should be taken with caution. On the other hand, conservative methods such as CT2, Q2_VS, Q2_BLAST, Metaxa2 and KNN have an exceptionally high under-classification rate at the species level, when query identity is quite high in our tests, thus possibly missing the opportunity to classify many sequences already deposited in databases. Both extremes can negatively impact the taxonomic analysis of environmental samples, hence a middle ground would be preferred. Therefore, a filter was introduced in HiTaC in order to reduce the overclassification rate by reporting confidence scores for the predictions made and it can also be set to detect and remove novel or difficult lower ranks that are unlikely to be correct.
In our view, it is challenging to rely on ITS markers alone to classify fungi at the species level, given the current limitations in ITS databases such as low quality sequencing and misidentifications (Badotti et al., 2017). Nevertheless, ITS has been successfully employed to accurately catalogue bacterial communities at (sub)species resolution (Milani et al., 2020). Moreover, from an ecological perspective, it is interesting to use amplicon sequencing in addition to shotgun due to the fact that it can more accurately reflect the composition of microbiomes, as well as taxonomic abundance and richness, especially at higher ranks (Tessler et al., 2017). Table 3. Results obtained by the resources benchmark for the dataset SP RDP ITS 100. HiTaC ranked 10 th in terms of processing speed, surpassed mostly by methods based on compiled languages. The filter proved to be slow during the training stage, which placed it among the slowest methods.

Method
Time (

Conclusion
In summary, in this study we developed a novel taxonomic classifier, HiTaC, which employs a hierarchical machine learning model for accurate fungal ITS classification. Our algorithm is able to handle imbalanced data and outperforms state-of-the-art classification methods on ITS datasets available in the TAXXI benchmark, consistently achieving higher accuracy and sensitivity across different taxonomic ranks, especially on lower taxonomic levels (genus and species). HiTaC is an open-source software, available at https://github.com/mirand863/hitac.