Linked machine learning classifiers improve species classification of fungi when using 1 error-prone long-reads on extended metabarcodes 2

44 Background: The increased usage of error-prone long-read sequencing for metabarcoding of 45 fungi has not been matched with adequate public databases and concomitant analysis 46 approaches. We address this gap and present a proof-of-concept study for classifying fungal 47 taxa using linked machine learning classifiers. We demonstrate the capability of linked 48 machine learning classifiers to accurately classify species and strains using real-world and 49 simulated fungal ribosomal DNA datasets, including plant and human pathogens. We 50 benchmark our new approach in comparison to current alignment and k-mer based methods 51 based on synthetic mock communities. We also assess real world applications of species 52 identification in complex unlabelled datasets. 53 Results: Our machine learning approach assigned individual nanopore long-read amplicon 54 sequences to fungal species with high recall rates and low false positive rates. Importantly, 55 our approach successfully distinguished between closely-related species and strains when 56 individual read errors were higher than the genetic distance between individual taxa, which 57 the alignment and k-mer methods could not do. The machine learning approach showed an 58 ability to identify key species with high recall rates, even in complex samples of unknown 59 species composition.

1 2 performed very well for closely related taxa, including the cryptic species Candida 2 4 2 metapsilosis and Candida orthopsilosis and another closely related species Candida albicans.

4 3
The two cryptic Candida species (C. metapsilosis and C. orthopsilosis) had a very high 2 4 4 consensus sequence similarity, with a genetic distance of 2.74% (97.26% identity) in our 2 4 5 fungal ribosomal DNA target region representing the genetically least distinct species pair.

4 6
Our machine learning approach achieved species level recall rates of 90.1% and 89.1% for C.  This highlights the strength of our approach. The gold standard alignment approach also performed very well when compared to the were classified with recall rates in excess of 95%. Yet this approach significantly 2 5 2 underperformed when trying to differentiate taxa with low genetic distance such as those 2 5 3 from the Candida genus. As with the machine learning approach, the three Candida species were classified with the lowest recall rate at the species level, with C. albicans, C. respectively. These difficulties are also reflected in the overall mean species level recall rate 2 5 7 of 76.6 ± 25.5%, which is much lower than our machine learning approach. Next, we assessed our dataset with alignment and k-mer based analysis approaches when 2 6 0 using the publicly available NCBI database. Overall, NCBI alignment with minimap2 2 6 1 performed similarly well at higher taxonomic ranks. However, inconsistent or missing Recall rates of the alignment-based minimap2 technique, k-mer-based Kraken2 method and the machine learning decision tree across different taxonomic ranks (a-f). The minimap2 technique, as applied to the gold standard database, was successful across most taxonomic ranks, but lower recall rates were recorded for closely related species at the species level (f). Both the minimap2 and Kraken2 methods were applied to the NCBI database, and while the minimap2 NCBI alignment was more accurate across most taxonomic ranks, both showed comparable recall at the species level. The machine learning decision tree approach provided the greatest classification power for closely related species, despite lower recall rates for some distantlyrelated species than the gold standard alignment method. all other methods tested across all taxonomic ranks. A key feature of a species classification tool is its ability to identify a known target species 2 7 1 from a complex sample of unknown composition. This is especially important when 2 7 2 attempting to identify the presence of a target species, such as a specific pathogen, from a 2 7 3 patient or environmental sample.

7 4
We generated two additional sequencing datasets of unknown composition from two distinct 2 7 5 environments to test the capability of our machine learning decision tree to identify a given 2 7 6 target species. These datasets were generated with the same PCR and sequencing protocols as for the individual 44 training species focusing on the fungal ribosomal DNA region. The first 2 7 8 dataset was derived from amplicon sequencing of fungi-infected wheat leaves (wheat dataset) 2 7 9 and the second was derived from bronchoalveolar wash in a clinical setting (clinical dataset).

8 0
To each of these sequencing datasets of unknown composition, we spiked in silico a known and false positive rate of our machine learning classifiers using our in silico spiked reads, We first plotted the propagated confidence score of the species level classification for all 2 8 8 reads in each in silico spiked dataset to better understand the behaviour of our machine 2 8 9 learning decision tree on samples containing reads of unknown origin (Supporting 2 9 0 Information Figure S1). This clearly shows that the propagated confidence scores for reads of 2 9 1 unknown origin are far lower than reads of species the classifiers were trained on. We then confidence scores thresholds assuming that the original sample did not include any reads of confidence threshold reached 0.9, and the false positive rate was consistently low across both percent. For C. albicans, not using a confidence threshold at all resulted in a recall rate of Previously, models were trained using sequence data from pure fungal culture. However, this 3 0 8 is not always accessible, and limits the ability for the approach to be extended. This allows 3 0 9 We retrained all models in the decision tree path of A. flavus using these simulated reads. None of the resultant models were statistically different to those models trained using real 3 1 7 sequenced data in terms of precision or recall. Further, they showed similar ROC curves (see learning classifiers can be trained using simulated reads and still accurately distinguish between taxa across all taxonomic ranks.

2 1
We also used these modes trained using simulated data to test whether the use of simulated training data affected the ability of the system to detect target species from a complex sample.

2 3
We found no difference in the recall or false positive rates for detecting the target species 3 2 4 using these simulated-trained models, with a near identical distribution to that shown in 3 2 5 Supporting Information Figure S1, and a recall rate of 0.910 when using a confidence 3 2 6 threshold of 0.85 compared to the recall rate of 0.917 when trained on real data. The plots show the recall rate (left axis, blue dots) and false positive rate (right axis, orange or green dots) at varying propagated confidence score thresholds for Aspergillus flavus (a) and Candida albicans (b) when spiked into clinical (orange) or wheat (green) datasets. Both plots are based on 2000-read in silico spiked samples containing 1000 reads with known labels (A. flavus or C. albicans) and 1000 reads of unknown origin. For A. flavus, a confidence threshold of 0.85 maintains a recall rate of 91.7%, while reducing the false positive rate to 1% for both datasets. For C. albicans, the same confidence threshold of 0.85 has a recall rate of 72.4% and reduces the false positive rate below 2% for both datasets. Nanopore sequencing offers portable sequencing platforms and generates long-reads that can 3 3 3 cover extended metabarcodes that include more sequence information than the classic Here, we implement a novel machine learning approach for species level classification. We [41], although the objectives differ. IDTAXA also uses a tree-like approach to classify a 3 4 5 subsample of the k-mers into a given taxonomic rank using machine learning, and the user 3 4 6 can specify certain confidence thresholds to avoid over-classification when confidence is low 3 4 7 [41]. While IDTAXA's goal is to classify unknown reads to a confident taxonomic rank, our approach is more focused on the detection of one or a few target species from mixed samples.

4 9
As such, we used full-length reads rather than k-mers from the short ITS regions, as used by Vu et al. [34], for training the machine learning models. Training these models with long 3 5 1 reads is computationally less efficient than with k-mers, but is more suitable for detection- The initial comparisons were based on idealised databases derived from our sequencing 3 8 2 dataset, for which metabarcode length and online database entry length were equivalent.

8 3
Hence, is it expected that analyses based on these idealised databases will outperform other 3 8 4 approaches relying on public databases with short reference sequences of uneven length, as shown by our results (Figure 2). Interestingly, the Kraken2 approach underperformed 3 8 6 compared to the alignment-based approach in our current study. This is consistent with previous studies with nanopore data, where the classification accuracy of Kraken2 never default 35 bp k-mers [49,50]. One approach to improve the k-mer based classification 3 9 0 accuracy for error-prone reads would be to use a smaller k-mer length. Another common 3 9 1 issue of public fungal databases was that many species were not included in the database or 3 9 2 were presented with different taxonomic labels, which resulted 0% recall rates of some these online databases for classifying fungal metabarcode reads, as the nomenclature is not We also tested if our machine learning approach could accurately identify target species in sample and for Aspergillus species in the wheat sample. We naively assumed these species 4 0 8 were not present in our complex sample save for those reads that had been spiked in. Candida have been misidentified as the target species from these complex samples.

1 3
Further, we found that training the machine learning classifiers using simulated reads, rather available.

2 0
The taxa used to train the classifiers are flexible and can be changed to suit the user's need identifying all species present. While unknown taxa absent in the training dataset will be 4 2 7 assigned a label, the use of a confidence score threshold can aid in filtering these Oxford Nanopore Technologies to improve barcoding accuracy, cost-effectiveness and 4 3 2 scalability, such as the recently-released Q20+ chemistry, offers promise for expanding our 4 3 3 proof-of-concept study to more species using multiple barcodes to improve the species-level Long-read sequencing can be more accurate for species identification and detection than short We collected different fungal tissue differently for DNA extractions. The tissue collection 4 5 1 processes for each fungal taxon are summarized in Supporting Information Table T1. We used three different DNA extraction methods for all the species in the mock 4 5 3 communities. The methods for each species are listed in the Supporting Information Table   4  100 mg of leaf tissue was homogenized and cells were lysed using cetyl trimethylammonium protocol.

2 3
Processing and manipulation of fungal pathogen reads The read datasets obtained contained both the fungal ribosomal DNA sequence and that of 5 2 5 elongation factor 1 alpha. A two-step data filtration process was applied to remove the 5 2 6 elongation factor 1 alpha reads and outlier reads from the fungal ribosomal DNA read set.

2 7
Reads were mapped to an in-house database of fungal ribosomal DNA regions to select reads 5 2 8 of a similar general structure to the ITS region. This homology-based filter assumes the 5 2 9 structure of the fungal ribosomal DNA region is similar between species due to shared 5 3 0 ancestry, repeatedly shown to be true [60]. The in-house database was curated from 28 ITS kingdom. This process mapped reads using minimap2 (v2.17), using the map-ont flag. Reads that failed to map were discarded.

3 4
Remaining reads were filtered for read length. The expected read length for the fungal spread of successfully filtered reads differed between samples, a 90% confidence interval cut-5 3 7 off around the mean read length was applied. This interval was sufficient to exclude short or 5 3 8 very long reads resulting from incomplete or partial homology filtering or errors in the 5 3 9 sequencing or basecalling processes. An additional simulated dataset was later generated using NanoSim v3.0.0 for use in testing 5 4 8 the ability to train models using simulated data based on references available online, without 5 4 9 losing the ability to accurately classify reads from real-world data. A total of 35,000 5 5 0 additional reads were simulated for Aspergillus flavus to ensure the maximum number of 5 5 1 reads that could be processed using our approach was available. These reads were generated 5 5 2 using an error profile identical to the pre-existing non-simulated reads but were based on a 5 5 3 reference consensus sequence for A. flavus.

4
Generating consensus sequences for each species The consensus sequence, an aggregate sequence formed from the comparison of multiple 5 5 6 sequences that represents the 'true' sequence, was generated using 200 randomly subsampled SILVA and NCBI databases.

6 2
Determining the relationships between samples 5 6 3 Using the taxonomic information available for each sample in MycoBank and the results of a 5 6 4 BLAST search with the generated consensus sequences, a cladogram was designed to show (a node) to distinguish between samples for each read, with the cladogram functioning as a 5 6 8 decision tree with the node-situated classifiers determining the path of tree descent.

6 9
Individual strains of the same species were counted as separate species for this cladogram. learning classifier due to its ability to use the spatial relationships between data features in the 5 7 3 reads, such as the distance between ITS and other variable groups. CNNs are capable of 5 7 4 learning from both minor variation and higher-order features, which is of particular importance given the high read error of nanopore reads. While we did compare a naïve 5 7 6 Bayesian and a fully-connected multilayer perceptron model to the CNN approach, these 5 7 7 produced quite poor results in comparison, and previous studies using short-read k-mers had 5 7 8 shown CNNs to provide an advantage over the existing Bayesian Ribosomal Database Project 5 7 9 and other machine learning classifier types [34]. Given this, we selected CNNs to be the basis 5 8 0 of our decision tree moving forward.

8 1
CNNs work best when there is a balanced number of items in each classification class. As each read subsampled, the nucleotide sequence was converted to a numeric sequence, where 5 9 0 A, C, G, and T became 0, 1, 2, and 3, respectively. This integer encoding produced the same 5 9 1 result as using one-hot encoding to multiple dimensions. As not all sequences were of equal 5 9 2 length, but an equal length was required to avoid sequence length being a distinguishing 5 9 3 factor in the classifier, all sequences were padded out to a length of 5,000 bp. The padding 5 9 4 used a value of 4 to avoid the padding data from affecting the identification of key features 5 9 5 for classification. Each read was assigned a label representing the output class it would belong to in the one-hot 5 9 7 format. Labelled reads were then separated into a training set and a test set. The training set 5 9 8 contained 85% of the reads, and was used to train the machine learning classifiers, while the 5 9 9 test set contained the remaining 15% of labelled reads and was used to test the efficacy of 6 0 0 said classifiers on similar data that the classifier had not previously encountered. This was 6 0 1 done using the leave out N sequences approach rather than leaving out N taxa, which is more 6 0 2 appropriate to represent the 'dark matter' taxa, as we wanted to focus on detecting the known  Specific details for the design of the machine learning classifiers and the required software Evaluation of the machine learning classifiers 6 1 0 The test set was used to assess the accuracy of the various machine learning classifiers. As 6 1 1 the test set data was labelled, the expected outcome for each read was known, and could be 6 1 2 compared to the output of the machine learning classifier. The accuracy, or classification rate, 6 1 3 of these classifiers was the proportion of reads in the test set for whom the prediction of the outcome were true positives and matches outside this outcome were false negatives.
[ 1 ] When seeking to identify members of a specific taxon in a community, where the members 6 2 0 are not immediately obvious from the species name, it is useful to have samples classified at 6 2 1 each taxonomic rank. A singular classifier would require excessive computational power to 6 2 2 do this. As such, we chained the machine learning classifiers together into a decision tree 6 2 3 with a tree descent approach, based on the cladogram of the species present in our sample.

2 4
The most confident outcome of the machine learning classifier at one taxonomic rank would 6 2 5 be used to decide the path along the decision tree. This path could either lead into another 6 2 6 machine learning classifier, if the path diverged again, or lead all the way down to the species 6 2 7 level with the same confidence.