ProtAlign-ARG: Antibiotic Resistance Gene Characterization Integrating Protein Language Models and Alignment-Based Scoring

The evolution and spread of antibiotic resistance pose a global health challenge. Whole genome and metagenomic sequencing pose a promising approach to monitoring the spread, but typical alignment-based approaches for antibiotic resistance gene (ARG) detection are inherently limited in the ability to detect new variants. Large protein language models could present a powerful alternative but are limited by databases available for training. Here we introduce ProtAlign-ARG, a novel hybrid model combining a pre-trained protein language model and an alignment scoring-based model to expand the capacity for ARG detection from DNA sequencing data. ProtAlign-ARG learns from vast unannotated protein sequences, utilizing raw protein language model embeddings to improve the accuracy of ARG classification. In instances where the model lacks confidence, ProtAlign-ARG employs an alignment-based scoring method, incorporating bit scores and e-values to classify ARGs according to their corresponding classes of antibiotics. ProtAlign-ARG demonstrated remarkable accuracy in identifying and classifying ARGs, particularly excelling in recall compared to existing ARG identification and classification tools. We also extended ProtAlign-ARG to predict the functionality and mobility of ARGs, highlighting the model’s robustness in various predictive tasks. A comprehensive comparison of ProtAlign-ARG with both the alignment-based scoring model and the pre-trained protein language model demonstrated the superior performance of ProtAlign-ARG.


Introduction
Antibiotic resistance is a major threat to public health.Every year, antibiotic-resistant human pathogens claim the lives of an estimated 700,000 people worldwide, with this number expected to rise to 10 million per year by 2050 1,2 .Antibiotic-resistance genes (ARGs), i.e., genes carried by bacteria that confer resistance to antibiotics, can be shared between bacteria and then further transmitted among animals, humans, and various environments [3][4][5][6][7] .Thus, there have been calls for the advancement of approaches for monitoring the emergence and movement of ARGs across One Health sectors 8 .
With the advent of next-generation DNA sequencing technologies, databases of bacterial genomes and metagenomes originating from various One Health sources continue to grow.This presents the opportunity to tap into such data to advance ARG monitoring.Traditionally, ARGs are identified from DNA sequencing data via alignment to existing ARG databases, such as the Comprehensive Antibiotic Resistance Database (CARD) 9 .Sequence alignment entails the computational arrangement of nucleotide or amino acid sequences to determine regions of similarity, which may indicate functional, structural, or evolutionary relationships among the sequences.Unfortunately, alignment-based approaches are inherently limited, particularly in their reliance on existing databases and their inability to capture remote homologs, i.e., instances where evolutionary relationships between genes or proteins have significantly diverged over time 10 .Further, alignment-based methods are highly sensitive to the selection of similarity thresholds, resulting in false negatives if too stringent or false positives if too liberal.This ambiguity poses challenges for users, but also inherently limits the ability to detect novel ARGs.
Recognizing these constraints underscores the importance of exploring alternative methodologies to enhance the accuracy and precision of ARG prediction for antibiotic resistance monitoring.Despite the broad adoption of alignment-based algorithms in ARG classification, it is clear that this approach alone doesn't provide comprehensive insights for accurate ARG class classification.Moreover, aligning against large databases is time-consuming 11 , often requiring hours to days to align terabytes of sequence data to databases of several gigabytes in size [12][13][14] .
Deep learning-based tools have the potential to greatly advance One Health monitoring of ARGs, presenting a means to improve the efficiency and accuracy of ARG detection and classification from DNA sequencing data, while also enabling the detection of new/emergent ARGs that are not yet described in publicly available databases.Deep learning, particularly with protein language model embeddings, offers a more nuanced representation of protein sequences, excelling in contextualizing these sequences and uncovering complex patterns missed by conventional methods 15 .Recently, deep learning-based tools have demonstrated promise for ARG identification and classification.Deep-ARG 16 is one such tool that uses deep learning and considers a dissimilarity matrix for ARG identification.HMD-ARG 17 was developed more recently and offers a hierarchical multi-task classification model using a convolutional neural network(CNN).ARG-SHINE 18 utilizes a machine learning approach to ensemble three component methods for predicting ARG classes.However, protein language models have only begun to be tested for ARG detection.Recently, a transformer-based protein language model demonstrated strong generalization capabilities, meaning it could accurately learn patterns from a limited number of training sequences and apply this knowledge to effectively identify and classify new, unseen protein sequences 19 .Transformer models use a self-attention mechanism to capture long-range dependencies and contextual relationships, enhancing their performance on sequence data.
The abundance of protein sequences translated from next-generation DNA sequencing data has revealed complex molecular relationships and intricate interactions between amino acids far apart in the protein's three-dimensional structure.Protein sequences also inherently encode crucial structural features, including secondary structures and motifs, which guide protein folding and function 20 .Protein language models provide a powerful approach to deciphering such complexity.In particular, 2/17 these models excel in capturing intricate patterns and motifs across diverse gene types, providing a systematic way to understand the nuanced language of protein sequences.Protein language models, trained on millions of protein sequences, have demonstrated promise for developing ARG prediction models.Our preliminary study demonstrated the efficacy of pre-trained protein language models for ARG identification and classification tasks 21 .Recently PLM-ARG 22 has also proven capable of using protein language models to predict ARGs.We found that integrating language models and transfer learning, a technique where a model developed for a specific task is reused as the starting point for a model on a second task 23 , could be a promising approach to ARG identification and classification while mitigating false negatives.However, a notable challenge remains: deep learning models exhibit suboptimal performance when confronted with limited training data, which hampers their ability to support generalized predictions.In such cases of insufficient training data, alignment-based scoring can outperform pre-trained protein language model (PPLM) based prediction.
Here we introduce ProtAlign-ARG as a novel model for identifying and classifying ARGs in protein sequence data that leverages the strengths of PPLM-based prediction with alignment-based scoring.This hybrid solution aims to enhance predictive accuracy, particularly in scenarios with limited training samples and diverse datasets.To demonstrate the ProtAlign-ARG pipeline, we carried out a comprehensive comparison with existing pipelines, validating the superiority of our proposed model in overcoming challenges posed by inadequate training data.We further demonstrate the capability of ProtAlign-ARG to characterize ARG mobility, including differentiating intrinsic ARGs, along with improvements in ARG classification according to antibiotic resistance class.

Model Development Data Curation
For ARG data, we utilized HMD-ARG-DB 17 as it is one of the largest repositories of ARGs and one that captures the most comprehensive annotations across various dimensions.HMD-ARG-DB was curated from seven widely-used databases, namely AMRFinder 24 , CARD 25 , ResFinder 26 , Resfams 27 , DeepARG 16 , MEGARes 28 , and Antibiotic Resistance Gene-ANNOTation 29 , containing over 17,000 ARG sequences distributed among 33 antibiotic-resistance classes.As 19 classes in HMD-ARG-DB have only a few genes in their groups, we focused on the 14 most prevalent classes for developing the prediction models.
To curate the non-ARG dataset, we downloaded the entire Uniprot dataset while excluding sequences labeled as ARGs and performed diamond alignment with the HMD-ARG-DB.Sequences with an e-value greater than 1e-3 and a percentage identity below 40% were classified as non-ARGs.This enabled focus on non-ARG sequences demonstrating some level of similarity to ARG sequences.The goal was to enhance the model's capability to identify ARGs, even in scenarios where such similarities exist 17 .Furthermore, it facilitated comparisons with other state-of-the-art tools such as DeepARG 16 and HMD-ARG 17 .

Data Partition
We partitioned the dataset into training, validation, and testing sets based on a specific similarity threshold.This threshold represented the maximum allowed similarity between sequences in the two sets, ensuring that the training and testing data were distinct enough to prevent biased accuracy metrics, and to assess the accuracy of each model on unseen data.Although CDHIT 44 and MMseq 45 are commonly used for data partitioning, they cannot guarantee that training and testing data have the specified maximum similarity.In contrast, GraphPart 46 , a recently developed tool, has proven to be highly effective in precise separation and retain most sequences until reaching the desired similarity threshold.Here we used both CDHIT and GraphPart for data partitioning and clustering.Using CDHIT 44 with a 40% threshold similarity on the HMD-ARG-DB dataset, we obtained 721 clusters that were partitioned into training and testing data with an 80:20 ratio.However, as shown in Figure 5, a comparison of training and testing data revealed that more than 50% of sequences between the training and testing sets partitioned by CDHIT had similarity over 40% (200 were > 90%).In contrast, GraphPart provided exceptional partitioning precision and is therefore used for partitioning 80% of the sequences for training and validation and 20% for the testing.Table 1 shows the distribution of ARG counts in training and testing sets using Graphpart for 40% and 90% similarity thresholds.

ProtAlign-ARG tasks and models
ProtAlign-ARG comprises four distinct models, each dedicated to a specific task, including 1) ARG Identification, 2) ARG Class Classification, 3) ARG Mobility Identification, and 4) ARG Resistance Mechanism.Figure 1 presents an overview of the architecture of how these four models were developed, highlighting their respective designs and interrelationships.ProtAlign-ARG utilizes both pre-trained protein language models and alignment-based scoring model to ensure robust performance in these tasks.Detailed descriptions of these components are provided in the following sections.

Leverage pre-trained language Model for ARG identification and classification
At the time of our work, many protein language models are available, such as ProtXL, ProtElectra, and ProtT5.We selected the ProtAlbert model due to its superior performance.Notably, ProtAlbert outperformed its counterparts while maintaining a relatively modest size, comprising 224 million parameters.This choice strategically balanced between optimal performance and computational efficiency, especially when compared to larger models that are more time-consuming.ProtAlbert, with its 12-layer architecture, undergoes training on UniRef100.For our downstream prediction task, we harnessed the pre-trained ProtAlbert model following the official GitHub repository provided by ProtTrans 20 .The training unfolds in two stages: an initial phase on sequences of up to 512 characters for 150,000 steps, succeeded by another 150,000 steps on longer sequences, spanning up to 2000 characters.
The pre-trained ProtAlbert model imparts its knowledge to the ARG classification task through transfer learning embeddings.
ProtAlbert offers two methods: per-residue, which treats each amino acid separately, and per-protein, which views the entire protein sequence as one unit.For ARG identification and classification, we chose the per-protein method.This approach recognizes proteins as integrated units with specific structures and functions.It simplifies the representation, making it less computationally complex and better at handling sequence variations.The per-residue method gives detailed amino acid information but can be more complex and sensitive to sequence changes.Thus, the per-protein method balanced biological 5/17 relevance with computational efficiency.As demonstrated in Figure 2, we extract embeddings from the model's final layer and consolidate them into a fixed-size representation for all proteins.ProtAlbert employs mean pooling for this purpose.The resultant vector serves as the input for a single feed-forward layer in the ARG identification or classification task, featuring 32 neurons.

Alignment-based scoring model
Deep learning models are effective in learning from large data sets but may not perform well with small amounts of data.
In contrast, alignment-based models focus on sequence similarities and can perform well even if there is a limited number of sequences.So while deep learning performs well on large amounts of sequence data, alignment-based models are often better for smaller data sets.Here we combine both models for ARG identification and prediction, leveraging the advantages of both models.For the alignment-based scoring, we utilized DIAMOND 47 for matching, following a modified version of the ARG-KNN model proposed by the paper ARG-SHINE 18 .
Initially, we align the query sequence with the training data using DIAMOND, setting an e-value threshold of < 1e − 3 to identify similar sequences (homologs).If a query sequence fails to align with any sequence in the training dataset, we cannot employ the alignment-based scoring to label it.We switch to PPLM prediction for those sequences.We applied this scoring method to both ARG identification and classification.
For each query sequence, we computed alignment scores for each label.The label with the highest score is assigned to the query sequence.For a given query sequence, denoted as p q , where q is the number of sequences from the same label.The score for the label C i , S(C i , p q ) is defined by the following equation: where T q represents the set of proteins associated with label C i and their bit scores for p q , p signifies any protein in T q , I(C i , p) is a binary indicator indicating whether p belongs to the label C i , and B(p q , p) signifies the bit score of the alignment between protein p q and p.The label with the highest score is considered the result of the similarity model.

ProtAlign-ARG
ProtAlign-ARG synergizes PPLM-based scoring with alignment-based scoring to enhance the accuracy of ARG identification and classification.Deep learning models, particularly PPLMs, are known to excel in data-rich environments; however, they tend to struggle with accuracy when the available data are sparse or when the model's predictive confidence is low.In our methodology, we introduce confidence thresholds as a metric for evaluating the reliability of PPLM-based predictions.These thresholds-95%, 90%, 80%, 70%, 60%, 50%, 40%, 30%, and 20%-represent the model's certainty in its classification.
We assessed the accuracy of predictions falling below each threshold and observed a marked decline in reliability (Figure 3).Specifically, predictions with less than 90% confidence yielded a 45% accuracy rate, indicating that over half of these When the query sequence does not find a match within the training data, which is a situation expected to arise with novel or rare ARGs, PPLM is then applied for the final classification.The PPLM, leveraging its comprehensive understanding of protein sequence language, is a decision-making tool in such cases.Figure 3 illustrates the relationship between the various confidence thresholds and the corresponding accuracy of predictions.did not substantially improve overall accuracy relative to the PPLM or alignment-scoring-based ARG identification models.
This result was as expected, since both of the labels applied in this study, "ARG" and "non-ARG", had ample data in both the training and test sets for this portion of the study.Furthermore, for comparing our model's accuracy with existing state-of-the-art models like DeepARG 16 , ARG-SHINE 18 , and HMD-ARG 17 , we divided the COALA dataset 18 , categorizing it into three groups: "No alignment," "Less than 50% similarity between training and testing sets," and "Greater than 50% threshold."The COALA90 dataset was prepared using the CD-HIT tool, which could not maintain a strict threshold between cluster similarity.In this dataset, the performance of the PPLM deteriorated, especially for the ARG classes streptogramin, rifamycin, and bacitracin, which had just 26, 29, and 127 sequences and achieved an overall F1-score of 0.67.While ProtAlign-ARG outperformed both of them.The model achieved an F1-score of 0.81 even outperforming the alignment-based scoring model that achieved an overall accuracy of F1-score of 0.71.We also divided the test dataset into three subsections based on the alignment with the training sequences: those with no alignment, those with less than 50% alignment, and those with greater than 50% alignment using the CDHIT tool.Table 6 shows the accuracies achieved by the PPLM and the ProtAlign-ARG for these test sets.We observed that for cases with no alignment and very low alignment, the PPLM performed better than the alignment-scoring-based model and ProtAlign-ARG achieved

Independent Test Set Validation
ProtAlign-ARG, trained on GraphPart-based clustering with a 40% similarity threshold on the HMD-ARG-DB dataset, successfully identified all novel 76 beta-lactamase genes that have been experimentally-validated by Berglund et al. 49 .Furthermore, the ARG class classification model correctly categorized these genes as beta-lactamases.To ensure that the training set did not contain these beta-lactamase genes, we conducted DIAMOND alignments, excluding any sequences with similarity above 40% during the model's training.This performance underscores the effectiveness of our model in identifying and annotating novel ARGs compared to existing tools.

Conclusion
Our study presents a robust pipeline and benchmarking standard for the identification and classification of ARGs and thus can contribute to advancing goals of antibiotic resistance monitoring across One Health sectors.ProtAlign-ARG introduced a novel method for taking advantage of both alignment-scoring-based tools and deep learning predictions and surpassed state-of-the-art tools in both ARG identification and class classification.One limitation of the current model is that accuracy is lower for sequences with low identity to the training data.To address this, we plan to incorporate protein 3D structure information into the models as protein structures are more conserved than sequences, potentially enhancing performance.

Figure 1 .
Figure 1.The pipeline for ARG Identification & Classification.Each number (1-4) corresponds to the four distinct models incorporated in the pipeline.

Figure 2 .
Figure 2. Pre-trained protein language model for ARG Identification & Classification

Figure 3 .
Figure 3. Percentage of accuracy for predictions below the indicated confidence thresholds for PPLM-based ARG class classification.

Figure 4
Figure4shows that the PPLM performed well for ARGs encoding resistance to Aminoglycoside, Beta-lactam, Fosfomycin, Glycopeptide, Tetracycline, and Trimethoprim, with high scores across all metrics.The alignment-based scoring model exhibited high precision and recall for certain classes, such as Aminoglycoside (precision 1 and recall 0.9) and Trimethoprim ( precision 1 and recall 1), but performed poorly for Polymyxin and Rifampin (precision and recall 0 for both).The ARG class prediction model of ProtAlign-ARG performed well for classes like aminoglycoside (F1-score 0.9) and beta_lactam (F1-score

Figure 4 .
Figure 4. ARG class prediction accuracy for the PPLM, Alignment-scoring-based model, and ProtAlign-ARG based on precision, recall, and F1-score.

Figure 5 .
Figure 5. Distribution of sequence count across different similarity thresholds between the training and testing set achieved by CDHIT clustering.

Table 2 .
46curacy for component models on ARG Identification on HMD-ARG-DB 17 using 40 and 90 percent similarity threshold using GraphPart46.

Table 3 .
Accuracy for the PPLM, Alignment-scoring-based model, and ProtAlign-ARG for ARG class classification on HMD-ARG-DB at 40% threshold Table3shows that the PPLM excelled in achieving better recall compared to the alignment-based scoring model, but it lagged in terms of precision.ProtAlign-ARG, on the other hand, struck a balance by achieving a high level of precision and recall, leading to an overall improvement in accuracy relative to both alternative models.

Table 4 .
Accuracy for the different component models for ARG class classification on HMD-ARG-DB at 90% threshold Additionally, we conducted tests using a 90% similarity threshold.This was particularly important because 19 of the ARG class samples had just 219 sequences in total.Table4presents the performance of the three models in this scenario.In this case, the alignment-scoring-based model outperformed both the PPLM and the ProtAlign-ARG in precision.The PPLM tends to perform poorly when the sample size is limited (e.g., Kasugamycin and Quinolone antibiotic classes).When we experimented with all 33 ARG classes, including the 19 ARG classes with just only 219 sequences in total, the accuracy dropped significantly for the PPLM model, whereas the alignment-scoring model performed much better.However, when we considered the 90% similarity threshold only for the remainder of the 14 frequent ARG classes, the PPLM outperformed the alignment-scoring-based model.Our hybrid ProtAlign-ARG achieved a 78% F1-score, consistent with the alignment-scoring-based model.

Table 5 .
Accuracy for the different component models for ARG class classification on COALA90 dataset