plASgraph - using graph neural networks to detect plasmid contigs from an assembly graph

Identification of plasmids from sequencing data is an important and challenging problem related to antimicrobial resistance spread and other One-Health issues. In our work, we provide a new architecture for identifying plasmid contigs in fragmented genome assemblies built from short-read data. Unlike previous machine-learning approaches for this problem, which classify individual contigs separately, we employ graph neural networks (GNNs) to include information from the assembly graph. Propagation of information from nearby nodes in the graph allows accurate classification of even short contigs that are difficult to classify based on sequence features or database searches alone. Our new species-agnostic software tool plASgraph outperforms recently developed PlasForest, which uses database searches to supplement sequence-based features. Since our tool does not rely on existing plasmid databases, it is more suitable for classification of contigs in novel species and discovery of previously unknown plasmid sequences. Our tool can also be trained on a specific species, and in that scenario it outperforms mlplasmids trained on the same species. On one hand, our work provides a new, accurate, and easy to use tool for plasmid classification; on the other hand, it serves as a motivation for more widespread use of GNNs in bioinformatics, such as in pangenome sequence analysis, where sequence graphs serve as a fundamental data structure. Availability https://github.com/cchauve/plASgraph

Ideally, the input assembly graph would consist of a few connected components, each 156 corresponding either to a single chromosome or a plasmid. In a fully resolved assembly, 157 each chromosome and plasmid would actually correspond to an isolated vertex. However, 158 in graphs created from short-read data, walks corresponding to individual molecules are 159 typically interconnected by spurious edges or through ambiguous contigs, creating complex 160 structures (see Figure 5 for an illustration). The main novelty of our approach is to use 161 the graph neighbourhood of a node as a source of information, under the assumption that 162 other contigs from the same molecule (chromosome or plasmid) are likely to belong to this 163 neighbourhood.

164
Given a comprehensive database of closed genomes for a given bacterial species, longer 165 contigs can likely be classified simply based sequence homology; in our work, we concentrate 166 on classifying contigs in a de novo framework that does not rely on existing reference genomes.

167
The de novo contigs classification problem is of interest for e.g. the analysis of samples from 168 poorly sampled bacterial species. 170 As an input to the classification task, each contig is characterized by five input features: activation. More precisely, if we organize the n feature vectors into the n × k matrix X, the 202 graph convolutional layer can be expressed as

204
whereÃ is the graph adjacency matrix with ones along the diagonal,D is a diagonal matrix 205 whereD ii = jÃ ij , Θ is a k × ℓ matrix of trainable weights, σ is a non-linear activation 206 function, and Z is the n × ℓ matrix of output feature vectors. A single GCL integrates 207 information from an immediate neighbourhood of a node; by employing d GCLs one integrates 208 the information from the distance of at most d for each node.
209 Figure 1 shows the architecture we have designed for our task. The five input features for 210 each node are first transformed by a fully connected layer to a vector of length 32 per node.

211
This is followed by six GCLs using the same weight matrix Θ. The last two fully connected 212 layers operate on each node separately, the first producing a vector of length 32, and the 213 second producing two output scores, loosely interpretable as probabilities of the node being 214 part of a chromosome and plasmid, respectively. Since these two outputs correspond to two 215 separate classification tasks, we do not require these two scores to sum to one. Each layer is 216 followed by the ReLU activation, except for the last layer, which uses the sigmoid activation.

217
All layers excluding the last one are followed by 10% dropout to prevent overfitting.

218
GCLs combine features of each node with features of the neighbours and over time, the 219 influence of the original features is greatly diminished. In our task, the original features can 220 be highly informative, especially for nodes corresponding to longer contigs; therefore we want 221 to maintain node identity throughout the computation. node degree relative coverage relative GC content relative contig length relative pentamer distance Figure 1 Model architecture of plASgraph. The model uses as inputs the assembly graph structure and five input features per node (contig). The core of the architecture is composed of six graph convolutional layers. The model generates two outputs per node, which facilitate the classification of plasmids and chromosomes as two separate classification tasks. constructed from short reads only. The hybrid assembly is used to derive ground truth 237 classification of short-read contigs, as explained in the next paragraph. The short-read 238 assembly graph is then used as an input for our model, both in training and testing scenarios.

239
In hybrid assemblies, the ground truth labels are determined based on the contig length.

240
In particular, all contigs longer than a species-specific threshold (  Table S1 for overview of experiments).

267
The scripts used for training and evaluation, as well as detailed results are available at 268 https://github.com/cchauve/plASgraph_WABI_2022.

269
Similarly to plASgraph, mlplasmids is a de novo tool, using only sequence-derived features 270 to classify contigs. However, it requires training on each species separately and was designed 271 for contigs longer than 1 kbp. In contrast, PlasForest is a species-agnostic tool, designed to 272 work also on short contigs (< 1 kbp), but it is dependent on the comparison of the input 273 sequences to sequence databases. Neither of these tools use the assembly graph information.

274
Since plASgraph was designed to explicitly handle ambiguous contigs by including separate   Figure 2 shows that the F1-score of plASgraph is higher than mlplasmids on most samples 296 from E. faecium and K. pneumoniae. Although mlplasmids outperforms plASgraph on many 297 E. coli samples, the overall score for plasmid classification is higher for plASgraph, whereas 298 the performance for the chromosome classification task is comparable (Tab. 2).

299
Our results also show that the main advantage gained by plASgraph in the species-specific   to the large phylogenetic distance, but plASgraph still performs significantly better than 320 mlplasmids. The distribution of per sample F1-scores is shown in Fig. S4.
plasmid chromosome Evaluation of the species-specific plASgraph models in comparison to mlplasmids. Each row compares the F1-score of plASgraph to mlplasmids on individual testing isolates for both plasmid (left) and chromosome (right) classification tasks. The isolates without plasmids are not shown in graphs on the left, as F1-score is then undefined. Rows from top to bottom correspond to E. faecium, E. coli, and K. pneumoniae data sets respectively. The models are always trained and tested on the same species.

Figure 3
Impact of the number of layers on the model performance. Different numbers of graph convolutional layers (GCLs) were tested for the E. faecium model. Each architecture was trained five times using random seeds and the F1-score was calculated on the 20% split validation set. The largest improvement in performance is visible with introduction of the first GCL; a slight additional increase in accuracy is observed for five and six GCLs.
features that are relative to the sample-wide statistics, and thus are less species dependent and 323 are able to use the overall context provided by the whole assembly. In contrast, mlplasmids 324 directly uses k-mer frequencies, which are highly specific to individual species. for ambiguous contigs is shown in Figure S6. Figure S7 shows scatter plot of all C. freundii 344 contigs based on their chromosome and plasmid score and their true label. It shows that 345 overall plASgraph classifies a large majority contigs accurately.

346
PlasForest uses features derived from querying input sequences against a reference 347 database, which is not used in our predictions. The higher average F1-score of plASgraph 348 therefore suggests that the contribution of information present in the assembly graph 349 combined with relative features of the contigs exceeds the contribution from homology search.

350
Moreover, independence from sequence databases makes our tool more suitable for application 351 to completely novel species.

352
PlASgraph not only provides a score for plasmidic and chromosomal contigs but also 353 outputs a visualization of an assembly graph labeled according to the predictions. Figure 5   354 shows parts of the assembly graph for C. freundii isolate SAMN15148288 with nodes colored

ground-truth
PlasForest prediction plASgraph prediction PlASgraph is not dependent on specific species and can therefore be used also for newly 367 sequenced bacteria for which no closed genome sequence is available yet. Our species-368 agnostic plASgraph model outperforms recently published, database-dependent PlasForest

369
[23] approach when compared across different bacterial species. De novo classification 370 (database independence) allows more accurate identification of previously unknown plasmids 371 and chromosomes which can be critical for diverse One-Health epidemiologic surveillance.

372
However, when desired, plASgraph can also be trained for a particular species. Our species-373 specific models are more accurate when compared to mlplasmids [5], although mlplasmids 374 performs better than plASgraph on longer contigs above 1 kbp. We hypothesize that 375 mlplasmids is able to learn to recognize chromosome contigs of a particular species through 376 their pentamer distributions at the cost of cross-species generalizability. Furthermore, this 377 approach is unreliable for classifying contigs of lengths in the range of 100-1000 bp. Accurate 378 classification of shorter contigs by plASgraph may enable identification of more complete 379 plasmids from incomplete assemblies and has a potential to facilitate novel plasmid discovery.

380
Another novel feature of plASgraph is the separation of plasmid and chromosome classi-381 fication tasks, recognizing that some contigs are ambiguous, being parts of both types of 382 molecules. These ambiguous contigs are an interesting subject for further study by themselves; 383 our preliminary analysis of ambiguous contigs in our datasets suggests that the majority of 384 them are related to transposons and phages. These mobile elements can integrate into both 385 plasmids and chromosomes within the cell.

386
The simplicity of the architecture of the plASgraph model makes it amenable to extensions.

387
For example, the use of additional information about plasmids, such as the presence of 388 plasmid-specific genes in a contig, could allow further increase in classification accuracy as 389 this additional information would propagate to nearby nodes thanks to the GNN architecture.

390
In addition, it will be interesting to investigate how plASgraph could be adapted for accurate 391 plasmid identification in metagenomic datasets, like wastewater samples, which play an 392 increasingly crucial role in monitoring antibiotic resistance [13].

Figure S2
Circular contig size distribution in hybrid assemblies. Above each plot, the total number of circular contigs is shown in parentheses. A vertical line marks the species-specific threshold in each plot; circular contigs shorter than the threshold are considered plasmidic, whereas longer contigs are considered chromosomal.  Figure S3 Distribution of short-read contig labels. Darker colors represent a higher percentage of the respective label in each dataset. Each small square shows the percentage, the absolute number, as well as the cumulative length of the short-read contigs with the respective label.

Figure S6
Accuracy of ambiguous contig classification for species-agnostic plASgraph model. Each data point represents one isolate; the plot combines isolates from all species used in Fig. 4. The isolates with no ambiguous contigs are not shown.

Figure S7
Results on the 96 C. freundii samples (9118 contigs). Each dot represents a contig, its radius being proportional to the contig length. Contigs are split in four panels according to their ground truth label. Each contig is shown with coordinates being its chromosome score (x-axis) and its plasmid score (y-axis).