Analysis of the limited M. tuberculosis accessory genome reveals potential pitfalls of pan-genome analysis approaches

Pan-genome analysis is a fundamental tool in the study of bacterial genome evolution. Benchmarking the accuracy of pan-genome analysis methods is challenging, because it can be significantly influenced by both the methodology used to compare genomes, as well as differences in the accuracy and representativeness of the genomes analyzed. In this work, we curated a collection of 151 Mycobacterium tuberculosis (Mtb) isolates to evaluate sources of variability in pan-genome analysis. Mtb is characterized by its clonal evolution, absence of horizontal gene transfer, and limited accessory genome, making it an ideal test case for this study. Using a state-of-the-art graph-genome approach, we found that a majority of the structural variation observed in Mtb originates from rearrangement, deletion, and duplication of redundant nucleotide sequences. In contrast, we found that pan-genome analyses that focus on comparison of coding sequences (at the amino acid level) can yield surprisingly variable results, driven by differences in assembly quality and the softwares used. Upon closer inspection, we found that coding sequence annotation discrepancies were a major contributor to inflated Mtb accessory genome estimates. To address this, we developed panqc, a software that detects annotation discrepancies and collapses nucleotide redundancy in pan-genome estimates. We characterized the effect of the panqc adjustment on both pan-genome analysis of Mtb and E. coli genomes, and highlight how different levels of genomic diversity are prone to unique biases. Overall, this study illustrates the need for careful methodological selection and quality control to accurately map the evolutionary dynamics of a bacterial species.


Summary of supplemental figures & tables
Table S1.18 bubble regions identified with at least 1 kb of novel sequence relative to H37Rv.Table S2.Annotation type summary for H37Rv reference genome annotations.

SR1 -Comparison of short and long-read characteristics
We compared the assembly quality characteristics between the short-read and long read assemblies (Table 1, Figure S1).All complete assemblies consist of a single circular contig, while the short-read assemblies are consistently more fragmented with a median number of contigs of 116 (IQR: 106 -132).The median genome size of the complete assemblies is 4,413 kb (4,407 kb -4,421 kb).The cumulative length of the short-read assemblies are slightly lower than the complete genome assemblies, with a median length of 4,314 kb (IQR: 4,289 kb -4,336 kb).Both technologies produced assemblies with a highly similar GC content, but the short-read assemblies varied slightly more: median GC content of the hybrid assemblies was 65.6%, IQR: 65.6%-65.6%,and for short-read assemblies the median GC content was 65.5%, IQR 65.5% -65.6%.We found very high conservation in terms of the # of predicted ORFs for both short and complete assemblies.The median number of ORFs predicted for complete assemblies was 4074 (IQR: 4065 -4088), while for short-read assemblies it was 4035 (IQR: 4022 -4046).In summary, hybrid and short-read assembly characteristics were similar except that short-read assemblies systematically were less continuous, had lower cumulative length, and contained less predicted CDSs than complete assemblies.

SR2 -Identifying novel sequence relative to the H37Rv reference genome
Next, we wanted to evaluate what proportion of SV nodes in the graph represented sequences not present in the H37Rv reference genome.These SVs are of interest as they represent sequences that were lost in the ancestor to H37Rv but are present in other Mtb isolates.We found that only 160/2025 of evaluated SV nodes contained novel nucleotide content relative to H37Rv, with a cumulative length of 66,621 bp.Thus, only 5.5% of evaluated SV nodes within the graph represent novel sequence content relative to H37Rv.Thus, a majority of evaluated SV nodes in the graph are composed of either reconfiguration of already existing nucleotide content or deletions of regions that are present in the H37Rv reference genome.
The 160 SV nodes with novel sequence content relative to H37Rv were spread across 65 different bubble regions in the Mtb pan-genome.To focus our attention on larger SVs with novel sequences relative to H37Rv.From this list we were motivated to quantify the number of potentially large deletions (1 > kb) that occurred in the ancestor of the H37Rv strain.We generated a short list of 18 bubble regions that contained at least 1 kb of novel sequence relative to H37Rv (Table S1).The length of cumulative sequence that was novel relative to H37Rv ranged from 1153 bp to 7,900 bp per bubble region.
Notably, Bubble 193 matched the exact coordinates of a known structural variant within the Mtb population, Tbd1.The Tbd1 region is a deletion specific to lineages 2-4 of the Mtb population, and implicated in resistance to oxidative stress and hypoxia (1,2).Our accurate detection of TbD1 serves as a positive control for the ability of our pan-genome graph approach to detect large structural differences between Mtb isolates.We further evaluated the graph representation of the Tbd1 deletion in terms of its genomic and phylogenetic context (Figure S3).Bubble 193 contains 3 different nodes and has three different observed configurations.The first observed configuration found within the graph is a complete deletion of the expected Tbd1 sequence (ΔTbD1), and this allele was exclusively found in all lineage 2-4 isolates.The second observed configuration was a completely intact TbD1 sequence, found in lineages 1,5, & 6.This distribution of ΔTbD1 and +TbD1 alleles exactly matches prior knowledge of the phylogenetic distribution of this deletion across lineages 1-6.Interestingly, a third configuration of bubble 193 was observed for the single lineage 8 isolate within our collection.The lineage 8 isolate was found to have an IS6110 insertion element disrupting the mmpL6 gene, presumably resulting in a loss of function.This supports likely convergent disruption of the mmpL6 gene in the Mtb phylogeny, in the first case through deletion and second case through transposon insertion.

SR3 -Investigating differences between two bacterial genome annotation pipelines on H37Rv
Motivated by the consistent decrease in the predicted pan-genome size when using PGAP, we explored the differences between the PGAP and Bakta annotation pipelines.To compare these softwares on an identical genome sequence, we first annotated the H37Rv reference genome with both Bakta and PGAP and then compared these annotation results with the official annotations (Table X).Although both tools predicted approximately the same number of total genes, we found that PGAP annotated 138 less protein coding sequences (CDSs) than Bakta, but annotated 150 more pseudogenes.PGAP only predicted 3,941 CDSs in the H37Rv reference, while Bakta predicted 4,079 total CDSs.
We manually inspected all pseudogenes in both annotations and found a high frequency of cases where a pseudogene identified by PGAP, is called as multiple CDSs by the Bakta software.In many cases a single pseudo will be annotated as multiple CDSs, resulting in a split CDS scenario.PGAP overall is more likely to call a sequence a pseudogene if it finds a suspect frameshift mutation is present.In contrast, Bakta is much more likely to call new CDSs in the position of a gene with a frameshift mutation.Notably, all the pan-genome analysis pipelines evaluated ignore pseudogenes, meaning any region annotated as a pseudogene will be ignored in downstream analysis and be treated as non-coding sequence.Thus, these annotation differences have consequences for pan-genome analyses that are based primarily on annotated CDS sequences.

A) PacBio Subreads Assembly & Polishing Pipeline
For all assemblies that were done with PacBio subreads, either from the RS II and Sequel II platforms, the following assembly processing was taken: All long reads (PacBio subreads) were assembled using Flye (v2.6).After assembly, Flye performed three rounds of iterative polishing of the genome assembly with long reads, producing a polished de novo long read assembly.If Flye identified the presence of a complete circular contig, Circlator (v1.5.5) was used to standardize the start of each assembly at the DnaA (Rv0001) locus.The paired-end Illumina WGS reads were trimmed with Trimmomatic (v0.39).Trimmed reads were aligned to the associated de novo long read assembly with BWA-MEM (v0.7.17).Duplicate reads were removed from the resulting alignments using PICARD (v2.22.5).Using the deduplicated alignments, Pilon (v1.23) was then used to correct SNSs and small INDELs in the de novo PacBio assembly.All relevant code, with exact run parameters, can be found in the 1.Mtb.Generate.HybridAsm.PacBioRSII.smkscript (Runnable as a Snakemake workflow).

B) ONT Assembly & Polishing Pipeline
For all assemblies made with Oxford Nanopore (Pore v9.4.1) the following assembly processing was taken: All ONT (v9.4.1) long reads ONT were assembled using Flye (v2.6).After the initial assembly, Flye performed three rounds of iterative polishing of the genome assembly with long reads, producing a polished de novo long read assembly.Next, the Medaka software was used to polish the de novo assembly with the ONT reads.If Flye identified the presence of a complete circular contig, Circlator (v1.5.5) was used to standardize the start of each assembly at the DnaA (Rv0001) locus.The paired-end Illumina WGS reads were trimmed with Trimmomatic (v0.39).Trimmed reads were aligned to the associated de novo long read assembly with BWA-MEM (v0.7.17).Duplicate reads were removed from the resulting alignments using PICARD (v2.22.5).Using the deduplicated alignments, Pilon (v1.23) was then used to correct SNSs and small INDELs in the de novo PacBio assembly.Next, the assembly was further polished with short reads using the PolyPolish (v0.5) pipeline.All relevant code, with exact run parameters, can be found in the 2.Mtb.Generate.HybridAsm.PacBioHiFi.smkscript (Runnable as a Snakemake workflow).

C) PacBio CCS (HiFi) Assembly & Polishing Pipeline:
For all assemblies generated from Sequel II CCS/HiFi reads the following assembly processing was taken: All PacBio HiFi reads were assembled using Flye (v2.9).After assembly, Flye performed three rounds of iterative polishing of the genome assembly with long reads, producing a polished de novo long read assembly.If Flye identified the presence of a complete circular contig, Circlator (v1.5.5) was used to standardize the start of each assembly at the DnaA (Rv0001) locus.The paired-end Illumina WGS reads were trimmed with Trimmomatic (v0.39).Trimmed reads were aligned to the associated de novo long read assembly with BWA-MEM (v0.7.17).Duplicate reads were removed from the resulting alignments using PICARD (v2.22.5).Using the deduplicated alignments, Pilon (v1.23) was then used to correct SNSs and small INDELs in the de novo PacBio assembly.All relevant code, with exact run parameters, can be found in the 2.Mtb.Generate.HybridAsm.PacBioHiFi.smkSnakemake workflow.

Evaluating sequence content of SV nodes within Mtb SV pan-genome graph
For the k-mer analysis of the graph, each node's sequence was broken down into all 31 bp subsequences.The mmh3 python library was used to generate the MurmurHash for all 31-mers within the dataset.Additionally, all k-mers were converted to their canonical k-mer such that a k-mer and its reverse complement would be given the same hash.For measurement of k-mer overlap between nodes and the rest of the graph the Jaccard Containment metric was used.Jaccard Containment is defined as the size of the union of two k-mer sets A and B divided by the length of set A. Put simply, the Jaccard containment is measured as the proportion of k-mers found within A that are also present in a second sequence, B.
For identification of SV nodes with unique sequence content, low k-mer overlap was defined as a k-mer Jaccard Containment of < 0.05.If a node was smaller than 31 bp it was excluded from this analysis.The ~500 SV nodes with length < 31 bp only have a cumulative length 6149 bp out of the 1,301,211 bp of SV nodes within the graph.As these short SV regions accounted for a small proportion (0.5%) of total SV node sequence, we expect that excluding them is unlikely to change our description of the Mtbc accessory genome.
To access the insertion sequence (IS) elements and phage sequences due to the fact that those classifications are the recognized MGEs in the Mtb genome.We compared the k-mer profile of each SV-node to all sequences annotated as related to "Insertion sequence elements & phages" in the H37Rv reference genome (Kapopoulou et al. 2011).Table S4.Summary of panqc nucleotide similarity clustering of Mtb pan-genome estimates.For each estimate analyzed with panqc, the following relevant statistics are shown after nucleotide similarity clustering using a minimum 31-mer overlap (maximum Jaccard Similarity observed) of 0.8: 1) The number of total nucleotide similarity (NS) clusters identified, 2) total number of genes merged into NS clusters, and 3) the mean number of genes belonging to each NS cluster.Table S5.Summary of NRC nucleotide similarity clustering of E. coli pan-genome estimates.For each estimate analyzed with panqc, the following relevant statistics are shown after nucleotide similarity clustering using a minimum 31-mer overlap (maximum Jaccard Similarity observed) of 0.8: 1) The number of total nucleotide similarity (NS) clusters identified, 2) total number of genes merged into NS clusters, and 3) the mean number of genes belonging to each NS cluster.

Figure S2 .
Figure S2.Comparison of assembly characteristics between complete assemblies and short-read de novo assemblies.

Figure S3 .
Figure S3.Evaluation of structural variation in TbD1 locus using a pan-genome graph.

Figure S4 .
Figure S4.Mtb Pan-genome estimates for all pipeline-parameter combinations evaluated.

Figure S5 .
Figure S5.Example of pseudogenes detected by PGAP in place of two CDSs by Bakta.

Figure S6 .
Figure S6.Evaluating scenarios contributing to inferred absences in short-read assemblies between low and high BUSCO completeness scores.

Figure S7 .
Figure S7.Mtb pan-genome estimates before and after NRC correction.

Figure S8 .
Figure S8.E. coli pan-genome estimates before and after NRC correction.

Figure S9 .
Figure S9.Overview of gene gain and loss events from Mtb gene ancestral state reconstruction.

Figure S1 .
Figure S1.Overview of k-mer jaccard similarity across Mtb genomes.Heatmap and histogram of Jaccard Similarity of k-mers (k = 31 bp) across all pairs of complete Mtb genomes.The median k-mer jaccard similarity between all pairs was 0.97 (IQR: 0.96 -0.98), while the most distant pair of genomes had a k-mer jaccard similarity of 0.94.

Figure S8 Figure
Figure S8

Table S3 .
Summary table of feature types annotated by Bakta & PGAP across all 151 complete Mtb

Table S4 .
Summary of panqc nucleotide similarity clustering of Mtb pan-genome estimates.

Table S5 .
Summary of panqc nucleotide similarity clustering of E. coli pan-genome estimates.

Table S1 . 18 bubble regions identified with at least 1 kb of novel sequence relative to H37Rv.
(2)h bubble identified has at least 1 kb of sequence from SV nodes that have low k-mer match to the H37Rv reference genome.Coordinates of each bubble are given relative to the H37Rv genome assembly.Bubble #193 overlaps exactly with the known coordinates of the 2.1 kb TbD1 deletion, which is implicated in increased resistance to oxidative stress and hypoxia(2).Bubble #116 overlaps with a 4.4 kb sequence identified to be exclusive to lineage 8 of the MTBC(3).