Pan-genome analysis of Mycobacterium africanum: insights to dynamics and evolution

Mycobacterium tuberculosis complex (MTBC) consists of seven major lineages with three of them reported to circulate within West Africa: lineage 5 (West African 1) and lineage 6 (West African 2) which are geographically restricted to West Africa and lineage 4 (Euro-American lineage) which is found globally. It is unclear why the West African lineages are not found elsewhere; some hypotheses suggest that it could either be harboured by an animal reservoir which is restricted to West Africa, or strain preference for hosts of West African ethnicity, or inability to compete with other lineages in other locations. We tested the hypothesis that M. africanum (MAF) might have emigrated out of West Africa but was outcompeted by more virulent strains of M. tuberculosis (MTB). Whole genome sequences of MTB from Nigeria (n=21), China (n=21) and MAF from Mali (n=24) were retrieved, and a pan-genome analysis was performed after fully annotating these genomes. The outcome of this analysis shows that Lineages 4, 5 and 6 have relatively close pan-genomes whilst lineage 2 has an open pan-genome. We also see a correlation in numbers of some multiple copy core genes and amino acid substitution with lineage specificity that may have contributed to geographical distribution of these lineages. The findings in this study provides a perspective to one of the hypotheses that M. africanum might find it difficult to compete against the more modern lineages outside West Africa hence its localization to the geographical region.


25
Abstract: Mycobacterium tuberculosis complex (MTBC) consists of seven major lineages with three of them reported to circulate within West Africa: lineage 5 (West African 1) and lineage 6 (West African 2) which are geographically restricted to West Africa and lineage 4 (Euro-American lineage) which is found globally. It is unclear why the West African lineages are not found 30 elsewhere; some hypotheses suggest that it could either be harboured by an animal reservoir which is restricted to West Africa, or strain preference for hosts of West African ethnicity, or inability to compete with other lineages in other locations.
We tested the hypothesis that M. africanum (MAF) might have emigrated out of West Africa but was outcompeted by more virulent strains of M. tuberculosis (MTB).
35 Whole genome sequences of MTB from Nigeria (n=21), China (n=21) and MAF from Mali (n=24) were retrieved, and a pan-genome analysis was performed after fully annotating these genomes. The outcome of this analysis shows that Lineages 4, 5 and 6 have relatively close pan-genomes whilst lineage 2 has an open pan-genome. We also see a correlation in numbers of some multiple copy core genes and amino acid substitution with lineage specificity that may have contributed to 40 geographical distribution of these lineages.
The findings in this study provides a perspective to one of the hypotheses that M. africanum might find it difficult to compete against the more modern lineages outside West Africa hence its localization to the geographical region.

Introduction:
45 Tuberculosis (TB) is the world leading cause of death from infectious diseases, with the latest TB report stating over 1.1 million deaths globally in 2018, with two thirds from low-to-middle income countries such as China, Indonesia, Nigeria, the Philippines, Pakistan, South Africa and Bangladesh (1). Mycobacterium tuberculosis complex (MTBC), the causative organism for tuberculosis has been studied widely and is known to be a monoclonal bacteria compared to other bacterial models 50 acquiring variation predominantly through mutations (2). MTBC is grouped in two major ancestries, the ancient lineages and the modern lineages, both of which are further grouped into lineages and sub lineages based on the geographical regions that they are found. Lineage 1 (East Africa, Philippines, in the region of the Indian Ocean), lineage 2 (East Asia), lineage 3 (East Africa, Central Asia), lineage 4 (Europe, America and Africa), lineage 5 (West Africa 1), lineage 6 (West 55 Africa 2) and lineage 7 in Ethiopia (3). The ancestry of these lineages was established by the study of 20 variable genomic regions that are caused by insertion or deletion events (4). For example, the absence or presence of an MTB specific deletion known as TbD1 and regions of difference (RD) helps to classify MTBC into modern (lineages 2,3 and 4) or ancient (lineages 1, 5. 6 and 7) strains. Different lineages of MTBC have shown to present different symptoms and immunological 60 responses as seen in several studies. Although there is limited knowledge of TB virulence, pathogenicity does not associate with classical virulence factors like toxins (5), but rather with other complex factors, such as other bacterial infections being harboured by the patient that can interact with the tubercle bacilli, the host immune system, or even environmental factors resulting in a complex cascade of events (6). Furthermore, reports have linked TB infectivity to human genetic 65 susceptibility, with certain polymorphisms in the human genome relating to certain MTBC lineages in their respective geographic distribution (7)(8)(9). Interestingly, a study found that a particular variant in a gene responsible for autophagy in humans, IRGM-261T, influences the susceptibility of TB caused by MAF but not MTB (10). Additionally, findings from research on Ghanian populations showed that variants associated with 5-lipoxygenase (ALOX5) waere associated with 70 an increased TB risk (11). This suggests that MTBC might be adapted to certain populations just as they are geographically distributed (12). With the advent of next generation sequencing and the continuous reduction in the cost to sequence the entire genome of an organism, studies have moved from the analysis of a single or few genomes to multiple or a collection of genomes. Pan-genome analysis is a product of the breakthrough of multi-genome study in molecular biology (16).
85 The pan-genome can be described as the collection of entire genes in a particular species. This comprises the core genes shared by all strains, dispensable genes shared by two or more strains, and unique genes also known as singletons that are peculiar to specific strains. This can be used in describing bacterial species as many species differ by their gene content to a large extent (17). The core genes are responsible for the major phenotype and basic biological processes of the bacteria, 90 whilst the accessory and unique genes may be involved in other metabolic pathways such as adaptation to a particular host, virulence, antibiotic resistance and other functions that confer selective benefits over other species (16). Pan-genome analysis has been performed on more than 50 bacteria species in the past decade and this has revealed interesting information relating to pathogenesis, bacteria evolution, drug resistance, host specialization, horizontal gene transfer 95 (HGT) (18,19). Pan-genome analysis has also been adopted for viruses, fungi and plants (20,21).
Studies have also used pan-genome analysis for identifying potential vaccine candidates against bacterial infections (18) The purpose of this study was to compare the entire gene set of M. africanum and the most commonly found TB sub lineage in Nigeria L4.6.2 (also known as the Cameroon sub lineage) and 100 L2.2.1 (also known as the Beijing lineage) as an out-group to understand the evolution, genome dynamics, metabolic pathways and also genes involved in biological processes.

Sample collection and filtering:
We first retrieved fully assembled genomes of M. africanum (n=24) out of 30 available genomes on 105 the NCBI Reference Sequence Database (RefSeq), selected sequences were those that were not derived from surveillance projects or contain anomalies. We retrieved an additional 30 raw sequence datasets from the Senghore et al. study (15), and selected 21 genomes after filtering and profiling them according to respective lineages using TB-Profiler version 2.3 (22). Thus, we selected only the Cameroon sub lineage of the Euro-American lineage. An out-group set of lineages 110 was used for comparison by obtaining 43 genome assemblies from NCBI RefSeq that were sequenced from China. These selected genomes were classified into subtype lineages using Biohansel (23), and only genomes that belongs to the Beijing sub lineage (n=21) were selected.

Bioinformatics analysis:
The genomes of the Cameroon sub lineages were assembled using a de novo approach with SPAdes 115 version 3.11.1 (24). The scaffolds of the MAF genomes from Mali, Cameroon sub lineage and the Beijing lineage were annotated using Prokka version 1.12 (25). The annotated genomes were analysed with the Bacterial Pan Genome Analysis Tool (BPGA) version 1.3 (16), using 95% similarity for orthologous clustering and both pan-and core genomes were calculated over 20 permutations to prevent bias. Additionally, the pan-genome functional analysis was carried out 120 utilizing KEGG, COG metabolic and functional pathways, which were all visualised with LibreOffice Calc plot functions. Panaroo (26) and PopPUNK (27) were also employed for pangenome investigation of gene profiles and clustering core and accessory genomes.

Mycobacterium africanum:
125 The identification of core genes (genes shared between all strains), accessory genes (genes shared by two or more strains but not all) and unique genes (genes peculiar to individual strains) were clustered respectively. The pan-genome analysis across 24 genomes of M. africanum showed that they have 3974 ± 20 genes (mean ± standard deviation). The number of accessory genes ranged from 194 to 222 genes and the unique set of genes varied from 1 to 38 genes (Supplementary table   130 1A). Empirical power law equations and exponential equations were used to generate pan-and core genome curves, which showed that the pan-genome curve has almost reached a plateau as the exponent parameter calculated is 0.03 (Fig 1). This argues that the 24 genomes analysed are sufficient to obtain an accurate estimate of pan-and core genome size, with additional samples yielding diminishing returns. 140 Pan-genome functional analysis was done using KEGG pathway database and KEGG IDs assigned to orthologous protein clusters in the core, accessory and singleton genes and matched against the database, as shown in Fig 2. The highest gene contents in the pan-genome of M. africanum were responsible for biological processes such as metabolism, a remarkable amount of unique genes account for environmental information processing and organismal systems. A more detailed KEGG 145 classification showed that majority of unique gene sets were responsible for signalling molecules and interaction, signal transduction, infectious diseases, digestive system and cellular community (Fig 3). The classification of core, accessory and unique genes in numbers after clustering into orthologous groups was performed. The 21 genomes from the Cameroon sub lineage had 4067 ± 18 genes (mean ± standard deviation). Accessory gene numbers ranged from 242 to 289 and singleton gene 160 numbers were from 19 to 55 genes (Supplementary table 1B). Power law equations and exponential equations were used to generate pan and core genome curves which reflects the curve has almost reached a plateau as the exponent parameter calculated is 0.06 (Fig 1).
A high percentage of genes in the core genome were linked to metabolic processes after functional classification of the pan-genome (Fig 2). A more detailed KEGG classification of the pan genome 165 showed that majority of the accessory genes are related to infectious diseases and cellular community (Fig 3).

Mycobacterium tuberculosis sub lineage 2.2.1 (Beijing sub lineage):
170 Orthologous clustering of the pan-genome showed that the analysis of 20 genomes have 4023 ± 83 genes (mean ± standard deviation). The number of accessory genes ranged from 359 to 435 genes, with 3546 genes shared between all strains (core genes) and unique gene sets varied from 0 to 338 genes per strain (Supplementary table 1C). Exponential regression was used to generate the number of core genes whilst power law regression was used to extrapolate the pan-genome curve which 175 reflects an open genome with an exponent parameter 0.12 (Fig 1).
KEGG classification of core, accessory and unique genes into functional roles by clustering genes into orthologous groups argued that the majority of core and accessory genes are responsible for metabolic functions (Fig 2) and a detailed classification of functional genes showed that a relative quantity of the core genes are associated with carbohydrate metabolism, whilst the accessory genes 180 are linked with signal transduction and infectious diseases (Fig 3).

Pan-genome comparative analysis
Following the clustering of respective TB datasets into core, accessory and unique genes, the core genes of MAF were almost equal to the number of core genes in the Cameroon sub lineage (MTB-4.6.2) but had fewer accessory genes than the Cameroon sub lineages. However, the Beijing sub 185 lineages (MTB-2.2.1) had fewer core genes than MAF and MTB-4.6.2, but more accessory genes than both (Supplementary table 1). The red, orange and cyan lines in Figure 1  Using BPGA to cluster the core, accessory and unique genes and assigning KEGG functional pathways to them, about 90% of the core genes of the three lineages are responsible for metabolism related pathways. However, accessory gene assignment of metabolic, organismal system and environmental information related pathways in the lineages are: MAF (54%, 52% and 53%), MTB-195 4.6.2 (22%, 28% and 32%) and MTB-2.2.1 (66%, 25% and 27%) as shown in Figure 2.

Evolution of M. africanum
Core genes are responsible for survival and majority of biological processes in the bacteria. Those that exist in copies, also known as "multi copy core genes" (MCG) were studied as these genes, especially rRNAs in bacteria, have been seen to influence adaptation to environmental pressures 200 and the structure of microbiomes (29). We examined conserved genes of these lineages that exhibit copy number variation (CNV) and constructed a phylogeny based on their core genomes ( Figure 4).
Eleven (  showed little evolution in the CNV and lastly, PPE proteins 40, 51 and 57 displayed lineage distinct 215 CNV across the genomes (Fig 4).
On the other set of gene families studied, fadD13, fadD15, fadD25 and fadD32 genes which are responsible for fatty acid CoA ligase, all showed CNV in the TB strains during the course of evolution except the fadD32 gene (Fig 4). Additionally, multiple sequence alignment of fadD13 genes in MAF and MTB genomes showed substitution mutations A8G in fadD13_2, S108F and 220 A123V in fadD13_3 (Fig 5).

Discussion:
Despite the clonal nature of MTBC, our findings shows that the addition of new genomes to the 230 analysis will not lead to the discovery of new phenotypes in MAF due to its close pan-genome. In addition, we also observed copy number variations of some core genes that may be related to the geographically restricted specialists (MAF) and globally distributed generalists (MTB-4.6.2 and MTB-2.2.1).
The reduction in number of core genes in MTB-2.2.1 compared to MAF and MTB-4.6.2 can be 235 linked to increased virulence as previous work have shown that the Beijing sub lineage is more pathogenic than ancient lineages. Genome reduction causes bacteria to have increased virulence compared to their counterparts with larger genomes (33,34). Varying numbers of accessory and unique genes which could also be responsible for drug resistance, virulence or preference to a particular host which plays a key role in lineage classification, geographical distribution and distinct 240 lifestyles of some TB strains as reported in earlier studies (12,16). Furthermore, investigation into some MCG show copy number variation and amino acid substitution during the evolution of MTBC, this could also attribute to host preference and the geographical peculiarity of MAF.
In the detailed KEGG classification shown in Fig 3, Fig 4). This may also be related to the geographic distribution and host specificity as housekeeping genes are responsible for survival of pathogens (37).
One of the multiple copy core genes, fadD13, which codes for long chain fatty acid COA ligase in 255 MTB and maintains appropriate mycolic acid composition and permeability of the envelope on its exposure to acidic pH (38,39), was investigated for evolution in the lineages. The fadD13 gene is encoded by operon MymA and is essential for virulence and in vivo progression of MTBC (40).
Multiple alignments of gene copies of fadD13 (Fig 5) in the TB genomes suggests that these lineage-specific mutations could have also shaped the distinctive features of the modern and ancient 260 lineages in this study, as numerous fadD13 protein variants have been seen to influence ATP binding (40).

Conclusion:
Pan-genome analysis of MAF in this work showed that a higher number of orthologous genes in the dispensable genome may have contributed to the restriction of global distribution and reduced 265 virulence compared to modern lineages MTB-4.6.2 and MTB-2.2.1. Also, substitutions in some multiple copy core genes and copy number variations might have influenced evolution of MAF and its geographic distribution.
Inasmuch as it is hard to say if MAF migrated out of West Africa and got outcompeted by modern lineages, it is almost certain that it might soon be outcompeted by modern virulent strains in West 270 Africa due to its close pan genome.

Acknowledgement:
We would like to show appreciation to our colleagues at the African Centre of Excellence for