Abstract
Little is known about the genomic diversity of Escherichia coli in healthy children from sub-Saharan Africa, even though this is pertinent to understanding bacterial evolution and ecology and their role in infection. We isolated and whole-genome sequenced up to five colonies of faecal E. coli from 66 asymptomatic children aged three-to-five years in rural Gambia (n=88 isolates from 21 positive stools). We identified 56 genotypes, with an average of 2.7 genotypes per host. These were spread over 37 seven-allele sequence types and the E. coli phylogroups A, B1, B2, C, D, E, F and Escherichia cryptic clade I. Immigration events accounted for three-quarters of the diversity within our study population, while one-quarter of variants appeared to have arisen from within-host evolution. Several study strains were closely related to isolates that caused disease in humans or originated from livestock. Our results suggest that within-host evolution plays a minor role in the generation of diversity than independent immigration and the establishment of strains among our study population. Also, this study adds significantly to the number of commensal E. coli genomes, a group that has been traditionally underrepresented in the sequencing of this species.
Introduction
Ease of culture and genetic tractability account for the unparalleled status of Escherichia coli as “the biological rock star”, driving advances in biotechnology (1), while also providing critical insights into biology and evolution (2). However, E. coli is also a widespread commensal, as well as a versatile pathogen, linked to diarrhoea (particularly in the under-fives), urinary tract infection, neonatal sepsis, bacteraemia and multi-drug resistant infection in hospitals (3-5). Yet, most of what we know about E. coli stems from the investigation of laboratory strains, which fail to capture the ecology and evolution of this key organism “in the wild” (6). What is more, most studies of non-lab strains have focused on pathogenic strains or have been hampered by low-resolution PCR methods, so we have relatively few genomic sequences from commensal isolates, particularly from low-to middle-income countries (7-13).
We have a broad understanding of the population structure of E. coli, with eight significant phylogroups loosely linked to ecological niche and pathogenic potential (B2, D and F linked to extraintestinal infection; A and B1 linked to severe intestinal infections such as haemolytic-uraemic syndrome) (14-17). All phylogroups can colonise the human gut, but it remains unclear how far commensals and pathogenic strains compete or collaborate—or engage in horizontal gene transfer—within this important niche (18, 19).
Although clinical microbiology typically relies on single-colony picks (which has the potential to underestimate species diversity and transmission events), within-host diversity of E. coli in the gut is crucial to our understanding of inter-strain competition and co-operation and also for accurate diagnosis and epidemiological analyses. Pioneering efforts using serotyping and molecular typing have shown that normal individuals typically harbour more than one strain of E. coli (20-22), with one individual carrying 24 distinct clones (22-24). More recently, whole-genome sequencing has illuminated molecular epidemiological investigations (9), adaptation during and after infection (25, 26), as well as the intra-clonal diversity in healthy hosts (27).
There are two plausible sources of within-host genomic diversity. Although a predominant strain usually colonises the host for extended periods (28), successful immigration events mean that incoming strains can replace the dominant strain or co-exist alongside it as minority populations (29). Strains originating from serial immigration events are likely to differ by hundreds or thousands of single-nucleotide polymorphisms (SNPs). Alternatively, within-host evolution can generate clouds of intra-clonal diversity, where genotypes differ by just a handful of SNPs (20).
Most relevant studies have been limited to Western countries, except for a recent report from Tanzania (21), so little is known about the genomic diversity of E. coli in sub-Saharan Africa. The Global Enteric Multicenter Study (GEMS) (30, 31) has documented a high burden of diarrhoea attributable to E. coli (including Shigella) among children from the Gambia, probably as a result of increased exposure to this organism through poor hygiene and frequent contact with animals and the environment. In also facilitating access to stool samples from healthy Gambian children, the GEMS study has given us a unique opportunity to study within-host genomic diversity of commensal E. coli in this setting.
Methods
Study population
We initially selected 76 faecal samples from three-to five-year-old asymptomatic Gambian children, who had been recruited from Basse, Upper River Region, the Gambia, into the GEMS study (30) as healthy controls from December 1, 2007, to March 3, 2011. Samples had been collected according to a previously described sampling protocol (32). Archived stool samples were retrieved from -80°C storage and allowed to thaw on ice. A 100-200 mg aliquot from each sample was transferred aseptically into 1.8ml Nunc tubes for microbiological processing below (Figure 1). Ten of the original 76 samples proved unavailable for processing in this study.
Bacterial growth and isolation
1 ml of physiological saline (0.85%) was added to each sample tube and vigorously vortexed at 4200 rpm for at least 2 minutes. Next, the homogenised sample suspensions were taken through four ten-fold dilution series. A100 µl aliquot from each dilution was then spread evenly on a plate of tryptone-bile-X-glucuronide differential and selective agar. The inoculated plates were incubated overnight at 37°C under aerobic conditions. Colony counts were performed on the overnight cultures for each serial dilution for translucent colonies with entire margins and blue-green pigmentation indicative of E. coli. Up to five representative colonies were selected from each sample and sub-cultured on MacConkey agar overnight at 37°C before storing in 20% glycerol broth at -80°C. Individual isolates were assigned a designation comprised of the subject ID followed by the colony number (“1-5”).
Genomic DNA extraction and genome sequencing
Broth cultures were prepared from pure, fresh cultures of each colony-pick in 1 ml Luria-Bertani broth and incubated overnight to attain between 109 – 1010 cfu per ml. Genomic DNA was then extracted from the overnight broth cultures using the lysate method described in (33). The eluted DNA was quantified by the Qubit high sensitivity DNA assay kit (Invitrogen, MA, USA) and sequenced on the Illumina NextSeq 500 instrument (Illumina, San Diego, CA) as described previously (34).
Following Dixit et al. (20), we sequenced a random selection of isolates twice, using DNA obtained from independent cultures, to help in the determination of clones and the analysis of within-host variants (Supplementary File 5). Bioinformatic analyses of the genome sequences were carried out on the Cloud Infrastructure for Microbial Bioinformatics (CLIMB) platform (35).
Phylogenetic analysis
The paired 150bp reads were quality checked and assembled, as previously described (34). Snippy v4.3.2 (https://github.com/tseemann/snippy) was used for variant calling, using the complete genome sequence of commensal E. coli str. K12 substr. MG1655 as a reference strain (NCBI accession: NC_000913.3) and to generate a core-genome alignment, from which a maximum-likelihood phylogeny with 1000 bootstrap replicates was reconstructed using RAxML v8.2.4 (36), based on a general time-reversible nucleotide substitution model. The phylogenetic tree was rooted using the genomic sequence of E. fergusonii as an outgroup (NCBI accession: GCA_000026225.1). The phylogenetic tree was visualised in FigTree v1.4.3 (https://github.com/rambaut/figtree/) and annotated in RStudio v3.5.1 and Adobe Illustrator v 23.0.3 (Adobe Inc., San Jose, California). For visualisation, a single colony was chosen to represent replicate colonies of the same strain (ST) with identical virulence, plasmid and antimicrobial resistance profiles and a de-replicated phylogenetic tree reconstructed using the representative isolates.
Multi-locus sequence typing, Clermont typing and SNPs
The merged reads were uploaded to EnteroBase (37), where de novo assembly and genome annotation were carried out, and in-silico multi-locus sequence types (STs) assigned based on the Achtman scheme, allocating new sequence types (ST) if necessary. EnteroBase assigns phylogroups using ClermontTyper and EzClermont (38, 39) and unique core-genome MLST types based on 2, 513 core loci in E. coli. Publicly available E. coli sequences in EnteroBase (http://enterobase.warwick.ac.uk/species/index/ecoli) (37) were included for comparative analysis, including 23 previously sequenced isolates obtained from diarrhoeal cases recruited in the GEMS study in the Gambia (Supplementary File 1).
We computed pairwise single nucleotide polymorphism (SNP) distances between genomes from the core-genome alignment using snp-dists v0.6 (https://github.com/tseemann/snp-dists). For the duplicate sequence reads of the same strains, we used SPAdes v3.13.2 (40) to assemble each set of reads and map the raw sequences from one sequencing run to the assembly of the other run and vice versa, as described previously (20). SNPs were detected using the CSIPhylogeny tool (https://cge.cbs.dtu.dk/services/CSIPhylogeny/) and compared between the two steps, counting only those SNPs that were detected in both sets of reads as accurate.
Accessory gene content
We used ABRicate v0.9.8 (https://github.com/tseemann/abricate) to predict virulence factors, acquired antimicrobial resistance (AMR) genes and plasmid replicons by scanning the contigs against the VFDB, ResFinder and PlasmidFinder databases respectively, using an identity threshold of ≥ 90% and a coverage of ≥ 70%. Virulence factors and AMR genes were plotted next to the phylogenetic tree using the ggtree, ggplot2 and phangorn packages in RStudio v3.5.1. We calculated co-occurrence of AMR genes among study isolates and visualised this as a heat map using RStudio v 3.5.1.
Population structure and comparison of commensal and pathogenic strains
We assessed the population structure using the hierarchical clustering algorithm in EnteroBase. Briefly, the isolates were assigned stable population clusters at eleven levels (from HC0 to HC 2350) based on pairwise cgMLST allelic differences. Hierarchical clustering at 1100 alleles differences (HC1100) resolves populations into cgST complexes, the equivalent of clonal complexes achieved with the legacy MLST clustering approaches (37). We reconstructed neighbour-joining phylogenetic trees using NINJA (41), based on clustering at HC1100 to display the population sub-clusters at this level as an indicator of the genomic diversity within our study population and to infer the evolutionary relationship among our strains and others in the public domain.
Next, we interrogated the HC1100 clusters that included both pathogenic and commensal E. coli strains recovered from the GEMS study. For the clusters that encompassed commensal and pathogenic strains belonging to the same ST, we reconstructed both neighbour-joining and SNP phylogenetic trees to display the genetic relationships among these strains. We visualised the accessory genomes for the overlapping STs mentioned above to determine genes associated with phages, virulence factors and AMR. The resulting phylogenetic trees were annotated in Adobe Illustrator v 23.0.3 (Adobe Inc., San Jose, California).
Ethical statement
The study was approved by the joint Medical Research Council Unit The Gambia-Gambian Government ethical review board.
Results
Population structure
The study population included 27 females and 39 males (Table 1). All but one reported the presence of a domestic animal within the household. Twenty-one samples proved positive for the growth of E. coli, yielding 88 isolates. We detected 37 seven-allele sequence types (STs) among the isolates, with a fairly even distribution (Figure 2). Five STs were completely novel (ST9274, ST9277, ST9278, ST9279 and ST9281). These study strains were scattered over all the eight main phylogroups of E. coli (Table 2). Hierarchical clustering of core genomic STs revealed twenty-seven cgST clonal complexes (Supplementary File 2).
Within-host diversity
Just a single ST colonised nine individuals, six carried two STs, four carried four STs, and two carried six STs. We found 56 distinct genotypes, which equates to an average of 2.7 genotypes per host. Two individuals (H-18 and H-2) shared an identical strain belonging to ST9274 (zero SNP difference) (Supplementary File 4, yellow highlight), suggesting recent transfer from one child to another or recent acquisition from a common source.
We observed thirteen cases where a single host harboured two or more variants within the same SNP cloud (Table 2). Such within-host evolution accounted for around a quarter of the observed variation, with immigration explaining the remaining three quarters. 22% of within-host mutations represented synonymous changes. 43% were non-synonymous mutations, while 31% occurred in non-coding regions, and 4% represented stop-gained mutations (Supplementary File 6). The average number of SNPs among variants within such SNP clouds was 5 (range 0-18) (Table 3). However, in two subjects (H36 and H37), pairwise distances between genomes from the same ST (ST59 and ST5148) were as large as 14 and 18 SNPs respectively (Supplementary File 4, grey highlight).
Accessory gene content and relationships with other strains
A quarter of our isolates were most closely related to commensal strains from humans, with smaller numbers most closely related to human pathogenic strains or strains from livestock, poultry or the environment (Table 4). One isolate was most closely related to a canine isolate from the UK. Three STs (ST38, ST10 and ST58) were shared by our study isolates and diarrhoeal isolate from the GEMS study (Supplementary Figure 2), with just eight alleles separating our commensal ST38 strain from a diarrhoeal isolate from the GEMS study (Figure 5).
We detected 130 genes encoding putative virulence factors across the 88 study isolates (Figure 2; Supplementary File 3). More than half of the isolates encoded resistance to three or more clinically relevant classes of antibiotics (Figure 3; Supplementary Figure 1). The most common resistance gene network was -aph(6)-Id_1-sul2 (41% of the isolates), followed by aph(3”)-Ib_5-sul2 (27%) and bla-TEM-aph(3”)-Ib_5 (24%). Most isolates (67%) harboured two or more plasmid types (Figure 4). Of the 24 plasmid types detected, IncFIB was the most common (41%), followed by col156 (19%) and IncI_1-Alpha (15%). Nearly three-quarters of the multi-drug resistant isolates carried IncFIB (AP001918) plasmids, suggesting that these large plasmids disseminate resistance genes within our study population.
Discussion
This study provides an overview of the within-host genomic diversity of E. coli in healthy children from a rural setting in the Gambia, West Africa. Surprisingly, we recovered a low rate of colonisation than reported elsewhere among children of similar age groups (42), with only a third of our study samples yielding growth of E. coli. This may reflect geographical variation but might also be some hard-to-identify effect of the way the samples were handled, even though they were kept frozen and thawed only just before culture.
Several studies have shown that sampling a single colony is insufficient to capture E. coli strain diversity in stools (20, 21, 23). Lidin-Janson et al. (43) claim that sampling five colonies provides a >99% chance of recovering dominant genotypes from single stool specimens, while Schlager et al. (24) calculate that sampling twenty-eight colonies provides a >90% chance of recovering minor genotypes. Our results confirm the importance of multiple-colony picks in faecal surveillance studies, as over half (57%) of our strains would have been missed by picking a single colony.
Although our strains encompassed all eight major phylotypes of E. coli, the majority fell into the A and B1 phylogenetic groups, in line with previous reports that these phylogroups dominate in stools from people in low- and middle-income countries (44, 45). The prevalence of putative virulence genes in most of our isolates highlights the pathogenic potential of commensal intestinal strains—regardless of their phylogroup—should they gain access to the appropriate tissues, for example, the urinary tract. Our results complement previous studies reporting genomic similarities between faecal E. coli isolates and those recovered from urinary tract infection (25, 46).
We found that within-host evolution plays a minor role in the generation of diversity, in line with Dixit et al. (20), who reported that 83% of diversity originates from immigration events, and with epidemiological data suggesting that the recurrent immigration events account for the high faecal diversity of E. coli in the tropics (47). Co-colonising variants belonging to the same ST tended to share an identical virulence, AMR and plasmid profile, signalling similarities in their accessory gene content. The estimated mutation rate for E. coli lineages is around one SNP per genome per year (48), so that two genomes with a most recent common ancestor in the last five years would be expected to be around ten SNPs apart. However, in two subjects, pairwise distances between genomes from the same ST (ST59 and ST5148) were large enough (14 and 18 respectively) to suggest that they might have arisen from independent immigration events, as insufficient time had elapsed in the child’s life for such divergence to occur within the host. However, it remains possible that the mutation rate was higher than expected in these lineages, although we found no evidence of damage to DNA repair genes. More than half of our isolates encode resistance to three or more classes of antimicrobials echoing the high rate of MDR (65%; confirmed by phenotypic testing) in the GEMS study. IncFIB (AP001918) was the most common plasmid Inc type from our study, in line with the observation that IncF plasmids are frequently associated with the dissemination of resistance (49). However, a limitation of our study is that we did not perform phenotypic antimicrobial resistance testing, although Doyle et al. (50) reported that only a small proportion of genotypic AMR predictions are discordant with phenotypic results.
Comparative analyses confirm the heterogeneous origins of the strains reported here, documenting links to other human commensal strains or isolates sourced from livestock or the environment. This is not surprising, as almost all study participants reported that animals are kept in their homes and children in rural Gambia are often left to play on the ground, close to domestic animals such as pets and poultry (51).
Our results show that the commensal E. coli population in the gut of healthy children in rural Gambia is richly diverse, with the independent immigration and establishment of strains contributing to the bulk of the observed diversity. Besides, this work has added significantly to the number of commensal E. coli genomes, which are underrepresented in public repositories. Although solely observational, our study paves the way for future studies aimed at a mechanistic understanding of the factors driving the diversification of E. coli in the human gut and what it takes to make a strain of E. coli successful in this habitat.
Data summary
All genomic assemblies for the strains included in this study are freely available from EnteroBase (http://enterobase.warwick.ac.uk/species/index/ecoli). The EnteroBase genome assembly barcodes are provided in Supplementary Files 1 and 2.
The raw genomic sequences have been deposited in the NCBI SRA, under the BioProject ID PRJNA658685 and accession numbers SAMN15880286 to SAMN15880281.
Conflicts of interest
We declare no conflicts of interest.
Author contributions
Conceptualization: MA, MP; data curation, MP, NFA; formal analysis: EFN; analytical support: MD; funding: MA and MP; sample collection and storage: MJH, UNI, PET, CO; data management: SG; laboratory experiments, EFN, supervision, NFA, MP, JO, MA; manuscript preparation – original draft, EFN; review and editing, NFA, MP; review of the final manuscript, all authors.
Funding information
MA, MJH, UNI, SG, CO, PET and MP were supported by the Medical Research Council Unit, The Gambia at London School of Hygiene and Tropical Medicine. The BBSRC Institute Strategic Programme, Microbes in the Food Chain (BB/R012504/1 and its constituent projects 44414000A and 4408000A) supported EFN and MP. NFA was supported by the Quadram Institute Bioscience BBSRC funded Core Capability Grant (project number BB/ CCG1860/1). The funders played no role in the study design, data collection and analysis, the decision to publish, or the preparation of the manuscript.
Supplementary material
Supplementary Figure 1
A co-occurrence matrix of acquired antimicrobial resistance genes detected in the study isolates. The diagonal values show how many isolates each individual gene was found in, while the intersections between the columns represent the number of isolates in which the corresponding antimicrobial resistance genes co-occurred.
Supplementary Figure 2
A Neighbour-joining phylogenetic tree depicting the genetic relationships among twenty-four strains isolated from diarrhoeal cases in the GEMS study 4. The Sequence types identified in these isolates are shown in the legend, with the genome count displayed in square brackets next to the respective sequence types. Three STs (ST38, ST58 and ST10) overlapped with what was found among commensal strains from this study (see Figure 2).
Supplementary File 1
Sequencing statistics and characteristics of twenty-four previously sequenced GEMS cases included in this study.
Supplementary File 2
A summary of the sequencing statistics of the study isolates reported in this study.
Supplementary File 3
A summary of the virulence factors detected among the study isolates and their known functions.
Supplementary File 4
A pairwise single nucleotide polymorphism matrix showing the SNP distances between the study genomes.
Supplementary File 5
List of the sample clones for which two independent cultures were obtained and sequenced, to find the SNPs between the same clones.
Supplementary File 6
Mutations in variants inferred to have been derived from within-host evolution.
Acknowledgements
We gratefully acknowledge the study participants in GEMS and all clinicians, field workers and the laboratory staff of the Medical Research Council Unit The Gambia at London School of Hygiene and Tropical Medicine involved in the collection and storage of stools in the GEMS study in Basse Field Station and Fajara.
Footnotes
Minor edits (typographical).