Human reference gut microbiome comprising 5,414 prokaryotic species, including newly assembled genomes from under-represented Asian metagenomes

Metagenome sampling bias for geographical location and lifestyle is partially responsible for the incomplete catalog of reference genomes of gut microbial species. Here, we present a substantially expanded microbiome catalog, the Human Reference Gut Microbiome (HRGM). Incorporating newly assembled 29,082 genomes from 845 fecal samples collected from three under-represented Asian countries—Korea, India, and Japan—the HRGM contains 232,098 non-redundant genomes of 5,414 representative prokaryotic species, >103 million unique proteins, and >274 million single-nucleotide variants. This is an over 10% increase from the largest reference database. The newly assembled genomes were enriched for members of the Bacteroidaceae family, including species associated with high-fiber and seaweed-rich diet. Single-nucleotide variant density was positively associated with the speciation rate of gut commensals. Ultra-deep sequencing facilitated the assembly of genomes of low-abundance taxa, and deep sequencing (>20 million read pairs) was needed for the profiling of low-abundance taxa. Importantly, the HRGM greatly improved the taxonomic and functional classification of sequencing reads from fecal samples. Finally, mapping homologous sequences for human auto-antigens onto the HRGM genomes revealed the association of commensal bacteria with high cross-reactivity potential with autoimmunity. The HRGM (www.mbiomenet.org/HRGM/) will facilitate the identification and functional analysis of disease-associated gut microbiota.


Introduction 4 1
Human gut microbiome is considered the "second human genome" and plays a crucial role in ( Supplementary Fig. 1a, Methods), which is more exhaustive than similar approaches 8-11 8 0 (Supplementary Table 1). For instance, we adopted an ensemble method for binning 8 1 assembled contigs, as it showed better performance than individual binning tools 12,13 . We  Therefore, we performed de novo genome assembly of fecal samples from three Asian 8 5 countries: Korea, India, and Japan (referred to here as KIJ samples, Supplementary Table 2).

6
At the start of the current study, WMS data for 805 and 110 fecal samples from Japan and 8 7 India, respectively, were publicly available but not included in the UHGG 14,15 . To 8 8 complement these data, we generated WMS data for fecal samples collected from 90 donors 8 9 recruited in Korea. We set the minimum completeness at 50% and the maximum 9 0 contamination at 5% for genomes of minimum quality. We divided the genome bins into two substantially higher than that of the standard database ( Fig. 4a,b, P < 0.001, two-sided 2 5 3 Wilcoxon signed-rank test). In addition, the variance of the read classification rate of custom 2 5 4 databases was significantly smaller than that of the standard database, except for the 2 5 5 Cameroon population (Fig. 4a, P < 0.001, Brown-Forsythe test). Importantly, the 2 5 6 classification efficacy of the HRGM-based database was significantly improved compared 2 5 7 with that of the UHGG-based database for the four test samples (Fig. 4a,c, P < 0.001, two-2 5 8 sided Wilcoxon signed-rank test), which suggests that the updated reference genome database 2 5 9 improves taxonomic classification of the gut metagenomic sequencing data. HRGM-95 protein catalogs (Methods). The number of aligned reads was 1.31% higher, on 2 6 5 average, with HRGM-95 in all tested samples than with UHGP-95 (Fig. 4d), although 2 6 6 HRGM-95 contains 0.4% more proteins than UHGP-95. Taken together, the newly assembled genomes from under-represented Asian countries 2 6 8 significantly improve the genome and protein databases for metagenomic analysis of both, 2 6 9 taxonomic and functional profiling. Taxonomic profiles obtained by shallow sequencing (0.5-2 million reads) highly correlate 2 7 3 with those obtained by ultra-deep sequencing (2.5 billion reads) 31 . However, this evaluation 2 7 4 is based on entire taxa, in which highly abundant or core taxa govern the correlation measure. Further, low-abundance taxa likely play important, as yet unknown, biological roles in the gut 2 7 6 microbial communities 32,33 . We therefore evaluated the impact of sequencing depth on the 2 7 7 reliability of taxonomic profiling for different ranges of taxon abundance. We generated a included in the HRGM. We then stratified the taxonomic features into eight different groups, 2 8 0 according to the mean relative abundance (Fig. 5a,b). We calculated the mean Pearson the taxonomic profiles at different sequencing depths for different mean relative abundances 2 8 3 (Methods). The taxonomic profile similarity between two groups showed increasing PCC 2 8 4 and SCC with an increasing sequencing depth. For example, >10 million read pairs (3 Gbp) 2 8 5 may need to have taxonomic profiles that highly correlate (PCC > 0.9) with those based on 2 8 6 80 million read pairs (25 Gbp) to account for the features with lowest 13.92% of relative 2 8 7 abundance (relative abundance < 1e-06) ( Fig. 5c and Supplementary Fig. 10a). For SCC > 2 8 8 0.9, the required sequencing depth increased to 20 million read pairs (6 Gbp) for taxonomic 2 8 9 features with a similar level of relative abundance ( Fig. 5b and Supplementary Fig. 10b).

9 0
Overall, these observations suggest that deep sequencing (>20 million read pairs) may be 2 9 1 required to obtain reliable taxonomic profiles of low-abundance taxa. Sequencing 30 Gbp is optimal for functional profiling of the human gut microbiome 2 9 4 Next, using the protein catalog, we investigated the optimal sequencing depth for functional 2 9 5 profiling of the human gut microbiome. Since the detection of gene content generally requires 2 9 6 a much deeper sequencing depth than that for the detection of genomes, we analyzed the (60 Gbp) (Methods). The number of the detected coding genes initially grew rapidly as the 2 9 9 sequencing depth increased, but later approached the estimated maximum count 3 0 0 ( Supplementary Fig. 11a). The curves fitted well (R 2 > 0.99) two-site saturation models 34 , 3 0 1 and we hence estimated the maximum number of coding genes for each sample using the 3 0 2 regression model. Interestingly, the estimated maximum gene counts in the samples differed, showed very similar normalized maximum gene count curves, with over 80% of the gut samples ( Supplementary Fig. 11b). Sequencing another 30 Gbp would fail to detect 90% of 3 0 7 the maximum gene count. Therefore, 100 million read pairs is the optimal sequencing depth 3 0 8 for the best trade-off between the sequencing cost and the gain-of-functional information for WMS-based studies of the human gut microbiome.  Microbial peptides homologous to the host auto-antigens may stimulate host immune cells   3  1  3 and, hence, the hypothesis of molecular mimicry has emerged as a mechanism underlying 3 1 4 autoimmune diseases 35 . To systematically evaluate this hypothesis, we mapped microbial 3 1 5 peptide sequences homologous to the human self-antigens involved in autoimmune diseases 3 1 6 onto the genomes of HRGM representative species. We first compiled autoimmune disease-3 1 7 related antigen set from the Immune Epitope Database (IEDB) 36 , and then used it for 3 1 8 homology-searches of microbial peptide sequences from 5,414 representative species. We thus identified species with a high cross-reactivity potential based on the density of the increased as the number of coding genes increased (Fig. 6a), we divided the ECG count by 3 2 2 the total number of genes for each species. Some human gut commensals had a relatively 3 2 3 high cross-reactivity potential (Fig. 6b, high cross-reactivity potential (Fig. 6d). Indeed, many of them are associated with 3 2 6 autoimmune diseases. For example, Akkermansia muciniphila is abundant in the enthesitis- increased levels of interleukin 6 39 , a pro-inflammatory cytokine that can disrupt the immune genomes is predictive for human gut microbiota associated with autoimmune diseases. In the present study, we constructed an improved catalog of the human reference gut  We also demonstrated that the analysis of microbial DNA and peptide sequences facilitates WMS data for fecal samples from India and Japan were obtained from published studies 14,15 .

7 8
Fecal WMS data for India were generated for 110 healthy donors in North-Central and 3 7 9 Southern India 14 . Although the sequencing depth was relatively low (1.2 Gbp on average), it consent was obtained before the study. The UHGG does not contain any MAGs from Korea. The libraries were prepared as described in the TruSeq Nano DNA Library Prep Reference Guide (Illumina #15041110). Briefly, 100 ng DNA was fragmented using LE220 Focused fragments were obtained after size selection. After adapter ligation, eight PCR cycles were 3 9 2 performed. Library quantification was performed as described in the Kapa Illumina Library The adapter sequences were trimmed, and low-quality bases and short reads were removed 4 0 0 from WMS data using Trimmomatic v0.39 42 . Next, the reads were aligned with the human 4 0 1 genome GRCh38.p7 using Bowtie2 v2.3.5 43 , and the aligned reads were then removed. The 4 0 2 majority of quality-controlled reads were assembled as contigs using metaSPAdes 44 , which is  Table 2).

0 6
Genome bins were generated using the ensemble approach and three binning tools: in total). The genome bins were divided into two groups: HQ, bins with over 90% 4 1 9 completeness and less than 5% contamination; and MQ, bins with 50-90% completeness and 4 2 0 less than 5% contamination. Groups of genomes that corresponded to species were generated using a two-step iterative 4 2 4 procedure. Preliminary clustering was performed using Mash v2.2 50 algorithm. Mash 4 2 5 distances were calculated for all possible pairs of genomes using the "-s 10,000" parameter. Next, the average-linkage-based hierarchical clustering was performed, at a cutoff of 0.2.

6 9
To render the HRGM useful for the 16S rRNA sequencing-based metagenomic analysis, the 4 7 0 16S rRNA regions for 5,414 representative species genomes were predicted using barrnap 4 7 1 v0.9 27 tool and the "--evalue 1e-05" parameter, and "--kingdom bac" and "--kingdom arc"  HRGM (Supplementary Fig. 4). For the species bins with more than three genomes, SNVs were identified using the codes Overall, 64,661,728 CDS were identified in 29,082 genomes from the KIJ set using Prodigal 4 9 6 v2.6.3 22 and "-c -m -p single" parameters. Since many proteins were derived from conspecific tutorial was adopted. For instance, the CD-HIT-95 protein catalog (a 95% similarity level protein catalog construction is depicted in Supplementary Fig. 7.