Influence of sequencing depth on bacterial classification and abundance in bacterial communities

Microbial diversity is the most abundant form of life. Next Generation Sequencing technologies provide the capacity to study complex bacterial communities, in which the depth and the bioinformatic tools can influence the results. In this work we explored two different protocols for bacterial classification and abundance evaluation, using 10 bacterial genomes in a simulated sample at different sequencing. Protocol A consisted of metagenome assembly with Megahit and Ray Meta and taxonomic classification with Kraken2 and Centrifuge. In protocol B only taxonomic classification. In both protocols, rarefaction, relative abundance and beta diversity were analyzed. In the protocol A, Megahit had a mean contig length of 1,128 and Ray Meta de 8,893 nucleotides. The number of species correctly classified in all depth assays were 6 out of 10 for protocol A, and 9 out of 10 using protocol B. The rarefaction analysis showed an overestimation of the number of species in almost all assays regardless of the protocol, and the beta diversity analysis results indicated significant differences in all comparisons. Protocol A was more efficient for diversity analysis, while protocol B estimated a more precise relative abundance. Our results do not allow us to suggest an optimal sequencing depth at specie level.

Microbial diversity is composed of a great variety of unicellular organisms (prokaryota, 2 archaea, protozoa, fungi and viruses) and is the most abundant life forms present on the 3 planet [1]. Advances in next-generation sequencing (NGS) technologies have allowed us 4 to reach unprecedented levels of genomic analysis [2], [3], and have provided us with 5 the possibility to analyze non-culturable communities, whether they are animal tissue, 6 air or soil samples [4]. 7 8 Two sequencing methods are usually used in metagenomic studies, 16S and shotgun. 9 The first focused in the sequencing of hypervariable regions of the 16S rRNA gene and 10 the second consists of the sequencing of complete genomes from a sample [5]. Both prokaryote kingdom, its resolution is limited and it has a lower sensitivity in the 14 identification of genus and species [7], [8]. The shotgun method increases resolution, 15 sensitivity at the genus, species or bacterial strain level and bacterial co-abundance in 16 microbiome studies [9]. However, generation of precise results depends not only on the 17 sequencing platform, but also on the depth and data analysis employed [10]. It is 18 important to highlight that there is not a consensus about the right protocol for 19 processing and analysis of the sequencing data, due to the existence of different 20 approaches and to the great variety of bioinformatics tools [8], [11]. 21 22 In microbiome studies, the main bioinformatics toolsets are metagenome assemblers 23 (MetasSPAdes, Megahit, Ray Meta) and taxonomic classifiers (Kraken and Centrifuge) 24 that allow us to estimate the diversity and relative abundance of the microorganisms 25 present in a sample [12]- [14].

27
The knowledge of the precise abundance at the genus and species level is important 28 to better understand the ecological composition, the possible interactions, as well as the 29 associations between pathology and specific microorganisms [15]. The aim of this work 30 was to explore two different approaches for bacterial classification and abundance 31 evaluation, using a simulated bacterial community sample at different sequencing 32 depths.

34
Computational resources 35 The present work was carried out on a server with Linux CentOS operating system, 35 36 processors and 62 GB RAM.

38
For the "in-silico" analysis, we used a simulated reference sample composed by 10 39 different bacterial genomes belonging to the main phyla that conform the human gut 40 microbiota. Bacterial genomes were downloaded from the National Center for 41 Biotechnology Information database (NCBI). To constitute the reference sample, we 42 provided an arbitrary abundance for each one of the selected species, considering a total 43 of 25 genomes which were distributed among the 10 species. In this context, the 25 44 genomes copies microbiome corresponds to the 1X depth for the reference sample. Table 1 shows the list of selected bacterial species, the number of copies of each genome, 46 their genome size, abundance and the corresponded phylum.  [4] and Ray Meta V 2.3.1 [16] programs, while bacterial classification and estimation of bacterial abundance was evaluated with Kraken2 V 2.1.2 [16] and Centrifuge V 1.0 [13]. Microbial-RefSeq and Bacteria-Archea databases were used respectively.

48
Using the reference sample, we simulated a shot-gun sequencing in silico for the 49 Illumina HiSeq2500 platform and by means of the ART program [17] we obtained Taxonomic classification, rarefaction and bacterial abundance analyses were assessed by 55 means of two sequence analysis protocols. Protocol A is divided into 4 stages: assembly, 56 taxonomic classification and rarefaction, bacterial abundance, and statistical analysis, 57 while protocol B includes all steps, but it does not consider the assembly step ( Figure 1). 58 Each de novo assembly was performed using a k-mer of 31, 51, 79 and 109 nucleotides. 59

60
The classification results were obtained after filtering the information from each 61 classifier considering only the information related to the 10 species that constitute the 62 reference sample in each depth assay (Table 1). In contrast, for the analysis of total 63 diversity, we included the whole number of species identified without any filtering step. 64 The estimated number of copies for each genome in the different depth assays was 65 calculated by multiplying 25 (number of initial copies) by the depth value on each test. 66 The relative abundance represents the number of copies of each species in relation to 67 the total number of readings and expressed in percentage, for each one of the depth 68 assays. Finally, the alpha diversity present in the reference and in each test was used for 69 the calculation of beta diversity using the Sorensen index [18]. The statistical  (Table 2). The taxonomic classification in protocol A was equally efficient when Centrifuge or  In all assays, 9 out of the 10 species from the reference sample were consistently well classified in all assays (Akkermansia muciniphila, Alistipes f inegoldii, Bacteroides vulgatus, Bif idobacterium bif idum, Desulf ovibrio vulgaris, Escherichia coli, F usobacterium nucleatum, Lactobacillus reuteri and Shigella f lexneri). The remaining specie was not classified (Dialister invisus). a: Centrifuge classifier. b: Kraken2 classifier.

78
species compared to the Ray Meta assembler, and this result was ascending in each 100 depth assay. In both cases, the richness was greater than that present in the reference 101 sample (Table 3 and Figure 4).   Table 3.

102
On the other hand, when the data are analyzed with protocol B without including a 103 previous assembly process, the total number of species identified, considerably exceeds 104 the theoretical value of the reference sample in each depth assay, regardless of the 105 classifier used. (Table 3 and Figure 5). Rarefaction curves for total number of identified species using the protocol B. The number of species identified with protocol B is much higher than the number of species reported for protocol A. This growth pattern is maintained for each one of the depth assays and holds the same trend with any of the two classifiers. For the specific number of species identified see Table 3. Megahit assemblies differed than that present in the reference sample, these results were 111 observed in all depth assays ( Figure 6). On the other hand, classification of Ray Meta 112 assemblies generated the most imprecise relative abundances regardless of taxonomic 113 classifier ( Figure 7).  Beta diversity analysis 120

Relative abundance
The beta diversity analysis at the species level between the different depth assays and 121 the reference in protocol A, showed significant differences (p<0.005). (Figures 6 and 7). 122 The same result was obtained for the protocol B (p<0.005). (Figure 8). When 123 analyzing the beta diversity between each one of the depth assays with all the others we 124 observed a significant difference (p<0.005).

126
In the protocol A, at the genus level, the use of the centrifuge classifier with Megahit 127 or Ray Meta assemblers, showed no significant difference from the 50X depth assay and 128 the results did not improve after this point (p>0.005). In contrast, using protocol B, 129 there was no significant differences from the 25X assay with Centrifuge (p>0.005).

131
This work has focused on exploring some of the tools commonly used in the 132 characterization of microbial communities using shotgun sequencing and that do not 133 require extensive knowledge or abilities in bioinformatics [19], [20]. Considering this, in 134 this work we did not explore in depth the mathematical basis or logarithms of these 135 tools. The use of a simulated sample in this work helps to eliminate the negative 136 influence of errors or low sequencing quality of a real sample [21]. In this study, we 137 determined the diversity and abundance in each assay, this gives greater certainty in the 138 results obtained.
The results of the metagenome assemblies showed that about 90 of the contigs had a 141 size below 10,000 nucleotides in the protocol A. This size is smaller than the genome of 142 Dialister invisus (1.8 Mb), specie in the sample with the smaller genome size. In this 143 work we used Megahit and Ray Meta, these are "de novo" assemblers that are both 144 based on Bruijn graphs [4], [16]. This type of graphs allows the efficient assembly of 145 short reads [22], but when divided into the length of the k-mer defined in the assembly, 146 these can be susceptible to errors [12]. During the assembly process of a mixed genome, 147 two types of errors often occur. The first is the presence of k-mer in different regions of 148 the same genome, giving rise to chimeric connections between the nodes in the Bruijn 149 graph, which are different to the real sequence. This can result in erroneous assemblies 150 and short contigs. This error increases when assembly a metagenome because a k-mer 151 can be present in different genomes. The second error occurs in regions with low 152 coverage. However, this error was excluded when a simulated sample was studied [23]. 153 The presence of short contigs in this study is the result of erroneous assemblies and the 154 use of k-mer of different lengths did not prevent the generation of these errors.

155
Currently, the use of next-generation sequencing platforms such as PacBio and 156 Nanopore, which can sequence fragments of 30-50 Kb and 100 Kb respectively [24].

157
This may be favorable in the study of complex bacterial communities, because larger 158 read sizes can generate more precise assemblies, even with shared genomic regions. with respect to other assemblers, and they gave greater relevance to avoidance of the 164 assembly of repeated reads, ramifications and the generation of longer assemblies. 165 However, these parameters are not a guaranty of their efficiency, which we observed in 166 the assemblies performed in this study [4], [12], [16]. In addition, the performance 167 evaluation was not carried with a sample that has a known composition.

169
The classification of the assemblages showed a lower number of species than the 170 number of contigs obtained in each assay, this was due to the absence of genomic 171 information necessary for their classification. In contrast, the number of species was 172 overestimated in each of the assays, and this influenced the beta diversity and relative 173 abundance, which were different from the reference. The results obtained in this work 174 related to the assemblages, show the need to discriminate the short assemblages, since 175 there may be misclassified and influence the results of rarefaction, relative abundance, 176 and beta diversity.

178
The omission of metagenome assembly and the exclusive use of taxonomic classifiers 179 is common in microbiota studies where shotgun sequencing is used [10], [25], this can be 180 explained by the high demand of computational requirements and processing time of the 181 assemblers [25]. Our results from protocol B reflect a relative abundance different to the 182 reference and overestimation of species much higher than protocol A in each of the 183 depth assays. This is of utmost importance since bacterial richness and relative 184 abundance are two of the most important results in microbiota studies.

186
The results obtained with both protocols indicate that neither of the bioinformatics 187 tools used in this study is completely accurate, but Ray Meta generated the larger size 188 assemblies and avoided the generation of short contigs. While Centrifuge had the lowest 189 number of errors, but this was influenced by the type of database, which includes 190 genomic information characteristic of bacterial species and omits redundancies [13].

191
This makes the classification of species with shared genomic regions more efficient.

192
Regarding the influence of depth, we are not able to obtain information on the 194 optimal metagenome sequencing depth to obtain reliable taxonomic results. However, at 195 the genus level, we did not observe an improvement in the results from the 50X depth 196 and using Ray Meta with Centrifuge.

198
It is important to note that the sample analyzed in this study is minimal and cannot 199 be compared to the composition of a real complex bacterial community sample. For metagenome assembly and diversity analysis, protocol A (Ray Meta and Centrifuge) 205 was more efficient. While in relative abundance, protocol B (Centrifuge) was better.

206
Our results do not allow us to suggest an optimal sequencing depth.