One is not enough: on the effects of reference genome for the mapping and subsequent analyses of short-reads

Mapping of high-throughput sequencing (HTS) reads to a single arbitrary reference genome is a frequently used approach in microbial genomics. However, the choice of a reference may represent a source of errors that may affect subsequent analyses such as the detection of single nucleotide polymorphisms (SNPs) and phylogenetic inference. In this work, we evaluated the effect of reference choice on short-read sequence data from five clinically and epidemiologically relevant bacteria (Klebsiella pneumoniae, Legionella pneumophila, Neisseria gonorrhoeae, Pseudomonas aeruginosa and Serratia marcescens). Publicly available whole-genome assemblies encompassing the genomic diversity of these species were selected as reference sequences, and read alignment statistics, SNP calling, recombination rates, dN/dS ratios, and phylogenetic trees were evaluated depending on the mapping reference. The choice of different reference genomes proved to have an impact on almost all the parameters considered in the five species. In addition, these biases had potential epidemiological implications such as including/excluding isolates of particular clades and the estimation of genetic distances. These findings suggest that the single reference approach might introduce systematic errors during mapping that affect subsequent analyses, particularly for data sets with isolates from genetically diverse backgrounds. In any case, exploring the effects of different references on the final conclusions is highly recommended. Author summary Mapping consists in the alignment of reads (i.e., DNA fragments) obtained through high-throughput genome sequencing to a previously assembled reference sequence. It is a common practice in genomic studies to use a single reference for mapping, usually the ‘reference genome’ of a species —a high-quality assembly. However, the selection of an optimal reference is hindered by intrinsic intra-species genetic variability, particularly in bacteria. Biases/errors due to reference choice for mapping in bacteria have been identified. These are mainly originated in alignment errors due to genetic differences between the reference genome and the read sequences. Eventually, they could lead to misidentification of variants and biased reconstruction of phylogenetic trees (which reflect ancestry between different bacterial lineages). However, a systematic work on the effects of reference choice in different bacterial species is still missing, particularly regarding its impact on phylogenies. This work intended to fill that gap. The impact of reference choice has proved to be pervasive in the five bacterial species that we have studied and, in some cases, alterations in phylogenetic trees could lead to incorrect epidemiological inferences. Hence, the use of different reference genomes may be prescriptive to assess the potential biases of mapping.


151
152 The average coverage depth (i.e., mean number of reads covering each position of the reference 153 genome) was only slightly affected by reference choice (Fig 3, S4 Table). Its effect was noticeable 154 when reads were mapped to the most divergent reference genomes of the different species, as in the 155 previous parameters. However, the average depth seemed to be more dependent on other factors such 156 as the total number of reads (sequencing coverage) of the isolates rather than on the genetic distance 157 to the reference genome. One such example is isolate NG-VH-50 (N. gonorrhoeae), which had a low 158 total number of reads and also showed low average depth values regardless the reference selected for 159 mapping (S5 Table). Differences in this parameter depending on the reference used for mapping were 160 found to be non-significant in all the species, according to Kruskal-Wallis tests.   (Fig 7D).
239 These differences were also observed when only the core genome was used to obtain the phylogenetic 240 tree (S3 File). Additional species-specific differences are described next.   Table) and the corresponding distributions were compared. Those regions that were not present in all 294 the sequences of a species were removed from the alignments for these analyses.
295 Overall, the distributions of recombination rates were very similar regardless the reference genome 296 used in each case. However, relevant differences in some peaks were found in different MSAs from 297 the same species. For example, the MSAs built with 32867 or NCTC13798 (N. gonorrhoeae) as 298 reference sequences showed at least two clearly observable peaks that were absent when FA 1090 was 299 the reference (Fig 11).  323 Table).
324 In all cases, the dN/dS values computed for each gene were <1. Differences in dN/dS depending on the 325 reference used (Fig 12) were significant (Kruskal-Wallis, P < 0.05) for all the species. The proportion 326 of significant pairwise comparisons (Wilcoxon, P < 0.05) depended on the species, ranging from 327 47.7% (L. pneumophila) to 83.3% (S. marcescens) ( Table 1). In contrast with the results obtained in 328 the parameters discussed previously, some of the comparisons involving the most genetically distant 329 reference genomes (e.g., 342 strain of K. pneumoniae) as mapping references were not significant.
330 Therefore, in this case it is difficult to explain the variability of ω based on the genetic distances 331 between reference sequences for most species. N. gonorrhoeae could be treated as an exception, 332 because the comparisons involving the reference strain FA 1090 (the most genetically distinct one) 333 were the only significant ones. These differences were also observed when only the core genome was 334 used to compute ω.    412 clades and even to the association of specific isolates to transmission clusters, thus potentially 413 affecting epidemiological inferences. This has been observed even when using phylogenetically 414 related strains from the same non-clonal species as a reference, in contrast with previous studies in 415 clonal bacteria [25] where differences in phylogenetic inference appeared when using reference 416 genomes from close but different species. This is most obvious in the L. pneumophila data set, in 417 which two isolates changed their positions and were placed in the same cluster of the reference 418 sequence used for mapping, while the overall topology remained practically unchanged.

419
420 Differences between trees were quantified by topological distance metrics, reflecting, in most cases, 421 lack of correlation between tree distances and genetic distances of the corresponding reference 422 genomes. As suggested previously [22,27], when working with a genetically diverse set of isolates, it 423 is impossible to select a single reference close to all of them, and single-reference mapping biases are