Evaluation of saliva as a source of accurate whole-genome and microbiome sequencing data

This study sets out to establish the suitability of saliva-based whole-genome sequencing (WGS) through a comparison against blood-based WGS. To fully appraise the observed differences, we developed a novel technique of pseudo-replicates. We also investigated the potential of characterising individual salivary microbiomes from non-human DNA fragments found in saliva. We observed the majority of discordant genotype calls between blood and saliva calls fell into known regions of the human genome that are typically sequenced with low confidence; and could be identified by quality control measures. Pseudo-replication demonstrated that the levels of discordance between blood- and saliva-derived WGS data were entirely similar to what one would expect between technical replicates if an individual’s blood or saliva had been sequenced twice. Finally, we successfully sequenced salivary microbiomes in parallel to human genomes as demonstrated by a comparison against the Human Microbiome Project. Author Summary DNA is usually collected from blood for the analysis of human genomes. In France, a new and very large genetic dataset will be created where selected participants will be sent saliva-collection kits in the post as this data collection method presents numerous logistical benefits. It has been previously shown that good quality genetic data can be created from saliva, though existing studies have often not considered the latest technologies or have only analysed a very small number of individuals. In this study, we have analysed genetic data derived from saliva for 39 individuals to give a firm conclusion that the proposed genome sequencing approach of the new French dataset will be capable of provided high quality data by making a comparison to pre-existing genetic data derived from blood for these 39 individuals. In order to do so, we developed a novel method (presented here) to establish the similarity between two sets of genetic data for the same individual that are generated from separate DNA samples. Finally, we have also demonstrated an added bonus of colleting saliva samples: that it is possible to gather both human genetic data and potentially interesting salivary microbiome data at the same time by separating and analysing in parallel human and non-human DNA fragments.


Introduction
Whole-genome sequencing in humans has become widespread in the exploration of genetic variation between and within populations as well as in genetic epidemiology. Reduced costs have made such sequencing far more attainable. For the majority of large panels of human genetic variation, for example those that have been gathered to for the Haplotype Reference Consortium [1], genomic data has been generated through whole-genome sequencing (WGS) of DNA extracted from blood.
Quoted error rates for most WGS technologies are generally very low [2], yet even apparently tiny error rates can represent hundreds and thousands of erroneous calls across entire genomes. Errors may well congregate in regions that are particularly complicated to sequence as the accuracy of WGS has been shown to vary in different regions of the genome [3][4][5]. Hence, when planning the creation of a new large whole-genome sequencing dataset using a methodology that is in some sense alternative, it is vital to establish that it does not introduce unexpected and problematic patterns into the data that would render the new dataset incomparable to existing WGS datasets. Saliva has many advantages over blood in terms of the logistics of data collection and individual participation levels [6,7]. The POPGEN project in France envisages the creation of a French genomic reference panel by gathering DNA samples using Oragene OG-600 saliva kits. These are to be sent out, completed by participants and returned through the post. A similar approach was used in a recent study in the USA which took advantage of both social media and the use of saliva kits to increase the engagement with vast numbers of individuals [8], and to facilitate the creation of a large population based study with individuals spread evenly across a large region. This last point is of great importance if a full representation of a population's genetics is to be recorded and also to avoid potential geographical selection biases that have recently been highlighted by Haworth et al. [9] and by Munafò et al. [10] in the ALSPAC cohort [11] and in the UK Biobank [12], respectively.
DNA extraction from saliva is by no means a new concept, and has been demonstrated to be a successful technique for genotyping and sequencing studies. The quality and quantity of attained DNA had been shown to be superior in saliva as compared to buccal swabs [6,13,14], likely due to the higher prevalence of leukocytes in saliva [15]. Limiting factors for saliva for sequencing were however presented in [16] where the presence of non-human DNA and subsequent low quantities of attained human DNA was postulated as resulting in large observed differences in the number of markers that could be called (significantly less in saliva than in blood); this being an observation that also made by Herráez and Stoneking [17]. Indeed, even in a recent study that compared WGS data between blood and saliva [18], the most problematic characteristics regarding data from saliva included a large number of samples failing an initial agarose gel quality control test (suggesting very poor quality DNA) and the potential for the presence of a large proportion of non-human DNA. The possibility for low quality [19] and low quantity [20] of human DNA in certain saliva samples has also been presented in the sequencing data generated via exome-capture kits.
At time of writing, comparisons of genomic data between samples of DNA from blood and saliva have largely concluded that saliva is an adequate substitute for blood; producing data with high but not perfect concordance to genotype calls from blood [6,[21][22][23][24][25][26][27][28]. If concordance rates of the order of 95-99% between genotypes obtained from blood and saliva samples are to be considered satisfactory, it is important to realize that at the scale of a whole genome sequence where ~3 million genetic variants are expected per individual, this could represent as many as 150,000 differences.
Such reported statistics, we felt, are worrisome considering they could predict very high error rates in saliva-based sequencing data and could predict unwanted artefacts that could significantly inhibit the utility of a large population-wide WGS dataset. Furthermore, many of the current publications that compare blood against saliva have employed very small sample sizes (<10) and few have examined WGS data.
In this study, we compare WGS data derived from blood and saliva DNA from 39 individuals. Without gold-standard genomes to evaluate the quality of the sequencing in GAZEL-ADN, we have leveraged information on intrinsic difficulties in sequencing certain regions of the genome set out in a collection of studies from the Genome in a Bottle (GiaB) project [3,29,30]. Furthermore, we introduce a novel technique of pseudo-replication. This is a data-driven method to generate, in the absence of gold standard genomes, specific baselines for sequencing reproducibility from within a study. We also show that whilst the treatment of our samples followed the protocol of sequencing human DNA rather than microbiome DNA, we were able to sequence and identify large numbers of non-human DNA fragments in order to give a characterisation of individual salivary-microbiomes in our sample of 39 individuals. This presents an added bonus of the experimental design of using saliva-kits for data collection as one can sequence both good quality human and microbiome data in parallel.

Comparison of blood and saliva derived WGS data at the individual level
Details of the recruitment of individuals, the DNA extraction, sequencing, alignment, and genotype calling for blood and saliva are given in the Methods. We examined similarities between single sample WGS variant calls from blood and saliva for each of the 39 individuals. In Supplementary   Figure 1, the number of called variants, the mean individual read depths, and the mean genotype quality scores are presented. We also calculated per-individual statistics from BAM files; Supplementary Figure 1 includes results for mean insert length of reads, estimated base error rates, and mean read quality. We saw similar numbers of variants being called (mean of 4,619,812 for saliva against a mean of 4,615,935 for blood) but higher statistics regarding the depth and quality in saliva. In saliva the mean read depth was 40.2× (range of 28.2× to 54.2×) compared to 35 Figure 2).
These summary statistics indicate that the quality of the data derived from saliva is here above that of blood. This is not a generalizable observation, simply a characteristic of our particular study likely related to the fact that different genome sequencers were used for blood and saliva (see Methods). Discordances between blood and saliva genotypes can be assumed to indicate genotyping errors in at least one of the two datasets as we should not anticipate any significant contribution from somatic mosaicism between saliva and blood at this depth of sequencing [33].

Comparison of blood and saliva derived WGS data at the whole study level
In Figure 1(a), the distribution of the F-measures is displayed for all 12,085,848 variants. 93.61% of all variants show perfect agreement between blood and saliva. Subsequently in Figure 1(a), we split these variants into two genomic region-sets: high-confidence regions indicated by the GiaB project and the complementing set of low-confidence regions. Subsequently, we focused only on the variants that pass our quality control (QC) thresholds (see Methods). The GiaB high-confidence region list was accessed from ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release. We observed that a high proportion of the variants with disagreements between blood and saliva fall within the lowconfidence regions of the genome. It is also clear that quality control (described fully in the Methods) was successful at removing discordant variants in both high and low-confidence regions. In Figure   1(b), we plot the F-measure along chromosome 2 as an example of how discordant variants cluster into the low-confidence GiaB regions and how quality control improves concordance. Similar

Pseudo-replication
We have ascertained that the majority of the discordances observed between saliva and blood could be segregated using quality control and the GiaB low-confidence region list. This gave a strong suggestion that the differences between blood and saliva that we observed were in the majority due to readily identifiable sequencing errors. This suggested the saliva-based sequencing produced high quality genotype calls (as demonstrated by the high concordance with blood-based sequencing in the high-confidence regions). The success in removing a large proportion of the erroneous calls using quality control also demonstrated that it should not be problematic to analyse saliva-and bloodbased WGS data side-by-side.
Yet even after quality control, discordances remain between blood and saliva across our 39 individuals. The question remains as to whether the remaining discordance was linked to the differences between saliva and blood and the respective sequencing pipelines, or falls within the range that would be expected were an individual's DNA to be sequenced twice on the same platform.
To answer this question, ideally, we would re-sequence each individual's blood DNA sample in order to compare discrepancies of blood against blood with blood against saliva. However, such resequencing was not possible in this study and could well have led to further batch effects given that there would have been a significant length of time between sequencing runs. Not having access to repeated sequencing data, we decided on an innovative in silico approximation of such a round of re-sequencing and to create specific baselines for the comparisons of variant calling between saliva and blood. This pseudo-replication process is described fully in the Methods. The approach involved returning to the raw FASTQ files which contain lists of each individual's raw read data from the sequencer. These lists of reads for blood and saliva were each divided into two non-overlapping groups to give four separate lists of raw reads for each individual. These four lists were then processed separately and identically to produce four variant call sets for each individual; two from blood and two from saliva.
Having four variant call sets for each individual allowed us to make six pairwise comparisons; two comparisons between pairs of pseudo replicates derived from either both from blood or saliva ('blood -blood' and 'saliva -saliva', respectively), and four comparisons between blood and saliva.
The same quality control criteria were applied to the 156 (39×4) pseudo-replicates as had been  For all 39 individuals with paired blood-and saliva-based WGS data, four variant call sets were generated by creating two pseudo-replicates for both blood and saliva. For each pair of blood-and saliva-based WGS data, three comparisons were made, one between the two blood pseudo-replicates (red), one between a randomly chosen blood pseudo-replicate and a randomly chosen saliva pseudo-replicates (gold), and one between the two saliva pseudo-replicates (blue). The F-measures for these comparisons were calculated for variants in high-and low-confidence regions of the genome after quality control had been applied to all 78 pseudo replicates (left and right, respectively).

Salivary Microbiome
We constructed a pipeline to investigate the possibility of sequencing salivary microbiomes from the raw sequencing data derived from saliva. This was necessitated as our raw data had been generated using technology designed for sequencing human DNA while studies of microbiome data typically involve different molecular sequencing approaches [34]. Our goal was to identify reads that were not of human origin and to map such reads to known bacterial reference genomes. The pipeline we used is described fully in the Methods. Applying this to the 39 individuals of the pilot study, we successfully captured an average of 40 million reads that passed quality control measures and could be identified as non-human (Supplementary Figure 4).
These groups of non-human reads were then aligned to known bacterial 16S rRNA gene reference libraries. Across all 39 individuals we qualified a specific taxon to be present in GAZEL-ADN if more than 50 reads aligned to that taxon's reference 16s ribosomal gene. The OTU groups in GAZEL-ADN were compared to 290 publicly available salivary microbiome samples of the Human Microbiome Project (HMP) [35] (Figure 3a). The majority of salivary phyla that are strongly represented in HMP (nodes close to the center of the taxonomic tree) were found in GAZEL-ADN and only a small proportion of rare genera present in the HMP data were not detectable in GAZEL-ADN reads (grey nodes). These results show that the salivary microbiomes characterized in this study have realistic profiles as no major taxonomy groups from the HMP are missing. Furthermore, many phylogenetic families and genera that were detected in GAZEL-ADN were in fact not present in the HMP data (taxonomic tree edges in blue). This might suggest that the deep shotgun sequencing performed in GAZEL-ADN was potentially more sensitive than the 16S analysis of the HMP. However, it is not possible to compare the relative proportions of reads in each OTU group between GAZEL-ADN and the HMP due to the differences in DNA amplification methods and the fact that meta-genomic microbiome studies cannot be assumed to reflect a linear transformation of true microbial population sizes in the body [36,37]. Hence, these results should be interpreted with caution. The proportions of reads that match to the main phyla presented in Figure 3a (11 branches that lead out from the center) for each individual of GAZEL-ADN are presented in Figure 3b where notable difference between individuals can be observed.  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38 39 Sample Analysis of the salivary microbiome.

Discussion
In this study, we have compared WGS variant calls derived from 39 paired saliva and blood samples.
We have demonstrated that a majority of the discordances that are observed between blood and saliva occur in regions of the genome where sequencing is known to be least accurate. We could observe such discordances congregating in these regions due to the relatively large sample size that we had access to for this study. Furthermore, it was clear that recommended quality control measures successfully indicated and excluded a very high proportion of the genotypes that displayed disagreement between blood and saliva.
To fully establish whether the remaining differences after quality control between saliva and blood were remarkable or not, we used a novel method of splitting FASTQ files and creating sequencing pseudo-replicates. This enabled us to confidently establish that the differences between blood and saliva were entirely similar to the differences that one should expect if either an individual's blood or saliva had been sequenced twice. Thus, while high concordance between sequencing data in blood and saliva has previously been reported [18][19][20]28,38], we have been able to better contextualise such a result. This has afforded the conclusions that the WGS data generated from the saliva collection kits is of high quality; and that it will be possible to analyse the prospective POPGEN dataset alongside existing datasets (for analyses such as common or rare variant association studies) without fear of harmful batch effects. Technical bias due to differences is technology being used within a study is a non-negligible problem. The large scale of many recent genetic studies and the time required to assemble genomic datasets may result in different sequencing techniques being used at the start and at the end due to evolving guidelines and technology. The herein proposed novel method of pseudo-replication represents a valuable tool for assessing the potential impact of technological bias within a study as it can provide good approximations for benchmarks of wholegenome sequencing reproducibility.
In our study, we have concentrated on appraising the suitability of saliva for WGS in humans for the calling of SNPs and INDELs. In both Yao et al. [18] and Trost et al. [28] the question of the accuracy of the calling copy-number variations (CNVs) from saliva is also discussed where greater discordance (than for SNPs or INDELs) between blood and saliva has been presented. Prospectively, the 39 paired datasets of GAZEL-ADN will provide valuable opportunities for evaluating of the calling of CNVs from saliva and also for similar investigations for structural variants, transposable elements, and mitochondrial variation.
Here, we have also demonstrated that the sets of non-human reads found in saliva carry sufficient information for taxonomical descriptions of the salivary microbiome even when samples were treated for human genome sequencing. We were able to show that the principal groups of bacterial taxa found in our study matched those found in a large public microbial reference dataset. We could also highlight differences between individuals in terms of presence or absence and relative abundance of certain taxa in their salivary microbiome populations. These findings merit further investigation and, as a prospective of this study, we plan to continue to determine a best strategy for extracting salivary microbiome information from WGS data from saliva and to further develop a bioinformatics pipeline for the generation of microbiome data from read data initially generated for human whole-genome sequencing.
This pilot study enables us to have confidence for the use of home-use saliva kits for the proposed new French genetic reference panel POPGEN of the French medical genomics initiative [39,40]. The WGS data generated from saliva will be an equally good approximation of the true genomes of the participants had the project been based on blood collection. As using saliva will also come with considerable benefits in terms of the participation rates and logistics of such a study, it therefore represents a highly attractive data-collection method. We also envisage that it will be possible to study patterns of microbial populations across France in parallel to studies of human genetics via the collection of saliva for sequencing in the POPGEN project.

Data Collection
Sixty individuals from the GAZEL cohort [41] from whom WGS data was already available were selected at random and asked to donate saliva for sequencing. WGS data from saliva of 39 of these individuals were investigated in this study. We refer to this pilot study of 39 individuals as GAZEL- (PROMEGA) instrument and using magnetic bead technology. DNA extraction for blood samples was performed at the CEPH in Paris, France on an Autopure (Qiagen) automated system using a salting out method. All libraries for whole-genome sequencing were subsequently prepared using the Illumina TruSeq PCR-free protocol, with an input of 1µg of DNA at the CNRGH in Evry. However, library preparation and sequencing for saliva was performed over two years later than that of blood.
The saliva samples were processed using an Illumina NovaSeq 6000 sequencer, while blood samples had previously been processed using an Illumina HiSeqX5 sequencer. These two sequencers have a common underlying technology, and have recently been shown to perform similarly [42]; the main difference being the increased speed and throughput of NovaSeq. Generation of sequencing data was performed using the same bioinformatics pipeline for blood and saliva. Alignment was performed using the software bwa (v.0.7.15) [43] to the human reference genome hs37d5; a human reference genome from genomic build 37 and including dummy contigs for mapping well-known non-human contaminates. This alignment was followed by genotype calling using the GATK Haplotype Caller (v.3.8) [44].

Comparing blood and saliva WGS call sets
To compare the genotype calls between saliva and blood, the F-measure was used as in [45,46]. The

Pseudo-replication
Trost et al., [28] gave a baseline for levels of concordance between blood and saliva from a single external repeated sequencing study for comparisons of sequencing from different biological sources (blood, saliva, and buccal cells). We developed this idea and introduced a novel technique of pseudoreplication. We went back to the original FASTQ files for each individual and split these unordered lists of raw read data in two (retaining reads with their paired ends). This was simply achieved by selecting odd and even groups of four from the line numbers of the two FASTQ files for each individual (one file containing the front end-reads, one file containing the corresponding back-end reads). Naturally, this reduced the depth by 50% to approximately 20× in our study; this lower depth will lead to lower quality in our data, but we should still expect to be able to have well called genotypes at this depth [48]. Hence, each sequencing run provided two pseudo-replicates that are an approximation of results from a replication study at half the depth for each individual. To create our pseudo-replicates, we first applied the quality control program BBTools (v. 36 [49] to estimate the relative abundance of each sequence in the respective dataset. Sequences with a coverage statistic of less than 1 were regarded as not present in a sample and removed from further analysis. Taxa abundances were summed at the phylum level and the genus level for specific genera. Figures were realised using the R-packages 'phyloseq' [56], 'taxa' [57], and 'metacoder' [58]. Publically available salivary microbiomes and relevant metadata were downloaded from the HMP (https://www.hmpdacc.org/) in order to provide a comparison dataset and to check quality. Operational taxonomic unit (OTU) representative sequences, OTU tables, and mapping files for 16S rRNA V3-V5 sequencing were downloaded from publically available HMP QIIME Community Profiling datasets (http://downloads.hmpdacc.org/data/HMQCP/otu_table_psn_v35.txt.gz).

Acknowledgments
We would like to thank all of the members of the GAZEL cohort for their participation in this study.

Data Availability
The complete WGS data generated in this study will be submitted to the French Centralized Data Center of the France Medicine Genomic Plan that is under construction. A synthetic dataset has been generated that allows the replication of our principal results but without a full disclosure of individual level sequencing data. This dataset is currently available on our website http://lysine.univbrest.fr/~aherzig/BloodSaliva/.
Author Contributions AFH carried out analyses pertaining to the comparison of data quality between blood and saliva, and wrote the manuscript. LVS carried out the analyses of salivary microbiomes. EG planned the study and wrote the protocol with CD. All analyses in the study were discussed and decided upon by AFH, LVS, and EG. GL-F acted as project manager regarding the logistics of individual recruitment for this pilot study in conjunction with EG, CD, MG, FL, and MZ. HB managed the blood DNA extractions, LL-R managed the saliva DNA extractions, and AB, RO and JFD contributed to WGS data production. All authors participated in the final redaction of the manuscript.

Declarations
The authors have no conflicts of interest to declare.