Deaminase associated single nucleotide variants in blood and saliva-derived exomes from healthy subjects

Background Deaminases play an important role in shaping inherited and somatic variants. Disease related SNVs are associated with deaminase mutagenesis and genome instability. Here, we investigate the reproducibility and variance of whole exome SNV calls in blood and saliva of healthy subjects and analyze variants associated with AID, ADAR, APOBEC3G and APOBEC3B deaminase sequence motifs. Methods Samples from twenty-four healthy Caucasian volunteers, allocated into two groups, underwent whole exome sequencing. Group 1 (n=12) analysis involved one blood and four saliva replicates. A single saliva sample was sequenced for Group 2 subjects (n=12). Overall, a total of 72 whole exome datasets were analyzed. Biological (Group 1 & 2) and technical (Group 1) variance of SNV calls and deaminase metrics were calculated and analyzed using intraclass correlation coefficients. Candidate somatic SNVs were identified and evaluated. Results We report high blood-saliva concordance in germline SNVs from whole exome sequencing. Concordant SNVs, found in all subject replicates, accounted for 97% of SNVs located within the protein coding sequence of genes. Discordant SNVs have a 30% overlap with variants that fail gnomAD quality filters and are less likely to be found in dbSNP. SNV calls and deaminase-associated metrics were found to be reproducible and robust (intraclass correlation coefficients >0.95). No somatic SNVs were conclusively identified when comparing blood and saliva samples. Conclusions Saliva and blood both provide high quality sources of DNA for whole exome sequencing, with no difference in ability to resolve SNVs and deaminase-associated metrics. We did not identify somatic SNVs when comparing blood and saliva of healthy individuals, and we conclude that more specialized investigative methods are required to comprehensively assess the impact of deaminase activity on genome stability in healthy individuals.

Saliva and blood both provide high quality sources of DNA for whole exome sequencing, with no 72 difference in ability to resolve SNVs and deaminase-associated metrics. We did not identify somatic 73 SNVs when comparing blood and saliva of healthy individuals, and we conclude that more specialized 74 investigative methods are required to comprehensively assess the impact of deaminase activity on 75 genome stability in healthy individuals. 76 77 Background 78 APOBEC/AID deaminases are a recognized endogenous source of genome instability [1][2][3][4][5]. Somatic 79 mutations caused by deamination events have been identified in cancer in vitro and in vivo [6][7][8][9], and 80 evidence of deaminase-associated mutations in non-cancerous conditions is emerging, such as various 81 viral infections and neurodegenerative diseases [10,11]. Deaminases have also recently been implicated 82 in accumulation of pre-cancerous mutations [12], and as a causative driver of many human SNPs [13]. 83 84 Deaminases predominantly drive C-to-U(T) and A-to-I(G) transition mutations, however DNA repair 85 mechanisms typically prevent deamination from compromising genome integrity and causing somatic 86 mutation [14,15]. Pathophysiological processes can disrupt normal DNA repair, resulting in mosaic 87 manifestation of deaminase-associated single nucleotide variants (SNVs) in affected tissues [16]. 88 Although deaminases employ similar biochemical mechanisms, each has a unique binding domain 89 associated with one or more DNA motifs [17,18]. Deaminase motifs can be identified and quantified in 90 Next-Generation Sequencing (NGS) data facilitating diagnosis of the specific cause of the mutation. 91 For example, AID targets C-sites in the context of WRC motifs (W = A or T; R = A or G; reverse 92 complements as GYW, with Y = T or C), APOBEC3G deaminates CC sites (or GG) and APOBEC3B 93 deaminates TCW (or WGA) motifs and ADARs deaminate WA sites [2,19,20]. Establishing 94 reproducible and robust deaminase-associated SNV profiles in healthy people will improve the utility 95 of mutation profiling techniques for monitoring progression of diseases such as cancer, and for 96 understanding patient response to treatment. Here, we report profiles for SNVs associated with deaminase motifs for a cohort of 24 healthy human 107 subjects using whole exome sequencing (WES). For twelve of these subjects (Group 1) we compare 108 blood with biological and technical saliva replicates from Caucasian volunteers of different age groups 109 and sex and hypothesize that deaminase-associated SNV profiles of a cohort of healthy individuals will 110 show a high concordance between saliva and whole blood DNA in a reproducible and robust manner. 111 112

Healthy subject selection 114
In total, 24 healthy Caucasian subjects were recruited for this study. Volunteers were considered 115 healthy if they had blood pressure and heart rate within normal ranges, had never smoked, were only 116 light drinkers (<14 units of alcohol weekly), had no major viral infections or immune related diseases 117 and did not take any regular medication. Eight subjects were recruited into each of the three age groups 118 18-19, 30-39, and 50-59, with an equal ratio of males to females in each group. These subjects were 119 randomly allocated into two groups of equal sex and age group. Group 1 (n=12) involved analysis of 120 blood and saliva sample replicates. Group 2 (n=12) involved analysis of saliva-1 sample only.  Five WES datasets were generated for twelve subjects (Group 1), comprised of two males and two 178 females from three age categories (18-19, 30-39 and 50-59). As described in Figure 1

Blood and saliva whole exome sequencing 203
Saliva and blood samples from 12 healthy volunteers, Group 1 subjects, underwent sequencing and 204 analysis according to the workflow illustrated in Figure 1. In addition, 12 exomes were obtained from 205 Saliva-1 samples from the remaining 12 recruited healthy volunteers, Group 2 subjects (Table 2). For 206 all exomes sequenced (n=72), an average of 136 million high-quality 100bp paired-end reads were 207 obtained. The total number of reads, mapping rate and coverage statistics for all sequencing runs are 208 described in Supplementary Table 1. Mapping rates were between 94.2% and 99.9% with a median of 209 98.9%. The median exome coverage rates were 97.2% (>30x) and 70.0% (>100x) of the exome. 210 Sample HP_4_1 produced the lowest number of reads and subsequently had the lowest sequencing 211 depth with 91.5% of the exome covered by >30x. Age group, sex, and counts for total SNVs, SNVs 212 within a coding region (referred to CDS), and percentages of variants within a coding sequence region 213 that correspond to known motifs for AID, ADAR, APOBEC3G and APOBEC3B are presented in 214 Tables 1 and 2

SNV concordance between and within sample types 227
For Group 1 subjects (n=12), SNVs called in each sample were analyzed following the workflow 228 described in Figure 1

Analysis of unmapped reads 316
The average number of unmapped reads was larger in saliva (60 WES datasets, mean=2,372,300, 317 98.4% mapping rate) than in blood (12 WES datasets, mean=334,182, 99.7% mapping rate), 318 corresponding to a six fold higher unmapped rate in saliva (1.63% unmapped) compared to blood 319 (0.27% unmapped). Overall, there is a 98.6% average mapping across all 72 samples and replicates 320 (Supplementary Table 1). Quality statistics for unmapped reads are summarized at 321 https://jpmam1.github.io/MultiQC/. Unmapped reads for volunteers HP_1 and HP_2 were extracted 322 and aligned to the nr protein database. with read alignment rate to the NCBI nr database larger in saliva 323 (41%) than in blood (33%). Reads that failed to align to NCBI nr were typically low quality. 324 Unmapped reads derived from saliva, but not blood, were predominantly found to contain reads 325 aligning to metagenomic species (Supplementary Figure 6). sources of discordant SNVs were investigated and were found to be associated with low read depth, 339 high strand bias, and low genotype quality. Analysis of putative somatic variants showed no conclusive 340 evidence of somatic mutation when comparing blood and saliva samples. On average, approximately 341 2% of reads failed to align to the human genome, with reads derived from saliva samples primarily re-342 lated to metagenomic taxa associated with the oral microbiome [41,42]. Here, we establish that saliva 343 and blood are both appropriate sources of DNA for WES analyses, with no detected difference in abil-344 ity to resolve SNVs and deaminase-associated signatures and metrics. The data are not publicly available due to information that could compromise research participant 436 privacy and consent. The data that support the findings of this study are available on reasonable request 437 from the corresponding author NEH. 438