Characterizing the quality metric in genotype imputation

Abstract

We modeled the relationship between the imputed allelic dosage ( = ( ! , … , " ), # ∈ 134 Because the WGS dataset is unphased diploid data, dosage r 2 calculated from that will 173 be slightly different from EmpRsq. Note S1 provides the detailed methods to calculate obtained using Equations 6-7 and calculated from the imputed dosage in both haploid 176 and diploid cases. 177 178 Imputation using the simulated 1KGP reference panels with different θ values 179 The 1KGP vcf files were downloaded from the Minimac3 website (see above). 180 Parameters were estimated using Minimac3. The θ value was extracted (denoted as 181 "Recom" in the m3vcf file), 3 scaled by 21 folds manually (0.01-100), and replaced in the 182 original file. A brief explanation of the θ value is provided in Note S2. The array data of 183 WGS993, a subset of BBJ-180k, was used as the target sample. The imputation 184 procedure was the same as described above. 185 186 Evaluation of the imputation performance using the simulated reference panels 187 We obtained the imputed allelic dosage (LooDosage) from Minimac4's empiricalDose 188 file (by turning the "--meta" option on). 35 It was derived from the leave-one-out (LOO) 189 method by hiding the markers on the array during the imputation. Rsq, EmpRsq, MARE,190 and βimp were calculated from the LooDosage and array data. MAF was obtained from 191 the array data. 192 193 We sampled subsets from the 1KGP and JEWEL3k, shuffled the sample order 10 times, 195 and created new vcf files using bcftools v1.14 (https://samtools.github.io/bcftools/). 36 We 196 estimated the parameters using Minimac3. Method S4 provides the detailed methods. 197 The total θ value along chr19 (by summing the θ values between adjacent markers) was 198 used to evaluate the reference panel size and ancestral diversity's impact. A discussion 199 of the θ value qualification is provided in Note S3. 200

Number of confident alleles and high-Rsq variants 219
Haploid dosage (HDS) is the imputed alternative allele's allelic dosage at the haploid 220 level, which could be obtained by the "--format HDS" option in Minimac4. Briefly, it is in 221 a per-variant and per-individual manner and a higher value represents an imputed 222 alternative allele is more certain. We quantified the θ value's impact on the imputed 223 dosages by the number of confident alleles (HDS > 0.9). Rsq was used to judge how 224 many high-Rsq (Rsq > 0.7) variants could be passed to the downstream analyses. 225 226 Furthermore, we determined how many confident alleles and high-Rsq variants could be 227 obtained only from the additional distant ancestries in the multi-ancestry reference 228 panel. We defined ancestry-specific variants as follows: EUR-only variants only existed 229 in the 1KGP-EUR403, and it has no non-EUR variants. JPT3256-only and 1KGP-EAS-only 230 variants only existed in the JPT3256 and 1KGP-EAS, respectively. Non-EAS variants 231 were not found in the JPT3256 or 1KGP-EAS. 232

233
We followed the TOPMed imputation pipeline (Accessed on Oct. 15, 2022, 235 https://topmedimpute.readthedocs.io/) and used the 1KGP and JEWEL3k reference 236 panels. Specifically, the HapMap2 genetic map was used as a reference for the θ value 237 instead of estimating it using Minimac3. WGS993 was used as the target. As the θ value 238 was not outputted by default, we modified the Minimac4's source code to obtain the 239 transformed θ value (Method S5). 240 241

BBJ imputation using the four reference panels 243
The BBJ-180k was imputed using the TOPMed, 1KGP, BBJ1k, and JEWEL3k reference 244 panels ( Figure 1A). Characteristics of each panel and the target sample are listed in 245 Rsq threshold (Figure 2A and Table S1). In addition to the absolute number, unique 248 variants were imputed from each panel ( Figure 2B). These results reproduced the 249 benefits of using large and different reference panels. 19 250 251 We then used WGS993 to empirically evaluate the imputation performance. MAF and 252 13 5%), low-frequency (5% > MAF ≥ 1% and 1% > MAF ≥ 0.5%), rare (0.5% > MAF > 254 0.1%), doubleton (MAC = 2), and singleton (MAC = 1). For common single nucleotide 255 polymorphisms (SNVs), the TOPMed imputation result showed the highest coverage 256 (0.950), whereas BBJ1k (0.890) and JEWEL3k (0.907) were lower than 1KGP (0.948). 257 This loss of variants was possibly due to the additional QC steps in combining the 258 1KGP with Japanese WGS (Method S2). As MAF decreased, the coverage also 259 decreased; however, more EAS samples in the reference panel mitigated this decrease, 260 as expected ( Figure 2C). genotypes were more certain (defined as the imputed dosage closer to 0, 1, or 2) 37 291 ( Figure 3B), compared to the other three panels (Figure 3C-E). The high certainty 292 increased Var(y) and Rsq. In Note S4, the positive relationship between imputed-293 genotype certainty and Rsq is demonstrated. As discussed further below, high certainty 294 or Rsq did not mean the imputation is more accurate. As shown in Figure 3B, many 295 TOPMed imputation, causing higher MARE and Rsq, and an even low dosage r 2 . 297 298 Variants with dosage r 2 < Rsq showed a lower βimp ( Figure 3F). We used rs671 as 299 another example (Figure 3G   We also showed that low θ values increased the imputed-allele certainty using a 336 randomly selected target haplotype ( Figure 4D and Figure S11). High certainty might explanation is provided in Note S5. Such imputed-dosage properties were also revealed 339 by the mean MARE and βimp across all variants on chr19 ( Figure S13). Taken together, 340 the simulated imputation using an extremely low θ value mimics the high certainty and 341 Rsq overestimation in the TOPMed imputation result in EAS (Figure 3). 342 343

Template switching rate impacts the imputation performance 344
Using the modified 1KGP panels, although the mean EmpRsq was insensitive to the 345 scaling of the θ value (maximum difference of 0.034 for the θ value scaling between 346 0.01 and 2-fold; Figure 5), low θ value increased Rsq ( Figure 5). We evaluated the 347 number of confident alleles (HDS > 0.9) and high-Rsq variants (Rsq > 0.7) (Methods). 348 Downscaling the θ value increased the number of confident alleles and high-Rsq 349 variants (Table S2). When the θ value was 0.5-fold, the number of confident alleles and 350 high-Rsq variants increased by 5.46% and 8.89%, respectively, and if the θ value was 351 2-fold, these numbers decreased by 11.8% and 13.8%, respectively. These results suggesting that the θ estimates were in a trade-off between the panel size and ancestral 364 diversity ( Figure 6E). Note S6 explains these effects. 365 366

Fitness of the θ value, deviation, and imputation performance 367
To elucidate the multi-ancestry reference panel's impact on the imputation result, we 368 simulated two scenarios using the multi-ancestry reference panels: the target sample 369 was from the (1) minor and (2) major ancestry ( Figure 1C). 370 371 Scenario 1: The target sample was from the minor ancestry 372 We simulated 8 EUR-EAS reference panels and used 100 EUR samples as the target 373 (Methods). As the panel size increased and θ value decreased (  (Table S3). These results revealed that a lower θ value increases the number of 384 confident alleles and high-Rsq variants, without improving the EmpRsq (Table S3). We simulated 13 JPT-1KGP reference panels and used WGS993 as the target 392 (Methods). When only JPT samples were in the panel, the mean EmpRsq and Rsq 393 increased with the panel size, as expected ( Figure S17). When combined with the 394 1KGP subsets, the mean EmpRsq was highest when using JPT3256+1KGP-EAS for 395 variants with MAF ≥ 1% and JPT3256+1KGP-JPT for variants with 1% > MAF ≥ 0.5% difference < 0.01 for all MAF categories). The mean Rsq was the highest when using 398 JPT3256 and decreased with the addition of more ancestries, with maximum differences 399 of 0.007, 0.024, and 0.048 for variants with MAF ≥ 5%, 5% > MAF ≥ 1%, and 1% > MAF 400 ≥ 0.5%, respectively ( Table 3) We used the TOPMed imputation pipeline (Methods) and observed an upward 421 deviation of Rsq, particularly when using the 1KGP panel ( Figure S19). The θ value 422 transformed from the genetic map was 0.15-and 0.27-fold of that estimated by 423 Minimac3 when using the 1KGP and JEWEL3k panels, respectively ( Table 4). change the imputation accuracy, as verified in a Finnish study. 40 We had also shown 458 that dosage r 2 was insensitive to the θ value. Second, the θ value was only 459 underestimated for the EAS (size = 1,184), but the sizes of EUR, AFR, and AMR (size = imputation result was sometimes misleadingly high in AFR 38 and even EUR. 41 A recent 462 paper reported that the TOPMed panel was more robust to the low-density genotyping 463 array than the HRC and 1KGP panels. 18 These results also suggested that a fixed low θ 464 Our study has several limitations. First, our simulations of the reference panel were 477 study-specific. As discussed, the θ estimates decreased with the panel size and 478 increased with the ancestral diversity. However, a larger panel typically increases the 479 diversity simultaneously. Therefore, the θ value and imputation result need to be 480 discussed case by case. This limitation also implies that the general experience may not 481 work well for a new reference panel and target dataset. Second, in the simulations of 482 multi-ancestry reference panels (scenarios 1 and 2), WGS datasets were merged using IMPUTE2. 9 We did not investigate the impact of IMPUTE2 but treated the merged 484 dataset similarly to the WGS dataset. It might underestimate the number of high-Rsq 485 variants from the distant ancestry (Note S7). Our results using the 1KGP (not modified 486 by IMPUTE2) also revealed that only a limited number of confident alleles and high-Rsq 487 variants in the imputation result were from the distant ancestry (Note S7 and Table S2). In summary, we explained that the HMM parameter could be a potential reason for 494 inaccurate Rsq and inferior dosage r 2 when using large multi-ancestry reference panels. 495 This is also the first study of the relationship between the template switching rate, 496 imputed-genotype certainty, Rsq, and dosage r 2 . We envision that our methods and 497 conclusions could provide insights into benchmarking studies, construction of reference 498 panels, and development of imputation algorithms and pipelines in the future. We thank the staff of the BBJ for collecting and managing samples and clinical 507 information. We acknowledged the Human Genome Center, the Institute of Medical 508  hum0014 and hum0311, respectively).