Genotype imputation accuracy and the quality metrics of the minor ancestry in 1 multi-ancestry reference panels 2

30 Large-scale imputation reference panels are now available and have contributed to 31 efficient genome-wide association studies through genotype imputation. However, it is 32 still under debate whether large-size multi-ancestry or small-size population-specific 33 reference panels are the optimal choices for under-represented populations. We 34 imputed genotypes of East Asian (EAS; 180k Japanese) subjects using the Trans- 35 Omics for Precision Medicine (TOPMed) reference panel and found that the standard 36 imputation quality metric (Rsq) substantially overestimated the dosage r 2 (squared 37 correlation between imputed dosage and true genotype). Variance component analysis 38 of Rsq revealed that the increased imputed-genotype certainty (dosages closer to 0, 1, 39 or 2) caused upward bias, indicating some systemic bias in the imputation. Through 40 systematic simulations using different template switching rates (θ value) in the hidden 41 Markov model, we uncovered that the lower θ value increased the imputed-genotype 42 certainty and Rsq; however, dosage r 2 was insensitive to the θ value, thereby causing a 43 deviation. In simulated reference panels with different sizes and ancestral diversities, 44 the θ value estimates from Minimac decreased with the size of a single ancestry and 45 increased with the ancestral diversity. Thus, Rsq could overestimate or underestimate 46 dosage r 2 for a subpopulation in the multi-ancestry panel and the deviation represents 47 different imputed-dosage distributions. Finally, despite the impact of θ value, distant 48


9
Rsq is the ratio between Var(y) and (1 − ), where p is the alternative allele frequency 156 (AAF) in the imputed dataset [3,7]. Then, 157 where SSres is the residual sum of squares that follows +,+ = ()* + ()-, and n is 160 the number of imputed haplotypes. Hence, Rsq comprises two parts: regression-related 161 and residual-related. We define The BBJ-180k was imputed using the TOPMed, 1KGP, BBJ1k, and JEWEL3k reference 251 panels ( Figure 1A). Characteristics of each panel and the target sample are listed in 252 Table 1. We categorized the imputed variants by MAF and Rsq in each imputed 253 dataset. With more EAS samples in the panel, more variants with low MAF passed each 254 Rsq threshold (Figure 2A and Supplementary Table 1). In addition to the absolute 255 number, unique variants were imputed from each panel ( Figure 2B). These results 256 reproduced the benefits of using large and different reference panels [19]. We then used WGS993 to empirically evaluate the imputation performance. MAF and 259 MAC of WGS993 were used to categorize the variants into six bins: common (MAF ≥ 260 5%), low-frequency (5% > MAF ≥ 1% and 1% > MAF ≥ 0.5%), rare (0.5% > MAF >

Quantifying the deviation between Rsq and dosage r 2 277
The deviation between Rsq and dosage r 2 was persistent in all MAF bins only when 278 using the TOPMed panel (Supplementary Figure 2), which indicated a potential 279 systematic bias from the imputation pipeline or the reference panel. To justify our 280 observation, we analytically derived the relationship between Rsq and dosage r 2 281 (Methods). Two novel metrics, MARE and βimp, were introduced to quantify the 282 deviation. MARE, a MAF-adjusted form of residual error, takes a value between 0 and 283 1, and increases with SSres. βimp describes the distinguishability between the mean 284 imputed dosage of each true genotype group. Rsq is the ratio between observed and 285 expected variance and dosage r 2 shows the correlation. Under the assumption of "well-286 calibration" (the posterior allele probability from imputation equals the expected true 287 allele dose), Rsq equals dosage r 2 [10]. Our analytical derivation of the relationship 288 between Rsq and dosage r 2 did not assume the "well-calibration" (Supplementary 289  (Figure 3A and 3F). It clearly showed that the 296 overestimated Rsq was accompanied by a higher MARE ( Figure 3A). We used 297 rs142572000 as an example (Figure 3B-E). In the TOPMed imputation result, imputed 298 genotypes were more certain (defined as the imputed dosage closer to 0, 1, or 2) 299 ( Figure 3B) [37], compared to the other three panels (Figure 3C-E). The high certainty 300 increased Var(y) and Rsq. In Supplementary Note 4, the positive relationship between 301 imputed-genotype certainty and Rsq is demonstrated. As discussed further below, high 302 certainty or Rsq did not mean the imputation is more accurate. As shown in Figure 3B, 303 many heterozygotes were incorrectly imputed with a dosage of approximately 0 in the 304 TOPMed imputation, causing higher MARE and Rsq, and an even low dosage r 2 . 305 306 Variants with dosage r 2 < Rsq showed a lower βimp ( Figure 3F). We used rs671 as 307 another example (Figure 3G We categorized MARE into Rsq bins and βimp into dosage r 2 bins to compare them 317 between reference panels and to the expected values when assuming Rsq equals 318 dosage r 2 . The mean MARE of the TOPMed result was above the expected for Rsq bins 319 0.35-0.9, and the mean βimp of the 1KGP result was below those of the other panels 320 (Supplementary Figure 8). These results suggested that the TOPMed result was more 321 certain and might contain more wrongly imputed genotypes ( Figure 3B and 3G), while 322 the 1KGP result might possess higher shrinkage in the imputed dosage, as shown 323 above (Figure 3C and 3H).   Table 2). When the θ value was 0.5-fold, the number of 360 confident alleles and high-Rsq variants increased by 5.46% and 8.89%, respectively, 361 and if the θ value was 2-fold, these numbers decreased by 11.8% and 13.8%,

Fitness of the θ value, deviation, and imputation performance 379
To elucidate the multi-ancestry reference panel's impact on the imputation result, we 380 simulated two scenarios using the multi-ancestry reference panels: the target sample 381 was from the (1) minor and (2) major ancestry ( Figure 1C). 382 20 383 Scenario 1: The target sample was from the minor ancestry 384 We simulated 8 EUR-EAS reference panels and used 100 EUR samples as the target 385 (Methods). As the panel size increased and θ value decreased ( Table 2) Figure 17). When 408 combined with the 1KGP subsets, the mean EmpRsq was highest when using 409 JPT3256+1KGP-EAS for variants with MAF ≥ 1% and JPT3256+1KGP-JPT for variants 410 with 1% > MAF ≥ 0.5% (Table 3). Adding other ancestries decreased the mean 411 EmpRsq marginally (maximum difference < 0.01 for all MAF categories). The mean Rsq 412 was the highest when using JPT3256 and decreased with the addition of more 413 ancestries, with maximum differences of 0.007, 0.024, and 0.048 for variants with MAF 414 ≥ 5%, 5% > MAF ≥ 1%, and 1% > MAF ≥ 0.5%, respectively ( Table 3)

EAS imputation using the public reference panels 434
We used the TOPMed imputation pipeline (Methods) and observed an upward 435 deviation of Rsq, particularly when using the 1KGP panel (Supplementary Figure 19). 436 The θ value transformed from the genetic map was 0.15-and 0.27-fold of that estimated 437 by Minimac3 when using the 1KGP and JEWEL3k panels, respectively 438 (Supplementary Table 5 change the imputation accuracy, as verified in a Finnish study [40]. We had also shown 474 that dosage r 2 was insensitive to the θ value. Second, the θ value was only 475 In summary, we explained that the HMM parameter could be a potential reason for 512 inaccurate Rsq and inferior dosage r 2 when using large multi-ancestry reference panels. 513 This is also the first study of the relationship between the template switching rate, 514 imputed-genotype certainty, Rsq, and dosage r 2 . We envision that our methods and 515 conclusions could provide insights into benchmarking studies, construction of reference 516 panels, and development of imputation algorithms and pipelines in the future.

Data and code availability 540
We provided the tool to replicate the metrics used in this work 541 (https://github.com/shimaomao26/impumetric). Other scripts were also deposited there. 542 Genotype data for BBJ and the TOPMed imputation results were deposited at NBDC 543 Human Database (research ID: hum0014 and hum0311, respectively).