SNVstory: A dockerized algorithm for rapid and accurate inference of sub-continental

2

7 148 any issues with sequencing quality and sample contamination. We ran Picard CollectMultipleMetrics 149 on the aligned bam files to collect alignment summary, quality score, and GC bias metrics (Table S1).
150 Sequencing read allocation was calculated using samtools. Coverage information was collected using 151 mosdepth 31 . The average coverage for all realigned samples was 40X (ranging from 31X to 77X).

156 Removal of Related Samples
157 Related samples of the third degree (e.g., first cousins, great grandparents, or great-grandchildren) 158 or closer were identified by the relationship inference tool, KING 32 . Data from the 1kGP and SDGP 159 were preprocessed using PLINK2 with the following parameters: "--new-id-max-allele-len 10000 --160 max-alleles 2" 33 . KING recommends performing as little filtering as possible. However, an additional 161 filtering step was performed to prevent the computation from running out of memory. Therefore, the 162 analysis was restricted to variants shared by at least two individuals: "--maf 0.0007" in the case of the 163 1kGP and "--maf 0.007" for SDGP. After removing the variants present in only one sample, KING was 164 executed on the resulting bed file, with the "--kinship" option set to report pairwise relatedness 165 inference. Samples from the analysis were flagged that had a third-degree kinship coefficient cutoff  170 Because some samples from the 1kGP are related to more than one other individual in the cohort, the 171 following procedure was implemented to remove the fewest number of samples. Considering only 172 the relationships with coefficients exceeding the third-degree cutoff, a graph-based method was . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

192
193 gnomAD: Because our gnomAD algorithm uses synthetic data, we must consider two parameters: a 194 population size that balances the model's accuracy with training time and resources and a p-value 195 from a Chi-Square test that removes uninformative SNVs. This was accomplished using a nested for 196 loop to iterate over all combinations of population sizes and p-values for SNV removal ( Figure S1).
197 For each combination, we generated a set of 80/20 training/validation splits of the data. A Chi-Square . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted June 5, 2023. ; https://doi.org/10.1101/2023.06.02.543369 doi: bioRxiv preprint . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made    was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted June 5, 2023. ; https://doi.org/10.1101/2023.06.02.543369 doi: bioRxiv preprint 248 confusion matrix ( Figure S4). The accuracies for the 1kGP subcontinental models are as follows:    Table   260 Browser using assembly GRCh37 to get the genomic interval for each gene. If a region contains 261 multiple genes, we combine the genes to form a non-overlapping genomic interval (e.g., ANKRD45,  (Table S2). Self-reported race is derived from the paternal/maternal ethnic background.
280 Ethnicity is categorized into one of three groups: Non-Hispanic or Latino, Hispanic or Latino, and 281 Unknown/Not Reported Ethnicity. Race is classified into one of five groups: White, Asian, Bi-282 racial/Multi-racial, Black or African American, and Unknown/Unspecified. Due to the broadness of 283 these categories, we report the comparison between predicted genetic ancestry for the continental 284 models only ( Table 1).

285
286 Most of the individuals share agreement between genetic ancestry and ethnicity/race, e.g., for those 287 predicted to be European, a match of White / Non-Hispanic or Latino for race /ethnicity occurs in 288 92.5%, 96.7%, and 89.1% of individuals by the gnomAD (Table 1A), 1kGP (Table 1B), and SGDP 289 (Table 1C) models, respectively. However, several cases exist where individuals are self-reported as 290 White while having a different genetic ancestry across multiple models, and vice versa. Additionally, 291 13 of our cases have either Unknown/Not Reported Ethnicity or Unknown/Unspecified Race. As 292 discussed in the Introduction, the ability to refine or add genetic ancestry information in these cases 293 is helpful for added diagnostic precision in variant filtering/prioritization.

295 Model Interpretation for Indeterminant Samples
. CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made    344 However, there is room for improvement, as our most diverse dataset (SGDP) includes the fewest 345 samples. We could not build subcontinental models as granular as the labels provided because there . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted June 5, 2023. 362 The inferred distinctiveness of Latino copies of KRTAP19-8 suggests that rare founder mutations in 363 this gene may contribute to increased rates of thyroid cancer among women of Hispanic ancestry.
364 The ability to target variants in genes inherited from specific populations adds a new tool to the 365 diagnostician's toolkit and could lead to improved patient outcomes.

366
367 Finally, our approach allows users to reliably execute our models given a single-sample or multi-368 sample VCF, with results tailored toward ancestry assignment for an individual sample. This provides 369 immediately useful ancestry information in the clinical setting, where ancestry can be used to inform 370 diagnostic or therapeutic decisions. Specifically, a subject's ancestry can be used to help prioritize . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted June 5, 2023. ; https://doi.org/10.1101/2023.06.02.543369 doi: bioRxiv preprint 371 variants that may be rare in one population but not another. In the clinical setting, it may be essential 372 to recognize the difference between ethnicity, race, and genetic ancestry in determining the optimal 373 therapy or drug dosage.

374
375 Given the widespread availability of genome sequencing data and models like SNVstory that can 376 accurately predict ancestry, we advocate for genetic ancestry to become the standard classification   was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

420
. CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted June 5, 2023.  . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted June 5, 2023. ; https://doi.org/10.1101/2023.06.02.543369 doi: bioRxiv preprint