Abstract
For a variety of human malignancies, incidence, treatment efficacy and overall prognosis show considerable variation between different populations and ethnic groups. Disentangling the effects related to particular population backgrounds can help in both understanding cancer biology and in tailoring therapeutic interventions. Because self-reported or inferred patient data can be incomplete or misleading due to migration and genomic admixture, a data-driven ancestry estimation should be preferred. While algorithms to analyze ancestry structure from healthy individuals have been developed, an easy-to-use tool to assign population groups based on genotyping data from SNP profiles is still missing and benchmarking for the validity of population assignment strategy for aberrant cancer genomes was not tested.
We benchmarked the consistency and accuracy of cross-platform population assignment. We also demonstrated its high accuracy to process unaltered as well as cancer genomes. Despite widespread and extensive somatic mutations of cancer profiling data, population assignment consistency between germline and highly mutated samples from cancer patients reached of 97% and 92% for assignment into 5 and 26 populations re-spectively. Comparison of our benchmarked results with self-reported meta-data estimated a matching rate between 88% to 92%. Despite a relatively high matching rate, the ethnicity labels indicated in meta-data are vague compared to the standardized output from our tool.
We have developed a bioinformatics tool to assign the populations from genome profiling data and validated its performance in healthy as well as aberrant cancer genomes. It is ready-to-use for genotyping data from nine commercial SNP array platforms or sequencing data. This tool is effective to scrutinize the population structure in cancer genomes and provides better measure to integrate genotyping data from various platforms instead of self-reported information. It will facilitate research on interplay between ethnicity related genetic background and molecular patterns in cancer entities and disentangling possible hereditary contributions.
The docker image of the tool is provided in DockerHub as “baudisgroup/snp2pop”.
Abbreviations
- ACB
- African Caribbeans in Barbados
- AFR
- African
- ALDH2
- Aldehyde Dehydrogenase 2 Family (Mitochondrial)
- AMR
- Admixed American
- ASW
- Americans of African Ancestry in SW USA
- BAF
- B allele frequency
- BEB
- Bengali from Bangladesh
- CDX
- Chinese Dai in Xishuangbanna, China
- CEU
- Utah Residents (CEPH) with Northern and Western European Ancestry
- CHB
- Han Chinese in Beijing, China
- CHS
- Southern Han Chinese
- CLM
- Colombians from Medellin, Colombia
- EAS
- East Asian
- ESN
- Esan in Nigeria
- EUR
- European
- FIN
- Finnish in Finland
- Fst
- Fixation Index
- GBR
- British in England and Scotland
- GEO
- Gene Expression Omnibus
- GIH
- Gujarati Indian from Houston, Texas
- GWD
- Gambian in Western Divisions in the Gambia
- IBS
- Iberian Population in Spain
- ITU
- Indian Telugu from the UK
- JPT
- Japanese in Tokyo, Japan
- KHV
- Kinh in Ho Chi Minh City, Vietnam
- LD
- linkage disequilibrium
- LWK
- Luhya in Webuye, Kenya
- MAF
- Minor allele frequency
- MSL
- Mende in Sierra Leone
- MXL
- Mexican Ancestry from Los Angeles USA
- PCA
- Principle Component Analysis
- PEL
- Peruvians from Lima, Peru
- PJL
- Punjabi from Lahore, Pakistan
- PUR
- Puerto Ricans from Puerto Rico
- SAS
- South Asian
- SNP
- Single nucleotide polymorphism
- STU
- Sri Lankan Tamil from the UK
- TCGA
- the Cancer Genome Atlas
- TSI
- Toscani in Italia
- YRI
- Yoruba in Ibadan, Nigeria