Sensitive detection of DNA contamination in tumor samples via microhaplotypes

Low levels of sample contamination with other human DNAs can have disastrous effects on the accurate identification of somatic variation in tumor samples. Detection of sample contamination in DNA is often based on low frequency variants that indicate if more than a single source of DNA is present. This strategy works with standard DNA samples but can be problematic in solid tumor FFPE samples because there are often huge variations in allele frequency (AF) due to copy number changes arising from gains and losses across the genome. The variable AFs make detection of contamination challenging. To avoid this, we counted microhaplotypes to assess sample contamination. Microhaplotypes are sets of variants on the same sequencing read that can be unambiguously phased. Instead of measuring AF, the number of microhaplotypes is determined. Contamination detection becomes based on fundamental genomic properties, linkage disequilibrium (LD) and the diploid nature of human DNA, rather than variant frequencies. We optimized microhaplotype panel content and selected 164 SNV sets located in regions already being sequenced within a cancer panel. Thus, contamination detection uses existing sequence data. LD data from the 1000 Genomes Project is used to make the panel ancestry agnostic, providing the same sensitivity for contamination detection with samples from individuals of African, East Asian, and European ancestry. Detection of 1% contamination with no matching normal sample is possible. The methods described here can also be extended to other DNA mixtures such as forensic and non-invasive prenatal testing samples where DNA mixes can be similarly detected. The microhaplotype method allows sensitive detection of DNA contamination in FFPE tumor and other samples when deep coverage with Illumina or other high accuracy NGS is used.


Introduction
ancestral groups described by the 1000 genomes project (African, East Asian, European). DNA 135 quality was highly variable which primarily impacted mixing studies when two samples of much 136 different quality were mixed. 137 Prior to using MHs, we had assessed DNA contamination based on finding low level germline 138 SNVs that were assumed to come from contaminating DNA. However, FFPE tumor samples can 139 have high levels of copy number variation that can have a serious impact on germline SNV AFs Having an assigned dbSNP number was used as a surrogate for variation that is most 142 likely real since it has been observed previously. Variants were sorted by MAF into 10% bins. 143 Most samples had the expected distribution where >80% of the variants had 0-10% MAF with 144 this low-level variation arising from sequencing errors, FFPE artifacts or somatic variants. In 145 addition, there are also many with 40-50% MAF arising from heterozygous germline variants. 146 However, three samples had much more widely distributed MAFs including one sample 147 (AATF748T) where >20% of the variants had MAFs between 20 and 30%. These are germline 148 variants as confirmed by matching normal DNA samples that were also sequenced for these 149 samples. When we used a method for detecting DNA contamination based on low frequency 150 germline variants, these samples were discarded based on supposed high contamination levels 151 when, in fact, they are close to 100% pure based on MH analysis. The "contamination" signal is 152 actually copy number variation.  Because of this, variation in European ancestry individuals had a larger impact on the initial 188 choice of SNVs than variation in other ancestries. Thus, there is a possibility that samples from different ancestral backgrounds could respond differently when not filtered based on ancestry.

190
Once the candidate MH sets were identified, they were balanced for ancestry so that detection 191 of contamination in all groups would behave similarly.

192
Variants not in a segmental duplication or a low confidence region as defined by gnomAD were  Even at the lowest tested contamination level, 0.1%, the mean number of 3 rd /4 th MHs is over 30 286 compared to 13 (range 2-39) for the pure samples without mixing ( MHs before a contamination level is calculated. Based on these data, 25 is set as the minimum 291 number before a sample is considered potentially contaminated because it is less than all 292 samples mixed at 1% and high enough to minimize the impact of individual outliers.

293
While the number of 3 rd MHs is relevant for assessing whether a sample is potentially 294 contaminated, it is less useful for estimating the level of contamination as it reaches a 295 maximum around 2% contamination with these coverage levels. In contrast, median 3 rd MH 296 frequency changes as a function of contamination level so is more useful in this regard.

297
The in silico mixing studies provide an empirical calibration, but it is also useful to understand 298 the theoretical basis for those findings. As shown in Fig 3, the expected frequency of the 299 contaminating 3 rd MH depends on the nature of the starting and incoming genotypes. Some 300 combinations will generate no 3 rd MH, others will generate a 3 rd MH with a frequency half the 301 level of incoming contaminant, and others will generate a genotype that is the same frequency 302 as the contaminant. Since most variant combinations will be at the 50% level, examination of 303 the median value for the 3 rd MH should yield a value that is half of the true contamination level. is possible to employ various error correction methods to improve sensitivity, but the need for 323 that is dependent on the sensitivity required for the application. In addition, there is no value in 324 sequencing a sample more deeply than the starting number of input molecules, which can be a 325 limitation in some situations. Based on these considerations, we have aimed for detection of 326 contamination at least as low as 1% in these samples.