Abstract
To better understand the composition of heterogeneous tissue samples used in generating large genomic datasets, we developed a method for estimating the abundance of T cells within the cellular population. Somatic recombination of chromosomal DNA in T cells creates a vast repertoire of structurally divergent T cell receptors (TCRs) that recognize an array of non-self proteins. It also generates a genomic signature by which TCR sequences can be distinguished from other cell types in non-targeted NGS genomic data. Here we leverage this signature to extract reads with rearranged TCR sequences from a non-targeted population, such as whole genome sequencing (WGS) or whole exome sequencing (WES) datasets. We isolate and confirm T cell rearranged reads from the remainder of the genome (99.9%), accurately estimate relative T cell abundance within a cellular population, and provide a snapshot of the T cell receptor repertoire. This approach is unique from available TCR software options that focus on examining the overall diversity of the TCR repertoire and require prior amplification or selection of this region before sequencing, and has particular utility in immunoscoring clinical patient samples in situations where genomic data exists and other approaches are unavailable.