RT Journal Article SR Electronic T1 Tissue heterogeneity is prevalent in gene expression studies JF bioRxiv FD Cold Spring Harbor Laboratory SP 2020.12.02.407809 DO 10.1101/2020.12.02.407809 A1 Gregor Sturm A1 Markus List A1 Jitao David Zhang YR 2020 UL http://biorxiv.org/content/early/2020/12/03/2020.12.02.407809.abstract AB Background Lack of reproducibility in gene expression studies has recently attracted much attention in and beyond the biomedical research community. Previous efforts have identified many underlying factors, such as batch effects and incorrect sample annotations. Recently, tissue heterogeneity, a consequence of unintended profiling of cells of other origins than the tissue of interest, was proposed as a source of variance that exacerbates irreproducibility and is commonly ignored.Results Here, we systematically analyzed 2,692 publicly available gene expression datasets including 78,332 samples for tissue heterogeneity. We found a prevalence of tissue heterogeneity in gene expression data that affects on average 5-15% of the samples, depending on the tissue type. We distinguish cases of severe heterogeneity, which may be caused by mistakes in annotation or sample handling, from cases of moderate heterogeneity, which are more likely caused by tissue infiltration or sample contamination.Conclusions Tissue heterogeneity is a widespread issue in publicly available gene expression datasets and thus an important source of variance that should not be ignored. We advocate the application of quality control methods such as BioQC to detect tissue heterogeneity prior to mining or analysing gene expression data.Competing Interest StatementBoth GS and JDZ are former or current employees of F. Hoffmann-La Roche Ltd, Switzerland. GS receives consulting fees from Pieris Pharmaceuticals GmbH outside this work.