Abstract
The benefit of integrating batches of genomic data to increase statistical power in differential expression is often hindered by batch effects, or unwanted variation in data caused by differences in technical factors across batches. It is therefore critical to effectively address batch effects in genomic data. Many existing methods for batch effect adjustment assume continuous, bell-shaped Gaussian distributions for data. However in RNA-Seq studies where data are skewed, over-dispersed counts, this assumption is not appropriate and may lead to erroneous results. Negative binomial regression models have been used to better capture the properties of counts. We developed a batch correction method, ComBat-Seq, using negative binomial regression. ComBat-Seq retains the integer nature of count data in RNA-Seq studies, making the batch adjusted data compatible with common differential expression software packages that require integer counts. We show in realistic simulations that the ComBat-Seq adjusted data result in better statistical power and control of false positives in differential expression, compared to data adjusted by the other available methods. We further demonstrated in a real data example where ComBat-Seq successfully removes batch effects and recovers the biological signal in the data.