PT  - JOURNAL ARTICLE
AU  - Sudhir Varma
TI  - Blind estimation and correction of microarray batch effect
AID  - 10.1101/202952
DP  - 2017 Jan 01
TA  - bioRxiv
PG  - 202952
4099  - http://biorxiv.org/content/early/2017/10/20/202952.short
4100  - http://biorxiv.org/content/early/2017/10/20/202952.full
AB  - Microarray batch effect (BE) has been the primary bottleneck for large-scale integration of data from multiple experiments. Current BE correction methods either need known batch identities (ComBat)or have the potential to overcorrect, by removing true but unknown biological differences (SVA).Even though the effects of technical differences on measured expression have been published, there are no BE correction algorithms that take the approach of predicting technical effects from parameters computed from a fixed reference sample set. We show that a set of signatures, each of which is a vector the length of the number of probes, calculated on a Reference set of microarray samples can predict much of the batch effect in other Validation sets. We present a rationale of selecting a Reference set of samples designed to estimate technical differences without removing biological differences. Putting both together, we introduce the Batch Effect Signature Correction (BESC) algorithm that uses the BES calculated on the Reference set to efficiently predict and remove BE. Using two independent Validation sets, we show that BESC is capable of removing batch effect without removing unknown but true biological differences. Much of the variations due to batch effect is shared between different microarray datasets. That shared information can be used to predict signatures (i.e. directions of perturbation) due to batch effect in new datasets. The correction is blind (without needing to re-compute the parameters on new samples to be corrected), single sample, (each sample is corrected independently of each other) and conservative (only those perturbations known to be likely to be due to technical differences are removed ensuring that unknown but important biological differences are maintained). Those three characteristics make it ideal for high-throughput correction of samples for a microarray data repository. An R Package besc implementing the algorithm is available from http://explainbio.com.