Abstract
Summary Variability in datasets are not only the product of biological processes: they are also the product of technical biases. ComBat is one of the most widely used tool for correcting those technical biases, called batch effects, in microarray expression data.
In this technical note, we present a new Python implementation of ComBat. While the mathematical framework is strictly the same, we show here that our implementation: (i) has similar results in terms of batch effects correction; (ii) is as fast or faster than the R implementation of ComBat and; (iii) offers new tools for the bioinformatics community to participate in its development.
Availability and Implementation pyComBat is implemented in the Python language and is available under GPL-3.0 (https://www.gnu.org/licenses/gpl-3.0.en.html) license at https://github.com/epigenelabs/pyComBat.
1. Introduction
Batch effects are the product of technical biases, such as variations in the experimental design or even atmospheric conditions (Lander, 1999; Thomas L. Fare et al., 2003). They particularly reveal themselves when merging different datasets, which have likely been built under different conditions. If not corrected, these batch effects may lead to incorrect biological insight, since the variability can be wrongly interpreted as the product of a biological process.
Multiple methods exist that address this problem. They include approaches related to frequentist statistics, such as simple normalization (Yang et al., 2002; Irizarry et al., 2012) or principal component analysis (Nielsen et al., 2002); and machine learning, such as support-vector machines (Benito et al., 2004). One of their main flaws is, however, their incapacity to handle low sample sizes or more than two batches at the same time (Chen et al., 2011).
ComBat, originally implemented in the R library sva (Leek et al., 2012), is based on the mathematical framework defined in (Johnson et al., 2007). This tool leverages a parametric and non-parametric empirical Bayes approach for correcting the batch effect in datasets that works for small sample sizes or in the presence of outliers. Note that the parametric method requires strong assumptions but is largely faster than the non-parametric approach.
We introduce in this article pyComBat, a new Python implementation of ComBat, following the same mathematical framework. We show that it yields comparable results for adjusting for batch effects, but is generally faster, in particular for the usually slow, but more general, non-parametric method.
2. pyComBat
pyComBat is a Python 3 implementation of ComBat. It mostly uses generic libraries like Pandas (McKinney, 2010) or NumPy (Van Der Walt et al., 2011) to mimic ComBat, following the exact same mathematical framework.
Two important features are not directly related to the performances of the software but are of outmost importance. First, pyComBat is available as an open source software under a GPL-3.0 license, which means anyone can use, modify, distribute and share it. Opening pyComBat to the Python for bioinformatics community is the best way for maintaining and improving it, while increasing its robustness. Second, the reliability of pyComBat has been thoroughly checked, using unit testing (with the pytest library, cover=83%) for assessing the proper functioning of each sub-module as well as insuring an easy maintenance, in particular after modifications.
3. Comparison with ComBat
a. Dataset used
For software validation, we used the package bladderbatch version 1.22.0 (Leek, 2019), that contains microarray gene expression data on 57 samples from 5 batches and is the reference example dataset for the sva package. We then compared ComBat and pyComBat on the same dataset (corresponding to the 20,000 first genes of bladderbatch) for (i) power for batch effect correction and; (ii) computation time.
b. Batch effect correction
As an implementation of the ComBat algorithm, pyComBat is expected to have similar, if not identical, power in terms of batch effects correction. This is confirmed in Fig.1A, which shows the distribution of differences between the outputs of ComBat and pyComBat. As expected, the differences are distributed closely around zero (mean = −9.8·10−5, 95% CI = [-0.03,0.027]). The slight variability can be explained by the different ways R and Python (in particular NumPy) handle matrices and matrix calculation.
To further validate PyComBat, we used Principal Variant Component Analysis (PVCA) (Li et al., 2009) – implemented in R in the library of the same name – to estimate the batch effect before and after applying pyComBat. Fig. 1B and Fig. 1C show that the batch effects are completely removed. We still observe variability due to the interaction between batches and cancer, which is however related to the design of the sampling and not correctable through the same means.
c. Computation time
Computation time is evaluated by running ComBat (resp. pyComBat) 100 times on the bladderbatch dataset presented in section 3a, with the parametric and the non-parametric approaches.
Due to Python efficiency in handling matrix operations and matrix manipulations as well as thorough optimization of our code, pyComBat is also as fast or even faster than ComBat. The parametric version of the software indeed appears twice as fast as ComBat in terms of computation time (fig.1D), with less variability.
The most striking result concerns the non-parametric version (fig.1E), which is more time consuming, but also less dependent on the distribution of the data. In this case pyComBat is approximatively 15 times faster than ComBat, going from around 100 minutes to less than 10 minutes.
4. Discussion and conclusion
We have presented a new Python implementation for ComBat, the most commonly used software for batch effects correction on high-throughput molecular data. Our implementation offers the same correcting power, with similar computation time for the parametric method, and significantly shorter time for the slower non-parametric version. This reduced computing time opens perspectives for a more generic use of the non-parametric approach to a larger range of datasets.
While developed and tested on microarray gene expression data, ComBat has also been used to correct batch effects for a wider range of high-throughput molecular profiling platforms, such as RNA sequencing platform (Gandal et al., 2018). However, a prior log-transformation of the data is necessary to use ComBat. Similar tools have recently been developed to avoid this additional transformation (Zhang et al., 2020).
We have attached importance to making the software open source and as documented as possible while providing tools for testing modifications to the code. We believe that this will be benefiting the Python bioinformatics community and opening the way towards the translation of other widely used software from R to Python.
Acknowledgements
The authors thank Phuong Pham for his advice about the estimation of the efficiency of the adjustments.