Abstract
Background In high-throughput sequencing studies, sequencing depth, which quantifies the total number of reads, varies across samples. Unequal sequencing depth can obscure true biological signals of interest and prevent direct comparisons between samples. To remove variability due to differential sequencing depth, taxa counts are usually normalized before downstream analysis. However, most existing normalization methods scale counts using size factors that are sample specific but not taxa specific, which can result in over- or under-correction for some taxa.
Results We developed TaxaNorm, a novel normalization method based on a zero-inflated negative binomial model. This method assumes the effects of sequencing depth on mean and dispersion vary across taxa. Incorporating the zero-inflation part can better capture the nature of microbiome data. We also propose two corresponding diagnosis tests on the varying sequencing depth effect for validation. We find that TaxaNorm achieves comparable performance to existing methods in most simulation scenarios in downstream analysis and reaches a higher power for some cases. Specifically, it has a well balance on power and false discoveries control. When applying the method in a real dataset, TaxaNorm has improved performance when correcting technical bias.
Conclusion TaxaNorm considers correcting both sample- and taxon-specific bias by introducing an appropriate regression framework in the microbiome data, which aids in data interpretation and visualization. The ‘TaxaNorm’ R package is freely available through the CRAN repository https://CRAN.R-project.org/package=TaxaNorm and the source code can be downloaded at https://github.com/wangziyue57/TaxaNorm.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
↵† Co-senior author
Section on simulation results updated; Section on real data application updated to include more analysis and results; Figure 3 revised; Figure 6 added; Tables are updated; Supplemental files updated.
List of abbreviations
- HTS
- high-throughput sequencing
- rRNA
- ribosomal RNA
- DA
- differential abundance
- TSS
- total-sum scaling
- MED
- median-by-ratio
- UQ
- upper quartile
- TMM
- trimmed mean of M-values
- CSS
- cumulative sum scaling
- ANCOM-BC
- analysis of compositions of microbiomes with bias correction
- ZINB
- zero-inflated negative binomial
- NB
- negative binomial
- EM
- expectation-maximization
- MLE
- maximum likelihood estimation
- PMF
- probability mass function
- CDF
- cumulative distribution function
- LRT
- likelihood ratio test
- FDR
- false discovery rate
- HMP
- Human Microbiome Project
- 16S
- 16S rRNA gene amplicon sequencing
- WGS
- whole-genome shotgun sequencing
- NMDS
- non-metric multidimensional scaling
- BSS
- between-group sum of squares