Abstract
Background Analysis of viral diversity using modern sequencing technologies offers extraordinary opportunities for discovery. However, these analyses present a number of bioinformatic challenges due to viral genetic diversity and virome complexity. Due to the lack of conserved marker sequences, metagenomic detection of viral sequences requires a non-targeted, random (shotgun) approach. Annotation and enumeration of viral sequences relies on rigorous quality control and effective search strategies against appropriate reference databases. Virome analysis also benefits from the analysis of both individual metagenomic sequences as well as assembled contigs. Combined, virome analysis results in large amounts of data requiring sophisticated visualization and statistical tools.
Results Here we introduce Hecatomb, a bioinformatics platform enabling both read and contig based analysis. Hecatomb integrates query information from both amino acid and nucleotide reference sequence databases. Hecatomb integrates data collected throughout the workflow enabling analyst driven virome analysis and discovery. Hecatomb is available on GitHub at https://github.com/shandley/hecatomb.
Conclusions Hecatomb provides a single, modular software solution to the complex tasks required of many virome analysis. We demonstrate the value of the approach by applying Hecatomb to both a host-associated (enteric) and an environmental (marine) virome data set. Hecatomb provided data to determine true- or false-positive viral sequences in both data sets and revealed complex virome structure at distinct marine reef sites.
- virome
- virus discovery
- bioinformatic workflow
- viral metagenomics
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
This revision represents many changes to figures to better represent the hecatomb workflow and outputs. In addition, the environmental (marine) sample analysis was updated.,
List of abbreviations
- AIDS
- acquired immunodeficiency syndrome
- SIV
- simian immunodeficiency virus
- HPC
- high-performance computing
- NCBI
- National Center for Biotechnology Information
- RPKM
- reads per kilobase million
- FPKM
- fragments per kilobase million
- SPM
- sequences per million
- LCA
- lowest common ancestor
- ICTV
- International Committee on Taxonomy of Viruses
- PERMANOVA
- permutational analysis of variance
- PCoA
- principal coordinate analysis
- ANOVA
- analysis of variance
- SIMPER
- similarity percentag