Abstract
The microbiomes of tropical corals are actively studied using 16S rRNA gene amplicons to understand microbial roles in coral health, metabolism, and disease resistance. However, primers targeting bacterial and archaeal 16S rRNA genes may also amplify organelle rRNA genes from the coral, associated microbial eukaryotes, and encrusting organisms. In this manuscript, we demonstrate that standard workflows for annotating microbial taxonomy severely under-annotate mitochondrial sequences in 1272 coral microbiomes from the Earth Microbiome Project. This issue prevents annotation of >95% of reads in some samples and persists when using either Greengenes or SILVA taxonomies. Worse, mitochondrial under-annotation varies between species and across anatomy, biasing comparisons of α- and β-diversity. By supplementing existing taxonomies with diverse mitochondrial rRNA sequences, we resolve ~97% of unique unclassified sequences as mitochondrial, without increasing misannotation in mock communities. We recommend using these extended taxonomies for coral microbiome analysis and encourage vigilance regarding similar issues in other hosts.
Introduction
Corals are animals that exist in intimate symbiosis with a wide variety of microscopic symbionts including microbial eukaryotes, bacteria, archaea, viruses, and phages (1–3). Collectively, these coral associates and the cnidarian host animal are known as the coral holobiont. The interactions of coral-associated microbes with their host and one another are complex, but of great interest due to their potential to modulate coral disease susceptibility (e.g. (1)) and responses to environmental stressors. Marker gene studies using small-subunit ribosomal RNA gene amplicons (SSU rRNA) have played a key role in describing the coral holobiont. Yet there are older symbioses that complicate these studies.
Traces of the evolutionary history of organelles as formerly free-living bacteria can be found in their genomes. For example, animal mitochondria carry their own small subunit rRNA gene, known as the 12S rRNA gene. These 12S rRNA genes are often amplified by the same PCR primers used for 16S rRNA analysis of bacteria and archaea. This can pose problems for microbiome studies (2,3) if organelle SSU rRNA gene sequences are not removed in silico, or excluded using special laboratory procedures like peptide nucleic acid clamps (2) or CRISPR-Cas9 cleavage (3). Because laboratory methods are relatively laborious and taxon-specific, a common approach is to identify and filter out organelle rRNA sequences in silico using standard taxonomy annotation pipelines such as the naive-Bayesian RDP classifier (4), alignment-based algorithms such as USEARCH (5) and VSEARCH (6), or machine learning approaches (7). If this process is accurate and unbiased across categories of samples, then removal of mitochondrial SSU rRNA sequences reduces effective sequencing depth but does not otherwise compromise microbiome analysis.
Application of in silico methods to coral microbiome libraries typically does identify some mitochondrial SSU rRNA gene sequences. However, variation in coral mitochondrial 12S rRNA genes has long been known to exist. Indeed the deeply divergent ‘robust’ and ‘complex’ clades of the phylogenetic tree of scleractinian corals were initially named and characterized based on their ‘short’ or ‘long’ 12S rRNA PCR products (8,9). The existing literature does not establish whether known taxon-specific variation in coral mitochondrial 12S rRNA gene sequences might impede their in silico removal from coral microbiome SSU rRNA marker gene libraries in a host-specific manner.
In this manuscript we report widespread, severe, and host-specific under-annotation of mitochondrial sequences in short-read coral microbiome SSU rRNA amplicon libraries; demonstrate that differences in mitochondrial under-annotation between coral taxa bias comparisons of microbiome diversity across coral families; and propose a simple extension to existing microbial taxonomies that appears to mostly resolve this issue.
SILVA and Greengenes under-annotate mitochondrial ribosomal RNAs
The Global Coral Microbiome Project (GCMP) dataset includes a collection of 1 272 16S rRNA gene amplicon libraries from the mucus, tissue, and skeleton of phylogenetically diverse coral taxa sequenced on Illumina HiSeq using the Earth Microbiome Project protocol ((10); Supplemental Table 1a). During analysis of this dataset we noticed many sequences which were not annotated by standard workflows (e.g. ‘Unknown’ annotations). Indeed ‘Unknown’ sequences represented 38% of total reads according to vsearch annotation with SILVA and 41% using Greengenes. Such ‘Unknown’ sequences accounted for >95% of microbial relative abundance in 51 coral samples using SILVA or 59 coral samples using Greengenes. Many of these unannotated sequences appeared to be mitochondrial in origin based on ad hoc BLAST searches. Troublingly, BLAST even identified possible cryptic mitochondrial sequences in samples where other coral mitochondrial 12S rRNA genes were successfully annotated using vsearch. We reasoned that apparent under-annotation of mitochondrial SSU rRNA sequences in coral microbiomes could be explained by some combination of species-specific mitochondrial 12S rRNA gene length and sequence variation (e.g. (8,9)); coral heteroplasmy (11)); mitochondrial sequences from encrusting or ingested organisms (12) and incomplete representation of mitochondrial sequences from diverse hosts in SILVA (13) and Greengenes (14).
Expanding taxonomic references improves detection of mitochondrial RNA genes
To address this problem more formally, we developed a workflow for expanding the SILVA 132 (13) and Greengenes 13_8 (14) reference taxonomies with diverse mitochondrial sequences from the Metaxa2 (15) project (Supplementary Methods). We then used vsearch classification to re-annotate GCMP sequences with one of several standard taxonomies: SILVA 132, Greengenes 13_8, or the expanded versions of the same, which we refer to as silva_metaxa2 and Greengenes_metaxa2. We expected that if the mitochondrial references in existing taxonomies were already sufficiently diverse, then adding additional references would not alter taxonomic annotations. Conversely, if these existing taxonomies lacked representation of diverse mitochondrial SSU rRNA gene sequences, adding those sequences might ameliorate mitochondrial under-annotation.
Taxonomic annotations of the GCMP dataset using the expanded silva_metaxa2 or Greengenes_metaxa2 taxonomies had 97% fewer ‘unannotated’ sequences, and roughly proportional increases in annotated mitochondria (Fig. 1a). This resolved more than 99% of fully unannotated sequences as coral mitochondria. This suggests that many sequences of unknown taxonomy in coral microbiomes are divergent under-annotated mitochondrial 12S rRNA genes.
When coral mucus, tissue and skeleton were analyzed separately using the expanded taxonomies, samples from coral tissue - which we expect to be richest in coral mitochondria - had higher proportions of under-annotated coral mitochondria (Fig. 1b). Finally, BLAST searches of all sequences that were differentially annotated when using the expanded taxonomies identified the most commonly differentially annotated sequences (among those with BLAST hits) as cnidarian mitochondria (26%), primarily from coral families Pocilloporidae, Merulinidae, Poritidae, Acroporidae and Lobophylliidae (Supplementary Tables S2a and S2b).
Mitochondrial under-annotation biases diversity estimates
We next explored how under-annotation of coral mitochondria might influence statistical analysis of coral microbiomes. Coral mitochondrial under-annotation was strongly biased across coral families (Supplementary Fig. 1). Failure to annotate these mitochondria was sufficient to alter the outcome of cross-family comparisons of microbial richness and evenness in the GCMP dataset (Supplementary Table S1e). For example, using standard SILVA or Greengenes taxonomies, coral families appear to exhibit significant differences in microbiome richness and evenness in mucus, tissue, and skeleton (e.g. for the ‘observed species’ metric, Kruskal-Wallis p = 0.003 with standard SILVA). Yet improved annotation of mitochondria renders cross-family differences in mucus microbiome richness and evenness not significant (p = 0.078 with SILVA + Metaxa2; full results in Supplementary Table S1e). At the same time, improved annotation of mitochondria increased the significance of cross-family differences in the microbiome richness of tissue or skeleton by up to 5 orders of magnitude, and evenness by up to 10 orders of magnitude (Supplementary Table S1e). Although the significance of β-diversity comparisons of microbiome differences between coral families did not change when mitochondrial under-annotations were resolved, the effect size of these differences (i.e. Kruskal-Wallis H statistics) were cut to only 58% of their prior values when using the expanded taxonomies. This suggests that cryptic coral mitochondria can dramatically alter estimates of cross-family differences in coral microbiomes.
Longer read lengths reduce but do not eliminate under-annotation
To test whether mitochondrial under-annotation was peculiar to short-read Illumina HiSeq libraries used in the Earth Microbiome Project, we also reanalyzed microbiomes from corals affected by chronic Montipora White Syndrome (cMWS; Brown et al., in revision). Brown et al., used Ion Torrent sequencing, and had longer read lengths than the Earth Microbiome Project. In the Brown et al. dataset, we found fewer unclassified sequences, but still observed a 16-fold increase in mitochondrial annotations (~31 million vs. 1.9 million) with the silva_metaxa2 expanded reference set relative to the standard SILVA reference (Supplemental Data Table S3; Supplementary Methods). Thus, mitochondrial under-annotation does not appear to be unique to Earth Microbiome Project protocols. Together, these findings suggested that inclusion of diverse mitochondrial reference sequences greatly increased annotation of mitochondrial 12S rRNA sequences in coral microbiomes.
Expanding reference taxonomies does not lead to mitochondrial over-annotation
One potential concern with expanding reference taxonomies is that it might increase mis-annotation of certain bacteria as mitochondria. We tested whether increased mitochondrial annotations might lead to false positives by applying our expanded taxonomies to mock communities of known composition that did not contain mitochondria. In all tested mock communities from the mockrobiota project (16), expanding the mitochondrial reference set did not increase false positive annotations of mitochondria, and did not affect overall accuracy (maximum change in f-measure < 10−5; Supplementary Figure S2).
Recent studies have identified bacteria in the genus Aquarickettsia as of particular interest as mediators of coral health (1,17). However, as Aquarickettsia and other members of Midichloreaceae are - in relative terms - somewhat closely related to mitochondria, we wanted to test whether our expanded taxonomies might increase mis-annotation of these important coral symbionts as mitochondria. Reassuringly, annotations for all coral- or placozoan-associated Aquarickettsia from recent studies (1,17) did not change using the updated taxonomies (Supplementary Table S2d).
Conclusion
These results demonstrate that extension of the SILVA or Greengenes taxonomies with diverse mitochondrial sequences can improve taxonomic annotations of coral mitochondria. While we explore mitochondrial misannotation in corals, it seems likely that similar issues may occur in any sufficiently diverse and deeply divergent set of eukaryotic hosts (e.g. marine sponges (18)). To address this issue, we recommend that investigators studying host-associated microbiomes ensure that diverse host mitochondrial reference sequences are included in their reference database by either using the pre-calculated QIIME2-compatible taxonomies supplied here, or, in the future, by updating the most recent SILVA or Greengenes taxonomies with diverse mitochondrial sequences.
Competing Interests
The authors declare no conflicts of interest.
Supplementary Data Tables
Supplementary Data Table 1. Results of testing annotation accuracy for the Global Coral Microbiome Project (GCMP) dataset. a. Metadata for the GCMP project samples used in this study (e.g. coral species, temperature, depth, etc). b. Annotation statistics by sample and taxonomy. Sample data and annotation results by reference taxonomy. This table includes counts for the number of ASVs that could not be annotated and the number annotated as mitochondria for each taxonomic resource. c. Comparison of proportions of Unknown ASVs assigned by taxonomy. This table holds the results of pairwise Kruskal-Wallis tests comparing compartment-specific proportions of sequences which were not identified at the domain level across different reference taxonomies. d. Comparison of proportions of mitochondrial ASVs assigned by reference taxonomy. This table holds the results of pairwise Kruskal-Wallis tests comparing compartment-specific proportions of sequences identified as mitochondria based on different reference taxonomies. e. Comparison of α- and β-diversity results by annotation scheme. Differences in α-diversity between coral families based on mitochondrial annotation method. Statistics reflect the results of either Kruskal-Wallis tests for several α-diversity metrics across coral families. f. Comparison of β-diversity results by annotation scheme. Statistics reflect the results of either Kruskal-Wallis tests for several β-diversity metrics across coral families.
Supplementary Data Table 2. Performance of reference taxonomies. a. NCBI BLAST lineages of differentially annotated sequences (SILVA). NCBI Taxonomy lineages of BLASTed sequences annotated differently by the silva_metaxa2 and SILVA reference taxonomies. b. NCBI BLAST lineages of differentially annotated sequences (Greengenes). NCBI Taxonomy lineages of BLASTed sequences annotated differently by the Greengenes_metaxa2 and Greengenes reference taxonomies. c. Mock community accuracy comparisons. Accuracy statistics of annotations of Mockrobiota mock communities generated by comparing extended reference taxonomies to their base taxonomy using the qiime2 evaluate-taxonomy method. Base reference annotations were considered perfectly accurate, for purposes of comparison. d. Annotations of Aquarickettsiales sequences by reference taxonomy. Full annotations for 38 Aquarickettsiales sequences from (1,17) using each reference taxonomy.
Supplementary Data Table 3. Results of testing annotation accuracy for the Brown et al. chronic Montipora White Syndrome (cMWS) dataset. This table reports the per-sample sequence counts of non-organelle sequences from Brown et al. when annotated with either silva or silva_metaxa2.
Acknowledgements
J.R.Z. and T.B. are supported by NSF IOS CAREER grant 1942647. TB and JLPG are supported by NSF IOS grant 1655682. A static copy of code and relevant data files for this project can be found at https://zenodo.org/record/4551201. Additionally, a live copy is maintained on GitHub as part of the GCMP Global Disease Project at https://github.com/zaneveld/GCMP_Global_Disease/tree/master/analysis/organelle_removal. The authors thank Nicholas Bokulich and Greg Caporaso for suggestions regarding software to compare mock communities, Grace Klinges for sharing Aquarickettsia sequences, Elizabeth Traylor for initial exploration of this issue, and Ayo Akinrinade for comments that improved the manuscript.