ABSTRACT
In this work, we exploit recent advances in metagenomic assembly and bacteriophage identification to describe the phage content of saliva from 5 mother-baby pairs sampled twice 7 - 11 months apart during the first year of the babies’ lives. We identify 25 phage genomes that are comprised of one to 71 contigs, with 16 having a single contig. At the detectable level, phage were sparsely distributed with the most common one being present in 4 of the 20 samples, derived from two mothers and one baby. However, if they were present in the early time point sample from an individual, they were also present in the later sample from the same person more frequently than expected by chance. The nucleotide diversity (π) in phage from the same sample or the same person was much lower than between different individuals, indicating dominance of one strain in each person. This was different from bacterial genomes, which had higher diversity indicating the presence of multiple strains within an individual. We identify likely bacterial hosts for 16 of the 25 phage, including an apparent inovirus that is capable of integrating in the dif site of Haemophilus species. It appears that phage in the oral cavity are sparsely distributed, but can be maintained for months once acquired.
INTRODUCTION
The microbiota of the oral cavity has great importance in oral health. The mechanism of dental caries involves acid production by specific bacterial populations in response to dietary sugar (1). Chronic periodontitis is likewise associated with altered bacterial communities (2, 3). Bacteriophages are likely to be an important influence on oral bacterial communities, both through predation of bacteria and aiding horizontal gene transfers. Additionally, it is possible that bacteriophages or lysins could be used as therapy for oral diseases (4).
In recent years, the availability of high throughput sequencing of the 16S rRNA gene together with other techniques has produced a wealth of information on the bacterial component of the oral microbiome (5). Less is known about oral bacteriophages, partially due to their lack of a universally conserved marker gene. However some phage that infect oral bacteria have been cultivated (4) and oral phage have also been identified in metagenomic studies (6). Most such metagenomic studies involved isolation of particles and sequencing of the particle-associated DNA, often following amplification (7–10). It has also been found that oral bacteria respond to phage with Clustered Regularly Interspaced Short Palindromic Repeat (CRISPR)-based adaptive immunity (11, 12). Recently methods have been developed to aid the identification of phage sequences from whole shotgun metagenomes without the need for particle isolation or amplification (13, 14). Paez-Espino and co-workers applied their method to whole metagenomes that were generated by the NIH Human Microbiome Project, including sequences from various human oral environments, most frequently tongue dorsum, buccal mucosa, and supragingival plaque (15).
A technique that has been developed recently to assemble bacterial genomes from metagenomes is to assemble metagenomes from many samples and then cluster contigs based on sequence coverage per sample and nucleotide kmer frequencies (16). We set out to combine the contig clustering with phage identification to identify phage genomes from whole shotgun metagenomes from the oral cavity. In the present work, we apply this approach to a data set of whole metagenomic DNA from saliva of 5 mothers and their 5 babies during the first year of the babies’ lives.
METHODS
Sample Collection
DNA samples were part of a larger set collected for 16S rRNA gene amplicon sequencing (Sulyanto et al., in preparation). There were 20 samples, 2 per individual from 5 mother child pairs taken during the first year of the child’s life. Demographic information on the subjects is shown in Table 1. For each person we used one sample taken between 0-3 months after the birth and one sample taken 10-12 months after. Informed consent was obtained and the study was approved by The Ohio State University IRB. Saliva samples were collected from infants by saturating a flocked swab (Copan Diagnostics, www.copanusa.com) for 30 seconds and from mothers by expectoration of unstimulated saliva. Samples were stored in ATL lysis buffer and frozen until processing.
DNA and library preparation
DNA was isolated with QIAamp DNA mini kits (QIAgen, www.qiagen.com) using the included protocol augmented by the inclusion of a bead-beating step to increase bacterial cell lysis. 100 ng of DNA in a volume of 50 µl was subjected to fragmentation in a Covaris S2 instrument with Intensity 5, Duty cycle 10%, Cycles per burst 200 and Treatment time 50s. 40 ng of the fragmented DNA was used to make Illumina sequencing libraries with the NEB Next DNA library kit for Illumina (New England Biolabs, www.neb.com). For 3 samples (family 1 mother 10 month, family 2 baby 3 month, and family 2 baby 10 month) we performed enrichment with the NEB Next Microbiome Enrichment Kit (New England Biolabs, www.neb.com) and sequenced both the pre-enrichment and post-enrichment samples. For one sample (family 1 mother 2 month), we sequenced only an enriched sample. Comparisons of the pre- and post-enrichment samples with Metaphlan2 (17) showed only minor differences in the bacterial content of the two samples so we pooled sequence data from the three repeated samples and used only the enriched data for the one sample.
We sequenced the pooled libraries in one lane of the HiSeq 4000 (Illumina, www.illumina.com) with 150 PE chemistry.
Sequence filtering and assembly
The sequencing reads were trimmed to remove adapter sequences and low quality regions with Trimmomatic v. 0.32 (18), using the parameters ILLUMINACLIP:<TruSeq3-PE- 2.fa>:2:30:10:1:true and MINLEN:70. Human reads were removed by mapping against the human genome (human_g1k_v37.fasta from ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/technical/reference/) with BWA-MEM 0.7.8 (19), and processing with FilterSamReads and SamToFastq from Picard tools 1.112 (github.com/broadinstitute/picard). The reads from all samples were co-assembled using MEGAHIT v1.0.6-3-gfb1e59b (20). The assembly was examined with metaquast (21). The human-depleted sequences are deposited in the NCBI SRA associated with BioProject accession PRJNA448135.
Clustering of contigs and identification of phage-encoding contigs
We processed the contigs through the CONCOCT clustering pipeline (16), which incorporates the following steps: (1) Co-assembly of all samples using MEGAHIT (20); (2) Discarding contigs shorter than 1 kb; (3) Fragmenting contigs larger than 20 kb into pieces of 10 kb; (4) Mapping of reads to contigs using bwa mem (19); (5) Calculation of coverage per sample for each contig; (6) Clustering of contigs into bins based on tetranucleotide frequencies and coverage per sample; and (7) Assigning taxonomy by predicting encoded protein sequences and searching against the NCBI nr non-redundant protein database. A product of the CONCOCT pipeline was a table of average coverage per sample for all assembled contigs, which we were able to use to drive decisions grouping contigs into clusters representing likely genomes.
We identified potential phage-derived contigs with VirSorter (13), and by examining the taxonomy of encoded proteins (performed with DIAMOND v. 0.8.14 (22)) and MEGAN v. 6.5.10 (23) to find bacteriophage-related genes. We further relied on the coverage per sample patterns to link phage contigs as we found that identified phage were nearly always > 2x coverage in a very small number of samples (between 1 - 4 of the 20) and < 0.1x coverage in the others. We noticed that many contigs found by VirSorter were ones that had been fragmented because they were originally over 20 kbp. We included these after checking that the coverage per sample patterns were consistent for all the fragmented pieces. We included a contig that was identified as viral and circular by VirSorter, and was 8218 bp in length (Oral phage 4), theorizing that it might be a short circular phage genome. If long contigs contained phage-related genes such as terminase or capsid proteins by blastx, we included them. We next examined CONCOCT bins that were enriched in potential phage contigs as identified by VirSorter, and included contigs from these bins if their coverage per sample levels were consistent. We split up these bins or discarded contigs if they appeared to show more than one coverage pattern. Finally we found contigs through the DIAMOND/MEGAN analysis that were related to cultivated phages. The coverage per sample for contigs in these groups were manually inspected and the contig sets were subdivided if multiple coverage patterns were found. The phage contigs are deposited at IMG/M with the genome ID 3300019854, and Supplemental Table 1 contains a mapping of the scaffold IDs in IMG/M to oral phage numbers used here and contig numbers assigned by MEGAHIT.
Analysis of single nucleotide polymorphisms
For phage that were found in more than one sample at greater than 1 x average coverage, we analyzed the frequency of single nucleotide polymorphisms within and between samples. A step of the CONCOCT pipeline had been to align all the reads to all assembled contigs using bwa mem v. 0.7.12-r1039 (19). We used samtools v. 0.1.19 (24) to subset bam files for each sample and phage combination that exceeded the coverage threshold and generate pileup files from them. We then compared the pileups between samples at positions where each had been sequenced at least four times, and counted nucleotide differences if the position was sequenced as one base four times in one sample and a different base at least four times in the second sample. Because there is substantial overlap between the forward and reverse reads in our data (average insert size from alignment ∼140 bp), this criteria ensures the base is read in at least two independent paired end reads. We used the python script pileup-analyze.py to count nucleotides that were the same between samples or ones that were different.
Nucleotide diversity and comparison of phage with bacterial genomes
To perform comparisons between bacterial genomes and phages, we selected eight contigs that were representative of four bacterial genomes, with 2 contigs corresponding to each genome from a set of abundant salivary bacteria: Rothia mucilaginosa, Prevotella melaninogenica, Neisseria mucosa (closely related to Neisseria sicca and Morococcus cerebrosus), and Veillonella atypica. We chose contigs that were over 2 kbp in length and had megablast alignments to reference genomes with 100% query coverage and over 93% identity. We then used a combination of samtools (24), BioPython (25), and command line utilities to generate bam files containing mapped reads from all samples that had over 1 fold average coverage for those contigs. We used freeBayes v1.1.0-54-g49413aa (26) to predict variants in frequency-based mode, with filters of 4 observations and 1% of total observations (parameter settings: ‘--min-alternate-fraction 0.01 --min-alternate-count 4 --pooled-continuous --haplotype-length 0’). We calculated the nucleotide diversity π as described (27), within each sample and between pairs of samples with greater than 1 fold average coverage using a python script calculate_pis.py.
Phage annotation and taxonomy
Phage genes and proteins were predicted on the PhAnToMe annotation server (www.phantome.org). Functional annotation was carried out on the IMG/MER server (28)(available at https://img.jgi.doe.gov under IMG genome ID 3300019854).
Oral phage taxonomic relationships were analyzed with the vConTACT program (29), part of the iVirus suite on CyVerse. Briefly, the encoded proteins of the oral phage together with with the encoded proteins of all prokaryotic phage in RefSeq are used in an all versus all blastp search, the results are used to define protein clusters, and the degree of sharing of members of the protein clusters is used to define a network of relationships between phage genomes.
Host inference by tRNA
We identified tRNA genes within the phage contigs with tRNAscan-SE 1.3.1 (30) and used the predicted sequences in blastn (31) searches of the nt and RefSeq genome databases.
Host inference by CRISPR spacer analysis
We used two methods to attempt to identify bacterial hosts through CRISPR spacer matches with the phage genomes. The first was to search the CRISPR spacer database derived from bacterial genomes at the IMG/VR site (32) with the phage contigs. The implementation used the web interface accessed through a python script, img-crispr-blast.py, that invokes the Selenium web scraping library to control the Firefox browser. In the second method, we assembled CRISPR repeats from the sequence reads used in the current study using crass v. 0.3.12 (33). We then performed a blastn search with the predicted spacers as queries, the phage genomes as database, and settings ‘-task blastn-short -perc_identity 95 -qcov_hsp_perc 95 -evalue 0.1’. Because crass only assembles the CRISPRs but not flanking sequences we extracted the read ids for reads that contained the matched spacer. We then determined which MEGAHIT assembled contig they mapped to, and examined the Diamond search results of proteins encoded by the contig to assess the taxonomy.
Host analysis - prophages
We searched for related prophages in sequenced bacterial genomes by BLAST. The refseq_genomes database was downloaded from the NCBI ftp site in December 2017 and the genetic identifier numbers (gi numbers) for bacterial sequences were accessed by querying the NCBI Entrez nucleotide site with the search term taxid2[ORGN]. We used the blastdb_aliastool program to create a binary form of the gi list and used blast 2.6.0+ to align the phage contigs against the database with the modifiers -gilist <gi list file> -task blastn -evalue .00001. Phage genomes where over 50% of the genome aligned with the bacterial genomes at over 70% nucleotide identity were considered as possible matches. The search also identified two phage that had highly similar regions (> 90% identity) to bacterial isolate genomes confined to the ends of contigs, suggesting that the metagenomic assembled contig might represent a prophage sequence.
Clustering with metagenomic phage contigs from the “Earth’s virome” study
To determine the relationship of phage identified in this study to those found in a recent work by others (15), we applied a clustering approach that the same group developed (14). Briefly, the pipeline uses blastn to align phage genomes to the previously described metagenomic phage contigs, parses the blast results and performs single linkage clustering on the results. This allows assignment of the contigs from this study to viral clusters (vc) that they found in the earlier work and describes relationships between our contigs and singleton contigs that did not previously belong to a cluster. Since some viral clusters and singletons found in the earlier study had putative bacterial hosts, it was possible to extend this host inference to the present phage contigs.
RESULTS
Sequence assembly and identification of phage genomes
The sequencing produced 231 million paired end reads, 68 million of which did not map to the human genome (29%). The original assembly of those non-human reads was 221 Mbp in 200,796 contigs with an N50 value of 1189 bp. After excluding contigs under 1000 bp, there were 64,294 contigs and an assembled length of 129 Mbp. To be consistent with the recommended input for the CONCOCT program, we fragmented contigs longer than 20 kb to 10 kb pieces plus the residual.
We used the contigs greater than 1000 bp as input to the CONCOCT metagenomic clustering program (16) and it assigned them to 144 clusters. CONCOCT clusters contigs with Gaussian mixture models based on nucleotide kmer frequencies and coverage per sample. One of the outputs is a table of average coverage per sample for all contigs determined by mapping reads back to contigs with bowtie2 (34). As expected, many of these clusters represented partial to nearly complete bacterial genomes, many of them known oral bacteria. This was inferred through the CONCOCT annotation step that uses the Diamond program (22) to align predicted protein sequences against the NCBI nr protein database.
We also ran VirSorter (13) on the assembled contigs. The program identified 294 contigs using its RefSeq database and 683 contigs using the extended Virome database. Nearly all of the identified contigs were predicted to be phage-only (VirSorter categories 1-3) with only 4 contigs from the RefSeq prediction and 6 from the Virome prediction being marked as potential prophage (categories 4-6).
We had fragmented contigs that were over 20 kbp in the CONCOCT analysis pipeline to avoid possible chimeras derived from different genomes, and followed the same procedure for VirSorter. However, we noticed that a large number of the contigs identified as possible phage by VirSorter were these subfragments. We also found that the coverage per sample patterns of the subfragments were consistent with each other, indicating that they were non-chimeric. We found a total of 16 such contigs. For 14 of them, we did not find additional contigs that clustered with them in coverage per sample and they were identified as potential phage. We designated these as oral phage numbers 1-3, 5-13, 17, and 19. Another contig of 8218 bp was predicted by VirSorter to be circular by end identity. The circularity was further supported by the observation that forward and reverse ends of the same paired reads mapped at the 5’ and 3’ ends of this contig. We designated this oral phage 4. Two other >20 kb VirSorter contigs clustered with additional short contigs in phages 16 and 18.
To find additional phage, we examined CONCOCT clusters that contained higher than average numbers of predicted phage contigs by VirSorter. We found five such clusters, one of which had three distinct coverage per sample patterns that we subdivided into phage 16-18. As mentioned this resulted in additional contigs added to phages 16 and 18, while 17 remained as a single contig of 120,359 bp. For the other four CONCOCT clusters, we excluded some contigs with inconsistent coverage per sample, forming phage genomes 19-22.
We lastly examined the Diamond (22) alignments to the NCBI nr protein database to find predicted proteins that were closely related to known phage. We identified eight contigs that encoded proteins similar to Actinomyces phage Av-1 (35) (which we subdivided by coverage per sample into phage genomes 14 and 15), 9 contigs similar to Streptococcus phage Cp-1 (36) (which we subdivided into phages 24 and 25), and 4 contigs related to Streptococcus phage EJ-1 (37) (phage 23). The properties of the identified phage genomes are shown in Table 2. The collection of phages show a wide range of properties, with sizes from phage 4 at 8.2 kb to phage 20 at 231 kb and GC content from phage 22 at 23% to phage 11 at 68%.
Distribution of phage in samples
We summarized the read mapping data as reads mapped per genome and calculated the percentage of total reads that represented each phage. The results are plotted as a heatmap in Figure 1. In analyzing the distribution, we used a threshold of > 0.01% of reads mapping to a genome to indicate presence of a phage in a sample. Two aspects of the distribution of the phages are notable. Firstly phage are sparse with each phage only present over the threshold in a few samples. The most prevalent phage is oral phage 20, which is present in 4 of the 20 samples, from one baby and two mothers, including both early and late samples from one of the two moms. Secondly, if a phage is present in the earlier of the two samples from an individual, it has a significant tendency to be present in the other sample collected 7-11 months later (Fisher’s exact test p = 0.00012). We only found one phage that was present in both a mother and her child, i.e. phage 20 in family 2, in the 10 month child and 12 month mother samples.
Nucleotide diversity and single nucleotide variant (SNV) analysis
We calculated nucleotide diversity π for the phage genomes and as a comparison a set of 8 bacterial genomic contigs from four species, selected as described in methods. Nucleotide diversity is the probability that a given nucleotide is different for any two members of a population. The heuristics used to distinguish possible sequencing or PCR errors from definite variants and the method of calculating π are also described in the methods and derived from Schloissnig et al. (27). We calculated the diversity both within samples and between pairs of two samples, with the pairs subdivided into pairs from unrelated people, pairs from mother and baby the same family (as mentioned only one pair of samples for one phage fit this criterion), and pairs from the same person at different time points. Figure 2A shows the results of the analysis. Between unrelated people there was a broad distribution of nucleotide diversity for both phage and bacterial genomes. The mean π was 0.0055 for bacteria and 0.0048 for phage. The distribution of bacterial nucleotide diversity between family members was not significantly different from unrelated persons (mean π = 0.0058), likewise the one phage (phage 20) that was found in mother and baby had high between sample diversity. For the pairs of samples from the same person at different times, the nucleotide diversity for bacteria (mean π = 0.0029) was significantly lower than for bacteria from family members or unrelated, which agrees with similar measurements in gut (27) and skin (38) microbiomes. The diversity within samples for the bacterial contigs was lower still, though still measurable (mean π = 0.0014), indicating the presence of multiple strains for bacterial species in the oral cavity. The presence of multiple strains of oral bacterial species in single subjects has been observed in culture (39) and single cell sequencing (40, 41) experiments. Conversely the phage genomes had very low nucleotide diversity in samples from the same person, even when compared over time (mean π = 0.00034 from the same sample, = 0.00054 from different samples). Because of these nucleotide diversity patterns, in Figure 2B we treated the phage samples as single strains and analyzed the differences as percent nucleotide difference as detected by consistent variants compared to the reference assembly (which was chimeric by SNV patterns). In Figure 2B, we note that there are differences between phage species; phage 4 and phage 12 show very low divergence between subjects, while others show > 1% nucleotide difference. This includes the phage 20 mother and baby samples from family 2, which have 1.75% difference. All the within subject differences are very small, including the phage 20 samples from mother 6, which were 0.0024% different.
Phage Taxonomy by vConTACT
We used vConTACT (29) to examine the relationship of the oral phages to cultured phages present in the NCBI RefSeq database. The program uses an all versus all protein BLAST to compare proteins and assign them to protein clusters, then determines the relationship of phage genomes by the sharing of those protein clusters. The program generates a network diagram of phage relationships based on protein clusters including the oral phage and all known phage in the RefSeq sequence database, which is presented in Figure 3. Twenty four of the twenty five oral phages had a connection to at least one RefSeq phage, the exception being oral phage 17. A group of eleven oral phages (1, 5, 6, 7, 8, 9, 12, 13, 18, 19, and 23) were well-connected to a large and interconnected group of the RefSeq phages. Phage 16 was only connected to Oral Phage 18 and a single other phage, Arthrobacter phage vB_ArS-ArV2 (42). This phage is part of a group of related phages that mainly infect Actinobacteria, including many infecting Mycobacterium and Propionibacterium. Another member of this cluster, Mycobacterium phage DS6A (43) connected to Oral Phage 18. Two other phage, Oral Phages 2 and 11, connected to a different member of this cluster, Mycobacterium phage PegLeg (44). However, for these four oral phages, the similarity as quantified by vConTACT between the two oral phages is stronger than the similarity between the oral phages and RefSeq phages.
A group of four oral phages, numbers 14, 15, 24, and 25, are related to a cluster of phage including Bacillus phage phi29 (45). Of the cluster, oral phages 14 and 15 are most closely related to Actinomyces phage Av-1 (35), while oral phages 24 and 25 are most closely related to Streptococcus phage Cp-1 (46), agreeing with similarities we noted when originally identifying the phage contigs. Oral phage 20, that by genome size (> 200 kb) is considered a jumbo phage, is related to a group of jumbo phage (47) that includes Pseudomonas phage phiKZ. A number of phage in this group are about equally similar to oral phage 20. Oral Phage 22 is similar to a group of phage that includes the well-studied Enterobacteria phage T4 but has highest similarity to Campylobacter phage Cpt10 (48). Oral phages 3, 10, and 21 are similar to smaller clusters including less well-studied phage.
Finally, Oral Phage 4 is the only genome that shows similarity outside of the Siphoviridae (tailed phage) family. It shows similarities with members of the Inoviridae family of filamentous single stranded DNA phage, including the well known Enterobacteria phage M13.
Phage Host Inference
The potential bacterial hosts of oral phage were inferred by four approaches: examining phage-encoded tRNA genes for potential bacterial sources, searching for CRISPR spacers that match the oral phage genomes, finding related prophage sequences in bacterial isolate genomes, and by association with previously characterized viral clusters. Table 3 summarizes the results.
The tRNAscan-SE program identified 12 potential tRNA encoding genes in the phage contigs. However, only one was a high identity match to a known bacterial tRNA. This was encoded by phage 8, and was 100% identical to tRNA-Lys genes of both Haemophilus influenzae and Aggregatibacter actinomycetemcomitans. Other tRNAs that were found were less than 95% matches to either Genbank nr/nt or RefSeq genomic databases. There were four such sequences in Oral Phage 10, three in Oral Phage 17, three in Oral Phage 20, and one in Oral Phage 23. The tRNA-containing phage that did not give host information are marked with an asterisk in Table 3.
The IMG/VR web resource (32) contains a database of CRISPR spacers derived from sequenced bacterial genomes that is searchable by BLAST. The phage contigs in this study were used to search that database and 28 hits were found to 8 of the phage. As seen in Supplemental Table 2, the BLAST matches were all to oral bacterial species. Where multiple hits were found to the same phage, the bacteria involved were taxonomically related: at genus level for phages 16, 18, 24, and 25 and at family level for phages 7 and 19.
The crass CRISPR assembly program (33) assembles CRISPR repeats from metagenomic data. The CRISPRs assembled from the metagenomic reads contained 1232 spacers, 65 of which had BLAST matches to 12 of the oral phage. Since crass only assembles the repeat regions, it was necessary to track read mappings of the repeat containing reads to attempt to identify the host by flanking genes but this was only possible for four phage. For phages 7, 24, and 25, this confirmed the identity from searching IMG/VR, while phage 23 was identified as a Streptococcus phage.
The fourth approach used was to search the known bacterial genomes from the RefSeq database for related sequences to the oral phage. Two kinds of sequences were found by this approach. Alignments of the entire phage genomes were found for nine of the phage genomes. We considered matches if the alignments covered at least 50% of the total length of the phage genomes, and were on average at least 70% identical. As seen in Supplemental Table 3, the percent coverage for the nine varied from 57.7% to 100% and the percent identity from 72.2% to 99.95% (rounded up in the table). Particularly striking was the high degree of similarity between Oral Phage 4 and a number of Haemophilus influenzae and H. parainfluenzae genomes. The single contig that comprises Oral Phage 4 is 8218 bp and has a direct repeat of 13bp between it’s 5’ and 3’ ends, possibly indicating a circular structure in the sample. Three of the six isolate contigs are very close to the size of the Phage 4 contig, 777_HPAR is 8259 bp, 718_HINF is 8265 bp, and HMSC068C11 is 8230 bp (Supplemental Table 3), with the difference in length reflecting longer end repeats in the isolate contigs. The other three isolates have the phage regions as part of longer contigs. In each case the contig represents a phage insertion that occurs at a position corresponding to the dif site of Haemophilus influenzae (49), a form of prophage that has been observed for filamentous phage of the Inoviridae family (50).
Other related prophage were found for oral phages 1, 5, 6, 7, 8, 13, 16, and 23 (Supplemental Table 3). For phages 7, 8, 16, and 23 the host inference agreed with other methods while for the other phage host predictions were not available. In the case of phage 23, the nucleotide identity was the lowest, and different contigs of the phage were most highly identical to different isolate genomes, although all similarities were to Streptococcus strains related to Streptococcus mitis.
Finally oral phages 12 and 19 showed nucleic acid identities to bacterial isolates only at the ends of the metagenomic phage contigs. The regions involved are shown in Supplemental Table 4. In the case of oral phage 12, the regions at the ends represent contiguous regions from the similar genomes, indicating that the phage may have integrated in the genome. The putative integration point is within a gene that is predicted to encode a YebC/PmpR family DNA-binding transcriptional regulator. The insertion is predicted to change the amino acid sequence at the C-terminus of the protein from AIMDEEE to SILINE. The insertion point is upstream of sequences encoding the TPP riboswitch and the ThiC gene involved in thiamine biosynthesis. Very near the 5’ end of the contig for oral phage 19 there is a sequence highly similar to sequences from Lachnoanaerobaculum sp. ICM7 that are predicted to encode an IS110 family transposase while the 3’ end consists of a 94 bp repeated sequence from upstream of the transposase. The predicted host of this phage by CRISPR spacer analysis was Lachnospiraceae, the family containing Lachnoanaerobaculum.
We searched the annotations generated by the IMG-MER system to identify possible integrase related genes, including transposases and resolvases, as these might be expected in temperate phage capable of integration. The results are presented in Supplemental Table 5. Five of the nine phage that had related prophages contained integrase or related genes (5, 6, 7, 8, and 16), as did both of the phage with ends similar to isolate genomes (12 and 20). The lack of integrase for phage 4 is not surprising since many related filamentous phage utilize cellular recombination proteins XerC and XerD to catalyze integration at the dif site rather than encoding their own integrase (50). Integrases were not identified for phages 1, 13, and 23 despite the existence of (somewhat distantly) related prophages.
Viral cluster analysis
A recent publication identified phage-encoding contigs from a variety of sources, including metagenomic sequences from oral samples generated by the Human Microbiome Project (15). In that publication the authors described clustering of viral contigs and the method for clustering was published separately (14). Contigs are clustered based on 90% average nucleotide identity over 75% overlap. This method was applied to the phage genomes found in the current study, combining them with viral clusters and singletons found in the previous study (Supplemental Table 6). Twenty of the twenty five oral phage clustered with phage contigs found in that work. Phage 24 and 25 clustered with each other and the same viral clusters and singletons. Although those authors identified >100,000 viral clusters from numerous host-associated and other habitats, all of the contigs that clustered with phage from the current study were derived from the human oral environment.
The case of phage 4 was somewhat anomalous. The contig to which it showed high identity, metagenomic contig SRS015921_WUGC_scaffold_307 is 55,406 in length and appeared to be a segment of a Haemophilus parainfluenzae genome, with phage 4 inserted at the dif site as described above, and an unrelated prophage inserted about 1 kb proximal to the dif site downstream of the genomic Ferritin-1 and Ferritin-2 genes, which are over 98% identical to ones from H. parainfluenzae. The viral cluster that was observed in the previous study appears to be due to the presence of this unrelated phage sequence, which is likely a member of the Siphoviridae as it contains conserved tail fiber protein genes. Finally host inference was also possible through the viral clustering in some cases (Table 3), and allowed the prediction of oral phage 3 as a Streptococcus phage.
DISCUSSION
In this work, we describe a set of oral bacteriophage that were found solely by analysis of whole metagenome shotgun sequences, without isolation of viral particles or DNA amplification. This method has some advantages and disadvantages. The advantages include that it was possible to make inferences by the coverage per sample of the assembled contigs, and associate contigs that derive from the same phage. Therefore it was possible to study phage that assembled as multiple contigs. Methods using coverage per sample that have been useful in generating bacterial genomes from metagenomes (51) can be applied to phage in this way. This was particularly true for oral samples in this study because each phage had significant coverage in only a few samples and appeared to be absent in other ones. A second advantage to the approach is that filtration methods that are sometimes used to isolate viral particles often exclude jumbo phage (47), one of which was found in this study.
The depth of sequencing of this study is somewhat limited by sequencing costs, the desire to study multiple samples including longitudinal ones, and the presence of 71% human DNA in saliva. A possibility is that phages are not as sparse as they appear, but are normally present at low levels and undergo blooms to detectable levels given specific environmental conditions. Two observations seem to speak against this possibility. One is that the same phages are very often found at detectable levels in samples from the same individual 7-11 months apart. A second is that phages within an individual at high levels have low nucleotide diversity, while between individuals most phages have higher diversity with the exception of phages 12 and 4. Phage 12 was only found in the babies from families 2 and 4 and is not as highly identical to related phage contigs in IMG/VR (90-98% nucleotide identity). Phage 4 was also found to be have high identity to various Haemophilus isolate genomes and a metagenomic contig assembled from the Human Microbiome Project data (32).
It is also possible that the phages studied here reflect a subpopulation of oral phages that are sparsely distributed. It may be that there are more common phages that are present in more slightly divergent forms, but the presence of multiple related genomes may interfere with genome assembly into long enough contigs to observe and reliably classify as phage.
In this study although we frequently observed maintenance of phage for months with very low nucleotide divergence between samples from the same patient we failed to find any cases of phage transfer from mothers to children. In the single case of phage 20 where a very similar phage was present in a mother and baby from the same family, the strain was as different (1.75% nucleotide divergence) as many phage from unrelated subjects, suggesting that it was acquired from a different source. Other studies have shown transmission of phage from mothers to babies (52–54) so it is likely there was not a large enough sample size of phage and subjects to observe it here. Transmission does seem to be less frequent than maintenance within an individual.
A drawback of using the whole metagenome approach is that it is not possible to know for certain if the phage sequences identified are prophage inserted in bacterial genomes, some other intracellular form, or phage particles. The evidence suggests there may be some of all types. Phage 4 is an interesting case. The method that was used to create sequencing libraries involving repairing DNA ends, tailing, and ligating adapters. Such a technique would not be expected to work with single stranded DNA. Phage 4 is related to the Inoviridae family of filamentous phage where the packaged genome is a single stranded DNA circle. The lifecycle of these phages however involves an intracellular double stranded circular replicative form, and it may be this form is observed here. Some contigs from bacterial isolates and a metagenomic contig strongly suggest that an integrated form of phage 4 in the dif site of Haemophilus species is possible, though we did not directly observe it. Phages 12 and 19 had ends that were highly identical to bacterial genomes, suggesting that the contigs might be derived from prophage sequences, especially in the case of 12 where contiguous sequences from Veillonella genomes that contained metabolic genes were present at either end. The case for phage 19 was not quite so obvious because the sequence at one end was a repeat of part of the sequence from the other, the similarity was only with a single isolate genome, and the region involved contained a transposase gene. It is possible a transposon inserted into a phage genome and was packaged. However the species of bacteria agreed with the predicted host of the phage, indicating the fusion did not result from a misassembly.
Several approaches were used to identify the bacterial hosts for these oral phage. Altogether we were able to identify putative hosts for 16 of the 25 phage through tRNA analysis, CRISPR spacer analysis, similarity to prophages, and clustering with previously described phage (Table 3). Where it was possible to identify bacterial hosts by multiple methods, there was agreement between the methods though in some cases one method giving a higher taxonomic rank that subsumed the host prediction by the other method. This gives us confidence in the predictions.
The viral clustering indicates that this study represents a fairly small subset of possible oral phage. From the IMG/VR site (32), it is reported that they found 48,904 viral contigs and associated those with 5246 viral clusters and had remaining 7467 singleton contigs. The phage genomes from our study associated with 19 of those clusters and 31 singletons. Our oral phage also would have united some clusters and singletons from the earlier study. In all cases the phage identified here associated with human oral phage. Given that the criterion for clustering is fairly stringent, the fact that 18 of 25 of the phage in this study clustered with the IMG/VR phage indicates that collection may be fairly comprehensive (note that phage 4 may have been included serendipitously).
It would be of interest to identify phage that could be utilized in phage therapy to treat oral diseases or opportunistic infections caused by oral bacteria (4). It would require additional work to cultivate such phage but the current work could be used to develop rapid PCR tests for phage presence in samples and in some cases such as phage 4 identify bacterial strains containing prophage. We identified multiple phages that appear to infect Streptococcus, though none that were shown to directly target S. mutans or S. sobrinus, which have significant roles in dental caries. Also we did not find phage that obviously infect bacteria strongly associated with periodontitis such as Porphyromonas gingivalis, Treponema denticola, Tannerella forsythia, Filifactor alocis, or others (2). It might be possible to more efficiently find such phage by examining DNA isolated from supragingival plaque for caries or subgingival plaque for periodontitis. However bacteria including species from Streptococcus, Haemophilus, and Actinomyces can cause opportunistic infections and we discovered phage infecting each of those genera.
In conclusion, we have shown that bacteriophage can be readily discovered by analysis of whole metagenomic sequences from the oral cavity. We find that phage if present seem to have low nucleotide diversity, indicating that they are dominated by a single strain, a striking difference from oral bacteria which have higher nucleotide diversity indicating the presence of multiple strains in most cases. Phage appear to be sparsely distributed but if present are frequently maintained over periods of months.
SOFTWARE AND DATA SHARING
Software scripts used in the work are available as a git repository at https://code.osu.edu/beall-3/salivary-virome. Raw sequences are available at NCBI SRA under BioProject PRJNA448135. Assembled phage contigs are available at https://img.jgi.doe.gov/ under IMG genome ID 3300019854.
ACKNOWLEDGEMENTS
This work was supported by NIH/NIDCR grant DE024327. We thank Pearlly Yan for DNA sequencing, Haella Holmes and Steve Spiritoso for technical assistance, Ben Bolduc for help with vConTACT and Ann Gregory and Matthew Sullivan for useful discussions.