Diagnostic Potential & the Interactive Dynamics of the 1 Colorectal Cancer Virome

18 Viruses are associated with many human cancers, largely due to their mutagenic and functionally 19 manipulative abilities. Despite this, cancer microbiome studies have almost exclusively focused on 20 bacteria instead of viruses. We began evaluating the cancer virome by focusing on colorectal cancer, 21 a primary cause of morbidity and mortality throughout the world, and a cancer linked to altered colonic 22 bacterial community compositions while the virome role remains unknown. We used 16S rRNA gene, whole 23 shotgun metagenomic, and purified virus metagenomic sequencing of stool to evaluate the differences in 24 human colorectal cancer virus and bacterial community composition. Through random forest modeling we 25 identified differences in the healthy and colorectal cancer virome. The cancer-associated virome consisted 26 primarily of temperate bacteriophages that were also predicted to be bacteria-virus community network hubs. 27 These results provide foundational evidence that bacteriophage communities are associated with colorectal 28 cancer and likely impact cancer progression by altering the bacterial host communities. 29


Introduction to adenomatous and adenomatous to cancerous stages (Figure S7 C-E).
After evaluating our ability to classify samples between two disease states, we performed a three-class 163 random forest model including all disease states. The 16S rRNA gene model yielded a mean AUC of 0.774 164 and outperformed the viral community model, which yielded a mean AUC of 0.672 (p-value < 0.001, Figure   165 S8 A-C). The microbes important for the healthy versus cancer and healthy versus adenoma models were 166 also important for the three-class model ( Figure S8 D-E). The most important bacterium in the two and three  Figure S8). Overall, we were able to identify 78.8% of the OVUs as known viruses, and 93.8% of those 176 viral OVUs aligned to bacteriophage reference genomes. It is important to note that this could have been 177 influenced by our methodological biases against enveloped viruses (more common of eukaryotic viruses than 178 bacteriophage), due to chloroform and DNase treatment for purification. 179 We evaluated whether the phages in the community were primarily lytic (i.e. obligately lyse their hosts after 180 replication) or temperate (i.e. able to integrate into their host's genome to form a lysogen, and subsequently 181 transition to a lytic mode). We accomplished this by identifying three markers for temperate phages in the 182 OVU representative sequences: 1) presence of phage integrase genes, 2) presence of known prophage 183 genes, according the the ACLAME (A CLAssification of Mobile genetic Elements) database, and 3) nucleotide 184 similarity to regions of bacterial genomes (29,31,32). We found that the majority of the phages were 185 temperate and that the overall fraction of temperate phages remained consistent throughout the healthy, 186 8 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted October 26, 2017. ;https://doi.org/10.1101/152868 doi: bioRxiv preprint suggesting the gut virome is primarily composed of temperate phages (13,18,31,33).

189
Because the link between colorectal cancer and the virome was driven by bacteriophages, we hypothesized 190 that the influential phages were primarily predators of the influential bacteria, and thus influenced their relative 191 abundance through predation. If this hypothesis were true, we would expect a correlation between the relative 192 abundances of influential bacteria and phages. Instead, we observed a strikingly low correlation between 193 bacterial and phage relative abundances (Figure 3 A,C). Overall, there was an absence of correlation 194 between the most influential OVUs and bacterial OTUs (Figure 3 B). This evidence supported our null 195 hypothesis that the influential phages were not primarily predators of influential bacteria.

196
Given these findings, we hypothesized that the most influential phages were acting by infecting a wide range 197 of bacteria in the overall community, instead of just the influential bacteria. In other words, we hypothesized 198 that the influential bacteriophages were community hubs (i.e. central members) within the bacteria and 199 phage interactive network. We investigated the potential host ranges of all phage OVUs using a previously as community hubs. Next, the centrality of each OVU was compared to its importance in the colorectal cancer 205 classification model. Phage OVU centrality was significantly and positively correlated with importance to the 206 disease model (p-value = 0.02, R = 0.14), suggesting that phages important in driving colorectal cancer also 207 were more likely to be community hubs (Figure S9 B). Together these findings supported our hypothesis that 208 influential phages were hubs within their microbial communities and had broad host ranges. CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted October 26, 2017. ; https://doi.org/10.1101/152868 doi: bioRxiv preprint Because of their propensity for mutagenesis and capacity for modulating their host functionality, many viruses 211 are oncogenic (1-4). Some bacteria also have oncogenic properties, suggesting that bacteriophages may 212 play an indirect role in promoting carcinogenesis by influencing bacterial community composition and 213 dynamics (8-10). Despite their carcinogenic potential and the strong association between bacteria and 214 colorectal cancer, a mechanistic link between virus colorectal communities and colorectal cancer has yet to 215 be evaluated. Here we show that, like colonic bacterial communities, the colon virome was altered in patients 216 with colorectal cancer relative to those with healthy colons. Our findings support a working hypothesis for 217 oncogenesis by phage-modulated bacterial community composition.

218
Using our findings here, we have begun to delineate the role the colonic virome plays in colorectal cancer in 219 the form of a working hypothesis that will guide our future studies (Figure 5 A). We found that basic diversity 220 metrics of alpha diversity (richness and Shannon diversity) and beta diversity (Bray-Curtis dissimilarity) 221 were insufficient for identifying virome community differences between healthy and cancerous states. By  CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted October 26, 2017. ; https://doi.org/10.1101/152868 doi: bioRxiv preprint colorectal cancer, we also used our data and existing knowledge of phage biology to develop a working 237 hypothesis for the mechanisms by which this may occur. This was done by incorporating our findings into the 238 current model for colorectal cancer development (Figure 5 B) (22). We hypothesize that the process begins 239 with broadly infectious phages in the colon lysing and thereby disrupting the existing bacterial communities.

240
This shift opens novel niche space that enabled opportunistic bacteria (such as Fusobacterium nucleatum) 241 to colonize. Once the initial influential founder bacteria establish themselves in the epithelium, secondary 242 opportunistic bacteria are able to adhere to the founders, colonize, and establish a biofilm. Phages may 243 play a role in biofilm dispersal and growth by lysing bacteria within the biofilm, a process important for 244 effective biofilm growth (37). The oncogenic bacteria may then be able to transform the epithelial cells 245 and disrupt tight junctions to infiltrate the epithelium, thereby initiating an inflammatory immune response.

246
As the adenomatous polyps developed and progressed towards carcinogenesis, we observed a shift in the 247 phages and bacteria whose relative abundances were most influential. As the bacteria enter their oncogenic 248 synergy with the epithelium, we conjecture that the phages continue mediating biofilm dispersal. This process 249 would thereby support the colonized oncogenic bacteria by lysing competing cells and releasing nutrients to  Bacteriophage and bacterial communities cannot maintain stability and co-evolution without one another (6, 259 38). Not only is the human virome an important element to consider in human health and disease (12-18), 260 but our findings support that it is likely to have a significant impact on cancer etiology and progression.

11
. CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted October 26, 2017. ; https://doi.org/10.1101/152868 doi: bioRxiv preprint  (pre-cancer), and carcinoma (cancer) patients. Stool samples were split into two aliquots, the first of which was used for bacterial sequencing and the second which was used for virus sequencing. Bacterial sequencing was done using both 16S rRNA amplicon and whole metagenomic shotgun sequencing techniques. Virus samples were purified for viruses using filtration and a combination of chloroform (bacterial lysis) and DNase (exposed genomic DNA degradation). The resulting encapsulated virus DNA was sequenced using whole metagenomic shotgun sequencing.

25
. CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted October 26, 2017. ; https://doi.org/10.1101/152868 doi: bioRxiv preprint 26 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted October 26, 2017. ; https://doi.org/10.1101/152868 doi: bioRxiv preprint

Figure 3: Relative abundance correlations between bacterial OTUs and virome OVUs. A) Pearson correlation coefficient values between all bacterial OTUs (x-axis) and viral OVUs (y-axis) with blue being positively correlated and red being negatively correlated. Bar plots indicate the viral (left) and bacterial (bottom) operational unit importance in their colorectal cancer classification models, such that the most important units are in the top left corner. B) Magnification of the boxed region in panel (A), highlighting the correlation between the most important bacterial OTUs and virome OVUs. The most important operational units are in the top left corner of the heatmap, and the correlation scale is the same as panel (A). C) Histogram quantifying the frequencies of Pearson correlation coefficients between all bacterial OTUs and virome OVUs.
27 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted October 26, 2017. ; https://doi.org/10.1101/152868 doi: bioRxiv preprint Figure 4: Lysogenic phage relative abundance in disease states. Phage OVUs were predicted to be either lytic or lysogenic, and the relative abundance of lysogenic phages was quantified and represented as a boxplot. No disease groups were statistically significant.
28 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted October 26, 2017. ; https://doi.org/10.1101/152868 doi: bioRxiv preprint

Figure 5: Final model and working hypothesis from this study. A) Basic model illustrating the connections between the virome, bacterial communities, and colorectal cancer. B) Working hypothesis of how the bacteriophage community is associated with colorectal cancer and the associated bacterial community.
29 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted October 26, 2017. ; https://doi.org/10.1101/152868 doi: bioRxiv preprint Figure S1: Basic Quality Control Metrics. A) VLP genomic DNA yield from all sequenced samples. Each bar represents a sample which is grouped and colored by its associated disease group. B) Sequence yield following quality control including quality score filtering and human decontamination.

30
. CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted October 26, 2017. ; https://doi.org/10.1101/152868 doi: bioRxiv preprint Figure S2: Length and coverage statistics. A) Heated scatter plot demonstrating the distribution of contig coverage (number of sequences mapping to each contig) and contig length for the virus metagenomic sample set. B) Scatter plot illustrating the distribution of operational viral unit (OVU) length and sequence coverage for the virus metagenomic sample set. C) Heated scatter plot demonstrating the distribution of contig coverage and length for the whole metagenomic sample set. D) Scatter plot illustrating the distribution of operational genomic unit (OGU) length and sequence coverage for the whole metagenomic sample set.

31
. CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted October 26, 2017. ; https://doi.org/10.1101/152868 doi: bioRxiv preprint Figure S3: Operational genomic unit composition stats. A) Strip chart demonstrating the length and frequency of contigs within each operational genomic unit of the virome sample set. The y-axis is the operational genomic unit identifier, and x-axis is the length of each contig, and each dot represents a contig found within the specified operational genomic unit. B) Density plot (analogous to histogram) of the number of virome operational genomic units containing the specific number of contigs, as indicated by the x-axis. C-D) Sample plots as panels C and D, but for the whole metagenomic sample set.

Figure S4: Diversity calculations comparing cancer states of the colorectal virome, based on relative abundance of operational genomic units in each sample. A) NMDS ordination of community samples, colored for cancerous (green), pre-cancerous (red), and healthy (yellow). B) Differences in means between disease
group centroids with 95% confidence intervals based on an ANOSIM test with a post hoc multivariate Tukey test. Comparisons (indicated on y-axis) in which the intervals cross the zero mean difference line (dashed line) were not significantly different. C) Shannon diversity and D) richness alpha diversity quantification comparing pre-cancerous (grey), cancerous (red), and healthy (tan) states.

33
. CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted October 26, 2017.

Figure S5: Beta-diversity comparing disease states and the study negative controls. Differences in means between disease group centroids with 95% confidence intervals based on an ANOSIM test with a post hoc multivariate Tukey test. Comparisons in which the intervals cross the zero mean difference line (dashed line)
were not significantly different.

34
. CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted October 26, 2017. ; https://doi.org/10.1101/152868 doi: bioRxiv preprint Figure S6: Comparison of bacterial 16S rRNA classification models with and without OTUs whose median relative abundance are greater than zero. A) Classification model performance (measured as area under the curve) for bacteria models using 16S rRNA data both with and without filtering of samples whose median was zero. Significance was calculated using a Wilcoxon rank sum test, and the resulting p-value is shown. The random area under the curve (0.5) is marked with a dashed line. B) Relative abundance of the six bacterial OTUs removed when filtered for OTUs with median relative abundance of zero. OTU relative abundance is separated by healthy (red) and cancerous (grey) samples. Relative abundance of 1% is marked by the dashed line.

37
. CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted October 26, 2017. ; https://doi.org/10.1101/152868 doi: bioRxiv preprint Figure S9: Community network analysis utilizing predicted interactions between bacteria and phage operational genomic units. A) Visualization of the community network for our colorectal cancer cohort. B) Scatter plot illustrating the correlation between importance (mean decrease in accuracy) and the degree of centrality for each OVU. A linear regression line was fit to illustrate the correlation (blue) which was found to be statistically significantly and weakly correlated (p-value = 0.0173, R = 0.14).

38
. CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted October 26, 2017. Cluster 320 Cluster 310 Cluster 62 Cluster 18 Cluster 95 Cluster 162 Cluster 315 Cluster 115