The Diagnostic Potential & Interactive Dynamics of the Colorectal Cancer Virome

Human viruses (those that infect human cells) have been associated with many cancers, largely due to their mutagenic and functionally manipulative abilities. Despite this, cancer microbiome studies have almost exclusively focused on bacteria instead of viruses. We began evaluating the cancer virome by focusing on colorectal cancer, a primary cause of morbidity and mortality throughout the world, and a cancer linked to altered colonic bacterial community compositions but with an unknown association with the gut virome. We used 16S rRNA gene, whole shotgun metagenomic, and purified virus metagenomic sequencing of stool to evaluate the differences in human colorectal cancer virus and bacterial community composition. Through random forest modeling we identified differences in the healthy and colorectal cancer virome. The cancer-associated virome consisted primarily of temperate bacteriophages that were also predicted to be bacteria-virus community network hubs. These results provide foundational evidence that bacteriophage communities are associated with colorectal cancer and potentially impact cancer progression by altering the bacterial host communities.


Introduction
Because previous work has identified shifts in which bacteria were most important at different stages 163 of colorectal cancer (8, 20, 22), we explored whether shifts in the relative influence of phages could be 164 detected between healthy, adenomatous, and cancerous colons. We evaluated community shifts between 165 the disease stage transitions (healthy to adenomatous and adenomatous to cancerous) by building random 166 forest models to compare only the diagnosis groups around the transitions. While bacterial OTU models  Figure S8). Overall, we 183 were able to identify 78.8% of the OVUs as known viruses, and 93.8% of those viral OVUs aligned to 184 bacteriophage reference genomes. It is important to note that this could have been influenced by our 185 methodological biases against enveloped viruses (more common of eukaryotic viruses than bacteriophage), 186 due to chloroform and DNase treatment for purification. 187 We evaluated whether the phages in the community were primarily lytic (i.e. obligately lyse their hosts after 188 replication) or temperate (i.e. able to integrate into their host's genome to form a lysogen, and subsequently 189 transition to a lytic mode). We accomplished this by identifying three markers for temperate phages in the 190 OVU representative sequences: 1) presence of phage integrase genes, 2) presence of known prophage 191 genes, according the the ACLAME (A CLAssification of Mobile genetic Elements) database, and 3) nucleotide 192 similarity to regions of bacterial genomes (29, 31, 32). We found that the majority of the phages were 193 temperate and that the overall fraction of temperate phages remained consistent throughout the healthy, 194 adenomatous, and cancerous stages (Figure 3). These findings were consistent with previous reports 195 suggesting the gut virome is primarily composed of temperate phages (13, 18, 31, 33).

197
Because the link between colorectal cancer and the virome was driven by bacteriophages (as opposed 198 to non-bacterial viruses), we tested a potential hypothesis that the virome signal was a mere reflection of 199 the bacterial signal, and thus highly correlated with the bacterial signal. If this hypothesis were true, we 200 would expect a correlation between the relative abundances of influential bacterial OTUs and virome OVUs.

201
Instead, we observed a strikingly low correlation between bacterial and viral relative abundances (Figure 4 202 A,C). Overall, there was an absence of correlation between the most influential OVUs and bacterial OTUs 203 (Figure 4 B). This evidence supported our null hypothesis that the influential viral OVUs were not primarily 204 reflections of influential bacteria.

205
Given these findings, we posited that the most influential phages were acting by infecting a wide range of 206 bacteria in the overall community, instead of just the influential bacteria. In other words, we hypothesized 207 that the influential bacteriophages were community hubs (i.e. central members) within the bacteria and 208 phage interactive network. We investigated the potential host ranges of all phage OVUs using a previously 209 developed random forest model that relies on sequence features to predict which phages infected which 210 bacteria in the community (Figure 5 A) (34). The predicted interactions were then used to identify phage 211 community hubs. We calculated the alpha centrality (i.e. measure of importance in the ecological network) of 212 each phage OVU's connection to the rest of the network. The phages with high centrality values were defined 213 as community hubs. Next, the centrality of each OVU was compared to its importance in the colorectal cancer 214 classification model. Phage OVU centrality was significantly and positively correlated with importance to the 215 disease model (p-value = 0.004, R = 0.176), suggesting that phages that were important in driving colorectal 216 cancer also were more likely to be community hubs (Figure 5 B). Together these findings supported our 217 hypothesis that influential phages were hubs within their microbial communities and had broad host ranges.

219
Because of their propensity for mutagenesis and capacity for modulating their host functionality, many 220 human viruses are oncogenic (1)(2)(3)(4). Some bacteria also have oncogenic properties, suggesting that 221 bacteriophages, a component of the human virome in addition to human-specific viruses, may play an 222 indirect role in promoting carcinogenesis by influencing bacterial community composition and dynamics 223 (8-10). Despite their carcinogenic potential and the strong association between bacteria and colorectal 224 cancer, a link between virus colorectal communities and colorectal cancer has yet to be evaluated. Here 225 we show that, like colonic bacterial communities, the colon virome was altered in patients with colorectal 226 cancer relative to those with healthy colons. Our findings support a working hypothesis for oncogenesis by 227 phage-modulated bacterial community composition.

228
Based on our findings, we have developed a conceptual model to be tested in our future studies aimed at 229 elucidating the role the colonic virome plays in colorectal cancer (Figure 6 A). We found that basic diversity 230 metrics of alpha diversity (richness and Shannon diversity) and beta diversity (Bray-Curtis dissimilarity) 231 were insufficient for identifying virome community differences between healthy and cancerous states. By 232 implementing a machine learning approach (random forest classification) to leverage inherent, complex  are also many other alternative hypotheses by which the system could be operating. We hypothesize that 251 the process begins with broadly infectious phages in the colon lysing, and thereby disrupting, the existing 252 bacterial communities. This shift opens novel niche space that enabled opportunistic bacteria (such as 253 Fusobacterium nucleatum) to colonize. Once the initial influential founder bacteria establish themselves in 254 the epithelium, secondary opportunistic bacteria are able to adhere to the founders, colonize, and establish 255 a biofilm. Phages may play a role in biofilm dispersal and growth by lysing bacteria within the biofilm, a 256 process important for effective biofilm growth (37). The oncogenic bacteria may then be able to transform the 257 epithelial cells and disrupt tight junctions to infiltrate the epithelium, thereby initiating an inflammatory immune 258 response. As the adenomatous polyps developed and progressed towards carcinogenesis, we observed a 259 shift in the phages and bacteria whose relative abundances were most influential. As the bacteria enter their 260 oncogenic synergy with the epithelium, we conjecture that the phages continue mediating biofilm dispersal.

261
This process would thereby support the colonized oncogenic bacteria by lysing competing cells and releasing 262 nutrients to other bacteria in the form of cellular lysates. In addition to highlighting the likely mechanisms by 263 which the colorectal cancer virome is interacting with the bacterial communities this model will guide future 264 research investigations of the role the virome plays colorectal cancer.

265
Our working hypothesis represents a conceptualization of areas for future work, which will be required 266 to characterize the colorectal cancer microbiome at the functional, mechanistic level. There are many 267 different ways in which this system may operating, and our working hypothesis is one. For example, it is The 16S rRNA gene sequences were analyzed as described previously, relying on the mothur software 304 package (v1.37.0) (39, 40). Briefly, the sequences were de-replicated, aligned to the SILVA database (41), 305 screened for chimeras using UCHIME (42), and binned into operational taxonomic units (OTUs) using a 97% 306 similarity threshold. Abundances were normalized for uneven sequencing depth by randomly sub-sampling 307 to 10,000 sequences, as previously reported (23). to remove smaller contaminants. The filtered supernatant was treated with chloroform for ten minutes 321 with gentle shaking, so as to lyse contaminating cells including bacteria, human, fungi, etc. The exposed 322 genomic DNA from the lysed cells was degraded by treating the samples with 5U of DNase for one hour 323 at 37C. DNase was deactivated by incubating the sample at 75C for ten minutes. The DNA was extracted 324 from the purified virus-like particles (VLPs) using the Wizard PCR Purification Preparation Kit (Promega).

325
Disease classes were staggered across purification runs to prevent run variation as a confounding factor. 326 using the Illumina Nextera XT preparation kit according to the standard kit protocol. The tagmentation time operating characteristic Curve (AUC) and the three-class random forest model was trained using the 372 mean AUC. Both were validated using five-fold nested cross validation to prevent over-fitting on the tuning 373 paramters. Each training set was repeated five times, and the model was tuned for mtry values. For 374 consistency and accurate comparison between feature groups (e.g., bacteria, viruses), the sample model 375 parameters were used for each group. The maximum AUC during training was recorded across twenty 376 iterations of each group model to test the significance of the differences between feature set performance.

377
Statistical significance was evaluated using a Wilcoxon test between two categories, or a pairwise Wilcoxon 378 test with Bonferroni corrected p-values when comparing more than two categories. Phage OVU replication mode was predicted using methods described previously (29, 31, 32). Briefly, we 398 identified temperate OVUs as representative contigs containing at least one of three genomic markers: 1) 399 phage integrase genes, 2) prophage genes from the ACLAME database, or 3) genomic similarity to bacterial 400 reference genomes. Integrase genes were identified in phage OVU representative contigs by aligning the 401 contigs to a reference database of all known phage integrase genes from the Uniprot database (Uniprot 402 search term: "organism:phage gene:int NOT putative"). Prophage genes were identified in the same way, 403 using the ACLAME set of reference prophage genes. In both cases, the blastx algorithm was used with an   The authors declare no competing interests.     (pre-cancer), and carcinoma (cancer) patients. Stool samples were split into two aliquots, the first of which was used for bacterial sequencing and the second which was used for virus sequencing. Bacterial sequencing was done using both 16S rRNA amplicon and whole metagenomic shotgun sequencing techniques. Virus samples were purified for viruses using filtration and a combination of chloroform (bacterial lysis) and DNase (exposed genomic DNA degradation). The resulting encapsulated virus DNA was sequenced using whole metagenomic shotgun sequencing.

Figure 5: Community network analysis utilizing predicted interactions between bacteria and phage operational genomic units. A) Visualization of the community network for our colorectal cancer cohort. B)
Scatter plot illustrating the correlation between importance (mean decrease in accuracy) and the degree of centrality for each OVU. A linear regression line was fit to illustrate the correlation (blue) which was found to be statistically significantly and weakly correlated (p-value = 0.00409, R = 0.176). Figure 6: Final working hypothesis from this study. These panels summarize our thoughts on our results and represent interesting future directions that we predict will build on the presented work. A) Basic model illustrating the connections between the virome, bacterial communities, and colorectal cancer. B) Working hypothesis of how the bacteriophage community is associated with colorectal cancer and the associated bacterial community.

Figure S4: Diversity calculations comparing cancer states of the colorectal virome, based on relative abundance of operational genomic units in each sample. A) NMDS ordination of community samples, colored for cancerous (green), pre-cancerous (red), and healthy (yellow). B) Differences in means between disease
group centroids with 95% confidence intervals based on an ANOSIM test with a post hoc multivariate Tukey test. Comparisons (indicated on y-axis) in which the intervals cross the zero mean difference line (dashed line) were not significantly different. C) Shannon diversity and D) richness alpha diversity quantification comparing pre-cancerous (grey), cancerous (red), and healthy (tan) states. Figure S5: Beta-diversity analysis comparing Bray-Curtis dissimilarity between disease state and negative control community structures that were captured following sequence rarefaction. Differences in means between disease group centroids with 95% confidence intervals based on an ANOSIM test with a post hoc multivariate Tukey test. Comparisons in which the intervals cross the zero mean difference line (dashed line) were not significantly different. Figure S6: Comparison of bacterial 16S rRNA classification models with and without OTUs whose median relative abundance are greater than zero. A) Classification model performance (measured as area under the curve) for bacteria models using 16S rRNA data both with and without filtering of samples whose median was zero. Significance was calculated using a Wilcoxon rank sum test, and the resulting p-value is shown. The random area under the curve (0.5) is marked with a dashed line. B) Relative abundance of the six bacterial OTUs removed when filtered for OTUs with median relative abundance of zero. OTU relative abundance is separated by healthy (red) and cancerous (grey) samples. Relative abundance of 1% is marked by the dashed line.