Towards a consensus microRNA signature of primary and metastatic colorectal cancer

Background Although microRNAs (miRNAs) are involved in all hallmarks of cancer, miRNA dysregulation in the metastatic process remains poorly understood. We investigated the role of miRNAs in metastatic evolution of colorectal cancer (CRC) by analyzing smallRNA-seq datasets from primary CRC, metastatic locations (liver, lung and peritoneum), and corresponding adjacent tissues. Addressing main challenges of miRNA analysis, a bioinformatics pipeline was developed that contains ​bona fide miRNA annotations from MirGeneDB, utilizes the quality control software miRTrace, applies physiologically meaningful cutoffs and accounts for contribution of cell-type specific miRNAs and host tissue effects. Results Two hundred-and-seventy-five miRNA sequencing datasets were analyzed, and after adjusting for the contribution of heterogeneity in cellular composition, strong signatures for primary and metastatic CRC were identified. The signature for primary CRC contained many previously known miRNAs with known functions. Deregulation of specific miRNAs was associated with individual metastatic sites, but the metastatic signatures contained overlapping miRNAs involved in key elements of the metastatic process, such as epithelial-to-mesenchymal transition and hypoxia. Notably, four of these miRNAs (MIR-8, MIR-10, MIR-375, MIR-210) belong to deeply conserved families present in many other organisms, triggering questions about their evolutionary functions and opportunities for experimental validation. Conclusion: Applying a meticulous pipeline for the analysis of smallRNA-seq data, miRNA signatures for primary and metastatic CRC were identified, contributing novel insights into miRNA involvement in CRC metastatic evolution and site-specific metastatic adaptations. New datasets can easily be included in this publicly available pipeline to continuously improve the knowledge in the field.


Background
Colorectal cancer (CRC) is a heterogeneous disease and a leading cause of cancer-related deaths worldwide [1] . CRC evolution is an only partially understood process that starts with the formation of primary tumors from epithelial cells in the colorectum and commonly leads to metastatic progression, which is the major cause of mortality. The main metastatic sites are the liver, lungs and peritoneal surface, emphasizing the substantial changes these cells undergo during their evolution that enable them to travel and establish in distant organs with very different microenvironments [2,3] . Although the molecular landscape of primary CRC (pCRC) has been extensively characterized and several mechanisms for the onset of tumorigenesis are known [4] , the genomic and transcriptomic changes in metastatic CRC (mCRC) are less well understood [5][6][7] . A better understanding of these changes is warranted, because the adaptation to a metastatic phenotype not only represents a pivotal step in cancer evolution, but could be exploited as potential diagnostic, prognostic, and predictive biomarkers of clinical relevance.
MicroRNAs (miRNAs) are evolutionary ancient post-transcriptional gene regulators involved in cell-type specification, tissue identity and development in animals [8] .
With more than 11,000 annual publications in 2019 alone they are the most studied RNA molecules to date [9] . Because their dysregulation correlates with all hallmarks of cancer [10] , due to their remarkable chemical stability [11] , and the availability of sensitive detection methods, miRNAs have repeatedly been suggested as promising cancer biomarkers [12,13] . However, although numerous candidate biomarkers and miRNA signatures have been suggested, there is a striking lack of overlap between published results from different groups, and to date, no clinically implemented miRNA biomarker is available for pCRC [6,[14][15][16][17][18][19] . In mCRC, the absence of consistent data is even more apparent, and there is no consensus regarding which miRNAs are up-and down-regulated in mCRC, representing a major obstacle to understanding the role of miRNAs in metastatic progression as well as to the identification of relevant biomarkers [6] .
This lack of consensus is surprising and clearly is not reflecting biology or evolution of CRC but might rather be due to known challenges in miRNA research. First, varying definitions of human miRNA genes have previously resulted in substantially different sets of miRNAs, and non-miRNAs, being profiled between different studies. Second, a uniform bioinformatics workflow tailored for smallRNA-seq analysis of bulk samples from different organs, tissues and pathological states was missing. Third, for differential expression analysis, commonly reported miRNAs did not meet physiologically relevant expression levels in either of the studied samples [20] .
Finally, a strategy to account for cellular composition effects when analyzing bulk tissue cancer specimens has usually not been applied [21] .
To overcome these challenges, our group previously showed that the majority of predicted human miRNAs in the most commonly used miRNA reference database, miRBase, were not derived from miRNA genes, and that including such sequences in miRNA studies could affect the results [22] . For instance, in a previous study by Neerincx et al [14] , the reported "miRNA" signature based on miRBase annotations for mCRC consisted solely of genes that were all no bona fide miRNAs (Mir-320 -miRtron [23] , Mir-3117 -overlaps with protein coding gene, Mir-1246 -a U2 derived degradation product [24] , and Mir-663 -a rRNA fragment [25] . To enable more meaningful and comparative studies, we set up MirGeneDB, a publicly available, curated miRNA gene database [26] . The challenge of ensuring quality and reproducibility when comparing smallRNA-seq data across studies was resolved by Kang et al who developed the miRTrace software as a universal quality control pipeline specifically for smallRNA-seq data [27] . To ensure physiologically meaningful levels of miRNAs in differential expression analysis between two sample groups such as pCRC and mCRC, we applied a strict cutoff of 100 reads per million (RPM) as the minimum expression level, which also implied lower dependence on technical factors such as, for instance, sequencing depth [14] .
The final challenge relates to the analysis of datasets generated from bulk tissue samples of varying cellular composition, such as normal tissue and primary tumor tissue biopsies. Although the majority of miRNAs are ubiquitously expressed in various tissues, many miRNAs are exclusively expressed in particular cell types, tissues, organs, or at specific developmental time points [21,[28][29][30] . Such (cell-) specific miRNAs may confound the interpretation of results when comparing miRNA expression between different tissue types, such as primary and metastatic tumors and tumors derived from different metastatic sites. To address this we took advantage of previous studies on cell-type specific miRNA expression and incorporated the information of cell specificity of miRNAs into our analysis [28][29][30][31] .
Taking all these aspects into account and addressing each of them in a comprehensive, reproducible and flexible pipeline, we analyzed novel smallRNA-seq data from pCRC and mCRC, combined with normal adjacent tissues and existing smallRNA-seq datasets, altogether totaling 193 datasets. We derived a robust consensus miRNA signature of pCRC and mCRC that partially overlap with previous reports but which also includes novel miRNAs. Specifically, differential expression analysis revealed a set of deregulated miRNAs when comparing pCRC and multiple metastatic sites, including Mir-10-P1a, Mir-375, Mir-8-P1b and the hypoxamiR Mir-210 that was validated in independent samples and point to an evolutionary adaptation of metastatic tumor cells to hypoxic conditions in distant sites.

Consensus datasets
Four of the six tissue-based datasets fulfilled the quality control criteria, yielding 193 total NGS datasets (Additional file 1: Supplementary Methods), while two studies were excluded. One had low quality reads in the majority of samples [15] [14,17,32] , and these datasets were included in further analyses ( Figure   1A).

Global miRNA expression
Global miRNA expression was analyzed using the dimensionality reducing technique UMAP ( Figure 1A) and unsupervised clustering ( Figure 1B). UMAP revealed that the consensus datasets clustered according to the tissue of origin ( Figure 1A left), and not according to the originating study ( Figure 1A right), indicating that the consensus datasets are representative of their respective tissue. All normal tissues (nLi, nLu and nCR) formed distinct clusters distant from each other and from the tumor tissues. Of note, the UMAP plot showed that the CLM and mLu tissues formed distinct clusters.
Unsupervised clustering ( Figure 1B) showed a clear difference between normal and tumor tissues. The normal tissues (nLi, nLu and nCR) formed distinct clusters ( Figure   1B), while nCR and the tumor samples (pCRC, CLM, mLu and PM-CRC), were part of a major cluster separate from nLi and nLu ( Figure 1B).

Cell specific miRNA expression
McCall et al 2017 [28] reported a set of 45 miRNAs with cell-type specific expression.
These miRNAs were used to analyze the contribution of cell specific miRNAs on expression data derived from bulk tissue (see Additional  Comparing cell specific miRNA expression in the different tissues using a heatmap ( Figure 2B), scaled by a z-score of Reads Per Million (RPM) values for each miRNA, cell specific miRNAs tended to be expressed at lower levels in malignant tissues compared to the normal tissues ( Figure 2B). Not surprisingly, the hepatocyte specific Mir-122_5p had a much higher expression in nLi compared to nCR. While Mir-145_5p appeared to have a higher expression level in nCR compared to pCRC,  could not be visually observed to be differentially expressed.

A miRNA signature for pCRC compared to nCR
To test which individual miRNAs are driving the observed global differences between nCR and pCRC, expression levels of all miRNA genes were compared between the two groups, annotating cell specific miRNAs, as well as miRNAs reported to be expressed in plasma of healthy individuals. Thirty-seven miRNAs were up-regulated and 37 miRNAs were down-regulated; and of these, 15 up-and down-regulated miRNAs were cell specific miRNAs and 20 up-and down-regulated miRNAs were reported to be expressed at a high level in peripheral blood cells [37] . Global differential expression results are shown as a volcano plot in Figure 3A, and the signature miRNAs found to be differentially expressed above the defined thresholds are shown in a bar plot in Figure 3B. Of the initial signature, 15 miRNAs were cell-type specific, (lower expression in pCRC) are reported to be exclusively expressed in mesenchymal cells, specifically fibroblasts and smooth muscle cells [38] . Mir-150 and Mir-342 are predominantly expressed by lymphocytes, and we observed a strong reduction in expression levels in the tumor samples compared to nCR. Mir-223, which is specific to dendritic cells and macrophages [28] , was up-regulated in pCRC. Signatures for mCRC compared to pCRC, accounting for normal tissue expression Common and independent miRNA signatures were identified for CLM, mLu and PM-CRC, compared to pCRC. Altogether, 27 miRNAs were up-regulated and 7 miRNAs were down-regulated in the metastatic lesions when compared to pCRC (Figure 4).
To assess miRNAs that are differentially expressed when comparing pCRC and mCRC, an important consideration is the confounding effect of miRNAs derived from the normal tissue surrounding the metastasis, which inevitably will be present in the bulk metastasis tissue samples. This issue was solved by omitting miRNAs that were differentially expressed between nCR and nLi or nLu, as well as between pCRC and nLi or nLu from the signature. Five miRNAs were differentially expressed across multiple metastatic sites ( Figure 4 and Figure 5). Mir-210_3p showed strong up-regulation in both CLM and mLu, with more than twice as high expression levels compared to pCRC, whereas in PM-CRC, the increase was more modest.

Discussion
We here present a rigorous pipeline for the analysis of bulk tissue smallRNA-seq data.
The pipeline was developed to address prevailing challenges in the miRNA field to obtain robust insights into the evolution of CRC and in particular to understand the role of miRNAs in the evolution of mCRC. Our analysis of clinical smallRNA-seq datasets produced in this study, as well as publicly available clinical smallRNA-seq data sets represents the most comprehensive analysis of miRNAs in CRC to date.
Combining meticulous quality control with open science practices allows for increased power to detect meaningful miRNAs, along with high confidence that the results will be reproducible. In addition, this allows for the inclusion of additional datasets to the analysis as they become available in the future. Importantly, our analytical pipeline is not restricted to CRC and could easily be applied to other cancer types.

pCRC signature miRNAs
More than 40 deregulated miRNAs were identified in the signature for pCRC. This

MIR-8 and MIR-375
Mir-8-P1b_3p ( miR-141 ), which is a member of the MIR-8 family , was 1 down-regulated in pCRC compared to nCR, yet up-regulated at all metastatic sites compared to pCRC ( Figure 5). MIR-8 family members have been reported to suppress metastasis by targeting transcription factors ZEB1 and ZEB2, suppressing E-cadherin expression [48,49] . The finding therefore makes sense in terms of contributing to MET in the established metastasis at the distant site and is in line with previous reports in CRC [50,51] . Increased expression of Mir-8-P1b_3p suppresses ZEB1 and ZEB2 expression, and thus reactivates E-cadherin expression and the epithelial phenotype. Further experimental evidence from an in vivo study of mouse breast cancer cell lines suggested that MIR-8 family miRNA expression was necessary for liver and lung colony formation [52] , supporting the significance of the MIR-8 family in the evolution of metastases not only for mCRC, but metastases in general.
Two previous studies have also suggested circulating Mir-8-P1b_3p ( miR-141 ) to be a predictive biomarker of metastasis in CRC. In a study of 182 pCRC cases (stage I-IV) and 76 healthy controls, plasma Mir-8-P1b_3p ( miR_141 ) levels were associated with stage IV CRC and poor survival outcomes [53] , while no differences in expression levels were detected in the corresponding pCRC samples. In a more recent study, Meltzer et al 2019 studied plasma exosomal miRNA levels in 83 locally advanced rectal cancer cases collected at the time of diagnosis [54] . Exosomal Mir-8-P1b_3p ( miR-141 ) was elevated in patients with synchronous CLM compared to patients with no metastases, and Mir-8-P1b_3p ( miR-141 ) was associated with higher risk of CLM development. The study also reported that high Mir-375_3p levels were associated with synchronous CLM. This is interesting in light of the tissue expression levels detected in our study, as Mir-375_3p was expressed at higher levels in both CLM and PM-CRC compared to pCRC. Mir-375_3p is the only member of the MIR-375 family . 2 Both miRNAs have previously been reported to be present in circulation in other cancer types, including breast cancer [55] , suggesting a possible association with the metastatic process. It is worthwhile mentioning that both the MIR-8 and the MIR-375 family are evolutionarily deeply conserved miRNA families, ranging back more than 650 million years to the last common ancestor of all bilaterian animals. Consequently, family members are also found in most other animals [26] , and could be studied, for instance, in emerging cancer model systems such as the fruit fly Drosophila [56] .

MIR-10
Another miRNA family that has been associated with EMT is the MIR-10 family [57] and in particular its Mir-10-P1 paralogues . Levels of Mir-10-P1b_5p were elevated in 3 breast cancer cell lines with metastatic capability, compared to both human mammary epithelial cells, or other human breast cancer cell lines [58] . Furthermore, in vivo silencing of Mir-10-P1b_5p using antagomirs suppressed lung metastasis, but did not slow primary tumor growth [57] . Mir-10-P1b_5p expression is induced by Twist , a hypoxia inducible transcription factor involved in EMT, and it targets Hoxd10, which is reported to be a suppressor of metastasis [57] . T hese findings were further expanded to CRC showing that elevated Mir-10-P1 levels increase invasion and migration in CRC tumors [59] .

MIR-210
Hypoxia is an important feature of metastatic evolution, influencing processes such as metabolism, angiogenesis, cell proliferation and differentiation [64] . In this context, our finding of up-regulation of the "hypoxamir" Mir-210_3p at all mCRC sites is of particular interest and we further validated the up-regulation of this miRNA by qPCR.
The Mir-210_3p promoter has a binding site for HIF-1 and HIF-2 , transcription α α factors involved in maintaining cellular oxygen homeostasis, and Mir-210_3p is therefore up-regulated under hypoxic conditions [65] . However, the impact of Mir-210_3p on the metastatic process is controversial, with studies providing contradictory statements about its role in different cancer types, some reporting functions as an oncoMir and inducer of metastatic progression, while others supports a role as a tumor suppressor [66] . In pCRC Mir-210_3p was shown to be up-regulated in a study of 193 cases comparing paired pCRC and nCR tissue samples, and high Mir-210_3p levels were associated with inferior prognosis [67] . Higher  levels were also reported to enhance invasion and metastasis in cell lines (Mudduluru et al. 2015) . We did not observe up-regulation of  in pCRC compared to nCR; rather, our results show up-regulation at the metastatic sites, which is in accordance with several non-NGS studies in CRC [68][69][70] . Our finding of increased Mir-210_3p expression at all mCRC sites, suggest that hypoxic stress is a common occurrence at the metastatic sites, and that up-regulation of Mir-210_3p may be a consequence of this hypoxic microenvironment. Curiously, MIR-210 also belongs to a group of ancient miRNA, present in most animals living today, and therefore could be tested in a range of previously mentioned models such as the fruit fly or zebrafish.

the normal tissue effect and cell-type specific miRNAs
When considering miRNA expression at the global level, our results confirmed the importance of analyzing mCRC sites as separate entities. The UMAP analysis ( Figure   1A) showed that the metastatic sites, in terms of global miRNA expression, were clearly distinct. In particular, the mLu datasets formed more distinct clusters from the pCRC, compared to the other mCRC samples. This may reflect the presence of host tissue cells in the bulk metastasis tissue, but could also represent adaptive expressional responses in tumor cells. Both this, and the unsupervised clustering analysis ( Figure 1B), showed that at the global miRNA expression level of the adjacent tissues, nCR, nLi and nLu were clearly distinct, stressing the importance of controlling for adjacent tissues in the differential expression analysis to moderate the initial apparent large differences between nCR and pCRC, or pCRC and mCRC.
Important in this perspective, exclusion of cell-type specific miRNAs (Figure 2A), from the analysis did not appear to influence the clustering analyses. In particular, mLu samples still clustered close to nLu, and not to the other malignant tissues, pointing to the presence of additional global miRNA expression differences in these organs and tissues. Nevertheless, when comparing tissues, it is important to account for miRNAs that are exclusively expressed in specific cell types. For instance, Mir-143/145 was previously reported to be down-regulated in pCRC compared to nCR, and was therefore suggested as a biomarker of CRC [71] . However, it has since been conclusively shown that Mir-143/145 are expressed in mesenchymal cells only, and not in epithelial cells [72,73] . Therefore, the observed "down-regulation" of Mir-143/145 must be caused by differences in cellular composition of the tissues, and not in expression changes in epithelial cancer cells. It is therefore necessary to account for the cellular composition of tissues when analyzing bulk tissue samples.
Similarly, the down-regulation of platelet and red blood cell specific Mir-486 was previously associated with tumor stage in pCRC [74] , but is probably not directly related to tumor progression. Another example is the hepatocyte specific Mir-122 [75] , where a very strong up-regulation in CLM likely reflects presence of hepatocytes in the bulk CLM tissue. Not surprisingly, therefore, these cell-type specific miRNAs appear to have lower expression in malignant tissues compared to the normal tissues ( Figure 2B). This is an important consideration for future research as the biomarker potential of a miRNA abnormally expressed in diseased tissue must be tightly controlled.

Conclusion
In conclusion, using a consistent pipeline for analysis of bulk tissue smallRNA-seq data across datasets from multiple studies, a set of miRNAs implicated in mCRC evolution was revealed. Interestingly, three of the miRNAs up-regulated in mCRC are among the evolutionary most deeply conserved miRNAs which could be interpreted as a reactivation of early gene regulatory programs and as a support of the previously suggested devolution hypothesis of cancer [76][77][78] .
A limitation of the study is that the datasets originated from heterogeneous studies, trials and time points and included few paired pCRC and mCRC samples. This may have resulted in under reporting of some relevant miRNAs, but the consensus signatures still represent a solid foundation for future research. By setting up a publicly available bioinformatics pipeline for processing and analyzing smallRNA-seq data, we have ensured that it will be straightforward to add new datasets when they become available.
Accounting for known cell-type specific miRNAs is a novel effort in the field, which has contributed to remove false signals and identify a robust signature. However, the effort was limited to available knowledge from cell type profiling in vitro and in bulk [28][29][30]37] , and does not account for all cell-type specific expression differences, particularly where the differences in miRNA expression levels are more subtle or unknown. Cell deconvolution of bulk tissue sequencing data based on gene expression is in itself a large and rapidly developing field [79][80][81] .
As of yet, development of rigorous miRNA expression profiles on the cellular level is still in its infancy [82] . The use of bulk tissue data does not allow for direct analysis of tumor heterogeneity, and tumors are not homogenous entities, but made up of distinct sub clonotypes and cell types [83] . Therefore, single cell approaches, which have led to a number of encouraging findings in CRC [84,85] , hold promise for future progress in the miRNA field [86,87] .

Patient samples
Tumor

RNA extraction and next generation sequencing
RNA was extracted using Qiagen Allprep DNA/RNA/miRNA universal kit, which simultaneously isolates genomic DNA and total RNA. RNA concentration was evaluated using ThermoFisher NanoDrop spectrophotometer and RNA integrity was evaluated by Agilent Technologies Bioanalyzer RNA 6000 Nano kit. Samples with sufficient quality were then used to prepare small RNA NGS libraries (RIN > 6), using TruSeq Small RNA Library preparation protocol. Successfully prepared libraries were sequenced using Illumina HiSeq 2500 High Throughput Sequencer using single end sequencing (50bp).

Collection of available NGS datasets
A literature search was conducted for the terms "microRNA + CRC + next generation sequencing" in different variations, and reviews were studied [18,39] . Publicly available data were downloaded from European Genome-phenome Archive (EGA), the Sequence Read Archive (SRA) and the Gene Expression Omnibus (GEO). For download, the latest version of sRNAbench was used [89] . Corresponding SRA data files were automatically downloaded and converted into FASTQ files.

Data processing
All datasets were processed using the same pipeline. miRTrace [27] was used for preprocessing and quality control of raw data (FASTQ files). In brief, low quality reads, defined as reads where less than 50% of nucleotides had a Phread quality score greater than 20 were discarded. 3p adapter sequences were trimmed, and reads made up of repetitive elements and reads that are shorter than 18nt were removed. The output of miRTrace quality control statistics included read length distribution, percentage sequences where adapter was detected, reads < 18 nt, low complexity or low Phred score, percentage of reads miRTrace determined to be miRNA, rRNA, tRNA, artifacts or unknown, number of unique miRNA genes detected in each sample, and degree of contamination of miRNAs derived from other species than human. Samples where either less than 25% of reads were between 20 and 25 nt, or where more than 75% of reads were discarded, or where less than 10% of reads were determined to be miRNA, were excluded. Studies where more than half of datasets failed the QC criteria, or where significant contamination was detected, were excluded from the study.

Read alignment
Reads were mapped to MirGeneDB2.0 (mirgenedb.org) precursor sequences consisting of the pre-miRNA sequence as well as 30 nt upstream and downstream of the Drosha cut site [26] using bowtie1.2 [90] , requiring an 18 nucleotide seed sequence of zero mismatches to avoid cross-mapping, and up to five mismatches outside the seed. Mapped reads were counted using the featureCounts 'summarizeOverlaps()' function from the 'GenomicAlignments' Bioconductor package [91,92] .

Data exploration
For data exploration, read counts were normalized using the varianceStabilizingTransformation() (VST) function from the DESeq2 package [93] .
Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP), and hierarchical clustering, was used to visualize the similarity of datasets on the global miRNA expression level. UMAP is a dimensionality-reducing algorithm used to reduce the numbers of dimensions in each dataset, here 537 miRNA genes, onto a more easily interpretable two dimensions, while retaining the relationship between datasets [94] . The umap R package was used, and datasets were annotated by tissue and by study ( Figure 1A, left and right panels, respectively). Hierarchical clustering was performed with the hclust() R function using the complete linkage clustering method on a distance matrix of all datasets computed with the dist() R function using maximum distance measure. Datasets were annotated according to tissue origin.

Analysis of influence of cell specific miRNA
Principal another UMAP analysis was done, discarding the 45 cell specific miRNAs from the analysis, in order to assess clustering when these miRNAs are not present in the datasets.

Differential expression analysis
The differential expression analysis can be viewed in the supplementary R-markdown file, (Additional file 7: Supplementary Methods) Differential expression analysis was performed using DESeq2 [93] . Multiple testing was corrected using Benjamin-Hochberg, with a false discovery rate (FDR), threshold of 0.05. The Log 2 Fold Change (LFC) shrinkage function in DESeq2, lfcShrink(), was enabled. This shrinks fold changes for miRNAs with higher variance. In addition, miRNAs were filtered if they had a LFC > 0.58 or LFC < -0.58, and where at least one tissue had mean expression >100 Reads Per Million (RPM) a level previously suggested as cutoff for physiological activity [20] , and MiRNAs known to be cell-type specific [28] were labelled in each signature. Also, miRNAs that have been reported to be present at high levels in one or more blood cell types were annotated as such, [37] . When comparing metastatic tissue samples, normal adjacent tissues were used to control for the confounding effect of surrounding normal tissue.

qPCR validation
To validate the NGS finding of increased expression of Mir-

Declarations Ethics approval and consent to participate
All patients whose material has been used in this study have provided consent to participate in the study. The Norwegian Regional Committees for Medical and Health Research Ethics

Consent for publication
Not Applicable

Availability of data and materials
The datasets generated and/or analysed during the current study will be made available in the European Genome-phenome Archive repository. Bioinformatics pipeline can be found at (https://github.com/eirikhoye/mirna_pipeline)