Subnetwork-based prognostic biomarkers exhibit performance and robustness superior to gene-based biomarkers in breast cancer

Background Effective classification of cancer patients into groups with differential survival remains an important and unsolved challenge. Biomarkers have been developed based on mRNA abundance data, but their replicability and clinical utility is modest. Integrating functional information, such as pathway data, has been suggested to improve biomarker performance. To date, however, the advantages of subnetwork-based biomarkers have not been quantified. Results We deeply sampled the population of prognostic gene-based and subnetwork-based biomarkers in a breast cancer meta-dataset of 4,960 patients. Analysing the performance and robustness of 22,000,000 gene biomarkers and 6,250,000 subnetwork biomarkers across twenty different training:testing cohort partitions of the meta-dataset revealed that subnetwork biomarkers exhibit superior overall performance and higher concordance across partitions. We find evidence of an upper bound for optimal biomarker size of ∼200 genes or ∼100 subnetworks. Additionally, with both biomarker feature types, larger biomarkers tend to show less consistency in performance across partitions, suggestive of over-fitting. Finally, an evaluation of varying training cohort sizes quantifies the effects of training cohort size. Conclusions Many groups are developing techniques for exploiting network-based representations of biological pathways to characterize cancer and other diseases. By considering the distribution of gene- and subnetwork-based biomarkers, we show that pathway data improves performance and replicability, and that smaller biomarkers are more robust across patient cohorts. These insights may facilitate development of clinically useful biomarkers.

biological pathways to characterize cancer and other diseases. By considering the distribution 38 of gene-and subnetwork-based biomarkers, we show that pathway data improves improvements in biomarker performance, then identifying informative groups of genes is 83 clearly a more efficient approach to generating clinically relevant prognostic tools. 84 To resolve this issue, we examined whether using subnetworks as the building blocks of 85 prognostic biomarkers confers any advantage over orthodox gene-based approaches. The 86 SIMMS model was generalized to score gene biomarkers as well as subnetwork biomarkers 87 and then used to evaluate and compare the performance of 22,000,000 gene biomarkers and  Results and discussion 95 Using SIMMS to measure the performance of gene and subnetwork biomarkers 96 The prognostic performance of subnetwork biomarkers was evaluated on a 4,960 breast 97 cancer patient meta-dataset using the SIMMS model as fully detailed in Methods [28] . 98 Briefly, a univariate Cox proportional hazards model is used to estimate the association 99 between mRNA abundance levels and survival information of the training cohort for each 100 gene separately. Next, for each patient, the per-gene coefficients from the model are 101 aggregated into subnetwork risk scores. Then, for each subnetwork biomarker, a multivariate 102 Cox model is fit to the patient risk scores of the subnetworks comprising the biomarker.
To study the characteristics of the population of gene biomarkers and subnetwork biomarkers, 114 we randomly sampled 22,000,000 gene biomarkers and 6,250,000 subnetwork biomarkers by 115 utilizing jackknifing, a resampling method in which samples are drawn from a population 116 without replacement [34] . For this purpose, we only used genes for which mRNA abundance 117 scores were available in all eighteen datasets comprising our breast cancer meta-dataset 118 ( Table 1). To ensure that our set of gene biomarkers contained the same underlying mRNA 119 abundance data as our set of subnetwork biomarkers, all gene biomarkers were drawn from a 120 pool of the 1,500 genes which were included in at least one of the 500 subnetworks from 121 which the subnetwork biomarkers were created. Since gene biomarkers were drawn from a 122 set of 1,500 features while subnetwork biomarkers were drawn from a set of 500 features, the 123 population of all possible gene biomarkers is substantially larger than that of all possible subnetwork biomarkers. This necessitated a larger test set of gene biomarkers to fully capture 125 the null distribution of gene biomarker performance. The prognostic capability of each gene 126 and subnetwork biomarker was estimated using SIMMS on twenty different partitions of the 127 breast cancer meta-dataset, each of which was composed of varying training and testing 128 cohort sizes (Tables 2-3). For full details on how the meta-dataset was partitioned, see 129 Methods.

130
Considering only a single meta-dataset partition of equally-sized training and testing cohorts 131 (the 'default' partition in Table 2), we found that the performance of gene and subnetwork 132 biomarkers is highly sensitive to biomarker size (Figure 1). Adding more features to gene 133 biomarkers results in substantial improvements in performance up to ~100 genes, at which 134 point improvement in performance starts levelling off before peaking at ~175 genes. After this, 135 larger gene biomarkers show a sharp and consistent decline in prognostic capability, 136 particularly after biomarker size exceeds ~200 features. On the other hand, the performance 137 of subnetwork biomarkers improves much more gradually with increasing biomarker size, 138 plateauing at around 100 subnetworks. 140 To better contrast the performance of gene and subnetwork biomarkers, we first normalized 141 the sizes of our subnetwork biomarkers to make them more directly comparable to our gene 142 biomarkers. Since a subnetwork biomarker is essentially a set of gene lists, the most 143 straightforward way to perform this normalization is to simply sum the sizes of all these gene 144 lists to get the total number of genes in the biomarker. However, there is gene content overlap in our set of 500 subnetworks so a single subnetwork biomarker may include multiple 146 subnetworks that contain the same gene. Therefore, we additionally considered the number of 147 unique genes in each biomarker. At each gene count, we compared the 95th to 99th 148 percentiles of subnetwork biomarker performance against the 95th to 99th percentiles of gene 149 biomarker performance (Figure 2). 150 The difference in performance between the two gene count normalization techniques is 151 negligible as both gene and subnetwork biomarkers exhibit similar peak performance.

152
Although gene biomarkers require fewer genes to reach optimal prognostic efficiency, they 153 also show a higher propensity for over-fitting at higher gene counts.

154
Randomly sampling training:testing meta-dataset partitions reveals effect of training 155 cohort size on biomarker performance and stability 156 To test the robustness of our finding that biomarker size affects performance, we re-evaluated 157 our biomarker sets on a partition of the meta-dataset with a substantially smaller training 158 cohort at only 36.5% of the total patient number. As seen in Additional file 1: 159 Supplementary Figure 1, the general characteristics of biomarker performance with regard 160 to biomarker size remain the same even though the performance of both gene and 161 subnetwork biomarkers decreases slightly overall in this setting. However, subnetwork 162 biomarkers do reach higher peak performance than gene biomarkers, suggesting that they 163 are less sensitive to a suboptimal training environment. To verify these observations, we went 164 on to create eighteen more meta-dataset partitions using random sampling for a total of ten 165 partitions with larger training cohorts and ten partitions with smaller training cohorts.
As first evidenced by our initial finding, both gene and subnetwork biomarkers perform 167 routinely better when trained on the larger patient cohorts (Figure 3). In only one dataset 168 partition containing a smaller training cohort do biomarkers consistently outperform those 169 trained on the larger cohorts. For most of the additional eighteen dataset partitions, the 170 performance patterns resemble our initial two partitions. This is particularly true for gene 171 biomarker performance which peaks rapidly at smaller biomarker sizes then drops with 172 increasing biomarker size. Subnetwork biomarker performance peaks slowly, as seen 173 previously, and then either plateaus or undergoes a more gradual decline.

174
To quantify biomarker performance stability across partitions we used Lin's concordance   (Figure 4). This reveals that optimal subnetwork biomarker 199 performance tends to be greater than optimal gene biomarker performance, an effect that is  Although the stability of gene biomarker performance is significantly higher than the stability of 208 subnetwork biomarker performance when considering biomarkers of all feature sizes, at comparable gene counts this relationship is inverted ( Figure 5). The instability of subnetwork 210 biomarkers that contain many large subnetworks obscures the fact that subnetwork 211 biomarkers consisting of a small or moderate number of modules perform relatively 212 consistently across different training:testing cohort partitions. As discussed above, subnetwork 213 biomarkers tend to outperform even optimal gene biomarkers at fairly small sizes.

214
Subnetwork biomarkers without an exorbitant number of features are, therefore, not only 215 more stable than gene biomarkers but also have greater prognostic capabilities.  The breadth of our analysis also allows us to make several other important observations 229 about prognostic biomarker performance and robustness. Small training cohort sizes have a greater effect on overall biomarker performance than on the consistency of said performance.

231
Furthermore, the performance of large biomarkers is substantially more unstable than that of 232 small biomarkers. Therefore, to optimize biomarker development, it is critical to maximize 233 training cohort size and to minimize biomarker size.

234
Nevertheless, we recognize there are several caveats to be made regarding our findings that 235 subnetwork biomarkers are superior to gene biomarkers. Because generally gene biomarkers 236 will contain fewer genes than subnetwork biomarkers, and biomarkers with fewer genes are 237 more robust, it is easier to find gene biomarkers with consistently high performance especially 238 since gene biomarker performance peaks at a smaller number of features. Moreover, our 239 study considers only the null distribution of biomarker performance. It is thus possible, albeit 240 highly unlikely, that among the "rare" biomarkers, whose performance falls well above that of 241 those that are randomly sampled, gene biomarkers are preferable to subnetwork biomarkers.

242
Finally, there is a potential limitation in considering only 500 subnetworks and, in particular, 243 only 1500 genes as these may not necessarily be representative of the actual human genome 244 and the approximately 20,000 genes it contains. As information on PPI and gene pathways 245 continues to accumulate, this matter can be thoroughly investigated in the future with access 246 to a more complete and biologically accurate set of data.

247
Finally, deeply sampling the total biomarker space allows for a comprehensive evaluation of 248 biomarker characteristics that is not attainable through analyses on a smaller number of 249 biomarkers acquired using unsubstantiated feature selection strategies. Since experiment 250 replicability has proven to be a great challenge in the development of clinically useful 251 biomarkers, it is vital to study the performance of biomarkers in a variety of settings, especially using different training and testing cohorts. Jackknifing and meta-dataset 253 permutation are, therefore, two key tools for prognostic biomarker research and can be 254 utilized to great effect in studies hindered by limited sample sizes and data access.
where mRNA g,P is the mRNA abundance in patient P of gene g contained within subnetwork The median risk score is calculated for the training cohort and is then used as the break-point 282 at which to dichotomize the testing cohort. A univariate Cox model is fit to this testing cohort 283 and the risk classification is assessed by the resulting hazard ratio.

284
The SIMMS model was extended to gene biomarkers by treating each gene as a subnetwork 285 of size one. We refer to this as the geneSIMMS model. The risk score for patient P and gene 286 g is thus given by 287 risk g , P = log 2 ( HR g )× mRNA g ,P with HR g and mRNA g,P defined as in (1 This is followed by patient dichotomization of the testing cohort and then biomarker 293 performance evaluation, same as described above for SIMMS.

318
As seen in