## Abstract

It is assumed that cancers develop upon acquiring a particular number of (epi)mutations in driver genes, but the law governing the kinetics of this process is not known. I have recently shown that the age distribution of incidence for 20 most prevalent cancers of old age is best approximated by the Erlang probability distribution. The Erlang distribution describes the probability of several successive random events occurring by the given time according to the Poisson process, which allowed me to predict the number of critical driver events for these cancer types. Here I show that the Erlang distribution is the only classical probability distribution that can, in addition, adequately model the age distribution of incidence for all studied childhood and young adulthood cancers. This validates the Poisson process as the universal law describing cancer development at any age and the Erlang distribution as a useful tool to predict the number of driver events for any cancer type. The Poisson process signifies the fundamentally random timing of driver events and their constant average rate. As waiting times for the occurrence of the required number of driver events are counted in decades, it suggests that driver mutations accumulate silently in the longest-living dividing cells in the body – the stem cells.

## Introduction

Since the discovery of the connection between cancer and mutations in DNA, in the middle of the XX century, there have been multiple attempts to deduce the law of driver mutation accumulation from the age distribution of cancer incidence or mortality (1). The proposed models, however, suffer from several serious drawbacks. For example, early models assume that cancer mortality increases with age according to the power law (2-4), which at some advanced age would necessary lead to the mortality surpassing 100 000 people per 100 000 population. Moreover, already at that time it was known that many cancers display deceleration of mortality growth at advanced age, which is to be expected if the probability of death at a given age is to remain under 100%. Finally, when large-scale incidence data have accumulated, it became clear that cancer incidence not only ceases to increase with age but, for at least some cancers, starts to decrease (5, 6). More recent models of cancer progression are based on multiple biological assumptions, consist of complicated equations that incorporate many predetermined empirical parameters, and still have not been shown to describe the decrease in cancer incidence at an advanced age (7-12). It is also clear that an infinite number of such mechanistic models can be created and custom tailored to fit any set of data, leading us to question their explanatory and predictive values.

Recently I have proposed that the age distribution of cancer incidence is, in fact, a statistical distribution of probabilities that a required number of driver events occurs precisely by the given age, i.e. a probability density function (PDF) (13). I then tested the PDFs of 16 well-known continuous probability distributions for fits with the CDC WONDER data on the age distribution of incidence for 20 most prevalent cancers of old age. The best fits were observed for the gamma distribution and its special case – the Erlang distribution, with the average R^{2} of 0.995 (13). Notably, these two distributions describe the probability of several independent random events occurring precisely by the given time, according to the Poisson process. This allowed me to estimate the number of driver events, the average time interval between them and the maximal populational susceptibility, for each cancer type. The results showed high heterogeneity in all three parameters amongst the cancer types but high reproducibility between the years of observation (13).

However, 4 other probability distributions – the extreme value, normal, logistic and Weibull – also showed good fits to the data, although inferior to the gamma and Erlang distributions. This leaves some uncertainty regarding the exceptionality of the gamma/Erlang distribution for the description of cancer incidence. Here I test these shortlisted distributions on the CDC WONDER data on childhood and young adulthood cancers. I show that the gamma and Erlang distributions are the only distributions that converge for all tested cancers and provide close fits. This result validates the Poisson process as the fundamental law describing the age distribution of cancer incidence for any cancer type, which also allows to predict important parameters of cancer development, including the number of driver events.

## Results

To test the universality of the gamma/Erlang distribution, the publicly available USA incidence data on childhood and young adulthood cancers were downloaded from the CDC WONDER database (see Materials and Methods). The PDFs for the general forms of the following continuous probability distributions were tested for fit with least squares non-weighted nonlinear regression analysis: extreme value, gamma, logistic, normal and Weibull (see Materials and Methods). Only the gamma distribution converged for all tested cancer types and provided good fits (Fig. 1, Table 1).

Importantly, the gamma distribution and the Erlang distribution derived from it are the only classical continuous probability distributions that describe the cumulative waiting time for ** k** successive random events, with the Erlang distribution differing only in counting events as integer numbers. Because these properties suit excellently to describe the waiting time for real discrete random events such as driver mutations, the Erlang distribution provides the opportunity to get unique insights into the carcinogenesis process. I have previously proposed that the shape parameter

**of the Erlang distribution indicates the average number of driver events that need to occur in order for a cancer to develop to a stage that can be detected during clinical screening; the scale parameter**

*k***indicates the average time interval (in years) between such events; and the amplitude parameter**

*b***divided by 1000 estimates the maximal susceptibility (in percent) of a given population to a given type of cancer (13).**

*A*To obtain these parameter values, the Erlang distribution was fitted individually to incidence of each of 10 childhood/young adulthood cancer types (Fig. 2, Table 2). The goodness of fit varied from 0.6263 (due to an outlier), for intracranial and intraspinal germ cell tumours, to 0.9999, for extracranial and extragonadal germ cell tumours of childhood, with the average of 0.9476. The predicted number of driver events varied from 1, for extracranial and extragonadal germ cell tumours of childhood, neuroblastoma and ganglioneuroblastoma, retinoblastoma, and intracranial and intraspinal embryonal tumours, to 9, for malignant gonadal germ cell tumours. The predicted average time between the events varied from 1 year, for extracranial and extragonadal germ cell tumours of childhood and hepatoblastoma, to 15 years, for intracranial and intraspinal embryonal tumours. The predicted maximal populational susceptibility varied from 0.02%, for extracranial and extragonadal germ cell tumours of childhood, to 2%, for malignant gonadal germ cell tumours. Overall, the data confirm high heterogeneity in carcinogenesis patterns revealed in the previous study (13).

## Discussion

I have previously shown that 5 probability distributions – the extreme value, gamma/Erlang, normal, logistic and Weibull – approximate the age distribution of incidence for 20 most prevalent cancers of old age (13). The shape of those incidence distributions resembles the bell shape of the normal distribution, with some asymmetry, or at least the left part of it. However, many cancers of childhood have a radically different shape of the incidence distribution, the shape of the exponential distribution (Fig. 2). Of the 5 shortlisted distributions, only the gamma/Erlang and Weibull distributions can assume that shape, i.e. reduce to the exponential distribution when the parameter ** k** equals 1. Of the remaining 2 distributions, gamma/Erlang provides superior fit compared to Weibull. In fact, for cancers of old age, the average R

^{2}for the Weibull distribution is 0.9938, whereas for the gamma/Erlang distribution is 0.9954 (13). For cancers of childhood and young adulthood, the average R

^{2}for the Weibull distribution is 0.6576, whereas for the gamma distribution is 0.9490 (Table 1). Most importantly, the Weibull distribution failed to converge for extracranial and extragonadal germ cell tumours of childhood and for retinoblastoma, whereas the gamma/Erlang distribution provided the perfect fit (R

^{2}=1.000). Thus, it appears that the gamma/Erlang distribution is the only classical probability distribution that fits universally well to cancers of childhood, young adulthood and old age.

I have proposed that the parameter ** k** of the Erlang distribution indicates the average number of driver events that need to occur in order for a cancer to develop to a stage that can be detected during clinical screening (13). As mentioned above, the Erlang distribution reduces to the exponential distribution when

**equals 1, because the exponential distribution describes the waiting time for a single random event. It would thus mean that cancers with the exponential shape of the age distribution of incidence require only a single driver event with random time of occurrence, most likely a somatic driver mutation (14) or epimutation (15). This explains their maximal prevalence in the early childhood.**

*k*In his seminal paper (16), Alfred Knudson has proposed that hereditary retinoblastoma, a childhood cancer with the exponential age distribution of incidence, is caused by a single somatic mutation in addition to one heritable mutation. He also proposed that in the nonhereditary form of the disease, both mutations should occur in somatic cells. As hereditary form is estimated to represent about 45% of all cases (16, 17), the number of driver mutations predicted from combined incidence data should be around 1.55. Interestingly, whilst the gamma distribution fits the incidence data excellently, with R^{2}=1.0, it predicts 1.325 driver events. This yields the estimate of the hereditary form prevalence at 67.5%. This higher value may point to the general underestimation of the hereditary component in unilateral retinoblastoma, perhaps due to limitations of routine genetic screening and the influence of genetic mosaicism (18). In contrast to retinoblastoma, the hereditary form of neuroblastoma is estimated to comprise only 1-2% of all cases (19), hence the exponential age distribution of incidence would mean that only one somatic mutation is required. Indeed, the gamma distribution predicts 0.9816±0.0295 driver events (R^{2}=0.9998).

The prediction of a single driver event in cancers with the exponential age distribution of incidence does not mean that only a single driver gene can be discovered in such cancer types. In fact, many driver genes are redundant or even mutually exclusive, e.g. when the corresponding proteins are components of the same signalling pathway (20). Thus, each tumour in such cancer types is expected to have a mutation in one driver gene out of a set of several possible ones, in which all genes most likely encode members of the same pathway or are responsible for the same cellular function. For example, in each neuroblastoma tumour sample, a mutation was present in only one out of 5 putative driver genes – ** ALK, ATRX, PTPN11, MYCN** or

**.**

*NRAS*(21)Another aspect to consider is that while one mutation is usually sufficient to activate an oncogene, two mutations are typically required to inactivate both alleles of a tumour suppressor gene. Therefore, cancers with the exponential age distribution of incidence are predicted to have either a single somatic mutation in an oncogene, or a single somatic mutation in a tumour suppressor gene plus an inherited mutation in the same gene. The former is the case for neuroblastoma, where an amplification or an activating point mutation in ** ALK** is often present (22-24). The latter is the case for retinoblastoma, where an inactivating mutation in one allele of

**is usually inherited, whereas an inactivating mutation in the other**

*RB1***allele occurs in a somatic cell (25).**

*RB1*Finally, the number of driver events predicted by the Erlang distribution refers exclusively to rate-limiting events responsible for cancer progression. For example, it was shown that inactivation of both alleles of ** RB1** leads first to retinoma, a benign tumour with genomic instability that easily transforms to retinoblastoma upon acquiring additional mutations (26). In this case, two mutations in

**are rate-limiting, whereas mutations in other genes are not, because genomic instability allows them to occur very quickly. In neuroblastoma, frequent**

*RB1***amplification and chromosome 17q gain are found only in advanced stages of the disease (27, 28), so they are unlikely to be the initiating rate-limiting events.**

*MYCN*Overall, application of the gamma/Erlang distribution to childhood and young adulthood cancers showed its exceptionality amongst other probability distributions. The fact that it can successfully describe the radically different age distributions of incidence for cancers of any age and any type allows to call the underlying Poisson process the universal law of cancer development. The Poisson process signifies the fundamentally random timing of driver events and their constant average rate (13). The Erlang distribution allows to calculate, by multiplying the number of driver events by the average time interval between them, that an average person needs from 73 to 324 years to accumulate the required number of driver alterations, depending on the cancer type (13). This finding is consistent with the silent accumulation of driver mutations in stem cells before the terminal clonal expansion (29-31), because this is the only type of dividing cells surviving for so long in the body, and mutations require cellular division to be fixed in the DNA. For childhood and young adulthood cancers, these estimates range from 1 to 35 years (see Table 2), but the mechanism is likely the same. Finally, as the Erlang distribution allows to predict the number and rate of driver events in any cancer subtype for which the data on the age distribution of incidence are available, it may help to optimize the algorithms for distinguishing between driver and passenger mutations (32), leading to the development of more effective targeted therapies.

## Materials and Methods

### Data acquisition

United States Cancer Statistics Public Information Data: Incidence 1999 - 2012 were downloaded via Centers for Disease Control and Prevention Wide-ranging OnLine Data for Epidemiologic Research (CDC WONDER) online database (http://wonder.cdc.gov/cancer-v2012.HTML). The United States Cancer Statistics (USCS) are the official federal statistics on cancer incidence from registries having high-quality data for 50 states and the District of Columbia. Data are provided by The Centers for Disease Control and Prevention National Program of Cancer Registries (NPCR) and The National Cancer Institute Surveillance, Epidemiology and End Results (SEER) program. Results were grouped by 5-year Age Groups, Crude Rates were selected as output and All Ages were selected in the Age Group box. All other parameters were left at default settings. Crude Rates are expressed as the number of cases reported each calendar year per 100 000 population. A single person with more than one primary cancer verified by a medical doctor is counted as a case report for each type of primary cancer reported. The population estimates for the denominators of incidence rates are a slight modification of the annual time series of July 1 county population estimates (by age, sex, race, and Hispanic origin) aggregated to the state or metropolitan area level and produced by the Population Estimates Program of the U.S. Bureau of the Census (Census Bureau) with support from the National Cancer Institute (NCI) through an interagency agreement. These estimates are considered to reflect the average population of a defined geographic area for a calendar year. The data were downloaded separately for each specific cancer type, upon its selection in the Childhood Cancers tab.

### Data selection and analysis

For analysis, the data were imported into GraphPad Prism 5. Only cancers that show childhood/young adulthood incidence peaks and do not show middle/old age incidence peaks were analysed further. The middle age of each age group was used as the x value, e.g. 17.5 for the “15-19 years” age group. Data were analysed with Nonlinear regression. The following User-defined equations were created for the statistical distributions:

Gamma:Y=A*(x^(k-1))*(exp(-x/b))/((b^k)*gamma(k))

Extreme value:Y=A*(exp(-((x-t)/b)))*(exp(-exp(-((x-t)/b))))/b

Logistic:Y=A*(exp((x-t)/b))/(b*((1+exp((x-t)/b))^2))

Normal:Y=A*(exp(-0.5*(((x-t)/b)^2)))/(b*((2*pi)^0.5))

Weibull:Y=A*(k/(b^k))*(x^(k-1))*exp(-((x/b)^k))

The parameter *A* was constrained to “Must be between zero and 100 000.0”, parameter *t* to “Must be between zero and 150.0”, parameters *b* and *k* to “Must be greater than 0.0”. “Initial values, to be fit” for all parameters were set to 1.0. All other settings were left by default, e.g. Least squares fit and No weighting.

For the Erlang distribution, the parameter *k* for each cancer type was estimated by the fitting of the Gamma distribution, rounded to the nearest integer and used as “Constant equal to” in the second round of the Gamma distribution fitting, which provided the final results.

## Footnotes

The author declares no conflict of interest.