## Abstract

Phenotypic variation in the copy number of gene products expressed by cells or tissues has been the focus of intense investigation. To what extent the observed differences in cellular expression levels are persistent or transient is an intriguing question. Here, we develop a quantitative framework that resolves the expression variation into stable and unstable components. The difference between the expression means in two cohorts isolated from any cell population is shown to converge to an asymptotic value, with a characteristic time, *τ*_{T}, that measures the timescale of the unstable dynamics. The asymptotic difference in the means, relative to the initial value, measures the stable proportion of the original population variance . Empowered by this insight, we analysed the T-cell receptor (TCR) expression variation in CD4 T cells. About 70% of TCR expression variance is stable in a diverse polyclonal population, while over 80% of the variance in an isogenic TCR transgenic population is volatile. In both populations the TCR levels fluctuate with a characteristic time of 32 hours. This systematic characterisation of the expression variation dynamics, relying on time series of cohorts’ means, can be combined with technologies that measure gene or protein expression in single cells or in bulk.

## 1 Introduction

The phenotypic variation among organisms or cells is a theme of growing importance in biology. Macroscopic phenotypes, such as body structures or physiologic responses, have been studied for ages, but one phenotype particularly suitable for quantification that has received attention in the last decades is the amount of specific mRNAs and proteins expressed by single cells. Advances in genomics have allowed the analysis of genetic contributions to variation in gene expression, in terms of so-called expression quantitative trait loci (eQTL) [1, 2]. In this case, expression levels, typically assessed via mRNA levels, are treated as quantitative traits, and one is interested in the specific loci underlying variation in expression levels among different individuals. The increasing availability of single-cell resolution genomics, proteomics and metabolomics technologies has enabled molecular biologists to analyse cell lineages and tissues showing that what were previously perceived as homogeneous cell populations are in fact a complex mixture of often transient and interchangeable cellular types and cellular states (see discussion in [3]). In parallel to these studies linking phenotypes to genotype, the literature on stochastic gene expression [4, 5, 6, 7, 8], reviewed in [9], has brought to light the variation in expression levels in isogenic cells, even when these are in the same cellular state and in the same environment. The variation is typically attributed to the “noise” resulting from the small copy number of molecules involved in the process.

Several studies addressed the fluctuation dynamics of gene expression levels [10, 11] revealing a complex picture of the variation in isogenic cell populations. The fluctuation timescales range from hours [7, 12], to days [13, 14, 15] or weeks [16, 17, 18], depending on the cells and on the degree of multimodality of the expression distribution under study. The distinct timescales can be associated with the different mechanisms that may cause the variation in the expression levels of a molecular component of interest in some cell population. However, most quantitative approaches developed up to this date have focused on noise in gene expression as the predominant mechanism explaining the variation observed (for example, [6, 7, 19, 20, 21, 22, 23, 24]). It remains unclear to which degree less volatile dynamic processes or even persistent differences contribute to the observed variation in a isogenic cell population. This is particularly relevant in the case of cells from multicellular organisms, due to the robust epigenetic processes that underlie differentiation stages, cell lineages or cell states, but also the intraclonal structure of apparently homogenous populations [25, 26].

A case study of particular interest is the expression of Sca1 in a hematopoietic cell line [16, 17] since it reveals the inherent complexity of variation dynamics and also the difficulties in characterising it experimentally. Chang et al. [16] reported that that biased cohorts of cells tend to restore the histogram of expression levels of Sca1 of the starting population, albeit with very slow dynamics. In principle, complete restoration would be consistent with a lack of stable variants in the population, at least in terms of expression levels of Sca1. However, [17] have shown that, even after 2 weeks, the reconstitution is not fully complete. More importantly, these authors [17] showed that some cells in this population express markers indicative of terminal differentiation, and have limited capacity for cell division. This points to an inherent heterogeneity in the population that may persist in time. An important limitation of these approaches was relying mainly on the juxtaposition of histograms of expression levels in order to compare cell populations, without a rigorous quantification. It is not clear how to analyse such data and because of this the degree to which the original distribution is restored remains uncertain. A quantitative approach that overcomes this impasse is necessary and also important to provide formal concepts on which to ground subsequent studies on the expression levels in cell populations.

Our work lumps the molecular mechanisms regulating expression levels in a cell population into two components, one stable and another unstable. The stable component leads to permanent differences between the expression levels of any two cohorts of cells. The unstable component, on the other hand, represents transient differences in the expression levels of the cohorts, which will eventually vanish in time. Starting from these definitions, a general model is derived to describe protein expression levels in a population with both the stable and unstable components. The relative contribution of the stable component to the expression variation is then defined as a single parameter termed . We show theoretically that this parameter can be estimated in an unbiased way by following over time the mean expression in cohorts isolated from the population of interest. This dynamical characterisation of the expression variation is completed by concomitantly estimating the characteristic timescale *τ*_{T}.

This theoretical result is used to characterise the contributions to variation in expression levels of the T-cell receptor (TCR) in two biologically relevant CD4 T cell populations. The first population, purified from wild type mice, is composed of clones emerging from the process of V(D)J recombination, each carrying genetically distinct TCR loci. The second is a genetically uniform population isolated from Marilyn TCR-transgenic mice, in which all T cells express the same recombined receptor genes [27]. We find that the stable component is the main contribution in the polyclonal population , while the unstable component predominates in the Marilyn population . This suggests that genetic heterogeneity contributes to stable differences in TCR expression levels in T cells, but that there are other mechanisms contributing to persistent expression variation in isogenic populations.

## 2. A general model for protein expression levels in a cell population

### 2.1 Partitioning the contributions to variation in expression levels

We assume that any biological cell population, hereafter referred to as full population, is a mixture of sub-populations. Each cell belongs to and remains in one of these sub-populations all the time. Using a mixture model formulation, each sub-population is indexed by *i* = 1, 2, …, *N*, and described by three parameters (*µ*_{i}, *v*_{i}, *w*_{i}): the mean *µ*_{i} and variance *v*_{i} of expression levels, and the relative frequency *w*_{i} of cells in the full population that belong to this sub-population. The latter is given by:
where *n*_{i} is the number of cells in the *i*-th sub-population and is the total number of cells in the full population. A related approach has been used by Gianola *et al*. [28] to study genetic parameters in the context of the quantitative genetics of mixture characters.

In the limit of large *N*, the parameters (*µ*_{i}, *v*_{i}, *w*_{i}) describing a sub-population are taken as random variables (** µ**,

**,**

*v***) (see methods for details of the notation used) following a particular multivariate distribution. Then, one can relate the mean**

*w**µ*

_{F}and variance

*v*

_{F}of expression levels of the full population to the properties of the sub-populations, as detailed in section A of the supplementary information. Provided that there is no correlation between the frequencies (

**) and either the means (**

*w***), the squared means (**

*µ*

*µ*^{2}) and the variances (

**) of the sub-population, it follows that (supplementary information, section A): where the subscript**

*v**F*is used to highlight that these are properties of the full population. Therefore, under these conditions, the mean of the full population is simply the expected value of the means of the sub-populations (𝔼 [

**]), while the variance of the full population is the sum of the variance within each subpopulation (𝔼 [**

*µ***]) and the variance among the sub-populations (variance in the means, 𝕍 [**

*v***]).**

*µ*It is important to highlight that equations 2 and 3 are general and independent of the precise definition of a sub-population. However, the two terms in equation suggest a specific definition, in which only the unstable component is present in each sub-population. In this way, the term of variation within any sub-population 𝔼 [** v**] becomes the contribution of the unstable component to the variance of the full population, while the variation among the means of the sub-populations 𝕍 [

**] is the contribution of the stable component. In the next section 2.2, expression levels within each sub-population will be described by a stochastic model, while the different sub-populations will have different means controlled by one of the parameters of this stochastic model.**

*µ*### 2.2 An explicit model of protein expression in a cell population

#### 2.2.1 Describing variation within a sub-population

The stochastic model of protein expression considered here is based on previous work [29], and is defined by the following two equations:
where *x*_{t} is the amount of protein expressed at time *t*, and *y*_{t} is a stochastic variable following the Ornstein-Uhlenbeck process. In equation 5, *W*_{t} is the Wiener process [30]. The parameters for the model are presented in Table 1, along with their respective dimensions.

The equation governing *dx*_{t} has two terms. The first term, , is the rate of production which depends on the stochastic process *y*_{t}, and the second term, *x*_{t}/*β*, is the degradation rate following first-order kinetics with mean *β* protein lifetime. A model with a similar overall structure was reported before [31], in which mRNA transcription and degradation have also been explicitly incorporated. Equation 4 can be re-written as:
where *z*_{t}, defined as:
denotes the instantaneous rate of protein production. This rate is normalised, to have unit expected value. All processes governing protein production (promoter transitions, transcription and translation, among others) are lumped together into the average rate *α* and the instantaneous rate given by *z*_{t}. The representation in equation 6, which highlights the contribution of lumped upstream factors, has been applied before in the analysis of models of stochastic gene expression (for example, [6, 7]). Equation 6 denotes that, in a single cell, the instantaneous rate of protein production is proportional to the instantaneous levels of these lumped upstream factors, and fluctuates as a function of time, with auto-correlation time approximately equal to *τ* [29]. These fluctuations are then propagated downstream, resulting in fluctuations in protein levels, with dynamics dictated by *τ* (through *z*_{t}) and *β*. For simplicity, protein degradation is assumed to be deterministic, with the same rate 1/*β* for all cells. The temporal evolution of the protein expression levels *x*_{t} in two cells with distinct characteristic times *τ* is illustrated in figure 1 (top).

It follows from equation 7 that:
and therefore the stationary rate of protein production follows a lognormal distribution in cells of a sub-population, consistent with a report of lognormal rates of protein expression [32]. Equations 4 and 5 are a simple model that generates, for a wide range of parameter values, a lognormal-like distribution of protein levels (figure 1,bottom), compatible with the widespread observation of the lognormal distribution in cell populations. In this scenario, in terms of the log-transformed protein levels (section B of the supplementary information), the mean and variance of a stationary sub-population are given by equations 9 and 10, respectively:
where the subscript *W* will be used hereafter to denote that the variation is due to the stochastic process influencing the instantaneous rate of protein production. In equation 10, *g*(*·, ·*) is an arbitrary function, which can be estimated via simulation.

#### 2.2.2 Combining variation within and among sub-populations

As formulated in section 2.1, the stable component arises due to variation in the means of the sub-populations. Therefore, we assume that parameter *α* in equation 4 is distributed in the full population, becoming a random variable, denoted by ** α**. Consequently, each sub-population is described by one value of

*α*, resulting in different average rates of production, and hence different mean expression levels.

For simplicity, we consider the case that *α**∼ℒ 𝒩 (µ*_{α}, *σ*_{α}). For the *i*-th sub-population, with parameter *α* _{i}, the mean and variance follow from equations 9 and 10:
where is assumed to be the same for all sub-populations. In terms of log-transformed values, plugging equations 9 and 10 into equation 3, one obtains the variance of the full population:

An important property of equation 13, which is based on log-transformed values, is that the parameters that represent the variances due to the stable and unstable components ( and , respectively) remain separate. This is a key feature, greatly simplifying the process of analysis and inference through-out this work. As detailed in the supplementary information (section C), the equivalent of equation 13 considering protein levels without any transformation has an additional term, dependent on and . This additional term arises since the variance of each sub-population in this case depends on the value of *α*. Therefore, we consider, from this point on, the analysis based on log-transformed values only.

## 3 Isolating cells to quantify the contributions to the variation in a cell population

### 3.1 Defining the relative contribution of the stable component

The previous section 2.2.2 showed that the variance of log-transformed expression levels of the full population is simply the sum of variances due to the stable and unstable components (equation 13). In this context, in analogy with the *R*^{2} quantification of the variance explained by a linear regression model, we define as:
to denote the proportion of the observed variance that is explained by the stable component.

Hence, formalizes and quantifies the relative contribution of the stable component to the total variance of the full population, reducing the problem of quantifying the contributions to the estimation of a single parameter. In the case of , variation in expression levels arises exclusively due to the unstable component; conversely, the stable component explains all the observed variation if . Finally, in the intermediate case , a combination of the two components is at play.

### 3.2 Estimating the relative contribution of the stable component by isolating cells

After defining , a setup for its estimation is considered. Since the original population is assumed to be heterogeneous, being composed of several subpopulations, a natural approach for estimation is to isolate a cohort of cells and to follow the temporal evolution of some property of this cohort. The isolation of cells according to the expression levels of some protein has been described in previous experimental works [16, 10, 14, 17, 18], usually employing fluorescenceactivated cell sorting (FACS). To simplify the presentation, it is assumed that the property is always quantified based on a sufficiently large number of cells, such that sampling effects are negligible.

Hereafter, a time reference *t* is defined beginning from the instant of isolation in a hypothetical experiment. Let an isolated cell cohort correspond to cells between percentiles *p*_{1} and *p*_{2} of expression levels of the original population. Without loss of generality, it is assumed hereafter that *p*_{1} *< p*_{2}. Therefore, the two percentiles should satisfy 0% ≤ *p*_{1} *< p*_{2} *<* 100% or 0% *< p*_{1} *< p*_{2} ≤ 100%. This ensures that at least one of the isolated cohorts to be used for inference is not identical to the original population at time *t* = 0. Hence, isolating cells corresponds, indirectly, to selecting some of the sub-populations, if any compose compose the original population. Upon isolation, the expression levels of cells in a given sub-population will relax to the stationary distribution of that subpopulation. Therefore, at the level of the isolated cohorts being tracked, changes in the property of expression levels are related to the dynamics of the unstable component, as expression levels of the sub-populations that have been isolated relax to their stationary values. The time for this relaxation to take place will be hereafter referred to as the characteristic time of the variation.

In a given experiment, three outcomes are possible (figure 2). If only the unstable component is present , after waiting a sufficiently long amount of time, the distribution of protein expression in the isolated cohort will converge to that of the original population (figure 2, top). In contrast, if the observed variation is explained by the stable component only , the distribution of the isolated cohort will not change as a function of time, remaining identical to that just after being isolated; it will always differ from that of the original population (figure 2, middle). Finally, if both the stable and unstable components are present in the original population , the isolated cohort will evolve in time, but without ever restoring the distribution of the original population (figure 2, bottom).

The key question now is what properties of the isolated cohort can be used to infer . The next section shows that can be accurately inferred from the dynamics of the means of the cohorts and examines the choice of a specific approach for isolation in term of the percentiles *p*_{1} and *p*_{2}. The additional features that can be extracted from the variance of the isolated cohort are addressed in supplemental information section E.

## 4 Estimating the relative contribution of the stable component

This section uses simulation to identify which property of the isolated cohorts can lead to a good estimate of , when followed in time. In the simulations, protein expression levels are described by the model derived in section 2, neglecting cell division for simplicity. Since all derivations are based on equation 13, the analysis herein relies on log-transformed values of protein levels.

The isolated cohort considered at first for inference here is composed of the 10% of cells with the highest (respectively lowest) expression levels in the original population hereafter referred to as “high expressors” (respectively “low expressors”). Following the notation of section 3.2, we have *p*_{1} = 90% and *p*_{2} = 100% (respectively, *p*_{1} = 0% and *p*_{2} = 10%). The choice of 10% is arbitrary, and is deemed to represent, at least in principle, a good compromise between resolution and number of cells obtained. Moreover, a random sample of the original population will serve as reference.

We first address how the dynamics of the mean of log-transformed protein levels in isolated cohorts, shown in figure 3 (left) for the high expressors. Briefly, the mean of an isolated cohort will evolve smoothly until it reaches an asymptotic limit.

It turns out that the asymptotic mean of log protein levels of an isolated cohort is a linear function of the of the original population from which it was obtained, as illustrated in figure 3(right) in the cases of the isolation of the 10% high and low expressors in populations with different . This linear relationship allows one to define a straightforward approach for estimating . Defining Δ_{A,B}(*t*) as the difference between the means of log-transformed values of two isolated cohort *A* and *B*, respectively *µ*_{A}(*t*) and *µ*_{B}(*t*), at time instant *t*:
then, can be estimated via:

The condition Δ_{A,B}(0) ≠ 0 for using equation 16 implies that the two isolated cohorts being compared must have different means just after isolation (*t* = 0). From the inequality in equation 14, an additional relationship for Δ_{A,B}(*t*) holds:

Therefore, the stationary difference between the means of log-transformed expression levels of the isolated cohorts *A* and *B* is expected to be, under the present formulation, lower than or equal to the difference immediately after isolation. Therefore, a key result is that, to estimate , one may simply calculate the ratio between the asymptotic value of difference between the means of logtransformed protein levels in two isolated cohorts relative to its initial value after isolation.

An important consequence for experimental design is that one can improve the resolution in the estimation of by maximising the value of Δ_{A,B}(0). For any given percentage of cells to be isolated (the difference *p*_{2} -*p*_{1}; see section 3.2), the maximal initial difference is obtained by isolating the extreme high and low expressors. Consequently, the remainder of this work focuses on this case, by always relying on the function Δ_{H,L}(*t*) for estimation, where H and L denoted respectively the high and low expressors.

Equation 16 has an important advantage from an experimental point of view: the fact that it depends only on the differences between the means of the sorted and reference populations. This is particularly important given that there are typically day-to-day variations in the absolute values read by a flow cytometer, to which equation 16 is robust.

The asymptotic analysis just presented does not allow to consider the dynamics of the expression levels. To address these dynamics we introduce the time-dependent function Ω_{H,L}(*t*) given by:

Being based on the means of log-transformed values of two populations that have been isolated, Δ_{H,L}(*t*) follows an approximately exponential decay (figure 4; see section D of the supplementary material for a rationale). Using the approximation of exponential decay, and defining the effective characteristic time as *τ*_{T}, Ω_{H,L}(*t*) takes the following form:

It follows that the effective characteristic time *τ*_{T} is undefined in the case of , since Δ_{H,L}(*t*) does not change as a function of time after isolation. Since *τ*_{T} is a measure of the time needed for the initial difference Δ_{H,L}(0) to reach the asymptotic value lim_{t→∞} *{*Δ_{H,L}(*t*)}, it provides a formal characterisation of the timescale of the variation.

An exhaustive simulation study (figure 4, b-e) led to the conclusion that *τ*_{T} can be approximated, with a typical bias of at most 5–10% of the true value, as:

Therefore, the auto-correlation time of the stochastic rate of protein production (*τ*) and the mean lifetime of the protein (*β*) determine the timescale of the variation in expression levels (*τ*_{T}).

The relative contribution of the stable component and the effective characteristic of the variation (*τ*_{T}) can be visualized in a single plot, derived from equation 19. As shown in figure 5, corresponds to the asymptotic value of Ω_{H,L}(*t*), while *τ*_{T} corresponds to the instant of time that satisfies:

Since equation 19 features an exponential decay, it follows that the plateau is reached in practice after an amount of time approximately equal to 5 *τ*_{T}… Furthermore, the inequality in equation 17 becomes:
since function Δ_{H,L}(*t*) is monotonically decreasing with time.

Although this section has focused on the case in which high and low expressors are used, all the properties derived also hold for any two isolated cohorts *A* and *B*. The only requirement is that the condition Δ_{A,B}(0) *≠* 0 is satisfied.

## 5 Quantification of the components shaping the variation in T-cell receptor expression levels

The theoretical framework developed in the previous section is used here in the analysis of the variation in the expression of the TCR in mouse CD4+ T lymphocytes. The TCR is a heterodimeric membrane receptor that elicits signal transduction upon interaction with MHC-peptide complexes on the membrane of antigen-presenting cells [33]. In wild-type animals, the T cell populations are genetically heterogeneous at the level of the TCR. The genetic diversity of the receptor is brought about by the somatic recombination at the loci encoding the receptor chains in thymocytes (reviewed in [34]). In contrast, genetically manipulated mouse strains are available in which all the T cells express the same TCR (for example, [27]). In these mouse strains, the somatic recombination is ablated (*Rag2*^{−/−} background) and a single functional TCR is expressed in all cells driven by transgenes encoding the two chains of the TCR.

To quantify the origin and timescale of the variation in the context of the TCR, we used a polyclonal population from a wild-type imbred strain and a surrogate monoclonal population from the Marilyn TCR-transgenic strain [27]. In this setup, we are interested in comparing the values of and *τ*_{T} estimated for the polyclonal and the Marilyn monoclonal populations. These two populations show comparable mean expression values (see figure 6, top graphs) but the expression is more variable in the polyclonal population than in the monoclonal one [35], presumably reflecting the genetic diversity [36].

In the model developed in section 2.2, the stable component arises from different mean protein production rates. In polyclonal populations, the stable variation in average TCR production may be caused by the differential regulation of expression of the receptor sub-units, depending on the specificity of the particular TCR, or by the differential ability of the specific sub-units to pair and be expressed [37]. In either case, genetic heterogeneity would ultimately explain some of the variation observed at the level of a polyclonal population. If so, this would imply in for a polyclonal population. By analysing a TCR-transgenic population, we address whether genetic variation is the only factor explaining the stable component. In the affirmative case, one would obtain for a TCR-transgenic population. If one obtains , non-genetic mechanisms must be evoked.

We adopted an experimental design in which high and low expressors, defined to contain 10% of the mass of the starting population distribution, were sorted (figure 6, top) and then maintained *in vitro* without any stimulation. As described before [38], there was no cell division under these conditions, and cells slowly died off such that after 3 to 4 days no live cells were left. Since in the Marilyn transgenic strain, all T cells have a naive phenotype [27], we restricted the analysis of the wildtype polyclonal populations to those cells that express high levels of the CD45RB marker, indicative of a naive phenotype [39]. By restricting the analysis to naive cells, the distribution of cell size as measured by Foreward Scatter was similar in the high and low expressor cohorts when sorted from the monoclonal Marilyn population as well as from polyclonal population (supplemental section F).

The dynamics of the frequency distribution of the TCR expression levels in cohorts of high and low expressors sorted from polyclonal and monoclonal animals and subsequently cultured *in vitro* for up to 72h is illustrated in figure 6 for a representative experiment. The distributions of the TCR expression levels in the high and low expressors sorted from wildtype polyclonal population remain clearly different. In contrast the high and low expressors from the monoclonal Marilyn TCR-transgenic population become very similar as a function of time after sorting. In fact, this particular data set portrays an extreme case of convergence of the histograms of high and low expressors and it was chosen to better illustrate the difference between the monoclonal and polyclonal populations. In the two other independent sets, the histograms of Marilyn high and low cohorts did not converge (see figure 7 for the quantification including all experiments).

The values of Ω_{H,L}(*t*) in three experimental data sets are shown in figure 7. The normalisation (equation 18) masks the fact that the value of Δ_{H,L}(0) was conspicuously greater for the polyclonal population, in accordance with the observations [35], that the variance is larger in polyclonal populations than in TCR transgenic populations (figure 8).

In order to estimate and *τ*_{T} by fitting the model to the experimental data equation 19 must be refined as follows:
where *δ*_{0} represents an estimate, obtained via fitting, of the “true” initial value Δ_{H,L}(0). Equation 23 has the important property of preserving the statistical independence between data points used as input for the fitting, a key requirement for proper statistical analyses.

The analysis was based on fitting the three-parameter exponential model (equation 23) to the ensemble of the data, composed of the multiple experiments done for each biological population. The different modelling scenarios being tested are defined by specifying each of the three parameters, , *τ*_{T} and *δ*_{0}, for each biological population as being shared or not between the polyclonal and monoclonal populations. Small variations in defining the percentages for sorting high and low expressors in different experiments are expected to sporadically affect the value of *δ*_{0} and therefore this parameter was always fitted separately for each experiment. Therefore, the modelling scenarios are obtained by specifying how parameters and *τ*_{T} are shared between the biological populations. The complete description of the modelling scenarios considered is presented in table 2 (column 2). Modelling Scenario 1 represents the null model, according to which the polyclonal and monoclonal populations are described by the same values of and *τ*_{T}. This is the scenario with the smallest number of parameters considered. Scenario 2 represents the plausible situation in which these two populations may be described by different values of , but equal *τ*_{T}, while in Scenario 3 parameter *τ*_{T} is also allowed to be different in the two populations. Finally, Scenario 4 represents a lower bound in terms of the error in the fitting, where data from each experiment is fitted independently, and has the largest number of parameters. The Akaike Information Criterion (AIC) [40] is used to compare the different modelling scenarios in their capacity to fit to the ensemble of the data. The AIC has a solid foundation on information theory [40], representing a compromise between the error in fitting the data and the number of parameters in the model. The results are presented in terms of the difference ΔAIC_{c} between the AIC for each Scenario and that of Scenario 1. In comparing different modelling scenarios, the one with the smallest value of the AIC (and therefore, the smallest value of ΔAIC_{c}) provides the best and most parsimonious description of the data.

The results of the model fitting, including the sum of squared residuals, point estimates of the parameters, and the value of ΔAIC_{c}, are shown for each scenario in table 3. It follows that Scenarios 2 and 3 have the lowest values of ΔAIC_{c}, with Scenario 3 having a slightly higher value. Since the latter has one extra parameter, this suggests that a scenario where only is allowed to be different constitutes the most parsimonious explanation for the data. Hence, altogether Scenario 2 is favoured, according to which the two populations differ only in . In this case, we obtain an effective timescale of 32 hours, and values of of 71% for the polyclonal and 17% for the monoclonal population, with 95% confidence intervals of [57%, 79%] and [0%, 31%], respectively. Finally, the function Ω_{H,L}(*t*) resulting from scenario 2 is shown in figure 7, highlighting the values of estimated for each population.

## 6 Discussion

In this article, we developed a new approach to analyse the variation in protein expression levels in a cell population, which enables measuring the characteristic dynamics of the fluctuations in cellular levels and estimating the magnitude of stable and unstable contributions to the variation across cells. The crux of the proposed analysis is based on the realisation that the difference between the means of log-transformed expression levels in two selected cohorts isolated from a population of interest converges with aproximmate exponential dynamics to an asymptotic value. By normalising this asymptotic value by the difference in cohorts’ means immediately after their isolation one obtains an unbiased estimation of the proportion of population variance that is explained by the stable component , while the mean convergence time *τ*_{T} measures the timescale of unstable component dynamics. This is the key insight of the analysis. This insight stems from perceiving any cell population as a mixture of many independent subpopulations, each with a characteristic mean expression level, that is fixed yet distributed among the subpopulations. Under these assumptions, the population variance is equated to the sum of the variance of the subpopulations means, which embodies the stable component of variation, and the variance of the expression level within the subpopulations, which represents the unstable component.

At first sight, the stable and unstable components of expression variation as formulated here are analogous to what Huang [10] referred to as population and temporal noise, respectively. However, this analogy is not straightforward. Huang’s definition of population noise precludes, by construction, any underlying genetic and stable epigenetic mechanisms. In contrast, the stable component, as defined here, is a statement about the dynamical aspects of variation and is silent about mechanism. We believe that the terms stable and unstable components are not only intuitive but convey a more precise description of variation in terms of its temporal dynamics. The mechanistic basis of these components remains a matter for further analysis. Putative mechanisms underlying the stable component include genetic variation and non-volatile epigenetic traits [41, 10]. In turn, the unstable component may be explained by noise in gene expression [9, 10] as wells as transient metastable epigenetic variants [41, 10]. Stable gene expression variants, which would be part of stable component of variation, are expected to be pervasive, since differentiation stages, cell lineages and cell types are hallmarks of multicellular organisms [25, 26]. In spite of this expectation, many quantitative approaches to expression variation in cells in the past have focused on noise in gene expression [4, 19, 7, 21, 23, 24].

Measuring the extent to which selected cohorts of cells can restore the full complexity of population from which they were sorted is an intuitive approach to analyse the heterogeneity of a population. This basic intuition motivated the experimental design used in several reports [16, 42, 14, 17, 18], in which the stable and unstable components of variation were evoked and utilised in an informal way. The capacity of cohorts to restore totally or partially the distribution of the original population has often been interpreted and discussed qualitatively, based on the visual inspection of raw flow cytometry histograms or of summary data time series. The present report advanced beyond such ‘half-full / half-empty glass” interpretations of data by contributing a rigorous quantitative method to analyse these kind of sorted cohort experiments based on estimating two parameters, and *τ*_{T}, that capture the essence of heterogeneity and dynamics of the expression variation.

The analysis method is grounded on a stochastic modelling framework. The protein expression levels in a single cell are described as very simple stochastic processes, based on [29], in which the instantaneous protein production rate (captured by variable *z*_{t}) fluctuates generating a stationary log-normal distribution of expression levels in each subpopulation. Protein expression has been modelled by others considering transcriptional burst dynamics that can be shown to generate discrete numbers of transcripts following a negative binomial distribution. Although transcript copy number distributions are generally assumed to be described by a negative binomial distribution at any range of expression levels, they are well-approximated by a log-normal distribution at high copy numbers per cell [43]. Therefore the log-normal approximation underlying the dynamics of *z*_{t} is justified by the observation that the transcripts encoding the TCR *α* and *β* chains are among the most abundant in the cell [44]. This model of the single cell expression dynamics was used to simulate a population formalised as a large mixture of independent subpopulations. By applying the equation for partitioning the variance to this mixture model, the analysis based on log-transformed values emerged as the best approach, since in this case the contributions due to the stable and unstable components are additive, greatly simplifying inference. This is particularly relevant for flow cytometry data, which is typically analysed in a logarithmic scale. It is interesting to note that [18] also relied on log-transformed values for quantification, based on the analysis of time-series of expression levels in individual cells.

An important result here is that the rigorous unbiased estimation of can be done based on a time series of normalised measurements of the difference between the means of log-transformed expression levels in isolated cohorts. The normalisation of these measurements by the value immediately after sorting is a critical part of the inference procedure. A similar normalisation by the initial value was used by Singh [31] to analyse the temporal evolution of the squared coefficient of variation of a single population, under a model that assumed that the observed variation was completely due to noise in gene expression. Using this type of analysis in settings of transcription inhibition, these authors [31] assessed whether noise in mRNA production and degradation or promoter activity fluctuations contribute to noise in protein expression. The normalisation of the differences by their initial value (*t* = 0) in the present work formalises the definition of how much of the initial difference, introduced by the process of sorting by design, remains at later times (function Ω(*t*)). Hence, a key requirement is that the isolated cohorts being compared have different means just after sorting. This strongly argues to using high and low expressors as the basis for quantification, in order to maximise the measurement resolution. In practice, one has to manage a tradeoff between how extreme are the expression levels (to increase resolution and dynamic range of the readout) and how many cells are contained in the cohorts (sample size). The estimation based on the mean expression levels offers great advantages in terms of the range of applications. It is often argued that the standard experimental techniques that measure bulk expression are obsolete in the context of the studies of gene expression noise, because their populationaveraged readouts mask cell heterogeneity (see, for example, [10, 45]). The present analysis framework enables to use these techniques, as one may combine the isolation of cells (the only step requiring the analysis at the single-cell level), with population-averaged readouts to quantify and *τ*_{T}. The function Δ(*t*), which is at the core of the estimation process, can be approximated as the log-arithm of the fold-ratio between the raw mean values of the two populations. In theory, by measuring and the total variance of the full population , one could estimate the actual values of and in equation 13. This would allow one to compare the values of and in different biological populations Furthermore, we also showed that the variances of the isolated cohorts can be further informative, allowing the estimation of the ratio between the absolute values of the contribution of the stable component in the isolated cohort and in the starting population. However, the estimate obtained in this way is biased, under-estimating the true value by up to 20%. Consequently, if an estimate of this ratio is needed, we suggest a simulation-based approach.

Empowered by the quantitative framework, we analysed the variation in the expression levels of the T-cell receptor (TCR) in mouse CD4^{+} T cells. T cells are an interesting case study. It has been described that some molecules in these cells have a component of stability in their expression levels, such as CD5 [46, 47, 48]. This has also been suggested for CD127 [49], a sub-unit of the IL-7 receptor. Also studies centred on chromatin regulation of expression of cytokines in T cells [13, 15], found that the expression of IL-4 and IL-10 is unstable at the level of single cells. With the increasing availability of single cell genomics, proteomics and metabolomics techniques there is accumulating evidence that cell populations that hitherto were perceived as homogeneous are in fact complex mixture of cell types and cell states, that may be reversible and transient, raising the issue of stability and dynamics. From a practical perspective, different mouse models are available with different genetic diversity in the TCR loci, which gives a handle to tease apart genetic and non-genetic components of variation. Hence, in our analysis of TCR expression levels we studied a genetically heterogeneous polyclonal population, and also a particular isogenic population, Marilyn TCR transgenic [27] in a *Rag2*-deficient background. These two populations display distinct variances of the TCR levels that are, not surprisingly, positively associated with the genetic TCR diversity. We asked whether these two populations could be described by equal or different values of and *τ*_{T} and found that the overall most parsimonious explanation for the data was a model where only one of these parameters is allowed to be different. The model with different *τ*_{T} and the same value of performed marginally better based on the AIC. However, the point estimate of the characteristic time of the polyclonal population in this scenario was about 20 days, requiring an extrapolation beyond the experimental observation time of 96 hours and involving an unreasonable large uncertainty. Although such long time scales have been described for the restoration of a bimodal population distribution from selected unimodal cohorts (e.g. over 30 days in [18, 42]), the scenario of very distinct *τ*_{T} values for wildtype and transgenic populations is biologically unsound. This scenario requires that the expression of transgenic TCR would differ from natural TCR expression in terms of the protein turnover rate *β* as well as noise in gene expression, such that the same could be present in populations with markedly different variances (figure 8). Therefore, based on these statistical and biological considerations we rejected this scenario. We concluded in favour of the scenario in which the TCR expression fluctuates with a characteristic time of 32 hours in the both populations, which differ in the values of , the polyclonal having and monoclonal Marilyn having . The relatively small yet not negligible value of obtained for the latter population may be particular and not necessarily generalisable to other TCR-transgenic populations. It is worth mentioning that the analysis of another such TCR transgenic population led to a higher value [50], suggesting that transgenic populations, which are known to display different variance of the TCR levels (e.g. [35]), may also differ in the extent of the stable component of variance.

The capacity to analyse the variation in TCR expression validates experimentally the theoretically-designed methodology. We could quantify the two key parameters in the two cell populations, implying that the methodology has enough power to resolve the stable and unstable components of variance even when the distribution of interest is remarkably unimodal and narrow.

Beyond this key methodological result what do the actual estimates of and *τ*_{T} tell about phenotypic variation in TCR expression?

It is important to highlight from the outset that the stable proportion of variance and the timescale may be constrained by the experimental setup on which we intentionally relied. High and low expressors were maintained *in vitro* for as long as possible in the absence of any stimulation in conditions that precluded cell division. Using this setup, we focused on cell-intrinsic components only, avoiding complications arising from cell division. As a consequence of this choice, the present data do not exclude the possibility that signals arising from the intermittent stimulus from the antigen-presenting cells in the *in vivo* environment may change the values of both or *τ*_{T} for the populations tested. Also, cell division is expected to decrease the timescale of the fluctuations in a twofold manner. First, protein dilution into the daughter cells may effectively reduce the value of *β*, even if yeast studies indicate that protein levels are remarkably constant if corrected for cellular volume [51]. Second, cell division may affect the stability of epigenetic modifications facilitating the transitions between chromatin states or bistable transcriptional switches that affect quantitatively TCR expression in this way reducing the effective *τ*_{T}. A similar point was made in a study [52] of induced pluripotent stem cells. These cells maintained a memory of transcriptional and epigenetic signatures indicative of the cell of origin that vanished with sequential passages. Hanna et al. [53] reported a similar impact of cell division itself. Furthermore, cell division and generation time variability may introduce cell-extrinsic deformations of the expression distribution by differential selection of lineages (see [54] for a theoretical analysis). These potential peculiarities of the experimental design notwithstanding, the estimates of and *τ*_{T} are to our knowledge the first reported values and therefore interpreting the meaning of these values requires indirect comparison with other estimates.

The characteristic time of the variation in protein expression represents a transient memory of expression levels [7]. Various studies have quantified the dynamics of fluctuations in expression levels of various molecules, reporting characteristic times that range from hours [7, 12] to days and weeks [16, 14, 17, 18, 42]. In studies quantifying the dynamics of the percentage of T cells inducing the expression of cytokines [13, 15], the effective timescale was estimated to be about 70 hours for the cytokines IL-10 [13] and IL-4 [15], under conditions in which the cells divide. In terms of molecular mechanism, this has been linked to slow dynamics of chromatin remodeling [13, 15]. The longer time scales were systematically obtained in scenarios with cell division and that involved the restoration of a multimodal distribution of expression levels from biased cohorts. The dynamics of multimodal distributions, in which cells switch between overtly distinct subpopulations may correspond to transitions between cellular states. For the unimodal TCR expression, we estimated an effective timescale of 32 hours, in the absence of cell division. This timescale is shorter than that necessary to restore a full multimodal distribution from extremely biased cohorts [18, 42]. The TCR protein complex is arguably one of the most complex receptors in terms of its composition, trafficking and regulation. In quiescent cells, such as the naive cells analysed here, it is continuously recycled between the plasma membrane and intracellular membranes with a fast rate of less than an hour. The TCR in the ensemble of these two pools has a slow turnover rate. The treatment with protein synthesis inhibitor up to 12 hours led only to modest changes in expression levels [55], suggesting that *β* might be greater than 12 hours. However, this estimate is potentially problematic, since this treatment may alter the regulation of the TCR complex levels, as it may up-regulate the expression of the mRNA encoding its *ζ* chain sub-unit [56]. Sousa and Carneiro [57] estimated the baseline TCR turnover in an human T cell line by fitting the dynamics of the mean upon short-term stimulation, and found a value for *β* of 15 hours. Both values [55, 57] are compatible with the effective timescale estimated here, which lumps protein stability and the auto-correlation time of the rate of protein production, and suggest that *β* is of the same order of magnitude as *τ*_{T} in the case of the TCR.

The different components that may underly the stable variation in TCR expression levels are systematically addressed in figure 9. The mean TCR level has been shown to be distributed among the V_{β}-family subsets in CD4 as well as CD8 human T cell populations [36], under conditions in which there was a strong correlation with cell size. Despite the fact that the high and low expressors cohorts in our experiments had virtually the same size distribution as assessed by the respective forward scatter signal, it is likely that part of the stable variance in TCR expression is in fact due to the genetic diversity at the TCR locus. The question is whether genetic diversity explains all stable variation. The estimate of a positive value for in the nominally monoclonal TCR transgenic population suggests that non-genetic variation may contribute to stable differences in expression levels of the TCR among cells. This might be a particular feature of TCR-transgenic populations, as their relationship to the actual clones in a polyclonal T cell population is not trivial. The specific mechanism that would mediate such non-genetic variation is unclear at present. We speculate that the stable component in this system may arise from a myriad chromatin modifications in the form of “molecular switches”, which would affect, directly or indirectly, the expression of the TCR. This speculation is inspired on theoretical studies [58, 13], which have predicted these marks to be stable once fully established, but also potentially variable among cells. In terms of the full range of modifications affecting expression of the TCR in *cis* and in *trans*, some could be present, while others could be absent in each individual cell in a stochastic yet stable pattern of modifications [58, 13]. In those cells in which the balance of modifications happens to be tilted towards those inducing expression, levels of the TCR would be higher than average, while in cells with lower TCR levels this balance would be shifted in the direction of those leading to decreased expression. Similar considerations could be made to any epigenetic variants in any of the vast number of transcription factors and regulatory proteins that control the TCR complex expression. Finally, in this enumeration of the causes of stable variation in TCR levels, it is worth noting that despite we have been referring to the population of T cells in Marilyn transgenic mice as “monoclonal” throughout this article, the cells are not a T cell clone derived from a mature T cell. Instead, they are continuously differentiating in the thymus and being exported to circulation. One cannot rule out that these T cells or their bone marrow and thymic precursors underwent sporadic somatic mutations in any of those genes affecting TCR complex expression. The existence of genetic mosaics in somatic tissues has been well documented (reviewed in [59]) following the advent of single cell sequencing, and thus, one must envisage the possibility that part of the stable variation observed in transgenic TCR expression is due to *bone fide* genetic variation. Simple back of the envelop calculations suggest that epigenetic variants and/or mutational mosaicism outside the TCR locus may represent more than 1/6 of the stable variance in the wild type CD4 T cells (assuming that epigenetics and/or mosaicism explain the same amount variance in both monoclonal and polyclonal populations, that all the stable variance in the former (which corresponds to 17% relative to monoclonal and 8% relative to polyclonal variances) is explained by these two processes and that the stable variance in latter is explained by these two processes and by TCR diversity, we have that in the polyclonal population 8/71 of the stable variance is explained by epigenetics/mosaicism and 63/71 is explained by genetic TCR diversity; see figure 9).

This quantitative framework makes a connection between systems biology, in particular works focused on noise in gene expression [9], and quantitative genetics [60]. In both domains, approaches for decomposing the variance, or another measure of variation, have been instrumental in studying the properties of different biological systems (see, for example, [41]). In the studies of noise in gene expression, the notion of intrinsic and extrinsic noise put forward by Elowitz and colleagues [4, 19] is based on decomposing the coefficient of variation of expression levels into these two noise sources. Other works have focused on either generalising this distinction or developing new decompositions [21, 61, 24, 23]. In fact, these approaches may be combined with the framework developed here, to further partition the unstable component, for example, into intrinsic and extrinsic noise. Likewise, in connection with quantitative genetics, parameter can be interpreted as the “heritability” in expression levels of a population. This arises from an analogy with the decomposition of phenotypic variation into a contribution from additive genetic variation and another due to environment (see, for example, [28, 60]), neglecting non-additive genetic variation. At present, this constitutes a mere analogy, since is defined even in the absence of cell division.

Hitherto, the studies on the phenotypic variation in gene expression levels at the individual cell level have relied experimentally on timelapse imaging of single cells or on population snapshots using single-cell resolution techniques such as flow cytometry, qPCR, RNAseq or Cytof. The quantitative framework and methodology proposed here, relying on estimates of the mean of the expression levels in apropriately selected cohorts, enables studying the sources and dynamics of the variation in cellular expression levels by conjugating a single step of sorting with the full gamut of transcriptomic, proteomic, and metabolomic technologies available to measure bulk expression. This opens new prospects for studying quantitative traits and responses in heterogeneous cell populations. In particular, by providing a rigorous approach to quantify the relative contribution of the stable component, this work allows the determination of the extent this component contributes to variation in different biological and experimental systems [16, 14, 17]. Furthermore, the model for expression levels considered here can be further extended to incorporate, for example, more elaborated formulations, such as those with positive and feedback loops, as in the case of gene regulatory networks regulating cell differentiation (for example, [62]). Finally, the sophistication of DNA recording-based methods to account for past fluctuations in transcript levels of a cell or its lineage will generate massive sets of single-cell transcriptomics data points suitable for decomposition into stable and unstable components [63]. In summary, we have put forward a solid theoretical framework to dissect the components of variation in expression levels according to their stability and dynamics, which enables the further analysis of how different molecular mechanisms may modulate each component.

## Materials and Methods

### Notation

The function log(.) denotes the natural logarithm, and random variables are represented as bold symbols, as in ** x**. We use 𝔼 [

**] to denote the expected value of a random variable**

*x***, and 𝕍 [**

*x***] the variance. The notation**

*x*

*z**∼ℒ 𝒩 (µ, σ*) represents a random variable

**following a lognormal distribution with parameters**

*z**µ*and

*σ*, having therefore the probability density function:

### Numerical simulations

Simulations of the model based on stochastic differential equations were performed using custom software written in C++, based on the GNU Scientific Library (http://www.gnu.org/software/gsl/). For a given value of the parameters *τ* and *β*, the stochastic model (equations 4 and 5) was simulated, using the Brent-Dekker method (GNU Scientific Library) to adjust the value of *σ* so as to obtain the desired value of *σ*_{W}.

Simulations of cell sorting experiments to isolate appropriate cohorts were done using an initial population with *σ*_{T} = 0.3, having 1.2 × 10^{6} cells and 2 × 10^{4} sub-populations, with the number of cells per sub-population following a multinomial distribution. From the starting population, 10% of cells were isolated. As a simple approximation of an experimental setting, each isolated cohort was divided into 3 replicates, and simulated for a given period of time, with snapshots of each replicate being collected at equally spaced times.

### Data analysis, fitting and model selection

Numerical analysis was conducted using MATLAB (Mathworks). The exponential model was fitted to the data by non-linear least squares. To study the relationship between *τ*_{T} and parameters *β* and *τ*, simulations were ran for several combinations of values of . The values of *τ*_{T} were estimated by fitting the exponential model. Fitting the ensemble of the experimental data was done by equally weighting each experiment, based on the number of data points per experiment. Values of the Akaike Information Criterion (AIC) were corrected for small sample size, as highlighted in section 2.4 of ref. [40], and include the residual variance as an additional effective parameter being estimated for each model. Confidence intervals (95%) were obtained by bootstrapping each experiment separately, then fitting the ensemble of the data.

### Mice

C57BL6/J and B6.*Rag2*^{−/−} mice were obtained from the Jackson Laboratory. Marilyn mice [27] were kindly provided by Olivier Lantz (Institut Curie, France), and bred with B6.*Rag2*^{−/−} to produce Marilyn.*Rag2*^{−/−}. Mice were bred and maintained under specific pathogen free conditions at the animal house of the Instituto Gulbenkian de Ciência, and used for experiments with ages between 8 and 12 weeks. All animal procedures were approved by the ethics committee of the Instituto Gulbenkian de Ciência.

### Antibodies and Flow Cytometry

Flow cytometry was performed using a Beckman-Coulter CyAN ADP. Fc receptors were always blocked prior to staining, by incubation with FcBlock (2.4G2, produced in-house). Cells were stained at 4*°C*, in ice-cold buffer with PBS, 5% fetal bovine serum (PAA), and, except in the case of sorting, with 0.1% sodium azide.

Monoclonal antibodies produced in-house used were: anti-TCR-C*β* (H57597), anti-CD4 (GK1.5), anti-CD8 (YTS169.4), anti-CD25 (PC61), anti-CD45RB (16A), anti-CD62L (MEL-14), anti-B220 (RA3-6B2), anti-MHC-II (M5/114),anti-Mac1 M1/70), anti-CD3*ϵ* (2C11), anti-CD3*ϵ* (2C11). Commercial antibodies were: anti-CD49b (pan-NK, DX5, BD), anti-CD4 (RM4-5, BD), anti-CD44 (MEL-14, eBioscience), anti-TCR*γδ* (GL3; BD). Biotinylated antibodies were further labeled with PE-Streptavidin (BD).

### Cell sorting and in vitro cultures

Single-cell suspensions were prepared from lymph nodes, and also spleens in the case of Marilyn.*Rag2*^{−/−} animals (due to limited number of cells), by passing cells through a nylon mesh. Cohorts of cells were sorted according to the TCR levels on a FACSAria (BD), using a strategy based on negative selection of CD4^{+} T cells. Briefly, cells were stained for TCR (anti-TCR-C*β*) and lineage markers not expressed by naive CD4^{+} T cells, and then lineagecells falling within the desired TCR gates (illustrated in figure 6 and F) were sorted. A polyclonal naive population was sorted as CD45RB^{high}, lineage(CD8, pan-NK, B220, TCR*γδ* and CD25) cells, while Marilyn cells were sorted as CD62L^{+}, lineage(B220, CD11c, pan-NK, Mac1, MHC-II). The use of CD62L as an alternative marker of naive cells allows for a more efficient sorting (due to a slow loss in the CD45RB signal throughout the sorting), given the limited number of cells, based on the fact that the vast majority of Marilyn cells retain a naive phenotype [27]. Before each sorting for Marilyn cells, the gating for CD62L^{+} Lineagecells, when analysed in a control sample also labeled for CD4, includes more than 80% of TCR^{+}CD4^{+} Marilyn cells. Purities of the sorted populations were assessed by staining aliquots of the sorted populations for CD4 expression, were typically greater than 96%.

After sorting, T cells were cultured in flat-bottom 96-well plates (50 × 10^{3} cells per well), in RPMI (Invitrogen), 10% fetal bovine serum (PAA), 1% Sodium Pyruvate (Gibco), penicillin/streptomycin (Gibco), gentamycin (Sigma), 50*µ*M 2-ME (Gibco), in an incubator at 37*°C*, with 5% CO_{2}.

TCR levels were quantified by staining, under optimal, saturating conditions, with anti-TCRC*β* antibody, which binds to the constant region of one of the sub-units of the TCR (see, for example, [64]). Cells maintained in culture were analysed at different time-points by re-staining the TCR, using the same antibody anti-TCRC*β* (clone and fluorochrome) as used for the sorting. In each time-point, 3 replicates (wells) of each sorted population were analysed. In each experiment, an additional population (control) was sorted in parallel, keeping the same gates used for all expressors, but without staining for the TCR, as a control for the impact of this staining. In each time-point, TCR levels of the control population were compared against those of “all expressors”, confirming that the staining for the sorting does not induce massive changes in TCR expression levels.

Data were analysed using FlowJo 8.8.7 (Tree Star Inc.). Cells gated on forward-scatter and side-scatter, live cells (propidium iodide negative) and CD4^{+} cells. For the analysis of TCR levels, cells were further gated on CD62L^{+} cells, to reduce experimental variation in TCR levels. Percentages of CD62Lcells were always lower than 20% in early time-points (up to 48 hours), and similar to those from control cells, arguing against impact of staining for the TCR in order to sort cells. Gated data was exported as text files and analysed in MATLAB (Mathworks) using custom code.

## Author contributions

This work was done as a collaboration between all authors. TSG, VMB and JC designed and planned research. TSG developed the theoretical framework, performed simulations, established the experimental setup, performed experiments, analysed data and drafted the manuscript. VMB and JC jointly supervised the work, contributed to both the theoretical and experimental parts, and wrote the manuscript together with TSG. All authors read and approved the final paper.

## Acknowledgements

We are grateful to Jocelyne Demengeot and Henrique Teotonio for the support during the development of this work and to Alberto Darszon and Vera Martins for reading an earlier version of this manuscript. We thank Rui Gardner, Telma Lopes and Claudia Bispo for assistance on flow cytometric analysis and cell sorting. This work was supported by a grant from the Fundacao para a Ciencia e Tecnologia (FCT) (PTDC/BIA-BCM/108020/2008). TSG was supported by a fellowship from FCT (SFRH/BD/33572/2008).

## Footnotes

Current email address: tguzella{at}tguzella.org, vasco.barreto{at}fcm.unl.pt, jcarneir{at}igc.gulbenkian.pt

## References

- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].↵
- [62].↵
- [63].↵
- [64].↵