## ABSTRACT

The causes of many complex human diseases are still largely unknown. Genetics plays an important role in uncovering the molecular mechanisms of complex human diseases. A key step to characterize the genetics of a complex human disease is to unbiasedly identify disease-associated gene transcripts in whole-genome scale. Confounding factors could cause false positives. Paired design, such as measuring gene expression before and after treatment for the same subject, can reduce the effect of known confounding factors. Model-based clustering, such as mixtures of Bayesian hierarchical models, has been proposed to detect gene transcripts differentially expressed between paired samples. However, not all known confounding factors can be controlled in a paired/match design. To the best of our knowledge, no clustering methods have the capacity to adjust for the effects of covariates yet. In this article, we proposed a novel mixture of Bayesian hierarchical models with covariate adjustment in identifying differentially expressed transcripts using high-throughput whole-genome data from paired design. Both simulation study and real data analysis show the good performance of the proposed method

## Introduction

Genome-wide differential gene expression analysis is widely used for the elucidation of the molecular mechanisms of complex human diseases. One popular and powerful approach to detect differentially expressed genes is the probe-wise linear regression analysis combined with the control of multiple testing, such as limma^{1}. That is, we first perform linear regression for each probe and then adjust p-values for controlling multiple testing. One advantage of this approach is its capacity to adjust for potential confounding factors.

Another approach for detecting differentially expressed genes is the model-based clustering via mixture of Bayesian hierarchical model (MBHM)^{2–7}, which can borrow information across genes to cluster genes. Probe clustering based on MBHMs treats gene transcripts as “samples” and samples as “variables”. Therefore, transcript clustering based on MBHMs has large number of “samples” and relatively small number of “variables”, hence does not have the curse-of dimensionality problem. In addition, unlike transcript-specific tests that have several parameters per transcript, transcript clustering based on MBHMs has only a few hyperparameters per cluster to be estimated and could borrow information across transcripts to estimate model hyperparameters. These approaches generally assume that samples under two groups are obtained independently.^{8} proposed a constrained MBHM to identify genetic outcomes measured from paired/matched designs.

Paired design is commonly used in study design for its homogeneous external environment for comparing measurements under different conditions. However, not all known confounding factors can be controlled in a paired/match design. Hence, we might still need to adjust the effects of confounding factors for data from a paired/matched design. To best of our knowledge, no existing clustering methods, including the constrained MBHM^{8}, have the capacity to adjust for potential confounding factors, i.e., age, sex, and/or batch effect.

In this article, we proposed a novel mixture of Bayesian hierarchical models with covariate adjustment in identifying differentially expressed transcripts using high-throughput whole genome data from paired design.

## Materials and Methods

We assumed that gene transcripts can be roughly classified into 3 clusters based on their expression levels in subjects after treatment (denoted as condition 1) relative to those before treatment (denoted as condition 2):

transcripts after treatment have higher expression levels than those before treatment, i.e., over-expressed (

**OE**) in condition 1;transcripts after treatment have lower expression levels than those before treatment, i.e., under-expressed (

**UE**) in condition 1;transcripts after treatment have same expression levels than those before treatment, i.e., non-differentially expressed (

**NE**) between condition 1 and matched condition 2.

We followed^{8} to directly model the marginal distributions of gene transcripts in the 3 clusters. In^{8}, they proposed a mixture of three-component Bayesian hierarchical distributions to characterize the within-pair difference of gene expression. We extended their model by incorporating potential confounding factors (such as *Age* and *Sex*) in the mixture of Bayesian hierarchical models, which might affect the response of gene expression to drug treatment. We assumed that data have been processed so that the distributions of mRNA expression levels are close to normal distributions. For RNAseq data, we can apply VOOM transformation^{9} or countTransformers^{10}.

### A mixture of Bayesian hierarchical models

For the *g*^{th} gene transcript, let *x*_{gl} and *y*_{gl} denote the expression levels of the *l*^{th} subject under two different conditions, e.g., before and after treatment, *g* = 1, …, *G, l* = 1, …, *n*, where *G* is the number of transcripts and *n* is the number of subjects (i.e., the number of pairs). Let *d*_{gl} = log_{2}(*y*_{gl}) − log_{2}(*x*_{gl}) be the log2 differences for the *l*^{th} subject. Denote *d*_{g} = (*d*_{g1}, …, *d*_{gn})^{T}. We assumed that *d*_{gl} is conditionally normally distributed given mean vector and covariance matrix. Let *X*^{T} be the *n*× (*p* + 1) design matrix, where *p* is the number of covariates. The first column of *X*^{T} is the vector of ones, indicating intercept. Let *η* be the (*p* + 1) ×1 vector of coefficients for the intercept and covariate effects. We assume following mixture of three-component Bayesian hierarchical models:

For the OE gene transcripts,
where *k*_{1} > 0, *α*_{1} > 0 and *β*_{1} > 0. Γ(*α*_{1}, *β*_{1}) denotes the Gamma distribution with shape parameter *α*_{1} and rate parameter *β*_{1}. That is, we assume that (1) the mean vectors *μ*_{g}, *g* = 1, …, *G*, given the variance follow a multivariate normal distribution with mean vector exp [*X*^{T} *η*_{1}] and covariance matrix ; and (2) the variances , follow a Gamma distribution with shap parameter *α*_{1} and rate parameter *β*_{1}.

For the UE gene transcripts,
where *k*_{2} > 0, *α*_{2} > 0, *β*_{2} > 0, and *X*^{T} is the design matrix.

For the NE gene transcripts,
where *k*_{3} > 0, *α*_{3} > 0 and *β*_{3} > 0. *U*^{T} is the design matrix without intercept column.

The hyper-parameters *α*_{c} and *β*_{c} are shape and rate parameters for the Gamma distribution, respectively, *c* = 1, 2, 3. As for *k*_{1}, *k*_{2} and *k*_{3}, the variation of the mean vector *μ*_{g} should be smaller than that of the observations *d*_{g}. So we expect 0 *< k*_{c} *<* 1, *c* = 1, 2, 3.

Note that the marginal distribution for each component of the mixture is a multivariate t distribution with special structure of mean vector and covariance matrix [11, Section 3.7.6].

For continuous covariates, we require that they are standardized so that they have mean zero and variance one. Standardizing continuous covariates would make exp (*X*^{T} *η*_{1}) and exp (*X*^{T} *η*_{2}) be numerically finite.

Ideally, we should require *μ*_{g} > 0(*μ*_{g} *<* 0) for all transcripts in cluster 1 (cluster 2). To do so, we can assume a log normal prior distribution for *μ*_{g} in cluster 1, for instance. However, a log normal distribution could not be a conjugate prior for the mean of a normal distribution. It would increase the computational burden if non-conjugate priors were used. As an alternative, we require the mean *E*(*μ*_{g}) > 0 (*E*(*μ*_{g}) *<* 0) for cluster 1 (cluster 2) by assuming *E*(*μ*_{g}) for cluster 1 (cluster 2) to be exp[*X*^{T} *η*_{1}] (−exp[*X*^{T} *η*_{2}]). Denote .

### Marginal density functions

Let *f*_{1}(**d**_{g} |*ψ*), *f*_{2}(**d**_{g} |*ψ*), *f*_{3}(**d**_{g|} *ψ*) be the marginal densities of the 3 Bayesian hierarchical models, and *π* = (*π*_{1}, *π*_{2}, *π*_{3}) is the vector of cluster mixture proportions. Then the marginal density of **d**_{g} is :

### Determining transcript cluster membership

The transcript-cluster membership is determined based on the posterior probabilities, *ζ*_{gc} = *Pr*(*g*^{th} gene transcript in cluster *c* |*d*_{g}). We can get

We determine a transcript’s cluster membership as follows: If the maximum value among *ζ*_{gi}, *i* = 1, 2, 3 is *ζ*_{gc}, then the transcript *g* belongs to cluster *c*.

The true values of *π*_{1}, *π*_{2}, *π*_{3}, and *ψ* are unknown. We use estimated values to determine transcripts’ cluster membership.

### Parameter estimation via EM algorithm

We used expectation–maximization (EM) algorithm^{12} to estimate the model parameters *π* = (*π*_{1}, *π*_{2}, *π*_{3})^{T} and *ψ*. Detailed derivations of the expectation *Q* of the log complete-data likelihood function and its first-order derivatives are shown in the Appendix.

We stop the expectation-maximization iterations if the sum of squared differences of model parameter estimates between current iteration and previous iteration is smaller than a small constant (e.g. 1.0 ×10^{−3}).

More details about the EM algorithm are shown in Appendix.

### A real data study

We downloaded the dataset GSE24742^{13} from the Gene Expression Omnibus (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE24742) to evaluate the performance of the proposed model-based clustering method (denoted as *eLNNpairedCov*).

The dataset is from a study that investigated the gene expression before and after administrating rituximab, a drug for treating anti-TNF resistant rheumatoid arthritis (RA). There are 12 subjects, each having 2 samples (one sample is before treatment and the other is after treatment). Age and sex are also available. Expression levels of 54,675 gene probes were measured for each of the 24 samples by using Affymetrix HUman Genome U133 Plus 2.0 array. The dataset has been preprocessed by the dataset contributor. We further kept only 43,505 gene probes in the autosomal chromosomes (i.e., chromosomes 1 to 22). We then performed log2 transformation for gene expression levels. We next obtained the within-subject difference of the log2 transformed expression levels (log2 expression after-treatment minus log2 expression before-treatment). By examining the histogram (Figure 5) of the estimated standard deviations of log2 differences of within-subject gene expression for the 43,505 gene probes, we found a bimodal distribution. We excluded gene probes having standard deviation *<* 1. Finally, 23,948 gene probes kept in the down-stream analysis.

### A simulation study

We performed a simulation study to compare the performance of the proposed method *eLNNpairedCov* with transcript-wise test *limma* and Li et. al.’s^{8} method (denoted as *eLNNpaired*). Both *eLNNpairedCov* and *limma* adjust covariate effects, while *eLNNpaired* does not.

The *limma* approach first performs an empirical-Bayes-based linear regression for each transcript. In this linear regression, the within-subject log2 difference of transcript expression is the outcome and intercept indicating if the transcript is over- expressed (intercept>0), under-expressed (intercept*<*0), or non-differentially expressed (intercept = 0), adjusting for potential confounding factors. Here condition 1 are observations after treatment and condition 2 are observations before treatment. A transcript is claimed as OE if its intercept estimate is positive and corresponding FDR-adjusted p-value *<* 0.05, where FDR stands for false discovery rate. A transcript is claimed as UE if its intercept estimate is negative and corresponding FDR-adjusted p-value *<* 0.05. Other transcripts are claimed as NE.

In this simulation study, we considered two scenarios. In the first scenario (Scenario1), the number of subjects is equal to 30. In the second scenario (Scenario2), the number of subjects is equal to 100.

For each scenario, we generated 100 datasets. Each simulated dataset contains *G* = 20, 000 gene transcripts, 0.0376% of which are OE and 0.346% of which are UE.

There are two covariates: standardized age (denoted as *Age*.*s*) and *Sex. Age*.*s* follows normal distribution with mean 0 and standard deviation 1. Seventy five percent (75%) of subjects are women.

The model hyperparameters are set as *α*_{1} = 3.53, *β*_{1} = 3.45, *k*_{1} = 0.26, *η*_{1} = (0.18, 0.00, − 1.05)^{T}, *α*_{2} = 3.53, *β*_{2} = 3.45, *k*_{2} = 0.26, *η*_{2} = (0.18, 0.00, −1.05)^{T}, *α*_{3} = 2.86, *β*_{3} = 2.20, *k*_{3} = 0.72, and *η*_{3} = (− 0.01, 0.00)^{T}.

The parameter values (*π, ψ*, and proportion of women) in the simulation study are estimated from the analysis of the pre-processed real dataset GSE24742 described in Section.

### Evaluation criteria

Two agreement indices and two error rates are used to compare the predicted cluster membership and true cluster membership of all genes. The two agreement indices are accuracy (i.e., proportion of predicted cluster membership equal to the true cluster membership) and Jaccard index^{14}. For perfect agreement, these indices have a value of one. If an index takes a value close to zero, then the agreement between the true transcript cluster membership and the estimated transcript cluster membership is likely due to chance. The two error rates are false positive rate (FPR) and false negative rate (FNR). FPR is the percentage of detected DE transcripts among truly NE transcripts. FNR is the percentage of detected NE transcripts among truly DE transcripts. We also examined the user time and number of EM iterations for running each simulated dataset.

## Results

### Results of the real data analysis

For the real dataset, we adjusted standardized age and sex for *eLNNpairedCov* and *limma*. We standardized age so that it has mean zero and variance one. For each transcript, we also scaled its expression across subjects so that its variance is equal one.

The *limma* method detected 6 under-expressed gene transcripts (Figure 1 and Table 1), while *eLNNpaired* did not find any positive signals (i.e., ). The proposed method *eLNNpairedCov* detected 9 OE transcripts (Upper panel of Figure 2 and Table 2) and 83 UE transcripts (Lower panel of Figure 2 and Table 3). The 83 UE transcripts detected by *eLNNpairedCov* include the 6 UE transcripts detected by *limma*.

It is assuring that several genes corresponding to the 92 DE transcripts identified by *eLNNpairedCov* have been associated to rheumatoid arthritis (RA) in literature. For example, Humby et. al. (2019)^{15} reported that genes *ZNF365* (OE), *IL36RN* (OE), *MRVI1-AS1* (OE), *WFDC6* (UE), *UBE2H* (UE), are associated with RA.

### Results of the simulation study

The results for Scenario 1 (*n* = 30) are shown in Figures 3 and 4, which are jittered scatter plots of the performance indices and the difference of performance indices versus methods, respectively.

The differences of performance indices are between *limma* and the other 2 methods (*eLNNpaired* and *eLNNpairedCov*). A positive difference indicates that the performance indices of the other 2 methods are larger than that of *limma*. A negative difference indicates that the performance indices of the other 2 methods are smaller than that of *limma*.

The upper panel of Figures 3 and 4 show that *eLNNpairedCov* has higher agreement indices (Jaccard and accuracy) than *limma*, which in turn has higher agreement indices than *eLNNpaired*.

The middle panel of Figures 3 and 4 show that the three methods have similar FPR, except that *eLNNpaired* has larger FPR than *eLNNpairedCov* and *limma* for a few simulated datasets. The middle panel also show that *eLNNpairedCov* has smaller FNR than *limma*, which in turn has smaller FNR than *eLNNpaired*.

The bottom panel of Figures 3 and 4 show that *limma* runs very fast, while both *eLNNpaired* and *eLNNpairedCov* run in reasonable time (i.e., less than 30 seconds per dataset that has *G* = 20, 000 genes and *n* = 30 subjects). On average *eLNNpairedCov* spends a little more time than *eLNNpaired*. The bottom panel of Figures 3 and 4 also show that *eLNNpairedCov*uses only 2 EM iterations, while *eLNNpaired* tends to use more EM iterations. Note that the EM iteration number for *limma* is set to be one, which does not use EM algorithm to obtain parameter estimates.

The simulation results for Scenario 2 (*n* = 100) are shown in Figures 6 and 7 in the Appendix, which have similar patterns to those for Scenario 1 (*n* = 30), except that both *eLNNpaired* and *eLNNpairedCov* have smaller FPR than *limma*. Note that *eLNNpairedCov* tends to have smaller FNR than *limma*, while *eLNNpaired* has huge FNR (close 1).

## 1 Discussion

In this article, we extended Li et al.’s (2017) mixture of Bayesian hierarchical models to allow the adjustment for potential confounding factors. This is novel in that to the best of our knowledge, all existing clustering methods do not yet have the capacity to adjust for covariates.

The proposed approach is different from transcript-wise test with controlling multiple testing in that it does not involve hypothesis testing. Hence, no multiple test controlling is needed. The simulation study showed that if the difference of gene expression between samples before treatment and samples after treatment follows the mixture of Bayesian hierarchical models in Section, then the proposed method can outperform *limma*, which currently is a fast and powerful transcript-wise test method. The real data analysis also showed the proposed method *eLNNpairedCov* can detect more differentially expressed gene transcripts, which include the transcripts detected by *limma*.

Although we classify genes to three distinct clusters, the transitions between these clusters could be smooth. This would be reflected by a gene’s posterior probability that might be large in two of three clusters, e.g., 0.49 for cluster 1, 0.01 for cluster 2, and 0.5 for cluster 3. On the other hand, expression changes could be split up into more than 3 clusters, e.g., groups behaving differently. In this article, we are only interested in identifying three clusters of genes: over-expressed in condition 1, under-expressed in condition 1, and non-differentially expressed.

There are other model-based clustering methods in literature, such as^{16}. However, they were not designed to detect differentially expressed genes. For example, we can set the number *K* of clusters as 3 for their model. However, there is no constraints that the intercepts for the three clusters have to be positive, negative, and zero. That is, the three clusters identified might not correspond to over-expressed, under-expressed, and non-differentially expressed genes.

RNAseq and single-cell RNAseq data are cutting-edge tools to investigate molecular mechanisms of complex human diseases. However, it is quite challenging to analyze these count data with inflated zero counts. In future, we will evaluate if eLNNpairedCov can be used to analyze single-cell RNAseq data by first transforming counts to continuous scale (e.g., via VOOM^{9} or countTransformers^{10}) and then to apply eLNNpairedCov to the transformed data.

We implemented the proposed methods to an R package *eLNNpairedCov*, which will be freely available to researchers.

## Author contributions statement

Conceptualization, W.L. and W.Q.; methodology, Y.Z., W.L, and W.Q.; software, Y.Z. and W.Q.; validation, Y.Z., W.L. and W.Q.; formal analysis, Y.Z. and W.Q.; investigation, W.L.; resources, W.L.; data curation, Y.Z.; writing—original draft preparation, Y.Z.; writing—review and editing, W.L. and W.Q.; visualization, Y.Z. and W.Q.; supervision, W.L. and W.Q.; project administration, W.L.; funding acquisition, W.L. All authors have read and agreed to the published version of the manuscript.

## Funding

This article was supported by Canada Natural Sciences and Engineering Research Council (NSERC) grants 198662. W.Q. is a Sanofi employee and may hold shares and/or stock options in the company.

## Data availability statement

The data presented in this study are openly available in [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE24742]

## Additional information

The authors declare no potential conflict of interests.

## Appendix

### Histogram of estimated standard deviations of log2 differences of within-subject gene expression for GSE24742

#### Marginal density functions

The marginal density function of **d**_{g}is :
where *f*_{1}(**d**_{g}| *ψ*), *f*_{2}(**d**_{g} |*ψ*), *f*_{3}(**d**_{g} |*ψ*) are the marginal densities of the 3 clusters, and *π* = (*π*_{1}, *π*_{2}, *π*_{3}) is the cluster proportion.

Then we have the marginal distribution for each cluster:
where *X*^{T} is the *n* × (*p* + 1) design matrix.
where *U*^{T} is the *n × p* design matrix without intercept column.

Here

Note that each marginal density is the density of a multivariate t distribution with special structures of mean vector and covariance matrix.

#### Log marginal densities

Here

#### EM algorithm

Let *z*_{g} = (*z*_{g1}, *z*_{g2}, *z*_{g3}) to be the indicator vector indicating if gene transcript *g* belongs to a cluster or not. To stablize the estimate of *π* when *π*_{c} is very small, we assume that the cluster mixture proportions *π* follows a symmetric Dirichlet *D*(**b**) distribution, i.e., . Therefore, the likelihood function for the complete data (**d, z**, *π*) is

Then the log complete-data likelihood function is:

The EM algorithm is used to estimate parameters *π* and *ψ*. Since *z* is unknown random vector, we integrate it out from the log complete-data likelhood function. Here, **z**_{g} = (*z*_{g1}, *z*_{g2}, *z*_{g3}).

#### E-step

Denote *Q*^{(t)} (*π, ψ*|*d, z*^{(t)}, *π*^{(t)})as the expected log complete-data likelihood function at *t*-th iteration of the EM algorithm, we have
where

#### M-step

Maximize *Q*^{(t)} (*π, ψ*|*d, z*^{(t)}, *π*^{(t)}) to find the optimal values of *π* and *ψ*, and use these optimal values as estimates for the parameters *π* and *ψ*.

To maximize *Q*^{(t)} (*π, ψ*|*d, z*^{(t)}, *π*^{(t)}), we use the “L-BFGS-B” method developed by Byrd et al. (1995)^{17}, which utilizes the first partial derivatives of *Q*^{(t)} (*π, ψ*|*d, z*^{(t)}, *π*^{(t)}) and allows box constraints, that is each variable can be given a lower and/or upper bound.

#### First Derivatives

For *π*_{1}, *π*_{2}, *π*_{3}, we have:

Since , and letting *b*_{1} = *b*_{2} = *b*_{3} = 2, *∂Q*^{(t)}*/∂π*_{1} = *∂Q*^{(t)}*/∂π*_{2} = 0, we have:

##### For OE genes

For *α*_{1}, *β*_{1}, *k*_{1} and *η*_{1}, we have

Now, we would like to take partial derivatives of log *f*_{1} with respect to *α*_{1}, *β*_{1}, *k*_{1} and *η*_{1}. And we have:

Denote where

Then we can rewrite *M*_{1} as

As for *k*_{1}, we have

##### For UE gene probes

For *α*_{2}, *β*_{2}, *k*_{2} and *η*_{2}, we have

Take partial derivatives of log *f*_{2} with respect to *α*_{2}, *β*_{2}, *k*_{2} and *η*_{2}. And we have:

Denote where

Then we can rewrite *M*_{2} as

For *k*_{2}, we have

##### For NE gene probes

For *α*_{3}, *β*_{3}, we have

Here

Therefore, take the derivatives of log *f*_{3}, we have

For , we consider the following two components,

And we have (Matrix Cookbook^{18}, equation(85)), given **W** is symmetric,

Then,

For *k*_{3}, we have

#### Initial parameter estimates

We first run *limma* to get false discovery rate (FDR) adjusted p-value of moderated t-test for each gene transcript. We then partition the gene transcripts to 3 clusters based on the FDR-adjusted p-values and signs of the test statistics: (1) gene transcripts with positive test statistics and FDR-adjusted p-values *<* 0.05 (denoted as OE); (2) gene transcripts with negative test statistics and FDR-adjusted p-values *<* 0.05 (denoted as UE); and (3) the remaining gene transcripts (denoted as NE). If numbers of transcripts in OE and/or UE are smaller than 5, we then use 100 transcripts having the smallest un-adjusted p-values to find initial OE and UE transcripts. We next estimate the cluster mixture proportions (*π*_{1}, *π*_{2}, *π*_{3}) by the 3 cluster sizes dividing the total number of gene transcripts.

Within each cluster, we use moment estimator to get the initial estimates of model parameters. In our case, we use sample median (med) and sample median absolute deviation (mad) to get robust estimates of mean and standard deviation, based on which we get the initial estimate of *ψ*.

##### OE gene transcripts

where

##### UE gene transcripts

where

Here, we use to make sure it is positive.

##### NE gene transcripts

The initial values are

As for *η*_{3}, we first need to get With linear regression, we have

Then we can get

And where

#### Bounds for hyper-parameters

Although we can re-parameterize hyper-parameters to make sure *α*_{c} > 0, *β*_{c} > 0, 0 *< k*_{c} *<* 1, *c* = 1, 2, 3, and to transform constrained optimization problem to un-constrained optimization problem, the results are usually not good due to non-linear optimization in the M-step of the EM algorithm. To avoid un-expected large values of estimated *α*_{c}, *β*_{c}, and/or *k*_{c} in numerical optimization, we set constraints 0.001 *< α*_{c} *<* 6, 0.001 *< β*_{c} *<* 6, 0.001 *< k*_{c} *<* 0.9999.

If initial estimates , or are outside their ranges, then we randomly choose a number between the range as the initial parameter estimates.

In the M-step of the EM iterations, we also require each element of *η*_{c}, *c* = 1, 2, 3, be within the interval [−10, 10].

If at least one element of *ψ*^{(t)} are outside their ranges in the *t*-th iteration of the EM algorithm, we then set final estimate of *ψ* as *ψ*^{(t−1)}, and then only update , until convergence criterion satisfied.

The M-step in the EM algorithm is to maximize *Q* function, which is a non-linear function of model parameters *π*_{1}, *π*_{2}, *π*_{3}, *ψ*. It is difficult to optimize non-linear functions. The above constraints and procedure produced good results in the real and simulation data analyses in this article.

#### Information for differentially expressed transcripts for GSE24742

The information about the 6 UE transcripts detected by *limma* is shown in Table 1.

The information about the 9 OE transcripts detected by *eLNNpairedCov* is shown in Table 2.

The information about the 83 UE transcripts detected by *eLNNpairedCov* is shown in Table 3.