## Abstract

**Motivation** Deep sequencing based ribosome footprint profiling can provide novel insights into the regulatory mechanisms of protein translation. However, the observed ribosome profile is fundamentally confounded by transcriptional activity. In order to decipher principles of translation regulation, tools that can reliably detect changes in translation efficiency in case-control studies are needed.

**Results** We present a statistical framework and analysis tool, *RiboDiff*, to detect genes with changes in translation efficiency across experimental treatments. *RiboDiff* uses generalized linear models to estimate the over-dispersion of RNA-Seq and ribosome profiling measurements separately, and performs a statistical test for differential translation efficiency using both mRNA abundance and ribosome occupancy.

**Availability** Source code and documentation are available at http://github.com/ratschlab/ribodiff. Supplementary Material can be found at http://bioweb.me/ribodiff.

**Contact** zhongy{at}cbio.mskcc.org and raetsch{at}cbio.mskcc.org.

## 1 Introduction

The recently described ribosome profiling technology [6] allows the identification of RNA fragments that were bound by the ribosome complex. It provides valuable information on ribosome occupancy and, thereby indirectly, on protein synthesis activity. This technology can be leveraged by combin- ing the measurements from RNA-Seq expression estimates in order to determine a gene’s translation efficiency (TE): *T E* = *A*_{RF} */ A*_{mRNA}, where *A*_{mRNA} and *A*_{RF} are the mRNA and ribosome footprint (RF) read counts, respectively [7, 5, 13]. The normalization by mRNA abundance is designed to remove transcriptional activity as a confounder of RF abundance. The TEs in treatment/control experiments can then be compared to identify genes most affected w.r.t. translation efficiency; for instance, [13] considered a ratio (a.k.a. fold-change) of the TEs of treatment and control. However, what these initial approaches and analyses only take into account partially, is that one typically only obtains uncertain estimates of the mRNA and ribosome abundance. In particular for lowly expressed genes, the error bars for the ratio of two TE values can be very large. As in proper RNA-Seq analyses, one should consider the uncertainty in these abundance measurements when making statements about differentiality. For RNA-Seq, this has been described in various ways often based on generalized linear models taking advantage of dispersion information from biological replicates (for instance, [11, 2, 3]). In [14, 16]i, a way to adopt an approach for RNA-Seq analysis for this problem was described, which had several conceptual and practical limitations. Here, we describe a novel statistical framework that also uses generalized linear model to detect effects of a treatment on RNA translation. Additionally, our approach accounts for the fact that two different sequencing protocols with distinct statistical characteristics are used. We compare it to a recently published tool *Babel* [10].

## 2 Methods

In sequencing-based ribosome footprint profiling, the RF read count is naturally confounded by mRNA abundance (Fig. 1A). We seek a strategy to compare RF measurements taking mRNA abundance into account, in order to accurately discern the translation effect in case-control experiments. We model the vector of mRNA and RF read counts and , respectively, and for gene *i* with Negative Binomial (NB) distributions, as described before (for instance, [11, 8, 3]): *y ^{i} ~ N B*(

*µ*), where

^{i}, κ^{i}*µ*is the expected count and

^{i}*κ*is the estimated dispersion across (biologi- cal) replicates. Formulating the problem as a generalized linear model (GLM) with the logarithm as link function, we can express expectations on read counts as a function of latent quantities related to

^{i}*mRNA abundance β*in the two conditions (

_{C}*C*= {0, 1}), a quantity

*β*

_{mRNA}that relates mRNA abundance to RNA-Seq read counts, a quantity

*β*

_{RF}that relates mRNA abundance to RF read counts and a quantity

*β*

_{∆,C}that captures the effect of the treatment on translation. In particular, the expected mRNA read count is given by the equation = log().

We assume that transcription and translation are successive cellular processing steps and that abundances are linearly related. The expected RF read count, , is given by = log(). A key point to note is that is revealed to be a shared parameter between the expressions governing the expected mRNA and RF counts. It can be considered to be a proxy for shared transcriptional/translation activity under condition *C* in this context. Then, indicates the deviation from that activity under condition *C*, with = 0 for *C* = 0 and free otherwise.^{‡}

Fitting the GLM consists of learning the parameters *β ^{i}* and dispersions

*κ*given mRNA and RF counts for the two conditions

^{i}*C*= {0, 1}. We perform alternating optimization of the parameters

*β*given dispersions

^{i}*κ*and the dispersion parameters

^{i}*κ*given

^{i}*β*, similar to the EM algorithm:

^{i}As experimental procedures for measuring mRNA counts and RF counts differ, we enable the estimating of separate dispersion parameters for the data sources of RNA-Seq and RF profiling to account for different characteristics. As in [2], we use the mean-dispersion relationship *κ* = *f* (*µ*) = *λ*_{1}*/µ* + *λ*_{0} and a Gamma distribution to obtain the function *f* (*µ*). We perform empirical Bayes shrinkage [8] to shrink *κ ^{i}* towards

*f*(

*µ*) to stabilize estimates. See Section D in Supplementary Material for details.

In a treatment/control setting, we can then evaluate whether a treatment (*C* = 1) has a signifi- cant differential effect on translation efficiency compared to control (*C* = 0), which is equivalent to determining whether the inferred parameter *β*_{∆,1} differs significantly from 0. This is whether the relationship denoted by the dashed line in Fig. 1A is needed or not. We can compute significance levels based on the *χ*^{2} distribution by analyzing log-likelihood ratios of the Null model ( = 0) and the alternative model ( ≠ 0).

## 3 Results and Discussion

We simulated data to illustrate the performance of *RiboDiff* and to compare it with a recently published tool *Babel*. For details on data simulation see Section F in Supplementary Material. Fig. 1B shows the receiver operating characteristic curve of *RiboDiff* and *Babel*, indicating su- perior quantitative performance of *RiboDiff*. We also re-analyzed previously released ribosome footprint data (GEO accession GSE56887). After multiple testing correction, *RiboDiff* detected 601 TE down-regulated genes and 541 up-regulated ones with FDR < 0.05, which is about twice as many as reported in [14]. The new TE down set includes 92.4% genes identified in the previous study, whereas the TE up set contains 94.7% previously identified ones. The result of *RiboDiff* is also compared to TE fold change analysis, which classifies genes with the most extreme ∆_{TE} as candidates (Fig. 1C). We run *RiboDiff* on a machine with 1.7 GHz CPU and 4GB RAM, it took 23 mins of computing time to finish (10, 474 genes having both RF and mRNA counts).

In summary, we propose a new statistical model and analysis tool to analyze the effect of a of a treatment on RNA translation. It assumes a rich model of data generation and can be used accurate differential testing. A major advantage of this method is facilitating comparisons of RF abundance by taking mRNA abundance variability as a confounding factor. Moreover, *RiboDiff* is specifically tailored to produce robust dispersion estimates for different sequencing protocols measuring gene expression and ribosome occupancy that have different statistical properties. The described approach is statistically sound and identifies a similar set of genes from a less developed method that was used in [14]. The release of this tool is expected to enable proper analyses of data from many future RF profiling experiments.

### A Sequencing library size and normalization

The sequencing library sizes of RNA-Seq and Ribosome footprinting (RF) counts are normalized separately. We calculate the library size *S* similar to [8] with modifications:
where *T* denotes data type (mRNA or RF); *j* indexes the replicates (or samples); is the observed count of type *T* for gene *i* in replicate *j*. For all genes in all replicates, we add one to the count value to avoid the geometric mean across all replicates in the denominator equals to zero. The ratios of gene read counts in a given replicate to the geometric means are calculated, and we take the median of these ratios whose count is greater than one as the library size. The read counts are normalized by the library size before being used in the next step.

### B The explanatory matrix of GLM

To control the observed read counts fitting into the GLM system as we described in the main text, an explanatory matrix *X* is designed. Here we show it in the context of linear predictor *η* of GLM:

In this X matrix example, the first four rows absorb mRNA count with two replicates for each condition. The last six rows absorb RF count with three replicates for each condition. Please note the first and second columns in *X* are shared between mRNA and RF counts, where we couple the two different data set. The linear predictor then are linked with negative binomial distributed mean and through logarithm as the link function.

### C Negative binomial likelihood function

The probability mass function of negative binomial distribution is given by
where *y ^{i,j}* is the observed RF or mRNA read count of

*j*

^{th}replicate of gene

*i*;

*κ*is the dispersion parameter of the

^{i,j}*N B*distribution where

*y*is drawn from;

^{i,j}*µ*is the estimated count of

^{i,j}*j*

^{th}replicate. Thus the logarithmic likelihood of negative binomial of gene

*i*is given by

Note that the likelihood function is adjusted by Cox-Reid term as suggested by Robinson *et al.* [12] to compensate bias from estimating coefficients in fitting GLM step. Here, *X* is the explanatory matrix with dimension of *n* × 4 or *n* × 5, depending on *H*_{0} or *H*_{1}, where *n* is the total number of replicates of RF and mRNA data; *µ ^{i}* is the vector of estimated counts;

*κ*is the dispersion vector.

^{i}### D Empirical Bayes shrinkage for obtaining final dispersion

We follow the approach published recently [8] to get the final dispersion . Assumption is based on the observation that the dispersion follows a log-normal prior distribution [15] centered at the fitted dispersion which is obtained from the dispersion-mean relationship *κ* = *f* (*µ*) = *λ*_{1}*/µ* + *λ*_{0}(see in the main text). The can be estimated by maximizing the following equation:
where is the variance of the logarithmic residual between prior and the fitted dispersion . Moreover, the variance () of the logarithmic residual between raw dispersion and is comprised of 1) the variance of sampling distribution of the logarithmic dispersion and 2) . The can be approximately obtained from a trigamma function:
where *m* is the number of samples and *d* is the number of coefficients. Whereas, the is calculated as the median absolute deviation (mad) of logarithmic residuals between pairs of and :

Therefore, we can get the by and obtain the final dispersion by maximizing the posterior in equation 5.

### E **Estimating dispersion for different sequencing protocols sepa- rately**

As experimental procedures for representing mRNA and RF abundances can vary, such as the sam- ples are sequenced in different platforms, we enable *RiboDiff* uses separate dispersion parameters *κ* for different data sources. Here we show an example that estimating *κ* separately is needed. The example data are from a recent publication [4].

The empirical dispersion for RNA-Seq and RF counts are calculated from the following equation [8, 11, 1, 9]:

Fig. 2 shows the mean-dispersion relationship. It demonstrates the deviation of empirical disper- sion of RNA-Seq and ribosome footprint data in this experimental setting. The deviation between these two data sets becomes small while the count increases.

### F Data simulation

We simulated the RF and mRNA read count for 2,000 genes with 500 genes showing translational efficiency down regulated and 500 genes showing up regulated. There are three replicates for each of the two treatments in both “ribosome profiling” and “RNA-Seq” counts. Therefore, the dimension of count matrix is 2,000 × 12.

We first generated the mean counts for two treatments of both RF and mRNA across all 2K genes assuming they are randomly drawn from a negative binomial distribution with parameter *n* and *p*, where *n* = 1*/κ* and *p* = *n/*(*n* + *µ*). Then, for each mean count *µ ^{i}*, we generated three count values as three replicates, from a negative binomial distribution with parameter

*µ*and

^{i}*κ*, where

^{i}*κ*is calculated as

^{i}*κ*=

^{i}*f*(

*µ*) =

^{i}*λ*

_{1}

*/µ*+

^{i}*λ*

_{0}. To simulate the genes with TE changes in two treatments, we add fold difference to the mean count of the target genes, assuming the fold changes follow a gamma distribution that is observed from real data (GEO accession GSE56887). The gamma distribution has a shape parameter

*α*and a scale parameter

*s*, and its mean

*µ*=

_{G}*α · s*. In the following simulation, we fix the

*s*, only specify different

*α*to make genes having different fold changes on their means. The fold increase

*F*

_{I}is obtained by where

*X*

_{G}is a random vector containing 500 elements generated from a gamma density function. And the fold decrease

*F*

_{D}is obtained as

Here, we simulated five groups of count data, in every group 1,000 out of 2,000 genes showing TE changes:

mean count has fold change only for RF count, with

*α*= 0.8;mean count has fold change only for mRNA count, with

*α*= 0.6;mean count has fold change only for RF count, with

*α*= 1.5;mean count has fold change only for mRNA count, with

*α*= 1.5;mean count has fold change for RF with

*α*= 0.8 AND for mRNA with*α*= 0.6, referred as “combined” in Fig. 3.

Note that in the last group, if the gene has fold increase in RF, it must has fold decrease in mRNA. By doing this, the effect at mRNA level is added to the TE change outcome instead of offsetting the effect caused by RF. Other parameters for simulating are as follow: for all RF and mRNA, *n* = 1, *β*_{1} = 0.1, *β*_{2} = 0.0001, *s* = 0.5. The parameter *p* controls the scale of the count. We use 0.008 for RF and 0.0002 for mRNA. We run *RiboDiff* with the five groups of data set to estimate the sensitivity and specificity (Fig. 3). We also compared the performances of *RiboDiff* with *Babel* [10] using the simulated data of the combined setting.

## Acknowledgements

This work was funded by the Marie Curie ITN framework (Grant # PITN- GA-2012-316861), MSKCC, the National Cancer Institute (R01-CA142798-01 to H.-G.W.) and the Experimental Therapeutics Center (H.-G.W.).