Dataset-specific thresholds significantly improve detection of low transcribed regulatory genes in polysome profiling experiments

Motivation Polysome profiling is novel, and yet has proved to be an effective approach to detect mRNAs with differential ribosomal load and explore the regulatory mechanisms driving efficient translation. Genes encoding regulatory proteins, having a great influence of the organism, usually reveal moderate to low transcriptional levels, compared, for example, to genes of house-keeping machinery. This complicates the reliable detection of such genes in the presence of technical and/or biological noise. Results In this work we investigate how cleaning of polysome profiling data on Arabidopsis thaliana influences the ability to detect genes with low level of total mRNA, but with a highly differential ribosomal load, i.e. genes translationally active. Suggested data modelling approach to identify a background level of mRNA counts individually for each dataset, shows higher power in detection of low transcribed genes, compared to the use of thresholds for the minimal required mRNA counts or the use of raw data. The significant increase in detected number of regulation–related genes was demonstrated. The described approach is applicable to a wide variety of RNA-seq data. All identified and classified mRNAs with high and low translation status are made available in supplementary material.


34
Investigation of the mechanisms underlying differential gene expression is one of the fundamental tasks basic idea behind all of these techniques is to separate mRNA in a quiet state (monosomal fraction) and 45 active state, i.e. mRNA heavily loaded with ribosomes (polysomal fraction), followed by sequencing or 46 hybridizing on chips [3]. The resulting quantitative measure of translational state allows a better correlation of the number of mRNA transcripts and the observed protein levels [5]. Additionally, such 48 data can be used to investigate regulatory mechanisms of the observed differential translation.

49
There are a number of programs used for analysis of ribosome sequencing data, most of which were 50 originally developed for the analysis of gene transcription [6][7][8]. The major problem of the mathematical 51 methods behind these programs is the estimation of the variance, that is the key point for the represented as a sum of two independent random variables, one following negative binomial distribution 143 and the other exponential:

146
In other words, it is assumed that every measured mRNA count value contains real and random parts.

147
It is not possible to decompose each value of mRNA count into two components due to the random nature 148 of the process, but one can estimate the maximum contribution of the exponential part and then subtract 149 it from the raw value. It is possible, because the contribution of the binomial part with its peak around 150 3000 is negligible at low values, therefore it will be assumed that points with very low values are of pure 151 random nature.

152
The exponent distribution has one parameter and can be found by fitting the exponential model into 153 data below ten counts (first several points on the red curve, fig. 2). Having built the exponential model 154 (grey dashed curve, fig.2), one can extrapolate the curve to the point where the exponent drops to some 155 acceptably low value, or in other words, solve for m the equation e -αm =10 -3 , where α is the estimated 156 decay parameter. For example, the exponent equals 10 -3 when mRNA count equals 24 for total mRNA 157 fraction. That means, that one mRNA out of thousand with the count value of 24 is expected to appear by 158 chance. The value of 24 can be used as a threshold for the minimal required counts instead of pre-defined 8 159 threshold [7,12,13]. But following our logic, that the observed counts consist of two independent 160 components, this value should be subtracted from all raw mRNA count values to maximally exclude 161 possible random effect. If the resulting value is negative, a zero value is assigned:

208
It is evident from the table, that the data cleaning step is essential for detection of genes with regulatory 209 function. The suggested cleaning via subtraction of the "noisy counts" results in detection of more 210 genes, moreover, the percentage of regulation-related genes has also slightly increased. The results also 211 support our hypothesis, that regulatory genes tend to show only moderate levels of transcription, but 212 the most significant overrepresentation is observed for the data cleaned by subtraction (table 1).

213
Comparison of the identified gene sets revealed 122 genes found only using the data cleaned by 214 subtraction, 72 genes found only by raw data and 155 genes found by both (gene lists are available in 215 supplementary material). Focusing on genes annotated with "signal" term the corresponding numbers 216 will be 39, 18, 56 (cleaned, raw and both datasets). This demonstrates, that the data cleaning procedure 217 objectively extends the number of identified genes of interest. For example, there are such genes like 218 root meristem growth factor (RGF3, AT2G04025), embryo-specific protein (ATS3, AT5G62210), 219 transmembrane protein (DUF1191, AT4G23720) and many others directly related to gene regulation and 220 signal transduction, all found exclusively after the suggested data cleaning.

221
It is interesting to note, that the commonly accepted approach to remove mRNA with counts below 222 some pre-defined threshold leads to significantly fewer genes even compared to the raw data (table 1) 223 and therefore, it was not used in the above comparisons. We also do not apply conventional pre-224 selected thresholds for the counts for the following reasons. First, the variation of those is quite 225 significant and ranges from just a few in most studies [7,9]  The use of functional classification of genes like Gene Ontology is practical to give a quick overview on 251 underlying differences in functionality of the investigated genes. Here the resource PANTHER v.14.0 [17] 252 was used to classify the mRNAs in four datasets. These datasets were compiled using "symmetrical" 253 criteria to the criterion defined above. Particularly, mRNA are classified according to the level of 254 transcription into low and high (N total,i ≤300 and N total,i ≥1200, respectively) and expression and the significant influence of experimental and biological noise. One way to overcome this 300 is to investigate target genes with strong expression and apply reverse engineering or use databases of 301 regulatory pathways to find the regulators. Direct methods utilize complex mathematical models to 302 discern weak signals of regulation.

303
The data cleaning procedure suggested here is assumed not to further complexify the methods, but to 304 "personalize" parameters, used to dissect noise and real values. The idea consists in defining a 305 maximum contribution, which could originate from technical or biological noise, with a subsequent 306 subtraction of that value from the raw measurements. This is different to other approaches, where only 307 values below some noise threshold are removed and the rest is left intact. As shown in the results, the 308 suggested cleaning procedure increases the number of detected genes with differential expression.

309
Moreover, the ratio of genes with regulatory functions is also increased after suggested data cleaning.

310
We believe that data modelling should be used to define dataset-specific thresholds and the use of