Abstract
It is common to measure a large number of features in parallel to identify those differing between two experimental conditions - e.g. the search for differentially expressed genes using microarrays or RNA-Seq. Ranking features by p-value allows for control of the TYPE I error, but p-values are not reliable when there are very few replicates; and investigators typically require features be ranked by “fold change” in conjunction with p-values. At first glance the fold change appears to be a natural quantity on which to compare the differential behavior of features. But it is highly sensitive to small values in the denominator and is problematic in how it equates changes in both small and large numbers such as a change from 1 to 2 versus a change from 100 to 200. The strategy of adjusting all values by adding one is a widely used heuristic approach to try to mitigate the problems with fold-change. However, that can be far from optimal. A systematic strategy to determine an optimal value (pseudocount) to adjust by is employed using both real and simulated benchmark data. In RNA-Seq a value of 20 appears to be close to optimal in all cases. Another strategy is to sort by difference , but this is problematic for comparing measurements across a wide spectum, as large differences of small values rank below proportionally smaller difference in large values. An abstract mathematical framework is introduced to describe the problem of ranking by differential effect size, enabling us to study the ranking problem in general as opposed to specific contexts such as fold-change or difference. From this framework we discovered a remarkable property of pseudocounts, in that they strike a balance between sorting by fold-change and sorting by difference. Lastly, another fundamentally different type of application is presented, which is to rank di-codons by their differential abundance in the ORFeome of different species.
Footnotes
E-mail address: nsoum{at}upenn.edu