## Abstract

Existing methods for quantifying transcript abundance require a fundamental compromise: either use high quality read alignments and experiment-specific models or sacrifice them for speed. We introduce Salmon, a quantification method that overcomes this restriction by combining a novel ‘lightweight’ alignment procedure with a streaming parallel inference algorithm and a feature-rich bias model. These innovations yield both exceptional accuracy and order-of-magnitude speed benefits over traditional alignment-based methods.

Estimating transcript abundance across cell types, species, and conditions is a fundamental task in genomics. For example, these estimates are used for the classification of diseases and their subtypes [1], for understanding expression changes during development [2], and tracking the progression of cancer [3]. Efficient quantification of transcript abundance from RNA-seq data is an especially pressing problem due to the exponentially increasing number of experiments and the growing adoption of expression data for medical diagnosis [4]. However, various methods that address this problem achieve accurate results at the cost of requiring significant computational resources and do not scale well with the rate at which data is produced [5]. The recently developed quantification tool Sailfish [6] achieves an order of magnitude speed improvement over previous approaches, but Sailfish can sometimes produce slightly less accurate estimates for paired-end data or for stranded protocols and does not take advantage of high quality alignment information and experiment-specific models.

We introduce a quantification procedure, called Salmon (**Supplementary Fig. 1**), that achieves best-in-class accuracy, takes advantage of high quality alignment information and experiment-specific models and provides the same order-of-magnitude speed benefits as Sailfish. Using synthetic data from both the RSEM simulator [7] and the Flux Simulator [8] as well as experimental quantitative PCR data [9], we show that Salmon generally outperforms Sailfish and eXpress [10] with respect to accuracy (**Fig. 1a-b,e**; **Supplementary Tables 1&2**) and is also faster than Sailfish (**Fig. 1c**). The transcript abundance estimation problem is particularly difficult for genes with many isoforms since reads derived from these genes can map to many more transcripts, and we find that Salmon is also generally more accurate in this case (**Fig. 1d**). Salmon is designed to run in parallel so that the procedure scales better with the number of reads in an experiment. Salmon can quantify abundance either via a lightweight alignment procedure (**Online methods, Lightweight alignment** and **Supplementary Fig. 2**), or using pre-computed alignments provided in `SAM` or `BAM` format — we find that the quantification accuracy is robust to this choice of input (**Supplementary Fig. 3**). Salmon is also typically more accurate than a recent unpublished procedure Kallisto (**Supplementary Figs. 4&5, Supplementary Table 1**).

An innovation contributing to Salmon’s speed and accuracy is its novel lightweight alignment procedure. Salmon attempts to find a chain of super-maximal exact matches (SMEMs) and maximal exact matches (MEMs) to the transcriptome that cover a read. A maximal exact match is a substring that is shared by the read and reference transcript that cannot be extended in either direction without introducing a mismatch, and a super-maximal exact match [11] is a MEM that is not contained within any other MEM on the query. Salmon’s lightweight alignment procedure finds co-linear chains of SMEMs. The SMEMS in these chains must be approximately consistent in the sense that the sizes of the gaps between SMEMs in the read and the transcript need not be identical (**Online methods, Lightweight alignment** and **Supplementary Fig. 2**). Using a Burrows-Wheeler-based index, this approach allows for the computation of much more accurate alignments than using k-mers at a speed much faster than full alignment (**Fig. 1c**). This approach overcomes potential inaccuracies of using k-mers as in Sailfish while providing some of the benefits of a full alignment. If errors or mutations are uniformly distributed in a read, very few k-mers could map to a transcript even if the read and the transcript share a high-quality alignment. Salmon’s improvement in overall accuracy may be due in large part to lightweight alignment since a modification of Sailfish that incorporates this type of efficient alignment starts to approach Salmon’s accuracy (**Supplementary Figs. 6&7, Supplementary Table 1**). The primary insight behind lightweight alignment is that achieving accurate quantification of transcript abundance from RNA-seq data does not require knowing the optimal alignment between the sequenced fragment and the transcript for every potential locus of origin. Rather, it is sufficient to identify the transcripts and positions within them that match the fragments reasonably well.

Salmon also incorporates a rich model of experimental biases, which allows it to account for the affects of experiment-specific parameters and biases including non-uniform read mapping at transcript start sites, strand-specific protocols, and the fragment length distribution. These biases are automatically learned in the online phase of the algorithm, and are encoded in a fragment-transcript agreement model (**Online methods, Fragment-transcript agreement model**). In this model, fragment-transcript assignment scores are defined as proportional to (1) the chance of observing a fragment length given a particular transcript/isoform of a gene (2) the chance that a fragment starts at a particular position on the transcript, (3) the concordance of the fragment aligning with a user-defined sequencing library format (e.g. a paired ended, stranded protocol), and (4) the chance that the fragment came from the transcript based on a score obtained from the the lightweight alignment procedure. Salmon additionally incorporates these biases and experimental parameters by maintaining ‘rich equivalence classes’ of fragments (**Online methods, Equivalence classes**) that contain the information in these models and speed up the process of estimating transcript abundances.

Salmon’s two-phase parallel inference procedure (**Online methods, Online phase** and **Offline phase**; Illustration of method in **Supplementary Fig. 1**) allows it to scale well with the number of reads in an experiment and make use of large multicore machines that are already commonly used to run bioinformatics pipelines. For example, Salmon can quantify a data set of approximately 200 million reads in approximately 5 minutes using 64 cores. Unlike the Sailfish k-mer-based index, the parameters for lightweight alignment (e.g. the fraction of the read required to be covered, or the minimum length MEMs considered in chains) can be modified without re-building the index, allowing for rapid experimentation of quantification parameters. As an alternative to computing lightweight alignments, Salmon’s design also allows the user to provide alignments that have already been computed and uses an alternative alignment scoring model in this case (**Online methods, Alignment model**).

The insight behind Salmon’s lightweight alignment approach and sophisticated inference model allows for the use of more sequence information in the read and produces some of the most accurate expression estimates to date. Salmon’s ability to compute high quality estimates of transcript abundances at the previously prohibitive scale of thousands of experiments will also enable individual expression experiments to be interpreted in the context of many rapidly growing sequence expression databases. This will allow for a more comprehensive comparison of the similarity of experiments across large populations of individuals across different environmental conditions and cell types.

## Author Contributions

R.P. and C.K. designed the method, which was implemented by R.P. R.P., G.D., and C.K. designed the experiments and R.P. and G.D. conducted the experiments. R.P. G.D. C.K. wrote the manuscript.

## Online methods

### Objectives and models for abundance estimation

Assume that, for a particular sequencing experiment, the underlying true transcriptome is given as , where each *t _{i}* is the nucleotide sequence of some transcript (an isoform of some gene) and each

*c*is the corresponding number of copies of

_{i}*t*in the sample. Further, we denote by

_{i}*ℓ*(

*t*) the length of transcript

_{i}*t*.

_{j}The model of the sequencing experiment dictates that, in the absence of experimental bias, library fragments are sampled proportional to *c _{i}* ·

*ℓ*(

*t*). That is, the probability of drawing a sequencing fragment from some position on a particular transcript

_{i}*t*is proportional the total fraction of all nucleotides in the sample that originate from a copy of

_{i}*t*. This quantity is called the nucleotide fraction [12]:

_{i}The true nucleotide fractions, ** η**, though not directly observable, would provide us with a way to measure the true relative abundance of each transcript in our sample. Specifically, if we normalize the

*η*by the transcript length

_{i}*ℓ*(

*t*), we obtain a quantity called the transcript fraction [12]. These

_{i}*τ*can be used to immediately compute common measures of relative transcript abundance like transcripts per million (TPM). The TPM measure for a particular transcript is the number of copies of this transcript we would expect to exist in a collection of one million transcripts, assuming this collection had exactly same distribution of abundances as our sample. The TPM for transcript

*t*, is given by Of course, in a real sequencing experiment, there are numerous biases, confounding factors, and sampling effects that may alter the above assumptions, and accounting for them is important for making inference accurate, which we will discuss below.

_{i}Given a collection of observations (raw sequenced fragments or alignments thereof), and a model similar to the one described above, there are numerous approaches to inferring the relative abundance of the transcripts in the target transcriptome, . Here we describe two basic inference schemes, both available in Salmon, which are commonly used to perform inference in models similar to the one defined above.

### Maximum likelihood objective

The first scheme takes a maximum likelihood approach to solving for the quantities of interest. Specifically, if we assume that all fragments are generated independently and we are given a vector of known nucleotide fractions ** η**, a binary matrix of transcript-fragment assignment

**where**

*Z**z*= 1 if fragment

_{ji}*j*is derived from transcript

*i*, and the set of transcripts , we can write the probability of observing a set of sequenced fragments as:

Pr {*f _{i}* |

*t*= 1} is the probability of generating fragment

_{i}, z_{ji}*j*given that it came from transcript

*i*. We will use Pr {

*f*|

_{j}*t*} as shorthand for Pr {

_{i}*f*|

_{j}*t*= 1} since Pr {

_{i}, z_{ji}*f*|

_{j}*t*= 0} is uniformly 0. The determination of Pr {

_{i}, z_{ji}*f*|

_{j}*t*} is defined in further detail in

_{i}**Fragment-transcript agreement model**. The likelihood associated with this objective can be optimized using the EM algorithm as in [12].

### Bayesian objective

One can also take a Bayesian approach to transcript abundance inference as done in [13, 14]. In this approach, rather than directly seeking maximum likelihood estimates of the parameters of interest, we want to infer the posterior distribution of ** η**. In the notation of [13], we wish to infer — the posterior distribution of nucleotide fractions given the transcriptome and the observed fragments . This distribution can be written as:
where
and
Unfortunately, direct inference on the distribution is intractable because its evaluation requires the summation over the exponentially large latent variable configuration space . Since the posterior distribution cannot be directly estimated, we must rely on some form of approximate inference. One particularly attractive approach is to apply variational Bayesian (VB) inference in which some tractable approximation to the posterior distribution is assumed.

Subsequently, one seeks the parameters for the approximate posterior under which it best matches the true posterior. Essentially, this turns the inference problem into an optimization problem — finding the optimal set of parameters — which can be efficiently solved by a number of different algorithms. In particular, variational inference seeks to find the parameters for the approximate posterior that minimizes the Kullback-Leibler (KL) divergence between the approximate and true posterior distribution. Though the true posterior may be intractable, this minimization can be achieved by maximizing a lower-bound on the marginal likelihood of the posterior distribution [15], written in terms of the approximate posterior. Salmon optimizes the collapsed variational Bayesian objective [13] in its online phase and the full variational Bayesian objective [14] in the variational Bayesian mode of its offline phase (see **Offline phase**).

### Fragment-transcript agreement model

We model the conditional probability Pr {*f _{j}* |

*t*} for generating

_{i}*f*given

_{j}*t*using a number of auxiliary terms. These terms come from auxiliary models whose parameters do not explicitly depend upon the current estimates of transcript abundances. Thus, once the parameters of these these models have been learned and are fixed, these terms do not change even when the estimate for Pr {

_{i}*t*|

_{i}**] =**

*η**η*needs to be updated. Salmon uses the following auxiliary terms:

_{i}Where Pr {*ℓ* | *t _{i}*] is the probability of drawing a fragment of the inferred length given

*t*, and is evaluated based on an observed empirical fragment length distribution. Pr {

_{i}*p*|

*t*,

_{i}*ℓ*} is the probability of the fragment starting at position

*p*on

*t*, computed using an empirical fragment start position distribution as defined in [12]. Pr {

_{i}*o*|

*t*] is the probability of obtaining a fragment aligning with the given orientation to

_{i}*t*. This is determined by the concordance of the fragment with the user-specified library format. It is 1 if the alignment agrees with the library format and a user-defined prior value

_{i}*p*otherwise. Finally, Pr {

_{ō}*a*|

*f*,

_{j}*t*} is the probability of generating alignment

_{i}, p, o, ℓ*a*of fragment

*f*, given that it is drawn from

_{j}*t*, with orientation

_{i}*o*, and starting at position

*p*and is of length

*ℓ*; this term is defined as the coverage score (see Algorithms, Lightweight Alignment) for lightweight alignments, and is given by equation (6) for traditional alignments. The parameters for all auxiliary models are learned during the streaming phase of the inference algorithm from the first

*N*′ observations (5, 000, 000 by default). These auxiliary terms can then be applied to all subsequent observations.

### Alignment model

When Salmon is given read alignments as input, it can learn and apply a model of read alignments to help assess the probability that a fragment originated from a particular locus. Specifically, Salmon’s alignment model is a spatially varying first-order Markov model over the set of `CIGAR` symbols and nucleotides. To account for the fact that substitution and indel rates can vary spatially over the length of a read, we partition each read into a fixed number of bins (4 by default) and learn a separate model for each of these bins. This allows us to learn spatially varying effects without making the model itself too large (as if, for example, we had attempted to learn a separate model for each position in the read). Given the `CIGAR` string *s = s*_{0}*,..., s _{|s|}* for an alignment

*a*, we compute the probability of

*a*as: where Pr {

*s*

_{0}} is the start probability and is the transition probability under the model at the

*k*

^{th}position of the read (i.e., in the bin corresponding to position

*k*). To compute these probabilities, Salmon parses the

`CIGAR`string s and moves appropriately along both the fragment

*f*and the reference transcript

_{j}*t*, and computes the probability of transitioning to the next observed state in the alignment (a tuple consisting of the

_{i}`CIGAR`operation, and the nucleotides in the fragment and reference) given the current state of the model. The parameters of this Markov model are learned from sampled alignments in the online phase of the algorithm (see

**Algorithm 1**). When lightweight alignments are used instead of user-provided alignments, Pr {

*a*|

*f*,

_{j}, t_{i}, p, o*ℓ*} is taken to be proportional to the normalized coverage of fragment

*f*on transcript

_{j}*t*: coverage (

_{i}*f*)/ max

_{j}, t_{i}*coverage (*

_{k}*f*).

_{j}, t_{k}### Algorithms

Salmon consists of three components: a lightweight-alignment model, an online phase that estimates initial expression levels and model parameters and constructs equivalence classes over the input fragments, and an offline phase that refines the expression estimates. The online and offline phases together optimize the estimates of ** α** which is a vector of weighted estimates of read counts. Each method can compute

**directly from these parameters.**

*η*The online phase uses a variant of stochastic, collapsed variational Bayesian inference [16]. The offline phase applies the variational Bayesian EM algorithm [15] over a reduced representation of the data represented by the equivalence classes until a data-dependent convergence criterion is satisfied. An overview of our method is given in **Supplementary Fig. 1**, and we describe each component in more detail below.

### Lightweight alignment

A key computational challenge in inferring relative transcript abundances is to determine the potential loci-of-origin for a sequenced fragment. To make the optimization tractable, all positions cannot be considered. However, if the sequence of a fragment is substantially different from the sequence of a given transcript at a particular position, it is very unlikely that the fragment originated from this transcript and position — these positions will have their probability truncated to 0 and will be omitted from the optimization. Determining a set of potential loci-of-origin for a sequenced fragment is typically done by aligning the reads to the genome or transcriptome using tools like Bowtie2 [17], STAR [18], or HISAT [19]. While Salmon can process the alignments generated by such tools (when they are given with respect to the transcriptome), it provides another method to determine the potential loci-of-origin of the fragments directly, using a procedure that we call *lightweight alignment*.

The main motivation behind lightweight alignment is that achieving accurate quantification of transcript abundance from RNA-seq data does not require knowing the optimal alignment between the sequenced fragment and the transcript for every potential locus of origin. Rather, simply knowing which transcripts (and positions within these transcripts) match the fragments reasonably well is sufficient. Formally, we define lightweight-alignment as a procedure that, given the transcripts and a fragment *f _{i}*, returns a set of 3-tuples . Each tuple consists of 3 elements: a transcript

*t*, a position

_{i′}*p*within this transcript, and a score

_{i′}*s*that summarizes the quality of the match between

_{i′}*f*and

_{i}*t*at position

_{i′}*p*.

_{i′}We describe, here, the lightweight-alignment approach for a single read (it extends naturally to paired-end reads by looking for lightweight-alignments for read pairs that are appropriately positioned on the same transcript). Salmon attempts to find a chain of super-maximal exact matches (SMEMs) and maximal exact matches (MEMs) that cover a read. Recall, a maximal exact match is a substring that is shared by the query (read) and reference (transcript) that cannot be extended in either direction without introducing a mismatch. A super-maximal exact match [11] is a MEM that is not contained within any other MEM on the query.

Salmon attempts to cover the read using SMEMs. Differences — whether due to read errors or true variation of the sample being sequenced from the reference — will often prevent SMEMs from spanning an entire read. However, one will often be able to find *approximately* consistent, co-linear chains of SMEMs that are shared between the read and target transcripts. A chain of SMEMs is a collection of 3-tuples *c* = {(*q*_{1}, *t*_{1}, *ℓ*_{1}),...} where each *q _{i}* is a position on the query (read),

*t*is a position on the reference (transcript), and

_{i}*ℓ*is the length of the SMEM. If

_{i}*Σ*| (

_{i}*q*

_{i}_{+1}−

*q*) − (

_{i}*t*

_{i}_{+1}−

*t*)| = 0, then we say that the chains are consistent — the space between the location of SMEMs on the query and the reference are the same. If, instead, we require that

_{i}*Σ*| (

_{i}*q*

_{i}_{+1}−

*q*) − (

_{i}*t*

_{i}_{+1}−

*t*)| ≤

_{i}*δ*then we say that the chain is

*approximately*consistent, or

*δ-*consistent. Consistent chains can deal only with substitution errors and mutations, while

*δ*-consistent chains can also account for indels.

**Supplementary Fig. 2**shows an example.

While the discussion above is in terms of SMEMs, the chains constructed by Salmon typically consist of a mix of SMEMs and MEMs. This is because, like BWA-mem [11], Salmon breaks SMEMs that are too large (by default, greater than 1.5 times the minimum required MEM length), to prevent them from masking potentially high-scoring MEM chains. In order for Salmon to consider a read to match a transcript locus sufficiently well, there must be a *δ*-consistent chain between the read and the transcript sequence, beginning at the locus, that covers a user-specified fraction of the read (65% by default).

Using this procedure, Salmon implements lightweight alignment by finding, for a fragment *f _{i}*, all transcript position pairs (

*t*,

_{i′}*p*) that share a

_{i′}*δ*-consistent chain with

*f*covering at least fraction

_{i}*c*of the fragment. The score,

*s*, of this lightweight alignment is simply the fraction of the fragment covered by the chain.

_{i′}Salmon searches for SMEMs using the FMD-index [20]. Specifically, Salmon uses a slightly-modified version of the BWA [20] index, replacing the default sparse sampling with a dense sampling to improve speed. When Salmon is run in lightweight alignment mode, one must have first prepared an index for the target transcriptome against which lightweight alignment is to be performed. The Salmon `index` is built using the index command of Salmon. Unlike *k*-mer-based indices (e.g. as used in Sailfish [6] or Kallisto [7]), the parameters for lightweight-alignment (e.g. the fraction of the read required to be covered, or the minimum length MEMs considered in chains) can be modified without re-building the index. This allows one to easily modify the sensitivity and specificity of the lightweight-alignment procedure without the need to re-create the index (which often takes longer than quantification).

### Online phase

The online phase of Salmon attempts to solve the variational Bayesian inference problem described in **Objectives and models for abundance estimation**, and optimizes a collapsed variational objective function [13] using a variant of stochastic collapsed Variational Bayesian inference [16]. The inference procedure is a streaming algorithm that updates estimated read counts ** α** after every small group

*B*(called a mini-batch) of observations. The pseudo-code for the algorithm is given in

^{τ}**Algorithm 1**.

The observation weight for mini-batch *B ^{τ}*,

*v*

^{τ}, in line 15 of

**Algorithm 1**is an increasing sequence sequence in

*τ*, and is set, as in [10], to adhere to the Robbins-Monroe conditions. Here, the

**represent the (weighted) estimated counts of fragments originating from each transcript. Using this method, the expected value of**

*α***can be computed directly from**

*η***using equation (16). We employ a**

*α**weak*Dirichlet conjugate-prior with As outlined in [16], the SCVB0 inference algorithm is similar to variants of the online-EM [21] algorithm with a modified prior. The procedure in

**Algorithm 1**is run independently by as many worker threads as the user has specified. The threads share a single work-queue upon which a parsing thread places mini-batches of alignment groups. An alignment group is simply the collection of all alignments

(i.e. all multi-mapping locations) for a particular read. The mini-batch itself consists of a collection of some small, fixed number of alignment groups (1,000 by default). Each worker thread processes one alignment group at a time, using the current weights of each transcript and the current auxiliary parameters to estimate the probability that a read came from each potential transcript of origin. The processing of mini-batches occurs in parallel, so that very little synchronization is required, only an atomic compare-and-swap loop to update the global transcript weights at the end of processing of each mini-batch — hence the moniker laissez-faire. This lack of synchronization means that when estimating *x _{y}*, we can not be certain that the most up-to-date values of

**are being used. However, due to the stochastic and additive nature of the updates, this has little-to-no detrimental effect [22]. The inference procedure itself is generic over the type of alignments being processed; they may be either regular alignments (e.g. coming from a**

*α*`bam`file), or lightweight-alignments generated as described in

**Lightweight alignment**above. After the entire mini-batch has been processed, the global weights for each transcript

**are updated. These updates are**

*α**sparse;*i.e. only transcripts which appeared in some alignment in mini-batch

*B*will have their global weight updated after

^{τ}*B*has been processed. This ensures, as in [10], that updates to the parameters

^{τ}**can be performed efficiently.**

*α*### Equivalence classes

During its online phase, in addition to performing streaming inference of transcript abundances, Salmon also constructs a highly-reduced representation of the sequencing experiment. Specifically, Salmon constructs “rich” equivalence classes over all of the sequenced fragments. We define an equivalence relation ~ over fragments. Let be the set of transcripts to which *f _{x}* maps according to alignments

*A*. We say

*f*if and only if

_{x}~ f_{y}*M*(

*f*) =

_{x}*M*(

*f*). Related, but distinct notions of alignment-based equivlance classes have been introduced previously (e.g. [23]), and shown to greatly reduce the time required to perform iterative optimization such as that described in

_{y}**Offline phase**. Fragments which are equivalent can be grouped together for the purpose of inference. Salmon builds up a set of fragment-level equivalence classes by maintaining an efficient concurrent cuckoo hash map [24]. To construct this map, we associate each fragment

*f*with

_{x}

*t**=*

^{x}*M*(

*f*), which we will call the label of the fragment. Then, we query the hash map for

_{x}*t*. If this key is not in the map, we create a new equivalence class with this label, and set its count to 1. Otherwise, we increment the count of the equivalence class with this label that we find in the map. The efficient, concurrent nature of the data structure means that many threads can simultaneously query and write to the map while encountering very little contention. Each key in the hash map is associated with a value that we call a “rich” equivalence class. For each equivalence class

^{x}*C*, we retain a count

^{j}*d*= |

^{j}*C*|, which is the total number of fragments contained within this class. We also maintain, for each class, a weight vector

^{j}*w*. The entries of this vector are in one-to-one correspondence with transcripts

^{j}*i*in the label of this equivalence class such that

That is, is the average conditional probability of observing a fragment from *C ^{j}* given

*t*over all fragments in this equivalence class. Since the fragments in

_{i}*C*are all exchangeable, the pairing between the conditional probability for a particular fragment and a particular transcript need not be maintained, as the following series of equalities holds:

^{j}Thus, the aggregate weights stored in the “rich” equivalence classes gives us the power of considering the conditional probabilities specified in the full model, without having to continuously reconsider each of the fragments in .

### Offline phase

In its offline phase, which follows the online phase, Salmon uses the “rich” equivalence classes learned during the online phase to refine the inference. Given the set of rich equivalence classes of fragments, we can use an expectation maximization (EM) algorithm to optimize the likelihood of the parameters given the data. The abundances ** η** can be computed directly from

**, and we compute maximum likelihood estimates of these parameters which represent the estimated counts (i.e. number of fragments) deriving from each transcript, where: and . If we write this same likelihood in terms of the equivalence classes , we have:**

*α*#### EM update rule

This likelihood, and hence that represented in equation (9), can then be optimized by applying the following update equation iteratively

We apply this update equation until the maximum relative difference in the ** α** parameters satisfies:
for all . Let

**be the estimates after having achieved convergence. We can then approximate**

*α′**η*by where:

_{i}#### Variational Bayes optimization

Instead of the standard EM updates of equation (11), we can, optionally, perform Variational Bayesian optimization by applying VBEM updates as in [14], but adapted to be with respect to the equivalence classes:
where:
Here, Ψ (·) is the digamma function, and, upon convergence of the parameters, we can obtain an estimate of the expected value of the posterior nucleotide fractions as:
where . Variational Bayesian optimization in the offline-phase of Salmon is selected by passing the `– –useVBOpt` flag to the Salmon `quant` command.

### Sampling from the posterior

After the convergence of the parameter estimates has been achieved in the offline phase, it is possible to draw samples from the posterior distribution using collapsed, blockwise Gibbs sampling over the equivalence classes. Samples can be drawn by iterating over the equivalence classes, and re-sampling assignments for some fraction of fragments in each class according to the multinomial distribution defined by holding the assignments for all other fragments fixed. Many samples can be drawn quickly, since many Gibbs chains can be run in parallel. Further, due to the accuracy of the preceding inference, the chains begin sampling from a good position in the latent variable space almost immediately. These posterior samples can be used to obtain estimates for quantities of interest about the posterior distribution, such as its variance, or to produce confidence intervals. When Salmon is passed the `– –useGSOpt` parameter, it will draw a number of posterior samples that can be specified with the `– –numGibbsSamples` parameter.

## Validation

### Metrics for accuracy

We compute three different metrics that summarize the agreement of the predicted number of reads originating from each transcript with the known (simulated) read counts. While these different measures generally give consistent results in our testing, they measure different properties of the underlying estimates. We choose to evaluate these error measures on the estimated read counts to minimize the effect of differences in the manner in which different methods normalize expression estimates by the transcript length (e.g. differences in *effective* length calculations).

The first measure is the mean absolute relative difference (MARD), which is computed using the absolute relative difference ARD*i* for each transcript *i*:
where *x _{i}* is the true value of the number of reads, and

*y*is the predicted value. The relative difference is bounded above by 2, and takes on a value of 0 whenever the prediction perfectly matches the truth. To compute the mean absolute relative difference, we simply take . The second measure is the proportionality correlation, which Lovell et al. [25] argue is a good measure for relative quantities like mRNA expression. The proportionality correlation is defined as:

_{i}As *ρ _{p}* is undefined when either true or estimated measurements take on values of 0, we choose to add a small, positive constant (1 × 10

^{−2}) to all values when computing the proportionality correlation. The

*ρ*measure varies from −1 to 1, with a value of 1 being representative of perfect proportional correlation. Finally, we also compute the Spearman correlation coefficient between the true number of reads deriving from each transcript and the number of reads estimated by each quantification method. Salmon and Kallisto, by default, truncate very tiny expression values to 0. For example, any transcript estimated to produce < 1 × 10

_{p}^{−8}reads is assigned an estimated read count of 0. However, eXpress does not perform such a truncation, and very small, non-zero values may have a negative effect in some of the accuracy metrics we compute. To mitigate such effects, in all of our experiments, we first truncate to 0, in the output of eXpress, all values smaller than the minimum non-zero prediction observed in the output of the other methods.

### Ground truth simulated data

To assess accuracy in a situation where the true expression levels are known, we generate synthetic data sets using both the Flux Simulator [8] and the RSEM-sim procedure used in [7]. The Flux Simulator attempts to model the different stages of an RNA-seq experiment (e.g. amplification, fragmentation, etc.), and it adopts various mathematical models for different stages of the simulation. However, it does not assume the same generative model used by any of the quantification tools tested here, and thus may be a more unbiased simulation method. The Flux Simulator data consisted of 75 million 76bp paired-end reads on a transcript population of 5 million molecules for two separate species: *Homo Sapiens* and *Zea Mays.* To generate data with RSEM-sim, we follow the procedure used in [7] — RSEM was run on sample `NA12716_7` of the Geuvadis RNA-seq data [26] to learn model parameters and estimate true expression, and the learned model was then used to generate 20 different simulated datasets, each consisting of 30 million 75 bp paired-end reads. All tests were performed with eXpress v1.5.1, Kallisto v0.42.1, Salmon v0.4.2 and STAR v2.41d. The flag `– –useErrorModel` was passed to alignment-based Salmon. Reads were aligned with STAR using the parameters `– –outFilterMultimapNmax 200 —outFilterMismatchNmax 99999 – –outFilterMismatchNoverLmax 0.2 ––alignIntronMin 1000 ––alignIntronMax 0 – –outSAMtype BAM Unsorted.` Otherwise, default parameters were used unless noted.

### qPCR data

We compared quantification performance of the methods using qPCR data from the SEQC consortium [9]. We obtained normalized Prime PCR estimates for genes from http://abrf.masonlab.net/Files.html and compared abundance estimates of Sample A (Universal Human Reference RNA) with abundance estimates on RNA-seq data from sample A obtained at the BGI site (SEQCJLM_BGI_A_1, GEO ID: GSE47792). While all tested methods for quantifying abundance seem to produce high concordance with qPCR-based estimates, we find that Salmon performs better than most other methods (**Supplementary Table 2**).

### Comparison with Stringtie

We also performed accuracy analyses using a recent transcript assembly and quantification program Stringtie [27]. After quantifying with Stringtie, we noticed that many transcripts that are highly expressed in the ground truth and by other quantifiers are shown as unexpressed in the Stringtie output, resulting in low overall correlation with the ground truth. This may be due to Stringtie’s conservative approach. It requires that each exon-intron-exon junction is supported by at least one spliced read in order to be considered in the pool of expressed transcripts. For longer genes with many introns, it may therefore be more likely that transcripts associated with this gene are discarded. We chose not to include these results here to not penalize methods like Stringtie that will attempt to reconstruct rather than just quantify transcripts.

## Acknowledgements

This research is funded in part by the Gordon and Betty Moore Foundation’s Data-Driven Discovery Initiative through Grant GBMF4554 to C.K. It is partially funded by the US National Science Foundation (CCF-1256087, CCF-1319998) and the US National Institutes of Health (R21HG006913, R01HG007104). C.K. received support as an Alfred P. Sloan Research Fellow.