IsoBayes: a Bayesian approach for single-isoform proteomics inference

Studying protein isoforms is an essential step in biomedical research; at present, the main approach for analyzing proteins is via bottom-up mass spectrometry proteomics, which return peptide identifications, that are indirectly used to infer the presence of protein isoforms. However, the detection and quantification processes are noisy; in particular, peptides may be erroneously detected, and most peptides, known as shared peptides, are associated to multiple protein isoforms. As a consequence, studying individual protein isoforms is challenging, and inferred protein results are often abstracted to the gene-level or to groups of protein isoforms. Here, we introduce IsoBayes, a novel statistical method to perform inference at the isoform level. Our method enhances the information available, by integrating mass spectrometry proteomics and transcriptomics data in a Bayesian probabilistic framework. To account for the uncertainty in the measurement process, we propose a two-layer latent variable approach: first, we sample if a peptide has been correctly detected (or, alternatively filter peptides); second, we allocate the abundance of such selected peptides across the protein(s) they are compatible with. This enables us, starting from peptide-level data, to recover protein-level data; in particular, we: i) infer the presence/absence of each protein isoform (via a posterior probability), ii) estimate its abundance (and credible interval), and iii) target isoforms where transcript and protein relative abundances significantly differ. We benchmarked our approach in simulations, and in two multi-protease real datasets: our method displays good sensitivity and specificity when detecting protein isoforms, its estimated abundances highly correlate with the ground truth, and can detect changes between protein and transcript relative abundances. IsoBayes is freely distributed as a Bioconductor R package, and is accompanied by an example usage vignette.

1 Supplementary Tables  6: Area under the curve (AUC) for the detection of protein isoforms, for every method in each real dataset, computed on the subset of protein isoforms from multi-isoform genes (i.e., genes with more than one expressed isoform).7: Area under the curve (AUC) for the detection of protein isoforms, and correlation between log10 estimated protein isoform abundances (i.e., log10(abundance + 1)), and those found in the validation set.Values represent averages across the proteases of the jurkat and WTC-11 datasets.Results refer to IsoBayes and IsoBayes_mRNA, computed on three input data: i) PSM counts from OpenMS ' Percolator ("OpenMS" column); ii) PSM counts from MetaMorpheus ("MM PSM" column); iii) peptide intensities from MetaMorpheus("MM int" column).

Real data -Robustness to input data
Note that numbers slightly differ with respect to other Tables; this is because, here, we focus on the isoforms that are in common across the three data types, i.e., with at least 1 detected (shared or unique) peptide.This ensures a fair comparison across inputs.

Supplementary Figure 4 :Supplementary Figure 5 :
Boxplot of the stabilized log2-FCs between protein and transcript relative abundances, identified in the validated set, stratified based on the probability, estimated by IsoBayes_mRNA, that isoform relative abundances are higher at the protein-than at then transcript-level.In each cell line, we considered results from all protesease.Left: jurkat dataset; right: WTC-11 dataset.Boxplot of the stabilized log2-FCs between protein and transcript relative abundances, identified in the validated set, stratified based on the probability, estimated by IsoBayes, that isoform relative abundances are higher at the protein-than at then transcript-level.In each cell line, we considered results from all protesease.Left: jurkat dataset; right: WTC-11 dataset.

Table 1 :
Summary results, from the simulation study, for IsoBayes fit without mRNA abundances."Abundance present iso" and "Abundance absent iso" indicate the estimated average abundance for protein isoforms which were actually simulated to be present and absent, respectively.
Supplementary Table3: Area under the curve (AUC) for the detection of protein isoforms, for every method in each real dataset.Supplementary Table4: Correlation between log10 estimated protein gene abundances (i.e., log10(abundance + 1)), and those found in the validation set.In each cell line, we considered results from all protesease.1.3Real data -Isoforms without unique peptides

Table 5 :
Area under the curve (AUC) for the detection of protein isoforms, for every method in each real dataset, computed on the subset of protein isoforms solely associated to shared peptides.
1.4 Real data -Isoforms from multi-isoform genesSupplementary Table

Table 8 :
1.6 Real data -PEP vs. FDR mode Area under the curve (AUC) for the detection of protein isoforms, and correlation between log10 estimated protein isoform abundances (i.e., log10(abundance + 1)), and those found in the validation set.Values represent averages across the proteases of the jurkat and WTC-11 datasets.Results refer to IsoBayes and IsoBayes_mRNA, based the PEP and FDR modes.Methods were fit on both PSM counts ("MM PSM" rows), and peptide intensities ("MM int" rows), computed via MetaMorpheus.

Table 9 :
Runtime (in minutes) and memory (in Gigabytes) for IsoBayes and IsoBayes_mRNA, based the PEP and FDR modes.Values represent averages across the proteases of the jurkat and WTC-11 datasets.Methods were fit on both PSM counts ("MM PSM" rows), and peptide intensities ("MM int" rows), computed via MetaMorpheus.