Abstract
Transcripts are frequently modified by structural variations, which leads to either a fused transcript of two genes (known as a fusion gene) or an insertion of intergenic sequence into a transcript. These modifications, called transcriptomic structural variants (TSV), can lead to drastic changes in a downstream product. Detecting TSVs, especially in cancer tumor sequencing where they are known to frequently occur, is an important and challenging computational problem. This problem is made even more challenging in that often only RNA-seq measurements are available. We introduce SQUID, a novel algorithm and its implementation, to accurately predict both fusion-gene and non-fusion-gene TSVs from RNA-seq alignments. SQUID takes the unique approach of attempting to reconstruct an underlying genome sequence that best explains the observed RNA-seq reads. By unifying both concordant alignments and discordant read alignments into one model, SQUID achieves high sensitivity with many fewer false positives than other approaches. We detect TSVs on TCGA tumor samples using SQUID, and observe that that non-fusion-gene TSVs are more likely to be intra-chromosomal than fusion-gene TSVs. We also quantify the propensity for breakpoint partners to be reused. We identify several novel TSVs involving tumor suppressor genes, which may lead to loss-of-function in the corresponding genes and play a role in tumorgenesis.