Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

ClusTrast: a short read de novo transcript isoform assembler guided by clustered contigs

View ORCID ProfileKarl Johan Westrin, View ORCID ProfileWarren W. Kretzschmar, View ORCID ProfileOlof Emanuelsson
doi: https://doi.org/10.1101/2022.01.02.473666
Karl Johan Westrin
1Science for Life Laboratory, Department of Gene Technology, KTH Royal Institute of Technology, SE-171 65, Solna, Sweden
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Karl Johan Westrin
Warren W. Kretzschmar
1Science for Life Laboratory, Department of Gene Technology, KTH Royal Institute of Technology, SE-171 65, Solna, Sweden
2Center for Hematology and Regenerative Medicine (HERM), Department of Medicine Huddinge, Karolinska Institute, SE-141 52, Flemingsberg, Sweden
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Warren W. Kretzschmar
Olof Emanuelsson
1Science for Life Laboratory, Department of Gene Technology, KTH Royal Institute of Technology, SE-171 65, Solna, Sweden
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Olof Emanuelsson
  • For correspondence: olofem@kth.se
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Preview PDF
Loading

Abstract

Background Transcriptome assembly from RNA-sequencing data in species without a reliable reference genome has to be performed de novo, but studies have shown that de novo methods often have inadequate ability to reconstruct transcript isoforms. We address this issue by constructing an assembly pipeline whose main purpose is to produce a comprehensive set of transcript isoforms.

Results We present the de novo transcript isoform assembler ClusTrast, which clusters a set of guiding contigs by similarity, aligns short reads to the guiding contigs, and assembles each clustered set of short reads individually. We tested ClusTrast on datasets from six eukaryotic species, and showed that ClusTrast reconstructed more expressed known isoforms than any of the other tested de novo assemblers, at a moderate reduction in precision. For recall, ClusTrast was on top in the lower end of expression levels (<15% percentile) for all tested datasets, and over the entire range for almost all datasets. Reference transcripts were often (35–69% for the six datasets) reconstructed to at least 95% of their length by ClusTrast, and more than half of reference transcripts (58–81%) were reconstructed with contigs that exhibited polymorphism, measuring on a subset of reliably predicted contigs.

Conclusion We suggest that ClusTrast can be a useful tool for studying isoforms in species without a reliable reference genome, in particular when the goal is to produce a comprehensive transcriptome set with polymorphic variants.

Competing Interest Statement

The authors have declared no competing interest.

Footnotes

  • westrin{at}kth.se, wk{at}warrenwk.com, olofem{at}kth.se

  • We have tried to make it clearer that the purpose of ClusTrast is to produce a comprehensive set of transcripts, and that it is not a suitable tool when specificity is the focus. The revision includes more analyses of the performance of the assemblers, e.g. the difference between true positive sets as defined by SQANTI and CRBB (3.1.3) and an investigation of polymorphisms and alternative splicing in the reconstructed transcriptomes (novel section 3.1.5). The description of our method, ClusTrast, has been moved from the Results chapter to the Implementation chapter (formerly Materials and Methods). A new chapter, Conclusion, is introduced. One of the datasets in our original submission, poplar, turned out to be a mixed dataset with also a fungus included. Thus, we have removed that data set from the main manuscript, and we use another poplar dataset instead.

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.
Back to top
PreviousNext
Posted November 22, 2022.
Download PDF

Supplementary Material

Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
ClusTrast: a short read de novo transcript isoform assembler guided by clustered contigs
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
ClusTrast: a short read de novo transcript isoform assembler guided by clustered contigs
Karl Johan Westrin, Warren W. Kretzschmar, Olof Emanuelsson
bioRxiv 2022.01.02.473666; doi: https://doi.org/10.1101/2022.01.02.473666
Digg logo Reddit logo Twitter logo Facebook logo Google logo LinkedIn logo Mendeley logo
Citation Tools
ClusTrast: a short read de novo transcript isoform assembler guided by clustered contigs
Karl Johan Westrin, Warren W. Kretzschmar, Olof Emanuelsson
bioRxiv 2022.01.02.473666; doi: https://doi.org/10.1101/2022.01.02.473666

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (4078)
  • Biochemistry (8750)
  • Bioengineering (6467)
  • Bioinformatics (23314)
  • Biophysics (11719)
  • Cancer Biology (9133)
  • Cell Biology (13227)
  • Clinical Trials (138)
  • Developmental Biology (7403)
  • Ecology (11360)
  • Epidemiology (2066)
  • Evolutionary Biology (15076)
  • Genetics (10390)
  • Genomics (14000)
  • Immunology (9109)
  • Microbiology (22025)
  • Molecular Biology (8772)
  • Neuroscience (47312)
  • Paleontology (350)
  • Pathology (1418)
  • Pharmacology and Toxicology (2480)
  • Physiology (3701)
  • Plant Biology (8043)
  • Scientific Communication and Education (1427)
  • Synthetic Biology (2206)
  • Systems Biology (6009)
  • Zoology (1247)