Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

A topic modelling analysis of TCGA breast and lung cancer transcriptomic data

View ORCID ProfileFilippo Valle, View ORCID ProfileMatteo Osella, View ORCID ProfileMichele Caselle
doi: https://doi.org/10.1101/2020.10.19.345694
Filippo Valle
1Physics Department, University of Turin and INFN, via P. Giuria 1, 10125 Turin, Italy
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Filippo Valle
  • For correspondence: filippo.valle@unito.it
Matteo Osella
1Physics Department, University of Turin and INFN, via P. Giuria 1, 10125 Turin, Italy
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Matteo Osella
Michele Caselle
1Physics Department, University of Turin and INFN, via P. Giuria 1, 10125 Turin, Italy
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Michele Caselle
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Preview PDF
Loading

Abstract

Topic modelling is a widely used technique to extract relevant information from large arrays of data. The problem of finding a topic structure in a dataset was recently recognized to be analogous to the community detection problem in network theory. Leveraging on this analogy, a new class of topic modelling strategies has been introduced to overcome some of the limitations of classical methods. This paper applies these recent ideas to TCGA transcriptomic data on breast and lung cancer. The established cancer subtype organization is well reconstructed in the inferred latent topic structure. Moreover, we identify specific topics that are enriched in genes known to play a role in the corresponding disease and are strongly related to the survival probability of patients. Finally, we show that a simple neural network classifier operating in the low dimensional topic space is able to predict with high accuracy the cancer subtype of a test expression sample.

Competing Interest Statement

The authors have declared no competing interest.

  • Abbreviations

    The following abbreviations are used in this manuscript:
    hSBM
    hierarchical stochastic block model
    TP, FP, TN, FN
    True Positives, False Positives, True Negatives, False Negatives
    FDR
    False Discovery Rate
    FPKM
    Fragments Per Kilobase of transcript per Million mapped reads
    GSEA
    Gene Set Enrichment Analysis
  • Copyright 
    The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.
    Back to top
    PreviousNext
    Posted December 12, 2020.
    Download PDF

    Supplementary Material

    Email

    Thank you for your interest in spreading the word about bioRxiv.

    NOTE: Your email address is requested solely to identify you as the sender of this article.

    Enter multiple addresses on separate lines or separate them with commas.
    A topic modelling analysis of TCGA breast and lung cancer transcriptomic data
    (Your Name) has forwarded a page to you from bioRxiv
    (Your Name) thought you would like to see this page from the bioRxiv website.
    CAPTCHA
    This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
    Share
    A topic modelling analysis of TCGA breast and lung cancer transcriptomic data
    Filippo Valle, Matteo Osella, Michele Caselle
    bioRxiv 2020.10.19.345694; doi: https://doi.org/10.1101/2020.10.19.345694
    Digg logo Reddit logo Twitter logo Facebook logo Google logo LinkedIn logo Mendeley logo
    Citation Tools
    A topic modelling analysis of TCGA breast and lung cancer transcriptomic data
    Filippo Valle, Matteo Osella, Michele Caselle
    bioRxiv 2020.10.19.345694; doi: https://doi.org/10.1101/2020.10.19.345694

    Citation Manager Formats

    • BibTeX
    • Bookends
    • EasyBib
    • EndNote (tagged)
    • EndNote 8 (xml)
    • Medlars
    • Mendeley
    • Papers
    • RefWorks Tagged
    • Ref Manager
    • RIS
    • Zotero
    • Tweet Widget
    • Facebook Like
    • Google Plus One

    Subject Area

    • Cancer Biology
    Subject Areas
    All Articles
    • Animal Behavior and Cognition (4105)
    • Biochemistry (8807)
    • Bioengineering (6508)
    • Bioinformatics (23446)
    • Biophysics (11783)
    • Cancer Biology (9196)
    • Cell Biology (13307)
    • Clinical Trials (138)
    • Developmental Biology (7428)
    • Ecology (11402)
    • Epidemiology (2066)
    • Evolutionary Biology (15141)
    • Genetics (10429)
    • Genomics (14036)
    • Immunology (9167)
    • Microbiology (22142)
    • Molecular Biology (8802)
    • Neuroscience (47533)
    • Paleontology (350)
    • Pathology (1427)
    • Pharmacology and Toxicology (2489)
    • Physiology (3729)
    • Plant Biology (8076)
    • Scientific Communication and Education (1437)
    • Synthetic Biology (2220)
    • Systems Biology (6036)
    • Zoology (1252)