Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Probe Efficient Feature Representation of Gapped K-mer Frequency Vectors from Sequences using Deep Neural Networks

Zhen Cao, Shihua Zhang
doi: https://doi.org/10.1101/170761
Zhen Cao
Academy of Mathematics and Systems Science, Chinese Academy of Sciences
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Shihua Zhang
Academy of Mathematics and Systems Science, Chinese Academy of Sciences
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: zsh@amss.ac.cn
  • Abstract
  • Info/History
  • Metrics
  • Data Supplements
  • Preview PDF
Loading

Abstract

How to extract informative features from genome sequence is a challenging issue. Gapped k-mers frequency vectors (gkm-fv) has been presented as a new type of features in the last few years. Coupled with support vector machine (gkm-SVM), gkm-fvs have been used to achieve effective sequence-based predictions. However, the huge computation of a large kernel matrix prevents it from using large amount of data. And it is unclear how to combine gkm-fvs with other data sources in the context of string kernel. On the other hand, the high dimensionality, colinearity and sparsity of gkm-fvs hinder the use of many traditional machine learning methods without a kernel trick. Therefore, we proposed a flexible and scalable framework gkm-DNN to achieve feature representation from high-dimensional gkm-fvs using deep neural networks (DNN). We first proposed a more concise version of gkm-fvs which significantly reduce the dimension of gkm-fvs. Then we implemented an efficient method to calculate the gkm-fv of a given sequence at the first time. Finally, we adopted a DNN model with gkm-fvs as inputs to achieve efficient feature representation and a prediction task. Here, we took the transcription factor binding site prediction as an illustrative application. We applied gkm-DNN onto 467 small and 69 big human ENCODE ChIP-seq datasets to demonstrate its performance and compared it with the state-of-the-art method gkm-SVM. We demonstrated that gkm-DNN can not only improve the limitations of high dimensionality, colinearity and sparsity of gkm-fvs, but also make comparable overall performance compared with gkm-SVM using the same gkm-fvs. In addition, we used gkm-DNN to explore the representation power of gkm-fvs and provided more explanation on how gkm-fvs work.

Copyright 
The copyright holder for this preprint is the author/funder. All rights reserved. No reuse allowed without permission.
Back to top
PreviousNext
  • Posted February 4, 2018.

Download PDF

Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Probe Efficient Feature Representation of Gapped K-mer Frequency Vectors from Sequences using Deep Neural Networks
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
Share
Probe Efficient Feature Representation of Gapped K-mer Frequency Vectors from Sequences using Deep Neural Networks
Zhen Cao, Shihua Zhang
bioRxiv 170761; doi: https://doi.org/10.1101/170761
del.icio.us logo Digg logo Reddit logo Technorati logo Twitter logo CiteULike logo Connotea logo Facebook logo Google logo Mendeley logo
Citation Tools
Probe Efficient Feature Representation of Gapped K-mer Frequency Vectors from Sequences using Deep Neural Networks
Zhen Cao, Shihua Zhang
bioRxiv 170761; doi: https://doi.org/10.1101/170761

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (619)
  • Biochemistry (857)
  • Bioengineering (515)
  • Bioinformatics (4754)
  • Biophysics (1499)
  • Cancer Biology (1028)
  • Cell Biology (1445)
  • Clinical Trials (52)
  • Developmental Biology (973)
  • Ecology (1628)
  • Epidemiology (808)
  • Evolutionary Biology (3687)
  • Genetics (2509)
  • Genomics (3260)
  • Immunology (601)
  • Microbiology (2408)
  • Molecular Biology (888)
  • Neuroscience (6471)
  • Paleontology (42)
  • Pathology (124)
  • Pharmacology and Toxicology (220)
  • Physiology (286)
  • Plant Biology (890)
  • Scientific Communication and Education (247)
  • Synthetic Biology (383)
  • Systems Biology (1321)
  • Zoology (162)