Low-dimensional representation of genomic sequences

Richard C Tillquist; Manuel E Lladser

doi:10.1007/s00285-019-01348-1

Low-dimensional representation of genomic sequences

J Math Biol. 2019 Jul;79(1):1-29. doi: 10.1007/s00285-019-01348-1. Epub 2019 Mar 30.

Authors

Richard C Tillquist¹, Manuel E Lladser²

Affiliations

¹ Department of Computer Science, The University of Colorado, Boulder, CO, 80309-0526, USA.
² Department of Applied Mathematics, The University of Colorado, Boulder, CO, 80309-0526, USA. manuel.lladser@colorado.edu.

PMID: 30929047
DOI: 10.1007/s00285-019-01348-1

Abstract

Numerous data analysis and data mining techniques require that data be embedded in a Euclidean space. When faced with symbolic datasets, particularly biological sequence data produced by high-throughput sequencing assays, conventional embedding approaches like binary and k-mer count vectors may be too high dimensional or coarse-grained to learn from the data effectively. Other representation techniques such as Multidimensional Scaling (MDS) and Node2Vec may be inadequate for large datasets as they require recomputing the full embedding from scratch when faced with new, unclassified data. To overcome these issues we amend the graph-theoretic notion of "metric dimension" to that of "multilateration." Much like trilateration can be used to represent points in the Euclidean plane by their distances to three non-colinear points, multilateration allows us to represent any node in a graph by its distances to a subset of nodes. Unfortunately, the problem of determining a minimal subset and hence the lowest dimensional embedding is NP-complete for general graphs. However, by specializing to Hamming graphs, which are particularly well suited to representing biological sequences, we can readily generate low-dimensional embeddings to map sequences of arbitrary length to a real space. As proof-of-concept, we use MDS, Node2Vec, and multilateration-based embeddings to classify DNA 20-mers centered at intron-exon boundaries. Although these different techniques perform comparably, MDS and Node2Vec potentially suffer from scalability issues with increasing sequence length whereas multilateration provides an efficient means of mapping long genomic sequences.

Keywords: Feature extraction; Graph embeddings; Hamming graph; Metric dimension; Reads; Resolving set.

Publication types

Research Support, Non-U.S. Gov't
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Computer Simulation
Data Analysis
Data Mining / methods
Genomics / methods*
Genomics / statistics & numerical data
High-Throughput Nucleotide Sequencing*
Principal Component Analysis
Proof of Concept Study
Sequence Analysis, DNA*

Grants and funding

S10 OD012300/OD/NIH HHS/United States