Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Machine Learning for Population Genetics: A New Paradigm

View ORCID ProfileDaniel R. Schrider, View ORCID ProfileAndrew D. Kern
doi: https://doi.org/10.1101/206482
Daniel R. Schrider
*Department of Genetics, Rutgers University, Piscataway, New Jersey 08554
†Human Genetics Institute of New Jersey, Rutgers University, Piscataway, New Jersey 08554
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Daniel R. Schrider
  • For correspondence: dan.schrider@rutgers.edu kern@biology.rutgers.edu
Andrew D. Kern
*Department of Genetics, Rutgers University, Piscataway, New Jersey 08554
†Human Genetics Institute of New Jersey, Rutgers University, Piscataway, New Jersey 08554
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Andrew D. Kern
  • For correspondence: dan.schrider@rutgers.edu kern@biology.rutgers.edu
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Preview PDF
Loading

Abstract

As population genomic datasets grow in size, researchers are faced with the daunting task of making sense of a flood of information. To keep pace with this explosion of data, computational methodologies for population genetic inference are rapidly being developed to best utilize genomic sequence data. In this review we discuss a new paradigm that has emerged in computational population genomics: that of supervised machine learning. We review the fundamentals of machine learning, discuss recent applications of supervised machine learning to population genetics that outperform competing methods, and describe promising future directions in this area. Ultimately, we argue that supervised machine learning is an important and underutilized tool that has considerable potential for the world of evolutionary genomics.

  • Glossary

    Feature vector
    A multidimensional representation of a data point made up of measurements (or features) taken from it (e.g. a vector of population genetic summary statistics measured in a genomic region).
    Training
    The process of algorithmically generating from a training set a function that seeks to correctly predict a datum's response variable by examining its feature vector.
    Labeled data
    Data examples for which the true response value (or label) is known.
    Training set
    A set of labeled examples for use during training.
    Test set
    A set of labeled examples for use during testing that is independent of the training set.
    Loss function
    A measure of how correctly an example's response variable was predicted.
    Risk function
    A measure of aggregated loss across an entire training set (e.g. the expected value of the loss function). We wish to minimize the value of the risk function during training.
    Regression
    A machine learning task where the value to be predicted for each example is a continuous number.
    Classification
    A machine learning task where the value to be predicted for each example is a categorical label.
    Binary classification
    A classification task in which there are two possible class labels, often termed positives and negatives.
    Precision
    In binary classification, the fraction of all examples classified as positives that are true positives (i.e. the number of true positives divided by the sum of the number of true positives and number of false positives). Also known as the positive predictive value.
    Recall
    In binary classification, the fraction of all positives that are correctly predicted as such (i.e. the number of true positives divided by the sum of the number of true positives and number of false negatives). Also known as sensitivity.
    Confusion matrix
    A table for visualizing accuracy in multi-class classification, which is simply the contingency table of the true and predicted classes for each example in a test set (see Figure Box 1 for an example).
    Overfitting
    When a model has achieved excellent accuracy in training data set but does not generalize well--i.e. the model has been tuned to precisely recognize patterns of noise in this set that are unlikely to be present in an independent test set. Sometimes referred to as overtraining.
    n-fold cross validation
    When only a small set of labeled data are available, cross validation can be used to measure accuracy. This process partitions the labeled data into n non-overlapping equally sized sets, and trains the predictor on the union of n-1 of these before testing on the remaining set. This is repeated n times, so that each of the n sets is used as the test set exactly once, and the average accuracy is recorded.
    Boosting
    A class machine learning techniques that seek to iteratively construct a set of predictors, weighing each predictor's influence on the final prediction according to its individual accuracy. Additionally, in most algorithms the new predictor to be added to the set focuses on examples that the current set of predictors has struggled with.
    Support vector machine (SVM)
    A machine learning approach that seeks to find the hyperplane that optimally separates two classes of training data. These data are often mapped to highdimensional space using a kernel function. Variations of this approach can be performed to accomplish multi-class classification or regression.
    Decision tree
    A hierarchical structure that predicts an example's response variable by examining a feature, and branching to the right subtree if the value of that feature is greater than some threshold, and branching to the left otherwise. At the next level of the tree, another feature is examined. The predicted value is determined by which leaf of the tree is reached at the end of this process.
    Random forest
    An ensemble of semi-randomly generated decision trees. An example is run through each tree in the forest, and these trees then vote to determine the predicted value. Random forests can perform both classification and regression.
    Artificial neural network (ANN)
    A network of layers of one or more “neurons” which receive inputs from each neuron in the previous layer, and perform a linear combination on these inputs which is then passed through an activation function. The first layer is the input layer (i.e. the feature vector) and the last layer is the output layer, yielding the predicted responses. Intervening layers are referred to as “hidden” layers.
    Deep learning
    Learning using ANNs or similarly networked algorithmic models that contain multiple "hidden" layers between the input and output layers.
  • Copyright 
    The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.
    Back to top
    PreviousNext
    Posted October 20, 2017.
    Download PDF
    Email

    Thank you for your interest in spreading the word about bioRxiv.

    NOTE: Your email address is requested solely to identify you as the sender of this article.

    Enter multiple addresses on separate lines or separate them with commas.
    Machine Learning for Population Genetics: A New Paradigm
    (Your Name) has forwarded a page to you from bioRxiv
    (Your Name) thought you would like to see this page from the bioRxiv website.
    CAPTCHA
    This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
    Share
    Machine Learning for Population Genetics: A New Paradigm
    Daniel R. Schrider, Andrew D. Kern
    bioRxiv 206482; doi: https://doi.org/10.1101/206482
    Reddit logo Twitter logo Facebook logo LinkedIn logo Mendeley logo
    Citation Tools
    Machine Learning for Population Genetics: A New Paradigm
    Daniel R. Schrider, Andrew D. Kern
    bioRxiv 206482; doi: https://doi.org/10.1101/206482

    Citation Manager Formats

    • BibTeX
    • Bookends
    • EasyBib
    • EndNote (tagged)
    • EndNote 8 (xml)
    • Medlars
    • Mendeley
    • Papers
    • RefWorks Tagged
    • Ref Manager
    • RIS
    • Zotero
    • Tweet Widget
    • Facebook Like
    • Google Plus One

    Subject Area

    • Evolutionary Biology
    Subject Areas
    All Articles
    • Animal Behavior and Cognition (4663)
    • Biochemistry (10320)
    • Bioengineering (7647)
    • Bioinformatics (26266)
    • Biophysics (13486)
    • Cancer Biology (10655)
    • Cell Biology (15372)
    • Clinical Trials (138)
    • Developmental Biology (8473)
    • Ecology (12786)
    • Epidemiology (2067)
    • Evolutionary Biology (16806)
    • Genetics (11374)
    • Genomics (15438)
    • Immunology (10586)
    • Microbiology (25099)
    • Molecular Biology (10176)
    • Neuroscience (54271)
    • Paleontology (399)
    • Pathology (1663)
    • Pharmacology and Toxicology (2884)
    • Physiology (4329)
    • Plant Biology (9216)
    • Scientific Communication and Education (1583)
    • Synthetic Biology (2547)
    • Systems Biology (6765)
    • Zoology (1459)