Machine Learning for Population Genetics: A New Paradigm

Daniel R. Schrider; Andrew D. Kern

doi:10.1101/206482

Abstract

As population genomic datasets grow in size, researchers are faced with the daunting task of making sense of a flood of information. To keep pace with this explosion of data, computational methodologies for population genetic inference are rapidly being developed to best utilize genomic sequence data. In this review we discuss a new paradigm that has emerged in computational population genomics: that of supervised machine learning. We review the fundamentals of machine learning, discuss recent applications of supervised machine learning to population genetics that outperform competing methods, and describe promising future directions in this area. Ultimately, we argue that supervised machine learning is an important and underutilized tool that has considerable potential for the world of evolutionary genomics.

Glossary

Feature vector: A multidimensional representation of a data point made up of measurements (or features) taken from it (e.g. a vector of population genetic summary statistics measured in a genomic region).
Training: The process of algorithmically generating from a training set a function that seeks to correctly predict a datum's response variable by examining its feature vector.
Labeled data: Data examples for which the true response value (or label) is known.
Training set: A set of labeled examples for use during training.
Test set: A set of labeled examples for use during testing that is independent of the training set.
Loss function: A measure of how correctly an example's response variable was predicted.
Risk function: A measure of aggregated loss across an entire training set (e.g. the expected value of the loss function). We wish to minimize the value of the risk function during training.
Regression: A machine learning task where the value to be predicted for each example is a continuous number.
Classification: A machine learning task where the value to be predicted for each example is a categorical label.
Binary classification: A classification task in which there are two possible class labels, often termed positives and negatives.
Precision: In binary classification, the fraction of all examples classified as positives that are true positives (i.e. the number of true positives divided by the sum of the number of true positives and number of false positives). Also known as the positive predictive value.
Recall: In binary classification, the fraction of all positives that are correctly predicted as such (i.e. the number of true positives divided by the sum of the number of true positives and number of false negatives). Also known as sensitivity.
Confusion matrix: A table for visualizing accuracy in multi-class classification, which is simply the contingency table of the true and predicted classes for each example in a test set (see Figure Box 1 for an example).
Overfitting: When a model has achieved excellent accuracy in training data set but does not generalize well--i.e. the model has been tuned to precisely recognize patterns of noise in this set that are unlikely to be present in an independent test set. Sometimes referred to as overtraining.
n-fold cross validation: When only a small set of labeled data are available, cross validation can be used to measure accuracy. This process partitions the labeled data into n non-overlapping equally sized sets, and trains the predictor on the union of n-1 of these before testing on the remaining set. This is repeated n times, so that each of the n sets is used as the test set exactly once, and the average accuracy is recorded.
Boosting: A class machine learning techniques that seek to iteratively construct a set of predictors, weighing each predictor's influence on the final prediction according to its individual accuracy. Additionally, in most algorithms the new predictor to be added to the set focuses on examples that the current set of predictors has struggled with.
Support vector machine (SVM): A machine learning approach that seeks to find the hyperplane that optimally separates two classes of training data. These data are often mapped to highdimensional space using a kernel function. Variations of this approach can be performed to accomplish multi-class classification or regression.
Decision tree: A hierarchical structure that predicts an example's response variable by examining a feature, and branching to the right subtree if the value of that feature is greater than some threshold, and branching to the left otherwise. At the next level of the tree, another feature is examined. The predicted value is determined by which leaf of the tree is reached at the end of this process.
Random forest: An ensemble of semi-randomly generated decision trees. An example is run through each tree in the forest, and these trees then vote to determine the predicted value. Random forests can perform both classification and regression.
Artificial neural network (ANN): A network of layers of one or more “neurons” which receive inputs from each neuron in the previous layer, and perform a linear combination on these inputs which is then passed through an activation function. The first layer is the input layer (i.e. the feature vector) and the last layer is the output layer, yielding the predicted responses. Intervening layers are referred to as “hidden” layers.
Deep learning: Learning using ANNs or similarly networked algorithmic models that contain multiple "hidden" layers between the input and output layers.

The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.