STPGA: Selection of training populations with a genetic algorithm

Optimal subset selection is an important task that has numerous algorithms designed for it and has many application areas. STPGA contains a special genetic algorithm supplemented with a tabu memory property (that keeps track of previously tried solutions and their fitness for a number of iterations), and with a regression of the fitness of the solutions on their coding that is used to form the ideal estimated solution (look ahead property) to search for solutions of generic optimal subset selection problems. I have initially developed the programs for the specific problem of selecting training populations for genomic prediction or association problems, therefore I give discussion of the theory behind optimal design of experiments to explain the default optimization criteria in STPGA, and illustrate the use of the programs in this endeavor. Nevertheless, I have picked a few other areas of application: supervised and unsupervised variable selection based on kernel alignment, supervised variable selection with design criteria, influential observation identification for regression, solving mixed integer quadratic optimization problems, balancing gains and inbreeding in a breeding population. Some of these illustrations pertain new statistical approaches.


Introduction
The paper introduces the R (R Core Team 2016) package STPGA that provides a genetic algorithm for subset selection. The package is available from the Comprehensive R Archive Network (CRAN) at http://CRAN.R-project.org/package=STPGA, and some of the underlying motivations, methodology and results were presented in Isidro et al. 2015;Crossa et al. 2016;Akdemir and Sánchez 2016) and also some innovations that will be detailed in several subsequent articles. This document details version 4.0 which includes major upgrades and bug fixes compared to previous versions.
Numerous other algorithms have been proposed for the optimal subset selection problem, many of them are heuristic exchange type algorithms (Fedorov 1972;Mitchell 1974;Nguyen and Miller 1992;Rincent et al. 2012;Isidro et al. 2015). In exchange type algorithms new solutions are obtained by adding one point and removing another at a time (some exchange algorithms might allow exchange of more than one design point at once), these algorithms are greedy and are only proven to find the best subset for certain type of design criteria. In general, exchange algorithms are prone to getting stuck in local optimal solutions. Branch and bound (BB) (Furnival and Wilson 1974) is a global exhaustive search method that has proven to be reasonably efficient on practical problems. BB searches the design region by iteratively dividing design region and searching each piece for an optimal solution. BB is often more efficient than straight enumeration because it can eliminate regions that provably do not contain an optimal solution. Welch (1982) uses a BB algorithm to find globally best D-optimal design for a given design criteria and a set of candidate points. Another method that has been applied to the subset selection problem is simulated annealing (Haines 1987).
Branch and bound and simulated annealing algorithms require appreciable computation time even for moderate sized problems.
Genetic algorithms (GAs) are a class of evolutionary algorithms made popular by John Holland and his colleagues (Goldberg and Holland 1988;Holland 1992a,b), and have been applied to find exact or approximate solutions to optimization and search problems (Goldberg and Holland 1988;Sivanandam and Deepa 2007;Akdemir and Sánchez 2016). There are numerous packages that implement GAs or similar evolutionary algorithms.
The packages gafit (Tendys 2002), galts (Satman 2013), genalg (Willighagen 2005), rgenoud (Mebane Jr et al. 2015), DEoptim (Ardia et al. 2016) and the GA (Scrucca et al. 2013) offer many options for using optimization routines based on evolutionary algorithms. The optimization algorithm that is used in STPGA (LA-GA-T algorithm) is a modified genetic algorithm with tabu search and look ahead property and it is specialized for solving subset selection problems.
Today's trends in computation are towards computer architectures that integrate many, less complex processors, exploit thread-level and data-level parallelism. This makes these computers perfect ground for implementation of evolutionary algorithms for solving complex optimization problems since these algorithms can be easily to be run at parallel. To make my point more clear, lets remember Amdahl's law (Amdahl 1967) which puts a limit to the speed that can be gained by parallelizing a process: speedup = serial processing time parallel processing time where f par is the fraction of code which could be parallelized,1 − f par is the serial fraction and N p is the number of processors. As it can be observed from the figures in Figure 1, obtained by applying this formula for varying values of f par between 0 and 1 and values of N p in {1, 2, 4, 8, 16, 32, 64, 128, 256} , under the assumption that the number of processors doubles each year and the processing capabilities for each parallel node and everything else identical throughout, taking full advantage of parallelization requires methods that have paralelization frequency close to one. From the same figure we can read that 1% change in parallelization frequency might cause up to 3.5 times speedup if there are 256 processors and a ideal parallelizable procedure might have speed more than to 50 times relative to a 80% parallelizable procedure. Evolutionary algorithms, like GA, fall at the very right end of these figures.
At the Michigan State University, at the beginning of year 2017, there were hundreds of nodes available for the researches through their high performance computer cluster (HPCC) system. It is not science fiction to claim that clusters with millions of nodes will be available to researchers with the technologies such as cloud / grid computing. We are faced with the challenge of matching these these parallel resources with methods that can use them efficiently. In my view, easy adaptation of GAs to parallel computation is a major advantage of GAs to other subset selection algorithms. GAs scale well with large and/or computationally expensive problems and can achieve reasonable running times by using parallel computing architectures with either shared or distributed memory systems, these systems are becoming increasingly available to the researchers. Scientific community prefer algorithms that run faster in serial.
However, the direction the improvements in the computer technology seem be more in the parallelization rather than faster processors.
Some advantages of the GAs to other subset selection algorithms include the following: • GA can be applied to many different problems and does not need to be reinvented for each new problem.
• GA is a very flexible optimization algorithm, evolutionary mechanisms that are involved in GA can be modified at different stages of the algorithm; various selection strategies, various penalization of the objective function, can be explored simultaneously or in a serial fashion.
• Adopting GAs in a parallel computing environment is easy. This might involve, for example, evaluation of the fitness function for the current GA population, or running many GAs at parallel to provide initial solutions generation of GAs.
In addition the LA-GA-T algorithm in STPGA adds two more properties: • Inferior solutions that were visited recently will not be visited. This property is akin to a memory in an intelligent system.
• The state of current solutions are used to predict the best ideal solution. This gives the algorithm the look ahead property. This property is akin to inference in an intelligent system.
Nature solves problems through evolutionary processes, it works with communities of solutions that exploits the their communal information content to create new solutions, it is this information that persists, not the individual solutions. In addition, we have long standing theories explaining how and why such evolutionary processes work. There is also a vast amount of principled study of evolutionary mechanisms, both that are natural and artificial; the whole subject of evolutionary genetics; the methodology, theories and practices related to breeding; the theoretical and practical approaches of evolutionary algorithms and computation allows us humans to understand and manage these systems.
In the next section, I briefly review the basic ideas behind simple GAs and the LA-GA-T that is used in STPGA. Then, I present the details of the interface to the STPGA package in Section 3, followed by several examples section and the conclusions. The examples section has been divided into two main parts: STPGA for selection of training populations, and STPGA in other subset selection problems. Both of these sections are rather long and detailed, especially the part that relates to optimal design that introduces some concepts and ideas of optimal design of experiments with a focus on predictive learning using regression models. Some of the design criteria discussed in this section are implemented in STPGA, a table listing of these criteria is provided. I also demonstrate how to write user defined criteria.

Optimizer in STPGA
The optimization algorithm that is used in STPGA is a modified genetic algorithm with tabu search and look ahead property. Genetic algorithms are stochastic search algorithms which are able to solve optimization problems using evolutionary strategies inspired by the basic principles of biological evolution. They use a population of candidate solutions that are represented as binary strings of 0's and 1's, this population evolving toward better solutions.
At each iteration of the algorithm, a fitness function is used to evaluate and select the elite individuals and subsequently the next population is formed from the elites by genetically motivated operations such as crossover and mutation. The properties and prospects of genetic algorithms were first laid out in the cornerstone book of Holland (Holland 1992a).
GAs have an implicit parallelism property (Holland 1992a). In addition, since GA uses a set of solutions at each iteration it couples well with the advanced computers (workstations with many processors and large memory) and computer systems (high performance computing clusters, cloud computing technologies) of today allowing it to be applied to very large scale optimization problems. In my opinion, with the advent of new technologies like DNA computing (Liu et al. 2000;Paun et al. 2005) that uses programmable molecular computing machines or quantum computers (Gruska 1999;Leuenberger and Loss 2001) that operate on "qubits", parallelizable algorithms such as GA will have more and more important role in big scale optimization problems. The GA algorithm in STPGA is supplemented with two additional principles, tabu (memory) and inference through prediction based on a current population of solutions. I refer to it as the LA-GA-T (look ahead genetic algorithm with tabu) algorithm.
Tabu search is a search where most recently visited solutions are avoided by keeping a track of the previously tried solutions. This avoids many function evaluations and decreases the number of iterations till convergence, it is especially useful for generating new solutions around local optima.
The LA-GA-T algorithm in STPGA also uses the binary coding of the current population of solutions and their fitness to fit a linear ridge regression model from which the effects of individual digits in this binary code are estimated assuming that the contribution of an individual to the criterion value does not change much in relation to different subsets. The predicted ideal solution based on this model is constructed and included in the elite population of solutions. This gives the algorithm a look ahead property and improves the speed of convergence especially in the initial steps of the optimization. I should note here that the idea of regressing the fitnesses of solutions on their designs was inspired by the genomic selection methodology recently put into use in plant and animal breeding with the promise of increasing genetics gains from selection per unit of time.
As can be seen from the Figures 2 and 3, LA-GA-T converges in much fewer iterations compared to a simple GA. However, I have to note that the per iteration computation time for LA-GA-T algorithm is slightly higher compared to a simple GA.
Procedure 1 Genetic Algorithm Evaluation -For each solution in S t−1 calculate the criterion value,

7:
Look ahead -Use the binary coding of S t−1 and their fitness to fit a linear ridge regression model from which the effects of individual digits in this binary code are estimated. Put this solution in S t ,

8:
Selection -Identify the best solutions by the ordering of criterion values, these are denoted by E t ,

9:
Elitism -Let the best solution in E t be s t . Put s t in S t ,

10:
Tabu -Update memory for tabu by letting M emT abu t = S t−1 .

11:
repeat 12: Crossover-Randomly pick two solutions in E t . Recombination of these two solutions are obtained by summing the frequency distributions of these solutions and sampling with new solutions using probabilities corresponding to this combined frequency distribution.

13:
Mutation -With a given probability decrease the frequency of a mate that has positive frequency by some integer value less than the current frequency of that mate and increase the frequency of some other mate pair is by the same amount. Insert solution into S t .

18:
until S t has N pop solutions. 19: until Convergence is reached 20: Evaluation -For each solution in S t calculate the criterion value, 21: Selection -Identify the best solutions by the ordering of criterion values, these are denoted by E t , 22: Elitism -Let the best solution in E t be s t . Return s t .
The solutions obtained by any run of GA may be sub-optimal and different solutions can be obtained given different starting populations. Another layer of safety is obtained if the algorithm is started from multiple initial populations and an island model of evolution is used where separate populations are evolved independently for several steps and then the best solutions from these algorithms becomes the initial solutions to evolutionary algorithm. Since the functions in STPGA can start from user provided initial values, island models and other strategies can be combined when using the algorithm. I give an example code for doing this in the Appendix section.

Software interface, computational considerations
There are two main functions in STPGA, these are GenAlgForSubsetSelection and GenAl-gForSubsetSelectionNoTest. The function GenAlgForSubsetSelection uses a simple genetic algorithm to identify a training set of a specified size from a larger set of candidates which minimizes an optimization criterion (for a known test set). The function GenAlgForSubsetSe-lectionNoTest is for identifying a training set of a specified size from a larger set of candidates which minimizes an optimization criterion, no test set is specified. These functions share a lot of common parameters, except GenAlgForSubsetSelection requires an additional input that specifies the target set of individuals and the data matrix should be supplemented to include the observed value of the variables for these target individuals. The inputs for these functions are described below: • P depending on the criterion this is either a numeric data matrix or a symmetric similarity matrix. When it is a data matrix, the union of the identifiers of the candidate (and test) individuals should be put as row names (and column names in case of a similarity matrix). For methods using the relationships, this is the inverse of the relationship matrix with row and column names as the the identifiers of the candidate (and test) individuals.
• Candidates vector of identifiers for the individuals in the candidate set.
• Test vector of identifiers for the individuals in the test set.
• ntoselect n T rain : number of individuals to select in the training set.
• npop genetic algorithm parameter, number of solutions at each iteration • nelite genetic algorithm parameter, number of solutions selected as elite parents which will generate the next set of solutions.
• keepbest genetic algorithm parameter, TRUE or FALSE. If TRUE then the best solution is always kept in the next generation of solutions (elitism).
• tabu genetic algorithm parameter, TRUE or FALSE. If TRUE then the solutions that are saved in tabu memory will not be retried.
• tabumemsize genetic algorithm parameter, integer>0. Number of generations to hold in tabu memory.
• mutprob genetic algorithm parameter, probability of mutation for each generated solution.
• mutintensity mean of the Poisson variable that is used to decide the number of mutations for each cross.
• niterations genetic algorithm parameter, number of iterations.
• minitbefstop genetic algorithm parameter, number of iterations before stopping if no change is observed in criterion value.
• plotiters plot the convergence: TRUE or FALSE. Default is FALSE.
It is possible to use user defined functions as shown in the examples.
• mc.cores number of cores to use.
• InitPop a list of initial solutions • tolconv if the algorithm cannot improve the errorstat more than tolconv for the last minitbefstop iterations it will stop.
• C contrast matrix.
• Vg covariance matrix between traits generated by the relationship K (only for multitrait version of PEVMEANMM).
• Ve residual covariance matrix for the traits (only for multi-trait version of PEVMEANMM).
All these inputs except P, ntoselect (also Candidates and Test for GenAlgForSubsetSelection) have default values of NULL meaning that they are internally assigned to the default suggested settings. These settings are as follows: npop = 100, nelite = 5, keepbest = TRUE, tabu = FALSE, tabumemsize = 1, mutprob = .8, mutintensity = 1, niterations = 500, minit-befstop=100, niterreg = 5, lambda = 1e-6, plotiters = FALSE, plottype = 1, errorstat = "PEVMEAN", C = NULL, mc.cores = 1, InitPop = NULL, tolconv = 1e-7, Vg = NULL, Ve = NULL. In a specific application of STPGA, we recommend the users to change these options until they are satisfied with the final results. Especially, when used with large data sets (many columns or rows), the parameters npop, niterations, minitbefstop should be increased. The function GenAlgForSubsetSelection in the package uses this algorithm to identify a training set of a specified size from a larger set of candidates which minimizes an optimization criterion (for a known test set). The function "GenAlgForSubsetSelectionNoTest" tries to identify a training set of a specified size from a larger set of candidates which minimizes an optimization criterion, no test set is specified.
The subset selection algorithms in "GenAlgForSubsetSelectionNoTest" and "GenAlgForSub-setSelection" have somewhat different inner workings. "GenAlgForSubsetSelectionNoTest" splits the individuals given in row names of the input matrix P into two parts: a set called Train of size "ntoselect" and its complement. The "GenAlgForSubsetSelection" starts with an input defining of split of the individuals given in row names of the input matrix P into three parts: a set called "Candidates", a set called "Test" and their complement, after this the algorithm splits the set "Candidates" into a set called "Train" and its complement.
These two functions can be used with any user defined fitness functions and in the examples section, I will illustrate how these mechanisms can be used for general subset selection problems. When using a design criterion that uses the design matrix of the target individuals along with the candidates, the "GenAlgForSubsetSelection" function uses the individuals listed in "Test" extract the design matrix of these individuals from the design matrix of all individuals P. If you use a similar criterion with the "GenAlgForSubsetSelection" function 'the design of the target set is implicitly assigned as the rows in P not in "Train".
Many modern statistical learning problems involve the analysis of high dimensional data.
For example, in genomic prediction problems phenotypes are regressed on large numbers of genome-wide markers. STPGA was initially prepared for working with high dimensional data related to such whole-genome-regression (Meuwissen et al. 2001;Daetwyler et al. 2013) and association (Risch et al. 1996;Burton et al. 2007;Rietveld et al. 2013;Tian et al. 2011)  STPGA software is written purely in R, however the computationally demanding criteria can be programmed by the user in C, C++ or Fortran. STPGA also benefits from multi-thread computing. The computational performance of the algorithm can be greatly improved if R is linked against a tuned BLAS implementation with multi-thread support, for example OpenBLAS, ATLAS, Intel mkl, etc.

Illustrations
In this section, I am going to illustrate use of STPGA.

STPGA for selection of training populations
Experiments provide useful information to scientists as long as they are properly designed and analysed. The history of the theoretical work on the design problem goes way back.
Fisher (Fisher 1992(Fisher , 1960 gives a mathematical treatment of the determination of designs for some models. The first extended presentation of the ideas of optimum experimental designs appear in (Kiefer 1959) and (Yates 1935). A brief history of statistical work on optimum experimental design is given by Wynn (wynn1984jack) and the subject continues to develop, recently at an increasing rate. Box et al. (Box et al. 1978), Box and Draper (Draper and Pukelsheim 1996;Atkinson and Donev 1992;Pukelsheim 2006) are a few of the authoritative texts on the subject. The alphabetical naming of designs is due to (Kiefer et al. 1985). For a detailed discussion of standard criteria reference is made to these references.
In this paper we focus only on a the narrow design problem of selecting an optimal set of n design points, X T rain = {x 1 , . . . , x n }, from a set of candidate design points X C = {x 1 , . . . , x N }. The design defined by these n points, can be viewed as a measure on the When dealing with problems of supervised learning where the resulting model of the experiment will be used to make inferences about a known set of individuals, we can distinguish between the candidate set and a target set X T arget = {x 1 , . . . , x t } : X T arget describes the focused design region for which predictions about the dependent variables based on the models trained on X T rain are required. Let's assume that 1 ≤ n ≤ N and 1 ≤ t are fixed integers, x i are p-vectors and we denote the matrix form of X T rain as X, this is called the design matrix; and X T arget as X * , this is called the design matrix for the target space.
The first component of a design optimization problem is the objective function. For example the objective function might be chosen as theoretically or numerically obtained sampling variance of a prespecified estimator of a population quantity of interest. The second component of an optimization problem is the set of decision variables and the constraints on the values of these variables. Once the objective function and the set of constraints are known the next step is to use a method to look for solutions that optimize the objective function and also adhering to the constraints.
Parametric design criteria usually depend on a function of the information matrix for the model parameters that gives some indication about the sampling variance and covariance of the estimated parameters. Let I θ (ζ) denote the information matrix of the parameters θ for a given design ζ. In order to be able to achieve a criteria that orders designs with respect to their information matrices, usually, a scalar function of the information matrix is used. These designs criteria have alphabetical names, the designs obtained by optimizing these criteria are referred to as A-, D, E-, G-, etc,... optimal designs. The list of design criteria that are implemented in STPGA are described by Table 1 with references to the equations in this manuscript from which these were inspired.
Many practical and theoretical problems in science treat relationships of the type y = g(x, θ), where the response, y, is thought of as a particular value of a real-valued model function or response function, g, evaluated at the pair of arguments (x, θ). The parameter value θ, unknown to the experimenter, is assumed to lie in a parameter domain Θ. This is called the regression of y on x. The STPGA is not confined to regression, but we use regression analysis to do most of the explaining and demonstrations.

Linear models
The choice of the function g is central to the regression model building process. One of the simplest regression models is the linear model. Let X n×p be the design matrix, β p×1 the vector of regression parameters, y n×1 the vector of observations, and n×1 = ( 1 , 2 , . . . , n ) our error vector giving With I n as the n × n identity matrix, the model is represented by the expectation vector and covariance matrix of y, and is termed the classical linear model with moment assumptions. We assume i , i = 1, 2, . . . , n will be iid with mean zero and cov( ) = σ 2 I n . Under the additional normality assumption we write y ∼ N (0, σ 2 I n ).
We now concentrate on determining the optimal estimator for c β in the linear regression model. If X is not of full rank, it is not possible to estimate β uniquely. However, Xβ is uniquely estimable, and so is c Xβ for any conformable vector c that is in the row space of X. If estimability holds then the Gauss-Markov Theorem determines the optimal estimator The variance of this estimator depends only on the matrix X X, Up to the common factor σ 2 /n, the optimal estimator has variance c (X X) − X X(X X) − c.
Assuming estimability, the optimal estimator for the linear function of the coefficients γ = Cβ is also given by the Gauss-Markov Theorem:γ = C(X X) − X Y. The covariance matrix of the estimatorγ is C(X X) − X X(X X) − C . The covariance matrix becomes invertible provided the coefficient matrix C has full row rank.
A closely related task is that of prediction. Suppose we wish to predict additional responses y * = X * β + * . If we take the random vectorγ from above as a predictor for the random vector y * , to obtain precise estimators for X * β, we would like to choose the design so as to maximize the relevant information matrix (minimize the covariance matrix). For example, G-optimal designs are obtained by minimizing the maximum variance of the predicted values, i.e., the maximum entry in the diagonal of the matrix X(X X) − X .

Ridge Regression
Note the following: If A is a symmetric matrix, then the limit lim λ→0 (A + λ 2 I) −1 is a generalized inverse of A, and also lim λ→0 This estimator is called the ridge estimator, the coefficients have covariance matrix approximately proportional to Furthermore, prediction error variance for estimating the CX * β with ridge regression is approximately proportional to Ridge estimators have smaller variance than BLUE's but they are biased since the estimators are "shrunk" towards zero.
Ridge estimators are especially useful when X X is singular. In some cases, the ridge estimation is only applied to a subset of the explanatory variables in X, for example it is customary to not shrink the mean term.
Splitting the design matrix X as X = (X F , X R ), where X F contains the effects modeled without ridge penalty and X R contains the terms modeled with ridge penalty, a design criterion concerning the estimation of shrunk coefficients can be written as

RKHS
Using the matrix identity ( The ridge regression solution for γ can then be written The important message here is that we only need access partitions of the matrix The variance covariance matrixγ for proportional to Reproducing Kernel Hilbert Spaces (RKHS) regression methods replace the inner products by kernels, it is as if we are performing ridge regression on a transformed data φ(x), where φ is a feature map associated to the chosen kernel function and the associated kernel matrix. The resulting predictor is now nonlinear in x and agrees with the predictor derived from the RKHS perspective (Schölkopf and Smola 2002). RKHS regression extends ridge regression allowing a wide variety of kernel matrices, not necessarily additive in the input variables, calculated using a variety of kernel functions. A kernel function, k(., .) maps a pair of input points x and x into real numbers. It is by definition symmetric (k(x, x ) = k(x , x)) and non-negative.
Given the inputs for the n individuals we can compute a kernel matrix K whose entries are

Gaussian Linear Mixed Models
The linear mixed model methodology was first developed within the context of animal genetics and breeding research by Henderson (1975); Kempthorne et al. (1957); Henderson et al. (1959), many important statistical models can be expressed as mixed effects models and it is the most widely used model in prediction of quantitative traits, and genome-wide association studies.
In studies on linear mixed models it is usual to consider the estimation of the fixed effects β and the variance components, and also the prediction of the random effects u. For a given data vector y, the vector of random effects u is a realization of random variables which are observed and these effects must therefore necessarily be predicted from the data (Henderson 1953).
In the linear mixed-effects model, the observations are assumed to result from a hierarchical linear model: The similarity of the mixed models and RKHS regression models has been stressed many times. However, mixed modeling approach provides a formalized approach since he inferences are based on a probabilistic model, and therefore, allows legitimate inferences about the parameters and predictions. Henderson et al. show that maximizing the joint density of y and u yields the MLEs of the parameters β and EBLUPs (estimated BLUPs)û that solve: Henderson's mixed-model equations can be used to estimate the standard errors of the fixed and random effects. For a given design, the inverse of the coefficient matrix is written as where H 11 , H 12 , and H 22 are, respectively, p×p, p×q, and q ×q sub-matrices. Note that referring to the coefficient matrix is an abuse of notation since the parameters of the mixed effects model does not include the vector u. Using this notation, the sampling covariance matrix for the BLUE (best linear unbiased estimator) of β is given by σ(β) = H 11 = (W V −1 W ) − that the sampling covariance matrix of the prediction errors (û − u) is given by and that the sampling covariance of estimated effects and prediction errors is given by σ considerû − u rather thanû as the latter includes variance from both the prediction error and the random effects u themselves.). The standard errors of the fixed and random effects are obtained, respectively, as the square roots of the diagonal elements of H 11 and H 22. In addition, using the above definitions, cov(u|y Optimal design of experiments with mixed models involve determination of the design matrices W and Z; however, in many applications, estimates of only one of β or u is needed. For example, design criterion is obtained by considering the variance-covariance matrix of C (û − u) given by C H 22 C is named the prediction error variance. A more recent design criterion is the generalized coefficient of determination (Laloë 1993;Laloë and Phocas 2003;Rincent et al. 2012) for the random terms c i (û − u), i = 1, . . . , l : If the mixed model is simplified such that ∼ N (0, R = σ 2 I) and u ∼ N (0; G = σ 2 u A), and the rows of C have zeros corresponding to fixed effects, the formula for prediction error variance becomes: and the corresponding formula for coefficient of determination becomes: where λ = σ 2 /σ 2 u . Furthermore, when we assume u ∼ N (0, G = σ 2 u I), then the above formulas simplify further to Here, M = I − W (W W ) − W is is a projection matrix orthogonal to the vector subspace spanned by the columns of W, so that M W = 0.

Some generalizations and extensions of parametric design criteria
A generalization of the D-, A-, G-optimal criteria is provided by the Keifer's φ p criteria: given by Another extension of D-optimality deals with minimizing The criterion permits designs for h different models which may be fitted to the data, for the jth of which the information matrix is I j (ζ). The matrix A j defines S j linear combinations of the p j parameters in model j which are of experimental importance and the non-negative weights α j express the relative importance of the different aspects of the design. Examples of compound D-optimum designs for linear models are given by Atkinson and Donev (1992). Pritchard and Bacon (1978) proposed a new criterion alternative to the traditional D-optimal design, which has a measure of the overall correlation among the parameters directly as objective function to be minimized i.e. the root square of the individual correlations between pair of parameters:

Non-parametric design criteria
The design criteria of the previous sections started from a parametric model. There are some optimal design approaches that does not make any parametric assumptions, leading to non-parametric design criteria. Most of these methods are based on a distance matrix.
A design criteria that aims to achieve a high spread among its support points within the design region, i.e., make the smallest distance between neighboring points in the design as large as possible is called the maximin-distance criterion. Let D = d ij i=1,...,N denote the distance matrix among the possible design points. Maximin distance criteria is finding the subset of n points such that to be maximized among these n points. Another possibility is to pick the n design points so that the the maximum distance from all the points in a target set of points X * is as small as possible. These designs are called space filling designs, some performance bench-marking for various space-filling designs can be found in (Pronzato 2008) and (Pronzato and Müller 2012).
Example 1: In this example, we want to find the best D-optimal 13 point design for a second order regression model over a grid design region defined by two variables both with possible values in the set −2, −1, 0, 1, 2. Naming these variables as x 1 and x 2 and the generic response as y, we can write this model as First, we crate the design matrix for the design space.
Optimal design of phenotypic experiments based on prior genotypic information can lead to high information value at low costs. To illustrate these points, lets use the wheat data set first in prediction and association settings. We begin by loading the data and doing some preprocessing necessary to run the experiment: Box 3: Loading and preprocessing the wheat data set included in STPGA > data(WheatData) > svdWheat<-svd(Wheat.K, nu=50, nv=50) > PC50WHeat<-Wheat.K%*%svdWheat$v > rownames(PC50WHeat)<-rownames(Wheat.K) > DistWheat<-dist(PC50WHeat) > TreeWheat<-cutree(hclust(DistWheat), k=4) "TreeWheat" partitions the data into four sets, lets observe this grouping using a plot of first two principal components. We will consider a scenario where the final aim is to accurately predict the genotypic values in group 2 and we want to establish this by conducting a phenotypic experiment on 50 genotypes selected from the remaining groups. However, it is important to be able to specify GA parameters: Box 7: Optimization of training populations with STPGA, specifying algorithm parameters > Train3<-GenAlgForSubsetSelection(P=PC50WHeat,Candidates=Candidates, + Test=Test,ntoselect=50, + InitPop=NULL,npop=200, + nelite=10, mutprob=.5, mutintensity = 1,niterations=200, + minitbefstop=50, tabumemsize = 1,plotiters=FALSE, + lambda=1e-9,errorstat="PEVMEAN", mc.cores=4) > Train4<-GenAlgForSubsetSelectionNoTest( + P=PC50WHeat[rownames(PC50WHeat)%in%Candidates,],ntoselect=50, + InitPop=NULL,npop=200, + nelite=10, mutprob=.5, mutintensity = 1,niterations=200, + minitbefstop=50, tabumemsize = 1,plotiters=FALSE, + lambda=1e-9,errorstat="PEVMEAN", mc.cores=4) We finally want to compare the prediction accuracy for predicting the target data compared to the average accuracy that would be obtained using a sample size of same size. I will only use "Train3" and "Train4" below, we will also need the the package EMMREML (Akdemir and Godfrey 2015) for fitting the mixed effects model: We will repeat estimation with random sample 300 times to obtain mean performance: These results are as expected: Optimally designed phenotypic experiments are more informative, they result in higher accuracies compared to a random sample of the same size. If the researcher also has access to the genotypic information for the individuals in the target set, then this information when properly used might lead to gains in per unit information that will come from a phenotypic experiment.
I also expect that the association (GWAS) results from an optimized sample to be better than a random sample. I can not verify this with a simple example. However, here is a comparison of the marker effects estimated from a full set, compared to a random sample and an optimized sample of the same size.  (deptrainopt4), ncol=1), + Z=Ztrainrs%*%Wheat.M, K=diag(ncol(Wheat.M))) > orderfull<-order(abs(modelrrblupfull$uhat), decreasing=T) > orderopt<-order(abs(modelrrblupopt$uhat), decreasing=T) > orderrs<-order(abs(modelrrbluprs$uhat), decreasing=T) > mean(abs(orderrs-orderfull)) [1] 1580.187 > mean(abs(orderopt-orderfull)) [1] 1567.686 As noted before, the subset selection optimization problem is a combinatorial one. We need to see if the algorithm got close to convergence, we can do this by plotting the criterion values over the iterations, these values are stored in the output of STPGA with the name 'Best criterion values over iterations'.

STPGA in other subset selection problems
Optimal subset selection algorithm in STPGA can be used with user supplied optimization criterion, and therefore, it has a wide area of application. I am going to try to give a few examples: supervised unsupervised variable selection, Minimize inbreeding while maximizing gain, mixed integer programming, influential observation selection. These examples can easily be extended.
The following example illustrates how the users can define their own criteria and use it with STPGA for a variable selection problem. It involves aligning kernels to select variables and as far as I know this wasn't done before.
Selecting most representative marker set (markers that represent most of the variability in a given marker data) A genetic relationship matrix measures the amount individuals in a certain group are genetically similar. Genetic relationship matrices can be constructed using pedigrees, or using genome-wide markers. In this example, we will try to find a fixed size subset of available genome-wide markers that results in a genetic relationship matrix that is as close to the genetic relationship matrix as possible. These selected markers can be called the genetic anchor markers, since they explain most of the properties of the genome-wide relationship matrix.
The main question is if there is a subset of markers that can explain a big part of all of the variation captured by all the markers (or even the genome sequence), since this relates to many important genetic concepts like effective population size, effective number of independent chromosome segments, population structure and its effects on predictability within and between sub-populations. Note that the selection of the anchor markers does not involve any phenotypic observations, therefore this is an unsupervised marker selection approach, similar to some recent approaches expressed in sparse principal components analysis (Witten et al.  We can see how this optimally selected markers compare with the randomly selected marker sets of the same size.  (1,2,3,4,5,6),2,3, byrow=TRUE), widths=c(2,2,2), + heights=c(2,2), respect=TRUE) > par(mar=c(3,2,2,1)) > # turn off the axes > image(Ars, axes=FALSE, main="Random Markers") > image(optA, axes=FALSE, main="Optimal Markers") > image(Afull, axes=FALSE, main="All Markers") > image((Ars-Afull)^2, axes=FALSE, + main="Squared errors for random") > image((optA-Afull)^2, axes=FALSE, + main="Squared errors for optimal") > par(mfrow=c(1,1))

Minimize inbreeding while maximizing gain
Many authors (Goddard 2009;Jannink 2010;Sun et al. 2013;Akdemir and Sánchez 2016) have expressed the importance of reducing inbreeding in PS and GS for long-term success of breeding programs. They argued that GS is likely to lead to a more rapid decline in the selection response unless new alleles are continuously added to the calculation of GEBVs, stressing the importance of balancing short and long term gains by controlling inbreeding in selection.
Here, x is a vector in R n , D is an n × n symmetric positive definite matrix, d is a constant vector in R n and c is a scalar constant. QPs usually come with a system of linear constraints on the vector x ∈ R n which can be written as Here A is an m 1 × n matrix with m 1 ≤ n and B is a m 2 × n matrix. The vectors f and g have lengths m 1 and m 2 respectively. QP can be more compactly stated as compactly as: There are many efficient algorithms that solves QP's so there is in practice little difficulty in calculating the optimal solution for any particular data set. In this example, I will be using the package quadprog (Turlach and Weingessel 2007). Now, let A be a matrix of pedigree based numerator relationships or the additive genetic relationships between the individuals in a breeding population (this matrix can be obtained from a pedigree of genome-wide markers for the individuals) and let c be the vector of proportional contributions of individuals to the next generation under a random mating scheme.
The average inbreeding and co-ancestry for a given choice of c can be defined as r = 1 2 c Ac. If b is the vector of GEBV's, i.e., the vector of BLUP (best linear unbiased predictor) estimated BV's of the candidates for selection. The expected gain is defined as g = c b. Without loss of generality, we will assume that the breeder's long term goal is to increase the value of g.
In Wray and Goddard (1994); Brisbane and Gibson (1995); Meuwissen (1997), an approach that seeks minimizing the average inbreeding and co-ancestry while restricting the genetic gain is proposed. The optimization problem can be stated as where ρ is the desired level of gain.
This problem is easily recognized as a QP. Recently, several parental percentage allocation strategies were tested using QP's in (Goddard 2009;Pryce et al. 2012;Schierenbeck et al. 2011).  (6) for varying values of ρ, or using the similar criteria in the mate selection approaches, we can trace out an efficient frontier curve, a smooth nondecreasing curve that gives the best possible trade-off of genetic variance against gain. This curve represents the set of optimal allocations and it is called the efficiency frontier (EF) curve in finance (Markowitz 1952) and breeding literature.
Many practical applications require additional constraints, ans it is possible to extend these formulations to introduce additional constraints as positiveness, minimum-maximum for proportions, minimum-maximum for number of lines (cardinality constraints). It is not too difficult to use STPGA to solve a version of this problem.
Suppose the breeder wants to select a subset of the individuals to become parents of the next generation increasing gain while controlling for coancestory and inbreeding. Note that the breeder only wants to select a subset and therefore we can assume that the parental contributions will be the same for each of the selected individuals. The following code illustrates how to select 10 individuals from the wheat data set for changing values of λ.  Suppose now the breeder wants us to pick 10 individuals, but also asks for the best parental contribution proportions. This is a mixed integer quadratic programming problem. Let i be the minimum proportion that must be allocated to line i, δ i be the maximum proportion that must be allocated to line i if any of line i will be conserved, where we must have 0 ≤ i ≤ δ i ≤ 1.

Variable selection in regression using model selection criteria
One of the standard use of GA is in variable selection and STPGA can be used in variable selection.
Another very popular method for variable selection and penalized (shrunk) parameter estimation in high dimensional regression is the lasso approach which can be summarized as finding the best regression coefficients that minimize the regression loss function while minimizing a multiple of the 1 norm of the coefficients. The relative stress on the importance of either of these complementary functions is expressed as a linear combination of these two. For example with the squared error loss the lasso optimization criterion is The idea, its implications for many different fields of science can not be overlooked, the same goes for the enormous number of methods developed to solve this problem and its variations.
However, there is a major statistical problem: there isn't enough principle behind its objective criterion so it is no big magic. It does not have certain properties we would like from a good regression. For example, the residuals of a model using the coefficients obtained by lasso are not orthogonal to the design of the variables that are selected by lasso. This last issue can be circumvented by using lasso only as a selection operator and then fitting the coefficients with least squares without answering the main question: Why this criterion, not another one? Parsimony, yes, why this one? Is this the one that gives the best generalization error (minimize the prediction error)? The only good argument for the defense of this criterion I can think of is a Bayesian one and the so called "oracle" property which is only true under restrictive assumptions such as no dependency among the explanatory variables. we should be careful not to confuse the actual problem with the method, such as the carpenter who sees everything as nails since he/she is good at using hammers.
In my opinion, the whole area of norm penalized estimation neglects the many model selection criteria (AIC (Akaike 1974), BIC (Schwarz et al. 1978), ICOMP(IFIM) (Bozdogan 1987), etc,...) introduced during the 1970's and onward by many prominent statisticians whose derivations depends on the solid theory of likelihoods, divergence measures, etc.,... I am not sure why we should develop elegant theories and then ignore them for not so elegant ones.
So while lasso Tibshirani (1996) is great and can be solved at great speed in serial (after investment a huge amounts of energy and resources in the last 10 or more years) and the methods to find its solutions can only be partially parallelized. I am not sure the shrinkage approach can take on the next challenge of solving extremely large and complex problems. In addition to not being able to adopt well to parallel computer architectures, their application areas are limited by the problems these methods can address and addressing new problems with the same techniques involves inventing newly crafted methodologies which consumes the most amount of time and resources.
It is easy to make criteria like the one I have made up. Here is another one and β (k) = ( X (k) X (k) ) −1 X (k) y; and another one: and β (k) = ( X (k) X (k) ) −1 X (k) y. Both of these have nice interpretation as well. In fact, any optimal design criteria of Section 4.1 could be relevant in this context, for example, try and β (k) = ( X (k) X (k) ) −1 X (k) y. I encourage the user to program these criterion or their own, perhaps the one you invent will make more sense for the problem that you are trying to solve than this or that other criterion. Identifying influential observations in regression (and keeping the most consistent regression data) When a regression model is fitted to data, if a few of the observations are different in some way from the bulk of the data then using all observations into the model might not be appropriate.
The inclusion or exclusion of a few of these observations make a significant change in the parameter estimates or predictions. These are referred to as influential observations. One statistic that is used to identify the influence of an observation is the so called DFBETAS which measures the effect of deletion of single observations on the estimated model coefficients.
This measure can be generalized to deletion of a group of individuals as where β is the estimated regression coefficients from full data, β (−) is the same estimated using a part of the data (leaving out its complement from the analysis), σ (−) 2 is the residual variance estimate obtained using only partial data and X (−) is the design matrix for partial data. Here is the application of this to the "iris" data set (available with base R) to locate the influential observations if there are any: (PTtPT+lambda*diag(nrow(PtP)))%*%(B-BT)))) + return(-meanD) + } > data(iris) > P<-as.matrix(scale(iris[,1:4], center=TRUE, scale=TRUE)) > rownames(P)<-rownames(iris) > STPGAoutlist<-vector(mode="list", length=20) > ii=1 > for (i in 135 :149) The following plot shows that the criterion value shows a sharp increase when we go from 144 to 145 observations and there are total of 150 observations in the data set, six observations that aren't included in the set of 144 observations cause large change in the value of the estimated coefficients. We observe that these are the observations we have manipulated.
In STPGA package I have implemented a GA algorithm for subset selection which was inspired by the recent developments in the genomic breeding of plants, animals and other organisms.
The main advantage of this algorithm is that it can be run in parallel to solve the subset selection problem for any given objective function.
I also did not compare the relative speed of this particular implementation of LA-GA-T algorithm to any other software. I admit that this software could be written so that it functioned faster using one processor and, no matter what I do, it will never work as fast as some other algorithms that are specialized in their tasks. However, I note that LA-GA-T algorithm can be run in parallel on many computers to solve problems as fast as some problem specific algorithms that have to be performed in serial.
Prediction accuracy from genomic models can be improved by targeting more informative individuals in the data set used to generate the predictors, this result has been exemplified by several papers in the area ; Yu et al. (2016). Nevertheless, the subject deserves more attention.
Some of the criteria I have mentioned in the selection of training populations section of this paper are not among the default criteria listed in Table 1, however, I have included them as user defined criteria in the help files that is provided with the package.
I have plans to migrate all of this program to a more efficient programming language for performance improvements, and there is room for improvement redesigning parts of the algorithm for However, there are some advantages of using R. Some of these include like the accessibility, availability, being able to define functions on the fly. Therefore, I will still be supporting this pure R version. As a statistician, I have the feeling that writing efficient software is partially out of my expertise, immediate interests and surely can be done much easier with collaboration with people with the right skills. As of now, I can use STPGA with moderately complex problems, either by allowing the algorithm to take its time or by using a highly parallelizable machine. I am open to any suggestions, collaborations in this respect.

Author contributions
Every idea, code in this manuscript was conceived by Deniz Akdemir unless they were acknowledged by the references within the article. If a person or a group makes any claims that they have contributed to this work in any way it should not be taken seriously.