## 1 Abstract

Whether or not genetic divergence on the short-term of tens to hundreds of generations is compatible with phenotypic stasis remains a relatively unexplored problem. We evolved predominantly outcrossing, genetically diverse populations of the nematode *Caenorhabditis elegans* under a constant and homogeneous environment for 240 generations, and followed individual locomotion behavior. We find that although founders of lab populations show highly diverse locomotion behavior, during lab evolution the component traits of locomotion behavior – defined as the transition rates in activity and direction – did not show divergence from the ancestral population. In contrast, the genetic (co)variance structure of transition rates showed marked divergence from the ancestral state and differentiation among replicate populations during the final 100 generations and after most adaptation had been achieved. We observe that genetic differentiation is a transient pattern during the loss of genetic variance along phenotypic dimensions under drift during the last 100 generations of lab evolution. However, loss of genetic variances present in the founders may be due to directional selection. These results suggest that once adaptation has occurred, and on the short-term of tens of generations, stasis of locomotion behaviour is repeatable because of effective stabilizing selection at a large phenotypic scale, while the genetic structuring of component traits is contingent upon drift history at a local phenotypic scale.

## 2 Introduction

Stasis, the lack of directional change in the average values of a trait over time, is the most common phenotypic pattern observed over timespans reaching one million years (Arnold, 2014; Gingerich, 2019; Uyeda et al., 2011). Theory predicts phenotypic stasis when stabilizing selection, or when directional and other forms of selection cancel out over the period examined, acts upon standing genetic variation reflecting the phenotypic effects of mutational input (Charlesworth et al., 1982; Estes and Arnold, 2007; Hansen and Martins, 1996; Lande, 1986; Morrissey and Hadfield, 2012). When considering mutation-selection balance on the long-term (as scaled by the effective population sizes), theory has been successfully applied to explain, for example, fly wing evolution over a period of 40 million years (Houle et al., 2017), or nematode embryogenesis over 100 million years (Farhadifar et al., 2015). On the short-term of a few tens to hundreds of generations, however, many natural populations depend on standing genetic variation for adaptation or rescue from extinction, when mutation should be of little influence and founder effects, demographic stochasticity and genetic drift important (Chelo et al., 2013; Hill, 1982; Mallard et al., 2022b; Matuszewski et al., 2015).

On the short-term, before mutation-selection balance is reached, phenotypic stasis in extant natural populations is also commonly observed, often despite significant trait heritability and selection (Merilä et al., 2001; Pujol et al., 2018). Explanations for short-term phenotypic stasis have relied on showing that in many cases there were no changes in the breeding traits’ values, that is, no genetic divergence, either because of selection on unmeasured traits that are genetically correlated with observed ones or because of correlated selection due to unknown environmental covariation between observed and unobserved traits with fitness e.g., (Czorlich et al., 2022; Kruuk et al., 2002), both instances of “indirect” selection. Short-term phenotypic stasis without genetic divergence has also been explained by phenotypic plasticity allowing the tracking of environmental fluctuations e.g., (Biquet et al., 2022; de Villemereuil et al., 2020). These studies indicate that phenotypic evolution cannot be understood when considering each trait independently of others and that a multivariate description of selection and standing genetic variation is needed. Selection on multiple traits should be seen as a surface with potentially several orthogonal dimensions (Phillips and Arnold, 1989), each with particular gradients depicting selection strength and direction on each trait and between traits (Arnold et al., 2001; Lande and Arnold, 1983). Responses to selection in turn will depend on the size and shape of the **G**-matrix, the additive genetic variance-covariance matrix of multiple traits (Lande, 1979). For example, phenotypic dimensions with more genetic variation are expected facilitate adaptation, as selection will be more efficient (Lande, 1976, 1979; Schluter, 1996), even if indirect selection can confound predictions about phenotypic evolution (Mallard et al., 2022a; Morrissey and Bonnet, 2019; Stinchcombe et al., 2014).

The extent to which phenotypic stasis is compatible with the expected divergence of the **G** matrix in the short-term remains little unexplored cf. (Bohren et al., 1966; Gromko, 1995; Simões et al., 2019; Teotónio et al., 2004; Teotónio and Rose, 2000). Studies in natural populations cannot usually control environmental variation and estimates of **G** matrix dynamics are nearly impossible to obtain, while experiments employing truncation selection do not easily model the complexity of the selection surface. Under drift, and assuming an infinitesimal model of trait inheritance, the **G** matrix size (i.e., the total genetic variance) is reduced and diverges from ancestral states by a factor proportional to the effective population size (Lande, 1976; Lynch and Hill, 1986; Phillips et al., 2001). However, theory that includes the effects of finite population sizes, multivariate selection, and the pleiotropic effects of mutation remains out of reach for changes in genetic covariances between traits and thus **G** matrix shape (Barton and Turelli, 1987; Burger, 2000; Lande, 1980; Lynch and Walsh, 1998; Simons et al., 2018). We do expect, however, that once most adaptation has occurred divergence of the **G** matrix shape is caused by drift, and also know that different forms of selection might lead to further genetic divergence in the relatively local phenotypic space occupied after adaptation (Doroszuk et al., 2008; Haller and Hendry, 2014). Whether or not genetic divergence will also lead to phenotypic divergence should then depend on the distribution of pleiotropic effects of quantitative trait loci (QTL) alleles, and linkage disequilibrium between them, created by past selection and drift, and ultimately on the developmental and physiological mapping of genetic onto phenotypic variation (Chebib and Guillaume, 2017; Hansen and Wagner, 2001; Morrissey, 2015; Riska, 1989).

Here we analyse the evolution of locomotion behavior on the hermaphroditic nematode *Caenorhabditis elegans*, spanning 240 generations of lab evolution in a constant and homogeneous environment, thus maximizing the chances of imposing and detecting stabilizing selection. We could obtain an accurate characterization of the fitness effects of component trait variation of locomotion behavior (transition rates between movement states and direction), by measuring essentially all individuals at the time of reproduction. We characterized the evolution of the broad-sense **G** matrix for hermaphrodite locomotion behavior, obtained by phenotyping inbred lines derived from the domesticated ancestral population at generation 140, and from three replicate populations during further 50 and 100 generations in the same environment. After domestication, selection gradients were estimated by regressing fertility onto transition rates. We seek to find if short-term evolution of the **G** matrix follows the directions of selection, or if there is loss of genetic variance just by drift, and to determine how genetic divergence is compatible with phenotypic stasis once most adaptation has been achieved.

## 3 Methods

### 3.1 Archiving

Data, R code scripts, and modeling results (including **G** matrix estimates) can be found in our github repository, and will be be archived in *Dryad.org* upon publication.

### 3.2 Laboratory culture

We analyzed the lab evolution of locomotion behavior during 273 generations (Figure 1A), the first 223 of which have been previously detailed (Noble et al., 2017; Teotónio et al., 2012; Theologidis et al., 2014). Briefly, 16 inbred founders were intercrossed in a 33 generation funnel to obtain a single hybrid population (named A0), from which six population replicates (A[1-6]) were domesticated for 100-140 generations. Based on the evolution of several life-history traits such as hermaprodite self and outcross fertility, male mating ability or viability until reproduction we have previously shown that most adaptation to lab conditions had occurred by generation 100 (Carvalho et al., 2014a,b; Poullet et al., 2016; Teotónio et al., 2012; Theologidis et al., 2014). From population A6 at generation 140 (A6140) we derived 6 replicate populations and maintained them in the same environment for another 100 generations (CA[1-6]). Inbred lines were generated by selfing hermaphrodites from A6140 (for at least 10 generations), and from CA populations 1-3 at generation 50 and 100 (CA[1-3]50 and CA[1-3]100; Noble et al. (2019)). We refer to these last 100 generations as the focal stage. During the domestication and focal stages, populations were cultured at constant census sizes of N = 10^{4} and expected average effective population sizes of *N _{e}* = 10

^{3}(Chelo et al., 2013; Chelo and Teotónio, 2013). Non-overlapping 4-day life-cycles were defined by extraction of embryos from plates, followed by seeding of starvation-synchronised L1 larvae to fresh food (Teotónio et al., 2012). Periodic storage of samples (> 10

^{3}individuals) were done by freezing (Stiernagle, 1999). Revival of ancestral and derived population samples allows us to control for transgenerational environmental effects under “common garden” phenotypic assays (Teotónio et al., 2017).

### 3.3 Worm tracking assays

#### 3.3.1 Sampling and design

Population samples were thawed from frozen stocks on 9cm Petri dishes and grown until exhaustion of food (*Escherichia coli* HT115). This occurred 2-3 generations after thawing, after which individuals were washed from plates in M9 buffer. Adults were removed by centrifugation, and three plates per line were seeded with around 1000 larvae. Samples were maintained for one to two complete generations in the controlled environment of lab evolution. At the assay generation (generation 4-6 generations post-thaw), adults were phenotyped for locomotion behaviour at their usual time of reproduction during lab evolution (72h post L1 stage seeding) in single 9cm plates. At the beginning of each assay we measured ambient temperature and humidity in the imaging room to control for their effects on locomotion.

Inbred lines from the experimental populations were phenotyped over 7 years in two different lab locations (Lisbon and Paris) by three different experimenters. In total, there were 197 independent thaws, each defining a statistical block containing 2-22 samples. 188 inbred lines from the A6140 population were phenotyped, with 52 CA150, 52 CA250, 51 CA350, 51 CA1100, 53 CA2100 and 68 from CA3100. Each line was phenotyped in at least two blocks (technical replicates). CA[1-3]50 and CA[1-3]100 lines were phenotyped within a year. A6140 lines were phenotyped over 2 consecutive years. We further phenotyped the outbred populations and the 16 founders. For these there were 9 independent thaws, of which 5 also contained founders. All founders and populations were phenotyped twice with the exception of A6140, included in six blocks.

In order to improve estimation of the selection surface in our lab evolution environment (see below), we also assayed locomotion bias in 56 inbred lines derived from populations evolved in a high-salt environment (GA[1,2,4]50) for which fertility data was available (Noble et al., 2017). These lines were phenotyped in the same blocks as the A6140 lines included in the gamma matrix analysis (Lisbon, single experimenter). Removing these lines from the analysis did not affect the mode of the posterior distribution estimates of our coefficients, and only led to the loss of statistical power reflected by wider credible intervals (analysis not shown).

#### 3.3.2 Imaging

To measure locomotion behavior we imaged adults 72h post-L1 seeding using the Multi-Worm Tracker [MWT version 1.3.0; Swierczek et al. (2011)]. Movies were obtained with a Dalsa Falcon 4M30 CCD camera and National Instruments PCIe-1427 CameraLink card, imaging through a 0.13-0.16 mm cover glass placed in the plate lid, illuminated by a Schott A08926 backlight. Plates were imaged for approximately 20-25 minutes, with default MWT acquisition parameters. Choreography was used to filter and extract the number and persistence of tracked objects, and assign movement states across consecutive frames as forward, still or backwards, assuming that the dominant direction of movement in each track is forward (Swierczek et al., 2011).

MWT detects and loses objects over time as individual worms enter and leave the field of view or collide with each other. Each individual track is a period of continuous observation for a single object (the mapping between individual worms and tracks is not 1:1). We ignored the first 5 minutes of recording, as worms are perturbed by plate handling. Each movie contains around 1000 tracks with a mean duration of about 1 minute. The MWT directly exports measurements at a frequency that can vary over time (depending on tracked object density and computer resource availability), so data were standardized by subsampling to a common frame rate of 4 Hz. Worm density, taken as the mean number of tracks recorded at each time point averaged over the total movie duration, was used as a covariate in the estimation of genetic variance-covariances below.

#### 3.3.3 Differentiating males from hermaphrodites

A6140 and all CA populations are androdioecious, with hermaphrodites and males segregating at intermediate frequencies (Teotónio et al., 2012; Theologidis et al., 2014). We were able to reliably (97% accuracy) differentiate between the sexes based on behavioural and morphological traits extracted from MWT data.

We first evaluated a set of simple descriptions of individual size, shape and movement to find a subset of metrics that maximized the difference in preference for a two-component model between negative and positive controls: respectively, inbred founders and two monoecious (M) populations which contained no, or very few, males; and three dioecious (D) populations with approximately 50% males [M and D populations were derived from A6140, see Theologidis et al. (2014) and Guzella et al. (2018)]. Starting with worm area, length, width, curvature, velocity, acceleration, and movement run length as parent traits from Choreography output, derived descendant traits were defined by first splitting parents by individual movement state (forward, backward, still) and calculating the median and variance of the distribution for each track. Traits with more than 1% missing data were excluded, and values were log-transformed where strongly non-normal (a difference in Shapiro-Wilk –log_{10}(p) > 10). Fixed effects of block and log plate density were removed by linear regression before model fitting. Two-component Gaussian mixture models were fit to tracks for each line/population (R package *mclust* Scrucca et al. (2016), *VI I* spherical model with varying volume), orienting labels by area (assuming males are smaller than hermaphrodites). We sampled over sets of three traits (requiring three different parent trait classes, at least one related to size), and took the set maximizing the difference in median Integrated Complete-data Likelihood (ICL) between control groups (log area, log width, and velocity, all in the forward state). By this ranking, the 16 inbred founders and two monoecious populations fell within the lower 19 samples (of 77), while the three dioecious populations fell within the top 15 samples.

To build a more sensitive classifier robust to male variation beyond the range seen in control data, we then trained an extreme gradient boosting model using the full set of 30 derived traits on the top/bottom 20 samples ranked by ICL in the three-trait mixture model [R package *xgboost*, Chen and Guestrin (2016)]. Negative control samples were assumed to be 100% hermaphrodite, while tracks in positive controls were assigned based on mclust model prediction (excluding those with classification uncertainty in the top decile). Tracks were classified by logistic regression, weighting samples inversely by size, with the best cross-validated model achieving an area under the precision-recall curve of 99.75% and a test classification error of 3.1% (*max_depth* = 4, *eta* = 0.3, *subsample* = 0.8, *eval_metric* = *“error”*). Prediction probabilities were discretized at 0.5.

Males tend to move much faster than hermaphrodites (Lipton et al., 2004), and since individual collision leads to loss of tracking sex is strongly confounded with track length and number. To estimate male frequencies at the sample level, tracks were sampled at 1s slices every 30s over each movie in the interval 400-1200 seconds, and line/population estimates were obtained from a binomial generalized linear model (Venables and Ripley, 2002). Estimates appear to saturate at around 45%, due presumably to density-dependent aggregation of multiple males attempting to copulate.

### 3.4 Locomotion behavior

#### 3.4.1 Definition of transition rates

In a one-dimensional space individual locomotion behavior can be described by the transitions rates of activity and direction. We modelled the expected sex-specific transition rates between forward, still and backward movement states with a continuous time Markov process. We consider a system having *d* = 3 states with , denoting the transition probability matrix (Jackson, 2011; Kalbfleisch and Lawless, 1985):
where , with being the movement state occupied in instant *t*. We consider a time-homogeneous process described by the transition rate matrix:
where *q _{i,j}* ≥ 0 ∀

*i, j*, subject to the constraint:

Hence, six of the nine possible transitions are independent. Let *θ* denote the parameters to be estimated, containing the off-diagonal elements from equation 2:

In this model, the time an object remains in a given state is on average 1/*q _{i}*. Since the process is stationary, the probability of transition is a function of the time difference Δ

*t*=

*t*

_{2}–

*t*

_{1}, such that

*P*(

*t*

_{1},

*t*

_{2}) =

*P*(Δ

*t*), and the elements of the

*P*(Δ

*t*) matrix:

It then follows that:
where exp(·) denotes the matrix exponential. The constraint in equation 3 ensures that:
where *f _{i}* is the relative frequency of state

*i*that no longer depends on the previous state (all three rows of the

*(∞) matrix converge). We find that the state frequencies from*

**P***(*

**P****∞**) are a monotonic and mostly linear function of the observed frequencies of movement states (Figure S3), showing that violations of the Markov assumption of the model does not induce a large bias in the long-term predictions of our model.

#### 3.4.2 Estimation of transition rates

To estimate transition rates, we have *N* objects (individual tracks) from each technical replicate (Petri plate), with the data on the *k*-th object denoted as:
where *s _{k,l}* is the state of the k-th object in the

*l*-th time-point in which it was observed, and

*t*is the instant of time in which this observation was made. Then, given data , the log-likelihood for the model for analysis is (Bladt and Sorensen, 2005; Kalbfleisch and Lawless, 1985): where

_{k,l}*p*(Δ

_{i,j}*t*) was defined in equation 5, is calculated as a function of the parameters

*θ*via equation 4. Therefore, the data on the

*N*objects can be represented as the number of observations of

*x*= (

*i, j*, Δ

*t*), which we denote as : and where is the indicator function:

The input data can then be compressed by considering only the data:

The log-likelihood to estimate transition rates can be finally rewritten as:
where is a d-dimensional vector of 1s, ʘ denotes the Hadamard product, and ln *P _{k}* is the matrix obtained by taking the logarithm of each value in matrix

*P*.

_{k}These models were specified using RStan (Stan Development Team (2018), R version 3.3.2, RStan version 2.15.1), which performs Bayesian inference using a Hamiltonian Monte Carlo sampling to calculate the posterior probability of the parameters given the observed data. We used multi-log normal prior distributions with mean transition rate and a coefficient of variation: .

Throughout, we denote non-self transition rates *q _{k}* the six off-diagonal elements of the

**Q**matrix estimated by the above model.

#### 3.4.3 Male and inbreeding effects

Using the transition rates measured in populations and inbred lines, we fit a series of linear mixed-effects models to test for phenotypic evolution in the outbred populations (see equation 19), for effects of male frequency on hermaphrodite transition rates in the outbred populations (equation 20), and for inbreeding effects in the inbred lines (equation 21). Given sparse temporal sampling, we make the conservative assumption of independence of observations within domestication and focal stages. For transition rate *q _{k}*:
with

*α*the trait mean,

*β*a fixed effect of generation number

_{gen}*t, γ*and

_{anc}*δ*random effects accounting for intercept and slope differences between the domestication and focal periods of lab evolution a random effect of block

_{anc}*b*and the residual error. with

*β*a fixed effect of male frequency

*F*, a random effect accounting for differences between populations, a random effect of block

*b*, and the residual error.

As we estimate the **G**-matrix from the line differences (see next section), it is likely that it does not reflect the true additive genetic (co)variance matrix (**G**-matrix) unless the mean trait values among lines are similar to the mean trait values of the outbred population from which the lines were derived (Kearsey and Pooni, 1996; Lynch and Walsh, 1998). Because the lines and the populations were phenotyped at different times, we included environmental covariates:
where environmental covariates: temperature (**T**), relative humidity (**H**) and density (**D**) are fitted as fixed effects. *β* is a two-level categorical fixed effect (inbred lines or population). *γ* is a two-level categorical fixed effect accounting for differences between the years of phenotyping measurements of the A6140 lineages. a random effect accounting for line identity within populations and the residual error.

Both male and inbreeding models were fit using the *lmer* function in R package *lme4*, and non-zero values of fixed effects were tested against null models without fixed effects with likelihood ratio tests. Marginal *r*^{2} for the male frequencies were computed using the *r.squaredGLMM* function of the package MuMIn (Bartoń, 2020).

### 3.5 Transition rate genetics

#### 3.5.1 **G**-matrix estimation

Genetic (co)variances of transition rates per population are estimated as half the between inbred line differences for lines separately derived from the evolving outbred populations. In the absence of selection during inbreeding and cancelling of directional non-additive gene action, this broad-sense **G**-matrix obtained from inbred lines is an adequate surrogate for the additive **G**-matrix of outbreeding populations (Kearsey and Pooni, 1996; Lynch and Walsh, 1998). We test these assumptions (see below).

**G**-matrices for the six non-self transition rates *q _{k}* were estimated from trait values for the inbred lines derived from focal populations. We estimated

**G**-matrices separately for each of the seven populations (A6140, CA[1-3]50, CA[1-3]100). The 6 transition rates

*q*were fitted as a multivariate response variable

_{k}*y*in the model: where the intercept (

*) and the environmental covariates: temperature (*

**μ****T**), relative humidity (

**H**) and density (

**D**) were fitted as fixed effects. Environmental covariates were fitted individually and with all possible interactions. Each covariate was standardized to a mean of 0 and standard deviation of 1. Block effects (

**B**) and line identities (

**L**) were modeled as random effects and

**e**was the residual variance. We then estimated a matrix of genetic (co)variance as half the line covariance matrix (

**L**). An additional two-level categorical effect was included when estimating the A6140 matrix that accounts for differences between the 2012 and 2013 phenotyping blocks.

Models were fit with the R package *MCMCglmm* (Hadfield, 2010). We constructed priors as the matrix of phenotypic variances for each trait. Model convergence was verified by visual inspection of the posterior distributions and by ensuring that the autocorrelation remained below 0.05. We used 100,000 burn-in iterations, a thinning interval of 2,000 and a total of 2,100,000 MCMC iterations.

#### 3.5.2 **G**-matrices under random sampling

For each of our seven populations (A6140, CA[1-3]50, CA[1-3]100), we constructed 1000 randomised **G**-matrices to generate a null distribution against which to compare the observed estimates. We randomly shuffled both the inbred line and block identities and fit the equation 20. We then computed the posterior means of our 1,000 models to construct a null distribution.

We additionally generated 1000 matrices for the A6140 population using the same procedure on random subsets of 60 (of 188 total) inbred lines to determine the effects of sampling the same number of lines as those for CA[1-3]50 and CA[1-3]100 populations.

#### 3.5.3 **G**-matrix divergence and differentiation

To compare the overall variance of the **G**-matrices during experimental evolution, we first computed the trace of the matrices. We then performed spectral analyses on the posterior ancestral **G**. This decomposition of describes the overall shape of the multivariate variance for each matrix, with the relative genetic variance between the six eigenvalues of each eigenvector indicating whether the matrix is elliptical (a few large eigenvalues) or round (homogeneous eigenvalues); the first eigenvector (defined as *g _{max}*) is the linear combination of traits where the genetic variance is maximized.

We used spectral analysis to explore differences between the seven **G**-matrices, following (Aguirre et al., 2014; Hine et al., 2009). Genetic (co)variance tensors (**Σ**) are fourth-order objects describing how phenotypic dimensions between transitions rates maximize differences between all the **G**-matrices. The genetic variation among multiple **G**-matrices can be described by **Σ** decomposition into orthogonal eigen-tensors (*E _{i}*, with

*i*being the orthogonal dimensions), each associated with an eigenvalue quantifying its contribution to variation in

**Σ**(

*). In turn, eigentensors can be decomposed into eigenvectors (*

**α**_{i}*e*) each with associated eigenvalues (

_{ii}*λ*). Aguirre et al. (2014) implemented this approach in a Bayesian framework using

_{i}*MCMCglmm*, and Morrissey and Bonnet (2019) made an important modification to account for sampling where the amount of variance in

*is compared to an expected distribution under sampling a finite number of lines.*

**α**_{i}### 3.6 Selection on transition rates

#### 3.6.1 Selection surface

The log-transformed, covariate-adjusted fertility values (best linear unbiased estimates) for each inbred line were downloaded from Noble et al. (2017), exponentiated, and divided by the mean to obtain a relative fitness measure (*w _{l}*). See Results and Discussion for a justification of fertility as a fitness proxy.

Since we did not observe any directional change in locomotion behavior or component transition rates during lab evolution, and because the inbred lines were derived after domestication, most of adaptation to the lab environments has occurred and we do not expect linear (directional) selection to be significant. We estimated quadratic selection gradients using partial regression, following (Lande and Arnold, 1983), with the *MCMCglmm* R package:
with *α* being the mean relative fitness among all lines and *γ* the partial coefficients estimating quadratic selection on each transition rate *k*, or between pairs of transition rates *k*_{1} and *k*_{2}. Environmental covariates (temperature, humidity, density) were defined and normalized as for the **G**-matrices estimation described above. Model residuals were normal and homocedastic (not shown).

We compared the results of this model (equation 21) with those of linear mixed effect models including as a random effect the additive genetic similarity matrix **A** between inbred lines, as defined in Noble et al. (2017) and Noble et al. (2019). We have also compared results from equation 21 to models including coefficients for linear selection on each transition rate. Under both circumstances parameter estimates are similar to those presented, albeit with changing credible intervals (not shown). Including other measured traits by the worm tracker, such as body size [a trait related to developmental time that is known to affect fertility in our populations (Theologidis et al., 2014)] similarly does not affect the qualitative conclusions we reach.

#### 3.6.2 **G**-matrix alignment with the selection surface

We used canonical analysis (Phillips and Arnold, 1989) to visualize the selection surface as:
with **U** being the matrix of eigenvectors of *γ*, and Λ the diagonal matrix of eigenvalues (denoted *λ*_{[1–6]}). **G** matrices were rotated to visualize them as:

To sample a null distribution of the *γ* eigenvalues along the rotated dimensions, we fit the same model after permuting the relative fitness values of the lines. We then extracted the diagonal elements of these permuted *γ* after rotation using the estimated **U**.

To see the evolution of the **G** matrix in the selection surface, we calculated the Pearson product moment correlations between the eigentensor vectors explaining most of the genetic differences between the 7 matrices (*e*_{11}, *e*_{12}) with the canonical selection dimensions (*y*_{1}-*y*_{6}). We estimated uncertainty in these values by sampling from the posterior distribution of *γ* 1000 times.

### 3.7 Inference of effects

Most of our analysis relies on Bayesian inference of genetic or phenotypic effects. As discussed in Walter et al. (2018), the “significance” of effects can be inferred when there is no overlap between the posterior null sampling distributions with the posterior empirical estimate of the expected values. Thus, we compare expected value estimates such as a mean or mode with the 95% credible intervals under random sampling of the expected value. However, when we compare with each other empirical posterior distributions, e.g. genetic (co)variance estimates or null distributions, we follow Austin and Hux (2002) and infer “significant” differences between them when their 80% credible intervals do not overlap (strictly 83%).

## 4 Results

### 4.1 Laboratory culture

Our lab evolution system is based on a hybrid population derived from 16 founder strains (Figure 1A). Replicate samples from the hybrid population were domesticated for 140 non-overlapping generations at census size N=10^{4} to an environment in part characterized constant density, temperature and relative humidity, and little spatial structure during the life-cycle (see Methods). The dynamics of several life-history traits during domestication indicate that most adaptation to lab conditions occurred by generation 100 (Carvalho et al., 2014a,b; Poullet et al., 2016; Teotónio et al., 2012; Theologidis et al., 2014). From a single domesticated population we derived replicate populations and evolved them for another 100 generations in the same environmental conditions. Although we measured locomotion behavior throughout of lab evolution, we only follow the **G** matrix of its component traits during the last 100 generations, after adaptation, a stage that we call here the focal stage of lab evolution (Figure 1A).

*C. elegans* reproduces mostly by selfing in nature though there is considerable variance in male mating performance among the founders (Teotónio et al., 2006). By training a model on a suite of size- and locomotion-related metrics, we found that hermaphrodites could be clearly differentiated from males (see Methods), and estimated males frequencies were high during the entire experiment (Figure S1). Because *C. elegans* are androdioecious, and hermaphrodites cannot mate with each other, average expected selfing rates at a generation are 1 minus twice the male frequency at the previous generation (Teotónio et al., 2012), and we can conclude that outcrossing was the predominant reproduction mode during lab evolution. Previously, we showed that effective population sizes during domestication were of about *N _{e}*=10

^{3}(Chelo and Teotónio, 2013).

### 4.2 Evolution of locomotion behavior

We measured locomotion behavior at the time of reproduction for each outbred population and the inbred founders using worm video tracking (Swierczek et al., 2011). The output, after quality control and initial analysis, are individual worm tracks categorized at a given point in time by activity (moving, or not) and direction (forwards or backwards). We model a three-state memoryless (Markov) process with homogeneous spatial and temporal dynamics (see Methods, Figure S2). We view this as an obviously false but useful approximation of worm locomotion behavior under our conditions, which is only partially violated (worms tend to resume forward movement more often than expected; Figure S3). Component traits of locomotion behavior are the (sex-specific) six non-self transition rates between forward movement, backward movement, and immobility.

We find that while the founders of lab evolution show great diversity in locomotion behavior under lab conditions, evolved populations rapidly attained, and maintained, a stable level after hybridization for 240 generations. For example, considering the proportion of time individual worms are stationary (Figure 1B), we observe values of around 40% for hermaphrodites - much higher than most founders - while males are much more vagile (stationary around 10%). Neither hermaphrodite nor male transition rates showed directional change from the hybrid ancestral state over the full 240 generation period (Table 1, Supplementary Figures S4 and S5). Differences between replicate populations can be explained by sampling error.

### 4.3 Broad-sense **G** matrix

To estimate **G** matrices, we used approximately 200 lines from the generation 140 domesticated population (A6140), and approximately 50 lines from each of three replicate populations derived from A6140 and sampled at generations 50 (CA[1-3]50) and 100 (CA[1-3]100) of the focal lab evolution. We use these broad-sense **G**-matrices as a surrogate for the narrow-sense (additive) **G**-matrices of the outbred populations (see Methods). These two kinds of matrices might not be identical because of selection during inbreeding or because of differential expression of non-additive genetic effects in inbred and outbred individuals. Such differences, if present, manifest as differences in the mean values of inbred and outbred samples as directional effects will statistically average out for polygenic traits (Kearsey and Pooni, 1996; Lynch and Walsh, 1998). We used the inbred lines and the focal A6140 ancestor to compare means for all transitions and we did not find any evidence of directional non-additive genetic effects (Table 2).

Our **G** matrices could also differ from the **G** matrices of outbred populations due to the absence of males in the inbred lines; which were abundant in the outbred populations. This is because males are known to disturb hermaphrodite locomotion behavior (Lipton et al., 2004). We tested for effects of male frequency on transition rates in outbred populations with univariate linear models and found that they were weak at best (Figure S6).

### 4.4 **G** matrix evolution

For the domesticated 140 population (A6140), ancestral to all CA populations during further 100 generations in the same environment after adaptation, there is significant genetic variance in all hermaphrodite transition rates, relative to a null distribution from permutations of line and technical replicate identity (Figure 2A). Likewise, the posterior distributions of most (12 of 15) covariance estimates between transition rates do not overlap 0, and differ from the null distribution of posterior means. The A6140 **G**-matrix is structured in two main behavioral modules, with the transitions from still to forward or backward (i.e. leaving the still state) showing positive covariances with each other and negative covariances with other transition rates.

When looking at the evolved CA populations, we see that their **G**-matrices are reduced, particularly after 100 generations of evolution (Figure S7). The **G** matrices of CA populations after 100 generations do not differ from the permutation null. Reduced genetic (co)variance in generation 50 and 100 is particularly obvious when calculating the trace of the **G** matrices, although all populations contain more genetic variance than expected under sampling (Figure 2B). The loss of genetic (co)variances during focal evolution could be due to differences in statistical power or the result of continued lab evolution. Sub-sampling A6140 to the sampling sizes of CA[1-3]50 and CA[1-3]100 populations, while increasing the credible intervals did not affect the estimated modes, with many of them remaining different from the null (Figure S8).

Spectral decomposition of the A6140 **G** matrix further shows that, for the phenotypic dimension encompassing most genetic variation in this population (*g _{max}*), the projected variance of CA populations in this dimension is much reduced (Figure 2C). In this analysis we only find a difference between empirical variances with null expectations for A6140 and CA populations at generation 50, not CA populations at generation 100.

### 4.5 Genetic divergence and differentiation

We tested for divergence of the **G**-matrices from the ancestral state, and for differentiation between derived replicate populations, using spectral analysis and comparing the seven matrices simultaneously (see Methods). This analysis identifies the phenotypic dimensions along which there are most differences between matrices. The first two eigentensors, ** E_{1}** and

**, explain more variation than the null expectation (Figure 3A), 54% and 21%, respectively.**

*E*_{2}**G** matrix coordinates in the space of ** E_{1}** and

**(Figure 3B), show that the A6140 population drives most significant differences between all matrices, and thus encompasses most of the genetic divergence. Along the first eigenvector of**

*E*_{2}**(called**

*E*_{1}*e*

_{11}; Figure 3C) divergence is due to loss of genetic variance in the CA populations. We further find that

*e*

_{11}is highly collinear with the

*g*of the A6140 population, the phenotypic dimension encompassing most ancestral genetic variation (not shown).

_{max}The second eigentensor ** E_{2}** suggests differentiation between replicate populations at generation 50 (Figure 3C) and thus we further tested for differentiation between replicate populations during focal evolution by restricting the spectral analysis to only the three CA[1-3]50

**G**-matrices. We observe that a single eigentensor was different from the null expectations, explaining 53% of the differences between the matrices (Figure S9). The coordinates of these matrices in the space of the eigentensor indicate that CA150 and the remaining two populations contributed in opposite directions to the difference observed. Most of this difference is expressed along the first two eigenvectors (50% and 37%): CA[2-3]50 lost variance along the first eigenvector and CA150 along the second one.

### 4.6 Selection on locomotion behavior

In Noble et al. (2017) we reported the fertility of many of the inbred lines used to estimate the **G** matrices. This data encompasses hermaphrodite self-fecundity and progeny viability until early larvae, measured in an environment that closely mimicked that of lab evolution. With this data at hand we can estimate the selection surface of locomotion in our lab environment by applying equation 21, with relative fertility being partially-regressed onto the transition rates (see Methods).

We find that the 95% credible intervals for several coefficients for correlated selection between pairs of transition rates do not overlap zero: negative between still-forward (SF) and forward-still (FS) and positive between SB and FS, and FS and BS (Figure S10). To visualize the selection surface, we rotated the *γ*-matrix with canonical analysis (see Methods). The resulting selection surface suggests a saddle with three unstable equilibria in three canonical dimensions *y*_{1}-*y*_{3}, indicating disruptive selection, and three stable equilibria in three dimensions (*y*_{4}-*y*_{6}), indicating stabilizing selection (Figure 4). We only find, however, evidence of weak and strong stabilizing selection on *y*_{5} and *y*_{6}, respectively, because only these empirical estimates are unlikely under the null distribution.

### 4.7 **G** matrix evolution in the selection surface

Projection of the **G**-matrices onto the canonical selection dimensions shows that most genetic variance is concentrated in dimensions under neutrality (*y*_{1}-*y*_{4}), while the dimensions under stabilizing selection (*y*_{5} and *y*_{6}) do not show much genetic variance (Figure 5). Loss of genetic variance along *y*_{2}-*y*_{5} is clearly consistent with drift when assuming an infinitesimal model of trait inheritance (Barton et al., 2017) and effective population sizes of *N _{e}* = 10

^{3}(Chelo and Teotónio, 2013). For neutral

*y*

_{1}and selected

*y*

_{5}and

*y*

_{6}dimensions it appears that replicate populations maintain more variation than expected under drift during the focal stage of lab evolution, and when compared with the founders before their hybridization.

To assess if **G**-matrix evolution aligned with the selection surface, we calculated the correlation between the directions of genetic divergence and differentiation (eigenvectors *e*_{11}, *e*_{12}; Figure 3) and the canonical selection dimensions (Figure 4). There is no alignment between the **G** matrix with the selection surface as no correlations were detected, except perhaps between *y*_{3} and *e*_{11}, and between *y*_{5} and *e*_{12} (Figure S11).

## 5 Discussion

The evolution of *C. elegans* locomotion behavior during 240 generations in a fairly constant and homogeneous lab environment is characterized by stasis, following a genetically and phenotypically dynamic 33 generation period of hybridizing the founder strains. Most of the genetic variance along the several phenotypic dimensions under stabilizing selection or drift was lost, suggesting directional selection during hybridization of founders and domestication until generation 140. Despite phenotypic stasis, genetic divergence and differentiation continued during further 50 to 100 generations of evolution in the same environment, a result that is sufficiently explained by drift. Because we previously showed that most adaptation had occurred by the start of the focal stage (Carvalho et al., 2014a,b; Poullet et al., 2016; Teotónio et al., 2012; Theologidis et al., 2014), our findings provide evidence for what Haller and Hendry (2014) termed “squashed” stabilizing selection: once populations adapt to their environment, most individuals will show close to optimal trait values and it becomes difficult to detect the effects of selection within the local phenotypic space occupied. Taken together, these findings suggest that directional selection outside a local phenotypic space after adaptation ensures phenotypic stasis, and that over a larger phenotypic space there was *effective* stabilizing selection.

The loss of genetic variance from the 16 founders to the domesticated population in the selection surface dimensions *y*_{1}, *y*_{5} and *y*_{6} is notable because it suggests that the rapid phenotypic evolution during intercrossing of founders to form the hybrid population was due to initially strong directional selection, which subsequently weakened during lab domestication. It can be argued that, with only 16 founders, we have little power to reject the hypothesis that there was no loss, and that the genetic (co)variances we found after domestication simply reflect natural standing genetic variation. At mutation-drift balance, the **G** matrix should reflect the patterns of mutational effects described by the **M** matrix, the equivalent measure of trait mutational variances, and covariances between them due to pleiotropy (Lande, 1979; Lynch and Hill, 1986). Elsewhere, we have estimated the **M** matrix in two of the founders of lab evolution, which show locomotion values divergent from those of lab evolution populations, by phenotyping a set of about 120 lines that accumulated mutations in a nearly-neutral fashion for 250 generations (Baer et al., 2005; Yeh et al., 2017). We found that the **M** matrices from these founders have similar sizes and are well aligned with each other, but not with the genetic **G** matrix of our A6140 domesticated population (Mallard et al., 2022b). Loss of genetic variances from the founders during hybridization and lab domestication was therefore at least partly due to directional selection. Future work should nonetheless try to understand if mutation-selection balance is responsible for the maintenance of genetic variation in locomotion behavior in nature by comparing **G** matrices from natural populations, as they can be obtained from a large collection of wild isolates now available (Cook et al., 2017; Lee et al., 2021), with **M** matrices (Houle et al., 1996; Johnson and Barton, 2005).

Along the phenotypic dimensions of stabilizing selection (*y*_{5} and *y*_{6}), more genetic variance might have been maintained than that expected under drift during the final 100 generations of lab evolution. If true, these findings are similar to those from many studies in nature and from artificially selected populations (Johnson and Barton, 2005; Turelli and Barton, 1994; Walsh and Blows, 2009): most quantitative traits have significant heritable genetic variation despite being under stabilizing selection. For example, when Sztepanacz and Blows (2017) applied six generations of disruptive truncation selection in *Drosophila serrata* to a multivariate dimension of maximal standing genetic variation (and low mutational variance) in male hydrocarbon traits, they observed a decrease of phenotypic variation, while disruptive selection in a dimension with little standing genetic variation (and high mutational variance, thus inferred to be under strong stabilizing selection in the base population), led to an increase in phenotypic variance. Assuming that there was little change in environmental variance during the experiment (Pelabon et al., 2010; Whitlock and Fowler, 1999), disruptive (stabilizing) selection is not only expected to increase (diminish) phenotypic variation but also to maintain (deplete) genetic variation relative to drift, though population genetic equilibria timescales differ between disruptive and stabilizing selection (Barton, 1990; Turelli, 1988; Vladar and Barton, 2014; Walsh and Lynch, 2018).

There are several explanations for potentially more than expected genetic variance with drift. Genetic variances in *y*_{5} and *y*_{6} are very low, and thus could be simply the result of measurement error and insufficient sampling. All three replicate populations showed similar posterior distributions, however, and no estimate overlaps zero. Another explanation is that we might have overestimated the strength selection or have biased estimates about the form of selection because environmental covariances with unmeasured traits could have caused correlated selection with transition rates (Blows and Brooks, 2003; Hunt et al., 2007; Mallard et al., 2022a). This seems an unlikely explanation. First, the strength of stabilizing selection on *y*_{6} in particular is of the same order of magnitude estimated in studies of natural populations (Johnson and Barton, 2005; Kingsolver et al., 2001). The regression equation 21 using standardized trait values returns an eigenvalue of −0.64 for *y*_{6} when compared to the eigenvalue using unstandardized trait regression of about −10 shown in Figure 4. Second, stabilizing selection was estimated after domestication and adaptation to our lab environment and the regression approach employed is robust and unbiased to the inclusion of linear terms or other traits: when including directional selection coefficients or body size, a trait known to be genetically correlated with fertility (Noble et al., 2017; Theologidis et al., 2014), similar estimates are obtained (not shown).

A more serious concern is the role of unmeasured traits under potential selection that might be genetically correlated with the observed transition rates (Barton and Turelli, 1987; Lande, 1979; Mallard et al., 2022a; Shaw et al., 1995). Modelling additive genetic similarity among the inbred lines used in the regression (Noble et al., 2017, 2019) does not qualitatively change the inference about selection on transition rates (not shown), though we did not model the genetic architecture of locomotion behavior with multiple traits as the dependent variables. Several models have proposed that pleiotropic effects on unmeasured traits, which we did model, could explain maintenance of genetic variation in quantitative traits, though under weak selection and close to linkage equilibrium between QTL alleles in the long term of mutation-selection balance (Barton, 1990; Johnson and Porter, 2007; Simons et al., 2018; Zhang and Hill, 2005). While in our case transition rates between movement states should define the overall locomotion behavior of *C. elegans*, we cannot rule out that genetic covariation with unmeasured traits will better predict the evolution of genetic variances along the selected phenotypic dimensions.

Finally, potentially more genetic variance than expected by drift on *y*_{5} and *y*_{6} could also be explained by balancing selection generated by environmental or sex-specific effects (Johnson and Barton, 2005). In support of a role for balancing selection, we have previously shown that excess heterozygosity was maintained relative to that expected under drift and linked selection on deleterious recessives during the domestication stage of lab evolution (Chelo et al., 2019; Chelo and Teotónio, 2013). Further supporting balancing selection, QTL mapping of selection trait *y*_{6}, using the sequence data of Noble et al. (2019) for a subset of the inbred lines used here, detects two QTL with high minor allele frequencies (> 30%, unpublished analysis).

One of the major findings here is that of divergence and transient differentiation of the **G** matrix during the last focal 100 generations of lab evolution. The phenotypic dimensions of genetic divergence and differentiation among all populations were not obviously aligned with the phenotypic dimensions under selection, and most if not all of the genetic variance lost during this focal 100 generation period was expected with drift. Some of the phenotypic dimensions of genetic divergence were perhaps aligned with dimensions that were under weak stabilizing selection (*y*_{5}) or no selection (*y*_{3}), but there is no explanation for these observations, except as false positives, as drift is unknown to predictably change the shape of the **G** matrix (Phillips and McGuigan, 2006). In line with these observations are those of classic experiments in *D. melanogaster* by Fowler and colleagues where, after bottlenecking an outbred population, there was a reduction in the size of the **G** matrix for wing morphology in the derived bottlenecked populations, and size divergence among them, as expected under drift (Fowler and Whitlock, 1999; Phillips et al., 2001). Genetic divergence also occurred because the shape of the **G** matrix changed as derived populations showed different genetic covariances between traits. Unlike our finding of phenotypic stasis, however, variable drift history was consequential to the future phenotypic divergence of particular bottlenecked populations (Whitlock et al., 2002).

Most of our analyses and the underlying theoretical predictions are predicated on the assumption that the infinitesimal model of trait inheritance is a good approximation of the truth. However, that assumption may be violated, inasmuch as the genetic variances and covariances of locomotion behavior will not on the short-term of our hybridization and lab evolution be independent of allele frequency changes and linkage disequilibrium between smaller effect quantitative trait loci (QTL). QTL allele frequency independence is expected only in the long-term, when approaching strong recombination and weak selection, mutation and drift, steady-states (Barton, 1990; Barton et al., 2017; Vladar and Barton, 2014). Our findings pose the question of how genetic drift, together with effective stabilizing selection, generates variable allele frequency changes at QTL so that pleiotropy or linkage disequilibrium between them eventually results in genetic covariances that diverge from the ancestral states and are not common among replicate populations. In our case, recombination during the focal stage should have remained much weaker than selection between 0.5-1 cM regions (Chelo and Teotónio, 2013; Noble et al., 2017, 2019); for total a total genome size of 300 cM. If after domestication several QTL alleles within these linked regions segregate at low frequency, it is possible that selection and drift was such that each replicate population during divergence fixed alleles with differently signed phenotypic effects that would not average out when comparing across populations (Bernstein et al., 2019; Cohan, 1984; Gromko, 1995). Inflation of the effects of drift is further expected because there is a correlation across generations between the traits’ breeding values of successful parents and their offspring that results in a reduction in effective population sizes (Robertson, 1961; Santiago and Caballero, 1998). For intermediate frequency QTL alleles, linked selection could have dampened among replicate differentiation (Hill and Robertson, 1966; Zhang and Hill, 2005), but the strong polygenic sign epistasis observed previously in a subset of the same inbred lines used here for fitness-related traits could be at work in the opposite direction (Noble et al., 2017). Only investigation of allele and linkage disequilibrium frequency dynamics at exemplar QTL in our populations will shed light into these questions.

Short-term phenotypic stasis without genetic divergence in natural populations has been explained by indirect selection or phenotypic plasticity, despite heritability and direct selection on the traits that were followed. Our study suggests that phenotypic stasis can also occur with simultaneous genetic divergence, and we conclude that the adaptive landscape in our lab environment is best understood as a table-top mountain, where a saddled plateau with different optima are of little consequence to genetic or phenotypic divergence. Outside the plateau, directional selection explains phenotypic stasis and loss of genetic variation, within the plateau drift appears to be the main driver of evolution.

## 7 Author contributions

Conceptualization FM, LN, CB, HT; hardware and software implementation BA, TG; data acquisition and analysis BA, FM, LN, TG; funding acquisition LN, CB, HT; project administration HT; resources CB, HT; writing, original draft FM, HT; writing, review and editing LN, CB; correspondence FM (mallard{at}bio.ens.psl.eu) and HT (Teotonio{at}bio.ens.psl.eu).

## 9 Supplementary Figures

## 6 Acknowledgments

We thank A. Crist, J. Garcia, H. Gendrot, C. Goy, V. Pereira, F. Melo, and A. Silva for help with worm handling and data acquisition; R. Costa, R. Kerr, and N. Scwierczek for help with hardware and software implementation; N. Barton, C. Dillmann, A. Futschik, L. Kollar, S. McDaniel, P. Phillips, A. Le Rouzic, and A. Veber for discussion. We also thank the anonymous reviewers that greatly improved the presentation of this work. This work was supported by the European Research Council (ERC-St-243285) and the Agence Nationale pour la Recherche (ANR-14-ACHN-0032-01, ANR-17-CE02-0017-01) to HT, the National Institutes of Health (R01GM107227) to CB, and a Marie Curie fellowship (H2020-MSCA-IF-2017-798083) to LN.