Inference of sexual selection from pairwise contests

Sexual selection, whether mediated by male-male competition, female choice, or male choice, is often realized through a series of pairwise contests. Experimental evolutionists have long applied male-choice and female-choice experiments to quantify the relative mating abilities of genotypes, an important component of fitness. Here, we consider Drosophila mate-choice experiments and apply two mathematical models, the Bradley-Terry and the Élő models, which have been explicitly designed to quantify the merits of contestants and establish a ranking based on results from pairwise contests. Methods such as the Bradley-Terry model and Élő ratings also provide a way to predict the outcome of contests that have not yet taken place. Leave-one-out validation approaches allow us to assess the utility of these models to quantify relative fitnesses, how well the models fit the data, and how well they perform at prediction. After applying these methods to Drosophila mate-choice experiments, we interpret the results in terms of sexual selection, and discuss the implications of departures of the data from good fits to the models.


Introduction
Sexual selection is a potent driver of evolutionary change and a powerful motivating force in animal behavior (Andersson 1994). First introduced in The Origin of Species and then developed in The Descent of Man and Selection in Relation to Sex, the theory of sexual selection explains how certain characteristics impact mate choice and in turn, how the application of those mate choices has driven the evolution of phenotypic traits (Darwin 1859(Darwin , 1896. Both males (Darwin 1896) and females (Clutton-Brock 2009) engage in intrasexual competition for mates and reproductive opportunities. In general, the genetic variance for mating success appears in many organisms to exceed the genetic variance for other classical components of fitness, such as viability (Bundgaard and Christiansen 1972), especially under some environmental conditions (Yun et al. 2018). This, coupled with the challenges in measuring sexual selection, motivate the need to consider statistical paradigms that can provide the best estimates of mating success.
A variety of empirical approaches have been applied in an effort to quantify sexual selection. One of the most common experimental protocols is to present virgin individuals of one sex with a pair of individuals of the opposite sex and observe the first mating that occurs (Arbuthnott et al. 2017;Dechaume-Moncharmont et al. 2013;Herdman et al. 2004). Such pairwise contests are somewhat contrived and rarely match a natural situation, but in aggregate, many such pairwise trials might be able to produce reasonable inferences about the ranked mating abilities of the genotypes in the study. The estimates typically have high variance, but with adequate replication they can produce reliably repeatable results.
The models of Bradley and Terry (1952) and Élő (1978) are two of the many probability models that estimate player ratings and fit a model that also allows estimation of the probability that one player (or team) beats another (Cattelan 2012;Király and Qian 2017). Player ratings are an estimate of "ability" of each individual, and are calculated based on outcomes of past pairwise contests. Bradley and Terry (1952) derives and develops a formal statistical model for inference of ranking when the data consist of pairwise contests, and the Élő ranking system was developed later and independently to provide a means for ranking the performance of chess players. The Bradley-Terry and Élő models can be used to test whether there is a simple ranking of players that display strict stochastic transitivity. The models can be used to make probabilistic predictions of outcomes of matches that have not yet occurred. So if A has beaten B two-thirds of the time, and B has beaten C two-thirds of the time, then not only do we expect A to beat C (simple transitivity), but in fact the model predicts that A will beat C with probability 0.8 (simply solve for C/(A+C) when A/(A+B) = 2/3, and B/(B+C) = 2/3). The question of transitivity of behaviors arises frequently in population biology (Shafir 1994), and while it seems like a rational approach, there are situations both theoretical and empirical that produce nontransitive behaviors (McNamara et al. 2014). Application of the Bradley-Terry model is less common, but it too has appeared in the population biology literature. Shev et al. (2014) applied the Bradley-Terry model to obtain fits to the dominance hierarchy among rhesus macaques engaging in pairwise contests. Muller et al. (2015) obtained estimates of host preference of saproxylic beetles with the Bradley-Terry model, and they assessed the effect of covariates on preferences with a method of recursive partitioning.
We would like to answer whether modeling approaches like Bradley-Terry, which estimate intrinsic abilities or ranks of each genotype, provides a useful means for estimating fitness parameters when the experimental data are in the form of pairwise contests (mate-choice tests). Such a quantitative approach would have value in assessing how many pairwise tests are needed to obtain reliable inferences of relative mating abilities. Parameters of the model assign a value or "ability" to each genotype and these in turn can be given confidence limits. These estimates are readily interpreted as relative mating abilities, and the differences in mating abilities can easily be used to infer allele frequency dynamics in a population (assuming mating ability is the only component of fitness that varies). After we apply the Bradley-Terry and Élő models to mate-choice data from Arbuthnott et al. (2017), we determine how well each model fits. Such tests of goodness-of-fit assess whether sexual selection occurs such that there is an intrinsic ranking of genotypes, as opposed to the outcome of a contest being dependent on specific interactions among the pairs of genotypes. Caveats to this approach are outlined in the Discussion.

Fly mate-choice data
Many mate-choice experiments have been published using Drosophila, but we needed a study that considered all pairwise tests among a collection of multiple lines. Drosophila matechoice data was generously provided by Devin Arbuthnott from the paper Arbuthnott et al. (2017). That study was designed as a male-choice test and sought to determine whether there was transitivity to the choosing by males among pairs of female genotypes. Generally such experiments with Drosophila are done as female-choice tests (Andersson 1994), and part of the idea of this study was to consider whether male-choice is an important phenomenon in flies. We use the data from pairwise contests with Oregon-R males and females from eleven lines from the Drosophila Genetic Reference Panel (DGRP): 138, 313, 362, 399, 517, 555, 707, 714, 738, 774, and 786. Note that, because each line is a distinct inbred genotype, we use the terms line and genotype interchangeably. In each trial, two females from eleven DGRP lines were presented to a male from an Oregon-R stock. Altogether, there were 55 distinct pairwise tests, and a total of 550 mating trials were conducted. The line number of the female fly that successfully mated within a 2 hour observational period was recorded. Among the 550 mating trials, 392 concluded with a successful mating.

Data analysis
We used R version 3.5.2, Eggshell Igloo, for the data analysis (R Core Team 2018). The R package readxl (Wickham & Bryan 2019) was used to input the data, and R packages

Model equations
The basic modeling challenge is to estimate, based on the results of a series of pairwise contests among our 11 genetic lines of flies, a vector of scores, α i , for i = 1 to 11. With these α i estimates, the formula for the probability of line i being chosen for mating over line j is These α i scores, which apply to both the Bradley-Terry and Élő models, could in principle be used as a metric for fitness, in this case reflecting only the relative fitnesses with respect to sexual selection. In the Bradley-Terry model, where λ i is the "ability" parameter of line i (Turner & Firth 2012). Given the data matrix of observed counts of wins and losses for each pairwise contest, Bradley and Terry (1952) derived the maximum likelihood estimators for the λ i 's.
In what essentially amounts to a rediscovery of the Bradley-Terry model, the Élő rating system in chess, makes use of sequential pairwise chess match results and provides players with a global rating. In this Élő ranking system, where R i is the "rating" of line i (Coulom 2007). The Élő system provides a convenient way to update these ratings with each successive contest.

Transitivity
One important assumption of the Bradley-Terry model is that the contests have the feature of transitivity. Transitivity, an indication of "rational" behavior (Arbuthnott et al. 2017), occurs when, if A beats B, and B beats C, then A beats C. Weak stochastic transitivity occurs when, given Pr(A beats B) ≥ 0.5 and Pr(B beats C) ≥ 0.5 then Pr(A beats C) ≥ 0.5 (Houston 1991). Strong stochastic transitivity occurs when, given Pr(A beats B) ≥ 0.5 and Pr(B beats C) ≥

then Pr (A beats C) ≥ max[Pr(A beats B), Pr(B beats C)] (Houston 1991). Equations 2a and
2b hold true only if there is also strong stochastic transitivity. Note that an assessment of fit to the Bradley-Terry model implies not only transitivity, but a quantitative fit to the mating probabilities. If all that we knew was that the results are transitive, this would not by itself produce an ability to quantitatively predict outcomes of mate choices.

Tests of goodness-of-fit
Deviance statistics: The R packages generate a null deviance and a residual deviance statistic to measure how well the model fits. A smaller deviance value indicates a better fitting model. The null deviance is the deviance corresponding to the model containing no fitted parameters, while the residual deviance corresponds to the model containing all rating parameters of fly lines. The difference between the residual and null deviances approximates a χ 2 distribution with 10 degrees of freedom. We conduct a chi-square test to find P-values and determine significance (in this case, indicating that the model produces a significantly better fit to the data than does the null model).

Leave-one-out tests:
We assessed the performance of the Bradley-Terry model by systematically excluding the results of pairwise contests between two randomly chosen lines from the data. We fit the Bradley-Terry model to the incomplete data and use it to predict the probability of the contest between the lines in the excluded contest. We then compare with a chisquare test the predicted probability with the actual counts of mating successes, giving us an idea of how well the model predicts the results of Drosophila mating assays. There are 55 pairs among the 11 lines, and so this yields 55 chi-square tests, each with one degree of freedom.
Transitivity tests: Let A, B, and C be three different lines. We find all possible trios and order them such that A > B and B > C (A engages in more matings than B, and B engages in more matings than C). After temporarily excluding the result of A vs. C from the data, we generate a Bradley-Terry model and use it to predict the probability that the male will choose the female of strain A over strain C. We then compare this to the actual percentage of trials in which A is chosen over C. We conduct a binomial test to generate a P-value for each of the 165 tests (the number of distinct trios of lines possible from a set of 11 lines). Note that these tests are not independent of one another.
We use a Q-Q plot of P-values to assess the goodness-of-fit of the model. P-values are obtained from the transitivity tests described above. We then compare obtained P-values from the binomial test with a uniform distribution of P-values. Using this graph, we can observe if and by how much the P-values differ from the expected null distribution.

Élő ratings
The Élő rating method is a system used to find a numerical rank-order of individuals in a group (Albers and de Vries 2001). Originally created to rank chess players, Élő ratings can be and are used in a variety of contexts. Here, we apply Élő ratings to eleven DGRP fly lines. The equation for calculating a line's rating is , and E i is the predicted probability that line i mates. E i is calculated using Equations 1 and 2a for the Bradley-Terry model and using Equations 1 and 2b for the Élő model. The k-factor determines how much the rating changes in response to competition results, and is chosen somewhat arbitrarily based on the expected total number of contests. Here we choose k = 24.

Predicted allele-frequency dynamics
Sexual selection results in changes in population allele frequencies, and so to interpret the meaning of different models of sexual selection, it might be useful to consider how the models differ with respect to their predicted allele frequency dynamics. In this case, we wanted to know whether the allele frequency dynamics differed between a simple model where the fitnesses were inferred to be a property of each line, in a simple ranked order, compared to the model where specific pairwise contests imposed altered frequency dynamics. We do this by simulating genotype frequencies across multiple generations. Note that all lines in the experiments are inbred and are therefore homozygous. Thus, we ignore the complexities of the real genetics, and model them as if they are haploid clones competing.
In our first simulation, we use Élő abilities as a direct proxy for fitness, substituting into the haploid frequency recursion: where f i is the frequency of genotype i in the current generation, f' i is the frequency of genotype i in the next generation, and α i is the Élő ability of genotype i. This is thus equivalent to a onelocus, multiple allele haploid selection model.
In the next simulation, we use Bradley-Terry abilities to compute the probabilities of line i being chosen for mating over line j. These abilities comprise a fitness matrix p ij , specifying the probability that i wins in a contest between i and j. To calculate the frequencies over successive generations, we use the following recursion: where f i is the frequency of genotype i in the current generation, f' i is the frequency of genotype i in the next generation. This is still a haploid model, but the genotypes are considered in all possible pairs, and mating abilities are a property of mating pairs rather than of individual genotypes.

Fitting Élő and Bradley-Terry models
We calculate the Élő ratings of the eleven DGRP lines by assigning each line an initial rating of 400 (as is customary in the common use of Élő ratings in international chess) and inputting each pairwise competition as a "match" between two lines. As the number of matches increases, the ratings should tend to stabilize, and plotting the ratings is one means to assess whether sufficiently many matches have been run to obtain stable results. Ratings were calculated using Equation 3 and the match order was randomized. Élő ratings of different lines diverge significantly, indicating a difference in "ability" between genotypes ( Figure 1).
However, the rank ordering of the ratings is still changing after nearly 400 trials. We conclude that the model fits would be improved by collecting even more mating trials, and when the ratings stabilize, we would arrive at the best estimates for the "ability" of each line. Next, we generate fits to the Bradley-Terry and Élő models using the data. Among the fitted parameters is the "ability" of each line, which is a measure of how well a line is expected to perform relative to other lines in the group. Higher abilities are directly related to a higher probability of winning the pairwise contest (see Equation 1). Thus, it seems logical to presume that these estimates of ability might be a useful quantitative measure of relative mating success.
Estimates for the abilities of each line are found in Table 1. A key conclusion is that both models produce the same rank ordering of lines' mating abilities ( Figure S1). We are interested in exploring whether these are useful proxies for estimates of reproductive fitness.

Tests of goodness-of-fit
We report the null and residual deviances for the Bradley-Terry and Élő models in Table   2. The difference between the residual and null deviance is a measure of how well the complete model explains the data compared to the model that assumes that all genotypes have equal mating ability. The highly significant P-values for both the Bradley-Terry and Élő models indicates that the models allowing fly lines to have distinct mating success probabilities greatly improves the fit of the model to the data compared to the null model, which allows no differences among lines. This further implies that the models have improved predictive capability over the null. Table S1 shows the tests of significance of each predicted mating trial compared to the null hypothesis of equal mating probability.  Next we conducted the transitivity tests described in the "Materials and Methods" section using all possible trios of lines. We compared the predicted probability of line A beating line C (given A beats B and B beats C) with the actual percentage of A beating C in the data. Although many of the points have similar observed and expected values, some also display some deviation.

Bradley-Terry Élő
We therefore performed binomial tests to assess the significance of these transitivity tests. We use a Q-Q plot to assess overall goodness-of-fit of these 165 tests by plotting the expected Pvalues against observed P-values taken from the binomial tests ( Figure 2).
10 of the 165 (6.06%) tests were significant at α = 0.05, a magnitude of deviation that is quite likely to occur by chance. We see from the plot that the points fit along the y = x line with only minor deviation, indicating that the observed P-values conform reasonably well to the expected values. Another test of the predictions is to calculate the probabilities of outcomes in the case where all data except for one of the 55 contests is included, and contrast this to the observed mating success of that trial. Figures S2 and Table S2 show again that the model predictions fit the data well, and in fact only one of the 55 tests is rejected at P < 0.05. , and plotted the allele frequency dynamics for an arbitrary starting frequency. In this case, the line that ultimately wins depends on both the mating abilities and the initial frequencies. When initial frequencies are equal for all lines, the line with the highest ability outcompetes the others. But when initial frequency for the highest ability line is sufficiently low, another line may be the ultimate winner, as seen in Figure   3B. contests, the behavior is now frequency dependent, so if line 362 is not sufficiently common, it can be outcompeted. Initial frequencies were randomly generated.

Discussion
Mate choice tests have been widely used by evolutionary biologists to assess differences among genotypes in mating ability, an essential component of fitness. For the most part, inference of parameters of sexual selection from an aggregation of pairwise tests has been done A B by simple heuristics, without any formal statistical models. Here, we have shown that models, such as the Bradley-Terry model can provide a rigorous way to infer ratings for each genotype that serve as excellent proxies for fitness estimates. These models do make specific assumptions about the data and the situation to which they are applied (Shev et al. 2014). First, there is assumed to be a true ranking of the participants, and that there is a transitivity in the winners of contests. Contests are assumed to be independent of each other, and there is assumed to be no context dependence to the contest outcomes. Given the excellent fit that Drosophila mate choice data show to the Bradley-Terry model, these assumptions appear to be satisfied, at least in the constrained laboratory environment.
There are several advantages to the application of the Bradley-Terry model to pairwise contest data in the context of evolutionary biology. A formal parametric model allows the assessment of the overall goodness-of-fit of the model to the data. This allows the identification of lines or assays that deviate from the model. Additionally, the quality of the data can be ascertained, particularly whether the experiment is sufficiently large and consistently executed, to determine whether stable and robust estimates of parameters have been achieved (Figure 1).
Finally, the parameter estimates from the model explicitly present the researcher with the ability to predict the outcomes of other pairwise tests, allowing a powerful form of cross validation.
This study also raises some interesting questions related to the old "unit of selection" debate (Lewontin 1970). Sexual selection can be modeled with fitness parameters for each genotype, and the outcome of particular mating trial depends in some simple way on the parameters of each (e.g. Equation 1). In this case, the unit of selection is the single genotype, and the model has as many parameters as there are genotypes. Alternatively, the outcome of a mating trial could be modeled with a matrix of parameters whose entries represent all possible pairs of genotypes, and the unit of selection is now the genotype-pair. This matrix of 55 parameters can also be calculated from the Bradley-Terry model. Interestingly the resulting allele frequency dynamics of these two models (Equations 4a and 4b) can be quite different, as is demonstrated in Figures 3a and 3b (Arbuthnott et al. 2017;Edward and Chapman 2012). Finally, we note that the Bradley-Terry model is over 65 years old, and that inference and prediction in the context of pairwise tests is an area of active development, including Bayesian methods (Gilbert and Wells 2019) and machine learning approaches (Menke and Martinez 2008). Some of these methods even accommodate multi-dimensional attributes of the competing individuals, making them even more useful in a broad collection of evolutionary applications. "ability" and Élő "ability," respectively (see Table 1). Each point represents the expected probability of one of the 55 pairwise contests. contests, and the y-axis is the predicted probability for the same (i, j) pair based on the parameter estimates with the data from the (i, j) pairwise mating tests removed. There were 55 pairwise contests, and so the plot has 55 points. The sample was relatively modest, so there is evident scatter, but only one of these 55 tests significantly rejects a goodness-of-fit binomial test (Table S2).

With
Incomplete Data Table S1 -Matrix of expected probabilities from the Bradley-Terry model. P ij is the probability that line i will be selected over line j. Probability ± 1 standard error. z-test for H 0 : P ij = 0.5(ns: P > 0.05, *: P < 0.05, **: P < 0.01, ***: P < 0.001) Significance codes appear in parentheses.