Differential retention of Pfam domains creates long-term evolutionary trends

Protein domains that emerged more recently in evolution have higher structural disorder and greater clustering of hydrophobic residues along the primary sequence. It is hard to explain how selection acting via descent with modification could act so slowly as not to saturate over the extraordinarily long timescales over which these trends persist. Here we hypothesize that the trends were created by a higher level of selection that differentially affects the retention probabilities of protein domains with different properties. This hypothesis predicts that loss rates should depend on disorder and clustering trait values. To test this, we inferred loss rates via maximum likelihood for animal Pfam domains, after first performing a set of stringent quality control methods to reduce annotation errors. Intermediate trait values, matching those of ancient domains, are associated with the lowest loss rates, making our results difficult to explain with reference to previously described homology detection biases. Simulations confirm that effect sizes are of the right magnitude to produce the observed long-term trends. Our results support the hypothesis that differential domain loss slowly weeds out those protein domains that have non-optimal levels of disorder and clustering. The same preferences also shape differential diversification of Pfam domains, further impacting proteome composition.


28
The possibility that evolution could result in long-term directional change over time has enduring 29 appeal, but there are few well-documented examples. It has proven difficult to rigorously assess 30 purported universal tendencies toward increases in complexity and size over time (Gregory 2008; 31 McShea 1991). Perhaps the best documented example of a long-term trend is that some taxonomic 32 groups show a tendency toward increasing body size over time, aka 'Cope's rule' ( Payne et al. 2009). Interestingly, this trend does not seem to be caused by directional 34 selection slowly exerting a preference for larger individuals, but rather by the differential 35 diversification of larger body size clades (Heim et al. 2015). This is unsurprising, because directional 36 selection among individuals of the same species works quickly, making it an unlikely explanation for 37 such a slow directional trend.

39
New examples of long-term trends were recently observed in the properties of protein domains (e.g.

69
What we are proposing is that the relevant level of selection may not be the one that selects among 70 alternative alleles of the same gene (Fig. 1, top), but instead the level that selects among alternative

90
We focus on protein domains rather than on complete genes. We take domain annotations from the 91 Pfam database (El-gebali et al. 2019), and refer to domains simply as Pfams. Protein-coding domains 92 are sometimes thought of as functional units of protein sequence that are able to fold independently 93 (Ponting and Russell 2002). However, Pfams are annotated by HMMER3 on the basis of evolutionary 94 rather than functional relatedness (Finn, Clements, and Eddy 2011), and as such are fundamental units 95 of protein sequence homology. This makes them easier to work with than genes, which commonly 96 contain multiple different Pfams of a variety of ages (Bagowski, Bruins, and Velthuis 2010; 97 Bornberg-Bauer and Albà 2013).

99
We infer loss rates by maximizing the likelihood of observed presence / absence of at least one Pfam 100 instance across all species in a phylogeny. As well as this central focus on total loss rates, leading to 101 differential retention over time (Fig. 1, bottom), we perform a complementary analysis of differential 102 diversification (Fig. 1, middle)     implausible. Second, we infer total loss rates jointly with a rate of false positive hits. Finally, we 113 consider the hypothesis of homology detection bias while interpreting results.

115
We ask whether ISD and clustering scores predict the rate of total loss of a Pfam in a lineage, and also 116 the number of Pfam instances per species. We analyse animal lineages only, because the ISD trends as 117 a function of Pfam age found by (James et al. 2021) were specific to animals.

121
We used the phylostratigraphy dataset of James et al., restricted to the 6841 Pfams annotated in at 122 least two out of our of 343 animal species, and already subject to a number of filters used by James et 123 al. We scanned all six reading frames of intergenic regions to find species that contained unannotated 124 instances of these Pfams. We also performed additional quality controls to remove likely viral or other 125 contaminants on the basis of incoherent distribution across the tree (see Methods), resulting in a 126 dataset of 6700 Pfams. We then estimated the rate of total loss of each Pfam over the animal species 127 in our dataset (see Methods). During this maximum likelihood procedure, we iteratively removed 128 Pfam presence data for clades (mostly single species) where they are more likely to be false positive 129 hits or horizontal gene transfer events.

131
The median inferred loss rate of a Pfam in the animal phylogeny was 0.0009/MY (1 st and 3 rd quartiles 132 of 0.00021 and 0.0035), i.e. 0.9 losses per 100,000 years. Note that this is not the rate at which a 133 single instance of a Pfam is lost, but rather the rate at which all instances of a Pfam are lost from a 134 lineage. Unsurprisingly, Pfams with a higher mean number of instances per genome tend to have a 135 lower rate of total loss (Supplementary fig. 1, adjusted R 2 = 0.30, p < 1e-16).

149
Pfams with a mean ISD of 0.18 are least often lost 150 Pfams with high ISD have high loss rates (adjusted R 2 = 0.023, slope = 1.35, p = 4e-36; data and loess 151 regression shown in Fig. 2A). This is the predicted direction in order for their relationship to explain 152 the animal phylostratigraphy trend (Fig. 3A). Note that the Box-Cox transformation of loss rates 153 makes the units of the slope in these linear models hard to interpret. Interestingly, this relationship is 154 not just non-linear, but also non-monotonic (loess regression in Fig. 2A). A regression model that 155 includes a breakpoint at ISD = 0.18 (95% CI 0.15-0.21), representing the minimum loss rate, explains 156 substantially more of the variance in loss rate among Pfams (adjusted R 2 = 0.030, slopes -2.00 and 157 3.86), and fits the data significantly better than a model without a breakpoint (p = 6e-12), better than a 158 model with zero slope for ISD values higher than the breakpoint (p = 1e-45), and better than a model 159 with zero slope for ISD values lower than the breakpoint (p = 0.0002).

161
The existence of a breakpoint rather than a monotonic relationship contradicts predictions from the 162 homology detection bias hypothesis. That hypothesis expects overestimation of loss rates for high ISD

163
Pfams whose homologs are more difficult to detect. While this could drive the overall relationship we 164 found in which Pfams with high ISD have high loss rates, it cannot explain why loss rates also rise 165 with low ISD <0.18.

167
Our results are instead consistent with selection acting to preferentially remove Pfams at either end of indicate that the trend of falling ISD with age (shown in Fig. 4A) has not yet fully saturated.

178
Pfams with moderately interspersed hydrophobic residues are least often lost

179
The hydrophobic residues of younger domains are more clustered (i.e. have higher clustering) than  respectively. The fact that this is higher than the upper end of our 95% confidence interval for the 196 breakpoint suggests that the trend in clustering values with age has not yet saturated. The y-intercept 197 of Fig. 3B shows that Pfams are born with clustering scores a little above 1. Because Pfams tend to be 198 born with relatively random amino acid placement corresponding to a clustering score of ~1 that is so 199 much higher than the optimum value of 0.81, differential Pfam retention can explain the progressive 200 drop in clustering scores with Pfam age. In further support of the hypothesis that differential loss is 201 responsible for the trend in clustering with age, there is more variability in clustering within younger 202 Pfam cohorts (Fig. 4B, slope of standard deviation in clustering with age = -5.5 × 10 -5 , adjusted R 2 = 203 0.60, p = 2e-10, using linear regression weighted by number of Pfams per age cohort).

205
Results are robust to Pfam age as a confounding factor 206 When ascertaining the causal effects of ISD and clustering on loss rates, Pfam age has the potential to 207 be a confounding factor in our analyses, for two reasons. Firstly, if older Pfams also have lower ISD 208 and less clustered hydrophobic amino acids (James et al. 2021) for reasons other than differential loss 209 due to these properties, this could potentially drive a spurious relationship between Pfam properties 210 and loss rates. As expected, we observe lower loss rates for older Pfams in our data (Adjusted R 2 = 211 0.050, p = 2e-76, supplementary figure 2). Secondly and more subtly, Pfams of identical age share the 212 same speciation events, which enable Pfam losses to be observed. This results in phylogenetic 213 structure in the data. A further complication is that the fact that older Pfams have survived to be 214 observed might create ascertainment bias toward estimated loss rates that are lower than the true loss 215 propensity.

217
To control for both indirect correlations and phylogenetic structure, we used mixed-effect linear 218 regression models, including age as a random rather than a quantitative effect. Note that each branch 219 of the tree corresponds to a possible Pfam age, making age a discrete quantitative variable.

220
Empirically, we found that random effects models are better supported model choices: for example, 221 for models of loss rate and ISD, R 2 = 0.33 if we include age as a random effect, and R 2 = 0.058 if we 222 include age as a linear predictor, where R 2 here is the conditional R 2 describing the variance explained 223 by the entire model. AIC scores for the respective models are 23580 and 24432. Similarly, for models of loss rate and clustering, a model that includes age as a random effect model has R 2 = 0.34, while a 225 fixed effect model has R 2 = 0.05. AIC values are 23603 and 24476, respectively.

227
The non-monotonic relationship between loss rate and ISD remains statistically supported in the

233
While controlling for Pfam age causes marginal R 2 values to drop dramatically, the causal direction is 234 unclear. I.e., differences in ISD and clustering might drive changes in loss rates as a function of Pfam 235 age, in which case controlling for age would remove a good portion of the genuine biological effect.

236
However, the statistical significance we find indicates that Pfam age alone is insufficient to be the sole 237 driver of the dependence of loss rates on ISD and clustering.

239
Differences in loss rates are small enough to produce slow change 240 Even without controlling for phylogenetic structure/age, the effect sizes for loss and clustering on ISD 241 are clearly small. In most contexts, this would cast doubt on their biological relevance. However, our 242 motivating question is to discover what mechanism might have a small enough effect size such that its 243 steady action could continue for billions of years before saturating. We therefore next use simulations

247
We simulate a set of Pfams initialized with the distribution of ISD or clustering values observed in 248 young animal Pfams, and subject them to differential loss at rates given by our fitted regression 249 models with breakpoints. This produces directional trends in ISD and clustering on the pertinent 250 timescale (Fig. 3C, 3D). The relationship between ISD (

289
Unfortunately, mean number of instances per genome is expected to depend on genome annotation 290 quality, which is likely to vary across the animal species in our dataset. We have at least partly 291 addressed this problem by restricting analyses to genomes deemed "Complete".

292
In agreement with our loss rate results, Pfams with a higher mean number of instances per genome

318
This study implements high-throughput phylostratigraphy; while we implemented a number of novel 319 measures to reduce the frequency of false positive and false negative homologs, our dataset will not 320 be free of error. We focus on Pfam domains, for which homology detection is performed using the

324
and Eddy 2020). However, Pfam age is not a central focus of our analysis. The potential issue here is 325 rather that homology detection bias might explain why Pfams with high ISD appear to be lost more 326 often and have fewer instances per genome (a similar bias for clustering has not been documented).

327
However, the known difficulty in detecting homology given higher ISD cannot explain our observed 328 non-monotonic dependence of loss rates and instance number on ISD.

330
To detect as many true homologs as possible, instead of relying on existing genome annotations, we 331 scan all six reading frames for hits. Each Pfam was as a result included in a mean of 7.35 and a 332 median of 4 new animal species, out of 343. However, HMMER can produce false positives (Pearson,

333
Li, and Lopez 2017); while use of curated Pfam seeds might reduce this problem, it will not 334 completely eliminate it. We minimized impact by co-inferring false positive (or horizontally 335 transferred) hits, with a higher rate for those found in our 6-frame scan, at the time of our inference of 336 loss rate. Overall, these data quality measures resulted in a substantial decrease in the mean loss rate 337 of approximately 14%. Another key quality control was to remove some Pfams altogether as potential 338 contaminants not from the genomes under study. Our dataset and methodology represent a new 339 standard for work on protein domain evolution, which, while imperfect, has resulted in a higher 340 quality dataset for further analysis. Nevertheless, given the residual likely presence of false positive 341 and false negative Pfam hits, we expect our results on total loss of a Pfam to be more robust to genome annotation issues than the number of Pfam instances per genome. However, we note that our 343 instance number and loss rate results qualitatively agree.

345
While protein domains no doubt experience idiosyncratic selection pressures associated with their 346 specific structures and functions within organisms, our findings suggest a more universal tendency in 347 which Pfams with ISD and clustering values close to the 'optimum' are differentially retained in the 348 long term. A candidate for a universal cause is that the same biophysical properties that promote 349 correct folding also promote toxic aggregation, posing a universal dilemma for proteins (Bucciantini

361
The current study does not directly rule out directionality in descent with modification, which might

373
However, while selection among alternative alleles of the same domain (i.e. descent with 374 modification) is often the default explanation for molecular evolution trends ( fig. 1 top), we provide 375 evidence for selection acting on a higher level, among protein domains, highlighting the importance 376 of differential retention and diversification of domains as an evolutionary force in protein sequence 377 evolution ( fig. 1 middle and bottom). Selection acting on multiple different levels to affect a trait does 378 not necessarily result in conflict between the levels (Lewontin 1970). This higher level is analogous to 379 the dependence of speciation and extinction rates on the traits of species (see, for example (Aguilée et 380 al. 2018)). An interesting parallel to our findings is the work of Heim et al. (2015), who found that 381 long-term trends do not seem to be caused by directional selection, but rather by the differential 382 diversification of certain clades. Here we find that the same broad category of explanation applies to 383 long-term trends in protein properties as a function of age since they originated.

385
The study of long-term directional change in evolution, and the level of selection at which it occurs, 386 has historically focused on species-level traits such as morphology. But there are several advantages 387 of conducting such studies on proteins. Firstly, because Pfam domains are easily identified using 388 HMMER, they are highly tractable for use in the study of higher levels of selection. In particular, it is 389 easier to count how many instances there are of a Pfam in a genome than it is to identify all species in 390 a clade.

392
Secondly, it is impossible to distinguish between e.g. high extinction rates vs. low speciation rates 393 from observations of extant species distributions alone (Louca and Pennell 2020). An inherent 394 difficulty is that we cannot observe extinct species once they are gone; fossil data is required to 395 provide additional information (Mongiardino Koch, Garwood, and Parry 2021). However, when a 396 domain undergoes complete loss in one lineage, we can directly infer that loss through comparison to 397 other lineages in which it was not lost. Sophisticated methods developed to study speciation and 398 extinction rates (Fitzjohn 2012) could be adapted to take advantage of this during the study of domain 399 duplication and loss rates. Future studies could examine not just how loss rates depend on trait values, 400 but also account for systematic variation in duplication and loss rates among clades, or over time. The 401 study of protein domain diversification and loss is interesting not only to advance our understanding 402 of the evolution of the proteome, but also as a powerful new study system for those interested more 403 broadly in the causes of long-term directional trends in evolution, and in levels of selection.

409
We begin with the Pfam phylostratigraphy dataset of James et al. (2021). Briefly, for a species to be 410 included in this dataset, a "Complete" annotated genome had to be available from NCBI, and the 411 species had to be present in Timetree (Hedges, Dudley, and Kumar 2006). We further restricted this 412 dataset to those Pfams that were present in a minimum of 2 animal species, and for which both ISD    branch j of length " equal to1 − (− ! " ). The likelihood that Pfam i is completely lost on 462 branches j and retained in at least one copy on branches k, given total loss rate ! , is:

463
(1) 464 We first provisionally assign losses by assuming a Pfam arose on the branch basal to the 465 monophyletic clade that includes all species with the Pfam, and is lost on the branches that explain the 466 current distribution with fewest number of losses, i.e. we use Dollo parsimony (see figure 5). We then 467 use Newton's method to find the loss rate that maximizes the log likelihood, initialized at the loss rate

483
(1). We then refine this coarse Dollo-parsimony based estimate through maximum likelihood 484 techniques that allow for false positive/horizontally transferred instances (Equation 2).

486
Next, to remove false positive / horizontally transferred Pfam presence data, we use a recursive 487 empirical Bayesian procedure. We calculated two prior probabilities: # , that the annotated presence 488 of a Pfam in a species or clade is a false positive, and ! , that a Pfam presence discovered only by our 489 six reading frame scan of intergenic sequences is a false positive. These were initialized at 10 -6 and 490 0.2, respectively.

492
If $ is the set of species annotated as having the Pfam from CDS observations, and ! is the set of 493 species annotated as having the Pfam only by us from the InterProScan intergenic search, the 494 likelihood of observing the data if the species sets $ and ! are false positives is:

495
(2) 496 where ! is re-estimated by maximum likelihood for each species set $ and ! , and absolute 497 magnitude notation indicates the number of species in the set.

499
To find the sets $ and ! that maximize likelihood, we first group Pfam presence observations into

514
Our procedures to include presence data when Pfams were previously unannotated, and to exclude 515 presence data as described above, had a noticeable quantitative impact on our data. The combined 516 effect of these quality control measures caused the mean and median loss rate to fall from 0.0075 and 517 0.00091 to 0.0065 and 0.00090. The mean and median number of Pfam instances per species rose 518 from 5.9 and 2.2 to 6.2 and 2.4.

520
For some young Pfams (appearing after divergence with plants), our procedures to reduce false 521 negative and false positive errors in presence annotation within the animal species tree lead us to re-522 estimate Pfam age. All of the Pfam loss rate and age data is available online at: https://github.com/j-e-  distribution of hydrophobic residues, with higher values indicating that hydrophobic residues tend to 540 be arranged in clusters along the sequence, and lower values indicating that hydrophobic residues are 541 more evenly distributed along the sequence than would be expected by chance. As for ISD, per Pfam 542 scores were then calculated as the mean over all instances of each Pfam in our animal dataset.

544
Statistics All plotting and statistical analysis was performed in R. Linear regression models that incorporate 546 breakpoints were implemented using the segmented package. We used the packages lme4, MuMIn, 547 and segmented to conduct models, and ggplot2 to create plots. Prior to linear regression, we 548 transformed loss rate using an optimized two parameter Box-Cox transform (λ = 0.0054, λ2 = 6.1e-06) 549 calculated using the R package geoR. We transformed mean instance number using a Box-Cox 550 transform (λ= -0.61). We transform after rather than before taking the mean because the natural units 551 for scoring impact on the proteome (Fig. 1 middle)

555
Throughout this work, we report the adjusted R 2 values for our linear regression model results, as 556 implemented in the programming language R, which uses the Wherry formula (Wherry 1931). This 557 R 2 accounts for the number of explanatory variables included in the model. Nested models were 558 compared using the anova function in base R. We conducted mixed effects models in R using the lmer 559 package, and calculated corresponding pseudo R 2 values using the R package MumIn. This package 560 returns marginal R 2 values, which can be interpreted as the variance explained by fixed effects in the 561 model, and conditional R 2 values, which can be interpreted as the variance explained by both the fixed 562 and random effects in the model (Nakagawa and Schielzeth 2013). R scripts and data required to 563 replicate our analysis are available at: https://github.com/j-e-james/DomainLossRate.

566
We performed simulations to assess the biological relevance of our inferred loss rates, using custom 567 python code available at https://github.com/j-e-james/DomainLossRate. We initialize these 568 simulations by sampling, with replacement, the Pfams that we estimate to be younger than 100 MY 569 old, to produce an initial set of 5000 Pfams, each represented by its ISD and clustering values. We 570 then assign each Pfam the loss rate predicted from either its ISD value, or its clustering value, on the 571 basis of our corresponding regression model. For each Pfam, we then sample the time to a single loss 572 from an exponential distribution, with scale parameter = 1/loss rate. We then track the mean and 573 standard deviation of the population of survivors over time.

577
We thank Cecile Ané for helpful discussions that prompted a more complex project that we 578 abandoned in favor of this simpler approach. We thank Nathan Aviles for helpful discussions about 579 fancier statistical approaches that we didn't use. We thank all of the participants at the 2019