Adaptive protein evolution through length variation in short tandem repeats

Intrinsically disordered protein regions are of high importance for temperature sensing and immune responses in plants. Tracts of identical amino acids accumulate in these regions and can vary in length over generations due to expansions and retractions of short tandem repeats at the genomic level. However, little attention has been paid to what extent length variation is shaped by natural selection. By environmental association analysis on 2,514 length variable tracts in 770 whole-genome sequenced wild Arabidopsis thaliana we show that length variation in glutamine and asparagine amino acid homopolymers, as well as in interaction hotspots, correlate with local bio-climatic habitat. We determined experimentally that the promoter activity of a light-stress gene depended on polyglutamine length variants in a disordered transcription factor. Our results show that length variations impact protein function and are substantially shaped by natural selection. Length variants modulating protein function at a global genomic scale has implications for understanding protein evolution and eco-evolutionary biology.


ELIP1 promoter activation depends on the length of poly-Q tracts in TCP14
102 Several members of the TCP transcription factor family had STR-encoded homopolymers 103 predicted to be in IDRs and/or in disordered interaction regions (S. Table 3 cotyledons. This repression decreases upon illumination when the nuclear localization of OR is 112 diminished and accumulation of TCP14 in the nucleus derepresses chloroplast biogenesis during 113 hypocotyl de-etiolation by increased ELIP1 transcription (15). We found TCP14 to contain poly-114 Q tracts that overlapped predicted disordered PPI regions (S. Figure 1) and that varied in length 115 in different Arabidopsis accessions. We experimentally tested the functional relevance of poly-Q 116 tracts in TCP14 by quantifying the ability of the different homopolymer length variants to activate 117 transcription of the ELIP1 gene. The two poly-Q tracts we focused on overlap with predicted 118 disordered PPI sites (Figure 2A), and we envisioned that length variation in these poly-Q tracts 119 would influence the ability of TCP14 to activate ELIP1. We tested our hypothesis using TCP14 120 variants from accessions that only differed in these two tracts in the TCP14 protein sequence. The   Qn-Qn-GFP variants to achieve comparable expression levels for the fusion proteins. TCP14-7Q-133 3Q-GFP, TCP14-4Q-6Q-GFP, TCP14-7Q-6Q-GFP and TCP1-4Q-3Q-GFP localized to the 134 nucleus in N. benthamiana leaf cells ( Figure 2B). As negative controls, we measured the luciferase 135 luminescence signals in assays containing the constructs without adding estradiol, as well as assays 136 without ELIP1p:LUC ( Figure 2C). Our results show that a simultaneous length change in both 137 tracts produced significantly different luciferase outputs (7Q-6Q vs. 4Q-3Q: Welsch T-test P 138 value: 0.007 and 4Q-6Q vs. 7Q-3Q: Welsch T-test P value: 0.03), and that the last tract yielded a 139 significant difference only when the first tract was 7Q (7Q-6Q vs. 7Q-3Q: Welsch T-test P value: 140 0.0008). Changes in only the first tract did not produce any significant differences ( Figure 2C). 141 We conclude that both tracts are involved with ELIP1 promoter activation, but that the last tract is 142 more important. These results show that natural allelic variations in the poly-Q tracts alters ELIP1 143 promoter activation, and given the important role of ELIP1 in light stress toleration, this most 144 likely reflects the ability of different accessions to respond to light stress.

146
Homopolymer tract lengths vary along temperature-related gradients 147 Next, we explored the relationship between natural allelic variation in protein coding STRs and  File 4). As many of these variables were highly correlated, we decomposed the information to 152 seven principal components (PCs) that together explained 80% of the total variation (S. Figure 2). 153 The top correlations between the individual environmental variables and PC axes 1-7 are shown 154 in S. Table 5, and the accessions' values along the axes are available in S. shown in S. Figure 3). We regressed the seven environmental PC axes on our estimates of allelic 156 variation in 2,514 protein coding STRs, one site and one environmental axis at a time (S. File 5). 157 The accessions' original sampling location colored by their position along PC1 and PC2 are shown 158 in Figure 3A. As a control, we used mock STR genotypes (S. File 6). We separately treated STR 159 alleles as continuous or categorical values to assess if linear or non-linear effects best explained 160 the environmental variation. In 24% of tests with multiallelic STRs, non-linear effects better 161 explained the environmental variation compared to linear effects. For further analysis, we kept the  Table 5). At the highest confidence levels (i.e.

165
lowest P-values), associations with PC axis 2 were more common (Figure 3B, S. File 5). PC axis 166 2 best represents variation in net primary production and temperature seasonality (S. Table 5). 167 None of the mock genotypes produced significant associations with any PC axes after multiple test 168 (Bonferroni) correction (S. File 6). With regard to specific homopolymer tract types, poly-N, poly-169 Q, and poly-E tracts were overrepresented in producing higher confidence associations with PC 170 axis 1, and poly-E and poly-S tracts were driving the higher confidence associations with PC axis 171 2 ( Figure 3C). Notably, the top candidate proteins containing poly-N, poly-Q, poly-E or poly-S 172 tracts with environmentally associated length variation included proteins known to be involved in 173 regulating responses to abiotic and biotic stress (S. Figure 4). Note that we cannot completely rule 174 out that the associations are confounded with the population structuring along the environmental 175 gradients (S. Figure 5). However, when controlling for the pairwise genetic similarity with 176 putative neutral markers (S. Figure 6), P-value distributions were right-skewed (S.   protein-protein binding propensities, making it likely that differential downstream interacting 216 ELIP1 promoter activation was caused by an altered protein-protein interaction between TCP14 217 and OR, as OR is known to bind TPC14 and represses its ability to activate the ELIP1 218 promoter(15). Further experiments on TCP14 would be necessary to fully test this hypothesis, and 219 follow-up experiments on additional candidate tracts, similar to Jung et. al (2020), will be key to 220 understanding the extent of the phenotypic consequences of protein coding STR length variation.

221
Here, we have demonstrated that length variation in coding STRs is associated with various

254
Disorder and disordered binding site predictions 255 We downloaded A. thaliana reference (TAIR10) proteome disorder consensus predictions from   278 We previously scored the number of units in each (diploid) STR allele across the Arabidopsis 279 accessions(9). An STR-specific variant caller, Haplotype inference and phasing for Short Tandem 280 Repeats (42) was used to call the variants. We used the same STR unit counts in this study, but for 281 an expanded set of accessions (770), and only STRs that encoded homopolymer tracts. We 282 constructed the putative neutral SNP matrix for the 770 accessions as in our previous study (9). In 283 brief, we drew random, common (major allele frequency < 0.9) intergenic SNPs that we pruned 284 for linkage disequilibrium using the 'locate_unlinked' function of the Python package scikit.allel 285 (parameters: size = 1000, step = 20, threshold = 0.1)(43). From these SNPs, we calculated the 286 pairwise correlation between accessions and standardized the output matrix.

288
Bioclimatic variables 289 We retrieved the full set of environmental variables retrieved by Ferrero-Serrano and Assmann     instance between STRs and predicted IDRs), we would not expect a large difference between 325 ratios. As such, we constructed contingency tables and used the 'fisher_exact' function of the 326 Python module 'statsmodels' to calculate odds ratios and P values from contingency tables (49).

327
The odds ratio would be  Table 7). Here, we were only interested in positive dependence and performed one-sided 335 Fisher's exact test (with the option 'greater' in the 'fisher_exact' function). We also used one-336 sided Fisher's exact tests to address if there was dependence between IDRs or predicted IDR 337 binding sites and associations with gene expression (see legend S. Table 4).

339
We used ordinary linear square regression to test if the environment of Arabidopsis accessions

369
The luciferase experiment was performed 12 times per construct on one day. We used a Welch T-370 test to test if the differences in mean luciferase outputs was statistically significant.                                    ratios (y-axis) resulting from one-sided Fisher's exact test of dependence between the type of amino acid and its 614 association with PC axis 1 (left panel) and PC axis 2 (right panel). The x-axis show the confidence of the association 615 (-log10 P value) with the PC axis. The odds ratios of association with the other PC axes are shown in S. Table 7. 616