The nature of intraspecific genome size variation in taxonomically complex eyebrights

Genome size (GS) is a key trait related to morphology, life history, and evolvability. Although GS is, by definition, affected by presence/absence variants (PAVs), which are ubiquitous in population sequencing studies, GS is often treated as an intrinsic property of a species. Here, we studied intra- and interspecific GS variation in taxonomically complex British eyebrights (Euphrasia). We generated GS data for 192 individuals of diploid and tetraploid Euphrasia and analysed GS variation in relation to ploidy, taxonomy, population affiliation, and geography. We further compared the genomic repeat content of 30 samples. We found considerable genuine intraspecific GS variation, and observed isolation-by-distance for GS in outcrossing diploids. Tetraploid Euphrasia showed contrasting patterns, with GS increasing with latitude in outcrossing Euphrasia arctica, but little GS variation in the highly selfing Euphrasia micrantha. Interspecific differences in GS genomic repeat percentages were small. We show the utility of treating GS as the outcome of polygenic variation. Like other types of genetic variation, such as single nucleotide polymorphisms, GS variation may be increased through hybridisation and population subdivision. In addition to selection on associated traits, GS is predicted to be affected indirectly by selection due to pleiotropy of the underlying PAVs.


Summary 27
• Genome size (GS) is a key trait related to morphology, life history, and evolvability. 28 Although GS is, by definition, affected by presence/absence variants (PAVs), which are 29 ubiquitous in population sequencing studies, GS is often treated as an intrinsic property 30 of a species. Here, we studied intra-and interspecific GS variation in taxonomically 31 complex British eyebrights (Euphrasia). 32 • We generated GS data for 192 individuals of diploid and tetraploid Euphrasia and 33 analysed GS variation in relation to ploidy, taxonomy, population affiliation, and 34 geography. We further compared the genomic repeat content of 30 samples. 35 • We found considerable genuine intraspecific GS variation, and observed isolation-by-36 distance for GS in outcrossing diploids. Tetraploid Euphrasia showed contrasting 37 patterns, with GS increasing with latitude in outcrossing Euphrasia arctica, but little GS 38 variation in the highly selfing Euphrasia micrantha. Interspecific differences in GS 39 genomic repeat percentages were small. 40 48 INTRODUCTION dispensable chromosome. The above types of genetic variation can be subsumed under the 82 term presence/absence variants (PAVs), a type of structural genomic variation, and may be 83 detectable by methods for estimating GS, such as flow cytometry. Modern protocols using flow 84 cytometry with appropriate reference standards, and following best practice approaches, can be 85 accurate and highly precise (Greilhuber et al., 2007;Pellicer et al., 2021) and reveal genuine 86 intraspecific variation that can be confirmed by genome sequencing. Such sequencing has also 87 been used to reveal that repeat differences can be useful genetic markers, including Our study considers such variation as polygenic, meaning heritable, and with a value affected 92 by multiple independent loci in the genome (Figure 1). Many polygenic traits are known, with the 93 most renowned example being human height (Fisher, 1919). Loci underpinning polygenic 94 variation need not be protein-coding genes, but may also involve non-coding sequences 95 including introns, promotors, trans elements, or genomic repeats. Loci underpinning a polygenic 96 trait may differ in their effect sizes, as shown by Koornneef et al. (1991)

for flowering time in 97
Arabidopsis thaliana (see also Napp-Zinn, 1955   mercaptoethanol. An additional 1 mL of buffer was added to the homogenate, and then this was 156 filtered through a 30 μ m nylon mesh to discard debris. Finally, the sample was stained with 157 100 μ l of PI (1 mg/mL, Sigma) and incubated for 20 min on ice. For each accession analysed, 158 one sample was prepared, and this was run three times on the flow cytometer. The nuclear 159 DNA content of each sample run was estimated by recording at least 5,000 particles (c.1,000 160 nuclei per fluorescence peak) using a Cyflow SL3 flow cytometer (Sysmex-Partec GmbH, 161 Munster, Germany) fitted with a 100-mW green solid-state laser (Cobolt Samba). Resulting 162 output histograms were analysed using the FlowMax software (v. 2.9, Sysmex-Partec GmbH) 163 for statistical calculations. We report only GS estimates for samples where the coefficients of 164 variation (CV) of the sample and standard peaks in the flow histogram were less than 5% (see 165 Supporting Information Figure S1a and b for illustrative histograms of each ploidy level). 166 Where differences in GS were detected within a species, combined samples containing at least 167 two accessions were prepared following the same procedure as for individual runs. Genuine 168 intraspecific variation was confirmed where multiple fluorescence peaks were identified from the 169 combined run. 170 Throughout the paper we give 1C values in pg, where necessary converting published GS 171 values reported in Gbp to pg using a conversion factor of 0.978 following Doležel et al. (2003). 172

Repeat content variation 173
Sequence data generation. We used a combination of existing and newly generated genomic 174 sequencing data to investigate repeat variation in 31 samples comprising seven diploids and 23 175 tetraploids of Euphrasia plus Bartsia alpina as an outgroup. We downloaded short-read Illumina 176 data from the sequence read archive (SRA, see Supplementary Information Table S2). These 177 included 18 samples in total, including 12 tetraploid samples from the isolated island of Fair Isle 178 (Shetland, Scotland) generated for the study of Becher et al. (2020), which allowed us to study 179 genomic repeat profiles in sympatric populations. This dataset also included a total of six 180 representative diploid and tetraploid species from elsewhere in Britain. 181 We supplemented this previous data with newly generated sequence data from eleven 182 additional UK samples representing a wider range of species and geographic locations, will always be more of the shared sub-genome than of the sub-genome restricted to tetraploids. 200 To minimise mate overlaps of short insert sizes, each read was trimmed to 100 nucleotides. 201 Further, we only used reads where at least 90 nucleotides had phred quality scores > 30. To 202 analyse the genomic repeat content, we excluded clusters annotated by RE as plastid DNA or 203 Illumina process controls. Our numbers thus deviate slightly from RE's automatic annotation. 204

Statistical analyses. Most GS analyses were conducted across all individuals or populations. 205
However, for E. arctica, E. anglica, and E. micrantha, where sampling covered most of their 206 large geographical range in Britain, we also analysed data from each species separately. All 207 analyses were done using R version 3.6.1 (R Core Team, 2019). For analyses of variance 208 (ANOVAs) we used the function aov(). To test whether sample means of GS were significantly 209 different, we used the function t.test(), with Bonferroni correction in cases of multiple testing. To 210 analyse how GS variation was partitioned by ploidy, taxon, and population we used ANOVA. To 211 test the effect of 'species', we then re-ran the ANOVAs without hybrids (Table 1). To test the 212 significance of GS variance differences between species pairs, we divided the population mean 213 genome sizes by each species' grand mean (centring) and then applied an F test (R function 214 var.test()). 215 We tested the association between GS and latitude using a mixed effect model (R package 216 nlme, function lme()). For species analysed separately, we used linear models. We carried out 217 Mantel tests to assess the relationship between geographic distance and GS difference across 218 individuals. We used linear models to assess the differences in relative abundance of individual 232 repeat types between ploidy levels. 233 To investigate a possible association of individual repeat clusters with GS, we used nine 234 tetraploid samples for which we had both an estimate of the population average GS and repeat 235 data (samples marked with asterisks in Supporting Information Table S2). We used the function 236 cor.test() to assess the significance level of any associations between the genome proportion of 237 each individual repeat cluster and population average GS. apart), we also found considerable GS variation between populations in close proximity in E. 258 foulaensis (0.99 pg and 1.25 pg, 2.5 km apart) and in E. confusa (1.14 pg and 1.32 pg, same 259 population). In all cases, tests to distinguish genuine intraspecific variation from technical 260 artefacts confirmed the GS differences reported between individuals (see Methods and 261 Supporting Information Figure S1c and d). Generally, we found wider GS ranges in taxa with 262 more populations sampled. A notable exception was E. micrantha (GS range 1.14-1.21 pg from 263 17 individuals analysed from 9 populations, up to 962 km apart), which is discussed below. 264 In ANOVAs, the vast majority of the overall GS variation was explained by 'ploidy', while 'taxon' 265 and 'population' accounted for smaller significant fractions ( Table 1). 'Population' accounted for 266 considerably more variation than 'taxon' -3 or 8 times, depending on whether hybrids were 267 included in the analysis or not. This difference is due to the few data available for most hybrids 268  We confirmed a strong relationship between ploidy and latitude (ANOVA F 1,190 =18.79, 305 p=2.4⨉10 -5 ), with diploids generally limited to lower latitudes (being particularly abundant in 306 southern England, Supporting Figure S2) while tetraploids extend to the very north of Britain. 307 However, there was no significant association between GS and latitude within ploidy levels 308 (treating taxon as a random effect, t=0.63, p=0.53). We then analysed the data for each of the 309 three widely sampled species individually using linear models (Figure 2c) To assess how well genomic repeat profiles in samples from different populations correspond 333 with species identity based on morphology, we used two unsupervised machine learning 334 techniques: hierarchical clustering and principal component analysis (PCA). We focussed our 335 analyses on the largest 100 repeat clusters, which together account for approximately 50% of 336 each genome, no matter if diploid or tetraploid. Each smaller repeat cluster had a genomic 337 proportion of < 0.7% in each sample. Hierarchical clustering resulted in a tree that grouped 338 samples largely by ploidy, rather than species identity, with the exception of (i) a sample of the 339 Austrian alpine E. cuspidata (CU), a species considered diploid, which grouped as sister to the 340 tetraploids, and (ii) tetraploid E. arctica from Cornwall (AR5), which grouped as sister to all other 341 Euphrasia samples (Figure 3a  repeats within the genome. CL22, in turn, had been annotated as CRM, which is a type of 376 Ty3/Gypsy chromovirus retrotransposon that commonly targets centromeric sequences (Nagaki 377 et al., 2003;Neumann et al., 2011). 378 Among all 17 broad repeat types identified by RE (see Supplementary Information Table S2),  379 we found significant differences between ploidy levels for two. Diploid genomes contained 380 higher average proportions of 45S rDNA (4.9%) than tetraploids (2.0%, F 1, 28 =20.4, p corr <0.001), 381 with the genomic proportion ranging from 1.7% to 5.7% in diploids and from 0.8% to 3.4% in 382 tetraploids. Tetraploids contained, on average, more Ty1/copia Ale elements (0.15%) than 383 diploids (0.09%, F 1,28 =11.18, p corr =0.018). While our PCA approach had identified some 384 satellites as highly differentiated in copy number (see above), differences over all satellites were 385 not significant. This is because there was differential enrichment in the ploidy levels for CL1 386 versus CL2 and CL5 (Figure 3c). Overall, there is comparatively little differentiation in genomic 387 repeats between the ploidy levels. In this study, we investigated the nature of GS variation across taxonomically complex diploid 404 and tetraploid British Euphrasia. We complemented an extensive population survey of GS 405 variation with an analysis of genomic repeat composition from seven diploids and 23 tetraploid 406 Euphrasia. Overall, we find notable GS variation between populations of the same species, 407 representing a wide range of genuine intraspecific GS variation. Within ploidy levels there is a 408 continuum of GS variation, though ploidy levels have discrete GS ranges. These differences 409 within and between ploidy levels are not attributable to large copy number changes of an 410 individual DNA repeat, but rather to multiple segregating PAVs. Here, we first discuss the link 411 between GS variation and population dynamics and speciation history, highlighting how GS is 412 shaped by many similar processes as population-level sequence variation. We then consider 413 the landscape of repeat dynamics and the potential association with Euphrasia polyploid 414 genome history. Finally, we consider the wider implications of framing GS variation in a 415 population genetic framework. high frequency of hybridisation in Euphrasia may lead to increased levels of structural 479 rearrangements due to ectopic recombination, which may be more common between 480 heterozygous genomic repeats (Morgan, 2001). 481 Between ploidy levels of Euphrasia, we found allotetraploids had an 11% lower mean GS 482 compared with the value predicted from doubling the mean GS of diploids. This discrepancy 483 may have originated from genome downsizing, commonly seen during re-diploidisation. It may 484 also be explained by the fusion of two diverged diploid genomes of different size, as seen in 485 allopolyploid Gossypium (Hendrix & Stewart, 2005) and such as E. arctica, did not reveal a signal of divergence in repeat content. This is surprising not 502 just because of their morphological distinctiveness, but their difference in outcrossing rate, with 503 theory predicting that the copy-number and equilibrium frequency of transposable elements 504 depends on the level of selfing in a population (Morgan, 2001;Dolgin & Charlesworth, 2006). A 505 likely explanation is that the shift to high-selfing in E. micrantha is relatively recent compared to 506 the time it takes for the genomic repeat content to reach equilibrium level. 507 508

Evolution of genome size variation 509
The continuous GS variation within and between Euphrasia species, coupled with these 510 differences likely being a product of segregating PAV across the genome, underlines the 511 polygenic nature of GS variation. Regarding GS differences as the result of segregating (i.e. 512 genetic) variants blurs the classic distinction between genotype and nucleotype, where 513 "nucleotype" refers to "conditions of the nucleus that affect the phenotype independently of the 514 informational content of the DNA", essentially identical to GS (Bennett, 1971(Bennett, , 1977. Because 515 GS has been shown to be correlated with many traits including cell size, stomatal pore size, the 516 duration of cell division, and life-history differences (e.g. Šímová  Research on GS is somewhat decoupled from studies on sequence-based variation in 531 populations. We suggest future research into GS evolution should consider both patterns of total 532 GS and the population processes underlying this variation. In addition to furthering our 533 understanding of intraspecific GS diversity in Euphrasia and other plant groups, answers to 534 these questions will also improve our understanding of GS evolution between species and 535 across phylogenies, which starts at the population level.  Charlesworth for extensive comments on an earlier version of the manuscript. 549 550 551 552

Author Contributions 553
• HB analysed the data with input from MRB and ADT 554 • ADT, CM, HB, and MRB collected samples. 555 • CM confirmed species identifications. 556 • ADT and IJL designed the study. 557 • RFP, JP, and IJL generated the GS data. 558 • HB and ADT wrote the manuscript. 559 • All authors read and commented on the manuscript. 560 561

Data Availability 562
The whole genome-sequencing data are available from the sequence read archive, Bioprojects 563 PRJNA624746 and PRJNA678958. The scripts and data required to replicate our results are 564 available from GitHub, repository: zzzzzzz (to be added upon acceptance).