Evidence for natural selection and barrier leakage in candidate 1 loci underlying speciation in wood ants 2 3

loci underlying speciation in wood ants 2 3 Kulmuni J1,2*, Nouhaud P1, Pluckrose L1#, Satokangas I1#, Dhaygude K1, Butlin RK3,4 4 5 1) Centre of Excellence in Biological Interactions, Department of Biosciences, 6 POB 65, FI-00014, University of Helsinki, Finland 7 2) Tvärminne Zoological Station, University of Helsinki, J.A. Palménin tie 260, 8 FI-10900 Hanko, Finland 9 3) Department of Animal and Plant Sciences, Alfred Denny Building, University 10 of Sheffield, Western Bank, Sheffield, S10 2TN, UK 11 4) Department of Marine Science, University of Gothenburg, S-405 30 12 Gothenburg, Sweden 13 #) Authors contributed equally. 14

However, loci fixed differently in males, despite shared hybrid history and variable in 198 females are expected to experience the strongest antagonistic selection in the current 199 population and offer the best opportunity to test for selection acting on barrier loci. 200 We assumed that the allele fixed in F. aquilonia-like males represents the parental F. The mean FST estimate between the F. aquilonia-like and F. polyctena-like lineages 205 was 0.14 (s.e. = 4.66´10 -4 ) between the males, and 0.07 (s.e. = 2.27´10 -4 ) between the 206 female pools. We found 711 SNPs (0.43 % of the total) alternatively fixed between 207 7 males of the F. aquilonia-like and F. polyctena-like lineages, but no fixed differences 208 between the female pools. As expected under antagonistic selection, SNPs fixed 209 differently in males were polymorphic and had mean FST of 0.34 (s.e. = 5.83´10 -3 , 210 Supplementary Figure 1) in the females. The alternatively fixed SNPs between the 211 male pools represent genomic regions of high differentiation between the two male 212 lineages despite their shared hybrid history, and thus mark candidate barrier loci that 213 are predicted to contribute to reproductive isolation and speciation. These 711 SNPs 214 fell into 610 assembled contigs (cumulative size: 2.09 Mb, 0.94% of the genome), 215 whose lengths vary between 515 -23,815 bp (mean = 3430 bp). Seventy-eight 216 contigs had more than two barrier SNPs (maximal number of barrier SNPs on a single 217 contig was 5). 218 219 We were interested in how these candidate barrier loci locate in the genome, because 220 this can provide information about the origin of reproductive isolation and play a role 221 in how efficient the barrier loci are in maintaining divergence despite hybridization 222 and gene flow. If candidate barrier loci are co-localized in the genome their origin and 223 maintenance could be explained by selection on a few genomic regions that could, for 224 example, harbor structural re-arrangements. Alternatively, candidate barrier loci 225 scattered across the genome could effectively reduce gene flow over much of the 226 genome. We used the genome of the closely-related Formica exsecta [17] (277 Mb,227 14,617 scaffolds,N50 = 997,654), which is better assembled compared to the poolseq 228 genome assembly, to map the location of candidate barrier loci. F. exsecta is 229 estimated to have diverged from the parental species of our hybrid population 5 Mya 230 [18]. We used the 610 contigs containing candidate barrier SNPs as queries and 231 searched the F. exsecta genome assembly using BlastN. If our query scaffold was 232 larger than 5, 000 bp we used +/-2, 500 bp surrounding the candidate barrier SNP as 233 query. Our candidate barrier loci map to a total of 134 F. exsecta contigs, with a 234 cumulative coverage of 56.2 % of the F. exsecta genome, both of which are 235 significantly less than for a similar number of random SNPs in our data set 236 (Supplementary Figure 2). Haploid chromosome number varies between 26 and 28 in 237 Formica [19]. Assuming evenly-sized chromosomes our candidate barrier loci are 238 likely to be situated in over 10 chromosomes. 239 240 8 Candidate barrier loci are predicted to interact at the protein level 241 Next, we were interested in the possible functions of our candidate barrier loci, as 242 they can be informative about the putative selection pressures and mechanisms of 243 hybrid male breakdown. We annotated each of the 610 contigs containing candidate 244 barrier SNPs, first extracting the transcripts of the genes located in those regions from 245 the closely related F. exsecta. Out of 571 candidate barrier regions (see Material and 246 Methods; each region was defined as 5, 000 bp around the candidate barrier SNP) that 247 we were able to anchor in F. exsecta (all blastn e-values < 10 -141 ), we recovered 590 248 unique genes. Sixty-three of the candidate genomic regions did not contain any 249 annotated gene. The mean number of genes per candidate genomic region was 1.4, 250 and the maximum was six. For each gene, its F. exsecta transcript sequence was 251 recovered and blasted (blastx) against Drosophila melanogaster r6.15 proteins 252 recovering 317 genes, whose homologs reside in candidate barrier regions 253 (Supplementary Table 1). 254 255 The genetic model for hybrid breakdown suggests inviability and sterility are caused 256 by dysfunction and negative epistatic interactions between diverged genes from the 257 parental species. One possible mechanism that can result in epistasis are molecular 258 interactions. We thus asked if there was any evidence for molecular interactions in the 259 form of protein-protein interactions between the gene products in candidate barrier 260 regions. For this we used the recovered D. melanogaster proteins and the STRING 261 database [20], which contains information on known and predicted protein-protein 262 interactions. We found that candidate barrier regions harbor proteins that show 263 significantly more interactions with each other than a random set of proteins drawn 264 from the D. melanogaster genome (PPI enrichment p-value = 9.65´10 -13 , Figure 2). 265 Out of 268 genes annotated and having information in the STRING database, 167 266 genes fall into a single interaction network, with evidence for protein-protein 267 interactions in D. melanogaster or other species. Seventy-four of the network genes 268 had evidence for interactions which were experimentally determined or obtained from 269 curated database (PPI enrichment p-value = 0.03). However, randomly drawing 711 270 SNPs from our pooled sequencing data 1000 times showed that similar enrichment in 271 protein-protein interactions (i.e. network) could be retrieved for a random set of SNPs 272 as well. This suggests either a bias towards functional and interacting sets of genes in 273 9 our total SNP dataset or a bias in our method towards interacting sets of genes arising 274 due to conservation (only genes conserved between ants and D. melanogaster can be 275 included in the analysis). However, this does not mean the interactions which were 276 documented would be unreliable. In summary, our results suggest interactions among 277 the genes in candidate barrier regions, but these interactions are not significantly more 278 abundant than for random sets of SNPs in our data.

289
Next, we were interested in the functions of genes within candidate barrier regions. 290 For this we used the D. melanogaster gene IDs. The top three biological processes 291 characterizing our candidate barrier genes are "developmental process" (p = 3.5´10 -292 18 ), "system development" (p = 3.5´10 -18 ), and "single-organism developmental 293 process" (p = 4.03´10 -18 ). Similarly, the top three molecular functions are "ion 294 binding" (p = 8.55´10 -6 ), "protein binding" (p = 3.67´10 -5 ) and "calcium ion binding" 295 (p = 8.74´10 -5 ). These biological processes and molecular functions were highly 296 significantly enriched when compared to the genome average of Drosophila 297 melanogaster. However, they were not significantly enriched if compared to a random 298 set of SNPs drawn from our data, as all of them are found in over 5 % of our 299 simulations. 300

301
Testing for signs of selection at candidate barrier loci in a natural population 302 We tested for selection during development by genotyping altogether 96 individuals 303 at early (larva) and 100 individuals at late (adult or late stage pupa) developmental 304 stages (Supplementary Table 2 Hybrid F . aqui l oni alike males Allele frequency estimates (poolseq) Hybrid F . pol yctenalike females

374
Our previous microsatellite-based study showed that introgressed alleles were favored 375 in females when heterozygous. This selection acted throughout the female lifetime. 376 Thus, we searched for signs of similar selection testing for increase in heterozygote 377 excess between early and late stages using FIS estimates. Indeed, there was a tendency 378 for more negative mean FIS values (i.e. greater excess of heterozygosity) in candidate 379 Hybrid F . pol yctenalike females Then, we tested for allele frequency changes in F. aquilonia-like males, expecting the 405 introgressed alleles to decrease in frequency during development as shown in our 406 previous microsatellite based study [12]. Allele frequencies were significantly 407 different between larval and adult developmental stages in candidate barrier SNPs 408 compared to the random SNPs (glmer, z-value= -2.98, p = 0.00286), the candidate 409 barrier SNPs showing more frequency change during development ( Figure 5). 410 However, the frequencies of the introgressed alleles increased from larval to adult 411 developmental stages in F. aquilonia-like males (Figure 4) Figures 3 and 4). Increase of 432 introgressed alleles in males is not likely to be caused by errors in male poolseq-based 433 allele frequency estimates either. All but one of the candidate barrier SNPs (33 out of 434 34 SNPs) that were initially fixed in both adult males and females of the F. with strongest evidence of this in the F. polyctena-like lineage [11,12]. Our current 552 dataset shows a trend of increasing heterozygosity excess at candidate barrier loci 553 during female development, but this is not statistically significant after removal of 554 two outlier individuals likely to be early generation hybrids. However, we find a 555 significant excess of heterozygosity at the candidate barrier loci compared to the 556 random loci. This excess heterozygosity is likely to be caused by selection either in 557 the current generation or earlier in population history. First, strong allele frequency 558 differences between mothers and fathers would result in high heterozygosity in 559 diploid females without the action of selection in the sampled generation. However, in 560 this scenario initial differentiation between sexes at candidate barrier loci was likely 561 caused by selection acting in earlier generations (see below). Second, selection that 562 favors heterozygotes may have already acted on the candidate loci before the larval 563 stage we sampled. Our previous study documented selection during development 564 using eggs and few days old larvae and compared them to adults, whereas here we use 565 1-2 week old larvae and adults. Thus, in the current study we are likely to miss any 566 signal of selection acting earlier in development. 567

568
We assumed the candidate barrier loci to represent recessive incompatibilities and loci 569 linked to them, thus expecting to find evidence of negative selection in the haploid 570 males [11,12]. However, we find the opposite with introgressed alleles increasing in 571 frequency at the candidate barrier loci in the F. aquilonia-like males during 572 development. This is consistent with natural selection favoring introgression in these 573 19 males. The magnitude of change is comparable to that observed within a generation in 574 other systems, e.g. in stick insects for the color pattern locus [32]. 575 576 Selection for introgression within a generation in males may be connected to patterns 577 we observed between years: A proportion of the barrier loci showed barrier leakage 578 resulting in increase of introgressed alleles in males sampled 10 years apart. These 579 rising frequencies of introgressed alleles in F. aquilonia-like haploid males between 580 years can be explained by at least three hypotheses, which are not mutually exclusive. 581 First, the barriers to gene flow can be gradually breaking down in F. aquilonia-like 582 males, which could explain both within-and between-year patterns. Barrier breakage 583 could be due to recombination and subsequent formation of compatible combinations 584 of barrier loci in males. Barriers for gene flow could also break down due to relaxed 585 selection. Under this model the haplodiploid systems (and X-chromosomes) can show 586 fluctuating allele frequencies at candidate barrier loci, comparable to those observed 587 in our data, if sexes had different allele frequencies to begin with. However, gradual 588 breakdown of barriers for gene flow can only explain changes observed between 589 years and not selection for introgressed alleles within a generation in males. Also, 590 allele frequency differences between sexes are not expected to be maintained without 591 selection over multiple generations. Second, a proportion of marker loci may have 592 been hitchhiking due to close genomic location with barrier loci and so missing from 593 males in 2004, but these hitchhikers are now becoming dissociated from the barrier 594 loci via recombination and consequently found in males in 2014. Again, this 595 hypothesis cannot explain the significant increase of introgressed alleles within a 596 generation in males. A third possibility is that the incompatibilities could be 597 environment-dependent, which could explain both within-generation and between-598 years effect. Environment-dependence means that the negative effects of the 599 incompatibilities would be expressed only in certain environments. This situation is 600 comparable to genotype ´ genotype ´ environment interaction, where fitness effects 601 of a genetic variant are dependent on the genomic background it is found in, as well 602 as the external environment. Previous empirical evidence for environment-dependent 603 incompatibilities comes e.g. from yeast [33] and Drosophila