Speciation in sympatry with ongoing secondary gene flow and an olfactory trigger in a radiation of Cameroon cichlids

Whether speciation can happen in the absence of geographical barriers and if so, under which conditions, is a fundamental question in our understanding of the evolution of new species. Among candidates for sympatric speciation, Cameroon crater lake cichlid radiations have been considered the most compelling. However, it was recently shown that a more complex scenario than a single colonization followed by isolation underlies these radiations. Here, we perform a detailed investigation of the speciation history of a radiation of Coptodon cichlids from Lake Ejagham using whole-genome sequencing data. The existence of the Lake Ejagham Coptodon radiation is remarkable since this 0.5 km2 lake offers limited scope for divergence across a shallow depth gradient, disruptive selection is weak, the species are sexually monochromatic, yet assortative mating is strong. We infer that Lake Ejagham was colonized by Coptodon cichlids almost as soon as it came into existence 9,000 years ago, yet speciation events occurred only in the last 1,000-2,000 years. We show that secondary gene flow from a nearby riverine species has been ongoing, into ancestral as well as extant Lake Ejagham lineages, and we identify and date river-to-lake admixture blocks. One of these contains a cluster of olfactory receptor genes that introgressed close to the time of the first speciation event and coincides with a higher overall rate of admixture into the recipient lineages. Olfactory signaling is a key component of mate choice and species recognition in cichlids. A functional role for this introgression event is consistent with previous findings that assortative mating appears much stronger than ecological divergence in Ejagham Coptodon. We conclude that speciation in this radiation took place in sympatry, yet may have benefited from ongoing riverine gene flow. Author Summary Despite an active search for empirical examples and much theoretical work, sympatric speciation remains one of the most controversial ideas in evolutionary biology. While a host of examples have been described in the last few decades, more recent results have shown that several of the most convincing systems have not evolved in complete isolation from allopatric populations after all. By itself, documenting the occurrence of secondary gene flow is not sufficient to reject the hypothesis of sympatric speciation, since speciation can still be considered sympatric if gene flow did not contribute significantly to the build-up of reproductive isolation. One way forward is to use genomic data to infer where, when and into which lineages gene flow occurred, and identify the regions of the genome that experienced admixture. In this study, we use whole-genome sequencing to examine one of the cichlid radiations from a small isolated Cameroon lake, which have long been the flagship example of sympatric speciation. We show that gene flow from a riverine species into the lake has been ongoing during the history of the radiation. In line with this, we infer that the lake was colonized very soon after it was formed, and argue that Lake Ejagham is not as isolated as previously assumed. The magnitude of secondary gene flow was relatively even across Lake Ejagham lineages, yet with some evidence for differential admixture, most notably before the first speciation event into the C. deckerti and C. ejagham lineage. Among the sequences that were introgressed into this lineage is a cluster of olfactory receptor genes, which may have facilitated speciation by promoting sexual isolation between incipient species, consistent with previous findings that sexual isolation appears to be stronger than ecological isolation in Ejagham Coptodon. We conclude that speciation in this radiation took place in sympatry, yet may have benefited from ongoing riverine gene flow.


151
Phylogeny of the Lake Ejagham Coptodon radiation 152 As a first step in revealing the speciation history of the Lake Ejagham Coptodon radiation 153 (hereafter: Ejagham radiation), we took several approaches to infer the phylogenetic relationships 154 among the three Lake Ejagham species C. deckerti, C. ejagham and C. fusiforme, as well as two 155 closely related riverine species from the neighboring Cross River drainage, C. guineensis and C. 156 sp. Mamfé (Fig 1), C. kottae, a Cameroon crater lake endemic that did not diversify in situ, and the 157 much more distantly related Sarotherodon galilaeus. 158 Maximum likelihood (ML) trees based on concatenated genome-wide SNPs using RaxML 159 with any of three outgroup configurations (only C. kottae / only S. galilaeus / both species) resulted 160 in monophyly of Lake Ejagham species and a sister relationship between C. deckerti and C. 161 ejagham with 100% bootstrap support (Fig 2A). However, inferences on whether one of the two 162 riverine species is more closely related to Ejagham Coptodon, or the two are sister species, 163 differed among outgroup configurations ( Fig S1). To further investigate the relationships among the 164 two riverine species relative to Ejagham Coptodon, we constructed species trees based on 100 kb 165 gene trees. Species trees based on rooted gene trees using ML and the Minimize Deep 166 Coalescence (MDC) criterion in Phylonet, as well as a species tree based on unrooted gene trees 167 using ASTRAL, all indicated monophyly of the Ejagham radiation, and a sister relationship between 168 C. deckerti and C. ejagham (Fig S2). 169 We used two methods to more explicitly examine the prevalence of discordant phylogenetic 170 patterns. In keeping with the results from phylogenetic trees, a phylogenetic network based on 171 genome-wide SNPs produced by Splitstree showed limited discordance along the branch to the 172 Ejagham Coptodon ancestor, with higher levels of discordance along the branch to the C. deckerti -173 C. ejagham ancestor and especially near the divergence of the riverine species ( Fig 2B). Second,174 phylogenetic relationships along local segments of the genome grouped by the machine-learning 175 approach Saguaro into 30 unrooted trees ("cacti") indicate that in 90.02% of the genome, Ejagham predominantly taken place prior to diversification within Lake Ejagham.
We tested this interpretation using five-taxon D FOIL statistics ( Fig 3B). D FOIL statistics take 188 advantage of derived allele frequency patterns in a symmetric phylogeny across two pairs of 189 populations with dissimilar coalescence times. The combination of signs (positive, negative, or 190 zero) across four D FOIL statistics can distinguish between admixture along terminal branches and 191 admixture with the ancestral population of the most recently diverged population pair. Here,we 192 repeated the test with each of three possible pairs of Lake Ejagham species as P1 and P2, and as 193 P3 and P4 the pair of riverine species, which diverged prior to the Ejagham species (see next 194 section). D FOIL statistics using both pairs of Lake Ejagham taxa that involve C. fusiforme indicated a 195 pattern of admixture between C. sp. Mamfé and the Lake Ejagham ancestor (Fig 3B, left). D FOIL 196 statistics are designed to uncover a single admixture pattern, such that multiple instances of gene 197 flow may lead to a combination of signs across D FOIL statistics without a straightforward 198 interpretation, which may explain the pattern observed for the comparison with C. deckerti and C. 199 ejagham as P1 and P2 (Fig 3B,right). 200 Consistent with more complex patterns of admixture, D-statistics for comparisons that 201 explicitly test for differential admixture between Ejagham species with C. sp. Mamfé indicate that 202 C. ejagham and C. deckerti experienced slightly higher levels of admixture than C. fusiforme after 203 their divergence (Fig 3A, bottom bars). Furthermore, an f 4 -ratio test suggests that 4.7% of C. 204 ejagham ancestry derives from admixture with C. sp. Mamfé during or after its divergence from C. 205 deckerti (Fig 3C), but it should be noted that D-statistics did not indicate differential admixture for 206 this comparison (Fig 3A, bottom bar). Overall, we infer that differential gene flow from C. sp. 207 Mamfé into the three Ejagham species has been relatively minor in comparison to gene flow 208 shared among the species. This difference in magnitude can be clearly seen in Fig 3A, where the 209 upper three bars represent shared gene flow and the lower three bars differential gene flow to 210 Ejagham species. 211 models with single migration bands may be prone to overestimation of that specific migration rate. 223 We therefore also ran models with an intermediate number of migration bands (either to all three 224 extant Ejagham species or to both ancestral lineages), and present results for all of these different 225 models in Fig 4, and Table 1. Divergence times and population sizes mentioned below represent  226 only those from models with all significant migration bands. 227 Divergence between the ancestral riverine and Lake Ejagham lineages was estimated to 228 have occurred around 9.76 ka ago (95% HPD: 8.27-11.23, Fig 4A), which we consider an estimate 229 of the timing of the colonization of Lake Ejagham. Encouragingly, this coincides with the age of the 230 lake estimated from core samples (9 kya: Stager et al. 2017). In contrast to rapid colonization of 231 the new lake, we estimated that the first speciation event in Lake Ejagham lineage only occurred 232  Fig 4E-F), which is in line with field observations of its rarity (Martin, 2013) and with its 239 piscivorous ecology (Dunz and Schliewen, 2010). 240 In agreement with the results from genome-wide admixture statistics, we infer that 241 secondary gene flow from riverine species has taken place mostly or only from C. sp. Mamfé 242 relative to C. guineensis. In models with single migration bands, significant gene flow was inferred 243 from C. sp. Mamfé into all Ejagham lineages (Fig 4D-F). Rates of gene flow to ancestral 244 populations dropped relative to extant lineages in models with all migration bands, in particular for 245 gene flow to the lineage ancestral to all three species (Fig 4D-F). 246 Overall, G-PhoCS inferred similar rates of gene flow from C. sp. Mamfé to extant species 247 guineensis or from the riverine ancestor (prior to the split between C. sp. Mamfé and C. guineensis) had 95% HPD intervals that overlapped with zero, and all except two had means very 260 close to zero (Table 1,   We also did not find clear evidence for gene flow among Ejagham Coptodon lineages using G-269 PhoCS. We evaluated models with each one of all possible migration bands in both directions, and 270 95% HPD for all migration rates overlapped with zero (S4C Fig). The mean inferred population 271 migration rate was higher than 0.01 only for C. fusiforme to C. deckerti (0.27) and to C. ejagham 272 (0.02). Such limited evidence for post-divergence gene flow within the radiation is surprising, given 273 that these species are in the earliest stages of speciation (Martin, 2013). However, caution is 274 warranted given that the very recent divergence of these lineages may render it difficult to identify 275 ongoing gene flow. Furthermore, representative breeding pairs at the tail ends of the phenotype 276 distribution were selectively chosen for sequencing (Martin, 2012), while excluding ambiguous 277 individuals that could not be assigned to a particular species. 278 Admixture blocks support ongoing gene flow from C. sp. Mamfé

279
To identify genomic blocks of admixture between riverine and Lake Ejagham species, we first 280 defined putative blocks as contiguous sliding windows that were outliers for f d , a four-population 281 introgression statistic related to D that is suitable for application to small genomic regions, and 282 subsequently applied HybridCheck (Ward and van Oosterhout, 2016) to validate and age these 283 blocks. We used all combinations of ingroup triplets that could differentiate between admixture from 284 secondary gene flow into Ejagham. In total, high-confidence admixture blocks spanned across only 296 0.64% of the queried part of the genome (5.7 Mb). 297 In accordance with the much stronger evidence for Lake Ejagham admixture with C. sp. 298 Mamfé than with C. guineensis, the majority of likely (68.3%) and high-confidence (80.1%) 299 admixture blocks involved C. sp. Mamfé as the riverine species, and likely and high-confidence 300 admixture blocks with C. sp. Mamfé were, on average, younger (2.94 and 1.37 ka, respectively) 301 than those with C. guineensis (4.55 and 1.97 ka, respectively, Fig 5A). 302 Because f d and HybridCheck detect admixture only between species pairs, we took two 303 approaches to investigate at which point along the Lake Ejagham phylogeny admixture took place 304 for likely admixture blocks. First, we intersected admixture blocks involving different Lake Ejagham 305 species but the same riverine species, and detected 76 likely (and 38 high-confidence) blocks 306 involving a single Lake Ejagham species, 88 (50) blocks shared among two Lake Ejagham 307 species, and 95 (87) blocks shared among all three Lake Ejagham species ( Fig 5B). Thus, 29.3% 308 of likely blocks (and 26.0% of high-confidence blocks) were unique to a single lake species, but 309 this may be an overestimate, since such blocks may have been present but escaped statistical 310 detection in other species, for instance due to recombination within the block. This possibility is 311 underscored by the age distribution of admixture blocks: admixture blocks detected in one species 312 were not younger than those detected in multiple species (S5 Fig). In line with results from 313 genome-wide admixture statistics and G-PhoCS, we found more admixture blocks into C. deckerti, 314 C. ejagham, and their ancestor, compared to C. fusiforme ( Fig 5B). 315 Second, we used D FOIL statistics to distinguish between admixture involving the ancestral 316 Lake Ejagham lineage ("DEF"), the C. deckerti -C. ejagham ancestor ("DE"), and the terminal 317 branches. We were able to categorize 23 likely (and 13 high-confidence) admixture blocks with 318 D FOIL statistics, showing a pattern of decreasing occurrence of admixture blocks through time, with 319 only a single likely (and 0 high-confidence) block involving a terminal Lake Ejagham branch (Fig  320   5C). For cases where admixture is with an ancestral (lake) clade, D FOIL statistics cannot infer the 321 direction of introgression, but the single classified admixture block with an extant lake taxon is, as 322 expected, inferred to have been into the lake. 323

Admixture of olfactory genes into C. deckerti and C. ejagham
324 Among all high-confidence blocks, 11 gene ontology terms were enriched (Table 2). Eight genes in 325 a single admixture block on scaffold NC_022214.1 were responsible for the three most enriched 326 categories; seven of these genes are characterized as olfactory receptors and the eighth as 327 "olfactory receptor-like protein" (none have a gene name and only one has 1-to-1 orthologues in 328 other species on Ensembl Release 90 (S10 Table)). The admixture block containing this cluster of 329 genes, which is shown in Fig 5D, was estimated to have introgressed from C. sp. Mamfé into both confidence admixture blocks, this block was the largest, had the highest summed f d score, and had 333 the second lowest HybridCheck p-value. 334 When performing GO analyses separately for blocks involving each Lake Ejagham species, 335 no additional terms were found to be enriched. With respect to admixture blocks involving each 336 Lake Ejagham species, the same 11 terms were enriched for C. ejagham, nine of these terms were 337 enriched for C. deckerti, and none were enriched for C. fusiforme (Table 2). Blocks unique to one 338 Lake Ejagham species (either taken together, or separately by species), were not enriched for any 339 terms, while blocks shared between two species were enriched for nine terms and blocks shared 340 between all three species for two terms (

344
In the context of an isolated lake, a classic case of fully sympatric speciation would involve i) 345 colonization of the lake by a single lineage, effectively in a single event, and ii) no subsequent 346 gene flow with populations outside of the lake prior to or during speciation. Our results suggest that 347 for the Lake Ejagham Coptodon radiation, the former is true but the latter is not. In contrast to the 348 original paradigm of a highly isolated lake colonized only once by a single cichlid pair (Schliewen et 349 al. 2001), we found ongoing gene flow from one of the riverine species into all three species in the 350 lake throughout their speciation histories. Interestingly, one of the clearest signals of introgression 351 came from a cluster of olfactory receptor genes that introgressed into the ancestral population at xx 352 kya just prior to the first speciation event, suggesting that gene flow may have facilitated 353 speciation. 354 identical to the estimated date of the origin of the lake itself (9 ka years ago, Stager et al., 2017), 357 suggesting that the lake was rapidly colonized by the ancestral lineage. It should be noted that this 358 estimate in turn relies on an estimate of the mutation rate. We here use estimates from 359 sticklebacks (Guo et al., 2013) as previous studies on cichlids have done (Kautt et al., 2016a, 360 2016b), but it cannot be excluded that our focal species have substantially different mutation rates 361 Recknagel et al., 2013). 362 Martin et al. (2015a) argued that the Cameroon lakes containing cichlid radiations may not be as 363 isolated as has previously been suggested, based on the inference of secondary gene flow in all 364 radiations, and the fact that each lake has been colonized by several fish lineages (five in the case ongoing gene flow are both in support of this view. In this light, it is worth pointing out that lake

Ejagham (i) has an outflow in the wet season (S6 Fig) which may be connected to the Munaya 368
River (itself part of the Cross River system), (ii) does not have a waterfall that could prevent fish 369 from entering the lake (C. H. Martin, pers. obs.) contrary to claims elsewhere (Bolnick and 370 Fitzpatrick, 2007), and (iii) is at an elevation of only 141 meters, about 60 meters higher than the 371 closest river drainage (the other two Cameroon lakes containing sympatric radiations, Lake 372 Barombi Mbo and Lake Beme, are at altitudes of 314 and 472 meters, respectively). 373 No major secondary colonizations 374 Our data suggest that the initial colonization of the lake established the large majority of the 375 lineage that has since diversified within Lake Ejagham, and we find no evidence for major 376 secondary colonizations that either established a new lineage or resulted in a hybrid swarm. 377 Several lines of evidence indicate that such events are unlikely to have taken place. First, 378 considerable phylogenetic conflict would be expected if diversification happened rapidly after a 379 secondary colonization event, while we found particularly widespread monophyly across the 380 genome (89.34%, S7 Table). Second, we inferred a long time lag between colonization and the first 381 speciation event within the lake (9.76 ka and 1.20 ka ago, respectively, Fig 4A, Table 1). Third, we 382 estimated gene flow into the ancestral lake lineage to be relatively low (Fig 4B), and in line with 383 this, models with and without post-divergence gene flow between riverine and lake lineages 384 resulted in similar (9.76 and 8.80 ka ago, respectively, Table 1) estimates of the divergence time of 385 the ancestral lake lineage. 386

Continuous low levels of gene flow from one of two Cross River Coptodon species
387 Even though we found that Ejagham Coptodon was established by a single major colonization, our 388 results are not consistent with subsequent isolation of the lake lineages. We found evidence for 389 ongoing secondary gene flow from the source population, which diverged into C. guineensis and C. 390 sp. Mamfé after the split with the Ejagham lineage. Results from all three types of approaches that 391 we used to identify secondary gene flow (demographic analysis with G-PhoCS, genome-wide 392 admixture statistics, and the identification of admixture blocks) show that gene flow originated 393 predominantly from one of these riverine lineages, C. sp. Mamfé (Fig 3A, 4B, 5). Little is known 394 about the precise geographic distribution of C. sp. Mamfé, yet this asymmetry is consistent with the 395 sampling location of this species (37 km from Lake Ejagham to the Cross River at Mamfe) relative 396 to that of C. guineensis (65 km from Lake Ejagham to a tributary of the Cross River at Nguti; see 397 also Fig 1 that depicts all major rivers). Both Coptodon lineages are known to coexist within the guineensis and C. sp. Mamfé. However, the clearest evidence of gene flow from C. guineensis 404 comes from admixture blocks, where an inference of differential ancestry from the two riverine 405 species was required. Since we were only able to include a single C. sp. Mamfé individual, it is 406 nevertheless possible that we missed genetic variation in that species, which may have led to 407 incorrect assignment as the riverine source lineage as C. guineensis. 408 Differential admixture of Ejagham radiation with riverine Coptodon 409 We found some evidence for differential riverine admixture, from C. sp. Mamfé, among the three 410 Ejagham species. While the admixture proportion of C. ejagham may be slightly higher than that of 411 C. deckerti (f 4 -ratio test: Fig  However, no riverine Coptodon have been confirmed beyond a posted sign reporting introduced 425 river fishes. In this study, we found no evidence that any of our individuals were recent hybrids (Fig  426   S4), but our limited sample sizes preclude us from any strong inferences on their potential 427 occurrence in the lake. 428 genes ( Table 2). While in mammals, the Olfactory Receptor (OR) gene family is the largest gene family with around 1,000 genes, mostly due to the expansion of a single group of genes, fish 440 species examined so far have much fewer (69-158 complete genes) yet a more diverse set of OR 441 genes (Azzouzi et al., 2014;Niimura and Nei, 2005). Unfortunately, little additional information is 442 known about the eight admixed olfactory receptor genes. 443 This cluster of OR genes was contained in the largest and arguably most striking of all high-444 confidence admixture blocks (Fig 5D), which is estimated to have introgressed from C. sp. Mamfé 445 into C. deckerti and C. ejagham, but not C. fusiforme, just prior to the estimated divergence time of 446 C. fusiforme and the ancestor of C. deckerti and C. ejagham. Thus, the timing, source and target of 447 introgression all correspond with the inference of elevated levels of gene flow from C. sp. Mamfé to 448 the C. deckerti -C. ejagham ancestor relative to C. fusiforme (Fig 3A, Fig 4B, Fig 5). These 449 patterns may suggest a role for the introgression of this block in initiating speciation in Ejagham 450

Coptodon. 451
Chemosensory signaling in general, and olfactory receptors specifically, have often been linked to 452 speciation, especially with respect to sexual isolation (Smadja and Butlin, 2008). A host of studies 453 has shown the importance of olfactory signaling in conspecific mate recognition in fish (Crapon de  454 Caprona and Ryan, 1990;Kodric-Brown and Strecker, 2001;McLennan, 2004;McLennan and 455 Ryan, 1999), and in a pair of closely related Lake Malawi cichlids, female preference for 456 conspecific males was shown to rely predominantly if not exclusively on olfactory cues 457 (Plenderleith et al., 2005). Not surprisingly, it has repeatedly been suggested that olfactory signals 458 may help explain explosive speciation in cichlids (Azzouzi et al., 2014;Blais et al., 2009;Keller-459 Costa et al., 2015). 460 Olfactory signaling seems particularly relevant to mate choice and speciation in Ejagham 461 Coptodon, since three species occur syntopically, assortative mating by species appears to 462 represent the strongest isolating barrier (Martin, 2012(Martin, , 2013, and sexual dichromatism is absent. 463 Important next steps will be to examine the importance of olfactory cues in mate recognition in 464 Lake Ejagham Coptodon, specifically between C. fusiforme and the other two species, and to 465 characterize these genes and their patterns of divergence and admixture in more detail. 466 Waiting time for sympatric speciation 467 While we inferred that colonization of Lake Ejagham took place more than 9 ka years ago, the first 468 branching event among Ejagham Coptodon was estimated to have occurred as recently as 1.20 ka 469 years ago ( Fig 4A, Table 1). We did not include the fourth nominal Coptodon species in the lake, C. 470 nigrans, but extreme phenotypic similarity to C. deckerti (Dunz and Schliewen, 2010) and our 471 inability to identify or distinguish these individuals in field collections and observations (Martin earlier speciation events did occur, but were followed by extinction. While we cannot fully exclude 476 this scenario, there are no indications for environmental disruptions such as major changes in 477 water chemistry or depth during the history of Lake Ejagham (Stager et al., 2017). 478 Assuming that the divergence of C. fusiforme was the first within this radiation, a striking 479 difference emerges between the waiting time to the first (7.74 ka) and the next two speciation 480 events, which both occurred within 1.20 ka. The opposite pattern, a slowing speciation rate, would 481 be expected if speciation followed a niche-filling model of ecological opportunity in the lake. At least 482 two non-mutually exclusive explanations may account for this counterintuitive result. 483 First, an initial lack of ecological opportunity in young Lake Ejagham may have prevented a rapid 484 first speciation event. Our results are reminiscent of those for sympatrically speciating Tristan da 485 Cunha buntings (Ryan et al., 2007), where, as discussed by Grant and Grant (2009), the ancestral 486 branch is considerably longer than those of the extant species. Grant and Grant (2009) (Cornen et al., 1992;Green and Kling, 1988). 492 Second, genetic variation for traits underlying sexual and ecological selection and the associated 493 genetic architecture may initially not have been conducive to speciation. If ecological and mate 494 preference traits are distinct (i.e. not magic traits: Servedio et al., 2011) and independently 495 segregating within the ancestral colonizing population, sympatric speciation models predict that 496 there will be a waiting time associated with the initial buildup of linkage disequilibrium between 497 these traits before sympatric divergence can proceed (Dieckmann and Doebeli, 1999;Kondrashov 498 and Kondrashov, 1999). Furthermore, Bolnick (2004) demonstrated that under conditions where 499 genetic variation for stringent assortative mating is limiting and females are penalized for 500 assortative mating, sympatric speciation may require a long time. In this light, it is particularly 501 intriguing that introgression of a block containing eight olfactory receptor genes from C. sp. Mamfé,502 which are likely to be highly relevant for mate choice, were introgressed shortly prior to the first 503 speciation event. Therefore, genetic variation brought in by riverine gene flow may have been 504 necessary to initiate speciation among Lake Ejagham Coptodon. 505

506
We showed that Lake Ejagham was rapidly colonized by ancestors of the extant Coptodon 507 radiation in a single major colonization, while also inferring low levels of ongoing and continuous 508 secondary gene flow from riverine species into ancestral as well as extant lake species. Speciation 509 can still be considered sympatric if secondary gene flow was present but did not play a causal role in speciation, and the pattern of ongoing gene flow is consistent with this. However, introgression 511 of a cluster of olfactory receptor genes into a pair of sister species (but not the third species) just 512 prior to their divergence, indicates that secondary gene flow may have been important to 513 speciation. The introgression of olfactory genes is particularly salient given that Ejagham Coptodon 514 species exhibit strong assortative mating, but currently weak disruptive selection, syntopic 515 breeding territories, and no sexual dichromatism within a tiny, shallow lake. 516 Sampling 518 Sampling efforts and procedures have been described previously in Martin et al. (2015a). Here, we 519 sampled breeding individuals displaying reproductive coloration from three species of Coptodon 520 (formerly Tilapia) that are endemic to Lake Ejagham in Cameroon: Coptodon fusiforme (n = 3), C. 521 deckerti (n = 2), and C. ejagham (n = 2). We additionally used samples from closely related riverine 522 species from the nearby Cross River whose ancestors likely colonized Lake Ejagham: C. 523 guineensis (n = 2) at Nguti, 65 km from Lake Ejagham, and an undescribed taxon, C. sp. "Mamfé" 524 (Keijman, 2010)  0.99" in vcftools) and SNPs with more than two alleles (using "-m2 -M2" flags in bcftools, version 554 1.5 (Li, 2011)). Genotypes with a genotype quality below 20 and depth below 5 were set to missing 555 (using "--minGQ" and "--minDP" flags in vcftools, respectively), and sites with more than 50% 556 missing data were removed (using "--max-missing" flag in vcftools). Our final dataset consisted of 557 15,523,738 SNPs with a mean sequencing depth of 11.82 (range: 7.20 -16.83) per individual. 558 Phylogenetic trees, networks, and genetic structure 559 We employed several approaches to estimate relationships among the three species in the 560 Coptodon Ejagham radiation and the two riverine Coptodon taxa. These analyses were repeated 561 for four outgroup configurations: no outgroup (unrooted trees), using only C. kottae, only S. 562 galilaeus, and both C. kottae and S. galilaeus as outgroups. Only sites with less than 10% missing 563 data were used for phylogenetic reconstruction. 564 Using the GTR-CAT maximum likelihood model without rate heterogeneity, as implemented in 565 RaxML (version 8.2.10, Stamatakis, 2014), we inferred phylogenies for all SNPs concatenated, as 566 well as separately for each 100kb window with at least 250 variable sites ("gene trees"). This 567 resulted in sets of 1,532 -2,559 trees, depending on the outgroup configuration. 568 Next, rooted gene trees were used, first, to compute Internode Confidence All (ICA) scores 569 (Salichos et al., 2014), using the "-L MR" flag in RaXML) for each of the nodes of the whole-570 genome trees. Rooted gene trees were also used to construct species trees in Phylonet (version 571 3.6.1, Than et al., 2008) using the Minimize Deep Coalescence criterion (Than and Nakhleh, 2009, 572 "Infer_ST_MDC" command) and maximum likelihood ("Infer_Network_ML" command with zero 573 reticulations), and using a maximum pseudo-likelihood method implemented in MP-EST (version 574 1.5, Liu et al., 2010). Finally, we used ASTRAL (version 2.5.5, Mirarab et al., 2014) to infer species 575 trees from unrooted gene trees. 576 To visualize patterns of genealogical concordance and discordance, we computed a phylogenetic 577 network using the NeighborNet method (Bryant and Moulton, 2004) implemented in Splitstree 578 (version 4.14.4, Huson and Bryant, 2006), using all SNPs. 579 We used the machine learning program Saguaro (Zamani et al., 2013) to determine the dominant 580 topology across the genome and calculate the percentages of the genome that supported specific 581 relationships, such as monophyly of the Ejagham Coptodon radiation. Saguaro combines a hidden 582 Markov model with a self-organizing map to characterize local phylogenetic relationships among 583 individuals without requiring a priori hypotheses about the relationships. This method infers local 584 relationships among individuals in the form of genetic distance matrices and assigns segments 585 across the genomes to these topologies. These genetic distance matrices can then be transformed 586 into neighborhood joining trees to visualize patterns of evolutionary relatedness across the genome, but otherwise applied default parameters. We investigated the effect of the number of proposed topologies on the proportion of genomes assigned to our two categories, and found that 590 the percentages were robust after 20 proposed topologies, with increasingly smaller percentages 591 of the genome being assigned to new additional topologies. 592 Genome-wide tests for admixture 593 We tested for admixture between the two riverine species and the three Lake Ejagham species 594 using several statistics based on patterns of derived allele sharing among these species. We used 595 the ADMIXTOOLS (version 4.1, Patterson et al., 2012)  any Lake Ejagham species (P3). Given that all three of these comparisons indicated admixture 604 between C. sp. Mamfé and Lake Ejagham species (Fig 3A), we next tested whether there was 605 evidence for differential admixture from C. sp. Mamfé among the three Ejagham Coptodon 606 species, using the three possible pairs of Lake Ejagham species as P1 and P2, and C. sp. Mamfé 607 as P3. 608 Another way to test for differential C. sp. Mamfé admixture among Ejagham Coptodon species is 609 by using f 4 -ratio tests, wherein taxon "X" is considered putatively admixed, containing ancestry 610 proportion α from the branch leading to P2 (after its divergence from taxon P1), and ancestry 611 proportion α -1 from the branch leading to taxon P3. Given the constraints imposed by the 612 topology of our phylogeny, we could only test for admixed ancestry of either C. deckerti or C. lengths across taxa. When this assumption is violated, the dfoil program can be run in "dfoilalt" 632 mode, thereby excluding single derived-allele counts (Pease and Hahn, 2015). Since we observed 633 significantly fewer single derived-allele sites for C. sp. Mamfé than for C. guineensis, we ran the 634 dfoil program in "dfoilalt" mode at a significance level of 0.001. 635 As input, G-PhoCS expects full sequence data for any number of loci. Since G-PhoCS models the 645 coalescent process without incorporating recombination, it assumes no recombination within loci, 646 and free recombination between loci. Following several other studies (Choi et al., 2017;Gronau et 647 al., 2011;Hung et al., 2014;McManus et al., 2015), we picked 1 kb loci separated by at least 50 648 kb. Following (Gronau et al., 2011), loci were selected not to contain the following classes of sites 649 within the O. niloticus reference genome -that is, rather than being simply masked, these sites 650

Inference of demographic history with G-PhoCS
were not allowed to occur in input loci: (1) hard-masked (N) or soft-masked (lowercase bases) sites 651 in the publicly available genome assembly; (2) sites that were identified to be prone to ambiguous 652 read mapping using the program SNPable (Li, 2009, using k=50 and r=0.5 and excluding rankings 653 0 and 1); and (3) any site within an exon or less than 500bp from an exon boundary. Furthermore, 654 loci were chosen to contain no more than 25% missing data (uncalled and masked genotypes). 655 Using these selection procedures, a total of 2,618 loci were chosen using custom scripts and a 656 VCF to Fasta conversion tool (Bergey, 2012). 657 Prior distributions for demographic parameters are specified in G-PhoCS using α and β parameters 658 of a gamma distribution. We determined the mean of the prior distribution (α / β ) for each 659 parameter using a number of preliminary runs, while keeping the variance (α / β MCMC runs converged on similar posterior distributions. 663 For each combination of migration bands (see below), we performed four replicate runs. Each G-664 PhoCS run was allowed to continue for a week on 8-12 cores on a single 2.93 GHz compute node 665 of the UNC Killdevil computing cluster, resulting in runs with 1-1.5 million iterations. The first 666 250,000 iterations were discarded as burn-in, and the remaining iterations were sampled 1 in every 667 50 iterations. Convergence, stationarity, and mixing of MCMC chains was assessed using Tracer 668 (version 1.6.0, Rambaut et al., 2014). 669 Because the total number of possible migration bands in a six-taxon phylogeny is prohibitively high 670 for effective parameter inference, we took the following strategy. Our primary focus was on testing 671 migration bands from C. sp. Mamfé ("Mam") and C. guineensis ("Gui") to the Lake Ejagham 672 Coptodon species and their ancestors: C. deckerti ("Dec"), C. ejagham ("Eja"), C. fusiforme ("Fus"), 673 "DE" (the ancestor to Dec and Eja), and "DEF" (the ancestor to DE and Fus). We first performed 674 runs each with a single one of these migration bands. Since all migration bands from C. sp. Mamfé 675 had non-zero migration rates, we next performed runs with all of these migration bands at once. 676 However, in those runs we observed failures to converge, higher variance in parameter estimates, 677 and the dropping to zero of rates of migration to the ancestral Lake Ejagham lineage (see Fig 4). 678 The latter is surprising given that for single-band runs, this migration rate was the highest inferred, 679 and is also in sharp contrast to other analyses that show much stronger support for migration to the 680 ancestral lineage than to extant species. While we suspect that runs with all migration bands have 681 poor performance due to the number of parameters, runs with single migration bands may be 682 prone to overestimation of the migration rate. We therefore also performed runs with migration 683 bands either to all three extant species or to both ancestral lineages, and report results for all of 684 these run types, separately. 685 Finally, we performed runs with no migration bands. We did not examine models with migration 686 from the Ejagham radiation to neighboring rivers because this is not relevant to sympatric 687 2011) can be interpreted as the probability that a locus in the target population has experienced 700 migration from the source population, and is calculated by multiplying m by the time span of the 701 migration band, which is the time window during which both focal populations existed (M s→t = m s→t 702 * τ s,t ). 703 Local admixture tests 704 To identify genomic regions with evidence for admixture between one of the riverine species and 705 one or more of the Lake Ejagham species, we first computed the f d statistic (Martin et al., 2015b) 706 along sliding windows of 50 kb with a step size of 5 kb, using ABBABABA.py (Martin, 2015). The f d 707 statistic is a modified version of the Green et al. (2010) estimator of the proportion of introgression 708 (f), and has been shown to outperform D for the detection of introgression in small genomic 709 windows (Martin et al., 2015b). 710 In the topology ((P1, P2), P3), O), f d tests for introgression between P2 and P3. For each window, 711 f d was calculated for two types of configurations. First, those that can identify the source of any 712 riverine admixture, using the two riverine species as P1 and P2 and a Lake Ejagham species as 713 P3 (for example: P1 = C. guineensis, P2 = C. sp. Mamfé, P3 = C. ejagham). Second, those that 714 can identify differential admixture from a riverine species among two Lake Ejagham species (for 715 example: P1 = C. deckerti, P2 = C. ejagham, P3 = C. sp. Mamfé). Since f d only detects 716 introgression between P2 and P3, f d was also computed for every triplet with P1 and P2 swapped. 717 P-values for f d were estimated by Z-transforming single-window f d values based on a standard 718 normal distribution, followed by multiple testing correction using the false discovery rate method 719 (FDR, Benjamni and Hochberg, 1995), using a significance level of 0.05. Next, putative admixture 720 blocks were defined by combining runs of significant f d values that were consecutive or separated 721 by at most three non-significant (FDR > 0.05) windows. Because any secondary admixture must 722 have occurred within the last ~10k years, after colonization of Lake Ejagham, true admixture 723 blocks are expected to be large, and blocks of less than five total windows or with maximum f d 724 values below 0.5 were excluded from consideration. Therefore, only genomic scaffolds of at least 725 70 kb (i.e., 557 scaffolds or 97.40% of the assembled genome) can harbor a putative admixture 726 block. Blocks indicating differential admixture with a riverine species among two Lake Ejagham 727 species (in ingroup triplets with a pair of Lake Ejagham species as P1 and P2, and a riverine 728 species as P3) were retained only when the riverine source of admixture could be distinguished in 729 a direct comparison, by intersection with blocks indicating differential admixture with a Lake 730 Ejagham species among the two riverine species. For instance, a block indicating admixture 731 between C. deckerti (P2) and C. sp. Mamfé (P3) in an ingroup triplet with C. ejagham as P1 (i.e., 732 identifying differential admixture among two lake species) was only retained if it overlapped with an (Ward and van Oosterhout, 2016), using the same mutation rate as for our G-PhoCS analysis. 737 HybridCheck identifies blocks that may have admixed between two sequences by comparing 738 sequence similarity between triplets of individuals along sliding windows, and next estimates, for 739 each block, the coalescent time between the two potentially admixed sequences. While 740 HybridCheck can also discover admixture blocks ab initio, we employed it to test user defined 741 blocks with the "addUserBlock" method. Given that HybridCheck accepts triplets of individuals, and 742 f d blocks detected in a given species triplet were tested twice in HybridCheck for that species 743 triplet, each using a different individual of the admixed Lake Ejagham species. Blocks were 744 retained when HybridCheck reported admixture between the same pair of individuals as the f d 745 statistic, and with a p-value smaller than 0.001 for both triplets of individuals. Our final set of "likely 746 blocks" consisted of those with an estimated age smaller than the G-PhoCS point estimate (in runs 747 with all possible migration bands from C. sp. Mamfé) of the divergence time between the Lake 748 Ejagham ancestor ("DEF") and the riverine ancestor ("AU"), while "high confidence blocks" were 749 defined as those with the upper bound of the 95% confidence interval of the age estimate smaller 750 than the lower bound of the 95% HPD of the divergence time estimate between DEF and AU (for 751 whichever set of G-PhoCS runs, either with no, some, or all migration bands from C. sp. Mamfé, 752 had the lowest value for this parameter). 753 In order to characterize the patterns of admixture for these pairwise admixture blocks further, we 754 calculated localized D FOIL statistics for each. Since these statistics depend on the occurrence of 755 sufficient numbers of all possible four-taxon derived allele frequency occurrence patterns among 756 five taxa, these only produced results for a subset of blocks (for the same reason, we were not 757 able to use these statistics for ab initio admixture block discovery along sliding windows). Since we 758 already established the presence of admixture for these blocks, and performed these analyses to 759 determine the pattern of admixture, we did not require significance for each D FOIL statistic, but also 760 considered it to be positive or negative if the statistic was more than half its maximum value and 761 had at least 10 informative sites.    The focal species in this study are shown: three species of Lake Ejagham Coptodon and two closely 817 related riverine species. As outgroups, we used C. kottae, a Cameroon crater lake endemic species that 818 did not diversify, and Sarotherodon galilaeus.   Estimates of divergence times (B), population sizes (C), and migration rates (D-F) across runs with varying migration bands from C. sp. Mamfé to lake 848 lineages: "none", "single", "ancestral/current", and "all" indicate that individual runs estimated zero, one, several (either to the two ancestral lineages, DE and 849 DEF, or to the three extant species), or all possible migration bands, respectively.