A Time-calibrated Firefly (Coleoptera: Lampyridae) Phylogeny: Using Genomic Data for Divergence Time Estimation

Fireflies (Coleoptera: Lampyridae) consist of over 2,000 described extant species. A well-resolved phylogeny of fireflies is important for the study of their population genetics, bioluminescence, evolution, and conservation. We used a recently published anchored hybrid enrichment dataset (AHE; 436 loci for 88 Lampyridae species and 10 outgroup species) and state-of-the-art statistical methods (the fossilized birth-death-range process implemented in a Bayesian framework) to estimate a time-calibrated phylogeny of Lampyridae. Unfortunately, estimating calibrated phylogenies using AHE and the latest and most robust time-calibration strategies is not possible because of computational constraints. As a solution, we subset the full dataset by applying three different strategies: (i) using the most complete loci, (ii) using the most homogeneous loci, and (iii) using the loci with the highest accuracy to infer the well established Photinus clade. The estimated topology using the three data subsets agreed on almost all major clades and only showed minor discordance within less supported nodes. The estimated divergence times overlapped for all nodes that are shared between the topologies. Thus, divergence time estimation is robust as long as the topology inference is robust and any well selected data subset suffices. Additionally, we observed an un-expected amount of gene tree discordance between the 436 AHE loci. Our assessment of model adequacy showed that standard phylogenetic substitution models are not adequate for any of the 436 AHE loci which is likely to bias phylogenetic inferences. We performed a simulation study to explore the impact of (a) incomplete lineage sorting, (b) uniformly distributed and systematic missing data, and (c) systematic bias in the position of highly variable and conserved sites. For our simulated data, we observed less gene tree variation which shows that the empirically observed amount of gene tree discordance for the AHE dataset is unexpected and needs further investigation.


Introduction
Fireflies (Coleoptera: Lampyridae) consist of more than 2,000 globally distributed described species renowned can be biased (Brown 2014). For example, if our model predicts much lower variation in GC content among 165 sequences, then our inference might wrongly group taxa with low (or high) GC content together (Betancur-R 166 et al. 2013). 167 Posterior predictive distributions were simulated using parameters values (e.g., phylogeny and substitu-168 tion rates) drawn from the posterior distribution which were inferred in the previous step. We conservatively 169 discarded the initial 50% of samples as burnin and used the remaining 100,000 samples for the simulations 170 (four replicates with originally 50,000 samples each). Finally, we computed the posterior predictive p-values 171 as the frequency how often the summary statistic of the observed data was larger or equal to the summary 172 statistic computed using the simulated data (midpoint p-values; Höhna et al. (2018)). That is, if we obtain a 173 very low p-value, then most or all of our simulated datasets have a larger summary statistic. For example, if 174 our empirical alignment had very few variable sites and most simulated datasets had more variable sites, then 175 the p-value would be close to zero. Conversely, a high posterior predictive p-value depicts larger summary 176 statistics from the observed data compared with the simulated data. 177 Simulation Study 178 We performed a simulation study as a benchmark and reference for our single locus phylogenetic analyses. 179 Specifically, we focused on (1) the discordance between gene trees and species trees under the multispecies  Then, if the branch leading to the next speciation event is shorter than the coalescent time between two 187 individuals, then we could observe deep coalescent events with incomplete lineage sorting. Thus, to observe 188 incomplete lineage sorting the population size needs to be sufficiently large and/or the internal branch length 189 needs to be sufficiently short. In our simulations, we simulated 436 gene trees within the fixed species tree  Second, the AHE dataset -as most phylogenomic datasets-are far from complete and missing sequence 195 data is heterogeneously distributed (see Figure S1). On the one hand, missing sequence data can impact 196 phylogeny inference, specifically if some taxa have a high fraction of missing sequence data (Sanderson et al. 197 1998; Lemmon et al. 2009). These taxa are often rogue and cannot be placed with certainty or correctly is then unclear if this site is invariant or not. Thus, missing sites can impact our calculation of summary 204 statistics, and thus our evaluation of model adequacy. Here, we explore the impact of missing data with a 205 specific focus on how missing data is distributed in AHE datasets. 206 We simulated sequence alignments for each of the three sets of 436 gene trees as follows. We simulated 207 branch rates from a uncorrelated lognormal relaxed clock model with mean 1.836 × 10 −3 (in million years) 208 and standard deviation of 0.58. Then, we simulated sequence data under a GTR+Γ model with base  of standard phylogenetic models. The among site rate variation model (+Γ) allows for rate variation using 222 four discrete rate categories but each site evolves independently and identically distributed. That is, each site 223 has a probability of 0.25 to be in any of the four rate categories regardless of the position in the alignment 224 (center vs. beginning/end). This model violation might not be a problem for many phylogenetic analyses. 225 However, the combination of missing data that is more prevalent at the same positions as highly variable sites 226 could induce a systematic bias. We explored this potential systematic bias by repeating the above simulation 227 with rate categories drawn deterministically depending on the position in the alignment. Specifically, we 228 divided the alignment in eight equal-sized regions where the outer regions received the highest of the four 229 rate categories and the middle regions the lowest rate categories respectively.

230
In total, we simulated three sets of 436 gene trees and four alignments per gene trees (436 loci x 3 231 population sizes x 2 levels of missing data x 2 modes of rate variation = 5232 simulated alignments). We 232 analyzed each simulated alignment with the same inference pipeline as the empirical AHE dataset. We  Divergence time estimation of Lampyridae phylogeny 236 We used the 436 ultra-conserved elements (AHEs) recently published by Martin  for the high sequence coverage, Photinus + Ellychnia monophyly, and low GC variance criteria respectively.

252
These data subsets share some loci but the majority of loci are private for each data subset ( Figure 1).

253
Thus, our divergence time analyses using these three different data subsets are mainly independent. If all 254 three datasets produce the same or highly similar divergence time estimates, then we are confident that the 255 divergence time estimates are robust to our choice of data subsets.

256
For each data subset we employed a partitioned GTR+Γ substitution model (Tavaré 1986) where among 257 site rate variation was modeled by 4 discrete categories obtained from a gamma distribution (Yang 1994   We omitted using the fossil †Electrotreta rasnitsyni because our preliminary analyses showed that Dri- Burmite amber ( ! 99 million year old), it provides a mimum age for Lampyridae of at least 99Ma. 291 We used the fossilized birth-death-range process (Stadler et al. 2018) to time-calibrate the Lampyridae phylogeny. The fossilized birth-death range process requires assignment of fossils to clades (see Table 1) and 293 integrates over both the actual placement within the clade (e.g., stem vs crown) and the actual time of fossil 294 within the specified stratigraphic range. That is, we provided both minimum and maximum ages (Table 1) 295 for each fossil taxon. Then, the fossilized birth-death range process gives equal probability that the true age 296 of the fossil was within the specified range. In principle, we could omit the monophyletic constraints if we had 297 morphological data for both fossil and extant taxa using tip-dating approaches (

311
Properties of the AHE loci 312 We obtained a minimum of 218 variables sites and a maximum of 2,071 variable sites with a mean of 696 313 variable sites ( Figure S5). Similarly, we obtained a minimum of 23 invariable sites, a maximum of of 1067 314 invariable sites and a mean of 225 invariable sites. We used the number of variable sites as a proxy for how 315 informative a locus is (Townsend 2007). Overall, the distribution of the number of variable sites appeared 316 unimodal without extremely low outliers. Thus, we did not see any indication that specific loci should be 317 particularly poor for phylogenetic inference.

318
The minimum and maximum pairwise distance showed interesting patterns. The majority of loci had a 319 minimum pairwise distance of zero ( Figure S5), which means that the alignments contained two sequences 320 without substitutions among them. Hence, there is no phylogenetic signal to distinguish between the se-321 quences. In itself, this low pairwise distance does not imply a problem for phylogenetic inference because 322 the two taxa will be placed as sister taxa. However, this distribution could indicate that there are several 323 taxa that cannot be resolved.

324
The maximum pairwise distance showed a skewed distribution with some larger outliers. This could 325 indeed be problematic. First, the high maximum pairwise distance will most likely lead to long branches in 326 the phylogeny. Second, the high distance could occur due to non-homologous sequences. The sequences, for 327 example, could be contaminated, mis-aligned and/or represent paralogs.

328
The distribution of GC content showed some slightly multi-modal and skewed mean GC content and 329 variance in GC content ( Figure S5). The mode with lower mean GC content and higher variance in GC 330 content could represent loci which are problematic for phylogenetic inference.  Figure S6). The monophyletic support increased on the genus level; eight of the fourteen 340 genera were recovered as monophyletic, five genera were rejected as being monophyletic and Australoluciola 341 was ambiguously recovered as either monophyletic or not ( Figure S6). . For each AHE locus, we computed the posterior probability that the clade is monophyletic. A histogram with most loci having a high posterior probability (e.g., Ellychnia) depicts strong support by the majority of AHE loci. Conversely, a histogram with most loci having a low posterior probability (e.g., Ototretinae) depicts strong support against monophyly by the majority of AHE loci. Other clades (e.g., Photinus) received contradicting support with some loci strongly supporting and other loci strongly rejecting monophyly. Note that we assumed Photinus being paraphyletic and instead considered Photinus+Ellychnia based on previous results. The histograms for all clades is shown in the supplementary material Figure S6.
Correlation between missing data and posterior support-Given the poor support on higher taxonomic 343 levels, and the ambiguous support for some of the named clades, we investigated whether there is a cor-  Figure S7. 350 We observed that there is no correlation between sequence coverage and the posterior probability of Photinus+Ellychnia being monophyletic ( Figures S7 and S8). This results is actually expected because we 352 removed taxa which had 50% or more sites in the sequence missing. The overall sequence coverage instead 353 represents the completeness of the entire alignment and therefore loci with higher average sequence coverage 354 are loci that contain more taxa after pruning. The same trend and correlation between sequence coverage and 355 phylogenetic accuracy can be seen for all other tested clades ( Figure S7). Thus, our pruning of incomplete 356 sequences from the alignment makes filtering loci based on overall sequence coverage futile.

357
Correlation between summary statistics of the data and posterior support-Next to the sequence complete-     The second row shows the summary statistics computed for the simulated dataset with complete sequences and homogeneous (i.e., independent and identically distributed, IID) highly variables sites. The third row shows the summary statistics computed for the simulated dataset with missing sequences and homogeneous highly variable sites. The fourth row shows the summary statistics computed for the simulated dataset with complete sequences and systematically distributed (i.e., akin to the empirical AHE dataset) highly variables sites. The bottom row shows the summary statistics computed for the simulated dataset with missing sequences and systematically distributed highly variable sites. The distribution of highly variables versus conserved sites had little to no impact on the summary statistics.
First, we observed that simulated complete alignments had different distributions of summary statistics 401 compared to the empirical data ( Figure 4)  with the combination of systematically ordered highly variables sites at the borders and missing sequence 420 data, then our model adequacy tests for the simulated failed. It remains surprising that our phylogenetic 421 substitution model was not adequate for even a single empirical locus. 422 We observed an unexpected amount of gene tree discordance between our species tree and the gene trees 423 ( Figure 6). The RF-distance computed for the empirical dataset are more centered at intermediate values 424 and never close to 0. That means, not a single gene tree was equal to or close to any of our reference trees.

425
Using the simulated data as a reference predicts that we should observe more often gene trees that are similar 426 to the species tree. Only if we assumed an unrealistic large population size of 10,000,000 diploid individuals 427 did we observe gene tree discordance for the simulated data close to the empirical data ( Figure S12). For reason is causing this unexpected gene tree discordance. 431 We observed that missing data has an impact on the distribution of posterior probabilities of certain 432 clades being monophyletic (Figures S13-S16). Some clades, for example, Phosphaeni, Luciolinae and Curtos, 433 are always rejected if complete sequences were used in the simulations (Figures S13 and S15). When we 434 used sequences with missing data in our simulations, and pruned sequences with less than 50% non-missing 435 sites, then we obtain high posterior probabilities for some loci that these clades were indeed monophyletic  ( Figures S14 and S16). The latter results agree more with our empirical analysis.

437
Note that we used the phylogeny which we inferred for the concatenated alignment to simulate dataset 438 (Figure 7). This phylogeny has some clades, for example Curtos, as monophyletic. Single loci were the possi-

451
The support for the named clades using the three concatenated data subsets largely matches the support 452 from the single gene tree analyses (Figure 2 and S17). The concatenated analyses inferred trees with subsets also recovered Photinus+Ellychnia to be monophyletic. It is therefore most probable that Photi-461 nus+Ellychnia is indeed monophyletic. In all our analysis we recovered both Ellychnia itself being mono-462 phyletic and Photinus + Ellychnia being monophyletic, indicating that Photinus is paraphyletic (Figure 7).  Table S1). Our inferred results show    and up to 37 loci in this study), (ii) the chosen fossil calibrations, (iii) the fossil calibration approach (node 486 dating vs the fossilized-birth-death process), and (iv) the software BEAST vs RevBayes. It is very reassuring 487 to see the large overlap in divergence time estimates despite these differences in the performed analysis.

488
Our divergence times are robust for the majority of clades when comparing the three different data subsets 489 ( Figure S18). If the clades were identified as being monophyletic for all three data subsets ( Figure S17), then 490 also the estimated crown ages were identical ( Figure S18). However, when the clades were not found to be 491 monophyletic, then the clade ages also differed (e.g., Ototretinae). This result is not surprising as the crown 492 age is defined as the most recent common ancestor for the selected taxa, and if this most recent common 493 ancestor includes different species then the interpretation of this ancestor and its age should be different.   In our study with used the same data source but different data subset. Thus, the difference in results of 516 phylogenomic studies could originate rather from the data source than the amount of data.

517
Second, divergence time estimates seem robust to the chosen data subset if the inferred topology agreed.

518
It is not surprising that a clade, which was inferred to be monophyletic for one data subset but not for 519 another data subset, obtained a different crown age estimate (e.g., Ototretinae). Therefore, we conclude 520 that it is more important to focus first on robust estimation of phylogeny using different data subsets or 521 methods. Once we understand how to select data subsets to produce reliable phylogenies, then we can safely 522 use the same data subsets for divergence time estimation. Our study raises some important aspects where 523 we need to improve our inference of phylogeny.

524
Unrealistic gene tree discordance 525 Our analyses of the 436 AHE loci revealed strong gene tree discordance. Such large amounts of gene tree 526 discordance are not expected even when allowing for incomplete lineage sorting. Richards et al. (2018) showed 527 that discordance between mitochondrial loci and nuclear loci is equally large, suggesting that much of the 528 apparent gene tree discordance originates from methodical factors and biological factors. Our simulation 529 study corroborates these findings: we cannot explain the observed gene tree discordance with incomplete 530 lineage sorting or phylogenetic uncertainty.

531
In our analyses, we did not clean the AHE dataset but took the data as given. There are several possible 532 sources of error which we did not check. For example, we did not assess orthology and did not check for 533 alignment errors. However, the published AHE dataset (Martin et al. 2019) was curated following current 534 best practices and therefore should reflect the state of the field.

535
In the last decade, we have seen several debates about using concatenation or coalescent-based species tree approaches (for a recent review see Bravo et al. 2019). Our observed higher accuracy of the concatenated 537 analyses over the single gene trees is surprising and provides empirical evidence against standard theoret-538 ical expectations. We expect that there is recombination between loci and therefore that concatenation 539 approaches can be misleading (Degnan and Rosenberg 2009). Instead, we find that the information within 540 single loci is misleading and, when concatenated, the noise is canceled out to leave an improved phylogenetic 541 signal (e.g., , recovery of Photinus + Ellychnia as monophyletic). We do not advocate that concatenation 542 approaches are philosophically or theoretically superior, but instead we notice that current single gene tree Previous studies have shown that loci can be filtered to increase phylogenetic accuracy (e.g., Alda et al.

553
2019). However, previous studies mainly focused on the resulting species tree and not on single genes (e.g., 554 Leite et al. 2021). Overall, we could not identify a single summary statistics as a reliable predictor for 555 phylogenetic accuracy (Figure 3). Thus, we could not use single summary statistics as data filtering criteria 556 for robust phylogenetic inference. It is possible that some phylogenetic relationships of Lampyridae require 557 revision and thus our proxies of phylogenetic accuracy are misleading. Clearly more work is needed if these 558 summary statistics of the data are used for selecting data subsets. Our results show some promise and single 559 summary statistics could be used to detect outliers which are then removed to clean the dataset (Figure 3).

560
Additionally, a combination of summary statistics might provide a fruitful future approach.  Lemmon et al. 2009). Imagine that we would add another column to our data 580 matrix but this column consists only of missing sites. In that case, we have not added any information to our 581 data and in fact the likelihood score remains the same after adding this column of missing data (Felsenstein 582 2004). Hence, missing data in theory is not problematic.

583
Previous studies have investigated the impact of missing data using simulation studies (Philippe et al. is distributed. If we randomly place missing data in the data matrix, then this has only a minor impact 586 on our ability to correctly infer the true phylogeny. Here we used an empirically informed approach and 587 removed sites using the same positions as in the original data matrix (see Supplementary Figures S2-S4). In 588 our simulations it occurred that complete sequences were missing and certain regions (the boundary of the 589 sequences) have higher prevalence of missing data. This uneven distribution of missing data has the effect 590 that some taxa cannot be placed accurately and the entire gene tree is erroneous. Therefore, it was necessary 591 to remove taxa for a locus with too few non-missing sites, e.g., we removed taxa with fewer than 50% sites.

592
Our simulations and results confirm theoretical expectations that missing sites do not bias phylogenetic 593 inference ( Figure 6). However, we observed that missing sites do bias distributions of summary statistics

606
The primary aim of this study was to estimate a time-calibrated phylogeny of Lampyridae. We used the In the process of selecting robust data subsets, we investigated the phylogenetic accuracy of single AHE 621 loci. We found an unexpected amount of gene tree discordance ( Figure 6). We explored the impact of 622 incomplete lineage sorting, missing sequence data and systematic distribution of highly variable sites using 623 simulations. The observed gene tree discordance cannot be explained due to incomplete lineage sorting.

624
Instead, the gene tree discordance most likely originates from data errors (e.g., paralogs and poor align-625 ments) or model inadequacy ( Figure 5). Surprisingly, our standard phylogenetic substitution models are not 626 adequate for even a single AHE locus. We showed that this model inadequacy is not due to missing data