A supermatrix phylogeny of the world’s bees (Hymenoptera: Anthophila)

The increasing availability of large phylogenies has provided new opportunities to study the evolution of species traits, their origins and diversification, and biogeography; yet, with the exception of butterflies, taxonomically well-curated phylogenies are currently lacking for major insect groups. Bees (Hymenoptera: Anthophila) are a large group of insect pollinators that have a worldwide distribution, and a wide variation in ecology, morphology, and life-history traits, including sociality. For these reasons, as well as their major economic importance as pollinators, numerous molecular phylogenetic studies of relationships between and/or within families or genera for this group have been published. We used publicly available sequence data, a family-level phylogenomic backbone, and ultra-conserved element (UCE) data, reconciled to a taxonomic database, to produce a dated phylogeny for bees. The phylogeny comprises 4651 bee species, representing 23% of species and 86% of genera. At family, subfamily, and tribe levels, the data were robust, but between and within some genera relationships remain uncertain. In addition, most of the species with available sequence data are geographically distributed in North America and Europe, highlighting gaps that should be considered in future research to improve our understanding of bee evolution and phylogeography. We provide a summary of the current state of molecular data available and its gaps, and discuss the advantages and limitations of this bee supermatrix phylogeny (available online at beetreeoflife.org), which may enable new insights into long standing questions about evolutionary drivers in bees, and potentially insects. Highlights Bee supermatrix phylogeny constructed with public and published sequence data. Includes 23% of currently recognised species and covers 86% of genera. Provides a summary of remaining gaps in bee phylogenetics. Available online at beetreeoflife.org, with subsetting tool to facilitate comparative analyses.


Introduction
Phylogenies are a key tool, not only for evolutionary biology, but also ecology subdisciplines such as 23 conservation biology, community ecology and macroecology (Beck et al., 2012; Harvey and Pagel, scripts and Microsoft Excel (see Supplementary file 2 and Supplementary file 3) (Microsoft data. For protein coding genes and ribosomal RNA genes we used tblastn and blastn respectively. In 134 total, we obtained data for seven nuclear protein-coding genes (ArgK, CAD, NaK, Pol II, Wnt-1, LW Rh, 135 and EF-1α), two ribosomal genes (16S and 28S), and two mitochondrial coding genes (Cytb and COI) 136 ( longest species accession per locus, we retained all accessions to first assess the gene tree monophyly of nominate species as well as genera. Due to the great number of accessions, EF-1α, LW Rh and COI 147 genes were split into two sets: Apidae and the rest. 148 with the taxonomic list (Supplementary file 1). We refined protein coding gene alignments by hand in 152 BioEdit (Hall, 1999) using the original references as a guide, to have all coding regions strictly in-frame. 153 Despite being time consuming, aligning protein coding genes this way makes adding taxa simpler. For 154 the ribosomal RNA genes, we used the MAFFT alignment as is. The mitochondrial COI was processed slightly differently because of the extensive number of 166 accessions that are publicly available, its nominal intra-specific diversity, and because it is often 167 sequenced in separate fragments. We reduced this gene alignment to a single consensus sequence 168 per species, based on the most common base per site (with ties scored as ambiguous). This approach 169 tends towards the most commonly sequenced sub-lineage and is a simple way to combine data, 170 discount rarer aberrant sequences and rationalize choice of intra-specific lineage complexity. The 171 consensus alignment was then subjected to the same procedure of gene tree and genera monophyly 172 assessment as described for the other genes included in this work. Additionally, we included COI data 173 data-matrices with BLAST v2.13.0, to identify the corresponding orthologous sections. We then 202 selected a subset of suitable UCE loci using several criteria: maximum e-value of 1e-50, >150 sites, and 203 >50% sites matched to the consensus reference. We aligned these loci with MAFFT as previously 204 described for gene fragments, concatenated them into a UCE 'supermatrix', and reconciled species 205 names with the taxonomic database. This supermatrix (57.9 kb in 94 loci for 777 samples) was then 206 compacted down to an amount adequate to reconstruct key tree shape, without excess computational 207 burden, by deleting sites that were <74% complete and then further reduced by one third jack-knifing, 208 resulting in a dataset of 13.3 kb; 85% complete. We then produced a tree to compare to the original 209 published UCE phylogenies, and to assess for non-monophyletic genera and aberrant sequences as 210 previously described. The resulting tree had 71% of nodes with bootstrap support >=90%. Details of 211 the primary transcriptome and UCE data sources are provided in Supplementary file 3. merely use the topology of phylogenomic datasets, with branch length/ages applied post-hoc. 217 218

Supermatrix assembly and analysis 219
Once all supermatrix data components were reconciled to the taxonomic database, aligned and 220 assessed for aberrant elements, they were used to produce a concatenated supermatrix. We did this 221 by taking the best (longest accepted) single exemplar sequence per 'gene' per species, for all species 222 with a total of least 400 sites of any data. The supermatrix was then compacted by removing regions 223 with little or no data (<10% taxa with any data per 'gene') and ambiguous alignment regions with 224 . For genera that contained species that did not share any data with any other member (i.e., the 231 only species in that rank with that genetic data), we reassigned the data to the member of the genus 232 with the most data and removed the source species. In total 40 substitutions were made (0.4% of the 233 matrix). 234

235
The outgroup included five taxa representing the Crabronidae subfamilies Astatinae, Bembicinae, 236 Crabroninae and Philanthinae, based on the phylogenomic 'stub' plus a small amount of gene 237 fragment data. As gene data was patchy across outgroups, taxonomic rank-substitution was also used 238 to create composite family representatives for some taxa. The gene composites included the genera 239 rather than enforce a range of putative calibrations, we simply calibrated the bee root to a broad 278 normal distribution of 100 million years ago (mya) SD 4. This was achieved by drawing a root age from key higher clades. 282 283

Results and discussion 284
Our intention was to produce a useful resource, by collating molecular data for bees, summarizing the 285 current state of the data, and providing a serviceable large-scale supermatrix phylogeny. Molecular 286 data is continuously being generated, and in this process higher taxonomy is also being reconciled. As 287 a result, most of the higher ranks are consistent and well supported: family, subfamily, tribe; but to a 288 lesser extent, genera. We kept the tree inference to an efficient approximation, with appropriate 289 computational effort to match the level of precision inherent in the data. This phylogeny is not a test 290 of bee systematics but a reflection of current information and ideas about bee relationships. In the 291 following sections we provide information regarding the current state of the data, the estimated 292 phylogeny, and the uncertainty for the set of phylogenetic trees produced in this work. 293

294
The phylogeny includes 4651 species and 427 genera, which represent 23% of currently described 295 species and 82% of genera, respectively (Fig. 1). In addition, the genera included contain 96% of all 296 described species. This comprises 36327 sites with a total of 18.6 million base characters from 14543 297 data elements, median 11% complete (3981 sites per species). Supplementary file 3 provides complete 298 information on genetic data for the entire supermatrix. We took a more hands-on approach to 299 curating suitable data rather than an all-in automation to build our dataset (Beaulieu and O'Meara, 300 2018), with several iterations of collation and summary inference guiding its development. Our 301 dataset is not exhaustive but a substantial representation of the total possible molecular data, as of 302 April 2023, providing the largest supermatrix dataset and tree of bees to date (Chesters, 2020; Hedtke 303 et al., 2013). Tables 1, 2 and Supplementary Table S2 show brief summaries of our sampling by family 304 and by gene. A more detailed breakdown by subfamily is provided in Supplementary Table S3.  305 Between families, sampling coverage of named genera ranged from 73-100% but at the species level, 306 coverage was lower ranging between 19-41% (Table 2). Taxonomic sampling is key for phylogenetic 307 inference accuracy (Nabhan and Sarkar, 2012), and while family, subfamily and tribe level 308 relationships are robust, further work is necessary to fill in the gaps and reduce uncertainties in 309 relationships between and within genera.  *All subfamilies are included in this work and total numbers for each rank are based on our revised taxonomic list. 325

Gene sampling 326
Overall, the data were highly skewed with many species (38%) having one gene and only a small 327 number (16%) having more than five (Fig. 3A-B). However, data were well distributed across linages 328 and a small number of species with many genes were phylogenetically spread across the tree (Fig. 3A-329 C). In addition, the use of shortened subsets of the very large phylogenomic data (transcriptome and 330 UCE) provided sufficient information to ensure the tree was consistent with published higher-level 331 phylogenies in topology and relative branch length. Our phylogenomic 'stub' recreated published 332 phylogenomic trees with high bootstrap support (>95%), and the UCE ensured that most of the key 333 nodes matched previously published results (see Similarity to previous studies section). Thus, the 334 addition of thousands of taxa with little data to the core backbone had limited effect on the underlying 335 backbone of the tree. This can be seen in the similarity of the genus representative tree and the 336 equivalent subtree pruned from the full all-species tree where 89% of nodes were recovered; 337 remaining nodes had low support in either or both trees (Fig. S1). Branch lengths were also very similar 338 as are the inferred clade ages (see section below). 339 data structure limitations by identifying groups that are most affected by scarce data overlap, and 350 would therefore profit from targeted data gap filling (see Freyman, 2015). In a cladistic analysis, taxa 351 with no data in common cannot be recovered as sister taxa (Sanderson et al., 2010). In principle, a 352 supermatrix would have at least one GIC to all species (typically mtDNA) with the remaining genes 353 (typically nuclear) limited to key representatives of the major lineages. However, in practice 354 supermatrices tend to be patchy with much missing data, potentially leaving many true sister taxa 355 with little or no data in common. Nonetheless, our most common gene COI covers 3892 species (84%, 356 Table 1). The nuclear genes EF-1a and LW Rh are also widely sampled (44 and 40%, respectively). 357 358 At family and subfamily levels, data was highly robust and structured with a minimum of seven GIC 359 (median of eight) among all subfamilies (Supplementary Table S5, Fig. 3C). Within genera the overlap 360 of data varied (Supplementary Table S5, Fig. S2A-B), with numerous cases of zero shared data between 361 individuals within a genus (see Supplementary Table S6 for details). The generally increasing GIC 362 deeper into the tree and variation in GIC towards the tips is apparent from GIC mapped onto the tree 363 ( Figure S2C; note that in the tree GIC should be >0). There are two approaches when dealing with 364 limited GIC, either remove species or fill in the existing gaps. Excluding rogue taxa may also help but 365 this can be a complicated process (Smith, 2022). Until these gaps are filled in, species relationships 366 within poorly structured genera should be treated with caution, and in the cases where analyses are 367 sensitive to tree topology, these genera could be pruned back to single genus representatives. 368 families (Orr et al., 2022). Our all-species tree agrees with the latter but with weak support (ufbs 73%) 385 while our genus tree has yet another result (Fig.S1). A similar case applies with the mono-or paraphyly 386 of the Centridini. We summarise key differences between this phylogeny and previously published 387 work in Supplementary file 5, which also mentions some genera that are potentially not monophyletic. 388 Danforth (2013) (Fig. S3B). Our dating falls within the range for most taxa (Fig. S3A) (CI) or above their minimum age constraint (Fig. S3B). These comparisons highlight the limited use previous results as secondary calibrations, meaning there is even less known than it appears. estimating a time tree. The unconstrained tree is sufficiently clock-like that apparent rate variation 411 can be accommodated by the PLRS correlated model (e.g., Fig. S1). 412

413
The fossil constraints are either much younger than the lineage appears to be, or they are close. As 414 they are essentially minimum age constraints (and would only be applied as such in PLRS), they would

Melittidae 33
Our date for this family is older but considering it is sister to all other families there is scope for a deeper age.

Root (crown age) of the extant Anthophila 17
Cardinal et al. (2018) based this on the fossil Melittosphex burmensis but a recent paper suggests it might not be a bee (Rosa and Melo, 2021). Previous estimates, including large hymenopteran phylogenomic studies, returned a wide range of ages. Thus, we kept to a broad consensus normal distribution age (mean 100 mya, SD 4) for the extant Anthophila, minimizing the disparity.