A New Paralog Removal Pipeline Resolves Conflict between RAD-seq and Enrichment

Target enrichment and RAD-seq are well-established high throughput sequencing technologies that have been increasingly used for phylogenomic studies, and the choice between methods is a practical issue for plant systematists studying the evolutionary histories of biodiversity of relatively recent origins. However, few studies have compared the congruence and conflict between results from the two methods within the same group of organisms, especially in plants, where extensive genome duplication events may complicate phylogenomic analyses. Unfortunately, currently widely used pipelines for target enrichment data analysis do not have a vigorous procedure for remove paralogs in Hyb-Seq data. In this study, we employed RAD-seq and Hyb-Seq of Angiosperm 353 genes in phylogenomic and biogeographic studies of Hamamelis (the witch-hazels) and Castanea (chestnuts), two classic examples exhibiting the well-known eastern Asian-eastern North American disjunct distribution. We compared these two methods side by side and developed a new pipeline (PPD) with a more vigorous removal of putative paralogs from Hyb-Seq data. The new pipeline considers both sequence similarity and heterozygous sites at each locus in identification of paralogous. We used our pipeline to construct robust datasets for comparison between methods and downstream analyses on the two genera. Our results demonstrated that the PPD identified many more putative paralogs than the popular method HybPiper. Comparisons of tree topologies and divergence times showed significant differences between data from HybPiper and data from our new PPD pipeline, likely due to the error signals from the paralogous genes undetected by HybPiper, but trimmed by PPD. We found that phylogenies and divergence times estimated from our RAD-seq and Hyb-Seq-PPD were largely congruent. We highlight the importance of removal paralogs in enrichment data, and discuss the merits of RAD-seq and Hyb-Seq. Finally, phylogenetic analyses of RAD-seq and Hyb-Seq resulted in well-resolved species relationships, and revealed ancient introgression in both genera. Biogeographic analyses including fossil data revealed a complicated history of each genus involving multiple intercontinental dispersals and local extinctions in areas outside of the taxa’s modern ranges in both the Paleogene and Neogene. Our study demonstrates the value of additional steps for filtering paralogous gene content from Angiosperm 353 data, such as our new PPD pipeline described in this study. [RAD-seq, Hyb-Seq, paralogs, Castanea, Hamamelis, eastern Asia-eastern North America disjunction, biogeography, ancient introgression]

heterozygous sites at each locus in identification of paralogous. We used our pipeline to construct 23 robust datasets for comparison between methods and downstream analyses on the two genera. Our 24 results demonstrated that the PPD identified many more putative paralogs than the popular method 25 HybPiper. Comparisons of tree topologies and divergence times showed significant differences 26 between data from HybPiper and data from our new PPD pipeline, likely due to the error signals 27 from the paralogous genes undetected by HybPiper, but trimmed by PPD. We found that 28 phylogenies and divergence times estimated from our RAD-seq and Hyb-Seq-PPD were largely 29 congruent. We highlight the importance of removal paralogs in enrichment data, and discuss the 30 merits of RAD-seq and Hyb-Seq. Finally, phylogenetic analyses of RAD-seq and Hyb- Seq 31 resulted in well-resolved species relationships, and revealed ancient introgression in both genera. 32 Biogeographic analyses including fossil data revealed a complicated history of each genus 33 involving multiple intercontinental dispersals and local extinctions in areas outside of the taxa's 34 modern ranges in both the Paleogene and Neogene. Our study demonstrates the value of additional 35 steps for filtering paralogous gene content from Angiosperm 353 data, such as our new PPD 36 pipeline described in this study. shallow phylogenetic divergence because the data contain both conserved coding sequences and 77 their flanking variable sequences, whereas RAD-seq data is traditionally thought to be more 78 appropriate for studies of shallow phylogenetic divergence (e.g., studies within a genus or species, 79 and sometimes in relatively young families) due to allele dropout effects (Arnold et al. 2013; 80 Gautier et al. 2013;Eaton 2014, and reviewed in Leaché and Oaks 2017). However, recent studies 81 have shown that the allele-drop out (missing data) in RAD-seq data is usually systematic (or 82 hierarchical) and phylogenetically informative (DaCosta and Sorenson 2016). Recent studies using 83 RAD-seq have resolved phylogenies of lineages diverging during the Eocene and even the late 84 of paralogs (in red), followed by removal of detected paralogs. All three methods detect putative 145 paralogs by observation of more than one unique contigs in an individual mapped to the same 146 reference gene with >85% (HybPiper) or >80% (PHYLUCE and SECAPR) sequence identity or 147 the same contig mapped to more than one reference genes. 148 149 A sequence similarity-based approach for calling paralogous genes, such as those 150 implemented in the three popular pipelines mentioned above, may be sufficient for phylogenetic 151 studies using custom designed probes based on orthologous sequences encompassing the study 152 group. However, for studies leveraging probes built from evolutionarily distant taxa from the focal 153 group of investigation, especially in groups where gene and genome duplication are thought to be 154 common such as plants, sequence similarity between contig and target genes alone may not be 155 sufficient for removing all paralogs. Additional analyses of the sequence data may be needed to 156 remove the potential paralogous sequences before performing phylogenetic analyses. In this study 157 we propose a supplementary criterion to the sequence similarity for detecting and removing the 158 problematic paralogous gene data from Hyb-Seq by examining heterozygous sites within and 159 among individuals in the aligned sequences (Fig. 2). Low heterozygosity in a species-level dataset 160 is expected under the assumption that polymorphisms among species are more likely to be fixed 161 differences between paralogs over deep divergences (Eaton 2014). Even at the shallow level of 162 phylogenetic divergence (e.g. population genetics), high heterozygosity within a locus may also 163 be attributed to paralogs (Hohenlohe et  Most enrichment pipelines do not by default consider the sequence variants of contigs/loci (details 167 in Supplementary Information, available on Dryad). In contrast, the methods developed for 168 analyses of RAD-seq data are usually stricter and use more parameters to filter putative paralogs 169 (Harvey et al. 2016). For example, PyRAD (Eaton 2014) not only uses the parameter of clustering 170 based on sequence similarity within each locus, but also uses two other parameters to filter 171 sequences with high heterozygous sites and with high percentage of shared heterozygous sites 172 (Eaton 2014;McKinney et al. 2017). Therefore, we modified HybPiper (Fig. 2) to include analyses 173 of the heterozygosity parameters for further filtering of putative paralogs from enrichment data. 174 shows the basic PPD functions, including two major steps: first, generating a "degenerated" 177 sequences; second, generating well-trimmed matrix and detecting paralogous genes. The "well-178 trimmed" process (s1 to s7) include filtering out sequences with high percentage of heterozygous 179 gain insights into the congruence between Angiosperm 353 and RAD-seq data. These two genera 192 are common elements of the deciduous forests of the northern hemisphere and discontinuously 193 distributed in eastern Asia (EA) and eastern North America (ENA); the study of which allow us to 194 gain insights into the origin and evolution of this well-known floristic disjunction, one of the first 195 order biodiversity patterns in the Northern Hemisphere that has been extensively studied (Wolfe 196 1980;Boufford and Spongberg 1983;Tiffney 1985a;Xiang et al. 1998bXiang et al. , 2000Wen 1999;197 Johnson (1988) and Nixon (1997), 207 C. pumila in the southeastern United States and C. pumila var. ozarkensis (Ashe) A.E. Murray 208 limited to the Ozark mountains. Phylogenetic studies of Castanea were previously conducted using 209 data from six chloroplast regions in Lang et al. (2006Lang et al. ( , 2007. The studies found that sect. 210 Eucastanon is paraphyletic and sect. Balanocastanon and sect. Hypocastanon did not form a sister 211 group. Authors of these studies also hypothesized an eastern Asia origin of Castanea in the Eocene, 212 followed by range expansion to Europe and North America in the mid Eocene, as an example of 213 the "Out of Asia" pattern (Donoghue and Smith 2004  of 50% samples sharing heterozygosity at a site (i.e., 0.5 shared heterozygosities in ipyrad) of a 304 locus as the cut-off of orthology. The resulting read clusters were aligned across samples using 305 MUSCLE (Edgar 2004). The loci were finally trimmed (5 bases on 3' end) and filtered if they 306 contained over 30 SNPs or 8 indels. To compare across different levels of missing data, we output 307 matrices containing full-length loci in different minimum thresholds of samples using the parallel 308 branching assembly function. We selected thresholds from at least 20% of samples present at a 309 locus, to 80%, increasing in increments of 10%. We named these matrices based on this threshold, 310 e.g. M20, which contained all loci present in at least 20% of samples. All matrices have been 311 uploaded to Dryad. intronerate.py module to recover "intron" and "supercontig" (CDS + intron fragments) sequences. 322 Finally, we generated multiple Hyb-Seq data matrices to assess the phylogenetic effects of 323 character coding, data trimming, and paralogous genes in order to find the best dataset for 324 comparison with the RAD-seq data and for assessing their impacts on divergence time. Specifically, 325 we identified loci containing putative paralogs using our new pipeline, named putative paralog 326 detection (PPD) (available on Github: https://github.com/Bean061/putative_paralog) and built the 327 following matrices for comparisons: exon vs. supercontig matrices of orthologous loci, paralogous 328 loci, or all genes (including the paralogous genes), respectively. The exon matrices contained the 329 sequences of the coding regions only while the supercontig matrices contained sequences of both 330 coding and their flanking regions of the three respective gene groups. The pipeline includes two 331 major steps that occur after generating alignment matrices in HybPiper or other enrichment 332 pipelines: first, "degenerated" sequences are generated from mapping reads back to contigs, which applies the IUPAC ambiguity codes to capture the heterozygous sites in generating the exon and 334 supercontig matrices of the three sets of aforementioned genes; and second, the matrix is trimmed 335 of highly heterozygous sites, misaligned regions, and particularly gappy columns, as one trimming 336 step which is shown in Figure 2 (see details in Supplementary Information available on Dryad). 337 Thresholds for trimming are configurable by the user. To directly compare the impact of character 338 coding, we compared the "degenerated" orthologous matrix generated through both steps of PPD 339 with the orthologous data matrix built from HybPiper without degenerate character coding in the 340 matrix (referred to as "consensus" matrices) and also trimmed these matrices with the second step 341 of PPD (referred to as the "well-trimmed consensus" matrices). We also compared data from the 342 Angiosperm 353 genes with the plastid genes captured by off-target in the Hyb-Seq experiment, 343 and the combined genes from RAD-seq and Hyb-Seq (excluding plastid genes). In addition, we 344 compared data from exon and supercontig, paralogs and orthologous, and applied three different 345 trimming approaches (no trimming, automated trimming in Trimal, and our "well-trimmed" 346 method in PPD) in producing these matrices (see details in Supplementary Information, available 347 on Dryad). For all the Hyb-Seq data matrices below, if no trimming description is directly indicated, 348 the data matrices were "well-trimmed" through PPD (PPD Part 2: s1-s7 in Fig. 2). All Hyb-Seq 349 comparisons are listed in vs. supercontig, orthologs vs. paralogs, and matrices with different trimming parameters (untrimmed, "auto-trimmed" trimming in 357 Trimal, and our "well-trimmed" method in our PPD Pipeline) and mapping parameters (BWA -k). "consensus" matrix contains 358 sequences with heterozygous sites represented by the most common code; "degenerated" matrix contains sequences with 359 heterozygous sites represented by the IUPAC ambiguity codes. Numbers in species trees and gene trees indicate different topologies. 360 In Castanea, three major topologies are included. Topology 1 indicates the topology with ((EA), (ENA, Europe)), topology 2 indicates 361 the topology with ((EA, Europe), (ENA)), and topology 3 indicates the (C. crenata, (Europe, ENA, EA)). In Hamamelis, three major 362 topologies are included. Topology 1 indicates the topology with ((H. mollis, (H. japonica, ENA)), topology 2 indicates (H. mollis, H. 363 japonica), ENA)), and topology 3 indicates (H. japonica, (H. mollis, ENA)). MSA represents the multiple sequence alignment. 364

Phylogenetic Analyses 365
Phylogeny of RAD-seq data. with Hyb-Seq data in the analyses so that the two types of data included identical samples (Table  430 1). The only exception was that RAD-seq data were missing for Castanea sativa due to the failure 431 of herbarium samples in the RAD-seq experiment and lack of silica dried leaf samples at the time 432 the experiment was conducted. Among the different Hyb-Seq data matrices, we performed 433 divergence time dating analyses for several matrices for each genus, including the "degenerated" 434 and "consensus" supercontig matrices for orthologous genes, the "degenerated" exon matrix for 435 orthologous genes, the "degenerated" supercontig matrix for paralogous genes, the combined 436 RAD-Hyb-Seq data, and matrices using different trimming processes to allow comparisons of 437 supercontig and exon and the effect of coding methods, paralogous genes, and trimming methods. Hamamelis to the phylogeny to provide a more accurate evolutionary framework for the study.  Table S1, available on Dryad). The molecular matrix used in the analysis was the 474 "degenerated" supercontig matrix of orthologous genes from Hyb-Seq, as this data set contains 475 the most informative characters of orthologous genes and supported phylogenetic relationships 476 consistent with the completely resolved species tree and the RAD-seq trees (see Results below). 477 The molecular characters were recorded as missing in the fossil taxa. Therefore, we modified the analysis to constrain the fossils with their most closely related clades 485 or species based on paleobotany literature, with dates for fossils set as the value in between the 486 age range of each fossil (Table 3). We also enforced the species tree backbone derived from the 487 molecular data in the analysis to reduce the run time and ensure the recovery of relationships of 488 extant species depicted in the species tree. We ran the analysis for 200 million generations with 489 sampling of trees every 10,000 generations. Quality of the runs and parameter convergence were 490 assessed using Tracer v.1.7.1. The maximum credibility tree of median heights was constructed 491 using TreeAnnotator (Bouckaert et al. 2014) after discarding 20% trees as burn-in. As a 492 comparison, we also constructed a dated total evidence tree using a matrix containing the combined 493 data from RAD-seq and Hyb-Seq and the morphological characters using the same processes 494 described above. The total evidence dating analysis was also repeated with the combined RAD-495 Hyb-Seq data.   Table S2, available on Dryad) based on largely the physical 515 connections between areas, according to (Graham 1999a(Graham , 1999b(Graham , 2018Tiffney and Manchester 516 2001;Mann 2007). The rates were arbitrarily set using the following criteria: as 0.001 for the time 517 slices when the two areas were physically disconnected or isolated by barriers (such as water or 518 dry areas); as 1.0 when the two areas were physically and ecologically connected, and as values 519 between 0.001 -1.0 (e.g., 0.2, 0.5, or 0.75) when the two areas were partially connected by islands 520 that were separated at different levels (Supplementary Table S2 happened between P3 and P2 or P3 and P1, it will result in an excess of shared derived alleles 541 between the taxa that experienced introgression (D > 0 or D < 0). The Dsuite is a tool for 542 calculating D-statistics based on VCF files, assessing the significance via the jackknife test, and 543 generating two-tailed P-values based on the Z-scores. To explain conflicts between the 544 concatenated-gene tree and species tree, we performed two D-statistic tests for Castanea and 545 Hamamelis to assess ILS vs. introgression among specific lineages involved in the phylogenetic 546 conflicts. Specifically, for Castanea, we performed the test with the Hyb-Seq data (primary Hyb-547 Seq-PPD, see Results) (as it contains complete species samples while the RAD-seq data had C. 548 sativa missing) and combined RAD-Hyb-Seq data. We tested for introgression events among ENA 549 Castanea, EA Castanea, and European Castanea (C. sativa) due to the weak support for the sister 550 relationship between C. sativa and the ENA clade in the species tree, but strong support in the  Table S4, available on Dryad). 583 After filtering, the length of each locus from RAD-seq is approximately 105 bp for both Castanea 584 and Hamamelis (Table 2). 585 For Hyb-Seq datasets, the number of loci varied depending on gene region, filtering 586 strategy, and categorization of genes in the alignment (e.g. ortholog vs. paralog or exon vs. 587 supercontig). For complete the loci recovery and sequence length information in different datasets, 588 see Table 2. The average length per locus in exon region of well-trimmed "degenerated" matrix of 589 orthologs was approximately 594 bp in Castanea and 615 bp in Hamamelis, while the average 590 length per locus in the well-trimmed "degenerated" supercontig matrix was much longer, approximately 2448 bp in Castanea and 1643 bp in Hamamelis. The information on loci number, 592 alignment length, average length per locus, number of segregating sites, and number of parsimony 593 informative sites of all matrices derived from different trimming methods are available in Table 2. 594 Overall, many more loci were obtained from RAD-seq compared to Hyb-Seq (344 genes) but the 595 average length per locus is much longer in the Hyb-Seq data, resulting in longer alignment lengths 596 in Hyb-Seq matrices (Table 2). 597 For the combined RAD-Hyb-Seq data, we first checked for duplication of loci present in 598 both data sets. We found 12 loci (out of 1290 in Castanea) and 2 loci (out of 1595 in Hamamelis) 599 in the RAD-seq data of the two genera that could be mapped to the Hyb-Seq data. After removing 600 these loci from the RAD-seq M50, we obtained 1574 loci for Castanea and 1912 loci for 601 Hamamelis (Table 2) for the combined RAD-Hyb-Seq data. The chloroplast DNA data matrices 602 extracted from the Hyb-Seq data due to off-targeting contained 12,647 bp for Castanea and 603 107,068 bp for Hamamelis (Table 2).  Table 2, we have chosen to present results in the main text primarily for the well-614 trimmed "degenerated BWA100" supercontig matrix of orthologs generated from our PPD 615 pipeline. We will refer to the Castanea and Hamamelis alignments generated through this pipeline 616 as primary Hyb-Seq-PPD datasets. Results of additional analyses performed on alternative 617 matrices were presented in Supplementary Information (available on Dryad).  Table S4, available on Dryad). 626 The concatenated gene trees from both ML and BI methods showed a split of the EA and 627 ENA species in two monophyletic sister clades with full support (Fig. 3a (1)  sativa is placed as the sister to the American clade, with high support (Fig. 3a (2) and 657

Supplementary Figs. S8b and S9b available on Dryad). 658
The species trees constructed from this data matrix using ASTRAL-III and SVDQuartets 659 have identical topologies that depict the same species relationships as the concatenated gene tree, 660 with strong support for all nodes except one. The node uniting C. sativa and the ENA species in 661 the species tree from SVDQuartets is relatively weakly supported (72% -see Figure 3c). 662 The analysis of the combined RAD-seq and primary Hyb-Seq-PPD data (the RAD-Hyb-663 Seq data) matrix resulted in identical ML and species tree topologies, which were also identical to 664 the results of both RAD-seq and the primary Hyb-Seq-PPD datasets (see above). The combined data based-species tree is also highly supported except for the node uniting C. sativa and the ENA 666 clade (SVDQuartets 68%) ( Supplementary Fig. S10, available on Dryad). 667 668 Castanea -chloroplast phylogeny.-Analyses of the chloroplast data using RAxML and IQ-tree 669 resulted in an incompletely resolved phylogeny with lower support (Fig. 3b). The trees placed one 670 of the EA species C. crenata as the sister of a well-supported clade consisting of the remaining 671 species (93% bootstrap value and 95% ultrafast bootstrap value), different from the nuclear data 672 from the RAD-seq and Hyb-Seq (Figs. 3a and 3c). Within this large clade, the remaining EA 673 species form a well-supported monophyletic group (94% bootstrap value and 94% ultrafast 674 bootstrap value). However, the relationships among this EA subclade, C. sativa from Europe, and 675 the ENA species (C. pumila and C. dentata) remain unresolved (Fig. 3b). The species tree from RAD-seq data with SVDQuartets method shows a different species tree 710 and is presented in Supplementary Fig. S20, available on Dryad. Parrotiopsis jacquemontiana 711 and Fothergilla gardenii were used as outgroups. All trees were drawn by ggtree in R. 712

714
Hamamelis -Hyb-Seq-The primary Hyb-Seq-PPD data for Hamamelis contains only 160 genes. 715 We increased the gene sampling to 319 genes by adding the additional genes with exon data. The 716 new primary data consisted of 514,351 bp with 5,693 sites that are parsimony informative (Table  717 2). The trees resulting from analyses of this matrix using IQ-Tree, RAxML, and MrBayes are 718 identical in topology (Fig. 4a (2) and Supplementary Figs. S18b and S19b, available on Dryad) 719 and show species relationships identical to those resolved by the concatenated RAD-seq data (Fig. 720 4a (1)), with strong support. The two species trees reconstructed from ASTRAL and SVDQuartets 721 are identical in topology (Fig. 4c), which is also identical to the concatenated Hyb-Seq gene tree 722 and species tree inferred from the RAD-seq data with ASTRAL-III. Similarly, the node connecting 723 the ENA clade and H. japonica is not well supported (0.59 in ASTRAL and 90 in SVDQuartets -724 see Figure 4c). The species trees inferred from the combined RAD-Hyb-Seq data were identical in 725 topologies to trees derived from separate analyses of the RAD-seq and the primary Hyb-Seq-PPD 726 dataset, but with higher support for the sister relationship between the ENA clade and H. japonica SVDQuartets that recognized the monophyly of the EA species, but different from the nuclear 732 phylogeny from other analyses as mentioned above that supported H. japonica and the ENA clade 733 being sister. Relationships among the four ENA species remain unresolved in the plastid gene tree 734 (Fig. 4b). Within the genus, other divergence occurred in the mid-Miocene and late Miocene (Fig. 5a). The 740 divergences times estimated with the primary Hyb-Seq-PPD data are slightly younger (~0.6 to 741 ~5.3 Ma) on the mean values for all of the nodes than those estimated from the RAD-seq data (Fig. 742 5c; Supplementary Table S5, available on Dryad). The European chestnut (C. sativa), which was 743 not missing in the RAD-seq data, diverged from the two ENA species in the mid-Miocene (13.6 Ma, 95% HPD: 10.9-16.7 Ma) based on the primary Hyb-Seq-PPD data (Fig. 5c). The divergence 745 times estimated from the combined RAD-Hyb-Seq data were slightly older for the mean values 746 than those based on the primary Hyb-Seq-PPD data ( Supplementary Fig. S22a, available on Dryad) 747 and slightly younger than the values from the RAD-seq data. 748  Table 2). The results showed that the "consensus" matrix from the HybPiper pipeline and 767 the various "degenerated" matrices from our PPD pipeline resulted in trees with the same topology 768 (Table 2; Supplementary Fig. S23, available on Dryad), but the trees derived from the "consensus" 769 matrices had longer branch lengths than the trees derived from the "degenerated" matrices 770 ( Supplementary Fig. S23, available on Dryad), probably due to the greater number of parsimony 771 informative sites in the "consensus" matrices ( Table 2). The results also showed that the 772 phylogenies inferred from the putative paralogs differed from the orthologous gene phylogenies in 773 terms of topology and branch lengths (Table 2; Supplementary Fig. S24, available on Dryad). Our 774 results further showed congruence between the exon and supercontig trees in Castanea but minor 775 incongruence between the exon and supercontig trees in Hamamelis. The exon tree grouped H. 776 mollis and H. japonica as the sisters with low support (Supplementary Fig. S24, available on 777 Dryad), a relationship not supported by analyses of the supercontig matrix (Figs. 4a (2) and 4c). 778 Finally, our results showed that different levels of trimming of ortholog matrices did not influence 779 topology, but did influence branch lengths ( Supplementary Fig. S24,  primary Hyb-Seq-PPD data ("degenerated"), and "consensus" Hyb-Seq data) resulted in some 785 notable difference between matrices, but with mean estimates that were largely within the range 786 of the HPDs of other estimates. The divergence times estimated from the "consensus" Hyb-Seq 787 data were slightly older (~2 -4 Ma) at every node than those estimated from the primary Hyb-Seq-788 PPD data (Fig. 5; Supplementary Table S5, available on Dryad). The estimates based on the 789 paralogs data identified by PPD were also older than those based on the orthologs and the 95% 790 HPDs from the paralogs is also much larger in Hamamelis ( Supplementary Fig. S25 Table S5, available on Dryad), while the "untrimmed  798 consensus" matrix (i.e., data directly from HybPiper) produced much older estimates than the two 799 trimmed matrices ( Supplementary Fig. S26 and Fig. 5; Supplementary Table S5, available on  800   Dryad; Table 2 (Table 4). In Hamamelis, the analyses found significant signals 809 supporting introgression between H. japonica (as P3 or P2), and the American clade (as P2 or P3) 810 in the RAD-seq data (P = 0.0015), in the primary Hyb-Seq-PPD data (P = 0.0009), and in the 811 combined RAD-Hyb-Seq data (P < 0.0001; Table 4). However, the test did not find any significant 812 signals supporting introgression among H. mexicana, H. vernalis, and H. ovalis in these three data 813 sets (Table 4) Notes: a O represents a closely related outgroup taxon and P1 and P2 are taxa tested for signatures 818 of introgression with P3. b when D significantly greater than 0, it indicates introgression events 819 happened in P2 and P3. "ENA" means eastern North American clade, "EA" means eastern Asian 820 clade, "EA1" means all eastern Asian species except C. crenata. "Primary Hyb-Seq-PPD" means 821 the orthologous dataset generated by our pipeline. "Combined RAD-Hyb-Seq" means the dataset 822 containing both RAD-seq and Primary Hyb-Seq-PPD data. 823

824
Phylogenetic Network.-The analyses with PhyloNet detected one reticulation event in Castanea 825 between C. sativa and the stem of the EA clade excluding C. henryi (Fig. 6a). The inheritance 826 probabilities show that C. sativa has 24% genomic contribution from the ancestor of EA clade 827 excluding C. henryi (Fig. 6a). The analyses also detected one ancient reticulation event in 828 Castanea species in the early Miocene resulted in the EA ("A") and Eur-ENA ("BE") lineages. 854 The EA lineage later diversified further into at least seven species in the mid and late Miocene 855 with one of them spread into western North America. At least three of these species including the 856 one that expanded its range into western North America became extinct in the late Miocene (Fig.  857   7a). The lineage is now represented by the four EA species. The Eur-ENA lineage ("BE") became 858 isolated between the two areas in the early-mid Miocene. The ENA part speciated into two extant 859 ENA species C. pumila and C. dentata in the mid-late Miocene, while the European ("E") part 860 spread back to Asia and speciated further, followed by extinction in Asia in the early-mid Miocene. 861 One of them survived as C. sativa in Europe and the other is represented by a fossil species C. with some notable advantages as discussed in the Introduction. We evaluated the two methods 899 through side-by-side comparisons of phylogenomic results from data generated with RAD-seq and 900 Hyb-Seq of Angiosperm 353 genes in the study of two EA-ENA disjunct genera, Castanea and 901 Hamamelis. Our results showed, although there were fewer loci in the Angiosperm 353 gene data 902 (e.g., 292 and 319 orthologs in well-trimmed "BWA100" datasets vs. 3,597 and 3,935 RAD-seq 903 loci in the M20 matrices that contained the most RAD-seq loci of Castanea and Hamamelis, 904 respectively) ( Table 2 and Supplementary Table S4, available on Dryad), the total alignment length 905 was much longer than that of the RAD-seq data owed to the longer average length of enrichment 906 loci (Table 2). However, the proportion of the segregating sites in the Hyb-Seq data are similar to 907 that of the RAD-seq data (Table 2). A similar pattern of differences in loci number, alignment 908 length, and segregating sites between the RAD-seq and enrichment data was also reported in 909 Harvey et al. (2016), but with lower values for these variables (the segregating sites: 4.07/site and 910 0.69% in the UCE matrices vs. 1.35/site and 1.41% in the RAD-seq matrices). However, the 911 aforementioned differences in sequence data did not affect the topology recovered by phylogenetic 912 methods in either genera (Figs. 3 and 4). Both of our RAD-seq and Hyb-Seq data well resolved 913 the phylogenetic relationships within Castanea and Hamamelis with strong support (Figs. 3 and  914  4). This finding suggests that both approaches can generate a large amount of phylogenetic data 915 for reliable inferences of species relationships within a genus. This is congruent with previous 916 studies showing these two data sets performed similarly in phylogenetic inferences (e.g. Harvey 917 et al. 2016;Manthey et al. 2016). However, the RAD-seq data may be unable to detect gene flow 918 or introgression due to shortage of segregating sites per locus which decreases the chances of 919 finding shared alleles on a very short phylogenetic branch (Harvey et al. 2016). We note that 920 although both data supported the same species tree topology, the species tree may not be the correct 921 species history, as found in our study of Hamamelis (see discussion below on introgression). 922 Although RAD-seq and Hyb-Seq data agreed in species relationships recovered from 923 phylogenetic analyses, they appeared to slightly disagree in divergence times estimated using the 924 same method. This result is unsurprising given the differences in alignment content and branch 925 lengths in likelihood analyses. However, these differences had inconsistent effects across the two 926 genera. The divergence times estimated from the Hyb-Seq data were younger (~1-5 Ma) in 927 Castanea but older (~2-3 Ma) in Hamamelis at all nodes than those estimated from the RAD-seq 928 data. Overall, the divergence time dating results are very similar between the RAD-seq and Hyb-929 Seq data (Fig. 5 and Supplementary Table S5, available on Dryad). 930 The phylogenetic congruence between data from RAD-seq and Hyb-Seq are encouraging 931 and suggest that one may choose one or the other approach for a given phylogenomic study at the 932 RAD-seq would be a better choice over Hyb- Seq. 940 It is noteworthy that our RAD-seq data captured a small number of the Angiosperm 353 941 genes included in the Hyb-Seq data (25 loci of RAD-seq matching to 21 loci of Hyb-Seq in 942 Castanea and 13 loci of RAD-seq matching to 12 loci of Hyb-Seq in Hamamelis), indicating some 943 level of overlap between the two datasets. This demonstrates that the RAD-seq data, as acquired 944 using the experimental methods and analytical tools adopted in this study, likely represents a 945 diverse array of genes and genomic regions, including the conserved low copy nuclear genes of 946 Our results also showed that our new pipeline (PPD) identified many more putative 951 paralogs than HybPiper. Although the "consensus" sequence data generated from HybPiper 952 produced the phylogenetic tree with the same topology as the tree from the "degenerated" 953 sequence data derived from PPD, the HybPiper data contained many more "false" phylogenetic 954 informative sites (due to the presence of paralogous genes), resulting in longer branches affecting 955 divergence time estimation (Fig. 5 and Supplementary Fig. S26 and Table S5, available on Dryad). 956 The sequence data with better cleaning of paralogs and coded with the "degenerated" method are 957 advantageous for phylogenomic studies, as they contain more accurate information for 958 phylogenetic and divergence time estimations. Our analyses of the loci containing paralogous 959 genes from the Hyb-Seq data often resulted in phylogenies different from those inferred from data 960 of the orthologous genes ( Supplementary Fig. S24, available on Dryad). The divergence times 961 estimated from data including potential paralogous genes (i.e. the "consensus" data matrices from 962 HypPiper) are likely older due to the additional variable sites introduced by gene paralogy 963 (Supplementary Table S5, available on Dryad). Thus, our results clearly highlighted that 964 paralogous gene content can negatively impact phylogenetic analyses. Our comparisons indicate 965 that the PPD pipeline can effectively identify paralogs in the Hyb-Seq data and produce data with 966 better quality for phylogenetic and divergence time dating analyses. 967 The advantage of "degenerated" (IUPAC consensus) coding of sequences from Hyb- Seq 968 in phylogenetic studies has also been shown in previous studies. In Kates

Influence of Gene Regions and Trimming Methods in Hyb-Seq Data 993
It is well-recognized that the rate and model of molecular evolution of genes and gene 994 regions affect phylogenetic inferences. The sequence data from Hyb-Seq contain both exon and 995 intron regions that evolve at strikingly different rates. How the exon and intron data would affect 996 phylogenetic influences is worthy of exploration. Gene specific models of molecular evolution 997 chosen from modeltest based on partitioned supercontig sequences may be less accurate, as they 998 must account for the evolution of both the exon and intron regions in the same partition, which 999 often evolve under different rates and models, and thus may not be accurate for either the exon or 1000 the intron regions within the same partition. The amount of phylogenetic information in the exon 1001 and intron regions also differs. How congruent are the analyses using the supercontig data with the 1002 approximate molecular models for the entire gene regions versus analyses using exon data alone 1003 with the more accurate models for exon regions remains to be explored. To gain insights into the 1004 question, we compared results from supercontig and exon data matrices constructed using the well-1005 trimmed procedure and "degenerated" coding method.
Our results showed that in Castanea, the tree topologies inferred from the exon and with larger k value (100) in BWA and "degenerated" coding to obtain Hyb-Seq data in better on the question. Previous studies supported that the Bering land bridge (BLB) and the North and gene flow across the NALB may be possible after the early Eocene until the early Miocene past, whereas the modern species are descendants of Mesophytic Forests developed after the global 1098 climatic cooling in the early Oligocene (Tiffney 1985b). Miocene, lending evidence in agreement with the suggested introgression (Fig. 7b). Given the 1114 biogeographic history and ancestral range of the common ancestor of the two lineages ("ABC"), 1115 the divergence and subsequent introgression of these lineages likely occurred at the high latitudes 1116 of "ABC" along the BLB area where the lineages leading to H. japonica and the ENA clade were 1117 overlapping in range and hybridized before they became extinct in the BLB and western North 1118 America. A hybrid origin of H. japonica between the ancestor of H. mollis and the ancestor of the 1119 ENA lineage is also possible. 1120 (Doye 1908;Johnson 1988). Our result indicated that Sect. Eucastanon that included C. dentata, 1144 C. sativa, C. mollissima, C. seguinii, and C. crenata is paraphyletic and the character of one nut 1145 per cupule in C. pumila (ENA) and C. henryi (EA) is homoplasious. The taxonomic status of the 1146 Allegheny chinkapin (C. pumila) and the Ozark chinkapin has been disputed (Johnson 1988;Nixon 1147Nixon 1997. Johnson (1988) considered the Ozark chinkapin as a variety of C. pumila, while Nixon 1148 (1997) regarded it as a separate species C. ozarkensis. In our study, all individuals representing C. 1149 pumila including the Ozark chinkapins formed a monophyletic group sister to C. dentata with 1150 strong support. Therefore, our phylogenomic study do not support the recognition of C. ozarkensis 1151 as a distinct species. However, the hypothesis should be further tested with population level 1152 sampling of related taxa. 1153 It is noteworthy that the species relationship revealed from PhyloNet is congruent with the 1154 chloroplast gene tree in grouping H. japonica and H. mollis as sisters (Fig. 4b). This relationship 1155 was revealed in the exon orthologous data ( Supplementary Fig. S24, available on Dryad) and one 1156 of the species trees that was reconstructed by SVDquartets with the RAD-seq data (Supplementary 1157 Fig. S20, available on Dryad). A simulation study by Chou et al (2015) showed that when ILS is 1158 very high, ASTRAL performed better than SVDquartets in constructing species history, but when 1159 ILS is low SVD quartet performed better than ASTRAL. In a more recent simulation study, Long 1160 and Kubatko (2018) showed that when gene flow is present, ASTRAL is not consistent while 1161 SVDquartets performed well. Our results appears to agree with the finding of Long and Kubatko 1162 (2018) suggesting SVDquartets is more likely to recover the true species history with data from 1163 numerous gene loci (like the RAD-seq data) when there is high level of inter-species introgression 1164 as observed in Hamamelis. Nonetheless, in all species trees, the support for the sister taxon of the 1165 ENA clade was low, clearly indicating conflicts among gene trees. Our results further indicate that 1166 in addition to species tree methods, data type also affects phylogenetic inferences in the presence 1167 of introgression. regions. The columns show the results from untrimmed "consensus", well-trimmed "consensus",