Abstract
American chestnut (Castanea dentata) was once the most economically and ecologically important hardwood species in the United States. In the first half of the 20th century, an exotic fungal pathogen – Cryphonectria parasitica – decimated the species, killing approximately four billion trees. Two approaches to developing blight resistant American chestnut populations show promise, but both will require introduction of adaptive genomic diversity from wild germplasm to produce diverse, locally adapted reforestation populations. Here we characterize population structure, demographic history, and genomic diversity in a range-wide sample of 384 wild American chestnuts to inform conservation and breeding with blight resistant varieties. Population structure analyses with DAPC and ADMIXTURE suggest that the chestnut range can be roughly divided into northeast, central, and southwest populations. Within-population genomic diversity estimates revealed a clinal pattern with the highest diversity in the southwest, which likely reflects bottleneck events associated with Quaternary glaciation. Finally, we identified genomic regions under positive selection within each population, which suggests that defense against fungal pathogens is a common target of selection across all populations. Taken together, these results show that American chestnut underwent a postglacial expansion from the southern portion of its range leading to three extant populations. These populations will serve as management units for breeding adaptive genetic variation into the blight-resistant tree populations for targeted reintroduction efforts.
1 Introduction
The American chestnut (Castanea dentata (Marsh.) Borkh) is a deciduous tree with a widespread historical range in the eastern United States and southeastern Canada (1, 2). Historical records describe C. dentata as a canopy hardwood tree that typically grew 60 – 90 feet high, but that could exceed 120 feet, with a diameter of 3-5 feet (3, 4). The rapid growth of American chestnut, coupled with its decay resistant wood, previously made it the single most valuable hardwood species in the United States (5, 6). Moreover, the prodigious and reliable seed crop was an important source of food and feed throughout its native range (7).
In the first half of the 20th century, an exotic fungal pathogen — Cryphonectria parasitica – decimated the American chestnut, killing billions trees. While >400 million American chestnuts survive in the forests of the eastern United States (8), the vast majority of these are collar sprouts from trees that germinated before arrival of the blight (9, 10). Although some of these sprouts are able to flower before being killed by blight, most are reinfected before they are able to produce nuts. The American chestnut is thus considered functionally extinct.
Successful restoration of the American chestnut will rely on an accurate understanding of extant genomic diversity in the wild (11), which can be leveraged to increase effective population size and adaptability of blight-resistant populations (12). The ultimate goal of this project is to prioritize geographic areas for ex situ conservation through propagation of wild trees. These trees will then be used to introgress natural genetic variation into blight resistant populations. The first step in this process is to define broad management units on the basis of population structure and postglacial history for the species (13).
Previous population genetic studies used microsatellite and ddRAD-seq to characterize the population structure and genetic diversity of American chestnut (14–17). These studies suggest the southern portions of the American chestnut range are the most genetically diverse and that two populations exist: either a distinct northeastern population (15, 16) or a distinct southern population (17). Furthermore, these studies suggest that the American chestnut underwent a postglacial migration northward from a southern glacial refugium (15, 17). While these studies provide our first glimpse of the extant patterns of genetic diversity in chestnut, they are limited by relatively low geographic sampling density and genotyping methods that may lack power to comprehensively characterize genome-wide diversity and signatures of selection (18, 19).
In this study, we used ≈ 17X whole-genome resequencing (WGS) data for each of 384 American chestnut genotypes, sampled from across the entire historical species range, to (i) estimate the population structure; (ii) describe the demographic history and contemporary barriers to gene flow; (iii) evaluate genomic diversity within each population; (iv) and identify signatures of selection within this iconic species. To our knowledge, our use of WGS is the first for American chestnut and allows the use of modern demographic inference techniques that were inaccessible to previous studies. Wholegenome resequencing captures many times more variant sites and has improved statistical accuracy than other low-density methods (20). Working in collaboration with The American Chestnut Foundation (TACF), the results of this study will be used to identify management units for germplasm conservation, and will aid in the overall efforts to restore the American chestnut population to its pre-blight abundance in the eastern US.
2 Materials and Methods
2.1. Leaf sample collection
TACF staff and volunteers sampled leaves from ≈1000 American chestnut trees throughout the species range. The youngest leaves were sampled from each tree in May through July over three growing seasons (2018 – 2020). During and immediately after collection, leaves were kept cool with ice and refrigeration or were desiccated and stored with silica gel. Leaves were shipped to Virginia Tech within 3 weeks of collection and stored at -80°C. Coordinates of the sampled trees were retrieved from TACF’s dentataBase (www.acf.herokuapp.com) and TreeSnap (21)(https://treesnap.org/).
2.2. DNA isolation and sequencing
From a cohort of ≈1,000 available leaf samples, we selected 384 genotypes for WGS (Fig. 1) on the basis of DNA quality and geography, with the goal of including as much of the historical range as possible. Leaves were ground to a fine powder using a Spex 2000 Geno/Grinder and ceramic beads with three 45 second intervals of grinding interspersed by submersion in liquid nitrogen. DNA was extracted with Qiagen’s DNAeasy Plant DNA extraction kit. For samples 1-96, a modified phenolchloroform cleanup step was performed to remove organic contaminants from older leaves. For samples 97 through 384, an additional 100% ethanol wash step was performed to remove organic contaminants that carried over from the previous steps and to dry the spin column membrane. DNA quality and concentration were measured by a Nanodrop Onec and Qubit 3.0 Fluorometer, respectively. When DNA concentration was low, a secondary CTAB-based extraction was used. DNA was stored in a 100-200 µl AE solution in a - 20°C freezer. PCR-free library preparation and genomic sequencing were conducted at the HudsonAlpha Institute for Biotechnology. Libraries were sequenced in batches of 48 on an Illumina NovaSeq 6000 instrument in 2×150bp paired-end mode with a target depth of 20x.
We also performed WGS on a Castanea species reference panel of 95 individuals to detect potential hybrid ancestry in the putative C. dentata samples. The reference panel included 19 C. sativa, 15 C. pumila var. pumila, 10 C. pumila var. ozarkensis, six C. pumila var. alabamensis, four C. dentata, one C. seguinii, two C. henryi, 18 C. crenata, and 20 C. mollissima. Leaf samples from Asian Castanea species (C. mollissima, C. crenata, and C. henryi) were collected from trees planted in the U.S. DNA from C. sativa was provided by R. Costa from trees in Portugal. C. pumila samples were collected from native locations (non-planted) in the U.S. DNA was extracted as above. Sequencing was performed at the Duke University Center for Genomic and Computational Biology, where libraries were prepared with an Illumina Nextera kit and sequenced in an Illumina NovaSeq 6000 S4 flowcell (48 samples per lane) in 2×150bp paired-end mode.
2.3. Bioinformatics
Bioinformatic analyses were performed on Virginia Tech’s Advanced Research Computing (ARC) system. SNPs were called using a custom pipeline adapted from the Broad Institute’s Genome Analysis Toolkit (GATK v3.8) best practices (22). Individual fastq files were aligned using the Burrows-Wheeler Aligner (BWA) mem algorithm with the -M and -R flags and the C. dentata genome as a reference (Castanea dentata v1.1; http://phytozome-next.jgi.doe.gov/). The resulting SAM files were converted to BAM format, sorted, and indexed using SAMtools (23, 24). If samples were sequenced on multiple lanes, the individual lanes for each sample were then combined into a single BAM file using samtools merge. For the Castanea species reference, where PCR was performed in the library preparation, PCR duplicates were removed from the BAM files using MarkDuplicates in picard.
The GATK HaplotypeCaller algorithm (25, 26) was used to call SNPs and INDELs by chromosome. Chromosome GVCF files for each individual were combined with GatherVcfs and individual GVCFs were merged with GenotypeGVCFs. Polymorphisms were quality filtered using the GATK VariantFiltration algorithm following GATK best practices. The VCF file was further filtered in VCFtools v0.1.17 (27) to include only biallelic SNPs (–max alleles 2) and SNPs with <10% missing data per site (–max-missing 0.9). Overall missing data per individual was checked in VCFtools (—missing-indv), and individuals with >10% missing data were removed. Unless otherwise noted, a minor allele frequency (MAF) filter was applied (MAF = 1/2n, where n = number of samples).
VCFtools was used to calculate the sequencing coverage and depth for the filtered VCF file. Genome-wide scans for SNP density were determined using the –SNPdensity option with 50kb windows. The mean sequencing depth per variant was calculated using the –site-mean-depth option and the mean sequencing depth per sample was determined using the – depth option. A smoothed line of the SNP density results for each chromosome was visualized in ggplot2 using the default ‘gam’ parameter in geom_smooth.
2.4. Species identification and estimation of hybridization in wild American chestnut populations
We used the Castanea species reference dataset to test for evidence of introgression in our wild C. dentata samples. The C. seguinii and C. henryi samples were excluded as there were only one and two samples of these species, respectively. The VCF files from the C. dentata and species reference datasets were combined using bcftools merge and subsequently filtered with bcftools to retain biallelic SNPs and remove singleton SNPs (MAF = 1/2n, where n = number of samples) (28). Both SNPs and INDELs were retained for these analyses. ADMIXTURE was run with the –cv flag enabled to perform a five-fold cross-validation for K = 1-9 (29). The value of K was chosen as the most likely number of clusters when each Castanea species first separated into at least one distinct group (C. pumila varieties were considered as a single cluster). Putative C. dentata samples were classified as hybrids or misidentified if they had cluster membership >10% with a different species.
2.5. Population structure
Population structure in C. dentata was estimated using ADMIXTURE and a Discriminant Analysis of Principal Components (DAPC) (30). ADMIX- TURE uses a model-based approach, like STRUCTURE, to estimate ancestry, but uses maximum likelihood rather than MCMC and is more computationally efficient (29). DAPC combines PCA and a Discriminant Analysis (DA) to identify demographically independent clusters by minimizing within group variance and maximizing between group variances (30). The VCF file was first converted to a BED file and linkage-disequilibrium (LD) pruned to include SNPs with R2 values < 0.1 within 50 SNP sliding windows (step size 10 SNPs) in PLINK v1.9 (31). ADMIXTURE was performed using the pruned BED file for K values 1-9. The –cv and -j120 options were enabled to allow for a five-fold cross validation and for the analysis to run in multithreaded mode using 120 threads, respectively. The lowest CV error score was used to determine the most likely value of K. The populations identified by ADMIXTURE were used for all subsequent analyses unless otherwise noted. Population membership for each sample was determined by highest ancestry proportion from the ADMIXTURE results.
The DAPC analysis was performed in R using the adegenet package (32). To create the input file for DAPC, the pruned BED file was converted back to a VCF file in PLINK v1.9 using the –recode vcf option and the –ref-from-fa option. The pruned VCF file was converted to genlight format using the vcfR package (33). The optimal number clusters was determined with the find.clusters function in adegenet, and the Bayesian Information Criterion (BIC) statistic was calculated for a maximum of five clusters and 355 principal components (PCs) (PCs = n-1, where n is the number of samples). The cluster number with the lowest BIC score was assumed to be the most likely value. The optim.a.score was assessed for each DAPC run to determine the optimal number of PCs to retain.
2.6. Demographic history
The sequential Markovian coalescent approach in SMC++ was used to estimate the historical effective population sizes (Ne) of C. dentata from whole genome data (34). SMC++ required the input VCF file to not have any filtering for LD or MAF, so the input VCF file of 356 C. dentata was only filtered for high missing sites (>10% missingness) and biallelic SNPs. Sites missing in the C. dentata reference genome, which could be erroneously interpreted as long runs of homozygosity, were masked with a bed file generated with a conversion script (https://www.danielecook.com/generate-a-bedfile-of-masked-ranges-a-fasta-file/). The SMC files were generated for each of the 12 chromosomes using the vcf2smc function in SMC++. The estimate function was then used with a mutation per generation rate set to 5.2 × 10−8 from estimates of pedunculate oak (Quercus robur) (35). A 30-year generation time was assumed to convert coalescent events to years. A CSV file of the results was generated with the plot function and -c option for plotting in R (36).
2.7. Migration rates
Migration rates for C. dentata were estimated using Estimated Effective Migration Surfaces (EEMS) (37). EEMS estimates and visualizes effective migration and diversity across a given geographic range using genetic data from known locations (37). A stepping stone model is assumed and that isolation-by-distance is a component of the populations. The EEMS program requires three input files: an average pairwise differences matrix (DIFFS), list of sample geographic locations (CO- ORD), and a list of habitat boundary coordinates (OUTER). To reduce computational load, the LD pruned C. dentata dataset from the population structure analyses was used. The pruned BED file was converted to the EEMS DIFFS file using the EEMS program bed2diffs. The habitat boundary coordinates were mapped with an online tool (http://www.birdtheme.org/useful/v3tool.html).
An initial run was performed with 8 million iterations, 1 million burn-in, 600 nDemes, 9999 thinning, and the hyperparameters in their default setting. EEMS relies on properly adjusted proposal variances, which influence the predicted migration and diversity rates. During the analysis, each of the output proposal acceptance frequencies should be between 20%-30%, but it is sufficient for them to range from 10%-40% (https://github.com/dipetkov/eems/blob/master/Documentation/EEMS-doc.pdf). Following the initial EEMS run, there were two proposal frequencies that were either less than 10% or greater than 40%. To account for this, the mEffectProposalS2 parameter was increased to 0.2 and mrateMuProposalS2 was decreased to 0.002. Four values of demes, 200, 350, 500, and 650 were used to evaluate the number of demes for best fit. Three chains, each with 20 million iterations, four-million burn-in, and 9999 thinning were performed for each deme and their outputs were combined in R using the rEEMSplots2 program. The posterior trace plot was evaluated to determine if the MCMC chain converged. Any EEMS run that did not converge was restarted from the final parameter state in the previous run. The results were mapped in R using rEEMSplots2. A linear regression line was fitted to the default ‘Dissimilarities between pairs of sampled demes’ plot for each deme value, and the deme value with the highest R2 value was determined to be the best fit. Appalachian Mountain peak locations were obtained from Wikipedia (https://en.wikipedia.org/wiki/List_of_mountains_of_the_Appalachians) and a smoothed loess line of these locations was added to the EEMS figure using ggplot2.
2.8. Tests for neutrality and nucleotide diversity
We used ANGSD to perform the population statistical analyses since its use of genotype likelihoods has been found to provide less biased estimates than previous methods that require genotype calling (38, 39). The sorted bam files generated from the SNP calling methods were used as the initial input for ANGSD to generate the SAF file (site allele frequency likelihood) for each population. Due to computational considerations, a SAF file was generated for each chromosome within a population using -doSaf 1 and then merged using realSFS cat. The filtering parameters used for each SAF file were adapted from ANGSD recommendations and other hardwood tree studies (40, 41). These parameters were: adjust mapping quality for excessive mismatches (-C 50), minimum base quality score 20 (-minQ 20), minimum mapping quality 30 (-minMapQ 30), discard reads that do not uniquely map (-uniqueOnly 1), only retain sites where the pair could be mapped (-only_proper_pairs 1), and remove ‘bad’ reads (-remove_bads 1). Additionally, to polarize the alleles in the site-frequency-spectrum (SFS), an ancestral reference fasta was generated in ANGSD using 11 C. mollissima BAM files with the following parameters: -doFasta 2 and -doCounts 1. Once the individual chromosomes were merged to generate a SAF file for each population, the SFS was calculated using realSFS.
Population statistics were computed for nucleotide diversity and tests of neutrality in ANGSD. Thetas were first calculated from the SFS in ANGSD using realSFS saf2theta. The statistics for each chromosome were then determined using thetaStat do_stat. Nucleotide diversity was calculated from the thetaStat do_stat output by dividing the pairwise theta (tP) by the number of sites (-nSites) evaluated by ANGSD for that genomic region. A sliding window analysis was also performed on each chromosome using a 50 kb window and a 10 kb slide.
Pairwise FST was evaluated using ANGSD for each of the populations to estimate the population differentiation of C. dentata and identify candidate regions of the genome undergoing selection. Using the per chromosome SAF files as input, each pair of populations per chromosome were used to generate separate pairwise 2D-SFS files. The FST index was then performed on each pairwise 2D-SFS file using realSFS fst index. A sliding window analysis was then performed for the FST calculation with a 50 kb window and a 10 kb slide. The estimated pairwise FST for each chromosome was averaged for all 12 chromosomes to get the mean pairwise FST between populations.
VCFtools was used to calculate the observed heterozygosity for each sample. The filtered SNP dataset was used as input with the options -s – and -het. The output file provides the observed and expected homozygosity, F statistic, and number of nucleotide sites analyzed for each sample. To calculate the observed heterozygosity for each sample, we first subtracted the observed homozygosity from the number of sites, and then divided that total from the number of sites. This provides the ratio of observed heterozygosity for each sample. To calculate the average observed heterozygosity for each population, the sample memberships from the ADMIXTURE analysis were used. The observed heterozygosity was averaged over all samples within a population to calculate the average observed heterozygosity for that population. To determine the expected heterozygosity for each sample and each population, the previous steps were performed using the expected homozygosity values in place of observed homozygosity. Observed heterozygosity between all populations was tested for significance using a one-way ANOVA test in R with function aov(). Further comparisons between each pair of populations were completed using Tukey multiple pairwisecomparisons with R function TukeyHSD(). For the one-way ANOVA and Tukey tests, a P-value of 0.05 was used for significance.
2.9. Detecting signatures of positive selection
We used RAiSD to identify genomic regions undergoing positive selection in each population (42). RAiSD estimates a composite statistic, µ, which evaluates each genomic region based on multiple neutrality and diversity metrics. RAiSD has been shown to excel at identifying regions that are undergoing selective sweeps, while being more computationally efficient than other leading methods (42). The filtered C. dentata VCF file was subset for each of the three populations identified by ADMIXTURE and used as input. Regions that were missing data in the C. dentata reference genome, denoted with N, were masked with the -X flag and a BED file of missing locations. The default parameters were used for each run, except that we assigned a seed for the random number generator. The RAiSD filtering parameters for each population’s dataset retained 8,833,026 SNPs for the northeast population, 7,545,294 SNPs for the central population, and 11,737,005 SNPs for the southwest population for analysis. We used the quantile function in R (36) to identify the 0.1% outlier regions in each of the RAiSD site reports for each population. Results of this analysis were displayed using ggplot2 (43).
To identify the genes associated with the 0.1% outlier regions for each population, we obtained gene names and location information from the American chestnut genome feature file (Cdentata_673_v1.1.gene.gff3.gz; Castanea dentata v1.1; http://phytozome-next.jgi.doe.gov/). Genes were determined to be associated with the RAiSD outlier regions if they resided within 1 Kb of the region. Only unique genes were retained. To determine gene function, we identified the orthologs for each of the outlier genes in Arabidopsis thaliana. The list of outlier C. dentata genes for each population were entered into BioMart on Phytozome (44, 45), and the A. thaliana TAIR10 genome was used to output a list of corresponding orthologs (46). The list of A. thaliana orthologs were entered into the online TAIR gene ontology tool to obtain the function category for each ortholog (47)(https://www.arabidopsis.org/tools/bulk/go/index.jsp). Finally, we performed GO enrichment analyses for biological function on four gene sets. These sets were the unique genes belonging to the southwest, central, and northeast populations, and the set of genes that are shared between all three populations. The A. thaliana orthologs were retrieved for each gene set using the previously described methods, and were submitted to the TAIR GO Term Enrichment for Plants tool (https://www.arabidopsis.org/tools/go_term_enrichment.jsp), which sends the data to the PANTHER Classification System (48). The summary for each gene set analysis using the PANTHER Classification System was as follows. Analysis type: PANTHER Overrepresentation Test (Released 20210224), Annotation Version and Release Date: GO Ontology database DOI: 10.5281/zenodo.5228828 Released 2021-08-18, Reference list: Arabidopsis thaliana (all genes in database), Annotation Data Set: GO biological process complete, Test Type: Fisher’s Exact, Correction: Bonferonni correction for multiple testing for p<0.05.
3. Results
3.1. Genomic datasets
Of the 384 samples sequenced, 86 had greater than 20x coverage, 242 had 10-20x coverage, and 56 had less than 10x coverage. The 384 sample C. dentata VCF file contained 23,720,251 SNPs and INDELs. Eighteen samples with greater than 10% missing data were removed, and 10 additional samples were removed that had > 10% cluster membership with one or more of the Castanea species reference samples in ADMIXTURE analysis. The final C. dentata dataset contained 356 individuals with an average coverage of ≈ 17x and 21,136,994 high quality SNPs (Fig. S1). The pruned C. dentata dataset contained 3,539,550 SNPs. The Castanea species reference dataset contained 92 samples and 62,647,079 SNPs that passed the filtering criteria. The combined and filtered C. dentata and Castanea species reference dataset contained 76,378,648 SNPs and INDELs.
3.2. Hybridization
ADMIXTURE analysis with the combined C. dentata and Castanea species reference panel suggested seven clusters best explained the data, with each Castanea species as an individual cluster in addition to three clusters within C. dentata (Fig. S2). Of the 384 putative American chestnut individuals, 340 had >99% ancestry assigned to the C. dentata clusters. However, ten individuals showed a significant level of ancestry from other Castanea species (>10%) and were removed from further analyses. Three samples were identified as C. pumila, four samples were C. sativa x C. dentata hybrids, and three were C. mollissima x C. dentata hybrids. Overall, the American chestnut samples sequenced did not reveal widespread patterns of significant introgression with other Castanea species.
3.3. Population structure in Castanea dentata
Population structure within C. dentata was best explained by a two or three population model as identified by the DAPC BIC plot and ADMIXTURE CV error plot, respectively (Fig. 2). The three population ADMIXTURE model was characterized by a southwest, central, and northeast cluster (Fig. 2). The southwest and central population separated in northern Georgia and eastern Tennessee, while the central and northeast population have an area of admixture in Pennsylvania before becoming more distinctly separated in southern New York. The two population DAPC model included the same southern population and boundary as ADMIXTURE, but the central and northeastern populations were merged. Both analyses were mostly in agreement with population memberships at the same K values.
3.4. Migration rates
The R2 value for the dissimilarity plots for the 650 deme model (R2 = 0.92) and 500 deme model (R2 = 0.93) were similar, and we present the 650 deme model due its increased resolution. EEMS analysis suggests that the Appalachian Mountains form a barrier to gene flow along their length, with migration running from southeast to northeast on either side of the mountain range (Fig. 3a). A single region of above average effective migration rate was shown in southern West Virginia that crosses the Appalachian Mountain range (Fig. 3a). The ADMIXTURE K=4 model agrees with the EEMS estimates and suggests a further subdivision of the central population on either side of the Appalachian Mountains (Fig. 3b). EEMS also estimates a diversity parameter (q), which reflects genetic dissimilarity between individuals within the same deme, and can thus be thought of as the within-population component of genetic variance. This diversity parameter was generally high throughout the range, though more so in the central and southern portion of the range, and somewhat lower in the northeast and northwest (Fig. S3). Pockets of lower diversity, and thus higher intra-deme genetic similarity, on the outer edges of the species native range may reflect areas of more recent expansion-associated bottlenecks (Fig. S3).
3.5. Demographic history
SMC++ estimates of Ne over time suggest that each population underwent contractions and expansions in Ne, beginning approximately two million years ago. All populations followed a similar pattern of demographic history, however, the southwest population lagged the central and northeastern populations’ events. Ne rapidly increased for all three populations approximately 6,700-11,700 years ago, after which the central population underwent an additional contraction within the past 7,000 years (Fig. 4). The southwest population had the highest contemporary Ne (Ne(southwest)=20,306, Ne(central)= 8,347, Ne(northeast)= 13,078).
3.6. Genomic diversity and tests of neutrality
The southwest population had the greatest nucleotide diversity, followed by the central and northeast populations (πsouthwest = 0.0069; πcentral = 0.0064; πnortheast = 0.0058). All populations had negative average Tajima’s D, which were similarly clinal(Dsouthwest= -1.083; Dcentral=-1.016; Dnortheast=- 0.335). Consistent with these negative values for Tajima’s D, the SFS plot revealed that for each population, there was a deficiency of rare variants and an excess of high frequency variants (Fig. S4). Sliding window analyses revealed heterogeneous genome-wide Tajima’s D, nucleotide diversity, and FST (Fig. 5). Throughout the genome, the southwest population had the most negative Tajima’s D values, followed by the central and northeast populations (Fig. 5). Conversely, the southwest population had the highest nucleotide diversity values throughout the genome, with decreasing values for the central and northeast populations (Fig. 5). The highest FST values were attributed to the southwest-northeast population pair (Fig. 5).
Consistent with the pattern for nucleotide diversity, heterozygosity was highest in the southwest (Table 1). Northeast Ho was significantly lower than the central and the southwest population (p<0.001, p<0.001) (Fig. S5a). FST estimates between population pairs were relatively low for all comparisons, with the highest divergence between the southwest and northeast populations (FST(southwest-northeast) =0.1076, FST(southwest-central) = 0.0705, FST(central-northeast) = 0.0268).
3.7. Genomic regions and associated genes undergoing positive selection
Every chromosome for each population contained outlier genomic regions identified by RAiSD (Fig. 5). The southwest population had the greatest number of significant regions with 11,733, followed by the northeast population with 8,832, and the central population with 7,534. Within the significant regions, the southwestern population contained 617 outlier genes, which was the most out of the three populations (Fig. 6). Among these outlier genes, 402, 387, and 323 were unique to the southwestern, central, and northeastern populations, respectively, while 49 genes were shared between all three populations (Fig. 6).
3.8. GO enrichment analysis
GO enrichment analysis revealed that “response to stress” was within the top four most annotated gene families for the GO biological process for each population (Fig. S6). Among the unique genes within each population, the southwest population had the most overrepresented GO terms, with 25 (Table S1), while the northeast population had 10 (Table S2), and the central population contained one (Table S3). The GO term “defense response to fungus” was significantly overrepresented in the shared gene set for all three populations (Fold enrichment = 16.5, p = 0.0373). The overrepresented genes for the shared gene set were Caden.02G006600 and Caden.03G057600, which are both in the apoptotic ATPase eukaryotic orthologous group (KOG).
4 Discussion
The American chestnut was an economically and ecologically important tree species that was decimated by an invasive fungal blight approximately 100 years ago. Blight resistant American chestnut populations are being developed, and these populations will need sufficient genomic diversity to thrive across the diverse and rapidly changing climatic gradient of the species historical range. Our goal is to prioritize areas for ex situ conservation through propagation of wild trees. These trees will then be used to introgress adaptive genetic variation into chestnut blight-resistant populations. The first step in this process is to define broad management units on the basis of population structure and postglacial history for the species (13).
To address this goal, we re-sequenced 384 wild chestnut genotypes, which yielded over 23 million variants. These data revealed that population structure in chestnut can be best described by a two or three population model, with a southwestern population being present in both models. The southwestern population was the most genetically diverse, and that diversity decreases as latitude increases. These contemporary patterns of genomic diversity in American chestnut are most likely the result of past population size reductions, and recent expansions, associated with Quaternary glaciation of North America.
4.1. Identification of hybridization in American chestnut
Our primary goal in developing a WGS dataset that included most Castanea species was to exclude hybrids between our wild C. dentata samples and congeners a significant concern given the history of planting and naturalization of non-native Castanea species in the eastern U.S. (51). We did not detect widespread introgression with other chestnut species, which is consistent with previous assessments of North American Castanea (17). However, our C. dentata sampling focused on trees with morphological characteristics representative of the species. Thus, our sample was biased against genotypes displaying intermediate phenotypes, and hybrids with naturalized non-native Castanea species, as well as the native C. pumila, may nevertheless be present in the wild. A systematic exploration of the relationships among worldwide Castanea species is beyond the scope of this paper, and future studies will use these data to resolve the phylogenetic relationships between Castanea species and within the Castanea pumila species complex.
4.2. Population structure
Our range-wide sample of 356 American chestnut genotypes suggests either two or three genetically distinct populations, separated along a north-south gradient. Both the DAPC and ADMIXTURE analyses agreed with the presence of a southwestern population, but differed on the number of populations in the remainder of the range. Nevertheless, both DAPC and ADMIXTURE gave very similar cluster memberships and patterns at each K value. Previous studies primarily used microsatellite data to describe the population structure and genetic diversity of the American chestnut. Kubisiak and Roberds (14) sampled several locations from North Carolina through Massachusetts and found that a single, genetically diverse metapopulation best fit the data. Two additional studies used sampling sites from Kubisiak and Roberds (14) but analyzed a different set of microsatellite markers. Gailing and Nelson (15) and Müller et al. (16) found that the data best supported two populations — a northeastern and southwestern population – with Pennsylvania and Maryland being a transition zone between these two groups. These studies did not sample south of North Carolina, which may explain their two population estimates. More recently, Spriggs and Fertakos (17) used ddRAD-seq data and identified the southern portion of the range as a separate population after a more extensive sampling of the southern region. Their results agreed with our DAPC analysis, whereas our ADMIXTURE analysis suggests a further division between the central and northern portions of the range. This may be due to the increased genomic coverage of our study, our larger sample size and more uniform range-wide sampling, or some combination of these factors.
While the scarcity of viable seed produced by wild American chestnuts make common gardens difficult, one such study of 13 American chestnut provenances, collected from throughout the native range C. dentata and planted in Vermont, found that seedlings from warmer and moderate climates grew faster but had greater winter injury than northern seed sources (52–54). The same study showed that nuts collected from the north have greater winter cold hardiness than those from the south (53). Thus, like most temperate and boreal tree species, American chestnut exhibits local adaptation to climate. The three populations we identified appear to be differentiated at latitudinal temperature and precipitation breaks along the Appalachian Mountain range. Pennsylvania marks the region of admixture between the central and northeastern populations, and is also where mean winter temperatures transition from above freezing to below freezing (2015 PRISM climate group, https://prism.oregonstate.edu/normals/). In addition, the separation between the central and the southwestern populations occurs in Tennessee, which is the northern boundary for an area of the southern U.S. that receives 250mm higher annual precipitation on average compared with eastern and northern portions of the historical chestnut range (https://prism.oregonstate.edu/normals/). These three populations may thus comprise broad chestnut ecoregions, reflecting both isolation-by-distance but also potentially isolation- by-adaptation (55–57).
4.3. Patterns of genomic diversity
Genome-wide nucleotide diversity in American chestnut was comparable to other widely distributed forest tree species (58). The southern portion of the range had the highest levels of nucleotide diversity, which decreased as latitude increased, consistent with previous studies (14–17). Mean heterozygosity estimates across all populations were much lower, and the inbreeding coefficient higher, than previous estimates from microsatellites (15). FST between populations was relatively low, with the greatest divergence between the southwest and northeast populations. Tajima’s D was negative in each population, which was driven by an excess of rare variants for each population.
The combination of negative Tajima’s D, low levels of heterozygosity, and an excess of rare variants suggest that all populations are undergoing expansion following recolonization bottlenecks, the timing of which varied by population. The more negative values of Tajima’s D for the southwest population, followed by the central and northeastern populations, suggest that the southwest population underwent a more ancient bottleneck. For nucleotide diversity, we observed the inverse. The southwest population has the highest levels of nucleotide diversity, which decreased as latitude increases, suggesting that the southwestern population has a higher long term effective population size, and was most likely a glacial refugium from which postglacial expansion occurred. This same pattern of inverse relationships between neutrality and diversity tests can be found in Sitka spruce (Picea sitchensis), which underwent recolonization of its northern range from a southern glacial refugium (59). The patterns in genome-wide diversity estimates among populations may also be due to different pressures of natural selection (60). Additionally, we found that the more geographically separated populations were more genetically divergent, which is similar to other temperate species (40, 61). Though, the overall genetic distinctness between populations was relatively low, with most of the genomic diversity being accounted for within populations findings that are consistent with a multi-species assessment of northeastern North American tree species (62).
4.4. Demographic history and migration patterns
Ne for each of our populations declined in the distant past and subsequently each experienced repeated size changes before a rapid expansion within the past 11,000 years. In the distant past, changes in Ne for the northeastern population generally paralleled those of the central population, while the southwest population lagged, likely due to more muted impacts of climate change in this area. Curiously, following an initial increase after the last glacial maximum, Ne for the central population size again decreased in the recent past. This decline may be due to a recent bottleneck event, or uncertainty in the SMC++ analysis, however, it was not due to the chestnut blight. The decline of the American chestnut populations from the blight occurred within the last century, and most of the trees we sampled are stump sprouts derived from surviving root stock that predate the blight. Thus, the recent dramatic reduction of American chestnut census population size would not influence our demographic history analyses.
The population declines and subsequent expansion follow the Quaternary glaciation events, which began approximately 2.7 million years ago (63). Previous biogeographical assessments for the eastern United States (64), as well as genetic studies, suggest that tree species in eastern North America migrated from southwestern refugia (15, 17). Fossil pollen evidence indicates the Gulf coastal forests in Florida and Southern Alabama were a glacial refuge for C. dentata during the Wisconsonian glaciation approximately 25,000 – 31,000 year ago (51, 65, 66). As the glaciers retreated, pollen evidence indicates that C. dentata migrated north into Tennessee approximately 15,000 years ago (67, 68), and continued northeastward along the Appalachian Mountains (68) at a rate of approximately 100 meters per year, eventually arriving in Connecticut approximately 2,000 years ago (64). However, other glacial refugia may have existed more northeastward and the rate of migration northward may have been slower than what fossil pollen suggests (69). As populations migrated northeastward and diverged, bottlenecks likely occurred, leading to a northeastern population with less genomic diversity than those further south.
When we allowed for four clusters in the ADMIXTURE analysis, the central population separates on either side of the Appalachian Mountain range. The Appalachian Mountain range may thus serve as a barrier to gene flow throughout the American chestnut range, with migration paths northeastward on either side. With four clusters, we observed a region of mixing between eastern and western clusters in southwest Virginia and southeast West Virginia, which the EEMS analysis showed was also an area of enhanced gene flow. A migration route may have existed in this area, which may explain the observation by Gailing and Nelson (15) that chestnuts in Ontario, Canada were more similar to the North Carolina population than their northeastern neighbors.
4.5. Signatures of selection in American chestnut populations
The ANGSD sliding window analyses revealed several broad genomic regions of negative Tajima’s D and reduced nucleotide diversity. Some of these regions likely reflect reduced recombination and associated decreased nucleotide diversity near centromeres. However, several intervals, such as on chromosome six and chromosome ten, also had elevated FST for the southwest-central and southwest-northeast population pairs, suggesting that these regions may have experienced recent selection related to environmental adaptation to more northern climates. RAiSD identified several thousand outlier regions that may be targets of selection that were enriched for genes related to “response to stress” and “response to chemical”. Further, the GO enrichment analysis identified “defense response to fungus” as the only overrepresented biological pathway for the set of shared genes between all populations. This suggests that selection related to pathogen pressure is a key feature of global adaptation in American chestnut.
A few of the American chestnuts sampled in this study were large surviving trees that may have low levels of blight resistance, however, most were blight killed resprouts. Furthermore, American chestnuts from across the species range are highly susceptible to Phytophthora cinnamomi (the Oomycete responsible for Phytophthora root rot) - a contemporary agent of American chestnut decline (70, 71). To date, the only known source of P. cinnamomi resistance for C. dentata has been introgression from Asian Castanea species, such as C. mollissima and C. crenata (71–73). Thus, the enrichment for defense genes among selection targets we observed is unlikely to be related to these two contemporary pathogens that threaten the species. Rather, it likely reflects historical interactions with, and adaptations to, as yet unknown native fungal pathogens.
This is not to say, however, that the “defense response to fungus” selection targets could have no role in responding to chestnut blight or Phytophthora root rot. In response to blight inoculation, both resistant C. mollissima and susceptible C. dentata show transcriptional responses, though more genes potentially related to blight resistance were upregulated in C. mollissima (74). Additionally, blight resistance in American chestnut backcross populations is polygenetic, with multiple loci contributing to resistance (75). The lack of blight resistance in American chestnut may be due to an inadequate or inappropriate transcriptional response to infection. As such, further evaluation of these genes is necessary to determine the cause of their overrepresentation and their possible relationship to blight resistance in wild populations.
4.6. Conclusion
We developed two high quality WGS datasets that will further population genomics studies of American chestnut and other Castanea species, which revealed that American chestnut underwent a postglacial migration northward that most likely influenced its current genetic structure. Three populations were identified that were separated along a latitudinal gradient, with the southern population having the highest levels of genetic diversity, which suggests it is most likely the oldest population and the refugia from which postglacial expansion occurred. Subtle population structure also revealed a separation of the central population on either side of the Appalachian Mountains that suggests these mountains represent a barrier to gene flow. We identified genomic targets of selection that were both unique to each population, and shared among all populations, which reflect adaptation to both the abiotic and biotic environments. Future breeding and conservation plans will need to consider these separate populations to preserve unique areas of American chestnut genetic diversity.
While this study describes the patterns of genetic structure that exist within wild American chestnut, it stops short of identifying the genomic signatures of local climate adaptation within each population. Failure to account for traits related to climate within breeding populations could lead to the reintroduction of maladapted blight-resistant trees that fail to compete with other native species (76). Identifying the genomic targets of climate-related selection is a key step in understanding the underlying genomic basis of local adaptation. Our future goal is to characterize the genomic architecture of local adaptation across the species range, and combine those results with this study’s findings to develop strategies for germplasm conservation and breeding to explicitly account for local adaptation in restoration populations.
DATA ACCESSIBILITY AND BENEFIT-SHARING SECTION
Data Accessibility Statement
Genomic Data
The genomic sequences reported in this paper have been deposited in the NCBI SRA (BioProject PRJNA804196)
Sample metadata
Metadata are also stored in the SRA (BioProject PRJNA804196)
Scripts
The scripts used for the SNP calling pipeline and other mentioned scripts can be found at https://github.com/alex-sandercock/American_chestnut_WGS
Benefit-Sharing Statement
Benefits Generated: Benefits from this research include the sharing of our genomic data as listed in the Data Accessibility Statement.
AUTHOR CONTRIBUTIONS
A.M.S., J.G., J.A.H., J.S., and J.W.W. designed the study. Q.Z., H.A.J., T.M.S., J.A.S., S.F.F., K.C., J.S., and J.G. contributed to data collection. A.M.S. analyzed the data. A.M.S., J.A.H., and J.W.W. wrote the manuscript with input from all authors.
ACKNOWLEDGEMENTS
We would like to thank TACF volunteers who participated in sampling of wild trees for this project, as well as Dr. Rita Costa (Instituto Nacional de Investigação Agrária e Veterinária, Portugal) for providing Castanea sativa samples. We thank the HudsonAlpha Institute for Biotechnology and TACF for prepublication use of the genome of Castanea dentata funded by the Colcom Foundation. We thank Advanced Research Computing at Virginia Tech for providing computational resources, Dr. Robert Settlage for technical support related to the analyses described here. This research was supported by the United States National Institute for Food and Agriculture Projects 1018599 and 1005394, and by a Graduate Fellowship to AS from the Virginia Tech Institute for Critical Technology and Applied Science.
Footnotes
Author affiliations updated. Supplemental gene annotation figure was altered to reflect percent gene count on the x-axis. Additional VCF file quality analyses for variant depth and density were performed and a new supplementary figure was added to reflect those results. The main findings were not impacted by these revisions.