Abstract
Genomic data can be a powerful tool for inferring ecology, behaviour and conservation needs of highly elusive species, particularly when other sources of information are hard to come by. Here we focus on the dryas monkey, an endangered primate endemic to the Congo Basin with cryptic behaviour and possibly less than 250 remaining individuals. Using whole genome data we show that the dryas monkey represents a sister lineage to the vervet monkeys and has diverged from them at least 1 million years ago with additional bi-directional gene flow 590,000 – 360,000 years ago. After bonobo-chimpanzee admixture, this is the second reported case of gene flow that most likely involved crossing the Congo River, a strong dispersal barrier. As the demographic history of bonobos and dryas monkey shows similar patterns of population increase during this time period, we hypothesise that the fluvial topology of the Congo River might have been more dynamic than previously recognised. As a result of dryas monkey - vervet admixture, genes involved in resistance to the simian immunodeficiency virus (SIV) have been exchanged, possibly indicating adaptive introgression. Despite the presence of several homozygous loss-of-function mutations in genes associated with reduced sperm mobility and immunity, we find high genetic diversity and low levels of inbreeding and genetic load in the studied dryas monkey individual. This suggests that the current population carries sufficient genetic variability for the long-term survival of this species. We thus provide an example of how genomic data can directly improve our understanding of elusive species.
Introduction
The dryas monkey (Cercopithecus dryas) is a little-known species of guenons endemic to the Congo Basin, previously only recorded from a single location (Figure 1). The recent discovery of a second geographically distinct population has led to the elevation of its conservation status from critically endangered to endangered, however little is known about its true population size, distribution range, behaviour, ecology and evolutionary history (IUCN 2019). Based on pelage coloration, the dryas monkey was first classified as the central African representative of the Diana monkey (Cercopithicus diana) (Schwarz 1932). Later examinations of the few available specimens suggested that the dryas monkey should be classified as a unique Cercopithecus species (Colyn et al. 1991). More recently, Guschanski et al. 2013 described the mitochondrial genome of the dryas monkey type specimen, preserved at the Royal Museum for Central Africa (Tervuren, Belgium), providing the first genetic data for this species. The mitochondrial genome-based phylogeny places the dryas monkey within the vervet (Chlorocebus) genus, supporting previously suggested grouping based on similarities in feeding behavior, locomotion and cranial morphology (Kuroda et al. 1985; Jonathan Kingdon, David Happold, Thomas Butynski et al. 2013).
(A) Distribution ranges of vervets. Cercopithecus dryas is currently known from only two isolated populations: The Kokolopori-Wamba and the Lomami National Park. The sequenced dryas monkey individual was sampled from the Lomami population as indicated by the asterisk. (B) The autosomal consensus phylogeny of the dryas monkey and the vervets. The tree topology is supported by multiple analyses (see Methods, Figure S1). Red dotted lines depict previously reported admixture events between the different vervet species (Svardal et al. 2017), with the line width corresponding to the relative strength of admixture. Green arrows show the best-supported timing of admixture between the dryas monkey and the vervets. Sabaeus individuals from Ghana have a higher proportion of shared derived alleles with the dryas monkey than sabaeus individuals from Gambia, which is the result of a strong secondary admixture pulse between tantalus and the Ghana sabaeus population (Svardal et al. 2017). (C) Maximum likelihood Y-chromosome phylogeny, calibrated using the rhesus macaque Y-chromosome as outgroup. (D) Mitochondrial phylogeny based on de-novo assembled and the previously published dryas monkey type-specimen mitochondrial genomes, rooted with the rhesus macaque (not shown). Solid line depicts the consensus tree, whereas thin purple lines show the posterior distribution of species-trees recovered in BEAST.
The vervets currently consist of six recognised species: sabaeus, aethiops, tantalus, hilgerti, pygerythrys and cynosuros (Zinner et al. 2013). They are common in savannahs and riverine forests throughout sub-Saharan Africa, as well as on several Caribbean islands, where they were introduced during the colonial times (Figure 1A). However, the dryas monkey is geographically isolated from all vervets and has numerous highly distinct morphologically characteristics (Zinner et al. 2013). As vervets are characterized by a dynamic demographic history with extensive hybridisation (Svardal et al. 2017), including female-mediated gene flow, transfer of mitochondrial haplotypes between species can result in discordance between a mitochondrial tree and the true species history. Thus, the phylogenetic placement of the dryas monkey remains uncertain.
With only two known populations and possibly fewer than 250 individuals, the dryas monkey is endangered and of significant conservation concern (Hart et al. 2019). The goal of this study was thus twofold: First, reconstruct the evolutionary and demographic history of the dryas monkey in relation to that of the vervets and second, asses the long-term genetic viability of this species. To this end, we sequenced the genome of a male dryas monkey at high coverage (33X), which represents the first genome-wide information available for this cryptic and little-known species.
2.0 Methods
2.1 Sample collection and sequencing
A dryas monkey tissue sample was obtained from an individual from the Lomami population in eastern DRC (Figure. 1A) and exported to the United States under approved country-specific permits, where DNA was extracted using the Qiaqen DNeasy Blood & Tissue kit, following the manufactures protocol. For library preparation and sequencing the DNA was sent to Uppsala University, Sweden. The Illumina TruSeq DNA PCR-free kit was used for standard library preparation and the sample was sequenced on one lane of the Illumina HiseqX platform (2x 150bp). In addition, we obtained previously published FASTQ data for all currently recognised vervet species from mainland Africa (Warren et al. 2015; Svardal et al. 2017) and divided them into two subsets: 23 sabaeus, 16 aethiops, 11 tantalus, 6 hilgerti, 16 cynosuros and 51 pygerythrus individuals sequenced on the Hiseq2000 platform at low coverage (the low-coverage dataset), and one individual of each vervet species sequenced on the Genome-Analyzer II platform at medium coverage (the medium-coverage dataset). We also obtained high coverage genomes of two rhesus macaques, to be used as outgroup in our analysis (Xue et al. 2016).
2.2 Alignment, variant detection and filtering
All FASTQ data was adapter and quality trimmed using Trimmomatic on recommend settings (Bolger et al. 2014) and then aligned against the Chlorocebus sabaeus reference genome (ChlSab1.1) (Warren et al. 2015) using bwa-mem (Li 2013) with default settings. Samtools was used to filter out reads below a mapping quality of 30 (phred-scale) (Li et al. 2009). Next, reads were realigned around indels using GATK IndelRealigner (DePristo et al. 2011; McKenna et al. 2010) and duplicates marked using Picard2.10.3 (https://broadinstitute.github.io/picard/). We obtained a genome wide coverage of 33X for the dryas monkey, 27X and 31X for the rhesus macaques, 1.9-6.8X for the low-coverage, and 7.4-9.8X for the medium coverage vervet genomes. Next, we called single nucleotide polymorphisms (SNPs) with GATK UnifiedCaller outputting all sites (DePristo et al. 2011; McKenna et al. 2010). Raw variant calls were then hard filtered following the GATK best practices (Van der Auwera et al. 2013). Additionally, we removed all sites below quality 30 (phred-scale), those with more than three times average genome-wide coverage across the dataset, sites for which more than 75% of samples had a non-reference allele in a heterozygous state, indels and sites within highly repetitive regions as identified from the repeatmask-track for ChlSab1.1 using vcftools (Danecek et al. 2011).
2.5 Autosomal phylogeny
We assessed the phylogeny of the dryas monkey in relation to the vervets using genome-wide methods on either the medium or low coverage dataset. We obtained pseudo-haploid consensus sequences from the medium-coverage dataset for each autosome and each species (including one rhesus macaque as outgroup) using GATK-FastaAlternateReferenceMaker (Van der Auwera et al. 2013). We then concatenated autosomal sequences into a multi-species alignment file. Next, non-overlapping 350kb genomic windows were extracted from the alignment and filtered for missing sites using PHAST v1.4 (Hubisz et al. 2011). After filtering we excluded all windows with a length below 200kb resulting in a final data set consisting of 3703 genomic windows (mean sequence length of 298kb ±SD 18.5kb). Individual approximately-maximum-likelihood gene trees were then generated for the set of windows using FastTree2 v2.1.10 (Price et al. 2010) and the GTR model of sequence evolution. Next, we constructed a coalescent species tree from the obtained gene trees with ASTRAL v5.6.2 (Zhang et al. 2018) on default parameters. ASTRAL estimates branch length in coalescent units and uses local posterior probabilities to compute branch support for the species tree topology, which gives a more reliable measure of support than the standard multi-locus bootstrapping (Sayyari & Mirarab 2016). Additionally, we obtained an extended majority rule consensus tree using the CONSENSE algorithm in PHYLIP v3.695 (Felsenstein 2005). Finally, to explore phylogenetic conflict among the different gene trees, we created consensus networks with SplitsTree v4 using different threshold values (15% 20% and 25%) (Huson & Bryant 2006).
Next, we used the low-coverage dataset (n = 123) to assess the genetic similarity between all individuals and the dryas monkey by running a Principal Components Analysis (PCA) on all filtered autosomal bi-allelic sites called in all individuals using PLINK1.9 with default settings (Purcell et al. 2007). We estimated (sub)species divergence times using population level data by applying a clustering algorithm as in (Warren et al. 2015). Next, we obtained pseudo-haploid aligned consensus sequences for all 123 low-coverage vervets, the rhesus and the dryas monkey using GATK-FastaAlternateReferenceMaker (Van der Auwera et al. 2013). The R package ape was then used to calculate pairwise distances among all individuals using the Tamura and Nei 1993 model, which allows for different rates of transitions and transversions, heterogeneous base frequencies, and between-site variation of the substitution rate (Paradis et al. 2004; Tamura & Nei 1993). We used the R package phangorn to construct a UPMG phylogeny form the resulting pairwise distance matrix (Schliep 2011). We obtained estimates of divergence under the assumption of no gene flow as in (Warren et al. 2015), calibrating the tree to the rhesus outgroup, setting the divergence time between the rhesus (Papionini) and vervets (Cercopithecini) at 13.7 million years ago (Hedges et al. 2015).
2.4 PSMC
To infer long-term demographic history of the studied species, we used a pairwise sequentially Markovian coalescent model (PSMC) (Li & Durbin 2011) based on all medium-coverage vervet genomes and the high coverage dryas monkey genome. We excluding sex chromosomes, repetitive regions and all sites for which read depth was less than five and higher than 100. We scaled the PSMC output using a generation time of 8.5 years (Warren et al. 2015) and a mutation rate of 0.94×10−8per site per generation (Pfeifer 2017). Bootstrap replicates (n=100) were performed for the high-coverage dryas monkey genome by splitting all chromosomal sequences into smaller segments using the splitfa implementation in the PSMC software and then randomly sampling with replacement from these fragments (Li & Durbin 2011). Our dryas monkey genome coverage (33X) differed strongly from that of the medium-coverage vervets (7.4-9.8X). As limited coverage is known to biases PSMC results (Nadachowska-Brzyska et al. 2016), we down-sampled the dryas monkey genome to a similar coverage (8X) as the medium-coverage vervet genomes and repeated the PSMC analysis, allowing for qualitative comparisons between species. Demographic estimates from PSMC can also be biased by admixture events between divergent populations, giving a false signal of population size change (Hawks 2017). We thus removed all putatively introgressed regions (see below) from the dryas monkey genome and re-run the PSMC analysis.
2.6 Mitochondrial phylogeny
We de-novo assembled the mitochondrial genomes for the dryas monkey, the medium-coverage vervets, and rhesus macaque with NOVOplasty (Dierckxsens et al. 2016) at recommend settings, using a K-mer size of 39 and the Chlorocebus sabaeus (ChlSab1.1) mitochondrial reference genome as seed sequence. We also included the previously published mitochondrial genome sequence of the dryas monkey type specimen (Guschanski et al. 2013) and the olive baboon mitochondrial genome (Panu_3.0) used for divergence calibration. A mitochondrial phylogeny was then obtained as in van der Valk et al. 2018. Briefly, mitochondrial genomes were aligned using clustal omega with default settings (Sievers et al. 2011). We partitioned the alignment into coding genes (splitting triplets into 1st + 2nd and 3rd position), rRNAs, tRNAs, and non-coding regions using the ChlSab1.1 annotation in Geneious 10.1.2 (https://www.geneious.com). Phylogenetic tree reconstruction and divergence dating was carried out with BEAST2.4.6 (Bouckaert et al. 2014), using the best fitting substitution model for each partition as identified with PartitionFinder2 (Lanfear et al. 2017) and enforcing a strict molecular clock. Birth-rate and clock-rate priors were set as gamma distribution with α = 0.001 and β = 1000 as recommend for mitochondrial phylogenetic analysis (Bouckaert et al. 2014). To calibrate the tree, priors for the split time between rhesus and baboon were set as a log-normal distribution with M = 2.44 and S = 0.095, which corresponds to a divergence time of 9.98–13.2 Mya (95% CI) (Hedges et al. 2015). The Bayesian model was run for an MCMC length of 500 million and we used Tracer 1.6 (Rambaut et al. 2014) to confirm run convergence and obtain probability distributions. The consensus and locus-specific trees were plotted in DensiTree (Bouckaert 2010).
2.7 Y-chromosome phylogeny
The mammalian Y chromosome sequence is enriched for repeats and palindromes, and thus accurate assembly from short-read data is challenging (Tomaszkiewicz et al. 2017; Kuderna et al. 2019). We therefore obtained partial Y-chromosome consensus sequences using the filtered SNP calls. First, we identified all male individuals in our low-coverage dataset using the ratio of X-chromosome to autosomal coverage (Figure S2). Next, GATK FastaAlternateReferenceMaker was used to obtain a Y-chromosome consensus sequence for each male individual using the filtered variant calls as input. We masked all sites for which at least one individual showed a heterozygous call, as these represent SNP-calling errors. Additionally, we masked all repetitive regions and all sites for which one or more female individuals also showed a variant call, as these regions are likely enriched for SNP-errors due to mismappings. A maximum-likelihood tree was then constructed in MEGAX (Kumar et al. 2018) using the Tamura-Nei 1993 model (Tamura & Nei 1993), running 1000 bootstrap replicates and only including sites called in all male individuals. The tree was time calibrated using the rhesus Y-chromosome as outgroup, a rhesus (Papionini) - vervets (Cercopithecini) divergence time of 13.7 million years (Hedges et al. 2015) and assuming a uniform mutation rate.
2.8 Gene flow
We performed a model-free test of unbalanced allele sharing between the vervet individuals and the dryas monkey (D-statistic) (Green et al. 2010) using all autosomal bi-allelic sites called in all the low-coverage individuals and dryas monkey. Pairwise D-statistics were run for all combinations as H1,H2,H3,H4 with dryas monkey as the third ingroup (H3) and the rhesus macaque as the representative of the ancestral variant (H4) (we excluded sites for which the two rhesus-macaque genomes were not identical). Additionally, we calculated frequency-stratified D-statistics on population level as in (de Manuel et al. 2016) to obtain estimates of the direction of gene flow, again using the dryas monkey genome as H3 and rhesus as H4.
A model-based estimate of gene flow was obtained by constructing a maximum likelihood (ML) tree using TreeMix v. 1.12 (Pickrell & Pritchard 2012), accounting for linkage disequilibrium (LD) by grouping sites in blocks of 1,000 SNPs (-k 1000). Based on the previous phylogenetic inferences, the dryas monkey was set as root and a round of rearrangements was performed after all populations were added to the tree (-global). Standard errors (-se) and bootstrap replicates (-bootstrap) were used to evaluate the confidence in the inferred tree topology and the weight of migration events. After constructing a maximum likelihood tree, migration events were added (-m) and iterated 50 times for each value of 'm' (1-10) to check for convergence in terms of the likelihood of the model as well as the explained variance following each addition of a migration event. The inferred maximum likelihood trees were visualised with the in-build TreeMix R script plotting functions.
Additionally, we performed maximum likelihood estimation of individual ancestries using ADMIXTURE (Alexander et al. 2009) based on all autosomal bi-allelic SNPs called in all individuals, filtered for MAF >5% and LD pruned using plink (--indep 50 10 2). The optimal number of clusters (K) in the admixture graph was identified by running 5-fold cross-validations for K 1-10.
2.9 Identifying introgressed regions
To identify putatively introgressed regions in all vervet individuals, we performed a screen for such segments following a strategy outlined in (Martin et al. 2015). Briefly, in sliding windows of 10kb we calculated Fd statistics (which is related to D-statistic but not subject to the same biases as D when calculated in sliding windows, (Martin et al. 2015)) using all sabaeus individuals from Gambia as ingroup (H1) (as these showed the least amount of shared derived alleles with the dryas monkey) and Dxydryas-X, and Dxysabaeus(Gambia)-X, where x refers to the focal individual. Next we calculated the average ratio and standard deviation for Dxydryas-X / Dxysabaeus(Gambia)-X across all windows. As Dxy is a measure of sequence divergence, introgressed windows between the dryas monkey and non-sabaeus vervets are expected to have a relative low Dxydryas-X and relatively high Dxysabaeus(Gambia)-X. Windows showing an excess of shared derived alleles with the dryas monkey (ZFd score > 2) and unusual low divergence towards the dryas monkey (DxyDryas-X / Dxysabaeus(Gambia)-X) > (2·SD ± genome wide DxyDryas-X / Dxysabaeus(Gambia)-X) were flagged as putatively introgressed.
2.10 Gene ontology enrichment of introgressed genes
Using the Chlorocebus sabaeus genome annotation (Warren et al. 2015) we obtained for each individual all genes within putatively introgressed windows. A gene ontology enrichment was run for all putatively introgressed genes fixed in all non-sabaeus individuals in Blast2GO using Fisher’s exact test (Gotz et al. 2008). Next, for all genes we obtained selection coefficient from Svardal et al. 2017, which were calculated using a multilocus test of allele frequency differentiation (identifying regions in the genome where the change in allele frequency at the locus occurred too quickly to be explained by drift: XP-CLR selection scores) (Chen et al. 2010). Candidate genes for adaptive introgression were then identified as those with high gene-selection scores and high frequency in the recipient population.
2.11 Heterozygosity and inbreeding
We measured genome-wide autosomal heterozygosity for all individuals with average genome coverage > 3X using realSFS as implemented in ANGSD, considering only uniquely mapping reads (-uniqueOnly 1) and bases with quality score above 19 (-minQ 20) (Fumagalli et al. 2013; Korneliussen et al. 2014). ANGSD uses genotype-likelihoods, rather than variant calls, allowing for the incorporation of statistical uncertainty in low-coverage data and shows high accuracy in estimating heterozygosity for genomes above 3X coverage (Fumagalli 2013; van der Valk et al. 2019). Next, we used PLINK1.9 (Purcell et al. 2007) to identify stretches of the genome in complete homozygosity (runs of homozygosity: ROH) for all individuals with average genome coverage > 3X. To this end, we ran sliding windows of 50 SNPs on the VCF files of all included genomes, requiring at least one SNP per 50kb. In each individual genome, we allowed for a maximum of one heterozygous and five missing calls per window before we considered the ROH to be broken.
2.12 Genetic load
We used the variant effect predictor tool (McLaren et al. 2016) to identify loss-of-function mutations (transcript ablation, splice donor variant, splice acceptor variant, stop gained, frameshift variant, inframe insertion, inframe deletion, and splice region variant), missense and synonymous mutations on the filtered SNP calls. As an indication of mutational load, for each individual we counted the number of genes containing one or more loss-of-function and the total number of missense mutations divided by the number of synonymous mutations (Fay et al. 2001). We excluded all missense mutations within genes containing a loss-of-function mutation, as these are expected to behave effectively neutral. Dividing by the number of synonymous mutations mitigates species-specific biases, such as mapping bias due to the fact that the reference genome was derived from a sabaeus individual, coverage differences, and mutation rate (Xue et al. 2015).
3.0 Results and discussion
3.1 The dynamic demographic history of the dryas monkey and vervets
In this study we present the first genome sequence of the dryas monkey, and show that it is a sister lineage to the vervets (Figure 1B, S1, S4B). Multiple phylogenomic approaches (MSC, pairwise differences, FastTree, SplitTree, and TreeMix) unambiguously support the same tree topology, thus contradicting the suggested placement within the Chlorocebus genus as inferred from the mitochondrial data (Guschanski et al. 2013). After analysing 3703 gene trees from autosomal genomic windows, we obtained a multi-species-coalescent tree with maximum support values (lpp=1.0) for all nodes (Figure S1A). Our topology is consistent with the vervet phylogeny previously reported by Warren et al. 2015. Although a majority-rule consensus tree (Figure S1B) and network analyses at different threshold values (Figure S1C) showed some poorly resolved nodes within the vervet clade, the position of dryas monkey as sister lineage to all vervets remains unambiguous (Figure S1).
The divergence times within the vervet genus inferred by us are generally more ancient (+~10% for the oldest nodes, Figure. 1B) than reported by Warren et al. 2015, which is likely explained by our use of population level data, whereas one representative genome per species was used by Warren et al. 2015. We estimate that the dryas monkey diverged from the common ancestor of all vervets around 1 million years ago, long before the vervet radiation started ca. 590 thousand years ago (Figure 1B). As we used an ingroup reference for read mapping (ChlSab1.1), theses estimates could be slightly biased due to decreased mapping efficiency of reads showing the non-reference allele, thereby decreasing the observed genetic distance between dryas monkey and the vervets (Günther & Nettelblad 2018). Thus the ~1 million years divergence time likely represent the lower boundary of the true divergence time. The Y-chromosome-based phylogeny shows the same topology and divergence time estimates as the autosomal trees (Figure 1C) and the Principal Components Analysis further supports that the dryas monkey is genetically distinct from all currently recognised vervets (Figure S3). This stands in stark contrast to the inferences based on the mitochondrial genomes, which show the dryas monkey (both our dryas monkey sample and the dryas monkey type specimen) to be nested within the vervet genus Chlorocebus (Guschanski et al. 2013) (Figure 1D).
The discrepancies in tree topologies derived from genomic regions with different inheritance modes (autosomal, Y-chromosomal, and mitochondrial) and the known history of introgression among the vervets (Svardal et al. 2017) suggest a possible role of gene flow in shaping the evolutionary history of the dryas monkey. Therefore, we explored whether ancient admixture events can resolve the observed phylogenetic discordance. We found that sabaeus individuals from Gambia share significantly fewer derived alleles with the dryas monkey (D-statistic) than all other vervets (Figure 2A). As derived alleles should be approximately equally frequent in all species under the scenario of incomplete lineage sorting without additional gene flow (Green et al. 2010), such a pattern strongly suggest that alleles were exchanged between the dryas monkey and the vervets after their separation from Chlorocebus sabaeus (~590kya, Figure 1B). Aethiops individuals share fewer derived alleles with the dryas monkey than tantalus, hilgerti, cynosuros, and pygerythrus (Figure 2A), suggesting that gene flow likely occurred for an extended period of time, at least until after the separation of aethiops from the common ancestor of the other vervets (~490kya, Figure 1B). As tantalus, hilgerti, cynosuros and pygerythrus vervets share a similar amount of derived alleles with the dryas monkey, gene flow most likely ended before the speciation of this group (~360kya, Figure 1B). The small observed differences in the D-statistic between tantalus, hilgerti, cynosuros and pygerythrus (±14%, Figure 2A) are likely the result of drift and selection due to population sizes differences among these vervet species (Svardal et al. 2017). The inferred history of gene flow is also concordant with the mitochondrial phylogeny and suggests that the dryas monkey mitochondrial genome was introgressed from the common ancestor of all non-sabaeus vervets (Figure 1B and 1C). Gene flow between the common ancestor of the non-sabaeus vervets and the dryas monkey is also supported by TreeMix and ADMIXTURE analyses, but we note that these model-based estimates rely on accurate allele frequency estimates, which are absent for the dryas monkey population, as it is represented by a single individual (Figure S4).
(A) Pairwise D-statistics for all individuals using Chlorocebus sabaeusGambia with the least amount of shared derived alleles to the dryas monkey as ingroup. (B) D-statistics stratified by derived allele frequency for each species, using all Chlorocebus sabaeusGambia individuals as ingroup.
In contrast to the sabaeus individuals from Gambia, sabaeus vervets from Ghana also carry an excess of shared derived alleles with the dryas monkey (Figure 2A). The Ghanese sabaeus population recently hybridised with tantalus, so that a large proportion of their genome (~15%) is of recent tantalus ancestry (Figure 1B) (Svardal et al. 2017). As tantalus individuals carry many shared derived alleles with the dryas monkey, this secondary introgression event likely led to the introduction of dryas monkey alleles into the Ghanese sabaeus, explaining the high D-statistics in this population.
Next, we obtained approximations of the directionality of gene flow using frequency-stratified D-statistics as in (de Manuel et al. 2016). We found that the vervet populations carry derived alleles shared with the dryas monkey at either low or high frequency, but few such alleles are found at intermediate frequencies (Figure 2B). High frequency alleles in the donor population are more likely to be introgressed during gene flow and are subsequently present at low frequency in the recipient population (Kuhlwilm et al. 2016). Therefore, our observation is consistent with bi-directional gene flow between the dryas monkey and the non-sabaeus vervets. The general direction of the gene flow appears to have been dominated by the introgression from the dryas monkey into the non-sabaeus vervets, as we observe an overall higher proportion of low-frequency shared derived alleles in the vervets (Figure 2B). This difference is particularly pronounced in aethiops, suggesting that the gene flow was primarily from the dryas monkey into the vervets before aethiops separated from the common ancestor of tantalus, hilgerti, cynosuros and pygerythrus. After this split, gene flow likely became more bi-directional with increased introgression events into the dryas monkey, as evidenced by the presence of high frequency putatively introgressed alleles in tantalus, hilgerti, cynosuros and pygerythrus. An alternative explanation is that the observed allele frequencies are driven by selection, as introgressed alleles might be on average selected against (Juric et al. 2016). It is also noteworthy that the dryas monkey caries a vervet mitochondrial genome, which must have been introduced into the population through female-mediated gene flow and eventually became fixed, replacing the ancestral dryas monkey mitochondrial sequence. This is supported by the clustering of the type specimen and study dryas monkey individual mitochondrial genomes (Figure 1D) and their placement as sister to all non-sabaeus vervets.
The putatively introgressed alleles in the sabaeus population from Ghana are found at intermediate frequency (> 0.25 and < 0.50) (Figure 2B), which is in agreement with the indirect introduction of these alleles through recent introgression from tantalus vervets. As the tantalus population carries derived alleles at high and low frequency (Figure 2B), the Ghanese sabaeus population received a mixture of both high and low frequency alleles, resulting in the observed intermediate frequency.
Using approaches that are relatively insensitive to demographic processes (e.g. genetic drift and changes in effective population size), we obtained strong support for the presence of gene flow between the dryas monkey and the vervets, but we caution that they may incorrectly infer gene flow in situations with ancestral subdivision (Slatkin & Pollack 2008). However, such ancestral population structure would have to persist over an extended period of time, encompassing multiple speciation events. Furthermore, our inferences of gene flow are supported by the discordance between the nuclear and mitochondrial phylogenies. Thus, gene flow seems to be the most parsimonious explanation.
3.2 Identifying introgressed regions and inferring their functional significance
Using Dxy and Fd statistic (Martin et al. 2015) we identified putatively introgressed regions in sliding windows of 10.000 bp for each individual. As expected under the scenario of secondary gene flow, windows containing an excess of shared derived alleles with the dryas monkey have low genetic divergence (Dxy) to the dryas monkey and high genetic divergence to sabaeus (Figure S5) (de Manuel et al. 2016). Summing over all putatively introgressed windows, we roughly estimate that 0.4% - 0.9% in Gambia sabaeus, 1.6 – 2.4% in aethiops and 2.7% - 4.8% of the genome in the other vervets shows as signature of introgression with the dryas monkey. We estimate that putatively introgressed haplotypes average below 10,000bp (note that we could not detect any haplotypes below the length of 10,000bp with our used method) (Figure S6). The similar length of putatively introgressed haplotypes in all vervet species strongly supports that gene flow occurred in the common ancestor of all non-sabaeus species. The putatively introgressed haplotypes into the Ghanese and (to a lesser extent) Gambian sabaeus population were later introduced during secondary gene flow with tantalus, as an independent recent gene flow event from the dryas monkey into these sabaeus individuals would have resulted in significantly longer haplotypes. However, we caution that our ability to accurately identify the haplotype lengths is low given the short length of the introgressed haplotypes and a single available dryas monkey genome.
The average length of putatively introgressed haplotypes reflects the approximate timing of admixture, as haplotypes are broken apart over time due to recombination (Liang & Nielsen 2014). The average length of introgressed haplotypes in the dryas monkey and the vervets is considerably shorter than putatively introgressed Neanderthal haplotypes in modern Homo sapiens (~120kb) (Prüfer et al. 2014), which hybridized 47,000 to 65,000 years ago (Sankararaman et al. 2012). This supports our inferred ancient timing of gene-flow (590,000 – 360,000 years) between the dryas monkey and the vervets. Interestingly, the proposed gene-flow between bonobos (Pan paniscus) and non-western chimpanzees (Pan troglodytes), which have overlapping distribution range with the dryas monkey and the vervets respectively, occurred around the same time period, 200,000 – 550,000 years ago. The average length of the introgressed haplotypes in chimpanzees is longer (~25.000bp), likely due to the longer generation time of chimpanzees compared to vervets (24.6 versus 8.5 years, (Langergraber et al. 2012; Warren et al. 2015; de Manuel et al. 2016), leading to fewer recombination events in chimpanzees since the admixture event.
It is noteworthy that both bonobo-chimpanzee and dryas monkey-vervet gene flow most likely involved the crossing of the Congo River (Figure 1), previously thought to be an impenetrable barrier for mammals (Colyn 1987; M. Colyn et al. 1991; Colyn & Deleporte 2004; Eriksson et al. 2004; Kennis et al. 2011). As the timing of these two introgression events are highly congruent, the fluvial topology of the Congo River and the geology within the Congo basin might have been more dynamic 200,000-500,000 years ago than previously recognised (Beadle 1981; Stankiewicz & de Wit 2006).
Having identified introgressed regions, we explored if they may carry functional significance. We find that genes previously identified to be under strong selection in vervets (top 10% of XP-CLR selection scores (Svardal et al. 2017)) are less often introgressed (average introgression frequency 0.021) than genes that did not experience strong selection (bottom 10% of XP-CLR scores; average allele frequency 0.029). This may be explained by weak selection against introgressed gene on average, which may be deleterious in the non-host background, a pattern also observed for Neanderthal genes in Homo sapiens and bonobo genes in the chimpanzee genetic background (Nye et al. 2018; Juric et al. 2016). To identify genes with adaptive functions, we focused on 109 putatively introgressed genes that are fixed in all non-sabaeus vervets. Gene ontology analysis revealed enrichment for genes related to cell junction assembly and cell projection organization (Figure S7). Vervets are the natural host of the simian immunodeficiency virus (SIV) and the genes under strongest selection in the vervets are related to immunity against this virus (Svardal et al. 2017). Accordingly, POU2F1, AEBP2, and PDCD6IP are among the fixed putatively introgressed genes in all non-sabaeus individuals. POU2F1 is a member of the pathway involved in the formation of the HIV-1 elongation complex in humans (Sturm et al. 1993), AEBP2 is a RNA polymerase II repressor (Kim et al. 2009) known to interact with viral transcription (Zhou & Rana 2002; Debaisieux et al. 2012), and PDCD6IP is involved in virus budding of the human immunodeficiency and other lentiviruses (Strack et al. 2003; von Schwedler et al. 2003). However, the SIV resistance related genes that experienced the strongest selection in vervets (e.g. RANBP3, NFIX, CD68, FXR2 and KDM6B) do not show a signal of introgression between vervets and the dryas monkey. Therefore, while adaptive importance can be plausible for some of the introgressed loci, it does not appear to be a strong driver for retaining particular gene classes, although adaptive function for some introgressed genes in species-specific background cannot be excluded and would warrant dedicated investigations.
3.3 Genomic view on conservation of the endangered dryas monkey
The dryas monkey is considered the only representative of the dryas species group (Grubb et al. 2003) and is listed as endangered in the IUCN red list due to its small population size of ca. 250 individuals (Hart et al. 2019). We therefore used demographic modelling and genome-wide measures of heterozygosity and inbreeding to assess the long- and short-term population history of the dryas monkey. Pairwise Sequential Markovian Coalescent (PSMC) analysis of the dryas monkey genome revealed a dynamic evolutionary history, with a marked increase in effective population size starting ca. 500,000 years ago, followed by continuous decline in the last ~200,000 years (Figure 3A). The date of population size increase coincides with our estimated onset gene flow. To eliminate the possibility that our PSMC inferences are driven by the increased heterozygosity due to gene flow, we removed all putatively introgressed regions and repeated the PSMC analysis, which produced the same results (Figure 3A). We therefore suggest that the population size increase in the dryas monkey and the associated likely range expansion facilitated secondary contact between the dryas monkey and the vervets.
(A) PSMC analyses for the repeat-filtered dryas monkey genome (red), after removing putatively introgressed regions (black), and after downsampling to 8x coverage (dotted line). As the curve is strongly shifted at lower coverage, the downsampled genome was used for between-species comparison in (B). (B) PSMC analysis of medium coverage genomes (7.4-9.8X) for all vervets and the downsampled dryas monkey genome (8X).
As previously reported, low genomic coverage shifts the PSMC trajectory and makes inference less reliable, particularly for more recent time periods (Figure 3A, (Nadachowska-Brzyska et al. 2016). Therefore, to allow for demographic comparisons to the vervets, we re-ran the PSMC on the dryas monkey genome down-sampled to similar coverage as the genomic data available for the vervets. This analysis strongly suggest that 100,000 – 300,000 years ago the dryas monkey population was the largest among all vervets, possibly ranging in the tens of thousands of individuals (Figure 3B).
The genetic diversity of the dryas monkey (measured as between chromosome-pair differences), a proxy for the adaptive potential of a population (Lande & Shannon 1996), is high compared to that of the much more abundant vervets (Figure 4A). The dryas monkey individual also shows no signs of excessive recent inbreeding, which would manifest itself in a high fraction of the genome contained in long tracts of homozygosity (> 2.5Mb) (Figure 4B). To estimate genetic load, we identified all genes in the dryas monkey genome containing one or more loss-of-function (LoF) mutations and identified all missense mutations in genes other than those already containing LoF-mutation(s) (as such mutations likely behave neutral). We find multiple genes in the dryas monkey containing a homozygous loss-of-function mutation associated with a disease phenotype in humans (n = 27), including SEPT12, associated with reduced sperm mobility (Kuo et al. 2015) and SLAMF9, associated with reduced immunity to tapeworm infections (Cárdenas et al. 2014). However, genome-wide measures of genetic load, measured as the ratio between LoF or missense and synonymous mutations, does not show an increased genomic burden of deleterious mutations in the dryas monkey compared to the much more abundant and widely distributed vervets (Figure 4C-D). The demographic history and the genome-wide measures of genetic diversity and genetic load of the dryas monkey thus suggest that the population of this endangered primate might be larger than currently recognised and that the dryas monkey population has good chances for long-term survival, if appropriate conservation measures are implemented.
(A) Average number of heterozygous sites per 1000 base pairs. (B) Fraction of the genome in runs of homozygosity (ROH) above 100kb (open bars) and fraction of the genome in ROH > 2.5MB (solid bars). (C) Ratio of missense to synonymous mutations, excluding all missense mutations within genes containing one or more LOF mutations. (D) Ratio of LOF to synonymous mutations, counting genes with more than one LOF mutation only once.
Supplementary figures
(A) MSC-based species trees generated by ASTRAL using 3703 autosomal genomic windows. The tree was rooted with the rhesus macaque. Branch lengths are given in coalescent units and are an indicator of gene-tree discordance. The normalized quartet score of this topology is 0.87. Asterisk symbols at nodes indicate maximum local quartet support posterior probabilities (lpp=1.0) and q1 values displayed at each node show the percentage of quartets in all gene trees that agree with that branch topology. (B) Majority-rule consensus tree obtained from 3703 autosomal gene-trees using CONSENSE in PHYLIP v3.695. The number above each branch shows the total number of trees out of all 3703 gene trees supporting the given branch and the number below corresponds to its percentage. (C) SplitsTree consensus networks using 3703 gene trees at median thresholds of 25%, 20% and 15%.
Sexing a subset of individuals based on autosomal versus X chromosome coverage.
Genome wide Principal Components Analysis on all quality filtered bi-allelic sites.
(A) Whole genome ADMIXTURE analysis for 9 predefined clusters based on all quality filtered autosomal bi-allelic SNPs. Note the magnified upper part of the bar for the dryas monkey, which shows similar proportions of all non-sabaeus vervets in the dryas monkey genome (B) TreeMix analysis for five migration events (m=5). Population allele frequencies are separated by species and geographic origin. (C) Likelihood support values for TreeMix models with 0 to 10 migration events respectively. After modelling five migration events, the model likelihood does not increase any further.
Errors bars show ±3SE. Window size = 10kb. Windows with high Fd statistic (the excess of shared derived alleles with the dryas monkey over that with sabaeus) have on average unusual large genetic distance (Dxy) to sabaeus and unusual low genetic distance to the dryas monkey. Note that the vast majority of windows is at Fd ~ 0 (>90%). Dxy-sabaeus at windows with Fd = 0 (putative non-introgressed windows) is generally around 0.007 whereas Dxy-dryas at these windows ~0.012. This is in agreement with the obtained divergence time estimates (Figure 1B) (e.g. genetic distance of the non-sabaeus vervets to the dryas monkey is on average 1.75 - 2.25 times larger than non-sabeus vervets to sabaeus). At high Fd, Dxy-dryas is around ½-⅓ that of Fd = 0, suggesting that these windows diverged on average half to one third as long ago as the other windows (thus ~500.000 to 330.000 years ago as the most likely time period of introgression, if we assume the divergence of the dryas monkey to the vervets to be ~1 million years ago, as indicated by our analyses).
Windows are short on average for all species. Sabaeus individuals from Gambia have the lowest number of putatively introgressed windows, which were likely introduced into this species via secondary gene-flow with tantalus. Ghana individuals have a much higher number of these windows, as a result of the recent strong secondary admixture with tantalus.
Top panel shows enriched GO-terms and the genes associated with these terms. Bottom panel shows the enrichment network.
Acknowledgments
We thank the DRC government and ICCN, for facilitating sample collection. Sequencing was performed by the SNP&SEQ Technology Platform in Uppsala. The facility is part of the National Genomics Infrastructure (NGI) Sweden and Science for Life Laboratory. The SNP&SEQ Platform is also supported by the Swedish Research Council and the Knut and Alice Wallenberg Foundation. The authors acknowledge support from the Uppsala Multidisciplinary Centre for Advanced Computational Science for assistance with massively parallel sequencing and access to the UPPMAX computational infrastructure. This work was supported by FORMAS (2016-00835) to K.G. Sequence data generated in this study is available in the European nucleotide archive under accession number PRJEB32105.