Abstract
Genital divergence contributes to reproductive barriers between species. Emergence of a novel accessory structure, the baculum, has independently evolved and been lost throughout mammalian evolution, purportedly driven by sexual selection. In primates, the longest recorded baculum belongs to Macaca arctoides, the bear macaque. This species has been proposed to be of homoploid hybrid origin via ancient hybridization between representatives from the fascicularis and sinica species groups. To investigate the evolutionary origins of the bear macaque and its unique morphology, we used whole genome sequences to quantify gene flow and phylogenetic relationships in 10 individuals from 5 species, including the bear macaque (n=3), and two species each from the sinica (n=3) and fascicularis (n=4) species groups. The results of these analyses were concordant, and identified 608 genes in the bear macaque that supported both clustering between M. arctoides and the sinica group (topo2) and had shared derived alleles between species from the two groups. Similarly, 361 genes supported both clustering between M. arctoides and the fascicularis group (topo3) and had shared derived alleles between both groups. Further, sliding window analysis of phylogenetic relationships revealed 53% of the genomic regions supported placement of M. arctoides in the sinica species group (topo2), 16% supported placement in the fascicularis species group (topo3), and 11% supported M. arctoides in a grouping distinct from the sinica and fascicularis groups (topo1). Genomic regions with topo1 were intersected with previously identified QTL for mouse baculum morphology, and 47 genes were found, including five of sixteen major candidate loci that govern mouse baculum variation (KIF14, KIAA0586, RHOJ, TGM2, and DACT1). Although baculum morphology in the bear macaque is diverged from its parent taxa, it most closely resembles that of the fascicularis group. Outliers of shared ancestry from the fascicularis species group located within these same QTL regions overlap with the gene BMP4, which is an important component of the hedgehog signaling pathway that controls gonadogenesis. Two additional outlier genes (one shared with each species group) outside of the baculum QTL are known to interact with BMP4, suggesting this pathway may be involved in baculum morphology in primates. These results highlight how the mosaic ancestry of the bear macaque could explain its unique baculum evolution and collectively contribute to reproductive isolation.
Introductory Paragraph In mammals, the baculum has extreme morphological variability, a dynamic evolutionary history characterized by repeated gain and loss, and is often used in species identification. The bear macaque has divergent genital morphology, including the longest baculum among all primates, and is proposed to have evolved via ancient hybrid speciation. Here, population genetic and phylogenomic approaches were used to examine how ancient hybridization in the bear macaque may have shaped this important component of genital morphology. Results demonstrate extensive mosaicism across the genome, which is consistent with ancient genetic contributions from both putative parental taxa. Genetic regions associated with baculum morphology also had mosaic ancestry for several genes, including KIF14 and KIAA0586, major candidate genes for baculum morphology in mice, and BMP4, a developmental gene involved in gonadogenesis. These results have important implications for how hybridization may have shaped the evolution of reproductive isolation in this unusual species with complex speciation.
Introduction
For species with internal fertilization, male genital morphology has complex evolution and diverges rapidly1–3. The baculum, or penis bone, is a novel male genital structure in placental mammals that evolved ∼145 million years ago4. Since that time, it has been the subject of recurrent adaptation having been gained or lost multiple times in various mammals, including recent loss in Homo sapiens5. Despite its remarkable interspecific variation, the penis bone generally does not vary extensively within species6, and may be the only physical trait that differs consistently between closely related species, making it a key component in species identification5. However, the genetics of morphological variation in this novel trait are not well known. In 2016, a morphometric analysis using micro-CT scans of bacula in mice identified three major quantitative trait loci (QTLs), two that explained 50.6% of the variation in baculum size, and one that explained 23.4% of the variation in baculum shape7. These regions were further narrowed by differential gene expression in early development of male mice and potential involvement in bone or genital morphogenesis, revealing 16 major candidate loci (see Table S3 in ref7). Sequence divergence to rats reduced this list even further to four major candidate genes, all of which have orthologs in primates (PTPRC, KIAA0586, ASPM, and KIF14)7.
Among primates, the bear macaque, Macaca arctoides has the longest baculum (4-6 cm8), even when corrected for body weight8,9. This species is perhaps the most unusual among macaques, distinguished not only by its extreme baculum length, but also by divergent genital morphology in both males and females (Figure S1), a unique single-mount ejaculatory mating behavior, whitish pelage in newborns, and a bald forehead and cheeks8. One possible explanation for the origin of the bear macaque baculum is that natural selection favored the evolution of prezygotic barriers to hybridization via a mechanical “lock and key” mechanism10. Interestingly, however, this dramatic derived genital morphology nonetheless may have been influenced by, or evolved despite, extensive genetic exchange with other macaque species. Early molecular work revealed phylogenetic incongruence between mitochondrial and Y-chromosomal genealogies for M. arctoides10,11. Mitochondrial markers place M. arctoides as sister to the fascicularis species group, whereas autosomal markers place it as sister to the sinica species group12–15. However, both morphological and Y-chromosomal trees support placement in the sinica group11,15,16 (see Table S1 in ref14 for review). This incongruence was inferred to be consistent with ancient hybridization between the ancestor of the fascicularis species group and a sinica species group member (an ancestor of either M. assamensis or M. thibetana)10. This hypothesized hybrid origin is diagrammed in Figure 1B.
In contrast to allopolyploid speciation, homoploid hybrid speciation (HHS) conserves chromosome number and is expected to have similar occurrence frequencies in plants and animals17,18. Though expected to be a rare speciation mechanism relative to “regular” bifurcating speciation, diverse taxa such as birds (Passer italiae19), butterflies (Heliconius heurippa20) and yeast (Saccharomyces paradoxus21) are thought to have evolved via HHS. Recent mitochondrial phylogenetics have shown that M. arctoides is more closely related to M. mulatta as compared to M. fascicularis14,22, indicating introgression may have occurred after the split of these two species, which is estimated to be 1.68 mya14,15,23 (Figure 1B). This suggests an alternative evolutionary scenario to the previously proposed model for the origin of M. arctoides by hybrid speciation; this alternative is also presented in Figure 1B.
Recent analysis of macaque genomes support introgression between M. arctoides and members of the fascicularis and sinica species groups14, both of whom it overlaps with geographically (Figure 1C). The baculum in M. arctoides is more than double the average length of sinica species group taxa and nearly four times as long as fascicularis species group taxa9,24. These sister species groups are estimated to have split ∼3 million years ago, with hybridization to produce M. arctoides purportedly occurring ∼1.8 million years ago14, during which glacial periods during the Pleistocene were associated with reduced expanse of Southeast Asian forests into multiple refugia. Ecological isolation is a common feature to other scenarios of HHS25, and may have contributed to the extensive interspecific hybridization that gave rise to the unique macaque species, M. arctoides. Y-chromosomal data suggests that interbreeding between ancestors of the fascicularis and sinica groups was driven by sinica males. Available evidence suggests that this interbreeding subsided ∼1.5 mya10, followed by little interbreeding until the present8.
Here, we investigated the role hybridization played in shaping the unusual genital morphology of M. arctoides. Because a major criteria for HHS26,27 is that reproductive isolation (RI) be a direct byproduct of hybridization (but see28), we focused especially on genetic regions that are associated with morphological variation and candidate genes in mice5. Phylogenomic and population genetic approaches were used on several extant macaque species (M. arctoides; the sinica group species M. thibetana and M. assamensis; and the fascicularis group species M. fascicularis and M. mulatta). Also, outlier regions were examined within the bear macaque genome with alternate alleles from the two parental genomes and regions where the bear macaque differs from both parents. Despite its divergent morphology8,9,29, the baculum of M. arctoides is more similar to fascicularis species8, making these outliers of particular interest.
Results
Genomic analysis of five macaque species revealed a mosaic of evolutionary ancestry with respect to both putative parental species; this generated novel allelic combinations at multiple loci throughout the genome, including several genes associated with baculum morphology. These analyses used newly sequenced whole genome samples from M. arctoides (20X coverage) and M. assamensis (13X), and 8 publicly available genomes ranging in coverage from 4-49X (median 32X; Table 1). While not directly investigated here, these data will be useful for addressing questions related to the putative hybrid origin of this species in combination with targeted simulations (see Discussion).
Of the heterozygous sites, the majority are unique to each species group (Figure 1D). M. arctoides shares the most variants with the sinica species group, M. assamensis and M. thibetana, consistent with taxonomic placement in this species group30. However, the two-way intersection between M. arctoides and M. thibetana had the fewest number of shared heterozygous sites, which is likely due to the low overall heterozygosity in the M. thibetana sample (Figure 1D; Table 1). M. assamensis had the largest number of heterozygous sites (∼4.7 million; Table 1; Figure 1D). This high variability is consistent with the demographic results from smc++, which indicated population size expansion around 10,000 years ago in M. assamensis but not M. thibetana and M. arctoides (Figure 1E). Consistent with this, PSMC analysis in a previous study showed M. assamensis has a higher overall effective population size beginning ∼250 kya compared to M. arctoides and other sinica species (see Figure 3B in14).
Extensive Mosaicism of the M. arctoides Genome
The four-taxon test31–33 results supported that the M. arctoides genome has a mosaic of shared ancestry with both the sinica and fascicularis species groups (Figure 2), which is consistent with extensive introgression from both species groups. Though it is worth noting that windows with shared ancestry could also be due to incomplete lineage sorting between these taxa23. The genome-wide average fdM34,35 value was −0.113, ranging from −0.69 (supporting shared ancestry with the sinica group) and 0.55 (supporting shared ancestry with the fascicularis group) in 50kb sliding windows. This sliding window analysis was repeated with several window sizes (Figure S2), and all had a similar genome-wide mean fdM. In a previous study, estimates of D were similar to the fdM estimates here (see Table S8 in14).
Comparison of fdM in windows on the autosomes and on the X-chromosome revealed the X-chromosome to have significantly lower mean fdM estimate (−0.15), with a statistical difference between the two distributions (Figure S3). There was a non-significant negative relationship between chromosome length and fdM, but the X-chromosome has an intermediate length (149 Mb) compared to the autosomes (mean=134 Mb). Thus, the lower fdMestimate on the X is not likely attributed to the length of the chromosome.
Phylogenetic relationships among the species groups were evaluated via the software Twisst36 (topology weighting by iterative sampling of subtrees). This method quantifies phylogenetic relationships across the genome and returns weights in sliding windows (Figure 1F). The topology weights were used to define a “majority topology” as having two-thirds or more of the sum of weights in any region. Windows that did not meet this criterion were re-classified as “unresolved”. This analysis showed the proportion of the genome that supported topo2, which groups M. arctoides with the sinica group, as 52.64% (95% bootstrap CI [52.02%, 53.28%]). However, a significant portion, 15.70% (95% bootstrap CI [15.40%, 16.01%]), of the genome supported topo3, which groups M. arctoides with the fascicularis species group. Interestingly, 11.07% (95% bootstrap CI [10.84%, 11.31%]) of the genome supported M. arctoides clustering in a group by itself (topo1). The remaining 20.59% (95% bootstrap CI [20.18%, 21.01%]) of the genome was categorized as unresolved. Still, the 15.70% of the genome that supports topo3 is a rather large fraction of the genome considering this is an ancient hybridization event. Similar to our four-taxon results, this distribution of support for each of the three possible topologies is different on the X chromosome, with an excess of topo2 on the X (70.89%, 95% bootstrap CI [67.49%, 74.18%]). One possible explanation for this difference relates to incomplete lineage sorting which is expected to be lower on the X-chromosome as compared to the autosomes, owing to the lower effective population size of the X.
In addition to examining broader patterns of shared ancestry in this species, outliers of shared ancestry with each species group and the M. arctoides genome were specifically scrutinized (see below). This set of regions were intersected with the Ensembl gene predictions downloaded from University of California Santa Cruz (UCSC) Genome Browser, resulting in 371 genes with the highest ancestry from the fascicularis group, and 608 genes with the highest ancestry from the sinica group. These results were highly concordant with the majority topology for each of the two possible groupings between M. arctoides and the parental taxa. The intersection resulted in no reduction of the 608 sinica-shared genes, whereas the fascicularis-shared genes were reduced to 363 from 371. This set of 971 genes were uploaded to the Panther Database. The enrichment gene terms for GO Biological Processes revealed three GO terms enriched from fascicularis species group and 24 from sinica species group (Table S5). A surprising result was that there were three neuronal-related gene categories for ancestry shared between sinica and M. arctoides (see below and Discussion).
Intersection of Regions Associated with Baculum Morphology with Topo1 Regions
Next, the phylogenomic results were intersected with the mouse baculum QTL to investigate the genetic ancestry of the divergent baculum in M. arctoides. First, the 11.07% of topo1 regions that support placement of M. arctoides into its own lineage that does not share recent ancestry with either the fascicularis or sinica groups were investigated further. To accomplish this, the stringency in defining the majority topology was increased to require 100% of the weights into topo1, which reduced this to 4.5% of the genome. The regions from the Twisst analysis were extracted and intersected with the known Ensembl gene annotations, resulting in 4175 genes. A GO Analysis of these genes revealed enrichment of gene categories involved in developmental processes, immune system processes, and metabolic processes (Figure 4). There were also gene categories involved in reproduction, reproductive processes, behavior, and pigmentation, which have particular importance in M. arctoides given its phenotypic changes in these traits.
This gene list was intersected with homologous chromosome regions in rheMac8 to the three major mouse baculum QTL, leaving 47 genes. Five of these genes (TGM2, KIF14, RHOJ, DACT1, and KIAA0586) were listed in Table S3 of ref7 as the 16 major candidate loci that overlap QTL regions for baculum morphology in mice. Two of the five genes, KIF14 and KIAA0586, also had a high sequence divergence between mouse and rat, thus being considered as two of four major candidate loci in explaining the genetic basis of mouse baculum size variation7. Similarly, divergence between M. arctoides and the sinica and fascicularis species group members was also high for KIF14 (Sin-Arc FST = 0.45; Fas-Arc FST = 0.61) and KIAA0586 (Sin-Arc FST = 0.38; Fas-Arc FST = 0.52) compared to the genome-wide averages (Sin-Arc FST = 0.33; Fas-Arc FST = 0.43). Ten of the 47 genes also had missense mutations between the mouse strains used for the QTL mapping (see Table S2 in ref7). Two of the 10 genes with missense mutations also had high FST: DDX59 (Sin-Arc FST = 0.40; Fas-Arc FST = 0.52) and SOGA1 (Sin-Arc FST = 0.38; Fas-Arc FST = 0.51).
For the two top candidate genes, KIF14 and KIAA0586, M. arctoides differed from all other lineages at 15 and 13 nucleotide positions, respectively. M. arctoides had two fixed and six heterozygous non-synonymous mutations in the protein-coding region of KIAA0586 and one fixed and one heterozygous non-synonymous mutation in KIF14. Neither of the fixed differences were located in any known functional domains. One of the other five genes (DACT1) had a premature stop codon fixed in M. arctoides; however, the impact to overall function is unclear (but see Discussion).
For fixed non-coding variants and variants immediately upstream or downstream of each gene, 100bp regions surrounding variants for each allele were input into the online tool TomTom38 to identify potential binding sites for transcription factors (Table S8). One of these SNPs suggested that CTCF binds to the M. arctoides allele of KIF14, but not any of the others (Figure 3A). The ENCODE Transcription Factor Target dataset as accessed through Harmonizone39 suggests that CTCF is a known transcription factor for KIF14.
Gene Flow in Regions Associated with Baculum Morphology
To understand how the mosaic ancestry of the M. arctoides genome contributed to its extreme baculum morphology and subsequent RI, the outliers of shared ancestry were also intersected with regions in rheMac8 homologous to the mice baculum QTLs. ANOVA indicated the distribution of fdM scores in these regions were significantly different from the rest of the genome. Surprisingly, these regions were more negative, indicating more shared ancestry with sinica member species than fascicularis group species (Figure S4), which is the opposite of the expectation based on morphology.
Because the baculum of M. arctoides is more similar to members of the fascicularis species group than the sinica group, the fascicularis-shared introgression outlier regions (shown in blue in Figure S4) were further examined. Of 363 genes identified genome-wide, three genes were located within one of the three QTL regions on chromosome 7 (Figure S5; Supplementary Methods and Results). One of these genes, Bone Morphogenetic Protein 4 (BMP4), is a key gene in the hedgehog signaling pathway40, and has a role in gonadal development41,42. As expected for a major developmental gene, no coding differences were observed. However, there were two variants in the 3’ UTR of BMP4, one of which matched topo3, where M. arctoides and fascicularis species had a shared allele (Figure 3C; Figure S6A). Using TomTom as described above, differential binding was predicted for variants at each SNP which identified three genes known to interact with BMP4: Six1, Tead1, and Tead3. Six1 deficiencies lead to reduced BMP4 expression43 and affect spatial expression patterns44. Similarly, Tead proteins are a family of transcription factors that mediate transcription and cell migration of BMP4 via the TAZ component of the Hippo pathway45. Based on the alleles present, the lack of Six1 binding and the added binding of Tead1 and Tead3 would predict an increase in expression of BMP4 in M. arctoides and fascicularis group species as compared to sinica group species.
Characterization of Outlier Genes Explain Other Phenotypes Unique to M. arctoides
As mentioned earlier, M. arctoides has other traits (e.g. ontogenetic coloration, prolonged intromission with a postejaculatory “pair-sit”, etc8.) that make it unusual among macaques. Of the 971 genes identified as shared ancestry outliers, a stringent set of 146 genes, 51 fascicularis and 95 sinica origin genes were identified (see Supplementary Methods and Results) and manually curated. The seven manually curated categories had the following number of genes: 11 bone, 12 reproductive, 7 sensory, 8 immune and 6 skin/fur. These highly represented groups had: 55 cellular functions and 55 neural functions (Table S6). Some genes were in more than one category. The two highly represented groups corresponded well with the prevalence of neuronal-related gene categories from the GO analysis for sinica outliers. One possibility for this neural enrichment is that interactions between the fascicularis-shared mtDNA and the mostly sinica-shared nucleus may have led to mitonuclear incompatibilities between these species (see Discussion).
A moderate number of bone, reproductive, and skin/fur genes is unsurprising based on the unique biology of M. arctoides. Notably, two known interactors with BMP4 (DYNC2LI1 and CTDSPL2) were included in this stringent gene list. DYNC2LI1 exhibited shared ancestry between M. arctoides and fascicularis (fdM = 0.79) whereas CTDSPL2 exhibited shared ancestry between M. arctoides and sinica (fdM = −0.82). In humans, DYNC2LI1 mutations are associated with lethal skeletal diseases and have been found to impair the Hedgehog pathway46. CTDSPL2 phosphorylates various Smad proteins, which are transcriptional regulators directly involved with the BMP pathway. CTDSPL2 knock-down experiments in mice lead to increased expression of BMP targeted genes47.
TomTom was used to examine putative binding motifs associated with the differing alleles at SNPs within CTDSPL2 and DYNC2LI1. Within CTDSPL2, M. arctoides/sinica and the fascicularis/outgroup had 16 fixed differences between them and 5 heterozygous sites. Of these, only one transcription factor seemed to be directly related to regulation of CTDSPL2. Specifically, overexpression of XBP1 in mouse cells up-regulates CTDSPL2448. TomTom results suggested that XBP1 binds to the shared M. arctoides/sinica allele of CTDSPL2 but not the shared fascicularis/outgroup allele (Figure 3B). The DYNC2LI1 gene had 12 fixed differences between M. arctoides/fascicularis and the sinica/outgroup, one resulting in a non-synonymous mutation (c. 422G>A; p. Arg>Lys141). TomTom results suggest that this SNP changes the binding affinity of RBFOX2 protein, which regulates splicing events49. TomTom predicted binding of RBFOX2 to the allele shared by M. arctoides and fascicularis species (Figure 3C). See Table S8 for detailed characterization of TomTom results for each SNP.
Discussion
A patchwork of ancestral origins for the baculum in the bear macaque
The gene analysis of alleles unique to the bear macaque lineage revealed 47 genes that overlapped three QTL regions associated with baculum morphology in mice. Two of these genes are major candidate genes for baculum size, KIF14 and KIAA05867, the latter of which has several nonsynonymous mutations. Three other genes were on a longer list of 16 candidate genes, one of which (DACT1) had a nonsense mutation c. 1317G>A; p. Trp439* fixed in M. arctoides. DACT1 is part of the Wnt signaling pathway, interacts with dishevelled (DVL), and is associated with Townes-Brocks syndrome 2 (TBS2; MIM# 617466). Interestingly, a heterozygous nonsense mutation c. 1256G>A; p. Trp419* in DACT1 in humans has been shown to have a variety of phenotypes including genitourinary malformations. These malformations include hypospadias50, a birth defect where the urethral opening is on the underside of the penis, which is also characteristic of M. arctoides male genital morphology8.
While there is a mosaic of ancestries across the genome of M. arctoides, there is more support for sinica group ancestry, consistent with previous results10,14,15. Interestingly, baculum morphology, which is divergent in both size and shape in M. arctoides9,29, does not support clustering of the bear macaque with the sinica species group, and instead morphologically resembles members of the fascicularis species group8. Consistent with these morphological similarities, one of 363 fascicularis-sourced introgression outliers overlaps with a QTL associated with variation in baculum size in mice. This gene, BMP4, is a major regulator of bone formation and gonad development40,42. Two SNPs in the 3’ UTR have alternative binding motif alleles that suggest regulatory changes in this gene that would result in upregulation of BMP4 in M. arctoides as compared to sinica species (Figure 6C). Still, functional validation is needed to ascertain whether this genomic variation is coupled with differences in expression intensity, location, or timing during development.
Additionally, among the 47 genes overlapping the mouse QTLs from topo1, one gene with notably high FST was the gene BMP/Retinoic Acid Inducible Neural Specific 3 (BRINP3; Sin-Arc FST = 0.48, Fas-Arc FST = 0.60). There were two other genes in the outlier analysis that are known to interact with BMP4 and bone development, pointing to possible rewiring of this important developmental pathway in the M. arctoides genome. Although this pathway was not identified in the mouse baculum morphology study, it was mentioned as a likely candidate worthy of investigation. It is also possible that this pathway is involved in baculum morphology in primates, but not in mice. Future work should focus on these genes to understand the evolution of baculum morphology in M. arctoides.
Complex speciation of the bear macaque
Macaques are a diverse primate genus with several examples of purported complex speciation10,11,50. Although mtDNA suggests that introgression with the fascicularis group should be driven by M. mulatta14,22, when considering these two species separately, the autosomal estimates of fdM are largely similar (Table S3). One possible explanation for this is that the sample used for M. fascicularis is Vietnamese in origin and therefore has extensive gene flow from M. mulatta51. Though it is worth noting that there are 1.6 times more shared heterozygous sites between M. arctoides and M. mulatta than between M. arctoides and M. fascicularis (Figure 1D). We also found 11.07% of genomic regions support a clustering of M. arctoides separate from other macaque species. Although several taxonomic surveys have suggested the placement of M. arctoides into its own species group30,52,53, recent genome surveys suggest that sequencing of additional sinica species members will be useful in making this determination54,55.
While genomic mosaicism is a key feature of hybrid species, it is also a ubiquitous feature of admixture more generally. For example, in swordtail fish, there is genomic mosaicism that resembles hybrid speciation, although it is not consistent with other information from this species26,27,56. Similarly, up to 15% of the genome supports clustering of M. arctoides with fascicularis species. This finding, though compelling, is also not uncommon in other taxa that do not have hybrid speciation. For example, in human-chimp-gorilla comparisons, ∼30% of the human genome does not support the species tree57. In principle, an even contribution from both parents to a hybrid genome is possible, but this is often not the case in HHS (e.g., Heliconius20). Here, we observed a significant deviation from a 50-50 split of ancestry from the two groups, which could be due to a combination of proposed recent admixture with sinica group taxa10 and differences in ancestral populations sizes of the putative parental taxa.
Here, we also found higher shared ancestry between M. arctoides and sinica species group taxa on the X-chromosome as compared to the autosomes, which is consistent with either (1) higher introgression from the sinica group on this chromosome compared to the autosomes, (2) lower effective population size of the X-chromosome compared to the autosomes and/or (3) increased selection against introgression on the X chromosome (which is qualitatively consistent with Haldane’s rule). The latter possibility is interesting in light of a recently proposed mitochondrial capture scenario via reduced fitness of male hybrids14. Mitochondrial capture could either have followed M. arctoides split from sinica, or it could have been simultaneous with an HHS origin (Figure 1B). While there are recent statistics that explicitly test for HHS58, these have a restrictive set of assumptions that may not apply to this system (e.g., subsequent gene flow with parental taxa). Therefore, a more explicit set of simulations of various speciation scenarios would need to be conducted to determine the timing of ancient introgression from the two parental genomes.
Unanticipated enrichment of introgression outliers with neurological function
GO terms in the introgression outliers from sinica and M. arctoides origin analysis revealed a large number of genes involved in neurological function that were noticeably absent in the fascicularis outliers. A possible explanation is that the mismatch in sinica-sourced mitonuclear genes could lead to mitochondrial dysfunction, which could contribute to neural malfunction and disease. For example, disease-causing mutations in mitochondrial aminoacyl tRNA-synthetases in humans predominantly result in neurological disorders59. This is believed to be due to the generally high-energy demand of primate brains and may necessitate coevolution of mitonuclear and mitochondrial genes60,61. Moreover, at least two of the sinica-sourced neurological genes characterized here, GFM2 and SOD2, are known to be involved in mitochondrial disorders exhibiting neurological effects62,63. Additionally, hybridization scenarios exhibiting mismatch between mitochondrial and mitonuclear genes have been discovered in a variety of systems64–66. A comprehensive analysis of the potential for mitonuclear incompatibilities in the evolution of M. arctoides could also shed light on the evolution of the bear macaque.
Future Directions
This study has focused on the baculum based on its extreme length in M. arctoides, which is proposed to play a role in reproductive isolation among species However, we recognize that the baculum is certainly not the only barrier to reproduction between M. arctoides and its parental taxa. We anticipate there are other regions of the genome that have contributed to speciation67. For example, there have been compensatory changes in female reproductive morphology29 that are equally important in the mechanical barrier between these taxa (Figure S1). Additionally, changes in their mating behavior also contribute to differences between M. arctoides and its parent taxa9. Therefore, as the genetic basis of such traits become known, they can be further investigated.
In addition to this study, there has been extensive interest in the evolution of this species. However, due to the limited number of publicly available whole genome samples (now three with the present study), the conservation status of this species, and the ethical and logistical barriers to experimental research with primates, it is challenging to make functional insights into its evolution. Additional WGS data from more individuals of this species would allow for a better characterization of selection/adaptation within this lineage. Additionally, more WGS samples would be useful in examining the distribution of introgression haplotype lengths to pinpoint the timing of events in this system that would better clarify when hybridization from each parent took place.
Methods
Samples and Genome Sequencing
One female each from M. arctoides (Malaya) and M. assamensis (A20) were sequenced. These two samples were multiplexed and sequenced on one lane of sequencing for initial quality assessment. They were then subsequently run on one additional lane of sequencing each for a final 1.5 lane per sample. Sequencing was done at UCSF Medical Center Sequencing Facility on a Illumina HiSeq 2000 machine. Each run was assigned as different read groups to be treated as independent in the variant calling workflow. The raw read data have been deposited in NCBI sequence read archive and are available via the accession PRJNA622565.
Publicly available sequences
Raw fastq files were downloaded from eight public genome samples for this project (Table 1). Additional sample details can be found in the corresponding publications. These were each downloaded from NCBI SRA using sratoolkit version 2.8.1 with options to split data into read pairs and in the original format68. Because M. thibetana was sequenced on an older platform, we used seqtk69 to convert from Q64 to Q33. Additionally, SRA files from multiple lanes as indicated in the fastq header were split for independent alignment to the reference genome facilitating the definition of separate read groups to be treated as independent in the variant calling workflow.
Alignment, genome analysis, and variant calling
Raw fastq files from all samples were aligned to the reference genome rheMac8 (NCBI Accession: PRJNA214746). The masked reference genome was downloaded from UCSC. GATK best practices were followed to obtain high quality variant sites (see Supplementary Methods). Briefly, reads were aligned to the reference genome, duplicates were marked, indels were realigned, base quality scores were recalibrated, variants were called, and variant quality scores were recalibrated. Samples from each species were processed independently and merged into a final callset. The baboon reference genome was added to the variant files via two-way genome alignment and custom scripts. These final filtered files for each chromosome with the baboon information are freely available and were used for all subsequent analyses.
Four-taxon test for introgression
To analyze patterns of introgression in sliding windows, a modified four-taxon test was used (Figure 2). Rather than separately analyzing gene flow between M. arctoides and each parental species group, as has been done previously14, we relied on the modified statistic fdM, which is better at comparing introgression from both groups35. Specifically, hybridization between M. arctoides (P3) was tested with members of both the sinica species group (P1) and the fascicularis species group (P2), using Baboon as an outgroup. Calculations were done using scripts available on github33,34. To select the appropriate window size for the analysis, a series of different sized windowed analyses were conducted. fdM was computed for overlapping windows of size 5kb, 25kb, 50kb, 100kb, 500kb and 1000kb, with a step size of 20% of each window size. The minimum number of sites per window size was based on the distribution of sites available in the callset at each bin size. Specifically, a minimum number of sites of 10, 50, 100, 200, 1000, and 2000, respectively, per window size were used. Because it is recommended to have at least 100 sites per window34, 50kb sliding windows were selected and are shown in Figure 2 as a heatmap across the genome. Results from the additional bin sizes are shown in Figure S2. This analysis was repeated with individual taxa from each species group (Table S3).
Topology weighting
To analyze the weights of different phylogenetic topologies across the genome, the software Twisst was used to compare the weights of different species topologies throughout the genome36,70. First, data used for the introgression analysis were pruned to make sure that each group had at least one member represented in the genotype data. Next, neighbor joining trees were constructed in sliding windows of 50 SNPs each across the genome using Phyml from scripts available on github34. This generated newick formatted tree files that were used as input for Twisst.
The topology weighted outputs were further explored by collapsing adjacent intervals with the same majority topology supported. Since there were three possible topologies, the majority topology was defined as any one topology that had at least two-thirds of the total weight values. Any intervals without a majority topology were labeled as “unresolved”. If adjacent intervals supported the same majority topology, they were collapsed and split into the three major topologies. These were intersected with gene regions from the Ensembl annotations for rheMac8 downloaded from UCSC.
Bootstrap confidence intervals for topo1, topo2, topo3, and unresolved proportions were computed by resampling 2,792 1Mbp intervals along the genome (149 when considering only chromosome X) for each of 10,000 replicates and taking the 2.5% and 97.5% quantiles.
Demography Analysis using smc++
The method smc++71 was used, which uses whole genome sequencing data across multiple samples and does not require phased data. All sequences were masked at the start of a chromosome up to the first genotype call, at the end of a chromosome from the last genotype call, and all rheMac8 assembly gaps. Then, the estimate module was used to estimate population size history with an assumed mutation rate of 2.5e-814 and automatic selection of inference timepoints. When generating population size trajectory plots, a generation time of 6 years14 was used.
Functional Analysis of Targeted and Outlier Regions
A major criteria of homoploid hybrid speciation is evidence that RI evolved as a by-product of hybridization26. To investigate this directly, regions associated with morphological variation in the baculum identified in mice7 were targeted. The UCSC tool liftover72 was used to convert coordinates of the three QTL from the mouse study from mm10 to rheMac8.
To further identify outlier regions, loci with the highest shared ancestry from each parent taxa were extracted from the sliding window results (see Supplementary Methods). These regions were further intersected with gene lists from the Twisst analysis. Specifically, sinica outliers of shared ancestry were intersected with regions where the majority topologies supported M. arctoides clustering with sinica species. Similarly, fascicularis outliers of shared ancestry were intersected with regions where the majority topologies supported M. arctoides clustering with fascicularis species. These intersected gene lists (Table S4) were used to conduct a GO Analysis using Panther73 (Table S5).
Author Contributions
LSS, JDW, DJM, and BJE conceived of the project. DJM supplied the samples and JDW paid for sequencing. LSS executed methods/analyses with substantial input from JDW and BJE, and wrote the draft manuscript. BJE and DJM significantly aided the interpretation of results. ZAS conducted VQSR and SMC++ analyses. NPB and TEN conducted manual curation of gene outliers and TomTom on baculum alleles. All authors were involved in editing and revision of the final draft.
Acknowledgements
This work was supported by research start-up funds from the Department of Biological Sciences at Auburn University (LSS). Sequencing costs were supported in part by NIH grant R01 GM115433 (to JDW). This work was made possible in part by a grant of high performance computing resources and technical support from the Alabama Supercomputer Authority. BJE was supported by a grant from the Natural Science and Engineering Research Council of Canada (RGPIN-2017-05770). We thank members of the Stevison lab for helpful discussions regarding the analysis and results, with particular thanks to Stephen Sefick. We thank Simon Martin both for having a public github repository of high quality, but also for responsiveness regarding inquiries in use of his code. We thank several researchers in attendance at the 2016 Speciation GRC, Evolution 2017, and PEQG 2018 for helpful discussions regarding this work. We thank Wendy Hood, Ken Halanych, and Geoff Hill for feedback on early drafts of the manuscript.
Footnotes
↵† Deceased April 18, 2019