Summary
Genomes can vary within individual organisms. Programmed DNA elimination leads to dramatic changes in genome organisation during the germline–soma differentiation of ciliates1, lampreys2, nematodes3,4, and various other eukaryotes5. A particularly remarkable example of tissue-specific genome differentiation is the germline-restricted chromosome (GRC) in the zebra finch which is consistently absent from somatic cells6. Although the zebra finch is an important animal model system7, molecular evidence from its large GRC (>150 megabases) is limited to a short intergenic region8 and a single mRNA9. Here, we combined cytogenetic, genomic, transcriptomic, and proteomic evidence to resolve the evolutionary origin and functional significance of the GRC. First, by generating tissue-specific de-novo linked-read genome assemblies and re-sequencing two additional germline and soma samples, we found that the GRC contains at least 115 genes which are paralogous to single-copy genes on 18 autosomes and the Z chromosome. We detected an amplification of ≥38 GRC-linked genes into high copy numbers (up to 308 copies) but, surprisingly, no enrichment of transposable elements on the GRC. Second, transcriptome and proteome data provided evidence for functional expression of GRC genes at the RNA and protein levels in testes and ovaries. Interestingly, the GRC is enriched for genes with highly expressed orthologs in chicken gonads and gene ontologies involved in female gonad development. Third, we detected evolutionary strata of GRC-linked genes. Developmental genes such as bicc1 and trim71 have resided on the GRC for tens of millions of years, whereas dozens have become GRC-linked very recently. The GRC is thus likely widespread in songbirds (half of all bird species) and its rapid evolution may have contributed to their diversification. Together, our results demonstrate a highly dynamic evolutionary history of the songbird GRC leading to dramatic germline–soma genome differences as a novel mechanism to minimise genetic conflict between germline and soma.
Text
Not all cells of an organism must contain the same genome. Some eukaryotes exhibit dramatic differences between their germline and somatic genomes, resulting from programmed DNA elimination of chromosomes or fragments thereof during germline–soma differentiation5. Here we present the first comprehensive analyses of a germline-restricted chromosome (GRC). The zebra finch (Taeniopygia guttata) GRC is the largest chromosome of this songbird6 and likely comprises >10% of the genome (>150 megabases)7,10. Cytogenetic evidence suggests the GRC is inherited through the female germline, expelled late during spermatogenesis, and eliminated from the soma during early embryo development6,11. Previous analyses of a 19-kb intergenic region suggested that the GRC contains sequences with high similarity to regular chromosomes (‘A chromosomes’)8.
In order to reliably identify sequences as GRC-linked, we used a single-molecule sequencing technology not applied previously in birds that permits reconstruction of long haplotypes through linked reads12. We generated separate haplotype-resolved de-novo genome assemblies for the germline and soma of a male zebra finch (testis and liver; ‘Seewiesen’; Supplementary Table 1). We further used the linked-read data to compare read coverage and haplotype barcode data in relation to the zebra finch somatic reference genome (‘taeGut2’)7, allowing us to identify sequences that are shared, amplified, or unique to the germline genome in a fashion similar to recent studies on cancer aneuploidies13. We also re-sequenced the germline and soma from two unrelated male zebra finches (‘Spain’; testis and muscle; Extended Data Fig. 1) using short reads.
We first established the presence of the GRC in the three germline samples. Cytogenetic analysis using fluorescence in-situ hybridisation (FISH) with a new GRC probe showed that the GRC is present exclusively in the germline and eliminated during spermatogenesis as hypothesised (Fig. 1a-b, Extended Data Fig. 2)6,11. We compared germline/soma sequencing coverage by mapping reads from all three sampled zebra finches onto the reference genome assembly (regular ‘A chromosomes’), revealing consistently germline-increased coverage for single-copy regions, reminiscent of programmed DNA elimination of short genome fragments in lampreys2 (Fig. 1c-d). A total of 92 regions (41 with >10 kb length) on 13 chromosomes exhibit >4-fold increased germline coverage in ‘Seewiesen’ relative to the soma (Fig. 1e, Supplementary Table 2). Such a conservative coverage cut-off provides high confidence in true GRC-amplified regions. We obtained nearly identical confirmatory results using another library preparation method for the ‘Spain’ birds (Fig. 1f). Notably, the largest block of testis-increased coverage spans nearly 1 Mb on chromosome 1 and overlaps with the previously8 FISH-verified intergenic region 27L4 (Fig. 1e-f).
Our linked-read and re-sequencing approach allowed us to determine the sequence content of the GRC. The GRC is effectively a non-recombining chromosome as it recombines with itself after duplication, probably to ensure stable inheritance during female meiosis8. We predicted that the GRC would be highly enriched in repetitive elements, similar to the female-specific avian W chromosome (repeat density >50%, compared to <10% genome-wide)14. Surprisingly, neither assembly-based nor read-based repeat quantifications detected a significant enrichment in transposable elements or satellite repeats in the germline samples relative to the soma samples (Extended Data Figure 3, Supplementary Table 3). Instead, most germline coverage peaks lie in single-copy regions of the reference genome overlapping 38 genes (Fig. 1e-f, Table 1, Supplementary Table 4), suggesting that these peaks stem from very similar GRC-amplified paralogs with high copy numbers (up to 308 copies per gene; Supplementary Table 5). GRC linkage of these regions is further supported by sharing of linked-read barcodes between different amplified chromosomal regions in germline but not soma (Fig. 1g-h), suggesting that these regions reside on the same haplotype (Extended Data Fig. 4). We additionally identified 245 GRC-linked genes through germline-specific single-nucleotide variants (SNVs) present in read mapping of all three germline samples onto zebra finch reference genes (up to 402 SNVs per gene; Supplementary Table 4). As a control, we used the same methodology to screen for soma-specific SNVs and found no such genes. We conservatively consider the 38 GRC-amplified genes and those with at least 5 germline-specific SNVs as our highest-confidence set (Table 1). We also identified GRC-linked genes using germline–soma assembly subtraction (Fig. 1i); however, all were already found via coverage or SNV evidence (Table 1). Together with the napa gene recently identified in transcriptomes (Fig. 1j)9, our complementary approaches yielded 115 high-confidence GRC-linked genes with paralogs located on 18 autosomes and the Z chromosome (Table 1; all 267 GRC genes in Supplementary Table 4).
We next tested whether the GRC is functional and thus probably physiologically important using transcriptomics and proteomics. We sequenced RNA from the same tissues of the two Spanish birds used for genome re-sequencing and combined these with published testis and ovary RNA-seq data from North American domesticated zebra finches9,15. Among the 115 high-confidence genes, 6 and 32 were transcribed in testes and ovaries, respectively (Table 1). Note, these are only genes for which we could reliably separate GRC-linked and A-chromosomal paralogs using GRC-specific SNVs in the transcripts (Fig. 2a-b, Extended Data Fig. 5, Supplementary Table 6). We next verified translation of GRC-linked genes through protein mass spectrometry data for 7 testes and 2 ovaries from another population (‘Sheffield’). From 83 genes with GRC-specific amino acid changes, we identified peptides from 5 GRC-linked genes in testes and ovaries (Fig. 2c-d, Extended Data Fig. 6, Table 1). We therefore established that many GRC-linked genes are transcribed and translated in adult male and female gonads, extending previous RNA evidence for a single gene9 and questioning the hypothesis from cytogenetic studies that the GRC is silenced in the male germline16,17. Instead, we propose that the GRC has important functions during germline development, which is supported by a significant enrichment in gene ontology terms related to reproductive developmental processes among GRC-linked genes (Fig. 2e, Supplementary Table 7). We further found that the GRC is significantly enriched in genes that are also germline-expressed in GRC-lacking species with RNA expression data available from many tissues18 (Fig. 2f, Supplementary Table 8). Specifically, out of 65 chicken orthologs of high-confidence GRC-linked genes, 22 and 6 are most strongly expressed in chicken testis and ovary, respectively.
The observation that all identified GRC-linked genes have A-chromosomal paralogs allowed us to decipher the evolutionary origins of the GRC. We utilised phylogenies of GRC-linked genes and their A-chromosomal paralogs to infer when these genes copied to the GRC, similarly to the inference of evolutionary strata of sex chromosome differentiation19. First, the phylogeny of the intergenic 27L4 locus of our germline samples and a previous GRC sequence8 demonstrated stable inheritance among the sampled zebra finch populations (Fig. 3a). Second, 37 gene trees of GRC-linked genes with germline-specific SNVs and available somatic genome data from other birds identify at least five evolutionary strata (Fig. 3b-f, Extended Data Fig. 7, Table 1), with all but stratum 3 containing expressed genes (cf. Fig. 2a-d). Stratum 1 emerged during early songbird diversification, stratum 2 before the diversification of estrildid finches, and stratum 3 within estrildid finches (Fig. 3g). The presence of at least 7 genes in these three strata implies that the GRC is tens of millions of years old and likely present across songbirds (Extended Data Fig. 7), consistent with a recent cytogenetics preprint20. Notably, stratum 4 is specific to the zebra finch species and stratum 5 to the Australian zebra finch subspecies (Fig. 3g), suggesting piecemeal addition of genes from 18 autosomes and the Z chromosome over millions of years of GRC evolution (Fig. 3h). The long-term residence of expressed genes on the GRC implies that they have been under selection, such as bicc1 and trim71 on GRC stratum 1 whose human orthologs are important for embryonic cell differentiation21. Using ratios of non-synonymous to synonymous substitutions (dN/dS) for GRC-linked genes with >50 GRC-specific SNVs, we found 17 genes evolving faster than their A-chromosomal paralogs (Supplementary Table 9). However, we also detected long-term purifying selection on 9 GRC-linked genes, including bicc1 and trim71, as well as evidence for positive selection on puf60, again implying that the GRC is an important chromosome with a long evolutionary history.
Here we provided the first evidence for the origin and functional significance of a GRC. Notably, our analyses suggest that the GRC emerged during early songbird evolution and we predict it to be present in half of all bird species. The species-specific addition of dozens of genes on stratum 5 implies that the rapidly evolving GRC likely contributed to reproductive isolation during the massive diversification of songbirds22. It was previously hypothesised that GRCs are formerly parasitic B chromosomes that became stably inherited23,24. Our evidence for an enrichment of germline-expressed genes on the zebra finch GRC is reminiscent of nematodes and lampreys where short genome fragments containing similar genes are eliminated during germline–soma differentiation2–4. All these cases constitute extreme mechanisms of gene regulation through germline–soma gene removal rather than transcriptional repression3,5,10. Remarkably, the GRC harbours several genes involved in the control of cell division and germline determination, including prdm1, a key regulator of primordial germ cell differentiation in mice25,26. Consequently, we hypothesise that the GRC became indispensable for its host by the acquisition of germline development genes and probably acts as a germline-determining chromosome. The aggregation of developmental genes on a single eliminated chromosome constitutes a novel mechanism to ensure germline-specific gene expression in multicellular organisms. This may allow adaptation to germline-specific functions free of detrimental effects on the soma which would otherwise arise from antagonistic pleiotropy.
Author Contributions
Conceptualisation: W.F., A.S., J.P.M.C., F.J.R.R., C.M.K., A.M.D.C., T.I.G.; cytogenetics analyses and interpretation: J.P.M.C., F.J.R.R., J.C.; genomic analyses and interpretation: A.S., C.M.K., F.J.R.R., A.M.D.C.; transcriptomic analyses and interpretation: F.J.R.R.; proteomic analyses and interpretation: T.I.G., A.J.C., D.K., M.J.P.S., N.H.; gene enrichment analyses and interpretation: C.M.K., W.F., A.S.; phylogenetic analyses and interpretation: F.J.R.R., A.S., C.M.K., T.I.G.; manuscript writing: A.S. with input from all authors; methods and supplements writing: C.M.K. with input from all authors; supervision: A.S., J.P.M.C., T.I.G., M.J.P.S. All authors read and approved the manuscript.
Author Information
The authors declare no competing financial interests. Correspondence and requests for materials should be addressed to F.J.R.R. (email: fjruizruano{at}ugr.es) and A.S. (alexander.suh{at}ebc.uu.se).
Acknowledgements
We thank Peter Ellis, Moritz Hertel, Martin Irestedt, Regine Jahn, Max Käller, Bart Kempenaers, Ulrich Knief, Pedro Lanzas, Juan Gabriel Martínez, Julio Mendo-Hernández, Beatriz Navarro-Domínguez, Remi-André Olsen, Mattias Ormestad, Yifan Pei, Douglas Scofield, Linnéa Smeds, Venkat Talla, and members of the Barbash lab and the Suh lab for support and discussions. Mozes Blom, Jesper Boman, Nazeefa Fatima, James Galbraith, Octavio Palacios, and Matthias Weissensteiner provided helpful comments on an earlier version of this manuscript. A.S. was supported by grants from the Swedish Research Council Formas (2017-01597), the Swedish Research Council Vetenskapsrådet (2016-05139), and the SciLifeLab Swedish Biodiversity Program (2015-R14). The Swedish Biodiversity Program has been made available by support from the Knut and Alice Wallenberg Foundation. F.J.R.R., J.C., and J.P.M.C. were supported by the Spanish Secretaría de Estado de Investigación, Desarrollo e Innovación (CGL2015-70750-P), including FEDER funds, and F.J.R.R. was also supported by a Junta de Andalucía fellowship. A.M.D.C was supported by a postdoc fellowship from Sven och Lilly Lawskis fond. T.I.G. was supported by a Leverhulme Early Career Fellowship Grant (ECF-2015-453). T.I.G., A.J.C. (CABM DTP), and M.S. were supported by a NERC grant (NE/N013832/1). N.H. was supported by a Patrick & Irwin-Packington Fellowship from the University of Sheffield and a Royal Society Dorothy Hodgkin Fellowship. D.K. was supported by the National Research Foundation Singapore and the Singapore Ministry of Education under its Research Centres of Excellence initiative. W.F. was supported by the Max Planck Society. Some of the computations were performed on resources provided by the Swedish National Infrastructure for Computing (SNIC) through Uppsala Multidisciplinary Center for Advanced Computational Science (UPPMAX). The authors acknowledge support from the National Genomics Infrastructure in Stockholm funded by Science for Life Laboratory, the Knut and Alice Wallenberg Foundation and the Swedish Research Council.