Evolution of gene-rich germline restricted chromosomes in black-winged fungus gnats through introgression (Diptera: Sciaridae)

Germline restricted DNA has evolved in diverse animal taxa, and is found in several vertebrate clades, nematodes, and flies. In these lineages, either portions of chromosomes or entire chromosomes are eliminated from somatic cells early in development, restricting portions of the genome to the germline. Little is known about why germline restricted DNA has evolved, especially in flies, in which three diverse families, Chironomidae, Cecidomyiidae, and Sciaridae exhibit germline restricted chromosomes (GRCs). We conducted a genomic analysis of germline restricted chromosomes in the fungus gnat Bradysia (Sciara) coprophila (Diptera: Sciaridae), which carries two large germline restricted “L” chromosomes. We sequenced and assembled the genome of B. coprophila, and used differences in sequence coverage and k-mer frequency between somatic and germ tissues to identify GRC sequence and compare it to the other chromosomes in the genome. We found that the GRCs in B. coprophila are large, gene-rich, and have many genes with paralogs on other chromosomes in the genome. We also found that the GRC genes are extraordinarily divergent from their paralogs, and have sequence similarity to another Dipteran family (Cecidomyiidae) in phylogenetic analyses, suggesting that these chromosomes have arisen in Sciaridae through introgression from a related lineage. These results suggest that the GRCs may have evolved through an ancient hybridization event, raising questions about how this may have occurred, how these chromosomes became restricted to the germline after introgression, and why they were retained over time.


Soma Germ
Genome Assembly

A-II A-III A-IV X GRC
A. Coverage difference (log 2 ) germ/soma 12 sequencing libraries. We identified scaffolds that have a higher coverage in germ tissue than 240 the somatic tissue (log2 germ/soma coverage difference > 0.5) (Fig 2B) and a high 241 proportion (>80%) of GRC-specific 27-mers on the scaffold (Fig 2C, see Materials and both methods agreed on the assignment. Through this method we were also able to identify 244 regions that belonged to the X chromosomes or autosomes. Through both the coverage and 245 k-mer assignment of chromosomes, we identified 162.4 Mb of sequence as autosomal, 52.9 246

Histogram of Ahist
Mb of sequence that belong to the X chromosome, and 154 Mb of sequence that belong to 247 the GRC ( Table 1). The 20.2 Mb of sequence that we were unable to classify ( Table 1) 248 represent cases when the two methods (coverage and kmer-based) did not support the 249 assignment with high confidence, indicating overall high agreement of the two approaches. 250 With the exception of the GRC size, which is approximately double the size that we would 251 expect given flow cytometry estimates of chromosome size in B. coprophila (Rasch, 2006), 252 our chromosome size estimates are comparable to chromosome size estimates for this 253 species. The size of the GRCs in our genome assembly indicates that the two GRCs may 254 have been at least partially assembled separately. We explore this possibility below. 255 256 We annotated 41,418 genes in our B. coprophila genome assembly: 17,802 on the 257 autosomes, 4,277 on the X chromosome, and 15,812 attributed to the GRCs ( GRCs have paralogs throughout genome searches with the annotated genes to infer paralogs within our genome assembly. We also 268 conducted a collinearity analysis to identify larger homologous blocks in the genome in 269 which we identified collinear blocks of five or more genes anchored to the reference 270 assembly (for autosomal and X-linked genes) [31] or an assembly we generated with long-271 read data from male germ tissue (for GRC genes-see Supplementary Text 2 for methods). 272 This allowed us to increase the continuity of our assembly and to anchor genes within our 273 assembly to known chromosomes (autosomes A-II, A-III, A-IV, and the X chromosome) in 274 the reference genome. From these analyses we wanted to determine 1. Whether the GRC 275 genes have paralogs on other chromosomes in the genome and whether paralogs were 276 mostly on one chromosome, which would allow us to determine the origin of the GRCs, 2. 277 Whether there is evidence for strata on the GRC with different genes having different 278 divergence levels (i.e. some genes older than others) and 3. Whether GRC-GRC reciprocal 279 blast hits are prevalent in the genome assembly, which would give further evidence that the 280 two GRC chromosomes were assembled separately (i.e. that the same gene on homologous 281 GRCs were assembled on separate scaffolds). For convenience, we will call the GRC-GRC 282 reciprocal hits paralogs too, even though the circumstances under which they diverged are 283 not clear. 284

285
We found that the GRCs carry many paralogous genes to both autosomes and the X 286 chromosome (Fig 3A). Additionally, there is a substantial number of paralogs in which both 287 copies are on the GRC (GRC-GRC paralogs). Overall, 71.4% of the paralogs we identified 288 contained at least one GRC gene. The sequence identity between paralogs showed a 289 unimodal distribution without striking differences between specific paralog groups (Fig 3B), 290 suggesting that divergence between paralogs is not dependent on the genomic location of 291 the genes in the paralogs. A collinearity analysis revealed 88 collinear blocks between the 292 14 autosome or the X chromosome. We anchored 42 blocks to individual chromosomes in the 295 reference assembly and found that the GRCs are homologous to all four chromosomes of B. 296 coprophila (Fig 3C; Supplementary Table 2), suggesting that the GRCs are not derived 297 from a single chromosome nor from a simple chromosomal rearrangement (e.g. fusion of a 298 chromosomal arm and X chromosome). 299 B.

Number of Paralogs
Nucleotide Identity

C.
Autosomes GRC scaffolds X chromosome Frequency levels. C. Collinear blocks found between GRC scaffolds (orange) and scaffolds anchored to 305 the X chromosome (blue) or individual autosomal chromosomes (A-II, A-III, or A-IV; shades 306 of green). Note that there is variation in the reference assembly in the proportion of scaffolds 307 that are anchored to each chromosome (Supplementary Table 2). 308

309
The GRC chromosomes in Sciarids were hypothesized to be derived from the X 310 chromosome [23], therefore, one of our aims for the paralogy and collinearity analyses was 311 to test if there is a clear homology between the X chromosome and GRC. Contrary to 312 theoretical expectations, the GRCs carry many paralogous genes to both the autosomes 313 and X chromosome (Fig 3A), the divergence between the GRC and X chromosome 314 paralogs was similar to the divergence between the GRC and autosomal paralogs (Fig 3B)  315 and we identified collinear blocks between the three autosomes and the GRCs as well as 316 the X chromosome and the GRCs (Fig 3C). Therefore, we found no evidence that the GRCs 317 were derived from the X chromosome. Rather, it seems that the GRCs show no clear 318 homology to any specific chromosome, but have homologous regions to all chromosomes in 319 roughly equal proportions. This is similar to recent findings on the GRC in zebra finches, in 320 which it was found that the genes on the GRC in this species also had paralogs located 321 throughout the genome, so there was no clear chromosomal origin for this chromosome. 322 However, in contrast to the zebra finch GRC, where some GRC genes were found to be 323 older than others, the unimodality of divergences of GRC genes to their paralogs in B. 324 coprophila suggest the GRC were acquired in a short evolutionary time frame, perhaps 325 during a single event (further explored in the phylogenetic analysis). 326 327 In B. coprophila, the two GRCs are a homologous pair of approximately 88Mbp 328 (Table 1). They form bivalents during female meiosis, but it remains unclear whether the two 329 chromosomes recombine [32]. If the recombination is suppressed, the two GRCs could 330 diverge over time to the extent that the two homologous GRC chromosomes assembled on 331 separate scaffolds. We found that the total size of the GRC scaffolds was about twice as 332 large as we expected given the estimated size of one GRC chromosome (154 Mbp vs. 333 88Mbp; Table 1). This result, in addition to the large number of GRC-GRC paralogs we 334 identified, suggests that the two GRCs indeed are divergent. This suggests that the 335 reciprocal blast hits in which both gene copies were on the GRCs are likely alleles of the 336 same loci on the two homologous GRC chromosomes. However, the GRC-GRC paralogs 337 also show similar divergence distribution as the GRC-autosomal and GRC-X paralogs, 338 suggesting that the two GRCs diverged from each other over extended periods of time (i.e. 339 some genes stopped recombining close to the origin of the GRCs) (Fig 3B). 340

341
The two GRCs are heteromorphic and show different sequencing coverage 342 To further investigate whether the two GRCs are homologous but deeply divergent, 343 we analysed the sequencing coverage of all GRC genes, paralogs in which both copies are 344 on the GRCs, and collinear blocks where both blocks are located on the GRCs. We found 345 the sequencing coverage of GRC genes is bimodal, with two modes at 25x coverage and 346 30x coverage (Supplementary Fig 3). We tested if the two histogram peaks represent 347 genes on the two GRC chromosomes by comparing the coverage of GRC genes in our 348 paralogy analysis in which both genes in the reciprocal blast hit were on the GRC (see 349 above). Indeed, most of these genes have one paralog in the low coverage peak (coverage 350 18-33x) and the second paralog in the high coverage peak (coverage 23-38x; Fig 4A), containing genes with close to the higher coverage peak and the other genes with a 357 coverage close to the lower coverage peak (Fig 4B), however there were a few exceptions 358 to this rule as well (See Supplementary Fig 4). were made from pools of 95 male testes. Therefore, the two GRCs may have been at 372 slightly different frequencies in the flies we sequenced. The differences in GRC frequency in 373 male testes suggests that the variation of GRCs in sperm may not be purely stochastic with 374 respect to the two differentiated GRCs (i.e. one is more likely to be present than the other). 375  indicate that the GRCs evolved in the ancestor of Cecidomyiidae and Sciaridae) (Fig 5B). 428

Histogram of highcov_L
This raises questions about how these chromosomes evolved. The most parsimonious 429 explanation from our phylogenetic data is that the GRCs in Sciaridae arose through a 430 hybridisation event between early Sciarids and Cecidomyiids, as the B. coprophila GRC 431 branch falls within Cecidomyiidae, but is longer than the root of Sciaridae family, suggesting 432 the hybridisation event has probably happened prior to diversification of the Sciaridae family. 433 To explore the hypothesis of GRC origin through hybridisation in Sciaridae, we examined all 434 gene trees in which one B. coprophila gene was located either on an autosome or the X 435 chromosome and one or two genes were located on the GRC. We found that most of the 436 gene trees support the hybrid origin hypothesis (Fig 5C). In 410 of 424 (97%) gene trees, 437 the autosomal/ X linked gene copy fell within the Sciaridae clade, as expected. For single 438 copy GRC paralogs, 71.8% (244), were identified as members of Cecidomyiidae family and 439 in a minority of these trees the GRC gene fell within the Sciaridae (25.3%; 86) (Fig 5B). The the jewel wasp Nasonia vitripennis which causes female-to-male conversion; a transcript 524 from a gene on the PSR chromosome has been identified which causes this effect [46]. 525 However, the link between GRCs and sex determination is not air-tight, as Sciarid species chromosomes rather than whole chromosomes are restricted to the germline) suggest that 563 this system evolves to resolve germ/ soma conflict over gene expression. However, our 564 results strongly suggest that the GRCs in B. coprophila evolved not as a means to resolve 565 germ/soma conflict, but likely instead to resolve conflict between chromosomes which were 566 introgressed into Sciaridae from Cecidomyiidae. The GRCs in zebra finches, as well, are not 567 suggested to have evolved as a means to resolve germ/ soma conflict, but are instead 568 proposed to have evolved from a selfish B chromosome [13]. Investigating the evolution of 569 GRCs in more lineages will help to settle this question, but it seems that the origin of GRCs 570 are likely to be different than the origins of germline restricted DNA in systems with 571 chromatin diminution, and it may be useful to consider the evolutionary pressures which lead 572 to these two systems separately. However, after germline restricted DNA evolves, it might 573 follow a similar evolutionary trajectory in both chromatin diminution and chromosome 574 elimination systems, given that in both systems researchers have found that germline 575 restricted DNA are enriched for genes that function in germline maturation/ function [8-10].  596 We sequenced genomic DNA from somatic (heads) and germ (testes and sperm) 597 tissue of 1-2 day old adult males. We generated Illumina short read data from somatic and 598 germ tissue. We dissected males which had been put on ice in a vial (to slow down males) 599 on a clean slide in a dish of ice under a dissecting scope. For the dissections, we used 600 jewellers forceps to separate the head from the body and then placed the head in a 1.5ml 601 microcentrifuge vial on dry ice. We then placed a drop of sterile 1X PBS on the body of the male and used forceps and insect pins to slowly pull the claspers away from the body until 603 the claspers and male reproductive tissue separated from the body. We then severed the 604 ejaculatory duct and placed the testes in a separate microcentrifuge tube. We collected 605 males over several days and stored the samples at -80°C until DNA extractions, sequencing 606 a pooled sample from the tissue from 95 males. 607 608 The DNA extraction protocol we used was a modified version of the Qiagen DNeasy 609 Blood and tissue kit extraction procedure (see Supplementary text 1 for full protocol). We 610 quantified DNA on a qubit fluorometer (v3). We sequenced the samples on the Illumina 611 Novaseq S1 platform, generating PE data with 150bp reads and 350bp inserts through 612 Edinburgh Genomics. 613 614 Genome assembly and annotation 615 We generated a genome assembly with both the somatic and germ tissue short read 616 libraries (Supplementary Table 1). We also generated a genome assembly from long read 617 sequence data from germ tissue, but the short-read assembly produced a more complete 618 genome assembly according to BUSCO gene assessments, so this assembly was used for 619 gene annotation. We used the long read assembly for the collinearity analysis to increase 620 the continuity of GRC scaffolds (See Supplementary Text 1 for details).  Identification of GRC scaffolds 641 We used a combination of two techniques to identify scaffolds belonging to the GRC 642 in our assembly. One technique employs coverage differences between the germ and 643 somatic tissues to identify which chromosome a scaffold belongs to. Since the number and 644 type of chromosomes differs between the somatic and germ tissues (Fig 2A), we expect 645 autosomal scaffolds to have a log2 coverage difference (germ/soma) of approximately -1 646 (i.e. at 2X the frequency in somatic tissue compared to germ), X-linked scaffolds to have a 647 coverage difference of approximately 1, and GRC scaffolds to have very few reads mapping 648 to them from the somatic library but a diploid coverage level in the germ tissue library. We 649 mapped the germ and somatic reads to the genome assembly with bwa mem (v0.7.17) 650 using default settings and counted the number of reads from each library mapping to each 651 scaffold [60]. Due to somatic contamination in the germ library, the coverage differences 652 displayed the pattern we expected and we were able to distinguish autosomal and X linked 653 scaffolds but the autosomal and X chromosome scaffolds had slightly different coverage coverage less than 45x, as the majority of GRC linked genes had a coverage between 15x 710 and 45x (see Supplementary Fig 3). We identified how well each of the collinear blocks met 711 the expected coverage patterns (i.e. one block having a higher coverage than the other) by 712 computing a statistic comparing the number of genes that meet the expected coverage 713 patterns by the total number of genes in the collinear block (Supplementary Fig 4). 714 715 Phylogenetic analysis of the GRCs origin 716 We utilized draft genome assemblies for 14 Sciaroidea species and 2 species 717 outside the Sciaroidea, most of which we obtained from Anderson et al. [27] with the 718 exception of Mayetiola destructor, which we obtained from NCBI (accession: 719 GCA_000149195.1). We conducted a BUSCO analysis (version 4.0.2) [55] using the insecta 720 database (insecta_odb10) on each genome assembly, along with our B. coprophila 721 assembly, to identify single copy orthologs in each genome. We excluded the Exechia fusca 722 genome from further analyses as this genome had a low proportion of complete BUSCO 723 genes identified, indicating that the genome was likely of poor quality. We identified the 724 chromosomal locations of each BUSCO gene identified in the B. coprophila assembly and 725 extracted the BUSCO IDs for all genes which were duplicated and had one gene copy on an 726 autosome or the X chromosome and either one or two gene copy on the GRCs 727 (Supplementary Table 3). We took the amino acid sequence of these BUSCO genes for B. 728 coprophila (all copies) and the longest amino acid sequence for each BUSCO ID per species 729 as the gene sequence in the genome assemblies from all other species (although note that 730 most of the other Sciaroidea species had relatively low rates of gene duplication--See