Genome and transcriptome analysis of the mealybug Maconellicoccus hirsutus: A model for genomic Imprinting

In mealybugs, transcriptional inactivation of the entire paternal genome in males, due to genomic imprinting, is closely correlated with sex determination. The sequencing, de-novo assembly and annotation of the mealybug, Maconellicoccus hirsutus genome and its comparison with Planococcus citri genome strengthened our gene identification. The expanded gene classes, in both genomes relate to the high pesticide and radiation resistance; the phenotypes correlating with increased gene copy number rather than the acquisition of novel genes. The complete repertoire of genes for epigenetic regulation and multiple copies of genes for the core members of polycomb and trithorax complexes and the canonical chromatin remodelling complexes are present in both the genomes. Phylogenetic analysis with Drosophila shows high conservation of most genes, while a few have diverged outside the functional domain. The proteins involved in mammalian X-chromosome inactivation are identified in mealybugs, thus demonstrating the evolutionary conservation of factors for facultative heterochromatization. The transcriptome analysis of adult male and female M.hirsutus indicates the expression of the epigenetic regulators and the differential expression of metabolic pathway genes and the genes for sexual dimorphism. The depletion of endosymbionts in males during development is reflected in the significantly lower expression of endosymbiont genes in them. Author summary The mealybug system offers a unique model for genomic imprinting and differential regulation of homologous chromosomes that pre-dates the discovery of dosage compensation of X chromosomes in female mammals. In the absence of robust genetics for mealybugs, we generated and analysed the genome and transcriptome profile as primary resources for effective exploration. The expanded gene classes in the mealybugs relate to their unique biology; the expansion of pesticide genes, trehalose transporter, SETMAR and retrotransposons correlate with pesticide, desiccation and radiation resistance, respectively. The similarity in the genomic profile of two species of mealybugs strengthens our gene prediction. All the known epigenetic modifiers and proteins of the primary complexes like the PRC1,2 and the trithorax are conserved in mealybugs, so also the homologues of mammalian proteins involved in X chromosome inactivation. The high copy number of genes for many partners in these complexes could facilitate the inactivation of a large part of the genome and raise the possibility of formation of additional non-canonical complexes for sex specific chromosome inactivation. In adult males and females, the status of epigenetic regulation is likely to be in a maintenance state; therefore, it is of interest to analyze the expression of epigenetic regulators during development.


83
The mealybugs (Hemiptera:Pseudococcidae), such as Maconellicoccus hirsutus (Mhir)and 84 Planococcus citri (Pcit), commonly feed on plant sap. They are considered as invasive species  (Fig 2). The amplicons were sequenced and mapped back to the scaffold in the assembly, 178 to confirm their organization. The gene organization was also confirmed by long PCR (Fig 2   179 I). between different species as opposed to direct descent, is very common in sap feeding insects. 184 In addition, sap-feeders have obligate endosymbionts and this tripartite nested arrangement of 185 obligate endosymbionts in mealybug, provide nutritional benefits to the host [17]. This may 186 lead to lateral transfer of genes more frequently in those insects. 187 We analysed the HGTs in Mhir genome as well as reanalysed HGTs in M. hirsutus genome 188 assembly provided by Husnik [18] . We identified 98 HGTs after applying all the QC criteria.  Gene classes expanded and contracted in the mealybug genome 219 The evolutionary gain and loss of genes contribute to adaptation of organisms to their habitat. 220 In insects like the mealybugs, the expansion and contraction of genes may be linked to their 221 widespread geographical distribution and broad host range, thus it is interesting to analyse the 222 gene classes that have expanded or contracted. To identify such gene classes, we compared the 223 proteome of M. hirsutusand P.citri with five other insect species, namely,D.melanogaster, A. 224 pisum, R.prolixus, C.lectularius and B.mori using OrthoFinder [22]. In this analysis, we identified orthogroups (cluster of orthologous genes) which were further classified as - 226 Expanded, contracted and mealybug-specific based on the gene counts and a consolidated list 227 of all the three clases is given in S3Table.

228
Expanded gene classes; pesticide and desiccation resistance genes 229 Gene orthogroups involved in biological processes related to insecticide resistance, desiccation 230 resistance, radiation resistance and hormone signalingare expanded in both Mhir and Pcit (Fig   231   4). 232 The entire carboxylesterase gene family is expanded in the mealybugs, having the highest  Table). Cytochrome P450 family also shows similar expansion, with mealybugs having the insecticide resistance known and hence their infestation in cultivated plants [25,26]. In addition to the expansion of genes for insecticide metabolism, fatty acid metabolism enzymes, fatty acid 250 synthase and acylglycerol-o-acyltransferase are also expanded in the mealybugs.These genes 251 could be associated with the production of the waxy coating in mealybugs, majorly composed 252 of trialkyl glycerols and wax esters. The waxy covering poses serious impediment for 253 permeability of pesticides and protects mealybugs from desiccation and also predators [26,27]. 254 Thus, expansion of these genes could also contribute to resistance against insecticides. all other hox genes are present in 2 or more copies compared to the other insect species (Fig   334   5). The hox genes in Mhir lack the cluster-like arrangement seen in Drosophila or Tribolium, 335 and are scattered throughout the genome with some genes present in pairs or triplets on the 336 same scaffold, while some are on the same scaffold but interspersed with other gene classes 337 (Fig 5). This kind of arrangement of hox genes is called the "atomized or no clustering" which 338 is also observed in flatworm Schistosoma mansoni and nematodes [44]. Anopheles gambiae, 339 Tribolium castaneum and Cimex lectularius have a single large cluster of all hox genes,   The genes coding for histones, the primary substrates of epigenetic modification, and their 361 variants were identified in the mealybug genome along with other genomes for comparison 362 (Fig 6A, B). Mhir contains only a single complete quintet cluster of histone genes, while in 363 two of its other quintets, the histone H1 is absent ( Fig 6C). The remaining histone genes are 364 present in scaffold either singly or with some of the other histone genes, but not all the histones.

365
For instance, H2A may be present as an isolated gene in one scaffold or in combination with 366 some of the core histone genes in another scaffold ( Fig 6C). This organization has some 367 similarity with the other hemipteran, Acyrthosiphon pisum [47]. The number of alleles for core 368 histones vary with only two copies of histone H1 gene present in Mhir while 9 copies in Pcit.  methyltransferase, METTL4 gene is present as a single copy in both. We also generated 393 phylogenetic trees of DNMTs of all insects to assess their evolutionary conservation with 394 human proteins (Fig 7). 395 We observed that DNMT proteins of M. hirsutus, P. citri and C. lectularius and A. pisum 396 clustered with DNMT1, while only one methyltransferase of A. pisum clusters withDNMT3. 397 P. citri proteins (g3941, g42301), remained outliers, as they are partial sequences lacking     In-house PERL script was used to fetch genes containing high priority domains in Mhir and 481 also in the other genomes that were used for comparative analysis. The gene classes were also 482 mined using BLASTp (www.blast.ncbi.nlm.nih.gov).After manual curation, these genes were 483 divided into three groups-(i) Interproscan only -genes which harbor the HPD but has not been 484 annotated/assigned any function in BLASTp (ii) BLASTp only-those genes that lack HPD 485 even though a function is assigned in BLASTp (iii) Concordant-those that are annotated by 486 BLASTp and contain the HPD as well (common to both InterProScan and BLASTp). A  S9Table). Besides this, many genes with HPD are recognized (Mhir-13, Pcit-5, Apis-9), but 507 were not annotated in BLASTp( Fig 9A). These are potentially novel methyltransferases, to be 508 investigated further.The number of genes for specific modification as well as activating and 509 repressive modifications is not high in mealybugs relative to that in others though a large part 510 of genome is under epigenetic regulation (Fig 9B and 9C). This is not unexpected considering 511 that these are catalytic functions. 512 We used phylogenetic clustering with Drosophila HMTs, to assign specificity of the coded 513 enzymes (Fig 10). The low confidence annotations (identified by BLASTp only) are due to 514 partial sequences, but the identity of some of them could be deciphered based on their clustering 515 pattern. For example between the two E(z)-like genes, Mhir_g5597 is complete and Mhir_g18633 is fragmented, similarly additional copies of genes for trr detected (Mhir_g13137 517 & Mhir_g20142) are also partial sequences (Fig 10, S3A, B, C, D Fig). The key-word based 518 identification of BLASTp annotated genes also led to error as in the case of trl and Ash2, which 519 was detected by phylogenetic clustering (Fig 10). These proteins are not catalytic histone 520 modifiers in Drosophila, but are associated with epigenetic regulatory complexes, like  Table). SETMAR genes, identified as an expanded class of genes in  Pcit have Smyd genes and Smyd5 is identified as a differentially expressed gene (Fig 10). 566 A single ortholog of gpp/DOT-1 (H3K79 specific) is present both in Mhir and Pcit (Fig 10). Arginine methyltransferases are also present in Mhir (10 genes) and Pcit (9 genes, S9 Table).    The Gcn5-related HAT1 forms a distinct cluster with Mhir (g10554) and Pcit (g4576) proteins 611 (Fig 11). The Dmel Gcn5 clusters also include Mhir (g20555) and Pcit (g6490) proteins. HAT1  Table) . Since, HATs participate in dosage compensation and  Table). We used phylogenetic clustering with Drosophila 649 HDMs and found multiple genes coding for the almost all demethylases in both Mhir and Pcit.  of Dmel (Fig 13). The HDAC1 and HDAC3 clusters form a part of a larger cluster that includes 694 other Mhir (g5630) and Pcit (g31540) proteins (Fig 13). HDAC3 binds to putative enhancers  (Fig 13). HDAC11, the lone member of Class IV, is an unusual type, which is present  Pcit, complexes (Fig 16). Ash2 is missing in the Pcit in Trithorax dCOMPASS-like complexes.

783
The absence of the genes in both Pcit and Mhir genome may indicate true absence in mealybug 784 genome, while absence in only one of the two, like absence of Sh2, suggests sequence gaps.  The InterProScan and the BLASTp analysis of Mhir genome led to the identification of 30 801 chromatin remodelling proteins, including 2 putative novel chromatin remodelers (S11 Table). 802 We carried out a comparative analysis of all the genes in Mhir with other insect genomes (S11 Table). D.melanogaster has well defined candidates for CRM genes along with some 804 predictedchromatinremodelers that contain the HPD (S11 Table). It is known that the A. pisum by various chromatin remodeling complexes is given (S12 Table). The writers for all the  (Fig 19). Out of the 1183 genes, 652 genes have higher expression in males 856 and 531 genes show higher expression in females, these will be referred to as male enriched 857 and female enriched genes hereafter in this manuscript. 858 We used a combination of approaches to find functionally enriched pathways and processes in 859 males and females. We found genes related to metabolic and oxidative phosphorylation annotation. Biological processes show enrichment of genes related to "translation" and 864 "response to oxidative stress" specifically present in females while genes related to 865 "cytoskeletal organization" and "cellular protein modification" were present in males (S11Fig).

866
In addition, we performed manual curation of all DE genes (S12 Fig, Fig20&21) which  genes is moderate to low as seen in terms of relative TPM values (S14 Table). This low   The process of X inactivation in mammals is random as opposed to that in the mealybugs,  Based on the available literature on X chromosome inactivation, we considered the protein 961 factors that interact with the long non-coding RNA as assembly factors to establish 962 inactivation. The presence of homologues of genes coding for these proteins in the mealybug 963 genome are identified (S15 Table and Fig 23). We have considered the molecular process 964 leading to chromosome condensation and transcriptional inactivation in three different phases, 965 however this does not correlate to the temporal sequence of these events. their status in mealybug genome is given in S15 Table. 991 The detection of most of the protein factors interacting with Xist RNA in the mealybug genome   Genome sequencing and annotation 1023 The genomic DNA was extracted from the pool of embryos using phenol-chloroform method. The assembly of the genome was validated by PCR using tiling primers, followed by Sanger 1055 sequencing. The primers were designed considering gene architecture as in assembled scaffold 1056 and the sequence obtained was compared with the same scaffold. The scaffold containing 1057 histone genes was selected for this analysis. The primer sequences are given in S16 Table. 1058 Long PCRs (5-7 kb amplicons) were also carried out for validation of these scaffolds.

1118
The library prepared using the Illumina TruSeq stranded RNA library preparation kit following 1119 the manufacturer's instructions. Ribo-Zero was used to deplete ribosomal RNA followed by 1120 fragmentation and priming for cDNA synthesis. The cDNA first strand was synthesized 1121 followed by the second strand synthesis. Adenylation of the 3' ends was performed to prevent 1122 them from ligating to one another during the adapter ligation process. Following PCR 1123 enrichment, the concentration was estimated using Bioanalyser 2100 (Agilent Technologies) 1124 and sequenced on Illumina HiSeq 2500 producing 100X100-nt paired-end reads. Two