ABSTRACT
Female mammals are functional mosaics of their parental X-linked gene expression due to X chromosome inactivation (XCI). This process inactivates one copy of the X chromosome in each cell during embryogenesis and that state is maintained clonally through mitosis. In mice, the choice of which parental X chromosome remains active is determined by the X chromosome controlling element (Xce), which has been mapped to a 176 kb candidate interval. A series of functional Xce alleles has been characterized or inferred for classical inbred strains based on biased, or skewed, inactivation of the parental X chromosomes in crosses between strains. To further explore the function-structure basis and location of the Xce, we measured allele-specific expression of X-linked genes in a large population of F1 females generated from Collaborative Cross strains. Using published sequence data and applying a Bayesian “Pólya urn” model of XCI skew, we report two major findings. First, inter-individual variability in XCI suggests mouse epiblasts on average contain 20-30 cells contributing to brain. Second, NOD/ShiLtJ has a novel and unique functional allele, Xcef, that is the weakest in the Xce allelic series. Despite phylogenetic analysis confirming that NOD/ShiLtJ carries a haplotype almost identical to the well-characterized C57BL/6J (Xceb), we observed unexpected patterns of XCI skewing in females carrying the NOD/ShiLtJ haplotype within the Xce. Copy number variation is common at the Xce locus and we conclude that the observed allelic series is a product of independent and recurring duplications shared between weak Xce alleles.
Introduction
Although X chromosome inactivation (XCI) was first described in the early 1960s (Lyon 1961; Beutler et al. 1962), the genetic influences and molecular mechanisms underlying this phenomenon are still incompletely understood. Embryonic stem cells of female placental mammals undergo random XCI, a process that transcriptionally inactivates one of the two X chromosomes early in development (Avner and Heard 2001; Disteche and Berletch 2015). Subsequent daughter cells carry on the initial decision, forming clusters of cells in which either the maternal or paternal X is actively transcribed. Consequently, female mammals are unique mosaics of parental X chromosome activity. XCI ensures that expression of genes on the X chromosome is functionally equalized with those of males as a form of dosage compensation.
At the epiblast stage, each embryonic cell randomly and independently inactivates one of the parental X chromosomes and locks in its cellular fate (Nesterova et al. 2001; Okamoto et al. 2004). This random selection occurs at around embryonic day E5.5 (Takagi et al. 1982; Rastan 1982), prior to differentiation into the three major embryonic germ layers and when there are 120-250 cells comprising the epiblast (Snow 1977). The inactivated X chromosome (Xi) undergoes major reorganization and becomes condensed and heterochromatic, stabilizing gene repression in subsequent somatic cells (Wutz 2011; Nora et al. 2012). Regulation of XCI is carried out in part by Xist, a cis-acting long noncoding RNA (lncRNA) that is transcribed only from the inactivated X (Xi) (Brown et al. 1991). The major X inactivation center (Xic) extends across a 450 kb multi-function region containing many elements responsible for the complex molecular cascade orchestrating XCI, including Xist and other cis elements such as Tsix and Xite (Lee et al. 1996; Cattanach et al. 1970; Ogawa and Lee 2003).
The role played by Xist is necessary but not sufficient to fully explain XCI, leading researchers to explore the larger landscape of cis and trans regulators, chromatin modifiers, and protein complexes that may comprise the Xist interactome (Dossin et al. 2020; Penny et al. 1996; Brockdorff et al. 1991; Giorgetti et al. 2016; Minajigi et al. 2015). Control of XCI is inherently genetic and thus heterogeneity in the genetic architecture of these elements may affect the expression of Xist and its antisense counterpart, Tsix, leading to disruption of the machinery controlling the counting, choice, and silencing of the inherited X chromosomes. Xite is one such example of a region harboring both allelic heterogeneity and intergenic transcription start sites resulting in differential regulation of Tsix expression (Ogawa and Lee 2003). In turn, Tsix is monoallelically expressed from the active X (Xa) and blocks Xist accumulation, thus ensuring the future Xa (Lee et al. 1999a,b).
XCI is ostensibly random, so the a priori distribution of maternal and paternal Xa is expected to be 50:50. Nevertheless, non-random biases between mouse lines have been observed for decades (Cattanach and Isaacson 1967; Cattanach 1970; Cattanach et al. 1970), leading researchers to postulate that beyond the wholesale control of inactivation, preferential skewing for one parental set of X chromosomes over the other may also be under genetic control. Skewing can take two forms. Primary skewing is when the parental chromosomes are inactivated in unequal proportion from the outset (Percec et al. 2002). Secondary skewing arises as a form of selection, where paternal and maternal chromosomes are initially inactivated at random but the embryonic cells carrying them undergo unbalanced rates of replication or death (Minks et al. 2008; Takagi 1980). The hallmarks of secondary skewing also differ, in that it can be tissue-specific and occur at any point during development. In the event of a beneficial or deleterious mutation being carried on the chromosome inherited from one parent, secondary skewing could be advantageous.
Primary skewing in mice has been associated with an allelic series on the X chromosome named the X chromosome controlling element (Xce). Five known functional Xce alleles have been described from weakest to strongest, i.e. Xcea > Xcee > Xceb > Xcec > Xced (Cattanach et al. 1972; Cattanach and Papworth 1981). Under this paradigm, X chromosomes with Xcea are the least likely to remain active, and when found in female heterozygotes alongside the Xcec allele, skewing as extreme as 20:80 is expected (Figure 1). These allelic designations are well-recognized and have been consistently observed in inbred mouse strains exhibiting replicable skews in X inactivation ratio. Among the remaining unknown features of the Xce, however, are the exact size and location of the region and the nature of genetic variation that leads to the phenomenon.
A natural starting place to search for the Xce would be within the Xic. Control of XCI was initially mapped to a genomic region which overlaps the Xic, and Xist was an early candidate for the Xce. Using translocated coat color genes, Cattanach and collaborators placed the control region between two markers for Tabby (Ta) and Mottled (Mo) coat colors (Figure 2). Upon discovery of Tsix and Xite, allelic heterogeneity across the Xic was suggested as a candidate for Xce and as an explanation for the phenotypic breadth of skewing observed in mice (Ogawa and Lee 2003). However, more recent work in the last two decades demonstrate that the Xce does not overlap the Xic, suggesting that another separate region also participates in XCI. Further refinements over the the decades (Cattanach and Papworth 1981; Simmler et al. 1993; Chadwick et al. 2006; Calaway et al. 2013) have narrowed down the region to a 176 kb minimum interval about 500 kb proximal to Xic, rich with multiple structural variants including duplications and inversions.
Researchers have thus generally coalesced around the theory that the Xce, as first defined by Cattanach (1970), describes a region on the X chromosome proximal to the Xic that influences XCI and skewing phenotypes with a number of functional alleles. Nevertheless, there remains uncertainty about the nature, function, and mechanisms of how Xce influences XCI, or if XCI is entirely controlled by a single locus. A complete characterization of Xce and its influence on XCI remains elusive.
More recent work incorporating genetic diversity in F1 mouse crosses confirmed the broad patterns of the Xce alleles and support the importance and functionality of the region in XCI; this is despite difficulty in pinpointing the functional variation and specific gene(s) causing the skew. The narrowest putative Xce region to date was reported by our group in Calaway et al. (2013) using F1 crosses of classical inbred mouse strains, wild-derived strains, and other Mus species. Those results showed that the Xce region, localized to an at-minimum 176 kb candidate region consistent with previously described intervals, confers skewed XCI in patterns compatible with the known paradigm (Chadwick et al. 2006). The minimum Xce interval comprises a series of duplications and inversions and Calaway et al. (2013) proposed that copy number variations (CNVs) may play a role in XCI skewing (Figure 2). Increased genetic diversity made possible the discovery of another functional allele in the series, Xcee, observed in inbred PWK/PhJ mice (Crowley et al. 2015; Calaway et al. 2013; Lenarcic et al. 2018).
In addition, Thorvaldsen et al. (2012) demonstrated that mice with recombinant breakpoints across Xce had significantly different XCI ratios than either control populations of homozygous or non-backcrossed F1 mice. This finding highlights the difficulty of identifying one single region that explains the entire phenomenon and confers the totality of XCI control.
In this study we take advantage of the fairly narrow putative Xce interval to explore XCI in a genetically heterogeneous mouse population. We further define and characterize the role of Xce, and in particular of CNVs, in XCI skewing using 266 female mice from 28 F1 crosses of the Collaborative Cross (CC) multiparental mouse population (Collaborative Cross Consortium 2012; Srivastava et al. 2017). The CC are a panel of replicable and genetically diverse inbred mouse strains, each derived from an independent cross of eight inbred strains representing the three major Mus musculus subspecies: domesticus (A/J, C57BL/6J, 129S1/SvImJ, NOD/ShiLtJ, NZO/HlLtJ, WSB/EiJ), castaneus (CAST/EiJ) and musculus (PWK/PhJ). Each CC strain possesses genome-wide contributions from the founder strains due to mixing that occurred during rounds of breeding, leading to functional genetic variation and phenotypic breadth. Generations of sib-pair mating resulted in inbred haplotype blocks, allowing for replicates of each CC strain.
Most of the previous studies quantifying XCI make use of either 1) F1 hybrids of classical inbred mouse strains, or 2) back-crossed mouse populations on an inbred background with specific and deliberate introductions of one other strain to probe the boundaries of Xce. With increased heterozygosity in the genetic background of our CC-derived sample population, we can tease apart the effects of Xce independent from the genetic background. As a result, any observed XCI will be directly attributable to primary skewing due to Xce because other loci on the X chromosome will be shuffled among the crosses. Increased genetic heterogeneity in our sample population also allows us to describe further phenotypic heterogeneity in XCI ratios beyond the known Xce alleles (Figure 3).
Two of the inbred laboratory strains used in generating the CC, NOD/ShiLtJ and NZO/HILtJ (henceforth referred to as NOD and NZO, respectively), have not had their Xce alleles characterized through crosses; however, both were predicted to be Xceb due to haplotype similarity with C57BL/6J based on dense genotyping (Calaway et al. 2013). Our results interrogate the accuracy of these predictions based on observed XCI skew in F1 females with sequence derived from NOD or NZO spanning the Xce.
Our estimation of XCI skewing is more precise and generalizable compared with much of the XCI literature for two reasons. First, we incorporate X chromosome-wide expression data by quantifying from global RNA-seq. Previous work, by contrast, has generally quantified XCI using allele-specific expression (ASE) measured at a few known genes, which may present biases and inaccurate ratios due to inactivation escape, cis regulatory elements, or various confounding variables that are not due to XCI itself. Second, we report precise measures of uncertainty about our estimates using a Bayesian hierarchical statistical model that accounts for multiple sources of information. Chromosome-wide ASE data presents more opportunities for sophisticated statistical modeling to assess XCI, and there are relatively few examples of XCI proportion modeled hierarchically as a beta-distributed random variable (Larson et al. 2017; Lenarcic et al. 2018). This allows us to largely account for other subtle factors that are known to play a role, such as parent-of-origin effects (POE) in XCI whereby the paternal X (Xp) is predisposed to slightly lower levels of activation regardless of Xce allele (Wang et al. 2010; Calaway et al. 2013; Lenarcic et al. 2018). The model also, in accounting for variability in XCI among genetically identical individuals, estimates the effective number of epiblast cells at the point of X inactivation that contribute to the organ on which the RNA-seq is collected.
Another key resource we take advantage of is recently-published high coverage whole genome sequences of the CC strains (Srivastava et al. 2017; Shorter et al. 2019), which we used to specifically and accurately quantify CNVs across the Xce. By quantifying targeted, short reads, we confirm that this region hosts highly recurring sequences which appears to have implications for Xce function, and consequently, skewed XCI proportions in mouse crosses. Our characterization of the Xce region utilizes the most genetically diverse mouse population to estimate XCI to date and incorporates data from next-generation sequencing to determine ASE, providing a comprehensive quantification of chromosome-wide skewing.
Materials and Methods
Notation
Throughout this article, we denote each F1 sample by Strain 1/Strain 2, where counts from Strain 1 comprise the numerator of the XCI proportion, i.e. . Reciprocal crosses are denoted a or b, for CC001♀ x CC011♂ and CC011♂ × CC001♀, respectively. These designations were made arbitrarily, but remain consistent throughout the study. Table S1 provides a summary of the CC strains and the F1 crosses.
Mouse breeding populations and sample collection
The process of generating CC strains has been previously described in detail by Collaborative Cross Consortium (2012). CC mice were purchased from the Systems Genetics Core Facility (SGCF) at the University of North Carolina (UNC). This study includes data from 266 samples derived from a total of 29 CC strains (Figure 4) used to produce 28 F1 recombinant inbred intercross lines (CC-RIX). Data for this study was generated from two CC-RIX sample populations (SP). Heterozygosity present in the RIX lines allows us to both precisely measure ASE by comparing the expression of transcripts with allele A versus transcripts with allele B from mice that inherit the genotype AB.
SP1
This population was developed to identify strain, POE and perinatal maternal diet effects on gene expression and behavioral phenotypes in adulthood by utilizing F1 crosses of CC-RIX and has been described in detail (Schoenrock et al. 2018). Nine genetically distinct reciprocal CC-RIX were bred from 18 nonoverlapping CC strains such that samples from CC1♀ x CC2♂ and CC2♀ x CC1♂ are each represented (Figure 4a). Strain-pair selection aimed to maximize several criteria, namely the number of known brain-imprinted loci, as defined from Crowley et al. (2015) and Williamson et al. (2013) that are heterozygous between haplotypes that are identical by descent with NOD and C57BL/6J (Oreper et al. 2018).
Females from the 18 CC strains were exposed to one of four experimental diets (vitamin D deficient, protein deficient, methyl donor enriched, or standard control chow; Dyets Inc., Bethlehem, PA) during the perinatal period from 5 weeks prior to mating until their pups were weaned 3 weeks after birth. Whole brain tissue was collected from 188 female CC-RIX mice at 60 days of age (65.1 ± 4.8 days (mean and st dev)). Mice used for gene expression studies were behaviorally naïve. Tissue was collected in 26 batches with a minimum of 2 RIX/diet combinations in a batch. Mice were euthanized and whole brain was immediately extracted. A sagittal cut was made to hemisection the left and right hemisphere and tissue was immediately flash frozen in liquid nitrogen and stored at −80° until pulverization. Right brain hemispheres of all samples were pulverized using a BioPulverizer unit (BioSpec Products, Bartlesville, OK).
SP2
In the second population, 21 CC strains, 10 of which overlap with the strains in SP1, produced 19 non-reciprocal RIX. These mice were part of a study to elucidate the genetic basis of antipsychotic-induced adverse drug reactions and has been previously described (Giusti-Rodríguez et al. 2020). The larger study comprised 840 mice, representing 62 CC strains and 73 RIX lines. The design of the RIX crosses formed a quasi-loop such that each maternal line was also the paternal line for another cross (see 4b). Only 85 female samples with RNA-seq data were relevant to our analysis so the number of replicates from SP2 is smaller than from SP1 with a median of four samples per CC-RIX (range: 2-7).
Starting at 8-weeks of age, the mice were subjected to a 30 day treatment protocol where half were implanted with slow-release haloperidol (antipsychotic drug) pellets (3.0 mg/kg/day) and the other half received placebo. Treated and untreated mice were matched between sexes, RIX cross, cage, and batch. After 30 days of exposure to drug or vehicle at 12 weeks of age, mice were sacrificed by cervical dislocation without anesthesia to avoid effects on gene expression. Complete description of this experiment is provided in an independent manuscript (FPMV, unpublished).
RNA-seq preparation
SP1
For 188 mice, total RNA was extracted from ~25 mg of powdered right brain hemisphere tissue using Maxwell 16 Tissue LEV Total RNA Purification Kit (AS1220, Promega, Madison, WI). UNC HTSF core performed RNA concentration and quality check using fluorometry (Qubit 2.0 Fluorometer, Life Technologies Corp., Carlsbad, CA) and a microfluidics platform (Bioanalyzer, Agilent Technologies, Santa Clara, CA). RNA-sequencing was performed in three sequencing batches spread out over the course of the two-year collection of brain tissue once 96 samples from F1 CC-RIX offpsring were obtained. There were a median of 20 samples per CC-RIX (range: 12-32), with 3-4 samples per diet and reciprocal direction.
RNA was prepped with the Illumina TruSeq Stranded mRNA protocol for 100 base pair, stranded, single-end reads at the UNC sequencing core. An initial round of RNA-seq was conducted in December 2014 and June 2015 on HiSeq 2500 machines, and quality control (QC) was conducted on the first few batches of RNA-seq output with fastqc/0.11.8. Reads with low “Per base sequence quality” and “Per sequence quality scores” were prioritized for a second library prep. This first round of RNA-seq was followed up with more sequencing in June 2019 on a HiSeq 4000 machine to boost average read depth for each sample. The final data for each sample were subjected to the same QC criteria and combined, for an average of 24.6 million (M) reads per sample (median 17.9 M, range 10.6-109 M). 7 samples were removed due to missing X chromosomes or low read count.
SP2
Detailed methods for RNA-seq sample preparation and processing are described in an independent manuscript (FPMV, unpublished). Briefly, RNA was extracted from striatum using the Total RNA Purification 96-Well Kit (Norgen Biotek, Thorold, ON, Canada) and prepared with the Illumina (San Diego, CA) TruSeq Stranded mRNA Library Preparation Kit v2 with polyA selection using 1 μg total RNA as input. Equal amounts of all barcoded samples were pooled, to account for lane and machine effects. Each of the three pools was sequenced on eight lanes of the Illumina HiSeq 2000 for 100 base pair, stranded, single-end reads.
Quality control filtered out lanes with significant issues in terms of duplication level, fraction of mapped reads (using TopHat2) and, after summarizing reads at a gene level, fraction of mapped reads among the reads that were mapped to an exon. We only considered samples that passed 3 cutoffs: filtering by duplication (at most 40% duplication), percentage of mapped reads (at most 25% reads not mapping) and percentage of mapped reads being mapped to a gene (at most 35% not being mapped to a gene). QC procedures also resulted in corrections or discarded samples due to mismatches in labeling for strain and sex. Principle component analysis identified an outlier that was also removed. Another sample was removed due to a missing X chromosome.
Demographic details about the 266 CC-RIX samples across study populations are compiled in File S1.
Genotyping in CC-RIX and haplotype reconstruction
To ensure accurate phasing of variants, each sample in SP1 was genotyped on the MiniMUGA platform (Sigmon et al. 2020). MiniMUGA is an array-based genetic QC platform with over 11,000 probes designed to perform robust discrimination between most classical and wild-derived laboratory mouse strains. Three X0 females from SP1 that were removed from subsequent analysis were confirmed using the MiniMUGA platform, serving as a useful negative control for our ASE quantification methods. Haplotypes corresponding to each CC founder strain were reconstructed using R/qtl2 v0.20 (Broman et al. 2019). Genotype- and allele-probabilities for SP2 were inferred from previous genotyping conducted on CC strains and two to four additional animals per strain known to be their most recent common ancestors using the MegaMUGA platform. MegaMUGA comprises up to 77,800 single nucleotide polymorphism (SNP) markers that were optimized for detecting heterozygous regions and discriminating between haplotypes in homozygous regions, with a special emphasis for markers that are informative in the CC (Morgan et al. 2016). Genotyping for MiniMUGA and MegaMUGA was performed at Neogen (Lincoln, NE). Cross-referencing RIX haplotype regions with known CC and CC founder variants for consistency was particularly important at heterozygous loci where the correct parental inheritance would be critical for determining ASE.
We defined the Xce in the data based on previously published intervals because all 8 CC founder strains are represented in every sample, instead of each mouse representing one single strain. In iterative stages we defined Xce, first, based on the interval described in Chadwick et al. (2006) from 101.6-103.6 Mb, and then, refined to the minimum interval described in Calaway et al. (2013) roughly from 102.75-102.92 Mb because the narrower interval was still consistent with both our results from the broader interval and previously observed XCI skews between strains. All base pair positions throughout the manuscript are derived from the Genome Reference Consortium Mouse Build 38 (GRCm38).
Measuring allele-specific expression (ASE) in F1 females
To detect allele- and chromosome-specific expression, we have developed a novel approach using direct k-mer matching to capitalize on known variants in the sequenced CC and founder mice. Key to this method is set of virtual 25-base genotyping probe sequences created from the forward and reverse complement sequences centered about both reference and alternate variants. The reference sequence was provided by the GRCm38 reference mouse genome, based on C57BL/6J, and alternate alleles were collected from sequence data of the other 7 CC founder strains obtained from the Sanger Institute’s Mouse Genomes Project (Keane et al. 2011).
The variant set was filtered to remove unusually high and low probe-sequence counts occurring in any of the sequenced samples. An initial set of approximately 866,000 genome-wide variants were verified across CC and founder strains and became the anchors for matched pairs of k-mers with either the reference or variant allele in the center base. Roughly 590,000 of these k-mers are present in sequences with the highest transcript support level (TSL1), and of those about 414,000 are unique. We filtered k-mers to exclude those that (1) contain multiple variants, and match to (2) duplicated sequences, (3) patterns that are missing from multiple founder strains, (4) loci close to exon start sites, and critically, (5) multiple genomic locations in any CC strain. Taking these criteria into account, between 40-60% of the remaining variants were usable per chromosome. The remaining 7,957 k-mers on the X chromosome comprise a set of paired 25-mers designed to uniquely identify if a sample contains the reference or alternate allele (File S2). We used the tool msbwt v0.3.0 (run on python/2.7.11) to transform our RIX RNA-seq reads into multi-string Burrows-Wheeler Transform (BWT) formatted files to perform efficient, exact k-mer searches to count instances of each k-mer in the RNA-seq reads, thereby quantifying gene expression corresponding to each CC parent in an allele-specific fashion (Holt and McMillan 2014).
Statistical modeling of X chromosome inactivation
We designed a Bayesian hierarchical model to estimate X inactivation proportion at the level of the gene, individual, and RIX, based on the RNA data above. The model also, as a byproduct of its use of beta distributions and their connection to Pólya urns, estimates the number of brain precursor cells in the epiblast at the point of X inactivation choice, at around E5.5 (Rastan 1982; Lenarcic et al. 2018). This section describes first the model for estimating the XCI proportion associated with a given RIX, and then the estimation of the number of brain precursor cells (hereafter, the day 5 brain precursor count) based both on a given RIX and on all RIXs combined. The main components of the model are summarized in Figure 5, with more detail in Figure S1.
Model for RIX-specific XCI proportion
The average XCI proportion inherent to a RIX is reflected by the XCI proportions of mice from that RIX. These mouse-level XCI proportions are in turn reflected by ASE at X chromosome genes. Our model estimates mouse-level XCI proportions for genes by counting k-mers from the allele of one parent vs that of the other and treating these as outputs from a binomial distribution controlled by overall XCI proportions at the gene-, mouse and RIX level.
Consider a given RIX of CC strains u and v, where strain u is expected to have a weaker Xce allele or, in the case where both are of the same strength, the maternal strain. For counts associated with Xist, which is expressed from the Xi and should therefore have the opposite XCI proportion, the assignment of u and v were reversed. For mouse i = 1, …, n, let Nkgi be the total number of counts for k-mer k of gene g and let ykgi be the number of these counts specifically from strain u. Then, ykgi is distributed where μgi is the expected proportion expressed from strain u vs strain v for gene g in mouse i. Different genes g = 1,2, … can have different proportions μ1i, μ2i, …, but we require these to be centered around a common individual-level proportion μi as where this corresponds to the conventional parameterization, Beta(μiα, (1 – μi)α). The individual-level proportion μi is modeled as where c[i] denotes the combination of experimental factors c that are relevant to mouse i, μc is the XCI proportion predicted for that combination, and α0 models the day 5 brain precursor count (described later). The proportion μc is modeled through a logit link as the outcome of a linear predictor, where intercept β0 models an overall value for the RIX, and θc incorporates the effects of experimental covariates.
The set of experimental covariates in θc was different for SP1 and SP2. For SP1, these were perinatal diet (diet), POE (recip), and their interaction, where dietc is a categorical predictor indicating the perinatal diet to which mice in condition c was exposed, βD is a ndiet-vector of diet effects constrained to sum to zero, recipc indicates the reciprocal direction ( if the dam was u, if the dam was v), βR is the POE, and βDR is a ndiet-vector of treatment-by-POE, also constrained to sum to zero. Across the RIXs in SP1, ndiet ranged from 2-4, corresponding to a maximum of 4, 6, or 8 conditions per RIX. For RIXs where any condition level c contained only one sample, we set θc = 0.
For SP2, which did not include reciprocal crosses, we initially considered using where trtc indicates the drug treatment assignment ( for haloperidol, for placebo) of condition level c. Treatment assignment was missing for 9 mice, and in these cases we used model-based imputation, with γc ~ Bin(1,0.5). The treatment effect, however, was observed to be zero (see File S3), which serves as a negative control for the model given the timing of the drug dose at 8 weeks after birth, well after XCI is established. Because of the zero effect, the lack of a strong biological rationale for its inclusion, and the relative instability of its estimation for some RIX, the final model for SP2 was θc = 0, ie, with treatment effect excluded.
Our primary target quantity for each RIX, regardless of its population, was the overall XCI proportion, μ, given by the inverse logit of β0, i.e.,
We additionally report XCI proportions for each mouse, μi for i = 1, …, n.
Prior distributions for parameters were specified as follows. For parameters modeling RIX-wide XCI, we set β0 ~ Logistic(0,1) such that μ ~ Unif(0,1), i.e., a flat prior on overall XCI proportion. The prior set on α0 ~ Uniform(0,1000) reflected a reasonable number of cells in the whole embryo at around E5-6 (Snow 1977). Other parameters were modeled with weakly informative priors: βR, βT ~ N(0,104); βT, βTR ~ Nstz (0,104 × I), where Nstz() is the multivariate normal distribution constrained so that its variates sum to zero [after Crowley et al. (2014), Appendix A]; and α ~ Ga(0.01,0.01).
Posterior distributions for parameters were obtained using Markov chain Monte Carlo (MCMC). MCMC was performed over two separate chains each run with 5 × 104 (SP1) or 105 (SP2) iterations, discarding the initial 10% of the iterations as burnin and thinning every 5, thus providing 1.8 × 104 or 3.6 × 104 posterior samples in total. Estimates are reported as posterior means (modes and medians are supplied in Tables S1–2) with 95% highest posterior density (HPD) intervals. All models were written and implemented in JAGS 4.3.0 (Plummer 2003) and R version 3.5.2 (R Core Team 2017). Code to run the statistical model is available at https://github.com/kathiesun/XCI_analysis.
Pólya urn-based estimation of the day 5 brain precursor count
In our model for X inactivation, the individual-specific XCI proportion μi is modeled as a beta distribution with precision α0 (Equation 1). This use of the beta distribution can be directly related to an idealized model of cell proliferation based on a Pólya urn (Lenarcic et al. 2018) (Figure S1). The Pólya urn is a hypothetical random process that begins with an urn containing a red balls and b blue balls. A ball is drawn at random and replaced by two balls of the same color. This is repeated an infinite number of times, after which the proportion of red balls pred in the urn will be distributed as where the precision a + b is the total number of balls at the point the process began. To the extent that proliferation of embryonic cells in alternate XCI states is analogous to the proliferation of alternate color balls in the Pólya urn, our precision parameter α0 models the (effective) number of brain-relevant cells at the point of the E5.5 XCI decision.
We estimated 1) an α0 for each RIX, and 2) a global α0, based on all RIX data. Posterior distributions of α0 for each RIX is were obtained using MCMC as described above. These were similar to each other but individually somewhat vague (see Results). To obtain a more precise estimate, we assumed the α0 was the same across RIXs and calculated a posterior given all RIX data as the normalized product of the individual posteriors, where denotes the posterior for RIX r = 1, …, R give RIX data , and the above relation holding only because the priors on α0 are identical and uniform such that . In practice, this involved parametrically approximating each RIX posterior, , as gamma distribution with shape and rate using the fitdistr() function from the R package MASS v7.3-51.4 (Venables and Ripley 2002), and then calculating their renormalized product, which is equivalent to a gamma distribution with shape and rate .
Point and interval estimates from the aggregate posterior approach above were comparable to those from traditional random-effects meta-analysis on the per-RIX estimates, the latter conducted with the R package meta v4.14-0 (Balduzzi et al. 2019) using both inverse variance and DerSimonian-Laird estimators (DerSimonian and Laird 1986).
Whole genome sequences of CC strains
Over the last few years, high-coverage sequences of the CC strains have been made available to the research community. These whole genome sequences (WGS) improved upon the resolution of recombination breakpoints and haplotype assignment in 75 CC strains by sequencing paired-end short reads (150 bp) at 30× coverage for a single male per strain (Srivastava et al. 2017; Shorter et al. 2019). Deeper sequencing led to improved haplotype reconstruction in samples bred from CC strains, and allowed for identification of unique mutations private to a particular strain. We incorporated additional WGS of the CC founder strains from other previously published sources (Keane et al. 2011) and from the GRCm38 mouse reference genome.
The WGS described above for 75 CC strains, along with 24 replicates of C57BL/6J mice and one replicate each of the other seven CC founders, have been made publicly available in BWT-formatted DNA-seq reads http://csbio.unc.edu/CEGSseq/index.py. These multi string BWTs were built using the msBWT python tool (Holt and McMillan 2014) from all lanes and paired ends of the Illumina read sets for these genome sequences. Resources making use of the the BWT dataset for effecient k-mer searches have been previously described (Srivastava et al. 2017).
Haplotype analyses based on WGS
The resulting WGS from the CC strains were used to assemble 8 intervals totalling 8,215 bp across the Calaway et al. (2013) minimum Xce locus in each one of the 8 CC founders. The following CC strains represented the corresponding founder as follows: reference genome for C57BL/6J; CC055 as representative of the NOD haplotype; CC020 for A/J; CC024 for 129S1/SvImJ; CC051 for WSB/EiJ; CC032 for CAST/EiJ; CC003 for PWK/PhJ; and CC002 for NZO. We first identified intervals between 0.4 – 3 Kb in length, composed of contiguous 45-mers that are present only once in the reference genome. We used the most proximal of these 45-mers as a seed and assembled the sequence in the CC strains using the consensus of the read pileups. All bases used in the consensus were supported by at least two independent reads and, within each strain, lacked any evidence of SNPs or copy number differences. Once assembled, the sequences were aligned using the EMBL-EBI tool, Clustal Omega (Madeira et al. 2019), and alignments were optimized by manual inspection to reduce the number of variants. The location, length, and CC strains used for the assembly are shown in Table S3.
Phylogenetic analysis of CC founder strains
The 8 assembled intervals spanning the Xce region were used to estimate the phylogenetic relationship based on X chromosome sequence similarity among the 8 CC founders using BEAST 2.6.3, which performs Bayesian evolutionary analysis by sampling trees (Bouckaert et al. 2019). The tree model was based on a coalescent prior for a constant population and was simplified with linked site, clock, and tree parameters among the intervals. We assumed a strict clock and the HKY substitution model (Hasegawa et al. 1985). We generated 107 MCMC samples from the posterior of coalescent trees, thinning every 103 samples, over the course of three separate runs with different starting seeds for a total of 3 × 104 recorded posterior samples. We visualized the resulting tree set using DensiTree.v2.2.7, which shows different topographies with varying level of support.
Quantifying copy number variations
The set of 106 WGS with BWT-formatted data described above was also previously used to develop an occurrence-count matrix of every sequential, non-overlapping 45-mer from the standard mouse reference (GRCm38). We used this count matrix to query 45-mers across CC strains containing different functional alleles in the putative Xce interval defined in Calaway et al. (2013). By comparing and quantifying differential k-mer counts between strains, we generated discrete evidence of CNVs in regions along the X chromosome. Samples were classified into eight groups corresponding to the CC founder strains at the Xce interval, roughly between 102.65-102.95 Mb when translated to GRCm38 coordinate space. The 24 C57BL/6J replicates comprised the baseline “reference” group and the remaining CC-derived samples that were homozygous for C57BL/6J across the Xce interval were separated into another group to provide a negative control.
Strain-wide copy numbers for each k-mer were first normalized per sample and then averaged across samples in each group. Segmental duplications (SD) and inversions (I) were defined as regions where the mean difference, Δ, between 45-mer counts in the comparison strain versus the inbred C57BL/6J mean were different than 0 after k-means clustering of Δ centered at 0, > 0, and < 0. K-mers that have an average of one copy in the reference group and zero copies in the comparison group were deemed to contain nucleotide polymorphisms in the non-reference strain. The relevant 45-mers spanning the Xce are compiled in File S4, along with the X chromosome positions, the number of copies present in the reference genome, and any SD or I assignments. Alignment boundaries for each SD were determined and visualized using Gepard v1.40 with word size of 45 (Krumsiek et al. 2007).
Data availability
The processed data and code to support the results reported here are available at Figshare [link]. These data include: R scripts to re-generate figures in this manuscript and intermediate data files; full demographic data for SP1 and SP2; curated lists of 25-mers used to detect reference and variant alleles in RNA-seq data from the X chromosome along with code to generate this list; k-mer counts of the curated 25-mers for both populations; k-mer counts of 45-mers from DNA-seq using CC and CC founder strains; haplotype probabilities based on genotyping data for SP1 and based on MRCA genotypes for SP2; full MCMC sampling outputs for the statistical model. The processed incident count matrices of contiguous 45-mers for the CC strains noted above, and BWT-formatted files of all RNA-seq data are available publicly at http://csbio.unc.edu/CEGSseq/index.py. Genotyping data for the CC MRCAs are available at http://csbio.unc.edu/CCstatus/index.py?run=FounderProbs and genotyping data for SP1 have been deposited in a UNC Data-verse repository (https://dataverse.unc.edu/dataverse/MiniMUGA) under DOI number 10.15139/S3/UYURKF. All R scripts to run the statistical model, and process and generate datasets are available at https://github.com/kathiesun/XCI_analysis.
Results
XCI ratio estimated for each mouse and RIX from RNA-seq allele-specific expression
The CC-RIX females comprising this study were genetically heterogeneous mosaics of the 8 CC founder strains with one copy of each chromosome inherited in its entirety from each CC parent. In order to quantify ASE, we relied on efficient multi-string BWT searching of k-mers to identify reference and alternate alleles in the RNA-seq reads. This is akin to a microarray-based quantification strategy where each k-mer represents a probe designed based on prior knowledge, allowing us to precisely target known SNPs to measure ASE.
Counts of k-mers containing reference and alternate alleles of variants were attributed to a particular CC parent according to the haplotype reconstruction derived from genotyping data. The relative frequency of summed reads across a gene originating from one of the CC parents, e.g. CC001 in a CC001/CC011 RIX, was modeled analogously to the frequency of heads when flipping a potentially biased coin, as a binomial count that depends on an underlying long-run proportion that may deviate from 0.5. This proportion was estimated for each gene; the proportions across genes were used to estimate an underlying XCI proportion for each mouse; and the XCI proportions across mice were used to estimate a proportion specific to the RIX. These estimations were performed simultaneously using a Bayesian hierarchical model, which also 1) incoporated, and thereby corrected for, potential effects of experimental or breeding-related factors, and 2) connects the variability of mouse-specific XCI proportions about their RIX-wide mean to the subset of epiblast cells at the point of the initial XCI decision contributing to the assayed tissue, in this case the brain.
XCI is relatively consistent across genes within an individual
Across an individual mouse, gene-level estimates of XCI proportion are stable, suggesting that our quantification methodology is reliable. Figure 6 shows XCI proportion estimates for a mouse each from three CC-RIXs (all 266 samples are shown in File S5). Our Bayesian model estimates posterior distributions for XCI proportions at the gene and RIX level; we report both the means of those distribution and their 95% highest posterior density (HPD) intervals. Gaps in the X chromosome position reflect the patchwork heterozygosity and homozygosity of the CC-RIX samples. In this example, the HPD intervals are fairly narrow around the means, indicating the precision of these estimates, and for two of the mice, the XCI proportion is far from 0.5, indicating strong XCI skew (File S5).
These three example mice demonstrate the consistency of estimates for each sample at genes across the X chromosome, supporting our estimates of even fairly extreme XCI skews such as those shown in the Figure 6b-c. At the mouse level, this consistency is representative of all of the samples in the experiment overall.
Pattern of XCI skew in RIXs with known Xce allele is consistent with previous studies
Our results for XCI skew were largely consistent with previously published research, given our knowledge about the underlying haplotype structure of the CC strains and the known Xce subtypes corresponding to major Mus musculus strains (Figure 3). Estimates of XCI proportions for each sample and each CC-RIX are compiled in Table S1 and File S1.
Figure 7 shows the XCI proportion at the individual- and RIX-level for every cross in the study, divided as a) crosses between strains with previously phenotyped Xce alleles, b) crosses between strains with inferred alleles. Crosses with both CC strains sharing the same Xce allele had XCI ratios at roughly 50:50. The crosses demonstrate that Xcea is weaker than any other known allele, as only roughly 30-35% of the cells have active chromosomes bearing Xcea (Figure 7a). Xcee and Xceb are approximately of equal strength, which corroborates the similar pattern seen in Calaway et al. (2013).
Unlike the narrow HPD intervals seen at the gene and individual level (Figure 6), there is greater variability across individuals within a RIX. Some RIX from SP2 have wide HPD intervals reflecting their smaller replicate groups overall and perhaps a smaller starting amount of cells relative to SP1 due to RNA-seq sample collection for SP2 that took tissue from the striatum as opposed to whole brain tissue for SP1. An additional caveat is that haplotype reconstruction for SP2 relied upon genotyping data from the CC resource and not the specific individual mouse, which may have led to errors in assigning haplotypes, particularly near segregation points. Therefore, some RIX from SP2 have HPD intervals that are less informative, e.g. CC015/CC005, CC015/CC011, and CC021/CC002 (Figure 7).
The width of the HPD intervals at the RIX level derives from the precision, α0, of the overall RIX-wide XCI proportion. Though we described some legitimate experimental artifacts that may contribute to lower precision in certain RIX crosses, there are also true underlying biological reasons for this variation among samples in a cross. Inter-individual variability among the samples in a RIX can be interpreted as different amounts of starting cells that correspond to our precision estimate, α0, as described next.
Estimated number of cells in pre-brain epiblast tissue range from 20 to 30
Our statistical model for sample-specific XCI proportion implies a Pólya-urn model for cell proliferation in which one of the estimated parameters, α0, relates to the number of brain precursor cells in the epiblast at the onset of random X inactivation. Our estimates of α0 were strikingly concordant between the two sample populations (Figure 8 and Table S2), and so we combined them to give a single, overall value. The combined posterior distribution for α0 followed a gamma distribution with shape parameter 100.36 and rate parameter 4.10. This translates to a point estimate (posterior mean) for α0 of 24.48 with standard error (posterior standard deviation) of 2.44 and a 95% HPD interval of 19.93 to 29.50. Our model thus suggests that the number of initial pluripotent cells in the epiblast that eventually form brain tissue in mature mice may be around 20-30. This is a reasonable figure given the number of total cells in the epiblast ranges from around 120 on E5.5 to 660 on E6.5 (Snow 1977).
Unexpected XCI skewing in RIX females with the NOD Xce allele
As well as corroborating earlier studies, the CC strain data also characterized the XCI (and thus Xce subtype) for two founder strains that had not been previously evaluated. Both founder strains, NOD and NZO, had been previously assigned to Xceb due to sequence similarity with the reference genome.
We found a striking pattern of skewed XCI in crosses containing haplotypes derived from NOD at the Xce interval from 102.65-102.95 Mb. Crosses heterozygous at this locus between NOD and any other founder exhibited profoundly skewed XCI proportions, despite the expectation that skewing would behave similarly with other strains carrying Xceb. Our results indicate that NOD harbors a novel Xce allele conferring a lower tendency to remain active, weaker even than Xcea. Figure 6 shows examples of gene- and sample-wide estimates of XCI proportion in three different CC-RIX crosses, each with NOD contributing the Xce region for at least one of its inherited X chromosomes.
Chromosomes bearing the Xce derived from NOD were consistently more likely to be inactivated than any other Xce allele (Figure 7b). This consistency suggests that this observed skewing is due to underlying variation that is inherent to the NOD Xce haplotype and not CC strain-specific factors, leading us to establish Xcef from NOD as new allele in the functional series. Unexpected skews were observed in 11 out of 12 CC-RIX where one parental chromosome inherits the NOD Xce allele. This concordance was irrespective of different CC and founder strains carrying the Xcef, and transcended different Xce pairings, suggesting that this result is genuinely due to primary skewing.
Interestingly, chromosomes from NZO behave like they carry Xceb which follows our a priori assumptions. This narrows our focus of inquiry because both NZO and NOD are identical-by-descent in this region and harbor few SNPs compared with the mouse reference genome. As a result, we investigated whether 1) the observed XCI skewing phenomenon in NOD—and by extension, other Xce functional alleles—may be driven by chromosomal rearrangements and not necessarily sequence variation; or 2), NOD and/or NZO were improperly categorized as Xceb based on haplotype similarity.
SNP analysis in the Xce interval show that NOD and C57BL/6J have almost identical haplotypes
Given the unexpected patterns of XCI skewing in RIX females that carry the NOD Xce haplotype in heterozygosity, we decided to use recently released WGS from 75 CC strains to determine the extent of haplotype sharing between the CC founders. To ensure that we only compare orthologous sequences we limited this analysis to genomic regions spanning the Xce candidate interval that have copy number one in the reference genome and C57BL/6J, and likely copy number one in each of the other CC founders. For each region, we assembled the CC founder sequence using the CC strain with the corresponding haplo-type and deepest sequence coverage. After aligning each region, we used standard phylogenetic analysis to determine the relationships between the founder haplotypes (Figure 9). The results were fully consistent with the previously published haplotype sharing based on microarray genotyping (Calaway et al. 2013). Briefly, the eight founders are distributed in four well supported haplotypes: one represented by CAST/EiJ, the second by PWK/PhJ, a third that includes 129S1/SvImJ and A/J; and the fourth and last comprises C57BL/6J, NOD, NZO and WSB/EiJ. We conclude that the expectation that NOD should be Xceb is supported by haplotype sharing.
Copy number variations distinguish weaker Xce alleles from stronger ones
The minimum Xce interval between 102,747,920-102,924,411 bp identified in Calaway et al. (2013) contain a series of recurring chromosomal rearrangements. These CNVs—segmental duplications (SD) and inversions (I)—were also verified in C57BL/6J with molecular assays by Sheedy (2012). We further corroborated the chromosomal architecture of this region in the mouse reference sequence with local nucleotide comparisons (Figure 10a) and optical mapping data (Figure S2). These rearrangements have been posited as a potential explanation for the effect of the Xce functional allele series.
Using direct searches of non-overlapping 45-mers, we discovered an additional copy of the X chromosome sequence from approximately 102,802,400-102,839,400 bp, forming a continuous, 37 kb repeat spanning SD3b, SD4, and the bridge sequence between these recurring regions that we denote SD6. As shown in Figure 10b, the pertinent duplicated region marked with a magenta band clearly spans 45-mers with a consistent increase of counts centered at one extra copy. Henceforth we refer to this novel CNV as R1. Furthermore, we demonstrate that both A/J and 129S1/SvlmJ, which both carry Xcea, share the same duplicated region, R1, as NOD with a roughly increased copy number of one (see Figures S4-S5). Replicated experiments over decades (Johnston and Cattanach 1981; Simmler et al. 1993; Calaway et al. 2013) have demonstrated that Xcea was the weakest known Xce allele, previous to our finding in NOD.
This strong molecular evidence establishes a distinction between the reference genome and strains with weak Xce alleles, supporting the idea that variations in copy number within the putative Xce region contributes to the functional allele. We hypothesize R1 is associated with a weak Xce allele, and that the chromosomal organization of CNVs in NOD, A/J, and 129S1/SvlmJ may be described with the schematic shown in Figure 10C.
Compared with NOD, both A/J and 129S1/SvlmJ have a markedly higher number of nucleotide variations relative to the reference (Figures S4-S5). Although all three strains share a similar pattern of repeats with R1, NOD has a weaker phenotype still compared with Xcea. Both XCI skewing and genetic differences still remain between NOD and the two strains confirmed to possess Xcea, leading us to establish NOD as its own allele in the functional series, Xcef.
Strikingly, the CNV pattern seen in NZO contains notable departures from those in other strains. NZO appears to have a more complex series of nested repeats such that different portions of the “weak repeat,” i.e. R1, are replicated at different frequencies (Figure 11). It carries three additional copies of SD4, two additional copies of SD3b, and one additional copy of a sequence segment distal to SD4 that we denote SD7.
We confirm that NZO has unique breakpoints between SDs that NOD and the reference sequence lack by querying matches of 45-mers at the SD boundaries. Neither the reference nor NOD contain repeats of SD7, so there is only one set of sequences flanking both sides of SD7, i.e. between SD4-SD7 and SD7-I5b. NZO, on the other hand, contains two distinct sets of k-mers on both the proximal and distal ends of SD7 (see Figure S9). This provides evidence that there are two copies of SD7 in NZO, one of which is a repeat flanked by sequences that form a pattern neither observed in the reference nor NOD. Although we are not able to verify the exact locations and pattern of the NZO duplications, shown as a hypothesized schematic in Figure S9B, we do see molecular evidence supporting the quantity of repeats in NZO and the presence of unique breakpoints between duplications and inversions. This suggests that NZO has a different chromosomal architecture in this region compared with other strains, though one that does not manifest in differences of XCI pattern compared to the reference strain.
Discussion
In a previous study by our group (Calaway et al. 2013), we used a diverse set of inbred strains and allele-specific gene expression to characterize a new Xce phenotype and to narrow the putative Xce interval. That study identified a set of recurrent duplications within the Xce and suggested that variation in their copy numbers may in fact be the functional variation driving the allelic series. In the present study, we examined that hypothesis and quantified the skewing phenotypes of two CC founder strains with inferred Xce alleles based on sequence similarity with C57BL/6J across the Xce locus.
Leveraging increased genetic diversity in CC-RIX identifies novel XCI patterns
Two important features of our methods are worth noting: 1) increased heterogeneity in the genetic composition of our F1 crosses of well-described CC strains, and 2) improved mapping resolution across the X chromosome from a novel method of quantifying ASE in CC-RIX mice and modeling the resulting counts in a hierarchical Bayesian manner. The animals represented in our study are each mosaics of 8 inbred mouse strains, with one X chromosome inherited entirely from each parent. Haplotype estimates across the genome in the CC strains are stable and replicable, thus allowing us to leverage previously collected genotyping and sequencing data to inform ASE estimates in our dataset. The complete haplotype reconstruction across the X chromosome for every cross used in this study is depicted in Figure 12.
Our methods relied upon a novel way to quantify ASE across the X chromosome by querying a set of curated 25-mers among the RNA-seq reads from each of the 266 mice in our study population. The 25-mers specifically targeted reference and alternate alleles at known polymorphisms in coding regions, and fed into a hierarchical Bayesian model to quantify XCI proportion for each cross and sample. Among the Xce alleles that have previously been characterized, our estimated XCI proportions matched what we would expect based on data from the literature (see Figure 7a). This finding serves to corroborate historical observations and to provide validation for our Xce imputation method and statistical model.
We observed highly variable proportions in some crosses, potentially owing to multiple sources of variation. Some CC strains have segregating boundaries at or near the Xce interval, making the assignment of CC strain from which the haplotype derives more uncertain, such as near 102.5 Mb in CC062. As shown in Figure 12, CC062/CC035 defines the lower boundary of the maximum Xce interval because the data is consistent with the XCI ratio being 50:50, i.e. between two Xcea functional alleles of equal strength. In reality, CC062 has a large recombination interval between 129S1/SvImJ and NOD near this proximal boundary. The broad range of proportions we actually observe suggests that Xcea / Xcea may not be an appropriate designation for every sample in this cross and that some may indeed be Xcea / Xcef.
In addition, we have few samples and crosses with Xcee derived from PWK/PhJ. Our findings suggest that it is similar in strength to Xceb and not demonstrably weaker, consistent with previous findings (Calaway et al. 2013). As noted in the Methods, samples from SP2 had fewer replicates because only the females were relevant to this study, potentially leading to more RIX-wide variability. Lastly, RNA-seq tissue collection for SP2 used the striatum as opposed to whole brain tissue, resulting in a smaller starting amount of cells relative to SP1. XCI proportions for individuals in SP2 thus had higher variance, leading to less stable estimates and larger HPD intervals.
Pólya urn-based approximation to the number of cells in pre-brain epiblast tissue
Inter-individual variability in XCI skew among genetically-identical samples within a RIX cross can be partitioned into experimental and biological variation. Although the two cannot be easily disentangled, we surmise that the biological variation derives, in part, from the precision of the beta distributed parameter for each estimate of mouse-specific XCI proportion. At the point of inactivation choice, the cells in the epiblast are akin to balls in a Pólya urn. The Pólya urn describes a random process in which an intial number of red and blue balls undergo successive rounds of randomly assigned duplications; after infinite rounds, the final proportion of red vs blue balls is a random number whose variability is a function of the total starting number. Urns that start with a greater number of balls are more stable against random fluctuations in the proportions of the red to blue balls, and have proportions more closely gathered around the starting proportion; urns starting with a smaller number lead to a final proportion that is more variable.
Analogously, the urn represents a RIX and α0 represents the starting number of pluripotent cells that are involved in the decision to activate either the maternal or paternal chromosome at around E5.5 and will eventually form brain tissue (or whichever tissue undergoes an ASE assay) in the mature mouse. Though we first estimate α0 in each RIX individually, we assume that the parameter should be similar in each individual cross, given the stability of biology underlying the XCI process.
Though we are unable to verify this quantity of 20-30 pre-brain epiblast cells, it does seem reasonable given the total number of cells in the epiblast is between 120-660 at E5-6 (Snow 1977).
Copy number of recurrent duplications may explain the weakness of Xcea and novel Xcef, found in NOD
Both NOD and NZO were previously predicted to express the Xceb functional allele based on haplotype similarities to C57BL/6J. Our results do not support this conclusion in NOD. We characterize the Xce locus derived from NOD as a separate functional allele in the series, Xcef, because we find it to be consistently weaker than all other known Xce haplotypes. Crosses involving 6 CC strains (CC012, CC023, CC026, CC028, CC041, and CC065) that contain the NOD-derived Xce region corroborate the weakness of the novel Xcef (Figure 7). This continuity leads us to conclude that chromosomes carrying the NOD Xce allele contain sequence-level variation in this interval, manifesting in primary inactivation bias against keeping that parental copy of the X chromosome active.
We confirm that both NOD and NZO share sequence similarity in the Xce interval with the reference genome (Figure 9) using haplotype assembly from CC WGS and phylogenetic analysis. As a result, we conclude that CNV structure may be the causative factor for this phenomenon. CNV analysis (Figure 10b) reveals a large interval in which the normalized counts for all k-mers are consistent with the presence of an extra copy in NOD. We tentatively conclude that the repeat, R1, represents a genuine copy number increase of a contiguous 37 kb-long segment in NOD. R1 includes the entire SD3b and SD4, as well as the bridge sequence connecting them that is not duplicated in the reference (Figure 10b). The novel R1 appears to be recent; the last duplication found in the reference genome is that of SD3a-b inverting and inserting distally to form I5a-b, and is demonstrated by the sequence similarity between these two sets of sequences in both k-mer identity over sliding windows and optical mapping data. This general rearrangement structure is similar between two weak alleles, Xcea and Xcef. A/J and 129S1/SvlmJ express Xcea, and both strains share with NOD evidence of the same SD3a-b to I5a-b inversion alongside the novel repeat, R1.
NZO expresses Xceb despite complicated CNV organization
XCI estimates from crosses containing an Xce region derived from NZO do not deviate from our hypothesized ratios based on the strain carrying Xceb. Whether the XCI proportions seen in NZO indicate that its Xce interval is the same molecular species with genuinely identical function as Xceb, or if the two phenotypes have converged to appear similar is unclear. We would expect NZO to have a duplication structure akin to that of the reference mouse genome, or at least a different structure to that of NOD, A/J, and 129S1/SvlmJ. Our analysis of NZO is hampered by the lack of CC-RIX in our data with NZO in the putative Xce region. One of our main study populations, SP1, was designed to maximize heterozygous loci between C57BL/6J and NOD, which explains the predominance of both strains in our down-stream analysis. Nevertheless, the three RIX that contain NZO are consistent with the strain bearing Xceb or at least a functional allele of the same strength.
In NZO, we find a more complex pattern of SD’s and I’s than seen in other strains. As shown in Figure 11, NZO appears to harbor one increased copy of SD7, two increased copies of SD3b and SD6, and three increased copies of SD4. We confirm the increased copy number of these elements by observing novel sets of boundaries between SD’s that are not present in C57BL/6J, NOD, or other strains. For example, NZO contains two distinct sets of sequences on the distal end of SD7, suggesting that there are two real copies of the segment in the NZO sequence: one of which leads into I5b and is present in the C57BL/6J sequence, and the other of which is novel (see Figure S9).
Thus, the copy number pattern observed in NZO is indeed different than what we observe in NOD, A/J and 129S1/SvlmJ, and the reference genome. NOD and NZO were predicted to share the same skewing phenotype as the reference based on sequence similarity at the SNP level. Our data demonstrates that the NOD Xce haplotype has a novel functional allele, distinct from NZO and any known Xce allele. CNVs can explain the difference between the functional Xce alleles present in NOD and C57BL/6J but they are not able to discriminate between NOD and strains with the Xcea allele. This is not particularly surprising given that this simplified approach ignores the potential effect of variation outside of the recent NOD duplication and do not consider higher order factors associated with duplications such as location and orientation of the duplicated segment.
CNV abundance and organization, along with sequence variation, may all play a role in Xce strength and XCI skewing
We do not capture a full portrait of how duplicated segments are organized in this complicated region. Copy number may well play a role, but there seem to be additional factors distinguishing NOD from strains in Xcea, as all of these strains appear to contain the novel R1. In addition, the duplication structure found in NZO is more complex than what we observe in other strains yet this does not translate to a detectably different phenotype compared with C57BL/6J. This suggests that alternate recurrent duplication structures, each containing variations relative to the reference mouse genome, may present technically different Xce species that converge in similar phenotypes. This is supported by phylogenetic analysis showing that A/J and 129S1/SvlmJ are more similar to each other, while NOD and C57BL/6J evolved separately along another branch. The larger region surrounding the Xce contains many other recurrent duplications and repeats, indicating that it is potential “hotspot” of copy number changes (Sheedy 2012). We provide evidence that CNV are important to the Xce phenotype and that simply looking at sequence variation is insufficient, but the exact cause and nature of those aberrations remain elusive.
Based on genotyping information collected on the animals in SP1 and from ancestral CC samples, we can map the control element associated with our observed results to a chromosomal location that is consistent with the historical interval as set forth by Calaway et al. (2013), Chadwick et al. (2006), Simmler et al. (1993), and others. In our CNV analysis, we focus on a small portion of this interval near the proximal end from 102.7-102.9 Mb. Our analysis does not preclude the possibility that more distal genomic elements—including sequence variations, CNVs, chromosomal structure, etc.—may also contribute to the function of the element. The immediate region surrounding our putative Xce interval is highly duplicated and carries convoluted patterns of CNVs, only a few of which we examined in this study. Further work into the nature of the Xce may explore the patterns and inheritance of those rearrangements. For example, some molecular evidence from Sheedy (2012) suggests that a distal duplication distinguishes CAST/EiJ from C57BL/6J. Broader molecular characterization of the extent that CNV plays a role in enacting this control will be required to fully understand the function of Xce.
Supporting information
File S1 List of all 266 mouse samples (CSV).
File S2 Complete data on 7,957 25-mers used to quantify gene expression on X chromosome (CSV).
File S3 Posterior mode, mean, median, and 95% highest posterior densities determined by the Bayesian hierarchical model for all covariates in GLM performed for SP1 and SP2 (CSV).
File S4 Complete data on non-overlapping genomic 45-mers used to determine CNV in Xce interval (CSV).
File S5 Individual XCI proportion estimates across X chromosome for all 266 samples in study, akin to Figure 6 (PDF).
Supplemental Figures and Tables
The following pages provide supplemental figures and tables for the manuscript “Skewed X inactivation in genetically diverse mice is associated with recurrent copy number changes at the Xce locus” by Sun et al.
Acknowledgements
This work was funded by a National Institute of Mental Health (NIMH) grant R01-MH100241 to WV and LMT, NIMH and National Human Genome Research Institute (NHGRI) grants (P50-MH090338, P50-HG006582, and U24-HG010100) to FPMV, and a National Institute of General Medical Sciences (NIGMS) grant R35-GM127000 to WV. KYS also received partial support from the UNC-CH Caroline H. and Thomas S. Royster Fellowship and NIGMS training grant, T32-GM067553. VZ is funded via National Institute of Environmental Health Sciences training grant, T32-ES007018. MiniMUGA was developed under a service contract to FPMV and other investigators at UNC-CH from Neogen Inc., Lincoln, NE. None of the authors have a financial relationship with Neogen Inc. apart from the service contract listed above. The authors have no other conflict of interest to declare. Members of Leonard McMillan’s lab at UNC-CH, including Maya Najarian and Sebastian Sigmon, provided guidance with the WGS data and building k-mers. James Xenakis was instrumental in accessing data from SP2. We thank Greg Keele and other members of the Valdar laboratory for helpful discussions and thoughtful comments on the manuscript.