Summary
Mouse substrains are an invaluable model for understanding disease. We compared C57BL/6J, which is the most commonly used inbred mouse strain, with several closely related substrains. We performed whole genome sequencing and RNA-sequencing analysis on 9 C57BL/6 and 5 C57BL/10 substrains. We identified 352,631 SNPs, 109,096 INDELs, 150,344 short tandem repeats (STRs), 3,425 structural variants (SVs) and 2,826 differentially expressed genes (DEGenes) among these 14 strains. 312,981 SNPs (89%) distinguished the B6 and B10 lineages. These SNPS were clustered into 28 short segments that are likely due to introgressed haplotypes rather than new mutations. Outside of these introgressed regions, we identified 53 SVs, protein-truncating SNPs and frameshifting INDELs that were associated with DEGenes. Our results can be used for both forward and reverse genetic approaches, and illustrate how introgression and mutational processes give rise to differences among substrains.
1. Introduction
Since Clarence C. Little generated the C57BL/6 inbred strain a century ago, the C57BL/6J has become the most commonly used inbred mouse strain. Closely-related C57BL/10 substrains1,2, which were separated from C57BL/6 in about 1937, are also commonly used in specific fields such as immunology3 and muscular dystrophy4. The popularity of C57BL strains has led to the establishment of many substrains (defined as >20 generations of separation from the parent colony). Among the C57BL/6 branches, the two predominant lineages are based on C57BL/6J (from The Jackson Laboratory; JAX) and C57BL/6N (from the National Institutes of Health; NIH5,6). Subsequently, several additional substrains have been derived from the JAX and the NIH branches.
Genetic differences between closely-related laboratory strains have been assumed to be the result of accumulated spontaneous mutations7. For those that are selectively neutral, genetic drift dictates that some new mutations will be lost, others will maintain an intermediate frequency, and others will become fixed, replacing the ancestral allele8. Because of historical bottlenecks and small breeding populations over many generations, fixation of new mutations can be relatively rapid.
Numerous studies have reported phenotypic differences among various C57BL/6- and C57BL/10-derived substrains, which are likely attributable to genetic variation. For C57BL/6 substrains, these differences include learning behavior9, prepulse inhibition10, anxiety and depression11, fear conditioning12–14, glucose tolerance15, alcohol-related behaviors16,17, and responses to other various drugs18–21. For C57BL/10 substrains, these differences include seizure traits22 and responses to drugs23. Crosses between two phenotypically divergent strains can be used for quantitative trait mapping. Because crosses among closely related substrains segregate fewer variants than crosses of more divergent strains, identification of causal alleles is greatly simplified21. Such crosses have been referred to as a reduced complexity cross (RCC)24 and have been further simplified by the recent development of an inexpensive microarray explicitly designed for mapping studies that use RCCs25.
Whole Genome Sequencing (WGS) technology provides a deep characterization of Single Nucleotide Polymorphisms (SNPs), small insertions and deletions (INDELs), Short Tandem Repeats (STRs), and Structural Variations (SVs). SNPs that differentiate a few of the C57BL/6 substrains have been previously reported21,26. While most SNPs are expected to have no functional consequences, a subset will; for example, SNPs in regulatory and coding regions, which can profoundly alter gene expression and function. STRs have never been systematically studied in C57BL substrains. STRs are highly variable elements that play a pivotal role in multiple genetic diseases, population genetics applications, and forensic casework. STRs exhibit rapid mutation rates of ~10-5 mutations per locus per generation27, orders of magnitude higher than that of point mutations (~10-8)28, and are known to play a key role in more than 30 Mendelian disorders29; recent evidence has underscored their profound regulatory role suggesting widespread involvement in complex traits30. SVs include deletions, duplications, insertions, inversions, and translocations. SVs are individually less abundant than SNPs and STRs, but collectively account for a similar proportion of overall sequence difference between genomes31. In addition, SVs can have greater functional consequences because they can result in large changes to protein coding exons or regulatory elements32. Large SVs among C57BL/6 (but not C57BL10) substrains were identified using array comparative genomic hybridization7, and have also been identified in more diverse panels of inbred strains using WGS33. Although some genetic variants that differ between closely related C57BL substraits have been previously reported7,34–36, a comprehensive, genomewide map of SNPs, INDELs, STRs, SVs and gene expression differences among C57BL6 and C57BL10 substrains does not exist.
In an effort to create such a resource, we performed whole genome sequencing in a single male individual from 9 C57BL/6 and 5 C57BL/10 substrains (~30x per substrain) and called SNPs, INDELs, STRs and SVs. In addition, to identify functional consequences of these polymorphisms, we performed RNA-sequencing of the hippocampal transcriptome in 6-11 male mice from each substrain, which allowed us to identify genes that were differentially expressed (Figure 1A). This approach has two advantages: first it provides a large number of molecular phenotypes that may be caused by substrain specific polymorphisms. Second, we assumed that the gene expression differences would often reflect the action of cis regulatory variants, making it possible to narrow the number of potentially causal mutations without requiring the creation of intercrosses.
2. Results
Processing WGS data, we identified 352,631 SNPs, 109,096 INDELs, 150,344 STRs and 3,425 SVs in nine C57BL/6 and five C57BL/10 substrains. 5.6% of SNPs and 17.2% of INDELs are singletons (only occur in one substrain). 89% of SNPs and 58% of INDELs separated the C57BL/6 and C57BL/10 branches. The fraction of variants in each category observed in different number of substrains is plotted in Figure S1. RNA-sequencing analysis on 106 hippocampal samples identified 16,400 expressed genes and 2,826 DEGenes (17.2%) in C57BL/6 and C57BL/10 substrains (FDR<0.05). These data are available in the Supplementary material.
2.1 Genetic evidence for origin of C57BL/6 and C57BL/10 substrain differences
Figure 1B shows the relationships among C57BL/6 and C57BL/10 substrains based on historical records37–39. Figure 1C shows a dendrogram that was produced using SNPs, STRs and bi-allelic SVs. Comparison of these two figures shows that the records about the relationships among C57BL/6 and C57BL/10 substrains are consistent with our sequencing results.
2.2 Distribution of genomic variants across the genome
The distribution of variants across the genome is shown in Figure 1D. Several dense clusters of variants common in all categories (SNPs, INDELs, STRs and SVs) are evident (e.g. on chromosomes 4, 8, 11 and 13 for example). The non-uniformity of these polymorphisms was inconsistent with our expectation that polymorphisms were due to new mutations and genetic drift. To further explore this observation, we examined the distribution of SNPs for each of the 14 different substrains (Figure 1E). This figure demonstrates that these clusters consist of a series of highly divergent haplotypes that differentiated the C57BL/6 and C57BL/10 lineages. In total 312,981 SNPs (89% of SNP variants detected in this study and 99.6% of the C57BL/6 vs C57BL/10 SNPs) reside in several C57BL/10-specific clusters that represent just 5% of the genome (28 segments on 11 chromosomes: 1, 2, 4, 6, 8, 9, 11, 13, 14, 15, 18) with a SNP density of ~1/425 bp. Across the remaining 95% of the genome SNPs do not appear to be clustered and have a density of ~1/67,000 bp (more than 100-fold less dense). We found that many of the SNPs in these intervals were present in the Mouse Genome Informatics (MGI) database (URL), which suggested they were not due to new mutations in the C57BL/10 lineage. We used the MGI database to identify strains that were similar to these 28 segments. No single strain matched all 28 segments. However, Figure S2 shows that for 24 of the 28 segments, at least one strain in the database had greater than 90% concordance (we only considered strains for which a minimum of 300 SNPs was available in that segment). Based on these data, we hypothesize that one or more inbred or outbred mice were accidentally introduced into either the C57BL/6 or C57BL/10 lineage. Another possibility is that their last common ancestor was not fully inbred, and that these regions were differentially fixed after their separation. Most of the concordant strains have domesticus origin; however, two large segments on chromosomes 4 and 11 showed apparent musculus origin. Additionally, 9,218 SNPs (2.6% of all SNPs) did not match the C57BL/6J reference genome (mm10) for all substrains, even for the sample we obtained from C57BL/6J. This may be because the cryopreserved embryo stock used for the modern C57BL/6J mice at JAX are more than 20 generations separated from the samples used to generate the mouse reference genome, GRCm3840. The remaining SNP appeared to be uniformly distributed and are likely due to new mutations. The distributions of other variants (INDELs, STRs and SVs) mirrored SNPs and are plotted in Figure S3.
2.3 Identification of candidate genomic variants causing differential gene expression
We found that 2,826 of 16,400 genes (17.2%) were differentially expressed among the 14 substrains (FDR 0.05), we call these DEGenes. We assumed that many of the DEGenes were due to local (cis) polymorphisms.
In order to identify genomic variants that might be causally related differential gene expression, we tested all identified variants (SNPs, INDELs, STRs and SVs) in the cis-window (1Mb upstream of gene start and 1Mb downstream of gene end) for association with the corresponding DEGene. Specifically, we tested the association between the cis-variants and the median of DEGene expression by a linear regression test, using Limix41. The resulting p-values are reported in the Supplementary material.
Our cohort consists of 14 substrains. As expected, all variants with the same strain distribution pattern have identical p-values in the association tests. For example, the gene Kcnc2, which had significantly reduced expression in C57BL/6JEiJ (Figure 2A), there was an equally strong correlation with four SNPs, one INDEL and one STR in the cis-region (Figure 2B), and many more variants outside the cis-region (Figure S4). The INDEL was annotated as a frameshift loss-of-function variant by Variant Effect Predictor (VEP)42, therefore, it had a strong prior to be the causal variant. Thus, even in this small cohort, we found a number of examples in which a variant within the cis-window had a strong prior and therefore appeared likely to explain a DEGene. We describe several such examples in the next section. However, for the majority of DEGenes, there were no polymorphisms that had strong priors, meaning that any of the variants with the smallest p-values in the cis-window, or a combination of them, or trans-acting variants elsewhere in the genome, could be causal.
2.4 Differential expression of genes is associated with multiple categories of functional variants
Genomic variants that disrupt protein coding exons or nearby cis-regulatory elements have strong potential to cause differential gene expression. We investigated the causal role of variants in the cis-window by quantifying the strength of effects for multiple functional categories of variants. SNPs and INDELs were annotated using VEP42, which identified 555 loss-of-function variants (frameshift, stopgain or splice variant). SVs were annotated by intersecting with the gene features including exons, Transcription Start Site (TSS), Untranslated Regions (UTRs), promoters, enhancers, and introns. When an SV intersected with multiple types of functional elements, it was categorized according to the order mentioned above. The same gene annotations were applied to STRs. Intergenic SVs and STRs, which are defined as those that did not intersect with any gene features, were paired with the gene that had the nearest TSS. In addition, we assessed multiallelic copy number variation of genes by quantifying sequence coverage of all Segmental Duplications mapped to the mm10 reference genome (URL) that intersected with genes.
Genomic variants of the above categories that intersected with DEGenes were tested for association with gene expression by a Linear Mixed Model (LMM) using Limix41. We controlled for the complex relationships among inbred strains (a form of population structure) by using a Genomic Relatedness Matrix (GRM), derived from SNP genotypes, as a random effect, and parent strain (C57BL/6 or C57BL/10) as a fixed effect. Figure 2C shows the QQ plot for the p-values obtained from the data versus the uniform distribution. The black dots show the deciles of the data in each category. SegDups that intersected genes were strongly correlated with the expression of those genes, as would be expected for gene copy number variation. Loss-of-Function SNPs and INDELs also showed a significant inflation of correlated DEGenes, followed by the genic SVs. The genic STRs showed a slight inflation, which was not as significant as other variant types. The missense SNPs, intergenic SV and intergenic STR p-values followed the uniform distributions.
For each category, the p-values obtained by the LMM model are corrected by the Benjamini-Hochberg procedure to obtain an FDR. We identified 53 significant (FDR < 0.05) associations between DEGens and features, which are reported in Tables S2 and S3. The majority of associations (41 of 53 genes) reflected segmental duplications. In Table S3 we report the genotype pattern in the substrains for each variant; notably, there are several clusters of significant associations with the same genotype pattern. For example, one extensive region on Chromosome 2 that clearly distinguishes C57BL/6NJ from all other substrains accounts for 18 of the 53 genes identified. Another cluster with a more complex genotype pattern on Chromosome 4 accounts for 11 of the 53 identified genes.
2.5 Distinct mechanisms of differential gene expression caused by SVs
SVs can affect gene function by (1) varying the dosage of a full-length gene (2) deletion or insertion of exons producing alternative isoforms of a gene, or (3) rearrangement of the cis-regulatory elements of genes. For example, there are three copies of the gene Srp54 in the mouse reference genome, but we found significant variability in the number of copies across the substrains; the number of copies was strongly associated with expression of Srp54 (Figure 3A). Thus, in this example, copy number variation in highly mutable segmental duplication regions is the likely cause of differential gene expression. An example of a SV that likely impacts expression is the Lpp gene. The Lpp gene has a tandem duplication of the first two exons in two substrains (C57BL/10ScCr and C57BL/10ScNHsd) that creates two copies of the TSS, which probably accounts for its ~2-fold increased expression (Figure 3B).
An intriguing example of altered expression caused by a SV is the Wdfy1 gene. This gene has a tandem duplication of exons 4-6 in the C57BL/6J substrain, which is also present in the mm10 reference genome. We found this duplication is associated with a paradoxical decrease in Wdfy1 gene expression (Figure 3C). Given that a frameshift is incurred by this tandem duplication, we hypothesized that reduced Wdfy1 expression is caused the nonsense-mediated RNA decay (NMD) pathway. This highly conserved RNA-turnover pathway promotes the turnover of mRNAs harboring premature termination codons, including those generated by frameshifts43. We reasoned that if a major spliced isoform of Wdfy1 contains this tandem duplication and associated frameshift, its decay by NMD would be detectable by examining expression in NMD-deficient cells. In support, RNA-seq analysis showed that NMD-deficient (Upf2-/-) C57BL/6J ES cells expressed ~2-fold higher levels of Wdfy1 than control (sibling) C57BL/6J ES cells (Figure 3F). Analysis of splice junctions from RNA-seq analysis confirmed the existence of an aberrant isoform in all C57BL/6J lines that includes splicing of exon 6 to the downstream (duplicated) exon 4 (Figure 3D), which we refer to as the “6a->4b” junction. This splice junction was unique to C57BL/6J strains, and the ratio of 6a->4b to all splice junctions was increased by ~2-fold in the Upf2-/- ES cells relative to control ES cells (Figure 3E). These results demonstrate that major isoforms transcribed from Wdfy1 in C57BL/6J mice are degraded by NMD, and the transcripts that are retained are alternative splice forms that exclude the 6a->4b junction.
3. Discussion
We have performed a large-scale multi-omics analysis of 14 C57BL substrains. We identified 352,631 SNPs, 109,096 INDELs, 150,344 STRs and 3,425 SVs; furthermore, of the 16,400 genes that were expressed in the hippocampus, 2,826 were significantly differentially expressed (FDR<0.05). Unexpectedly, many of the polymorphisms that differentiated the C57BL/6 and C57BL/10 substrains were concentrated in a few haplotypes, comprising just 5% of the genome. These polymorphisms appear to be due to either introgression of an unrelated individual, or incomplete inbreeding at the time that the C57BL/6 and C57BL/10 lineages diverged. Setting these introgressed regions aside, we tried to identify variants that were causally related to differential gene expression by focusing on the cis regions around DEGenes. This allowed us to identify 53 genes in which a variant with high prior probability to be causal was significantly associated with gene expression. While the majority of these 53 instances were caused by segmental duplications, several of which spanned many adjacent genes, a smaller proportion were due to SVs and INDELs (see Supplemental Tables S2 and S3). Inflation of test statistics for these categories of variants further underscores their likely causal roles and highlights the fact that a relaxed FDR threshold would have identified more than 53 variant/DEGene associations.
An unexpectedly large subset of variants (89% of all SNPs) were concentrated in 28 highly diverged haplotypes that were present in all C57BL/10 strains and represented just 5% of the genome. These dense clusters of genetic variation (1 SNP/425 bp) perfectly differentiated C57BL/10 from C57BL/6, and likely reflect introgression from another strain.
Intriguingly, the smaller haplotypes appeared to be of domesticus origin, and were similar to haplotypes found in multiple non-C57BL inbred strains. The two largest haplotypes appeared to be of musculus origin and were also similar to multiple non-C57BL inbred strains. The exact sequence of events that led to this situation are impossible to deduce, but these patterns are clearly due to breeding errors rather than spontaneous mutations; this conclusion is based on several observations: 1) the density of the polymorphisms, 2) the abrupt boundaries of the regions/haplotypes and 3) the fact that the SNPs in these introgressed regions are found in other inbred strains, which would not be the case if they were due to spontaneous mutations. A previous microarray study performed on 198 inbred mouse strains also identified SNP differences between C57BL/6J and three C57BL/10 substrain (C57BL/10J, C57BL/10ScNJ and C57BL/10ScSnJ) for all the 28 introgressed segments that we identified44,45; however, that study did not highlight the significance of that finding, and did not have sufficiently dense coverage to define the boundaries of the introgressed regions. While a majority of C57BL/10-specific genetic variants lie within these introgressed regions, they contained only a small fraction (~13%) of DEGenes; however, given that the introgressed regions represent only 5% of the genome, this is still about 2.6-fold greater density of DEGenes that would be expected if they were randomly distributed across the genome.
Outside of these apparently introgressed regions we identified 37,745 SNPs that were distributed throughout the genome in a Poisson fashion with more than 100-fold lower density (~1 SNP/67,000 bp). These SNPs are apparently due to the accumulation of new mutations and their identification was the original goal of our study. Dendrograms based on these SNPs recapitulated the historically recorded relationships among the substrains (Figure 1B). For the relatively large number of DEGenes (>2,000) that were located outside of the introgressed regions, we considered the association between different categories of nearby (cis) variants, and expression of DEGenes. Variable copy number segmental duplicated regions were shown to be highly enriched for significant associations as were genic SVs and loss of function SNP/INDELs (Figure 2C).
We presented several examples to highlight how different classes of variants underlie DEGenes. For example, variable copy number segmental duplications led to both increased and decreased expression of Srp54 (Fig 3A). In another example, duplication of transcription start sites led to increased expression of Lpp (Fig 3B). In the case of Wdfy1, duplication of several exons led to down-regulation of expression (Fig 3C), which we showed was due to NMD-mediated mRNA decay (Fig 3D). Wdfy1 was previously reported to be differentially expressed between C57BL/6J and C57BL/6NCrl, and was identified as one of the candidate genes for reduced alcohol preference in C57BL/6NCrl46. This gene is also within the QTL named Emo4 (location: Chr1:68,032,186-86,307,305 bp, URL); mice which are homozygous for C57BL/6J allele are more active in the open field test. Whether Wdfy1 is actually the cause of either association cannot be resolved by our study.
Despite the numerous examples in which likely causal variants were identified, a majority of the causal variants underlying DEGenes remain unknown. Many are likely to be due to variants in regulatory regions that have not been distinguished from other nearby variants with the same strain distribution pattern (and thus identical p-values). Although we focused on the possibility that DEGenes were due to nearby variants (cis-eQTLs), the large fraction of differentially expressed genes (17.2% of all expressed genes) could indicate that many DEGenes are due to trans-eQTLs. Producing crosses between pairs of strains will be necessary to address the relative importance of cis-versus trans-eQTLs in the observed DEGenes; it is possible that such crosses could identify one or more major trans-regulatory hot spots.
Our results create a resource for future efforts to identify genes and causal polymorphisms that give rise to phenotypic differences among C57BL strains using the increasingly popular reduced complexity cross (RCC) approach in which two phenotypically divergent nearly isogenic inbred substrains are crossed to produce an F2 population24. Because of the low density of polymorphisms, identifying the causal allele is much more tractable. For example, the gene Cyfp2 was identified as the cause of differential sensitivity to cocaine and methamphetamine in a cross between C57BL/6J and C57BL/6N21. In the Supplementary material we have provided genomic variants (SNPs, INDELs, STRs and SVs), differentially expressed genes in the hippocampus, as well as association tests between DEGenes and nearby variants. In addition, we have provided the VEP annotated SNP/INDELs, which distinguishes loss of function, missense and synonymous mutations. Our data also identify some regions that have a high density of polymorphisms that may complicate the RCC approach. For example, phenotypic differences between C57BL/6 and C57BL/10 strains might frequently map to the introgressed regions, which have a high density of polymorphisms that would significantly hinder gene identification and negate the advantage of RCCs. Furthermore, crosses between two C57BL/6 or between two C57BL/10 strains may map to large segmental duplication regions such as those on Chromosomes 2 and 4 (see Figure 1E and Supplemental Table S3), which would again hinder gene identification. Thus, one key observation from this study is that genetic differences among C57BL/6 and C57BL/10 strains are not uniformly distributed. Furthermore, our study used a single individual to represent each strain for whole genome sequencing. Therefore, we did not explore the extent to which the polymorphic regions we identified may be segregating versus fixed within each inbred strain. If some of these polymorphic regions are not fixed, it would further complicate the analysis of RCCs.
Whereas the RCCs represent a forward genetic approach (starting with a phenotypic difference, searching for the genetic cause), another novel application of our dataset would be to select two strains that are divergent for a coding or expression difference and to use that cross to study gene function. This reverse genetic approach (starting with a genetic difference, searching for the phenotypic consequences) has not been attempted using closely related substrains, but is conceptually similar to characterization of a knockout mouse. This approach is limited by the available polymorphisms. Although it would be necessary to account for the impact of linked polymorphisms, most of the polymorphisms would be unlinked and would not confound the interpretation of results.
In summary, we have created a dataset that elucidates the differences among C57BL strains and can be used for both forward genetic (RCC) and reverse genetic approaches. We identify previously unknown introgressed segments that differentiate the C57BL/6 and C57BL/10 lineages. Our results can also be used to explore mutational processes and highlight the tendency of inbred strains to change over time due to mutational processes.
Author Contributions
A.P. designed the study. C.L.S.P. performed the animal breeding, dissection and the preparation of WGS and RNASeq libraries. Y.R. and A.W. performed initial analyses of the WGS and RNASeq data. M.M. carried out all statistical genetic and functional genomic analyses. M.G. and S.S. performed STR calling. M.M. and J.S. performed SV calling and analysis of SV eQTLs. M.W. developed the Upf2 mouse model, A.S. derived the ES cells and performed the corresponding RNA-Seq. M.M., J.S., A.P. wrote the paper.
Declaration of interest
The authors declare no competing interests.
4. Methods
4.1 Mice
We obtained a panel of 14 C57BL substrains from four vendors. The panel included 9 C57BL/6 substrains: C57BL/6J, C57BL/6NJ, C57BL/6ByJ, C57BL/6NTac, C57BL/6JBomTac, B6N-TyrC/BrdCrCrl, C57BL/6NCrl, C57BL/6NHsd, C57BL/6JEiJ, and 5 C57BL/10 substrans: C57BL/10J, C57BL/10ScCr, C57BL/10ScSnJ, C57BL/10SnJ, C57BL/10ScNHsd (Table 1). All of the substrains were bred for one generation at the University of Chicago before tissue was collected for whole genome sequencing and RNA-sequencing; this avoided gene expression differences that were secondary to environmental differences among the four vendors. All procedures were approved by the University of Chicago IACUC. One hundred and ten male mice in total, with six to eleven mice per substrain, were chosen for RNA-sequencing from hippocampus, and one male mouse per substrain was chosen for whole genome sequencing from spleen (Figure 1A).
4.2 Whole-genome sequencing (WGS)
DNA from one male animal per substrain (n=14) was extracted from spleens using a standard “salting-out” protocol. Sequencing libraries were prepared using a TruSeq DNA LT kit, as per the manufacturer’s instructions. Subsequently, sequencing data was generated by Novogene at an average depth of ~30X coverage on an Illumina HiSeq X Ten (paired-end 150bp) (Table 1).
4.3 RNA-sequencing and data processing
Total RNA was extracted from 110 hippocampal samples using Trizol reagent (Invitrogen, Carlsbad, CA). RNA was treated with DNase (Invitrogen) and purified using RNeasy columns (Qiagen, Hilden, Germany). RNA-sequencing library prep and sequencing was performed by the University of California San Diego Sequencing Core using Illumina TruSeq prep and Illumina HiSeq 4000 machine (single-end 50bp; Table 1). Reads were mapped to mouse reference transcriptome (mm10) using the splice-aware alignment software HiSat247, and counts were normalized using HTSeq48. Only genes that had at least one Count Per Million (CPM), for at least two samples were included in our analysis. We further removed four outlier samples identified by PCA analysis. This left us with gene expression data for 16,400 genes across 106 samples in 14 substrains.
To identify Differentially Expressed Genes (DEGenes) we performed analysis of variance using the anova function in R, and adjusted the p-values by computing the false-discovery rate (FDR) using the p.adjust function in R, with the Benjamini-Hochberg procedure. We obtained 2,826 DEGenes among C57BL/6 and C57BL/10 substrains combined, 1,210 DEGenes within C57BL/6, and 104 DEGenes within C57BL/10 substrains with FDR<0.05.
4.3.1 Nonsense mediated decay assay
To determine whether SVs of the Wdfy1 gene in C57BL/6J create novel mRNA isoforms that are degraded by the Nonsense-Mediated Decay (NMD) pathway, we performed RNA-seq on mouse embryonic stem cells (mESCs) from a Upf2-/- strain of C57BL/6J that has impaired NMD and control mouse mESCs from C57BL/6J. Samples with an RNA integrity index of >8 (as determined by a BioAnalyzer) were used for RNA-seq analysis. The University of California San Diego Sequencing Core performed library preparation using ribosomal RNA depletion protocol followed by paired-end sequencing (100 cycles) using a HiSeq4000. Reads from three replicates of Upf2-/- samples and three controls were mapped to the mouse reference genome (mm10) by HiSat247, and counts were normalized using HTSeq48. We kept all genes with CPM>1 and normalized the counts with edgeR function in R, however, we only analyzed Wdfy1 expression in an effort to detect differences in NMD between Upf2-/- and control samples.
4.4 SNPs and INDELs
We used SpeedSeq49 to process the WGS paired-end reads. SpeedSeq uses BWA-mem (v.0.7.8) to map the reads to the mm10 reference genome, SAMBLAST50 to mark duplicates, Sambamba51 to sort the BAM files, and FreeBayes52 to jointly call SNPs and INDELs. INDELs are defined as insertions or deletions which are relatively short in length. The length range for the detected INDELs in our study is between one and 64 base pairs, which is approximately the lower bound for SV length scales. We restricted our analysis to variants that were fixed within individual substrains by including homozygous SNPs and INDELs only, resulting in a callset consisting of 352,631 SNPs and 109,096 INDELs. These variants are provided in the Supplementary material.
When computing the identity-by-state (IBS) matrix for dendrograms, we LD-pruned the SNP panel with Plink53 (--indep-pairwise 50 5 0.5) yielding 16,739 SNPs. This pruned SNP set was augmented by STRs and all bi-allelic SVs, followed by computing the distance matrix with dist and plotting the dendrograms with hclust in R v3.6.1.
4.5 Short Tandem Repeat (STR)
We used HipSTR v0.6 with default parameters54 to call STRs from mapped reads using the mm10 reference STR set available from the HipSTR website (URL). The reference STR set was generated using Tandem Repeats Finder55 allowing a maximum repeat unit length of 6bp. STRs for the substrains were jointly genotyped on a single node of a local server in batches of 500 STRs. Resulting VCF files from each batch were merged to create a genome-wide callset in VCF format. We filtered out calls with missing genotypes, as well as calls with reference alleles for all substrains, resulting in a total of 150,344 polymorphic STRs. The STR calls are available in the Supplementary material.
4.6 Structural Variations (SV)
SVs were detected using a combination of approaches. First, we called SVs with LUMPY56 and CNVnator57, two complementary methods that rely on discordant and split read signals or coverage respectively. Second, because SV calling accuracy by the above methods is low in regions that are dense in segmental duplications, copy number variation within annotated segmental duplications was quantified directly from coverage, and these coverage values were used for the correlation of gene copy numbers with gene expression.
We filtered out SV calls that overlapped 50% or more with the gap regions of the mouse reference genome, as well as the calls with length smaller than 50 bp and larger than 1Mbp. A more stringent >1000 bp length filter was applied to CNVnator calls. We then filtered out non-homozygous calls and calls that were homozygous for the alternative allele in all substrains.
Concordant calls from LUMPY and CNVnator with 50% or greater reciprocal overlap and the same genotypes were merged and the breakpoints reported by LUMPY were used. Consensus calls that overlapped with annotated segmental duplications (SegDup) in the reference genome were excluded, and instead SegDup copy number was assessed directly from read depth signal using mosdepth v0.2.658 with window size 100 bp. SegDup annotations from the mm10 genome with at least 98% similarity were intersected with gene annotations, and the median read coverage across SegDups which intersect with genes was normalized by the median coverage of the corresponding chromosome. These normalized coverage values were used to correlate gene copy numbers with gene expression. The final set of SVs included 3,425 deletions, duplications and inversions in nine C57BL/6 and five C57BL/10 substrains. The distribution of SVs in each category and substrain is summarized in Table S1. The VCF file of the SV calls, and the read coverage data for the SegDup regions are provided in the Supplementary material.
4.7 Resource availability
4.7.1 Lead contact
Further information and requests for resources should be directed to and will be fulfilled by the Lead Contact, Milad Mortazavi (miladm{at}alumni.stanford.edu).
4.7.2 Material availability
This study does not generate new unique reagents.
4.7.3 Data and code availability
The datasets (supplementary material) generated during this study are available at Mendeley Data, DOI: 10.17632/k6tkmm6m5h.15.
5. Supplemental information
Acknowledgements
M.M., Y.R., C.L.S.P., A.W. and A.A.P. were supported by P50DA037844. Additionally, Y.R. was supported by T32MH018399 and A.W. was supported by T32MH020065.