Expression analysis of Huntington disease mouse models reveals robust striatum disease signatures

Huntington’s disease is caused by expanded trinucleotide repeats in the huntingtin gene (HTT), and a higher number of repeats is associated with an earlier age of disease onset. Although the causative gene has been identified, its connections to the observed disease phenotypes is still unclear. Mouse models engineered to contain increasing numbers of trinucleotide repeats were sacrificed at different ages to detect RNA-level and protein-level changes specific to each repeat length and age in order to examine the transcriptional and translational characteristics of the disease. RNA-seq and quantitative proteomics data were collected on 14 types of tissues at up to 8 repeat lengths and in up to 3 different ages, and differential gene and protein expression were examined. A modified method of imputing missing proteomics data was employed and is described here. The most dysregulated tissue at both the RNA and protein levels was the striatum, and a strong gender effect was observed in all of the liver experiments. The full differential expression results are available to the research community on the HDinHD.org website. The results of multiple expression tests in the striatum were combined to generate an RNA disease signature and a protein disease signature, and the signatures were validated in external data sets. These signatures represent molecular readouts of disease progression in HD mice, which further characterizes their HD-related phenotype and can be useful in the preclinical evaluation of candidate therapeutic interventions. Author Summary Mouse models of Huntington’s disease were engineered to allow a detailed examination of how the disease causes changes in gene activity in a variety of tissues. Among the 14 tissues studied, the one most affected by the disease in our experiments was the striatum, a brain region involved in voluntary movement. The liver results showed large differences in gene activity between the male and female mice. In our analysis, we propose a minor change in how proteomics data is typically analyzed in order to improve the ranking of significant results. Using the striatum data in this study and in others, we identified robust genetic signatures of disease at both the RNA and protein levels.

were sacrificed at different ages to detect RNA-level and protein-level changes specific to each 23 repeat length and age in order to examine the transcriptional and translational characteristics 24 of the disease. RNA-seq and quantitative proteomics data were collected on 14 types of tissues 25 at up to 8 repeat lengths and in up to 3 different ages, and differential gene and protein 26 expression were examined. A modified method of imputing missing proteomics data was 27 employed and is described here. The most dysregulated tissue at both the RNA and protein 28 levels was the striatum, and a strong gender effect was observed in all of the liver experiments. 29 The full differential expression results are available to the research community on the 30 HDinHD.org website. The results of multiple expression tests in the striatum were combined to 31 generate an RNA disease signature and a protein disease signature, and the signatures were 32 validated in external data sets. These signatures represent molecular readouts of disease 33 progression in HD mice, which further characterizes their HD-related phenotype and can be 34 useful in the preclinical evaluation of candidate therapeutic interventions. Huntington's disease (HD), a fatal neurological disorder characterized by progressive motor, 49 cognitive, and behavioral impairments (1). The CAG repeats in the DNA are translated to a 50 polyglutamine (polyQ) region in the HTT protein, but the mechanisms connecting the polyQ 51 mutation to the disease phenotypes are still being explored. Marcy MacDonald, Vanessa 52 Wheeler, Scott Zeitlin, and their respective collaborators engineered mice with a human exon 1 53 of HTT knocked into the endogenous locus. From that, they engineered a series of HTT alleles 54 having different CAG repeat lengths, which we will refer to as the mouse allelic series (2). The 55 range includes wild type (WT) mice containing the natural Q7 repeat length, as well as a Q20 56 knock-in that has been characterized as being behaviorally (2) and transcriptionally (3) similar 57 to WT. The disease alleles Q50, Q80, Q92, Q111, Q140, and Q175 were compared to the Q20 58 knock-in as a control whenever possible, in order to remove possible effects of the knock-in 59 construct. In the experiments lacking Q20 samples, the WT mice were used as controls. All the 60 mice are heterozygous for the knock-in allele. The mice were sacrificed at 2, 6, or 10 months of 61 age, and tissues extracted from the mice were studied using RNA-seq and liquid 62 chromatography tandem mass spectrometry (LC/MS/MS) label-free proteomics. 63 64 Table 1 summarizes all of the RNA-seq and proteomics data sets that were analyzed and lists 65 their GEO, SRA, and PRIDE accession numbers. Most of the RNA-seq experiments used a broad 66 range of HTT alleles (WT, Q20, Q80, Q92, Q111, Q140, and Q175) and a range of ages (2, 6, and 67 10 months), and these are called the "Full Series" in Table 1 and in the text. The ones labeled 68 "Miniseries" had fewer alleles, namely WT, Q20, Q50, Q92, and Q140, and only two ages, 6 and 69 10 months. The "Tissue Survey" experiment (GSE65775) had only Q175 and WT mice aged 6 70 months. The wild type Q length was Q7. All of the proteomics experiments used the alleles in 71 the RNA-seq "Full Series" plus Q50. Some of the proteomics experiments lacked a Q50 sample 72 at the 2-month age, as noted in Table 1 of the striatum proteomics data (3) and identified correlated expression modules using 78 weighted gene coexpression network analysis (WGCNA). We extended Langfelder's 79 differential expression analysis of the striatum, cortex, and liver RNA to include the cerebellum, 80 hippocampus, and white adipose tissue near gonads, and added full proteomics analyses of the 81 striatum, cortex, cerebellum, hippocampus, liver, heart, and muscle. Our proteomics analysis 82 includes imputation of missing values in selected cases, as is done in the Bioconductor package 83 DEP (4), except that in our case the imputed values were calculated deterministically to make 84 the ranking of significant results more consistent. A striatal RNA disease signature is presented 85 here and contains genes overlapping two of the WGCNA modules in Langfelder's study. A 86 striatal protein disease signature is also presented here. These RNA and protein signatures will 87 be useful in understanding perturbations to transcription and translation in HD and, as 88 molecular biomarkers, can aid in evaluating the potential of candidate therapeutics to revert 89 the HD-related phenotype in mouse models. 90 91

Results 92
Gender effect in liver 93 While examining PCA plots during outlier detection, a striking difference was seen between the 94 male and female samples in all the liver experiments, including the RNA Full Series, RNA 95 Miniseries, and protein Full Series. This gender effect was not seen in any other tissue. Figure  96 1 shows several PCA plots with female samples in blue and male samples in green, and with the 97 different Q lengths indicated by different shapes. Figure 1A shows the striatum RNA Full Series, 98 with the green and blue genders intermingled, and this intermingling was typical of the other 99 tissues and experiments. Figures 1B, 1C, and 1D show the liver RNA Full Series, RNA Miniseries, 100 and protein Full Series respectively, and blue female samples are clearly separated from the 101 green male samples. Gender differences in liver gene expression have been previously 102 reported in mice (5) and the differences have been observed to begin at puberty, around days 103 30 to 35 of age. Interestingly, gender differences have been observed in human HD. While no 104 gender-specific differences in disease burden or age of onset are observed, the progression rate 105 in women appears to be faster (reviewed in (6)). Since the high-Q and wild type samples are 106 within each gender cluster, this means the gender effect in liver is larger than the disease 107 effect. Because of this large difference, all the liver data differential expression was analyzed 108 three ways: combined liver (LIV), female liver (LVF), and male liver (LVM).      The striatum is the most affected tissue, accounting for 42% of all differentially expressed genes 133 in the RNA Full Series and 25% of all significant genes in the protein Full Series. The numbers of 134 significant genes increase with Q length at both 6 months and 10 months, and at both the RNA 135 and protein levels. This trend is not observed for any other tissue. At the 2-month age, the 136 striatum shows more changes at the protein level than at the RNA level, and this is also true for 137 the cortex. The cerebellum and hippocampus are more affected (both 11%) than the cortex 138 (7%) at the RNA level, but less than the cortex at the protein level (cortex 11%, cerebellum 3%, 139 hippocampus 1%). The cerebellum shows similar numbers of changes in the Full Series RNA 140 (Q175 vs. Q20 at 6 months, 2,592 changes) and in the tissue survey (Q175 vs. WT at 6 months, 141 2,565 changes), but the hippocampus shows more changes in the Full Series (4,630) than in the 142 tissue survey (1,122). The cerebellum and hippocampus both show high counts at the Q175 Q 143 length in the RNA at 6 months and 10 months, but few counts at any other Q length and age. 144 These high Q175 counts for the cerebellum and hippocampus RNA are not seen in the 145 proteomics data. 146

147
The female liver is the second most affected tissue in the RNA Full Series experiments (18%), 148 while the male liver is the least affected (4%). The high counts in the female liver are primarily 149 restricted to the 6-month age, while the male liver has similar gene counts at all ages. At the 150 protein level, the female liver is again more affected than the male (13% vs. 6%), but the 151 combined liver has even more changes (22%), making it second only to the striatum among the 152 proteomics tissues. A likely explanation for the higher counts in the combined liver analysis is 153 that it has twice as many samples, increasing the statistical significance for some genes that 154 would not be significant with fewer samples. The most extreme result in the combined liver is 155 the Q175 vs. Q20 comparison at 6 months, with 1,138 significant genes, and this is the largest 156 count of significant changes among all the proteomics experiments. 1,084 of these 1,138 157 changes are negative. All of the proteomics results for the liver samples are dominated by 158 negative genes (down-regulated in the disease condition), while changes in both directions are 159 seen in the RNA. 160

161
The most affected tissue in the RNA tissue survey is white adipose around gonad, with 3,102 162 significant changes. This same tissue was investigated separately in experiment GSE76752 at a 163 range of Q lengths, and the Q175 vs. Q20 comparison still shows thousands of changes (6,802), 164 but all other Q lengths show much smaller changes (13 at Q140, 2 at Q111, 141 at Q92, and 22 165 at Q80). In contrast, the white adipose sample near intestine is the least affected tissue, with 166 only one significant gene change. Two types of muscle tissue were included in both the RNA 167 tissue survey and the proteomics Full Series, namely gastrocnemius (calf) muscle and heart. 168 The gastrocnemius muscle was more affected than the heart in the proteomics data (15% vs. 169 5%). The least affected brain tissue in the survey was the corpus callosum, with only 14 170 differentially expressed genes. were combined with five other striatum data sets and were grouped by Q length and age: Q175 199 10 months (10M), Q140 10M, Q140 6M, and R6/2 3M. 37,633 genes that were defined in all 10 200 experiments were compared within these four groups and across all groups. The same 201 significance criteria were used for all 10 data sets: genes must have an adjusted p-value less 202 than 0.05 and a fold change of at least 20% in either direction. The counts of significant genes 203 in each experiment, genes shared within each group, and genes common to all four groups are 204 shown in Table 3  In that study, modules M2 and M20 are the top two most CAG length-dependent modules, 231 where M2 correlated negatively with CAG length and M20 correlated positively. M2 had the 232 largest number of dysregulated genes, and M2's hub genes include the striatal medium spiny 233 neuron marker genes Ppp1r1b, Drd1, Drd2, and Gpr6, which are also in Str266R. M20 had the 234 strongest positive correlation with CAG length and was enriched for p53 signaling, cell division, 235 and protocadherin genes. The Str266R genes were tested for GO term enrichment, and the 236 most significant biological processes were responses to alkaloid and amphetamine, regulation 237 of glutamatergic synaptic transmission and membrane potential, synapse assembly, and protein 238 dephosphorylation. The full list of enriched GO terms is in Supplementary Table 4.  239   240 For cases where a smaller disease signature is desirable, the top 10 increasing and top 10 241 decreasing genes in Str266R were selected, except that the 8 genes that were not significant in 242 one of the four validation data sets were disqualified from this smaller signature. This small 243 bidirectional signature is called Str20R and is shown in Table 4 The overlapping significance method was also used to find a cortex RNA signature. Ten cortex 248 experiments that had HD samples (either Q175, Q140, or R6/2) and WT samples at several ages 249 were tested for differential expression, assembled into similar groups (Q175, Q140 10M, Q140 250 6M, and R6/2), and the overlaps in their significant genes were calculated. The genes 251 significant in each experiment are listed in Supplementary Table 5 and the counts of  252 overlapping genes are summarized in Table 5. For the R6/2 mice, there is a strong cortex 253 signature that is reproducible at ages 6 weeks and 3 months, with 2,352 overlapping genes. For 254 the Q175 mice, there are 110 genes that overlap at ages 6 and 10 months. The Q140 overlaps 255 are very low at both 6 months and 10 months. In the two Q140 10M studies, only 9 genes 256 overlap: Il33, Slc45a3, Chdh, Gpr153, Enpp6, Gm5067, Apod, Flnc, and 5031410I06Rik. In the 257 three Q140 6M studies, only 2 genes overlap, Scn4b and Gm5067. Unlike the striatum analysis, 258 no genes pass significance tests in the cortex in all 10 experiments. It was not possible to 259 determine a consistent cortex disease signature using this method and these data sets. We 260 speculate that the active disease genes differ by age and by region within the cortex during the 261 progression of HD. 262 We next sought to determine a striatum protein disease signature but had access to only four 265 data sets outside of the allelic series (compared to nine for the RNA signature), so we used the 266 recurrence ranking method described in Materials and Methods and illustrated in Figure 4.  Supplementary Table 6. These proteins were tested for GO term enrichment, and the most 288 significant biological processes were the regulation of synaptic plasticity, cognition, locomotory 289 behavior, learning or memory, visual behavior, and second-messenger-mediated signaling. The 290 full GO term enrichment results are included in Supplementary Table 4.  291   292 As we did with the RNA signature, we extracted a smaller protein signature by taking the top 10 293 increasing proteins and top 10 decreasing proteins from the Str115P and excluded proteins that 294 were not significant in all 4 validation sets. We call this smaller signature Str20P, shown in 295 Table 6. Note that the fold changes in Table 6 look smaller than the Str20R values in Table 4, 296 but the proteomics calculations are done in log10 intensities instead of log2. 297 The Str266R and Str115P signatures have 40 gene symbols in common, and all 40 change with 301 disease in the same direction at the RNA and protein levels. We define these 40 genes as a 302 combined RNA/protein signature, Str40RP. The 40 genes are shown in Table 7, ranked from 303 most positive to most negative RNA log2 fold change, and are included in Supplementary Table  304 6. Only the first four genes (Acy3, Chdh, Nagk, and Psme1) increase in disease; all the rest 305 decrease. This is not surprising because both the Str266R and Str115P signatures are 306 dominated by genes that decrease in disease. The most enriched biological processes among 307 these 40 genes are negative regulation of protein dephosphorylation, cAMP or cGMP metabolic 308 process, response to amphetamine, visual learning, visual behavior, and positive regulation of 309 long-term synaptic potentiation. The full GO term results for Str40RP are included in 310 Supplementary

RNA-seq and proteomics data sets 318
All of the mouse experiments used in this study were downloaded from NCBI's Gene Expression 319 Omnibus (GEO) (7) and CHDI's Huntington's Disease in High Definition website (HDinHD) (3). 320 The raw FASTQ files are also available on NCBI's Sequence Read Archive (SRA) (8) and the mass 321 spectrometry data is on EBI's Proteomics Identifications Database (PRIDE) (9). The GEO data 322 sets were provided as read counts per gene, and the HDinHD data sets were provided as log10 323 label-free quantitation (LFQ) intensities. The RNA-seq read counts were produced from the raw 324 FASTQ files using OmicSoft ArrayStudio (10). The proteomics signal intensities were produced 325 from the mass spectrometry data by Evotec SE. The sample information for all experiments 326 was downloaded from HDinHD 327 (http://repository.hdinhd.org/data/allelic_series/Allelic_Series_Decoder_Ring-1.5.xlsx.gz). The 328 sample information file was edited to remove "LFQIntensity" from the sample names for the 329 liver PXD005641 proteomics samples in order to match the sample names in the intensities file. each Q length, the higher ages (6 and 10 months) were compared to the 2-month age, and we 353 call these the age comparisons. These age comparisons will include healthy aging genes in 354 addition to disease genes. The healthy aging genes were identified using the Q20 10-month vs. 355 Q20 2-month and Q20 6-month vs. Q20 2-month comparisons within each tissue. The presence 356 of healthy aging genes in the age comparison results could be a problem for some intended 357 uses of these gene lists, so the age comparisons are provided with the healthy aging genes 358 retained in Supplementary File 9 but with them removed in Supplementary File 10. Ensembl 359 identifiers in the differential expression results were converted to gene symbols using GFF3 360 annotation files from Ensembl (13) A simple way to identify a disease signature when many data sets are available is to examine 395 the genes that overlap each experiment's lists of significant genes and add the requirement 396 that their changes need to be in the same direction in every experiment. This was used to 397 determine the Str266R signature. In the selection of candidate genes from 10 RNA-seq data 398 sets, the significance criteria included both a fold change threshold (20% or higher) and an 399 adjusted p-value threshold (less than 0.05). When testing the candidate genes in the validation 400 data sets, only the adjusted p-value criterion was used. 401 402 Disease signatures, recurrence ranking method 403 When fewer data sets are available, a method based on ranking the recurring significance of 404 genes in multiple comparisons was used, and this was used to determine the Str115P signature. 405 Mice with Q lengths of Q111 or higher and ages 6 months or older (Q111+ 6M+) were 406 considered as having the disease phenotype base on the RNA and protein differential 407 expression results. The 6 Q-length comparisons and 6 age comparisons meeting these criteria 408 were used as the 12 comparisons to look for recurrent significance in. An adjusted p-value of 409 less than 0.1, with no fold change threshold, was used as the significance criterion. was the requirement to be incorporated into the final Str115P signature. 421

Validation data sets 423
The striatum and cortex RNA signatures were validated using untreated HD and WT control 424 mice from other experiments, some of which have already been published and others which 425 are in pre-publication. Pre-publication data sets are referred to generically in the text as 426 "CohortX" until those data sets are public. Some cohorts have multiple time points, and those 427 will be identified as "CohortXTimeY". The public studies are HDAC (GSE104086), which studied 428 the effects of HDAC inhibitors, and KMO (GSE105158), which studied the effects of a KMO 429 inhibitor. All of the HD mice were either Q175 or Q140 mice aged 6 to 12 months 430 (Cohort1Time1, Cohort1Time2, Cohort2Time1, Cohort2Time2, Cohort3Time1, Cohort3Time2, 431 and Cohort3Time3) or R6/2 mice aged 6 weeks to 3 months (HDAC, Cohort4, and KMO). The 432 Cohort1, HDAC, Cohort4, KMO, and Cohort2 validation sets each had 7 to 10 HD mice and 7 to 433 10 WT mice. The Cohort3 validation sets each had 23 to 31 HD mice and 23 to 29 WT mice. The 434 Cohort1 and Cohort3 data sets had only striatum samples, while the HDAC, Cohort4, KMO, 435 Cohort2 sets had both striatum and cortex samples. 436