A study of allelic series using transcriptomic phenotypes in a metazoan

Expression profiling holds great promise for genetics because of its ability to measure thou-sands of genes quantitatively in parallel. Although transcriptomes have recently been used to perform epistasis analyses for pathway reconstruction, there has not been a systematic effort to understand how expression profiles will vary among various mutants of the same gene. Here, we study an allelic series in C. elegans consisting of one wild type and two mutant alleles of mdt-12, a highly pleiotropic gene whose gene product is a subunit of Mediator complex, which is essential for transcriptional initiation in eukaryotes. We developed a false hit analysis to identify which populations of genes commonly differentially expressed with respect to the wild type are likely the result of statistical artifact. We concluded that expression perturbations caused by these alleles split into four distinct modules called phenotypic classes. To understand the dominance relationship between the two mutant alleles, we developed a dominance analysis for transcriptional data. Dominance analysis of these phenotypic classes support a model where mdt-12 has multiple functional units that function independently to target the Mediator complex to specific genetic loci. Author Summary Expression profiling is a way to quickly and quantitatively measure the expression level of every gene in an organism. As a result, these profiles could be used as phenotypes with which to perform genetic analyses (i.e., to figure out what genes interact with each other) as well as to dissect the molecular properties of each gene. Before we can perform these analyses, we have to figure out the rules that apply to these measurements. In this paper, we develop new concepts and methods with which to study an allelic series. Briefly, allelic series are an important aspect of genetics because different alleles encode different versions of a gene. By studying these different versions, we can make statements about how function is encoded within the sequence of a gene. We apply our methods to the mdt-12 gene, which encodes a subunit of the Mediator complex. Though we know it is essential for all transcriptional activity in eukaryotes, we understand very little about how the Mediator complex functions to generate both general and specific phenotypes. The reason for this is the genes that encode these subunits are associated with general sickness and multiple phenotypes when mutated, which makes them challenging to study genetically. We show that transcriptomic phenotypes renders the study of general factors such as mdt-12 feasible. Supplementary Data The website for the Supplementary Data for this project is still under construction and will be available shortly. All code, data and figures are available upon request.


Introduction
As a proof of principle, we selected a subunit of 48 the Mediator complex in C. elegans, mdt-12 (previ-49 ously known as dpy-22 12 ), for genetic analysis. We 50 explored three alleles, including the wild-type allele, 51 of this highly pleiotropic gene because its biologi-52 cal roles are poorly understood. The mutant alle-53 les were generated in previous screens 13, 14 , where 54 they were associated with specific phenotypes in the 55 male tail and in the vulva. Mediator is a macro-56 molecular complex that contains approximately 25 57 subunits 15 and which globally regulates RNA poly-58 merase II (Pol II) 16,17 . Mediator is a versatile regu-59 lator, a quality often associated with its variable sub-60 unit composition 16 , and it can promote transcription 61 as well as inhibit it. The Mediator complex consists 62 of four modules: the Head, Middle and Tail modules 63 and a CDK-8-associated Kinase Module (CKM). The 64 CKM can associate reversibly with Mediator. Cer-65 tain models propose that the CKM functions as a 66 molecular switch, which inhibits Pol II activity by 67 sterically preventing its interaction with the other 68 Mediator modules 18,19 . Other models propose that 69 the CKM negatively modulates interactions between 70 Mediator and enhancers 20 . In C. elegans, the CKM 71 consists of CDK-8, MDT-13, CIC-1 and DPY-22 21 . 72 Since dpy-22 is orthologous to the human Mediator 73 subunits MED-12 and MED-12L 13 , we will hence-74 forth refer to this gene as mdt-12 . mdt-12 has been 75 studied in the context of the male tail 13 , where it 76 was found to interact with the Wnt pathway. It 77 has also been studied in the context of vulval for-78 mation 22 , where it was found to be an inhibitor of 79 the Ras pathway. Loss of mdt-12 is lethal in XO an-80 imals 23,24 , and developmental studies have relied on 81 reduction-of-function alleles to understand the role 82 of this gene in development. Studies of the male 83 tail were carried out using an allele, dpy-22(bx93), 84 that generates a truncated DPY-22 protein miss-85 ing its C-terminal 949 amino acids as a result of a 86 premature stop codon, Q2549STOP 13 . In spite of 87 the premature truncation, animals carrying this al-88 lele grossly appear phenotypically wild-type. In con-89 trast, the allele used to study the role of mdt-12 in 90 the vulva, dpy-22(sy622), is a premature stop codon, 91 Q1698STOP, that predicted to remove 1,800 amino 92 acids from the C-terminus 14 (see Fig. 1). Animals 93 carrying this mutation are severely dumpy (Dpy), 94 have egg-laying defects (Egl) and have a low pene-95 trance multivulva (Muv) phenotype. These alleles 96 could form a single quantitative series, affecting the 97 same sets of target genes but to different degrees, 98 in which case the trans-heterozygote would exhibit 99 a single dosage-dependent phenotype intermediate to 100 the two homozygotes. Alternatively, they could form 101 a single qualitative series, in which case the trans-102 heterozygote should have the same phenotype as the 103 homozygote of the bx93 allele, since this allele en-104 codes the longer protein. These alleles could also 105 form a mixed series, in which case multiple separa-106 ble phenotypes would appear that have qualitative or 107 2/14 showing that bx93 has a dominant mutant phenotype 131 for this subset. For each class, we were able to quan-132 titatively measure the dominance level of each allele. Differential expression with respect to the wild-type 146 control for each transcript i in a genotype g is mea-147 sured via a coefficient β g,i , which can be loosely in-148 terpreted as the natural logarithm of the fold-change. 149 Positive β coefficients indicate up-regulation with re-150 spect to the wild-type, whereas negative coefficients 151 indicate down-regulation. Transcripts were tested for 152 differential expression using a Wald test, and the re-153 sulting p-values were transformed into q-values that 154 are correcteed for multiple hypothesis testing. Tran-155 scripts were considered to have differential expression 156 between wild-type and a mutant if the associated q-157 value of the β coefficient was less than 0.1. At this 158 threshold, 10% of all differentially expressed genes are 159 expected to be false positive hits.

172
False hit analysis identifies four pheno-173 typic classes 174 Overlapping three sets of differentially expressed 175 genes from different genotypes can generate at most 176 seven categories. Each of these seven categories could 177 be interpreted biologically if the population is be-178 lieved to arise from real effects. If these populations 179 are small, however, there is a real chance that they 180 represent statistical noise, and are not biologically 181 meaningful. If that is the case, these populations 182 may consist largely of genes that are mis-classified 183 and belong to a different cluster, in which case they 184 should be re-classified into the most likely cluster, if 185 it can be determined. 186 We identified three categories of genes that were 187 most likely to be influenced by statistical noise due 188 to their small size. These populations were those that 189 encompassed genes differentially expressed in bx93 190 homozygotes and one other genotype, as well as genes 191 that were differentially expressed specifically in bx93 192 homozygotes.   To assess the extent to which statistical artifacts could affect the interpretation of certain intersections, we first idealized the Venn diagram and asked whether false positive and false negative results could distort the diagram back to its original shape. We estimated the false negative rate at 15% and used a false positive rate of 10%. For simplicity, only false hit analysis for bx93 groups is shown. False hits can explain the existence of a groups of genes that are differentially expressed in bx93 homozygotes only, in bx93 homozygotes and trans-heterozygotes, and in bx93 homozygotes and sy622 homozygotes. Genes that are solely expressed in bx93 homozygotes are unlikely to exist, whereas genes that are differentially expressed in bx93 homozygotes and one other genotype are probably misclassified and should be differentially expressed in all genotypes. The trans-heterozygote specific class cannot be explained by statistical artifacts.  We also identified a sy622 -specific phenotypic class  Different phenotypic classes behave dif-278 ferently in an sy622 homozygote 279 We asked whether these classes had perturbation dis-280 tributions distinct from each other within a single 281 homozygote. Specifically, we wanted to test whether 282 these sets behaved as randomly selected sets. If this 283 were the case, then within a single genotype, each 284 class would be expected to have the same distribution 285 of perturbations (see Fig. 4). We found that that the 286 β coefficients of isoforms within the bx93 -associated 287 phenotype on average had the largest absolute value 288 (mean: 1.2). The sy622 -associated phenotype had 289 a smaller range of perturbations compared to the 290 bx93 -associated phenotype (95th percentiles of the 291 two distributions: 2.9 versus 3.2, respectively), and a 292 statistically smaller median (0.91 vs 1.2, respectively, 293 p < 10 −6 , non-parametric boostrap). The medians 294 of the sy622 -specific and -associated classes were the 295 same (p = 0.15). There are systematic differences 296 between the behaviors of each class. This rejects the 297 null hypothesis that the transcripts in each class were 298 randomly selected.

299
Dominance can be quantified in tran-300 scriptomic phenotypes 301 Dominance relationships between alleles are 302 phenotype-specific. In other words, an allele 303 can be dominant over another for one phenotype, 304 yet not for others. An example is the let-23 allelic 305 series-nulls of let-23 are recessive lethal (Let) and 306 presumably also recessive vulvaless (Vul) relative to 307 the wild-type allele. The sy1 allele of let-23 is dom-308 inant viable relative to null alleles, but is recessive 309 Vul 3 to the wild-type allele. Above, we postulated 310 that there are four phenotypic classes, three of 311 which are composed of genes whose expression is 312 significantly perturbed in the sy622 homozygote. If 313 these classes are indeed modular phenotypes, then 314 the dominance relationships within each class should 315 be the same from gene to gene. In other words, a 316 single dominance coefficient should be sufficient to 317 explain the gene expression in the trans-heterozygote 318 for every gene within a class.

319
To quantify this dominance, we implemented and 320 maximized a Bayesian model (see Methods). Briefly, 321 we asked what the linear combination of β coefficients 322 from each homozygote would best predict the ob-323 served β values of the heterozygote, subject to the 324 constraint that the coefficients added up to 1 (see 325 Dominance analysis). We reasoned that if this was a 326 modular phenotype controlled by a single functional 327 unit encoded within the gene of interest, then a plot 328 5/14  The bx93 allele is dominant over the sy622 for 354 the bx93 -associated phenotype 355 We explored how expression levels of transcripts 356 within the bx93 -associated phenotypic class were con-357 trolled by these two alleles. Transcripts in this class 358 are differentially expressed in homozygotes of either 359 allele. Moreover, transcripts in this class are more 360 perturbed in sy622 homozygotes than in bx93 ho-361 mozygotes. This is consistent with a single functional 362 unit that is impaired in the bx93 allele, and even more 363 impaired in the sy622 allele (see Fig. 5).

364
If a single functional unit is being impaired, then 365 we would expect these alleles to form a quantitative 366 allelic series for this phenotypic class. In a quantita-367 tive series, alleles exhibit semidominance. We quanti-368 fied the dominance coefficient for this class and found 369 that the bx93 allele is largely but not completely 370 dominant over the sy622 allele (d bx93 = 0.81; see 371 Fig. 5). Dominance in the context of an allelic series 372 indicates a qualitative allelic series, which is evidence 373 that MDT-12 protein produced from the bx93 allele 374 has an intact functional unit that is deleted in pro-375 tein product from the sy622 allele. Mixed evidence 376 for quantitative and qualitative allelic series at this 377 phenotypic class precludes a definitive conclusion. In bx93 homozygotes, transcripts within the bx93associated class are less perturbed than in sy622 homozygotes. The line of best fit (green) is β bx93/bx93 = 0.56·β sy622/sy622 . B In a trans-heterozygote, the bx93 allele is largely dominant over the sy622 allele for the expression levels of transcripts in the bx93 -associated class. In the graphs above, densely packed points are colored yellow as a visual aid. The size of the point is inversely proportional to the standard error of the β coefficients.
The bx93 allele is semidominant with sy622 379 for the sy622 -associated phenotypic class 380 We quantified the relative dominance of bx93 and 381 sy622 on the expression level of transcripts that be-382 longed to the sy622 -associated class. We found that 383 both alleles are semidominant (d bx93 = 0.51). This 384 suggests that there is a structure distributed evenly 385 throughout the gene body starting the first amino 386 acid position and ending before position 2,549. Since 387 the two alleles are semidominant for transcript ex-388 pression in this class, the functionality encoded in 389 this gene must be dosage-dependent for this model 390 to hold.

391
The sy622 -specific class is strongly en-392 riched for a Dpy transcriptional signa-393 ture 394 bx93 homozygotic animals are almost wild-type, but 395 careful measurements show that they have a slight 396 body length defect causing them to be slightly Dpy, 397 and sy622 homozygotic animals are known to be 398 severely Dpy 14 , but this phenotype is complemented 399 almost to bx93 levels when this allele is placed in 400 trans to the sy622 allele. The only class that is 401 fully complemented to wild-type levels is the sy622 -402 specific class. Therefore, we hypothesized that the 403 sy622 -specific class should show a strong transcrip-404 tional Dpy signature.

405
To test this hypothesis, we derived a Dpy signa-406 ture from two Dpy mutants (dpy-7 and dpy-10 , DAA, 407 CPR and PWS unpublished) consisting of 628 genes. 408 We used this gene set to look for a transcriptional 409 Dpy signature in each phenotypic class using a hy-410 pergeometric probabilistic model (see Methods). We 411 found that the sy622 -specific and -associated classes 412 were enriched in genes that are transcriptionally as-413 sociated with a Dpy phenotype. The bx93 -associated 414 class also showed significant enrichment (fold-change 415 = 2.2, p = 4 · 10 −10 , 68 genes observed). The en-416 richment was of considerably greater magnitude in 417 the sy622 -specific class (fold-change enrichment = 418 3, p = 2 · 10 −40 , 167 genes observed) than the en-419 richment in the sy622 -associated class (fold-change 420 = 1.9, p = 9 · 10 −9 , 82 genes observed) or in the 421 bx93 -associated class. Correlation analysis showed 422 that a majority of the genes in the sy622 -specific class 423 were strongly correlated between the expression lev-424 els in the Dpy signature and the expression levels 425 in sy622 homozygotes, while 25% of the genes were 426 anti-correlated (Spearman R = 0.42, p = 6 · 10 −15 , 427 see Fig. 6). If the anti-correlated values are excluded 428 from the Spearman regression, the statistical value of 429 7/14 the regression improves significantly (Spearman R = 0.94, p = 2 · 10 −108 ). Taken together, this suggests 431 that the sy622 -specific phenotypic class contains a 432 transcriptional signature that can be associated with 433 the morphological Dpy phenotype. 434 We also tested a hypoxia dataset 11 , since mdt-12 435 is not known to be upstream of the hif-1 -dependent 436 hypoxia response in C. elegans. Enrichment tests re-    A We obtained a set of transcripts associated with the Dpy phenotype from dpy-7 and dpy-10 mutants. We identified the transcripts that were differentially expressed in sy622 homozygotes. Next, we plotted the β values of each transcript in sy622 homozygotes against the β values in a dpy-7 mutant. A significant portion of the genes are correlated between the two genotypes, showing that the signature is largely intact. 25% of the genes are anti-correlated. B We performed the same analysis using a set of transcripts associated with the hif-1 -dependent hypoxia response as a negative control. Although sy622 is enriched for the transcripts that make up this response, there is no correlation between the β values in sy622 homozygotes and the β values in egl-9 homozygotes. In the plots, a colormap is used to represent the density of points. The standard error of the mean is inversely proportional to the standard error of β mdt−12(sy622) . Transcriptomic phenotypes generate large amounts of 536 information that can be used to determine functional 537 units. However, due to the large number of tests per-538 formed, false positive and false negative events oc-539 cur frequently enough to create populations of tran-540 scripts that have anomalous behaviors. It is necessary 541 to identify what modules or populations are most at 542 risk of these events and to what extent these mod-543 ules may be polluted by false signals to prevent over-544 interpretation. In our experiment, we can estimate 545 statistical noise in each population. There is a rich 546 literature in genomics devoted to controlling and es-547 timating false positive rates 25,26 , but false negative 548 rates have largely been ignored because they do not 549 create spurious signal in simple experimental designs 550 and because there is ample signal in most RNA-seq 551 experiments. For allelic series experiments to be suc-552 cessful, systematic algorithms to estimate and con-553 trol false negative rates, and to identify the popula-554 tions most at risk for enrichment of false hits, must 555 9/14 be developed, because false negative hits can create populations of genes that have fantastical biological 557 behaviors (such as contrived examples of intragenic 558 complementation or dosage models). 559 We performed a false hit analysis, estimating the 560 false negative rate at 15%, to identify the clusters 561 or classes of genes most at risk for statistical noise.

607
A dosage model could explain the trans-608 heterozygote specific class if the dosage curve is bell-609 shaped. In this model, a switch is only activated 610 at a very specific mdt-12 gene dosage. Beyond this 611 dosage, the switch remains off. Although such a 612 model explains the data, mechanisms that could gen-613 erate such a dosage curve are not immediately ob-614 vious. One possibility is that this switch is en-615 acted at the level of cell specification or cell divi-616 sion, and that at the appropriate dosage of mdt-12 , 617 two cells that would typically collaborate to form a 618 phenotype now act antagonistically, pushing trans-619 heterozygotes into a different state from the homozy-620 gotes. If this is the case, whole-organism RNA-seq 621 may have limited resolution to identify what tissues 622 or cells are being perturbed. Single-cell sequencing 623 of C. elegans has recently been reported. As this 624 technique becomes more widely adopted, and with 625 decreasing cost, single-cell profiling of these geno-626 types may provide information that complements the 627 whole-organism expression phenotypes, perhaps ex-628 plaining the mysterious origin of this phenotype.

629
Analysis of allelic series using 630 transcriptome-wide measurements 631 The potential of transcriptomes to perform epistasis 632 analyses has been amply demonstrated 10,8 , but their 633 potential to perform allelic series analyses has been 634 less studied. Though similar in some respects, epista-635 sis analyses and allelic series studies call for different 636 methods to solve different problems. To successfully 637 perform an allelic series analysis, we must be able to 638 identify the number and identity of the phenotypic 639 classes, and a dominance analysis must be performed 640 for each class to determine whether the alleles inter-641 act qualitatively or quantitatively with each other. 642 Additionally, if an allelic series includes more than 643 two alleles, the number of experimental outcomes 644 and the number of possible outcomes rapidly become 645 large.

646
The general problem of partitioning a set of genes 647 into phenotypic classes is a common problem in bioin-648 formatics. This problem has been tackled through 649 clustering, matrix-based methods such as PCA or 650 non-negative matrix factorization, or through q-651 value-based classification (as we have done here). 652 Although these methods can classify genes or tran-653 scripts into clusters, by themselves they cannot ascer-654 tain the probability that any one cluster is real.  We should push to sequence allelic diversity to more 701 fully understand genotype-genotype variation. Differential expression analysis was performed us-731 ing Sleuth 36 . Briefly, we used a general linear model 732 to identify genes that were differentially expressed be-733 tween wild-type and mutant libraries. To increase 734 our statistical power, we pooled wild-type replicates 735 from other published and unpublished analysis. All 736 wild-type replicates were collected at the same stage 737 (young adult). In total, we had 10 wild-type repli-738 cates from 4 different batches, which heightened 739 our statistical power. Batch effects were smaller 740 than between-genotype effects, as assessed by princi-741 pal component analysis (PCA), except when switch-742 ing between samples constructed by different library 743 methods. Wild-type samples constructed using the 744 same library method clustered together and away 745 from all other mutant samples. However, clustering 746 wild-type samples by themselves revealed that the 747 samples clusters correlated with the person who col-748 lected them. Therefore, we added batch correction 749 terms to our model to account for batch effects from 750 library construction as well as from the person who 751 collected the samples. Non-parametric bootstrap 753 We performed non-parametric bootstrap testing to 754 identify whether two distributions had the same 755 11/14 mean. Briefly, the two datasets were mixed, and samples were selected at random with replacement 757 from the mixed population into two new datasets. 758 We calculated the difference in the means of these 759 new datasets. We iterated this process 10 6 times. To  Dominance analysis 773 We modeled allelic dominance as a weighted average of allelic activity. Briefly, our model proposed that β coefficients of the heterozygote, β a/b,i,Pred , could be modeled as a linear combination of the coefficients of each homozygote: where β k/k,i refers to the β value of the ith isoform in 774 a genotype k/k, and d a is the dominance coefficient 775 for allele a. 776 To find the parameters d a that maximized the probability of observing the data, we found the parameter, d a , that maximized the equation: