Analysis of allelic series with transcriptomic phenotypes

Although transcriptomes have recently been used to perform epistasis analyses, they are not yet used to study intragenic function/structure relationships. We developed a theoretical framework to study allelic series using transcriptomic phenotypes. As a proof-of-concept, we apply our methods to an allelic series of dpy-22 , a highly pleiotropic Caenorhabditis elegans gene orthologous to the human gene MED12 , which is a subunit of the Mediator complex. Our methods identify functional regions within dpy-22 that modulate Mediator activity upon various genetic modules.

Introduction been used to characterize allelic series. 23 We have devised methods for characterizing al-24 lelic series with RNA-seq. To test these methods, 25 we selected three alleles 11,12 of a C. elegans Medi-  The CKM associates reversibly with other modules, 33 and appears to inhibit transcription 16,17 . In C. el-34 egans development, the CKM promotes both male 35 tail formation 11 (through interactions with the Wnt 36 pathway), and vulval formation 18 (through inhibi-37 tion of the Ras pathway). Homozygotes of allele 38 dpy- 22(bx93), which encodes a premature stop codon 39 Q2549Amber 11 , appear grossly wild-type. In con-40 trast, animals homozygous for a more severe al-41 lele, dpy-22(sy622) encoding another premature stop 42 codon, Q1698Amber 12 , are dumpy (Dpy), have egg-43 laying defects (Egl), and have multiple vulvae (Muv). 44 (see Fig. 1A). In spite of its causative role in a num-45 ber of neurodevelopmental disorders 19 , the structural 46 and functional features of this gene are poorly un-47 derstood. In humans, MED12 is known to have a 48 proline-, glutamine-and leucine-rich domain that in-49 teracts with the WNT pathway 20 . However, many 50 disease-causing variants fall outside of this domain 21 . 51 To study these variants and how they interfere with 52 the functionality of MED12 , quantitative and effi-53 cient methods are necessary. 54 RNA-seq phenotypes have the potential to reveal 55 functional regions within genes, but their phenotypic 56 complexity makes this difficult. We developed a 57 method for determining allelic series from transcrip-58 tomic phenotypes and used the C. elegans dpy-22 59 gene as a test case. Our analysis revealed functional 60 regions that act to modulate Mediator activity at 61 thousands of genetic loci. wild type. 95 We used a false hit analysis to identify four non-96 overlapping phenotypic classes. We use the term 97 genotype-specific to refer to groups of transcripts 98 that were perturbed in one mutant. We use the 99 term genotype-associated to refer to those groups of 100 transcripts whose expression was significantly altered 101 in two or more mutants with respect to the wild 102 type control. The dpy- 22 Because the mutations we used are truncations, 130 our results suggest the existence of various func-131 tional regions in dpy-22/MDT12 (see Fig. 2). The 132 dpy-22(sy622)-specific phenotypic class is likely con-133 trolled by a single functional region, functional region 134 1 (FC1), and the dpy-22(sy622)-associated pheno-135 typic class is likely controlled by a second functional 136 region, functional region 2 (FC2). It is unlikely that 137 these regions are identical because their dominance 138 behaviors are very different. The dpy-22(bx93) allele 139 was largely dominant over the dpy-22(sy622) allele for 140 the dpy-22(bx93)-associated class, but gene expres-141 sion in this class was perturbed in both homozygotes. 142 The perturbations were greater for dpy-22(sy622) 143 homozygotes than for dpy-22(bx93) homozygotes. 144 This behavior can be explained if the dpy-22(bx93)-145 associated class is controlled jointly by two distinct 146 effectors, functional regions 3 and 4 (FC3, FR4, 147 see Fig. 2). A rigorous examination of this model 148 will require studying alleles that mutate the region 149 between Q1689 and Q2549 using homozygotes and 150 trans-heterozygotes. 151 We also found a class of transcripts that had per-152 turbed levels in trans-heterozygotes only; its bio-153 logical significance is unclear. Phenotypes unique 154 to trans-heterozygotes are often the result of physi-155 cal interactions such as homodimerization, or dosage 156 reduction of a toxic product 24 . In the case of 157 2/9 . CC-BY 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted January 29, 2018.  Flowchart for an analysis of arbitrary allelic series. A set of alleles is selected, and the corresponding genotypes are sequenced. Independent phenotypic classes are then identified. For each phenotypic class, the alleles are ordered in a dominance/complementation hierarchy, which can then be used to infer functional regions within the genes in question.

3/9
. CC-BY 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted January 29, 2018.  The number of genes associated with each class is shown. The dpy-22(bx93)-associated class may be controlled by two functional regions. FR2 and FR3 could be redundant if FR4 is a modifier of FR2 functionality at dpy-22(bx93)-associated loci. Note that the dpy-22(bx93)-associated phenotypic class is actually three classes merged together. Two of these classes are DE in dpy-22(bx93) homozygotes and one other genotype. Our analyses suggested that these two classes are likely the result of false negative hits and genes in these classes should be differentially expressed in all three genotypes, so they we merged all classes together (see Methods).
dpy-22/MDT12 orthologs, how either mechanism 158 could operate is not obvious, since DPY-22 is ex-159 pected to assemble in a monomeric manner into the 160 CKM. Massive single-cell RNA-seq of C. elegans has 161 recently been reported 25 . When this technique be-162 comes cost-efficient, single-cell profiling of these geno-163 types may provide information that complements the 164 whole-organism expression phenotypes, perhaps ex-165 plaining the origin of this phenotype.
195 dpy-22 is not known to be upstream of the hif-1 -196 dependent hypoxia response in C. elegans. Enrich-197 ment tests revealed that the hypoxia response was 198 significantly enriched in the bx93 -associated (fold-199 change = 2.1, p = 10 −8 , 63 genes observed), the 200 sy622 -associated (fold-change = 1.9, p = 4 · 10 −8 , 78 201 genes observed) and the sy622 -specific classes (fold-202 change = 2.4, p = 9 · 10 −55 , 186 genes observed). 203 However, there was no correlation between the ex-204 pression levels of these genes in dpy-22 genotypes and 205 the expression levels expected from the hypoxia re-206 sponse. Although the hypoxia gene battery can be 207 found in dpy-22 mutants, these genes are not used to 208 deploy a hif-1 -dependent hypoxia phenotype. Taken 209 together, our results suggest that transcriptomic sig-210 natures can be used to understand the biological func-211 4/9  Ranked β egl−9 (Hypoxia) Figure 3. sy622 homozygotes show a transcriptional response associated with the Dpy phenotype. A We obtained a set of transcripts associated with the Dpy phenotype from dpy-7 and dpy-10 mutants. We identified the transcripts that were differentially expressed in sy622 homozygotes. We ranked the β values of each transcript in sy622 homozygotes and plotted them against the ranked β values in dpy-7 mutants. A significant portion of the genes are correlated between the two genotypes, showing that the signature is largely intact. 25% of the genes are anti-correlated. B We performed the same analysis using a set of transcripts associated with the hif-1dependent hypoxia response as a negative control.
Although sy622 is enriched for the transcripts that make up this response, there is no correlation between the β values in sy622 homozygotes and the β values in egl-9 homozygotes.

5/9
. CC-BY 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted January 29, 2018.  identify non-empty genotype-associated phenotypic 308 classes, but identifying genotype-specific classes be-309 comes difficult if the experimental false positive rate 310 is high. On the other hand, even moderate false 311 negative rates (> 10%) rapidly degrade signal from 312 genotype-associated classes. For classes that are as-313 sociated with three genotypes, an experimental false 314 negative rate of 30% is enough on average to prevents 315 this class from being observed. 316 We selected λ = 3 because classification using this 317 threshold was high across a range of false positive and 318 false negative combinations. A challenge to applying 319 this algorithm to our data is the fact that the false 320 negative rate for our experiment is unknown. Al-321 though there has been significant progress in control-322 ling and estimating false positive rates, we know of no 323 such attempts for false negative rates. It is unlikely 324 that the false negative rate for our study is lower than 325 the false positive rate, because all genotypes except 326 the controls are likely underpowered. We used false 327 negative rates between 10-20% for false hit analy-328 sis. When the false negative rate was set at 15% 329 or higher, the algorithm converged on the same five 330 classes shown above. For false negative rates between 331 10-15%, the algorithm output the same five classes, 332 but also accepted the (dpy-22(sy622),dpy-22(bx93))-333 associated class. We selected the model correspond-334 ing to false negative rates of 15-20% because this 335 model had lower χ 2 values than the model selected 336 with a false negative rate of 10-15% (4,212 versus 337 100,650). 338 We asked whether re-classification of some classes 339 into others could improve model fit. We manually 340 re-classified the (dpy-22(sy622),dpy-22(bx93))-341 associated and the (dpy-22(bx93), trans-342 heterozygote)-associated classes into the bx93 -343 associated class (which is associated with all 344 genotypes), and we compared χ 2 statistics between 345 a re-classified reduced model and a reduced model. 346 The re-classified model had a lower χ 2 (181). Thus, 347 we concluded that the re-classified reduced model is 348 6/9 . CC-BY 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted January 29, 2018. . https://doi.org/10.1101/210724 doi: bioRxiv preprint the most likely model to give rise to our data.

Data: M obs = {N l }, an observed set of classes,
where each class is labelled by l ∈ L and is of size N l . f p , f n , the false positive and negative rates respectively. α, the signal-to-noise threshold for acceptance of a class. Result: M reduced , a reduced model that fits the data.
begin Define a minimal set to initialize the reduced model K = {min l∈L N l } Refine the model until the model converges or iterations max out Calculate signal to noise for each labelled class False negatives can result in λ < 0 λ l ← M obs,l /F l if (λ > α) | (λ < 0) then K l ← M obs,l end end i + + end end Return the reduced model M reduced = K return M reduced Algorithm 1: False Hit Algorithm. Briefly, the algorithm initializes a reduced model with the phenotypic class or classes labelled by the largest number of genotypes. This reduced model is used to estimate noise fluxes, which in turn can be used to estimate a signal-to-noise metric between observed and modelled classes. Classes that exhibit a high signalto-noise are incorporated into the reduced model.

350
Dominance analysis 351 We modeled allelic dominance as a weighted average of allelic activity: where β k/k,i refers to the β value of the ith isoform in 352 a genotype k/k, and d a is the dominance coefficient 353 for allele a. 354 To find the parameters d a that maximized the probability of observing the data, we found the parameter, d a , that maximized the equation: (2) where β a/b,i,Obs was the coefficient associated with 355 the ith isoform in the trans-het a/b and σ i was 356 the standard error of the ith isoform in the trans-357 heterozygote samples as output by Kallisto. S is the 358 set of isoforms that participate in the regression (see 359 main text). This equation describes a linear regres-360 sion which was solved numerically.

362
Code was written in Jupyter notebooks 31 using the 363 Python programming language. The Numpy, pandas 364 and scipy libraries were used for computation 32,33,34 365 and the matplotlib and seaborn libraries were used 366 for data visualization 35,36 . Enrichment analyses were 367 performed using the WormBase Enrichment Suite 37 . 368 For all enrichment analyses, a q-value of less than 369 10 −3 was considered statistically significant. For gene 370 ontology enrichment analysis, terms were considered 371 statistically significant only if they also showed an 372 enrichment fold-change greater than 2.