Abstract
Heritable variation in a gene’s expression arises from mutations impacting cis- and trans-acting components of its regulatory network, with expression variation often derived from trans-regulatory mutations within species. Here, we investigate how trans-regulatory mutations are distributed within the genome and within a gene regulatory network by identifying and characterizing 69 mutations with trans-regulatory effects on expression of the same focal gene in Saccharomyces cerevisiae. Relative to 1766 mutations without effects on expression of this focal gene, we found that these trans-regulatory mutations were enriched in coding sequences of transcription factors previously predicted to regulate expression of the focal gene. However, over 90% of the trans-regulatory mutations identified mapped to other types of genes involved in diverse biological processes including chromatin state, metabolism and signal transduction. Finally, we find that the genomic distribution of these new regulatory mutations significantly overlaps with the genomic distribution of trans-regulatory variants segregating within S. cerevisiae.
Introduction
The regulation of gene expression is a complex process, essential for cellular function, that impacts development, physiology, and evolution. Expression of each gene is regulated by its cis-regulatory DNA sequences (e.g., promoters, enhancers) interacting either directly or indirectly with trans-acting factors (e.g. transcription factors, signaling pathways) encoded by genes throughout the genome. Genetic variants affecting both cis- and trans-acting components of regulatory networks contribute to expression differences within and between species (Albert & Kruglyak, 2015; Barbeira et al., 2018; Ferraro et al., 2020; Gamazon et al., 2018; Oliver et al., 2005). This regulatory variation arises the same way as genetic variation affecting any other quantitative trait: new mutations generate variation in gene expression and selection favors the transmission of some genetic variants over others, giving rise to polymorphism within a species and divergence between species. Because new mutations are the raw material for this polymorphism and divergence, knowing how new mutations impact gene expression is essential for understanding how gene regulation evolves (reviewed in Hill et al., 2020). Targetted mutagenesis has been used to systematically examine the effects of individual mutations in cis-regulatory sequences for a variety of elements in a variety of species (Hornung et al., 2012; Kwasnieski et al., 2012; Maricque et al., 2017; Melnikov et al., 2012; Metzger et al., 2015; Patwardhan et al., 2009; Sharon et al., 2012), but such targeted approaches are not well-suited for surveying the effects of new trans-regulatory mutations because trans-regulatory mutations can be located virtually anywhere within the genome. Consequently, we know comparatively little about the genomic sources, molecular mechanisms of action and evolutionary contributions of individual trans-regulatory mutations.
Despite the vast potential target size for trans-regulatory mutations, regions of the genome most likely to harbor mutations affecting a particular gene’s expression might be predictable from knowledge of the regulatory network controlling a gene’s expression. Among eukaryotes, the set of genes and interactions regulating gene expression in trans is perhaps best understood in the baker’s yeast Saccharomyces cerevisiae (Hughes & Boer, 2013): networks of regulatory connections (Teixeira et al., 2018) have been inferred from experiments that profile the transcriptional effects of gene deletions (Hughes et al., 2000; Jackson et al., 2020; Kemmeren et al., 2014), map binding sites for transcription factors (Rhee & Pugh, 2011; Zheng et al., 2010; Zhu et al., 2009), identify protein-protein interactions (Gavin et al., 2002; Liu et al., 2020; Tarassov et al., 2008), and test pairs of genes for genetic interactions (Costanzo et al., 2016; Leeuwen et al., 2016). However, the extent to which the genomic sources of trans-regulatory mutations can be predicted from such networks is generally unknown (Flint & Ideker, 2019). Recently, a chemical mutagen was used to induce mutations throughout the genome of S. cerevisiae, and hundreds of mutant genotypes were collected that all altered expression of the same gene, providing the biological resources needed to systematically characterize properties of new trans-regulatory mutations and to test the predictive power of inferred regulatory networks.
Here, we use genetic mapping, candidate gene sequencing and functional validation to identify 69 trans-regulatory mutations that alter expression of the focal gene from this set of mutants and contrast their properties with a comparable set of 1766 mutations that did not affect expression of the focal gene. Using these data, we determined how these trans-regulatory mutations were distributed within the genome and within regulatory networks. For example, we asked how frequently trans-regulatory mutations were located in coding or non-coding sequences because trans-regulatory variants are often predicted to affect coding sequences (Hill et al., 2020) but some non-coding variants have been shown to be associated with trans-regulatory effects on gene expression (Consortium, 2020; Yao et al., 2017; Yvert et al., 2003). We also asked whether genes encoding transcription factors were the primary source of trans-regulatory variation, which is often assumed (Albert et al., 2018; Lewis et al., 2014) despite case studies identifying trans-regulatory variants in genes encoding proteins with other functions (Lutz et al., 2019; Mehrabian et al., 2005; Schadt et al., 2005; Yvert et al., 2003). To determine how well an inferred regulatory network can predict genomic sources of expression changes, we mapped the trans-regulatory mutations to a network of transcription factors predicted by functional genomic data to regulate expression of the focal gene and examined the molecular functions and biological processes impacted by trans-regulatory mutations that did not map to genes in this network. By systematically examining the properties and identity of new trans-regulatory mutations, this work fills a key gap in our understanding of how expression differences arise and may help predict sources of trans-regulatory variation segregating in natural populations. Indeed, we found that the genomic distribution of new trans-regulatory mutations overlaps significantly with the genomic distribution of trans-regulatory variants segregating among wild isolates of S. cerevisiae that affect expression of the same gene (Metzger & Wittkopp, 2019), suggesting that the mutational process generating new trans-regulatory variation significantly shaped the regulatory variation we see in the wild.
Results and Discussion
Genetic mapping of trans-regulatory mutations
To characterize properties of new trans-regulatory mutations affecting expression of a focal gene, we took advantage of three previously collected sets of haploid mutants that all showed altered expression of the same reporter gene (Figure 1A, Gruber et al., 2012; Metzger et al., 2016). This reporter gene (PTDH3-YFP) encodes a yellow fluorescent protein whose expression is regulated by the S. cerevisiae TDH3 promoter, which natively drives constitutive expression of a glyceraldehyde-3-phosphate dehydrogenase involved in glycolysis and gluconeogenesis (McAlister & Holland, 1985). Mutations in these mutants were caused by exposure to the chemical mutagen ethyl methanesulfonate (EMS), which induces primarily G:C to A:T point mutations randomly throughout the genome (Shiwa et al., 2012). Together, these collections contain ∼1500 mutants isolated irrespective of their fluorescence levels (“unenriched” mutants) and ∼1200 mutants isolated after enriching for cells with the largest changes in fluorescence (Figure 1A). When we started this work, expression level of PTDH3-YFP in these mutant genotypes had been described (Gruber et al., 2012; Metzger et al., 2016), but the specific mutations present within each mutant as well as which mutation(s) alter(s) PTDH3-YFP expression in each genotype were unknown.
From these collections, we selected 82 EMS-treated mutants for genetic mapping to identify individual causal mutations (Figure 1A). Sanger sequencing of the reporter gene in these mutants showed that none had mutations in the TDH3 promoter or any other part of the reporter gene, indicating that they harbored mutations affecting PTDH3-YFP expression in trans. 39 of these mutants were selected based on previously published fluorescence data, with 11 mutants selected from the collections enriched for large effects (red points in Figure 1B,C) and 28 mutants selected from the unenriched collection (red points in Figure 1D). Each selected mutant showed changes in average YFP fluorescence greater than 1% relative to the un-mutagenized progenitor strain. Another 197 mutants from the unenriched collection (blue points in Figure 1D) were subjected to a secondary fluorescence screen, from which an additional 43 mutants with a change in fluorescence greater than 1% (red points in Figure 1E) were chosen. A 1% change in YFP fluorescence has previously been shown to correspond to a ∼3% change in YFP mRNA abundance (see Methods and Duveau et al., 2018), although changes in fluorescence caused by trans-regulatory mutations in these mutants could affect either transcription driven by the TDH3 promoter or post-transcriptional regulation of YFP synthesis or stability.
To identify mutations within the 82 selected EMS mutants, and to determine which of these mutation(s) were most likely to affect YFP expression in each mutant, we performed bulk-segregant analysis followed by whole-genome sequencing (BSA-Seq) as described in Duveau et al. (2014) with minor modifications (see Methods). Briefly, each mutant strain was crossed to a common mapping strain expressing the PTDH3-YFP reporter gene, and large populations of random haploid spores were isolated after inducing meiosis in the resulting diploids (Figure 2A). For each of the 82 segregant populations, a low fluorescent bulk and a high fluorescent bulk of ∼1.5 x 105 cells each were isolated using fluorescence-activated cell sorting (FACS) (Figure 2B). Genomic DNA extracted from each bulk was then sequenced to an average coverage of ∼105x (ranging from 75x to 134x among samples, Supplementary File 1) to identify the mutations present within each mutant genotype and to quantify the frequency of mutant and non-mutant alleles in both bulks (Figure 2C). A mutation causing a change in fluorescence is expected to be found at different frequencies in the two populations of segregant cells. Conversely, a mutation with no effect on fluorescence that is not genetically linked to a mutation affecting fluorescence is expected to be found at similar frequencies in these two populations. Prior work suggested this protocol should have 95% power to identify mutations altering fluorescence by 1% or more (Duveau et al., 2014).
Using a stringent approach for calling sequence variants (see Methods), we identified a total of 1819 mutations (Supplementary File 2), among which 1768 mutations (97.2%) were single nucleotide changes (Figure 2D). Of these single nucleotide changes, 96.3% were one of the two types of point mutations (G:C to A:T transitions) known to be primarily induced by EMS (Shiwa et al., 2012). 48 small indels and 3 aneuploidies, which could have arisen spontaneously or been introduced by EMS, were also identified. Of these 3 mutants with aneuploidies, 2 were found to have an extra copy of chromosome I and 1 was found to have an extra copy of chromosome V based on ∼1.5 fold higher sequencing coverage of these chromosomes relative to the rest of the genome in the BSA-seq data from segregant populations (shown in Supplementary Table 1). We identified an average of 23.9 mutations per strain, which is within the 95% confidence interval of 21 to 45 mutations per strain estimated previously from the frequency of canavanine resistant mutants (Metzger et al., 2016). Surprisingly, the number of mutations per strain did not follow a Poisson distribution: we observed more strains with a number of mutations far from the average than expected for a Poisson process (P-value < 10-5, resampling test; Figure 2 - figure supplement 1), which could be explained by cell-to-cell heterogeneity in DNA repair after exposure to the mutagen (Liu et al., 2019; Uphoff et al., 2016).
At least one mutation was significantly associated with fluorescence in 46 of the mutants analyzed based on likelihood ratio tests (G-tests described in Methods, Supplementary File 2), with a total of 67 mutations associated with fluorescence identified among these mutants, including all 3 aneuploidies (Supplementary Table 1). 29 mutants had a single mutation associated with fluorescence, 13 mutants had two associated mutations, and 4 mutants had three associated mutations. However, 8 of the 13 mutants with two associated mutations and all 4 mutants with three associated mutations showed linkage (genetic distance below 25 cM) between at least two of the mutations associated with fluorescence, suggesting that only one of the linked mutations might impact fluorescence in each of these mutants. To determine whether one linked mutation was more likely to impact fluorescence than the others, we compared the magnitude of allele-frequency difference between the high and low fluorescence pools (estimated by the G-value) for each mutation. For 9 of the 12 mutants with linked mutations, we found that the mutation with the highest G-value was significantly more strongly associated with fluorescence than the linked mutation(s) (resampling test: P < 0.05, Supplementary File 3), suggesting that this mutation was responsible for the fluorescence change. For the other 3 mutants, none of the linked mutations showed stronger evidence of impacting fluorescence than the others (resampling test: P > 0.05, Supplementary File 3).
The remaining 36 mutants did not have any mutations significantly associated with fluorescence (Supplementary File 2). These mutants tended to show smaller changes in fluorescence than mutants with one or more associated mutations (Figure 2 - figure supplement 2), suggesting that our power to map mutations causing 1% changes in fluorescence might have been lower than anticipated. These 36 mutants might also harbor multiple mutations with small effects on expression, each of which was below our detection threshold. Consistent with this possibility, we observed a small but significant correlation (r2 = 0.127, P = 0.03) between the total number of mutations in these 36 EMS mutants and their expression level (Figure 2 - figure supplement 3). It is also possible that we failed to find associated mutations in some of these mutants because their change in fluorescence was initially overestimated by the “winner’s curse” (Xiao & Boehnke, 2009). Accordingly, 71% of mutants selected for mapping after two independent fluorescence screens had at least one mutation significantly associated with fluorescence compared to only 30% of mutants selected after a single fluorescence screen. Some changes in fluorescence observed in these 36 mutants might also have been caused by non-genetic variation and/or undetected mutations.
Additional trans-regulatory mutations identified by sequencing candidate genes
We noticed in the BSA-seq data that three mutations increasing fluorescence more than 5% relative to the un-mutagenized progenitor strain mapped to two genes (ADE4 and ADE5) in the same biochemical pathway (de novo purine biosynthesis) (Supplementary File 2). We therefore used Sanger sequencing to test whether these genes or other genes in this pathway were also mutated in 15 additional EMS mutants with fluorescence at least 5% higher than the progenitor strain. We first looked for mutations in ADE4, then ADE5 if no mutation was found in ADE4, and then ADE6 if no mutation was found in the other genes. At least one nonsynonymous mutation was identified by Sanger sequencing in one of these three genes in 14 of the 15 EMS mutants (green points in Figure 1C,E; Supplementary File 4). For the remaining mutant (brown point in Figure 1E), we sequenced a fourth purine biosynthesis gene, ADE8, but again found no mutation. In two additional EMS mutants with smaller increases in fluorescence (2.1% and 4.6%, purple points in Figure 1D,E) and a reddish color characteristic of ADE2 loss of function mutants (Roman, 1956), we found nonsynonymous mutations in ADE2 by Sanger sequencing (Supplementary File 4). Follow-up experiments showed that mutations in ADE2, ADE5, and ADE6 did not increase YFP fluorescence driven by two other promoters (PRNR1 and PSTM1), suggesting that mutations in the purine biosynthesis pathway affected expression of PTDH3-YFP through mechanisms mediated by the TDH3 promoter rather than YFP (Figure 2 - figure supplement 4). Taken together, these data suggest that genes in the purine biosynthesis pathway might be the predominant mutational source of large increases in TDH3 expression.
Functional testing confirms effects of trans-regulatory mutations identified by genetic mapping and candidate gene sequencing
To determine whether mutations statistically associated with fluorescence in the BSA-seq data actually affected expression of PTDH3-YFP, we introduced 34 of the 67 associated mutations individually into the fluorescent progenitor strain using scarless genetic engineering approaches (Supplementary File 5). We also used scarless genome editing to create single-site mutants for 11 of the 17 additional mutations identified in purine biosynthesis genes by Sanger sequencing (Supplementary File 4, Supplementary File 5). Fluorescence of these engineered strains (called “single-site mutants” hereafter) was then quantified by flow cytometry in parallel with fluorescence of the EMS mutant carrying the same associated mutation as well as the un-mutagenized progenitor strain, with four replicate populations analyzed for each genotype. Fluorescence values were then transformed into estimates of YFP abundance as described in the Methods.
Of the 24 mutations without linked variants in EMS mutants that were tested in single-site mutants, 23 (96%) caused a significant change in expression (P < 0.05, permutation test, Supplementary File 5), suggesting a ∼4% false positive rate in our BSA-Seq experiment. In addition, all 11 single-site mutants with mutations in purine biosynthesis genes identified by Sanger sequencing showed statistically significant effects on fluorescence relative to the un-mutagenized progenitor strain (all increased fluorescence, P < 0.05, permutation test, Supplementary File 5). The remaining 10 mutations tested in single-site mutants were from 5 of the EMS mutants with two linked mutations associated with fluorescence. Each of these mutations was introduced separately into a single-site mutant to independently measure its effect on expression. For 4 of these 5 pairs of linked mutations, only one of the two single-site mutants showed a significant change in expression relative to the progenitor strain (Figure 2E). In each case, the single-site mutant and the EMS mutant showed changes in expression in the same direction relative to the progenitor strain (Figure 2E). The mutation affecting expression was always the mutation with the larger G-value in the BSA-Seq data, consistent with the results of the statistical tests described above (Supplementary File 3). In the last case (YPW54 in Figure 2E), both mutations affected expression in the single-site mutants, consistent with our inability to statistically predict which mutation was more likely to impact expression from the BSA-Seq data for this mutant as well as both mutations being nonsynonymous changes in the same gene (CHD1) (Supplementary File 3). The BSA-seq data also accurately predicted whether a mutation increased or decreased fluorescence for 27 (93%) of the 29 mutations with significant effects on fluorescence in single-site mutants (Figure 2F). For the other two mutations, effects on expression in the same direction were observed in the single-site mutants and the corresponding EMS mutants (Supplementary File 5), suggesting that the different growth conditions used for the mapping experiment (see Methods) might have modified the effects of these mutations.
Comparing PTDH3-YFP expression in the 40 single-site mutants that significantly altered fluorescence to that in the 40 EMS mutants from which these mutations were identified showed that expression was very similar overall between single-site and EMS mutants sharing the same mutation (Figure 2G, linear regression: r2 = 0.944, P = 2.4 x 10-25), although significant differences in expression were observed for some pairs (Figure 2G, Figure 2 - figure supplement 5). These data suggest that (1) the vast majority of the mutations we identified by genetic mapping and candidate gene sequencing do indeed have trans-regulatory effects on expression of PTDH3-YFP and (2) the majority of EMS mutants analyzed had a single mutation that was primarily, if not solely, responsible for the observed change in PTDH3-YFP expression.
Properties of trans-regulatory mutations affecting expression driven by the TDH3 promoter
In all, 69 mutations showed evidence of affecting PTDH3-YFP expression in trans, including 3 aneuploidies and 66 point mutations. 52 of these mutations were identified by genetic mapping (Supplementary File 6) and 17 were identified by sequencing candidate genes (Supplementary File 4). 12 of the mutations identified by genetic mapping were genetically linked to one or more other mutations but showed stronger evidence of affecting PTDH3-YFP expression than the linked mutation(s) in statistical and/or functional tests described above (Supplementary File 3). To identify trends in the properties of these 69 trans-regulatory mutations, we compared them to 1766 mutations considered non-regulatory regarding PTDH3-YFP expression because they showed no significant association with expression of the reporter gene in the BSA-Seq experiment.
First, we asked whether the mutational spectra of trans-regulatory mutations differed from non-regulatory mutations (Figure 3A). We found that G:C to A:T transitions most commonly introduced by EMS occurred at similar frequencies in the two groups (G-test, P = 0.84). No indels were associated with expression in the BSA-seq data (Supplementary File 6), which was not statistically different from the frequency of indels among non-regulatory mutations (0% vs 2.7%, G-test, P = 0.056). By contrast, aneuploidies were highly over-represented in the set of trans-regulatory mutations since all three extra copies of a chromosome observed in the BSA-Seq data were found to be associated with fluorescence (G-test, P = 8.6 x 10-6). We also found a significant difference in the genomic distribution of the two sets of mutations (G-test, P = 2.4 x 10-3), with non-regulatory mutations appearing to be randomly distributed throughout the genome but trans-regulatory mutations enriched on chromosomes VII and XIII (Figure 3B, Figure 3 - figure supplement 1). However, these two chromosomes contain the purine biosynthesis genes in which multiple trans-regulatory mutations were identified, and there was no significant difference in genomic distributions between trans-regulatory and non-regulatory mutations when mutations in purine biosynthesis genes were excluded (G-test, P = 0.35).
Trans-regulatory mutations are often assumed to be located in coding sequences, but they can also be located in non-coding, presumably cis-regulatory, sequences of trans-acting genes (Hill et al., 2020). We therefore asked whether trans-regulatory mutations affecting PTDH3-YFP expression were more often found in coding or non-coding regions of the genome than expected by chance. Of the 1766 non-regulatory mutations, 1257 (71.3%) were coding mutations located in exons, and 506 (28.7%) were non-coding mutations located in intergenic (n = 500) or intronic (n = 6) regions (Figure 3C). This paucity of mutations in introns is consistent with the rarity of introns in S. cerevisiae, and the overall frequency of non-coding mutations (28.7%) is similar to the fraction of the S. cerevisiae genome (30.6% of 12.1 Mb) considered non-coding (www.yeastgenome.org). By contrast, of the 66 trans-regulatory point mutations, only one was located in a non-coding sequence. This non-coding mutation was located in the intergenic sequence between IOC2 and KIN2, presumably affecting expression of one or both genes with a downstream effect on PTDH3-YFP expression. The 3 aneuploidies were excluded from this and subsequent analyses because they affected both coding and non-coding sequences of a large number of genes. The underrepresentation of non-coding changes among regulatory mutations was statistically significant (1.5% of regulatory mutations are non-coding vs 28.4% of non-regulatory mutations; G-test, P = 4.3 x 10-9), suggesting that new mutations affecting PTDH3-YFP expression in trans are more likely to alter coding than non-coding sequences. This enrichment in coding sequences might be because coding sequences tend to have a higher density of functional sites than non-coding sequences.
Finally, we examined how trans-regulatory mutations located in coding sequences impacted the amino acid sequences of the corresponding proteins. Among mutations identified in coding sequences, we found that all trans-regulatory mutations changed the amino acid sequence of proteins whereas only 70% of non-regulatory mutations did (Figure 3D; G-test, P = 1.4 x 10-4). This difference was primarily driven by mutations that introduced stop codons (nonsense mutations) rather than mutations that substituted one amino acid for another (nonsynonymous mutations): 20% of trans-regulatory mutations in coding sequences were nonsense mutations versus 3% of non-regulatory mutations (Figure 3D; G-test, P = 4.8 x 10-6), and 80% of trans-regulatory mutations were nonsynonymous versus 67% of non-regulatory mutations (Figure 3D; G-test, P = 0.07). Nonsense mutations always altered an arginine, glutamine, or tryptophan codon (Figure 3E), consistent with the structure of the genetic code and the types of mutations induced by EMS (figure S8 in Metzger et al., 2016). For nonsynonymous mutations, two types of amino acid changes were particularly enriched among trans-regulatory mutations (Figure 3E; Figure 3 - figure supplement 2): 26.2% of trans-regulatory mutations changed glycine to aspartic acid versus 5.2% of non-regulatory mutations (permutation test, P < 10-4), and 10.8% of trans-regulatory mutations changed glycine to glutamic acid versus 2.7% of non-regulatory mutations (permutation test, P = 0.0042). As a consequence, mutations altering glycine codons were strongly over-represented in general among trans-regulatory mutations (49.2% of trans-regulatory mutations vs 14.5% of non-regulatory mutations in coding sequences; permutation test, P < 10-4), perhaps because glycine is the smallest amino acid, making its substitution likely to modify protein structure (Bhate et al., 2002; Miller, 2007). Indeed, glycine is one of the three amino acids with the lowest experimental exchangeability (Yampolsky & Stoltzfus, 2005) and mutations affecting glycine codons are enriched among mutations causing human diseases (Khan & Vihinen, 2007; Molnár et al., 2016; Vitkup et al., 2003).
Regulatory mutations are enriched in a predicted TDH3 regulatory network
Because of the key role transcription factors play in the regulation of gene expression, and because transcription factors have been shown to be a source of trans-regulatory variation in natural populations (Albert et al., 2018; Lewis et al., 2014), we asked whether trans-regulatory mutations affecting PTDH3-YFP expression were enriched in genes encoding transcription factors. We found that 5 (7.7%) of the 65 trans-regulatory coding mutations mapped to the coding sequence of one of the 212 genes predicted to encode a transcription factor in the YEASTRACT database (Teixeira et al., 2018), but this was not significantly more than the 5.6% of non-regulatory coding mutations mapping to these genes (G-test: P = 0.52). Not all transcription factors are expected to regulate expression of TDH3, however, so we also tested for enrichment of trans-regulatory mutations among transcription factors specifically predicted to regulate TDH3.
Using information consolidated in the YEASTRACT database (Teixeira et al., 2018) that supports evidence of a transcription factor binding to a gene’s promoter and regulating its expression, we constructed a network (Figure 4) of potential direct regulators of TDH3 as well as potential direct regulators of these direct regulators (1st and 2nd level regulators of TDH3) and asked how often the trans-regulatory mutations we identified mapped to these genes. We found that 4 trans-regulatory mutations mapped to three genes in this network, with 2 mutations affecting the 1st level regulator TYE7, 1 mutation affecting the 1st level regulator GCR2, and 1 mutation affecting the 2nd level regulator TUP1 (Supplementary File 6). This number of mutations mapping to genes in the predicted TDH3 regulatory network was 12-fold greater than expected by chance (6% for trans-regulatory vs 0.5% for non-regulatory mutations; G-test, P = 0.0037), thus the inferred regulatory network had predictive power even though the vast majority of trans-regulatory coding mutations (61 of 65, or 94%) mapped to other genes. Only one of these other trans-regulatory mutations mapped to a transcription factor. This mutation was a nonsynonymous substitution affecting ROX1, which is predicted in the YEASTRACT database to directly regulate expression of the indirect TDH3 regulator TUP1. In other words, ROX1 is predicted by existing functional genomic data to be a 3rd level regulator of TDH3 (Figure 4). With no other transcription factors harboring a trans-regulatory mutation in our dataset, this result suggests that mutations in transcription factors located more than three levels away from TDH3 in its transcriptional regulatory network are unlikely to be sources of new expression changes driven by the TDH3 promoter.
Deleterious effects of mutations in two direct regulators of TDH3
Transcription factors encoded by the TYE7 and GCR2 genes found to harbor trans-regulatory mutations affecting expression of PTDH3-YFP are known to regulate the expression of glycolytic genes (including TDH3) by forming a complex with transcription factors encoded by the RAP1 and GCR1 genes (Shively et al., 2019). Rap1p (Yagi et al., 1994) and Gcr1p (Huie et al., 1992) are both known to bind directly to the TDH3 promoter (Figure 5A), and mutations in these binding sites cause large decreases in TDH3 expression (Metzger et al., 2015). These observations strongly suggest that mutations in RAP1 and GCR1 should also cause detectable changes in TDH3 expression, yet no mutations were observed in these genes in our set of trans-regulatory mutations. To investigate why we did not recover trans-regulatory mutations in RAP1 or GCR1, we used error-prone PCR to generate mutant alleles of these genes with mutations in either the promoter or coding sequence of RAP1 or the second exon of GCR1, which includes 99.7% of the GCR1 coding sequence (Figure 5B). Hundreds of these RAP1 and GCR1 mutant alleles were then introduced individually into the un-mutagenized strain carrying the PTDH3-YFP reporter gene using CRISPR/Cas9-guided allelic replacement. Sequencing the mutated regions of RAP1 and GCR1 in a random subset of transformants showed that each strain harbored an average of 1.8 mutations in the RAP1 gene (Figure 5C) or 2.4 mutations in the GCR1 gene (Figure 5D). As expected for PCR-based mutagenesis, the number of mutations per strain appeared to follow a Poisson distribution both for RAP1 mutants (Figure 5C, Chi-square goodness of fit, P = 0.14) and GCR1 mutants (Figure 5D, Chi-square goodness of fit, P = 0.79).
Among the RAP1 mutant strains, only 9.1% (43 of 470 strains) showed a significant change in PTDH3-YFP expression greater than 3% (corresponding to a ∼1% change in fluorescence) relative to the un-mutagenized progenitor strain (Figure 5E), suggesting that most EMS mutants harboring coding mutations in RAP1 would have been excluded from our mapping study. In addition, the strongest decrease in PTDH3-YFP expression observed among RAP1 mutants (17%) was substantially smaller than the strongest decrease in expression caused by mutating the RAP1 binding site in the TDH3 promoter (57.5% reported in Duveau et al., 2018), suggesting that even this most severe phenotype was not caused by a null allele of RAP1. To test this hypothesis, we used site-directed mutagenesis to alter 5 amino acids (one at a time) in Rap1p expected to disrupt DNA binding based on the crystal structure of Rap1p complexed with DNA (Konig et al., 1996). In each case, we obtained by PCR a DNA fragment containing either a synonymous mutation in the codon corresponding to the amino acid (which should not affect the DNA binding of Rap1p) or one of two nonsynonymous mutations, with one nonsynonymous mutation more likely to alter protein function than the other (Yampolsky & Stoltzfus, 2005). We then used CRISPR/Cas9 allele replacement to introduce each mutation into the yeast genome and sequenced 10 independent clones from each transformation to determine if the mutation was introduced in the RAP1 coding sequence as intended. All five synonymous mutations were observed in several of the clones sequenced, but 7 of the 10 nonsynonymous mutations were never recovered (Supplementary File 7). This outcome suggests that nonsynonymous mutations altering the DNA binding of Rap1p are lethal or nearly lethal, making them unlikely to have been recovered in a mutagenesis screen. Indeed, Rap1p1 is known to be an essential, pleiotropic transcription factor playing critical roles in regulating expression of glycolytic genes like TDH3 as well as ribosomal proteins and genes required for mating (reviewed in Piña et al., 2003). Taken together, these data indicate that RAP1 mutations are unlikely to be common sources of variation in expression driven by the TDH3 promoter.
For the GCR1 mutant strains, 37.7% showed a significant change in PTDH3-YFP expression greater than 3% relative to the un-mutagenized progenitor strain (Figure 5F). Several of these mutant alleles decreased the expression driven by the TDH3 promoter by ∼80%, which is similar to the previously reported effects of mutations in the Gcr1p binding sites of the TDH3 promoter (Metzger et al., 2015), suggesting that they were null alleles. Indeed, resequencing these large effect alleles revealed that one of them had a single nucleotide insertion in the 28th codon of the GCR1 ORF, which led to a frame shift eliminating 96% of amino acids (757 of 785) from Gcr1p. Because Gcr1p regulates expression of many glycolytic genes (Uemura et al., 1997) and GCR1 deletion has been reported to cause severe growth defects in fermentable carbon source environments (Clifton et al., 1978; Hossain et al., 2016; López & Baker, 2000), we hypothesized that the fitness effects of mutations in GCR1 might also have caused them to be underrepresented in the population from which the EMS mutants analyzed were derived. To test this hypothesis, we measured the relative fitness of 62 of the 220 GCR1 mutants, including all mutants with decreased PTDH3-YFP expression. GCR1 mutants causing the largest changes in PTDH3-YFP expression showed strong defects in growth rate; however, several GCR1 mutants with changes in PTDH3-YFP expression greater than 3% did not strongly affect fitness (Figure 5G). This observation suggests that some of the coding mutations in GCR1 decreasing PTDH3-YFP expression could have been sampled among the EMS mutants used for mapping. We therefore conclude that mutations in GCR1 were most likely not recovered in our set of regulatory mutations because of the wide diversity of mutations that can affect TDH3 expression and the limited number of EMS mutants included in the mapping experiment.
Properties of genes harboring regulatory mutations
With only 5 of the 65 trans-regulatory point mutations in coding sequences mapping to transcription factors, we used gene ontology (GO) analysis to examine the types of genes harboring trans-regulatory mutations affecting PTDH3-YFP expression more systematically. In all, these 65 mutations mapped to 42 different genes, with 9 genes affected by more than one mutation, 4 of which were genes involved in the de novo purine biosynthesis pathway (Figure 6A). Several gene ontology terms were significantly enriched among genes affected by trans-regulatory mutations relative to genes affected by non-regulatory mutations. Supplementary File 8 includes all enriched GO terms, whereas Figure 6B only includes enriched GO terms that are not parent to other GO terms in the GO hierarchy. Of the 33 GO terms enriched for trans-regulatory mutations shown in Figure 6B, 11 terms (including 13 of the 42 genes with trans-regulatory mutations) were related to chromatin structure (Figure 6B), which is known to play an important role in the regulation of gene expression (Li et al., 2007). An additional 5 GO terms (including 6 genes with trans-regulatory mutations) were related to metabolism, and 4 terms (including 9 genes with trans-regulatory mutations) were related to transcriptional regulation (Figure 6B). Three GO terms related to glucose signaling, including regulation of transcription by glucose, carbohydrate transmembrane transport and glucose metabolic process, were also significantly enriched for genes affected by trans-regulatory mutations (Figure 6B). When we broadened this category of genes based on a review of glucose signaling (Santangelo, 2006), the enrichment included 5 genes implicated in glucose signaling (Supplementary File 9; 12.2% of genes affected by trans-regulatory mutations were involved in glucose signaling vs 2.7% of genes affected by non-regulatory mutations; Fisher’s exact test: P = 6.2 x 10-3).
At the pathway level, we found that genes involved in glycolysis and de novo purine biosynthesis were also significantly enriched for trans-regulatory mutations (Figure 6B), with the latter driven by the mutations in ADE2, ADE4, ADE5 and ADE6 genes described above (Supplementary File 10). Genes involved in iron homeostasis also emerged as an over-represented group, with 5 GO terms (including 7 genes) being related to the regulation of intracellular iron concentration (Figure 6B). Diverse cellular processes implicated in iron homeostasis were represented among genes harboring trans-regulatory mutations, such as iron transport (FTR1, CCC2), iron trafficking and maturation of iron-sulfur proteins (CIA2, NAR1), transcriptional regulation of the iron regulon (FRA1) and post-transcriptional regulation of iron homeostasis (TIS11). Remarkably, nearly half of all trans-regulatory point mutations in coding sequences (31 of 65) were located in genes involved either in purine biosynthesis or iron homeostasis. Moreover, 6 of the 8 genes harboring more than one trans-regulatory mutation (Figure 6A) were involved in one of these two processes. Mutations in purine biosynthesis genes tended to cause large increases in expression, whereas mutations in iron homeostasis genes tended to cause large decreases in expression (Supplementary File 10). Although the mechanistic relationship between these pathways and TDH3 expression is not known, changing cellular conditions, including concentrations of metabolites (Pinson et al., 2009) or iron within the cell (reviewed in Outten & Albetel, 2013), can affect the regulation of gene expression. Ultimately, our data suggest that although mutations affecting PTDH3-YFP expression map to genes with diverse functions, genes involved in a small number of well-defined biological processes are particularly likely to harbor such trans-regulatory mutations.
Trans-regulatory mutations are enriched in genomic regions harboring natural variation affecting TDH3 expression
Because new mutations affecting gene expression provide the raw material for regulatory variation segregating within a species, we asked whether the trans-regulatory mutations we observed were enriched in genomic regions associated with naturally occurring trans-regulatory variation affecting expression driven by the TDH3 promoter. Specifically, we compared the genomic locations of trans-regulatory mutations identified in the current study to the locations of ∼100 trans-acting quantitative trait loci (QTL) affecting expression of PTDH3-YFP identified from crosses between the progenitor strain of the EMS mutants (BY) and 3 other S. cerevisiae strains (SK1, YPS1000, M22) (Metzger & Wittkopp, 2019) (Figure 7A).
Non-regulatory mutations were observed in eQTL regions as often as expected by chance (66.7% of non-regulatory mutations vs 65.1% of the whole genome in eQTL regions; G-test: P = 0.15), but the 66 trans-regulatory mutations were significantly enriched in eQTL regions (Figure 7B; 88% of trans-regulatory mutations vs 66.7% of non-regulatory mutations in eQTL regions; G-test: P = 9.6 x 10-5). The overrepresentation of trans-regulatory mutations in eQTL regions remained statistically significant when we considered only the 44 trans-regulatory mutations identified from the collection of EMS mutants not enriched for large effects (Figure 7B; G-test: P = 0.027) or only the 22 trans-regulatory mutations identified from EMS mutants enriched for large effects (Figure 7B; G-test: P = 3.3 x 10-5). The enrichment of trans-regulatory mutations in eQTL regions was thus not only driven by the effect size of these mutations or by the fact that several of the trans-regulatory mutations with large effects were located in the same genes. When we considered eQTL regions identified from each cross separately, we observed a significant enrichment of trans-regulatory mutations in eQTL regions identified in SK1 x BY and YPS1000 x BY crosses, but not in eQTL regions identified in the M22 x BY cross (Figure 7B; G-tests: P = 0.016 for SK1 x BY, P = 6.5 x 10-3 for YPS1000 x BY, P = 0.70 for M22 x BY). This pattern might be explained by the close genetic relatedness between BY and M22 (Metzger & Wittkopp, 2019) or by the specific ecological niche of M22 isolated from an Italian vineyard (Capece et al., 2012). Overall, the enrichment of trans-regulatory mutations in eQTL regions suggests that biases in the mutational sources of regulatory variation have shaped genetic sources of expression variation segregating in wild populations.
Conclusions
By systematically isolating and characterizing 69 trans-regulatory mutations that all affect expression of the same focal gene, this study reveals how trans-regulatory mutations are distributed within a genome and within a regulatory network. For example, we found that these trans-regulatory mutations were widely spread throughout the genome, with all except one located in coding sequences. These data also allowed us to determine how well a regulatory network inferred from integrating functional genomic and genetic data can predict sources of trans-regulatory variation. Like many biological networks, transcriptional regulatory networks have been inferred with the promise of explaining relationships between genetic variants and the higher order trait of gene expression, but the predictive power of such networks remains sparsely tested (Flint & Ideker, 2019).
We found that although the trans-regulatory mutations in coding regions were not enriched in transcription factors generally, they were overrepresented among transcription factors inferred to be regulators of TDH3. None of these transcription factors are known to directly bind to the TDH3 promoter, however, and mutations in RAP1 and GCR1, which have well characterized binding sites in the TDH3 promoter, were notably missing from our set of trans-regulatory mutations affecting PTDH3-YFP expression. Targeted mutagenesis of RAP1 and GCR1 suggested that most mutations in these genes (particularly RAP1) cause severe growth defects that might have prevented their recovery in mutagenesis screens. Over 90% of the trans-regulatory mutations examined were located in genes outside of this network encoding proteins with diverse molecular functions involved in chromatin remodeling, nonsense-mediated mRNA decay, translation regulation, purine biosynthesis, iron homeostasis, and glucose sensing. Surprisingly, nearly half of the trans-regulatory mutations mapped to genes involved in either the purine biosynthesis or iron homeostasis pathways. Although not anticipated, finding so many trans-regulatory mutations in genes that are not transcription factors is consistent with the transcriptomic effects of gene deletions showing that transcription factors tend not to affect expression of more genes than other types of proteins (Featherstone & Broadie, 2002).
Consequently, it seems that regulatory networks describing the relationships between transcription factors and target genes might capture only a small fraction of the potential sources of trans-regulatory variation. Understanding the properties of trans-regulatory mutations is important because these mutations provide the raw material for natural trans-regulatory variation. We found that mutations affecting PTDH3-YFP expression were enriched in genomic regions associated with expression variation among wild isolates of S. cerevisiae, suggesting that mutational sources of regulatory variation have helped shape the sources of genetic variation affecting gene expression segregating in natural populations. Differences in the genomic distribution of new regulatory mutations and polymorphisms are presumably due to natural selection, which influences the evolutionary fate of new regulatory mutations based on their fitness consequences. The fitness consequences of trans-regulatory mutations include not only changes in growth rate caused by altering expression of the focal gene, but also their pleiotropic effects on activity of other genes. Ultimately, explaining the variation in gene expression we see in natural populations will require studies like this elucidating the mutational input as well as studies describing the fitness and pleiotropic effects of these mutations in native environments.
Materials and methods
Mutant strains selected for mapping
To identify mutations associated with expression changes, we selected 82 haploid mutant strains for bulk segregant analysis (Figure 1A) from three collections of mutants obtained in Gruber et al. (2012) and Metzger et al. (2016) via ethyl methanesulfonate (EMS) mutagenesis of two progenitor strains expressing a YFP reporter gene (Yellow Fluorescent Protein) under control of the TDH3 promoter (PTDH3-YFP). 71 mutants were selected from a collection of 1498 lines founded from cells isolated randomly (unenriched) after mutagenesis in Metzger et al. (2016), 5 mutants were selected from 211 lines founded from cells enriched for fluorescence changes after mutagenesis in Metzger et al. (2016) and the last 6 mutants were selected from 1064 lines founded from cells enriched for fluorescence changes in Gruber et al. (2012). Mutants from Metzger et al. (2016) were obtained by mutagenesis of the progenitor strain YPW1139 (MATα ura3d0), while mutants from Gruber et al. (2012) were obtained by mutagenesis of the progenitor strain YPW1 (MATa ura3d0 lys2d0). Both progenitors were derived from S288c genetic background (see Metzger et al. 2016 and Gruber et al. 2012 for details on construction of YPW1139 and YPW1 strains). In YPW1139, PTDH3-YFP is inserted at the ho locus with a KanMX drug resistance marker. In YPW1, PTDH3-YFP is inserted at position 199270 on chromosome I near a pseudogene. YPW1139 harbors RME1(ins-308A) and TAO3(1493Q) alleles (Deutschbauer & Davis, 2005) that increase sporulation frequency relative to YPW1 alleles, as well as SAL1, CAT5 and MIP1 alleles that decrease the frequency of the petite phenotype (Dimitrov et al., 2009). We previously showed that the few genetic differences between YPW1 and YPW1139 did not affect the magnitude of effects of TDH3 promoter mutations on fluorescence (Metzger et al., 2016). Fluorescence levels of the three collections were measured in Gruber et al. (2012) and in Metzger et al. (2016). From these data, we selected 39 mutants for BSA-Seq that showed statistically significant fluorescence changes greater than 1% relative to the progenitor strain. Among these mutants, 6 were selected from the Gruber et al. (2012) collection (Z-score > 2.58, P < 0.01), 5 were selected from mutants enriched for large effects in Metzger et al. (2016) (permutation test, P < 0.05) and 28 were selected from unenriched mutants in Metzger et al. (2016) (permutation test, P < 0.05). The remaining 43 mutants included in BSA-Seq experiments were selected from mutants in Metzger et al. (2016) for which we collected new fluorescence measures using flow cytometry. This second fluorescence screen included 197 lines from the unenriched collection that were chosen because they showed statistically significant fluorescence changes (permutation test, P < 0.05) greater than 1% relative to the progenitor strain in the initial screen published in Metzger et al. (2016). The 43 mutants selected from this 2nd screen showed statistically significant fluorescence changes (permutation test, P < 0.05) greater than 1% relative to the progenitor strain.
Measuring YFP expression by flow cytometry
Fluorescence levels of mutant strains were quantified by flow cytometry using the same approach as described in Metzger et al. (2016) and Duveau et al. (2018). For assays involving strains stored in individual tubes at -80°C, all strains were thawed in parallel on YPG plates (10 g yeast extract, 20 g peptone, 50 ml glycerol, 20 g agar per liter) and grown for 2 days at 30°C. Strains were then arrayed using pipette tips in 96 deep well plates containing 0.5 ml of YPD medium (10 g yeast extract, 20 g peptone, 20 g D-glucose per liter) per well at positions defined in Supplementary File 11. The reference strain YPW1139 was inoculated at 20 fixed positions on each plate to correct for plate and position effects on fluorescence. The non-fluorescent strain YPW978 was inoculated in one well per plate to quantify the autofluorescence of yeast cells. Plates were incubated at 30°C for 20 hours with 250 rpm orbital shaking (each well contained a sterile 3 mm glass bead to maintain cells in suspension). Samples from each plate were then transferred to omnitrays containing YPG-agar using a V & P Scientific pin tool. For assays involving strains already arrayed in 96-well plates at -80°C (i.e. RAP1 and GCR1 mutants), strains were directly transferred on YPG omnitrays after thawing. After 48 hours of incubation at 30°C, samples from each omnitray were inoculated using the pin tool in four replicate 96-well plates containing 0.5 ml of YPD per well and cultivated at 30°C with 250 rpm shaking for 22 hours. Then, 15 µl of cell cultures were transferred to a 96-well plate with 0.5 ml of PBS per well (phosphate-buffered saline) and samples were immediately analyzed on a BD Accuri C6 flow cytometer connected to a HyperCyt autosampler (IntelliCyt Corp). A 488 nm laser was used for excitation and the YFP signal was acquired with a 530/30 optical filter. Each well was sampled for 2 seconds, yielding fluorescence and cell size measurements for at least 5000 events per well. Flow cytometry data were analyzed using custom R scripts (Supplementary File 12) as described in Duveau et al. (2018). First, events that did not correspond to single cells were filtered out using flowClust clustering functions. Second, fluorescence intensity was scaled by cell size in several steps. For Figures 1B-D, these values of fluorescence relative to cell size were directly used for subsequent steps of the analysis. For other figures, these values were transformed using a log-linear function to be linearly related with YFP abundance. Transformations of fluorescence values were performed using the relationship between fluorescence levels and YFP mRNA levels established in Duveau et al. (2018) from five strains carrying mutations in the promoter of the PTDH3-YFP reporter gene. The YFP mRNA levels quantified in these five strains are expected to be linearly related with YFP protein abundance based on a previous study that compared mRNA and protein levels for a similar fluorescent protein (GFP) across a broad range of expression levels (Kafri et al., 2016). For this reason and because mutations recovered in this study may alter YFP expression at the post-transcriptional level, the transformed values of fluorescence were considered to provide estimates of YFP abundance instead of mRNA levels. The median expression among all cells of each sample was then corrected to account for positional effects estimated from a linear model applied to the median expression of the 20 control samples on each plate. To correct for autofluorescence, the mean of median expression measured among all replicate populations of the non-fluorescent strain was then subtracted from the median expression of each sample. Finally, a relative measure of expression was calculated by dividing the median expression of each sample by the mean of the median expression among replicates of the reference strain. Figures show the mean relative expression among the four replicate populations of each genotype. Permutation tests used to compare the expression level of each single site mutant to the expression level of the EMS mutant carrying the same mutation are described in the legend of Figure 2 - figure supplement 5A.
Two-level permutation tests
We developed a permutation-based approach to determine which EMS mutant strains from Metzger et al. (2016) showed a significant change in YFP expression relative to their progenitor strain. This permutation approach was motivated by the fact that Student tests and Mann-Whitney-Wilcoxon tests applied to these data appeared to be overpowered. Indeed, the flow cytometry assay from Metzger et al. (2016) included 146 instances of the progenitor strain YPW1139 that were placed at random plate positions and with fluorescence measured in four replicate populations for each position. When comparing the mean expression of the four replicate populations of YPW1139 grown at a given plate position to the mean expression of all other replicate populations of YPW1139, the P-value was below 0.05 in 25.3% of cases when using Student tests and in 13.7% of cases when using Mann-Whitney-Wilcoxon tests. The fact that more than 5% of P-values were below 0.05 indicated that the tests were overpowered, which was because expression differences between YPW1139 populations grown at different plate positions were in average larger than expression differences between replicate populations grown at the same position. For this reason, we compared the expression of each mutant strain to the expression of the 146 x 4 populations of the YPW1139 progenitor strain using permutation tests with two levels of resampling as described below. In these tests, we compared 10,000 times the expression levels of each tested strain measured in quadruplicates to the expression levels of YPW1139 measured in quadruplicates at a randomly selected plate position among the 146 available positions (a new position was picked at each iteration). For each iteration of the comparison, we calculated the difference D between 1) the absolute difference observed between the mean expression of the tested strain and the mean expression of YPW1139 and 2) a randomized absolute difference of mean expression between two sets of 4 expression values obtained by random permutation of the 4 expression values measured for the tested strain and of the 4 expression values measured for YPW1139 at the selected plate position. Finally, for each tested strain the proportion of D values that were negative (after excluding D values equal to zero) corresponded to the P-value of the permutation test. When we applied this test to YPW1139 as a tested strain, we found that the P-value was below 0.05 for 6.1% of the 146 plate positions containing YPW1139, indicating that the permutation test was not overpowered.
BSA-Seq procedure
To identify mutations associated with fluorescence levels in EMS-treated mutants, we used bulk-segregant analysis followed by Illumina sequencing (BSA-Seq). BSA-Seq data corresponding to the 6 mutants from Gruber et al. (2012) were collected together with the BSA-Seq dataset published in Duveau et al. (2014). For the other 76 mutants (from Metzger et al., 2016), BSA-Seq data were collected in this study in several batches (see Supplementary File 13) using the experimental approach described in Duveau et al. (2014) (with few modifications). First, each EMS-treated mutant (MATα ura3d0 ho::PTDH3-YFP ho::KanMX) was crossed to the mapping strain YPW1240 (MATa ura3d0 ho::PTDH3-YFP ho::NatMX4 mata2::yEmRFP-HygMX) that contained the FASTER MT system from Chin et al. (2012) used to tag diploid and MATa cells with a fluorescent reporter. Crosses were performed on YPD agar plates and replica-plated on YPD + G418 + Nat medium (YPD agar with 350 mg/L geneticin (G418) and 100 mg/L Nourseothricin) to select diploid hybrids. After growth, cells were streaked on another YPD + G418 + Nat agar plate, one colony was patched on YPG agar for each mutant and the diploid strain was kept frozen at -80°C. Bulk segregant populations were then collected for batches of 8 mutants in parallel as follows. Diploid strains were thawed and revived on YPG plates, grown for 12 hours at 30°C on GNA plates (50 g D-glucose, 30 g Difco nutrient broth, 10 g yeast extract and 20 g agar per liter) and sporulation was induced for 4 days at room temperature on KAc plates (10 g potassium acetate and 20 g agar per liter). For each mutant, we then isolated a large population of random spores (> 108 spores) by digesting tetrads with zymolyase, vortexing and sonicating samples in 0.02% triton-X (exactly as described in Duveau et al., 2014). ∼3 x 105 MATα spores were sorted by FACS (BD FACSAria II) based on the absence of RFP fluorescence signal measured using a 561 nm laser and 582/15 optical filter. Spores were then resuspended in 2 ml of YPD medium. After 24 hours of growth at 30°C, 0.4 ml of cell culture was transferred to a 5 ml tube containing 2 ml of PBS. Three populations of 1.5 x 105 segregant cells were then collected by FACS: 1) a low fluorescence population of cells sorted among the 2.5% of cells with lowest fluorescence levels (“low bulk”), 2) a high fluorescence population of cells sorted among the 2.5% of cells with highest fluorescence levels (“high bulk”) and 3) a control population of cells sorted regardless of their fluorescence levels. YFP signal was measured using a 488 nm laser and a 530/30 optical filter. To exclude budding cells and enrich for single cells, ∼70% of all events were filtered out based on the area and width of the forward scatter signal prior to sorting. In addition, the median FSC.A (area of forward scatter, a proxy for cell size) was maintained to similar values in the low fluorescence bulk and in the high fluorescence bulk by drawing sorting gates that were parallel to the linear relationship between FSC.A and fluorescence intensity in the FACSDiva software. After sorting, cells were resuspended in 1.6 ml of YPD medium and grown for 30 hours at 30°C. Each sample was then stored at -80°C in 15% glycerol in two separate tubes: one tube containing 1 ml of culture (for DNA extraction) and one tube containing 0.5 ml of culture (for long-term storage). Extraction of genomic DNA was performed for 24 samples in parallel using a Gentra Puregene Yeast/Bact kit (Qiagen). Then, DNA libraries were prepared from 1 ng of genomic DNA using Nextera XT DNA Library Prep kits (Illumina) for low fluorescence bulks and for high fluorescence bulks (control populations were not sequenced). Tagmentation was carried out at 55°C for 5 minutes. Dual indexing of the libraries was achieved using index adapters provided in the Nextera XT Index kit (index sequences used for each library are indicated in Supplementary File 14). Final library purification and size selection was achieved using Agencourt AMPure XP beads (30 µl of beads added to 50 µl of PCR-amplified libraries followed by ethanol washes and resuspension in 50 µl of Tris-EDTA buffer). The average size of DNA fragments in the final libraries was 650 bp, as quantified from a subset of samples using high sensitivity assays on a 2100 Bioanalyzer (Agilent). The concentration of all libraries was quantified with a Qubit 2.0 Fluorometer (Thermo Fisher Scientific) using dsDNA high sensitivity assays. Libraries to be sequenced in the same flow lane were pooled to equal concentration in a single tube and sequenced on a HiSeq4000 instrument (Illumina) at the University of Michigan Sequencing Core Facility (150-bp paired-end sequencing). The 2 x 76 libraries were sequenced in 4 distinct sequencing runs (45300, 45301, 54374 and 54375) that included 36 to 54 samples (libraries sequenced in each run are indicated in Supplementary File 14). In addition, 4 control libraries were sequenced in run 45300, corresponding to genomic DNA from 1) YPW1139 progenitor strain, 2) YPW1240 mapping strain, 3) a bulk of low fluorescence segregants from YPW1139 x YPW1240 cross and 4) a bulk of high fluorescence segregants from YPW1139 x YPW1240 cross. 18 libraries sequenced in run 54374 were not analyzed in this study.
Analysis of BSA-Seq data
Demultiplexing of sequencing reads and generation of FASTQ files were performed using Illumina bcl2fastq v1.8.4 for sequencing runs 45300 and 45301 and bcl2fastq2 v2.17 for runs 54374 and 54375. The next steps of the analysis were processed on the Flux cluster administered by the Advanced Research Computing Technology Services of the University of Michigan (script available in Supplementary File 15). First, low quality ends of reads were trimmed with sickle (https://github.com/najoshi/sickle) and adapter sequences were removed with cutadapt (Martin, 2011). Reads were then aligned to the S288c reference genome (https://www.yeastgenome.org/, R64-1-1 release to which we added the sequences corresponding to PTDH3-YFP, KanMX and NatMX4 transgenes, available in Supplementary File 11) using bowtie2 (Langmead & Salzberg, 2012) and overlaps between paired reads were clipped using clipOverlap in bamUtil (https://github.com/statgen/bamUtil). The sequencing depth at each position in the genome was determined using bedtools genomecov (https://github.com/arq5x/bedtools2). For variant calling, BAM files corresponding to the low fluorescence bulk and to the high fluorescence bulk of each mutant were processed together using freebayes (https://github.com/ekg/freebayes; Garrison & Marth, 2012) with options -- pooled-discrete --pooled-continuous. That way, sequencing data from both bulks were pooled to increase the sensitivity of variant calling and allele counts were reported separately for each bulk. To obtain a list of mutations present in each mutant strain, false positive calls in the VCF files generated by freebayes were then filtered out with the Bioconductor package VariantAnnotation in R (Supplementary File 16). Filtering was based on the values of several parameters such as quality of genotype inference (QUAL > 200), mapping quality (MQM > 27), sequencing depth (DP > 20), counts of reference and alternate alleles (AO > 3 and RO > 3), frequency of the reference allele (FREQ.REF > 0.1), proportion of reference and alternate alleles supported by properly paired reads (PAIRED > 0.8 and PAIREDR > 0.8), probability to observe the alternate allele on both strands (SAP < 100) and at different positions of the reads (EPP < 50 and RPP < 50). The values of these parameters were chosen to filter out a maximum number of calls while retaining 28 variants previously confirmed by Sanger sequencing. We then used likelihood ratio tests (G-tests) in R to determine for each variant site whether the frequency of the alternate allele (i.e. the mutation) was statistically different between the low fluorescence bulk and the high fluorescence bulk (Supplementary File 16). A point mutation was considered to be associated with fluorescence (directly or by linkage) if the P-value of the G-test was below 0.001, corresponding to a G value above 10.828. Since this G-test was performed for a total of 1819 mutations, we expected that 1.82 mutations would be associated with fluorescence due to type I error (false positives) at a P-value threshold of 0.001. This expected number of false positives was considered acceptable since it represented only 2.7% of all mutations that were associated with fluorescence. To determine if an aneuploidy was associated with fluorescence level, we compared the sequencing coverage of the aneuploid chromosome to genome-wide sequencing coverage in the low and high fluorescence bulks using G-tests. The G statistics was computed from the number of reads mapping to the aneuploid chromosome and the number of reads mapping to the rest of the genome in the low and high fluorescence bulks. Aneuploidies with G > 10.828, which corresponds to P-value < 0.001, were considered to be present at statistically different frequencies in both bulks. A custom R script was used to annotate all mutations identified in BSA-Seq data (Supplementary File 17), retrieving information about the location of mutations in intergenic, intronic or exonic regions, the name of genes affected by coding mutations or the name of neighboring genes in case of intergenic mutations and the expected impact on amino acid sequences (synonymous, nonsynonymous or nonsense mutation and identity of the new amino acid in case of a nonsynonymous mutation).
Sanger sequencing of candidate genes
As an alternative approach to BSA-Seq, additional mutations were identified by directly sequencing candidate genes in a subset of EMS-treated mutants (Supplementary File 4). More specifically, we sequenced the PTDH3-YFP transgene in 95 mutant strains from Metzger et al. (2016) that showed decreased fluorescence by more than 10% relative to the progenitor strain. We sequenced the ADE4 coding sequence in 14 mutants from Metzger et al. (2016) that were not included in the BSA-Seq assays and that showed increased fluorescence by more than 5% relative to the progenitor strain. Two of the sequenced mutants had a mutation in the ADE4 coding sequence. We then sequenced the ADE5 coding sequence in the remaining 12 mutants and found a mutation in five of the sequenced mutants. We continued by sequencing the ADE6 coding sequence in the remaining seven mutants. Five of the sequenced mutants had a single mutation and one mutant had two mutations in the ADE6 coding sequence. We sequenced the ADE8 coding sequence in the last mutant but we found no candidate mutation in this mutant. Finally, we sequenced the ADE2 coding sequence in two mutants that showed a reddish color when growing on YPD plates. For all genes, the sequenced region was amplified by PCR from cell lysates, PCR products were cleaned up using Exo-AP treatment (7.5 µl PCR product mixed with 0.5 µl Exonuclease-I (NEB), 0.5 µl Antarctic Phosphatase (NEB), 1 µl Antarctic Phosphastase buffer and 0.5 µl H2O incubated at 37°C for 15 minutes followed by 80°C for 15 minutes) and Sanger sequencing was performed by the University of Michigan Sequencing Core Facility. Oligonucleotides used for PCR amplification and sequencing are indicated in Supplementary File 14.
Site-directed mutagenesis
34 mutations identified by BSA-Seq and 11 mutations identified by sequencing candidate genes were introduced individually in the genome of the progenitor strain YPW1139 to quantify the effect of these mutations on fluorescence level. “Scarless” genome editing (i.e. without insertion of a selection marker) was achieved using either the delitto perfetto approach from Stuckey et al. (2011) (for 19 mutations) or CRISPR-Cas9 approaches derived from Laughery et al. (2015) (for 26 mutations). Compared to delitto perfetto, CRISPR-Cas9 is more efficient and it can be used to introduce mutations in essential genes. However, it requires specific sequences in the vicinity of the target mutation (see below). The technique used for the insertion of each mutation is indicated in Supplementary File 18. The sequences of oligonucleotides used for the insertion and the validation of each mutation can be found in Supplementary File 14.
In the delitto perfetto approach, the target site was first replaced by a cassette containing the Ura3 and hphMX4 selection markers (pop-in) and then this cassette was swapped with the target mutation (pop-out). The Ura3-hphMX4 cassette was amplified from pCORE-UH plasmid using two oligonucleotides that contained at their 5’ end 20 nucleotides for PCR priming in pCORE-UH and at their 3’ end 40 nucleotides corresponding to the sequences flanking the target site in the yeast genome (for homologous recombination). The amplicon was transformed into YPW1139 cells using a classic LiAc/polyethylene glycol heat shock protocol (Gietz & Schiestl, 2007). Cells were then plated on synthetic complete medium lacking uracil (SC-Ura) and incubated for two days at 30°C. Colonies were replica-plated on YPD + Hygromycin B (300 mg/l) plates. A dozen [Ura+ Hyg+] colonies were streaked on SC-Ura plates to remove residual parental cells and the resulting colonies were patched on YPG plates to counterselect petite cells. Cell patches were then screened by PCR to confirm the proper insertion of Ura3-hphMX4 at the target site. One positive clone was grown in YPD and stored at -80°C in 15% glycerol. For the pop-out step, a genomic region of ∼240 bp centered on the mutation was amplified from the EMS-treated mutant containing the desired mutation. The amplicon was transformed into the strain with Ura3-hphMX4 inserted at the target site. Cells were plated on synthetic complete medium containing 0.9 g/l of 5-fluoroorotic acid (SC + 5-FOA) to counterselect cells expressing Ura3. After growth, a dozen [Ura-] colonies were streaked on SC + 5-FOA plates and one colony from each streak was patched on a YPG plate. Cell patches were screened by PCR using oligonucleotides that flanked the sequence of the transformed region and amplicons of expected size (∼350 bp) were sequenced to confirm the insertion of the desired mutation and the absence of PCR-induced mutations. When possible two independent clones were stored at - 80°C in 15% glycerol, but in some cases only one positive clone could be retrieved and stored.
A “one-step” CRISPR-Cas9 approach was used to insert mutations impairing a NGG or CCN motif in the genome (22 mutations), which corresponds to the protospacer adjacent motif (PAM) targeted by Cas9. First, a DNA fragment containing the 20 bp sequence upstream of the target PAM in the yeast genome was cloned between SwaI and BclI restriction sites in the pML104 plasmid. This DNA fragment was obtained by hybridizing two oligonucleotides designed as described in Laughery et al. 2015. The resulting plasmid contained cassettes for expression of Ura3, Cas9 and a guide RNA targeted to the mutation site in yeast cells. In parallel, a repair fragment containing the mutation was obtained either by PCR amplification of a ∼240 bp genomic region centered on the mutation in the EMS-treated mutant or by hybridization of two complementary 70 mer oligonucleotides containing the mutation and its flanking genomic sequences. The Cas9/sgRNA plasmid and the repair fragments were transformed together (∼150 nmol of plasmid + 20 µmol of repair fragment) into the progenitor strain YPW1139 using LiAc/polyethylene glycol heat shock protocol (Gietz & Schiestl, 2007). Cells were then plated on SC-Ura medium and incubated at 30°C for 48 hours. This medium selected cells that both internalized the plasmid and integrated the desired mutation in their genome. Indeed, cells with the Cas9/sgRNA plasmid stop growing as long as their genomic DNA is cleaved by Cas9 but their growth can resume once the PAM sequence is impaired by the mutation, which is integrated into the genome via homologous recombination with the repair fragment (Laughery et al., 2015). A dozen [Ura+] colonies were then streaked on SC-Ura plates and one colony from each streak was patched on a YPG plate. Cell patches were screened by PCR using oligonucleotides that flanked the mutation site and amplicons of expected size (∼350 bp) were sequenced to confirm the insertion of the desired mutation and the absence of secondary mutations. Then, one or two positive clones were patched on SC + 5-FOA to counterselect the Cas9/sgRNA plasmid, grown in YPD and stored at -80°C in 15% glycerol.
A “two-steps” CRISPR-Cas9 approach was used to insert mutations located near but outside a PAM sequence (4 mutations). Each step was performed as described above for the “one-step” CRISPR-Cas9 approach. In the first step, Cas9 was targeted by the sgRNA to a PAM sequence (the initial PAM) located close to the mutation site (up to 20 bp). The repair fragment contained two synonymous mutations that were not the target mutation: one mutation that impaired the initial PAM and one mutation that introduced a new PAM as close as possible to the target site. This repair fragment was obtained by hybridization of two complementary 90 mer oligonucleotides and transformed into YPW1139. In the second step, Cas9 was targeted to the new PAM. The repair fragment contained three mutations: two mutations that reverted the mutations introduced in the first step and the target mutation. This repair fragment was obtained by hybridization of two complementary 90 mer oligonucleotides and transformed into the strain obtained in the first step. Positive clones were sequenced to confirm the insertion of the target mutation and the absence of other mutations.
We used CRISPR/Cas9-guided allele replacement to introduce individual mutations in five codons of the RAP1 coding sequence that encode for amino acids predicted to make direct contact with DNA when RAP1 binds to DNA (Konig et al., 1996). For each codon, we tried to insert one synonymous mutation, one nonsynonymous mutation predicted to have a weak impact on RAP1 protein structure and one nonsynonymous mutation predicted to have a strong impact on RAP1 protein structure based on amino acid exchangeability scores from Yampolsky & Stoltzfus (2005) (see Supplementary File 7 for the list of mutations). Each mutation was introduced in the genome of strain YPW2706. This strain is derived from YPW1139 and contains two identical sgRNA target sites upstream and downstream of the RAP1 gene (see below for details on YPW2706 construction). Therefore, we could use a single Cas9/sgRNA plasmid to excise the entire RAP1 gene in YPW2706 by targeting Cas9 to both ends of the gene. We used gene SOEing (Splicing by Overlap Extension) to generate repair fragments corresponding to the RAP1 gene (promoter and coding sequence) with each target mutation. First, a left fragment of RAP1 was amplified from YPW1139 genomic DNA using a forward 20 mer oligonucleotide priming upstream of the RAP1 promoter and a reverse 60 mer oligonucleotide containing the target mutation and the surrounding RAP1 sequence. In parallel, a right fragment of RAP1 overlapping with the right fragment was amplified from YPW1139 genomic DNA using a forward 60 mer oligonucleotide complementary to the reverse oligonucleotide used to amplify the left fragment and a reverse 20 mer oligonucleotide priming in RAP1 5’UTR sequence. Then, equimolar amounts of the left and right fragments were mixed in a PCR reaction and 25 cycles of PCR were performed to fuse both fragments. Finally, the resulting product was further amplified using two 90 mer oligonucleotides with homology to the sequence upstream of RAP1 promoter and to the RAP1 5’UTR but without the sgRNA target sequences. Consequently, transformation of the repair fragment together with the Cas9/sgRNA plasmid in YPW2706 cells was expected to replace the wild type allele of RAP1 by an allele containing the target mutation in RAP1 coding sequence and without the two flanking sgRNA target sites. For each of the 15 target mutations, we sequenced the RAP1 promoter and coding sequence in 10 independent clones obtained after transformation. All synonymous mutations were retrieved in several clones, while several of the nonsynonymous mutations were not found in any clone, suggesting they were lethal (Supplementary File 7).
RAP1 and GCR1 mutagenesis using error-prone PCR
We used a mutagenic PCR approach to efficiently generate hundreds of mutants with random mutations in the RAP1 gene (promoter and coding sequence) or in the second exon of GCR1 (representing 99.7% of GCR1 coding sequence). DNA fragments obtained from the mutagenic PCR were introduced in the yeast genome using CRISPR/Cas9-guided allele replacement as described above. The sequences of all oligonucleotides used for RAP1 and GCR1 mutagenesis can be found in Supplementary File 14.
First, we constructed two yeast strains for which the RAP1 gene (strain YPW2706) or the second exon of GCR1 (strain YPW3082) were flanked by identical sgRNA target sites and PAM sequences. To generate strain YPW2706, we first identified a sgRNA target site located downstream of the RAP1 coding sequence (41 bp after the stop codon in the 5’UTR) in the S288c genome. Then, we inserted the 23 bp sequence corresponding to this sgRNA target site and PAM upstream of the RAP1 promoter (immediately after PPN2 stop codon) in strain YPW1139 using the delitto perfetto approach (as described above). To generate strain YPW3082, we first identified a sgRNA target site located at the end of the GCR1 intron (22 bp upstream of exon 2) in the S288c genome. Then, we inserted the 23 bp sequence corresponding to this sgRNA target site and PAM immediately after the GCR1 stop codon in strain YPW1139 using the delitto perfetto approach (as described above).
Second, we constructed plasmid pPW437 by cloning the 20mer guide sequence directed to RAP1 in pML104 as described in Laughery et al. 2015 and we constructed plasmid pPW438 by cloning the 20mer guide sequence directed to GCR1 in pML104 as described in Laughery et al. 2015. These two sgRNA/Cas9 plasmids can be used, respectively, to excise the RAP1 gene or GCR1 exon 2 from the genomes of YPW2706 and YPW3082.
Third, we generated repair fragments with random mutations in RAP1 or GCR1 genes using error-prone PCR. We first amplified each gene from 2 ng of YPW1139 genomic DNA using a high-fidelity polymerase (KAPA HiFi DNA polymerase) and 30 cycles of PCR. PCR products were purified with the Wizard SV Gel and PCR Clean-Up System (Promega) and quantified with a Qubit 2.0 Fluorometer (Thermo Fisher Scientific) using dsDNA broad range assays. 2 ng of purified PCR products were used as template for a first round of mutagenic PCR and mixed with 25 µl of DreamTaq Master Mix 2x (ThermoFisher Scientific), 2.5 µl of forward and reverse primers at 10 µM, 5 µl of 1 mM dATP and 5 µl of 1 mM dTTP in a final volume of 50 µl. The imbalance of dNTP concentrations (0.3 µM dATP, 0.2 µM dCTP, 0.2 µM dGTP and 0.3 µM dTTP) was done to bias the mutagenesis toward misincorporation of dATP and dTTP. For RAP1 mutagenesis, the forward oligonucleotide primed upstream of the RAP1 promoter (in PPN2 coding sequence) and the reverse oligonucleotide primed in the RAP1 terminator and contained a mutation in the PAM adjacent to the sgRNA target site. For GCR1 mutagenesis, the forward oligonucleotide primed at the end of the GCR1 intron and contained a mutation in the PAM adjacent to the sgRNA target site and the reverse primer primed in the GCR1 terminator. The PCR program was 95°C for 3 minutes followed by 32 cycles with 95°C for 30 seconds, 52°C for 30 seconds, 72°C for 2 minutes and a final extension at 72°C for 5 minutes. For RAP1 mutagenesis, the product of the first mutagenic PCR was diluted by a factor of 33 and used as template for a second round of mutagenic PCR (1.5 µl of product in a 50 µl reaction) similar to the first round but with only 10 cycles of amplification. For GCR1 mutagenesis, the product of the first mutagenic PCR was diluted by a factor of 23 and used as template for a second round of mutagenic PCR (2.2 µl of product in a 50 µl reaction) with 35 cycles of amplification. Using this protocol, we expected to obtain on average 1.6 mutations per fragment for RAP1 mutagenesis and 1.8 mutations per fragment for GCR1 mutagenesis (see below for calculations of these estimates).
pPW437 was transformed with RAP1 repair fragments into YPW2706 and pPW438 was transformed with GCR1 repair fragments into YPW3082 as described above for CRISPR/Cas9 site directed mutagenesis. To select cells that replaced the wild type alleles with alleles containing random mutations, transformed cells were plated on SC-Ura and incubated at 30°C for 48 hours. To confirm the success of each mutagenesis and to estimate actual mutation rates, we then sequenced the RAP1 genes in 27 random colonies from the RAP1 mutagenesis and we sequenced the second exon of GCR1 in 18 random colonies from the GCR1 mutagenesis. Next, 500 colonies from RAP1 mutagenesis and 300 colonies from GCR1 mutagenesis were streaked onto SC-Ura plates. After growth, one colony from each streak was patched on YPG and grown four days at 30°C. Then, patches were replica-plated with velvets onto SC + 5-FOA to eliminate sgRNA/Cas9 plasmids. Finally, 488 clones from RAP1 mutagenesis and 355 clones from GCR1 mutagenesis were arrayed in 96-well plates containing 0.5 ml of YPD (same plate design as used for the flow cytometry assays) and grown overnight at 30°C. 0.2 ml of cell culture from each well was then mixed with 46 µl of 80% glycerol in 96-well plates and stored at -80°C. The fluorescence of these strains was quantified by flow cytometry as described above to assess the impact of RAP1 and GCR1 mutations on PTDH3-YFP expression (expression data for each mutant can be found in Supplementary File 19).
In our mutagenesis approach, we introduced a mutation that impaired the target PAM sequence in all RAP1 and GCR1 mutants. To determine the effect of this mutation alone, we generated strains YPW2701 and YPW2732 that carried the PAM mutation in the RAP1 terminator or in the GCR1 intron, respectively, without any other mutation in RAP1 or GCR1. The fluorescence level of these two strains was not significantly different from the fluorescence level of the progenitor strain YPW1139 in flow cytometry assays.
Estimation of RAP1 and GCR1 mutation rates
The expected number of mutations per PCR amplicon (Nmut) depends on the error rate of the Taq polymerase (μ), on the number of DNA duplications (D) and on the length of the amplicon (L): Nmut = μ·D·L. The published error rate for a classic polymerase similar to DreamTaq is ∼3 x 10-5 errors per nucleotide per duplication (McInerney et al. 2014). Amplicon length was 3057 bp for RAP1 mutagenesis and 2520 pb for GCR1 mutagenesis. The number of duplications of PCR templates was calculated from the amounts of double stranded DNA quantified using Qubit 2.0 dsDNA assays before (I) and after (O) each mutagenic PCR reaction as follows: . For the first round of RAP1 mutagenesis, 11.7. For the second round of RAP1 mutagenesis, . Therefore, the total number of duplications was 17.2 and the expected number of mutations per amplicon Nmut was 1.6 on average. For the first round of GCR1 mutagenesis, . For the second round of GCR1 mutagenesis, . Therefore, the total number of duplications was 23.6 and the expected number of mutations per amplicon Nmut was 1.8 on average.
Effects of mutations in purine biosynthesis genes on expression from different promoters
We compared the individual effects of three mutations in the purine biosynthesis pathway (ADE2-C1477a, ADE5-G1715a and ADE6-G3327a) on YFP expression driven by four different yeast promoters (PTDH3, PRNR1, PSTM1 and PGPD1). Each mutation was introduced individually in the genomes of four parental strains described in Hodgins-Davis et al. (2019) carrying either PTDH3-YFP (YPW1139), PRNR1-YFP (YPW3758), PSTM1-YFP (YPW3764) or PGPD1-YFP (YPW3757) reporter gene at the ho locus. Site-directed mutagenesis was performed as described in the corresponding section (see above). The fluorescence of the four parental strains, of a non-fluorescent strain (YPW978) and of the 12 mutant strains (4 reporter genes x 3 mutations) was quantified using a Sony MA-900 flow cytometer (the BD Accuri C6 instrument used for other fluorescence assays was not available due to Covid-19 shutdown) in three replicate experiments performed on different days. For each experiment, all strains were grown in parallel in culture tubes containing 5 ml of YPD and incubated at 30°C for 16 hours. Each sample was diluted to 1-2 x 107 cells/mL in PBS prior to measurement. At least 5 x 104 events were recorded for each sample using a 488 nm laser for YFP excitation and a 525/50 optical filter for the acquisition of fluorescence. At least 5 x 104 events were recorded for each sample. Flow cytometry data were then processed in R using functions from the FlowCore package and custom scripts available in Supplementary File 12. After log-transformation of flow data, events considered to correspond to single cells were selected on the basis of their forward scatter height and width (FSC-H and FSC-W). Fluorescence values of single cells were then normalized to account for differences in cell size. Finally, the median fluorescence among cells was computed for each sample and averaged across replicates of each genotype.
Statistical comparisons of trans-regulatory and nonregulatory mutations
We established a set of 69 trans-regulatory mutations that included 52 mutations with a P-value below 0.01 in the G-tests comparing the frequencies of mutant and reference alleles in low and high fluorescence bulks (see above) as well as 17 mutations identified by Sanger sequencing in the coding sequence of purine biosynthesis genes. In parallel, we established a set of 1766 nonregulatory mutations regarding PTDH3-YFP expression that included mutations with a P-value above 0.01 in the G-tests comparing the frequencies of mutant and reference alleles in low and high fluorescence bulks (see above) and mutations that did not affect PTDH3-YFP expression in single-site mutants. We performed statistical analysis to compare properties of trans-regulatory and nonregulatory mutations using RStudio v1.2.5019 (R scripts are in Supplementary File 16). We used G-tests (likelihood.ratio function in Deducer package) to compare the following properties between trans-regulatory and nonregulatory mutations: i) the frequency of G.C to A:T transitions, ii) the frequency of indels, iii) the frequency of aneuploidies, iv) the distribution of mutations among chromosomes, v) the frequency of mutations in coding, intronic and intergenic sequences, vi) the frequency of synonymous, nonsynonymous and nonsense changes among coding mutations, vii) the frequency of coding mutations in transcription factors, viii) the frequency of coding mutations in the predicted TDH3 regulatory network (see below), ix) the proportion of mutations in eQTL regions (see below). We used resampling tests to compare the frequencies of different amino acid changes caused by trans-regulatory and nonregulatory mutations in coding sequences. We computed for each possible amino acid change the observed absolute difference between i) the proportion of coding trans-regulatory mutations causing the amino acid change and ii) the proportion of nonregulatory mutations causing the amino acid change. Then, we computed similar absolute differences for 10,000 randomly permuted sets of trans-regulatory and nonregulatory mutations. The P-value for each amino acid change was calculated as the proportion of resampled absolute differences greater or equal to the observed absolute difference.
TDH3 regulatory network
The network of potential TDH3 regulators shown on Figure 4 was established using data available in July 2019 on the YEASTRACT (www.yeastract.com) repository of regulatory associations between transcription factors and target genes in Saccharomyces cerevisiae (Teixeira et al., 2018). We used the tool “Regulation Matrix” to obtain three matrices in which rows corresponded to the 220 transcription factor genes in YEASTRACT and columns corresponded to the 6886 yeast target genes included in the database. In the first matrix obtained using the option “Only DNA binding evidence”, an element had a value of 1 if the transcription factor at the corresponding row was reported in the literature to bind to the promoter of the target gene at the corresponding column and a value of 0 otherwise. The two other matrices were obtained using the option “Only Expression evidence” with either “TF acting as activator” or “TF acting as inhibitor”. An element had a value of 1 only in the “TF acting as activator” matrix if perturbation of the transcription factor at the corresponding row was reported to increase expression of the target gene at the corresponding column. An element had a value of 1 only in the “TF acting as inhibitor” matrix if perturbation of the transcription factor at the corresponding row was reported to decrease expression of the target gene at the corresponding column. An element had a value of 1 in both matrices if perturbation of the transcription factor at the corresponding row was reported to affect expression of the target gene at the corresponding column in an undetermined direction. Finally, an element had a value of 0 in both matrices if perturbation of the transcription factor at the corresponding row was not reported to alter expression of the target gene at the corresponding column in the literature. We then used a custom R script (Supplementary File 16) to generate a smaller matrix that only contained first level and second level regulators of TDH3 and TDH3 itself. A transcription factor was considered to be a first level regulator of TDH3 if a regulatory association with TDH3 was supported both by DNA binding evidence and expression evidence. A transcription factor was considered to be a second level regulator of TDH3 if a regulatory association with a first level regulator of TDH3 was supported both by DNA binding evidence and expression evidence. The network shown on Figure 4 was drawn using Adobe Illustrator based on regulatory interactions included in the matrix of TDH3 regulators (in Supplementary File 11). To determine whether mutations in the TDH3 regulatory network constituted a significant mutational source of regulatory variation affecting PTDH3 activity, we compared the proportions of trans-regulatory and non-regulatory mutations that were located in a TDH3 regulator gene (first or second level) using a G-test (likelihood.ratio function in R package Deducer).
Competitive fitness assays
We performed competitive growth assays to quantify the fitness of 62 strains with random mutations in the second exon of GCR1. These 62 strains corresponded to all GCR1 mutants that showed a significant decrease of PTDH3-YFP expression as quantified by flow cytometry as well as GCR1 mutants for which GCR1 exon 2 was sequenced and the location of mutations was known. The 62 strains were thawed on YPG plates as well as reference strains YPW1139 and YPW2732 and strain YPW1182 that expressed a GFP (Green Fluorescent Protein) reporter instead of YFP. After three days of incubation at 30°C, strains were arrayed in four replicate 96-well plates containing 0.5 ml of YPG per well. In parallel, the [GFP+] strain YPW1182 was also arrayed in four replicate 96-well plates. The eight plates were incubated on a wheel at 30°C for 32 hours. We then measured the optical density at 620 nm of all samples using a Sunrise plate reader (Tecan) and calculated the average cell density for each plate. Samples were then transferred to 1.2 ml of YPD in 96-well plates to reach an average cell density of 106 cells/ml for each plate. 21.25 µl of samples from plates containing [YFP+] strains were mixed with 3.75 µl of [GFP+] samples in four 96-well plates containing 0.45 ml of YPD per well. The reason why [YFP+] and [GFP+] strains were mixed to a 17:3 ratio is because we anticipated that some of the GCR1 mutants may grow slower than the [GFP+] competitor in YPD. Samples were then grown on a wheel at 30°C for 10 hours and the optical density was measured again after growth to estimate the average number of generations for each plate. The ratio of [YFP+] and [GFP+] cells in each sample was quantified by flow cytometry before and after the 10 hours of growth. Samples were analyzed on a BD Accuri C6 flow cytometer with a 488 nm laser used for excitation and two different optical filters (510/10 and 585/40) used to separate YFP and GFP signals. FCS data were analyzed with custom R scripts using flowCore and flowClust packages (Supplementary File 12) as described in Duveau et al. (2018). First, we filtered out artifactual events with extreme values of forward scatter or fluorescence intensity. Then, for each sample we identified two clusters of events corresponding to [YFP+] and [GFP+] cells using a principal component analysis on the logarithms of FL1.H and FL2.H (height of the fluorescence signal captured through the 510/10 and 585/40 filters, respectively). Indeed, [YFP+] cells tend to have lower FL1.H value and higher FL2.H value than [GFP+] cells and these two parameters are positively correlated. The competitive fitness of [YFP+] cells relative to [GFP+] cells was calculated as the exponential of the slope of the linear regression of on the number of generations of growth (where YFP corresponds to the number of [YFP+] cells and GFP corresponds to the number of [GFP+] cells). We then divided the fitness of each sample by the mean fitness among all replicates of the reference strain YPW1139 to obtain a fitness value relative to YPW1139. The fitness of each strain was calculated as the mean relative fitness among the four replicate populations for that strain. These fitness data can be found in Supplementary File 19.
Gene ontology (GO) analysis
GO term analyses were performed on www.pantherdb.org website in June 2020 (Mi et al., 2019). In “Gene List Analysis”, we used “Statistical overrepresentation test” on a query list corresponding to the 53 genes affected by trans-regulatory coding mutations. GO enrichment was determined based on a reference list of the 1251 genes affected by non-regulatory coding mutations using Fisher’s exact tests. Four separate analyses were performed for GO biological processes, GO molecular functions, GO cellular components and PANTHER pathways. GO terms that are significantly enriched in the list of trans-regulatory mutations (mutations associated with fluorescence level) relative to non-regulatory mutations (mutations not associated with fluorescence level) at P < 0.05 are listed in Supplementary File 8.
Enrichment of mutations in eQTL regions
Genomic regions containing expression quantitative trait loci (eQTL) associated with PTDH3-YFP expression variation in three different crosses (BYxYPS1000, BYxSK1 and BYxM22) were obtained from Table S11 in Metzger & Wittkopp (2019). A custom R script was used to determine the number of trans-regulatory and non-regulatory mutations located inside and outside these eQTL intervals (Supplementary File 16). G-tests were performed to determine whether the proportion of trans-regulatory mutations in eQTL intervals was statistically different from the proportion of non-regulatory mutations in the same eQTL intervals.
Data archiving
Deposition of data (FASTQ files and FCS files) on public repositories is in progress. Sequencing data (FASTQ files) will be available on NCBI SRA (https://www.ncbi.nlm.nih.gov/sra). Flow cytometry data (FCS files) will be available on the Flow Repository (https://flowrepository.org/).
Supplementary data
Supplementary Table 1. Statistical associations between aneuploidies and fluorescence level.
Supplementary File 1. Sequencing depth in BSA-seq data
Supplementary File 2. List of all mutations identified by BSA-Seq or Sanger sequencing in this study
Supplementary File 3. Linked mutations associated with fluorescence level in BSA-Seq experiments
Supplementary File 4. Mutations identified by Sanger sequencing of candidate genes
Supplementary File 5. Mutations tested in single-site mutants
Supplementary File 6. Mutations associated with fluorescence level in BSA-Seq experiments
Supplementary File 7. Targeted mutagenesis of RAP1 residues making direct contact with DNA
Supplementary File 8. List of GO terms overrepresented in genes hit by causative mutations relative to genes hit by neutral mutations
Supplementary File 9. Mutations located in the coding sequence of glucose signaling genes
Supplementary File 10. Trans-regulatory effects of mutations in purine biosynthesis genes or iron homeostasis genes
Supplementary File 11. Files used as inputs for analyses performed with the PBS script (Supplementary File 15) and R scripts (Supplementary Files 12, 16 and 17).
Supplementary File 12. R scripts used for the analysis of flow cytometry data
Supplementary File 13. List of DNA libraries grouped by sequencing runs
Supplementary File 14. List of oligonucleotides used in this study
Supplementary File 15. PBS script used to process FASTQ files
Supplementary File 16. R scripts used for the analysis of BSA-Seq data and for comparing the properties of trans-regulatory and non-regulatory mutations.
Supplementary File 17. R script used to annotate variants identified in BSA-Seq data
Supplementary File 18. Construction of single-site mutant strains
Supplementary File 19. Phenotypes of RAP1 mutants (expression) and GCR1 mutants (expression and fitness)
Acknowledgments
We thank Gaël Yvert and Mark Hill for helpful comments on the manuscript, the University of Michigan sequencing core and University of Michigan flow cytometry core for research support, and the National Institutes of Health (R01GM108826 and R35GM118073 to P.J.W.), European Molecular Biology Organization (EMBO ALTF 1114-2012 to F.D.), National Science Foundation (MCB-1929737 to P.J.W.), NIH Genetics Training grant (T32GM007544 to P.V.Z.), NIH Genome Sciences Training Grant (T32HG000040 to B.P.H.M and M.A.S.), and the Michigan Life Sciences Fellow program (M.A.S.) for funding.
References
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵