Mining novel cis-regulatory elements from the emergent host Rhodosporidium toruloides using transcriptomic data

The demand for robust microbial cell factories that produce valuable biomaterials while resisting stresses imposed by current bioprocesses is rapidly growing. Rhodosporidium toruloides is an emerging host that presents desirable features for bioproduction, since it can grow in a wide range of substrates and tolerate a variety of toxic compounds. To explore R. toruloides suitability for application as a cell factory in biorefineries, we sought to understand the transcriptional responses of this yeast when growing under experimental settings that simulated those used in biofuels-related industries. Thus, we performed RNA sequencing of the oleaginous, carotenogenic yeast in different contexts. The first ones were stress-related: two conditions of high temperature (37 and 42°C) and two ethanol concentrations (2 and 4%), while the other used the inexpensive and abundant sugarcane juice as substrate. Differential expression and functional analysis were implemented using transcriptomic data to select differentially expressed genes and enriched pathways from each set-up. A reproducible bioinformatics workflow was developed for mining new regulatory elements. We then predicted, for the first time in this yeast, binding motifs for several transcription factors, including HAC1, ARG80, RPN4, ADR1, and DAL81. Most putative transcription factors uncovered here were involved in stress responses and found in the yeast genome. Our method for motif discovery provides a new realm of possibilities in studying gene regulatory networks, not only for the emerging host R. toruloides, but for other organisms of biotechnological importance.


Introduction
Microbial cell factories are in ever-growing demand due to the pursuit of more sustainable products supporting the concepts of green chemistry and biorefineries (Cherubini, 2010;Martins et al., 2021;Tan and Lamers, 2021). The benefits go beyond obtaining valuable products from low-cost raw materials, but include the independence from oil, a finite and extremely polluting material (Borodina and Nielsen, 2014;Liu et al., 2021). Innovations in obtaining products from renewable substrates are based on the genetic improvement and metabolic engineering of microorganisms that can degrade and ferment different biomasses (Martins-Santana et al., 2018). The main challenges the currently available cell factories face are not only breaking complex compounds into simpler chains, but also tolerating adverse conditions resulting from bioprocesses (Geiselman et al., 2020;Devi et al., 2022).
Rhodosporidium toruloides is emerging as a potential, robust host for completing both tasks. This oleaginous, carotenogenic yeast can flourish on a wide range of substrates, such as sugarcane juice and bagasse, crude glycerol, wheat straw, cassava starch, among others (Bonturi et al., 2017;Park et al., 2018;Lopes, 2020). It can also store high amounts of lipids-up to 60% of its total cell weight-and tolerate compounds commonly toxic to other yeast species (Saini et al., 2020). It performs better than the widely used baker's yeast, Saccharomyces cerevisiae, for example, as it can degrade C5 sugars and lignin-derived compounds, which the latter cannot . The advantages go over other lipid-producing strains, such as Yarrowia lipolytica, since it can grow in more complex carbon sources and presents higher tolerance to several inhibitors like 5-hydroxy methyl ester, furfural, acetic acid and vanillin (Saini et al., 2020). R. toruloides has already been shown to produce several different bioproducts in a variety of conditions. Examples of these bioproducts include enzymes, carotenoids, sugar alcohols, such as arabitol and galactitol, and lipid-based compounds, such as linoleic acid, oleic acid and cocoa butter substitute (Jagtap and Rao, 2018;Park et al., 2018;Jagtap et al., 2019;Bonturi et al., 2022). Recently, terpenes that can be used as precursors of biodiesel and jet fuel alternatives were produced on a pilot scale, derived from corn stover hydrolysates (Yaegashi et al., 2017;Geiselman et al., 2020;Kirby et al., 2021). Hence, all these distinctive characteristics point to R. toruloides as a great candidate for improving the productivity of biorefineries.
There is great interest in using unconventional, more robust yeasts in detriment to the use of S. cerevisiae for ethanol production in bioreactors Wu et al., 2020). Ethanol tolerance and thermotolerance, for instance, are desirable features. First, ethanol accumulation during fermentation can become a significant stress factor during the process (Stanley et al., 2010). Secondly, higher fermentation temperatures can lower the cost of cooling bioreactors during the process and reduce bacterial contamination, decreasing the need for antibiotics (Abdel-Banat et al., 2010;Huang et al., 2018). Thus, to explore if R. toruloides could tolerate and thrive in these stressful environments, we decided to investigate the transcriptional response of this yeast growing in high temperatures and in the presence of ethanol.
Furthermore, Soccol et al. (2017) developed a process for producing biofuels using R. toruloides grown in a simple media containing only sugarcane juice and urea. Sugarcane juice is an abundant and low-cost substrate, containing up to 15% of fermentable sugars. Based on their work, we additionally sought to understand the changes in transcripts of R. toruloides grown in media containing sugarcane juice and urea (Soccol et al., 2017).
Transcription factors (TF) are proteins that bind to specific DNA sequences modulating the rate of transcription of DNA to mRNA. These particular DNA sequences are collectively called transcription factor binding sites (TFBS) or binding motifs, and are present in the cis-regulatory region of DNA surrounding the transcription start sites. The activity of TFs can directly or indirectly transform physiological and environmental signals into gene expression patterns. Thus, identifying both the proteins and the position of their binding motifs is essential for understanding gene regulation in the cell and allowing better transcription control, which can serve as a basis for new bioengineering applications (Martins-Santana et al., 2018). However, finding new regulatory sequence motifs is a challenging procedure, especially in eukaryotic organisms, even when we have a large amount of available experimental data (Ambrosini et al., 2020). One of the reasons is that the primary nucleotide sequence is one of many characteristics that specify a target. Additional factors, such as genomic context, DNA binding, modification, and shape, can change nucleotide preferences (Inukai et al., 2017). Besides, it is only possible to test some environmental conditions to which a natural regulatory network can respond. Nevertheless, new tools and methodological approaches are developing to predict regulatory elements in different datasets. In the future, they might help overcome motif discovery limitations in non-model organisms (Ksouri et al., 2021;Novakovsky et al., 2021;Wetzel et al., 2022).
Here, we developed a bottom-up bioinformatic pipeline for cis-regulatory element discovery combining tools for motif prediction and sequence analysis. We applied it to the emergent microbial cell factory R. toruloides, whereas this workflow can be well suitable to other biotechnologically relevant organisms. To do that, we cultivated R. toruloides in the settings described above and extracted RNA samples from early time points of growth. Then, we performed RNA sequencing, functional analysis of the transcriptional responses, and discovery of new motifs from genes of the main pathways found to be enriched in each specific condition. In this sense, we are providing new insights regarding the regulation of R. toruloides genes in a transcriptomic approach and describing, for the first time, putative TFBS that can be used

Strains, media and growth conditions
For RNA sequencing experiments using R. toruloides grown in a medium with sugarcane juice (SCJ), cultures were first grown in LB medium (1% yeast extract, 1% tryptone, and 0.5% NaCl) for 24 h at 30°C and 200 rpm. LB medium was used in the inoculum so that no other sugar influenced the yeast metabolism. The inoculum was centrifuged and transferred to media containing sugarcane juice. Considering the concentration of 120 g of sucrose per liter of juice, we standardized the final concentration of the media as 4% sucrose. In addition, 1% urea was added as a nitrogen source. The filtered sugarcane juice was pasteurized at 65°C for 30 min before being added to the autoclaved medium. SCJ comprises about 75% water, 13-15% of sucrose, and 10-15% of fibers, and pasteurization should not change overall composition. The cultures were grown for 8 h and then collected for RNA extraction. Aliquots of the inoculum in LB grown for 24 h were also collected to serve as internal control, and hereafter will be called time 0 h.
The experiments for the stress conditions were carried out as follows: the inoculum was grown in YPD medium (1% yeast extract, 2% peptone, and 2% glucose) at 200 rpm at 30°C for 24 h. These inoculants were centrifuged and transferred to the respective media, containing YPD and one of the respective stress conditions: 2% ethanol and 4% ethanol, which were grown in a shaker at 30°C, and the high-temperature conditions that were grown at 37°C and 42°C. The control condition was determined as the inoculum grown in YPD for 24 h. After being transferred to the respective media, the cultures were grown for another 16 h and collected for RNA extraction and sequencing. All inoculations were performed in a 1: 10 ratio, and all experiments were performed in triplicates. The strain R. toruloides IFO0880 was used in all experiments (Lin et al., 2014;Coradetti et al., 2018).

RNA extraction and sequencing
Cell lysis was performed by pelleting yeast cells from the culture by centrifugation and resuspending it with Trizol ® (Thermo Fisher Scientific). Samples were transferred to tubes containing zirconium beads and lysed using a cell homogenizer.
RNA extraction using Trizol ® (Thermo Fisher Scientific) followed the manufacturer's instructions. The resulting RNA samples were immediately purified using the Qiagen ® Rneasy mini kit. Samples were submitted to analysis by Agilent Bioanalyzer RNA 6000 Nano kit for RNA integrity check, aiming for a RIN number higher than 8. After being extracted and purified, all RNA samples were kept in a -80°C freezer. The samples were sent for sequencing at the Genomics Center of the University of São Paulo in Piracicaba, São Paulo, Brazil. The facility prepared the sequencing library using Illumina TruSeq Stranded mRNA Sample Prep LT Protocol. RNA sequencing was performed using the HiSeq SBS v4 kit in Illumina HiSeq 2500, with paired reads of 100 bp (2 × 101). Raw data is available as Sequencing Read Archives (SRA) on the NCBI website under accession number PRJNA883675.

RNA-seq differential gene expression analysis
Read quality check was performed using FastQC (Andrews, 2018), and the trimming was made using the Trimmomatic software (Bolger et al., 2014). Henceforth, we used only the reads assigned by Trimmomatic as paired. Gene expression was quantified with kallisto (Bray et al., 2016) used for counting the R. toruloides IFO0880 JGI reference transcriptome (Coradetti et al., 2018) available at: https://mycocosm.jgi.doe.gov/Rhoto_ IFO0880_4/Rhoto_IFO0880_4.home.html. The differential expression analysis was done using DESeq2 (Love et al., 2014) in the R platform (R Core Team, 2019). Genes whose adjusted value of ps (adjpval) were lower than 0.05 and whose log2-fold change values were lower than −1 or higher than 1 were selected as differentially expressed genes (DEGs).

Functional and pathway analysis
To analyze the DEGs function, we used the gene function annotation file created by KOG (also provided by JGI, as above). For pathway enrichment analysis, we used the GAGE R package (Luo et al., 2009). As input in this step, we used the DEGs from each condition with their function annotated by the KEGG Automatic Annotation Server, 1 as this tool provides a compatible file for GAGE analysis. The enriched pathways were filtered by a value of p lower than 0.05. Bubble maps were plotted using the R package ggplot2 (Wickham, 2016). Path view R package (Luo and Brouwer, 2013) was used to visualize enriched pathways whenever needed.

Putative cis-regulatory elements discovery
For each condition, we selected the DEGs that belong to the KOG classes. We extracted the promoter sequence of these genes from the reference database. We applied the HOMER software (Heinz et al., 2010) in each promoter sequence set, with the following parameters: -len 10, −stat hypergeo, −olen 3, −strand both, -maxBack 0.3. The background file was the R. toruloides Frontiers in Microbiology 04 frontiersin.org IFO0880 JGI reference promoter sequences. We performed a pairwise comparison between all motifs by applying the universal motif (Tremblay, 2020) function 'compare_motifs' using Pearson correlation coefficient (PCC) as the method and average mean as the scoring strategy. We also performed pairwise alignment using TOMTOM from MEME suite (Gupta et al., 2007), providing as input our motif file as both target and source, with the following arguments: -min-overlap 0, −evalue, −thresh 10.0 and -dist Pearson. Finally, we filtered the results by a value of p lower than 0.05, an e-value lower than 1 and a PCC higher than 0.8. All motifs from our experimental settings were found using HOMER and were aligned with the ones from JASPAR2018_CORE_fungi_ non-redundant database using TOMTOM. Subsequently, they were plotted and analyzed as a graph, using Gephi (Bastian et al., 2009), applying Circle Pack Layout for representation with the variables 'predicted or annotated, ' experimental condition and node degree as hierarchy.

Finding TF candidates
We got the amino acid sequence of the corresponding transcription factor protein for the overrepresented motifs found on the JASPAR alignment. Then we performed a BLAST on the JGI platform to search for similar proteins on the R. toruloides reference assembled genome (provided by JGI, as above). Amino acid sequences and GO functions of each predictive TF were extracted from UniProt. 2

Bioinformatics workflow
All analyses performed on the R environment were organized in an easy-to-use Jupyter Notebook. The customized code and instructions to install and run the workflow are publicly available at: https://github.com/computational-chemical-biology/cis_reg.

Results
3.1. A new pipeline for the discovery of regulatory elements from transcriptomic data Figure 1 summarizes the workflow developed in this work. The experimental design consisted of growing cells and extracting their RNA, as described in the Methods section. Then, the RNA sequencing data was used as input for the motif discovery pipeline. First, differential expression analysis was performed to select the differentially expressed genes (DEGs), and functional analysis was executed to find enriched pathways for each condition. We then 2 http://uniprot.org extracted the promoter sequences from the genes of the specific enriched pathways and applied the HOMER software for motif prediction. Table 1 contains the number of promoter sequences and motifs found by HOMER for each case. We performed an internal pairwise comparison between all motifs by applying the universal motif package and the TOMTOM tool. Subsequently, we performed an external comparison against the Jaspar database using TOMTOM. The entire workflow will be described in detail throughout the "Results" section.

Differential expression analysis shows a significant transcriptional change in sugarcane and high-temperature conditions
The RNA sequencing data were analyzed through the RNA-seq differential analysis using DESeq2. For all analyses, there were two different condition settings: the first one was the sugarcane juice condition (SCJ), and the second one was the stress conditions (comprising 37 and 42°C, and ethanol 2 and 4%). The differential expression analysis for R. toruloides grown for 8 h on the SCJ condition was measured relative to its own control, which was the inoculum grown in LB for 24 h (also called time 0 h). The 8-h timepoint was selected for this condition because cells are still in exponential phase and we wanted to grasp gene expression as soon as transcription is adapting to the new medium. The differential expression analysis for R. toruloides grown for 16 h in stress conditions was measured using the inoculum grown in YPD for 24 h as internal control. The 16-h timepoint was chosen for a similar reason: when cells are still in exponential phase but already had time to grow accustomed to the each stress. Each experimental setting was always analyzed separately since they had different internal controls. Growth curves of R. toruloides in all aforementioned conditions can be visualized in Supplementary Figure S1.
The principal component analysis (PCA) was performed to check the quality of the sequencing samples, where we could observe the consistency of our sample replicates in a summarized representation. As shown in the PCA graphs in Supplementary Figure S2A, where the two first components explain 84% of the variance, the replicates of each condition are close to each other and distant from their respective controls, which indicates that the gene expression differences are consistent inside each condition, allowing the comparison between them. Nevertheless, in Supplementary Figure S2B, we can see that the replicates of the culture grown at 37°C are very close to the control condition. While the 42°C is separated by the first principal component, explaining 50.1% of the variance alone.
To further investigate the quality of the RNA-seq experiments, the distance between samples, based on the transcript expressions, was measured (Supplementary Figure S3). In the heatmap analysis, Euclidean distance was applied to the expression pattern of each sample, where darker blue means more proximity and Frontiers in Microbiology 05 frontiersin.org lighter blue means the greater distance between samples. We can see the same trend in which the replicates of each condition are grouped, which confirms the PCA analysis. Similar to PCA, in Supplementary Figure S3A, we can see that the ethanol 2% and ethanol 4% conditions were not distant from each other. This shows that the changes in transcriptional behavior in R. toruloides were not divergent between the two conditions, albeit they are both distant from the control. In Supplementary Figure S3B, it is possible to see a similar trend to PCA, where the SCJ condition is distant from its internal control. The Venn diagram depicted in Figure 2 summarizes the DEGs found for R. toruloides grown in each experimental condition. The conditions in which more genes presented transcriptional change were the SCJ and the temperature 42°C. More than 3,000 DEGs were found for those experimental settings, while there were around 1,000 DEGs for each of the other conditions: 37°C and ethanol 2 and 4%. These results show that the latter conditions, although causing a significant change in the transcriptional behavior of R. toruloides, did not cause as much impact as the change to a rich substrate caused by the addition of sugarcane in the media nor the increase in temperature by growing at 42°C. R. toruloides grown in SCJ had a more significant number of down-regulated genes versus up-regulated genes, while growing at 42°C resulted in the opposite: more genes were up-regulated (Supplementary Figure S4). The outermost numbers in the Venn diagram represent the number of genes exclusively differentially expressed for each condition. The exclusive DEGs can lead to a new inducible promoter library, given that the promoters will likely be induced by the same condition in which their genes were differentially expressed. Although still under development, the current promoter toolbox for R. toruloides is limited (Johns et al., 2016;Liu et al., 2016;Wang et al., 2016;. Our Workflow for the discovery of cis-regulatory elements using transcriptomic data.
Frontiers in Microbiology 06 frontiersin.org dataset can be useful for developing inducible promoters in the future.

Functional analyses uncover KEGG pathway enrichment in each condition
We performed a pathway enrichment analysis to analyze the DEGs' function. To do that, we first had the transcript function annotated by KEGG Orthology. Functional annotation by KEGG was the input required for the R package used for the analysis, called GAGE. Both the annotation and GAGE analysis are described in the methods section. The resulting pathways were filtered to a value of p of less than 0.05. Figure 3 shows the biochemical pathways of R. toruloides that are enriched when using SCJ as a substrate. The main up-regulated pathways enriched in this condition are those related to cell growth and metabolism, such as DNA metabolism and replication, meiosis and amino acid biosynthesis. Pathways Frontiers in Microbiology 07 frontiersin.org associated with the biosynthesis of amino acids, ribosomes and cell cycle showed the greatest numbers of up-regulated genes.
For the 42°C condition, although there was a higher number of up-regulated genes (Supplementary Figure S4), most of the enriched pathways were found to be down-regulated (Figure 4). Included on that list are several metabolism pathways, showing the exact opposite of what happens in SCJ-metabolism genes are being turned off. In contrast, the condition of 37°C resulted in a reduced number of differentially expressed transcripts compared to the control grown at 30°C (Supplementary Figure S5). In Supplementary Figure S6, the routes found to be enriched by GAGE by the presence of ethanol in the media can be seen. Most of the pathways are metabolism-related and down-regulated in both conditions.

Identification of putative cis-regulatory elements and transcription factor candidates
Subsequently, we performed a new functional analysis using the function annotation file of the reference gene created by KOG (provided by the JGI website as described in "Materials and methods"). The new functional annotation using the KOG database was chosen due to a better correspondence with our transcript IDs resulting in a greater number of categorized genes and less redundancy in gene classes. The most significant classes Venn diagram of differentially expressed genes (DEGs) for all conditions. The Venn diagram shows how many DEGs are shared between all conditions and how many are exclusive for each condition. Numbers outside of the Venn diagram represent the total of DEGs when compared to each respective control. Enriched kyoto encyclopedia of genes and genomes (KEGG) pathways for Rhodosporidium toruloides grown on sugarcane juice (SCJ). Bubble map showing the biochemical pathways of R. toruloides noted by KEGG enriched in the SCJ condition, as obtained by the GAGE package. Pathways that have an enrichment value greater than 0 are up-regulated, while those that have a value less than 0 are down-regulated. Blue scale inside the bubbles represents the decreasing value of ps. The different sizes of the bubbles define the approximate number of DEGs in each biochemical pathway.
Frontiers in Microbiology 08 frontiersin.org of genes were selected for further analyses, which can be seen in Figure 5. All the remaining classes in which the DEGs of all conditions tested were classified according to KOG annotation are presented in Supplementary Figure S7. In both cases, data is shown as the percentage of DEGs in relation to the total number of genes in the R. toruloides genome. We used the transcriptomic data generated in this work to search for putative TFs and putative TFBS that might play a role in regulating both the stress response and the response to complex carbon sources. The pursuit of TF and TFBS was carried out in several stages to select suitable candidates. First, we extracted the sequence from the promoters corresponding to the genes of the KOG classes from Figure 5. By doing this, we both restricted the number of sequences to search motifs, which may reduce background noise, and we attempted to guarantee a biological relationship between these promoter sequences. Then, we used the HOMER software to obtain the putative motifs from those sequences. The motifs obtained were read on the R platform using the universal motif library. A pairwise comparison was performed between them, with Pearson's correlation coefficient (PCC) as a method. The pairwise comparison was made following the rationale that if the motif is found in more than one gene, the chances of being a real motif increase. We also performed pairwise alignment using TOMTOM to complement the analysis. Finally, we filtered Enriched kyoto encyclopedia of genes and genomes (KEGG) pathways for R. toruloides grown at 42°C. Bubble map showing the biochemical pathways of R. toruloides noted by KEGG that are enriched in the 42°C condition, as obtained by the GAGE package. Pathways that have an enrichment value greater than 0 are up-regulated, while those that have a value less than 0 are down-regulated. Blue scale inside the bubbles represents the decreasing value of ps. The different sizes of the bubbles define the approximate number of DEGs in each biochemical pathway. the results by a value of p less than 0.05, an e-value less than 1 and a PCC greater than 0.8. The resulting motifs were then compared to the Jaspar fungi 2018 (non-redundant DNA) database using TOMTOM, and some of these alignments can be seen in Figure 6. The motif candidates and motifs from JASPAR, aligned by TOMTOM, were plotted in a graph so that we could analyze the similarities between them (Figure 7). The main TFs corresponding to the putative TFBS found using TOMTOM are listed in Table 2. The percentage of identity of TF proteins found by this method against proteins from R. toruloides genome is also listed in Table 2, along with GO functions of the TF proteins.

Discussion
In this work, we developed a reproducible bioinformatics workflow that can be used to discover regulatory elements using transcriptomic data as input. We applied to the emergent microbial cell factory R. toruloides. First, we analyzed the transcriptomic data and showed that R. toruloides presented different transcriptional responses to all the conditions tested. When using media containing SCJ and urea, functional analysis showed a predominance of metabolism pathways being up-regulated, while degradation pathways such as autophagy and peroxisome are down-regulated. This pattern is reasonable since the yeast adapted its metabolism from the inoculum containing no sugar to the rich and complex environment provided by SCJ.
Interestingly, one of the pathways found to be up-regulated by GAGE was the terpenoid backbone biosynthesis. This is a good indicator that we can use the abundant and inexpensive media composed of sugarcane and urea as a start point substrate to produce terpenes in R. toruloides, like previously done using corn stover hydrolysates (Kirby et al., 2021). Moreover, when growing the yeast at 42°C, cytosolic DNA sensing pathways and mRNA surveillance pathways were up-regulated and represent cell stress responses (Jamar et al., 2017;Meng et al., 2021). Similarly, in this condition, the homologous recombination pathway is up-regulated-possibly for dealing with cell damage (Kawale and Sung, 2020). We also hypothesized that the transcriptional responses to the two high-temperature conditions could be classified into two distinct types of stress responses. The 42°C condition caused the heat-shock response, when cells repress protein biosynthesis-as we can see by the decrease of the biosynthesis of amino acids-and protein misfolding and oxidative stress occur (Verghese et al., 2012). In contrast, the condition of 37°C might have reached the other type of response: thermotolerance-given that most transcripts were not differentially expressed compared to the control grown at 30°C. The very few pathways found to be enriched in this condition, like cell cycle and oxidative phosphorylation, might suggest that R. toruloides is enduring the increase in temperature (Shui et al., 2015;Huang et al., 2018). Nevertheless, another study showed that cultivating R. toruloides at 37°C significantly impaired its growth (Wu et al., 2020).
Yet, the ethanol response in our yeast remarkably corresponds with what is described in the literature for ethanol stress and tolerance in S. cerevisiae (Stanley et al., 2010;Yoshida et al., 2021). It is interesting to note that cell cycle and DNA replication pathways are down-regulated, showing inhibition of cell growth, as expected for ethanol stress (Stanley et al., 2010). Inhibition of endocytosis was also reported as an ethanol stress response (Lucero et al., 2000). We can see that both ethanol concentrations decreased the expression of R. toruloides endocytosis genes. Additionally, the proteasome genes are up-regulated, demonstrating that cells are dealing with misfolded proteins, which is described in the literature as a sign of ethanol stress (Bubis et al., 2020). Hence, we hypothesized that the response to the two different concentrations of ethanol is still a stress response and that R. toruloides has not yet been able to tolerate the compound (Lucero et al., 2000;Stanley et al., 2010;Bubis et al., 2020).
Furthermore, we explored the cis-regulatory elements related to the DEGs in the transcriptomic analysis. The TFBS predicted for R. toruloides presented similarity mainly with binding sites for the following TFs: ARG80, RPN4, ADR1, and DAL81. Those TFs can be seen in the graph (Figure 7) as having a more significant number of similar putative motifs. As demonstrated in Table 2, HAC1 protein is involved in ER-unfolded protein response, and RPN4 protein is involved in response to stress. This could explain the enrichment of both their binding motifs in our industrial stress conditions. Studies showed that mutations in the RPN4 protein conferred ethanol resistance to S. cerevisiae (Bubis et al., 2020). However, no homologs of these proteins were found in our carotenogenic yeast genome. Some TFs were responsive to Graph of putative transcription factors and their putative binding sites. Graph showing interaction between putative transcription factors for R. toruloides with the transcription factors (TF) binding motifs predicted using HOMER. Green nodes represent the sugarcane juice condition, orange nodes represent the 42°C condition, dark green nodes are the 2% ethanol, and blue nodes are the 4% ethanol condition. Written inside nodes are the number of the TFBS and the name of the gene class annotated by eukaryotic orthologous groups (KOG) in which that motif was found. Pink nodes are the putative transcription FTs found using TOMTOM. Microbiology  11 frontiersin.org oxidative stress, such as AFT2 and HCM1, whereas only HCM1 was found in the R. toruloides genome. Interestingly, the DAL81 protein was found to be mutated in response to high-temperature stress in a study by Huang et al. (2018). This could explain the enrichment of its binding motifs in our heat shock transcripts, although a homolog of this protein was not found. SWI6 binding motifs may be an exciting source for future studies, since this protein was found in R. toruloides genome with 39% identity, which regulates the heat stress response. Interestingly, binding motifs for SPT23, a protein that regulates cold response, were also found. This protein was found in the genome of R. toruloides with 45% identity.

Frontiers in
Nevertheless, the binding of the TF proteins to their respective motifs discovered by our pipeline needs to be further tested to confirm their activity. Alternatively, in the case of the ones whose corresponding binding protein homologs were not found, the identified motifs could be recognized by yet-unknown TFs. As shown by Antonieto et al. (2019), a transcriptional regulator with low homology to AZF1 recognizes a well-conserved motif regulating the expression of cellulases genes.
The motifs found can be used as starting points for new transcriptional modulation studies. The ultimate goal would be to define a core promoter sequence for R. toruloides and then change the sequences of TFBS according to the desired behavior. For instance, one can study ethanol tolerance in this organism by adding the RPN4 consensus sequence to their respective constructs. Additionally, one can use our pipeline for other organisms with experimentally-defined functional promoter sequences, where scientists can compose motif sequences and their locations to engineer new transcriptional behaviors (D' Ambrosio and Jensen, 2017; Monteiro et al., 2020;Feng and Marchisio, 2021;Tominaga et al., 2022). Also, as mentioned in the Results section, our transcriptomic data can be used to create new promoter libraries for

Conclusion
Here, we characterized the transcriptional responses of R. toruloides when growing in industry-like conditions and discovered new regulatory elements enriched in each context. We showed how differential gene expression, followed by a custom, reproducible bioinformatics motif discovery workflow, could predict putative motifs for binding transcription factors in our strain, most of which involved stress-related responses. DNA motifs similar to the binding sites for ADR1, ARG80, DAL81, and RPN4 transcription factors were the most abundantly enriched in our dataset. While ADR1 and ARG80 proteins were found in the R. toruloides genome with 54 and 60% of identity, respectively, DAL81 and RPN4 were not found. The novel putative cisregulatory elements described here offer a great initial point to optimize gene regulation in industrial conditions expanding the current knowledge on regulatory networks for this yeast. Furthermore, our pipeline for motif discovery can be easily applied to other unexplored hosts.

Data availability statement
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found at: https://www.ncbi.nlm.nih. gov/bioproject/PRJNA883675/, PRJNA883675.

Author contributions
RS-R, M-EG, LN, and MC designed the project. LN and ÍS performed RNA experiments. MC, LN, and ÍS performed the differential expression analysis. MC developed the motif discovery pipeline. RS contributed substantially to the analysis, interpretation of data, and availability of the developed workflow. LN wrote the manuscript. All authors contributed to the article and approved the submitted version.