ABSTRACT
Transcriptional regulation is key in bacteria for providing an adequate response in time and space to changing environmental conditions. However, despite decades of research, the binding sites and therefore the target genes and the function of most transcription factors (TFs) remain unknown. Filling this gap in knowledge through conventional methods represents a colossal task which we demonstrate here can be significantly facilitated by a widespread feature in transcriptional control: the autoregulation of TFs implying that the yet unknown transcription factor binding site (TFBS) is neighbouring the TF itself. In this work, we describe the “AURTHO” methodology (AUtoregulation of oRTHOlogous transcription factors), consisting of analyzing upstream regions of orthologous TFs in order to uncover their associated TFBSs. AURTHO enabled the de novo identification of novel TFBSs with an unprecedented improvement in terms of quantity and reliability. DNA-protein interaction studies on a selection of candidate cis-acting elements yielded an >90% success rate, demonstrating the efficacy of AURTHO at highlighting true TF-TFBS couples and confirming the identification in a near future of a plethora of TFBSs across all bacterial species.
Key points
Transcription factor (TF) autoregulation implies that their binding site (TFBS) is in their close vicinity
We developed and assessed the AURTHO methodology (AUtoregulation of oRTHOlogous TFs) for TFBS discovery
Our results shows that AURTHO greatly facilitates the identification of highly reliable novel TFBSs
INTRODUCTION
In the prokaryotic world, subtle changes in the environment can have limited or instead widespread effects on the expression of genes, permitting an efficient response to new conditions. This modulation of gene expression is mediated by different mechanisms, the best-known of which involves transcription factors (TF) that, through binding of specific DNA sequences will activate or inhibit the transcription of target genes. Regulators often control the expression of multiple genes by binding to similar transcription factor binding sites (TFBS) upstream of each of its targeted genes or transcription units (Browning et al., 2019; Browning & Busby, 2016; Mejía-Almonte et al., 2020; Van Hijum et al., 2009). Despite having been in the spotlight the longest among all regulation mechanisms, a great deal of mystery still pertains to transcriptional networks even in well-studied microorganisms like Escherichia coli (Baumgart et al., 2021; Santos-Zavaleta et al., 2019). Actually, in most bacteria, only a handful of TFs have been studied, revealing just an inkling of the regulatory networks they use to control cellular processes and adapt to their environment rapidly and efficiently.
Using a wet lab approach, unveiling novel TF-TFBS couples and their regulatory network can take years, but high throughput approaches such as RNA-seq, ChIP-Seq, and DAP-seq have been game changers in regulation data acquisition (Bartlett et al., 2017; Baumgart et al., 2021; Ishihama et al., 2016; Liu et al., 2018; Park, 2009; Wang et al., 2009). These approaches largely facilitate the assessment of the transcriptional output in response to a specific set of signals. However, researchers are often limited and biased when testing a set of laboratory culture conditions, which rarely reflect the bacteria’s natural environment. Indeed, the transcriptional response, and therefore the binding of TFs, is also a dynamic process that highly varies in time and space according to the state of growth or the step of the life cycle for bacteria that undergo extensive physiological and morphological differentiations (Świątek-Połatyńska et al., 2015). Therefore, the fraction of TFs which are only expressed and needed in very specific conditions are unlikely to be highlighted via these studies.
Completely different approaches starting from in silico analyses have also been used. Usually, these approaches first acquire the knowledge of the TFBS, from which the regulon can be inferred, after which its function can be deduced through the analysis of the target genes’ functions (Dwarakanath et al., 2012; Liao et al., 2014; Rigali et al., 2004; Rodionov, 2007; Van Hijum et al., 2009; Yao et al., 2014). With the advent of genome sequencing technologies, researchers have been working to exploit these data to uncover conserved regulatory elements and link them to a TF. As early as 2002, the genomes of three model micro-organisms; E. coli, Bacillus subtilis, and Streptomyces coelicolor, had been studied with the aim to uncover over-represented dyad-type motifs in intergenic regions of the genome, where cis-acting elements are expected to be found (Li et al., 2002; Mwangi & Siggia, 2003; Studholme et al., 2004). During that same period, we used an in silico-based approach to show that refining the classification of TFs into sub-families beyond the sequence of their helix-turn-helix motif facilitates the discovery of their binding sites. In addition, this work demonstrated that using the autoregulatory property of bacterial regulators in an in silico approach was an effective way to assign a discovered TFBS to its cognate TF (Rigali et al., 2002, 2004). Now that the number of available genomes has significantly grown, approaches based on comparative genomics and more specifically on phylogenetic footprinting, have become possible (Janky & van Helden, 2008; Rodionov, 2007; Wasserman & Sandelin, 2004). Phylogenetic footprinting is a method that aims at discovering conserved regulatory sequences in orthologous UTRs (UnTranslated Region) in different genomes, as it is believed that functional features are encoded in evolutionarily conserved DNA sequences. Thus, the traits that are targeted are regulatory DNA sequences (TFBSs) and their associated TF. The research group of Prof. Rodionov has indeed shown through their “regulon propagation and reconstruction” approach that certain orthologous TFs and their cognate TFBSs are conserved across an extensive variety of taxa (Kazanov et al., 2013; Leyn et al., 2016; Novichkov et al., 2010, 2013; Ravcheev et al., 2014; Rodionov, 2007).
We predict that this type of approach, when used on a more closely related taxonomic group, will prove to be even more prolific in terms of the quantity of discovered cis-trans relationships. Indeed, numerous TF-TFBS couples are only conserved between closely related species, and this focused approach will likely point out taxon-specific regulatory interactions. With this in mind, we developed a de novo approach and assessed the extent to which it could accelerate the discovery of DNA sequences recognized by TFs. In contrast to previously used comparative genomics in silico approaches, our methodology draws on a widespread property of TFs, i.e., they often control their own expression, which imposes that the location of the searched TFBS is in the close vicinity of the TF gene itself. Combined with the conservation of the TFBS between orthologous TFs, this guided the development of the AURTHO methodology (AUtoregulation of oRTHOlogous transcription factors), consisting of analyzing upstream regions of orthologous TFs in order to uncover their associated TFBSs.
As a case study to test the AURTHO methodology, we focused our attention on one family of TFs, the LacI family, and selected a closely related taxon, the Streptomyces genus, as the latter has been shown to encode large numbers of TFs (12.3% of the model species’ genome is dedicated to encoding regulatory genes) (Bentley et al., 2002). The AURTHO strategy revealed to be extremely efficient at providing reliable candidate TFBSs as the presented work not only confirmed the TFBS of the five LacI-TFs already studied in streptomycetes but also proposed a cognate TFBS for 90 additional and yet uncharacterized LacI-TFs thereby largely filling the gap in knowledge about cis-acting elements. As autoregulation is a feature of many different TF families, our results suggest that the application of the AURTHO approach across all bacterial species will highly facilitate the discovery of novel TF-TFBS couples.
RESULTS AND DISCUSSION
Starting hypotheses and the AURTHO methodology
The de novo approach used to unveil the TFBSs of LacI-family TFs is based on three main assumptions: (i) orthologous TFs bind to identical motifs on DNA, (ii) LacI TFs often (70% according to Ravcheev et al, 2014) regulate their own expression (autoregulation), meaning their binding site can be found in the upstream region of the gene encoding them, and (iii) its primary target gene(s) is (are) usually found adjacent to or in the same transcriptional unit as that of the TF, reinforcing the probability of finding its binding site in close vicinity to the TF gene. Additionally, for members of the LacI-family of TFs (used in this work as a case study) the binding sites are easily spotted as they are usually characterised by palindromic sequences of even length which contain a typical CG-pair at the centre of the motif (Ravcheev et al., 2014). Nonetheless, some atypical binding sites have been identified, showcasing uneven lengths, the absence of a CG-pair in the centre (Tsujibo et al., 2004), directed repeats (Schlösser et al., 2001) and/or a stretch of less conserved nucleotides of variable length between the two inverted repeats (though for a single TF and its orthologs, the length is usually conserved) (Ravcheev et al., 2014).
The methodology that guided our approach is detailed in the flowchart presented in Figure 1. First, genomes from the genus Streptomyces were downloaded from the NCBI database and filtered to retrieve only the ones annotated as “Complete” in their assembly status (assembly.info on GitHub). Proteinortho (Lechner et al., 2011) was used to create clusters of orthologous genes (COGs) by performing diamond blast in an all-versus-all manner, and clustering genes using a reciprocal best alignment heuristic (RBAH). Simultaneously, an hmmscan using HMMER3 was performed on all genomes against the Pfam-A profile database (Eddy, 2011; El-Gebali et al., 2019; Mistry et al., 2021), and TF genes were classified into families through signature domain combinations, as described in the P2TF database (Ortet et al., 2012). For the LacI-family of TFs, the signature domain combination consists of a LacI DNA-binding domain (PF0356) and a periplasmic binding protein domain (PF0532, PF13377 or PF13407) (Ravcheev et al., 2014). However, according to the P2TF database, the presence of a LacI-HTH motif inside the DNA-binding domain is a sufficient predictor of a protein belonging to this family of TFs (Ortet et al., 2012). For every gene identified as a LacI TF, we extracted the COG they belonged to, and “manually” checked for functional coherence inside the COG based on the gene annotations. Only COGs in which most annotations were coherent with a regulatory function were conserved for further analysis. For each of them, the upstream sequences of the LacI-TF genes were extracted, the length of which is variable as the extraction halted as soon as the translational start/stop codon of an upstream gene was encountered. Different maximum lengths of search regions were tested (500 bp, 300 bp, 100 bp with an additional 50 bp inside the coding region). For each LacI-COG, these sequences were aligned with the MEME software (Bailey & Elkan, 1994) using two different search parameters termed ZOOPS (Zero or One Occurrence Per Sequence) and ANR (Any Number of Repetitions), and three different search lengths (small = 10 nucleotides (nt), medium = 20 nt, and long = 30 nt). MEME produced four motifs per search, and the results for each combination of parameters were manually curated to identify sites that were most consistent with characteristics of known LacI binding sites, namely the palindromic property of the site and the central CG-pair (Ravcheev et al., 2014). Finally, the FASTA-format matrices of putative binding sites were used to create sequence logos with WebLogo (Crooks et al., 2004) and to design Cy5-marked DNA probes containing the consensus binding site for each LacI-COG. A series of LacI-family TFs were selected to experimentally validate the predicted DNA-protein interaction through Electrophoretic Mobility Shift Assays (EMSAs). Finally, an additional round of manual inspection was performed in COGs’ cases which required manual inspection of the gene locus organization in order to extract the proper gene’s upstream region (see step 8 in Figure 1 and Figure 2).
de novo identification of binding sites of LacI-family TFs in streptomycetes LacI-family transcription factor identification
LacI-family TFs were identified by the presence of a typical LacI helix-turn-helix motif (PF0356) in the N-terminal DNA-binding domain of the protein sequence. As expected for the Streptomyces genus, in which sugar catabolism regulation is essential for adaptation to diverse environments (Hodgson, 2000; van der Meij et al., 2017), LacI TFs were identified in all 182 studied complete genomes (supplementary Figure S1). However, there was a great disparity in the number of LacI regulators identified. Streptomyces bingchenggensis (BCW-1) possesses 69 LacI TFs, while Streptomyces olivoreticuli (subsp. olivoreticuli strain=ATCC 31159) only encodes 6 LacI genes (Figure S1). This goes far beyond any explanation related to their genomes’ size, as there is no correlation between the length of the chromosome and the relative abundance of LacI TFs (11.9 Mb/9692 genes and 8.8 Mb/7102 genes for S. bingchenggensis and S. olivoreticuli, respectively).
In total, in 182 Streptomyces strains, 4403 LacI TFs were identified, grouped into 167 COGs. Among these, only 5 (~3% of all LacI TFs) have been subject to studies in Streptomyces species, i.e., i) the galactomannan/mannobiose/mannose utilization repressor ManR (LacI003 in Table 1, conserved in 177/182 species) (Ohashi et al., 2021), (ii) the maltose/maltodextrin catabolism pathway regulator MalR (LacI005 in Table 1, conserved in 176/182 species) (Nguyen, 1999; Nguyen et al., 1997; Schlösser et al., 2001; van Wezel, White, Bibb, et al., 1997; van Wezel, White, Young, et al., 1997), iii) the cellulose/cello-oligosaccharide utilisation regulator CebR (LacI006 in Table 1, conserved in 153/182 species) (Book et al., 2016; Francis et al., 2015; Jourdan et al., 2016; Marushima et al., 2009; Schlösser et al., 2000), iv) the xylan/xylo-oligosaccharide utilization repressor BxlR (LacI015 in Table 1, conserved in 88/182 species) (Giannotta et al., 1996, 2003; Tsujibo et al., 2004), and v) the agar-utilisation regulator DagR (LacI139 in Table 1) (Tsevelkhoroloo et al., 2021), the latter being one of the rarest LacI TF, only conserved in two Streptomyces species. Strikingly, the function of the two most conserved LacI TFs (LacI001 and LacI002 in Table 1) is unknown, further illustrating the lack of knowledge about transcriptional regulation in this well-studied bacterial genus. Remarkably, 25 LacI TFs were only present in one single species, meaning they were part of “orphan” COGs containing only that single gene. In these cases, it is inherently impossible to perform a comparative genomics approach, which requires the comparison of two or more sequences.
Identification of TF binding sites
For each the 167 LacI-family COGs, a set of upstream regions was extracted with varying lengths as described in the Methodology section. This resulted in 138 sets of two or more upstream regions. Indeed, in the remaining cases, the COG was either orphan (one gene), or there was either only one, or no gene in the COG for which an upstream region was present. This happens when the TF is co-transcribed with other genes in its transcription unit. As explained above, three maximum lengths of upstream sequences were tested for the MEME analysis, but overall, a maximum length of 300 bp (halted whenever an upstream coding region was encountered) yielded the best results in terms of number of discovered motifs and their resolution. This was supported by the previous observation of Ravcheev et al (Ravcheev et al., 2014) that LacI binding sites are rarely found beyond 300 nucleotides upstream of the target gene, or after the beginning of the coding region.
In order to first assess the reliability of our de novo approach, we singled out the studied LacI regulators (ManR, MalR, CebR, BxlR, and DagR), and checked if the motifs we generated using our in-silico approach correspond to their experimentally determined cis-acting sequences. As presented in Table 1, for ManR (LacI003), CebR (LacI006), and BxlR (LacI015), the de novo identified motifs were identical to their experimentally identified consensus sequences, i.e., GACAACGTTGTC (Ohashi et al., 2021), TGGGAGCGCTCCCA (Schlösser et al., 2000), and CGAA-Nx-TTCG (Giannotta et al., 1996, 2003; Tsujibo et al., 2004), respectively. For MalR (LacI005), the two binding sites deduced by DNase footprinting assays (Schlösser et al., 2001) were also found (see Table 1), further confirming that our approach is appropriate for deducing over-represented motifs that closely relate to the ones that were experimentally identified. In the case of DagR, its DNA-binding site was not identified during our first manual inspection of MEME-generated motifs. Indeed, this TF is only present in two strains (S. coelicolor and S. bingchenggensis), meaning there were only two upstream regions to align. In this case, MEME is often not able to distinguish motifs found by chance from potentially biologically significant ones, causing proposed motifs to have very high E-values. Hence, it was only upon re-examination of the four motifs proposed by MEME that we identified the one that corresponded to one of the validated binding sites of DagR (LacI139), AACCGGTT (Tsevelkhoroloo et al., 2021).
Of the 133 unstudied LacI-COGs for which two or more upstream sequences could be extracted, one or two putative binding site(s) in their upstream region was found for 82 (~62%) of them (Table 1). In addition, 9 motifs were further identified (6) or improved (3) by extracting the upstream region of the first gene of a transcriptional unit (operon) that contains the TF gene (see below in the next section), bringing the total number of COGs with a predicted TFBS to 88 (~66%). Based on the previously defined characteristics of LacI TFBSs (central CG pair and inverted repeat sequence), we defined different “reliability groups” for the predicted motifs (categories A, B, and C in Table 1) we think reflect the probability of the site being bound by its cognate TF. For example, the TGTGACCGGTCACA conserved motif found upstream of LacI059 orthologs presents of 14 bp perfect inverted repeat centred on a CG pair. For over 70% of LacI-COGs, the predicted motif is considered to be highly reliable (assigned A in Table 1), as they possess both characteristics. TF-TFBS couples have a lower predicted reliability if one of these two characteristics is missing, which was the case for 11 LacI-COGs (assigned B in Table 1). This is for instance the case of the predicted motif of LacI 001 and LacI 002, the first of which, although containing an inverted repeat (GAGCC-N8-GGCTC), lacks the typical central CG-pair, and the second on the other hand possessing the central CG pair but for which the left part of the motif does not at all reflect any kind of symmetry with the right part. For the remaining 9 LacI-COGs, the best motif does not possess either of these two sequence features and, consequently, they have a much lower confidence score (motifs assigned C in Table 1).
Improvement round by inspection of TF genetic locus organization
Around 40% of LacI-COGs did not yield any potential binding site using our approach (see Figure 2A). In most cases (LacI120 – LacI142 in Figure 2A) the size of the COG was so small (2 or 3 members in the COG) that, as demonstrated with the DagR example discussed above, MEME likely could not distinguish motifs occurring by chance from biologically significant ones. Indeed, usually, when the number of representatives of one COG is too small, the entire region upstream of the TF is conserved which prevents the identification of the functional conserved cis-acting elements. Nonetheless, there is a number of COGs for which we unexpectedly did not find an over-represented motif. Although this could simply be due to the lack of autoregulation for these COGs, further investigation revealed that in some cases, the average length of the region upstream of these COGs was smaller than for COGs for which we could find a conserved motif (Figure 2A). Indeed, LacI-TFs are typically encoded in the divergent direction of the genes of the operon they regulate, and through binding to the cis-acting element in the intergenic region between its own gene and the upstream gene, it can control both transcription units in concert. However, the genetic organization is not always as such, and the TF can sometimes be found in between other genes belonging to the same transcription unit or even in the last position of the latter. Figure 2B illustrates the LacI COGs where the operon organization clearly prevented the identification of a binding site in the upstream region of the TF encoding gene. In these cases, the TF is still likely to bind to the region upstream of the sets of genes that constitute the whole transcription unit to which the TF encoding gene belongs to. Therefore, the correct search region is not in the upstream region of the TF gene, but in the transcription unit’s upstream region.
With this in mind, we selected the COG that is mostly present in the first position of the transcription unit in order to repeat the upstream region extraction and the MEME analysis. The selected examples where this additional round allowed the identification of a conserved motif or to modify the motif originally found are presented in Figure 2B. Notably, for six of the selected examples, this additional round of manual inspection allowed to identify 5 class A motifs (022, 029, 042, 043, and 054) and one class B motif (056) (Figure 2B and Table 1). The remaining three examples involve COGs for which a motif was discovered through the direct extraction of the TF gene upstream region (012, 037 and 038). However, this additional round enabled the improvement of two of the motifs (for 012 and 038), and the identification of a second, binding site for LacI 037 which, although it contains a well-conserved CG-pair in the centre, the left part of the palindrome can only be guessed from the sequence logo, classifying this motif in the B category. In this case, the additional round brought more ambiguity to the predicted TFBS, and which one is the true binding site for LacI 037 remains to be determined. For LacI 012 and LacI 038, the motifs that MEME proposed were very similar to the ones uncovered the first time. Hence, this further strengthens our confidence in the palindromic sequence that was initially found. Finally, among the other COGs that were selected, LacI 013 represents a very peculiar case as it is part of the malEFG operon divergently transcribed from the gene encoding MalR (belonging to the COG LacI 005). As a consequence, the examination of this operon’s regulatory region only highlighted the MalR binding site again, with LacI 013 possibly competing for the same site or targeting a site residing elsewhere in the chromosome. Nonetheless, this additional manual check remains essential in cases where the operon’s organization deviates from the “typical” topology. This enabled us to predict 7 additional binding sites for LacI TFs, and to strengthen our confidence in two of the previously identified binding sites.
Experimental validation of new TF-TBS couples
In total 41 LacI-TFs were selected for protein-DNA interaction study by EMSAs. Proteins were assessed for their production levels in different cultures conditions (temperature, incubation time post induction) in order to choose one where a majority of them were produced. Their solubility, purification degree, and their stability as pure proteins after mid- or long-term storage at −20°C were also assessed, after purification. According to these criteria, 16 6His-tagged LacI-TFs were retained for EMSAs (Figure 3). DNA probes containing the MEME predicted binding site and tagged with Cy5 were incubated with increasing concentrations of their respective purified LacI-TFs as described previously (Francis et al., 2015; Tenconi et al., 2015). DNA-protein interactions were observed using an ImageQuant™ LAS 4000, by detecting the fluorescence emission of the Cy5-tag using a 670 nm detection filter. ManR (LacI 003, Figure 3 second panel) was used as a positive control for the EMSA method, as its cognate palindromic motif GACAACGTTGTC has been recently confirmed experimentally (Ohashi et al., 2021). Interestingly, no retardation could be observed for LacI 001 (panel 1 in Figure 3), whose binding site is classified in the B category because of the lack of a central CG-pair. For the remaining 14 tested TF-TFBS couples a retardation band could be observed. The high success rate of the DNA-protein interaction assays demonstrates that the AURTHO approach is an appropriate way of discovering highly reliable TFBSs for unstudied TFs.
CONCLUSIONS AND PERSPECTIVES
Identifying the DNA sequence bound by a TF is key to unveiling novel regulatory pathways and attributing novel biological functions to genes/proteins that belong to a regulon. In this work, we assessed to which extent a de novo methodology based on the assumption that a large proportion of TFs control their own expression would be able to provide a reliable candidate TFBS for a TF with unknown function. The AURTHO approach drastically narrows down the searched regions for TFBSs in the bacterial chromosome, mainly focusing the DNA motif enrichment analysis within the upstream region of the TF of interest. Using TFs member of the LacI family in the Streptomyces genus as a case study, we identified 88 highly reliable TFBSs that possess the hallmarks of most LacI family regulator binding sites, i.e., a CG pair centered in a symmetric dyad. All the DNA probes tested containing a motif with these sequence characteristics showed positive and specific interaction through EMSAs with their associated pure LacI TF, thereby demonstrating the high reliability of the predicted TFBSs. Hence, our approach showcases a very high potential at revealing the DNA sequences bound by a transcriptional regulator, as before our work, about four decades of study managed to reveal the TFBS of only 5 LacI-family TFs in Streptomyces species. This represents a potential improvement of 18-fold compared to the current state of knowledge. The main limitation resides on the number of members within a COG which directly affects the number of upstream regions to align for finding a conserved motif. When we initiated this work in 2018, 90 Streptomyces complete genomes were available and from these data, 53 motifs were predicted from 172 COGs (orphan COGs included). Little more than a year later (October 2019), the number of complete genomes from this genus had roughly doubled (182 genomes, this work), and the AURTHO methodology yielded 90 motifs for 167 COGs (orphan COGs included). As the number of COGs negligibly changed (~3%) between both analyses while the number of motifs found almost doubled, this considerable improvement has to be imputed to the substantial portion of COGs that were not orphan anymore which allowed our methodology to be applicable. This reflects that the successive rounds of the AURTHO approach will become more and more successful at predicting putative cis-elements as the number of available genomes of one taxon increases.
One crucial question when applying phylogenetic footprinting, is the choice of the phylogenetic distance between the taxa selected for analysis. Indeed, the analyzed species can be neither too closely related (too much conservation in the regulatory region, alignment uninformative), nor too distant (the regulatory element will not be conserved). We show that, when a study is focused on a specific bacterial genus, the AURTHO approach is very potent at highlighting taxon-specific regulatory interactions, compared to the ones available in the RegPrecise database. In the latter, only 11 LacI TFs in the Streptomyces genus have been highlighted through regulon reconstruction and propagation, all of which are highly conserved and probably have orthologs in other genera. As shown in Figure 4, for 10 of them, the motifs proposed by both approaches were either identical or highly similar. The remarkable exception relates to the second most conserved Streptomyces LacI-COG (002, with SCO4158 as the representative member). Indeed, the RegPrecise motif for this regulator is a palindromic and CG centred sequence (TCTACGCGCGTAGA), while our predicted motif (CGCGTAGACT) partially corresponds to the half right part of the palindrome, the other half being degenerated and not conserved (Figure 4). The possible lack of autoregulation of LacI002 raises the question of whether this regulator is a global one (high number of target genes whose functions pertain to different cellular processes), as it has been suggested that global LacI TFs are less likely (~50%) to use an autoregulatory mechanism, compared to local regulators (~75%) (Ravcheev et al., 2014). And indeed, preliminary regulon identification revealed that the scope of the regulatory action of SCO4158 is extensive. Therefore, identifying the TFBS of global regulators via an analysis of their upstream region could be relatively less successful at providing reliable candidate motifs. This result suggests that both “AURTHO” and “regulon propagation and reconstruction” approaches are complementary, the latter being more adequate when focusing on global regulators with conserved regulatory interactions across a more phylogenetically diverse group.
Past studies on the conservation of TF-TFBS couples in distant bacterial groups suggest that the AURTHO approach will also generate a similar rate of success/reliability when applied to orthologues that do not belong to a same/unique genus (Bertram et al., 2011; Urem et al., 2016). Additionally, autoregulation has also been frequently observed for TF belonging to other families, such as GntR, MerR, MarR, IclR, among many others. Some binding sites from these families have also been characterized, hence using the hallmarks of these TFBSs combined with the AURTHO approach will surely increase the discovery rate of novel conserved motifs in these families as well. Overall, the results presented in this work suggests that the AURTHO approach will greatly facilitate the discovery of a plethora of cis-acting elements in all bacterial genus.
MATERIALS AND METHODS
Bioinformatics
Genome assemblies belonging to the genus Streptomyces were downloaded from the NCBI database and filtered based on the “Complete Genome” (assembly_level) and “latest” (version_status) tags in the assembly summary file. Proteinortho (v6.0.8) was used to compare all protein sequences and cluster them into orthologous groups (COGs). This version uses diamond (v0.9.36) as a default sequence aligner, and clusters groups based on the reciprocal best alignment heuristic (RBAH) (Lechner et al., 2011). HMMER3 was used to perform hmmscan on all proteins and identify protein domains by comparing them to the domain profiles in the Pfam-A database (Eddy, 2011; El-Gebali et al., 2019; Mistry et al., 2021). The P2TF database was used as a guide for TF identification based on the proteins’ domain combinations (Ortet et al., 2012). The MEME software (Multiple Em for Motif Elicitation, v5.1.0) was used to align upstream regions of identified LacI TF genes and identify putative transcription factor binding sites (Bailey et al., 2015; Bailey & Elkan, 1994). Based on the previously described LacI TFBS hallmarks, the most probable motif(s) were selected and downloaded in FASTA format, and a sequence logo was created with WebLogo3 (WebLogo v3.5.0) (Crooks et al., 2004). Position Weight Matrices (PWM) were calculated on R using the Biostrings package (https://bioconductor.org/packages/Biostrings) and expressed as a log-likelihood (Wasserman & Sandelin, 2004). The PWMs calculated with different background nucleotide probabilities (reflecting either a 50% or a 71.3% GC content, the latter being the average GC content in the Streptomyces genus) are available in Supplementary Files (pwm50.tar.gz and pwm71.tar.gz).
Heterologous production and purification of His-tagged proteins
The 41 LacI-TF genes selected for DNA-protein interaction studies are listed in Table S1. 40 of them were ordered at Twist Biosciences for codon-optimized sequence cloned in the NdeI and XhoI restriction sites of pET-28a for heterologous production in E. coli BL21(DE3). In addition, we used pSIN002 in which the original sequence of SCO1078 was cloned into the pET-22b (between NdeI and HindIII restriction sites) and was heterologously produced in the BL21 Rosetta™ (DE3) strain of E. coli. All proteins were 6His-tagged on their C-terminal extremity, enabling Immobilised Metal Affinity Chromatography (IMAC) purification on an Ni-NTA column from Cytiva (HisTrap™ HP). Transformed E. coli strains were inoculated in TB (Terrific Broth) supplemented with the appropriate antibiotics for plasmid selection (kanamycin for pET-28a, ampicillin and chloramphenicol for pET-22b and pLysS-containing E. coli Rosetta™ strains). The production was induced with 1 mM of IPTG when the culture attained an optical density of 0.8 (at 600 nm), and the culture was left overnight at 37ºC. The next day, pelleted cells (10.000 rpm, 30 min, 4ºC) were resuspended in 50 mL of Equilibration buffer (see below for composition) and lysed using a high-pressure homogeniser (Avestin Emulsiflex C3). After another round of centrifugation (18.000 rpm, 30 min, 4ºC), the supernatant, corresponding to the soluble intracellular fraction of the lysis mixture was filtered (0.22 µM) before IMAC purification. Buffers used for the protein purification process were of the following composition: (i) equilibration buffer (50 mM Phosphate Buffer, 20 mM imidazole, 1M NaCl, pH 7.5), (ii) wash buffer, (50 mM Phosphate Buffer, 20 mM imidazole, 2 M NaCl, pH 7.5), (iii) elution buffer (50 mM Phosphate Buffer, 500 mM imidazole, 150 mM NaCl, pH 7.5). The protein purification was performed on the NGC Quest 10 Chromatography and the NGC Quest 100 Chromatography (Bio-Rad) at the Protein Factory platform (InBioS-CIP, ULiège). Selected fractions based on the absorbance at 280 nm of the elution profile were deposited on SDS-PAGE gels (Mini-PROTEAN® TGX™ Precast Gels, Bio-Rad) gels to assess their purity, and the most concentrated ones were desalted using a HiPrep™ 26/10 desalting column (packed with Sephadex® G-25 Fine) from Cytiva. The resulting desalted fractions in EMSA buffer (Tris 10mM pH 7.5, KCl 50mM, DTT 1mM, glycerol 2%, CaCl2 0.25 mM, MgCl2 0.5 mM), were analysed on SDS-PAGE gel (Mini-PROTEAN® TGX™ Precast Gels, Bio-Rad) for purity, and only the most concentrated and pure fractions were collected and used for DNA-protein interaction studies.
Electrophoretic mobility shift assays
DNA probes were designed using the predicted binding sites for each of the selected LacI COGs. For each COG, a matrix of possible binding sites (in FASTA format) was downloaded from MEME, and then used to create a WebLogo based on which we deduced the consensus sequence for designing the probe. In cases where a nucleotide was not overrepresented at a specific position, we chose the nucleotide complementary to the nucleotide conserved in the other part of the motif in order to make it closer to a dyad symmetry. The primers (Eurogentec, Seraing, Belgium) used to generate the DNA probes are listed in supplementary Table S2. The interaction reactions between pure 6His-tagged proteins and their Cy5-labelled DNA probe containing their predicted binding site were performed in EMSA buffer (Tris 10mM pH 7.5, KCl 50mM, DTT 1mM, glycerol 2%, CaCl2 0.25 mM, MgCl2 0.5 mM), as described previously (Francis et al., 2015; Tenconi et al., 2015). The final EMSA samples which were incubated at room temperature for 15 min contained 12.5 nM of hybridized probe, 1.5 mM of non-specific protein (Bovine Serum Albumine, BSA), 10 mg of non-specific DNA (sheared Salmon Sperm DNA, Invitrogen™), representing a 400-fold excess compared to the probe, and increasing concentrations of protein (obtained by performing two-fold serial dilutions of the fraction with the highest concentration of protein). After migration into a 1% agarose gel, the visualization of the free and retarded bands was monitored using the fluorescence imager (GE Healthcare), detecting the Cy5-tagged DNA probes at a wavelength of 670 nm.
DATA AVAILABILITY
All in-house scripts that were used to generate the data (genome download, COG creation, TF family identification, upstream sequence extraction, MEME analysis) are available on GitHub (https://github.com/SinaedaA/AURTHO), as well as a markdown file retracing all steps of the AURTHO methodology.
FUNDING
This work was supported by ‘Fonds De La Recherche Scientifique – FNRS’ [FRIA 1.E.031.18-20 to SA and SR, R.FNRS.5240 to SR; and the ‘Gouvernement Wallon’ [1510530 to AN and SR].
SUPPLEMENTARY DATA
ACKNOWLEDGMENTS
We are grateful to the teams working at the Protein Factory platform (https://www.proteinfactory.uliege.be/cms/c_14301576/en/proteinfactory) and the Robotein platform (https://www.robotein.uliege.be/cms/c_14301428/en/robotein) for productive discussions and assistance in experimental design. S.R. is a Fonds de la Recherche Scientifique (FRS-FNRS) senior research associate.