ABSTRACT
Genome-wide transcriptomic analyses have revealed abundant expressed short open reading frames (ORFs) in bacteria. Whether these short ORFs, or the small proteins they encode, are functional remains an open question. One quarter of mycobacterial mRNAs are leaderless, meaning the RNAs begin with a 5’-AUG or GUG initiation codon. Leaderless mRNAs often encode an unannotated short ORF as the first gene of a polycistronic transcript. Consecutive cysteine codons are highly overrepresented in mycobacterial leaderless short ORFs. Here we show that polycysteine-encoding leaderless short ORFs function as cysteine-responsive attenuators of operonic gene expression. Through detailed mutational analysis, we show that one such polycysteine-encoding short ORF controls expression of the downstream genes by causing ribosome stalling under conditions of low cysteine. Ribosome stalling in turn blocks mRNA secondary structures that otherwise sequester the Shine-Dalgarno ribosome-binding site of the 3’gene. This translational attenuation does not require competing transcriptional terminator formation, a mechanism that underlies traditional amino acid attenuation systems. We further assessed cysteine attenuation in Mycobacterium smegmatis using mass spectrometry to evaluate endogenous proteomic responses. Notably, six cysteine metabolic loci that have unannotated polycisteine-encoding leaderless short ORF architectures responded to cysteine supplementation/limitation, indicating that cysteine-responsive attenuation is widespread in mycobacteria. Individual leaderless short ORFs confer independent operon-level control, while their shared dependence on cysteine ensures a collective response. Bottom-up regulon coordination is the antithesis of traditional top-down master regulator regulons and illustrates one utility of the many unnanotated short ORFs expressed in bacterial genomes.
INTRODUCTION
Short open reading frames (sORFs) are extremely difficult to computationally identify in genome sequences; their shortened gene length approaches statistical random ORF background frequencies and their amino acid sequences have limited bioinformatic value (Frith et al. 2006; Hemm et al. 2008; Hobbs et al. 2011; Crappe et al. 2013).
Conventional mass spectrometry proteomic studies also systematically underrepresent the small proteins (sproteins) encoded by sORFs, as they can be lost during sample preparation or provide too few detectable peptides (Hemm et al. 2010). These limitations have contributed to both a lag in (i) recognition of sORFs, and (ii) assessment of their functional potential. This knowledge gap has been underscored by recent descriptions of many potential novel sORFS in Escherichia coli and in bacteria found in the human microbiome (Meydan et al. 2019; Sberro et al. 2019; Weaver et al. 2019). Previous work applying complementary transcriptomic approaches to mycobacteria, in both slow-growing Mycobacterium tuberculosis, and fast-growing Mycobacterium smegmatis and Mycobacterium abscessus suggested that their genomes contained hundreds of sORFs actively producing sproteins (Shell et al. 2015; Miranda-CasoLuengo et al. 2016). Ribosome profiling (Ribo-seq), together with RNA-seq and transcription start site (TSS) mapping, provided genome-wide empirical evidence for the location and ribosome occupancy of mycobacterial mRNAs that express sproteins (Cortes et al. 2013; Shell et al. 2015; Miranda-CasoLuengo et al. 2016).
Many ORFs initiated by leaderless mRNAs (LL-mRNAs) in mycobacteria are short (defined here as less than 150 nt) and unannotated. Importantly, the ORFs initiated at the combined transcription/translation initiation start sites of LL-mRNAs are readily identifiable from transcriptomic data sets and thus provide a high-confidence list of expressed novel mycobacterial sORFs. Insights into the mechanistic and functional attributes of LL-mRNAs have lagged, as they are rare or poorly expressed in E. coli but are abundant in archaea, Actinobacteria and extremophiles (Beck and Moll 2018). LL-sORFs represent the first (5’-most) ORF of a transcript, which positions them for a role in cis-regulation of the downstream operonic genes. There is ample precedent in eukaryotes for regulation of downstream genes by such “upstream ORFs” (uORFs) (Hinnebusch et al. 2016; Couso and Patraquim 2017). In prokaryotes, mechanisms have also been previously described in which uORFs have been shown to regulate expression of downstream genes through a process known as attenuation (Oppenheim and Yanofsky 1980; Bechhofer 1990; Henkin and Yanofsky 2002).
Attenuation is a cis-regulatory mechanism often mediated by short uORFs enriched in codons for the amino acid product of that biosynthetic operon. Attenuation occurs when abundant charged tRNA levels allow translating ribosomes to quickly clear the modulating uORF, promoting the formation of an intrinsic terminator that aborts transcription of the operon. Low levels of charged tRNA cause ribosome stalling in the uORF at codons for the end-product amino acid, facilitating the formation of a competing anti-terminator structure, thereby releasing attenuation to allow transcription to extend into the biosynthetic operon (Turnbough 2019). uORF-mediated attenuation mechanisms for cysteine have not been described. A subset of predicted mycobacterial LL-sORFs conspicuously encode consecutive cysteine residues, and these were found upstream of genes annotated to be involved in cysteine biosynthesis (Shell et al. 2015). We hypothesized that these LL-sORFs function as cysteine-sensitive attenuators.
RESULTS
We identified 304 putative LL-sORFs in Mycobacterium smegmatis (Supplementary Information Table 1(Shell et al. 2015; Martini et al. 2019)). We compared amino acid content of the encoded sproteins and found that consecutive cysteines were overrepresented relative to cysteine content (chi-square p<.01, Extended Data Table 1) and relative to consecutive cysteine frequency in annotated genes. A subset of predicted mycobacterial LL-sORFs conspicuously encode consecutive cysteine residues, and these were often found upstream of genes annotated to be involved in cysteine biosynthesis (Shell et al. 2015). We hypothesized that these LL-sORFs function as cysteine-sensitive attenuators.
Ms5788 is regulated in response to cysteine abundance
One predicted LL-sORF (here denoted Ms5788A) encodes eight consecutive cysteines in its C-terminus, and is followed by operonic genes, including a putative thiosulfate sulfurtransferase, cysA2 (Fig 1A). RNA-seq and Ribo-seq profiles indicate that this unannotated LL-sORF is abundantly transcribed and translated in M. smegmatis (Extended Data Fig 1 A). Moreover, transcription start site mapping strongly suggests that a homologous LL-sORF is expressed in M. tuberculosis (Extended Data Fig 1B). To determine whether the genes located immediately downstream of Ms5788A are regulated in response to changes in cysteine levels, we generated a luciferase translational reporter in which a constitutive promoter drives a leaderless transcript that begins at the native GUG initiation codon of Ms5788A and continues to the initiation codon of the annotated gene downstream, Ms5788 (Fig 1B). We then measured expression of the reporter in M. smegmatis cells in the presence or absence of cysteine in the growth medium. Luciferase activity decreased for cells grown with cysteine supplementation (Fig 1B i). Hence, we hypothesized that expression of Ms5788 is regulated in response to cysteine levels by an attenuation mechanism involving the upstream LL-sORF Ms5788A.
We next tested whether cysteine-dependent regulation of Ms5788 requires translation of the upstream Ms5788A LL-sORF. We mutated the GUG translation initiation codon of Ms5788A to ACC in the context of the luciferase reporter, to prevent ribosome loading. This non-start mutation reduced luciferase expression, and abolished attenuation (Fig 1B compare i vs ii). These data indicate that translation of Ms5788A is required for cysteine-dependent regulation of Ms5788, and that in the absence of Ms5788A translation, expression of Ms5788 is locked in an attenuated state. We speculated that ribosome occupancy of the C-terminal polycysteine tract of Ms5788A is particularly important for this regulation. We created a nonsense mutant, Ser8Stop, to truncate Ms5788A ten amino acids before the first Cys codon. This truncating mutation also dramatically reduced luciferase expression and attenuation by cysteine (Fig 1B iii). The residual cysteine response of this nonsense mutant may result from translation reinitiation after the stop codon.
Since ribosome occupancy of the LL-sORF appeared to be required for cysteine-dependent regulation of the luciferase reporter, we postulated that ribosomes stalled in the Ms5788A polycysteine tract due to limiting levels of charged tRNAcys would increase luciferase expression by relieving attenuation. We hypothesized that recoding the polycysteine tract should impair the observed cysteine response. We created an out-of-frame (OoF) Ms5788A mutant luciferase reporter to replace the eight consecutive cysteine codons with eight consecutive leucine codons, leaving a single Cys18 codon (Fig 1B iv). This OoF mutant was not affected by cysteine supplementation, indicating the importance of the polycysteine tract in relieving attenuation. Interestingly, expression of this reporter appears to be in the active state, suggesting that limiting leucine may functionally substitute for limiting cysteine.
Ribosome stalling in the Ms5788A sORF modulates RNA structure in the Ms5788 5’ UTR
Ribosome occupancy of Ms5788A could affect the formation of mRNA secondary structures in the Ms5788 mRNA leader, as occurs in previously described attenuation mechanisms (Turnbough 2019). The predicted RNA secondary structure for the mRNA through the Ms5788A LL-sORF and up to Ms5788, indicates the potential for an energetically stable structure over most of its length (Fig 2). Importantly, nucleotides in the polycysteine tract of Ms5788A are predicted to base pair with complementary sequences near the Shine-Dalgarno sequence of Ms5788. This structure suggests a mechanism in which stalled ribosomes in Ms5788A free the Shine-Dalgarno sequence to recruit and position ribosomes for canonical translation initiation of Ms5788.
Whereas the Ms5788A OoF mutant was predicted to have no effect on mRNA structure, we created mutants intended to selectively disrupt predicted duplex pairing near the Ms5788 Shine-Dalgarno sequence (Fig 2). We first changed the invariant guanine in the 2nd position of cysteine codons (UGY) to cytosine (UCY) in the last five codons of Ms5788A (Fig 2, recoded red series bottom strand). The nucleotide changes should reduce the stability of base pair interactions with the Shine-Dalgarno region, while recoding polyserine for the final five codons of Ms5788A: Cys(26-30)Ser. This reporter exhibited constitutively elevated expression that was insensitive to cysteine (Fig 2 ii), consistent with full Shine-Dalgarno sequence accessibility.
To differentiate the effect of RNA duplex formation from the effect of Ms5788A amino acid recoding, we created a mutant that only disrupted the base pairing, by changing the predicted five cognate nucleotides near the Shine-Dalgarno sequence (Fig 2, recoded red top strand). Even though the polycysteine tract remained intact, luciferase expression from this reporter was insensitive to cysteine supplementation (Fig 2 iii), consistent with the base-paired structure corresponding to the attenuated state. We next constructed a mutant that combined the recoded Ms5788A and cognate non-coding mutants, such that base pairing should be restored. As expected, combining the complementary mutations reduced expression of the luciferase reporter for cells grown in the presence of cysteine, consistent with restored base pairing (Fig 2 iv). Interestingly, the restored response to cysteine indicated that the residual four-cysteine codon content of the LL-sORF was sufficient to confer sensitivity to cysteine.
We constructed a silent Ms5788A mutant that switched the nucleotide sequence of six cysteine codons to retain polycysteine coding, and yet are predicted to disrupt base pairing with the Shine-Dalgarno region (Fig 2 polycysteine purple bottom strand). The elevated luciferase activity of this mutant reporter was not attenuated under cysteine-replete conditions, separating the roles of nucleotide and encoded amino acid sequence of Ms5788A (Fig 2 v). We also created a mutant of the predicted complementary bases near the Shine-Dalgarno sequence (Fig 2, polycysteine purple top strand). Surprisingly, this mutant exhibited an attenuation response similar to wild type (Fig 2, compare i and vi). We reassessed the base pairing potential of the mRNA produced from this mutant and found that it was predicted to fold into a stable, wild-type-like structure that would also reduce Shine-Dalgarno sequence availability (Extended Data Fig 2). Combining the Ms5788A and peri-Shine-Dalgarno nucleotide changes in this series resulted in an expression pattern similar to that of the wild-type construct, consistent with the model (Fig 2 vii).
Cysteine-dependent regulation involving Ms5788A affects expression of downstream operonic genes
Classical models of ribosome-mediated attenuation invoke competing mRNA stem loops that form an intrinsic terminator structure when ribosomes rapidly translate the sORF, resulting in regulation at the level of transcription (Yanofsky 1981; Turnbough 2019). In the case of Ms5788 attenuation, regulation appears to occur at the level of translation. To test this, we created a reporter to assess mRNA extension beyond the start of Ms5788. To maintain the predicted RNA structure of the 5’ leader region, an independent Shine-Dalgarno sequence was added to efficiently initiate translation of transcripts that extend into the luciferase gene (Fig 3 ii). Luciferase activity of this reporter was insensitive to cysteine, and expression levels were consistently high. These data indicate that Ms5788 attenuation does not result from transcription termination. Translational repression can indirectly affect transcription over longer distances due to polarity that is caused by Rho-dependent transcription termination and/or by enhanced RNase processing of the untranslated RNA (Deana and Belasco 2005; Martini et al. 2019). To determine if translational repression of Ms5788 by attenuation leads to polar effects on downstream genes, we constructed a translational fusion that extends to Ms5789 (cysA2). This more distal site, 516 nt 3’ of the Ms5788 start, exhibited cysteine responsiveness (Fig 3 iii). Taken together, our data support a model in which Ms5788 is regulated by translational attenuation through Ms5788A-controlled Shine-Dalgarno availability, while Ms5789 is likely regulated by polarity, due to the absence of elongating ribosomes in Ms5788.
Cysteine-dependent regulation of Ms5789 and Ms5790 by Ms5788A occurs in a chromosomal context
To assess whether our reporter-supported model is valid in the native locus context, we performed quantitative mass spectrometry-based proteomics (LFQ) to determine differences in protein expression for cells grown with or without cysteine supplementation. Whole cell extracts were prepared from cultures of M. smegmatis grown in minimal media +/- cysteine supplementation. Tryptic digests of whole cell lysates were subjected to nanoUHPLC-MS/MS identified and quantitated using label-free based peak integration (Cox and Mann 2008; Bosserman et al. 2017). As expected, the abundance of most proteins is unchanged between the two conditions (+/- cysteine), reflected in the linear correlation on the diagonal of the scatter plot (Fig 4A). Proteins below the diagonal were more abundant in cells grown without cysteine supplementation. The small (159 AA) and hydrophobic Ms5788 was not detected in these experiments. However, Ms5789 and Ms5790 are predicted to be co-transcribed in an operon with Ms5788 (Martini et al. 2019), and expression of the encoded proteins Ms5789 and Ms5790 was higher in cells grown without cysteine (Fig 4A, yellow diamonds), consistent the attenuation model.
To test the hypothesis that polycysteines in Ms5788A control Ms5789 and Ms5790 expression, we generated an out-of-frame (OoF) mutation in Ms5788A by deletion of a single nucleotide from the chromosomal locus. This deletion shifts the reading frame to encode valines and alanines in place of the polycysteine tract, and adds six additional amino acids (ERSRAL) prior to encountering a stop codon (Fig 4B), while preserving the potential for nearly wild-type RNA duplex formation. This mutant exhibited low Ms5789 and Ms5790 expression levels consistent with a stable mRNA leader structure sequestering the Shine-Dalgarno of Ms5788. Expression of Ms5789 and Ms5790 was unaffected by cysteine supplementation (Fig 4B, yellow diamonds in diagonal), highlighting the role of cysteine codons in relieving attenuation.
We reasoned that if Ms5788A directs attenuation, its deletion would elevate operon expression of the downstream genes. We created a precise deletion of Ms5788A and subjected this mutant to LFQ proteomics. As predicted, Ms5789 and Ms5790 were no longer responsive to ambient cysteine supplementation, appearing in the diagonal of unresponsive genes (Fig 4C, yellow diamonds in diagonal). Moreover, absolute levels of Ms5789 and Ms5790 were higher in the Δ5788A strain than in either the wild-type or the OoF mutant, regardless of cysteine supplementation. This indicates that the Ms5788A deletion leaves the Ms5788 Shine-Dalgarno sequence fully available for canonical translation initiation, and results in an elevated expression of the operonic Ms5789 and Ms5790 (Fig 4B, C yellow diamonds). Thus, data from wild-type and targeted chromosomal mutant M. smegmatis are in agreement with our reporter-generated data, and strengthen the conclusion that the Ms5788A LL-sORF modulates operonic gene expression through an attenuation mechanism.
Widespread cysteine-dependent regulation in M. smegmatis associated with cysteine-rich LL-sORFs
Given that LL-sORFs in M. smegmatis are enriched for polycysteine, we hypothesized that additional cysteine-responsive genes are regulated by an associated polycysteine LL-sORF. We identified six more LL-sORFs that contain at least two consecutive cysteine codons and are located upstream of annotated genes. We then examined our proteomic data for the protein levels encoded by putative operonic genes downstream of these LL-sORFs for cells grown with/without cysteine supplementation. Remarkably, all of the detectable proteins that were encoded by polycysteine LL-sORF-led operons were upregulated, falling below the diagonal (Fig 4A, black circles), consistent with attenuation release. In cysteine-replete medium, expression of some of these proteins (Ms0113, Ms0934, Ms4527, Ms5279, and Ms5280) was below the threshold of detection for reliable quantification [<105 LFQ intensity (a.u.)], indicative of tight attenuation. The polycysteine LL-sORFs exhibit RNA-seq and Ribo-seq expression profiles consistent with their robust expression during cysteine replete growth (Extended Data Figs 1 and 3). None of these expressed LL-sORFs were identified by genome annotation pipelines. Each of these cysteine-responsive operons contains genes annotated for cysteine associated activities (Table 1). Collectively, these data reveal a cysteine-metabolic regulon, whose concerted response is controlled independently at each locus by an expressed LL-sORF that includes consecutive cysteine codons. This bottom-up coordinated regulation contrasts with the conventional top-down, master regulator-driven mechanism of transcriptional regulons.
LFQ proteomics does not comprehensively identify all proteins in a proteome. We looked for additional loci with the same hallmarks of responsive attenuation: an expressed polycysteine LL-sORF upstream of annotated operonic genes. Ms4536A was identified upstream of a single gene (Table 1, Extended Data Fig 3 E). Without corroborating indicators of cysteine response or function of the encoded operon protein, we only speculate at its membership in the M. smegmatis cysteine LL-sORF regulon. The independent evolution of LL-sORFs makes it unlikely that our experimental reference species, M. smegmatis, contains all of the LL-sORF operons in the mycobacterial pangenome. In one example, transcriptomic profile data for M. tuberculosis clearly identified Rv2334A as an unannotated gene meeting all of the criteria demonstrated in the M. smegmatis regulon: an expressed LL-sORF with a C-terminal polycysteine tract, followed by genes annotated as cysK1 and cysE (Table 1, Extended Data Fig 3 G).
The availability of complete genome sequence information on diverse mycobacteria provided an opportunity to track the evolution of these polycysteine sORFs. We searched the genomes of 41 species using complementary approaches predicated on the sequence of each M. smegmatis or M. tuberculosis LL-sORF, or on the position of the flanking annotated orthologous genes as landmarks. We identified genomic sequences consistent with conservation of the LL-sORFs expressed in M. smegmatis and M. tuberculosis (Extended Data Table 2). The distribution of the conserved LL-sORFs is summarized as barcodes adjacent to each species on the Mycobacterium genus phylogenetic tree (Fig 5). Ms4527A, Ms4533A, and Ms5788A are deeply rooted, indicating both an emergence that predates mycobacterial radiation and a selective advantage to retaining these sequences. The presence/absence of others (e.g., Ms0932A or Ms5280A) is consistent with horizontal gene transfer or sporadic gene loss by deletion or degradation. Sequence logos derived from multiple alignments of the amino acid sequences encoded by the LL-sORFs further support the evolutionary selection of their consecutive cysteine tracts (Fig 5). The similarity between Ms4536A and Rv2334A strongly suggests homology by common origin, yet the context, including operon genes, is not homologous. The occurrence of Ms4536A—TQXA or Rv2334A—cysK1-cysE is mutually exclusive, suggesting origin by rearrangement rather than a merodiploid duplication or horizontal gene transfer event. Non-cysteine amino acids are also conserved in some LL-sORFs, suggesting that they provide function through ribosomal interaction, or support activities as an independent small protein product, or are encoded by codons that are constrained at the nucleotide level.
DISCUSSION
An attenuation mechanism that controls translation in response to amino acid availability
In the work presented here, we demonstrate attenuation of a cysteine biosynthesis pathway locus. Attenuation is a recurring theme in biosynthetic pathways for nucleosides and amino acids, in which the end product of the pathway interacts with the 5’ leader of the biosynthesis-encoding transcript to reduce (attenuate) expression of the operonic ORFs downstream (Turnbough 2019). However, these models typically involve the formation of competing alternate hairpin structures that function as an intrinsic terminator when the end product is plentiful. By contrast, the Ms5788A LL-sORF featured here defines a class of translational attenuator that indirectly assesses charged tRNAcys availability to modulate expression of the downstream operonic genes.
The need for cysteine regulation in mycobacteria
Why might mycobacteria need a multi-locus cysteine regulon? Mycobacteria do not produce glutathione, which modulates redox balance in most bacteria. Instead, they rely on mycothiol (MSH), a cysteine derivative (Xu et al. 2011; Loi et al. 2015). The mshA gene (Ms0933) encodes the first enzyme in MSH biosynthesis, and it resides in an operon with the hallmarks of cysteine attenuation (Extended Data Fig 3 B). The multiple roles of cysteine in mycobacterial protein synthesis, redox and sulfur metabolism likely require subtle, independent fine-tuning of the enzymes involved in these respective pathways. The slight differences in cysteine composition and architectures of the polycysteine LL-sORFs could impart varying mechanisms as well as customized levels of operon gene expression.
Attenuating sORF requirements
What are the cysteine codon requirements of an effective small ORF attenuator? Our criterion that LL-sORFs encode two consecutive cysteine codons may seem to be a low threshold, yet the cis-encoded proteins only detectable in cysteine-limiting conditions (baseline of Fig 4A) demonstrate the effectiveness of two consecutive cysteines in LL-sORFs in relieving attenuation. Our Ms5788A LL-sORF analysis demonstrated that base pairing was needed to impose attenuation, but it is premature to speculate that it is required at all responsive loci. Transcription activation was not a regulatory factor at the Ms5788 locus; the promoter remained intact in the Ms5788A mutants in Fig 4 B, C, indicating that all of the cysteine response at this locus is directed by the attenuation mechanism we detailed and not via transcriptional activation. RNA-seq profiles of the LL-sORFs presented here are also consistent with robust constitutive transcription in cysteine-replete medium.
Evolution of a cysteine attenuation regulon
The independent evolution of similar polycysteine LL-sORF architecture associated with cysteine attenuation at multiple loci in mycobacteria indicates that coordinated expression is important, and that LL-sORF directed attenuation is effective. The co-regulation of individual operons functionally defines a regulon, akin to regulons controlled by dedicated DNA-binding transcription factors. We speculate that the evolution of regulation by attenuation is simplified in mycobacteria by the robust nature of leaderless translation, and that the evolution of a dedicated transcription factor and its cognate binding sites in the promoters of target operons is more problematic than exploiting LL-sORFs in a genus that exhibits frequent and robust LL-mRNA expression. As the first genes in their transcripts, LL-mRNAs are ideally positioned to cis-regulate expression of downstream genes. Additionally, transcription start sites are preferentially associated with purines (R), and the +2 transcript position is preferentially associated with pyrimidines (Y) (Martini et al. 2019). Thus, transcription often begins at RYN trinucleotide sequences. Our previous study showed that a 5’ RUG trinucleotide is both necessary and sufficient for robust leaderless translation initiation (Shell et al. 2015), so many transcription start sites are predicted to already initiate leaderless translation, and others are only one or two changes from initiating translation. It is not yet clear whether leaderless architecture per se offers advantages for attenuation and may have been selected over canonical Shine-Dalgarno translation initiation, or whether the sole criterion of an RUG sequence at the transcription start site is simply a relatively modest requirement. It is clear, however, that in mycobacteria, polycysteine LL-sORFs integrate two levels of coordination: locally by modulating polycistronic operon gene expression, and globally by synchronizing operon response.
Concluding remarks
Given the prevalence of LL-sORFs in mycobacteria, we speculate that other translational regulons have evolved as an alternative to transcriptional regulons controlled by DNA-binding transcription factors. LL translation is considered to be the ancestral form of ribosome delivery (Nakamoto 2009; Zheng et al. 2011; Duval et al. 2013), indicating that LL-sORF regulons may be ancient and widespread. LL-sORFs define a functional subclass of small de novo genes that effectively decode translational stress into broad regulatory effects.
METHODS
Bacterial strains and culture
M. smegmatis wild-type mc2155 and its derivatives were grown in tryptic soy broth + 0.05% Tween 80 (TSBT) or on TSA plates, and cultured at 37°C. Antibiotic selection for reporter maintenance or mutation selection strategies included apramycin (12.5 µg/ml on agar, 10 µg/ml in broth), hygromycin (100 µg/ml and 25 µg/ml), kanamycin (50 µg/ml and 10 µg/ml), and zeocin (50 µg/ml and 25 µg/ml).
For the cysteine attenuation study, bacteria were cultured in minimal media. Base medium per liter: 6g Na2HPO4 (anhydrous), 3g KH2PO4, 0.5g NaCl, 1g NH4Cl, 0.05% (v/v) Tween-80. After autoclaving, 0.2 % glucose and micronutrients were added to final concentrations: MgSO4 to 1 mM, CaCl2 to 100 μM, H3BO3 to 4×10−7 M, CoCl2.6H2O to 3×10−8 M CuSO4.5H2O to 1×10−8 M, MnCl2.4H2O to 8×10−8 M, ZnSO4.7H2O to 1×10−8 M, FeSO4.7H2O to 1×10−6 M. L-cysteine or L-cystine (exogenously stable dimeric cysteine) was supplemented at 200 µg/ml as noted.
Luciferase assays
Reporters were generated by long-primer-dimer PCR to recreate the LL-sORF and leader sequence of M. smegmatis msmeg_5788. The products were cloned by Infusion and verified by DNA sequence analysis. The NanoLuc (Promega) luciferase ORF is carried on a plasmid that confers apramycin resistance and integrates at the L5 attB site in mycobacteria. Mycobacterial cultures were grown in minimal media for luciferase assays. Luciferase activity in a culture was assessed by the addition of NanoGlo (Promega) substrate and then measuring luminescence normalized to culture density.
Chromosomal mutants of M. smegmatis
A precise deletion of Ms5788A was created using a targeting plasmid that integrates via a single cross-over allowing selection of a hygr intermediate and, then, is resolved by a second homology-driven event in which a sacB/galK counter-selection allows enrichment for the deleted recombinant (Barkan et al. 2011). The OoF point mutant was created by a recombineering approach that used a single-stranded oligonucleotide template to introduce an additional adenine in Ms5788A (van Kessel and Hatfull 2008). Co-electroporation of 2 × 10−10 mol of 60-mer oligo with 500 ng of an episomal zeor plasmid (pGE324, Zeo-sacB) allowed zeocin selection and isolation of the electrocompetent population of M. smegmatis. Mismatch-sensitized PCR (MAMA-PCR) screening (Cha et al. 1992; Swaminathan et al. 2001) of isolates identified clones that integrated the additional adenine. Deletion and point mutants were verified by genomic DNA PCR and sequencing.
Mass spectrometry
Wild-type and mutant derivative M. smegmatis were cultured in minimal media with or without cysteine supplementation, harvested by centrifugation and cryo-milled (Retsch MM400, Haan, Germany) for mass spectrometry analysis. Reagents were of LC-MS quality or higher and obtained from Sigma Aldrich unless indicated. Milled cell pellets were digested with trypsin using commercial S-Traps (Protifi, NY). Briefly 50 μg of protein from each milled cell pellet was re-suspended in 6% SDS 10mM Tris-2-Carboxy ethyl phosphine in 100 mM tri ethyl ammonium bicarbonate (TEAB), heated at 95 ºC for 3minutes, then alkylated with 10 mM iodoacetamide in the dark for 15 min. Samples were acidified by addition of H3PO44 to 1.2% Final (v/v) and flocculated by 7-fold addition of 95/5 MeOH:100 mM TEAB prior to collection on the S-trap column (Zougman et al. 2014). Washing and conditioning was preformed three times by 150 μl addition of MeOH buffer as above and 1 μg of sequencing grade trypsin (Promega, WI) was added to each sample and digested at 37 ºC for 8 hours. Peptides were isolated, acidified and desalted using Stage tips packed into P100 pipette tips, and dried using a MiVac (Genvac UK) prior to LC-MS/MS analysis (Rappsilber et al. 2003).
NanoUHPLC-MS/MS was performed essentially as described (Bosserman et al. 2017; Bosserman et al. 2019). 1μg of each digest was analyzed in technical triplicate and biological duplicate on an Orbitrap instrument running a TOP15 data-dependent acquisition (Q-Exactive Thermo San Jose, CA). Protein spectral matching and Label Free quantification were performed using MaxQuant (Cox and Mann 2008) against the M. smegmatis FASTA combined with contaminants from the Uniprot database, LFQ param (UP000000757 6,595 entries). LFQ parameters were set to default, quantification was restricted to proteins >2 peptides (missing peaks enabled). Target-decoy was used to determine False Discovery Rates (Elias and Gygi 2007) and proteins at a global 1% FDR were used for quantification. Data reduction and significance testing were performed using a modified LIMMA methodology (Efstathiou et al. 2017). Protein search and RAW data files are accessible at the Center for Computational Mass Spectrometry via ftp://MSV000084381@massive.ucsd.edu (Deutsch et al. 2017). Proteins quantitatively replicated in one condition but absent in another, were given an arbitrary intensity of 5×104 (below detection threshold) for ease in visualization.
Bioinformatics
Phylogeny
The genomes and proteomes for Mycobacterium abscessus subsp. massiliense (NC_018150.2), M. africanum strain 25 (CP010334,1), M. litorale strain F4 (CP019882.1), M. avium 104 (NC008595.1), M. ulcerans Agy99 (NC_008611.1), M. vanbaalenii PYR-1 (NC_008726.1), M. marinum M (NC_010612.1), M. liflanddii 128FXT (NC_020133.1), M. kansasii ATCC 12478 (NC_022663.1), M. gilvum Spyr1 (NC_014814.1), M. bovis AF2122/97 (NC_002945.4), M. tuberculosis H37Rv (NC_000963.3), M. sinense strain JDM601 (NC_015576.1), M. canettii CIPT 140010059 (NC_015848.1), M. chubuense NBB4 (NC_018027.1), M. intracellulare MOTT-64 (NC_016948.1), M. intracellulare ATCC 13950 (NC_016946.1), M. neoaurum VKM ac-1815D (NC_023036.2), M. haemophilum ATCC 29548 (NZ_CP0118883.2), M. simiae ATCC 25275 (NZ_HG315953.1), M. goodii strain X7B (NZ_CP012150.1), M. fortuitum strain CT6 (NZ_CP011269.1), M. phlei strain CCUG 21000 (NZ_CP014475.1), M. immunogenum strain CCUG 47286 (NZ_CP011530.1), M. chelonae CCUG 47445 (NZ_CP007220.1), M. vaccae 95051 (NZ_CP011491.1), M. chimaera strain AH16 (NZ_CP012885.2), M. caprae strain Allgaeu (NZ_CP016401.1), M. colombiense CECT 3035 (NZ_CP020821.1), M. dioxanotrophicus strain PH-06 (NZ_CP020809.1), M. marseillense strain FLAC0026 (NZ_CP023147.1), M. lepraemurium strain Hawaii (NZ_CP021238.1), M. shigaense strain UN-152 (NZ_AP018164.1), M. stephanolepidis (NZ_AP018165.1), M. pseudoshottsii JCM 15466 (NZ_AP018410.1), M. paragordonae 49061 (NZ_CP025546.1), M. rutilum strain DSM 45405 (NZ_LT629971.1), M. thermoresistibile strain NCTC10409 (NZ_LT906483.1), M. hassiacum DSM 44199 (NZ_LR026975.1), and M. microti strain 12 (CP010333.1) were downloaded from NCBI. Orthologs of M. smegmatis proteins were identified by a reciprocal best blast hit approach and aligned in MAFFT v.7.058b (Katoh and Standley 2013). A supermatrix was created from the concatenation of 1560 of these protein alignments with an in-house Python script. Alignments included in the supermatrix were required to have 40-41 of the 41 mycobacterial species present. IQ-TREE v.1.6.9 (Nguyen et al., 2015) generated a maximum likelihood (ML) phylogeny under an LG+R4 model of evolution (Soubrier et al., 2012; Yang 1995; Le and Gascuel 2008). Support values were generated by the ultrafast bootstrap method with 1000 replicates (Minh et al., 2013). The ML tree was visualized in FigTree v.1.4.4 (http://tree.bio.ed.ac.uk/software/figtree/).
Identification of ssORFs in additional mycobacterial genomes
The orthologs to all M. smegmatis ORFs with an upstream ssORF were identified in other Mycobacteria via a reciprocal best blast hit approach and the regions 1000 nucleotides upstream of these genes were extracted. M. smegmatis ssORF proteins were queried against these upstream regions via TBLASTN searches to identify orthologous sequences. The coordinates for these sequences were obtained from the blast results and extended to the match the length of the corresponding M. smegmatis ssORF. To ensure that we captured the start and stop codon, we extended these coordinates by an additional 5-10 amino acids on both 5’ and 3’ ends. These coordinates and the strand information captured from the blast results were used to extract and translate ssORFs from 1 Kb upstream regions with an in-house Python script. All ssORFs were visually inspected and initially aligned in MEGA v. 70.26 (Kumar et al., 2016). In some cases, additional ssORF orthologs were identified by predicting proteins (anything between two stop codons) in upstream 1 Kb regions with the getorf subroutine from EMBOSS v. 6.3.1 (Rice et al., 2000). Multiple alignments of deduced LL-sORF amino acid sequences were performed using a web-based tool and default clustal algorithm (Madeira et al. 2019). The aligned sequences were compiled into a sequence logo infographic (weblogo.berkeley.edu).
Supplementary Information Table 1. LL-sORFs predicted by transcriptomic data: Ribo-seq, RNA-seq, and RUG transcription start sites (Shell et al. 2015; Martini et al. 2019).