Genotype-phenotype map of an RNA-ligand complex

RNA-ligand interactions play important roles in biology and biotechnology, but they often involve complex three-dimensional folding of RNA and are difficult to predict. To systematically explore the phenotypic landscape of an RNA-ligand complex, we used microarrays to investigate all possible single and double mutants of the 49-nt RNA aptamer Broccoli bound to the fluorophore DFHBI-1T. We collected more than seven million fluorescence measurements in varying conditions, and inferred dissociation rate constants, spectral shifts, and intragenic epistasis. Our results reveal an unexpectedly complex phenotypic landscape, in which mutations near the fluorophore binding pocket modulated magnesium-, potassium- and fluorophore-binding and fluorescence spectra, while distal mutations influenced structural stability and fluorescence intensity. We trained a machine learning model that accurately predicted RNA secondary structure from local epistatic interactions, despite the presence of G-quadruplexes and other noncanonical structures. Our experimental platform will facilitate the discovery and analysis of new RNA-ligand interactions.


Introduction
In addition to its role as messenger in protein synthesis, RNA performs functions that depend on its three-dimensional folding and binding to other molecules. Such interactions are important in biology (eg, the specific recognition of tRNA sequence and structure by aminoacyl tRNA synthetases (Ibba and Soll, 2000) and biotechnology (eg, the inhibition of SARS-CoV ribosomal frameshifting by a ligand of the FSE pseudoknot (Park et al., 2011;Warner et al., 2018)). However, RNA-ligand interactions are difficult to measure, and computational methods are poor at predicting noncanonical RNA structures, such as G-quadruplexes, that often establish RNA-ligand specificity.
The key to accurate prediction of RNA structures and interactions is experimental data that connects RNA sequence to structural properties. Such data has been instrumental in developing commonly used secondary structure predictors, such as UNAfold and RNAfold (Lorenz et al., 2011;Markham and Zuker, 2008), and 3D structure predictors (Boniecki et al., 2016;Das and Baker, 2007). Unfortunately, the data used to train existing predictors has low coverage of noncanonical structures or RNA-ligand complexes, and little information about the effects of external conditions on structural stability. To address these gaps, here we performed a massively parallel assay of the effects of RNA sequence variation and external conditions on the binding of the 49-nt RNA aptamer Broccoli to the fluorophore DFHBI- 1T (Filonov et al., 2014).

Single mutation effects
We designed an oligonucleotide library that encodes wild-type Broccoli and all its 10,731 (3x49 + (3x49) x (3x48) / 2) single and double mutants. Each variant was attached to a stabilizing scaffold (Filonov et al., 2015) and to three or more unique 60-nt probes, allowing specific hybridisation to spots on an Agilent cDNA microarray ( Fig. 1A, B). We transcribed the library with T7 polymerase and labeled a small fraction of RNA with Cyanine-3 (Cy3) as loading control. To evaluate the effects of mutations on fluorescence, we measured the green signal (aptamer-DFHBI-1T complex) and red signal (Cy3-labeled aptamer) using an automated fluorescence microscope. To guide the analysis of structure-function relationships, we predicted the 3D structure of Broccoli by homology modeling using the known crystal structure of Spinach (Warner et al., 2014).
The microarray measurements were consistent between experimental replicates and with published data (Fig. S1). In most positions, mutations had mild-to-moderate negative effects on fluorescence, but some variants were brighter than the wild-type ( Fig. 1). Although the sequence of the apical loop, UUCG, is known as a stabilizing tetraloop (Heus and Pardi, 1991), most mutations in this region increased fluorescence, as seen before (Ageely et al., 2016). Mutations in canonically (Watson-Crick C-G, A-U and G-U) paired regions typically reduced fluorescence, consistent with their predicted destabilizing effect. Unlike in previous reports (Ageely et al., 2016) mutations in the terminal stem were tolerated, presumably because our design comprised a stabilizing scaffold. Nucleotide G12, thought to be essential because it forms hydrogen bonds with the fluorophore (Warner et al., 2014), tolerated mutation to A, but not to C or U. Mutations in the two G-quadruplexes (GQ) that form the fluorophore binding platform, and in the U13-A33-U31 Hoogsteen base triple which seals the binding pocket from the top, reduced fluorescence by more than 90%, confirming the importance of these structural elements for fluorophore binding.

Epistasis
Previous large-scale studies showed that the effects of mutations in a gene usually depend on the presence of other mutations in the same gene, a phenomenon known as intragenic epistasis (Domingo et al., 2018;Li et al., 2016;Puchta et al., 2016;Sarkisyan et al., 2016). To visualise epistasis, we compared the effects of single mutations in the wild-type background, and in each of the 147 (3 x 49) single-mutant backgrounds. Although mutational profiles showed similarities across multiple backgrounds, there were also clear differences, indicating the existence of intragenic epistatic effects (Fig. S2).
Epistasis can be partitioned into nonspecific (global) epistasis, where the phenotypes of double mutants can be predicted from the effect sizes of single mutations, and specific (local) epistasis, where phenotypes of double mutants also depend on the precise identity of both mutations (Domingo et al., 2019;Kondrashov and Kondrashov, 2001;Otwinowski et al., 2018). To calculate nonspecific epistasis, we fitted a locally weighted polynomial regression (LOESS) model to estimate the typical fluorescence of double mutants given the observed fluorescence of single mutants ( Fig. 2A). For each double mutant, we then defined specific epistasis as the deviation between observed fluorescence and fluorescence predicted by the nonspecific epistasis model (Fig. S3, top right). Nonspecific epistasis explained 63% of the variation in fluorescence, compared to 42% of variation explained by an additive model (Fig. S4). Unlike the additive model, the nonspecific epistasis model correctly predicted non-negative fluorescence and lack of positive epistasis between pairs of large-effect mutations.
Local epistatic interactions measured by deep mutational scanning have been used to predict structures of proteins and RNAs (Rollins et al., 2019;Schmiedel and Lehner, 2019;Zhang et al., 2020). To evaluate the relation with RNA structure, we analysed local epistasis between pairs of nucleotides, depending on their position within the secondary and tertiary structures of Broccoli. Base-paired nucleotides showed strong positive epistasis, consistent with the restoration of function by compensatory mutations, whereas non-paired nucleotides located within the same stem showed negative epistasis (Fig. 2B,C, Fig. S3). Pairs in which one or both partners were part of a triplet or quadruplet structure typically showed weak epistasis, whereas loop nucleotides showed both positive and negative epistasis ( Fig. S5A). Unexpectedly, proximity in the 3D structure was associated with lower absolute values of epistasis ( Fig. S5B,C).
The association between epistasis and structural elements suggests that it may be possible to predict RNA structure from epistasis signals. To test this principle, we employed supervised machine learning to predict base pairings of the wild-type Broccoli. We trained support vector machine (SVM) classifiers on three features for each candidate base pair. The chosen features were drawn from local epistasis signals associated with each candidate pair (details in Methods) on the basis of their ability to discriminate between paired and non-paired candidates in the feature space ( Fig S6). We labelled all pairs (N=990) as paired or non-paired according to the known wild-type structure. We employed stratified sampling on a 70-30 split between training and test data, to account for the class imbalance between paired and nonpaired candidates (10:980 ratio). We trained an ensemble of 1000 classifiers that achieved average sensitivity of 76% and specificity of 98%.
Most pairs were accurately predicted by the SVM, with the exception of pair 15:30, which possibly indicates constraints on nucleotide identity (Fig. 2D). The minimum folding energy (MFE) structure comprises five base pairs not found in our 3D structural model, and none of these extra pairs were recovered by the SVM, consistent with the expectation that these pairs are not compatible with fluorescence ( Fig. S7). Some candidate pairs (eg 4:46 and 5:45) were not present in the MFE structure, but were consistently called as paired by the SVM (Fig. S3). These base pairs, which have been reported as paired in some previous studies (Ageely et al., 2016), may represent alternative functional conformations of Broccoli. Consistently, pairs 4:46 and 5:45 also showed some propensity for base-pairing in 3D folding simulations.

Gene-environment interactions
Our approach allows the estimation of multiple molecular phenotypes through deep mutational scanning across experimental conditions. We performed 168 experiments with varying fluorophore concentrations, Mg 2+ and K + ion concentrations, pH, temperature, and excitation/emission wavelengths. While the red signal, corresponding to the amount of RNA hybridized to each spot, was similar across experiments, the intensity of green signal depended on the experiment. Assays performed in similar conditions yielded highly correlated fluorescence profiles (Fig.   S8). As expected, high concentrations of fluorophore, Mg 2+ or K + in the buffer were associated with increased brightness of most variants (Fig. 3, S9). with Hill coefficient equal to 1 to represent a single binding site. Mutations with the largest effect on fluorophore affinity were found in the fluorophore binding pocket, in positions C32 and G33 (Fig. 3). Residue A69 in Spinach influences fluorophore access into its binding site (Warner et al., 2014); consistently, mutations of the homologous residue A39 in Broccoli decreased affinity. F_max correlated well with fluorescence measured in individual experiments, but weakly with Kd (Fig. S10).
Some mutations (such as C32U) increased DFHBI-1T-Broccoli affinity and decreased fluorescence, while others (eg. C32A and C32G) had an opposite effect ( Fig. 3D). This counterintuitive observation may be explained by postulating that weak binding reduces photobleaching and increases the proportion of RNAfluorophore complexes found in an active state. Indeed, a previous study described fast fluorescence decay and fluorophore-dependent recovery of RNA aptamers, which has been attributed to accelerated photoconversion of the fluorophore when bound to the RNA (Han et al. 2013). Wild-type Broccoli had one of the highest affinities to the fluorophore, but many mutants were brighter than wild-type (Fig. S10, see also (Ketterer et al., 2015), consistent with the original selection strategy which relied primarily on DFHBI binding rather than fluorescence (Filonov et al., 2014).
Similar to DFHBI-1T titration, Mg 2+ and K + titration showed monotonically increasing relationships between ligand concentration and fluorescence (Fig. 3, S9, S10, S11), and we used the Hill equation to calculate EC50(Mg 2+ ) and EC50(K + ), the ion concentrations at which half of maximum fluorescence was observed. Broccoli was initially selected in low-magnesium conditions to improve in vivo fluorescence (Filonov et al., 2014), and wild-type Broccoli showed low magnesium sensitivity. On average, mutants required ~1.8 times as much magnesium as the wild-type to achieve half-maximum fluorescence, but variants mutated in position 36 required ~5.3 times as much magnesium (Fig. 3C, D). In the structural model, nucleotide U36 is close to a coordinated Mg 2+ ion, suggesting a role in binding (Fig. 3E). Mutations in positions 43 and 46 increased sensitivity to K + concentration, suggesting a role in the formation of the G-quadruplex structures, which are stabilized by K + ions. While changes of magnesium and potassium concentration affected the fluorescence signal differently in different mutants, changes of temperature and pH showed less interaction with individual mutations (Fig. S9). Position A39, previously reported as a spectral tuning spot for DFHO-Broccoli_29-1 red/orange complex (Filonov et al., 2019), had no effect on the fluorescence spectrum of DFHBI-1T.

Fluorescence spectra
To analyse the spectral shifts in more detail, we measured the fluorescence spectra of selected variants in a spectrofluorometer. The C15G mutation, which destabilised the stem above the fluorophore binding pocket, shifted the excitation maximum to 435 nm with almost no change to the emission spectrum (Fig. 4D). The G12A mutation, in the presence or absence of other mutations, caused a yellow-shift in both the emission and excitation spectra (Fig. 4D). In a previous study, trifluoromethyl groups in DFHBI derivatives changed fluorescence spectra by altering dipole moments across the fluorophore, and the G12A mutation presumably mimics this effect by disrupting the hydrogen bonding with the carbonyl oxygen of DFHBI-1T ( Fig. 4E). Measurement of fluorescence spectra in solvents with different dielectric constants confirms that changes of polarity may induce blue and yellow shifts in DFHBI-1T (Fig. S12).  ((Kellenberger et al., 2013;Paige et al., 2012), reviewed in (Su and Hammond, 2020)). In vivo applications of such aptamers require the optimization of molecular phenotypes, such as the fluorescence spectrum, brightness, affinity to fluorophore and other ligands, sensitivity to pH and metal ions, photobleaching, thermal stability and resistance to helicases and nucleases. Many of these properties can be adjusted by mutations in the RNA (Ageely et al., 2016;Filonov et al., 2014;Warner et al., 2014). However, currently there are no established methods to predict how individual mutations will affect most phenotypes of interest, and, as a result, aptamer construction is typically done by random mutagenesis followed by functional screening (Filonov et al., 2014;Han et al., 2013;Song et al., 2017;Strack et al., 2013). Our dataset provides a systematic overview of how mutations in the Broccoli RNA aptamer influence parameters relevant to in vivo function, paving the way to the development of predictive machine learning models that will allow the design of RNA molecules with required properties. The recent success of AlphaFold in the prediction of protein structures highlights the benefits of large datasets in understanding molecular function (AlQuraishi, 2019). We anticipate that our microarray platform, which uses microscopy equipment readily available in molecular biology labs, will facilitating both the development of RNA tools and a mechanistic understanding of their function.

Acknowledgments
We thank Toby Hurd for the inspiration to start this project; Duncan Sproul for suggesting to use a microarray platform; members of the Advanced Imaging Facility

DNA library design and amplification.
We designed a pool of oligonucleotides composed as follows: truncated T7 polymerase promoter "ACGACTCACTATAGGGAGA" (19nt) -unique probe sequence (60nt) -"AAAAAAAAAAAAAAAA" (16nt) -upstream part of F30 "TTGCCATGTGTATGTGG" (17nt) -Broccoli variant (49nt) -downstream part of F30 "CCACATACTCTGATGATCCTTCGGGATCATTCATGGCAA" (39nt). As the oligo length was limited to 200 nucleotides, part of the T7 polymerase promoter was added during amplification of the library. The 60-mer probe sequences were derived from the GPL10787-9758.txt file containing information for Agilent SurePrint G3 Mouse GE 8x60K Microarray (Glass slide formatted with eight high-definition 60K arrays), that includes probes for mRNAs and lincRNAs. Sequences of Broccoli variants were generated by a custom script that changed the original nucleotide in each position of an input sequence to each of the 3 remaining nucleotides, producing 3 x "length of the sequence" of mutated variants. The first round was run on wild type Broccoli ("GAGACGGTCGGGTCCAGATATTCGTATCTGTCGAGTAGAGTGTGGGCTC"), to generate single mutants. This output was use as input for a second round -to generate double mutants. Taq DNA polymerase -10U) and the PCR thermocycler program was set as follows.
Initial denaturation at 95ºC for 3 min. was followed by 9 -12 three-step cycles of: denaturation at 95ºC for 15 s.; annealling at 55ºC for 30 s.; extension at 68ºC for 90 s. The PCR product was purified with MinElute PCR Purification Kit (Qiagen, Cat No.

28004) following manufacturer's protocol using a double elution with 20 µl of Elution
Buffer each. The final DNA concentration was determined using NanoDrop 8000 spectrophotometer (ThermoFisher, Cat No. ND-8000-GL).

In vitro transcription.
All of the purified DNA product from the library amplification PCR was used as a template for in vitro transcription reactions using MEGAshortscript T7 Transcription Kit (ThermoFisher, Ambion, Cat No. AM1354). The transcription reactions were performed following the manufacturer's protocol and using 200 nM final concentration of template DNA (~550 ng of template DNA per 20 µl reaction).
Reactions were incubated at 37ºC overnight (18-24 h) in ThermoMixer F1.5 (Eppendorf, Cat No. 5384000039) with ThermoTop (Eppendorf, Cat No. 5308000003) followed by TURBO DNase treatment at 37ºC for 20 min. The RNA product from each reaction was purified with RNeasy MinElute Cleanup Kit (Qiagen, Cat No. 74204) following a manufacturer's protocol using a double elution with 40 µl of RNase-free water each. The final purified samples were mixed together and the RNA concentration was determined using NanoDrop 8000 spectrophotometer. A sample of the final RNA product was run on an agarose gel to confirm the correct size and check for RNA quality and possible degradation.

RNA labelling with Cy3.
Ten percent of the final RNA amount was used for chemical labelling with Cy3 fluorescent dye using Arcturus Turbo Labeling Cy3 Kit (ThermoFisher, Applied Biosystems, Cat No. KIT0609). The chemical labelling procedure and subsequent removal of free Cy3 dye was performed following manufacturer's protocol with 15 µg of RNA per reaction. The labelled and purified RNA from all labelling reactions was mixed together and the final RNA and Cy3 concentrations were determined using NanoDrop 8000 spectrophotometer.

Hybridisation to microarrays.
The complete RNA library was hybridised to SurePrint G3 Mouse GE 8×60K Microarrays (Agilent, Cat No. G4852A) using a partially modified version of manufacturer's protocol as described in Version 6.9.1 of 'One-Color Microarray-Based Gene Expression Analysis Protocol'. The total amount of RNA used for hybridisation on each microarray was between 20 µg and 45 µg and contained between 1% and 7% of Cy3 labelled RNA. The most robust imaging data were obtained after hybridisation with 40 µg of total RNA containing 2.5% of Cy3 labelled RNA (39 µg of unlabelled RNA mixed with 1 µg of Cy3 labelled RNA). The concentration used for other gradients was 140 mM. A pH gradient was made using 20 mM HEPES with pH -5; 5.5; 6; 7; 8; 9; pH values used for other gradients were 5.5 and 7. A temperature gradient was generated by adjusting the temperature in the microscopy room using air conditioning and by using heated chamber with a thermostat fitted on the microscope. The temperatures tested were: 17ºC; 23ºC (using A/C setting); 30ºC; 37ºC and 42ºC (using heated chamber). The standard temperature for other gradients was 23ºC.

Imaging chamber assembly.
A major technical challenge was to keep microarrays in a constant concentration of imaging buffer components, avoiding any evaporation throughout the whole duration of an imaging session that lasted about one hour. For that purpose specially designed single-use imaging chambers were made using 18 mm x 18 mm cover glasses #1 (VWR, Cat No. 631-1331) and silicon seals cut out of ultra thin silicone film with 0.3 mm thickness (Silex, Ultra Thin Silicone Film -0.3 mm -30 Shore).
Silicone seals were cut out using precision CNC cutting machine with inner dimensions of 10 mm x 13.5 mm and 1.5 mm width at each side. First the cover glasses were gently but thoroughly washed with 80% ethanol and lens cleaning tissues, then gently dried with lens cleaning tissues and dusted with manual air duster. Using forceps, the silicone seals were very carefully placed with the more sticky side facing the cleaned cover glasses leaving about 5 mm space on one side of the cover glasses for handling. Such prepared imaging chambers were stored in an empty cover glass box held by sponge at the sides to prevent touching each other. Imaging chambers were handled with care to avoid ever touching the imaging area and were air dusted again before use. Just before starting an imaging session 25 µl of imaging buffer was carefully placed in the centre of the microarray and the imaging chamber was precisely positioned on this microarray using forceps and was gently pushed along the edges using a tip of the forceps to seal the chamber. Any droplets of the imaging buffer outside of the chamber were immediately dried using a corner of a tissue paper. A clean hybridisation gasket slide and an old microarray slide stained with ink were placed precisely underneath the fresh microarray slide to indicate the position of the transparent microarrays and guide placing imaging buffer and chamber on the microarray. A well positioned flat silicone seal would prevent any evaporation of the buffer from the imaging chamber. After the imaging was complete a buffer change was performed by swiftly and carefully detaching the imaging chamber from the microarray slide using forceps and then briefly drying the microarray slide by gently tapping one side on the bench with clean tissue paper underneath to absorb the buffer. Another sample of imaging buffer and fresh imaging chamber were quickly placed on the same microarray. The buffer change had to be performed relatively quickly to avoid a complete drying out of the microarray as this would often lead to deterioration of the spots on the microarray. Usually three to four buffer changes were made but sometimes even up to ten buffer changes were performed on one microarray still giving good quality data, however this varied depending on the hybridisation and imaging conditions and had to be assessed individually for each microarray.
Imaging of one microarray required 204 tiles using a grid of 17x12 tiles with 40% overlap between the tiles. Each tile was imaged 3 or 4 times in each channel as separate Z stacks with virtually identical focal plane as the 0.015 µm step between the stacks was below the stage motor resolution. These separate images for each tile were later averaged during processing to reduce the background noise.

Feature extraction.
The individual image files were combined into a single image of the entire chip using a bespoke LabView (National Instruments) program which corrected for illumination heterogeneity before merging the images together. Areas of overlap were added together using linear opacity ramps to avoid hard edges between images. Multiple acquisitions of the same colour and settings were averaged to improve signal to noise ratio.
The arrangement of the array of features was identified by matching the fixed patterns of spots in the four corners of the array using the green channel, and this information was used to identify the approximate position and index number of each spot. These positions were refined by searching for a circular pattern of bright pixels starting from the initial approximate location. The average pixel value was extracted for each feature in all the colours recorded and saved. Each feature was then classified by its index number into groups for easier processing.

Calculating the fluorescence of Broccoli variants.
The microarray contained 40371 spots expected to bind Broccoli variants (78 spots for wild-type Broccoli, 8100 single or double mutants with 4 spots per variant, and 2631 variants with 3 spots per variant), as well as 9925 "empty spots" where nothing was designed to hybridise, but where we could still read signal. We concluded that the signal might be coming from non-specific cross-hybridisation so the average signal of these empty spots was subtracted from all values in each channel. To account for the differences in the amount of Cy3-labeled RNA used in each experiment, we divided the red signal of each spot by the mean red signal of all spots on a given array. Spots with red signal less than 0.2 (i.e. less than 20% of average red signal on the array) were excluded from further analysis. All spots with negative green signal had the green signal set to 0 and the normalized Broccoli  (20,21,22,23,24,25), which had little effect on fluorescence. We found that using a subset of double mutants in the estimation of single mutation effects increased the amount of available data, reduced noise, and did not bias the results.
We calculated global epistasis using the "loess" function in R.
To calculate local epistasis values for all double mutants, we subtracted the measured fluorescence of each double mutant from the fluorescence predicted using the global epistasis model described above.

Training of machine learning models.
We considered all pairs of positions (i,j) and labelled each pair as either paired (10) or non-paired (980) based on the secondary structure model of wild-type Broccoli.
We excluded pairs in which |i-j| < 4 because such pairs are not commonly found in RNA structures. We trained an ensemble of Support Vector Machine (SVM) classifiers on the labelled data, using the scikit-learn library in Python with a 70:30 split between training and test datasets. For model training, we employed as features the mean local epistasis signals for the A:U, G:C, and G:U pairs: 1. (E(i A , j U ) + E(i U , j A ))/2; 2. (E(i G , j C ) + E(i C , j G ))/2; 3. (E(i G , j U ) + E(i U , j G ))/2 where E(i X , j Y ) is the value of local epistasis of a double mutant with mutation X in position i, and mutation Y in position j. We randomly shuffled the data a total of 1000 times, from where we built an ensemble of 1000 classifiers and computed the frequency of each possible pair being called as paired in the test data.

Curve fitting.
We used the open source tool gnuplot to calculate the dissociation constant (Kd) and We excluded variants for which the estimated EC50 value was greater than the respective F_max, or where the estimated EC50 was negative, or F_max was greater than 1000 fluorescence units.

Fluorescence measurements of individual variants.
To

Measuring of fluorescence spectra in solvents with different dielectric constants.
The RNA was aliquoted and dissolved in DMSO (10 mM: 1 mg per 0.312 mL).
NA, data not available.