An Improved Search Algorithm to Find G-Quadruplexes in Genome Sequences

A growing body of data suggests that the secondary structures adopted by G-rich polynucleotides may be more diverse than previously thought and that the definition of G-quadruplex-forming sequences should be broadened. We studied solution structures of a series of naturally occurring and model single-stranded DNA fragments defying the G3+NL1G3+NL2G3+NL3G3+ formula, which is used in most of the current GQ-search algorithms. The results confirm the GQ-forming potential of such sequences and suggest the existence of new types of GQs. We developed an improved (broadened) GQ-search algorithm (http://niifhm.ru/nauchnye-issledovanija/otdel-molekuljarnoj-biologii-i-genetiki/laboratorija-iskusstvennogo-antitelogeneza/497-2/) that accounts for the recently reported new types of GQs.


INTRODUCTION
Non-canonical polynucleotide structures play an important role in biogenesis processes, such as transcription, DNA repair, replication, translocation and RNA splicing (Saini et al. 2013). A clear view of DNA/RNA secondary structures and dynamics is necessary to understand the mechanisms of genomic regulation and to identify new biomarkers of pathology and drug targets. A growing body of data suggests that the secondary structures adopted by G-rich polynucleotides may be more diverse than previously thought (Kaluzhny et al. 2009;Tomasko et al. 2009;Guedin et al. 2010;Amrane et al. 2012;Beaudoin et al. 2013;Mukundan and Phan 2013). For instance, two new types of G-quadruplexes (GQs) have recently been reported: GQs with mismatches (Tomasko et al. 2009) and GQs with bulges (Mukundan and Phan 2013) ( Figure 1). GQs with bulges (bGQs) are GQs in which two stacked tetrad-forming guanosines in one column are separated by a projecting nucleoside. GQs with mismatches (mGQs) contain one or more substitutions of G for other nucleotides in the tetrads. (The mismatching nucleosides may participate in stacking). Both types of structures appeared to be stable under physiological conditions. These findings have led the researchers to the conclusion that the definition of GQforming sequences should be broadened. All currently available online search tools for GQs (Quad finder (Scaria et al. 2006), QGRS Mapper (Kikin et al. 2006) andQGRS predictor (Menendez et al. 2012)) employ the G 3 +N L1 G 3 +N L2 G 3 +N L3 G 3 + formula, which only defines canonical ('perfect') GQs.
We present here the first GQ-search tool, imGQfinder, that accounts for noncanonical ('imperfect') quadruplex structures (imGQs; i.e., bGQs and mGQs) in addition to canonical GQs. The ImGQfinder tool is freely accessible at the URL http://niifhm.ru/nauchnyeissledovanija/otdel-molekuljarnoj-biologii-i-genetiki/laboratorija-iskusstvennogoantitelogeneza/497-2/. Structural studies of a series of ( G 3+ N L1 G 3+ N L2 G 3+ N L3 G 3+ )-defying oligonucleotides (ONs), whose imGQ-forming potential was predicted by imGQfinder, were performed to verify our improved GQ-search algorithm. We also utilize ImGQfinder for statistical analysis of the imGQ-motif distribution in the human genome. In our broadened algorithm, we search for G-runs, determine the distance between them and select fragments that comply with the predetermined conditions for the maximum length of GQ loops and the minimum number of nucleotides in a G-run (i.e., the number of tetrads). The imGQ motif definition for imGQs with single defects is presented in Table 1. ImGQs with multiple defects can also be analyzed, but the relative set of formulas is not shown. Apparently, most imGQ motifs can be interpreted as both putative bGQs and mGQs. Some imGQs may also turn out to be 'perfect' GQs with fewer tetrads, e.g., a putative 3-tetrad imGQ with a bulge or a mismatch in the external tetrad can theoretically adopt the canonical 2-tetrad GQ conformation. ImGQfinder searches for all GQ and imGQ motifs, including overlapping ones. The program is implemented in Perl. The graphical user interface was developed using the Tk library. The inputs include the queried nucleotide sequence in fasta format, the number of tetrads and defects and the maximum loop length. The hits are displayed in a table. The coordinates of each G-run start (G1, G2, G3 and G4) and the positions of the defects (DEFECT) in each imGQ-forming fragment are shown. The user can also see the full sequences of the putative GQs or imGQs (the 'Add sequence to output' option).

RESULTS
In addition, we offer an application that can determine overlapping GQ/imGQ sites in the form of a single lengthy fragment with GQ/imGQ-forming potential (the 'Add intersected output' option). This feature is useful for estimating the maximum number of quadruplexes that can exist simultaneously.

Structural studies (algorithm verification)
Although several recent publications contain direct evidence for the existence of imGQs that are stable under physiological conditions, such structures are still relatively new, and there are few examples of well-characterized imGQs. To complement the studies on imGQ structures and to verify our search algorithm for imGQs, we synthesized a set of naturally occurring and model single-stranded DNA fragments that were defined by ImGQfinder to be putative imGQs and GQs, and we analyzed their conformations in solution using physicochemical methods. The ONs are listed in Table 2.
The sequences Bcl, Ct1 and PSTP were taken from the human genome. Bcl is located in the BCL2 promoter region 42 nucleotides upstream of the translation start site (NCBI Reference Sequence: NC_000018.9, chr18: -60985942 to -60985966). Ct1 is located in the intron of the CTIF gene (NCBI Reference Sequence: NC_000018.9, chr18: +46379322 to +46379344). PSTP is located at the PSTPIP2 intron/boundary (NCBI Reference Sequence: NC_000018.9, chr18: -43572049 to -43572072). BclG, BclA, BclT, Bcl-tr (truncated), Ct2, Ct3, Ct4, CtA, CtC and CTG are mutants of Bcl and Ct1. G3, G3A, G4, G4A and G4AA are model sequences. The solution structures of the ONs were investigated using UV-melting experiments, CD spectroscopy and NMR spectroscopy. The rotational relaxation times of EtBr in complex with the ONs are proportional to the hydrodynamic volumes of the molecules, and these times were estimated to distinguish between monomolecular and intermolecular quadruplexes. The melting temperatures of monomolecular GQs/imGQs and the GQ characteristics determined from the CD data (parallel, antiparallel or mixed GQ folding) are given in Table 1. Fragments of the 1 H-NMR spectra and CD spectra of the ONs Bcl, Ct1 and their mutants are shown in Figure 2. For the UV-melting profiles, molecularity analysis, CD spectra of the model ONs G3, G4 and their mutants and all the corresponding experimental procedures, see the supporting information.
The ONs G4, G3, CtG and BclG can fold into perfect 4-tetrad (G4, CtG and BclG) and 3-tetrad (G3) GQs according to the conventional GQ definition. Indeed, all of them formed highly stable parallel (G3 1 , G4 and CTG) or mixed (BclG) GQs in the presence potassium salt, as evidenced by the CD spectra and the UV-melting profiles. The imino region of the BclG 1 H-NMR spectrum ( Figure 2) contains 16 signals, which is consistent with 4 G-tetrads.
The ONs BclT, G4A, G4AA, Ct1-Ct4, CtA, CtC and PSTP defy the conventional G 3+ N L1 G 3+ N L2 G 3+ N L3 G 3+ formula and would be omitted by the currently existing GQ-search algorithms. ImGQfinder defines all of these sequences to be putative imGQs. Indeed, all these ONs form stable monomolecular quadruplexes in the presence of potassium salt. Fourteen signals in the imino regions of the 1 H-NMR spectra of the ONs Ct1 and PSTP are consistent with 4-tetrad mGQ structures with one imperfect tetrad 2 . Twelve signals in the imino-spectrum region of ribo-Ct1 most likely suggest a 3-tetrad bGQ structure (Molecular modeling studies were performed to clarify the Ct1 structure. For more information, see the supporting data). Importantly, the ONs BclT and G4aa cannot even form 2-tetrad GQs according to the conventional GQ definition. However, these ONs appear to fold into rather stable GQ-like structures under physiological conditions. These results confirm that ImGQfinder can be used to predict the possibility of bGQ and mGQ formation.

ImGQs in the human genome: statistical analysis (algorithm application)
ImGQfinder was utilized to reassess the abundance and to analyze the distribution of putative quadruplex sites in the human genome. We only considered 4-tetrad GQs and imGQs, which are generally more stable than 2-and 3-tetrad GQs according to the literature and our own physicochemical data. The sequences representing overlapping GQ/imGQ sites were counted only once (this was performed using an application feature of ImGQfinder). Sites with both GQand imGQ-folding potentials were regarded as putative GQs because the latter are generally more stable. As expected, imGQs are substantially more abundant than GQs (Table 3). Thus, the maximum overall number of quadruplex-like structures realized in vivo may be significantly higher than previously thought. The distribution of putative imGQs and GQs within RefSeq genes (genomic sequences used as reference standards for well-characterized genes; http://www.ncbi.nlm.nih.gov/refseq/rsg/about/) is shown in Figure 3 (for a more detailed analysis, see the supporting information). As expected, imGQs are substantially more abundant than GQs.
To additionally validate the ImGQfinder algorithm, we also calculated the number of all putative non-overlapping 'perfect' 3-tetrad guadruplexes in the human genome and compared it with the literature data. The obtained value (359 k) is close to the previous estimations (376 k) (Huppert and Balasubramanian 2005).

DISCUSSION
A new GQ-search algorithm, which is based on a broadened definition of quadruplex-forming sequences, and the user-friendly online tool ImGQfinder were developed. The algorithm was verified by structural studies of a series of ONs whose imGQ-forming potential was predicted by 1 G3 demonstrated extreme stability in potassium, and the stability was even superior to that of G4. We attribute this stability to the fact that single-nucleotide fragments separating G runs fit perfectly well in the diagonal loops of 3tetrad GQs but may be slightly too short for 4-tetrad diagonal loops. 2 One imperfect tetrad contains three Hoogsteen-bound Gs with two imino G protons that participate in H-bonding, which results in two additional signals in the relative region of the 1 H-NMR spectrum.
imGQfinder. Importantly, the physicochemical properties of ImGQs and GQ, such as the thermal stability under physiological conditions, appear to be rather similar.
Reassessment of the abundance of putative quadruplex sites in the human genome with imGQfinder revealed that the maximum number of G4 structures that could be simultaneously realized has been underestimated. As is evident from Figure  In conclusion, the broadened GQ-search algorithm opens up new opportunities in the prediction of DNA/RNA structure and allows thorough analysis of all possible conformations adopted by polynucleotides.

METHODS
The ON synthesis and purification, the MS analysis and the UV-melting, CD and rotational relaxation time measurements were performed as previously described (Varizhuk et al. 2013).

NMR studies
NMR samples were prepared at a concentration of ~0.1 mM in 0.6 ml H2O+D2O (10%) buffer solution containing 20 mM Tris-HCl (pH 7.5) and 100 mM KCl and annealed (heated to 90 C for 3 minutes, then cooled quickly on ice) prior to spectral measurements to ensure unimolecular quadruplex folding. 1 H-NMR spectra were recorded with Bruker AVANCE II 300 (300.1 MHz), Bruker AMX III (400.1 MHz) and Bruker AVANCE II 600 (600.1 MHz) spectrometers. The 1 H chemical shifts were referenced relative to an external standard -sodium 2,2-dimethyl-2silapentane-5-sulfonate (DSS). The spectra were recorded using presaturation or pulsed-field gradient WATERGATE W5 pulse sequences (zgprsp and zggpw5 from the Bruker library, respectively) for H 2 O suppression.

Molecular modeling
Ct1 GQ models 1 and 2 were created as follows. The starting positions of the GQ core atoms were obtained from the PDB (139D and 2KQH). The core of every GQ was created using Swiss-PDB Viewer. Then, loops were added step by step as described further by utilizing the SYBYL 8.0 molecular modeling package. To remove unfavorable van der Waals interactions, the models were reoptimized after attaching each loop using SYBYL 8.0 and the Powell method with the following parameters: Gasteiger-Hückel charges, TRIPOS force field, non-bonded cut-off distance of 8 Ǻ , distance-dependent dielectric function, 1000 iterations, the simplex method in an initial optimization and the energy gradient convergence criterion with a threshold of 0.05 kcal*mol -1 *Å -1 . The GQ core was frozen during the above reoptimizations.
The molecular dynamics simulations (MD) were performed with the Amber 9 suite with ff99SB and parmbsc0 force fields as described previously (Varizhuk et al. 2013). The trajectory length was 35 ns. Snapshot visualization and hydrogen bond analysis were performed using VMD (http://www.ks.uiuc.edu/Research/vmd/) with a donor-acceptor distance of 3 Å and an angle cutoff of 20 degrees. Snapshots were taken every 0.1 ns. The MM-GBSA method was used to calculate the Ct1 GQ free energies for models 1 and 2. In this approach, the free energy is calculated according to the formula , where E MM , G sol and TS are the total mechanical energy of the molecule in gas phase, the free energy of hydration and the entropic contribution, respectively. E MM was calculated as the sum of the electrostatic energies, van der Waals energies and the energies of internal strain (bonds, angles and dihedrals) by using a molecular-mechanics approach. G sol was calculated as the sum of the polar (G polar ) and nonpolar (G nonpolar ) terms. The electrostatic contribution to the hydration energy G polar was computed using the Generalized Born (GB) method (Onufriev et al. 2000) using the algorithm developed by Onufriev et al. (Weiser et al. 1999;Onufriev et al. 2002) for calculating the effective Born radii. The non-polar component of the hydration energy G nonpolar , which includes solute-solvent van der Waals interactions and the free energy of cavity formation in solvent, was calculated using the following formula: , where SASA is the solvent accessible surface area. SASA was computed using the LCPO method (Srinivasan et al. 1998) with α = 0.00542 kcal/mol -1 Å -2 . The entropic term was not calculated explicitly, but it was accounted for implicitly via the GQ conformational mobility. Snapshots taken from a single trajectory of the MD simulation of the complex were used for the calculations of the binding free energy.

TABLES
Position of a mismatch/bulge Formula 5' -terminal G-run G i-1 NG k-i (L xA G k ) 3 , A=1,2,3 The second G-run from the 5'terminus G k L x1 G i-1 NG k-i (L xA G k ) 2 , A=2,3 for mGQs G k L x1 G i-1 NG k-i+1 (L xA G k ) 2 , A=2,3 for bGQs The third G-run from the 5'terminus (G k L xA ) 2 G i-1 NG k-i L x3 G k , A=1,2 for mGQs (G k L xA ) 2 G i-1 NG k-i+1 L x3 G k , A=1,2 for bGQs 3'-terminal G-run (G k L xA ) 3 G i-1 NG k-i , A=1,2,3 L denotes any nucleotide in a loop; N denotes any mismatching (bulging) nucleotide (if N = G, the formula describes a canonical, 'perfect' GQ); A is the loop index; х A is the number of nucleotides in loop A; k is the number of guanosines in a G-run (k≥3); and i is the position of a mismatch/bulge in the G-run (1<i<k).   n  e  c  k  l  a  c  e  m  o  n  o  m  o  r  p  h  i  c  G  -q  u  a  d  r  u  p  l  e  x  e  s  i  n  t  h  e  h  u  m  a  n  C  E  B  2  5  m  i  n  i  s  a  t  e  l  l  i  t  e  .   J  A  m  C  h  e  m  S  o  c   1  3  4   (  1  3  )  :  5  8  0  7  -5  8  1  6  .  B  e  a  u  d  o  i  n  J  D  ,  J  o  d  o  i  n  R  ,  P  e  r  r  e  a  u  l  t  J  P  .  2  0  1  3  .  N  e  w  s  c  o  r  i  n  g  s  y  s  t  e  m  t  o  i  d  e  n  t  i  f  y  R  N  A  G  -q  u  a  d  r  u  p  l  e  x  f  o  l  d  i  n  g  .   N  u  c  l  e  i  c  A  c  i  d  s  R  e  s   .  F  i  s  e  t  t  e  J  F  ,  M  o  n  t  a  g  n  a  D  R  ,  M  i  h  a  i  l  e  s  c  u  M  R  ,  W  o  l  f  e  M  S  .  2  0  1  2  .  A  G  -r  i  c  h  e  l  e  m  e  n  t  f  o  r  m  s  a  G  -q  u  a  d  r  u  p  l  e  x  a  n  d  r  e  g  u  l  a  t  e  s