The parallel-stranded d(CGA) duplex is a highly predictable structural motif with two conformationally distinct strands

DNA can adopt non-canonical structures that have important biological functions while also providing structural diversity for nanotechnology applications. Here, we describe the crystal structures of two oligonucleotides composed of d(CGA) triplet repeats in the parallel-stranded duplex form. The structure determination of four unique d(CGA)-based parallel-stranded duplexes across two crystal structures has allowed us to characterize and establish structural parameters of d(CGA) triplets in the parallel-stranded duplex form. Our results show that d(CGA) units are highly uniform, but that each strand in the duplex is structurally unique and has a distinct role in accommodating structural asymmetries induced by the C-CH+ base pair.


INTRODUCTION
DNA is a polymorphic biopolymer that can adopt an array of conformations beyond the traditional B-form double helix. Watson-Crick base paired duplexes can access multiple helical forms (A or Z-form) depending on sequence and environmental conditions (1,2). The hydrogen bonding and base stacking interactions that stabilize anti-parallel duplexes also allow DNA to access other non-canonical conformations, some of which have known biological implications, including G-quadruplexes (3), i-motifs (4), and triplexes (5). Further, the formation of many non-canonical structures can be controlled by nucleotide sequence composition and several environmental factors including pH, the presence and concentration of cations, and temperature (6)(7)(8)(9)(10)(11). The alternative structures formed by genomic DNA triplet repeat sequences (12)(13)(14) have been implicated in their ability to expand and cause genetic instabilities (15,16). Understanding these alternative structures and the conditions that may lead to their formation provides a fundamental basis for understanding the disease states. Additionally, the ability to control DNA conformations has utility in DNA nanotechnology applications where non-canonical motifs expand structural and functional diversity of nanostructures while retaining inherent programmability and predictability.
The d(CGA) triplet repeat motif is another such environmentally sensitive motif that can adopt different structural forms in a pH-dependent manner at near-physiological temperature and salt concentration (31). Neutral pH favors a unimolecular anti-parallel hairpin stabilized by canonical G-C base pairs (12,32), while acidic pH favors a non-canonical homo-base paired parallel-stranded duplex (ps-duplex) (33,34). Although the non-canonical d(CGA) motif can adopt distinct structural conformations, the ps-duplex is the predominantly studied form (33,(35)(36)(37)(38). Originally described as -DNA, the d(CGA)n ps-duplex is stabilized by homo-base pair interactions (C-CH+, G-G, and A-A) and inter-strand base stacking interactions (34,38). The C-CH+ homo-base pair requires hemi-protonation at the N3 position to form three hydrogen bonds along the Watson-Crick face (34). N2-N3 sugar-edge hydrogen bonds stabilize G-G homobase pairs, while A-A homo-base pairs are formed through N6-N7 Hoogsteen face hydrogen bonds. Importantly, the GpA dinucleotide step provides significant stabilization to the ps-duplex by the formation of interstrand G/A base stacking interactions.
The structure and stability of the ps-duplex is highly influenced by the 5ʹ-nucleotide of each triplet (31). Similar G/A-stacking interactions have been observed in ps-duplex structures containing d(GGA) or d(TGA) triplets (35,37,(39)(40)(41), though contiguous repeats of these sequences are unable to form psduplexes (31). The ps-duplex region of an intercalation-locked tetraplex containing d(TGA) triplets forms a perfectly symmetrical duplex (37), while the same ps-duplex region containing d(CGA) triplets resulted in structural asymmetry and duplex bending (36). The asymmetry is associated with a displacement from the helical axis at the C-CH+ base pair. Further, thermodynamic studies indicated that ps-duplex structures containing six tandem d(YGA) triplet repeats undergo a significant destabilization when d(CGA) triplets are replaced with d(TGA) triplets (31). Beyond the additional hydrogen bond interaction within each C-CH+ base pair, the structural details as to why asymmetric d(CGA) duplexes are significantly more stable than symmetric d(TGA) triplets remain unclear.
The ability of d(CGA) to form distinct structural states appears to be a trait shared by several other triplet repeat motifs, though d(CGA) is the only triplet known to form perfectly ps-duplex structures (12)(13)(14)31). Interestingly, genomic analysis of all possible triplet repeat sequences have shown that d(CAG) triplets are over-represented in the human genome and indicated in disease pathologies while d(CGA) triplet repeat sequence are the least frequently observed, occurring only 16 times (42). A similarly low frequency and coverage of d(CGA) triplets was seen when a comparable genomic analysis was performed in other eukaryotic organisms (43). The formation of alternative structures and the relative stabilities of such structures are thought to be important factors contributing to the expansion of repeat sequences (13,44,45).
Due to the challenges they present to replication machinery, repeat sequences that form alternative structures could influence pathological or evolutionary outcomes (32,46). Therefore, it is important to characterize the structural diversity of triplet repeat sequences that have the ability to form such noncanonical structures.
In this work we have determined the crystal structure of two oligonucleotides containing multiple tandem d(CGA) triplet repeats in the ps-duplex form. These structures are the longest ps-duplexes to be solved solely comprised of such triplets. The crystals grew from different solution conditions and resulted in distinct crystal packing arrangements. The structure determination of four ps-duplexes across these two different crystal structures has allowed us to thoroughly characterize and define the structural features of d(CGA) triplets and the ps-duplexes they form. Despite crystallization and molecular packing differences, the resulting ps-duplex structures have strikingly low RMSD values, demonstrating the robust structural uniformity of the d(CGA) triplet repeat motif in the ps-duplex form. Additionally, each ps-duplex contains two conformationally distinct d(CGA) triplets based on hydrogen bonding and base stacking interactions.
Surprisingly, each strand contains only one triplet conformation. Thus, ps-duplexes containing d(CGA) repeats are not structurally symmetrical and the apparent structural asymmetry is propagated discretely throughout each strand.

Data collection, processing, and structure determination
Crystals were removed from drops with nylon cryo-loops, immediately dipped in the respective crystallization condition supplemented with 30% MPD or PEG400 and plunged into liquid nitrogen.
Diffraction data were collected at the Advanced Photon Source (APS), Argonne National Laboratory.
(CGA)5TGA data was collected on the 24-ID-E beamline and GA(CGA)5 data was collected on the 24-ID-C beamline.
Data processing for (CGA)5TGA was carried out with iMosflm (47) and GA(CGA)5 was carried out with XDS (48) and Aimless (49). Initial phases were obtained by molecular replacement using Phaser (50). The parallel stranded homo-duplex d(CGA) triplet region from PDB id: 1IXJ (35) was used as the search model for (CGA)5TGA, and two tandem d(CGA) units from the refined (CGA)5TGA structure were used as the search model for GA(CGA)5. Model building and refinement was carried out for in Phenix (51) and Coot (52), respectively, for both datasets.

Circular Dichroism
CD spectra were obtained using a Jasco J-810 spectropolarimeter fitted with a thermostatted cell holder (Jasco, Easton, MD). Samples were prepared using 10 μM DNA in 20 mM MES, 100 mM sodium chloride (pH 5.5), or 20 mM sodium cacodylate, 100 mM sodium chloride (pH 7.0). Samples were incubated at 4 o C overnight prior to data collection. Data were collected at room temperature at wavelengths from 220-300 nm.

Overview
We determined the crystal structures of (CGA)5TGA and GA(CGA)5 in the ps-duplex form at 2.1 Å and 1.32 Å, respectively (Table 1). (CGA)5TGA was crystallized at pH 5.5 to preferentially stabilize the ps-duplex form, while GA(CGA)5 was crystallized at pH 7.4 to characterize the anti-parallel hairpin form. Despite being at a pH that strongly favors the hairpin form ( Figure 1A,B), GA(CGA)5 also crystallized as a ps-duplex.
Several other structures that rely on the C-CH+ hemi-protonation have also crystallized as ps-duplexes at above-neutral pH, suggesting that factors beyond pH influence this structural preference (37,54). The high local concentration of DNA and the presence of crowding agents have been demonstrated to increase the observed pH of the structural transition in C-CH+ mediated structures (32,55,56). CD measurements of d(CGA) repeat sequences are consistent with these observations; the presence of crowding agents shifts the favorability range of the ps-duplex to higher pH ( Figure 1C). Specifically, the addition of 30% PEG2000 increased the pH of the structural transition by 0.33 ± 0.06 and 0.32 ± 0.13 pH units for (CGA)5TGA and (CGA)5, respectively ( Figure 1D). Also, previous thermodynamic measurements have demonstrated a significantly greater stability in the ps-duplex over the anti-parallel hairpin form (31). Therefore, it is not surprising that the significantly more stable ps-duplex form is dominant in crowded crystallization conditions where structural stability is advantageous. It may also thus be possible for d(CGA) ps-duplexes to form in crowded cellular environments, similar to other C-CH+ mediated DNA structures (57)(58)(59). Despite testing multiple constructs of d(CGA)-derived oligonucleotides, we were unable to determine a structure in the hairpin form.

Crystal Packing
In the GA(CGA)5 crystal structure, six strands form three parallel-stranded homo-duplexes (Duplex 1, 2, and 3) in the asymmetric unit ( Figure 2A). Duplex 2 is coaxially stacked between duplexes 1 and 3 though 3ʹ to 5ʹ end stacking of the terminal G1-G1 and A17-A17 base pairs. This arrangement results in a junction of three tandem sets of interstrand G/A stacking interactions at each duplex intersection to stabilize the crystal lattice ( Figure 2B). This packing arrangement forms columns of alternating ps-duplexes propagating throughout the crystal along the c-axis. Interestingly, this is the first instance of 3ʹ to 5ʹ end stacking in this class of ps-duplexes; other ps-duplexes containing the d(CGA) motif stack in the 3ʹ-3ʹ or 5ʹ-5ʹ orientation (36). This difference is likely due to the lack of 5ʹ-C. The exposed 5ʹ-G allows for preferential formation of inter-duplex G/A stacking interactions with the 3ʹ-A of another duplex that directly mimic the internal interstrand G/A stacking interactions. In the (CGA)5TGA structure, two strands form one homoduplex (Duplex 4) in the asymmetric unit ( Figure 2C). The duplex is stacked with crystallography identical duplexes via 5ʹ-5ʹ stacking of C1-C1 base pairs and 3ʹ-3ʹ stacking of A18-A18 base pairs. Both crystals grew in the presence of divalent cations where they primarily mediate inter-duplex crystal packing interactions ( Figure S1). When possible, anomalous difference maps and coordination distances were used to verify cation identity and placement ( Figure S2). GA(CGA)5 (duplexes [1][2][3] crystallized in the presence of hexamminecobalt(III) (NCO) and Sr 2+ . Specifically, NCO is positioned in multiple conformations between the Hoogsteen faces of guanosines from two duplexes, with GN7-GN7 and GO6-GO6 distances of 8.7 ± 0.3 Å and 7.6 ± 0.1 Å, respectively ( Figure S1A). Sr 2+ mediates the remaining inter-duplex guanosine positions in two distinct modes. The first set of Sr 2+ mediated interactions are similar to NCO but have shorter GN7-GN7 and GO6-GO6 distances (7.9 ± 0.1 Å and 6.8 ± 0.1 Å, respectively) ( Figure S1B). The remaining Sr 2+ cations are similarly positioned between two guanosines from separate ps-duplexes but the major groove faces are positioned such that GN7-GO6 are oriented together (9.31 ± 0.03 Å) ( Figure S1C). In the (CGA)5TGA structure, Ba 2+ mediates inter-duplex packing in two distinct environments. One mode is almost identical to the first set of Sr 2+ mediated interactions, where GN7-GN7 and GO6-GO6 distances are 7.8 ± 0.1 Å and 6.9 ± 0.0 Å, respectively ( Figure S1D). The remaining Ba 2+ cations are positioned between the major groove face of one guanosine and the phosphate oxygen of the opposing duplex guanosine where the GN7-PO2 and GN6-PO2 distances are 9.4 ± 1.3 Å and 8.9 ± 0.9 Å ( Figure S1E). Interestingly, despite the different cations with unique packing interactions, the resulting psduplex structures were highly uniform. This indicates that each strand within the ps-duplex is conformationally unique.

Structural asymmetry is induced by the C-CH+ base pair
We assessed the overall symmetry of the ps-duplex form to determine how structural asymmetries are propagated through the ps-duplex. The linearity of each duplex along base pair units was measured by connecting the mid-point of the hydrogen bonding partners of each homo-base pair ( Figure 4A). Each d(CGA) triplet exhibits a similar bending pattern centered around the largest deviation from linearity (25.0 o ± 3.9 o ) at the C-CH+ base pair (Figure 4B). The G-G centric angle does not propagate significant deviations from linearity, while the magnitude of the A-A centric deviation is highly dependent upon the identity of the following nucleotide (C or T). When a C-CH+ base pair is present, the adjacent 5ʹ-A-A centric angle adopts a deviation (20.8 o ± 4.3 o ) similar to the C-CH+ centric angle. Alternatively, the A-A centric deviation is smaller (9.0 o ) when followed by a T-T base pair ( Figure 4B). Structural overlays of the A-(C/T) step indicate that this deviation coincides with the extension of one cytosine from the helical axis to align the Watson-Crick faces for the formation of the hemi-protonated C-CH+ base pair ( Figure 4C), as previously noted (36).
There is also a slight displacement of the adjacent adenosine on the same strand which could be required to accommodate the cytosine deviation. This contrasts with the T-T base pair which makes interactions in a perfectly symmetrical manner; therefore, the adjacent adenosine also remains unbent. We conclude that the A-A base pair provides structural flexibility to accommodate deviations from linearity induced by the C-CH+ base pair.

Each strand within the ps-duplex has unique structural character
Backbone torsion angle analysis and the duplex asymmetry suggested that the two strands of the ps-duplex have unique structural characteristics. These differences are correlated with two distinct hydrogen bond interactions that form within the A/C step between d(CGA) triplets ( Figure 5A). The first hydrogen bond is between cytosine N4 (C-N4) and a non-bridging phosphate oxygen (O2P) of the previous adenosine within the same strand. There is no bond equivalent to the C-N4 to O2P bond in the T-T base pair, further suggesting that this bond could be influential in controlling the relative position of the C-CH+ and A-A base pairs. The second hydrogen bond is between the same non-bridging phosphate oxygen and the adenosine N6 (A-N6) of the opposing strand. Interestingly, depending on the strand within each duplex, there are unique differences in A-N6 to O2P and C-N4 to O2P bond lengths ( Figure 5B). In one strand, herein referred to as "rigid," all A-N6 to O2P and C-N4 to O2P bonds distances remain within 2.8 to 3.1 Å.
However, within the opposite strand (referred to as "loose") the same bond lengths increase beyond hydrogen bond distance. The average A-N6 to O2P and C-N4 to O2P distances within the loose strand are 4.1 ± 0.4 Å and 3.5 ± 0.1 Å, respectively. The cytosine that is displaced from the helical axis is always on the loose strand, where the increased bond lengths and wider range of torsion angles within the loose strand coincide with this displacement.
Accompanying the differences in hydrogen bonding are distinct differences in base stacking interactions between loose and rigid strands ( Figure 5C). Base pair overlap areas (excluding exo-cyclic groups) calculated for each duplex using 3DNA v2.4 (60) indicate that intra-rigid strand A/C and C/G steps maintain similar overlap areas of 2.4 ± 0.5 Å 2 and 2.3 ± 0.4 Å 2 , respectively ( Figure 5D). The inter-strand G/A stacking interaction adjacent to the A/C step on the rigid strand also has a similar overlap area of 2.8 ± 0.9 Å 2 ( Figure 5D). However, the stacking areas of the loose strand are more variable. The A/C step on the loose strand has the lowest base overlap area (0.4 ± 0.1 Å 2 ), while the G/A (inter-strand) and C/G (intrastrand) stacking interactions surrounding the A/C step have the highest stacking overlap area (4.6 ± 0.4 Å 2 and 4.0 ± 0.2 Å 2 , respectively; Figure 5D). The large stacking interactions surrounding the bent A/C step within the loose strand contributes additional stabilization that may compensate for the increased base-tophosphate hydrogen bond distances.
The overall structural asymmetry and accompanying differences in hydrogen bonding and base stacking interactions among strands are observed throughout each ps-duplex studied. Though it would be conceivable to expect the structural asymmetry to be propagated on a per-triplet basis, we observed the propagation on a per-strand basis over the entire length of the d(CGA) repeats. Thus, each ps-duplex is composed of two structurally unique strands where all triplets within a strand adopt either the loose or rigid character. The structural homogeneity of triplets within strands implies that duplexation of tandem d(CGA) triplets could occur in a cooperative manner. Further, the distinct conformations of each strand could play separate roles in accommodating the structural asymmetry. The rigid strand is the structural scaffold strand that maintains consistent hydrogen bonding and stacking interactions, while the loose strand provides structural flexibility to stabilize and accommodate deviations from linearity induced by the C-CH+ base pair. the rigid strand C-N4 to OP2 and A-N6 to OP2 bond distances increase from an average of 3.0 Å to 4.4 Å and 4.9 Å, respectively, while the loose strand A-N6 to OP2 distance decreases from 4.0 Å to 3.5 Å ( Figure   S8A). The C1ʹ-C1ʹ distance for the T-T homo base pair is 1.4 Å wider than the C-CH+ homo-base pair, therefore, upstream swelling of the rigid strand could be required to accommodate the wider T-T homobase pair ( Figure S8B). Increased base overlap areas of the G/A steps adjacent to the d(TGA) triplet could also contribute additional stabilization to compensate for the extended rigid strand bond distances ( Figure   S8C,D). Enthalpic destabilization observed in sequences containing d(TGA) triplets was previously attributed to the loss of one hydrogen bond from replacing the hemi-protonated C-CH+ with a T-T base pair (31). The structure described here further suggests this destabilization could also be due to the loss of the C-N4 to O2P hydrogen bond and swelling of adjacent d(CGA) triplets that coincide with the addition of a T-T base pair. The incorporation of a d(TGA) triplet at the 3ʹ-end of a long stretch of d(CGA) triplets does not disrupt the overall ps-duplex structure but induces slight structural changes in the adjacent d(CGA) triplet.
Though d(CGA) and d(TGA) triplets are not structurally identical within the ps-duplex, they could be used to control rigid and loose strands. Interestingly, U Br GA triplets have been shown to offer increased stability to the ps-duplex via the formation of a halogen bond with the phosphate oxygen of an adjacent adenosine (37). This indicates the valuable prospective of the U Br GA triplet in the rational design of d(CGA) containing ps-duplexes. To fully evaluate the potential use as discriminator triplets, structural analysis of d(CGA) repeat sequences containing internal d(TGA) triplets (and U Br GA triplets) are needed to understand the effect of d(TGA) incorporation on downstream d(CGA) triplets.

Prospects for d(CGA)-based ps-duplexes in technology development and biology
The crystal structures described here have allowed us to characterize the d(CGA) triplet repeat motif in the ps-duplex form and establish structural features for its use as a building block in DNA nanotechnology applications. The generalized helical and base parameters established by these structures will serve as constraints for the incorporation of d(CGA)-based triplets into rational structure design.
Particularly, requiring an integer number of base pairs per turn (9.0 ± 0.1 base pair) simplifies its use from a design perspective, as incorporation of three d(CGA) repeats completes exactly one helical turn. Though the formation of stable alternative DNA structures may be desirable for the rational design of DNA-based architectures, they may be unfavorable or selected against in biological systems. Specifically, repeat sequences that form alternative structures could undergo greater instability due to the challenges they present to replication machinery (46). However, there is also the possibility that readily formed, thermodynamically stable structures would be selected against evolutionarily, as they may have an even greater impact on endogenous replication systems. Interestingly, d(CGA) triplet repeat expansions have not been implicated in human disease and are found least frequently in eukaryotic genomes (42,43). This raises the interesting possibility that their propensity to form highly stable ps-duplex structures could be a  " )*+ " )*+ ** Correlation coefficient between reflection intensities from the data set randomly split into two halves.  (31,32,34). The anti-parallel form has a weak positive band at 280 nm and weak negative band at 260 nm (31,32,34). A. CD spectrum for GA(CGA)5 at pH 5.5 (blue) or pH 7.0 (black). B. CD spectrum for (CGA)5TGA at pH 5.5 (blue) or pH 7.0 (black). C. (CGA)5 forms a ps-duplex at pH 5.0 (light blue) and anti-parallel hairpin at pH 6.6 (dark blue) in the same buffer conditions as A/B. The formation of the ps-duplex form at pH 6.6 is favored in the presence of 30% PEG400 (dark gray), PEG2000 (medium gray), and PEG4000 (light gray). D. Crowding agents increase the pH of the structural transition from ps-duplex to anti-parallel hairpin form. The transition was measured as the loss of characteristic ps-duplex signal at 265 nm in native conditions (solid lines) or in the presence of 30% PEG2000 (dashed lines) for (CGA)5 (gray) and (CGA)5TGA (pink).   The deviation from linearity of a base pair is measured as the angle between the two cylinders adjacent to the base pair of interest. Individual cylinders were created by connecting points placed at the midpoint of the hydrogen bonding partners of each base pair. The resulting cylinders are colored based on the identity of the base pairs they connect; G-A (gray), A-C (black), C-G (blue). B. Deviation from linearity of each base pair along the (CGA)5TGA (purple) or GA(CGA)5 (red) sequence. Colored bars along the sequence correspond to the same cylinders connecting base pairs from A. The angles measured for GA(CGA)5 are represented as the average of duplexes 1-3 and (CGA)5TGA is from duplex 4. C. Overlay of d(YGA) triplets within (CGA)5TGA, rotated 180 o to highlight A-A and C-C base pairs. Five A/C steps (gray and teal) overlaid with one A/T step (black). Compared to the black strand, the teal strand does not show significant structural deviation, while the nucleotides within the gray strand are extended out of the helical axis. Figure 5. Hydrogen bond distances and base overlap areas are used to distinguish loose and rigid strands within the d(CGA) ps-duplex A. The A/C step highlighting the A-N6 to O2P and C-N4 to O2P interactions within loose (gray) and rigid (teal) strands. Chain A (duplexes 1 and 4), chain C (duplex 2), and chain E (duplex 3) have been characterized as loose strands. Chain B (duplexes 1 and 4), chain D (duplex 2), and chain F (duplex 3) have been characterized as rigid strands. B. Loose (gray) and rigid (teal) strand bond distances represented along the GA(CGA)5 sequence. A-N6 to O2P distances are plotted as circles and C-N4 to O2P distances are plotted as diamonds. Each data point represents the average distance measured from duplexes 1-3. Loose strand bond distances cycle between 3.5 ± 0.1 Å and 4.1 ± 0.1 Å depending on the identity of the nucleotide involved in the interaction while rigid strand bond distances remain between 2.8 to 3.1 Å, regardless of the interaction. C. Base overlap areas are different for loose and rigid strands. View of all unique base stacking interactions (inter-strand G/A, intra-strand A/C, and intra-strand C/G) that contribute to d(CGA) triplet stabilization. 90 o rotation illustrates difference in stacking overlap area between strands. The rigid strand (teal) maintains consistent stacking overlap areas, while the loose strand (gray) is highly variable. The star denotes the cytosine that is extended from the helical axis. D. Base stack overlap areas are represented as averages of overlap areas from d(CGA) triplets from duplexes 1-4 and are shown for the respective base steps.