Abstract
Photoisomerization of retinoids inside a confined protein pocket represents a critical chemical event in many important biological processes from animal vision, non-visual light effects, to bacterial light sensing and harvesting. Light driven proton pumping in bacteriorhodopsin entails exquisite electronic and conformational reconfigurations during its photocycle. However, it has been a major challenge to delineate transient molecular events preceding and following the photoisomerization of the retinal from noisy electron density maps when varying populations of intermediates coexist and evolve as a function of time. Here I report several distinct early photoproducts deconvoluted from the recently observed mixtures in time-resolved serial crystallography. This deconvolution substantially improves the quality of the electron density maps hence demonstrates that the all-trans retinal undergoes extensive isomerization sampling before it proceeds to the productive 13-cis configuration. Upon light absorption, the chromophore attempts to perform trans-to-cis isomerization at every double bond coupled with the stalled anti-to-syn rotations at multiple single bonds along its polyene chain. Such isomerization sampling pushes all seven transmembrane helices to bend outward, resulting in a transient expansion of the retinal binding pocket, and later, a contraction due to recoiling. These ultrafast responses observed at the atomic resolution support that the productive photoreaction in bacteriorhodopsin is initiated by light-induced charge separation in the prosthetic chromophore yet governed by stereoselectivity of its protein pocket. The method of a numerical resolution of concurrent events from mixed observations is also generally applicable.
Significance Statement Photoisomerization of retinal is a critical rearrangement reaction in many important biological processes from animal vision, non-visual light effects, to bacterial light sensing and harvesting. It has been a major challenge to visualize rapid molecular events preceding and following photoisomerization so that many protein functions depending on such reaction remain vaguely understood. Here I report a direct observation of the stereoselectivity of bacteriorhodopsin hence delineate the structural mechanism of isomerization. Upon a light-induced charge separation, the retinal in a straight conformation attempts to perform double bond isomerization and single bond rotation everywhere along its polyene chain before it proceeds to the productive configuration. This observation improves our understanding on how a non-specific attraction force could drive a specific isomerization.
Introduction
Bacteriorhodopsin (bR) pumps protons outward from the cytoplasm (CP) against the concentration gradient via photoisomerization of its retinal chromophore. The trimeric bR on the native purple membrane shares the seven transmembrane helical fold and the same prosthetic group (Fig. S1) with large families of microbial and animal rhodopsins (Ernst et al., 2014; Kandori, 2015). An all-trans retinal in the resting state is covalently linked to Lys216 of helix G through a Schiff base (SB), of which the double bond C15=Nζ is also in trans. Upon absorption of a visible photon, the all-trans retinal in bR isomerizes efficiently and selectively to adopt the 13-cis configuration (Govindjee et al., 1990). In contrast, an all-trans free retinal in organic solvents could isomerize about various double bonds, but with poor quantum yields (Freedman and Becker, 1986; Koyama et al., 1991).
A broad consensus is that the isomerization event takes place around 450-500 fs during the transition from a blue-shifted species I to form a red-shifted intermediate J (Herbst, 2002; Mathies et al., 1988). Various molecular events prior to the isomerization have also been detected. Vibrational spectroscopy showed a variety of possible motions, such as torsions about C13=C14 and C15=Nζ, H-out-of-plane wagging at C14, and even protein responses (Diller et al., 1995; Kobayashi et al., 2001). Nevertheless, the species I or a collection of species detected before 30 fs remain in a good trans configuration about C13=C14 instead of a near 90° configuration (Zhong et al., 1996). Recently, deep-UV stimulated Raman spectroscopy revealed strong signals of Trp and Tyr motions in the protein throughout the I and J intermediates (Tahara et al., 2019). Despite extensive studies, fundamental questions on the photoisomerization of retinal remain unanswered at the atomic resolution. What is the quantum mechanical force that causes the all-trans retinal to isomerize specifically to 13-cis after absorbing a photon? Why not isomerize elsewhere in bR? How is the quantum yield of this specific isomerization enhanced by the protein compared to those of free retinal in solution? Does any isomerization sampling occur? This work addresses these questions by solving a series of structures of the early intermediates based on the electron density maps unscrambled from the published serial crystallography datasets using singular value decomposition (SVD). These structures of “pure” photoproducts at atomic resolution reveal widespread conformational changes in all seven helices prior to the all-trans to 13-cis isomerization and after its completion, suggesting that isomerization sampling takes place in bR, where rapid photoisomerizations and single bond rotations are attempted everywhere along the polyene chain of the retinal before the only successful one flips the SB at ∼500 fs. The implication of these findings to the proton pumping and directional conductance is presented in a companion paper (Ren, 2021).
Several international consortiums carried out large operations of serial crystallography at free electron lasers (XFELs). It is now possible to capture transient structural species at room temperature in the bR photocycle as short-lived as fs (Brändén and Neutze, 2021). Compared to cryo-trapping, authentic structural signals from these XFEL data are expected to be greater in both amplitude and scope. However, the signals reported so far do not appear to surpass those obtained by cryo-trapping methods, suggesting much needed improvements in experimental protocols and data analysis methods. Two major sources of data are used in this study (Table S1). Nogly et al. captured retinal isomerization to 13-cis by the time of 10 ps and attributed the specificity to the H-bond breaking between the SB and a water (Nogly et al., 2018). Kovacs et al. contributed datasets at many short time delays (Kovacs et al., 2019). Those sub-ps datasets demonstrate oscillatory signals at frequencies around 100 cm-1. The essence of this work is a numerical resolution of structural heterogeneity, a common difficulty often encountered in cryo trapping and time-resolved serial crystallography. To what extend a specific structural species can be enriched in crystals depends on the reaction kinetics governed by many experimental parameters including but not limited to the fluence, wavelength, and temperature of the light illumination. While it is possible to reach higher fractional concentrations at specific time points for more stable species such as K or M due to the ratio between the rates going into and exiting from that species, transient species such as I and J are often poorly populated. If such structural heterogeneity is not resolved, it is very difficult, if not impossible, to interpret the electron density maps and to refine the intermediate structures (Ren et al., 2013). An assumption in nearly all previous studies has been that each dataset, at a cryo temperature or at a time delay, is derived from a mixture of a single photoinduced species and the ground state. Therefore, the difference map reveals a pure intermediate structure. This assumption is far from the reality thus often leads to misinterpretation of the observed electron density maps. This work is yet another case study to demonstrate the application of our analytical protocol based on SVD (Methods) that makes no assumption on how many excited intermediates that contribute to the captured signals at each time point (Ren, 2019; Ren et al., 2013; Yang et al., 2011). More importantly, this work showcases that our resolution of structural heterogeneity enables new mechanistic insights into the highly dynamic chemical or biochemical processes.
Results and Discussion
A total of 24 datasets and 18 time points up to 10 ps are analyzed in this study (Table S1). Difference Fourier maps at different time points and with respect to their corresponding dark datasets are calculated according to the protocols previously described (Methods). A collection of 126 difference maps at short delays ≤ 10 ps are subjected to singular value decomposition (SVD; Methods) followed by a numerical deconvolution using the previously established Ren rotation in a multi-dimensional Euclidean space (Ren, 2016, 2019). Such resolution of electron density changes from mixed photoexcited species in the time-resolved datasets results in four distinct intermediate structures in the early photocycle, which are then refined against the reconstituted structure factor amplitudes (Table S2; Methods).
Low frequency oscillations observed upon photoexcitation
Ten out of 17 major components derived from the sub-ps delays of Kovacs et al. (Fig. S2) describe five two-dimensional oscillatory behaviors at frequencies ranging from 60 to 400 cm-1 (Fig. S3). Compared to a bond stretching frequency commonly observed in vibrational spectroscopy, these oscillations are at much lower frequencies. The lowest frequency is 61±2 cm-1, that is, a period of 550±20 fs (Fig. S3a), which matches exactly the oscillation detected in transient absorption changes in visual rhodopsin (Wang et al., 1994). Although these ten components follow the oscillatory time dependencies, they do not show any association with the chromophore or the secondary structure of the protein (Fig. S4). Similar oscillatory components were also present in the XFEL datasets of MbCO (Ren, 2019). Therefore, the same conclusion stands that these low frequency vibrations induced by short laser pulses often detected by ultrafast spectroscopy are the intrinsic property of a solvated protein molecule, here specifically bacteriorhodopsin (bR) (Johnson et al., 2014; Liebel et al., 2014). Interestingly, the isomerization sampling and productive photoisomerization observed in this study occur within the first oscillatory period at the lowest frequency. While such coincidence begs the question whether the protein oscillation is required for isomerization (see below), direct evidence is lacking in these XFEL data to support any functional relevance of these oscillatory signals.
Intermediates I’, I, and expansion of retinal binding pocket
In contrast to the oscillating signals, three components U10, U14, and U17 reveal strong light-induced structural signals in terms of both extensiveness and quality (Figs. 1ab and S5). These signals originate exclusively from a few time points of Nogly et al., too few to fit the time dependency with exponentials. Instead, a spline fitting through these time points gives rise to the estimated coefficients c10, c14, and c17 in the linear combination of c10U10 + c14U14 + c17U17 for reconstructing the electron density maps of the states I, J, and their respective precursors I’, J’ (Fig. 2a). A reconstituted difference map of I’ – bR (Fig. 1c) is located on the spline trajectory from the origin, that is, bR at the time point of 0-, to the first time point of 49-406 fs (PDB entry 6g7i). This state is denoted I’ as a precursor leading to the I state judged by the time point at ∼30 fs.
However, this is not to say that a single species I’ exists around 30 fs. Quite the opposite, the population of the time-independent conformational species I’ rises and falls and peaks around 30 fs, while many other species during isomerization sampling coexist with I’ at the same time (see below). The reconstituted difference map is used to calculate a set of structure factor amplitudes that would produce this difference map of I’ – bR (Methods). And the structure of I’ is refined against this reconstituted dataset (beige; Figs. 1cd and S6). The same protocol is used to refine the structure of I state (purple; Fig. S7) with a reconstituted difference map I – bR (Figs. 1a, 2ab, 3a, and S5). This SVD-dependent refinement strategy extends the commonly used method based on an extrapolated map to another level. This newly developed method is able to refine a structure against any linear combination of signal components while eliminating noise and systematic error components, and components identified as other intermediate species mixed in the data. Therefore, this method enables the refinement of an unscrambled, hence pure, structural species (Methods).
The all-trans retinal chromophore in the ground state of bR is largely flat except the last atom C15 (Fig. 2c 2nd panel). In contrast, the side chain of Lys216 is highly twisted forming two near-90° single bonds (Fig. 2c 4th panel), which results in a corner at Cε that deviates dramatically from the plane of the all-trans retinal (Fig. 2c 2nd panel). The refined geometry of the retinal in I’ retains a near perfect all-trans configuration, including the Schiff base (SB) double bond C15=Nζ, while various single bonds along the polyene chain deviate from the standard anti conformation significantly (Fig. 2c 4th panel). The torsional deviations from anti are in a descending order from the β-ionone ring to the SB. These torsional changes result in an S-shaped retinal shortened by ∼4% (Fig. 2c 3rd panel). The distal segment C6-C12 moves inboard up to 0.9 Å and the proximal segment C13-Cε, including the SB, moves outboard up to 1.6 Å (Fig. 2c 1st and 2nd panels; see Fig. S1 for orientations in bR). This creased retinal observed here at around 30 fs (Fig. 1d) is attributed to the direct consequence of a compression under an attraction force between the β-ionone ring and the SB (see below).
The refined structure of the I state (Fig. S7) shows that the retinal remains in near perfect all-trans, including the SB, and as creased as its precursor I’ (Fig. 3c). The torsional deviations from anti single bonds become even more severe compared to the I’ state and remain in a descending order from the β-ionone ring to the SB (Fig. 2c 4th panel). The major difference from its precursor is that the single bond Nζ-Cε now adopts a perfect syn conformation (Figs. 2c 4th panel and 3c), and the anchor Lys216 has largely returned to its resting conformation. Such a lack of substantial change between the ground state and the intermediate I was previously noted by a comparison of a chemically locked C13=C14 with the native retinal (Zhong et al., 1996).
Remarkably, the major component U10 reconstituted into the difference map of I – bR contains widespread signal associated with all seven helices (Fig. 2b). The reconstituted map clearly shows collective outward motions from the center (Fig. 3a) suggesting an expansion of the retinal binding pocket at hundreds of fs, which is confirmed by the refined structure of the I state (Fig. 3d top panel). For example, the distance between the Cα atoms increases by 0.8 Å between Arg82 and Phe208 and by 0.7 Å between Tyr83 and Trp182. It is noteworthy that similar protein signals are present in the raw difference map calculated from the time point of 457-646 fs from Nogly et al. (6g7j) prior to an SVD analysis (Fig. S8).
Transient bleaching at near UV of 265-280 nm was observed before 200 fs and attributed to structural changes in the retinal skeleton and the surrounding Trp residues (Schenkl et al., 2005). Recent deep-UV stimulated Raman spectroscopy also demonstrated that motions of Trp and Tyr residues start to emerge at 200 fs and remain steady until the isomerization is over at 30 ps (Tahara et al., 2019). Here the refined structure of the I state with displaced helices and an expanded retinal binding pocket offers an explanation for the stimulated Raman gain change at hundreds of fs. However, it is unclear why and how such extensive protein responses take place even before the retinal isomerization. According to the broadly accepted concept of proteinquake, initial motions are generated at the epicenter where the chromophore absorbs a photon and then propagated throughout the protein matrix (Ansari et al., 1985). It is plausible that these ultrafast protein responses are the direct consequence of isomerization sampling in a confined protein pocket. It was observed in organic solvents using high-pressure liquid chromatography (HPLC) that all-trans retinal could isomerize at various double bonds along the polyene chain to adopt 9-, 11-, and 13-cis configurations, but with rather poor quantum yields (Freedman and Becker, 1986; Koyama et al., 1991). This intrinsic property of the all-trans retinal would behave the same even when it is incorporated in the protein except that the protein matrix herds the chromophores on the right track of the productive photocycle and keeps the concentrations of the attempted byproducts low. These byproduct conformations of the retinal during isomerization sampling are too numerous and too minor to be observed experimentally. Nevertheless, they cause a common effect, an expansion of its binding pocket, since the all-trans retinal in the resting state is tightly boxed by massive side chains all around (Fig. 3e). Any attempt to isomerize would push against this box one way or another. For instance, triple attempts to isomerize simultaneously at 11, 13, and 15 positions were suggested by a quantum mechanics/molecular mechanics simulation (Altoè et al., 2010). When the retinal binding pocket is altered in mutants, the quantum yield of each isomerization byproduct is expected to increase resulting in an impaired productive pathway (see below).
Intermediates J’, J and productive isomerization of retinal
The time point of 10 ps of Nogly et al. (6g7k) differs from the previous time point of 457-646 fs (6g7j) by negating the component of U10 (Fig. 2ab), which leads to a restoration of the normal retinal binding pocket in J’ from an expanded one in the I state followed by a contraction in J (Fig. 3d bottom panel). Two time-independent structures of J’ (green; Fig. S9) and J (gray; Fig. S10) are refined based on the respective reconstituted difference maps with the same protocol (Methods). Their populations peak at the approximate time of ∼700 fs and ∼20 ps, respectively. The observed contraction of the retinal binding pocket is likely due to an elastic recoiling of the seven helical bundle following its transient expansion caused by the isomerization sampling.
The creased retinal persists in both the J’ and J structures (Fig. 2c 2nd panel and Fig. 3c). The difference map of J’ – bR clearly shows the 13-cis configuration (Fig. 3b). Indeed, near perfect 13-cis is successfully refined in both structures (Fig. 2c 4th panel). While the SB double bond C15=Nζ is momentarily distorted from the trans configuration in J’ with a torsion angle of 133°, a perfect trans configuration at C15=Nζ is promptly restored in J (Fig. 2c 4th panel). The refined structures of this series of early intermediates show that the SB Nζ is rotating clockwise in the entire process of the isomerization of I’ ➔ I ➔ J’ ➔ J, if the retinal is viewed from the proximal to distal direction (Fig. 2c). It seems that the isomerization starts in an expanded retinal binding pocket and finishes in a tighter one. Whether the pocket expansion and contraction are required for the productive isomerization and what role the low frequency oscillations play in isomerization will need more time points at short delays to further isolate the molecular events temporally.
Coulomb attraction as driving force of isomerization sampling
The fundamental questions remain: What is the driving force that causes the all-trans retinal to isomerize after a photon absorption, at several double bonds if not restrained but exclusively at C13=C14 in bR? How does the protein environment enhance the quantum yield of the isomerization to 13-cis? Here I hypothesize that a Coulomb attraction between the β-ionone ring and the SB at the Frank-Condon point, 0+ time point, provides the initial driving force upon a photon absorption. The electric field spectral measurements (Mathies and Stryer, 1976) and the quantum mechanics simulation (Nogly et al., 2018) suggested that a charge separation occurs along the polyene chain at the excited state of bR. Such a dipole moment was also detected through a transient bleaching signal at near UV region (Schenkl et al., 2005). It can be shown that a plausible charge separation of ±0.1e between the β-ionone ring and the SB would cause an attraction force > 1 pN. If calibrated with the measured range of dipole moment of 10-16 D (Mathies and Stryer, 1976), the charge separation could reach the level of ±0.16e to ±0.26e, giving rise to an attraction force of 3.5-9 pN between the β-ionone ring and the SB. This attraction force is evidently sufficient to crease the flat all-trans retinal into an S-shape and to compress it slightly within tens of fs as observed in I’ and I states (Figs. 1d, 2c 2nd and 3rd panels, and 3c). In the meanwhile, this very attraction force also triggers simultaneous attempts of double bond isomerizations and single bond rotations along the polyene chain that cause the expansion of the retinal binding pocket as observed at hundreds of fs. Following the only successful isomerization at C13=C14, the chromophore segment from C15 to Cδ is attracted to the β-ionone ring; and these two parts become significantly closer (Fig. 2c 3rd panel). None of the single bond rotations can complete under the restraints of the protein. Especially, the segment closer to the midpoint of the retinal is more confined due to the steric hinderance of Thr90 and Tyr185 from the inboard and outboard sides, respectively (Fig. 3e). Therefore, the single bonds deviate from anti less and less towards the midpoint (Fig. 2c 4th panel). The effect of charge separation seems eased gradually as the reaction proceeds beyond the J state as indicated by the slow restoration of the anti conformation (Fig. 2c 4th panel).
Apparently, the same charge separation and the attraction force upon photon absorption also take place in a solution sample of free retinal. Compared to the retinal embedded in protein, photoisomerization in solution is nonspecific, resulting in a range of byproducts, since an isomerization at any position would bring the SB significantly closer to the β-ionone ring. It is understandable that each of the byproducts could only achieve a poor quantum yield (Freedman and Becker, 1986; Koyama et al., 1991) as rotations at multiple single bonds driven by the same attraction force and achieving a similar folding of the polyene chain would further sidetrack the double bond isomerizations thus diminishing their quantum yields. However, these byproducts due to single bond rotations are short-lived beyond detection by HPLC as they spontaneously revert back in solution. The protein environment in bR plays a major role in enhancing the quantum yield of the isomerization to 13-cis by shutting down all other reaction pathways triggered by the charge separation. This is further elucidated by the mutant functions below.
Isomerization byproducts permitted by mutant protein environments
The structure of a double mutant T90A/D115A (3cod) showed little difference from the wildtype (Joh et al., 2008) while the single mutants T90V and T90A retain < 70% and < 20% of the proton pumping activity, respectively (Marti et al., 1991; Perálvarez et al., 2001). These observations illustrate that some nonproductive pathways of the isomerization sampling succeed more in the altered retinal binding pocket. In the wildtype structure, Thr90 in helix C points towards the C11=C12-C13-C20 segment of the retinal from the inboard with its Cγ atom 3.7 Å from the retinal plane. Given the van der Waals radius rC of 1.7 Å, only 0.3 Å is spared for the H atoms of the Cγ methyl group thereby effectively shutting down the nonproductive pathways of the isomerization sampling. Any motion of the retinal would have to push helix C toward inboard causing an expansion of its binding pocket. Missing this close contact in T90A increases the room to 1.9 Å for isomerization byproducts, which would greatly reduce the quantum yield of the 13-cis productive isomerization thus retain < 20% of the activity.
In addition to 13-cis, the retinal in the light adapted T90V mutant showed 9- and 11- cis configurations at the occupancies of 3% and 18%, respectively, while these configurations were not detected in light adapted wildtype (Marti et al., 1991). Then why would a Val residue at this position with an equivalent Cγ atom permit the formation of some isomerization byproducts? In wildtype bR, the side chain of Thr90 engages two strong H-bonds Trp86O-Thr90Oγ-D115Oδ so that its Cγ methyl group is aligned toward the retinal. Without these H-bonds in T90V, the isopropyl group of Val90 is free to adopt other rotameric positions so that neither of the Cγ methyl groups has to point directly to the retinal, which increases the available room for the formation of some isomerization byproducts. Compared to the light adapted state, these isomerization byproducts could reach even higher percentages during active photocycles thus reduce the proton pumping activity below 70%.
From the outboard, the side chain of Tyr185 in helix F is nearly parallel to the retinal plane with a distance of 3.5 Å. This close contact of a flat area from C8 to C14 of the retinal prevents any significant motion of the retinal toward the outboard. Even slight motions would push helix F away as observed here in the expansion of the retinal binding pocket. The mutant Y185F largely retains the flat contact so that its proton pumping activity does not reduce much (Hackett et al., 1987; Mogi et al., 1987). However, it is predictable that various single mutants at this position with smaller and smaller side chains would promote more and more isomerization byproducts and eventually shut down proton pumping.
Two massive side chains of Trp86 and 182 from the EC and CP sides respectively do not seem to play a significant role in suppressing byproduct formation as shown by the mutant W182F that retains the most of the wildtype activity (Hackett et al., 1987), since the motions involved in isomerization sampling are oriented more laterally. The transient expansion and contraction of the retinal binding pocket (Fig. 3d) indicate that the tight box surrounds the mid-segment of the retinal (Fig. 3e) is not completely rigid. Rather, its plasticity must carry sufficient strength to prevent isomerization byproducts. Presumably, this strength originates from the mechanical property of the helical bundle.
In summary, this work reveals the transient structural responses to many unsuccessful attempts of double bond isomerization and single bond rotation by a numerical resolution from the concurrent pathways, which are otherwise difficult to observe. These findings underscore an important implication, that is, a nonspecific Coulomb attraction provides the same driving force for the isomerization sampling with and without a protein matrix. A productive isomerization at a specific double bond is guided by the incorporation of the chromophore in a specific protein environment. The productive pathway is selected from numerous possibilities via stereochemical hinderance. Nevertheless, this nonspecific Coulomb attraction force may not be directly applicable to the photoisomerization of retinal from 11-cis to all- trans in the activation of visual rhodopsins. The key difference is bR as an energy convertor versus a visual rhodopsin as a quantum detector (Lewis, 1978).
Competing interests
ZR is the founder of Renz Research, Inc. that currently holds the copyright of the computer software dynamiX™.
Methods
From the outset, the key presumption is that every crystallographic dataset, at a given temperature and a given time delay after the triggering of a photochemical reaction, captures a mixture of unknown number of intermediate species at unknown fractions. Needless to say, all structures of the intermediates are also unknown except the structure at the ground state that has been determined and well refined by static crystallography. A simultaneous solution of all these unknowns requires multiple datasets that are collected at various temperatures or time delays so that a common set of intermediate structures are present in these datasets with variable ratios. If the number of available datasets is far greater than the number of unknowns, a linear system can be established to overdetermine the unknowns with the necessary stereochemical restraints (Ren et al., 2013). The analytical methods used in this work to achieve such overdetermination have been incrementally developed in the past years and recently applied to another joint analysis of the datasets of carbonmonoxy myoglobin (Ren, 2019). Time-resolved datasets collected with ultrashort pulses from an X-ray free electron laser were successfully analyzed by these methods to visualize electron density components that reveal transient heating, 3d electrons of the heme iron, and global vibrational motions. This analytical strategy is recapped below.
The methodological advance in this work is the refinement of each pure intermediate structure that has been deconvoluted from multiple mixtures. Structure factor amplitudes of a single conformation free of heterogeneity are overdetermined. Given the deconvoluted structure factor amplitude set of a pure state, the standard structural refinement software with the built-in stereochemical constraints is taken full advantage of, e.g. PHENIX (Adams et al., 2010; Liebschner et al., 2019). In case that the computed deconvolution has not achieved a single pure structural species, the structural refinement is expected to make such indication.
Difference Fourier maps
A difference Fourier map is synthesized from a Fourier coefficient set of Flight-Freference with the best available phase set, often from the ground state structure. Before Fourier synthesis, Flight and Freference must be properly scaled to the same level so that the distribution of difference values is centered at zero and not skewed either way. A weighting scheme proven effective assumes that a greater amplitude of a difference Fourier coefficient Flight-Freference is more likely caused by noise than by signal (Ren et al., 2001, 2013; Šrajer et al., 2001; Ursby and Bourgeois, 1997). Both the dark and light datasets can serve as a reference in difference maps. If a light dataset at a certain delay is chosen as a reference, the difference map shows the changes since that delay time but not the changes prior to that delay. However, both the dark and light datasets must be collected in the same experiment. A cross reference from a different experimental setting usually causes large systematic errors in the difference map that would swamp the desired signals. Each difference map is masked 3.5 Å around the entire molecule of bacteriorhodopsin (bR). No lipid density is analyzed.
Meta-analysis of protein structures
Structural meta-analysis based on singular value decomposition (SVD) has been conducted in two forms. In one of them, an interatomic distance matrix is calculated from each protein structure in a related collection. SVD of a data matrix consists of these distance matrices enables a large-scale joint structural comparison but requires no structural alignment (Ren, 2013a, 2013b, 2016). In the second form, SVD is performed on a data matrix of electron densities of related protein structures (Ren, 2019; Ren et al., 2013; Schmidt et al., 2003, 2010). Both difference electron density maps that require a reference dataset from an isomorphous crystal form and simulated annealing omit maps that do not require the same unit cell and space group of the crystals are possible choices in a structural meta-analysis (Ren, 2019; Ren et al., 2013). The interatomic distances or the electron densities that SVD is performed on are called core data. Each distance matrix or electron density map is associated with some metadata that describe the experimental conditions under which the core data are obtained, such as temperature, pH, light illumination, time delay, mutation, etc. These metadata do not enter the SVD procedure. However, they play important role in the subsequent interpretation of the SVD result. This computational method of structural analysis takes advantage of a mathematical, yet practical, definition of conformational space with limited dimensionality (Ren, 2013a). Each experimentally determined structure is a snapshot of the protein structure. A large number of such snapshots taken under a variety of experimental conditions, the metadata, would collectively provide a survey of the accessible conformational space of the protein structure and reveal its rection trajectory. Such joint analytical strategy would not be effective in early years when far fewer protein structures were determined to atomic resolution. Recent rapid growth in protein crystallography, such as in structural genomics (Berman et al., 2012; Bonvin, 2021; Chandonia and Brenner, 2006) and in serial crystallography (Glynn and Rodriguez, 2019; Schaffer et al., 2021), has supplied the necessarily wide sampling of protein structures for a joint analytical strategy to come of age. The vacancies or gaps in a conformational space between well-populated conformational clusters often correspond to less stable transient states whose conformations are difficult to capture, if not impossible. These conformations are often key to mechanistic understanding and could be explored by a back calculation based on molecular distance geometry (Ren, 2013a, 2016), the chief computational algorithm in nucleic magnetic resonance spectroscopy (NMR), and by a structure refinement based on reconstituted dataset, a major methodological advance in this work (see below). These structures refined to atomic resolution against reconstituted datasets may reveal short-lived intermediate conformation hard to be captured experimentally. Unfortunately, a protein structure refined against a reconstituted dataset currently cannot be recognized by the Protein Data Bank (PDB). Because crystallographic refinement of a macromolecular structure is narrowly defined as a correspondence from one dataset to one structure. A never-observed dataset reconstituted from a collection of experimental datasets does not match the well-established crystallographic template of PDB; let alone a refinement of crystal structure with the NMR algorithm.
A distance matrix contains M pairwise interatomic distances of a structure in the form of Cartesian coordinates of all observed atoms. An everyday example of distance matrix is an intercity mileage chart appended to the road atlas. Differences in the molecular orientation, choice of origin, and crystal lattice among all experimentally determined structures have no contribution to the distance matrices. Due to its symmetry, only the lower triangle is necessary. A far more intimate examination of protein structures in PDB is a direct analysis of their electron density maps instead of the atomic coordinates. M such (difference) electron densities, often called voxels in computer graphics, are selected by a mask of interest. In the case of difference maps, only the best refined protein structure in the entire collection supplies a phase set for Fourier synthesis of electron density maps. This best structure is often the ground state structure determined by static crystallography. Other refined atomic coordinates from the PDB entries are not considered in the meta-analysis. That is to say, a meta-analysis of difference electron density maps starts from the X-ray diffraction data archived in PDB rather than the atomic coordinates interpreted from the diffraction data, which removes any potential model bias.
Singular value decomposition of (difference) electron density maps
An electron density map, particularly a difference map as emphasized here, consists of density values on an array of grid points within a mask of interest. All M grid points in a three-dimensional map can be serialized into a one-dimensional sequence of density values according to a specific protocol. It is not important what the protocol is as long as a consistent protocol is used to serialize all maps of the same grid setting and size, and a reverse protocol is available to erect a three-dimensional map from a sequence of M densities. Therefore, a set of N serialized maps, also known as vectors in linear algebra, can fill the columns of a data matrix A with no specific order, so that the width of A is N columns, and the length is M rows. Often, M >> N, thus A is an elongated matrix. If a consistent protocol of serialization is used, the corresponding voxel in all N maps occupies a single row of matrix A. This strict correspondence in a row of matrix A is important. Changes of the density values in a row from one structure to another are due to either signals, systematic errors, or noises. Although the order of columns in matrix A is unimportant, needless to say, the metadata associated with each column must remain in good bookkeeping.
SVD of the data matrix A results in A = UWVT, also known as matrix factorization. Matrix U has the same shape as A, that is, N columns and M rows. The N columns contain decomposed basis components Uk, known as left singular vectors of M items, where k = 1, 2, …, N. Therefore, each component Uk can be erected using the reverse protocol to form a three-dimensional map. This decomposed elemental map can be presented in the same way as the original maps, for example, rendered in molecular graphics software such as Coot and PyMol. It is worth noting that these decomposed elemental maps or map components Uk are independent of any metadata. That is to say, these components remain constant when the metadata vary. Since each left singular vector Uk has a unit length due to the orthonormal property of SVD (see below), that is, |Uk| = 1, the root mean squares (rms) of the items in a left singular vector is 1/√M that measures the quadratic mean of the items.
The second matrix W is a square matrix that contains all zeros except for N positive values on its major diagonal, known as singular values wk. The magnitude of wk is considered as a weight or significance of its corresponding component Uk. The third matrix V is also a square matrix of N × N. Each column of V or row of its transpose VT, known as a right singular vector Vk, contains the relative compositions of Uk in each of the N original maps. Therefore, each right singular vector Vk can be considered as a function of the metadata. Right singular vectors also have the same unit length, that is, |Vk| = 1. Effectively, SVD separates the constant components independent of the metadata from the compositions that depend on the metadata.
A singular triplet denotes 1) a decomposed component Uk, 2) its singular value wk, and 3) the composition function Vk. Singular triplets are often sorted in a descending order of their singular values wk. Only a small number of n significant singular triplets identified by the greatest singular values w1 through wn can be used in a linear combination to reconstitute a set of composite maps that closely resemble the original ones in matrix A, where n < N. For example, the original map in the ith column of matrix A under a certain experimental condition can be closely represented by the ith composite map w1v1iU1 + w2v2iU2 + … + wnvniUn, where (v1i, v2i, …) is from the ith row of matrix V. The coefficient set for the linear combination is redefined here as cki = wkvki/√M. The rms of the values in a map component, or the average magnitude measured by the quadratic mean, acts as a constant scale factor that resets the modified coefficients cki back to the original scale of the core data, such as Å for distance matrices and e-/Å3 for electron density maps if these units are used in the original matrix A. Practically, an electron density value usually carries an arbitrary unit without a calibration, which makes this scale factor unnecessary. In the linear combination c1iU1 + c2iU2 + … + cniUn, each component Uk is independent of the metadata while how much of each component is required for the approximation, that is, cki, depends on the metadata.
Excluding the components after Un in this approximation is based on an assumption that the singular values after wn are very small relative to those from w1 through wn. As a result, the structural information evenly distributed in all N original maps is effectively concentrated into a far fewer number of n significant components, known as information concentration or dimension reduction. On the other hand, the trailing components in matrix U contain inconsistent fluctuations and random noises. Excluding these components effectively rejects noises (Schmidt et al., 2003). The least-squares property of SVD guarantees that the rejected trailing components sums up to the least squares of the discrepancies between the original core data and the approximation using the accepted components.
However, no clear boundary is guaranteed between signals, systematic errors, and noises. Systematic errors could be more significant than the desired signals. Therefore, excluding some components from 1 through n is also possible. If systematic errors are correctly identified, the reconstituted map without these significant components would no longer carry the systematic errors.
The orthonormal property of SVD
The solution set of SVD must guarantee that the columns in U and V, the left and right singular vectors Uk and Vk, are orthonormal, that is, Uh•Uk = Vh•Vk = 0 (ortho) and Uk•Uk = Vk•Vk = 1 (normal), where h ≠ k but both are from 1 to N. The orthonormal property also holds for the row vectors. As a result, each component Uk is independent of the other components. In other words, a component cannot be represented by a linear combination of any other components. However, two physical or chemical parameters in the metadata, such as temperature and pH, may cause different changes to a structure. These changes are not necessarily orthogonal. They could exhibit some correlation. Therefore, the decomposed components Uk not necessarily represent any physically or chemically meaningful changes (see below).
Due to the orthonormal property of SVD, an N-dimensional Euclidean space is established, and the first n dimensions define its most significant subspace. Each coefficient set ci = (c1i, c2i, …, cni) of the ith composite map is located in this n-dimensional subspace. All coefficient sets for i = 1, 2, …, N in different linear combinations to approximate the N original maps in a least-squares sense can be represented by N points or vectors c1, c2, …, cN in the Euclidean subspace. This n-dimensional subspace is essentially the conformational space as surveyed by the jointly analyzed core data. The conformational space is presented as scatter plots with each captured structure represented as a dot located at a position determined by the coefficient set ci of the ith observed map. When the subspace has greater dimensionality than two, multiple two-dimensional orthographical projections of the subspace are presented, such as Fig. 2a. These scatter plots are highly informative to reveal the relationship between the (difference) electron density maps and their metadata.
If two coefficient sets ci ≍ cj, they are located close to each other in the conformational space. Therefore, these two structures i and j share two similar conformations. Two structures located far apart from each other in the conformational space are dissimilar in their conformations, and distinct in the compositions of the map components. A reaction trajectory emerges in this conformational space if the temporal order of the core data is experimentally determined (Fig. 2a). Otherwise, an order could be assigned to these structures based on an assumed smoothness of conformational changes along a reaction trajectory (Ren, 2013a, 2013b, 2016). Causation and consequence of structural motions could be revealed from the order of the structures in a series, which may further lead to structural mechanism. In addition, an off-trajectory location in the conformational space or a location between two clusters of observed structures represents a structure in a unique conformation that has never been experimentally captured. Such a hypothetical structure can be refined against a reconstituted distance matrix using molecular distance geometry (Ren, 2013a, 2013b, 2016) or a reconstituted electron density map with the method proposed below.
Rotation in SVD space
Dimension reduction is indeed effective in meta-analysis of protein structures when many datasets are evaluated at the same time. However, the default solution set of SVD carries complicated physical and chemical meanings that are not immediately obvious. The interpretation of a basis component Uk, that is, “what-does-it-mean”, requires a clear demonstration of the relationship between the core data and their metadata. The outcome of SVD does not guarantee any physical meaning in a basis component. Therefore, SVD alone provides no direct answer to “what-does-it-mean”, thus its usefulness is very limited to merely a mathematical construction. However, the factorized set of matrices U, W, and V from SVD is not a unique solution. That is to say, they are not the only solution to factorize matrix A. Therefore, it is very important to find one or more alternative solution sets that are physically meaningful to elucidate a structural interpretation. The concept of a rotation after SVD was introduced by Henry & Hofrichter (Henry and Hofrichter, 1992). But they suggested a protocol that fails to preserve the orthonormal and least-squares properties of SVD. The rotation protocol suggested by Ren incorporates the metadata into the analysis and combines with SVD of the core data. This rotation achieves a numerical deconvolution of multiple physical and chemical factors after a pure mathematical decomposition, and therefore, provides a route to answer the question of “what-does-it-mean” (Ren, 2019). This rotation shall not be confused with a rotation in the three-dimensional real space, in which a molecular structure resides.
A rotation in the n-dimensional Euclidean subspace is necessary to change the perspective before a clear relationship emerges to elucidate scientific findings. It is shown below that two linear combinations are identical before and after a rotation applied to both the basis components and their coefficients in a two-dimensional subspace of h and k. That is,
where ch and ck are the coefficients of the basis components Uh and Uk before the rotation; and fh and fk are the coefficients of the rotated basis components Rh and Rk, respectively. The same Givens rotation of an angle θ is applied to both the components and their coefficients:
Obviously, the rotated components Rh and Rk remain mutually orthonormal and orthonormal to other components. And
Here are the singular values that replace wh and wk, respectively, after the rotation. They may increase or decrease compared to the original singular values so that the descending order of the singular values no longer holds. Th|k = (th|k1, th|k2, …, th|kN) = (fh|k1, fh|k2, …, fh|kN)/sh|k are the right singular vectors that replace Vh and Vk, respectively. Th and Tk remain mutually orthonormal after the rotation and orthonormal to other right singular vectors that are not involved in the rotation.
Eq. 1 holds because the dot product of two vectors does not change after both vectors rotate the same angle. To prove Eq. 1 in more detail, Eqs. 2 and 3 are combined and expanded. All cross terms of sine and cosine are self-canceled:
A rotation in two-dimensional subspace of h and k has no effect in other dimensions, as the orthonormal property of SVD guarantees. Multiple steps of rotations can be carried out in many two-dimensional subspaces consecutively to achieve a multi-dimensional rotation. A new solution set derived from a rotation retains the orthonormal property of SVD. The rotation in the Euclidean subspace established by SVD does not change the comparison among the core data of protein structures. Rather it converts one solution set A = UWVT to other alternative solutions A = RSTT so that an appropriate perspective can be found to elucidate the relationship between the core data and metadata clearly and concisely.
For example, if one physical parameter could be reoriented along a single dimension k but not involving other dimensions by a rotation, it can be convincingly shown that the left singular vector Uk of this dimension illustrates the structural impact by this physical parameter. Before this rotation, the same physical parameter may appear to cause structural variations along several dimensions, which leads to a difficult interpretation. Would a proper rotation establish a one-on-one correspondence from all physical or chemical parameters to all the dimensions? It depends on whether each parameter induces an orthogonal structural change, that is, whether structural responses to different parameters are independent or correlated among one another. If structural changes are indeed orthogonal, it should be possible to find a proper rotation to cleanly separate them in different dimensions. Otherwise, two different rotations are necessary to isolate two correlated responses, but one at a time.
For another example, if the observed core datasets form two clusters in the conformational space, a rotation would be desirable to separate these clusters along a single dimension k but to align these clusters along other dimensions. Therefore, the component Uk is clearly due to the structural transition from one cluster to the other. Without a proper rotation, the difference between these clusters could be complicated with multiple dimensions involved. A deterministic solution depends on whether a clear correlation exists between the core data and metadata. A proper rotation may require a user decision. A wrong choice of rotation may select a viewpoint that hinders a concise conclusion. However, it would not alter the shape of the reaction trajectory, nor create or eliminate an intrinsic structural feature. A wrong choice of rotation cannot eliminate the fact that a large gap exists between two clusters of observed core datasets except that these clusters are not obvious from that viewpoint. A different rotation may reorient the perspective along another direction. But the structural conclusion would be equivalent. See example of before and after a rotation in (Ren, 2016).
This rotation procedure finally connects the core crystallographic datasets to the metadata of experimental conditions and accomplishes the deconvolution of physical or chemical factors that are not always orthogonal to one another after a mathematical decomposition. SVD analysis presented in this paper employs rotations extensively except that no distinction is made in the symbols of components and coefficients before and after a rotation except in this section. This method is widely applicable in large-scale structural comparisons. Furthermore, Ren rotation after SVD is not limited to crystallography and may impact other fields wherever SVD is used. For example, SVD is frequently applied to spectroscopic data, images, and genetic sequence data.
Structural refinement against reconstituted dataset
The linear combination Δρ(t) = f1(t)R1 + f2(t)R2 + … + fn(t)Rn after a rotation reconstitutes one of the observed difference maps at a specific time point t. This time-dependent difference map depicts an ever-evolving mixture of many excited species. A reconstituted difference map Δρ(E) for a time-independent, pure, excited species E = intermediate I’, I, J’, and J deconvoluted from many mixtures would take the same form except that only one or very few coefficients remain nonzero if a proper rotation has been found (Table S2). In order to take advantage of the mature refinement software for macromolecular structures with extensive stereochemical restraints, a set of structure factor amplitudes is needed. Therefore, it is necessary to reconstitute a set of structure factor amplitudes that would produce the target difference map Δρ(E) based on a known structure at the ground state. First, an electron density map of the structure at the ground state is calculated. This calculated map is used as a base map. Second, this base map of the ground state is combined with the positive and negative densities in the target difference map Δρ(E) so that the electron densities at the ground state are skewed toward the intermediate state. Third, structure factors are calculated from the combined map. Finally, the phase set of the calculated structure factors is discarded, and the amplitudes are used to refine a single conformation of the intermediate species E that Δρ(E) represents.
This protocol following the SVD and Ren rotation of components achieves a refinement of a pure structural species without the need of alternative conformations. Several points are noteworthy. First, the minimization protocol in this refinement is performed against a numerically reconstituted amplitude set that has never been directly measured from a crystal. This reconstituted dataset could be considered as an extrapolated dataset “on steroids” if compared to the traditional extrapolation of small differences, such as, the Fourier coefficient set to calculate a 3Fo-2Fc map, a technique often used to overcome a partial occupancy of an intermediate structure. An extrapolation of small differences is not directly observed either but computed by an exaggeration of the observed difference based on an assumption that the intermediate state is partially occupied, such as the doubling of the observed difference in 3Fo-2Fc = Fo + 2(Fo-Fc). In contrast to the conventional technique of extrapolation, the deconvolution method applied here is an interpolation among many experimental datasets rather than an extrapolation. Secondly, the deconvolution is a simultaneous solution of multiple intermediate states mixed together instead of solving a single excited state.
Second, a map calculated from the ground state structure is chosen as the base map instead of an experimental map such as Fo or 2Fo-Fc map. If the second step of the protocol is skipped, that is, no difference map is combined with the ground state map, the refinement would result in an R factor of nearly zero, since the refinement is essentially against the calculated structure factors (bR in Table S2). This is to say, the residuals of the refinement are solely due to the difference component instead of the base map. This is desirable since errors in the static structure of the ground state are gauged during its own refinement. On the other hand, if an experimental map is chosen as a base map, the refinement R factors would reflect errors in both the base map and the difference map, which leads to a difficulty in an objective evaluation of this refinement protocol.
Third, the combination of the base map and a difference map is intended to represent a pure intermediate species. Therefore, alternative conformations in structural refinement that model a mixture of species would defeat this purpose. However, this combined map could be very noisy and may not represent a single species without a proper rotation. This is particular the case, if the target difference map Δρ is not derived from an SVD analysis and Ren rotation. The SVD analysis identifies many density components that are inconsistent among all observed difference maps and excludes them, which greatly reduces the noise content. Therefore, this refinement protocol may not be very successful without an SVD analysis. Another source of noise originates from the phase set of the structure factors. Prior to the refinement of the intermediate structure, the phase set remains identical to that of the ground state. This is far from the reality when an intermediate structure involves widespread changes, such as those refined in this study. If the rotation after SVD is not properly selected, the target difference map would remain as a mixture minus the ground state. Therefore, the refinement of a single conformation would encounter difficulty or significant residuals, as judged by the R factors, the residual map, and the refined structure. A proper solution to this problem is a better SVD solution by Ren rotation rather than alternative conformations. A successful refinement of near perfect trans or cis double bonds is a good sign to indicate that the reconstituted amplitude set after a rotation reflects a relatively homogeneous structure. If a double bond could not be refined well to near perfect trans or cis configuration, the dataset of structure factor amplitudes is likely from a mixture of heterogeneous configurations, which occurred frequently in previous studies of bR and photoactive yellow protein (Jung et al., 2013; Lanyi and Schobert, 2007; Nogly et al., 2018). It has been a great difficulty in crystallographic refinement in general that a heterogeneous mixture of conformations cannot be unambiguously refined even with alternative conformations. This difficulty becomes more severe when a mixture involves more than two conformations or when some conformations are very minor.
Lastly, the refinement protocol proposed here could be carried out in the original unit cell and space group of the crystal at the ground state. However, this is not always applicable as the original goal of the meta-analysis is a joint examination of all available structures from a variety of crystal forms. It would be highly desirable to evaluate difference maps of the same or similar proteins from non-isomorphous crystals together by SVD. Alternatively, the refinement protocol could also be performed in the space group of P1 with a virtual unit cell large enough to hold the structure, which is the option in this study (Table S2). This is to say, the entire analysis of SVD-rotation-refinement presented here could be extracted and isolated from the original crystal lattices, which paves the way to future applications to structural data acquired by experimental techniques beyond crystallography, most attractively, to single particle reconstruction in cryo electron microscopy.
Supplementary Tables
Supplementary Figures and Legends
Acknowledgements
This work is supported in part by the grant R01EY024363 from National Institutes of Health. The following database and software are used in this work: CCP4 (ccp4.ac.uk), Coot (www2.mrc-lmb.cam.ac.uk/Personal/pemsley/coot), dynamiX™ (Renz Research, Inc.), gnuplot (gnuplot.info), PDB (rcsb.org), PHENIX (phenix-online.org), PyMOL (pymol.org), Python (python.org), and SciPy (scipy.org).