## Abstract

X-ray crystallography is an invaluable technique for studying the atomic structure of macromolecules. Much of crystallography’s success is due to the software packages developed to enable the automated processing of diffraction data. However, the analysis of unconventional diffraction experiments can still pose significant challenges—many existing programs are closed-source, sparsely documented, or are challenging to integrate with modern libraries for scientific computing and machine learning. Here we describe `reciprocalspaceship`, a Python library for exploring reciprocal space. It provides a tabular representation for reflection data from diffraction experiments that extends the widely-used pandas library with built-in methods for handling space group, unit cell, and symmetry-based operations. As we illustrate, this library facilitates new modes of exploratory data analysis while supporting the prototyping, development, and release of new methods.

## 1 Introduction

The analysis of most diffraction experiments begins with processing diffraction images and ends with refining an atomic model that is consistent with the observed data. Numerous software suites and commandline applications address different stages of the processing pipeline, and these diverse programs are typically combined in order to address the challenges of a particular data set [1, 2, 3, 4, 5, 6, 7]. However, many unconventional diffraction experiments do not fit easily into the processing pipelines established within existing crystallography software. Such experiments often require custom scripts and programs to analyze the resulting data. Recent examples of such experiments include time-resolved pump-probe experiments that investigate the structural dynamics within room-temperature crystals [8, 9]. New software is needed to support custom analyses to improve the development, reproducibility, and adoption of less routine diffraction experiments.

A software library to support such experiments must provide built-in methods to handle space group, unit cell, and symmetry-based operations. This requirement is already met by several general-purpose libraries, such as the `Computational Crystallography Toolbox` (`CCTBX`) and `GEMMI` [4, 10]. However, it is also desirable to facilitate the exploratory inspection of reflection data and to support seamless integration with existing scientific computing software. These additional requirements lower the barrier to implement and test new methods while minimizing the duplication of code and effort.

Due to Bragg’s law, crystallography data is inherently tabular with each observed reflection described by a Miller index. This property underlies many of the file formats for storing diffraction data; integrated intensities and any reflection-specific metadata are stored with the associated Miller index (see Fig. 1). For data analysis in Python, tabular data is commonly represented using the pandas software library [11]. `pandas.DataFrame` objects provide support for the arbitrary manipulation of tabular data, storage of heterogeneous data types, and easy integration with any scientific computing or machine learning library that supports NumPy arrays [12].

Due to the tabular nature of reflection data and the widespread use of pandas in data science, we sought to develop a library that extended the `DataFrame` for crystallographic data by providing built-in support for space groups, unit cells, and symmetry operations. This library, `reciprocalspaceship`, can be used to inspect reflection data, develop new crystallographic methods, and release reproducible analysis pipelines for X-ray diffraction experiments.

## 2 `reciprocalspaceship`**Library**

### 2.1 Mission Statement

`reciprocalspaceship` is a free and open-source software library with the primary goal of simplifying the analysis of crystallography data in Python. To achieve this goal, we sought to design a software library that is intuitive for both crystallographers and Python programmers. This requires full support for common crystallographic operations, as well as easy integration with the scientific computing and machine learning libraries that are developed and maintained by the Python community.

### 2.2 Design

The `DataFrame` is the core abstraction in `pandas. reciprocalspaceship` provides a `DataSet` class which extends the `DataFrame`, augmenting it to represent reflection data from X-ray diffraction experiments. `DataSet` objects store reflection data, along with the associated space group and unit cell, and can be initialized from common reflection file formats such as MTZ files (Fig. 1). By extending the `pandas DataFrame`, it is possible to preserve its core functionality while adding built-in methods to support common crystallographic operations. These operations use the GEMMI library to represent space groups and unit cells [10], and have been vectorized to increase performance.

To support compatibility with MTZ files, `reciprocalspaceship` provides custom datatypes to represent different crystallographic observables. To ensure maximum compatibility with other Python libraries, these datatypes are all represented internally using NumPy arrays of either 32-bit integers or floating-point values. Methods are also provided for inferring relevant datatypes based on the column labels used to describe the data. `DataSet` objects can contain any datatype supported by pandas, including generic Python objects.

### 2.3 Features

The primary capabilities of `reciprocalspaceship` are provided through the `DataSet` object, which builds on the core features of the pandas `DataFrame` to provide crystallographic support. These objects can represent both merged and unmerged reflection data, and provide attributes and methods that enable crystallographic data analysis. These features are summarized in Table 1.

In addition to the `DataSet` object, `reciprocalspaceship` provides several algorithms that can be used for analysis. These include `merge()`, which implements the averaging of unmerged reflection data using maximum-likelihood weights, and `scale_merged_intensities()`, which implements French-Wilson scaling to account for negative merged intensities [14]. These implementations can serve as templates for the development of new analysis methods using `reciprocalspaceship`. The set of algorithms offered through this library will continue to expand as users implement new analyses intended for broader adoption.

### 2.4 Development and Documentation

`reciprocalspaceship` is maintained on GitHub to foster community involvement in its maintenance, testing, and documentation. Every change to the source code is tested using an automated suite in order to support continuous integration [15]. `reciprocalspaceship` is available through the Python Package Index (PyPI), and can be installed on most systems using `pip`. Documentation is automatically generated from the `reciprocalspaceship` GitHub repository to ensure up-to-date information is available for users. The website also includes a User Guide section describing the design and features of `reciprocalspaceship`, and examples that use the library for crystallographic applications. By committing to an open-source development model, it will be possible to maintain this library to meet the needs of crystallographers.

## 3 Examples

The following examples demonstrate the use of `reciprocalspaceship` in the analysis of crystallographic data. These examples cover the merging of scaled observed intensities, analyzing anomalous differences from a single-wavelength anomalous dispersion (SAD) experiment, and applying weights to a time-resolved difference map. These examples are intended to illustrate the breadth of crystallographic problems that can be addressed using this library, as well as its seamless integration with common scientific computing libraries. The examples are available as interactive Jupyter notebooks^{1} in the `reciprocalspaceship` documentation [13].

### 3.1 Assessing Uncertainty in Merging Statistics

Merging statistics are useful for assessing the internal consistency of a data set, and many different metrics have been proposed over the years [16, 17]. Although merging statistics are commonly reported by data reduction pipelines, they are often not reported with uncertainties and do not always give access to their underlying parameters, such as the number of resolution bins or the type of correlation coefficients to report. By facilitating inspection of the underlying reflection data, `reciprocalspaceship` can be used to write quality control scripts for automating analysis pipelines, or, as shown here, in the exploratory analysis of the properties of a single data set. By enabling crystallographers to try new statistical routines, `reciprocalspaceship` may help in the development of more robust indicators of data quality.

To illustrate this, we computed *CC*_{1/2} and *CC _{anom}* for scaled, unmerged reflection data. The data were collected on a tetragonal crystal of hen egg-white lysozyme at ambient temperature and 6.5 keV. The integrated intensities were scaled in AIMLESS, and the data contains sufficient anomalous signal from the native sulfur atoms to determine experimental phases by the SAD method [18, 1, 19, 20]. Using

`reciprocalspaceship`, it is possible to implement a function that merges redundant observations using inverse-variance weights in about 10 lines of code (Fig. 2a). This code takes advantage of the

`groupby()`functionality inherited from

`Pandas`in order to efficiently perform calculations on a per-reflection basis [11]. By randomly splitting the observed reflections by image, this function can be used to independently merge different sets of observations for computing

*CC*

_{1/2}and

*CC*. Due to the modularity of this workflow, it is possible to repeat the random partitioning of observations to generate uncertainty estimates, and to repeat these calculations using both Pearson and Spearman correlation coefficients.

_{anom}As shown in Fig. 2b, high *CC*_{1/2} values indicate that the data were significantly edge-limited, which is common for data collected at low energy on strongly diffracting crystals. The *CC _{anom}* values show that significant anomalous signal was obtained up to the highest resolution bin. Furthermore, the Spearman correlation coefficients are systematically higher and have smaller uncertainties in the low and intermediate resolution range suggesting the presence of outliers in the data.

### 3.2 Merging Observations with a Robust Error Model

The difference observed for *CC _{anom}* between the Pearson and Spearman correlation coefficients in Fig. 2b suggests the presence of outlier observations despite the outlier rejection applied by AIMLESS [19]. Since AIMLESS assumes a normally distributed error model for its observations, such outliers can have a large impact on the estimate of the true merged intensity. We can evaluate whether a normally distributed error model is appropriate based on the distribution of residuals between the observed intensities and the estimate of the true mean. This histogram can be made in just a few lines of Python by taking advantage of the

`Pandas`indexing (Fig. 2c). Compared with the expected distribution of residuals for normally distributed observations, this data set has significantly heavier tails, with many observations several standard deviations away from the merged intensity.

The residuals in Fig. 2c suggest that merging may be improved by a more robust error model that can tolerate outliers. One popular choice of robust error model is the Student’s *t*-distribution. This distribution is parameterized by a location, scale, and number of degrees of freedom, *ν*, which controls the robustness of the distribution to outliers. Importantly, the distribution approaches the normal distribution as *ν* approaches infinity. Unlike the normal distribution, there is not an analytical expression for the maximum-likelihood estimator of the true mean given a set of observations under a Student’s *t*-distributed error model. However, we can construct a simple optimization problem to recover maximum-likelihood estimates of the merged intensity for each miller index. To begin with, we write the likelihood function, which is the probability of the data as a function of the mean intensity for each miller index:

The error model *P* can be any suitable location-scale family distribution. This likelihood function asserts that the observed intensity is drawn from a distribution centered at the merged intensity, *μ _{h}*, with a scale determined by the empirical standard deviation of the observation,

*σ*:

_{Ih,i}In order to recover maximum-likelihood estimates, we need only maximize equation 1 with respect to the merged intensities which are the optimization variables in this problem. Equivalently, we may minimize the negative logarithm of the likelihood: which has the advantage of converting a numerically unstable product into a sum.

This optimization was implemented in `PyTorch` in a general form that could flexibly accept a location-scale family distribution to use as an error model [21]. The data was merged using Student’s *t*-distributions with varying degrees of freedom as error models, and the resulting *CC _{anom}* were compared with the normally distributed error model. The error models for fewer degrees of freedom outperformed the error models for more degrees of freedom (larger

*ν*), with their performance trending towards that of the normally distributed error model (Fig. 2d).

This example demonstrates the use of `reciprocalspaceship` to construct a flexible merging function using a machine learning library. This greatly reduces the overhead required to prototype a new analysis method by making it easy to use existing and well-supported libraries. Furthermore, the benefits of using robust statistical estimators, as demonstrated by the improved *CC _{anom}* values in figures 2b and 2d, suggest new avenues for improving the existing crystallographic analysis infrastructure. One such project,

`careless`, is combining

`reciprocalspaceship`and

`TensorFlow`to use approximate Bayesian inference in order to develop new scaling and merging routines [22, 23].

### 3.3 Revisiting French-Wilson Scaling

In the previous example, we identified anomalous differences from a room-temperature sulfur SAD experiment. Here, we will examine this anomalous signal in real space by making an anomalous difference map. Before we can make a map, it is necessary to scale the merged intensities to account for any negative values that may result from background subtraction during integration. This is commonly handled using a Bayesian approach first proposed by French and Wilson [14]. Briefly, this algorithm works by solving an integral:
where the likelihood, , is taken to be normally distributed with the empirical standard deviation. The prior distribution, *P*(*J _{h}*), is the Wilson distribution:
which is parameterized by Σ, the mean intensity of reflections at the appropriate resolution. In order to estimate Σ for each reflection, the classic French-Wilson scaling algorithm computes the mean intensity of reflections in resolution shells, and interpolates the mean values from shells adjacent to the particular reflection. Since the functional form of the prior distribution has strictly positive support [24], the expectations computed from equation 4 are necessarily positive. Furthermore, the posterior structure factor amplitudes can be estimated as part of the same subroutine using the following integral:

This scaling method is implemented in `reciprocalspaceship` as `scale_merged_intensities()`, though this implementation differs significantly from the classical one in several regards. Notably, rather than computing mean values in shells, we use a Gaussian smoother [25, chapter 14.7.4-5] to regress the mean of the intensity distributions, Σ, against resolution. This regression model is quite flexible and offers an anisotropic mode which estimates the mean intensity locally as a function of the Miller indices. Whereas the original paper computed the posterior by interpolating a table of cached results [14], our implementation uses Chebyshev-Gauss quadrature to evaluate the integrals on the fly. We generate quadrature points and weights with `NumPy` [12] and compute the relevant log probabilities using the distribution classes implemented in SciPy [26]. Our implementation is tested for consistency with the original paper [14] and with `CCTBX`[4].

The merged intensities from the sulfur SAD experiment were rescaled and converted to structure factors using `scale_merged_intensities()`. This operation leaves large intensities relatively unchanged, while ensuring that any negative values become strictly positive (Fig. 3a). Anomalous differences of the structure factor amplitudes were computed between Friedel pairs. The anomalous difference map shown in Fig. 3b was then constructed using phases derived from the refined model (PDB: 7L84). The map shows significant anomalous peaks at a 5*σ* contour, with the density localized to each of the 10 sulfur atoms in the lysozyme structure.

### 3.4 Identifying Anomalous Scattering Atoms in Real Space

The anomalous difference map shown in Fig. 3b was rendered in PyMOL (Schrödinger, LLC) from the anomalous difference amplitudes and phases. It is also possible to compute a real-space map using `reciprocalspaceship` and NumPy, which enables one to use image processing software to automate the identification of anomalous scattering atoms. This process is illustrated in Fig. 3c. This code snippet arranges the complex anomalous structure factors on a reciprocal space grid, and then computes the real-space anomalous difference map using the Fast Fourier transform [27] function in NumPy [12]. `scikit-image`, an image processing library [28], can be used to identify peaks in the map. This procedure successfully identifies the 80 sulfur sites in the tetragonal lysozyme unit cell (10 sulfurs per copy, 8 copies). The automatically identified sites are overlaid with the anomalous difference map in Fig. 3d.

This example illustrates the use of `reciprocalspaceship` to produce real-space maps from structure factors. Importantly, due to the seamless integration with NumPy, it is possible to take advantage of Python image processing libraries for identifying peaks in the real-space density. Due to the wealth of libraries and tools written by the Python community, this feature of `reciprocalspaceship` can provide the opportunity to develop and test new algorithms rapidly. In this manner, the use of `reciprocalspaceship` could simplify existing data processing pipelines, and perhaps be useful in the development of new methods in crystallographic data analysis or structural bioinformatics.

### 3.5 Applying Weights to a Time-Resolved Difference Map

Time-resolved crystallography experiments make use of X-ray diffraction to monitor structural changes in a crystalline sample. Commonly, structural changes are initially evaluated on the basis of isomorphous difference maps. Such maps are computed by estimating the difference in structure factor amplitudes of the sample before and after a perturbation, such as a laser pulse. Combining these |*F _{on}*| – |

*F*| differences with ground state phases from a reference structure yields an estimate of the differences between the electron density of the sample before and after the perturbation. Difference maps are often noisy due to systematic errors or scaling artifacts, and are frequently weighted by the magnitude of the difference signal and/or the error estimates associated with the empirical differences in structure factor amplitudes. In this example we will visualize the effects of applying weights to a time-resolved difference map of photoactive yellow protein (PYP). PYP is a model system in time-resolved crystallography due to the trans-to-cis isomerization of its 4-hydroxycinnamyl chromophore which occurs upon absorption of blue light [29]. This data set was collected at the BioCARS Laue beamline APS-14-ID, and is composed of matched images collected in the dark and 2ms after illumination with blue light. This data was collected and provided by Marius Schmidt and Vukica Šrajer. Several schemes have been used to apply weights to time-resolved difference maps. Many of them take the form of Equation 7, involving a term based on the uncertainty in the difference structure factor amplitude (

_{off}*σ*

_{ΔF}) and optionally, a scale term based on the the magnitude of the observed difference structure factor amplitude (|Δ

*F*|):

With *α* = 0, these weights take the form derived by Ursby and Bourgeois [30]. The Δ*F*-dependent term downweights the influence of outliers in the data set resulting from poorly measured differences by assigning lower weights to their map coefficients. The degree of skepticism about large differences is controlled by the *α* parameter. *α* values of 1.0 [31] and 0.05 [8] have been reported in the literature.

The weighting function given by Equation 7 can be expressed in a few lines of Python that apply weights based on the values of |Δ*F*| and *σ*_{ΔF} in an `rs.DataSet` object (Fig. 4a). The weights computed for the PYP data set are illustrated in Fig. 4b. Difference structure factors with low signal-to-noise ratios (large *σ*_{ΔF} relative to |Δ*F*|) or large difference structure factor amplitudes are assigned lower weight. The unweighted and weighted difference maps were then made using phases derived from the ground-state model (PDB: 2PHY). The side-by-side comparison of these difference maps shows that the weights greatly improve the interpretation of the structural changes—emphasizing the trans-to-cis isomerization of the chromophore as well as concerted changes in the nearby Arg52 and Phe96 sidechains (Fig. 4c and 4d).

This example illustrates the use of `reciprocalspaceship` for creating custom maps. Importantly, it demonstrates both the exploratory analysis of different weighting schemes, as well as writing MTZ files including different weight columns. These can be used to visualize the impact of the different weights in a molecular visualization suite.

## 4 Discussion

`reciprocalspaceship` is a Python library that can form the foundation for the development of new methods in crystallographic data analysis. This library provides a `DataSet` object that can conveniently represent tabular reflection data while adhering to common practices in Python data analysis. This empowers crystallographers to write idiomatic Python code to analyze their experiments while having full support for the necessary features of crystallographic analysis, such as symmetry operations, unit cells, and spacegroups. Example applications were presented which use this library for merging scaled reflections, analyzing anomalous differences from a SAD experiment, and for observing the impact of weights on a time-resolved difference map. These examples illustrate how `reciprocalspaceship` could be used in several different contexts, producing useful analyses with relatively short scripts and functions that can take full advantage of the existing Python ecosystem.

`reciprocalspaceship` can be used for exploratory data analysis – allowing one to inspect interesting properties of an important data set. Or it can be used to prototype, develop and ship new methods and algorithms for analyzing data sets [22]. Furthermore, this library can be useful in teaching crystallography by allowing students to familiarize themselves with reflection data, space groups and symmetry, and the implementation of commonly-used algorithms. This library lowers the barrier to entry for crystallographic software development by using a framework familiar to Python data scientists.

## 5 Data and Code Availability

`reciprocalspaceship` and worked-out examples are available on GitHub at https://github.com/Hekstra-Lab/reciprocalspaceship, and can be installed directly from the Python Package Index (PyPI). The code used in these examples are available in the `reciprocalspaceship` documentation, and the interactive Jupyter notebooks and all supporting data can be downloaded directly from the Examples directory of the GitHub repository.

## 6 Acknowledgements

We thank the staff at the Northeastern Collaborative Access Team (NE-CAT), beamline 24-ID-C of the Advanced Photon Source, for supporting our room-temperature crystallography experiments, with special thanks to Igor Kourinov. NE-CAT beamlines are supported by the National Institute of General Medical Sciences, NIH (P30 GM124165), using resources of the Advanced Photon Source, a U.S. Department of Energy (DOE) Office of Science User Facility operated for the DOE Office of Science by Argonne National Laboratory under Contract No. DE-AC02-06CH11357. We also thank Marius Schmidt and Vukica Šrajer for the time-resolved Laue diffraction data of photoactive yellow protein. This work was supported by the Searle Scholarship Program (SSP-2018-3240) and a fellowship from the George W. Merck Fund of the New York Community Trust (338034). J.B.G. was supported by the National Science Foundation Graduate Research Fellowship under Grant No. DGE1745303.