Abstract
pyCapsid is a python package developed to facilitate the characterization of the dynamics and mechanical units of protein shells and other protein complexes. The package was developed in response to the rapid increase of high-resolution structures, particularly capsids of viruses, requiring multiscale biophysical analyses. Given a protein shell, pyCapsid generates the collective vibrations of its amino-acid residues, identifies quasi-rigid mechanical regions, and maps the results back to the input proteins for interpretation. pyCapsid’s source code is available under MIT License on GitHub (https://github.com/luquelab/pycapsid). It has also been deployed in the two leading python package-management systems, PIP (https://pypi.org/project/pyCapsid/) and Conda (https://anaconda.org/luque_lab/pycapsid). Installation instructions and tutorials are available in the GitHub Page-style online documentation (https://luquelab.github.io/pyCapsid). In addition, users can post issues regarding pyCapsid in the GitHub repository (https://github.com/luquelab/pyCapsid/issues).
Contact Antoni Luque (aluque{at}sdsu.edu).
Supplementary information (SI) Further details and figures on the performance results reported in this article are available in the GitHub repository https://github.com/luquelab/pyCapsid/tree/main/results/performance). A gallery displaying the application of pyCapsid to various protein shells is available on the online documentation (https://luquelab.github.io/pyCapsid/gallery/).
1 Introduction
Viruses protect their infective genomes in protein shells called capsids (Twarock and Luque 2019). The number of capsids resolved structurally has increased exponentially in the last two decades, partly thanks to cryo-electron microscopy advances (Callaway 2020; Johnson and Olson 2021; Montiel-Garcia et al. 2021). These three-dimensional reconstructions combined with computational algorithms and complementary experimental techniques are leading to a mechanistic characterization of the assembly, dynamics, and stability of viral capsids, opening the doors to new antiviral strategies (Johnson et al. 2021; Twarock and Stockley 2019; Organtini et al. 2017; Kizziah, Rodenburg, and Dokland 2020; Yeager et al. 1990; Qazi et al. 2018; Li et al. 2008; Mata et al. 2020; de Pablo and San Martín 2022; X. Zhang et al. 2013; Bayfield, Steven, and Antson 2020; Montiel-Garcia et al. 2021; Podgorski et al. 2020; Hua et al. 2017; Mohajerani et al. 2022; Wilson and Roof 2021; Bruinsma, Wuite, and Roos 2021; Luque and Reguera 2013; Lee et al. 2022; Grime et al. 2016; Plavec et al. 2021). Among the computational methods, molecular dynamics algorithms have improved dramatically in the last decades and can infer the dynamics of large protein complexes. However, they resolve relatively short timescales (⪝ 1 μs) and require specialized computational resources (Jana and May 2021; Hadden et al. 2018; Perilla and Schulten 2017; Bryer et al. 2022). The fact that capsids are assembled from 60 to more than 60,000 proteins further limits the application of molecular dynamics (Twarock and Luque 2019; Luque et al. 2020; Berg and Roux 2021). Alternatively, the combination of normal mode analysis (NMA), molecular coarse-graining, and elastic network models (ENM) offers a more scalable solution (Bahar et al. 2010; Romo and Grossfield 2011a). This approach has successfully estimated the collective motion of proteins in complexes (Bahar et al. 2010) and identified structural conformational changes in capsids (Tama and Brooks 2005). Nonetheless, no easy-to-use computational packages are currently available to characterize the dynamics and mechanical properties of protein shells. The bioinformatics software presented here, pyCapsid, aims to address this issue.
pyCapsid is inspired by prior publications that applied ENM, NMA, and clustering methods to three-dimensional reconstructions to extract the quasi-rigid regions associated with mechanically relevant units in protein shells (Ponzoni et al. 2015; Polles et al. 2013). These methods are very insightful, but their implementation in packages that generate the dynamics of protein complexes, such as NRGTEN (Mailhot and Najmanovich 2021), ClustENMD (Kaynak et al. 2021), WebPSN (Seeber et al. 2015), or the popular ProDy ProDy (S. Zhang et al. 2021), is not trivial. To address this issue, we introduce pyCapsid, an accessible Python package that identifies the dominant dynamics and quasi-rigid regions of protein shells. Moreover, since the methods are generic, pyCapsid can also be applied to other protein complexes. Yet, in this first release of pyCapsid, we have focused on the characterization of protein shells, such as viral capsids, cellular protein compartments like encapsulins, and gene-transfer agents (Giessen et al. 2019; Bárdy et al. 2020; Johnson and Olson 2021; Montiel-Garcia et al. 2021).
2 Methods and Features
The python package pyCapsid is divided into five independent modules
The PDB (protein data bank) module, the CG (coarse-graining module), the NMA (normal mode analysis) module, the QRC (quasi-rigid clustering) module, and the VIS (visualization) module (Figure 1). The role and technical aspects of each module are briefly described below.
PDB module
This module retrieves and loads structural data from the Protein Data Bank (using the PDB ID) or a local file in PDB or PDBx/mmCIF formats (Westbrook et al. 2022; Berman et al. 2000). The PDB module builds on functions from the Python packages Biotite (Kunzmann and Hamacher 2018).
CG module
This module coarse-grains the proteins at the amino-acid level and establishes an elastic force field between amino acids, offering four standard elastic models’ options: The anisotropic network model (ANM), the gaussian network model (GNM), the unified anisotropic and gaussian network model (U-ENM), which is the default, and the backbone-enhanced elastic network model (bb-ENM). Each amino acid is coarse- grained as a point mass on the alpha-carbon. The links in the network connect amino acids that are closer than a certain threshold. The default value is 15Å for ANM and 7.5Å for the other models. These values are based on prior studies of elastic models reproducing empirical molecular thermal fluctuations (B-factors) (Eyal, Yang, and Bahar 2006; Zheng 2008; Micheletti, Carloni, and Maritan 2004; Romo and Grossfield 2011b). The small threshold distance leads to a sparse network. This network, combined with the elastic strength values of the elastic model, defines the Hessian matrix. Given that the network is sparse, the calculations to build the matrix are accelerated using Numba (Lam, Pitrou, and Seibert 2015).
NMA module
This module obtains the motions of the macromolecular complex by decomposing the dynamics into independent sinusoidal motions called normal modes (Goldstein, Poole, and Safko 2002). The normal modes and associated frequencies are obtained from the Hessian matrix derived in the CG module. It is well established that only low-frequency modes are relevant to the global dynamics of macromolecules (Bahar et al. 2010). The default number of modes calculated in pyCapsid is 200. This number was selected by comparing the results with simulations using a larger number of modes (as many modes as one-hundredth of the number of residues, that is, 1000 modes for a structure containing 100,000 residues). pyCapsid also provides an optional dependency to accelerate the calculations further in GPUs using CUDA via solvers in the cupy package (Okuta et al. 2017).
QRC module
This module estimates the amino acids that tend to fluctuate as a single mechanical unit (quasi-rigid cluster) using the Spectrus algorithm (Ponzoni et al. 2015). A cluster contains groups of residues that minimize the distance fluctuations between residues. The default clustering method in pyCapsid is the default discretize method from scikit-learn and offers k-means clustering as an alternative option (Pedregosa et al. 2012; Yu and Shi 2003). pyCapsid explores from four to n_cluster_max number of clusters, which the user sets.
VIS module
The results obtained from pyCapsid are stored as data files, figures, and movies in the same running folder. pyCapsids’ online tutorial (https://luquelab.github.io/pyCapsid/tutorial/)provides instructions and scripts to visualize the results in two popular molecular visualization tools NGLview and ChimeraX (Nguyen, Case, and Rose 2018; Pettersen et al. 2021).
3 Applications
pyCapsid’s performance and accuracy were obtained for 25 protein shells on an HPC cluster core with Intel Xeon CPU E5-2650 v4 (2.20 GHz) and 128 GB of RAM. The size of the protein shells ranged from 16,000 to over 400,000 amino acid residues. The peak memory usage ranged from 800 MB to 90 GB and increased with the number of residues following a power law (exponent = 1.46±0.06 and R2 =0.97). The runtime ranged from 2 minutes to 36 hours and increased with the number of residues following a power law (exponent = 2.20±0.10 and R2 = 0.95). The accuracy was estimated from the correlation coefficient between the simulated and empirical thermal motions (B-factors) of the amino acids. pyCapsid’s accuracy ranged from 0.10 to 0.88 (out of 1) and decreased linearly for structures with lower experimental resolution (slope = −0.20 ± 0.05 1/Å and R2 = 0.40). The regression projected a perfect accuracy for structures with an ideal experimental resolution of 0 Å (intercept = 1.23 ± 0.18). The accuracy was independent of the number of residues (Spearman’s coefficient = −0.09 and p-value = 0.66). Four additional small capsids (PDB IDs 2ms2, 1za7, 1a34, and 3nap) were analyzed, obtaining quasi-rigid domain decompositions consistent with similar analyses of capsids published previously (Polles et al. 2013; Ponzoni et al. 2015). A subset of 10 of the 25 protein shells in the original performance analysis was used to compare the speed of pyCapsid with ProDy, from loading the PDB to generating the normal modes analysis (NMA). pyCapsid displayed an average speed increase of 3.0±1.5 that was independent of capsid size (Spearman’s coefficient = 0.11 and p-value = 0.76). This increase in speed was due to the use of Numba and the invert shift mode in SciPy. However, this led to a similar increase in memory usage. The PDB entries, statistical methods, and figures are available in the Supplementary Information.
4. Concluding Remarks
pyCapsid can generate the collective motion and extract the quasi-rigid functional regions of protein shells and other protein complexes. The underlying algorithm of pyCapsid generates the dynamical modes faster than popular protein dynamics packages, like ProDy, which facilitates the quasi-rigid domain decomposition of large complexes like protein shells in the range of minutes to over a day for standard computers. The computational efficiency of pyCapsid, combined with its accessibility via Python distribution packages and online tutorials, aims to facilitate its adoption among researchers interested in physical virology, structural bioinformatics, and related fields.
Funding information
The authors’ research was supported by the National Science Foundation (Award #1951678) and the Gordon and Betty Moore Foundation (Award #GBMF9871, https://doi.org/10.37807/GBMF9871). The HPC facilities were supported by the NSF Office of Advanced Cyberinfrastructure grant 1659169.
Acknowledgments
The authors thank the insights from the SDSU Biomath Group, particularly professors Arlette Baljon and Parag Katira. The authors also thank the SDSU Computational Research Science Center for the HPC used to test and benchmark pyCapsid.