MDverse: Shedding Light on the Dark Matter of Molecular Dynamics Simulations

The rise of open science and the absence of a global dedicated data repository for molecular dynamics (MD) simulations has led to the accumulation of MD files in generalist data repositories, constituting the dark matter of MD - data that is technically accessible, but neither indexed, curated, or easily searchable. Leveraging an original search strategy, we found and indexed about 250,000 files and 2,000 datasets from Zenodo, Figshare and Open Science Framework. With a focus on files produced by the Gromacs MD software, we illustrate the potential offered by the mining of publicly available MD data. We identified systems with specific molecular composition and were able to characterize essential parameters of MD simulation, such as temperature and simulation length, and identify model resolution, such as all-atom and coarse-grain. Based on this analysis, we inferred metadata to propose a search engine prototype to explore collected MD data. To continue in this direction, we call on the community to pursue the effort of sharing MD data, and increase populating and standardizing metadata to reuse this valuable matter.


31
The volume of data available in biology has increased tremendously (Marx, 2013; Stephens et al., 32 revealed the high value of these data and highlighted the different categories of the simulated 92 molecules, as well as the biophysical conditions applied to these systems. Based on these results 93 and our annotations, we proposed a search engine prototype to easily explore this dark matter of 94 MD. Finally, building on this experience, we provide simple guidelines for data sharing to gradually 95 improve the FAIRness of MD data.  (2022)). 104 One immediate strategy to index MD simulation files available in data repositories is to per-105 form a text-based Google-like search. For that, one queries these repositories with keywords such 106 as 'molecular dynamics' or 'Gromacs'. Unfortunately, we experienced many false positives with 107 this search strategy. This could be explained by the strong discrepancy we observed in the quan-108 tity and quality of metadata (title, description) accompanying datasets and queried in text-based 109 search. For instance, a description text could be composed of a couple of words to more than 1,200 110 words. Metadata is provided by the user depositing the data, with no incentive to issue relevant 111 details to support the understanding of the simulation. For the three data repositories studied, no 112 human curation other by that of the providers is performed when submitting data. It is also worth 113 mentioning that title and description are provided as free-text and do not abide to any controlled 114 vocabulary such as a specific MD ontology. 115 To circumvent this issue, we developed an original and specific search strategy that we called 116 Explore and Expand ( 2 ) (see Fig. 1-A and Materials and Methods section) and that relies on a com-117 bination of file types and keywords queries. In the Explore phase, we searched for files based on 118 their file types (for instance: .xtc, .gro, etc) with MD-related keywords (for instance: 'molecular dy-119 namics', 'Gromacs', 'Martini', etc). Each of these hit files belonged to a dataset, which we further 120 screened in the Expand phase. There, we indexed all files found in a dataset identified in the previ-121 ous Explore phase with, this time, no restriction to the collected file types (see Fig. 1-A and details 122 on the data scraping procedure in the Materials and Methods section). 123 Globally, we indexed about 250,000 files and 2,000 datasets that represented 14 TB of data de-124 posited between August 2012 and March 2023 (see Table 1). One major difficulty were the numer-125 ous files stored in zipped archives, about seven times more than files steadily available in datasets 126 (see Table 1). While this choice is very convenient for depositing the files (as one just needs to pro-127 vide one big zip file to upload to the data repository server), it hinders the analysis of MD files as 128 data repositories only provide a limited preview of the content of the zip archives and completely 129 inhibits, for example, data streaming for remote analysis and visualization. Files within zip files 130 are not indexed and cannot be searched individually. The use of zip archives also hampers the 131 reusability of MD data, since a specific file cannot be downloaded individually. One has to down-132 load the entire zip archive (sometimes with a size up to several gigabytes) to extract the one file of 133 interest. 134 The first dataset we found related to MD data that has been deposited in August 2012 in 135 Figshare and corresponds to the work of Fuller et al. (Fuller et al., 2012) (see Table 1) but we 136 may consider the start of more substantial deposition of the MD data to be 2016 with more than 137 Figure 1. (A) Explore and Expand ( 2 ) strategy used to index and collect MD-related files. Within the explore phase, we search in the respective data repositories for datasets that contain specific keywords (e.g. "molecular dynamics", "md simulation", "namd", "martini"...) in conjunction with specific file extensions (e.g. "mdp", "psf", "parm7"...), depending on their uniqueness and level of trust to not report false-positives (.i.e not MD related). In the expand phase, the content of the identified datasets is fully cataloged, including files that individually could result in false positives (such as e.g. ".log" files). (B) Number of deposited files in generalist data repositories, identified by our 2 strategy. a few thousands files in 2018 to almost 50,000 files in 2022 (see Fig. 1 a steep increase in 2022 ( Fig. 1-B). We believe that this trend will continue in future years, which 147 will lead to a greater amount of MD data available. It is thus urgent to deploy a strategy to index 148 this vast amount of data, and to allow the MD community to easily explore and reuse such gigantic  to compare the availability of all data related to each MD program but to give a snapshot of the 166 type of data available at a given time (i.e. March 2023) in generalist data repositories. Interest-167 ingly, many files (> 133,000) were not directly associated to any MD program (see Fig. 2

-A label
168 'Unknown'). We categorized these files based on their extensions (see Fig. 2 of great value to share. We therefore quantified the types of files generated by Gromacs (Fig. 3-A).

178
The most represented file type is the .xtc file (28,559 files, representing 8.6 TB). This compressed 179 (binary) file is used to store the trajectory of an MD simulation and is an important source of in-180 formation to characterize the evolution of the simulated molecular system as a function of time.

181
It is thus logical to mainly find this type of file shared in data repositories, as it is of great value 182 for reusage and new analyses. Nevertheless, it is not directly readable but needs to be read by a . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted May 2, 2023. ; https://doi.org/10.1101/2023.05.02.538537 doi: bioRxiv preprint from less interesting trajectories such as minimization or equilibration runs.

202
Given the large volume of data represented by .xtc files (see above), we could only scratch the 203 surface of the information stored in these trajectory files by analyzing a subset of 779 .xtc files -204 one per dataset in which this type of file was found. We were able to get the size of the molecular 205 systems and the number of frames available in these files ( Fig. 3-B). Of note, the frames do not 206 directly translate to the simulation runtime -more information deposited in other files (e.g. .mdp 207 files) is needed to determine the complete runtime of the simulation. The system size was up 208 to more than one million atoms for a simulation of the TonB protein (Virtanen et al., 2020). The 209 cumulative distribution of the number of frames showed that half of the files contain more than 210 10,000 frames. This conformational sampling can be very useful for other research fields besides 211 the MD community that study, for instance, protein flexibility or protein engineering where diverse 212 backbones can be of value. We found an .xtc file containing more than 5 million frames, where the 213 authors probe the picosecond-nanosecond dynamics of T4 lysozyme and guide the MD simulation 214 with NMR relaxation data (Kümmerer et al., 2021). Extending this analysis to all 28,559 .xtc files 215 detected would be of great interest for a more holistic view, but this would require an initial step 216 of careful checking and cleaning to be sure that these files are analyzable. (779 .xtc files selected from 28,559 .xtc files indexed) could explain this discrepancy, an alternate 237 hypothesis is that the size of an .xtc file also depends on the number of frames stored. To reduce 238 the size of .xtc files deposited in data repositories, besides removing some frames, researchers 239 might also remove parts of the system, such as water molecules. As a consequence for reusability, 240 this solvent removal could limit the number of suitable datasets available for researchers inter-241 ested in re-analysing the simulation with respect to, in this case, water diffusion. While the size of 242 systems extracted from .gro files was homogeneously spread, we observed a clear bump around  We then analyzed residues in .gro files and inferred different types of molecular systems (see was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted May 2, 2023. ; . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted May 2, 2023. ; https://doi.org/10.1101/2023.05.02.538537 doi: bioRxiv preprint tems, especially membrane proteins, were deposited. This highlights the vitality of this research 253 field, and the will of this community to share their data. We also found numerous systems contain-254 ing solvated proteins. This type of data, combined with .xtc trajectory files (see above), could be 255 invaluable to describe protein dynamics and potentially train new artificial intelligence models to 256 go beyond the current representation of the static protein structure (Lane, 2023). There was also 257 a good proportion of systems containing nucleic acids alone or in interaction with proteins (1237 258 systems). At this time, we found only few systems containing carbohydrates that also contained  Finally, we found 1,029 .gro files which did not belong to the categories previously described.

266
These files were mostly related to models of small molecules, or molecules used in organic chem-  the mindset to reach that ending time. We restricted this analysis to the 4,623 .mdp files that used 311 the md or sd integrator, and that have a simulation time above 1 ns. We found that the majority of 312 the .mdp files were used for simulations of 50 ns or less (see Fig. 4-A). Further, 697 .mdp files with 313 simulations times set-up between 50 ns and 1 µs and 585 .mdp files with simulation time above 314 1 µs were identified. As analyzing .gro files showed a good proportion of coarse-grained models 315 (Fig. 3-B,C), we discriminated simulations setups for these two types of models using the time step 316 as a simple cutoff. We considered that a time step greater than 10 fs (i.e. dt=0.01) corresponded 317 to MD setups for coarse grained models (Ingólfsson et al., 2014). Globally, we found that over all 318 simulations, the setups for atomistic simulations were largely dominant. However, for simulations 319 with a simulation time above 1 µs specifically, coarse-grain simulations represented 86 % of all. 320 We then looked into the combinations of thermostat and barostat (see Fig. 4 (Berendsen et al., 1984). In a few cases, we observed the use of the V-324 rescale thermostat with the very recently developed C-rescale barostat (Bernetti and Bussi, 2020). 325 A total of 2,021 .mdp files presented neither thermostat nor barostat, which means they would 326 not be used in production runs. This could correspond to setups used for energy minimization,  , 2015). 330 Finally, we analyzed the range of starting temperatures used to perform simulations (see Fig. 4-331 C). We found a clear peak around the temperatures 298 K -310 K which corresponds to the range 332 between ambient room (298 K -25°C) and physiological (310 K -37°C) temperatures. Nevertheless, 333 we also observed lower temperatures, which often relate to studies of specific organic systems or 334 simulations of Lennard-Jones models (Jeon et al., 2016). Interestingly, we noticed the appearance 335 of several pikes at 400 K, 600 K, and 800 K, which were not present before the end of the year 2022.  To encourage further analysis of the collected files, we shared our data collection with the com- as .gro and .mdp files. Furthermore, when available, a description of the found data is provided 346 and searchable for keywords (Fig. 5-A, on the left sidebar). The sets of data found can then be 347 exported as a tab-separated values (.tsv) file for further analysis (Fig. 5-B).

371
Without a community-approved methodology for depositing MD simulation files in data reposito-372 ries, and based on the current experience we described here, we propose a few simple guidelines 373 when sharing MD data to make them more FAIR (Findable, Accessible, Interoperable and Reusable):

374
• Avoid zip or tar archives whose content cannot be properly indexed by data repositories. As  . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted May 2, 2023. ; https://doi.org/10.1101/2023.05.02.538537 doi: bioRxiv preprint . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted May 2, 2023. ; https://doi.org/10.1101/2023.05.02.538537 doi: bioRxiv preprint submit a revised version, and providing a link to the related research paper in updated 402 metadata of the MD dataset will ease the reference to the original publication upon data 403 reuse.

404
-The code used to analyze the data, ideally deposited in the repository to guarantee avail-405 ability, or in a GitHub or GitLab repository.

406
-Any other datasets that belong to the same study.  (Commun Biol, 2023). Eventually, they could be implemented in machine 415 actionable Data Management Plan (maDMP) (Miksa et al., 2019). So far, MD metadata is formalized 416 as free text. We advocate for the creation of a standardized and controlled vocabulary to describe 417 artifacts and properties of MD simulations. Normalized metadata will, in turn, enable scientific  While indexing about 2,000 MD datasets, we found that title and description accompanying these 422 datasets were very heterogeneous in terms of quality and quantity and were difficult for machines 423 to process automatically. It was sometimes impossible to find even basic information such as the 424 identity of the molecular system simulated, the temperature or the length of the simulation. With-425 out appropriate metadata, sharing data is pointless, and its reuse is doomed to fail (Musen, 2022). 426 It is thus important to close the gap between the availability of MD data and its discoverability and 427 description through appropriate metadata. We could gradually improve the metadata by following 428 two strategies. First, since MD engines produce normalized and well-documented files, we could 429 extract parameters of the simulation by parsing specific files. We already explored this path with 430 Gromacs, by extracting the molecular size and composition from .gro files and the simulation time 431 (with some limitations), thermostat and barostat from .mdp files. We could go even further, by 432 extracting for instance Gromacs version from .log file (if provided) or by identifying the simulated 433 system from its atomic topology stored in .gro files. This strategy can in principle be applied to 434 files produced by other MD engines. A second approach that we are currently exploring uses data 435 mining and named entity recognition (NER) methods (Perera et al., 2020) to automatically identify 436 the molecular system, the temperature, and the simulation length from existing textual metadata  . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted May 2, 2023. ; https://doi.org/10.1101/2023.05.02.538537 doi: bioRxiv preprint could also provide valuable insights to understand how a given MD analysis was performed. Fi-450 nally, GitHub repositories might also be an entry point to find other datasets by linking to simu-451 lation data, such as institutional repositories (see for instance (Pesce and Lindorff-Larsen, 2023)). 452 However, one potential point of concern is that repositories like GitHub or GitLab do not make any 453 promises about long-term availability of repositories, in particular ones not under active develop-454 ment. Archiving of these repositories could be achieved in Zenodo (for data-centric repositories) 455 or Software Heritage (Di Cosmo and Zacchiroli, 2017) (for source-code-centric repositories).

456
An obvious next step is the enrichment of metadata with the hope to render open MD data 457 more findable, accessible and ultimately reusable. Possible strategies have already been detailed 458 previously in this paper. We could also go further by connecting MD data in the research ecosystem.

459
For this, two apparent resources need to be linked to MD datasets: their associated research pa-460 pers to mine more information and to establish a connection with the scientific context, and their 461 simulated biomolecular systems, which ultimately could cross-reference MD datasets to reference 462 databases such as UniProt (Consortium, 2022), the PDB (Berman et al., 2000) or Lipid Maps (Sud In this work, we showed that sharing data generated from MD simulations is now a common prac-472 tice. From Zenodo, Figshare and OSF alone, we indexed about 250,000 files from 2,000 datasets, 473 and we showed that this trend is increasing. This data brings incentive and opportunities at differ- repositories were different, all implementations were performed in dedicated Python(van Rossum, 492 actually identified as MD simulation files. This is the core of the Explore and Expand strategy ( 2 ) 498 we applied in this work and illustrated in Fig 1. 499 When a zip file was found in a dataset, its content was extracted from a preview provided by 500 Zenodo and Figshare. This preview was not provided through APIs, but as HTML code, which we 501 parsed using the Beautiful Soup library (version 4.11.2). Note that the zip file preview for Zenodo 502 was limited to the first 1,000 files. To avoid false-positive files collected from zip archives, a final 503 cleaning step was performed to remove all datasets that did not share at least one file type with 504 the file type list mentioned above. In the case of OSF, there was no preview for zip files, so their 505 content has not been retrieved.

506
Gromacs files 507 After the initial data collection, Gromacs .mdp and .gro files were downloaded with the Pooch 508 library (version 1.6.0). When a .mdp or .gro file was found to be in a zip archive, the latter was 509 downloaded and the targeted .mdp or .gro file was selectively extracted from the archive. The same  The .gro files were parsed with the MDAnalysis library (Michaud-Agrawal et al., 2011) to ex-520 tract the number of particles of the system. Values found in the residue name column were also 521 extracted and compared to a list of residues we manually associated to the following categories: 522 protein, lipid, nucleic acid, glucid and water or ions (https://github.com/MDverse/mdws/blob/main/ 523 params/residue_names.yml).

524
The .xtc files were analyzed using the gmxcheck command (https://manual.gromacs.org/current/ 525 onlinehelp/gmx-check.html) to extract the number of particles and the number of frames.

526
MDverse data explorer web app 527 The MDverse data explorer web application was built in Python with the Streamlit library. Data was 528 downloaded from Zenodo (see the Data and code availability section).

530
Molecular graphics were performed with VMD (Humphrey et al., 1996) and Chimera (Pettersen 531 et al., 2004). For all visualizations, .gro files containing molecular structure were used. In the case 532 of the two structures in Fig. 3 . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted May 2, 2023.