Abstract
Applications of next-generation sequencing (NGS) technologies require availability and access to an information technology (IT) infrastructure and bioinformatics tools for large amounts of data storage and analyses. The U.S. Food and Drug Administration (FDA) anticipates that the use of NGS data to support regulatory submissions will continue to increase as the scientific and clinical communities become more familiar with the technologies and identify more ways to apply these advanced methods to support development and evaluation of new biomedical products. FDA laboratories are conducting research on different NGS platforms and developing the IT infrastructure and bioinformatics tools needed to enable regulatory evaluation of the technologies and the data sponsors will submit. A High-performance Integrated Virtual Environment, or HIVE, has been launched, and development and refinement continues as a collaborative effort between the FDA and George Washington University to provide the tools to support these needs. The use of a highly parallelized environment facilitated by use of distributed cloud storage and computation has resulted in a platform that is both rapid and responsive to changing scientific needs. The FDA plans to further develop in-house capacity in this area, while also supporting engagement by the external community, by sponsoring an open, public workshop to discuss NGS technologies and data formats standardization, and to promote the adoption of interoperability protocols in September 2014.
LAY ABSTRACT: Next-generation sequencing (NGS) technologies are enabling breakthroughs in how the biomedical community is developing and evaluating medical products. One example is the potential application of this method to the detection and identification of microbial contaminants in biologic products. In order for the U.S. Food and Drug Administration (FDA) to be able to evaluate the utility of this technology, we need to have the information technology infrastructure and bioinformatics tools to be able to store and analyze large amounts of data. To address this need, we have developed the High-performance Integrated Virtual Environment, or HIVE. HIVE uses a combination of distributed cloud storage and distributed cloud computations to provide a platform that is both rapid and responsive to support the growing and increasingly diverse scientific and regulatory needs of FDA scientists in their evaluation of NGS in research and ultimately for evaluation of NGS data in regulatory submissions.
Introduction
Next-generation sequencing (“Next Gen Sequencing”, NGS) has driven a revolution in biomedical research and product development. The potential and realized applications of NGS in medical product development are beginning to be understood, and they will continue to grow. In 2013, the Center for Drugs Evaluation and Research received the first two New Drug Applications (NDAs) that included NGS data as part of the formal regulatory submission (http://www.fda.gov/AdvisoryCommittees/CommitteesMeetingMaterials/Drugs/AntiviralDrugsAdvisoryCommittee/ucm368547.htm), and the Center for Devices and Radiological Health cleared the first device that relies on generation and use of NGS data for diagnosis of human disease (press release: http://www.fda.gov/newsevents/newsroom/pressannouncements/ucm375742.htm). Additionally, NGS data has been submitted in support of recommended virus detection methods for evaluation of novel vaccine cell substrates. Furthermore, NGS is being evaluated by scientists within the U.S. Food and Drug Administration (FDA), in academia, and in biological and biotechnological industries for its applicability to adventitious agent testing of cell substrates and other biologic product manufacturing intermediates, as well as for safety evaluation of human tissues and blood, or analysis of genetic stability of vaccine strains during manufacture or as lot release. These are just a few examples related to medical product development, and the list is likely to continue to grow exponentially as familiarity increases and costs of using this technology continue to decrease.
In January 2013, under the auspices of the FDA's Office of the Chief Scientist, the FDA convened the first meeting of the FDA Genomics Working Group (GWG). The Cross Center FDA Working Group was formed in response to the challenges that NGS poses, and in recognition of the many ongoing efforts across the FDA to address the bioinformatics and information technology (IT) challenges that NGS presents in a regulatory setting. The FDA recognized that a coordinated approach was needed to develop the complex IT and bioinformatics solutions required to be prepared for receiving and reviewing regulatory applications of NGS, while also recognizing that the significant monetary and human investments needed to be made in a manner that would optimize efficient and effective use of FDA appropriated funds.
The scope of the FDA GWG is to prepare the FDA to address IT and scientific challenges to facilitate FDA readiness for NGS data, including, but not limited to, the following:
Determine how to develop processes to store, transfer, and perform efficient computation on large and complex NGS data sets
Identify what tailor-made bioinformatics analytics need to be developed to support FDA evaluation of NGS; and if so, advise how to develop the informatics expertise and resources to meet this need effectively and with efficient use of funding
Develop criteria and approaches for evaluating data quality in the context of using these data and the interpretations of these data to support regulatory decision-making
Within this scope, the goals of the FDA GWG include ensuring that the required resources and expertise are identified, implemented, and developed in a coordinated manner, both internally, and where feasible, with other government agencies; and to work collaboratively with relevant external partners, such as the National Institutes of Health (NIH), the Centers for Disease Control and Prevention (CDC), the National Institute of Standards and Technology (NIST), academia, and industry to develop the IT infrastructure, data standards, and analytic approaches that are required to support use and applications of NGS in order to solve relevant scientific questions and support regulatory decision-making.
Within the Center for Biologics Evaluation and Research (CBER), additional activities are ongoing to support development and use of NGS data for biologics products. For example, CBER scientists have been actively engaging in performing studies in-house to evaluate the feasibility of applying NGS tools to detect adventitious virus in biologic products and intermediates used in manufacture (e.g., master cell banks, etc.). To support this work, CBER has been actively engaged with scientists in industry, academia, and other government agencies, in a PDA co-ordinated Advanced Virus Detection Technologies Users Group since October 2012. In addition, FDA scientists are using NGS to evaluate stability of live virus vaccines and stem cell–derived products, as well as in genome-wide association studies to identify the genetic basis of rare vaccine-related adverse events.
High-Performance Integrated Virtual Environment (HIVE)
One tool that CBER has developed for intramural research use is based on a collaborative project between CBER and George Washington University. The High-performance Integrated Virtual Environment, or HIVE, provides the IT infrastructure and bioinformatics capability to support data acquisition, transfer, analytics, and secure data storage needs of computing-intensive technologies, such as NGS. HIVE uses distributed cloud-based storage and computational capacity to support data storage, retrieval, and analytic needs through a web-based interface to support genomics data transfer between the wet lab instrument generating the NGS data and the FDA high-performance computing environment. By using a web-based interface to support these functionalities, and having in-house programmers to develop analytic algorithms and software to support specific biological questions, we have created a platform that is “user-friendly” to scientists who want to apply these tools but may not have the bioinformatics expertise to do so themselves. The following sections provide more details about HIVE capabilities and structure.
Data Loading
Data may be submitted directly from sources including scientific instrumentation, local disks, open or public databases, or from secure remote locations through web and file transfer protocols. A range of public databases are available for searching and uploading data, including the following: ncbi.nlm.nih.gov, uniprot.org, pir.georgetown.edu, rcsb.org, lanl.gov, ebi.ac.uk, and ddbj.nig.ac.jp. After receiving data from these various sources, the data are validated and distributed by the cloud control server during collection by the system.
Metadata associated with the data is automatically also harvested at the time of data submission/uploading. Users also have the option to provide additional metadata manually upon initiation of the session through the web browser interface.
Data formats are recognized and converted to internal data standards for optimal efficiency within the system. Data is compressed, encrypted, indexed, and archived in the distributed storage cloud. Metadata is likewise archived in a metadata database.
Data Storage
Conventional data storage systems are based on relational databases with tens of tables requiring 3–4 months to develop and implement, separate search engines, security models, and interface designs requiring support by database administration specialists. In contrast, HIVE data storage is based on a “HIVE-honeycomb” model. These data storage systems are constructed with only a few (4–5) tables. HIVE requires only 1–2 days for implementation of new databases. By using a single search engine, a unified security model and interface design, all the databases can be maintained by 1–2 support specialists. By using this more integrated model of database development, HIVE provides a more facile and flexible tool to respond to new scientific needs.
Data Security
HIVE has been built to provide tailor-made security needs, so that the user can identify, through the web browser interface, the access privileges to his or her data. The user can indicate whether individuals can have only read permission, write permission, or no permission. By implementing hierarchically inherited “up” or “down” permissions, HIVE allows definitions of complex and granular security rules without performance degradation.
To ensure data security, HIVE has been built with the ability to monitor and audit user access to ensure that his or her actions on the data are consistent with the corresponding permissions (e.g., only permitted to read or write, etc.). Importantly, HIVE has also been designed with hardware segregation, so that public data and secure data are stored in distinct areas of the distributed cloud storage. This prepares the way for using HIVE for analysis of data submitted in regulatory files.
Data Computation
Using the web browser interface, users can configure the analytical tools by selecting the pre-loaded data, choosing the desired analytic algorithm, and specifying parameter values. The request is submitted through the web portal and promptly parallelized. Parallel “chunks” are executed in the distributed computational cloud, by retrieving the needed data from the distributed cloud storage. Computations are monitored and parallel outputs are coagulated by the cloud control server. When results are complete, summary and visualization are sent back to the web browser and a copy of the results is returned to the distributed storage cloud for temporary archiving.
Depending on the required analysis, multiple algorithms can be joined in a workflow, in other words, using the results of one analysis, one can feed into the subsequent analysis. Initiation of this process is identical to the primary analysis with the exception that the input is from the prior result rather than primary data.
By combining use of the distributed storage cloud and the distributed computational cloud, HIVE provides a high-throughput data exchange highway using the high-performance networking Infiniband hardware.
Data Visualization
HIVE scientific visualization library follows the Data Driven Document paradigm [http://vis.stanford.edu/papers/d3] and is developed using JavaScript/HTML5 (Scalable Vector Graphics) supported by most browsers; hence there is no need for additional client software installation/maintenance to use HIVE. The tool-kit has a wide variety of graphical visualization utilities linked seamlessly to parallelized backend processes, efficiently providing content for complex data extensive diagrams as needed. From the viewpoint of research scientists, HIVE provides the capability to construct workflows from existing tools, with an easy launch point from the menu of available applications, graphical interfaces for analysis, and/or export functionalities in multiple formats that can be used to transfer results to other user tools for visualization and additional analysis.
Workflow
HIVE applies the principles of software reuse to allow rapid development of analytic modules. The modular infrastructure allows for flexible and rapid development of new analytic approaches to answer a variety of biological questions.
Current implementation of HIVE has many integrated and adopted tools for large-scale data analytics. Examples of such tools include but are not limited to:
Short read sequence DNA-seq alignment arsenal: Native HIVE-Hexagon Aligner, Blast, BWA, BowTie
RNA-seq and expression analysis: Native Hexagon Aligners, Top-Hat, Ace-Magic
Multiple reference aligners: MAFFT, ClustAL
Denovo Assemblers: Velvet, Abyss, OASES, HIVE Native Contig Assembler
Variation calling tools arsenal: HIVE-heptagon native variant caller, Samtools, Cuffdiff, Cufflink, Cuffmerge
Concurrent data retrieval modules from external sources
Sequence manipulation arsenal: quality control and validator tools, primer cutters, sequence trimmers, adapter filters, complexity and quality filtration engine, random read simulators, reference set assembly
Comparative genomics: HIVE-octagon reference clusterization tools, discriminant classification analysis methods, multiple phylogenetic linkage methods of hierarchical clustering
Spectroscopy: multiple MS spectroscopy processing utilities and molecular library validation and peak detection tools
Table Query Analyzer (EXCEL-like utility which works with hundreds of millions of rows)
Taxonomic and metagenomic identification tools: recombinant analysis, gap discovery tools, coding region discovery tool
The point of the list above is that the variety of tools available, and the inter-compatibility of these tools, allows the user to essentially pick from a menu and use a “plug-and-play” approach to easily perform any number of analyses to support evaluation of the data.
Performance
The use of novel software architecture in conjunction with Infiniband technology, distributed cloud storage, and cloud control server provides an environment to support real-time data processing that out-performs other data analytic platforms available in the public domain. HIVE's integrated platform and concurrent data-loading and processing pipelines allow for dramatic increases in data transfer and analysis speed. Comparison of a state-of-the-art multicore computer with HIVE has demonstrated that the multicore computer takes approximately 2–10 days to run a single human genome mapping experiment compared to HIVE performing these analyses in approximately 1.5–3 h. Viral and bacterial sample classification and detection using conventional technology will take from half a day up to 2 days, whereas similar analyses for viruses on HIVE may take 1–2 min and 5–15 min for bacteria.
These more rapid tools are critical for FDA as we engage in using these tools to support evaluation of regulatory submissions, respond to food-borne outbreaks, and other activities that are critical to support our regulatory and public health mission.
Next Steps
At the FDA, we anticipate that the application of NGS technology will continue to grow exponentially. The use of NGS is expected to facilitate direct development of new biomedical products (i.e., for diagnostics) as well as support-development of new biomedical products (e.g., detection of adventitious agents in cell substrates for vaccine manufacture). In order to understand and evaluate the potential of NGS technology and the requisite bioinformatics approaches required to interpret these data, the FDA will continue to use a combination of intramural research combined with outreach and interfacing with the external stakeholder community of scientists, clinicians, software/hardware developers, and NGS technology developers. To facilitate external communication and information exchange, the FDA sponsored an open, public workshop, NGS technology, data formats standardization, and promotion of interoperability protocols on September 24–25, 2014 (http://www.fda.gov/scienceresearch/specialtopics/regulatoryscience/ucm389561.htm). One goal was to review a proposal for data standards for NGS, but the forum also provided an opportunity to hear from the community on a range of other issues, so that the FDA can fully understand how NGS is going to be applied in the future and how we can best be prepared to address the challenges of using these data in a regulatory environment today and going forward.
Footnotes
The authors declare that they have no competing interests.
CONFERENCE PROCEEDING: Proceedings of the PDA/FDA Advanced Technologies for Virus Detection in the Evaluation of Biologicals Conference: Applications and Challenges Workshop in Bethesda, MD, USA; November 13-14, 2013
Guest Editors: Arifa S. Khan (Rockville, MD), Dominick Vacante (Malvern, PA)
- © PDA, Inc. 2014