Genomics in healthcare: GA4GH looks to 2022

The Global Alliance for Genomics and Health (GA4GH), the standards-setting body in genomics for healthcare, aims to accelerate biomedical advancement globally. We describe the differences between healthcare- and research-driven genomics, discuss the implications of global, population-scale collections of human data for research, and outline mission-critical considerations in ethics, regulation, technology, data protection, and society. We present a crude model for estimating the rate of healthcare-funded genomes worldwide that accounts for the preparedness of each country for genomics, and infers a progression of cancer-related sequencing over time. We estimate that over 60 million patients will have their genome sequenced in a healthcare context by 2025. This represents a large technical challenge for healthcare systems, and a huge opportunity for research. We identify eight major practical, principled arguments to support the position that virtual cohorts of 100 million people or more would have tangible research benefits.

The Global Alliance for Genomics and Health (GA4GH), the standards-setting body in genomics for healthcare, aims to accelerate biomedical advancement globally. We describe the differences between healthcare-and research-driven genomics, discuss the implications of global, population-scale collections of human data for research, and outline missioncritical considerations in ethics, regulation, technology, data protection, and society. We present a crude model for estimating the rate of healthcare-funded genomes worldwide that accounts for the preparedness of each country for genomics, and infers a progression of cancer-related sequencing over time. We estimate that over 60 million patients will have their genome sequenced in a healthcare context by 2025. This represents a large technical challenge for healthcare systems, and a huge opportunity for research. We identify eight major practical, principled arguments to support the position that virtual cohorts of 100 million people or more would have tangible research benefits.
Human genomics 1 is undergoing a shift from a predominantly research-driven activity to one that is driven-and funded-by health care. Human datasets collected by healthcare providers and made available to researchers now offer unprecedented opportunities for rapid advancement of biological research. Humans have always been a focus for research, both in the clinic and in probing fundamental life processes, but the scale and scope of these efforts have been relatively modest. Healthcare funding for genomic sequencing of humans on a population scale is providing remarkable, transformative opportunities for both clinical practise and basic research. This is the premise of the Global Alliance for Genomics and Health (GA4GH), which seeks to facilitate the coalescence of clinical practise and biomedical research for mutual benefit to clinicians, researchers, and human health. It is an exciting time for genomics researchers because results from our work can now have a direct impact on healthcare-a major goal for many researchers. This change is driven by the growing effectiveness of genomics in practising medicine and by the increasing affordability of genomics technology. Rolling out genomics in healthcare will open up opportunities to study the genetic and molecular components of health and disease on an unprecedented scale. But the challenge of transferring expertise effectively from the research domain to healthcare is massive, rivalled only by that of establishing data-access mechanisms that are both appropriate to research applications and respectful of the rights of the individuals to whom the data pertain. We believe these challenges can be met, but only if the genomics community is committed to broad-based advocacy and coordinated efforts worldwide.
This is not a new position. The community has anticipated this shift since the Human Genome Project began in the late 1980s, and the trends we summarise here have been the subject of keen observation for many years. The current transition is a significant moment for genomics and healthcare; but there have been step changes along the way, and many more are still to come. Here, we simply pause to reflect on the landscape of genomics-to consider the future paths before us and the role GA4GH can play in accelerating biomedical advancement.
GA4GH is a worldwide alliance of genomics and clinical researchers, data scientists, healthcare practitioners, and many others working together to establish frameworks and standards for responsible, international genomic and health-related data sharing. Our ultimate goal is to leverage the detailed knowledge of the genomics research community to advance healthcare and, conversely, to enable the enrichment of biomedical research through the responsible sharing and use of clinical genomic data in research. It is clear that without such a consortium, the uptake of genomics into clinical practise will be slower, more expensive and riskier, and will differ country by country with little harmonisation. This would reduce the benefit to patients worldwide substantially, and increase costs to healthcare systems. In such a landscape, we would not be able to tap into the opportunities presented by millions of human genomic sequence and phenotypic data for basic research.
Through this lens, we describe the differences between healthcare-and research-driven genomics, discuss the implications of global, population-scale collections of human data for research, and outline mission-critical considerations in ethics, regulation, technology, data protection, and society. Healthcare has an entirely different financial, legal, and social landscape (see Table 1). It is the largest component of the economies of the G20 countries, and one of the most regulated industries. Its structure and regulation are very diverse from country to country, covering the full spectrum from state-run to private provision. In each system, the cost of an assay in healthcare-genomics included-is always considered in light of its benefits to the health of an individual. In theory, if a genome assay becomes cost-effective for a specific application within a healthcare system, the only limit to its deployment is the number of patients who will potentially benefit via that application. In practise, there are many logistical and regulatory hurdles to overcome before an assay can be incorporated into a healthcare system's regular offerings.
The current case for implementing genomics in healthcare can be presented in at least four broad categories: infectious disease, rare disease, cancer, and common or chronic disease. These categories map to clear clinical use cases, though there is-as always in human biology-huge overlap between them. For example, there is a continuum between rare and common disease, and rare disease and severe cancer susceptibility can both be diagnosed and managed following similar clinical genetic pathways. Infectious disease contributes to both cancer and chronic disease, often as a key trigger.
Despite the inevitable crossover, these four categories map well to the funding and management approaches of many different healthcare systems. In the following sections, we outline the case for genomics in each category.
Genomics can be used to identify the infectious agent of disease with more confidence and precision than ever before, and at increasing speed (1,2). The main challenges to deployment in healthcare are managing cost and logistics, achieving precise phenotypic prediction (e.g. antibiotic resistance), and aligning historical records of non-genomic-based assays with contemporary genomics tests. Rare diseases (i.e. those with a person frequency of 1 in 2,000 or lower (7)) often have a clear genetic component, often of high penetrance with only a few genes involved in each disease. Clinical geneticists have used single-gene tests since the early 1990s to support diagnosis and some treatment decisions for many of these diseases.
Genomics provides a confident diagnosis for many rare diseases, which enables families and healthcare systems to manage the disease appropriately and brings patient retesting (a.k.a. their "diagnostic odyssey") to a close. Genomic diagnosis of a child can inform parents about the odds of the disease affecting a future child (i.e. distinguishing de novo mutations from recessive alleles), empowering them to make informed family-planning decisions. In exceptional cases, a successful diagnosis can lead to profound changes in a patient's medical treatment, prognosis, and quality of life. The cost of assaying broader genomic regions, including whole-genome sequencing, has fallen considerably, which has had a substantial impact on rare-disease diagnosis and research (8,9). Arguably, the rare-disease space has seen the most successful deployment of genomics in healthcare, with multiple systems reporting diagnostic rates of between 20% and 30%, and specific health economics papers providing multiple lines of evidence of costeffectiveness (10). C a n c e r One of the consistent hallmarks of cancer is altered somatic genomes, often with specific mutations that 'drive' the cancer. Characterising a cancer by sequencing a tumour's genome alongside the patient's unaffected genome has returned profound insights for cancer research.
Applying cancer genomics in the clinic is more complicated. The time window for care decisions is measured in weeks rather than months, and gaining genomic information on such a short timescale is logistically challenging, to say the least. In addition, while straightforward diagnosis has immediate benefits for rare-disease patients, cancer genomic information is only considered useful if it changes treatment options. For this to occur, carefully planned clinical trials are necessary to chart change of treatment impact on the basis of genomics.
The heterogeneity of cancer as a disease-of each individual cancer and of any subsequent metastasis-adds many layers of complexity to genomic analysis. However, oncologists are increasingly confident that genomic information will be useful in cancer care decisionmaking. Indeed, it is being deployed effectively in a growing number of cases, and we are confident that the roll-out of genomics into practicing cancer care will increase steadily in the coming years. 'Common disease' is a catchall phrase describing a vast spectrum of diseases that have distinct 'environmental' and 'genetic' components. This area of genomics is firmly rooted in research, and is the healthcare category where genomic information is most removed from practise, even though there is considerable effort on the research use of genomics.
The focus of genomic research in common disease is large-scale cohorts, for example biobanks in the UK (UK BioBank, Generation Scotland), China (Kadoorie), and the US (the nascent All of Us Research Program); megabanks such as Tohoku in Japan; and the wholepopulation cohorts in Iceland, Estonia, Finland (the Sisu project), and Denmark. Genotyping has become a regular feature of large-scale cohorts, and whole-genome sequencing is bound to become routine here as well.
Currently, the value of genomics for common diseases such as cardiovascular disease and diabetes is mainly in research. However, some clinical practise work is being explored, for example adapting screening protocols (11 Infectious-disease genomics currently focuses on sequencing infectious agents, and as such contributes huge amounts of genomic information on bacteria and viruses, but little on humans. Rare diseases affect up to 1 in 7 (14%) people in G20 countries. Restricting this figure to the numbers projected in countries with active genomic rare disease diagnosis processes, the percentage of people affected is closer to 1% to 2% of live births (depending on the healthcare system). Both parents of the patient are sequenced whenever possible, so in countries where rare-disease genomics is deployed in healthcare, between 2% and 5% of the population could be participating actively in this category of genomic testing. Given the robust benefit to healthcare, logistical simplicity, and cost-effectiveness compared to current approaches, at the minimum prevalence of 1% of the population, 20 million people could well be sequenced for rare-disease diagnosis by 2025 (see Table 2).
Half the population of G20 countries will, at some point in their lives, be diagnosed with a cancer event. Depending on the utility of genomics for informing the treatment of different cancer types, healthcare systems could be deploying cancer genomics for over 40 million people in the developed world by 2025 (see Table 3).
Common diseases affect nearly everyone. When genomic information can be used clinically for broad, common diseases, it will be feasible to justify sequencing entire populations. Whole-population-scale sequencing is feasible currently for smaller countries (and is largely in place already in some countries, e.g. Iceland), and is likely to be in place at some point in the next two decades.
We developed a crude model for estimating the rate of healthcare-funded genomes worldwide. This accounts for the preparedness of each country for genomics, and infers a progression of cancer-related sequencing over time (conservative estimate: genomics is used in 70% of cancer diagnoses by 2027). Figure 1 shows the growth in the number of genomes based on this model, with an expected number of 47.5 million genomes sequenced for rare disease diagnoses (patients + family members) and 83 million (cancer + matched normal) genome sequences for cancer by 2025.
This model is crude in its treatment of the complex, policy heavy area of genomic medicine, and undoubtedly some countries will have a faster or slower uptake than we predict. Furthermore, if the cost of sequencing falls radically, full population-scale sequencing may be deployed quite quickly, and so complete population sequencing could occur in this time frame.
Whatever the details of the country-by-country roll out, we are confident that over 60 million patients will have their genome sequenced in a healthcare context by 2025. This represents a large technical challenge for healthcare systems, and a huge opportunity for research.
The availability of large cohorts that represent individuals from different populations throughout the world would be transformative for human research. The advantages of large cohorts are considerable, especially when the costs of raw sequencing and first-line analysis will be borne primarily by healthcare systems all over the world. This could be seen as having spill-over benefits for research, which generally functions within a narrow economy. The use of healthcare-driven genomic data being also repositioned for research is part of the broad trend of breaking down barriers between research and healthcare, providing for a more permeable, 'learning' healthcare system with deep, sustained ties to research.
The research benefits that accrue from assembling cohorts of over 1 million people (currently the largest cohorts available) are an open question. Cohorts of 0.5 to 1 million people might be adequate for research, and tracking more than that may not be necessary. However, experience has shown us that larger sample sizes allow for more fine-grained analysis (12) and more robust findings in genetics, regardless of the system used to collect them.
We have identified eight major practical, principled arguments to support our position that virtual cohorts of 100 million people or more would have tangible research benefits.
Many rare diseases occur at person frequencies of 1 every 10,000 to 1 every 1,000,000 live births. To associate a particular mutation to a particular phenotype robustly, at least two observations are required, often more. This argues for pooling rare disease diagnoses in populations of 10 million to 500 million, where all the suspected rare disease individuals (between 1% to 2%) and their parents (where available) are sequenced.
The Matchmaker Exchange (MME) project (13) is building a platform for matching patients with similar phenotypic and genotypic profiles through standardized application programming interfaces (APIs) and procedural conventions. Its participating genomic groups are located in multiple countries, and pool possible diagnoses with the goal of finding 'a second match'. MME has produced nearly 30 matches as of October 2017, predominantly matching patients who live in different countries (i.e. not enrolled in the same healthcare system).
The penetrance of potential disease-causing alleles is best assessed using broadly ascertained (e.g. population-scale) cohorts. Genomics of rare disease (patients and their parents) and the 'healthy' genomes of cancer patients can be deployed effectively to understand the frequency of putative alleles, irrespective of the specific disease of interest.  heterogeneity more feasible, as they allow robust statistical analysis and rigorous testing of the resulting models in data not used to train models.

5
. D  i  s  c  o  v  e  r  i  n  g  a  n  d  c  h  a  r  a  c  t  e  r  i  s  i  n  g  e  p  i  s  t  a  s  i  s  i  s  m  o  r  e  f  e  a  s  i  b  l  e  w  i  t  h  l  a  r  g  e  s  a  m  p  l  e  s  i  z  e  s Epistasis (i.e. non-linear interactions between alleles) is commonplace in model-organism genetics. It is difficult to utilise in human genetics, principally because interesting alleles are low frequency and the interaction occurs as the product of these frequencies; however, onemillion-person or more cohorts overcomes these frequency concerns and provides a starting point for probing these interactions. Each order of magnitude higher will improve discoverability of rarer epistatic events. Such discoveries need not account for wholepopulation-level variance to be interesting; they are inherently useful for understanding the molecular biology of a disease or trait of interest. When we reach the '10 million to 100 million genomes sequenced' milestone, we will be in the final approach to the 1.1e-8 mutation rate for each base in the human genome. Even at the current cohort size of around 100,000 in the ExAC and gnomAD databases, we can derive a reasonably robust model for mutation rates and gene-scale selection. With higher sample numbers, these models can become more sophisticated in terms of the heterogeneity in mutation rates they model, and narrower in terms of the genomic region they probe-ultimately to single-base-pair resolution. Germline genetic information has two unique properties that make it invaluable for observational (epidemiological) studies: stability of germline variation and randomisation of allele distribution in a population. Germline variation rarely changes over the lifetime of an individual. If it does, it is readily detected in the form of mosaicism or cancer. Variants are randomly assigned to chromosomes during meiosis, and this randomisation is present in the resulting distribution of alleles inside a population.
Genetic variation can be used as a subtle 'natural experiment' to clarify the relationships between variable measures (e.g. weight gain and Type II diabetes, or alcohol usage during pregnancy and outcome of offspring). The genetic variables act as unbiased statistical instruments to provide insight on causality. These techniques work with very-low-impact variants, but require large sample sizes (millions of individuals) to work well. The precise allele frequencies present in the germline vary between populations, mainly due to population drift accentuated by bottlenecks. Different physical and social environments influence environmental exposures and developmental trajectories, and different healthcare systems vary in their ability to measure specific diseases and traits.
At scales below 10,000 individuals, these differences hinder joint use of datasets and force researchers to focus only on the larger subsets present in their data. At larger scales, this diversity can be leveraged to find more conclusive evidence behind specific alleles and environmental differences. It also opens the door to serendipitous discoveries based on traits that are measured well by one system and not another. Ethical consideration of patients and populations together with responsible regulation are critical to biomedical research in all its forms. This is particularly true for healthcare-funded genomics, which involves deeply complex national regulation and legislation. Keeping the UN Universal Declaration of Human Rights at the heart of these conversations will serve to activate the right 'to share in scientific advancement and its benefits' (14), firmly rooted in concerns shared by people of all nations. It also ensures a common human-rights approach to addressing the benefits and potential risks in a balanced manner.
Each society has its own, unique perspective on the sharing of personal information, with more open or restrictive regulatory norms and systems on data collection, access, and sharing. The commonality is that all of these systems have some mechanism by which researchers are able to access both research and clinical data, specifically to use it in ways that enable patients and research participants to share in scientific advancement and its benefits.
Population-scale sequencing schemes-wherein regulated healthcare providers share clinical genomic data for research-are unlikely to allow large-scale aggregation of data to migrate beyond national boundaries, but it is feasible to imagine that federated analysis without data movement is possible. Accountability principles (15) and sanctions for misuse are being developed in order to respect and maintain the trust of participants.
Based on the universal principles of benefitting from scientific research and respecting national regulatory architecture, we believe that most healthcare systems can ultimately participate in responsible, worldwide data sharing while remaining compliant with applicable jurisdictional law and institutional policies. To realise the goal of deploying single, agreed methods in multiple locations offering harmonised datasets, each location must present genomic and associated phenotypic information in a consistent, standard manner. These standards should, wherever possible, be based on a service architecture in which information is retrieved using web-based calls (e.g. REST protocol). This is the de facto standard for large-scale data delivery in science and technology.
Currently, genomics data analysis is tied to bespoke, file-based systems, with a mixture of domain-specific standard file types tied together in institutional (and sometimes individual researcher) configurations. This scheme is incompatible with a global, federated architecture. A major goal of GA4GH is to provide diverse, service-based standards to enable a global federation of identity and data, within a robust, security-assured cloud environment that enforces jurisdictional constraints and local service agreements.
Researcher access to genomic data refers specifically to access via analytic machinery created for research users. Over the past decade there has been a significant shift towards virtualisation in computing. This involves packaging and distributing the computing components (e.g. processors, storage units, platforms) or entire computing infrastructures and allowing them to run in an environment that simulates a specific computational architecture as if it were physically co-located. The value of virtualisation is that it enables a third-party service provider ("cloud" provider) to provision computing services and storage capacity "on demand" as they are needed. "Cloud" services are particularly well suited to computing problems involving large volumes of storage and complex computations (e.g. genomic research). Virtualisation is now deployed on a large scale in cloud environments, enabling researchers to use virtual software tools to analyse data distributed across multiple physical data stores, rather than downloading large amounts of data for use with local software tools. This pattern of moving the analysis to the data rather than data to the analysis allows for more appropriate scaling and control of access via national schemes, while also enabling consistency of analysis by a researcher "visiting" each of these schemes.
Another technology trend that holds great potential for genomics research is federationboth identity federation and data federation. Identity federation enables multiple organisations to rely on each other or a third party to authenticate researchers' identities and to share security attributes. This enables each data holder to enforce its own security policies, using attributes passed along with the authenticated identity. Data federation enables genomic analysis to be performed across multiple data stores.
S e c u r i t y c h a l l e n g e s GA4GH, as an international consortium federating large volumes of sensitive clinical and genomic data across virtual computing environments, is presented with formidable challenges in assuring data confidentiality, data integrity, service availability, and individual privacy. Some of these challenges call for innovative application of well-established security standards, frameworks, and protocols, such as identity federation on a global scale. Other challenges require solutions still emerging from security research, such as privacypreserving data linkage and homomorphic encryption. GA4GH seeks to apply current and emerging technology solutions, standards, and best practices to help protect the confidentiality, integrity, and availability of sensitive, high-value data. Healthcare data are a leading target for cyber-attackers; as such, GA4GH and its partners must implement a layered and proactive scheme to identify potential threats and vulnerabilities, continuously monitor the use of data and services, detect potential attacks, and collectively respond to potential breaches. Risk management is central to GA4GH's standards-development process, while seeking to leverage industry standards and best practices wherever possible. Ethical, regulatory, and technical challenges all feed into the wider societal challenges of bringing genomics into healthcare, which we must meet as a single community. The three strands of this challenge are: maintaining public trust in healthcare systems, overcoming differences in objectives and methods between research and healthcare, and breaking down unproductive divides between disciplines. We envision clinical and basic researchers collaborating seamlessly in the context of practising healthcare. The virtuous cycle of research and healthcare will be celebrated as an example of harmonised human endeavour. The track record of data science makes these ambitions realistic. Genomic datasets from healthcare will be among the largest generated over the next decade, and to make the most of them we need to harness the best of computational biology: both academic (i.e. electrical engineering, computer science, statistics, physics) and commercial (i.e. technology, pharmaceutical, biotech, health informatics).
Communication and respect for all players will maintain a steady focus on shared objectives and outcomes. A sceptic might wonder whether this is really needed. Perhaps large healthcare systems have enough internal drivers to ensure good delivery of genomic medicine without any global coordination, and smaller healthcare systems will naturally align themselves to their nearest scheme (culturally or physically). Perhaps researchers are better off negotiating individually, system by system, hospital by hospital, thus allowing more local innovation. Do we really need to coordinate worldwide to deliver dividends on genomics?
We have already established that we need to coordinate between large healthcare systems in rare-disease diagnosis and discovery. As a practical example, Matchmaker Exchange illustrates the power of bringing practicing clinicians and researchers together. This is also mirrored in the Cancer field where large international projects have switched to using more routine healthcare generated data sources.
Furthermore, as nascent genomic medicine schemes are being delivered in a variety of countries, it is clear that the federated approach enabled by GA4GH is the only scheme that can satisfy both research and healthcare goals. In addition, many commercial and public organizations want to minimize the costs and risks of the complex technical software needed to either contribute to genomic medicine or deliver genomic tools. A complex, multistakeholder ecosystem requires neutral and technically competent standards.
Nevertheless, standards and frameworks must be fit for purpose and useful for the broad set of users: clinical, academic, commercial, and public. These standards must also enable progress-not stifle it. To these ends, GA4GH is focused on the creation and management of genomics standards and not on their implementations, though we expect and encourage many implementing groups to actively participate.
It is not difficult to imagine a world in which genomic and molecular data have been collected for over a billion humans. In such a future, genomics will deliver both personalised and population-based benefits to patients-in acute healthcare, preventative medicine and probably unforeseen ways. When researchers everywhere can access secure data responsibly and use the data to benefit human health, discoveries will accelerate in both health and fundamental biology, in areas on which future benefits may unexpectedly rely. These goals are as ambitious in scope as the coordinated, international response to infectious disease outbreaks, and the international network of seismic sensors for early detection of earthquakes and tsunamis-both of which are demonstrably successful. To achieve this ambition we must have ongoing, sustained public support, from individuals, organisations, and governments.
Just as the W3C consortium provides the framework for setting the standards of the World Wide Web, GA4GH has taken responsibility for building the technical standards-rooted in a robust regulatory, ethical, and security-assured framework-to enable healthcare to benefit directly from scientific progress. It represents an open culture that spans technical standards development and implementation, and discussions that inform and are informed by public policy.
Because the future of healthcare belongs to all of us, we warmly welcome the participation, on every level, of people in all disciplines, in every country.  F  i  g  u  r  e  1  .  E  x  p  o  n  e  n  t  i  a  l  g  r  o  w  t  h  i  n  t  h  e  n  u  m  b  e  r  o  f  g  e  n  o  m  e  s  e  q  u  e  n  c  e  s  a  r  i  s  i  n  g  f  r  o  m  r  a  r  e  d  i  s  e  a  s  e  a  n  d  c  a  n  c  e  r  g  e  n  o  m  i