Abstract
Antibodies are immune system proteins that recognize noxious molecules for elimination. Their sequence diversity and binding versatility have made antibodies the primary class of biopharmaceuticals. Recently it has become possible to query their immense natural diversity using next-generation sequencing of immunoglobulin gene repertoires (Ig-seq). However, Ig-seq outputs are currently fragmented across repositories and tend to be presented as raw nucleotide reads, which means nontrivial effort is required to reuse the data for analysis. To address this issue, we have collected Ig-seq outputs from 53 studies, covering more than half a billion antibody sequences across diverse immune states, organisms and individuals. We have sorted, cleaned, annotated, translated and numbered these sequences and make the data available via our Observed Antibody Space (OAS) resource at antibodymap.org. The data within OAS will be regularly updated with newly released Ig-seq datasets. We believe OAS will facilitate data mining of immune repertoires for improved understanding of the immune system and development of better biotherapeutics.
1. Introduction
Antibodies (or B-cell receptors) are protein products of B-cells and primary actors of adaptive immunity in jawed vertebrates 1. They are highly malleable molecules that can bind to virtually any antigen. An organism holds a great variety of these molecules increasing the probability that an arbitrary antigen can be recognized by an antibody, initiating an immune response 2. Owing to their binding malleability they are the most prominent class of reagents and biotherapeutics 3,4. Continued successful exploitation of these molecules relies on our ability to discern the functional diversity of antibody repertoires 5–7.
Next-generation sequencing of immunoglobulin gene repertoires (Ig-seq) has enabled researchers to take snapshots of millions of sequences at a time across individuals, diverse organisms and different immune states 8,9. The ability to analyze millions of antibody sequences has the potential to uncover the mechanics of the immune response to any antigen10,11 and dysfunctions of the immune system itself12.
Many previous studies have addressed the issue of antibody diversity, contributing invaluable evidence to the dynamics of human immune systems 13. Numerous analyses have focused on the frequencies of V(D)J gene usages, which can offer insights into creating biased antibody libraries for therapeutic applications 14–16. Another therapeutic application of antibody repertoire analysis is advancing vaccine design by comparative longitudinal studies of pre- and post-antigen challenge experiments10,11,17–22. Such comparative studies have shown that different individuals can converge on the same antibody sequence against a given vaccine 11,19. Due to sequencing limitations, these analyses have focused on heavy or light chains separately, whereas one ought to study the paired repertoire to obtain a deeper insight of the antibody diversity 23.
Technical advances in sequencing technology have outpaced storage and analysis pipelines 24,25. This has meant that the outputs of Ig-seq studies are fragmented across repositories making it difficult to perform large-scale data mining of antibody repertoires 25. Metadata such as isotype, age or subject identifiers are not typically standardized, therefore extraction of specific subsets of antibody repertoires for comparative analyses is challenging. Furthermore, the data are typically deposited as raw nucleotide reads. It requires non-trivial ad hoc effort to convert such raw reads to amino acid sequences that ultimately dictate the molecular structure and antigen-recognition. Some of these issues are addressed by services that provide Ig-seq-specific data deposition and analysis pipelines such as Immport (http://immport.org) 26,27, or ImmunoseqAnalyzer (http://clients.adaptivebiotech.com/), IReceptor (http://ireceptor.irmacs.sfu.ca/) or VDJServer (http://vdjserver.org) 28. The IReceptor and the VDJServer are the main resources that fall under the umbrella of the organized effort by the Adaptive Immune Receptor Repertoire (AIRR) Community to provide standardized deposition and analysis pipelines for the Ig-seq outputs 24. These services chiefly focus on facilitating bulk deposition of raw data to perform standardized sequencing analyses. Ultimately, because immunoinformatics is not the chief focus of such services, bulk data download from such websites is limited and converting the raw nucleotide data obtained into format suitable for analysis still requires non-trivial effort.
To address these issues, we have created the Observed Antibody Space (OAS) resource that allows large-scale data mining of antibody repertoires. We have collected the raw outputs of 53 Ig-seq experiments covering over half a billion sequences. We have organized the sequences by metadata such as organism, isotype, B-cell type and source, and the immune status of B-cell donors to facilitate bulk retrieval of specific subsets for comparative analyses. We have converted all of the Ig-seq sequences to amino acids and numbered them using the IMGT scheme. The data is available for querying or bulk download at http://antibodymap.org. We believe that OAS will facilitate data mining antibody repertoires for improved understanding of the dynamics of the immune system and thus better engineering of biotherapeutics.
2. Materials and Methods
A list of study accession codes of publically available Ig-seq datasets were obtained via a literature review. The majority of raw reads were downloaded from the European Nucleotide Archive (ENA) 29 and the National Center for Biotechnology Information (NCBI) websites 30. In a small number of cases, another public Ig-seq repository was specified e.g. 14,31–33. Metadata were manually extracted from the deposited datasets and arranged in a reproducible format.
The downloaded FASTQ files were processed depending on the sequencing platform. Paired raw Illumina reads were assembled with FLASH 34. The assembled antibody sequences were converted to the FASTA format using FASTX-toolkit 35. As raw reads from Roche 454 are not paired, these FASTQ files were directly converted to the FASTA format with the FASTX-toolkit.
The heavy chain sequences were automatically annotated with isotype information unless such data was given in the corresponding publication. Automatic isotype annotation was performed by aligning the constant heavy domain 1 (CH1) of any given antibody sequence against the IMGT isotype reference 36 of the respective species using the Smith-Waterman algorithm 37. We assigned score of two for a nucleotide match, and a score of minus one for a nucleotide mismatch or a gap in our Smith-Waterman scoring function. The IMGT isotype references comprised 21 nucleotide-long fragments of the CH1 domain of the antibody isotypes. To ensure a high confidence of correct isotype identification, we employed a conservative threshold of 30 in the Smith-Waterman algorithm scoring function. Sequences whose Smith-Waterman algorithm score was below the threshold for all isotypes were assigned as 'bulk'. The robustness of this protocol was confirmed on the author-annotated Ig-seq datasets 38–40 where it resulted in 99% accurate annotations. Around 1% of the Ig-seq data had a very short (or missing) sequence of CH1 domain. Such sequences were also assigned as 'bulk'.
IgBlastn 41 was used to convert the FASTA files of antibody nucleotide sequences to amino acids. The amino acid sequences were then numbered with ANARCI 42 using the IMGT scheme 43. ANARCI does not number a sequence if it does not align to a suitable Hidden Markov Model 44. ANARCI therefore ensures that antibody sequences do not harbor unusual indels or stop codons in the antibody regions, that V and J genes align to respective species amino acid IMGT germlines 36, and that the length of CDR-H3 is not greater than 37 residues in human, mouse, rat, rabbit, alpaca and rhesus antibodies. Due to technical limitations of sequencing platforms, certain reads were missing significant portions of the variable region (e.g. portions of CDR1), sequences that did not have all three CDRs were discarded as incomplete. The V and J genes are identified during the ANARCI numbering step.
Using the protocol above we annotated Ig-seq results of 53 independent studies. In order to streamline updating OAS with new data, we have generated a procedure to automatically identify Ig-seq datasets from raw sequence read archives. We apply our antibody annotation protocol to each raw nucleotide dataset deposited in the NCBI/ENA repositories, if we find more than 10,000 antibody sequences in any given dataset, it is set aside for manual inspection. It is still necessary to manually inspect raw datasets to efficiently assign metadata as these are currently deposited in a non-standardized manner. This procedure allows for automatic identification of new Ig-seq datasets, which will enable us to semi automatically update OAS.
3. Results
We collected raw sequencing outputs from 53 Ig-seq studies. All raw nucleotide reads were converted into amino acids using IgBlastn 41. The full amino acid sequences were then IMGT numbered using ANARCI 42. As well as providing IMGT and gene annotations, ANARCI acts as a broad-brush filter of antibody sequences that are likely to be erroneous (see Materials and Methods). Applying the same retrieval, amino acid conversion, gene annotation and numbering protocol to all sequences assures the same point of reference across the 53 heterogeneous Ig-seq datasets 45. This protocol produces the full IMGT-numbered sequences together with gene annotations for each of the 53 datasets.
The numbered amino acid sequences in each dataset are sorted by metadata e.g. individuals, age, vaccination regime, B-cell type and source etc. (Figure 1). Deposition of such metadata is currently not standardized and requires ad hoc manual curation for each dataset. In an effort to organize the antibody sequences using such metadata, we have grouped the sequences within each dataset into Data Units. Each Data Unit represents a group of sequences within a given dataset with a unique combination of metadata values. The metadata values are summarized in Table 1.
As of April 29th 2018, 53 Ig-seq studies are included in OAS totaling 608,651,423 sequences (552,824,460 VH and 55,826,963 VL sequences). The majority of these sequences are murine (~50.4%) and human (~47.4%). Twenty-two of the Ig-seq studies interrogate the immune system of diseased individuals, the most common ailment being HIV (13 studies). The database also contains 22 Ig-seq studies of the naive antibody gene repertoires (the collection of B-cells from donors who are healthy and not purposefully vaccinated). The main source of B-cells in the OAS database is peripheral blood (~231m of sequences) followed by spleen/splenocytes (~198m) and bone marrow (~124m). The database holds isotype information for each individual heavy sequence and the two most common isotypes are IgM (~312m) and IgG (~139m). For ~65m sequences we were not able to assign isotypes with high confidence. The median total redundant size of the Ig-seq studies in the OSA database is 2,006,196 sequences, while the largest Ig-seq study was that by Greiff et al., (246,449,189 redundant sequences) 14. Detailed statistics on each dataset are given in Table 2. All the data may be bulk downloaded or individual Data Units queried, at http://antibodymap.org.
We organized the datasets among studies related to a given Ig-seq experiment. Each study in the OAS database is subdivided into Data Units. Each Data Unit is a collection of IMGT-numbered amino acid sequences uniquely identified by the metadata descriptors given in Table 1, five of which (species, disease, vaccine, B-cell source and B-cell type) are given in this Table. The 'total ANARCI parsed sequences' field indicates the total number of redundant sequences in our database, with the non-redundant numbers in parentheses. Abbreviations: PBMC, peripheral blood mononuclear cell; CLL, chronic lymphocytic leukemia; NP-CGG, chicken gamma globulin; ASC, antibody secreting cell; OVA, ovalbumin; NP-HEL, hen egg lysozyme; CSF, cerebrospinal fluid; MS, multiple sclerosis; SLE, systemic lupus erythematosus; MG, myasthenia gravis; DNP, dinitrophenyl; HuD, paraneoplastic encephalomyelitis antigen.
4. Discussion
Here, we describe the Observed Antibody Space (OAS) database, a unified repository to facilitate large-scale data mining of antibody repertoires in both their amino acid and nucleotide forms. Absence of well established repositories in Ig-seq deposition space required us to perform a combination of literature search and manual curation of the datasets in order to organize the data into OAS. The current lack of widely-adopted deposition standards hampers automatic updating of OAS, as datasets where we find large number of antibodies still require manual curation to perform metadata annotation correctly. Hopefully, efforts such as that by the AIRR community will result in standardization of Ig-seq outputs and will further streamline deposition procedures facilitating large-scale data mining of antibody repertoires 24. Devising unified antibody repertoire repositories is challenging due to both the size of the datasets as well as the diverse data descriptors and analytical pipelines desired by bioinformaticians, wetlab scientists and clinicians 33.
OAS is the first organized collection of a large body of Ig-seq outputs. In order to allow comparative bioinformatics analyses across different subsets of the antibody repertoires, we have annotated the datasets by commonly used metadata descriptions such as organism, isotype, B-cell type and source, and the immune state of B-cell donors. To facilitate research about particular antibody sequences or regions, we make full IMGT-numbered high-quality amino acid sequences available together with gene annotations, as well as raw nucleotide data.
This data should aid in-depth comparative analyses across different studies to discern the commonalities observed between independent samples as well as directing Ig-seq experiments on not yet interrogated antibody repertoires. Revealing shared preferences can be invaluable in identifying the portions of the theoretically allowed antibody space that are strategically used to start an immune response 6. Furthermore, such comparative studies can offer a way of deconvoluting the various degrees of freedom of immune repertoires such as differences between diversity of isotypes 67 or organisms 87. Charting the differences between repertoires of human/mouse is of particular interest for engineering better humanized biotherapeutics 88.
Beyond identifying broad commonalities across repertoires, data mining Ig-seq outputs provides novel avenues for designing better antibody-based therapeutics. The plethora of currently available Ig-seq data offers a glimpse at a set of sequences that should be able to fold and function in an organism. Aligning therapeutic candidates to sequences in Ig-seq repertoires can reveal mutational choices that might be naturally acceptable hence providing shortcuts for antibody engineering such as humanization 89. Furthermore, contrasting the naturally observed antibodies with therapeutic ones can offer insight as to the naturally favored biophysical properties of these molecules 4-90-91. All such future applications rely on the availability of well-structured datasets that can offer a unified point of reference for bioinformatics analyses. We hope that OAS will aid data mining antibody repertoires, help identify strategic preferences of our immune systems and will ultimately improve how we engineer antibodies into better therapeutics.
Funding
This work was supported by funding from Biotechnology and Biological Sciences Research Council (BBSRC) [BB/M011224/1] and UCB Pharma Ltd awarded to AK.
Acknowledgment
We would like to thank all members of Oxford Protein Informatics Group for testing our OAS resource. In particular, we are grateful to Garret M. Morris and Matthew Raybould for their comments, which significantly improved the quality of our work.
References
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.
- 16.↵
- 17.↵
- 18.
- 19.↵
- 20.
- 21.
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.
- 47.
- 48.
- 49.
- 50.
- 51.
- 52.
- 53.
- 54.
- 55.
- 56.
- 57.
- 58.
- 59.
- 60.
- 61.
- 62.
- 63.
- 64.
- 65.
- 66.
- 67.↵
- 68.
- 69.
- 70.
- 71.
- 72.
- 73.
- 74.
- 75.
- 76.
- 77.
- 78.
- 79.
- 80.
- 81.
- 82.
- 83.
- 84.
- 85.
- 86.
- 87.↵
- 88.↵
- 89.↵
- 90.↵
- 91.↵