Introduction

Next-generation sequencing (NGS) has transformed many areas of biological and translational research1,2,3,4. Recently, the scope of NGS application has expanded into the analysis of antibody repertoire encoded by B cells5,6,7. Demonstrated with proof-of-concept studies in animal models8,9, NGS-based antibody repertoire analysis has been applied to examine human samples10, particularly in the study of human immunodeficiency virus type-1 (HIV-1) infected individuals with broadly neutralizing antibodies (bnAbs)11,12,13,14,15,16. For these bnAbs, special bioinformatics tools have been developed to identify somatic variants and maturation pathways from NGS-derived repertoires11,12,13,14,15,16.

Unlike other NGS applications, antibody repertoire analysis faces unique challenges in both sequencing and data analysis due to the complexity of B-cell development, in which antigen-driven affinity maturation selects for somatic mutations throughout variable region of immunoglobulin genes. It is therefore critical to sequence entire antibody variable domains (~450 bp) for a meaningful repertoire analysis and to recover functional antibodies from the NGS data. Long reads are particularly critical for the study of HIV-1 bnAbs, which often show 20–35% sequence divergence compared to their germline precursors17,18,19,20. As a result, most studies have been carried out using the 454 platform11,12,13,14,15,16, as this technology typically has a read length of around 400 bp, but with a relatively low throughput. Another critical factor in NGS-based repertoire analysis is sequencing error, which is platform-specific and thus requires different algorithms for correction13,21,22,23. The 454 and PGM platforms suffer from homopolymer errors, which can be corrected using germline genes as a template14, whereas the MiSeq platform generates substitution errors, which can be corrected by calculating a consensus7. Irrespective of the NGS platform, experimental details in sample preparation such as polymerase chain reaction (PCR) primers also play a critical role in producing a reliable repertoire22. Although 5′-RACE PCR has been proposed as a solution for unbiased repertoire analysis, the long PCR products (~600 bp) pose a significant challenge to current NGS platforms. Recently, a multiplex PCR method with minimized bias was reported for T-cell repertoire analysis24. Meanwhile, the library amplification based on PCR will produce redundant cDNA molecules, which when combined with sequencing errors, may lead to artificial antibody clones and diversity in repertoire analysis22. Although the basic strategies for antibody repertoire analysis have just been established and not yet optimized, the current research focus has begun to shift from cross-sectional studies to longitudinal analyses25, which require high-precision dissection of repertoire properties to establish meaningful biological conclusions. Therefore, it remains unclear whether current antibody sequencing technologies will suffice for these new applications.

In this study, we adapted the Ion Torrent Personal Genome Machine (PGM) for high-throughput sequencing of full-length antibody variable domains. We validated the platform with samples from an HIV-1-infected donor (IAVI donor 17), the source of bnAb PGT121 and its siblings19 and two HIV-1-uninfected donors. The greater depth of PGM sequencing allowed us to identify a more complete somatic population of the PGT121-class antibodies. We then introduced 5′-RACE PCR into template preparation in order to capture antibody repertoires in an unbiased manner. We compared the overall properties of the unbiased repertoires to those obtained using multiplex primers. We also developed a random barcoding strategy to track individual antibody cDNA molecules and to reduce amplification noise and sequencing error. With a side-by-side comparison, we demonstrated that the new template amplification methods and sequencing chemistry could significantly improve the repertoire quality compared to the current methods based on this platform.

Results

PGT121 class of broadly neutralizing antibodies

Identified from African donor 17 of the IAVI Protocol G cohort (IAVI donor 17)19, the PGT121 class of antibodies was originally described as consisting of six members: PGT121–124 and PGT133–134. These antibodies potently neutralize 65–70% of HIV-1 isolates (median IC50 < 0.05 μg ml−1) by recognizing a high-mannose patch in the gp120 V3 region26. Further structural analysis revealed that PGT121–123 share a similar mode of recognition by recruiting multiple structural elements in the HIV-1 envelope variable regions27. In the 454 analysis of IAVI donor 17, a novel phylogenetic method was devised to infer putative intermediates, which potently neutralized diverse strains on a 74-virus panel with half the mutation level of the mature parent antibodies16. Recently, Barouch et al demonstrated the therapeutic value of PGT121 in simian-human immunodeficiency virus (SHIV)-infected rhesus monkeys28. With the extensive information available for the PGT121 class of antibodies, the sample from IAVI donor 17 provides a unique test case for investigating various aspects of antibody repertoire analysis.

We first reanalyzed the 454 sequencing data of IAVI donor 1716 with the Antibodyomics 1.0 pipeline11,12,13,14,15 (Figs. S1 and S2). Of 966,935 raw reads, 122,079 originated from the IgHV4-59 gene and 527,100 from the IgLV3-21 gene, the heavy and light chain germline precursors of PGT121 class, respectively. Using a sequence identity cutoff of 90%, closely related somatic variants were identified for both PGT121 heavy and light chains, but not for other members of the class (Fig. 1A). Through intra-donor phylogenetic analysis13,14, we identified 81 heavy chains and 470 light chains that were somatically related to the PGT121 class of antibodies (Fig. 1B).

Figure 1
figure 1

Analysis of reported 454 sequencing data of PGT121 class of antibodies.

(A) Identity/divergence analysis of 454-derived sequence population for IAVI donor 17. Heavy and light chains of the representative PGT121-class antibodies (PGT121–124 and PGT133–134) are used as template in the sequence identity calculation. The heavy chains of IgHV4-59 origin (left) and the light chains of IgLV3-21 origin (right) are plotted as a function of sequence identity to the template and the sequence divergence from the inferred germline gene. (B) Intra-donor phylogenetic trees calculated for heavy (left) and light chains (right). Iterative intra-donor phylogenetic analysis was performed to identify sequences that are somatically related to the PGT121 class of antibodies. After three iterations, the analysis converged to 81 heavy chains and 470 light chains, respectively.

PGM sequencing of IAVI donor 17 with IgVH4 and IgVL3 primers

We performed deep sequencing of antibody transcripts from memory and plasma B cells using PCR to amplify heavy chains from the IgHV4 family and light chains from the IgLV3 family (Fig. S3A). Briefly, mRNA from an estimated 20 million PBMCs was used for reverse transcription (RT) to produce template cDNA (Table S1). PGM sequencing was performed using an Ion 316 chip and a modified protocol, in which the default 3′-trimming option was turned off in order to obtain longer reads. The Antibodyomics 1.0 pipeline was used for data processing and sequencing error correction. The pipeline output and the following PGM sequencing experiments are briefly summarized in Table 1.

Table 1 Antibody repertoire analysis of an HIV-1-infected individual and two uninfected individualsa

PGM sequencing provided 3,610,144 raw reads, of which 172,732 sequences were assigned to the IgHV4-59 gene and 1,615,722 to the IgLV3-21 gene, the heavy and light chain germline precursors of PGT121 class, respectively. After pipeline processing, the sequences of IgHV4-59 and IgLV3-21 origins were compared to the PGT121 class of antibodies. The identity/divergence plots revealed over 90% identical sequences for all 6 PGT121-class heavy chains as well as the PGT121 and PGT123 light chains (Fig. 2A). These results indicate that the PGM platform, when used in conjunction with germline gene-specific primers, could effectively capture the closely related somatic variants for this antibody class with greater coverage than the 454 platform (Fig. 1A).

Figure 2
figure 2

Analysis of PGM sequencing data generated from IAVI donor 17 using VH4- and VL3-specific primers.

(A) Identity/divergence analysis. Heavy chain sequences of the IgHV4-59 origin and light chain sequences of the IgLV3-21 origin are plotted as a function of sequence identity to the sequences of 6 PGT121-class antibodies and of sequence divergence from putative germline genes. (B) Iterative intra-donor phylogenetic analysis. Heavy and light chains with the same germline origin as the PGT121 class are subjected to multiple rounds of intra-donor phylogenetic analysis. In each round, the input sequences are divided into a number of subsets, each with the germline gene and 6 PGT121-class sequences included. After phylogenetic calculation, sequences that are clustered with the PGT121-class antibodies are extracted and used as input for the next round of analysis. The analysis converged to a fixed number of sequences after 10 iterations. (C) Intra-donor phylogenetic trees. Maximum-likelihood trees of VH sequences of and VL sequences from IAVI donor 17, along with 6 representative PGT121-class antibody sequences, are rooted by the respective germline gene sequences. Each bar represents a 0.1 change per nucleotide site.

Using the 6 PGT121-class antibodies as a template, iterative intra-donor phylogenetic analysis13,14 was performed to identify all the heavy and light chains that were somatically related to this class. After 10 iterations, the analysis converged to 1,022 heavy chains and 2,282 light chains (Fig. 2B), which compared to the 454-derived somatic variants show a ~12- and ~5-fold increase in the number of sequences, respectively16. The greater number of somatic variants can be attributed to the greater sequencing depth, although some of the new clones might have resulted from errors in RT, PCR and sequencing as stressed in recent reviews22. Nevertheless, with more somatic variants identified, the intra-donor phylogenetic trees (Fig. 2C) appeared to be more complete and should be more suitable for inferring intermediate sequences16.

Our results thus illustrated the importance of sequencing depth in antibody repertoire analysis, especially in the identification of rare bnAb clones and lineage intermediates, which will further advance our understanding of bnAb development and provide new antibody targets for B-cell precursor and lineage-based immunogen design29,30.

Functional validation of PGM-derived PGT121 somatic variants

We validated the neutralization of newly identified PGT121 variants on a 6 cross-clade virus panel16. For each member of the PGT121 class, heavy and light chains alike, we extracted the sequences with an identity of 90% or greater with respect to the template and grouped the sequences using an identity cutoff of 100%. We then manually selected sequences that represented closely related somatic variants (>95%) and have diverged further in the maturation (<95%). This analysis resulted in 15 heavy chains and 8 light chains, which were synthesized and paired with their respective native partner chains (Table S5A). 19 antibodies were expressed and tested in neutralization assays, with PGT121 and PGT133 included for comparison (Table 2A). For HIV-1 isolates 92BR020, 92RW020 and IAVI C22, 70% of the reconstituted antibodies neutralized with an IC50 ranging from 0.001 to 1 μg ml−1, which is characteristic of the PGT121 class. The heavy chain somatic variants of PGT121, 122 and 124, when paired with their native light chains, showed comparable if not higher potency than the native antibodies. In the case of PGT123, where the light chain somatic variants were paired with the native heavy chain, neutralizing activity appeared to be similar among the somatic variants and higher than those heavy-chain chimeric antibodies. Taken together, the neutralization results confirmed that PGM can be used to derive functional antibodies with a similar success rate (~70%) to 45411,12,13,14,15,16, providing a biologically relevant assessment of the data quality generated by this platform.

Table 2 Neutralization titers of 26 chimeric antibodies derived from IAVI donor 17 against 6 HIV-1 Env-pseudovirusesa

Utility of 5′-RACE PCR for an unbiased repertoire analysis

The NGS analysis of HIV-1-infected donors has been focused primarily on specific germline gene families that give rise to the bnAbs of interest11,12,13,14,15,16, leaving a majority of the antibody repertoire uncharacterized. Recently, Choi et al reported that 5′-RACE PCR offered an unbiased view of the murine Igh repertoire, which allows for an in-depth analysis of the V-gene rearrangement frequency31. With the PGM platform, we investigated the utility of 5′-RACE PCR in antibody repertoire analysis for IAVI donor 17 and two HIV-1-uninfected donors. Briefly, we adapted the murine procedure of Choi et al31 for human samples by designing a new reverse primer that binds just downstream of the variable domain (Fig. S3B). We then sequenced the 5′-RACE PCR products to capture the entire heavy or light chain repertoire from the donor's memory and plasma B cells using this reverse primer (Table S4). PGM sequencing was performed on an Ion 316 chip without using 3′-trimming in raw data processing to extend the read length.

For IAVI donor 17, the combination of 5′-RACE PCR and PGM sequencing provided 1,098,334 heavy chains and 1,424,744 light chains (Table 1B), of which 71,689 sequences were assigned to IgHV4-59 and 121,652 sequences to IgLV3-21. For the two uninfected donors, 5′-RACE PCR and PGM sequencing provided over 3 million raw reads (Table 1B). Notably, for all three donors the average read length from 5′-RACE PCR was over 500 bp, compared to an average of 420–430 bp from gene-specific primers, highlighting the importance of long-read capability for unbiased antibody repertoire analysis.

For IAVI donor 17, IgHV4-59 accounted for 8.5% of the unbiased heavy chain repertoire (Fig. 3A). Surprisingly, IgHV5-51, which has not been associated with any HIV-1 bnAb, was the most prevalent germline gene family, accounting for 22.3% of the repertoire. The heavy chain repertoires from two uninfected donors presented two extremes (Fig. 3A). Uninfected donor #1 yielded a rather even distribution whereas uninfected donor #2 showed a skewed usage of germline genes IgHV1-69 and IgHV4-34. We then characterized the heavy chain complementarity determining region 3 (CDR H3) (Fig. 3B). IAVI donor 17 and the uninfected donor #1 showed somewhat similar distributions, with IAVI donor 17 having slightly longer CDR H3 regions. Surprisingly, uninfected donor #2 showed a two-peak distribution with the second peak centered at ~22 aa, suggesting that a large portion of antibodies in this repertoire (~26%) possess unusually long CDR H3 loops. We further analyzed the heavy chains with long CDR H3s (22–26aa) and found that 71.2% of the sequences were of IgHV4-34 origin with a preferred usage of J6, the longest J gene segment, suggesting that the long CDR H3s resulted from V-D-J rearrangement rather than sequencing error. We also characterized the distribution of germline divergence to determine the degree of somatic hypermutation (Fig. 3C). IAVI donor 17 showed a higher divergence than the two uninfected donors, which is expected for the prolonged maturation process in HIV-1-infected individuals. The PGT121-class heavy chains showed an average divergence of 22.0%, which is 9% higher than that of the IgHV4-59 family, indicating that these bnAbs require a longer maturation process than non-HIV-1-specific antibodies of the same germline origin. However, uninfected donor #2 showed a substantially lower germline divergence than IAVI donor 17 and uninfected donor #1.

Figure 3
figure 3

Unbiased repertoires of IAVI donor 17 and two HIV-1-uninfected donors.

Unbiased heavy and light chain repertoires were obtained using 5′-RACE PCR. PGM sequencing was performed using an Ion 316 chip and sequencing data was processed with the Antibodyomics 1.0 pipeline. The processed sequences were used to calculate heavy (A–C) and light chain (D–F) repertoire properties such as germline gene usage (A and D), complementarity determining region 3 (CDR3) length (B and E) and germline gene divergence (C and F).

For IAVI donor 17 (Fig. 3D), IgLV3-21 accounted for 9.2% of the unbiased light chain repertoire, whereas IgLV3-1 appeared to be used predominantly (25.7%) in the repertoire. All three donors showed similar CDR L3 length distributions, with a clear preference for a 9–11 aa loop length (Fig. 3E), as opposed to a more spread and diverse CDR H3 length distribution. The light chains also showed a similar divergence pattern to the heavy chains, with more near-germline sequences (with a divergence of 0–1%) in the repertoire from uninfected donor #1 (Fig. 3F). The PGT121-class light chains gave an average germline divergence of 23.2%, compared to a lower value of ~13% calculated for the IgLV3-21 family.

The unbiased analysis revealed intriguing features of IAVI donor 17 repertoire. More specifically, it allowed for the PGT121 class of antibodies to be analyzed in the context of the entire repertoire, which showed that these bnAbs were not the most prevalent family in this donor's repertoire. The large population of near-germline, IgHV4-34-originated antibodies with long CDR H3 loops found in the unbiased repertoire of uninfected donor #2 can be potentially explained by different causes such as an ongoing immune response, an autoimmune condition or a unique genetic background. This finding highlights the potential utility of unbiased repertoire analysis in identifying transient antibody responses and unusual patterns of antibody maturation.

Identification and validation of somatic variants from unbiased repertoire analysis

With 5′-RACE PCR, individual germline gene families were diluted due to non-specific amplification of all germline genes (Figs. 3A and 3D). As a result, the IgVH4-59 and IgLV3-21 families accounted for less than 10% of the total repertoire. The identity/divergence plots identified >90% identical sequence for both PGT122 heavy and light chains and for the PGT123 light chain, but not for other members of this class (Fig. S4). We identified 6 sequences with sequence identities ranging from 88.6% to 100% with respect to the PGT122 heavy chain. We also selected 3 and 2 somatic variants of the PGT122 and PGT123 light chains, respectively. When paired with their native partner chains, 7 chimeric antibodies were expressed (Table S5b) and in most cases neutralized diverse HIV-1 isolates (Table 2B). Taken together, although 5′-RACE PCR can provide a more accurate view of the overall antibody repertoire, gene-specific primers are still more advantageous in the identification of somatic variants and temporal analysis of antibody maturation for an antibody family with a well-defined germline origin25. Our analyses of IAVI donor 17 thus highlight the separate advantages of 5′-RACE and standard PCR methods in the current applications of antibody repertoire analysis.

Comparison of repertoires derived from 5′-RACE PCR and multiplex PCR

Due to differential primer efficiencies and primer cross-reactions, multiplex PCR can cause biases in the sequencing library, thus hampering the reliability of antibody repertoire analysis22. Primer bias of multiplex PCR has been previously investigated for T cell receptors (TCRs) using a synthetic repertoire24. Here, we examined the primer bias in the context of antibody repertoire analysis by comparing the IAVI donor 17 repertoires generated by 5′-RACE PCR and different sets of gene-specific primers (Table 1C). We redesigned the fusion primers such that the sequencing starts from the 3′-end of the variable domain (Tables S2 and S3 and Fig. S3A) similar to 5′-RACE PCR, with the exception that the 5′-end is now anchored by a VH or VL primer.

We first tested two sets of heavy chain primers: (1) primers that overlapped the end of the V-gene leader sequence and the start of the V region and (2) upstream primers that annealed to the start of the V-gene leader sequence and have been optimized to capture highly mutated sequences32. The first set of primers (GP-H1) were derived from previous 454 studies of HIV-1 bnAbs11,12,13,14,15 with the addition of two IgHV5 primers designed based on the same principles. The second set of primers (GP-H2) were previously reported by Scheid et al32. We postulated that a less biased primer set would better represent the germline gene usage of the unbiased repertoire. Remarkably, GP-H1 produced a close match to the unbiased repertoire, with a correlation coefficient of 0.96 (Fig. 4A), suggesting that GP-H1 can serve as the first-level approximation to 5′-RACE PCR for heavy chain repertoire analysis. However, GP-H2 showed an extremely skewed usage of IgHV1-69 (79.8%) with a correlation coefficient of 0.17 (Fig. 4B), with IgHV4-59 and IgHV5-51 accounting for only 0.039 and 0.001% of the entire repertoire. We then tested a set of forward λ-chain primers (GP-L1), which yielded a biased germline gene usage with a correlation coefficient of 0.37 (Fig. 4C). In this repertoire, IgLV2-14, rather than IgLV3-1, was the most prevalent germline gene and accounted for ~25.7% of the population. IgLV3-21, the germline precursor of PGT121-class light chains, only accounted for 2.0% of the repertoire, as opposed to a 9.2% in the unbiased repertoire (Fig. 3D).

Figure 4
figure 4

Comparison of IAVI donor 17 repertoires generated by 5′-RACE PCR and multiplex PCR.

(A) Heavy chain germline gene usage (%) from GP-H1 primer set and corresponding correlation with the unbiased repertoire. (B) Heavy chain germline gene usage (%) from GP-H2 primer set and corresponding correlation with the unbiased repertoire. (C) Light chain germline gene usage (%) from GP-L1 primer set and corresponding correlation with the unbiased repertoire. GP-H1 and GP-L1 primers overlap the end of the V-gene leader sequence and the start of the V region, whereas GP-H2 primers anneal more upstream to the start of the V-gene leader sequence and have been optimized to capture highly mutated antibody sequences.

The stark difference in germline gene usage between GP-H1 and GP-H2 exemplifies the influence of primer selection upon basic repertoire properties. This comparison further emphasizes the necessity of using 5′-RACE PCR to eliminate primer bias, although there appears to be value in optimizing gene-specific primers and multiplex PCR to minimize bias24.

Reducing amplification noise and sequencing errors by barcoding cDNA molecules

Antibody repertoire sequencing has been widely used to identify somatic variants and maturation pathways of HIV-1 bnAbs11,12,13,14,15,25,33. However, the noise from PCR library amplification combined with sequencing errors can complicate the interpretation of sequence diversity22 and thus, undermine the reliability of putative intermediates inferred from phylogenetic analysis11,16. Recently, two template tagging strategies were proposed to reduce amplification noise in transcriptome34 and viral RNA35 sequencing. In this study, we developed a “random barcoding” strategy for antibody sequencing in which 10 degenerate nucleotides (N10) were included in the cDNA synthesis primer such that each template is labeled with a unique identifier (ID) (Fig. S3C). In theory, such random barcodes can create 1,048,576 (410) distinct sequence IDs, which are comparable to the number of heavy or light chains generated in a typical PGM sequencing run. We sequenced the IgHV4 and IgLV3 gene families of IAVI donor 17 using this random barcoding strategy and observed comparable data quality (Table 1D).

Of ~3.1 million raw reads, 88.1% possessed a random barcode of 10 nucleotides. After pipeline processing, 84.0% of the IgHV4-59 family and 96.3% of the IgLV3-21 family contained the barcodes of correct length (Fig. 5A). 3–12% of the sequences contained an extra nucleotide in the barcode, which was likely caused by errors in cDNA synthesis, PCR amplification or PGM sequencing. We then examined the amplification noise within the germline gene families of the PGT121 class (Fig. 5B). An all-to-all comparison identified 70,235 (42.1%) uniquely barcoded heavy chains from 166,703 IgHV4-59 sequences, with the copy number ranging from 1 to 99. In contrast, only 106,984 (14.2%) out of 754,085 IgLV3-21 sequences were found to be uniquely barcoded, with a maximum copy number of 853. The distribution of identical barcodes did not fit a normal distribution curve (Fig. 5B), suggesting the templates were not amplified equally35. In particular, the light chains appeared to be amplified more frequently than the heavy chains, as indicated by a peak population (~19%) of 11–50 copies (Fig. 5B, right panel). Despite the possibility of 410 unique barcodes, different cDNA templates can be labeled by the same barcode sequence. We examined this possibility by using the CDR3 as a secondary sequence ID in the determination of unique templates, namely, two sequences need to have the same barcode and the same CDR3 length (with an error of ±1.5aa) to be considered “identical”. Indeed, 1–3% of the sequences were amplified from different templates but assigned with the same barcodes, as indicated by slightly increased single-copy reads (Fig. 5B). This was confirmed by visual inspection of sequences with identical barcodes but different CDR H3/L3 lengths (Fig. S5).

Figure 5
figure 5

Random barcoding strategy in the repertoire sequencing of PGT121 class of antibodies.

(A) Barcode length distribution for the raw sequencing data (red) and the pipeline-processed IgHV4-59 family (blue) and IgLV3-21 family (green), the germline genes of the PGT121 class of antibodies. Plotted in the distribution are 3,109,512 raw reads, 198,345 IgHV4-59 originated sequences and 782,720 IgLV3-21 originated sequences. (B) Distribution of copy number for the heavy chains of IgHV4-59 origin (left panel) and light chains of IgLV3-21 origin (right panel) with a correct 10-nt barcode length. Identical cDNA templates were identified using either random barcode alone (blue) or a combination of random barcode and CDR3 length (red). (C) Sequence length variation of PGT121-class heavy chains plotted as a function of copy number. A total of 166,703 heavy chains of IgHV4-59 origin with the correct barcode length were subjected to an iterative intra-donor phylogenetic analysis using 6 PGT121-class heavy chains as a template. After 5 iterations, the analysis converged to 2,011 sequences, which were subjected to further calculation of copy number and length variation. (D) Identity/divergence analysis of PGT121-class antibodies before (blue) and after (red) random barcode-based reduction of sequence redundancy, for heavy (left panel) and light chains (right panel).

Next we investigated the utility of random barcode to correct sequence errors for PGM-derived PGT121 class of antibodies. 5 iterations of intra-donor phylogenetic analysis13,14 converged to 2,011 PGT121-class antibody heavy chains. Using random barcodes and the CDR H3 length, 1,105 unique heavy chains were identified. The copy number distribution of the PGT121 heavy chain somatic variants resembled that of the whole IgHV4-59 family (Fig. 5B, left panel), with 62.9% of the sequences having a single copy. We then calculated the “consensus” sequences for all heavy chains with more than two copies. As expected, the variation of sequence length as a result of PCR or sequencing error decreases significantly as the copy number increases (Fig. 5C). The average sequence length decreased from 397.2 to 396 bp, which is the correct sequence length of PGT121-class heavy chains. The barcode-corrected PGT121-like sequences showed reduced diversity on the identity/divergence plots (Fig. 5D).

Taken together, the random barcoding strategy can quantify the amplification bias in antibody repertoire sequencing and thus provide an effective means to reduce potential artifacts resulting from PCR-based amplification noise and sequencing errors. This strategy is general and can be applied to longer barcodes as demonstrated for a 20-nucleotide barcode (Fig. S6). We also demonstrated that the CDR3, the most conserved antibody signature, can be used as a “natural barcode” to assist in the analysis of sequence redundancy. It should be noted that each B cell can carry multiple copies of mRNA, which can be labeled with different barcodes in RT and treated as non-redundant cDNA molecules. Therefore, the current barcoding strategy can only eliminate the redundancy of the expressed antibody repertoire rather than that of the B-cell repertoire.

Improving antibody repertoire quality with new NGS technologies

Pyrosequencing has not been favored for antibody repertoire analysis primarily due to homopolymer errors7. Therefore, it is important to assess whether the recent technical advances for PGM can improve sequencing accuracy. Here, we have evaluated three new PGM technologies that were made available to academic users through the Early Access program. These include two template preparation methods – an improved version of the emulsion-based method and an emulsion-free method called isothermal amplification (IA) – and the Hi-Q sequencing enzyme. We tested various combinations of the template preparation methods, Hi-Q enzyme and data processing methods using the 5′-RACE PCR products from uninfected donor #2 (Table 1E).

The combined use of the improved emulsion-based method and Hi-Q enzyme (new OT2 + Hi-Q) showed a remarkable improvement consisting of a 16% increase in the number of raw reads, a 50–60 bp increase in read length and a 15–20% increase in sequence population without gaps in V-gene alignment (from 8.3 to 28.4% and 18.1 to 33.7% for the heavy and light chains, respectively) (Table 1E, #8). The change of error profile can be visualized by the distribution of gaps in the V-gene alignment (Fig. 6A). The “error-free” sequences, along with those with only one gap in the V gene, have shifted the repertoire towards a lower error rate. We then investigated whether 3′-trimming, which was turned off in previous PGM sequencing and bioinformatics filtering can further improve the accuracy. Indeed, 3′-trimming did increase the V-gene error-free population by 10% but with a tradeoff of 20% decrease in sequence reads (Table 1E, #9). After bioinformatics filtering, 40.4% of the heavy chains and 45.8% of the light chains contained no indel errors in the V gene segment, respectively (Fig. 6A).

Figure 6
figure 6

Improved antibody sequence quality from new PGM technologies.

Number of gaps in VH- and VL-gene alignment is plotted for (A) the combined use of improved emulsion-based template preparation method and Hi-Q enzyme (new OT2 + Hi-Q) and (B) the combined use of emulsion-free isothermal amplification (IA) and Hi-Q enzyme (IA + Hi-Q). The template library from uninfected donor #2 was sequenced as a test case. Plotted are the standard PGM protocol without 3′-trimming (blue), the combined use of a new template preparation method and Hi-Q enzyme without 3′-trimming (green), the same combination with 3′-trimming (magenta) and the same combination with 3′-trimming and bioinformatics filtering (red).

The combined use of IA and Hi-Q (IA + Hi-Q) showed further improvement on all the metrics examined. The sequencing output was increased by 26% with respect to the standard PGM protocol, with the heavy and light chains showing identical values in read length (560 bp) and sequence quality (33.3% V-gene error-free) (Table 1E, #10). These results confirmed the high fidelity of sequencing templates generated by IA. With 3′-trimming, although there was a 20% decrease in sequence reads, ~41% of the entire repertoire was error-free (Table 1E, #11). Bioinformatics filtering further increased this error-free population to ~47% of the entire repertoire (Fig. 6B).

The significant improvements in sequence quality demonstrate the crucial role of NGS technology in antibody repertoire analysis. The reduced homopolymer errors from the combined use of IA and Hi-Q will further increase the accuracy of antibody lineage analysis and intermediate inference based on the PGM platform.

Discussion

The extraordinary ability of antibodies to recognize the plethora of foreign pathogens relies on their sequence diversity generated by gene rearrangement and affinity maturation36,37. NGS-based repertoire analysis is poised to further our understanding of humoral immunity5 and to accelerate antibody discovery and vaccine design26,30,38,39,40,41. The promises and challenges in this emerging field have been reviewed in length5,6,7,22. Using samples from a unique HIV-1-infected donor and two uninfected donors, we examined several critical issues in antibody repertoire analysis. (1) Longer reads. Although sequencing the CDR3 may be sufficient in characterizing the antibody response for some pathogens (e.g. human dengue virus42), sequencing the entire V(D)J-coding region has become a prerequisite for antibody repertoire analysis. Here we demonstrate that the PGM platform can sequence the entire variable domain with an average read length of 550 bp at an estimated 1% cost of the 454 platform. The sequencing protocols, heavy and light chain primer sets and bioinformatics pipeline validated in this study provide a set of practical solutions for antibody repertoire analysis based on this platform. (2) Biased vs. unbiased. Gene-specific primers may cause significant bias and thus are not optimal for tracing dynamic antibody responses de novo during natural infection or vaccination. We addressed this issue by adopting 5′-RACE PCR in template preparation, which allowed us to analyze HIV-1 bnAbs from a unique patient sample in the context of the entire repertoire. Using the unbiased repertoire as a reference, we quantified the bias generated by various primer sets currently used in antibody repertoire analysis. (3) Artifacts caused by PCR-based amplification. In a recent review, amplification noise was noted as a major problem in repertoire analysis22. Molecular tagging has been used to deal with such noise in transcriptome34 and viral RNA35 sequencing but not yet extended to immune repertoire sequencing. In this study, we devised a random barcoding strategy to quantify the amplification bias in the analysis of PGT121 class of antibodies. We also examined the utility of this strategy to correct sequencing errors. Such a strategy will benefit the in-depth analysis of HIV-1-infected donor samples with experimentally isolated bnAbs11,12,13,14,15,16,33. Since this strategy can only remove redundancy at the cDNA level, genomic sequencing of the immunoglobulin gene loci after isotype-specific B cell purification by flow cytometry or other approaches may be required to detect redundancy at the mRNA level10,22. (4) Improved NGS technologies. As NGS technologies continue to mature, new advances based on the available platforms will likely generate a direct impact on repertoire analysis. For the PGM platform, we have demonstrated improved throughput, read length and sequence quality from the combined use of new template preparation methods and sequencing chemistry. Together, the technology assessment and development described in this study will help establish a more rigorous foundation for antibody repertoire analysis in biomedical research.

Methods

Human specimens

Peripheral blood mononuclear cells (PBMCs) were obtained from donor 17, an HIV-1 infected donor from the IAVI Protocol G cohort43. All human samples were collected with written informed consent under clinical protocols approved by the Republic of Rwanda National Ethics Committee, the Emory University Institutional Review Board, the University of Zambia Research Ethics Committee, the Charing Cross Research Ethics Committee, the UVRI Science and Ethics Committee, the University of New South Wales Research Ethics Committee. St. Vincent's Hospital and Eastern Sydney Area Health Service, Kenyatta National Hospital Ethics and Research Committee, University of Cape Town Research Ethics Committee, the International Institutional Review Board, the Mahidol University Ethics Committee, the Walter Reed Army Institute of Research (WRAIR) Institutional Review Board and the Ivory Coast Comite é National d'Ethique des Sciences de la Vie et de la Sante é (CNESVS). The sample from IAVI donor 17 was the source of broadly neutralizing antibodies PGT121–124 and PGT133–13419. The PBMCs of two HIV-1-uninfected donors were obtained from the California Blood Bank according to the Institutional Review Board (IRB) at The Scripps Research Institute. The blood samples were collected with written informed consent from the donors.

Sample preparation using gene-specific primers

Total RNA was extracted from 20 million PBMCs into 30 μl of water with TRIzol Reagent (Life Technologies). The reverse transcription (RT) was performed with SuperScript III (Life Technologies) and oligo(dT)12–18. The cDNA was purified and eluted in 20 μl of elution buffer (NucleoSpin PCR Clean-up Kit, Clontech). The immunoglobulin gene-specific PCRs were performed with Platinum Taq High-Fidelity DNA Polymerase (Life Technologies) in a total volume of 50 μl, with 5 μl of cDNA as template, 1 μl of gene-specific primers and 1 μl of 10 μM reverse primer. The primers each contained an appropriate adaptor sequence (A or trP1) for subsequent PGM sequencing. Two sequencing directions (Fig. S3A) and their respective primer sets were designed (Tables S1–S3). 25 cycles of PCRs were performed and the expected PCR products (~500 bp) were gel purified (Qiagen).

Sample preparation using 5′-RACE PCR

After total RNA extraction, 5′-RACE was performed with FirstChoice RLM-RACE Kit (Life Technologies) and oligo(dT)12–18. The immunoglobulin PCRs were set up in a total volume of 50 μl, with 5 μl of cDNA as template, 1 μl of 5′-RACE primer and 1 μl of 10 μM reverse primer. The 5′-RACE primer contained PGM trP1 or P1 adaptor (P1 is required for isothermal amplification [IA]), while the reverse primer contained a PGM A adaptor (Fig. S3B and Table S4). 25 cycles of PCRs were performed and the expected PCR products (~600 bp) were gel purified (Qiagen).

Sample preparation using gene-specific primers with random barcodes

Total RNA extraction was performed using the same protocol as above. A random barcode of ten degenerate nucleotides was inserted between PGM A adaptor and the reverse primer in the constant domain. RT was performed with SuperScript III (Life Technologies) and the barcoded primers. After cDNA purification, the immunoglobulin gene-specific PCRs were set up in a total volume of 50 μl, with 5 μl of cDNA as template, 1 μl of forward gene-specific primers and 1 μl of 10 μM PGM A adaptor. The forward gene-specific primers each contained a PGM trP1 adaptor for subsequent PGM sequencing (Fig. S3C and Table S2). 25 cycles of PCRs were performed and the expected PCR products (~500 bp) were gel purified (Qiagen).

Ion Torrent PGM sequencing of antibody libraries

The antibody heavy- and light-chain libraries were quantitated using Qubit 2.0 Fluorometer with Qubit dsDNA HS Assay Kit and then used at a ratio of 1:1 except for the first sequencing experiment, in which a ratio of 1:2 was used. The dilution factor required for Ion Torrent PGM template preparation was determined such that the final concentration was 30 pM. The template preparation was performed with either Ion PGM Template OT2 400 Kit on the Ion OneTouch 2 Instrument overnight or the IA Kit. Template enrichment was performed on the Ion OneTouch ES Instrument the following day. Prior to PGM sequencing, quality control of the template was determined by the Qubit 2.0 Fluorometer with the Ion Sphere™ Quality Control Kit. Sequencing was performed on the Ion PGM System with the Ion PGM™ Sequencing 400 Kit or PGM™ Hi-Q 400 Kit using either an Ion 316 or 318 v2 chip for a total of 850 nucleotide flows (1,100 flows when IA was used). Raw data processing with and without the 3′-end trimming in base calling was compared when evaluating new PGM technologies.

Bioinformatics analysis of antibody sequencing data

The Antibodyomics 1.0 pipeline described in our previous studies11,12,13,14,15 was used to process all NGS data. After full-length variable domain sequences were obtained, a new filter was used to detect and remove erroneous sequences that may contain swapped gene segments from PCR errors. Specifically, a full-length read was removed from the data set if the V-gene alignment was less than 250 bp (220 bp in the case of IAVI donor 17 light chains). In this study, a modified procedure for the intra-donor phylogenetic analysis13,14 was used to analyze IAVI donor 17 sequences. Two changes were made to improve the accuracy and computational efficiency. Firstly, the neighbor-joining (NJ) method used previously was replaced with the maximum likelihood (ML) method. Secondly, the extraction of somatic variants was automated by using a program to recognize evolutionarily related sequences that reside on the same phylogenetic branch as the input template sequences (e.g. PGT121 class of antibodies).

Antibody expression

Antibody production was performed as previously described11,12,13,14,15. Briefly, the bioinformatically selected antibody chain sequences were synthesized (GenScript, Inc) and cloned into the CMV/R expression vector containing the constant regions of IgG. The heavy and light chains identified from IAVI donor 17 PGM sequencing data were paired with their respective partner chain DNAs from the PGT121-class antibodies. Full-length IgGs were expressed by transient transfection of 293F cells and purified using a recombinant protein-A column (Pierce). The expression and sequence information of PGM-derived antibodies are summarized in Table S5.

HIV-1 neutralization assays

Neutralization assays were performed on TZM-bl reporter cells using a six-virus panel as previously described44,45,46. A six-virus panel was used in this study. Neutralization curves were fit by a nonlinear regression analysis using a 5-parameter hill slope equation. The 50% inhibitory concentration (IC50) is defined as the antibody concentration required to inhibit HIV-1 infection by 50%.