Template-Based Assembly of Proteomic Short Reads For De Novo Antibody Sequencing and Repertoire Profiling

Antibodies can target a vast molecular diversity of antigens. This is achieved by generating a complementary diversity of antibody sequences through somatic recombination and hypermutation. A full understanding of the antibody repertoire in health and disease therefore requires dedicated de novo sequencing methods. Next-generation cDNA sequencing methods have laid the foundation of our current understanding of the antibody repertoire, but these methods share one major limitation in that they target the antibody-producing B-cells, rather than the functional secreted product in bodily fluids. Mass spectrometry-based methods offer an opportunity to bridge this gap between antibody repertoire profiling and bulk serological assays, as they can access antibody sequence information straight from the secreted polypeptide products. In a step to meeting the challenge of mass spectrometry (MS)-based antibody sequencing, we present a fast and simple software tool (Stitch) to map proteomic short reads to user-defined templates with dedicated features for both monoclonal antibody sequencing and profiling of polyclonal antibody repertoires. We demonstrate the use of Stitch by fully reconstructing two monoclonal antibody sequences with >98% accuracy (including I/L assignment); sequencing a Fab from patient serum isolated by reversed-phase liquid chromatography (LC) fractionation against a high background of homologous antibody sequences; sequencing antibody light chains from the urine of multiple-myeloma patients; and profiling the IgG repertoire in sera from patients hospitalized with COVID-19. We demonstrate that Stitch assembles a comprehensive overview of the antibody sequences that are represented in the dataset and provides an important first step toward analyzing polyclonal antibodies and repertoire profiling.


Introduction 40
Antibodies bind a wide variety of antigens with high affinity and specificity, playing a major role in 41 the adaptive immune response to infections, but can also target self-antigens to mediate 42 autoimmune diseases [1][2][3][4]. Antibodies can mediate immunity by blocking essential steps of a 43 pathogen's replication cycle (e.g. receptor binding and cell entry), triggering the complement 44 system, or activating a specific cell-mediated immune response known as antibody-dependent 45 cellular cytotoxicity. Antibodies elicited in response to infection may persist in circulation for 46 several months and regenerate quickly by subsequent exposures to the same antigen through a 47 memory B-cell response [5,6]. All this has made antibodies into popular serological markers of 48 pathogen exposure and vaccine efficacy, therapeutic leads for the treatment of cancer and 49 infectious disease, and invaluable research tools for specific labelling and detection of molecular 50 targets. 51

52
The large diversity of antigens that antibodies can target comes from a complementary diversity 53 of available antibody sequences and compositions [2,[7][8][9][10][11]. Antibodies of most classes consist 54 of a combination of two unique, paired, homologous polypeptides: the heavy chain and the light 55 chain, each consisting of a series of the characteristic Immunoglobulin (Ig) domains. Both chains 56 can be subdivided into a variable region, involved in antigen binding, and a constant region, which 57 plays a structural role in oligomerization, complement activation and receptor-binding on immune 58 cells. Disulfide bonds covalently link the heavy and light chains, and two copies of this covalent 59 heterodimer are in turn disulfide-linked on the heavy chains to form the characteristic Y-shaped 60 5 cDNA sequencing of the antibody producing B-cell, unlike true de novo protein sequencing 104 methods. Direct protein sequencing methods have focused especially on de novo sequencing of 105 monoclonal antibodies, based on bottom-up analysis of digested peptides [22][23][24][25][26][27][28][29][30][31]. With the aid of 106 specialized software packages like DiPS, Supernovo, or ALPS, full and accurate sequences of 107 the heavy and light chains can be reconstructed with the MS/MS spectra of the digested peptides 108 [28,29,31]. The use of multiple complementary proteases and novel hybrid fragmentation 109 techniques provides large benefits in sequence coverage and accuracy in these methods [26]. 110 The obtained sequences are complete and accurate enough to reverse-engineer functional 111 synthetic recombinant antibodies, for instance of monoclonal antibodies from lost hybridoma cell 112 lines. Recently we also demonstrated complete sequencing of a monoclonal antibody isolated 113 from patient serum by reversed-phase LC fractionation and integrated bottom-up and top-down 114 analysis [32]. Plasma proteomics methods to profile polyclonal IgG mixtures and other 115 heterogeneous variant proteins based on de novo methods (SpotLight and LAX) have also 116 recently been described [33][34][35][36]. 117 118 Characterization of polyclonal mixtures and a move toward full profiling of the circulating antibody 119 repertoire remain major outstanding challenges for MS-based antibody sequencing. In a step to 120 meeting these challenges, we present a fast and simple software tool (Stitch) to map proteomic 121 short reads to user-defined templates with dedicated features for both monoclonal antibody 122 sequencing and profiling of polyclonal antibody repertoires. We demonstrate the use of Stitch by 123 fully reconstructing 2 monoclonal antibody sequences with >98% accuracy (including I/L 124 assignment); sequencing a Fab from patient serum isolated by reversed-phase LC fractionation 125 against a high background of homologous antibody sequences; sequencing antibody light chains 126 from urine of multiple-myeloma patients; and profiling the IgG repertoire in sera from patients 127 hospitalized with  129

130
The experimental de novo antibody sequence reads obtained from a typical LC-MS/MS 131 experiment are 5-40 amino acids in length. Although these reads are relatively short for 132 completely de novo assembly, the rates of somatic hypermutation are typically low enough (1-133 10%) that the translated germline sequences contained in the IMGT are of sufficient homology to 134 accurately place all peptide reads in the correct framework of the heavy and light chains. Based 135 6 on this notion, we developed Stitch to perform template-based assembly of antibody-derived de 136 novo sequence reads using local Smith-Waterman alignment [37]. Although the program can also 137 perform this task on any user-defined set of templates, using plain FASTA sequences as input, 138 we developed dedicated procedures for both mono-and polyclonal antibody sequencing using de 139 novo reads from PEAKS as input [38]. With input from PEAKS the program can use metadata of 140 individual reads as filtering criteria and determine weighted consensus sequences from 141 overlapping reads, based on global and local quality scores, as well as MS1 peak area. As output 142 Stitch generates an interactive HTML report that contains a quantitative overview of matched 143 reads, alignment scores and a combined peak area for every template. In addition, it generates 144 the final consensus sequences for all matched templates together with a sequence logo, depth of 145 coverage profiles, and a detailed overview of all assembled reads in the context of their templates 146 (see Figure 1). Finally, the output report also contains a complete overview of all reads assigned 147 to the CDRs. 148

149
In its most basic implementation, Stitch can simply match peptide reads to any homologous 150 template in a user-defined database. Peptide reads are placed based on a user-defined cutoff 151 score of the local alignment. When the database contains multiple templates, individual reads 152 may match multiple entries with scores above this cutoff. This scenario is particularly relevant to 153 antibody sequences as the multitude of available V/J/C alleles share a high degree of homology. 154 The program can be set to place reads within all templates above the cutoff score or to place 155 reads only on their single-highest scoring template. With this latter setting, reads with equal scores 156 on multiple templates will be placed at all entries simultaneously. Reads with a single-highest 157 scoring template are thereby defined as 'unique' for the program to track the total 'unique' 158 alignment score and area of every template. Furthermore, Stitch explicitly considers the ambiguity 159 of read placement across multiple homologous template sequences. A multiple sequence 160 alignment is performed on each segment of the user-defined templates to generate a cladogram 161 that represents the homology between the template sequences. Unique reads are placed at the 162 tips of the branches, whereas shared reads are placed at the corresponding branching points of 163 the tree (see Figure 1D). Stitch outputs the consensus sequence of every matched template 164 based on all overlapping reads, accounting for frequency, global quality score and MS peak area 165 with PEAKS data as input. The generated consensus sequence defaults to the template sequence 166 in regions without coverage. Positions corresponding to I/L residues are defaulted to L in PEAKS 167 data, as the two residues have identical masses and are therefore indistinguishable in most MS 7 experiments. The consensus sequences in the output follow the matched template in these 169 instances, changing the position to isoleucine when suggested by the template sequence. 170 171 Stitch allows templates to be defined in multiple separate groups, such that for antibody 172 sequences we can sort peptide reads from heavy and light chains, and distinguish peptides from 173 the V-, J-and C-segments of either chain. We have defined separate template databases for 174 IGHV-IGHJ-IGHC, as well as IGLV-IGLJ-IGLC (with all kappa and lambda sequences combined 175 in the same databases). The templates correspond to the germline sequences included in IMGT 176 but filtered to create a reduced and non-redundant set of amino acid sequences (templates for 177 human, mouse and rabbit antibodies are currently provided and the clean-up procedure to 178 generate the non-redundant databases from additional species is included in the program). 179 Templates for the D-segment are not taken from IMGT as they are too short and variable for any 180 meaningful read-matching. In addition to the Ig segments, a separate decoy database for common 181 contaminants of cell culture medium, plasma/serum, and proteomics sample preparation can be 182 defined. The output report includes consensus sequences for all matched germline templates with 183 annotation of the CDRs, as well as a quantitative overview of how each germline template is 184 To demonstrate the use of Stitch, we assembled de novo peptide reads to reconstruct the full 224 heavy and light chain sequences of three different monoclonal antibodies. First, we used the 225 human-mouse chimeric therapeutic antibody Herceptin (also known as Trastuzumab). Herceptin 226 is composed of mouse CDR sequences placed within a human IgG1 framework and targets the 227 Her2 receptor in treatment of a variety of cancers [26,40,41]. Second, we reconstructed the 228 sequence of the anti-FLAG-M2 antibody, which is a mouse antibody targeting the DYKDDDDK 229 epitope used to label and purify recombinant proteins [26,42]. Alignment of the assembled output 230 sequences reveals an overall accuracy of 98% and 99% (including I/L assignments) for Herceptin 231 and anti-FLAG-M2, respectively (see Figure 2). A close-up view of the CDRH3 reconstructions 232 demonstrates how the missing D-segment in the heavy chain is obtained through the two-step we aim to implement in the future. It is currently also limited to plain FASTA and PEAKS data as 298 input reads, but we aim to adapt it to data formats from additional de novo sequencing software 299 in the future. Current limitations regarding polyclonal antibody profiling will have to be solved with 300 improved experimental approaches: obtaining longer sequence reads will reduce ambiguity in the 301 correlation of sequence variants against the database of homologous templates, and top-down 302 MS/MS of intact Fabs/antibodies or additional cross-linking MS workflows will have to elucidate 303 the heavy-light chain pairings in the antibody mixture. 304 305 13 By enabling antibody sequencing and profiling from the purified secreted product, the 306 development of Stitch contributes to an emerging new serology, in which bulk measures of 307 antigen binding and neutralization can be directly related to the composition and sequence of a 308 polyclonal antibody mixture. Direct MS-based sequencing and profiling of secreted antibodies 309 thereby bridges the gap between bulk serological assays and B-cell sequencing approaches. 310 These developments promise to provide a better understanding of antibody-mediated immunity 311 in natural infection, vaccination and autoimmune disorders. 312 313 314

Monoclonal antibodies and COVID-19 serum IgG -sample preparation 316
Herceptin and anti-FLAG-M2 were obtained as described in reference [26], F59 was purified from 317 patient serum as described in reference [32]. Convalescent serum from COVID-19 patients were 318 obtained under the Radboud UMC Biobank protocol; IgG was purified with Protein G affinity resin 319 (Millipore). Samples were denatured in 2% sodium deoxycholate (SDC), 200 mM Tris-HCl, 10 320 mM Tris(2-carboxyethyl)phosphine (TCEP), pH 8.0 at 95 °C for 10 min, followed by 30 min 321 incubation at 37 °C for reduction. The samples were then alkylated by adding iodoacetic acid to 322 a final concentration of 40 mM and incubated in the dark at room temperature for 45 min. For 323 herceptin and anti-FLAG-M2 a 3 μg sample was then digested by one of the following proteases: 324 trypsin (Promega), chymotrypsin (Roche), lysN (Thermo Fisher Scientific), lysC (FUJIFILM 325 Wako Pure Chemical Corporation), gluC (Roche), aspN (Roche), aLP (Sigma-Aldrich), 326 thermolysin (Promega), and elastase (Sigma-Aldrich) in a 1:50 ratio (w/w) in a total volume of 100 327 μL of 50 mM ammonium bicarbonate at 37 °C for 4 h. After digestion, SDC was removed by 328 adding 2 μL of formic acid (FA) and centrifugation at 14 000g for 20 min. Following centrifugation, 329 the supernatant containing the peptides was collected for desalting on a 30 μm Oasis HLB 96-330 well plate (Waters). The F59 monoclonal isolated from patient serum was digested in parallel by 331 four proteases: trypsin, chymotrypsin, thermolysin and pepsin. Digestion with trypsin, 332 chymotrypsin and thermolysin was done with 0.1 μg protease following the SDC protocol 333 described above. For pepsin digestion, a urea buffer was added to a total volume of 80 μL, 2M 334 Urea, 10 mM TCEP. Sample was denatured for 10 min at 95 °C followed by reduction for 20 min 335 at 37 °C. Next, iodoacetic acid was added to a final concentration of 40 mM and incubated in the 336 dark for 45 min at room temperature for alkylation of free cysteines. For pepsin digestion 1 M HCl 337