KA-Search: Rapid and exhaustive sequence identity search of known antibodies

Antibodies with similar amino acid sequences, especially across their complementary-determining regions, often share properties. Finding that an antibody of interest has a similar sequence to naturally expressed antibodies in healthy or diseased repertoires is a powerful approach for the prediction of antibody properties, such as immunogenicity or antigen specificity. However, as the number of available antibody sequences is now in the billions and continuing to grow, repertoire mining for similar sequences has become increasingly computationally expensive. Existing approaches are limited by either being low-throughput, non-exhaustive, not antibody specific, or only searching against entire chain sequences. Therefore, there is a need for a specialized tool, optimized for a rapid and exhaustive search of any antibody region against all known antibodies, to better utilize the full breadth of available repertoire sequences. We introduce Known Antibody Search (KA-Search), a tool that allows for the rapid search of billions of antibody sequences by sequence identity across either the whole chain, the complementarity-determining regions, or a user defined antibody region. We show KA-Search in operation on the ∼2.4 billion antibody sequences available in the OAS database. KA-Search can be used to find the most similar sequences from OAS within 30 minutes using 5 CPUs. We give examples of how KA-Search can be used to obtain new insights about an antibody of interest. KA-Search is freely available at https://github.com/oxpig/kasearch.

against natural antibody repertoires to find identical or highly similar antibodies. This is useful, as similar 27 antibodies often share properties and it can therefore be a powerful method for finding antibodies in nature 28 which have improved properties such as their developability profile, reduced immunogenicity or increased 29 affinity (3, 4, 5, 6, 7) (see Figure 1A). 30 Similarity between antibodies can be measured in different ways. The most common ones are via sequence 31 identity or structural similarity (8,9). With a protein's function being preserved in the structure, structural 32 similarity is often superior for finding proteins with analogous functions, such as antibodies binding the 33 same epitope (7, 10). However, with orders of magnitude more sequence data available than structural 34 data, a sequence identity search enables the exploration of a much larger space. Sequence data is also 35 more diverse, as next generation sequencing of B-cell receptors (BCR) is routinely being applied to study 36 adaptive immunity, generating sequences from a range of species (11,12,13) and from individuals with 37 differing disease states (14, 15). Furthermore, continuous improvements in high-throughput sequencing 38 methods and increased adoption by research labs means that the amount and diversity of sequence data is 39 rapidly increasing (16,17). 40 Searching all this immune repertoire data for similar sequences is useful for a wide range of applications, 41 such as finding the most similar human antibody sequence to an antibody isolated from an animal model 42 during therapeutic development. While freely available, searching this data requires extensive post-43 processing of each source, and a database providing a single entry to antibody data to search against is 44 therefore advantageous. One such effort is the Observed Antibody Space (OAS) (18, 19) database, which 45 collates data from publicly available BCR sequencing studies and as of September 2022 contains ∼2.4 46 billion unpaired heavy and light antibody chains. While the size of OAS is promising from a scientific perspective, its scale and continuous growth, visualized in Figure 1B, make effectively mining it a challenge.

48
Calculating the sequence identity between antibodies is simple, but without software specially optimised 49 for the task, the computational cost of exhaustively searching OAS and other large antibody sequence 50 databases is becoming prohibitive. There is therefore a need for specialized tools to search this space now 51 and in the future.

52
There exist many tools for searching large datasets of protein sequences for similar sequences, for 53 example BLASTp (20) and CD-Hit-2d (21), and newer methods such as MMseqs2 (22). However, these 54 tools are designed around searching a diverse set of proteins and often exploit the low similarity between 55 most sequences. To increase speed, both CD-Hit and MMseqs2 prefilter the target sequences for low 56 identity sequences to reduce the number of pairwise alignments to make, as this is a computational 57 expensive step (22). However, this is not as effective for closely related sequences such as antibodies, as 58 the prefiltering can remove good hits. Further, each tool uses an alignment method designed for general 59 protein sequences, which can result in unreliable antibody alignments, especially in the highly variable 60 complementary-determining regions (CDRs). Within the immunoinformatics field this alignment problem 61 is often overcome by using antibody specific numbering schemes, like the ImMunoGeneTics (IMGT) 62 scheme (23). Another issue with non-antibody specific tools, is the lack of flexibility in their searches.

63
These tools can only readily be used on the whole antibody chain and not for finding similar sequences 64 based on subregions. Searching for specific, identical regions within antibodies, especially the CDRs, is 65 often used when looking for similar binders (6). With the majority of the residues involved in binding being 66 located in the CDRs, the sequence identity over this region is often more relevant than that of the whole 67 antibody. For some applications, the exact set of residues involved in binding (paratope) may be known. In 68 these cases, searching based on the sequence identity of the paratope may be even more informative (see 69 Figure 1C). An antibody specific tool utilizing antibody numbering schemes for better searches, without 70 prefiltering for an exhaustive search, and with the ability to search user-defined regions would improve our 71 ability to make best use of the antibody sequence data available.

72
Recent efforts to create antibody specific searching tools include iReceptor (24) and AbDiver (25). 73 iReceptor, only allows for a V-, D-, or J-gene search or an exact CDR3 match search. AbDiver uses an 74 antibody numbering scheme to align sequences and allows for both CDR3 and whole chain searches.

75
AbDiver restricts CDR3 searches against CDR3s with a specified V gene and species of origin, and whole 76 chain searches against sequences with same length CDR1 and 2 and ±1 length CDR3. These restrictions    Table S1 and cover around ∼99.8% of 95 sequences in OAS. The 0.2% of antibody sequences that contain a rare insertion can not be aligned and 96 are hence compared using a slower method. Every aligned sequence is accompanied by two index values 97 which can be used to retrieve its meta data.

98
All sequences in OAS (September 2022) are pre-aligned using this method to generate a dataset ready to 99 be used by KA-Search. This results in over 2,070 million heavy and 355 million light chain sequences.

100
Sequences are split into heavy and light chains, and by species information, e.g. human, mouse, rabbit, 101 rat, rhesus, camel and humanized, allowing for faster specific searches. We call this data set of heavy and 102 light chains for OAS-aligned. We also built a subset of the heavy chain dataset, OAS-aligned-small, that 103 contains 144 million human heavy chain sequences. This was generated by removing sequences containing 104 ambiguous residues or seen less than three times. OAS-aligned and OAS-aligned-small, and the code 105 to update the data sets or expand it with an in-house data set is made freely available with KA-Search 106 (https://github.com/oxpig/kasearch).  108 The identity between the query sequence and a target sequence is computed using the method as described     (Figure 3). KA-Search takes ∼8 seconds, which is far faster than BLASTp and CD-Hit-2d, ∼103 and ∼82 seconds respectively, but slower than MMseqs2 at 151 ∼3 seconds. In terms of sensitivity, identifying the closest antibody in OAS-test, the conventional tools 152 struggle to find the exact closest match, with BLASTp finding the exact match as highest ranked for 19 out 153 of the 100 sequences, CD-Hit-2d for three sequences and MMseqs2 for no sequences. When looking for the closest match within the top-100 highest ranked sequences, BLASTp and CD-Hit-2d find the closest 155 match for 65 and 14 sequences, respectively, while MMseqs2 finds none. The average difference in identity 156 between the highest ranked and closest match was also calculated. For BLASTp, CD-Hit-2d and MMseqs2 157 this difference was on average 3.93%, 11.34% and 7.15% identity, respectively. The difference between the 158 closest match within the 10 million sequences and the best within the top-100 highest ranked sequences 159 were 0.27%, 4.53% and 4.0%, respectively. While CD-Hit-2d is better than MMseqs2 at finding the closest   Figure 4A shows the disease of the patient 168 the antibody sequence found in OAS comes from and in Figure 4B   and generates better alignments than tools using generic methods. The increased speed allows  to avoid prefiltering and be exhaustive while still retaining a competitive speed. Avoiding prefiltering is 192 crucial, as current prefiltering techniques greatly reduce sensitivity when searching highly related proteins, 193 such as antibodies, where a single mutation can be of great importance. While pre-aligning the antibody 194 sequences increases search speed, the initial pre-alignment is slow. We therefore provide a pre-aligned 195 dataset of the current OAS, ready to use for searching. This dataset can be extended with future OAS 196 updates or in-house data without the need to re-align the existing sequences.    Returned antibodies with over 90% identity across four different regions were visualised based on, A) the disease state of the patient and B) which V-and J-genes the antibody sequence is derived from. C) The variable region of COVOX-253's heavy (purple) and light (grey) chain with the bound spike glycoprotein (beige), derived from PDB 7BEN. The paratope of the heavy chain, which was used to search with KA-Search, is shown in red.