Data mining patented antibody sequences

Konrad Krawczyk; Andrew Buchanan; Paolo Marcatili

doi:10.1101/2020.11.26.389866

Abstract

Patent literature should be a reflection of thirty years of engineering efforts in developing monoclonal antibody therapeutics. Such information is potentially valuable for rational antibody design. Patents however are not designed to convey scientific knowledge, but rather legal protection. It is unclear whether antibody information from patent documents, such as antibody sequences could be useful for the therapeutic antibody sphere in conveying engineering know-how rather than act as legal reference only. To assess the utility of patent data for therapeutic antibody engineering, we quantified the amount of antibody sequences in patents destined for medicinal purposes and how well they reflect the primary sequences of therapeutic antibodies in clinical use. We identified 16,526 patent families from major jurisdictions (e.g. USPTO and WIPO) that contained antibody sequences. These families held 245,109 unique antibody chains (135,397 heavy chains and 109,712 light chains) that we compiled in our Patented Antibody Database (PAD, http://naturalantibody.com/pad). We find that antibodies make up a non-trivial proportion of all patent amino acid sequence depositions (e.g. 10.95% of USPTO Full Text database). Our analysis of the 16,526 families demonstrates that the volume of patent documents with antibody sequences is growing with the majority of documents classified as containing antibodies for medicinal purposes. We further studied the 245,109 antibody chains from patent literature to reveal that they very well reflect the primary sequences of antibody therapeutics in clinical use. This suggests that patent literature could serve as a reference of previous engineering efforts to improve rational antibody design.

Introduction

The binding versatility of antibodies has been used for medicinal purposes making them the most successful group of biotherapeutics¹. Typical timelines involved in bringing these molecules to the market are slow, however more and more molecules are approved in the US and EU each year¹. Successful exploitation of antibodies by either experimental^2,3 or computational techniques⁴ relies on our ability to understand what makes a successful antibody-based therapeutic^5,6.

Therapeutic antibodies on the market and in late stage clinical trials have been previously studied by experimental^2,7 and computational⁶ approaches to identify properties that make a successful biotherapeutic. Such studies² however only focused on 137 approved or post-phase-I antibodies (Clinical Stage Therapeutics, or CSTs), which is a small dataset in the light of the mutational space available to antibodies⁸. CSTs, are high-quality data-points that are end results of a long engineering process of selecting a molecule from a number of viable candidates. The single successful therapeutic molecule is therefore only partially representative of the engineering process. Full public disclosure of the efforts involved in developing a therapeutic antibody constituting intermediate sequences and selection decisions is not desirable because of the commercial value of such know-how, which needs to be legally protected.

Because of the need to protect the know-how involved in engineering therapeutic antibodies, relevant information needs to be disclosed in patent documents. Previous approaches to extract information on patent antibody landscape⁹ or specific antibody formats¹⁰ focused on keyword and patent classification searches. One can broadly discern between patents on antibody techniques (e.g. phage display, humanization) and novel antibody molecules. It is the patents on novel molecules that could be of particular engineering interest as these reflect the constructs that might find their way into the clinic. The disclosure of antibody sequence and target information¹¹ in such patents reveals to a certain extent the engineering choices as such molecules have been subjected to myriad prior tests to be suitable candidates for expensive legal protection and further clinical trials.

The purpose of patent literature is not conveying scientific knowledge, but legal protection. In this work we assessed the utility of patent data for therapeutic antibody engineering efforts by establishing the extent to which antibodies from patents reflect therapeutics in clinical use. For this purpose, we identified patent documents that contained antibody sequences, to quantify how many of these were destined for medicinal purposes and how well they reflect advanced stage therapeutics.

Results

Antibodies account for a non-trivial proportion of sequences deposited in patent documents

We identified documents with antibody sequences by downloading data from four data sources: USPTO (http://uspto.gov), WIPO (http://wipo.int), DDBJ¹², and EBI¹³. Choice of the data sources was motivated by the availability of biological sequences and coverage of patent documents worldwide. Biological sequence information is not universally available in patent documents in all jurisdictions¹⁴. In certain cases, the data is not freely available, but rather accessible for a fee (e.g. European Patent Office). Primary access to biological sequences in machine-readable format is freely available from the USPTO and WIPO. USA is the largest pharmaceutical market¹⁵, compelling pharmaceutical companies developing a novel antibody therapeutic to seek patent protection within the jurisdiction of USPTO. Similarly, it is common to seek protection under the auspices of WIPO PCT system in order to spread the coverage of the patent documents across many jurisdictions worldwide. Furthermore, data from certain major jurisdictions, such as EPO, JPO and KPO are available via third parties such as DDBJ¹² and EBI¹³. Therefore, we argue that datasets made available via USPTO, DDBJ, WIPO and EBI provide a reasonable coverage of the worldwide antibody sequence patents.

We extracted raw sequence data from USPTO, WIPO, DDBJ and EBI on Jan 30^th 2020, with the particulars of parsing the heterogenous sources described in Methods. From each dataset we extracted raw, redundant amino acid and nucleic acids sequences. Sequences containing exclusively nucleotides were translated to amino acids using IgBlast¹⁶ as described previously¹⁷. Raw amino acid sequences were analyzed using ANARCI¹⁸ to identify antibody variable region chains (VH, VL, including scFvs). We report the number of raw sequences analyzed and the resulting identified antibodies in Table 1.

View this table:

Table 1.

Published biological sequences and proportion thereof identified as antibody chains. We extracted raw sequences from USPTO (divided between the full text, FT, and long listing repository PSIPS), DDBJ, WIPO and EBI. The total number of raw sequences is given in column Total Raw. Of these we show how many were identified by ANARCI as containing an antibody chain (column Ab-identified). In the column “% Total” we report the proportion of identified antibody sequences out of the total of raw sequences. Both Total Raw and Ab-identified columns report the redundant number of sequences so as to exemplify the volume of antibody depositions in patent sequences – we report the number of unique heavy (H) and light (L) chains in the parentheses in column “Ab-identified”.

We find a higher proportion of sequences identified as antibodies in amino acid depositions which account for as many as 10.95% and 12.09% of USPTO-FT and DDBJ datasets respectively. In fact, large portion of sequences deposited in patents are very short; for instance in USPTO-FT only 1,811,694 (32.73%) amino acid sequences are longer than 50 amino acids, and antibodies make up 30.50% of these. This stands to show that antibodies make up a non-trivial volume of all the sequences deposited in patent documents.

Antibody sequence data in patents is however redundant to a large extent when one considers a unique sequence to be defined by its variable region. Combining all the non-redundant VH and VL sequences from our datasets we count 245,109 unique antibody domains (135,397 heavy chains and 109,712 light chains). This suggests that many antibody variable region sequences are listed as part of multiple patent documents. Not all of these sequences however are guaranteed to have been developed for medical applications, which can be determined by analyzing the text content of patent documents.

Patent landscape of documents containing antibody sequences

We analyzed the text content of patents containing antibody sequences so as to establish what proportion of these list molecules for medicinal purposes. We connected all the redundant antibody sequences to their patent documents and identified a patent family for each. A patent family can be regarded as identifying documents with the same subject matter across several jurisdictions. Altogether our 245,109 sequences are distributed among 16,526 patent families. We extracted the metadata from the patent documents, such as titles, abstracts, inventors and classifications. We used this information to determine the proportion of patents destined for medicinal applications by analyzing their classifications, and whether the inventors and listed targets resemble entities and molecules associated with development of monoclonal antibody therapies.

Most patent documents citing antibody sequences are destined for medicinal applications

We analyzed the patent classifications of the 16,526 patent families that indicate the purpose of the invention described in each document. We extracted the Cooperative Patent Classification (CPC, developed by USPTO and EPO, https://www.cooperativepatentclassification.org/) designations from the documents as this was the most common listed scheme, covering 15,951 (96.52%) out of 16,526 families. Patent classifications according to CPC have a section, class, subgroup, main group and a subgroup (e.g. classification C07K16/2866 has section C, class 07, subclass K, main group 16 and subgroup 2866). We divided the 15,951 families according to their CPC classifications excluding the subgroup (e.g. C07K16/2866 becomes C07K16) to reveal the general categories the documents fall into and present results in Table 2.

View this table:

Table 2.

Subclasses of the patent classifications. Most common subclasses associated with patents including antibody sequences according to the Cooperative Patent Classification (CPC, https://www.cooperativepatentclassification.org/). There were 15,951 patents containing antibodies with CPC classification and the percentage of families in each class is expressed as a proportion of this number.

Subgroup C07K16 that indicates immunoglobulins, is the most common classification, present in 13,790 (86.45%) of the 15,951 patent families. Families listing antibodies for medicinal purposes (A61K39) account for 9,459 (59.30%) of the 15,951 families. Furthermore the more general medicinal categorization A61K (preparations for medical, dental or toilet purposes) accounts for 11,398 (71.45%) of the 15,951 patent families. This indicates that the majority of documents citing antibody sequences are developed for medicinal purposes, such as novel treatments or diagnostics. This is well reflected by the organizations that submit such patent applications, where 9 out of top 10 and 69 out of top 100 are pharmaceutical companies associated with development of monoclonal antibody therapies for a range of targets and disease indications (see Supplementary Section 1).

Targets of antibodies in patent documents correspond to known therapeutic targets

We checked to what extent antibody targets reported in patent literature reflect those of known therapeutic antibodies. Each patent family in PAD was scanned for antibody target (see Methods). Therapeutic antibodies in clinical use together with their associated targets were compiled from the WHO lists of International Nonproprietary Names¹⁹ (INNs, e.g. list 122²⁰) IMGT ²¹, Antibody Society (http://www.antibodysociety.org) and Thera-SAbDab^22,23, resulting in 563 unique INNs. We grouped the targets by number of patent families and therapeutic antibodies they were associated with. We present the results for top 30 targets sorted by the highest number of patent families in Table 3.

View this table:

Table 3.

Top 30 targets in patent documents. We extracted the targets of the antibodies in patent documents and present top 30 ranked by the number of families where they were mentioned. For each target, we show the number of patent families mentioning the target (#Families), the number of therapeutics on the market/in the clinic against it (#Therapeutics) and the cumulative number of therapeutics covered by the top targets (#Therapeutics cumulative).

The number of patent families associated with a target appears to correspond to a larger number of therapeutic antibodies against the same target. Top 10 targets sorted by number of their patent families account for 114 (20.24%), top 30 account for 223 (39.60%) and top 100 account for 369 (65.54%) out of 563 therapeutics. Therefore, targets from patents listing antibody sequences provide a reasonable reflection of the targets of currently available therapeutic antibodies. In fact the greater number of patent families per target can be associated with an earlier date of the said target being mentioned in a patent document (Figure 1A). It does not mean however that the patent space for monoclonal antibodies is saturated as the number of new targets mentioned is increasing (Figure 1B). This suggests that studying patent documents including antibody sequences could provide an early indication of their targets and thus activity in the field of therapeutic antibodies.

Figure 1.

Target usage in patent documents reporting antibody sequences. A) Relationship between number of patent families per target and the earliest mention of the target in patent documents containing antibodies. For each target, we noted the earliest date among patent documents citing it and grouped these into 4-year intervals. Within each interval we noted the total number of patent families for a given target and plotted the aggregate for each time interval. B) For each 4-year interval, we plot the number of new target names that were first introduced in a patent document at that time.

There is a growing number of patent documents associated with antibody sequences

We analyzed the timestamps associated with patents in order to check whether there is a growing trend in releasing documents with antibody sequences and what proportion thereof is made up of molecules for medicinal indications. Each patent family lists several dates corresponding to the activity associated with the patent. We noted the earliest and most recent dates for each patent family to reflect the original submission dates and the most up-to-date activity respectively.

We plotted the earliest dates for each patent family in our dataset which indicates that the number of patent documents containing antibody sequences is steadily rising (Figure 2). The most recent dates associated with the same patent documents (Figure 2) shows a more acute rise since 2016 which indicates strong activity within the earlier submitted patents. Since not all patent families are explicitly destined for medicinal applications, we have plotted the corresponding earliest and most recent dates for the 9,459 documents classified as medicinal preparations containing antibodies (Supplementary Figure 1) which recapitulates the increasing number of patent documents being released.

Figure 2.

The volume of patent family documents listing antibody sequences per year. For each patent family we noted the earliest and most recent dates of any documents associated with it and the aggregate numbers of these are given by red and blue bars respectively. The apparent low activity in 2020 can be attributed to the fact that data contributed in 2020 only account for January that year.

Increasing patent activity in documents listing antibody sequences for medicinal indications is in line with the rising approval rates for antibody-based biologics^1,24. Given that the patents are an early sign of approvals to come, it suggests that we can expect more biologics in the clinics in the foreseeable future. Since majority of such patent documents are indeed listing antibodies for medicinal purposes, the broad characteristics of the molecules listed in patent documents could provide an indication of the engineering choices in their design.

The sequence landscape of patented antibodies

Antibody sequences found in patent documents could reflect the broad decisions taken by engineers shaping these molecules before they arrive in the clinic. However, not all antibody sequences found in patent documents are destined for medicinal applications. For this reason we analyzed the broad sequence characteristics of antibodies from patent documents to establish to what extent they are a reflection of therapeutic antibodies in clinical use and vice-versa. We performed this analysis by looking at all of our antibodies from all patents (AllPatAb) and just the subset associated with documents classified as containing antibodies for medicinal applications (MedPatAb). Altogether AllPatAb consisted of 135,397 heavy chains and 109,712 light chains whereas MedPatAb consisted of 93,067 heavy chains (68.73% of all heavy chains) and 67,667 light chains (67.67% of all light chains).

Most antibody sequences from patents align to human and mouse germline V region genes

We checked the patterns of organism-specific germline gene usage in antibody sequences originating from patent documents. Since organism reporting is not consistent in patent documents, we aligned the sequences in PAD to HMMs created from IMGT germline sequences for fifteen organisms: human, mouse, alpaca, rhesus, rabbit, rat, pig, cow, macaque, zebrafish, trout, salmon, dog, horse and chicken. For each organism and germline, we noted the total number of patent antibody sequences aligning to a given germline as well as the number of families they originated from.

We show the number of MedPatAb sequences that aligned to one of our fifteen organisms in Table 4 with the corresponding distribution for AllPatAb sequences in supplementary Table 2. Majority of the unique heavy sequences from patents for medicinal indications align to human germlines (72.80% of unique sequences), followed by mouse (15.39% of unique sequences). The same holds true for light chains with 67.72% of MedPatAb sequences aligning to human and 19.68% to mouse germlines. Antibodies aligning to either mouse or human germlines are most frequently found within protein families. Human-aligned heavy and light chains can be identified in 75.69% and 69.76% patent families respectively. Mouse-aligned heavy and light chains can be found in 52.18% and 53.93% patent families respectively. This broad proportion is also reflected in all the antibody sequences from patents (AllPatAb), indicating that the medicinal patent classification does not skew the broad trend of majority of patented sequences aligning to human and mouse germlines. The alignment to those two organisms reasonably reflects the human focus of antibody development and the rodent antibodies that often server as a basis for humanized therapeutics²⁵.

View this table:

Table 4.

Most common V-region gene species antibodies from patents aligned to. Antibodies from patent documents destined for medicinal indications (MedPatAb) were aligned to fifteen IMGT-derived²⁸ V region germlines from human, mouse, alpaca, rhesus, rabbit, rat, pig, cow, macaque, zebrafish, trout, salmon, dog, horse and chicken. We noted the number of patent sequences that aligned to the given species germline (#Unique Sequences) and the number of patent families (#Patent Families) these originated from.

Germline V gene usage of antibodies from patent documents corresponds to a large extent with germline V gene usage of therapeutic antibodies

Given that majority of antibodies from patents align to human germlines, we stratified these by the particular human V-region genes. In Table 4 and 5 we report the most common V-region genes medicinal patent sequences align to (corresponding numbers for all patents can be found in Supplementary Table 3). We compare the distribution of germline genes in patents to the germline usage in therapeutic antibodies to show to what extent patent submissions reflect current therapeutics.

View this table:

Table 5.

Top-20 most common human V-region genes antibodies from patents aligned to. For each patent antibody sequence for medicinal applications (MedPatAb) that aligned to human germline V-regions, we noted the IMGT V-region gene. We show the number of unique sequences that aligned to a given human V-region gene (Per Sequence) and number of patent families these originated from (Per Family). We also show the number of therapeutic antibody sequences in clinical use that align to the given V-region gene (Per Therapeutic).

The top heavy and light V region genes are identical among medicinal patented sequences, medicinal patents and therapeutics. The most used human heavy chain V-gene by sequence, family and therapeutic usage is IGHV3-23, accounting for 25.29% of all patented medicinal sequences, occurs in 15.56% of all medicinal families and accounts for 16.38% of therapeutics. The most frequently observed human light chain germline usage is IGKV1-39, accounting for 14.63% of all patented medicinal sequences, 12.84% of all medicinal patent families and 18.42% of therapeutic antibodies. Some of the most commonly observed genes might be the result of specific platform choices²⁶ that might attempt to recapitulate naturally observed frequencies⁸ or focus on a small set of scaffolds²⁷. The most frequently used germlines are broadly corresponding between patented sequences, medicinal patents and therapeutics, even though the ordering might not be the same. This indicates that the patent literature well reflects the choices of V-region genes of therapeutic antibodies in clinical use.

Antibodies from patent documents well reflect therapeutic antibody sequences, with the exception of CDR-H3 lengths

The germline gene distribution of antibody sequences from patents appears to reflect the germline gene distribution of therapeutic sequences, though such comparison is not fit to indicate the actual sequence discrepancies between the two datasets. We checked to what extent patented sequences are a reflection of therapeutics by pairwise sequence comparisons between the two datasets.

For each of the 563 therapeutics we checked if we can find a perfect length-matched hit in PAD. For 546 (96.98%) out of 563 therapeutics we found a perfect length-matched hit in PAD. For the remaining 17 therapeutics without perfect matches, we found that the PAD version used for this study (Jan 2020) was out of date or there existed only high sequence identity matches as compared to Lens.org but not perfect ones (Supplementary Table 4).

For each antibody sequence from a patent, we noted the highest IMGT sequence identity to any therapeutic and present the results stratified by AllPatAb and MedPatAb sequences in Figure 3. Large proportion of PAD sequences align with high sequence identity to one of the 563 therapeutics. Total of 21,772 (16.08%) of heavy chain AllPatAb sequences and 17,378 (18,67%) of heavy chain MedPatAb sequences have matches of 90% sequence identity or better to a therapeutic sequence. Total of 44,919 (40,94%) of light chain AllPatAb sequences and 31,241 (46,16%) of light chain MedPatAb sequences have matches of 90% sequence identity or better to a therapeutic sequence. Altogether this illustrates that many sequences in patent documents well reflect the therapeutic antibody sequences currently in the clinical use. However there is also a large number of sequences with matches below 90% sequence identity to either heavy or light therapeutic heavy chain. This could reflect sequences that are only currently in development or never found their way to the clinic as a result of failure, abandonment or otherwise.

Figure 3.

Closest matches of antibody sequences from patents to therapeutic antibodies. For each sequence in AllPatAb and AllPatMed we noted the closest IMGT sequence identity to a therapeutic antibody. A) Distribution of heavy chain sequence identities to closest therapeutic heavy chain. B) Distribution of light chain sequence identities to closest therapeutic light chain.

Perfect matches between full variable region PAD sequences and therapeutics implicitly indicates good correspondence in the CDR region. Arguably, the most diverse and thus the most engineered portion of an antibody is its heavy chain CDR3 region, CDR-H3^29,30. The length of CDR-H3 has been previously shown to be a good estimator of overall developability of an antibody, with therapeutic antibodies having shorter CDR-H3⁶. We contrasted the CDR-H3 lengths found in PAD to those in therapeutic, structural and natural human antibodies. We extracted CDR-H3s from antibody structures found in the Protein Data Bank³¹ that are regularly collected by the Structural Antibody Database²² (SAbDab). The natural human antibodies were sourced from a deep Next Generation Sequencing (NGS) study by Briney et al.⁸ downloaded from the Observed Antibody Space database¹⁷. We found a total of 58,383 unique CDR-H3s in all PAD sequences (AllPatAb), 37,247 unique CDR-H3s in antibodies from medicinal patents (MedPatAb), 422 unique CDR-H3s in therapeutics, 2021 unique CDR-H3s in structures and 73,217,582 unique CDR-H3s from natural human antibodies. We plotted the distribution of lengths for each of these datasets in Figure 4.

Figure 4.

Distribution of CDR-H3 lengths. We plotted the distribution of CDR-H3 lengths from therapeutic antibodies (Antibody Therapeutics), antibodies from patents in PAD (Patents, stratified between AllPatAb and MedPatAb), structures of antibodies from the Protein Data Bank (Antibody Structures) and natural human antibodies from a deep Next Generation Sequencing study (Natural Human Antibodies).

The distribution of CDR-H3 lengths from patent sequences does not appear to be different between AllPatAb and MedPatAb sequences. Therapeutic CDR-H3s have the shortest median lengths, followed by structures, patents and natural human antibodies. The shorter lengths in structures might be reflective of large number of artificial/therapeutic antibodies that can be found in SAbDab²³. Lengths of CDR-H3s from patent sequences appear to be mid-range between therapeutic and natural antibodies. This suggests that patent antibody sequences might reflect certain amount of engineering of these molecules as they do not follow the natural distribution, normally favoring longer lengths. Nevertheless patent antibody CDR-H3 do not recapitulate the therapeutic CDR-H3 length distribution. Since vast majority of therapeutic CDRs can be found in sequences from patent documents, the discrepancy with the therapeutic length distribution can suggest certain engineering choices faced by those molecules not revealed in this study.

Patent landscape of single domain antibodies

Our earlier results revealed that majority of antibodies from patents align well to human or mouse germline V region genes, which recapitulates the widespread use of ‘traditional’ antibody format containing both heavy and light chains. The third most commonly identified organism was alpaca (Table 4), which suggests the single domain antibody (sdAb) format. The single domain antibodies are found naturally in camelids (camels, lamas, alpacas) and because of the lack of light chain are believed to have more favorable biophysical properties than antibodies, without detriment to their antigen recognition ability^32,33. They have been commercialized as therapeutics by Ablynx under the protected name Nanobody^® with first single domain antibody drug, Caplacizumab, recently approved³⁴. Allowing the first sdAb drug in clinical use holds the promise of more molecules in this format in the near future³⁵, which can be reflected by patents.

We identified the total number of patent families in PAD having sdAbs to quantify the possible number of molecules in this format in development, providing an orthogonal view to currently known therapeutic candidates³⁵. Patent families were classified as containing sdAbs if they were classified as C07K2317/569 (Single domain, e.g. dAb, sdAb, VHH, VNAR or nanobody^®) or C07K2317/22 (from camelids, e.g. camel, llama or dromedary) or if they contained sequences aligning to alpaca sdAb germlines. Using the classification method we identified 845 families and using the alpaca germline method we found 867 families. There was an overlap between the two, resulting in total of 1,176 families identified as containing sdAbs or 7.11% of all of our 16,526 families in PAD. Of the 1,176 families 586 (49.82%) were classified as containing antibodies for therapeutic purposes.

The top 30 organizations sorted by the number of families containing sdAb sequences (Supplementary Table 5) well reflect the companies developing biotherapeutics in this format³⁵. The list however contains more organizations than those currently reported as developing sdAb therapies, suggesting that the field might be more nuanced, notwithstanding wide use of sdAbs for imaging and diagnostic purposes³⁶. From the list of known sdAb therapeutics, our list does not contain AdAlta and Ossianix that report shark single domain antibodies, sequences of which we do not identify. In fact, not all sdAbs that we identified follow the natural camelid format, as there exist sequences of single domain human antibodies (e.g. US2011097339).

We checked the total number of sequences in PAD that could be identified as sdAbs. The 1,176 patent families that we identified as containing sdAbs hold a total of 48,849 unique heavy chain sequences. Not all of such sequences are sdAbs as the patent document might have included traditional antibodies as well. Therefore we calculated the number of sequences that were identified as alpaca sdAb germlines and sequences found in one of the 1,176 families but containing only heavy chains. We found a total of 12,914 unique sequences aligning to sdAb alpaca germlines and 13,368 unique sequences found in 1,176 sdAb families containing heavy chains only. There was an overlap between the two sequence sets and combining them resulted in a total of 15,792 possible sdAb sequences, which makes up 11,66% of all the 135,397 heavy chain sequences in PAD. Of the 15,792 possible sdAb sequences, 8,342 (52.82%) were found in patent documents classified as containing antibodies for medicinal purposes. Therefore, single domain antibody sequences appear to make up a non-trivial proportion of antibody sequences found in patents which could be indicative of upcoming sdAb clinical trials and approvals.

In order to provide an indication of the possible activity to come in the field of single domain antibodies, we plotted the earliest and most recent dates associated with any of the 586 sdAb patent families classified as having antibodies for medicinal applications (Figure 5). There appears to be a steady increase in the number of patent documents including sdAb sequences for medicinal purposes (same holds true for all 1,176 patent families containing sdAb sequences, Supplementary Figure 2). Given the steady rise in the number of patents containing sdAbs and recent approval of Caplacizumab, one might expect more molecules in this format in clinical use in the future.

Figure 5.

Patents including single domain antibody sequences over time. For each of the 586 patent families in PAD identified as having sdAbs and classified as containing antibodies for medicinal purposes, we noted the earliest and most recent dates, given as red and blue bars respectively.

Discussion

Successful exploitation of antibodies as therapeutics relies on ever deeper understanding of the biology of these molecules. Many features of therapeutic antibodies can be found in naturally sourced sequences⁵, however effective biotherapeutic requires bespoke engineering for clinical safety and developability². We proposed that such biotherapeutic engineering knowledge could be reflected in patent documents containing antibody sequences.

Our analysis of patents containing antibody sequences revealed that majority of such documents are explicitly developed as containing antibodies for medicinal purposes. Vast majority of therapeutic antibody sequences can be found in patent documents. Further to that, many sequences from patents are within close sequence identity of therapeutic antibodies that are approved or undergoing clinical trials. This suggests that thousands of antibodies from patents could provide a reflection of engineering choices that were made during development therapeutic molecules. Such data could offer an integrated collection of insights into the features that were designed into antibodies to make them successful therapeutics.

This information could be readily exploited by computational methods⁴. It was previously demonstrated that only 137 Clinical Stage Therapeutic (CST) antibodies can provide insights into developability of these molecules⁶. As demonstrated by our analysis, there is an order of magnitude more patented sequences that are close sequence matches to such CSTs. These could indicate different variants and possible features of biotherapeutics, creating a more wholesome picture of what makes a successful biotherapeutic.

Employing antibody sequences from patents however is not without its caveats. Unlike academic literature, patent documents are not designed to convey knowledge but rather offer legal protection. This might result in wide claims on sequence identities to proposed antibody variants that could obfuscate the resulting therapeutic sequence. As we demonstrated, antibodies in patents provide a good reflection of the therapeutics either approved or in clinical use. This would suggest that even though claims could be quite wide on sequence space, many of them appear to fall within the sequence identity orbit of currently available therapeutics. Therefore, certain antibody sequences from patents could broadly reflect the engineering choices in the design of these molecules.

The already large amount of antibodies from patent documents will most likely keep rising, as we demonstrated by the growth in the number of such documents in the recent years. In fact studying such patents could provide an early indication of approvals to come³⁷. This might be specifically true in the sphere of single domain antibodies. There is just one such approved therapeutic on the market³⁴ and ten in clinical trials³⁵ (in 2019). We find a great number of sdAb patents suggesting that the field might further develop in the near future, providing an alternative to traditional monoclonal antibody therapy.

The ongoing increase of patents containing antibodies for medicinal indications will keep contributing to an already ample body of knowledge of antibody engineering. This data could be used to offer insights into the engineering choices in designing these molecules, accelerating delivery of biotherapeutics to the clinic.

Methods

Identifying antibody sequences in patent documents

Raw biological sequence data associated with patent documents was downloaded from four freely available accessible services: the United States Patent and Trademark Office (USPTO, https://www.uspto.gov/), the DNA Data Bank of Japan (DDBJ)¹², European Bioinformatics Institute (EBI)¹³ and World Intellectual Property Organization (WIPO, https://www.wipo.int/). The USPTO data were divided between the full text submissions (https://bulkdata.uspto.gov/) and lengthy sequence listings (http://seqdata.uspto.gov/). Using a custom Python script, the USPTO full text submissions were scanned for nucleotide or amino acid sequences and listings containing these, whereas USPTO PSIPS contained sequence listings only. Using a custom Python script the WIPO FTP documents (ftp://tp.wipo.int/pub/published_pct_sequences) were scanned for nucleotide and amino acid sequences. In both cases of USPTO and WIPO, differences in sequence listing formats from different time periods was accounted for by developing a custom Python parser for each case, transferring all the raw sequences and their associated patent numbers into FASTA format. Data from DDBJ and EBI are available through their ftp services (ftp://ftp.ddbj.nig.ac.jp/ddbj_database and ftp://ftp.ebi.ac.uk/pub/databases respectively) and were readily available in FASTA format.

The nucleotide entries were scanned for antibody sequence by using IGBLAST¹⁶ as described previously¹⁷, and their amino acid translations were noted. The raw amino acid sequences were scanned for presence of antibodies using ANARCI¹⁸. We only kept those amino acid sequences where all three CDR regions and all four framework regions could be identified and that contained only 20 canonical amino acids. This resulted in a dataset of IMGT-numbered amino acid sequences, associated with their patent numbers.

Patent metadata acquisition and antibody target identification

Different patent numbers can point to the same document, submitted across several jurisdictions, termed ‘patent family’. For each patent number associated with a sequence we identified the patent family by using the Open Patent Services API v. 3.2 (developers.epo.org/ops-v3-2). Using the Open Patent Services API, we downloaded the metadata associated with each family which included: family identifier, title, description, patent numbers with associated dates and applicants.

The patent metadata was used for antibody target identification. Even though there exist certain CPC classifications indicating what the antibody should bind to, we noted that they were not universally present. Therefore we performed manual target annotation, supported by Named Entity Recognition (NER). We applied the GENIA NER³⁸ parser to the titles and abstracts of patent families. As with scientific publications titles and abstracts can be expected to reflect the most important content of the document³⁹, in particular pertaining to the binding mode of the reported antibody. The resulting annotations accelerated the manual process of annotating each of our patent families with possible targets.

Web Service

We make the data accessible for academic non-commercial use via web service accessible at http://naturalantibody.com/pad. Users can search for antibody sequences by pasting the amino acids of the variable domains. The input sequence is IMGT-numbered. The sequences in PAD are IMGT-aligned to the input sequence and the top 50 best sequence identity matches are displayed.

Disclosure Statement

The authors declare no conflict of interest

Footnotes

http://naturalantibody.com/pad

References

1.↵
Kaplon H, Muralidharan M, Schneider Z, Reichert JM. Antibodies to watch in 2020. MAbs 2020; doi: 10.1080/19420862.2019.1703531
OpenUrl CrossRef
2.↵
Jain T, Sun T, Durand S, Hall A, Houston NR, Nett JH, Sharkey B, Bobrowicz B, Caffry I, Yu Y, et al. Biophysical properties of the clinical-stage antibody landscape. Proc Natl Acad Sci U S A 2017; doi: 10.1073/pnas.1616408114
OpenUrl Abstract/FREE Full Text
3.↵
Mason DM, Friedensohn S, Weber CR, Jordi C, Wagner B, Meng S, Reddy ST. Deep learning enables therapeutic antibody optimization in mammalian cells. bioRxiv 2019; doi: 10.1101/617860
OpenUrl Abstract/FREE Full Text
4.↵
Norman RA, Ambrosetti F, Bonvin AMJJ, Colwell LJ, Kelm S, Kumar S, Krawczyk K. Computational approaches to therapeutic antibody design: established methods and emerging trends. Brief Bioinform 2019; doi: 10.1093/bib/bbz095
OpenUrl CrossRef
5.↵
Krawczyk K, Raybould MIJ, Kovaltsuk A, Deane CM. Looking for therapeutic antibodies in next-generation sequencing repositories. MAbs 2019; doi: 10.1080/19420862.2019.1633884
OpenUrl CrossRef PubMed
6.↵
Raybould MIJ, Marks C, Krawczyk K, Taddese B, Nowak J, Lewis AP, Bujotzek A, Shi J, Deane CM. Five computational developability guidelines for therapeutic antibody profiling. Proc Natl Acad Sci U S A 2019; doi: 10.1073/pnas.1810576116
OpenUrl Abstract/FREE Full Text
7.↵
Koenig P, Lee CV, Walters BT, Janakiraman V, Stinson J, Patapoff TW, Fuh G, Wilson IA. Mutational landscape of antibody variable domains reveals a switch modulating the interdomain conformational dynamics and antigen binding. Proc Natl Acad Sci U S A. 2017 doi: 10.1073/pnas.1613231114
OpenUrl Abstract/FREE Full Text
8.↵
Briney B, Inderbitzin A, Joyce C, Burton DR. Commonality despite exceptional diversity in the baseline human antibody repertoire. Nature 2019; doi: 10.1038/s41586-019-0879-y
OpenUrl CrossRef PubMed
9.↵
Petering J, McManamny P, Honeyman J. Antibody therapeutics - the evolving patent landscape. N. Biotechnol. 2011; doi: 10.1016/j.nbt.2011.03.023
OpenUrl CrossRef
10.↵
Dumet C, Pottier J, Gouilleux-Gruart V, Watier H. Insights into the IgG heavy chain engineering patent landscape as applied to IgG4 antibody development. MAbs 2019; doi: 10.1080/19420862.2019.1664365
OpenUrl CrossRef
11.↵
Cole P. Patentability of genes: A European union perspective. Cold Spring Harb Perspect Med 2015; doi: 10.1101/cshperspect.a020891
OpenUrl Abstract/FREE Full Text
12.↵
Tateno Y. DNA Data Bank of Japan (DDBJ) for genome scale research in life science. Nucleic Acids Res 2002; doi: 10.1093/nar/gks1195
OpenUrl CrossRef PubMed Web of Science
13.↵
Li W, Mcwilliam H, de la Torre AR, Grodowski A, Benediktovich I, Goujon M, Nauche S, Lopez R. Non-redundant patent sequence databases with value-added annotations at two levels. Nucleic Acids Res 2009; doi: 10.1093/nar/gkp960
OpenUrl CrossRef PubMed
14.↵
Jefferson OA, Köllhofer D, Ehrich TH, Jefferson RA. Transparency tools in gene patenting for informing policy and practice. Nat Biotechnol 2013; doi: 10.1038/nbt.2755
OpenUrl CrossRef
15.↵
Shibata S, Uemura R, Suzuki T. Comparative Analysis Between the Top-Selling Drugs in the Japanese Pharmaceutical Market and Those in the United States, the United Kingdom, France, and Germany. Ther Innov Regul Sci 2016; doi: 10.1177/2168479015604182
OpenUrl CrossRef
16.↵
Ye J, Ma N, Madden TL, Ostell JM. IgBLAST: an immunoglobulin variable domain sequence analysis tool. Nucleic Acids Res 2013; doi: 10.1093/nar/gkt382
OpenUrl CrossRef PubMed Web of Science
17.↵
Kovaltsuk A, Leem J, Kelm S, Snowden J, Deane CM, Krawczyk K. Observed Antibody Space: A Resource for Data Mining Next-Generation Sequencing of Antibody Repertoires. J Immunol 2018; doi: 10.4049/jimmunol.1800708
OpenUrl Abstract/FREE Full Text
18.↵
Dunbar J, Deane CM. ANARCI: Antigen receptor numbering and receptor classification. Bioinformatics 2015; doi: 10.1093/bioinformatics/btv552
OpenUrl CrossRef PubMed
19.↵
Jones TD, Carter PJ, Plückthun A, Vásquez M, Holgate RGE, Hötzel I, Popplewell AG, Parren PWHI, Enzelberger M, Rademaker HJ, et al. The INNs and outs of antibody nonproprietary names. MAbs 2016; doi: 10.1080/19420862.2015.1114320
OpenUrl CrossRef PubMed
20.↵
International nonproprietary names for pharmaceutical substances (INN): proposed INN: list 122. WHO Drug Inf 2019; Available From https://www.who.int/medicines/publications/druginformation/innlists/PL122.pdf
21.↵
Poiron C, Wu Y, Ginestoux C, Ehrenmann F, Duroux P, Lefranc M-P. IMGT/mAb-DB: the basis of IMGT data of therapeutic monoclonal antibodies. Bull Cancer 2010; Available from http://www.imgt.org/mAb-DB/
22.↵
Dunbar J, Krawczyk K, Leem J, Baker T, Fuchs A, Georges G, Shi J, Deane CM. SAbDab: the structural antibody database. Nucleic Acids Res 2013; doi: 10.1093/nar/gkt1043
OpenUrl CrossRef PubMed Web of Science
23.↵
Raybould MIJ, Marks C, Lewis AP, Shi J, Bujotzek A, Taddese B, Deane CM. Thera-SAbDab: the Therapeutic Structural Antibody Database. Nucleic Acids Res 2019; doi: 10.1093/nar/gkz827
OpenUrl CrossRef PubMed
24.↵
Walsh G. Biopharmaceutical benchmarks 2010. Nat Biotechnol 2010; doi: 10.1038/nbt0910-917
OpenUrl CrossRef PubMed
25.↵
Almagro JC, Fransson J. Humanization of antibodies. Front Biosci 2008; Available from https://www.bioscience.org/2008/v13/af/2786/fulltext.htm
26.↵
Almagro JC, Pedraza-Escalona M, Arrieta HI, Pérez-Tapia SM. Phage Display Libraries for Antibody Therapeutic Discovery and Development. Antibodies 2019; doi: 10.3390/antib8030044
OpenUrl CrossRef
27.↵
Lee C V., Liang WC, Dennis MS, Eigenbrot C, Sidhu SS, Fuh G. High-affinity human antibodies from phage-displayed synthetic Fab libraries with a single framework scaffold. J Mol Biol 2004; doi: 10.1016/j.jmb.2004.05.051
OpenUrl CrossRef PubMed Web of Science
28.↵
Lefranc M paule, Giudicelli V, Regnier L, Duroux P. IMGT, a system and an ontology that bridge biological and computational spheres in bioinformatics. Brief Bioinform 2008; doi: 10.1093/bib/bbn014
OpenUrl CrossRef PubMed Web of Science
29.↵
Xu JL, Davis MM. Diversity in the CDR3 region of V(H) is sufficient for most antibody specificities. Immunity 2000; 13:37–45. doi: 10.1016/s1074-7613(00)00006-6
OpenUrl CrossRef PubMed Web of Science
30.↵
Knappik A, Ge L, Honegger A, Pack P, Fischer M, Wellnhofer G, Hoess A, Wölle J, Plückthun A, Virnekäs B. Fully synthetic human combinatorial antibody libraries (HuCAL) based on modular consensus frameworks and CDRs randomized with trinucleotides. J Mol Biol 2000; doi: 10.1006/jmbi.1999.3444
OpenUrl CrossRef PubMed Web of Science
31.↵
Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. The protein data bank. Nucleic Acids Res 2000; doi: 10.1093/nar/28.1.235
OpenUrl CrossRef PubMed Web of Science
32.↵
Muyldermans S, Cambillau C, Wyns L. Recognition of antigens by single-domain antibody fragments: The superfluous luxury of paired domains. Trends Biochem. Sci.2001; doi: 10.1016/s0968-0004(01)01790-x
OpenUrl CrossRef
33.↵
Bannas P, Hambach J, Koch-Nolte F. Nanobodies and nanobody-based human heavy chain antibodies as antitumor therapeutics. Front. Immunol.2017; doi: 10.3389/fimmu.2017.01603
OpenUrl CrossRef
34.↵
Duggan S. Caplacizumab: First Global Approval. Drugs 2018; doi: 10.1007/s40265-018-0989-0
OpenUrl CrossRef
35.↵
Morrison C. Nanobody approval gives domain antibodies a boost. Nat. Rev. Drug Discov.2019; doi: 10.1038/d41573-019-00104-w
OpenUrl CrossRef
36.↵
De Meyer T, Muyldermans S, Depicker A. Nanobody-based products as research and diagnostic tools. Trends Biotechnol 2014; doi: 10.1016/j.tibtech.2014.03.001
OpenUrl CrossRef PubMed
37.↵
Pereira CG, Lavoie JR, Garces E, Basso F, Dabić M, Porto GS, Daim T. Forecasting of emerging therapeutic monoclonal antibodies patents based on a decision model. Technol Forecast Soc Change 2019; doi: 10.1016/j.techfore.2018.11.002
OpenUrl CrossRef
38.↵
Tsuruoka Y, Tsujii J. Bidirectional inference with the easiest-first strategy for tagging sequence data. In: HLT/EMNLP 2005 - Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference. 2005; Available from https://www.aclweb.org/anthology/H05-1059.pdf
39.↵
Volanakis A, Krawczyk K. SciRide Finder: A citation-based paradigm in biomedical literature search. Sci Rep 2018; doi: 10.1038/s41598-018-24571-0
OpenUrl CrossRef

View the discussion thread.

Posted December 03, 2020.

Download PDF

Supplementary Material

Data/Code

Citation Tools

Subject Area

Bioinformatics

Subject Areas

All Articles

Animal Behavior and Cognition (5217)
Biochemistry (11755)
Bioengineering (8757)
Bioinformatics (29209)
Biophysics (14980)
Cancer Biology (12103)
Cell Biology (17416)
Clinical Trials (138)
Developmental Biology (9425)
Ecology (14187)
Epidemiology (2067)
Evolutionary Biology (18314)
Genetics (12246)
Genomics (16807)
Immunology (11870)
Microbiology (28101)
Molecular Biology (11599)
Neuroscience (60995)
Paleontology (452)
Pathology (1872)
Pharmacology and Toxicology (3238)
Physiology (4961)
Plant Biology (10429)
Scientific Communication and Education (1683)
Synthetic Biology (2887)
Systems Biology (7341)
Zoology (1651)

[1] 1.↵
Kaplon H, Muralidharan M, Schneider Z, Reichert JM. Antibodies to watch in 2020. MAbs 2020; doi: 10.1080/19420862.2019.1703531
OpenUrl CrossRef

[2] 2.↵
Jain T, Sun T, Durand S, Hall A, Houston NR, Nett JH, Sharkey B, Bobrowicz B, Caffry I, Yu Y, et al. Biophysical properties of the clinical-stage antibody landscape. Proc Natl Acad Sci U S A 2017; doi: 10.1073/pnas.1616408114
OpenUrl Abstract/FREE Full Text

[3] 3.↵
Mason DM, Friedensohn S, Weber CR, Jordi C, Wagner B, Meng S, Reddy ST. Deep learning enables therapeutic antibody optimization in mammalian cells. bioRxiv 2019; doi: 10.1101/617860
OpenUrl Abstract/FREE Full Text

[4] 4.↵
Norman RA, Ambrosetti F, Bonvin AMJJ, Colwell LJ, Kelm S, Kumar S, Krawczyk K. Computational approaches to therapeutic antibody design: established methods and emerging trends. Brief Bioinform 2019; doi: 10.1093/bib/bbz095
OpenUrl CrossRef

[5] 5.↵
Krawczyk K, Raybould MIJ, Kovaltsuk A, Deane CM. Looking for therapeutic antibodies in next-generation sequencing repositories. MAbs 2019; doi: 10.1080/19420862.2019.1633884
OpenUrl CrossRef PubMed

[6] 6.↵
Raybould MIJ, Marks C, Krawczyk K, Taddese B, Nowak J, Lewis AP, Bujotzek A, Shi J, Deane CM. Five computational developability guidelines for therapeutic antibody profiling. Proc Natl Acad Sci U S A 2019; doi: 10.1073/pnas.1810576116
OpenUrl Abstract/FREE Full Text

[7] 7.↵
Koenig P, Lee CV, Walters BT, Janakiraman V, Stinson J, Patapoff TW, Fuh G, Wilson IA. Mutational landscape of antibody variable domains reveals a switch modulating the interdomain conformational dynamics and antigen binding. Proc Natl Acad Sci U S A. 2017 doi: 10.1073/pnas.1613231114
OpenUrl Abstract/FREE Full Text

[8] 8.↵
Briney B, Inderbitzin A, Joyce C, Burton DR. Commonality despite exceptional diversity in the baseline human antibody repertoire. Nature 2019; doi: 10.1038/s41586-019-0879-y
OpenUrl CrossRef PubMed

[9] 9.↵
Petering J, McManamny P, Honeyman J. Antibody therapeutics - the evolving patent landscape. N. Biotechnol. 2011; doi: 10.1016/j.nbt.2011.03.023
OpenUrl CrossRef

[10] 10.↵
Dumet C, Pottier J, Gouilleux-Gruart V, Watier H. Insights into the IgG heavy chain engineering patent landscape as applied to IgG4 antibody development. MAbs 2019; doi: 10.1080/19420862.2019.1664365
OpenUrl CrossRef

[11] 11.↵
Cole P. Patentability of genes: A European union perspective. Cold Spring Harb Perspect Med 2015; doi: 10.1101/cshperspect.a020891
OpenUrl Abstract/FREE Full Text

[12] 12.↵
Tateno Y. DNA Data Bank of Japan (DDBJ) for genome scale research in life science. Nucleic Acids Res 2002; doi: 10.1093/nar/gks1195
OpenUrl CrossRef PubMed Web of Science

[13] 13.↵
Li W, Mcwilliam H, de la Torre AR, Grodowski A, Benediktovich I, Goujon M, Nauche S, Lopez R. Non-redundant patent sequence databases with value-added annotations at two levels. Nucleic Acids Res 2009; doi: 10.1093/nar/gkp960
OpenUrl CrossRef PubMed

[14] 14.↵
Jefferson OA, Köllhofer D, Ehrich TH, Jefferson RA. Transparency tools in gene patenting for informing policy and practice. Nat Biotechnol 2013; doi: 10.1038/nbt.2755
OpenUrl CrossRef

[15] 15.↵
Shibata S, Uemura R, Suzuki T. Comparative Analysis Between the Top-Selling Drugs in the Japanese Pharmaceutical Market and Those in the United States, the United Kingdom, France, and Germany. Ther Innov Regul Sci 2016; doi: 10.1177/2168479015604182
OpenUrl CrossRef

[16] 16.↵
Ye J, Ma N, Madden TL, Ostell JM. IgBLAST: an immunoglobulin variable domain sequence analysis tool. Nucleic Acids Res 2013; doi: 10.1093/nar/gkt382
OpenUrl CrossRef PubMed Web of Science

[17] 17.↵
Kovaltsuk A, Leem J, Kelm S, Snowden J, Deane CM, Krawczyk K. Observed Antibody Space: A Resource for Data Mining Next-Generation Sequencing of Antibody Repertoires. J Immunol 2018; doi: 10.4049/jimmunol.1800708
OpenUrl Abstract/FREE Full Text

[18] 18.↵
Dunbar J, Deane CM. ANARCI: Antigen receptor numbering and receptor classification. Bioinformatics 2015; doi: 10.1093/bioinformatics/btv552
OpenUrl CrossRef PubMed

[19] 19.↵
Jones TD, Carter PJ, Plückthun A, Vásquez M, Holgate RGE, Hötzel I, Popplewell AG, Parren PWHI, Enzelberger M, Rademaker HJ, et al. The INNs and outs of antibody nonproprietary names. MAbs 2016; doi: 10.1080/19420862.2015.1114320
OpenUrl CrossRef PubMed

[20] 20.↵
International nonproprietary names for pharmaceutical substances (INN): proposed INN: list 122. WHO Drug Inf 2019; Available From https://www.who.int/medicines/publications/druginformation/innlists/PL122.pdf

[21] 21.↵
Poiron C, Wu Y, Ginestoux C, Ehrenmann F, Duroux P, Lefranc M-P. IMGT/mAb-DB: the basis of IMGT data of therapeutic monoclonal antibodies. Bull Cancer 2010; Available from http://www.imgt.org/mAb-DB/

[22] 22.↵
Dunbar J, Krawczyk K, Leem J, Baker T, Fuchs A, Georges G, Shi J, Deane CM. SAbDab: the structural antibody database. Nucleic Acids Res 2013; doi: 10.1093/nar/gkt1043
OpenUrl CrossRef PubMed Web of Science

[23] 23.↵
Raybould MIJ, Marks C, Lewis AP, Shi J, Bujotzek A, Taddese B, Deane CM. Thera-SAbDab: the Therapeutic Structural Antibody Database. Nucleic Acids Res 2019; doi: 10.1093/nar/gkz827
OpenUrl CrossRef PubMed

[24] 24.↵
Walsh G. Biopharmaceutical benchmarks 2010. Nat Biotechnol 2010; doi: 10.1038/nbt0910-917
OpenUrl CrossRef PubMed

[25] 25.↵
Almagro JC, Fransson J. Humanization of antibodies. Front Biosci 2008; Available from https://www.bioscience.org/2008/v13/af/2786/fulltext.htm

[26] 26.↵
Almagro JC, Pedraza-Escalona M, Arrieta HI, Pérez-Tapia SM. Phage Display Libraries for Antibody Therapeutic Discovery and Development. Antibodies 2019; doi: 10.3390/antib8030044
OpenUrl CrossRef

[27] 27.↵
Lee C V., Liang WC, Dennis MS, Eigenbrot C, Sidhu SS, Fuh G. High-affinity human antibodies from phage-displayed synthetic Fab libraries with a single framework scaffold. J Mol Biol 2004; doi: 10.1016/j.jmb.2004.05.051
OpenUrl CrossRef PubMed Web of Science

[28] 28.↵
Lefranc M paule, Giudicelli V, Regnier L, Duroux P. IMGT, a system and an ontology that bridge biological and computational spheres in bioinformatics. Brief Bioinform 2008; doi: 10.1093/bib/bbn014
OpenUrl CrossRef PubMed Web of Science

[29] 29.↵
Xu JL, Davis MM. Diversity in the CDR3 region of V(H) is sufficient for most antibody specificities. Immunity 2000; 13:37–45. doi: 10.1016/s1074-7613(00)00006-6
OpenUrl CrossRef PubMed Web of Science

[30] 30.↵
Knappik A, Ge L, Honegger A, Pack P, Fischer M, Wellnhofer G, Hoess A, Wölle J, Plückthun A, Virnekäs B. Fully synthetic human combinatorial antibody libraries (HuCAL) based on modular consensus frameworks and CDRs randomized with trinucleotides. J Mol Biol 2000; doi: 10.1006/jmbi.1999.3444
OpenUrl CrossRef PubMed Web of Science

[31] 31.↵
Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. The protein data bank. Nucleic Acids Res 2000; doi: 10.1093/nar/28.1.235
OpenUrl CrossRef PubMed Web of Science

[32] 32.↵
Muyldermans S, Cambillau C, Wyns L. Recognition of antigens by single-domain antibody fragments: The superfluous luxury of paired domains. Trends Biochem. Sci.2001; doi: 10.1016/s0968-0004(01)01790-x
OpenUrl CrossRef

[33] 33.↵
Bannas P, Hambach J, Koch-Nolte F. Nanobodies and nanobody-based human heavy chain antibodies as antitumor therapeutics. Front. Immunol.2017; doi: 10.3389/fimmu.2017.01603
OpenUrl CrossRef

[34] 34.↵
Duggan S. Caplacizumab: First Global Approval. Drugs 2018; doi: 10.1007/s40265-018-0989-0
OpenUrl CrossRef

[35] 35.↵
Morrison C. Nanobody approval gives domain antibodies a boost. Nat. Rev. Drug Discov.2019; doi: 10.1038/d41573-019-00104-w
OpenUrl CrossRef

[36] 36.↵
De Meyer T, Muyldermans S, Depicker A. Nanobody-based products as research and diagnostic tools. Trends Biotechnol 2014; doi: 10.1016/j.tibtech.2014.03.001
OpenUrl CrossRef PubMed

[37] 37.↵
Pereira CG, Lavoie JR, Garces E, Basso F, Dabić M, Porto GS, Daim T. Forecasting of emerging therapeutic monoclonal antibodies patents based on a decision model. Technol Forecast Soc Change 2019; doi: 10.1016/j.techfore.2018.11.002
OpenUrl CrossRef

[38] 38.↵
Tsuruoka Y, Tsujii J. Bidirectional inference with the easiest-first strategy for tagging sequence data. In: HLT/EMNLP 2005 - Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference. 2005; Available from https://www.aclweb.org/anthology/H05-1059.pdf

[39] 39.↵
Volanakis A, Krawczyk K. SciRide Finder: A citation-based paradigm in biomedical literature search. Sci Rep 2018; doi: 10.1038/s41598-018-24571-0
OpenUrl CrossRef

Data mining patented antibody sequences

Abstract

Introduction

Results

Antibodies account for a non-trivial proportion of sequences deposited in patent documents

Patent landscape of documents containing antibody sequences

Most patent documents citing antibody sequences are destined for medicinal applications

Targets of antibodies in patent documents correspond to known therapeutic targets

There is a growing number of patent documents associated with antibody sequences

The sequence landscape of patented antibodies

Most antibody sequences from patents align to human and mouse germline V region genes

Germline V gene usage of antibodies from patent documents corresponds to a large extent with germline V gene usage of therapeutic antibodies

Antibodies from patent documents well reflect therapeutic antibody sequences, with the exception of CDR-H3 lengths

Patent landscape of single domain antibodies

Discussion

Methods

Identifying antibody sequences in patent documents

Patent metadata acquisition and antibody target identification

Web Service

Disclosure Statement

Footnotes

References

Citation Manager Formats

Subject Area