PAbFold: Linear Antibody Epitope Prediction using AlphaFold2

Defining the binding epitopes of antibodies is essential for understanding how they bind to their antigens and perform their molecular functions. However, while determining linear epitopes of monoclonal antibodies can be accomplished utilizing well-established empirical procedures, these approaches are generally labor- and time-intensive and costly. To take advantage of the recent advances in protein structure prediction algorithms available to the scientific community, we developed a calculation pipeline based on the localColabFold implementation of AlphaFold2 that can predict linear antibody epitopes by predicting the structure of the complex between antibody heavy and light chains and target peptide sequences derived from antigens. We found that this AlphaFold2 pipeline, which we call PAbFold, was able to accurately flag known epitope sequences for several well-known antibody targets (HA / Myc) when the target sequence was broken into small overlapping linear peptides and antibody complementarity determining regions (CDRs) were grafted onto several different antibody framework regions in the single-chain antibody fragment (scFv) format. To determine if this pipeline was able to identify the epitope of a novel antibody with no structural information publicly available, we determined the epitope of a novel anti-SARS-CoV-2 nucleocapsid targeted antibody using our method and then experimentally validated our computational results using peptide competition ELISA assays. These results indicate that the AlphaFold2-based PAbFold pipeline we developed is capable of accurately identifying linear antibody epitopes in a short time using just antibody and target protein sequences. This emergent capability of the method is sensitive to methodological details such as peptide length, AlphaFold2 neural network versions, and multiple-sequence alignment database. PAbFold is available at https://github.com/jbderoo/PAbFold.


Supporting Information Contents:
Table 1A: Full sequence information for all scFv and antigen proteins Table 1B: MSA for the scFv chimera variants, with loop and linker region annotation Figure 1: Comparison of AlphaFold2 Myc scFv predictions to Fab crystal structure Figure 2: AlphaFold2 predictions for scFv interacting with full length antigen proteins Figure 3: Illustration of AlphaFold2 peptide predicted placements and confidence thereof Figure 4: Structure superposition analysis for Myc and HA scFv variants relative to reference crystal structures Figure 5: In the context of Myc, testing prediction performance versus sliding peptide window parameters Figure 6: Testing detection of Myc epitope inserted into three locations in an unrelated 3 rd -party protein Figure 7: HA epitope prediction for three anti-HA scFvs Figure 8: Comparing prediction performance for mBG17 using multimer-v2 and multimer-v3 Figure 9: Comparing prediction performance for Myc using multimer-v2 and multimer-v3 and the new MSA Figure 10: Comparing prediction performance for HA using multimer-v2 and multimer-v3 and the new MSA Figure 11: Comparison of 9 major systems after recreating MSAs locally with downloaded databases Figure 12: Comparison of 9 major systems after recreating MSAs with colabfold after MMSEQS rebuilt the old databases for our use Figure 13: Comparison of the 9 major systems without using any MSA, using only the single sequence Figure 14: Comparison of the contents of the MSA for Myc-2E2 after being generated by the 4 major methods: old generation, new generation, local generation, and MMSEQS rebuilt specialty server.Figure 15: Overview table of whether or not each MSA generation type could accurately detect the experimentally determined epitope in each of the 9 major systems.1B Supplemental AlphaFold2 frequently placed peptides on the opposite side of the CDRH3 from the Myc epitope (grey), it was not confident in these peptide placements (low, small, blue pLDDT spheres).In contrast, some of the peptides placed around the CDRH3, and in positions similar to the native epitope (grey) were placed with higher pLDDT confidence (increasingly large spheres trending from green to yellow to orange and red).D) The top ranked peptide as predicted by PAbFold with sequence QKLISEEDLL (red) and the crystal structure solution of the Myc epitope (grey).Supplemental Figure 4: RMSD comparison (all numbers have units of Å) for AlphaFold2 predicted scFv structures compared to reference crystal structures, A) 2or9 (Myc) and B) 1frg (HA), respectively.The loops of the scFv more closely mimic the crystal structure when the epitope peptide is present.The backbone also undergoes subtle changes during docking that make it slightly more similar to the crystal structure.These structures were aligned by identifying the framework residues in all structures, then aligning the framework region Cα with the Kabsch algorithm (49, 50).Specifically excluded from this process were the heavy and light CDR loops of the structures, as well as the flexible linker structure that connects the heavy and light chains due to the inherent floppy, unstructured nature of this region.After aligning the framework regions of the AlphaFold2 predicted structures and the crystal structures (2or9 and 1frg respectively), an RMSD of these Cα was calculated and is reported as the first column 'BB Cα RMSD'.

Supplemental
Without further alignment, loop placement was analyzed with an all backbone RMSD by calculating the RMSD between the C, Cα, N, and O along the backbone of all residues in the scFv that were not used for the framework superimposition.This RMSD is reported in the second column as 'Loop all backbone RMSD'.Finally, to investigate peptide predicted placement and potential scFv:epitope interactions, an all-atom RMSD was calculated between the crystal structure and the AF2 predicted peptide structure (no additional alignment).Because the apo structure lacks a peptide position, this is only reported in the 'Docked' category and is in the 3 rd column labeled 'Epitope all atom RMSD'.One script was written for each scFv (Myc and HA), and can be found in the Zenodo deposition of our data (https://zenodo.org/records/10884181)because this analysis is not a key part of PAbFold.Briefly this analysis reveals that all three HA scFv variants have predicted framework regions and loop regions in the apo structures that closely match the reference structure (0.56-0.58 Å and 1.21-1.39Å).Accordingly, when the cognate epitope peptide is present, it can be placed with relatively high accuracy for all three scFvs (3.1-3.2Å), with only small changes in the loops (1.39 Å to 1.25 Å, 1.32 Å to 1.26 Å, and 1.21 Å to 1.27 Å).In contrast, the apo structures for the three Myc scFvs have a much higher deviation in the loop regions (2.87 to 3.06 Å).
When the epitope peptide is added, there is significant motion in the loops consistent with an "induced fit" description.In the two chimeric Myc scFvs (Myc-15F11 and Myc-2E2) the final loop RMSD is reduced to 1.51-1.61Å, and the epitope peptide is successfully predicted (2.45-2.68Å).However, despite a lower apo-state loop RMSD (2.87 Å), the loop RMSD for the wild-type Myc scFv only drops to 1.75 Å, and the epitope peptide placement does not match the experimental structure (6.69 Å).This is consistent with the failure of the wild-type Myc scFv AlphaFold2 predictions in main text Figure 2. Similarly, with a fixed peptide length of 10 and a sliding window step size of 1 (F), 2 (G), and 5 (H), we can see the practical epitope detection outcome was similar for a sliding window of 1 and 2, but resolution and accuracy were reduced for a sliding window step size of 5. To more fully illustrate the strong learned bias that AlphaFold2 has for placing any peptides among the CDR loops, we predicted the structure of Myc-2E2 in complex with several control peptides.These negative control peptides bind to the generally expected antibody binding site, but with poor pLDDT.I) GSx5 in magenta (GSGSGSGSGS) had a score (mean peptide from Simple Max method pLDDT) of 29.5.(GGGGS) 2 in orange (GGGGSGGGGS) had a score of 31.9.G 10 in red (GGGGGGGGGG) had a score of 33.

Supplemental
Lastly, J) A 10 in cyan (AAAAAAAAAA) had a score of 41 and is the only negative control peptide to have an alpha-helical secondary structure (presumably due to the increased alpha helical propensity of alanine).and PDB70 ( 220313)) (blue).

Figure 1 .
Alignment of AlphaFold2 predicted scFv structures to an anti-c-Myc Fab crystal structure.A) Alignments of AlphaFold2-derived wild-type Myc scFv, Myc-2E2 scFv, and Myc-15F11 scFv structures with a Myc Fab crystal structure (PDB: 2orb).Predicted scFv structures are shown in dark blue, 2orb Myc Fab structures are shown in light blue.B) RMSD values comparing structural similarities between the wild-type Myc scFv, Myc-2E2 scFv, and Myc-15F11 scFv structures with a Myc Fab crystal structure (PDB: 2orb) were computed by the PyMOL align command.Supplemental Figure 2: Alphafold2's best attempt to dock whole sequences with the respective sequence's scFv.A) The whole HA protein structure and scFv complex as predicted by AF2, with the correct epitope sequence highlighted in magenta.B) Shows the same structure by highlighted by confidence (pLDDT) of the structure with AF2.Similarly, the entire Myc protein-scFv complex are shown with C) the correct epitope highlighted in magenta and D) the confidence of the structure shown, and again for the mBG17 Nprotein-scFv complex in E) and F).Supplemental Figure3: AlphaFold2 places all peptides near the CDR loops.The predicted Cα coordinates for all scFv (excluding the flexible linker) were extracted, and all were aligned together using the Kabsch algorithm (49, 50).With the scFvs structurally aligned, an all-against-all RMSD was calculated for the epitope peptides.To visually represent each peptide as a single point, the coordinates for all epitope atoms were averaged.The "central" exemplar epitope (cyan) is the peptide with the smallest sum of RMSD to all other peptides.A) The average and quartile for peptide placement relative to the central peptide via Box-and-Whisker plot reveals that AlphaFold2 largely places all epitopes in the same area.The Myc CDRH3 runs through the middle of a traditional paratope pocket, it isn't a "cradle" for the epitope to sit on.AlphaFold2 places peptides on both sides of the CDRH3, causing significant spread in the peptide placement.B) An example of an exemplar, most-central predicted peptide structure (cyan) for the peptide PKSCASQDSS (cyan) bound to the Myc-2E2 scFv (green) that is distant from an example outlier peptide (magenta, peptide PHSPLVLKRC, center-tocenter distance 14.8 Å).All peptide placements are still in contact with CDRH3, consistent with a strong AlphaFold2 bias to place peptides in a typical antibody binding site.C) The Myc-2E2 scFv (pale-green) and the average epitope placement (cyan) peptide alongside the crystal structure solution of the Myc epitope (grey).Remaining peptide placements are represented as a cloud of spheres at the mean peptide position.Each peptide sphere is colored and sized by epitope pLDDT (ranging from 20 to 90).Although

Figure 5 .
Assessment of peptide size and sliding window sizes on epitope prediction efficacy.Myc-2E2 scFv:peptide structures were predicted with peptides of 8 (A), 9 (B), 10 (C), 11 (D), and 12 (E) amino acid lengths derived from the Myc protein with a sliding window of 2 amino acids, and pLDDT scores from each predicted structure were plotted against the Myc amino acid position and sliding window length target.F) Negative control peptides bind to antibody binding sites, but with poor pLDDT scores.

Figure 6 :
PAbFold epitope detection is independent of position within target sequence.The Myc epitope (EQKLISEEDL) was added into the beginning, middle, or end of the 99-a.a.HIV protease sequence (Genbank Accession: NP_705926.1)prior to epitope scanning structure prediction.Positions of the Myc epitope sequence added to in the A) N-terminus B) middle and C) C-terminus of the HIV protease sequence.D) Highlights the ranked sequences recovered from each experiment in A, B, and C. Supplemental Figure 7: Alphafold2 can accurately predict the HA linear epitope in different scFv backbones.The anti-HA VH and VL antibody sequences were used to generate either A) wild-type scFv or CDR loop grafted onto the B) 15F11 or C) 2E2 antibody backbones.The Influenza A virus hemagglutinin protein sequence (Genbank AUT17530.1)was used as the target antigen and processed into 10 amino acid overlapping peptides with a 1 amino acid sliding window.The structures for each scFv:peptide pair were predicted with Alphafold2, and pLDDT values for each scFv:peptide pair are shown.D) The top-ranking epitope sequences via pLDDT scores are reported via the consensus method.Sequence underlining represents overlap with the known HA epitope (HA a.a.114-125: YDVPDYASL).E) The top-ranking epitope sequences via pLDDT scores are reported via the simple max method.Supplemental Figure 8: A comparison of Alphafold2 multimer version 3 and multimer version 2 applied to the mBG17 system.The experimental epitope, DDFSKQLQQS, is still easily identified with all three scFv backbones (wildtype, 15F11, and 2E2).Supplemental Figure 9: Myc comparison of epitope identification accuracy, comparing model types.Performance variation with AlphaFold2 model (multiple versions 2 and 3) and MSA versions (most up to date version of the ColabFold MSA server uses UniRef30 (2302) and PDB100 (220517)) vs the old MSA server (when this data was initially generated, ColabFold MSA server used UniRef30 (2202) and PDB70 (220313)).The left column is the WT scFv, the middle column is the CDR loops spliced onto the 15F11 backbone, and the right column is the CDR loops spliced onto the 2E2 backbone.Performance was ablated when using MM3 and the new MSA, and significantly degraded when using MM2 with the new MSA.For AF2-MM2 Old MSA, see Figure 2. Supplemental Figure 10: HA comparison of epitope identification accuracy, comparing model types.A comparison of the differing AlphaFold2 models with the Myc system (multimer version 3 and 2) along with a comparison of the new MSA (most up to date version of the ColabFold MSA server uses UniRef30 (2302) amd PDB100 (220517)) vs the old MSA server (when this data was initially generated, ColabFold MSA server used UniRef30 (2202) and PDB70 (220313)).The left column is the WT scFv, the middle column is the CDR loops spliced onto the 15F11 backbone, and the right column is the CDR loops spliced onto the 2E2 backbone.For AF2-MM2 Old MSA, see Supplemental Figure 7. Supplemental Figure 11: Local remake of the databases used by the MMSEQS server.Databases were downloaded (UniRef30 (2202) and PDB70 (220313)) and were queried locally to produced MSA's for testing.These runs all were done with the multimer version 2 model of Alphafold 2. The left column is the WT scFv, the middle column is the CDR loops spliced onto the 15F11 backbone, and the right column is the CDR loops spliced onto the 2E2 backbone.The first row is the HA system, the second row is the Myc system, and the final row is the mBG17 system.Supplemental Figure 12: Server remake of the MMSEQS databases.The databases were rebuilt by the MMSEQS team UniRef30 (2202) and PDB70 (220313)) on the Colabfold MSA server and were queried produced MSA's for testing.These runs all were done with the multimer version 2 model of Alphafold 2. The left column is the WT scFv, the middle column is the CDR loops spliced onto the 15F11 backbone, and the right column is the CDR loops spliced onto the 2E2 backbone.The first row is the HA system, the second row is the Myc system, and the final row is the mBG17 system.Supplemental Figure 13: Single Sequence mode (no MSA's) of epitope prediction with AF2.These runs all were done with the multimer version 2 model of Alphafold 2 in single sequence mode (i.e.no MSA was used) as a negative control, to highlight the importance of a quality MSA.The left column is the WT scFv, the middle column is the CDR loops spliced onto the 15F11 backbone, and the right column is the CDR loops spliced onto the 2E2 backbone.The first row is the HA system, the second row is the Myc system, and the final row is the mBG17 system.Supplemental Figure 14: MSA overlap between the 4 generation methods.Here we highlight the number of unique entries that are shared amongst all of the MSA methods, those being: 1) using the databases right now via colabfold (PDB30 2302 and PDB100 230517) (green) 2) the databases after they had been accessed via colabfold and cached for repeated use (UniRef30 (2202) and PDB70 (220313)) (yellow), 3) downloading the databases locally (UniRef30 (2202) and PDB70 (220313)) and attempting to create the MSAs ourselves (red), and 4) querying the databases after the MMSEQS team rebuilt them for our use via colabfold (UniRef30 (2202)

Table 1A
EQKLISEEDLSupplemental Table BB Ca RMSD Loop all backbone RMSD Epitope all atom RMSD BB Ca RMSD Loop all backbone RMSD Epitope all atom RMSD