CBP60-DB: An AlphaFold-predicted plant kingdom-wide database 1 of the CALMODULIN-BINDING PROTEIN 60 (CBP60) protein 2 family with a novel structural clustering algorithm

12 Molecular genetic analyses in the model species Arabidopsis thaliana have demonstrated the 13 major roles of different CAM-BINDING PROTEIN 60 (CBP60) proteins in growth, stress 14 signaling, and immune responses. Prominently, CBP60g and SARD1 are paralogous CBP60 15 transcription factors that regulate numerous components of the immune system, such as cell 16 surface and intracellular immune receptors, MAP kinases, WRKY transcription factors, and 17 biosynthetic enzymes for immunity-activating metabolites salicylic acid (SA) and N - 18 hydroxypipecolic acid (NHP). However, their function, regulation and diversification in most 19 species remain unclear. Here we have created CBP60-DB, a structural and bioinformatic 20 database that comprehensively characterized 1052 CBP60 gene homologs (encoding 2376 21 unique transcripts and 1996 unique proteins) across 62 phylogenetically diverse genomes in the 22 plant kingdom. We have employed deep learning-predicted structural analyses using AlphaFold2 23 and then generated dedicated web pages for all plant CBP60 proteins. Importantly, we have 24 generated a novel clustering visualization algorithm to interrogate kingdom-wide structural 25 similarities for more efficient inference of conserved functions across various plant taxa. 26 Because well-characterized CBP60 proteins in Arabidopsis are known to be transcription factors

plant kingdom. We have employed deep learning-predicted structural analyses using AlphaFold2 23 and then generated dedicated web pages for all plant CBP60 proteins. Importantly, we have 24 generated a novel clustering visualization algorithm to interrogate kingdom-wide structural 25 similarities for more efficient inference of conserved functions across various plant taxa. 26 Because well-characterized CBP60 proteins in Arabidopsis are known to be transcription factors 27 with putative calmodulin-binding domains, we have integrated external bioinformatic resources 28 to analyze protein domains and motifs. Collectively, we present a plant kingdom-wide 29 identification of this important protein family in a user-friendly AlphaFold-anchored database, 30 representing a novel and significant resource for the broader plant biology community. 31 32 . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted July 8, 2022.   to produce viable results for complexes, membrane-bound proteins, or proteins that are unable to 57 crystallize (Tugarinov et al., 2004;Shi, 2014;Nogales and Scheres, 2015). A major advance to 58 solve this grand challenge occurred with the launch of AlphaFold2, which is a novel deep 59 learning approach for accurately predicting the three-dimensional structure of a protein from its 60 amino acid sequence (Jumper et al., 2021). However, base AlphaFold2 also suffers from a few 61 drawbacks, such as lack of exposure for certain internal settings (e.g., number of recycling 62 steps), it is slightly unoptimized, and the default MSA generation algorithms used can be slow 63 . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made ColabFold can produce more predictions within a shorter period.  Colaboratory notebook is provided, as well as a minimal implementation for executing locally. 86 We have showcased a visualization for this algorithm on the index page of the CBP60-DB web 87 application. was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted July 8, 2022. ; https://doi.org/10.1101/2022.07.07.499200 doi: bioRxiv preprint 5 sequences, transcript/cDNA sequences, and protein sequence data. Each protein entry's amino 95 acid sequence was used as an input to ColabFold for structural predictions. ColabFold 96 (https://github.com/sokrypton/ColabFold) was used instead of the original AlphaFold2 since the 97 former produces a higher number of predictions within a shorter time, while also improving 98 prediction quality compared to base AlphaFold2. This improvement is primarily due to  The CBP60-DB user interface was designed to be easy to navigate, with an emphasis on several 115 intuitive visualization options that are available and assembled for best user accessibility.

116
Additionally, the application was written in the Go programming language without third party 117 dependencies, making it straightforward to re-deploy across any modern system. All database 118 contents are either stored within the assets directory of the application, which is freely accessible 119 via HTTP(S), or stored within an internal json file that is then loaded into memory as a hash 120 table, where keys are the md5 hashes of the unique transcript names. The advantage of an 121 internal hash table over a traditional database management system (DBMS) is that the internal 122 hash table is faster for accessing and serving data and requires no additional dependencies. 123 Furthermore, since the contents of the database are static and the memory required to load the 124 json file is reasonable (6.9 Mb), there is little need for using an alternative DBMS. However, 125 . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted July 8, 2022. ; https://doi.org/10.1101/2022.07.07.499200 doi: bioRxiv preprint 6 should we decide to scale the contents of the database to include vastly more entries, an 126 alternative DBMS will be the preferable solution.
127 128 Data archival 129 CBP60-DB archives and provides access to the following data below. Note that protein 130 structures which have been updated, replaced, or removed will not be archived.

131
• Predicted protein crystal structure in PDB and mmCIF file formats.

132
• Protein metadata and AlphaFold2 metadata in json format.

133
• Generated thumbnails of the predicted structure in png format.

134
• AlphaFold2 scoring metrics in json format (pLDDT, PAE, and pTM score).       The feature tensor was then flattened to produce a × ( × ) matrix, which was used as an 182 input to UMAP. UMAP is a powerful dimensionality reduction algorithm that can generally  (Fig 1).  was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted July 8, 2022. ; https://doi.org/10.1101/2022.07.07.499200 doi: bioRxiv preprint 9 viewer, navigable cluster map (Fig 1), interactive plots for prediction metrics, as well as the top 208 five most structurally similar proteins (if available).

228
. CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted July 8, 2022. ; https://doi.org/10.1101/2022.07.07.499200 doi: bioRxiv preprint 10 229 Search page: 230 The search page (Fig 3) displays the search results from queries made via the search bar on any 231 page within CBP60-DB. Once a query is submitted through the search bar, users will be 232 redirected to this page. If no query is provided, all database entries will be displayed instead. The protein information page (Fig 4) is arguably the most useful page within CBP60-DB,       was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made     was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted July 8, 2022. ; https://doi.org/10.1101/2022.07.07.499200 doi: bioRxiv preprint