Curation of the Deep Green list of unannotated green lineage proteins to enable structural and functional characterization

An explosion of sequenced genomes and predicted proteomes enabled by low cost deep sequencing has revolutionized biology. Unfortunately, protein functional annotation is more complex, and has not kept pace with the sequencing revolution. We identified unannotated proteins in three model organisms representing distinct parts of the green lineage (Viridiplantae); Arabidopsis thaliana (dicot), Setaria viridis (monocot), and Chlamydomonas reinhardtii (Chlorophyte alga). Using similarity searching we found the subset of unannotated proteins that were conserved between these species and defined them as Deep Green proteins. Informatic, genomic, and structural predictions were leveraged to begin inferring functional information about Deep Green genes and proteins. The Deep Green set was enriched for proteins with predicted chloroplast targeting signals that are predictive of photosynthetic or plastid functions. Strikingly, structural predictions using AlphaFold and comparisons to known structures show that a significant proportion of Deep Green proteins may possess novel protein tertiary structures. The Deep Green genes and proteins provide a starting resource of high value targets for further investigation of potentially new protein structures and functions that are conserved in the green lineage.


Introduction
The genome sequencing revolution of the past two decades has removed a major barrier to identifying and describing the genetic toolkits used by green lineage organisms for growth, photosynthesis, and adaptation to diverse conditions. The number of sequenced plant genomes is growing exponentially, but the resources for comprehensive, experimental structural and functional protein annotation are lagging behind (Ellens et al., 2017). Homology-based annotations are a simple and rapid means of predicting protein function in the absence of any other functional data. In homology-based annotations, new sequences are searched for similarity to proteins in other species, some of which may already have known functions that may be assigned to the newly predicted protein "by proxy". It is generally assumed that conservation at the sequence level implies conservation of function, but this correlation is not perfect (Blaby-Hass et al., 2011). New sequences can also be searched for the presence of conserved domains using sensitive Hidden Markov Models (HMM) and/or multiple sequence alignments (Soding, 2005). The Protein ANalysis THrough Evolutionary Relationships (PANTHER), Protein Families (Pfam), and Clusters of Orthologous Groups of proteins (COG) and the eukaryote-specific version EuKaryotic Orthologous Groups (KOG) are powerful classification tools that leverage growing sets of data to improve annotation based on sequence similarity (Tatusov et al., 2003, Koonin et al., 2004, Mi et al., 2012, Finn et al., 2016, Mi et al., 2016, Bolger et al., 2017, El-Gebali et al., 2018. The above approaches will identify functions and/or structural domains for around half of all proteins in newly sequenced plant genomes, with some variation dependent on genome size and complexity as well as taxonomic position (Hanson et al., 2010). Even well annotated genomes from budding yeast and humans still contain approximately 30% unannotated proteins, and 30-40% of these unannotated proteins (~10% total) are likely to have an uncharacterized catalytic function [1]. Likewise, every genome contains predicted proteins that are unannotated because they either have no similarity to characterized proteins or have limited information available beyond the possible presence or absence of domains. Unannotated proteins have typically accounted for approximately 40-60% in plants and algae genomes (Berardini et al., 2015, Niehaus et al., 2015, Blaby-Haas et al., 2019 with functional assignments slowly increasing.
The potential for discovery of new structures, catalytic activities, and biological functions among unknown proteins is enormous, but these proteins also represent a huge challenge due to their overwhelming numbers which increase whenever a new genome is sequenced (Fox et al., 2008, Hanson et al., 2010. Having improved functional information for plant proteins is especially important as the need for faster deployment of biotechnology and breeding resources is essential to continue the pace of development of agriculture and bioenergy crops to maintain food and energy security in a challenging global environment with an increasing human population and growing climate instability. An integral part of overcoming challenge will involve crops with improved tolerance for abiotic stresses such as temperature, light, salt, and nutrient deficiency (Ahanger et al., 2017, Maggio et al., 2018, Varshney et al., 2018.
Functional annotations of plant proteins lags behind that for animals, fungi, and prokaryotes due to fewer resources devoted to plant research and incomplete plant genome annotation, though this situation is improving with dedicated portals and tool development on platforms such as Phytozome, PLAZA, and Gramene (Goodstein et al., 2012, Proost et al., 2015, Tello-Ruiz et al., 2018. Currently, approximately 12% of predicted proteins have an experimentally determined structure or structural data derived from related proteins in the Protein Data Bank (PDB) for Arabidopsis thaliana, compared to more than 25% in Saccharomyces cerevisiae and more than 30% in Homo sapiens (Callaway, 2022). One approach for prioritizing unknown/unannotated genes and proteins for further functional analyses is to demand sequence conservation across one or more taxa (aka phylogenomics). While this does not directly help with annotation, it ensures that information obtained about that gene or associated protein from one species will be impactful as it can be leveraged across species. Indeed, phylogenomics approaches have been used successfully to obtain new biological information in multiple contexts. One particularly useful deployment of phylogenomics defined GreenCut proteins, found in multiple photosynthetic taxa, but not in non-photosynthetic eukaryotes. The GreenCut set of unknown/uncharacterized proteins was first defined when annotating the genome of the green alga Chlamydomonas reinhardtii (Merchant et al., 2007), and has since been expanded and updated (Karpowicz et al., 2011). Validating its usefulness, many of the original unknowns in the GreenCut list were found to have key functions in photosynthetic processes (Arthur et al., 2019, Wakao et al., 2021. A limitation of the GreenCut list is that it spans a wide taxonomic breadth including non-green-lineage species such as diatoms, and red and brown algae and demands high similarity scores between candidates. This stringency ensures higher confidence of an important photosynthetic function for candidates but may exclude proteins that diverged too rapidly to detect similarities across large phylogenetic distances. Here we took a comprehensive approach using inclusive criteria to create the Deep Green list of unannotated green lineage proteins. This list is based on identification and curation of conserved unknown proteins in three green lineages (i.e. Viridiplantae) species: Chlamydomonas reinhardtii, Arabidopsis thaliana, and Setaria viridis. Preliminary characterization of Deep Green proteins using in silico methods and published data sets is reported here and has revealed similarities among these proteins including enrichment in predicted chloroplast, photosynthetic, and stress related functions, and identified multiple predicted families of novel protein structures.

Results and Discussion
Our objective was to define and begin characterizing a set of poorly annotated or unannotated conserved green lineage proteins. To do so we chose three focal species: Arabidopsis thaliana (Arabidopsis), Setaria viridis (Setaria), and Chlamydomonas reinhardtii (Chlamydomonas). Arabidopsis has the best studied and well annotated genome of any angiosperm species (Berardini et al., 2015), while Setaria (green foxtail millet) is an emerging model C4 grass and bioenergy feedstock model that has a small stature, short generation time, and is genetically tractable (Huang et al., 2016, Zhu et al., 2017, Hu et al., 2018. Together, Arabidopsis and Setaria represent two major branches of angiosperms, dicots and monocots, respectively. The unicellular green alga Chlamydomonas is a well-established model for investigating cellular processes in photosynthetic eukaryotes due to its fast growth rates, haploid genome, low levels of gene duplication, and availability of high-throughput genetics and genomics tools (Sasso et al., 2018).

Identification of conserved protein families
To reduce search complexity and to help identify unannotated proteins, we developed a downselection strategy ( Figure 1A). The first step involved grouping the predicted proteins in each species into families with highly similar sequences (>30% sequence identity, >50% overlap), while at the same time merging annotations among family members (Tables S1-S3). This paralog grouping reduced sequence search space from 27654, 38334, and 17741 predicted primary proteins to 13058, 19951, and 14076 in Arabidopsis, Setaria, and Chlamydomonas respectively (Figure 1B), and ensured that unannotated members of a paralog family were not included when one of their family members was already annotated.
The h3-cd-hit algorithm which was used for grouping into families also selected a lead protein in each cluster which served as a representative for similarity searching between proteomes of the three focal species (Tables S4-S6). For each protein family, Phytozome annotations were collected, merged, and used to identify unannotated or poorly annotated families in subsequent steps. Because our goal was to identify conserved unannotated proteins, the next step in the process was to find protein families that were shared between at least two of the three species. Using the basic local alignment search tool for protein (BLASTP) among lead proteins from each species, the top high-scoring segment pair (hsp) was identified and those which passed a similarity threshold (e-value cutoff threshold of 10 -3 ; >40% of the sequence length aligned for at least one of the two proteins) were used to further group proteins into superfamilies with shared annotations (Figure 2, Table S7).

Identification of unannotated/unknown proteins
The families (Tables S4-S6) were filtered to select poorly annotated or unannotated members based on keyword searching of aggregated annotations ( Figure 1A). If a family in one species was well annotated or characterized, its conserved counterparts in the other two species were also considered annotated and removed from consideration. If the definition line (defline) annotation was blank, or contained the terms "unknown", "undefined", "uncharacterized", "hypothetical", "domain unknown function", "expressed protein", "transmembrane", "function unknown", "predicted protein" and "conserved in plant or green lineage", "anykrin", "fbox", "tetra-or penta-tricopeptide", the family was retained. In Arabidopsis, a list of uncharacterized/unannotated proteins was recently released and

Characterization of Unknown and Deep Green proteins
We interrogated the Deep Green list in several ways to provide preliminary information on potential functions. Deep Green proteins in all the three species were strongly enriched for chloroplast targeting and the plant members of this set were also de-enriched for nuclear targeting (Figure 3).
Importantly, the strong chloroplast localization enrichment for predicted proteins in each species was not seen in the set of all unknown proteins for each species where there was just a slight enrichment. These findings suggest that there is a significant number of conserved proteins with chloroplast functions that have yet to be characterized. Deep Green and unknown proteins were also significantly enriched in proteins predicted to contain one or more transmembrane domain(s) as compared to all predicted protein families ( Figure S1).
We next performed co-expression analysis for Chlamydomonas Deep Green genes using published transcriptome data sets. An important resource was a previously described high-resolution diurnal data set for synchronized Chlamydomonas cultures (Zones et al., 2015). In that study, around 80% of genes with detectable expression (~12,000) showed strong periodic diurnal or cell-cycle-controlled expression patterns ( Figure 4A). Genes coding for the Deep Green proteins showed higher overall average expression levels compared with all expressed genes of 3.63 versus 1.91 log2RPKM respectively ( Figure   4B). We also investigated the distribution and enrichment of Deep Green genes in 18 diurnal clustered and unclustered expression groups and found them to be significantly over-represented in clusters 2, 4, 6, and 7, which all have peak expression in the light phase when most of the chloroplast or photosynthesis related genes are also expressed ( Figure 4C, Figure S2).
Deep Green genes were also significantly over-represented in the non-differentially expressed cluster (i.e., constitutively expressed genes) and are de-enriched in the un-expressed group. Finally, Deep Green genes were also significantly de-enriched in the dark phase clusters 13 and 15, cell motility and   Figure 8A). In addition, 9 of the Deep Green proteins with a potentially novel fold showed a high structural similarity among the three interspecific homologs as exemplified in (Table 1, Figure 8B). In addition, approximately 60% of the Arabidopsis proteome currently has either experimentally determined structures or structures through association with related proteins in the PDB (12%), with the remaining majority (48%) having been predicted using AlphaFold (Callaway, 2022). Our predicted structures of the Deep Green and unknown protein sets will likely build on this database of predicted structures.
To better understand the structural complexity of the Deep Green protein set, two measurements describing order versus disorder were used. IUPred3 predicts the likelihood of individual residues being in a structured region ( Figure 9A)  represented by GC rich triplets. Chlamydomonas has the greatest amount of disorder in its predicted proteome and the highest genomic GC content (66%), while Arabidopsis has the lowest predicted disorder and the lowest GC content (36%) with Setaria in between (46% GC). Nonetheless, within each species the Deep Green predicted proteins were predicted to be more structured than average proteins.

Summary and Perspectives
The motivation behind this study was that among unknown proteins, those which are conserved in diverse members of the green lineage are likely to play important roles in photosynthetic biology. Our identification and preliminary characterization of Deep Green proteins presented here supported this hypothesis. There was a strong over-representation of Deep Green proteins for predicted chloroplast localization, suggesting their direct participation in plastid biogenesis or photosynthetic functions. In agreement with this finding, Deep Green genes in Chlamydomonas were shown to be over-represented for having light phase expression patterns that are also a characteristic of photosynthesis related proteins.
Interestingly, there was also higher than expected representation of Deep Green genes among those that are induced during heat stress or heat stress recovery in Chlamydomonas, possibly reflecting the important role of chloroplast function in heat stress tolerance (Hu et al., 2020, Luo et al., 2021. The concentration of Deep Green genes in several ChlamyNet sub-networks was harder to interpret as the sub-networks contain hundreds or thousands of genes, but it again suggests some functional coherence for the Deep Green gene list whose members were not distributed uniformly within ChlamyNet. A second goal of identifying Deep Green proteins was to investigate possible new protein folds and structures as these could also have novel catalytic or other properties. Using the powerful new version of Alphafold v2.1, we predicted stable structures for a large fraction of Deep Green proteins; and excitingly, many of them had no good matches in PDB meaning that they likely represent new families of structural folds. Also encouraging was the agreement found between predicted structures of Deep Green homologs between the three species supporting the hypothesis that there will also be structure-based functional conservation across the green lineage. In summary, the Deep Green gene/protein list that has been created and characterized here will be an impactful starting point for applying functional genomics and structural studies that will help shed light on unexplored areas of biology in photosynthetic eukaryotes. Homologs: The proteins in each organism were searched against each other organism using BLASTP to identify those proteins having homologs using both BLOSUM 45 and 62 matrices with an evalue cutoff of 10 -3 and a qcovs score ≥40% (Altschul et al., 1990).

Datasets
Clustering: To reduce the overall number of proteins and to generate a non-redundant protein set, the three-step hierarchical clustering algorithm h3-cd-hit (http://weizhonglab.ucsd.edu/webMGA/server/) was used on the protein list derived from the primary transcript only lists from each of the three organisms to identify those proteins that cluster together with ≥30% primary sequence identity (Huang et al., 2010). consists of a variant of the force-directed layout. Nodes produce repulsive forces whereas edges induce attractive forces. Nodes are then placed such that the sum of these forces are minimized. The organic layout has the effect of exposing the clustering structure of a network. In particular, this layout tends to locate tightly connected nodes with many interactions or hub nodes together in central areas of the network. The over-representation hypher.test analysis was performed using the R programming language with a significant level of 0.05.

Structural Predictions:
We used AlphaFold v2.1 to predict tertiary structures of the Deep Green proteins from their amino acid sequences. Five structural models were generated per protein and the models were ranked using the predicted local-distance difference test (pLDDT) scores (Jumper et al., 2021). The model with the highest pLDDT score was accepted as the most accurate structural prediction.
Computations were carried out on NREL's Eagle High-Performance Computing (HPC) cluster. Structural predictions for protein sequences with less than 1100 amino acid residues were run on graphics processing unit (GPU) nodes (with 16 GB Tesla V100 accelerators), while longer sequences were performed on graphics processing unit (CPU) nodes due to memory limitations. Twenty Arabidopsis, 26 Setaria, and 18 Chlamydomonas Deep Green protein sequences were predicted by SignalP-5.0 to contain signal peptides. For these sequences, the signal peptides were truncated in-silico prior to structure prediction. AlphaFold v2.1 structural prediction ran successfully on 457 out of 458 Arabidopsis, all 504 Setaria, and 382 out of 384 Chlamydomonas Deep Green proteins. AlphaFold v2.1 runtime errors occurred for the Arabidopsis protein AT1G21650.3 (1806 aa) and for two Chlamydomonas proteins, Cre04.g216050.t1.1 (3691 aa after removal of predicted signal peptide) and Cre07.g314900.t1.1 (732 aa).
These structures could not be predicted due to runtime errors of the HHBlits software that AlphaFold v2.1 uses for fast iterative protein sequence searching by HMM-HMM alignment. AlphaFold v2.1 has been documented to fold proteins that are at least 16 and at most 2700 amino acid residues long. To perform structural homology analysis on the Deep Green proteins, proteins with a predicted tertiary structure having a confidence score higher than 0.5 were selected and FoldSeek (van Kempen et al., 2022) was used to generate structural alignments with proteins in the Protein Data Bank (PDB, version on 2021-06-01) using the parameters: --alignment-type 1 --tmscore-threshold 0 --max-seqs 2000.
Protein disorder predictions and analyses: Protein order versus disorder based on overall secondary structure was quantified using a standalone version of the Intrinsically Unstructured Prediction (IUPred3) (Erdos et al., 2021) tool that was run on the NREL high-performance computing (HPC) cluster.
In the current work, the long disorder prediction mode of IUPred3 was used along with the medium smoothing option that involves the Savitzky-Golay filter with parameters 19 and 5. IUPred3 returns a score, between 0 and 1, for each amino acid residue in the input protein sequence, that represents the probability of the given residue being part of a disordered region. Residues with scores equal to or exceeding 0.5 were considered to be disordered. Next, the percentage disorder (percentage of the total number of amino acid residues in a protein that are disordered) was quantified for each of the lead proteins from the entire proteomes of Arabidopsis, Setaria, and Chlamydomonas.