FunOrder 2.0 – a fully automated method for the identification of co-evolved genes

Coevolution is an important biological process that shapes interacting species or even proteins – may it be physically interacting proteins or consecutive enzymes in a metabolic pathway. The detection of co-evolved proteins will contribute to a better understanding of biological systems. Previously, we developed a semi-automated method, termed FunOrder, for the detection of co-evolved genes from an input gene or protein set. We demonstrated the usability and applicability of FunOrder by identifying essential genes in biosynthetic gene clusters from different ascomycetes. A major drawback of this original method was the need for a manual assessment, which may create a user bias and prevents a high-throughput application. Here we present a fully automated version of this method termed FunOrder 2.0. To fully automatize the method, we used several mathematical indices to determine the optimal number of clusters in the FunOrder output, and a subsequent k-means clustering based on the first three principal components of a principal component analysis of the FunOrder output. Further, we replaced the BLAST with the DIAMOND tool, which enhanced speed and allows the future integration of larger proteome databases. The introduced changes slightly decreased the sensitivity of this method, which is outweighed by enhanced overall speed and specificity. Additionally, the changes lay the foundation for future high-throughput applications of FunOrder 2.0 in different phyla to solve different biological problems. AUTHOR SUMMARY Coevolution is a process which arises between different species or even different proteins that interact with each other. Any change occurring in one partner must be met by a corresponding change in the other partner to maintain the interaction throughout evolution. These interactions may occur in symbiotic relationships or between rivaling species. Within an organism, consecutive enzymes of metabolic pathways are also subjected to coevolution. We developed a fully automated method, FunOrder 2.0, for the detection of co-evolved proteins, which will contribute to a better understanding of protein interactions within an organism. We demonstrate that this method can be used to identify essential genes of the secondary metabolism of fungi, but FunOrder 2.0 may also be used to detect pathogenicity factors or remains of horizontal gene transfer next to many other biological systems that were shaped by coevolution.

other partner to maintain the interaction throughout evolution. These interactions may occur in symbiotic 29 relationships or between rivaling species. Within an organism, consecutive enzymes of metabolic

38
Every form of life known to humankind is subjected to evolution. This process shapes and forms all 39 biological systems on macroscopic and molecular level. Thus, understanding and detecting evolutionary 40 processes substantially contributes to understanding life forms and life itself. An important evolutionary 41 process is the so-called coevolution. This is defined as a "process of reciprocal evolutionary change that 42 occurs between pairs of species or among groups of species as they interact with one another" (1). This 43 definition can be extended to interacting proteins (2), may it be physical interactions or may it be 44 consecutive actions in a metabolic pathway. In this regard, coevolution describes a similar evolutionary 45 process with a similar evolutionary history among interacting proteins and the corresponding genes.

46
In a previous study, we described a semi-automated method for the identification of coevolutionary linked genes, named FunOrder (3). Therein, the protein sequences of an input set of proteins are blasted 48 against an empirically optimized proteome database. The Top 20 results of each search are then 49 compared in a multisequence alignment and a phylogenetic tree is calculated for each input protein. Next, 50 the phylogenetic trees of all proteins are compared pairwise using the treeKO tool. This tool calculates 51 how similar two trees are, and in thus how similar the evolutionary history of two proteins is. The treeKO 52 tool calculates two distances, the strict distance and the speciation distance. Notably, the strict distance 53 had previously been suggested to be more suitable for the detection of coevolution in protein families 54 than the speciation (or evolutionary) distance (4). However, we combined the two distance values to a 55 third measure, the combined distance, in order to consider also the speciation history in the FunOrder method. The strict and the combined distances of all pairwise comparisons were then compiled in two 57 matrices and visualized as heatmaps, dendrograms and two principal component analyses (PCA) were 58 performed. In the final step of this method, the user needed to assess these different visualizations of the 59 underlying data to detect co-evolved proteins (Fig 1A). Please also refer to the original study for a 60 detailed description of this method (3).

61
Previously, we demonstrated the functionality and applicability of this method by identifying essential 62 genes in biosynthetic gene clusters (BGCs) of ascomycetes (3). Fungal BGCs contain genes whose 63 corresponding enzymes catalyze the biosynthesis of secondary metabolites (SMs) (5). SMs are a vast 64 group of compounds with different structures and properties that are not necessary for the normal growth of an organism but can be beneficial under certain conditions (6). Notably, many SMs also have medicinal 66 or other useful purposes, such as dyes, food additives, and as monomers for novel plastics (7). However,

67
we can classify the genes in a BGC into biosynthetic genes, further essential genes, and gap genes. The 68 biosynthetic genes encode for enzymes that are directly involved in the biosynthesis of the SM, while the 69 further essential genes encode for transporters (8), transcription factors (9), or resistance genes (10). In 70 contrast, gap genes are not involved in the biosynthesis of the SM despite being co-localized in the BGC 71 (11). Both, the biosynthetic genes and the further essential genes are necessary for the biosynthesis of a 72 SM in the native organisms (12). We could use FunOrder to detect theses essential genes, because they 73 share a similar evolutionary background in many fungal BGCs (3). The FunOrder method contributes to a 74 better understanding of fungal BGCs by adding an additional layer of information. This will support users 75 in the decision which genes should be considered for detailed studies in the laboratory. Importantly, the 76 application of FunOrder is not limited to BGCs from ascomycetes but may be useful to answer any 77 biological problem related to molecular coevolution of genes or proteins. Notably, this requires the 78 compilation and evaluation of a suitable proteome database.

79
The obvious major shortcoming of the original FunOrder method is the final manual assessment which 80 prevents full automation and high-throughput analyses. Further, the very sensitive but slow BLAST for the analysis of coevolution in plants or mammals, larger databases will be needed. In this study we 83 describe an improved version of the method, termed FunOrder 2.0 which overcomes the two mentioned limitations. For an automated detection of co-evolving genes, we determine the optimal number of gene groups in the FunOrder output and then use k-means clustering based on the first three principal 86 components of a PCA. Further, we replace BLAST with the recently published and upgraded DIAMOND 87 tool (14) to enable searching of larger databases and lay the foundation for future different applications of 88 FunOrder 2.0.

91
Integration of the DIAMOND algorithm

92
The first major improvement of the FunOrder method was the integration of the DIAMOND algorithm (14, 93 15) for searching the proteome database instead of the previously used BLAST algorithm (13) (Fig 1).

94
This change will allow the usage of larger databases in FunOrder 2.0, since DIAMOND is as sensitive as

100
To test, whether the integration of DIAMOND might have altered the ability of FunOrder to detect 101 coevolution, we analyzed the same control gene clusters (GCs) we had previously used to evaluate the 102 original FunOrder method (3) and calculated the internal coevolution quotient (ICQ). The ICQ expresses how many genes in a gene cluster are detected as coevolutionary linked and is calculated subsequently 104 to the treeKO comparison ( Fig. 1). Since no other changes have been introduced until this point in the 105 workflow, the ICQ values are a feasible way to compare BLAST and the DIAMOND software. We found only marginal differences between the original FunOrder method (using BLAST) and FunOrder 2.0 (using 107 DIAMOND) (Table S1). For visualization, we compared the ICQ results in a kernel density plot (Fig 2).

108
Therein, the curve for the ICQs of the positive control GCs (BioPath in Fig 2) slightly shifted to the left (higher internal coevolution) compared to the original method, while the curve for the negative control that DIAMOND might be better suited than BLAST within the FunOrder method, as the usage of DIAMOND resulted in a better distinction of the positive and negative control GCs. The curve for the sequential GCs was flattened and broadened compared to the original curve (Fig 2), which can also be 114 explained by the assumed better performance of DIAMOND in this workflow. As the sequential GCs are 115 random loci from different ascomycetes (3), they contain random numbers of co-evolved and 116 independently evolved genes. Consequently, the usage of DIAMOND lowers the ICQ for GCs containing 117 many co-evolved genes and raises the ICQ for GCs with many independently evolved genes compared to  As mentioned, a major limitation of the original FunOrder method was the need for a manual assessment 126 of the output, during which the proteins are grouped into clusters based on different data visualizations 127 ( Fig 1A). Please refer to our previous method for a detailed description of the procedure (3). To solve this 128 problem, we integrated two R scripts for automatic definition of co-evolved protein groups (or clusters) 129 ( Fig 1B). The two R scripts use the first three principal components of the PCA of the strict and the 130 combined distance matrices as input ( Fig 1B) and group the proteins by k-means clustering. In the 131 original FunOrder method, only the first two components were considered.

132
The first R-script for automated protein (or gene) clustering initially determines the optimal number of 133 gene clusters within the first three principal components of the PCAs using the R Package NbClust (17).

134
This package uses different indices and varies the number of clusters, distance measures and clustering 135 methods to determine the optimal number of clusters in a data set based on the majority rule. If the 136 prediction of the optimal number of clusters fails, the second (a back-up) script with a predefined number of clusters is called. The prediction of the optimal number of clusters might fail for instance if the majority BGCS, we predefined the number of clusters to 3. Regardless of the script used, the final output is an definition, we analyzed the same 30 BGCs as in our previous study. To observe only the influence of the 143 automated cluster definition, we kept the BLAST tool for the initial database search still in place (Fig. 1).

144
Then, we compared the obtained results to those of the previously performed manual analyses (3) ( Table   145 1 and Table S3). In only 5 out of the tested 30 BGCs, the exact same results were obtained (Table S3). In 146 15 BGCs, the automated cluster definition missed at least one biosynthetic or further essential gene in 147 comparison to the manual assessment, but it could detect more of these essential genes in 5 BGCs.

148
Regarding the gap and extra genes, the automated cluster definition returned less false positives than the 149 manual assessment in 12 BGCs but found more in 4 BGCs. In summary, the automated cluster detection 150 appeared to be more stringent than the manual assessment method, which led to slightly reduced 151 sensitivity but enhanced selectivity (see Table S3 for a detailed statistical analysis). comparative analysis of the benchmark BGCs as described above. The results were very similar to the automated analysis using the BLAST analysis (Table 1 and Table S3). In a few cases, the usage of essential genes were detected in 13 of the 30 BGCs by FunOrder 2.0, but also fewer gap or extra genes 164 in 12 BGCs (Table 1 and Table S3). Yet, FunOrder 2.0 clustered more genes together than the original 165 method in other BGCs -to be precise, more essential genes were detected in in 5 BGCs and more gap or 166 extra genes in 4 BGCs compared to the original method (Table S3). The overall enhanced stringency 167 reduced the sensitivity slightly (

174
and that FunOrder 2.0 can be used to identify essential genes in fungal BGCs.

188
Changes of the workflow

189
Within the previously developed work flow (3) (Fig. 1A), we replaced BLAST (13) with DIAMOND (14) for the database search ( Fig 1B). The previous and subsequent steps up to the integrated R-scripts remained the same as described in the original FunOrder method (3). Notably, the BLAST algorithm was 192 kept in the software bundle to extract the sequences from the local database and can be used for an 193 optional remote search of the NCBI database (18). The distances measure obtained after the treeKO 194 algorithm were compiled in matrices, which were used as input for three alternative R-scripts ( Fig 1B). All functions was rearranged in a manner, that all calculations were performed on a single matrix and then on 201 the next. The first matrix to be analyzed is the strict distance matrix followed by the combined distance assessed manually as described previously (3).

205
The second and third R-scripts aim to determine the co-evolved genes automatically. To this end, a PCA 206 is calculated for the strict and the combined distance matrices each. The first three principal components 207 are then considered for defining the clusters by a k-means clustering approach. The difference between 208 the second and the third R-script is the number of clusters used in the k-means clustering approach. In 209 the second R-script, the optimal number of clusters in the first three principal components of the PCAs is 210 determined by NbClust (17)

Performance evaluation
Similar as in the original method we analyzed 30 empirically characterized BGCs to evaluate the ability of distinguish them from so-called gap genes and genes outside of the defined BGC borders. The genes 249 clustering with the core enzyme(s) were considered as "detected". As previously described "we counted 250 the total number of (1a) detected essential genes or (1b) Table S1. ICQ values of protein sets of conserved metabolic pathways of the primary metabolism 402 (BioPath), sequential GCs and random GCs used in this study.  proteomes; sequential GCs, co-localized genes from random loci of different ascomycetes.