Phylogenetic profiling in eukaryotes: The effect of species, orthologous group, and interactome selection on protein interaction prediction

Phylogenetic profiling in eukaryotes is of continued interest to study and predict the functional relationships between proteins. This interest is likely driven by the increased number of available diverse genomes and computational methods to infer orthologies. The evaluation of phylogenetic profiles has mainly focussed on reference genome selection in prokaryotes. However, it has been proven to be challenging to obtain high prediction accuracies in eukaryotes. As part of our recent comparison of orthology inference methods for eukaryotic genomes, we observed a surprisingly high performance for predicting interacting orthologous groups. This high performance, in turn, prompted the question of what factors influence the success of phylogenetic profiling when applied to eukaryotic genomes. Here we analyse the effect of species, orthologous group and interactome selection on protein interaction prediction using phylogenetic profiles. We select species based on the diversity and quality of the genomes and compare this supervised selection with randomly generated genome subsets. We also analyse the effect on the performance of orthologous groups defined to be in the last eukaryotic common ancestor of eukaryotes to that of orthologous groups that are not. Finally, we consider the effects of reference interactome set filtering and reference interactome species. In agreement with other studies, we find an effect of genome selection based on quality, less of an effect based on genome diversity, but a more notable effect based on the amount of information contained within the genomes. Most importantly, we find it is not merely selecting the correct genomes that is important for high prediction performance. Other choices in meta parameters such as orthologous group selection, the reference species of the interaction set, and the quality of the interaction set have a much larger impact on the performance when predicting protein interactions using phylogenetic profiles. These findings shed light on the differences in reported performance amongst phylogenetic profiles approaches, and reveal on a more fundamental level for which types of protein interactions this method has most promise when applied to eukaryotes.


47
The

117
The BUSCO metric assesses genome completeness based on the (in our case) 118 absence of single-copy orthologs that are highly conserved among eukaryotic species.
119 The absences of these orthologs can result from incomplete draft genomes or false 120 negatives in gene prediction, which in both cases leads to false absences of orthologs. 121 We selected 50 high-quality genomes with the lowest BUSCO values, i.e., genomes with 122 the least number of unexpected absences. We also selected 50 lower quality genomes 123 with the highest BUSCO values, i.e., genomes with the most number unexpected 124 absences (Fig 1.A.). We compared the quality filtered genome sets with 1000 randomly 125 generated genome sets of 50 genomes each to see if quality-based selection differs from 126 any random sampling of genomes.  1.B.). This suggests 140 that it is more beneficial to filter out lesser-quality genomes than it is to select for high-141 quality genomes. This result is consistent between two independent scores of genome 142 quality (S1 Fig).

143
With these results, it seems prudent to select genomes only based on quality when 144 applying phylogenetic profiles. However, there is an inherent bias between genome 145 quality and phylogenetic distribution (Fig 1.A

162
We analysed the impact of eukaryotic diversity by selecting two sets of 50 163 genomes, one containing the most similar species (Fig 2.A.) and the other the most 164 diverse species (Fig 2.B.) from our initial species set. The (dis)similarity was measured 165 using an iterative all-vs-all comparison using the cosine distance between genomes and 166 their orthologous group content. We started with the most diverse or similar species pairs 167 and iteratively added to this set the species with the highest (dis)similarity until we 168 obtained 50 genomes (Materials and Methods). We recalculated the protein-interaction 169 prediction performance for both these sets. The prediction performance is lower than the 170 initial set for both sets, but not worse than any randomly selected genome sets (AUC: 171 0.760 for the dissimilar set and AUC: 0.764 for the similar set) (Fig 2. C. inset). Diversity and quality both impact performance and we expect it to have a combined 195 measure for each of these two criteria, we can also objectively evaluate the prediction 196 performance by removing genomes from the initial species set one-by-one (Fig 3. A.).
197 Genomes that decrease prediction performance when removed from the initial set we can 198 consider as advantageous to phylogenetic profiling, while genomes that increase 199 prediction performance when removed from the initial set we can consider as 200 disadvantageous to phylogenetic profiling. We selected the top 50 advantageous and top 201 50 disadvantageous genomes to see whether these genomes together in their respective 202 sets also influence the prediction performance.

235
We can also directly relate the differences between these genome sets to how 236 close the genomes in the sets are to the human genome. The cosine distance and the 237 shared orthologous groups of the genomes with the human genome ( A surprising finding is that the 241 advantageous set contains numerous parasitic organisms (S1 Table).

313
The results in section 3 (Fig 3.C.) reinforce the notion that the reference interaction 314 set plays a role in the performance of predicting interacting proteins. For these reasons, 315 we analysed how the filtering and choice of reference interactome influences protein 316 interaction prediction performance in eukaryotes. Using an unfiltered human protein 317 interaction dataset reduces the prediction performance from an AUC of 0.779 to an AUC 318 of 0.638 (Fig 5. A.). This performance is also lower than any set of randomly selected 319 LECA orthologous groups (inset). The quality of the interaction data used clearly plays a 320 role in prediction performance, i.e., if we take a noisy "ground truth" it turns out to be 321 difficult to predict this truth. It is difficult to predict interactions with a set littered with false, 322 virtually random, pairs.

353
We indeed find evidence of multiple genes belonging to ancestral complexes 354 enriched in the human interaction set (Fig 5. B.), including enrichment in more 355 straightforward GO terms related to mitochondria and respiration (e.g., GO:0005747, 356 GO:0006120, GO:0032981 and GO:0070469), cilium (e.g., GO:0005929) and 357 spliceosomal components (e.g., SMN complex GO:0032797). We also find evidence in 358 higher-level GO terms that at lower levels reflect complexes known to be present in 359 human and absent in yeast (S2 Table), such as chromatin modification (e.g.,  Table).   413 Note, both studies show very strong signals for complexes as well as pathways, which 414 we excluded due to the problem of defining a quality negative interaction set.

415
In conclusion, we find that for eukaryotes more genomes and better-quality 416 genomes are not necessarily better. It is instead the type of information in the genomes.
417 The information in these genomes is not directly related to larger genomes, for instance 418 parasites increase prediction performance. Instead, the information is related to the 419 interactions of the reference species present in a given genome. Genome selection has 420 a minor influence compared to orthologous groups selection and interactome selection, 421 which both greatly improve the performance when predicting protein interactions.
422 Interactome and orthologous group selection is likely the major source for the large 423 variance in reported performances. Ancestral complexes that are repeatedly lost are 424 responsible for the strong performance of phylogenetic profiles in eukaryotes and it is 425 these hidden choices in orthologous group selection that we should consider when we 426 find large differences in performance between studies.
427 Material and Methods 428 1.Initial datasets and methods 429 We started our investigation from the analysis done in our previous work [19], to 430 investigate the influence of different parameters on the performance of predicting protein-431 protein interactions using phylogenetic profiles. We showed a relatively high prediction 432 performance using a large set of diverse eukaryotes and orthologous groups inferred to 433 be in the Last Eukaryotic Common Ancestor (LECA). This reference set is called the initial 434 set. Any changes that we made are changes in this initial set. In the sections below, we 435 will briefly describe the composition of this initial set and the methods we used to obtain 436 it. 437 438 We inferred orthologous groups on a diverse genome set of 167 eukaryotes using 439 different orthologous group inference methods in our previous work. For this analysis, we 440 chose the best performing method regarding protein interaction prediction, Sonicparanoid  512 We removed genomes one-by-one from the initial species set of 167 eukaryotes 513 to see how the different genomes influence the performance of protein interaction 514 prediction with phylogenetic profiling. We recalculated the performance for each of these 515 167 sets. The 50 genomes that increased the performance compared to the initial species 516 set the most when removed from the initial set were labelled as disadvantageous. The 50 517 genomes that decreased the performance the most when removed from the initial set 518 were labelled advantageous. For both the disadvantage and advantageous set we 519 recalculated the protein interaction prediction performance. 520 3. Gene and interactome selection procedures 521 We compared the results of the orthologous group selection procedures to 522 randomly selected LECA orthologous groups to exemplify that the differences in 523 prediction accuracy is not due to random variations in orthologous group composition. 524 We made a thousand LECA orthologous group sets containing a random selection of 525 63% of the orthologous groups. We calculated each of these set's protein interaction 526 prediction performance. 527 3. 1. Selecting orthologous groups 528 In our initial species set, we used orthologous groups estimated to be in LECA 529 (Methods section 1.1.). We took the raw output of the orthology inference methods and 530 filtered out the LECA orthologous groups to get a set that contains post-LECA orthologous 531 groups. We also recalculated the prediction performance with the raw output of the 532 orthology prediction methods, which is all inferred orthologous groups. 533 534 We compared the five PubMed ID filtered human BioGRID set with the unfiltered 535 human BioGRID dataset. Every interaction with less than five pubIDs is now included as 536 well. Removing the five PubMedID filter should indicate how quality filtering of reference 537 interactions influences prediction performance.

538
We selected next to the human interactions the Saccharomyces cerevisiae 539 BioGRID interaction database (version 3.5.175 July 2019) [39] to analyse the influence 540 of the reference interactome. We filtered the interactions to keep only the interaction pairs 541 found in at least five publications (PubMed ID's). We followed the same procedure as with 542 the human interaction set (Methods section 1.2.).

543
Following this analysis, we hypothesized that the drop in prediction performance 544 for yeast is caused by the loss of ancestral protein complexes in yeast. To test this, we 545 chose interacting LECA orthologous groups that contained only human genes (sample 546 set) and calculated the enrichment to the set with interacting LECA orthologous groups 547 containing human and yeast genes (population set). We calculated the enrichment using 548 the following equation: ÷ , where n is the total number of genes associated with a GO