Human mitochondrial protein complexes revealed by large-scale coevolution analysis and deep learning-based structure modeling

Recent development of deep-learning methods has led to a breakthrough in the prediction accuracy of 3-dimensional protein structures. Extending these methods to protein pairs is expected to allow large-scale detection of protein-protein interactions and modeling protein complexes at the proteome level. We applied RoseTTAFold and AlphaFold2, two of the latest deep-learning methods for structure predictions, to analyze coevolution of human proteins residing in mitochondria, an organelle of vital importance in many cellular processes including energy production, metabolism, cell death, and antiviral response. Variations in mitochondrial proteins have been linked to a plethora of human diseases and genetic conditions. RoseTTAFold, with high computational speed, was used to predict the coevolution of about 95% of mitochondrial protein pairs. Top-ranked pairs were further subject to the modeling of the complex structures by AlphaFold2, which also produced contact probability with high precision and in many cases consistent with RoseTTAFold. Most of the top ranked pairs with high contact probability were supported by known protein-protein interactions and/or similarities to experimental structural complexes. For high-scoring pairs without experimental complex structures, our coevolution analyses and structural models shed light on the details of their interfaces, including CHCHD4-AIFM1, MTERF3-TRUB2, FMC1-ATPAF2, ECSIT-NDUFAF1 and COQ7-COQ9, among others. We also identified novel PPIs (PYURF-NDUFAF5, LYRM1-MTRF1L and COA8-COX10) for several proteins without experimentally characterized interaction partners, leading to predictions of their molecular functions and the biological processes they are involved in.


23
Recent advances in Deep-Learning (DL) techniques for structure prediction based on protein-sequence 24 alignments have led to a breakthrough in structural genomics. These state-of-the-art methods, which 25 are now sensitive enough to work with shallow alignments, can predict protein 3D structure to atomic 26 accuracy [1][2][3]. The methods can be applied on the whole-proteome scale [4], with the 3D structures for 27 almost all human proteins being recently "determined" by DeepMind using Alphafold2. While the 28 quality of these 3D structures remains to be validated by the scientific community, they unquestionably 29 facilitate functional characterization of these proteins and interpretation of disease-causing mutations in 30 them [4]. 31 One of the next research directions where these DL methods can transform is to determine 32 protein-protein interactomes. Using statistical analyses of coevolutionary signal between residues from 33 different proteins, we have shown that the interface between interacting proteins in Bacteria can be 34 accurately predicted from alignments of diverse sequences on the proteome scale [5]. This type of in 35 silico protein-protein interaction screen appears to be more accurate than large-scale experimental 36 screens such as yeast-two-hybrid and affinity-purification-mass-spectrometry [5]. The application of 37 similar methods to human sequences were hindered by limitations in sensitivity for shallower Eukaryotic 38 sequence alignments. However, we expect the increased performance of current DL methods will 39 enable proteome-wide screen of protein-protein interaction (PPI) in human. 40 Mitochondria are dynamic membrane-bound organelles found in almost all eukaryotic cells. 41 They carry a variety of essential cellular functions ranging from energy production and metabolism to 42 regulation of cell death and immune response [6][7][8][9][10]. Mitochondria evolved from an endosymbiotic 43 relationship between ancestral bacteria and eukaryotes [11]. While these organelles maintain their own 44 genetic materials and transcriptional and translational machineries, most proteins targeted to 45 mitochondria are now encoded in the nuclear genome. Technology advances in genomics, 46 transcriptomics, proteomics, and metabolomics have extended the repertoire of mitochondrion-47 associated proteins, identified their interaction partners, and expanded the knowledge about their 48 functions [12][13][14][15][16]. The MitoCarta3.0 database has summarized 1136 mitochondrial proteins in human 49 and classified them according to their pathways [17]. These mitochondrial proteins are located in four 50 compartments: the mitochondrial outer membrane (MOM), the intermembrane space (IMS), the 51 mitochondrial inner membrane (MIM), and the matrix. Characterizations of mitochondrial proteins and 52 their complexes are crucial to our understanding of how mitochondria regulate various cellular activities 53 and the mechanisms of a diverse range of mitochondrion-related diseases [18]. 54 As a pilot study, we applied RoseTTAFold [1] and Alphafold2 [2] to identify interacting protein 55 pairs and predict their interfaces among human mitochondrial proteins. We showed that a combination 56 of these methods can identify interacting partners and model protein complexes on a large-scale with 57 high precision. With these methods, we captured many known PPIs that were supported by multiple 58 experiments and/or protein complex structures. In addition, we obtained detailed information about the 59 interfaces of known PPIs without experimentally determined complex structures and identified novel 60 PPIs for several proteins without experimentally characterized interaction partners. These findings 61 provided structural insight into the mechanisms of some disease-causing mutations and functional 62 predictions of several poorly characterized proteins. 63 64 Results and discussion 65 In silico screen can identify interacting proteins and predict their interface at high precision 66 We designed a computational procedure (flowchart in Figure 1A, see Materials and methods) to perform 67 in silico PPI screen for 605099 pairs of 1118 mitochondrial proteins from the MitoCarta3.0 database [17] 68 using RoseTTAFold [1] and AphaFold2 [2]. About 5% of mitochondrial protein pairs were excluded from 69 this study because their combined lengths were beyond our GPU's compute capability. We used the best 70 inter-residue score (see Materials and methods) between a pair of proteins to indicate their interaction 71 confidence. Precision-recall curves based on true PPIs (positive controls) and random protein pairs 72 (negative controls, see Materials and methods) evaluate the performance of these methods. The 73 RoseTTAFold 2-track model, a fast version of RoseTTAFold without 3D representation of the proteins 74 inside the model, is remarkably better than Direct Coupling Analysis (DCA), a statistical method we used 75 previously in coevolution-based Bacteria PPI screens [5] ( Figure 1B). The area under the precision-recall 76 curve scores (AUC) for DCA (0.014) is much lower than the AUC for RoseTTAFold (0.113). In addition, we 77 observed that residues in the mitochondria-targeting peptides of different proteins can show high 78 contact probability, although they are not expected to be involved in protein-protein interface as they 79 are cleaved after being transferred to mitochondria. Excluding the mitochondrion-targeting peptides 80 5 before computing the highest inter-residue contact probability for a pair of proteins further improved 81 the accuracy of the screen ( Figure. 1C) (AUC improved from 0.113 to 0.132). 82 Figure 1. A. A flowchart of the procedure used in this study to perform PPI prediction and structural modeling. B. The precision-recall curves of three methods in detecting positive PPIs of mitochondrial protein pairs. DCA -Direct Coupling Analysis method, RoseTTAFold (no transit peptide) is the RoseTTAFold method used on residue pairs excluding those involving any residue in mitochondrial transit peptides. C. Counts of high-scoring interacting partners (contact probability score >0.9) (blue bars) for proteins with the highest number of such partners. The red bars (if any) show the number of true positives among the top-scoring pairs. D. Precision-recall curves for RoseTTFold and AlphaFold2 on the 401 pairs of selected PPI predictions. E. Distribution of the fraction of predicted contacts that are mapped to experimentally determined complex interfaces. F. The left panel shows the AlphaFold2 model of MT-CO2 (magenta) and MT-CO3 (rainbow) complex, with the top 20 RoseTTAFold-predicted contact pairs shown as yellow lines connecting their C- atoms. The right panel shows the AlphaFold2 model of the MT-CO1 (gray), MT-CO2 (magenta) and MT-CO3 (rainbow) complex. Top contact pairs are shown in yellow, cyan, and red lines for MT-CO2/MT-CO3, MT-CO1/MT-CO2 and MT-CO1/MT-CO3 pairs, respectively. Upon apoptotic stimuli, AIFM1 is cleaved to produce a C-terminal soluble fragment that is translocated 158 to the nucleus to trigger downstream apoptotic events. AIFM1 is also essential for cell viability as its 159 downregulation results in the loss of CI subunits. The pro-survival activity of AIFM1 is tied to its 160 interaction with CHCHD4, a CHCH domain-containing protein located in the IMS [34]. CHCHD4 (the 161 ortholog of yeast protein Mia40) is a crucial component of the redox-regulated MIA machinery 162 responsible for the import of a set of cysteine-containing protein substrates, such as COX17, COX19,163 MICU1 and COA7 [35]. CHCHD4 functions as a chaperone and catalyzes the formation of disulfide bonds 164 in these substrate proteins, including some respiratory chain subunits and proteins with a variety of 165 other functions such as redox regulation and antioxidant response [35]. 166

Coevolution analysis revealed protein interaction interfaces for known protein complexes
We identified CHCHD4-AIFM1 as one of the top scoring PPI pairs with both AlphfaFold2 and 167 RoseTTAFold contact probabilities above 0.99. The AlphaFold2-predicted complex structure between 168 them is shown in Figure 2A. Their interaction is primarily mediated by the N-terminus of CHCHD4, which 169 is missing in the solution structure of CHCHD4 and could be disordered on its own [36]. The predicted 170 interacting residues of CHCHD4 are not from the CHCH domain (residue range: 45 to 109) with two 171 (CX 9 C) 2 motifs [37,38]. Instead, our predicted interaction site is consistent with experimental studies 172 describing the N-terminal CHCHD4 27-residue segment being sufficient for its interaction with AIFM1 173 [34]. Our prediction also mapped the interaction surface on AIFM1, which has eluded experimental 174 studies thus far. The AIFM1 surface is mainly on the edge -strand (residue 504-508) in the C-terminal 175 Reductase_C domain (Figure 2A). The N-terminal segment of CHCHD4 forms a -hairpin that interacts 176 with this edge -strand ( Figure 2A). More than 20 missense mutations in AIFM1 (magenta positions in 177 Figure 2A) have been reported in association with several diseases and have been classified as 178 pathogenic/likely pathogenic [39][40][41][42]. Interestingly, one likely pathogenic mutation from AIFM1 (Y560H) 179 is mapped near the interaction site (magenta sidechain shown in Figure 2A), and it is predicted to 180 interact with two hydrophobic residues (I12 and F14) on CHCHD4 (sidechains shown in green spheres in 181 Figure 2A). Interactions between these hydrophobic residues could contribute to the binding energy 182 between CHCHD4 and AIFM1 and could explain potential deleterious effects for this mutation. ATPAF2 (ATP synthase F1 complex assembly factor 2) was recently predicted by a computational tool 213 (CLustering by Inferred Co-expression, CLIC) [49]. The interaction between these two proteins was also 214 experimentally validated [49]. FMC1 belongs to the LYRM family of proteins [50,51], and ATPAF2 is one 215 of the assembly factors of CV [52]. This interaction was also found in two high-throughput PPI studies 216 (reported in BioGRID) [14,16]. Our PPI screen ranked the FMC1 and ATPAF2 pair among the top hits 217 with both AlphaFold2 and RoseTTAFold contact probabilities above 0.99. The FMC1 protein is predicted 218 to adopt a three-helix bundle fold, typical of the LYRM family proteins. The interaction site on FMC1 is 219 mapped to the third core -helix of such a fold ( Figure 2C). The structure model of ATPAF2 contains a 220 small 3-stranded -sheet at the N-terminus, while the rest of the protein is mainly -helical. The 221 interaction site on ATPAF2 is mapped to the last -helix and includes one residue (magenta sidechain 222 shown in spheres in Figure  Our analysis revealed strong AlphaFold2 and RoseTTAFold contact probability scores (>0.99) 235 between COX19 and COX11. Predicted interacting residues on COX19 are mainly mapped to the -236 helical hairpin of the CHCH domain with the twin CX9C motifs ( Figure 2D). The CX9C motifs in COX19 237 proteins are characterized by two conserved YL diads, giving rise to the refined motif signatures of 238 Cx 6 YLxC [54]. In the yeast system, these YL diads were proved to be crucial for the binding to COX11 239 based on mutagenesis studies [54]. This binding mode is consistent with our model, which showed high 240 contact probabilities for residues corresponding to yeast YL diads in the human COX19 protein, where 241 the first YL diad is changed to FM and the second YL is maintained (magenta residues in Figure 2D). The 242 Cx 6 YLxC -hairpin of COX19 mainly interacts with the second, third, and sixth core -strands of the 243 immunoglobulin domain of COX11 [56] ( Figure 2D). The COX19 structural model also includes two small 244 -helices (colored in dark green in Figure 2D) C-terminal to the Cx 6 YLxC helix hairpin (colored in lighter 245 green). Residues in the turn between these two C-terminal -helices are also predicted to interact with 246 COX11 residues from the loop before the first core -strand and the loop between the second and third 247 -strands ( Figure 2D). 248 249 ECSIT and NDUFAF1 250 ECSIT (Evolutionarily Conserved Signaling Intermediate in Toll pathways) was originally identified as a 251 protein involved in the Toll and bone morphogenetic protein signaling pathways [57,58]. Later studies 252 showed that ECSIT is a CI assembly factor that interacts with other assembly factors such as NDUFAF1 253 and ACAD9 [59,60]. The interaction between ECSIT and NDUFAF1 is strongly supported by AlphaFold2 254 and RoseTTAFold (contact probability >0.99). The AlphaFold2-predicted ECSIT structure has an N-255 terminal domain consisting of -helical repeats and a C-terminal alpha+beta domain with -helices 256 sandwiching a mainly anti-parallel -sheet of six -strands (cyan structure in Figure 2E). The AlphaFold2 257 model of NDUFAF1 has an N-terminal -helix with mainly hydrophilic residues and a C-terminal -258 sandwich with a jelly roll topology (green structure in Figure 2E). Multiple NDUFAF1-interacting sites of 259 were predicted in ECSIT, including the -hairpin formed by the third and fourth core -strands of the 260 second domain, the C-terminal end of the first -helix of the second domain, the last -helix of the first 261 domain, and the loop in between the two domains. The interaction surface on NDUFAF1 is mainly 262 mapped to the loop regions at one end between the two -sheets of the jelly roll domain. Two 263 pathogenic mutations responsible for Mitochondrial complex I deficiency, nuclear type 11 (R217C and 264 G245R, colored in magenta in Figure 2E  Residues involving missense disease-causing mutations in COQ7 are colored in red, and sidechain of L111 near the interaction interface is shown in spheres. Sidechains are shown in magenta spheres for COQ9 residues whose mutagenesis affected the binding to COQ7.
Such experimental results are consistent with our predictions, which mapped the interacting residues 305 from COQ9 to this surface patch ( Figure 3B). Several COQ9 residues, such as D237, Y240, and W241 306 (sidechains shown as magenta spheres in Figure 3B

Identification of PYURF as a potential subunit of the NDUFAF5 hydroxylase complex 338
Affinity enrichment mass spectrometry (AE-MS) was used in a previous study to identify PPI of 50 select 339 mitochondrial uncharacterized proteins (MXPs) [14]. One MXP, formerly named C17orf89, was found to 340 interact with NDUFAF5 and play an important role in CI assembly [14]. This protein has since been 341 renamed NDUFAF8. We identified the NDUFAF8-NDUFAF5 complex in our screen with the contact 342 probability of 0.911 by RoseTTAFold and the contact probability of 0.444 by AlphaFold2. We also 343 identified another MXP named PYURF that likely interact with NDUFAF5. The PYURF-NDUFAF5 pair has a 344 AlphaFold2 score of 0.998 and a RoseTTAFold score of 0.959. Predicted interaction with NDUFAF5 345 suggests that the function of the small uncharacterized protein PYURF could be related to CI assembly as 346 well. 347 NDUFAF5 adopts a Rossmann-like fold with seven -strands and belongs to the family of S-348 adenosylmethionine-dependent methyltransferases [81]. However, instead of being a methyltransferase, 349 it catalyzes the hydroxylation of an arginine residue in of the CI subunit NDUFS7, and this 350 posttranslational modification is crucial in the early stage of CI assembly [81]. As an essential assembly 351 factor, NDUFAF5 has mutations causing the disease of mitochondrial complex I deficiency (nuclear type 352 16, OMIM: 618238) [82]. Mapping of the predicted interface residues onto the structure models of 353 NDUFAF8, NDUFAF5 and PYURF suggests that NDUFAF8 and PYURF occupy distinct interaction sites on 354 NDUFAF5 ( Figure 5A) that are distal from each other. NDUFAF5 (cyan in Figure 5A) residues coevolving 355 with NDUFAF8 (orange in Figure 5A) is mapped to an -helix before the last -strand and is far from the 356 active site. A likely pathogenic mutation (F55L, sidechain of F55 shown in blue spheres in Figure 5A) in 357 NDUFAF8 [83] was mapped near its interaction site with NDUFAF5. 358 NDUFAF5 residues coevolving with PYURF (green in Figure 5A) are mapped near the active site 359 to an edge -strand, the -helix before it and the Rossmann crossover loop after it. PYURF is a small 360 protein predicted to adopt a Trm112p-like fold [84]. Remote homologs of PYURF include Rieske-like 361 Our coevolution analysis as well as the remote homology of PYURF to Trm112 proteins suggests that 366 PYURF is a subunit that forms a complex with the NDUFAF5 hydroxylase and could play an essential role 367 in the catalytic activity. The involvement of PYURF in CI assembly was supported by a recent genome-368 wide CRISPR death screen that shows the phenotype of CI deficiency when the PYURF gene was 369 disrupted [88]. 370 371

Identification of potential interaction partners and interfaces of LYRM proteins 372
The LYRM (leucine/tyrosine/arginine motif) family proteins (also called LYR proteins) are small proteins 373 with diverse functions. LYRM proteins are found to associate with OXPHOS complexes as CI subunits 374 (NDUFB9 (LYRM3) and NDUFA6 (LYRM6) The functions of LYRM1 and LYRM9, two LYR-domain containing proteins, remain to be 387 elucidated [94,97]. In our PPI screen LYRM1 is predicted to interact with MTRF1L (mitochondrial 388 translational release factor 1-like) (AlphaFold2 score: 0.902, RoseTTAFold score: 0.955). MTRF1L is 389 responsible for translation termination upon detection of certain termination codons [93]. In the 390 structural model of the LYRM1-MTRF1L complex, the interface residues in LYRM1 are mapped to the 391 first -helix of the LYR domain with a three-helix bundle fold. The interface residues from MTRF1L are 392 mapped to a N-terminal -helical domain [98] ( Figure 5B). RoseTTAFold also reported modestly high 393 contact probability scores for the LYRM1-LYRM9 pair (score: 0.862) and the LYRM9-MRPL57 394 (mitochondrial ribosomal protein 63) pair (score: 0.907), albeit the AlphaFold2 scores for them are low 395 (below 0.2). AltMIEF1, a small protein encoded by an open reading frame in the 5′ region of the MIEF1 396 gene, was the only LYRM protein previously found to function in translation and ribosomal assembly 397 [92]. Our study suggests that the uncharacterized LYRM1 and LYRM9 could also play important roles in 398 regulation of translation and other functions related to mitochondrial ribosomes. 399 HHpred sequence similarity searches and Alphafold2 structural models also suggest the 400 presence of a divergent LYR domain in UQCC2, a small protein that interacts with UQCC1 and functions 401 in the assembly of CIII [99]. UQCC1 and UQCC2 are orthologs to fungal CIII assembly factors Cbp3p and 402 Cbp6p, respectively. AlphaFold2 structural model predicts that UQCC1 adopts an -helical bundle fold 403 with six major -helices, consistent with the experimental structure of its bacterial homolog [100]. Our 404 analysis predicted high probability for interaction between UQCC1 and UQCC2. The predicted UQCC1-405 UQCC2 interface in UQCC2 is mapped to the N-terminal LYR domain (colored in lighter green in Figure  406 5C) that adopts a three-helix bundle fold, as well as regions C-terminal (colored in darker green in Figure  407 5C) to the LYR domain. Our findings suggest UQCC2 as a new member of the LYRM family that functions 408 in CIII assembly [101]. 409 410 Identification of COA8 as a potential subunit of the COX10 enzyme complex 411 COA8, previously named APOPT1 (apoptogenic-1), was recently discovered to function in CIV assembly. levels. The functional mechanism of COA8 in CIV assembly has not been revealed. Our analysis identified 416 high contact probability scores (>0.99 by both AlphaFold2 and RoseTTAFold) between COA8 and COX10. 417 As an essential CIV assembly factor, COX10 catalyzes the farnesylation of the vinyl group of heme B to 418 produce heme O, the first step of the mitochondrial heme A biosynthesis required for CIV assembly. 419 The predicted interaction between COA8 and COX10 suggests that COA8 could be a subunit of 420 the heme farnesyltransferase complex. The AlphFold2 model of the COA8-COX10 complex revealed that 421 COA8 forms two long -helices, and COX10 is a multi-pass transmembrane protein with 9 422 transmembrane segments, consistent with its homology to proteins in the UbiA prenyltransferase family 423 (Pfam: UbiA) [106, 107] ( Figure 5D). The predicted interface residues are mainly mapped to the middle 424 of the second -helix in COA8 and in between the second and third transmembrane segments of COX10 425 ( Figure 5D), which is part of the cap domains (colored in blue in Figure 5D)  For each organism we identified up to one protein (best hit, if available) that shows the highest 450 sequence similarity to the target protein from the combined BLAST hits found by multiple representative 451 sequences. These best hits together with the target protein are considered as the expanded orthologous 452 group for the target protein. This group of sequences were subject to multiple sequence alignment by 453 MAFFT (with the --auto option) [113]. To construct the joint alignment for any two target proteins, we 454 identified the intersection of the organisms for their orthologous groups and merged the sub-alignments 455 containing only these organisms for the two proteins. Positions that are gaps in the human proteins 456 were removed. Predictions of coevolving residues by deep learning neural network were carried out for 457 protein pairs that have a combined length of less than 1500 amino acid residues. 458 459 460 We applied the RoseTTAFold 2-track model, a faster but inferior model than the 3-track one, to each 461 concatenated alignment [ref]. RoseTTAFold predicts the probability density for the distance between 462 each residue pair. We summed up the probability for distance bins below 10Å and used it as the contact 463 probability (ranging from 0 to 1) of a residue pair. Residue pairs involving the last 10 residues in the first 464 sequence and the first 10 residues of the second sequence were discarded as they represent artificially 465 continuous segment in the merged alignment and could have biased scores. To further improve the 466 performance, we also removed residue pairs involving the mitochondrial target sequences (defined in 467 the UniProt Feature fields) in either protein. The best contact probability of the remaining residue pairs 468 was reported as the contact probability score for the protein pair. 469

Selection of top predictions
We observed that some proteins enriched in top hits with high contact probability scores could 470 be false positives. To downgrade the influence of these proteins, we designed a score to penalize 471 proteins involving many high-scoring pairs. For each protein, we recorded up to 20 contact probability 472 scores of its highest scoring pairs with contact probability scores more than 0.5. If the number of pairs 473 ( ) with scores more than 0.5 for a protein is less than 20, we extended the list of its top scores by 474 adding 20 − pseudo scores of 0.5 at the end, such that the list consists of 20 numbers. The average of 475 the 20 numbers in this score list is then used a penalty weight, and its minimum value is 0.5. The new 476 adjusted score for any protein pair is calculated as the original score divided by the maximum of the two 477 penalty weights of the two proteins. 478 We selected the top 200 pairs ranked by the original contact probability score and the top 350 479 pairs ranked by the adjusted score. Their union consists of 401 pairs of proteins. AlphaFold2 was used to 480 build structural models for these 401 pairs. AlphaFold2 also produces probability distribution for 481 residue-residue distances, and we computed contact probability as the sum of probability for distance 482 bins below 10 Å. The highest contact probability between residues of two proteins was used to indicate 483 the contact probability for this protein pair.    interface is shown in spheres. Sidechains are shown in magenta spheres for COQ9 residues whose 551 mutagenesis affected the binding to COQ7. 552 CAP domains of COX10 are colored in blue. COX10 residues involving missense disease-causing 561 mutations are colored in magenta, and those near the interaction interface have their sidechains shown 562 in spheres. D215, E211, and K278 of COX10 (with carbon atoms colored in yellow) are conserved 563 charged residues likely involved in catalysis. The residue (F118) involving a missense disease-causing 564 mutation in COA8 has its sidechain shown in red spheres. 565 566