A pan-plant protein complex map reveals deep conservation and novel assemblies

Plants are foundational to global ecological and economic systems, yet most plant proteins remain uncharacterized. Protein interaction networks often suggest protein functions and open new avenues to characterize genes and proteins. We therefore systematically determined protein complexes from 13 plant species of scientific and agricultural importance, greatly expanding the known repertoire of stable protein complexes in plants. Using co-fractionation mass spectrometry, we recovered known complexes, confirmed complexes predicted to occur in plants, and identified novel interactions conserved over 1.1 billion years of green plant evolution. Several novel complexes are involved in vernalization and pathogen defense, traits critical to agriculture. We also uncovered plant analogs of animal complexes with distinct molecular assemblies, including a megadalton-scale tRNA multi-synthetase complex. The resulting map offers the first cross-species view of conserved, stable protein assemblies shared across plant cells and provides a mechanistic, biochemical framework for interpreting plant genetics and mutant phenotypes.


INTRODUCTION
Plants make up most of the planet's biomass and sustain global economic and environmental systems. Despite increasing numbers of sequenced plant genomes, biochemical characterization of encoded plant proteins captures a relatively small portion of the expected breadth of biological functions. The best characterized dicotyledonous plant, Arabidopsis thaliana , encodes approximately 35,000 protein-coding genes, but the functions of the majority of these proteins remain uncharacterized, even by homology (Rhee and Mutwil, 2014) . This trend is similar for Oryza sativa (rice), a critical food crop, and the best characterized monocotyledonous plant (Kawahara et al., 2013) . In marked contrast to other model organisms, it is estimated that only 5% of Arabidopsis proteins and considerably fewer proteins in other plants have had biochemical activity, localization, and biological roles determined by direct experimentation (Niehaus et al., 2015;Rhee and Mutwil, 2014;Swarbreck et al., 2008) . Extraordinary experimental efforts are needed to de ine the core expressed proteome and molecular machinery of plants.
The active multiprotein assemblies in plant cells have not yet been systematically de ined, in stark contrast to recent progress in humans (Havugimana et al., 2012;Hein et al., 2015;Huttlin et al., 2015Huttlin et al., , 2017Kirkwood et al., 2013;Kristensen et al., 2012;Malovannaya et al., 2011;Wan et al., 2015) . Determining protein-protein interactions is a key step for discovering gene and protein function and frequently opens new avenues to study and manipulate critical cellular processes (Eisenberg et al., 2000;Hartwell et al., 1999;Schwikowski et al., 2000;Walhout and Vidal, 2001) . In model organisms, such as yeast or Drosophila , systematic mapping of protein complexes led to critical functional insights, facilitated understanding of conserved and disease-related pathways (Vidal et al., 2011) , and helped characterize uncharacterized proteins via guilt by association (Hu et al., 2009;Marcotte et al., 1999;Oliver, 2000) . Revealing this class of biochemical information for plants will dramatically advance efforts in fundamental plant research, as well as guide practical applications such as improvements in crop yields, disease/stress resistance, and biofuel production.
Unfortunately, many of the techniques that have been used to discover protein complexes at scale in animals and yeast (e.g., high-throughput af inity puri ication (AP-MS)) are prohibitively expensive or even intractable at large-scale in plants due to issues of complex genomes, polyploidy, and transformation ef iciency. Consequently, AP-MS experiments in plants have been limited to targeted protein families, primarily in Arabidopsis and rice (Bassel et al., 2012;Rohila et al., 2009;Van Leene et al., 2008, 2011Zhang et al., 2010) . Such approaches for studying protein-protein interactions are disproportionately dif icult in less well-characterized plant species necessitating new strategies for comparative studies.
Co-fractionation mass spectrometry (CF-MS) is a high-throughput method to detect interacting proteins, applicable to any organism, without the need for antibodies or transgenic epitope tagging of individual proteins. CF-MS involves chromatographically separating a native protein extract, then identifying proteins in each biochemical fraction using mass spectrometry. Co-elution (co-fractionation) of proteins in a separation serves as evidence of physical association, which, when measured over multiple distinct separations is a rigorous signal of protein interaction (Wan et al., 2015) . Observation of repeated co-elution in multiple different species reduces species-speci ic artifacts and adds power to discover conserved-and thus more likely functional-interactions. Furthermore, the use of machine learning methods allows for strong control over false discovery rates of protein interactions.
Several papers have classi ied candidate Arabidopsis protein complexes using native (non-denaturing) separations such as gels (Takabayashi et al., 2017) or co-fractionation (Aryal et al., 2017;Gilbert and Schulze, 2019;McBride et al., 2019) . However, in a single or small set of experiments, two proteins can coincidentally co-elute without a physical association. By comparison, repeated co-elution using multiple distinct species, tissues, and biochemical separations provides increasing con idence in a protein interaction (Wan et al., 2015) . As with genome-wide association studies (GWAS) or recombination mapping of a single mutation by phenotype, large amounts of data are required to build a statistically strong observation.
In this largest survey to date of expressed plant proteins as well as their physical assemblies, we collected a massive and diverse plant co-elution dataset to de ine high con idence protein-protein interactions in a statistical computational framework, recovering known complexes and identifying novel complexes conserved across plants for over a billion years. Multiple newly discovered complexes have direct relevance to agronomically important traits. Collectively, the resulting set of protein abundances and map of stable protein interactions will help interpret plant gene functions in a mechanistic, evolutionary, and biochemical framework.

A massive data set of protein abundances and co-puri ication from 13 plant species
We generated a large, diverse, and representative proteomics dataset from 13 species spanning 1.1 billion years of green plant evolution ( Figure 1A ). Our compendium incorporates proteomic data from Arabidopsis thaliana, Brassica oleracea (broccoli), Glycine max (soy), Cannabis sativa (hemp), Solanum lycopersicum (tomato), Chenopodium quinoa (quinoa), Zea mays (maize), Oryza sativa ssp. japonica (rice), Triticum aestivum (wheat), Cocos nucifera (coconut), Ceratopteris richardii (fern), Selaginella moellendorf ii (spikemoss) , and the green algae Chlamydomonas reinhardtii . We chose crop and model plants important to the research community, including species that overcome speci ic technical challenges (e.g. high yields of nuclei from broccoli and coconut; embryonic tissue from wheat and hemp). We included primitive vascular plants (fern, spikemoss) and single-celled green algae ( Chlamydomonas ) for their ancestral characteristics. Our dataset's combination of multiple species and diverse cell types gives a broad view of expressed proteins across Viridiplantae .
We separated each native, non-denatured protein extract by some biophysical property, either size exclusion chromatography (SEC), ion exchange chromatography (IEX), or isoelectric focusing (IEF) ( Figure 1B-C ). Each chromatographic fraction was analyzed using high resolution, high sensitivity liquid chromatography/mass spectrometry (LC/MS). In all, we collected 14,520,970 interpretable peptide mass spectra from 2,111 individual chromatographic fractions, each capturing distinct subsets of native plant proteins and protein assemblies. These protein abundance measurements will be broadly valuable for addressing diverse questions of plant biochemistry and function, including protein modi ications, expression, and determining which species or tissue contains high abundance of any particular protein.

An evolution-informed strategy improves proteomics coverage and enables comparisons across species with different ploidy levels
Integrating protein observations from different organisms is complicated due to orthology mapping. This long-standing problem is even more extreme in plants due to their often complex and polyploid genomes, as well as past whole-genome duplications (Jiao et al., 2011) . For example, most farmed wheat is hexaploid and contains over 100,000 genes, which complicates comparison to model diploids such as Arabidopsis, with its approximately 35,000 genes. The existence of multiple near-identical proteins also reduces the number of peptides that uniquely match a single protein, reducing protein recovery by standard proteomics methods. Current protein-grouping statistical methods for assigning peptides to proteins tend to perform erratically for highly redundant genomes, in practice allocating shared peptides semi-randomly across similar proteins. We thus developed a novel evolution-informed protein-grouping approach which is generally applicable to proteomic data from any arbitrary number of different species.
Our strategy was to interpret mass spectral observations in terms of protein orthogroups, rather than individual proteins. An orthogroup (OG) is a set of genes in modern organisms that all derive from the same original gene in those organisms' last common ancestor. We began by assigning all protein-coding genes in each plant species to predetermined orthogroups (Huerta-Cepas et al., 2017) , thereby organizing highly related protein sequences into homologous protein families ( Figure 1D ). We then considered mass spectra from any orthogroup member protein as evidence for the abundance of their orthogroup ( Figure 1E ) . This analysis markedly increased the number of observed peptides because peptides shared by multiple members of an orthogroup (but not unique to a single protein) could now contribute to quanti ication. Importantly, orthogroup identi iers, unlike individual protein identi iers, are consistent across species and could be used as a key for integrating data from multiple species. Thus, we collapse the three proteins in the Ribosomal Protein L36 orthogroup in Arabidopsi s to a set, directly comparable to the behavior of that orthogroup in other species, for example the set of seven proteins in this orthogroup in wheat. (We refer to orthogroups by available Arabidopsis common gene names throughout; Table S1 provides additional identi iers.) Figure 1F visually summarizes our over 2 million protein abundance measurements across plant species integrated into this comparative phylogenetic framework and Figure 1G highlights speci ic examples of complexes from the data in Figure 1F (described in File S1, data in File S2 ).
Consistent with ploidy levels ( Figure 2A ), diploid organisms such as Arabidopsis and Chlamydomonas exhibited a peak of one protein per orthogroup. In contrast, tetraploid quinoa and soy each showed a peak of two proteins per orthogroup (one from each of two subgenomes) and in the same manner, hexaploid wheat showed a peak of three proteins. Our data are consistent with the hypothesis that the current spikemoss ( Selaginella moellendor ii ) genome was a greenhouse hybrid (Banks et al., 2011) , with a peak of two proteins per orthogroup, or one protein per parent genome.
This orthogroup-based proteomics interpretation strategy increased the recovery of unique spectral counts for the highly redundant proteome of hexaploid wheat by over 300%, while not strongly impacting organisms with small diploid genomes, e.g., Chlamydomonas ( Figure 2B ). Similarly, coverage of observed orthogroups is less variable than coverage of observed proteins across species. ( Figure 2C ). Collapsing sets of evolutionarily-related proteins into orthogroups is thus a lexible and broadly applicable solution for cross-species proteomics analyses, especially across plants with varying ploidy levels.

Characteristics of expressed plant proteomes
Our data represent over 141,910 unique proteins and 23,896 orthogroups from diverse species and extracts, the largest proteomic survey of plants performed to date, covering broad areas of function ( Figure S1) . We suf iciently capture the proteome observable by native fractionation mass spectrometry, as evidenced by saturating the number of new orthogroups observed per additional experiment ( Figure 2D ), suggesting that more samples are unlikely to signi icantly improve conserved proteome coverage. In all, we observed 96.7% of the 11,339 most conserved Viridiplantae orthogroups (details in Methods). Our dataset, therefore, provides a meaningful snapshot of the conserved and expressed proteome of green plants.
Collapsing sets of proteins to orthogroups masks individual protein behavior, so we examined the behavior of paralogs within orthogroups. We found that about half of orthogroups with three or more member proteins contain one dominant protein that is highly expressed relative to other members ( Figure 2E ). An approximately equal set have protein expression more evenly shared across members. These trends were consistent across multiple species with widely varying genome sizes ( Figure 2E ). Thus, some degree of functional divergence, as measured by differential expression, was evident in about 50% of multi-gene orthogroups.
The proteins we observed also tended to be products of more abundant mRNAs ( Figure 2F ). Work in other organisms has shown that there is an imperfect correlation between protein abundances and RNA transcript levels, largely attributable to post-transcriptional, translational, and degradation rate effects on steady-state protein levels (Vogel and Marcotte, 2012) . Our data here similarly illustrate this point (Spearman r = 0.45, Figure 2G ). One notable outlier is the enzyme RuBisCo, the most abundant protein on earth (Feller et al., 2007) and the most abundant protein in 16/35 of our experiments.
RuBisCo is substantially more abundant at the protein level than would be expected based on its transcript levels, e.g. in Chlamydomonas ( Figure 2G ) or in Arabidopsis tissues ( Figure  S1 ).

Systematic identi ication and scoring of stable protein-protein interactions
In many cases, subunits of known complexes show co-elution patterns easily detected by eye, as for subunits of the 20S proteasome, prefoldin, and TSET complexes that respectively coelute with distinct, complex-speci ic, elution patterns ( Figure 1G ). However, a computational framework is necessary to identify co-eluting proteins systematically and at high-throughput.
To quantitatively score co-elution behavior indicative of stably interacting proteins, we employed a supervised machine learning approach based upon observed data for known complexes. Protein-protein interactions were derived solely from the coordinated separation behavior of proteins over multiple, orthogonal biochemical fractionation experiments. We required co-fractionation evidence in at least three species so as to capture well-supported and evolutionarily conserved interactions. We trained an Extra Trees classi ier (with 5-fold cross-validation) to distinguish between characteristics of pairs of proteins known to stably interact and random pairs of observed proteins (details in Methods). The classi ier assigned a probabilistic CF-MS score (Co-Fractionation-Mass Spectrometry score) between 0 and 1 to each potential protein-protein interaction ( File S3 ), with 1 indicating a high likelihood of physical association based on observing strongly coordinated protein elution pro iles, and 0 indicating no evidence for interaction.
We next aimed to rigorously assess statistical con idence for physical protein-protein interactions, evaluating against a fully withheld test set of 886 known protein interactions that were not used at any point during model training. Ranking protein interactions by their CF-MS scores accurately recapitulated this withheld test set ( Figure  3A ) and allowed us to measure classi ier error rates: For interactions with CF-MS scores over 0.50, we observed 90% precision ( i.e. ≤10% false positive interactions) and 23% recall. Interactions with a CF-MS score above 0.2 exhibited 50% precision at 51% recall, and thus were still informative in many cases ( Figure 3B ).

Con irming CF-MS interactions by independent assays and chemical cross-linking
Our measured high-con idence protein interactions agree with independent protein interaction observations. As a direct test of our method, we asked if independent CF-MS data from maize, a plant species that was not used to train the model, showed the same trends. Ortholog pairs with a high CF-MS score (>0.5, derived from non-maize plants) co-eluted strongly in a fractionation experiment of maize shoots ( Figure 3C ). Thus, CF-MS-supported interactions are highly concordant with biochemical separation behavior of proteins in an unassayed plant species. We also compared our observed interactions to other independent plant protein interaction datasets, inding that protein pairs with higher CF-MS scores were more likely to af inity purify together, to interact by yeast-2-hybrid, and to exhibit coordinated mRNA expression both in Arabidopsis and in rice ( Figure 3D) . Our high con idence scores align with and provide orthogonal support for biologically interesting interactions previously discovered by AP-MS and yeast-2-hybrid assays. Figure  3E illustrates three such cases for the TPLATE complex proteins (Gadeyne et al., 2014) , the interactions between SWI/SNF components with LEAF AND FLOWER RELATED (LFR) and an uncharacterized protein AT5G17510 (Vercruyssen et al., 2014) , and the interaction between two uncharacterized pentatricopeptide repeat proteins PPR59 and PPR351 (Arabidopsis Interactome Mapping, 2011) .
To further independently validate our derived protein complexes using unbiased orthogonal methods (and thus avoid "cherry-picking" promising test cases), we implemented two untargeted, large-scale biochemical approaches. For the irst method, we plotted expected monomeric mass against observed mass in a representative Arabidopsis size exclusion experiment and demonstrated that a substantial proportion of proteins eluted with a mass considerably larger than their monomeric masses, supporting that endogenous complexes remained intact in our experimental conditions ( Figure 4A, Methods).
For the second validation method, we performed global chemical cross-linking on fractionated protein extracts from soy sprout and Chlamydomonas , covalently tethering pairs of interacting proteins within each chromatographic fraction. We identi ied 194 heteromeric protein-protein interactions from soy and 228 from Chlamydomonas , at a false discovery rate of 1% and ∆XlinkX score ≥70 ( File S4 ). Cross-linking recovered 31 withheld test set positive interactions and one negative interaction, empirically estimating the false discovery rate at 3%. We observed high concordance between the cross-linked protein pairs and our CF-MS determined interacting proteins, and protein pairs with high CF-MS scores were considerably more likely to be cross-linked ( Figure 4B ). Furthermore, our observed cross-links were consistent with physical constraints of known protein interactions ( Figure 4C-F ). For example, intersubunit cross-links in the soy CCT chaperonin complex were exclusively identi ied in the same fractions as co-eluting CCT subunits ( Figure 4C ) and occurred between residues appropriate to the cross-linker length (<30Å Cɑ-Cɑ) ( Figure 4E ). While this complex is well documented in animals, a 22S complex containing CCT subunits was only reported in 2019 in Arabidopsis (Ahn et al., 2019) . Our data con irm that the CCT complex is conserved across plants and indicate that the speci ic 3D subunit organization in plants resembles that in animals. Similarly, co-fractionation and cross-linking of the Photosystem II complex was observed in Chlamydomonas , with the observed inter-protein cross-links located at appropriate adjacent solvent-accessible subunit interfaces ( Figure 4D, F ) as expected for the native conformation of this complex. The recovery of multiple structurally coherent inter-subunit cross-links between co-fractionating subunits provides experimental evidence that CF-MS faithfully captured protein assemblies.
Thus, a combination of direct experimental validation using independent biochemical methods (co-fractionation in an independent organism, calibrated size exclusion chromatography, and chemical cross-linking) and comparison to independently determined protein interactions from the literature all indicate that the CF-MS measurements provide strong evidence for physical assemblies.

Identi ication of multiprotein complexes con irms those inferred by gene content and reveals additional novel assemblies
As our CF-MS datasets faithfully captured many large multiprotein assemblies ( Figures 1G, 4A, C-F ), we next sought to systematically de ine higher-order plant protein complexes by clustering the proteins based on the measured pairwise interactions at FDR < 10% ( Figure 5 ). Instead of choosing a single clustering cutoff to de ine discrete complexes, we selected multiple cutoffs to re lect hierarchies of interacting proteins. For example, one cutoff de ined the 80S ribosome, while a iner cutoff differentiated its 40S and 60S subcomplexes ( Figure 5 ). Orthogroups in well-characterized complexes are annotated dark green; however, we identi ied several previously unreported subunits and interactors with the potential to enrich our understanding of how these known complexes function in plants. Excitingly, we observed many complexes comprised of novel associations ( Figure 5 , yellow) as well as proteins that remain uncharacterized in plants ( Figure 5 , bold circles). Our clustered and annotated set of high con idence protein complexes is available as File S5.
As internal positive controls, we identi ied 117 complexes that are reported in the literature. Some of these eukaryotic complexes, such as the Conserved Oligomeric Golgi (COG) complex, the SRP68/72 heterodimer, and the TRAPP and BRISC complexes have to our knowledge only previously been inferred in plants by gene content. Similarly, we ind orthologs of complex members that have only been reported in non-plant species, such as a MAA3 (a plant ortholog of the yeast protein Sen1 and human Senataxin) interacting with RNA polymerase III, an interaction recently found in yeast to regulate RNA polymerase III termination (Rivosecchi et al., 2019) .
Likewise, while homologs to the yeast and mammalian oligosaccharyltransferase (OST) complex subunits exist in plants, the full complex has not been biochemically isolated (Strasser, 2016) . We observe a plant OST complex with overlapping membership to the yeast and mammalian OST complex, and detect cross-links between HAP6 and OST48 in soy, suggesting the plant OST complex resembles that of other eukaryotes. We identify potential novel OST components Stomatin-like protein 1 (SLP1), SPC25, and EMC1. We also identi ied components of the initiation factor (eIF)2B complex in which the eIF2Bγ/ε and eIF2Bβ/δ dimers were clearly co-purifying with each other ( Figure 1G ), but the eIF2Bα subunit appeared to be less stably associated. The existence of the eIF2B complex in plants has been speculative as this complex has not been characterized and isolated from plants, in contrast to its well-established observation and characterization in other eukaryotes (Browning and Bailey-Serres, 2015) .
Protein interactions provide a means to characterize, corroborate, and predict protein functions via guilt-by-association. We found several instances where a top-scoring interaction to an uncharacterized protein could be con irmed in the literature, serving as additional positive controls. For example, the most con ident interactor with the recently characterized Arabidopsis protein AT5G14910 is the chloroplast ribosome protein RPS1; the interaction was independently con irmed in a recent study (Pulido et al., 2018) . We also observed instances of interactions between proteins catalyzing consecutive enzymatic reactions, such as a novel interaction between OXP1 and GEP, enzymes catalyzing the last two steps of glutathione degradation. In cases where complexes identi ied at a stringent FDR lack known members, using the core subunits to query scored interactions often recovers the expected subunits and potential novel interactors. Because protein interactions can lead to functional insights, we provide a public tool for the community to query the CF-MS interactions at http://plants.proteincomplexes.org .

Alternative multiprotein assemblies are apparent in the plant lineage
Analysis of protein complexes has revealed that homologous gene products are not always assembled in the same way (Marsh and Teichmann, 2015) . We found many cases in which plants appear to have alternate arrangements of interacting proteins relative to other lineages, including the adoption of plant-speci ic subunits. Furthermore, we identi ied cases in which plants exhibited analogous assemblies to those in other lineages, achieving similar architectures or functions, but through distinct molecular interactions. In both cases, protein sequence homology alone was inadequate to predict protein complex structure or function without knowledge of the assemblies themselves.
One prominent example is a conserved tRNA multi-synthetase complex (MSC). While functional assemblies of essential aminoacyl tRNA synthetases seem ubiquitous across all life forms, distinct architectures, memberships, and accessory proteins have been cataloged in diverse organisms including animals, yeast, archaea, and bacteria (Laporte et al., 2014) . They are frequently loosely associated and condition-dependent, making complete composition dif icult to de ine. A candidate MSC was recently reported in Arabidopsis (McBride et al., 2019) . We observed a conserved megadalton-scale MSC with architecture and accessory subunits distinct from those in animals, fungi, and bacteria but with notable parallels ( Figure 6A ). Our plant MSC lacks the p38, p43, and p18 scaffolding subunits critical for assembly of the human multi-synthetase. Instead, it contains the plant ortholog of ARC1, the central organizing cofactor of the yeast multi-synthetase. Conservatively, the plant MSC complex appears to contain the ortholog of ARC1, Ybak (a member of the trypanosome MSC (Cestari et al., 2013) ), Clustered Mitochondria Homolog (CLU), the WD40 scaffold protein VIP3/SKI8, and a distinct set of tRNA synthetases, among them glutamate, isoleucine, and tryptophan-tRNA ligases (E, I, and W). Peripheral members may include valine, tyrosine, histidine, aspartate, proline, threonine, leucine, glutamine, and methionine-tRNA ligases (V, Y, H, D, P, T, L, Q, and M) ( Figure 6A ). Of the 20 eukaryotic tRNA ligases, 8/9 observed in the mammalian MSC (all but asparagine) appear to be in the plant MSC complex, suggesting that there could be an advantage to assembling these particular tRNA-ligases.
Just as the presence of orthologs could not predict the plant MSC complex, the genetic absence of orthologs does not predict the absence of functionally similar complexes. One example is the complex of proteasome assembly chaperones. In humans, a stable heterodimer of PAC1 and PAC2 aids the assembly of the proteasome alpha subunit ring (Hirano et al., 2005) . While plants lack a PAC1 gene, we found a plant-speci ic PAC2-like protein associated with PAC2 ( Figure 6B ). No proteasome assembly chaperone complex has been previously described in plants, suggesting that the PAC2/PAC2-like complex likely performs this function.
We also found several instances where plants utilize lineage-speci ic subunits to co-opt known molecular modules to serve plant-speci ic functions. One example is a heterodimer of a conserved eukaryotic transcription factor. In humans, RBMX interacts with SAFB to bind the promoter of the SREBP1 gene to regulate sterols in the liver (Omura et al., 2009) . We found the plant ortholog of the RBMX transcription factor (RZ1B/C) interacting with the plant-speci ic protein VERNALIZATION1 (VRN1) ( Figure 6C ), both of which are known to regulate the FLOWERING LOCUS C ( FLC ) gene (Levy et al., 2002; and together control a crucial plant-speci ic event: rapid lowering at the appropriate seasonal time. Finally, the chloroplast NADH dehydrogenase-like (NDH) complex provides a more intricate example of how plants have adapted conserved architectural modules with plant-speci ic proteins to serve plant-speci ic purposes. NDH is a chloroplast complex that shares architecture with the respiratory Complex I of mitochondria, with both complexes functioning in electron low (Shikanai, 2016) . We identi ied known NDH subunits and found three additional subunits, EGY1, EGY2, and STR4A ( Figure 6D, left ). These new subunits have high CF-MS scores for interactions with multiple subunits of the NDH complex, scoring highly with NDH subcomplex B and L members, and in some cases scoring higher than the interactions between known members of the NDH complex itself ( Figure  6D, network ). EGY1 and EGY2 are chloroplast-localized intramembrane metalloproteases, but their speci ic functions remain unknown (Adamiec et al., 2017) . The speci ic enrichment of EGY1, EGY2, and the NDH complex in Bundle Sheath chloroplasts of C4 plants supports the association of these metalloproteases with NDH in vivo (Majeran et al., 2008) . The third new subunit, STR4A, is a rhodanese-like protein of unknown function. While mitochondrial Complex I is known to interact with an accessory rhodanese domain sulfurtransferase, no such subunit has yet been reported for the architecturally similar plant NDH complex. STR4A is one of 6 rhodanese-like domain proteins in Arabidopsis ; however, the CF-MS scores indicate that the association of STR4A with the NDH complex is unique among those six ( Figure 6D, bottom right ). One clue to the role of STR4A may come from the related protein, STR4, required for localization of the photosynthetic FNR complex to the thylakoid membrane of chloroplasts (Lintala et al., 2014) . However, because NDH, FNR, and Complex I are all electron transport complexes, it is also possible that sulfurtransferase activity is important for some shared function in electron transport.
These observations highlight how shared features such as orthology and protein complex architecture can be informative, but as plants have alternatively purposed many proteins and complexes, directly measuring speci ic protein interactions and assemblies is necessary to understand the functional roles of plant proteins.

Interaction-to-phenotype: Discovering protein functions and phenotypes from protein interactions
Ultimately, the utility of large-scale datasets is their ability to drive biological discovery. Interacting proteins are more likely to share phenotypes (Fraser and Plotkin, 2007;Lage et al., 2007;McGary et al., 2007) ( Table S3 ) and thus our data should provide a basis for linking genotype to phenotype and gaining biochemical insights into shared phenotypes. We therefore interrogated our dataset for cases where the derived interactions might suggest new hypotheses regarding protein function and phenotypes.
We found several links between pathogen defense and immunity genes through protein interactions. Plant pathogen resistance mechanisms are an area of intense research, as pathogens cause billions of dollars in crop losses annually (Savary et al., 2019) . We discovered two novel complexes related to plant-pathogen interactions. The irst is comprised of Basic endochitinase B (CHIB) and Osmotin-like protein 34 (OSM34) ( Figure  7A ), representing two different protein families, Pathogenesis-Related protein group 3 (PR3) and Pathogenesis-Related protein group 5 (PR5) respectively. Each protein has been individually reported to target fungal cell walls, and both are highly co-expressed following infection by the gray mold fungus B. cinerea (Dhawan et al., 2009) . To our knowledge, this is the irst report of a stable protein complex among members of different pathogenesis-related protein classes. Mechanistic characterization of this protein complex could aid the development of strategies to prevent devastating crop losses from fungal infection.
A second novel pathogen-related protein complex contains proline iminopeptidase (PIP) and Nudix hydrolase 3 (NUDT3) ( Figure 7B ). The native role of these proteins in plants is unclear, but intriguingly, bacterial versions of both PIP (Kan et al., 2018) and Nudix hydrolase (Dong and Wang, 2016;Mukaihara et al., 2004) are injected by bacterial type III secretion systems into plant cells in order to suppress plant immunity. While work remains to determine the native role of plant PIP and NUDT3, the information that these proteins form a stable endogenous complex will guide future efforts. The direct observation of known pathogen resistance proteins in complexes creates a concrete framework for interpreting previous results and examining the mechanism by which they impact plant health.
We also considered a new interaction between the DOMINO1 and LA1 proteins which served to con irm a joint role for the interaction partners in embryonic development. Loss of function of either of these genes is known to produce a nucleolar hypertrophy phenotype and nonviable embryos (Fleurdépine et al., 2007;Lahmy et al., 2004) . While LA1 is an RNA-binding protein found in diverse eukaryotes, DOMINO1 is a plant-speci ic protein with mutations linked to ribosome biogenesis defects and slow embryonic growth (Lahmy et al., 2004) . We observed a stable DOMINO1/LA1 complex in multiple plants and tissues ( Figure 7C ) and con irmed that both subunits affect the same biological process by comparing the phenotypes of individual domino1 or la1 insertional mutant lines of Arabidopsis . Heterozygous domino1 or la1 mutant plant lines produced siliques with many abnormal clear embryos ( Figure 7C ). These phenotypic similarities support a functional complex of the DOMINO1 and LA1 proteins in vivo with a proposed role in ribosome biogenesis.
Protein interactions also provide a basis to predict entirely new phenotypes. We exploited this trend for a new interaction between the mitochondrial outer membrane porin VDAC2/5 and 3β-hydroxysteroid dehydrogenase/C-4 decarboxylase (3βHSD/D). As vdac2 mutants have a well-characterized late lowering and sterility phenotype (Tateda et al., 2011) , we predicted by interaction that 3βhsd/d mutants would show similar defects. We directly compared the phenotypes of Arabidopsis bearing loss of function T-DNA insertions in VDAC2 or 3βHSD/D2. Disruption of either gene delayed lowering, induced wavy leaves, and reduced fertility. ( Figure 7D ). The underlying defect of these shared phenotypes is likely related to transport and modi ication of plant sterols, as 3βHSD/D is a sterol modifying enzyme (Kriechbaumer et al., 2018) , and VDAC2 is essential in mice for steroidogenesis (Prasad et al., 2015) .

CONCLUSIONS
By using mass spectrometry proteomics to de ine the major protein complexes shared across plants, we have constructed a reference map of the basic biochemical 'wiring diagram' of a plant cell. The data capture over two million protein abundance measurements from multiple tissues and diverse species, revealing stable protein complexes conserved over more than a billion years of plant evolution. The resulting map thus provides a global snapshot of protein organization in plants. While we have presented examples of paths to connect gene products with phenotypes and to test speci ic functional hypotheses, this rich dataset can be mined in myriad additional ways, laying the foundations for extensive basic and applied research across the vast landscape of plant biology.

ACKNOWLEDGMENTS
The authors gratefully acknowledge Angel Syrett for plant illustrations, Benjamin Liebeskind for assistance with fern transcriptomics and feature calculation, Anna Battenhouse for data deposition assistance, Claire Palmer for early orthology calculations, Hong Qiao, Alan Lloyd, and the University of Texas UTEX Culture Collection of Algae for providing plant and algal samples, Dan Boutz for mass spectrometry assistance, Riddhiman Garge and Momo Wisath Sae-Lee for assistance with coconuts, Andrew Emili for early discussions and assistance, and John Wallingford and Dannie Durand for constructive feedback and editing. The research was funded by grants from the Welch Foundation

DECLARATION OF INTERESTS
The authors declare no con licts of interest.

Figure 1. Integrative Co-fractionation Mass Spectrometry (CF-MS) Work low Used to Determine Stable Plant Protein Complexes. See also Table S1
(A) The selected thirteen species represent a broad range of evolutionary time (MYA, million years ago (Kumar et al., 2017) ).
(B) Native extracts from each plant are chromatographically separated and the proteins in each fraction identi ied by mass spectrometry. (C) Co-fractionation of proteins is evidence of physical association. (D) Proteins from each species' proteome are irst assigned to orthologous groups (OGs). (E) Peptides that match more than one orthogroup (light gray text) are not used, however peptides that uniquely match a single orthogroup are used to quantify the abundance of an orthogroup in individual chromatographic fractions. Elution pro iles for each orthogroup are graphically represented here as ridgelines or heat maps (blue showing abundance) across multiple separation experiments.  Table S1) of six well-known protein complexes (names at right). Color intensity (blue is positive signal) depicts measured abundances for each orthogroup (labeled at left) in two distinct chromatographic separations (labeled at top) out of the 35 total separations.

Figure 2. Proteomics in High-Ploidy Species Enhanced by Assignment of Proteins to
Orthogroups. See also Figure S1 (A) Number of proteins assigned per orthogroup for each plant species in our study colored by ploidy. Shaded ovals at left represent subgenome organization. (B) Fold increase (x-axis) in peptide spectral matches that identify unique orthogroups vs. unique proteins. Each bar represents a single fractionation experiment conducted on the species named at left and color-coded by ploidy as in (A) . (C) The number of observed proteins (left plot) or orthogroups (right plot) experimentally observed (in blue) compared to the possible total in the proteome (gray). Note that relative coverage per species is a function of the amount of data collected from that species in this study.
(D) Our data set of 35 fractionations and 13 species is suf icient to identify the majority of orthogroups possible by this method. Each dot represents the number of orthogroups identi ied (y-axis) in a subsample of n experiments (x-axis), with sampling repeated ten times per each n . (E) Orthogroups with more than two proteins were approximately equally likely to be represented by a single dominant protein as not, regardless of ploidy. (F) Orthogroups observed by mass spectra (green) represent those with higher mRNA abundances (TPM, transcripts per million, log scale; data from (Panchy et al., 2014) ), as shown for Chlamydomonas . Gray represents orthogroups not observed in our study. (G) Log-scale protein abundances (y-axis) show expected correlation with RNA abundances (x-axis, Transcripts per million; same as in (F) ) in Chlamydomonas , however with numerous outliers, notably, RuBisCo (green dot). (C) PPIs with high CF-MS scores (FDR < 10%) are highly correlated in a species withheld from training (maize). (D) Protein interactions with higher CF-MS scores were more likely to have been identi ied by af inity puri ication, 2-hybrid in Arabidopsis , and were more likely to be co-expressed in Arabidopsis and rice. (E) Agreement of the CF-MS protein-protein interactions (yellow) with af inity puri ication (blue) and 2-hybrid interactions (red) for three protein complexes.  Thin concentric circles show the clustering hierarchy of protein-protein interactions into complexes for each of four clustering thresholds (see File S5 for complex membership and annotations). Protein orthogroups ( illed circles) are colored green for associations previously reported in any species or yellow for those irst reported in this study. Bold outlines denote proteins uncharacterized in plants, de ined as uncharacterized if all proteins in the orthogroup lack an Arabidopsis gene symbol and a Uniprot Function annotation. Bold complex labels are discussed in the text.
Unless otherwise stated, samples were quick-frozen and ground to a ine powder in liquid nitrogen using a chilled mortar and pestle. Powder was thawed to 4°C, and resuspended in approximately an equal volume of the speci ied lysis buffer plus protease and phosphatase inhibitors followed by nutation 30 minutes at 4°C. All subsequent steps were at 4°C. The crude homogenate was clari ied by a low-speed spin (~3000 x g, 10 minutes), and the supernatant further clari ied by a high-speed spin (~14,000 x g, 10 minutes). Protein concentration was determined by Bio-Rad Protein Assay or Bio-Rad DC Protein Assay, and 1-4 mg of extract was 0.45µ iltered (Ultrafree-MC HV Durapore PVDF, Millipore) and fractionated by HPLC chromatography. In general, detergent and salt were minimized where possible to avoid perturbing protein assemblies. Because the highly abundant plant protein RuBisCo dominates green tissues, we included etiolated seedlings (depleted for photosynthetic proteins) and two non-green tissues: germ tissues (enriched in proteins related to core cell biology), and isolated nuclei (enriched for DNA/transcription-related proteins) ( Figure S1 ).
Embryonic plant tissue/seed extracts. Liquid nitrogen powders of quinoa, hemp seed, or Arabidopsis thaliana mucilage-less mutant ttg1-1 seeds were lysed in Wheat Germ Lysis Buffer: 20 mM HEPES pH 7.6, 130 mM K-acetate, 5 mM Mg acetate, 0.1 mM EDTA, 10 % glycerol, 1 mM DTT, protease and phosphatase inhibitors as above. In some cases, the crude homogenate was pre iltered through Miracloth (Millipore) to remove debris, or 14,000 x g supernatant required additional clarifying centrifugation of 10 minutes at 40,000 x g.
Nuclear extracts. Nuclei were prepared from fresh broccoli using the CelLytic PN kit (Sigma-Aldrich). Brie ly, fresh loret tips were shaved using a sharp knife into a beaker on ice and homogenized in a chilled mortar and pestle on ice in the Nuclear Isolation Buffer provided. The homogenate was further processed by brief treatment on ice with a Tissue Tearor (BioSpec Products, Inc.). Manufacturer's instructions were followed for lysis using 1.0 % Triton X-100 and the "Semi-pure Preparation of Nuclei" protocol with 1.5 M sucrose. The resultant nuclear pellet was extracted twice with Nuclear Extraction Buffer, and the two extraction supernatants were pooled.
Coconut ( Cocos nucifera ) nuclei were puri ied using a composite method based on (Cutter et al., 1952;Matzke et al., 1992;Mondal et al., 1972) . A young (5 month old, ~ 1.5 kg) green coconut supplied fresh from Coconut Fields Forever (Davie, FL) was opened, the liquid was drained, and the gelatinous endosperm removed by gentle scraping. Four volumes of ice-cold 4% sucrose were added to the endosperm and the mixture was homogenized 15 strokes with a loose pestle. All subsequent steps were on ice or at 4°C. The homogenate was iltered through cheesecloth to remove debris, and the nuclei were recovered from the iltrate by centrifugation at 250 x g, 10 min, 4°C. Nuclei were washed once with 15 ml Nuclear Isolation Buffer from the CelLytic PN kit (Sigma-Aldrich) and re-pelleted. After a second wash in 0.5 ml Nuclear Isolation Buffer, the recovered nuclear pellet was extracted as for broccoli nuclei using Nuclear Extraction Buffer (CelLytic PN kit, Sigma-Aldrich).
Chlamydomonas reinhardtii culture (2.5 liters of UTEX 90) was pelleted by centrifugation 10 minutes at 1,100 x g with no brake and washed with 900 ml 10 mM HEPES pH 7.6 at room temperature. Washed cells were collected by centrifugation 10 minutes 1,800 x g room temperature with no brake and the pellet was frozen in liquid nitrogen, thawed, refrozen and ground in a liquid nitrogen-chilled mortar and pestle. Powder was thawed and refrozen for a second grinding. After resuspension in Plant Lysis Buffer, cells still appeared largely intact so the homogenate was sonicated on ice with a probe tip 6 x 10 sec, amplitude 28%, and 6 x 10 sec amplitude 70%. This lysate was incubated 30 minutes 4°C rotating end-over-end with subsequent steps as for other green tissues.

Fern transcriptome sequencing
As Ceratopteris richardii currently lacks an annotated genome, we de novo assembled the fern transcriptome from mRNA sequencing data collected from fronds, mature gametophytes, and spores as follows: RNA sequencing Spores were collected from adult plants and cultured on agar to gametophyte stage. Fronds were cut from healthy adult plants and lash-frozen in liquid nitrogen. Both fronds and gametophytes were ground to a ine powder prior to RNA extraction. Total RNA was extracted from spores by phenol-chloroform extraction, as previously described (Salmi and Roux, 2008) . Total RNA was extracted from gametophytes, and fronds with the Spectrum Plant Total RNA Kit (Sigma-Aldrich) according to manufacturer's instructions. Total RNA from each sample was diluted to 10-100 ng/ µl and further prepared and sequenced by the Genome Sequence and Analysis Facility at UT Austin as follows: Total RNA was Poly-A-selected using the Poly(A)Purist Magnetic Kit (Life Technologies AM1919) and fragmented to ~200 bp. First strand synthesis was performed using the NEBNext First Strand Synthesis Module (NEB E7525L) with random primers and reverse transcriptase in the presence of Actinomycin D and Murine RNase Inhibitor. Second strand synthesis was performed using the NEBNExt Ultra Directional RNA Second Strand Synthesis Module (NEB E7750L) for 1 hour 16°C, and DNA was puri ied on AMPURE XP Beads (Beckman Coulter A63881). Double-stranded DNA was subjected to end repair and dA-tailing using the NEBNext End Repair and dA-Tailing Modules (NEB E6050L and E6053L) followed by ligation of barcode adapters (IDT) with T4Ligase (NEB). Following PCR ampli ication, the quality of the inal library was con irmed by Bioanalyzer (Agilent). We collected 2x150 bp RNA-seq reads from fern spore, mature gametophyte, and adult frond on an Illumina HiSeq 4000 with a target of 30 million reads per tissue.

Fern transcriptome assembly
After removing low-quality reads (reads lacking all four nucleotides or with a no-call), we assembled transcripts with Velvet (version 1.2.06) and Oases (version 0.2.06) using each of ive k-mer values (k=45, 55, 65, 75, 85). Also, we converted .fastq ile to a non-redundant .fasta ile and performed separate de novo transcriptome assembly with k=35, 45, 55, 65, and 75. Assembled transcripts were combined for each tissue, then redundant or fragmented sequences were removed based on BLASTN analysis. We determined the translational reading frame and corresponding peptide sequences from each assembled transcript based on BLASTP mapping results (after 6-frame translation in silico ) to four plant reference proteome databases (Creinhardtii_169, Osativa_193_pep, Smoellendorf ii_91_pep, TAIR10). Sequences lacking signi icant BLASTP scores to the reference proteomes were considered to be non-coding and omitted from the resulting fern proteome database. The resulting protein sequences derived from the three tissues were combined and a non-redundant protein sequence set computed based on clustering with UCLUST (version 4.2.66), requiring >97% amino acid identity. The supporting code is available from the NuevoTx repository ( https://github.com/taejoonlab/NuevoTx ). Fern transcriptome and proteome assembly statistics are summarized in Table S4 .

HPLC chromatography
Lysates were fractionated on a Dionex UltiMate3000 HPLC system consisting of an RS pump, Diode Array Detector, PCM-3000 pH and Conductivity Monitor, Automated Fraction Collector (Thermo Scienti ic, CA, USA) and a Rheodyne MXII Valve (IDEX Health & Science LLC, Rohnert Park, CA) using biocompatible PEEK tubing and either size exclusion chromatography or one of two ion exchange separations (mixed bed, or triple-phase WAXWAXCAT, see below). The sample loaded was 1-4 mg protein as measured by the BioRad Protein Assay or DC Protein Assay as appropriate to the sample buffer. Fractions were collected into 96-deep well plates. Select support ribs in the base were notched with a single-edged razor blade prior to fraction collection to accommodate subsequent use of the Life Technologies magnetic plate for bead-based mass spec preparation of samples.
Triple phase ion exchange : Three columns, each 200 x 4.6 mm ID, particle diameter 5 µm, pore diameter 100 A , were connected in series in the following order: two PolyWAX LP columns followed by a single PolyCAT A (PolyLC, Inc, Columbia, MD). Loading, buffers, and fraction collection were as for mixed bed ion exchange above with slight modi ications in low rate and elution from the methods of (Havugimana et al., 2012) . The low rate was either 0.25 ml/min with a 120 min gradient from 5-100 %B. For the separation of nuclear extracts, the gradient was modi ied to a 115-minute multiphasic elution from 5-100% Buffer B.

Isoelectric focusing
Isoelectric focusing was performed using the Agilent 3100 OFFGEL Fractionator (Agilent Technologies, Santa Clara, CA) according to the manufacturer's application note for native protein separation (Babu CV and Palaniswamy, 2014) . Samples were separated into 24 fractions using the 24 cm IPG strips pH 3-10 NL (GE HealthCare, Chicago, IL). To achieve the required low salt concentration 2.5 mg wheat germ extract was diluted with water and centrifuged 10 min., 10,000 x g 4°C prior to dilution with OFFGEL Stock Solution. Broccoli nuclear extract was dialyzed in a Slide-A-Lyzer Cassette G2 with a 10,000 MWCO (Thermo Scienti ic, CA, USA) against 2 changes, 2 hours each, of 5.0 mM HEPES-KOH pH 7.5, 25 mM NaCl, 0.2 mM DTT followed by one change of 0.5 mM HEPES-KOH pH 7.9 overnight. A substantial precipitate formed so the solution was clari ied by centrifugation 12,000 x g 10 min, 10°C, and the soluble portion further concentrated in an Amicon Ultra ilter unit with a 10 kD MWCO (MilliporeSigma, Burlington, MA). BioRad Protein Assay determined that this soluble portion retained 1.5 mg (~ 35% of the predialysis amount) which was diluted with OFFGEL Stock Solution for focusing.

Mass spectrometry sample preparation
Samples were prepared for mass spectrometry in 96-well plate format using either ultra iltration and in-solution digestion or SP3 bead-binding of proteins and on-bead digestion ( (Hughes et al., 2014) with modi ications as described below). Plates were sealed with transparent ilm during incubation steps.
Ultra iltration was performed with an AcroPrep Advance 96-ilter plate,3k MWCO (Pall) using a vacuum manifold (QIAvac 96 or Multiwell, Qiagen) at -0.75 Bar. Prior to the application of the samples, preservatives were removed from the membranes by sequential iltration of 100 µl LC/MS quality water and 100µl trypsin digestion buffer (50 mM Tris-HCl, pH 8.0, 2 mM CaCl 2 ). Samples were concentrated to 100 µl, diluted 2 fold with trypsin digestion buffer, and concentrated again to a inal volume of 50-100 µl before being transferred back to a 96-deep well plate. 50 µl 2,2,2,-tri luoroethanol (TFE) was added and samples were reduced with TCEP (Bond-Breaker, Thermo,) at a inal concentration of 5 mM for 30 min. 37°C. Iodoacetamide was added to 15 mM and plates were incubated in the dark 30 min. at room temperature. Alkylation was quenched by the addition of DTT to 7.5 mM. TFE was diluted to <5% by the addition of trypsin digestion buffer. 1 µg of trypsin was added to each fraction and the sealed plate was incubated 37°C overnight. Digestion was stopped by bringing samples to 0.1% formic acid, and peptides were desalted using a 5-7 µl C18 Filter Plate (Glygen Corp) with a vacuum manifold, dried, and resuspended for mass spectrometry in 3% acetonitrile, 0.1% formic acid.
Bead-based sample preparation was limited primarily to SEC fractionation samples due to interference of salt concentrations greater than 300 mM. Beads used were an equal mixture (vol:vol ) of SpeedBead Magnetic Carboxylate Modi ied Particles #45152105050250 and #65152105050250 (GE Healthcare, UK). Fractions were irst adjusted to 20% TFE and reduced with 5mM TCEP, 45 min, 37°C. Samples were alkylated by the addition of Iodoacetamide to 25mM inal concentration, 30 minutes in the dark at room temperature, and then the reaction was quenched with 15 mM DTT. 4 µl of the mixed bead suspension (5µg/µl of each bead type) was added to each fraction. Protein binding was initiated by the addition of a premix of formic acid (to 20% of the alkylated sample) and acetonitrile (to equal 50% of inal bead incubation volume. After 30 min. at room temperature with gentle rocking, the beads were pelleted by centrifugation (1000 x g, 5 min., room temperature). The deep well plate was placed on a magnetic plate (Life technologies) and all but 300 µl of the supernatant removed and discarded. After removing the sample plate from the magnet the beads were resuspended in the remaining 300 µl and transferred to a conical bottomed 450 µl 96-well plate and placed on the magnetic plate for all wash steps.
After beads were collected, the supernatant was removed and discarded and beads were washed rapidly and sequentially with 2 x 100 ml aliquots of 70% ethanol and 2 x 100 µl aliquots of acetonitrile. The plate was removed from the magnet and beads were allowed to air dry brie ly before resuspension in 25 ml of 10% TFE/90% trypsin digestion buffer. 25 µl of trypsin digestion buffer containing 0.25 µg trypsin was then added to each resuspended sample for digestion overnight, 37°C. The digestion plate was placed on the magnet and the supernatant containing digested peptides was transferred to a fresh 450 µl 96-well plate and digestion stopped by bringing samples to 1% formic acid. Further peptides were recovered from the beads with two successive elution steps using 50 µl 2% DMSO and pooled with the original peptide supernatant. During the irst DMSO elution, the plate was sealed and sonicated in a bath for 10 min. Peptides in the 150 µl total eluates were desalted as above.

Chemical cross-linking
Extract (2.5 mg soy sprouts, or 2.1 mg Chlamydomonas ) was fractionated by size exclusion as above but with PBS pH 7.2 (Gibco) as the mobile phase. DSSO (Thermo Scienti ic) was dissolved immediately before use in dry dimethylformamide (stored under nitrogen) at a concentration of 50 mM and then further diluted in PBS for a working stock. Immediately after fractionation working stock DSSO was added to fractions to a inal concentration of 0.5 mM and samples were incubated 1 hour at room temperature. Cross-linking was quenched by the addition of 1 M Tris pH 8.0 to a concentration of 24 mM. Cross-linked fractions were immediately frozen and stored -80°C until prepared for mass spectrometry using the ultra iltration and in-solution digestion method. Soy samples were digested as usual with trypsin, while Chlamydomonas samples were split after reduction/alkylation and treated as follows. One set was digested with trypsin under standard conditions while the other set was diluted 8 fold with 50 mM Tris-HCl pH 8.0, 10 mM CaCl 2 and digested with 0.5 µg chymotrypsin overnight 37°C. Both digests were stopped by adjusting samples to 0.1% formic acid. Peptides were desalted as above and dissolved in 20µl 5% acetonitrile, 0.1% formic acid for mass spectrometry.

Mass spectrometry data acquisition and processing
Acquisition Mass spectra were acquired using one of three Thermo mass spectrometers: Orbitrap Elite, Orbitrap Fusion, or Orbitrap Fusion Lumos. In all cases, peptides were separated using reverse phase chromatography on a Dionex Ultimate 3000 RSLCnano UHPLC system (Thermo Scienti ic) with a C18 trap to Acclaim C18 PepMap RSLC column (Dionex; Thermo Scienti ic) con iguration. Peptides were eluted using a 5-40% acetonitrile gradient in 0.1% formic acid over 120 min. (for Orbitrap Elite), or a 3-45% gradient over 60 min. (for Fusion and Lumos) and directly injected into the mass spectrometer using nano-electrospray for data-dependent tandem mass spectrometry.
Data acquisition methods were as described below for each machine: Orbitrap Elite : top 20 CID with full precursor ion scans (MS1) collected at 60,000 m/z resolution. Monoisotopic precursor selection and charge-state screening were enabled, with ions of charge > + 1 selected for collision-induced dissociation (CID). Up to 20 fragmentation scans (MS2) were collected per MS1. Dynamic exclusion was active with 45 s exclusion for ions selected twice within a 30 s window.
Orbitrap Fusion : top speed CID with full precursor ion scans (MS1) collected at 120,000 m/z resolution and a cycle time of 3 sec. Monoisotopic precursor selection and charge-state screening were enabled, with ions of charge > + 1 selected for collision-induced dissociation (CID). Dynamic exclusion was active with 60 s exclusion for ions selected once within a 60 s window. For some experiments, a similar top speed method was used with dynamic exclusion of 30 s for ions selected once within a 30 s window and high energy-induced dissociation (HCD) collision energy 31% stepped +/-4%. All MS2 scans were centroid and done in rapid mode.
Orbitrap Lumos : top speed HCD with full precursor ion scans (MS1) collected at 120,000 m/z resolution. Monoisotopic precursor selection and charge-state screening were enabled using Advanced Peak Determination (APD), with ions of charge > + 1 selected for high energy-induced dissociation (HCD) with collision energy 30% stepped +/-3%. Dynamic exclusion was active with 20 s exclusion for ions selected twice within a 20s window. All MS2 scans were centroid and done in rapid mode.
For identi ication of DSSO cross-linked peptides, peptides were resolved using a reverse phase nano low chromatography system with a 115 min 3-42% acetonitrile gradient in 0.1% formic acid. The top speed method collected full precursor ion scans (MS1) in the Orbitrap at 120,000 m/z resolution for peptides of charge 4-8 and with dynamic exclusion of 60 sec after selecting once, and a cycle time of 5 sec. CID dissociation (25% energy 10 msec) of the cross-linker was followed by MS2 scans collected in the orbitrap at 30,000 m/z resolution for charge states 2-6 using an isolation window of 1.6. Peptide pairs with a targeted mass difference of 31.9721 were selected for HCD (30% energy) and collection of rapid scan rate centroid MS3 spectra in the Ion Trap.
Orthogroup assignment Each plant proteome was searched against EggNOG viridiplantae (virNOG) orthogroup HMMs using eggNOG -mapper v1 (Huerta-Cepas et al., 2017) . Additionally, human and Arabidopsis proteomes were searched with the eggNOG-mapper against eukaryote (euNOG) orthogroup HMMs to allow annotation transfer of known human protein complexes. The set of proteins from all species assigned to the same orthogroup HMM were considered to belong to the same orthogroup.
Analysis of peptide-spectral matches at the level of orthogroups All proteomes were in silico trypsin digested to peptides, allowing for two missed cleavages and a missed cleavage when lysine or arginine are followed by proline. Within each proteome, only peptides from proteins within the same orthogroup were retained, and peptides matching proteins in multiple orthogroups discarded. This gave a list of orthogroup-unique peptides sometimes deriving from multiple proteins. Experimentally-observed peptides were compared to these orthogroup unique peptides with allowance for leucine/isoleucine ambiguity to identify orthogroups present in each biochemical fraction. Peptide spectral matches (PSMS) of peptides in the same orthogroup were summed to get orthogroup PSM counts.

Analysis of overall trends in protein recovery
We observed in at least one experiment 96.7% of the 11,339 Viridiplantae orthogroups that are most highly conserved i.e. conserved across at least half (7) of our 13 plant species. Of the 378 unobserved yet conserved orthogroups, 170 are membrane proteins suggesting their lack of detection may stem from non-ideal solubilization conditions for this class of proteins. The high coverage of conserved proteins was in marked contrast to the remaining 22,144 less conserved orthogroups, of which we only observed 58.4%. Observed proteins also tend to be longer than unobserved proteins, with a median length of 383 vs. 196 amino acids. Our observations mirror those for the extensively studied human proteome where notably 1,482/20,055 proteins have thus far evaded detection by all available proteomics technologies (Baker et al., 2017) . We observed consistent tissue-speci ic functional trends among the observed proteins ( Figure S1 ), supporting the idea that the not-yet-observed proteins, whether in plants or humans, may tend to exhibit strongly tissue or temporal speci ic expression or universally low abundance. Orthogroup abundances were highly reproducible across experiments ( Figure S2 ).
Assembly of features for scoring putative protein interactions For each experiment, we assembled an elution matrix of orthogroups by fractions with PSMs normalized to 1 within each orthogroup. We additionally concatenated our training set of 32 experiments into sets by species and taxonomic group, i.e. all Arabidopsis experiments joined into one matrix and likewise for all eudicot, monocot, angiosperm, vascular plant, and green plant experiments. Maize was withheld from these concatenations as a holdout species. An elution matrix of all experiments was also made ( File S3 ). Cross-linked soy and cross-linked Chlamydomonas experiments were not used in training. We retained only orthogroups with at least 32 total PSMs observed across the 32 combined training experiments.
We next calculated a series of all-by-all pairwise scores between orthogroup elution pro iles for all 46 training matrices -Pearson's r, Spearman's rho, Euclidean distance, Bray-Curtis similarity, and stationary cross-correlation, all with added Poisson noise. Euclidean distance and Bray-Curtis similarity scores were inverted and normalized to a max score of 1. We calculated a hypergeometric score for the co-occurrence of proteins in fractions with repeated sampling of fractions (Drew et al., 2017) . Prior to building a feature matrix of these scores for machine learning, we removed orthogroup pairs that did not correlate with at least a Pearson r > 0.3 in at least three experiments in at least two species. All scores were joined to a 3,076,998 row feature matrix of orthogroup-orthogroup similarity scores, and missing values illed with zeros.
Construction of the gold standard protein complex training and test sets We used known human protein complexes from the CORUM database (Giurgiu et al., 2019) as a gold standard set of positive stable protein-protein interactions. Human protein identi iers were converted to virNOG orthogroup identi iers via orthology to Arabidopsis proteins. 397 CORUM protein complexes were supplemented with 6 well-characterized plant protein complexes. Known complexes were divided into positive training and test complexes according to the scheme from (Drew et al., 2017) and complexes with over 30 members removed. We additionally supplemented 8,662 CORUM pairwise interactions with 2,562 pairs of plant proteins with direct evidence of stable protein-protein interaction, e.g. co-crystallization, co-immunoprecipitation, collected from the TAIR (Swarbreck et al., 2008) protein-protein interactions repository (ftp://ftp.arabidopsis.org/home/tair/Proteins). 20,000 negative training and test interactions were drawn from feature matrix rows, removing any interactions present in the positive gold standard set. 1240 positive training interactions and 884 positive test interactions were present in the feature matrix.
Identi ication of interacting proteins by supervised machine learning We irst used the scikit-learn ExtraTreesClassi ier feature selection to reduce the dimensionality of the feature matrix to the top 100 features based on declining feature importance ( Figure S3 ). We used the TPOT (Olson and Moore, 2016) AutoML wrapper of scikit-learn machine learning functions for all subsequent training steps. We discovered optimal hyperparameters for an ExtraTreesClassi ier with 5-fold cross-validation, with an area under the precision-recall curve of 0.64. We then trained an Extra Trees Classi ier with TPOT discovered hyperparameters. This model was applied to the entire feature matrix to give a Co-Fractionation Mass Spectrometry (CF-MS) score to each pair of orthogroups, with higher scores corresponding to higher co-elution. Precision, recall, and false discovery rates were calculated from 886 positive and 20,000 negative test set interactions.
Clustering of interacting protein pairs to de ine multiprotein assemblies Interaction scores above a 10% false discovery rate threshold (CF-MS score ≥ 0.509) were input into R igraph cluster_walktrap to de ine coherent protein complexes. Walktrap reweighted edges between orthogroups were reformatted to a dendrogram and cut at intervals to obtain a nested hierarchy of complexes. Cuts closer to the root of the dendrogram result in larger complexes and cuts closer to the tips further de ine subcomplexes. A portion of these clusters are homodimers or heterodimers of closely related proteins, such as two inactive GDSL proteins, two ferredoxins, or two NTF2 proteins ( Table S5 ).
Size calibration Based on known molecular weight size standards spiked into one soy size exclusion experiment, we it a linear model of log10 molecular weight ~ fraction number. To transfer this calibration to other size exclusion experiments, we selected a series of internal standards complexes and proteins with known native molecular weights and consistent single peak elution patterns. A model derived from molecular weight standard spike-ins of 667, 443, 200 and 66 kDa was able to predict molecular weight values for our internal standards close to their known elution positions ( Figure S4 ). We it a new linear model for derived weights of internal standards and applied this model to all size exclusion experiments to obtain a molecular weight for each fraction.

Protein Parts Per Million (PPM) calculation
We calculated protein parts per million (PPM) following (Weiss et al., 2010) , with scripts stored at github.com/marcottelab/MS_grouped_lookup/ppm_utils. Brie ly, unique tryptic peptides were iltered to peptides between 7 and 40 amino acids long, and a correction factor calculated from the sum of the total length of peptides in this range per orthogroup. Observed peptide PSMs were multiplied by the peptide length, summed by orthogroup, divided by the orthogroup correction factor, multiplied by 1,000,000 and divided by the experiment total to get parts per million.
3D homology modeling Precomputed 3D homology models from SWISS-MODEL (Bienert et al., 2017) were used to evaluate consistency with cross-linking data as follows: SWISS-MODEL 3D models of Arabidopsis CCT subunits were aligned to the known experimental Saccharomyces cerevisiae CCT molecular assembly (PDB: 4V94), then soy CCT subunit cross-links positioned as guided by sequence alignment to the corresponding Arabidopsis subunits. Chlamydomonas cross-links were evaluated using SWISS-MODEL 3D models of Chlamydomonas Photosystem II subunits that were aligned to the Arabidopsis Photosystem II (PDB: 5MDX). 3D models were visualized using Chimera (Pettersen et al., 2004) .
50 Arabidopsis seeds per genotype were sterilized for 10 minutes in 20% Clorox bleach and washed ive times with 1mL di H 2 O . Sterilized seeds were plated on 0.5X Murashige and Skoog agar plates, sealed with para ilm, wrapped in aluminum foil and incubated at 4°C for 3-5 days, followed by incubation in a lighted growth chamber for 7 days. Plates were germinated and plants grown under an illumination cycle of 16 hours 22°C days and 8 hour 20°C nights. Seedlings were planted in a soil mix of 75% Pro-Mix Biofungicide with wetting agent/25% Pro ile Field and Fairway calcined clay supplemented with 1g Miracle-Gro Plant Food/gallon water, and one teaspoon Gnatrol Biological Larvicide (Valent Biosciences LLC). Bonide copper sulfate was sprayed weekly to prevent fungus. All mutant lines except CS832348 were grown an additional generation prior to phenotyping and quantitation. Days to lower were marked from the introduction of plates to light to the irst visible petal, and siliques were counted at the end of lowering.
Genotyping A portion of rosette leaf from each plant was lash-frozen in Eppendorf tubes in liquid nitrogen and ground to a ine powder with sterile rods. 400 µl extraction buffer (200mM Tris-Cl pH 7.5, 250mM NaCl, 25mM ETDA, 0.5% SDS) was added to each tube and mixed brie ly. Tubes were centrifuged at 3000 rcf for 7 minutes, then 350 µl supernatant transferred to a 96 well 2mL plate pre illed with 350 µl isopropanol per well, mixed by pipette, and incubated at room temperature for 10 minutes. Following centrifuging for 35 minutes at 3000 rcf, supernatant was removed by quickly inverting the plate. 150 µl ethanol was added to each well, plates pulse spun, and supernatant again poured off with quick inversion of the plate. After air-drying for at least 15 minutes DNA was resuspended in 150 µl H 2 O . PCR reactions to con irm the genotyping were set up as two separate reactions per mutant, one for wild type with a gene speci ic left primer (LP) and right primer (RP), and a second with the gene speci ic right primer (RP) and a left border primer for the T-DNA insert (L3 or LBb1.  Proportion of total spectral counts for each tissue annotated with each high-level COG category. Nuclear samples show enrichment for information and depletion for metabolism. 15-20% of spectral counts from all tissues derive from orthogroups with no COG annotation. (Bottom) Arabidopsis mRNA vs protein abundances. RuBisCo is observed to have relatively moderate mRNA transcript levels across Arabidopsis tissues but is frequently the most abundant protein in our experiments. The median protein abundance of RuBisCo protein in Arabidopsis leaves was plotted against RNA abundance for three separate green Arabidopsis tissues (Gan et al., 2011;Liu et al., 2012Liu et al., , 2016 . Titles indicate accession numbers from the ExpressionAtlas database (Petryszak et al., 2016) . Figure S2. See Methods. Reproducibility of protein abundance measurements per CF-MS fractionation experiment. Hierarchically clustered all-by-all Pearson correlation of total orthogroup abundance vectors for each experiment. Experiments performed on the same tissue have highly repeatable orthogroup abundances, e.g. wheat germ and Chlamydomonas experiments. Figure S3. See Methods. Feature selection for protein-protein interaction scoring, as measured by scikit-learn Feature Importance. The top-ranked 100 features were selected for cross-validated model training and testing. Figure S4. See Methods. Calibration of molecular weight estimation by size exclusion chromatography. Proteins previously known to elute at particular molecular weights in size exclusion chromatography were selected as internal reference standards, for example, members of the COP9 signalosome complex, which are known to elute at a position corresponding to approximately 350 kDa. These literature-annotated elution positions (x-axis) were compared to observed elution positions (y-axis). We observe generally excellent agreement between expected and observed molecular weights ( Spearman r = 0.97 ), with one signi icant outlier (an Hsp70) consistently observed above 70kDa, showing that this protein participates in higher-order assemblies in plants.  Table S1. Related to Figure 1. Translation of common names referenced in manuscript to OrthogroupID, Arabidopsis Uniprot Accession, and Arabidopsis locus identi iers. One orthogroup may contain multiple proteins due to duplication since the last common ancestor of plants.  Figure 7. Interacting partners share mutational phenotypes (here, from the plant PhenomeNET ontology (Oellrich et al., 2015) ) at a signi icantly higher rate than random protein pairs ( p < 10 -16 , chi-squared test).