Abstract
Elucidating the wiring diagram of the human cell is one of the central goals of the post-genomic era. Here, we integrate genome engineering, confocal imaging, mass spectrometry and data science to systematically map protein localization in live cells and protein interactions under endogenous expression conditions. For this, we generated a library of 1,311 CRISPR-edited cell lines harboring fluorescent tags that also serve as handles for affinity capture, and applied a new machine learning framework to encode the interaction and localization profiles of each protein. Our approach provides a data-driven description of the molecular and spatial networks that organize the human proteome. We show that unsupervised clustering of these networks delineates functional groups and facilitates biological discovery, while hierarchical analyses uncover the core features that template cellular architecture. Furthermore, we discover that localization signatures are remarkably predictive of protein function, and often contain enough information to identify molecular interactions. Paired with a fully interactive website (opencell.czbiohub.org), OpenCell is a resource for the quantitative cartography of human cellular organization.
The sequencing of the human genome has transformed cell biology by defining the protein parts list that forms the canvas of cellular operation (1, 2). This paves the way for elucidating how the ∼20,000 proteins encoded in the genome organize in space and time to define the cell’s functional architecture (3, 4). Where does each protein localize within the cell? Can we comprehensively map how proteins assemble into larger functional communities? A main challenge to answering these fundamental questions is that cellular architecture is organized along multiple scales, so that several approaches need to be combined for its elucidation (5). In a series of pioneering studies, human protein-protein interactions have been mapped using ectopic expression strategies with yeast two-hybrid (Y2H) (6) or epitope tagging coupled to immunoprecipitation-mass spectrometry (IP-MS) (7, 8), while protein localization has been charted using immuno-fluorescence in fixed samples (9). However, these approaches do not measure protein interactions under native expression conditions (which can be more precise than using ectopic methods (10)), nor define protein localization in live and unperturbed cells. Furthermore, protein interactions and spatial localization have so far mostly been addressed separately. Therefore, the description of human cellular organization remains incomplete. By contrast, seminal work in the budding yeast S. cerevisiae has demonstrated how libraries of endogenously tagged strains can enable the comprehensive mapping of localization and interactions in a eukaryotic proteome (11–13). These libraries were made possible by the relative simplicity of homologous recombination in yeast (14) and enable the functional characterization of proteins in their native cellular context. Excitingly, recent advances in CRISPR-mediated genome engineering now allow for similar strategies to be applied for the interrogation of the human cell (15, 16).
Here, we combine experimental and analytical innovations to create OpenCell, a systematic proteomic map of human cellular architecture. We generated a library of 1,311 CRISPR-edited HEK293T cell lines expressing fluorescent protein fusions from endogenous genomic loci, which we characterized by pairing confocal microscopy and mass spectrometry. Our dataset constitutes the most comprehensive image collection of live-cell protein localization to date, while integration with IP-MS using the fluorescent tags for capture enables measurement of localization and interactions from the same samples. For a quantitative description of cellular architecture, we introduce a data-driven framework to quantitatively represent protein features, supported by a new machine learning approach for image encoding. This analysis allows us to delineate communities of functionally related proteins by unsupervised clustering and provides mechanistic insights on specific proteins or pathways, including many proteins that had so far remained uncharacterized. This approach further enables a hierarchical description of the human proteome’s organization, and highlights in particular that intrinsically disordered proteins – and notably RNA-binding proteins – exhibit very unique functional signatures that shape the proteome’s network. Finally, a direct comparison of imaging and mass spectrometry data establishes that localization patterns measured by light microscopy often contain enough information to predict interactions at the molecular scale.
Results
The OpenCell library
Fluorescent protein (FP) fusions are versatile tools that enable both the study of protein localization by microscopy and that of protein-protein interactions by acting as affinity handles for IP-MS (15, 17) (Fig. S1A). Here, we constructed a library of fluorescently tagged HEK293T cell lines by targeting human genes with the split-mNeonGreen2 system (18) (Fig. 1A). Split-FPs greatly simplify CRISPR-based genome engineering (15), which allowed us to generate fusions directly into endogenous genomic loci and to preserve native expression regulation (Fig. 1B). A full description of our pipeline is available in the Methods section (summarized in Figs. 1C-E). In brief, FP insertions sites (N-or C-terminus) were informed by literature curation or structural analysis. For each tagged target we isolated a polyclonal pool of CRISPR-edited cells, which was then characterized by live-cell 3D confocal microscopy, IP-MS, and genotyping of tagged alleles by next-generation sequencing. Open-source software development and advances in instrumentation supported scalability (Fig. 1C). In particular, we developed crispycrunch, a software for CRISPR-based integration experiments enabling guide RNA selection and homology donor sequence design (github.com/czbiohub/crispycrunch). We also fully automated microscopy acquisition in Python to enable on-the-fly computer vision and selection of desirable field of views imaged in 96-well plates (github.com/czbiohub/opencell-microscopy-automation). Furthermore, our mass-spectrometry protocols take advantage of the high sensitivity of timsTOF instruments (19) which allowed miniaturization of IP-MS down to 0.8×106 cells of starting material (Fig. S1B; 12-well plate culture, a >10-fold reduction compared to previous approaches (7, 8)).
In total, we targeted 1728 genes, of which 1311 (76%) could be successfully detected by fluorescence and form our current dataset (full library details in Suppl. Table 1). From these, we obtained paired IP/MS measurements for 1261 targets (96% of the publication set, Fig. 1D). Expression level is the main limitation to the successful detection of endogenous fluorescent fusions. Indeed, RNA-Seq analysis revealed that “unsuccessful” targets are expressed at significantly lower levels than successful ones (Fig. 1D, bottom left panel), and a correlation between transcript abundance and protein fluorescence identified the expression threshold corresponding to our fluorescence detection limit (log[RNA tpm] = 1.5, Fig. 1D, bottom right panel). This threshold corresponds to the median expression level in the HEK293T line (Fig. S1C), meaning that the top ∼50% of expressed genes are detectable at endogenous level using current FPs.
To maximize throughput, we used a polyclonal strategy to select genome-edited cells by fluorescent cell sorting. These polyclonal pools contain cells with different genotypes. On the one hand, HEK293T are pseudo-triploid (20) and a single edited allele is sufficient to confer fluorescence. On the other hand, different DNA repair mechanisms compete with homologous recombination for the resolution of CRISPR-induced genomic breaks (21) so that alleles containing non-functional mutations can be present in addition to the desired fusion alleles. However, such alleles do not support fluorescence and are therefore unlikely to impact downstream measurements, especially in the context of a polyclonal pool. In addition, we derived a stringent selection scheme to significantly enrich for fluorescent fusion alleles (Fig. S1D). Our final cell library has a median 61% of mNeonGreen-integrated alleles, 5% wild-type and 26% other non-functional alleles (Fig. S1E, full CRISPR design genotype information in Suppl. Table 1).
Finally, we verified that our engineering approach maintained the endogenous expression level of the tagged targets. For this, we quantified protein expression by Western blotting using antibodies specific to proteins targeted in 12 different cell pools. This analysis revealed that the median abundance of a target protein in engineered lines was 79% of its abundance in wt HEK293T cells (Fig. S1F). Thus, our gene-editing strategy preserves near-endogenous abundances and circumvents the limitations of ectopic overexpression (16, 22, 23), which include aberrant localization, changes in organellar morphology or masking effects (see the respective examples of SPTLC1, TOMM20 and MAP1LC3B in Fig. S1G). Therefore, OpenCell supports the functional profiling of tagged proteins in their native cellular context.
The OpenCell interactome
Affinity enrichment coupled to mass spectrometry is an efficient and sensitive method for the systematic mapping of protein interactions networks (24). We isolated tagged proteins (“baits”) from cell lysates solubilized in digitonin, a mild non-ionic detergent known to preserve the native structure and properties of membrane proteins (25). Specific protein interactors (“preys”) were identified from biological triplicate experiments using label-free bottom-up proteomics on a timsTOF instrument (19) (see Figure S2A-B for a detailed description of our statistical analysis, which builds upon established methods (7)). In total, the full interactome from our 1261 OpenCell baits includes 30,293 interactions between 5271 proteins (baits and preys, Fig. 2A, full interactome data in Suppl. Table 2).
To assess the quality of our interactome, we estimated its precision and recall using reference data (Fig. S2B). For recall analysis, we quantified the coverage in our data of interactions included in CORUM(26), a compendium of protein interactions manually curated from the literature. To estimate precision, we quantified how many of our interactions involved protein pairs expected to localize to the same broad cellular compartment (27) (Fig. S2B). Benchmarking OpenCell against other large-scale interactomes, we compared its precision and recall to Bioplex (overexpression of HA-tagged baits (8, 28)), HuRI (Y2H (6)) and our own previous data (GFP fusions expressed from bacterial artificial chromosomes (7)) (Fig. S2C-E). We also calculated compression rates for each dataset as a measure of the overall inter-connectivity in the interaction networks (29) (Fig. S2F). Across all metrics, OpenCell outperformed previous approaches. Together, these findings establish the high quality of our interactome, which likely reflects both our preservation of near-endogenous protein expression and the sensitivity of our mass spectrometry analyses.
Analyzing the distribution of the number of interactors per tagged target allowed us to investigate the global properties of the human interaction network (Fig. 2B). While this distribution approximates a power-law for moderate interaction counts (i.e., a linear relationship in log-log scale, Fig. 2B), our data shows significantly more targets with a large number of interactors (Ninteraction≥100 in Fig. 2B) than expected from a “scale-free” model of protein interaction networks(30), as has been noted in other analyses (31). We discovered that highly interacting proteins (i.e., Ninteraction≥100) are not simply the most highly expressed (Fig. 2C), but rather exhibit specific biophysical signatures. A sequence-base analysis showed that highly interacting proteins are significantly less a-helical and less hydrophobic than other proteins, but more likely to contain disordered domains (Fig. 2D, 2E). Ontology analysis also revealed that the group is enriched for RNA-binding proteins (Fig. 2F). The propensity of highly interacting proteins for high intrinsic disorder has been recognized in Y2H datasets (32). Intrinsic disorder is also a common property of proteins that form bio-molecular condensates, which include a large number of RNA-binding proteins (33, 34). Interestingly, disorder has been proposed to be under positive selection in viral proteomes as a means for the relatively small number of proteins encoded in viral genomes to be able to interact pleiotropically with the host machinery (35).
Stoichiometry-driven clustering of interaction signatures
A powerful way to interpret interactomes is to identify communities of interactors. These communities highlight complexes or functional pathways and also facilitate the assignment of protein function via “guilt-by-association” (8, 12). To this end, we applied unsupervised Markov clustering (MCL) (36) to the graph of interactions defined by our data (5148 baits and preys). We first measured the stoichiometry of each interaction as a proxy for interaction strength, using a quantitative approach we previously established (7), and used it to weigh the edges in the interaction graph (Fig. 2G). A first round of stoichiometry-weighted Markov clustering delineated inter-connected protein communities and outperformed clustering on the basis of connectivity alone (Fig. S2G). To further refine annotations, we subjected each individual MCL community to another round of clustering in which low-stoichiometry interactions were removed. The resulting sub-clusters outline core interactions within existing communities (Fig. 2G). An illustrative example of how this unsupervised approach allows us to delineate functionally related proteins is shown in Figure 2H: all subunits of the machinery responsible for the translocation of newly translated proteins at the ER membrane (SEC61/62/63) and of the EMC (ER Membrane Complex) are grouped within respective core interaction clusters, but both are part of the same larger MCL community. This mirrors the recently appreciated co-translational role of EMC for insertion of transmembrane domains at the ER (37). Interestingly, additional proteins, which have only been recently shown to have co-translational role, are found clustering with translocon or EMC subunits. These including ERN1 (IRE1), a folding sensor (38), and CCDC47, a poorly characterized translocon interactor which, like EMC, regulates the biogenesis of membrane proteins at the ER (39, 40). This highlights how clustering can facilitate mechanistic exploration by grouping together proteins involved in related pathways. Overall, we identified 300 communities including a total of 2154 baits and preys (full details in Suppl. Table 2). A graph of interactions between communities reveals a richly inter-connected network (Fig. 2I), the structure of which outlines the global architecture of the human interactome (discussed further below).
The OpenCell image dataset
A key advantage of our cell engineering approach is to enable the characterization of each tagged protein in live, unperturbed cells. To profile localization, we performed fluorescence microscopy on a spinning-disk confocal microscope equipped with a 63x 1.47NA objective under environmental control (37°C, 5% CO2), and imaged the full 3D distribution of proteins in consecutive z-slices. Microscopy acquisition was fully automated in Python to enable scalability (Fig. S3A-B). In particular, we trained a computer vision model to identify fields of view (FOVs) with homogeneous cell density on-the-fly, significantly reducing uncontrolled experimental variation between images. Our resulting dataset contains a collection of 5176 3D stacks (4-5 different FOVs for each target) and includes paired imaging of nuclear morphology with Hoechst 33342, a cell-permeable DNA dye compatible with live-cell measurements.
We manually annotated localization patterns by assigning each protein to one or more of 15 separate cellular compartments such as nucleolus, centrosome or Golgi apparatus (see Figure 3A for the full list and example images). Because proteins often populate multiple compartments at steady-state (9), we annotated localizations using a three-tier grade system: grade 3 identifies the most prominent localization compartment(s), grade 2 represents clearly detectable but minor localizations, and grade 1 annotates weak localization patterns nearing our limit of detection (see Figure S4A for two representative examples, full annotations in Suppl. Table 3). Ignoring grade 1 annotations which are inherently less precise, 55% of proteins in our library are multi-localizing and outline known functional relationships with, for example, clear connections between secretory compartments (ER, Golgi, vesicles, plasma membrane), or between cytoskeleton and plasma membrane (Fig. 3B). The most common pattern of multilocalization involves proteins found in both nucleus and cytoplasm (21% of our whole library), highlighting the importance of the nucleo-cytoplasmic import/export machinery in shaping global cellular function(41, 42). Importantly, because our split-FP system does not enable the detection of proteins in the lumen of organelles, multi-localization involving translocation across an organellar membrane (which is rare but does happen for mitochondrial or peroxisomal proteins) will not be detected in our data.
Quantitative localization encoding with self-supervised machine learning
Extracting functional insights directly from cellular images is a major goal of modern cell biology and data science (43). Because the function of a protein is tightly linked to its localization, we explored whether a quantitative comparison of localization signatures would allow us to delineate groups of co-functioning proteins. For this, we developed a deep learning model, which is fully described in a companion study (44). Briefly, our model is a variant of an autoencoder (Fig. 3C): a form of neural network that learns to vectorize an image through paired tasks of encoding (from an input image to a vector in a latent space) and decoding (from the latent space vector to a new output image). After training, a consensus representation for a given protein can be obtained from the average of the encodings from all its associated images. This generates a “localization encoding” (Fig. 3C) that captures the complex set of features that define the localization of a protein across the full cell population. One of the main advantages of this approach is that it is self-supervised. Therefore, as opposed to supervised machine learning strategies that are trained to recognize pre-annotated patterns (for example, manual annotations of protein localization (45)), our method extracts localization signatures from raw images without any a priori assumptions, manually assigned labels, or other additional information. This allows us to objectively compare proteins by measuring the similarity between their localization signatures.
A UMAP representation of the localization encodings for the entire OpenCell library is shown in Figure 3D. This map is organized in distinct territories that closely match manual annotations (Fig. 3D, highlighting mono-localizing proteins). This validates that our approach yields a quantitative representation of the biologically relevant information in our microscopy data. We then asked what degree of functional relationship could be inferred between proteins solely on the basis of their localization patterns. For this, we employed an unsupervised Leiden clustering strategy classically used to identify cell types in single-cell RNA sequencing datasets (46). Strikingly, applying this data-driven approach identified groups of proteins that are closely related mechanistically (182 clusters in total, full list in Suppl. Table 3). For example (Figure 3E), our analysis not only separated P-body proteins (cluster #83) from other forms of punctated cytoplasmic structures, but also unambiguously differentiated vesicular trafficking pathways despite very similar localization patterns: the endosomal machinery (cluster #40), plasma membrane endocytic pits (cluster #117) or COP-II vesicles (cluster #143) were all delineated with high precision. Proteins involved in closely inter-related cellular functions were also found to cluster together: for example, the ER translocon clusters with the SRP receptor, EMC subunits and the OST glycosylation complex, all responsible for co-translational operations (cluster #9). Clustering performance was also high for non-organellar proteins, as shown in Figure S4B (cytoplasmic clusters) and Fig S4C (nuclear clusters). Altogether, our results show that the localization pattern of a given protein reflects its function with high specificity, down to specific pathways, and that this information can be captured by self-supervised deep learning algorithms. While closely related proteins cluster tightly together, the existence of many separated groups also underlines the fascinating diversity of localization patterns across the full proteome. Images from nuclear proteins offer compelling illustrative examples of this diversity (Fig. S4D).
Interactive data sharing at opencellczbiohuborg
To enable widespread access to the OpenCell datasets, we built an interactive web app that provides side-by-side visualizations of the confocal images and of the interaction network for each tagged protein (Fig. 4A-B). The app is organized around a ‘target profile’ page that displays all of the metadata, images, and interactions for a selected mNG11-tagged target (Fig. 4B). Confocal fluorescent images can be visualized either in 2D as a scrollable stack of z-slices, or in 3D via an interactive volume rendering module we developed (Fig. 4C). Our interface also allows the user to toggle between fluorescence channels (tagged protein and DNA stain) and to adjust image brightness and contrast. The interaction network, which consists of the target, its direct interactors, and the interactions between them, is organized by the communities and core clusters identified by Markov clustering and is positioned directly adjacent to the image viewer (Fig. 4B). This side-by-side juxtaposition of images and interactions encourages the comparison between subcellular localizations and interaction signatures. The interactome visualization can be switched to reveal quantitative data for each pull-down in the form of volcano and stoichiometry plots (Fig. 4D). To enable the interactive navigation of the full interaction network, interactors in the network that are themselves OpenCell targets are hyperlinked (from any visualization mode) to their corresponding target pages. Interacting proteins in the network that are not OpenCell targets are likewise hyperlinked to a distinct ‘interactor profile’ page (Fig. 4A, middle panel) that displays information about all the pull-downs in which the interacting protein appears. Finally, to explore the dataset by localization pattern, a separate ‘gallery’ page displays a grid of thumbnail microscopy images for all targets in the library, filtered according to a user-defined set of subcellular localizations (Figure 4A, right panel). The app itself is supported by a relational database and a REST API that also allows for programmatic access to the underlying raw data.
Comparing interactions and spatial relationships
IP-MS and microscopy examine the architecture of the cellular proteome at very different scales (molecular for IP-MS, pan-cellular for microscopy). Localization and interactions are linked: proteins must co-localize to interact. But while the localization patterns of each interactor must overlap to some extent, depending on the nature of these interactions (stable or transient), they do not need to match completely. Therefore, the degree to which interacting proteins also share the same overall spatial signature represents an organizing feature of the cellular protein network. To quantify the similarity of localization of any two targets in OpenCell, we measured the Pearson correlation between their localization encodings; this gives a distance measure of similarity in “localization space” between two proteins (Fig. 5A, upper panel). In parallel, similarities can be measured in “interaction space” by comparing the interaction profiles between any two proteins based on stoichiometry information (Fig. 5A, lower panel). As expected, the distribution of localization vs. interaction similarities between all pairs of OpenCell targets that were found to interact shows a positive correlation between the two parameters (Fig. 5B); however, it also highlights that the vast majority of interacting proteins are not particularly similar by either measure (large cloud around the origin of the graph). Examining the relationship between localization similarity and interaction stoichiometry further reveals two separate groups of interacting pairs (Fig. 5C): 1) those that interact with low stoichiometry and whose spatial signatures do not specifically overlap (i.e., have low localization similarity, solid line in Fig. 5C), which represents the largest proportion and 2) a smaller but well-delineated group of stoichiometric interactors that share very similar localization patterns (dashed line). This second result makes intuitive sense: interactions that are stoichiometric should be stable in space and time, and a high overlap between localization patterns is expected. Indeed, different subunits of known stable complexes share extremely similar localization signatures and form tight clusters in our image UMAP (Fig. 5D). Overall, our analysis underscores the organization of the cell’s proteome along two modes of interactions: small communities of high-stoichiometry protein groups, whose functions are intertwined to the point that their steady-state localization patterns are very similar, and a much larger set of low-stoichiometry, and presumably more dynamic interactions with lower spatial overlap.
This intuitive result also has an important correlate: that highly similar localization patterns between two proteins can be used to infer close molecular interaction. In fact, looking at the entire set of OpenCell target pairs (predicted to interact or not), proteins that share high localization similarities are also very likely to interact (Fig 5E). For example, target pairs with a localization similarity greater than 0.85 have a 58% chance of being direct interactors, and a 68% chance of being second-neighbors (i.e., sharing a direct interactor in common). Therefore, our analysis demonstrates that a quantitative comparison of localization patterns can also make predictions about the molecular-level architecture of the proteome. This also reveals that the localization pattern of a given protein contains highly specific information from which precise functional attributes can be extracted, in particular by modern machine learning algorithms.
Biological discovery using interactomes or images
As demonstrated above, unsupervised clustering of both localization and interaction signatures can be used to derive functional relationships between proteins (for example, unambiguously linking together different forms of co-translational processes). A direct application of this result is to help elucidate the cellular roles of the many human proteins that remain poorly characterized (47). We first mined our interactome for poorly studied proteins. We focused on the set of baits and preys found within MCL communities, reasoning that belonging to a community was a strong signal to map function via “guilt by association”. As a simple metric for how well characterized a protein is, we quantified its occurrence in article titles and abstracts from Pub-Med. Empirically, we determined that proteins in the bottom 10th percentile of publication count (corresponding to less than 10 publications) were very poorly annotated (Fig. 6A). This set encompasses a total of 251 proteins for which our dataset offers specific mechanistic insights. For example, the poorly characterized NHSL1, NHSL2 and KIAA1522 are all found as part of an interaction community centered around SCAR/WAVE, a large multi-subunit complex nucleating actin polymerization (Fig. 6B). Interestingly, all three proteins share sequence homology and are all homologous to NHS (Fig. S5A), a protein mutated in patients with Nance-Horan syndrome and that interacts with SCAR/WAVE components to coordinate actin remodeling (48). This suggests that NHSL1, NHSL2 and KIAA1522 also act to regulate actin assembly. A recent mechanistic study made public after our initial prediction supports this hypothesis: NHSL1 was found to localize at the cell’s leading edge and to directly bind SCAR/WAVE to negatively regulate its activity, reducing F-actin content in lamellipodia and inhibiting cell migration (49). Importantly, the authors identified NHSL1’s SCAR/WAVE binding sites, and we find these sequences to be conserved in NSHL2 and KIA1522 (Fig. 6B). Therefore, we propose that both NHSL2 and KIAA1522 are also direct SCAR/WAVE binders and new modulators of the actin cytoskeleton.
Our data also uncovers a specific function for ROGDI, whose variants cause Kohlschuetter-Toenz syndrome (a recessive developmental disease characterized by epilepsy and psychomotor regression (50)). ROGDI appears in the literature because of its association with disease, but no study, to our knowledge, specifically determines its molecular function. To delineate the function of ROGDI, we first observed that its interaction pattern matched very closely that of three other proteins in our dataset: DMXL1, DMXL2 and WDR7 (Fig. 6C). This set exhibited a specific interaction signature related to v-ATPase, the lysosomal proton pump. All four proteins interact with soluble v-ATPase subunits (ATP6-V1), but not its intra-membrane machinery (ATP6-V0). Interestingly, DMXL1 and WDR7 have been shown to interact with V1 v-ATPase, and their knock-down compromises lysosomal re-acidification (51). In fact, sequence analysis showed that DMXL1/2, WDR7 and ROGDI are homologous to proteins from yeast or Drosophila that have been involved in the regulation of assembly of the soluble V1 subunits onto the V0 transmembrane ATPase core (52, 53) (Fig. S5). In yeast, Rav1 and Rav2 (homologous to DMXL1/2 and ROGDI, respectively) form the stoichiometric RAVE complex, a soluble chaperone that regulates v-ATPase assembly (53). To further characterize the function of these proteins, we generated new tagged cell lines for DMXL1/2, WDR7 and ROGDI. Because of the low expression level of these proteins, imaging analysis proved difficult, and fluorescent fusions of DMXL2 and ROGDI did not lead to detectable fluorescence. However, pull-downs of DMXL1 and WDR7 confirmed a stoichiometric interaction between DMXL1/2, WDR7 and ROGDI (Fig. 6C, right panels). Interestingly, no direct interaction between DXML1 and DMXL2 was detected, suggesting that they might nucleate two separate sub-complexes. Therefore, our data uncovers a human RAVE complex comprising DMXL1/2, WDR7 and ROGDI, which likely acts as a chaperone for v-ATPase assembly. Altogether, NHSL1/2-KIAA1552 and DMXL1/2-WDR7-ROGDI illustrate how OpenCell catalyzes new biological insights by combining quantitative analysis, literature curation and new functional data – including a direct mechanistic role for ROGDI, shedding light on the biology of Kohlschuetter-Toenz syndrome.
These examples underscore the power of mining interactome data for mechanistic predictions. Having established that quantitative localization encoding on its own could help elucidate the molecular function of a given protein, we next asked to which extent the function of an orphan protein could be characterized by imaging alone. FAM241A (or C4orf32) is a human protein without any functional annotation. Our initial interactome dataset placed FAM241A as part of a community centered around the OST complex, the transferase responsible for co-translational glycosylation (Fig. 6D). We generated an endogenous FAM241A fluorescent fusion and separately used imaging and mass-spectrometry to elucidate its function. Importantly, the deep learning model we used to generate its localization encoding was not trained with images of this new target (“naïve” model). Strikingly, the quantitative distances between FAM241A and other OpenCell targets measured using either localization or interaction signatures both identified FAM241A as a new OST subunit (Fig. 6D). Moreover, separate unsupervised clustering analyses using the two signatures both placed FAM241A in well-defined clusters with other OST subunits (Fig. 6D, right panels). Thus, the function of FAM241A could have been predicted with the same degree of precision by using either its interaction signature or its localization encoding. This proof-of-concept example establishes the potential of live-cell imaging as a specific readout of protein function, including for the characterization of poorly studied human proteins.
Hierarchical structure(s) of proteome organization
Finally, we explored the global structure of our datasets by using hierarchical clustering to highlight the signatures patterning the proteome. Starting with the 300 interactome communities or the 182 localization clusters outlined above, we mapped the full hierarchical relationships underlying both our interactome and imaging sets. Specifically, we implemented an agglomerative clustering strategy based on node pair sampling (54), which we applied separately to the graph of interactions between interactome communities and to the graph connecting localization clusters derived from the corresponding UMAP adjacency matrix (55). This resulted in fully connected hierarchical trees, shown in Figure 7. Isolating groups of proteins at separate hierarchical layers reveals different levels of the proteome’s organization. At an intermediate layer, the proteome can be delineated into sets of 18-19 separate “modules”, each including a median of 99 proteins for the interactome (modules M1-M18, Fig. 7A, see composition in Suppl. Table 4A) and of 66 proteins for the imaging dataset (modules N1-N19, Fig, 7B, see composition in Suppl. Table 4B), respectively. At a higher layer, each dataset can be divided into three “branches”, which separately represent the core features that shape the proteome’s global architecture from a molecular or spatial perspective. Performing gene ontology analysis underlined that modules and branches are enriched for specific cellular functions or compartments, which define unique functional signatures (labeled in Fig 7A,7B – details in Figure S6 for branches; Suppl. Tables 4A, 4B contain the full gene ontology analysis for both modules and branches).
Overall, the hierarchy of localization encodings reveals that, as expected, localization patterns are organized along the three foundational compartments of the eukaryotic cell: nucleus, cytoplasm and membrane-bound organelles (Fig. 7B, Fig. S6G). Each localization branch is sub-divided into modules that correspond to discrete sub-cellular territories. For example, the organellar branch separates into the different components of the secretory pathway: ER, Golgi, endosome, lysosome or plasma membrane, mirroring the known spatial segregation between these compartments. The fact that this unsupervised clustering strategy broadly recapitulates known cellular compartments validates our approach. By contrast, the hierarchical analysis of the interactome graph reveals how the proteome is organized at the molecular scale (Fig. 7A). The 18 modules separate the interactome into clear cellular functions such as transcription, splicing or vesicular transport. This reflects that functional pathways are templated by groups of proteins that physically interact, a principle also underscored by the overlap between genetic interactions and protein complexes in eukaryotes (56, 57). More interestingly, analysis of the high-level branches reveals a separation between three groups of proteins that differ in their functional and biophysical properties. Ontology enrichment highlights that proteins related in membrane processes (branch B) and RNA-binding (branch C) clearly segregate from the rest of the proteome in term of their interaction profiles (Fig. S6A-E). This functional segregation is correlated with different biophysical properties: branch B proteins are significantly more structured, more hydrophobic and richer in aromatic residues than the rest of our dataset; conversely, branch C proteins have higher intrinsic disorder and higher isoelectric points (Fig. S6B-C). Over-all, our data reveals that RNA-binding proteins form a specific molecular sub-group that shapes the global organization of the cell, similar to how association with membranes is a molecular feature that sets apart a large fraction of the proteome. Strikingly, RNA-binding proteins are known to form membrane-less organelles under stress conditions, especially through phase transition processes supported by intrinsic disorder (33, 34). This suggests that the biophysical properties underlying the formation of biomolecular condensates might also be a global driving force shaping the cellular proteome network under normal conditions.
Discussion
OpenCell combines innovations at four separate levels to augment the elucidation of human cellular architecture. First, we describe an integrated experimental pipeline for high-throughput cell biology, fueled by scalable methods for genome engineering, live-cell microscopy and IP-MS. Second, we provide an open-source resource of well-curated localization and interactome measurements, easily accessible through an interactive web interface at opencell.czbiohub.org. Third, we pioneer new analytical strategies for the representation and comparison of interaction or localization signatures (including a fully self-supervised machine learning approach for image encoding). And fourth, we demonstrate how our dataset can be used both for fine-grained mechanistic exploration (by elucidating the function of multiple proteins that were previously uncharacterized), as well as for investigating the core organizational principles that wire the proteome. In particular, we uncover two global features that shape cellular architecture. First, we show that most proteins interact with low stoichiometry and distribute unequally within the cell, whereas high-stoichiometry interactors share very similar localization patterns. This reinforces the importance of low-stoichiometry interactions for defining the overall structure of the cellular network, not only providing the “glue” that holds the interactome together (7) but also connecting different cellular compartments. Second, we reveal that two separate groups of interacting proteins segregate from the global proteome: membrane-related and RNA-binding, both of which exhibit specific biophysical signatures (in particular hydrophobicity and high intrinsic disorder, respectively). That membrane-related proteins form a specific interaction group is perhaps not surprising as the two-dimensional membrane surface drives their sequestration within the three-dimensional cell. By contrast, the discovery of RNA-binding proteins as a separate sub-group is significant given the growing appreciation that RNA-binding proteins, together with RNAs themselves, can form condensates in the cytoplasm and nucleoplasm. Interestingly, intrinsic disorder is an important modulator of partition into biomolecular condensates, as are specific protein-protein interaction domains (33, 34, 58, 59). Therefore, RNA-binding proteins might have evolved to multiply molecular interactions between themselves to create compartments that can concentrate a large functional variety of proteins (for example RNA processing factors, translation machinery or silencing complexes (60, 61)) and enable the spatial specialization of cellular processes. In this context, it is interesting to consider a role for RNA itself as an organizer of the cellular proteome, as is for example the case for some non-coding RNAs whose function is to template molecular interactions with proteins to form nuclear bodies (62).
While OpenCell joins a growing number of large-scale datasets (6, 8, 9, 16, 27, 63–66) that contribute to the systematic and quantitative dissection of the human cell, it is unique in its analysis of both quantitative interactomes and live-cell localization of genome-edited proteins. However, the full description of human cellular architecture remains a formidable challenge, especially considering the vast diversity of cell types and cell states that shape human physiology. Mirroring the advances in genomics following the sequencing the human genome (2), open-source systematic datasets will likely play an important role in how the growth of cell biology measurements can be transformed into fundamental discoveries by an entire community (67). To date, OpenCell includes functional information for 1311 targets (∼7% of the human proteome) and a total of 5148 proteins found in our interactome (baits and preys, ∼26% of the proteome). Our approach that combines split-FP systems and HEK293T – a cell line that is heavily transformed but easily manipulatable – is mostly constrained by scalability considerations. But given the current pace of technological advances, the large scale of measurements required to match the full extent of cellular complexity might soon be within reach. In particular, advances in stem cell technologies enable the generation of libraries that can be differentiated in multiple cell types (16), while innovations in genome engineering (for example, by modulating DNA repair (68)) pave the way for the scalable insertion of gene-sized payload (e.g., fluorescent proteins, HaloTag, degrons), for the combination of multiple edits in the same cell (e.g., dual-tagged libraries for co-localization studies) or for increased homozygosity in polyclonal pools. In addition, our live-cell imaging approach also paves the way for the systematic description of 4D intracellular dynamics (64), which is being transformed by recent developments in high-throughput light-sheet microscopy (69).
Finally, OpenCell provides a large set of opensource, quantitative and curated data that can be used as a proving ground for data science and algorithm development. Our own innovation in machine learning to encode localization signatures was made possible by the availability of a critical mass of high-quality live-cell images taken under uniform experimental conditions – itself facilitated by our collection of genome-edited lines that can be characterized repeatedly and in a native cellular context. Our results also demonstrate the power of self-supervised deep learning models to identify complex but deterministic signatures from light microscopy images. In particular, we show that remarkably detailed functional relationships can be inferred on the sole basis of similarities between localization patterns, including the prediction of molecular interactions. This opens exciting avenues for the use of imaging as a high-throughput, information-rich method for deep phenotyping and functional genomics (70). Because light microscopy is easily scalable, can be performed in live, unperturbed samples and enables measurements at the single-cell level, this offers rich opportunities for the full quantitative description of cellular diversity in normal physiology and disease.
*
* *
Competing interests
J.S.W. declares outside interest in Chroma Therapeutics, KSQ Therapeutics, Maze Therapeutics, Amgen, Tessera Therapeutics and 5 AM Ventures. M. M. is an indirect shareholder in EvoSep Biosystems.
Materials and Methods are available as a supplementary file.
Acknowledgements
We thank N. Neff, M. Tan, R. Sit and A. Detweiler for help with high-throughput sequencing; G. Margulis, A. Sellas, E. Ho and J. Mann for operational support; A. McGeever for help with web application architecture and deployment; and S. Schmid for critical feedback. M.D.L. thanks C. L. Tan for continuous discussions. H.K. was supported by an International Research Fellowship from the Japan Society for the Promotion of Science. R.M.B. was supported by a NIH pre-doctoral fellowship (F31 HL143882). B.H. was supported by NIH (R01GM131641) and is a Chan Zuckerberg Biohub Investigator. J.S.W. was supported by NIH (1RM1HG009490) and is an investigator with the Howard Hughes Medical Institute.
Footnotes
References
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.
- 72.
- 73.
- 74.
- 75.
- 76.
- 77.
- 78.
- 79.
- 80.
- 81.
- 82.
- 83.
- 84.
- 85.