## Abstract

The circuits of olfactory signaling are reminiscent of complex computational devices. The olfactory receptor code, which represents the responses of receptors elicited by olfactory stimuli, is effectively an input code for the neural computation of odor sensing. Here, analyzing a recent dataset of the odorant-dependent responses of human olfactory receptors (ORs), we show that the space of human olfactory receptor codes is partitioned into a modular structure where groups of receptors are “labeled” for key olfactory features. Our analysis reveals a low-dimensional structure in the space of human odor perception, with the receptor groups as the bases to represent major features in the perceptual odor space. These findings provide a novel evidence that some fundamental olfactory features are already hard-coded at the level of ORs, separately from the higher-level neural circuits.

## INTRODUCTION

Olfaction is a sensory process that captures environmental signals by detecting molecular stimuli through a repertoire of olfactory receptors (ORs). Although the full processing of olfactory information is realized in the complicated neural circuits [1], the first step of olfactory sensing involves a selective binding of odorants to the cognate ORs, which is biochemical in nature [2]. The binding elicits an array of downstream responses in the corresponding olfactory receptor neurons (ORNs) [3]. The response pattern encoded into ORs or ORNs, termed the olfactory *receptor code* [4], provides the first neural representation of an odor and is essential in the early stage of olfactory sensing.

Lying at a crucial juncture in the flow of olfactory information, the receptor code bridges between the molecular and neural spaces. An olfactory stimulus in the *molecular space*, representing the physiochemical properties of the odorants, is translated into the *receptor code space* to form an “input code”. The information is then processed through the higher-order *neural spaces*, and eventually evokes the sense of smell that constitutes the *perceptual odor space* (see Fig. 1). Recent insights into the processing of olfactory information were gained mostly based on the olfactory sensing of non-human species [5–7] or through theoretical studies [8, 9]. A separate branch of research attempts to relate the molecular space directly with the perceptual odor space [10, 11]. However, the goal of understanding the principles of olfactory computation in humans faces a fundamental difficulty due to our limited access to the neural circuitry in the human brain.

Here, we show that a critical progress can still be made by a systematic analysis of the receptor code space. Whereas the OR repertoire is known to encode the input signal from odorants like a piano keyboard that combinatorially produces a variety of musical chords [4], it may be further “formatted” to facilitate the processing of relevant information, perhaps by appropriately reflecting the structure of the perceptual odor space. Specifically, we analyze a set of receptor codes for different odorants, extracted from the biochemical measurement of downstream responses of the human ORs [12–14]. Because the receptor code is implemented by the many-to-many pairwise interaction between odorants and ORs [4, 15], we treat the interaction of each odorant-receptor pair as either “on” (responding) or “off” (not responding), and analyze the binary pattern of pairwise interaction, considering it a zeroth-order representation of the receptor code. Our analysis of overlap between distinct receptor codes (receptor code redundancy) reveals the intrinsic structure of the human receptor code space, which is also correlated with the set of coarse-grained features in the perceptual odor space. This suggests that the processing of olfactory information is already at work in the receptor space.

## RESULTS

We present a quantitative analysis of the pattern of odorant-receptor interactions, employing a dataset that reports the responses of 303 receptors against 89 odorants [14]. The state of all *N* = 303 receptors can be represented in terms of an *N*-dimensional binary vector **y**, where *y _{i}* = 1 if the

*i*-th receptor is “on”, and

*y*= 0 if it is “off”. For a given odor

_{i}**x**, the corresponding

*receptor code*

**y**represents the odor

**x**in the

*N*-dimensional binary receptor code space.

We collected all 535 pairwise interactions reported in the dataset, including 60 de-activations [16], and visualized the receptor codes in the form of an interaction network (Fig. 2a). In this interaction network, each node is either an odorant or a receptor, and each edge connects an interacting odorant-receptor pair. For a more detailed version of the interaction network with varying parameters, see Fig. S1.

### Interaction network reveals odorant hubs with non-redundant receptor codes

In the odorant-OR interaction network (Fig. 2a), the number of edges attached to an odorant node indicates the number of receptors that recognize this odorant. We call this number the *degree* of the odorant node, and denote it by *k*_{λ}, where λ is the index for the odorant (also see Methods).

We observe two properties from the statistics of odorant degrees. First, the receptor-space representation of single odorants is sparse: On average, an odorant is recognized by only 6 (〈*k*〉) ≃ 6) out of *N* = 303 receptors, which amounts to 2% of the receptor space. The sparsity observed in the network is consistent with previous reports from the ORN responses [17, 18]. Second, the degrees of the odorants in the dataset are non-uniformly distributed with a heavy tail, which can be approximately fitted to *P*(*k*) ~ *k*^{−α} with *α* ≈ 0.9 (Fig. 2b). Such distribution allows us to identify a small number of high-degree odorant “hubs”. The six highest-degree hub odorants are labeled in Fig. 2a. It is worth noting that the receptor codes for the hub odorants are highly non-redundant with one another (Fig. 2c); among the six highest-degree odorants, 10 out of all 15 pairs have completely disjoint sets of receptor codes. In particular, despite significant similarity in chemical structures, there is no co-activated OR between eugenol and eugenol acetate, which differ in a single functional group (hydroxy group versus acetate in the aromatic ring). The overlap of receptor codes among the hub odorants is significantly smaller than what is expected from a random mixture of odorants whose receptor codes cover the same number of receptors (*p* < 0.02; also see Methods and Fig. S3).

We introduce an idea of *receptor code redundancy*, a quantitative measure of (dis-)similarity between the olfactory responses of two distinct odorants. If two odors have exactly the same receptor codes, they are perceived as the same olfactory signal. We hypothesize that two odors are poorly distinguishable when their receptor codes have a higher redundancy. For example, consider a discriminatory task where the goal is to detect a target odorant in the presence of a constant background odor [19, 20]. When there was no background odor, the signal would elicit responses in |**y**_{targ}| receptors, where **y**_{targ} is the receptor code for the target odorant. But when some receptors also respond to the background, **y**_{bg}, the states of the overlapping receptors would not change upon the addition of the target odorant; the receptor code overlap |**y**_{targ} ∩ **y**_{bg} | reduces the effective size of the response (also see Fig. 2d-e). Therefore we define the receptor code redundancy in terms of the fraction χ = |**y**_{targ} ∩ **y**_{bg} | / |**y**_{targ}|. If this fraction is small, as in Fig. 2d, the target odorant is deemed unambiguously detected even in the presence of the background odorant.

Taken together, the single-odorant representation in the receptor space is sparse and non-uniform, giving rise to high-degree “hub” odorants. The receptor code is almost non-redundant among these hub odorants, whereas the odorants that have a greater receptor code redundancy with a hub odorant are grouped around it. As presented below, analysis of the receptor code redundancy enables us to decipher the structure of the receptor code space more quantitatively.

### The receptor code space is naturally partitioned to show a modular structure

The grouped structure is better manifested when the interaction network is projected to the space of receptors. Here we consider the *co-activation graph* of receptors, which inherits all receptor nodes in the original interaction network; two receptor nodes are connected by an edge if they share a common odorant in the original network (Fig. 2d-e, last column). Of particular interest in the coactivation graph are the “receptor *cliques”*, or the groups of receptors that are co-activated by the same odorants. Because each pair of receptors interacting with a shared odorant λ is connected in the co-activation graph of receptors, the set of all receptors that interact with a given odorant always form a clique. Given the particular structure of the interaction network, with hub odorants with largely non-redundant receptor codes (Fig. 3a), the co-activation graph of receptors is bound to have large and mostly non-overlapping cliques that are associated to the hub odorants (Fig. 3b). We use the receptor cliques to *partition* the receptor space into non-overlapping groups, such that each receptor group is associated to a shared odorant; when a receptor is a part of more than one receptor clique, it is assigned to the larger clique (Fig. 3c).

The receptor groups can be used to group odorants based on their receptor codes. The idea is to construct odorant groups such that an odorant λ belongs to an odorant group Λ if its receptor code redundancy with respect to the group Λ, χ_{λ|Λ} = |**y**_{λ} ∩ **y**_{Λ}| / |**y**_{λ}|, is large (see Methods). First, we used the receptor groups to determine the reference receptor codes for odorant groups (**y**_{Λ}’s). Specifically, for each receptor group Λ (note that the group index is shared between the receptors and the odorants), we make a binary vector **y**_{Λ} such that [**y**_{Λ}]_{i} = 1 if and only if receptor *i* belongs to the group Λ. Then we assigned each odorant to the group Λ where the receptor code redundancy χ_{λ|Λ} is maximized. This results in a simultaneous partitioning of odorants and receptors (Fig. 3d, local groups).

The local groups emerge naturally from the particular statistics of the receptor codes with hub odorants, and capture the local coactivation pattern; receptors are grouped together if they are coactivated by the same odorant. If we were to characterize the overall structure of the receptor code space in terms of the groups, however, the pairs that are *not* grouped together also carry important information. In our case, a grouping is “good” if receptors in the same group respond to the same sets of odorants, *and* receptors in different groups to different odorants. Now we step further to obtain an optimal grouping of the receptor code space based on the co-activation patterns: we carry out a *secondary* grouping of the receptors by merging the *primary* groups obtained above by identifying receptor cliques. We merge a pair of primary groups when their overlap in the co-activation matrix is larger than a threshold (see Methods for details), where the threshold for merging is determined to maximize the goodness of grouping (Fig. S5). Because we partition the odorants and the receptors simultaneously, a merging of receptor groups automatically leads to a merging of the corresponding odorant groups (Fig. 3d).

Application of the grouping procedure to the human olfactory receptor codes results in the sorted interaction matrix in Fig. 4, where each row represents the receptor code for a given odorant. The rows and columns of the interaction matrix are sorted according to the orders in the respective odorant/receptor groups (Fig. 4a-b). The interaction matrix has a roughly block-diagonal form (Fig. 4c), with much less significant contributions from the off-diagonal elements. Notice that we have already arranged the interaction network layout to represent a grouped structure, which is visualized more clearly in Fig. 5, where colored territories (Groups 1–6) represent the best partitioning of the receptor code space. The feasibility of grouping reflects a special structure of the receptor code; such clear grouping is not obtained from a random network with the same degree statistics. Specifically, we randomize the interaction matrix while preserving (i) the number of interactions, (ii) the number of interactions for each odorant (degree) and (iii) the primary grouping structure. In all three cases, it was highly unlikely to find a grouping as good as or better than what we observed from the human receptor code data (*p* ≤ 0.002; see Methods and Fig. S6).

### Odorants in the same group tend to carry similar olfactory features

Our analysis so far only concerned the receptor codes for the test odorants; we did not use any other information about the odorants. On the other hand, because the dataset was constructed to include those odorants that were considered in previous psychology studies [13, 14], each test odorant has a known perceptual quality; i.e., how it smells to human.

It turns out that the two hub odorants in the largest receptor-code-based group (Group 1) represent the two sub-classes of plant-related odors: the characteristic “green” odor of fresh-cut grass (cis-3-hexen-1-ol) and the characteristic “floral” odor that is encountered in many natural essential oils (geranyl acetate). Moreover, there are many other odorants in Group 1 that belong to similar perceptual categories, namely green (phenyl acetaldehyde, sandalwood, terpineol); floral (linalool, lilial, lyral, anisaldehyde); and minty (methyl salicylate, spearmint, and the carvones). In Group 2, all three hub odorants are related to the body odor. Androstenone and androstadienone (the “human pheromone”) are contained in sweat or saliva; butyric acid with its characteristic sweaty smell is naturally formed in the human body by fermentation, and is also found in milk and butter. Among the smaller-degree odorants, heptanoic acid carries a rancid sweaty odor, and 2-methyl-1-propanethiol is associated with meaty or beer-like flavors. In Groups 3, 4 and 5, we find the ingredients of culinary spices. The two hub odorants in Groups 3 and 4 are the major constituents of clove oil (eugenol and eugenol acetate); the two hub odorants in Group 5 are common food-flavoring agents (ethyl vanillin and cinnamon). Group 3 includes more spice- or food-related odorants: nutmeg and banana have their well-known flavors, guaiacol is known to give the characteristic flavor to whiskey and roasted coffee, and 2-heptanone is related to the typical smell of gorgonzola cheese.

These observations lead to a hypothesis that, whereas our analysis is based only on the receptor codes and their redundancy, it results in a grouping of perceptually similar odorants together. In order to make the idea more quantitative, we labeled each odorant with a coarse-grained *olfactory feature* out of five categories (plantlike, animal-like, culinary, citrus, and others). More specifically, we first labeled the odorants using the eight perceptual odor categories identified from a previous work [21] (the primary odor category), then merged some of the categories that show similar co-occurrence patterns as described below (the secondary odor category); also see Fig. S7c and Table S1. Note that throughout this paper, we will use the term “olfactory features” specifically to refer to the perceptual odor qualities that are labeled by a discrete set of categories. Details of the labeling procedures are described in Methods. As a result, each odorant is assigned two independently determined labels: one for the receptor-code group (the “group”) and another for the perceptual odor category (the olfactory “feature”). Similarly as with the receptor-code groups, in this analysis we will consider the secondary odor categories which provide a more parsimonious representation of the perceptual odor space.

We sampled the joint distribution of receptor-code groups and olfactory features in the 89 test odorants, where each odorant is weighted by its degree so that its impact on the receptor space is taken into account (see Methods). We can say that an olfactory feature is *enriched* in a given group if the pair of two labels is observed more frequently in the joint distribution than expected from the marginal distributions of the individual labels. In particular, we quantified their joint enrichment in terms of the weighted log likelihood ratio (Fig. 6a), which sum up to give the mutual information between the two labels (see Methods). We find strong enrichments of plant-like odor in Group 1, animal-like odor in Groups 2 and 6, and culinary (food-related) odor in Groups 3, 4 and 5, supporting the hypothesis stated above. We also find that citrus is enriched in the union of all smaller groups, with the size-ranked group indices 7 and above. We performed the G-test and the chi-squared test to evaluate the statistical significance of the claim, and obtained *p* < 0.001 from both methods (*G* = 57.5 and χ^{2} = 62.2 for 24 degrees of freedom; see Methods for details).

### Groups in the odor space explain the “olfactory white”

Lastly, we show that our results can be connected to the reported phenomenon of “olfactory white” [22], in which human subjects cannot distinguish odor mixtures if more than 30–40 odorants are randomly mixed while spanning the odor space. However, the idea of “spanning the odor space” remains at the level of a qualitative analogy to lower-dimensional sensory modalities such as color vision or tonal audition. In this study, we showed that the olfactory space can be understood in terms of a small number of (six) groups, both at the level of receptor codes and of perceptual features. Now we explore the idea of olfactory white more quantitatively, by asking when the odor space, partitioned in terms of the six major groups identified from the receptor codes, is fully covered by a mixture of odorants.

We sample a random set of *m* odorants from our test dataset, and count the number of major groups (out of six) with at least one odorant in the sampled set. Repeating the sampling while varying the mixture size *m*, we examine the average number of groups covered by the mixture (Fig. 6b). It turns out that a random sampling of *m* ~ 33 odorants is just enough to sample each of the 6 major groups at least once, in the sense that the average group coverage is > 5.5. On the other hand, the average receptor coverage (the number of receptors that recognize at least one odorant in the mixture) by a mixture of *m* ~ 34 random odorants is equivalent to the coverage by the 6 hub odorants (Fig. S3b). Therefore, we can say that a random mixture of 30–40 odorants is equivalent to a complete spanning of the odor space (up to the six major groups), in terms of the group coverage (number of bases spanned) as well as in terms of the receptor coverage. Our characterization of the human olfactory space, in terms of the receptor-code groups, offers a quantitative account for the number of random odorants needed to construct perceptually convergent odor mixtures (“olfactory whites”).

## DISCUSSION

We discovered a natural partitioning of the receptor code, by identifying groups of olfactory receptors (ORs) with correlated response patterns. The same procedure also identifies groups of odorants with similar receptor codes. Whereas the grouping was performed without any external label other than the receptor codes, we find that odorants in the same group tend to carry similar olfactory features, which indicates the existence of “labeled groups” in the receptor code space. Below we will elaborate this idea and discuss its implications from a broader perspective of human odor perception.

### Receptor groups as the bases of perceptual odor space

The presence of perceptually labeled receptor groups implies a lowdimensional nature of olfactory processing. Because the odor space is represented by the response pattern of *N* ≈ 400 functional ORs, there are in principle 2^{400} possible OR states, even in the binary regime. But when a group of receptors respond in a correlated way, not all 2^{400} states are equally likely. Therefore, a significant amount of dimensionality reduction is made in the receptor code space. This aligns with the previous ideas involving the effective sparsity of odor space that, despite the apparent high-dimensionality implicated by physiochemical properties of odorants, olfaction is working in a much lower dimension in effect [23–25]; or that the odor space is intrinsically clustered rather than uniform [21, 22].

The effective sparsity of the receptor code space, which may once again propagate to the perceptual odor space, has an analogy to vision. When one says that the natural visual scene is sparse [26], it means that the scene consists of a small number of features, such as lines or edges [27]. Here we take a viewpoint that the coarsegrained *olfactory features* (labeled by a small number of odor categories) are the “lines and edges” of the perceptual odor space, and that receptor groups represent the receptive fields for these features, with correlated response patterns. Whereas feature-extracting coding was thought to be the realm of higher-level neurons [28, 29], here we found a concrete evidence that it already starts at the level of receptors. It remains an open question to study how the encoded information is further organized through the downstream interaction between the OR neurons [30, 31].

### Hard-coded labels in the receptor space

Our findings of labeled groups support the idea that primary features of olfactory perception may already be “hard-coded” in the receptor code space [32, 33], enabling innate as well as learned behavioral responses. Indeed, if there exist some basic olfactory features relevant to the core functions of the organism, there is a clear advantage in keeping hard-coded labels for such features: fewer layers would enable a faster information processing [34] that does not require higher-level neural computations. It is also likely that such important features are the most strongly conserved over the course of evolution of the olfactory genome [35]. In fact, most examples of labeled-line receptors in other organisms are coarse-grained discriminators for certain groups of odorants [36, 37], or simply for good versus bad [33, 38], which is arguably the most important axis of odor perception for humans as well [24, 39, 40]. This extends the idea of *labeled lines*, in the sense that a group of receptors maintains a set of dedicated channels for a distinct perceptual feature.

However, we also note that the labeled-lines picture is not a perfect explanation of the data. The grouping of receptor codes is itself noisy (there are false positives and false negatives in the co-activation matrix of receptors), and the perceptual labeling of groups is noisy as well (e.g., there are odorants that belong to Group 2 yet carry plant-like odors). This implies that the receptor code also has signatures of *combinatorial coding*, a flexible point of view that assumes that the collective activities of receptors (rather than the activation of any specific receptor) encodes the odor information. For example, odorants that belong to the same receptor-code groups may carry different perceptual meanings (e.g., whereas all receptors that respond to the odorant 1-octanethiol also respond to cinnamaldehyde, 1-octanethiol smells like sulfur while cinnamaldehyde carries a pleasant, cinnamon-like scent). Our results therefore represent a hybrid between the two coding strategies. This adds to the growing evidence that olfactory coding in most organisms, including humans, are clearly more complicated than the dichotomy between labeled lines and combinatorial coding [41–43], although there are some cases in which one strategy seems to dominate [4, 38].

### More on the importance of hub odorants

At the core of our analysis are the hub odorants that elicit responses in the large population of receptors. The hub odorants create an important structure in the receptor space. According to our statistical analysis, the receptor codes for the largest hubs are almost non-redundant, and they “span” the receptor code space. If we take seriously the proposal that structures in the receptor code space represent the features in the perceptual odor space, we may further hypothesize that the hub odorants correspond to the important stimuli in the natural olfactory environment.

On the other hand, the six hub odorants in the dataset are described as the “typical” or “characteristic” odor of certain category, which are also commonly encountered in a human’s life. For example, cis-3-hexen-1-ol is recognized as the characteristic grassy smell; geranyl acetate is found in many natural essential oils and carries a floral aroma; butyl acetate and androstenone are found in the human and animal body; eugenol (clove), cinnamon, and ethyl vanillin (vanilla) each represents a characteristic and widely used ingredient for food flavoring.

Taking the observations together, we hypothesize that the receptor code is designed to assign a larger number of receptors to a more *salient* stimulus, whose presence is more strongly correlated with a particular olfactory feature. The existence of salient odorants in the natural odor space is in fact plausible from the biological viewpoint. For example, because the types of odorants produced by the flower is an outcome of evolutionary selection, most flowers have a few common odorants, with their physiochemical properties attracting the pollinating insects [44]. When more salient stimulus is encoded by the response of more ORs, it has the advantage that important olfactory signals (carried by the salient odorants) are detected reliably: i.e., responses are not lost under perturbations in the receptor space, such as the functional loss of certain ORs due to genetic mutation [45, 46]. If the hypothesis was correct, one would be able to predict the saliency of odorants in the natural odor space based on the number of receptors that respond. This opens up a series of problems related to the natural odor statistics. In particular, an interesting problem would be to explore whether the statistical salience has any relationship with the statistics of co-occurrence between odors [47].

### Limitations and open problems

Here we briefly discuss the limitations of our current approach, and the related problems for future research.

First of all, our analysis considers a binary approximation of the receptor code space, marking an odorant-receptor pair as “interacting” if there is a response at all. The ultimate understanding of the receptor codes should also incorporate the higher-order effects, such as the dependence on odor concentration [17, 48]; the effect of inverse agonists [49]; or the interaction between different odorants [50]. Only then will it be possible to fully address how the information in the receptor code is collectively processed, by the inter-glomerular interactions [30, 31]; with the temporal effects [51] involving sniffs [52]; or under different abundances of the receptors [53]. As for the concentration dependence, we show that the current framework can be extended by constructing a family of binary interaction networks at varying odor concentrations (Fig. S2). The effect of odorant concentration on receptor code is in principle straightforward to analyze under our odorant-receptor based framework even for a mixture of odorants at heterogeneous concentrations; however, meaningful investigation of such effect on human olfactory sensing would require a more systematic normalization with respect to the human detection thresholds that are different for the individual odorants [40, 54]. Our work lays down a general framework on which more specific questions like this can be addressed.

Second, in light of the three levels of olfactory processing that we laid out at the beginning (Fig. 1), it is natural to expect that the receptor code space also reflects the chemical properties of the odorants that comprise the molecular space. Although we did not extend our analysis to the molecular space in this work, previous works suggest a correlation between the molecular similarity between pairs of odorants and the similarity of their olfactory responses at the level of OR neurons [12, 17, 48, 55]. Now that we have a low-dimensional characterization of the receptor code space in terms of the groups, an interesting problem would be to investigate the correlation between the receptor-code groups and the chemical properties of the odorants.

Third, whereas our analysis achieved an unprecedented link between the receptor code space and the perceptual odor space, further progress can be made a better characterization of the perceptual space. In particular, we relied on a specific mapping of the test descriptors to the set of reference words that describe the eight perceptual odor categories [21]. This process can be replaced by a more systematic quantification of similarity between the descriptor words, which would require the understanding of the underlying semantic space for olfactory descriptors.

### Concluding remarks

A set of concrete messages can be drawn from our study: (i) The existence of odorant groups, or the correlated patterns between the receptor codes, greatly reduces the dimensionality of the receptor code space and allows us to identify coarse-grained olfactory features in the perceptual odor space. (ii) The odorant and receptor groups revealed from our study clarify the balance between the labeled lines and combinatorial coding strategies. The labeled receptor groups on top of non-uniformly distributed receptor codes imply a hybrid strategy, which can be used to leverage both the fidelity of the labeled-line design and discriminative capacity of the combinatorial coding. (iii) Each large group of odorants is described by the coarse-grained olfactory feature (e.g. plant-related, animal-related odor).

In summary, by analyzing the dataset of OR responses, we identified a modular structure inherent in the receptor code space of human olfaction. Insights from these findings could be extended to the broader study of human odor perception and help understand exceptionally diverse data on olfaction in a more systematic manner.

## METHODS

### Modeling

#### Network representation of the receptor code space

We analyze the dose-response curves of 535 interacting odorant-receptor pairs, involving 89 odorants and 303 olfactory receptors (ORs), from the dataset of [14]. The interacting odorant-OR pairs were screened from a library of 511 human OR genes [13, 14]; the final number of 303 ORs include only those that respond to at least one odorant in the dataset. The panel of odorants was selected to span the 20-dimensional physiochemical space that explains the variance in mammalian OR responses [12]. The receptors in the dataset, which cover a near-full (~ 3/4) space of known human ORs, can be viewed as the receptor-space representation of the odorant. On the other hand, the set of cognate odorants found for each receptor, which is only a small sampling of the olfactory environment (humans can detect at least ≳ 10^{4} odorants), is not guaranteed to be complete. We note that the dataset also includes several natural odors that are not monomolecular odorants (for example “banana”), which do not exactly fit in our formulation of pairwise odorant-OR interactions. Nevertheless, these are only a minor fraction of the dataset, and we included all reported odors in our analysis.

A graph can be formally written as *G* = (*V, E*), where *V* is the set of all nodes (“vertices”) in the graph, and *E* is the set of all edges. In the binary regime, each edge in *E* is an unordered pair of nodes in *V*; this gives a binary graph, which can also be represented in terms of an *adjacency matrix A*, whose (*i, j*)-th element *A _{ij}* represents whether there is an edge between the two nodes

*i*and

*j*in the graph. Specifically, the odorant-receptor network is a

*bipartite graph*, where each node belongs to either of two node types (either odorants or receptors) and every edge connects one odorant node to one receptor node. If

*V*and

_{O}*V*are the sets of all odorants nodes and the set of all receptors nodes, respectively, each edge in

_{R}*E*connects exactly one node in

*V*and the other node in

_{O}*V*. This bipartite structure is cast into the

_{R}*M*×

*N*sub-matrix of the full adjacency matrix,

*A*

_{λi}, where

*N*= |

*V*| is the number of receptor nodes in the network, and

_{R}*M*= |

*V*| the number of odorant nodes. Each element of the matrix is defined as

_{O}*A*

_{λi}= 1 if and only if there is an edge between odorant

*λ*and receptor

*i*. We call

*A*

_{λi}the

*interaction matrix*.

#### Binary vector representation of a receptor code

The receptor code space representation of an odorant, written in a binary vector **y**, is equivalent to the corresponding row of the interaction matrix. That is, [**y**_{λ}]_{i} = *A*_{λi}, where [*y*]_{i} is the *i*-th element of vector **y**.

The union of two binary vectors is defined element-wise as [**y**_{1} ∪ **y**_{2}]_{i} = [**y**_{1}]_{i} ∨ [**y**_{2}]_{i}, where ∨ is the logical “or” operation between two binary variables. Similarly, the intersection is defined as [**y**_{1} ∩ **y**_{2}]_{i} = [**y**_{1}]_{i} ∧ [**y**_{2}]_{i}, where Λ is the logical “and” operation. We also define the difference of two binary vectors as [**y**_{1} \ **y**_{2}]_{i} = [**y**_{1}]_{i} ∧ ¬[**y**_{2}]_{i}, where ¬ is the logical “not” operation. For the union and the intersection, the definitions easily extend to more than two binary vectors.

#### Degree of an odorant

The *degree* of a node is the number of edges attached to it, or (when there is no loop) the number of other nodes that are connected to it. It is useful to define the *size* of a binary vector as the number of its non-zero elements: .

The degree of an odorant node is the number of receptor nodes connected to it:

In the current study, we primarily consider the odorant degree, or how broadly (or narrowly) an odorant is “targeting” the receptors. This breadth of interaction for the odorant is well-defined, because there is a finite number of olfactory receptors. The dataset in [14] covers a majority of the known (functional) human olfactory receptor repertoire: 303 out of ~ 400 ORs. Our approach has advantage over other studies that considered whether a given receptor is “narrowly tuned” or “broadly tuned” across different odors, which is hard to quantify without a good metric of the odor space.

The degree distribution of odorants, *P _{O}*(

*k*), is measured by counting the number of odorants at each degree

*k*.

*P*(

_{O}*k*) appears approximately straight in the log-log scale, and is fitted to the curve,

*P*(

*k*) ~

*k*

^{−α}. However, the fit does not necessarily indicate an underlying distribution that is power-law in a strict sense; we are simply using the power-law dependence to describe the heavy-tailed shape of the degree distribution.

#### Co-activation graph

The *co-activation graph* is a projection of the original bipartite graph to one of the two node types. For example, the co-activation graph of receptors, *G _{R}* = (

*V*), inherits all receptor nodes

_{R}, E_{R;O}*V*from the original bipartite graph. Two receptor nodes in the co-activation graph

_{R}*G*are connected by an edge if there is a common odorant node in the original graph

_{R}*G*that is connected simultaneously to both receptors (see Fig. S4a,f). To define the edges,

*E*, it is sufficient to define the corresponding adjacency matrix

_{R;O}*A*, where (

_{R}*A*)

_{R}_{ij}= 1 indicates an edge between two receptors

*i, j*in the co-activation graph. We call

*A*the

_{R}*co-activation matrix*of receptors. Each element of the co-activation matrix is given as (

*A*)

_{R}_{ij}= ∨

_{λ∈VO}(

*A*

_{λi}Λ

*A*

_{λj}), which indicates whether there is at least one odorant λ for which

*A*

_{λi}and

*A*

_{λj}are both 1, where

*A*

_{λi}is the interaction matrix for the original bipartite graph.

### Receptor code redundancy

We define the receptor code redundancy as χ = (*n*_{full} − *n*_{eff})/*n*_{full}, where *n*_{full} is the full impact of a signal in the absence of redundancy, and *n*_{eff} is the net effect of the signal under redundancy.

#### Receptor code redundancy of a target with respect to the background

For a discrimination task of a target odorant λ in the presence of a background odorant *μ*, the effect of redundancy reduces the net effect of adding λ from *n*_{full} = |**y**_{λ}| to *n*_{eff} = |**y**_{λ} \ **y**_{μ}|. In this case, the receptor code redundancy is χ_{λ}|_{μ} = (|**y**_{λ}| − |**y**_{λ} \ **y**_{μ}|)/ |**y**_{λ}| = |**y**_{λ} ∩ **y**_{μ}| / |**y**_{λ}|.

#### Receptor coverage by a mixture of odorants

We can also consider an odor signal that consists of a mixture of odorants. We define its *receptor coverage* as the number of receptors that respond to at least one odorant in the mixture, *n*_{union} = |**y**_{1} ∪ ⋯ ∪ **y**_{m}|. We can also count the sum of individual odorant degrees, *n*_{sum} = |**y**_{1}| + ⋯ + |**y**_{m}|. Note that the net receptor coverage *n*_{union} corresponds to *n*_{eff}, and the sum of individual coverages *n*_{sum} to *n*_{full}, in the above construction (also see Fig. S3a).

#### Receptor code redundancy in a mixture of odorants

The two numbers, *n*_{union} and *n*_{sum}, are different when there are overlaps between the receptor codes for the odorants in the mixture. Therefore, we characterize the net amount of receptor code overlap as *n*_{sum} − *n*_{union}, and define the receptor code redundancy in a mixture as χ_{mix} = (*n*_{sum} *n*_{union})/*n*_{sum}.

#### Odorant mixture test

We selected mixtures of *m* odorants from the interaction network, either randomly (random mixture), or selecting the highest-degree odorants first (mixture of hubs). For each mixture, we counted the receptor coverage *n*_{union} and the sum of individual coverages *n*_{sum}. For the random mixtures, we drew 500 independent samplings at each *m*; because we tried 90 different values of 0 ≤ *m* ≤ 89, we have 45,000 random mixture samples.

The receptor coverage by the six hub odorants is *n*_{union} = 155, about half of the receptor space (Fig. S3b). However, its receptor code redundancy is only χ_{mix} = 0.08. This is much smaller than what is expected from a random-odorant mixture at the same receptor coverage (χ_{mix} ≈ 0.23 using the average counts); none of 105 random mixtures at the fixed receptor coverage *n*_{union} = 155 had a smaller redundancy (*p* < 0.01). Considering all mixtures with *n*_{union} ≤ 155, we got *p* < 0.02 (317 out of 17340). If we consider all random mixtures and count the samples that are below the red curve in Fig. S3c, we get *p* = 0.05 (2420 out of 45,000), where we note that most of the extreme samples come from the top end of the curves, where the newly added odorants are no longer “hubs”. Besides, the average redundancy curves show that the level of redundancy with the six hubs (χ_{mix} ≈ 0.1) is reached already at a receptor coverage of *n*_{union} ≈ 70 in an average random mixture, which corresponds to *m* ≈ 13 odorants (Fig. S3c).

We also note that the redundancy curve for the mixture of hubs (Fig. S3c, red) deviates most greatly from the “chance curve” obtained from the random mixtures (gray) when the six hub odorants were selected. This adds support to our focus on the six hub odorants.

### Grouping in the receptor code space

We will provide details of the grouping process in three steps: node ranking, primary grouping and secondary grouping. To start, we rank the odorants and receptors in the dataset.

#### Ranking odorants and receptors by the degrees

Odorants are ranked by their degree of interaction, *k* (Eq. 1). When there are multiple odorants with the same degree, they are ranked according to the order they are indexed in the original dataset [14]. We call this ordering *h*_{0}, such that each odorant λ (where λ may be an arbitrary index) is assigned a unique rank *h*_{0}(λ) ∈ {1, ⋯, *M*}.

Receptors are ranked by their degree in the co-activation graph. The degree in the co-activation graph represents the number of other receptors that share at least one odorant partner with the given receptor. When there are multiple receptors with the same degree of co-activation, they are ranked according to the order they are indexed in the original dataset [14]. Similarly as for the odorants, we call this ordering *g*_{0}, such that each receptor *i* is assigned a unique rank *g*_{0}(*i*) ∈ {1, ⋯, *N*}.

#### Primary grouping

We first partition the receptor space into a set of non-overlapping groups, and subsequently use these to classify the odorants based on the receptor code overlap.

In the primary receptor grouping (*g*_{1}), receptors are grouped such that each receptor “signs up” to the largest receptor clique it belongs to. Specifically, we assign to each receptor the rank *h*_{0} of the highest-degree odorant it interacts with; in case of ties, we choose odorants of higher ranks (smaller *h*_{0}). We re-rank the receptor group index while preserving the order of the associated rank *h*_{0}, by assigning the primary group index *I* = 1 to the largest receptor group, and the index *I* = 2 to the next largest group, and so on (*I* =1, 2, ⋯ *I*_{max}). This results in a grouping *g*_{1}: *i* ↦ *I* that maps each receptor *i* to a group index *g*_{1}(*i*) = *I*.

In the primary odorant grouping (*h*_{1}), each odorant is assigned to the above-determined receptor group (*g*_{1}) to which its receptor code overlaps the most. The primary odorant grouping *h*_{1}: λ ↦ *I* (or *h*_{1}(λ) = *I*) is defined by choosing the receptor group of index *I*, with which the odorant’s receptor code (**y**_{λ}) displays the maximum overlap, namely *I* = argmax_{J} χ_{λ|J}. Here, the receptor code overlap of an odorant λ to a receptor group index *I* is quantified as χ_{λ|I} = |**y**_{λ} ∩ **y**_{I}| / |**y**_{λ}|, where the binary vector **y**_{I} represents the receptor code for the group *I* by having each element (**y**_{I})_{i} = 1 if *g*_{1}(*i*) = *I*, and (**y**_{I})_{i} = 0 otherwise. Note that the resulting odorant group indices span through *I* =1,2, ⋯, *I*_{max}, shared with the receptor group indices.

#### Secondary grouping

We merge the primary groups based on the amount of co-activation in the receptor code space.

In the secondary receptor grouping (*g*_{2}), pairs of primary receptor groups are merged if there is a strong co-activation. For two primary receptor groups *I*, *J*, we calculate the amount of coactivation between the two groups, simply in terms of the density of off-diagonal-block contribution in the co-activation matrix *A _{R}*:
where

*i*∈

*I*is a shorthand to indicate that the sum runs over all receptors

*i*such that

*g*

_{1}(

*i*) =

*I*. We merge each pair of primary groups

*I*and

*J*whenever their co-activation is stronger than a threshold, i.e., if . Here we used an optimal threshold value of , where the grouping was most informative (Fig. S5; also see Statistical validation). Note that this is a greedy and transitive grouping. For example, if there are three primary groups

*I*

_{1},

*I*

_{2},

*I*

_{3}such that but , then all three are merged together in the secondary grouping; the two groups

*I*

_{1},

*I*

_{2}are first combined, then

*I*

_{3}is also pulled into the group that

*I*

_{2}belongs to. After performing all pairwise merges, we once again re-rank the resulting group indices while preserving the order. This defines a grouping

*G*

_{12}:

*I*↦

*I*′ that maps each primary group

*I*to a (re-numbered) secondary group

*I*′. Each receptor

*i*is then assigned a secondary group index

*g*

_{2}(

*i*) =

*G*

_{12}(

*g*

_{1}(

*i*)).

In the secondary odorant grouping (*h*_{2}), pairs of primary odorant groups are merged if there is a strong co-activation. Because the primary grouping was a simultaneous partitioning of odorants and receptors, the above merging operation of receptor groups from *g*_{1}(*i*) to *g*_{2}(*i*) automatically propagates to the odorant groups. Specifically, each odorant λ is assigned a secondary group index *h*_{2}(λ) = *G*_{12}(*h*_{1}(λ)).

### Labeling with the perceptual odor categories

#### Descriptor space

The perceptual odor quality of each odorant is characterized by a list of odor descriptors (Table S2). We collected the odor descriptors for the 89 test odorants in the dataset [14], from multiple sources including public databases, academic works and industrial reports (see Table S5 for data sources). In particular, we obtained a majority of odor descriptors from the Good Scents Company Information System [56], using custom scripts for web-scraping.

Let *W* be the set of all descriptors that appear at least once in our collection. We have *N _{W}* = |

*W*| = 194 unique descriptor words for the

*M*= 89 test odorants in the dataset [14]. We can think of an

*M*×

*N*binary matrix, , such that if odorant λ is described by the word

_{W}*w*, and 0 otherwise.

#### Perceptual odor categories

We adopt the 8 odor categories identified in a previous work [21]. Each category *s* ∈ {1,2, ⋯, 8} is defined by a list of reference descriptors, , provided in Figure 7 in the original paper (also shown in Table S3). For convenience, we also nickname each category with a single representative descriptor: (1) citrus, (2) chemical, (3) minty, (4) floral, (5) fruity, (6) green, (7) spicy, and (8) animal. The representative nickname for each category was chosen according to our own semantic senses; its only role is to provide an intuitive name by which we can easily refer to the category, and it does not play any further role in the analysis.

#### Labeling odor descriptors

We first label each descriptor *w* ∈ *W* with a unique perceptual category *s* ∈ {1, 2, ⋯, 8, 9}, where the additional 9th category stands for “unsorted”. This defines a map . Here we only consider a deterministic categorization of descriptors, such that . The mapping is straightforward when the descriptor is directly contained in the reference vocabulary; because are non-overlapping by construction, we can find a unique category *s* for the descriptor in this case. However, some descriptors do not exactly match with any of the reference words, in which case the mapping requires a sense of similarity between the words in the underlying semantic space.

Here we attempt to make a projection of each descriptor onto the space of reference words. Specifically, for each test descriptor *w*, we choose a proxy from the reference vocabulary that can approximate the descriptor *w*, and identify the category s for which . The extra category “unsorted” is assigned to abstract or ambiguous descriptors [57]. Null descriptors like “odorless” or “weak” are also labeled as unsorted. The proxies were determined manually by the authors using the human sense of the semantic space, and the entire mapping is made available as a supplementary table (Table S4) so that the reader can reproduce and/or judge the procedure.

#### Labeling odorants

Next, we label each odorant λ with a perceptual category *s*. To summarize how each odorant is associated with a set of descriptors, we write a conditional probability

This allows us to write *p*(*s*|λ) = Σ_{w∈W} *p*(*s*′|*w*)*p*(*w*|λ). For simplicity, we consider a deterministic map that involves a “vote” by all descriptors of each odorant:

In case of ties, we put more weight to the descriptor that was marked as the representative “odor type” in the GoodScents database [56]. Now we have a map that assigns a perceptual odor category *s* to each odorant λ.

#### Joint distribution of receptor-code groups and perceptual odor categories

Treating the receptor-code group index *s _{R}* and the perceptual odor category index

*s*as random variables, we can formally write down their joint probability distribution as:

_{W}This is to say that the two group/category indices are *sampled* by the test odorants, where the two grouping maps *p*(*s _{R}*|λ) and

*p*(

*s*|λ) are conditionally independent for a given odorant λ.

_{W}Because we already defined the two deterministic maps, we can write down the two conditional distributions as *p*(*s _{R}*|λ) =

*δ*

_{sR,h(λ)}and . The third term,

*p*(λ), represents the weight to each odorant. If we were to count all odorants equally, we would have used a uniform

_{O}*p*(λ) = 1/

_{O}*M*. However, we would like a sampling that is aware of the odorant’s “weight’ in the receptor space, such that an odorant that affects more receptors (a higher-degree odorant) counts more. To this end, we write where is the degree of the odorant λ in the interaction network. Note that this approach assumes a uniform

*p*(

_{R}*i*) ∝ 1 for the receptors, rather than the odorants. In this case,

*p*(λ) can be interpreted as the probability that a randomly chosen receptor recognizes the odorant λ. Summarizing, the joint distribution is sampled as

_{O}We can also consider the two marginal distributions, *p*(*s _{R}*) = Σ

_{sW}

*p*(

*s*) and

_{R}, s_{W}*p*(

*s*) = Σ

_{W}*(*

_{sR}p*s*), and consequently the two conditional distributions,

_{R}, s_{W}*p*(

*s*) ≡

_{R}|s_{W}*p*(

*s*)/

_{R}, s_{W}*p*(

*s*) and

_{W}*p*(

*s*) ≡

_{W}|s_{R}*p*(

*s*)/

_{R}, s_{W}*p*(

*s*).

_{R}### Statistical validation

#### Optimal threshold for secondary receptor grouping

In the secondary receptor grouping, our goal was to obtain the best partitioning of the receptors so that the resulting structure captures the co-activation pattern. Because we start from the primary receptor groups as the “units”, and work by merging them based on the overlaps, the grouping depends on the threshold . In the limit where the threshold is too small, most units are merged together and a giant cluster is formed (Fig. S5a). On the other hand, when the threshold is too large, most units stay un-merged, failing to capture the global structure (Fig. S5c). The optimal threshold is where the grouping is most *informative* (see below), in the sense that more strongly co-activating primary groups are merged together in the same secondary group, and non-co-activating groups stay apart (Fig. S5b).

Here we consider our grouping as a binary classifier, where each receptor pair has one of the two real states (either co-activated by an odorant or not), and one of the two predicted states (either grouped together or not). We count the number of receptor pairs that belong to each of the four outcomes: co-activated and grouped (true positive; TP), co-activated and ungrouped (false negative; FN), non-co-activated and grouped (false positive; FP), and non-co-activated and ungrouped (true negative; TN). We also define the real positive (RP = TP + FN) and the real negative (RN = TN + FP) in the data, and similarly the predictive positive (PP = TP + FP) and the predicted negative (PN = TN + FN) according to the grouping. There are multiple ways to quantify the informativeness of the classifier, but the basic idea is to understand the tradeoff between the type I (FP) and type II (FN) errors. Here we take two common approaches, considering the sensitivity-specificity and the precision-recall tradeoff respectively. The *sensitivity* is defined as the true positive rate TPR = TP/RP, which is also called the *recall*. The *specificity* is defined as the true negative rate TNR = TP/RN. It is also common to write the specificity in terms of the *fall-out*, or the false positive rate FPR = TP/RN with a simple relationship FPR =1 - TNR. The *precision* is defined as the positive predicted value PPV = TP/PP. Also see Fig. S5d.

To find the optimal grouping in terms of the sensitivity and the specificity, we minimize the sum of squared error, SSE = FNR^{2} + FPR^{2} = ((1 - sensitivity)^{2} + (1 − specificity)^{2}). This is also graphically equivalent to finding the point that is closest to the upper left corner of the receiver operating characteristic (ROC) plane, which plots the sensitivity versus the 1-specificity at varying threshold (Fig. S5e-f).

Alternatively, we can use the *F _{β}* score, another popular quantity defined as the weighted harmonic average of precision and recall [58]:

A larger *β* puts more weight to the recall (pays more attention to the false negatives), whereas a smaller *β* puts more weight to precision (more attention to the false positives). Because the actual dataset is strongly imbalanced such that RN > RP, we choose a value of *β* that penalizes false negatives more heavily [59]. Specifically, we use , so that the weighted combination becomes . We can also plot the precision-recall curve as the threshold is varied (Fig. S5h-g). In our case, the two optimization conditions (minimum SSE and maximum *F _{β}*) are satisfied at the same threshold , which defines the best secondary receptor grouping (Fig. S5f,h).

#### Shuffling test for receptor-code groups

To test the statistical significance of the observed receptor groups, we compare the result to three null models constructed by shuffling the odorant-receptor interactions. We consider three null models, with shuffling rules that preserve different marginal statistics.

First, we make a full shuffling of the activation matrix, *A*_{λi}, such that the number of non-zero elements (the number of interactions in the network) is preserved (Fig. S6a). This is also equivalent to fixing the average degree of the odorants.

Second, we construct a more constrained null model where the degree distribution is preserved. Specifically, we shuffle each odorant-row of the activation matrix *A*_{λi} independently, such that each odorant re-connects to the receptors randomly, while the degree is kept fixed (Fig. S6b).

Third, to test the effect of the secondary grouping separately from the primary grouping, we consider a null model that preserves the primary grouping structure (Fig. S6c). For each odorant λ in a primary group *g* = *h*_{1}(λ), We shuffle the the set of receptors that belong to the same primary group, {*i* ∈ {1, ⋯, *N*} | *g*_{1}(*i*) = *g*}, and separately shuffle the rest of receptors that belong to all other primary groups, {*i* | *g*_{1} (*i*) ≠ *g*} (also see Fig. S6d). But as a result of this shuffling, the goodness of the original primary grouping becomes exaggerated, because shuffling destroys all residual correlation that was not captured by the primary groups. So we add extra randomness to restore the precision of primary grouping as it appears in the original data. We do this by making an appropriate number of *degree-preserving swaps*, as suggested by [60]. The degree-preserving swaps are made in the following way. We randomly choose 2 columns and 2 rows from the activation matrix, which defines a 2 × 2 binary sub-matrix. If the sub-matrix is either diagonal or anti-diagonal, we replace the sub-matrix with the other form, and count this as one swap. It is clear that such a swap preserves both the column-sum and the row-sum of the matrix. If the chosen sub-matrix has neither of the two forms, no action is made, and we proceed to sample a new sub-matrix. The two goodness measures are simultaneously restored to the levels that correspond to the original primary grouping, with 40 extra swaps (Fig. S6d).

Under each null model, we draw multiple shuffles to sample the distributions of the test statistics, in this case the two goodness measures SSE and *F _{β}*. For each shuffle of the interaction data, we constructed the corresponding receptor co-activation matrix, and applied the grouping algorithm at varying overlap threshold . We chose the best grouping results as the test statistics for the sample, either at minimum SSE (Fig. S6e-f) or at maximum

*F*(Fig. S6g-h). The two measures were optimized separately, and may correspond to different optimal thresholds for the same shuffled sample. This procedure allows us to estimate the (one-tailed) p-value, which indicates the probability to have a grouping result from a shuffled dataset that is as good as or better than what we observe from real data. Because there was only up to 1 out of 500 shuffled samples that clusters better than the original dataset, we can say that

_{β}*p*≤ 0.002 (Fig. S6i-j).

#### Joint enrichment between receptor-code groups and perceptual odor categories

To quantify the correlation between the receptor-code groups *s _{R}* and the perceptual odor categories

*s*, we consider the likelihood ratio

_{W}*p*(

*s*)/

_{R}, s_{W}*p*(

*s*)

_{R}*p*(

*s*). If this ratio is larger than 1 for a given (

_{W}*s*) pair, it indicates an over-representation (enrichment) of the pair in the dataset, compared to the expectation from the null hypothesis that the two labels are completely uncorrelated (Fig. S7a). Note that the likelihood ratio can also be interpreted as the enrichment of one label given another, if re-written as

_{R}, s_{W}*p*(

*s*|

_{R}*s*)/

_{W}*p*(

*s*) or

_{R}*p*(

*s*)/

_{W}|s_{R}*p*(

*s*).

_{W}The weighted sum of the log likelihood ratios is the mutual information:
it is also equivalent to the Kullback-Leibler (KL) divergence, *D _{KL}*[

*p*(

*s*)|||

_{R}, s_{W}*p*(

*s*)

_{R}*p*(

*s*)]. This sum quantifies the amount of information that is lost by incorrectly describing the observed joint distribution

_{W}*p*(

*s*) with the null hypothesis

_{R}, s_{W}*p*(

*s*)

_{R}*p*(

*s*).

_{W}It is also useful to look at each pair’s contribution to the mutual information, which has the form of a weighted log likelihood ratio *ρ*(*s _{R}, s_{W}*) =

*p*(

*s*)log[

_{R}, s_{W}*p*(

*s*)/

_{R}, s_{W}*p*(

*s*)

_{R}*p*(

*s*)]. For the purpose of understanding which pairs are more significantly enriched, we find it more informative to consider the weighted log likelihood ratio (Fig. S7b) than the log likelihood ratio alone (Fig. S7a). The weighting properly takes into account that the log likelihood adds up with independent observations. We will call

_{W}*ρ*(

*s*) the

_{R}, s_{W}*joint enrichment*between the receptor-code groups

*s*and the perceptual odor categories

_{R}*s*.

_{W}#### Merging the categories for dimensionality reduction

Because the amount of information that can be drawn from data is fixed, it is important that we keep only as many groups/categories as relevant to our conclusion. In the receptor code space, we considered the six largest groups and treated the rest as a single extra group (6 degrees of freedom). In the perceptual odor space, we looked at which of the 8 odor categories have similar enrichment patterns (Fig. S7b), and also used the common-sense semantic knowledge, to merge the plant-like (green, floral, minty, chemical) and the culinary (spicy, fruity) labels together (Fig. S7c). The grouping of “chemical” with the plant-like category is based on the correlations in the enrichment pattern, although we should note that there is only one odorant in our dataset (quinoline) that is labeled as “chemical” according to our procedure. This results in a set of reduced perceptual odor categories with 4 degrees of freedom, labeled as plant-like, animal-like, culinary, citrus and others, which gives a more parsimonious classification of the perceptual odor space while preserving the enrichment pattern. For convenience, we distinguish the two set of labels with the names “Category 1” (the primary categories adopted from [21]) and “Category 2’ (the secondary categories after dimensionality reduction). The secondary category for the perceptual quality of an odorant is determined by its primary category (Table S1).

#### Evaluating the significance of group labeling

We used the G-test to evaluate the statistical significance of our finding that the receptor-code groups can be labeled with perceptual odor categories. The G-test statistic is defined as , where *O _{sR,sW}* and

*E*are the observed and the expected number of odorants with the pair of labels (

_{sR,sW}*s*). By construction, we have

_{R}, s_{W}*O*=

_{sR,sW}*M*(

_{p}*s*) and

_{R}, s_{W}*E*=

_{sR,sW}*M*(

_{p}*s*)

_{R}*p*(

*s*), where

_{W}*M*is the number of test odorants (observations). This gives

*G*= 2

*M*·

*D*. We can calculate the p-value for this joint classification, by assuming that

_{KL}*G*approximately follows the chi-squared distribution with the appropriate degrees of freedom. In this case, the number of degrees of freedom for the chi-squared distribution is d.o.f. = (|

*S*| − 1)(|

_{R}*S*| − 1), where |

_{W}*S*| is the number of distinct receptor-code groups, and |

_{R}*S*| is the number of distinct odor categories. Using the primary odor categories, we get

_{W}*G*= 68.9, which translates to

*p*= 0.03 with 48 degrees of freedom (Fig. S7b). Using the secondary odor categories, we get

*G*= 57.5 and

*p*< 0.001 with 24 degrees of freedom (Fig. S7d).

To further validate our use of the G-test, we also compared the result with the chi-squared test, for which the test statistics is definedas . (Note that this χ^{2} should not be confused with the overlap measure χ or the overlap threshold in our analysis of the receptor code space.) We get χ^{2} = 71.4 and *p* = 0.02 using the primary odor categories (Fig. S7e), and χ^{2} = 62.2 and *p* < 10^{−4} using the secondary categoreis (Fig. S7f). Both tests give small and consistent p-values, especially with the secondary odor categories with reduced degrees of freedom. This confirms the statistical significance of our findings, in which each major group in the receptor code space is associated with a characteristic label in the perceptual odor space.

## Data availability

The list of test odorants are provided in two sets of supplementary tables, with receptor-code groups and perceptual odor categories (Table S1) and with raw odor descriptors (Table S2) respectively. The list of reference words for each perceptual category [21] is provided as Table S3. Our labeling of each perceptual odor descriptor is shown in Table S4. Data sources are listed in Table S5.

Formatted data files to reproduce the network representation of odorant-receptor interaction with full attributes for each node and edge (as .cyjs and .cys files, to open in Cytoscape) are freely available at https://github.com/jihyunbak/ORnetwork.

## Code availability

The Matlab code for the grouping analysis, along with associated documentation, is available under the MIT license at https://github.com/jihyunbak/ORnetwork.

## AUTHOR CONTRIBUTIONS

J.H.B., S.J.J. and C.H. designed the projects. J.H.B. and C.H. performed research and analyzed the data. J.H.B., S.J.J. and C.H. wrote the paper.

## COMPETING INTERESTS

The authors declare that they have no competing interests.

## ACKNOWLEDGMENTS

We thank the KIAS Center for Advanced Computation for providing computing resources. This study was partly supported by the grants from the National Research Foundation of Korea (NRF-2018R1A2B3001690) (C.H.), and the National Science Foundation (CHE-1362926) (S.J.J).

## References

- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].↵
- [62].↵
- [63].↵
- [64].
- [65].