Abstract
Advances in cognitive neuroscience are often accompanied by an increased complexity in the methods we use to uncover new aspects of brain function. Recently, many studies have started to use large feature sets to predict and interpret brain activity patterns. Of crucial importance in this paradigm is the mapping model, which defines the space of possible relationships between the features and neural data. Until recently, most encoding and decoding studies have used linear mapping models. However, some researchers have argued that the space of linear mappings is overly constrained and advocated for the use of more flexible nonlinear mapping models. Here, we discuss the choice of a mapping model in the context of three overarching goals: predictive accuracy, interpretability, and biological plausibility. We show that, contrary to popular intuition, these goals do not map cleanly onto the linear/nonlinear divide. Moreover, we argue that, instead of categorically treating the mapping models as linear or nonlinear, we should instead aim to estimate the complexity of these models. We show that, in most cases, complexity provides a more accurate reflection of restrictions imposed by various research goals and outline several complexity metrics that can be used to effectively evaluate mapping models.
1. Introduction
In recent decades, neuroscientists have witnessed a massive increase in the amount of available data, as well as in the computational power of tools we can apply to the data. As a result, today we can leverage huge datasets to build powerful models of brain activity. In this era of new opportunities, it is important to be mindful of conceptual choices we have to make before modeling our data. The goal of this paper is to discuss one such choice: a mapping model that relates features of interest to brain responses.
When studying a brain circuit, area, or network, it is often useful to formulate and test hypotheses about features that elicit a response in the relevant neural units (a single cell, a population of neurons, a brain area, etc.). The features can be stimulus-based (Figure 1A), behavior-based (Figure 1B), or based on responses in other neural units, within the same brain or a brain of another individual (Figure 1C). The exact source of the features may vary: common sources include human annotations (e.g., “faces”), empirical measurements (e.g., behavioral responses), and outputs of a computational model (e.g., a vector of responses to each image in layer 4 of a deep neural network (DNN); Figure 1D).
The encoding/decoding modeling framework in cognitive neuroscience. (A) Studies investigating the effect of external stimuli on brain activity start with the stimulus, extract its features of interest, and use a mapping model to establish the mapping between these features and a neural variable extracted from the data recorded during/after stimulus presentation. (B) In other studies, researchers extract features associated with participants’ behavior and map those onto the neural variable recorded before/during this behavior. (C) Yet another class of studies describes the mapping between activity in different brain regions. (D) In recent years, more and more studies replace hand-crafted features, like those shown in (A), with large feature vectors derived from computational models, such as neural networks.
To relate a set of features to brain data, we need to establish a mapping between them. A perfect feature set would fully explain neural activity, such that there would be a 1:1 mapping between a the feature set and neural activity in a given neuron/electrode/voxel1. However, cognitive neuroscience today is far from producing perfect features. One limitation is our incomplete understanding of the computational mechanisms underlying brain function. Another limitation is the data we use, which typically provide an indirect and/or noisy measure of neural activity. These limitations mean that most feature sets cannot be perfectly mapped onto the neural data; instead, some level of fitting is required.
A mapping model is a model trained to map between the features and neural data. In this paper, we discuss both encoding mapping models, i.e. models that map from the feature model to the neural variable, and decoding mapping models, i.e. models that map from the neural variable to the feature model (see Figure 1). Others have discussed the relative merits of the two approaches (Holdgraf et al., 2017; King et al., 2018; Kriegeskorte & Douglas, 2019; Naselaris et al., 2011); our arguments in this paper apply to both mapping directions, unless specified otherwise. Note also the difference between a mapping model, which is trained directly on brain data, and a computational model that aims to mimic brain function but does not necessarily use neural data (Figure 2).
The distinction between a model of brain function and a mapping model. A model of brain function aims to mimic the brain, but does not directly fit to neural data. Our focus is on mapping models, which directly link a feature set to a neural variable. The mapping model depicted here uses features derived from a model of brain function (like in Figure 1D).
Mapping models have many properties, but the most common distinction is drawn between (A) a linear mapping model (such as linear regression) and (B) a nonlinear mapping model (such as a neural network). The key question we consider here is how to decide which mapping model is most appropriate for a particular research goal.
2. The controversy
Recent advances in machine learning (ML) have had a large effect on cognitive neuroscience (Bzdok et al., 2017). Whereas traditional research approaches use relatively simple mappings to relate a handful of features to a single neural variable (e.g., “animate/inanimate” vs. the average BOLD response in a brain region of interest), ML-based mapping models can operate on high-dimensional feature vectors and can learn the mapping between features and data without presupposing the exact nature of that mapping. The concurrent increase in the size of available datasets (e.g., Chang et al., 2019; Majaj et al., 2015; Schoffelen et al., 2019) has enabled researchers to train large-scale mapping models without overfitting them.
A number of studies have leveraged the power of ML methods to build flexible nonlinear mapping models and use them to identify neural correlates of brain disorders (e.g., Hasanzadeh et al., 2019; Kazemi & Houghten, 2018; Kim et al., 2016; Leming et al., 2020) and behavioral traits (e.g., Kumar et al., 2019; Morioka et al., 2020; Xiao et al., 2019). Yet the vast majority of cognitive neuroscience studies use linear mapping models (such as linear regression), resulting in a gap between different neuroscience subfields.
Three primary justifications have been proposed for using linear mapping models:
Linear mapping models facilitate comparison of predictive accuracy across feature sets (e.g., Caucheteux & King, 2020; Jain & Huth, 2018; Schrimpf et al., 2018; Yamins et al., 2014).
Linear mapping models estimate weights for individual features, making the mapping more interpretable (e.g., Anderson et al., 2017; Lee Masson & Isik, 2021; Naselaris et al., 2011; Sudre et al., 2012; cf. Haufe et al., 2014; Kriegeskorte & Douglas, 2019).
Linear mapping models are more biologically plausible: they approximate readout by a downstream area and can therefore indicate what information is available to the rest of the brain (e.g., Kamitani & Tong, 2005; Kriegeskorte, 2011).
In the following section, we critically review arguments for linear vs. nonlinear mapping models in the context of these and other research goals, with the aim of establishing a general framework for specifying the mapping model in various research scenarios. We show that each of the three broad criteria used to evaluate mapping models — predictive accuracy, interpretability and biological plausibility — can be broken down into several different research goals, each of which places its own set of constraints on the mapping model.
3. What do we want from our mapping models?
To pick the best model, we first need to specify the goal that we are trying to achieve (Kording et al., 2020). This goal can be described as a set of particular desiderata for the model. The most common desiderata for models in cognitive neuroscience are predictive accuracy, interpretability, and biological plausibility. Here, we discuss what each of these desiderata might mean in the context of mapping models.
3.1. Predictive accuracy
A necessary condition for a successful scientific model is its ability to explain past results, as well as to predict future (or held-out) data. In neuroscience, as in other fields, scientific progress is driven by researchers generating new hypotheses, testing their predictions against experimental results, and focusing on the most accurate hypotheses. In the encoding/decoding framework, a hypothesis can be operationalized as a set of features derived from a computational model (e.g., a neural network layer; Yamins et al., 2014) or from behavioral ratings (e.g., Anderson et al., 2017).
A common way to measure the predictive accuracy of a set of features is to use a mapping model that will estimate the best link between the features and the training set of the neural data. The mapping — fit on the training set — can then be used to predict responses in a held-out test set, after which we can evaluate those predictions by, e.g., correlating them with actual data. These correlations are often normalized by an estimate of the reliability of the data (a “ceiling”) to yield an estimate of explained variance (e.g., Cadieu et al., 2014; Kell et al., 2018; Schrimpf et al., 2018; Yamins et al., 2014).
This prediction-oriented framework can be used to achieve multiple goals, only some of which require imposing specific constraints on the mapping model.
3.1.1. Compare feature sets
Model accuracy (e.g., explained variance) can be used to compare competing feature sets to figure out which of them best reflects neural responses (Schrimpf et al., 2020). Such comparison-oriented studies tend to use linear mappings in order to minimize the number of additional computations performed on the features. Indeed, using a powerful non-linear mapping model could wash out important differences across feature sets. For example, if the goal is to determine whether activity in inferior temporal cortex is better predicted by an early or a late layer of a convolutional neural network, we should use a mapping model with a limited expressive power; otherwise, the mapping model will be able to transform features from an early layer into features from a late layer, eliminating meaningful differences between them. Thus, feature comparison studies often benefit from linear mapping models.
3.1.2. Test feature decodability
Another research question a neuroscientist might ask is “do neural data Y contain information about features X?” A more applied version of this question is “can I predict features X based on neural data Y?” In this scenario, the goal is to find a mapping that allows us to reach above-chance decoding accuracy2. If this is the primary goal, then we should not put restrictions on the space of possible mappings — all that matters is the mapping model’s performance on held-out data. For instance, if a study aims to determine whether certain behavioral traits (X) can be predicted from neural data (Y), it does not need to limit its scope of possible mappings as long as the resulting mapping model performs successfully on unseen data. Similarly, a study that finds information about an imagined visual scene (X) in primary visual cortex (Y) may provide a valuable contribution to the field even if it uses highly unconstrained nonlinear mappings. All in all, for studies in this category, the main objective is for the mapping model to achieve the highest possible predictive accuracy (or, for applied research, to reach a certain accuracy threshold) regardless of (non)linearity.
3.1.3. Build accurate models of brain data
Finally, some researchers are trying to build accurate models of the brain that can replace experimental data or, at least, reduce the need for experiments by running studies in silico (e.g., Jain et al., 2020; Kell et al., 2018; Yamins et al., 2014). The main criterion for such models is their predictive accuracy, but they need to clear a very high accuracy bar (ideally at the level of the noise ceiling). The best way to build these in silico brains might be to train large powerful mapping models on large amounts of neural data. In this scenario, there is no theoretical justification for a linear mapping constraint because the primary goal is maximizing predictive accuracy.
3.2. Interpretability
Once we find a mapping that achieves sufficiently high predictive accuracy, we often want to interpret it. Which features contribute the most to neural activity? Do neurons/electrodes/voxels respond to single features or exhibit mixed selectivity? How does the mapping relate to other models or theories of brain function?
The traditional view is that linear mappings are easier to interpret than non-linear mappings (Naselaris et al., 2011). However, the goal of building interpretable models is ultimately complicated by the fact that there is no clear-cut definition for interpretability. Below, we discuss three definitions of interpretability, ranging from strictest to loosest, and show that interpretability does not always require a linear mapping model. Importantly, in each of these cases, interpretability places restrictions not only on the mapping model, but also on the features that can be used to yield meaningful interpretations.
3.2.1. Examine individual features
Traditionally, many cognitive neuroscientists have believed that interpreting a neural signal requires identifying a set of words to describe its function (e.g., Desimone et al., 1984; Kanwisher et al., 1997). In this scenario, a useful model of brain activity has features that can be described using one or a few words (“faces”, “vertical lines”, etc.) and a linear mapping from these features to neural data. We consider this to be the strictest definition of interpretability because it places the strongest constraints on both the features (which have to be nameable) and the mapping models (which have to be linear). With a linear mapping model, the regression weights can be interpreted as a relative measure of contribution to the neural activity (though this is not always straightforward in cases where features span different values or suffer from multicollinearity).
Interpretable features have played a crucial role in understanding brain function (Kanwisher, 2010). However, nameability of features may be an overly restrictive metric of interpretability as it limits our understanding to a vocabulary that is heavily biased by a priori hypotheses and may not include words for the concepts we actually need (Buzsáki, 2019). For instance, recent work has shown that neurons typically described as “face-responsive” respond more strongly to artificial images produced by DNNs than to natural images described by the word “face” (Ponce et al., 2019), suggesting that simple verbal features cannot provide a full account of neural activity. As a result, many researchers have started to use higher-dimensional sets of features, a move that has introduced new definitions of what it means for a mapping to be interpretable.
3.2.2. Test representational geometry
A looser definition of interpretability that has become popular in the last decade is the use of high-dimensional feature vectors that are linearly mapped to a neural variable. These features may be produced by humans, such as rating properties of words (Binder et al., 2016), or by computational models, such as semantic embeddings (e.g., Mitchell et al., 2008; Pereira et al., 2018) or computer vision features (e.g., Kay et al., 2008; Yamins et al., 2014). When using large-scale feature sets, we cannot always interpret the weights of a linear mapping model in the same way as we did with nameable features. If individual features within a set cannot be labeled (e.g., in the case of DNN layer activations), examining them one by one has a limited potential to inform our intuition (Kay, 2018). However, we can examine the feature set as a whole, asking: do features X, generated by a known process, accurately describe the space of neural responses Y? Thus, the feature set becomes a new unit of interpretation, and the linearity restriction is placed primarily to preserve the overall geometry of the feature space. For instance, the finding that convolutional neural networks and the ventral visual stream produce similar representational spaces (Yamins et al., 2014) allows us to infer that both processes are subject to similar optimization constraints (Richards et al., 2019). That said, mapping models that probe the representational geometry of the neural response space do not have to be linear, as long as they correspond to a well-specified hypothesis about the relationship between features and data.
3.2.3. Describe the feature set
The loosest definition of interpretability is the ability to describe the set of features that was used to train the mapping model (e.g. “phonological features”). In this scenario, we make no assumptions about a particular representational geometry of these features (such as linear separability). The lack of specific assumptions about the form of the feature-data mapping means that constraints on the mapping model are not strictly necessary — all we need is an epistemologically satisfying description of the features. If a mapping model achieves good predictivity, we can say that a given set of features is reflected in the neural signal. In contrast, if a powerful mapping model trained on a large set of data achieves poor predictivity, it provides strong evidence that a given feature set is not represented in the neural data. Under this definition, any mapping model is interpretable as long as we can describe the set of features that it uses.
3.3. Biological plausibility
In addition to prediction accuracy and interpretability-related considerations, biological plausibility can also be a factor in deciding on the space of acceptable feature-brain mappings. We discuss two goals related to biological plausibility: simulating linear readout and accounting for physiological mechanisms affecting measurement.
3.3.1. Simulate linear readout
One of the main arguments put forth in favor of linear mappings is the claim that they approximate the linear readout performed by a putative downstream brain area (Kamitani & Tong, 2005; Kriegeskorte, 2011). Under this view, the mapping model approximates the transmission of the features to a hypothetical information consumer. The linear readout requirement often serves as a proxy for feature usability: if the features can be extracted with a linear mapping model, it means that they do not require extensive computations in order to be used downstream.
The ability to use features of interest in downstream computations is indeed an important consideration. However, there are reasons to be cautious about the linear readout requirement. First, some models operate on neural data collected from multiple recording sites rather than a single neural population/region, making subsequent linear readout biologically implausible. For instance, decoding models that use whole-brain data, such as M/EEG, have no downstream region that could ‘read out’ information from all over the brain — the only entity performing readout is the observer. Second, linear readout might not be an accurate characterization of the decoding mechanisms used by downstream areas to extract information from the brain region of interest. In fact, unlike linear models that can pool across all measured neurons or voxels in the region of interest, readout in biological neural systems is likely to be both sparse (e.g., Barak et al., 2013; Barlow, 1969; Olshausen & Field, 2004; Vinje & Gallant, 2000) and nonlinear (e.g., Ghazanfar & Nicolelis, 1997; Gidon et al., 2020; Hodgkin & Huxley, 1939; Shamir & Sompolinsky, 2004). Third, linear regression is a fairly arbitrary threshold to draw for mechanistic plausibility. Even a relatively ‘constrained’ linear classifier can read out many features from the data, many of them biologically implausible (e.g., voxel-level ‘biases’ that allow orientation decoding in V1 using fMRI; Ritchie et al., 2019). In sum, unconstrained linear mapping models (or linear mapping models constrained by weight distribution among many features, like ridge regression) may be both overly limiting because they do not account for possible nonlinear computations and overly greedy because they might leverage information in a way that real neurons do not.
Is there a better mapping model that accounts for possible nonlinear computations during readout without being overly broad? One possible approach is to introduce parsimony constraints on the feature space of the models (Kukačka et al., 2017). Introducing the sparsity constraint (i.e., allowing the mapping model to access only a limited number of neurons) could increase the biological plausibility of putative readout (Yoshida & Ohki, 2020). However, in the context of measurements that collapse across large numbers of neurons (i.e. most measurements in cognitive neuroscience), the sparsity constraint might be impossible to enforce, as a single voxel or electrode already combines signal from a large number of neurons. More broadly, evaluating the biological plausibility of decoding is difficult, as readout might differ across brain regions of interest (Anzellotti & Coutanche, 2018), and the current understanding of the details of readout mechanisms is still limited. Future progress in research on readout mechanisms will be key to evaluate the plausibility of different assumptions about readout in a more principled manner.
3.3.2. Incorporate physiological mechanisms affecting measurement
When brain recordings are known to be nonlinear transformations of underlying neural activity (e.g. fMRI, in which BOLD responses are related to neural responses via the HRF; Friston et al., 2000), knowledge about the nonlinear relationship between the neural responses and the measurements can (and often should) be explicitly incorporated into the mapping. Failing to do so might privilege feature sets that incorporate properties of the measurement over feature sets that more accurately reflect the neural representations encoded in a brain region.
In some cases, the nonlinearity can be incorporated when deriving the neural variable before the model is fitted (e.g. the beta weights for fMRI or the power in a given frequency range for EEG/MEG/ECoG). However, this is not always the best approach. For fMRI, the shape of the relationship between neural activity and BOLD responses varies across different subjects and brain regions, and even across different voxels (Ekstrom, 2021; Handwerker et al., 2004). This implies that the frequently used strategy of convolving feature responses with a standard HRF that is fixed across voxels and participants has its limitations, and that mapping models might benefit from integrating nonlinear estimation of the HRF shape within a family of functions motivated by physiological data (see, for instance, Pedregosa et al., 2015; Shain et al., 2020). Region- and voxel-specific HRFs can be set by estimating a relatively small number of parameters; thus, they would require only a small increase in model complexity. For M/EEG, the recorded signal is a combination of both inhibitory and excitatory signals; thus, treating it as a straightforward linear combination is not always possible (Hansen et al., 2010). Linear mapping models often overlook the complexities of neuroimaging signals, sacrificing biological plausibility as a result.
To summarize thus far, different research goals place very different constraints on the mapping model. A particular goal might require choosing a linear mapping model, adding additional restrictions to that model, using a particular class of nonlinear models, or imposing no a priori restrictions whatsoever.
4. Practical considerations
The criteria outlined above are primarily based on theoretic considerations: which mapping model has the properties that allow us to achieve a particular goal? However, another important consideration is practical feasibility: do we have enough data to accurately estimate the mapping? Will the noise in our data lead certain mapping models to fail?
Determining how much data is required for fitting a particular mapping model has critical implications for experimental design (the number of trials/data points per participant, the number of repetitions per stimulus, etc.). In general, the fewer constraints are placed on the mapping model, the more data will be needed to converge on a good mapping. This relationship can be estimated using standard validation methods by, for instance, taking a large dataset and evaluating the mapping model’s predictive accuracy on left-out test data while gradually increasing the size of the training dataset. However, few studies report such analyses (and in some cases large-enough datasets may still be lacking). One exception is a line of fMRI studies that aim to determine the best mapping model for linking interregional functional correlations and behavioral/demographic traits. The results of these studies are mixed: some report a marked advantage of nonlinear mapping models over linear ones (Bertolero & Bassett, 2020), whereas others report that linear mapping models perform equally well even when the training set includes several thousand brain images (He et al., 2020; Schulz et al., 2020). Thus, the field would greatly benefit from further systematic examinations of the influence of dataset size (and other experimental design properties) on the performance of a particular mapping model type.
Even with infinite data, certain measurement properties might force us to use a particular mapping class. For instance, Nozari et al. (2020) show that fMRI resting state dynamics are best modeled with linear mappings and suggest that fMRI’s inevitable spatiotemporal signal averaging might be to blame (although see Anzellotti et al., 2017, for contrary evidence). In sum, even after establishing theoretical desiderata for the mapping model, we need to conduct rigorous empirical tests to determine which model class will achieve good predictive accuracy given the amount and quality of available data.
5. Moving forward: estimating model complexity
Instead of focusing exclusively on the linear/nonlinear dichotomy, we propose to view the choice of the mapping model in the context of a broader notion of model complexity. Complexity lies at the heart of most desiderata discussed above. Arbitrarily complex models are less likely to generalize, leading to decreased predictive accuracy on test data; they can be harder to interpret; and they are less likely to match computations in biological circuits. All these considerations support the idea of an Occam’s razor, whereby one should strive for the simplest mapping that will achieve the desired goal.
5.1 The role of complexity in mapping model selection
We argue that, most of the time, the debate over linear/nonlinear models is, in reality, a debate over how complex the mapping model should be. For instance, the use of linear mappings when comparing feature sets is dictated by the desire to reduce the amount of computation performed on the features — in other words, the desire to reduce mapping model complexity. Similarly, the linear readout requirement is introduced to ensure that the features are usable by downstream brain regions. However, a simple nonlinear mapping might make the features equally usable (given the nonlinear nature of most brain computations). Thus, we suggest reframing the linear/nonlinear debate in terms of model complexity.
Figure 3 shows the research goals discussed above together with the mapping model types that are traditionally used to achieve these goals, as well as our proposal to shift from the linear/nonlinear dichotomy to explicit estimates of model complexity. Note that this diagram depicts theoretical, a priori criteria for restricting mapping model complexity; practical considerations might impose additional constraints to achieve better predictivity (see Section 4).
Different research goals are currently being collapsed into the “linear/nonlinear” dichotomy (L - linear, NL - nonlinear), but in fact correspond to different degrees of mapping model complexity. Note that the ordering of research goals along the complexity continuum is approximate and shown primarily for illustration purposes.
It turns out that the goals within each of the three broad categories — predictive accuracy, interpretability, and biological plausibility — impose very different constraints on the complexity of the mapping model. Further, these constraints are often more graded than the linear/nonlinear distinction:
- Interpreting individual features is easier when the mapping is not only linear, but also sparse, so that each neuron can be described with only a few features. Reframing the mapping model choice in terms of complexity allows us to pick out simple mappings within the class of linear mapping models, thus facilitating interpretation.
- Satisfying biological constraints, such as accounting for physiological properties of the measurement or simulating neural readout, may require a certain degree of nonlinearity but these nonlinearities are often well-defined and can keep overall model complexity relatively low.
- Testing whether a feature set accurately captures the representational geometry of neural responses requires the mapping model to reflect that geometry. Thus, the complexity of the mapping model depends primarily on the hypothesis being tested.
- Comparing and/or interpreting feature sets is possible even when the mapping is nonlinear, as long as we can compare the mappings using a metric that incorporates both predictive accuracy and model complexity.
- Decoding features from neural data and building accurate encoding models of the brain does not require placing any theory-based restrictions on the mapping model (although such restrictions might improve performance in practice).
5.2 Complexity measures
How can we estimate the complexity of our mapping models? To date, many studies have focused primarily on a binary distinction in which linear models are “simple” and nonlinear models are “complex”. However, as discussed above, this distinction is often overly simplistic. Here, we review several measures of mapping model complexity that are commonly used in the ML literature and may serve as an alternative to the linear/nonlinear dichotomy.
5.2.1. Number of free parameters
A very common approach to measuring complexity is by taking the number of free parameters in the model. In this approach, each model class receives a penalty that corresponds to its number of parameters, such that classes with more parameters have a larger penalty. In order to justify the use of additional parameters, the model needs to achieve substantial performance improvement compared to simpler models. This tradeoff is often implemented using Akaike’s Information Criterion (AIC) or the Bayesian Information Criterion (BIC), which reward models for good predictive performance but penalize them for the number of parameters. Note, however, that this approach often fails to capture distinctions that seem intuitively relevant. For instance, a linear and a nonlinear model with the same number of parameters would have equal complexity in this view, even though the latter often has a greater expressive power. Another example is a sparse mapping model that only has non-zero weights for 1-2 features vs. a dense model that places non-zero weights on, say, 500 features: if the initial feature vector size is the same, then these models will have the same number of parameters and therefore equivalent complexity under this metric.
Benefits: easy to estimate.
Limitations: does not always reflect relevant complexity distinctions; sensitive to model architecture.
Sample use case: choosing among multiple well-performing mapping models with the goal of maximizing predictive accuracy on a new (untested) dataset.
5.2.2. Minimum description length
Another common approach to measuring model complexity is based on the idea of minimum description length (MDL; Rissanen, 1978). This approach typically assumes an encoding function over a class of models, and the complexity of each model within the class is determined by the length of the model’s encoding. The encoding function essentially serves as a prior over the model class: more probable mapping models would be assigned shorter descriptions (see also Diedrichsen & Kriegeskorte, 2017; Wu et al., 2006, for a discussion of the relationship between priors and regularization constraints). The MDL approach can overcome some limitations of complexity measures based solely on the number of free parameters by exploiting correlations between parameters to achieve a shorter description length. For instance, under this scheme, sparse models would have a shorter description length and would therefore be considered less complex. The main limitation of an MDL-based metric is that it requires specifying a mapping model class, as well as an encoding scheme for mappings within that class. Thus, if there is no natural prior over the set of mappings we wish to compare, an architecture-free complexity measure may be preferred3.
Benefits: captures many relevant distinctions; less arbitrary than the number of parameters.
Limitations: requires specifying a mapping model class and a prior over mappings in that class.
Sample use case: comparing competing feature sets without presupposing linear separability of these features.
5.2.3. Sample complexity
Finally, a more practice-oriented metric is sample complexity. Loosely speaking, the sample complexity of a model class is a function that determines the minimal number of training samples that are required in order to achieve desired model performance. It is not always straightforward to compute this function a-priori; however, it can be assessed empirically by computing learning curves, i.e., the achieved level of predictive accuracy on a test set as a function of the number of training samples (see Section 4). Estimates of sample complexity are vital for understanding whether a given model failed because the underlying hypothesis was wrong or because the dataset was too small to achieve a proper fitting.
Benefits: immediate practical application.
Limitations: an indirect measure of model complexity.
Sample use case: estimating the amount of data required to achieve good predictivity for a given mapping model.
Overall, instead of defaulting to linear models, we propose incorporating the estimate of mapping model complexity into the general evaluation framework of encoding/decoding models. This estimate can be used in different ways depending on the research goal. For instance, for feature comparison, if two feature sets produce equally accurate mapping models, the feature set corresponding to a simpler mapping model represents a better fit to neural data. For estimates of potential downstream readout, instead of limiting ourselves to linear functions, we can consider a range of possible mappings, where simpler mappings reflect a higher probability that these features are used downstream. Thus, estimates of model complexity can serve as a powerful complement to predictive accuracy when evaluating mapping model performance.
6. Conclusion
The encoding/decoding framework in modern cognitive neuroscience has provided many valuable insights. However, in some cases, the field has been held back by its excessive reliance on linear mappings between features and brain activity. Here, we have described various research goals that should be taken into account when specifying a mapping model. Contrary to popular belief, few of these goals require the use of linear mapping models. Instead, some do not require placing any constraints on the mapping model, some require placing specific nonlinear constraints, and some use linearity simply as a proxy for reducing model complexity. We therefore propose to explicitly include estimates of model complexity when evaluating mapping models. Incorporating such estimates can help the field overcome its reliance on linear mappings and discover a richer space of accurate, simple, biologically plausible predictors of brain activity.
Acknowledgements
This paper is part of the Generative Adversarial Collaboration (GAC) initiative organized by the Computational Cognitive Conference board. We thank the GAC organizers, especially Megan Peters and Gunnar Blohm, for their invaluable help with this initiative. Many of the ideas discussed in this work arose during the GAC workshop in October 2020 (https://www.youtube.com/watch?v=UI5KclR71IE&list=PLNWftEg2R4s5iObSUvPXhnDJvyNbs4PnM&index=2). We thank the invited workshop speakers — Kohitij Kar, Mariya Toneva, Laura Gwilliams, Jean-Rémi King, Martin Hebart, and Anna Schapiro, — as well as workshop participants, for their ideas, comments, questions and suggestions. We also thank the reviewers who provided comments on our GAC proposal in August 2020 (the reviews are available at https://openreview.net/forum?id=-o0dOwashib). MS was supported by a Takeda Fellowship and the SRC Semiconductor Research Corporation. NZ was supported by a BCS Fellowship in Computation. EF was supported by R01 awards DC016607 and DC016950.
Footnotes
↵1 Note that the neural data being fitted is not necessarily the neural recording itself: researchers may choose to predict the average firing rate, power in a particular frequency band, or beta coefficients from the general linear model (GLM) of fMRI responses (King et al., 2018).
↵2 Some authors refer to predictive accuracy as a measure of mutual information between features X and neural data Y (e.g., Kriegeskorte, 2011), but this relationship is not always straightforward. In general, estimating mutual information is a hard problem (Paninski, 2003), although some ML-based approaches can provide a useful approximation (e.g., Belghazi et al., 2018; Xu et al., 2020).
↵3 For example, certain informational measures (e.g., Bialek et al., 2001; Gilad-Bachrach et al., 2003) can be used to measure the complexity of the statistical relationship between the inputs and outputs of the mapping model (e.g., features and predicted neural data) regardless of a particular architecture or model class, and in some cases may also capture the complexity of non-parametric generative models.