Abstract
Our daily life is realized by the complex orchestrations of diverse brain functions including perception, decision, and action. One of the central issues in cognitive neuroscience is to reveal the complete representations underlying such diverse functions. Recent studies have revealed representations of natural perceptual experiences using encoding models1–5. However, there has been little attempt to build a quantitative model describing the cortical organization of multiple active, cognitive processes. Here, we measured brain activity using functional MRI while subjects performed over 100 cognitive tasks, and examined cortical representations with two voxel-wise encoding models6. A sparse task-type encoding model revealed a hierarchical organization of cognitive tasks, their representation in cognitive space, and their mapping onto the cortex. A cognitive factor encoding model utilizing continuous intermediate features by using metadata-based inferences7 predicted brain activation patterns for more than 80 % of the cerebral cortex and decoded more than 95 % of tasks, even under novel task conditions. This study demonstrates the usability of quantitative models of natural cognitive processes and provides a framework for the comprehensive cortical organization of human cognition.
Introduction
The cortical basis of daily cognitive processes has been studied using a voxel-wise encoding and decoding model approach6 where multivariate regression analysis is used to determine how brain activity in each voxel is modelled by target features, such as visual features1,2, object or scene categories3,8,9, sound features5,10,11, and linguistic information4,12,13. Some studies have further described the cortical (e.g. semantic) representational space that elucidates important categorical dimensions in the brain (e.g. mobile vs. nonmobile, animate vs. inanimate) and how such representations are mapped onto the cortex3,14. However, all previous attempts have used brain activity recorded during passive listening or viewing tasks. No study has so far been able to clarify the comprehensive cortical representations underlying active cognitive processes.
Here, we combined data-driven encoding modelling and metadata-based reverse inference to reveal such representations. Six subjects underwent functional MRI experiments to measure whole-brain blood-oxygen-level-dependent (BOLD) responses during 103 naturalistic tasks (Fig. 1a), including as many cognitive varieties as possible and ranging from simple visual detection to complex cognitive tasks such as memorization, language comprehension, and calculation (see Supp. Info for the task list and descriptions). This experimental setup aimed to extend the previous efforts at describing the semantic space3,14 by estimating the cognitive space that depicts the relative relationships among diverse cognitive processes. Each task was thus regarded as a sample taken from the entire cognitive space. To obtain a comprehensive representation of the cognitive space, we modelled voxel-wise responses using regularized linear regression6 based on two sets of features (Fig. 1b-c). First, using a task-type encoding model where tasks were represented as binary labels (Fig. 1b), we evaluated representational relationships among cognitive tasks across the cerebral cortex. Second, to further examine the generalizability of the modelling approach to any cognitive tasks, we constructed an additional cognitive factor encoding model, where each task was transformed into the 715-dimensional continuous feature space using metadata references7 (Fig. 1c). This allowed us to use a latent feature space for each task6,15 and thereby predict and decode activity for novel tasks that were not used during model training (Fig. 1d).
Results
Hierarchical organization of cognitive tasks
To examine how the cortical representations of over 100 tasks are related, we calculated a representational similarity matrix (RSM) using the estimated weights of the task-type model, concatenated across subjects (Fig. 2a). The RSM suggests that tasks form six clusters based on their representational patterns in the cerebral cortex. Task clusters were then visualized by the dendrogram obtained using hierarchical clustering analysis (HCA). The largest clusters contained tasks based on sensory modalities, such as visual (‘AnimalPhoto’, ‘MapSymbol’), auditory (‘RateNoisy’, ‘EmotionVoice’), and motor (‘PressLeft’, ‘EyeBlink’) tasks. Some clusters contained higher cognitive components, such as language (‘WordMeaning’, ‘RatePoem’), introspection (‘ImagineFuture’, ‘RecallPast’), and memory (‘MemoryLetter’, ‘RecallTaskEasy’). Tasks were further represented in sub-clusters of specific cognitive properties (Fig. 2b-d). For example, in the visual cluster, tasks with food pictures (‘RateDeliciousPic’, ‘DecideFood’) were closely located, whereas tasks with negative pictures (‘RateDisgustPic’, ‘RatePainfulPic’) formed a separate cluster, memory (Fig. 2b) tasks involving calculations (‘CalcEasy’, ‘CalcHard’) were close while those involving simple digit matching (‘MemoryDigit’, ‘MatchDigit’) formed a separate cluster, and in the introspection cluster (Fig. 2d), tasks involving imagining future and recalling past events were more closely located than tasks involving the imagination of places or faces. These results indicate hierarchically organized brain representations of cognitive tasks.
Spatial visualization of cognitive space and its cortical mapping
The HCA reveals the relative relationships between task samples taken from the entire cognitive space. To further determine the structure and cortical organization of the cognitive space, we performed principal component analysis (PCA) with the estimated weight matrix of the task-type model, concatenated across subjects (Fig. 3 and Supplementary Fig. 1). Figure 3a shows the distributions of the tasks according to their PC coefficients, where task position is determined by the first and second PC and task colour by the first, second, and third PC (corresponding to red, green, and blue, respectively; see Fig. 3b inset). Tasks with similar representations were assigned similar colours and were closely located in the 2-dimensional space (Fig. 3a). Tasks involving movie processing are clustered on the left at the top. Tasks dedicated to image and auditory processing are located more centrally on both the left and right side, gradually shifting towards complex cognitive tasks involving language, memory, logic, and calculation at the bottom of the distribution. To further visualize cortical distributions of cognitive space representations, the voxel-wise PCs were projected to the cortical sheet of each subject (Fig. 3b and Supplementary Fig. 2, 3), using the same RGB colour scheme as in Figure 3a. For example, the occipital areas are mostly green, showing that voxels in these areas represent movie and image-related tasks (Fig. 3a). The adjacent temporal parietal junction (TPJ) tends to be coloured in red, corresponding to internal cognitive tasks involving memory and calculations. Frontal areas show intricate patterns, including language-related representations (blue) in the left lateral regions. This topographical organization was consistent across subjects (Supplementary Fig. 3), indicating that our analyses provide a broad representation of the cognitive space in the human cerebral cortex.
Prediction and decoding of brain activity with the cognitive factor model
Although the task-type model can reveal distinctive relationships among tasks, it is too sparse to encompass latent and continuous features and is not generalizable to novel tasks. To tackle these issues, we transformed over 100 tasks into the 715-dimensional latent feature space using the Neurosynth database7 and constructed a voxel-wise cognitive factor model (Fig. 1c). To examine the generalizability of this model under novel task conditions (i.e. on a task that was not used to train the model), we trained the cognitive factor model with four fifths of the tasks (82 or 83 tasks), and predicted brain activity for the 20 or 21 remaining tasks (Fig. 1d). The model achieved significant prediction accuracy throughout the entire cortex (Fig. 4a and Supplementary Fig. 4; mean ± SD, 0.322 ± 0.042; 86.2 ± 5.1 % of voxels were significant; p < 0.05, FDR-corrected). To show that this cannot merely be explained by sensorimotor effects, we performed an additional encoding model analysis that regressed out visual, auditory, and motor components (see Methods). This analysis again revealed significant prediction accuracy across the cerebral cortex (mean ± SD, 0.285 ± 0.035; 82.4 ± 4.9 %; Supplementary Fig. 5), indicating that the generalizability of the cognitive factor-model stems from higher-order (i.e. not sensory) cognitive components.
To further test the generalizability and task specificity of cognitive factors, we performed a task decoding analysis with novel tasks. We trained a decoding model with four fifths of the tasks and decoded the cognitive factors related to the remaining target tasks at each time point. We tested whether the decoded features were more similar to the target task than to each of the remaining 102 tasks in the cognitive space. We obtained significant decoding accuracy for novel tasks (mean ± SD, 96.5 ± 0.9 %; 98.9 ± 0.4 % of tasks were significant; sign tests, p < 0.05, FDR-corrected; Fig. 4b, Supplementary Fig. 6), indicating that brain activity patterns were task-specific, and that the portion of the human cognitive space our model covers is sufficient to also decode novel tasks.
Discussion
Most previous studies using encoding or decoding model approaches have used passive viewing or listening tasks2–4,13, and standard neuroimaging studies using active tasks usually focus on a few conditions and examine effects of pre-assumed cognitive factors by comparing induced brain activations. While the latter strategy is a powerful way to test the plausibility of certain hypotheses, outcomes from such specialized studies could so far not elucidate the representational relationships among diverse tasks and cannot be generalized to naturalistic tasks where cognitive factors cannot be inferred in advance. Here, using over 100 naturalistic tasks that broadly sample the human cognitive space, the prediction accuracy we find for our model throughout the entire cortex is in clear contrast to the results of previous studies. While, for example, our earlier modelling attempt using a passive viewing paradigm3 provided significant predictions for 22 % of cortical voxels, largely restricted to the occipital and temporal areas, the cognitive factor model in the current study achieved significant predictions for about 86 % of all cortical voxels. The metadata-based inference technique used here further demonstrates the contribution of cognitive factors to these tasks7 and the applicability of such a data-driven approach to elucidate the brain organization of diverse cognitive functions.
While several clusters and components found in the current study have also been identified in previous multi-task studies16–21, we reveal a gradual shift in cognitive space, from perceptual to more complex cognitive tasks, that can only be elucidated by using our broad sampling paradigm. The subject-wise modeling also allowed the examination of the generalizability of the cognitive space, of task hierarchy, and of the representations in each subject’s brain to novel tasks; the latter may form the quantitative basis for elucidating personal traits in cognitive functions22. The fact that our model achieved unprecedentedly wide generalizability regarding cortical coverage and multi-task decodability indicates that our task battery represents a sufficient number of samples to probe a major proportion of the human cognitive space. Although the tasks used here do not cover the entire domain of human perception and cognition (e.g. they do not cover odour perception, speech, social interaction, etc.), our method is applicable to any arbitrary task that could be performed in a scanner, and our framework provides a powerful step forward to the complete modelling of the representations underlying human cognition.
Methods
Subjects
Six healthy subjects (aged 22-33 years, two females; referred to as ID01-06) with normal vision and normal hearing participated in the current experiment. Subjects were all right-handed (laterality quotient = 70-100), as assessed using the Edinburgh inventory23. Prior to their participation in the study, written informed consent was obtained from all subjects. This experiment was approved by the ethics and safety committee of the National Institute of Information and Communications Technology in Osaka, Japan.
Stimuli and procedure
We prepared 103 naturalistic tasks that can be performed without any pre-experimental training (see Supplementary Data for the detailed description of each task and Supplementary Fig. 7 for the behavioural results). Tasks were selected to include as many cognitive domains as possible. Each task had 12 instances; eight instances were used in the training runs, and four instances were used in the test runs. Stimuli were presented on a projector screen inside the scanner (21.0 × 15.8 degrees of visual angle at 30 Hz). The root-mean square of auditory stimuli was normalized. During scanning, subjects wore MR-compatible ear tips. The experiment was executed in 3 days, with six runs performed on each day.
The experiment was composed of 18 runs, 12 training runs and six test runs. Each run contained 77-83 trials with a duration of 6-12 s per trial. To keep subjects attentive and engaged and to ensure all runs had the same length, a 2-s feedback for the preceding task (correct or incorrect) was presented 9-13 times per run. In addition to the task, 6 s of imaging without a task were inserted at the beginning and at the end of each run; the former was discarded in the analysis. The duration of a single run was 556 s. In the training runs, task order was pseudo-randomized, as some tasks depend on each other and were therefore presented close to each other in time (e.g. the tasks ‘MemoryDigit’ and ‘MatchDigit’). In the test runs, 103 tasks were presented four times in the same order across all six runs (but with different instances for each repetition). There was no overlap between instances in the training runs and the test runs. No explanation of tasks was given to the subjects prior to the experiment. Subjects only underwent a short training session on how to use the buttons to respond.
MRI data acquisition
The experiment was conducted on a 3.0 T scanner (TIM Trio; Siemens, Erlangen, Germany) with a 32-channel head coil. We scanned 72 interleaved axial slices that were 2.0 mm thick, without a gap, parallel to the anterior and posterior commissure line, using a T2*-weighted gradient-echo multiband echo-planar imaging (MB-EPI) sequence24 [repetition time (TR) = 2000 ms, echo time (TE) = 30 ms, flip angle (FA) = 62°, field of view (FOV) = 192 × 192 mm2, resolution = 2 × 2 mm2, MB factor = 3]. We obtained 275 volumes in each run, each following three dummy images. For anatomical reference, high-resolution T1-weighted images of the whole brain were also acquired from all subjects with a magnetization-prepared rapid acquisition gradient echo sequence (MPRAGE, TR = 2530 ms, TE = 3.26 ms, FA = 9°, FOV = 256 × 256 mm2, voxel size = 1 × 1 × 1 mm3).
fMRI data preprocessing
Motion correction in each run was performed using the statistical parametric mapping toolbox (SPM8). All volumes were aligned to the first EPI image for each subject. Low-frequency drift was removed using a median filter with a 240-s window. The response for each voxel was then normalized by subtracting the mean response and scaling it to the unit variance. We used FreeSurfer25,26 to identify cortical surfaces from anatomical data, and to register them to the voxels of functional data. For each subject, the voxels identified in the cerebral cortex were used in the analysis (53,345~∼66,695 voxels per subject).
Task-type model
The task-type model was composed of one-hot vectors which were assigned 1 or 0 for each time bin, indicating whether one of the 103 tasks was performed in that period. The total number of task-type model features was thus 103.
Encoding model fitting
In the encoding model, cortical activation in each voxel was fitted with a set of linear temporal filters that capture the slow hemodynamic response and its coupling with brain activity2. The feature matrix FE [T × 3N] was modelled by concatenating sets of [T × N] feature matrices with three temporal delays of 2, 4, and 6 s (T = # of samples; N = # of features). The cortical response RE [T × V] was then modelled by multiplying the feature matrix F with the weight matrix WE [3N × V] (V = # of voxels):
We used an L2-regularized linear regression using the training dataset to obtain the weight matrix WE. The training dataset consisted of 3336 samples (6672 s). The optimal regularization parameter was assessed using 10-fold cross validation, with the 18 different regularization parameters ranging from 100 to 100 × 217.
The test dataset consisted of 412 samples (824 s, repeated four times). To reshape the data spanning over six test runs into the four times-repeated dataset, we discarded 6 s of the no-task period at the end of each run, as well as the 2-s feedback periods at the end of the 3rd and 6th test runs. Four repetitions of the test dataset were averaged to increase the signal-to-noise ratio. Prediction accuracy was calculated using Pearson’s correlation coefficient between predicted signal and measured signal in the test dataset. The statistical threshold was set at p < 0.05, and corrected for multiple comparisons using the false discovery rate (FDR) procedure27.
Evaluation of optimal regularization parameter
To keep the scale of weight values consistent across subjects, we performed a bootstrapping procedure to assess the optimal regularization parameter used for the group HCA and PCA4. For each subject, we randomly divided the training dataset into training samples (80 %) and validation samples (20 %) and performed model fitting using an L2-regularized linear regression. This procedure was repeated 50 times, with the 18 different regularization parameters ranging from 100 to 100 × 217. The resultant prediction accuracies were averaged across six subjects for each parameter. We selected the optimal regularization parameter which provided the highest mean prediction accuracy across subjects. This regularization parameter was used for model fitting in the group HCA and PCA.
Hierarchical cluster analysis
For the HCA, we used the weight matrix of the task-type model concatenated across six subjects. For each subject, we selected voxels which showed a significant prediction accuracy with p < 0.05 (with FDR correction, 39,485~56,634 voxels per subject) and averaged three time delays for each task. RSM was then obtained by calculating Pearson’s correlation coefficients between mean brain activations of all task pairs. A dendrogram of 103 tasks was described using the task dissimilarity (1 – correlation coefficient) as a distance metric, with the minimum distance as a linkage criterion. Each cluster was labelled based on the included cognitive tasks. To obtain an objective interpretation of cluster labelling, we also performed a metadata-based inference of cluster-related cognitive factors (Supplementary Fig. 8 and Table 3).
Principal component analysis of task-type weights
For each subject, we performed a PCA on the weight matrix of the task-type model concatenated across six subjects. We selected voxels which showed a significant prediction accuracy with p < 0.05 (with FDR correction, 39,485~56,634 voxels per subject) and averaged three time delays for each task. The number of meaningful PCs was determined based on the prediction accuracy with the reconstructed weight matrix (Supplementary Fig. 1). To interpret each PC, we quantified the relative contribution of each task using the PCA loadings; tasks with higher PCA loading values were regarded to contribute more to the target PC (Supplementary Table 1). Each PC was thus labelled based on these cognitive tasks. To obtain an objective interpretation of PC labelling, we also performed a metadata-based inference of PC-related cognitive factors (Supplementary Table 2). PCA loadings were also used to evaluate the representational correspondence between task clusters and PCs (Supplementary Fig. 9). To show the structure of the cognitive space, 103 tasks were mapped onto the 2-dimensional space using the loadings of PC1 (1st PC) and PC2 as the x- and y-axis. The tasks were further coloured in red, green, and blue, based on the relative PCA loadings in PC1, PC2, and PC3, respectively.
To represent the cortical organization of the cognitive space for each subject, we extracted and normalized PCA scores from each subject’s voxels. The resultant cortical map indicates the relative contribution of each cortical voxel to the target PC (denoted as PCA score map, Supplementary Fig. 2). By combining the PCA score maps of the top three PCs of each subject, we visualized how each cortical voxel is represented by cognitive clusters. Each cortical voxel was coloured based on the relative PCA scores of PC1, PC2, and PC3, corresponding to the colour of the tasks in the 2-dimensional space.
Cognitive factor model
To obtain a task representation using the continuous features in the human cognitive space, we transformed sparse task-type features into the latent cognitive factor feature space (Fig.1c). We used Neurosynth (http://neurosynth.org; accessed 26th January 2018) as a metadata reference of the past neuroimaging literature7. From the approximately 3,000 terms in the database, we manually selected 715 terms that cover the comprehensive cognitive factors while avoiding redundancy. Specifically, we removed several plural terms which had their singular counterpart (e.g. ‘concept’ and ‘concepts’) and past tense verbs which had their present counterpart (‘judge’ and ‘judged’) in the dataset. We also excluded terms which indicated anatomical regions (e.g. ‘parietal’) (see Supplementary Data for the complete set of 715 terms). We used the reverse-inference image of the Neurosynth database for each of the selected terms. The reverse-inference image indicates the likelihood of a given term being used in a study if activation is observed at a particular voxel. Each reverse-inference image in MNI152 space was registered to the subjects’ reference EPI data using FreeSurfer25,26.
We calculated correlation coefficients between the weight map of each task in the task-type model and the registered reverse-inference maps. This resulted in the [103 × 715] coefficient matrix. We obtained a cognitive transform function (CTF) of each subject, by averaging the coefficient matrices of the other five subjects. The CTF is a function that transforms the feature values of 103 tasks into the 715-dimensional latent feature space. The feature matrix of the cognitive factor model was then obtained by multiplying the CTF with the feature matrix of the task-type model. Note that the CTF (and the resultant feature matrix) of each target subject was independent of their own data. The total number of cognitive factor model features was 715.
Encoding model fitting with sensorimotor regressors
To evaluate a possible effect of low-level sensorimotor features on the model predictions, we performed an additional encoding model fitting while regressing out sensorimotor components. We concatenated motion-energy (ME) model features (visual), modulation transfer function (MTF) model features (auditory), and button response (BR) model features (motor) with the original feature matrix during the model training (see the Supplementary Methods for details). ME model features were obtained by applying 3-dimensional spatio-temporal Gabor wavelet filters to the visual stimuli2. MTF model features were obtained by applying spectro-temporal modulation-selective filters to the cochleogram of the auditory stimuli28. BR model features were obtained based on the number of button responses made by each subject. The model testing excluded the sensorimotor regressors from the concatenated feature matrix and the corresponding weight matrix. This analysis revealed that model prediction accuracy is independent of low-level sensorimotor features.
Motion-energy model (regressor of non-interest for visual features)
The details of the ME model design have been described elsewhere2. First, movie frames and pictures were spatially down-sampled to 96 × 96 pixels. The RGB pixel values were then converted into the Commission International de l’Eclairage (CIE) LAB colour space, and colour information was discarded. The luminance (L*) pattern was passed through a bank of 3-dimensional spatio-temporal Gabor wavelet filters. The outputs of two filters with orthogonal phases (quadrature pairs) were squared and summed to yield local motion-energy. Motion-energy was compressed with a log-transform and temporally down-sampled to 0.5 Hz. Filters were tuned to six spatial frequencies (0, 1.5, 3.0, 6.0, 12.0, 24.0 cycles/image) and three temporal frequencies (0, 4.0, 8.0 Hz), without directional parameters. Filters were positioned on a square grid that covered the screen. The adjacent filters were separated by 3.5 standard deviations of their spatial Gaussian envelopes. The total number of ME model features was 1395.
Modulation transfer function model (regressor of non-interest for auditory features)
A sound cochleogram was generated using a bank of 128 overlapping bandpass filters ranging from 20 to 10,000 Hz29. The window size was set to 25 ms, and the hop size to 10 ms. Filter output was averaged across 2 s (TR). We further extracted features from the MTF model28. For each cochleogram, a convolution with modulation-selective filters was calculated. The outputs of two filters with orthogonal phases (quadrature pairs) were squared and summed to yield local modulation energy2. Modulation energy was log-transformed, averaged across 2 s, and further averaged within each of the 10 non-overlapping frequency ranges logarithmically spaced along the frequency axis. The filter outputs of upward and downward sweep directions were used. Modulation-selective filters were tuned to five spectral modulation scales (Ω = 0.50, 1.0, 2.0, 4.0, 8.0 cyc/oct) and five temporal modulation rates (ω = 4.0, 8.0, 16.0, 32.0, 64.0 Hz). The total number of MTF model features was 1000.
Button response model (regressor of non-interest)
The BR model was constructed based on the number of button responses within 1 s for each of the four buttons, with the right two buttons pressed by the right thumb and the left two buttons pressed by the left thumb. The total number of BR model features was four.
Decoding model fitting
In the decoding model, the cortical response matrix RD [T × 3V] was modelled by concatenating sets of [T × V] matrices with temporal delays of 2, 4, and 6 s. The feature matrix FD [T × N] was modelled by multiplying the cortical response matrix RD with the weight matrix WD [3V × N]:
The weight matrix WD was estimated using an L2-regularized linear regression with the training dataset, following the same procedure as for the encoding model fitting.
Encoding and decoding with novel tasks
In order to examine the generalizability of our models, we performed encoding and decoding analyses with novel tasks which were not used in the model training (Fig. 1d). We randomly divided the 103 tasks into five task groups. A single task group contained 20-21 tasks. We performed five independent model fittings, each with a different task group as the target. From the training dataset, we excluded the time points during which the target tasks were performed, and those within 6 s after the presentation of the target tasks. In the test dataset, we used only the time points during which the target tasks were performed, and those within 6 s after the presentation of the target tasks. This setting allowed us to assume that the activations induced by the target task group and those induced by the other four task groups (training task groups) did not overlap, and it enabled us to investigate the prediction and decoding accuracy for the novel tasks. We performed the encoding and decoding model fitting with the training task groups composed of 82-83 tasks. For the model testing, we concatenated the predicted responses or decoded features of the five task groups. Responses or features for the time points that were duplicated were averaged across the five task groups. Note that encoding and decoding with novel tasks was only possible with the cognitive factor model, because the original tasks needed to be transformed into the latent feature space.
For the decoding analysis with novel tasks, we measured the similarity between the CTF of each task and each decoded cognitive factor vector using Pearson’s correlation coefficient for each time point. We refer to the correlation coefficient as the task score12. We then calculated the time-averaged task scores for each task using the one-vs.-one method. For each target task, a series of binary classification was performed between the target task and each of the remaining 102 tasks. Decoding accuracy was then calculated as a percentage that the target task had a higher task score in this procedure. Statistical significance of decoding accuracy was tested for each task using the sign test (p < 0.05, with FDR correction).
Code and data availability
The MATLAB code used in the current study and the datasets generated during and/or analysed during the current study are available from the corresponding author upon reasonable request.
Author contribution
T.N. and S.N. designed the study; T.N. collected and analysed the data; T.N. and S.N. wrote the manuscript.
Author Information
The authors declared no competing interests. Correspondence and requests for materials should be addressed to S.N (nishimoto{at}nict.go.jp)
Acknowledgments
We thank MEXT/JSPS KAKENHI (grant numbers 17K13083 and JP18H05091 in #4903 (Evolinguistics) for T.N., and JP15H05311 for S.N.) as well as JST CREST JPMJCR18A5 and ERATO JPMJER1801 (for S.N.) for the partial financial support of this study. The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.