An enhanced inverted encoding model for neural reconstructions

Here we present a more interpretable and versatile approach for reconstructing the contents of perception, attention, and memory from neuroimaging data. Our enhanced inverted encoding model (eIEM) incorporates theoretical and methodological improvements including proper accounting of population-level tuning functions and a trial-by-trial prediction error-based metric where reconstruction quality is measured in meaningful units. Added functionality and improved flexibility is further gained via eIEM’s novel goodness-of-fit feature: for trial-by-trial reconstructions, goodness-of-fits are obtained independently (non-circularly) to prediction error and can be applied to any IEM procedure or decoding metric, resulting in improved reconstruction quality and brain-behavior correlations, and more creative applications. We validate eIEM from methodological principles, simulated neuroimaging datasets, and three pre-existing fMRI datasets spanning perception, attention, and working memory. Notably, eIEM is easy to apply and broadly accessible – our Python package (https://pypi.org/project/inverted-encoding) implements eIEM in one line of code – and is easily modifiable to compare performance metrics and/or scale up to more complex models.

Here we present a more interpretable and versatile approach for reconstructing the 3 contents of perception, attention, and memory from neuroimaging data. Our enhanced 4 inverted encoding model (eIEM) incorporates theoretical and methodological 5 improvements including proper accounting of population-level tuning functions and a 6 trial-by-trial prediction error-based metric where reconstruction quality is measured in 7 meaningful units. Added functionality and improved flexibility is further gained via 8 eIEM's novel goodness-of-fit feature: for trial-by-trial reconstructions, goodness-of-fits 9 are obtained independently (non-circularly) to prediction error and can be applied to any 10 IEM procedure or decoding metric, resulting in improved reconstruction quality and 11 brain-behavior correlations, and more creative applications. We validate eIEM from 12 methodological principles, simulated neuroimaging datasets, and three pre-existing 13 fMRI datasets spanning perception, attention, and working memory. Notably, eIEM is 14 easy to apply and broadly accessible -our Python package 15 (https://pypi.org/project/inverted-encoding) implements eIEM in one line of code -and is 16 easily modifiable to compare performance metrics and/or scale up to more complex 17 models. 18

Introduction 1
A mental representation can be defined as the "systematic relationship between 2 features of the natural world and the activity of neurons in the brain" 1 . An increasingly 3 common approach to study mental representations using neuroimaging data is to 4 employ encoding models, which describe this relationship computationally, typically by 5 reducing the complexity of the input data with a set of functions that, when combined, 6 roughly approximate the neural signal. 7 Encoding and decoding models (aka voxelwise modeling or stimulus-model 8 based modeling) have become a standard method for investigating neural 9 representational spaces and predicting stimulus-specific information from brain activity 2-10 4 . The key advantages of such models over other computational approaches such as 11 multivariate pattern classification or representational similarity analysis are typically 12 touted as the following: (1) Encoding models can take inspiration from single-unit 13 physiology by consisting of tuning functions in stimulus space (aka feature space), 14 allowing both the maximally receptive feature and the precision/sensitivity of the tuning 15 to be estimated across a population of neurons. (2) Encoding models (which transform 16 presented stimuli into predicted brain activity) are easily inverted into decoding models 17 (which transform observed brain activity into predicted stimuli), and this process can be 18 applied to a wide range of stimuli, including oriented Gabor patches 5 The inverted encoding model (IEM), sometimes also referred to as forward 26 modeling, is one example of an encoding and decoding model that has quickly risen to 27 prominence in the cognitive neuroscience community (see Supplemental Figure 1) 10-36 . 28 The basic idea behind IEMs is illustrated in Figure 1A. First, an encoding model is 29 trained to associate patterns of brain activity with specific stimuli. The model uses 30 simple linear regression and a basis set representing the hypothesized population-level 1 tuning functions, consisting of several channels that are modeled as cosines (or von 2 Mises functions) equally separated across stimulus space (e.g., orientation, color, 3 spatial location). Each channel in the basis set can be assigned a weight per voxel (we 4 adopt fMRI nomenclature here, but IEMs can be applied to any modality like EEG, 5 MEG, etc.) and hence a model can be trained to predict the activity of each voxel using 6 the weights of each channel as predictors (i.e., the regressors in a linear regression). 7 Then, this trained encoder is inverted such that it becomes a decoder capable of 8 reconstructing a trial's stimulus when provided with a novel set of voxel activations (see 9 Methods for mathematical implementation). 10 IEM is popular due to its simplicity, robust performance, and grounding in single-11 unit physiology principles. Over recent years, several variations, improvements, and 12 additional features have been proposed for IEMs, resulting in various approaches and 13 evaluation metrics (Supplemental Figure 1). This malleability is both an advantage and 14 a challenge of IEM, as there is sometimes ambiguity over which metric is best, 15 particularly for less experienced users. Moreover, as discussed more below, certain 16 implementations of IEM can produce misleading or difficult to interpret results, and do 17 not account for the shape of the basis channels (as recently pointed out in some high -18 profile debates in the literature 37-40 ). While this is not to suggest that previous papers 19 have made incorrect scientific interpretations, there are several aspects of IEM that can 20 be improved upon. 21 Here we propose a variation of IEM which improves the interpretability of 22 stimulus reconstructions, addresses some key issues inherent in the IEM procedure as 23 commonly applied, and provides trial-by-trial stimulus predictions and goodness-of-fit 24 estimates. Our proposed approach serves two main functions: first, it can be thought of 25 as a set of recommended best-practices based on theoretical principles, and second, it 26 offers some novel enhancements to the IEM procedure, including a focus on trial-by-trial 27 predictions that includes a built-in measure of uncertainty. For simplicity and ease of 28 communication, we refer to this approach as "enhanced inverted encoding modeling" 29 (eIEM). In this manuscript we provide theoretical justification for each of the 30 recommended aspects of eIEM, along with real and simulated data confirming that 31 eIEM produces results at least as good as -and in several cases better than -a 1 commonly implemented alternative. We also share a publicly available Python package 2 that can apply eIEM to neuroimaging data in a single line of code. While advanced 3 researchers may have reason for adopting some aspects of eIEM without others (using 4 either our code or their own), more novice users may find this pre-packaged set of 5 recommendations and enhanced functionality particularly useful. 6 Figure 1 depicts the eIEM procedure. In essence, eIEM is a combination of 7 several modifications and improvements (some previously employed, some novel). 8 First, with eIEM, the core encoder and decoder steps described above (i.e., least-9 squares linear regression for estimating channel weights and responses, using a basis 10 set of hypothesized population-level tuning functions) remain the same, but the 11 encoding model is repeatedly fitted with slightly shifted basis sets such that subsequent 12 reconstructions are in stimulus space rather than channel space. This "iterative shifting" 13 of the basis set has been employed in a few previous papers 41-43 , but it is not yet 14 common practice. Iterative shifting allows for more accurate reconstructions spanning 15 the full stimulus space (allowing the remaining eIEM steps to be optimally applied), and 16 aids more generally in reducing potential interpretation flaws and bias associated with 17 impoverished basis sets, as illustrated and explained in Supplemental Figure 2. We also 18 include a Jupyter notebook that illustrates the advantages of iterative shifting across 19 various simulated neuroimaging data on OSF (https://osf.io/vf3n6). 20 More critically, the core difference of eIEM is that we analyze reconstructions at 21 the trial-by-trial level, using a procedure that can obtain stimulus predictions, prediction 22 error, and goodness-of-fits for each trial. In contrast, most existing implementations of 23 IEM typically employ an align-and-average approach ( Figure 1B), where each trial's 24 reconstruction is circularly shifted along the x-axis such that it is centered on the 25 "correct" channel (closest to the ground truth stimulus feature), and then these aligned 26 reconstructions are averaged together to result in a single reconstruction, typically 27 quantified using metrics such as slope, amplitude, etc. As we show in the Results, this 28 align-and-average approach is susceptible to information loss, outlier bias, and 29 interpretation issues.  trial (here 8 trials of colored squares), a trial-by-voxel matrix of brain activations for every voxel 5 per trial (here simulated beta weights from a six-voxel brain region, and a basis set representing 6 hypothesized population-level tuning functions; see Methods for details). The encoder models each voxel's response as the weighted sum of the channels. Given that the trial-by-voxel matrix 1 and the basis set are already given, the weights matrix can be estimated via least-squares 2 linear regression. Once the weights matrix is estimated from the training dataset, it can be 3 inverted such that the encoder becomes a decoder for the test dataset. Now, instead of 4 estimating the weights matrix via least-squares linear regression, the weights matrix and the 5 trial-by-voxel matrix are given and the channel responses (i.e., reconstructions) are estimated.

6
The resulting estimated channel responses, or simply reconstructions, is a trial-by-channel 7 matrix where each trial has its own reconstruction composed of weighted cosines. For simplicity, 8 this example shows the encoder trained on the first half of trials and the decoder used to predict 9 the color of the remaining trials, but in most applications cross-validation should be used such 10 that every trial may be decoded while avoiding circularity/double-dipping. (b) The commonly 11 applied align-and-average procedure, which involves aligning and averaging reconstructions 12 across trials and measuring the result according to a variety of possible metrics (e.g., amplitude, 13 slope; see Supplemental Figure 1). (c) eIEM deviates from this step by evaluating trial-by-trial 14 reconstructions rather than an averaged reconstruction. We use iterative shifting of the basis set 15 to allow channel space to equal stimulus space, correlate the reconstructions with the full set of 16 basis channels to estimate each trial's predicted stimulus (and then average prediction error),

17
and also estimate goodness-of-fits for each trial's reconstruction.

19
With eIEM, these aligned-and-averaged reconstructions can still be obtained for 20 visualization purposes, but eIEM offers several novel advantages for evaluating, 21 quantifying, and interpreting reconstructions. eIEM obtains trial-by-trial stimulus 22 predictions using a correlation table approach a (previously used in a small number of 23 papers 7,42 ) that adapts to the shape of the basis channel ( Figure 1C). For each trial, a 24 set of correlation coefficients is computed, each reflecting the correlation between that 25 trial's reconstruction and a basis channel (i.e., "perfect reconstruction") centered at 26 every integer in stimulus space (e.g., resulting in 360 correlation coefficients for a 27 stimulus space ranging from 0-359°). The highest of these correlation coefficients is 28 determined to be the best fit for that trial, and the predicted stimulus feature for that trial 29 is simply the center of that best-fitting basis channel. As described below in the Results, 30 evaluating reconstructions using stimulus predictions obtained via this approach is more 31 methodologically sound than, say, taking the location of highest amplitude as the 32 a There are other techniques besides the correlation table approach that have been used to obtain trialby-trial stimulus predictions within the IEM framework, which we refer to as "non-standard approaches" in Supplemental Figure 1. We directly compare mean absolute error calculated via these different nonstandard approaches using real fMRI datasets in Supplemental Figure 3. No approach was conclusively better than the others, and we picked the correlation table based on the fact that it that accounts for the shape of the encoder's basis channels and is parsimonious with goodness-of-fit calculations which rely on the correlation table approach, as described below. stimulus prediction 25,44 , because the shape of the basis functions is automatically taken 1 into account. 2 To quantify reconstruction quality, eIEM naturally lends itself to prediction error 3 based summary metrics. The difference between the predicted and actual stimulus 4 feature is that trial's prediction error; this error can be averaged across trials to obtain 5 mean absolute error (MAE), mean signed error, etc. Here we adopted MAE (specifically 6 circular mean absolute error when using circular feature spaces) as a default metric for 7 evaluating reconstruction quality, because of its simplicity and grounding in meaningful 8 units (average prediction error), particularly compared to other IEM metrics such as 9 slope, amplitude, reconstruction fidelity, etc which use arbitrary units. However, MAE 10 itself is not a necessary component of eIEM -trial-by-trial prediction errors can be 11 evaluated in any manner the researcher sees fit. 12 The final and most novel advantage that eIEM offers is the automatic calculation 13 of trial-by-trial goodness-of-fit values. Recall that each trial's stimulus prediction is 14 determined as the center of the best-fitting basis function. Because this is calculated 15 independently of the actual correct stimulus, the goodness-of-fit values themselves 16 (correlation coefficients) can also be recorded and optionally leveraged to estimate trial-17 by-trial confidence of predictions. Individual trials in a neuroimaging study can vary 18 substantially in signal quality (driven by attentional fluctuations, alertness, head motion, 19 scanner noise, etc.) but typically the IEM procedure does not incorporate uncertainty 20 into decoding performance. The lack of uncertainty information has been noted in other 21 contexts, with some recent alternatives to IEM proposed to incorporate uncertainty 45,46 . 22 eIEM has the advantage of easily and automatically producing a trial-by-trial measure of 23 prediction uncertainty within the IEM framework itself. This enhancement adds 24 substantial flexibility to the IEM procedure. For example, as depicted in Figure 1C, 25 goodness-of-fit can be used to set thresholds in a non-circular manner, such that less 26 reliable trials are excluded from reconstructions. As we demonstrate in the Results 27 using real and simulated fMRI datasets, goodness-of-fit thresholding can substantially 28 increase statistical power and performance in versatile ways. A caveat to this increased 29 flexibility is that a researcher cannot try varying cutoff thresholds and then cherry-pick 30 the threshold that provides desirable results; thresholds need to be determined before 1 data collection or according to some sort of a priori, independent criteria. 2 Here we define eIEM as an encoding model that uses the combination of iterative 3 shifting (or any technique which allows reconstructions to reside in stimulus space), 4 evaluating reconstructions using the basis channel defined for the encoder (via the 5 correlation table approach or other constrained model fitting), calculating trial-by-trial 6 prediction errors, and calculating goodness-of-fits independently from stimulus 7 predictions. Researchers can easily implement eIEM on their own through our publicly 8 available Python package (https://pypi.org/project/inverted-encoding; see Methods for 9 more information). Of course, researchers can choose to implement eIEM without our 10 Python package or choose to only implement some but not all aspects of eIEM. The 11 value of eIEM is primarily in terms of evaluating IEM results: using both real and 12 simulated data, we show below that eIEM can be beneficial in terms of interpretability, 13 flexibility, functionality, and robustness to methodological concerns. First, as initial validation, we confirmed that across all three datasets eIEM 26 replicated the overall pattern of results obtained previously with IEM ( Figure 2). To do 27 this in a rigorous and consistent way, we reanalyzed all datasets ourselves using a 28 "standard" align-and-average IEM approach with slope as the decoding metric, even if 29 the original paper did not employ this exact same procedure. We chose slope as it is a 1 commonly used decoding metric across IEM papers (see Supplemental Figure 1). For 2 the purposes of testing eIEM, we selected a single primary analysis from each of the 3 three datasets, as detailed in the Methods. 4 In the Perception dataset 48 , we used both techniques to decode the horizontal 5 position of a stimulus in V1, V4, and IPS. The "standard" align-and-average procedure 6 revealed significant decoding performance in all 3 ROIs, with the strongest decoding 7 (greatest slope) in V1, followed by V4, and then IPS. eIEM replicated this pattern. In the 8 Attention dataset 47 , we used both techniques to decode the attended orientation within a 9 multi-item, multi-feature stimulus array, in the same three ROIs. The "standard" IEM 10 procedure revealed significant decoding in V1 and V4, but not IPS. eIEM again 11 replicated this pattern. Finally, in the Memory dataset 43 , we used both techniques to 12 decode the remembered orientation of a stimulus over two types of working memory 13 delays: blank delay and distractor delay. With the "standard" procedure, the 14 remembered orientation was successfully decoded in V1, V4, and IPS, with significantly 15 greater decoding in V1 and V4 during the blank delay compared to the distractor delay.   procedure without thresholding (purple boxes, 'Validation' plots), and eIEM using increasingly 5 stringent cutoffs based on goodness-of-fit (purple boxes, 'Improved Flexibility' plots). Bar plots 6 depict the reconstruction quality, using average slope for the "standard" procedure (higher 7 values are better) and MAE for eIEM (lower values are better) across subjects, with individual subjects overlaid as colored dots, for each of 3 ROIs (V1, V4, IPS). For the Attention dataset, 1 the 'Brain-behavior correlation' plot additionally plots the trial-by-trial correlation between 2 absolute behavioral error and absolute decoding error for each ROI and goodness-of-fit 3 threshold. Error bars depict standard error of the mean, dotted black lines represent chance 4 decoding, and asterisks represent statistically significant decoding (p<.05, no corrections for 5 multiple comparisons). Overall results show that conclusions are similar between the "standard" 6 align-and-average IEM approach and eIEM, but MAE is more interpretable (not based in 7 arbitrary units) and not prone to the methodological concerns we discuss later in the Results. In 8 addition, each dataset showed that MAE consistently improved with increasing exclusion 9 thresholds, demonstrating the flexibility of goodness-of-fit to exclude noisy trials. Increasing 10 exclusion thresholds also appeared to strengthen brain-behavior correlations in the Attention

Demonstrating the improved interpretability and versatility of eIEM 15
Having validated our method across three diverse fMRI datasets, we next use 16 these same datasets to illustrate the practical advantages and additional features of 17

eIEM. 18
Improved interpretability. eIEM lends itself to metrics that are easily interpretable 19 and comparable across datasets due to decoding performance being measured in 20 terms of prediction error. In contrast, the arbitrary units typical of the more standard 21 aligned-and-averaged reconstructions are not easily interpretable. For example, in the 22 Memory dataset, the "standard" procedure results in an average slope of .006 for the 23 blank delay condition and .004 for the distractor delay condition in V1 (or cosine fidelity 24 values of .100 and .098 as reported in the original paper 43 ); eIEM replicates this pattern, 25 but now with a more interpretable and meaningful metric: orientation can be decoded 26 with an average error of 27.9 degrees in the blank delay and 32.5 degrees in the 27 distractor condition. Note that the eIEM procedure can evaluate trial-by-trial 28 reconstructions using any metric, not just the MAE metric, if a different metric better 29 aligns with the researcher's goals. 30 Incorporating the additional trial-by-trial uncertainty feature. Next, we tested the 31 flexibility of eIEM to make use of the trial-by-trial goodness-of-fit information. The eIEM 32 approach produces a best-fitting stimulus prediction and associated goodness-of-fit 33 value (correlation coefficient) for each trial. It is important to emphasize that the 34 correlation coefficient reflects the degree to which the reconstruction matches the best-35 fitting basis channel, not the basis channel centered on the correct stimulus. In other 1 words, this goodness-of-fit information is obtained without circularity. It is obtained 2 independently and prior to any calculation of prediction error. 3 To test the impact of using goodness-of-fit information on decoding performance, 4 we performed an analysis where we excluded trials with the lowest 5%, 10%, 25%, and 5 50% of goodness-of-fit values ("Improved flexibility" subplots of Figure 2). This resulted 6 in visible improvements in MAE (i.e., smaller decoding error) with increasing exclusion 7 thresholds for all three datasets (linear regression revealed significant negative slope in 8 all cases except for IPS in the Attention dataset). Notably, in the Attention dataset, MAE 9 improved with increasing thresholds in V1 and V4 (where decoding was significant in 10 the unthresholded analysis) but not in IPS (where decoding was at chance in the 11 unthresholded analysis). Thus, the goodness-of-fit information can be used to improve 12 decoding performance when a brain region contains reliable information about a 13 stimulus, but does not produce false positives in the absence of observable stimulus-14 specific brain activity. 15 This trial-by-trial prediction uncertainty information could be used in several 16 ways. One suggestion we put forth is that goodness-of-fit can be used to threshold 17 reconstructions, such that worse-fitting trials may be excluded from analysis. This 18 principle is analogous to the phase-encoded retinotopic mapping and population 19 receptive field modeling techniques, where a set of models spanning the full stimulus 20 space is evaluated for every voxel, and the parameters of the best-fitting model are 21 selected as that voxel's preferred stimulus, with the goodness-of-fit values then used to 22 threshold the results 49-51 . Note that although r-squared is the more commonly used 23 statistic for goodness-of-fit using regression, squaring the correlation coefficient loses 24 potentially critical information about the sign of the correlation coefficient (e.g., a 25 perfectly inverted reconstruction should not be assigned equal confidence as a perfect 26 reconstruction in most cases), so we recommend the use of the r-statistic. For data 27 scrutiny purposes or for experiments where an inverted reconstruction is theoretically 28 informative, then researchers are free to adopt r-squared values or to not calculate 29 goodness-of-fit values entirely. Alternative approaches to trial thresholding also exist, 1 including weighting trials based on their goodness-of-fit (see Discussion). 2 Goodness-of-fits obtained using eIEM can also be applied more broadly, beyond 3 the MAE decoding metric. That is, one could use eIEM goodness-of-fits to threshold 4 trials while using any of the commonly employed metrics of slope, amplitude, cosine 5 fidelity, etc. Goodness-of-fit thresholding can also be applied directly to (aligned-and-6 averaged) reconstruction curves. This could be particularly useful when one wants to 7 use eIEM and quantify results with MAE, but also visualize reconstruction curves. We 8 illustrate this point in Figure 3 using the above 3 fMRI datasets along with a simulated 9 dataset. As increasingly strict goodness-of-fit thresholds were applied, the 10 reconstruction curves were visibly improved. All of the quantification metrics similarly 11 showed improvements with increasing goodness-of-fit thresholds. For the real fMRI 12 datasets, these improvements generalized across all datasets, subjects (within-subjects 13 and across-subjects), brain regions, and experimental conditions (mirroring the 14 performance benefits shown in the "Increased flexibility" subplots of Figure    Brain-behavior correlations. Finally, having trial-by-trial prediction error and 24 goodness-of-fit values lends itself to analyses correlating neural measures with 25 behavior. We demonstrate this in the Attention dataset (the Perception dataset did not 1 collect behavioral responses, and in the Memory dataset behavioral performance was 2 too close to ceiling [~3° avg. error]). In V1 and V4, we observed a significant correlation 3 between a trial's behavioral error magnitude and neural prediction error. Moreover, the 4 strength of these correlations increased with higher goodness-of-fit thresholds, as 5 depicted in the "Brain-behavior" subplot of Figure 2 (Fisher z-transformed correlation 6 test: V1: t(6)=6.85, p<.001; V4: t(6)=5.31, p<.001). We note that behavioral error itself 7 did not noticeably change across these thresholds, suggesting that the goodness-of-fit 8 information seemed to be reflecting noise at the level of the fMRI signal, not simply 9 fluctuations in behavior or cognitive focus. 10 Altogether, these findings demonstrate several tangible benefits of using eIEM. 11 Not only does eIEM produce robust, reliable results, but it can do so using a prediction 12 error framework and metric that is more interpretable and intuitive. Moreover, the novel

Additional methodological concerns addressed by eIEM 21
The three real fMRI datasets analyzed above are useful validation cases 22 because they contain robust findings (as may be more likely with published, publicly 23 available datasets), allowing us to convey the improved interpretability and flexibility of 24 eIEM in cases where the overall pattern of decoding is consistent with a "standard" 25 align-and-average IEM approach. Crucially, however, there are also cases where we 26 would expect eIEM results to diverge from the "standard" IEM results due to various 27 methodological concerns inherent in the align-and-average approach. Below we 28 illustrate some hypothetical cases susceptible to specific methodological concerns and 29 limitations that are addressed by eIEM. These hypothetical cases are based on 30 theoretical principles and illustrated by cherry-picked visualizations to showcase some 31 ways that eIEM could meaningfully diverge from standard align-and-average steps; 1 while it remains unclear how pragmatically relevant such divergences would be in real 2 neuroimaging experiments, even less extreme versions of these patterns could produce 3 misleading data via other IEM approaches.  In (c) each individual trial's reconstruction is essentially noise, such that the averaged 10 reconstruction results in a false peak around the aligned point; the "standard" procedure using 11 align-and-average metrics results in spuriously superior decoding performance than both (a) 12 and (b), with (c) having a higher amplitude, steeper slope, improved cosine fidelity, and 13 narrower standard deviation when fit with a gaussian distribution. The eIEM procedure, 14 calculating MAE from trial-wise prediction error, correctly concludes that case (a) shows the 15 best decoding performance.

17 18
Another advantage of eIEM is that MAE is also less prone to bias from outlier 19 trials compared to any of the align-and-average metrics. When using the align-and-20 average IEM approach, a single outlier reconstruction can disproportionately bias the 21 shape of the averaged reconstruction, potentially completely flipping the averaged 22 reconstruction in the most extreme cases. In contrast, the influence of an outlier is 23 naturally capped for eIEM: Consider a hypothetical experiment composed of 300 trials 24 where 299 trials predict the correct stimulus and one trial predicts the stimulus 180° 25 away (assuming 360° stimulus space); because prediction error is calculated at the trial 26 level and then averaged, an outlier trial could only increase MAE by a maximum 0.6°. 27

28
Falsely assuming a monotonic relationship between decoding performance and 29 neural signal. In addition to providing preferred solutions to the potential concerns of 30 outlier bias and information loss, there are other advantages to evaluating trialwise 31 reconstructions using eIEM. As described earlier, our trialwise goodness-of-fit metric 32 allows the researcher to predict which trials will likely produce better reconstructions 33 (and hence better stimulus predictions). While other complementary approaches exist, 34 for example using trial-by-trial decoding error to help extract top-down spatial signals 44 35 or to correlate trial-by-trial decoding with memory-guided behavior across trials 24 , 36 critically, eIEM's approach to calculating trialwise predictions avoids another known 37

issue. 38
As has been recently highlighted in the literature, IEMs produce reconstructions 1 that depend on the choice of basis set 38,39 . If the IEM procedure does not explicitly take 2 this observation into account, this can result in faulty interpretations. Supplemental 3 Figure 2 illustrates a rudimentary example of this issue that can be addressed with 4 iterative shifting, but a more fundamental problem remains. That is, intuition -and 5 standard practice -wrongly assume that a monotonic relationship exists between 6 reconstruction quality measured by metrics such as slope, amplitude, and bandwidth, 7 and the amount of stimulus-specific information in the brain signal. However, if the basis 8 set consists of equally spaced and identical basis channels (which is typically the case), 9 then a "perfect" reconstruction returns the exact shape of the basis channel b . Yet, the 10 standard align-and-average IEM steps can produce reconstructions from noisy neural 11 data that appear better than this "perfect" benchmark. We argue that it makes more 12 sense to directly compare the shape of the reconstruction to the shape of the basis 13 channel to make predictions and evaluations. The correlation table step we implement 14 in eIEM automatically adjusts to consider the shape of the basis channel because it is 15 the basis channel itself that is being used to obtain predictions, providing a more direct 16 relationship between IEM performance and stimulus-specific brain signal. 17 Simply put, amplitude, slope, bandwidth, cosine fidelity, etc. are inferior metrics 18 compared to the correlation table metric because they do not adapt to the choice of 19 basis set. This is true even if a researcher were to skip the align-and-average step by 20 evaluating reconstructions at the trial-by-trial level. For instance, using the amplitude 21 metric, a higher amplitude at the aligned point is thought to reflect improved 22 performance. If the basis channel ranges from 0 to 1, a perfect reconstruction should 23 have an amplitude of exactly 1 at the aligned point, but reconstructions can feasibly 24 have amplitudes far greater than 1. Likewise, the cosine fidelity (aka vector mean) 25 approach of evaluating reconstructions by taking the dot product of the reconstruction 26 b There are some cases where the basis channel is not representative of the perfect reconstruction. For instance, if training a model with intermixed experimental conditions, or if using basis channels that are not identical or are unequally spaced. In these cases, one can simulate what a perfect reconstruction looks like at each stimulus feature and use these tuning functions in the same manner as the correlation table approach. We also note that there may be cases where visual comparison of reconstructions is the goal and numerical reconstruction evaluations are unnecessary. and a cosine function could favor reconstructions with a wider width than expected 1 under ideal conditions. Figure 4C illustrates an extreme example of this problem, where 2 standard align-and-average IEM metrics would produce spuriously high values, 3 potentially leading one to falsely interpret the neural signal from Figure 4C as producing 4 a better reconstruction than the signal from Figure 4A. 5 Note that to avoid this pitfall, it is important to preserve the exact shape of the  decoding performance compared to Figure 4a). Meanwhile, eIEM evaluates the 14 reconstructions at a trial-by-trial level, using a fixed basis channel shape (i.e., the only 15 parameter that changes is the mean location in stimulus space). Our eIEM package 16 implements this using the correlation table approach, which can be considered as a 17 constrained version of curve-fitting, in the sense of optimizing the best location of the 18 basis channel, but any approach that fits the trial-by-trial reconstructions via comparison 19 to the exact basis channel shape used in the original encoder would also be 20 appropriate. 21

Conclusions 22
Inverted encoding modeling has become a popular method for predicting stimuli 23 and investigating neural representations because of its robust performance, simplicity of 24 linear modeling, ability to predict untrained classes, and grounding in single-unit concerns surrounding some standard IEM procedures, and offering increased 28 functionality, flexibility, and interpretability of results. In other words, when it comes to 29 whether significant stimulus-specific information can be reconstructed from a brain 1 region, eIEM performs comparably to (i.e. at least as good as) other IEM approaches. 2 The core advantages of eIEM are that it can also do more, increasing the types of 3 analyses that can be conducted, as well enhancing the interpretability of results. 4 Namely, eIEM's benefits are due to the following modifications: (1) iterative 5 shifting of the basis channel returns reconstructions in stimulus space rather than 6 impoverished channel space, (2) evaluating reconstruction quality using the same 7 tuning function used to build the encoder is more theoretically sound than assuming a 8 monotonic relationship between reconstruction shape and neural signal, (3) calculating 9 trial-by-trial prediction errors can be used to evaluate in meaningful units when decoding 10 stimulus features, improving interpretability (compared to an align-and-averaged 11 reconstruction) and allowing comparison of decoding performance across different 12 model specifications, and (4) calculating goodness-of-fits independently from stimulus 13 predictions allows a trial-by-trial uncertainty measure that can be used to flexibly 14 improve reconstruction quality and applications. 15 It is worth noting that while some prior papers have presented alternative IEM 16 procedures that implement some of the modifications above, eIEM is unique in terms of 17 its comprehensiveness and accessibility/simplicity. Moreover, the eIEM procedure is We demonstrated the above advantages of eIEM using three fMRI datasets. Our 25 validations demonstrated how our eIEM approach can be applied to both circular and 26 non-circular stimulus spaces, is sensitive to variations in decoding performance across 27 brain regions and experimental conditions, and can be used to accurately decode the 28 contents of perception, attention, and working memory. Our modifications further 29 allowed for decoding performance of each dataset to be directly compared to each other 30 in intuitive units, allowed us to meaningfully link neural reconstructions with behavior, 31 and demonstrated how uncertainty, measured via goodness-of-fit, can be leveraged to 1 increase statistical power. Note that just because these three datasets produced 2 consistent overall results (in terms of significance testing) across procedures does not 3 ensure this will always be the case, as illustrated by the cases in Figure 4 where eIEM 4 carries less of a risk of misinterpretation than other implementations. 5 One of the more novel aspects of eIEM is that researchers have the flexibility to 6 exclude trials with noisier reconstructions based on our goodness-of-fit scores, which 7 reflect how similar in shape each reconstruction is to the basis channel at the predicted 8 stimulus. This flexibility can be used alongside any computational model or evaluation 9 metric because researchers could first threshold trials using goodness-of-fits obtained 10 from our eIEM approach before proceeding to implement non-eIEM steps. A core 11 difference between goodness-of-fits obtained using eIEM compared to other IEM 12 metrics that evaluate the shape of a reconstruction is that eIEM goodness-of-fits are 13 obtained independently from prediction error-they are obtained without reference to 14 the correct, ground truth stimulus (other metrics first center reconstructions at the 15 correct stimulus and then evaluate the result, which is circular if used for trial 16 thresholding). Note that we do not prescribe a specific cutoff for determining goodness-17 of-fit thresholds in this paper, rather, we simply offer that such an approach is possible 18 for improving IEM performance. For example, a researcher could a priori decide to more 19 heavily weight trials with higher confidences or simply exclude the noisiest 20% of trials. 20 We depict examples of varying goodness-of-fit thresholds across a variety of simulation 21 parameters in Supplemental Figure 3. We also stress that goodness-of-fit thresholds 22 need to be determined before data collection or according to some sort of a priori, 23 independent criteria. 24 Our method also improves interpretability by evaluating reconstructions in terms 25 of prediction error. For example, "V1 showed 10° average prediction error and V4 26 showed 20° average prediction error" is more interpretable than "V1 showed .02 27 amplitude and V4 showed .01 amplitude" because the latter is in arbitrary units, 28 whereas MAE is in meaningful units. Further, unlike amplitude or slope, the magnitude 29 of prediction error can be directly compared to other experiments using the same stimulus space. This allows researchers to compare the quality of decoding across 1 experiments or experimental conditions with differing model specifications. 2 In this paper we have referred to IEMs as a specific kind of encoding and 3 decoding model that involves simple linear regression with population-level tuning 4 functions. An advantage of eIEM is that improvements are accomplished without noise and which produce trial-by-trial probability distributions such that uncertainty can 15 be obtained similarly to our procedure (although the researchers discuss this in terms of 16 testing Bayesian theories of neural computation rather than trial thresholding). Further 17 work will be necessary to compare all these approaches, and we acknowledge that 18 these more complex modeling approaches may be more useful than eIEM depending 19 on the research question and the extent to which the researcher prefers model 20 complexity over simplicity. For now, given the ample and broad interest in the cognitive 21 neuroscience community towards implementing relatively more simplistic IEM 22 procedures, we offer eIEM as an accessible means to improve reconstructions within 23 the same familiar framework already popularized by IEM. 24 In summary, inverted encoding modeling has become increasingly popular in 25 recent years, and yet the proper method for evaluating and interpreting IEMs has 26 become increasingly uncertain. Our eIEM method is advantageous for both theoretical 27 and practical reasons. From theoretical principles, we argue that eIEM is less 28 susceptible to methodological and interpretational pitfalls than other typical 29 implementations of IEM. We also demonstrate clear and practical advantages for 30 evaluating reconstructions according to our eIEM method: as validated with real fMRI 31 datasets, researchers can use eIEM to obtain improved reconstructions via thresholding 1 with goodness-of-fits, compare decoding performance across experiments and varying 2 basis sets using a metric grounded in meaningful units, evaluate performance in 3 stimulus space rather than impoverished channel space, and obtain concrete stimulus 4 predictions (with corresponding goodness-of-fits) for every trial rather than rely on a 5 summary statistic based in arbitrary units. While there already exist approaches capable 6 of addressing some of these concerns, our approach represents a suite of best 7 practices that can be adopted by researchers in future work. Our procedure can be 8 easily implemented with just one line of code using our accessible Python package 9 (https://pypi.org/project/inverted-encoding; see Methods) c , and can be applied to 10 investigate a vast range of research questions that involve reconstructing the contents 11 of perception, attention, and memory from neuroimaging data.

Inverted encoding model procedures 21
For all datasets, we performed a set of analyses using both a standard align-and-22 average IEM approach and the eIEM approach (as depicted in Figure 1, with the 23 exception that we used iterative shifting for both approaches to facilitate more direct 24 comparison). For both procedures, we used the same basis set. Basis channels can be 25 where θ is degrees in stimulus space, μ is the center of each channel, and 28 stimulus_range is the range of stimulus space (e.g., 360° hues on a color wheel). The 29 reasoning behind raising cosines to the num_channels-1 is to make the tuning curves 1 narrower and more comparable to physiological findings 54 . Our eIEM Python package 2 actually approximates this cosine with a gaussian function; this allows the user to 3 experiment with different widths for their tuning functions via the optional parameter 4 channel_sd. For the encoder, each voxel's response was modeled as the weighted sum 5 of the channels such that the observed trial-by-voxel fMRI activation matrix is equal to 6 the dot product of the basis set and the weight matrix, 7 where trial_features is the feature (e.g., orientation) of the stimulus and basis_set is the 9 matrix of channels with shape (stimulus_range, num_channels). 10 Once the weights matrix is estimated from the training dataset, it is inverted such 11 that the weights matrix and the trial-by-voxel matrix are given and the channel 12 responses (i.e., reconstructions) are estimated. 13 For the 'standard' align-and-average approach, we aligned and averaged the 20 single trial reconstructions into an average reconstruction and calculated slope as a 21 traditional decoding metric. For the eIEM procedure, we calculated absolute prediction 22 error for each trial via the correlation table metric and then calculated MAE. We 23 performed these steps for each subject, ROI, and condition, and then we calculated the 24 average slopes and MAEs across subjects. 25 For each condition and ROI, we assessed significance via permutation testing. 26 Significance tests were one-sided and uncorrected, calculated by comparing the t-27 statistic calculated from the actual data against the permuted null distribution of t-28 statistics (one t-statistic per each of 5,000 permutations). For eIEMs, we also repeated 29 this analysis pipeline using varying levels of goodness-of-fit thresholds. That is, we 30 discarded a certain percent of trials based on the worst goodness-of-fits and then 1 calculated MAE using the remaining trials. The full list of t-statistics and corresponding 2 p-values for the tests performed in Figure 2 are displayed in Supplementary Table 1. Data were previously collected in our lab for another study Chen et al. (2021) 47 . 30 In this experiment, seven participants completed a visual attention task. Each trial 31 started with a central fixation cross. After 700ms, three circle outlines were displayed at 1 equidistant locations surrounding the fixation cross for 200ms. One outline was thicker 2 than the others, representing the spatial cue. Participants were instructed to covertly 3 attend to the spatial cue location while maintaining fixation at the fixation cross. After 4 1100ms, three colored and oriented gratings were briefly displayed for 100ms, followed 5 by a 200ms mask and a continuous orientation report. Participants were instructed to 6 report the orientation of the grating that appeared at the location of the spatial cue. 7 There were also trials where participants were asked to shift attention to a different 8 spatial location before the onset of the gratings, and entire runs where participants were 9 asked to attend and report the color of the grating (instead of orientation), but we did not Data were obtained by downloading post-processed fMRI data publicly available 25 on OSF (https://osf.io/dkx6y) 43 . We reanalyzed Experiment 1 where six participants 26 underwent a visual working memory task. For each trial, a cue indicating the distractor 27 condition was shown for 1.4s, followed by a target grating shown for .5s where 28 participants were instructed to memorize its orientation, followed by a 1s blank delay, 29 and then an 11s delay where 3 possible distractor conditions were possible: blank 30 delay, Fourier-filtered noise, or distractor grating of a pseudo-random orientation. 31 Following an additional 1s blank delay, participants had 3s to report the orientation of 1 the target grating, and finally a variable intertrial interval (3/5/8s). Each participant 2 completed 108 trials per distractor condition. We reconstructed the remembered 3 orientation based on the average activation patterns 5.6-13.6 s after target onset, as in 4 Figure 1c in their paper, but only used the blank delay and distractor grating conditions 5 for simplicity. To maintain consistency across procedures and datasets, we used the 10-6 fold cross-validation described earlier, with the blank delay and distractor grating 7 conditions trained/modeled/evaluated separately from each other. We analyzed V1, V4, 8 and IPS regions of interest which were defined via retinotopic mapping protocols where 9 participants viewed rotating wedges and bowtie stimuli 55 . We applied IEMs to the post-10 processed data conducted by the authors of the original paper: Single-trial activation 11 estimates consisted of averaged BOLD signal between 5.6-13.6s (7-17 TRs) after target 12 onset. For more methods information, please refer to the original paper 43 . 13 14

Simulated datasets 15
We used Python to simulate fMRI data from hypothetical brain region of interests 16 with arbitrarily chosen numbers of voxels, where each voxel's ground truth tuning 17 function was the same shape as the basis channel. For Figure 1 and Figure 3, we 18 injected random noise from a gaussian distribution into the trial by voxel matrix to 19 produce noisier data as would be expected from a real fMRI experiment. Simulated 20 fMRI data were subjected to the IEM procedures described above to depict the contents 21 of Figure  We have released the Python 3 package "inverted-encoding" on PyPi 29 (https://pypi.org/project/inverted-encoding/) and GitHub (https://github.com/paulscotti/inverted_encoding) for easy implementation of our eIEM 1 procedure. The package contains two main functions, "IEM" and "permutation." 2 For the "IEM" function, the only necessary inputs are an array of the stimulus 3 features for every trial, the trial by voxel activations matrix (note: inputs other than 4 voxels may be used for other modalities), the given stimulus range (e.g., stim_max=180 5 for stimulus values ranging from 0°-179°), and the number of folds to use for cross-6 validation. The basis set can be specified as an optional parameter and will otherwise 7 default to a basis set composed of nine equidistant channels each modeled using 8 gaussian functions that approximate cos H( − ) -,./ I . . We use gaussian approximation 9 because this allows the user additional flexibility to vary the width of the basis channel 10 tuning functions via the standard deviation parameter. The stimulus space defaults to a 11 circular space but can be optionally set to be non-circular via the Boolean parameter 12 "is_circular." We advise researchers to be careful regarding cross-validation, such that 13 training and test data do not come from the same functional runs due to timepoints not 14 being independent (autocorrelation, z-scoring, etc.). The final outputs are an array of 15 each trial's predicted stimulus, an array of each trial's corresponding goodness-of-fit, 16 each trial's unaligned stimulus reconstruction, and each trial's aligned stimulus 17 reconstruction (aligned such that the correct stimulus is at zero in channel space). The 18 user can then compute MAE (or any decoding metric that they prefer). MAE can be 19 computed by averaging the (circular) absolute error between the predicted stimulus 20 features and the actual stimulus features. We provide a convenience function, 21 "circ_diff", for computing circular error. The user can decide whether they want to 22 threshold any trials using the provided goodness-of-fit values prior to calculating 23 decoding performance. 24 For the "permutation" function, the only necessary input is an array of the actual 25 stimulus features. For each iteration, the stimulus features are randomly shuffled and 26 used as the predicted stimuli to compute the MAE. The function outputs a null 27 distribution of MAE values for the user to compare against the MAE obtained from the 28 "IEM" function. A more exact and computationally intensive method would be to rerun 29 the entire IEM pipeline with shuffled stimulus labels on every iteration to obtain the null 30 distribution. This can also be performed using our package by simply repeating the IEM 1 function with a different shuffling of the stimulus features for every iteration. Our 2 exploratory comparisons of null distributions obtained using both approaches across the 3 three fMRI datasets discussed in the main text yielded no obvious differences. 4 5

Data availability 6
The Perception dataset 48 is publicly available on OSF (https://osf.io/j7tpf/). The Memory 7 dataset 43 is publicly available on OSF (https://osf.io/dkx6y). The Attention dataset 47 will 8 be made publicly available upon final publication of the original study; researchers may 9 contact the authors to obtain this dataset before then. 10 11

Code availability 12
Code to implement eIEM is available as a Python package 13 (https://pypi.org/project/inverted-encoding). Code to reproduce Figures 1 and 3 Figure 3 in the main text, colored lines reflect 1 different goodness-of-fit thresholds (red:50% green:30% orange:15% blue:0%), the dotted 2 curved line depicts the basis channel (i.e., "perfect" reconstruction), and the vertical dotted line 3 denotes the aligned point in channel space. Text on the left-side of each plot denotes the 4 ground truth simulation parameters. "v" refers to the number of voxels, which was a random 5 integer between 50 and 1000. "c" refers to the number of equally spaced basis channels, which 6 was a random integer between 5 and 30. "vsd" and "bsd" refer to the standard deviation of the 7 underlying voxel tuning functions (vsd) and basis channels (bsd), which were random integers 8 between 10 and 120. "σ" was a random floating-point number between 0 and 5 and refers to the 9 standard deviation of a normal distribution (mu=0) from which random noise was injected into 10 the trial by voxel matrix. Colored text denotes the MAE for each thresholded reconstruction.

11
Subplots were automatically removed and replaced if the reconstructions showed near-chance 12 performance or near-perfect performance, where differences from thresholding would not be 13 expected to be visible. Code to reproduce this figure is available on OSF (https://osf.io/et7m2/).

15 16
Supplemental 1 Supplemental Table 1. Statistics from tests depicted in Figure 2 from the main text.