Meet me in the middle: brain-behavior mediation analysis for fMRI experiments

Functional outcomes (e.g., subjective percepts, emotions, memory retrievals, decisions, etc…) are partly determined by external stimuli and/or cues. But they may also be strongly influenced by (trial-by-trial) uncontrolled variations in brain responses to incoming information. In turn, this variability provides information regarding how stimuli and/or cues are processed by the brain to shape behavioral responses. This can be exploited by brain-behavior mediation analysis to make specific claims regarding the contribution of brain regions to functionally-relevant input-output transformations. In this work, we address four challenges of this type of approach, when applied in the context of mass-univariate fMRI data analysis: (i) we quantify the specificity and sensitivity profiles of different variants of mediation statistical tests, (ii) we evaluate their robustness to hemo-dynamic and other confounds, (iii) we identify the sorts of brain mediators that one can expect to detect, and (iv) we disclose possible interpretational issues and address them using complementary information-theoretic approaches. En passant, we propose a computationally efficient algorithmic implementation of the approach that is amenable to whole-brain exploratory analysis. We also demonstrate the strengths and weaknesses of brain-behavior mediation analysis in the context of an fMRI study of decision under risk. Finally, we discuss the limitations and possible extensions of the approach.


Introduction
Functional outcomes (e.g., subjective percepts, emotions, memory retrievals, decisions, etc...) are partly determined by external stimuli and/or contextual cues. But they may also be strongly influenced by irreducible variability in brain responses to incoming information (Ferster, 1996;Shadlen and Newsome, 1994). In particular, neural noise may be a critical determinant of illusory percepts, aberrant emotions, erroneous memory retrievals, biased decisions, etc... (Bays, 2014;Faisal et al., 2008;Hong and Rebec, 2012). For most existing statistical data analyses of neurophysiological data, neural noise is typically treated as a statistical nuisance, since it compromises the identification of relationships between measured brain activity and experimental variables (Doi and Lewicki, 2011;Naselaris et al., 2011). This perspective is unfortunate however, since neural noise can provide complementary information regarding how incoming information is processed and/or distorted by the brain to yield functional outcomes (Dinstein et al., 2015;McDonnell and Ward, 2011;Stein et al., 2005). The critical point is that a brain system may encode functionally-relevant information that is not used by the brain when producing a functional outcome. This has been repeatedly demonstrated in neurological patients who do not exhibit significant behavioral impairments despite being lesioned in brain regions that are known to encode behaviorally-relevant information (Aerts et al., 2016;Alstott et al., 2009). But what if one can show that neural noise contributes to -otherwise unexplained-behavioral variability? This is the essence of brain-behavior mediation analysis, which aims at detecting neural systems that both respond to behaviorally-relevant cues or stimuli and eventually impact overt behavior (MacKinnon et al., 2007).
Recall that any cognitive function can be seen as some form of -potentially complex, context-dependent, redundant, partially unconscious, etc-neural transformation of relevant stimuli into adaptive behavioural outcomes (Robbins, 2011). By adaptive, we simply mean that cognitive functions serve a specific purpose, which can be abstracted and put to a (behavioural) test. At the limit, one could argue that understanding cognitive functions reduces to assessing input-output relationships, where inputs are experimentally controlled stimuli and/or task instructions, and outputs are overt behavioural outcomes. In this view, neuroimaging in healthy subjects should serve to identify how brain networks contribute to the input-output transformation (Palestro et al., 2018;Rigoux and Daunizeau, 2015;Turner et al., 2019). A reasonable strategy here is to identify intermediary neural states that mediate the impact of incoming information onto overt behavior and/or subjective reports. In its simplest form, brain-behavior mediation analysis reduces to a twofold regression analysis that aims at detecting uncontrolled variability in brain responses that significantly improves behavioral predictability. The ensuing statistical tests typically reason as follows: if region M responds to experimental factor X, and explains behaviour Y above and beyond the effect of X, then M mediates the effect of X onto Y. For example, brain-behavior mediation analysis was used to identify the prefrontal and/or subcortical systems that mediate successful emotional regulation , threat response (Wager et al., 2009a(Wager et al., , 2009b or risk avoidance (Yamamoto et al., 2015). More recently, the anterior cingulate cortex, the anterior insula, the thalamus and some brain stem nuclei were shown to mediate various aspects of pain perception (Atlas et al., 2010(Atlas et al., , 2014Geuter et al., 2018;Koban et al., 2017Koban et al., , 2019Woo et al., 2015). Most of these studies were performed using the multilevel mediation/moderation or M3 toolbox (Wager, 2008), which was first derived for probing effective connectivity from fMRI signals. Since then, a few multivariate extensions of brain-behavior mediation analysis were proposed, aiming at improving either spatial or temporal resolution (Chén et al., 2018;Lindquist, 2012;Zhao and Luo, 2017). But these approaches neither lay out nor address the specific methodological and interpretational challenges posed by brain-behavior analysis, when applied to typical fMRI experiments. In our view, progress in brain-behavior mediation analysis requires answering at least four important (and related) questions:  (Q1) Which test statistics should be used? Not only should the test statistics be valid (i.e. yield controlled false positive rate), but they also should be maximally powerful. The latter is a pressing issue because fMRI induces a massive multiple comparison problem, which can only be solved by using more stringent significance thresholds (Lindquist and Mejia, 2015;Worsley and Friston, 1995).
We will summarize and compare the statistical properties of the most established test statistics of mediation analysis.
 (Q2) How robust is brain-behavior mediation analysis to assumptions regarding the hemodynamic response function (HRF) and other confounds? Recall that virtually all forms of fMRI time-series analyses rely on HRF models to assess effects of interest (Deshpande et al., 2010;Gitelman et al., 2003;Liao et al., 2002;Pedregosa et al., 2015). Although brain-behavior mediation analysis involves similar assumptions, different modelling strategies may be employed that yield distinct bias-variance tradeoffs. We will compare the statistical properties of these candidate approaches in the presence of deviations to modelling assumptions.  (Q3) What sort of brain mediators can we expect to detect? Consider the bottomup chain of neural information processing stages that eventually yield behavioral outcomes (from low-level sensory processing to high-level cognitive treatment of stimuli and/or cues). It turns out that these stages do not have the same chance of being detected. As we will see, this is a corollary consequence of the nontrivial 6 (and yet undisclosed) impact of neural noise onto the statistical properties of mediation analysis.  (Q4) Does mediation analysis induce potential interpretational issues? As we will see, some interpretational issues are specific to the chosen statistical testing approach, but others are generic to any brain-behavior mediation analysis. In particular, significant mediated effects are compatible with two distinct scenarios regarding the causal relationship between brain activity and behavioral responses. We discuss the importance of this and related issues and identify ways to address them.
In this work, we address these four questions from a user-oriented statistical perspective. Our aim here is to set a methodological standard for brain-behavior mediation analysis. The Methods section serves as the statistical and conceptual basis for addressing the four questions (Q1-Q4) above. It starts with a description of the brainbehavior mediation model and its associated null-hypothesis testing alternatives.
Specific issues that arise in the context of typical fMRI experiments (factorial designs and condition contrasts, group-level random effects analysis, etc) are shortly discussed.
We then consider the critical role of neural noise in brain-behavior mediation analyses, and present alternative solutions to the issue of HRF deconvolution. We close this section with a note on causality and its accompanying interpretational issue. We address the latter using a complementary information-theoretic approach (so-called I/O test). En passant, we show how to exploit the underlying mathematical degeneracy to drastically reduce the computational cost of whole-brain mediation analysis. In the Results section, we use numerical Monte-Carlo simulations to answer questions Q1-Q4. We compare the specificity and sensitivity of candidate mediation tests, as a function of neural noise, and in the presence of hemodynamic confounds. We also evaluate the utility and robustness of our I/O test. We then strengthen our in-sillico conclusions with an application to an 7 experimental fMRI dataset acquired when people make decisions under risk. We exemplify the use of brain-behavior mediation analysis to ask questions regarding intraand between-subjects variations in behavioral responses and attitudes. Finally, we discuss our results in the light of the existing literature and highlight potential weaknesses and perspectives (Discussion section).

Methods
In what follows, we will consider behavioral paradigms akin to decision tasks, whereby subjects need to process some (experimentally-controlled) information X to provide a (measured) behavioral response Y . Brain-behavior mediation analysis then aims at identifying whether some (anatomically-specific) feature of their observed brain activity M mediates the effect of X onto Y . In our example fMRI application (see Results section), we will focus on a value-based decision making task, whereby participants have to accept or reject (response Y ) a risky gamble composed of a 50% chance of winning a gain G and a 50% chance of loosing L (input information X ). But more generally, X is an experimental manipulation of some sort, M is a measure of neural activity at the time of processing the stimulus, and Y is some overt expression of the stimulus-induced covert mental state of interest. The brain-behavior mediation model Let n be the number of trials in a typical experimental session. Let X , M and Y be 1 n  column vectors encoding the trial-by-trial experimental manipulation, the brain's response to the experimental manipulation (e.g., the magnitude of the fMRI BOLD response to the stimulus at each trial, in some voxel or region of interest) and the behavioral response to the experimental manipulation, respectively. For the sake of mathematical simplicity, and without loss of generality, we will assume that X , M and Y have all been z-scored.
From the perspective of identifying the determinants of behavior, one may first ask whether X has an effect on Y or not. In its simplest mathematical form, this question reduces to considering the following simple linear regression model: where c is an unknown regression coefficient that measures the strength of the statistical relationship between the independent ( X ) and dependent ( Y ) variables, and Now, one may also ask whether M mediates the effect of X onto Y . In its simplest mathematical form, this question relies on the following pair of linear regression models: where the first equation expresses the fact that M responds to X (with some unknown susceptibility a ), and the second equation states that Y depends upon both M (with some unknown susceptibility b ) and X (with some unknown susceptibility ' c ). On may think of residuals 0 M  in terms of some form of neural noise, because they capture trialby-trial variations in M that are independent of X . As we will see, they play a pivotal role in brain-behavior mediation analysis.
Although simple, Equation 2 does not explicitly quantify the size of a mediated effect. But this can be done by noting that Equation 2 can be rewritten as follows: where M has simply been replaced by its expression from Equation 2. Equation 3 is helpful in realizing that the total effect of X onto Y is partitioned into a direct effect (whose size is ' c ) and an indirect effect (whose size is ab ). This distinction is important because the latter is the effect of X onto Y that is mediated by M . This is why established mediation tests rely on assessing the statistical significance of the indirect effect (MacKinnon et al., 2007). Note that so-called full mediation occurs when '0 c  (no direct path), and one speaks of partial mediation whenever '0 c  .
Importantly, when we perform mass-univariate mediation analysis, we effectively consider each voxel or region of interest in isolation, and ask whether the local indirect effect is statistically significant. If mediation tests are repeated over voxels, then they form a statistical mediation map, which can localize which brain structure(s) mediate(s) the effect of X onto Y . In this context, Equations 1-3 have two interesting implications, which we will highlight now.
To begin with, recall that the incoming information X is processed by a distributed brain system, whose elements (sampled across large voxel sets) concurrently contribute to the behavioural response Y . The structure of this distributed brain system is likely to involve multiple processing pathways that work both in series and in parallel, as in Figure 1 below.

10
In simple bottom-up hierarchical architectures such as this one, lower levels would correspond to e.g., occipital low-level visual processes, whereas higher levels would map to, e.g., prefrontal decision making processes. Clearly, Figure 1 is already an oversimplification because it ignores reciprocal connections, branching processes and/or context-dependent gating mechanisms (Friston, 2011;He and Evans, 2010;Rubinov and Sporns, 2010). But when we perform mass-univariate brain-behavior mediation analysis, we reduce the complexity even further by considering each voxel or region of interest in isolation, effectively ignoring any hierarchical structure of this sort. Here, X and Y encode the experimental manipulation and the ensuing behavioral response, respectively. Variables i j M are activity within brain regions that act as intermediary processing steps. In this oriented graphical model, arrows represent causal relationships. Although processing pathways operate both in series and in parallel, mass-univariate brain-behavior mediation analysis ignore this and treat each region/voxel independently of each other (red dotted arrows).
First, given the likely parallel nature of processing pathways, one would not expect that any isolated voxel or region of interest may ever fully mediate the impact of X onto Y .
The implicit assumption of mass-univariate brain-behavior mediation analysis is that, in each voxel, the direct path ' c effectively captures, in a non specific manner, mediated effects that go through other (parallel) pathways. This, however, places a very heavy load on the statistical sensitivity of mediation tests, which need to be able to detect potentially small indirect effect sizes, even when correcting for multiple comparisons (e.g., across voxels).
Second, nothing prevents different processing pathways to have strong but opposing impacts on the behavioral response. An example here would be opponent brain systems that yield strong but ambivalent (e.g., appetitive-aversive) cognitive states, whose idiosyncratic balance may explain one's specific ability to suppress e.g., impulsive behavioral responses (Seymour et al., 2005;Zhang et al., 2017). In particular, if the impact of different pathways balance out, then the total effect of X onto Y may become undetectable ( 0 c  ). It follows that brain-behavior mediation analyses may be required for faithfully identifying the determinants of behavior. Alternatively, the relative contribution of different pathways may vary across individuals, which may drive interindividual behavioral differences. We will see an example of this in the Results section below.

Statistical tests of mediation
In what follows, we recall the most established approaches to null-hypothesis testing of mediated effects. We start with the premise that if M mediates the effect of X onto Y , then the corresponding indirect effect has to be different from 0 ( 0 ab  ). In what 12 follows, we summarize two kinds of statistical testing approaches (namely: the "indirect" and the "conjunctive" approaches) that differ in terms of how they frame the corresponding null hypothesis.
The indirect approach follows from noting that the null hypothesis of mediation analysis can be framed as follows: Under the simple brain-behavior mediation model in Equations 1-2, the indirect effect equates the difference between total and direct effects, i.e.
' ab c c  . This is why early approaches to mediation testing were assessing the statistical significance of the difference ' cc  (Baron and Kenny, 1986). However, theoretical work demonstrated that this equivalence may not always hold (Pearl, 2012), which would render the ensuing test invalid. This applies to typical fMRI experiments, because of the effect of confounding variables on path coefficient estimates.
Another, more valid, approach is to compare estimates of the indirect effect to their distribution under the null. This is the principle of Sobel's test (Sobel, 1982). Recall that all parameters are identifiable from Equation 2, given X , Y and M . In particular, the ordinary least-squares (OLS) estimates â and b of unknown path coefficients a and b are given by (all X , Y and M variables are z-scored, see Appendix A for a mathematical derivation):

ˆ0
b  when M has an effect on Y above and beyond the effect of X . In addition, the variance of these OLS estimates are given by: Under the assumption that model residuals Y where  is Student's cumulative density function with appropriate degrees of freedom.
Later improvements over Sobel's test (Hayes and Scharkow, 2013) derived from theoretical statistical works on the distribution of the product of two normal random variables, which essentially include an additional 22â b   term to the denominator of Sobel's pseudo-zscore (Aroian, 1947;Goodman, 1960).
We refer to these extensions as Aroian's and Goodman's tests, respectively.
Alternatively, non-parametric approaches have been proposed to derive the distribution of indirect effect size estimates under the null (MacKinnon et al., 2002(MacKinnon et al., , 2004. Here, we will use the same bias-corrected bootstrap approach as the one proposed in the M3 toolbox (Wager, 2008).
The conjunctive approach follows from noticing that the null hypothesis of mediation analysis is composite (Moran, 1970), i.e.: null hypotheses are exactly equivalent, but the composite null highlights the fact there is no mediated effect as long as one path coefficient is null (which breaks the causal cascade). In turn, one may test for the conjunction of both effects, i.e. test for the statistical significance of both a and b path coefficients. In practice, conjunctive testing relies on the "maximum p-value" approach (here, two-tailed test):  (Friston et al., 2005a;Nichols et al., 2005). This is important, because conjunctive testing cannot be invalid but may have low sensitivity.
However, it is trivial to show that Sobel's pseudo z-score is always smaller than the conjunctive test statistics, i.e.: . This means that one would expect conjunctive testing to be systematically more efficient than Sobel's approach. At this point, we note that the sensitivity profile of indirect and conjunctive approaches actually depends upon neural noise strength and model misspecifications (see next sections).
We will address this and related issues in the Results section, using extensive numerical Monte-Carlo simulations.
We refer the reader interested in extending these statistical approaches to experimental designs including multiple conditions (cf., e.g., factorial designs) and/or multiples subjects (cf. group-level random effects analysis) to Appendix C and D, respectively.
The non-trivial impact of neural noise Although indirect and conjunctive null hypotheses are formally equivalent to each other, the latter is helpful to disclose the subtle tension behind mediation testing. In brief, two conditions must be satisfied for detecting a mediated effect: (i) strong evidence for 0 a  and (ii) strong evidence for 0 b  . The former means that X partly explains the trial-bytrial variability of M . And the latter means that M partly explains the variability of Y that is unexplained by X . The critical point here is to realize that these two conditions are in conflict with each other. This is because they have opposing demands on neural zt  , which tends also towards 0 when In other words, if neural noise strength is very high, then evidence for 0 a  is weak.
Only for intermediary levels of neural noise can evidence for both 0 a  and 0 b  reach statistical significance. We note that this observation generalizes to any mediation test, irrespective of the mathematical form of the brain-mediation model. We refer the interested reader to Appendix E.
We will quantify the impact of neural noise on the statistical efficiency of candidate mediation testing approaches in the Results section below. But this property of mediation analysis has an important implication, which we now highlight.
Recall the structure of the processing hierarchy in Figure 2. Within a given processing pathway, each hierarchical level responds to its (lower-level) parents, eventually changing the information content in an incremental manner, e.g.: In turn, one would expect that mass-univariate mediation analysis can only detect those neural information processing steps that are positioned at an intermediary hierarchical level, i.e. sufficiently far away from either end of the hierarchy. We will exemplify this in the Results section below.

Dealing with hemodynamic confounds
Clearly, the brain-behavior mediation model in Equation 2 cannot directly be applied to fMRI time series. The reason is twofold. First, behavioral and neural variables are not sampled in the same manner. In brief, the former is collected at each "trial" of the behavioral task, while the latter is typically sampled at a sub-trial temporal resolution.
Second, fMRI BOLD dynamics effectively result from the convolution of neural activity with the hemodynamic response function or HRF (Logothetis et al., 2001;Martin et al., 2006). This implies that the event-related BOLD response is delayed in time, when compared to trial onsets. In addition, if the inter-trial interval is smaller than the HRF duration (which is typically the case), BOLD signals measured during a trial may derive from the additive contributions of multiple neural responses (to the current and preceding trials). For the purpose of brain-behavior mediation analysis, there are essentially two ways of dealing with such hemodynamic confounds.
On the one hand, one may deconvolve BOLD signals from the HRF, as follows. Let k  (resp., k  ) be the onset time (resp., duration) of the k th trial in the experimental design.
One first construct "trial" regressors that span the duration of the fMRI session (at the sampling resolution of fMRI ; typically: TR=1-2secs), which are zero everywhere except during the time interval defined as   , k k k  . Each of these is then convolved with the canonical HRF and its temporal derivatives, to account for potential mismatches in hemodynamic delays (Liao et al., 2002). One then augment the resulting GLM with fMRI confounds (e.g., motion regressors and slow drifts), and fit it to fMRI time series. Fitted regressor weights at each voxel thus provide an estimate M of the local neural response to each trial, which is deconvolved from the HRF and corrected for typical fMRI confounds, and can then enter a mediation analysis. We call this the deconvolution approach.
On the other hand, one may reframe the brain-behavior mediation model in the HRFconvolved space. One first resample the explanatory and dependent variables at the fMRI temporal resolution by reweighing each "trial'" regressor above with its corresponding X and Y entries and them summing over trials. One then convolves the resulting regressors with the canonical HRF (and its temporal derivatives) and augment the resulting GLM with fMRI confounds prior to entering a mediation analysis. We call this the convolution approach.
Both approaches can, in principle, deal with hemodynamic and other fMRI confounds, but they differ in terms of their respective bias-variance tradeoff. The convolution approach effectively yields reliable neural response estimates, under the implicit assumption that the HRF is identical across trials. In contrast, the deconvolution approach allows for trial-by-trial variations in HRF, at the cost of compromising the reliability of neural response estimates.
In the Results section below, we evaluate the robustness of these two strategies w.r.t.
deviations to canonical HRF models.
A note on causality    This has two important consequences.
First, one may rely on Equations 11-12 to improve the computational efficiency of brainbehavior mediation analysis by several orders of magnitude. Recall that in the context of whole-brain fMRI, working with regression models where fMRI signals only enter as dependant variables is computationally very advantageous. This is because many algebraic operations that are required for parameter estimation (e.g., here, matrix multiplications and inversions, etc) can be computed once and for all. In brief, the computational gain of performing brain-behavior mediation analysis using Equations 9-10, when compared to Equation 2, is of the order of

22
This alternative causal interpretation ( YM  ) is not as nonsensical as it may first sound. For example, somatosensory cortices will respond to variations in motor actions, eventually enabling proprioceptive sensations. More generally, a given brain system may be collecting and/or processing information regarding overt behavior (which may have been produced elsewhere in the brain) for the purpose of, e.g., learning, memory, metacognition, etc... In any case, this interpretational issue is important, because the implicit intention behind brain-behavior mediation analysis is clearly to provide statistical evidence for the "native" causal scenario ( X M Y  ). We will comment on this and related issues in the Discussion section of this manuscript.
One may think that affording evidence for the "native" causal claim of brain-behavior mediation analysis may require non observational studies, e.g., causal perturbations of neural activity (lesion studies, transcranial magnetic stimulation, etc). Nevertheless, we argue that one may perform complementary data analyses that may partially address the interpretational issue above. For example, having assessed the significance of a mediated effect, one may exploit locally multivariate information to provide statistical evidence for or against candidate causal claims. In fact, when considering the set of mediator variables within a significant cluster together, "native" ( MY  ) and "swapped" ( YM  ) causal interpretations induce a many-to-one and a one-to-many M-Y mapping, respectively (see Figure 3 below). Because "native" and "swapped" causal scenarios differ in terms of whether Y is viewed as an input or as an output of local neural information processing, we refer to the ensuing test statistics as an I/O test statistics.

Figure 3: Candidate multivariate input/output M-Y mappings.
Left panel: "native" causal interpretation (many-to-one input/output mapping, Y as an output). Right panel: "swapped" causal interpretation (one-to-many input/output mapping, Y as an input).  is the correlation between X and Y (Marrelec et al., 2005). In turn, Equation 12 can be rewritten as follows: It turns out that the sign of  provides evidence in favor or against the native causal interpretation of the brain-behavior mediation model.

Results
In what follows, we will be comparing five testing approaches: Sobel's test, Aorian's test, Goodman's test, the M3 bootstrap test, and the conjunctive approach, in terms of their statistical sensitivity and specificity. Using numerical simulations, we will assess the impact of neural noise and deviations to HRF assumptions. Taken together, these insilico experiments will serve to address questions Q1 to Q3. Using further numerical simulations, we will demonstrate the utility of our I/O test statistics for addressing the main interpretational issue of brain-behavior mediation analysis (Q4). Finally, we will report the results of a brain-behavior mediation analysis in the context of an fMRI experiment on decision making under risk.

26
Comparing the statistical specificity and sensitivity of testing approaches First, we ask whether candidate testing approaches yield valid inferences, i.e. whether they allow for a faithful control of false positive rate. To address this question, we simulated data under three different variants of the null hypothesis. More precisely, we Second, we asked how sensitive are candidate testing approaches under moderate mediated effect sizes. Here, we simulated 40,000 datasets with Equation 2, using 12 ab  , and measured the (true positive) detection rate of each candidate testing approach, as one varies the significance threshold  .
The results of these analyses are summarized on Figure 4 below.

Figure 4: Statistical specificity and sensitivity of variants of mediation significance testing approaches.
Left panel: The sensitivity of mediation tests (y-axis) is plotted against the significance threshold α (x-axis), for each candidate testing approach (dark blue: conjunctive testing, blue: M3 bootstrap indirect approach, light blue: Goodman's indirect approach, green: Sobel's indirect approach, yellow: Aorian's indirect approach). Upper right panel: The specificity of mediation tests (y-axis) is plotted against the significance threshold, for H0: a=0 and b=1/2 (same format as left panel). Middle right panel: Same format as upper right panel for H0: a=1/2 and b=0. Lower right panel: Same format as upper right panel for H0: a=b=0.
As expected, the conjunctive test is more sensitive than both Sobel and Aorian tests.
Slightly more surprising maybe is the fact that the conjunctive test turns out to also be more sensitive than Goodmans test and the M3 bootstrap test, though the latter reach similar sensitivity levels for significance thresholds higher than 0.001. We will refine our evaluation of statistical sensitivity when assessing the impact of neural noise below.
In addition, all approaches except Goodman and the M3 bootstrap tests are valid, i.e. they yield a false positive rate that is equal or smaller than the significance threshold  .
Goodman's test always yield invalid inference if the significance threshold is small enough, whereas the M3 bootstrap test only yields invalid inference when 0 b  . Note 28 that the conjunctive approach is the least conservative of all tests, and this difference grows when the significance threshold decreases.

Assessing the impact of neural noise
Recall that the magnitude of neural noise is expected to play a critical role for the statistical sensitivity of mediation analysis. To demonstrate this effect, we simulated 10,000 datasets using the same parameter settings as above, except for neural noise magnitude, which we varied from  Left panel: The sensitivity of mediation tests (y-axis) is plotted against the variance of neural noise (x-axis), for each candidate testing approach (same format as Figure 4), when a=b=1/2. Chance level is indicted using a black dotted line. Right panel: The sensitivity of conjunctive mediation tests (y-axis) is plotted against the variance of neural noise (x-axis), when varying path coefficients (dark blue: a=2 and b=1/2, blue: a=1 and b=1/2, cyan: a=b=1/2, orange: a=1/2 and b=1, yellow: a=1/2 and b=2).
All testing approaches have a similar sensitivity profile, which follows a bell-shaped function of neural noise magnitude, with an apex around 0 1 M Var 

  
. This corresponds to a situation in which about 20% of the trial-by-trial variance in M is explained by X . As the amount of explained variance in M departs from this nominal level, the sensitivity of mediation analysis effectively tends towards chance level. Now, everything else being equal, increasing a or the variance of X eventually inflates sensitivity on the right tail of the sensitivity profile, while increasing b rather boosts its left tail. This moves the position of sensitivity apex towards smaller and stronger noise variance, respectively (see Figure 5, right panel). Now, in the Methods section above, we reasoned that the expected sensitivity profile of mediation analysis should eventually favor the detection of neural information processing steps that are positioned away from either end of the processing hierarchy. In what follows, we compare candidate testing approaches w.r.t. their ability to detect levels in a simple feed-forward hierarchy. In brief, we simulated 1,000 datasets under Equation 10, using 100 intermediary network nodes. In all simulations, initial and final path coefficients were set to 0 12 ab  and all intermediary path coefficients were set to 1 i ai  . In addition, the variance of all independent variables were set to unity except for the local neural noise increments, whose standard deviation was set to 0.3. Following the principle of mass-univariate mediation analysis, a mediation test was then performed on each node in isolation (significance threshold:

 
). For each network node, the ensuing 30 (true positive) detection rate was then measured across the 1,000 simulations. The result of the ensuing detection profile is shown on Figure 6 below. The sensitivity of mediation tests (y-axis) is plotted against the hierarchical level of candidate mediators along the serial processing pathway (x-axis), for each candidate testing approach (same format as Figure 4).
As expected, local neural noise increments accumulate along the hierarchy, effectively increasing the neural noise level estimate as the hierarchical level increases. In turn, the detection profile also follows a bell-shaped function of hierarchical level, such that lower and higher hierarchical levels are less easy to detect. Interestingly, one can also see that different testing approaches have different sensitivity profiles. In particular, one can see that the conjunctive approach exhibits a higher sensitivity than all other approaches, irrespective of the hierarchical level of interest. Note that the M3 bootstrap test is better than other indirect approaches for intermediate hierarchical levels, but eventually loses its competitive advantage for higher hierarchical levels.
Assessing the robustness to deviations from hemodynamic assumptions Despite the inclusion of HRF derivatives in the mediation model, deviations to the canonical HRF can impair test sensitivity. In this section, we compare the robustness of convolution and deconvolution approaches to unanticipated delays in HRF. We thus simulated 100 datasets using the same parameter settings as above, except that we varied systematically the HRF delay, effectively inducing a shift with the canonical HRF ranging from -5 second to 5 seconds. Each dataset was then analyzed using all (indirect and conjunctive) testing approaches, under both convolution and deconvolution strategies (with the canonical HRF and its delay derivative). For each HRF delay shift, we then  The sensitivity of mediation tests (y-axis) is plotted against the HRF shift (x-axis), for each candidate testing approach (same format as Figure 4, thick lines: convolution approach, thin lines: deconvolution approaches).
All mediation analysis strategies exhibit a bell-shaped sensitivity profile, eventually peaking when there is no deviation to the canonical HRF (i.e. when the HRF delay shift is null). Also, when there is no deviation to the canonical HRF, deconvolution and convolution strategies yield similar test sensitivity. However, when the deviation to the canonical HRF increases, the loss of statistical sensitivity is much stronger for deconvolution than for convolution approaches. For example, with a (realistic) delay shift of 3 seconds, most deconvolution approaches lose about 10% to 15% sensitivity on average. In comparison, convolution approaches only lose about 2% sensitivity. In addition, the conjunctive approach always exhibit higher sensitivity than indirect approaches, irrespective of whether one chooses a convolution or deconvolution strategy.
We note that, with a significance threshold of 0.05   , deviations to the canonical HRF has no adverse effect on the validity of statistical tests, i.e. all mediation test approaches yield 5% or less false positive rates under the null.

Addressing the interpretational issue of brain-behavior mediation analysis with the I/O test statistics
Recall that a significant mediated effect may have two distinct causal interpretations: the behavioral variable may either be an input ( YM  ) or an output ( MY  ) of the brain region where the null has been rejected. To address this issue, we proposed a simple I/O test statistics, whose sign is expected to discriminate between these two scenarios.
Here, we evaluate the utility of the I/O test statistics  , in conditions similar to our fMRI data analysis below, using numerical Monte-Carlo simulations.
First, we simulated data under three scenarios:  H1 (native causal scenario MY  ): the independent variable X is sampled under a normal distribution, each multivariate mediator unit i M is set to a noisy affine transformation of X (with random weights), and the dependent variable Y is set as a noisy mixture of X and all mediator units (with random weights).  H2 ("swapped" causal scenario YM  ): the independent variable X is sampled under a normal distribution, the dependent variable Y is set as a noisy affine transformation of X (with a random weight) and each multivariate mediator unit i M is set to a noisy mixture of X and Y (with random weights).  On can see that the three scenarios are very well discriminated. In particular, the distribution of the  under the null is centered on zero, and lies in between its distribution under H1 and under H2. Moreover, and as expected, These simulations however, do not account for the limitations that arise in realistic settings. In particular, the number of neural units or voxels that compose the multivariate set of mediators may largely exceed the number of trials. Here, a pragmatic solution is to perform a PCA decomposition, and keep the K first principal components to summarize the within-region variability. We now ask whether the ensuing I/O test statistic is robust to this dimension reduction. In brief, we performed the same set of simulations as above, this time simulating 100 mediating units and deriving the test statistics from the ensuing K=20 first principal components. The resulting Monte-Carlo distributions are shown on the right panel of Figure 8.
One can see that the dimension reduction strongly reduces the range of variation of the I/O test statistics, when compared to the situation above, where all the relevant variation is available. Furthermore, the magnitude of  under scenario H1 and H2 is asymmetrical. More precisely, one can see that In other terms, when relevant information is lacking, the average evidence in favor of H1 is weaker than the average evidence in favor of H2. Nevertheless, the sign of the I/O test statistics can still be interpreted as evidence for or against the native causal interpretation of the brainbehavior mediation model, i.e. Here, we perform a brain-behavior mediation analysis of previously acquired fMRI data (Chen, 2014), which is openly available as part of the OpenFMRI project (Poldrack et al., 2013). In this study, 60 participants made a series of 64 accept/reject decisions on risky gambles. On each trial, a gamble was presented, entailing a 50/50 chance of gaining an amount G of money or losing an amount L (so-called "baseline" condition). Participants were told that, at the end of the experiment, four trials would be selected at random: for those trials in which they had accepted the corresponding gamble, the outcome would be decided with a coin toss, and for the other ones -if any-, the gamble would not be played. All 64 possible combinations of G/L pairs (10$<G<40$, 5$<L<20$) were presented across trials, which were separated by 7 seconds on average (min 6, max 10).
MRI scanning was performed on a 3T Siemens Prisma scanner. High-resolution T1w structural images were acquired using a magnetization prepared rapid gradient echo (MPRAGE) pulse sequence with the following parameters: TR = 2530 ms, TE = 2.99 ms, FA = 7, FOV = 224 × 224 mm, resolution = 1 × 1 × 1 mm. Whole-brain fMRI data were acquired using echo-planar imaging with multi-band acceleration factor of 4 and parallel imaging factor (iPAT) of 2, TR = 1000 ms, TE = 30 ms, flip angle = 68 degrees, in-plane resolution of 2X2 mm 30 degrees of the anterior commissure-posterior commissure line to reduce the frontal signal dropout, with a slice thickness of 2 mm and a gap of 0.4 mm between slices to cover the entire brain. See https://openneuro.org/datasets/ds000053/versions/00001 for more details.
Data preprocessing included standard realignment and movement correction steps. Note that we excluded 2 participants, either due to missing information or because the misalignment between functional and anatomical scans could not be corrected.
We first regressed, for each participant, the observed choices against gains and losses (Equation 1). This yielded estimates of the total effects ˆG c and ˆL c of gains and losses, respectively. This also provided an estimate ˆY X  of the behavioral residuals' standard deviation. The results of this analysis are shown on Figure 9 below. In brief, both gain and loss factors have a significant effect on decisions under risk (gain factor: p<10 -5 , loss factor: p<10 -5 ). We note that, together, gain and loss factors explain on average 44.6% (std: 24.2%) of the trial-by-trial variance on participants' decisions (average balanced accuracy: 84.84%, std: 9%).
For each participant, we also derived a loss-aversion index:

 log
LG cc   , which is positive when losses have a stronger weight on accept/reject decisions than gains. One can see that the average loss-aversion index is significant (p<10 -5 ), i.e. losses have more weight on participants' decisions than gains.
Then, we analyzed fMRI time series (at the within-subject level) using both convolution and deconvolution approaches.
The convolution strategy relied upon the following two GLMs: The deconvolution strategy was implemented as follows. First, we fitted GLM3, which included "trial" regressors (temporally aligned with the gamble presentation) as well as basic fMRI confounds. Regression weight estimates yielded local trial-by-trial neural re-sponses M . Maps of path coefficients estimates â and b were obtained using Equation 4, given local neural responses M . Random-effect group-level inference on the mediation of gain and loss factors was then performed by reporting group averages of path coefficients â and b , after 8mm FWHM smoothing. We applied all indirect and conjunctive approaches except for the M3 bootstrap method (because of its limited statistical gain, when compared to its computational cost). For all approaches, we used unsigned (two-tailed) tests with standard random field theory (RFT) correction for whole-brain multiple comparisons correction.
In brief, no mediation testing approach based upon the deconvolution strategy reached statistical significance. This was the case even when using more lenient corrections for multiple comparisons (e.g., FDR). This was however not the case for mediation analyses based upon the convolution strategy. Here, indirect approaches yielded group-level significant clusters at low-set inducing thresholds (p=0.01 or p=0.05, uncorrected). In what follows, we discard these results as these thresholds are known to violate RFT assumptions (Flandin and Friston, 2019).  Upper panel: the six significant clusters of brain-behavior mediation analysis (conjunctive/convolution approach) are shown on axial (up), sagittal (middle) and coronal views (bottom). All maps used a default set-inducing threshold of correction p=0.001 uncorrected (red areas) for the RFT correction, except the bilateral dmPFC's map where with p=0.0002 uncorrected (yellow areas) in order to separate the two hemispheres. Lower panel: The ensuing between-subject empirical distribution of the I/O test statistics λ (y-axis, group-level mean ±standard deviation) is shown for each significant clusters (x-axis).
We note that regions contralateral to significant unilateral mediators of the gain effect were all close to statistical significance: c.f. left SMG (p=0.094, RFT-corrected) and left anterior vlPFC or BA45 (p=0.177, RFT-corrected).
At the very least, these analyses demonstrate the superior statistical efficiency of conjunctive/convolution approaches. In brief, no other candidate variant of mediation analysis yields positive results on this dataset. Now, the significant mediated effects above may have two distinct causal interpretations.
To afford evidence in favor or against the "native" causal claim of brain-behavior mediation analysis, we derived, for each participant and each significant cluster, our I/O test statistics. Note that, prior to the analysis, we summarized the trial-by-trial variance in each cluster using the 20 first principal components from the within-cluster PCA decomposition (on average cross clusters and participants, these preserve 89% ±2% of the trial-by-trial variance). The group-level empirical distribution of  is shown on the lower panel of Figure 10, for each of the 6 significant clusters. Reassuringly, all clusters exhibit strong evidence in favor of the "native" causal interpretation of brain-behavior mediation analysis, i.e. 0   for all subjects and all clusters. We will comment on these results in the Discussion section. Now, the effect of experimental factors seems to be mediated by a set of anatomically segregated regions in the brain. These regions are likely to be organized into a functional network (cf. Figure 1 above), eventually exerting competitive and/or cooperative influences on behavioral responses. The analysis above is agnostic about the functional architecture of this network. However, the extent to which each of these network nodes actually mediates the effect of gains and losses onto choices varies across subjects.
Thus, a given individual may have an idiosyncratic structure of brain pathways for pro-42 cessing gain and loss information. In turn, inter-individual differences in the pattern of mediated effect sizes may have behavioral consequences in terms of how strongly gains and/or losses impact decisions under risk.
Recall that the balance between the behavioral effects of gains and losses is measured using the loss aversion index

 log
LG cc   (cf. Figure 9). One may thus ask whether the pattern of mediated effect sizes predicts loss aversion. We thus extracted, in each voxel of the 6 significant mediating clusters above, the indirect effect size Ĝ ab , and average these within each cluster. This resulted in 6 region-specific indirect sizes per participant. We then regressed loss aversion indices against (log-transformed) indirect effect sizes, across participants. The results of this analysis are summarized on Figure 11 below.

Figure 11: Inter-individual differences in loss-aversion.
Left panel: regression coefficients of the analysis of inter-subject differences of loss aversion (grey: constant term, blue: weight of inter-individual differences in cluster-averages of indirect effect size). Errorbars depict standard errors of the mean. Right panel: Observed (y-axis) versus predicted (x-axis) loss aversion. Each dot is a participant.
First, an omnibus F-test shows that the pattern of indirect effect sizes significantly predicts loss aversion (F=5.13, dof=[7,51], R 2 =12.8%, p=2x10 -4 ). This is important, since this means that one can think of loss aversion in terms of a trait that is partly determined by the relative contribution of processing pathways that mediate the effect of gains onto decisions under risk. In addition, one can see that loss aversion increases when the indirect effect size in the left posterior dmPFC increases (t=2.66, dof=51, p=0.01). No indirect effect size in any other region has a significant effect on loss aversion (all p>0.29).

Discussion
In this work, we identified the specific challenges of brain-behavior mediation analysis. In particular, we evaluated the specificity and sensitivity of five statistical tests, including so-called indirect and conjunctive approach. In brief, the conjunctive approach systematically shows higher sensitivity, while yielding valid inference. In addition, we disclosed the non-trivial impact of neural noise, and assessed the robustness to deviations from fMRI modelling assumptions. The former implies that brain-behavior mediation analysis cannot detect mediators that are too close from either end of the neural information processing hierarchy. In-silico investigations of the latter eventually favor the convolution approach to HRF modelling. We also disclosed some interpretational issues of mediated effects, in particular: significant mediated effects have two distinct causal interpretations.
Importantly, this causal degeneracy may be partially addressed using complementary multivariate I/O test statistics. In addition, it has unexpected favorable computational consequences for whole-brain mediation analysis. Lastly, brain-behavior mediation analysis of fMRI data acquired in the context of decisions under risk further demonstrated the importance of methodological choices regarding brain-behavior mediation analysis.
Eventually, the conjunctive/convolution test approach showed that the right SMG, bilateral posterior dmPFC, right anterior vlPFC and bilateral posterior dlPFC mediate the effect of prospective gains on decisions under risk. Group-level I/O test statistics provided evidence that these regions are contributing to shaping behavioral responses (in a feedforward, causal, manner), rather than collecting and/or processing information about it (cf. interpretational issue). Finally, we showed that inter-individual differences in loss aversion is partly determined by the relative contribution of these six regions to behavioral control.
Taken together, our numerical simulations and analyses of experimental fMRI data demonstrated that conjunctive testing has higher statistical sensitivity than indirect approaches. This is true even for the bias-corrected M3 bootstrap test, despite its huge computational cost. We note that the sensitivity of the M3 bootstrap test may, in principle, be improved by increasing the number of permutations used to approximate the null distribution (here: 1,000). This however, would render whole-brain analysis excessively slow. Note that M3 bootstrap and conjunctive tests had already been compared at the standard 5% significance threshold outside the context of fMRI (Hayes and Scharkow, 2013). Although authors noted that bias-corrected bootstrap tests were slightly invalid (false positive rate greater than 5%), they recommended them because they eventually yielded more reliable confidence interval estimations. We extended these simulations, eventually showing that the invalidity of bias-corrected bootstrap tests increases as one relies on more stringent significance threshold (cf. Figure 3), which is required when correcting for multiple comparisons. For all these reasons (test validity, statistical sensitivity and computational cost), we would rather favor conjunctive testing for mass-univariate brain-behavior mediation analysis.
Although computationally expedient, mass-univariate brain-behavior mediation analysis essentially relies upon an incomplete model. Not only is it agnostic about the structure of the distributed brain system that process the incoming information (cf. Figure 1)  would be partly artefactual, they would still be very informative to predict behavioral responses Y above and beyond the linear effect of X . Now, let us assume that a neurocognitive model was available, that would describe how incoming information X would be distorted, transformed and integrated with other (potentially incidental) processes, along the processing hierarchy. For example, such model may derive from recent work in theoretical neuroscience regarding population coding (Averbeck et al., 2006;Georgopoulos et al., 1986), predictive coding (Bastos et al., 2012;Hosoya et al., 2005;Rao and Ballard, 1999) or efficient coding (Barlow, 1961;Doi and Lewicki, 2011). Or it could rely on agnostic multivariate and/or nonlinear decompositions that, when properly parameterized, would account for all sorts of complex relationships between X and M . In any case, if the model was complete enough, then observed neural activity would not strongly deviate from its predictions. This would preclude the statistical detection of mediated effects. Ironically speaking then, progress in modelling neural information processing may eventually hinder the statistical efficiency of brain-behavior mediation analysis. More practically, this means that statistical brain-mediation analysis may be used in an exploratory manner, to identify brain regions that contribute to behavioral control. Further, complementary, model-based approaches to neural information processing would then help reducing one's epistemic uncertainty regarding neural noise.
For example, artificial neural network modelling may be useful to identify either the structure of processing pathways (Rigoux and Daunizeau, 2015) and/or the impact of incidental biological constraints that may distort local neural information processing (Brochard and Daunizeau, 2020). This is not to say, however, that statistical brain-behavior mediation analysis cannot be improved.
For example, one may aim at providing more informative inferences regarding the structure of the underlying processing hierarchy. A possibility here is to merge mediation analysis with existing graph analysis techniques that were developed for assessing effective connectivity in the brain (Alstott et al., 2009;Smith et al., 2011;Sporns, 2013).
Another, less exhaustive but simpler, solution is to work iteratively: having identified a brain region that significantly mediates the XY  effect, one may then look for other brain regions that would mediate both XM  and MY  relationships, and repeat on subsequent mediators. Note that this would require additional corrections for the natural dependencies between brain regions. We refer the interested reader to Van Kesteren & Oberski (2019) for an interesting first step in this direction.
We also think that progress can be made regarding the main interpretational issue of brain-mediation analysis. In this context, let us highlight two extensions of linear massunivariate approaches that sound promising.
First, one may rely on more stringent inferences regarding the causality of the MY  relationship (Preacher, 2015). For example, temporal precedence may be accounted for, and inserted in the brain-mediation model using variants of Granger causality (Zhao and Luo, 2017). Note that special care must be taken regarding hemodynamic delays, whose variations across brain regions may confound temporal precedence. In particular, established fMRI applications of Granger causality are known to be prone to such confounds (David et al., 2008;Deshpande et al., 2010;Zhao and Luo, 2017). Nevertheless, constraining the brain-behavior mediation model with temporal precedence would likely reduce spurious inferences.
Second, one may exploit locally multivariate information to discriminate between manyto-one ( MY  ) and one-to-many ( YM  ) input/output mappings. In this work, we proposed a first step in this direction: namely, the I/O test statistics  . Numerical simulations demonstrated the utility of this information-theoretic measure, and its robustness to partial information losses that result from data dimensionality reduction. However, this work falls short of an exhaustive analytical treatment of I/O test statistics. For example, neither did we investigate whether and how nonlinearities in causal relationships confound and/or bias  estimates, nor did we derive a formal statistical test of the significance of  estimates. We note that the difficulty here, is that the null hypothesis may not be the most useful reference point for I/O test statistics. Rather, one aims at comparing two alternative non-nested models. Therefore, an optimal statistical treatment of I/O tests statistics may be best approached using a bayesian approach (Kass and Raftery, 1995;Liu and Aitkin, 2008). We will pursue this and related extensions of I/O test statistics in subsequent publications.
Finally, let us discuss the results of our fMRI analysis. Recall that we identified six candidate mediators of the effect of gain onto decisions under risk. Among these, the posterior dmPFC was previously shown to regulate speed-accuracy tradeoffs (Forstmann et al., 2008) and its anatomical lesion is known to impair inhibitory control in the presence of response conflict (Nachev et al., 2007). Also, decades of neuroimaging, stimulation and lesions studies have evidenced the role of posterior dlPFC and vlPFC cortices in cognitive control (Gbadeyan et al., 2016;Levy and Wagner, 2011;MacDonald et al., 2000;Miller and Cohen, 2001;Nee and D'Esposito, 2017;Soutschek and Tobler, 2020).
In addition, functional and anatomical studies report convergent evidence that the right SMG is crucial for regulating emotional responses (Adolphs, 2002;Makovac et al., 2016;Silani et al., 2013). Now, in the context of decisions under risk, automatic fear responses may induce a default tendency to reject risky gambles, eventually yielding loss aversion (Martino et al., 2006(Martino et al., , 2010. This default emotional bias may enter in conflict with the appetitive effect of prospective gains. Whether the appetitive dimension of gambles eventually dominates automatic fear responses may then depend on the potentiation of emotional responses and on the efficiency of downstream cognitive control, which would explain why the SMG, dmPFC, dlPFC and vlPFC cortices mediate the effect of gain on decisions. This is also in line with our analysis of inter-individual differences of loss aversion, which shows that peoples' loss aversion increases when the indirect effect size (of gains on accept decisions) in the left dlPFC pathway increases. This is because a strong involvement of the dlPFC pathway may signal inefficient cognitive control (Braver et al., 2010;Poldrack, 2015), which would result in loss aversion worsening.
Although quite self-consistent and elegant, this interpretation really relies on the "native" causal interpretation of brain-behavior mediation analysis. So what if we had not found support for this causal scenario with our I/O test statistics? In fact, the existing literature may also be queried to find past evidence that may be more compatible with the alternative causal interpretation of brain-behavior mediation analysis. For example, beyond its well-known implication in language processing, the right SMG has been shown to be involved in somatosensory perception (Ben-Shabat et al., 2015;Tunik et al., 2008). Under this perspective, evidence for 0 b  (or, equivalently, 0 d  ) may be interpreted in terms of low-level perceptual representations of (motor?) action plans. We note that the experimental design is compatible with this interpretation because the spatial arrangement of accept/reject responses is not randomized over trials (Chen, 2014). This highlights the need for developing approaches that reduce the causal ambiguity of simple brain-behavior mediation analyses.