Power, positive predictive value, and sample size calculations for random field theory-based fMRI inference

Recent discussions on the reproducibility of task-related functional magnetic resonance imaging (fMRI) studies have emphasized the importance of power and sample size calculations in fMRI study planning. In general, statistical power and sample size calculations are dependent on the statistical inference framework that is used to test hypotheses. Bibliometric analyses suggest that random field theory (RFT)-based voxel-and cluster-level fMRI inference are the most commonly used approaches for the statistical evaluation of task-related fMRI data. However, general power and sample size calculations for these inference approaches remain elusive. Based on the mathematical theory of RFT-based inference, we here develop power and positive predictive value (PPV) functions for voxel-and cluster-level inference in both uncorrected single test and corrected multiple testing scenarios. Moreover, we apply the theoretical results to evaluate the sample size necessary to achieve desired power and PPV levels based on an fMRI pilot study.

Introduction A fundamental goal of task-related functional magnetic resonance imaging (fMRI) is to identify the cortical correlates of cognition. An approach routinely used to achieve this goal is mass-univariate null hypothesis signicance testing in the framework of the general linear model (Friston et al., 1994;Poline and Brett, 2012;Cohen et al., 2017). In the recent debate on the reproducibility of research ndings in the life sciences, the statistical practices of fMRI research have once again taken centre stage in the community discourse (e.g., Eklund et al., 2016;Mumford et al., 2016;Poldrack et al., 2017;Eklund et al., 2019;Flandin and Friston, 2019). Here, a particular emphasis has been on statistical power and its relation to typical sample sizes in fMRI group studies (Button et al., 2013;Guo et al., 2014;Szucs and Ioannidis, 2016;Cremers et al., 2017;Geuter et al., 2018;Turner et al., 2018). In task-related fMRI, statistical power is broadly dened as the probability of detecting cortical activation, if this activation is indeed present. In general, statistical power and, consequently, methods for computing the sample sizes necessary to achieve desired levels of power depend on both the statistical inference framework used and assumptions about the expected cortical activation.
A prominent statistical inference framework for null hypothesis signicance testing in fMRI research is based on random eld theory (RFT) (Worsley, 2007;Friston, 2007;Nichols, 2012;Ostwald et al., 2018). RFT-based fMRI inference is a parametric framework that allows for controlling the multiple testing problem inherent in the mass-univariate approach. Technically, this framework rests on analytical approximations to the exceedance probabilities of topological features of data roughness-adapted random eld null models. RFT-based fMRI inference is implemented in the two major data analysis software packages used by the neuroimaging community, namely, Statistical Parametric Mapping (SPM) and the Functional Magnetic Resonance Imaging of the Brain (FMRIB) Software Library (FSL). It encompasses up to ve forms of statistical testing: uncorrected and corrected voxel-level inference, uncorrected and corrected cluster-level inference, and set-level inference (Friston et al., 1996). With the exception of setlevel inference, all forms are routinely reported in the functional neuroimaging literature. More specically, bibliometric analyses suggest that RFT-based fMRI inference, especially corrected cluster-level inference, accounts for approximately 70% of published task-related human fMRI studies (Supplement S.1).
In light of the widespread use of RFT-based inference, previously proposed approaches for the calculation of power and sample sizes in fMRI research have a number of shortcomings.
First and foremost, most previously proposed frameworks are not well aligned with the theory of RFT-based fMRI inference (e.g.. Desmond and Glover, 2002;Mumford and Nichols, 2008;Durnez et al., 2016), rendering them non-applicable for the most commonly employed forms of fMRI inference. Second, the framework previously proposed by Hayasaka et al. (2007) and Joyce and Hayasaka (2012) that is aligned with the theory of RFT-based fMRI inference only addresses voxel-level and not cluster-level inference. Moreover, this framework does not address the variety of power types that arise in multiple testing scenarios and thus remains imprecise with respect to the interpretation of its ensuing power and sample size values. Third, all previous frameworks assume that under the alternative hypothesis, cortical activation is expressed either in a known region of interest or over the entire cortex. Notably, neither of these assumptions necessarily reects common intuitions of neuroimaging researchers. Finally, no previous framework allows for the necessary sample sizes to be derived based on a desired positive predictive value (PPV), a novel statistical marker for the quality of empirical research that has risen to prominence over the last decade (Wacholder et al., 2004;Ioannidis, 2005;Heston and King, 2017;Colquhoun, 2019).
With the current work, we address these shortcomings and report on a novel framework for power, PPV, and sample size calculations in RFT-based fMRI inference. We rst consider the framework's theoretical foundations by briey reviewing the notion of power in single test scenarios, the concepts of minimal and maximal power in multiple testing scenarios, the foundations of the PPV, and the notion of partial alternative hypotheses. We then discuss the RFT-based power and PPV functions at both the voxel and cluster level in both the uncorrected single test and corrected multiple testing scenario and discuss their parametric dependencies. In a third step, we apply the proposed framework in a prospective power analysis based on a pilot fMRI data set and evaluate the sample sizes necessary to obtain desired power and PPV levels. We close with a discussion of some commonalities and dierences between the proposed framework and previously proposed approaches and some potential avenues for future research. Throughout, we limit our scope to the evaluation of contrasts of rst-level GLM parameter estimates (COPEs) at the group level using T -statistics, the approach most commonly used for group-level fMRI analyses. The technical foundations of our framework are detailed in Supplement S.2 and the Methods section. All data and software used are available from https://osf.io/xjcg4/.

Theoretical foundations Power functions
In single test scenarios, such as testing for the activation of a single voxel, two types of errors can occur: the test may reject the null hypothesis when it is in fact true, referred to as a Type I error, and the test may not reject the null hypothesis when in fact the alternative hypothesis is true, referred to as a Type II error. From a frequentist perspective, Type I and Type II errors are associated with their probabilities of occurrence, denoted α and 1 − β, respectively, and commonly referred to as Type I and Type II error rates. The complementary probability of a Type II error, i.e., the probability rejecting the null hypothesis if the alternative hypothesis is true, is referred to as the power β of a test. A fundamental aim of test construction is to maintain low Type I and Type II error rates. To this end, a desired Type I error rate is usually selected rst by dening a test signicance level α , ensuring a Type I error rate of at most α . For many commonly used tests, the power at a xed signicance level α can then be shown to be a function β(n, d) of an eect size measure d and the sample size n. An often recommended approach in research study design is calculating the necessary sample size n for which, under the assumption of a xed eect size d, the power reaches a desirable level, such as β(n, d) = 0.8.

Minimal and maximal power functions
In multiple testing scenarios, such as simultaneously testing for cortical activation over many voxels, a Type I or a Type II error may occur for each of the individual tests involved, inducing a variety of Type I and Type II error rates. For example, commonly considered Type I error rates in fMRI research are the family-wise error rate (FWER), dened as the probability of one or more false rejections of the null hypothesis, and the false discovery rate (FDR), dened as the expected proportion of Type I error among the rejected null hypotheses. Classically, the FWER has been the prime target for Type I error rate control in fMRI research. The prevalence of FWER control derives from the fact that the FWER can be eciently controlled using maximum statistic-based procedures (e.g., Roy, 1953;Roy and Bose, 1953), which were at the centre of the early developments of RFT-based fMRI inference (Friston et al., 1991;Worsley et al., 1992;Friston et al., 1994). Maximum statistic-based multiple testing procedures allow the FWER to be controlled using a family-wise error signicance level α FWE . Just as the multiplicity of statistical tests in multiple testing scenarios induces a variety of Type I error rates, it also induces a variety of Type II error rates and hence power types. Power types commonly considered in multiple testing are minimal power, dened as the probability of one or more correct rejections of the null hypothesis, and maximal power, dened as the probability of correctly rejecting all false null hypotheses (e.g., Dudoit et al., 2003). When calculating the sample sizes necessary for desired power levels in Type I error rate-controlled multiple testing scenarios, it is hence essential to explicate the power type of interest. As RFT-based fMRI inference naturally lends itself to the evaluation of the minimal and maximal power functions β min (n, d) and β max (n, d), respectively, we focus on these power types in the current work.

PPV functions
In recent discussions, studies with low power have been related to high probabilities of the claimed eects to be false positives (cf. Ioannidis, 2005;Button et al., 2013). This relationship is not inherent in classical frequentist test theory in which Type I and Type II error rates are conceived independently. Instead, the dependency of Type I error rates on Type II error rates, and hence power, arises in the context of a probabilistic model that assigns probabilities to the null hypothesis of being either true or false and the ensuing concept of a test's PPV (Wacholder et al., 2004) (for an equivalent formulation in terms of false positive risk, see Colquhoun (e.g. 2017Colquhoun (e.g. , 2019). A test's PPV, denoted here by ψ, is dened as the probability of the null hypothesis being false given that the test rejects the null hypothesis. As discussed in Supplement S.2, the PPV depends on both the Type I error rate and the prior hypothesis parameter π ∈ [0, 1], which represents the prior probability of the alternative hypothesis being true. For a constant Type I error rate and prior hypothesis parameter, the PPV is a function of the test's power and, similar to power, a function ψ(n, d) of the eect and sample sizes. Moreover, in multiple testing scenarios, such PPV functions can be generalized to minimal and maximal PPV functions ψ min (n, d) and ψ max (n, d) by substitution of the respective minimal and maximal power functions. Similar to power functions, single test and multiple testing PPV functions allow nding the sample size n for which, at a given eect size d, the PPV function reaches a desirable level, such as ψ(n, d) = 0.8.

Partial alternative hypothesis scenarios
Previous approaches to the evaluation of power in fMRI inference have typically relied on the assumption that the experimental eect of interest is expressed in a known cortical region of interest, i.e., single test scenarios, (e.g., Desmond and Glover, 2002;Mumford and Nichols, 2008), or in multiple testing scenarios, across the entire cortical volume (e.g., Hayasaka et al., 2007;Joyce and Hayasaka, 2012). While there are situations in which prospective power analyses are reasonable under these assumptions, we here suggest that the evaluation of necessary samples sizes may often be desired although neither the precise location of an expected activation nor the activation of the entire cortical sheet is reasonably assumed. To this end, we propose to parameterize the power, PPV, and sample size calculations in multiple testing scenarios with a partial alternative hypothesis parameter λ ∈ [0, 1], which describes the assumed proportion of activated brain volume. Intuitively, for example, λ = 0.1 corresponds to the assumption that 10% of the cortex is truly activated. Formally, λ corresponds to the continuous spatial generalization of the alternative hypotheses ratio of multiple testing scenarios, as discussed in Supplement S.2.
Note that if λ = 0, the minimal and maximal power are necessarily identically zero, as there are no true activations. Equivalently, if λ = 1, the FWER is necessarily zero, as there are no null activations.

RFT-based fMRI inference power and PPV functions
Based on the theoretical considerations above and the mathematical theory of RFT-based fMRI inference, it is possible to develop a set of power and PPV functions that are well-aligned with the RFT-based inference framework (Methods). In the following, we rst discuss the power and PPV functions β(n, d) and ψ(n, d) for voxel-and cluster-level inference in single test scenarios for xed signicance levels α . We then consider the power and PPV functions β λ min (n, d), β λ max (n, d), ψ λ min (n, d), and ψ λ max (n, d) for voxel-and cluster-level inference for xed family-wise error signicance levels α FWE and for xed partial alternative hypothesis parameters λ. Note that these functions form the essential prerequisites for calculating the sample sizes necessary to achieve desired levels of power or PPV.
The single test scenario: uncorrected voxel-and cluster-level inference Figure 1A depicts the power functions β(n, d) for voxel-and cluster-level inference in the uncorrected single test scenario at a signicance level of α = 0.05. For voxel-level inference and medium eect sizes of d = 0.4 to d = 0.6, sample sizes of n = 20 to n = 40 are required to achieve power levels of β(n, d) = 0.8. For cluster-level inference and similar eect sizes, slightly larger sample sizes of n = 25 to n = 40 are required to achieve similar power levels. Note that in contrast to voxel-level inference, cluster-level inference depends on the value of a cluster-dening threshold (CDT). For the cluster-level power function depicted in Figure 1A, the CDT was set to u = 4.3, corresponding to a p-value of 0.001 at ν = 9 degrees of freedom.
Naturally, varying the CDT impacts power: as shown in the right panel of Figure 1B, increasing the CDT at a constant sample size decreases power. This relationship is intuitive as, all else being equal, increasing the CDT will mask out an increasing number of voxels and hence reduce the chance of detecting a truly activated cluster. Similarly, and more fundamentally, the signi- (A) Power functions for uncorrected voxel-level and cluster-level inference for a given sample size n and eect size d. For the cluster-level power function, a CDT parameter of u = 4.3 (p = 0.001 for ν = 9 degrees of freedom) was used. (B) Power dependency on the signicance level α and the CDT value u for voxel-and cluster-level inference, respectively. (C) PPV functions for uncorrected voxel-level and cluster-level inference for a given eect size d and sample size n for the prior hypothesis parameter set to π = 0.2. (D) Prior parameter dependencies of the voxel-and cluster-level PPV functions for a xed eect size of d = 0.5. Dots represent the evaluated sample sizes. For implementational details, please see rftp_gure_1.m. cance level impacts power for both voxel-and cluster-level inference: as depicted for voxel-level inference in the left panel of Figure 1B, decreasing the signicance level decreases power. For all power curves shown in Figure 1B, the eect size was set to d = 0.5. For this medium eect size, a sample size of approximately n = 70 is required to achieve a power of β(n, d) = 0.8 at the uncorrected voxel-level signicance level of α = 0.001, which is sometimes used for inference in empirical studies. Notably, neither uncorrected voxel-level inference nor cluster-level inference is aected by the search space's resel volumes that relate to the statistical map's roughness: the RFT-based power function of the voxel-level height statistic (cf. eq. (26)) is identical to the power function of a one-sample T -test and is hence independent of the search space's resel volumes per se. The power function of the cluster-extent statistic (cf. eq. (29)), however, is dependent on the expected cluster extent and hence potentially susceptible to variations in the statistical map's roughness. However, as the third-order resel volume aects both the expected volume of clusters and the expected number of clusters, and for the evaluation of the expected cluster extent, RFT-based fMRI inference assumes the independence of these expectations (cf. eq. (12)), resel volume -and hence roughness -independence ensues. Figure 1C depicts the PPV functions for voxel-and cluster-level inference in the uncorrected single test scenario as a function of eect size d and sample size n and for a prior hypothesis parameter of π = 0.2. Here, medium eect sizes similar to those of the power functions require sample sizes on the order of n = 10 to n = 30 and n = 15 to n = 35 to achieve PPV levels of ψ(n, d) = 0.8 for voxel-and cluster-level inference, respectively. From the denition of the  Figure 2. Minimal and maximal power and PPV functions for voxel-and cluster-level inference in the corrected multiple testing scenario. (A) Minimal and maximal power and PPV functions for corrected voxel-level inference for a given sample size n, eect size d, and partial alternative hypothesis parameter λ (rst three columns). The fourth column depicts the corrected voxel-level minimal and maximal PPV functions for a prior hypothesis parameter of π = 0.2. (B) Minimal and maximal power and PPV functions for corrected cluster-level inference for a given sample size n, eect size d, and partial alternative hypothesis parameter λ (rst three columns). The fourth column depicts the corrected cluster-level minimal and maximal PPV functions for a prior hypothesis parameter of π = 0.2. All cluster-level power functions were evaluated for a CDT of u = 4.3, and all voxel-and cluster-level power and PPV functions were evaluated for an exemplary resel volume set of R0 = 6, R1 = 33, R2 = 354, and R3 = 705. For further implementational details, please see rftp_gure_2.m. PPV function ψ(n, d) as a monotonic transformation of a power function β(n, d) (cf. eq. (39)), it follows that the parameter dependencies of the voxel-and cluster-level power functions carry over to the respective PPV functions. Naturally, PPV functions are additionally strongly dependent on the value of the prior hypothesis parameter π: as shown in Figure 1D, low prior hypothesis parameter values result in much larger sample sizes necessary to achieve desired PPV levels, while higher prior hypothesis parameter values have the opposite eect.
The multiple testing scenario: corrected voxel-and cluster-level inference Figure 2A depicts maximal and minimal power and PPV functions for corrected voxel-level inference at a signicance level of α FWE = 0.05. Specically, the two leftmost panels of Figure 2A depict the minimal and maximal power functions β λ min (n, d) and β λ max (n, d) for corrected voxellevel inference and a partial alternative hypothesis parameter of λ = 0.1. Achieving a minimal power level of β λ min (n, d) = 0.8 for a medium eect size of d = 0.5 requires sample sizes in the range of n = 15 to n = 30. To achieve similar levels of maximal power β max (n, d), the same eect size requires sample sizes of n = 200 to n = 500. As shown in the upper three panels of Figure 2A, increasing the partial alternative hypothesis parameter to λ = 0.2 and λ = 0.3 decreases sample sizes necessary to achieve a minimal power of β λ min (n, d) = 0.8. For maximal power, such a decrease is not observed. Intuitively, this relationship can be understood as follows: increasing the proportion of cortical activation increases the chances of detecting activation at a single cortical location (minimal power) but not of detecting activations at all locations (maximal power). Finally, for a prior hypothesis parameter of π = 0.2, PPV levels of ψ λ min (n, d) = ψ λ max (n, d) = 0.8 can be achieved with eect and sample sizes largely similar to those for minimal and maximal power, as depicted for λ = 0.3 in the rightmost column of Figure 2A. Figure 2B depicts maximal and minimal power and PPV functions for corrected cluster-level inference at a signicance level of α FWE = 0.05. As for voxel-level inference, the leftmost panels of Figure 2B depict the minimal and maximal power functions for a partial alternative hypothesis parameter of λ = 0.1. Here, achieving a minimal power of β λ min (n, d) = 0.8 for a medium eect size of d = 0.5 requires sample sizes in the range of n = 10 to n = 20, while achieving a maximal power of β λ max (n, d) = 0.8 at the cluster level requires sample sizes of n = 30 to n = 50. As for corrected voxel-level inference, increasing the partial alternative hypothesis parameter to λ = 0.2 and λ = 0.3 decreases the necessary sample sizes for minimum power but not for maximum power. Finally, for a prior parameter of π = 0.2, ψ λ min (n, d) = ψ λ max (n, d) = 0.8 can also be achieved at the cluster level with eect and sample sizes largely similar to those for power ( Figure 2B, rightmost column).
Naturally, the minimal and maximal power and PPV functions of corrected voxel-and clusterlevel inference exhibit a number of additional parametric dependencies ( Figure 3). First, as shown in Figure 3A, similar to the patterns observed for their uncorrected counterparts, the minimal and maximal power functions of corrected voxel-and cluster-level inference are aected by the desired signicance level α FWE , with lower values of α FWE implying lower power. Second, and in contrast to the patterns observed for their uncorrected counterparts, the power functions in the corrected scenario are dependent on the data roughness, as expressed by a statistical map's resel volumes. Figure 3B visualizes this inuence as parameterized by a roughness parameter r, where for r = 1, the resel volumes are set as in Figure 2, while for r = 0.5 and r = 2 to r = 5, they are decreased or increased by the respective factor. Notably, for both voxel-and clusterlevel inference, changes in the data roughness have opposite eects on minimal and maximal power: for minimal power, an increase in roughness r results in an increase of β λ min (n, d), while for maximal power, an increase in roughness r results in a decrease of β λ max (n, d). The eect of increased roughness on minimal power is familiar from the FWER-controlling features of the expected Euler characteristic (EC) (Adler, 1981;Worsley et al., 1996): the higher the roughness of the statistical eld, the higher the probability for the maximum of the statistical eld to exceed a given value, and hence the lower the statistical signicance of an isolated peak. Because this relationship is a property of the maximum statistic T max , it is also evident in the case of minimal power. Intuitively, as the roughness of the statistical eld can be viewed as a measure of the voxel height statistics' spatial independence, detecting a single true alternative hypothesis is easier if it is not correlated with neighbouring height statistics. In contrast, maximal power increases with decreasing roughness and hence increasing smoothness. This association is intuitive: the smoother the statistical eld is, the stronger the spatial covariation of the statistics. Thus, if a true alternative hypothesis is detected at one location, the other true alternative hypotheses are also likely to be detected (if, as in the current case, it is assumed that the area of activation corresponds to a contiguous set). As in the uncorrected cluster-level scenario, increasing the value of the CDT decreases power at a constant eect size for both minimal and maximal power ( Figure 3C) because the probability of detecting one or all locations at which the alternative hypothesis is true decreases with the masking of an increasing number of voxels. Finally, the prior hypothesis parameter π also strongly aects PPV levels in the multiple testing scenario, as exemplied in Figure 3D for the cluster-level minimal and maximal PPV functions.

Exemplary application
The power and PPV functions presented above imply the sample sizes necessary to achieve desired power and PPV levels over a broad range of possible eect sizes. To demonstrate the practical value of these functions, we nally consider their application in the concrete scenario of determining the sample size necessary to achieve power and PPV levels of 0.8 for a single eect size estimate. To this end, we re-analysed fMRI data from the rst 10 participants in a previously reported perceptual decision-making study in which the amount of visual evidence for a presented stimulus to depict a face or a car was varied (Ostwald et al., 2012;Georgie et al., 2018). At the group level, contrasting fMRI activity levels between high and low visual evidence  Figure 4A (for further details about the experimental and data-analytical procedures, please see Supplement S.5). Our aim was to use the eect size estimate derived from this cluster to calculate the sample sizes necessary to achieve minimal and maximal power and PPV levels of 0.8 for corrected voxel-and cluster-level inference at a signicance level of α FWE = 0.05, a partial alternative hypothesis parameter of λ = 0.1, and a prior hypothesis parameter of π = 0.2. To this end, we evaluated the average T-values of the cluster, yielding T = 4.65, which translates into an eect size estimate ofd = 4.65/  Figure 4A, this eect size bias is most severe for small data subsets and decreases with increasing data subset size. For a data subset of n = 10, the eect size bias amounts to approximately ∆d = 1. We thus used this empirically validated bias estimate to correct our eect size estimate tod c =d − ∆d = 0.47. Using the power and PPV functions discussed in the previous section, and the sample size calculation algorithms Algorithm A2 and Algorithm A3 documented in Supplement S.6, we then obtained the following results: at the voxel level, sample sizes of n = 19 and n = 374 are required to achieve minimal and maximal power levels of 0.8, respectively ( Figure 4B). At the cluster level, sample sizes of n = 12 and n = 48 are required to achieve minimal and maximal power levels of 0.8 ( Figure 4C), respectively. For all testing scenarios considered and for the current parameter settings, slightly smaller sample sizes are required to achieve PPV levels of 0.8.

Discussion
In summary, we have developed power and PPV functions for RFT-based fMRI inference, which represents one of the mainstays of task-related fMRI data analysis. Further, we have demonstrated, how these functions can be used to determine the minimal sample sizes necessary to achieve desired power and PPV levels in study planning. Based on our example and its implementation in the MATLAB function rftp_gure_4.m, interested users may readily adapt the procedures described herein for performing power, PPV, and sample size calculations in fMRI study planning. In the following, we briey sketch the relation of the current framework to related approaches in the literature, discuss some potential avenues for future renements of the approach, and close with some general remarks about statistical testing and power calculations in fMRI research.
The current framework can be thought of as a direct extension of the work by Hayasaka et al.
(2007) and Joyce and Hayasaka (2012), generalizing the results presented therein to the cluster level and carefully distinguishing between uncorrected and corrected scenarios and the multiple power types thereby induced. As such, the current framework comprises region of interestbased approaches proposed by Desmond and Glover (2002) and Mumford and Nichols (2008) and implied in the discussions by Friston (2012) and Lindquist et al. (2013) as special cases.
Specically, in terms of its power function, a region of interest-based approach corresponds to uncorrected inference at the voxel level, i.e., a power evaluation for a one-sample T -test, with the dierence that in typical region of interest-based approaches, voxel height statistics are spatially averaged over a set of voxels. Another power calculation framework that has recently been popularized is the approach of Durnez et al. (2016). This framework rests on a testing procedure that considers local maxima of voxel height statistics above a threshold. Under the model by Durnez et al. (2016), these local maxima are thought to be the outcome of a mixture distribution, comprising realizations of a null hypothesis exponential distribution and an alternative hypothesis Gaussian distribution. While the test procedure itself is not explicitly described, the apparent idea is to reject the null hypothesis of no activation at the location of the local maximum based on a set of arbitrary selected critical values (Durnez et al., 2016, Section 3.3). Based on parameter estimates for the alternative hypothesis mixture component and the selected critical value, Durnez et al. (2016) calculate power and sample sizes. While an interesting approach in its own right, the method by Durnez et al. (2016) relates to statistical models and testing procedures that are specic to the power calculation approach by Durnez et al. (2016) and that are not routinely used in fMRI data analysis.
The current work implies some potential avenues for further research with the aim of improving power, PPV, and sample size calculations for fMRI inference. First, RFT-based fMRI inference itself may be further rened, thus entailing an optimization of the power and PPV framework discussed herein. For example, the approximations to the cluster-level test statistic distributions remain to be based on the Gaussian random eld approximations by Friston et al. (1994), while newer results for T -and F -elds are available (e.g., Cao, 1999). Similarly, the notion of resel volumes has been largely superseded by the concept of Lipschitz-Killing curvatures (e.g., Taylor and Worsley, 2007), a theoretical development that has yet to be considered in standard discussions of RFT-based fMRI inference. Second, it has been observed previously as well as by us that some of the power functions of the RFT-based inference framework can behave non-monotonically outside of practically relevant parameter regimes (Hayasaka et al., 2007). Therefore, it may be desirable to further pursue mathematical analysis of the RFT-based exceedance probability function approximations and to study their analytic behaviour across parameter regimes. Finally, with respect to the PPV, it may be desirable to diminish the degree of subjectivity involved in selecting the prior hypothesis parameter. Potential avenues with which to achieve this goal include basing PPV calculations on empirical priors estimated from fMRI pilot data and considering the PPV in the more general setting of the false positive risk (e.g., Colquhoun, 2017Colquhoun, , 2019. As emphasized throughout, statistical power and PPVs are rooted in statistical testing, i.e., the dichotomization of the uncertainty-imbued results of statistical inference. As such, statistical testing, power and PPV calculations, as well as deriving the sample sizes necessary to achieve desired power and PPV levels, always generate simplied answers to complex scientic questions (e.g., Wasserstein et al., 2019). Such simplied answers may not always be desired in a scientic context, as indicated by recent initiatives to share unthresholded statistical parametric maps (Gorgolewski et al., 2015). Stated dierently, while many researchers have argued that abandoning statistical testing based on arbitrary signicance thresholds may be a promising avenue for improving scientic inference, few have argued that the entailing abandonment of power analyses may have similar eects. While we share the hope that the fMRI community will abandon statistical testing in the long run, we here have provided power, PPV, and sample calculations applicable to the widely used RFT-based fMRI inference procedures that can be adopted in the meantime.

Methods
Here, we develop the power and PPV functions reported in the Results section. For a comprehensive review of RFT-based fMRI inference from rst principles and with a particular focus on its SPM implementation, please refer to Ostwald et al. (2018). For a comprehensive review of the underlying test theory, please refer to Supplement S.2.

Probabilistic model
Standard fMRI group analysis in the framework of the GLM is based on a two-level summary statistics approach. At the rst-level, participant-specic MRI time series are analysed using voxel-wise convolution-based GLMs. The resulting participant-and voxel-specic COPEs are the data used for the second-level, continuous-space, discrete-data point model of RFT-based fMRI inference, where Y i (x) denotes the random variable that models the COPE of the ith of n study participants at location x in the continuous three-dimensional search space S. In its structural form, the joint distribution of these random variables is dened by where µ(x) is an unknown value of a space-dependent parameter function µ : R 3 → R, σ > 0 is an unknown standard deviation parameter, and Z i (x) is a Z-eld modelling observation error. The Z i (x), i = 1, ..., n are assumed to be independent and of identical smoothness. Observed COPE data sets are assumed to represent a lattice approximation to eq. (2) and can be represented by the discrete-space, discrete-data point model where µ v := µ(x v ) denotes the value of the parameter function µ at voxel location x v , and Z iv := Z i (x v ) denotes the ith Z-eld random variable located at voxel location x v . In the following, we denote the family of random variables Y iv , i = 1, ..., n, v = 1, ..., m by Y := (Y iv ) i=1,...,n,v=1,...,m , we summarize the values of the space-dependent eect size parameter function in a vector µ := (µ 1 , ..., µ m ) T ∈ R m , and we denote the ensuing cardinality of the discretized second-level random eld model and, equivalently, the dimensionality of an observed COPE data set, by k := nm.

Statistics
RFT-based fMRI inference is based on a set of statistics that map k-dimensional COPE data sets onto lower-dimensional outcome spaces. Evaluating the probability of observed values of these statistics under the random eld model of eq. (1) then allows for testing null hypotheses at desired levels of signicance. To this end, RFT-based fMRI inference distinguishes single test scenarios, commonly referred to as uncorrected inference, based on uncorrected p-values, and multiple testing scenarios, commonly referred to as corrected inference, based on corrected p-values. Depending on the test scenario and the type of statistic, a specic form of inference ensues.
In the single test scenario and at the voxel level, the statistics of interest are the voxel height whereȳ v and s v denote the sample mean and sample standard deviation of the vth voxel data, respectively. The voxel height statistics thus correspond to standard T -statistics and form so- In the single test scenario and at the cluster level, the statistics correspond to the cluster extent statistics where K j (Y ) denotes the extent of the jth of c clusters within an excursion set dened by a CDT u ∈ R. The test statistics K j (Y ), j = 1, ..., c subsume all data-analytical steps that project a COPE data set onto the extents of clusters within the excursion set of a statistical parametric map. These steps comprise but are not limited to thresholding a statistical parametric map at level u, evaluating the entailing clusters using a numerical connectivity scheme, and measuring the extent of the resulting clusters. Given the complexity of these computational subprocesses, closed-form expressions for the evaluation of K j are not easily provided. Nevertheless, an approximation to the distribution of the test statistics K j (Y ), j = 1, ..., c is routinely used in RFT-based fMRI inference, as will be discussed below.
In the multiple testing scenario and at the voxel level, the statistics of interest are the maximum and minimum of the voxel height statistics respectively. Similarly, in the multiple testing scenario and at the cluster level, the statistics of interest are the maximum and minimum of the cluster extent statistics K max := max j∈Nc K j and K min := min respectively. Consideration of the maximum statistics is warranted by their inherent property of enabling FWER control and the evaluation of minimal power in multiple testing scenarios.
Consideration of the minimum statistics, in contrast, is warranted by their property of enabling the evaluation of maximum power in multiple testing scenarios. In the following, we detail the distributions of the statistics of eqs. (4)-(7) under the probabilistic model of eq. (1) that forms the core of RFT-based fMRI inference and the power evaluation framework proposed here. The distributions of the statistics will be provided in terms of exceedance probability functions (EPFs). EPFs are the probabilistic complements of cumulative probability functions and formulate the probability that a given statistic exceeds (rather than falls below, as in the case of cumulative probability functions) a given value. The use of EPFs is conventional in RFTbased fMRI inference and is useful in the contexts of false positive control and statistical power, both of which correspond to probabilities that statistics exceed critical values.

EPFs of RFT-based fMRI inference statistics
The EPFs of the test statistics (4) - (7) are based on (1) the T -eld's search space resel volumes, (2) the T -eld's EC densities, and (3) three topological feature expectations. We discuss each of these in turn.
For the current work, the non-central T -eld EC densities (9) are evaluated by the function r_fun.m. This function computes f (t; δ, ν) using MATLAB's nctpdf.m function, computes the integral of the zero-order non-central T -eld EC density using Matlab's nctcdf.m function, and approximates the series of eq. (11) by a numerically converging nite sum.
(3) Finally, the EPFs of the test statistics (4) - (7) are based on the following three topological feature expectations of T -elds: the expected volume of an excursion set, the expected number of clusters within an excursion set, and the expected volume of clusters within an excursion set.
For the non-central T -eld EC densities of eq. (9) with non-centrality parameter √ nd and n − 1 degrees of freedom, and for a CDT u, these expected values are given by respectively.
With these preliminaries, the following EPFs for the statistics of eqs. (4) -(7) ensue: • The EPF of the voxel height statistics T v follows from the standard theory of T -statistics. Moreover, because the zero-order non-central T -eld EC density is identical to the cumulative density function of a non-central T -distribution, the EPF of the T v for a non-central T -eld with non-centrality parameter √ nd and n − 1 degrees of freedom takes the form Note that for d = 0, the EPF of T v equals the EPF of Student's T -distribution with n − 1 degrees of freedom.
• The EPF of the cluster extent test statistics K j derives from an approximation for Gaussian random elds originally proposed by Friston et al. (1994). For a non-central T -eld with noncentrality parameter √ nd and n − 1 degrees of freedom, and for a CDT u, this approximation generalizes to • An approximation to the EPF of the maximum voxel height statistic T max was originally proposed by Worsley et al. (1996) and was generalized to non-central T -elds by Hayasaka (2007). For a non-central T -eld with non-centrality parameter √ nd and n − 1 degrees of freedom, the approximation is given by Similarly, as shown in Supplement S.3, an approximation to the EPF of the minimum voxel height statistic T min can be given as • Finally, an approximation to the maximum cluster-level statistic K max was proposed by Friston et al. (1994). Based on Hayasaka et al. (2007), this approximation can be generalized to noncentral T -elds with non-centrality parameter √ nd, n − 1 degrees of freedom, and a CDT u Similarly, as shown in Supplement S.3, an approximation to the EPF of the minimum cluster extent statistic K min can be given as Test-relevant aspects of the EPFs in eqs. (15) of the search space's resel volumes R d (S), d = 0, 1, 2, 3 into resel volumes R 0 d (S), d = 0, 1, 2, 3 for which the null hypothesis of zero activation holds and resel volumes R 1 d (S), d = 0, 1, 2, 3 for which the alternative hypothesis of non-zero activation and with eect size parameter δ = 0 holds. Note that for λ = 0, the partial alternative hypothesis scenario (21) corresponds to the complete null hypothesis of standard RFT-based fMRI inference, whereas for λ = 1, it corresponds to the complete alternative hypothesis scenario of Hayasaka et al. (2007) and Joyce and Hayasaka (2012). Intuitively, the value of λ thus corresponds to the proportion of the brain that is assumed to be activated for a given COPE. Formally, this proportion can be considered equivalent to the alternative hypothesis ratio in discrete multiple testing developed in Supplement S.2, eq. (S2.21). Specically, for a partial alternative hypothesis parameter λ and a set of resel volumes R d , d = 0, 1, 2, 3, the expected Euler characteristic that combines resel volumes and EC densities in the EPFs of the RFT-based fMRI test statistics (4) -(7) takes the form For eqs. (12) and (14), only the respective zero-and third-order terms are considered.

Tests and power functions
With the test statistics and hypotheses in place, we next formalize the single test and multiple testing scenario for voxel-and cluster-level inference and document the power functions that result from the EPFs (15) -(20). (1) Single test (uncorrected) voxel-and cluster-level inference The aim of voxel-level inference in the single test scenario is to evaluate the null hypothesis of zero activation at the vth voxel location using the voxel height statistic T v for the test where 1 {·} denotes the indicator function and c denotes the test's critical value. The Type I error rate of this test is controlled by choosing a critical value t α such that and the test obtains a signicance level α . With the EPF of T v , it then follows that the power function for voxel-level inference in the single test scenario is given by This power function corresponds to the standard power function for one-sample T -tests and is visualized in Figure 1A and Figure 1B. Note that the dependency of eq. (26) on the critical value t α is commonly expressed indirectly in terms of the dependency of t α on α (cf. Figure 1B). The aim of cluster-level inference in the single test scenario is to evaluate the null hypothesis of zero activation over the extent of the jth cluster using the cluster extent statistic K j for the where k denotes the test's critical value. The Type I error rate of this test is controlled by choosing a critical value k α such that and the test obtains a signicance level α . With the EPF of K j , it then follows that the power function for the cluster-level inference in the single test scenario is given by where denotes the expected volume of a cluster in an excursion set at level u. This power function is visualized in Figure 1B.
(2) Multiple testing (corrected) voxel-and cluster-level inference The aim of voxel-level inference in the multiple testing scenario is to evaluate the null hypothesis of zero activation at the vth voxel location while accounting for the multiplicity of tests over voxels using the multiple test The FWER of this test is controlled based on the EPF of the maximum voxel height statistic T max (17) by choosing a common critical value t α FWE such that for a desired signicance level α FWE . From the EPF of the maximum voxel height statistic (17), it then follows, that the minimal power function of voxel-level inference in the multiple testing scenario under the assumption of a partial alternative hypothesis with parameter λ is given by Similarly, from the EPF (18) of the minimum voxel height statistic T min , it follows that the maximal power function for voxel-level inference in the multiple testing scenario under the assumption of a partial alternative hypothesis parameter λ is given by The ensuing minimal and maximal power functions for corrected voxel-level inference for λ = 0.1, 0.2, 0.3 are visualized in Figure 2A.
Finally, the aim of cluster-level inference in the multiple testing scenario is to evaluate the null hypothesis of zero activation over the extent of the jth cluster location while accounting for the multiplicity of cluster tests using the multiple test The FWER of this test is controlled based on the EPF of the maximum cluster extent statistic K max (cf. (19)) by choosing a common critical value k α FWE such that for a desired signicance level α FWE . From the EPF (19) of K max , it then follows that the minimal power function of cluster-level inference in the multiple testing scenario under the assumption of a partial alternative hypothesis parameter λ is given by where P Θ 1 K j ≥ k α FWE is evaluated according to (16) for resel volumes λR d , d = 0, 1, 2, 3. Similarly, from the EPF (20) of the minimum cluster extent statistic K min , it follows that the maximal power function for cluster-level inference in the multiple testing scenario under the assumption of a partial alternative hypothesis parameter λ is given by where P Θ 1 K j ≥ k α FWE is evaluated as for β λ min above. The ensuing power functions for Figure 2B.

PPV functions
As discussed in Supplement S.2, PPV functions for the ve test scenarios of interest herein can be specied by means of the respective test's (partial alternative hypothesis parameter-dependent) power function for sample size and eect size β(n, d), the test's desired Type I error rate α , and the prior hypothesis parameter π as where the dependencies on π and α are left implicit. The PPV functions depicted in Figure 1 - Figure  Over the past seven years, at least four studies have used bibliometric methods and one study has used survey methods to assess the use of data analysis software packages and statistical testing procedures in the functional neuroimaging literature (Carp, 2012;Woo et al., 2014;Poldrack et al., 2017;Borghi and Van Gulick, 2018;Yeung, 2018). The most recent and most comprehensive account is provided by Yeung (2018). In Table S

S.2. Test theory
In this Section we review the formal foundations of test theory. We rst develop the single hypothesis test scenario and its associated error rates and power function. We then consider the multiple testing scenario with a particular emphasis on the notions of partial alternative hypothesis scenarios as well as minimal and maximal power functions. In a third step, we discuss the probabilistic foundations of the positive predictive value. We close our review by discussing a single-observation z-test in the context of the single test and the multiple testing scenario.

S.2.1 The single test scenario
Probabilistic model. To introduce the notion of a single test, we consider a parametric probabilistic model P θ (Y ) that describes the probability distribution of a random entity (i.e., a random variable or a random vector) Y and that is governed by a parameter θ ∈ Θ. The random entity Y models data and is assumed to take on values y ∈ R n , n ≥ 1. Note that we do not consider the parameter θ to be a random entity and thus develop the following theory against the background of the classical frequentist scenario.
Test hypotheses. In test scenarios, the parameter space Θ is partitioned into two disjoint subsets, denoted by Θ 0 and Θ 1 , such that Θ = Θ 0 ∪ Θ 1 and Θ 0 ∩ Θ 1 = ∅. A test hypothesis is a statement about the parameter governing P θ (Y ) in relation to these parameter space subsets.
Specically, the statement is referred to as the null hypothesis and the statement is referred to as the alternative hypothesis. Note that we are concerned with the Neyman-Pearson hypothesis testing framework and thus assume that null and alternative hypotheses always exist in an explicitly dened manner. A number of things are noteworthy. First, a statistical hypothesis is a statement about the parameter of a probabilistic model. In the following, we will use the subscript notations P Θ 0 and P Θ 1 to indicate that the parameter θ of the probabilistic model P θ is an element of Θ 0 or Θ 1 , respectively. Second, the term null hypothesis is not necessarily the statement that some parameter assumes the value zero, even if this is often the case in practice.
Rather, the null hypothesis in a statistical testing problem is the statement about the parameter one is willing to nullify, i.e., reject. Finally, the expressions H = 0 and H = 1 are not conceived as realizations of a random variable and hence hypothesis-conditional probability statements are not meaningful. The statements H = 0 and H = 1 are merely equivalent expressions for θ ∈ Θ 0 and θ ∈ Θ 1 , respectively: H = 0 refers to the true, but unknown, state of the world that the null hypothesis is true and the alternative hypothesis is false (θ ∈ Θ 0 ), and H = 1 refers to the true, but unknown, state of the world that the alternative hypothesis is true and the null hypothesis is false (θ ∈ Θ 1 ). In general, hypotheses can be classied as simple or composite. A simple hypothesis refers to a subset of parameter space which contains a single element, for example Θ 0 := {θ 0 }. A composite hypothesis refers to a subset of parameter space which contains more than one element, for example Θ 0 := R ≤0 . The commonly encountered null hypothesis Θ 0 = {0}, also referred to as nil hypothesis, is an example for a simple hypothesis.
Tests. Given the test hypotheses scenario introduced above, a test is dened as a mapping from the data outcome space to the set {0, 1}, formally Here, the test value φ(Y = y) = 0 represents the act of not rejecting null hypothesis, while the test value φ(Y = y) = 1 represents the act of rejecting the null hypothesis. Rejecting the null hypothesis is equivalent to accepting the alternative hypothesis, and accepting the null hypothesis is equivalent to rejecting the alternative hypothesis. In the following and in the main text, we suppress the notational dependence of φ(Y = ·) on y and write φ(Y ) instead. Because Y is a random entity, the expression φ(Y ) is also a random entity. All tests φ(Y ) considered in the current study involve the composition of a test statistic where R models the test statistic's outcome space, and a subsequent decision rule such that the test can be written as Note that, as for the test, we suppress the dependencies of γ(Y ) and δ(γ(Y )) on y ∈ R n , such that both γ(Y ) and δ(γ(Y )) should be read as random entities. The subset of the test statistic's outcome space for which the test assumes the value 1 is referred to as the rejection region of the test. Formally, the rejection region is dened as The random events φ(Y ) = 1 and γ(Y ) ∈ R are thus equivalent and associated with the same probability under P θ (Y ). In a concrete test scenario, it is hence usually the probability distribution of the test statistic that is of principal concern for assessing the test's outcome behaviour.
Finally, all test decision rules considered in the context of the current study are based on the test statistic exceeding a critical value u ∈ R. By means of the indicator function, the tests considered here can thus be written Note that (S2.8) describes the situation of one-sided tests. The one-sided one-sample T -test is a familiar example of the general test structure described by expression (S2.8): using the sample mean and sample standard deviation, a realization of the random entity Y is rst transformed into the value of the t-statistic, whose size is then compared to a critical value in order to decide for rejecting the null hypothesis or not.
Tests error probabilities. When conducting a hypothesis test as just described, two kinds of errors can occur. First, the null hypothesis can be rejected (φ(Y ) = 1), when it is in fact true (θ ∈ Θ 0 ). This error is referred to as the Type I error. Second, the null hypothesis may not be rejected (φ(Y ) = 0), when it is in fact false (θ ∈ Θ 1 ). The latter error is known as the Type II error. The probabilities of Type I and Type II errors under a given probabilistic model are central to the quality of a test: the probability of a Type I error is called the size of the test and is commonly denoted by α ∈ [0, 1]. It is dened as α := P Θ 0 (φ(Y ) = 1), (S2.9) and also routinely referred to as the Type I error rate of the test. Its complementary probability, is known as the specicity of a test. The probability of a Type II error lacks a common denomination. Its complementary probability is referred to as the power of a test. In words, the power of a test is the probability of accepting the alternative hypothesis (rejecting the null hypothesis), if θ ∈ Θ 1 , i.e., if the alternative hypothesis is true. Note that basic introductions to test error probabilities often denote the probability of a Type II error by β ∈ [0, 1] and thus dene power by 1 − β. For our current purposes, we prefer the denition of eq. (S2.12), because it keeps the notation concise and is more coherent with common notations of test quality functions.
Significance level. It is important to distinguish between the size and the signicance level of a test: a test is said to be of signicance level α ∈ [0, 1], if its size α is smaller than or equal to α , i.e., if α ≤ α .

(S2.13)
If for a test of signicance level α it holds that α < α , the test is referred to as a conservative test. If for a test of signicance level α it holds that α = α , the test is referred to as an exact test. Tests with an associated signicance level α for which α > α are sometimes referred to as liberal tests. Note, however, that such tests are, strictly speaking, not of signicance level α .
The test quality function. The size and the power of a test are summarized in the test's quality function. For a test φ(Y ), the test quality function is dened as q : Θ → [0, 1], θ → q(θ) := E P θ (Y ) (φ(Y )). (S2.14) In words, the test quality function is a function of the probabilistic model parameter θ and assigns to each value of this parameter a value in the interval [0, 1]. This value is given by the expectation of the test φ under the probabilistic model P θ (Y ). The denition of the test quality function is motivated by the value it assumes for θ ∈ Θ 0 and θ ∈ Θ 1 : because the random variable φ(Y ) only takes on values in {0, 1}, the expected value E P θ (Y ) (φ(Y )) is identical to the probability of the event φ(Y ) = 1 under P θ (Y ). Thus, for θ ∈ Θ 0 , the test quality function returns the size of the test (eq. (S2.9)) and for θ ∈ Θ 1 , the test quality function returns the power of the test (eq. (S2.12)).
The test power function. For θ ∈ Θ 1 , the test quality function is also is referred to as the test's power function and is denoted by against the background of a parametric probabilistic model P θ (Y ) that describes the probability distribution of a random entity Y which models observed data taking on values in R n . The parameter θ of the model is assumed to take values in a parameter space Θ.
Multiple test hypotheses. In multiple testing scenarios comprising m ∈ N tests, the parameter space is partitioned m times into disjoint subsets Θ about the true, but unknown, value of the parameter θ are referred to as the ith null and alternative hypothesis, respectively. Collectively, the m null hypotheses and their associated alternative hypotheses are referred to as a hypotheses family and the set I is referred to as the hypotheses index set. In the following, we will be concerned with the following situations • all null hypotheses of the hypotheses family are true and all alternative hypotheses are false, • some null hypotheses of the hypotheses family are true and the remaining alternative hypotheses are true, • all null hypotheses of the hypotheses family are false and all alternative hypotheses are true.
For convenience, we will refer to these scenarios as the complete null hypothesis, the partial alternative hypothesis, and the complete alternative hypothesis, respectively. The following notation is helpful to formally express the complete null hypothesis and complete alternative hypothesis scenarios, respectively: Note that despite the identical notation, the dierence between the single test scenario null and alternative hypotheses (S2.1) and (S2.2), and the multiple testing scenario complete null and complete alternative hypotheses (S2.17) and (S2.18) should in general be clear from the context.
As above, we will use the subscript notations P Θ 0 and P Θ 1 to indicate that the parameter θ of the probabilistic model P θ is an element of the complete null or alternative hypotheses Θ 0 or Θ 1 , respectively. In light of expressions (S2.17) and (S2.18), we denote the partial alternative hypothesis by θ ∈ ∩ i∈I 1 Θ (i) 1 for I 1 ⊂ I with m 1 := |I 1 |, (S2.19) and refer to I 1 as the alternative hypotheses index set. Given the binary nature of the ith null and alternative hypothesis, it follows immediately that in the case of (S2.19) it holds that θ ∈ ∩ i∈I 0 Θ (i) 0 for I 0 := I \ I 1 with |I 0 | = m − m 1 =: m 0 . (S2.20) We refer to I 0 as the null hypotheses index set. The ratio of the cardinality of the alternative hypotheses index set and the cardinality of the hypotheses index set will be denoted by λ = m 1 m , (S2.21) and will be referred to as the alternative hypotheses ratio. Note that λ = 0 corresponds to the complete null hypothesis, whereas λ = 1 corresponds to the complete alternative hypothesis. Finally, for λ ∈]0, 1[, we use the subscript notation P λΘ 1 to indicate that the parameter θ of the probabilistic model P θ is an element of a partial alternative hypothesis with alternative hypotheses ratio λ.
Multiple test. For the multiple testing scenario, let denote a test, such that φ i (Y = y) = 0 represents the act of accepting the ith null hypothesis and rejecting the ith alternative hypothesis, while φ i (Y = y) = 1 represents the act of rejecting the ith null hypotheses and accepting the ith alternative hypothesis. Then a multiple test is a A multiple test can thus be conceived as an m-dimensional vector of single tests φ i (Y = ·), the probability distribution of which is governed by the parametric probabilistic model P θ (Y ). As in the single test scenario, we will suppress the notational dependence of Φ(Y = ·) on y and write Φ(Y ) instead. Again, because the data Y is modelled as a random entity, the expression Φ(Y ) should be read as a random vector. Similarly, as in the single test scenario we are only concerned with scenarios for which each constituent test denotes the ith test statistic with ith rejection region and u i ∈ R denotes the ith critical value. The multiple one-sided one-sample T -tests commonly performed for group-level fMRI analyses are a familiar example of the general multiple test structure described by eqs. (S2.23) -(S2.26): using voxel-specic sample means and sample standard deviations, the data Y , usually comprising voxel-wise participant-specic beta parameter estimate contrasts derived from rst-level GLM analyses, is projected onto a set of m T -statistics.
The values of these m T -statistics individually evaluated with respect to appropriately dened critical values, and for each of the m voxels, the null hypothesis of zero activation is either rejected or not. The multiple testing scenario. The numbers m0 and m1 of true null and alternative hypotheses θ ∈ Θ (i) 0 and θ ∈ Θ (i) 1 , are assumed to be xed and unknown. The outcome of the ith test φi(Y ), and hence also the aggregate numbers of tests to assume either the value 0 or 1, Mij, i = 0, 1, j = 0, 1, as well as their sums, M00 + M10 and M01 + M11 are random entities, all of which are governed by the parametric probabilistic model P θ (Y ) and the functional forms of the test statistics γi, i = 1, ..., m.
Multiple test error probabilities. The multiple testing scenario induces a variety of test error scenarios. While for the single test scenario there exist four possible constellations of true hypotheses and test outcomes (θ ∈ Θ j and φ(Y ) = k for j = 0, 1 and k = 0, 1) there exist 4 m such constellations in the multiple testing scenario (θ ∈ Θ (i) j and φ i (Y ) = k for i = 1, ..., m, j = 0, 1 and k = 0, 1). In other words, while a single test φ(Y ) may either result in either a Type I or a Type II error (or a correct result), a multiple test Φ(Y ) may result in the simultaneous occurrence of Type I errors in some of its constituent single tests and Type II error in others of its constituents single tests (and correct results in the remaining single tests). This induces probabilities for the occurrence of a variety of test error scenarios and hence a variety of Type I and Type II error rates. As Type II error rates are complementary probabilities of correct rejections of null hypotheses, dierent Type II error rates correspond to dierent notions of power. In the following, we rst review the most commonly considered Type I and Type II error rates in multiple testing scenarios. In later sections, we then consider the family-wise error rate, minimal and maximal power and their control and evaluation by means of maximum and minimum statistics in further detail.
The test error rates of multiple testing scenarios can be developed quantitatively as follows: as above, let I 0 and I 1 denote the null and alternative hypotheses index sets, respectively (cf. eqs. (S2.20) and (S2.19)). Note again that the binary single test scenario implies that I = I 0 ∪ I 1 and I 0 ∩ I 1 = ∅ and that it is assumed that the sets I 0 and I 1 and their respective cardinalities m 0 and m 1 are true, but unknown, entities. Based on the probabilistic binary outcome of each test constituent φ i (Y ), the following quantities are induced at an aggregate level: • the number M 00 of tests for which θ ∈ Θ The situation is summarized in Table S.2. Note that the values m, m 0 and m 1 correspond to true, but unknown, quantities, the four quantities M jk , j = 0, 1, k = 0, 1 correspond to unobservable random variables, and the quantities M 00 +M 10 and M 01 +M 11 , i.e., the total number of accepted and rejected null hypotheses, correspond to observable random variables. Commonly considered Type I error rates in this scenario are • the family-wise error rate, dened as the probability for the event M 01 ≥ 0, i.e., of one or more Type I errors, • the per-family error rate, dened as the expectation of the unobservable random variable M 01 , i.e., the expected number of Type I errors, • the per-comparison error rate, dened as the per-family error rate divided by the number of hypotheses m, and • the false-discovery rate, dened as the expectation of the random variable M 01 /(M 01 + M 11 ) if M 01 + M 11 = 0 and 0 if M 01 + M 11 = 0, i.e., the expected proportion of Type I errors among the rejected null hypotheses, or 0, if no hypotheses are rejected.
Notably, in contrast to the Type I error rate in the single test scenario (i.e., the size of a test), the Type I error rates in the multiple testing scenario refer to either probabilities (such as the family-wise error rate) or expectations of the counting random variables M ij , i = 0, 1, j = 0, 1.
In a concrete multiple testing scenario, these probabilities and expectations have to be derived based on the nature of the probabilistic model and the denition of the multiple test.
As for the generalization of the notion of a Type I error to the multiple testing scenario, the multiple testing scenario induces a variety of Type II error rates and their respective complementary probabilities, i.e., power types. Commonly considered power types in the multiple testing scenario are • minimal power, dened as the probability of the event M 11 ≥ 1, i.e., of one or more correct rejections of the null hypothesis, • average power, dened as the expectation of the random variable M 11 divided by m 1 , i.e., the expected proportion of false null hypotheses that are rejected, and • maximal power, dened as the probability of the event M 11 = m 1 , i.e., of correctly rejecting all false null hypotheses.
Multiple test construction. As in the single test scenario, multiple tests are usually constructed to rst and foremost control a chosen Type I error rate at a desired signicance level α . In a second step, additional test construction measures may then be taken to achieve a desired level of a chosen power type. The random eld theory-based fMRI inference framework has traditionally focussed on the family-wise error rate (FWER) as the target for Type I error rate control. In the following, we shall thus further elaborate on the denition of the FWER and establish how the distribution of the maximum statistic can be utilized for its control. Furthermore, we formally develop the notions of minimal and maximal power and their relation to maximum and minimum statistics, respectively.
Maximum statistic-based FWER control. As introduced above, the FWER of a multiple test is dened as the probability of one or more Type I errors. More formally, let Φ(Y ) = (φ i (Y )) i∈I denote a multiple test with with hypotheses index set I and null hypotheses index set I 0 ⊆ I, I 0 = ∅. Then the FWER is dened as the probability This expression is to be understood as follows: clearly, the FWER refers to the probability of events φ i (Y ) = 1 under the probabilistic model for the case that at least one null hypothesis holds true, i.e., I 0 = ∅. More specically, the intersection subscript ∩ i∈I 0 Θ (i) 0 qualies that the parameter of the probabilistic model is such, that all null hypotheses with indices in the set I 0 ⊆ I, I 0 = ∅ hold. Complementary, the union statement ∪ i∈I 0 φ i (Y ) = 1 implies that the event φ i 1 (Y ) = 1 and/or the event φ i 2 (Y ) = 1, ..., and/or the event φ im 0 (Y ) = 1 with i j ∈ I 0 for j = 1, 2, ..., m 0 occurs, i.e., that at least one, but possible more, events φ i (Y ) = 1 with i ∈ I 0 occurs. This is equivalent to the probability of the event M 01 ≥ 0 as considered above. In analogy to the signicance level in the single test scenario, a multiple test Φ(Y ) is then said to be of family-wise signicance level α FWE , if its FWER is equal to or smaller than α FWE , i.e., if Equivalently, such a test is said to control the FWER at level α FWE . If for a test Φ(Y ) it holds that α FWE = α FWE , we say that Φ(Y ) oers exact control of the FWER at level α FWE . A general method to establish FWER control for a multiple test of the form (S2.23) -(S2.26) at a level α F W E is aorded by consideration of the distribution of the maximum test statistic The method rests on identifying a common critical value u FWE α ∈ R for all constituent tests φ i (Y ) of the form (S2.24) that satises  for all i ∈ I0. Further, let the critical value u FWE α be such that with the denition of the maximum statistic γ 0 max (Y ) in eq. (S2.29) it holds that (S2.32) Then, with the denition of the FWER in eq. (S2.27), it follows that where on the right-hand side of the second equation ij ∈ I0 for j = 1, 2, ..., m0. In verbose form: the probability of the event that one or more of the component tests φi(Y ), i ∈ I0 of the multiple test Φ(Y ) evaluate to 1 over the set of true null hypotheses I0 is equal to the complementary probability of the event that all component tests evaluate to 0 over the set of true null hypotheses I0. Given the form of the multiple test Φ(Y ), this probability in turn corresponds to the probability that all relevant component test statistics assume values smaller than the critical value u FWE α . The latter event is identical to the event that the maximum statistic γ 0 max (Y ) over the set of true null hypotheses is smaller than u FWE α . The complementary probability of this event then implies the validity of eq. (S2.31).
A note on the usage of the terms uncorrected single test and corrected multiple testing inference in the main text may be appropriate here: de-facto, FWER control in multiple testing is not based on some form of correction procedure that turns an uncorrected p-value into a corrected p-value, but the two p-values of uncorrected and corrected inference instead refer to dierent statistics. Because the notion of correcting for the multiple testing problem using corrected p-values is deeply engrained in the fMRI literature, however, we refrain from abandoning this terminology.
Maximum statistic-based minimal power evaluation. Minimal power can be conceived of as the mirror analogue of the FWER. As dened above, minimal power is the probability for one or more correct rejections of the null hypothesis. In analogy to the FWER, minimal power of a multiple test Φ(Y ) = (φ i (Y )) i∈I with hypotheses index set I and alternative hypotheses index set I 1 ⊆ I, I 1 = ∅ is formally given as As for the formal expression of the FWER, the intersection statement ∩ i∈I 1 Θ (i) 1 qualies that the parameter of the probabilistic model is such that all alternative hypotheses with indices in the set I 1 ⊆ I, I 1 = ∅ hold true, while the union statement ∪ i∈I 1 implies that the event φ i 1 (Y ) = 1 and/or the event φ i 2 (Y ) = 1, ..., and/or the event φ im 1 (Y ) = 1 with i j ∈ I 1 for j = 1, 2, ..., m 1 occurs, i.e., that at least one, but possible more, events φ i (Y ) = 1 with i ∈ I 1 occurs. This is equivalent to the event M 11 ≥ 1 as considered above. Minimal power can be evaluated in a straight-forward fashion for multiple testing procedures that employ a common critical value and for which the distribution of the maximum statistic is known. Specically, as shown below, given a test of the form (S2.23) -(S2.26), the denition of the maximum statistic and a critical value u ∈ R, it holds that In the applied context of the current study, eq. (S2.36) implies that minimal power can be evaluated by considering the appropriate maximum statistic distributions of the random-eld theory-based fMRI inference framework.
Minimum statistic-based maximal power evaluation. In analogy to the formal denition of minimal power in eq. (S2.34), maximal power can be dened as In analogy to the formal FWER and minimal power denitions, the intersection subscript ∩ i∈I 1 Θ (i) 1 qualies that the parameter of the probabilistic model is such, that all alterantive hypotheses with indices in the set I 1 ⊆ I, I 1 = ∅ hold, while the intersection statement ∩ i∈I 1 φ i (Y ) = 1 implies that the events φ i 1 (Y ) = 1 and the event φ i 2 (Y ) = 1, ..., and the event φ im 1 (Y ) = 1 with i j ∈ I 1 for j = 1, 2, ..., m 1 occur, i.e., that all events φ i (Y ) = 1 with i ∈ I 1 occur. This is equivalent to the probability of the event M 11 = m 1 as considered above. Moreover, as shown below, given a test of the form (S2.23) -(S2.26), the denition of the minimum test statistic and a critical value u ∈ R, it holds that In the context of the current study, eq. (S2.40) implies that the evaluation of maximal power necessitates the availability of the minimum statistics distributions of the random-eld theorybased fMRI inference framework under the appropriate alternative hypotheses scenarios.
Power functions. Based on the denitions of the partial alternative hypothesis ratio in eq.
Note that the assumption I 0 ⊂ I, I 0 = ∅ implies that neither I 0 nor I 1 are empty sets and that hence λ ∈]0, 1[.

S.2.3 Positive predictive value functions
The concept of a positive predictive value (PPV) descends from a framework originally presented by Wacholder et al. (2004). Specically, it arises in the context of probabilistic models, in which, in contrast to the classical frequentist test theory discussed thus far, both the test outcomes and the hypotheses states are modelled by random variables. In the following, we rst consider the notion of a PPV in the context of the single test scenario discussed in Section S.2.1 and develop the notion of a PPV function. In a second step, we then consider the notion of a PPV in the multiple testing scenario of Section S.2.2 and dene the ensuing minimal and maximal PPV functions.
The single test scenario. To establish the formal background of the PPV, we consider the parametric probabilistic model where the random variable H models the hypothesis state and the random variable φ(Y ) models the test state. As in Section S.2.1, H = 0 models the case that the null hypothesis is true and the alternative hypothesis is false, and H = 1 models the case that the null hypothesis is false and the alternative hypothesis is true (cf. eqs. (S2.1) and (S2.2)). Similarly, φ(Y ) = 0 represents the act of not rejecting the null hypothesis and φ(Y ) = 1 represents the act of rejecting the null hypothesis (cf. eq. (S2.3)). Note that the distribution of the data is considered only implicitly in the current probabilistic model, which is justied as we again consider only deterministic test procedures. For the development of the PPV, the joint distribution of H and φ(Y ) is constructed by (1)  where α and β refer to the size and the power of the test φ(Y ) (cf. eqs. (S2.9)) and (S2.12), respectively. Based on the thus dened joint distribution, the conditional probability of H to assume the value 1 given that φ(Y ) assumes the value 1 evaluates to . (1) For the joint distribution, we have Optimal test properties of α = 0 and β = 1 result in an optimal PPV. For a test size of α = 0, optimal test power of β = 1 yields a PPV corresponding to the hypothesis prior probability π = 0.5. (C) Panel C visualizes the PPV for the commonly desired power of β = 0.8 as a function of the hypothesis prior probability and the test size. Note that the hypothesis prior probability dominates both test size and power. For implementational details, please see rftp_gure_S1.m.
(2) For the marginal distribution P (φ), we thus have (3) Finally, for the conditional distribution of the alternative hypothesis being true (H = 1) given a positive test outcome φ(Y ) = 1, we have which completes the proof.
For the probability P θ (H = 1|φ(Y ) = 1), Ioannidis (2005) coined the term positive predictive value. Intuitively, the PPV is thus the probability of the alternative hypothesis being true, given a positive test outcome. We visualize the dependency of the PPV on the hypothesis prior probability π, the test power β, and the test size α in Figure S.1. Fixing one of the three parameters of the PPV at a conventional level (α = 0.05, π = 0.5 and β = 0.8) demonstrates that the PPV of a test increases with the hypothesis prior probability and the test power, and decreases for increases in test size. Note that Ioannidis (2005) and Button et al. (2013) prefer the formulation of the PPV in terms of the pre-study odds ω := π 1 − π , (S2.52) rather than hypothesis prior probability. In terms of the pre-study odds, the PPV can be reexpressed as P θ (H = 1|φ(Y ) = 1) = ωβ ωβ + α . (S2.53) Proof of eq. (S2.53) With the expression for the conditional probability of H = 1 given φ(Y ) = 1 and the denition of the pre-study odds of eq. (S2.52), we have For a prexed test size, the notions of a single test's power function (cf. eq. (S2.15)) and the functional form of the PPV (cf. eq. (S2.48)) induce the positive predictive value function (PPV function) where β(θ) denotes the value of the test power function for θ. Note that the values of the PPV function depend on the prexed test size α, the parameters of the probabilistic model by means of the single test power function β, and the hypothesis prior probability π.
The multiple testing scenario. To generalize the notion of a PPV function to the multiple testing scenario, we dene the minimal and maximal positive predictive value functions , (S2.57) respectively, where β min and β max denote the minimal and maximal power functions as dened in (S2.42) and (S2.43). Note that in this scenario, the marginal hypothesis parameter π represents the prior probability of the partial alternative hypothesis scenario with partial alternative hypothesis parameter λ ∈]0, 1[.

S.2.4 Examples
To illustrate the theoretical concepts of Section S.2.1 to Section S.2.3 and as a conceptual reference point for the random eld theory-based fMRI inference scenarios discussed in the main text, we next discuss two examples. The rst example concerns a single test scenario, the second example concerns the extension of the rst example to the multiple testing scenario. In both scenarios, we make repeated use of the probability density function of the Gaussian distribution, which we abbreviate by for expectation parameter µ ∈ R n and positive-denite covariance matrix parameter Σ ∈ R n×n .
A single-observation z-test Probabilistic model. As a rst example, we consider a probabilistic model P θ (Y ) that governs the distribution of a data random variable Y taking values in R. For µ ∈ R and σ 2 > 0, the model is assumed to be dened in terms of the probability density function p θ (y) := N (y; µ, σ 2 ). (S2.59) Intuitively, a single data point Y = y is thus assumed to have been sampled from a univariate Gaussian distribution of unknown expectation and known variance. For this model, we assume that the parameter space of interest is of the form Θ := R ≥0 .
In words, the null hypothesis µ ∈ Θ 0 is rejected, if the data realization is equal to or exceeds a given critical value u ∈ R, otherwise it is not rejected.
Distributions of the test statistic. As discussed in Section S.2.1, to aord Type I error rate control and to evaluate the power of a thus controlled test, the distributions of the test statistic under the null and alternative hypotheses are central. The former distribution allows for identifying a critical value such that the size of the test maximally assumes a certain probability.
The latter distribution allows for evaluating the probability of rejecting the null hypothesis under the scenario of the alternative hypothesis being true. In the current test scenario, the distribution of the test statistic under the null hypothesis θ ∈ Θ 0 , and hence also the probabilities for the equivalent events Z(Y ) ∈ [u, ∞[ and φ(Y ) = 1, can be readily inferred: because the test statistic conforms to the identity mapping, its distribution for µ ∈ Θ 0 is given by the probability density function p Θ 0 (z) = N (z; 0, σ 2 ). (S2.63) Likewise, the test statistic distribution for θ ∈ Θ 1 and its associated events Z(Y ) ∈ [u, ∞[ and φ(Y ) = 1 is given by the probability density function Note that the required integral corresponds to the cumulative density function of the univariate Gaussian distribution, for which well-known and widely implemented approximations exist. A numerical approach for the evaluation of u α based on the probability P Θ 0 (Z(Y ) ≥ u α ) is discussed in Section S.6.
Power and positive predictive value function. Given a critical value u α and the distribution of the test statistics under the alternative hypothesis scenario as specied by (S2.64), the probability of the event φ(Y ) = 1 evaluates to The power function of the test thus takes the form where µ ∈ R m and I m denotes the m × m identity matrix. In other words, Y is distributed according to a multivariate Gaussian distribution with expectation parameter µ and spherical covariance matrix parameter σ 2 I m . The parameter space of the model is assumed to be given by Θ := R m ≥0 and concerns the expectation parameter µ. Note that for a hypothesis prior of π = 0, the PPV of the test does not exceed 0.5, while for a hypothesis prior of π = 1 the PPV of the test, is equal to one, regardless of the eect size. For implementational details, please see rftp_gure_S2.m.
Test hypotheses, statistics, and definition. For an index set I := {1, 2, ..., m} and a value θ ∈ R >0 we consider the family of hypotheses The ith test statistic thus corresponds to the projection of Y onto its i coordinate. Note that for the current scenario the dimension of the data outcome space and the number of tested hypotheses are identical, but this does not necessarily have to be case.
Distributions of the maximum statistic. We next assume that we aim for the maximum statistic-based control of the FWER of the multiple test dened in eq. (S2.75). As discussed in Section S.2.2, this entails the evaluation of distribution the maximum statistic For the current example this distribution can be expressed in terms of the EPF Proof of eq. (S2.77) We have where ij ∈ I0 for j = 1, 2, ..., m0. The factorization of the joint distribution of the relevant test statistics implied by the third equation follows from the assumption of a spherical covariance matrix for the probabilistic model, which for the multivariate Gaussian distribution implies the independence of its component random variables.
The fourth equation follows with the well-known form of the marginal distributions of the multivariate Gaussian distribution.
Furthermore, we aim for the evaluation of minimal and maximal power functions and their associated PPV functions. As discussed in Section S.2.2, this entails the evaluation of the distributions of As shown below, these distributions can be expressed in terms of the EPFs and m0, respectively. Similarly, the EPF (S2.82) follows from where ij ∈ I1 for j = 1, 2, ..., m1. Note that to achieve similar levels, maximal power requires much larger eect sizes than minimal power. (C) Minimal and maximal PPVs as functions of the prior partial alternative hypothesis parameter and the eect size parameter d for a partial alternative hypothesis parameter of λ = 0.1. As in the single test scenario, extreme prior alternative hypothesis parameters render the PPV less dependent on the eect size than medium sized prior alternative hypothesis parameters. For implementational details, please see rftp_gure_S3.m.
Type I error rate control. As discussed in Section S.2.2, exact FWER control at signicance level α FWE is aorded by identifying a common critical value u FWE α such that Finally, the introduction of a partial alternative hypothesis prior π induces the minimal and maximal PPV functions . (S2.90) We visualize the multiple testing scenario of the single-observation z-test in Figure S.3 for the case of m := 100 simultaneously tested hypotheses. The upper and lower subpanels of Figure S.3A visualize the exceedance probabilities P θ (Z max ≥ z) and P θ (Z min ≥ z). Note that in comparison with the Z statistic of the single test scenario in Figure S.2A, the maximum statistic exceedance probabilities are shifted to larger values of z, i.e., the maximum statistic Z max has a higher probability to exceed a given z value than the Z statistic, and decays faster. Similarly, the minimum statistic exceedance probability mass is shifted to lower values of z, with the same decay as the maximum statistic. In addition, the subpanels of Figure  The upper and lower panels of Figure S.3B visualize the ensuing minimal and maximal power functions β λ min and β λ max , respectively, as a function of the eect size d and the partial alternative hypothesis parameter λ ∈]0, 1[. Note that for both power types, a high level of λ implies a small value of m 0 , which in turn results in a lower critical value u FWE α , which for constant signicance level and eect size, implies a higher value of the respective power function. This eect is particularly prominent in the case of maximal power, which for comparable power levels requires much higher eect size values when compared to minimal power, and exhibits a symmetry about a partial alternative hypothesis parameter of λ = 0.5. Finally, the upper and lower subpanels of Figure S.3C visualize the minimal and maximal PPV functions ψ λ min and ψ λ max for λ = 0.1, respectively. Like for the single test scenario, the introduction of a partial alternative hypothesis scenario prior probability results in a modulation of the respective power functions, which for prior parameter values towards the boundaries π = 0 and π = 1 of the prior parameter space render the PPV function less dependent on eect sizes than in the center of the prior parameter space around π = 0.5.

S.3. Minimum statistics EPFs
Minimum voxel height statistic EPF The approximation of the EPF of the minimum voxel height statistic is based on the assumption that for suciently high degrees of freedom the parametric expression of the expected Euler characteristic of a non-central T -eld (cf. Hayasaka et al. (2007, eq. (4)), Worsley et al. (1996, eq. (3.1))) can serve both as an approximation for the probability of the maximum voxel height statistic to exceed a value t > 0, as well as as an approximation for the probability of the minimum voxel height statistic to fall below −t, i.e., We then have for t > 0 With the transformation x ≈ 1−exp(−x) for small x ∈ R that is used in the SPM implementation of RFT-based fMRI inference (cf. Friston et al. (1996, eq. (5)), Hayasaka et al. (2007, eq. (4)), Ostwald et al. (2018, eq. (110))), we then have denote the expected Euler characteristic approximation of the number of local maxima within an excursion set at level u that also serves as the approximation to the expected number of clusters within an excursion set at level u under RFT-based fMRI inference. Further, let C <k,u denote a random variable that models the number of clusters within an excursion set at level u that have an extent smaller than some constant k. As for its complement C ≥k,u , which forms the basis for the approximation of the EPF of the maximum cluster extent statistic, we assume that C <k,u ∼ Poiss λ C <k,u , (S3.6) where λ C <k,u := E(M u )P (K j < k) = E(M u )(1 − P (K j ≥ k)).

(S3.7)
That is, like the random variable C ≥k,u , the random variable C <k,u is assumed to be distributed according to a Poisson distribution, the expectation parameter of which is given by the product of the expected number of clusters and, in contrast to C ≥k,u , the probability of a cluster volume to take on a value smaller than some constant k. Next, let K j , j = 1, ..., c denote the volumes of clusters j = 1, ..., c within an excursion set at level u, and let K min denote the minimum cluster extent statistic dened in eq. (7). Then, with the denition of the random variable C <k,u , we have for k ∈ R P (K min ≥ k) = P (C <k,u = 0) . (S3.8) In words, the probability that the minimum of the cluster extent statistics K j within an excursion set at level u is larger than or equal to k is identical to the probability that the number of clusters within the excursion set that have a volume smaller than k is zero. With the Poisson form of the distribution of C <k,u , it then follows that and with (S3.5), we obtain (S3.10)

S.4. EPFs visualizations
In this Section, we visualize EPFs of the six test statistics of eqs. (4) -(7) that underlie RFT-based fMRI inference and the power, PPV, and sample size calculations reported in the current work. These visualizations allow for readily relating the respective power functions to the distributional properties of the test statistics and in this way make the power functions discussed and documented in the main text accessible.

Single test scenario statistics
As discussed in Methods, the single test statistics of interest in RFT-based fMRI inference are the voxel height statistics T v (cf. eq. (4)) and the cluster extent statistics K j (cf. eq. (5)) with EPFs provided in eqs. (15) and (16) Figure 1A of the main text.
Similarly, the test-relevant aspects of the EPF of K j are visualized in Figure S.4B for a clusterdening threshold of u = 4.3. As for the voxel height statistic, the upper panel of Figure S.4B depicts the critical value k α measured in number of voxels for exact tests with signicance level α = 0.05 as a function of the sample size n implied by eq.(16) and evaluated using Algorithm 1. As sample size increases, the critical cluster extent value decreases as expected. In the lower panel of Figure S.4B, we visualize the EPF of K j as a function of eect size d and sample sizes n = 10, 15, ..., 40. As in the lower panel of Figure S.4A, each sample size-specic stack reects a variation of the eect size d between 0.2 (bottom) to 0.8 (top) and the critical values for an exact test with signicance level α = 0.05 at a given sample size are indicated by red vertical bars. Like for voxel height statistic, the value of the EPF at the location of the critical values corresponds to the power of the uncorrected cluster-level test as visualized in Figure 1B of the main text.

Multiple testing statistics
As discussed in Methods, the multiple testing statistics of interest are the maximum and minimum voxel height statistics T max and T min (cf. eq. (6)) with EPFs provided in eqs. (17) and (16), respectively, as well as the maximum and minimum cluster extent statistics K min and K max (cf. eq. (7)), with EPFs provided in eqs. (19) and (20), respectively. As in the main text, we visualize test-relevant aspects of the EPFs in Figure S.5 based on the resel volumes R 0 (S) = 6, R 1 (S) = 33, R 2 (S) = 354, and R 3 (S) = 705, and, for the minimum and maximum cluster extent statistics, a cluster-dening threshold of u = 4.3.
The left upper panel of Figure 15) and (16) as a function of sample size and eect size d. Specically, for each sample size-specic EPF stack, the eect size d varies between 0.2 at the bottom of the stack and 0.8 at the top of the stack. Additionally, the gure depicts the sample size-specic critical values for exact tests with signicance level α = 0.05 as red vertical bars. The exceedance probabilities at the location of the red bars in the respective statistics outcome space corresponds to the eect and sample size-dependent power values visualized in Figure 1. Note that as discussed in the main text, neither EPF depends on resel volumes of the search space. For implementational details, please see rftp_gure_S4.m.
corresponds to the complete null hypothesis, and the critical values increase with decreasing sample size. Compared to the single test scenario, the critical values are ve to ten times as large. Increasing λ and thus reducing the multiplicity of the multiple testing problem results in a decrease of the critical values, which accelerates as λ approaches 1. In the right upper panel of Figure S.5A, we visualize the associated exceedance probabilities that remain constant around 0.05. The medium panel of Figure S.5A visualizes the EPF of T max . As for T v and K j , the EPFs are visualized as sample size-specic stacks for n = 10, 15, ..., 40 and within each stack, the eect size d varies from 0.2 (bottom) to 0.8 (top). For all stacks, λ is set to 1, Critical values and EPFs for the multiple testing voxel-and cluster-level statistics. The left uppermost panels of (A) and (B) visualize the critical values t FWE α and k FWE α for FWER-controlled voxel-and cluster-level tests of signicance level α FWE = 0.05 as a function of sample size n and partial alternative hypothesis parameter λ ∈ [0, 1], respectively. Their associated exceedance probabilities are visualized in the right upper panels of (A) and (B). The central panels visualize the EPFs of eqs. (17) and (19) as functions of sample size n and eect size d. Specically, for each sample size-specic exceedance probability stack, the eect size d varies between 0.2 at the bottom of the stack and 0.8 at the top of the stack. In addition, each panel depicts the sample size-specic critical values for exact tests with signicance level α FWE = 0.05 as red vertical bars. The color scale depicted for P (Tmax ≥ t) applies to all four lower subpanels of the gure. The lowermost subpanels of (A) and (B) visualize the EPFs of eqs. (18) and (20) as functions of sample size n and eect size d as for their maximum counterparts. For implementational details, please see rftp_gure_S5.m. the location of these bars corresponds to the sample size-and eect size-specic minimal power of the FWER-controlled voxel-level test in the complete alternative hypothesis scenario. The identical way of portrayal is used in Figure S.5B for the test-relevant aspects of the EPFs of K min and K max . Note that in comparison to the single test scenario, the critical values k FWE α depicted in the upper left panel of Figure S.5B are up to three times as large.

S.5. Exemplary data set
The exemplary fMRI data set is part of a perceptual decision making simultaneous EEG/fMRI data set that has been previously documented and made generally accessible in the standardized BIDS format (Ostwald et al., 2012;Georgie et al., 2018). In the following, we briey sketch the experimental procedures and fMRI data analyses that form the basis for the statistical parametric map depicted in Figure 4.

Experimental procedure
Participants performed a visual perceptual decision task in a 2 × 2 factorial within-participant design with experimental factors stimulus coherence (with levels low and high) and spatial prioritization (with level yes and no). On each trial, a visual stimulus depicting either a face or a car was presented in one visual hemield. Individual stimuli were presented for 200 ms and the participant was asked to indicate via a button press whether the stimulus depicted a face or a car. For the button presses, participants used their right index and middle nger for the two stimulus categories, and the mapping from stimulus category to response button was counterbalanced across participants. The informativeness of the visual stimulus was manipulated by altering the phase coherence of its spatial frequency spectrum resulting in low and high stimulus coherence trials. On half of the trials, a cueing arrow shown continuously for 1 s prior to the stimulus indicated in which visual hemield the stimulus would be presented.
Participants were asked to allocate their spatial attention to the respective visual hemield, while maintaining steady central xation (spatial prioritization condition). On the other half of the trials, the two-headed cuing arrow was uninformative and the stimulus was presented randomly in either visual hemield (no spatial prioritization condition). Face and car stimuli were equally distributed across the four experimental conditions. The stimulus presentation order was randomized. Participants were asked to respond as quickly and as accurately as possible with an emphasis on responding as quickly as possible and to maintain stable xation on the central xation cross throughout the experiment. For fMRI data acquisition, data from 90 trials for each of the four conditions (half of them face stimuli) were recorded with an inter-trial interval discretely randomized between 10 and 12 s. The 90 trials per condition were split into ve experimental runs, each lasting approximately 14 minutes. fMRI data acquisition and analysis fMRI data was acquired simultaneously with EEG at the Birmingham University Imaging Centre using a 3T Philips Achieva MRI scanner. T2*-weighted functional data were collected with an eight-channel phased-array SENSE head coil. EPI data (gradient echo-pulse sequence) were acquired from 32 slices (3x3x4 mm resolution, TR 2,000 ms, TE 35 ms, SENSE factor 2, ip angle 80 deg). Slices were oriented parallel to the AC-PC axis of the participant's brain and positioned to cover the entire brain space. A mass-univariate summary-statistics GLM analysis was performed to assess condition-induced eects at the group-level. SPM12 (V6906) was used for both fMRI data preprocessing and statistical modelling. Prior to GLM parameter estimation at the participant-level, fMRI data were motion-corrected by realigning EPI volumes to the rst volume of the rst run of a given participant, normalized to MNI spaced using the SPM MNI-EPI template, re-interpolated to 2 mm isotropic voxel size, and smoothed using an 8 mm isotropic Gaussian kernel. The rst-level GLM design matrix for each participant was then specied in run-wise, block-diagonal form. Here, each block comprised the four condition-specic stimulus onset functions, convolved with the canonical haemodynamic response function, in the columnwise order: high stimulus coherence/spatial prioritization, high stimulus coherence/no spatial prioritization, low stimulus coherence/spatial prioritization, low stimulus coherence/no spatial prioritization. Per SPM defaults, the design matrices additionally comprised a constant run oset and a cosine basis function set implementing a temporal high-pass lter with a cut-o of frequency 1/128 Hz. High-frequency residual error correlations were accounted for by SPM's default of approximating a rst-order autoregressive process with white noise using parameterized covariance basis functions. GLM beta and covariance component parameters were then estimated using SPM's restricted maximum likelihood estimation scheme. Finally, ten participant-specic COPE images were evaluated for the high stimulus coherence > low stimulus coherence contrast weight vector (1, 1, −1, −1) replicated over sessions and padded with zeros for regressors of no interest. The resulting COPE images are available from the`Contrast Images' folder of the accompanying OSF project. Finally, the COPE images were evaluated at the group-level using voxel-wise one-sample T -tests, implemented in rftp_gure_4.m, the resulting statistical parametric map of which is available from the`One Sample T Test' folder in the accompanying OSF project.

S.6. Algorithms
Critical value algorithm for a desired signicance level For a given test statistic γ(Y ) and desired signicance level α , we use Algorithm 1 to numerically evaluate the required critical value c α for a test that controls the Type I error rate at signicance level α .
Necessary sample size algorithm for a desired power level For a given test statistic γ(Y ), desired signicance level α , eect size d and, in the case of a multiple testing scenario, partial alternative hypothesis parameter λ, we use Algorithm 2 to numerically evaluate the required minimal sample size to achieve a desired power level β.
Necessary sample size algorithm for a desired PPV level For a given test statistic γ(Y ), desired signicance level α , eect size d, prior hypothesis parameter π and, in the case of a multiple testing scenario, partial alternative hypothesis parameter λ, we use Algorithm 3 to numerically evaluate the required minimal sample size to achieve a desired PPV level ψ.