Abstract
Scientists use high-dimensional measurement assays to detect and prioritize regions of strong signal in a spatially organized domain. Examples include finding methylation enriched genomic regions using microarrays and identifying active cortical areas using brain-imaging. The most common procedure for detecting potential regions is to group together neighboring sites where the signal passed a threshold. However, one needs to account for the selection bias induced by this opportunistic procedure to avoid diminishing effects when generalizing to a population. In this paper, we present a model and a method that permit population inference for these detected regions. In particular, we provide non-asymptotic point and confidence interval estimates for mean effect in the region, which account for the local selection bias and the non-stationary covariance that is typical of these data. Such summaries allow researchers to better compare regions of different sizes and different correlation structures. Inference is provided within a conditional one-parameter exponential family for each region, with truncations that match the constraints of selection. A secondary screening-and-adjustment step allows pruning the set of detected regions, while controlling the false-coverage rate for the set of regions that are reported. We illustrate the benefits of the method by applying it to detected genomic regions with differing DNA-methylation rates across tissue types. Our method is shown to provide superior power compared to non-parametric approaches.
1 Introduction
Due to the advent of modern measurement technologies, several scientific fields are increasingly relying on data-driven discovery. A prominent example is the success of high-throughput assays such as microarrays and next generation sequencing in biology. While the original application of these technologies depended on predefined biologically relevant measurement units, such as genes or singe nucleotide polymorphisms (SNPs), more recent applications attempt to use data to identify and locate genomic regions of interest. Examples of such application include detection of copy number aberrations (Sebat et al., 2004), transcription binding sites (Zhang et al., 2008), differentially methylated regions (Jaffe et al., 2012c), and active gene regulation elements (Song and Crawford, 2010). A data-driven approach is also common among neuroscientists, who search in imaging and functional imaging data for regions that are affected by variations in cognitive tasks (Friston et al., 1994, Hagler et al., 2006, Woo et al., 2014).
As these technologies mature, the focus of statistical inference shifts from individuals to populations. Instead of searching for regions different from baseline in an individual sample, we instead want to compare two or more populations (cancer versus normal, for example) to locate regions of difference. For population level inference, region detection methodology needs to account for between-individual biological variability, which is often non-stationary, as well as for technical measurement noise. Furthermore, sample size is usually small due to high technology and recruitment costs, so variability in the estimates remains considerable. Searching for small regions of consistent signal within such large noisy assays is therefore prune to false detections, if selection within the noise is not properly accounted for. In contrast, the lost opportunity cost that can result from being overly-conservative is equally concerning. Statistical inference therefore provides a rigorous approach that can aid practitioners make informed decisions regarding resource allocation.
In the context of population inference, the most commonly applied approach is to compute marginal p-values at each site, correct for multiplicity, and then combine contiguous significant sites into regions. For example, publications in the high-profile biology journals have implemented these homegrown ad-hoc analysis pipelines (Kundaje et al., 2015, Becker et al., 2011, Pacis et al., 2015, Lister et al., 2013). However, there is no theoretical justification for extrapolating inferences from the single sites to the region. In particular, using the average of the observed values at the selected sites as an estimate for the region will result in a biased estimate; Kriegeskorte et al. (2009) coined the term circular inference for such practices in neuroscience, highlighting the reuse of the same information in the search and in the estimation. Furthermore, the power of such methods to detect regions is limited by the power to detect at the individual site, which are noisier and require a-priori stronger multiplicity corrections compared to regions. For example, a region consisting of several almost-significant sites will be overlooked by such algorithms. Finally, there is no clear way how to prioritize regions of different sizes and different correlation structures without more refined statistical methods. Here we describe a general framework that permits statistical inference for region detection in this context.
Although high-profile genomic publications have mostly ignored it, the statistical literature includes several relevant publications to the challenge of providing valid inferential statements for regions of interest within a large continuous map of statistics. Published methods differ in how they summarize the initial information into a continuous map of statistics: smoothing or convolving the measurements with a pre-specified kernel, forming point-wise Z maps, p-value maps (Pedersen et al., 2012), or likelihood maps (Siegmund et al., 2011, Hansen et al., 2012). Regions of interest can be identified from local maxima, by thresholding(Jaffe et al., 2012c) or by segmentation based on hidden Markov models (Kuan and Chiang, 2012) or likelihood (Zhang and Siegmund, 2012). When the shape of the signal is known, for example in data related to transcription binding sites, a map of scan-statistics can be surveyed for maxima (see Cai and Yuan, 2014, for asymptotic analysis). Due to the large number of measurements this approach assume true signal locations are well separated which permits the employment of multiplicity corrections for individual detected regions (Schwartzman et al., 2011, 2013, Sun et al., 2014). Non-parametric methods can use sample-assignment permutation to simulate a non-parametric null distribution even for non-stationary data from an unknown distribution (Jaffe et al., 2012b, Hayasaka and Nichols, 2003).
However, current methods fail to address two important characteristics of the practical challenge. The first is to form inferential statements regarding the effect size, such as estimates and confidence intervals. In genomics high-throughput data it is well known that unknown confounders can often create many small differences between the groups (Leek et al., 2010). These are not regions of biological interest, yet can reach strong statistical significance in highly powered studies. Estimates of effect size rather than just p-value permit the practitioner to discern biological significance. Technically, the problem of non-null inference requires stronger modeling assumptions and more sophisticated methods for treating nuisance parameters compared to significance testing for a fully specified null. Work on confidence intervals in large processes include Weinstein et al. (2013) for individual points rather than regions, and Sommerfeld et al. (2015) for globally bounding the size of the non-null set. We consider providing estimates of the effect in biologically meaningful units and quantification of the uncertainty in these estimates an important contribution.
The second challenge is non-stationarity: in most genomic signals, both the variance and the auto-correlation change considerably along the genome. This behavior is due to different DNA sequence composition(Bock et al., 2008, Benjamini and Speed, 2012), uneven marker coverage, and other biomedical properties affecting the DNA amplification procedures employed by the relevant experimental protocols (Jaffe et al., 2012a). Non-stationarity makes it harder to compare the observed properties of the signal across regions: a region of k adjacent positive sites is more likely to signal a true population difference if probe correlation is small, but more likely to be caused by chance variation if probe correlation is very high.
In this paper, we introduce a comprehensive approach for inference for region detection (IRD) that produces selection-corrected p-values, estimators and confidence intervals for the population effect size. We focus on popular set of selection algorithms that, after preprocessing, apply a threshold-and-merge approach to region detection: the map of statistics is thresholded at a given level, and neighboring sites that pass the threshold are merged together. The key idea is to identify for each potential region its selection event – the necessary and sufficient set of conditions on the observed estimate vector that lead to detection of the region. For threshold-and-merge algorithms, the selection event can be described as a set of coordinate-wise truncations. Basing inference on the distribution of a test statistic conditional on the selection event, corrects the selection bias (Lee et al., 2013, Fithian et al., 2014). Inference for each region is tailored for the local covariance in the region. When further selection is needed downstream, standard family wise corrections can be applied to the list of detected regions.
The paper is structured as follows. The rest of the introduction describes in more detail the specific genomic signal – DNA-methylation – that will be used to demonstrate our method. In Section 2 we present a model for the data generation and define threshold-and-merge selection. In Section 3 we review the conditional approach to selective-inference, which enables the researcher to analyze each selected region separately by adjusting the tests to the conditional distribution. When the data is approximately multivariate normal, the conditional distribution follows a truncated multivariate normal (TMN, Section 3.3). For an individual region, we describe sampling based tests and interval estimates in Section 4. In Section 5 we return our focus to the set of selected regions in the experiment, discussing a secondary adjustment that is needed if only a subset of the originally detected regions are reported. Sections 6 and 7 evaluate the performance of the method on simulated and measured DNA-methylation data.
1.1 Motivating example: differentially methylated regions (DNA)
DNA methylation is a biochemical modification of DNA that does not change the actual sequence and is inherited during mitosis. The process is widely studied because it is thought to play an important role in cell development (Razin and Riggs, 1980) and cancer (Feinberg and Tycko, 2004). The great majority of methylation events occurs when a methyl group attaches to CpG site (a cytosine base followed by a guanine base along the chromosome). These potential methylation sites are depleted in genome (less than 1% compared to GpC’s which are at 4%), and the density in which they appear varies widely (Takai and Jones, 2002). The attachment and detachment processes of the Methyl group are stochastic within each cell and can vary between the cells of the biological sample. Current high-throughput technologies measure the proportion of cells in a biological specimen that are methylated giving a value between 0 and 1 for each measured CpG site. For example, the most widely used product, the Illumina Infinuim array, produces these proportion measurement at about 450,000 sites (Bibikova et al., 2011).
Unlike the genomic sequence, methylation differs across different tissues of the same individual, changes with age, environmental impacts, and disease (Robertson, 2005). Of current interest is to associate changes in methylation to biological outcomes such as development and disease. Reported functionally relevant findings have been generally associated with genomic regions rather than single CpG sites (Jaenisch and Bird, 2003, Lister et al., 2009, Aryee et al., 2014) thus our focus on differentially methylated regions (DMRs). Because DNA methylation is also susceptible to several levels of stochastic variability (Hansen et al., 2012) statistical inference is necessary. Figure 1 shows two examples for regions of difference found by comparing samples from human colon and lung. Our method estimates confidence intervals for the difference in each of the regions.
Inference for DMRs needs to take into account the unknown but variable and often strong local correlation between nearby methylation sites. The propensity for methylation is governed by many local processes, and therefore neighboring methylation sites tend to be have similar methylation proportions. These local influences are thought to be moderated through other local properties such as the DNA attachment to packing proteins, its local chemical properties (Kuan et al., 2010), and the local sequence composition (e.g. the proportion of G and C bases). Therefore, simple correlations models based on the genomic distance between are not sufficient. One of our contributions is an approach that accounts for the non-stationary nature of the data.
2 Model for the effects of selection
In the following we introduce a model for the effects of selection on regional descriptors. We begin our description with a population model for individual probes and for the parameter describing the regions, and only then discuss the observed statistic for each regions. We then specify the popular family of threshold-and-merge selection procedures, which we will analyze in this manuscript. Keep in mind that practitioners typically think about these in the opposite direction: observed summaries such as “average observed differences in region” are natural descriptors for the region effect. Selection bias would typically be framed in terms of the expected decay of this observed effect toward zero when the analysis is repeated using the same region but a new set of samples. The population model allows us to estimate and probabilistically bound this decay.
2.1 Population model
Suppose the collected data consists of n samples of D measurements each, Y1, …,Yn. We model the i’th sample as a random process composed of a mean effect that is linear in known covariates and an additive random individual effect. Each sample is annotated by the covariate of interest Xi ∊ R, and by a vector of nuisance covariates . Then the i’th observed vector is:
Here is the fixed process of interest, are fixed nuisance processes, and captures both the individual sample effect and any measurement noise. εi can further be characterized by a positive-definite covariance matrix . In matrix notation, let be the matrix of measurements and be the design matrix organized so the covariate of interest is in the first column.
For a concrete example of our notation, consider a two-group design comparing samples from two tissue-types. Each sample would be coded in a vector Yi ∈ RD. Xi would code the tissue type, 1 for group A and 0 for group B. Wi can encode such demographic variables as age and gender, as well as a bias term. Then θj would code for the mean difference between groups on the j’th site. We expect the Θ process to be almost zero in most sites, and to deviate from zero in short connected regions.
Regions of interest
We are interested in detecting and prioritizing regions of the measurement space that are short relative to the size of the genome and for which values in Θ are large in absolute value. A region of interest [ROIs] corresponds to a range a : b = (a, a + 1,…, b) where is large (or negatively large) compared to zero. A sufficient representation for an ROI is a triplet r = (a, b, d), where a ≤ b ∈ 1,…, D are indices and d ∈ {–, +} represents the direction of deviation from 0. Depending on context, we would usually restrict our analysis to indices b, a that are not too far apart. We code these restrictions into the set of potential ROIs. Note that would usually still be very redundant, with many regions that are almost identical; in any run of the algorithm, only a small subset of will be selected for estimation.
Vector of estimators Z
Our inference procedures focus on a vector of point-wise unbiased normal estimators . The assumptions for Z are:
Unbiasedness: E[Z] = Θ.
Estimable covariance: Σ := Cov(Z) is estimable, and the estimate is independent of Z.
Approximate local normality: For indices a ≤ b that define a potential ROI, meaning (a, b, +) or (a, b, −) are in , vector Za−1:b+1 is approximately multivariate normal
Here, are the vector and matrix subsets of Θ and Σ respectively.
Specifically, we can take Z to be the least-squares estimator for Θ
Then, with enough samples compared to covariates (n > p)1, the local covariance C = Cov(εi) is estimable from the linear model residuals, resulting in . Furthermore, for each , and . Hence,
This is extendable to a two-group design where each group is allowed a different covariance.
Effect size and population effect size
A commonly used summary for the effect size in a region is the area under the curve (AUC) of the observed process, which is the sum of the estimated effects in the region (Jaffe et al., 2012b). To decouple the region length and the magnitude of the difference, we define the observed effect size as the average rather than the sum of the effect:
The observed effect size of region a : b is
We will associate the observed effect size t(Za:b) for each potential region a : b with the population parameter representing its unconditional mean
The effect size of region a : b is
We prefer to AUC because it decouples two different sources of information: the region length and the effect magnitude. Because correlation varies across different regions, the region length does not correspond monotonically to the amount of independent information. The size of the region, together with the correlation structure, would nevertheless affect the uncertainty of the estimates. Finally note that since the AUC for (a, b, +) is , any inference for can be easily converted into inference for the AUC. The methods described here are easily extendible for other linear statistics.
2.2 Selection
A common way to identify potential ROIs is to screen the map of statistics at some threshold, and then merge sites that passed the screening into regions (Jaffe et al., 2012b, Siegmund et al., 2011, Schwartzman et al., 2013, Woo et al., 2014). This procedure ascertains a minimal biological effect-size in each site, while increasing statistical power to detect regions and controlling the computational effort. After preprocessing and smoothing the responses, these algorithms run a version of the following steps:
Produce an unbiased vector of linear estimates .
Identify the set of indices exceeding a fixed threshold c, {j : Zj > c}. This is the excursion set.
Merge adjacent features that pass the threshold into regions.
Filter (or split) regions that are too large.
The procedure is illustrated in Figure 2. The output of such algorithms is a random set of (positive) detected ROIs that consists of ROIs (a, b, +) corresponding to the beginning and end indices of positive excursion regions. Step 4 ascertains that .
Comments:
Typically, we are also interested in the similarly defined set of negative ROIs . For ease of notation we deal only with , understanding that can be analyzed in the same way. It is important to note that each region is selected either to the positive set or to the negative set, and this choice will instruct the hypothesis test.
Smoothing of the features and local adjustment of the thresholds can both be incorporated within this framework – implicitly changing Θ into a smoothed version.
Distribution after selection
It is important to distinguish between inference for predefined regions based on previous biological knowledge, for example exons or transcription start sites, and inference for regions detected with the same data from which we will construct inferential statements. For a predefined region, the linear estimator is an unbiased estimator for , and its uncertainty can be assessed using classical methods. Moreover, would be approximately normal, so the sampling distribution of would only depend on a single parameter for the variance. In particular, the correlation between measurements would affect the distribution of only through this variance parameter , and can be accounted for in studentized intervals.
In contrast, when the ROI (a, b, +) is detected from the data, becomes biased (Berk et al., 2013, Fithian et al., 2014, Kriegeskorte et al., 2009). We will tend to observe extreme effect sizes compared to the true population mean. If , the observed effect size would always be greater than the threshold c. Furthermore, the distribution of would be right-skewed, so normal-based inference methods are no longer valid.
For non-stationary processes, evaluating the observed regions poses an even greater challenge. Due to variation in the dependence structure, it is no longer straightforward to compare different regions found on the same map. The bias and skewness depend not only on a single index of variance, but rather on the local inter-dependence of Z in a neighborhood of a : b. In Figure 3, we study the effect of correlation on the bias and skewness in a simulated example; changes in correlation affect the bias, the spread and the skewness of the conditional distribution.
2.3 Inference goals
We can now restate our goal using the notation. For the set of K detected regions , we would like to make valid inferential statements about each that account for the selection. These include:
a hypothesis test for against the null , and a high-precision p-value for downstream multiplicity corrections;
an estimate for ;
a confidence interval for .
We would also like to be able to prune the detected set based on the hypotheses tests, while continuing to control for the false coverage statements in the set.
3 Conditional approach to selective inference
Because we are only interested in inference for selected ROIs in , a correction for the selection procedure is needed. Most such corrections are based on evaluating, ahead of selection, the potential family of inferences. In hypothesis testing, this requires, essentially, to evaluate all potential ROIs and transform them into a common distribution (usually by calculating the p-value). Instead, we adapt here a solution proposed for meta-analysis and more recently in the context of regression model selection problems (Lee et al., 2013, Fithian et al., 2014): adjust inference to hold for the conditional distribution of the data given the selection event. First we review the premise of selective inference, and then specify the selection event and selective distribution for region detection.
3.1 Selective inference framework
Recall that denotes the set of potential ROIs, and the random set denotes the random set of selected ROIs. Assume that for ROI we associate a null hypothesis that will be evaluated if and only if . A hypothesis test controls for the selective error if it controls for the probability of error given that the test was conducted. Formally, denote by Ar = A(a,b,+) the event that r was selected for . Then:
The hypothesis test φr(Z) ∈ {0,1}, which returns 1 if the null is rejected and 0 otherwise, is said to control the selective type 1 error at level α if
In a frequentist interpretation, the relative long-term frequency of errors in the tests of r that are carried out should be controlled at level α.
Selective confidence intervals can be defined in a similar manner. Denote by F the true distribution of Z, so that F belongs to a model . Associate with each r a functional ηr(F) of the true distribution of Z, and again let Ar be the event that the confidence interval Ir(Z) for ηr(F) is formed. Then Ir(Z) is a selective 1 — α confidence interval if:
Correcting the tests and intervals to hold over the selection criteria removes most biases that are associated with hypothesis selection. If no additional selection is performed, the selective tests are not susceptible to the “winner’s curse”, whereby the estimates of the selected parameters tend to display over-optimistic results. If each individual test ϕr controls selective error at level α, then the ratio of mean errors to mean selected is also less than α. In a similar manner, the proportion of intervals not covering their parameter (False Coverage Rate, FCR) is controlled at α (Fithian et al., 2014, Weinstein et al., 2013). This strong individual criterion allows us to ignore the complicated dependencies between the selection events; as long as the individual inference for the selected ROIs is selection controlled, error is also controlled over the family.
Note however that when the researcher decides to report only a subset of the selected intervals or hypotheses, a secondary multiplicity correction would be required. A likely scenario is reporting only selective intervals that do not cover 0 (Benjamini and Yekutieli, 2005). The set of selective p-values (or selective intervals) behaves like a standard hypothesis family, so usual family-wise or false discovery corrections can be used. We discuss this further in Section 5.
3.2 Selection event in region detection
For region detection, we can identify the selection event Ar = A(a,b,+)(Z) as a coordinate-wise truncation on coordinates of Z. Selection of (a, b, +) occurs only if all estimates within the region to exceed the threshold. These are the internal conditions:
Furthermore, unless Za or Zb are on a boundary, the selection of (a, b, +) further requires external conditions that do not allow the selection of a larger region, meaning:
3.3 Truncated multivariate normal distribution
Assume the observations are distributed approximately as a multivariate normal distribution Z ~ N(Θ,Σ) with Σ known. According to (2), to form selective tests or intervals based on a statistic t(a,b,+)(Z) we need to characterize the conditional distribution of t(a,b,+)(Z) given the selection event A(a,b,+). We will sample the conditional vector to approximate the distribution of the functional.
Conditioning the multivariate normal vector Z on the selection event results in coordinate-wise truncated multivariate normal (TMN) vector with density where A(a,b,+) is seen as a subset of . The TMN distribution has been studied in many contexts, including constructing instrumental variables(Lee, 1981), Bayesian inference (Pakman and Paninski, 2014), and lately post selection inference in regressions (Lee et al., 2013). In contrast to the usual multivariate normal, linear functionals of the truncated normal cannot be described analytically (Horrace, 2005). We will therefore resort to Monte Carlo methods for sampling TMN vectors, and empirically estimate the functional distribution from this sample. Naively, one can sample from this distribution using a rejection sampling algorithm: produce samples from the unconditional multivariate normal density of Z, and reject samples that do not meet the criteria A(a,b,+).
In the rest of this paper we assume we have efficient samplers for the TMN distribution. There are many publicly available samplers for this distribution, for example, in the R statistical package, including rejection samplers, Gibbs samplers (Geweke, 1991) and Hamiltonian (Pakman and Paninski, 2014) samplers. We use a Gibbs sampler included in the selective-inference package, because it handles well extreme values. Sampling strategies are discussed in Appendix A.
For inferences of a specific index range, it is sufficient to focus on the coordinates of Z in the vicinity of a : b. Coordinates outside the selection range a − 1 : b + 1 are neither restricted nor influence t(Z) and can be ignored in the modeling process assuming the detected bumps are sufficiently separated. Instead of modeling and sampling a D dimensional TMN, for each selected ROI of length l we sample an l + 2 vector.
4 Inference for the effect size
Here we describe in sampling based inference prodedures. First, we review the case where the null hypothesis is specified by the full mean vector. Next, we propose a mapping from our parameter of interest into the mean vector Θa−1:b+1. For the internal mean, we use a linear mapping with the profile vector s. (In simulations, we find that the conditional distribution of the statistic is not sensitive to small changes in s.) For the external mean θa−1, θb+1, we propose a conservative choice of plugging in the observed values . With these choices, the conditional distribution for each value of theta is fully specified and sampling based tests can be run. We differ the discussion of efficient sampling strategies to Appendix A.
Notational remarks:
Throughout the section we focus on a single selected region r = (a, b,+). In 4.2 we model only the internal region, so . In 4.1, 4.3, we model both external and internal regions, so Θ = Θa−1:b+1 and Σ = Σa−1:b+1. We assume here Σ is known, though in practice we will plug in estimates of Σ.
When analyzing the positively detected regions , a region a : b is associated with a single selection event (a,b,+). We therefore adopt the shorthand Aa:b = A(a,b,+). Negatively detected regions would be analyzed separately in a similar manner.
Because we are interested primarily in the effects of different mean vectors Θ on the distribution of Z and t(Z) = Za:b, we use the shorthands for the TMN density with mean vector Θ and the univariate density of t(Z) : Z ~ fΘ.
4.1 Test for a full mean vector
Consider a vector Z that follows a TMN density fΘ, and suppose we want to test a strong (fully specified) hypothesis for some against . For the test to be powerful against a shift in multiple coordinates, we consider the test where t is a non-negative linear combination is the 1 – α quantile of the distribution of . A case in point is the strong null . When Σ is known, the null distribution of t(Z) is fully specified even though its analytic form is unknown. We can therefore sample from , and use this sample to empirically estimate and specifically its 1 − α quantile as accurately as we need.
Explicitly, the algorithm would
Use a TMN sampler to generate a Monte Carlo sample from .
Compute the statistic for each example .
Estimate the 1 – α quantile of t(Z) under H0 from the order statistic of ti.
Furthermore, the p-value of the observed vector zobs can be estimated from the sample as
In order to form intervals, we will also need two sided tests. A two sided test can be implemented by setting and estimating from the Monte Carlo sample. Note that we match α level one-sided tests with 1 — 2α level intervals so that coverage would agree for the null.
4.2 A single parameter family for the mean
Confidence intervals for require more care than the null tests, because even after specifying the model has b – a ancillary degrees of freedom. Approaches to non-parametric inference include plugging-in the maximal-likelihood values of the ancillary parameters for every value of (profile-likelihood, see review in DiCiccio and Romano, 1988), or conditioning on the ancillary directions in the data as in Lockhart et al. (2014), Lee et al. (2013). We take an approach similar to the least favorable one-dimensional exponential family (Efron, 1985), in proposing a linear trajectory from to the mean vector Θ. Figure 4 shows the main steps of our approach.
We form the confidence interval by inverting a set of tests for the average parameter . Recall that a random interval I(Z) is a 1 − 2α level confidence interval for if for any Θ. Given a family of 2α level two-sided tests for , the set of non-rejected parameter-values forms a 1 − 2α confidence set (Inversion lemma, Lehmann and Romano, 2005).
The family of tests we use are based on a one-dimensional sub-family of the TMN, produced by the linear (or affine) mapping , where vector represents the profile of the mean that is scaled linearly by θ. For identifiability, we set . For any value of θ, we can define the distribution as follows:
At θ = 0, we get the strong null hypothesis Θ = 0. For other values of θ we get different TMN distributions. Note that the condition ascertains that , because .
We derive confidence intervals and point estimators for based on this one-parameter family . Each value of identifies a specific mean vector (panel A), and a conditional distribution for the vector Z and the statistic t(Z) (panels B, C). Therefore, for each value of θ, we can construct the two-sided test by following the recipe in 4.1. That is, we can sample from , estimate empirically the distribution and quantiles of t(Z) under this null, and reject if not . By repeating this process for a fine grid of θ values, we can invert the sequence of tests (panel D) and get a high-resolution confidence set or interval I.
As a point estimator, we propose using
This corresponds to the intersection of the median function with tobs, and does not require extra calculations when computing the confidence intervals. In Appendix A we show that tests for any value of θ can be computed using a single Monte Carlo sample.
For the statistic , we propose to use the profile: which would vary the mean in each coordinate proportionally to the sum of the columns of the covariance matrix. For the unconditional family parametrized by is the natural statistic for the exponential family, as well as the maximum likelihood estimator. After truncation, this is still an exponential family with the same natural statistic, though it is no longer the MLE2. An alternative profile is su = (1,…, 1), which induces a uniform increase in the coordinates of the region. In practice, we find that our methods are not sensitive to the particular choice between these two profiles, as seen in Figure 5. Further discussion of the choice of the profile in 4.2.1. Some readers may prefer to skip directly to 4.3.
4.2.1 Monotonicity and choice of profile
Although naively we would expect functionals of the distribution of t(Z) to increase as Θs(θ) increases, this monotonicity property does not always hold. Monotonicity is an important prerequisite for the method: measuring θ using t(Z) makes sense only if t(Z) indeed increases with θ. Furthermore, monotonicity guarantees that the acceptance region of the family of tests would be an interval, simplifying the parameter searches. Indeed, for a non-truncated multivariate normal with mean θ · s and s positive, the distribution of t(Z) is monotone increasing in every coordinate of Θ (and does not depend s). Unfortunately, after conditioning, monotonicity is no longer guaranteed for a general profile vector s; for example, it is not always true that the mean or the quantiles functions will be increasing functions of θ. See Figure 5. In the rest of this section we identify conditions for monotonicity: in Lemma 1 we show that for every non-negative covariance, using the profile ensures that increases in θ. This result is tailored for the statistic . In Lemma 2, we identify a subset of the non-negative covariances and, for each covariance, a set of profiles that ensure monotonicity for any non-negative statistic.
We denote by the family of densities parametrized by θ of t(Z)|A, where . The following lemma proves that for is a monotone likelihood ratio family in θ. This is a consequence of identifying t(Z) as the natural statistic of the family in Fithian et al. (2014).
Lemma 1 If gθ;s,t denotes the family densities for t(Z) with the scalar parameter θ as before, then
is a monotone likelihood ratio family.
is an increasing function of θ.
The confidence set for θ obtained by inverting two sided tests is an interval.
Properties 2,3 are direct consequences of the monotone likelihood ratio family property. We leave the proof of 1 to the appendix.
The lemma further allows a classical approach to the problem of choosing the statistic after the model. If we choose a specific deviation from 0, e.g. setting the profile to be uniform su = (1,…, 1)ʹ, then the most efficient statistic would be the exponential-family natural statistic . Lemma 1 implies that t*(Z) would be monotone in θ.
A stronger result of monotonicity of can be derived from properties of the multivariate distribution fΘ if the covariance of Za,b belongs to a restrictive class of positive-covariance matrices.
An M-matrix is a positive-definite matrix in which all off-diagonal elements are non-positive.
Lemma 2 Suppose Σ−1 is an M-matrix, and the profile s can be written as a non-negative sum of columns of Σ, then gθ;s,t is a monotone likelihood ratio family.
The condition on Z is sufficient to show a strong enough association condition on Z (second-order multivariate total positivity property, MTP-2) that continues to hold after conditioning. The lemma is based on theory developed by Rinott and Scarsini (2006). The details of the proof are in the appendix. In Figure 5
4.3 Choice of mean for the external constraints
For the distribution fΘ to be fully specified, we need to set values for the external mean parameters θa−1,θb+1. Although it is possible to arbitrarily assume θa−1 = θb+1 = 0, when the variance of Za−1 and Zb+1 (Σa−1 a−1, Σb+1,b+1) are small, this could lead to a bad fit for the data and produce ill-behaved intervals. Scaling the external mean with θ also produces artifacts when the Za−1 or Zb+1 are strongly correlated with (Za,…, Zb).
We recommend using the plug-estimators for θa−1, θb+1 based on the observed values Za−1, Zb+1:
Because selection is often weak for the external parameters, the plug-in estimators seem to do a good job even as Σ decreases.
To summarize, an affine model for the mean is:
4.4 Estimated covariance
When the number of samples is small, the estimation may be sensitive to the covariance that is used. We therefore propose using an inflated estimate of the sample covariance in order to reduce the probability of underestimating the variance of individual sites and of overestimating the correlation between internal and external variables.
Call the sample estimator for Cov(Z). Then the 1 + λ diagonally inflated covariance is is defined as follows:
5 Controlling false coverage on the set of intervals
When the threshold is selected liberally, the result of running the threshold-and-merge algorithm is a large set of detected regions. If only a subset of these regions is reported, this selection could lead to low coverage properties over the selected set. In our framework, because the confidence intervals are conditioned on the initial selection, the problem does not arise (Fithian et al., 2014). However, if we use the new p-values (or intervals) to screen regions that are not separated from 0, a secondary adjustment is needed.
In that case, an iterative algorithm built described in Benjamini and Yekutieli (2005) can be applied to choose a second level inflation factor for the intervals, so that all intervals do not cover 0 and FCR is maintained. At each iteration, the set of intervals is pruned so that only intervals separated from 0 are kept in the set. Then, the BH procedure is run on the subset of p-values, selecting the q-value threshold that controls the false discovery rate at α. Setting new confidence intervals at the 1 – q level controls the rate of false coverage for the family of intervals. If any of the intervals cover 0 after the inflation, the set can be further pruned, and another BH algorithm run on the smaller set. The requirements for the usual FDR assumptions to hold over a set of regions are discussed in Schwartzman et al. (2013).
Computationally, we can reuse the sample for the p-values to fit any size of confidence interval. Here is a review of the full algorithm:
Run a threshold-and-cluster algorithm to generate the bump candidates.
For each bump, test the selective hypothesis that the effect is smaller than ℓ (typically, ℓ = 0).
Find the p-value for .
Run the BH procedure on the (sorted) p-value list. Find a value q that controls the FDR at level α.
For bump candidates that pass the criterion, form marginal confidence intervals with coverage 1 − q. The sample from stage 2 can be reused for this purpose.
6 Simulation
We conducted two set of simulations to verify that the coverage properties of the confidence intervals are robust to different scenarios and to investigate power. In both simulations, data was generated from a two-group model. Detailed description of both simulations are found in Appendix C.
In the first set, we sampled repeatedly for the same region to study the coverage and power. We sampled a D = 5 multivariate normal vector and selected for the region a = 2, b = 4. We varied the number of samples, the covariance shapes, the true effect size, the shape of the mean vector. We used either the true covariance, or a covariance that was estimated from the data. Note that increasing the number of samples reduces the variance of the Z vector, as well as improves the estimation of the covariance.
In the second set, we sampled from longer (D=50) random non-stationary continuous processes, with a random non-null difference between the groups. Marginals followed either a normal or logistic-transformed normals. For each simulation, we ran the threshold-and-merge procedure and randomly selected one of the detected ROIs. We measured coverage properties for randomly selected bump locations. The variance level of the samples and threshold were selected to be similar to those in samples from DNA-methylation arrays shown in Section 7.
Results
Representative results for both known and estimated covariances are shown in Figure 6, and the coverage rate for Ccor is summarized in Figure 7 (left). For the known covariance, coverage rates are approximately nominally correct under both covariance regimes, different group sizes and effect size. For the estimated covariance, coverage is less than the nominal rate, but the error decreases as the number of samples increases. Note that although for n = 16 with estimated Σ the two-sided coverage is almost correct, for θ = 0 the lower bound is too liberal. Figure 7 (right) plots the power of the selective intervals, as the probability of not covering 0 with increasing true effects θ. We use only the known Σ which gives accurate coverage statements.
The results from the continuous simulation are displayed in Table 6. For Normal data with the true covariance, we get the expected coverage of 0.9 even though the true mean vector was misspecified.
For the logistic-transformed normal data, we get conservative intervals, perhaps due to the short tails. For the estimated Σ, accuracy of coverage depends on the number of samples; increased sample size allows better estimates of Σ estimate. Still, for 10 + 10 samples the algorithm seems to give approximately correct coverage.
7 DNA methylation data
The inference was run on detected regions analysing healthy human samples across two different tissues. Our dataset consists of methylation measurements for 36 tissue samples using the Illumina 450K array. Data was acquired from The Cancer Genome Atlas (TCGA).
We run two analyses of the data:
The two-tissue analysis compares 19 lung samples with 17 colon samples. We expect many true differences between the two groups.
For the one-tissue data, we randomly partitioned the 19 lung samples into two groups of 9 and 10 samples. Group-difference found on the one-tissue data are considered false-positive.
On each dataset, we detected ROIs using a fixed threshold (c = 0.1) to produce a list of candidate regions. For detection, we used the bumphunter (v1.10.0) package (Irizarry et al.). For each detected ROI, we produced a selective p-value and formed a 90% selective confidence interval for mean between-group difference. The sample estimator of Σ was used for inference. Regions whose intervals overlapped 0 were pruned and intervals readjusted using (a) BH procedure to control FCR as discussed in Section 5 or (b) Bonferroni procedure to control family-wise probability of non-coverage. The samples were not smoothed or preprocessed. We did not allow regions to include sites separated by more than 5000 bps. The analysis was implemented in R. The TMN distribution was sampled using a C-compiled version of the selective-inference sampler, accelerated by tilting. For comparison, we produced p-values for the same set of detected ROIs using the family-wise error correction using permutations (Jaffe et al., 2012b) as implemented in bumphunter. All candidate regions were ranked by area, and compared to the strongest regions found in a null distribution that assumes random assignment to groups. The FWE corrected p-value was set to the proportion of permuted datasets in which a superior region was found.
Results
The summary results for the two tissue and the one tissue data are summarized in Table 2. The selective inference method displays a much greater sensitivity than the permutation level, while giving little in terms of specificity. For the two tissue design, selective inference calls 89% of the found regions to be significantly different from 0 at the 0.05 BH adjusted level, and 63% for an FWE of 0.05. This compares to 0.07% for the permutation based FWE at 0.05. Examples for regions detected by selective inference and not by permutation FWE are shown in Figure 8.
For the one tissue design, the nominal coverage of the intervals is conservative (3% of regions are rejected at the 0.05 level). No region is significant after multiplicity corrections with either method. If no covariance inflation is used (λ = 0), 5.5% of the regions are rejected at the 0.05 level, and 11 regions pass the BH procedure.
8 Discussion
In this paper we present a method of generating selection corrected p-values, estimates and confidence intervals for the effect size of individual regions detected from the same data. The method allows for non-stationary individual processes, as each region is evaluated according to its own co-variance. For a two group design under non-negative correlation, the nominal coverage of the tests and lower-bounds of the intervals are shown to hold when the covariance is known or the number of samples per group is moderate (group size ≥ 16). For genomic data sets, we show that the method has considerably better power than non-parametric alternatives, and the resulting intervals are often short enough to aid decision making. In the following we discuss further
Setting the threshold
The threshold c has considerable sway over the size and the number of regions detected. For the method to be effective requires a reasonable threshold: setting it too high will “condition away” all the information at the selection stage, and very little information will be used for the inference. Setting the threshold too low will allow many regions to pass, requiring stronger multiplicity corrections to control the family-wise error rates. Exploring this tradeoff via a higher-criticism (Donoho and Jin, 2008) type approach may be an interesting extension.
More generally, the assumption that c is determined before the analysis can probably be relaxed, as the threshold is only very weakly dependent on any individual region. In particular, data-dependent thresholds computed far from the selected region (e.g. on different chromosomes) would have similar properties; a possible algorithm is then to compute a threshold for each chromosome based on data from all other chromosomes. Robust function of the data such as the median or non-extremal quantile based thresholds should also give good results. See Weinstein et al. (2013) for an adaptive threshold in univariate selection.
Region size information
A related question is how to integrate the information regarding the size of the region. The methods we propose do not directly use information regarding the size of the detected region, as this information is conditioned away. Instead, inference is based only on the distance between the observations and the threshold: if the observed values are sufficiently close to the threshold, the p-values will be large; if they are farther away, p-values will be small. Discarding the size of the region is perhaps counterintuitive. Hypothetically, we may detect a large enough region (with P(Aa:b) small enough) to be significant regardless of the effects of selection, but still get selective p-value that are large. In practice, however, this is unlikely; both the probability of the event Aa:b and the selective p-value become smaller as the size of the unconditional mean vector (Θa:b) increases. Long regions would usually be detected because the mean was larger than 0 in most of the region. This would usually also manifest in smaller p-values and less uncertainty in the confidence interval.
Reintegrating the probability of detection can increase power, but would require a different mechanism to control for selection. We may be able to recover the probability of selection from the Monte Carlo sample. It is tempting to reintegrate this information into the inference: under a strong null (Θa:b = 0), the likelihood of the data is the product of these two probabilities. The caveat, of course, is that the p-values associated with region size – P(Aa:b) – are not corrected for selection. Hence, we are back to the problem we wanted to initially solve.
Difference between our approach and pivot-based methods
The methods we propose are different from the exact pivot-based inference advocated by, for example,? and?. These advocate conditioning not only on the selection event, but also on the subspace orthogonal to the statistic of interest. Essentially, all but the one-dimension statistic t(Z) are conditioned away, leaving a well-specified single-parameter conditional distribution to evaluate. The model is elegant, in that it produces an exact p-value and intervals, without requiring sampling or plugging in nuisance parameters. However, we found that the fully-conditional approach had very little power separate true effects from nulls, and resulted in very large intervals. Specifically, after the conditioning, inference is conducted within a single segment {Za:b + α(1,…, 1)}α; if the estimate of any of the points in the region is very close to the threshold, the will be no separation and the p-value obtained would be high. In contrast, our method is not sensitive to having individual points which are close to the threshold, because it aggregates outcomes over the set . We pay a price in having an inexact method that leans on sampling and a misspecified choice of mean vector.
Additional applications and future work
The importance of accurate regional inference inference extends from only genomics. Indeed, the threshold-and-merge region detection algorithm is extremely common in neuroscience for the analysis of fMRI data, where it is known as cluster inference. The standard parametric methods used for cluster inference rely on approximations for extreme sets in stationary gaussian processes (Friston et al., 1994). Recently, the high-profile study of (Eklund et al., 2016) showed these methods to be too liberal by testing them on manufactured null; also, the distribution of detections was not uniform along the brain suggesting the process was non-stationary. The alternative offered were non-parametric permutations of subject assignment (Hayasaka et al., 2004), similar to those used in Section 7. As we observe, these non-parametric methods can be grossly over-conservative in their model, in particular when multiple regions are detected. Adapting our method for functional data may allow a powerful parametric model that relaxes the stationarity and strong thresholding requirements, without sacrificing power.
More work is required to expand the scope of the algorithm for these additional applications. In particular, the method needs to work well for larger regions and smaller samples. Currently, for larger regions, sampling becomes harder due to increased sensitivity of the initial parameter and to the mixing time of the Gibbs sampler. Solutions for these problems include more robust sampling algorithms and convergence decisions. Perhaps, for larger regions, approximations of t(Z)|A can replace Monte Carlo methods.
Furthermore, local covariance estimates might require too many samples to stabilize, and rigorous methods should be employed to deal with the unknown covariances. Using the truncated multivariate t instead of multivariate normal would account for the uncertainty in estimating the variances; however, the correlation structure also has uncertainty which we currently do not take into account. We suggest in (6) an inflation parameter to give a conservative estimate of the correlation, leaving to the user the choice of λ. It is more likely that for each application, specific models for covariance estimation can be developed. In genomics, external annotation including probe-distance and sequence composition can give a prior model for shrinkage.
9 Acknowledgements
The authors would like to acknowledge Will Fithian, Yossi Rinott and Yoav Benjamini for their comments that contributed to this manuscript. Y.B. would also want to thank Giovanni Parmigiani and the Stein Fellowship at Stanford for sponsoring a summer stay at the Biostatistic Department at Dana Farber that initiated this research. R.A.I is supported by NIH grant R01GM083084.
A Appendix: Accelerated sampling
Recall that we parameterize the truncated multivariate normal family with a single mean parameter θ that linearly determines the mean vector. The distributions corresponding to different values of θ form a single parameter exponential family; this implies that importance weighting can efficiently convert a sample for into a sample for . We detail here the algorithm. This is an expansion of the ideas described in the appendix of (Fithian et al., 2014).
Suppose we have a Monte Carlo sample and we would like to estimate of E[g(Z)] for . Then the importance sampling estimate of is where
The importance estimator is unbiased. With a careful choice of θ0 it may enjoy lower variance per sample-size compared to Monte Carlo estimates from (Owen, 2013). Nevertheless, for our application the primary gain is the ability to invert tests for numerous values of θ using a single sample.
The exponential tilting principle (Siegmund, 1976) recognizes that for single-parameter exponential families, the importance weights w1,…,wn can be calculated without explicitly calculating the normalizing constant for the destination density . We develop here the explicit form for the TMN densities parameterized by a linear mean shift.
The TMN density (5) is written in full as: with the mean vector parameterized linearly in θ
As described above, we take and a profile .
We can expand this density to exponential family form: where does not depend on θ, and is the normalizing constant. Therefore, the likelihood ratio for an example z ∈ A depends on the covariance-corrected shape where the factor does not depend on z. Instead of computing the weights for a given Monte Carlo sample can be normalized to meet both conditions of empirical importance sampling weights.
A.1 Tilting for tests and intervals
The main speed-up in tilting arrises from the ability to use a single sample from to identify the acceptance region for any θ. This is particularly useful because our method for estimating confidence intervals requires building many individual tests for a dense grid of θ values. The main computational load comes from sampling the TMN under the constraint; tilting allows us to perform this task only once, for a convenient value of θ, and extract all tests from this sample.
Computing the p-value for the null hypothesis H0 : θ ≤ 0 against a one-sided alternative H1 : θ > 0. To estimate the probability under the null of getting a value greater than tobs, or we use the quantile of the ‘tilted’ empirical distribution
For a two sided test of H0 : θ = θ1, as required for two-sided intervals, we use . The test rejects .
Note that it is not necessary that the statistic t(z) coincide with the sufficient statistic of the exponential family. Although the sufficient statistic introduces the most powerful tests, other linear statistics may be more robust to deviations from the mean model. In the case were a different statistic is preferred, the reweighing of the Monte Carlo samples might not be completely correlated to the statistic.
A.2 Choosing θ0
Heuristically, we prefer a θ0 for which tobs is a likely outcome, with enough samples on both sides of tobs. In the extreme case, to identify any p-value in (0,1) via tilting requires that , meaning we have at least one sample at each side of tobs. The variance of an importance sample is (Owen, 2013, ch 9, pg 9), so a more equal distribution of weights would lead to better variances., a likely choice for θ0 would often mean the selection criterion is not very strong, so that the samplers for θ0 would be relatively efficient.
For choosing θ0 under a truncation at c, one candidate choice is to use the unbiased estimator if there were no selection θ0 = tobs. We require the empirical . If this value fails we run a linear search for θ0 values between tobs and 0 until we find a successful value. Due to selection against small values of Z, tobs is an upward biased estimator of θ; when variances are small, the sampler might need a better starting point for θ0, so multiple starting points can be explored.
B Appendix: Proofs and Examples
Lemma 1
Let denote the family densities for t(Z) with a scale single parameter θ. Then
is a monotone likelihood ratio family
E[t(Z)] is an increasing function of θ
The confidence set for θ obtained by inverting two sided tests is an interval.
Proof
We recall the exponential family form of : where does not depend on θ, and is the normalizing constant.
Next note that the density in for a particular value corresponds to integrating over the intersection of the hyper-plane ∑ Z = z with the conditional set A. Within this hyperplane, the value of the statistic is fixed zʹ(1,…, 1) = x so
Because the expression exp{θx – g(θ)} is fixed within Ax, it can be moved outside the integral. Defining we get the exponential family structure in
Therefore, the likelihood ratio for a value t(z) = x can be written as a strictly increasing function of x.
Lemma 2
We use results from Rinott and Scarsini (2006) for multivariate normals to identify the following conditions on Σ and s:
Σ−1 is an M-matrix, meaning all off-diagonal elements are non-positive. (In particular, Σ must be non-negative).
s lies in the cone CΣ of non-negative linear combinations of columns of Σ
Note that for Σ = I, any non-negative profile would be in CΣ (indeed, we see that all profiles produce a monotone curve for an iid covariance, see Figure 5). The condition is sufficient, but not necessary; even if Σ is not an M-matrix, there might still be specific (profile, statistic) pairs for which monotonicity will hold, as discussed in Lemma 1. However, there will be no profile for which any statistic will be monotone.
Let be the family of densities of t(Z)|A parametrized by θ, where Z ~ N(θ · s, Σ). If Σ−1 is an M-matrix, and the profile s can be written as a non-negative sum of columns of Σ, then is a monotone likelihood ratio family.
Proof
Consider random vectors Z and Z′ where Z ~ N(θ · s, Σ) and Z′ ~ N(θ′ · s, Σ) for some θ′ > θ. The lemma identifies a sufficient condition for Z′|A being stochastically larger (≥st) than Z|A Stochastic ordering implies that for any positive functional , and in particular for a positive statistic and the quantile functions of t(Z)|A are similarly ordered.
The key to the proof is moving from an ordering of Z and Z′ into an ordering of the conditional vectors [Z|A] and [Z′|A], (interpreted as [Z|Z ∈ A] and ). For rectangular sets A, the stronger notion of ordering total positivity is maintained through the conditioning. We review here the main results of Rinott and Scarsini (2006); proofs and an extended discussion of the cited properties are found there.
For multivariate densities f, g, the relation (total positivity) implies that is increasing in any coordinate-wise increase in x. [Lemma 2.2, a]
If , then X is also stochastically greater than Y, implying for any nondecreasing function ϕ [Proposition 2.4]. If then X is said to be multivariate total positive of order 2 or MTP2.
For a rectangular set [Theorem 2.5, a special condition of Remark 2.6 (i)] Note that conditioning by thresholding individual coordinates would always result in a rectangular set A.
A multivariate normal Z with an invertible covariance matrix Σ is MTP2 if and only if Σ−1 is an M-matrix. [2.15]
If for some we have , then Z is MTP2. [Thm 3.2]
For Z that is MTP2, we have [Thm 3.2]
Therefore, under the assumptions regarding Σ, Z is MTP – 2. Call , then µ = (θ′ – θ) · s so . Therefore, the conditions suffice for . This further implies .
C Appendix: Simulation Details
Details of the two simulation studies are described below.
Experiment 1
In the first set of experiments, we tested coverage probability of 1 – 2α = 0.9 confidence intervals by repeatedly sampling the same set of variables, keeping only those data-vectors for which all locations passed the selection threshold. We sample data vectors of length D = 5, selecting for the positive region r = (2, 4, +) spanning the vector. Data was generated from the multivariate normal with mean bump-heights of . For the true mean vector we used with . Sample size was n1 = n2 = 4, 8,16; increased sample size reduces the variance of Z and, for unknown covariance, increases the accuracy of . The data was generated using two covariance matrices for the samples: a correlated Ccor and an uncorrelated Ciid,
In both matrices, σ2 = 0.04 was chosen to reflect the average observed within-group variance in the DNA-methylation data.
The following results are based on a tilting algorithm using n = 12000 samples from the reference distribution. The confidence interval estimation was repeated N = 2000 times for and N = 250 each value . We repeated each experiment three times, using (a) the true covariance, (b) the sample covariance , and (c) the inflated covariance with λ = 0.15, see (6). We used the profile in all experiments: under the known iid covariance, this corresponds to the true shape of the mean; in other cases the profile doesn’t match the shape of the true mean.
Experiment 2
Data was generated in the following way:
The (untransformed) mean process for group A µA was generated by sampling D=50 data points from an iid N(0,1) process, and convolving with the 1-dimensional kernel Kµ = (0.1, 0.2,0.4, 0.2,0.1). The mean difference process µδ was generated by sampling iid and convolving with K1. The mean for group .
For each sample we added noise by concatenating noise from two correlation regimes:
For , iid N(0,1) samples were smoothed by convolving each vector with Kε = (0.05,0.1,0.15,0.4,0.15,0.1,0.05). This resulted in correlated noise with .
For , iid noise was sampled from with no smoothing.
Noise was added to each sample so that .
For the transformed data, each sample Yi was transformed coordinate-wise with the logistic function . For the transformed data, a population of N = 10000 samples was generated, and the mean vector and covariance matrix for each group were estimated empirically from the samples.
Multiple subsamples of n = 40, 20,10, 5 were taken from each group.
Footnotes
↵1 Recall that p is the number of sample covariates – length{(Xi, Wi)} – which is typically small, not to be confused with the size of the measurement vector D = length{Yi}.
↵2 Admittedly, we reverse here the usual flow of statistical modeling, by first choosing the statistic and only later the model.