Neuro-Current Response Functions: A Unified Approach to MEG Source Analysis under the Continuous Stimuli Paradigm

Characterizing the neural dynamics underlying sensory processing is one of the central areas of investigation in systems and cognitive neuroscience. Neuroimaging techniques such as magnetoencephalography (MEG) and Electroencephalography (EEG) have provided significant insights into the neural processing of continuous stimuli, such as speech, thanks to their high temporal resolution. Existing work in the context of auditory processing suggests that certain features of speech, such as the acoustic envelope, can be used as reliable linear predictors of the neural response manifested in MEG/EEG. The corresponding linear filters are referred to as temporal response functions (TRFs). While the functional roles of specific components of the TRF are well-studied and linked to behavioral attributes such as attention, the cortical origins of the underlying neural processes are not as well understood. Existing methods for obtaining the cortical representation of such linear speech processing work in a two-stage fashion: either the TRFs are first estimated at the sensor level, and then mapped to the cortex via source localization, or the neuroimaging data are first mapped to the cortex followed by estimating TRFs for each of the resulting cortical sources. Given that each stage is biased towards specific requirements, such as sparsity and smoothness, the end result typically suffers from destructive propagation of biases across stages, which in turn hinders a valid statistical interpretation of the results and requires significant post-hoc processing to summarize the results in a meaningful fashion. In this work we address this issue by estimating a linear filter representation of cortical sources directly from neuroimaging data. To this end, we introduce Neuro-Current Response Functions (NCRFs), which are three-dimensionally oriented linear filters spatially distributed throughout the cortex, that predict the cortical current dipoles giving rise to the observed MEG (or EEG) data in response to speech. NCRF estimation is cast within a Bayesian framework, which allows unification of the TRF and source estimation problems, and also facilitates the incorporation of prior information on the structural properties of the NCRFs. We present a fast estimation algorithm, which we refer to as the Champ-Lasso algorithm, by leveraging recent advances in optimization. We demonstrate the utility of the Champ-Lasso algorithm through application to simulated and experimentally recorded MEG data under auditory experiments. Our simulation studies reveal significant improvements over existing work, in terms of both spatial resolution and reliance on fine-tuned coordinate co-registration. Application to experimentally-recorded data corroborates existing results, while also delineating the distinct cortical distribution of the underlying neural processes at high spatiotemporal resolution and obviating the need for post-processing steps such as clustering and denoising. In summary, we provide a principled modeling and estimation paradigm for MEG source analysis tailored to extracting the cortical origin of electrophysiological responses to continuous stimuli.

Arguably the earliest and most widely used technique to construct neural encoding models is the 'reverse correlation' technique, in which neural responses time-locked to multiple repetitions of simple stimuli (such as acoustic tones and visual gratings) are averaged, weighted by the instantaneous value of the preceding stimulus, to form the so-called evoked response function. Originally devised to study the tuning properties of sensory neurons (Aertsen and Johannesma, 1983;Aertsen et al., 1981;Ringach and Shapley, 2004), it was later incorporated into MEG/EEG analysis. In probing the neural response to more sophisticated stimuli such as continuous speech and video, the goal is to understand the encoding of the continuous stimuli as a whole, which is composed of both low level (e.g., acoustics) and high level (e.g., semantics) features which are bound together and distributed across time.
To address this issue, techniques from linear systems theory have been successfully utilized to capture neural encoding using MEG/EEG under the continuous stimuli paradigm. In this setting, the encoding model takes the form of a linear filter which predicts the MEG/EEG response from the features of the stimulus. For example, it has been shown that the acoustic envelope of speech is a suitable predictor of the EEG response (Lalor and Foxe, 2010). These filters, or impulse response functions, play a crucial role in characterizing the temporal structure of auditory information processing in the brain, and are often referred to as Temporal Response Function (TRF) Simon, 2012b, 2013a,b). For instance, in a competingspeaker environment in the presence of two speech streams, it has been observed that the TRF extracted from MEG response to the acoustic power consists of an early component at around 50 ms representing the acoustic power of the speech mixture, while a later peak at around 100 ms preferentially encodes the acoustic power of the attended speech stream (Ding and Simon, 2012a;Akram et al., 2016). Building up on evidence from ferret electrophysiology (Mesgarani et al., 2008) and human electrocorticography (ECoG) (Mesgarani et al., 2014), recent studies have expanded the TRF framework beyond the acoustic level to account for phoneme-level processing (Di Liberto et al., 2015), lexical processing (Brodbeck et al., 2018a) and semantic processing (Broderick et al., 2018).
Thanks to the grounding of the TRF model in linear system theory, several techniques from the system identification literature have been utilized for TRF estimation, such as the normalized reverse correlation (Theunissen et al., 2001), ridge regression (Machens et al., 2004), boosting (David et al., 2007), and SPARLS (Akram et al., 2017), some of which are available as software packages (Theunissen, 2007(Theunissen, , 2010Crosse et al., 2016;Brodbeck, 2017). While these methods have facilitated the characterization of the functional roles of various TRF components in sensory and cognitive processing of auditory stimuli, they predominantly aim at estimating TRFs over the MEG/EEG sensor space. While recent studies, using electrophysiology in animal models and ECoG in humans, have provided new insights into the cortical origins of auditory processing (See, for example, Mesgarani et al., 2008;Mesgarani and Chang, 2012;Pasley et al., 2012;Mesgarani et al., 2014), they do not account for the whole-brain distribution of the underlying sources due to their limited spatial range. As such, the whole-brain cortical origins of the TRF components are not well studied.
To address this issue using neuroimaging, current dipole fitting methods have been utilized to map the sensor space distribution of the estimated TRF components onto cortical sources (Lalor et al., 2009;Ding and Simon, 2012a). Given that the processing of sophisticated stimuli such as speech is known to be facilitated by a widely distributed cortical network, single dipole sources are unlikely to capture the underlying cortical dynamics. More recent results have used the minimum norm estimate (MNE) source localization technique to first map the MEG activity onto the cortical mantle, followed by estimating a TRF for each of the resulting cortical sources (Brodbeck et al., 2018b). While these methods have shed new light on the cortical origins of the TRF, they have several limitations that need to be addressed. First, the ill-posed nature of the MEG/EEG source localization problem under distributed source models results in cortical estimates with low spatial resolution (Schoffelen and Gross, 2009;Baillet et al., 2001). Given the recent and ongoing advances in MEG/EEG source localization towards improving the spatial resolution of the inverse solutions (See, for example, Gorodnitsky et al., 1995;Friston et al., 2008;Haufe et al., 2008;Wipf et al., 2010;Wipf and Nagarajan, 2009;Wu et al., 2016;Gramfort et al., 2013;Babadi et al., 2014;Krishnaswamy et al., 2017;Pirondini et al., 2018), it is tempting to simply use more advanced source localization techniques followed by fitting TRFs to the resulting cortical sources. However, these techniques leverage specific prior knowledge on the source configurations and are typically developed for the event-related potential (ERP) paradigm (Handy, 2005;Gazzaniga and Ivry, 2009;Luck, 2014). As such, they are biased towards providing source estimates with high resolution under specific repetition-based experimental settings, and do not account for the key structural properties of the underlying neural processes that extract information from continuous sensory stimuli. These key properties include the smoothness and/or sparsity of the response functions in the lag domain, and their spatial correlation over the cortex.
Second, high resolution inverse solutions require precise forward models that are constructed based on high resolution magnetic resonance imaging (MRI) scans with accurate sensor registration, which may not be readily available. In addition, the single-trial nature of experiments involving continuous auditory stimuli, does not allow to leverage the time-averaging across multiple trials common in source localization of evoked responses from MEG/EEG. Finally, the two-stage procedures of first fitting TRFs over the sensor space followed by localizing the peaks using dipole fitting, or first finding source estimates over the cortex followed by fitting TRFs to cortical sources, results in so-called bias propagation: the inherent biases arising from the estimation procedure in the first stage propagate to the second stage, often destructively so, and limit, sometimes severely, the statistical interpretability of the resulting cortical TRFs.
In order to address these limitations, in this paper we provide a methodology for direct estimation of cortical TRFs from MEG observations, taking into account their sparsity and spatiotemporal structure. We refer to these cortical TRFs as neuro-current response functions (NCRFs). We construct a unified estimation framework by integrating the linear encoding and distributed forward source models, in which the NCRFs linearly process different features of the continuous stimulus and result in the observed neural responses at the sensor level. We cast the inverse problem of estimating the NCRFs as a Bayesian optimization problem where the likelihood of the recorded MEG response is directly maximized over the NCRFs, thus eliminating the need for the aforementioned two-stage procedures. In addition, to guard against over-fitting and ensure robust recovery of the NCRFs, we design a regularizer that captures the spatial sparsity and temporal smoothness of the NCRFs and mitigates the need for accurate cortical surface patch statistics in the head model. While the resulting optimization problem turns out to be non-convex, we provide an efficient coordinate-descent algorithm that leverages recent advances in evidence maximization to obtain the solution in a fast and efficient manner.
We evaluate the performance of the proposed NCRF estimation framework using both simulation studies and application to experimentally recorded MEG data. Our simulation studies reveal that the proposed method is capable of producing cortical TRF estimates with higher spatial resolution and less variability compared to existing methods. Application to experimentally recorded MEG data from young adult subject listening to speech provides new insights into the cortical dynamics of speech processing.

Theory and Methods
We develop our theory and methods for a canonical MEG auditory experiment in which the subject is listening to a single speech stream. Our goal is to determine how the different features of the speech stream are processed at different cortical stages and evoke specific neural responses that give rise to the recorded MEG data. For clarity of description and algorithm development, we first consider a single-trial experiment, and take the momentary acoustic power of the speech stream, i.e., the speech envelope, as the feature of interest. We will discuss below the more general scenarios including multiple trials, multiple speech stimuli, and multiple, and possibly competing, features reflecting different levels of cognitive processing.

Preliminaries and Notation
Let e t , 1 ≤ t ≤ T denote the speech envelope at discrete time index t for a duration of T samples taken at a sampling frequency of f s . We consider a distributed cortical source model composed of M dipole sources d m = (r m , j m,t ), 1 ≤ m ≤ M , where r m ∈ R 3 denotes the Euclidean coordinates of the m th dipole and j m,t := [j m,t,R , j m,t,A , j m,t,S ] ∈ R 3 denotes the dipole current vector at time t in the right-anterior-superior (RAS) coordinate system. The dipole locations can be obtained by standard tessellation of the 3D structural MR images of the cortex and assigning dipoles to the corresponding vertices (Baillet et al., 2001;Gramfort et al., 2014).
We assume that these current dipoles are in part stimulus-driven, i.e., j m,t,i = f i (e t , e t−1 , · · · , e 1 ) + v m,t,i where the placeholder i takes the values of one of the coordinate axes, {R, A, S}, f i is a generic function, and v m,t := [v m,t,R , v m,t,A , v m,t,S ] accounts for the stimulus-independent background activity. Following the common modeling approaches in this context (Ringach and Shapley, 2004;Lalor et al., 2006;Lalor and Foxe, 2010;Lalor et al., 2009), we take f i to represent a linear finite impulse response (FIR) filter of length L: where τ m,i := [τ m,i,0 , τ m,i,1 · · · , τ m,i,L−1 ] and e t := [e t , e t−1 , · · · , e t−L+1 ] . Note that τ m,i can be thought of as a TRF corresponding to the activity of dipole source m along the coordinate axis determined by i.
The length of the filter L is typically determined by a priori assumptions on the effective integration window of the underlying neural process. The 3D linear filters τ m := [τ m,R , τ m,A , τ m,S ] ∈ R L×3 are vector-valued TRFs at each source m, capturing the linear processing of the stimuli at the cortical level. As such, we refer to these vector-valued filters as Neuro-Current Response Functions (NCRFs) henceforth.
Let j t := [j 1,t , j 2,t , · · · , j M,t ] ∈ R 3M be a vector containing all the current dipoles at time t, and J := [j 1 , · · · , j T ] ∈ R 3M ×T be the matrix of current dipoles obtained by concatenating the instantaneous current dipoles across time t = 1, 2, · · · , T . Similarly, let V ∈ R 3M ×T denote the matrix of stimulusindependent background activity, Φ := [τ 1 , τ 2 , · · · , τ M ] ∈ R 3M ×L denote the matrix of NCRFs, and S := [e 1 , e 2 , · · · , e T ] ∈ R L×T denote the matrix of features. Eqs. (1) and (2) can then be compactly expressed as: As for the sensor space, we assume a conventional MEG setting with N sensors placed at different positions over the scalp, recording magnetic fields/gradients as a multidimensional time series. The MEG observation at the i th sensor at time t is denoted by y i,t , 1 ≤ i ≤ N and t ∈ [1, · · · , T ]. Let Y ∈ R N ×T be the MEG measurement matrix with the (i, t) th element given by y i,t . The MEG measurement matrix is related to the matrix of current dipoles J according to the following forward model (Sarvas, 1987;Mosher et al., 1999;Baillet et al., 2001): where L ∈ R N ×dM maps the source space activity to the sensor space and is referred to as the lead-field matrix, and W ∈ R N ×T is the matrix of additive measurement noise. The lead-field matrix can be estimated based on structural MRI scans by solving Maxwell's equations under the quasi-static approximation (Hämäläinen et al., 1993).

Problem Formulation
Given the stimulus-driven and current-driven forward models of Eqs.
(3) and (4), our main goal is to estimate the matrix Φ, i.e., the NCRFs. To this end, we take a Bayesian approach, which demands distributional assumptions on the various uncertainties involved, i.e., the stimulus-independent background activity and the measurement noise. For the measurement noise, we adopt the common temporally uncorrelated multivariate Gaussian assumption, i.e., where A 2 B := tr A BA and Σ w ∈ S N + denotes the unknown noise covariance matrix. The covariance matrix Σ w can be estimated from either empty-room or pre-stimulus recordings (Engemann and Gramfort, 2015). Next, let V m ∈ R 3×T denote the matrix of background activity at source m, for m = 1, 2, · · · , M .
We adopt the following distribution for the background activity V: i.e., the portion of the current dipoles reflecting the background activity are modeled as zero-mean independent Gaussian random vectors with unknown 3D covariance matrix Γ m ∈ S 3 + . Under this assumption, Eq. (3) can be expressed as: where Γ ∈ S 3M + is a block-diagonal covariance matrix with its m th diagonal block given by Γ m , for m = 1, 2, · · · , M .
Under these assumptions, the joint distribution of the MEG measurement and current dipole matrices is given by: By marginalizing over J, we obtain the distribution of the MEG measurement matrix parametrized by the NCRF matrix Φ and the source covariance matrix Γ: It is now possible to cast the problem of finding Φ as a Bayesian estimation problem, in which a loss function fully determined by the posterior distribution of NCRF matrix Φ given the MEG measurement matrix Y is minimized. In other words, if Γ were known, the NCRF matrix estimation would amount to the following maximum likelihood problem: Figure 1: Mixed-norm penalty term P 2,1,1 (Θ) for regularizing the loss function. The penalty term is constructed by first isolating all 3D Gabor coefficient vectors across the dictionary elements and space, and then aggregating their 2 norm. As a result, it promotes sparsity in space and Gabor coefficients, while being invariant to the orientation of the dipole currents.
Another advantage of this Bayesian framework is the possibility of introducing regularization schemes that can mitigate the ill-posed nature of this problem, and instead work with regularized maximum likelihood problems. Note that this optimization problem makes a direct connection between the MEG measurement matrix,Y and the NCRF matrix Φ and allows us to avoid the aforementioned two-stage procedures in finding TRFs at the cortical level (Lalor et al., 2009;Brodbeck et al., 2018b).

Regularization
As is the case in other source imaging methods, there are many fewer constraints than the free parameters determining the NCRFs. This makes the problem severely ill-posed. As such, proceeding with the maximum likelihood problem in Eq. (10) is likely to result in overfitting. In order to ensure robust recovery of a meaningful solution to this ill-posed problem, we need to include prior knowledge on the structure of the NCRFs in the form of regularization.
To this end, we construct regularizers based on a convex norm of the NCRF matrix Φ, to both capture the structural properties of the NCRFs and facilitate algorithm development. The structural properties of interest in this case are spatial sparsity over the cortical source space, sparsity of the peaks/troughs, smoothness in the lag domain, and rotational invariance (Ding and Simon, 2012b;Akram et al., 2017).
In order to promote smoothness in the lag domain and sparsity of the peaks/troughs, we adopt a concept from Chen et al. (2001), in which a temporally smooth time series is approximated by a small number of Gabor atoms over an over-complete dictionary G ∈ R L×L , for someL ≥ L (Feichtinger and Strohmer, 2012;Akram et al., 2017). To this end, we first perform a change of variables τ m := Gθ m , Φ = ΘG , and S := G S, where θ m ∈ RL ×3 are the coefficients of the m th NCRF over the dictionary G and Θ ∈ R 3M ×L is a matrix containing θ m s across its rows. Then, to enforce sparsity of the peaks/troughs, spatial sparsity, and rotational invariance, we use the following mixed-norm penalty over θ m s, i.e., the Gabor coefficients: Let θ m,l ∈ R 3 be the l th Gabor coefficient vector for the m th NCRF. Note that the summand is θ m,l 2 , which is a rotational invariant norm with respect to the choice of dipole RAS coordinate system. This structural feature allows the estimates to be robust to coordinate rotations (See Appendix A). The inner summation of θ m,l 2 (as opposed to θ m,l 2 2 ) over l = 1, 2, · · · ,L enforces group sparsity of the Gabor coefficients (i.e., the number of peaks/troughs), akin to the effect of 1 -norm. Finally, the outer summation over m = 1, 2, · · · , M promotes spatial sparsity of the NCRFs (See Fig. 1, and also Appendix A).
Using this change of variables and regularization scheme, we can reformulate (10) as the following regularized maximum likelihood problem: The parameter η > 0 controls the trade-off between data fidelity and regularization, i.e., the complexity of the resulting model grows inversely with the magnitude of η. This parameter can be chosen in a datadriven fashion using cross-validation (See Section 3.2). Fig. 2 provides a visual illustration of the proposed modeling and estimation paradigm.

Source Covariance Matrix Adaptation
Note that the objective function in Eq. (12) is convex in Θ and thus one can proceed to solve for Θ by standard convex optimization techniques. However, this requires the knowledge of the source covariance matrix Γ, which is unknown in general. From Eq. (7), it is evident that Γ implicitly offers adaptive penalization over the source space through spatial filtering. As such, the source covariance matrix serves as a surrogate for depth compensation (Lin et al., 2006), by reducing the penalization level at locations with low SNR. One data-independent approach for estimating Γ is based on the lead-field matrix (Haufe et al., 2008). Here, thanks to the Bayesian formulation of our problem, we take a data-driven approach to adapt the source covariance matrix to the observations. One principled way to do so is to estimate both Θ and Γ from the observed MEG data by solving the following optimization problem: Unfortunately, the loss function in Eq. (13) is not convex in Γ. However, given an estimate of Θ, solving for the minimizer of (13) in Γ is a well-known problem in Bayesian estimation and is referred to as evidence maximization or empirical Bayes (Berger, 1985). Although a general solution to this problem is not straightforward to obtain, there exist several Expectation-Maximization (EM)-type algorithms, such as ReML (Friston et al., 2008), sMAP-EM (Lamus et al., 2012), and the conjugate function-based algorithm called Champagne (Wipf et al., 2010), which might be employed to estimate Γ given an estimate of Θ. In the next section, we present an efficient recursive coordinate descent-type algorithm that leverages recent advances in evidence maximization and proximal gradient methods to solve the problem of Eq. (13).

Inverse Solution: The Champ-Lasso Algorithm
Since simultaneous minimization of (13) with respect to both Θ and Γ is not straightforward, we instead aim to optimize the objective function by alternatingly updating Θ and Γ, keeping one fixed at a time.
Suppose after the r th iteration, the updated variable pair is given by Θ (r) , Γ (r) , then the update rules for (r + 1) th iteration are as given as follows: Updating Γ With Θ = Θ (r) , Eq. (13) reduces to the following optimization problem: with Although the problem is non-convex in Γ, it can be solved via the Champagne algorithm (Wipf et al., 2010), which solves for Γ by updating a set of auxiliary variables iteratively. Though the solution Γ (r+1) is not guaranteed to be a global minimum, the convergence rate is fast (with computation cost per iteration being linear in N ), and more importantly each iteration is guaranteed not to increase the loss function in Eq. (14).
Updating Θ Fixing Γ = Γ (r+1) , results in the following convex optimization problem over Θ: where Σ The first term in Eq. (15) is a smooth differentiable function whose gradient is straightforward to compute, and the proximal operator for the penalty term P 2,1,1 (Θ) has a closed-form expression and can be computed in an efficient manner (Gramfort et al., 2012). Regularized optimization problems of this nature can be efficiently solved using an instance of the forward-backward splitting (FBS) method (Beck and Teboulle, 2009;Nesterov, 2005). We use an efficient implementation of FBS similar to FASTA (Fast Adaptive Shrinkage/Thresholding Algorithm) software package (Goldstein et al., 2014) to obtain Θ (r+1) from Eq. (15).
Although the loss function is not jointly-convex in (Θ, Γ), the foregoing update steps ensure that the loss in (13) is not increased at any iteration and stops changing when a fixed-point or limit-cycle is reached (Wright, 2015). Finally, note that due to the efficiency of the overall solver, it is possible to restart the optimization with a randomized initialization, and choose the best solution among several potential alternatives.

Extension to Multiple Feature Variables
The preceding sections focused on the case of a single stimulus feature variable, i.e., the speech envelope.
However, complex auditory stimuli such as natural speech, are processed at various levels of hierarchy. Upon entering the ears, the auditory signal is decomposed into an approximate spectrogram representation at the cochlear level prior to moving further into the auditory pathway (Yang et al., 1992). Beyond these lowlevel acoustic features, higher-level phonemic, lexical, and semantic features of the natural speech are also processed in the brain. Thus, to obtain a complete picture of complex auditory cortical processing, it is desirable to consider response functions corresponding to more than one feature variable.
One can proceed to estimate response functions for each feature variable separately. But, since many of these features have significant temporal correlations, the resulting response functions do not readily provide unique information regarding the different levels of the processing hierarchy. To investigate simultaneous processing of these various feature variables and allow them to compete in providing independently informative encoding models, we consider a multivariate extension of the response functions (Ding and Simon, 2012b;Di Liberto et al., 2015).
Suppose that there are F ≥ 1 feature variables of interest. We modify Eq. (3) by replacing each column of the NCRF matrix Φ by F columns (one for each temporal response function) and each row of the stimulus matrix by F rows (one for each feature variable). As we will demonstrate below in Section 3, this will enable
Output: Θ (R) where R is the index of the last outer iteration of the algorithm.
us to distinguish between different cortical regions in terms of their response latency across a hierarchy of features.

Extension to Multiple Trials with Different Stimuli
Next, we consider extension to K different trials corresponding to possibly different auditory stimuli.
Let the stimuli, MEG observation, and background activity covariance matrices for the k th trial be denoted by S k , Y k , and Γ k , respectively, for k = 1, · · · , K. We can extend the optimization problem of Eq. (13) as follows: In doing so, we have assumed that the NCRFs remain unchanged across trials, which promotes integration of complementary information from different trials (without direct averaging). Note that this assumption intentionally suppresses the trial-to-trial variability of the NCRFs by adaptively weighting the contribution of each trial according to its noise level (i.e., Γ k ), in favor of recovering NCRFs that can explain common cortical patterns of auditory processing. In contrast, if all the trials were to be concatenated or directly averaged to form a unified trial (with a single covariance matrix Γ), the trial-to-trial variability would not necessarily be suppressed, especially when there are few trials available. Furthermore, this formulation allows to incorporate trials with different lengths into the same framework.
The optimization problem of Eq. (16) can be solved via a slightly modified version of that presented in Section 2.3. The resulting algorithm is summarized in Algorithm 1, which we refer to as the Champ-Lasso algorithm. A python implementation of the Champ-Lasso algorithm is archived on the open source repository Github (Das, 2019) to ease reproducibility and facilitate usage by the broader systems neuroscience community.

Subjects, Stimuli, and Procedures
The data used in this work are a subset of recordings presented in , and is publicly available in the Digital Repository at the University of Maryland (Presacco et al., 2018). To ensure that the participants actively engage in the listening task, they were tasked to also silently count the number of specific words that they would hear in the story.

Recording and Preprocessing
The data were acquired using a whole head MEG system (KIT, Nonoichi, Ishikawa, Japan) consisting of 157 axial gradiometers, at the University of Maryland Neuroimaging Center, with online low-pass filtering (200 Hz) and notch filtering (60 Hz) at a sampling rate of 1 kHz. Data were pre-processed with MNEpython 0.18.1 (Gramfort et al., 2013(Gramfort et al., , 2014. After excluding flat and noisy channels, temporal signal space separation was applied to remove extraneous artifacts (Taulu and Simola, 2006). Data were then filtered between 1 Hz to 80 Hz using a zero-phase FIR filter. Independent component analysis (extended infomax, Bell and Sejnowski, 1995) was applied to remove ocular and cardiac artifacts. Finally, 60 s long data epochs corresponding to the stimuli were extracted and downsampled to 200 Hz.

Source Space Construction
At the beginning of each recording session, each participant's head shape was digitized with a 3-Space Fastrak system (Polhemus), including 3 fiducial points and 5 marker positions. Five marker coils attached to these five marker positions were used to localize the head position of the participant relative to the MEG sensors. The head position measurement was recorded twice: at the beginning and end of the recording session and the average measured head positions were used. Since MR scans of the participants were not performed, the 'fsaverage' brain model (Fischl, 2012) was co-registered (via uniform scaling, translation and rotation) to each participant's head, using the digitized head shapes.
A volumetric source space for the 'fsaverage' brain was defined on a regular grid with spacing of 7 mm between two neighboring points, and then morphed to individual participants. These morphed source spaces were then used to compute lead-field matrices by placing 3 orthogonal virtual current dipoles on each of the grid points. The computed lead-field matrices contained contribution from 3222 virtual current dipoles, after removing those within subcortical structures along the midline. No cortical patch statistics were available due to the lack of MR scans, so the current dipoles were allowed to have arbitrary orientations in 3D.

Stimulus Feature Variables
We included predictor variables reflecting three different hierarchical levels of speech processing, including acoustic, lexical, and semantic features. These feature variables are described in detail in (Brodbeck et al., 2018b): • Envelope: The speech envelope was found by averaging the auditory spectrogram representation generated using a model of the auditory periphery (Yang et al., 1992) across frequency bands. This continuous univariate feature variable reflects the momentary acoustic power of the speech signal.
• Word Frequency: First, logarithmic word frequency measures, log 10 wf, were extracted from the SUB-TLEX database (Brysbaert and New, 2009) for each word. Then, a piecewise-continuous feature variable was constructed by representing each word in the speech segment by a rectangular pulse with height given by 6.33 − log 10 wf. Note that in this coding scheme, infrequent words are assigned higher values, while common words get lower values. Windows of silence were assigned 0.
• Semantic Composition: Lastly, to probe semantic processing, the semantic composition patterns identified by (Westerlund et al., 2015), including adjective-noun, adverb-verb, adverb-adjective, verb-noun, preposition-noun and determiner-noun pairs, were used. To generate the feature variable, the second word in each pair was represented by a rectangular window of height 1, and 0 elsewhere. This binaryvalued feature variable identifies the semantic binding of word pairs within the speech stream.
All three variables were constructed from the speech segments at the same sampling frequency as the preprocessed MEG data (i.e. 200 Hz). All feature variables were centered and scaled by their mean absolute value, to facilitate comparison of NCRF components pertaining to different feature variables.

Simulation Studies
Before applying the Champ-Lasso algorithm to localize NCRFs from experimentally recorded data, we assessed its performance using realistic simulation studies with known ground truth. In accordance with our experimental settings, we synthesized six 1 min long MEG data segments according to the forward model of Eqs. (3) and (4), mimicking neural processing of the speech envelope. To this end, we simulated temporal response functions of length 500 ms (with significant M50 and M100 components) associated with dipole current sources within the auditory and motor cortices. Fig. 3 (middle panel). The cortical activity was simulated using patches defined over a finely-discretized source space (namely, ico-5, with average dipole spacing of 3.1 mm) and constrained the dipole directions normal to the 'fsaverage' surface patches. To make the simulation as realistic as possible, we used real MEG recordings corresponding to a different speech stream as stimulus independent background activity.
In order to avoid any favorable bias in the inverse solution, we used a different source space for NCRF estimation, i.e., the aforementioned volumetric source space with unconstrained dipole orientations (Subsection 2.8), than the one used for simulating the data, i.e., ico-5. As a comparison benchmark, we also applied the two-stage method of Brodbeck et al. (2018b) to first localize the cortical sources via MNE, followed by independently estimating TRFs for all sources.
The two-stage localized TRFs and estimated NCRFs are shown in Fig. 3 The Champ-Lasso also successfully suppresses spurious peaks in the anterior temporal and inferior frontal lobes, demonstrating its robustness to over-fitting.

Application to Experimentally Recorded MEG Data
As mentioned in Section 2.9, we consider the three feature variables of acoustic envelope, word frequency, and semantic composition and aim at estimating 1000 ms-long NCRFs. The regularization parameter is tuned on the basis of generalization error via a 3-fold cross-validation procedure. To check whether inclusion of each feature variable improves the overall NCRF model significantly, the original model fit (i.e., the cost function evaluated at the estimated NCRF parameters) is compared against the average of three other model fits constructed by deliberately misaligning one feature variable via 4-fold cyclic permutations, using one-tailed t-tests.
To evaluate the group-level significance of the estimated NCRF components, the NCRF estimates are first smoothed with a Gaussian kernel (with standard deviation of 10 mm) over the source locations to compensate for possible head misalignments and anatomical differences across subjects. Then, at each dipole location and time index, the magnitudes of the vector-valued NCRFs are tested for significance using a permutation test via the threshold-free cluster-enhancement (TFCE) algorithm (Smith and Nichols, 2009) (See Appendix B). The NCRFs corresponding to the acoustic envelope in Fig. 4-A exhibit two prominent temporal peaks:

Analysis of the Acoustic Envelope NCRFs
an early peak at around 30 − 35 ms, bilaterally centered over the auditory cortex (AC), and a later peak  at around 100 ms, dorsal to the first peak and stronger in the right hemisphere. The latter is evident from comparing the left temporal profiles lA1 env and lA2 env , with their right hemisphere counterparts rA1 env and rA2 env . Note that the orientations of the NCRFs at the second peak (blue arrow, bottom panel of Fig. 4-A) are nearly the opposite of those at the first peak (red arrow, bottom panel of Fig. 4-A), which accounts for the negative polarity of the M100 peak with respect to M50 in standard TRF analysis. Furthermore, after the appearance of the first peak (∼ 35 ms, auditory traces lA1 env , lA2 env , rA1 env , and rA2 env ) at the AC, the activity appears to gradually shift towards the primary motor cortex (PMC) in both hemispheres (∼ 50 ms, motor traces lM env and rM env ). Additionally, the NCRFs show small bilateral late auditory components at around ∼ 250-350 ms. B. NCRF. The significant NCRF components manifest predominantly in the left hemisphere (Fig. 5-A). The earliest peak in the left AC occurs at around 50 ms, followed by a much stronger peak at around 150 ms, slightly posterior to the former (See lA wf in Fig. 5-B). The earlier peak also has contributions from the inferior temporal gyrus, as indicated by lIT wf . In addition, the left frontal cortex exhibits weak activity at around 150 ms (See lF wf in Fig. 5-B). A weak but localized peak centered over the left superior temporal sulcus (STS) is visible at around 240 ms. The only significant component in the right hemisphere occurs around the same time. Finally, the late NCRF components (at around 500-600 ms) mostly originate from the left AC and STS, with weak contributions from the right frontal cortex.

Analysis of the Semantic Composition NCRFs
The estimated NCRFs corresponding to the semantic composition feature variable are shown in Fig.   6-A, along with 5 representative NCRFs in Fig. 6-B. These include two left auditory (lA1 sc and lA2 sc ), two right frontal (rF1 sc and rF2 sc ), and one right middle temporal (rMT sc ) NCRF. The main NCRF components in the left AC peak at around 155 ms and 475 ms, with the earlier peak being ventral to the later one (See lA1 sc and lA2 sc in Fig. 6-B). The significant right hemispheric NCRFs are temporally concentrated between  155 to 210 ms, and appear superior to those in the left hemisphere, involving inferior frontal gyrus (IFG).
Strikingly, these NCRFs in the right hemisphere seem to move in the anterosuperior direction until around 185 ms, at which point the right hemisphere exhibits strong frontal activity (Fig. 6-A). The NCRFs return to their initial location afterwards at around 210 ms. This sequence of spatiotemporal changes is also evident in the sequence of temporal peaks in Fig. 6-B, given by rMT sc → rF2 sc → rF1 sc → rF2 sc .

Discussion and Concluding Remarks
Characterizing the dynamics of cortical activity from noninvasive neuroimaging data allows us to probe the underlying mechanisms of sensory processing at high spatiotemporal resolutions. In this work, we demonstrated a framework for direct estimation of such cortical dynamics in response to various features of continuous auditory stimuli from the MEG response. To this end, we developed a fast inverse solution under a Bayesian estimation setting, the Champ-Lasso algorithm, for inferring the Neuro-Current Response Functions (as spatiotemporal models of cortical processing) in a robust and scalable fashion.
One of the key features of the Champ-Lasso algorithm is the ability to simultaneously estimate cortical source covariances in a data-driven fashion (as opposed to relying on data-agnostic depth-weighting procedures) and finding the NCRF model parameters. The interplay between the two as well as incorporating the structural properties of the NCRFs into the model, taking advantage of the Bayesian nature of the estimation framework, ultimately leads to spatially focal NCRFs, with smooth temporal profiles. In other words, the NCRF and source covariance estimation procedures work in tandem to best explain the observed MEG data while minimizing the spatial leakage and capturing the smoothness of the temporal responses. In contrast, previously existing methodologies result in estimates that are spatially broad, which then require post-hoc clustering procedures to meaningfully summarize the underlying spatiotemporal cortical dynamics. These serialized procedures in turn introduce biases to the estimates, and hinder meaningful statistical interpretation of the results.
To demonstrate the utility of our proposed framework, we estimated NCRFs corresponding to several feature variables of speech, reflecting different levels of cognitive processing and comprehension from MEG.
The data analyzed in this paper were analyzed by an earlier method in (Brodbeck et al., 2018b), where a two-stage procedure was utilized to probe the cortical processing of speech: the MEG data was first cortically localized using an MNE inverse solver, followed by estimating individual temporal response functions for each source. In order to summarize the resulting estimates in a meaningful fashion, yet another processing step was necessary to disentangle the different spatially dispersed and highly overlapping cortical sources. Our results corroborate those obtained in (Brodbeck et al., 2018b), while obviating the need for any such postprocessing, by providing a one-step estimation procedure with the substantial benefit of greatly improved spatial resolution. In addition, the three-dimensional nature of the NCRFs in our framework allows the segregation of different spatial activation patterns that are temporally overlapping. For example, the bilateral activity components in the primary motor cortex in response to the acoustic envelope are automatically clearly distinguishable from the early activation in the auditory cortex, without the need for any post-hoc processing.
Our results also support other neuroimaging evidence for the hierarchical model of speech processing, involving not only the temporal lobe, but also the motor and frontal cortices (Scott and Johnsrude, 2003;Hickok and Poeppel, 2004;Davis and Johnsrude, 2007;Okada et al., 2010;Peelle et al., 2010;de Heer et al., 2017). To probe the functional organization of this hierarchy, we estimated NCRFs corresponding to features extracted from speech at the acoustic, lexical and semantic levels and found distinct patterns of cortical processing at high spatiotemporal resolutions. Our results indeed imply that while the acoustic and lexical features are processed primarily within the temporal and motor cortical regions (Fadiga et al., 2002;Wilson et al., 2004;Pulvermüller et al., 2006;Crinion et al., 2003;Dewitt and Rauschecker, 2012;Mesgarani et al., 2014;Hullett et al., 2016), phrase-level processing, assessed here using the semantic composition variable, is carried out through the involvement of the frontal cortex (Kaas and Hackett, 1999;Hickok and Poeppel, 2007;Rauschecker and Scott, 2009).
Another advantage of our proposed methodology is mitigating the dependence of the solution on the precise geometry of the underlying cortical source models. In conventional neuromagnetic source imaging, individual structural MR images are utilized in the construction of source space models, particularly for retrieving the cortical surface segmentation. The normal direction to the so-called cortical patches in these models is key in determining the lead-field matrix, which are often referred to as orientation-constrained source models. However, in many available neuroimaging datasets (including the one analyzed in this work), MR images are not available, relying only on an average head model, instead of one informed by the subject-specific cortical geometry. In order to mitigate the need for such information, we utilized a free-orientation volumetric source space in our estimation framework. While this makes the underlying optimization problem more involved and computationally intensive, it adds more than a compensatory amount of flexibility to the underlying models and allows them to recover missing information regarding the cortical source space geometry. To this end, we used rotationally invariant sparsity-inducing priors to regularize the spatiotemporal distribution of the NCRFs. Together with the aforementioned data-driven source covariance adaptation, this regularization scheme results in consistent source orientation estimates and provides a degree of immunity to unwanted side-effects of error-prone coordinate-frame rotations. To confirm these theoretical expectations, we validated this feature of our framework using simulation studies with known ground truth. In light of the above, posing NCRF estimation over an orientation-free volumetric source space can also be thought of as unifying the virtues of distributed source imaging and single dipole fitting: we aim at estimating both the orientations and magnitudes of spatially sparse dipole currents within the head volume that can best linearly predict the MEG responses to continuous stimuli.
A python implementation of the Champ-Lasso algorithm is archived on the open source repository Github (Das, 2019) to ease reproducibility and facilitate usage by the broader systems neuroscience community. The current implementation of our algorithm uses the aforementioned regularization scheme to recover temporally smooth and spatially sparse NCRFs. Due to the plug-and-play nature of the proposed Bayesian estimation framework, one can easily utilize other relevant regularization schemes to promote spatial smoothness or incorporate spectro-temporal prior information, by just modifying the penalty term.

Appendix A. Details of the Regularization Scheme
In this appendix, we provide more details on the regularization scheme used for NCRF estimation. Recall that the NCRF matrix estimation amounts to the following maximum likelihood problem: given a particular choice of Γ. With this choice, one can find the gradient of the objective as: and thus can attempt to solve the maximum likelihood problem using gradient descent techniques. The following observations on the gradient, however, show that the problem is ill-conditioned: 1. The left multiplier of Θ, i.e., L (Σ w + LΓL ) −1 L is singular.
2. The right multiplier of Θ, i.e, S S , which is the empirical stimulus correlation matrix is likely to be rank-deficient for naturalistic stimuli (Crosse et al., 2016).
Therefore, a direct attempt at solving the problem via the gradient descent results in estimates of Θ with high variability. In estimation theory, such ill-conditioning is handled by introducing a bias to the estimator, which contains a priori information about the problem, in order to reduce the estimation variance. In addition, the NCRF model typically has many more free parameters than the observed data points, and without introducing prior information, the estimation problem is prone to over-fitting.
The prior information is often incorporated in the form of regularization. A commonly used regularization scheme in this context is the Tikhonov regularization and its variants for promoting smoothness (Lalor et al., 2006). Other estimation schemes such as boosting and 1 -regularization promote sparse solutions (David et al., 2007;Akram et al., 2017). Here, we introduce a structured regularization by penalizing a specific mixed-norm of the NCRF matrix to recover spatio-temporally sparse solutions over the Gabor coefficients: In words, for each current dipole location, we penalize the vector-valued response function by sum of the magnitude of its corresponding Gabor coefficients.
Note that the 1 -regularization in this case, i.e.,

M m=1
L l=1 θ m,l 1 , is not compatible with the expected cortical distribution of the NCRFs. Since the 1 -norm is separable with respect to the three 3 coordinates of θ m,l , it tends to select a sparse subset of the 3D coordinates, rendering the recovered NCRF components parallel to the coordinate axes. In contrast, the proposed penalty aims to select the NCRF components as a single entity by penalizing the vector magnitudes at each lag. Indeed, if the current dipoles are constrained to be normal to the cortical patches in the NCRF formulation, the proposed penalty coincides with 1regularization.
Another advantage of this mixed-norm penalty is its rotational invariance when working with 3D vectorvalued response functions. Suppose the coordinate system is rotated by an orthogonal matrix U ∈ R 3×3 . θ m,l 2 = P 2,1,1 (Θ), which implies the aforementioned rotational invariance. As a result, the solutions are not dependent on any particular choice of coordinate system. Also, since the penalty does not prefer specific source orientations, it makes the solution more resilient to co-registration error than other approaches that do not consider the vector-valued nature of the current dipoles or constrain the solutions to be normal to the cortical surface.

Appendix B. Statistical Testing Procedures
To asses the statistical significance of the estimated NCRF components at the group level and across the source space, they need to be compared against suitable null hypotheses. The fact that the NCRF components are 3D vectors requires technical care in choosing the null hypotheses. Here, we provide two possible null hypotheses and testing methodologies: the Length Test, that only considers the length or magnitude of the NCRFs, and the Vector Test that takes into account both the magnitude and direction of the NCRFs. The corresponding source codes that implement these tests can be found at (Brodbeck, 2017).

The Length Test
This test aims to assess the statistical significance of the NCRF components by comparing their magnitudes against a baseline 'null' NCRF model at the group level. To control for false positives arising from over-fitting, instead of using an all-zero null model of the NCRFs, we aim to learn the null model from the dataset itself. The time-series of the feature variables are split into four equal segments, and these segments are permuted cyclically to yield three 'misaligned' feature time-series. Then, for each feature variable, three 'misaligned' time-series are constructed by swapping its original time-series with the 'misaligned' ones, while keeping the other two feature variables intact. Then, the average NCRF magnitudes estimated from these three 'misaligned' time-series are considered as the null model for that feature variable. The NCRF magnitude pairs from the original data and the null model are tested for significance using mass-univariate tests based on related measures t-tests.
To control for multiple comparisons, nonparametric permutation tests (Nichols and Holmes, 2002;Maris and Oostenveld, 2007) based on the threshold-free cluster-enhancement (TFCE) algorithm (Smith and Nichols, 2009) are used. First, at each dipole location and time point, the t-statistic is computed from the difference between the NCRF magnitude pairs. The resulting statistic-map is then processed by the TFCE algorithm, which boosts contiguous regions with high test statistic as compared to isolated ones, based on the assumption that spatial extent of the true sources is typically broader than those generated by noise.
To find the distribution of these TFCE values under the null hypothesis, TFCE values are calculated following the same procedure, on 10000 different random permutations of the data. In each permutation, the sign of the NCRF magnitude differences is flipped for a randomly selected set of subjects, without resampling the same set of subjects. Then, at every permutation, the maximum value of the obtained TFCE values is recorded, thereby constructing a non-parametric distribution of the maximum TFCE values under the null hypothesis. The original TFCE values that exceed the (1 − α) percentile of the null distribution are considered significant at a level of α corrected for multiple comparisons across the sources.

The Vector Test
This test aims at quantifying the significance of the estimated 3D NCRF components at the group level, based on the one-sample Hotelling's T 2 test. In the one-sample Hotelling's T 2 test, the population mean of the sample vectors is tested against the null hypothesis of mean zero, i.e. µ 0 = 0. To control for multiple comparisons, a similar strategy based on nonparametric permutations as in the case of the Length Test is used. At every time lag, the T 2 statistic for each dipole is computed as: wherex,Σ are the population mean and covariance matrix of the vector-valued NCRF components, respectively. The T 2 statistic quantifies the variability of vector-valued samples, akin to the role of the t-statistic for 1D samples (Mardia, 1975). The resulting T 2 -maps are then processed by the TFCE algorithm. As before, to construct a non-parametric distribution of maximum TFCE values under the null hypothesis, maximum values of the TFCE-processed T 2 maps on 10000 different random permutations of the data are recorded. In each permutation, the vector-valued NCRF components of each subject undergo uniform random rotations in 3D (Miles, 1965). The original TFCE values that exceed the (1 − α) percentile of the null distribution are considered significant at a level of α, corrected for multiple comparisons across the sources.
Traditionally, response functions are estimated as scalar functions of the data, either over the sensor space or over the source space by orientation-constrained inverse solvers. Considering the directional vari-ability of the NCRF estimates at the group level, however, takes into account the group level anatomical variability that may effect the current dipole orientations. In addition, the Vector Test is less computationally demanding than the Length Test, because it does not require refitting NCRFs for permuted models. In the Results section of the manuscript, we presented the NCRFs masked at a significance level of 5%, based on the Length Test. To demonstrate the difference between these two tests, here we also present the sames results using the Vector Test (Fig B.7).