Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Adding Extra Knowledge in Scalable Learning of Sparse Differential Gaussian Graphical Models

Arshdeep Sekhon, Beilun Wang, View ORCID ProfileYanjun Qi
doi: https://doi.org/10.1101/716852
Arshdeep Sekhon
Department of Computer Science, University of Virginia, Computer Science Department,
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Beilun Wang
Department of Computer Science, University of Virginia, Computer Science Department,
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Yanjun Qi
Department of Computer Science, University of Virginia, Computer Science Department,
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Yanjun Qi
  • For correspondence: yanjun@virginia.edu
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Data/Code
  • Preview PDF
Loading

Abstract

We focus on integrating different types of extra knowledge (other than the observed samples) for estimating the sparse structure change between two p-dimensional Gaussian Graphical Models (i.e. differential GGMs). Previous differential GGM estimators either fail to include additional knowledge or cannot scale up to a high-dimensional (large p) situation. This paper proposes a novel method KDiffNet that incorporates Additional Knowledge in identifying Differential Networks via an Elementary Estimator. We design a novel hybrid norm as a superposition of two structured norms guided by the extra edge information and the additional node group knowledge. KDiffNet is solved through a fast parallel proximal algorithm, enabling it to work in large-scale settings. KDiffNet can incorporate various combinations of existing knowledge without re-designing the optimization. Through rigorous statistical analysis we show that, while considering more evidence, KDiffNet achieves the same convergence rate as the state-of-the-art. Empirically on multiple synthetic datasets and one real-world fMRI brain data, KDiffNet significantly outperforms the cutting edge baselines with regard to the prediction performance, while achieving the same level of time cost or less.

1 Introduction

Learning the change of dependencies between random variables is an essential task in many realworld applications. For example, when analyzing functional MRI samples from different groups of human subjects, detecting the difference in brain connectivity networks can shed light on studying and designing treatments for psychiatric diseases [6]. In this paper, we consider Gaussian graphical models (GGMs) and focus on directly estimating changes in the dependency structures of two p-dimensional GGMs, based on nc and nd samples drawn from the two models (we call the task differential GGMs). In particular, we focus on estimating the structure change under a high-dimensional situation, where the number of variables p may exceed the number of observations: p > max(nc, nd). To conduct consistent estimation under high dimensional settings, we leverage the sparsity constraint. In the context of estimating structural changes between two GGMs, this translates into a differential network with few edges. We review the state-of-the-art estimators for differential GGMs in Section 2.1.

One significant caveat of previous differential GGM estimators is that little attention has been paid to incorporating extra knowledge of the nodes or of the edges. In addition to the observed samples, extra information is widely available in real-world applications. For instance, when estimating the functional connectivity networks among brain regions via fMRI measurements (i.e. observed samples), there exist considerable knowledge about the spatial and anatomical evidence of these regions. Adding such evidence will help the learned differential structure better reflect domain experts’ knowledge like certain anatomical regions or spatially related regions are more likely to be connected [20].

Although being a strong evidence of structural patterns that we aim to discover, extra information has rarely been considered when estimating differential GGM from two sets of observed samples. To the authors’ best knowledge, only two loosely-related studies exist in the literature: (1) One study with the name NAK [2] (following ideas from [14]) proposed to integrate Additional Knowledge into the estimation of single-task graphical model using a weighted Neighbourhood selection formulation. (2) Another study with the name JEEK [18] (following [15]) considered edge-level evidence via a weighted objective formulation to estimate multiple dependency graphs from heterogeneous samples. Both studies only added edge-level extra knowledge in structural learning and neither of the approaches was designed for the direct structure estimation of differential GGM.

This paper fills the gap by proposing a novel method, namely KDiffNet, to add additional Knowledge in identifying DIFFerential networks via an Elementary Estimator. Our main objective is to make KDiffNet flexible enough to model various combinations of existing knowledge without re-designing the optimization. This is achieved by: (1) representing the edge-level domain knowledge as weights and using weights through a weighted 𝓁1 regularization constraint; and (2) describing the node-level knowledge as the variable groups and enforcing through a group norm constraint. Then KDiffNet designs a novel hybrid norm as the minimization objective and enforces the superposition of two aforementioned structured constraints. Our second main aim for KDiffNet is to achieve direct, scalable, and fast differential GGM estimations, and at the same time to guarantee the estimation error is well bounded. We achieve this goal through modeling KDiffNet in an elementary estimator based framework and solving it via parallel proximal based optimization. Briefly speaking, this paper makes the following contributions: 1

  • Novel and Flexible: KDiffNet is the first method to integrate different kinds of additional knowledge for structure learning of differential GGMs. KDiffNet proposes a flexible formulation to consider both the edge-level evidence and the node-group level knowledge (Section 2.3).

  • Fast and Scalable: We optimize KDiffNet through a proximal algorithm making it scalable to large values of p. KDiffNet ‘s unified formulation avoids the need to design knowledge-specific optimization (Section 2.5).

  • Theoretically Sound: We theoretically prove the convergence rate of KDiffNet as Embedded Image, achieving the same error bound as the state-of-the-art (Section 2.6).

  • Empirical Evaluation: We evaluate KDiffNet using multiple synthetic datasets and one realworld task. Our experiments showcase how KDiffNet can integrate knowledge of spatial distances, known edges or anatomical grouping evidence in the proposed formulation, empirically showing its real-world adaptivity. KDiffNet improves the state-of-the-art baselines with consistently better prediction accuracy while maintaining the same or less time cost (Section 3).

2 Proposed Method: KDiffNet

2.1 Previous Estimators for Structure Change between two GGMs (Differential GGMs)

The task of estimating differential GGMs assumes we are given two sets of observed samples (in the form of two matrices) Embedded Image and Embedded Image, identically and independently drawn from two normal distributions Np(µc, Σc) and Np(µd, Σd) respectively. Here µc, µd ∈ ℝp describe the mean vectors and Σc, Σd ∈ ℝp×p represent covariance matrices. The goal of differential GGMs is to estimate the structural change Δ defined by [27] 2. Embedded Image

Here the precision matrices Ωc := (Σc)−1 and Ωd := (Σd)−1. The conditional dependency structure of a GGM is encoded by the sparsity pattern of its precision matrix. Therefore, one entry of Δ describes if the magnitude of conditional dependency of a pair of random variables changes between two conditions. A sparse Δ means few of its entries are non-zero, indicating a differential network with few edges.

A naive approach to estimate Δ is a two-step procedure in which we estimate Embedded Image and Embedded Image from two sets of samples separately and calculate Embedded Image using Eq. (2.1). However, in a high-dimensional setting, the strategy needs to assume both Ωd and Ωc are sparse (to achieve consistent estimation of each), although the assumption is not necessarily true even if the change Δ is sparse (details in Section S:1).

Multiple recent studies have been motivated to directly estimate Δ from two sets of samples. We call these studies differential GGM estimators and group them to four kinds. (1) Likelihood based. Zhang et al. [25] used the fused norm for regularizing the maximum likelihood estimation (MLE) to simultaneously learn both two GGMs and the difference (λ2(‖Ωc‖1 + ‖Ωd‖1) + λn‖Δ‖1). The resulting penalized MLE framework is a log-determinant program, which can be solved by block coordinate descent [25] or the alternating direction method of multipliers (ADMM) by the JGLFUSED package [5]. (2) Density ratio based:. Recently Liu et al. used density ratio estimation (SDRE) to directly learn Δ without having to identify the structures of Ωc and Ωd. The authors focused on exponential family-based pairwise Markov networks [10] and solved the resulting optimization using proximal gradient descent [9]. (3) Constrained 𝓁1 minimization based. Diff-CLIME, another regularized convex program, was proposed to directly learn structural changes Δ without going through the learning of each individual GGMs [26]. It uses an 𝓁1 minimization formulation constrained by the covariance-precision matching, reducing the estimation problem to solving linear programs. All three aforementioned groups have used 𝓁1 regularized convex formulation for estimating Δ. (4) Elementary estimator based. The last category extends the so-called Elementary Estimator proposed by [21, 23, 22] to achieve a closed-form estimation of differential GGM via the DIFFEE estimator [19] (more in the next section and Section 2.4).

2.2 Background: Elementary Estimators for Graphical Models

𝓁1 Regularized MLE for GGM Estimation: Graphical Lasso (GLasso)

The “GLasso” Estimator [24, 1] is the classic formulation for estimating sparse GGM from observations drawn from a single multivariate Gaussian distribution. It optimizes the following 𝓁1 penalized MLE objective: Embedded Image

Where λn > 0 is the sparsity regularization parameter. While state-of-the-art optimization methods have been developed to solve the optimization in Eq. (2.2), they are expensive for large-scale tasks.

𝓁1 based Elementary Estimator for Graphical Model (EE-GM) Estimation

Yang et al. [23] proposed to learn sparse Gaussian graphical model via the following formulation instead: Embedded Image

Actually [23] proposed the following generic formulation to estimate graphical models (GM) of exponential families (GGM is a special case of GM with exponential distribution): Embedded Image

Here θ is the canonical parameter to be estimated and Embedded Image is a so-called proxy of backward mapping for the target GM. Embedded Image is the empirical mean of the sufficient statistics of the underlying exponential distribution. For example, in the case of Gaussian, θ is the precision matrix, Embedded Image is the sample covariance matrix and the proxy backward mapping is Embedded Image (We explain backward mapping, proxy backward mapping and the property and convergence rate of Embedded Image in Section S:4.1).

The main advantage of Eq. (2.4) and Eq. (2.3) was that they are simple estimators with computationally easy solutions. Importantly their solutions achieve the same sharp convergence rate as the regularized convex formulation of Eq. (2.2) when under high-dimensional settings.

ℛ(·) norm based Elementary Estimators

Recently multiple studies [21, 22, 19, 18] followed [23] and expanded EE-GM into a more general framework “Elementary estimators” (EE): Embedded Image

Where ℛ (·) represents a decomposable regularization function. ℛ*(·) is the dual norm of ℛ(·), Embedded Image

Eq. (2.4) and Eq. (2.3) are special cases of Eq. (2.5). Embedded Image needs to be carefully constructed, well-defined and closed-form for the purpose of simplified computations. For example, [21] conduct the high-dimensional estimation of linear regression models by using the classical ridge estimator as Embedded Image in Eq. (2.5). When Embedded Image itself is closed-form and comes with strong statistical convergence guarantees in high-dimensional situations, we can use the unified framework proposed by the recent seminal study from [11] to prove that the solution of Eq. (2.5) achieves the near optimal convergence rate as comparable to regularized convex formulations when satisfying certain conditions.

2.3 Integrating additional knowledge and Δ with a Novel Function: kEV norm

Section 2.1 points out that none of the previous Δ estimators have designed to integrate extra evidence beyond two sets of observed samples. Differently our Δ estimator aims to achieve two goals: (1) the new estimator should be flexible enough to describe various kinds of real-world knowledge, including like spatial distance, hub knowledge, known interactions or how multiple variables function as groups (see below). (2) the new estimator should work well in high-dimensional situations (large p) and is computationally practical. Eq. (2.5) provides an intriguing formulation to build simpler and possibly fast estimators accompanied by statistical guarantees, as long as Embedded Image can be carefully constructed, well-defined and closed-form. We adapt it to design KDiffNet in the next Section 2.4.

In order to use Eq. (2.5) for estimating our target parameter θ = Δ, we need to design ℛ (Δ).

(1) Knowledge as Weight Matrix

We can describe the edge-level knowledge as positive weight matrices like WE ∈ ℝp×p. For example, when estimating the functional brain connectivity networks among brain regions WE can describe spatial distance among brain regions that are publicly available through projects like openfMRI [12]. Another important example is when identifying gene-gene interactions from patients’ gene expression profiles. Besides the patient samples, state-of-the-art bio-databases like HPRD [13] have collected a significant amount of information about direct physical interactions among corresponding proteins, regulatory gene pairs or signaling relationships collected from high-quality bio-experiments. Here WE can describe existing known edges as the knowledge, like those from interaction databases for discovering gene networks (a semi-supervised setting for such sample based network estimations).

The positive matrix-based representation provides a powerful and flexible strategy that allows integration of many possible forms of existing knowledge to improve differential structure estimation, as long as they can be represented into edge-level weights. We can combine WE knowledge and the sparse regularization of Δ into a weighted 𝓁1 norm ‖WE ∘ Δ‖1, enforcing prior known importance of edges in the differential graph through weights. The larger a weight entry in WE, the less likely the corresponding edge belongs to the true differential graph. As mentioned in Section 1, NAK and JEEK estimators have tried similar weight matrix based strategy to add extra knowledge in identifying single-task GGM and in discovering multiple GGMs. None of the previous differential GGM estimators have explored this though.

(2) Knowledge as Node Groups

In many real-world applications, there exist known group knowledge about random variables. For example, when working with genomics samples, biologists have collected a rich set of group evidence about how genes belong to various biological pathways or exist in the same or different cellular locations [4]. Such knowledge of node grouping provides solid biological bias like genes belonging to the same biology pathway tend to have interactions among them (shared dependency pattern) in one cellular context or tend to not interact with each other (shared sparsity) at some other cellular conditions. However, this type of group evidence cannot be described via the weight matrix WE based formulation.

This is because even though it is safe to assume nodes in the same group share similar interaction patterns, but we do not know beforehand if the nodes in the group are collectively part of the differential network (group dependency) or not (group sparsity). To mitigate this issue, we use a flexible known node-group norm to include such extra knowledge. We represent the group knowledge as a set of groups on feature variables (vertices) 𝒢p. Formally, ∀gk ∈ 𝒢p, gk = {i} where i indicates that the i-th node belongs to the group k. Integrating 𝒢p knowledge into Δ means to enforce a group sparsity regularization on Δ. We generate edge-group index 𝒢V from the node group index 𝒢p. This is done via defining Embedded Image. For vertex nodes in each node group gk, all possible pairs between these nodes belong to an edge-group Embedded Image. We propose to use the group,2 norm Embedded Image to enforce group-wise sparse structure on Δ. None of the previous differential GGM estimators have explored this knowledge-integration strategy before.

kEV norm

Now we propose a novel norm ℛ(Δ) to combine the two strategies above. We assume that the true parameter Embedded Image is a superposition of two “clean” structures, a sparse structured Embedded Image and a group-structured Embedded Image. We propose a new norm, knowledge for Edges and Vertex norm (kEV-norm), as the superposition of the edge-weighted 𝓁1 norm and the group structured norm: Embedded Image

Our target parameter Δ = Δe + Δg. The Hadamard product ∘ is the element-wise product between two matrices i.e. [A ∘ B]ij = AijBij and Embedded Image where k is the k-th group.

WE ∈ Rp×p is the aforementioned edge-level additional knowledge. WE > 0, ∀ i, j ∈ {1 … p}. ϵ > 0 is a hyperparameter. kEV-norm has the following three properties (proofs in Section S:3).

  • (i) kEV-norm is a norm function if ϵ and entries of WE are positive.

  • (ii)If the condition in (i) holds, kEV-norm is a decomposable norm.

  • (iii)The dual norm of kEV-norm is

Embedded Image

2.4 KDiffNet : kEV Norm based Elementary Estimator for identifying Differential Net

Our goal is to achieve simple, scalable and theoretically sound estimation. EE in Eq. (2.5) provides such a formulation as long as we can construct Embedded Image well. Now we have ℛ(Δ) as Eq. (2.7) and its dual norm ℛ *(·) in Eq. (2.8). We just need to find Embedded Image for Δ that is carefully constructed, theoretically well-behaved when high-dimensional, and closed-form for the purpose of simplified computations.

One key insight of differential GGM is that the density ratio of two Gaussian distributions is naturally an exponential-family distribution (see proofs in Section S:4.2). The differential network Δ is one entry of the canonical parameter for this distribution. The MLE solution of estimating vanilla (i.e. no sparsity and not high-dimensional) graphical model in an exponential family distribution can be expressed as a backward mapping that computes the target model parameters from certain given moments. When using vanilla MLE to learn the exponential distribution about differential GGM (i.e., estimating canonical parameter), the backward mapping of Δ can be easily inferred from the two sample covariance matrices using Embedded Image (Section S:4.2). Even though this backward mapping has a simple closed form, it is not well-defined when high-dimensional because Embedded Image and Embedded Image are rank-deficient (thus not invertible) when p > n. Using Eq. (2.3) to estimate Δ, Wang et. al. [19] proposed the DIFFEE estimator for EE-based differential GGM estimation and used only the sparsity assumption on Δ. This study proposed a proxy backward mapping as Embedded Image. Here [Tv(A)]ij := ρv(Aij) and ρv(·) is chosen as a soft-threshold function.

We borrow the idea to use Embedded Image. In Section S:4.3 and Section S:4.4 we prove that Embedded Image is both available in closed-form, and well-defined in high-dimensional settings. Now by plugging ℛ(Δ), its dual ℛ*(·) and Embedded Image into Eq. (2.5), we get the formulation of KDiffNet : Embedded Image

2.5 Solving KDiffNet

We then propose to use a proximal parallel based optimization to solve Eq. (2.9), inspired by its distributed and parallel nature [3]. To simplify notations, we add a new notation Δtot := [Δe; Δg], where; denotes the row wise concatenation. We also add three operator notations including Le(Δtot) = Δe, Lg(Δtot) = Δg and Ltot(Δtot) = Δe + Δg. Now we obtain the following re-formulation of KDiffNet : Embedded Image

Actually the three added operators are affine mappings: Le = AeΔtot, Lg = AgΔtot, and Ltot = AtotΔtot, where Ae = [Ip×p 0p×p], Ag = [0p×p Ip×p] and Atot = [Ip×p Ip×p].

Algorithm 1 summarizes the Parallel Proximal algorithm [3, 22] we propose for optimizing Eq. (2.10). In Section S:1.3 we further prove that its computational cost is O(p3). More concretely in Algorithm 1, we simplify the notations by denoting Embedded Image, and reformulate Eq. (2.10) to the following equivalent and distributed formulation: Embedded Image

Where Embedded Image and Embedded Image. Here ℐC (·) represents the indicator function of a convex set C denoting that ℐC (x) = 0 when x ∈ C and otherwise ℐC (x) = ∞. The detailed solution of each proximal operator is summarized in Table S:1 and Section S:2.

Algorithm 1

A Parallel Proximal Algorithm to optimize KDiffNet

Figure
  • Download figure
  • Open in new tab

2.6 Analysis of Error Bounds

Based on Theorem S:5.3 and conditions in Section S:5, we have the following corollary about the convergence rate of KDiffNet. See its proof in Section S:5.2.2.

Corollary 2.1.

In the high-dimensional setting, i.e., p > max(nc, nd), let Embedded Image. Then for Embedded Image and min(nc, nd) > c log p, with a probability of at least 1 −2C1 exp(-C2p log(p)), the estimated optimal solution Embedded Image has the following error bound: Embedded Image where C1,C2,a, c, κ1 and κ2 are constants. See s and sG in Definition S:3.4.

3 Experiments

We aim to empirically show that KDiffNet is adaptive and flexible in incorporating different kinds of available evidence for improved differential network estimation. Data: This is accomplished by evaluating KDiffNet and baselines on two sets of datasets: (1) A total of 126 different synthetic datasets representing various combinations of additional knowledge (details see Section 3.1); and (2) one real-world fMRI dataset ABIDE for functional brain connectivity estimation (Section 3.2). We obtain the edge-level knowledge from three different human brain atlas [7, 8, 16] about brain connectivity, resulting in three different WE with p = {116, 160, 246}. For each atlas we compute WE using the spatial distance between its brain Region of Interests (ROIs). At the same time, we explore two different types of group knowledge about brain regions from Dosenbach Atlas[7] (Section 3.2). Baselines: We compare KDiffNet to JEEK[18] and NAK[2], that use the extra edge knowledge, two direct differential estimators (SDRE[9], DIFFEE[19]) and MLE based JGLFUSED[5] (Section 2.1 and detailed equations of each in Section S:1). We also extend KDiffNet to data situations with only edge knowledge (KDiffNet-E) or only group knowledge (KDiffNet-G). Both variations (KDiffNet-E and KDiffNet-G) can be solved by fast closed form solutions (Section S:2.2).

Additional details of setup, metrics and hyper-parameters are in Section S:6.1. Hyperparameters: The key hyper-parameters are tuned as follows:

  • v : To compute the proxy backward mapping, we vary v in {0.001i|i = 1, 2, …, 1000} (to make Tv(Σc) and Tv(Σd) invertible).

  • λn : According to our convergence rate analysis in Section 2.6, Embedded Image, we choose λn from a range of Embedded Image using cross-validation. For KDiffNet-G, we tune over λn from a range of Embedded Image3.

  • ϵ: For KDiffNet-EG experiments, we tune ϵ ∈ {0.0001, 0.01, 1, 100}.

3.1 Experiment: Simulated Data about Brain Connectivity using Three Real-World Brain Spatial Matrices and Anatomic Group Evidence from Neuroscience as Knowledge

In this section, we show the effectiveness of KDiffNet in integrating additional evidence through a comprehensive set of many simulation datasets. Our simulated data settings mimic three possible types of additional knowledge in the real-world: with both edge and known node group knowledge (Data-EG), with only edge-level evidence (Data-E) or with only known node groups (Data-G). For the edge knowledge, we consider three cases of WE with p = {116, 160, 246} computed from three human brain atlas about brain regions [7, 8, 16]. For the group knowledge, we simulate groups to represent related anatomic regions inspired by the atlas [7]. For each simulation dataset, two blocks of data samples are generated following Gaussian distribution using Embedded Image and Embedded Image via the simulated Ωc and Ωd. Each simulated dataset includes a pair of data blocks to estimate its differential GGM. We conduct a comprehensive evaluation over a total of 126 different simulated datasets by varying (p), varying the number of samples (nc and nd), changing the proportion of edges controlled by WE (s) and by varying the number of known groups sG. The details of the simulation framework are in Section S:6.2.

We present a summary of our results (partial) in Table 1 using columns showing two cases of data generation settings (Data-EG and Data-G). Table 1 uses the mean F1-score and the computational time cost to compare methods (rows). Results about simulated datasets under Data-E case are in Section S:6. We can make several conclusions:

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 1:

Mean Performance(F1-Score) and Computation Time(seconds) with standard deviation given in parentheses for the same setting of nc and nd of KDiffNet-EG, KDiffNet-E, KDiffNet-G and baselines for simulated data. * indicates that the method is not applicable for a data setting.

  1. KDiffNet outperforms those baselines not considering knowledge. Clearly KDiffNet and its variations achieve the highest F1-score across all the 126 datasets. SDRE and DIFFEE are direct differential network estimators but perform poorly indicating that adding additional knowledge improves differential GGM estimation. MLE based JGLFUSED performs the worst in all cases.

  2. KDiffNet outperforms those baselines considering knowledge, especially when group knowledge exist. When under the Data-EG setting, while JEEK and NAK include the extra edge information, they cannot integrate group information and are not for differential estimation. This results in lower F1-Score(0.581 and 0.204 for W2) compared to KDiffNet-EG (0.927 for W2). The advantage of modeling both edge and node groups evidence is also indicated by the higher F1-Score of KDiffNet-EG with respect to KDiffNet-E and KDiffNet-G on the Data-EG setting (Top 3 rows in Table 1). On Data-G cases, none of the baselines can model node group evidence. On average KDiffNet-G performs 6.4× better than the baselines for p = 246 with respect to F1.

  3. KDiffNet achieves reasonable time cost versus the baselines and is scalable to large p. Figure 1(a) shows each method’s time cost per λn for large p = 2000. Consistently KDiffNet-EG is faster than JEEK, JGLFUSED and SDRE (Column 1 in Table 1). KDiffNet-E and KDiffNet-G are faster than KDiffNet-EG owing to closed form solutions. On Data-G dataset and Data-E datasets (Section S:6.2), our faster closed form solutions achieve much more significant computational all the baselines. For example on datasets using W2 p = 246, KDiffNet-E and KDiffNet-G are on an average 21000× and 7400× faster (Column 5 in Table 1) than the baselines, respectively. We have all detailed results and figures about F1-Score and time cost for all 126 data settings in Section S:6.2. Besides F1-Score, we also present the ROC curves from all methods when varying λn. KDiffNet achieves the highest Area under Curve (AUC) in comparison to all other baselines.

3.2 Experiment: Functional Connectivity Estimation from Real-World Brain fMRI Data

In this experiment, we evaluate KDiffNet in a real-world downstream classification task on a publicly available resting-state fMRI dataset: ABIDE[6]. This aims to understand how functional dependencies among brain regions vary between normal and abnormal and help to discover contributing markers that influence or cause the neural disorders [17]. ABIDE includes two groups of human subjects: autism and control. We utilize three types of additional knowledge: WE based on the spatial distance between 160 regions of the brain[7] and two types of available node groups from Dosenbach Atlas[7]: one with 40 unique groups about macroscopic brain structures (G1) and another with 6 higher level node groups having the same functional connectivity(G2). We use Quadratic Discriminant Analysis (QDA) in downstream classification to assess the ability of the estimators to learn the differential patterns about the connectome structures. (Details of the ABIDE dataset, baselines, design of the additional knowledge WE matrix, cross-validation and the QDA classification method are in Section S:6.4.) Figure 1(b) compares KDiffNet-EG, KDiffNet-E, KDiffNet-G and baselines on ABIDE, using the y axis for classification test accuracy (the higher the better) and the x axis for the computation speed (negative log seconds, the more right the better). KDiffNet-EG1, incorporating both edge(WE) and (G1) group knowledge, achieves the highest accuracy of 57.2% for distinguishing the autism subjects versus the control subjects without sacrificing computation speed 4.

Figure 1:
  • Download figure
  • Open in new tab
Figure 1:

(a)(LEFT) Computation Time (log milliseconds) per λn for large p = 2000: KDiffNet-EG has reasonable time cost with respect to baseline methods. KDiffNet-E and KDiffNet-G are fast closed form solutions. (b) (RIGHT) ABIDE Dataset: KDiffNet-EG achieves highest Accuracy without sacrificing computation speed (points towards top right are better).

4 Conclusions

We propose a novel method, KDiffNet, to incorporate additional knowledge in estimating differential GGMs. KDiffNet elegantly formulates existing knowledge based on the problem at hand and avoids the need to design knowledge-specific optimization. We sincerely believe the scalability and flexibility provided by KDiffNet can make differential structure learning of GGMs feasible in many real-world tasks. We plan to generalize KDiffNet from Gaussian to semi-parametric distributions or to Ising Model structures. As node group knowledge is particularly important and abundant in genomics, we plan to evaluate KDiffNet on more real-world genomics data with multiple types of group information.

Acknowledgement

This work was supported partly by the National Science Foundation under NSF CAREER award No. 1453580. Any Opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect those of the National Science Foundation.

Appendix

Appendix

S:1 Connecting to Relevant Studies

S:1.1 Differential GGM Estimation

JGLFUSED[6]: This study extends the previously mentioned MLE based GLasso(Section 2.2) estimator for sparse differential GGM estimation. An additional sparsity penalty on the differential network, called the fused norm, is included as part of the optimization objective: Embedded Image

The alternating direction method of multipliers (ADMM) method was used to solve Eq. (S:1.1) that needs to run expensive SVD in one sub-procedure [6].

DIFFEE[23]: Computationally, EEs are much faster than their regularized convex program peers for GM estimation. [23] proposed the so-called DIFFEE for estimating sparse changes in high-dimensional GGM structure using EE: Embedded Image

[23] use a closed form and well-defined proxy function Embedded Image to approximate the backward mapping (the vanilla MLE solution) for differential sGGMs. We explain the proxy backward mapping and its statistical properties in Section S:4.1. The DIFFEE solution is a closed-form entry-wise thresholding operation on Embedded Image to ensure the desired sparsity structure of its final estimate. Here λn > 0 is the tuning parameter. Eq. (S:1.2) is a special case of Eq. (2.5), in which ℛ(·) is the 𝓁 1-norm for sparsity and the differential network Δ is the θ we aim to estimate.

As claimed by [10] direct estimation of differential GGMs can be more efficient both in terms of the number of required samples as well as the computation time cost. Besides, it does not require to assume each precision matrix as sparse. For instance recent literature in neuroscience has suggested that each subject’s functional brain connection network may not be sparse, even though differences across subjects may be sparse [1]. When identifying how genetic networks vary between two conditions, each individual network may contain hub nodes, therefore not entirely sparse [11].

SDRE[12]: [12] proposed to estimate Sparse differential networks in exponential families by Density Ratio Estimation using the following formulation: Embedded Image

ℒ KLIEP minimizes the KL divergence between the true probability density pd(x) and the estimated Embedded Image without explicitly modeling the true pc(x) and pd(x). This estimator uses the elastic-net penalty for enforcing sparsity. We use the sparseKLIEP1, that uses sub-gradient descent optimization as a baseline to our method.

Diff-CLIME[25]: This study directly learns Δ through a constrained optimization formulation. Embedded Image

The optimization reduces to multiple linear programming problems, which in turn makes this method less scalable to large p with a computational complexity of O(p8).

S:1.2 Incorporating Additional Knowledge in GGM Estimation

While previous studies do not use available additional knowledge for differential structure estimation, a few studies have tried to incorporate edge level weights for other types of GGM estimation.

NAK [3]: For the single task sGGM, one recent study [3] (following ideas from [17]) proposed to use a weighted Neighborhood selection formulation to integrate edge-level Additional Knowledge (NAK) as: Embedded Image. Here Embedded Image is the j-th column of a single sGGM Embedded Image. Specifically, Embedded Image if and only if Embedded Image. rj represents a weight vector designed using available extra knowledge for estimating a brain connectivity network from samples X drawn from a single condition. The NAK formulation can be solved by a classic Lasso solver like glmnet.

JEEK[22]: Two related studies, JEEK[22] and W-SIMULE[18] incorporate edge-level extra knowledge in the joint discovery of K heterogeneous graphs. In both these studies, each sGGM corresponding to a condition i is assumed to be composed of a task specific sGGM component Embedded Image and a shared component ΩS across all conditions, i.e., Embedded Image. The minimization objective of W-SIMULE is as follows: objective: Embedded Image

W-SIMULE is very slow when p > 200 due to the expensive computation cost O(K4p5). In comparison, JEEK is an EE-based optimization formulation: Embedded Image

Here, Embedded Image and Embedded Image. The edge knowledge of the task-specific graph is represented as weight matrix W (i) and WS for the shared network. JEEK differs from W-SIMULE in its constraint formulation, that in turn makes its optimization much faster and scalable than WSIMULE. In our experiments, we use JEEK as our baseline.

Drawbacks: However, none of these studies are flexible to incorporate other types of additional knowledge like node groups or cases where overlapping group and edge knowledge are available for the same target parameter. Further, these studies are limited by the assumption of sparse single condition graphs. Estimating a sparse difference graph directly is more flexible as it does not rely on this assumption.

S:1.3 Computational Complexity

We optimize KDiffNet through a proximal algorithm, while KDiffNet-E and KDiffNet-G through closed-form solutions. The resulting computational cost for KDiffNet is O(p3), broken down into the following steps:

  • Estimating two covariance matrices: The computational complexity is O(max(nc, nd)p2).

  • Backward Mapping: The element-wise soft-thresholding operation [Tv(·)] on the estimated covariance matrices, that costs O(p2). This is followed by matrix inversions [Tv(·)]−1 to get the proxy backward mapping, that cost O(p3).

  • Optimization: For KDiffNet, each operation in the proximal algorithm is group entry wise or entry wise, the resulting computational cost is O(p2). In addition, the matrix multiplications cost O(p3).

For KDiffNet-E and KDiffNet-G versions, the solution is the element-wise soft-thresholding operation Embedded Image, that costs O(p2).

Figure S:1:
  • Download figure
  • Open in new tab
Figure S:1:

Schematic Diagram of KDiffNet : integrating extra edge and node groups knowledge for directly estimating the sparse change in the dependency structures of two p-dimensional GGMs (differential GGMs)

S:2 Optimization of KDiffNet and Its Variants

S:2.1 Optimization via Proximal Solution

We assume Δtot = [Δe; Δg], where; denotes row wise concatenation. Consider operator Ld(Δtot) = Δe and Lg(Δtot) = Δg, Ltot(Δtot) = Δe + Δg. Embedded Image

This can be rewritten as: Embedded Image

Where: Embedded Image

Here, Le, Lg and Ltot can be written as Affine Mappings. By Lemma in [], Embedded Image if AAT = βI, and h(x) = g(Ax), Embedded Image

βg = 1, βe = 1 and βtot = 2.

Solving for each proximal operator

A.F1(Δtot) = ‖WE ∘ (Le(Δtot))‖1

Le(Δtot) = AeΔtot = Δe. Embedded Image

Here, Embedded Image. Embedded Image

Here j, k = 1, …, p. This is an entry-wise operator (i.e., the calculation of each entry is only related to itself). This can be written in closed form: Embedded Image

We replace this in Eq. (S:2.6).

Embedded Image Here, Lg(Δtot) = AgΔtot = Δg. Embedded Image

Here, Embedded Image. Embedded Image

Here g ∈ 𝒢𝒱. This is a group entry-wise operator (computing a group of entries is not related to other groups). In closed form: Embedded Image

We replace this is Eq. (S:2.9).

C. Embedded Image Here, Ltot = AtotΔ tot and Atot = [Ip×p Ip×p]. Embedded Image Embedded Image

In closed form: Embedded Image

We replace this in Eq. (S:2.12).

D. Embedded Image Here, Ltot = AtotΔtot and Atot = [Ip×p Ip×p]. Embedded Image Embedded Image

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table S:1:

The four proximal operators

This operator is group entry-wise. In closed form: Embedded Image

We replace this in Eq. (S:2.15).

S:2.2 Close form solutions if incorporating Only Edge Or Only Node Group Knowledge

In cases, where we do not have superposition structures in the differential graph estimation, we can estimate the target Δ through a closed form solution, making the method scalable to larger p. In detail:

KDiffNet-E Only Edge-level Knowledge WE

If additional knowledge is only available in the form of edge weights, the Eq. (S:2.1) reduces to : Embedded Image

This has a closed form solution: Embedded Image

Here Embedded Image

KDiffNet-G Only Node Groups Knowledge GV

If additional knowledge is only available in the form of groups of vertices 𝒢V, the Eq. (S:2.1) reduces to : Embedded Image

Here, we assume nodes not in any group as individual groups with cardinality= 1. The closed form solution is given by: Embedded Image

Where Embedded Image and max is the element-wise max function. Algorithm 1 shows the detailed steps of the KDiffNet estimator. Being non-iterative, the closed form solution helps KDiffNet achieve significant computational advantages.

Algorithm 1

KDiffNet-E and KDiffNet-G

Figure
  • Download figure
  • Open in new tab

S:3 More Proof about kEV Norm and Its Dual Norm

S:3.1 Proof for kEV Norm is a norm

We reformulate kEV norm as Embedded Image to Embedded Image

Theorem S:3.1.

kEV Norm is a norm if and only if ℛ1(·) and ℛ2(·) are norms.

Proof. By the following Theorem S:3.3, R1(·) is a norm. If ϵ > 0, R2(·) is a norm. Sum of two norms is a norm, hence kEV Norm is a norm.

Lemma S:3.2.

For kEV-norm, WEj,k ≠ 0 equals to WEj,k > 0.

Proof. If WEj,k < 0, then |WEj,kΔj,k| = | - WEj,kΔj,k|. Notice that -WEj,k > 0.

Theorem S:3.3.

ℛ1(·) = ||WE ∘ · ||1 is a norm if and only if ∀1 ≥ j, k ≤ p, WEjk ≠ 0.

Proof. Proof. To prove the ℛ1(·) = ||WE ∘ · ||1 is a norm, by Lemma (S:4.2) we need to prove that f (x) = ||W ∘ x||1 is a norm function if Wi,j > 0. 1. f (ax) = ||aW ∘ x||1 = |a|||W ∘ x||1 = |a|f (x). 2. f (x + y) = ||W ∘ (x + y)||1 = ||W ∘ x + W ∘ y||1 ≤ ||W ∘ x||1 + ||W ∘ y||1 = f (x) + f (y). 3. f (x) ≥ 0. 4. If f (x) = 0, then Σ|Wi,jxi,j| = 0. Since Wi,j ≠ 0, xi,j = 0. Therefore, x = 0. Based on the above, f (x) is a norm function. Since summation of norm is still a norm function, ℛ1(·) is a norm function.

S:3.2 kEV Norm is a decomposable norm

We show that kEV Norm is a decomposable norm within a certain subspace, with the following structural assumptions of the true parameter Δ*:

(EV-Sparsity)

The ‘true’ parameter of Δ* can be decomposed into two clear structures–{Δe*and Δg*}. Δe* is exactly sparse with s non-zero entries indexed by a support set SE and Δg* is exactly sparse with Embedded Image non-zero groups with atleast one entry non-zero indexed by a support set SV. SE ⋂ SV = Ø. All other elements equal to 0 (in (SE ⋃ SV)c).

Definition S:3.4.

(EV-subspace) Embedded Image

Theorem S:3.5.

kEV Norm is a decomposable norm with respect to ℳ and Embedded Image

Proof. Assume u ∈ℳ and Embedded ImageEmbedded Image. Therefore, kEV-norm is a decomposable norm with respect to the subspace pair Embedded Image.

S:3.3 Proofs of Dual Norms for kEV Norm
Theorem S:3.6.

Dual Norm of kEV Norm is Embedded Image.

Proof. Suppose Embedded Image, where Embedded Image. Then the dual norm ℛ*(·) can be derived by the following equation. Embedded Image

Connecting ℛ1(·) = ‖WE · ‖1 and Embedded Image. By the following Theorem S:3.7, Embedded Image. From [13], for Embedded Image, the dual norm is given by Embedded Image where Embedded Image are dual exponents. where s𝒢 denotes the number of groups. As special cases of this general duality relation, this leads to a block (∞, 2) norm as the dual.

Hence, Embedded Image. Hence, the dual norm of kEV norm is Embedded Image.

Theorem S:3.7.

The dual norm of ‖WE ∘ · ‖1 is: Embedded Image

For ℛ 1(·) = ‖WE ∘‖1, the dual norm is given by: Embedded Image

S:4 Appendix: More Background of Proxy Backward mapping and Theorems of Tv Being Invertible

Essentially the MLE solution of estimating vanilla graphical model in an exponential family distribution can be expressed as a backward mapping that computes the target model parameters from certain given moments. For instance, when learning Gaussian GM with vanilla MLE, the backward mapping is Embedded Image that estimates Ω from the sample covariance matrix (moment) Embedded Image. However, this backward mapping is normally not well-defined in high-dimensional settings. In the case of GGM, when given the sample covariance Embedded Image, we cannot just compute the vanilla MLE solution as Embedded Image when high-dimensional since Embedded Image is rank-deficient when p > n. Therefore Yang et al. [24] proposed to use carefully constructed proxy backward maps for Eq. (2.4) that are both available in closed-form, and well-defined in high-dimensional settings for exponential GM models. For instance, Embedded Image in Eq. (2.3) is the proxy backward mapping [24] used for GGM.

S:4.1 Backward mapping for an exponential-family distribution

The solution of vanilla graphical model MLE can be expressed as a backward mapping[21] for an exponential family distribution. It estimates the model parameters (canonical parameter θ) from certain (sample) moments. We provide detailed explanations about backward mapping of exponential families, backward mapping for Gaussian special case and backward mapping for differential network of GGM in this section.

Backward mapping

Essentially the vanilla graphical model MLE can be expressed as a backward mapping that computes the model parameters corresponding to some given moments in an exponential family distribution. For instance, in the case of learning GGM with vanilla MLE, the backward mapping is Embedded Image that estimates Ω from the sample covariance (moment) Embedded Image.

Suppose a random variable X ∈ ℝp follows the exponential family distribution: Embedded Image

Where θ ∈ Θ ⊂ℝd is the canonical parameter to be estimated and Θ denotes the parameter space. ϕ(X) denotes the sufficient statistics as a feature mapping function ϕ : ℝp νℝd, and A(θ) is the log-partition function. We then define mean parameters v as the expectation of ϕ(X): v(θ) := 𝔼[ϕ(X)], which can be the first and second moments of the sufficient statistics ϕ(X) under the exponential family distribution. The set of all possible moments by the moment polytope: Embedded Image

Mostly, the graphical model inference involves the task of computing moments v(θ) ∈ℳ given the canonical parameters Embedded Image. We denote this computing as forward mapping : Embedded Image

The learning/estimation of graphical models involves the task of the reverse computing of the forward mapping, the so-called backward mapping [21]. We denote the interior of ℳ as ℳ0. backward mapping is defined as: Embedded Image which does not need to be unique. For the exponential family distribution, Embedded Image

Where Embedded Image.

Backward Mapping: Gaussian Case

If a random variable X ∈ ℝp follows the Gaussian Distribution N (µ, Σ). then Embedded Image. The sufficient statistics Embedded Image, and the log-partition function Embedded Image

When performing the inference of Gaussian Graphical Models, it is easy to estimate the mean vector v(θ), since it equals to 𝔼[X, XXT].

When learning the GGM, we estimate its canonical parameter θ through vanilla MLE. Because Σ−1 is one entry of θ we can use the backward mapping to estimate Σ−1. Embedded Image

By plugging in Eq. (S:4.6) into Eq. (S:4.5), we get the backward mapping of Ω as Embedded Image, easily computable from the sample covariance matrix.

S:4.2 Backward Mapping for Differential GGM

When the random variables Xc, Xd ∈ℝp follows the Gaussian Distribution N (µc, Σc) and N (µd, Σd), their density ratio (defined by [12]) essentially is a distribution in exponential families: Embedded Image

Here Embedded Image and Embedded Image.

The log-partition function Embedded Image

The canonical parameter Embedded Image

The sufficient statistics ϕ([Xc, Xd]) and the log-partition function A(θ): Embedded Image

And h(x) = 1.

Now we can estimate this exponential distribution (θ) through vanilla MLE. By plugging Eq. (S:4.11) into Eq. (S:4.5), we get the following backward mapping via the conjugate of the log-partition function: Embedded Image

The mean parameter vector v(θ) includes the moments of the sufficient statistics ϕ() under the exponential distribution. It can be easily estimated through Embedded Image.

Therefore the backward mapping of θ becomes, Embedded Image

Because the second entry of the canonical parameter θ is Embedded Image, we get the backward mapping of Δ as Embedded Image

This can be easily inferred from two sample covariance matrices Embedded Image and Embedded Image (Att: when under low-dimensional settings).

S:4.3 Theorems of Proxy Backward Mapping Tv Being Invertible

Based on [24] for any matrix A, the element wise operator Tv is defined as: Embedded Image

Suppose we apply this operator Tv to the sample covariance matrix Embedded Image to obtain Embedded Image. Then, Embedded Image under high dimensional settings will be invertible with high probability, under the following conditions:

Condition-1 (Σ-Gaussian ensemble) Each row of the design matrix X ∈ℝn×p is i.i.id sampled from N (0, Σ).

Condition-2 The covariance Σ of the Σ-Gaussian ensemble is strictly diagonally dominant: for all row i, δi := Σii - Σj≠i ≥ δmin > 0 where δmin is a large enough constant so that Embedded Image.

This assumption guarantees that the matrix Embedded Image is invertible, and its induced 𝓁∞ norm is well bounded. Then the following theorem holds:

Theorem S:4.1.

Suppose Condition-1 and Condition-2 hold. Then for any v ≥ Embedded Image, the matrix Embedded Image is invertible with probability at least Embedded Image for p′ := max{n, p} and any constant τ > 2.

S:4.4 Useful lemma(s) of Error Bounds of Proxy Backward Mapping Tv
Lemma S:4.2.

(Theorem 1 of [16]). Let δ be Embedded Image. Suppose that ν > 2δ. Then, under the conditions (C-SparseΣ), and as ρv() is a soft-threshold function, we can deterministically guarantee that the spectral norm of error is bounded as follows: Embedded Image

Lemma S:4.3.

(Lemma 1 of [15]). Let 𝒜 be the event that Embedded Image where p′ := max(n, p) and τ is any constant greater than 2. Suppose that the design matrix X is i.i.d. sampled from Σ-Gaussian ensemble with n ≥ 40 maxi Σii. Then, the probability of event 𝒜 occurring is at least Embedded Image.

S:5 Theoretical Analysis of Error Bounds

S:5.1 Background: Error bounds of Elementary Estimators

KDiffNet formulations are special cases of the following generic formulation for the elementary estimator. Embedded Image

Where ℛ*(·) is the dual norm of ℛ(·), Embedded Image

Following the unified framework [13], we first decompose the parameter space into a subspace pair Embedded Image, where Embedded Image is the closure of ℳ. Here Embedded Image. ℳ is the model subspace that typically has a much lower dimension than the original high-dimensional space. Embedded Image is the perturbation subspace of parameters. For further proofs, we assume the regularization function in Eq. (S:5.1) is decomposable w.r.t the subspace pair Embedded Image. Embedded Image

[13] showed that most regularization norms are decomposable corresponding to a certain subspace pair.

Definition S:5.1.

Subspace Compatibility Constant

Subspace compatibility constant is defined as Embedded Image which captures the relative value between the error norm | · | and the regularization function ℛ(·).

For simplicity, we assume there exists a true parameter θ* which has the exact structure w.r.t a certain subspace pair. Concretely: Embedded Image

Then we have the following theorem.

Theorem S:5.2.

Suppose the regularization function in Eq. (S:5.1) satisfies condition (C1), the true parameter of Eq. (S:5.1) satisfies condition (C2), and λn satisfies that Embedded Image. Then, the optimal solution Embedded Image of Eq. (S:5.1) satisfies: Embedded Image Embedded Image Embedded Image

Proof. Let Embedded Image be the error vector that we are interested in. Embedded Image

By the fact that Embedded Image, and the decomposability of ℛ with respect to Embedded Image Embedded Image

Here, the inequality holds by the triangle inequality of norm. Since Eq. (S:5.1) minimizes Embedded Image, we have Embedded Image. Combining this inequality with Eq. (S:5.7), we have: Embedded Image

Moreover, by Hölder’s inequality and the decomposability of ℛ;(·), we have: Embedded Image where Embedded Image is a simple notation for Embedded Image.

Since the projection operator is defined in terms of ‖· ‖2 norm, it is non-expansive: Embedded Image. Therefore, by Eq. (S:5.9), we have: Embedded Image and plugging it back to Eq. (S:5.9) yields the error bound Eq. (S:5.4).

Finally, Eq. (S:5.5) is straightforward from Eq. (S:5.8) and Eq. (S:5.10). Embedded Image

S:5.2 Error Bounds of KDiffNet

Theorem S:5.2, provides the error bounds via λn with respect to three different metrics. In the following, we focus on one of the metrics, Frobenius Norm to evaluate the convergence rate of our KDiffNet estimator.

S:5.2.1 Error Bounds of KDiffNet through λn and ϵ
Theorem S:5.3.

Assuming the true parameter Δ* satisfies the conditions (C1)(C2) and Embedded Image, then the optimal point Embedded Image has the following error bounds: Embedded Image

Proof: KDiffNet uses ℛ(·)= ‖WE°·‖1 +ϵ ‖·‖𝒢,2 because it is a superposition of two norms: ℛ1‖ WE°·‖1 and ℛ2 = ϵ‖·‖𝒢,2 Based on the results in[13], Embedded Image and Embedded Image, where s is the number of nonzero entries in Δ and sG is the number of groups in which there exists at least one nonzero entry. Therefore, Embedded Image Hence, Using this in Equation Eq. (S:5.4), Embedded Image.

S:5.2.2 Proof of Corollary (2.1)-Derivation of the KDiffNet error bounds

To derive the convergence rate for KDiffNet, we introduce the following two sufficient conditions on the ∑c and ∑d, to show that the proxy backward mapping Embedded Image is well-defined[23]:

(C-MinInf−∑) The true Embedded Image and Embedded Image of Eq. (2.1) have bounded induced operator norm, i.e., Embedded Image and Embedded Image.

(C-Sparse-∑): The two true covariance matrices Embedded Image and Embedded Image are “approximately sparse” (following [2]). For some constant 0 ≤ q < 1 and Embedded Image and Embedded Image.2

We additionally require Embedded Image and Embedded Image.

We assume the true parameters Embedded Image and Embedded Image satisfies C-MinInf∑ and C-Sparse∑ conditions.

Using the above theorem and conditions, we have the following corollary for convergence rate of KDiffNet (Att: the following corollary is the same as the Corollary 2.1 in the main draft. We repeat it here to help readers read the manuscript more easily):

Corollary S:5.4.

In the high-dimensional setting, i.e., Embedded Image. Then for Embedded Image and min(nc, nd) > c log p, with a probability of at least 1 −2C1 exp(−C2p log(p)), the estimated optimal solution Embedded Image has the following error bound: Embedded Image

where a, c, κ1 and κ2 are constants.

Proof. In the following proof, we first prove Embedded Image. Here Embedded Image and p′ = max(p, nc)

The condition (C-Sparse∑) and condition (C-MinInf∑) also hold for Embedded Image and Embedded Image. In order to utilize Theorem (S:5.3) for this specific case, we only need to show that Embedded Image for the setting of Embedded Image: Embedded Image

We first compute the upper bound of Embedded Image. By the selection v in the statement, Lemma (S:4.2) and Lemma (S:4.3) hold with probability at least Embedded Image. Armed with Eq. (S:4.15), we use the triangle inequality of norm and the condition (C-Sparse∑): for any w, Embedded Image

Where the second inequality uses the condition (C-Sparse∑). Now, by Lemma (S:4.2) with the selection of v, we have Embedded Image where c1 is a constant related only on τ and maxi ∑ii. Specifically, it is defined as Embedded Image. Hence, as long as Embedded Image as stated, so that Embedded Image, we can conclude that Embedded Image, which implies Embedded Image.

The remaining term in Eq. (S:5.14) is Embedded Image. By construction of Tv(·) in (C-Thresh) and by Lemma (S:4.3), we can confirm that Embedded Image as well as Embedded Image can be upper-bounded by v.

Similarly, the Embedded Image has the same result.

Finally, Embedded Image Embedded Image Embedded Image

Because by Theorem S:5.3, we know if Embedded Image, Embedded Image

Suppose p > max(nc, nd) we have that Embedded Image

By combining all together, we can confirm that the selection of λn satisfies the requirement of Theorem (S:5.3), which completes the proof.

S:6 More about Experiments

S:6.1 Experimental Setup

The hyper-parameters in our experiments are v, λn, ϵ and λ2. In detail:

  • To compute the proxy backward mapping in (S:2.1), DIFFEE, and JEEK we vary v for soft-thresholding v from the set {0.001i|i = 1, 2, …, 1000} (to make Tv(∑c) and Tv(∑d) invertible).

  • λn is the hyper-parameter in our KDiffNet formulation. According to our convergence rate analysis in Section 2.6, Embedded Image, we choose λn from a range of Embedded Image. For KDiffNet-G case, we tune over λn from a range of Embedded Image. We use the same range to tune λ1 for SDRE. Tuning for NAK is done by the package itself.

  • ϵ: For KDiffNet-EG experiments, we tune ϵ ∈{0.0001, 0.01, 1, 100}.

  • λ2 controls individual graph’s sparsity in JGLFUSED. We choose λ1 = 0.0001 (a very small value) for all experiments to ensure only the differential network is sparse.

Evaluation Metrics
  • F1-score: We use the edge-level F1-score as a measure of the performance of each method. Embedded Image, where Embedded Image and Embedded Image. The better method achieves a higher F1-score. We choose the best performing λn using validation and report the performance on a test dataset.

  • Time Cost: We use the execution time (measured in seconds or log(seconds)) for a method as a measure of its scalability. The better method uses less time3

S:6.2 Simulation Dataset Generation

We first use simulation to evaluate KDiffNet for improving differential structure estimation by making use of extra knowledge. We generate simulated datasets with a clear underlying differential structure between two conditions, using the following method:

Data Generation for Edge Knowledge (KE)

Given a known weight matrix WE (e.g., spatial distance matrix between p brain regions), we set Wd = inv.logit(−WE). We use the assumption that higher the value of Wij, lower the probability of that edge to occur in the true precision matrix. This is motivated by the role of spatial distance in brain connectivity networks: farther regions are less likely to be connected and vice-versa. We select different levels in the matrix Wd, denoted by s, where if Embedded Image, else Embedded Image, where Δd ∈ ∝p×p. We denote by s as the sparsity, i.e. the number of non-zero entries in Δd· BI is a random graph with each edge Embedded Image with probability p. δc and δd are selected large enough to guarantee positive definiteness. Embedded Image Embedded Image Embedded Image

There is a clear differential structure in Δ = Ωd − Ωc, controlled by Δd. To generate data from two conditions that follows the above differential structure, we generate two blocks of data samples following Gaussian distribution using Embedded Image and Embedded Image. We only use these data samples to approximate the differential GGM to compare to the ground truth Δ.

Data Generation for Vertex Knowledge (KG)

In this case, we simulate the case of extra knowledge of nodes in known groups. Let the node group size, i.e., the number of nodes with a similar interaction pattern in the differential graph be m. We select the block diagonals of size m as groups in Δg. If two variables i, j are in a group g′, in Embedded Image, else Embedded Image, where Δg ∈ ℝp×p. We denote by sG as the number of groups in Δg. BI is a random graph with each edge Embedded Image with probability p. Embedded Image Embedded Image Embedded Image

δc and δd are selected large enough to guarantee positive definiteness. We generate two blocks of data samples following Gaussian distribution using Embedded Image and Embedded Image.

Data Generation for both Edge and Vertex Knowledge (KEG)

In this case, we simulate the case of overlapping group and edge knowledge. Let the node group size,i.e., the number of nodes with a similar interaction pattern in the differential graph be m. We select the block diagonals of size m as groups in Δg. If two variables i, j are in a group g′, in Embedded Image, else Embedded Image, where Δg ∈ ℝp×p.

For the edge-level knowledge component, given a known weight matrix WE, we set Wd = inv.logit(-WE). Higher the value of Embedded Image, lower the value of Embedded Image, hence lower the probability of that edge to occur in the true precision matrix. We select different levels in the matrix Wd, denoted by s, where if Embedded Image, we set Embedded Image, else Embedded Image. We denote by s as the number of non-zero entries in Δd. BI is a random graph with each edge Embedded Image with probability p. Embedded Image Embedded Image Embedded Image δc and δd are selected large enough to guarantee positive definiteness. Similar to the previous case, we generate two blocks of data samples following Gaussian distribution using Embedded Image and Embedded Image. We only use these data samples to approximate the differential GGM to compare to the ground truth Δ.

S:6.3 Simulation Experiment Results

We consider three different types of known edge knowledge WE generated from the spatial distance between different brain regions and simulate groups to represent related anatomical regions. These three are distinguished by different p = 116, 160, 246 representing spatially related brain regions. We generate three types of datasets:Data-EG (having both edge and vertex knowledge), Data-G(with edge-level extra knowledge) and Data-V(with known node groups knowledge). We generate two blocks of data samples Xc and Xd following Gaussian distribution using Embedded Image and Embedded Image. We use these data samples to estimate the differential GGM to compare to the ground truth Δ. The details of the simulation are in Section S:6.2. We vary the sparsity of the true differential graph (s) and the number of control and case samples (nc and nd respectively) used to estimate the differential graph. For each case of p, we vary nc and nd in {p/2, p/4, p, 2p} to account for both high dimensional and low dimensional cases. The sparsity of the underlying differential graph is controlled by s = {0.125, 0.25, 0.375, 0.5} and sG as explained in Section S:6.2. This results in 126 different datasets representing diverse settings: different number of dimensions p, number of samples nc and nd, multiple levels of sparsity s and number of groups sG of the differential graph for both KE and KEG data settings.

Edge and Vertex Knowledge (KEG)

We use KDiffNet (Algorithm 1) to infer the differential structure in this case.

Figure S:2(a) shows the performance in terms of F1 Score of KDiffNet in comparison to the baselines for p = 116, corresponding to 116 regions of the brain. KDiffNet outperforms the best baseline in each case by an average improvement of 414%. KDiffNet-EG does better than JEEK and NAK that can model the edge information but cannot include group information. SDRE and DIFFEE are direct estimators but perofrm poorly indicating that adding additional knowledge aids differential network estimation. JGLFUSED performs the worst on all cases. We list the detailed results in Section S:6.5.

Figure S:2:
  • Download figure
  • Open in new tab
Figure S:2:

KDiffNet Edge and Vertex Knowledge Simulation Results for p = 116 for different settings of nc, nd and s: (a) The test F1-score and (b) The average computation time (measured in seconds) per λn for KDiffNet and baseline methods.

Figure S:2(b) shows the average computation cost per λn of each method measured in seconds. In all settings, KDiffNet has lower computation cost than JEEK, SDRE and JGLFUSED in different cases of varying nc and nd, as well as with different sparsity of the differential network. KDiffNet is on average 24× faster than the best performing baseline. It is slower than DIFFEE owing to DIFFEE’s non-iterative closed form solution, however, DIFFEE does not have good prediction performance. Note that B*() in KDiffNet, JEEK and DIFFEE and the kernel term in SDRE are precomputed only once prior to tuning across multiple λn. In Figure S:3(a), we plot the test F1-score for simulated datasets generated using W with p = 160, representing spatial distances between different 160 regions of the brain. This represents a larger and different set of spatial brain regions. In p = 160 case, KDiffNet outperforms the best baseline in each case by an average improvement of 928%. Including available additional knowledge is clearly useful as JEEK does relatively better than the other baselines. JGLFUSED performs the worst on all cases. Figure S:3(b) shows the computation cost of each method measured in seconds for each case. KDiffNet is on average 37× faster than the best performing baseline.

Figure S:3:
  • Download figure
  • Open in new tab
Figure S:3:

KDiffNet Edge and Vertex Knowledge Simulation Results for p = 160 for different settings of nc, nd and s: (a) The test F1-score and (b) The average computation time (measured in seconds) per λn for KDiffNet and baseline methods.

Figure S:4:
  • Download figure
  • Open in new tab
Figure S:4:

KDiffNet Edge and Vertex Knowledge Simulation Results for p = 246 for different settings of nc, nd and s: (a) The test F1-score and (b) The average computation time (measured in seconds) per λn for KDiffNet and baseline methods

In Figure S:4(a), we plot the test F1-score for simulated datasets generated using a larger WE with p = 246, representing spatial distances between different 246 regions of the brain. This represents a larger and different set of spatial brain regions. In this case, KDiffNet outperforms the best baseline in each case by an average improvement of 1400% relative to the best performing baseline. In this case as well, including available additional knowledge is clearly useful as JEEK does relatively better than the other baselines, which do not incorporate available additional knowledge. JGLFUSED again performs the worst on all cases. Figure S:4(b) shows the computation cost of each method measured in seconds for each case. In all cases, KDiffNet has the least computation cost in different settings of the data generation. KDiffNet is on average 20× faster than the best performing baseline. For detailed results, see Section S:6.5.

We cannot compare Diff-CLIME as it takes more than 2 days to finish p = 246 case.

Edge Knowledge (KE)

Given known WE, we use KDiffNet-E to infer the differential structure in this case.

Figure S:5:
  • Download figure
  • Open in new tab
Figure S:5:

(a) shows the performance in terms of F1-Score of KDiffNet-E in comparison to the baselines for p = 116, corresponding to 116 spatial regions of the brain. In p = 116 case, KDiffNet-E outperforms the best baseline in each case by an average improvement of 23%. While JEEK, DIFFEE and SDRE perform similar to each other, JGLFUSED performs the worst on all cases.

Figure S:5: KDiffNet-E Simulation Results for p = 116 for different settings of nc, nd and s: (a) The test F1-score and (b) The average computation time (measured in seconds) per λn for KDiffNet-E and baseline methods.

Figure S:5(b) shows the computation cost of each method measured in seconds for each case. In all cases, KDiffNet-E has the least computation cost in different cases of varying nc and nd, as well as with different sparsity of the differential network. For p = 116, KDiffNet-E, owing to an entry wise parallelizable closed form solution, is on average 2356× faster than the best performing baseline. In Figure S:6(a), we plot the test F1-score for simulated datasets generated using W with p = 160, representing spatial distances between different 160 regions of the brain. This represents a larger and different set of spatial brain regions. In p = 160 case, KDiffNet-E outperforms the best baseline in each case by an average improvement of 67.5%. Including available additional knowledge is clearly useful as JEEK does relatively better than the other baselines, which do not incorporate available additional knowledge. JGLFUSED performs the worst on all cases. Figure S:6(b) shows the computation cost of each method measured in seconds for each case. In all cases, KDiffNet-E has the least computation cost in different cases of varying nc and nd, as well as with different sparsity of the differential network. KDiffNet-E is on average 3300× faster than the best performing baseline. In Figure S:7(a), we plot the test F1-score for simulated datasets generated using a larger W with p = 246, representing spatial distances between different 246 regions of the brain. This represents a larger and different set of spatial brain regions. In this case, KDiffNet-E outperforms the best baseline in each case by an average improvement of 66.4% relative to the best performing baseline. Including available additional knowledge is clearly useful as JEEK does relatively better than the other baselines, which do not incorporate available additional knowledge. JGLFUSED performs the worst on all cases. Figure S:7(b) shows the computation cost of each method measured in seconds for each case. In all cases, KDiffNet-E has the least computation cost in different cases of varying nc and nd, as well as with different sparsity of the differential network. KDiffNet-E is on average 3966× faster than the best performing baseline.

Figure S:6:
  • Download figure
  • Open in new tab
Figure S:6:

KDiffNet-E Simulation Results for p = 160 for different settings of nc, nd and s: (a) The test F1-score and (b) The average computation time (measured in seconds) per λn for KDiffNet-E and baseline methods.

Figure S:7:
  • Download figure
  • Open in new tab
Figure S:7:

KDiffNet-E Simulation Results for p = 246 for different settings of nc, nd and s: (a) The test F1-score and (b) The average computation time (measured in seconds) per λn for KDiffNet-E and baseline methods.

Node Group Knowledge

We use KDiffNet-G to estimate the differential network with the known groups as extra knowledge. We vary the number of groups sG and the number of samples nc and nd for each case of p = {116, 160, 246}. Figure S:8 shows the F1-Score of KDiffNet-G and the baselines for p = 116. KDiffNet-G clearly has a large advantage when extra node group knowledge is available. The baselines cannot model such available knowledge.

Figure S:8:
  • Download figure
  • Open in new tab
Figure S:8:

KDiffNet-G Simulation Results for p = 246 for different settings of nc, nd and s: (a) The test F1-score and (b) The average computation time (measured in seconds) per λn for KDiffNet-E and baseline methods.

Varying proportion of known edges

We generate WE matrices with p = 150 using Erdos Renyi Graph [9]. We use the generated graph as prior edge knowledge WE. Additionally, we simulate 15 groups of size 10 as explained in Section S:6.2. We simulate Ωc and Ωd as explained in Section S:6.2. Figure S:9 presents the performance of KDiffNet-EG, KDiffNet-E and DIFFEE with varying proportion of known edges.

KDiffNet-EG has a higher F1-score than KDiffNet-E as it can additionally incorporate known group information. As expected, with increase in the proportion of known edges, F1-Score improves for both KDiffNet-EG and KDiffNet-E. In contrast DIFFEE cannot make use of additional information and the F1-Score remains the same.

Scalability in p

To evaluate the scalability of KDiffNet and baselines to large p, we also generate larger WE matrices with p = 2000 using Erdos Renyi Graph [9], similar to the aforementioned design. Using the generated graph as prior edge knowledge WE, we design Ωc and Ωd as explained in Section S:6.2. For the case of both edge and vertex knowledge, we fix the number of groups to 100 of size 10. We evaluate the scalability of KDiffNet-EG and baselines measured in terms of computation cost per λn.

Figure S:11 shows the computation time cost per λn for all methods. Clearly, KDiffNet takes the least time, for large p as well.

Choice of λn

For KDiffNet, we show the performance of all the methods as a function of choice of λn. Figure S:10 shows the True Positive Rate(TPR) and False Positive Rate(FPR) measured by varying λn for p = 116, s = 0.5 and nc = nd = p/2 under the Data-EG setting. Clearly, KDiffNet-EG achieves the highest Area under Curve (AUC) than all other baseline methods. KDiffNet-EG also outperforms JEEK and NAK that take into account edge knowledge but cannot model the known group knowledge.

Figure S:9:
  • Download figure
  • Open in new tab
Figure S:9:

F1-Score of KDiffNet-EG, KDiffNet-E and DIFFEE with varying proportion of known edges.

Figure S:10:
  • Download figure
  • Open in new tab
Figure S:10:

Area Under Curve (AUC) Curves for KDiffNet and baselines at different hyperparameter values λ.

Figure S:11:
  • Download figure
  • Open in new tab
Figure S:11:

Scalability of KDiffNet : Computation Cost (computation time per λ) as a function of p.

S:6.4 More Experiment: Brain Connectivity Estimation from Real-World fMRI
ABIDE Dataset

This data is from the Autism Brain Imaging Data Exchange (ABIDE) [7], a publicly available resting-state fMRI dataset. The ABIDE data aims to understand human brain connectivity and how it reflects neural disorders [19]. The data is retrieved from the Preprocessed Connectomes Project [4], where preprocessing is performed using the Configurable Pipeline for the Analysis of Connectomes (CPAC) [5] without global signal correction or band-pass filtering. After preprocessing with this pipeline, 871 individuals remain (468 diagnosed with autism). Signals for the 160 (number of features p = 160) regions of interest (ROIs) in the often-used Dosenbach Atlas [8] are examined. We also include two types of available node groups : one with 40 unique groups of regions belonging to the same functional network and another with 6 node groups about nodes belonging to the same broader anatomical region of the brain.

Cross-validation

Classification is performed using the 3-fold cross-validation suggested by the literature [14][20]. We tune over λn and pick the best λn using cross validation. The subjects are randomly partitioned into three equal sets: a training set, a validation set, and a test set. Each estimator produces Embedded Image using the training set. Then, these differential networks are used as inputs to Quadratic discriminant analysis (QDA), which is tuned via cross-validation on the validation set. Finally, accuracy is calculated by running QDA on the test set. This classification process aims to assess the ability of an estimator to learn the differential patterns of the connectome structures.

S:6.5 Detailed Simulation Results

Table S:2,Table S:3 and Table S:4 present a summary of results for KDiffNet-EG, KDiffNet-E and KDiffNet-G in terms of F1-Score, respectively. We report the average F1-Score(along with standard deviation across the same setting of nc and nd) across all simulation settings for each p. Table S:5,Table S:6 and Table S:7 present a summary of computation time for KDiffNet-EG, KDiffNet-E and KDiffNet-G, respectively. We report the average computation time per λn across all simulation settings for each p.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table S:2:

Mean Performance (and standard deviation) of KDiffNet-EG and baselines for multiple data settings.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table S:3:

Mean Performance (and standard deviation) KDiffNet-E and baselines for multiple data settings.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table S:4:

Mean Performance (and standard deviation) of KDiffNet-G and baselines for multiple data settings.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table S:5:

Mean (and standard deviation) Computation Time (measured in seconds) per λn of KDiffNet-EG and baselines for multiple data settings.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table S:6:

Mean (and standard deviation) Computation Time (measured in seconds) per λn of KDiffNet-E and baselines for multiple data settings.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table S:7:

Mean (and standard deviation) Computation Time (measured in seconds) per λn of KDiffNet-G and baselines for multiple data settings.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table S:8:

KDiffNet-E : p = 160 Relative Performance and speed up with respect to the best performing baseline.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table S:9:

KDiffNet-E : p = 246 Relative Performance and speed up with respect to the best performing baseline.

Footnotes

  • http://jointnets.org/

  • ↵1 We put details of theoretical proofs, details of how we generate simulation datasets and concrete performance figures in the appendix. Notations with “S:” as the prefix indicate the corresponding contents are in the appendix.

  • ↵2 For instance, on samples from a controlled drug study ‘c’ may represent the ‘control’ group and ‘d’ may represent the ‘drug-treating’ group. Using which of the two sample sets as ‘c’ set (or ‘d’ set) does not affect the computational cost and does not influence the statistical convergence rates.

  • ↵3 We use the same range to tune λ1 for SDRE and λ 2 for JGLFUSED. We use λ 1 = 0.0001(a small value) for JGLFUSED to ensure only the differential network is sparse. Tuning NAK is done by the package itself.

  • ↵4 We cannot compare to NAK and SDRE because they do not provide precision matrix required for QDA

  • ↵1 http://allmodelsarewrong.net/kliep_sparse/demo_sparse.html

  • ↵2 This indicates for some positive constant Embedded Image and Embedded Image for all diagonal entries. Moreover, if q = 0, then this condition reduces to Embedded Image and Embedded Image being sparse.

  • ↵3 The machine that we use for experiments is an Intel Core i7 CPU with a 16 GB memory.

References

  1. [1].↵
    O. Banerjee, L. El Ghaoui, and A. d’Aspremont. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. The Journal of Machine Learning Research, 9:485–516, 2008.
    OpenUrl
  2. [2].↵
    Y. Bu and J. Lederer. Integrating additional knowledge into estimation of graphical models.
  3. [3].↵
    P. L. Combettes and J.-C. Pesquet. Proximal splitting methods in signal processing. In Fixed-point algorithms for inverse problems in science and engineering, pages 185–212. Springer, 2011.
  4. [4].↵
    B. T. S. Da Wei Huang and R. A. Lempicki. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nature protocols, 4(1):44–57, 2008.
    OpenUrlCrossRef
  5. [5].↵
    P. Danaher, P. Wang, and D. M. Witten. The joint graphical lasso for inverse covariance estimation across multiple classes. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2013.
  6. [6].↵
    A. Di Martino, C.-G. Yan, Q. Li, E. Denio, F. X. Castellanos, K. Alaerts, J. S. Anderson, M. As- saf, S. Y. Bookheimer, M. Dapretto, et al. The autism brain imaging data exchange: towards a large-scale evaluation of the intrinsic brain architecture in autism. Molecular psychiatry, 19(6):659–667, 2014.
    OpenUrlCrossRefPubMedWeb of Science
  7. [7].↵
    N. U. Dosenbach, B. Nardos, A. L. Cohen, D. A. Fair, J. D. Power, J. A. Church, S. M. Nelson, G. S. Wig, A. C. Vogel, C. N. Lessov-Schlaggar, et al. Prediction of individual brain maturity using fmri. Science, 329(5997):1358–1361, 2010.
    OpenUrlAbstract/FREE Full Text
  8. [8].↵
    L. Fan, H. Li, J. Zhuo, Y. Zhang, J. Wang, L. Chen, Z. Yang, C. Chu, S. Xie, A. R. Laird, et al. The human brainnetome atlas: a new brain atlas based on connectional architecture. Cerebral cortex, 26(8):3508–3526, 2016.
    OpenUrlCrossRefPubMed
  9. [9].↵
    S. Liu, K. Fukumizu, and T. Suzuki. Learning sparse structural changes in high-dimensional markov networks. Behaviormetrika, 44(1):265–286, 2017.
    OpenUrl
  10. [10].↵
    S. Liu, J. A. Quinn, M. U. Gutmann, T. Suzuki, and M. Sugiyama. Direct learning of sparse changes in markov networks by density ratio estimation. Neural computation, 26(6):1169–1197, 2014.
    OpenUrl
  11. [11].↵
    S. Negahban, B. Yu, M. J. Wainwright, and P. K. Ravikumar. A unified framework for highdimensional analysis of m-estimators with decomposable regularizers. In Advances in Neural Information Processing Systems, pages 1348–1356, 2009.
  12. [12].↵
    R. A. Poldrack, D. M. Barch, J. Mitchell, T. Wager, A. D. Wagner, J. T. Devlin, C. Cumba, O. Koyejo, and M. Milham. Toward open sharing of task-based fmri data: the openfmri project. Frontiers in neuroinformatics, 7:12, 2013.
    OpenUrl
  13. [13].↵
    T. K. Prasad, R. Goel, K. Kandasamy, S. Keerthikumar, S. Kumar, S. Mathivanan, D. Teli- kicherla, R. Raju, B. Shafreen, A. Venugopal, et al. Human protein reference database 2009 update. Nucleic acids research, 37(suppl 1):D767–D772, 2009.
    OpenUrlCrossRefPubMedWeb of Science
  14. [14].↵
    T. Shimamura, S. Imoto, R. Yamaguchi, and S. Miyano. Weighted lasso in graphical gaussian modeling for large gene network estimation based on microarray data. 19:142–153.
  15. [15].↵
    C. Singh, B. Wang, and Y. Qi. A constrained, weighted-l1 minimization approach for joint discovery of heterogeneous neural connectivity graphs. arXiv preprint arxiv:1709.04090, 2017.
  16. [16].↵
    N. Tzourio-Mazoyer, B. Landeau, D. Papathanassiou, F. Crivello, O. Etard, N. Delcroix, B. Mazoyer, and M. Joliot. Automated anatomical labeling of activations in spm using a macroscopic anatomical parcellation of the mni mri single-subject brain. Neuroimage, 15(1):273–289, 2002.
    OpenUrlCrossRefPubMedWeb of Science
  17. [17].↵
    D. C. Van Essen, S. M. Smith, D. M. Barch, T. E. Behrens, E. Yacoub, K. Ugurbil, W.-M. H. Consortium, et al. The wu-minn human connectome project: an overview. Neuroimage, 80:62–79, 2013.
    OpenUrlCrossRefPubMedWeb of Science
  18. [18].↵
    B. Wang, A. Sekhon, and Y. Qi. A fast and scalable joint estimator for integrating additional knowledge in learning multiple related sparse gaussian graphical models. arXiv preprint arxiv:1806.00548, 2018.
  19. [19].↵
    B. Wang, A. Sekhon, and Y. Qi. Fast and scalable learning of sparse changes in high-dimensional gaussian graphical model structure. In Proceedings of The 21st International Conference on Artificial Intelligence and Statistics (AISTATS), 2018. [PS].
  20. [20].↵
    D. J. Watts and S. H. Strogatz. Collective dynamics of ‘small-world’networks. 393(6684):440–442.
  21. [21].↵
    E. Yang, A. C. Lozano, and P. Ravikumar. Elementary estimators for high-dimensional linear regression. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 388–396, 2014.
  22. [22].↵
    E. Yang, A. C. Lozano, and P. D. Ravikumar. Elementary estimators for sparse covariance matrices and other structured moments. In ICML, pages 397–405, 2014.
  23. [23].↵
    E. Yang, A. C. Lozano, and P. K. Ravikumar. Elementary estimators for graphical models. In Advances in Neural Information Processing Systems, pages 2159–2167, 2014.
  24. [24].↵
    M. Yuan and Y. Lin. Model selection and estimation in the gaussian graphical model. Biometrika, 94(1):19–35, 2007.
    OpenUrlCrossRefWeb of Science
  25. [25].↵
    B. Zhang and Y. Wang. Learning structural changes of gaussian graphical models in controlled experiments. arXiv preprint arxiv:1203.3532, 2012.
  26. [26].↵
    S. D. Zhao, T. T. Cai, and H. Li. Direct estimation of differential networks. page asu009.
  27. [27].↵
    S. D. Zhao, T. T. Cai, and H. Li. Direct estimation of differential networks. Biometrika, 101(2):253–268, 2014.
    OpenUrlCrossRefPubMed

References

  1. [1].↵
    E. Belilovsky, G. Varoquaux, and M. B. Blaschko. Testing for differences in gaussian graphical models: applications to brain connectivity. In Advances in Neural Information Processing Systems, pages 595–603, 2016.
  2. [2].↵
    P. J. Bickel and E. Levina. Covariance regularization by thresholding. The Annals of Statistics, pages 2577–2604, 2008.
  3. [3].↵
    Y. Bu and J. Lederer. Integrating additional knowledge into estimation of graphical models.
  4. [4].↵
    C. Craddock. Preprocessed connectomes project: open sharing of preprocessed neuroimaging data and derivatives. In 61st Annual Meeting. AACAP, 2014.
  5. [5].↵
    C. Craddock, S. Sikka, B. Cheung, R. Khanuja, S. Ghosh, C. Yan, Q. Li, D. Lurie, J. Vogelstein, R. Burns, et al. Towards automated analysis of connectomes: The configurable pipeline for the analysis of connectomes (c-pac). Front Neuroinform, 42, 2013.
  6. [6].↵
    P. Danaher, P. Wang, and D. M. Witten. The joint graphical lasso for inverse covariance estimation across multiple classes. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2013.
  7. [7].↵
    A. Di Martino, C.-G. Yan, Q. Li, E. Denio, F. X. Castellanos, K. Alaerts, J. S. Anderson, M. As- saf, S. Y. Bookheimer, M. Dapretto, et al. The autism brain imaging data exchange: towards a large-scale evaluation of the intrinsic brain architecture in autism. Molecular psychiatry, 19(6):659–667, 2014.
    OpenUrlCrossRefPubMedWeb of Science
  8. [8].↵
    N. U. Dosenbach, B. Nardos, A. L. Cohen, D. A. Fair, J. D. Power, J. A. Church, S. M. Nelson, G. S. Wig, A. C. Vogel, C. N. Lessov-Schlaggar, et al. Prediction of individual brain maturity using fmri. Science, 329(5997):1358–1361, 2010.
    OpenUrlAbstract/FREE Full Text
  9. [9].↵
    P. Erdds and A. R&wi. On random graphs i. Publ. Math. Debrecen, 6:290–297, 1959.
    OpenUrl
  10. [10].↵
    F. Fazayeli and A. Banerjee. Generalized direct change estimation in ising model structure. In International Conference on Machine Learning, pages 2281–2290, 2016.
  11. [11].↵
    T. Ideker and N. J. Krogan. Differential network biology. Molecular systems biology, 8(1):565, 2012.
    OpenUrlAbstract/FREE Full Text
  12. [12].↵
    S. Liu, J. A. Quinn, M. U. Gutmann, T. Suzuki, and M. Sugiyama. Direct learning of sparse changes in markov networks by density ratio estimation. Neural computation, 26(6):1169–1197, 2014.
    OpenUrl
  13. [13].↵
    S. Negahban, B. Yu, M. J. Wainwright, and P. K. Ravikumar. A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers. In Advances in Neural Information Processing Systems, pages 1348–1356, 2009.
  14. [14].↵
    R. A. Poldrack, P. C. Fletcher, R. N. Henson, K. J. Worsley, M. Brett, and T. E. Nichols. Guidelines for reporting an fmri study. Neuroimage, 40(2):409–414, 2008.
    OpenUrlCrossRefPubMedWeb of Science
  15. [15].↵
    P. Ravikumar, M. J. Wainwright, G. Raskutti, B. Yu, et al. High-dimensional covariance estimation by minimizing l1-penalized log-determinant divergence. Electronic Journal of Statistics, 5:935–980, 2011.
    OpenUrl
  16. [16].↵
    A. J. Rothman, E. Levina, and J. Zhu. Generalized thresholding of large covariance matrices. Journal of the American Statistical Association, 104(485):177–186, 2009.
    OpenUrlCrossRefWeb of Science
  17. [17].↵
    T. Shimamura, S. Imoto, R. Yamaguchi, and S. Miyano. Weighted lasso in graphical gaussian modeling for large gene network estimation based on microarray data. 19:142–153.
  18. [18].↵
    C. Singh, B. Wang, and Y. Qi. A constrained, weighted-l1 minimization approach for joint discovery of heterogeneous neural connectivity graphs. arXiv preprint arxiv:1709.04090, 2017.
  19. [19].↵
    D. C. Van Essen, S. M. Smith, D. M. Barch, T. E. Behrens, E. Yacoub, K. Ugurbil, W.-M. H. Consortium, et al. The wu-minn human connectome project: an overview. Neuroimage, 80:62–79, 2013.
    OpenUrlCrossRefPubMedWeb of Science
  20. [20].↵
    G. Varoquaux, A. Gramfort, J.-B. Poline, and B. Thirion. Brain covariance selection: better individual functional connectivity models using population prior. In Advances in neural information processing systems, pages 2334–2342, 2010.
  21. [21].↵
    M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference. Foundations and TrendsQR in Machine Learning, 1(1-2):1–305, 2008.
    OpenUrl
  22. [22].↵
    B. Wang, A. Sekhon, and Y. Qi. A fast and scalable joint estimator for integrating additional knowledge in learning multiple related sparse gaussian graphical models. arXiv preprint arxiv:1806.00548, 2018.
  23. [23].↵
    B. Wang, A. Sekhon, and Y. Qi. Fast and scalable learning of sparse changes in high-dimensional gaussian graphical model structure. In Proceedings of The 21st International Conference on Artificial Intelligence and Statistics (AISTATS), 2018. [PS].
  24. [24].↵
    E. Yang, A. C. Lozano, and P. K. Ravikumar. Elementary estimators for graphical models. In Advances in Neural Information Processing Systems, pages 2159–2167, 2014.
  25. [25].↵
    S. D. Zhao, T. T. Cai, and H. Li. Direct estimation of differential networks. Biometrika, 101(2):253–268, 2014.
    OpenUrlCrossRefPubMed
Back to top
PreviousNext
Posted July 28, 2019.
Download PDF
Data/Code
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Adding Extra Knowledge in Scalable Learning of Sparse Differential Gaussian Graphical Models
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Adding Extra Knowledge in Scalable Learning of Sparse Differential Gaussian Graphical Models
Arshdeep Sekhon, Beilun Wang, Yanjun Qi
bioRxiv 716852; doi: https://doi.org/10.1101/716852
Reddit logo Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Adding Extra Knowledge in Scalable Learning of Sparse Differential Gaussian Graphical Models
Arshdeep Sekhon, Beilun Wang, Yanjun Qi
bioRxiv 716852; doi: https://doi.org/10.1101/716852

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (4237)
  • Biochemistry (9147)
  • Bioengineering (6786)
  • Bioinformatics (24020)
  • Biophysics (12137)
  • Cancer Biology (9544)
  • Cell Biology (13795)
  • Clinical Trials (138)
  • Developmental Biology (7642)
  • Ecology (11715)
  • Epidemiology (2066)
  • Evolutionary Biology (15517)
  • Genetics (10650)
  • Genomics (14332)
  • Immunology (9492)
  • Microbiology (22856)
  • Molecular Biology (9103)
  • Neuroscience (49028)
  • Paleontology (355)
  • Pathology (1484)
  • Pharmacology and Toxicology (2572)
  • Physiology (3848)
  • Plant Biology (8337)
  • Scientific Communication and Education (1472)
  • Synthetic Biology (2296)
  • Systems Biology (6196)
  • Zoology (1302)