Gene regulatory network structure informs the distribution of perturbation effects

Gene regulatory networks (GRNs) govern many core developmental and biological processes underlying human complex traits. Even with broad-scale efforts to characterize the effects of molecular perturbations and interpret gene coexpression, it remains challenging to infer the architecture of gene regulation in a precise and efficient manner. Key properties of GRNs, like hierarchical structure, modular organization, and sparsity, provide both challenges and opportunities for this objective. Here, we seek to better understand properties of GRNs using a new approach to simulate their structure and model their function. We produce realistic network structures with a novel generating algorithm based on insights from small-world network theory, and we model gene expression regulation using stochastic differential equations formulated to accommodate modeling molecular perturbations. With these tools, we systematically describe the effects of gene knockouts within and across GRNs, finding a subset of networks that recapitulate features of a recent genome-scale perturbation study. With deeper analysis of these exemplar networks, we consider future avenues to map the architecture of gene expression regulation using data from cells in perturbed and unperturbed states, finding that while perturbation data are critical to discover specific regulatory interactions, data from unperturbed cells may be sufficient to reveal regulatory programs.


Figure S1
: Mediated effects outnumber direct effects at most magnitudes.Same as Fig. 3D, but with distances binned by whether pairs of genes are connected by an edge (distance 1, a "direct effect"), any path (distance greater than 1, a "mediated effect"), or no path at all ("null").Note also that the y-axis is the count of gene pairs with a perturbation effect of at least the magnitude given on the x-axis -that is, the distribution shown is a non-normalized inverse CDF.Gene pairs are pooled from the 50 example GRNs in Fig. 3.   showing the number of key target genes and stratifying by whether the expression equilibrium point of the synthetic GRN is stable (Methods).In all, 1,693 of the 1,920 GRNs (88.2%) reach an expression equilibrium through forward simulation of the SDE which is a stable fixed point of the corresponding ODE.These GRNs tend to be sparse (lower 1/p), modular (higher w), and have more hub regulators (lower δ out ).    S1 and S2.Similarly, for perturbation effects, we show the distribution split by whether pairs of genes share an edge ("A regulates B"), a path of distance 2 ("A indirectly regulates B"), or another relationship (right panel).At nearly all levels of coexpression, coregulation is more common than direct regulation.Meanwhile, direct regulation is more common than indirect regulation for the largest perturbation effects -note that the range of KO effects is clipped as in Fig. 6.At low levels of noise (small s), replicates from perturbed conditions are much more similar to one another than to the unperturbed data.At high levels of noise (large s), the perturbed data are more similar by canonical correlation to the unperturbed data than to the replicate perturbed data; but programs derived from each of the singular vectors are equivalently reproducible across conditions.

Figure S2 :
Figure S2: Modularity term differentiates within-and between-module effects.Same as Fig.3E, with within-module perturbation effects in red and between-module perturbation effects in blue.Here, networks are chosen so as to highlight the effect of the modularity term w.Each pair of blue and red tracelines is distribution of the within-(red) or across-module perturbation effects a single GRN.The generating parameters for these GRNs vary w (see legend) but hold other parameters constant, as follow: p = 1/4, k = 50, δ in = 10, δ out = 10.

Figure S3 :
Figure S3: Network generating parameters affect the number of KO effects.Same as Fig.4, but with summary statistic (y-axis) as the average number of perturbation effects per gene in the GRN with | log 2 FC| ≥ 0.1.We observe a similar direction of effect for each parameter as with the statistics presented in the main text.

Figure S4 :
Figure S4: Network generating parameters affect the stability of the fixed point.Same as Fig. 4, onlyshowing the number of key target genes and stratifying by whether the expression equilibrium point of the synthetic GRN is stable (Methods).In all, 1,693 of the 1,920 GRNs (88.2%) reach an expression equilibrium through forward simulation of the SDE which is a stable fixed point of the corresponding ODE.These GRNs tend to be sparse (lower 1/p), modular (higher w), and have more hub regulators (lower δ out ).

Figure S5 :
Figure S5: No interaction between sparsity term and other network generating parameters.Same as Fig. 4, but with additional stratification by the sparsity term 1/p.There is no obvious visual evidence for interactions between the parameters.

Figure S6 :
Figure S6: No interaction between out-degree term and other network generating parameters.Same as Fig.4, but with additional stratification by the out-degree term δ out .There is no obvious visual evidence for interactions between the parameters.

Figure S7 :
Figure S7: Summary of regression models for effects of network parameters on perturbations.Coefficients from regressing the logit-transformed fraction of genes which are hub knockouts (top) or target genes (bottom) on network generating parameters.Errorbars denote 95% confidence intervals for the regression coefficients.Model summaries can be found in TablesS1 and S2.

Figure S8 :
Figure S8: No interaction between network generating parameters and fit to experimental data.As in Fig.5C-E, we show the relationship between pairs of network generating parameters and goodness of fit to the cumulative distribution of perturbation effects from experimental Perturb-seq data.Each GRN (one point in every subpanel) is colored by its ranked fit to data: the synthetic GRNs are ranked separately by Kolmogorov-Smirnov p-value for incoming and outgoing perturbation effects, then the sum of these two ranks is used to produce an overall ranking.Intense red color indicates better ranked fit to data, and intense blue color indicates a worse ranking.

Figure S9 :
Figure S9: Coexpression is more often due to coregulation than edges.In the focal GRN from Fig. 6, we show a histogram of coexpression values split by whether pairs of genes share an edge ("A regulates B, or B regulates A", share a regulator ("A and B are coregulated"), or have another relationship (left panel).Similarly, for perturbation effects, we show the distribution split by whether pairs of genes share an edge ("A regulates B"), a path of distance 2 ("A indirectly regulates B"), or another relationship (right panel).At nearly all levels of coexpression, coregulation is more common than direct regulation.Meanwhile, direct regulation is more common than indirect regulation for the largest perturbation effects -note that the range of KO effects is clipped as in Fig.6.

Figure S10 :
Figure S10: Baseline coexpression and perturbation effects are uncorrelated in Perturb-seq data.Same as Fig. 6E, using data from our analysis subset of Replogle et.al. 2022 [9].Gene co-expression (x-axis) is the unsigned Pearson correlation between normalized single-cell gene expression data from unperturbed cells (clipped at |r| = 0.1).Perturbation effects (y-axis) are pairwise log-transformed Anderson-Darling pvalues for differences in gene expression distribution between perturbed and unperturbed states (clipped at − log 10 (p) = 10).Rank correlation (Spearman's ρ) is computed on the transformed but not clipped values of these two statistics.

Figure S11 :
Figure S11: True groups in the synthetic GRN are represented among gene programs.In the focal GRN from Fig. 7, we show the overlap between each of the true groups (k = 10, shown as points in each of the bins on the y-axis) and its closest matching program (maximum overlap across all 50 gene sets, values shown on the x-axis).Points corresponding to the same true group are connected with a line spanning across y-axis bins.There is similar representation of all of the groups among the learned gene programs, regardless of input data type.

Figure S12 :
Figure S12: Program replication depends on the number of cells.Same as Fig 7C,D -instead of taking downsamples of unperturbed cells from Replogle et.al., 2022, we here downsample the entire experiment to various study sizes.Here, the "entire experiment" is the normalized expression measurements of 5,247 genes in 932,593 control and intervened-upon cells which received one of the 5,247 perturbations in our analysis subset (Methods).We compare singular vectors (left) and programs (right) from the resulting downsamples of the entire experiment, as well as the subsets from Fig 7C,D.We note that the 75,328 control cells replicate the programs from the entire dataset comparably to 30,000 cells from the entire experiment.

Figure S13 :
Figure S13: Program replication depends on the magnitude of intrinsic noise.Same as Fig 7A,B for different levels of noise.We repeat CCA and analysis of gene programs as in Fig 7 (see Methods), varying the level of intrinsic noise (s).At low levels of noise (small s), replicates from perturbed conditions are much more similar to one another than to the unperturbed data.At high levels of noise (large s), the perturbed data are more similar by canonical correlation to the unperturbed data than to the replicate perturbed data; but programs derived from each of the singular vectors are equivalently reproducible across conditions.

Table S1 :
Summary of regression results (fraction of genes which are hub knockouts).

Table S2 :
Summary of regression results (fraction of genes which are hub target genes).