Hierarchical Markov Random Field model captures spatial dependency in gene expression, demonstrating regulation via the 3D genome

HiC technology has revealed many details about the eukaryotic genome’s complex 3D architecture. It has been shown that the genome is separated into organizational structures which are associated with gene expression. However, to the best of our knowledge, no studies have quantitatively measured the level of gene expression in the context of the 3D genome. Here we present a novel model that integrates data from RNA-seq and HiC experiments, and determines how much of the variation in gene expression can be accounted for by the genes’ spatial locations. We used Poisson hierarchical Markov Random Field (PhiMRF), to estimate the level of spatial dependency among protein-coding genes in two different human cell lines. The inference of PhiMRF follows a Bayesian framework, and we introduce the Spatial Interaction Estimate (SIE) to measure the strength of spatial dependency in gene expression. We find that the quantitative expression of genes in some chromosomes show meaningful positive intra-chromosomal spatial dependency. Interestingly, the spatial dependency is much stronger than the dependency based on linear gene neighborhoods, suggesting that 3D chromosome structures such as chromatin loops and Topologically Associating Domains (TADs) are strongly associated with gene expression levels. In some chromosomes the spatial dependency in gene expression is only detectable when the spatial neighborhoods are confined within TADs, suggesting TAD boundaries serve as insulating barriers for spatial gene regulation in the genome. We also report high inter-chromosomal spatial correlations in the majority of chromosome pairs, as well as the whole genome. Some functional groups of genes show strong spatial dependency in gene expression as well, providing new insights into the regulation mechanisms of these molecular functions. This study both confirms and quantifies widespread spatial correlation in gene expression. We propose that, with the growing influx of HiC data complementing gene expression data, the use of spatial dependence should be an integral part of the toolkit in the computational analysis of the relationship between chromosome structure and gene expression.

from HiC data (Figure 1), where a gene is mapped to all of its overlapping loci (Supplementary Figure S1), and its 48 spatial proximity to another gene is calculated as the summary of all of the loci pairs the two gene covers (Methods). can be observed at each gene. (b) A spatial gene network is inferred from HiC data. Each gene is treated as a node in the network. An edge exists between two genes if the spatial interaction frequency between loci overlapping with the two genes is higher than a threshold. See Methods for detailed description of the network inference. The triangular HiC heat map is generated using the 3D genome browser [28] with data from Rao et al [17].
where N i is the set of locations neighboring s i : N i = {s j : s j is a neighbor of s i } and w(N i ) = {w j : s j is a neighbor of s i }. Equation 1 is a conditionally specified model, its mean is further modelled as in Equation 2: Throughout this study, the posterior distribution of η helps us to understand the strength of the spatial dependency. The 55 main properties of the posterior η distribution that are the mean (η) and the 95% credible interval, which is obtained as 56 the 2.5% and 97.5% quantiles of the simulated posterior distribution. We will refer to the estimated posterior mean of η 57 (η) as the Spatial Interaction Estimate (SIE). If the 95% posterior credible interval for η does not contain 0, we say 58 that there is meaningful spatial dependency. The two other unknown parameters in this model are α and τ 2 , where 59 αis connected with a basal expression rate for all genes. The parameter τ 2 is connected with the conditional Gaussian 60 variance, which accounts for any remaining variance in gene expression within a sample. The same properties (mean, 61 2.5% and 97.5% quantiles) are used to summarize their posterior distributions.

62
In summary, PhiMRF models gene expression with a conditional Poisson-lognormal mixture,and a autoregressive 63 model with a parameter η that is connected with spatial dependency. We applied Bayesian inference that allowed us the 64 simulate from the posterior distributions of η, resulting in the Spatial Interaction Estimate (SIE) that symbolizes the 65 strength of the spatial dependency. Within each chromosome. We ran PhiMRF on all genes in each of the 23 human chromosomes (Y chromosome 68 excluded) in the IMR90 cell, with the spatial gene networks inferred from intra-chromosomal HiC data with 10kb 69 resolution [17]. We also implemented a linear baseline for each chromosome. The baseline takes the same gene 70 expression data for each gene but the spatial network is simply inferred from genes within 10k base pairs of each other 71 in the linear chromosome. By comparing our data with this baseline dataset, we can observe whether the long-range, 72 non-linear interactions do play a role in gene expression. We found strong evidence of positive spatial dependency 73 in eleven chromosomes: 1, 4, 5, 6, 8, 9, 12, 19, 20, 21 and X, with SIEs higher than the linear baseline ( Figure 2a, 74 Supplementary Tables S1 and S2). This suggested genes in these chromosomes are co-dependent on their neighbors 75 when it comes to gene expression, proving that genes that are spatially close have coordinated expression patterns on 76 a global scale. We noticed that although in all these eleven chromosomes the 95% credible interval of η is greater 77 than zero, their ranges vary greatly. Although these SIEs are all larger than their linear counterpart, the intervals do 78 not suggest that the difference between HiC and linear is statistically significant, except in Chromosomes 1, 9 and 21.

79
However, the differences in these SIEs should not be quantitatively compared (e.g. measuring the difference or ratio of 80 these SIE's), as they all have different number of genes and connectivity. There does not seem to be correlation between     Figure S2). To further investigate the genes within TADs, we isolated those edges that are 100 located within TADs as well, i.e. interactions that connect two genes within the same TAD. About half of the degree for while only nine chromosomes show positive spatial dependency when including both intra-TAD and inter-TAD edges. 106 We also ran the model on TAD genes with only inter-TAD edges, to rule out the possibility that the difference in SIE is Histogram ofτ 2 for 100 randomly permuted networks for all TAD genes in four chromosomes. Red dashed line is the observedτ 2 from the non-random HiC network. The permutation is carried out using an Erdős-Rényi algorithm with equal probability for any possible edge to be sampled. Total number of actual edges in each random graph is equal to the number of intra-TAD edges in the observed HiC network.
we carried out an in silico experiment to disrupt the TAD boundaries by randomly sampling edges in our HiC network.

114
In other words, we randomly permuted the order of the edges in each HiC network to create a reference distribution.

115
For each of these 100 random samples, edges are randomly designated between any two genes. The total number of 116 random edges is equal to the number of intra-TAD edges. We then fit PhiMRF for all 100 networks and obtain the Next we tested the hypothesis that spatial dependency of gene expression is correlated with the function of the genes.

149
Evidence suggest that co-functioning genes tend to cluster in space, and function is closely correlated with expression 150 levels, since co-functioning genes sometimes need to be co-transcribed [39,22].

151
Having developed the PhiMRF framework to understand the relationship between expression and spatial organization, 152 we can apply it to any group of genes, to see whether spatial dependency is particularly strong for certain functions,    Figure S7a).   The goal of our Bayesian framework is to simulate from the posterior distributions p(α|y), p(η|y) and p(τ 2 |y). The properties of these distributions directly answer the biological questions from which we abstracted the stochastic model.
The overall strategy is to simulate from the joint posterior distribution of p(α, η, τ 2 , w|y) using a Gibbs sampler, where we sequentially simulate from each of the full conditional posteriors of our parameters as follows.

282
The threshold is determined as a 90% quantile of all locus-locus interactions found in that chromosome or chromosome 283 pair. Therefore, the threshold changes from chromosome to chromosome, eliminating chromosomal bias. Out of the 284 four metrics, the min metric is the most conservative, as it requires that all interactions in the pool be larger than the 285 threshold, to consider the gene pair to be neighbors, while the max metric only requires that one interaction out of the 286 pool be larger than the threshold. Mean is our main metric, and all subsequent studies presented in the main text uses 287 the mean metric. Intra-chromosomal datasets are repeated using the other three metrics (median, min and max) as well 288 and presented in Supplementary Tables S9, S10 and S11. is where the ontology structure specifies that carbohydrate phosphorylation (GO:0046835) "is a" phosphorylation 295 (GO:0016310). If a protein is annotated with the former, a more informative term, then it is automatically annotated 296 with the latter, a less informative term. Since we are ranking the GO terms by the number of proteins they annotate, if 297 the annotations are propagated, the top ones will be the very general functions like "cellular process", that are in general 298 of proteins they directly annotate. These leaf annotations are directly curated from published experiments, which is 300 evidence that these terms are of interest to some researchers. The source code, prerequisites and installation guide, as well as a Docker image for the R package PhiMRF are available 303 at https://github.com/ashleyzhou972/PhiMRF under GPL-2 license.

311
[2] Robert Schneider and Rudolf Grosschedl. Dynamics and interplay of nuclear architecture, genome organization,