Abstract
We present VBASS, a Bayesian method that integrates single-cell expression and de novo variant (DNV) data to improve power of disease risk gene discovery. VBASS models disease risk prior as a function of expression profiles, approximated by deep neural networks. It learns the weights of neural networks and parameters of Poisson likelihood models of DNV counts jointly from expression and genetics data. On simulated data, VBASS shows proper error rate control and better power than state-of-the-art methods. We applied VBASS to published datasets and identified more candidate risk genes with supports from literature or data from independent cohorts.
Background
About 3% of children are born with congenital anomalies or will develop neurodevelopmental disorders (NDD)1. Given the severe consequence of these conditions on reproductive fitness, risk variants with large effect are under strong negative selection and therefore have low frequency in the population. Recent genetics studies identified hundreds of risk genes of these conditions, largely by rare de novo variants2-11, however, the majority of risk genes remain unidentified10,12-15, due to challenges in statistical power in analysis of rare variants16.
Cell-type specific gene expression has long been used qualitatively for interpretation of biological mechanisms in developmental biology and genetics. Previously we have shown that high expression in developing heart and diaphragm is associated with increased burden of de novo coding variants in congenital heart disease (CHD)7 and congenital diaphragmatic hernia9, respectively. We have also shown that cell-type specific expression in brain is associated with plausibility of autism spectrum disorders (ASD) risk genes17,18. It is clear that gene expression profile can inform association analysis of rare variants for risk gene discovery. However, the ability to improve power in gene discovery using expression data has been hindered by the lack of rigorous statistical methods and cell type specific expression data from relevant tissues during development. Recent efforts in cell atlas of human and model organisms have been generating large amount of single cell expression data of adult tissues19,20 in addition to an increase in various developmental stages21-24. Here we describe a novel computational method that leverages expression data with probabilistic models to improve statistical power of risk gene discovery.
VBASS (Variational inference Bayesian ASSociation), takes a vector of expression profile, such as cell-type specific expression from single cell RNA-seq and models the priors of risk genes as a function of expression profile of multiple cell types. VBASS uses deep neural networks to approximate the function and uses semi-supervised variational inference to estimate the parameters. Although optimized for scRNA-seq data, VBASS could also be applied to bulk RNA seq data with a simplified framework. We compared the performances of VBASS with extTADA under two conditions (bulk and scRNA-seq data) by both simulated and published de novo variants datasets to assess error control and statistical power and showed their better performances.
Methods
The probabilistic model of VBASS
VBASS is a Bayesian mixture model with learnable priors (Figure 1). VBASS assumes the number of genetic variants of interest (LGD or Dmis de novo variants) in the gene dgv are drawn independently through this generative process, given Mgv being the aggregated background mutation rate for variant type v in gene g and xg being the cell type specific gene expression profiles in gene g:
πg is a gene specific prior probability of being disease risk. yg is a binary random variable that indicates whether the gene is a risk gene or not. It is also used to generate a mixture of posterior probabilities on effect size γgv.
Model architecture. The architecture of VBASS. The input could be either a vector of single cell expression profile or a scalar of bulk expression profile. For the vector input, fE is a neural network to inference the gene specific parameter πg, while for the scalar input, it could be simplified to a sigmoid function with four parameters, will parameterize the distributions of yg. This distribution will be penalized by a Bernoulli prior via KL penalty term. This model also takes predefined labels as input, where yg is given by one-hot encoding of the labels and the Bernoulli KL penalty is replaced with a cross-entropy loss on the real label. kgv, θgv are two random variables conditioned on yg that reconstruct the parameters of Gamma-Poisson distribution for dgv.
We use neural network fE to infer πg from gene expression data xg with KL penalty of a fixed Bernoulli prior (Figure 1). By default, we used a 32-dim encoding module, followed by a 2-dim sampler module for π, respectively. Each module consists of a linear layer followed by ELU activation and layer normalization layers. We apply the same reparameterization trick as conventional variational autoencoders in fE with Bernoulli sampler25.
γgv is a random variable that denotes the enrichment rate of damage variant v in the patient cohort, which is also known as the relative risk of this gene. γgv is drawn independently through Gamma distribution p(γgv|kgv, θgv). kgv, θgv are conditioned on yg, under null they are equal to 1 while under alternative, they are equal to and
, respectively. We assume
and
are shared across all disease risk genes to reduce the number of parameters.
The loss function is given by the evidence lower bound (ELBO),
The KL penalty term regularized the gene-specific prior π by the hyperparameter
(Figure 1), which reflects the average proportion of risk genes:
The expectation term quantified the log likelihood of d conditioned on y integrated on the distributions parameterized by π:
are the parameters to learn, we use stochastic gradient decent to estimate them. The estimated parameters were used to calculate the posterior probability of association (PPA) for each gene being risk or not:
For conditions where gene expression data xg is a scalar, i.e., bulk RNA-seq data or average expression data of a certain cell type in scRNA-seq data, we could rewrite fE as a function with sigmoid shape, corresponding to a linear transformation with sigmoid activation:
while the other parts of the model remain the same.
Given PPA of all genes, we calculate Bayesian false discovery rate (FDR) by estimated false discovery proportion following the method described in He et al., 201312:
Where i is the rank index of genes (start with highest PPA), and FDRk is the estimated FDR of the gene ranked at k.
Parameter inference for VBASS
The parameters of VBASS could be inferenced with either unsupervised or semi-supervised training. For the scalar version, there are only six parameters to be estimated, which is possible for complete unsupervised training via MCMC. In practice, we used rstan package with 4 chains and 2000 iterations. For the neural network version, it is better to train in a semi-supervised manner to avoid converging issues. In practice, we trained the model with two training steps. First, we pre-trained our model using known risk genes labeled as positives and genes that harbor LGD variants in control cohort as negatives, replacing the Bernoulli KL penalty with cross-entropy loss26. The known risk genes (59 in total) were randomly picked from SFARI27 (release 2021 Q4) scored 1 genes, while negative controls (86 in total) were picked from genes with LGD variants in a control cohort14 (Table S1). During pre-training we set large learning rate to make the model converge faster. The parameters estimated from pre-training were then used as initial values in the second step, unsupervised training, which uses all genes without labels with reduced learning rate after each epoch. In practice, we used 50 epochs of semi-supervised pretraining and 60 epochs of unsupervised training. After training, we calculated PPA for all genes using the estimated parameters. For the simulation dataset, we estimated FDR on all genes to measure the statistical power. For the real dataset, we removed the known risk genes selected as positives in training when we estimate FDR to identify candidate risk genes.
De novo variants (DNV) and gene expression data
We obtained DNV data sets from a publication on congenital heart disease (CHD)13 of 2,645 parent-offspring trios (Table S2) and a preprint on autism spectrum disorder (ASD)11 of 16,616 trios (Table S3). The latter is a combined data set from exome or whole genome sequencing data of the SPARK consortium28, Simons Simplex Collection29, Autism Sequencing Consortium30, and MSSNG31. The gene expression rank was based on bulk RNA-seq data of mouse developing heart at E14.5, inspired from previous publications6,7. We obtained single cell RNA-seq data of human fetal midbrain and prefrontal cortex from two publications21,22. We used the combination of developmental time and cell ontology annotations as described in the two publications to define cell types. Small clusters of cell types with less than 10 single cells were removed. For each gene and cell type, we calculate the proportion of cells that express the gene as input to VBASS.
Annotation of de novo variants and background mutation rate calculation
We used ANNOVAR32 and VEP33 to annotate variants, protein-coding consequences, and predicted damaging scores for missense variants. We classified variants as LGD (likely gene disrupting, including frameshift, stop gained/lost, start lost, splice acceptor/donor), Dmis (Damage missense variants, defined by REVEL34 score ≥ 0.5), missense, or synonymous. For each variant type, we calculated the expected number of variants based on a background mutation rate model7,35 given the sample size. In-frame deletions/insertions (multiple of 3 nucleotides) and other splice region variants were excluded in the following analysis. Variants in olfactory receptor genes, HLA genes or MUC gene family were excluded in further analysis.
Generation of simulation datasets
We simulated two datasets to test VBASS’s performance with bulk and scRNA-seq datasets, respectively. For the first scenario, we first estimate the parameters based on real dataset and then used the estimated hyperparameters to generate the simulated dataset based on the Bayesian mixture model. Specifically, we randomly assigned 3.7% of genes as risk gene, then we drew the covariates (gene expression rank) of risk genes from the sigmoid distribution function. The de novo damage variants were drawn from Gamma-Poisson distribution with relative risk of 20 and 12 for LoF and Dmis, respectively. For non-risk genes, we drew covariates from a uniform distribution and de novo variants from Poisson distribution. We did the simulation under different sample sizes ranging from 2,645 to 20,000. For each sample size setting, we simulated 100 datasets and fit both models on each simulated dataset independently to estimate the hyperparameters, which were used to calculate the posterior probability of association (PPA) and then a Bayesian false discovery rate (FDR) by false discovery proportion implied by it. We performed single-tail Poisson tests independently on each simulated dataset to show the baseline statistical power, where the FDR were calculated by the Benjamini-Hochberg (BH) method.
For the second scenario, the simulation is based on real single cell dataset, where we created a non-linear function that maps cell-type specific expression to prior of being risk with following steps. First, we did a singular value decomposition (SVD) on the expression data of 59 known ASD risk genes (picked randomly from SFARI27 scored 1 genes) and 86 negative control genes (picked randomly from genes with LGD variants in control cohort14) (Table S3). Next, we fit a logistic regression model with elastic net penalty on the eigen vectors that explain 95% of the variance. The regression model was applied to all other genes and the output probabilities were squared and scaled to have an average of 3.2%, which matches the average proportion of risk genes estimated from extTADA model. This value served as a simulated prior of being risk, from which disease risk genes were randomly sampled. The de novo damage variants were drawn from Gamma-Poisson distribution for disease risk genes while Poisson distribution for non-risk genes with same sample size and relative risk as in real ASD dataset. We performed the simulation 50 times with same simulated prior and disease risk genes, then estimated the hyperparameters and calculate the PPA and Bayesian false discovery rate (FDR) independently on each simulated dataset for both models.
Results
VBASS model disease risk association with both genetics and expression data
VBASS is a Bayesian mixture model with learnable priors (Figure 1). We model the number of genetic variants of interest (e.g., LGD or Dmis de novo variants) in the gene as a sample drawn independently through mixture of Poisson and Gamma-Poisson (Negative binomial)
Distributions. Such Gamma-Poisson distribution has been proved useful in modeling the sparse de novo variant data12,36. Instead of using a naïve prior that all genes share the same probability of being disease risk, we assume that this prior should be gene specific and could be inferred from the spatiotemporal expression data of fetal development of corresponding organ. In VBASS, we model this prior πg as a function of expression profiles xg that could be approximated with a neural network fE (Figure 1 and Methods). With such approximation it is possible to take the advantage of the state-of-art stochastic gradient descent method37. yg is a binary random variable that indicates the risk status of a gene, which follows a Bernoulli distribution of πg and constrained with Kullback–Leibler divergence over average proportion of disease risk gene, . This penalty term could be replaced by a cross-entropy loss term if the label of gene is known, making it possible for semi-supervised training26. γgv is a random variable that denotes the relative risk of damage variant v in the patient cohort. It is drawn independently through a Gamma distribution p(γgv|kgv, θgv). We assume this distribution is shared across all disease risk genes (Methods).
VBASS could also take bulk RNA seq data of certain organ or cell type as input when prior knowledge of its disease risk association is available. For example, the increased burden of damage variants of high heart expression genes in CHD7. In that case xg is a scalar and fE could be parameterized by three parameters (A, B, C) that corresponds for a linear transformation followed by sigmoid activation (Methods). This sigmoid-shape function could quantify the fact that genes with higher expression in the corresponding organ or cell type are more likely to harbor disease risk variants.
We trained VBASS in a semi-supervised manner with stochastic gradient descent method to estimate the parameters (Methods). While for the simplified version with scalar input, VBASS can be trained in a completely unsupervised manner with MCMC (Methods). The estimated parameters were used to calculate PPA and FDR for all genes (Methods).
VBASS showed better power than extTADA on simulated data with bulk RNA-seq expression
We tested the performance of VBASS and extTADA on simulated CHD dataset (Methods). As expected, both models showed good false discovery control and local false discovery control (Sup Fig. 1). VBASS outperformed extTADA with better recall under same precision level (Figure 2A) under sample sizes from 2,645-20,000. Although the difference in power decreases with increasing sample size, VBASS still outperformed extTADA by roughly 10% increase of recall at sample size of 10,000, which is feasible for CHD in the next few years.
Performance comparison of scalar version VBASS and extTADA on simulation. A) Precision-recall in two models, only show the part with FDR ≤ 0.2 for extTADA and VBASS, only show the part with FDR ≤ 0.01 for Poisson test. B) Comparison of recall (y-axis) for genes sets with different mutation rates (x-axis)
To test the power of VBASS with respect to the size of genes, we calculated the recall rate at same significance levels (FDR ≤ 0.05) on both models for genes with different mutation rates. VBASS showed better statistical power especially for genes with higher mutation rates under small sample sizes (Figure 2B). As the sample size increases, the power difference of VBASS and extTADA becomes smaller on large genes, while VBASS still outperforms extTADA on medium-mutation-rate genes (Figure 2B). Overall, our simulation results showed that VBASS can increase the statistical power for prioritizing disease risk genes by estimating risk prior as a function of expression.
VBASS showed better power than extTADA on simulated data with scRNA-seq expression
We ran VBASS and extTADA separately on the simulation dataset (Method). Both models showed good false discovery control (Figure 3A). To test the statistical power of VBASS and extTADA, we plotted the precision-recall curve using the output posterior probabilities from VBASS, extTADA and the real parameters we used in simulation. VBASS outperformed extTADA with higher recall under same precision (Figure 3B). Further comparison showed good correlation between the prior value informed by VBASS and real πg we used in simulation (Figure 3C), indicating that VBASS could reconstruct the prior of being risk through single cell expression data. Moreover, we assessed the association between expression profile x and π via spearman correlation, the result of VBASS is close to real values (Figure 3D). Overall, those results showed that our model can not only reach higher statistical power on simulation data set than extTADA but also uncover the association between cell type expression profiles and disease risk.
Performance of single cell version VBASS on simulation data. A) Plot of true false discovery rate (real.FDR, y axis) at different FDR cutoff (x axis) estimated by extTADA and VBASS. B) Comparison of precision recall for extTADA and VBASS, only shown for the part with FDR ≤ 0.5. C) Scatter plot of disease risk prior (π) that we assigned in simulation (y-axis) and informed by VBASS (x-axis). Genes were colored by labels and whether used in semi-supervised training, where TN and TP correspond to true negative and true positive, respectively. D) Comparison of correlation between real disease risk prior and cell type expression (y-axis) versus correlation between VBASS informed prior and cell type expression (x-axis). Each dot represents a cell type.
VBASS identified novel CHD candidate risk genes on published DNV data
We applied VBASS to a CHD data set with DNVs from 2645 trios13. We used the mouse embryonic E14.5 heart bulk RNA-seq data to set gene expression rank percentile6,7. The estimated distribution of expression rank under null and alternative hypothesis showed most of the risk genes are enriched in rank percentile ≥ 75% (genes with rank percentile ≥ 0.75 are roughly 3 times more likely to be risk than other genes) (Figure 4A; Table 1), consistent with previous burden analysis of de novo variants7. With FDR ≤ 0.1, we identified 49 candidate risk genes. In contrast, using the original TADA method, we were able to identify only 40 candidate genes (Figure 4B-C, Table 2, Table S4). Among the gene that only detected by VBASS, FLT4 was reported to be a risk gene via combined analysis of de novo and inherited variants in the original paper, while TSC1 and FBN1 were in their curated CHD gene dataset from literature search13,38. CHD4 was reported to be significantly associated with CHD in a UK CHD cohort of 1891 probands8, while 3 (FRYL, SETD5, KMT2C) have both LoF and missense variants carriers, 2 (GANAB, KDM5A) have only missense variants carriers in that cohort. Furthermore, 4 (CHD4, SETD5, KMT2C, FBN1) are significantly associated with neurodevelopmental disorders39, while 11 (CHD4, FRYL, GANAB, SETD5, MINK1, ANK3, KMT2C, IQGAP1, TSC1, KDM5A, FBN1) have both LoF and missense variants carriers, and 2 (CAD, SLIT3) have only missense variants carriers in that cohort. Overall, these genes have additional genetic evidence in other cohorts and are plausible candidates. These results indicate that the assumption of VBASS is biologically sound and suggests its higher statistical power even in lower cohort size.
Performance comparison of VBASS and extTADA on CHD data. A). Function of disease risk prior on expression rank percentile estimated by VBASS. B). Genes identified by VBASS and extTADA at significance level 0.1. C). FDR of genes in extTADA (y-axis) and VBASS (x-axis), genes were colored by significance in both models (red), only in VBASS (purple) or only in extTADA (green) at significance level 0.1 (FDR ≤ 0.1).
Estimated VBASS parameters in CHD data. Mean, posterior mean; sd, standard error; 2.5% and 97.5%, confidence interval; n_eff, effective sample number in MCMC; Rhat, convergence diagnostic in MCMC.
Genes identified by VBASS but not extTADA. dn_LGD, de novo LGD variants; dn_Dmis, de novo Dmis variants.
VBASS identified novel ASD candidate risk genes on published DNV data
Previous studies have shown that gene expression in multiple cell types in the brain is associated with ASD risk17,18,40. This is in part what motivated the design of VBASS. We obtained ASD DNV data from a recent preprint15 that combined exome and genome data from four studies (see Methods), and single cell RNA-seq data of human fetal midbrain and prefrontal cortex from two publications21,22. We applied VBASS and extTADA to the full ASD data set with 16616 trios. VBASS identified 122 genes with PPA above 0.8 (Table S5). To compare the performance in identification of novel candidate risk genes, we removed the known risk genes used in training and calculate Bayesian FDR of all other genes with VBASS and extTADA (methods). Then we compared the candidate genes identified by VBASS and extTADA at significance level 0.05 and 0.1 (FDR ≤ 0.05 and FDR ≤ 0.1 respectively). At significance level 0.05, VBASS identified 51 genes (Table S6), among which 5 were not identified as candidates by extTADA (Fig. 5A, Table S6). Among the 5 genes, 2 (DLG4, PAX5) were reported to be risk genes in SFARI27 data base (release 2021 Q4) with score of 1 while not in our training gene list. METTL23 is a transcriptional partner of GABPA and essential for human recognition41, and disruption of METTL23 was reported to cause mild autosomal recessive intellectual disability42. ATF4 was reported to have significant altered expression in the middle frontal gyrus of ASD subjects43. At significance level 0.1, VBASS identified 75 genes (Table S6), where 6 were not identified by extTADA (Figure 5A, Table S6). Among the 6 genes, 2 (ZMYND8 CASZ1) were scored 1 in SFARI data base and CMPK2 was scored 3. LMTK3 was reported to cause behavioral abnormalities such as locomotor hyperactivity and reduced anxiety in mice knock-out models44,45. Furthermore, 7 out of the 11 genes identified only by VBASS (DLG4, METTL23, SPRY2, LMTK3, PFN2, CASZ1, ZMYND8) have additional genetic evidence in related cohorts39. There were six genes (CCDC40, FUBP3, PRKAR1B, SIN3A, ITGB5, PMM2) identified only by extTADA but not by VBASS, likely because of their low detection rates or co-expression strength with other candidates in the single cell datasets. Finally, we studied what are the cell types that associated the most with disease risk. According to spearman correlation analysis, oculomotor / trochlear nucleus (hOMTN), GABAergic neurons (hGaba) and dopaminergic neurons (hDA1) in gestation week 9-10 are more associated with autism risk, while microglia cells and endothelial cells (hEndo) are less associated with autism risk (Figure 5B). This observation is consistent with previous evidence of abnormalities in GABAergic neurons and synapses in neurodevelopmental disorders characterized by a shared symptomatology of ASD symptoms46, while reductions in GABA have been reported in several brain regions in children with ASD47,48. There were also evidences that dopaminergic dysfunctions were associate with autistic-like behavior49,50.
Performance comparison of VBASS and extTADA on ASD data. A). FDR of genes in extTADA (y-axis) or VBASS (x-axis). B). Spearman correlation between cell type expression and disease risk prior (π). The cell types from two single cell data sets were separated and ordered by correlation with π respectively.
Discussion
In this study, we described VBASS for identification of candidate risk genes by joint analysis of de novo variants of cases and gene expression profile of normal samples. The core idea of the method is that prior probability of a gene increase disease risk is a function of expression profile in relevant cell types, and that we can estimate the parameters of the function from the data in an empirical Bayesian framework. For bulk RNA-seq data, we set the function to be a sigmoid function with three parameters. For scRNA-seq data, we use deep neural networks to approximate the function and learn the contribution of cell types jointly with genetic data. Using simulation, we showed that VBASS have accurate error rate control and better statistical power than existing methods under both scenarios.
We applied VBASS to a published CHD DNV data set and estimated that high-expression genes are approximately 3 times more likely to be risk genes than low-expression genes in developing heart. We identified 14 more candidate risk genes, 6 of which have additional support in independent cohorts. We applied VBASS to a published ASD DNV data set and identified 5 and 6 more candidate genes at significance level 0.05 and 0.1 respectively, 8 of them have literature support or additional genetic evidence in neurodevelopmental disorders. Moreover, we showed that gene expression profiles of GABAergic neurons and dopaminergic neurons during gestation week 9-10 are strongly associated with autism risk, indicating their potential roles in neural circuits formation.
VBASS is based on the biological hypothesis that gene expression level in relevant cell or tissue types informs the plausibility of being a disease risk gene. The bulk-RNA seq version is optimized for a single expression profile that is informative of disease risk, such as bulk RNA sequencing data for congenital heart disease. The single cell RNA-seq version is optimized for the conditions in which multiple cell types and time points are associated with disease risk. One alternative approach to improve power based on informative non-genetic data is to calculate p-values for each gene using genetic data and then optimize FDR estimation using non-genetic data as covariates51-53, While it is a generalizable approach, these methods require p-values to have proper distributions (uniform) under the null. In the analysis of de novo or ultra-rare variants, the data is usually too sparse to support a proper distribution of p-values under the null. VBASS does not have this limitation.
A limitation of VBASS is that it only estimates the association of cell types with disease risk. It is not designed to answer questions about whether a certain cell type confers causality in the diseases caused by risk variants. Additionally, the performance of VBASS is partially determined by how well the expression data captures true expression states of genes. In this study, we used average expression of genes in cells within a cell type inferred from single cell data. This approach has limitations in representing rare and transient cellular states. More advanced representation, like RNA velocity54,55, together with more comprehensive measurements of cell types may improve the model.
Finally, we note the inference part of VBASS is not limited to scRNA-seq data but could be extended to other functional genomics modalities of genes, such as single cell ATAC-seq data or regulator-targets information without much modification of architecture.
Conclusions
We developed VBASS, a new computational method that integrates expression data with Bayesian probabilistic models to improve statistical power of risk gene discovery. It showed proper error rate control and better power than current Bayesian methods in simulation and real datasets. VBASS is freely available for academic use at: https://github.com/ShenLab/VBASS.
Declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Availabilities for data and meterials
All data sets (de novo variants and gene expression data) were obtained from publications and available from the corresponding publications. VBASS is available on Github: https://github.com/ShenLab/VBASS.
Competing interests
The authors have declared that no conflict of interest exists.
Funding
This work was supported by NIH grants R01GM120609, R03HL147197, and U01HL153009, and Simons Foundation Autism Research Initiative (SIMONS606450).
Author’s contributions
Conceptualization, Y.S.; Methodology, G.Z. and Y.S.; Software, G.Z.; Investigation, G.Z., Y.A.C. and Y.S.; Writing – Original Draft, G.Z. and Y.S.; Writing – Review & Editing, G.Z., Y.A.C. and Y.S.; Supervision, Y.S.; Funding Acquisition, Y.S.
Supplemental materials
Supplementary Table 1. Labels of genes for VBASS in semi-supervised training. Supplementary Table 2. De novo variants of 2645 CHD trios in Jin et al 2017.
Supplementary Table 3. De novo variants of 16616 ASD trios in Zhou et al 2021.
Supplementary Table 4. Posterior probabilities of all genes calculated in CHD cohort by VBASS and extTADA.
Supplementary Table 5. Posterior probabilities of all genes calculated in ASD cohort by VBASS and extTADA.
Supplementary Table 6. Posterior probabilities of all genes calculated in ASD cohort by VBASS and extTADA. Removed positive training genes when calculating FDR.
Acknowledgements
We would like to thank Dr. Wendy Chung, Dr. David Knowles, Dr. Nicholas Tatonetti, Dr. Haicang Zhang, Dr. Xueya Zhou, Dr. Xiao Fan, Yige Zhao, Joseph Obiajulu, Xi Fu, and members of Shen lab for helpful discussions.
Footnotes
Abstract revised. Figure 1 and Figure 2 revised.