A statistical method for joint estimation of cis-eQTLs and parent-of-orign effects using an orthogonal framework with RNA-seq data

In the past few years extensive studies have been put on the analysis of genome function, especially on expression quantitative trait loci (eQTL) which offered promise for characterization of the functional sequencing variation and for the understanding of the basic processes of gene regulation. However, most studies of eQTL mapping have not implemented models that allow for the non-equivalence of parental alleles as so-called parent-of-origin effects (POEs); thus, the number and effects of imprinted genes remain important open questions. Imprinting is a type of POE that the expression of certain genes depends on their allelic parent-of-origin which are important contributors to phenotypic variations, such as diabetes and many cancer types. Besides, multi-collinearity is an important issue arising from modeling multiple genetic effects. To address these challenges, we proposed a statistical framework to test the main allelic effects of the candidate eQTLs along with the POE with an orthogonal model for RNA sequencing (RNA-seq) data. Using simulations, we demonstrated the desirable power and Type I error of the orthogonal model which also achieved accurate estimation of the genetic effects and over-dispersion of the RNA-seq data. These methods were applied to an existing HapMap project trio dataset to validate the reported imprinted genes and to discovery novel imprinted genes. Using the orthogonal method, we validated existing imprinting genes and discovered two novel imprinting genes with significant dominance effect. Author Summary In the past decades, an unprecedented wealth of knowledge has been accumulated for understanding variations in human DNA level. However, this DNA-level knowledge has not been sufficiently translated to understanding the mechanisms of human diseases. Gene expression quantitative trait locus (eQTL) mapping is one of the most promising approaches to fill this gap, which aims to explore the genetic basis of gene expression. Genomic imprinting is an important epigenetic phenomenon which is an important contributor to phenotypic variation in human complex diseases and may explain some of the “hidden” heritable variability. Many imprinting genes are known to play important roles in human complex diseases such as diabetes, breast cancer and obesity. However, traditional eQTL mapping approaches does not allow for the detection of imprinting which is usually involved in gene expression imbalance. In this study, we have for the first time demonstrated the orthogonal statistical model can be applied to eQTL mapping for RNA sequencing (RNA-seq) data. We showed by simulated and real data that the orthogonal model outperformed the usual functional model for detecting main effects in most cases, which addressed the issue of confounding between the dominance and additive effects. Application of the statistical model to the HapMap data resulted in discovery of some potential eQTLs with imprinting effects and dominance effects on expression of RB1 and IGF1R genes. In summary, we developed a comprehensive framework for modeling imprinting effect for eQTL mapping, by decomposing the effects to multiple genetic components. This study is providing new insights into statistical modeling of eQTL mapping with RNA-seq data which allows for uncorrelated parameter estimation of genetic effects, covariates and over-dispersion parameter.


Abstract
In the past few years extensive studies have been put on the analysis of genome function, especially on expression quantitative trait loci (eQTL) which offered promise for characterization of the functional sequencing variation and for the understanding of the basic processes of gene regulation. However, most studies of eQTL mapping have not implemented models that allow for the non-equivalence of parental alleles as so-called parent-of-origin effects (POEs); thus, the number and effects of imprinted genes remain important open questions.
Imprinting is a type of POE that the expression of certain genes depends on their allelic parentof-origin which are important contributors to phenotypic variations, such as diabetes and many cancer types. Besides, multi-collinearity is an important issue arising from modeling multiple genetic effects. To address these challenges, we proposed a statistical framework to test the main allelic effects of the candidate eQTLs along with the POE with an orthogonal model for RNA sequencing (RNA-seq) data. Using simulations, we demonstrated the desirable power and Type I error of the orthogonal model which also achieved accurate estimation of the genetic effects and over-dispersion of the RNA-seq data. These methods were applied to an existing HapMap project trio dataset to validate the reported imprinted genes and to discovery novel imprinted genes. Using the orthogonal method, we validated existing imprinting genes and discovered two novel imprinting genes with significant dominance effect.

Author Summary
In the past decades, an unprecedented wealth of knowledge has been accumulated for understanding variations in human DNA level. However, this DNA-level knowledge has not been sufficiently translated to understanding the mechanisms of human diseases. Gene expression quantitative trait locus (eQTL) mapping is one of the most promising approaches to fill this gap, which aims to explore the genetic basis of gene expression. Genomic imprinting is an important epigenetic phenomenon which is an important contributor to phenotypic variation in human complex diseases and may explain some of the "hidden" heritable variability. Many imprinting genes are known to play important roles in human complex diseases such as diabetes, breast cancer and obesity. However, traditional eQTL mapping approaches does not allow for the detection of imprinting which is usually involved in gene expression imbalance. In this study, we have for the first time demonstrated the orthogonal statistical model can be applied to eQTL mapping for RNA sequencing (RNA-seq) data. We showed by simulated and real data that the orthogonal model outperformed the usual functional model for detecting main effects in most cases, which addressed the issue of confounding between the dominance and additive effects.

Introduction
With the completion the 1000 Genomes Project [1], an unprecedented wealth of knowledge has been accumulated for understanding variations in human DNA level. However, much less of this DNA-level knowledge has been translated to understanding the mechanisms of human diseases.
Gene expression quantitative trait locus (eQTL) mapping is one of the most promising approaches to fill this gap, which aims to explore the genetic basis of gene expression [2]. Among the eQTL techniques, cis-eQTL mapping is the most commonly used technique to map local eQTLs on the same chromosome of the gene. To evaluate gene expression levels, RNA sequencing (RNA-seq) technology has recently become a widely used high-throughput tool to assess the gene expression abundance, which has many advantages over microarray, especially for discovery of novel eQTLs.
Imprinting is a type of parent-of-origin effect that the expression of certain genes depends on their allelic parent-of-origin. The same alleles transmitted from the mother may have different expression levels on transcripts with those transmitted from the father. The influence on the phenotype between the two types of heterozygotes are different, as so-called parent-of-origin effect (POE). It is now known that there are at least 80 imprinted genes in humans, many of which are involved in embryonic and placental growth and development [3]. Studies suggested that POE is an important contributor to phenotypic variation in human complex diseases and may explain some of the "hidden" heritability. An earlier study showed that for type II diabetes, a variant of SNP rs2334499 in chromosome region 11p15 was protective when maternally transmitted, whereas confers risk when paternally transmitted [4]. Evidence also demonstrated the important roles of POEs in type I diabetes, breast cancer and other carcinomas [4,5].
However, most studies of eQTL maping have not implemented models that allow for the non-equivalence of parental alleles; hence the number and effects of imprinted genes remain important open questions. In the past few years, there were a few approaches that modeled POEs while searching for eQTLs with RNA-seq data. For example, Zhabotynsky et al. proposed to jointly model genetic effect and POE with a RNA-seq statistical method which focused on modeling the allele specific expression [6].
In quantitative genetics, the partition of the variance in statistical components due to additivity, dominance does not reflect the biological (or functional) effect of the genes but it is most useful for prediction, selection, and evolution [7]. In 2013, we proposed a unified orthogonal framework for modeling genetic variants displaying imprinting effects [8]. It allowed for imprinting effect detection whereas maintained the power to detect the main allelic effect including the additive and dominance effects. From theory of quantitative genetics, statistical additive genetic effects are obtained from average allele substitution effects, whereas dominance effects reflect the deviation of the genotypic values of the heterozygotes and the expected midpoint of the two homozygotes. In recent years, the inclusion of dominance in genomic models have been proposed by several researchers. However, multi-collinearity is an important issue arising from modeling multiple genetic effects. To achieve straightforward model selection and variance component analysis, uncorrelated estimation of the additive and dominance effect was necessary. Therefore, in current study, we propose to implement an orthogonal model to jointly evaluate the effect from both additive and dominance effect along with POE in eQTL mapping.
Genetic imprinting affects complex diseases through regulating the gene expression and can reveal an important component of heritable variation that remains "hidden" in traditional complex trait studies. In this study, we hypothesized that POEs contribute to regulate gene expression along with the main allelic effect from the gene. We proposed a statistical framework to test the main allelic effects (i.e., additive and dominance effects) of the candidate eQTLs along with the POE with an orthogonal model. Intensive simulations were conducted to evaluate the methods. We also applied the methods to an existing HapMap project trio dataset to validate the reported imprinting genes and identifying eQTLs for these genes.

Simulations
The statistical power of the Stat-POE and Func-POE methods was illustrated when the POE was and 91% when the sample size was 500 ( Figure 1). As expected, the Stat-POE method yielded same power in detecting POE but more desirable power in detecting main genetic effects compared to the Func-POE model (Figures 1-2). In conclusion, the Stat-POE method outperformed the Func-POE method in most simulation scenarios and our methods all achieved sufficient power for detection of POEs with a relatively sample size (N=100).  With the simulated data, we also evaluated the estimation bias for all the parameters ( , , ) from the Stat-POE model. Table 1 shows that the estimation of all genetic effects achieved higher accuracy when sample size increased. Interestingly, the estimation of the covariate and over-dispersion parameter was not notably affected by the sample sizes. Also, the estimation of genetic effects was not obviously affected by the value of the over-dispersion parameter.
These results revealed the accurate and stable estimation of the covariates and over-dispersion parameters by using the Stat-POE model. Moreover, large sample sizes and small overdispersion ensured better overall performance of the proposed methods. Table 1. Simulation results with different sample sizes. The estimation relative bias was defined as the difference between estimated value and true parameter value divided by the true parameter value. Relative bias and mean square of errors (MSE) have been multiplied by 100. for model parameters including additive effect (α), imprinting effect (ι), covariate (β), and dispersion parameter (φ).
To evaluate the Type I error, we also simulated scenarios where there no genetic effect or POE. Although there were slightly inflated false positives in detecting genetic effect and POEs when over-dispersion was large, the type I error rates were around the nominal level 0.05 for both methods in most scenarios especially when moderate over-dispersion existed ( = 0.2 ( Table 2). The false positive rate for detecting the genetic effects was comparable between the Stat-POE and Func-POE models in most scenarios. to estimate additive and imprinting effects for 22 genes with previous evidence of imprinting.
These selected genes were identified to be imprinted genes using 296 phased trios from the 1000 Genomes Project and the Genome of the Netherlands participants [9].
With the proposed Stat-POE method, we identified 33 significant cis-eQTLs (adjusted pvalues in additive effects < 0.05) for seven genes including LPAR6, RB1, PXDC1, IGF1R, AC069277.2, IGF2BP3, SNRPN genes (S2 Table). Among them, most candidate cis-eQTLs presented maternal expression pattern in regulating the gene expression. Besides, we identified six genes with imprinting effects, including LPAR6, PER3, RB1, PXDC, IGF1R, and IGF2BP3 with adjusted p-values < 0.05 (Table 3). Among the significantly imprinted genes, the gene expression of LPAR6 and IGF1R had significant regulation from the candidate cis-eQTLs rs11633209, rs728075 and rs7329291 in additive effect (adjusted p-values = 1.96×10 -66 , 3.57×10 -64 and 3.57×10 -64 ). Interestingly, we also discovered two novel genes which presented significant dominance effect in gene expression (Table 4), including RB1 gene from multiple candidate cis-eQTLs (adjusted p-value = 3.02×10 -80 ) and IGF1R gene from rs4965238 (adjusted p-value = 1.37×10 -67 ). Among the identified genes presenting dominance effect in eQTL mapping, the RB1 gene locating on chromosome 13 is a tumor suppressor gene, the mutation inactivation of which has been found to be the cause of human cancer [10]. It was also found to be an imprinted gene earlier in 2009 [11]. Interesting, IGF1R is the only gene which presented both dominance effect and imprinting effect from the candidate cis-eQTL rs4965238 (Table 4).
In conclusion, our real data application validated several existing imprinted genes, mapped candidate eQTLs for these imprinted genes. More interestingly, we discovered a few genes presenting significant dominance effect which might be involved in tumorigenesis.   We also investigated the parameter estimation and hypothesis testing of the two models with NB regression and Poisson regression assumption of the read counts for different application scope (results not shown). NB regression will be suggested when the over-dispersion is relatively large and Poisson regression is suggested to be used for small over-dispersion. In conclusion, we provided a comprehensive and robust framework for joint estimation of cis-eQTLs and parentof-origin effects.
The current study has several limitations. First, family data such as trios are needed to obtain the genotypes of heterozygotes in the offspring. Second, borrowing information from the whole samples will allow for better power for modeling the RNA-seq data. A third direction is to incorporate the allele specific gene expression(ASE) as conducted in Zhabotynsky's work so we can extend our study to model both ASE and POE for candidate cis-eQTLs [6]. The gradually decreased cost in RNA-seq technology and future studies in methodology are warranted to achieve more powerful estimation of decomposed variance from different genetic components.

Methods: the Stat-POE and Func-POE methods
Our methods are proposed upon a basic model of eQTL mapping of a single gene with RNA-seq data which are read counts. Therefore, we consider a single gene and study the association of its expression with the -th candidate expression quantitative trait loci (eQTL). Let be the total read counts mapped to this gene in the i-th sample, where and n is the sample size. We model = 1,…, by a negative binomial distribution as they are sparse counts data, which is a generalization of Poisson distribution to allow for over-dispersion. Let be the probability mass function ( ; , ) for a negative binomial distribution with mean and dispersion parameter : where is the gamma function. It's easy to find that the variance , in which Γ( ⋅ ) Var( ) = + 2 is the over-dispersion part. When the over-dispersion parameter goes to , For a bi-allelic locus, let the major and minor alleles of the -th candidate eQTL as and , 1 2 respectively. The genotype G takes four possible values , , and , the first allele

Parameter Estimation and Hypothesis testing
To estimate the genetic effects and POE, we can write the likelihood based on the data ( , , ) as follows, ( = 1,2,... ) (3) Termination: Iterate steps (1) and (2) until estimates of all the parameters converge.
In order to assess whether each regression covariate in the model is significant on the read counts of the gene or not, statistical hypothesis testing will be performed. We constructed three testing methods including the likelihood ratio test (LRT), score test and wald test as follows.

= ( , ')'
Then the score function for is , and the observed fisher information matrix is is the log-likelihood function given as in According to the theory from Rao 2005 [16], in our statistical setting, the score test statistic is therefore defined as (2.7) where .
Moreover, the Wald test statistic is defined by: Under , the statistics , , and all converge in distribution to . For a given significance hypothesis testing for the other parameters can be implemented similarly.

Simulations
To evaluate the performance of the proposed statistical methods in eQTL mapping with RNAseq data, we carried out extensive simulation studies in realistic settings. First, we compared power of the Stat-POE and Func-POE methods in detecting the main allelic effects (including the additive and dominance effects) and POE. We simulated , the total number of read counts of a gene in the i-th sample by a negative binomial distribution with = (0.1 + ( , )) and over-dispersion parameter , where X was a continuous covariate . To evaluate the estimation accuracy. The estimation relative bias was defined as the difference between estimated value and the true parameter value and then divided by the true parameter value. We also used simulated data to quantify the power and Type I error rates of the methods.
To illustrate the performance of the proposed methods in detecting different genetic effect terms and POE, the statistical power for the hypothesis testing was calculated using a range of different critical values under different scenarios. Type I error was calculated under the null model where there was no genetic effect or POE.

Datasets
We used an RNA-seq dataset from 30 HapMap Caucasian samples obtained from the NCBI Bioproject (PRJNA385599). The samples were collected from lymphoblastoid cell lines from 15 males and 15 females. For most of these samples, the RNA reads were 150 bp paired-end reads, with an additional run with 75 bp paired-end reads. The median of the total number of reads for these 30 samples was approximately 20 million. All of these reads were mapped to hg38 human reference genome using Tophat2 [6].
Since all of these samples were children of family trios, the parents of which were also part of the samples included in the 1000 Genomes Project [17]. Genotype data of the 30 trios were used to obtain the phased genotype of the children. For these 30 trios, the HapMap project genotyped about 3.9 million SNPs. The phasing and imputation of these 30 trios were conducted by Zhabotynsky et al.'s study, where the phased and imputed genotypes in our study were directly obtained from [6]. Briefly, SHAPEIT2 [18] was used for phasing and IMPUTE2 [19] was used for imputation against the 1000 Genome reference panel containing 2, 504 individuals and ~82 million SNPs. Based on the phased and imputed SNPs, we extracted 6, 211, 048 of high confidence imputed SNPs (the ones with at least one heterozygote in the sample) in total.

Identification of imprinted genes and genes with dominance effect
We selected 22 known imprinted genes based on the list reported by Jadhav et al. [9]. These genes were selected because they had abundant expression in the 30 samples (S1 Table). For each potential imprinted gene, all SNPs in the gene coding region were selected as candidate cis-eQTLs. For each cis-eQTL and gene expression pair, the Stat-POE method was applied to detect candidate eQTLs with additive, dominance and POE effects. Four covariates were adjusted in the model including the total read counts per individual and the first three principal components computed from the matrix of normalized expression. In all of the hypothesis testing, Benjamini-Hochberg (BH) method was used for multiple comparison to adjust the p-values obtained from the LRT tests [20]. We tested the POE of the previously reported imprinted genes to evaluate the performance of our methods. For novel discovery of genetic effects of these imprinted genes, we tested the additive and dominance effects simultaneously.