A comprehensive catalogue of regulatory variants in the cattle transcriptome

Characterization of genetic regulatory variants acting on the transcriptome of livestock is essential for interpreting the molecular mechanisms underlying traits of economic value and for increasing the rate of genetic gain through artificial selection. Here, we build a cattle Genotype-Tissue Expression atlas (cGTEx, http://cgtex.roslin.ed.ac.uk/) for the research community based on 11,642 RNA sequences from publicly available datasets representing over 100 cattle tissues. We describe the landscape of the transcriptome across tissues and report hundreds of thousands of cis- and trans- genetic associations with gene expression and alternative splicing for 24 major tissues. We evaluate the specificity/similarity of these genetic regulatory effects across tissues, and functionally annotate them using a combination of multi-omics data. Finally, we link gene expression in different tissues to 43 economically important traits using a large transcriptome-wide association study (TWAS) to provide novel biological insights into the molecular regulatory mechanisms underpinning agronomic traits in cattle.


Introduction 64
Genome-wide association studies (GWAS) have identified thousands of genetic 65 variants associated with complex traits in human and livestock populations 1-3 . As the 66 majority of these variants are non-coding, the characterization of the molecular mechanisms 67 by which such variants affect complex traits has been extremely challenging. Indeed, in 68 human genetics, it would have been impossible without projects such as the Genotype-Tissue 69 Expression (GTEx) project that has characterized genetic effects on the human transcriptome 70 and paved the way to understanding the molecular mechanisms of human variation 4 . 71 However, livestock genetics lags behind human genetics, and to date no study has 72 systematically explored the regulatory landscape of the transcriptome across a wide range of 73 tissues. GWAS signals of agronomic traits are significantly enriched in regulatory regions of 74 genes expressed in trait-relevant tissues in cattle 5,6 , but attempts to dissect variation in gene 75 expression have generally been small, both in terms of number of individuals and tissues. 76 Here, we describe the largest and most comprehensive study of the regulatory landscape of 77 any livestock species by analyzing 11,642 publicly available cattle RNA-Seq runs, 78 representing over 100 different tissues and cells types. We combined all of these data and 79 make the results freely and easily accessible to the research community through a web portal 80 (http://cgtex.roslin.ed.ac.uk/). 81 There has been a recent exponential growth in the number of RNA-Seq samples 82 made publicly available in cattle (Fig. S1a), but these data have never before been gathered 83 in one collection and processed uniformly. Here, we present a pipeline to uniformly integrate 84 11,642 public RNA-Seq datasets and identify both cisand trans-expression and splicing 85 quantitative trait loci (eQTLs and sQTLs) for 24 important cattle tissues. The latter is 86 facilitated by calling variants directly from the RNA-Seq reads and imputing to sequence 87 level with data from the 1000 Bull Genomes Project 7 , as previously done with human data 8 . 88 Next, we conducted in silico analyses, annotating eQTLs and sQTLs with a variety of 89 publicly available omics data, including DNA methylation, chromatin state, and chromatin 90 conformation characteristics. Finally, we integrated gene expression results with a large 91 GWAS of 27,214 dairy bulls from 43 cattle traits via a transcriptome-wide association study 92 (TWAS) and identified 442 previously unknown genes associated with 43 traits. The cattle 93 Genotype-Tissue Expression (cGTEx) atlas will serve as a primary source of reference for 94 cattle genomics, cattle breeding, adaptive evolution, comparative genomics, and veterinary 95 medicine.

97
Data summary 98 We analyzed 11,642 publicly available RNA-Seq runs from 8,653 samples using a 99 uniform quality control pipeline, yielding ~200 billion cleaned reads (Table S1). Summary 100 distributions of sequence platform, single/paired reads, clean read number, read length, sex, 101 age, and mapping rate across samples show that the quality of publicly available data is 102 acceptable for the analyses shown here (Fig. S1b-h). We kept 7,180 samples with clean reads 103 number > 500,000 and mapping rates > 60% for subsequent analyses, representing 114 104 tissues from 47 breeds and breed combinations (Fig. S1i, Table S1). Holsteins were the most 105 represented population (35.5% of all samples), reflecting its global economic value. A total 106 of 1,831 samples (21%) had no breed identified. We grouped the 114 tissues into 13 107 categories based on trait phylogeny and the 47 breeds into seven major sub-species, with Bos 108 taurus representing 87% of all samples (Table S1). 109 We also analyzed 144 publicly available whole-genome bisulfite sequencing (WGBS) 110 data sets from 21 cattle tissues to investigate the tissue-specificity of DNA methylation and 111 to functionally annotate eQTLs and sQTLs. These data were even more heavily dominated 112 by Holsteins with 90 (62%) samples representing that single breed. These data included ~73 113 billion good quality reads with an average mapping rate of 71% (Table S2).

114
Variation in gene expression across individuals and tissues 115 As expected, the number of expressed genes (Transcripts per Kilobase Million, TPM > 116 0.1) increased with the number of clean reads. However, we observed a plateau at 50 million 117 clean reads (Fig. S2a) where we only detected about ~60% of 27,607 annotated genes. This 118 observation suggests that tissues express only a certain proportion of genes and that there is 119 little benefit in increasing read depth beyond this plateau. Only 61 genes were not expressed 120 in any of the samples. About half of those (54.10%) were located in unplaced scaffolds, with 121 significantly (P < 0.05, 1-sided) shorter gene length, fewer exons, higher CG density, and 122 lower sequence conservation than expressed genes ( Fig. S2b-f).

123
Similarly, we detected more alternative splicing events with increasing numbers of 124 clean reads (Fig. S2g). More genes (874) had no splicing events than were not expressed (61) 125 in any sample. These genes also exhibited significantly shorter gene length, fewer exons, 126 lower expression, and lower sequence constraints than spliced ones ( Fig. S2h-k).

129
Genes without splicing events were significantly engaged in the integral component of 130 membrane and G-protein coupled receptor signaling pathway (Fig. S2m). 131 We found that about 25% of CpG sites were not covered in any of the WGBS samples, 132 even if these had more than 300 million clean reads, partially due to sequence technique 133 problems (Fig. S3a). These CpG sites were enriched in gene desert regions (e.g., telomeres) 134 with significantly higher CG density than the CpG sites captured by the WGBS (Fig. S3b-d). 135 We called a median of 21,623 SNPs from RNA-Seq data (Fig. S4a), and then imputed 136 each sample up to 3,824,444 SNPs using a multi-breed reference population of 3,310 bulls 137 (Methods , Table S3). We validated the imputation accuracy by comparing imputed and 138 RNA-Seq-derived SNPs with those directly called from whole-genome sequence (WGS) in 139 the same individuals (Fig. S4b, c), and the concordance rate was over 98%. After removing 140 duplicated samples (Methods), we obtained 4,889 individuals for subsequent QTL mapping 141 (Fig. S4d). 142 We found that clusters of samples based on both gene expression and alternative 143 splicing accurately recapitulated tissue types (Fig. 1a, b), reinforcing the quality and utility of 144 the data for our purposes. Tissue type was the primary driver of expression/splicing 145 differences, followed by sub-species. For instance, all 831 muscle samples from 46 projects 146 clustered together. Similar to expression and splicing, DNA methylation profiles also 147 recapitulated tissue types (Fig. 1c). But for imputed genotypes, as expected, samples 148 clustered by sub-species (Fig. 1d).

149
Tissue specificity of gene expression 150 Tissue-specificity of gene expression was conserved across cattle and human ( Fig. 2a-151 b), and the function of genes with tissue-specific expression accurately reflected the known 152 biology of the tissues (Fig. S5a). For instance, brain-specific genes were significantly 153 associated with synapse and neuron function, and testis-specific genes with spermatogenesis 154 and reproduction. We also calculated tissue-specificity of promoter DNA methylation and 155 alternative splicing (Methods). We found that, based on tissue-specificity, gene expression 156 level was significantly (FDR < 0.05) and negatively correlated with DNA methylation level 157 in promoter (Fig. 2c), and positively correlated with splicing ratios (Fig. 2d). For example, 158 CELF2, a brain-related gene, had significantly higher expression, lower methylation, and 159 more splicing events in the brain than in the other tissues considered (Fig. 2e). Similarly, the 160 function of genes with tissue-specific promoter hypomethylation and splicing reflected the 161 known biology of the tissues (Fig.S5b, c). Tissue-specific genes exhibited distinct patterns of 162 sequence constraints (Fig. S5d), supporting the hypothesis of tissue-driven genome 163 evolution 5,11 . We found that while brain-specific genes evolve slowly, blood or testis-specific 164 ones evolve rapidly. This trend was also observed within tissue-specific hypomethylated 165 regions ( Fig. 2f; Fig. S5e). 167 We identified cis-e/sQTLs for 24 tissues with 40 or more individuals, while accounting 168 for relevant confounding factors (Methods, Fig. S6a, b). The number of eGenes (genes with significant cis-eQTLs) ranged from 172 in ileum to 10,174 in blood, with 19,559 (83% of all 170 23,523 tested genes) eGenes in at least one tissue (Table S4). The number of sGenes (genes 171 with significant cis-sQTLs) ranged from four in the salivary gland to 7,913 in macrophages, 172 with 15,376 (70.8%) sGenes in at least one tissue (Table S4). Genes with no cis-eQTLs or -173 sQTLs in any of the tissues were significantly enriched in hormone activity, regulation of 174 receptor activity, neuropeptide signaling pathway, and reproduction. In general, the larger the 175 number of samples, the larger the number of cis-e/sGenes detected (Pearson's r = 0.85 for 176 cis-eQTLs with P < 1.3210 -7 , r = 0.63 for cis-sQTLs with P < 9.110 -4 ), (Fig.3a-b). As suggesting a lack of power to detect such associations (Fig. S7c). Furthermore, we fine-183 mapped eGenes to assess whether the identified signals were attributed to one or more causal 184 SNPs. We found that 46% (range 14.5 -73.9%) of eGenes had more than one independent 185 SNPs associated with expression ( Fig. 3c), indicating the widespread allele heterogeneity of 186 gene expression. In addition to discover the SNPs with the smallest effects within a locus, we 187 needed to adjust for the SNPs with the biggest effects (Fig. 3d), and the biggest effects  from different sub-species and breeds will increase statistical power for detecting shared 213 eQTLs, but decreased power will almost certainly result from eQTL present in only some of 214 the sub-species. We also noted the variability in eQTLs among sub-species. For instance, the 215 expression of an immune-related gene, SSNA1, was regulated by a cis-eQTL in Bos indicus 216 (P = 3.9310 -11 ) but not in Bos taurus or the hybrids (Fig. 3k).

217
Patterns and biological mechanisms underlying tissue-specificity/similarity of cis-QTLs 218 provide insights into pleiotropic regulatory effects on phenotypes 4 . We therefore conducted a 219 non-model-based pairwise analysis using significant eGene-eVariant pairs in one tissue to 220 estimate the proportion of non-null associations of these pairs in another tissue (Methods). 221 We found that tissues with similar functions (e.g., hypothalamus and pituitary) tended to  (Table S5). We speculated that 238 the large number of trans-eQTLs in cattle could be due to the high selection intensity for 239 economically important traits and the modest effective population size. This has led to a 240 complex inter-chromosomal pattern of linkage disequilibrium (LD) and gene co-expression. 241 We showed this might be the case, as we found significantly higher LD for cis-eQTL & 242 trans-eQTL pairs than cis-eQTL & random-SNP pairs (on matched chromosomes) across all 243 tissues (Fig. 4a). Similarly, the expression correlation between cis-eGenes and trans-eGenes 244 with shared eQTLs were significantly higher than cis-eGene and random-gene pairs across 245 all tissues (Fig. 4b). Among 29 autosomes, we observed trans-eQTL hotspots in nine 246 chromosomes (Fig. 4c). Genes associated with these trans-eQTLs hotpots were significantly 247 enriched in gene co-expression modules (Fig. 4d, e), and played key roles in RNA biology 248 and energy metabolism (Fig. 4f). After accounting for fine-mapped cis-eQTLs, the number of 249 trans-eGenes slightly decreased in each tissue (Fig. S9, Table S5). 251 We employed multiple layers of biological data to better define the molecular 252 mechanisms of genetic regulatory effects. As expected, cis-e/sQTLs were significantly 253 enriched in functional elements, such as 3'UTR and open chromatin regions ( Fig. 5a-b). As 254 expected, cis-sQTLs had a significantly higher enrichment in splice regions than cis-eQTLs. 255 The cis-eQTLs associated with stop gain had significantly larger effect sizes than other cis-256 eQTLs (Fig. 5c).  The primary goal of this study is to provide a resource for elucidating the genetic and 273 biological bases of agronomic traits in cattle. We thus evaluated e/sQTLs detected in each 274 tissue for associations with four distinct agronomic traits, i.e., ketosis, somatic cell score in 275 milk (SCS), age at first calving (AFC), and milk yield (MY). The top SNPs associated with ketosis from GWAS were significantly enriched within liver cis-e/sQTLs and trans-eQTLs 277 (Fig. 6a, Fig. S10a). Similarly, MY-associated SNPs were overrepresented in mammary 278 gland cis-e/sQTLs and trans-eQTLs (Fig. 6b, Fig. S10b). Compared to other tissues, 279 mammary gland, milk cells and liver were top tissues regarding the enrichment of cis-eQTLs 280 with MY-associated SNPs (Fig. 6c). Additionally, AFC-associated SNPs were most enriched 281 for monocytes cis-eQTLs, and SCS for mammary gland (Fig. S10c-d). 282 We then detected 496 candidate gene-tissue pairs for 43 agronomic traits in cattle via 283 TWAS, representing 337 unique genes (Table S6) Table S1). We then obtained   (Table S3). Finally, we obtained 6,123 365 samples that were genotyped and imputed successfully. We filtered out variants with MAF < 366 0.05 and dosage R-squared (DR 2 ) < 0.8, resulting in 3,824,444 SNPs used for QTL mapping.   (Table S2). We first used FastQC (v0.11.2) and Trim Galore v0  algorithm. We considered Hi-C contacts with FDR < 0.05 as significant.

419
Tissue-specificity analysis of gene expression, alternative splicing and DNA methylation 420 To quantify tissue-specific expression of genes, we computed a t-statistics for each gene 421 in each of the 114 tissues. We grouped 114 tissues into 13 categories (Table S1). We scaled 422 the log2-transformed expression (i.e., log2TPM) of genes to have a mean of zero and variance 423 of one within each tissue. We then fitted a linear model as described in 37 for each gene in 424 each tissue using least-squares. When constructing the matrix of dummy variables (i.e., 425 design matrix) for tissues, we denoted samples of the target tissue/cell type (e.g., CD4 cells) 426 as '1', while samples outside the target category (e.g., non-blood/immune tissues) as '-1'. We 427 excluded samples within the same category (e.g., CD8 cells and lymphocytes) to detect 428 genes with specific expression in each particular category, even if they were not specific to 429 the target tissue within this category. We obtained t-statistics for each gene to measure its 430 expression specificity in a given tissue. We considered the top 5% of genes ranked by largest 431 t-statistics as genes with high tissue-specific expression.

432
To detect tissue-specific alternative splicing, we used leafcutter to analyze the differential 433 intron excision by comparing the samples from target tissue to the remaining tissues 27 .

434
Samples from tissues of the same category with the target tissue were excluded. We used 435 Benjamini-Hochberg method (FDR) to control multiple testing.

436
For DNA methylation, we focused on DNA methylation levels of gene promoters (from like clean data size, mapping rate, project, breeds, sub-species, sex and age. (Fig. S6b) these 24 tissues is in Table S4. We kept genes with TPM > 0.1 in ≥ 20% samples in each  trans-eQTL mapping 511 We conducted trans-eQTLs for 13 tissues with at least 100 samples each. We filtered 512 genomic variants using a more stringent threshold than cis-eQTL mapping to partially 513 account for the reduction in statistical power. We obtained mappability of variants based on 514 k-mer length of 36 and 75 following the procedure described in autosomal chromosomes, while adjusting for the same covariates as in cis-eQTL analysis.

522
We adjusted P-values for multiple testing using Benjamini-Hochberg method to obtain FDR. 523 We considered gene-variant pairs with FDR < 0.05 as significant.

524
WGCNA co-expression network analysis and estimation of π1 statistics 525 We applied the Weighted Gene Co-Expression network (WGCNA) analysis 47 to obtain 526 gene co-expression network in each of 24 tissues used in eQTL mapping. We estimated the 527 sharing of cis-eQTLs and cis-sQTLs among tissues using the π1 statistics, as described in