Intra- and Inter-individual genetic variation in human ribosomal RNAs

The ribosome is an ancient RNA-protein complex essential for translating DNA to protein. At its core are the ribosomal RNAs (rRNAs), the most abundant RNA in the cell. To support high levels of transcription, repetitive arrays of ribosomal DNA (rDNA) are necessary and the long-standing hypothesis is that they undergo sequence homogenization towards rDNA uniformity. Here I present evidence of the rich genetic diversity in human rDNA, both within and between individuals. Using state-of-the-art genome sequencing data revealed an average of 192.7 intra-individual variants, including some deeply penetrating the rDNA copies, such as the bi-allelically expressed 28S.59A>G. From 104 diverse genomes, 947 high-confidence variants were identified and unmask a hidden genetic diversity of humans. These findings support the emerging concept that ribosomes are heterogeneous within cells and extends the heterogeneity into the realm of population genetics. Fundamentally, do our ribosomal variants determine how our cells interpret the genome?


Introduction
The ribosome is an RNA-protein complex essential to all life. This ancient complex translates genetic information into protein. In humans, a mature ribosome contains four structural and catalytic RNA molecules: 18S in the small subunit and 5S, 5.8S and 28S in the large subunit.
Ribosomes are highly abundant in cells with ribosomal RNA (rRNA) making up >80% of the RNA in a human cell. (1) To be transcribed to such a high level, the DNA encoding for rRNA (rDNA) is present at ~600 copies per diploid genome. The In the initial human genome sequencing project rDNA-containing BACmids were systematically depleted since sequencing was expensive and these repetitive sequences are thought to 'homogenize' into a uniform sequence. (3) Yet notably, in the original research yielding the reference human rDNA, Kuo and colleagues repeatedly cloned and sequenced variants both within and between individuals. (4,5) The variation in these clones localized to 'variable loop regions' and at the time it would have been challenging to distinguish these variants from pseudogenized rDNA. With recent improvements in sequencing technology, rDNA variation has been increasingly characterized in a diverse collection of species (6)(7)(8)(9)(10),so rRNA may not be as homogeneous as previously thought. Originally I hypothesized that a transposable element could exist within human rDNA (similar to the Drosophila R1 and R2 elements (11)) yet would be undetected by standard variant analysis of the human genome. I interrogated human rDNA next-generation sequencing data and 're-discovered' massive intraand inter-individual sequence heterogeneity in human rDNA, including expressed variants of 18S and 28S.

Intra-individual variation in rDNA
To measure the intra-individual variation in rDNA, high coverage PCR-free whole genome sequencing data(12) (DNAseq) from the Utah CEPH-1463 trio (daughter, NA12878; father, NA12891; mother, NA12892) was aligned to a reference human rDNA sequence, hgr1 (see materials and methods and Figure   1A).  Figure 1C). The distribution of the variants is uneven over the different regions of rDNA ( Figure 1D). Excluding 5S and 5.8S which were invariable, the density of variation was lower in 18S and 28S at 9.7 variants per kilobase (kb) compared to the other transcribed regions at 15.8 variants per kb, a trend consistent with rDNA variation in other species. (8,10) As expected, the evolutionary conservation of the mature rRNA variants were at evolutionarily variable regions, such as the expansion loops, or were rare within individuals. The outlier to this trend is the 28S.59G>A variation, which is both evolutionarily conserved and highly variable within individuals. ( Figure 1D).
To test if the variants in the mature rRNA are expressed, poly-A selected RNA sequencing (RNAseq) from the ymphoblastoid cell line GM12878 (derived from NA12878) was analyzed.(13) Variants at positions of moderate GC-content could be analyzed such as 28S.59G>A which shows bi-allelic expression. To interrogate a purer rRNA library, K562 total RNAseq (not poly-A selected and not rRNAdepleted) was analyzed with 70.3% of all reads mapping to rDNA, and here too, 28S.59G>A shows bi-allelic expression ( Figure 1E).
While inspecting RNAseq libraries, one variant was recurrently observed at the highly conserved 18S.1248T, yet was absent from DNAseq. In the reference RNA, this base is a uradine, but in the mature 18S rRNA, this base is hypermodified to 1-methyl-3-(α-amino-α-carboxyl-propyl) Pseudouridine ( Figure 1E) for its function in the catalyitic core of the ribosome's P-site. This hyper-modification hinders reverse transcriptase function (14), thus the "variation" is likely an error profile specific to this modified nucleotide. Besides being interesting, these findings indicate that most standard poly-A selected RNAseq libraries have 'contaminating' levels of rRNA which can be used to measure variant rRNA expression in future studies.

Inter-individual variation in rDNA
To measure if there is rDNA variation between individuals, an additional 104 genomes (four genomes from each of 26 populations)(15) were aligned to hgr1. Reasoning that biologically relevant variants, with respect to cell-or tissueheterogeneous ribosomes, would be maintained in humans, I quantitatively assessed the distribution of the variants in the population ( Figure 2C). There is a cluster of 23 variants which are both, prevalent in the population, and abundant within individuals (population variant-detection frequency (pVDF) > 0.5, iVAF 0.33 -0.66). Alternatively these variants could have had a higher iVAF in human ancestors, and drift could explain their prevalence and abundance as well. To truly measure balancing selection in rDNA, a more complete understanding of its heredity is necessary, but these 23 candidate variants warrant priority biochemical and molecular analysis (Table 1).
Surprisingly, by organizing the 26 human populations into their five superpopulations, blocks of called variants or even possibly alternative haplotypes emerged (yellow arrows, Figure 2A). These population-uncommon (low pVDF) but intra-individual abundant (higher iVAF) variants ( Figure 2C) lends support to the idea that besides intra-individual heterogeneity, there is population-level heterogeneity in rRNA. Altogether, there is substantial intra-and inter-individual variation in human rDNA, and these exciting findings raise far more questions then they answer.

Discussion
The idea that ribosomes are homogenous is increasingly being challenged. To date I haven't found evidence for a human rDNA-specific transposable element such as R1-or R2.

Materials and Methods
An electronic laboratory notebook is available online (http://rRNA.ca) with line-by-line annotated scripts and methods to replicate these experiments in their entirety.

Reference Genomes and rDNA repeats
The 43 kb reference rDNA repeat, U13369.1 was downloaded from NCBI. The first nucleotide of this sequence is the transcription start site (+1) of the 13.5 kb 45S pre-rRNA, and the last 1 kb is the rDNA upstream promoter region. Initial analyses were hindered by the sequence complexity of the intergenic spacer, so this study focuses on the genic portion of rDNA. PCR-free whole genome sequencing data from CEPH-1463 was aligned to RNA45S and its promoter as well as a single copy of the 5S locus from hg38 (chr1:228744112-228746352). The sequences were manually edited to match the consensus supported by the data, with the reasoning that this represents the common rDNA sequence at each position. Three difficult to align regions were masked with 'N' nucleotides, an AluY element and low complexity sequence in the 5S locus and a simple-repeat sequence in the ITS1. The resulting sequence, hgr1, was used in subsequent analysis {Sequence Accession Pending}.

Alignment
Initial alignment parameters were selected to maximize sensitivity. DNAseq alignment was performed with bowtie2 (28) using the command `bowtie2 --verysensitive-local -x hgr1 -1 <read1.fq.gz> -2 <read2.fq.gz>`. The majority of the alignments were ran on Amazon EC2 using an C4.2xlarge instance. A publicly available instance image with the necessary software and reference genomes is available: 'crown-161229'.

Variant Discovery
The purpose of this study was to define the major, high frequency variants in rDNA. Pseudogenized rDNA is a potential source of error in this study, but I reasoned that pseudo-rDNA mutations at identical positions between copies would be rare relative to total rDNA copy number. For variant-discovery, the 107 hgr1aligned bam files were analyzed with the Genome Analysis Toolkit (v.3.6) HaplotypeCaller (29) using standard parameters. HaplotypeCaller performs local de-novo assembly over variant positions, which makes it well suited for genotyping rDNA variants. The 926 called variants had an alternative allele quality score of >30 PHRED (p < 0.001) and 849 of those variants had a score of >100 PHRED (p < 10 -10 ). The type of substitutions and indel size distribution was plotted with bcftools. (30) This analysis should be interpreted as a conservative, lower bound on rDNA variation. With a larger data-set and more importantly, deeper and less PCR-bias genome sequencing, multi-ploidy variant analysis should yield a more complete set of variants at lower intra-individual variant allele frequencies. The product of iVAF and pVDF therefore, is variant allele frequency in the classical sense, taking into account the pool of all rDNA alleles. Although, this makes the assumption that rDNA copy number is the same in each individual which may not be true. (2,32) As such I have refrained from terming any of the variation polymorphisms even though a sub-set (Table 1) Table 1).