GeTallele: a method for integrative analysis and visualization of DNA and RNA allele frequencies

Background Asymmetric allele expression typically indicates functional and/or structural features associated with the underlying genetic variants. When integrated, RNA and DNA allele frequencies can reveal patterns characteristic of a wide-range of biological traits, including ploidy changes, genome admixture, allele-specific expression and gene-dosage transcriptional response. Results To assess RNA and DNA allele frequencies from matched sequencing datasets, we introduce a method for generating model distributions of variant allele frequencies (VAF) with a given variant read probability. In contrast to other methods, based on whole sequences or single SNV, proposed methodology uses continuous multi-SNV genomic regions. The methodology is implemented in a GeTallele toolbox that provides a suite of functions for integrative analysis, statistical assessment and visualization of Genome and Transcriptome allele frequencies. Using model VAF probabilities, GeTallele allows estimation and comparison of variant read probabilities (VAF distributions) in a sequencing dataset. We demonstrate this functionality across cancer DNA and RNA sequencing datasets. Conclusion Based on our evaluation, variant read probabilities can serve as a dependable indicator to assess gene and chromosomal allele asymmetries and to aid calls of genomic events in matched sequencing RNA and DNA datasets. Contact P.M.Slowinski@exeter.ac.uk


1
Introduction RNA and DNA carry and present genetic variation in related yet distinct manners; the differences being informative of functional and structural traits. In diploid organisms, an important measure of genetic variation is the allele frequency, which can be measured from both genome (DNA) and transcriptome (RNA) sequencing data (encoded and expressed allele frequency, respectively). Differential DNA-RNA allele frequencies are associated with a variety of biological processes, such as genome admixture or allele-specific transcriptional regulation (Ferreira, et al., 2016;Ha, et al., 2012;Han, et al., 2015;Movassagh, et al., 2016;Shah, et al., 2012).
Most of the RNA-DNA allele comparisons from sequencing have been approached at nucleotide level, where it has proven to be highly informative for determining the alleles' functionality (Ferreira, et al., 2016;Ha, et al., 2012;Han, et al., 2015;Macaulay, et al., 2016;Morin, et al., 2013;Movassagh, et al., 2016;Reuter, et al., 2016;Shah, et al., 2012;Shi, et al., 2016;Shlien, et al., 2016;Yang, et al., 2016). Comparatively, integration of allele signals at the molecular level, as derived from linear DNA and RNA carriers, is less explored due to challenges presented by short sequencing length. The different molecular nature of RNA and DNA also leads to limited compatibility of the sequencing output.
Herein, we address some of the above challenges by employing a mathematical model to assess differences between RNA-and DNA-variant allele frequencies (VAF) at single nucleotide variant (SNV) positions in the genome. To do that, we introduce GeTallele: a toolbox that provides a suit of functions for integrative analysis and visualization of Genome (VAFDNA) and Transcriptome (VAFRNA) allele frequencies along  Analysis of the VAFRNA and VAFDNA in the GeTallele is based on comparing probability of observing a given VAF value at various positions of interest. Estimation of the variant allele probability (vPR) is implemented using a data-driven mathematical model of distribution of the VAF values and is the core functionality of the GeTallele.

Variant probability
GetAllele aggregates frequency counts of a VAF sample into variant probability, vPR -probability of observing a variant allele. The vPR is a single informative value indicative of copy number variation. The vPR is a parameter that describes the genomic event that through the sequencing process was transformed into a specific distribution of VAF values found in the signal. For example, in VAFDNA from a diploid genome, variant probability vPR=0.5 (meaning that both alleles are equally probable) corresponds to a true allelic ratio of 1:1 for heterozygote sites. For heterozygote sites in the DNA from diploid monoclonal sample, the corresponding tumour VAFDNA is expected to have the following interpretations: vPR=1 or vPR=0 corresponds to a monoallelic status resulting from a deletion, and vPR=0.8 (or 0.2), 0.75 (or 0.25), 0.67 (or 0.33) correspond to allelespecific tetra-, tri-, and duplication of the variant-bearing allele, respectively.
The vPR of the VAFRNA is interpreted as follows. In positions corresponding to DNA heterozygote sites, alleles not preferentially targeted by regulatory traits are expected to have expression rates with variant probability vPR=0.5, which (by default) scale with the DNA allele distribution. Differences between VAFDNA and VAFRNA values are observed in special cases of transcriptional regulation where one of the alleles is preferentially transcribed over the other. In the absence of allele-preferential transcription, VAFDNA and VAFRNA are anticipated to have similar vPR across both diploid (normal) and aneuploid (affected by CNAs) genomic regions. Consequently, VAFDNA and VAFRNA are expected to synchronously switch between allelic patterns along the chromosomes, with the switches indicating break points of DNA deletions or amplifications.
Since we observed that DNA and RNA signals have different distributions of total reads and also that the distributions of total reads vary between participants, the model VAF distributions are generated individually for each sequencing signal and each participant.
To estimate vPR in the signals, GeTallele first generates model VAF distributions and then uses the earth mover's distance (EMD) (Kantorovich and Rubinstein, 1958;Levina and Bickel, 2001) to fit them to the data. To generate a model VAF distribution with a given variant probability, vPR, GeTallele, bootstraps 10000 values of the total reads (sum of the variant and reference reads; nVAR + nREF) from the analysed signal in the dataset. It then uses binomial pseudorandom number generator to get number of successes for given number of total reads and a given value of vPR (implemented in the Matlab function binornd). The vPR is the probability of success and generated number of successes is interpreted as an nVAR. Since the model vPR can take any value, it can correspond to a single genomic event as well as any combination of genomic events in any mixture of normal and tumour populations (See section 2.1.3).
The analysis presented in the paper uses 51 model VAF distributions with vPR values that vary from 0.5 to 1 with step (increment of) 0.01. The model VAF distributions are parametrized using only vPR ≥ 0.5, however, to generate them we use vPR and its symmetric counterpart 1-vPR.  Figure 1D.

vPR estimation
Earth mover's distance (EMD) is a metric for quantifying differences between probability distributions (Kantorovich and Rubinstein, 1958;Levina and Bickel, 2001) and in the case of univariate distributions it can be computed as: Here, PDF1 and PDF2 are two probability density functions and CDF1 and CDF2 are their respective cumulative distribution functions. Z is the support of the PDFs (i.e. set of all the possible values of the random variables described by them). Because VAFs are defined as simple fractions with values between 0 and 1, their support is given by a Farey sequence (Hardy, et al., 2008) of order n; n is the highest denominator in the sequence. For example, Farey sequence of order 2 is 0, 1/2, 1 and Farey sequence of order 3 is 0, 1/3, 1/2, 2/3, 1. GeTallele uses a Farey sequence of order 1000 as the support Z for all computations involving EMD. To estimate vPR, GeTallele computes EMD between the distribution of the VAF values of each signal in the window and the 51 model VAF distributions (i.e. observed vs modelled VAF), the estimate is given by the vPR of the model VAF distribution that is closest to the VAF distribution in the window. Examples of VAF distributions with fitted model VAF distributions are shown in Figure 3A and D. The dependence of the confidence intervals of the estimation on the number of VAF values in a window is illustrated in Fig. 4, which clearly demonstrates that the larger the number of VAFs in the chosen window the higher the accuracy of the estimate.

Fig. 4. Confidence intervals for artificial samples with different numbers of VAFs.
Each confidence interval is based on estimation of vPR in 1000 randomly generated samples with a fixed vPR (True value). Light grey bar is 95% confidence interval (950 samples lay within this interval), dark grey bar is 50% confidence interval (500 samples lay within this interval), red dot is median value.

vPR values in mixtures of normal and tumour populations
Since the vPR can take any value between 0.5 and 1 it can correspond to a single genomic event as well as any combination of genomic events in any mixture of normal and tumour populations. A mixture vPR value that corresponds to a combination of genomic events can be computed using the following expression: Where eVAR and eREF are the multiplicities of variant and reference alleles and pPL is a proportion of one of the populations. For heterozygote sites eVAR=1 and eREF=1, for deletions eVAR=0 or eREF=0, for du-, tri-and tetraplications eVAR or eREF can be equal to 2, 3 or 4, respectively. The sum of proportions pPL over the populations is equal 1. For example, for a mixture of 1 normal (N, pN=0.44) and 2 tumour populations (T1, pT1=0.39 and T2, pT2=0.17), T1 with deletion and T2 with deletion the mixture vPR value can be computed as follows: v $% = p 2 ⋅ B + p 6! ⋅ B + p 6" ⋅ B p 2 ⋅ (A + B) + p 6! ⋅ (0 + B) + p 6" ⋅ (0 + B) = 0.44 ⋅ 1 + 0.39 ⋅ 1 + 0.17 ⋅ 1 0.44 ⋅ (1 + 1) + 0.39 ⋅ (1 + 0) + 0.17 ⋅ (1 + 0) = 0.694.
By comparing the vPR values estimated from data with possible mixture vPR values we propose to estimate sample purity and its clonal composition. To this end, we first generate a full set of proportions of all the population in the mixture with step (increment of) 0.01 and compute all the possible vPR values that each of the mixtures could produce. For step 0.01: two populations (1 tumour) give 99 proportions, three populations (2 tumours) give 4851 proportions, four populations (3 tumours) give 156849 proportions. The matrices with mixture vPR values for each proportion, vary from 2x2, for two populations with deletions, to 35x35 for four populations with all events up to tetraplications. Then, we run an exhaustive approximate search over all the matrices with mixture vPR values over all the proportions. The search is approximate because the estimated vPR values have limited accuracy and because we consider only discrete values of proportions. In the analysis we define a match between estimated and mixture vPR values if they differ <0.009. The search returns a large number of admissible mixtures that could produce the estimated vPR values. This process is illustrated in Fig.  5.
To represent the admissible mixtures graphically we use ternary plots. Ternary plot allows to illustrate composition of three components using just two dimensions. The composition, represented by ratios of the three components which sum to a constant, is depicted as point inside or on the edge of an equilateral triangle. If the point is on the edges, the composition has only two components. To facilitate interpretation of the ternary plots, we also plot the grid lines that are parallel to the sides of the triangle. These gridlines indicate the directions of constant ratios of the components. Along such direction the ratio of one of the components is fixed and only the other two ratios vary. Examples of visualisation of admissible mixtures on ternary plots are shown in Figs. 5 and 6. To facilitate analysis of the admissible mixtures returned by the search procedure we introduce mixture complexity. Mixture complexity is a measure that increases with number of populations as well as with variety of genetic events. From the simplest mixture of 1 normal and 1 tumour population in which only deletions are possible to a model with 1 normal and multiple tumour populations where each can have deletions, and any level of multiplications. In practice, we set the limit at 3 tumour populations and tetraplications. Mixture complexity helps to group and visualise admissible mixtures. Mixtures with higher complexity allow more possible vPR values, meaning that it is easier to find the match with the estimated vPR values and the number of admissible mixtures increases (see Figure 6). We, further, observe that proportion of normal population, pN, increases with a number of clonal tumour populations included in the model mixture and that, generally, pN stays constant with increasing variety of genetic events, for a fixed number of clonal tumour populations.

Analysis of RNA-DNA relationships
GetAllele is readily applicable to assess RNA-DNA relationships between normal and tumour sequencing signals derived from the same sample/individual (matched datasets). As a proof of concept, we assessed matched normal and tumour exome and transcriptome sequencing data of 72 breast carcinoma (BRCA) datasets with pre-assessed copy-number and genome admixture estimation acquired through TCGA (Supplementary Table 1). For these datasets, purity and genome admixture has been assessed using at least three of the following five approaches: ESTIMATE, ABSOLUTE, LUMP, IHC, and the Consensus Purity Estimation (CPE) (Aran, et al., 2015;Carter, et al., 2012;Katkovnik, et al., 2002;Pagès, et al., 2010;Yoshihara, et al., 2013;Zheng, et al., 2014). In addition, on the same datasets we applied THetA -a popular tool for assessing CNA and admixture from sequencing data (Oesper, et al., 2013;Oesper, et al., 2014).

Segmentation results
Segmentation of the data, based on the tumour exome signal, resulted in 2697 windows across the 72 datasets. We excluded from further analysis 294 windows where either tumour exome or transcriptome had vPR>=0.58 but their VAF distribution could not be differentiated from the model VAF distributions with vPR=0.5 (p>1e-5, Kolmogorov Smirnov test, equivalent to Bonferroni FWER correction for 100000 comparisons). The 294 excluded windows correspond to 4% of the data in terms of number of base pairs in the windows and 4% of all the available data points; i.e. they are short and contain only few VAF values. In the remaining 2403 windows, we systematically examined the similarity between corresponding VAFTEX (tumour exome), VAFTTR (tumour transcriptome) and CNA. We obtained several distinct patterns of coordinated RNA-DNA allelic behaviour as well as correlations with CNA data.
In 60% of all analysed windows the distributions of VAFTEX and VAFTTR were statistically concordant (had p>1e-5, Kolmogorov Smirnov test), and in 40% they were statistically discordant (p<1e-5, Kolmogorov Smirnov test). In 2 windows VAFTEX and VAFTTR had the same vPR but had statistically different distributions (p<1e-5, Kolmogorov Smirnov test), we consider such windows as concordant; Kolmogorov-Smirnov test is very sensitive for differences between distributions, vPR fitting is more robust. In the vast majority of the discordant windows vPR of the VAFTTR, vPR,TTR, was higher than vPR of the VAFTEX, vPR,TEX, (only in 21 out of 959 discordant windows vPR,TTR was lower than vPR,TEX).

Concurrence of segmentation based on DNA and RNA
We next analysed the concurrence between windows resulting from independent segmentations of the dataset based on the tumour exome and transcriptome signals in the datasets (2697 and 3605 windows, respectively, across all the samples). We first assessed chromosome-wise alignment of the start and end points of the windows. In 45% of the chromosomes both VAFTEX and VAFTTR signals produce a single window that contains the whole chromosome. In 33% of chromosomes both signals produced multiple windows. These windows are well aligned, with 90% of the break-points within 7% difference in terms of the number of data points in the chromosome (Q50=0.02%, Q75=2% of data points in the chromosome). The probability of observing such an alignment by chance is smaller than p=1e-5 (100,000 bootstrap samples with breaking points assigned randomly in all the individual chromosomes where both signals produced multiple windows). In 22% of the chromosome windows based on VAFTEX and VAFTTR the signals were positionally discordantone signal produced a single window containing whole chromosome while the other produced multiple windows.
To compare the vPR values in the 55% of chromosomes where at least one signal produced more than one window, we computed chromosome-wise mean absolute error (MAE) between the vPR in two sets of windows. To account for different start and end points of the windows we interpolated the vPR values (nearest neighbour interpolation) at each data point in the chromosome. We separately compared the vPR,TEX and vPR,TTR values. The alignment in terms of MAE is very good, vPR,TEX agreed perfectly in 11% of the chromosomes and had the percentiles of MAE equal to Q50= 0.012, Q75= 0.022 and Q97.5= 0.047, while vPR,TTR agreed perfectly in 8% but had slightly higher percentiles of MAE Q50= 0.019, Q75= 0.034 and Q97.5= 0.07. vPR,TEX and vPR,TTR values had MAE=0 simultaneously in 4% of the chromosomes. Probability of observing such values of MAE by chance is smaller than p=1e-3 (1000 random assignments of vPR,TEX and vPR,TTR values to windows in the 873 chromosomes where at least one signal had more than one window). It is noteworthy that MAE Q97.5<0.07 is comparable with the confidence interval of single vPR estimate based on 50 VAF values. In other words, both signals in a sample (Tex and Ttr) give very similar results in terms of windows' segmentation and estimated values of the vPR. Albeit, segmentation of VAFTTR generates a higher number of windows. The higher number of VAFTTR windows indicates that transcriptional regulation occurs at a smaller scale than allelic status changes in DNA. Figure 7 shows examples of concurrence between windows based on VAFTEX and VAFTTR signals in a positionally concordant chromosome (both signals produced multiple windows). Fig. 7. Illustration of concurrence between windows resulting from independent segmentations of the dataset based on the VAFTEX and VAFTTR signals. A yellow dots, VAFTTR; grey circles, vPR,TTR interpolated at all data points in windows based on VAFTEX; yellow crosses, vPR,TTR interpolated at all data points in windows based on VAFTTR. B bar plot of the absolute difference between the vPR values in the two kinds of windows. C orange dots, VAFTEX; grey crosses, vPR,TEX interpolated at all data points in windows based on VAFTEX; orange dots vPR,TEX interpolated at all data points in windows based on VAFTTR. D bar plot of the absolute difference between the vPR values in the two kinds of windows.

Correlation between vPR and CNA
Finally, we assess the correlations between vPR and CNA in the individual datasets. We separately computed correlations for deletions and amplifications. In order to separate deletions and amplifications, for each data set we found CNAMIN, value of the CNA in the range -0.3 to 0.3 that had the smallest corresponding vPR,TEX. To account for observed variability of the CNA values near the CNAMIN, we set the threshold for amplifications to CNAA = CNAMIN-0.05, and for deletions we set it to CNAD = CNAMIN+0.05.

vPR based purity estimation
To demonstrate a practical application of estimating the admissible mixtures by means of comparing the estimated vPR,TEX values with the mixture vPR values we next use them to estimate purity of the samples. To this end we compared the vPR based purity (VBP) estimates with ESTIMATE, ABSOLUTE, LUMP, IHC, and the Consensus Purity Estimation (CPE) (Aran, et al., 2015;Carter, et al., 2012;Katkovnik, et al., 2002;Pagès, et al., 2010;Yoshihara, et al., 2013;Zheng, et al., 2014).
To obtain the VBP estimate we used vPR,TEX values. We, first, selected the vPR,TEX values that: 1. are estimated with high confidence, i.e. are based on at least 50 data points; 2. are most likely heterozygous in normal exome, i.e., have a corresponding vPR value in normal exome vPR,NEX <0.59; 3. most likely have vPR,TEX > 0.5, i.e., their p-value for comparison with vPR,TEX = 0.5 is very small p<1e-5 (Kolmogorov-Smirnov test). Next, we used the selected vPR,TEX values to find all admissible mixtures (with 1 to 3 tumour populations and allowing for all events, from deletions to tetraplications). To estimate the VBP, out of all the admissible mixtures we chose these with lowest mixture complexity and among these mixtures we take one with the highest pN (proportion of the normal population). The VBP, percentage of tumour populations in the sample, is then given as 1-pN. Such approach provides rather conservative estimates of VBP (the smallest 1-pN). However, GetAllele can be extended to offer alternative methods of employing the admissible mixtures to estimate VBP. Development, analysis and comparison of alternative VBP estimation methods is beyond scope of the current paper. Figure 9A shows violin plots of all considered 1-pN values and (x) indicates the smallest value taken as a VBP estimate. In two of the datasets we could not estimate the purity due to lack of suitable vPR,TEX values. The VBP estimates shows the best agreement with ABSOLUTE method (r=0.76, p<3.4e-14, Pearson's correlation, Fig. 9B2). We suppose that this is because the ABSOLUTE method is based on copy number distributions, and our analysis (Section 2. The approach presented in this section differs from other methods for inferring genomic mixture composition (e.g. SciClone (Miller, et al., 2014), PyClone (Roth, et al., 2014) or TPES (Locallo, et al., 2019)) in that it is based on changes of the alleles multiplicity estimated along continuous multi-SNV genomic regions.
The vPR values used to estimate admissible mixtures are based on windows with at least 50 VAF values which extend over millions of base pairs. In this way, the presented method is complementary to the SciClone and PyClone methods that use VAFs of somatic mutations.

3
Discussion Potential for integrative analysis of RNA and DNA sequence data is facilitated by the growing availability of RNA and DNA sequencing datasets and by the technological advances now enabling simultaneous RNA and DNA sequencing from the same source (Macaulay, et al., 2016;Reuter, et al., 2016;. However, integrative RNA and DNA analyses are challenged by limited compatibility between RNA and DNA datasets and high technical variance of the sequencing-produced signals. Our approach -GeTalleleaddresses the compatibility by restricting the analyses to confidently co-covered DNA and RNA regions, and the high variability -through quantification of differences between the DNA and RNA signals. In contrast to other methods based on statistical modelling, GeTallele is based on a mechanistic model of VAF distributions that depends on the distribution of total reads (extracted from data) and the vPR parameter. The brute-force simplicity and transparency of the presented methodology is one of its biggest advantages.
Additionally, in contrast to other methods, to analyse sequencing data GeTallele uses data segments that include multiple adjacent SNVs. The proposed mechanistic model indicates that due to probabilistic nature of the reads that estimates of genomic events based on continuous multi-SNV regions are intrinsically more robust than estimates based on a single SNV. Finally, GeTallele offers a unified pipeline to evaluate and visualise DNA-RNA whereas many of the other methods cited herein are used in conjunction with each other to achieve a similar result.
Using GeTallele, we detected several relationships between DNA-RNA allele frequencies and biological processes. First, in chromosomes affected by deletions and amplifications, VAFRNA and VAFDNA showed highly concordant breakpoint calls. This indicates that VAFRNA alone can serve as preliminary indicator for deletions and amplifications, which can facilitate the applications of RNA-sequencing analysis on the large and constantly growing collections of transcriptome sequencing data. Second, we showcased that vPR,TEX can be used to estimate sample composition (in terms of proportions of normal and tumour populations) and as a result its purity. Once the mixture composition is decided by the user, GeTallele allow to further interrogate genetic events in each population at a specific data segment.

4
Conclusions Based on our results, variant probability vPR can serve as a dependable indicator to assess gene and chromosomal allele asymmetries and to aid calls of genomic events in matched sequencing RNA and DNA datasets. Methods for estimating and analysing vPR values are implemented in a GeTallele toolbox. GeTallele provides a singular suit of functions for integrative analysis, statistical assessment and visualization of the observed patterns at desired resolution, including chromosome, gene, or custom genome region.

Data processing
All datasets were generated through paired-end sequencing on an Illumina HiSeq platform. The human genome reference (hg38)-aligned sequencing reads (Binary Alignment Maps, .bams) were downloaded from the Genomic Data Commons Data Portal (https://portal.gdc.cancer.gov/) and processed downstream through an in-house pipeline. After variant call (Li, 2011), the RNA and DNA alignments, together with the variant lists were processed through the read count module of the package RNA2DNAlign (Movassagh, et al., 2016), to produce variant and reference sequencing read counts for all the variant positions in all four sequencing signals (normal exome, normal transcriptome, tumour exome and tumour transcriptome). Selected read count assessments were visually examined using Integrative Genomics Viewer (Thorvaldsdóttir, et al., 2013).

Data segmentation
To analyse variant allele frequencies (VAF) at genome-wide level, GeTallele first divides the VAF dataset into a set of non-overlapping windows along the chromosomes. Segmentation of the dataset into windows is based on a sequencing signal chosen out of all the available datasets in the aggregated aligned VAF dataset (one out of four in the presented analysis).
To partition the data into the windows GeTallele uses a parametric global method, which detects the breakpoints in the signals using its mean, as implemented in the Matlab function findchangepts (Killick, et al., 2012;Lavielle, 2005) or in R (Killick and Eckley, 2014). In each window, the VAF values of the chosen signal have a mean that is different from the mean in the adjacent windows. Sensitivity of breakpoint detection can be controlled using parameter MinThreshold, in the presented analysis it was set to 0.2. For segmentation and analysis (without loss of generality) we transform all the original VAF values to VAF=|VAF-0.5|+0.5.

Statistics
To test statistical significance, GeTallele uses parametric and non-parametric methods and statistical tests (Corder and Foreman, 2014;Hollander, et al., 2013). Namely, to compare distributions of the variant allele frequencies (VAF) we use Kolmogorov-Smirnov test (examples of VAF distributions are depicted in Figs. 2  and 3). To study concurrence of windows, we use permutation/ bootstrap tests. To test relations between vPR and copy number alterations (CNA) we use Pearson's correlation coefficient.