An approach for normalization and quality control for NanoString RNA expression data

The NanoString RNA counting assay for formalin-fixed paraffin embedded samples is unique in its sensitivity, technical reproducibility, and robustness for analysis of clinical and archival samples. While commercial normalization methods are provided by NanoString, they are not optimal for all settings, particularly when samples exhibit strong technical or biological variation or where housekeeping genes have variable performance across the cohort. Here, we develop and evaluate a more comprehensive normalization procedure for NanoString data with steps for quality control, selection of housekeeping targets, normalization, and iterative data visualization and biological validation. The approach was evaluated using a large cohort (N = 1,649) from the Carolina Breast Cancer Study, two cohorts of moderate sample size (N = 359 and 130), and a small published dataset (N = 12). The iterative process developed here eliminates technical variation (e.g. from different study phases or sites) more reliably than the three other methods, including NanoString’s commercial package, without diminishing biological variation, especially in long-term longitudinal multi-phase or multi-site cohorts. We also find that probe sets validated for nCounter, such as the PAM50 gene signature, are impervious to batch issues. This work emphasizes that systematic quality control, normalization, and visualization of NanoString nCounter data is an imperative component of study design that influences results in downstream analyses.

genes ( Figure 4B). While the log 2 -fold changes were correlated between the two normalization 3 7 9 procedures, the genes found to be differentially expressed only with nSolver-normalized data tended to 3 8 0 have large standard errors with RUVSeq-normalized data and therefore not statistically significant using  Case study: bladder cancer gene expression 3 8 5 RUVSeq reduced technical variation (study site) while maintaining the biological variation (tumor grade).

8 6
RUVSeq data showed the most homogeneity in per-sample median deviation of log-expressions 3 8 7 compared to raw and nSolver data ( Figure 5A). The first principal component of nSolver data had 3 8 8 significant differences by study sites, which was not present in RUVSeq data ( Figure 5B). In addition, 3 8 9 there was a stronger biological association with tumor grade in the first principal component of expression 3 9 0 using RUVSeq data ( Figure 5C).

9 1 9 2
Case study: kidney cancer gene expression 3 9 3 We only found subtle differences in the deviations from the median expression between the normalization 3 9 4 procedures for the kidney cancer dataset ( Figure 6A). This cohort did not have the same known technical 3 9 5 variables observed in the other cohorts such as study site or sample age, and the RNA came from fresh-3 9 6 frozen material (Supplemental Table S1). We evaluated normalization methods on a source of technical 3 9 7 variation, DV300, the proportion of RNA fragments detected at greater than 300 base pairs as a source of 3 9 8 technical variation, and tumor stage as a biological variable of interest. The first two principal components  Proper normalization is imperative in performing correct statistical inference from complex gene 4 0 5 expression data. Here, we outline a sequential framework for NanoString nCounter RNA expression data 4 0 6 that provides both quality control checks, considerations for choosing housekeeping genes, and iterative 4 0 7 normalization with biological validation using both NanoString's nSolver software [9,12] and RUVSeq 4 0 8 [6,8]. We show that RUVSeq provided a superior normalization to nSolver on three out of four datasets by 4 0 9 more efficiently removing sources of technical variation, while retaining robust biological associations.

1 0
We also benchmark RUVSeq-normalization with two other normalization methods implemented in R and 4 1 1 show that RUVSeq outperformed all methods in reducing technical variation. We observed that normalization methods were sensitive to the quality and the set of housekeeping  We developed a quality metric to assess sample quality: samples with high proportions of genes detected 4 2 1 below the LOD in both endogenous genes and housekeepers were indicative of either low-quality 4 2 2 samples or reduced assay efficiency. Sample age was correlated with higher proportions of genes below 4 2 3 the LOD in both endogenous and housekeeping genes, which was likely due to RNA degradation over those derived from nSolver's assessment of data quality. We excluded these samples for analysis in both  showed a strong cis-eQTL signal in data from both normalization methods. We found significantly more 4 3 7 trans-eQTLs with the nSolver-normalized data (Figure 3). However, many of the trans-eSNPs for the loci 4 3 8 found with nSolver-normalized data tended to have moderate MAF differences across phase, leading us  The choice of normalization procedure is less of a concern in cohorts with minimal sources of technical al recommend [6,7], normalization should be a part of scientific process and should be approached iteratively with visual inspection and biological validation to tune the process. One normalization 4 5 7 procedure is not necessarily applicable to all datasets and must be re-evaluated on each dataset. In conclusion, we outline a systematic and iterative framework for the normalization of NanoString with RUVSeq [6,8] and data analysis with popular count-based R/Bioconductor packages, as well as iterative data visualization and biological validation to assess normalization. Researchers must pay close 4 6 7 attention to the normalization process and systematically assess pipelines that best suit each dataset. sub-optimal quality control and normalization pipelines.

7 2
• We provide an iterative framework for nCounter data with steps for quality control, normalization, and visualization/validation using RUVSeq.

7 4
• Using four real datasets, we show that our framework eliminates technical variation more reliably 4 7 5 than other methods, including NanoString's provided software nSolver, without diminishing 4 7 6 biological variation.

7 7
• We stress that quality control and normalization must be emphasized in study design and    Raw RCC files for nCounter expression from Sabry et al [20] are available NCBI Gene Expression

9 6
Omnibus (GEO) with the accession numbers GSE130286. Raw and normalized expression data from 4 9 7 CBCS will be available on GEO upon publication. For replication prior to publication, this data can be 4 9 8 requested from the authors.