## Abstract

High-throughput DNA sequencing enables detection of copy number variations (CNVs) on the genome-wide scale with finer resolution compared to array-based methods, but suffers from biases and artifacts that lead to false discoveries and low sensitivity. We describe CODEX2, a statistical framework for full-spectrum CNV profiling that is sensitive for variants with both common and rare population frequencies and that is applicable to study designs with and without negative control samples. We demonstrate and evaluate CODEX2 on whole-exome and targeted sequencing data, where biases are the most prominent. CODEX2 outperforms existing methods and, in particular, significantly improves sensitivity for common CNVs.

## Background

Copy number variations (CNVs) are large deletions and duplications of segments of the chromosome. CNVs are pervasive in the human genome and play a causal role in diseases such as cancer [1]. In the study of disease, CNVs usually appear in two contexts: *germline* CNVs refer to inherited variants, many of which are polymorphic at the population level [2]; in contrary, *somatic* CNVs refer to the variants resulting from somatic mutations, such as those commonly observed in cancer. Germline CNVs can also be described as *common* or *rare* based on their population frequencies. This paper addresses the problem of detection of both germline and somatic CNVs, and, in particular, of improving detection sensitivity for common CNVs under both categories.

With the dramatic growth of sequencing capacity and the accompanying drop in sequencing cost, massively parallel next-generation sequencing offers appealing platforms for genome-wide CNV detection. Whole-genome sequencing (WGS) offers an unbiased genome-wide approach to detect CNVs, while whole-exome sequencing (WES) and targeted sequencing allow the identification of disease-associated variants in coding regions with direct functional interpretation. Despite the rapid technological progress, CNV detection using high-throughput sequencing still faces analytical challenges due to the pervasive biases and artifacts that are introduced during library preparation and sequencing. Proper data normalization is crucial, especially for WES and targeted sequencing, where technical biases are usually magnitudes larger than CNV signal.

In studies where deep coverage of specific genome regions is desired [3–5], or where the cohort of interest is large, WES and targeted sequencing are often preferred as cheaper alternatives to WGS. For example, the DiscovEHR Collaboration (http://www.discovehrshare.com) has sequenced the exomes of more than 50,000 participants, and the Exome Aggregation Consortium (http://exac.broadinstitute.org) has generated WES data for 60,706 unrelated individuals taken from multiple disease-specific and population genetic studies. This paper was primarily motivated by the challenges in WES and targeted sequencing data, but the methods developed here have also been integrated into a CNV detection pipeline for WGS data (Zhou et al., https://doi.org/10.1101/172700), where Mendelian concordance analysis on genetically related samples shows that the normalization methods prescribed here allow for more accurate CNV calls.

CNV detection (by WES and targeted sequencing) is primarily based on detecting local changes in read coverage along the genome. This is based on the simple intuition that regions with copy number gain should have increased read coverage, and regions with copy number loss should have decreased read coverage. However, read coverage depends not just on copy number but also on many other factors, such as GC content [6], mappability [7], and other local sequence characteristics [8]. Therefore, read coverage tends to fluctuate even when there is no CNV, and is especially variable in WES and targeted sequencing data due to the biases and artifacts introduced during the targeting and amplification steps [9–11].

Many methods are available for CNV detection in high throughput sequencing data [9–15]; a detailed review of recent advances is in Methods. Despite much progress [16], significant challenges still remain. Independent benchmark results from multiple studies [9, 11, 13, 17] show that existing methods suffer from low precision and recall rates, especially for the detection of common germline CNV signals. This is not unexpected, since in WES and targeted sequencing data, and also to an extent in WGS data, the technical background bias for each genomic target varies across samples, leading to false positives and negatives if not properly removed. To remove this technical background, recent methods have relied on low-dimensional linear factor models to capture the background bias [9, 10], which tend to control false positives. Yet, these low-dimensional linear factor models also tend to remove common CNVs that correlate with the estimated factors. In this paper, we show that this issue also affects the detection of somatic CNVs in cancer genomes, where CNVs that are recurrent across multiple cancer samples can be accidentally removed in the normalization step. Due to these limitations, current genetic studies using WES and targeted sequencing data have focused mostly on single nucleotide variations and, at best, rare CNVs [18–20].

Here we propose CODEX2 for full-spectrum CNV detection in high-throughput sequencing data. By full-spectrum, we mean sensitive detection of both common and rare CNVs. CODEX2 can be applied to two scenarios: the case-control scenario where the goal is to detect CNVs that are enriched in the case samples (Figure 1A), and the scenario where control samples are not available and the goal is simply to profile all CNVs in all samples (Figure 1B). CODEX2 is evaluated in three ways. First, we reanalyze the WES data of the HapMap samples from the 1000 Genomes Project [2], where matched microarrays and experimental validation [21–23] are used to assess performance. Results here show that CODEX2 significantly improve both sensitivity and specificity over existing methods, especially for common CNVs. Next, we apply CODEX2 to targeted sequencing data of melanoma cancer cell lines and successfully identify recurrent CNVs. We show that the recurrence rates that are highly concordant with those obtained from a separate cohort studied by the Cancer Genome Atlas (TCGA) [1]. Finally, we perform extensive simulations to benchmark existing methods and to explore how key variables, such as population frequency and CNV length, influence performance. CODEX2 is compiled as an open-source R-package available at https://github.com/yuchaojiang/CODEX2.

## Results

### Methods overview

Figure 1 illustrates the two experimental designs where CODEX2 can be applied: The case-control design where a group of negative control samples are available and the goal is to detect CNVs present disproportionately in the case samples (Figure 1A), and the design with no negative control samples, such as the Exome Aggregation Consortium, where the goal is to detect all CNVs present in all samples (Figure 1B). The key innovation in CODEX2 is the construction and usage of *negative control genome regions* in a genome-wide latent factor model for sample- and position-specific background correction, and the utilization of *negative control samples*, under a case-control design, to further improve background bias estimation under this model. The negative control genome regions defined by CODEX2 are regions that do not harbor *common* CNVs, but are still allowed to harbor rare CNVs, and can be constructed from existing studies or learned from data.

Figure 2 illustrates how a CODEX2 normalization works, and how it differs from CODEX and SVD-based normalization methods such as XHMM [10]. For simplicity, we assume that the background bias can be captured by a 1-dimensional latent factor, which we call the “latent batch effect”. Against this background, let’s assume there are two CNVs: Region A, where the carrier status (the vector indicating whether each sample is a carrier) is correlated with the underlying latent batch, and region B, where the carrier status is uncorrelated with the latent batch. The signal for these two CNV regions is obscured by the background batch effect in the observed data. In a standard SVD or CODEX normalization, which does not make use of negative controls, all samples and all genomic regions are used in the estimation of the latent background model, resulting in contamination of background estimates by the CNV signal. The contamination is especially severe for CNV A, which is correlated with the latent batch. For CODEX2, in scenario where negative controls are available, only the negative control samples are used to fit the latent factor model, which is then used to predict the background bias of the rest of the samples. In scenario where negative control samples are not available but negative control regions are identified, the latent factor model is fitted only using the negative control regions, and then used to “fill in” the background bias for the rest of the regions. This way, we avoid the contamination of the background estimates by the CNV signal, thus attaining better separation of signal from noise, as can be seen from the histograms of the normalized *z*-scores. More details are given in Methods.

CODEX2 estimates a separate background factor for each genomic target/region in each sample, which can then be used to normalize the observed coverage and detect CNV regions using the recursive Poisson-likelihood segmentation algorithm in Jiang et al. [11].

### Analysis of WES data from the 1000 Genomes Project

We first evaluate CODEX2 by reanalyzing a publicly available WES data set from the 1000 Genomes Project [2], which contains 90 healthy individuals. 40 are Utah residents with northern and western European ancestry; 20 are Japanese from Tokyo; 26 are Yoruba people from Ibadan. Gender is well balanced with 44 males and 46 females. All samples have CNV profiles from three previous microarray studies with experimental validations [21–23]. Please refer to Jiang et al. [11] for more details on the sample input and the microarray studies.

To assess performance of CODEX2 and to benchmark, we use the experimentally validated CNVs from these three previous studies [21–23] as gold standards. Specifically, we adopt a stringent quality control procedure (i.e., the reported CNV must overlap with at least 2 and at most 20 exons, have less than 5% NA rate across all samples, and have at most three copy number states). These ‘gold-standard’ CNVs are categorized as common if they are present in more than 10% of the samples, and rare otherwise. Using this ‘validation set’, we are able to assess the performance of XHMM [10], EXCAVATOR [13], CODEX [11], and CODEX2 on detecting common and rare CNVs. Figure 3 shows the precision and recall rates across the four benchmarked methods, using the CNVs validated by each of the three studies [21–23] as ground truth. The grey lines are the contours of the *F*-measure, defined as the harmonic mean of precision and recall. XHMM, which is designed for detection of rare CNVs, lacks sensitivity to common CNVs, so does EXCAVATOR. CODEX detects proportionately more common CNVs but still lacks sensitivity. The best of the three existing methods achieve less than 40% recall rate for common CNVs. CODEX2 achieves a recall rate of 87.6%, 57.2%, and 83.1% in the three validation sets, respectively, while simultaneously making substantial improvements to specificity. Across all 3 validation sets, CODEX2 achieves good performance for both common and rare CNVs, with significant improvement in precision and recall for common CNV detection.

### Analysis of targeted sequencing data of melanoma cases and controls

We further apply CODEX2 to a cohort study of melanoma from Garman et al. (unpublished) including 334 cases (untreated human melanoma cell lines, patient-derived xenografts, and tumors) and 16 controls. Samples are sequenced on a custom capture panel of 108 genes, which have been previously implicated in tumorigenesis of melanoma. For almost all tumor suppressor genes, the entire gene region (exons and introns) were sequenced to facilitate CNV calling; for oncogenes, only exons were sequenced. For cases where the full gene is captured and sequenced, we separate the gene region into consecutive windows of 500 base pairs. This results in a panel of 1398 targets across 350 samples.

We apply CODEX2 to this data set and compare against CODEX. The number of Poisson latent factors in the background model is determined by BIC for both programs. The use of negative controls in estimating its background model allows CODEX2 to be more robust to model tuning: For CODEX2, the number of latent factors has minimal effect on normalization and more generally on CNV detection, since only the normal samples are used to estimate the bias coefficient for each exon (Supplementary Table 1). In comparison, for CODEX, the number of CNV events reduces as the number of latent factors increases (Supplementary Table 1). Since the 108 genes are sparsely scattered across the genome, segmentation is performed within each gene separately. Furthermore, due to clonal heterogeneity and normal cell contamination, copy numbers may not be integers, and are assumed to be continuous and fractional to represent attenuated mean estimates of the genome mixtures. We categorize a CNV event to be high gain, gain, diploid, one-copy deletion, and two-copy deletion, if the profiled copy number is above 3.3, between 2.3 and 3.3, between 1.7 and 2.3, between 0.7 and 1.7, and below 0.7, respectively. Figure 4 shows the heatmaps of the segmentation results by CODEX and CODEX2. Each row corresponds to a sample, with the first 16 samples towards the bottom corresponding to the normal controls; each column corresponds to a target in the gene panel. In melanoma, somatic deletions of tumor suppressors (e.g., *PTEN*) and duplications of oncogenes (e.g., *BRAF*) are known to have high incidence rates [1]. From visual inspection of the heatmaps in Figure 4, we see that CODEX2 successfully retains these expected recurrent deletions and duplications, while CODEX, which does not make use of the negative control samples in fitting the background model, misinterprets the recurrent signals as a background latent factor.

To rigorously evaluate CODEX2’s accuracy on this data, we compare the frequencies of the profiled gains and losses, that is, the proportion of samples in which a call is made, with frequencies from an independent melanoma patient cohort in TCGA [1]. Specifically, for each gene target, we plot in Figure 5 the proportion of samples carrying a deletion (or duplication) in TCGA, versus this corresponding proportion in our current data set. CODEX2 achieves much higher concordance with TCGA results, with overall correlation across genes reaching 0.842 for deletions and 0.853 for gains, as compared to CODEX (correlation = 0.52 for deletions and 0.049 for gains). CODEX2 detects in these cell line samples a higher frequency of *BRAF* amplification and *CDKN2A* loss, as compared to the frozen-tissue derived TCGA results, which is not surprising due to the relative *in vitro* growth advantage of cells carrying these mutations. Based on the results by CODEX2, Garman et al. (unpublished) further separate the cohort based on cancer subtypes and clinicopathological characteristics (responses to targeted and/or immunotherapy) and investigate the differences in mutational profiles between groups. The accurate profiling of CNVs in this data set enables unbiased downstream analysis.

### Performance assessment via spike-in studies

To understand how variables such as CNV length and population frequency affect the sensitivity of CODEX, CODEX2, and methods based on singular value decomposition (SVD, such as XHMM and CoNIFER), we conduct *in silico* spike-in studies. We start with the exon read depth data from chromosome 16 to 22 in the 90 samples we analyzed from the 1000 Genomes Project, and apply filtering to remove putative existing CNVs. We then add, to the background count matrix under the null, heterozygous CNV signals of varying length, frequency, and degree of correlation with the first latent factor in the background model. See details of simulation setup in Methods.

As an illustration, Figure 6 shows a small subset of the CNV regions in the spike-in data with the ground truth, the post-normalization heatmap, and the CNV assignments across multiple methods, with the “null” regions containing no CNVs removed for easier visualization. The histograms in Figure 6 show the distribution of the normalized *z*-scores, with exons that harbor CNVs in red and exons within diploid regions in grey. We see that CODEX2 achieves clear separation of the deletions and diploids with the distributions centered on the expected values, log (1/2). The segmentations by CODEX and XHMM contain false negatives as well as false positives, especially in regions where common CNV signals reside, while improved normalization by CODEX2 allows almost perfect segmentation. Supplementary Figure 3 further shows the true latent background parameters (estimated using raw read depth without spike-in, which represents the null model that would be unobservable in a real data scenario) and the estimated parameters obtained by CODEX and CODEX2 on the “observed data” containing added CNVs. Our results show that while the sample loadings {*h*_{1}, *h*_{2}, *h*_{3}} are consistent between CODEX, CODEX2, and the ground truth, the exon-specific factors {*g*_{1}, *g*_{2}, *g*_{3}} estimated by CODEX are biased due to the inclusion of the mean CNV signals, reflecting the same trend in Supplementary Figure 2. CODEX2 corrects this bias through the use of the mixture model.

We systematically compare the performance of CODEX2 against existing methods by spiking in CNVs of length 5, 10, 20, and 40 exons with population frequency *p ∈* {5%, 10%, …,95%}, repeating each simulation run 20 times. The precision and recall rates achieved by each method are shown in Figure 7. The plots show how two variables: population frequency and degree of correlation with batch effect, impact the accuracy of methods. The results show that CODEX and SVD-based methods are sensitive to both variables, while CODEX2 maintains high accuracy across all frequencies and all degrees of correlation. CODEX2 has nearly perfect performance, while CODEX and SVD-based methods suffer from low sensitivity and specificity, especially for common CNVs with frequencies around 50%. We also investigate the effect of CNV length on performance and show that CODEX and SVD-based methods have lower sensitivity and specificity for longer CNVs, as compared to CODEX2 (Supplementary Figure 4).

We further study the relationship between CNV carrier status and batch effects— 44 and 46 samples are sequenced by the Baylor College of Medicine and the Washington University Genome Sequencing Center respectively—and show that when they are highly correlated, CODEX suffers from low sensitivity, while CODEX2 is able to recover the CNV signals from the batch effect (Figure 7). Detailed results from an example run are shown in Supplementary Figure 5, where the spiked-in CNVs are highly correlated with the batch effect (i.e., most of the carriers are from one of the sequencing centers). CODEX estimates the latent factors assuming all exons are null, with the fitted regression line lying between the carriers and non-carriers, resulting in low sensitivity for true deletions in carriers and false positives as duplications in non-carriers. CODEX2, on the other hand, estimates the exon-specific latent factors in common CNV regions using the proposed EM algorithm and successfully separates the carriers from the non-carriers, which leads to clean down-stream segmentation. Here we show that normalization is of first-order effect.

## Methods

Multiple efforts have been made to recover CNV signals from experimental noise. VarScan2 [12], ExomeDepth [14], and ExomeCNV [15] control for baseline fluctuations in read coverage by relying on matched normal samples or building an optimized reference set. EXCAVATOR [13] adopts a median normalization approach for bias removal. It was soon realized that the magnitudes of the various sources of bias are sample-specific, and thus cannot be completely removed by normalizing to control samples or reference sets. This motivated CoNIFER [9] and XHMM [10] to adopt SVD to estimate sample-specific backgrounds, which can be more effective. SVD is, however, designed for capturing linear biases in continuous-valued observations. GC content has been shown to have a sample-specific, non-linear bias in sequencing data. Furthermore, read counts are not fit well by Gaussian models, even after transformation, due to their fluctuation over a very wide range. Our previous work, CODEX [11], adopts a Poisson latent factor model for count-based sequencing data and estimates a sample-specific background for each genomic position that incorporates nonlinear biases due to GC content, target-specific capture and amplification efficiency, and low-dimensional latent systemic artifacts.

We will start by giving an overview of the SVD-based methods by CoNIFER [9] and XHMM [10] and the Poisson latent factor model by CODEX [11]. We will discuss the limitations of existing models and the reason why they lack sensitivity for common CNVs. We will then describe the model for CODEX2, leaving algorithmic details to the Supplements.

### Existing methods

Denote ** Y** as a

*n*×

*m*matrix of raw read depth, where

*Y*

_{ij}corresponds to the read depth for exon

*i*∈ {1

*,…,n*} in sample

*j*∈ {1, …,

*m*}. SVD-based methods CoNIFER [9] and XHMM [10] remove the strongest

*K*SVD components, where

*U*_{K}and

*K*_{K}correspond to the top

*K*left and right singular vectors, respectively, and

*D*_{K}corresponds to the diagonal matrix of the

*K*largest singular values. Each column of

*Y****is column-standardized. The genome is then segmented by hidden Markov model (HMM) into ‘diploid’, ‘deletion’, or ‘duplication’ states.

This paper extends CODEX [11], which improves upon SVD-based approaches in several ways: CODEX adopts a Poisson model that more accurately models count data, and importantly, it explicitly models observable and measurable sources of bias, such as GC content and exon length, in addition to unmeasurable biases, due to unanticipated experimental variables, in the form of latent factors. In particular, GC content bias exhibits nonlinear patterns of variation across samples [6, 11], and thus CODEX uses a non-linear sample-specific function instead of a low-rank linear factor to capture this bias. Since an understanding of the CODEX model is integral to our ensuing discussion, we review it here. CODEX assumes that
where *N*_{j} is the total number of reads in sample *j*, *f*_{j}(·) is the GC content bias function for sample *j*, *GC*_{i} is the GC percentage of exon *i, β*_{i} is an exon-specific factor capturing multiplicative effects due to features such as exon length, and *g*_{k} = {*g*_{1k}, …,*g*_{nk}], *h*_{k} = {*h*_{1k}, …,*h*_{mk}} are the ath (1 ≤ *k* ≤ *K*) Poisson latent factors. The Poisson latent factors form a low-dimensional background model to capture unanticipated experimental variables, similar to the singular vectors in SVD-based normalization models [9, 10]. CODEX uses maximum likelihood to estimate the parameters of the null model, where ** g** and

**are assumed to be orthogonal to ensure identifiability (see Supplementary Algorithm 1 for details). For the remaining details on model selection, parameter estimation, and genome segmentation, see Jiang et al. [11].**

*h*These SVD- and factor-based methods, so far, all lack sensitivity for common CNVs. This is because common CNVs bias the estimation of the low-dimensional linear factors (*g*_{k} and *h*_{k} in CODEX's model and *U*_{k}, *V*_{k} in SVD-based algorithms), which, in CNV regions, capture part of the CNV signal. Therefore, subsequent removal of these factors also removes the signal for the CNVs. Supplementary Figure 2 shows a toy example illustrating this issue, focusing on one exon *A****** and assuming that *K**=* 1. For this exon, CNV signals are *in silico* added at six population frequencies, creating six scenarios. For each scenario, simulated read counts *Y*_{i} = {*Y*_{i}_{1},…, *Y*_{im}] are plotted against the sample factors *h*_{1} = {h_{11}, …,*h*_{m}_{1}}. The samples that carry the CNV are shown in red, the rest are shown in black. In each iteration of CODEX, Poisson log-linear regression is performed for *Y*_{i} on *h*_{1} to get *g*_{i}_{1}, the ‘loading’ for exon *i.* CODEX estimates *g*_{i}_{1,} assuming that the great majority of the samples are ‘nulls’, that is, they don't carry CNVs, resulting in estimated background values shown in green. Ideally, the background curve should be fit to the control samples, the true ‘nulls’, as shown in blue. Yet in real data we don't know which are the true carriers of this CNV, and the background curve is thus contaminated with signal, lying between the carriers and the non-carriers. The higher the incident rate of the CNV, the higher the contamination of the background by signal, and the closer the fitted green curve to the carriers rather than the non-carriers. This, as a result, leads to low sensitivity in detecting the true CNV carriers. The effect of population frequency on detection sensitivity by CODEX and SVD-based algorithms is also explored by simulation shown in Figure 6 and Figure 7.

### Full-spectrum CNV detection

Figure 1 gives an overview of the two scenarios of CODEX2. To describe CODEX2, we first consider the simpler scenario where negative control samples are available. Without loss of generality, we consider a cohort comprised of both tumor and normal samples, the normal samples may not be matched to the tumor samples. Our goal is to identify CNVs present in the cancer samples but not in the normal samples. Duplications of oncogenes and deletions of tumor suppressors are commonly seen in cancer samples and have been reported to be associated with cancer. It is therefore crucial to detect somatic CNVs recurrent only in cancer samples with high sensitivity and specificity. Denote *J*_{c} as the set of indices of the normal samples, which serve as negative control. Assuming that the normal samples are copy-number-neutral, the negative control samples are used by CODEX2 to estimate the exon-specific bias ** β** and latent factors

**to avoid attenuating and removing common CNV signals. Poisson regression is then to each cancer sample to obtain the sample loadings**

*g***. The sample-specific background values are then computed using**

*h*

*β**,*and used in Poisson likelihood segmentation to identify CNVs. Refer to Supplementary Algorithm 2 for the detailed estimation procedure.

**g**,**h**Now let’s consider the scenario where negative control samples are not available. We denote ** I*** as a set of indices of the exons that harbor highly recurrent CNVs, the compliment of which are the indices of exons within the negative control regions. The set

*** can be obtained based on prior knowledge (e.g., common deletions in tumor suppressors), from existing database (e.g., the Database of Genomic Variants), or empirically from a first-pass CODEX run. That is, if an exon lies within a common CNV region, CODEX will return a high standard deviation of the normalized**

*I**z*-scores across all samples for this exon – Supplementary Figure 2 shows that for a common CNV, the estimated null will be biased towards the alternative, which has high incidence rate. Figure 1B shows an example on identifying germline CNVs from a population of samples (e.g., healthy individuals from the 1000 Genomes Project). For step 1 in Supplementary Algorithm 2, we no longer have a set of controls samples to directly estimate the exon-specific parameters. We propose an expectation-maximization (EM) algorithm embedded in our iterative parameter estimation procedure, where the missing data is the carrier status of the samples. Specifically, for each exon

*i*∈

***, where is the incident rate for the CNV that span exon**

*I**i**, and is the deviation from the null on the log scale, which can be either pre-fixed (e.g., log(1/2) for heterozygous deletion) or estimated by CODEX2. For simplicity, here we show the case where there is only one type of CNV event within the carriers. This can be easily extended to multiple subgroups with a model selection metric, which is enabled in the CODEX2 package.

**,**

*N***(**

*f***), and**

*GC***can be using negative control regions shown in Supplementary Algorithm 3 step 1 and 2. We adopt an EM algorithm to estimate the unknown parameters with missing data:**

*h*Refer to Supplementary Algorithm 3 for implementation details.

CODEX2 offers the choice of Akaike information criterion (AIC), Bayesian information criterion (BIC), and variance reduction to determine the optimal number of latent factors. In CODEX and SVD-based methods (XHMM and CoNIFER), the number of latent factors is a critical model tuning parameter that affects normalization and segmentation results. It is often not clear whether to use the AIC, BIC, or simple visual examination of the scree plot, and this arbitrariness plagues all methods that rely on SVD, PCA, or factor models. By using negative controls to guide the estimation of the background model, CODEX2 is less sensitive to the number of latent factors (Supplementary Table 1), thus giving results that are easier to reproduce.

For segmentation, CODEX2 adopts the same Poisson likelihood ratio based approach as CODEX. Please refer to Jiang et al. [11] for details. For targeted sequencing, where a smaller pre-selected panel of targets are sequenced, the normalization model is exactly the same; the segmentation is performed within each gene separately. For WGS, refer to Zhou et al. (https://doi.org/10.1101/172700) for details.

### Simulation setup

We start with the exonic read depth data of chromosome 16 to 22 from the 90 samples we analyzed from the 1000 Genomes Project, and apply filtering to remove putative existing CNVs. Specifically, the filtering step removes all exons that: (i) are called to harbor CNV events by XHMM, EXCAVATOR, CODEX, or CODEX2, (ii) overlap with duplication and deletion reported in the Database of Genomic Variants, (iii) don't pass quality control procedure by CODEX (median coverage between 40 and 4000, length between 30 and 2000 base pairs, mappability less than 0.95, GC content between 30% and 70%), (iv) have standard deviation of normalized z-scores across samples above 0.3, or maximum of normalized z-scores above 0.8, or minimum normalized z-scores below -0.8 across all samples. This way we are left with 4035 ‘null’ exons that are CNV-free across 90 samples. Treating this filtered count matrix as background, we fit the background model of CODEX, with the estimated parameters as ground truth. The optimal number of latent factors is 3 by AIC, BIC and variance reduction (Supplementary Figure 3) and is kept the same for subsequent analysis for CODEX2, CODEX, and SVD-based method. We then add, to this background count matrix, heterozygous CNV signals of varying length, frequency, and degree of correlation with the first latent factor in the background model. In more detail, we spike in heterozygous deletions, of varying lengths and population frequencies, by reducing the raw depth of coverage for exons spanned by the CNV from *y* to y × *c*/2, where Ä is sampled from a normal distribution with mean 1 and standard deviation 0.1. Note that heterozygous deletions with frequency *p* in the population have exactly the same detection accuracy as duplications with frequency 1 −*p*, since all copy number events are defined in reference to a population average.

## Discussion and conclusions

A limitation shared by all existing CNV detection methods, highlighted by multiple independent benchmarking studies, is the lack sensitivity for common variants. Similarly, in our experience applying CODEX [11] to WES and targeted sequencing of tumor samples, we easily detect sporadic mutations but miss the highly recurrent mutations. To meet the widespread demand for improved CNV detection, we develop in this paper a new method, CODEX2, to remove technical noise and improve CNV signal-to-noise ratio for all sequencing platforms including WES and targeted sequencing. CODEX2 builds on our existing method CODEX [11] with a significant improvement of sensitivity for common variants, thus allowing full-spectrum CNV detection. CODEX2 can be applied to two scenarios: the case-control scenario where the goal is to detect CNVs that are enriched in the case samples, and the scenario where control samples are not available and the goal is simply to profile all CNVs.

We extensively benchmark CODEX2 against existing methods in three ways. In the first evaluation, we reanalyze the WES of the HapMap samples from the 1000 Genomes Project, where a set of experimentally validated CNV calls from microarrays and other sources are used to assess performance. Results here show that CODEX2 markedly improve both sensitivity and specificity over existing methods. The improvement for common variants is the most substantial, from 40% recall rate to >80% recall rate according to two out of the three validation sets used, while simultaneously improving precision from 60% to 90%. In the second evaluation, we apply CODEX2 to targeted sequencing data of melanoma cancer cell lines, where CODEX2 detects CNVs with recurrence rates that are highly concordant with those obtained from TCGA. Finally, we perform extensive simulations benchmarking existing methods and elucidating how key variables, such as population frequency, influence detection sensitivity. In combination, these results establish the improved accuracy of CODEX2 over existing state-of-the-art approaches, and the utility of the software under varying study designs.

Risso et al. [24] propose remove unwanted variation (RUV) for normalizing RNA sequencing data and detecting differential expression, which adopts factor analysis based on sets of control genes or samples. CODEX2 resembles RUV but comes with additional distinguishing features. First, RUV is designed for a case-control setting to identify differentially expressed genes whereas CODEX2 can be applied with and without negative control samples. Second, when estimating the latent factors that correspond to the control genes or samples, RUV adopts SVD with Gaussian assumption, while CODEX2 is completely off-the-shelf with Poisson generalized linear modeling. Furthermore, CODEX2 directly models the GC content bias, which cannot be fully captured by a linear principal component [6, 11]. CODEX also adjusts for library size factor and exonic lengths, all of which can be directly quantified/modeled.

CODEX2 normalizes the read depth data for CNV detection via a Poisson latent factor model, which can be well adapted to other settings within the genomics domain. Lee et al. [25] applies this to the non-normalized microRNA sequencing data. Chen et al. [26] estimates allele-specific copy number under tumor-normal setting using the Poisson latent factor model to remove biases and artifacts that cannot be fully captured by comparing to the normal.

In this paper, we show application results of CODEX2 to WES and targeted sequencing data. For WGS, user-defined consecutive bins can be treated as ‘targets’ in the WES setting, with normalization and segmentation procedures carried out in the same fashion. Method based on this has been integrated in a pipeline to detect CNV using both sequencing and microarray data (Zhou et al., https://doi.org/10.1101/172700). Notably, exome sequencing with its low cost and fast turnover is still being preferred as the method of choice in large-scale genomic studies, where the priority goal is to identify variants within protein coding regions. The sequencing capacity has also increased exponentially over the past few years. CODEX2 is currently being applied to 12,000 blood WES samples in the Penn Medicine BioBank and over 10,000 WES samples in the Alzheimer's Disease Sequencing Project. This tremendous amount of data opens up great opportunities and yet at the same time cautions should be made with regard to running efficiency and method scalability. CODEX2 processes each chromosome within each batch separately and can thus be highly parallelized. With increasing sequencing capacity, and increasing need to profile CNVs as a non-negligible source of genetic variation, we believe that CODEX2 can be a useful tool for the genetics and genomics community.

### Abbreviations

- CNV
- Copy number variation
- WGS
- whole-genome sequencing
- WES
- whole-exome sequencing
- SVD
- singular value decomposition
- TCGA
- the Cancer Genome Atlas
- AIC
- Akaike information criterion
- BIC
- Bayesian information criterion
- EM
- expectation-maximization
- RUV
- remove unwanted variation

## Declarations

### Ethics Approval and consent to participate

Not applicable.

### Consent for publication

Not applicable.

### Availability of data and material

CODEX2 is an open-source R package available at https://github.com/yuchaojiang/CODEX2 with license GPL-2.0.

### Competing interests

The authors declare no conflict of interest.

### Funding

This work was supported by the National Institutes of Health (NIH) grant R01 HG006137 to NRZ and P01 CA142538 to YJ.

## Authors’ contributions

NRZ initiated and envisioned the study. YJ and NRZ formulated the model. YJ developed and implemented the algorithm. YJ, KLN, and NRZ conducted the analysis. YJ and NRZ wrote the manuscript. All authors read and approved the final manuscript.

## Supplement to “CODEX2: full-spectrum copy number variation detection by high-throughput DNA sequencing”

### Parameter estimation algorithm for CODEX

**Initialization:**

.

**Iteration:**

For each sample

*j,*fit a smoothing spline to get*f*_{j}(*GC*).For each exon

*i,*update*ß*_{i}as .Denote

*Z = Nf*(*GC*)*ß*Apply SVD to row-centered log(^{new}.*Y*/*Z*) to obtain the*K*right singular vectors and use as*h*^{old}.For each

*i,*fit Poisson log-linear regression with*Y*_{i}: as response,*h*^{old}as covariates, log(*Z*_{i}:) as fixed offset to obtain updated estimates as .For each

*j,*fit Poisson log-linear regression with*Y*_{:j}as response,*g*as covariates, log(*Z*_{:j}) as fixed offset to obtain updated estimates as .Center each row of

*g ×*(*h*^{new})^{T}and apply SVD to the row-centered matrix to obtain the*K*right singular vectors to update*h*^{new}.Repeat steps (a) to (c) with

*h*^{old}=*h*^{new}till convergence to obtain*h, g.*

Repeat steps 1 to 3 with

*ß*^{old}=*ß*^{new}till convergence.

### Parameter estimation algorithm for CODEX2 with negative control samples

Let *J*_{c} be the indices of the control samples.

Apply CODEX to the null regions and

*GC*using Supplementary Algorithm 1 to get an estimate of .For case sample .

Fit a smoothing spline .

Denote

*Z*=*Nf*(*GC*)*ß.*Fit Poisson log-linear regression with*Y*_{:j}as response,*g*as covariates, log(*Z*_{:j}) as fixed offset to obtain updated estimates as .Repeat steps (a) to (b) with till convergence to obtain

*h*_{j}:*f*_{j}(*GC*).

### Parameter estimation algorithm for CODEX2 without negative control samples

Let *I** be the indices of exons with CNVs with high population frequencies.

Apply CODEX to the null regions using Supplementary Algorithm 1 to get an estimate of .

Use the estimated non-parametric smooth spline function to estimate the GC content bias in the non-null regions with the corresponding .

For each , adopt an EM algorithm to estimate .

E-step: where are from the previous M-step and is the probability for Poisson distribution with calculated based on the given parameters.

M-step: where is from the previous E-step. Run Poisson log-linear regression with . as response,

*h*and . as covariates, and as fixed offsets to obtain estimates of as intercept, as coefficients.

## Acknowledgements

We thank Bradley Garman, Ioannis N. Anastopoulos, and Bradley Wubbenhorst for providing the case-control melanoma dataset and Xiaoqi Geng, Xiangwei Zhong, and Dr. Jingshu Wang for helpful comments and discussions.