Rare copy number variants (CNVs) and breast cancer risk

Background Copy number variants (CNVs) are pervasive in the human genome but potential disease associations with rare CNVs have not been comprehensively assessed in large datasets. We analysed rare CNVs in genes and non-coding regions for 86,788 breast cancer cases and 76,122 controls of European ancestry with genome-wide array data. Results Gene burden tests detected the strongest association for deletions in BRCA1 (P= 3.7E-18). Nine other genes were associated with a p-value < 0.01 including known susceptibility genes CHEK2 (P= 0.0008), ATM (P= 0.002) and BRCA2 (P= 0.008). Outside the known genes we detected associations with p-values < 0.001 for either overall or subtype-specific breast cancer at nine deletion regions and four duplication regions. Three of the deletion regions were in established common susceptibility loci. Conclusions This is the first genome-wide analysis of rare CNVs in a large breast cancer case-control dataset. We detected associations with exonic deletions in established breast cancer susceptibility genes. We also detected suggestive associations with non-coding CNVs in known and novel loci with large effects sizes. Larger sample sizes will be required to reach robust levels of statistical significance.


Introduction
Copy number variants (CNVs) are pervasive in the human genome but are more challenging to detect with current technologies than single nucleotide variants (SNVs). Recent comprehensive sequencing projects 1,2 have characterised CNVs in large sample sets. The gnomAD project identified a median of 3,505 deletions and 723 duplications covering more than 50 base pairs per genome. Most deletions and duplications tend to be rare with longer variants tending to be rarer, suggesting negative selection against these variants. At the population level the 1000 Genomes project has mapped a large proportion of inherited CNVs 3 and observed that 65% had a frequency below 2%. Some CNVs have an established role in the inherited risk of breast cancer. Rare loss of function variants in susceptibility genes such as BRCA1 and CHEK2 are associated with a large increase in risk 4 . While the majority of these variants are single nucleotide mutations and short indels, they also include longer deletions and duplications. It has been reported that up to a third of loss of function BRCA1 variants in some populations may be CNVs 5 .
Large-scale genome-wide association studies (GWAS) have established breast cancer associations with common variants at more than 150 loci, mostly in non-coding regions [6][7][8][9] . At two of the loci, deletions imputed from the 1000 Genomes reference panel have been identified as likely causal variants. A deletion of the APOBEC3B gene-coding region increases breast cancer risk 10 and analysis of the tumours of the germline deletion carriers showed an increase in APOBEC-mediated somatic mutations. 11 A deletion in a regulatory region was identified as a likely causal variant at the 2q35 locus 12,13 .
Detecting CNVs from the intensity measurements of genotyping array probes is prone to producing unreliable calls due to the high level of noise. We recently developed a novel CNV calling method, CamCNV 14 , which focuses on rare CNVs and identifies outlier samples that may have a CNV, based on the intensity distribution across all samples at each probe. We showed that this approach is able to detect CNVs using as few as three probes 14 . Here, we apply this approach to a very large array genotype dataset to search for novel breast cancer associated CNVs.  Table 1). Studies included population-based and hospitalbased case-control studies, and case-control studies nested within prospective cohorts; we only included data from studies that provided both cases and controls. Phenotype data were based on version 12 of the BCAC database. Cases were diagnosed with either invasive breast cancer or carcinoma-in-situ. Oestrogen receptor (ER) status was determined from medical records or tissue microarray evaluation, where available. Analyses were restricted to participants of European ancestry, as defined by ancestry informative principal components 6,7 . Where samples were genotyped on both arrays, we excluded the iCOGS sample as the OncoArray has better genome-wide coverage. After sample quality control (see below), data on 36,980 cases and 34,706 controls with iCOGS genotyping, and 49,808 cases and 41,416 controls with OncoArray genotyping, were available for analysis (Supplementary Table 2).

Arrays
The Illumina iCOGs genotyping array 6 includes 211,155 probes for SNVs and short insertions/deletions. Most variants were selected because of previous association in casecontrol studies for breast prostate and ovarian cancers, or for dense mapping of regions harbouring an association. The OncoArray includes 533,631 probes, of which approximately half were selected from the Illumina HumanCore backbone, a set of SNPs designed to tag most common variants. The remainder were selected on the basis of evidence of previous association with breast, prostate, ovarian, lung or colorectal cancer risk. Approximately 32,000 variants on the OncoArray were selected to provide dense coverage of associated loci and known genes. The remainder were mostly selected from lists of common variants ranked by p-value, with a small number from lists of candidate variants.

CNV Calling
CNVs were called using the CamCNV pipeline as previously described 14 . In brief, the log R (LRR) intensity measurements and B allele frequency (BAF) for each sample at each probe were exported from Illumina's Genome Studio software. A principal component adjustment (PCA) was applied to the LRR, grouped by study, to remove noise and batch effects. After removing noisy probes and those in regions with known common CNVs, the LRRs for each probe were converted to z-scores using the mean and standard deviation from all BCAC samples. Circular binary segmentation was applied to the z-scores ordered by probe position for each sample using the DNACopy package. 16 This produces a list of segments for each sample by chromosome where the z-score of consecutive probes changes by more than two standard deviations. Segments with a mean probe z-score between -3.7 and -14 were called as deletions and segments with a mean z-score between +2 and +10 as duplications. We restricted the calls to segments covering a minimum of three and a maximum of 200 probes.
As per the CamCNV pipeline, we then excluded deletions with inconsistent B Allele Frequency and CNVs with a shift in LRR at the sample level that was outside the expected range. The additional CNV exclusions are summarised in Supplementary Table 3. To exclude regions with a high level of noise we also excluded CNVs falling within 1Mb of telomeres and centromeres and a number of immune loci such as the T-cell receptor genes where somatic mutations in blood are often observed 17 .

Sample Quality Control
Standard sample quality control exclusions were performed, as previously described for the SNP genotype analyses 6,7 . These include exclusions for excess heterozygosity, ancestry outliers, mismatches with other genotyping, and close relatives. A stricter sample call rate of >99% was used for the CNV analysis, compared to >95% used in the genotype analyses. We also excluded any participants for whom a DNA sample was not collected from blood and any that had been whole genome amplified.
In addition, we used two metrics to exclude noisy samples liable to produce an excess of unreliable CNV calls. First, we calculated a derivative log ratio spread (DLRS) figure for each sample as the standard deviation of the differences between LRR for probes ordered by genomic position, divided by the square root of two. This measures the variance in the LRR from each probe to the next averaged over the whole genome and thus is insensitive to large fluctuations such as might be expected between different chromosomes in the same sample. An ideal sample would have a small DLRS as the only variance would come from a small number of genuine CNVs. We calculated the DLRS using the dLRs function in R package ADM3 (https://CRAN.R-project.org/package=ADM3) before and after the PCA. At both stages we excluded samples with a DLRS more than 3.5 standard deviations above the mean DLRS for that study.
Second, we counted the number of short segments (between three and 200 probes) output by DNACopy for each sample. We observed that the distribution of segment counts was skewed to the right with an excess of samples with a large number of segments. We calculated a cut-off for the maximum number of segments using the following formula where x is the segment count for each sample (based on the rationale that the distribution of the true number of segments should be approximately Poisson): y=2*sqrt(x) cut-off = median(y)+3.5 The sample exclusions resulting from these QC steps are summarised in Supplementary Table 4.

Association Tests
All analyses were carried out separately for deletions and duplications. As we were only assessing rare CNVs, we treated all carriers as heterozygotes and did not attempt to identify rare homozygotes.
To account for overlapping CNVs and imprecision in the breakpoints, we assigned individual CNVs to regions. To identify the regions, we moved sequentially along each chromosome, identifying the start as an Oncoarray probe position where deletions were observed in at least five samples, and then the end position as the probe position before the first probe where deletions were observed in fewer than five samples. Regions within five probes of each other were then merged together. The process was repeated for duplications. Regions were also merged such that the major susceptibility genes (BRCA1, BRCA2, CHEK2) were included within a single region. We then assigned individual CNVs to regions where at least 90% of the CNV's length fell within the region. For iCOGS, which generally has less dense probe coverage, we first assigned CNVs to the OncoArray regions where they showed > 90% overlap. We then assigned any remaining CNVs to regions defined using the iCOGS probes, using the same procedure. Using this approach, 3,306 deletion regions were identified from OncoArray data, 812 of which were also observed using iCOGS data, and 541 regions identified using iCOGS alone. For duplications there were 2,203 OncoArray regions, with 854 also observed using iCOGS data, and 483 iCOGS specific regions.
Associations were evaluated for each array and each region using logistic regression, to derive a log odds ratio per deletion/duplication. Statistical significance was evaluated using a likelihood ratio test. The logistic regression analyses were conducted using in-house software (https://ccge.medschl.cam.ac.uk/software/mlogit/). Study and ten ancestry informative principal components, defined separately for each array, were included as covariates. The results from each array were combined in a standard fixed effect metaanalysis using the METAL software 18 . To avoid regions with too few observations, we excluded regions with fewer than 24 deletions or duplications (~0.015% of samples).
To detect more precisely the location of association signals, we also generated results for each probe. We created a vector of pseudo-genotypes for each probe with samples, such that a deletion covering that probe was coded as 1 and all other samples were coded as 0. We generated a similar set of genotypes for duplications. The results were analysed using logistic regression, as above.
To test for association between CNVs affecting the coding sequence of genes, in aggregate, and breast cancer risk, we identified samples with a deletion or duplication overlapping the exons of each gene. Exon positions were downloaded from the UCSC Genome Browser hg19 knownGene table. We used logistic regression to generate a log odds ratio (OR) for carriers of coding variants covering each gene, adjusted for study, as above. We generated results for each array and then for carriers combined across both arrays. For the combined analyses we treated studies with samples on both arrays as separate studies.
To calculate Bayesian False Discovery Probabilities (BFDPs) we assumed a log-normally distributed prior effect size as described by Wakefield 19 . The prior log(OR) was determined by assuming a 95% probability that the OR was less than some bound K, where K=3 for the regional and gene-based analysis, except for BRCA1 and BRCA2 where K=20 was assumed. The prior probability of association was assumed to be 0.001 for the regional analysis, 0.99 for BRCA1, BRCA2, ATM and CHEK2 and 0.002 for other genes. For the gene-based analysis only positive associations were considered as the prior evidence for all genes was in favour of PTVs being positively associated with risk.
To determine whether there was a tendency for CNVs to be associated with an excess, or deficit, of risk across genes or regions, we computed signed z-scores as the square root of the chi-squared statistic for each gene, multiplied by ±1 depending on whether the effect estimate was positive or negative. These were ranked and normalised summed z-scores, based on the r most significant associations, were derived. The overall test statistic was the maximum summed z-score over all possible values of r: Where n is the total number of genes/regions being tested. The significance of U was then determined by permutation, randomly permuting case-control labels within study 50 times.

Summary of CNVs Detected
After quality control we detected a mean of 2.9 deletions (standard deviation 1.6) and 2.5 duplications (SD 2.0) per sample. Supplementary Table 5 shows the mean length, probe coverage and segment z-scores of called CNVs. Duplications tended to be longer than deletions: for example, deletions called on OncoArray covered a mean of 45 Kilobases (Kb) (SD 106 Kb) over 9.8 probes (SD 17.2), while duplications covered a mean of 109 Kb (SD 202 Kb) over 18.9 probes (SD 36.5). CNV calls observed in multiple samples were concentrated in a small proportion of probes (Supplementary Table 6), with <11% of probes having frequency >0.01% and <2% of probes having frequency >0.5%.
We identified called CNVs which overlapped for at least 90% of their length with rare deletions and duplications (frequency <1%) identified by the 1000 Genomes Project (Supplementary Table 5). Forty-nine percent of OncoArray deletions and 47% of iCOGs deletions matched a 1000 Genomes Project variant while 29% of OncoArray duplications and 20% of iCOGs duplications matched. In total we identified CNVs closely matching 3,273 of the deletion variants published by the 1000 Genomes Project (~9% of total) and 1,255 of their duplication variants (~24% of the total).

CNVs Associated with Overall Risk
Association results were derived for 1,301 regions containing deletions and 992 regions containing duplications. QQ plots are shown in Figure 1A for deletions and 1B for duplications. There was no evidence for inflation in the test statistics for duplications (lambda=0.98; lambda 1000 =1.00) and minimal evidence for deletions (lambda = 1.11; lambda 1000 =1.00).
Seven deletion and two duplication regions were associated with breast cancer risk at p<0.001 (Table 1); of these, deletions within the BRCA1 region achieved p< 10 -6 . The results for all regions are shown in Supplementary Tables 7 and 8 and include statistics on the  number of probes covered by the calls. The results for individual probes covered by the  regions analysed are in Supplementary Tables 9 to 12. The BRCA1 locus contains multiple deletions across the gene. The CHEK2 region (OR 1.94, p=0.0003) covers the whole gene but nearly all the calls correspond to a deletion of exons nine and ten, which was previously observed in 1% of breast cancer cases and 0.4% of controls in Poland 20 . We observed the deletion in 0.9% of Polish cases and 0.5% of controls. The

Associations with Risk of Breast Cancer Subtypes
We repeated the analyses restricting cases to those with ER-positive, ER-negative and triple negative disease. Deletions and duplications with p-values below 0.001 are shown in Tables  2 and 3  In total we observed five deletion and two duplication regions with p-values < 0.001 that were not below this threshold in the overall risk analysis. The strongest novel association for ER-positive was for an intronic deletion in ITGBL1 (OR = 3.3, P=0.00007, P for difference by ER-status=0.18). For ER-negative disease the strongest novel association was with an intergenic deletion between ABCC4 (MRP4) and CLDN10 (OR=2.16, P=0.0002, P for difference by ER-status=0.02). Neither of these associations was significant for the other subtype. For triple negative disease, the strongest evidence of association was found for an intergenic duplication between TMC3 and MEX3B (OR=2.39 P= 0.00009) and for two separate deletions upstream of the DDX18 gene: 2:118258797-118389164 (OR= 6.56, P= 0.00001) and 2:117973154-118107795 (OR=4.54, P= 0.0008). The association at these two deletions was driven by the same samples, with 75% of the carriers of the first deletion observed to have the second deletion and normal copy number at the 62kb gap between the deletions.

Associations at Established Common Susceptibility Loci
Three of the most significant associations were observed within regions harbouring known breast cancer susceptibility loci for breast cancer. The most significant (OR=1.42;CI,1.21 to 1.67; P=0.00015) was upstream of FGFR2 and consistent with a 28 kb deletion in the 1000 Genomes Project data (chr10:123433204-123461492). Three independent risk signals have been previously identified at this region 22,23 . The effect size for the CNV was larger than those previously reported for these common SNPs ( For deletions, we found 10 genes with P < 0.01 (Table 4), the most significant being BRCA1 (OR=7.66, P= 3.72E-18). Four of these 10 genes (ATM, BRCA1, BRCA2, CHEK2) are known breast cancer susceptibility genes. 21 Deletions were also observed in two other known susceptibility genes: PALB2 (23 cases, 9 controls, OR=2.02, P=0.09) and RAD51C (21 cases, 9 controls, OR=2.04, P=0.08). The most significant novel association was for SUPT3H (OR=0.27, P=0.0004).
For duplications we observed 15 genes with P < 0.01 (Table 5). The most significant association was for VPS53 (OR = 0.5, P= 0.0009). This gene and ABR (OR=0.61 P= 0.008) both lie within the region on 17p which had the strongest association in the regional analysis. These associations were driven by duplications in different samples, with only one duplication in one sample overlapping both genes. Duplications were associated with an increased risk for only four of the 15 genes; the most statistically significant was RSU1 (OR=3.4, P= 0.004). There was also some evidence of association with risk for duplications in BRCA1 (OR =

Direction of effect tests
In the gene burden and individual probe analyses we observed a directional effect, whereby the strongest associations for deletions tended to increase risk and those for duplications tended to be protective. To test whether these associations deviated from what would be expected by chance, we computed ranked summed z-score tests and evaluated the significance of the maximum test statistic by permutation. Results are summarised in Table  6. The statistic for deletion regions was more significant than any of the permuted statistics (P=0.04) but was reduced to P=0.12 after removing the known genes BRCA1 and CHEK2.
The significance of the gene burden test for deletions also was reduced from P=0.04 to P=0.2 when the known genes were removed. The statistic for the duplication regions was lower than any of the 50 permutations (P=0.04). The gene burden analysis for duplications was not significant.

Discussion
We used the largest available breast cancer case-control dataset, comprising more than 86,000 cases and 76,000 controls with array genotyping, to test for associations with rare CNVs. Using the intensities from genotyping arrays to detect CNVs is not ideal due to a high level of noise and uncertainty in the calling, particularly for duplications. However, in tests of known CNVs and replication of calls between duplicate samples, the CamCNV method shows reasonable levels of sensitivity and specificity 14 . The main focus of this analysis was low frequency CNVs (<1% frequency) since higher frequency CNVs can generally be studied through imputation to a reference panel. In the 0.05%-1% frequency range, we could detect ~20% of the CNVs identified by the 1000 Genomes project. For some loci we only had evidence from one array because the probes do not exist to detect the variants on the other array. Thus, while this array-based approach provides power to evaluate the CNVs that can be assayed, much denser arrays or direct sequencing would be required to provide a complete evaluation of the contribution of CNVs.
In support of the reliability of the method, we detected evidence for both deletions and duplications in BRCA1, which was stronger for ER-negative disease, and for deletions in CHEK2 , which were stronger for ER-positive disease. The latter appeared to be driven by a single founder deletion in East European populations. Weaker evidence of association was also observed for deletions in other susceptibility genes (BRCA2, ATM, PALB2, RAD51C); the ORs were consistent with those seen for deleterious SNVs and indels. 21 In total, around 0.5% of cases in our analysis had a deletion in one of the known susceptibility genes with the proportion rising to ~1% for cases diagnosed under 50 years of age. The majority of coding deletions are expected to affect only part of the gene, with one study observing that a quarter covered only a single exon. 26 To detect all of these using array data would require at least three probes per exon. The OncoArray has this level of coverage for a few genes, including BRCA1 and BRCA2, but the coverage is lower for most genes and many coding CNVs will have been missed.
A key issue is the appropriate level of statistical significance to apply to these analyses. For the gene burden tests, P<2.5x10 -6 , as used in exome-sequencing, seems an appropriate level. It is less clear what is appropriate for non-coding variants. A level of P<5x10 -8 has become standard in GWAS and has been shown to lead to acceptable replication, but this seems over-conservative for CNVs, which are more likely to be deleterious. Consistent with this, for at least two of the ~200 common susceptibility loci, the likely causal variant is a CNV, a higher fraction than expected given the relative frequencies of CNVs and SNPs. Based on frequency analysis of whole-genome sequence data Abel et al. 1 estimated that rare CNVs are >800 times more likely to be deleterious than rare SNVs and >300 times more likely than rare indels. On the other hand, the significance level for non-coding CNVs should logically be more stringent than for the gene burden tests. Taken together, a significance level of ~10 -6 seems appropriate, while associations at P<0.001 may be worth following up in future studies. In our analyses only the association at BRCA1 (both in the overall and gene burden tests) passes the higher threshold. We also calculated Bayesian False Discovery Probabilities (BFDPs) 19 (Supplementary Tables 20 and 21) for our associations using prior probabilities of 0.001 for regions and 0.002 for genes. Outside the known genes none of the BDFPs gave a probability below 10%, with the lowest BFDP of 0.11 for the deletion in the FGFR2 locus. For a CNV observed with a frequency of 0.1% (n=91 samples in the OncoArray dataset) we had 40% power to detect an association with an odds ratio of 2 but only 1.5% power to detect an association with an odds ratio of 1.5. An OR of 2, comparable to that seen for deleterious variants in ATM and CHEK2, may be plausible for rare coding CNVs or non-coding CNVs that have a significant effect on gene expression. Larger sample sizes will clearly be required to evaluate rare CNVs with more modest associations.
In addition to the BRCA1 and CHEK2 loci, we found associations in three known susceptibility regions identified through GWAS, harbouring FGFR2, ADCY8 and KLF12. In each case, the variants are rarer than the established associated variants, but confer higher risks. The ADCY8 and KLF12 deletions are not in linkage disequilibrium with the associated SNPs. The FGFR2 deletion is in linkage disequilibrium with two of the likely causal common SNPs although there was still evidence of association with the deletion, albeit weaker, after conditional analysis. In-silico and functional analysis clearly demonstrate that FGFR2 is the target of the previously established variants 22,23 ; it will be interesting to establish if the same is true for the CNV.
Excluding loci in known susceptibility regions, the strongest evidence of association was for a 12kb deletion ( 13:102121830-102133956) in the second intron of ITGBL1 (OR = 3.3, P=0.00007 in the ER-positive analysis). This deletes a promoter flanking region (Ensembl ID: ENSR00001563823) and CTCF binding site (Ensembl ID: ENSR00001062398) active in mammary epithelial cells. There is experimental evidence that ITGBL1 expression, mediated by the RUNX2 transcription factor, enables breast cancer cells to form bone metastases 27 .
In the gene burden analysis, the strongest novel association was for deletions within SUPT3H, which were associated with a reduced risk. SUPT3H encodes human SPT3, a component of the STAGA complex which acts as a co-activator of the MYC oncoprotein 28 . SUPTH is located within the first intron of the RUNX2 transcription factor and the syntenic relationship between the two genes is highly evolutionarily conserved 29 . RUNX2 has a role in mammary gland development and high RUNX2 expression is found in ER-negative tumours. 30 The PCDHGB2 association appears to be due to a single variant (5:140739812-140740918) that deletes the first exon but as this gene is part of the protocadherin gamma gene cluster it is also possible that the deletion may be having an effect on one of the five genes that overlap PCDHGB2. It also deletes a promoter active in mammary epithelial cells (ENSR00001342785). The next strongest signals were for MEAK7 (OR=2.19 P= 0.001), a gene implicated in a mTOR signalling pathway 31 , and MAD1L1 ((OR=2.00 P=0.005), a component of the mitotic spindle-assembly that has been suggested as a possible tumour suppressor 32 .
After BRCA1, the most significant association for ER Negative disease in the gene burden analysis was for CYP2C18 which overlaps CYP2C19 ( ER-negative OR=2.6, P=0.002; triplenegative OR=4.4, P=0.0002). A previous analysis of CNVs and breast cancer in the Finnish population identified a founder mutation reaching an overall frequency of ~ 3% and reported a possible association at this locus for triple negative (OR 2.8, p= 0.02) and ER-negative breast cancer (OR =2.2 p=0.048). 33 The results from duplications are harder to interpret as there are often longer duplications overlapping whole genes with shorter variants covering some part of their length. For the gene burden analysis there was little evidence of strong associations. In the regional analysis, the two strongest associations cover multiple genes. The strongest evidence of association (OR=0.69 P=1.1E-05) was for a 1.5 Mb region at the start of chromosome 17 (17:13905-1559829). The probe-specific and gene burden results highlighted some stronger signals within this region, for example within NXN and VPS53, but the direction of effect was consistent across the region with 80% of the OncoArray probes having an odds ratio of 0.75 or lower (Figure 2). For the 0.4Mb duplication region on chromosome 21 (OR= 2.23 P=0.0001) the probe-specific results from OncoArray highlighted that the signal is specific to a shorter intergenic region (21:33421860-33459975) between HUNK and LINC00159.
We observed some evidence of an aggregate directional effect, both for genes and nongenic regions, such that the deletions in aggregate were associated with increased risk. There was also some suggestion that duplications, in aggregate, were associated with a reduced risk. These results suggest that additional associations are present that could be established with a larger dataset. A new GWAS, Confluence (https://dceg.cancer.gov/research/cancer-types/breast-cancer/confluence-project), aims to double the available sample size for breast cancer. This GWAS includes probes specifically designed to assay some of the most significant CNVs observed in this study (those significant at P<0.001), and the sample size should be sufficient to confirm or refute these associations.