Profiling copy number alterations in cell-free tumour DNA using a single-reference

Background The accurate detection of copy number alterations from the analysis of circulating cell free tumour DNA (ctDNA) in blood is essential to realising the potential of liquid biopsies. However, currently available approaches require a large number of plasma samples from healthy individuals, sequenced using the same platform and protocols to act as a reference panel. Obtaining this reference panel can be challenging, prohibitively expensive and limits the ability to migrate to improved sequencing platforms and improved protocols. Methods We developed qCNV and sCNA-seq, two distinct tools that together provide a new approach for profiling somatic copy number alterations (sCNA) through the analysis of cell free DNA (cfDNA) without a reference panel. Our approach was designed to identify sCNA from cfDNA through the analysis of a single plasma sample and a matched normal DNA sample -both of which can be obtained from the same blood draw. qCNV is an efficient method for extracting read-depth from BAM files and sCNA-seq is a method that uses a probabilistic model of read depth to infer the copy number segmentation of the tumour. We compared the results from our pipeline to the established copy number profile of a cell-line, as well as the results from the plasma-Seq analysis of cfDNA-like mixtures and real, clinical data-sets. Results With a single, unmatched, germline reference sample, our pipeline recapitulated the known copy number profile of a cell-line and demonstrated similar results to those obtained from plasma-Seq. With less than 1X genome coverage, our approach identified clinically relevant sCNA in samples with as little as 20 % tumour DNA. When applied to plasma samples from cancer patients, our pipeline identified clinically significant mutations. Conclusions These results show it is possible to identify therapeutically-relevant copy number mutations from plasma samples without the need to generate a reference panel from a large number of healthy individuals. Together with the range of sequencing platforms supported by our qCNV+sCNA-Seq pipeline, as well as the Galaxy implementation of this solution, this pipeline makes cfDNA profiling more accessible and makes it easier to identify sCNA from the plasma of cancer patients.


Introduction
Somatic copy number alterations (sCNA) are an important class of mutation in cancer [1,2]. Specific sCNA such as the amplification of HER2 in breast cancer or the Androgen Receptor (AR) in prostate cancer have both been linked to better outcomes from specific therapies and increased five-year survival rates [3][4][5][6]. Determining the specific copy number mutations present in an individual's cancer is essential for fully understanding each patient's disease.
Characterisation of the cell-free DNA (cfDNA) in the bloodstream of cancer patients can offer insights into the specific molecular events associated with each patient's disease without the need for invasive surgeries [7][8][9][10][11].
Analysis of the circulating tumour DNA (ctDNA) component of cfDNA, can make it possible to determine an individual's response to specific treatments [12], and to monitor the evolution of the patient's disease [13,14].
While the importance of sCNA are well established, much of the work surrounding the use cfDNA to characterise a patient's disease has focused on the analysis of somatic single nucleotide variations (SNV) [5,[15][16][17].
Several methods have identified copy number changes from cfDNA by comparing the DNA from the patient's plasma to a panel of reference samples collected from the plasma of groups of healthy individuals, that have been processed and characterised using the same technology [8,10,16,18]. These methods identify sCNA segments by determining regions in the cfDNA that are significantly different from the reference panel (Z-score). This Zscore based approach has been effectively used across a range of different cytogenetic settings and is routinely used in prenatal screening [18,19].
Analysis of a cell-line processed to resemble cfDNA (DNA samples fragmented to 150 -250 bp) with plasma-Seq, demonstrated that it was able to recapitulate the established copy number profile of the cell-line across a range of samples that contained varying amounts of tumour DNA [8]. Together, these attributes have made plasma-Seq an effective and popular approach for identifying sCNA in cfDNA, and have allowed researchers to characterise the copy number profiles from the plasma of patients suffering from a range of different cancer types [8,10,14,20,21]. However, the generation of a panel of library-matched, technology-matched, and sex-matched data-sets from a group of healthy individuals can be prohibitively difficult for smaller laboratories to source and generate. Moreover, migration to a new sequencing technology, alterations to sequencing chemistry, or improvements to the processes for manipulating cell-free DNA mandates the periodic generation of a new reference panel. Moreover, this reference panel based approach can make it difficult to separate germline copy number variants (CNV) from somatic sCNA.
In this manuscript, we describe the qCNV+sCNA-seq pipeline, which identifies sCNA without requiring a reference panel of healthy plasma samples. This approach identifies sCNA from the cfDNA of a cancer patient relative to a single reference sample, such as the cfDNA sample from a healthy, unrelated individual, or a sample of normal germline DNA from the same patient. We show this pipeline can characterise sCNA through the analysis of artificial, cfDNA-like mixtures and clinical cfDNA samples. We also describe a web-based galaxy implementation the qCNV+sCNA-Seq pipeline [23], which can be used by labs without dedicated bioinformatics resources. Together the versatility of this approach makes it possible for more researchers to adopt cfDNA sequencing to identify sCNA.

Methods
We developed a new pipeline composed of two novel tools qCNV and sCNA-seq ( Figure 1). qCNV is a memory and time efficient method for quickly determining the number of reads that align to specific regions of the genome; while sCNA-seq takes the output from qCNV and uses the distribution of reads from the 'normal' reference sample to identify sCNA from the cfDNA sample. Both tools operate independently from one another, allowing researchers to make use of the advantages each tool offers for their own pipelines. The counts file from qCNV can be used for any purpose, and sCNA-Seq can use the correctly formatted counts file from an alternative program, but together these tools offer a fast pipeline able to detect sCNA from low coverage cfDNA samples. The pipeline is made up of two separate tools and three distinct stages. (A) The first stage uses qCNV to determine the number of reads that align to a 1000bp window in both the cfDNA and normal germline samples. (B) The second stage takes the output from qCNV and passes it through the whole genome training mode of the second tool, sCNA-Seq. This mode uses larger window sizes and a smaller number of copy number states to better determine the proportion of tumour content in a sample. The third stage uses the tumour content estimation to run sCNA-Seq with a smaller window size (down to 2 kb) and a higher number of potential copy number states to provide a high resolution annotation of the tumour genome through the analysis of the cfDNA. This can be run across the entire genome (Mode 2) or for a specific region of the genome (Mode 3).

Data from previous plasma-Seq experiments.
Heitzer et al [8] generated a series of plasma-like DNA samples to determine the capacity of the plasma-Seq approach to characterise cfDNA samples that contained varying amounts of tumour derived DNA (Additional File 1, Supplementary Table 1). These mixtures were generated by mixing DNA from the HT-29 tumour cell-line with DNA from an unmatched normal cell line at varying proportions and subsequently mechanically shearing the DNA in order to generate fragment lengths similar to those observed in cell-free DNA (150 -250 bp). These samples were then sequenced with the MiSeq platform (1x 150bp). We downloaded this data from the European Genome-Phenome Archive (accession: EGAD00001000364) [24]. We also downloaded raw sequence data from 11 cfDNA samples collected from 7 prostate cancer patients, and 9 cfDNA samples from healthy females using the same accession. These datasets were mapped to the HG19 annotation of the human genome with BWA, using the internal pipeline detailed below. The coverage of these samples ranged from 0.06X -0.29X coverage (Additional File 1, Supplementary Table 1).
Butler et. al [17] generated exome sequencing data on primary tumour, metastatic, cfDNA and germline samples from two patients (Additional File 1, Supplementary Table 1). This data set contained data from metastatic breast cancer and sarcoma. Mapped bams were downloaded from the European Nucleotide Archive [25] (accession PRJEB8969). The mapping and processing of these files were described in the original publication [17], the coverage of these samples ranged from 118x to 309x.
Visualization of the established copy number profile for the HT-29 cell-line.
An independent annotation of the complete copy number profile of the HT-29 cell-line was downloaded from the COSMIC Cell Lines Project (CCLP) [26], using the Sample ID: COSS905939. This whole genome annotation of the sCNA in HT-29 was generated the CCLP through a PICNIC analysis [27] of microarray data produced by the Affymetrix SNP6.0 array. Unlike the results from plasma-Seq or qCNV+sCNA-seq, which analysed short degraded fragments of DNA in the plasma, the DNA used here was, genomic collected from intact cells, in a pure cell-line. The copy number profile generated by the CCLP represents the gold standard copy number for the HT-29 cell-line. To visually represent this information, the data file containing the absolute copy number of each segmented was plotted with ggplot2 [28]. To allow readers to fairly compare the results from different platforms each chromosome was condensed to the same width as the corresponding chromosome in the sCNA-seq figure.
Generation of a sex matched, germline reference, used in the comparison to EGAD00001000364 As the EGAD00001000364 cohort did not contain any normal, germline samples, we made use of an existing ultra-low coverage, sex-matched sample of germline DNA that our group had previously sequenced with the MiSeq platform. This sample was collected as part of a study to follow-up patients diagnosed with breast cancer.
The collection of breast cancer samples, by the Brisbane Breast Bank, was approved by the Human Research Ethics Committee at the University of Queensland (approval number: 2005000785). A sample of the patient's germline DNA was collected from a buffy coat sample and was used as our unmatched, normal reference. The DNA from this sample was extracted with Qiagen's DNeasy Blood and Tissue kit and was sequenced using the second iteration of the MiSeq sequencing chemistry. These raw fastq files were then passed through the same mapping pipeline as the raw plasma-Seq data from EGAD0000100036. While low coverage (0.92X), our reference sample was still considerably deeper than any of the plasma-Seq samples (Additional Supplementary   Table 1). To achieve a comparable depth this reference sample was down-sampled using Samtools [29]. The final coverage of the downsampled reference sample used in these comparisons was 0.18X.

Sequence Alignment and Post-Processing
All raw fastq files, (both public and sequenced in this study) were mapped to the genome (HG19) with BWA-MEM [30]. The resulting SAM file produced by BWA was sorted and converted into a bam file using Samtools [29]. Samtools was used to filter out supplementary alignments and index the filtered bam. The duplicate reads in this filtered bam were then identified using the MarkDuplicates function of Picard [31] The filtered, sorted, duplicate marked bam file was then characterised by qCoverage and qProfiler [32] to determine the quality and coverage of the mapped samples. When required, down-sampling of the final bams, was performed with the view -s function of Samtools.
Detection of sCNA with the qCNV+sCNA-seq pipeline This pipeline was developed to quickly analyse pairs of cfDNA/normal bams and identify sCNA. The pipeline is made up of qCNV and sCNA-seq. These tools were designed to complement each other, however both applications can be run independently of each other.
qCNV qCNV is an efficient java tool designed to determine the number of reads that align to specific regions in the genome. This method divides whole genome into windows of a fixed size and counts the number of reads that begin in each window. The size of the windows is defined by the user at the beginning to the experiment. The window size used in each of these experiments is 1000 bp.
sCNA-seq sCNA-seq characterizes the copy number profile of cfDNA sequencing data through the analysis of raw read count produced by qCNV. This method allows for the detection of both small scale and larger mutations. In our model, we used the normalized read counts from user-defined windows, to estimate the proportion of tumour DNA in the cfDNA sample and in cfDNA/normal pairs from the same individual, distinguish sCNA from CNVs. sCNA-seq utilises the hidden Markov model from cnvHitSeq [33], although the read-depth modelling is substantially different.
Expectation-maximization (EM) model for identifying sCNA from circulating tumour DNA in cfDNA samples We divide each chromosome into windows of a fixed size (described further below). We consider the observed data to consist of the number of reads which begin in each window in normal and plasma samples, Ni and Ti respectively. Define N = ∑ Ni and T = ∑ Ti as the total read counts for normal and plasma samples. The hidden states in our model consist of the ratio of the tumour copy number to the normal copy number in each window rcnj = {0,0.5,1,1.5,2,2.5, ..}, so that rcn = 1 corresponds to no amplification or deletion. We also define two parameters in our model: the tumour purity, which is the proportion of cell-free DNA which is derived from tumour, and the ploidy ratio as the average relative copy number state across the genome (which can be considered as the ratio of observed ploidy to normal ploidy of 2).
We model the expected number of reads in the plasma sample in each window using a beta-binomial distribution where where we have defined a multiplication factor m as This multiplication factor can be thought of as a way to calculate the expected proportion reads in a given window in the plasma genome, conditional on the proportion of reads observed in the same window in the normal sample.
The first term, (rcn/ratio)*purity, reflects that a greater proportion of reads are expected to come from the tumour portion in proportion to the increase in copy number relative to the average. The second term (1-purity) in this equation is the contribution to this proportion from the non-tumour derived component of plasma.
The beta distribution models our uncertainty in the estimate of the probability of selecting a read in a given window in the normal genome. This uncertainty scales with the number of independent reads. We introduce a parameter to adjust for any non-independence (e.g. caused by PCR), called Beta-down weight (BDW), such that One example of use of this parameter is to adjust for counting both ends of a paired end read (in this case BDW = 2). However, we have observed that BDW is also useful for modelling over-dispersion observed in both whole genome and exome-capture datasets.
We adopt the hidden Markov model described in Bellos et. Al [33]. The emission probabilities are defined by equation 1. This model allows transitions between different copy number states according to a globally defined transition rate matrix, such that a per-base transition rate is defined, and the transition probability matrix is defined as a matrix exponential of the rate matrix scaled by the distance measured between the mid-point of each window.
The ratio is initialized to 1, and the purity is initialized to 0.99. We use the forward-backward algorithm to calculate the posterior probability of using each relative copy number state at each window, wi (rcnj), which we use as a weight to assign each data point (T i, N i) to each copy number state. We then use a gradient descent algorithm in order to calculate the tumour purity and ratio which maximize the probability of this data, i.e. the combination of purity and ratio which maximizes We iterate these two steps until convergence of the purity and ratio estimates (such that the difference between iterations is less than 0.01), and value of the objective function or until a pre-defined maximum number of iterations is reached.
Analysis of cfDNA sCNA-seq modes sCNA-seq can be run in three modes (Supplementary Methods). The first mode is a training mode which estimates the proportion and ratio of tumour DNA in the sample and provides a low resolution sCNA segmentation genomewide. The second mode uses this information to characterise the tumour genome at a higher resolution and with an increased number of copy number states ( Figure 1). The third mode allows the user to perform a high-resolution analysis of a specific section of the genome. Like the second mode, this high-resolution survey mode, requires tumour content information produced in the whole genome training mode.

GC correction
To address for the impact of GC biases on the analysis on the low coverage samples that compared cfDNA to the genomic DNA from intact cells, a form of GC correction was developed to apply to the results from qCNV. This correction used a polynomial regression (with a quadratic model), to adjust the results for the amount of GC content for each of the 1000 bp regions of the genome in both the cfDNA and reference samples. As the effects of GC bias differ between runs and batches, this correction should be run under the user's discretion. For all samples in the EGAD00001000364 data-set, GC correction was applied, however it was not needed in the samples described by Bulter et . al [17] Quality Control In order to assess the quality of the results from qCNV + sCNA-Seq analysis of whole genomes, we introduced a quality control measure to account for excessive copy number switching. This can occur if the model is unable to resolve the correct tumour purity as a result of unmodeled variation in the data. As an example, we observed this behaviour when modelling publicly available exome data [17] for which the plasma and matched normal were assayed on different versions of an exome capture array. We flag the copy number segmentation as unreliable if the observed number of distinct sCNA exceed 5% of the number of potentially distinct sCNA (which is just the total number of windows used in the sCNA-Seq analysis).

Comparing copy number profiles
Determining the exact level of overlap between two different profiling methods (the official CCLP PICNIC copy number profile for the HT-29 cell-line [26], the pre-existing plasma-Seq results [8] and the results from qCNV+sCNA-seq ) was done at a base pair resolution and involved comparing the exact position of all of the sCNA identified in each analysis to the total sCNA identified in the corresponding analysis. In this comparison, sCNA were defined as either a gain or a loss, with both copy number mutations representing a genomic region that did not contain the same copy number state as the defined base-ploidy of the sample. The results from the qCNV + sCNA-Seq pipeline were typically designated as the reference for consistency. The number of overlapping bases between the reference sCNA and those identified in one of the other analyses were used to measure sensitivity and specificity.
The results from original plasma-Seq analysis [8] and the independent copy number profile from the CCLP analysis of the HT-29 cell-line [26] had to be reformatted to allow for this comparison. For the sCNA identified by the CCLP, this involved defining any segment lower than the base ploidy as a loss, while segments with a copy number estimate greater than the base ploidy were classed as gains. As the results from plasma-Seq do not contain any estimates of absolute copy number changes, a loss was defined as region with a segmental Z-score of < 5.00, while gains had a Z-score of >5.00, a method previously used when comparing the copy number results from plasma-Seq to another technology [34]. In both samples, any reads aligning to the Y chromosome were removed.
The X chromosome was retained due to the importance of this chromosome in prostate cancer.

Galaxy Implementation
The java applications used for analysis have been wrapped for use by the Galaxy scientific workflow system and added to a Galaxy server (http://www.genomicsresearch.org/galaxy) under the menu item "Plasma CNV Pipeline" [23]. The values required to run these programs have been pre-set, likewise the small amounts of preprocessing to allow these tools to seamlessly communicate with one another have been put in place. This Galaxy server was launched as part of a cluster on the NeCTAR research cloud using the GVL(1) launch process.

Results
The sCNA-Seq recapitulates the established copy number profile of a plasma-like cell-line using a single plasma reference sample We first assessed whether a single reference sample of cfDNA from an unrelated, healthy individual was sufficient to identify known sCNA from a cancer cell-line with an established copy number profile. To achieve this, the qCNV + sCNA-Seq pipeline was used to analyse DNA from the triploid HT-29 cancer cell-line that had been fragmented to resemble the plasma cfDNA, albeit at 100% tumour purity [8], using a single sample from the plasma-Seq reference panel as a reference. Our single reference approach recapitulated the same copy number profile present in the official COSMIC Cell Lines Project (CCLP) analysis of this cell-line, as well as the corresponding plasma-Seq annotation of the same sample (Figure 2 A,B,C). Using a 150 kb window size, the sCNA-seq pipeline identified clinically significant sCNA known to be present in HT-29, such as the gain of chromosome 11, as well as the amplification of 8q24 -the region containing the MYC oncogene [35,36].
When compared to the sCNA identified in the PICNIC analysis of the high quality, genomic, cell-line DNA used in the CCLP analysis ( Figure 2A) [26], sCNA-seq obtained an 87% base-pair sensitivity and 87% base-pair specificity (Table 1) [8]. The Y-axis shows the log2-ratio for each segment (C) The copy number profile from the qCNV+sCNA-seq pipeline of the plasma-like cell-line and against a single cfDNA sample from a healthy individual (a member from the plasma-Seq reference cohort), assuming a triploid model. The Y axis and segment colour shows the absolute copy number for each segment (D) The copy number profile from the qCNV+sCNA-seq analysis of the plasma-like cell-line and a single reference sample, prepared from intact normal cells. As information about the ploidy of a tumour is not always known, this sample was analysed as a diploid. Despite incorrectly assigning the ploidy, we were still able to capture the same broad copy number profile. In all the figures produced by sCNA-Seq, the colour of the segment represents a distinct copy number state. sCNA-seq determines the known sCNA of a plasma-like cell-line using a white blood cell DNA reference Having shown the feasibility of using a single, unrelated plasma sample as a reference sample, we next investigated the feasibility of using DNA obtained from intact white blood cells as a reference. The experimental advantage of this approach is that it allows normal DNA from white blood cells to be extracted at the same time as cell-free DNA from the same blood draw, making it possible to compare matched pairs of cfDNA and germline DNA. As the samples used to benchmark plasma-Seq did not contain any 'normal' germline DNA sample, we used a sex-matched, low coverage WGS sample of normal DNA that had been generated in-house using the same sequencing platform. We down-sampled this file (0.18X) to match the coverage obtained for the reference plasma samples in order to not produce biased results due to greater matched normal read depth.
Using this low coverage, unmatched normal DNA sample as a reference, sCNA-seq obtained an 89% base-pair sensitivity and 82% specificity when compared to the CCLP copy number profile ( Figure 2C, Table 1), and when the sample was analysed as a triploid. As ploidy of a tumour may not be known a-priori, we re-analysed this sample using the default diploid model, which led to a slightly reduced sensitivity of 82%. The copy number profile generated from the triploid and diploid models were very similar (Figure 2 C,D). These results suggest that sCNA can be identified from the analysis of the cfDNA in the plasma when compared to intact normal germline DNA, and that it is robust to the mis-specification of tumour ploidy.

sCNA-seq identifies known copy number mutations across a dilution of cfDNA-like samples
The proportion of tumour derived DNA in cfDNA is highly dynamic. In some patients 90% of the cfDNA can be tumour derived, while in other cancer patients these fragments can almost be undetectable [11,16,17,37,38].
Having shown the accuracy of the method with a pure ctDNA-like sample (100% cell-line DNA mixture), we next considered the performance of our approach at lower concentrations of plasma-like DNA. To achieve this, we analysed the same ultra-low coverage WGS mixtures (0.11X-0.17X) used to benchmark plasma-Seq with the qCNV+sCNA-seq pipeline, and using our sex-matched, germline reference sample ( Figure 3). As with the results from plasma-Seq, as the proportion of tumour cfDNA decreased, we detected a smaller number of copy number mutations, and a smaller proportion of genome altered by these events. The 50% ctDNA mixture was found to represent the pure ctDNA sample with 68% specificity, dropping to 18% in the 5% ctDNA mixture ( Table 2). The sensitivity of the approach remained high across the mixtures, only dropping to 94% in the 10% ctDNA mixture and 76% for the 5% ctDNA mixture. A similar pattern was observed for the plasma-Seq analysis, with sensitivity falling to 15% for the 5% mixture, but specificity remaining above 90%. Both plasma-Seq and sCNA-seq suffer from a degree of over-prediction when there is no or very little ctDNA in the mixture, although in this case qCNV+sCNAseq identifies a much smaller amount of spurious sCNA at these very low levels of tumour purity ( Table 2). Some of these predicted sCNA at very low levels of tumour purity may be due to germline differences between the patient and unrelated reference.
Encouragingly some of the clinically significant sCNA identified in the pure cell-line, such as the amplification of 8q, a mutation commonly seen across a range of different cancers [39], and the amplification of 19q a region frequently altered in ovarian cancers [40], were identified in all the mixtures containing more than 5% tumour cfDNA. (Figure. 3A-E), indicating that while a significant proportion of the sCNA seen in the pure cell-line sample were not identified in these samples, there was still enough information to identify biologically relevant mutations in these low ctDNA samples. Furthermore, the 10% mixture revealed that the plasma-Seq analysis captured the copy number profile of the pure cell-line with 20 % sensitivity and 93 % specificity at a single base pair resolution (Table 2). In comparison, the results from the qCNV+sCNA-seq analysis of the 10 % mixture, revealed it reflected the results from the pure cell-line with 21 % sensitivity and 94 % specificity ( Table 2), suggesting that in these low purity samples, a single reference sample collected from genomic DNA can perform equally as well as a panel of cfDNA samples collected from healthy individuals.

A single reference can identify clinically-significant copy numbers in clinical samples
Having demonstrated the potential of our pipeline on mixtures of plasma-like cell-line DNA, we next applied it to characterise publicly available cell free DNA data collected from cancer patients. We re-analysed a cohort of ultra-low coverage, WGS cfDNA samples collected from a group of patients diagnosed with prostate cancer and previously characterised by plasma-Seq [8]. To achieve this, we made use of the same, unmatched blood-derived reference described above. In this case, the reference sample is not sex-matched, as can be observed from the apparent copy number alteration on the X chromosome ( Figure 4).  Figure   2). The amplification of this gene, is an known mechanism of chemo-resistance in the transition of CS to CR prostate cancer [42]. While the remaining CR prostate cancer sample contained prominent focal peaks on chromosome 8 and chromosome 11. Examination of these peaks revealed they contained multiple genes that have been linked to cancer including CCDN1 as well as genes from the Fibroblast Growth Factor (FGF) family (Supplementary Figure 3). As FGF signalling has been suggested as a novel treatment strategy for CR prostate cancer [43], and the role of CCDN1 has been well established in cancer [44], this further showcases the potential of our approach. The identification of transformative, chemoresistant copy number mutations from the analysis of cfDNA from the plasma of multiple cancer patients, clearly demonstrates the potential our single reference pipeline to provide clinically significant results from the analysis of ultra-low coverage cfDNA samples.
The identification of recurrent sCNA commonly found in prostate cancer genomes suggests that our single reference method is able to identify clinically significant mutations in real plasma samples from cancer patients.
To verify the accuracy of these copy number profiles, the sCNA identified by qCNV+sCNA-seq were compared to the plasma-Seq analysis of the same cohort [8]. This comparison revealed both methodologies were generally able to identify the same broad copy number changes in each of these samples (Supplementary Table 3). One sample, CSPC2 had a low level of overlap with the plasma-Seq copy number profile (Supplementary Table 3), and while the same broad trends were present, examination of the results suggested that the base ploidy had been incorrectly assigned ( Figure 4B). Re-analysing this sample as a triploid produced results that were much more consistent with the plasma-Seq analysis ( Figure 4C, Supplementary Table 3).

sCNA-seq captures tumour evolution in clinical exome data
This pipeline was used to analyse two cohorts of matched (tumour, plasma, and blood cell derived DNA), high coverage exome samples. The capacity of the cfDNA to reflect the SNV content of each patient's disease [17] had been previously described for each of these data-sets, however the copy number profile of these samples had not been determined.  Table 4). Investigating these samples revealed differences in capture array and the resulting sequencing that may have affected the distribution of reads across these libraries (Supplementary Figure 5). These results show that a single reference is capable of identifying clinically significant sCNA from the cell-free DNA, however, the failure of the sarcoma samples indicates the importance of robust quality control measures when profiling sCNAs.

Discussion
Liquid biopsies have the potential to transform precision medicine. The current best practice for identifying sCNA in cfDNA samples is plasma-Seq [8], a powerful method, designed specifically to identify sCNA from cfDNA samples. While new tools able to identify sCNA from cfDNA samples have recently been developed [45], like plasma-Seq these methods require a reference panel of cfDNA samples taken from a group of healthy individuals.
This approach can be difficult to produce for smaller laboratories to source, or prohibitively expensive depending on the sequencing technology used. In this study, we have developed a model which is capable of producing a cfDNA sCNA segmentation using only a single reference, which produces results comparable to an established Z-score based approach. Moreover, we have demonstrated that it is possible to use buffy-coat derived normal, germline DNA as a reference, instead of plasma DNA from healthy reference sample. This provides a more streamlined protocol for sCNA profiling in cell-free DNA in which genomic germline DNA from the buffy coat is sequenced at the same time as the cell free DNA in plasma, and makes it easier to ensure that the reference sample is sequenced using exactly the same procedure as the plasma sample. This pipeline also makes it possible to compare matched samples removing the mischaracterization of CNV as sCNA -something not possible with existing Z-score based approaches.
We have shown that this approach can accurately infer the sCNA profile of cfDNA even using ultra-low coverage (0.06 -0.29X) sequence data generated using a single Illumina MiSeq run, provided the proportion of tumour derived DNA circulating in plasma is 20% or higher. Some clinically relevant, focal amplification could be observed at lower tumour purity (down to 10%); however, a substantial proportion of copy number changes are missed with ultra-low coverage sequencing of low tumour purity samples, both by our approach as well as plasma-Seq. This represents an achievement as traditional, matched copy number profiling approaches rely on SNV frequencies would be unable to identify mutations at this low level of coverage. While sCNA-Seq is currently unable to deduce the ploidy of a tumour sample apriori, the low level of coverage needed to identify clinically significant mutations in low purity samples, even when ploidy has been mis-assigned, highlights the potential of this approach for rapid sCNA profiling.
One way to address lower tumour purity samples is to use high coverage exome sequencing of plasma DNA. We have shown that our approach applied to a high coverage (309x) sequencing of DNA from plasma coupled with a 201x sequencing of normal DNA in the buffy coat could infer tumour sCNA at high resolution in a sample in which 28% of cell free DNA was tumour derived [17].
Many of the comparisons described here relied on a single unmatched plasma cfDNA reference. This approach identified mutations known to be involved in cancer, however it also has the potential to conflate germline CNV differences with sCNA. A recent estimate suggested that between 4.8 -9.5% of the genome is affected by CNV [46]. These germ-line differences, begin to influence the signal in cases in which the tumour purity is very low, and/or does not contain significant somatic copy number changes. To avoid the potential for detecting germline CNV we recommend using matched normal DNA as the reference sample.
The detection of therapeutically relevant sCNA from clinical cfDNA samples collected from patients with prostate cancer demonstrates the potential clinical utility of the qCNV+sCNA pipeline. This potential was best exemplified by the identification of the causative, chemo-resistant mutations from the cfDNA of two patients with CR prostate cancer. Even when the base-line ploidy was misassigned in CSPC-2, it was still possible to identify the focal peaks of amplification, more importantly these peaks were still readily identifiable, when this sample was reanalysed as a triploid. The identification of some of the clinically significant sCNA in the low tumour content mixtures, as well as the results from the prostate cancer patients highlights the potential for the qCNV+sCNA-Seq pipeline in cfDNA studies.

Conclusion
The ability to profile sCNA from a single-sample unlocks the potential of liquid biopsy both to smaller labs, and also for further experimentation using new platforms. By making a version of this approach available on Galaxy, we have also lowered the bioinformatics barriers to using this approach.