All2: A tool for selecting mosaic mutations from comprehensive multi-cell comparisons

Accurate discovery of somatic mutations in a cell is a challenge that partially lays in immaturity of dedicated analytical approaches. Approaches comparing a cell’s genome to a control bulk sample miss common mutations, while approaches to find such mutations from bulk suffer from low sensitivity. We developed a tool, All2, which enables accurate filtering of mutations in a cell without the need for data from bulk(s). It is based on pair-wise comparisons of all cells to each other where every call for base pair substitution and indel is classified as either a germline variant, mosaic mutation, or false positive. As All2 allows for considering dropped-out regions, it is applicable to whole genome and exome analysis of cloned and amplified cells. By applying the approach to a variety of available data, we showed that its application reduces false positives, enables sensitive discovery of high frequency mutations, and is indispensable for conducting high resolution cell lineage tracing.


Introduction
With advances in sequencing technologies, analysis of the genome of a single cell is gaining traction owing to Figure 1. Conceptual overview of All 2 approach and scoring. A) A tissue/sample is made up of different cells (ovals) carrying various mosaic mutations (reflected by different colors). Post single cell clonal expansion, rare mosaic mutations (in red) can be easily detected by comparing the clone to the bulk tissue. However, frequent mutations (in blue) will be missed by this approach. B-D) Each mutation in clone-to-clone (which is cell-to-cell) comparison can be represented by a NxN matrix of pairwise clone comparisons, where each box represents the call between a clone in the row versus a clone in the column. B) In case of a true mosaic mutation, the calls are arranged as rows in the matrix. The pattern in the matrix shows that the mutation is called in clone 2 and clone 5 when comparing them to other clones. C) In case of a germline variant, the calls are arranged in a column in the matrix. The displayed pattern suggests that the mutation is present in all clones except clone 3. D) The pattern has a sporadic distribution of calls in the pairwise matrix and does not suggest either mosaic mutations or germline variants. Such call is deemed as a false positive or noise. E) Distribution of mosaic and germline scores for calls (the size of the dot/circle corresponds to the number of calls with the same scores). The plot can be divided into four areas: mosaic mutations (where the mutations have high mosaic scores and low germline scores), germline variants (where the mutations have high germline and low mosiac scores), high frequency mosaic mutations (where calls have high both mosaic and germline scores) and, lastly, noise or false positive calls.
Since the mosaic mutations are typically present in a small fraction of cells in the bulk, and germline variants are 54 present in (almost) all the cells, we conditionally call " as frequency of a mosaic mutation and # as frequency 55 of a germline variant. Note, a germline variant can be lost or undetected in some cells, and that is why its cell 56 frequency in a bulk may be below 1.

57
Since " = 1 − # , we can just use one, such as = " , where is the fraction of cells with mosaic 58 mutation or the fraction of cells without germline variant. Now, we can calculate the number of cells ′ carrying 59 the mosaic mutation or the number of cells not carrying germline variants as ′ ≈ . In case of a true mosaic 60 mutation, the corresponding calls are arranged in rows in the matrix (Fig. 1B), and would sum up to where $ is the number of calls for the variant in a row corresponding to the i th cell. From the data, the best where $ is the number of calls for the variant in a column corresponding to i th cell. And best estimate of # is

76
A call having a high mosaic score and low germline score is defined as a mosaic mutation. Similarly, a call with 77 high germline score and low mosaic score is defined as a germline score. When a call has both high germline 78 and mosaic scores, we define it as a high frequency mosaic mutation. Such mutations are likely present at a 79 higher fraction of cells in a tissue, yet at a lower fraction than germline variant. For example, such mutations 80 could occur during early development and be present in a high fraction of cells across tissues in the human body 81 (6). The distribution of mutations (as dots) on a plane with axes corresponding to the two scores can be used to 82 divide the calls into mosaic mutations, germline variants, noise or false positive (low mosaic and low germline 83 score) and high frequency mosaic mutations (high mosaic and high germline score) (Fig. 1E).

85
Implementation and usage

86
Genomes of all pairs of cells need to be compared prior to using All 2 . Variant calls can be made by a 119 Figure 2. Calls from All 2 enable reconstruction of high-resolution lineage tree. A, C) Application of All 2 to iPSC clones discovers more variants (cyan) than analysis of deeply sequenced bulk tissues (gray) or pairwise comparison of clonal lines and the bulk (orange). The approach also calls variants across entire VAF spectrum. Analysis of bulk may discover variants with intermediate VAF (1%-10%) which are not sampled in clones. For the displayed comparison, variants with at least two supporting reads in the bulks are considered for each discovery approach. B) Lineage tree reconstructed from the analysis of 25 clones from an adult individual. Variants discovered from either bulk (gray) or pairwise (orange) comparisons provide limited information as compared to All 2 , which is the most comprehensive. Multiple additional branches in the lineage tree can be traced when using additional variants (cyan) discovered by applying only the All 2 approach, which is also reflected in the Venn diagram. SNVs found only in the bulk tissues are marked with asterisks and define putative branch not sampled by clones. The percentage values next to branches denote the average fraction of the bulk cells carrying the mutations. Clone names are shown on the right. representation of one allele over the other (allele imbalance). In extreme, a locus can have only one allele 143 amplified and germline variants on the unamplified locus will be lost. ADA mode is designed to address this 144 issue. In ADA mode, All 2 takes a list of genomic regions (in bed format) where no allele dropout is observed (see we applied a single cell specialized caller named SCOUT(9). We observed that even though this partially reduced 152 the effect of MDA amplification artifacts (Fig. 3C), it still resulted in a large number of mosaic and high frequency 153 mosaic mutations, which was further mitigated by the ADA mode (Fig. 3D). Additionally, there is a reduction in 154 the germline variants after applying ADA. These mutations, falsely called as homozygous reference due to allele 155 dropout in the single cell, are effectively filtered by the ADA mode. Mutation counts per clone (Fig. 3D) were also 156 similar to those found when analyzing only clones (Fig. 3A). This comparison yields evidence that even though number of potential false calls introduced by single cell amplification by half, without compromising the mutation 159 calls from clones not affected by allele dropout.

162
Runtime depends on the number of cells in the study and the variant caller used (since some variant callers will 163 output higher number of calls than others). For the first example with 25 iPSC lines (Fig. 2B), application of All 2 164 using 8 GB memory on a 2.4 GHz dual-core processor took less than 15 minutes for the 'score' module and less cell-to-cell comparison of WGS data from single cells or clones. Our method is superior to using deep sequencing 173 of bulk tissues and/or paired comparison of single cells versus bulk for detection of both low and high frequency 174 mosaic mutations. A limitation relative to bulk method is that the mutations that are not sampled by the analyzed 175 single cells cannot be discovered. This can be addressed by increasing the number of analyzed single cell. We

176
have also applied All 2 for comprehensive reconstruction of a developmental lineage tree, showing that All 2 allows 177 a vastly more comprehensive lineage discovery. Furthermore, the method is general and can be applied to any 178 problem of lineage tracing that relies on the analysis of multiple cells, such as tracing cancer evolution.

179
We further demonstrate that All 2 facilitates removal of false positive calls (in ADA mode) from amplified 180 single cells. Additionally, since ADA mode takes a bed file with inclusive regions as input, All 2 can be applied to instructions. Saliva DNA was extracted using DNeasy Blood and Tissue kit (Qiagen) with the following 200 modifications: 5 ml AL-buffer and 200 µl Proteinase K were added to saliva and incubated at 56°C for 30 minutes.

201
RNA was digested using 20µl RNAse A (Qiagen) for 5 minutes and DNA was extracted using 4 extraction 202 columns in parallel to optimize the yield.

204
Blood collection and DNA extraction 205 10-15 ml of blood was collected using BD Vacutainer ACD tubes. DNA was extracted using the Gentra Puregene

209
DNA extracted from iPSC lines were sequenced at 30X, while DNA extracted from saliva and blood was 210 sequenced at 200X. All sequencing was conducted at BGI using with 2x100 bp paired reads. The sequencing 211 library preparation was PCR-free.

213
Fetal brain tissue and MDA

214
Collection of fetal brain tissues for subject 316, derivation of clonal neurosphere lines and sequencing has been 215 previously described ( (Fig. S4). Single cell QC on the MDA amplified cells was performed using Scellector (8) and only one cell 243 (cell5) was used owing to its better quality than the rest (Fig. S5).

245
Mutation calling for lineage analyses

246
The files were processed the same way as the clones above. Calls were made using allele frequency cut-off of calls between Mutect2 and Strelka2. Mutations with a depth greater than 10 read, with at least 2 alternate 250 supporting reads and PASS value from both callers were used. For the allele frequency plots (Fig. 2A&C)