barcodetrackR: an R package for the interrogation of clonal tracking data

Clonal tracking methods provide quantitative insights into the cellular output of genetically labelled progenitor cells across time and cellular compartments. In the context of gene and cell therapies, clonal tracking methods have enabled the tracking of progenitor cell output both in humans receiving cellular therapies and in corresponding animal models, providing valuable insight into lineage reconstitution, clonal dynamics, and vector genotoxicity. However, the absence of a toolbox by which to interrogate these data has precluded the development of standardized analytical frameworks within the field. Thus, we developed barcodetrackR, an R package that provides users with tools for the analysis and visualization of clonal dynamics across time and cellular compartments in clonal tracking experiments. Here, we demonstrate the utility of barcodetrackR in exploring longitudinal clonal patterns and lineage relationships in the context of a number of clonal tracking studies of hematopoietic stem and progenitor cells (HSPCs) in humans receiving HSPC gene therapy and in animals receiving lentivirally transduced HSPC transplants.

of HSPCs(8), but also into the dynamics of clones within cell populations whose abundances are 51 largely independent of HSPC behavior, such as certain T cell(9) and natural killer (NK) cell 52 subsets(10). Furthermore, such clonal tracking methods have also been leveraged to provide 53 valuable insight into the clonal dynamics of cancer progression(11), in vitro differentiation(12), 54 and clonal dynamics of CAR-T cells(13).

56
Given the diversity of labelling and recovery strategies, as well as underlying differences in 57 vector and barcode constructs, a number of approaches for recovery of sequences from raw 58 sequencing data and determination of "true" integration site or genetic tag from sequencing 59 artifacts or other confounders have been developed and are largely approach-dependent(14-17).   Table 1. 76 Pairwise comparisons of clonal abundance profiles in clonal tracking data provide insight into 77 the relationships between upstream progenitor pools across cellular compartments. Here, we use 78 barcodetrackR to determine and visualize the correlation values and dissimilarity indices (Fig. 2 Fig. 2A). A similar pattern is observed when 87 plotting the Bray-Curtis dissimilarity indices between samples from the Six dataset, projected 88 into two dimensions using principal coordinates analysis (PCoA), where the first axis of 89 variation separates NK cells from other lineages based on their clonal abundances, and the 90 second separates T cells, B cells, and myeloid (Gr and Mo) cells (Fig. 2B). These analyses 91 demonstrate that the myeloid lineages are closely coupled and thus likely arise from shared 92 pathways originating from the same HSPC pool, in comparison to disparate generation of mature of serial transplant cluster together in principal coordinate space, supporting the notion presented 121 in the study that ALL founder cells retain self-renewal capacity over several serial transplants 122 (Fig. 2F). The distinct groupings of clones based on anatomic sampling site within the primary 123 transplanted animals suggests that ALL clonal output also appears to be geographically   129 In clonal tracking studies, both clonal counts and diversity measures can provide insight into the 130 clonality of progenitor cell pools. Here, we utilize barcodetrackR to assess clonality by 131 visualizing the detected clone counts and Shannon diversities of samples from three datasets over 132 time (Fig. 3). When quantifying clone numbers within the Six dataset, we show hundreds of 133 unique integration sites retrieved across five purified peripheral blood lineage samples, with a 134 larger number of clones detected in B cells and T cells as compared to Gr, Mo, and NK cells at 135 most individual timepoints (Fig. 3A). Decreasing clonal diversity (Shannon index) in the NK cell 136 lineage, as compared to other lineages, indicates that over time, a smaller number of clones 137 account for a larger fraction of hematopoiesis in the NK cell compartment (Fig 3B). This implies 138 a more oligoclonal population of contributing progenitors. The finding that the mature NK cell 139 compartment is largely composed of a few high-contributing clones post-transplantation is in 140 agreement with rhesus macaque barcoded autologous HSPC transplant studies(10).

Determining clonality based on clonal counts and diversity measures
Next, we analyzed patterns in the number of detected clones and Shannon diversities over time in 143 peripheral blood samples from a single mouse xenograft obtained from the Belderbos dataset. 144 We show that more clones were detected from the bulk peripheral blood sample at sacrifice than 145 at the first time point (green line, Fig. 3C). The Shannon diversity of bulk samples decreased 146 after the 9-week time point before stabilizing (Fig. 3D), underscoring the notion that clone 147 counts alone are not ideal measures of sample diversity. These findings suggest that within this 148 xenograft transplantation model, the diversity of HSPC output becomes stable over time, in 149 agreement with previous long term clonal tracking studies in macaque(8) and human(7).

151
Finally, we use barcodetrackR to quantify an extreme case of minimal clonal counts and 152 Shannon diversities in the context of clonal hematopoiesis using the Espinoza dataset (Table 1).

153
In this study, multiple lentiviral insertions in an HSPC mediated dysplastic clonal erythroid and 154 myeloid expansion, while largely sparing the lymphoid lineages. In agreement with the findings 155 of the study, we find that the longitudinal clone numbers contributing to the B and T cell lineages 156 fluctuate, but that the Shannon diversity index of these lineages remains high, especially at early 157 time points, indicating polyclonal contribution to the lymphoid lineages (Fig 3E). However, after 158 day 266 post-transplant, we observe a massive drop-off in both the number of unique clones 159 detected (Fig 3E) and the Shannon diversity (Fig 3F) within the myeloid lineages. This  In the Espinoza study, 6 detected genetic tags in this dataset were all in fact found to correspond 162 to a single lentivirally transduced cell with 9 insertions (3 of which were virtually undetected 163 with the available dataset's methodology due to deletions in the insertion proviral sequences), 164 verifying that this pattern is representative of clonal hematopoiesis(21). While in the above examples we utilize unique detected clones at each time point, cumulative clone counts can also 166 be calculated (Fig. S1) and provide a complementary view of clone numbers over time.    Table   208 1), both based on lineage-purified samples following autologous transplantation with genetically 209 tagged HSPCs. 210 We first track the density of clones, weighted by their added proportions, at each value of lineage 212 bias (log ratio) between the Gr and T cell lineages across multiple timepoints in the Six dataset. 213 We find the presence of three high-contributing sets of clones as determined by Gr/T lineage 214 bias: Gr-biased (rightmost peak), balanced clones (middle peak), and T-biased (leftmost peak) 215 (Fig. 5A). By systematically comparing each cell type in the dataset, we find that three sets of   246 We highlight the versatility of barcodetrackR by analyzing clonal tracking data collected from 247 TCR sequencing by visualizing the number of T cell clones detected and the Shannon diversity 248 of longitudinal samples from X-SCID patients treated with HSPC gene therapy(24) (Fig. S4). 249 The barcodetrackR package includes a graphical user interface (built using shiny(26)) to allow 250 researchers without programming experience to use these advanced quantitative tools. After    Dataset availability 302 The datasets used in this study are publicly available from published barcoding experiments and 303 detailed in Table 1. Data were pre-processed in R to create tabular data files amenable for entry 304 into barcodetrackR, and these procedures are outlined in scripts within the inst/sample_data 305 directory within the barcodetrackR package on GitHub.  The barcodetrackR package can operate on any dataset that contains rows as observations and 317 columns as samples, regardless of which experimental method for genetic labelling and 318 approaches for tag retrieval were used. When creating a SummarizedExperiment (SE) object 319 from the publicly available Bioconductor repository(32) for input into barcodetrackR, users have 320 the option of including a threshold to exclude low-abundance occurrences that are more likely to 321 come from noise or sequencing error(2). Using a threshold of 0.005, for example, retains genetic 322 tags which are present at an abundance of 0.5% or greater in at least one sample. Within this paper, the Six, Elder, Espinoza, Wu, and Koelle datasets were used with a threshold of 0.0001.

324
The Belderbos and Clarke datasets were used with no threshold.

326
Instantiating a SE through the function create_SE automatically creates the following assays:    The mds_plot function calculates dissimilarity indices between samples using any distance 344 metric within the vegdist function from the R package vegan(29). At time of publication, these 345 include "manhattan", "euclidean", "canberra", "clark", "bray", "kulczynski", "jaccard", "gower", "altGower", "morisita", "horn", "mountford", "raup", "binomial", "chao", or "cao". 347 The distance methods produce a matrix composed of the distances between samples, given the  the "dendro" parameter to TRUE allows users to see this clustering and setting the "clusters" 381 parameter to a non-zero value will label hierarchical clusters of clones. The dendrogram is drawn 382 using ggdendro(33). Users can choose from a number of methods to calculate the distance metric 383 (any method included in the proxy R package) and clustering ("ward.D", "ward.D2", "single", 384 "complete", "average", "mcquitty", "median", or "centroid"). Clones can also be ordered by the 385 first sample in which they had a non-zero abundance by setting the "row_order" parameter to 388 To track the emergence of clones over time, the function barcode_binary_heatmap will display a 389 heatmap indicating the presence or absence of clones in each sample. The "threshold" parameter 390 to the function dictates the limit of detection. Clones with percentage abundance below this 391 threshold in a given sample will be set to "absent." Clones are ordered in the binary heatmap 392 based on the first sample in which they emerged (had a non-zero abundance).  Clonal bias 409 The ridge_plot function calculates bias as a continuous variable and displays the density of 410 clones at each level of log bias. In order to calculate log bias between clones that are only present 411 in one sample, log bias is calculated as followed within the ridge plot function: The normalized value is taken from the SE slot which is scaled to counts per million. The option 415 to only analyze clones present in both samples is also included in the ridge_plot function through 416 the parameter "remove_unique". The density of clones at each value of log bias is estimated 417 using the kernel density estimator included in the ggplot2 R package. When the "weighted" 418 parameter is set to TRUE, the function weights the density estimation by the combined 419 abundance of the clone between the two samples, calculated as: Chord diagrams 424 The circos_plot function utilizes the circlize package in R(31) to display shared clones between