STIG: Generation and simulated sequencing of synthetic T cell receptor repertoires

T cell receptor repertoire inference from DNA and RNA sequencing experiments is frequently performed to characterize host immune responses to disease states. Existing tools for repertoire inference have been compared across publicly available biological datasets or unpublished simulated sequencing data. Evaluation and comparison of these tools is challenging without common data sets created from a known repertoire with well-defined biological and sequencing characteristics. Here we introduce STIG, a tool to create simulated T cell receptor sequencing data from a customizable virtual T cell repertoire, with clear attribution of individual reads back to locations within their respective T-cell receptor clonotypes. STIG allows for robust performance evaluation of T cell repertoire inference and downstream analysis methods. STIG is implemented in Python 3 and is freely available for download at https://github.com/Benjamin-Vincent-Lab/stig Author summary As part of the acquired immune system, T cells are integral in the host response to microbes, tumors and autoimmune disease. These cells each have a semi-unique T cell receptor that serves to bind a set of antigens that will in turn stimulate that cell to perform its particular pro- (or anti) inflammatory role. This receptor is the product of DNA rearrangement of germline gene segments, similar to B cell receptor loci rearrangement, which provides a wide variety of potential T cell receptors to respond to antigens. At the site of an immune reaction, T cells can increase their number through clonal expansion and methods have been developed to analyze bulk genetic sequencing data to infer the individual receptors and the relative size of their clonal subpopulations present within a sample. To date, these methods and tools have been tested and compared using either biological samples (where the true quantitiy and types of T cells is unknown) or unshared synthetic datasets. In this paper I describe a new tool to generate biologically-inspired T-cell repertoires in-silico and generate simulated sequencing data from them.

Characterization of T cell receptor repertoires using next generation sequencing has 2 allowed the adaptive immune system in various states of health and to be studied in fine 3 detail. The number and clonal diversity of T cells can be measured to give insight to 4 the host's resting immune state and response to disease, as well as being potentially 5 prognostic or predictive of response to therapy [1][2][3][4][5][6][7]. In bulk sequencing approaches, 6 the sequences of T cell receptor chains must be inferred from read fragments. This 7 process involves some uncertainty as individual reads may only partially fall into the 8 receptor region of interest and read depth can limit detection of rare receptor 9 clonotypes [8]. A number of tools have been developed to infer the underlying repertoire 10 of T cell receptor chains from bulk sequencing data [8][9][10][11][12]. To properly evaluate these 11 software tools and their underlying methodologies, a known "ground truth" repertoire 12 must be constructed from which inferences can be compared. Existing tools for 13 simulating TCR gene recombination lack features to mimic next-generation sequencing 14 output (e.g. FASTQ-formatted output, DNA or RNA library preparation approaches, 15 quality limitations) that might otherwise allow their direct use with inference tools. 16 There remains a need for tools that can create simulated TCR sequencing data from 17 an underlying set of virtual T cell receptors. To that end, we developed Synthetic TCR 18 Informatics Generator (STIG) to allow the creation of simulated sequencing data that 19 reflects underlying T cell receptor biology, commonly used sequencing library 20 approaches, and sequencing quality limitations. STIG is capable of generating sets of individuals health or disease [13,14], STIG supports multiple distributions to assign 38 virtual T cells between clonotypes: even, pseudo-normal and skewed (e.g. logistic, see S1 39 Fig). Each clonotype is assigned a TCR receptor type (i.e. alpha/beta or gamma/delta) 40 based on a ratio provided by the user before undergoing simulated V(D)J recombination. 41 The recombination process is highly configurable, with the ability to define specific V recombination can be modified from reasonable defaults. As transparency and flexibility 47 of the recombination model are core to STIG's utility, all recombination probabilities 48 are exposed and adjustable through a well documented and user-editable configuration 49 file. Users may provide custom segment allele sequences in FASTA-formatted files. The 50 recombination results are tested to ensure results are in-frame, without early stops, and 51 contain an appropriate CDR3 amino-acid motif, before being added to the repertoire:  To simulate non-optimal read quality, STIG will optionally degrade the quality and 79 nucleotide strings of generated reads. Several methods are provided: A fixed 80 PHRED+33 string applied to all generated reads, a length-dependent logistic function, 81 or a file-based approach where STIG duplicates quality strings from a user-provided 82 FASTQ file. Read sequences are degraded based on the error probabilities implied by 83 the paired PHRED+33 quality string before being written to a second set of output 84 FASTQ files. The degraded reads are numbered and written in a one-to-one fashion, 85 such that each degraded read can be linked back to a corresponding perfect quality read, 86 clonotype and TCR location.

87
STIG is written in Python 3 and has two widely-adpoted Python packages as using an internal hidden Markov model, and can generate synthetic DNA sequences for 106 those chains with the likelihood of generation implied by the model. OLGA [20] was 107 created to estimate probabilities for creation of user-provided alpha and beta chains (as 108 well as Ig-heavy chains) within a given recombination model, using a dynamic 109 programming approach to search across possible recombination paths.

110
While STIG does support creating T cell receptor chains based on a recombination 111 model, its core utility -creating virtual T cell repertoires from which to generate 112 realistic DNA or RNA-seq data -differs from other TCR analysis software (S1 Table). 113 STIG's use of a biologically-inspired model for T cell repertoires and receptors, as well 114 as its support for RNA-space reads and multiple sequencing library approaches, is novel 115 functionality compared to existing tools. By creating sequencing data from a set of T   Our results show that MiXCR performs very favorably at read depths of 10 and 134 above, and is sensitive to even low-population clonotypes at that read depth (Fig 2A,B). 135 The sensitivity and positive-predictive value (PPV) indicate that MiXCR is able 136 accurately discern the underlying CDR3 regions of both alpha and beta chains with 137 narrow variation across the 10 iterations (Fig 2A-D). At sufficient read depths, MiXCR 138 is able to accurately quantify the detected clonotypes (Fig 2E,F), based on the 139 Horn-modified Morisita's overlap index comparing the clonotype distributions between 140 STIG's output repertoire and MiXCR's inferred repertoire. We noted the difference in 141 the peak Morisita overlap index values between alpha and beta chains (Fig 2E,F), 142 indicating that MiXCR was better at quantifying alpha chains over beta chains even at 143 read depths exceeding that of typical RNA-seq experiments (i.e. 10 3 coverage per 144 base) [21]. These differing peak values were also observed when performing 145 Kullback-Leibler divergence calculations between the inferred and underlying repertoires 146 (see S2 Fig). Using data from the highest read depth analysis -chosen to minimize the 147 effects of inadequate chain coverage -we compared the length of each chain against that 148 chain's contribution to the final Kullback-Leibler divergence score (Fig 3A,B). We 149 hypothesize that any tendency to over-represent shorter CDR3s may be due to the 150 higher probability of finding reads that unambiguously cover the CDR3 when that 151 region is shorter. Further work is needed to confirm the relationship and any possible 152 causality.

153
Resource utilization

154
To measure CPU and memory requirements of repertoire generation, STIG was invoked 155 to create repertoires of increasingly large numbers of clonotypes. Each invocation was 156 repeated 10 times in total, with the maximum Resident-Shared Size (RSS) and CPU 157 runtime recorded. Runtime and memory consumption increases linearly in proportion to 158 the repertoire size (Fig 4), which is expected based on the design of STIG's algorithm  Tools exist to output DNA sequences from recombination models but have limited support for simulating output seen in real-world sequencing approaches. "Models chains as paired TCR" refers to associating alpha-beta and gamma-delta chains to allow for realistic chain ratios during read generation. 1 Can optionally flag chain sequences containing frameshifts, early stop codons, or invalid CDR3 amino-acid motifs. 2 Can output sequences at a fixed offset in the V(D)J sequence. Does not support amplicon anchors. * Outputs only full DNA sequence of generated chains

Supplementary Text 1
For analyzing paired-end RNA-seq data with MiXCR, we used the settings recommended for RNA-seq workflows by MiXCR's authors (see manual at: https://mixcr.readthedocs.io/en/master/rnaseq.html#typical-analysis-workflow). The recommended settings for the alignment step as of this writing were: . The MiXCR version utilized was 2.1.9 (release date: 7 February, 2018).

Supplementary Text 2
Effective read depths were calculated by trying to approximate a total RNA read depth of 0.1, 1, 10, 100 and 1000 reads per TCR region nucleotide (nt). The average alpha and beta chain gene RNA lengths were calculated from the generated repertoires yielding values of 816 and 933, respectively. Using 50nt paired reads with a 150nt insert length, that yields 816 + 933 + 600 nucleotides per chain in total (600 being the +150 nt * 2 per chain to account for reads that only partially fall into the 5' or 3' UTR regions), or 2349nt per virtual cell. With 100nt per read (50nt length, paired reads), one would need 23.49 reads per cell (or ~11 per chain) for a read depth of 1. Thus, with 500 TCR chains between 250 clonotypes, our reads counts would be approximately 500 reads = 0.1 reads per base, 5000 reads = 1 read per base, and so on.

Supplementary Figure 1
Example clonotype distributions in STIG. (A) The 'equal' option, in which cells are assigned to each clonotype with equal probability. (B) The 'unimodal' option. (C) The 'logisticcdf' option, which yields a truncated normal distribution when viewed as the number of clonotypes per population size.