Integrating T-cell receptor and transcriptome for large-scale single-cell immune profiling analysis

Recent advancements in single-cell immune profiling that enable the measurement of the transcriptome and T-cell receptor (TCR) sequences simultaneously have emerged as a promising approach to study immune responses at cellular resolution. Yet, combining these different types of information from multiple datasets into a joint representation is complicated by the unique characteristics of each modality and the technical effects between datasets. Here, we present mvTCR, a multimodal generative model to learn a unified representation across modalities and datasets for joint analysis of single-cell immune profiling data. We show that mvTCR allows the construction of large-scale and multimodal T-cell atlases by distilling modality-specific properties into a shared view, enabling unique and improved data analysis. Specifically, we demonstrated mvTCR’s potential by revealing and separating SARS-CoV-2-specific T-cell clusters from bystanders that would have been missed in individual unimodal data analysis. Finally, mvTCR can enable automated analysis of new datasets when combined with transfer-learning approaches. Overall, mvTCR provides a principled solution for standard analysis tasks such as multimodal integration, clustering, specificity analysis, and batch correction for single-cell immune profiling data.

| mvTCR jointly embeds TCR and transcriptome. a, Overview of the mvTCR architecture and the different mixture models. b, UMAP visualizations [12] of the 10x Genomics dataset (donor 2) comparing the embeddings of the multimodal (MoE) and unimodal (RNA, TCR) models colored by epitope specificity, ten largest clonotypes, and a cytotoxicity score as an example of cell state. c-e Comparison between the unimodal and all versions of the multimodal embeddings (Concatenation, PoE, MoE). Statistical significance to the corresponding unimodal embedding is calculated via one-sided, paired t-test (p-values: *<0.05, **<0.01, ***<0.001). The boxplot represents the quartiles and median line, while the whiskers extend to the full value range excluding outliers. c, Capturing of specificity by atlas-query prediction (weighted F1-Score) and clustering (Normalized Mutual Information) on the 10x dataset (donor 2). d, Avidity prediction measured by mean squared logarithmic error and R² value on 10x dataset (donor 2). e, Conservation of modality-specific properties on the Fischer dataset by clustering (NMI, normalized by performance of underlying modality), i.e., cell type defined on transcriptomic level and reactivity towards a SARS-CoV-2 mix defined on a clonotype level. f, Comparison between MoE embedding trained only on gene expression and CDR3β and tessa [9] on the tasks defined in c and e. mvTCR incorporates knowledge from transcriptome and TCR 88 When analyzing T-cell repertoires, the antigen specificity of each cell is critical to provide the context 89 for its cell state. The specificity of a T cell towards their cognate epitope is inherently determined 90 by their individual TCR sequence with similar TCR recognizing similar epitopes. Additionally, cells 91 with shared specificity express similar phenotypic characteristics. Therefore, we investigated to what 92 extend a joint T cell embedding will benefit the preservation and prediction of antigen specificity.
with regards to antigen specificity by Normalized Mutual Information (NMI) (Fig. 1c donor 2. Interestingly, the RNA based model showed significantly higher clustering scores than the 131 TCR model for donor 2 (Fig. 1c, NMI, one-sided, paired t-test, p-value=0.007), even though it had 132 significantly lower scores for kNN prediction (F1-Score, one-sided, paired t-test, p-value=0.029). This 133 is in line with our previous observation, that the TCR embedding was highly fractured, while the 134 RNA model formed a more continuous embedding space. Overall, the multimodal models captured 135 the advantages of both modalities. They locally preserved antigen specificity, while globally grouping 136 larger specificity clusters based on their similar transcriptomic profile. 137 In a next step, we investigated how informative the latent embedding was not only with regards 138 to antigen specificity, but also to binding strength. To this end, we trained an additional neural 139 network, that received the embedding as input to predict the counts of detected pMHC multimers 140 as an approximate measure of avidity a i (Methods Avidity Prediction, Supplementary Data 1). For 141 donor 2, the multimodal embedding proved to be beneficial (Fig. 1d) which were able to recognize a mix of disease-specific peptides. Ideally, the shared embedding should 158 conserve these modality-specific annotations to a large degree during clustering. Therefore, we re-159 ported the NMI of the models normalized to the score obtained by the respective modality, on which 160 the annotation was defined (Fig. 1e). This provided us with an estimate of how well modality-specific 161 characteristics were retained. All multimodal models clustered the cell types better than the TCR- Hence, we concluded, that both TCR-and gene expression-specific characteristics are represented to 167 a high degree in the shared embedding of mvTCR. Based on its high performance in most settings 168 mvTCR's MoE module was used for all following analyses. 169 Finally, we compared mvTCR against the weighted embedding derived by tessa for predicting spe-170 cificity and conserving modality-specific annotation (Fig. 1f, Supplementary Fig. 3c) (Methods 171 Benchmarks). tessa maximizes the correlation of the transcriptome and the TCR by deriving a 172 position-wise weighting of the TCR features, which was uniform for all sequences in a dataset. tessa 173 used the CDR3β sequence as the only input for representing the TCR. Therefore, we retrained mvT-174 CR without the TCRα sequence to avoid an unfair advantage due to additional information. Except 175 for clustering on donor 2, mvTCR showed better conservation of specificity on the 10x dataset.

176
While we observed a large increase of 11.3% for kNN-prediction for donor 2, mvTCR only slightly 177 surpassed the embedding derived from tessa by 2.45% for donor 1. On the Fischer dataset, mvTCR 178 clustered significantly better for cell type (NMI, one-sided, paired t-test, p<1e-4) and reactivity 179 (NMI, one-sided, paired t-test, p=0.026) with an increase in NMI of 25.4% and 8.1%, respectively. 180 We attribute the gain in performance to mvTCR's ability to integrate transcriptome information on    We trained mvTCR with and without scArches on the TIL dataset to evaluate the integration 258 quality. Without scArches, the cells were embedded into fragmented clusters. While each cluster ved the predictions for cancer and tissue source, while being almost on par for cell type annotations.

279
Therefore, we conclude, that mvTCR can efficiently remove batch effects between query and atlas 280 sets, while preserving the biological signal between studies.   In conclusion, we presented mvTCR as a model for analysing T cell repertoires in the context of 318 infectious disease, tumors, and therapies. We envision that the integration of TCR and transcriptome 319 via mvTCR will uncover hidden interdependencies between the two modalities and identify functional 320 related T-cell sub-clusters, that would remain hidden in separate analysis of the T-cell response, 321 thereby contributing to our understanding of T-cell modulation in healthy and disease. x i T CR when both chains are considered. x i RN A ∈ R genes comprises the 5,000 most highly variable 361 gene, whose read counts were normalized and log1p-transformed.

362
mvTCR encodes the TCR sequences x i T RA and x i T RB via the two encoder E T RA and E T RB , respectively, to obtain the lower-dimensional representations: (1) to the shared latent distribution: of size h. All downstream analysis and benchmark tests were performed on z i joint . The networks D RN A , D T RA , and D T RB decode the embeddings to the reconstructions: x i T RA = D T RA (z i joint ), and (6) is used. If not stated otherwise, this mixture module was used throughout this paper.

424
of the input data and regularization losses, which shaped the properties of the latent distribution. where is the mean squared error, and for each clonotype ct from the set of all clonotypes CT .

441
For benchmarks (Fig. 1c-f, Supplementary Fig. 3), an additional test set of 20% was used to evaluate 442 the performance on unobserved data. Training, validation, and test sets were constructed randomly 443 on a clonotype-level, i.e., cells with the same TCR input sequence were exclusively contained in a 444 single subset.

445
For evaluating atlas-level integration (Fig. 3a-c), two studies [39,40] containing cells from lung 446 cancer patients were held out of the accumulated dataset and 20% of the remaining data was used 447 as validation set.

448
To compare the running times with other multimodal methods (Fig. 3d), mvTCR was trained on 449 random subsets. Again 20% of data was used as validation set to measure the time for evaluation 450 calculations. Since the model converged after 20 epochs on the full dataset, this number was held 451 constant over all subsets, i.e., no early stopping was performed.

453
To select the best model structure, we perform optimization of all hyperparameter of the architecture 454 via Optuna [57]. Depending on the information available, different performance metrics are optimized between the ground truth and the predicted avidity a i the models are trained with ADAM optimizer the benchmark tests (Fig. 1c, Supplementary Fig. 3a) to enable statistical testing. The avidity 522 prediction (Fig. 1d, Supplementary Fig. 3b) is conducted on the same training, validation, and 523 testing splits as indicated above.

524
Fischer dataset: as for the 10x dataset, the models are trained five times on different dataset splits 525 (Fig. 1e). The hyperparameter are adapted via Optuna to preserve cell type and clonotype at 526 a ratio of one-to-one.

527
Comparison to tessa: We evaluate the performance of mvTCR against tessa as a baseline model 528 (Fig. 1f, Supplementary Fig. 3c) 10,000 reads per cell, followed by log1p-transformation and the reduction to the 5,000 most 560 highly variable genes. Additionally, the specificity annotation suggested in the publication note 561 was added. All cells not expressing a full TCR consisting of one α-and β-chain were remo-562 ved from the datasets since mvTCR requires paired information. To ensure correct matching 563 between TCR and specificity in our benchmark data, we further removed all cells expressing 564 multiple TCRs. A clonotype ID was assigned grouping cells with identical α-and β-chain. For 565 better quantification during the benchmark studies, this dataset was reduced to T cells with 566 reported binding to the most abundant eight antigens excluding non-binders. The cytotoxicity 567 score (Fig. 1b, Supplementary Fig. 1 and 2), was calculated by the mean of the normalized 568 marker genes described in [18]. 569 Fischer dataset: The filtered, normalized, and log1p-transformed dataset of Fischer et al.
[6] was 570 downloaded from NCBI GEO. Cells with missing α-or β-chain were removed from the dataset.

571
Clonotypes were assigned for the remaining cells. Finally, we selected the 5,000 most variable 572 genes for training mvTCR. with TCR information. Quality control, normalization, and log1p-transformation was already 576 performed in the original publication. We reduced the dataset to the 5,000 most variable genes, 577 filtered for incomplete TCRs, and assigned the clonotype. The models were selected to equally 578 preserve cell and clonotype. The database query of TCRs to the IEDB (Fig. 2d)  were assigned as antigen-specific, if more than 5% of the cluster's cells and more than 50% of 584 the epitope matches of the cluster stemmed from SARS-CoV-2 variants. Similarly, bystander 585 clusters were defined as exceeding these thresholds with non-Covid related epitopes. Differential 586 gene expression was calculated between antigen-specific and bystander clusters via a t-test with 587 Benjamini-Hochberg correction using scanpy. Only genes with an adjusted p-value smaller 5% 588 and a log-fold change greater 0.25 were reported.

589
Tumor-Infiltrating Lymphocyte dataset: The Tumor-Infiltrating Lymphocyte (TIL) dataset 590 consisted of a collection of studies, which were downloaded as described under https://gi 591 thub.com/ncborcherding/utility/tree/dev. Transcriptome and TCRs were manually 592 merged and cells without complete TCR were filtered. Additional, annotated doublets and 593 genes present in less than 100 cells were removed. The data was normalized to 10,000 counts per 594 cell, log1p-transformed, reduced to 5,000 most variable genes, and annotated with clonotype.

595
After filtering, the dataset contained 722,461 T cells.