Predicting cellular position in the Drosophila embryo from Single-Cell Transcriptomics data

Single-cell RNA-seq (scRNAseq) technologies are rapidly evolving and a growing number of datasets are now available. While very informative, in standard scRNAseq experiments the spatial organization of the cells in the organism or tissue of origin is lost. Conversely, spatial RNA-seq technologies designed to keep the localization of the cells have limited throughput and gene coverage. Mapping scRNAseq to data of genes with spatial information can thus increase coverage while providing spatial location. However, methods to perform such a mapping are still in their infancy and have not been benchmarked in an unbiased manner. To bridge the gap, we organized the DREAM Single-Cell Transcriptomics challenge to evaluate methods for reconstructing the spatial arrangement of single cells from single-cell RNA sequencing data. The challenge focused on the spatial reconstruction of cells from the Drosophila embryo from single-cell transcriptomic and, leveraging as gold standard, in situ hybridization data of a set of selected driver genes from the Berkeley Drosophila Transcription Network Project reference atlas. The 34 participating teams used an array of different algorithms for gene selection and location prediction. We devised a novel scoring and cross-validation scheme to evaluate the robustness of the best performing algorithms. Participants were able to correctly and robustly localize rare subpopulations of cells, accurately mapping both spatially co-localized and scattered groups of cells. The selection of predictor genes was essential for accurately locating the cells in the embryo. Among the most frequently selected set of genes we measured a relatively high expression entropy, high spatial clustering and the presence of prominent developmental genes such as gap and pair-ruled genes and tissue defining markers.


Introduction
The recent technological advances in single-cell sequencing technologies have revolutionized 1 the biological sciences. In particular single-cell RNA sequencing (scRNAseq) methods allow 2 transcriptome profiling in a highly parallel manner, resulting in the quantification of thousands of 3 genes across thousands of cells of the same tissue. However, with a few exceptions [1,2,3,4, 5] 4 current high-throughput scRNAseq methods share the drawback of losing the information relative 5 to the spatial arrangement of the cells in the tissue during the cell dissociation step. 6 One way of regaining spatial information computationally is to appropriately combine the single-7 cell RNA dataset at hand with a reference database, or atlas, containing spatial expression patterns 8 for several genes across the tissue. This approach was pursued in a few studies [6,7,8,9,10]. 9 Achim et al identified the location of 139 cells using 72 reference genes with spatial information 10 from whole mount in situ hybridization (WMISH) of a marine annelid and Satija et al developed 11 the Seurat algorithm to predict position of 851 zebrafish cells based on their scRNAseq data and 12 spatial information from in situ-hybridizations of 47 genes in ZFIN collection [11]. In both cases, Project (BDTNP). In this project, in situ hybridization data was collected resulting in a quantitative 19 high-resolution gene expression reference atlas [12]. Indeed, Karaiskos et al showed that the 20 combinatorial expression of these 84 BDTNP markers suffice to uniquely classify almost every cell 21 to a position within the embryo. 22 In the absence of a reference database, it is also possible to regain spatial information compu-   3. The predictions were compared to the ground truth location determined by calculating the 66 maximum correlation using all 84 in situs [10]. We received submissions from 34 teams, and 67 the overall analysis of the results showed that the selection of genes is essential for accurately 68 locating the cells in the embryo. The most selected genes had a relatively high expression entropy, 69 showed high spatial clustering and featured developmental genes such as gap and pair-ruled genes 70 in addition to tissue defining markers.    The challenge was organized in two rounds, a leaderboard round, and a final round. During the 86 leaderboard round the participants were able to obtain scores for five submitted solutions before   As stated, given that the ground truth for this challenge was publicly available and to avoid 94 over-fitting, we decided to invite the top 10 performing teams to contribute to a post-challenge 95 collaborative analysis phase to assess the soundness and stability of their gene selection and 96 cell location prediction. Consequently, teams were tasked to provide predictions for a 10-fold  Table 1.

114
A summary of the methods used by participants for gene selection and location prediction can 115 be seen in Table S2. The most frequently used method by participants for location prediction was a 116 similarity based prediction, such as the maximum Matthews correlation coefficient between the 117 binarized transcriptomics and the in situs that was proposed by Karaiskos et al. [10]. Another well  The most frequently used method by participants for gene selection was unsupervised or 124 supervised feature importance estimation and ranking. For example, in a supervised feature 125 importance estimation approach a machine learning model is trained to predict the coordinates of 126 each cell, given the transcriptomics data at input, that is, the genes with available in situ hybridization 127 measurements or all genes. Different machine learning models were trained such as Random Forest   where the euclidean distance between the locations was used as the distance metric. In order to 148 find the optimal k, we used the elbow method, i.e. we chose a k that saturates the sum of squares 149 between clusters [19]. Note that each cluster consists of a group of locations and each location 150 is predicted by one or more teams. Hence, for each cluster we calculated the average frequency 151 that its constituent locations are predicted by individual teams. We then picked the cluster with  The WOC location prediction approach does not take the genes used by the teams to make the 158 predictions into account. However, after the WOC predictions are generated, in order to score them, 159 we needed a list of genes for every subchallenge. To this end we used a WOC approach to gene 160 selection (see the following section for more details) and used the most frequently selected genes 161 per challenge. As reported above, the WOC solution performed better compared to the individual 162 solutions ( Figure 1B).  Table S2). However, if a subset of genes is selected as a candidate for solving the general 170 task of location prediction, it should be consistently identified when similar sets of single cells are 171 used as inputs. Therefore, we analyzed the consistency of gene selection for each team across folds 172 by 10-fold cross-validation. More importantly, we were interested in subsets of genes that were 173 consistently selected by multiple teams as this could underlie biological relevance.

174
The approaches for selecting genes taken by the top 10 teams resulted in consistent selection 175 across folds, significantly better than random, for all subchallenges. Indeed, all of the pairwise 176 Jaccard similarities of sets of selected genes for all teams were significantly higher than the expected 177 Jaccard similarity of a random pair of subset of genes (see Supplementary Figure S4). Importantly, 178 we measured an observable increase in variance and decrease of mean similarity as the number of 179 selected genes decreased.  (Table S3). 190 We conclude that the gene selection is not only consistent by team across folds, but also across       215 We conjectured that the most frequently selected genes should carry enough information content 216 collectively to uniquely encode a cell's location. Furthermore, genes should also contain location 217 specific information, i.e. their expression should cluster well in space. To quantify these features, 218 we calculated the entropy and the join count statistic for spatial autocorrelation of the in situs (see 219 Figure 4A and Methods for description). We observed that most of the in situ genes have relatively     Table 2). Although the selection of highly variable genes was one of kruppel (kr), knirps (kni) were selected in all 3 subchallenges (see Figure 5 and Table S3 that also 293 includes kni-like knrl) although tailless (tll) and hunchback (hb) were not. Along the A-P axis,  Table S3.

311
Since the ground truth of single cell locations was publicly available, the organization of this where N is the total number of cells with predicted locations.

429
The Matthews correlation coefficient, or f coefficient, is calculated from the contingency table  The second metric s 2 is simply the average inverse distance of the predicted most probable locations to the ground truth location Finally, the third metric s 3 measures the accuracy of reconstructed gene-wise spatial patterns where 8c denotes that the MCC is calculated cell wise for each gene.

437
For 287 out of the 1297 cells, the ground truth location predictions were ambiguous, i.e., the 438 MCC scores were identical for multiple locations. These cells were removed both from the ground 439 truth and the submissions before calculating the scores.

440
The teams were ranked according to each score independently. The final assigned rank r t 441 for team t was calculated as the average rank across scores. Teams were ranked based on the 442 performance as measured by the three scores on 1000 bootstrap replicates of the submitted solutions.

443
The three scores were calculated for each bootstrap. The teams were then ranked according to 444 each score. These ranks were then averaged to obtain a final rank for each team on that bootstrap.

445
The winner for each subchallenge was the team that achieved the lowest ranks. We calculated the 446 Bayes factor of the bootstrap ranks for the top performing teams. Bayesian factor of 3 or more was 447 considered as a significantly better performance. The Bayes factor of the 1000 bootstrapped ranks 448 of teams T 1 and T 2 was calculated as where r(T 1 ) i is the rank of team T 1 on the i-th bootstrap, r(T 2 ) i is the rank of team T 2 on the i-th  The entropy of a binarized in situ measurements of gene G was calculated as where p is the probability of gene G to have value 1. In other words, p is the fraction of cells where 453 G is expressed.

454
The join count statistic is a measure of a spatial autocorrelation of a binary variable. We will 455 refer to the binary expression 1 and 0 as black (B) and white (W ). Let n B be the number of bins 456 where G is expressed (G = B), and n W = n n B the number of bins where G is not expressed 457 (G = W ). Two neighboring spatial bins can form join of type J 2 {WW, BB, BW }. 458 We are interested in the distribution of BW joins. If a gene has a lower number of BW joins 459 that the expected number of BW, then the gene is positively spatially autocorrelated, i.e., the gene is 460 highly clustered. Contrarily, higher number of BW joins points towards negative spatial correlation, 461 i.e. dispersion.

462
Following Cliff and Ord [27] and Sokal and Oden [28], the expected count of BW joins is where the spatial connectivity matrix w is defined as ( 1 if i 6 = j and j is in the list of 10 nearest neighbors of i 0 otherwise The variance of BW joins is where x 1 = Â i Â j w i j , x 2 = 1 2 Â i Â j (w i j w ji ) 2 , x 3 = Â i Â j w i j + Â j w ji 2 .

463
Note that the connectivity matrix w can also be asymmetric, since it is defined by the nearest 464 neighbor function.

465
Finally, the observed BW counts are The join counts test statistic is then defined as which is assumed to be asymptotically normally distributed under the null hypothesis of no spatial   . .