Topslam: Waddington Landscape Recovery for Single Cell Experiments

We present an approach to estimating the nature of the Waddington (or epigenetic) landscape that underlies a population of individual cells. Through exploiting high resolution single cell transcription experiments we show that cells can be located on a landscape that reflects their differentiated nature. Our approach makes use of probabilistic non-linear dimensionality reduction that respects the topology of our estimated epigenetic landscape. In simulation studies and analyses of real data we show that the approach, known as topslam, outperforms previous attempts to understand the differentiation landscape. Hereby, the novelty of our approach lies in the correction of distances before extracting ordering information. This gives the advantage over other attempts, which have to correct for extracted time lines by post processing or additional data.


Introduction
High-throughput single-cell real-time polymerase chain reaction gene expression 2 measurements (Section S2) are new and promising techniques to give insights into the 3 heterogeneous development of individual cells in organism tissues [19]. However, 4 interpretation of measurements can be highly challenging. 5 Waddington [33,34] proposed a representation for understanding the process of differentiation, known as Waddington's landscape or the epigenetic landscape. The idea is that differentiated cells are located at different points on the epigenetic landscape with particular paths through the landscape more likely than others due to its underlying topology ( Figure 1). Originally, this landscape represents the quasi-potential function of genetic network dynamics and is shaped by evolution through mutational re-wiring of regulatory interactions [16]. In this context, the landscape is created by a complex set of interactions between transcription factors, genes and epigenomic modifications. Unpicking the mechanism behind this relationship is extremely challenging [2,16,20,35,36]. Instead we propose an alternative, data driven approach based on machine learning algorithms and judicious application of probabilistic methods. In this paper we reconstruct such landscapes from rich phenotype information through probabilistic dimensionality reduction. In particular, we extract maps of the epigenetic landscape given the observations of gene expression. The mathematical underpinnings of mapping involve a projection from a low dimensional space to a higher dimensional space. Classically we might wish to project the three dimensional world around us down to two dimensions for use as a map or a chart. Formally this involves a mapping, f (⋅) from the positions in the two dimensional space, x, to our measurements, y: T i m e Cell Type Figure 1. Waddington landscape representation for differentiating cells. The topology of the landscape is created by genetic network dynamics, shaped by evolution through mutational re-wiring of regulatory interactions. The cells follow along the topologystochastically deciding at key junctions for differentiation paths.
In epigenetic landscapes, rather than considering the high dimensional measurements 6 to be direct measurements of the physical world around us, we instead observe a rich 7 phenotype, such as the gene expression of an individual cell, y. Our aim is to develop a 8 coherent map such that the position of each cell, x, is consistent with cells that are 9 expressing a similar phenotype. In other words, if two cells have a similar gene 10 expression they should be located near to each other in the map, just as two people 11 located near to each other in a real landscape would have a similar view. 12 The utility of a map is given by the precision in which it can be recreated. 13 Geographic surveys were originally created through triangulation and laborious ground 14 level surveys. The challenges we face for the epigenetic landscape are somewhat greater. 15 In particular the measurements of phenotype are subject to a great deal of noise, 16 particularly in single cell experiments, in other words there is a mistiness to the 17 observations. Further, we cannot access all areas. We can only query individual cells as 18 to their particular phenotype, we cannot move around the landscape at will. Finally, 19 there is a complex, most likely non-linear relationship between any location on the map. 20 Thus, we have to estimate the full smooth map and its distortions from the discrete 21 observed points in form of cells and their gene expression patterns. 22 We are inspired by challenges in robotics: in robot navigation a robot facing a 23 landscape for the first time needs to continually assess its current position (the values of 24 x) and simultaneously update its estimate of the map (the function f (⋅)). This 25 challenge is known as as simultaneous localisation and mapping (SLAM [27]). 26 For example Ferris et al. [9] showed how simultaneous localisation and mapping 27 could be formed by measuring the relative strength of different WiFi access points as it 28 moves around a building. When you are near to a given access point you will receive a 29 strong signal, when far, a weak signal. If two robots both perceive a particular access 30 point to have a strong signal they are likely to be near each other. We can think of the 31 sophisticated analyses to resolve confounding factors. By providing the scientist with 48 the underlying Waddington landscape for cells in a given experiment, along with the 49 location of each cell in the landscape, we hope to significantly simplify this process.

50
Unpicking the nature of the genetic landmarks in the presence of noise typically exploits 51 feature extraction, where high dimensional gene expression data has its dimensionality 52 reduced [10,14,23,24], often through linear techniques. However, it is difficult to 53 determine the number of dimensions to use for further analyses [3][4][5].  deterministic algorithms are necessary for a coherent mapping [1,25]. 73 We make use of the Bayesian Gaussian process latent variable model (Bayesian  Further, we make use of the geometry of the underlying map by exploiting recent 77 advances in metrics for probabilistic geometries [30]. clustered and a further PCA is applied to each cluster separately, giving one coordinate 84 system per cluster [14] (see also [8,28] for an elegant implementation of this approach 85 and more). 86 Islam et al. [18] developed a graph based method, using similarities of cell profiles to 87 characterise two different cell types in a so called "graph map". Linear cell-to-cell 88 correlation is used to create a 5 nearest neighbour graph. The graph is then visually 89 adjusted by a force-directed layout to visualise cell-to-cell correlations. In topslam we 90 only make use of a graph to extract shortest distance between cells and not for 91 visualisation.

92
The Waddington's landscape [33,34] can be seen as a non-linear map for the 93 branching process of cells, where the cell process is described as a ball rolling down a 94 hill following stochastically (by e.g. cell stage distribution) the valleys of the hillside 95 ( Fig. 3).

96
For topslam, the underlying probabilistic dimensionality reduction technique 97 (Gaussian process latent variable model) has been successfully used in other applications 98 to single cell transcriptomics data, e.g. for visualisation [3], to uncover sub populations 99 of cells [4] and to uncover transcriptional networks in blood stem cells [22].

100
The novelty of our approach is to not correct after extraction of graph information, 101 but to correct the distances the graph extraction uses to extract information. We can do 102 that by estimating the underlying Waddington landscape along differentiation of cells.  Other methods apply deterministic non-linear dimensionality reductions and attempt 119 to recover the underlying pseudo time in a probabilistic framework [6].  graph. This stabilises the extraction of time along the graph. Outliers can create "short cuts" in the graph structure, which will be identified by the landscape's topography.

130
A Riemannian geometry distorts distances, just as in a real map movement is not 131 equally easy in all directions (it is easier to go down hill or follow roads) the underlying 132 Waddington landscape has a topology which should be respected. Topslam landscapes 133 are both non-linear and probabilistic and we correct, locally, for Riemannian distortions 134 introduced by the non-linear surfaces. In the next section we will show how the 135 combination of these three characteristics allows us to recover pseudo time without 136 reliance on additional data or additional (correctional) algorithms for graph extraction, 137 to correct for the underlying dimensionality reduction technique used.

138
In summary, we introduce a probabilistic approach to inferring Waddington

155
Epigenetic progression is a discrete process that Waddington suggested could be 156 visualised as part of a continuous landscape. However, the relationship between location 157 on the landscape and the measured state of the cell is unlikely to be linear.

158
Further, when mapping natural landscapes, a laborious process of triangulation 159 through high precision measurements is used to specify the map accurately. In the 160 epigenetic landscape, no such precision is available. As a result it is vital that we sustain 161 an estimate of our uncertainty about the nature of the landscape as we develop the map. 162

163
Simulation was done by simulating 5 differentiation patterns of cells (Fig. 2). We then

168
We compare extracted pseudo time orderings for four methods in Table 1. The four 169 methods we compare are Monocle [31], Wishbone [25], Bayesian GPLVM, and topslam.  (Table 1). This is about 5% higher correlation then the next  Figure 3. Representation of the probabilistic Waddington's landscape. The contour lines represent heights of the landscape. The lower the landscape, the less "resistance" there is to move around. The time is then extracted along the cells such that it follows the landscape, depicted as splitting arrows. This also reflects the separate cell fates in the epigenetic progression of the cells.

207
We extract the pseudo time for a mouse embryonic single cell qPCR experiment [13,19]   We extract the landscape for the developmental cells and compute distances along the 215 landscape through an embedded graph.

216
The starting cell needs to be given, whereas no more information is needed to extract  We use the same labelling of Guo et al. [14], which introduces some systematic errors Tspan8, TE PE Shared Independent gp confidence Data Figure 4. Some example plots for the marker gene extraction. In green you can see the individual fits of two GPs, sharing one prior, and in blue the shared fit of one GP to all the data. Differential expression is decided on which of those two models (green or blue) fits the data better. Note the time line elucidates when (in time) the gene can be used as a marker gene. Gata6 is a known marker for TE, but evidently it is also differentially expressed in mice between PE and EPI differentiation states.
marker genes for the three cell stages ( Table 2). We compiled the list as a comparison   Figure 5. Comparison plots between different dimensionality reduction techniques for the Guo et al. data set of developmental mouse embryonic stem cells [14]. As can be seen, only topslam (probabilistic Waddington landscape) can fully identify the relationships between the cells and order them correctly for pseudo time extraction. t-SNE is the underlying method Wishbone relies on and ICA the one for Monocle.  Table 2. Marker genes for differentiation between the three cell stages compiled from time series differential expression along the pseudotime. Shown are the ten most differentially expressed genes, pairwise between the three stages. For example Id2 is differentially expressed between (TE and EPI) and between (TE and PE). This means it is a marker gene for TE, as it behaves differently from the two other differentiation stages, but not within the two others. Id2 is known to be a marker for TE.
2. Supply starting point s ∈ X of pseudo time ordering extracted in the next step.   we are able to take account of the local topography when extracting pseudo times, 293 correcting distances by applying non Euclidean metrics along the landscape [30].

294
To perform pseudo time extraction with topslam we build a minimum spanning tree 295 (or k-nearest-neighbour graph) along the latent landscape uncovered by topslam. This