Abstract
The mammalian brain develops through a complex interplay of spatial cues generated by diffusible morphogens, cell-cell interactions, and intrinsic genetic programs that result in the generation of likely more than a thousand distinct cell types. Therefore, a complete understanding of mammalian brain development requires systematic mapping of cell states covering the entire relevant spatiotemporal range. Here we report a comprehensive single-cell transcriptome atlas of mouse brain development spanning from gastrulation to birth. We identified almost a thousand distinct cellular states, including the initial emergence of the neuroepithelium, a rich set of region-specific secondary organizers and a complete developmental program for the functional elements of the brain and its enclosing membranes. We used the atlas to directly test the hypothesis that human glioblastoma reflects a return to a developmental cell state. In agreement, most aneuploid tumor cells matched embryonic rather than adult types, while karyotypically normal cells predominantly matched adult immune cell types.
The brain emerges from the primitive ectoderm as a sheet of neuroepithelial cells which folds into the neural tube during neurulation1. The developing nervous system is unique for the length of the developmental window, the extent of the interplay between different anatomical regions and lineages, and the diversity of cell types generated. Therefore, the ability of single-cell RNA-seq to disentangle the molecular heterogeneity of a complex cell pool has been particularly useful to study nervous system development2–10. Recent studies have shed light on the developing telencephalon5,11, the hippocampus9,12,13, the developing ventral midbrain14–16, the developing spinal cord and cerebellum17,18, and the hypothalamic arcuate nucleus and diencephalon19,20. Single-cell RNA-seq has elucidated the differences between embryonic, postnatal and adult neural progenitors9,21,22, and compared normal glial progenitors with their malignant counterparts23,24.
To map mouse brain development in detail, we collected embryonic brain tissue from 43 pregnant CD-1 mice, sampling each day from E7 to E18 (Extended Data Figure 1a-b, Table S1). We prepared 105 samples by droplet-based single-cell RNA sequencing. After removing low-quality cells and doublets (Methods), 96 samples remained with a mean of 5766 transcripts (unique molecular identifiers, UMIs) and 1 934 genes detected per cell (Extended Data Figure 1c-f). The total cellular RNA content dropped as a function of embryonic age in all lineages, and then increased again in the most mature neurons (Extended Data Figure 1e-f and Extended Data Figure 2c-d). Erythrocytes segregated into primitive and definitive erythropoiesis and showed exceptionally high UMI counts but low gene counts (Extended Data Figure 2a). A transcriptional signature of cell-cycle activation (Methods) showed a clear bimodal pattern where 32% of all cells were cycling (>1% cell-cycle gene UMIs; Figure 1 and Extended Data Figure 2b). Immature cells from young embryos were generally proliferating, whereas more mature cells from older embryos had stopped dividing. Postmitotic cells expressed a larger proportion of nascent (unspliced) RNA relative to mature (spliced) RNA (Extended Data Figure 2e), likely due to the induction of lineage-specific genes when cells began their maturation programs.
A tSNE embedding of the dataset (Figure 1) revealed that cells were organized primarily by gestational age, branching to show the major lineages originating from the neuroepithelium. A large connected component — representing the neural tube and its derivatives — was surrounded by disconnected islets of microglia, erythrocytes, and vascular cells, none of which are derived from the neuroepithelium. From E7 the embryo split into clusters corresponding to the three germ layers, then formed the early neural tube. By E8-10, the neural crest had formed and from E10 to E16 differentiated into the VLMC (vascular and lep-tomeningeal cells25, also known as brain fibroblast-like cells26) lineage which proceeded to form the two layers of the meninges, pia and arachnoid27. The early neural tube then matured into proliferating radial glia, the precursors of all neural cells. Two large cohorts of neurons branched from the neural tube, one comprising forebrain excitatory neurons, and one containing all other neuronal types. The clear separation of forebrain excitatory neurons agrees with our previous findings in the adult mouse brain22, indicating the strong transcriptional and functional specialization of those cell types. After about E14, radial glia gradually lost pro-liferative capacity and switched to a glioblast state that eventually gave rise to astrocytes, ependymal cells and oligodendrocyte precursor cells. The orderly progression of successive waves of cell types reveal the coordinated temporal development of the mammalian brain (Extended Data Figure 2f).
Focusing first on the late gastrulation and early neurulation stages, we extracted E7-E8 cells from our dataset (Figure 2a-b) and divided the population into three parts (Figure 2b-c): primitive streak-stage cells (Pou5f1 and T, ref. 28); non-neural lineage-restricted progenitor cells (Foxa2 and Krt8, refs. 29,30); and early neuroepithelial cells (Pax6 and Hes3, refs. 31,32). Aligning our dataset to a previously described single-cell atlas of gastrulation (Figure 2d-f, ref. 33) confirmed that our data captured a continuous set of cell states from gastrulation and neurulation (Figure 2e) but revealed a greater heterogeneity of states than previously reported (Figure 2f).
Among neural cells we identified fourteen populations: twelve neuroepithelial clusters expressing Fez1, Sox3, Pax6, and Pdzph134, one neuromesodermal progenitor cluster expressing Cdx4, Cdx1 and a set of caudal Hox genes35,36 and a population of early neuroblasts expressing Nhlh1 and Nhlh237 (Figure 2g-i). The twelve neuroepithelial populations comparably expressed a set of stem-cell genes but differentially expressed genes linked to spatial patterning (Figure 2h, refs. 38,39). We recognized anterior populations including forebrain, anterior neural ridge and eye field progenitors (NEpitA1-3), a population expressing midbrain, hindbrain and spinal cord pattering factors (NEpitMd, NEpitMH, NEpitH and NEpitC) but also populations with a stark mediolateral patterned signature corresponding to the progenitors located at the midline and the neural fold (NEpitML and NEpitB). Finally, the population of early neuroblasts showed heterogeneous gene expression and likely represented a mixture of cells. Comparing these cells with later neuroblasts, revealed an enrichment of a midbrain and hindbrain signature with telencephalic-specific genes being less represented, suggesting that the first neuroblasts appear posteriorly (Figure 2j; Methods).
Next, we examined the neurulation and morphogenesis stages, E9-E11. Differently from gastrulation, where signals for the establishment of the neural plate anteroposterior axis are provided extrinsically, during neurulation specific neural progenitors, called secondary organizers, orchestrate the morphogenesis of the brain. We examined the distribution of genes encoding morphogens including SHH, WNTs, BMPs, FGFs, Neuregulins and R-spondins and found them enriched in a subset of E9-E11 radial glia-like cells (Figure 3a-b; Extended Data Figure 5a-b). We found four ventralizing organizers expressing Shh, four dorsalizing organizers expressing Wnts and Bmps, three Fgf-expressing organizers that characterized cells at the boundaries between regions, and two Neuregulin-expressing antihem organizers that appeared later in development (Figure 3c-d). The localization of these organizers clusters was corroborated by querying the Allen Developmental Mouse Brain ISH Atlas using Voxhunt (Extended Data Figure 5c; Methods; ref. 40).
Because organizer cells were identified by clustering, their transcriptomes were distinct from those of other cells and not defined only by the small number of genes described in previous literature. A large number of transcription factors and secreted ligands and surface receptors were expressed specifically in subtypes of organizers (Extended Data Figure 5d-e). For example, Wnt7b and Rspo2 distinguished dorsalizing organizers in the cortical hem, whereas Wnt3 and Fzd10 were specific to roof plate organizers. Ntf3 and Sost were expressed by mesencephalic and rhombencephalic roof plate cells, respectively. All dorsalizing organizers expressed Wnt9a, Wnt3a, Resp1, Resp3 and Bmp6.
Resources
The raw sequence data is deposited in the sequence read archive under accession PRJNA637987.
The companion wiki at http://mousebrain.org provides the following resources:
An interactive version of Figure 1.
Expression matrices.
Metadata
The analysis software developed for this paper is available at https://github.com/linnarsson-lab, in repositories named cytograph, punchcards and auto-annotation-md.
Though defined by Shh expression, each of the ventralizing organizers were further distinguished by region-specific expression of secreted molecules. The hypothalamic floor plate was the only ventralizing organizer that expressed Dlk1, a Notch ligand, and was distinguished by the absence of Spon1 and Ntn1. Furthermore, the Zona limitans intrathalamica (ZLI) and the mesencephalic floor plate shared expression of Wnt5a, Wnt5b and Lrtm1 but were distinguished by Pappa and Dkk2 expression, respectively. These results establish a rich repertoire of secondary organizers secreting combinations of morphogens that may act region-specifically to induce locally distinct neuronal fates.
To directly validate these secondary organizers, we performed in-situ sequencing41 of 119 genes (Figure 3c and Extended Data Figure 5d-e) in an E10.5 mouse brain (Figure 3e-i). The spatial distributions of Shh, Wnt, and Fgf isoforms clearly revealed the secondary organizers (Figure 3f). Moreover, the subtypes of floor, roof, and boundary organizers identified by scRNA-seq were revealed as spatial domains of the more general organizers (Figure 3g-i), defining a more complex system of spatial regulation of brain development than previously recognized. To summarize the complexity of the 119 spatial patterns of expression we clustered the in-situ sequencing data defining 40 embryonic territories characterized by a unique repertoire of signaling molecules and patterning factors (Extended Data Figure 6a-b). Overall, the fact that subtypes of organizers were both transcriptionally distinct, spatially segregated, and expressed distinct sets of ligands and receptors, shows that progenitors born in even closely adjacent spatial domains are subject to distinct environmental cues.
A distinct clade of the cluster dendrogram comprised late glia (Figure 1; clusters 751 and above) and related progenitors (Figure 4a and Extended Data Figure 7a-f, Methods). Focusing on these cells revealed two major groups of radial glia: those that expressed neurogenic markers, like Neurog2 or Dlx2, and those that expressed glial markers like Tnc or Egfr (Figure 4b-c). We refer to the latter as glioblasts, given their likely commitment to gliogenesis. More than 6 000 genes were significantly enriched in one of the two groups (Figure 4d, Extended Data Figure 7c-d). Over 600 of these genes were cell-cycle related, 89% of which were enriched in neurogenic radial glia. Olig1, Olig2, Aqp4, Stat3, and glutamate and GABA receptors were enriched in glioblasts. To further distinguish progenitors related to astrocytes or oligodendrocyte precursor cells (OPCs), we compared with genes enriched in these cell types in the adolescent mouse (Extended Data Figure 8j).
We previously defined seven region-specific astrocyte types of the adolescent mouse brain22, dominated by abundant telencephalic and non-telencephalic subtypes. Consistent with this observation, non-telencephalic glioblasts and oligodendrocyte precursors of mixed forebrain, midbrain, and hindbrain origin were distinct from a cluster of predominantly telencephalic glioblasts biased towards astrocytes (Extended Data Figure 7e). In contrast, neurogenic radial glia split into separate clusters based on region. These observations suggest a loss of regionalization particularly in the oligodendrocyte lineage, but also among astrocytes. To investigate further, we sampled from our dataset six groups of cells - early glia, neurogenic radial glia, glioblasts, OPCs, neuroblasts, and neurons – and trained a classifier to identify a cell’s region of origin using transcription-factor expression (Extended Data Figure 7g-i). The classifier was trained on each group and tested on the five remaining groups. Although the results suggest that region-defining genes shift throughout development, the classifier still performed relatively well when trained on glioblasts and OPCs, demonstrating that some region-defining genes were still present at low levels (Extended Data Figure 7j).
The transition from neurogenic to gliogenic occurred between E12 and E16 (Figure 4e and Extended Data Figure 7e). The earliest gliogenic cell to appear was a cluster unique in its co-expression of Egfr and the Notch ligands Dll1, Dll3, and Dll4 (Extended Data Figure 7k-l), This cluster, which was closely related to OPCs and might represent a pre-OPC state, expressed high levels of Olig1 and Olig2, as well as Crlf1, a soluble member of the Cntfr pathway implicated in gliogenesis42. The appearance of astrocytes at approximately E15 was revealed by increased levels of Gfap, Agt, and Aqp4 (Extended Data Figure 7k).
To better resolve lineage-specific versus shared gene expression programs, we inferred a pseudolineage tree for the whole dataset (Extended Data Figure 8). We arbitrarily designated a single cell in the early neuroepithelium as the root, and computed shortest paths (i.e., geodesics on the manifold) to every other cell on the radius nearest neighbor graph (RNN). The set of all shortest paths from the root forms a pseudolineage tree, due to the convergence of paths towards the root (Extended Data Figure 8a-b). Plotting individual pseudolineage trajectories to a few selected single cells (Extended Data Figure 8c-d) on the atlas embedding confirmed that pseudolineages agreed with expectations. For example, pia and arachnoid originated from the cranial neural crest; astrocytes and OPCs were generated from distinct but closely related radial glia; and cortical excitatory neurons and forebrain GABAergic neurons were derived from distinct radial glia precursors.
The pseudolineage tree provides a tool to investigate gene expression along putative lineage trajectories leading to specific end states. To do this, we isolated all the trajectories that ended in a specific cluster (or set of clusters), and projected all the single cells in those trajectories onto the embryonal pseudoage (Figure 4f-h and Extended Data Figure 8e-i). We then calculated average gene expression in pseudoage bins. To validate this approach, we first examined the subtree terminating in cortical excitatory neurons (Figure 4f) whose development is well understood. As expected, cells along the trajectory predominantly were located in the forebrain. Furthermore, a set of known regulators of cortical excitatory neurogenesis was expressed in an ordered progression from E11 to E18: Emx1, Eomes, Tbr1 and Satb211. RNA in situ hybridization (Allen Brain Atlas) confirmed the timing of expression of these key transcription factors (Extended Data Figure 8k).
Expanding the analysis to additional glial and neuronal lineages showed the expected lineage markers (Figure 4g and Extended Data Figure 8e-i). We noticed that nearly all neuronal — but none of the glial — lineages passed through an early neuroblast state expressing all or most of Nhlh1, Nhlh2, Ebf1, Ebf2 and Ebf3, extending the HPF analysis above with lineage-specificity. The expression of Nhlh1 and Nhlh2, especially in combination, (Figure 4g, Extended Data Figure 8j) identified a pan-neuronal state just at or after cell cycle exit (Figure 1 and Extended Data Figure 2b), which coincided with the set of clusters we had manually annotated as neuroblasts. Notably, however, the forebrain GABAergic lineage appeared to lack Nhlh1 expression, although low levels of Nhlh2 were detected.
Human brain cancer is a devastating disease, with a median survival of less than two years for glioblastoma. Recent work using single-cell RNA-sequencing to determine the cellular composition of glioblastoma has suggested that these tumors reflect a reversion to an embryonic cell state resembling neural progenitors, OPCs and/or immature astrocytes24,43,44. To unambiguously test this hypothesis, we compared human brain tumor data — comprising five glioblastoma samples and one anaplastic astrocytoma45 — with a reference cell type catalog constructed by merging the developmental cell atlas of this paper with our previous adolescent brain cell atlas22 (Methods and Extended Data Figure 9).
Each of the tumor samples matched a unique set of reference cell types, but in each case aneuploid cells predominantly matched embryonic cell types (mainly radial glia and neuroblasts) or adult astrocytes (Figure 5). In contrast, normal (euploid) cells matched adult immune cells, oligodendrocytes and vascular cells. Euploid vascular cells invariably matched embryonic vascular cells, not adult, likely indicating ongoing angiogenesis. The tumors differed in composition, proliferative status and the nature of the immune response. For example, only the anaplastic astrocytoma showed B cell and T cell response, and the most highly proliferative glioblastoma (SF11215) was almost devoid of immune cells. However, every tumor contained proliferating progenitors resembling embryonic radial glia or OPCs. With one exception (a neural match in tumor SF11159), all aneuploid clusters that matched adult cell types were astrocyte, OPC or neuroblast-like; all of which are cell types that are phenotypically close to embryonic progenitor cells. Thus, an unbiased comparison with embryonic and adult cell types confirms the essentially fetal cellular nature of human glioblastoma.
Our atlas is a step toward dissecting the principles of mammalian nervous-system development and provides an overview of the prenatal brain’s transcriptomic diversity. The wealth of information on time-, lineage-, and region-specific gene expression provides powerful tools for genetic targeting, and for understanding genes involved in neurodevelopmental disorders and human brain cancer.
Author contributions
S.L. supervised the project. S.L. and G.L. conceived the study design. G.L., K.S. and S.L. analyzed, annotated and interpreted the single-cell data and wrote the manuscript. G.L. and A.F. performed the single-cell experiments. A.F. performed the embryo dissociations. A.J. prepared sequencing libraries. P.L. and S.L. built the companion website. E.V., D.G. and C.M.L. performed in situ sequencing experiments. K.S., I.K., G.L. and C.M.L. analyzed and interpreted the in situ sequencing data. M.N. supervised the in situ sequencing experiments. All the authors critically reviewed the manuscript and approved the final version.
Code and data availability
Source code is available at https://github.com/linnarsson-lab. RNA-seq data is available at the Sequence Read Archive (SRA; https://www.ncbi.nlm.nih.gov/sra) under accession PRJNA637987. HybISS data is available at http://mousebrain.org/downloads.
Methods
Animals
CD-1 mice were obtained from Charles River (Germany) and mated overnight. The morning that plugs were detected was considered E0.5. Mice were housed with regular dark/light cycle and fed standard diet and water ad libitum. All animal procedures were approved by the Stockholm ethics committee (N68/14; Stockholms djurförsöksetiska nämnd) and followed Directive 2010/63/EU of the European Parliament and of the Council, the Swedish Animal Welfare Act (Djurskyddslagen: SFS 1988:534), the Swedish Animal Welfare Ordinance (Djurskyddsförordningen: SFS 1988:539) and the provisions regarding the use of animals for scientific purposes: DFS 2004:15 and SJVFS 2012:26.
Dissections
161 embryos between E7 and E18.5 were collected from 36 pregnant mothers. Embryos were extracted from the uterine horn and dissected in phosphate buffered saline (PBS) on ice within one hour. Dissections were performed with microsurgical tools under a fluorescent stereomicroscope. The number of embryos dissected was adapted to accommodate changes in size and cell numbers in the developing brain (see Table S1).
E7 and E8 embryos were cut into two parts to isolate the rostro-ventral neural tube from dorsal tissue including the trophectoderm, which was discarded. The developing spinal cord was identified by corresponding somite pairs and also discarded. From E9 and E10 embryos, we isolated the entire cephalic part, including prospective forebrain, midbrain, and hindbrain. From E11 onwards, only developing brain tissue was dissected and collected; surrounding head tissue was discarded. Meninges was removed whenever possible. E9 to E11 brains were divided into forebrain, midbrain and hindbrain. Forebrain and midbrain were separated at the diencephalon; midbrain and hindbrain at the isthmus. E12 to E15 forebrains were further divided into dorsal (including the hemispheres) and ventral (including the subcortical territories, caudoputamen and the ventral striatum) parts. E16 to E18 forebrains were divided into dorsal, ventral, and thalamic parts.
Cell dissociation
Dissected tissue was processed as previously described, with minor modifications9,22. Tissue was transferred to a 2 ml Eppendorf tube containing preheated digestion solution (37°C). Before E11 Trypsin was used for dissociation; after E11 the digestion solution contained 300 µl TrypLE Express (Life Technologies; cat. no. 12605-010), 2 000 µl papain solution (Worthington Biochemical; cat. no. LK003178; 25 U/ml in cutting solution), 100 µl DNase I (Worthington Biochemical; cat. no. LK003172; 1 mM in cutting solution) and 100 µl cutting solution. Enzymatic incubation time was adjusted based on developmental stage. The tissue was kept in a 37°C water bath and gently pipetted three times every 10 minutes. After 30 min at 37°C, another 100 µl of TrypLE solution was added, followed by 10 more minutes at 37°C and pipetting. After approximately 45–50 min, dissociation was complete, and cells became visible at the bottom of the Eppendorf tube. The solution was filtered using a 20-µm cell strainer (Falcon cat. no. 352340) and collected in a 15-ml plastic tube. The digestion solution was diluted with 2.4 ml of cutting solution and 0.6 ml of Neurobasal-A medium (Gibco; cat. no. 10888) and centrifuged at 100g for 4 min at 4°C. The supernatant was removed and the pellet resuspended in 0.5 ml cutting solution and 0.5 ml Neurobasal-A medium. Neurobasal-A medium was supplemented with l-glutamine (Gibco; cat. no. 25030-123), B27 (Gibco; cat. no. 17504-044) and penicillin plus streptomycin (Sigma 50x; cat. no. P4458). The cell suspension was carefully transferred with a Pasteur pipette onto an Optiprep gradient. For the gradient, 85 µl of Optiprep Density Solution (Sigma; cat. no. D1556) was mixed with 457.5 µl cutting solution and 457.5 µl Neurobasal medium including supplements. The gradient was subsequently centrifuged at 80 g for 10 min at 4°C. The supernatant was removed until only 100–200 µl remained. DNase I (10 µl) was added to prevent aggregation. The single-cell density was evaluated. Then cells were filtered with a 20 µm strainer (Partec CellTrics). Cells were pelleted, resuspended in cutting solution with DNaseI, and loaded into 10X Genomics Chromium v1 chips.
Single-cell RNA sequencing
Droplet-based single-cell RNA sequencing was performed using the 10x Genomics Chromium Single Cell Kit v1. Single-cell suspensions concentrated at 500-700 cells/ml were mixed with master mix and nuclease free water according to the Chromium manual, targeting 3500 cells per reaction. 12 PCR cycles were used for cDNA synthesis. Each library was sequenced using an Illumina HiSeq instrument, one sample per lane, with one 98 bp read located near the 3′ end of the mRNA. Illumina runs were demultiplexed and aligned to the genome (mm10-3.0.0), and BAM files were obtained from the 10X Genomics cellranger pipeline (version 3.0.2). Molecule counts were attributed to spliced and unspliced transcripts by running velocyto46 (version 0.17.11) with standard parameters, resulting in one loom file (http://loompy.org) per sample.
In-situ sequencing by HybISS
HybISS41 was performed as published at protocols.io (http://dx.doi.org/10.17504/protocols.io.xy4fpyw). Padlock probes, bridge probes and detection oligo sequences used are listed in Supplementary Table 3. Imaging was performed with a Leica DMi8 epifluorescence microscope equipped with LED light source (Lumencor® SPECTRA X), sCMOS camera (Leica DFC9000GTC), and 20× objective (HC PL APO, 0.80). ROIs were imaged with 10% overlap and 24 z-stack planes with 0.5 μm spacing and then maximum projected in LASX software and then tiles exported as raw TIFF files.
Computational Analysis
Pooling samples and preprocessing for analysis
Loom files from each 10x sample were aggregated by time point and brain region, the latter dictated by our dissection strategy. In particular we pooled: Cephalic Neural Tube E7-8, Forebrain E9-11, Midbrain E9-11, Hindbrain E9-11, Forebrain Dorsal E12-15, Forebrain Ventral E12-15, Midbrain E12-15, Hindbrain E12-15, Forebrain Dorsal E16-18, Forebrain Ventrothalamic E16-18, Forebrain Ventrolateral E16-18, Midbrain E16-18, and Hindbrain E16-18. Cells with fewer than 2 000 UMIs were excluded from pooling. Genes expressed in fewer than ten cells or greater than 60% of cells were excluded from analysis.
Cytograph 2.0
The analysis pipeline is available as an update to our Cytograph package, Cytograph 2.0 (https://github.com/linnarsson-lab/cytograph-dev). Cytograph is under continuous development, and the results in this paper were generated at commit 2e8eb79a5f83b-d9dda15503d9b2476ebdd8fa0b3. We described Cytograph v1.0 elsewhere22, including gene enrichment (overexpression), trinariation and clustering. We detail below only the key improvements: changes to dimensionality reduction and graph construction; auto-annotation; and punchcards.
In brief, we used Hierarchical Poisson factorization (HPF47) to decompose the expression matrix into 96 non-negative and modular components, while reducing noise and preserving most of the structural information in the original matrix. We then computed a radius nearest neighbor graph (RNN) using the information radius (also called the Jensen-Shannon divergence, a measure of the Shannon information difference) to link cells with near-identical gene-expression states. Finally, we clustered the RNN graph using modularity (Louvain) to define distinct sets of cells representing cell types or states along differentiation trajectories. We thus represent the high-dimensional gene expression manifold simultaneously as an RNN graph (facilitating analysis of continuous processes such as differentiation), a set of clusters (facilitating comparison between discrete cell states) and the components of the HPF (facilitating analysis of modular processes shared between cells). For visualization, we embedded the manifold in two dimensions using an optimized tSNE projection from HPF space (e.g. Figure 1). Unless otherwise specified, the following parameters were used: k=25, k_pooling=10, factorization=“HPF”, n_factors=96, n_genes=2000, mask=(), doublets_action=None, features=“enrichment”.
Cytograph uses the following open-source packages, and we are grateful to the authors for making such important resources freely available to the community: numpy, scikit-learn, scipy, networkx, python-louvain, hdbscan, pyyaml, statsmodels, numpy-groupies, tqdm, umap-learn, torch, harmony-pytorch, pynndescent, click, leidenalg, unidip and opentsne.
Distance metric, manifold graph, and pseudo-lineage geodesics
In single-cell analysis, it is typical to construct a nearest-neighbor graph by imposing a distance metric on a reduced-dimension-al space of gene expression. This graph is a proxy for the manifold of allowed gene expression states, and can be used for downstream processing such as clustering. However, the standard approach has several issues:
Nearest neighbors need not be close to each other in any absolute sense. A lone blood cell among neurons will be nearest to some neuron, but not close in any useful sense.
Distances are not biologically meaningful. For example, “euclidean distance in PCA space” has no intrinsic meaning that can be interpreted as a meaningful distance between cells.
Neighborhood structure on the manifold graph depends strongly on sampling density
We address these issues in three ways:
We use Hierarchical Poisson Factorization (HPF) to model the gene expression matrix as a linear combination of factors, and show that this model captures nearly all structure in the expression matrix (Extended Data Figure 3a-d).
We use the Jensen-Shannon Divergence, also known as the “information radius” as a principled metric that distinguishes “near-identical” cells from “dissimilar cells” (Extended Data Figure 3e-i).
We define pseudo-lineage trajectories as geodesics along the manifold, which therefore correspond to sets of cells that form a path connected by pairs of cells within a defined maximal information radius (Extended Data Figure 8a-b).
Given a pair of cells i, j and their normalized HPF component vectors θi, θj (considered here as probability distributions over the components) the Jensen Shannon Divergence (JSD) is defined as where H(·) is the Shannon entropy function. The JSD measures the entropy of the average of two cells, compared to the average entropies of the individual cells. It is a symmetrized version of relative entropy (Kullback-Leibler divergence). For a pair of identical cells, their HPF component values are drawn from identical distributions, the entropy of the average is equal to the average of entropies, and the JSD is zero. For a pair of dissimilar cells, the average of entropies is smaller than the entropy of the average, because information is lost in averaging. A pleasing property of the JSD is the fact that its square root is a true metric (satisfying non-negativity, symmetry, identity of discernibles and triangle inequality), which is required by some clustering and manifold learning algorithms. The JSD can be viewed as defining a small neighborhood around each cell, in which neighbors are near-indistinguishable in an information-theoretic sense. Exploiting this property, we construct a radius nearest neighbor (RNN) graph, by imposing a fixed JSD radius around each cell, and connecting only cells that fall within this radius (for computational efficiency, we cap the maximum number of neighbors to a fixed number, usually 25). This approach addresses several of the deficiencies of the common KNN approach:
Distances between neighbors have a clear meaning in terms of relative entropy, and informally, cells that are connected in the RNN can be viewed as “near-identical”
If the manifold is undersampled, the RNN will automatically be sparsely connected, or even disconnected, reflecting missing neighbors
Regions of the manifold that have been densely sampled will be densely connected
Connected paths through the RNN are guaranteed to proceed through individual cell neighbors that are near-identical in terms of relative entropy; a more meaningful property than arbitrary distances in PCA space
Next, for inference of longer trajectories along the manifold, we define a pseudo-lineage geodesic as the shortest path between two given cells (e.g. a root cell and a terminally differentiated cell) on the RNN using the JSD as edge weights. The shortest path is therefore a path of small steps, each connecting a pair of near-identical cells. A pseudo-lineage tree of all cells can be constructed by designating a single root cell, and computing the geodesics to all other cells. On this tree, many cells will be internal (branches), some external (leaves) and the root cell will have no predecessor. Cells not connected directly or indirectly to the root cell will not be part of the tree. For this paper, we constructed the pseudo-lineage tree by picking a single cell (#206,003) in the early neuroepithelium arbitrarily as the root.
Auto-annotation
After clustering, Cytograph incorporates a step termed auto-annotation that automatically assigns meaningful labels to the clusters (Extended Data Figure 10). Labels are flexible and may denote either specific cell types or general properties such as “cell cycle” or “GABAergic neurotransmission.” Multiple labels can be assigned to the same cluster, and each label can be attributed to multiple clusters. Each label is defined as a text document with a short computer-readable section that consists of four mandatory fields: name, abbreviation, definition and category:
name: Cranial Neural Crest Cell
abbreviation: CNC
definition: +Dlx2 +Plp1 +Sox10
categories: Ectodermal NeuralCrest Face
The “abbreviation” and “name” are unique identifiers, whereas “categories” can be used to organize the labels into meta-labels. The “definition” is the most important field and represents the label-assignment rule. It consists of a list of gene names, each prefixed with “+” if the gene must be present or “-” if the gene must be absent. To determine presence or absence, we use Cytograph’s trinarization score22. A cluster is assigned a label if and only if the “+” genes are present and the “-” genes are absent. Many different definitions for the same label are possible. To create robust labels that apply to diverse datasets, we favored definitions that comprised short lists of genes that are ideally well-studied. The auto-annotation tags were used for data exploration and to subset the dataset as described below.
Each document additionally contains a free-form section that is used to provide context, justification, and references, which was refined continuously throughout our study and constructed by cross-referencing the literature and our data. As a result, it forms a cumulative body of knowledge about mouse brain cell types and states that can be automatically applied to future datasets, available at https://github.com/linnarsson-lab/auto-annotation-md.
Punchcards
Repeating feature selection and dimensionality reduction on a subset of cells often highlights further structure than can be revealed by clustering a large, diverse dataset. Although subsetting the dataset in a supervised way incorporates biological knowledge to expose particularly interesting aspects of the heterogeneity, these decisions quickly become verbose, and the approach requires manual intervention at each level of analysis.
We addressed this issue by using the auto-annotation framework to implement a rule-based splitting procedure. Each level of analysis is defined by a .yaml file termed a punchcard. The name of the punchcard indicates the parent analysis to use as input (e.g. MidbrainE911.yaml). Within the file, each subset to analyze is given a name (e.g. Neurons and Glia) and the clusters to include are determined by the auto-annotation listed in the include field. A conditional expression can optionally be specified in the onlyif field to further constrain the subset. Punchcards can therefore define a series of biologically meaningful analyses that remain general to the input dataset and analysis parameters. Punchcards can be applied hierarchically to define increasingly fine-grained analyses.
Post-processing
Removal of doublets and contaminated clusters
Doublets were removed by running a modified version of DoubletFinder available in cytograph on each individual sample in the dataset. DoubletFinder scores were then averaged across clusters, and clusters with an average doublet score greater than 0.3 were removed from further analysis. Approximately 20 additional clusters were removed that were manually annotated as blood-contaminated, doublets, or contained fewer than five cells.
Merging
A dendrogram was constructed using Cytograph’s Aggregator class on the dataset. The dendrogram was cut at a height of 20 to generate groups of similar clusters. For each group, differential expression testing was performed pairwise between the clusters. The package diffxpy was used to run t-tests on the log expression values. Genes were considered differentially expressed if 1) q-value was less than 0.01, 2) log fold change was greater than 2, and 3) at least 30% of cells expressed the gene in one cluster, and the percentage of expressing cells in the second cluster was less than 70% of the percentage of expressing cells in the first cluster. Clusters with less than one differentially expressed gene were merged. The procedure was repeated until the remaining clusters in each group were distinguishable from one another by at least one differentially expressed gene.
tSNE of all cells and dendrogram of all clusters
We computed HPF to 30 components and then used the Art of tSNE heuristics48 (but with exaggeration=1.5, perplexity=150) to obtain the tSNE layout in Figure 1. The use of a relatively small number of HPF components resulted in a globally smoother and more connected manifold, at the expense of reduced local resolution. This tSNE was used for visualization only; clustering and annotation was performed as described above.
The final dendrogram was generated by running Cytograph’s Aggregator class on 942 clusters with parameter mask set to [‘cellcycle’, ‘sex’, ‘ieg’, ‘mt’], which excludes cell-cycle, sex-related, immediate-early, and mitochondrial genes from the analysis. Both the cluster dendrogram and the tSNE of all cells were computed with a more recent version of cytograph (commit 8a5518e-81197f5aa9d54565cbce3875738d8dac9).
Glia analysis
Clustering
Cells belonging to clusters 751 and above were pulled from the dataset. The mean expression of each gene was calculated for every cluster, and a new dendrogram was calculated based on the correlation of gene expression between clusters. The resulting dendrogram was cut at a height of 1, resulting in 18 meta-clusters that were subsequently annotated based on gene expression. Independently, a new t-SNE was calculated on this subset of cells using cytograph with the parameters described above except n_factors: 50 and mask: [‘cellcycle’, ‘sex’, ‘ieg’, ‘mt’, ‘ery’] to mask cell-cycle, sex-related, immediate-early, mitochondrial, and blood-related genes.
Differential expression
Differential expression was calculated using the diffxpy package. A t-test was calculated on the log gene-expression values normalized by total molecules per cell. Independently, the fraction of positive cells for each gene was calculated for both neurogenic radial glia and glioblasts. Differentially expressed genes were then filtered for those with q-values less than 10−5 and expressed in at least 10% of either neurogenic radial glia or glioblasts.
Cell type scores
Cell-type scores were based on the top 200 enriched genes for astrocytes and OPCs in the adolescent mouse dataset. The score was calculated as the percentage of total genes expressed in each cell that belonged to the enriched set.
Region classifier
1 200 cells were sampled from each of six subsets of the dataset: early glia, neurogenic glia, gliogenic glia, OPCs, neuroblasts, and neurons. Equal numbers were sampled from the forebrain, midbrain, and hindbrain. Each of these subsets was used to train a classifier for region that was then tested on the other five groups. The sklearn GradientBoostingClassifier was regularized with the parameter max_features, and max_features=50 was chosen after testing the performance of the classifier over a range of values. Default values were used for other parameters. Precision, recall, and F1 scores were calculated separately for forebrain, midbrain, and hindbrain cells and averaged.
Gastrulation and neural plate analysis
The dataset from Pijuan-Sala et al. was prefiltered by removing cells not classified by the authors. Feature selection using a CV-mean was used to restrict the analysis to the most variable 3500 genes. Analogously 5 223 genes were selected for the E7-8 dataset (2 000 for the primitive streak-stage; non-neural lineage-restricted, and early neuroepithelial). We considered the intersection between these two gene sets. The count tables were imputed by k-nearest neighbor pooling46 (k=30) and then log-transformed. To place the E7-8 data on the t-SNE embedding for Pijuan-Sala et al., we first computed the similarity (Pearson’s R) between each of the E7-8 cells and the reference cells. Then we assigned to each cell the average of the embedding coordinates computed over the 20 most similar cells in the reference.
The signature score for neuroblast from different regions was obtained considering genes enriched in neuroblasts sampled from forebrain, midbrain and hindbrain. In particular, we considered the neuroblast clusters sampled between E9 and E13 and computed an enrichment score for each of the areas and selected the top 200 genes. Neuroblast clusters were defined using a rule based on the auto-annotations: negative for the cell cycle label and positive for one of the neuronal lineage labels (Glut, GABA, Glyc, etc.). Finally to compute the signature score for each cell we used the same approach adopted by Seurat49. We calculate the Z-scores for each of the genes in the set, then average them and normalize the result using a reference computed by drawing random gene samples.
Spatial localization of organizer-like clusters
We used the R package Voxunt to determine the developing brain regions most likely to host each of the organizer-like cell clusters40. The tool constructs a 3D similarity map of the cells using a reference spatial expression atlas that can be interpreted as a putative localization for the cells. Briefly, the tool computes the Spearman correlation between average gene expression of a scRNA-seq cluster and each voxel of a gene expression model constructed from the ISH experiments of the Allen Brain Developmental Atlas50. To compute the similarity map we used the same script for E11 and E13, where we call variable_genes with nfeatures=150 and run voxel_map and plot_map using default parameters.
In situ sequencing analysis
Channel images were merged into a multichannel tiff image and aligned to a reference round. The channel used for tile alignment was the DAPI stain. Following alignment, the tiles were stitched together using Microscopy Image Stitching Tool (MIST51) and subsequently split into non-overlapping 2000×2000 pixel tiles. Each individual tile was then top-hat filtered, the RCPs were segmented and the intensity was measured in each channel. The RCP was then assigned the base with the highest intensity. The public repository for the code used can be found at https://github.com/Moldia/Tools. Gene expression was summarized considering a isotropic grid which step 16 μm and counting the called dots for each gene. The data was smoothed with gaussian filter with a bandwidth of 24 μm. The grid location with a total of called dots smaller than 4 were excluded from the clustering analysis. As a preprocessing before clustering the counts were transformed with the variance stabilizing transform log2(x+1), PCA was computed and the top 40 principal components retained. Finally, we performed a KMeans clustering setting n_cluster=40.
Glioblastoma analysis
To identify aneuploid cells, we calculated the normalized sum of expression on each chromosome, and then fit a 5-component Bayesian Gaussian mixture model to the 24-dimensional chromosome expression data of all cells (Extended Data Figure 9a-c). We then inferred the maximum a posteriori component label for each cell, and the posterior probability. We manually examined these labels for all clusters to determine which components corresponded to aneuploid cells, and which were euploid. We then mapped human to mouse genes and re-clustered and re-annotated the tumor data using cytograph to ensure comparable analyses. To match tumor clusters with reference cell types, we first computed the top 25 enriched genes for each reference cell type (developmental and adolescent). For each reference cluster and each tumor cluster, we then computed a matching score as the product of gene enrichment in the tumor for the 25 top enriched genes in the reference cell type. Thus, a high matching score reflected highly specific expression of the set of genes specifically expressed in the reference cell type. The top-scoring matching cell type was selected for each tumor cluster. For immune cells, we lacked reference data, and therefore called those cell types manually based on well-known markers of macrophages, B cells and T cells.
Acknowledgements
We thank Hannah Hochgerner and Amit Zeisel (Technion, Israel) for valuable advice, Patrik Ernfors (Karolinska Institute) for supporting the work of A.F., and the National Genomics Infrastructure for sequencing services. This work was supported by grants from the Knut and Alice Wallenberg Foundation (2015.0041 to S.L. and 2018.0172 to S.L. and M.N.), the Erling Persson Family Foundation (HDCA, to S.L. and M.N.), the Swedish Foundation for Strategic Research (RIF 15-0057, SB16-0065) to S.L., Hjärnfonden (PS2018-0012) to D.G., the Chan Zuckerberg Initiative DAF, an advised fund of Silicon Valley Community Foundation (2018-191929) to M.N, and the Swiss National Science Foundation (CRSK-3_190495) to G.L.