ABSTRACT
The human brain has changed dramatically since humans diverged from our closest living relatives, chimpanzees and the other great apes1–5. However, the genetic and developmental programs underlying this divergence are not fully understood6–8. Here, we have analyzed stem cell-derived cerebral organoids using single-cell transcriptomics (scRNA-seq) and accessible chromatin profiling (scATAC-seq) to explore gene regulatory changes that are specific to humans. We first analyze cell composition and reconstruct differentiation trajectories over the entire course of human cerebral organoid development from pluripotency, through neuroectoderm and neuroepithelial stages, followed by divergence into neuronal fates within the dorsal and ventral forebrain, midbrain and hindbrain regions. We find that brain region composition varies in organoids from different iPSC lines, yet regional gene expression patterns are largely reproducible across individuals. We then analyze chimpanzee and macaque cerebral organoids and find that human neuronal development proceeds at a delayed pace relative to the other two primates. Through pseudotemporal alignment of differentiation paths, we identify human-specific gene expression resolved to distinct cell states along progenitor to neuron lineages in the cortex. We find that chromatin accessibility is dynamic during cortex development, and identify instances of accessibility divergence between human and chimpanzee that correlate with human-specific gene expression and genetic change. Finally, we map human-specific expression in adult prefrontal cortex using single-nucleus RNA-seq and find developmental differences that persist into adulthood, as well as cell state-specific changes that occur exclusively in the adult brain. Our data provide a temporal cell atlas of great ape forebrain development, and illuminate dynamic gene regulatory features that are unique to humans.
MAIN TEXT
Bulk genomic measurements in primary brain tissue from humans, chimpanzees and other apes have identified molecular features that appear specific to the human brain9–13. These studies have been limited to a snapshot of adult brain tissues, or average measurements across heterogeneous cell populations. Time course measurements of rhesus macaque brain development provide insights into developmental divergence in primates14, but it has been difficult to perform similar experiments in great apes due to the lack of available tissue. Cerebral organoids15 grown from human and other great ape induced pluripotent stem cells (iPSCs)16 offer the exciting potential to study the evolution of human brain development in controlled culture environments. Previously, we and others have shown that human and chimpanzee cerebral organoids recapitulate many aspects of in vivo cortex development17–22. In particular, low-throughput single-cell transcriptomics on cortical-like regions within human and chimpanzee cerebral organoids revealed that gene expression patterns of early fetal neocortex development were largely recapitulated in the organoids17, 19, and comparative analyses revealed changes between human and chimpanzee21. Higher throughput scRNA-seq methodologies enable genomic dissection of individual organoids with the potential to study gene expression landscapes across multiple brain regions23, 24 and from multiple individuals. Here, we set out to use high-throughput single-cell RNA-seq, together with accessible chromatin profiling, to understand human cerebral organoid development from pluripotency, and to explore how human cortical gene expression programs have diverged from the other great apes. We further analyze adult prefrontal cortex tissue using single-nucleus RNA-Seq to reveal the potential and limits of cerebral organoids to study human-specific expression patterns observed in the mature brain.
We first used droplet-based scRNA-seq (10X genomics) to profile cell composition across a time course of human organoid development (pluripotency: 0 days (d); embryoid body: 4d; neuroectoderm: 10d; neuroepithelium: 15d; organoid stages: 1, 2, and 4 months (m)) from two human pluripotent stem cell lines (H9, embryonic stem cell (ESC), 23,226 cells; 409b2, iPSC, 20,272 cells; Fig. 1a,b; Extended Data Fig. 1). Marker gene analysis of two-dimensional t-SNE projections of the data from each time point separately, as well as all time points combined, revealed distinct progenitor, neuronal, astrocyte, and mesenchymal populations that emerged across the time course, with intermixing of iPSC and ESC-derived cells (Fig. 1c; Extended Data Fig. 2). We generated pseudocells by combining nearest neighbors in the high dimension gene expression space, which resulted in a more robust transcript estimation (on average ∼6,000 genes detected per pseudocell compared to ∼3,000 genes per single cell). We then constructed a force-directed k-nearest neighbors graph25 to visualize the temporal progression of the data (Fig. 1d, Extended Data Fig. 3). We track a progression through pluripotent, neuroectodermal, and neuroepithelial stem cell states during the first 15 days of differentiation. By 1 month, cells diversify into neural progenitors from multiple brain regions including the forebrain (dorsal and ventral telencephalon, diencephalon), midbrain (mesencephalon), and hindbrain (rhombencephalon). A small subpopulation resembling retinal progenitors of the developing eye field is also present, but these cells were only detected in an iPSC 409b2-derived organoid. In addition, a non-neuronal mesenchymal population appears from both cell lines early in the differentiation time course. By 2 months, excitatory and inhibitory neuronal fates have differentiated from progenitors of multiple brain regions, and by 4 months astrocytes have emerged (Fig. 1e). These observations were based on the supervised analysis of known marker genes and inspection of in situ patterns from the Allen Developing Mouse Brain Atlas, comparisons to bulk RNA-seq data from microdissected regions of the developing human brain (BrainSpan data26) and single-cell reference maps of cell prototypes from the dorsal and ventral telencephalon27, as well as the analysis of spliced and unspliced transcripts using RNA velocity28 (Fig. 1e,f; Extended Data Fig. 3-4). Together, this data provides a temporally and pseudotemporally resolved gene expression atlas of the earliest stages of human brain development.
We next analyzed the reproducibility of these gene expression patterns across PSC lines from different human individuals (Fig. 2a; Extended Data Fig. 5). In addition to the 2 lines (iPSC 409b2 and ESC H9) described above, we generated single-cell transcriptomic data from 2 month old organoids from 5 additional iPSC lines (Sc102a1, 9,525 cells; Wibj2, 13,356; Kucg2, 4,395; Hoik1, 2,660; Sojd3, 3,830), resulting in a total of 62,305 cells from 20 organoids. We identified cells on the neuronal lineage (49,153 cells), for which we constructed pseudocells as described above. We then quantified the similarity (Pearson correlation) of each pseudocell transcriptome to each time point and brain region bulk RNA-seq reference transcriptome from the developing human brain (BrainSpan29). We used these similarities to calculate a reference similarity spectrum (RSS) score for each pseudocell, used SPRING to reconstruct the relationships between pseudocells based on RSS, and projected all single cells to the SPRING-based pseudocell embedding. This analysis revealed neuronal differentiation trajectories representing ventral and dorsal telencephalon, as well as distinct populations of cortical excitatory (GLI3, EOMES, NEUROD6), ventral telencephalon inhibitory (DLX1, SOX6, GAD1/2), diencephalon excitatory, diencephalon inhibitory (with Cajal-Retzius cell signatures), mesen- (or midbrain) and rhombencephalon (hindbrain) excitatory, and mesen- and rhombencephalon inhibitory neurons (Fig. 2b-f). Notably, the use of RSS as input for the SPRING analysis instead of the transcriptomes resulted in a well-integrated projection of the data from all human individuals without the need for further integration approaches (Extended Data Fig. 5). Cell annotations were also confirmed through comparisons to voxel maps of in situ hybridization patterns from the developing mouse brain (Extended Data Fig. 5). Molecular signatures of the annotated cell types match with those in published scRNA-seq data sets of human cerebral organoids and fetal human brain tissues (Extended Data Fig. 6)21, 27. We found that each iPSC line contributed cells to multiple differentiation trajectories, however the proportions of cells in each trajectory varied across organoid and iPSC line (Fig. 2d; Extended Data Fig. 5). For example, over 90% of cells from the line Kucg2 were on the cortical excitatory (dorsal) trajectory in each of the 3 organoids, whereas Hoik1-derived organoids predominantly contained cells from non-telencephalic regions. This tendency of iPSC lines to form different compendiums of brain regions and cell types is consistent with prior work in the literature23, 30. Nonetheless, the brain region-specific gene expression patterns across the lines were highly correlated (median of Pearson correlation of pseudotemporally dependent genes: 0.91 and 0.90 for dorsal and ventral trajectories, respectively) and cells from each region clustered together (Fig. 2g-i). This data suggests that even though there is variation in the relative proportion of cell types that form in each organoid, the gene expression patterns within each brain region are largely reproduced across diverse human pluripotent stem cell lines, thus providing a baseline for identifying human-specific gene expression.
We next used chimpanzee organoids to identify features that differ from early human brain development. As for humans, we generated an atlas of gene expression across chimpanzee organoid development from pluripotency to 4 months in culture (Fig. 3a, Extended Data Fig. 7). Similar to human, chimpanzee organoids were morphologically complex with cortical-like regions containing apically located PAX6-positive progenitor cells and basally located neurons (Fig. 3b), with intermediates in between19. From the scRNA-seq data, we identified dorsal and ventral telencephalon differentiation trajectories, as well as rhombencephalon cell populations in chimpanzee organoids with a graph topology and gene expression patterns that were very similar to those observed in human (Fig. 3c-e). One difference was that upper and deep layer neurons in these chimpanzee organoids appeared to diversify and mature at an earlier stage along the cortical excitatory trajectory (Extended Data Fig. 8). We used a time warping algorithm to align the iPSC-to-cortical excitatory neuron pseudotimes from human and chimpanzee and observed that the later time points in chimpanzee failed to map to a human pseudocell counterpart (Fig. 3f,g). This observation suggested that neurons within chimpanzee organoid cortical regions may develop at a faster rate than in humans. In support of this observation we found that human dorsal telencephalon pseudocells projected to earlier parts of the developmental trajectory reconstructed in fetal human brain tissues27 than the chimp counterpart (Fig. 3h). In addition, we found that neuron maturation scores based on the cumulative expression of neuron projection, synapse assembly, and neurotransmitter secretion genes increased to higher levels in chimpanzee relative to human neurons over the time course (Fig. 3i, Extended Data Fig. 8). We also observed significantly more astrocytes relative to the number of radial glia (RG) cells in chimpanzee in 2- and 4-month organoids compared to humans (Fig. 3j). To determine the heterogeneity in organoid maturation across iPSC lines, we analyzed single-cell RNA-seq data (Smart-seq2) from additional human (15 individuals, 52 organoids) and chimpanzee (11 individuals, 38 organoids) organoids19, 21. Indeed, we found that there is heterogeneity in terms of upper and deep layer bifurcation timing that could be dependent on iPSC lines or organoid protocols. However, we found significant consistency across lines, organoids, and protocols in our assessment of neuron maturation based on gene expression (Extended Data Fig. 8). To determine if this difference in maturation timing is specific to humans, we generated cerebral organoids from macaque ESCs and analyzed 2 and 4 month organoids using single-cell transcriptomics (Extended Data Fig. 9). We found that upper and deep layer neurons diverge as early as 2 months, and that neurons mature over an even shorter time frame than in chimpanzees. Also, more upper layer neurons were detected in fetal macaque brain tissues compared with fetal human brain tissues with similar ages (Extended Data Fig. 8). This is consistent with expectations from previous reports comparing human and macaque brain development in vivo14 and in vitro 2D cultures18, 31. Together, this data suggests that delayed maturation of the human brain11, 29, 32, 33 can be traced back to very early stages of brain development.
We next wanted to detect gene expression changes in the developing cortex that have occurred since humans diverged with chimpanzees (Fig. 4a). We first inspected the expression patterns of human genes resulting from duplication or rearrangement that do not exist in other apes (Extended Data Fig. 10)34–37. We found that 22 of these 41 genes are detected in our human cerebral organoid data, and four of them (ARHGAP11B, FAM72B, FAM72C, FAM72D) are highly-specific to G2M phase progenitors of the dorsal and ventral telencephalon (Extended Data Fig. 10). ARHGAP11B has previously been shown to regulate basal radial glia cell proliferation and self-renewal38 and our data highlights the specificity of expression to a distinct phase of the cell cycle, and shows that expression is highly specific to RG progenitors along the time course of cortex development.
To identify quantitative gene expression differences between the primates, we first aligned all human, chimpanzee and macaque reads to a consensus genome and then aligned pseudotimes of dorsal telencephalon progenitor to early-born deep layer neuron trajectory between the species (Fig. 4b; Extended Data Fig. 11). We searched for genes that vary in expression along the pseudotime in each species, and find that 76.6% of these pseudotemporally dynamic genes have a conserved expression pattern. They represent ancestral gene regulatory programs that have been preserved in the primate developing cortex. We then searched for genes that are differentially expressed specifically on the human branch, and identified 98 genes, 96 of which clustered into seven different pseudotemporal patterns (Fig. 4c,d). Notably, clusters 1, 2, and 3 were enriched in human RGs, IPs, and neurons, respectively, and projections onto the entire human and chimpanzee cortical differentiation trajectory from pluripotency revealed specificity of differential expression to these cortical excitatory cell populations (Fig. 4e,f). Surprisingly, we observed that most of the human-specific deviations from chimpanzee and macaque were expression gains, rather than the loss of expression, in humans (Fig. 4g). This gain of expression skew was also observed for chimpanzee-specific changes relative to human and macaque (Extended Data Fig. 11). Our interpretation is that it could be more deleterious to lose a highly conserved gene expression pattern than it is to gain the expression of a new gene. Genes with gain of human-specific expression have no specific gene ontology enrichment, but are predicted to be involved in diverse cell biological processes including RG proliferation, neuron migration, neurite formation and are localized to different components of maturing neurons including axons, dendrites, and synapses (Fig. 4h). When comparing our results to previously published datasets generated using different single-cell RNA-seq methods (Fluidigm C1, Smart-seq2), as well as to human and macaque fetal data, we find strong overlap between the datasets (Extended Data Fig. 11). We also performed a similar comparison between human and chimp using cells of ventral telencephalon identity, and find 92 human-chimp DE genes, with 17% being distinct from what was observed in the cortex (Extended Data Fig. 12). Together, this analysis identifies human-specific gene expression changes that can be specific to certain cell states within the developing human forebrain.
To identify potential mechanisms that could underlie the human-specific gain and loss of gene expression in the cortex, we performed bulk and single-cell accessible chromatin profiling (scATAC-seq, Fluidigm C1) along the differentiation time course from pluripotency to 4 month organoids in human and chimpanzee (Fig. 5a). For the organoid time points, we enriched for dorsal telencephalon by using microdissected cortical regions as input for scATAC-seq (Extended Data Fig. 13). Aggregating the data from each single cell revealed a strong signal-to-noise ratio, and the organoid data was highly similar to bulk fetal brain DNase hypersensitivity39 (Fig. 5b), and overlapped forebrain regulatory regions (Extended Data Fig. 13). Accessible regions were scanned for transcription factor binding motifs and k-mers (7 nucleotides in length) to identify features that differ among cells and correlate with accessibility variation40. These features were used to visualize cell similarity in a two-dimensional t-SNE projection, which separated iPS, EB, neuroectodermal, neuroepithelial, cortical neural progenitor (NPC) and neuronal cells in both human and chimpanzee (Fig. 5c; Extended data Fig. 13). We then ordered cells in pseudotime using diffusion maps, allowing us to monitor transcription factor binding motifs and chromatin accessibility dynamically over the differentiation path from pluripotency to cortical neurons (Fig. 5d; Extended Data Fig. 13-14). In this way, we found that the majority of genes not expressed in cerebral organoids have inaccessible promoters in organoids (Extended Data Fig. 13).
We next searched for differential accessibility (DA) between human and chimpanzee cortical NPCs and neurons. We identified 8,099 peaks (7.4% of all accessible peaks) that gained accessibility in humans relative to chimpanzee, whereas 9,836 peaks (9% of all accessible peaks) lost accessibility (Fig. 5e). Some of these peaks (2,219, 12.4% of DA peaks) are DA in both NPCs and neurons, however most are specific to either NPCs (9,659, 53.8% of DA peaks) or neurons (6,057, 33.8% of DA peaks) and are enriched for various biological processes relative to all accessible organoid peaks (Fig. 5f). Notably, the majority of DA regions are specifically accessible in organoids relative to the earlier developmental stages (Fig. 5g) and many have been shown to drive reporter expression in the mouse developing forebrain (Extended Data Fig. 15)41. Consistent with other analyses of gene regulatory evolution42, 43, most DA peaks are located in intergenic or intronic non-protein coding regions of the genome (Fig. 5h). The majority of genes that are differentially expressed between human and chimpanzee along the dorsal telencephalon trajectory have one or more human-chimp DA peaks nearby (63% of differentially expressed protein-coding genes, Fig. 5h). We indeed found that genes with differential expression between human and chimpanzee were significantly more likely to have a nearby differentially accessible region than genes that are not differentially expressed between the species (Extended Data Fig. 15, Kolmogorov–Smirnov test, p<0.05). DA peaks are also significantly enriched for single nucleotide changes (SNCs) that are fixed in all humans and distinct from chimpanzee and other primates44 (Fig. 5i). Furthermore, these SNCs generate new or disrupt transcription factor binding sites for TFs that are expressed in organoids (Extended Data Fig. 15).
We annotated organoid-specific peaks that are DA between humans and chimpanzees and are nearby differentially expressed genes with various evolutionary signatures (Fig. 5j). This analysis identified potential regulatory regions that have human-derived fixed SNCs44, have undergone accelerated evolution in humans45–47, or overlap conserved regions that have been deleted in humans48. For instance, we identified 62 human accelerated regions that overlap DA peaks (32 in human DA peaks, 30 in chimp DA peaks), with one of these sites being nearby a gene with human-specific expression. In this case, the potential regulatory region is 244 Kb away from cadherin 7 (CDH7), a gene with higher expression specifically in human cortical neurons, and has increased accessibility in human neurons relative to chimpanzee and macaque (Fig. 5k). We also find DA regions nearby two genes, Ly6/PLAUR domain-containing protein 1 (LYPD1) and Ras-related C3 botulinum toxin substrate 1 (RAC1), that have human-specific expression in NPCs and neurons, respectively. LYPD1 is involved in neurotransmitter receptor-binding and anxiety-related behaviors49 and RAC1 is a GTPase involved in diverse processes including glucose uptake and cytoskeletal reorganization and genetic variants in this gene can lead to micro- or macrocephaly50 (Extended Data Fig. 15). In addition, we identify 22 regions that are accessible in chimpanzee NPCs or neurons that are highly conserved in mammals, but the DNA has been deleted in humans (so-called human conserved deletion, hCONDELs)48 and 1 of these are located nearby a DE gene (FADS1, Supplementary Table 10).
Finally, we wanted to know if the human-specific gene expression patterns observed in the developing brain were stage-specific or if they persist into adulthood. We generated single-nucleus RNA-Seq data from postmortem prefrontal cortex tissue of three human, chimpanzee/bonobo and macaque individuals (50,035 in human, 33,847 in chimp/bonobo and 50,403 in macaque). We obtain spatial information by isolating nuclei from sequential sections sliced from basal to apical positions, which allows us to link cell-type specific differences to cortical layering (Fig. 6a)9. By integrating the species using canonical correlation analysis and clustering 51, we recover expected cell classes such as excitatory and inhibitory neurons, astrocytes, oligodendrocytes, microglia and endothelial cells (Fig 6b-d). For our purposes, we focused on these broad cell classifications but note that subtypes could be more finely resolved and characterized (Extended Data Fig. 16). Different cell classes and cell types show distinct distribution along layers (Extended Data Fig. 16), which is consistent with previous reports9. Notably, we find that the portion of the transcriptome that is specific to neurons is more highly conserved based on sequence constraint than that of other cell types (Wilcoxon’s rank sum test, P<0.001, Fig 6e). Indeed, neuronal markers also show higher conservation than genes with higher expression in earlier pluripotent and progenitor states (Extended Data Fig. 16). Inhibitory neuron markers show slightly higher conservation than excitatory neuron markers in both adult brain and organoids (Fig. 6e and Extended Data Fig. 16), and inhibitory neurons in the adult cortex and organoid ventral telencephalon had fewer human-specific DE genes in comparison with excitatory neurons in the adult cortex and organoid dorsal telencephalon trajectory, respectively. We also find that astrocytes have slightly more differential expression than neurons or oligodendrocytes in adults (Fig. 6e,f; Extended Data Fig. 16). Together, these observations suggest different levels of evolutionary constraint on specific cell types in the cortex.
A substantial fraction of the genes expressed in dorsal and ventral telencephalic organoid neurons are also detected in excitatory and inhibitory neurons in the adult cortex, respectively (Fig. 6g,h). Notably, we find that 53 and 13 genes are commonly detected as DE in the organoid and adult excitatory and inhibitory neurons, respectively, compared to hundreds of genes that are specifically DE in the adult stage (Fig. 6i). Genes with human-chimp DE detected ubiquitously in the organoid dorsal telencephalon show stronger consistency with DE in adult excitatory neurons than genes with DE that is specific to either organoid NPCs or neurons, with NPC-specific DE genes having the weakest consistency in adult (Fig. 6j). In addition, DE genes restricted to organoids or adult show higher expression levels at the stage where DE is detected (Extended Data Fig. 16). There are interesting examples of genes with human-specific DE in adult cell classes (Fig. 6k,l), including genes that are DE in developing and adult neurons, such as COL6A1 which has been shown to have a protective role limiting autophagy and apoptosis in aging neurons52 and RIC3 which regulates the number and maturation of acetylcholine-gated ion channels in neurons53. We also find genes with human-specific DE in excitatory neurons showing significant layer specificity (Fig. 6l), suggesting their functions in specific subpopulations of cells at specific layer structures. Together, these analyses suggest that, with some exceptions, cortical cell type-specific transcriptome differences between human and chimpanzee are dynamic and linked to developmental stages.
To summarize, we identified patterns of dynamic gene expression and chromatin accessibility differences between human and chimpanzee cerebral organoid development from pluripotency through neuroepithelium, into multiple regions of the ape brain. We provide strong evidence that despite differences in brain region composition, gene expression patterns in the organoid forebrain are largely reproducible across iPSC lines from different individuals. We find that delayed maturation of the human brain begins during the very early stages of brain development. Moreover, we resolve differential gene expression to dynamic cell states upon the ontogenetic path from pluripotency to cortical neurons, and identify regulatory regions that could underlie human-specific innovations in gene expression. Finally, we map human-specific gene expression to cell types in the prefrontal cortex, and identify gene expression patterns that are specific to the adult brain, as well as patterns that can already be detected during development. The data generated in this study are available for exploration via a public interactive browser (https://bioinf.eva.mpg.de/shiny/sample-apps/scApeX/). Taken altogether, these data illuminate features of individual cell states that are uniquely human, and provides an extensive resource to guide exploration into the gene regulatory mechanisms that distinguish the developing human and chimpanzee brains.
AUTHOR CONTRIBUTIONS
SK, MB grew organoids with assistance from AW, LS, MH. SK performed scRNA-seq and snRNA-Seq with assistance from MS. MB performed scATAC-seq. ZH, MB, and SK analyzed the data. FSC, MH performed immunohistochemical stainings. JF compared organoid scRNA-seq data to mouse voxel maps. PG dissected and sliced tissue for snRNA-Seq. DH and ZQ performed bulk RNA-Seq of adult tissue. SK, MB, ZH, BT, JGC designed the study, and wrote the manuscript with support from PK, WH, SP.
AUTHOR INFORMATION
Conflict of interest: The authors declare no conflicts of interest.
ACCESSION CODE
The single-cell RNA-seq data is being deposited to EMBL-EBI ArrayExpress with the accession number E-MTAB-7552.
METHODS
Pluripotent stem cell lines and organoid culture
We acquired 6 human induced pluripotent stem cell (iPSC) lines (Sojd3, Hoik1, Kucg2, Wibj2 from the HipSci resource54; h409b2 from the RIKEN BRC cell bank17; Sc102a1 from System Biosciences), one human ES cell line (H9, WiCell)55, three chimpanzee iPSC lines (SandraA19; PR818-519, originally generated by the Gage lab and kindly provided to us by the R. Livesey group; JoC, generated in this study), one bonobo iPSC line (Bokela, generated in this study) and one ES macaque cell line (MN118, kindly provided through the R. Livesey group from Eliza Curnow). The iPSC line JoC (chimpanzee, Tchimpounga Sanctuary) was reprogrammed from blood cells (primary lymphocytes) using plasmid based reprogramming56 and Bokela (bonobo, Zoo Leipzig) was reprogrammed from fibroblasts using the StemMACS mRNA transfection kit (Miltenyi Biotec). Cell lines were validated for pluripotency markers by immunhohistochemical stainings using the Human Pluripotent Stem Cell 3-Color Immunohistochemistry Kit (R&D Systems, SC021) and were differentiated into the three different germ layers using the Human Pluripotent Stem Cell Functional Identification kit (R&D Systems) and StemMACS Trilineage Differentiation Kit (Miltenyi Biotec). Karyotyping was carried out using Giemsa banding at the Stem Cell Engineering facility, a core facility of CMCB at Technische Universität Dresden, and karyotypes were found to be normal. Cell lines were cultivated using standard feeder-free conditions in mTeSR1 (StemCell Technologies) and StemMACS iPS-Brew XF (Myltenyi Biotec) on matrigel-coated plates and differentiated into cerebral organoids using a whole organoid differentiation protocol (Lancaster et al. 2014). iPS Brew was used for cultivation of macaque ESCs as well as for EB generation during organoid differentiation for these batches (Supplementary Table 1). Cell lines were regularly tested for mycoplasma using PCR validation (Venor GeM Classic, Minerva Biolabs) and found to be negative.
Single-cell RNA-seq data generation
A summary of all single-cell experiments can be found in Supplementary Table 1. For organoid experiments (1 month, 2 months, 3 months, 4 months), whole organoids were dissociated for generating single cell gene expression libraries. Briefly, organoids were transferred to HBSS (without Ca2+ and Mg2+,-/-) and cut into two pieces to clear away debris from the center of the organoid (2-3 washes in total). Organoid pieces were then dissociated using Neural dissociation kit (P) using Papain-based dissociation (Miltenyi Biotec). Organoid pieces were incubated in Papain at 37 °C (enzyme mix 1) for an initial 15 min. followed by addition of Enzyme A (enzyme mix 2) to the Papain mix. Organoid pieces were then triturated using wide bore 1000ml tips and incubated for additional intervals of 5-10 min with triturations between the incubation steps, amounting to a total Papain incubation time of approximately 45 min. Cells were filtered through a 30 μm strainer and washed, centrifuged for 5 min at 300xg and washed 3 times with HBSS (-/-). Cells were then analyzed using Trypan Blue assay, counted using the automated cell counter Countess (Thermo Fisher), and diluted for an appropriate concentration to obtain approximately 6000 cells per lane of a 10X microfluidic chip device. Typically, cells from one organoid were loaded per lane in the microfluidic device, and in some cases organoids from different lines were pooled onto the same lane and demultiplexed based on single-nucleotide polymorphisms. For 1 month organoids, three pooled 409b2 and one H9 organoid were dissociated and cells from the two cell lines were mixed at equal ratios to be loaded on the chip. For as set of 2 month HipSci organoid data, organoids were dissociated for all four HipSci cell lines and pooled at equal ratios to be loaded on one lane of the microfluidic device aiming for 10k cells. Fluidigm C1 data (Supplementary table 1) were generated as previously described 19 and cells from chimpanzee SandraA 75d organoids were microdissected regions from vibratome slices for which single cell suspensions were generated as described above. Single cells were then sorted into 96-well plates using a FACS Aria III sorter and further processed using the SmartSeq2 protocol57 to generate cDNA and the NexteraXT kit (Illumina) to generate sequencing libraries. All libraries (10X and Fluidigm C1/SmartSeq2) were sequenced on Illumina’s Hiseq2500 platform in paired-end mode (100 bp Fluidigm C1/SmartSeq2; 26+8bp, 100bp 10x).
Early stages of organoid differentiation (iPS cells to neuroepithelium)
For iPSC/ESC single-cell experiments, cells were detached from cell culture dishes using TrypLExpress (Thermo Fisher) incubation for 5 min. followed by addition of mTeSR1. Cells were centrifuged for 5 min. at 200xg and resuspended in mTeSR1, filtered through a 20 μm strainer and washed with mTeSR1. Cells were then centrifuged again for 5 min. at 200xg and resuspended in mTeSR1, counted, diluted to the same concentration and mixed at equal ratios for the three cell lines to be loaded on the 10X microfluidic chip aiming for 10,000 cells. Thirty embryoid bodies (EBs), 15 neuroectoderms, and 1-3 neuroepithelium of each cell line were pooled for each dissociation, respectively. Cells were obtained by papain dissociation as described above for organoid dissociation, with slightly shorter incubation times in enzyme mix 1 (approximately 30 min.). For 10X experiments, cells from the three different cell lines were diluted and mixed at equal ratios to be loaded on the microfluidic chip device.
Single-cell experiments were conducted using the 10X Chromium Single Cell 3’ v2 Kit following the manufacturer’s instructions. Briefly, cells were mixed with reverse transcription mix, gel beads and oil were loaded on the chip device to be coencapsulated into droplets, which underwent first strand cDNA synthesis thereby tagging mRNAs with a unique molecular identifier (UMI) and a unique cell barcode. All following steps were conducted in bulk by breaking the droplets and cleaning up and amplifying the cDNA. Single-cell libraries were then constructed by fragmentation, end repair and adapter ligation and amplification using library specific index sequences as provided by 10X Genomics. Quantification and quality control of libraries was performed using High Sensitivity DNA assays for Agilent’s Bioanalyzer and sequenced on a HiSeq2500 in Rapid or HighOutput sequencing mode. Typically, one 10X library was sequenced on one lane of a sequencing flow cell, with the exception of the HipSci organoids for which three pooled libraries (each library contained pooled cells from four dissociated HipSci organoids from different cell lines) were sequenced on two lanes of a flow cell. See Table S1 for more details.
Immunohistochemistry
Organoids were washed in PBS prior to fixing in 4% PFA for 2-4 hours (h). The excess of fixative was removed with three PBS washes and organoids were then transferred to a 30% sucrose solution for 24-48 h for cryoprotection. Finally, organoids were transferred to plastic cryomolds (Tissue Tek) and embedded in OCT compound 4583 (Tissue Tek) for snap-freezing on dry ice. For immunohistochemical stainings, organoids were sectioned in slices of 20 µm thickness using a Leica CM3050 S cryostat and Microm HM 560 (Thermo Fisher Scientific) at -15 to -20°C. Organoid sections were quickly washed in PBS to remove any residual OCT. Then, sections were incubated in antigen retrieval solution (HistoVT One, Nacalai Tesque) at 70°C for 20 min. Excess solution was washed away with PBS and the tissue was incubated in blocking-permeabilizing solution (0.3% Triton, 0.2% Tween-20 and 5% Normal Goat Serum in PBS) for 1h at room temperature. Afterwards, sections were incubated overnight at 4°C in blocking-permeabilizing solution containing antibodies anti-PAX6 (mouse, 1:1000, Thermo Fisher Scientific, MA1-109; rabbit, 1:300, Covance, PRB-278P) and anti-CTIP2 (rat, 1:1000, Abcam, AB18465), anti-SATB2 (rabbit, 1:500, Abcam, Ab92446; mouse, 1:500, Abcam, Ab51502), anti-Tbr2 (mouse, 1:500, MPI-CBG Antibody Facility35). On the next day, sections were rinsed three times in PBS before incubation for 1h at room temperature in secondary antibody solution, which contained blocking-permeabilizing solution, DAPI (1:3000), Alexa Fluor 488-conjugated anti-rabbit antibody (goat, 1:1000, Thermo Fisher, A11008), Alexa Fluor 546-conjugated anti-mouse antibody (goat, 1:500, Thermo Fisher Scientific, A-21123), Alexa Fluor 647-conjugated anti-rat antibody (goat, 1:500, Thermo Fisher Scientific, A-21247) and Alexa Fluor 488-conjugated - anti-mouse (A21202) and anti-rat antibody (A21208), Alexa Fluor 555-conjugated anti-rabbit antibody (A31572), Alexa Flour 647-conjugated anti-mouse antibody (A31571) (all donkey-derived, 1:500, Molecular Probes). Finally, remainders of secondary antibody solution were washed off with PBS before covering with ProLong Gold Antifade Mountant medium (Thermo Fisher Scientific). Stained organoid cryosections were imaged using a confocal laser scanning Olympus Fluoview FV1200 microscope and Zeiss LSM 880 Airy upright microscope. Whole-section tilescans composed of 3 different z-plane images (z-step = 5-8 µm) were acquired using a 10X magnification objective, Plan-Apochromat 10x/0.45 M27 and Plan-Apochromat 20x/0.8 M27 objectives. Images were then stitched, stacked and further processed using the Olympus Fluoview 4.2b software and ImageJ (Fiji).
Single cell RNA-seq data preprocessing
We used Cell Ranger, the set of analysis pipelines suggested by 10X Genomics, to demultiplex raw base call files to FASTQ files and align reads to the human genome and transcriptome (hg38, provided by 10X Genomics) with the default alignment parameters. Pooled samples, including samples from different species or human lines, were then demultiplexed using a two-step procedure based on the read mapping results. In the first step, the genome alignment between human (hg38) and chimpanzee (panTro5) was downloaded from UCSC Genome Browser. Sites with diverged bases between human and chimpanzee were obtained based on the genome alignment. Reads covering the species-diverged sites were collected for each reported cell, with the number of bases matching each species counted. Cells with more than 80% reads covering the species-diverged sites matching with one species were assigned as cells from the species. For those samples with human cells from different lines pooled, a second step of demultiplexing was done using demuxlet58, based on the genotyping information of lines downloaded from HipSci websites (Kucg2, Wibj2, Hoik1, Sojd3) or called using bcftools based on the unpooled scRNA-seq data (H9, 409b2). Cells with the best singlet likelihood no less than 50 higher than the second best singlet likelihood and estimated mixture ratio less than 30% were labeled as their best-matched lines. All cells failing to pass any of the above threshold were classified as doublets and excluded from the following analysis.
Seurat59 was then applied for further data processing. Cells with more than 6,000 or less than 200 detected genes, as well as those with mitochondrial transcripts proportion higher than 5% were excluded. After the log-normalization, confounding factors including the number of detected genes and proportions of mitochondrial transcripts were also regressed out. Highly variable genes were then obtained as genes with dispersion higher than 0.5 and normalized expression level between 0.0125 and 3, followed by principal component analysis (PCA) based on the z-transformed expression levels of the identified highly variable genes (Supplementary Table 2). The top-20 PCs were used to do clustering using Seurat. Additional quality controls of the measured cells were based on primary cell type predictions by using public human fetal brain scRNA-seq data (Nowakowski data set)27. In brief, a Lasso logistic regression model was built, using gene expression ranks of the Nowakowski data set as the training set, to predict the primary cell type identity of each single cell in two-month-old and four-month old organoids. Cells which were predicted to be of ‘glycolysis’ identity were excluded, so as cells in the Seurat clusters where more than 80% of cells were predicted as of ‘glycolysis’ identity. Heterogeneity analysis of human (Extended Data Figure 1, Extended Data Figure 2, Supplementary Tables 3 and 4) and chimpanzee (Extended Data Figure 6, Supplementary Tables 5 and 6) full lineage data was performed using t-stochastic neighbor embedding based on the top principal components identified (top 20 PCs for human, top 15 PCs for chimpanzee). Cluster identities were assigned based on cluster gene markers (Supplementary Table 6) as determined by FindAllMarkers function in Seurat (min percentage of cells expressed = 0.25 and log fold change threshold = 0.25) and gene expression of known marker genes. For human data, cells from 409b2 and H9 and were integrated using canonical correlation analysis (CCA) as implemented in Seurat (v3). Briefly, data were normalized and the top 2000 highly variable genes for 409b2 and H9 cells were determined using the vst method. The datasets were integrated based on the top 20 CCs using the Seurat method by identifying anchors and integrating the datasets. The resulting integrated data were scaled and principal component analysis was performed. Clustering was performed based on the top 20 PCs and using a resolution of 0.6. Feature plots show non-integrated expression values. Cluster markers were determined using Wilcoxon test considering only genes that show a minimum log fold expression change of 0.25 in at least a fraction of 0.25 of cells in the clusters using the non-integrated expression values.
Reference similarity spectrum (RSS) and construction of pseudocell transcriptomes
The reference similarity spectrum (RSS) of one cell to the Human Developing Brain (HDB) atlas was defined as the normalized similarity between gene expression levels of the cell and gene expression levels of each of the 237 fetal samples with RNA-seq data in the BrainSpan database in Allen Brain Atlas (Extended Data Fig. 3). To increase discrimination of different reference samples, only the highly variable genes of the HDB data set (see Supplementary Table 2), defined based on expression variation-mean comparison of the reference data set, were used for the RSS calculation. Between each cell and each sample in the HDB data set, Pearson correlation coefficient (PCC) was calculated across the HBD-highly variable genes. Z-transformation was then applied to PCCs between each cell to the 237 fetal HDB samples to get the normalized similarities.
To construct pseudocells, single cells were firstly grouped based on their sample sources and Seurat clusters. Within each group of cells, i.e. those cells from the same sample and in the same Seurat cluster, cells were selected randomly with a selection probability of 20%. The selected cells were called pseudocell seeds or territory capitals (Extended Data Fig. 3). The ten nearest neighbors (NN) of each seed, based on Euclidean distances of the top-20 PCs, were then assigned to the seed, forming a pseudocell territory. If one cell was assigned to multiple pseudocell territories, one territory was chosen randomly. The expression level of one gene in each pseudocell was then calculated as the average gene expression level across cells in the pseudocell territory.
Visualization, lineage identification and pseudotime estimation of pseudocells for reconstructing human cerebral organoid differentiation from pluripotency
First, PCA was applied to a pseudocell expression matrix using the z-transformed expression levels of the highly variable genes as input. Euclidean distance between the top 10 PCs of each pair of pseudocells was calculated and a KNN-network (K=100) was then calculated with the constraint to only consider pseudocells from the same or nearby stages when screening for nearest neighbors. The kNN-network was visualized using SPRING25. To construct the pseudotime course of human cerebral organoid differentation from pluripotency, the Walktrap community identification algorithm (implemented in the R package igraph) was applied to the above kNN-network to identify network communities. The resulting communities were manually aggregated into four groups to minimize branches in each group. A diffusion map algorithm (implemented in R package destiny60) was applied to pseudocells in each of the four groups, with the expression levels of the highly variable genes of pseudocells as the input. The ranks in DC1 were used as the pseudotimes. We used an F-test based ANOVA analysis to identify genes with pseudotime-dependent expression patterns. In brief, we established a natural splined linear regression model (ns function in the R package splines) with six degrees of freedom (df), with expression levels as the response variable and pseudotimes as the independent variable, for each of the highly variable genes. An F-test was applied, to compare variation explained by the splined linear model with that of the residuals normalized by degrees of freedom. Bonferroni correction was performed across tested genes, with a corrected p-value threshold 0.01 to identify genes with pseudotime-dependent expression. The analysis was applied to the four groups of pseudocells separately.
Visualization, lineage identification, pseudotime estimation of cells in human two-month-old cerebral organoids from different individuals
Pseudocells were constructed for the human two-month-old organoids as above with constraint on samples and based on cells with predicted primary cell types as one of radial glia, intermediate progenitor (IPC), excitatory neuron, and inhibitory neuron. RSS to the BrainSpan fetal samples was calculated for each pseudocell, with distance between two pseudocells defined as the correlation distance between RSS of the two pseudocells. The kNN-network (k=20) was then constructed and SPRING was used to determine coordinates of pseudocells for visualization. To further discriminate pseudocells representing different neuronal lineages, a Walktrap algorithm for network community identification was applied to the RSS-based kNN-network (k=100). Communities that were significantly connected and showing concordant marker expression or similarity spectrum were aggregated, which resulted in three progenitor-to-neuron trajectories. Based on gene expression level ranks of cells in the three defined trajectories, two Lasso logistic regression models were trained, one for classification of cortical and ventral lineage, while the other one for classification of all the three trajectories. The first model was applied to pseudocells in the community C6 which was significantly connected to cortical and ventral trajectories, while the second model was applied to pseudocells in the community C4, which was significantly connected with both the non-telencephalon pseudocells and community C4. With a unique lineage label defined for each pseudocell, a 1*500 self-organizing-map (SOM) model was trained for each of the three trajectories, using RSS of pseudocells within the lineage as the training data. The index of neuron that one pseudocell was assigned to was used as its pseudotime. Diffusion map analysis was also applied to pseudocells at each trajectory, with highly variable gene expression as the input, with ranks of DC1 defined as alternative pseudotime of pseudocells. Pseudotimes obtained by the two methods are highly correlated (Spearman correlation is 0.91 and 0.92 for the dorsal and ventral telencephalon trajectories, respectively).
To project the single-cell data to the cell embedding space that was defined for pseudocells, two support vector regression (SVR) models (implemented in the R package e1071), each of which was for one dimension of the embedding, were trained using RSS of pseudocells as the training set. The trained models were applied to RSS of single cells for their predicted coordinates. Such coordinates were further refined by pushing each cell to its nearest pseudocell with smallest correlation distance of RSS to be 70% closer.
Similarly, a support vector machine (SVM) model was trained (implemented in the R package e1071) using RSS of pseudocells for the three trajectories, and applied to RSS of single cells for their lineage identity. After that, the corresponding SOM model for pseudotime estimation was applied to RSS of each single cell for its estimated pseudotime.
Dynamic time warping (dtw)-based alignment of pseudotime courses
We used a dynamic time warping algorithm to align different pseudotime courses. In brief, each pseudotime course was evenly broke into 50 blocks. Average gene expression levels of pseudocells or cells within each block was calculated. Pairwise distances between blocks from the two courses were calculated as the Pearson correlation distance, i.e. 1-PCC, across the highly variable genes in cells of both pseudotime courses. Suppose di,j represents the distance between the i-th block in the reference pseudotime course and the j-th block in the query pseudotime course. We defined D as the alignment distance matrix, where
A trace-back procedure was then performed to get the alignment. Three modes of alignment were implemented. In the first mode, the ‘fixed-end’ alignment, the initialization of D was done as:
In the other two modes, the ‘fixed-start’ and ‘end-to-end’ alignments, D was initialized as:
In the trace-back step, a ‘fixed-end’ and ‘end-to-end’ alignment was started from DM,N, where M and N are the numbers of blocks at the reference and query pseudotime courses, respectively. On the other hand, the trace-back step was started from Dm,N, where m = argmini(Di,n). In our study, the ‘fixed-end’ alignment was used to align the cortical and ventral lineage pseudotime course of human organoid cells; the ‘fixed-start’ alignment was used to align pseudotime courses of human and chimpanzee cortical pseudocells; the ‘end-to-end’ was used in the truncated alignment of pseudotime courses of different species.
Reconstruction of chimpanzee cerebral organoid differentiation from pluripotency
We applied a similar procedure as mentioned above describing the reconstruction of human cerebral organoid differentiation from pluripotency to reconstruct the organoid differentiation trajectory from chimpanzee single-cell RNA-seq data. In brief, the single-cell RNA-seq reads were mapped to the human-chimpanzee-macaque consensus genome and counted using Cell Ranger. Seurat was used for further preprocessing including gene expression normalization, confounding factor regression, PCA and clustering, Cells from organoid samples with predicted primary cell type identity of ‘glycolysis’, as well as cells within clusters with more than 80% cells having ‘glycolysis’ identity, were excluded. Pseudocells were then constructed with a seed selection probability of 20% and constraints on samples and Seurat clusters. PCA was applied to expression levels of highly variable genes across pseudocells, and pairwise distances of pseudocells were calculated as the Euclidean distances between the top-10 PCs. The kNN network (k=100) of pseudocells was constructed, linking every pseudocell with its 100-nearest pseudocells representing the same or nearby stages. Three-month-old and four-month-old organoids were seen as the same stage. The Walktrap network community identification algorithm was applied and the resulted community labels (walktrap communities) of pseudocells were compared with the predicted community labels (projected communities) based on a Lasso logistic regression model trained by ranks of gene expression levels of the human pseudocells representing the human organoid differentiation from pluripotency as described above. Any walktrap community with < 1000 kNN connections with other communities was discarded. One of the four labels: early, cortical, ventral, non-telencephalon was assigned to one walktrap community if more than 95% of pseudocells within the community were from the same group according to their projected communities. For one community with more than 10% of pseudocells with projected communities belonging to both ventral and midbrain-hindbrain groups, the non-telencephalon label was only assigned to pseudocells with projected communities in the non-telencephalon group. The diffusion map algorithm was applied to each of the four pseudocell groups, using the expression levels of highly variable genes as input, to estimate their pseudotimes. For the cortical, ventral and midbrain-hindbrain groups, the ranks of DC1 was used as the pseudotimes. For the early group, a principal curve (implemented in the R package princurve) was fitted in the DC1-DC2 space. The order of pseudocells projecting to the resulted principal curve was used as the pseudotimes.
Human-chimpanzee-macaque consensus genome
The construction of the consensus genome was performed using the procedure as described9, 61. In brief, the chained and netted pairwise genome alignments of the human (hg38) and chimpanzee (panTro5) genomes, and the human and macaque (rheMac8) genomes, were downloaded from UCSC Genome Browser. Based on the downloaded pairwise genome alignments, a multiple genome alignment of human-chimpanzee-macaque was constructed using multiz. Based on the human-chimpanzee-macaque genome alignment, we constructed the consensus genome by masking all discordant sites including mismatches, insertion/deletion (indels), as well as 6-bp flanking regions of indels on the human genome. The obtained consensus genome was indexed with gene annotation in GENCODE v27 for read mapping to the consensus genome with Cell Ranger.
Pseudotime estimation of cerebral organoid cells in different species
Single cell RNA-seq data of organoids with ages from two-month-old to four-month-old in human, chimpanzee, and macaque were mapped to the human-chimpanzee-macaque consensus genome and counted using Cell Ranger. Further preprocessing using Seurat was applied separately for data from the three species. Only cells with predicted primary cell type identities as radial glia, intermediate progenitors, excitatory neurons, or inhibitory neurons were included in the later analysis. Pseudocells were constructed for humans and chimpanzees, both with a coarse grain ratio of 20% and constraints on samples and Seurat clusters. The RSS to the HDB data set was calculated for each pseudocell, and the SVM model for lineage estimation was applied to estimate the lineage identity of each pseudocell. Focusing on the cortical lineage, a diffusion map analysis was applied to cortical pseudocells of the three species, respectively. The ranks of DC1 were used as the pseudotimes of the pseudocells. In macaque, similar procedure was applied directly to single cells without pseudocell construction.
Truncated dtw-based alignment of pseudotime courses reprgesenting neural progenitors and deeper layer neurons in different species
We used the first DC discriminating BCL11B+ and SATB2+ cortical neurons (DC3 in chimpanzee, DC4 in macaque) to identify upper layer (UL) neurons, as the pseudocells in the branch with highest expression level of SATB2. To identify potential upper layer neurons in human, we first retrieved markers of upper and deeper layer (DL) excitatory neurons27. The sum expression levels of UL and DL markers was then calculated for each pseudocell in human and chimp, with the UL-specificity score (sUL) being defined as the UL/DL markers expression ratio. The distribution of sUL in UL neurons in chimpanzee was used to determine the threshold to discriminate UL neurons from other cell types (sUL > 0). All UL neurons in the three species were excluded from the following analysis.
To correct for the DL neuron maturation timing differences between human and the other two species, a two-step pseudotime course alignment strategy was used. The first step, namely the trimming step, aims to determine the pseudotime points in chimpanzee and macaque which correspond to the latest pseudotime point in human. In brief, an SVR model with Gaussian-kernel was firstly constructed, with chimpanzee or macaque pseudotimes as the response variables and the RSS as the dependent variables. Two models were trained with the chimpanzee pseudocells and macaque cells respectively, and applied to the human pseudocells to predict their corresponding chimpanzee and macaque pseudotime points. Two constrained B-splines regression models (FHC, FHM) were then fitted (implemented in the R package cobs): human pseudotimes of human pseudocells (th) versus their predicted chimpanzee or macaque pseudotimes, with constraints of FHC(th = 0) = FHM(th = 0) = 0. FHC(th=1) and FHM(th=1) were used as the pseudotime thresholds to select chimpanzee pseudocells and macaque cells. Chimpanzee DL neurons with pseudotime tC > FHC(th=1), as well as macaque DL neurons with pseudotime tM > FHM(th=1), were excluded in following analysis. The second step, namely the alignment step, was then applied to the remaining pseudotime courses of the three species. An ‘end-to-end’ dtw-based alignment, as described above, was used to align the human pseudotimes with pseudotimes of each of the other two species using the human pseudotime course as the template.
Conserved developmental trajectories from NPCs to neurons in primates
Genes with pseudotime-dependent expression changes in organoids were identified in each of the three species, using the F-test based ANOVA analysis as described above. Those genes with significant pseudotime-dependent expression changes (BH-corrected P<0.05) in all the three species were defined as the genes with universal pseudotime-dependent expression changes, or pseudotime-dependently expressed genes. To estimate the similarities of the expression trajectories among the three species for those genes, Pearson’s correlation coefficient (PCC) was calculated for each gene between each pair of species, across its interpolated expression levels at 50 evenly distributed points along the aligned pseudotimes based on a natural spline regression model (df=6). To determine the threshold of a conserved trajectory, we performed 100 pseudotime permutations of pseudocells in the three species. Pairwise PCCs between species were calculated for each of the pseudotime-dependently expressed genes based on the randomized pseudotimes. Minimal PCC of each gene based on each permutation was obtained, and the PCC threshold was determined as the average of the second highest minimal PCC among permutations across all genes of interest. Pseudotime-dependently expressed genes with PCC higher than the threshold between any species pairs were defined as genes with conserved expression pseudotemporal patterns in primates.
Identification, clustering, and species specificity of differentially expressed genes between humans and chimpanzees
To compare transcriptome changes of the developmental trajectory from cortical neural progenitors to deeper layer neurons between human and chimpanzee, an F-test based comparison was applied to the expression profile along pseudotimes of the two species. In brief, for each gene, a natural spline linear regression model (df=6) was constructed for human and chimpanzee pseudocells along the aligned pseudotime course, without discriminating human and chimpanzee samples, and used as the null model (m0). The alternative natural spline linear regression model was also constructed, with each species having its own slopes and intercept (m1). The residuals of the variation, which cannot be explained by each model, were compared by an F test. Non-ribosomal genes with BH-corrected P<0.01 were identified as differentially expressed genes (DE genes) between human and chimpanzee along the developmental trajectory from cortical neural progenitors to deeper layer neurons (Supplementary Table 7).
To estimate the robustness of the identified differential expression (DE) to the number of used lines, as well as the pseudocell distribution along the pseudotime course, we used a series of replaceable pseudocell sampling procedure with constraints. In brief, in each round of replaceable pseudocell sampling, the candidate pseudocells to be selected are restricted to be those from a certain number of human cell lines. In addition, the subsampling in human pseudocells is performed to recapitulate the pseudocell distribution along the aligned pseudotime of chimpanzee pseudocells, i.e. each of the ten pseudotime bins contains the same number of human and chimpanzee pseudocells. This sampling procedure was performed 100 times for each possible number of human lines, ranging from one to seven. DE analysis, as described above, was applied to compare gene expressions of human pseudocells in each sampling with the chimpanzee pseudocells.
Robust DE genes were determined as DE genes which can be detected in at least 80% of tests performed with replaceable pseudocell samplings with any number of used human cell lines.
A similar strategy was also used to estimate the false positive human-chimpanzee DE genes due to differences between cell lines. In each sampling, two lines were randomly selected as group one, and a certain number of lines, ranging from one to five, were selected from the remaining lines as group two. For each group, pseudocells were randomly sampled from the selected lines to recapitulate the pseudocell distribution along the aligned pseudotime of chimpanzee pseudocells. Such sampling was performed 100 times for each possible number of lines used in group two. The transcriptome trajectory from cortical neural progenitors to deeper layer neurons in macaque organoids was used as the evolutionary outgroup to determine species specificity of the identified human-chimpanzee DE genes. First, the cumulative expression divergences of each gene between human and macaque (dHM), and between chimpanzee and macaque (dCM), were calculated. The cumulative expression divergence was calculated by summing up absolute values of average expression differences between species at the 50 pseudotime bins of equal sizes along the aligned pseudotimes. The human-chimpanzee DE of one gene is seen as human-specific if dHM − dCM > max (dHM, dCM)/2. Genes with chimpanzee-specific DE were identified in the same way. Genes with human-specific DE were then clustered based on their human-chimpanzee DE along pseudotimes. Average expression differences between human and chimpanzee at each of the 50 pseudotime bins along the pseudotimes was calculated for each gene with human-specific DE (denoted as dt at pseudotime bin t), and then normalized as . Hierarchical clustering (Ward algorithm) was then used to cluster those genes into nine clusters, with distances between genes calculated as the Euclidean distances between their normalized DE spectrums. Clusters with fewer than five genes were discarded. We annotated genes with human-specific expression patterns using the Homo sapiens Gene Ontology Annotation file (validation date: 21/04/2017) provided by the Gene Ontology Consortium.
Processing of the Fluidigm C1 based scRNA-seq data of cerebral organoids
In addition to the newly generated Fludigm C1 (SmartSeq2)-based scRNA-seq data, we further retrieved published sequencing data of 786 and 344 single cells from human and chimpanzee cerebral organoids17, 19, in the format of FASTQ files from GEO accession numbers GSE75140 and GSE86207 (CMK data set). All the reads were mapped to the human-chimpanzee-macaque consensus genome using STAR (v2.6.1d) with ‘--quantMode’ parameter set to ‘TranscriptomeSAM’ and GENCODE v27 annotation provided. Gene expression levels in each cell were quantified as TPM by RSEM (v1.3.1). Additionally, we retrieved the recently published gene expression matrix representing 3211 cells from human and chimpanzee cerebral organoids (excluding redundant cells from GSE75140 and GSE86207) and 4854 cells from human and macaque fetal brains21.
Based on the resulting gene expression profile, RSS to the fetal Brainspan data set was calculated as described above for each cell, with 248 genes with significant differential expression between cortical neurons measured by Smart-seq and Smart-seq2 excluded from the references. Distances between organoid cells were calculated as the Pearson’s coefficient distances between RSS of cells. Distances between cells from fetal brains were calculated in the same way. The resulted distance matrices of all organoid cells and fetal brain cells were used as the input to generate tSNE embeddings. kNN-network (k=50) was generated for organoid cells and fetal brain cells separated based on the RSS-based distances, and a Walktrap algorithm for network community identification was applied to identify cell clusters, which were further annotated based on their marker genes. Based on the cell type annotation, the diffusion map analysis, with the RSS profiles as input, was applied to the dorsal forebrain NPCs and neurons in organoids and fetal brains, respectively. The ranks of DC1 were used as the pseudotimes.
To validate the human-chimpanzee differential expression identified in our droplet-based scRNA-seq data using the C1-based cerebral organoid data, the organoid dorsal telencephalon pseudotemporal trajectory was firstly split into ten intervals. In each pseudotemporal interval, the human-chimpanzee DE was calculated as the log2- transformed fold change (log2FC) between the average expression of human and chimpanzee cells in the interval. Here, the CMK data set and other data sets which used a distinct quantification method were processed separately. A similar strategy was also applied to the aligned droplet-based human and chimpanzee pseudotemporal trajectories. Generalized log2-transformed fold change (gLog2FC), defined as the average log2FC across the pseudotemporal intervals, as well as the maximum log2FC across the intervals (mLog2FC), was further calculated for each human-chimpanzee robust DE genes in organoids. A DE gene is seen as being consistent in the two data sets if both gLog2FC and mLog2FC of the C1-based and droplet-based human-chimpanzee comparisons are of the same signs (refer as consistent DE genes). The pseudotemporal intervals with the maximum fold change in the droplet-based and C1-based trajectories were also obtained and compared for the consistent DE genes. This procedure was also applied to compare human-macaque differential expression of the human-specific DE genes along the droplet-based pseudotemporal trajectory and the C1-based fetal brain pseudotemporal trajectory.
Single-cell and bulk ATAC-seq data generation
Organoids of 2 to 4 months old were washed twice with PBS in a Tissue-Tek Cryomold (Sakura), then embedded in 4% low-melting agarose (Sigma) and sliced into 150 μm sections using a vibrating microtome (Ci 7000 smz, Camden Instruments). Slices were placed on microscope slides containing Differentiation medium with vitamin A (Diff +VA) and inspected under a stereomicroscope to dissect cortical regions. Selected regions were washed twice in 500 μL PBS and incubated at 37°C in 500 μL Accutase (Sigma) plus 0.5 μL DNase I (New England Biolabs) for ∼45 minutes. Trituration was performed for additional mechanical dissociation. Cells were passed through a 30 μm pre-separation filter (Miltenyi Biotec), washed with Diff+VA medium, and spun down at 300 x g (Heraeus Megafuge 40R, Thermo Scientific) for 5 minutes. The cell pellet was resuspended in 200 μl of Diff+VA medium. Cells were viewed under a microscope to ensure a single cell suspension was obtained, and then counted using a Countess Automated Cell Counter (Invitrogen). Single cell suspensions for the early stages of organoid differentiation (iPS cells to neuroepithelium) were obtained as described above.
From the cell suspension, 50,000 cells were used as input for bulk ATAC-seq as described62. The remaining cells were diluted to a final concentration of 300 cells/μl and used for microfluidics based single-cell ATAC-seq as described63. Briefly, cells were mixed with Suspension Reagent (Fludigm) at a 3:2 ratio and loaded onto a primed medium (10-17 μm) integrated microfluidic circuit (Fludigm) for capturing. Cell capture sites were examined under a microscope and noted for containing 0, 1, or multiple cells. Lysis, transposition, and amplification were performed on the Fluidigm C1 platform. DNA from each cell was transferred to an individual well of a 96-well plate and barcoded with unique combinations of 24 adapter-index i7 and 16 adapter-index i5 primers 63. Quantification and library size distribution was assessed on an Agilent 2100 Bioanalyzer using High Sensitivity DNA chips. Excessive primer contamination was removed using SPRIselect (Beckman Coulter Life Sciences) size selection. Up to 192 cells were pooled and sequenced in paired-end, dual-index mode for 50+8+50+8 cycles on one lane of an Illumina HiSeq 2500. A summary of all single-cell experiments can be found in Supplementary Table 1.
Single-cell and bulk ATAC-seq data processing
Base calling was performed using Bustard (Illumina), adapter trimming with leeHom64, and demultiplexing with deML65. Reads were aligned to hg19 for human, panTro4 for chimp, and rheMac8 for macaque using bowtie2 with a maximum fragment length of 2000. PCR duplicates were marked and removed using Picard tools (http://broadinstitute.github.io/picard). Samtools66 was used to retain properly paired reads with mapping quality greater than 30, while reads mapping to the mitochondrial genome, Y chromosome, and blacklisted genomic regions that show excessively high read mapping, several of which correspond to nuclear mitochondrial DNA segments (identified in Buenrostro et al.63 and the ENCODE Project67) were removed. For scATAC-seq, single cell BAM files were merged, excluding data from any capture site with 0 or more than 1 cell, to create an aggregated BAM file. Peaks, which represent regions enriched in mapped pair- end sequences, were called using MACS268 with options nomodel, nolambda, keep-dup all, and call-summits. Peak summits were extended by ±250 bp. In the event of overlapping peaks, the peak with the lowest p-value was kept. A single-cell ATAC-seq consensus peak set was obtained by requiring a peak to be accessible in a minimum of 5% of cells. Data visualization was carried out using the Integrative Genomics Viewer (IGV)69.
Enrichment for validated human VISTA enhancers
We overlapped scATAC-seq peaks detected in human cerebral organoids with positive human VISTA enhancers using bedtools intersect. For each tissue annotated in the VISTA Enhancer Browser, we counted the number enhancers that did or did not overlap a peak.
We compared these values to the number of all other tissue elements that did or did not overlap a peak. Fisher’s exact tests were performed to determine which tissues’ enhancers had a higher likelihood of being represented. The significance values were corrected for multiple testing using the qvalue package in R.
Cell state identification using single cell ATAC-seq on cerebral organoids and pseudotime estimation
The accessibility at each site in the consensus peak set for every single cell was used to create a count matrix. Cells with fewer than 5000 reads and less than 5% of reads in peaks were filtered out from further analyses. chromVAR40 was used to scan the peaks for transcription factor binding motif occurrences, using a curated collection of 1,765 human motifs from the cisBP database, and to identify significantly variable motifs among cells. In addition to TF binding motifs, peaks were scanned for 7-mers. Cell similarity was visualized in a two-dimensional t-SNE plot using the bias-corrected deviations in accessibility for 7-mers with a variability score greater than 1.5.
Each cell’s t-SNE coordinates and the consensus peaks were passed to Cicero70 and the densityPeak algorithm was used to identify two clusters of cells Statistically significant differences in TF motif accessibility between the two clusters was calculated using chromVAR, and those motifs corresponding to marker TFs known to distinguish neural progenitors and neurons was used to for cell state identification. Statistically significant differences in accessibility of additional annotations between the two clusters were used to support cell state identities. These annotations included differentially accessible chromatin peaks identified as being enriched in developing mouse brain radial glial cells or excitatory neurons 71, as well as accessibility in peaks nearby genes showing pseudotime-dependent expression in cortical neural progenitors or cortical neurons identified as part of this study.
We identified differentially accessible (DA) peaks between the two clusters using the command differentialGeneTest in Cicero. A count matrix was generated with featureCounts 72 using the top 250 DA peaks in each cluster. This count matrix was used as input for a diffusion map in order to obtain a pseudotemporal ordering of the cells73. Projecting transcription factor binding motif deviation Z-scores on the cells revealed a gradient of known neural progenitor to neuronal markers along the first diffusion map component and we took a cell’s rank along this component as its pseudotime value.
DA peaks identified between the two clusters were used as input test regions for GREAT (version 3.0.0)74 with all accessible organoids peaks serving as background regions. We used the default basal plus extension genomic association rule with its default values. All gene ontology (GO) Biological Process terms and their associated hypergeometric p-values were exported. For each term, we plotted the p-value obtained using cluster 1 (identified as NPCs) DA peaks and the p-value obtained using cluster 2 (identified as neurons) DA peaks as input. Terms with a p-value < 0.05 were considered enriched. Informative enriched terms were highlighted based on their significance value in one cell state relative to the other, and for small differences between the cell states when highlighting terms enriched in both.
Single cell ATAC-seq pseudotime estimation for cells in early states of differentiation and cerebral organoids
Similar to the analysis of the cerebral organoids, we used chromVAR to calculate bias-corrected deviations in accessibility for TF motifs and 7-mers for each cell. Here, we included the scATAC-seq consensus peak sets called in the iPSC, embryoid body, neuroectoderm, and neuroepithelial time points, in addition to the scATAC-seq consensus peak set from the cerebral organoid time point. In the event of overlapping peaks, the peak with strongest signal was retained. Cells with fewer than 5,000 reads and less than 5% of reads in peaks (fraction of reads in peaks, FRiP) were removed from further analyses (Supplementary Table 8). Cell similarity was visualized in a two dimensional t-SNE plot using the bias-corrected deviations in accessibility for 7-mers.
As the cerebral organoid cells’ pseudotimes were previously resolved, we focused on ordering the earlier stages. For this we used Cicero’s differentialGeneTest to identify DA peaks among the iPSC, embryoid body, neuroectoderm, and neuroepithelial time points. A count matrix was generated using the top 250 DA peaks in each time point and used as input for a diffusion map. Projecting TF motif deviation Z-scores of the cells revealed a gradient of pluripotent to more differentiated marker TFs along the first three diffusion map components. We fit a principle curve through the map, and used the pluripotent cells as a starting point to guide the curve. The rank of a cell along this curve was used as its pseudotime. We then added the cerebral organoid cells pseudotime ranks to the end of this earlier stage resolved pseudotime. We used the pheatmap R package to visualize the dynamics of significantly variable motifs across pseudotime.
Annotation of Accessible Chromatin Peaks
Peaks were linked to an expressed protein-coding gene using the nearest (maximum distance 1 Mb) transcription start site of the canonical transcript as defined by GENCODE (comprehensive gene annotation, release 19). Promoter regions were defined as 1000bp upstream a TSS, and distal regions refer to non-promoter regions. Exon and intron annotations were also obtained from GENCODE (comprehensive gene annotation, release 19). BEDtools75 was used to annotate peaks for several evolutionary signatures, including: human accelerated regions45–47; selective sweeps compared to great apes76 and archaic humans77; single nucleotide changes (SNC) in modern humans that happened since the split with great apes and before or after the split with the ancestor of Neandertals and Denisovans, first identified in Prüfer et al. 201444 and updated for this analysis using the most current 1000 Genomes Phase 3 allele frequencies, with a global allele frequency ≥99.5% defined as fixed in all modern humans; small insertions and deletions (up to 5 nucleotides) fixed in modern humans that happened since the split with great apes and before or after the split with the ancestor of Neandertals and Denisovans78; and, human deletions that are highly conserved in mammals (hCONDELs, Supplementary Table 10)48.
Identification of genomic regions with differential accessibility between human and chimpanzee organoid neural progenitors and neurons
To compare the chromatin accessibility of NPCs and neurons in cerebral organoids between human and chimpanzee and identify putative regulatory regions that may contribute to transcriptome divergence between human and chimpanzee, we applied a likelihood ratio test based on a generalized linear model with binomial error distribution to each regulatory region identified in human and chimpanzee organoids. More specifically, we identified open chromatin regions in human and chimpanzee organoids separately as described above. To compare an equal number of human and chimpanzee regions, we took the top 77,611 peaks (corresponding to the number of human consensus peaks) in each species and performed reciprocal liftOver, requiring a 50% minimum ratio of based that must remap, in order to identify their orthologous counterparts in the other species. Peaks that successfully lifted over (>99%) were merged using bedtools and re-named (i.e. mergePeak#). Count matrices were generated at these merged peaks in the species own genome, and the matrices were then joined on the common peak name. Considering the higher read coverage in human cells, we subsampled reads in human cells to equalize the medians of total number of reads mapped to the regions of interest in human and chimpanzee. This procedure was applied separately to NPCs and neurons. The resulting count matrices were binarized. We then fitted a generalized linear model for each region across all human and chimpanzee cells, with the accessibility as the response variable and species as the independent variable. Another model with the species variable replaced by a scaling coefficient was also fitted as the null model. The scaling coefficient is fixed to one for human cells and pc/ph for chimpanzee cells, where pc and ph are the average accessibility across all regions and all cells in chimpanzee and human, respectively. We compared the two models and got the p-values by using the likelihood ratio test. Regions with BH-corrected P<0.01 were defined as differentially accessible (DA) regions (Supplementary Table 9). This procedure was applied to NPCs and neurons separately to obtain DA regions in the two cell states.
Functional and evolutionary characterization of genomic regions with differential accessibility
We performed permutations to determine if differentially accessible (DA) peaks were significantly more likely to overlap a given annotation compared to non-differentially accessible (non-DA) peaks. In more detail, we first resized all peaks to an equal length of 500bp and calculated the average accessibility of human and chimp cells in the resized DA and non-DA peaks. Peaks were then placed into average accessibility bins of 5% intervals. Given the number of DA peaks in each accessibility bin, the same number of non-DA peaks was chosen at random from the corresponding accessibility bin. The random set of non-DA peaks was then overlapped with the given annotation using bedtools intersect. The random sampling of non-DA peaks and annotation overlap was repeated 2000 times. For each annotation, we counted the number of times a non-DA peak permutation resulted in a higher overlap than what was observed for DA peaks. This number was divided by the number of permutations to determine significance (p<0.05).
We used fixed SNCs, organoid-specific peaks, and linked differentially expressed (DE) genes as annotations. When overlapping peaks with fixed SNCs, we restricted the analysis to include only regions that passed a stringent genome alignability filter (“map35_100%”)44, in which SNCs could be called. Organoid-specific peaks were defined as peaks detected in 2-month and 4-month old cerebral organoid stages, but not detected in earlier stages of differentiation (pluripotency to neuroepithelial stages). Cell state-specific peaks were those identified as differentially accessible between NPCs and neurons in either human or chimp.
To study putative effects of fixed SNCs on transcription factor binding in the accessible genomic regions, we used funseq279 to scan and statistically evaluate all possible transcription factor binding motifs created by fixed SNCs in DA peaks. To generate a list of TF motifs lost on the human lineage, we used the human allele as the reference allele and the ancestral allele44 as the alternative allele. To generate a list of TF motifs gained on the human lineage, we flipped the state of the reference and alternative allele. This allowed us to directly compare the sequence scores of TF motifs gained or lost in humans. We subtracted the sequence score with the alternative allele from the sequence score with reference allele and performed min-max normalization. Human TF motif gains were plotted as positive values, while human TF motif losses were plotted as negative values. The genomic location of SNCs predicted to alter TF motif binding are provided in Supplementary Table 10. The alteration rate for TF motifs gained in humans was calculated by dividing the number of gains in DA peaks by the number of occurrences of that motif when scanning all organoid accessible peaks using chromVAR and the human genome sequence. The alteration rate for TF motifs lost in humans was calculated by dividing the number of losses in DA peaks by the number of occurrences of that motif when scanning all organoid accessible peaks using chromVAR and the chimpanzee genome sequence. The alteration rates of human TF gains and losses were also calculated per TF family, using TF motif family assignments obtained from80.
We used the macaque cerebral organoid scATAC-seq data to determine species specificity of the peaks identified as differentially accessible between human and chimpanzee (Supplementary Table 9). In brief, we counted read coverage of each accessible region we compared between human and chimp which can lift over to the macaque genome in each macaque cell. Regions failed during liftover were seen as inaccessible in all macaque cells. A random sampling of reads in human and chimpanzee cells was applied to equalize median read coverage in the three species. This procedure was applied 100 times and to the two cell states separately. Accessible probability was then calculated for the two cell states in the three species. In human and chimpanzee, averages across the 100 read-subsampling-based estimation were used. The difference of accessible probability between human and macaque (H-M), and that between chimpanzee and macaque (C-M), was then calculated for each human-chimpanzee DA peak in each cell state. The identified DA was considered as human-specific if its H-M difference is at least four times larger than the C-M difference, while its H-M difference is no less than 2%. Similar criteria was also applied to define chimp-specific DA.
To investigate potential biological processes that may be influenced by DA peaks, we used human-chimp DA peaks for each cell state (NPC or neuron) as input test regions for GREAT (version 3.0.0)74 with all accessible organoids peaks serving as background regions. This analysis was then carried out the same way as explained above.
Single-nucleus and bulk RNA-Seq data generation
Cubes were dissected from prefrontal cortex from human, chimpanzee, bonobo and macaque on dry ice aiming for cubes with few curvature to obtain reproducible slicing results. Briefly, the thickness of grey matter at all facets of the cube was measured to obtain a mean gray matter thickness. The mean thickness was divided by 10 to obtain the thickness for each of the segments, whereby each of the segments consisted of several slices at 50 um thickness. Sectioning was performed in a cryostat (Microm, Thermo Fisher), with slices being alternately immersed in Trizol (Invitrogen) for bulk RNA isolation or transferred to a dry tube (low binding) for single nucleus isolation on dry ice. Segments 11 and 12 were collected as well but were considered being derived from white matter of the cortex. Samples were then stored at -80°C until further use.
For nuclei isolation from frozen tissue, all following steps were performed on ice with precooled buffers and centrifugation steps were performed at 4°C. Briefly, tissue was spun down, thawed on ice and 1 ml PBSE (PBS (Gibco), 2 mM EDTA (Life Technologies)) was added to the tissue. The tissue slices were incubated at 4°C on a shaker at 1500 rpm for a total of 45-60 min with trituration steps in between using 1000p and 200p to homogenize the tissue. Generally, segments 1-10 were used for single-nucleus experiments. Two segments were pooled to obtain sufficient material for single nucleus isolation, resulting in 5 segments per individual. To reduce batch effects and increase the number of nuclei per experiment, material from three different individuals (originating from human, chimp/bonobo and macaque respectively) was pooled for each segment. After homogenization, solutions were combined in a 5 ml tube and spun down at 900xg for 5 min. The pellet was resuspended in 1.5 ml PBSE + 1% NP-40 (BioVision), triturated 20 times using 1000p and incubated for 7 min incubation on ice. Samples were then spun down at 900xg for 5 min and resuspend in 1.5 ml PBSE + 1% BSA (Serva) two times. Samples were then spun down again at 900xg for 5 min and resuspended in PBS + 1% BSA. Before sorting, samples were filtered through a 30 um cell filter (Miltenyi Biotec) and stained using DAPI (1:1000, BD Pharmingen). Nuclei were sorted in yield sort mode (BD FACS AriaIII and BD FACS Fusion) based on a defined nuclei population by excluding debris using FSC and SSC and by sorting DAPI positive events. Nuclei were sorted in bulk into 96 well plates and spun down 5 min at 600xg to enrich for nuclei in the pellet.
For each of the pooled samples, 2 lanes on a 10X Chromium microfluidic chip were loaded if feasible, aiming for the maximum possible number of nuclei to be targeted obtained from the sorting. Single-nucleus experiments were performed using the 10X Genomics Single Cell 3’ kit v2 to encapsulate nuclei along with barcode tagged beads, generate and amplify cDNA and to generate sequencing libraries. Each pooled library was barcoded using i7 barcodes provided by 10X Genomics. cDNA and sequencing library quality and quantity were determined using Agilent’s High Sensitivity DNA Assay. Libraries were pooled and sequenced in 150bp paired-end mode on Illumina’s NovaSeq platform as provided in Supplementary table 1.
RNA isolation for bulk-RNA Seq was performed using the Direct-zol 96 RNA kit (Zymo Research) and was quantified using Agilent’s Bioanalyzer RNA 6000 Nano and Pico kit. Libraries were prepared using the NEBNext Ultra Low RNA Library Prep Kit (New England Biolabs). Library quantification was performed using Agilent’s Bioanalyzer DNA 1000 chip kit. All bulk RNA Seq libraries were pooled at equal ratios and sequenced on one lane of an Illumina NovaSeq platform in 150 bp paired-end mode.
Processing of single-nucleus and bulk RNA-seq data from human, chimpanzee and macaque adult brains
Single-nucleus libraries were demultiplexed based on their i7 index sequences using 10x Cell Ranger (v2.1). Mapping to the human-chimp-macaque consensus genome and generation of count matrices was then performed using the same Cell Ranger, with the GENCODE v27 human annotation provided. Nuclei were assigned to species based on species specific sites using a two-step approach by separating all great ape from macaque nuclei first and subsequently assigning nuclei to either human or chimp/bonobo. Nuclei with a support of less than 80% for either of the groups were removed from further analysis. Moreover, nuclei with less than 200 and more than 6,000 genes detected, so as those with more than 5% detected transcripts being transcribed from mitochondria, were removed from further analyses.
The full single-nucleus RNA-seq data set including all species was further analyzed using Seurat (v3) (Supplementary Table 12). Single-nucleus expression values were normalized and highly variable genes were identified using a variance stabilizing function to detect the top 2000 variable genes (Supplementary Table 11). Data were then integrated by finding corresponding anchors between the species using 30 dimensions. Scaling and principal component analysis were performed using the integrated data. The top 20 principal components were used to identify neighbors of cells and clusters and to visualize the clustering using tSNE embedding. Cluster identities were assigned using unbiased identification using cluster markers by running Seurat’s FindAllMarkers function (Wilcoxon test, min.pct = 0.25, min logFC = 0.25) using non-integrated expression values, known marker genes reported elsewhere (Lake et al., PMID: 27339989, 29227469) and by cell type prediction using Seurat’s TransferData function to anchor to the published Drop-seq based human adult frontal cortex snRNA-seq data (Lake et al. Nature Biotech, PMID: 29227469). Two potential doublet clusters (c11, c19) were excluded from further analysis. For analysis of the major cell classes (excitatory neurons, inhibitory neurons, astrocytes, oligodendrocytes, oligodendrocyte precursor cells, microglia, endothelial cells) subtype clusters were combined and cell type markers recalculated using Seurat’s FindAllMarkers function (Wilcoxon test, min.pct = 0.25, min logFC = 0.25) using non-integrated expression values (Supplementary Table 13).
Since nuclei of the three species have significantly different transcriptome coverage, pseudo-nuclei were constructed for more robust transcriptome measurement, as well as for more fair and efficient comparison, using a similar procedure as described above to generate pseudocells, under the constraint of merging only nuclei from the same segment of the same sample and grouped in the same cell cluster. The probabilities of nuclei selected as pseudo-nuclei seed were 1/13 for human, 1/8 for chimpanzee and 1/10 for macaque.
Reads of the bulk RNA-seq samples were mapped to the human-chimpanzee-macaque consensus genome using STAR (v2.6.1d). The Python utility hiseq-count was used to count the numbers of uniquely mapped reads of genes annotated in GENCODE v27 human annotations. DESeq2 was used for normalization and retrieving FPKM as the expression level measurement.
To determine the laminar origin of each segment, genes with segment-dependent expression were firstly screened for each cortical cube. In brief, an ANCOVA analysis was applied to compare two models: the natural spline (df = 6) linear model with log10- transformed FPKM as the response and the segment order as the variable; the null model of expression values without any linear relationship with segments. For each of the resulted gene, its enriched segments in the cube were identified, as the segments with the gene’s expression at least one standard deviation higher than the mean across segments. Genes with enriched expression at each segment were then overlapped with the layer markers identified in 9. Segments with enriched genes significantly overlapping with markers of only one layer were seen as pure-layer original, others were seen as mixture of multiple layers. For each mixture segment, a quadratic-programming-based transcriptome deconvolution 81 was applied to determine the relative contribution of the enriched layers. A layer index was then obtained for each segment, as the average layers weighted by contributed proportion of the enriched layers.
Estimation of cell type distribution across cortical layers and gene expression patterns in neurons across cortical layers
To estimate the cell type composition of each layer, nuclei from each sample were randomly assign to one layer, based on the layer mixture proportions estimated above. The proportion of each of the six major cell classes: excitatory neurons, inhibitory neurons, astrocytes, oligodendrocytes, oligodendrocyte precursor cells (OPCs), microglia and endothelial cells, was then calculated for nuclei assigned to each layer in human. This procedure was repeated 100 times, with the resulted average as the final estimation. The laminar distribution of each cell cluster was also estimated based on the described procedure. In addition, a subsampling procedure with replaceable manner of the same number of nuclei (n = 200) from each layer was further applied to each of the 100 nuclei layer random assignment to control differences on the detected nuclei number of each layer. To get more precise estimation of layer origins on the nuclei level for excitatory and inhibitory neurons, both of which show a distinct layer distribution pattern across different subtypes, we trained an elastic net linear regression model (alpha = 0.5) on excitatory and inhibitory neurons separately, with the sample layer indices as the training response and expression levels of the highly variable genes as the variables. To enhance model robustness, pseudo-nuclei from all the three species together were used for model trainings. The trained models were then applied to the excitatory and inhibitory pseudo-nuclei again. The predicted layer indices were used as the estimated relative laminar location of the pseudo-nuclei. The projection of the predicted layer indices to layers were done by averaging expression patterns of markers of different layers9.
Differential expression analysis between human and chimpanzee cell types in adult brains and determination of their species-specificity
Due to the sparse nature of the snRNA-seq data and the unequal coverage of nuclei from different species, commonly used statistical test for differential expression analysis (e.g. Wilcoxon’s rank sum test) failed to provide reliable estimation of DE, even with the state-of-art VST normalization methods82. As detection rates of genes are correlated with their expression levels82, we therefore compared gene expression levels of the same cell type in human and chimpanzee by comparing their detection rates, using a GLM-ANCOVA analysis similar to the one described above to identify genomic regions with differential accessibility. In brief, the pseudo-nuclei expression matrix was binarized. A binomial GLM model was trained for each gene, with its detection as the response variable and species of pseudo-nuclei as the independent variable. This model was compared to the null model with the species variable replaced by a scaling coefficient. The scaling coefficient is fixed to one for human pseudo-nuclei and pc/ph for chimpanzee pseudo-nuclei, where pc and ph are the average detected gene numbers across pseudo-nuclei involved in the test in chimpanzee and human, respectively.
While the described DE test was applied to four cell classes with sufficient numbers of pseudo-nuclei: excitatory neurons, inhibitory neurons, astrocytes and oligodendrocytes, the heterogeneity within the two neuron types, as well as their uneven distributions in human and chimpanzee, needed to be considered. A subsampling procedure with replaceable manner was therefore applied. In every subsampling, an equal number of pseudo-nuclei (n = 200) from each species were sampled, with pseudonuclei in clusters annotated as the cell class of interest sharing equal probability being selected. The described DE test was then applied to the sampled nuclei of this cell class. This subsampling procedure was repeated for 100 times, and DE genes of each cell class were defined as genes with significant DE (BH-corrected P<0.005) in at least 80 times of the subsampling. Additional filtering was then applied, requiring the same direction of human-chimpanzee difference on detection rates and VST-normalized expression values.
Macaque pseudo-nuclei were then introduced to investigate species specificity of the identified DE. Similar procedure sampling the same number of pseudo-nuclei from clusters annotated to be the same cell class was repeated 100 times to the macaque pseudo-nuclei. For each sampling, average VST-normalized expression values were calculated for each cell class in human, chimpanzee and macaque, with which differences between human and macaque (dHM), as well as between chimpanzee and macaque (dCM), were calculated. The identified human-chimpanzee DE was defined as human-specific if |dHM| > 4 * |dHC|. Genes with chimpanzee-specific DE were identified in the same way (Supplementary Table 14).
ACKNOWLEDGEMENTS
We thank D. Wollny, A. Brazovskaya, K. Köhler, T. Schaffer, B. Schellbach, A. Weihmann, R. Schultz, I. Bünger, M. Dannemann, R. Snabel, B. Vernot, W. Hevers, M. Schörnig, J. Kelso of MPI-EVA, and K. Sekine of Yokohama University for their help with this project. We thank A. Fischer, M. Halbwax, K. Köhler, Anne Weigert and the Tchimpounga Sanctuary for support with generating the JoC iPSC line, Lea Berninger and Jula Peters (MPI-CBG, Dresden) for contributing Sc102a1 and SandraA organoids. Karyotyping was supported by the Stem Cell Engineering Facility, a core facility of CMCB at Technische Universität Dresden. Sorting was in part performed at the CUDZ at the veterinary medicine faculty at the University of Leipzig. This work was supported by the Max Planck Society (BT), Chan Zuckerberg Initiative (BT/JGC), and European Research Council (ANTHROPOID, JGC; ORGANOMICS; BT). SK was supported by the Boehringer Ingelheim Fonds.
REFERENCES
- 1.↵
- 2.
- 3.
- 4.
- 5.↵
- 6.↵
- 7.
- 8.↵
- 9.↵
- 10.
- 11.↵
- 12.
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵