Abstract
We have created a compendium of single cell transcriptome data from the model organism Mus musculus comprising nearly 100,000 cells from 20 organs and tissues. These data reveal many new aspects of cell biology, including gene expression in poorly characterized cell populations and the characterization of new populations of cells in many tissues. Furthermore, they allow for direct and controlled comparison of gene expression in cell types shared between tissues, such as immune cells from distinct anatomical locations. Two distinct technical approaches were used for most organs: one approach, microfluidic droplet-based 3’-end counting1,2, enabled the survey of thousands of cells at relatively low coverage, while the other, FACS-based full length transcript analysis3, enabled characterization of cell types with high sensitivity and coverage. The cumulative resource provides the foundation for a comprehensive resource of transcriptomic cell biology.
The cell is a fundamental unit of structure and function in biology, and multicellular organisms have evolved a wide variety of different cell types with specialized roles. Although cell types have historically been characterized on the basis of morphology and phenotype, the development of molecular methods has enabled ever more precise definition of their properties, typically by measuring protein or mRNA expression patterns4. Technological advances have enabled increasingly greater degrees of multiplexing of these measurements5,6, and it is now possible to use highly parallel sequencing to enumerate nearly every mRNA molecule in a given single cell3,7. This approach has provided many novel insights into cell biology and the composition of organs and tissues from a variety of organisms8–17. However, while these reports provide valuable characterization of individual organs, it is challenging to compare data taken with varying experimental techniques in independent labs from different animals. It therefore remains an open question whether data from individual organs can be synthesized and used as a more general resource for biology.
Here we report a compendium of cell types from the mouse Mus musculus. The compendium is comprised of single cell transcriptome sequence data from 97,029 cells isolated from 20 organs and tissues (Fig. 1). These organs were collected from 3 female and 4 male, C57BL/6 NIA, 3 month old mice (10-15 weeks), whose developmental age is roughly analogous to humans at 20 years of age. We analyzed multiple organs and tissues from the same animal, thereby generating a data set controlled for age, environment and epigenetic effects. This enables the direct comparison of cell type composition between organs as well as comparison of shared cell types across the entire organism. All data, protocols, and analysis scripts from the compendium are shared as a public resource.
a) 19 tissues from 4 male and 3 female mice were analyzed. After tissue dissociation, cells were either sorted by FACS or manipulated in microfluidic emulsions, after which they were lysed and their transcriptomes amplified, followed by sequencing, read mapping, and data analysis. b) Barplot showing number of sequenced cells prepared by FACS sorting from each tissue (n=20). c) Barplot showing number of sequenced cells prepared by microfluidic emulsion from each tissue (n=12). d) Histogram of number of reads per cell for each tissue from FACS sorted cells. e) Histogram of number of genes detected per cell for each tissue from FACS sorted cells. f) Histogram of number of unique molecular identifiers (UMI) sequenced per cell for each tissue from cells prepared by microfluidic emulsion. g) Histogram of number of genes detected per cell for each tissue for cells prepared by microfluidic emulsion.
Gene counts and metadata from all single cells are accessible via Digital Object Identifier (DOI) at 10.6084/m9.figshare.5715040 for the FACS data and 10.6084/m9.figshare.5715025 for the microfluidic droplet data. While these data are by no means a complete representation of all mouse organs and cell types, they provide a first draft attempt to create an organism-wide representation of cellular diversity and a comparative framework for future studies using the large variety of murine disease models.
We developed a specific procedure to collect 20 organs and tissues from the same mouse (see Methods). Briefly, each mouse was anesthetized with 2.5% v/v Avertin, followed by transcardial perfusion with 20 ml PBS. Aorta, bladder, bone marrow, brain (cerebellum, cortex, hippocampus, striatum), colon, diaphragm, fat (brown, gonadal, mesenteric, subcutaneous), heart, kidney, liver, lung, mammary gland, pancreas, skin, skeletal muscle, spleen, thymus, tongue, and trachea were immediately dissected and processed into single cell suspensions (see Extended Data). Organs were dissociated into single cells and either sorted by FACS into 384-well plates (with the exception of liver hepatocytes that were sorted into 96 well plates and heart derived cardiomyocytes that were hand-picked into 96-well plates), and in many cases also loaded into a microfluidic droplet emulsion generating device for cell isolation followed by transcriptome capture (Fig. 1a). Once prepared into sequencing libraries, transcriptomes were sequenced to an average depth of 685,500 reads per cell for the plate data and 7,709 unique molecular identifiers (UMI) per cell for the microfluidic emulsion data. After quality control filtering, 44,879 FACS sorted cells and 55,656 microfluidic droplet processed cells were retained for further analysis. A comparison of the two methods showed differences for each organ in the number of cells analyzed (Fig. 1b,c), reads per cell (Fig. 1d,f) and genes per cell (Fig. 1e,g).
Using two distinct measurement approaches on the same samples yielded several insights. We compared the methods to understand their relative strengths and weaknesses. Importantly, our results here show that such comparisons can be misleading when performed on a single organ type, as there is substantial variation in performance across organs (Supp. Fig. 1). For example, the lung, trachea and thymus showed nearly twice as many genes detected per cell in the FACS data as compared to the microfluidic data, whereas heart, kidney and marrow show comparable numbers of expressed genes detected by both methods (Supp. Fig. 1-3). Number of genes detected is a fairly crude approach to sensitivity and one must also consider other metrics such as dynamic range for gene expression. The FACS based approach has much higher dynamic range per gene than the microfluidic droplets (Supp. Fig. 4) and enables the analysis of full length transcripts. While the FACS approach generally resulted in higher sensitivity and dynamic range, it also is more laborious and time consuming to perform. The microfluidic droplet approach is faster and enables analysis of larger numbers of cells, albeit at generally lower sensitivity and dynamic range, and with reads only from the 3’ end of the transcript. We also expect this study to facilitate the development of computational tools to compare sequencing methods, and our data set will provide a training and validation set for such algorithms. The creation of these tools will crucial for comparing independent data generated by labs around the world as various tissue atlas studies begin generating results.
We performed unbiased graph-based clustering of the pooled set of transcriptomes across all organs, and visualized them using tSNE (Fig. 2a). The majority of clusters contain cells from only one organ, but a substantial minority contained cells from many organs. To further dissect these clusters, we separately performed dimensional reduction and clustering on cellular transcriptomes from each individual organ (Fig. 2b). The resulting clusters were manually annotated using a priori biological knowledge about cell-type specific gene expression, which confirmed the presence of many specific cell types within organs (Fig. 3). Many of these cell types have not previously been obtained in pure populations and our data provide a wealth of new information on their characteristic gene expression profiles. Initial annotation of each organ and tissue can be found in the extended data, and a detailed discussion each cell types on an organ by organ basis can be found in the supplement. Some unexpected discoveries include: 1) novel immune cell types in the lung, 2) a putative neuroendocrine cell population in the trachea, 3) new differentiated keratinocyte cell types of the tongue, 4) strong sexual dimorphism in hepatocytes with differential gene expression of male pheromones, 5) the suggestions of new roles for genes such as Neurog3, Hhex, and Prss53 in the adult pancreas, 6) sexual dimorphism in epithelial and mesenchymal cells of the bladder, and 7) a novel cell population expressing Chodl in skeletal muscle.
a) Dimensionally reduced tSNE plot of all cells sorted by FACS color coded by tissue of origin. b) Dimensionally reduced tSNE plots for each tissue of cells sorted by FACS. Color coding indicates distinct clusters. c) Barplots of manually annotated cell types based on differential gene expression across all tissues.
In order to better understand the relationships between cell types across organs, we mapped the biologically informed annotations of organ-specific cell types onto the clusters resulting from the unbiased analysis of the complete set of all cell transcriptomes. This analysis revealed that the clusters with cells from many organs generally represent shared cell types common to those organs (Figure 3). For example, B cells from fat, limb muscle, diaphragm, lung, spleen and marrow cluster together, as do T cells from spleen, marrow, lung, limb muscle, fat and thymus. Endothelial cells from fat, heart and lung cluster together, but form a distinct grouping from endothelial cells from mammary, kidney, trachea, limb muscle, aorta, diaphragm and pancreas. There are two clusters containing myeloid cells from limb muscle, brain, diaphragm, aorta, pancreas, kidney, trachea, heart, liver, fat and marrow. Lastly, a cluster containing mesenchymal stem cells from fat, diaphragm, and limb muscle suggests these cells are share similar transcriptomes with stromal cells in mammary, trachea, and lung. These findings show that gene-specific manual annotation of cell types is consistent with unbiased clustering based on whole transcriptome profiles, and that cell type identity is strong enough across tissues to enable their unbiased identification at the whole transcriptome level.
Comparison of cell type determination as done by unbiased whole transcriptome comparison versus manual annotation by organ-specific experts. The x-axis represents clusters from Figure 2a with multiple organs contributing, while the y-axis represents manual expert annotation of cell types in an organ-specific fashion based on the data in Figure 2b and 2c. The unbiased method discovers relationships between similar cell types found in different organs (highlighted regions); in particular it groups T cells from different organs into a single cluster, B cells from different organs into a different single cluster, and endothelial cells from different organs into a single cluster.
To further investigate a common cell type across organs, we pooled all T-cells and performed an unbiased clustering analysis, revealing 5 clusters (Figure 4) grouped into two distinct sub-groups: clusters 0,4 and clusters 1,2,3. Cluster 0 comprises cells from the Thymus that are undergoing VDJ recombination characterized by the expression of RAG (Rag1 and Rag2) and TdT (Dntt), and includes uncommitted double positive T-cells (Cd4 and Cd8a positive). Cluster 4 contains proliferating T-cells, predominantly from the thymus, and we hypothesize that these are pre-T cells that are expanding after having completed VDJ recombination. In contrast, Clusters 1,2,3 contain mature T-cells. The cells in Cluster 3 are also predominantly from the thymus and show high Cd5 expression, suggesting that they are undergoing positive selection. Cluster 2 cells are mostly peripheral cells and are most likely activated T-cells expressing the high affinity IL2 receptor (Il2ra and Il2rb). Interestingly, they also express MHC type II genes (H2-Aa and H2-Ab1). Finally, Cluster 1 also represents mature T-cells, but primarily from the spleen.
Analysis of all T cells sorted by FACS. a) Dimensionally reduced tSNE plot of all T cells colored by cluster membership. Five clusters were identified. b) Dotchart showing level of expression (color scale) and number of expressing cells (point diameter) within each cluster of T cells. c) Barplot showing fraction of T cells originating from Fat, Lung, Marrow, Muscle, Spleen or Thymus for each of the T cell clusters. d) Barplot showing fraction of Cd4+/ Cd8+/ Cd4+Cd8+ / Cd4−Cd8− T cells for each of the T cell clusters.
A key challenge for many single cell studies to date is understanding the potential changes to the transcriptome caused by handling, dissociation and other experimental manipulation. A previous study in limb muscle showed that quiescent satellite cells tend to become activated by dissociation and consequently express immediate early genes among other genes.18 We found that expression of these dissociation-related markers was also clearly observed in our limb muscle data, as well as in mammary and bladder (Supp. Fig. 6), but that many organs showed little evidence of cellular activation with this panel. Therefore the dissociation-related activation markers found in limb muscle are not universal across all organs. This is not to say that other tissues lack dissociation-related gene expression changes, but that some of the genes involved are specific to a given tissue. It is important to note that the presence of such gene expression changes does not prevent the identification of cell type or the comparison of cell types across organs.
To understand the extent to which transcription factor (TF) expression determines cell type, we performed a correlation analysis of TF expression14 across the entire data set for TFs that were significantly enriched (p < 10−5, log10 fold change > 0.3) in at least one unique combination of cell type and tissue (e.g., endothelial cells from liver or basal cell of the bladder). (Fig 5A). The off-diagonal elements of the correlation matrix indicate groups of transcription factors with correlated cell type expression. We measured co-expression of these groups of transcription factors and discovered that typically only one of a few cell types used any given combination of two to four TFs. (Supp. Fig. 7) The largest cluster of correlated genes enriched for a broad swath of immune cells across most tissues. Similarly, another cluster of genes enriched for endothelial cells. We also observed groups of TFs specific to several organs; for example within colon, goblet cells were highly enriched for a cluster of TFs that are either known or potential new targets for specification of goblet cells (Fig 5A, Supplementary Figure 7), and we also saw clusters specific to muscle cells, pancreatic cells, neurons, microglia and astrocytes. Another group of genes showed enrichment in several epithelial subtypes of tongue and bladder.
Transcription factor expression analysis. A. Gene-gene correlation (correlogram) of top cell type-tissue specific transcription factors (TFs) (selected by minimum differential expression of 0.3 and p-value of 10−6; immediate early genes excluded). Columns are colored by the cell ontology (left column) and tissue (right column) of the cell type+tissue with the highest expression of that gene. Insets show clusters of TFs enriched in selected cell types and tissues. B. Correlogram of top tissue-specific TFs for epithelial cells (cells ontologies containing “epithelial”, “basal”, “keratinocyte”, or “epidermis”). Row colors correspond to tissue (right) and cell ontology (left) of most-enriched cell type. tSNE plots are calculated using all variable genes and show tissue origin (right) and gene expression of select TFs (bottom). C. Correlogram and tSNE of tissue-specific TFs within stromal cells (cells ontologies containing “stromal”, “fibroblast”, or “mesenchymal”). D. Correlogram and tSNE of tissue-specific TFs within endothelial cells.
We further analyzed tissue-specific TFs within the endothelial, epithelial and stromal cell types found in several different organs (Fig 5B-D). Within epithelial cells, we found genes grouped by tissue: tongue (Pax9), bladder (Ets1), skin keratinocyte stem cells (Lhx2), and lung (Aebp1) (Fig 5B). Pax9 is enriched in tongue basal cells and is known to be necessary for formation of filiform papillae. Foxq1 is enriched in bladder basal cells, but its role in bladder basal cell differentiation is heretofore unexplored. Within fibroblast/stromal cells, we found TFs that separated heart from kidney fibroblasts, and specified the stromal cells of lung, trachea, or mammary gland (Fig 5C). Within endothelial cells, we found less strong signatures for tissue specificity; however, liver, brain, kidney, and lung-specific groups of TFs were distinguishable (Fig 5D). These results illustrate how single cell data taken across many organs and tissues can be used in combination to identify which transcriptional regulatory programs are specific to cell types of interest.
In conclusion, we have created a compendium of single-cell transcriptional measurements across 20 organs and tissues of the mouse. This Tabula Muris, or “Mouse Atlas”, has many uses, including the discovery of new putative cell types, the discovery of novel gene expression in known cell types, and the ability to compare cell types across organs. It will also serve as a reference of healthy young adult organs which can be used as a baseline for current and future mouse models of disease. While it is not an exhaustive characterization of all organs of the mouse, it does provide a rich data set of the most highly studied tissues in biology. The Tabula Muris provides a framework and description of many of the most populous and important cell populations within the mouse, and represents a foundation for future studies across a multitude of diverse physiological disciplines.
Methods
Mice and Tissue Collection
Four 10-15 week old male and four virgin female C57BL/6 mice were shipped from the National Institute on Aging colony at Charles River to the Veterinary Medical Unit (VMU) at the VA Palo Alto (VA). At both locations, mice were housed on a 12-h light/dark cycle, and provided food and water ad libitum. The diet at Charles River was NIH-31, and Teklad 2918 at the VA VMU. Littermates were not recorded or tracked, and mice were housed at the VA VMU for no longer than 2 weeks before euthanasia. Prior to tissue collection, mice were placed in sterile collection chambers for 15 minutes to collect fresh fecal pellets. Following anesthetization with 2.5% v/v Avertin, mice were weighed, shaved, and blood drawn via cardiac puncture before transcardial perfusion with 20 ml PBS. Mesenteric adipose tissue (MAT) was then immediately collected to avoid exposure to the liver and pancreas perfusate, which negatively impacts cell sorting. Isolating viable single cells from both pancreas and liver of the same mouse was not possible, therefore, 2 males and 2 females were used for each. Whole tissues were then dissected in the following order: colon, spleen, thymus, trachea, tongue, brain, heart, lung, kidney, gonadal adipose tissue (GAT), bladder, diaphragm, skeletal muscle (tibialis anterior), skin (dorsal), subcutaneous adipose tissue (SCAT, inguinal pad), mammary glands (fat pads 2, 3, and 4), brown adipose tissue (BAT, interscapular pad), aorta, and bone marrow (spine and limb bones). Following single cell dissociation as described below, cell suspensions were either used for FACS sorting of individual cells into 384-well plates, or for microfluidic droplet library preparation. All animal care and procedures were carried out in accordance with institutional guidelines approved by the VA Palo Alto Committee on Animal Research.
Tissue dissociation and sample preparation
Specific protocols for each tissue are described in the supplement.
Single Cell Methods
Lysis plate preparation
Lysis plates were created by dispensing 0.4 μl lysis buffer (0.5U Recombinant RNase Inhibitor (Takara Bio, 2313B), 0.0625% Triton™ X-100 (Sigma, 93443-100ML), 3.125 mM dNTP mix (Thermo Fisher, R0193), 3.125 μM Oligo-dT30VN (IDT, 5’AGCAGTGGTATCAACGCAGAGTACT30VN-3’) and 1:600,000 ERCC RNA spike-in mix (Thermo Fisher, 4456740)) into 384-well hard-shell PCR plates (Biorad HSP3901) using a Tempest or Mantis liquid handler (Formulatrix). 96 well lysis plates were also prepared with 4 μl lysis buffer. All plates were sealed with AlumaSeal CS Films (Sigma-Aldrich Z722634) or Microseal ‘F’ PCR plate seal (Biorad MSF 1001) and then spun down for 1 minute at 3220 xg and snap frozen on dry ice. Plates were stored at −80°C until used for sorting.
FACS sorting
After dissociation, single cells from each tissue were isolated in 384 or 96-well plates via Fluorescence Activated Cell Sorting (FACS). Most tissues were sorted into 384-well plates using SH800S (Sony) sorters. Heart and liver were sorted into 96-well plates and cardiomyocytes were hand-picked into 96-well plates. Skeletal muscle and diaphragm were sorted into 384-well plates on an Aria III (Becton Dickinson) sorter. Most tissues used combinations of fluorescent antibodies to enrich for the presence of known rare cell populations (see tissue section below), but some tissues were simply sorted into plates after removal of dead cells, small and large debris. For all samples, an initial gate was set to exclude small debris, platelets and cell aggregates, and most tissues included a forward scattering gate to select for single cells. One color channel was used in most tissues to stain and exclude dead cells and high prevalence cells that would otherwise dominate the cell population. Color compensation was used whenever necessary. Cells were sorted using the highest purity setting on the sorter (“Single cell” in the case of the SH800s) for all but the rarest cell types, for which the “Ultrapure” setting was used. Sorters were calibrated using FACS buffer before collecting cells from any tissue and after every 8 sorted plates to ensure accurate dispensing into plate wells. For a typical sort, a tube containing 1-3ml the pre-stained cell suspension was filtered, vortexed gently and loaded onto the FACS machine. A small number of cells were flowed at low pressure to check cell concentration and amount of debris. Then the pressure was adjusted, flow was paused, the first destination plate was unsealed and loaded, and single-cell sorting started. If a cell suspension was too concentrated, it was diluted it using FACS buffer or 1X PBS. For some cell types (e.g. hepatocytes), 96 well plates were used when it was not possible to sort individual cells accurately into 384 well plates. Immediately after sorting, plates were sealed with a pre-labeled aluminium seal, centrifuged and flash frozen on dry ice. On average, each 384-well plate took around 8 minutes to sort.
cDNA synthesis and library preparation
cDNA synthesis was performed using the Smart-seq2 protocol [1,2]. Briefly, 384-well plates containing single-cell lysates were thawed on ice followed by first strand synthesis. 0.6 μl of reaction mix (16.7 U/μl SMARTScribe Reverse Transcriptase (Takara Bio, 639538), 1.67 U/μl Recombinant RNase Inhibitor (Takara Bio, 2313B), 1.67X First-Strand Buffer (Takara Bio, 639538), 1.67 μM TSO (Exiqon, 5’-AAGCAGTGGTATCAACGCAGACTACATrGrG+G-3’), 8.33 mM DTT (Bioworld, 40420001-1), 1.67 M Betaine (Sigma, B0300-5VL), and 10 mM MgCl2 (Sigma, M1028-10X1ML)) was added to each well using a Tempest liquid handler. Reverse transcription was carried out by incubating wells on a ProFlex 2x384 thermal-cycler (Thermo Fisher) at 42°C for 90 min and stopped by heating at 70°C for 5 min.
Subsequently, 1.5 μl of PCR mix (1.67X KAPA HiFi HotStart ReadyMix (Kapa Biosystems, KK2602), 0.17 μM IS PCR primer (IDT, 5’-AAGCAGTGGTATCAACGCAGAGT-3’), and 0.038 U/μl Lambda Exonuclease (NEB, M0262L)) was added to each well with a Mantis liquid handler (Formulatrix), and second strand synthesis was performed on a ProFlex 2x384 thermal-cycler by using the following program: 1. 37°C for 30 minutes, 2. 95°C for 3 minutes, 3. 23 cycles of 98°C for 20 seconds, 67°C for 15 seconds, and 72°C for 4 minutes, and 4. 72°C for 5 minutes. The amplified product was diluted with a ratio of 1 part cDNA to 10 parts 10mM Tris-HCl (Thermo Fisher, 15568025), and concentrations were measured with a dye-fluorescence assay (Quant-iT dsDNA High Sensitivity kit; Thermo Fisher, Q33120) on a SpectraMax i3x microplate reader (Molecular Devices). Sample plates were selected for downstream processing if the mean concentration of blanks - ERCC-containing, non-cell wells - was greater than 0 ng/μl, and, after linear regression of the values obtained from the Quant-iT dsDNA standard curve, the R2 value was greater than 0.98. Sample wells were then selected if their cDNA concentrations were at least one standard deviation greater than the mean concentration of the blanks. These wells were reformatted to a new 384-well plate at a concentration of 0.3 ng/μl and final volume of 0.4 μl using an Echo 550 acoustic liquid dispenser (Labcyte).
Illumina sequencing libraries were prepared as described in Darmanis et al. 2015.13 Briefly, tagmentation was carried out on double-stranded cDNA using the Nextera XT Library Sample Preparation kit (Illumina, FC-131-1096). Each well was mixed with 0.8 μl Nextera tagmentation DNA buffer (Illumina) and 0.4 μl Tn5 enzyme (Illumina), then incubated at 55°C for 10 min. The reaction was stopped by adding 0.4 μl “Neutralize Tagment Buffer” (Illumina) and spinning at room temperature in a centrifuge at 3220xg for 5 min. Indexing PCR reactions were performed by adding 0.4 μl of 5 μM i5 indexing primer, 0.4 μl of 5 μM i7 indexing primer, and 1.2 μl of Nextera NPM mix (Illumina). PCR amplification was carried out on a ProFlex 2x384 thermal cycler using the following program: 1. 72°C for 3 minutes, 2. 95°C for 30 seconds, 3. 12 cycles of 95°C for 10 seconds, 55°C for 30 seconds, and 72°C for 1 minute, and 4. 72°C for 5 minutes.
Library pooling, quality control, and sequencing
Following library preparation, wells of each library plate were pooled using a Mosquito liquid handler (TTP Labtech). Pooling was followed by two purifications using 0.7x AMPure beads (Fisher, A63881). Library quality was assessed using capillary electrophoresis on a Fragment Analyzer (AATI), and libraries were quantified by qPCR (Kapa Biosystems, KK4923) on a CFX96 Touch Real-Time PCR Detection System (Biorad). Plate pools were normalized to 2 nM and equal volumes from 10 or 20 plates were mixed together to make the sequencing sample pool. PhiX control library was spiked in at 0.2% before sequencing.
Sequencing libraries from 384-well and 96-well plates
Libraries were sequenced on the NovaSeq 6000 Sequencing System (Illumina) using 2 x 100bp paired-end reads and 2 x 8bp or 2 x 12bp index reads with either a 200 or 300-cycle kit (Illumina, 20012861 or 20012860).
Microfluidic droplet single cell analysis
Single cells were captured in droplet emulsions using GemCode Single-Cell Instrument (10x Genomics, Pleasanton, CA, USA) and SC RNA-seq libraries were constructed as per 10X Genomics protocol using GemCode Single-Cell 3’ Gel Bead and Library V2 Kit. Briefly, single cell suspensions were examined using an inverted microscope and if sample quality was deemed satisfactory, sample was diluted in PBS/2% FBS to achieve a target concentration of 1000 cells/μl. If cell suspensions contained cell aggregate or debris, two additional washes in PBS/2% FBS at 300 x g for 5min at 4C were performed. Cell concentration was measured either with a Moxi GO II (Orflo Technologies) or a hemocytometer. Cells were loaded in each channel with a target output of 5,000 cells per sample. All reactions were performed in Biorad C1000 Touch Thermal cycler with 96-Deep Well Reaction Module. 12 cycles were used for cDNA amplification and sample index PCR. Amplified cDNA and final library were evaluated on a Fragment Analyzer using a High Sensitivity NGS Analysis Kit (Advanced Analytical). 10x cDNA libraries were quantitated for average fragment length using a 12 or 96 capillary Fragment Analyzer (Advanced Analytical) and by qPCR with Kapa Library Quantification kit for Illumina. Each library was diluted to 2nM and equal volumes of 16 libraries were pooled to fill each NovaSeq sequencing run. Pools were sequenced with 100 cycle run kits with 26 bases Read 1, 8 bases Index 1 and 90 bases Read 2 (Illumina 20012862). PhiX control library was spiked in at 0.2 to 1%. Libraries were sequenced on the NovaSeq 6000 Sequencing System (Illumina)
Data Processing
Sequences from the NovaSeq were demultiplexed using bcl2fastq version 2.19.0.316. Reads were aligned using to a copy of the mm10 genome augmented with ERCC sequences, using STAR version 2.5.2b with the following options: --outFilterType BySJout --outFilterMultimapNmax 20 --alignSJoverhang Min 8 --alignSJDBoverhangMin 1 --outFilterMismatchNmax 999 --outFilt erMismatchNoverLmax 0.04 --alignIntronMin 20 --alignIntronMax 100000 0 --alignMatesGapMax 1000000 --outSAMstrandField intronMotif --outSA Mtype BAM Unsorted --outSAMattributes NH HI NM MD --genomeLoad LoadA ndKeep --outReadsUnmapped Fastx --readFilesCommand zcat
Gene counts were produced using HTSEQ version 0.6.1p1 with default parameters except stranded was set to false and mode was set to intersection-nonempty.
Sequences from the microfluidic platform were demultiplexed and aligned using CellRanger with default parameters.
Clustering
Standard procedures for filtering, variable gene selection, dimensionality reduction, and clustering were performed using the Seurat package. For each tissue and each sequencing method (FACS and microfluidic), the following steps were performed. (The parameters used were tuned on a per-tissue basis. Values appear in the supplement.)
Genes appearing in fewer than 5 cells were excluded.
Cells with fewer than 500 genes were excluded. Cells with fewer than 50,000 reads (FACS) or 1000 UMI (microfluidic) were excluded. (In some organs, cells with more than 2 million reads were also excluded as a conservative measure to avoid doublets.)
Counts were log-normalized (log(1 + counts per N)), then scaled by linear regression against the number of reads (or UMIs), the percent of reads mapping to Rn45s, and the percent of reads to ribosomal genes.
Variable genes were selected using a threshold for dispersion (log of variance/mean). The distribution of dispersion in each expression bin was computed, and variable genes were those with a z-score of at least 0.5-1..
The variable genes were projected onto a low-dimensional subspace using principal component analysis. The number of principal components was selected based on inspection of the plot of variance explained.
A shared-nearest-neighbors graph was constructed based on the Euclidean distance metric in the low-dimensional subspace. (The SNN is a weighted graph where w_ij is the Jaccard similarity of the k-neighborhood of i with the k-neighborhood of j, where k = 30.) Cells were clustered using the Louvain method to maximize modularity.
Cells were visualized using a 2-dimensional t-distributed Stochastic Neighbor Embedding on the same distance metric.
Cell types were assigned to each cluster using the abundance of known marker genes. (Plots showing the expression of the markers for each tissue appear in the supplement).
When clusters appeared to be mixtures of cell types, they were refined either by increasing the resolution parameter for Louvain clustering or subsetting the data and rerunning steps 3-7.
A similar analysis was done globally for all FACS processed cells and for all microfluidic processed cells.
To attempt to remove the effect of technical variation from the gene expression matrix, it is common practice to regress out statistics associated with quality, like the number of reads or the percent of ribosomal or mitochondrial RNA. If those statistics differ systematically between cell types or tissues, this can have the unwanted effect of compressing the true variation between those groups. For example, the percent of ribosomal RNA varies from 1% in the Liver and Pancreas to 5% in the Spleen and Tongue. Hence we regress out these factors separately in the analysis of each tissue.
For the “dynamic range” comparison in Supplementary Figure 4, the gene expression was scaled to log of reads per million for FACS and log of UMI per 10,000 for microfluidic emulsions. Genes with nonzero expression were rank ordered and plotted.
Acknowledgements
This work was funded by the Chan Zuckerberg Biohub, the National Institute on Aging (DP1AG053015, T.W.-C), the Veterans Administration (T.W.-C.), and the Stanford Diabetes Research Center (NIH DK116074, S.K.K.). We thank Sony Biotechnology for making an SH800S instrument available for this project. Some cell sorting/flow cytometry analysis for this project was done on a Sony SH800S instrument in the Stanford Shared FACS Facility. Some fluorescence activated cell sorting (FACS) was done with instruments in the VA Flow Cytometry Core, which is supported by the US Department of Veterans Affairs (VA), Palo Alto Veterans Institute for Research (PAVIR), and the National Institutes of Health (NIH).