A single-cell atlas of the human healthy airways

Rationale The respiratory tract constitutes an elaborated line of defense based on a unique cellular ecosystem. Single-cell profiling methods enable the investigation of cell population distributions and transcriptional changes along the airways. Methods We have explored cellular heterogeneity of the human airway epithelium in 10 healthy living volunteers by single-cell RNA profiling. 77,969 cells were collected by bronchoscopy at 35 distinct locations, from the nose to the 12th division of the airway tree. Results The resulting atlas is composed of a high percentage of epithelial cells (89.1%), but also immune (6.2%) and stromal (4.7%) cells with peculiar cellular proportions in different sites of the airways. It reveals differential gene expression between identical cell types (suprabasal, secretory, and multiciliated cells) from the nose (MUC4, PI3, SIX3) and tracheobronchial (SCGB1A1, TFF3) airways. By contrast, cell-type specific gene expression was stable across all tracheobronchial samples. Our atlas improves the description of ionocytes, pulmonary neuro-endocrine (PNEC) and brush cells, which are likely derived from a common population of precursor cells. We also report a population of KRT13 positive cells with a high percentage of dividing cells which are reminiscent of “hillock” cells previously described in mouse. Conclusions Robust characterization of this unprecedented large single-cell cohort establishes an important resource for future investigations. The precise description of the continuum existing from nasal epithelium to successive divisions of lung airways and the stable gene expression profile of these regions better defines conditions under which relevant tracheobronchial proxies of human respiratory diseases can be developed.


Introduction
The prevalence of chronic respiratory diseases is thought to increase in the future by raising exposure to diverse atmospheric contaminants (pollution, allergens, smoking). The respiratory tract constitutes an elaborated line of defense based on a unique cellular ecosystem. Thus, secretory and multiciliated cells form a self-clearing mechanism that efficiently removes inhaled particles from upper airways, impeding their transfer to deeper lung zones. Several mechanical filters (nose, pharynx, ramified structure of the lung airways) further limit the flux of pathogens and inhaled particles downwards the bronchial tree. While nose and bronchus are sharing many cellular properties, which has led to the definition of a pathophysiological continuum in allergic respiratory diseases (1,2), they differ by features such as host defense against viruses, oxidative stress (3), or anti-bacterial mechanisms (4). Bulk RNA sequencing has indeed identified differences in gene expression between nose and bronchi (5).

Methods
The atlas of the airway epithelium (nose to approximately 12 th division of bronchi) was obtained from biopsies and brushings from 10 healthy non-smoking volunteers. Each donor was sampled 4-5 times in different regions of upper (nose) and lower airways (tracheal, intermediate, distal bronchi), located in different lobes ( Figure E1, Tables E1-2). Single-cell capture was carried out using the 10X Genomics Chromium device (3' V2). Large integrative analysis of the 35 samples composing the atlas was done using the fastMNN R package (6) and the following analysis was done using the Scanpy framework (Python) (7) to provide robust cell type annotation. Additional differential gene expression analysis was done using the edgeR R package (8) to investigate both cell distributions and gene expression heterogeneity along the airways. GSEA, trajectory inference (PAGA) (7) and gene network inference (GRNBoost2) (7) were also performed to characterize further the identified cell populations.
Results were validated using RNAscope and immunostainings. Additional details on the methods are provided in an online data supplement. populations could not be clustered separately and essentially differed by the level of expression of MUC5AC and MUC5B ( Figure E4) (12). We detected clusters of multiciliated cells (expressing high levels of FOXJ1, TPPP3, and SNTN) and deuterosomal cells, which correspond to precursors of multiciliated cells and express several specific markers: CCDC67/DEUP1, FOXN4 and CDC20B ( Figure 1C and 1D) (12,13). The suprabasal, secretory and multiciliated clusters each comprised a sub-cluster of cells that were only detected in nasal samples. These clusters were labelled "Suprabasal N", "Secretory N" and "Multiciliated N" and will be described later in the manuscript. Two cell types were associated with submucosal glands: serous cells (expressing high levels of LTF, LYZ and PIP) and mucous cells (expressing high levels of MUC5B but no MUC5AC) ( Figure 1C and 1D). Finally, we identified 222 cells belonging to clusters of rare epithelial cells (0.3% of the cells) ( Figure 1C and 1D). Incidentally, we detected the presence of some alveolar cells: 10 type I (AT1) and 11 type II (AT2) pneumocytes, which were all derived from a unique distal brushing. AT1 expressed HOPX, AGER, SPOCK2; AT2 expressed SFTPA, SFTPB, SFTPC and SFTPD ( Figure 3B, Table E2, Figure E5).

Immune cells: annotation and distribution along the respiratory tree
We clustered the 4891 immune cells in 7 distinct cell types ( Figure 1C-1E, Figure E6A). Four clusters of myeloid cells were found: (i) macrophages and (ii) monocytes, mostly detected in distal brushings; (iii) mast cells, mostly detected in distal brushing as well and at a lesser extent in tracheal and intermediate bronchial biopsies; (iv) dendritic cells, found everywhere. We also identified 3 clusters of lymphoid cells: T cells were found in all samples; plasma cells were exclusively found in biopsies, in line with an interstitial localization and B cells were mostly detected in distal airway brushings ( Figure 1F, Figure E1B-D). The gene regulatory network was further characterized with GRNboost2, a program that infers regulatory unit activity (14) ( Figure E6B). In the lymphoid lineage, we were able to discriminate B cells (expressing high levels of MS4A1 and LTB, and high PAX5 inferred activity) from plasma cells (expressing high levels of IGJ and MZBI, and high IRF4 inferred activity) ( Figure 1D, Figure E6A, B). T cells and related subtypes, that our analysis did not separate well, were characterized by a high and specific transcriptional activity of the XCL1 and CD3D regulatory units ( Figure E6B, E6C and E6E).  Figure 1D, Figure E7C). Based on a specific expression of markers such as RERGL, MCAM and PDGFRB, we also identified pericytes, a population of peri-endothelial mesenchymal cells with contractile properties that are located on the vascular basement membrane of capillaries (15,16). Pericytes also share with smooth muscle cells markers such as ACTA2 and MYL9 ( Figure 1D, Figure E7D).

Large variations in the composition of epithelial cells distinguish nasal and tracheobronchial airways
We then compared the epithelial composition in each of the 5 types of samples. We noticed a large effect produced by the sampling mode on the distribution of cells: brushings collected more luminal cell types, such as multiciliated or secretory cells, while forceps biopsies collected cells located deeper in the tissue such as basal, stromal cells, and submucosal gland cells ( Figure 1F, Figure E1B, Table E2). All following comparisons were then performed on samples obtained with similar sampling methods.
Tracheal and intermediate bronchial biopsies shared very similar cell type distributions, with few differences between biopsies taken from upper, middle and lower lobes ( Figure 1F, Figure   E1B). The most striking variation was for submucosal gland cells (serous and mucous cells).
Their detection in 3 out of 3 nasal biopsies, 3 out of 9 tracheal biopsies and 0 out of 9 intermediate biopsies ( Figure E1B) suggests a larger density of glands in the nose, and a progressive decline in smaller airways, as previously described (17)(18)(19)(20). Comparison between nasal and distal brushing samples also showed a clear enrichment of secretory cells in nasal samples, and an enrichment of multiciliated cells in distal samples ( Figure 1F, Figure E1B). In order to characterize qualitative differences between nasal and tracheobronchial compartments, we assessed the correlations in average gene expression between each epithelial cell type. We found stronger correlations (>0.9) between cells belonging to the same cell type, in a donor-independent manner, than between cells belonging to distinct cell types ( Figure 2A), confirming that cell type identity was well conserved across samples ( Figure E8). This analysis also revealed nasal-specific and tracheobronchial-specific sub-clusters for suprabasal, secretory and multiciliated cells ( Figures 1C, 1D, 2A, Figure E8), characterized by differentially expressed genes. We found a higher number of overlapping genes between suprabasal, secretory and multiciliated cell types from the nasal epithelium than between the same cell types from the tracheobronchial epithelium ( Figure 2B and 2C). Among the top 22 genes shared by all 3 cell types in nasal samples were SIX3, PAX7 and FOXG1 ( Figure E9), which have well-reported roles in the eye, neural and/or neural crest-derived development (21)(22)(23).
We then compared gene expression between secretory cells from nasal or tracheobronchial origins ( Figure 2D). A total of 543 genes were found up-regulated in the nose with a FC>1.5, logCPM >1.5 and p-value <0.05, whereas 499 genes were found up-regulated in the trachea/bronchi with identical thresholds (Table E3). We noticed an enriched expression of SCGB1A1, SCGB3A1, KLK11 and SERPINF1 in the bronchi, and of LYNX1, S100A4, CEACAM5, LYPD2, PI3 and MUC4 in the nose ( Figure 2D, Figure E9). Immunostainings confirmed the nasal expression of PI3 and MUC4 on independent brushing and biopsies ( Figure 2F). Thirty seven additional transcripts were confirmed, based on a comparison with the Protein Atlas database (24) (Table E4). First insights about distinctive functional properties were obtained after a gene set enrichment analysis. In bronchi, functional terms related to defense and innate immune response to aggression were found. In nasal epithelium, enrichment in terms related to differentiation and motility supports the existence of a higher cellular turnover ( Figure 2E, Table E5). Several regulatory units were associated with secretory cells of the nasal epithelium, such as MESP1, reported as a regulator of somitic mesoderm epithelialization (25).
AHR, the aryl hydrocarbon receptor, which contributes to adaptive and innate responses by inducing the expression of several xenobiotic-metabolizing enzymes, and STAT1, a transcription factor that acts downstream to the interferon pathway, were both enriched in the nasal tissue. FOXA3 regulatory unit, which promotes goblet metaplasia in mouse and induces MUC5AC and SPDEF expression (26,27), was enriched in tracheobronchial samples.
Intriguingly, dissociated nasal cells appeared larger. There was a proximo-distal gradient of cell size, with the largest average size in the nose (12.56 µm ± 0.71) and the smallest size in the distal airways (8.77 µm ± 0.71) ( Figure E2C and E2D). This difference correlated with the number of detected genes ( Figure E2A and E2B).

Identification of rare epithelial cells along the human airways
We identified 13 brush/tuft cells according to their high expression of LRMP and RGS13 (12,28,29) (Figure 3A-3C). We also noticed in these cells a specific activity of HOXC5, HMX2, and ANXA4 regulatory units ( Figure E10A).
A cluster of 29 pulmonary neuroendocrine cells (PNECs) ( Figure 3A) was found, mostly in tracheal and intermediate biopsies ( Figure 3B). PNECs expressed the neurotransmitterassociated genes PCSK1N, SCGN and NEB ( Figure 3C) and we identified HOXB1, ASCL1, and  Figure E10B). This profile suggests a precursor role for these cells.
A last group of rare cells, already described in primary cultures (12) and in asthmatic patients (30) named multiciliating-goblet cells, corresponded to cells expressing both goblet and multiciliated cell markers. In our dataset, around 60 cells were found positive for both FOXJ1 and MUC5AC. They were equally distributed between the secretory and the multiciliated cell clusters. We have used SoupX to remove gene counts that may emerge from cell-free mRNA contamination, thus avoiding interference with the quantification of multiciliating-goblet cells ( Figure 3E). We confirmed the presence of these cells using RNAscope in situ RNA hybridization ( Figure 3F). When these cells were superimposed in a PAGA representation of tracheobronchial cell lineages, we noticed that they were located close to multiciliated cells, while in a similar representation of nasal cell lineages, they were located between secretory and deuterosomal cells, nearer to these latter ( Figure E10B). This result supports our previous description of goblet cells as precursors of multiciliated cells in homeostatic and healthy epithelium and additionally suggests that transition through this stage may have slightly distinct dynamics between nasal and tracheobronchial epithelia (12).

Cell proliferation within homeostatic airways
Before batch correction, we identified a cluster of cycling cells, defined by the expression of  (12). Figure 4D shows UMAP graphs for the subgroup of cells that belonged to the bona fide cycling cluster with a superimposition of the cell cycle scores for G1, S and G2/M phases, which delineates each phase of the cell cycle inside the circular embedding ( Figure 4D). We noticed that the marker genes of this cycling population largely overlap with those of suprabasal cells ( Figure 4E), suggesting that in the homeostatic and healthy epithelium, suprabasal cells may be the main proliferating population in the epithelium. Labelling of bronchial epithelium sections with MKI67 antibody confirmed the presence of MKI67+/KRT5+ cells that were located in a para/suprabasal position ( Figure 4F). Cycling cells were distributed all across the 35 samples, although with a highly variable distribution, which was reminiscent to the expression profile of KRT13 in suprabasal cells ( Figure 4G and 4H, Figure E11A). These KRT13-high samples also displayed the highest cycling cell proportion (more than 20% of cycling cells, Figure 4G). In situ RNA hybridization in nasal epithelium sections confirmed an association of MKI67 RNA with cells expressing KRT13 ( Figure E11B). This association between KRT13 expression and proliferation, together with the variability of detection of these cells, is highly reminiscent of the previous description of hillocks in mouse airway epithelium (29). We indeed confirmed the presence of KRT13+ cell clusters in nasal epithelium, with patterns very similar to those previously found in mouse ( Figure 4I).

Discussion
We have established a reference single-cell atlas of normal human airways after analyzing 35 fresh tissue samples collected by bronchoscopy in 10 healthy volunteers, resulting in a largescale gene expression profiling that also integrated spatial information of each sample. This approach was well adapted to collect samples from the nose to the mid airways but excluded the bronchiolar compartment and the parenchyma, for which alternative experimental approaches have already been proposed (30,31). Establishing a comprehensive airway atlas will result from the combination of our atlas with these other datasets. That saying, our approach provides a unique opportunity to build a single-cell gene expression resource based on well-characterized healthy volunteers which are rarely accessible in most large scale studies. The use of bronchoscopy, a minimally invasive approach to the airways, also creates a real opportunity to transfer rapidly novel information generated in the context of the Human Cell Atlas project to new clinical practices. Profiling of identical cell types across many sites of the airway tree has allowed us to quantify the frequencies of epithelial, submucosal-gland, immune and stromal cells, and has revealed an influence of the mode of sampling. However, this did not prevent us from defining stable core cell type signatures for each epithelial, stromal and immune cell types, irrespective of their anatomical location. In contrast, important variations of gene expression were found when comparing the same populations of suprabasal, secretory and multiciliated cells from the surface epithelium between nasal and tracheobronchial compartments. Some variations, which did not reach significance, were also found in basal cells. These results fit well with previous works reporting dozens of differentially expressed transcripts between nasal and bronchial brushings (32,33). We also tested whether additional gradients of expression could be detected, for instance between upper and lower lobes, or between anterior or posterior part of the trachea. While we noticed some trends, the observed differences, such as an increased number of T cells, B and dendritic cells in lower lobes, need additional measurements to be validated. Future functional analysis will be important to assess the impact of these variations on the biology of each cell type. Our gene set enrichment and inferred regulatory unit analysis suggest that nasal cells undergo higher epithelial regeneration, and have function in xenobiotic metabolism as well as in interferon signaling.
Interestingly, SIX3, PAX6-7, and OTX1/2, which we found to be specific of the nasal epithelium, are all associated with gene ontology terms such as "pattern specification process", and "axis specification" and have well-described functions during embryonic patterning of the head (22,(34)(35)(36). Expression of Six3 in murine ependymocytes, which are radial glia-derived multiciliated cells, is necessary for the maturation of these cells during postnatal stages of brain development (34). Hence, nasal-specific expression of developmental patterning genes might be the consequence of head vs. trunk differential developmental origins and may not necessarily confer specific functions to nasal epithelial cells. The underlying mechanisms that confer a persistence in the expression of these developmental hallmarks remain to be elucidated.
Our focus on secretory cells demonstrates that nasal ones contain few SCGB1A1 + -and SCGB3A1 + -cells. Despite this low secretoglobin content, they display the core gene signature of secretory cells, suggesting that secretoglobins may not be sufficient marker genes to identify all secretory/club cells. These differences are important to consider when using nasal samplings as a proxy to assess bronchial status.
Our atlas also provides a comprehensive description of novel cell types, such as the multiciliating-goblet cells and the undefined rare cells, which, given their gene expression profile, may be precursors for the ionocytes, PNECs and brush cells. This is, to our knowledge, the first scRNAseq identification of human PNECs and brush cells with distinct gene signatures.
We also identified KRT13 + -cells with a detection frequency and organization that are highly reminiscent of the mouse airway hillocks (29), providing the first identification in human airways of hillock structures, and further work is required to investigate their functions.
Altogether, this atlas improves the cellular stratification of gene expression profiles in healthy human airway epithelium. It now makes possible an extensive exploration of the various situations involved in homeostasis and regeneration of normal and pathological airways.

CellType
T c e l l s

Ethics statement
The study was approved by the Comité de Protection des Personnes Sud Est IV (approval number: 17/081) and informed written consent was obtained from all participants involved.
All experiments were performed during 8 months, in accordance with relevant guidelines and  Table   E1.

Dissociation of brushings (protocols.io qubdwsn)
The brush was soaked in a 5 mL Eppendorf containing 1 mL of dissociation buffer which was

Dissociation of bronchial biopsy (protocols.io x3efqje)
The biopsy was soaked in 1 mL dissociation buffer which was composed of DPBS, 10 mg/mL protease from Bacillus Licheniformis (Sigma-Aldrich, reference P5380) and 0.5 mM EDTA.
After 1 h, the biopsy was finely minced with a scalpel, and returned to dissociation buffer.
From this point, the dissociation procedure is the same as the one described in the "dissociation of brushings" section, with an incubation time increased to 1h. For the cell capture by the 10X genomics device, the cell concentration was adjusted to 500 cells/µl in HBSS aiming to capture 5000 cells. All steps were performed on ice.

Cytospins from brushings
Cells dissociated from brushings were cytocentrifuged at 72 g for 10 min onto SuperFrostTM Plus slides using a Shandon CytospinTM 4 cytocentrifuge. CytospinTM slides were fixed for 10 min in 4% paraformaldehyde at room temperature for further immunostaining.

Processing of nasal turbinates
Inferior turbinates were resected from patients who underwent surgical intervention for nasal obstruction or septoplasty (kindly provided by Professor Castillo, Pasteur Hospital, Nice, France supplemented HBSS, tissues were processed depending on the assay.

Whole mounts of nasal turbinate epithelium
The outer layer (approximatively 1.5-mm thick) of nasal turbinates was resected with the help of a scalpel blade allowing the recovery of the epithelium that covers the turbinates. Nasal epithelium was fixed in PFA 4% for 1 hour at room temperature then overnight at 4°. After two washes in PBS, the epithelium was permeabilized with 0.5% Triton X-100 in PBS, blocked in 0.3% BSA for 30 min. Primary antibodies were incubated for 24 hours at room temperature, washed in 0.3% Triton X-100 in PBS, incubated with appropriate secondary antibodies diluted in blocking buffer for 4 hours at room temperature, washed in 0.3% Triton X-100 in PBS, all the steps were performed in a shaker. The epithelium was then mounted between a slide and cover-slip using imaging spacers. Imaging of the samples was performed in a Confocal LSM780 Zeiss.

Cryostat section of nasal turbinate epithelium
Nasal turbinates were fixed in paraformaldehyde 4% at 4°C overnight then extensively rinsed with phosphate-buffered saline (PBS). Fixed tissues where then prepared for cryo-embedding for cryostat sectioning. Tissue was embedded in optimal cutting temperature (OCT) medium (Thermo Fisher Scientific) at room temperature and then frozen by contact with liquid nitrogen. 10 µm-thick frozen tissue sections were obtained with a cryostat Leica CM3050S on Superfrost Plus® Gold slide (Thermo Scientific). Sections were kept at -80°C with desiccant for few weeks until use for RNAscope protocol.

Pretreatment Protocol
For cryostat tissue sections, the manufacturer's protocol for fixed frozen tissues described in user manual RNAscope® Multiplex Fluorescent Reagent Kit v2 Assay (Cat. No. 323100, Advanced Cell Diagnostics, lnc., USA) was followed. To avoid tissue section detachment from slides, the target retrieval step was replaced by an increased protease III incubation time to 45 min. For cytospin samples, the cell pretreatment described in ACD technical note MK-50 010 was followed. As red blood cell lysis has been performed during cell dissociation, hydrogen peroxide treatment step was skipped in further pretreatment. Protease III was incubated 30 min without dilution as cytospin cells are fixed with paraformaldehyde to follow the same pretreatment condition described by ACD for fixed frozen tissue (Cat. No. 323100, Advanced Cell Diagnostics, lnc., USA).

RNAscope Assay
After pretreatment, for both sections and cytospin samples, we followed manufacturer's The images were captured by a Zeiss LSM780 confocal microscope.

Immunostaining of paraffin sections
Sections were deparaffinized, an antigen retrieval treatment was performed using citrate buffer at pH6. Sections and cytospins were permeabilized with 0.5% Triton X-100 in PBS for 10 min, a following blocking treatment was performed with 3% BSA in PBS for 30 min. The incubation with primary antibodies was carried out at 4°C overnight. Incubation with secondary antibodies was carried out during 1h at room temperature. Nuclei were stained with 4,6-diamidino-2-phenylindole (DAPI). Primary and secondary antibodies information represented in supplementary table E6. When necessary, KRT5 antibody was directly coupled to CF488 with the Mix-n-Stain kit (Sigma-Aldrich) according to the manufacturer's instruction.
Coupled primary antibody was applied for 2 hours at room temperature after secondary antibodies had been extensively washed and after a 30 min blocking stage in 3% normal rabbit serum in PBS. Imaging of the samples was performed using a Confocal FV-10 from Olympus. parameters and human build hg19. All single-cell datasets that we generated, and the corresponding quality metrics are displayed in Table E1.

Primary data analysis
Initially, each dataset was roughly analysed using Seurat (v3) [1] to determine the best analysis workflow needed for the merged dataset. Permissive filtering was done on low-quality cells followed by median normalization, identification of highly variable genes and Louvain clustering. Marker genes of cell-clusters were identified using Wilcoxon's rank test and shared top genes across datasets resulted in a common and robust cell type annotation. This initial annotation was later used to create a reference for the precise annotation of the merged dataset ( Figure E8).

Data quality control done on individual datasets
Cell and gene filtering Each sample processing being slightly different from others (sample size, presence of blood, dissociation times), sample-specific quality metrics vary slightly between samples. To take into account this variability, each sample was pre-processed individually. Cells were excluded based on three criteria: high number of Unique Molecular Identifier (UMIs) per cell (max +3 Median Absolute Deviation, MAD), low number of detected genes per cell (min 500 genes) and high percentage of mitochondrial genes (max +3 MAD). Mitochondrial and ribosomal genes (gene symbols starting with RPS/RPL) were excluded from the count matrices.

Doublet removal
We used DoubletDetection for unbiased identification of doublets (technical error) in our datasets (https://github.com/JonathanShor/DoubletDetection). We kept the default parameters values and removed the predicted doublets. Further doublet removal was done on the merged dataset (without data integration), to remove clusters with a high proportion of predicted doublet (over 50 %).

Ambient mRNA correction
Dissociation of complex tissue, such as brushing and biopsies, results in a certain proportion of cell lysis. It results in the presence of ambient mRNA that spreads across all droplets of a single experiment. This gene expression background is highly dependent of the cell-type composition which might lead to misleading analysis. We used soupX (https://github.com/constantAmateur/SoupX) for background correction and compared analysis results between corrected and uncorrected dataset (corrected dataset being very sparse). Background corrected data were mainly used for visualization purposes.

Normalization
Size factors were calculated for the complete (merged) dataset using 'ComputeSumFactor' from the scran R package [2] . Cells were pre-clustered with the 'quickCluster' function, method 'igraph' and minimum and maximum cluster size of 100 and 3,000, respectively. Raw counts were then normalized and log-transformed with cell-specific size factors. A count of 1 was added to each value prior to log transformation.

Selection of highly variable genes
Highly variable genes (HVGs) were identified/calculated using the getHVGs function from the scran package, with default parameter values.

Batch Correction and Data Integration
Batch effects were removed using the 'fastMNN' function in the scran R package on 50 principal components computed from the HVGs only [3] . Correction was performed The resulting batch-corrected principal component analysis was then used for further analysis steps. The compared analysis was performed between batch-corrected and uncorrected datasets.

Dimensionality reduction and visualization
UMAPs were calculated using scanpy [4] . For the complete dataset, the first 12 components of the batch-corrected PCA were used and considering the 100 nearest neighbours of each cell. For each data subset (immune, rare cells, stromal, and cycling) UMAPs were computed on uncorrected PCA based on subset specific HVGs.

Data clustering and sub-clustering
We clustered cells using phenograph [5] (available in scanpy) with two parameter settings (i: 12 PCs and 100 nearest neighbours) to tackle the imbalance in cell proportion (e.g. basal cells vs rare cells). The number of PCs used was estimated empirically on the PCA elbow plot, and by manual examination of top genes correlated with PCs. After a first annotation step, described below, a sub-clustering step was performed for each annotated cell type, to clean the boundaries between the distinct cell types (basal and suprabasal), but also in order to better identify small clusters such as rare cells or stromal cells. The number of PCs used for these sub-clustering steps varies from 3 to 8 with 20 nearest neighbours per cell.

Markers identification and data annotation
Marker genes were identify using rank_genes_group function from scanpy using the Wilcoxon's rank test. The robustness of those markers was assessed by reviewing the literature, and by the high correlation of phenograph clusters sharing similar markers genes.
These clusters were then grouped and annotated as a unique cluster.

Gene expression differential analysis
Differential analysis between specific clusters (secretory vs. secretory N for instance) was designed differently from the rank_gene_group function from scanpy to overcome the sample-specific gene expression background still present even after correction and to amplify the statistical power of the differential analysis. Pseudo-bulk samples were created from each cell-clusters by summing the raw counts of each gene in multiple single cells. Each bulk was designed to be composed of an equal number of cells (to get similar library size between bulks), and to contain randomly picked cells from a homogeneous mix of all the donor samples (to have a similar gene expression background between all bulks from the same cell type).
Differential analysis was performed using glmFit function from the R package edgeR [6] .

GSEA analysis
Gene set enrichment analysis was performed using the fgsea R package with the GO Biological Process gene sets (Broad Institute GSEA MSigDB).

Cycling cell identification and cell cycle analysis
Cycling cells were identified in the batch-uncorrected analysis of the dataset as a single cluster, and this specific cell type annotation was reported in the batch-corrected dataset.
Cell cycle scoring (S phase and G2M phase) was performed using the function score_genes_cell_cycle from scanpy tools and the associated cell cycle genes [1] . The G1 phase score was estimated as the opposite of both the S and G2M phase.

Trajectory inference using PAGA
To compare the cell trajectories between nasal and tracheobronchial samples, two subsets of randomly picked cells from each nasal or tracheobronchial surface epithelial cell types (n = 500 cells per cell type) were used to infer their trajectories using the PAGA algorithm [7] available in scanpy. The included epithelial cell types were cycling, basal, suprabasal, suprabasal N, secretory, secretory N, deuterosomal, multiciliated and multiciliated N cells.
Cells were then projected on the corresponding force atlas embedding and multiciliatedgoblet cells were highlighted on the resulting trajectory.

Inference of transcriptional regulatory units
We inferred transcriptional regulatory units using the GRNboost2 algorithm implemented in the arboreto package (https://arboreto.readthedocs.io/en/latest/). Expression correlations between transcription factors and potential target genes were computed from a raw count data matrix where we set a maximum threshold of 5000 cells by cell types. We obtained 1222 modules composed of the 50 first top correlated genes with a confirmed transcription factor.
We scored the activity of those modules in each cell of the complete dataset using the score_genes function from scanpy tools. Cell type-specific activity of each module was determined with a Wilcoxon's rank test.