Regulatory variants active in iPSC-derived pancreatic progenitor cells are associated with Type 2 Diabetes in adults

Pancreatic progenitor cells (PPC) are an early developmental multipotent cell type that give rise to mature endocrine, exocrine, and ductal cells. To investigate the extent to which regulatory variants active in PPC contribute to pancreatic complex traits and disease in the adult, we derived PPC from induced pluripotent stem cells (iPSCs) of nine unrelated individuals and generated single cell profiles of chromatin accessibility (snATAC- seq) and transcriptome (scRNA-seq). While iPSC-PPC differentiation was asynchronous and included cell types from early to late developmental stages, we found that the predominant cell type consisted of NKX6-1+ progenitors. Genetic characterization using snATAC-seq identified 86,261 regulatory variants that either displayed chromatin allelic bias and/or were predicted to affect active transcription factor (TF) binding sites. Integration of these regulatory variants with 380 fine-mapped type 2 diabetes (T2D) risk loci identified regulatory variants in 209 of these loci that are functional in iPSC-PPC, either by affecting transcription factor binding or through association with allelic effects on chromatin accessibility. The PPC active regulatory variants in 65 of these loci showed strong evidence of causally underlying the association with T2D. Our study shows that studying the functional associations of regulatory variation in iPSC-PPC enables the identification and characterization of causal SNPs for adult Type 2 Diabetes.


33
In early development the pancreas is formed from pancreatic progenitor cells (PPCs), which are a multipotent cell 34 type that has the potential to give rise to endocrine cells (clusters of hormone secreting cells, such , , , and  35 cells) and exocrine cells (i.e. acinar and ductal cells) (Cano et al., 2014;Jennings et al., 2015). While PPCs are 36 precursors to mature pancreas cell types, the extent to which regulatory variants active in PPCs contribute to 37 pancreatic complex traits and disease in the adult is currently not known. Recently it has become possible to use 38 human pluripotent stem cells to derive PPCs (Pagliuca et al., 2014;Rezania et al., 2014), which provide a virtually 39 unlimited source of cells to identify and characterize regulatory variants. Given that regulatory variation is largely 40 located in enhancers and promoters (Pennacchio et al., 2013), ATAC-seq provides an optimal method to identify 41 and characterize variants in PPCs that directly alter transcription factor binding and downstream gene expression. 42 Examining induced pluripotent stem cell derived PPCs (iPSC-PPCs) from whole-genome sequenced unrelated 43 individuals using ATAC-seq could enable identification of regulatory variation in PPCs and determine whether 44 or not they are associated with adult pancreatic traits and diseases such as Type 2 Diabetes (T2D). to the iPSC scRNA-seq sample. These results are consistent with the differentiation protocol used in Veres et al. 124 (Veres et al., 2019), generating similar albeit more advanced cell types than the differentiation protocol we used 125 to generate the iPSC-PPCs. 126 While the nine iPSC-PPC samples all consisted of multiple cell types, the vast majority of the scRNA-seq cells 127 were PDX1+ progenitors, NKX6-1+ progenitors or replicating NKX6-1+ progenitors. To determine if our results 128 reflected the differentiation efficiency measured by flow cytometry, we compared the fraction of NKX6-1+ 129 progenitors and replicating NKX6-1+ progenitors, with the fraction of cells that stained double-positive for PDX1 130 and NKX6-1. We found that these two independent measurements were highly correlated (R = 0.843, p = 0.00218; 131 8 Pearson's correlation, Figure S7A), indicating that scRNA-seq captured the variable differentiation efficiency 132 observed in FACS. 133 To annotate the snATAC-seq clusters, we compared motif activity scores of pancreatic-associated transcription 134 factors from chromVAR (Schep et al., 2017) with their gene expression levels from scRNA-seq ( Figure 1G, Table  135 S4). While most cell types could be identified in both scRNA-seq and snATAC-seq, we observed several 136 differences (Figure 1, Figure S5, Figure S6, Table S5, Table S6). Both technologies identified mesendoderm 137 (strong motif activity scores for TFAP2A, TFAP2B, (Raap et al., 2021;Wang et al., 2011)), NKX6-1+ progenitors 138 (GATA4/6, HNF4A, FOXA1/2 PDX1, NKX6-1), endocrine (HNF1A, MAFA, NEUROD1, NKX2-2), and early 139 ductal (ETV1, ETS1, ETS2) cell type populations. However, with scRNA-seq, we were able to distinguish 140 clusters of early definitive endoderm cells and PDX1+ progenitors, which could not be distinguished in snATAC-141 seq from NKX6-1+ progenitors, which were the predominant cell type. Furthermore, snATAC-seq could not 142 discriminate between replicating and non-replicating iPSC-PPC. However, when we compared the cell type 143 fractions of NKX6-1+ progenitors in snATAC-seq with the fraction of early definitive endoderm, PDX1+, NKX6-144 1+ progenitors, and replicating cells in scRNA-seq, the fractions were significantly correlated (R = 0.956, p = 145 0.000783, Figure S7B). Interestingly, we identified two distinct endocrine cell clusters using snATAC-seq, which 146 differed in motif activity levels of early endocrine (SOX4, SOX10  PAX and RFX ( Figure 1G), which regulate endocrine differentiation, we determined that these cells are 150 committed to endocrine lineage but have not yet fully developed into mature endocrine cells. Overall, while our 151 results show that while scRNA-seq and snATAC-seq capture slightly different iPSC-PPC cell types, the 152 predominant cell type across all samples in both assays consisted of NKX6-1+ progenitors. 153 Endocrine cells express combinations of three pancreatic endocrine hormones 9 ESC-PPCs have been shown to produce polyhormonal endocrine cells (Shahjalal et al., 2018). We tested whether 155 the 952 cells in the endocrine cluster expressed combinations of INS (insulin), GCG (glucagon), and SST 156 (somatostatin, Figure S8). We observed that 50.7% of endocrine cells expressed INS, 18.7% expressed GCG, and 157 30.9% expressed SST. Of these cells, 23.7% expressed only one of the three hormones and 32.8% expressed at 158 least two hormones, with INS and SST being the most common combination (16%) and 11% expressed all three 159 hormones. While these hormones were also expressed in non-endocrine cell clusters, they were expressed in less 160 than 5% of the cells. These results suggest that, while the protocol results in asynchronous differentiation, NKX6-161 1+ progenitors, which are the most common cell type in the iPSC-PPCs, represent early pancreatic lineage to 162 endocrine cells. 163 Characterizing regulatory genetic variation in snATAC-seq

164
To understand the potential of iPSC-PPC as a model system to study pancreas regulatory genetics, we 165 investigated the potential effects of variants overlapping 203,895 autosomal snATAC-seq peaks using three 166 methods: 1) by identifying active transcription footprints and detecting their overlapping variants, using TOBIAS 167 (Bentsen et al., 2020); 2) by predicting the allelic effects of variants in snATAC-seq peaks, using deltaSVM 168 (Ghandi et  We observed multiple footprints at the same peak for two reasons: 1) multiple transcription factors may bind to 173 the same peak; and 2) since TOBIAS identifies transcription factor footprints independently for each transcription 174 factor and determines the presence of a bound footprint based on each transcription factor motif, it cannot 175 distinguish between transcription factors with highly similar motifs (for example: NKX6-1 and NKX6-2); 176 10 therefore, multiple transcription factors in the same family may be identified as bound to the same footprint. factor binding sites and to have allelic effects on their binding (Table S7, Table S8). To validate these predictions, 190 we investigated their overlap with the transcription factor footprints determined using TOBIAS for 89 191 transcription factors tested with both methods. For each of these transcription factors, we confirmed that variants 192 predicted by deltaSVM to overlap bound transcription factor binding sites were more likely than expected to 193 overlap the transcription factor footprint identified by TOBIAS (p = 2.2 x 10 -47 , t-test, Figure S9A, Table S9) and  194 found a negative association between deltaSVM score and distance from each transcription factor footprint (p = 195 4.3 x 10 -15 , t-test, Figure S9B). 196 Finally, to measure allelic-specific effects in 48,738 snATAC-seq peaks that overlapped at least one SNP 197 heterozygous in one or more of the seven tested samples (110,290 SNPs in total, including 86,660 of the ones 198 tested with TOBIAS and deltaSVM, Table S7, Table S10), we utilized genotypes of the nine iPSCORE 199 11 individuals from whole genome sequencing Panopoulos et al., 2017). We divided SNPs 200 according to whether they were heterozygous in one sample (termed "singletons" hereafter: 60,742 SNPs) or two 201 or more samples ("multiplets": 49,548 SNPs, Figure S10). We found 3,862 singleton and 3,487 multiplet SNPs  202 with ASE (7,349 in total, FDR < 0.05, binomial test, adjusted using Benjamini-Hochberg's method) in 5,583 203 peaks (2.25% of all peaks). We compared the allelic effects measured by ASE with the estimations by deltaSVM 204 and observed a significantly positive correlation (R = 0.26, p-value = 3.0 x 10 -34 , Figure 2A). We also computed 205 the correlation between snATAC-seq ASE and the deltaSVM predictions of allelic effects in each of these 206 transcription factors. We observed a significantly positive correlation for 27 transcription factors (FDR ≤ 0.05, 207 Benjamini-Hochberg's method, Figure 2B, Table S11) and the distribution of correlation values across all 208 transcription factors was significantly greater than zero (p = 3.7 x 10 -6 , t-test . We also observed a negative association between ASE measured on heterozygous 212 variants in snATAC-seq peaks and their distance from transcription factor footprints (p = 6.45 x 10 -84 , t-test, 213 Figure S9C, Table S12), indicating that variants with allelic effects are more likely than expected to affect 214 transcription factor binding and that the results from the methods we employed to analyze the associations 215 between genetic variation and chromatin function are concordant. Using three methods (TOBIAS, deltaSVM and 216 ASE), we were able to detect or predict allelic effects or effects on transcription factor binding for 86,261 variants 217 ( Figure 2C). These results indicate that the genotypes of a large proportion of variants at iPSC-PPC snATAC-seq 218 peaks are likely associated with altered binding affinities for transcription factors and therefore may be associated 219 with adult pancreatic complex traits and disorders. 220 To examine the correlation between ASE in iPSC-PPC regulatory elements due to genetic variation and to changes 221 in gene expression, we performed bulk RNA-seq (scRNA-seq only has coverage at 3' end of gene) for the seven 222 12 samples with snATAC-seq and performed ASE on the transcriptome ( Figure S10, Table S13). We observed a 223 significant positive correlation between ASE at promoters and allelic bias with corresponding genes (R = 0.039, 224 p = 1.6 x 10 -6 , Figure 2D). Although significant, the correlation between ASE at promoters and corresponding 225 genes was weak, which is consistent with previous observations ( snATAC-seq peaks (Table S14). These overlapping variants were more likely to have a higher rank (based on 234 causality within a credible set, 8.6 x 10 -136 , Mann-Whitney U test) and a higher posterior probability of association 235 (PPA) with T2D (p = 3.6 x 10 -68 ) than SNPs outside of peaks ( Figure 3A,B). These results indicate that regulatory 236 elements active in iPSC-PPC are enriched for causal variants associated with T2D. 237 To further characterize the potential of iPSC-PPC to study T2D genetics, we identified variants in the T2D 238 credible sets that had high PPA or were the top-ranked and overlapped bound transcription factor binding sites 239 (TOBIAS), had predicted allelic effects (deltaSVM) and/or validated allelic effects (ASE). Of the 269 credible 240 sets with at least one SNP overlapping snATAC-seq peaks, 209 (77.7%) had at least one SNP potentially altering 241 transcription factor binding affinities, including 163 with TOBIAS support, 152 with deltaSVM support and 65 242 with ASE. Of note, 73 had support from two methods and six from all three (Table S14). Among these 209 T2D 243 13 loci, we found 65 SNPs that showed strong evidence of being causal, including 38 that were top-ranked and 27 244 that were not the top ranked but had high PPA (PPA > 10%, Figure 3C). 245 The 38 SNPs that were the top-ranked included six with PPA ≥ 90%, corresponding to the loci for GLI2 (predicted 246 to affect binding by deltaSVM and/or TOBIAS of HSF2, HSF4 and YY2), CHCR2 (RFX1-5, SCRT1, SCRT2), 247 UBAP2 (HSF2 and HSF4), SLC30A8 (PAX1 and PAX9), IGF2BP2 (IRF1, PRDM1, ZNF384, and shows ASE 248 in snATAC-seq) and DGKB (RFX1) and eight with PPA between 50% and 90%, corresponding to 14 Twenty-seven credible sets had SNPs with allelic effects that were not the top ranked but had high PPA (PPA > 266 10%). These cases include the ZNF169 locus, whose top-ranked SNP (rs12236906) has PPA = 33% but does not 267 overlap any snATAC-seq peak, whereas its second-ranked SNP (rs10993329, PPA = 31%) overlaps a snATAC-268 seq peak and has deltaSVM predicted allelic effects ( Figure 3D). Although rs12236906 has been indicated as the 269 most likely causal SNP for this locus, our results suggest that rs10993329 is more likely to be functional. These 270 observations are supported by the higher activity of the genomic region surrounding rs10993329 across multiple 271 tissues ( Figure 3D). We further investigated the predicted allelic effects of rs10993329 and found that it is whereas the variant with the highest PPA (rs55834317: 27%) is not associated with any GWAS. While 277 rs79310463 and rs55834317 both overlap a snATAC-seq peak, only rs79310463 is associated with ASE and 278 overlaps footprints for TFDP1 and ZNF263, suggesting that this SNP is more likely to have functional 279 consequences in this locus ( Figure 3E). 280 In other loci containing lower ranked SNPs (PPA > 10%), such as HMG20A and IRS2, multiple variants overlap 281 iPSC-PPC snATAC-seq peaks and are predicted to have ASE, indicating that additional studies are needed to 282 determine if multiple causal SNPs underlie the associations in these loci and whether they are functional. In 283 conclusion, our genetic association analysis shows that many regulatory variants implicated in T2D are active 284 and have allelic effects in iPSC-PPC, making these cells a suitable model system to identify and characterize the 285 molecular mechanisms underlying T2D genetic associations. 286

287
In this study, we derived ten iPSC-PPC samples from nine unrelated individuals to generate matched scRNA-seq 288 and snATAC-seq and determined that while the differentiation was asynchronous and similar to ESC-PPC (Veres 289 et al., 2019), the derived cells largely consisted of a single cell type (NKX6-1+ progenitors). We characterized 290 regulatory variants that overlapped open chromatin in iPSC-PPC and found that these variants are likely to have 291 allelic effects on chromatin accessibility and may affect transcription factor binding. To validate the utility of 292 iPSC-PPC to characterize and annotate genetic variants associated with adult T2D, we used previously fine-293 mapped 380 T2D risk loci (Mahajan et al., 2018). Enrichment analyses revealed that the majority of the T2D risk 294 loci were located within open chromatin regions of iPSC-PPC. Furthermore, these loci contain variants that 295 overlap active transcription factor binding sites and/or show allele specific effects on chromatin accessibility. 296 Our study identified 65 T2D risk loci containing SNPs with strong evidence of being causal (i.e. high PPA) that 297 are associated with allelic-specific effects and/or predicted to affect transcription factor binding in the iPSC-PPCs. 298 For 38 of the T2D risk loci, the top-ranked SNP (i.e. the SNP with highest PPA) had functional effects on 299 transcription factor binding and/or chromatin accessibility; while in 27 loci, we observed that at least one lower-300 ranked SNP with high PPA (≥10%) was associated with allelic effects or altered transcription factor binding, 301 suggesting that the top-ranked SNP is likely not causal. In certain cases, such as rs79310463 in the KSR2 locus, 302 the SNP we identified as being associated with allelic effects or transcription factor binding had been described 303 as being causal in previous T2D studies ( empowering chromatin accessibility profiles with advanced tools such as transcription factor footprinting, allelic 319 effect predictions, and co-accessibility, it is feasible to uncover novel molecular mechanisms that underlie the 320 genetic risk of T2D. 321 Our study shows that by combining GWAS with epigenomic information from iPSC-PPC, it is feasible to gain 322 insight into the molecular mechanisms underlying the associations between genetic variation and adult pancreatic 323 complex traits and disease. Although we were able to determine the associations between SNPs in 209 T2D risk 324 loci and transcription factor binding or allelic effects, the majority of the associations were predictions. Larger 325 sample sizes would result in additional variants in ATAC-seq peaks and greater statistical power to test each 326 variant-containing peak for ASE and downstream changes in gene expression. Indeed, to gain insight into global 327 functional genetic variation, it will be necessary to obtain data for hundreds of iPSC-PPCs. With a small sample 328 size, we show that iPSC-PPC provide a suitable model system to study the associations between genetic variation, 329 regulatory mechanisms, and T2D, and that studies involving large numbers of samples could aid in the 330 identification of causal variants at the majority of T2D risk loci. 331

332
iPSCORE subject information 333 We obtained 9 iPSC lines from the iPSCORE collection (Panopoulos et al., 2017) (Table S1). These lines were 334 reprogrammed from skin fibroblasts collected from 9 unrelated subjects (8 female, 1 male), who ranged in age at 335 time of donation from 21 to 65 years old, and represent three 1000 Genomes Project super populations: European 336 American (7), Asian American (1), and African American (1). From each subject, whole blood samples were 337 collected and used to generate and process whole genome sequence (WGS) data as previously described 338 ( We applied the GATK best-practices pipeline for variant calling that includes indel-realignment, base-341 recalibration, genotyping using HaplotypeCaller, and finally joint genotyping using GenotypeGVCFs (DePristo  342  et  manufacture's recommendations except as noted below. One iPSC line (from subject 90e8222f-2a97-4a3c-9517-348 fbd7626122fd) was independently differentiated twice (PPC_029 and PPC_036) resulting in a total of 10 derived 349 iPSC-PPC samples. 350 Expansion of iPSC: One vial from each of 9 iPSC lines was thawed into mTeSR1 medium containing 10 μM 351 ROCK Inhibitor (Selleckchem) and plated on one well of a 6-well plate coated with matrigel. iPSCs were grown 352 until they reached 80% confluency and then passaged using 2mg/ml solution of Dispase II (ThermoFisher  353  Scientific). To obtain a sufficient number of iPSCs for differentiation, iPSCs were passaged twice: 1) cells from 354 the first passage were plated on three wells of a 6-well plate (ratio 1:3); and 2) cells from the second passage were 355 plated on six wells of a 6-well plate (ratio 1:2). 356 Monolayer plating (Day 0; D0): When the confluency of iPSC cells in the six wells of a 6-well plate reached 357 80%, cells were dissociated into single cells using Accutase (Innovative Cell Technologies Inc. Harvest: On D15 cells were dissociated using Accutase, collected and counted, and either processed fresh 372 (scRNA-seq) or cryopreserved (scRNA-seq and snATAC-seq). 373 Flow Cytometry

19
Each of the 10 iPSC-PPC differentiations were analyzed for co-expression of two pancreatic precursor markers, 375 PDX1 and NKX6-1, using flow cytometry. Specifically, at least 2x10 6 (Table S1). 413 Demultiplexing. To reassign pooled cells iPSC-PPCs back to the original subject (RNA_Pool_1 and 414 RNA_Pool_2; Table S1), we obtained the BAM files for each scRNA-seq sample and a VCF file containing SNPs 415 (called from WGS) that are bi-allelic and located at UTR or exon regions on autosomes as annotated by Gencode 416 v34lift37 (Harrow et al., 2012) calls from each of the nine subjects, two of which was not included in scRNA-seq 417 pools but served as negative controls. The two files (BAM and VCF) were used as input to Demuxlet (Kang et  418 al., 2018), which outputted the subject identities of each single cell based on genotype. We found that less than 419 21 1% of the cells mapped to negative controls after filtering for low quality cells and thus, were removed from 420 downstream analyses. 421 Data Processing. To merge the 10 scRNA-seq samples (1 iPSC, 7 fresh iPSC-PPCs, RNA_Pool_1, and 422 RNA_Pool_2), we first aggregated the samples that were sequenced as an independent batch (1 iPSC and 7 fresh 423 iPSC-PPC) using the CellRanger V6.0.1 command aggr with no normalization. For each sample (aggregated 424 sample, RNA_Pool_1, and RNA_Pool_2), we log-normalized the gene counts (NormalizeData) and identified 425 the 2000 most variables genes using a threshold of 0.5 for the standardized log dispersion (FindVariableFeatures). 426 We next applied Seurat's standard integration workflow to adjust for batch differences between the samples. 427 Specifically, we used FindIntegrationAnchors to identify a set of integration anchors between the samples using 428 30 dimensions computed from canonical correlation analysis (CCA). Next, we integrated the samples using 429 IntegrateData and applied the standard downstream workflow of scaling the data (ScaleData), applying principle 430 dimension reduction for 30 principle components (RunPCA), and then visualizing the single cells using Uniform 431 Manifold Approximation and Projection (UMAP). To identify clusters, we used a shared-nearest-neighbor (SNN) 432 graph of the significant PCs. To remove poor quality cells, we removed cells with fewer than 500 genes/cell or 433 more than 50% of the reads mapping to the mitochondrial chromosome. We performed iterative clustering until 434 all clusters driven by high mitochondrial reads or low number of genes were removed. Clusters with fewer than 435 250 cells were also removed. After filtering, 83,971 cells remained. We tested resolutions 0.05, 0.08, and 0.1 for 436 clustering analyses and determined that 0.08 was more representative of the cell types predicted to be observed 437 during stem cell differentiation into PPCs ( Figure 1B,F, Figure S3).  Figure S3, Table  451 S2), confirming that they were NKX6-1+ progenitors. To identify differentially expressed genes, we performed 452 Wilcoxon rank sum test between the normalized expression values of cells within the cluster and cells outside of 453 the cluster (Table S3). P-values were adjusted using a Bonferroni correction, and genes with FDR ≤ 0.05 were 454 considered differentially expressed. 455 Generation of snATAC-seq

456
Library Generation: A total of 7 iPSC-PPC samples were used for snATAC-seq generation ( first applying normalization (RunTFIDF) and linear dimensional reduction (FindTopFeatures and RunSVD) on 480 each sample dataset. We then identified a random subset of 20,000 peaks and computed a set of integration 481 anchors between the samples (FindIntegrationAnchors for 2,000 anchors) The two snaATAC-seq was integrated 482 using IntegrateData and 2-30 most significant dimensions calculated from dimension reduction analyses. Finally, 483 on the integrated dataset, dimension reduction was applied (RunSVD for 30 singular values), and single cells were 484 visualized using UMAP (RunUMAP on 2:30 dimensions). Clusters were identified using a SNN-graph method 485 using FindNeighbors and FindClusters. To remove low quality cells, we removed cells that satisfy one of the 486 24 following criteria: 1) the number of peak region fragments < 2,000 or > 20,000, 2) the percentage of reads in 487 peaks < 40%, 3) nucleosome signal > 1.5, or 4) TSS enrichment score < 2.5. Furthermore, we removed cells that 488 do not visually belong to a cluster (i.e. cells that are scattered between two distinct clusters). We performed 489 iterative clustering until we do not observe significant outliers of single cells. each motif in each cell. To annotate the cell types, we examined the motif activities of transcription factors with 500 known developmental or pancreatic functions: TFAP2A/B (mesendoderm), GATA4/6 (PDX1+ progenitors), 501 HNF4A, FOXA1/2, PDX1, NKX6-1 (NKX6-1+ progenitors), HES1, SOX4/9/10/13 (early endocrine), PAX4/6, 502 RFX1/3, HNF1A, MAFA, NKX2-2, NEUROD1 (endocrine), and ETV1, ETS1, ETS2 (early ductal). To validate 503 our annotations, we compared the motif activities to their gene expression in scRNA-seq using the same z-504 normalization method. We examined the motif activity profiles at resolutions 0.1, 0.15, and 0.20 ( Figure 1G, 505 Figure S4, Table S4), and reasoned that because subclusters within the predominant cluster expressed both PDX1 506 and NKX6-1 but at varying levels, we collapsed these clusters into NKX6-1+ progenitors. Resolution 0.1 was 507 used for downstream analyses. To identify differentially expressed peaks, we applied the FindAllMarkers function 508 25 in Signac with default parameters. Peaks with FDR ≤ 0.05 were considered differentially expressed (Table   509 S5).

511
To identify differentially enriched motifs for each snATAC-seq cluster at resolutions of 0.1, 0.15, and 512 0.20, we computed a Wilcoxon rank sum test for each motif and cell type cluster, comparing the 513 chromVAR z-score distributions of a random sample of 2,000 cells within the cluster and a random 514 sample of 2,000 cells outside of the cluster. Then, for each cluster, we applied a Bonferroni correction to 515 account for multiple testing. Motifs with FDR ≤ 0.05 were considered differentially enriched (Table S6).

517
To characterize regulatory variants for transcription factor binding sites, we used TOBIAS (Bentsen et al., 2020) 518 to identify the binding sites of transcription factors. We merged BAM files from snATAC_Pool_1 and 519 snATAC_Pool_2 and corrected for bias from Tn5 cutsites using TOBIAS function ATACorrect. Using the cutsite 520 tracks from ATACorrect, we computed footprint scores across the regions using FootprintScores. Using the 521 footprint scores along with transcription factor binding motifs from JASPAR2020 (Fornes et al., 2020), we then 522 estimated the binding positions of each transcription factor footprint across the genome. Using these positions, 523 we calculated the distance of each of the 325,942 SNPs in snATAC-seq peaks from its closest transcription factor 524 footprint on the same peak using bedtools closest -d. Information about the footprints can be found in Table S7. transcription factor binding, we filtered these tests based on seq_binding == "Y" and preferred_allele != "None". 537 seq_binding refers to whether the transcription factor is predicted to be bound to the locus overlapping the tested 538 SNP and preferred_allele describes whether the SNP is associated with improved binding affinity for the 539 transcription factor ("Gain"), decreased binding affinity ("Loss") or is not associated with changes in transcription 540 factor binding affinity ("None"). We found 84,196 tests passing these filters, for a total of 52,653 unique SNPs, 541 as one SNP may be predicted to affect the binding affinity for more than one transcription factor. 542 Allele-specific effects in snATAC-seq peaks 543 We filtered the total 288,813 snATAC-seq peaks to include peaks on autosomal chromosomes, with MACS2 544 score ≥ 100, and outside of ENCODE blacklist regions (hg19-blacklist.v2.bed.gz) ( Allele-specific effects in bulk RNA-seq using MBASED

586
To detect the allele-specific effects of gene expression in bulk RNA-seq from seven samples (7 subjects), we used 587 an R package MBASED , which uses a meta-analysis approach that aggregates information 588 from all SNPs within the gene body to measure gene-level ASE. For ASE analysis, we only considered genes on 589 autosomes and that is expressed. We determined a gene to be expressed if the gene is expressed with at least 1 590 TPM in at least 10% of the seven samples. Of the 62,492 genes, 18,217 genes (29.2%) were expressed, of which 591 10,715 were on autosomes. For each sample and gene, we obtained read counts for the reference and alternate 592 allele using samtools mpileup at the SNP loci for which the sample was heterozygous. We ran the 1-sample 593 analysis on MBASED, obtained the major allele frequency and p-value of ASE for each gene, and applied 594 29 multiple test correction using Benjamini-Hochberg's method. As default, MBASED removed genes with less than 595 10 reads for read depth. We determine a gene to display ASE if FDR ≤ 0.05 and major allele frequency ≥ 0.6. 596 Processing T2D loci 597 We obtained the genomic coordinates of each of 380 fine-mapped T2D loci from Mahajan et al. (Mahajan et al., 598 2018) . For each SNP with PPA > 0 in each locus, we extracted all its iPSC-PPC snATAC-seq peaks using 599 bedtools intersect (Quinlan and Hall, 2010 Author information    Violin plots showing the distribution of chromVAR motif activity score for the transcription factors described in Figure 1G for each snATAC-seq clusters in Figure 1D. Motif logos are shown underneath the corresponding plot.

Figure S7: Correlation between flow cytometry, scRNA-seq and snATAC-seq results
To determine the correspondence between FACS, scRNA-seq, and snATAC-seq, we compared the fraction of cells expressing both PDX1 and NKX6-1 within each iPSC-PPC sample. For scRNA-seq, we computed the total number of NKX6-1+ progenitors as the sum of NKX6-1+ progenitors and replicating NKX6-1+ progenitors (X axis, Figure S7A), as these cells express both PDX1 and NKX6-1. We found that the FACS and scRNA-seq was significantly correlated (R = 0.843, p = 0.00218, Figure S7A). We next determined whether the cell type fractions in snATAC-seq corresponds to scRNA-seq. Because snATAC-seq was not able to detect PDX1+ progenitors and early definitive endoderm (DE), we reasoned that these cells may be included in the main NKX6-1+ progenitor nuclei cluster. Therefore, we computed the fraction of cells that are early DE, PDX1+, NKX6-1+ progenitors and replicating NKX6-1+ progenitors in scRNA-seq and compared it to the fraction of nuclei that are NKX6-1+ progenitors in snATAC-seq. We found a significant correlation between scRNA-seq and snATAC-seq (R = 0.956, p = 0.000783, Figure S7B). These results show that scRNA-seq, snATAC-seq, and FACs are highly associated with each other, and that both sequencing methods can capture the variable differentiation efficiency in iPSC-PPC.  (A) For each of the 89 transcription factors tested with both deltaSVM and TOBIAS, we investigated the agreement between these two tools. Here, we determined if variants predicted to have allelic effects on a specific transcription factor by deltaSVM were more likely than expected to occur at genomic locations bound to the same transcription factor, as determined by TOBIAS. Effect size (X axis, log2 ratio) and p-values (Y axis, Fisher's exact tests) were measured using the fisher.test function in R. All tests had significant p-values after FDR correction (Benjamini-Hochberg) and had a positive log2 ratio and, overall, the distribution of log2 ratios was significantly greater than zero (p = 2.2 x 10 -47 , t-test, measured with the t.test function in R, with option mu = 0).
Each dot represents one of the 89 tested transcription factors.
(B) For the same transcription factors in (A), we tested if variants predicted to have allelic effects by deltaSVM were more likely to be closer to transcription factor footprints determined using TOBIAS than expected. We computed the distance of each variant from the closest transcription factor footprint on the same peak, and performed a linear regression between the distance and the absolute value of the variant score measured by In Sheet 2: We provide information for each sample in our study. Subject_UUID is provided in Column A.
Cell_type (Column B) indicates the type of cell (iPSC or PPC) obtained from each subject as part of this study.
Unique Differentiation Identifier (UDID, Column C) is a unique digit assigned for each attempted iPSC-PPC differentiation. %PDX1+ (Column D), %NKX6.1+, (Column E), and %PDX1+_NKX6.1+ (Column F) are the fractions of cells from each iPSC-PPC differentiation positively stained for PDX1, NKX6-1, or both PDX1 and NKX6-1, respectively. Data type indicates the type of sequencing method performed for each differentiation (Column G). The UUID for each sequenced sample is provided (Column H). Pooling schemes for samples combined prior to sequencing are shown (Column I).
In Sheet 3: We provide the UUIDs of each iPSC-PPC sample (Column A) by subject (Column B), WGS (Column C), bulk RNA-seq (Column D), fresh scRNA-seq (Column E), cryopreserved scRNA-seq (Column F), and snATAC-seq (Column G). Bulk RNA-seq was generated for all the of 10 iPSC-PPC samples that have scRNAseq, but in this study we analyzed only the seven that have matched snATAC-seq (Table S1). The table shows, for each of the 83,971 single cells, the sample UDID (Column A), the preparation of the sample corresponding to either fresh or cryopreserved (Column B), the cell barcode (Column C), UUIDs for subject, WGS, and scRNA-seq (Columns D-F), the number of reads (Column G), the number of genes detected (Column H), the percent of mitochondrial reads (Column I), the cell type assignment (Column J), UMAP coordinates (Column K-L), and the cluster assignments at resolutions 0.05, 0.08, and 0.1 (Columns M-O). The UUIDs for WGS was obtained using Demuxlet (Kang et al., 2018) and then mapped to subject and sample UUID. Barcodes for cells from freshly prepared samples were formatted as barcode-aggregate_id (Sheet 2) while those from cryopreserved samples were formatted as barcode-1.    For each cluster in snATAC-seq described in Figure 1D, we performed differential expression using the FindAllMarkers function in Signac. The table provides the cluster name, peak ID, average log2 fold-change, the fraction of cells within cluster that has peak expression > 0.1, the fraction of cells outside of cluster that has peak expression > 0.1, p-value, and q-value adjusted with a Bonferroni correction. We consider peaks with q-value  0.05 to be differentially expressed.   whether deltaSVM predicts that the transcription is bound, deltaSVM score for allelic TF binding, and the consequence of the SNP on the transcription factor binding ("Gain": the alternative allele has a stronger binding score; "Loss": the reference allele has a stronger binding score; "None": no difference between reference and alternative alleles  The table shows the enrichment for transcription factor binding and allelic effects predicted by deltaSVM in the genomic regions associated with a bound transcription factor annotated by TOBIAS. For each of the 89 transcription factors (the transcription factor ID by JASPAR and the transcription factor name by deltaSVM are shown in the table) in common between deltaSVM predictions and TOBIAS, we performed two enrichment analyses (shown in Figure S9A,B): 1) we tested for the enrichment of variants predicted to overlap a transcription factor binding site by deltaSVM and bound transcription factor binding sites measured by TOBIAS, using Fisher's exact test (the table shows estimate, p-value and q-value adjusted using Benjamini-Hochberg's method); and 2) we tested for the enrichment of variants predicted by deltaSVM to have allelic effects at a transcription factor binding site and bound transcription factor sites measured by TOBIAS, using linear regression (the table shows effect size, standard error, p-value and q-value adjusted using Benjamini-Hochberg's method).

Table S12: Associations between transcription factor binding and ASE
The table shows the associations between the major allele frequency of variant tested for ASE and its distance from each bound transcription factor binding site in the same snATAC-seq peak measured by TOBIAS ( Figure   S9C). The table shows, for each of the tested 746 transcription factors, its JASPAR ID, the measured effect size, standard error, p-value and q-value adjusted using Benjamini-Hochberg's method.

Table S13: Correspondence between ASE at promoters and their associated genes
Shown are the allele-specific effects of 5,380 expressed genes that have a snATAC-seq peak at the promoter region. The analysis was performed using MBASED (Mayba et al., 2014) on a per-sample basis. For each sample and gene (Columns A-C), we provide information about the major allele frequency (Column D), p-value of heterogeneity (Column E), p-value of ASE (Column F), FDR-corrected p-value using Benjamini Hochberg's method (Column G), and whether this gene exhibits ASE or not (Column H). We determine a gene to exhibit ASE if FDR ≤ 0.05 and the major allele frequency is ≥ 0.6. The major allele frequency, p-value of ASE, and pvalue of heterogeneity were computed by MBASED using the 1-sample analysis in the provided protocol. including the position of the index SNP, the index gene and the RS ID of the index SNP, as defined in Mahajan et al. (Mahajan et al., 2018); the PPA and the rank in the credible set; whether each SNP overlaps an iPSC-PPC snATAC-seq peak; whether it is predicted to have allelic effects, based on deltaSVM results; and whether it has ASE in the iPSC-PPC snATAC-seq dataset.