Introductory paragraph
A better understanding of the genetic mechanisms regulating hematopoiesis are necessary, and could augment translational efforts to generate red blood cells (RBCs) and/or platelets in vitro. Using available genome-wide association data sets, we applied a machine-learning framework to identify genomic features enriched at established platelet trait associations and score variants genome-wide to identify biologically plausible gene candidates. We found that high-scoring SNPs marked relevant loci and genes, including an expression quantitative trait locus for Tropomyosin 1 (TPM1). CRISPR/Cas9-mediated TPM1 knockout in human induced pluripotent stem cells (iPSCs) unexpectedly enhanced early hematopoietic progenitor development. Our findings may help explain human genetics associations and identify a novel genetic strategy to enhance in vitro hematopoiesis, increasing RBC and MK yield.
Main Text
Elucidating genetic mechanisms governing hematopoiesis has broad value in understanding blood production and hematologic diseases.1 Given interest in generating red blood cells (RBCs) and platelets from in vitro culture of induced pluripotent stem cells,2,3 there is also translational value in harnessing genetic and molecular processes that regulate hematopoiesis. For example, recent advances have increased platelet yield in vitro,2 but generating MKs cost-effectively will require novel strategies based on better knowledge of underlying mechanisms.2,4,5
In vitro systems might be improved by identifying novel factors from human genetic studies. Genome wide association studies (GWAS) have linked hundreds of single nucleotide polymorphisms (SNPs) with platelet trait variability.6–9 Because most GWAS SNPs are non-coding, likely influencing transcriptional expression of key genes,10,11 it has been challenging to derive functional biochemical understanding of the key genes of action,11–13 and few studies have elucidated biochemical mechanisms for platelet trait variability loci.14–18 One strategy to narrow focus on candidate genes is to link non-coding variation to expression of nearby genes.1,19,20 However, platelet trait variation GWAS have thus far implicated >6700 expression quantitative trait loci (eQTL) affecting expression of >1100 genes (Methods), highlighting a need to more specifically identify putatively functional sites.
To further narrow studies onto credible candidates for functional follow-up (Fig. 1a), we applied a penalized logistic regression model to select a subset of 628 different chromatin features that best distinguished 73 platelet trait GWAS SNPs6 from matched control SNPs not associated with platelet traits (Methods, Fig. 1b, and Supplementary Table 1). The resultant predictive model selected 9 epigenomic features and was able to discriminate between positive and negative labeled examples (Area Under the Receiving Operator Curve (AUC) = 0.793, Fig. 1c, Supplementary Fig. 1a, and Table 1). Each selected feature had a positive coefficient, meaning each was more likely to overlap a platelet-associated GWAS SNP than a control.
While some care in interpretation was required, it was encouraging that the model selected biologically plausible features. GATA1 and FLI1 are critical MK transcription factors,21,22 and most of our features came from hematopoietic cells (primary MK, K562, GM12878; Table 1). Furthermore, this set of 9 chromatin features are functionally predicted to identify regulatory elements near and within gene bodies (Table 1). This regulatory localization is consistent with previous observations.14
We calculated a trait-enrichment score based on SNP overlap with each of the 9 selected features, weighted by our penalized regression model coefficients (Methods, Table 1). Resultant scores were significantly higher for training, holdout and validation sets of platelet trait GWAS SNPs, relative to SNPs genome-wide (p<0.0001, Fig. 2a, Supplementary Table 2). Our regression model performed well compared to other methods20,23,24 (Fig. 2b-c, Supplementary Fig. 1b).
We next assessed biological support for penalized regression scoring beyond the machine learning framework. First, we evaluated the biological specificity of variation prioritized by the model, given practical limitations associated with fine-mapping and cellular validation experiments. The number of high-scoring SNPs from our model fell within the range of other predictors (Fig. 2d), and Gene Ontology analysis indicated that the nearest genes to penalized regression-prioritized variants were enriched for biologically relevant pathways (Methods, Fig. 2e and Supplementary Tables 3-5). Second, SNP scores correlated with summary association statistics for platelet trait-GWAS data6 (Supplementary Fig. 2a-b), with variants that were nominally associated but not yet genome-wide significant (and not used during the training or testing phases) having a significantly higher average score compared to SNPs with no clear association (p<0.0001, Methods, Supplementary Fig. 2c-d). This correlation suggested that our scoring algorithm was valid genome-wide and could potentially reveal true biological associations, as had the GWAS itself.6,14,15,17 Third, FANTOM5 enhancer regions25 were enriched for high-scoring SNPs, with an average score >0.9 compared with an average score 0.21 genome-wide (Methods, Supplementary Fig. 3a), consistent with the hypothesis that functional non-coding SNPs associate with active regulatory regions.11,26 We further observed that enhancer regions in hematopoietic cell types scored significantly higher than enhancers from irrelevant control cells (Supplementary Fig. 3a). This argues for trait specificity in hematopoietic enhancers, consistent with prior studies.27 Lastly, most high-scoring SNPs from the regression model were in gene bodies or near transcriptional start sites (TSSs, Supplementary Fig. 3b), with SNPs near key MK genes scoring significantly higher than SNPs in matched control regions28 (Supplementary Fig. 3c). Collectively, this evidence indicated that our model successfully targeted hematopoietic trait-relevant loci, particularly those near and within gene bodies.
We reasoned that active variants would (i) be in high linkage disequilibrium (LD) with established platelet trait GWAS loci, (ii) score highly relative to other SNPs within that LD block, (iii) regulate target gene(s) as quantitative trait loci, and (iv) overlap GATA binding sites.29,30 We prioritized GATA binding sites based on the importance of GATA factors in hematopoiesis21,31 and in our scoring algorithm (Methods, Fig. 1c).
This approach led us directly to SNPs known to impact hematopoiesis, MK and/or platelet biology (Table 2 and Supplementary Table 6). For example, rs342293 is a GWAS SNP6 that regulates PIK3CG gene expression15 (Fig. 3a-d). In platelets, PIK3CG activity regulates PIK3 signaling32 and response to collagen.33 The GATA site is disrupted in the presence of the SNP minor allele (Fig. 3d). Individuals harboring this minor allele had increased mean platelet volume (MPV) and decreased platelet reactivity.15
This approach also highlighted rs11071720, found within the 3rd intron of the Tropomyosin 1 (TPM1) gene locus. This SNP was in reasonably strong LD with the sentinel GWAS SNP6 rs3809566 (EUR r2=0.73, Fig. 3e-h). The rs11071720 minor allele, which disrupts a near-canonical GATA binding site, is an eQTL associated with decreased TPM1 expression34,35 (Methods, Fig. 3h, and Supplementary Fig. 4), higher platelet count, and lower mean platelet volume (MPV).8 Further, the minor allele for high-scoring rs4075583 (EUR r2=0.71 with rs3809566) was associated with decreased TPM1 expression in heterologous cells,36 though not in GTEx tissues.35 To our knowledge, neither of these SNPs, nor TPM1, had been functionally evaluated in the context of human hematopoiesis.
Given that these high-scoring putatively active SNPs impacted TPM1 expression, we investigated functions for the TPM1 gene in an in vitro human model of primitive hematopoiesis.37 We anticipated that total gene deletion would show stronger effects than non-coding SNP modification.38 Using CRISPR/Cas9, we targeted a ~5kb region containing TPM1 exons 4-8 in iPSCs (Fig. 4a), anticipating creation of a null allele.39 We confirmed deletion by sequencing and western blot (Fig. 4b and Supplementary Fig. 5). In total, we obtained 3 TPM1 knockout (KO) clones from 2 separate genetic backgrounds. Karyotype and copy number variation analyses confirmed that engineering these clones did not introduce any de novo genomic aberrancies (data not shown).
TPM1 protein was most abundant in iPSCs and downregulated during hematopoietic differentiation (Fig. 4b). Early differentiation proceeded normally in KO clones, with normal patterns of primitive streak and mesoderm gene expression (Fig. 4c) and pluripotency marker loss (Supplementary Fig. 6). The kinetics by which hemogenic endothelium (KDR+/CD31+) and hematopoietic progenitor cells (HPCs, CD34+/CD43+) emerged were also normal (Fig. 4d-e). In this culture system, hemogenic endothelium yields HPCs.
Remarkably, KO clones showed enhanced formation of hemogenic endothelium (Fig. 4d) and KO HPC yield roughly doubled that of WT controls (Fig. 4e-f and Supplementary Fig. 7) with normal cell surface expression of hematopoietic markers (Supplementary Fig. 8). KO HPCs generated normal quantities of mature MKs in liquid expansion culture (Fig. 4g). KO MK morphology and activation in response to agonists were normal (Supplementary Fig. 9-10). KO HPCs also generated increased numbers of erythroid cells (Fig. 4h) and normal quantities of myeloid cells (Supplementary Fig. 11).
Microarray gene expression analyses of WT and KO MKs revealed no statistically significant changes in MK genes, though Gene Set Enrichment Analysis (GSEA) showed a trend toward higher MK-related pathway expression in KO MKs (Supplementary Fig. 12a-e). Overall, 19 molecular pathways were upregulated in KO MKs (Supplementary Fig. 12f and Supplementary Table 7).
These data support a model whereby TPM1 deficiency enhances in vitro hematopoiesis and resulting RBC and MK yield, perhaps helping to explain human genetic association data linking SNPs that lower TPM1 expression34,36 with increased platelet count (Fig. 3h and Fig. 4i).8 Consistent with an impact on HPCs, TPM1-related SNPs have marginal effects on red cell traits in addition to genome-wide significant effects on platelet traits (Supplementary Fig. 13). These findings not exclude additional effects on terminal MK or RBC development in vitro, nor in vivo effects outside the scope of our model.
TPM1 deficiency could leave filamentous actin more ‘accessible’ to other modulator,40,41 such as other TPMs. Of these, TPM4 promotes MK development18 and likely modulates actin dynamics similar to TPM1.42 TPM4 isoforms were upregulated during MK differentiation and significantly increased in TPM1 KO iPSCs (Fig. 4b and Supplementary Fig. 14). Increased TPM4 may partially account for our observed enhanced HPC and MK yield.
Enhanced hematopoiesis in TPM1KO iPSCs contrasts detrimental effects of TPM1 deficiency on organism fitness in other contexts.6,43,44 For example, abrogated D. rerio thrombopoiesis with tpma-directed morpholinos6 resembles human TPM4 deficiency18 rather than TPM1 deficiency. This highlights the importance of species-specific genetic validation, particularly given inter-species disparities in hematopoiesis.45
In conclusion, we used penalized regression modeling and cellular validation to define a role for TPM1 in constraining in vitro hematopoiesis. In addition to understanding a genetic modifier of hematopoietic traits,6,8 application of our results may augment RBC and MK yield in vitro. Recent advances increasing per-MK platelet yields2 have focused a spotlight on increasing cost effectiveness of in vitro MK generation. In addition to improved recognition of genes and mechanisms underlying quantitative hematopoietic trait variation, application of the computational approach described herein could also help to specify trait-specific causal genetic variants for virtually any clinically relevant human trait.
Methods
In silico analysis
Relevant data sets and coding scripts can be found on GitHub (https://github.com/thomchr/2019.PLT.TPM1.Paper). Human genome version hg19 was used for all analyses, and we utilized the LiftOver script when necessary (https://bioconductor.org/packages/release/workflows/html/liftOver.html). GWAS summary statistics for were obtained courtesy of Nicole Soranzo (for 6) and are publicly available (http://www.bloodcellgenetics.org/).
Expression Quantitative Trait Locus analysis
To estimate the number of eQTLs implicated by prior platelet trait GWAS, SNPs in high LD with established GWAS loci8 (EUR r2>0.9) were identified using PLINK. From this set of SNPs, eQTLs and affected genes were identified from GTEx V7.35 Numbers reported in the text reflect unique eQTL SNPs, which often functioned across multiple tissues. The affected gene estimate reflects the number of unique Ensembl gene identifiers (ENSG).
SNP selection
Platelet trait GWAS SNPs were identified from Gieger et al (see Table 1 in 6). When two SNPs had been identified in a given region, the SNP with the greater effect size was chosen. The resultant 73 SNPs comprised our training SNP set (Supplementary Table 2). The remaining 8 SNPs were designated as a holdout set. From a total of 710 platelet trait (PLT, platelet count; MPV, mean platelet volume; PDW, platelet distribution width; PCT, platelet-crit)-associated GWAS SNPs from a more recent study,8 614 had rsIDs that matched our scored genome; these comprised our validation set. All of these SNP sets can be found on GitHub. We used the Genomic Regulatory Elements and GWAS Overlap algoRithm (GREGOR)46 tool to select control SNPs for our study. GREGOR matched SNPs based on Distance to nearest gene, “LD buddies” (i.e., number of SNPs within a LD block) and Minor allele frequency, and identified controls for each of the 73 training set SNPs.
Chromatin feature selection
We collected a subset of available features tracks from ENCODE47, including data for hematopoietic (K562 and GM12878) as well as other cell types (H1-hESC, HUVEC, HeLa, HepG2). We also collected available feature tracks from primary MKs.21,48 See Supplementary Table 1 for a list of these features.
Penalized regression modeling
To generate our model, we first analyzed training set GWAS SNPs (73) and matched controls SNPs (780,632) for overlap with 628 chromatin features (data set available on GitHub). Columns representing our 3 baseline parameters (Distance to Nearest Gene, LD Buddies and Minor Allele Frequency) were also included in this data table for each SNP. This chromatin feature overlap data file was then analyzed using the least absolute shrinkage and selection operator (LASSO, L1 regularization, glmnet version 2.0-2)49,50 with 10-fold cross-validation and forced inclusion of the 3 baseline parameters. Baseline parameters were assigned penalty factors of 0, while other chromatin features were assigned penalty factors of 1. Features and coefficients were taken from the λse (Df 12, %Dev 0.062980, λ 6.203e-05). For downstream genome-wide analyses, we scored all SNPs within NCBI dbSNP Build 147.
Model performance comparison
We used ROCR51 to compare prediction model performance. We used public databases to obtain SNP scores for alternative models (CADD v1.3, GWAVA (unmatched score), DeepSea; https://cadd.gs.washington.edu/download, http://www.sanger.ac.uk/resources/software/gwava, http://deepsea.princeton.edu).
Model specificity analysis
To compare how restrictive and specific predictive models were for high-scoring SNPs, we first obtained scored NCBI dbSNP Build 147 SNPs for LASSO, GWAVA and CADD models. We quantified the number of SNPs that fell within the top 10% of scores. For GWAVA, this included any SNP scored >0.90 given a maximum score of 1.0. For CADD, this included any SNP scored >32.4 (maximum PHRED score 36). For LASSO, this included any SNP scored >2.32 (maximum score 2.58).
To assess biological specificity, we identified the top 1% highest-scoring SNPs from each model (LASSO, GWAVA, CADD) after excluding platelet trait-associated GWAS loci (81 SNPs from 6 and 710 SNPs from 8). We then used closestBed (https://bedtools.readthedocs.io/en/latest/content/tools/closest.html) to identify the nearest gene to each of these SNPs. Genes and positioned were defined by BioMart (http://www.biomart.org/). We then used the Gene Ontology resource (http://geneontology.org/) to analyze pathway enrichment. Input analysis settings were Binomial tests and calculated FDR for GO Biological Process complete. Pathways identified with FDR<5% are presented in Fig. 2e and Supplementary Tables 3-5. Pathways shown in Fig. 2e are GO:0045652, GO:1902036, GO:1901532, GO:1903706, GO:0048534, GO:0030097, and GO:0030220.
Score validation
Gene Ontology pathways were used to identify key MK genes. A total of 132 “MK genes” were collected from pathways that were returned after a search for the term “megakaryocyte”, including “positive regulation of megakaryocyte differentiation”, “negative regulation of megakaryocyte differentiation”, “regulation of megakaryocyte differentiation”, “megakaryocyte differentiation”, “megakaryocyte development”, “platelet alpha granule”, “platelet formation”, “platelet morphogenesis” and “platelet maturation” (Supplementary Table 8). Gene locations for hg19 were obtained from the UCSC Genome Browser Table Browser feature.
The Genomic Regions Enrichment of Annotation Tool28 (GREAT) was used in combination with the UCSC Genome Browser52 (Table Browser interface) to analyze SNP locations and proximity to known genes.
Enhancer regulatory regions were defined according to the FANTOM5 data set.25 Presented FANTOM5 data represent scores for all overlapping SNPs from dbSNP 147.
Linkage disequilibrium structure assessment
The SNP Annotation and Proxy Search tool (https://archive.broadinstitute.org/mpg/snap/ldsearch.php), LDlink (https://analysistools.nci.nih.gov/LDlink), and 1000 Genomes Project (phase 3) data were used to measure linkage disequilibrium in the EUR population.
Transcription factor binding site identification
Transcription factor binding sites were identified using the Find Individual Motif Sequences (FIMO) and Analysis of Motif Enrichment (AME) tools from MemeSuite (http://meme-suite.org). To identify GATA sites, the genomic sequence contexts for LD blocks containing each GWAS SNP were analyzed for matches (p<0.001) by manual curation of canonical or near-canonical GATA binding motif in all orientations (AGATAA, TTATCA, AATAGA, TTATCT).
Human iPSC generation
iPSC models were generated as described from peripheral blood mononuclear cells.53 The “CHOP10” and “CHOP14” lines were used in this study. CRISPR/Cas9-mediated genome editing was performed as described54 per protocols from the CHOP Human Pluripotent Stem Cell Core Facility (https://ccmt.research.chop.edu/cores_hpsc.php) with the following guide sequences: 5’ (1) ATGACGAAAGGTACCACGTCAGG, 5’ (2) TGAGTACTGATGAAACTATCAGG, 3’ (1) CCCTTTTCTTGCTGCTGTGTTGG, 3’ (2) GGAGAGTGATCAAGAAATGGAGG.
Karyotyping (Cell Line Genetics, Madison, WI) and copy number variation (CHOP Center for Applied Genomics, Philadelphia, PA) analyses were performed per institutional protocols.
iPSC hematopoietic differentiation and analysis
iPS cells were differentiated in HPCs and terminal lineages (MKs, erythroid, myeloid) per published protocols.37,55–57 Validated flow cytometry gating for pluripotency (SSEA3+/SSEA4+), hemogenic endothelium (KDR+/CD31+), hematopoietic progenitors (CD43+/CD34+ and CD41+/CD235+) and terminal lineages can be found in these references.
Flow cytometry
Flow cytometry analysis was performed on a Cytoflex LX and FACS-sorting was performed on a FACS Aria II (BD Biosciences). Flow cytometry data were analyzed using FlowJo 10 (Tree Star, Inc.). The following antibodies were used for flow cytometry: FITC-conjugated anti-CD41 (BioLegend), PE-conjugated anti-CD42b (BD Biosciences), APC-conjugated anti-CD235 (BD Biosciences), PB450-conjugated anti-CD45 (BioLegend), AF488-conjugated anti-SSEA3 (BioLegend, AF647-conjugated anti-SSEA4 (BioLegend), PE-conjugated anti-KDR (R&D Systems), PECy7-conjugated antiCD31 (BioLegend), PECy7-conjugated anti-CD34 (eBioscience) and FITC-conjugated anti-CD43 (BioLegend).
Gene expression analysis by RT-semiquantitative PCR
Total RNA was prepared using PureLink RNA micro kits (Invitrogen) in which samples were treated with RNase-free DNase. The reverse transcription of RNA (100 ng-1 μg) into cDNA was performed using random hexamers with Superscript II Reverse Transcriptase (RT) (Life Technologies), according to the manufacturer’s instructions. Real-time quantitative polymerase chain reaction (PCR) was performed on QuantStudio 5 Real-Time PCR Instrument (Applied Biosystems). All experiments were done in triplicate with SYBR-GreenER pPCR SuperMix (Life Technologies), according to the manufacturer’s instructions. Primers (Supplementary Table 9) were prepared by Integrated DNA Technologies or Sigma Aldrich. Dilutions of human genomic DNA standards ranging from 100 ng/μl to 10 pg/μl were used to evaluate PCR efficiency of each gene relative to the housekeeping gene TATA-Box Binding Protein (TBP).
Microarray analysis
For microarray analysis, 50,000 cells were FACS-sorted directly into Trizol. RNA was extracted from using a miRNeasy Mini Protocol (Qiagen). Samples passing quality control were analyzed using the human Clariom D Assay (ThermoFisher Scientific) and analyzed using Transcriptome Analysis Console (ThermoFisher Scientific) Software and Gene Set Enrichment Analysis (http://software.broadinstitute.org/gsea/index.jsp) software.
Cell analysis and imaging
For Cytospins, FACS-sorted MKs were spun onto a glass slide and stained with May-Grünwald and Giemsa. Images were obtained on an Olympus BX60 microscope with a 40X objective. An Invitrogen EVOS microscope with a 10× objective was used to image cells in culture.
Western blots
Cell pellets were resuspended in Laemmli buffer, sonicated for 5 min, and boiled for 5 min at 95 degrees C. Lysates were centrifuged at 10,000 rpm for 5 min at room temperature, and supernatants were used for analysis. Lysate volumes were normalized to cell counts. Samples were run on 4-12% NuPAGE Bis-Tris gels (Invitrogen) and transferred onto nitrocellulose membranes (0.45um pore size, Invitrogen) at 350mA for 90 minutes. Following blocking in 5% milk for 1 h, membranes were incubated with primary antibodies overnight at 4°C. After washing thrice in TBST, membranes were incubated with secondary horseradish peroxidase-conjugate antibodies for 1h at room temperature, washed in TBST thrice, and developed using ECL western blotting substrate (Pierce) and HyBlot CL autoradiography film (Denville Scientific). The following antibodies were used for western blotting: Rabbit anti-TPM1 (D12H4, #3910, Cell Signaling Technologies), Mouse anti-TPM1/TPM2 (15D12.2, MAB2254, Millipore Sigma), Mouse anti-TPM3 (3D5AH3AB4, ab113692, Abcam), (Rabbit anti-TPM4 (AB5449, Millipore Sigma), and Mouse anti-β Actin (A1978, Sigma). Western blot band quantitation was performed using FIJI (https://fiji.sc/).
MK activation assay
MKs were pelleted and resuspended in Tyrode’s Salts (Sigma) with 0.1% bovine serum albumin (BSA) containing FITC-conjugated PAC-1 (BD Biosciences), PacBlue-conjugated CD42a (eBioscience) and APC-conjugated CD42b (eBioscience) at a concentration of roughly 100,000 cells per 50μl. Following addition of Convulxin (Enzo Biochem) or Thrombin (Sigma), cells were incubated at room temperature in the dark for 10 min. Cells were then incubated on ice for 10 min. An additional 100μl Tyrode’s Salts containing 0.1% BSA were added and cells were immediately analyzed by flow cytometry.
Data presentation
Genome-wide SNP Scores were loaded as custom tracks into the UCSC Genome Browser.52 Images depicting genomic loci were generated using this tool, as well as Gviz.58 Other data were created and presented using R, Adobe Illustrator CS6 or GraphPad Prism 6.
Statistics
Statistical analyses were conducted using R or GraphPad Prism 6.
Author Contributions
CST and BFV conceived of this study. CST, CDJ, KL, JAM, DLF, and BFV conducted and/or analyzed experiments. CST and BFV wrote the manuscript. BFV oversaw the work.
Competing Interests
The authors declare no competing interests.
Acknowledgements
We are grateful for thoughtful suggestions from Drs. Mortimer Poncz, Michele Lambert, and members of the Voight laboratory, as well as technical support from Tapan Ganguly and Hetty Rodriguez (University of Pennsylvania Microarray Core Facility), and the Penn Medicine Academic Computing Services. We thank Dr. Nicole Soranzo for sharing summary level GWAS data6 and Osheiza Abdulmalik for generous use of his microscope for Cytospin imaging.
This work was supported through R01DK101478 (BFV), a Linda Pechenik Montague Investigator Award (BFV), R01HL130698 (DLF, PG), T32HD043021 (CST), a Children’s Hospital of Philadelphia Neonatal and Perinatal Medicine Fellow’s Research Award (CST), an American Academy of Pediatrics Marshall Klaus Neonatal-Perinatal Research Award (CST) and a Children’s Hospital of Philadelphia Foerderer Award (CST).