Integrated analysis of high throughput transcriptomic data revealed specific gene expression signature of cardiomyocytes

Acquiring a specific transcriptomic signature of the human and mouse cardiomyocyte (CM) will greatly increase our understanding of their biology and associated diseases that remain the most deadly across the world. In this study, using comprehensive transcriptomic mining of 91 cell types over 877 samples from bulk RNA-sequencing, single cell RNA-sequencing, and microarray techniques, we describe a unique 118-gene signature of human and mouse primary CMs. Once we had access to this CM-specific gene signature, we investigated the spatial heterogeneity of CMs throughout the heart tissue. Moreover, we compared the CM-specific gene signature to that of CMs derived from 10 differentiation protocols, and we identified the protocols that generate cells most similar to primary CMs. Finally, we looked at the specific differences between primary and differentiated CMs and found that differentiated cells underexpress genes related to CM development and maturity. The differentiated cells conversely overexpressed cell cycle-related genes, resulting in the progenitor features that remain in differentiated CMs compared to primary adult CMs. The presence of histone post translational modification H3K27ac from ChIP sequencing data sets were used to confirm transcriptomic findings. To the best of our knowledge, this is the most comprehensive study to date that unravels the unique transcriptomic signature of primary and differentiated CMs. This study provides important insights into our understanding of CM biology and the molecular mechanisms that make them such a unique cell type. Moreover, the specific transcriptomic signature of CMs could be used in developmental studies, stem cell therapy, regenerative medicine, and drug screening assays.


Introduction:
Cardiomyocytes (CMs) constitute the majority of heart by mass and have an essential role in tissue contraction and overall heart function. 1 CM damage and dysfunction is the primary cause of heartrelated diseases. Despite decades of research related to drug development, regenerative medicine, and cell therapy for heart-related disorders, heart disease remains the global leading cause of death and has seen minimal progress in treatment strategies. 2,3 Gene expression quantification of CMs is a crucial step to understand their contribution to disease development, which could lead to new therapies. 4 This transcriptomic profile of primary CMs could be used as a reference to guide differentiation protocols of induced CMs. More humanistic CMs would then be invaluable for research, drug screening studies, and regenerative medicine. In spite of rich literature exploring the potential uses of differentiated CMs, there is an unmet need to define the unique transcriptomic signature of these cells that would open up their full potential.
Bulk RNA-sequencing is the most commonly used technique in transcriptomic research. While it provides detailed information on the entire tissue under investigation (heart tissue in this case), single cell RNA-sequencing is required to isolate individual cells and look exclusively at CMs. In one recent paper, Wang and colleagues identified different subtypes of human CMs in the left atrium and ventricle. 5 Moreover, spatio-temporal heterogeneity of mouse CMs has been studied by DeLaughter et al. (2016), in which they discovered that the CM subtype depends on the stage of heart development and the location of the heart tissue. 6 Other single cell RNA-seq studies have further noted the heterogeneity of cardiac muscle cells, with a gradient expression of CM marker genes like ACTC1 and MYH6. 7 Moreover, single cell studies have revealed the diverse cellulome of heart tissue through identifying major cell types and their gene expression heterogeneities in the heart. 8 Altogether, despite improvements in our understanding of the transcriptional profile of the heart and CMs, larger studies that combine hundreds of datasets are required to confidently define the specific gene expression profile of CMs compared to other cells in the body.
Heart transplant is currently the treatment of choice for end-stage heart failure; however, it is an unsustainable and rare solution given the scarcity of donors that limits this treatment to a small portion of patients. 9 To overcome such obstacles, other strategies have been proposed, including stimulating the proliferation of existing CMs, activation of endogenous cardiac progenitor cells, and differentiation of pluripotent stem cells or direct reprogramming of somatic cells towards CMs. [10][11][12] Among these techniques, generation of CMs through stem cell differentiation and somatic cell direct reprogramming have earned the most attention over the past decade. For example, van den Berg and colleagues proposed a monolayer differentiation technique through which pluripotent stem cells can be converted to CMs. 13 Similarly, it has been indicated that mouse fibroblasts can be converted directly into cardiac muscle cells using three transcription factors, Gata4, Tbx5 and Mef2c. 14 To assess the maturity of differentiated CMs, the transcriptomic profile is supplemented by the structure and function of the cytoskeleton and contractile apparatus, metabolic status, and electrophysiological properties. [15][16][17][18] These studies also show that, despite significant progress in the generation of more mature CMs, they still fail to functionally resemble their adult in vivo counterparts. Therefore, determining the transcriptomic differences between in vivo and differentiated CMs through comprehensive transcriptomic analysis will allow for more targeted differentiation techniques to create CMs more useful for research and eventually regenerative medicine. 19 In this study, we have conducted comprehensive transcriptomic analysis of 877 gene expression samples complemented with epigenetic data sets to define the specific transcriptomic profile of CMs. In addition, we assessed spatial gene expression heterogeneity of human CM populations.
Furthermore, we used our source of data to precisely compare several differentiation protocols and show the similarity of differentiated cells with their in vivo counterparts. Finally, through assessing 15 different differentiation studies, we found common transcriptomic features of all pluripotent stem cell-derived CMs that were significantly different from mature adult human CMs. Our transcriptomic finding well correlated with H3K27ac peaks derived from ChIP-seq data sets. Our results, in addition to identifying the fundemental transcriptomic features of cardiac muscle cells, could be crucial in future translational studies, like generating mature CMs for cell based therapy and tissue engineering studies.

Results:
To characterize the specific transcriptomic profile of cardiomyocytes (CMs), we analyzed human and mouse transcriptomic data of 91 somatic cell types and subtypes, as well as multiple types of stem cells (65 human and 26 mice cell types). We categorized cells based on the cell type, tissue and organ (location), and developmental stages (embryonic/fetal, pediatric/adolescent, and young/adult). The transcriptomic data came from bulk RNA-seq, microarray, and single cell RNAseq experiments. By analyzing the compiled data sets, we identified a unique profile of genes that was differentially expressed in CMs compared to all other cell types. We analyzed this specific profile of genes to investigate their function and heterogeneity in CMs. In addition, we assessed the success of CM differentiation from human pluripotent stem cells in a number of differentiation protocols to find the most efficient protocols in generating CMs similar to their adult human counterparts. Finally, we compared differentiated CMs to primary adult CMs to explore transcriptomic differences that may hinder the application of differentiated CMs in research and regenerative medicine.
Comprehensive transcriptome mining reveals cardiomyocyte-specific gene expression landscape.
To gain insight into the transcriptomic profile that make CMs a distinct cell type with highly specialized functions, we compiled data from 180 studies that examine 91 different cell types or subtypes from 877 individual RNA-seq and microarray experiments (Fig. 1A, Fig. S1A). All samples were curated to remove outliers. Expression of canonical CM marker genes were checked in the compiled CM dataset and a consistent expression of the marker genes was seen in them which implies the high quality of the gathered CM samples (Table S1 and Fig. S2A). Comparing RNA-seq and microarray data sets of human and mouse CMs to all other somatic and stem cell types revealed 137 genes with significantly higher expression levels in both mouse and human CMs compared to any other cell types. Because the data in Fig 1A is generated from bulk RNAseq and microarray techniques, which are unable to capture transcriptomic information of individual cells, we checked the validity of the identified genes at the single cell level. For single cell gene expression analysis, we used the Tabula Muris Consortium, which contains mice single cell RNA-seq data. 20 Cross-referencing our 137 upregulated genes from bulk RNA-seq and microarray data to the Tabula Muris Consortium identified 118 of the genes to also be significantly upregulated in CMs at the single cell level. Jaccard similarity correlation analysis confirms that these 118 genes make primary CM to cluster together and apart from all other human RNAseq datasets (Fig. S2B). Due to the higher expression of these genes in CMs, these 118 genes functioned as identity markers of CMs compared to other cells ( Fig. 1B and Table S2). Across all cell types, the expression of housekeeping genes remained relatively constant, with the exception of ACTB that had lower expression in CMs compared to somatic cells (Fig. 1B). This observation suggests that the significantly upregulated genes in this study are highly specific for CMs and are not artefacts generated from the acquisition of transcriptomic data. As expected, gene ontology analysis and human phenotype ontology revealed the heavy contribution of these genes to muscle development and in the normal function of the heart, as their dysregulation results in heart-related abnormalities like myopathy ( Fig. 1C and 1D).
To test and validate the reproducibility of our findings, we recruited publicly available important transcriptomic source for human tissue and organs, the Human Protein Atlas (HPA) dataset. 21 Although this source contains whole tissue and organs data (not individual cells), it remains valuable resource to which we can compare our human and mouse transcriptomic data. We extracted the expression level of 118 CM-specific gene list in the HPA data set which contains expression data of 42 different tissues and made an expression matrix which contain 4956 data points (each data point correspond to one single expression value in the matrix). Interestingly, in 98.06% data points (4860 out of 4956) expression of these 118 genes were higher in heart tissue compared to all other tissues which indicates the reliability of our findings. The most similar tissue to the heart tissue in this matrix is skeletal muscle cell. Even though we did not have the skeletal muscle cell information in our study, if we look at the expression of 118 cells individually between the heart and skeletal muscle tissues in the HPA database, we see 69 genes have higher expression in the heart compared to skeletal muscle cells. However, principle component analysis (PCA) analysis using expression value of 118 genes shows a clear separation between CMs and skeletal muscle cells (Fig. S1B). Therefore, we can confidently consider this expression signature as CM specific profile. So far, we considered expression of 118 CM-specific genes which is provided by HPA (intra-database analysis). At the next step we wanted to directly compare the expression values that we get in our analysis with the HPA information (Inter-database analysis). To this end, we chose smooth muscle cells and tissue as an example comparator which is a cell type and tissue which that is present in both datasets. The heart tissue and smooth muscle cell information of HPA were directly compared to heart tissue and smooth muscle cell information of our study respectively. While there was a high correlation between comparisons at tissue and cell level in which both data sets showed similar expression pattern (Fig. 1E), individual expression values were higher in CM compared to the smooth muscle cell ( Fig. 1F and G). In conclusion, the high correlation between CMs and heart tissue from the HPA dataset, as well as and low correlation with smooth muscle HPA data, confirmed the reliability of our data to assess the unique gene expression signature of heart cardiac muscle cells.
To validate the specificity of the identified CM-specific genes in human heart tissue, we reanalyzed the available single cell RNA-seq data set from a Wang et al. (2020), in which they isolated and sequenced the heart tissue that included CMs and non-CM cells. Following the quality control and removing unwanted cells we ended up having 3168 cells including CMs (32%), fibroblasts (13%), smooth muscle cells (11%), endothelial cells (39%), and macrophages (5%) in the human heart tissue consistent with the findings of their original study. We checked the previously identified 118 CM genes in this single cell dataset, and all detected CM genes were again more highly expressed in the CMs than any other heart tissue cell types ( Fig. 1H and Fig. S2C). Due to the low sensitivity of single cell RNA-seq, the expression of some genes could not be identified (Fig. S3).
Overall, we analyzed an extensive array of transcriptomic data to confidently define a CM-specific gene expression profile and we compared this list to CM single-cell sequencing data to confirm the validity of the data.
In addition to measuring the mRNA expression of genes, identifying the level of histone posttranslational modifications provides important insight into the accessibility of the gene for transcription. Post-translational modifications of histone tails, specifically acetylation at lysine 27 of histone H3 (H3K27ac), is tightly coupled and correlated with gene expression activation. In this regard, we used ChIP-seq data from human epigenome roadmap dataset to identify epigenetic signature of our identified genes and correlate these epigenetic signatures with their expression level . 22 We observed enriched H3K27ac marks on the promoters of CM-specific genes, as an example we chose top 15 most highly expressed CM-specific genes to visualize the presence of H3K27ac peaks (Fig. 1B, first panel). The majority of the analyzed genes had a high correlation between increased expression and the presence of H3K27ac in their promoter regions ( Fig. 2A).
We also looked at the presence of H3K27ac on CM-specific genes in 14 other cell types and found that the acetylation peaks were specific to cells of the heart tissue ( Fig. 2A). To control the validity of our analysis across different ChIP-seq experiments, we examined the H3K27ac marks of genes known to be upregulated in other cell types, including pluripotent specific genes (POU5F1/OCT4), a hepatocyte marker (ALB), and a smooth muscle marker (MYH11). The H3K27ac peaks were indeed present in the promoters of these genes in the correct cellular context and were not present in heart and other cell/tissue types ( Fig. 2A, S4A). In addition, we used H3K27ac ChIP-seq experiment from left ventricle of adult female downloaded from ENCODE project to confirm identified peaks using human epigenome roadmap data (Fig. S4B). In summary, analysis of H3K27ac marks revealed a high level of specificity and a strong correlation between H3K27ac marks and CM-specific gene expression of heart cells.
To better understand the interaction of these specific genes and their role in establishing major CM functions, we constructed an integrated gene regulatory network using protein-DNA and proteinprotein interaction data of the genes (Fig. 2B). The constructed network contains 97 of the CMspecific genes which have been connected by 311 protein-protein and protein-DNA interactions ( Fig. 2B). Protein-protein interaction analysis revealed four significant complexes, one of which has 14 highly expressed CM genes that are involved in actin-mediated cell contraction (Fig. 2C,   Fig. S5A, S5B, and S5C). Furthermore, regulatory network analysis revealed that three transcription factors in the list of 118 genes, TBX20, NKX2-5 and TBX5, regulate the expression of most downstream genes in the network. TBX20 alone regulates 40 genes in the network, while NKX2-5 and TBX5 each target 15 genes (Fig. 2D). In summary, by observing protein-protein and protein-DNA interactions, we have identified that many CM-specific genes are involved in cardiac muscle function and additionally, our results revealed that TBX20 is specific master regulator of this specific profile.

Spatial heterogeneity of cardiomyocyte-specific transcriptome
Single cell RNA sequencing technology provides an opportunity to explore heterogeneity within a given cell type or lineage. Therefore, we investigated the spatial heterogeneity of the identified genes in the CMs of heart tissue. To address this aim, we used the Wang et al. (2020) study and analyzed single cell data of 12 normal heart samples (five samples from the left ventricle and seven samples from the left atrium). 5 After removing unwanted and low-quality cells, we kept 4881 cells. Most of the CM-specific genes that we identified in the previous analysis (118 genes), like COX7A1, did not show a significant difference between LV and LA clusters (Fig. 3A). 19 genes did show significant differential expression between LV and LA (Fig. 3A). Among these 19 genes, MYH7, MYL2, MYL3, and TNNI3 were enriched in the LV, while 15 genes were enriched in the LA, including NPPA, NPPB, TRDN, CMYA5 and TTN (Fig. 3A, 3B, Fig. S6A and S6B). We looked into the H3K27ac pattern of these genes to see if we could find the same pattern of heterogeneity at the epigenetic level. Interestingly, we found that multiple genes had different H3K27ac patterns in the atrium compared to ventricle (Fig. 3C). For example, we found that the NPPA, MYL4 and MYL7 promoters had higher H3K27ac peaks in the right atrium compared to the left and right ventricles ( Fig. 3C and Fig. S4B). We identified ventricle-specific genes MYL2 and MYL3 that had higher H3K27ac marks in the ventricles compared to the atrium (Fig. 3C,   S4A). LA specific genes are mainly involved in the muscle cell function and muscle cell development (Fig. 3D). In addition to highlighting heterogeneity between LA and LV for CMspecific genes, we interestingly discovered that CM-specific genes are differentially expressed between different clusters of both the ventricle and atrium, with a continuum expression profile of many genes observed ( Fig. 3E and 3F). We found 73 genes differentially expressed in LA1 versus other LA clusters, and 53 genes show significant variation in LV1 compared to other LV clusters. . This observation may imply that the atrium has more dispersed gene expression than the ventricle. Altogether, our analysis showed that there is a spatial heterogeneity for CM-specific genes between the atrium and ventricle and, moreover, there is a gradient expression of these genes within each heart chamber.

Evaluating pluripotent stem cell differentiation toward cardiomyocytes:
Differentiating pluripotent stem cells (PSCs) into mature CMs that resemble their in vivo counterparts has been a challenge in the field of cardiology research. 23 Differentiated CMs stray from their natural counterparts in multiple aspects, including transcriptomic features, structure of cytoskeleton and contractile fibers, and metabolic and electrochemical properties. 15 Because of the rigorous steps we used to obtain a reliable reference of an in vivo CM transcriptomic profile in the previous steps, we could compare the success of differentiation protocols from PSCs to this in vivo baseline. To this aim, we used the expression data of 15 studies in which CMs had been generated using 10 distinct differentiation protocols (Fig. 4A, Table S3). [24][25][26][27][28][29][30][31][32][33][34][35][36][37][38][39][40][41][42][43] These protocols use combinations of signaling proteins and small molecules over a range of 7 days to 23 days to make the differentiated CMs. Therefore, the list of compiled protocols covered in our study includes a wide range of differentiation molecules and timelines (Fig. 4A).
To evaluate which differentiation protocol generates CMs that are the most similar to those in vivo, we calculated the Jaccard similarity index for all genes identified from RNAseq experiments  (Fig. 4B). In addition to considering the entire set of genes which were identified by RNAseq technique, we looked at the expression of the three specific transcription factors previously identified, TBX20, TBX5, and NKX2-5, that play significant roles in the regulation of CM-specific genes during development and differentiation. 44 Initially, however, we found that TBX5 has a more heterogenous expression in CMs than the other transcription factors and make primary heart samples to be dispersed on Principle Component Analysis (PCA) plot (Fig. S6C). Wang et al., (2020) also reported that TBX5 has a differential expression pattern between the left ventricle and the left atrium. 5 Therefore, we proceeded with TBX20 and NKX2-5 as common references to evaluate the accuracy of the differentiation protocols. PCA based on expression values of TBX20 and NKX2-5 showed high levels of similarity between primary CMs from all studies, but the differentiated CMs showed high variation between studies (Fig. 4C). Based on the PCA results, we identified four studies that possessed the highest similarity to the primary CMs (Fig. 4D). Three of these studies used two separate differentiation protocols, while the last one used a combination of the two protocols.  25 . These four studies exhibit the most similar expression of TBX20 and NKX2-5 to that found in CMs in vivo (Fig. 4E). Expressions of these two transcription factors were used to calculate the Jaccard similarity index between pooled primary CMs samples and differentiated CMs, and we found that the Jaccard similarity index (Fig. 4F)  The previous section determined which differentiation protocols produce CMs with the expression of key transcription factors most similar to primary CMs. Here, we further aimed to accurately identify the transcriptomic differences between primary and differentiated CMs. Therefore, we compared expression profiles of all differentiated CMs (57 samples) to that of heart tissue (61 samples). Although two protocols best mimic the production of primary CMs, we did metaanalysis for all differentiation protocols between generated CMs and primary adult CMs to better understand what all of these protocols should focus on in the future. PCA analysis of expressed genes showed that all differentiated samples, regardless of the protocol followed, have distinct gene expression profiles compared to in vivo CMs (Fig. S7C). Our meta-analysis revealed 348 genes that were significantly differentially expressed between these groups of CMs (Fig. 5A). 224 of these genes had higher expression in primary CMs, while 124 genes had higher expression levels in differentiated CMs (Table S4). Gene expression clustering showed two distinct clusters, in which all in vivo CMs clustered together and all differentiated CMs clustered separately (Fig.   5A). The genes that have higher expression in primary CMs mainly have role in functionality of CMs, like cardiovascular system development and muscle contraction ( Fig. 5B and Fig. S8A). On the other hand, 124 genes had higher expression in the differentiated CMs, most of which play a role in cell cycle regulation, cell division, and proliferation ( Fig. 5C and Fig. S8B). Portion of genes that have higher expression in primary CMs are among the CM-specific gene list. For instance, MB, COX7A1 and NRAP are genes with very well-known functions in CM that differentiated CMs failed to express at levels comparable to primary CMs (Fig. 5D, Fig S9). [45][46][47] The list of genes failed to be expressed by induced CMs is not restricted to those CM-specific genes that we identified in the first step. Instead, there are other well-characterized genes that have crucial roles in CMs development while they also have high expression in other types. For example, FABP4 is involved in the metabolic transition of CMs from glycolysis to fatty acid oxidation, which is an important step in the maturation of CMs. 6 The failure of induced CMs to express genes like FABP4 adds to the list of difference that they exemplify compared to primary CMs. These genes, as well as others that fail to be expressed by differentiated CMs, are included in Fig. 5E and Fig. S10A. These results indicate that regardless of the differentiation protocol used, differentiated CMs fail to express many genes necessary for cardiac muscle cell function and development. On the other hand, among 124 genes that had higher expression in the differentiated CMs and have role in cell cycle regulation and proliferation we can highlight MYCN gene (Fig.   5F, S8B, Fig. S10B). This transcription factor involved in late embryonic and early postnatal proliferation of CMs and is an example of a gene that must be silenced during normal CM development. 48 The above data was generated from bulk RNA-seq datasets. Because some of the gene expression data may therefore come from non-CM cells like smooth muscle cells, fibroblasts, and endothelial cells, we cross-referenced our data to the single cell RNA-seq data from Wang et al (2020). 5 This data confirmed that many genes that are not among the CM-specific profile, like FABP4 and genes involved in cardiovascular system development, are expressed in primary CMs ( Fig. 5G and Fig.   S10A). In addition, single cell gene expression analysis showed that 124 genes present in differentiated CMs have lower expression or are not detectable in primary CMs, including MYCN, CDK1, CDC20, ORC6, TOP2A, BUB1B and AURKB ( Fig. 5F and S10B). In summary, we showed that differentiated CMs fail to establish a gene expression profile that fully resembles in vivo human adult CMs and instead retain a more progenitor-like transcriptome. differentiation protocol, which we previously identified as producing the CMs most similar to in vivo CMs. Secondly, they retain the cells in culture for longer to generate ventricular CMs (about 80 days), which again we demonstrated as increasing the similarity between induced and primary CMs. Therefore, the epigenetic data produced by this protocol possesses both features and makes it ideal for further analysis. We looked into the status of H3K27ac marks during different stages of CM differentiation compared to adult human heart tissue. We first investigated the acetylation pattern of CM-specific genes that were not expressed in the transcriptome data of the differentiated CMs. As expected, we found that genes which were not expressed by differentiated CMs similarly lacked H3K27ac marks (Fig. 5H). These genes included CASQ2, COX7A1, RPL3L, and SGCG, all of which lacked H3K27ac and mRNA expression in differentiated cells but were present in primary CMs. Regardless of the duration of the differentiation protocol, none of the differentiation durations resulted in H3K27ac marks for these genes (Fig. 5H). While these CM-specific genes failed to be acetylated in differentiated CMs at all time points, acetylation status of a group of genes were indeed dependent on the differentiation time. Differentiation of ventricular CMs for 80 days resulted in the similar acetylation pattern of CM-specific genes like COX6A2, NRAP, LMOD3 or genes involved in cardiovascular development like CAV2 between induced and primary cells (Fig. S11). These genes acquired the same H3K27ac pattern at day 80 of differentiation but not at the earlier stages of differentiation (Fig. S11). Other genes that have important roles in CM development, but which are not restricted to the list of CM-specific genes, also have different acetylation pattern in induced CMs compared to primary CMs (Fig .5H). For example, all differentiation protocols failed to result in RAMP3 H3K27ac marks, while CMs from the left ventricle and right atrium have H3K27ac marks in their promoters.
We also investigated the levels of cell cycle-related genes, including MYCN, CDK1, BUB1B, and TYMS and found that differentiated CMs have higher expression of these genes. Consistent with their increased expression, differentiated CM had increased H3K27ac marks, similar to pluripotent stem cells, on these cell cycle-related genes (Fig. 5H). Notably, the in vivo CMs lack both the mRNA expression and acetylation of these genes. Therefore, there is a significant correlation between continued expression of these genes and their active epigenetic marks. An inability to silence these genes in differentiated CMs may provide one explanation for the discrepancy observed between primary and differentiated CMs. In other words, in vivo CMs regulate their cell cycle status, while differentiated CMs may be unable to exit the cell cycle and thus exhibit more progenitor characteristics.
In this study, we analyzed an extensive set of transcriptomic data to define the specific transcriptomic landscape of primary and differentiated CMs. We identified 118 genes that are highly expressed in CMs compared to all other cell types (Fig. 6). These genes are critical for CM function and development, including cardiac muscle fiber organization, energy metabolism, and cardiac development and function (Fig. 6). 19 of these genes showed spatial gene expression heterogeneity between the left ventricle and the left atrium. By using the expression profile of the master transcription factors of CM development, we then assessed available PSC to CM differentiation protocols to find the protocols that generate CMs most similar to adult human CMs, with two protocols determined to be the most successful. Finally, upon comparing the full transcriptome of differentiated CMs to primary CMs, we found that all differentiation protocols result in CMs that retain more embryonic and proliferative features than do primary adult CMs.
Overall, our study provides novel insight into the specific transcriptomic landscape of CMs, including confirming the spatial heterogeneity between CMs in different parts of the heart. Furthermore, we demonstrated the specific weaknesses that all differentiation protocols exhibit, which is a critical step in determining the future changes needed to generate CMs that are suitable for regenerative medicine and research.

Discussion:
Transcriptomic analysis is a powerful tool used to understand the identity and function of cells. In the present study, we focused on unraveling the unique transcriptomic profile of human and mouse primary CMs that subsequently allowed for the comparison of primary CMs to differentiated CMs.
We first gathered the transcriptomic information of 91 cell types and subtypes. Danielsson and colleagues showed that integrating RPKM/FPKM values from datasets generated in different laboratories with different protocols successfully results in the clustering of tissues based on the tissue type and not on the laboratory of origin . 49 The same group found that using FASTQ file to generate FPKM values had no major benefit compared to using the RPKM/FPKM values . 49 Additionally, the use of different analysis pipelines on RNA-seq datasets has been shown to yield 88% similar results for protein coding genes . 50 In our study, we converted the FPKM/RPKM values from multiple datasets to TPM values, which homogenously mixed the data for further analysis.
We then used a single analytical technique to determine the differentially expressed genes.
Although we did not use FASTQ files, our large sample number provided reliable and valid statistical results. In addition, we found a strong correlation of differentially expressed genes between our dataset results and the human protein atlas (HPA) datasets. Furthermore, single cell RNA-seq data and different ChIP-seq data sets confirmed our transcriptomic results. Following comprehensive data mining, provided a list of genes that have higher expression in mouse and human CMs compared to all other cell types. While our study investigated a large and diverse number of cell types, one limitation of the current study is that we could not include transcriptomic analysis of every somatic cell type that exists in the body. This would have increased the power of our findings and may have heightened the specificity of genes found to be expressed by CMs. For example, we have 61 heart tissue samples and 34 smooth muscle cell samples, but we have no samples specific to skeletal muscle. Using datasets from the (HPA), we discovered that a portion of the 118 genes identified as CM-specific genes indeed have high expression in skeletal muscle tissues. 21 Even this comparison between our findings and data from the HPA dataset is imperfect, as their data contains signals from various cell types present in a single tissue. Because we compared our data to well-defined, single cell CM data, we are confident that our results are accurate for detailing the expression profile of CMs compared to most other cell types. To increase the validity of H3K27ac of our findings, we used human epigenomic data from ChIP sequencing experiments of H3K27ac mark. Indeed, our transcriptional findings of primary and induced CMs were in agreement with the epigenetic signature of the genes. The project would be more impactful if it included proteomic data, which could have been further supplemented with functional confirmation. Despite this possible addition that could increase the validity of our findings, transcriptomic expression profiling coupled with epigenetic confirmation has been proven to be a reliable source of information that can lead to greater conclusions and benefit both research and medicine. 21,22,51 Additionally, our results provide important guidance for future proteomic and functional studies, as we have identified specific genes and pathways implicated in these cells.
As expected, the majority of the genes that were upregulated specifically in CMs were genes with well-known roles in CM function and development. For example, master regulators of CMs, including NKX2-5 and TBX20, were among the identified upregulated genes. 52,53 In addition to the expected genes, we also found a small group of genes specifically upregulated that do not have an identified role in CMs, such as HHATL. This gene might have a role in the formation of a septum between ventricles as it is involved in the sonic hedgehog (SHH) signaling cascade. 54 Obtaining a list of CM-specific genes, as we have done here, may have a number of applications.
For instance, it could potentially be used in regenerative medicine to check the extent of CM differentiation based on the similarity to their in vivo counterparts. This list may also act as a reliable source for characterizing primary CMs during development.
The majority of studies that have profiled the transcriptome of CMs have performed their investigations on the entire heart tissue, even though the heart at least contains five different cell types. 5 In our study, although we obtained the list of CM-specific genes using microarray and bulk RNA-seq techniques, we confirmed our results using publicly available heart single cell RNA-seq datasets. By doing this, we were able to benefit from the advantages of each technique. We had the large cell number and experiment sizes from bulk RNA-seq, and we could compare this data to the very specific, though small power samples from single cell RNA-seq. Through this approach, we identified CM-specific genes and were able to identify spatial heterogeneity present in the heart tissue. For instance, we confirmed that NPPA has increased expression in mouse and human primary CMs compared to other somatic cells, and we noted that its expression is specifically increased in the atrium compared to the ventricle. It has previously been indicated that NPPA is reduced in the ventricles upon birth. 55 Furthermore, it has been shown that NPPA mutations are correlated with diseases, such as atrial fibrosis. 56 Moreover, it has been indicated that NKX2-5 and TBX20 bind to the NPPA promotor and control its expression. 57,58 In our study, we confirmed the expression of these genes specifically in CMs and used ChIP sequencing data to confirm a direct interaction between TBX20 and NPPA. Altogether, with the expression of key transcription factors and known CM-specific genes that agrees with previous literature, this is the most thorough and high-throughput study of the CM transcriptomic profile to date.
Another aim of this study was to clarify the differences between primary and differentiated CMs based on their transcriptional profiles. To this aim, we compared the gene expression profiles of primary and induced CMs and concluded that the populations differed in two ways. Firstly, the induced CMs fail to express genes critical for development and function of CMs. For example, induced CMs do not express COX7A1, the absence of which results in dilated cardiomyopathy in mouse studies. 46 Furthermore, NRAP is the second largest actin-binding cytoskeletal protein and is expressed in myofibril precursor cells during myofibrillogenesis. 59 Homozygous truncated NRAP genes are found in the whole exome sequencing data of dilated cardiomyopathy patients which highlights the importance of this gene CMs development. 47 In our analysis, we confirmed that induced CMs have lower expression of this gene compared to primary CMs (Fig. 5E).
Secondly, the induced CMs aberrantly down regulate positive regulators of the cell cycle, preventing them from becoming fully differentiated (Fig. S5B and Fig. S7B). For instance, MYCN, which is known to induce cell cycle progression and proliferation, is more highly expressed in differentiated cells than in primary adult CMs. 48 It has been shown that MYCN is regulated by Sema6D that regulates cardiac muscle cell proliferation and prevents differentiation of cells. 48 Therefore, differentiation protocols may aim to down-regulate MYCN to inhibit proliferation and promote differentiation of CMs that would make them more similar to their primary counterparts. Altogether, these findings provide a detailed view of the current state of CM differentiation protocols compared to in vivo CMs, and it may guide future research that will produce more impactful CMs.
To the best of our knowledge, this is the first study that has compiled such an extensive and comprehensive transcriptomic profile of primary CMs. The large number of cell types and subtypes investigated, combined with the high consistency between our findings and those in the HPA dataset, make our findings very reliable. Based on this data, the detailed primary CM transcriptomic profile may guide future CM differentiation protocols that were analyzed in this same study. To detect outliers in the generated source of normal human and mice cells we used TPM values for PCA analysis and hierarchical clustering. TPM values were used to perform PCA using the built-in function of R called prcomp, and the results were visualized for 2D and 3D PCA using the ggplot2 and PCA3D packages in R, respectively. The first and second principle components were used to remove outliers. Overall, 57 RNA-seq samples were removed from 15 independent datasets (studies), resulting in 584 RNA-seq samples that we continued to use (Table S5).
In addition to bulk RNA-sequencing, single cell RNA-seq datasets GSE109816 and GSE121893 from Wang et al., (2020) were used to investigate CMs more specifically. Single cell RNA-seq was also used to explore the gene expression heterogeneity of cells. 5 Microarray data sets from 62 studies with 273 samples were also used to analyze the transcriptome of cells. We downloaded affymetrix mouse genome 430 2.0 array raw files (.CEL files). All samples were filtered to only include normal somatic cells (Table S5). In addition, all data was normalized using Robust Multichip Average (RMA) in R through the AffylmGUI package. 62,63 Finally, to add further confirmation to the transcriptomic analysis, we analyzed H3K27ac ChIP-

Meta-analysis
To get a specific transcriptomic profile of CMs, we compared the gene expression profile of CMs with all cells compiled in our source. To do this with a meta-analysis approach, we created a transcriptomic reference of CMs and subsequently compared all other somatic and stem cell types to this reference. These transcriptomic references were made for both human and mouse CMs' RNA-seq and microarray samples separately (Fig. S12A). In each category of mice and human RNA-seq and microarray data we obtained DEGs (Fig. 12). For example, to make this reference for human RNA-seq expression datasets, we merged 61 human heart tissue RNA-seq samples from three independent studies and considered them as a single reference to compare against all human somatic and stem cell RNA-seq samples. In order to handle such enormous amounts of data, NetworkAnalyst 66 was used was used, but it has a limitation that prevents analysis of all datasets at the same time. To overcome this obstacle, we compared the expression profile of CMs to five and six somatic/stem cell datasets at a time for human and mouse respectively, which resulted in 16 groups and 81 comparisons for human RNA-seq samples and 3 groups and 18 comparisons for mouse RNA-seq samples (Fig. S12B).
To prepare the data for meta-analysis, we inputted our values into the "multiple gene expression table" section of NetworkAnalyst. After uploading the data as a text file, we converted the gene IDs to "Official Gene Symbol", thus creating a common ID format for all of our data. We used the "visualization data plot" to check the distribution of values within samples, and we performed differential expression analysis on the data using p-value of 0.05 as a cut-off, which was generated using Benjamini-Hochberg's False Discovery Rate. For meta-analysis, the combined effect size, combined p-value, vote counting, and combined fold change (direct merging) were the four criteria used to detect significantly differentially expressed genes (DEGs), though the combined effect size and combined p-value were the primary criteria used to select the DEGs (Fig. S12). The effect size considers the size of the difference between two groups and is calculated by dividing the mean differences of the comparisons by the standard deviation. Based on the Cochran's Q tests we used random effects model to determine effect sizes (genes with pvalue < 0.05 were kept as significant genes). To calculate combined p-values we used Fisher's method, which is a weight-free method (combined p-values < 0.05 were kept as significant genes).
Vote counting is a way to identify genes that are significant across all comparisons during metaanalysis. For vote counting we kept genes with vote counts 5 and for combined fold changes, we kept genes with p-values <0.05 and fold change >2 as significant genes. For each criterion, we considered only significant genes that passed a certain threshold (Table S6). We then extracted genes that were significantly differentially expressed as measured by all four criteria, thus removing any genes that were deemed non-significant by one of the metrics. Finally, we merged all genes from the human and mouse RNA-seq and microarray datasets to create a finalized list of significantly differentially expressed genes in CMs compared to all other somatic and stem cells in our datasets (Fig. S12).

Single cell RNA-seq data analysis
All single cell analysis was performed with version 3 of Seurat. 67,68 The raw umi matrix of the heart single cell study was downloaded from GEO with GSE109816 and GSE121893 accession numbers. 5 We excluded the cells that expressed more than 72 percent of mitochondrial genes,

Network analysis
To construct a gene regulatory network, transcription factor (TF)-binding sites were obtained from For our analysis, complexes that have score greater than 2.0 were considered significant protein complexes.
g:Profiler was used for gene ontology analysis, and results were filtered based on the adjusted-pvalue. 76 TPM values from RNA-seq experiments were used to generate heatmaps using ggplot2 and Superheat packagesuperheat packages in R. 77

Jaccard similarity matrix calculation
The weighted Jaccard similarity index calculates the similarity between two numeric vectors (in our study two RNAseq samples). The range of this similarity is from 0 to 1, in which 1 is highest similarity and 0 is lowest similarity between samples. The Jaccard similarity index is calculated using the sum of minimum values in both vectors (the intersection of two RNA-seq data samples) divided by the sum of the maximum values in both vectors (the union of each). We used MATLAB code to calculate the Jaccard similarity index automatically.       in this study as CM-specific expression profile. Table S3. The list of differentiation data set and differentiation protocols that we used in this study.