ABSTRACT
scATAC-seq explains chromatin accessibility at cell-type resolution. Accordingly, this process is crucial for advancing our understanding of pathology and disease states. However, annotated data from scATAC-seq are both extensive and sparse; thus, conducting multidimensional analyses under multiple conditions is a challenging task. TDbasedUFE is a valuable tool for analyzing scATAC-seq data as it can extract genes in an unsupervised manner under multiple conditions based on tensor decomposition. We analyzed scATAC-seq data from the peripheral blood mononuclear cells of patients with sepsis infected with S. aureus using TDbasedUFE.
We extracted genes that exhibited different responses in methicillin-resistant (MSSA) and methicillin-sensitive S. aureus (MSSA) strains in sepsis for each cell type. Subsequently, we searched for studies containing gene sets similar to the extracted genes and predicted their functions. We also constructed protein-protein interactions (PPIs) for the extracted genes, defined hub proteins as central to the interactions based on degrees and clustering coefficients, and investigated the functions of these hub proteins. The genes of interest were abundant across all cell types, ranging from 710 to 1,372 genes. The functions of the extracted genes were predicted to be associated with several diseases or physiological substances. The hub proteins identified from the PPI analysis were mainly related to the ribosome, and their functions were associated with protein synthesis. These results highlight the suitability of TDbasedUFE for the analysis of scATAC-seq data. The functions of the genes identified in this study may provide insights into new promising therapeutic approaches, considering the distinction between methicillin resistance and S. aureus sepsis.
Introduction
Gene expression is influenced by the chromatin structure. An Assay for Transposase Accessible Chromatin with high-throughput sequencing (ATAC-seq) provides a valuable profile for analyzing epigenetic regulatory functions by revealing the open chromatin regions1. Single-cell ATAC-seq (scATAC-seq) allows the examination of cellular heterogeneity by analyzing the cell type-specific distribution of chromatin accessibility2. However, as the total length of coverage of DNA fragments obtained from scATAC-seq is very short compared to the full length of the genome, the resulting scATAC-seq profile is very sparse. The scATAC-seq data under multiple conditions are formatted as high-dimensional tensors after sparse individual scATAC-seq profiles are aggregated, resulting in a difficult analysis. TDbasedUFE3 is a powerful application that applies tensor decomposition to analyze large tensors, enabling the extraction of biologically critical genes (e.g., disease-causing genes). Tensor decomposition has already been proven to be successful at analyzing large sparse matrices of scATAC-seq data4.
In a clinical setting, whether Staphylococcus aureus (S. aureus) is methicillin-resistant must be elucidated. In S. aureus infections, methicillin resistance serves as a guide for selecting an appropriate treatment option5, and the recommended antibiotics differ according to their methicillin-resistant S. aureus (MRSA) or methicillin-susceptible S. aureus (MSSA) status6. Similarly, in bacteremia caused by the same S. aureus infection as sepsis, early molecular diagnosis of MRSA is expected to reduce mortality and treatment costs7. However, the management of S. aureus in sepsis does not distinguish between MRSA and MSSA and antibiotics effective against MRSA are prophylactically recommended8. Furthermore, the increased frequency of vancomycin use as a first-line drug for MRSA infections is considered a contributing factor to the recent decrease in the vancomycin sensitivity of S. aureus9. Notably, empirical antibiotic therapy poses a public health risk. Therefore, comparing the epigenetic responses of patients with MRSA and MSSA infections in sepsis may lead to selective and more effective treatment strategies.
Distinguishing between MRSA and MSSA in sepsis has not been extensively achieved using the Differentially Expressed Genes (DEGs) obtained from the analysis of host single-cell RNA-sequencing (scRNA-seq) data10. The analysis of multi-omics data from scRNA-seq and scATAC-seq using MAGICAL revealed 53 circuit genes, which represents at most one-third of the circuit gene between infection and control10. The 53 circuit genes were expressed between those infected with MRSA and those infected with MSSA (area under the receiver operating characteristic curve from 0.67 to 0.75)10. In this study, using annotated scATAC-seq data10 obtained from peripheral blood mononuclear cell (PBMC) samples from healthy controls and patients infected with two strains of S. aureus, TDbasedUFE3 was employed to identify genes with distinct expression between MRSA and MSSA. Enrichment analysis was conducted using the obtained genes to determine their functions. Furthermore, using the STRING database11, we investigated the protein-protein networks to identify the hub proteins.
Results
Gene extraction by TDbasedUFE
We identified genes for the following six cell types: Naïve CD4+ T cells (CD4 Naive), CD4+ central memory T cells (CD4 TCM), CD8+ effector memory T cells (CD8 TEM), CD14+ monocytes (CD14 Mono), CD16+ monocytes (CD16 Mono), and natural killer cells (NK), using TDbasedUFE3. Table 1 lists the numbers of patients for each cell type. The condition was listed in Table 1. The patient counts correspond to the number of Singular Value Vectors (SVVs) in the bar graph under the same conditions (left red bar in Figure 1). Similarly, the right green bar graph in Figure 1 represents SVVs for each condition, where the conditions listed from left to right include the Control, MRSA, and MSSA. For all cell types, the index of condition l3 for SVVwith opposite signs for MRSA (k = 2) and MSSA (k = 3), was selected. In all cell types, SVVs under the same conditions (left red bar graph in Figure 1) had the same sign, indicating that tensor decomposition was performed to minimize the dependence on patients. In CD14 Mono, CD16 Mono, and CD4 Naive cells, the components of the control were the smallest, suggesting that genes distinguishing between MRSA and MSSA in healthy individuals were selected. CD4 TCM and NK cells had significantly larger absolute values specifically for MRSA, whereas in CD8 TEM, they were selected to differentiate between MSSA and the control. Although various independences of SSVs were found under the three conditions for individual cell types, all selected SSVs had large absolute values under the conditions of either MRSA or MSSA, indicating that the genes of interest were selected for all cell types (Figure 1).
The numbers of extracted genes and their similarities are summarized in Figure 2. Figure 2 (a) shows the number of genes commonly extracted from each cell type, where the diagonal elements represent the number of gene-specific cell types. The number of genes ranged from 710 for CD16 Mono to 1,372 for CD4 TCM. Figure 2 (b) shows the percentage of gene overlap between different cell types. These ratios are not symmetrical because they represent the proportion of genes present in the cell type of the column among those present in the cell type of the row. The largest gene overlap was observed between NK and CD8 TEM cells, with a similarity of 12%. This result suggests that the individual cell types share a limited number of genes with those selected for each cell type.
Enrichment analysis
In Table 2, papers that encompass gene sets selected across all cell types using Rummagene12, including their respective keywords, are listed. The criterion for selecting terms was based on the top 1,000 terms with smaller adjusted P-values for each cell type and common across all cell types. Thus, the papers in Table 2 were notably selected across all cell types, at least for the cell types in this study. The keywords were subjectively selected based on their biological importance from the titles or abstracts of each paper, with a maximum of two per paper.
The terms were assigned codes in increasing order of the geometric mean of P-values (or in decreasing order of L, which is the absolute value of logarithm of the adjusted P-value), with a higher L indicating a paper that is more likely to contain gene sets significantly resembling our extracted genes. These keywords predicted potential functions of the extracted genes related to the presence or absence of methicillin resistance in S. aureus.
The adjusted P-values for the top 1,000 terms ranged up to the 10−7 scale. The papers or gene sets listed in Table 2 had large L ≥ 13, indicating significant selection across all cell types. Among the gene sets involved in the study shown in Table 2, several gene sets with cell-type distinction were obtained from T cells and NK cells. Among the sequenced gene profiling methods, RNA-seq is the most frequently used. Keywords, such as tissue, disease, and molecular names, were found to markedly vary. Creating biologically meaningful classifications has become challenging owing to this diversity. Two keywords were commonly selected: “SARS-CoV-2” in RG02, RG14 and RG16, and “leukemia” in RG07 and RG15 (considering acute myeloid leukaemia (AML) as a type of leukemia). The detailed results for the top 1,000 items in this table are available in Supplementary Table 3.
The existence of papers supporting the association between keywords and methicillin resistance was summarized for each keyword in the subsequent “Discussion” section.
STRING analysis
The extracted genes for each cell type were uploaded to STRING11 to obtain PPI data for each cell type. Only edges with a confidence score of > 0.900 were retained, capturing a network of strongly correlated proteins (Supplementary Figure 1). The average cluster coefficients14 ⟨Ci⟩, which characterize the shape of the network and the total number of nodes representing the size of the network, are listed in Table 3. ⟨Ci⟩ is the mean of all node cluster coefficients Ci, with values close to 1 indicating a graph closer to a complete graph and values close to 0 indicating a graph closer to a tree graph14. ⟨Ci⟩ for all cell types, except NK, falls within the range of 0.26±0.02, whereas NK exhibits an extremely low result of 0.14. Therefore, the PPI of NK had a structure similar to that of a tree, indicating relatively sparse and fewer interactions among proteins. The total number of nodes varied among different cell types, ranging from 155 for NK cells to 474 for CD4 TCM, indicating a wide range of node counts. Among the cell types, except for NK, with a similar ⟨Ci⟩, the largest network had more than twice the number of nodes compared with the smallest network. NK had the smallest total number of nodes; however, the network size was not significantly correlated with ⟨Ci⟩. Accordingly, NK cells may possess a specific PPI with a relatively small scale and sparse interactions. This result suggests that the functions of the gene products of the extracted genes in NK cells are relatively straightforward.
Table 3 lists the number of hub proteins for each cell. The hub type is presented in Table 3. Hub proteins were defined in the PPI network as proteins with a degree greater than five, indicating that they are central proteins interacting with a relatively large number of proteins. Among the hub proteins, those with a clustering coefficient greater than 0.5 (⟨Ci⟩ > 0.5) were defined as “family hubs,” and others were categorized as “normal hubs” with ⟨Ci⟩ ≤ 0.5 (Figure 3). Based on the definition of family hubs, these proteins form dense interaction groups, including those adjacent to each other. Thus, family hubs are the central proteins in a protein cluster connected by strong relationships.
The hub proteins in Table 3 indicate the total number of normal and family hubs, and the family rate represents the percentage of family hubs among the hub proteins. The number of nodes and counts of various hubs revealed a strong correlation, with a coefficient greater than 0.9. In contrast, the family rate did not show significant correlations with other values (with absolute value of correlation coefficients less than 0.7) and varied from 0% to 35% across the different cell types. Thus, the density of family hubs (or protein families) varies across cell types depending on the genes obtained for each cell type. In addition, despite the specifically low ⟨Ci⟩ in NK, the family rate was 33% in the higher group. Therefore, no correlation was found between the family rate and network structure.
Table 4 lists the family hubs for each cell type, including their identifiers in approved symbols with HGNC nomenclature13. For CD8 TEM and CD4 TCM, a relatively large number of proteins was selected. However, a reduced number of unique identifiers was found due to the presence of several proteins associated with mitochondrial ribosomes (MRPL and MRPS) and ribosomal proteins (RPL and RPS). In NK cells, the four hub families were exclusively composed of ribosomal-related proteins. Figure 4 shows the network plots of genes and enriched terms for each library:GO_Biological_Process_2023 (GOBP)16GO_Cellular_Component_2023, (GOCC)16 and, (KEGG)KEGG_2021_Human17. Combining all cell types, 16 terms were obtained for GOBP, 15 were obtained for GOCC, and 3 were obtained for KEGG. GO_Molecular_Function_2023(GOMF)16 terms were not obtained for any cell type.
Multiple overlapping terms (nodes containing multiple colors) were observed across different cell types, suggesting that the biological roles of the family hub are shared among various cells. The abundance of terms related to gene expression, translation, and ribosomes GOBP suggests that the family hub plays a role in preparing for active protein synthesis. In GOCC, no terms were selected across three or more cell types, suggesting that the cellular activity of the family hub may vary among different cell types. In all libraries, terms associated with CD8 TEM were included in the terms for CD4 TCM, which are both memory T cell types. In GOCC, the terms from CD8 TEM formed a distinct network. Terms related to ribosomes were obtained for all cell types, providing further evidence that this family hub contributes to protein synthesis in all cell types. In KEGG, “Ribosomes” were obtained with very small P-values (ranging from P≈10−12 to P≈10−7). “Coronavirus disease” was the other term obtained for multiple cell types. All results of the enrichment analysis of the family hubs are available in the Supplementary Table 5.
Discussion
This study sought to investigate the epigenetic functions of genes related to S. aureus sepsis by selecting genes from ATAC-seq data. The extracted genes represented the difference between the genetic responses of patients with sepsis infected with MRSA and those infected with MSSA, with the functions of these genes being unknown. The functions of the genes identified in this study may provide a novel perspective on the symptomatic treatment of S. aureus sepsis, considering differences in methicillin resistance. Enrichment analysis suggested several potential functions of the extracted genes and STRING analysis indicated potential functions related to ribosomes.
In gene extraction, owing to differences among patients for each cell type and the consistent direction of the SVVs under the same conditions, as indicated in Table 1, no patient bias was found in the identified genes. The genes were extracted based on genes that were significantly expressed (had high scores) in patients with MRSA and those with MSSA. Therefore, these genes are no longer genes uniquely expressed on one side (MRSA or MSSA).
The extracted genes were selected using scATAC-seq alone and a scoring metric for genes targeted at the Transcription Start Site (TSS)10,18. Therefore, gene ontologies and pathways may not function effectively as libraries for enrichment analysis due to this method. In the enrichment analysis, when subjecting the genes to Enrichr19 for each cell type, no significant terms were obtained from the GO library (GOBP, GOCC, GOMF) for all cell types. In the KEGG analysis, terms were only identified for memory T cells (CD4 TCM and CD8 TEM).
The relationship between the keywords in Table 2 obtained from enrichment analysis and methicillin resistance in S. aureus is supported by several papers found during the analysis (Table 5). The keywords identified in these previous studies strongly suggest the functions of the extracted genes, providing evidence that the approach used in this study with Rummage12 successfully predicted the gene functions. Keywords can serve as clinical markers for determining methicillin resistance in S. aureus that causes sepsis or as clues when exploring diseases, in which methicillin resistance in S. aureus could be a risk factor.
Methods
An overview of these approaches is presented in Figure 5.
Data set
The dataset is included in series GSE22018810 of the Gene Expression Omnibus (GSO) repository provided by the National Center for Biotechnology Information (NCBI). GSE22018810 is scATAC-seq that has already been annotated using ArchR18 of PBMCs from 10 MRSA samples, 11 MSSA samples, and 23 healthy control samples. In this study, the analysis was conducted only on cell types derived from the MRSA, MSSA, and healthy control groups, ensuring a sample size of five or more for each category (the sample count for each condition and each cell type). Specific sample names and their attributes are summarized in Supplementary Table 1.
Tensor decomposition
We opted to provide an explanation of TDbasedUFE3. Briefly, for each cell type, we arranged the scored datasets into third-order tensors with three axes: gene, sample, and infectious conditions. Let N be the number of genes. In addition, for a specific cell type, let Mc denote the number of samples and Kc the number of infectious condition types. In this study, three conditions were considered: healthy controls, MRSA, and MSSA, resulting in Kc = 3. Furthermore, the number of genes was N = 23746. The number of patients Mc is provided in Table 1. Let represent the score assigned to the ith gene of the jth sample under the kth condition for a certain cell type (its index in c). The tensorswere standardized using the gene vector from the scored scATAC-seq data for each j and k: Therefore, the averageand distribution of were When applying higher-order singular value decomposition (HOSVD)3,24tousing the gene size of the core tensor I1 ∈ ℕ, the following result is obtained where is a core tensor representing the weight of the product . , and are singular value orthogonal matrices (SVOM)3 (Figure 6 (a)). Choosing a value for ln, (n = 1, 2, 3) results in the selection of a vector from , which is referred to as the Singular Value Vector (SVV) that represents differences in the corresponding direction3 (for example, when considering representing conditions, if a difference exists between and, a difference between the first and second conditions is implied). I1 ∈ ℕ, (I1 ≤ N) also represents the multilinear rank24 in the gene direction (its index in i). To simplify the calculations, we set I1 = 10. The other ranks were not omitted for subsequent “Unsupervised feature extraction,” as they are square matrices (Figure 6 (a)).
Unsupervised feature extraction
The tensor decomposition of TDbasedUFE3 allows unsupervised feature extraction. However, a decomposition method must be chosen for our study. In this study, we created tensors with three conditions: control (k = 1), MRSA (k = 2), and MSSA (k = 3) to determine the differences between MRSA and MSSA. Accordingly, we chose l3, where the largest difference exists between and . In contrast, to minimize the differences among patients, we chose l2 where all values have the same sign. Under these conditions, l2 and l3 were uniquely determined in this analysis. Thereafter, l1 was selected to maximize the absolute value of core tensor. Consequently, we obtained the SVV, which represents the different genes identified under the condition of methicillin resistance of S. aureus. Assuming that the obtained SVV distribution followed a Gaussian distribution (null hypothesis), we selected the genes that exhibited significant differences. As a 2-tailed χ2 test was conducted, the P-value Pi is where is the cumulative χ2 distribution and the argument is larger than x3. is the optimized standard deviation, such thatobeys a Gaussian distribution as much as possible25 (Figure 6 (b)). Thereafter, the obtained P-values Pi were adjusted using the Benjamini-Hochberg (BH) procedure26. with adjusted P-values Qi below a threshold were selected, and these were considered as the extracted genes (Figure 6 (c)). In this study, the threshold was set to 0.01.
Enrichment analysis
We proceeded to evaluate the gene sets that exhibited similarities to the extracted genes. The genes extracted from each cell type were processed using Rummagene12 and their overlap with publicly available gene sets in PubMed Central (PMC) was verified. In this gene set, as more PMC gene sets were selected using Rummagene12, we considered the terms chosen for all cell types to represent the functions associated with the extracted genes. The top 1,000 terms with the smallest P-values, adjusted using the BH procedure26, were selected for each cell type. Subsequently, we extracted the terms chosen for all cell types.
The relevance to genes was assessed by taking the score, L, which is the logarithm base 10 of the geometric mean of the adjusted P-values (or the average of the logarithm base 10 of the adjusted P-values Pad), represented by the following formula: where nc represents the total number of cell types, that is, nc = 6, as shown in Table 1. In addition, ⟨X ⟩geo denotes the geometric mean of X, and denotes the adjusted P-value for the terms upon input of the genes extracted from the cth cell type.
Overall, papers with a larger L are more likely to include gene sets closely related to those extracted. Keywords, such as disease and tissue names, were extracted from the terms selected by Rummagene12, and their association with methicillin resistance was investigated.
STRING analysis
We conducted STRING analysis11 for each cell type. Protein-protein interactions (PPI) were considered when confidence, representing the credibility of the interaction, was greater than 0.900. Nodes that were not connected to any other node (degree 0) were excluded from the PPI network. We analyzed the graph-theoretical properties of the obtained PPI using the R package “igraph”27 and identified hub proteins that could be central to the protein interactions. The definition of a hub is based on the study by Jing-Dong J. Han et al.28, in which a node with a degree greater than five is considered a hub protein. Maslov and Sneppen argued that proteins that bind to hub proteins do not tend to interact with other hub proteins29. Therefore, in this PPI, where only strong correlations were retained, the boundaries between the hub proteins were expected to become clearer. A cluster formed by interactions centered around a hub is referred to as a community. We calculated the clustering coefficient for each hub to measure the extent to which interactions occurred within communities. As proposed by Watts and Strogatz14, the clustering coefficient is defined as the proportion of actual connections among neighboring nodes. Therefore, the clustering coefficient Ci is computed as where ki ≥2 is the ith node of degree. Ei is the number of connections among neighboring nodes, and Si is the number of possible connections among neighboring nodes. Among hub proteins, those with a clustering coefficient Ci > 0.5 were defined as “family hubs.” In contrast, hub proteins with a cluster coefficient Ci ≤0.5 were defined as “normal hubs” (Figure 3).
These hub proteins are speculated to have communities that resemble complete graphs and are likely to possess biological functions. Furthermore, by definition, family hubs are likely to overlap with gene or protein families, indicating that they are part of a family in a PPI context. We identified hubs, particularly crucial family hubs, using the network obtained from the STRING analysis.
Subsequently, to investigate the biological functions of the family hub obtained, gene set enrichment analysis was performed using Enrichr19 for genes with the same names as those in the family hub. We conducted enrichment analysis for each cell type using four libraries of Enrichr19: GO_Biological_Process_2023 (GOBP)16GO_Cellular_Component_2023, (GOCC)16, GO_Molecular_Function_2023(GOMF)16, KEGG_2021_Human and, (KEGG)17. The enrichment analysis was performed using the R package “clusterProfiler”15, with a significance level of α = 0.01. P-values were corrected using the BH procedure26.
Author contributions statement
Y.T. conceived the study, S.W. and Y.T. designed the research methods, and S.W. analyzed the data. All authors reviewed the manuscript.
Acknowledgements
We would like to thank Editage for English language editing.