Abstract
Understanding the molecular mechanism of SARS-CoV-2 infection (the cause of COVID-19) is a scientific priority for 2020. Various research groups are working toward development of vaccines and drugs, and many have published genomic and transcriptomic data related to this viral infection. The power inherent in publicly available data can be demonstrated via comparative transcriptome analyses. In the current study, we collected high-throughput gene expression data related to human lung epithelial cells infected with SARS-CoV-2 or other respiratory viruses (SARS, H1N1, rhinovirus, avian influenza, and Dhori) and compared the effect of these viruses on the human transcriptome. The analyses identified fifteen genes specifically expressed in cells transfected with SARS-CoV-2; these included CSF2 (colony-stimulating factor 2) and S100A8 and S100A9 (calcium-binding proteins), all of which are involved in lung/respiratory disorders. The analyses showed that genes involved in the Type1 interferon signaling pathway and the apoptosis process are commonly altered by infection of SARS-CoV-2 and influenza viruses. Furthermore, results of protein-protein interaction analyses were consistent with a functional role of CSF2 in COVID-19 disease. In conclusion, our analysis has revealed cellular genes associated with SARS-CoV-2 infection of the human lung epithelium; these are potential therapeutic targets.
Introduction
Infection of severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) is the cause of human coronavirus disease 2019 (COVID-19). The recent pandemic has caused devastation due to rapid spread of this viral infection. As a respiratory illness, the disease is readily transmitted. It also has a long incubation and can be carried asymptomatically, thus spreading through communities (1). The COVID-19 pandemic has affected almost every country regardless of their medical infrastructure and economic status. It has caused a healthcare crisis and created a devastating economic burden, including high unemployment, which has exacerbating the effect of the disease (2). At present, more than 9.4 million people from 213 countries have been infected with the virus [https://www.worldometers.info/coronavirus/]. The rapid spreading of this respiratory infection has forced millions to shelter in their homes and has led to death of more than 480,000 individuals. Additionally, COVID-19 disproportionately affects patients, particularly minorities in the U.S. and those with chronic problems such as hypertension, lung disease, diabetes, and immunocompromised conditions.
In the last three decades, the world has witnessed zoonotic transmission of various viruses (from animals to humans) leading to severe respiratory complications. These include H1N1, avian influenza, severe acute respiratory syndrome (SARS), and Middle East respiratory syndrome coronavirus (MERS□CoV) (3, 4). Although infections of these viruses is often fatal, their effect is restricted to geographic locations such as Africa, Asia, and South America. As the world awaits a vaccine for SARS-CoV-2, efforts are being made to understand the molecular mechanisms of these infections (5, 6).
High-throughput technologies such as RNA sequencing and microarrays are useful in the detection of respiratory virus infections and in understanding their molecular effect on human lung epithelial cells (7). Extensive data on corona virus sequencing has been deposited in public repositories such as NCBI Gene Expression Omnibus (GEO) and EMBL Array Express (8, 9). Meta-analysis and mining of such data can aid in a) understanding the molecular impact of COVID-19, b) elucidating differences and similarities between SARS-CoV-2 and other respiratory virus infections, and c) identification of targets for drug development. In the current study, we performed comparative analysis of publicly available gene expression data related to human lung epithelial cells infected with a respiratory virus. The analyses identified genes specifically expressed by SARS-CoV-2 infection and those that are commonly altered due to infection of coronovirus-2 and/or other respiratory viruses. In particular, expression of CSF2 (colony-stimulating factor 2) appears to be involved in COVID-19 disease.
Methods
RNA sequencing data analysis
A search of the NCBI GEO database for microarray or RNA-sequencing data publicly available as of 13th April 2020, found RNA-seq data uploaded by Blanco-Melo et al. (GSE147507) (6). We focused on samples related to normal human bronchial epithelial cells subjected to mock treatment (n=3) or SARS-CoV-2 infection (n=3). Raw sequencing data related to selected samples of GSE147507 as fastq files were downloaded from Sequence Read Archive (SRA) using fastq-dump of sratoolkit v2.9.6 [http://ncbi.github.io/sra-tools/]. First, raw sequencing reads were trimmed to remove adapter sequences and low-quality regions using Trim Galore! (v0.4.1) [http://www.bioinformatics.babraham.ac.uk/projects/trim_galore/]. Trimmed reads were subjected to quality control analysis using FastQC [https://www.bioinformatics.babraham.ac.uk/projects/fastqc/]. Tophat v2.1 was used to map trimmed raw reads to the human reference genome (hg38) (10). All bam files from multiple runs related to the same samples were merged and sorted using SAMtools (Version: 1.3.1) (11). Finally, raw read counts were enumerated for each gene in each sample using HTSeq-count (12).
Analysis of differential expression was performed using DESeq2 according to a standard protocol [https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html] (13). Genes with adj.P-value <0.05 and absolute fold change >= 1.5 were considered as significantly differentially expressed. Gene ontology enrichment analyses of the Differentially Expressed Genes (DEGs) were accomplished by use of the Database for Annotation, Visualization and Integrated Discovery (DAVID) v6.8 online tool (14). Gene ontology (GO) biological processes with P-values <0.05 and gene counts >2 were considered as significantly enriched.
Microarray data collection and analysis
The NCBI GEO database was queried for microarray data related to SARS-CoV infections of human lung epithelial cells. A query (SARS-CoV) AND “Homo sapiens” [porgn] AND (“gse” [Filter] AND (“Expression profiling by array” [Filter])) led to 15 search results. After screening, two studies (GSE47962, GSE17400) were selected. To find microarray data related to SARS-CoV infections of human lung epithelial cells, the GEO database was queried using (((Human lung epithelium) OR (Human bronchial epithelial) AND “Homo sapiens” [porgn] AND (“gse” [Filter] AND “Expression profiling by array” [Filter]))) AND (viral infection AND (“gse” [Filter] AND “Expression profiling by array” [Filter])) AND (“gse” [Filter] AND “Expression profiling by array” [Filter]) AND (“Expression profiling by array” [Filter]). This led to 38 search results, three of which (GSE49840, GSE71766, and GSE48575) were selected for analysis. Table 1 provides sample, platform, and cell line details for all five studies.
Summary of publicly available, high-throughput experimental data considered in the current study.
Since SARS-CoV-2 RNA-seq data included transcriptome profiling after 24 hours of infection, in all microarray studies we only considered samples after 24 hours of viral infection. GSE47962 included samples from human airway epithelium (HAE) cells infected with SARS-CoV, influenza virus (H1N1), or variants of SARS-CoV (SARS-dORF6 and SARS-BatSRBD) (15). GSE71766 comprised human bronchial epithelial cells (BEAS-2B) infected with rhino virus (RV), influenza virus (H1N1), or both (RV + H1N1) (16). Bronchial epithelial cell line 2B4 (a clonal derivative of Calu-3 cells) infected with SARS-CoV or Dhori virus (DOHV) were part of GSE17400 (17). GSE49840 included polarized calu-3 (cultured human airway epithelial cells) infected with human influenza virus (H3N2) or avian influenza viruses (H7N9, H5N1, and H7N7) (18). GSE48575 consisted of normal human bronchial epithelial cells (NHBEC) infected with seasonal H1N1 influenza A (sH1N1) or pandemic H1N1 influenza A (H1N1pdm) (19). GEO2R was used to identify differentially expressed genes for each of these studies independently (9). Probes with adj. P-value <0.05 and absolute fold change >=1.5 were considered as statistically significant and compared with DEGs of SARS-CoV-2 infection from RNA-seq data.
Protein-protein interaction analysis
STRING, a database of known or predicted protein-protein interactions (PPIs) was used to obtain interactions between genes altered on SARS-CoV-2 infection (20). Output from the STRING database was uploaded to Cytoscape v3.7.2 in simple interaction format, and the Cytohubba app was employed to identify hub genes (21-23). The top 50 genes were obtained separately from the PPI network based on three network parameters (closeness, degree, and betweenness), then common genes among these were selected as hub genes. We also checked the DrugBank database to determine if a drug is available to target them (24).
Results
RNA sequencing identified genes altered on SARS-CoV-2 infection of normal human bronchial epithelial cells (NHBEC)
Differential expression analysis was performed for NHBEC subjected to SARS-CoV-2 infection (n=3) or mock infection (n=3) from GSE147507. In total, 164 genes were up-regulated, and 76 genes were down-regulated by SARS-CoV-2 infection compared to mock infection [Figure 1A]. Gene Ontology enrichment analysis of differentially expressed genes showed “Type I interferon signaling pathway”, “Inflammatory response”, “Immune response”, “Response to virus”, and “Defense response to virus” as the top 5 enriched biological processes [Figure 1B].
(A) Volcano plot showing the proportion of differentially expressed genes after processing COVID-19 RNA-seq data [GSE147507]. (B) Top 10 biological processes associated with genes differentially expressed on COVID-19 infection of human lung epithelial cells. (C) Histogram showing the differentially expressed genes as determined by GEO2R analysis of respiratory viral infection microarray data.
The impact of other viral infections on the human lung epithelial cell transcriptome was explored using publicly available microarray data
A search of microarray datasets in the NCBI GEO database led to identification of five studies involving human lung epithelial cells subjected to viral infection. GEO2R analyses of each study were performed separately identifying differentially expressed genes. Figure 1C shows DEGs identified from each comparative analysis. From GSE71766, there were 574 differentially expressed probes in H1N1-infected BEAS-2B cells compared to control; 166 probes were up-regulated by RV16 infection. The combined infection of RV16 and H1N1 altered expression of 589 probes. In the case of GSE49840, there were 19959 probes differentially expressed in H7N7-infected Calu-3 cells compared to mock infected cells; 20155, 14859, and 17496 probes were differentially expressed by H5N1, H3N2, or H7N9 infections, respectively. For GSE17400, DOHV or SARS-CoV infection of Calu-3 cells led to altered expression of 447 and 221 probes, respectively. Analysis of GSE48575 led to identification of 130 and 9 differentially expressed probes on seasonal (sH1N1) or pandemic (H1N1pdm) influenza virus infection, respectively, of NHBECs. Lastly, the processing of GSE47962 resulted in discovery of 13414, 1415, 7, and 8 differentially expressed probes after H1N1, SARS-CoV-dORF6, SARS-CoV-BAT, or SARS-CoV viral infections, respectively, of HAE cells.
Comparative analysis of DEGs resulted in identification of SARS-CoV-2 infection-specific genes and those commonly affected by most of the viral infections
Differentially expressed genes from SARS-CoV-2 RNA-seq data were compared with GEO2R analysis results of GSE47962, GSE17400, GSE48575, GSE49840, and GSE71766. There were 15 genes differentially expressed after infection of SARS-CoV-2 [Figure 2A]. TCIM, HEPHL1, TRIML2, S100A9, S100A8, CRCT1, MAB21L4, MRGPRX3, and CSF2 were up-regulated; HNRNPUL2−BSCL2, CXCL14, THBD, GCNT4, PCDH7, and PADI3 were down-regulated.
(A) Heatmap showing 15 genes that are exclusively differentially expressed by SARS-CoV-2 infection. (B) Heatmap showing top 50 genes whose expression is altered on infection of at least two respiratory viruses including SARS-CoV-2.
Genes involved in “type1 interferon signaling pathway” and “apoptosis process” were commonly altered by SARS-CoV-2 and other viral infections. These included OAS1, OAS2, OAS3, MX1, MX2, SAMHD1, XAF1, BST2, IFI6, IFIT1, IFIT3, IFITM1, IFITM3, IRF9, STAT1, TNFAIP3, BIRC3, IL1A, IL1B, MAP3K8, PLSCR1, and TLR2 [Figure 2B].
Protein-protein interaction analysis of altered genes on SARS-CoV-2 infection revealed hub genes
In NHBECs, all 240 genes differentially expressed by SARS-CoV-2 infection were queried in the STRING database to identify known interactions among them. The database provided a PPI network of 1080 interactions (edges) among 168 genes (nodes) [Figure 3]. The PPI network was downloaded as a simple interaction format (SIF) file, visualized with Cytoscape, and analyzed with Cytohubba plugin to identify hub genes. The top 50 genes were obtained based on three network parameters: degree, closeness, and betweenness separately. The 24 genes featured in all three lists were considered as hub genes [Table 2]. The hub genes were IL6, CXCL8, IL1B, STAT1, MMP9, TLR2, CXCL1, ICAM1, CSF2, IFIH1, NFKBIA, DDX58, MX1, MMP1, PI3, SAA1, BST2, LCN2, EDN1, STAT5A, C3, SOD2, LIF, and HBEGF. To understand the exclusivity of these hub genes with COVID19 infection, we analyzed their expression after other viral infections. CSF2 was the only gene to remain unaltered on exposure to other viral infections; most of the COVID19 hub genes were affected by infection of human or avian influenza viruses such as H7N7, H1N1, H7N9, H3N2, and H5N1 [Figure 4].
Hub genes involved in COVID19 infection of normal human bronchial epithelial cells.
The network with 1080 edges among 168 nodes is obtained from the STRING database. Interactions with scores of 0.4 or above were considered.
A network generated by use of Cytoscape depicts the expression of COVID-19 hub genes on infection by various respiratory viruses. The green arrows indicate down-regulation, and the red arrows show up-regulation of these genes.
Discussion
Microarray meta-analysis and comparative transcriptome analysis have been useful bioinformatic approaches for maximum utilization of publicly available gene expression data (25, 26). Since the advent of high-throughput technologies such as microarrays and RNA-seq, researchers have performed in-depth transcriptome analysis of various biological conditions leading to multiple discoveries (27). Data from thousands of such experiments are being deposited in public repositories such as GEO and ArrayExpress. Selection and combinatorial analysis of such data can aid researchers in understanding the molecular mechanism of a disease, and in discovering biomarkers (28-30). We observed the availability of multiple microarray studies related to human respiratory viral infection and sensed the opportunity to compare the effect of SARS-CoV-2 and other respiratory viral infections on the human lung transcriptome.
The comparative transcriptome analysis led to identification of genes that are altered exclusively after SARS-CoV-2 infection. Among these genes, S100 calcium-binding proteinA9 (S100A9) and S100 calcium-binding protein A8 (S100A8) are calcium- and zinc-binding proteins that are elevated in inflammatory lung disorders (31). Colony stimulating factor 2 (CSF2) is a cytokine-coding gene associated with respiratory diseases such as pulmonary alveolar proteinosis (32). Transcriptional and immune response regulator (TCIM), a positive regulator of the Wnt/beta-catenin signaling pathway, is involved in proliferation of thyroid cancers (33, 34). Cysteine-rich C-terminal 1 (CRCT1) functions as a suppressor in esophageal cancer (35). Tripartite motif family-like 2 (TRIML2) promotes growth human oral cancers (36). Hephaestin-like 1 (HEPHL1) is a copper-binding glycoprotein with ferroxidase activity (37). MAS-related GPR family member X3 (MRGPRX3), a member of the mas-related/sensory neuron specific subfamily of G protein coupled receptors, is down-regulated in human airway epithelial cells exposed to smoke from electronic cigarettes (38). C-X-C Motif chemokine ligand 14 (CXCL14) functions as a tumor suppressor in oral cancer, lung cancer, and head and neck cancer; it also induces growth of prostate and breast cancers (39-43). Protocadherin 7 [PCDH7] is involved in cell-cell recognition and adhesion (44). Glucosaminyl (N-acetyl) transferase 4 (GCNT4), thrombomodulin (THBD), peptidyl arginine deiminase 3(PADI3), Mab-21-like 4 (MAB21L4), and HNRNPUL2−BSCL2 have no known association with lung disorders or respiratory virus infections.
The PPI analysis of genes differentially expressed by SARS-CoV-2 infection identified 24 hub genes. CSF2 was the only hub gene to show a SARS-CoV-2-exclusive gene expression pattern. Almost all other hub genes were affected by infection of other respiratory viruses. In conclusion, both PPI analysis and comparative transcriptome analysis point to a role of CSF2 in the molecular mechanism of SARS-CoV-2 infections of human lung epithelium. The current study also highlights the exclusivity of known lung inflammation disorder genes such as S100A8 and S100A9 with respect to SARS-CoV-2 infection.
Although the current bioinformatics approach makes use of available data to identify molecular targets for treatment of COVID-19, we understand that experimental validation is necessary to establish the exclusive association of CSF2, S100A8, and S100A9 with SARS-CoV-2 infection.
Disclosure of Potential Conflicts of Interest
No potential conflicts of interest were disclosed.
Acknowledgment
This study was supported by UAB impact funds to UM and SV. We thank Dr. Donald Hill, of the UAB Comprehensive Cancer Center, for editing this manuscript.
Footnotes
↵# Share Senior Authorship (UM Email: upendermanne{at}uabmc.edu)
Figure 2 has been revised Author affiliation is updated