Defining the Characteristics of Type I Interferon Stimulated Genes: Insight from Expression Data and Machine Learning

A virus-infected cell triggers a signalling cascade resulting in the secretion of interferons (IFNs), which in turn induce the up-regulation of IFN-stimulated genes (ISGs) that play an important role in the inhibition of the viral infection and the return to cellular homeostasis. Here, we conduct detailed analyses on 7443 features relating to evolutionary conservation, nucleotide composition, gene expression, amino acid composition, and network properties to elucidate factors associated with the stimulation of genes in response to type I IFNs. Our results show that ISGs are less evolutionary conserved than genes that are not significantly stimulated in IFN experiments (non-ISGs). ISGs show significant depletion of GC-content in the coding region of their canonical transcripts, which leads to under-representation in the nucleotide compositions. Differences between ISGs and non-ISGs are also reflected in the properties of their coded amino acid sequence compositions. Network analyses show that ISG products tend to be involved in key paths but are away from hubs or bottlenecks of the human protein-protein interaction (PPI) network. Our analyses also show that interferon-repressed human genes (IRGs), which are down-regulated in the presence of IFNs, can have similar properties to ISGs, thus leading to false positives in ISG predictions. Based on these analyses, we design a machine learning framework integrating the usage of support vector machine (SVM) and feature selection algorithms. The ISG prediction achieves an area under the receiver operating characteristic curve (AUC) of 0.7455 and demonstrates the similarity between ISGs triggered by type I and III IFNs. Our machine learning model predicts a number of genes as potential ISGs that so far have shown no significant differential expression when stimulated with IFN in the cell types and tissue types compiled in the available IFN-related databases. A webserver implementing our method is accessible at http://isgpre.cvr.gla.ac.uk/. Author summary Interferons (IFNs) are signalling proteins secreted from host cells. IFN-triggered signalling activates the host immune system in response to intra-cellular infection. It results in the stimulation of many genes that have anti-pathogen roles in host defenses. Interferon-stimulated genes (ISGs) have unique properties that make them different from those not significantly up-regulated in response to IFNs (non-ISGs). We find the down-regulated interferon-repressed genes (IRGs) have some shared properties with ISGs. This increases the difficulty of distinguishing ISGs from non-ISGs. The use of machine learning is a sensible strategy to provide high throughput classifications of putative ISGs, for investigation with in vivo or in vitro experiments. Machine learning can also be applied to human genes for which there are insufficient expression levels before and after IFN treatment in various experiments. Additionally, the interferon type has some impact on ISG predictability. We expect that our study will provide new insight into better understanding the inherent characteristics of human genes that are related to response in the presence of IFNs.

155 We filter out non-ISGs showing enhanced expression after type I IFN treatments (Log 2 (Fold Change) 156 > 0). The exclusion of these non-ISGs can effectively reduce the risk of involving false negatives in 157 analyses and producing false positives in predictions. As a result, the refined dataset S2 contains 620 158 ISGs and 874 non-ISGs with relatively high confidence.

159
The training procedure in the machine learning framework is conducted on a balanced dataset: 160 S2' consisted of 992 randomly selected ISGs and non-ISGs from dataset S2. The remaining human 161 genes in S2 are used for independent testing. Additionally, we also construct another six testing . The criterion for an ISG in the latter three datasets is a high level of up-regulation 167 (Log 2 (Fold Change) > 1.0) while that for non-ISGs is no up-regulation after IFN treatments (Log 2 (Fold 168 Change) < 0). The last testing dataset S8 is derived from our background dataset S1, containing 2217 169 ELGs. A breakdown of the aforementioned eight datasets is shown in Table 1. Detailed information 170 of the human genes used in this study is provided in S1 Data.
258 where is the number of divided parts that equals to 126 in this study; and are the value 259 of Log 2 (Fold Change) and AREP in the -th part; 0 and 0 are the mean and standard deviation of 260 Log 2 (Fold Change), which is set as 6.4 and 3.7 respectively in this study; and are the mean 261 and standard deviation of 126 AREP that reflect the representation of the considered feature. To make 262 fair comparisons among features with different scales, we normalise them based on the major value of 263 their representations:

Evolutionary characteristics of ISGs
308 In this study, we construct a dataset consisting of 620 ISGs and 874 non-ISGs (dataset S2) from 10836 309 well-annotated human genes (dataset S1). Human genes in the S1 dataset have higher confidence based 342 ORFs, open reading frames; ELGs, human genes with limited expression in interferon experiments.

344
To determine whether ISGs tend to originate from duplications, we count the number of within 345 human paralogs of each gene (Fig 4A). 348 the background human genes in dataset S1 (M 1 = 10.5, M 2 = 11.5, p = 8.8E-03). We hypothesize that 349 such a difference is mainly caused by the imbalanced distribution of singletons in ISGs and non-ISGs.  The difference between ISGs and non-ISGs The difference between ISGs and human genes The feature of nucleotide compositions (7) The feature of dinucleotide compositions (16) The feature of codon usages (64)   . We find several amino acids that are either enriched or depleted in ISG products 484 compared to background human proteins, which are produced by genes in dataset S1 (Fig 7). The     The difference between ISG and non-ISG products The difference between ISG products and human proteins 669 the basic composition of nucleotides influences the correlation between the representation of sequence-  The feature of nucleotide compositions (7) The feature of dinucleotide compositions (16) The feature of codon usages (64) The feature of nucleotide 4-mer compositions (256) The feature about evolution (7) The feature of amino acid composition (37) The feature about interactome network (8)  In this study, we train and optimise a SVM model from our training dataset, i.e., S2', and prepare seven 824 testing datasets to assess the generalisation capability of our model under different conditions. The 825 S2'' testing dataset is a subset of dataset S2. The prediction performance on this testing dataset is close 826 to that in the training stage with an AUC of 0.7455 (Fig. 14). The best MCC value is achieved when 827 setting the judgement threshold to 0.438, which means that the prediction model is sensitive to signals 828 related to ISGs. In this case, it produces predictions with high sensitivity but inevitably produces many 829 false positives, especially within the IRG class.

830
In the S3 testing dataset, we use 695 ISGs with low confidence. The overall accuracy only 831 reaches 44.0% when using a judgement threshold of 0.549, about 18% lower than SN under the same 832 threshold in the training dataset S2' ( Table 4). This is expected as they have some inherent attributes