RT Journal Article SR Electronic T1 Rule-based meta-analysis reveals the major role of PB2 in influencing influenza A virus virulence in mice JF bioRxiv FD Cold Spring Harbor Laboratory SP 556647 DO 10.1101/556647 A1 Fransiskus Xaverius Ivan A1 Chee Keong Kwoh YR 2019 UL http://biorxiv.org/content/early/2019/02/22/556647.abstract AB Background Influenza A virus (IAV) poses threats to human health and life. Many individual studies have been carried out in mice to uncover the viral factors responsible for the virulence of IAV infections. Virus adaptation through serial lung-to-lung passaging and reverse genetic engineering and mutagenesis approaches have been widely used in the studies. Nonetheless, a single study may not provide enough confident about virulence factors, hence combining several studies for a meta-analysis is desired to provide better views.Methods Virulence information of IAV infections and the corresponding virus and mouse strains were documented from literature. Using the mouse lethal dose 50, time series of weight loss or percentage of survival, the virulence of the infections was classified as avirulent or virulent for two-class problems, and as low, intermediate or high for three-class problems. On the other hand, protein sequences were decoded from the corresponding IAV genomes or reconstructed manually from other proteins according to mutations mentioned in the related literature. IAV virulence models were then learned from various datasets containing IAV proteins whose amino acids at their aligned position and the corresponding two-class or three-class virulence labels. Three proven rule-based learning approaches, i.e., OneR, JRip and PART, and additionally random forest were used for modelling, and top protein sites and synergy between protein sites were identified from the models.Results More than 500 records of IAV infections in mice whose viral proteins could be retrieved were documented. The BALB/C and C57BL/6 mouse strains and the H1N1, H3N2 and H5N1 viruses dominated the infection records. PART models learned from full or subsets of datasets achieved the best performance, with moderate averaged model accuracies ranged from 65.0% to 84.4% and from 54.0% to 66.6% for two-class and three-class datasets that utilized all records of aligned IAV proteins, respectively. Their averaged accuracies were comparable or even better than the averaged accuracies of random forest models and should be preferred based on the Occam’s razor principle. Interestingly, models based on a dataset that included all IAV strains achieved a better averaged accuracy when host information was taken into account. For model interpretation, we observed that although many sites in HA were highly correlated with virulence, PART models based on sites in PB2 could compete against and were often better than PART models based on sites in HA. Moreover, PART had a high preference to include sites in PB2 when models were learned from datasets containing concatenated alignments of all IAV proteins. Several sites with a known contribution to virulence were found as the top protein sites, and site pairs that may synergistically influence virulence were also uncovered.Conclusion Modelling the virulence of IAV infections is a challenging problem. Rule-based models generated using only viral proteins are useful for its advantage in interpretation, but only achieve moderate performance. Development of more advanced machine learning approaches that learn models from features extracted from both viral and host proteins must be considered for future works.