Abstract
Since the coining of the term phylodynamics, the use of phylogenies to understand infectious disease dynamics has steadily increased. As methods for phylodynamics and genomic epidemiology have proliferated and grown more computationally expensive, the epidemiological information they extract has also evolved to better complement what can be learned through traditional epidemiological data. However, for genomic epidemiology to continue to grow, and for the accumulating number of pathogen genetic sequences to fulfill their potential widespread utility, the extraction of epidemiological information from phylogenies needs to be simpler and more efficient. Summary statistics provide a straightforward way of extracting information from a phylogenetic tree, but the relationship between these statistics and epidemiological quantities needs to be better understood. In this work we address this need via simulation. Using two different benchmark scenarios, we evaluate 74 tree summary statistics and their relationship to epidemiological quantities. In addition to evaluating the epidemiological information that can be inferred from each summary statistic, we also assess the computational cost of each statistic. This helps us optimize the selection of summary statistics for specific applications. Our study offers guidelines on essential considerations for designing or choosing summary statistics. The evaluated set of summary statistics, along with additional helpful functions for phylogenetic analysis, is accessible through an open-source Python library. Our research not only illuminates the main characteristics of many tree summary statistics but also provides valuable computational tools for real-world epidemiological analyses. These contributions aim to enhance our understanding of disease spread dynamics and advance the broader utilization of genomic epidemiology in public health efforts.
Author Summary Our study focuses on the use of phylogenetic analysis to get valuable epidemiological insights. We conducted a simulation study to evaluate 74 phylogenetic summary statistics and their relationship to epidemiological quantities, shedding light on the potential of each of these statistics to quantify different characteristics of disease spread dynamics. Additionally, we assessed the computational cost of each statistic. This gives us additional information when selecting a statistic for a particular application. Our research is available through an open-source Python library. This work helps us enhance our understanding of phylogenetic tree structures and contributes to the broader application of genomic epidemiology in public health initiatives.
Competing Interest Statement
The authors have declared no competing interest.