TY - JOUR T1 - Public health in genetic spaces: a statistical framework to optimize cluster-based outbreak detection JF - bioRxiv DO - 10.1101/639997 SP - 639997 AU - Connor Chato AU - Marcia L. Kalish AU - Art F. Y. Poon Y1 - 2019/01/01 UR - http://biorxiv.org/content/early/2019/10/17/639997.abstract N2 - Genetic clustering is a popular method for characterizing variation in transmission rates for rapidly-evolving viruses, and could potentially be used to detect outbreaks in ‘near real time’. However, the statistical properties of clustering are poorly understood in this context, and there are no objective guidelines for setting clustering criteria. Here we develop a new statistical framework to optimize a genetic clustering method based on the ability to forecast new cases. We analyzed the pairwise Tamura-Nei (TN93) genetic distances for anonymized HIV-1 subtype B pol sequences from Seattle (n = 1, 653) and Middle Tennessee, USA (n = 2, 779), and northern Alberta, Canada (n = 809). Under varying TN93 thresholds, we fit two models to the distributions of new cases relative to clusters of known cases: (1) a null model that assumes cluster growth is strictly proportional to cluster size, i.e., no variation in transmission rates among individuals; and (2) a weighted model that incorporates individual-level covariates, such as recency of diagnosis. The optimal threshold maximizes the difference in information loss between models, where covariates are used most effectively. Optimal TN93 thresholds varied substantially between data sets, e.g., 0.0104 in Alberta and 0.016 in Seattle and Tennessee, such that the optimum for one population will potentially mis-direct prevention efforts in another. The range of thresholds where the weighted model conferred greater predictive accuracy tended to be narrow (±0.005 units), but the optimal threshold for a given population also tended to be stable over time. We also extended our method to demonstrate that variation in recency of HIV diagnosis among clusters was significantly more predictive of new cases than sample collection dates (ΔAIC> 50). These results demonstrate that one cannot rely on historical precedence or convention to configure genetic clustering methods for public health applications. Our framework not only provides an objective procedure to optimize a clustering method, but can also be used for variable selection in forecasting new cases. ER -