Abstract
In infectious disease epidemiology, clustering cases of infection in space and time is a standard method to identify and characterize outbreaks. Clustering cases by genetic similarity is analogous to spatial clustering, and may be more effective for pathogens transmitted at a relatively low rate by intimate contact. However, the statistical properties of genetic clustering in the context of out-break detection are not well characterized and cluster-defining criteria are generally set to arbitrary values. We describe a new method to optimize the predictive value of a clustering method by optimizing its parameters to maximize the difference in the Akaike information criterion (AIC) between individual-weighted and null models of cluster growth. This approach mirrors solutions to the modifiable areal unit problem (MAUP): the statistical association between covariates and an outcome is contingent on how their spatial distribution is partitioned into units of observation. To evaluate our method, we analyzed the distributions of pairwise Tamura-Nei (TN93) genetic distances from two published sets of anonymized HIV-1 subtype B pol sequence data stratified by collection year. We generated 46 different graphs by varying the pairwise threshold, where an edge in a graph indicates that the TN93 distance between the respective cases is below the corresponding threshold. For each graph, we generated predictions of cluster growth (numbers of new cases with edges to clusters of known cases) under two different Poisson regression models: a null model in which growth is only proportional to cluster size (i.e., no variation among individuals); and a weighted model where the variation associated with individual-level covariates are summed by cluster. Next, we calculated the AIC for each model on the distributions of observed cluster growth in two published HIV-1 pol data sets from Seattle, USA (n = 1, 653) and Alberta, Canada (n = 809). Based on the difference in AICs, we obtained different optimized TN93 thresholds for these data sets (0.014 and 0.011, respectively). We show that selection of this threshold parameter can substantially limit the utility of genetic clusters for public health, and that the optimal parameter for one population can misdirect prevention efforts in another. This statistical framework can potentially be used to optimize any clustering method, and to evaluate it against other methods including those that do not use genetic information.