Abstract
We present a quantitative, population genetics based physico-chemical model which predicts the stationary probability of observing an amino acid residue based on the optimal residue for the site and the sensitivity of the protein functionality to deviation from the optimum. We contextualize our physico-chemical model by comparing it to the more general, but less biologically meaningful models of sequence entropy. To illustrate our model's use, we parameterize our model using over a 1000 different sequences of HIV subtype C's Gag poly-protein. Using data from the LANL HIV database, we evaluate our physico-chemical model's performance by first comparing its site sensitivity parameters G' to the entropy based measures of site conservation and its ability to predict empirical in vitro and in vivo measures of HIV fitness. While our model's G' is well correlated with conservation, G' does a significantly better job predicting the empirical fitness data. More importantly, unlike the entropy model, our model can be further refined and used to test more complex biological hypotheses. For example, in our analysis we find evidence that different protein regions of the gag poly-protein have different sensitivities to deviation from the optimal amino acid residue's molecular volume. Finally, given its biological basis, it should be possible to extend our method to include epistasis in a more realistic manner than Ising models while requiring many fewer parameters than Potts models.











