Coverage-adjusted entropy estimation

Vincent Q Vu; Bin Yu; Robert E Kass

doi:10.1002/sim.2942

Coverage-adjusted entropy estimation

Stat Med. 2007 Sep 20;26(21):4039-60. doi: 10.1002/sim.2942.

Authors

Vincent Q Vu¹, Bin Yu, Robert E Kass

Affiliation

¹ Department of Statistics, University of California, Berkeley, CA 94720-3860, USA. vqv@stat.berkeley.edu

PMID: 17567838
DOI: 10.1002/sim.2942

Abstract

Data on 'neural coding' have frequently been analyzed using information-theoretic measures. These formulations involve the fundamental and generally difficult statistical problem of estimating entropy. We review briefly several methods that have been advanced to estimate entropy and highlight a method, the coverage-adjusted entropy estimator (CAE), due to Chao and Shen that appeared recently in the environmental statistics literature. This method begins with the elementary Horvitz-Thompson estimator, developed for sampling from a finite population, and adjusts for the potential new species that have not yet been observed in the sample-these become the new patterns or 'words' in a spike train that have not yet been observed. The adjustment is due to I. J. Good, and is called the Good-Turing coverage estimate. We provide a new empirical regularization derivation of the coverage-adjusted probability estimator, which shrinks the maximum likelihood estimate. We prove that the CAE is consistent and first-order optimal, with rate O(P)(1/log n), in the class of distributions with finite entropy variance and that, within the class of distributions with finite qth moment of the log-likelihood, the Good-Turing coverage estimate and the total probability of unobserved words converge at rate O(P)(1/(log n)(q)). We then provide a simulation study of the estimator with standard distributions and examples from neuronal data, where observations are dependent. The results show that, with a minor modification, the CAE performs much better than the MLE and is better than the best upper bound estimator, due to Paninski, when the number of possible words m is unknown or infinite.

2007 John Wiley & Sons, Ltd

Publication types

Research Support, N.I.H., Extramural
Research Support, U.S. Gov't, Non-P.H.S.
Validation Study

MeSH terms

Action Potentials / physiology*
Biometry / methods*
Entropy*
Humans
Likelihood Functions
Models, Neurological
Models, Statistical
Neurons / physiology
United States

Abstract

Publication types

MeSH terms

Grants and funding