PT - JOURNAL ARTICLE AU - Alexander M. Conway AU - Ian N. Durbach AU - Alistair McInnes AU - Robert N. Harris TI - Frame-by-frame annotation of video recordings using deep neural networks AID - 10.1101/2020.06.29.177261 DP - 2020 Jan 01 TA - bioRxiv PG - 2020.06.29.177261 4099 - http://biorxiv.org/content/early/2020/06/29/2020.06.29.177261.short 4100 - http://biorxiv.org/content/early/2020/06/29/2020.06.29.177261.full AB - Video data are widely collected in ecological studies but manual annotation is a challenging and time-consuming task, and has become a bottleneck for scientific research. Classification models based on convolutional neural networks (CNNs) have proved successful in annotating images, but few applications have extended these to video classification. We demonstrate an approach that combines a standard CNN summarizing each video frame with a recurrent neural network (RNN) that models the temporal component of video. The approach is illustrated using two datasets: one collected by static video cameras detecting seal activity inside coastal salmon nets, and another collected by animal-borne cameras deployed on African penguins, used to classify behaviour. The combined RNN-CNN led to a relative improvement in test set classification accuracy over an image-only model of 25% for penguins (80% to 85%), and substantially improved classification precision or recall for four of six behaviour classes (12–17%). Image-only and video models classified seal activity with equally high accuracy (90%). Temporal patterns related to movement provide valuable information about animal behaviour, and classifiers benefit from including these explicitly. We recommend the inclusion of temporal information whenever manual inspection suggests that movement is predictive of class membership.Competing Interest StatementThe authors have declared no competing interest.