Abstract
The development of automatic methods for image and video quality assessment that correlate well with the perception of human observers is a very challenging open problem in vision science, with numerous practical applications in disciplines such as image processing and computer vision, as well as in the media industry. In the past two decades, the goal of image quality research has been to improve upon classical metrics by developing models that emulate some aspects of the visual system, and while the progress has been considerable, state-of-the-art quality assessment methods still share a number of shortcomings, like their performance dropping considerably when they are tested on a database that is quite different from the one used to train them, or their significant limitations in predicting observer scores for high framerate videos. In this work we propose a novel objective method for image and video quality assessment that is based on the recently introduced Intrinsically Non-linear Receptive Field (INRF) formulation, a neural summation model that has been shown to be better at predicting neural activity and visual perception phenomena than the classical linear receptive field. Here we start by optimizing, on a classic image quality database, the four parameters of a very simple INRF-based metric, and proceed to test this metric on three other databases, showing that its performance equals or surpasses that of the state-of-the-art methods, some of them having millions of parameters. Next, we extend to the temporal domain this INRF image quality metric, and test it on several popular video quality datasets; again, the results of our proposed INRF-based video quality metric are shown to be very competitive.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
In this new version we test our video quality metric on three additional video datasets and provide the results obtained. In the first version, we only reported the performance of our video quaility prediction algorithm on a single video dataset. Also, we eliminate the appendices. The reason why we do this is because we want to present our video quality metric in its simplest form. The information that was provided in the appendices included aspects that made our algorithm more complex (i.e. performance using different temporal pooling methods, use of a temporal processing function in the video quality prediction model), are not sufficiently studied in the aforesaid appendices, and would therefore need further work.