Abstract
Reflectance, lighting, and geometry combine in complex ways to create images. How do we disentangle these to perceive individual properties, like the glossiness of a surface? We suggest that brains disentangle properties by learning to model statistical structure in proximal images. To test this, we trained unsupervised generative neural networks on renderings of glossy surfaces and compared their representations with human gloss judgments. The networks spontaneously cluster images according to distal properties such as reflectance and illumination, despite receiving no explicit information about them. Intriguingly, the resulting representations predict the specific patterns of ‘successes’ and ‘errors’ in human perception. Linearly decoding specular reflectance from the model’s internal code predicts human gloss perception better than ground truth, supervised networks, or control models, and predicts, on an image-by-image basis, illusions of gloss perception caused by interactions between material, shape, and lighting. Unsupervised learning may underlie many perceptual dimensions in vision, and beyond.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
Substantially extended Introduction and Discussion, and added representational similarity analysis, as well as further tests of network generalisation.