Abstract
Reflectance, lighting, and geometry combine in complex ways to create images. How do we disentangle these to perceive individual properties, like surface glossiness? We suggest that brains disentangle properties by learning to model statistical structure in proximal images. To test this, we trained unsupervised generative neural networks on renderings of glossy surfaces and compared their representations with human gloss judgments. The networks spontaneously cluster images according to distal properties such as reflectance and illumination, despite receiving no explicit information about them. Intriguingly, the resulting representations also predict the specific patterns of ‘successes’ and ‘errors’ in human perception. Linearly decoding specular reflectance from the model’s internal code predicts human gloss perception better than ground truth, supervised networks, or control models, and predicts, on an image-by-image basis, illusions of gloss perception caused by interactions between material, shape, and lighting. Unsupervised learning may underlie many perceptual dimensions in vision, and beyond.
Main
A photograph of a glass of water might consist of large bright patches, sparkling dots, and low-contrast blurs—yet we immediately see these as reflections on the water’s surface, bubbles within, and smudges on the glass. Identifying the distal physical causes of proximal image features is widely considered to be the central challenge of vision 1–7. Yet how we infer the outside world from ambiguous sensory data remains mysterious.
Somehow the visual system infers the combination of distal scene variables that most plausibly explains the proximal sensory input 4, 8–11. Yet, a major unsolved question is how the candidate explanations of proximal inputs came to be known in the first place. Our visual systems did not – and do not – have access to ground truth information about the number or kinds of distal sources of image structure that operate in the world. Any knowledge about the world must have been acquired from exposure to proximal stimuli over evolutionary and/or developmental time scales 12–14.
Here we explore the intriguing possibility that visual systems might be able to discover the operation of distal scene variables by learning statistical regularities in proximal images, rather than through learning an explicit mapping between proximal cues and known distal causes. Specifically, we show that by learning to efficiently compress and spatially predict images of surfaces, an unsupervised generative deep neural network (DNN) not only spontaneously clusters inputs by distal factors like material and illumination, but also strikingly reproduces many characteristic ‘misperceptions’ of human observers.
Two general principles motivate our approach. The first is that variability in proximal sensory stimuli is caused by variations in a lower-dimensional set of environmental factors (such as shape, reflectance, and illumination). This implies that the variation between images can be captured in a more compact or simple way when represented in terms of their underlying causes, rather than (say) in terms of pixels 15–17. A machine learning model encouraged to discover a compact representation of images might therefore converge on the finite sources that generate image variability. Identifying and disentangling these distal sources is only possible, however, if a second principle holds: different distal sources must generate statistically distinguishable effects in the proximal input 18. This seems intuitively true – for example, changes in illumination generate different kinds or patterns of variability in images than changes in surface material do. Based on these two principles, we reasoned that it should be possible for a sufficiently powerful statistical learning model to discover the existence of distal variables without a priori knowledge of either the number or kinds of distal variables that exist in the world, based solely on the variability they generate in images.
The idea that our perceptual systems exploit statistical regularities to derive information about the world has a long and venerable history in both psychology and neuroscience 19–21. For example, neural response properties in early visual cortex are well predicted by models trained to generate sparse codes for natural image patches 19, 21–24. Unfortunately, such ‘efficient coding’ approaches have not yet scaled beyond the initial encoding of images. One of the main motivations for the present work was to determine whether such ideas could provide leverage into mid-level scene understanding, i.e., inferring the distal physical causes of sense data.
Even if different distal factors have different statistical effects on images, and a sufficiently powerful unsupervised neural network is able to learn these, its success in disentangling different factors is unlikely to be perfect. That is, the network would sometimes misattribute the distal causes responsible for the data. However, we regard such misattributions as a potential strength; there are well documented examples where the human visual system systematically misattributes image structure to the wrong distal source—failures of ‘perceptual constancy’ 1, 5, 25–33. We were interested in whether the pattern of successful and unsuccessful attributions made by human observers would also be exhibited by networks that failed to fully disentangle distal scene variables. The goal was not to understand visual processes as an estimation of ground truth, but rather, to understand why our visual systems extract what they do about the world in the absence of access to ground truth knowledge.
One of the most striking patterns of successes and failures in estimating distal scene variables occurs in the perception of surface gloss 3, 5, 31, 34–37. Gloss perception is a paradigmatic case of a perceptual judgment where multiple physical effects must be separated. The pattern of specular reflections can change dramatically as a function of a surface’s 3D shape, illumination direction, and the observer’s viewpoint. Indeed, psychophysical evidence has shown that the perception of gloss in human observers depends not only on specular reflectance, as expected 32, 38–44, but also on lighting and shape 33–37, 45. We were interested in whether the specific pattern of these complex interactions could be a consequence of the visual system having learned to approximately disentangle distal sources from their effects on image structure.
Our work exploits DNN methods that have emerged for learning sophisticated models of image structure in the form of latent variables, which summarise how images differ from one another 46–50. DNNs achieve complex transformations by passing input data through layers of units that apply weighted summations and non-linearities, roughly mimicking the operations of biological neurons 51–55. During training, connections between units are iteratively adjusted to improve the network’s performance on some learning objective.
In supervised learning, networks are directly told what label to output for every input image in a training dataset, from which they learn to subsequently output appropriate labels for new test images. Supervised DNNs have revolutionised computer vision, achieving near-human object and face recognition 56–59, and are the best extant models of late ventral stream function 51–53, 60–63. However, unlike humans, they are often fragile to tiny image perturbations 64, 65 and over-rely on local texture 66, 67. As models of biological gloss perception, it is unclear from where the necessary training labels could come.
In unsupervised learning, training objectives encourage networks to learn statistical regularities in the training data without being given any explicit labels. For example, autoencoder networks are trained to compress training images into compact descriptions, then reconstruct them as accurately as possible 50, 68, 69. Here, we use a variant known as a PixelVAE 47, 48, which learns to both summarise and spatially predict images, in two connected processing streams (Figure 1C). It is a generative model that can create completely novel images with high-order statistical structure similar to the training data—in our case, images of glossy and matte surfaces.
Our main finding is that the representation of gloss learned by an unsupervised PixelVAE network closely matches the pattern of successes and failures in perceived gloss shown by human observers. The unsupervised models better match human data than do a range of supervised networks and simpler comparison models, suggesting that the learning process by which different distal sources are disentangled may play a fundamental role in shaping our visual experience of the world, and providing a potential answer to the conundrum of how we learn to see without explicit training.
Results
To test whether an unsupervised DNN can learn human-like gloss perception, we rendered 10,000 images from a virtual world consisting of frontal views of bumpy surfaces with either high (‘gloss’) or low (‘matte’) specular reflectance. Using renderings grants tight control over the statistics of the training environment, allowing us to guarantee that reflectance could not be trivially decoded from raw images and that physical factors varied independently of one another. Each image had a different random configuration of bumps, depth of surface relief, and colour, and was illuminated by one of six natural light fields (Figure 1A-B). We then trained ten different instances of a PixelVAE 47, 48 network with different initial random weights on this dataset, to ensure results were robust to representational differences between training instances of the same architecture 70. The network culminates in a probability distribution over pixel values (Figure 1C). Its training objective is to adjust the shape of this distribution in order to increase the likelihood of the training images under it, leading to a model of the structure and variability within and across images.
New images can be created from the unsupervised PixelVAE model. These are generated pixel-by-pixel from the top left corner by probabilistically sampling from the network’s learned distribution, conditioned both on previous pixels and values in the model’s 10-dimensional latent code (Figure 1C). The representations in this latent code are a highly compressed representation of whole-image properties, and are the focus of all subsequent analyses. After training, all ten instances of the model could generate wholly novel images that look like plausible surfaces (Figure 1D).
As a comparison DNN, we also trained ten instances of a supervised ResNet 71 network to classify the same images as high or low gloss, using ground truth (high or low specular reflectance in the rendering parameters) as training labels. Its mean classification accuracy was 99.4% +/-SD 0.001. This supervised model also contained a 10-dimensional fully-connected layer prior to its two-unit output classification layer, which we treated as its high-level latent code for comparisons with the unsupervised model (Figure 1C).
An unsupervised generative model disentangles distal scene properties
We are interested in the extent to which models transform raw images into a feature space within which different physical causes are disentangled 12, 14, 16, 72. Surfaces with similar reflectance properties may occupy very disparate points in raw pixel space, but should cluster together in the feature space of a good perceptual model. Although the unsupervised PixelVAE model’s training objective deals only with proximal image data, after training on the rendered dataset, we found that distal scene properties—such as gloss and lighting—spontaneously clustered within the networks’ 10-dimensional latent codes (cf. 47, 50, 73). Visualising in two dimensions using tSNE 74 strikingly reveals that low-gloss images cluster together, while high-gloss images form multiple tight clusters, corresponding to different light fields (Figure 2A). Within each light-field cluster, images are arranged by the angle of illumination, as well as by surface relief, with flatter surfaces occupying nearby points, and bumpier surfaces more distant points. This shows that without explicit labels, the unsupervised model learns at least partially to reorganise stimuli by their physical properties, one of the core challenges of mid-level vision.
The emergence of this clustering of images by scene properties is far from trivial. It was not caused by raw image similarities, since tSNE visualisation of the same images in raw pixel space showed a tight entangling of scene properties (Figure 2B). Other linear and non-linear pixel embeddings such as MDS and LLE 75 also failed to separate low-from high-gloss surfaces (Supplementary Figure 2). When the same visualisation was applied to the 10D layer of the gloss-supervised models, high and low gloss images were neatly separated, but other world factors were intermixed (Figure 2C). Similar qualitative patterns held for all ten instances of both unsupervised and supervised models.
To quantify these clustering effects, we used representational similarity analysis 76 (Figure 2D). The results support the tSNE visualisations. Pairs of images belonging to the same gloss condition (both glossy or both matte), corresponded to closer points in the unsupervised models’ 10D latent codes than pairs of images belonging to different gloss conditions (repeated-measures t-test comparing average distances between same-vs different-gloss image pairs, across network training instances: t9 = 16.73, p < 0.001, Cohen’s d = 0.97, 95% CI of difference = [0.37–0.46]). Likewise, pairs of images illuminated by the same light field had more similar latent representations than those lit by different light fields (t9 = 29.76, p < 0.001, d = 0.95, 95% CI = [0.36–0.41]). In the supervised models, clustering was dominated by gloss (Figure 2E; two-way mixed-effects ANOVA interaction between model type and scene factor F1,18 = 9878.34, p < 0.001, η2 = 0.99; follow-up tests show far stronger gloss clustering in supervised than unsupervised models t18 = 99.39, p < 0.001, d = 44.45, 95% CI = [6.42–6.66], but stronger light-field clustering in unsupervised models, t18 = −19.90, p < 0.001, d = 8.90, 95% CI = [0.27–0.32]). Thus, while the supervised model optimizes disentanglement of the single physical property on which it is trained, the unsupervised model spontaneously discovers multiple scene factors contributing to image structure.
The unsupervised model predicts human gloss perception for novel images
Our central question was whether the spontaneous separation of high and low gloss images in the unsupervised model could capture human gloss judgments. To derive quantitative gloss predictions from the models, a linear support vector machine (SVM) classifier was trained to find the hyperplane in the 10D latent code of each network that best separates high from low specular reflectance images (Figure 3A). Although this evaluation step involves label-based decoding, it is simply a formal way of quantifying the degree and form of the disentanglement. In neuroscience, information that is available directly via a linear readout from units or neurons is generally considered to be explicitly represented by a model or brain region (e.g. 16, 77–81). The linear classifier does not provide the model with any new information, but merely measures the relative placement of different classes of images within its existing feature space.
Based on this linear decoding, we find that gloss-classification accuracy for novel renderings across the ten unsupervised models was extremely good at 99.3% (+/-SD 0.002)—practically as good as decoding gloss from the 10D latent code of the supervised models (99.4% +/- 0.002; Figure 3B). Light field and surface relief could also be decoded well above chance from the unsupervised networks (Figure 3B), and significantly better than from the supervised networks (independent-measures t-test comparing light field decoding between unsupervised and supervised models: t18 = 23.25, p < 0.001, Cohen’s d = 10.40, 95% CI of difference = [0.28–0.34]; surface relief: t18 = 3.30, p = 0.004, d = 1.48, 95% CI = [0.08– 0.36]). Thus, linear decoding further demonstrates that the unsupervised networks learn a compact representation that summarises information about not only surface material, but other scene properties such as illumination and surface relief. The analysis also revealed that representations were distributed rather than sparse. The full latent code predicted scene properties much better than any individual dimension could (Figure 3B and Supplementary Figure 1C).
Crucially, we could now derive a predicted gloss value for any image by inputting it to a network and calculating the SVM decision value for its corresponding point in latent space (i.e., signed distance of point from network’s gloss-separating hyperplane; Figure 3A). This allowed us to compare the model against human gloss perception.
For Experiment 1: Gloss ratings we rendered 50 new images of surfaces with random magnitudes of specular reflectance, sampled uniformly from almost matte to almost mirror-like. Twenty observers rated the apparent gloss of each surface, from 1 (matte) to 6 (glossy). We compared their ratings to the gloss values predicted by the unsupervised model. Figure 4A shows that agreement was excellent (mean R2 over ten model training instances = 0.84), and was substantially better than for the supervised model (mean R2 = 0.40; independent-samples t-test t18 = 12.45, p < 0.001, Cohen’s d = 5.57, 95% CI of difference = [0.37–0.50]). Notably, the unsupervised model even predicted human ratings better than ground truth (specular magnitude within the rendering engine; R2 = 0.73, one-sample t-test of difference, across model training instances t9 = 4.74, p = 0.001, d = 1.50, 95% CI = [0.05–0.13]).
We also considered a number of alternative models (Figure 4A bargraph), all of which predicted human judgements less well than ground truth. The best of these was a feature space consisting of the 1,000 final-layer features from a ResNet DNN trained on 1.2 million images to classify objects 58, 71. This is consistent with previous findings that representations in object-recognition DNNs capture perceptually-relevant features of textures and images 61, 82–84. Other, less well-performing, models included a multi-scale texture description comprising 1,350 feature dimensions 85; the 4,096 latent features from a relatively simple image autoencoder; 10-dimensional embeddings of raw images via tSNE 74, MDS or LLE 75; and luminance histogram skewness—a measure previously proposed to predict human gloss perception 86 (see Methods). Supplementary Figure 2 shows visualisations of how gloss and other scene factors are organised within each of these feature spaces.
Since PixelVAE networks are generative models, novel images can be generated by sampling from them (Figure 1D). In Experiment 2: Gloss manipulation, we used such images to test whether perceived gloss varied systematically with an image’s location in the model’s latent space. From each of the model training instances, four sequences of five images were generated by conditioning the image-sampling process first on a ‘low gloss’ point in the latent space and then on progressively higher-gloss points (i.e., moving along the model’s ‘gloss discriminating axis’, orthogonal to the SVM hyperplane). Another four sequences progressed from high to lower gloss points (Figure 4C). The same 20 observers sorted the images within each sequence from least to most glossy. Figure 4B shows that moving along the model’s gloss discriminating axis systematically reduced the apparent gloss of generated images when moving in the matte direction (one-way repeated-measures ANOVA F4,76 = 244.11, p < 0.001, η2 = 0.93, 95% CI of correlation r = [0.87–0.93]) and increased it when moving in the high-gloss direction (F4,76 = 649.82, p < 0.001, η2 = 0.97, 95% CI r = [0.97–0.98]).
Thus, despite never being given information about scene properties during training, we find the unsupervised networks develop internal representations that not only disentangle distal causes that are impossible to tease apart in the raw input (Figure 2B and Supplementary Figure 2), but also, more remarkably, predict human gloss perception better than the true physical reflectance properties of surfaces.
The unsupervised model predicts failures of human gloss constancy
Although human gloss perception generally aligns well with specular reflectance, it also exhibits some well-documented ‘errors’. For example, bumpier surfaces tend to look glossier than flatter surfaces, and specific combinations of lighting and surface relief yield specific patterns of misperception (Figure 5A; 34, 35). Mimicking such perceptual errors is a key test of any computational model of biological vision 87. We assessed how well different models capture the systematic failures of gloss constancy exhibited by human observers.
To do this, we rendered sequences where surface relief increased in seven steps, while reflectance and other scene properties remained fixed (Figure 5A). For stimuli in Experiments 3a and 3b: Patterns of gloss constancy we selected two sets of 10 sequences for which (a) both the unsupervised and supervised models predicted deviations from constant gloss, and (b) the models made different predictions about the particular pattern of deviations (see Methods). The rationale behind this is that cases where models disagree provide the strongest power to test which model is superior 88, 89.
With these image sequences in hand, in Experiments 3a and 3b: Patterns of gloss constancy, two groups of 20 observers judged gloss in a paired-comparison task (see Methods). If observers correctly estimated reflectance, all surfaces should appear equally glossy, yet we find that they do not. Observers exhibited strong failures of gloss constancy, usually reporting surfaces with deeper relief to be glossier, although perceived gloss was non-monotonic in seven of the twenty sequences, being highest for intermediate reliefs (four examples shown in Figure 5B; complete data: Supplementary Figure 3). The unsupervised model, despite never being explicitly trained to represent gloss, and without being fit to human data, predicted patterns of failures of gloss constancy remarkably well (median R2 across sequences and model training instances = 0.71). The model correctly predicted the qualitative pattern of constancy failure (monotonic vs non-monotonic) for 18 out of the 20 stimulus sequences. In contrast, the supervised model completely failed to predict human gloss constancy. For almost all sequences it made predictions that were anti-correlated with human patterns (median R2 = −1.45). Of the alternative models (Figure 5C), mid-level texture features 85 provided the next best performance (median R2 = 0.54), but was significantly poorer than the unsupervised model (one-sample t-test of difference across model training instances t9 = 10.48, p < 0.001, Cohen’s d = 3.31, 95% CI of difference = [0.14–0.20]).
Human gloss constancy ranges from good to bad depending on interactions between lighting and shape 30, 33–35, 90. For Experiment 4: Degrees of gloss constancy, we rendered 40 image pairs depicting surfaces with identical material but different surface relief or lighting (Figure 5D; Methods), for which the unsupervised model predicted a wide range of degrees of gloss constancy, from excellent (near-identical predicted gloss for both images within a pair) to very poor (images received very different predicted gloss values; see Methods).
Twenty observers indicated which surface in each pair appeared glossier, with each pair repeated eight times. The unsupervised model predicted the degree and direction of human (failures of) constancy reasonably well (mean r across model training instances = 0.70, Figure 5E) and outperformed all alternative models (next-best model object-DNN features r = 0.64; one-sample t-test of difference, across PixelVAE training instances t9 = 4.00, p = 0.003, Cohen’s d = 1.27, 95% CI of difference = [0.03– 0.08]). Fitting a simple logistic function to relate model and human gloss values further improves the prediction (from average R2 across model training instances = 0.60 for a linear fit, to R2 = 0.74 for a logistic fit).
Aggregating results across the three experiments using renderings (Experiments 1, 3, and 4), the unsupervised model predicts human perceptual judgements better than all others (Figure 6). It achieves near-perfect ground-truth gloss-classification accuracy, while still predicting idiosyncratic errors of human gloss perception (Figure 6A), with a feature space two orders of magnitude more compact than the next best model (Figure 6B). The next best model was a set of texture features hand-engineered to efficiently capture higher-order statistical structure in images 85, which performed significantly less well (one-sample t-test of difference in RMSE predicting individual data across three experiments, t59 = 10.73, p < 0.001, Cohen’s d = 1.39, 95% CI = [0.05–0.07]). Simple image statistics, such as the skewness of the luminance histogram 86 failed to capture human gloss perception under our deliberately challenging stimulus conditions, where differences in specular reflectance must be de-confounded from differences in lighting and surface shape.
Model generalization and effects of training set
The composition of the training dataset has profound effects on all machine learning models. However, a good model of human mid-level vision should generalise over changes to training or test data. Our unsupervised networks were trained on a simulated environment with bimodally distributed gloss (near-matte or high-gloss). Nevertheless, they well predicted gloss in new image sets containing continuously-varying gloss (Figure 7A, mean R2 = 0.79 +/- SD 0.03; see also Supplementary Figure 6A), and predicted gloss for scenes with novel geometries and light fields as well as they did for familiar scenes (Figure 7A).
Model predictions do not seem to depend on artefacts of the specific computer graphics techniques used to generate our images (realtime rasterised rendering), as gloss predictions were near-identical for matched surfaces rendered using more time-intensive but physically faithful raytraced rendering (Supplementary Figure 4B). Remarkably, given their constrained training environments, the models even seem able to broadly categorise close-up photographs of real-world surfaces 91 as being of high or low gloss, although fail when shown surfaces far outside their training data, such as fabrics with high-contrast patterns (Supplementary Figure 6C-D).
We performed two tests of robustness to different training datasets. First, five new PixelVAE networks were trained on 10,000 additional renderings in which gloss was sampled continuously rather than bimodally (“continuously-sampled gloss training dataset”; Methods). Second, an additional PixelVAE was trained on a third dataset of 10,000 renderings in which surface geometry and lighting varied far more widely (“higher-variation training dataset”; Methods), with each scene comprising one of 10 novel surface geometries combined with one of 50 novel light fields. Both new training environments produced models with latent codes that could well subserve gloss classification on the original bimodal gloss dataset (mean accuracy = 96.7% and 91.4%, respectively). Importantly, we found that all three versions of the PixelVAE model (original, continuous-gloss training, and higher-variation training) made highly similar predictions regarding the relative gloss levels of experimental stimuli. This indicates that the ability to predict human perception is not highly sensitive to training set. The three versions of the unsupervised PixelVAE model, trained on non-overlapping datasets, made more similar gloss predictions to one another than to those made by any of the ten diverse alternative models (dark cluster of low dissimilarity values in the bottom left of Figure 7B). All three training environments led to unsupervised models that predicted human data reasonably well (Figure 7C), both for gloss ratings of novel rendered images (Experiment 1), and for the more challenging task of predicting patterns (Experiment 3) and degrees (Experiment 4) of (failures of) gloss constancy. Each of the three model versions predicted human data significantly better than the supervised model in all experiments (t19 = 4.01–31.08, p < 0.001, Cohen’s d = 0.56–7.00, 95% CI = [0.06–0.18]–[0.60–1.07] in nine repeated-measures t-tests comparing unsupervised vs supervised model correlation with individual participants’ data; bonferroni-corrected alpha = 0.006), and were significantly better than the most promising alternative model, texture features (80), in six out of nine comparisons (t19 = 0.73–5.84, p = 0.47–<0.001, d = 0.17–1.16, 95% CI = [-0.12–0.25]–[0.12–0.31]).
Several analyses were performed to assess how robustly unsupervised models outperformed their supervised counterparts. In building and training a DNN, values must be chosen for the many hyperparameters controlling network architecture and training. We evaluated the effects of some of these hyperparameters by training 28 additional models (14 unsupervised and 14 supervised), that differed from the original implementations in depth, learning rate, learning rate decay, training batch size, and complexity of the learned model (for unsupervised PixelVAEs); see Supplementary Table 1 for details. Eleven of the 14 unsupervised network variants outperformed all supervised network variants in predicting human gloss judgements; the only exceptions were networks that failed to train due to poor learning rate settings (Supplementary Figure 5). We also found that representations in the unsupervised model better predicted human judgements than those in the supervised model for all intermediate layers (Supplementary Figure 4A). Finally, we created a version of the supervised model that outputs continuous-valued gloss estimates rather than categorical decisions and trained it on a dataset with continuous rather than bimodal reflectances (see Supplementary Methods and Results). This version better predicted human gloss judgements than the category-supervised model, but less well than the unsupervised model, failing to exhibit the systematic errors that characterize human gloss perception (Supplementary Figure 4B). Overall, unsupervised learning in PixelVAE models appears to converge on a representation that captures key aspects of human gloss perception, and tolerates changes in the particular network hyperparameters, or the statistics, illuminations, or geometries of the training and test sets.
Features underlying gloss representation in the model
Previous research 35 identified specific image features—the coverage, contrast, and sharpness of specular highlights—that predicted perceived gloss for surfaces like those evaluated here. To test whether the PixelVAE model was also sensitive to these cues, we measured the coverage, contrast, and sharpness of highlights in 10,000 new renderings of surfaces with specular reflectance varying continuously from near-matte to near-mirror. All three cues could be decoded from the latent code of a PixelVAE trained on these images (mean R2 = 0.71), and could be increasingly well decoded from successive convolutional and fully connected layers (Supplementary Figure 7A). A linear combination of the three cues in the layer immediately preceding the 10D latent code correlated with gloss predicted from the latent code (r = 0.80; Supplementary Figure 7B). Moreover, manipulating the highlights in images to weaken each cue also reduced predicted gloss (Figure 8A-B; one-way repeated-measures ANOVA for the effect of highlight contrast reduction: F9,81 = 11.20, p < 0.001, η2 = 0.55, 95% CI of correlation r = [-0.65–-0.41]; sharpness: F9,81 = 9.65, p < 0.001, η2 = 0.52, 95% CI r = [-0.61–-0.40]; coverage: F9,81 = 18.14, p < 0.001, η2 = 0.67, 95% CI r = [-0.58–-0.29]).
Interestingly, we did not find evidence that the model’s predicted gloss decreased when the specular component of images was shifted so that highlights were misaligned with the geometry of the depicted surfaces (Figure 8B; F9,81 = 0.48, p =0.88, η2 = 0.05, 95% CI r = [-0.21–0.09]). This manipulation preserves coverage, contrast and sharpness, yet infringes photogeometric constraints that are a precondition for humans to identify highlights as specularities, and therefore to see gloss at all 6, 92, 93. The fact that the model predicts human gloss constancy without appearing to be sensitive to such constraints suggests that although these constraints are crucial to many aspects of surface perception 7, 93–96, the degree of perceived gloss, within surfaces that are seen as having highlights, may be largely explainable in terms of image features.
Discussion
Efficient representation of sensory data has long been hypothesised as a central goal of neural coding 15, 19–24, 97. Yet, while such approaches predict many aspects of low-level image encoding, to date, they have not explained how we visually infer properties of the outside world. Unsupervised learning objectives in modern DNNs, such as data compression and spatial prediction, offer powerful new implementations of these statistical learning principles 17. Our findings show that mid-level perceptual dimensions, like gloss—which imperfectly map onto properties of the physical world—can emerge spontaneously by learning to efficiently encode images. Thus, unsupervised learning may provide a bridge that links theoretical views emphasizing the importance of image statistics (e.g. 86, 98, 99), to those that treat perceptual processes as a decomposition of images into distinct physical causes (e.g. 4, 6, 8, 100).
One of the fundamental unsolved questions in vision science is how the visual system became aware of the different physical sources that contribute to image structure. Perception is commonly framed as the optimal estimation of a set of known physical quantities 4, 9–11. But these quantities that the brain putatively estimates were not specified a priori; they must somehow be discoverable (over either the course of evolution or learning) based on their manifestation in sensory experience 12–14. Here, we suggest that different physical causes give rise to different high-order regularities in visual data, making them discoverable through data-driven unsupervised learning processes 15–18. We provide a proof-of-principle that it is possible to learn to disentangle distal causes without prior knowledge about which classes of causes exist in the world, the cues that could be used to distinguish them, or even how many different classes of causes there are. An unsupervised statistical learning predicted both the expected changes in perceived gloss caused by varying specular reflectance (Experiment 1; 32, 33, 39–41, 43), as well as illusory changes in perceived gloss that arise from varying lighting and shape (Experiments 3 and 4; 34, 35). We suggest that known systematic errors in gloss perception can be attributed to the particular pattern of partial disentanglement arising from unsupervised statistical learning of surface appearances.
One of our more intriguing results is that the unsupervised model predicted human perception better than the supervised model tested. It is important to note that this is not because humans and unsupervised networks were better at extracting ground truth in these stimuli. Categorisation-supervised networks categorised gloss almost perfectly (Figure 3b), and regression-supervised networks predicted continuous gloss levels almost perfectly (Supplementary Figure 4b), yet both predicted human judgements less well than unsupervised networks. This implies that the systematic errors exhibited by humans and the PixelVAE model are not a trivial consequence of some inherent impossibility in recovering specular reflectance for these stimuli. Nor is it explained by the supervised model reporting ground truth specular reflectance too faithfully. Although the supervised model is near-perfect at coarsely categorising surfaces as having high or low specular reflectance, it still predicts different degrees of glossiness for different images, including sometimes erroneously. Yet the supervised and unsupervised models make different predictions on an image-by-image basis, with the latter more closely matching those made by humans. We propose that this shared pattern of deviation from ‘ideal performance’ may arise from shared characteristics in how the human visual system and unsupervised model learn to encode images.
One of the most notable failures of the PixelVAE in capturing human data is its insensitivity to photogeometric constraints known to affect human surface perception, such as the alignment of specular highlights with diffuse shading 7, 92, 93 (Figure 8). We believe that this failure is likely due to the relative poverty of 3D shape information in its training set. The link between highlights and diffuse shading arises from constraints imposed by the 3D shape of a surface 94, 96. It seems implausible to expect any visual system trained solely on monocular, static images to develop good sensitivity to these constraints, and without a detailed representation of 3D shape, no model is likely to explain all aspects of human gloss perception 94, 95, 101. We tailored the training sets towards modelling variations in perceived glossiness for physically realistic surfaces, where highlights are assumed to align with surface shading. 32, 33, 39–41, 43. An important direction for future research is testing whether unsupervised DNNs can also learn photogeometric relationships, if training sets provide additional information about shape (e.g., through motion, stereo, occlusion, or larger variations in geometry).
In using deep learning models, we do not wish to imply that all material perception is learned during an individual lifetime. Unsupervised learning principles can also operate on an evolutionary timescale. For example, V1 cell receptive fields are predicted by simple unsupervised learning models such as independent components analysis 22 and these seem to be present at birth in macaques 102. There is evidence that 5-8 month-old infants can distinguish between matte and specular surfaces 103, but also that material recognition and classification are still developing in 5-10 year-old children 104, 105. Even 5-8 months of visual experience provides a huge dataset for statistical regularity learning 106. It could be that approximate versions of many perceptual dimensions are rapidly learned during the first days, weeks and months of life.
Although we do not propose the PixelVAE architecture as a physiological simulation of visual cortex, its computational principles are eminently achievable in biological brains. Data compression and spatial prediction are learning objectives that require no additional information beyond the incoming sensory data, and there are several mechanisms by which brains could represent the probability distributions used within the PixelVAE network 107–109. At the same time, the brain certainly does not learn statistical distributions over RGB pixels. If the visual system uses learning objectives like the ones investigated here, they presumably operate on image representations that have undergone substantial retinal processing 110, 111.
In conclusion, unsupervised DNN models provide an ecologically feasible solution to the problem of how brains come to represent properties of the distal world without access to ground truth training data 12–14, 17, 112. Non-linear transformations reorganise inputs according to high-order regularities within and across images, allowing the visual system to better summarise and predict sensory data. Because regularities in images are caused by underlying physical objects and processes, these new configurations often end up (partially) disentangling physical properties from one another. Our results suggest that the imperfect nature of this disentanglement may account for the characteristic ‘errors’ humans make. Failures of constancy, which are rife in vision, may therefore offer tell-tale clues to how we learn to see. Unsupervised learning may account for them, not just in gloss perception but in perception more broadly.
Methods
Participants
Three groups of human naïve observers reported perceived gloss across five experiments: Experiments 1 and 2 (N=20, mean age 23.45 [range 19-32], 16 female, 4 male), Experiment 3a (N=20, mean age [range 19-31, 16 female, 4 male), Experiments 3b and 4 (N=20, mean age 24.45 [range 19-35], 14 female, 6 male). Six individuals participated in two different experimental groups, but received no information about the experimental design or hypotheses after the first session. No statistical methods were used to pre-determine sample sizes but our sample sizes are larger than those reported in previous publications measuring gloss constancy34, 35, 37. All participants had normal or corrected-to-normal visual acuity. Two male participants self-reported poor red-green colour vision. Experiments were conducted in accordance with the Declaration of Helsinki (6th Revision), with prior approval from the ethics committee of Justus Liebig University, Giessen, Germany. Volunteers gave written informed consent and were paid €8 per hour.
Stimuli
Stimuli were 800×800pixel images of bumpy surfaces rendered using Unity3D (version 2018.2.3). A virtual 40×40cm sheet with irregularly positioned bumps was placed in a scene and illuminated by one of six high dynamic range (HDR) light probes. Light probes were white-balanced 8192×4096pixel images of four exterior (beach, bay, woodland, savannah) and two interior (church, conference hall) environments, captured by Unity Technologies (https://assetstore.unity.com/packages/2d/textures-materials/sky/unity-hdri-pack-72511). A virtual camera (60° field of view) hovered 12cm above the sheet, viewing directly downwards. By randomly varying the camera’s location and orientation, an extremely large number of unique images could be generated from the scene.
Main training dataset
10,000 images were rendered to create the main training dataset for neural network models. For each rendering, one of the six HDRIs was randomly selected to illuminate the scene. Surface relief was scaled multiplicatively in depth by a uniformly-sampled random scaling factor, so that the distance between the lowest and highest points on the surface was between 0.1–2.5cm. The surface’s reflectance properties were controlled via the Unity3D standard shader, using the ‘specular setup’. The diffuse reflectance component of the surface material was chosen by selecting RGB values randomly uniformly within the interval 0.3 to 0.7, independently for each channel. The surface was either low or high gloss, with equal probability. For low gloss surfaces, the specular reflectance component of the material was selected randomly uniformly between 0.1–0.3, and the concentration of the specular component was randomly uniformly between 0.2–0.4. For high gloss surfaces, specular reflectance was between 0.3–0.5, and specular concentration between 0.75–0.95. The training dataset therefore had a bimodal distribution of approximately 50% low- and 50% high-gloss surfaces, with small variations in reflectance properties within each group. The same dataset was also used when training classifiers to decode gloss and other world factors from models.
Continuously-sampled gloss training dataset
To verify that the representations learned by our models were not artefacts of the bimodal gloss sampling in the main dataset, we rendered 10,000 new images in which both the magnitude and concentration of the specular reflectance component were sampled randomly uniformly between 0.1–0.8, independently of each other. All other scene properties were varied as in the main dataset.
Higher-variation training dataset
To verify that less constrained visual diets could lead to similar gloss representations, we created a third training dataset. 10,000 images each randomly combined one of ten novel surface geometries with fifty novel light fields. Novel geometries were virtual 40 ×40cm sheets with different shapes and sizes of irregularly placed ridges, bumps, and indentations (examples in Figure 7A). Novel light fields were fifty 4096×2048pixel HDR images (25 exterior, 25 interior) from the HDRI Haven database (http://www.hdrihaven.com). The distance of the virtual camera above the surface varied randomly uniformly between 8–12cm to introduce scale variation. All other scene properties varied as in the main dataset.
Renderings with continuously sampled gloss levels (Experiment 1: Gloss ratings)
Stimuli were 50 new rendered images with specular reflectance chosen randomly uniformly between 0.2–1.0, and specular concentration set to the same value. Other attributes were varied as in the main dataset. For all rendered images used as stimuli in psychophysical experiments (Experiments 1, 3 and 4), the same geometry and set of light fields were used as in the main training dataset, but images were novel renderings unseen by any model during training.
Gloss-modulated images generated from PixelVAE models (Experiment 2: Gloss manipulation)
For each of the PixelVAE networks, we first determined the axis in 10D latent space along which gloss could be most strongly discriminated (see Data analysis). Eight images (four low-gloss, four high-gloss) from the main training dataset were input to each network, and their corresponding latent points were recorded. These were used as ‘seed points’ to generate eight ‘gloss-modulated’ sequences for each network. The first step in each sequence was generated by conditioning the model’s pixel sampling process on the respective seed point in latent space, and sampling a 128×128pixel image (see Neural network architectures and training). The conditioning point was then moved 0.07 units in latent space along that model’s gloss-discriminating axis, either in the matte (for high-gloss seed images) or glossy (for high-gloss seed images) direction, and a second image was generated. This was repeated three more times, yielding five-step sequences for each seed image from each model training instance (400 experimental images in total).
Sequences of renderings with increasing surface relief (Experiments 3a and 3b: Patterns of gloss constancy)
To create a strong test of the different models, we wanted to probe human gloss constancy using stimuli (a) for which there were clear differences between constancy patterns predicted by unsupervised vs supervised models 88, 89, and (b) which were likely to produce diverse patterns of failures of constancy in human observers. We therefore first generated candidate stimuli, then selected those that best satisfied these desiderata.
For each experiment, we rendered 200 different candidate sequences of 7 images with fixed random material, illumination, and camera position, but exponentially increasing surface relief (maximum bump heights of {0.10, 0.18, 0.31, 0.55, 0.97, 0.70, 3.0}cm). All surfaces had relatively high gloss. In Experiment 3a, specular reflectance was randomly uniformly between 0.2–1.0, and specular concentration 0.7–1.0. In Experiment 3b, reflectance was between 0.1–0.3, and concentration 0.8–1.0. Other attributes were varied as in the main dataset. All 1400 images in each set were input to each of the unsupervised PixelVAE and supervised ResNet networks, to obtain predicted gloss values (see Data analysis). For each candidate sequence of 7 images, we then simulated the responses from each network under a two-alternative forced-choice (2AFC) experiment in which each possible pair of images was compared, and the image for which the model predicted a higher gloss value was selected. This was analogous to the “which is more glossy?” task performed by human observers (see Psychophysical procedures).
First, for each sequence we performed a 2×7 (model × surface relief) ANOVA between unsupervised and supervised model predictions, averaged over training instances of each model. Sequences were sorted according to the F interaction statistic, prioritising those with strong disagreements between model predictions 88, 89. In Experiment 3a, we visually inspected the top-ranked sequences and selected 10 with diverse appearances and subjective failures of gloss constancy.
In Experiment 3b we selected test sequences in an entirely automated manner by classifying sequences into four qualitative groups based on the average constancy pattern predicted by the unsupervised models:
Group 1: model predicts approximate constancy (range of predicted “proportion glossier” values < 0.25, and a linear function fit achieved R2 > 0.70);
Group 2: model predicts an approximately linear increase in gloss with bump height (range > 0.50, and a linear fit achieved R2 > 0.90);
Group 3: model predicts a non-monotonic failure of constancy (range > 0.50, and a quadratic fit of R2 > 0.90 with a negative coefficient of the squared term);
Group 4: model predicts an upward-swinging non-linear failure of constancy (range > 0.50, and a quadratic fit of R2 > 0.90 with a positive coefficient of the squared term).
We then selected the top sequences in each group (ranked by F interaction term), in proportion to the size of each group (1 “constant,” 4 “linear,” 2 “non-monotonic,” and 3 “non-linear” sequences).
Pairs of renderings with differing surface relief or lighting (Experiment 4: Degrees of gloss constancy)
Here we sought pairs of stimuli (a) with diverse scene differences (some differing in surface relief, some in lighting environment, and some in angle of illumination), and (b) for which the unsupervised model made a wide range of predictions, ranging from good constancy to strong failure of constancy.
We rendered 800 candidate image pairs with the same random material properties and camera position, but differing in either surface relief (0.40 vs 1.5cm maximum bump height) or illumination. Pairs differing in illumination were either ‘naturally’ lit (differing in HDR lighting environment, from among the six used in the main training dataset), or illuminated by a directional lamp (at a 30° vs 90° angle). Candidate pairs were generated in four groups of 200:
Group 1: identical natural light field; different surface relief
Group 2: identical lamp angle; different surface relief
Group 3: identical surface relief; different natural light field
Group 4: identical surface relief; different lamp angle
Average predicted gloss values were obtained for each image, across training instances of the unsupervised PixelVAE model (see Data analysis). Within each group, we ranked pairs by the absolute difference in predicted gloss of the images in the pair, and selected pairs lying at each of the 10th percentiles, yielding 40 test pairs.
Geometry and lighting generalisation test set
To test generalisation of models trained on the main dataset, we rendered five sets of 500 new images, one using the original surface sheet, and four using novel surface geometries (Figure 7A). Illumination was randomly one of eight novel 4096×2048pixel HDR light probes (six exterior, two interior) from the Dutch Skies 360° HDRI Project (http://www.dutch360hdr.com/). Specular reflectance was varied as in the continuously-sampled gloss training dataset.
Highlight-feature manipulated images
To manipulate highlight features (Figure 8), we manipulated the specular component image in four different ways, before combining with the diffuse component image by addition:
(1) Translated highlights: specular component was shifted rightwards in ten 5-pixel steps. To avoid edge artefacts after translation, images were rendered at 1024×1024 resolution, and then cropped to the lower right 800×800 pixels.
(2) Reducing highlight coverage: we applied image erosion to the specular component using the OpenCV package for Python. Kernel size was 2×2. Up to ten iterations were applied to create ten progressively reduced highlight maps.
(3) Reducing highlight sharpness: specular image was convolved with a Gaussian filter, with step size ranging from 1×1 to 11×11.
(4) Reducing highlight contrast: we created a ‘highlight mask’ by identifying pixels with values >4 (from 0-255 range) in greyscale version of specular image. Standard deviation of pixel intensity within the highlight map was multiplicatively reduced in ten steps, from 1 to 0.1 times their original SD, while retaining the original mean.
Neural network architectures and training
DNNs were implemented and trained in Python 2.7 with Tensorflow 1.14 on a GPU-accelerated machine using 1-4 GTX1080 graphics cards. Networks were trained on the first 9,000 images of the main training dataset, using the next 500 for validation, and the final 500 for testing accuracy of the supervised networks. All images were input at 128×128 pixel resolution. For both unsupervised and supervised models, ten independent instances of the same architecture were trained from different random initial weights and with different random sampling of training batches, to ensure robustness to representational differences between training instances 70. Additionally, we trained five instances of the PixelVAE network on the continuously-sampled gloss dataset and one on the higher-variation dataset (see Stimuli), dividing training and validation data in the same proportions. No data augmentation was used. All architectures used rectified linear activation functions.
Unsupervised PixelVAE model
We used the implementation from 47 of the PixelVAE architecture 48, available at https://github.com/ermongroup/Generalized-PixelVAE. The architecture consists of two streams of convolutional layers, which learn jointly via backpropagation. One stream is a ‘conditioning network,’ which is a convolutional variational autoencoder that takes an image as input and outputs a 10-dimensional smooth latent representation. We chose a 10D latent code as being approximately the most compact representation that still allowed the network to learn a good model of the training images, based on pilot data (see Supplementary Figure 1A-B). The other stream is an autoregressive PixelCNN++ 113, 114 model that learns the structure of the training data in terms of a polynomial probability distribution over pixel values, and takes as inputs both the image and the latent code output by the conditioning network. To generate new images, the autoregressive stream chooses an RGB value for each pixel, working from the top left to the bottom right of the image. Each pixel is sampled from the learned probability distribution, conditioning on both the values of pixels generated so far (which constrain the local structure of the image), and on the values provided in the latent code (which constrain more holistic image properties).
The conditioning network consisted of three convolutional layers of 64, 128, and 256 feature maps, followed by a fully connected layer of 512 units and a 10-unit latent code layer. The autoregressive network consisted of six ‘residual blocks’ of three layers of 64 convolutional feature maps each, with skip connections linking the first and sixth, and second and fifth blocks. Pixel likelihood distributions were modelled with a mixture of 12 logistic functions, and networks were trained with a batch size of 5 and a learning rate of 0.001 for 200 epochs, around which point validation error plateaued. Learning rate was gradually decayed by multiplying by 0.999995 after each epoch. No regularisation was used during training, to encourage the network to depend on information in its latent code 47.
Supervised ResNet model
We used the Tensorflow implementation of the ResNet architecture 57 from https://github.com/wenxinxu/resnet-in-tensorflow. Networks consisted of three ‘residual blocks’ each made up of three layers of 56 convolutional feature maps, with skip connections between each. The output of the final layer was passed to a 10-unit fully-connected layer, which we treated as the ‘latent code’ of the model for analysis purposes. This passed, via one more non-linearity to a two-unit softmax output layer. Ten networks were trained, from different random initial weights, to classify images as renderings of high or low specular reflectance surfaces (see Stimuli). Networks were trained with a batch size of 32 and a learning rate of 0.001 to minimise the sparse softmax cross-entropy between outputs and correct labels. Learning rate was gradually decayed by multiplying by 0.99995 after each epoch. Networks were trained for 21 epochs, by which point validation accuracy plateaued above 99%.
Simple autoencoder
We also considered a far simpler unsupervised model, in the form of a non-variational convolutional autoencoder implemented using Keras 2.2.5 for Tensorflow. The architecture comprised four layers of 64, 32, 16, and 16 feature maps alternating with 2 ×2 pooling layers, leading to a 4,096-unit fully connected bottleneck layer, which we treated as the model’s latent feature space for analysis purposes. The compressed code was expanded through mirrored layers of 16, 16, 32, and 64 feature maps to an output of the original dimensionality (128×128×3). The network was trained for 1,000 epochs to minimise mean absolute error between its input and output images (batch size 32, other learning parameters used default Adam optimiser values as implemented in Keras).
Object-trained deep neural network
Finally, we evaluated a pre-trained DNN supervised on object-recognition: an 18-layer Resnet 57 model available from the MATLAB 2019b Deep Learning toolbox (https://mathworks.com/help/deeplearning/ref/resnet18.html). The network had been pre-trained on the Imagenet Large-Scale Visual Recognition Challenge (ILSVRC) database to classify 1.2 million images into 1,000 object and animal categories 58. For analyses, we used representations in the 1,000-unit fully-connected layer immediately before softmax category readout.
Additional comparison models
Histogram skewness
Skewness was defined as the skew (3rd moment) of the distribution of pixel intensities in a greyscale version of each image.
MDS / tSNE/ LLE
Multi-dimensional scaling (MDS) is a linear dimensionality reduction technique that finds the linear projection of a dataset that best preserves distances between all data points. t-distributed Stochastic Neighbourhood Embedding (tSNE 74) is a non-linear technique that preferentially preserves distances between nearby data points. Locally Linear Embedding (LLE 75) is a non-linear technique that seeks a lower-dimensional manifold that best preserves all distances.
Each dimensionality reduction algorithm was used to create a 10D embedding of the 10,000 images from the main training dataset, as well as all 270 rendered images used in experiments (50 images from Experiment 1; 140 from Experiment 3; 80 from Experiment 4). The experimental probe images were included because it is not possible to project new data points into the reduced-dimensionality solution discovered by tSNE or LLE. Additional 2D embeddings were performed to create visualisations (Figure 2A-C and Supplementary Figure 2) using 4,000 images randomly selected from the main training dataset. Default parameters were used, as implemented in the scikit-learn package for Python.
Texture features
For each image we calculated a multi-scale texture feature description 85 using the TextureSynth package for MATLAB (www.cns.nyu.edu/~lcv/texture/). Images were rescaled to 256×256 pixels, and analysed at 3 scales, 4 orientations, with a spatial neighbourhood of 7, producing a description of each image in terms of 1,350 feature dimensions.
Psychophysical procedures
Psychophysical experiments were conducted in a dimly lit room, sitting at a comfortable distance from the screen. Participants could freely view the screen, and were given no fixation instructions. Stimuli were presented on an EIZO ColorEdge CG277 self-calibrating LCD monitor with a resolution of 2560×1440 and a refresh rate of 60Hz. PsychoPy 3.1 with Python 3.6 was used to present stimuli and record responses. At the beginning of each experiment, observers were shown 4-12 example experimental stimuli, randomly arranged, to familiarise them with the appearance and range of material properties they would be asked to judge. Response times were unconstrained, and stimuli were displayed until a response was recorded.
Experiment 1: Gloss ratings
Experiment 1 measured gloss ratings for novel rendered images. Fifty 800×800 pixel (18.6×18.6cm) images were presented singly in the centre of the screen. Each was repeated three times, for a total of 150 trials, presented in a random order. Participants were asked to rate the apparent gloss of each image using a 6-point scale with endpoints labelled “1 = completely matte” and “6 = extremely glossy”.
Experiment 2: Manipulating gloss
Experiment 2 measured gloss rankings for gloss-modulated network-generated images. Eighty sets of five 128×128 pixel (2.9×2.9cm) PixelVAE-generated images were shown. The five images within each set were arrayed in a random order in a row in the centre of the screen. Participants were asked to use the keyboard number keys to sort the images into a row of initially empty boxes at the bottom of the screen, in order from least glossy to most glossy. Each set of stimuli was shown once, for a total of 80 trials, presented in a random order.
Experiments 3a and 3b: Patterns of gloss constancy
Experiment 3 measured pairwise gloss comparisons for probe sequences differing only in surface relief. Experiments 3a and 3b differed only in the specific sequences shown to participants. Both experiments consisted of 10 sequences of 7 images, within which all possible pairs were shown twice each, for a total of 420 trials. Images appeared at a resolution of 800×800 pixels (18.6×18.6cm), side by side on screen. Side of screen and trial order were randomised, with all sequences interleaved with one another. Participants were asked to report, using the left and right arrow keys, which surface appeared more glossy.
Experiment 4: Degrees of gloss constancy
Experiment 4 measured pairwise gloss comparisons for pairs of images differing in surface relief or illumination. Each of 40 pairs of images was shown 8 times, for a total of 360 trials. Images appeared at a resolution of 128×128 pixels (2.9×2.9cm), side by side, with side of screen and trial order randomised. Participants reported, using the left and right arrow keys, which surface appeared more glossy.
Data analysis
All analyses of human and model data were performed in Python 2.7 or 3.6, using numpy v1.16.5 and/or scikit-learn v0.21.3 packages.
Representational similarity analysis
We measured average correlation distance (1 minus Pearson r) between the latent representations of all 10,000 images in the main training dataset, grouped by whether they used the same (diagonal blocks) or different (off-diagonal blocks) gloss/lighting conditions, for each network training instance. Representational dissimilarity matrices (RDMs; Figure 2D-E) were created by averaging these values over training instances of each model. To visualise the similarity of model predictions (Figure 7B) we created a vector, for each model, of predicted gloss values for all 270 rendered images used as experimental stimuli (50 images from Experiment 1; 140 images from Experiment 3; and 80 images from Experiment 4), normalised into the range 0–1. Euclidean distance between these prediction vectors was calculated for all pairs of models.
Decoding world factors from models and deriving gloss predictions
For models with multidimensional feature spaces (i.e., all except histogram skewness), a linear support vector machine (SVM) was used to classify specular reflectance (high vs low) of rendered images from their representations within the model. SVMs were trained on a random sample of 7,500 of the main training dataset images, and tested on the remaining 2,500 to derive gloss classification accuracies (Figure 3, Figure 6A and Supplementary Figures 4A and 5A). To derive continuously-valued gloss predictions for experimental images, images were input to each model, and the signed distance of their representation from the model’s SVM decision boundary was measured. Positive values indicate “high-gloss” classifications and negative values indicate “low-gloss”, with absolute magnitude indicating the strength of evidence for that classification (Figure 3, Figure 4B-C, Figure 5D-E, Figure 7A, Figure 8B, Supplementary Figures 6B-D and 7B).
For unsupervised PixelVAE and supervised ResNet models, we also trained linear SVMs to perform a 6-way light-field classification, and fitted linear regressions to predict surface relief (Figure 3B). Classifiers and regressions were performed once using the full 10D latent space for each network, and again using each of the network’s individual latent dimensions (Figure 3B; see also Supplementary Figure 1C for visualisation of individual dimensions in one unsupervised network).
For the histogram skewness model, raw skew values were used as gloss predictors for the purposes of model comparison. Ground-truth gloss classification accuracy (Figure 6A) was defined as the accuracy using an optimal threshold to separate high-from low-specular reflectance images from one another on the basis of their skewness, fitting the threshold on 7,500 main training dataset images and testing on the remaining 2,500.
Deriving predicted 2AFC experimental data from models
Model predictions for Experiment 3 were derived by simulating responses of each model in the 2AFC task performed by humans. For each sequence of 7 images, the predicted gloss values of each possible pair of images were compared, and the image for which the model predicted higher gloss was selected. Predicted responses were summarised as ‘proportion selected as being glossier’ for each level in the sequence, as for human data (Figure 5B and Supplementary Figure 3).
Measuring highlight features
Coverage, sharpness and contrast of highlights (Figure 8) were measured using a MATLAB package developed by Schmid et al 115. Briefly: (1) Coverage is defined as the proportion of pixels with higher intensities in the full image than in the diffuse component image; (2) Sharpness is defined as the Local Phase Coherence 116 within the specular component image; (3) Contrast is defined as the sum of the RMS contrast across eight bandpass-filtered versions of the specular component image.
Psychophysical data analysis
No participants or trials were excluded from analysis. In Experiment 1 both human ratings and model-predicted gloss values were normalised to the range 0-1 before comparing, individually for each participant or model instance. In Experiment 2, rankings were converted to average rank position of each image step for each participant, averaging within matte and glossy seed image sequences. In Experiments 3a and 3b, pairwise comparisons were converted to the proportion of times each image within each sequence was judged as having higher gloss, across pairings and repetitions, for each participant. In Experiment 4, pairwise comparisons were converted to the proportion of times ‘image A’ (arbitrarily labelled) was judged as having higher gloss than ‘image B,’ and this proportion subtracted from 0.5 (i.e. 0 indicates equal apparent gloss, and good constancy; deviations in either direction indicate deviations from constancy). Model predictions were obtained by subtracting the predicted gloss value of image B from that of image A, and scaling this gloss difference into the range −0.5 to 0.5, retaining the original zero point. For models with multiple training instances, model predictions were always derived, and performances calculated, for each individual training instance.
Statistical Analysis
Statistical analyses were performed using the Pingouin117 package for Python (version 0.3.10). All tests were two-tailed. Confidence intervals reported for mean differences and correlations were calculated by bootstrapping the respective estimate 10,000 times. Data distributions were assumed to be normal but this was not formally tested. The gloss predictions of all models were fixed, with no free parameters in evaluation against human data, except for the analysis of the data in Figure 5E, where the performance of a logistic transform of unsupervised model predictions is also reported.
Data availability
All human and model data are available on Zenodo at http://doi.org/10.5281/zenodo.4495586.
Code availability
Custom analysis code that supports the findings of this study is available on Zenodo at http://doi.org/10.5281/zenodo.4495586.
Author contributions
Conceptualization, K.R.S., R.W.F and B.L.A..; Methodology, K.R.S. and R.W.F.; Software, K.R.S.; Formal Analysis, K.R.S.; Investigation, K.R.S.; Resources, R.W.F.; Writing – Original Draft, K.R.S. and R.W.F.; Writing – Review & Editing, K.R.S., R.W.F. and B.L.A.; Visualization, K.R.S; Supervision, R.W.F.; Funding Acquisition, R.W.F. and K.R.S..
Competing interests
The authors declare no competing interests.
Supplementary Methods and Results
Pilot experiment to select latent dimensionality of unsupervised network models
We wanted to select a latent dimensionality for the PixelVAE model such that the network was forced to learn a highly compressed code, but still able to capture the majority of meaningful structure in the training images. To gauge where this dimensionality lay for our dataset, we trained five PixelVAE networks with latent dimensionalities of 2, 5, 10, 20 and 40. From each network we generated 50 images by conditioning the generative sampling process on each of 50 random bumpy surface images that did not belong to the training dataset (see Supplementary Figure 1A). We then performed a psychophysical pilot experiment in which we showed twenty observers the generated images. On a trial, one of the rendered seed images was shown at the top of the screen, and its corresponding five generated images (one from each network) were shown in a random order in a horizontal row beneath. Participants rearranged the five generated images in order of their apparent similarity to the rendered seed image, and we measured the average similarity ranking given to the images generated by each model. Results are shown in Supplementary Figure 1B. Apparent similarity of generated images to seed images increased with a model’s latent dimensionality, but began to plateau after a dimensionality of ten, which we used for all subsequent PixelVAE models.
Performance of intermediate layers in models
The analyses in the main work focus only on one 10-dimensional layer within both the supervised and unsupervised models. We also explored how well representations in the intermediate layers could predict both ground-truth gloss and human gloss perception, for one training instance of each model type. From the unsupervised PixelVAE architecture we trained an SVM to categorise high-vs-low gloss surfaces from the activations in each of the three convolutional layers of the autoencoder stream leading up to the latent code. From the supervised Resnet architecture we trained an SVM to categorise gloss from activations in the last convolutional layer of each each of the three residual blocks leading up to the output classification. Gloss classification accuracy was near-perfect for all layers of the supervised network, but increased across layers in the unsupervised network. Using the trained gloss classifiers, we then calculated predicted gloss values for each layer of each model and compared them to human gloss judgements in three psychophysical experiments as for all models in the main manuscript. The unsupervised network better predicted human gloss judgements across all layers, gradually improving across convolutional layers towards the 10-dimensional latent code (see Supplementary Figure 4A).
Continuous gloss regression supervised network model
To explore whether a supervised network with a richer training objective might learn representations more similar to humans’, we created a version of the ResNet supervised model architecture in which the final two “high-vs-low-gloss” categorical units were replaced with a single continuously-valued “gloss level” unit (see Supplementary Figure 4B). A new training set of 10,000 images was rendered, in which the magnitude of the specular component was sampled randomly uniformly between 0 and 1, and the concentration of the specular component was set to the same value. All other scene factors were varied as in the original training set. This created a continuously-sampled one-dimensional gloss space, ranging from completely matte to completely mirrored. The supervised regression model was trained for 21 epochs on 9,000 of the images to predict the true specular magnitude, by minimising mean absolute error (MAE) using the Adam optimiser with a learning rate of 0.0001 and a decay rate of 0.9999 after each epoch. Other network parameters were kept as for the categorisation-supervised models. Five model instances were trained from different random initial weight settings. Ability to predict specular magnitude was excellent (average MAE on 500 test images = 0.002). We then evaluated the model’s ability to predict human data in the same way as for the categorisation-supervised ResNet models in the main manuscript, by training a gloss-classification SVM on the feature space in the 10D penultimate layer of the network, deriving gloss predictions for experimental images, and comparing them to human gloss judgements. (see Supplementary Figure 4B). The regression-supervised networks performed better than the categorisation-supervised networks, predicting human judgements on average as well as ground-truth specular reflectance (one-sample t-test of model RMSE against ground truth RMSE across model training instances t4 = 0.24, p = 0.82, Cohen’s d = 0.10, 95% CI of difference = [-0.02–0.02]). Unsupervised models, however, consistently outperform ground-truth, predicting errors in human gloss perception, as well as successes (one-sample t-test of the poorest-performing set of unsupervised networks—those trained on a continuously-varying gloss dataset—against ground truth t4 = −4.78, p = 0.009, d = 2.14, 95% CI = [0.02–0.08]; independent-samples t-test against regression-supervised networks t8 = 3.98, p = 0.005, d = 2.52, 95% CI = [0.02–0.08]).
Effect of varying model hyperparameters
When working with deep neural networks one must choose values for a large number of hyperparameters, such as the number of layers and convolutional filters, the learning rate, the rate at which the learning rate decays as training progresses, and more. Fully exploring this hyperparameter space is prohibitively time consuming, so we performed a small exploration of some key factors by training 28 additional models (14 unsupervised PixelVAE and 14 supervised ResNet networks) and evaluating their ability to classify gloss and to predict human gloss perception (see Supplementary Figure 5). Supplementary Table 1 lists the variants tested. For both the unsupervised and supervised models we tested a deeper and shallower architecture than the one originally used, as well as lower and higher learning rates, and lower and higher decay rates for the learning rate. Some hyperparameters of interest could not be varied in both model types. For batch size, we explored higher and lower sizes for the supervised model only, as memory constraints on the training GPUs limited the unsupervised model to the small batch size of 5. Finally, we explored higher and lower degrees of complexity for the pixel distribution learned by the unsupervised PixelVAE model, a hyperparameter not present in the supervised model. For each variant, we altered only the single hyperparameter of interest, and held all others at their original values. All models were trained for the same duration as the original models (200 epochs for PixelVAE models and 25 epochs for ResNet models, which was sufficient for convergence in each model type, see Supplementary Figure 5B). Once trained, each model instance was evaluated against human psychophysical data from Experiments 1, 3 and 4 identically to the model evaluation procedures in the main manuscript. Average error in predicting human data, across the three experiments, is shown for each network in Supplementary Figure 5A and C.
Unsupervised models generally predicted human perceptual data better than supervised models across the wide range of hyperparameter settings explored, with the exception of two unsupervised models with large learning rates that failed to train (as evidenced by poor convergence and noisy generated samples). The performances of the original unsupervised model training instances fall towards the better end of the range of performances found for hyperparameter variants, and those of the original supervised instances fall towards the worse end of the range of performances found for variants, but both are within the bounds apparently typical of their model types. The slightly larger difference between the original models may be due to having specifically selected stimuli for Experiments 3 and 4 in order to maximise the “disagreement” in gloss predictions made by the original models.
Tolerance to variations in rendering method
In order to generate large datasets in a reasonable timeframe, we used real-time rasterised rendering in the Unity3D engine. The resulting images have good visual quality, but are not perfectly physically faithful. For example, they lack inter-reflections, and use an approximation of ambient occlusion. To test whether the unsupervised models trained on these rasterised images could make reasonable gloss predictions for surfaces rendered with more time-intensive but physically accurate methods, we generated 24 images of surfaces rendered once using rasterisation, and once physically faithful ray-tracing, via the Eevee and Cycles renders within the Blender engine, respectively. Each image was then input to all training instances of the original unsupervised PixelVAE model, and an average predicted gloss value was calculated. The models’ gloss classifications were correct for all images rendered via both methods, although gloss values were systematically slightly higher for images rendered via the rasterised method familiar to the networks (see Supplementary Figure 6B).
Generalisation to real-world photographs
We performed two tests of generalisation to real-world images. First, we took 20 colour close-up photographs of common surfaces, ten of which were highly glossy (e.g. tomatoes, plastic, metallic foil) and ten of which were strongly matte (e.g. chalk, fleece, styrofoam). Predicted gloss values for each image were calculated for each of the ten training instances of the original unsupervised PixelVAE models (Supplementary Figure 6C). The unsupervised models were able to correctly classify all but one of the surfaces as being high or low gloss. Second, we performed a broader test by inputting to the models all images from the Giessen Material Image Database (Wiebel, Valsecchi & Gegenfurtner, 2013), which comprises 300 close-up photographs of assorted wood, metal, stone, and fabric surfaces. Predicted gloss values for each image from each model training instance are shown in Supplementary Figure 6D. Although the unsupervised model was moderately successful at sorting the “metal” category images from dull brushed surfaces to highly polished ones, it also showed clear failure cases. For example, matte fabrics with high-contrast patterns often received high predicted gloss values. Given that textured or patterned surfaces never occurred in the model’s training dataset, we would not expect it to have been able to learn the image structures associated with these.
Supplementary Figures
Acknowledgements
This work was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation; project number 222641018–SFB/TRR 135 TP C1), by the European Research Council (ERC; Consolidator Award ‘SHAPE’–project number ERC-CoG-2015-682859) to R.W.F. and by an Alexander von Humboldt Postdoctoral Research Fellowship to K.R.S.. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript. We would like to thank Alexandra Schmid and Katja Doerschner for sharing code to implement the highlight feature measurement model, and Karl Gegenfurtner and James Todd for comments on earlier versions of this manuscript.
Footnotes
Substantially revised Introduction and Discussion, and extended Supplementary Information with new analyses including network hyperparameter evaluations.
References
- 1.↵
- 2.
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.
- 11.↵
- 12.↵
- 13.
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.
- 21.↵
- 22.↵
- 23.
- 24.↵
- 25.
- 26.
- 27.
- 28.
- 29.
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.
- 37.↵
- 38.↵
- 39.
- 40.
- 41.↵
- 42.
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.
- 50.↵
- 51.↵
- 52.
- 53.↵
- 54.
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.
- 79.
- 80.
- 81.↵
- 82.↵
- 83.
- 84.↵
- 85.↵
- 86.↵
- 87.↵
- 88.↵
- 89.↵
- 90.↵
- 91.↵
- 92.↵
- 93.↵
- 94.↵
- 95.↵
- 96.↵
- 97.↵
- 98.↵
- 99.↵
- 100.↵
- 101.↵
- 102.↵
- 103.↵
- 104.↵
- 105.↵
- 106.↵
- 107.↵
- 108.
- 109.↵
- 110.↵
- 111.↵
- 112.↵
- 113.↵
- 114.↵
- 115.↵
- 116.↵
- 117.↵