Measuring human sensitivity to perturbations within the manifold of natural images

Ingo Fruend; Elee Stalker

doi:10.1101/320531

Abstract

Humans are remarkably well tuned to the statistical properties of natural images. However, quantitative characterization of processing within the domain of natural images has been difficult because most parametric manipulations of a natural image make that image appear less natural. We used generative adversarial networks (GANs) to constrain parametric manipulations to remain within (an approximation of) the manifold of natural images. In the first experiment, 7 observers decided which one of two perturbed images matched an unperturbed comparison image. Observers were significantly more sensitive to perturbations that were constrained to the manifold of natural images than they were to perturbations applied directly in pixel space. Trial by trial errors were consistent with the idea that these perturbations disrupt configural aspects of visual structure used in image segmentation. In a second experiment, 5 observers discriminated paths along the image manifold. Observers were remarkably good at this task, confirming that observers were tuned to fairly detailed properties of the manifold of natural images. We conclude that human tuning to natural images is more general than detecting deviations from natural appearance, and that humans have, to some extent, access to detailed interrelations between natural images.

Introduction

The images that we experience in our everyday visual environment are highly complex. However, they only comprise a small part of all possible digital images and our visual system seems to be adapted to perform well with these natural images (Simoncelli & Olshausen, 2001; Geisler, 2008).

Humans are very quick at making simple decisions in natural images (Thorpe et al., 2001). We can easily complete missing image components with the most natural looking value (Bethge et al., 2007) or detect visual deviations from naturalness (Gerhard et al., 2013; Wallis et al., 2017; Fründ & Elder, 2013). This precise tuning seems to be mostly restricted to foveal areas and humans are much less sensitive to deviations from naturalness in the periphery (Wallis & Bex, 2012). In fact, two physically different images that match in only a coarse set of image statistics in the periphery, typically appear to be the same (Freeman & Simoncelli, 2011).

Most of these studies are concerned with human sensitivity to deviations from naturalness. It is however, less clear how to characterize human performance within the manifold of natural images. One challenge is that most experimental manipulations of natural images make the image itself appear less natural. For example, Bex (2010) manipulates images by local deformations. He finds that sensitivity to these deformations is tuned to the spatial frequency at which they occur. Although the deformations used by Bex (2010) only moderately alter the power spectrum of natural images, they considerably disrupt higher order statistical properties such as the phase spectrum of the images (Wichmann et al., 2006). Others, have manipulated images by introducing “dead leaves”—small homogeneous patches—in different locations (Wallis & Bex, 2012) or manipulating the correlation structure of the images (McDonald & Tadmor, 2006). For small patches presented in the visual periphery, observers can often not detect these manipulations. However, we argue that these manipulations still make the image appear less natural rather than studying visual processing within the domain of natural images. A possible solution to this problem is to use selective sampling (Sebastian et al., 2017): Instead of attempting to manipulate the image directly, one chooses another natural image that by chance shows the desired manipulation. Although this approach guarantees that the resulting “manipulations” remain natural, it is highly dependent on the indexing mechanism used to select the “manipulated image”. This dependence creates a difficulty in generalizing the selective sampling approach to higher levels of visual processing.

Recent advances in machine learning might provide a means to constrain image manipulations to the domain of natural images. Here, we focus on a class of very powerful generative image models, known as generative adversarial nets (GANs, Goodfellow et al., 2014; Radford et al., 2016; Arjovsky et al., 2017; Gulrajani et al., 2017; Miyato et al., 2018; Hjelm et al., 2018). GANs learn a mapping—called the generator—from an isotropic Gaussian distribution to the space of images. One key insight with GANs is the use of an auxiliary classification function—often called the critic—to judge how good the generator mapping is. Specifically, the critic attempts to predict if a given image has been generated by mapping isotropic Gaussian noise through the generator, or if the image is an instance from the training database. Generator and critic are trained in alternation, where the generator is trained to increase the errors of the critic and the critic is trained to decrease its own error (see for example Goodfellow et al., 2014, for details). In general, generator and critic can be any possible transformation, but they are typically implemented as artificial neural networks with multiple hidden layers (Goodfellow et al., 2014; Radford et al., 2016). Although never studied quantitatively, images generated from GANs look quite similar to natural images and manipulations in a GAN’s latent space seem to correspond in a meaningful way to perceptual experience. For example, Radford et al. (2016) start with a picture of a smiling woman, subtract the average latent representation of a neutral woman’s face and add a neutral man’s face to arrive at a picture of a smiling man. Similarly, Zhu et al. (2016) illustrate that projecting perceptually meaningful constraints back to a GAN’s latent space allows creation of random images with specified features (e.g. edges or colored patches) in the specified locations. Together, this suggests that GANs recover a reasonably good approximation to the manifold of natural images.

In this study, we used GANs to manipulate images and observers made perceptual judgments about these images. In experiment 1, the observers viewed a target image and decided which one of two noisy comparison images corresponded to that target. Crucially, noise was either applied directly in pixel space or it was restricted to remain within an approximation to the manifold of natural images by applying it in the latent space of a GAN. We found that this task was more difficult if noise was applied in the latent space of a GAN, suggesting that noise in the GAN’s latent space actually changes image features that are relevant for image recognition, while noise that was directly applied in pixel space only resulted in degradation of the image without necessarily changing image content. In experiment 2, observers were asked to detect changes of direction in videos that were constructed by moving along smooth paths through a GAN’s latent space and we found that observers performed significantly above chance even for very small directional changes, suggesting that GANs not only recover a good approximation to the manifold of natural images, but that they also recover a perceptually meaningful parameterization of this manifold.

Experiment 1: Sensitivity to perturbations within the natural image manifold

Method

Training generative adversarial nets

We trained a Wasserstein-GAN (Arjovsky et al., 2017) on the 60 000 32 × 32 images contained in the CIFAR10 dataset (Krizhevsky, 2009) using gradient penalty as proposed by (Gulrajani et al., 2017). See Figure 1A for example training images. In short, a GAN consists of a generator network G that maps a latent vector z to image space and a critic network D that takes an image as input and predicts whether that image is a real image from the training dataset or an image that was generated by mapping a latent vector through the generator network (see Figure 2 and Gulrajani et al., 2017 for details of the architecture of the two networks). The generator network and the critic network were trained in alternation using stochastic gradient descent. Specifically, training alternated between 5 updates of the critic network and one update of the generator network. Updates of the critic network were chosen to minimize the loss and updates of the generator were chosen to maximize this loss. Here, and denote averages over a batch of 64 latent vectors z or training images y respectively. Furthermore, ∇_y denotes the gradient with respect to image pixels y, which was evaluated at random points along straight line interpolations between real and generated images (see Gulrajani et al., 2017 for details). We set λ = 10 during training.

Figure 1: Example training images and samples

A. Training images from the CIFAR10 database used to train the GAN model. B. Example samples generated from the trained GAN. Similar images were used in the experiment.

Figure 2: Architecture of the generative adversarial network

A. Architecture of the generator network. Information flow is from top to bottom, network layers are “Linear k, m”: Affine transformation from k features to m features, “Conv k, m, n × n”: Convolutional layer from k channels to m channels using a kernel size of n × n, “DeConv k, m, n × n”: like convolution but up-upsampling before the convolution to increase spatial resolution and image size, “BatchNorm”: Batch normalization (Ioffe & Szegedy, 2015), “ReLU”: rectified linear unit ReLU(x) = max(0,x) (Glorot et al., 2011). The generator network maps a sample z from an isotropic 128 dimensional Gaussian to a 32 × 32 pixel colour image. B. Architecture of the critic network. Architecture components not used in A. are “Leaky ReLU” (He et al., 2015). The critic network receives as input either an image generated by the generator network or a real training image y, and it decides if the input image is real or not.

Networks with different numbers of hidden states (parameter N in Figure 2) were trained for 200 000 epochs using an ADAM optimizer (Kingma & Ba, 2015) with learning rate 10⁻⁴ and β₀ = 0, β₁ = 0.9. Specifically, we trained networks with N = 40, 50, 60, 64, 70, 80, 90,128 (see Figure 2). Wasserstein-2 error (Arjovsky et al., 2017) on a validation set (the CIFAR10 test dataset) was lowest with N = 90 in agreement with visual inspection of sample quality, so we chose a network with N = 90 for all remaining analyses. Example images generated from this final network are shown in Figure 1B.

Observers

Seven observers participated in the experiment. Two of them were authors, the remaining five were students from various labs at the Centre for Vision Research at York University, Toronto, Ontario. One additional observer participated in the first session, but withdrew from the experiment afterwards and their data was excluded from the analysis. All observers reported normal or corrected-to-normal vision. Prior to participation, all observers provided written informed consent to participate and all procedures were approved by the York University Ethics Board.

Procedure

Each individual observer judged image samples in a spatial two-alternatives forced-choice image matching paradigm (see Figure 3). On every trial, the observer saw three images; a target image at the center, one comparison stimulus on the left and another on the right. One of the comparison stimuli was a perturbed version of the target image, while the other was an equally perturbed version of another sample from the GAN. The observer was required to indicate which of the two comparison images matched the central target image by pressing a corresponding button on a computer keyboard (left arrow key if the left comparison stimulus matched, right arrow key if the right comparison stimulus matched). Stimuli were presented for up to 6s or until the observer made a response, resulting in practically unlimited viewing time. Before each trial, a fixation point appeared on the screen for 500ms. Each observer performed 5 sessions and each session consisted of 80 trials for each noise level and type, resulting in a total of 1300 trials per observer (except O2, who performed 1437 trials).

Figure 3: Design of the image matching experiment

On every trial, the observer saw three images; a target image at the center, one comparison stimulus on the left and another on the right. Comparison stimuli were perturbed by different types of noise. Pixel noise (left column) was added as independent Gaussian noise to every pixel, Fourier noise (middle column) with the same power spectrum as the original image, but with random phases, was added to the original image, Latent noise (right column) was independent Gaussian noise applied in the latent space of the GAN that was used to generate the images. The top row shows example experimental displays with low noise, the bottom row shows example experimental displays with high noise (amounts given on left). For illustration, the left stimulus is a random perturbed stimulus and the right stimulus is the perturbed target in every example display. During the experiment, the identities and locations of the comparison stimuli were randomized.

Stimuli

All stimuli were samples from a GAN, converted to gray scale by averaging the red, green and blue channels of the sample image. The target stimulus was always noise-free, while the two comparison stimuli were perturbed by one of three noise types (see Figure 3). Pixel noise was constructed by adding independent Gaussian noise to each pixel of the respective image. Fourier noise was constructed in the Fourier domain by replacing the image’s phase component by random numbers. This resulted in an image with the same power spectrum as the source image but with completely random phases. A Fourier-perturbed image was constructed by adding a multiple of this power-spectrum-matched noise to the source image. Latent noise was constructed by manipulating the latent vector z from which an image was generated. To generate perturbed images with a predefined difference in pixel space, we started by adding independent Gaussian noise ζ to z and determining the corresponding image G(z + ζ). We then used gradient descent on to adjust ζ such that the final difference between the target and the perturbed target had a predefined pixel space difference of δ.

Stimuli were presented on a medium gray background (54.1 cd/m²) on a Sony Triniton Multiscan G520 CRT monitor in a dimly illuminated room. The monitor was carefully linearized using a Minolta LS-100 photometer. Maximum stimulus luminance was 106.9 cd/m², minimum stimulus luminance was 1.39 cd/m². If the nominal stimulus luminance exceeded that range, it was clipped (for subsequent analyses, we also used the clipped stimuli). On every frame, the stimuli were re-rendered using random dithering to generate a quasi-continuous luminance resolution (Allard & Faubert, 2008). At a viewing distance of approximately 87cm, each stimulus image subtended approximately 0.65 degrees of visual angle and were separated by approximately 0.13 degrees of visual angle. One pixel subtended approximately 0.02 degrees of visual angle.

Data analysis

For every observer, we estimated a psychometric function parameterised as where γ = 0.5 is the probability to guess the stimulus correctly by chance, λ is the lapse probability, σ is the logistic function and a and b govern the offset and the slope of the psychometric function. We adopted a Bayesian perspective on estimation of the psychometric function (Fründ et al., 2011) and used weak priors λ ~ Beta(1.5, 20), a ~ N(0,100), b ~ (0,100). Mean a posteriori estimates of the critical noise level x_c at which ψ(x) = 0.75 and the slope of the psychometric function at x_c were obtained using numerical integration of the posterior (Schütt et al., 2016).

To understand how the structure of the GAN’s latent space determined image matching performance, we re-analysed data from the latent condition. For this re-analysis, we assumed that observers would pick the perturbed stimulus that is closer to the target with respect to some distance measure. More specifically, let t denote the noise-free target stimulus and and denote the perturbed target and distractor stimuli respectively. If an observer is picking the stimulus that is closer to the target, then should show a positive correlation with the observer’s trial by trial response accuracy. Here, d is a suitably defined distance measure. We used either the Euclidean distance , the radial distance , or the cosine distance , where ||x|| denotes the Euclidean norm of a vector x and 〈x, y〉 denotes the scalar product of vectors x and y. These distances were applied in either the GAN’s latent space or directly in pixel space, after concatenating the respective stimulus’ pixel intensities into one long vector. We then determined receiver operating curves (ROC) for predicting correct vs incorrect responses based on c. The area under the ROC is a measure for how well the respective distance measure predicts the observer’s trial by trial responses (Green & Swets, 1966). To test if the area under the curve (AUC) was significantly different from chance, we performed a permutation test randomly re-shuffling the correct/incorrect labels 1000 times and taking the 95-th percentile of the resulting distribution as the critical value. We also used permutation tests to determine if the AUC for two different distance measures was significantly different. For the pairwise comparisons, there are 128 possible re-assignments of AUC values to the two conditions, and we computed all of them. The p-values for these post-hoc comparisons were corrected for multiple comparisons to control for inflation of the false discovery rate (Benjamini & Hochberg, 1995).

To gain insight into the image features that determined the observers’ responses, we applied the ROC analysis to a number of image features as well. Firstly, we calculated the luminance histogram (50 bins) for each image and calculated the distance difference c between luminance histograms of the respective images. Secondly, to determine local dominant orientation at each pixel we first filtered the image with horizontal and vertical Scharr filters (Scharr, 2000) as implemented in scikit-image (van der Walt et al., 2014) giving local horizontal structure h and vertical structure v. The local orientation ϕ was extracted from these two responses such that h = r cos(ϕ) and v = r sin(ϕ), where . We then determined the histogram (3 bins) of the local orientations across the image and calculated c as the distance difference between these orientation histograms. As a third feature, we calculated the edge densities of the two images, using the canny edge detector from scikit-image with a standard deviation of 2 pixels and calculating the fraction of pixels labeled as edges by this algorithm. As a fourth feature we determined the slope of the power spectrum in double logarithmic coordinates.

Finally, we used a standard method for image segmentation (Felsenzwalb & Huttenlocher, 2004) to calculate segmentations of the images t, and . We used the method implemented in scikit-image (van der Walt et al., 2014). Briefly, this algorithm iteratively merges neighbouring pixels or pixel groups if the differences across their borders are small compared to the differences within them. Each segmentation consists of a number of discrete labels assigned to the pixels of the original image. If two pixels belong to the same segmented region, the two labels associated with them should be the same. However, different segmentations may assign different labels to the same region. Thus, two segmentations s and would be similar, if for many pairs (i, j) of pixels s_i = s_j implies . When calculating the distance between two segmentations s and , we therefore count the number of pixel pairs for which s_i = s_j and and we normalize by the number of pixel pairs that are assigned to the same region by at least one of the two segmentations. This is captured by the distance measure where if the expression A is true and otherwise and the sums go over all pairs of pixels i, j. If the two segmentations define exactly the same regions (but possibly with different labels), d_segm will be 0, if the two segmentations are completely different, in the sense that one has only one region (the entire image) and the other assigns each pixel to its own region, then d_segm will be 1.

Results

Lower tolerance for noise applied within the natural image manifold

In the first experiment, seven observers were required to judge which one of two noisy comparison images matched a centrally presented target stimulus (see Figure 3). Figure 4A shows psychometric functions for one example observer as a function of noise level. In general, higher noise levels were associated with less correct responses. The observer’s performance, indicated by level of response correctness, was least affected by noise that was applied independently to each pixel (critical noise level with 75% correct performance at 14.8 ± 1.43dB, posterior mean and standard deviation, blue dots and line in Figure 4A). Observer performance was more affected by noise that was applied in the pixel domain but matched the power spectrum of the original image (critical noise level at 7.31 ± 0.54dB, green dots and line in Figure 4A). Finally, the level of response correctness was most affected by noise that was applied in the latent space of the GAN and thus stayed within the manifold of natural images (critical noise level at 1.60 ± 0.60 dB, red dots and line in Figure 4A). Thus, this observer’s level of response correctness was most affected by noise applied within the manifold of natural images. Furthermore, the psychometric function for this observer was considerably steeper in the latent noise condition (slope at critical noise level was −0.11 ± 0.09/dB) than in the other two conditions (slopes at critical noise level were −0.015 ± 0.006/dB for pixel noise and −0.013 ± 0.0015/dB for Fourier noise), suggesting that the observer was relatively insensitive to low amplitude latent noise and then abruptly became unable to do the task, consistent with the idea that for latent noise above a certain level a categorical change happens, while noise in the pixel domain results in a more gradual decrease in image quality.

Figure 4: Higher sensitivity to noise perturbations applied within the manifold

A. Psychometric function for one example observer. Each dot represents between 50 and 60 trials, solid lines are mean a-posteriori estimates of the psychometric function (see Methods). All noise levels were quantified as root mean square difference to the target in pixel space and were normalized to the background luminance. B. Average critical noise levels corresponding to 75% correct performance. Height of the bars denotes the mean across 7 observers, error bars indicate s.e.m. across observers. Colors are the same as in part A. C. Average slope of the psychometric function at the critical noise level. Height of the bars denotes the mean across 7 observers, error bars indicate s.e.m. across observers. Colors are the same as in part A.

On average across all 7 observers, the critical noise level was highest for independent pixel noise (8.91±1.22dB, mean±s.e.m.) and it was comparable for Fourier noise (6.77±0.83dB, paired t-test pixel vs. Fourier noise: t(6) = −1.59, n.s.) and decreased significantly for latent noise (3.47±0.36dB, paired t-test pixel vs. latent noise: t(6) = 3.59, p = 0.011, Fourier vs. latent noise: t(6) = 3.89, p = 0.0080) respectively (see Figure 4B). Thus overall, observers were most affected by noise that was approximately applied within the manifold of natural images by perturbing the GAN’s latent representation of the stimulus. We verified that this result also held for every individual observer. We further found that psychometric functions tended to fall off more steeply when noise was applied in the GAN’s latent space (average slope at critical noise level for latent noise was -0.066±0.0084/dB, see Figure 4C) than when noise was applied in pixel space (average slope at critical noise level for pixel noise was -0.024±0.0041/dB, for Fourier noise -0.017±0.0027/dB), replicating the observations from Figure 4A.

To summarize, we found that, in general, noise that was approximately applied within the manifold of natural images was more effective at interrupting observers’ performance than noise applied outside of the manifold of natural images in pixel space. We performed two additional sets of analysis to determine (i) if latent space or pixel space image differences were more predictive of observer’s trial by trial behaviour and (ii) which image features were responsible for the decline in performance in the latent noise condition.

Distance in latent space correlates with image matching performance

We modeled observer behaviour in the image matching task, by assuming that on every trial, the observer picks the comparison stimulus that appears closer to the target with respect to some appropriate distance measure. We looked at three different candidate distance measures and applied them both in the GAN’s latent space and directly in pixel space. To evaluate the relevance of each distance measure, we used the area under the receiver operating curve (AUC) for discrimination between correct and incorrect trials.

Figure 5A shows average AUC-values for different hypothetical distance measures applied either in latent space or in pixel space. The simplest distance measure is the Euclidean distance, marked by the blue bars in Figure 5A (darker blue for Euclidean distance in latent space, lighter blue for Euclidean distance in pixel space). Note that Euclidean distance in pixel space is equivalent to the RMS difference between stimuli that was used as a unified measure of perturbation strength in Figure 4. Euclidean distance in latent space was at least as predictive as Euclidean distance in pixel space (latent space: 0.83±0.0092, average AUC ± s.e.m., pixel space: 0.82±0.014, permutation test p = 0.17).

Figure 5: Explaining trial by trial performance

A. Area under the ROC curve for different hypothetical distances applied in latent space or pixel space. Distance measures applied in the GAN’s latent space, corresponded to distances within the manifold of natural images (darker colors). Color hue codes for the type of distance measure used. Error bars indicate 95% bootstrap confidence intervals. The light gray line marks the average critical value for chance performance. B. Same as A. but for image features. “Lum” corresponds to comparisons of luminance histograms, “Ori” corresponds to comparisons of orientation histograms, “Edg” corresponds to edge density, “Pow” corresponds to the slope of the image’s power spectrum, and “Seg” corresponds to comparisons of image segmentations (see Methods for detail).

We performed the same analysis using the difference between the norms of the either the latent vector or the pixel vector (green bars in Figure 5A). In pixel space, radial distance is equivalent to an observer who simply compares the RMS contrast of the images. In latent space, the norm of the latent vector seems to be related to contrast as well, but the relationship is more complex. Radial distance receives considerably lower AUC than Euclidean distance in both latent and pixel space and was a much less reliable predictor of trial by trial performance. Radial distance in latent space was significantly less predictive than radial distance in pixel space (latent space: 0.57±0.022, pixel space: 0.68±0.016, permutation test p < 0.05 corrected) and for four out of 7 observers, radial distance in latent space did not predict trial by trial choices significantly better than chance.

Finally, we analyzed how well cosine distance explained observers’ responses (red bars in Figure 5A). Cosine distance is interesting for two reasons. Firstly, cosine distance is equivalent to Euclidean distance except for the influence of the radial component. Secondly, cosine distance can be interpreted as an observer who uses the target stimulus as a template and linearly matches that template against both comparison stimuli to pick the stimulus that matches best¹. Cosine distance applied in latent space was a better predictor than if it was applied in pixel space (latent space: 0.82±0.012, pixel space: 0.78±0.019, permutation test p = 0.05 corrected).

Taken together, these results suggest that the latent space of GANs captures processing beyond simple contrast differences.

Distortions of mid-level features explain trial by trial performance

In order to determine which image features were responsible for the decline in performance with perturbations in the latent space of GANs, we applied the same analysis to different image features (see Figure 5B).

We found that differences in the luminance distribution of the images are clearly predictive of trial by trial behaviour (average AUC: 0.67±0.018, AUC was larger than 95-th percentile of the null distribution in all seven observers). Yet, other features, such as the difference in local orientation (average AUC: 0.70±0.010) or differences in the images edge density (average AUC: 0.67±0.011) were equally good predictors of the observers’ trial by trial behaviour (permutation test not significant after correction for multiple comparisons).

One of the quantities that might have been relevant for our observers is the slope of the power spectrum (see for example Alam et al., 2014). We evaluated to what extent this feature contributed to our observers’ decisions and found that it is largely irrelevant at explaining the observers’ trial by trial behaviour: In three out of seven observers the AUC of this feature was not significantly different from chance performance and the average AUC for the slope of the images’ power spectrum was significantly less than that for any other feature we studied.

In order to quantify how well differences in the mid-level structure of the perturbed images could explain trial by trial responses, we calculated segmentations of all images using a standard segmentation algorithm (Felsenzwalb & Huttenlocher, 2004). Although we do not believe that humans necessarily segment images using graph based optimization as the algorithm does, we believe that this approach provides at least a coarse approximation to the mid-level structure of the images. Differences in segmentation were considerably more predictive than differences in any other feature distribution (mean AUC for segmentation 0.82±0.011, permutation test p < 0.05 corrected). In fact, differences in segmentation were about as predictive of trial by trial behaviour as cosine distance in latent space (permutation test p = 0.60), suggesting that indeed distortions of the images’ mid-level structure might be responsible for the decline in image matching performance when noise was constrained to stay within the manifold of natural images by applying it in the GAN’s latent space.

Experiment 2: Sensitivity to directions in the natural image manifold

In Experiment 1, we found that human observers are particularly sensitive to image perturbations that stay within the manifold of natural images. This was achieved by perturbing stimuli along a parameterization of the manifold of natural images as recovered by a GAN. We wondered if observers were also sensitive to other aspects of this parameterization, such as for example direction. We therefore asked observers to discriminate between videos that were created by walking along either straight paths in latent space (i.e. which contained no change in direction) or paths that had a sudden turn (i.e. which contained a change in direction).