Predictive coding as a unifying principle for explaining a broad range of brightness phenomena

The visual system is highly sensitive to spatial context for encoding luminance patterns. Context sensitivity inspired the proposal of many neural mechanisms for explaining the perception of luminance (brightness). Here we propose a novel computational model for estimating the brightness of many visual illusions. We hypothesize that many aspects of brightness can be explained by a predictive coding mechanism, which reduces the redundancy in edge representations on the one hand, while non-redundant activity is enhanced on the other (response equalization). Response equalization is implemented with a dynamic filtering process, which (dynamically) adapts to each input image. Dynamic filtering is applied to the responses of complex cells in order to build a gain control map. The gain control map then acts on simple cell responses before they are used to create a brightness map via activity propagation. Our approach is successful in predicting many challenging visual illusions, including contrast effects, assimilation, and reverse contrast. Author summary We hardly notice that what we see is often different from the physical world “outside” of the brain. This means that the visual experience that the brain actively constructs may be different from the actual physical properties of objects in the world. In this work, we propose a hypothesis about how the visual system of the brain may construct a representation for achromatic images. Since this process is not unambiguous, sometimes we notice “errors” in our perception, which cause visual illusions. The challenge for theorists, therefore, is to propose computational principles that recreate a large number of visual illusions and to explain why they occur. Notably, our proposed mechanism explains a broader set of visual illusions than any previously published proposal. We achieved this by trying to suppress predictable information. For example, if an image contained repetitive structures, then these structures are predictable and would be suppressed. In this way, non-predictable structures stand out. Predictive coding mechanisms act as early as in the retina (which enhances luminance changes but suppresses uniform regions of luminance), and our computational model holds that this principle also acts at the next stage in the visual system, where representations of perceived luminance (brightness) are created.


17
Visual perception is relative rather than absolute; the visual system (VS) computes the 18 perceptual attributes of a visual target not only based on its physical properties, but 19 also by considering information from the surrounding region of the target (context). For 20 example, it is possible to induce different kinds of effects by context modification, such 21 that the brightness of a target is contrasted (increasing brightness differences) or 22 assimilated (decreasing brightness differences) with respect to its adjacent surround. 23 Variants of these effects give rise to a myriad of visual illusions, which are of great 24 utility for building hypothesis about computational mechanisms or perceptual rules for 25 brightness perception. 26 At first sight it seems that contrast effects, such as simultaneous brightness contrast 27 (SBC), can be explained by lateral inhibition between a target (center) and its context 28 (surround). However, activity related to brightness contrast does possibly not occur 29 before V1, albeit the receptive fields of retinal ganglion cells are consistent with lateral 30 inhibition [1]. 31 Unlike contrast, brightness assimilation pulls a target's brightness towards to that of 32 its immediate context, and therefore cannot be explained by mechanisms based on plain 33 lateral inhibition. In fact, the neural mechanisms involved in generating brightness 34 appear to be more intricate, and only few computational proposals were made so far for 35 explaining assimilation effects. Note that any (computational) proposal could only be 36 deemed as being successful if it explained contrast and assimilation effects at the same 37 time. For example, the simplest (low-level) explanation for assimilation is spatial 38 lowpass filtering, but it would predict brightness assimilation also for contrast displays. 39 Similarly, [2] suggested a retinal model based on a second-order adaptation mechanism 40 induced by double opponent receptive fields; it predicted assimilation, but then again it 41 failed to predict contrast displays. With respect to mid-or higher-level processing, [3] 42 suggested a pattern-specific inhibition mechanism acting in the visual cortex, which 43 inhibits regularly arranged patterns of a visual stimulus. 44 Anchoring theory is a rule-based mechanism for segregating the contributions of 45 illumination and reflectance in brightness, and thus to map luminance to perceived gray 46 levels [4]. Accordingly, a visual image is first divided into one or more perceptual 47 frameworks (a framework is a set of surfaces that are grouped together). Within each 48 framework, the highest luminance value is anchored at perceived white, and smaller 49 luminance values are mapped to gray levels according to their luminance-ratio with the 50 anchor. Note that "white" may not be the only possible anchor. For example, the 51 lowest luminance may be anchored at "black", or the mean luminance may be anchored 52 at the Eigengrau level. Although anchoring theory is successful in explaining the subsequently used for suppressing (in the FCS) contrast measurements at spatially 121 corresponding positions. A filling-in process then uses the original boundaries along 122 with the modified contrast measurements. However, psychophysical evidence suggests 123 that White's illusion seems not to be affected significantly if the T-junctions are 124 suppressed [23,24], nor seem to be other illusions [25]. Furthermore, it is not readily 125 clear whether junction rules represent reliable cues in complex natural scenes: The 126 utility of junctions rules has only been illustrated with relatively simple artificial 127 displays [22,26]. 128 Another filling-in-like model used center-surround receptive fields at four resolution 129 levels for edge (or contrast) extraction [2]. At each resolution level, filter response 130 amplitudes ("local contrast") were gain-controlled with a low-pass filtered version of 131 themselves ("remote contrast"). A brightness map was estimated from the 132 gain-controlled contrast map with fixpoint iteration of a Laplacian [27], what 133 implements the filling-in process. The model was successful in simulating assimilation 134 and reverse assimilation effects (mostly centered on challenging variants of White's 135 effect), but failed in predicting Simultaneous Brightness Contrast (SBC). 136 More recently, [28] extended the mechanism of [18] for explaining assimilation effects 137 to two dimensions. Thus, Domijan modified BCS activity accordingly before filling-in 138 occurs in the FCS. To this end, he relies heavily on the max-operator, which he justified 139 with a dendritic circuit proposal. In comparison, [29] used a modified diffusion operator, 140 which could be efficiently implemented (both physiologically and computationally) with 141 rectifying gap-junctions or rectifying dendro-dendritic connections [17,29]. For filling-in, 142 Domijan's used a luminance sensitive pathway. Luminance-modulated contrast 143 responses are computed with an unbalanced center-surround kernel, similar to [30], but 144 see also [17] for a different way of computing multiplexed contrasts). In addition, [28] 145 computed BCS activity by first deriving a local boundary map, where the loss of activity 146 at junctions and corners was corrected. Based on the local boundary map, a global 147 boundary map was computed. In the latter, contours which are parallel or co-linear to 148 another contour were enhanced. Finally, local boundary activity was divided, at each  Otherwise it is smaller than one. The final BCS output keeps only those activities that 152 are relatively close to one -low contrast boundaries which are parallel to high contrast 153 edges are eliminated. This causes FCS activity to freely diffuse across the eliminated 154 boundaries. In this way assimilation displays are predicted. Unfortunately, Domijan did 155 not show any results with real-world luminance images as input, and thus it is not clear 156 whether his proposed mechanisms are robust and how well they generalize.

157
A completely different approach for explaining visual illusions is based on a 158 statistical analysis of real-world images. This approach suggests that the perception of 159 brightness [31,32] or lightness [32][33][34] is related to knowledge about the statistical 160 relationships between visual patterns. In particular, [31] proposed that the brightness of 161 a visual target embedded in some context depends on the expected luminance according 162 to a probability distribution function. The probability distribution function integrates 163 all contexts in that the target was seen previously. The perception of the target then 164 depends on its expected luminance given its current context: It is perceived as brighter 165 if the expected luminance is lower, and it is perceived darker otherwise. This approach 166 is successful in predicting contrast and assimilation for several visual illusions, and 167 suggests a statistical relationship between luminance patterns and brightness perception. 168 Unfortunately, no attempt has been made in order to unveil any information processing 169 strategy from the statistical analysis (but see [34]).

170
As with some of the models reviewed above, our approach also emphasizes the 171 importance of boundaries in brightness perception. Specifically, we propose to reduce 172 redundancy in the boundary maps. Such encoding strategies usually reduce the overall 173 activity of a representation (and thus the expenditure of metabolic energy, [35,36], and 174 are also known as efficient coding [37], predictive coding [38], whitening [90] or response 175 equalization [40]. 176 The underlying idea of our computational model is to adjust a boundary map, such 177 that redundant activity is suppressed, while non-redundant activity is enhanced. Since 178 neurons that encode redundant patterns tend to be over-represented, the overall 179 boundary activity is reduced after the adjustment (response equalization). Response 180 equalization is carried out by a dynamic filter. Fig 1 shows an overview of our model.

181
In the first step an input image is encoded in parallel by two sets of Gabor filters, which 182 mimic the spatial response properties of simple cells in V1 [46]. The responses of the 183 high-resolution filters define the contrast-only channel, while responses of the more 184 coarse-grained filters define the contrast-luminance channel. From the contrast-only 185 channel, we compute boundary activity via local energy [42,43]. Local energy is 186 insensitive to the phase of simple cells, and thus resembles complex cell responses. We 187 learn a decorrelation kernel from the local energy map, and apply it to the latter in 188 order to reduce its redundancy (=dynamic filtering). The redundancy-reduced energy 189 map then functions as a gain control map for both contrast channels. As a consequence, 190 contrast activity is modified. Subsequently, an iterative procedure is used to recover a 191 brightness map from the contrast channels. Our iterative procedure resembles a Each of the three stages is mathematically specified in corresponding subsection in the text. (A) Stage 1: The Contrast-Only channel and Contrast-Luminance channel are encoded by filtering the input image with a corresponding set of Gabor filters with high spatial resolution and coarse resolution, respectively. The local energy map is computed from the contrast-only channel. (B) Stage 2: The kernel of the dynamic filter is estimated from the local energy map. Dynamic filtering equalizes the amplitude spectrum of the energy map, reducing redundancy. The decorrelated energy map serves as gain control map for both contrast channels. (C) Stage 3: The output of the model is a brightness map that is obtained by solving an inverse problem, that is recovering the image from both of the contrast channels. Thus, the contrast channels do not interact before Stage 3.  Contrast-only and contrast-luminance channel 205 We use Gabor filters in order to encode contrast-only and contrast-luminance 206 information. In the primary visual cortex, simple cells respond to oriented light-dark 207 bars across a certain spatial frequency range [44] and their receptive fields can be 208 modeled by Gabor filters [45][46][47]. Consistent with the properties of Gabor filters, it 209 seems that many simple cells in V1 encode contrast information. Under certain 210 circumstances though, neurons in V1 may respond to surface brightness as well, even 211 without (sharp luminance-)contrasts in their receptive fields [48,49]. For 212 example, [50](found such neurons in V1 which have large receptive fields, broad 213 orientation tuning, and a preference for low spatial frequencies. These neurons respond 214 to both contrast and luminance. Accordingly, we computed the contrast-only channel 215 by Gabor filters with high spatial frequency tuning and balanced ON-OFF sub-regions 216 (i.e., the sum across the kernel is zero), such that they did not respond to homogeneous 217 regions of the stimulus. On the other hand, the contrast-luminance channel is computed 218 by Gabor filters having a lower spatial frequency, and unbalanced ON-OFF sub-regions 219 (i.e., the sum across the kernel is positive) such that they respond to both luminance 220 and contrast. Fig 1 illustrates these two sets of filters (see S1 Appendix for parameter 221 values and a mathematical description). The contrast-only and contrast-luminance 222 channel were computed by simply convolving (symbol " * ") a luminance image with the 223 corresponding set of Gabor filters. That is, if "g" represents a Gabor kernel, then

Materials and methods
represents its activity. The arguments (x,y) indicate 2D spatial coordinates. The 225 contrast channels remain separated until the filling-in process.

226
Local Energy Map

227
The idea underlying the "energy map" is to generalize across contrast polarity, 228 orientation and phase, respectively, what also leads to a certain degree of position 229 invariance [42,43,51,52]. The local energy map therefore resembles the properties of 230 complex cells in the primary visual cortex [53]. Particularly, complex cell responses are 231 similar to those of simple cells in terms of orientation and spatial frequency preference, 232 respectively. However, complex cell responses tend to be non-linear and shift-invariant 233 with respect to contrast phase [53,54]. Local energy is computed from a pair of Gabor 234 filters with quadrature phase (and identical orientation and spatial frequency) by  Parameters and mathematical details can be found in S2 Appendix. At first, the local energy was computed (at each position in the image) by summing the squared responses of a pair of Gabor kernels which have the same spatial frequency and orientation whereas their spatial phase differ in 90 degrees. After that, an energy map was computed by averaging the activity of local energy across orientations.

239
In this section we describe stage 2 of our model ( Figure 1B). The first subsection 240 describes how the dynamic filter is computed using zero-phase whitening (ZCA: [55].

241
With the dynamic filter we equalize the amplitude spectrum of the energy map. In fact, 242 it produces very similar results to the "Whitening-by-Diffusion" method proposed 243 in [40]. In the second and third subsections we detail the computation of the gain 244 control map and how it interacts with the whitened energy map, respectively.

246
The purpose of the dynamic filter is to equalize the amplitude spectrum of the energy 247 map. We compute our "dynamic filter" using zero-phase whitening (ZCA), a technique 248 which has been used for learning the receptive fields of retinal ganglion cells [55]. ZCA 249 is very similar to principal component analysis (PCA), and signal decorrelation can be 250 achieved with both of the latter. However, the components are constrained to be 251 symmetrical with ZCA. This "symmetry constraint" guarantees that the principal 252 components are localized in the spatial domain [55], and therefore can be used as filter 253 kernels. We nevertheless introduced a couple of modifications to the original ZCA (see 254 S3 Appendix). As a result of the modifications, we obtained a spatial filter that adapts 255 to the spatial structure of the local energy map of an image. It is called "dynamical" 256 because a different filter is learned from each image. After filtering, the amplitude 257 spectrum of the energy map is more uniform (see Fig 3). By the Wiener-Khinchin 258 theorem, a more uniform power or amplitude spectrum implies that the original signal 259 is more decorrelated [56,57]. For the decorrelated energy map, this means that spatial 260 patterns with low redundancy tend to be intensified, while patterns with high 261 redundancy tend to be attenuated. This is illustrated with

264
A gain control map G was computed in two steps. First, the dynamic filter F was used 265 as a convolution kernel for the energy map E: where the symbol " * " indicates convolution. We set the threshold to 10 percent of The contrast-only and the contrast-luminance channel were subjected to gain control 273 using the decorrelated energy map as where G is the gain control map, τ is a control parameter, R g represents the activity 275 of a Gabor filter g of the corresponding filter set, and (x, y) are the pixel coordinates of 276 the input image. The value of acts as an upper bound to the maximum activity that we 277 observed during the simulations. If the value of G at pixels coordinate (x, y) is bigger 278 (or smaller) than zero, then the activity of the corresponding Gabor response will 279 increase (or decrease). On the other hand, if G(x, y) ≈ 0 then R * g ≈ R g at position 280 (x, y). Finally, after applying Eq 4 to each channel, the results were lowpass filtered 281 with a Gaussian kernel (standard deviation 1 pixel) in order to reduce possible artifacts. 282

Stage 3: Brightness estimation as a filling in process 283
The brightness mapẑ is the output of our model. It is estimated by minimizing an 284 objective function E(z), which optimizes the trade-off between the reconstruction error 285 (first term in the sum of Eq 5) using the gain-controlled contrast channels and a 286 smoothness constraint (second term): Notice that the sum involves the Gabor filters g of both contrast channels (i.e., smoothness term serves to reduce artifacts produced at discontinuities. The Eq 5 is 291 solved iteratively with the conjugate gradient method (see S4 Appendix for more detail), 292 which was terminated either when having reached a maximum number of iterations, or 293 when an error criterion was satisfied: The last formula measures the difference in reconstruction between subsequent 295 iterations ("relative error"). In terms of convergence, we observed as our model normalization mechanism. The purpose of this section is to evaluate the relative 320 influence of the numerator and the denominator, respectively, on predicted brightness. 321 To this end, we identified three prominent scenarios for explaining corresponding classes 322 of brightness illusions ( Fig 5). The input consists of a series of nine squares arranged in a spatially redundant pattern, where the middle square has a different luminance. The profile plot suggests an overall increase in brightness contrast: Brightness of the middle square is further reduced, while the brightness of the surrounding squares is enhanced. While the brightness contrast also increased in the display with the bright middle square, this increase in contrast is caused nearly exclusively by the middle square.
The luminance pattern giving rise to Scenario 1 differs only in its spatial structure 324 (Fig 5A), as all structures have the same intensity value. In this case, the major 325 contribution to predicted brightness comes from the Gain Control Map (numerator of 326 Eq4). The patterns with high spatial correlations are attenuated by the dynamic filter, 327 while patterns with lower spatial correlation are somewhat increased such that a 328 brightness contrast effect in predicted for the central disk. Scenario 2 is defined by 329 luminance patterns with similar spatial structure but different intensity range (see 330   Fig5B). Analogously to Scenario 1, the major contribution to predicted brightness stems 331 from the Gain Control Map, but here the effect is limited by the size of the dynamic 332 filter. We observed that the dynamic filter will not only reduce the spatial correlations, 333 but it will also act as a contrast filter, if the redundant activity is in a sufficiently small 334 spatial region. As a result, redundant activity with higher (lower) intensity than the 335 other patterns would be increased (decreased). If this increment (or decrement) is 336 sufficiently big, it will produce a major (minor) brightness contrast effect.

337
April 18, 2020 9/30 In Scenario 3, the major contribution to predicted brightness is caused by the A luminance step and a luminance staircase, respectively, served as input images. Activity in response to the luminance step is close to the control parameter of Eq 4 (τ = 0.5), producing barely changes in the corresponding brightness estimation at the edges. In contrast, for the luminance staircase, the activity at the edges is relatively far from the control parameter, inducing a boost (an increment of brightness contrast) in the corresponding brightness estimation.

348
This section presents simulation results of our model in predicting brightness illusions. 349 The first subsection focuses on contrast effect: Simultaneous Brightness Contrast, 350 Benary-Cross and Reverse Contrast. The SBC display consists of two gray patches with identical luminance which are 360 embedded in a dark and bright background, respectively. The patch on the bright 361 background is perceived as darker than the patch on the dark background ( Fig 7A).

362
SBC can be attributed to low-level processing. For example, retinal ganglion cells may 363 enhance patch contrast by lateral inhibition. However, other studies suggest that SBC 364 may involve higher-level processing as well [59]: The apparent brightness of the patches 365 can be modulated by the region surrounding the patches (=spatial context). In fact, 366 psychophysical studies report that the contrast effect is perceived less intense for smaller 367 patches [60][61][62][63][64].  Fig 7). This translates to an increased contrast in 372 predicted brightness (profile plot in Fig 7B). We also studied the relation between patch 373 size and their predicted brightness. In agreement with previous studies, Fig 7D shows a 374 logarithmic relationship between patch size and our brightness estimation [65]. 375

376
The Benary-Cross [66] is composed of a black cross and two gray triangles with the 377 same luminance (Fig 8). The triangle embedded in the cross is perceived as brighter 378 than the other. Notice that both of the triangles are made up of identical contrast 379 polarities -one white to gray and two black to gray. This effect cannot be explained by 380 lateral inhibition and is usually attributed to "belongingness theory", where the region 381 in which the triangle appears to belong to induces a contrast effect [66]. Noise masking 382 experiments support the idea that the effect is caused by low-level mechanisms [67].

383
Our model predicted the brightness difference in the triangles according to two 384 scenarios. According to Scenario 1 (but also 2, considering the intensity differences), the 385 redundant patterns correspond to those edges of the triangles which are aligned with 386 the cross. They latter are attenuated, while the non-redundant edges are enhanced. The 387 Gain Control Map suggests a rather balanced effect, what is confirmed by the model's 388 predicted brightness map. Notice, however, that the length of the non-suppressed edges 389 is bigger for the left triangle.

391
Reverse Contrast (Fig 9) was introduced by Gilchrist in the context of his anchoring 392 theory. Gilchrist and co-authors suggest that simultaneous brightness contrast (SBC) 393 can be reversed (e.g. by overcoming lateral inhibition) by adding more structures to the 394 original SBC display. The purported mechanism acts on grounds of perceptual grouping 395 of these structures.

404
In order to better understand how our model predicted reverse contrast, we probed 405 it with further configurations (see Fig 9). We observed that the change in brightness of 406 the gray patches increases as a function of the number of flanking bars (Fig 10A). On 407 the other hand, if the flanking bars were misaligned to various degrees (disrupting the 408 good continuation principle of perceptual organization), the effect was considerably 409 reduced ( Fig 10B). Both results stand in agreement with psychophysical 410 experiments [68]. However, in the latter study the authors examined displays with even 411 more configurations that our model cannot predict (results not shown). perceived as brighter as the other one. Lateral inhibition cannot account for this effect, 417 and it has been suggested that the effect is caused by assimilation [3,69,70].

418
Assimilation means that the brightness of the flanking stripes averages with the gray 419 bars, and therefore one expects that reducing the bar height would also reduce the 420 strength of assimilation. However, experimental data indicate that the perceived 421 difference between the bars increases with smaller heights [71], and that 422 bandpass-filtered noise with the same orientation as the stripes enhanced the effect, 423 while with perpendicular orientation the effect was diminished [67]. Therefore, White's 424 effect seems to be principally generated by contrast at the horizontal edges of the bars 425 (Fig 11B and Fig 11C), and to a less detect by assimilation from the flanking stripes [72]. 426 In fact, a mainly contrast-based account is supported by the Gain Control Maps of 427 Fig 11A and Fig 11B. Because the vertical edges (assimilation) are highly redundant, 428 their activity is diminished (Scenario 1 & 2). The brightness estimation is dominated by 429 the horizontal edges of the bars (contrast), which are enhanced. The vertical edges 430 nevertheless account for a residual assimilation, but the effect in estimated brightness is 431 less for Fig 11B than for 11A. Therefore, the display with the smaller bars (Fig 11B) 432 has a higher predicted brightness difference between them, because less activity from 433 the vertical edges "mixes" with that from the horizontal edges during filling-in. Despite of the presence of stripes (but with low contrast), the display of Fig 11C   435 shows a clear contrast effect of the bars (cf. Fig 11C) according to Scenario 2. We also 436 studied the relation between the target size and the brightness estimation. We observed 437 that the predicted brightness of the bars could be modified as a function of bar height 438 and spatial frequency of the background (Fig 11D). Specifically, the predicted 439 brightness difference between the bars increases both with decreasing bar height and 440 with increasing spatial frequency. These model predictions are thus in agreement with 441 previous studies [10].  test patch occluded by the white squares appears to be brighter than the other. This 447 illusion was originally explained in terms of T-junctions [26]. However, [23] showed that 448 the effect persisted without T-junctions. Later, [73] studied different variations of 449 Todorovic's Illusion (labeled Context A and Context C in Fig 12). They found that the 450 target size interacted with the strength of assimilation. The original effect can be 451 reversed according to Context C (Fig 12), which looks like looking through a window 452 cross. It can also be abolished by moving the disk into the foreground (Fig 12, Context   The predictions of our model generalize well to further assimilation displays. Figure 14 469 shows the Dungeon illusion [25], the Checkerboard illusion [8] and Shevell's Ring [74]. Our model predicted the COCE, but the explanation is more intricate. At first sight, 484 being a filling-in type effect, it should be predicted in a straightforward way by our 485 model. We verified that without gain control (Eq4), the COCE cannot be predicted 486 (data not shown). Could it be then that the effect is produced by the low-pass filtering 487 which is applied after the gain control mechanism (see method section). This is not the 488 case, since the removal of low-pass filtering did not affect the prediction of the COCE 489 (see profile plot 2 in Fig 15A). Therefore, the gain control mechanism contributes to 490 producing the effect. Indeed, the luminance gradients cause negative values (indicated 491 by black lines) in the Gain Control Map around the edges (cf. Fig 15A). As a 492 consequence, activity corresponding to the luminance gradients is suppressed by the 493 gain control mechanism, what furthermore reduces the peak activity at the edges. After 494 all, the gradients are "ignored", and our model generates the COCE as a result of 495 assimilation of the edges. This explanation is also consistent with the cow-skin illusion, 496 which is a variant of the COCE without luminance gradients. It is composed exclusively 497 of adjacent black and white lines, and the empty regions are randomly arranged. retinal ganglion cells as the principal mechanism [75]: Assume a circular receptive field 504 with an excitatory center which has the same width as the white grid lines. The 505 inhibitory surround covers in addition the black squares. If the center is located right at 506 an intersection, it receives more inhibition from the surround (from two white lines) 507 than when the center is positioned between two intersections (inhibition from one line). 508 This translates to a brightness reduction at the intersections, but not in between. This 509 mechanism, though, is insufficient to explain why the effect is considerably reduced (or 510 even removed) if the bars are slightly corrugated (Fig 16A, bottom). It is also reduced 511 as a function of the ratio between grid line width and block width, where no effect is 512 produced for a ratio of one [76]. Control Map. In this way, an assimilation effect is induced. As to the corrugated HG, 517 the corners also represent the redundant patterns, but because the spatial structure are 518 less regular, the inhibition is correspondingly weaker (compare the Gain Control Maps 519 shown in Fig 16B). Consequently, the brightness reduction at the intersections is 520 considerably weaker for the corrugated HG.

521
Fig16D shows the dependence of the darkening effect on the ratio between grid line 522 width and block width. In agreement with the results from [76], we find that the 523 darkening effect decreases while the ratio approaches one. We also wish to note that we 524 were unable to predict further results with the HG that were presented in [76]. 525 Luminance Staircase and Pyramid (Chevreul's illusion) 526 Chevreul's illusion consists of increasing levels of luminance, arranged as a staircase or 527 as a pyramid. Although luminance is constant at each step, one perceives an illusory 528 brightening on the side of each step where the adjacent step is darker, and an illusory 529 darkening on the other. In the pyramid version, one perceives in addition (illusory) 530 glowing diagonals (Fig 17). The effect is absent at the lowest (black) and highest 531 (white) luminance level, and is considerably reduced on the middle step for a staircase 532 made up of three steps. All aspects of Chevreul's illusion are consistently predicted by the gradient system, 534 which is a computational model for representing luminance gradients [77,78]. The idea 535 behind gradient representations is to capture the smooth variations of luminance 536 (illumination effects), in order to help to disentangle reflectance from the illumination 537 component in luminance (since luminance is the product of reflectance with 538 illumination).

539
Brightness predictions from our model for the luminance staircase and the pyramid 540 are shown in Fig17. The illusory whitening and darkening at the stairs can be explained 541 according to Scenario 3: On the one hand, the gain control map increases the activity of 542 the Contrast-Luminance channel. On the other hand, the increase in excitation is offset 543 by the normalization mechanism, thereby producing non-uniform brightness activity at 544 the stairs. The glowing diagonals of the Pyramid Illusion are produced according to 545 Scenario 1, where the activity of non-redundant spatial patterns -especially at the 546 corners -is enhanced. On the other hand, the edges of the staircase represent a 547 redundant pattern, the activity of which is decreased. Consequently, more (less) 548 contrast at the corners (at edges) is generated in the brightness estimation (Fig 17C,   549 bottom). Finally, it is essential to highlight a limitation of our model in this context. 550 We observed that for a big number of steps (i.e., very narrow steps) the dynamic filter 551 "collapses" and the model could not longer predict the illusion nor the glowing diagonals 552 (data not shown). This is a consequence of the scale-sensitivity of the dynamic filter 553 (i.e., the size of the sampling patches), since with decreasing step size, the staircase 554 eventually approaches a linear luminance gradient, and the filter cannot resolve 555 anymore individual steps.

557
Mach Bands are illusory glowing stripes that are perceived adjacent to knee points that 558 are connected with a luminance ramp, where the bright (dark) band is attached to the 559 plateau with high (low) luminance (Fig 18). Notice that Mach bands do not cause 560 Chevreul's illusion. The perceived strength of Mach bands decreases when the ramp 561 gets steeper and eventually approaches a luminance step. Also, for very shallow ramps, 562 the perceived strength decreases. The perceived strength has thus a maximum at 563 intermediate ramp widths [79,80]. The textbook explanation based on lateral inhibition 564 is insufficient to explain the variation of strength with ramp width -it would wrongly 565 predict maximum perceived strength at a luminance step [81,83,84]. The perceived 566 strength of Mach bands is also modulated by the proximity, contrast and sharpness of 567 an adjacently placed stimuli [81,84]. wave. The gradient system furthermore suggests that bright Mach bands are key for the 572 perception of light-emitting surfaces [78]. Our model predicted the Mach bands, and as 573 April 18, 2020 17/30 well as the absence of them at steps (see profile plots in Fig 18). It furthermore 574 succeeds in predicting the inverted-U curve of the perceived strength of Mach Bands as 575 a function of the ramp width (Fig 18C). The inverted-U curve could be explained by 576 two mechanisms which act in opposite ways. (i) If the ramp width decreases, then the 577 activity at the knee points reaches a maximum that renders the normalization 578 mechanism (denominator of Eq 4) of the gain control mechanism ineffective (Scenario 3; 579 ideal step luminance). If the ramp width increases, then the luminance transition 580 between the plateaus is more gradual, which is associated with less activity at the edge 581 locations. In this way, the edge activity gets more susceptible to gain control mechanism 582 towards the maximum of the perceived strength. However, this effect does not remain 583 constant. After a certain ramp width, the activity across the ramp gets comparable to 584 the activity at the knee points, which produces less variability in the energy map E. In 585 consequence, (ii) the dynamic filter has less effect in eq 2, reducing gradually the multi-scale filtering [9,60]), but also by filling-in models [86]. Notice that a common 597 misconception with diffusion-based approaches (=filling-in models) is that the illusory 598 brightness modulation across the test field would average out. This, however, is usually 599 not the case. The exact explanation depends on the model under consideration. For 600 instance, a mechanisms that counteracts "averaging out" are boundary webs from the 601 boundary contour system (BCS) that extend across the test field and trap feature 602 contour activity (FCS) [87]. Other mechanisms include cross-channel inhibition between 603 brightness and darkness activity during filling-in [88]. 19C). We also observed from a specific spatial frequency on (4 cycles/image) that the 608 brightness modulation decreases with increasing separation and spatial frequency, 609 respectively, of the inducer gratings (surface plot in Fig 19C). Unlike the rest of the 610 illusions the GI effect was produced mainly at the filling-in stage (Eq 5), and to a lesser 611 degree by dynamic filtering. Dynamic filtering increased the activity at the boundaries 612 of the inducer gratings (see gain control maps in Fig 19); this increment produces a 613 significant contrast in estimated brightness between the inducer gratings and the test 614 field, which eventually propagated (by filling-in) across the test field.

615
Real World Images and Noise

616
Although synthetic images are a valuable tool for the study of certain aspects of the 617 visual system, it nevertheless evolved to the processing of real-world images. Real-world 618 images provide, therefore, a test of robustness for any model of the visual system. redundancy of the edges would not be affected by spatially uncorrelated noise. Finally, 631 we note that dynamic filtering has a couple of limitations with respect to spatially 632 correlated noise, such as band-pass-limited additive noise (see discussion). We did not 633 study this issue in more depth, as it would go beyond the scope of the present paper.

635
The perceived luminance (brightness) of target structure is highly sensitive to its spatial 636 context. Despite of many modeling attempts for brightness, we still have not arrived at 637 a detailed understanding of the corresponding neuronal information processing 638 principles. With our model, we emphasize the role of predictive coding in brightness 639 perception. Predictive coding for us simply means "what can be predicted will be 640 suppressed", and our dynamic filtering aims at reducing redundancy in boundary maps. 641 Coding strategies that aim at reducing activity and thus energy expenditure in 642 organisms are consistent, for example, with efficient coding [35-37, 90, 91], predictive 643 coding [38], whitening [90] or response equalization [40]. In this sense, we propose that 644 brightness perception is the consequence of suppressing redundant (i.e., predictable) 645 information. Our model is build on the latter idea(s), and apart from being able to 646 process real-world images, it predicts a bigger set of visual illusions than any other 647 previously published model.

648
Our focus is thereby on low-level vision, as our model simulates the activity of simple 649 and complex cells of the primary visual cortex. For each input, the model learns a filter 650 kernel by identifying redundant patterns in (simulated) complex cell responses (i.e., the 651 edge map), and subsequently uses the filter kernel to suppress redundant information 652 (dynamic filtering). Dynamic filtering amounts to response equalization of simulated 653 complex cell responses, much like the previously proposed "Whitening-by-Diffusion" 654 method which directly acts on the (Fourier) amplitude spectrum [40]. The equalized 655 responses are subsequently used for creating a representation of the sensory input by 656 filling-in (brightness estimation). Nevertheless is dynamic filtering a global mechanism, 657 which was adopted for the ease of implementation. In the primary visual cortex, we 658 expect that dynamic filtering acts in a more local fashion, but still on a spatial scale 659 that exceeds the typical receptive field sizes of V1 neurons. Such non-local mechanisms 660 could be biologically implemented by feedback from mid-level visual neurons with 661 sufficiently big receptive fields for detecting non-local correlations in activity. 662 We believe that our success in predicting a relatively large number of visual illusions 663 lends some support to our proposed computational principle. proposals and theories [4,22,28], which purport that a stimulus is divided into 704 perceptual frameworks based on anchors [4] or T-junctions [22]. However, it is not clear 705 whether anchors or T-junctions are sufficiently robust cues in real-world images, and 706 actually few previously published models demonstrated the processing of real-world 707 images.

708
Dynamic filtering is sensitive to the correlation structure of spatial patterns in order 709 to generate contrast and assimilation effects. In this way, the output of our model 710 would not be significantly affected if uncorrelated noise was added to the input. Yet 711 multi-scale models are highly sensitive to additive noise [9][10][11], because their predictions 712 depend on a careful re-adjustment of filter responses according to the 713 contrast-sensitivity function [8]. Thus, if noise was added to contrast and assimilation 714 displays, then corresponding predictions would be altered, because of corresponding 715 changes in the spatial frequency spectrum [12]. 716 Our model adapts to the statistical structure of each input image. This is to say that 717 we do not evaluate each input image in a previously learned long-term statistical context. 718 A long-term statistical context usually is learned from a big number of input samples in 719 order to derive feature-specific probability distributions. In connection with brightness, 720 a relationship between occurrence frequency of certain types of natural images and 721 brightness perception has been proposed [31][32][33]. The main limitation of such models is 722 that they require an enormous amount of data, and that visual illusions act much like 723 an associative trigger or they are perceived according to humdrum Bayesian inference. 724

725
In conclusion, this study provides a proof of concept of a hypothetical information 726 processing strategy for visual system, based on economizing edge representations. Our 727 predictions are reliant on the self-structure of the visual input, but not on accumulated 728 visual experience, spatial frequency representations, or predefined detectors. Our 729 proposed mechanism does not exclude information processing principles like 730 accumulating visual experience or spatial frequency representations, and should be 731 considered as being complementary to these. Finally, future work should address the 732 understanding of how the statistical structure of the context surrounding a target patch 733 influences its appearance. We also plan to study how different noise structures (as narrow-band, oriented, or correlated) influences the predictions of our model. Our 735 predictive coding hypothesis should be compatible with all levels of information 736 processing. This means that redundancy reduction likely might apply to higher-order 737 patterns and shapes that form the primitives for object recognition.

747
The ON and OFF sub-fields of a Gabor filter g are normalized independently via: where indicates Frobenius norm, and indicate g ON and g OF F region, respectively, of 749 filter g. The parameter allows to control the sensitivity to luminance and is fixed at 0.1 750 (note that for α = 0 the neuron would not respond to homogeneous regions of 751 luminance).

752
S2 Appendix. Energy Map. Firstly, the complex cells were computed directly 753 from the activity of the contrast-channel using the local energy model [42,43,51,52] 754 defined as: 755 C g (x, y) = R g;odd (x, y) 2 + R g;even (x, y) 2 (9) where R g,odd and R g,even indicates a pair of Gabor filters with identical orientation 756 and spatial frequency, but different phase (R g,even with ρ = 0 , and R g,odd with 757 ρ = π/2).

758
Finally, the local energy map E was computed as: S3 Appendix Dynamic filtering with zero-phase whitening (ZCA). Initially, 760 we sub-sampled the energy map E to half of its original size: 761 E 2 (x, y) = E(x, y) + E(x + 1, y) + E(x, y + 1) + E(x + 1, y + 1) 4 (11) where x, y ∈ {1, 3, 5, . . . n − 1} are spatial indices. Because ZCA decorrelates 762 intensity variations at the pixel level, the two-fold reduction in spatial scale has three 763 advantages. First, the computational cost of ZCA is reduced. Second, the sensitivity to 764 April 18, 2020 22/30 high spatial frequencies (and thus noise) is reduced. Third, because the edges on the 765 energy map typically span more than one pixel in width, the scale reduction retained 766 the intensity variations between different edges, while in turn reduced intensity variation 767 along the edges. This led to less variable and thus improved contour map, what in turn 768 facilitates the decorrelation between edges when applying the ZCA method.

769
Next, a set of 10000 patches of size 17 × 17 pixels was extracted randomly from the 770 sub-sampled energy map E 2 . To be computationally tractable the set was normalized 771 (extracting the mean and dividing by deviation) and cast into a matrix X of dimension 772 17 2 × 10000 such that the columns of the matrix represents each patch.

773
The zero-phase whitening (ZCA) transformation [55] consists in finding a 774 symmetrical matrix W of dimension 17 2 × 17 2 such that -after applying it to X -the 775 spatial correlations between the patches are eliminated (i.e., the covariance matrix after 776 transformation is equal to the identity matrix). Then, the columns of W form a base in 777 which the patches are decorrelated. Because W is symmetric, the columns of W are 778 identical up to cyclic change in rows (i.e., the columns of W are only differ by shifting 779 their values cyclically). This property permits to select any column of W , center it, and 780 reshape it to build the kernel for the dynamic filter (see further down). In order to where I represents the identity matrix and cov indicates the covariance matrix. Due 783 to W being symmetrical (i.e., W T = W ), manipulating linear algebra operations, the 784 last equation can be solved by 785 W = (X T X) −1/2 = Σ −1/2 (13) where Σ = cov(X) (i.e., represents the covariance matrix of X).

786
However, Eq 13 is very sensitive to high frequencies and isolated points (usually 787 noise). This issue can be alleviated by introducing a regularization parameter and / or 788 compression. We thus express the covariance matrix Σ by singular value decomposition 789 Σ = U SV T (using Matlab's svd function) and add a regularization parameter according 790 to Σ = U (S + I)V T , where was set to 0.01 * max(S) and I is the identity matrix, and 791 S is the matrix with singular values along its diagonal. In addition we reduce the x,y |F (x,y)| to be computationally tractable.