Modeling Bottom-Up and Top-Down Attention with a Neurodynamic Model of V1

Previous studies in that line suggested that lateral interactions of V1 cells are responsible, among other visual effects, of bottom-up visual attention (alternatively named visual salience or saliency). Our objective is to mimic these connections in the visual system with a neurodynamic network of firing-rate neurons. Early subcortical processes (i.e. retinal and thalamic) are functionally simulated. An implementation of the cortical magnification function is included to define the retinotopical projections towards V1, processing neuronal activity for each distinct view during scene observation. Novel computational definitions of top-down inhibition (in terms of inhibition of return and selection mechanisms), are also proposed to predict attention in Free-Viewing and Visual Search conditions. Results show that our model outpeforms other biologically-inpired models of saliency prediction as well as to predict visual saccade sequences during free viewing. We also show how temporal and spatial characteristics of inhibition of return can improve prediction of saccades, as well as how distinct search strategies (in terms of feature-selective or category-specific inhibition) predict attention at distinct image contexts. Author summary Saliency maps are the representations of how certain visual regions attract attention in a visual scene, and these can be measured with eye movements. A myriad of computational models with artificial and biological inspiration have been able to acquire outstanding predictions of human fixations. However, most of these models have been built specifically for visual saliency, a characteristic that denies their biological plausibility for modeling distinct visual processing mechanisms or other visual processes simultaneously. In addition to saliency, our approach is also able to efficiently work for other tasks (without applying any type of training or optimization and keeping the same parametrization) such as Visual Search, Visual Discomfort [1], Brightness [2] and Color Induction [3]. By performing simulations of human physiology and its mechanisms, we propose to build a unified model that could be extended to predict and understand distinct perceptual processes in which V1 is responsible.


Introduction
The human visual system (HVS) structure has evolved in a way to efficiently discriminate redundant information [4][5][6].In order to filter or select the information to April 8, 2019 1/32 arXiv:1904.02741v1[q-bio.NC] 4 Apr 2019 be processed in higher areas of visual processing in the brain, the HVS guides eye movements towards regions that appear to be visually conspicuous or distinct in the scene.This phenomena was observed during visual search tasks [7,8], where detecting early visual features (such as orientation, color or size) was done in parallel (pre-attentively) or required either a serial "binding" step depending on scene context.Koch & Ullman [9] came up with the hypothesis that neuronal mechanisms involved in selective visual attention generate a unique "master" map from visual scenes, coined with the term "saliency map".From that, Itti, Koch & Ullman [10] presented a computational implementation of the aforementioned framework (IKN), inspired by the early mechanisms of the HVS.It was done by extracting properties of the image as feature maps (using a pyramid of difference-of-gaussian filters at distinct orientations, color and intensity), obtaining feature-wise conspicuity by computing center-surround differences as receptive field responses and integrating them on a unique map using winner-take-all mechanisms.Such framework served as a starting point for saliency modeling [11,12], which derived in a myriad of computational models, that differed in their computations but conserved a similar pipeline.From a biological perspective, further hypotheses suggested that primates' visual system structure was mainly connected to the efficient coding principle.Later studies considered that maximizing information of scenes was the key factor on forming visual feature representations.To test that, Bruce & Tsotsos [13] implemented a saliency model (AIM) by extracting sparse representations of image statistics (using independent component analysis).These representations were found to be remarkably similar to cells in V1, which follow similar spatial properties to Gabor filters [14].
While the current concept of saliency maps is to predict probabilities of specific spatial locations as candidates of eye movements, it is also crucial to understand how to predict individual fixations or saccade sequences (also named "scanpaths").Scanpath predictions were formerly done through probabilistic measures of saccade amplitude statistics.These followed a similar heavy-tailed distribution to a Cauchy-Levy (in reference to random walks or "Levy flights", minizing global uncertainty) [15], with highest probability of fixations at a low saccade amplitude.This procedure was implemented in Boccignone & Ferraro's model [16], taking saliency from IKN.Later, LeMeur & Liu [17] proposed a more biologically plausible approach, accounting for oculomotor biases and inhibition of return effects.It used a graph-based saliency model (GBVS, also inspired by IKN) [18], with a higher probability to catch grouped fixations (which tend to be in stimulus center).
In order to evaluate model predictions with eye movement data, certain patterns underlying human eye movement behavior need to be accounted for a more detailed description and analysis of visual attention.These effects are found to be dependent on context, discriminability, temporality, task and memory during scene viewing and visual search [19,20].Attention and spatial selection, therefore, is also dependent on the neuronal activations from both bottom-up and top-down mechanisms.These processes are known to compete [21] to form a unique representation, termed priority map [22].These hypotheses suggest that attention is separated in distinct stages (pre-attentive as bottom-up and attentive as top-down) and that contributions towards guiding eye movements are simultaneously affected by distinct mechanisms in the HVS [23].This competition for visual priority is biased by a term called relevance (as opposed to saliency), where top-down attention is driven by task demands, working and semantic memory as well as episodic memory, emotion and motivation (3 of which seem to be unique for each individual and momentum) [24].At that end, it is stated [25,26] that visual selection relies on activations from higher-level layers towards lower-level receptive fields, considering modelization attention towards spatio-temporal regions of interest using top-down instructions.

Objectives
Initial hypotheses by Li [27,28] suggested that visual saliency is processed by the lateral interactions of V1 cells.In their work, pyramidal cells and interneurons in the primary visual cortex (V1, Brodmann Area 17 or striate cortex) and their horizontal intracortical connections are seen to modulate activity in V1.Li's neurodynamic model [29] of excitatory and inhibitory firing-rate neurons was able to determine how contextual influences of visual scenes contribute to the formation of saliency.In this model, interactions between neurons tuned to specific orientation sensitivities served as predictors of pop-out effects and search asymmetries [30].Li's neurodynamic model was later extended by Penacchio et al. [2] proposing the aforementioned lateral interactions to also be responsible for brightness induction mechanisms.By considering neuron orientation selectivity at distinct spatial scales, this model can act as a contrast enhancement mechanism of a particular visual area depending of induced activity from surrounding regions.Latest work from Berga & Otazu [31] has shown that the same model (without changing its parametrization) is able to predict saliency using real and synthetic color images.We propose to extend the model providing saliency computations with foveation, concerning distinct viewpoints during scene observation (mapping retinal projections towards V1 retinotopy) as a main hypothesis for predicting visual scanpaths.Furthermore, we also test how the model is able to provide predictions considering recurrent feedback mechanisms of already visited regions, as well as from visual feature and exemplar search tasks with top-down inhibition mechanisms.

A unified model of V1 predicts several perceptual processes
Here we present a novel neurodynamic model of visual attention and we remark its biological plausability as being able to simultaneously reproduce other effects such as Brightness Induction [2], Chromatic Induction [3] and Visual Discomfort [1] effects.Brightness and Chromatic induction stand for the variation of perceived luminance and color of a visual target depending on its luminance and/or chromatic properties as well as for its surrounding area respectively.Thus, a visual target can be perceived as being different (contrast) or similar (assimilation) to its physical properties by varying its surrounding context.With the simulations of our model, the output of V1's neuronal activity (coded as firing-rates), after several cycles of excitatory-inhibitory V1 interneuron interactions, is used as predictors of induction and saliency respectively.These responses will act as a contrast enhancement mechanism, which for the case of saliency, are integrated towards projections in the superior colliculus (SC) for eye movement control.Therewith, our model has also been able to reproduce visual discomfort, as relative contrast energy of particular region on a scene is found to produce hyperexcitability in V1 [32,33], one of possible causes of producing certain conditions such as malaise, nausea or even migraine.Previous neurodynamic [34][35][36][37][38] and saliency models [11,12] are able to reproduce attention processes and predict eye movements [39] but are uniquely presented to work for that specific task.On behalf of model biological plasusibility on V1 function and its computations, we present a unified model of lateral connections in V1, able to predict attention from real and synthetic color images while mimicking physiological properties of the neural circuitry stated previously.

Model Retinal and LGN responses
The HVS perceives to light at distinct wavelengths of the visual spectrum and separates them to distinct channels for further processing in the cortex.First, retinal photoreceptors (or RP, corresponding to rod and cone cells) are photosensitive to luminance (rhodopsin-pigmented) and color (photopsin-pigmented) [40,41].Mammal cone cells are photosensitive to distinct wavelengths between a range of ∼ 400-700nm, corresponding to three cell types, measured to be maximally responsive to Long (L, λ max 560nm), Medium (M, λ max 530nm) and Short (S, λ max 430nm) wavelengths respectively [42].RP signals are received by retinal ganglion cells (or RGC) forming an opponent process [43].This opponent process allows to model midget, bistratified and parasol cells as "Red vs Green", "Blue vs Yellow", and "Light vs Dark" channels.In order to simulate these chromatic and light intensity opponencies using digital images, we transformed the RGB color space to the CIELAB (Lab or L * a * b * ) space (including a gamma correction of γ RGB =1/2.2), as exemplified in Figure 1. ( The L * , a * and b * channels form a cubic color space [44] with RGB opponencies (+L=lighter, −L=darker, +a=reddish, −a=greenish, +b=yellowish and −b=blueish).We modeled V1's simple cell responses with a 2D "a-trous" wavelet transform [45].Discrete wavelet transforms allow to process signals by extracting information of orientation and scale-dependent features in the visual space, filtering each of the aforementioned opponencies.By building feature maps of orientation sensitivities at distinct spatial frequencies, it is possible to represent V1 RF input activity (which we applied separately to each pathway of the LGN), shown in Figure 2. The "a-trous" transform is undecimated and allows to perform a transform where its basis functions remain similar to Gabor filters.The "a trous" wavelet transform can be defined as: where By transposing the wavelet filter (h s , expressed in Fig. 2) and dilating it at distinct spatial scales (s = 1..S), we can obtain a set of wavelet approximation planes (c s,θ ), that are combined for calculating wavelet coefficients (ω s,θ ) at distinct orientation selectivities (θ = h, v, d).From these equations, three orientation selectivities can be extracted, corresponding to horizontal (θ h {0 ± 30||180 ± 30}º), vertical (θ v April 8, 2019 5/32 {90 ± 30||270 ± 30}º) and diagonal (θ d {45 ± 15||135 ± 15||225 ± 15||315 ± 15}º) angles.For the case of scale features, sensititivies to size (in degree of visual angle) correspond to 2 s0(s−1) /{pxva}, where "pxva" is the number of pixels for each degree of visual angle according to experimentation, and s 0 =8, is the minimum size of the wavelet filter (h 0 ) defining the first the scale frequency sensitivity.Initial c 0 = I o is obtained from the CIE L*a*b* components and c n corresponds to the residual plane of the last wavelet component (e.g.s = n).The image inverse (I o ) can be obtained by integrating the wavelet ω s,θ and residual planes c n :

Cortical mapping
The human eye is composed by RP but these are not homogeneously or equally distributed along the retina, contrarily to digital cameras.RP are distributed as a function of eccentricity with respect to the fovea (or central vision) [46].Fovea's diameter is known to comprise ∼5deg of diameter in the visual field, extended by the parafovea (∼5-9deg), the perifovea (∼9-17deg) and the macula (∼17deg).Central vision is known to provide maximal resolution at ∼1deg of the fovea, whereas in periphery (∼60-180-deg) there is lower resolution for the retinotopic positions that are further away from the fovea.These effects are known to affect color, shape, grouping and motion perception of visual objects (even at few degrees of eccentricity), making performance on attentional mechanisms as eccentricity-dependent [47].Axons from the nasal retina project to the contralateral LGN, whereas the ones from the temporal retina are connected with the ipsilateral LGN.These projections [48] make the left visual field send inputs of the LGN towards the right hemifield of V1, similarly for the case of the right visual field to the left hemifield of V1 (Figure 3-Right).function [49] [28, Section 2.3.1]using 128 mm of simulated cortical surface (see an example in Figure 3-Left).The visual space is transformed to a cortically-magnified space (with its correspondence of millimeter for each degree of visual angle) with a logarithmic mapping function.The pixel-wise cartesian visual space is transformed to polar coordinates in terms of eccentricity and azimuth for a specific foveation instance, then transformed to coordinates in mm of cortical space.Acknowledging that the visual space for digital images is represented with either a squared or rectangular shape, we computed the continuation of cortical coordinates by symmetrically mirroring existing coordinates of the image with their correspondence of visual space outside boundaries in the cortical space.In that manner, we exclude possible effects of zero-padding over recurrent processing while preserving 2D shapes for our feature representations.For this case, these effects were minimized by the inverse and repeating the same process at specific interaction cycles.Schwartz's mapping has been applied over the wavelet coefficients represented in Figure 2, as basis functions are convolved in the visual space, later magnified to the cortical space for representing V1 signals.These signals will serve as input to excitatory pyramidal cells, projected to their respective iso-orientation domains at distinct RF sizes.

V1 Neuronal Dynamics
Li's hypotheses suggest that V1 computations are responsible of generating a bottom-up saliency map [27,28].These hypotheses state that intracortical interactions between orientation-selective neurons in V1 are able to explain contextually-dependent perceptual effects present in pre-attentive vision [29,30,[50][51][52][53], relative to contour integration, visual segmentation, visual search asymmetries, figure-ground and border effects, among others.Pop-out effects that form the saliency map are believed to be the result of horizontal connections in V1, that interact with each other locally and reciprocally.These connections are formed by excitatory cells and inhibitory interneurons [54,55], processing information from pyramidal cell signals in layers of V1.Spatial organization of these cells accounts for selectivity in their orientation columns, their RF size and axonal field localization.The aforementioned interactions between orientation-selective cells was defined by Li's model [29] of excitatory-inhibitory firing-rate neural dynamics, later extended by Penacchio et al. [2].Here, contrast enhancement or suppression in neural responses emerge from lateral connections as an induction mechanism.Latest implementation done by Berga & Otazu [31] for saliency prediction used colour images, where chromatic (P-,K-) and luminance (M-) opponent channels were individually processed in order to compute firing-rate dynamics of each pathway separately.With cortical magnification, each gaze can significantly vary contextual information and therefore the output of the model.
Our excitatory-inhibitory model is described in Table 1.Horizontal connections (lateral and reciprocal) are schematized in Figure 4 and Table 1C, where excitatory cells have self-directed (J 0 ) and monosynaptic connections (J) between each other, whereas dysynaptically connected through (W ) inhibitory interneurons.Axonal field projections follow a concentric toroid of radius ∆ s = 15 × 2 s−1 and radial distance ∆ θ (accounting for RF size d s and radial distance β).Membrane potentials of excitatory ( ẋisθ ) and inhibitory ( ẏisθ ) cells are obtained with partial derivative equations defined in Table 1D, composed by a chain of functions that consider firing-rates (obtained by piece-wise linear functions g x and g y ) and membrane potentials from previous membrane cycles (modulated by α x , α y constants), current lateral connection potentials (J and W ) and spread of inhibitory activity within hypercolumns (ψ).Background inputs (I noise and I norm ) correspond to simulating random noise and divisive normalization signals (i.e.accounting for local nonorientation-specific cortical normalization and nonlinearities).Right: Model's intracortical excitatory-inhibitory interactions, membrane potentials (orange " ẋ" for excitatory and yellow " ẏ" for inhibitory) and connectivities ("J" for monosynaptic excitation and "W " for dysynaptic inhibition).

Input signals (I t
i;soθ ) have been defined as the wavelet coefficients (ω t isoθ ), splitted between ON and OFF components (representing ON and OFF-center cell signals from RGC and LGN) depending of the value polarity (+ for positive and − for negative coefficient values) from the RF.These signals are processed separately during 10τ (τ = 1 membrane time = 10ms), including a rest interval (using an empty input) of 3τ to simulate intervals between each saccade shift.The model output has been computed as the firing-rate average g x of the ON and OFF components (M (ω t+ isoθ ) and M (ω t− isoθ )) during the whole viewing time, corresponding to a total of 10 membrane time (being the mean of g x for a specific range of t).

Combining the output of all components by
we can describe the changes of the model (resulting from the simulated lateral interactions of V1) with respect the original wavelet coefficients ω t isoθ .Our result (S t i;o ) will define the saliency map as an average conspicuity map or feature-wise distinctiveness (RF firing rates across scales and orientations for each pathway).These changes in firing-rate alternatively define the contrast enhancement seen on the brightness and chromatic induction cases [1][2][3], where the model output is combined with the wavelet coefficients {M (ω t iso )ω t iso } instead.The network is in total, composed of 1.18 × 10 6 neurons (accounting for 3 opponent channels, both ON/OFF polarities and RF sizes of 128 × 64 × 3×8).Top-down inhibitory control mechanisms (I c ) are further explained in Table 1E and in section "Attention as top-down inhibition".

Projections to the SC
Latest hypotheses about neural correlates of saliency [56,57] state that the superior colliculus is responsible for encoding visual saliency and to guide eye movements [23,58].Acknowledging that the superficial layers of the SC (sSC) receive inputs from the early stages of visual processing (V1, retina), the SC selects these as the root of bottom-up activity to be selected in the intermediate and deep layers (iSC, dSC).In accordance to the previous stated hypotheses [27], saccadic eye movements modulated by saliency therefore are computed by V1 activity, whereas recurrent and top-down attention is suggested to be processed by neural correlates in the parieto-frontal cortex and basal ganglia.All these projections are selected as a winner-take-all mechanism in SC [27,28,30] to a unique map, where retinotopic positions with the highest activity will be considered as candidates to the corresponding saccade locations.These activations in the SC are transmitted to guide vertical and horizontal saccade visuomotor nerves [59].We have defined the higher active neurons (Equation 8) as the locations for saccades in the visual space (i,j) by decoding the inverse of the cortical magnification (Equation 6) of their respective retinotopic position ("i" neuron at X,Yi).
The behavioral quantity of the unique 2D saliency map has been defined by computing the inverse of the previous processes using the model output for each pathway separately.Retinotopic positions have been transformed to coordinates in the visual space using the inverse of the cortical magnification function (Equation 6).Output signals (V1 sensitivities to orientation and spatial frequencies) are integrated by computing the inverse discrete wavelet transform to obtain unique maps for each channel opponency (Equation 4).A unique representation (Equation 9) of final neuronal responses for each pathway (P-, K-and M-as a * , b * and L * ) is generated with the euclidean norm (adding responses of all channels as in Murray et al. [60] model).The resulting map is later normalized by the variance (Equation 10) of the firing rate [28,Chapter 5].This map represents the final saliency map, that describes the probability distribution of fixation points in certain areas of the image.In addition to this estimation, the saliency map has been convolved with a gaussian filter simulating a smoothing caused by the deviations of σ = 1 deg given from eye tracking experimentation, recommended by LeMeur & Baccino [61].(11) W [isθ,js θ ] = λ(∆s)0.14(1− e −0.4(β/ds) 1.5  )e −(∆ θ /(π/4)) 1.5 (12) Membrane potential E Input Type Description Sensory Attention as top-down inhibition An additional purpose of our work is the modeling of attentional mechanisms beyond pre-attentive visual selection.Instead of analyzing the scene serially, the visual brain uses a set of attentional biases to recognize objects, their relationships and their importance with respect to the task, all given in a set of visual representations.
Similarly to the saliency map, the priority map can be interpreted as a unique 2D representation for eye movement guidance formed in the SC, here including top-down (not guided by the stimulus itself) and recurrent information as visual relevance.This phenomena suggests that executive, long-term and short-term/working memory correlates also direct eye movement control [23,63].Previous hypotheses model these properties by forming the priority map through selective tuning [25,64].Selective tuning explains attention mechanisms as a hierarchy of winner-take-all processes.This hypothesis suggests that top-down attention can be simulated by spatially inhibiting specific layers of processing.Latest hypotheses [65] confirm that striate cortical activity gain can be modulated by SC responses, with additional modulations arising from pulvinar to extrastriate visual areas.In addition, it has also been stated [66]  smooth pursuit tasks [63,67,68] (also suggested to be crucial for planning intentional or endogenously-guided saccades), where its signals are sent to the SC.By feeding our model with inhibitory signals (I c shown in Figure 4 and Table 1E) we can simulate top-down feedback control mechanisms in V1 (initially proposed by Li [29,Sec. 3.7]).In this case, a new term I {vs} is added to the top-down inhibition of our V1 cortical signals that will be projected to the SC during each gaze.
In this implementation, we can perform distinct search tasks such as feature search (by manually selecting the features, or selecting features with maximal responses, similarly to a boolean selection [26]), exemplar and categorical object search (by processing the mean of responses ω from wavelet coefficients of a single or several image samples "N").These low-level computations would serve as cortical activations to be stored as weights in our low-level memory representations, that will be used as inhibitory modulation for the task execution.
Inhibition of Return During scene viewing, saccadic eye movements show distinct patterns of fixations [69], directed by exploratory purposes or either towards putting the attentional focus on specific objects in the scene.For the former case, the HVS needs to ignore already visited regions (triggering anti-saccades away from these memorized regions, as a consequence of inhibition) during a period of time before gazing again towards them.This phenomena is named inhibition of return [70], and similarly involves extracting sensory information and short-term memory during scene perception.As mentioned before, DLPFC is responsible of memory-guided saccades, and this function might be done in conjunction with the parietal cortex and the FEF.The parietal areas (LIP and PEF) [63,67,71] are known to be responsible of visuospatial integration and preparation of saccade sequences.These areas conjunctively interact with the FEF and DLPFC for planning these reflexive visually-guided saccades.Acknowledging that LIP receives inputs from FEF and DLPFC, the role of each cannot be disentangled as a unique functional correlate for the IoR.Following the above, we have modeled return mechanisms as top-down cortical inhibition feedback control accounting for previously-viewed saccade locations.Thus, we added an inhibition input I {IoR} at the start of each saccade, which will determine our IoR mechanism: This term is modulated with a constant power factor α {IoR} and a decay factor β {IoR} , which in every cycle will progressively reduce inhibition.The spatial region of the IoR has been defined as a gaussian function centered to the previous gaze (g), with a spatial standard deviation σ {IoR} dependent on a specific spatial scale and a peak with an amplitude of the maximal RF firing rate of our model's output ( Ŝ). Inhibitory activity is accumulated to the same map and can be shown how is progressively reduced during viewing time (Fig. 14).Alternatively illustrated in Itti et al.'s work [10], the IoR can be applied to static saliency models by substracting the accumulated inhibitory map to the saliency map during each gaze ( Ŝ − I g {IoR} ).

Procedure
Experimental data has been extracted from eye tracking experimentation.Four datasets were analyzed, corresponding to 120 real indoor and outdoor images (Toronto [13]), 40 nature scene images (KTH [72]), 100 synthetic image patterns (CAT 2000 P [73]) and 230 psychophysical images (SID4VAM [20]).Generically, experimentation for these type of datasets [74] capture fixations from about 5 to 55 subjects, looking at a monitor inside a luminance controlled room while being restrained with a chin rest, located at a relative distance of 30-40 pixels per degree of visual angle (pxva).The tasks performed mostly consist of freely looking at each image during 5000 ms, looking at the "most salient objects" or searching for specific objects of interest.We have selected these datasets to evaluate prediction performance at distinct scene contexts.Indicators of psychophysical consistency of the models has been presented, evaluating prediction performance upon fixation number and feature contrast.Visual search performance has been evaluated by computing predictions of locating specific objects of interest.For the case of stimuli from real image contexts (Fig. 17) we have used salient object segmented regions from Toronto's dataset [13], extracted from Li et al. [75].Finally, for the case of evaluating fixations performed with synthetic image patterns, we used fixations from SID4VAM's psychophysical stimuli.

Model evaluation
Current eye tracking experimentation represent indicators of saliency as the probability of fixations on certain regions of an image.Metrics used in saliency benchmarks [39] consider all fixations during viewing time with same importance, making saliency hypotheses unclear of which computational procedures perform best using real image datasets.Previous psychophysical studies [19,20] revealed that fixations guided by bottom-up attention are influenced by the type of features that appear in the scene and their relative feature contrast.From these properties, the order of fixations and the type of task can drive specific eye movement patterns and center biases, relevant in this case.
The AUC metric (Area Under ROC/Receiver Operant Characteristic) represents a score of a curve comprised of true positive values (TP) against false positive (FP) values.The TP are set as human fixations inside a region of the saliency map, whereas FP are those predicted saliency regions that did not fall on human fixation instances.For our prediction evaluation we computed the sAUC (shuffled AUC), where FP are expressed as TP from fixations of other image instances.This metric prioritizes model consistency and penalizes for prediction biases that appear over eye movement datasets, such as oculomotor and center biases (not driven by pre-attentional factors).We also calculated the Information Gain (InfoGain) metric for model evaluation, which compares FP in the probability density distribution of human fixations with the model prediction, while substracting a baseline distribution of the center bias (all fixations grouped together in a single map).Saliency metrics, largely explained by Bylinskii et al. [76], usually compare model predictions with human fixations during the whole viewing time, regardless of fixation order.In our study is also represented the evolution of prediction scores for each gaze.For the case of scanpaths, we evaluated saccade sequences by analyzing saccade amplitude (SA) and saccade landing (SL) statistics.These are calculated using euclidean distance between fixation coordinates (distance between saccade length for SA and distance between locations of saccades for SL).
Initial investigations on visual attention [7,8] during visual search tasks formulated that reaction times of finding a target (defined in a region of interest/ROI) among a set of distractors are dependent on set size as well as target-distractor feature contrast.In April 8, 2019 13/32 order to evaluate performance on visual search, we utilised two metrics that account for the ground truth mask of specific regions for search and the saliency map (in this context, it could be considered as a "relevance" map) or predicted saccade coordinates (from locations with highest neuronal activity).The Saliency Index (SI) [20,77] calculates the amount of energy of a saliency map inside a ROI (S t ) with respect to the one outside (S b ), calculated as: SI = (S t − S b )/S b .For the case of saccades in visual search, we considered to calculate the probability of fixations inside the ROI (PFI).

Results on predicting Saliency
In this section, probabilities of fixations have been generated using fixation maps of all participants from Toronto, KTH, CAT2000 and    In SID4VAM, stimuli are categorized with specific difficulty (according to the relative target-distractor feature contrast).With these, we computed the score for each relative contrast instance (Ψ) in Fig. 11.After computing each low-level stimulus instance with the presented models and evaluating results with the same metrics, our saliency model (NSWAM and NSWAM-CM) presents better performance than AIM and IKN while increasing score at higher feature contrasts.

Discussion
Quantitatively, systematic tendencies in free-viewing (center biases, inter-participant differences, etc. [79]) should not be likely to be considerable as indicators of saliency.April 8, 2019 16/32 Even if shuffled metrics try to penalize for these effects, benchmarks do not discard these tendencies from model evaluations (these are particular for each dataset task and stimulus properties).Acknowledging that first saccades determine bottom-up eye movement guidance [80,81], it is a phenomenon also observable in our experimental data (in terms of the decrease of performance with respect fixation region probability compared to fixation locations).In that aspect, evaluating first fixations with more importance could define new benchmarks for saliency modeling, similarly with stimuli where feature contrast in salient objects is quantified.Ideal conditions (following the Weber law) determine that if there is less difficulty for finding the salient region (higher target-distractor contrast), saliency will be focused on that region.Conversely, fixations would be distributed on the whole scene if otherwise.Our model presents better performance than other biologically-inspired ones accounting for these basis.

Results on predicting scanpaths
Illustration of scanpaths from datasets presented in previous section were computed with scanpath models in Fig. 13.Scanpaths are predicted by NSWAM-CM, by selecting maximum activity of our model during the first 10 saccades.We have plotted our model's performance in addition to Boccignone&Ferraro's and LeMeur&Liu's predictions (Fig. 12).Saccade statistics show an initial increment of saccade amplitude, decreasing as a function of fixation number.Errors of SA and SL (∆SA and ∆SL) are calculated as absolute differences between model predictions and human fixations.
Values of ∆SL appear to be lower and similar for all models during initial fixations.
Toronto KTH CAT2000 P SID4VAM Prediction errors are shown to be sustained or increasing for CLE and NSWAM-CM (maybe due to their lack of processing higher level features, experimental center biases, etc.).Errors on ∆SA predictions are lower for LeMeur&Liu's model, retaining similar saccades (except for synthetic images of SID4VAM).Although these errors are representative in terms of saccade sequence, we also computed correlations of models' SA with GT (ρSA).In this last case, NSWAM-CM presents most higher correlation values for Toronto, Kootstra and CAT2000 P (ρSA T oronto =-.38, p=.09; ρSA KT H =.012, p=.96; ρSA CAT 2000 P =.28, p=.16) than other models.Most of them seem to accurately predict SA for SID4VAM (which contains mostly visual search psychophysical image patterns), with ρSA between .7 and .8. LeMeur N atural , LeMeur F aces , LeMeur Landscapes [17] and NSWAM-CM (ours).
We simulated the inhibition factor for all datasets by substracting the inhibition factor I {IoR} to our models' saliency maps (NSWAM+IoR).After computing prediction errors in SA and SL for a single sample (Fig. 15-Top), best predictions seem to appear at decay values of β {IoR} between .93 and .98,which corresponds to 1 to 5 saccades (similarly explained by Samuel & Kat [82] and Berga et al. [20], where takes from 300-1600 ms for the duration of the IoR, corresponding to 1 to 5 times the fixation duration).For the case of the σ {IoR} , lowest prediction error (again, both in SA and SL) is found from 1 to 3 deg (in comparison, LeMeur & Liu [17] parametrized it by default as 2 deg).Results on ∆SA statistics have similar / slightly increasing performance until (β {IoR} <1) a single fixation time, decreasing at highest decay β {IoR} ≥5th saccade.For ∆SL values, errors in datasets such as KTH and SID4VAM are decreased at higher decay.For the latter, ∆SA errors are shown to decrease progressively at highest decay values (β {IoR} ≥.93).Lastly, when parametrizing the spatial properties of the IoR, saccade prediction performance is highest at lower size (with a near-constant error in SA and SL increasing about 1 deg for σ {IoR} =1 to 8 deg on all datasets).13) and saliency datasets (bottom row).

Discussion
Our model predictions on SA correlate better than other scanpath models (in terms of how SA evolves over fixations), however, prediction errors are higher in both SL and SA.This might be caused either by the predicted target locations of fixations (saliency), by shifts in saccade sequence (due to large saccade predictions) or by ignoring systematic tendencies in free-viewing (derived by center biases and/or focal fixations in a particular region of the image).We have to hesitate that first fixations are long known for being determinants of bottom-up attention [20,80].Instead, higher inter-participant differences [79] and center biases [83] increase as functions of fixation number, suggested as worse candidates for predicting attention.These parameters appear to specifically affect each stimuli differently (and accounting that each stimulus may convey specific semantic importance between each contextual element), which may relate to top-down attention but not to the image characteristics per se.We also want to stress the April 8, 2019 19/32 importance of foveation in our model.This is a major procedure for determining saccade characteristics (including oculomotor tendencies) and saliency computations, as it determines current human actions during scene visualization.The fact of posessing lower resolution as a function of eccentricity provides the aforementioned properties, innate in human vision and invariant to scene semantics.Adding an IoR mechanism has been seen to affect model activity and therefore scanpath predictions.In Fig. 14-Left we show how our inhibition factor (I {Ior} ) decreases over simulation time in relation to the parametrized decay β {IoR} , as well as the projected RF size with respect the gaussian parameter σ {IoR} .These variables (decay and size) affect either location of saccades and its sequence, modulating firing rate activity to already visited locations.It is shown for the example in Fig. 14-Right that the initial saccade is focused on the salient region and then it spreads to a specific location in the scene, not repeating with higher value of inhibition decay or field size.In the next section we show how our model can preproduce eye movements beyond free-viewing tasks by modulating of inhibitory top-down signals.

Results on feature and exemplar search
Saliency maps have also been computed with (NSWAM+VS) and without (NSWAM) top-down inhibitory modulation for singleton search stimuli [20].Top-down selection is applied to our low-level feature dimensions (scale, orientation, channel opponency and its polarity).In NSWAM+VS M , inhibition is parametrized considering the feature that activated highest inside the stimulus ROI (Equation 15-Top).Besides, inhibitory control in NSWAM+VS C has been set as the mean wavelet coefficients instead (Equation 15-Bottom).

Synthetic Pattern Search
Object Search Comparison of results from NSWAM with bottom-up only and with top-down inhibition present higher scores for both SI and PFI (Fig. 16) using top-down inhibition (NSWAM+VS M and NSWAM+VS C ).Here, there is an increase of fixations inside the ROI: ∆(P F I) V S M 7% and ∆(P F I) V S C 5% for real object search and ∆(P F I) V S M 6% and ∆(P F I) V S C 1% for synthetic image patterns.The SI is also presented to increase for both cases, with differences of ∆(SI) V S M =.6 × 10 Free-viewing fixations are seemingly predicted with similar performance in comparison with NSWAM predictions (Fig. 7).Saliency metrics are similar or increasing with respect NSWAM for feature singleton search fixations . This phenomena might be caused from influences of the center bias, presenting more fixations near the center in free-viewing tasks [20].
We illustrated results of PFI and SI (Fig. 18) in relation to relative target-distractor feature contrast for cases of Brigthness, Color and Size differences, as well as the Set Size for searching a certain target patterns (i.e. a circle superposed by an oriented bar).After computing SI for each distinct psychophysical stimuli, we can see in Figs.18-19 that our model performs best for searching differences with stimuli where there are differences in brightness, color, size and/or superimposed singletons, rather than for the case of different combination of orientations, specially with heterogeneous, nonlinear or categorical angle configurations.

Discussion
Overall results show that features computed by top-down maximal cortical activity (NSWAM+V S M ) seemingly perform better than other search prediction alternatives, such as considering average statistics of certain features (NSWAM+V S C ).When searching real objects, NSWAM+VS M and NSWAM+VS C present overall similar results, slightly higher for NSWAM+VS M (as dataset ROIs are selected from objects that are already salient).However, we suggest that considering scene statistics could perform better when searching contextually complex exemplars.Here the combination of features could be implicit when processing image ROI average characteristics but not when using maximal activations, qualitatively shown in Fig. 17.Search in psychophysical image patterns is significatively more efficient when selecting maximal feature activations (NSWAM+V S M ).Regarding that aspect, exemplar and categorical search for objects in real image scenes would require, in terms of search efficiency, computations with a higher number of features [84] (which would represent in more detail each cortical cell sensitivity).

Fig 19.
Performance on visual search evaluated on each distinct low-level feature, stimulus instances are from SID4VAM's dataset [20].

General Discussion
Current implementation of our V1 model is based on Li's excitatory-inhibitory firing rate network [29], following previous hypotheses of pyramidal and interneuron connectivity for orientation selectivity in V1 [54,55].To support and extend this hypothesis, distinct connectivity schemas (following up V1 cell subtype characterization) [85,86] could be tested (e.g.adding dysynaptic connections between inhibitory interneurons) to better understand V1 intra-cortical computations.Furthermore, modeling intra-layer interactions of V1 cells [43] could explain how visual information is parallely processed and integrated by simple and complex cells [87], how distinct chromatic opponencies (P-,K-and M-) are computed at each layer [88], and how V1 responses affect SC activity (i.e. from layer 5) [89].Testing contributions of each of these chromatic pathways (at distinct single/double opponencies and polarities), as well as distinct fusion mechanisms regarding feature integration, would define a more detailed description of how visual features affect saliency map predictions.
Previous and current scanpath model predictions could be considered to be insufficient due to the scene complexity and numerous factors (such as the task specificity, scene semantics, etc.) simultaneously involved in saccade programming.These factors increase overall errors on scanpath predictions, as systematic tendencies increase over time [20,22,79,83], making late saccades difficult to predict.In that aspect, in free-viewing tasks (when there is no task definition), top-down attention is likely to be dependent on the internal state of the subject.Further understanding of high level attentional processes have only been approximated through statistical and optimization techniques with fixation data.It has also been later observed that fixations during free-viewing and visual search have distinct temporal properties.This could explain that saliency and relevance are elicited differently during viewing time.Latest literature on that aspect, discern two distinct patterns of fixations (either April 8, 2019 23/32 ambient or focal) where subjects first observe the scene (possibly towards salient regions), then focus their attention on regions that are relevant to them [69], and these influences are mainly temporal.Its modelization for eye movements in combination with memory processing is still under discussion.Current return mechanisms have long been computed by inhibiting the regions of previous fixations (spatially-based), nonetheless, IoR could also have feature-selective properties [90] to consider.We suggest that not all fixations should have same importance when evaluating saliency predictions.Nature and synthetic scene images lack of semantic (man-made) information, which might contribute to the aforementioned voluntary (top-down guided) eye movements [91].Acknowledging that objects are usually composed by the combination of several features (either in shape, color, etc.), we should analyze if low-level features are sufficient to perform complex categorical search tasks.Extrastriate computations could allow the usage of object representations at higher-level processing, introducing semantically-relevant information and several image samples per category.Cortical processing of extrastriate areas (from V2 and V3) towards temporal (V4 & IT) and dorsal (V5 & MT) pathways [92, Section II] [43] could represent cortical activity at these distinct levels of processing, modeling in more detail the computations within the two-stream hypothesis (what & where pathways).Color, shape and motion processing in each of these areas could generate more accurate representations of SC activity [23], producing more complex predictions such as microsaccadic and smooth pursuit eye movements.

Future Work
Current and future implementations of the model are able to process dynamic stimuli as to represent attention using videos.By simulating motion energy from V1 cells and MT direction selective cells [28,Section 2.3.5],would allow our model to reproduce object motion and flicker mechanisms found in the HVS.Moreover, foveation through more plausible cortical mapping algorithms [93] could provide better spatial detail of the cortical field organization of foveal and peripheral retinotopic regions and lateralization, currently seen to reproduce V1/V2/V3 physiological responses.Adding to that, hypercolumnar feature computations of geniculocortical pathways could be extended with a higher number of orientation and scale sensitivities with self-invertible 2D Log-Gabor filters [94].In that regard, angle configuration pop-out effects and contour detection computations [95,96] can be done by changing neuron connectivity and orientation tuning modulations.
We aim in future implementations to model the impact of feedback in cortico-cortical interactions with respect striate and extrastriate areas in the HVS.Some of these regions project directly to SC, including the intermediate areas (pulvinar and medial dorsal) and basal ganglia [23,63,67].Our current implementation can be extended with a large scale network of spiking neurons [97], also being able to learn certain image patterns through spike-timing dependent plasticity mechanisms [98].With such a network, the same model would be able to perform both psychophysical and electrophysiological evaluations while providing novel biologically-plausible computations with large scale image datasets.

Conclusion
In this study we have presented a biologically-plausible model of visual attention by mimicking visual mechanisms from retina to V1 using real images.From such, computations at early visual areas of the HVS (i.e.RP, RGC, LGN and V1) are April 8, 2019 24/32 performed by following physiological and psychophysical characteristics.Here we state that lateral interactions of V1 cells are able to obtain real scene saliency maps and to predict locations of visual fixations.We have also proposed novel scanpath computations of scene visualization using a cortical magnification function.Our model outperforms other biologically inspired saliency models in saliency predictions (specifically with nature and synthetic images) and has a trend to acquire similar scanpath prediction performance with respect other artificial models, outperforming them in saccade amplitude correlations.In addition, we formulated projections of recurrent and selective attention using the same model (simulating frontoparietal top-down inhibition mechanisms).Our implementation of these, included top-down projections from DLPFC, FEF and LIP (regarding visual selection and inhibition of return mechanisms).We have shown how scanpath predictions improve by parametrizing the inhibition of return, with highest performance at a size of 2 deg and a decay time between 1 and 5 fixations.By processing low-level feature representations of real images (considering statistics of wavelet coefficients for each object or feature exemplar) and using them as top-down cues, we have been able to perform feature and object search using the same model.Two search procedures are presented to increase the probability to gaze inside a region of interest as well as the amount of fixations inside that region.In previous studies, the same model has been able to reproduce brightness [2] and chromatic [3] induction, as well as explaining V1 cortical hyperexcitability as a indicator of visual discomfort [1].With the same parameters and without any type of training or optimization, NSWAM is able predict bottom-up and top-down attention for free-viewing and visual search tasks.Model characteristics has been constrained (in both architecture and parametrization) with human physiology and visual psychophysics, and can be considered as a simplified and unified simulation of how visual processes occur in the HVS.

Fig 1 .
Fig 1. Example of CIELAB components of color opponencies given a sample image, corresponding to L * (Intensity), a * (Red-Green) and b * (Blue-Yellow).

Fig 3 .
Fig 3. Left: Examples of applying the cortical magnification function (transforming the visual space to the cortical space) at distinct views of the image presented in Figure 1.Right: Illustration of how polar coordinates (Z-plane) of azimuth Φ = (1, 2, 3, 4, 5) in the left visual field at distinct eccentricities r = (d, c, b, a) are transformed to the cortical space (W-plane) in mm (X and Yi axis values).Equations 5 & 6 express the monopole direct and inverse cortical mapping transformations (parameters set as λ = 12mm and e 0 = 1deg [28, Section 2.3.1]).Illustration sketch was adapted from E.L.

Fig 6 .
Fig 6.Diagram illustrating how visual information is processed by NSWAM-CM, including a brain drawing of each bottom-up and top-down attention mechanisms and their localization in the cortex (Bottom-Right).
SID4VAM eye tracking datasets (model scores and examples in Figs 7-10).Several saliency predictions have been computed from different biologically-inspired models.Our model has been computed without (NSWAM) and with foveation (NSWAM-CM), as a mean of cortically-mapped saliency computations through a loop of 1, 2, 5 and 10 saccades.Based on the shuffled metric scores, traditional saliency models such as AIM overall score higher on real scene images (Fig 7), scoring sAU C AIM =.663, and Inf oGain IKN =.024.For the case of nature images (Fig 8), our non-foveated and foveated versions of the model (NSWAM and NSWAM-CM) scored highest on both metrics (Inf oGain N SW AM =.168 and sAU C N SW AM −CM 10 =.567).As mentioned before, fixation center biases are present when the task and/or stimulus do not induce regions that are enough salient to produce bottom-up saccades.In addition, in real image datasets (Toronto and KTH), not all images contain particularly salient regions.This phenomena is seemingly presented in our models' saliency maps from 1st to 10th fixations (Figs.7-8, rows 5-8), where salient regions are presented to be less evident across fixation order.

Fig 11 .
Fig 11. sAUC and InfoGain scores for each relative target-distractor feature contrast

Fig 16 .
Fig 16.Statistics of Saliency Index (top row) and Probability of Fixations Inside the ROI (bottom row) for synthetic image patterns (left) and salient object detection regions from real image scenes (right).

Fig 18 .
Fig 18. Performance on visual search examples with a specific low-level feature contrast (for Brightness, Color or Size) and Set Size.We represented 7 instances ordered by search difficulty of each feature sample.