Recurrent issues with deep neural network models of visual recognition

Object recognition requires flexible and robust information processing, especially in view of the challenges posed by naturalistic visual settings. The ventral stream in visual cortex is provided with this robustness by its recurrent connectivity. Recurrent deep neural networks (DNNs) have recently emerged as promising models of the ventral stream. In this study, we asked whether DNNs could be used to explore the role of different recurrent computations during challenging visual recognition. We assembled a stimulus set that included manipulations that are often associated with recurrent processing in the literature, like occlusion, partial viewing, clutter, and spatial phase scrambling. We obtained a benchmark dataset from human participants performing a categorisation task on this stimulus set. By applying a wide range of model architectures to the same task, we uncovered a nuanced relationship between recurrence, model size, and performance. While recurrent models reach higher performance than their feedforward counterpart, we could not dissociate this improvement from that obtained by increasing model size. We found consistency between humans and models patterns of difficulty across the visual manipulations, but this was not modulated in an obvious way by the specific type of recurrence or size added to the model. Finally, depth/size rather than recurrence makes model confusion patterns more human-like. Contrary to previous assumptions, our findings challenge the notion that recurrent models are better models of human recognition behaviour than feedforward models, and emphasise the complexity of incorporating recurrence into computational models. Author summary Deep neural networks (DNNs) are considered the best current models of visual recognition. This is mainly due to the correspondence between their structure and that of the ventral stream in the primate visual system, as well as a double match between their representations and behaviour with human neural representations and error patterns. Recently, it has been suggested that adding recurrence to usually feedforward-only DNNs improved this match, while simultaneously making their architecture more brain-like. But how much of human behaviour do these models actually replicate, and does recurrence really make things better? We conducted an in-depth investigation of this question by putting DNNs to the test. In our work, we ask: do models still resemble humans when the task becomes complicated, and: are they making use of similar strategies to operate object recognition? Bringing different architectures together, we show that recurrence tends to increase model performance and consistency with humans. However, we cannot dissociate this improvement from that brought by parameter size alone. Additionally, we find a striking worsened match with human patterns of errors in models with recurrence, as compared to purely feedforward models. Contrary to previous assumptions, our findings challenge the notion that recurrent models are better models of human recognition behaviour than feedforward models, and emphasise the complexity of incorporating recurrence into computational models.


Introduction
The seemingly effortless nature of human object recognition hides the computational difficulty of real-world object recognition.The visual complexity of scenes from everyday life calls for flexible and robust processing in order to decode the identity of objects in the environment [Kravitz et al., 2013, Grill-Spector and Weiner, 2014, Bracci and Op de Beeck, 2023].This feat is accomplished seamlessly by the ventral stream in visual cortex, whose connectivity allows for such extraordinary abilities.While it is often modelled as a bottom-up network, every feedforward connection it contains is paralleled by one or several non-feedforward connections ( [Van Essen et al., 1986, Felleman and Van Essen, 1991, Lamme et al., 1998, Ungerleider et al., 2008, Baizer et al., 1991]).It is hence better understood as a highly recurrent network.
Recurrence in a system allows for the dynamic processing of information.Visual recognition, consequently, is a dynamic process.Although most objects are instantaneously recognised through a fast, feedforward sweep of processing ( [DiCarlo et al., 2012, Riesenhuber and Poggio, 2000, Riesenhuber, 2005]), visual information undergoes progressive transformations across time.During these, sensory inputs are modulated and integrated.This extra-feedforward processing occurs through cycles of recurrent activity considered essential for many perceptual phenomena, e.g.stimuluscontext modulation, predictive processing and figure-ground segmentation [Pennartz et al., 2019, Wyatte et al., 2014, Lamme and Roelfsema, 2000, Kreiman and Serre, 2020].This general implication of recurrence in non-canonical, non-trivial contexts has experimental evidence.Given the temporal unfolding that is inherently linked with recurrent processing, it is possible to selectively impact it while leaving initial feedforward processing intact, for example through the procedure of backward masking.Several studies made use of backward masking to manipulate recurrent processing, and found that it selectively impaired visual recognition under challenging conditions, but not under non-challenging ones ( [Seijdel et al., 2021, Rajaei et al., 2019, Tang et al., 2018]).
Strong evidence in favour of the recurrent nature of the visual cortex comes from studies using deep neural networks (DNNs).Optimised on recognition tasks, DNNs are considered good models of the ventral stream, mimicking its hierarchical structure and operating classification with human-like accuracy ( [Cadieu et al., 2014]).DNNs have been found to display representational similarity with the ventral stream ( [Yamins et al., 2014, Khaligh-Razavi andKriegeskorte, 2014]).Strikingly, this similarity increases with recurrent DNNs ( [Kar et al., 2019, Kietzmann et al., 2019]), indicating that the representational dynamics of the visual cortex fit better with these of a recurrent network, compared to feedforward-only.The aforementioned studies of backward masking also included DNNs to show that models equipped with recurrent-like features show more resilience to challenging conditions than feedforward-only models.
While feedforward processing is well understood, there is no universal account of the exact mechanisms served by recurrence in visual recognition.There are two factors that complicate this matter.First, there are virtually unlimited ways an image might be challenging to process.Natural environments have infinite potential variations.To reproduce those variations, objects could be rendered cluttered, blurred, noisy, and so on.Second, recurrence is an umbrella term for a variety of anatomical connections.There are lateral and feedback connections within and between brain regions, respectively, and there is no guarantee that one type of connection in, for example, primary visual cortex, serves the same function as a lateral connection in prefrontal cortex.Similarly, a top-down connection between two nearby regions in the visual system might serve a different purpose compared to a top-down connection all the way from prefrontal cortex to V4.
Taken together, these two factors provide a high number of possible combinations of how specific recurrent connections might underlie the resolution of well-defined image challenges.Across the visual cortex, various specific forms of recurrence have been studied and linked to perceptual phenomena.For instance: uncertainty computation ( [Ladret et al., 2023]) and pattern completion ( [Shin et al., 2023]) by lateral connections within V1, edge detection in cluttered images through lateral processing in V1 ( [Self et al., 2013]), prediction about occluded shapes by feedback from the prefrontal cortex (PFC) to V4 ( [Choi et al., 2018]), figure-ground modulation by feedback from V4 to V1 ( [Klink et al., 2017, Angelucci et al., 2017]), and interactions with spatial frequency transformations by feedback from frontal areas ( [Bar et al., 2006, Goddard et al., 2016]).Overall, the literature points to a wide range of roles played by recurrent processing to resolve a variety of visual challenges, with each recurrent cortical connection potentially operating a different computation on incoming sensory information.
The wealth of perceptual phenomena explained by discrete recurrent connectivity patterns, and the diversity of visual challenges found to be solved by recurrence through backward masking, suggests that visual recognition is served by a wide range of recurrent computations in charge of solving visual complexity.Identifying the various roles of recurrence and merging them in a larger picture of ventral stream function would prove significant in the understanding of human visual cognition.The present study aimed at taking a step in this direction with an experiment made to dissociate between different types of recurrence.We combined a variety of image level manipulations known to involve recurrence in humans in one large experiment ( [Seijdel et al., 2021, O'Reilly et al., 2013, Tang et al., 2018, Rajaei et al., 2019, Goddard et al., 2016]).We implemented different types of recurrence in a number of DNN models, and compared their behaviour on a classification task with that of humans.Taking advantage of the architectural diversity of our models, we asked whether we could use DNNs to dissociate between different recurrent processes responsible for the resolution of different challenges during visual recognition.
We found that recurrence in DNNs helped them match with human performance, but not more than added depth.This feeds into the observation that recurrent neural networks can be considered time-wrapped equivalents of size-matched feedforward neural networks ( [van Bergen and Kriegeskorte, 2020]).However, our results go a step further, as we also found a decrease in models match with human confusion matrices when equipped with recurrence.Additionally, we were not able to find dissociable patterns of performance across our recurrent models, which indicates that DNNs, as tools to model the visual cortex, are unable to reproduce the variability of phenomena that occur in brain recurrence.Overall, we interpret our results, in light of current developments in the field, as an argument for the development of more elaborate, biologically plausible ways of implementing recurrence in models of visual recognition.

Participants
A total of 231 subjects participated in the study (195 females, age (mean±SEM): 18.7 ± 1.7), of which 13 were excluded due to low performance (average accuracy below 0.7).All participants gave their informed consent before taking part.The protocol was approved by the ethics committee at KU Leuven.The task took place online and lasted approximately 30 minutes.

Stimuli
Images of objects were selected from real-world scenes of the publicly available online databases MS COCO ([Lin et al., 2014]) and ADE20K ([Zhou et al., 2018]).Eight object categories were included: person, cat, bird, tree, fire hydrant, building, bus, and banana (see Fig 2).The categories were chosen to cover a diverse range of levels of animacy, real world size and aspect ratio.Stimuli were segmented out from their backgrounds, transformed to a 700×700 pixel size, and equalised on their contrast and average luminance levels.10 exemplar objects were picked for each category, resulting in a stimulus set of 80 images.

Challenging manipulations
Object recognition was rendered challenging by applying several levels of visual manipulation on the stimulus set (see Fig 1).Each one of the manipulation was selected based on literature demonstrating that it is linked to recurrent processing, as evidenced by psychophysical, neuroimaging, or DNN data.We focused on three of them: occlusion, clutter, and phase scrambling.In total, the full stimulus set was rendered in one control condition (segmented object on a gray background) and 16 challenging conditions.
• Phase scrambling: phase scrambled images were generated by randomising the phase spectrum of objects.
Images were Fourier transformed, and their phase spectrum was replaced by random noise, on either side of a spatial frequency threshold of 1.5 cycles per degree (with images assumed to subtend 10 degrees of visual angle on screen).The origin of the phase spectrum was left intact.Images were then transformed back into image space.This resulted in two phase scrambled versions of each image: a low-pass phase scramble, and a high-pass phase scramble.
• Clutter: cluttered images were obtained by replacing plain backgrounds with cluttered backgrounds.These were obtained by phase scrambling randomly selected natural scenes from the MS COCO dataset, and measuring an index of clutter on the outcome.The clutter index was calculated following the method of [Groen et al., 2013], whereby measures of spatial coherence (SC) and contrast energy (CE) are computed for each image, the combination of which gives a proxy of how cluttered an image is.More precisely, Figure 1: Stimulus manipulations.The stimulus set included eight object categories of 10 objects each, passed through a total of 16 visual manipulations.Objects in the images were rendered challenging by applying either phase scrambling, clutter, or occlusion.(A) Phase scrambling was applied by replacing the phase spectrum of images with random noise, on either sides of a 1.5 cpd threshold (red square not actual size).(B) Clutter was created following the method described in [Groen et al., 2013], whereby phase-scrambled versions of natural scenes taken from the MS COCO dataset were ranked on their spatial coherence and contrast energy values.Subsets of the most and less cluttered of a large number of such images were taken as light-and heavy-cluttered backgrounds for the stimuli.(C) Images were lightly or heavily occluded (percentage of object left visible: 80% and 40%, respectively), using many small or a few large disks.This was done in one of three possible fashions: adding black blobs, adding blobs the colour of the background (deletion) adding a full black occluder with disk-like apertures.
CE and SC values are proxys of LGN population response derived from the activation of contrast filters ( [Ghebreab et al., 2009], [Scholte et al., 2009]).
Taken together, CE and SC distribute images on a two-dimensional space, with high object segmentability at the origin (high SC, low CE), and where objects become complex to segment from their background at the other end (low SC, high CE).We defined two levels of clutter by selecting 80 backgrounds at each end of the spectrum.Objects were hence rendered once on an lightly cluttered background (light clutter), and once on a heavily cluttered background (heavy clutter).• Occlusion: images were occluded by concealing them so that so that only a given proportion of the initial object would remain visible, either 80% (easy condition) or 40% (hard condition).Objects were concealed in three possible ways: by adding a full occluder with disk-like apertures, by adding occluding, black blobs, or by deletion of parts of the objects.Within each occluding technique, two levels of size were implemented: either many small disks or a few large disks were used.
We combined several types of occlusion based on the observation that they are not equally challenging ([Tang et al., 2018, Johnson andOlshausen, 2005]), and therefore increase the richness of our stimulus set.This combination of difficulty levels (2), occluding techniques (3) and sizes (2) resulted in 12 different occlusion manipulations.

Experimental task
Participants were given a categorisation task, where each image was to be classified in one of the eight possible categories of the stimulus set.All trials had the same timeline (see Fig 2): after a fixation cross, a target image appeared for 50ms.Targets were either followed by an empty screen or by a mask.The instructions asked for an accurate answer that could be registered as soon as the target was presented.All trials ended with the presentation of a reminder of the eight categories and their associated response keys.The latter appeared at the bottom of the screen 200ms after the mask (or empty screen) disappeared.
Masks were phase-scrambled versions of natural scenes taken from MS COCO.Images were randomly selected from the dataset, resized to match target images (700×700), and taken to the Fourier space where their phase spectrum was fully replaced by random noise (except for the origin, left intact) before reconstruction into the image space.
In total, 16 experimental conditions and one control made 17 conditions, all of which to be presented with and without masking.The resulting 34 conditions were divided into sub-experiments: occlusion made up 2 sub-experiments (few large occluder disks and many small occluder disks, 6 conditions each).Phase scrambling and clutter were joined together into one other sub-experiment (4 conditions).These three sub-experiments were ran with and without masking, resulting in 6 experiments in total.An extra unmasked control condition was added to all three of the masked experiments to serve as a comparable baseline.All of them were conducted online over the same period, and no participant took part more than once.
Within tasks, each condition was presented once per image, i.e. 80 trials per condition.As a result, tasks consisted of 400 (without masking, phase scrambling and clutter conditions) to 640 trials (with masking, occlusion conditions).

Deep neural network architectures
We investigated thirteen distinct DNN models, each varying in architecture and recurrent connections (see Fig 3).These models were specifically chosen to represent a range of complexities, from basic feedforward structures to more sophisticated recurrent networks.We categorised these models into three groups: CORnet models, B models, and VGG models (see Table 1 for more details).

CORnet Models.
From the CORnet family, we included the foundational models CORnet Z, CORnet RT, and CORnet S ( [Kubilius et al., 2018, Kubilius et al., 2019]).While CORnet Z uses a simple feed-forward architecture, CORnet RT enhances this design by integrating biologically-inspired additive recurrent mechanisms within each block.CORnet S further advances this architecture by adding additional convolutional layers in a bottleneck ResNet-like block structure, thereby enriching its internal dynamics and significantly increasing its parameter count relative to its simpler counterparts.For better readability, CORnet is hereafter abbreviated C, and models are branded after the recurrent element of connectivity they contain (L for lateral connections, T for top-down connections).We therefore selected C (CORnet Z), CL (CORnet RT) and CS (CORnet S).We custom-extended the C series for this research, by adding the following: C V1-V1, featuring within-layer connections in its initial convolutional layer; C IT-IT, implementing similar within-layer connections but in its terminal convolutional layer; CT, designed with top-down connections extending from higher to lower layers; and CLT, a specialized version of CL augmented with top-down connections analogous to those in CT.

B Models.
B models ([Spoerer et al., 2017]) include B, BL, BT , and BLT models.BL is characterized by lateral connections, BT by top-down connections, while the BLT model combines both lateral and top-down mechanisms.These models were used in their original implementations, see ([Spoerer et al., 2017]) for a more detailed description.

VGG Models.
The VGG models, specifically VGG11 and VGG16 [Simonyan and Zisserman, 2015], served as the feedforward control group.Their architecture, lacking feedback mechanisms, is deeper, providing a baseline comparison against recurrent models.VGG11 is a custom implementation of VGG16 featuring only 11 trainable layers and a smaller decoding head with a single linear layer.This modification was made to create a VGG16-like very deep feedforward network, but with a number of learnable parameters more comparable to the recurrent models.
Our model training process was divided into two distinct phases: a training phase on the ImageNet dataset, followed by a phase of fine-tuning on the specific 8 categories contained in our dataset.

ImageNet Training.
Models lacking pre-existing ImageNet-trained weights or featuring custom recurrent mechanisms (i.e.C V1-V1, C IT-IT, CT, CLT, VGG11) were trained in-house.This phase involved a single cycle of 25 epochs, using a fixed seed of 42 and a batch size of 256.A one-cycle learning rate policy [Smith and Topin, 2017], as implemented in the fastai [Howard et al., 2018] library, was employed.The maximum learning rate for the fit_one_cycle algorithm was set by multiplying the optimal rate, determined using fastai's implementation of the learning rate finder  C) VGG models were used as feedforward controls to the recurrent models.We used the state-of-the-art VGG16 model, as well as a custom-made, smaller VGG11 to better match the size of our recurrent models.
[ Smith, 2015], by a factor of 10.The training was performed on the ILSVRC2012 challenge subset of the ImageNet dataset [Russakovsky et al., 2015], which includes 1,000 classes and approximately 1.2 million images for training and an additional 50,000 for validation purposes.During the training phase, each image was normalized using the mean and standard deviation statistics from ImageNet, followed by a random crop with a minimum scale of 0.35 which was then resized to 224 x 224 pixels.In the validation phase, images were directly normalized and resized to 224 x 224 pixels, omitting the cropping process.
Fine-Tuning.For fine-tuning, we conducted 20 independent training runs on the same ImageNet-trained model, each with a distinct fixed seed.This approach ensured varied initializations for the new classifier head (adapted from 1000 output units to the 8 units needed for our classification task), for the dataloaders and different seed settings for essential libraries (e.g., NumPy, PyTorch, fastai, and CUDA).Each run comprised 3 cycles, each with 9 epochs, amounting to 27 epochs in total.The initial 6 epochs of each cycle focused on training the model's classifier and the last convolutional layer before the classifier.This strategy was adopted to preserve low-level features learned from ImageNet, aligning with our goal to model human-like visual processing by retaining features from a dataset rich in natural visual variance.
In the last 3 epochs of each cycle, the full model was trained, allowing for a comprehensive adaptation to our 8-way classification task.The learning rate for each cycle was dynamically determined using the learning rate finder algorithm from fastai at the start of each cycle.The fit_one_cycle method parameters were set with lr_max as the optimal learning rate multiplied by 10.
The fine-tuning dataset included a total of 32,000 images taken from the same datasets as these used to build the human dataset (see Stimuli), with an equal distribution of 4,000 images per category.Each category was divided to include half the images with their original background (split equally between colored and black and white) and half against a neutral gray background (also divided equally between colored and black and white).This approach allowed us to train the network to accurately categorize images irrespective of background variations and color cues.[Howard et al., 2018].

Model Sizes and Connections
A comprehensive overview of the model architectures is provided in Fig. 3. Models that implemented at least one type of recurrent connection (i.e., top-down or within-layer recurrent mechanism) have an added time dimension due to the iterative nature of recurrent dynamic.Here, we set the total time-steps to 5 during both training and testing phases.A time-step here represents an end-to-end (from input to output) processing of the input image.
Top-Down Connections.
Top-down mechanisms were implemented in three of the models used here: CT, CLT and BLT.The implementation of top-down mechanisms in the CT and CTL architectures, inspired by the CORnet model, and the BLT model, reveals distinct approaches to incorporating feedback connections in deep neural networks.
The BLT architecture incorporates top-down connections through transposed convolutional layers between consecutive layers, enabling the model to integrate abstract, higher-level information back into earlier stages of processing.The de-convolution and reintegration process could be seen as part of a predictive coding strategy, where the model minimizes prediction errors between higher level predictions and actual sensory input -akin to generative models that seek to enhance feature selectivity and reconstruct low-level sensory inputs from higher-level abstractions.The additive nature of the de-convolved signal in the implementation used here could lead to a direct influence of higher-level representations on lower-level processing, but offers less capacity for non-linear integration of the bottom-up and top-down signals compared to our other networks.
In CT and CLT models, on the other hand, each block is uniquely structured to integrate top-down inputs from all higher blocks, not just the adjacent one, through a specialized top-down pathway before being combined with the feedforward input through an additional convolutional layer.Such a design facilitates a more sophisticated non-linear integration of hierarchical information, allowing for the integration of bottom-up data with top-down information from all higher layers.This method fosters richer representation at each processing stage at the cost of increased computational complexity and a partial information loss due to channel averaging.
Specifically, in the CT and CLT models, a block processes a bottom-up input x (the output of the previous block at time-step t − 1) and combines it with top-down feedback M , derived from the aggregated outputs of higher-level blocks at time-step t − 1.In instances where top-down input is absent -such as during the initial time-step when no higher-level info is available -M is set to zero.
The feedforward path F F and the top-down path T D each involve convolution (*), Group Normalization (N), and ReLU activation (R): The weights θ f f and θ td represent distinct sets of learned weights for the feedforward and top-down convolutions, respectively.
The integrated top-down input M is formed by processing the outputs from higher layers m 1 , m 2 , . . ., m n : Given a top-down input tensor m i from layer i and a bottom-up input tensor x, the function ProcessLayer(m i , x) is defined as follows: where H x is the height of the bottom-up input tensor x, mi is the tensor obtained after applying adaptive average pooling to m i with a target spatial dimension of H x , and the function computes the mean across the channel dimension of mi , resulting in a tensor with a single channel and same spatial dimensions as the bottom-up input.
The final output y of the block is generated by concatenating the feedforward and top-down outputs and applying convolution, normalization, and ReLU: , where concatenation occurs along the channel dimension.The output operation O applies convolution, normalization, and ReLU activation on the concatenated tensor z: In this setup, the feedforward path (F F ) processes input from the preceding block, while the top-down path (T D) integrates inputs from multiple higher layers, creating a comprehensive feedback mechanism.This design allows for a dynamic interplay of bottom-up perceptual data and high-level predictions.Top-down connections facilitate a backward flow of information, refining feedforward outputs with knowledge acquired from more abstract representations.This approach mirrors the concept of biologically plausible models, where visual information processing involves not only a feedforward pathway but also feedback loops that dynamically adjust and refine their understanding of visual inputs [van Bergen and Kriegeskorte, 2020].
The CORblock RT block, used in the original implementation of CL [Kubilius et al., 2018]), was used here to model within-layer connections in models with within-layer connections only (i.e., CL, C V1-V1 and C IT-IT) to keep our custom implementations closer to the original.In these models, lateral connections are established within the block, where the output of a block at one timestep is used as an additional input in the next timestep (see [Kubilius et al., 2018]) for more details).On the other hand, within-layer connections are implemented through temporal depth (i.e., through several iterations over the same block) in CORnet S, with skip connections that facilitates the preservation and integration of information across multiple passes of the same block, effectively approximating a very deep network architecture with shared weights.In contrast, lateral connections are implemented in BLT through specific convolutional layers that are dedicated to processing inputs from the same layer at previous timesteps.These layers are specifically tasked with processing the output of the same block from the previous timestep, effectively allowing the block to integrate its past state with the current input at the "cost" of additional parameters.
A different approach was used to model within-layer connections for C LT, due to the more complex shapes and different natures of the top-down, bottom-up and lateral inputs.For each CLT block, during the forward pass for an input x at a given timestep t, a previous state s t−1 , and top-down input td, the output y t and the updated state s t can be described as follows: In this representation, f f corresponds to the output of the feedforward pass, computed by convolving the input x with the weights θ ff , followed by normalization N and ReLU activation R. The processed top-down input, td, is obtained by applying a sequence of operations defined by ProcessTopDown on the concatenated top-down input M (see sub-section Top-Down Connections.for more details).The term rr represents the within-layer processing of the previous state s t−1 , convolved with θ rr with kernel size and stride of 1 to preserve spatial dimensions, and then normalized and ReLU-activated.The concatenation operation Concat merges the outputs f f , rr, and td into a single tensor x along the channel dimension.This concatenation preserves the spatial dimensions of the feature maps, and effectively pools the distinct features extracted from each individual path into a unified representation for subsequent layers.Finally, the output y t of the block at timestep t and the updated state s t are obtained by convolving x with the output weights θ out , followed by normalization and ReLU activation.This output serves both as the response of the block at the current timestep and the state for the next timestep.
This recurrent structure enables the layer to integrate both current and previous activations, thereby enriching the artificial system with temporal dynamics and mimicking the recurrent mechanisms observed in biological visual systems ( [Kietzmann et al., 2019, Kar et al., 2019]).

Model Evaluation and Statistical Analysis
After training and fine-tuning, each model was tested against our test dataset.The test dataset included 800 black and white images (100 images for each of the same 8 categories) on a gray background.The dataset was presented to the network in a control or challenging condition (see Challenging manipulations for more details).
To evaluate the performance of each DNN model we employed a Top-1 Error Rate metric, which assesses classification accuracy based on the top prediction.Classification accuracy for each item in the test dataset was recorded for each model and each training seed, resulting in one confusion matrix per model across all the seeds.

Results
In this study, we compared humans and DNN models on a categorisation task in order to explore the respective advantages offered by different types of recurrence.To capture these potential differences we developed a rich stimulus set that would be sensitive enough to allow for a fine-grained test of model fit to human data.The stimulus set quality was twofold.First, it implemented a variety of visual challenges (occlusion, phase scrambling, clutter), aimed at reproducing the diversity of natural scene complexity, while maintaining control over the degree of challenge.Second, it included a high number of object categories (eight categories: people, cats, birds, trees, bananas, buses, buildings and fire hydrants, see Fig 2), thereby covering an array of human recognition behaviours.
During the experiment, images were presented either manipulated (challenging conditions, 16 conditions) or nonmanipulated (control condition, segmented objects on a plain background), for a total of 17 tasks to solve.Models and humans were asked to label the category the image belonged to on every trial, with human participants instructed to do so as fast and accurately as possible.We measured model accuracy, and human accuracy and reaction time (RT) to the task, to quantify the difficulty posed by the visual manipulations.
For humans only, half of the trials contained backward masking.In masked trials, a phase scrambled noise pattern was presented shortly after the stimulus.This allowed for the selective impairing of recurrent processing, and served as further confirmation that the visual manipulations we implemented were linked to recurrence.
Different DNN models were trained and tested.We ensured our models covered a wide range of architectures and sizes.The architectures we focused on were either purely feedforward or contained recurrence.Within our recurrent models, different types of recurrent connections were implemented.We used existing models from three families: C, B, and VGG ( [Kubilius et al., 2018, Spoerer et al., 2017, Simonyan and Zisserman, 2015]).While the VGG family is purely feedforward and aims at maximising performance through depth, both the C and B families involve recurrence and are built as biologically plausible architectures.Conversely to the B models where all cases of feedback and lateral connectivity are explored, the C models included only a base, feedforward model (C Z, referred to hereafter as C), a fully laterally connected model (C RT, referred to hereafter as CL), and a high-performance, enhanced version of the latter (C S).To increase the spectrum of recurrent architectures in this family, we built four extra C architectures: C V1-V1, C IT-IT, CT and CLT.On top of C and CL, these new models allowed us to separately look at the role of lateral and feedback connections in the C family too.On top of architecture, we had large model size differences across our DNNs, with smaller models (e.g.C, 4.5m parameters) more than twenty times smaller than our bigger ones (e.g.VGG 16, 110m parameters).Overall, we used a range of models capable of dissociating between the effects of depth, size, and recurrent connectivities on performance.All our models underwent a similar regimen of training: first on ImageNet, then fine-tuned on a custom dataset of images from the eight selected categories.

Visual manipulations trigger recurrent processing
Data from all six experiments were pooled together, after confirmation that results on the unmasked control, which was similar for all participants, did not differ significantly across experiments (one-way ANOVA on accuracy with experiment as factor, F = 0.68, p = 0.64).Average accuracy was then computed for each participant.A first two-way ANOVA was ran on average accuracy per manipulation (control, phase scrambling, occlusion and clutter) and per level of masking (mask, no mask).Results showed a significant interaction effect (F = 13.06,p = 3.89e − 08) typical of masking, with performance varying differently across levels of task difficulty (see Fig 4).Post-hoc analyses showed significant differences between control and both phase scrambling (all Bonferroni corrected -p = 2.8e − 41) and occlusion (p = 3.59e − 25), but not clutter (p > 0.05), as well as between clutter and both occlusion (p = 1.37e − 19) and phase scrambling (p = 4.16e − 35).Finally, occlusion and phase scrambling are significantly different as well (p = 2e − 16).To further investigate the role of masking, three two-way ANOVAs were then performed on accuracy within each of the three main manipulations, with factors manipulation level and masking.Results vary across tasks.There is no significant effect of masking, clutter difficulty (light, heavy) or their interaction (all p > 0.2).There is, however, a significant interaction effect of masking and occlusion level (easy, hard -F = 23.93,p = 1.69e − 06), on top of significant main effects of masking (F = 43.45,p = 2.15e − 10) and occlusion level (F = 265.29,p = 2.03e − 42).We finally found a significant interaction effect of masking and phase scrambling level (low pass, high pass -F = 33.78,p = 3.65e − 08), as well as significant main effects of masking (F = 45.07,p = 3.8e − 10) and phase scrambling level (F = 670.55,p = 7.81e − 57).

Control
To better account for the variations in task difficulty across our manipulations, we turned to RT. Correlations were computed between the masking effect and the average RT of each of our 17 tasks (see Fig 4).Masking effect is defined as the difference in average accuracy between non-masked and masked trials, and shows how performance decreases in the presence of masking.We found a strong, positive correlation between average RT and masking effect (Pearson's correlation r = 0.92, p = 7.04e − 08), which remains strong even after the exclusion of the two most extremes performance values (low-pass phase scrambling and MS occluder high, Pearson's correlation r = 0.86, p = 4.5e − 05).
Masking effect and RT are traditionally used to measure recurrence in visual recognition, but separately.Longer RTs are typically taken as indicators of extra, post-feedforward computations, while backward masking effects are direct evidence for the need of recurrent processing to solve the task.The concurrence of both measures of recurrence in our results shows that our tasks covary on difficulty and need for recurrent processing.

Model fit to human behaviour varies with recurrence and size
Having established the role of recurrent processing in our stimulus set, we next asked how our models would perform on it.We started by comparing model average performances, in order to see whether added recurrence or larger model size would lead to an improvement at all.Average accuracy per model was collected for each of the 20 random initiation seeds (see Fig 5).Performances were first compared across models by running a one-way ANOVA on accuracy with model as a main factor.Results showed a significant effect of models (F = 474.63,p = 5.64e − 163).Post-hoc comparisons of the models average accuracies showed significant differences across almost all pairs of models (Tukey test; all pair-wise comparisons significant, except: C -C V1-V1, C IT-IT -B, CT -BL, CLT -BT -BLT, CL -BL -BT, BT -BLT).In particular, VGG16 showed significantly higher accuracy than all models (p < 0.001, Tukey test).This reflects a striking difference in performance, with VGG16 on average 10% more accurate than other models.Within our recurrent models, we observed an increase in performance with added recurrent connections.The more recurrent models especially (CL, CT, CLT, CS, BL, BT & BLT) seem to perform higher than their baseline counterparts (CL, CT, CLT and CS > 95% CI of C; BT and BLT>= 95% CI of B).This improvement, however, could also be explained by the increase in size underwent by models as recurrence is added to them.To check whether performance across models was driven by size, we correlated model parameters numbers with model average performance.Results showed a large correlation (Pearson's correlation r = 0.74, p = 0.003).Overall, although recurrent models perform better than their non-recurrent counterparts, this analysis seems to point to model size as the main driver of the performances of our models.
Better performing models are more consistent with humans performance patterns While performance is a good indicator of how well a given model can solve the complex tasks of our stimulus set, it does not take into account the variability in task difficulty.A model could have a lower overall performance but a better task-wise fit with humans.To look beyond accuracy, we took advantage of our stimulus set variability to ask which models most consistently fit the pattern of performance of humans, and find difficulty where human participants found it.To do so, we divided the data into 17 tasks, from which a vector of 17 average accuracy values (one for each task) was calculated per model and for humans (see Fig 6).We then correlated each model with participants to measure their consistency with human data.
Results from this analysis (see Fig 7) show an overall high correlation (> 0.65), indicating that DNNs, on average, correlate with humans and agree on which task is easier or more difficult.Models overall correlate better with humans when equipped with recurrent connectivity.With the exception of C V1-V1, all recurrent models achieve larger correlation scores than their baseline equivalent (all recurrent C and B models with a mean correlation above the 95% CI of both baseline C and B models).Notably, all these models also show a higher correlation than VGG11, but not VGG16.The latter surpasses all with a correlation score of 0.85.These results show that, at approximately equivalent sizes, some recurrent models fare better on matching human behaviour than feedforward-only models (for instance, CT & CLT do better than VGG11 with approximately similar parameter numbers).However, that is not always true (for instance, C compared to C V1-V1 and C IT-V1).Regardless, the highest correlation is found for VGG16, which is an indication that the computational flexibility brought by recurrence in our models is not superior to that brought by added depth.In fact, model size seems to drive the consistency of models with humans (Pearson's correlation r = 0.83, p = 3.9e − 04), in a way comparable to how it drives performance.

Recurrence types do not dissociate tasks
Better performing DNNs are on average more consistent with humans.This correspondence between general performance and performance consistency suggests that solving the task better leads to matching humans patterns better.However, this does not rule out model-specific effects, or connection-specific effects, whereby some tasks could be better solved by some models than others.For instance, a task could be better solved by a model exhibiting lateral recurrence, with that model not performing best overall.To look for such effects and try to dissociate between types of recurrence, we checked whether the order of performance across models (see Fig 5) was repeated within each task, and whether there were exceptions to that order.For each task, one vector of thirteen values was created with the average performance of each DNN model.Each vector (17 vectors in total, one per task) was then correlated to the overall model order of performance.
Within-task model performance rankings seem to converge with the overall results: all tasks show a strong (Pearson's r > 0.54, 0.76 on average) correlation with the order of model performance, with the only exception of many small occluder, high (correlation of 0.04).The latter deviates from the global trend, with baseline models reaching better accuracies than expected (e.g.VGG11 surpasses VGG16, B surpasses its B counterparts, C is the second best of the C models).While it is not yet clear why this task falls out of the general trend, it is noticeably more difficulty than average, and might show a floor effect.We further address this point in the discussion.This convergence of within-task  with overall model performance ranking indicates that models do not display strong task-specific effects that could link architectural features to particular task challenges.
Despite the absence of model-specific effects, task-specific effects could remain: some tasks could fall out of the human-DNN consensus, and instead be better solved by humans or by models.To check for this, we looked for outliers on the human-model agreement (see Fig 7).Two noticeable pairs of outliers were found.In the phase scrambling condition, the low pass and high pass pair, surprisingly, shows models performing better than expected on the former, and worse than expected on the latter.This clashes with the generally accepted notion that DNNs rely on high-frequency information to perform classification ([Geirhos et al., 2022, Avberšek et al., 2021]).In the occlusion condition, the MS occluder high and MS blobs high pair also falls out of distribution, with models performing surprisingly well on the former and surprisingly low on the latter.This might be due to the high number of black pixels in these manipulations (see Fig 2).The occluder images, indeed, have on average more black pixels than the blobs images (301e3 v. 137e3 black pixels per image, on average).These black pixels might create a visual transient that could distract participants and interfere with performance, which could explain the unexpectedly low human accuracy.
Depth, not recurrence, makes model mistakes more human-like In order to compare the respective advantages of recurrence and size in matching with human patterns of performance, we looked at category-level errors.We used confusion matrices to get a measure of how categories are represented with respect to each other.Confusion matrices are built by calculating the number of times each of the eight categories was given as a response when each of the eight categories was presented.The result is a 8x8 matrix where each cell counts the number of responses to a given category.A confusion matrix was built for humans, and for each DNN model.The results were then visualised with principal component analysis (PCA).We ran PCAs at the image level, with two main components fitted onto the confusion data of each individual image (see Fig S10).To quantify the agreement between models and humans, we ran a correlation analysis at the confusion matrix level.Matrices were correlated without diagonal values using Pearson's R (see Fig 8).
The results of this analysis show a pattern diverging from that seen in the performance correlation ran previously.Here, recurrence in models does not seem to give an advantage.Rather, within the C and B families, adding recurrence decreases the correlation with humans.Within family, baseline models seem to outperform recurrent ones.To confirm the lack of an advantage of recurrence, we built 95% CIs around model correlation scores.To do so, we calculated participant-level correlations: we obtained 218 correlation values for each model, from the correlations of each participant's confusion matrix with the overall model confusion matrix.This allowed us to build CIs around the average individual-level correlations (see Fig 8).This more conservative approach confirms the observed patterns.In particular, CS and recurrent versions of B all have average participant-level correlation scores outside of and lower than the CI boundaries of the correlation scores of their baseline counterparts (boundaries of 0.09 and 0.08 for C and B, respectively).Additionally, all recurrent versions of C and B show correlations lower than VGG 16 (all below the CI lower boundary of 0.13).These values are striking when compared to model-model confusion matrix correlations: all models have a relatively higher average correlation with other models (Pearson's r, seed-level correlation with all other models in the range 0.24 − 0.94, with average 0.73).Therefore, when looking at the pattern of mistakes committed by our models, we do not find an advantage of adding recurrence.
Overall, the results of our analyses all have VGG 16, the deep, large, feedforward-only model, as the best to human data.We cannot, from the tests we ran, conclude for recurrent connectivity in DNNs as a better strategy to achieve a task in a more human-like fashion, as compared to depth and overall size.

Discussion
In this study, we investigated the role of recurrence in DNNs and its implication in the resolution of visual complexity.
We implemented a range of image-level manipulations known to involve recurrence in humans, and aimed at associating them with different types of recurrent connectivity in a set of DNN models.Our objective was to use DNNs to dissociate between distinct recurrent processes responsible for resolving different challenges during visual recognition.By comparing human and DNN behaviour in a classification task, we set out to find conditions where certain models more closely aligned with human behaviour.Our results on model performance across various architectures showcases the current limitations of recurrence in DNNs and their implications for modelling human visual recognition.
We found a nuanced relationship between recurrence, model size, and performance.The inclusion of recurrent connections in DNNs improved performance, however, this improvement was tied to the overall size of the network.Larger models, therefore, fared better on our tasks than smaller ones, with VGG16 performing best.
In a comparable way, when looking at performance consistency across tasks, we did not find that recurrent models aligned better with human difficulty perception.Instead, a similar pattern emerged whereby larger models aligned more consistently with humans, with model size broadly predicting how well a given model would do.Additionally, the accuracy ranking of models collapsed when task difficulty reached a certain level (notably on the MS blobs high occlusion task, see Recurrence types do not dissociate across tasks), which seems to match with reports of DNNs failing to replicate human performance in highly difficult conditions ([Geirhos et al., 2018, Ghodrati et al., 2014]).Overall, we found an agreement between models and humans on task difficulty, in line with other reports showing a tendency of DNNs to generally fail when humans fail ([Kheradpisheh et al., 2016]).We also found this agreement to be performance-dependent, with larger models displaying more consistency than smaller ones, in line with other reports ([Lee and DiCarlo, 2023]).
On top of this size comparison, we set up our experiment to be able to distinguish between different types of recurrence.By choosing and building DNNs that contained different types of recurrence in different layers, we aimed at finding taskdependent effects of recurrent processing.However, we did not find any such distinctions in our results, as evidenced by the overall consensus of performance ranking across tasks (see Better performing models are more consistent with humans performance patterns).This null result is surprising considering the ample evidence for region-specific perceptual phenomena in the brain.One could have expected, for instance, C V1-V1 to behave differently in tasks requiring figure-ground modulation (e.g.clutter tasks) given the known role of recurrent processing in V1 in this phenomenon ( [Lamme et al., 1998, Self et al., 2013]).This result, combined with the general performance-dependent model consistency, shows that the recurrent connections we implemented do not critically change the behaviour of our models.
When considering task performance and consistency, our results align well with the notion that recurrent neural networks can be considered time-wrapped equivalents of size-matched feedforward networks ( [van Bergen andKriegeskorte, 2020, Liao andPoggio, 2020]).Recurrent architectures, though, could be considered superior because more brain-like.Furthermore, recurrence allows to increase the number of computations without an increase in the number of neurons.Thus, if we were to express model size in terms of the number of units rather than the number of parameters, then recurrence results in an improved performance without an increase in 'size'.
However, even with this handy way out to promote recurrence as a solution with unique benefits, recurrence faces other challenges.Our examination of confusion matrices provided a finer-grained comparison of models, and showed, contrary to expectations, a drop in the alignment of recurrent models, specifically CS and BLT, with human behaviour.This corroborates reports showing that DNNs rely on different strategies to operate visual recognition ( [Lonnqvist et al., 2019, Biscione and Bowers, 2023, Hosseini et al., 2017]).Importantly, we found a characteristic tradeoff between overall performance and fit with human data ( [Fel et al., 2022]) within the recurrent model families C and B, but not VGG.
This challenges the observation that recurrent models better mimic the representational patterns of the brain than nonrecurrent ones ( [Nayebi et al., 2022, Kietzmann et al., 2019, Kar et al., 2019, Spoerer et al., 2020]).While recurrent models are reported to match better the brain's functioning, our study suggests that they could simultaneously be worse models of human behaviour.Therefore, we stress behavioural fidelity as a metric for developing better models of human visual recognition.
Strikingly, although we used a design equipped for it, we found no clear dissociations between the effect of specific types of recurrence.Lateral connections did not induce specific effects compared to top-down connections, either in general or in a layer-dependent way.This lack of dissociation could mean that the implementation of recurrence within DNNs does not inherently alter the mechanisms of information processing.Rather than fundamentally reshaping how inputs are processed, recurrent processing, in its current implementation, seems to operate similarly to added depth, as a tool adding to the computational power of a network.This is in line with our observation that despite vast differences in architecture, the approach to visual recognition, as indexed by confusion matrices, remains relatively homogeneous for a given model size (see Fig S11).
It is noticeable that each of the recurrent connection in our models can be implemented in multiple ways.Here, we adhered to the common implementation as it is used by architectures such as C and B, yet, there is a wide range of other options and frequent developments in the field ( [Nayebi et al., 2022, Wang and Hu, 2022, Linsley et al., 2019, Fukui et al., 2019, Lotter et al., 2017, Lazar et al., 2009, Konkle and Alvarez, 2023]).In addition, even though the visual challenges that we created were motivated by previous research that referred to a potential specific role of lateral versus top-down connections ( [Rajaei et al., 2019, Seijdel et al., 2021, Bar et al., 2006]), there might be better manipulations to be found.Overall, before we can accept the hypothesis that the type of recurrence does not matter, further studies are needed with more refined, biologically plausible implementations of recurrence and with additional visual challenges.
The need to search further in the space of visual manipulations also finds support in the wide variety we found across our 17 tasks.Interesting distinctions emerged, with seemingly similar tasks bringing surprisingly different results.The implementation of many small occluders, for instance, brought two such experimental conditions: MS occluder, high and MS blobs, high.While in the former, both humans and models see significant drops in their performance, the latter shows disagreement, with humans finding it much less difficult.A similar distinction can be found by comparing phase scrambling in the low pass condition and high pass condition.Even in the absence of model-specific effects, our results highlight the behavioural variability that can be measured using a rich stimulus set.
In conclusion, contrary to our expectations, our findings highlight the limitations of recurrence as implemented in our models.Furthermore, they point out the overall discrepancy of visual processing in DNNs compared to humans.While we reproduce the general DNN ability to capture patterns of task difficulty, our results do not support that recurrent models outperform feedforward counterparts in capturing human visual recognition processes.As a consequence, we stress the importance of behavioural fidelity as a metric in developing new models, and emphasise the need for refined, biologically plausible implementations of recurrence.

Supporting information
Table 1: General model information.Extra information about models parameter size, connections (lateral/feedback) and references.
Table 3: Custom model extra information: C IT-IT.Layer specific information (type, input and output shape, parameters, trainability) about C IT-IT (CORnet IT-IT).
Table 4: Custom model extra information: C IT-IT.Layer specific information (type, input and output shape, parameters, trainability) about CT (CORnet T).

Figure 2 :
Figure 2: Stimulus set & task.(A) Full set of stimuli manipulations, represented by one selected exemplar per category.Each row represents a particular manipulation.Colours above the top row indicate the eight categories (there were 10 exemplar images per category, one example is selected here).PS: phase scrambling, FL: few large, MS: many small.(B) The eight categories included in the stimuli.(C) Experimental paradigm used for human participants.

Figure 3 :
Figure 3: Model architectures.Rightward arrows represent feedforward connections.Circular, backward arrows above layers represent lateral connections.Leftward arrows represent feedback connections.The dashed arrows of CS represent the ResNet-inspired residual block structures.(A) Eight models were used from the CORnet family, including three readily available and five custom-made models.From [Kubilius et al., 2018], we selected the base FF model CORnet Z (C), the lateral connections-equipped CORnet RT (CL) and the highest-performing CORnet S (CS).Custom-made versions of these architectures were built, including full top-down connections (CT), full top-down and lateral connections (CLT), specific lateral connections in layers V1 (C V1-V1) and IT (C IT-IT).(B)The four available models of the B family were used.These models share a baseline 2-layer architecture (B), with the addition of either lateral connections (BL), top-down connections (BT), or both (BLT).(C) VGG models were used as feedforward controls to the recurrent models.We used the state-of-the-art VGG16 model, as well as a custom-made, smaller VGG11 to better match the size of our recurrent models.

Figure 4 :
Figure4: Human behavioural results.(A) Masking shows an interaction effect with manipulations, and impairs performance in occlusion and phase scrambling but not clutter.PS: phase scrambling.(B) A strong correlation is found between average RT and masking effect (r = 0.93).Their relationship is shown per task.Each dot represents a task (17 tasks in total).The line shows the best linear regression fit.

Figure 5 :
Figure 5: Model performance results.(A) Model performances: bar plot of average performance per model.Error represent SEM.The blue horizontal bar represents average human performance (0.89).(B) Scatter plot of model average accuracy per model size, in number of parameters.Note that the x-axis is in logarithmic scale.

Figure 6 :
Figure 6: Performance across task per models.Each diamond represents the average performance of a model on a given task (bars on the diamonds indicate SEM).Human performance is shown in blue, with the bottom of the blue bars indicating average human performance.While models on average perform lower than humans, they overall follow the same pattern.

Figure 7 :
Figure 7: Model consistency with humans.(A) Correlation between human performance on 17 tasks and models.Error bars show a 95% CI around the mean of the correlations calculated from each of the model's 20 initiation seeds.The blue shaded area represents the noise ceiling, calculated as the average split-half reliability of the human data (over 100 iterations).(B) Scatter plot of human and average model performance per task.Each dot represents a task.The grey line shows the best linear regression fit, with the grey shaded area showing a 95% confidence interval (CI) around the regression.

Figure 8 :
Figure 8: Correlating confusion matrices.(A) Bar plot of correlations between the human confusion matrix and the model confusion matrices.Bars indicate the standard error of the mean, calculated from the 20 random starting initializations of each model.Blue shaded area represents the noise ceiling, calculated as the average split-half reliability of the human group-level confusion matrix (over 100 iterations).(B) Average individual participant confusion matrix correlation with each model.Bars represent a 95% confidence interval around the mean.VGG 16 shows a higher correlation than any other model (all models below the lower CI boundary, 0.13).

Figure S9 :
Figure S9: Colour-blind friendly performance pointplot.A less colour-heavy reproduction of figure 6.The colour palette has been adapted for more contrast.Shape distinctions have been added as well.

Figure S10 :
Figure S10: Model PCA plots.PCA plots built from the category-wise confusion data of individual images, for each model and for human participants.

Figure S11 :
Figure S11: Within-model confusion matrix correlations.Correlation matrix (Pearson's R) of the pair-wise confusion matrix correlations of all models.
This dual-phase training approach produced 20 uniquely fine-tuned versions of each model, ensuring robust adaptation to our classification task while preserving foundational ImageNet-acquired knowledge.The Adam optimizer with a Cross Entropy loss function was employed throughout both phases of training.Training was executed in parallel on 3 NVIDIA GeForce RTX 2080 GPUs.The models were implemented using PyTorch version 2.1.2[Paszke et al., 2019], TorchVision version 0.16.2 [TorchVision maintainers and contributors, 2016], and fastai version 2.7.13

Table 7 :
Average performance per task per model.Detailed average performance per model and per task.Acronyms: FL: few large, MS: many small.