Iterative convergent computation may not be a useful inductive bias for residual neural networks

Recent work has suggested that feedforward residual neural networks (ResNets) approximate iterative recurrent computations. Iterative computations are useful in many domains, so they might provide good solutions for neural networks to learn. Here we quantify the degree to which ResNets learn iterative solutions and introduce a regularization approach that encourages learning of iterative solutions. Iterative methods are characterized by two properties: iteration and convergence. To quantify these properties, we define three indices of iterative convergence. Consistent with previous work, we show that, even though ResNets can express iterative solutions, they do not learn them when trained conventionally on computer vision tasks. We then introduce regularizations to encourage iterative convergent computation and test whether this provides a useful inductive bias. To make the networks more iterative, we manipulate the degree of weight sharing across layers using soft gradient coupling. This new method provides a form of recurrence regularization and can interpolate smoothly between an ordinary ResNet and a “recurrent” ResNet (i.e., one that uses identical weights across layers and thus could be physically implemented with a recurrent network computing the successive stages iteratively across time). To make the networks more convergent we impose a Lipschitz constraint on the residual functions using spectral normalization. The three indices of iterative convergence reveal that the gradient coupling and the Lipschitz constraint succeed at making the networks iterative and convergent, respectively. However, neither recurrence regularization nor spectral normalization improve classification accuracy on standard visual recognition tasks (MNIST, CIFAR-10, CIFAR-100) or on challenging recognition tasks with partial occlusions (Digitclutter). Iterative convergent computation, in these tasks, does not provide a useful inductive bias for ResNets.


Introduction
An iterative method solves a difficult estimation or optimization problem by starting from an initial guess and repeatedly applying a transformation that is known to improve the estimate, leading to a sequence of estimates that converges to the solution.Iterative methods provide a powerful approach to finding exact or approximate solutions where October 10, 2023 1/17 direct methods fail (e.g., for difficult inverse problems or solutions to systems of equations that are nonlinear and/or large).
Recurrent neural networks (RNNs) iteratively apply the same transformation to their internal representation, suggesting that they may learn algorithms similar to the iterative methods used in mathematics and engineering.The idea of iterative refinement of a representation has also driven recent progress in the context of feedforward networks.New architectures, based on the idea of iterative refinement, have allowed for the training of very deep feedforward models with hundreds of layers.Prominent architectures for achieving high depth are residual (ResNets; 1) and highway networks (2), which use skip connections to drive the network to learn the residual: a pattern of adjustments to the input, thus encouraging the model to learn successive refinements of a representation of the input that is shared across layers.
These architectures combine two ideas.The first is to use skip connections to alleviate the problem of vanishing or exploding gradients (3).The second is to make these skip connections fixed identity connections, such that the layers learn successive refinement of a shared representational format.
The second idea relates residual and highway networks to RNNs and iterative methods.Learning a single transformation that can be iteratively applied is attractive because it enables trading speed for accuracy by iterating longer (4).In addition, a preference for an iterative solution may provide a useful inductive bias for certain computational tasks.
Moreover, deep neural networks were originally inspired by neuroscience and have, in turn, contributed to a better understanding of human cognition and, in particular, vision.The primate visual system, in turn, has often been suggested to perform iterative refinement (5; 6).This principle could therefore allow us to better understand the shared ideas behind deep neural networks and primate vision.Moreover, enhancing ResNets' preference for an iterative solution could potentially make them more similar to the primate visual system.However, it is unclear whether ResNets indeed learn solutions akin to iterative methods and, if they do, whether this is a useful inductive bias.The two defining features of iterative methods are (a) iteration and (b) convergence.Here we analyze to what extent these features emerge in ResNets.In order to investigate whether these features provide a useful inductive bias, we introduce two simple modifications of classical ResNets and study their impact on a number of datasets.First, we study CIFAR-10, CIFAR-100, and MNIST (7; 8) as examples of classical vision tasks, assessing the networks' performance and sample efficiency.Since iterative and convergent inductive biases may be more useful for tasks that require some degree of recurrence, we also assess the networks' performance and sample efficiency on several variations of Digitclutter, a task which requires the recognition of multiple digits that occlude each other (9).
To study the effect of iteration, we manipulate the degree of weight sharing in ResNets, smoothly interpolating between ordinary and recurrent ResNets.We find that a higher degree of weight sharing tends to make the network more iterative, but does not result in improved performance or sample efficiency.This suggests that in ordinary ResNets, recurrent connections do not provide a useful inductive bias and the networks can harness the additional computational flexibility provided by non-recurrent residual blocks.
Recurrence implies iteration, but not convergence, and so is not sufficient for a network to implement an iterative method as defined above.ResNets, whether they are recurrent (i.e.sharing weights across layers) or not, are therefore neither required nor encouraged to learn a convergent algorithm during training.We demonstrate empirically that ResNets in general do not exhibit convergent behavior and that October 10, 2023 2/17 recurrent ResNets are more convergent than non-recurrent networks.To study the effect of convergence, we upper bound the residual blocks' Lipschitz constant.This modification adversely impacts performance, suggesting that the non-convergent behavior in ordinary ResNets is not merely due to lack of incentive, but underpins the networks' high performance.Across convergent ResNets, a higher degree of weight sharing does not negatively affect performance.This suggests that convergent ResNets, in contrasts to non-convergent ones, do not benefit from the increased computational flexibility of non-recurrent residual blocks.
Taken together, our results suggest that an inductive bias favoring an iterative convergent solution does not outweigh the computational flexibility of non-recurrent residual blocks for the considered tasks.

Related work
Prior theoretical work has focused on explaining the success of ResNets (1) and the more general class of highway networks (2) by studying the learning dynamics in ResNets (3; 10; 11) and their interpretation as an ensemble of shallow networks (12; 13), as a discretized dynamical system (14; 15; 16), and as performing iterative refinement.
The iterative refinement hypothesis.Our work builds on (17) who argue that the sequential application of the residual blocks in a ResNet iteratively refines a representational estimate.Their work builds on observations that dropping out residual blocks, shuffling their order, or evaluating the last block several times retains reasonable performance (12; 18) and can be used for training (19).Another set of methods uses such perturbations to train deep neural networks, using stochastic depth (19; 20; 21).
Other methods learn to evaluate a limited number of layers that depends on the input (22; 23) or learn increasingly fine-grained object categories across layers (24).Instead of using perturbations to encourage stability of the trained network, (25) propose a method inspired by dynamical systems theory to guarantee such stability in their model.
Iterative refinement and inverse problems.The iterative refinement hypothesis is particularly important in the context of inverse problems, which are often solved using iterative methods.This is particularly relevant for ResNets trained for perceptual tasks since perception is often conceptualized as an inverse problem (5; 6).(26) modeled visual cortex using recurrent neural networks that iteratively infer the latent causes of a hierarchical Bayesian model.Though these networks have been applied to complex datasets (27), most models either learn the inverse model (28; 29) or define an analytically invertible forward model (30; 31).A notable exception are invertible ResNets (32), whose inverse can be computed through a fixed point iteration.Another set of works starts out with a classical iterative algorithm and unfolds this algorithm to create a deep network, approaching a similar problem from the opposite direction (33; 34; 35; 36).Moreover, (37) refine CNNs to yield an inference algorithm on a specific generative model (recently implemented by 38).
Recurrent and residual neural networks.The idea of iterative refinement has motivated an increasing number of recurrent convolutional neural networks being applied to computer vision without an explicit implementation of iterative inference (9; 39).ResNets may be seen as RNNs that have been unfolded for a fixed number of iterations.Sharing weights between the different blocks allows us to train a recurrent residual neural network (40).In this framework, recurrent residual neural networks may be seen as a special case of residual neural networks where the weights are equal October 10, 2023 3/17 between all blocks.( 41) and (42) relaxed this constraint by defining the weights of the different blocks as a linear combination of a smaller set of underlying weights.The work by (41) is particularly interesting in this context, as their method yielded a more efficient parameterization of Wide ResNets (43), which generally have more channels, but far fewer layers than the architectures we consider here.
Inductive bias of recurrent operations on visual tasks.The inverse problem perspective on perception suggests that ordinary recognition tasks in computer vision may already benefit from an iteratively convergent inductive bias.For other tasks, we have additional reasons to believe that recurrent processing may be beneficial.For example, object recognition in the primate ventral stream can be well captured by feedforward models (44; 45) but benefits from recurrent processing under challenging conditions.This includes tasks where the presented objects are partially occluded or degraded (46) or tasks that involve perceptual grouping according to local statistical regularities (47) or object-level information (48).These observations have inspired tasks such as Digitclutter and Digitdebris (9) as well as Pathfinder (49) and cABC (50).For these datasets, recurrent networks have been shown to outperform corresponding feedforward control models (9; 51).This method has inspired several promising models in computer vision.Deep equilibrium models ( 56) harness a quasi-Newton method to find the fixed point of their recurrent operation.They have recently been applied to object recognition and image segmentation tasks (57).These works empirically demonstrated the existence of a fixed point under their parameterization.Other models enforce such a fixed point by the Lipschitz constraints we here employ as well.( 25) use an upper bound based on the Frobenius norm, whereas (58) approximate the Lipschitz constant using a vector-Jacobian product.Our method (see below) is based on (59) and has recently been employed (for different purposes) by ( 60) and (32).

Do ResNets implement iterative computations?
Iterative methods are characterized by two key properties: iteration (i.e., recurrent computation) and convergence.In this section, we examine whether residual neural networks have these properties.We start by showing that ResNets can express iterative algorithms, but then demonstrate that ResNets do not automatically learn iterative algorithms.We show that this observation extends to large-scale neural networks trained on real data.Finally, we introduce a paradigm that allows us to compare neural networks to iterative methods.

ResNets can represent iterative computations
A popular application for iterative methods is given by inverse problems of the form x = f (z).It is often not possible to directly compute the inverse, f −1 , of the forward model, because of analytical or combinatorial intractability.Instead, approximations (61) or iterative error-corrective algorithms (62) are used.Consider the linear forward October 10, 2023 4/17 . Though f has an analytical inverse, it serves as an illustrative example.Based on an input x, we can infer the latent variable z using the iterative update ẑ(0 where ẑ(T ) , for some T , is the estimate of z and ϵ i > 0 should be sufficiently small1 .The error is used to update ẑ, whereas x is simply retained throughout all blocks.b Trajectories of the estimates ẑ1 , ẑ2 across blocks in a feedforward network, iterative steps in an the error-corrective algorithm, residual blocks in a trained ResNet, and residual blocks in a trained recurrent ResNet.All four methods converge to the correct estimate (indicated by black 'x').c Dropping out the fourth block (unbroken line) has a minor impact on the ResNet.d If the last block is iteratively applied to the final estimate, the value diverges for both the residual and the recurrent network (broken lines), indicating that they do not learn a convergent structure.
This update can be implemented in a small ResNet.Figure 1a contains a schematic illustration of one block of this network.The representation spanned by an encoder in the beginning of the network contains an input representation x and a representation of the current estimate z.In the hidden layer of the residual block, we determine the positive and negative prediction errors and use them to update z: This recurrent residual building block would implement an iteratively convergent computation in a ResNet.The linear model is, of course, a trivial case, but serves as an illustration of the appeal of this approach.A wide neural network is a flexible function approximator and learning to represent the prediction error instead of the prediction is easier in many cases (24).

ResNets do not automatically learn iterative computations
The fact that ResNets can express iterative computations does not imply that they necessarily learn iterative solutions when trained via backpropagation.
Here we consider the simple example from above to better understand the behavior that may emerge.We highlight three behaviors that distinguish iterative from non-iterative computations and will examine the behavior of large-scale neural networks in the following sections.As an example, we train a conventional ResNet, and a ResNet that uses the same weights in each block (equivalent to a recurrent network) to invert the function f (z) := ( 3 2 z 1 , 3 4 z 2 ).We contrast their behavior with the iterative error-corrective algorithm outlined above.
Due to the lack of constraints, a non-residual feedforward neural network changes its representation in every layer.As a consequence, the linear decoder at the end of the network is not aligned to the intermediate representation and early readout (as depicted by the orange dots in Fig. 1b) leads to a meaningless estimate.In contrast, the skip connections encourage a ResNet to use the same representational format across blocks.This is to say that its intermediate representations are better aligned with the final decoder.Early readout is therefore possible and the representation across blocks will approach the final estimate.As a consequence, the across-block dynamics of the non-residual network are meaningless, whereas the recurrent and residual network's early readouts are close to the final estimate and approach it in a smooth manner, just like the error-corrective algorithm (Figure 1b).Since iterative methods iteratively refine their initial estimate, their behavior is more similar to the ResNets' monotonic convergence.
Aside from their smooth convergence, a fundamental property of iterative methods is their recurrence (i.e., the repeated use of the same computation).This means that dropping out an earlier block has the same effect as dropping out the last one.We can relax the requirement for exact repetition (weight sharing) and require merely similar computations.Figure 1c illustrates that the trained ResNet is indeed relatively robust to block dropout.
Yet the learned models fall short of an iterative method, which is apparent from a third mode of investigation.Iterative convergence would imply that applying the last block's transformation iteratively should keep the readout in the vicinity of the actual estimate.This is clearly not the case in our toy example (Figure 1d).Rather than representing a convergent estimate, this result is more compatible with understanding ResNets as approaching their final estimate at a constant speed and, in the case of late readout, moving past this estimate and overshooting at the same speed.Notably, both the non-recurrent and recurrent networks exhibit this behavior.This behavior is not surprising.After all, the network is trained to work for a fixed number of steps and not constrained to stay within the vicinity of its final estimate if more steps are added.However, it reveals that even recurrent ResNets do not automatically learn iterative convergent computations.To assess the extent to which they do learn iterative convergent computations we define three continuous indices, which measure convergence, recurrence, and divergence (defined below).

Iterative Convergence Indices
We evaluate the indices for six instances of ResNet-101, trained on CIFAR-10 (7).This ResNet consists of three stages of 16 residual blocks with 16, 32, and 64 channels, respectively, using the architecture recommended by (63).S1 Appendix includes details on the architecture and the training paradigm.The ResNet achieved 5.2% classification error on the validation set.To characterize the extent to which the ResNets have learned an iterative convergent computation, we introduce three indices measuring different aspects of such computations.
Convergence Index.Viewing ResNets as performing iterative refinement suggests that each stage gradually approaches its final estimate before passing this estimate to the next stage using a downsampling layer.By passing the estimate at each of the October 10, 2023 6/17 residual blocks to the next stage, we can monitor how the stages approach their final estimate across blocks (see Fig 2a, left panel).In accordance with previous results (17), we find that all stages smoothly approach their final estimate, confirming the earlier intuition of a shared representational format.To measure the rate of convergence, we compute the area under the curve (AUC) of the classification error, which we call the Convergence Index.We invert and normalize this value such that a Convergence Index of 0 corresponds chance level read-out at each residual block, whereas a Convergence Index of 1 corresponds to an instant convergence to the final classification error at the first residual block.Figure 2b, left panel, depicts this value for each stage, and averaged across stages.
Recurrence Index.To measure the degree of recurrence, we evaluated the effect of dropping out individual blocks on the error rate of the network (see Fig 2a, middle panel).In a non-recurrent ResNet, dropping out earlier blocks may have a stronger effect on the error rate than dropping out the last block.In contrast, in a recurrent ResNet, the effect on error rate is the same for dropping out either earlier blocks or the last block.We therefore computed the difference in error rate observed after these two manipulations.We summarized the behavior by the AUC, which we refer to as Recurrence Index (RI).We invert and normalize this value such that the RI is 0 if dropping out any block leads to an error rate at chance level and the RI is 1 in the case of a recurrent algorithm.Even though we study non-recurrent ResNets, dropping out any block other than the very first leads to a negligible drop in performance, replicating previous results (12; 17).As a consequence, the RI, across all stages, is close to 1.0 (see

Fig 2b, middle panel).
Divergence Index.ResNets may either converge to their final estimate or simply approach it in a sequence of steps.An iterative algorithm should not be negatively affected by additional applications of the same function.To examine this property, we apply the last block of each stage for an additional up to sixteen steps (see Fig 2a .We find that no stage is particularly robust to such additional evaluations, though the first stage has the lowest DI, indicating that it is the most robust.This suggests that ResNets approach and move away from their final estimate in a sequence of steps, with their computations bearing little similarity to an iterative convergent algorithm.A high DI does not indicate that the ResNet has failed in some way.After all, it was not trained to be robust to such perturbations.However, it indicates that the ResNet may not implement an iterative convergent computation.

Manipulating convergence and iteration in residual networks
We provided indices measuring convergence, recurrence, and divergence to assess the degree to which a ResNet implements an iterative method.Even though they are able to, ResNets do not necessarily learn to implement a purely iterative method.In particular, they show divergent behavior.Nevertheless, as we have shown above, their behavior does show some similarity to iterative methods and their success has been attributed to these similarities (17; 18).This suggests that even though the parameterization and optimization does not promote the emergence of an iterative method in a ResNet, a ResNet with iteratively convergent behavior may still have a better inductive bias.To test this hypothesis, we therefore here control the inductive bias, namely recurrence and convergence, of ResNets.October 10, 2023 7/17

Soft gradient coupling can interpolate smoothly between ordinary and recurrent optimization
We propose a method to blend between recurrent and non-recurrent networks without changing the architecture or the loss landscape.The method is motivated by the observation that we can train a recurrent neural network by sharing the different blocks' gradients (17; 40).In ordinary ResNets, the residual block t with the weights W t is changed by following the gradient ∆ t = ∂ Wt L, whereas RNNs impose as gradient where the weights across residual blocks within a stage must start from the same initialization.The former means that we do not employ any inductive bias towards recurrence, whereas the latter imposes a possibly overly restrictive function space on the architecture.To address both limitations, we propose soft gradient coupling, which uses as its update rule For λ < 1, this retains the entire space of computations enabling both non-recurrent as well as recurrent computations.However, for λ > 0, the optimization is biased to find more recurrent optima.In contrast to penalty regularizations or strict weight sharing models (41; 42), this does not change the network or loss landscape, but simply the accessibility of different local minima of the loss landscape.

Spectral normalization can guarantee convergence in residual networks
Iterative methods preserve a stable output when applied repeatedly.In contrast, the output of the ResNets diverged when the last block was applied repeatedly beyond the number of steps it was trained for.In order to control the degree of convergence in a ResNet, we constrain the Lipschitz constant L of the residual function f .L is defined as the minimal value such that for any input x, y, ∥f (x) − f (y)∥ ≤ L∥x − y∥.The smaller L, the more stable f .The Lipschitz constant is hard to determine accurately as it is a global property of f .We therefore determine an upper bound L(f ) ≥ L based on the linear operations within f .The next section will detail how this upper bound is computed.Using this upper bound, we replace f by its spectral normalization For a given input x, a recurrent residual stage is defined by the iterative application z 0 := x, z t := R(z t−1 ), where R defines the recurrently applied residual block.We wish to guarantee that z t eventually converges to a fixed point z ∞ and hope that empirically, z T , the representation after the specified number of iterations, will be close to this fixed point.According to the Banach fixed point theorem, one way to guarantee October 10, 2023 8/17 such convergence is to require that the residual block's Lipschitz constant L R be smaller than one.
We can achieve this by replacing the residual connections between adjacent blocks by residual connections between the input to the stage and each block, i. e. R(z) := x + f (z).R has the same Lipschitz constant as f .Setting µ < 1, we therefore guarantee that R converges to a fixed point defined by x.We call this network the properly convergent ResNet (PCR).
To get a network more similar to an ordinary ResNet, we consider R(z) := z + f (z).Though convergence is not guaranteed for the network defined by this residual block, we will demonstrate its empirical convergence.We thus call this network the improperly convergent ResNet (ICR).

Upper bound on the Lipschitz constant
To determine an upper bound on the Lipschitz constant of f , we use an alternative characterization based on the Jacobian J f (x).The spectral norm ∥A∥ 2 of a matrix A is defined as its maximal singular value.The Lipschitz constant is then given by According to the chain rule, Here z 1 and z 2 are given by the representation at the appropriate intermediate stages of the residual function, i. e. z 1 is the representation after BN 1 and z 2 is the representation after BN 2 .The Jacobians of the batch normalizations and convolutions do not depend on the input as these are linear operations.Since J ReLU and J BN1 are both diagonal matrices, they commute and therefore, We have therefore split the J f into a product of two constant functions and two Jacobian of the rectified linear unit.
The spectral norm ∥ • ∥ 2 is known to be sub-multiplicative.This means that for matrices A and B, ∥BA∥ 2 ≤ ∥B∥ 2 ∥A∥ 2 .Moreover, J ReLU , depending on whether its input is positive or negative is given by a diagonal matrix with ones or zeros on the diagonal.Its singular values are therefore at most 1.Putting this together, we can upper bound the Lipschitz constant as Theoretically, we could determine the maximal singular value of the convolution using a singular value decomposition.However, the singular value decomposition of such a large matrix is computationally expensive.Instead, Yoshida & Miyato (59) lay out how the maximal singular value can be approximated using a power iteration (64).The only difference to their method consists in the fact that our first convolution additionally involves multiplying the input by the diagonal matrices given by the two batch normalizations.October 10, 2023 9/17

Results
To assess our hypotheses, we considered non-recurrently initialized (ordinary) ResNets as well as recurrently initialized ResNets with coupling parameters 0, 0.5, 0.9, and 1.In addition, we considered properly and improperly convergent ResNets across the same coupling parameters, setting µ = 0.95 (we only trained these networks on CIFAR-10).
We trained several instances of all these ResNets with 8, 16, 32, and 64 channels in the first stage on classical visual recognition tasks (CIFAR-10, MNIST, and CIFAR-100) as well as Digitclutter, a challenging task with partially occluded objects, which has previously been observed to benefit from recurrent connections (9).

Soft gradient coupling improves Iterative Recurrence Indices
As Fig 3a shows, soft gradient coupling indeed improves iterative convergence, increasing the Convergence Index and decreasing the Divergence Index.The Recurrence Index is centered closely around 1 (see S1 Fig) .Convergence and Divergence Index, on the other hand, tend to increase and decrease, respectively, both with higher coupling parameters and with a higher number of channels.Notably, this trend appears to not hold up for a fully recurrent ResNet, corresponding to a coupling parameter of 1.These results show that soft gradient coupling is an effective way of manipulating a ResNet's behavior.Increasing weight similarity across blocks leads to more iterative convergence in a ResNet.
Moreover, increasing width makes networks converge faster and diverge more slowly.This could potentially be mediated by the fact that the wide networks' increased computational expressivity allows them to move closer to the ultimate target within the first layers (faster convergence).This, in turn, would mean that the last layer would have had less work to do, which could have attenuated detrimental effects of its repeated application (slower divergence).Perhaps surprisingly, we find that higher recurrence does not necessarily lead to more convergent behavior.This may indicate that less recurrent ResNets use their increased expressivity to more quickly approach their final estimate.

Stronger iterative convergence does not provide a useful inductive bias
We first assessed the effect of gradient coupling on the performance of non-convergent ResNets.As Fig 3b shows, a higher coupling parameter consistently leads to a higher error rate, both for CIFAR-10 and Digitclutter.Additional supporting experiments on CIFAR-100, MNIST, the Digitclutter task, and on sample efficiency can be found in S3 Fig.However, for intermediate coupling parameters of 0.5 and 0.9, this increase in error rate is smaller for networks with higher capacity (i.e., more channels).This effect can also be seen from the relationship between iterative convergence indices and performance.In particular, Fig 3c demonstrates that performance is higher for ResNets with a higher Convergence Index and a lower Divergence Index.This effect, however, is driven by the fact that ResNets with a higher number of channels also show higher measures of iterative convergence (see Fig 3a).When controlling for the number of parameters (see lines in Fig 3c ), we find no clear relationship between Convergence and Divergence Index and performance.In part, higher divergence index even seems to lead to higher performance, which would seem to indicate that less iterative computations lead to a better inductive bias.This is most likely caused by the fact that we manipulated the divergence index using recurrence regularization.More specifically, higher recurrence regularization seems to lead to a lower divergence index and lower performance.
We then assessed the effect of convergence regularization on performance by training several convergent ResNets on CIFAR-10 (see Fig 4b for ICRs).Convergence regularization led to a higher error rate across all coupling parameters and architectures.A notable exception is the fully coupled ResNet with 16 channels, which performs equally with and without convergence regularization.This suggests that convergence is not a useful inductive bias in ResNets.Taken together, these experiments suggest that iterative convergence may not provide a useful inductive bias for ResNets.

Discussion
We introduced soft gradient coupling, a new method of recurrence regularization, and demonstrated that this enabled us to manipulate iterative convergence properties in ResNets.To measure these properties, we introduced three indices of iterative convergence, quantifying the effect of perturbations previously introduced in the literature (12; 17).
Iterative methods are considered powerful approaches in particular for solving difficult inverse problems.However, here we did not find iterative convergence to be a useful inductive bias for ResNets.Moreover, we found that higher degrees of weight sharing did not improve a ResNet's parameter efficiency.One reason for this may be that soft gradient coupling or the spectral normalization are the wrong methods for this purpose or require a different optimization strategy.We explored minor variations of soft gradient coupling.More specifically, we selectively coupled only the last eight layers or used a non-uniform kernel in coupling the different layers.These variations did not have an overall effect on our findings (see S1 Appendix).Similarly, future research could explore selectively applying spectral normalization only to the last layers of each stage.October 10, 2023 11/17 Our findings also suggest, however, that deep feedforward computations should perhaps not be characterized as iterative refinement on a latent representation, but simply as a sequence of operations smoothly approaching their final estimate.Our conclusions are based on experiments on four visual classification tasks.Visual tasks have been proposed to be inverse problems and therefore lend themselves to iterative inference algorithms.Recognition tasks like Digitclutter that involve partial occlusions of objects have in particular been shown to benefit from recurrent computations (9; 46).However, an iterative method like an error-corrective algorithm that would require a forward model of the data may be more complex and therefore harder to learn than a purely discriminative model.Hence, for the four tested tasks, ResNets may learn direct inference rather than error-corrective inference via a forward model.The primate visual system has been suggested to solve an inverse problem using iterative refinement (5; 6).Deep neural networks, in turn, have emerged as the best image-computable models of the ventral visual stream (65; 66; 67).Recently, recurrent neural networks have been shown to predict the temporal dynamics of the ventral stream better than any other computational model (39; 68).In light of these successes, the fact that the behavior of ResNets is more consistent with direct inference than iterative refinement may highlight an important discrepancy between these models and the primate visual system.
Uncovering the role of iterative computations in the visual system may benefit from more discerning tests of iterative computations.We have focused here on classical object classification (perhaps made harder by occlusions).Performance on this task has been proposed to benefit from iterative computations, but object classification can also be solved without such computations (even with occlusions, in most cases).Moreover, if ResNets implemented iterative computations at all, they would implement ones that are focused on local interactions.Future research may benefit from tasks for which high performance necessitates iterative computations uncovering long-range context.This may be the case, for example, for Pathfinder or cABC.
Although it did not improve performance here, soft gradient coupling provides a method for smoothly interpolating between feedforward and recurrent neural networks.More generally, soft gradient coupling provides a simple way to encourage sets of weights to remain similar to each other.This technique may find further use in relaxing weight-sharing constraints and studying the benefit of various forms of weight sharing, including recurrence and convolution, in deep neural networks.

S1 Appendix
The appendix contains details on the used software and the ResNets' architecture and training paradigm.It also details a few variations on gradient coupling that were explored.As detailed in the appendix, the findings on these variations were consistent with the results presented in the main article.S1 Repository Repository for the model training and analysis.This repository contains the code to train the models, the resulting performance metrics, and code to analyse these metrics.

Fig 1 .
Fig 1. Illustration of iterative computations in ResNets.a A recurrent ResNet implementing a simple error-corrective inverse model.The prediction based on the current estimate ẑ is compared to the input x (via positive and negative errors p and n).The error is used to update ẑ, whereas x is simply retained throughout all blocks.b Trajectories of the estimates ẑ1 , ẑ2 across blocks in a feedforward network, iterative steps in an the error-corrective algorithm, residual blocks in a trained ResNet, and residual blocks in a trained recurrent ResNet.All four methods converge to the correct estimate (indicated by black 'x').c Dropping out the fourth block (unbroken line) has a minor impact on the ResNet.d If the last block is iteratively applied to the final estimate, the value diverges for both the residual and the recurrent network (broken lines), indicating that they do not learn a convergent structure.

Fig 2 .
Fig 2. Iterative convergence in ResNets with standard training.aThe different perturbation methods (early read-out for determining convergence, dropping-out blocks for determining recurrence, and additional evaluations of the last block for determining divergence) are illustrated for the three stages of the ResNet.The x axis depicts the residual block targeted by the perturbation and the y axis the error rate resulting from the corresponding perturbation (chance performance at 90%).For clarity, one of the six instances is emphasized in the plots.b The resulting index values for each stage (small translucent dots) and their averages across instances (large dots).c The error rate for the individual network instances is plotted against Convergence and Divergence Index.
, right panel) and determine the AUC (Divergence Index, DI, see Fig 2b, right panel) for a certain value µ.If L(f ) ≤ µ, the corrective factor c will not change the function.If L(f ) > µ, it will set the corresponding upper bound of f at L( f ) = µ, constraining the residual function's Lipschitz constant.

Fig 3 .
Fig 3. Iterative convergence and performance of coupled ResNets.a Effect of gradient coupling and initialization (rec.: recurrent; non-rec.: non-recurrent) on indices of iterative convergence for architectures with different numbers of channels.b Effect of gradient coupling and initialization on the performance on CIFAR-10 and Digitclutter-5.c Relationship between performance and iterative convergence, i.e., Convergence (left) and Divergence Index (right).Models with the same number of parameters are visualized by the same color and individual lines.Results in a, c are on CIFAR-10.
Both the properly and improperly convergent ResNets have a high Convergence Index as well as a Divergence Index at almost zero (see Fig 4a for ICRs and S2 Fig for PCRs).The Recurrence Index is again centered around one (see S1 Fig).For improperly convergent ResNets with 16 or 32 channels in the first stage, higher coupling parameters generally have a lower Convergence Index and a higher Divergence Index.Nevertheless, these indices indicate that the spectrally normalized ResNets exhibit much more convergent behavior than the ordinary networks.This was only guaranteed for the recurrent, properly convergent ResNets and is therefore an important observation.

Fig 4 .
Fig 4. Iterative convergence and performance of improperly convergent ResNets.a Effects of gradient coupling and initialization (rec.: recurrent; non.-rec.: non-recurrent) on iterative convergence indices.b Error rates on CIFAR-10 as a function of gradient coupling and initialization.

S1
Fig. Recurrence Index for gradient-coupled and improperly convergent ResNets.a Recurrence Index for gradient-coupled ResNets.b Recurrence Index for improperly convergent ResNets.S2 Fig. Results on properly convergent ResNets (PCRs) trained on CIFAR-10.a The indices of iterative convergence demonstrate that the PCRs indeed converge.b As the error rate on CIFAR-10 indicates, PCRs tend to perform a bit worse than the improperly convergent ResNets we studied in the main article.S3 Fig.Additional experiments on gradient-coupled ResNets.a Performance of gradient-coupled ResNets on variations of Digitclutter with a different number of overlapping digits and different size of training data.b Performance of gradient-coupled ResNets on CIFAR-100, c CIFAR-10 with few training data, and d MNIST.S4 Fig. Performance of training variations on CIFAR-10.a The effect of initializing batchnorm with γ = 0.1 instead of γ = 1.b The effect of using a triangular kernel for gradient coupling instead of a uniform kernel.c A variation of gradient coupling where the first five blocks in each stage were uncoupled.S5 Fig.The indices of iterative convergence plotted against the coupling parameters for the different training variations.