Abstract
Classically, visual processing is described as a cascade of local feedforward computations. Feedforward Convolutional Neural Networks (ffCNNs) have shown how powerful such models can be and revolutionized computer vision. However, ffCNNs only roughly mimic human vision. They lack recurrent connections and rely mainly on local features, contrary to humans who use global shape computations. Previously, using visual crowding as a well-controlled challenge, we showed that no classic model of vision, including ffCNNs, can explain human global shape processing (1). Here, we show that Capsule Neural Networks (CapsNets; 2), combining ffCNNs with a grouping and segmentation mechanism, solve this challenge in a natural manner. We hypothesize that one computational function of recurrence is to efficiently implement grouping and segmentation. We provide psychophysical evidence that, indeed, time-consuming recurrent processes implement complex grouping and segmentation in humans. CapsNets reproduce these results in a natural manner. Together, we provide mutually reinforcing psychophysical and computational evidence that a recurrent grouping and segmentation process is essential to understand the visual system and create better models that harness global shape computations.





