## Abstract

Neural computations are currently investigated using two separate approaches: sorting neurons into functional populations, or examining the low-dimensional dynamics of collective activity. Whether and how these two aspects interact to shape computations is currently unclear. Using a novel approach to extract computational mechanisms from networks trained on neuroscience tasks, here we show that the dimensionality of the dynamics and cell-class structure play fundamentally complementary roles. While various tasks can be implemented by increasing the dimensionality in networks with fully random population structure, flexible input-output mappings instead required a non-random population structure that can be described in terms of multiple sub-populations. Our analyses revealed that such a population structure enabled flexible computations through a mechanism based on gain-controlled modulations that flexibly shape the dynamical landscape of collective dynamics. Our results lead to task-specific predictions for the structure of neural selectivity, inactivation experiments, and for the implication of different neurons in multi-tasking.

## 1 Introduction

The quest to understand the neural bases of cognition currently relies on two disjoint paradigms [Barack and Krakauer, 2021]. Classical works have sought to determine the computational role of individual cells by sorting them into functional populations based on their responses to sensory and behavioral variables [Hubel and Wiesel, 1959; Moser et al., 2017; Hardcastle et al., 2017]. Fast developing tools for dissecting neural circuits have opened the possibility of mapping such functional populations onto genetic and anatomic cell types, and given a new momentum to this cell-category approach [Adesnik et al., 2012; Ye et al., 2016; Kvitsiani et al., 2013; Hangya et al., 2014; Pinto and Dan, 2015; Hirokawa et al., 2019]. This viewpoint has however been challenged by observations that individual neurons often represent seemingly random mixtures of sensory and behavioral variables, especially in higher cortical areas [Churchland and Shenoy, 2007; Machens et al., 2010; Rigotti et al., 2013; Mante et al., 2013; Park et al., 2014], where sharply defined functional cell populations are often not directly apparent [Mante et al., 2013; Raposo et al., 2014; Hardcastle et al., 2017]. A newly emerging paradigm has therefore proposed that neural computations need instead to be interpreted in terms of collective dynamics in the state space of joint activity of all neurons [Buonomano and Maass, 2009; Rigotti et al., 2013; Mante et al., 2013; Gallego et al., 2017; Remington et al., 2018; Saxena and Cunningham, 2019]. This computation-through-dynamics framework [Vyas et al., 2020] hence posits that neural computations are revealed by studying the geometry of low-dimensional trajectories of activity in state space [Mante et al., 2013; Rajan et al., 2016; Chaisangmongkon et al., 2017; Remington et al., 2018; Wang et al., 2018; Sohn et al., 2019], while remaining agnostic to the role of any underlying population structure.

In view of the apparent antagonism between these two approaches, two works have sought to precisely assess the presence of functional cell populations in the posterior parietal cortex (PPC) [Raposo et al., 2014] and prefrontal cortex [Hirokawa et al., 2019]. Rather than define cell populations by classical methods such as thresholding the activity or selectivity of individual neurons, these studies developed new statistical techniques to determine whether the distribution of selectivity across neurons displayed non-random population structure [Hardcastle et al., 2017]. Using analogous analyses, but different behavioral tasks, the two studies reached opposite conclusions. Raposo et al found no evidence for non-random population structure in selectivity, and argued that PPC neurons fully multiplex information. Hirokawa et al also observed that individual neurons responded to mixtures of task features, but in contrast to Raposo et al, they detected important deviations from a fully random distribution of selectivity, a situation they termed non-random mixed selectivity. By clustering neurons according to their response properties, they defined separate, though mixed-selective populations that appeared to represent distinct task variables and to reflect underlying connectivity. To resolve the apparent discrepancy with Raposo et al, Hirokawa et al conjectured that revealing non-random population structure in higher cortical areas may require sufficiently complex behavioral tasks.

The conflicting findings of [Raposo et al., 2014; Hirokawa et al., 2019] therefore raise a fundamental theoretical question: do specific computational tasks require a non-random population structure, or alternatively can any task in principle be implemented with a fully random population structure as in Raposo et al. [2014]? To address this question, we trained recurrent neural networks on a range of systems neuroscience tasks [Sussillo, 2014; Barak, 2017; Yang et al., 2019] and examined the population structure that emerges in both selectivity and connectivity using identical methods as Raposo et al. [2014]; Hirokawa et al. [2019]. Starting from the premise that computations are necessarily determined by the underlying connectivity [Mastrogiuseppe and Ostojic, 2018], we then developed a new approach for assessing the computational role of population structure in connectivity for each task. Together, these analyses revealed that, while a fully random population structure was sufficient to implement a range of tasks, specific tasks appeared to require a non-random population structure in connectivity that could be described in terms of a small number of statistically-defined sub-populations. This was in particular the case when a flexible reconfiguration of input-output associations was needed, a common component of many cognitive tasks [Sakai, 2008] and more generally of multi-tasking [Yang et al., 2019; Duncker et al., 2020; Masse et al., 2018]. To extract the mechanistic role of this population structure for computations-through-dynamics, we focused on the class of low-rank models [Mastrogiuseppe and Ostojic, 2018; Schuessler et al., 2020a,b] that can be reduced to interpretable latent dynamics characterized by a minimal intrinsic dimension and number of sub-populations [Beiran et al., 2021]. We found that the subpopulation structure of the connectivity enables networks to implement flexible computations through a mechanism based on modulations of gain and effective interactions that flexibly modify the low-dimensional latent dynamics across epochs of the task. Specifically, at the level of the collective dynamics, the sub-population structure allows different inputs to act either as drivers or modulators [Sherman and Guillery, 1998; Salinas, 2004; Ferguson and Cardin, 2020]. Our results lead to task-specific predictions for the statistical structure of single-neuron selectivity, for inactivations of specific sub-populations, as well as for the implication of different neurons in multi-tasking.

## 2 Results

### 2.1 Identifying non-random population structure in trained recurrent networks

We trained recurrent neural networks (RNNs) on five systems neuroscience tasks spanning a range of cognitive components: perceptual decision-making (DM) [Gold and Shadlen, 2007], parametric working-memory (WM) [Romo et al., 1999], multi-sensory decision-making (MDM) [Raposo et al., 2014], contextual decision-making (CDM) [Mante et al., 2013] and delay-match-to-sample (DMS) [Miyashita, 1988]. Each network consisted of *N* units, and the activation *x*_{i} of unit *i* was given by
where *ϕ*(*x*) = tanh(*x*) is the single-unit non-linearity, *J*_{ij} is the recurrent connectivity matrix, *η*_{i}(*t*) is a single-unit noise and the network receives *N*_{in} task-defined inputs through a set of feed-forward weights (see Methods 4.1). The output *z*(*t*) of the network was obtained by a linear readout of firing rates *ϕ*(*x*_{i}) through a set of weights {*w*_{i}}_{i=1…N} (Fig. 1a top). Each task was modeled as a mapping from a set of inputs representing stimuli and contextual cues to desired outputs (see Methods 4.3). For each task, we used gradient-descent to train 100 networks starting from different, random initial connectivities [Yang and Wang, 2020]. We then searched for evidence of non-random population structure by comparing the selectivity, connectivity and performance of the trained networks with randomized shuffles.

#### Population structure in selectivity

We first asked if training on each task led to the emergence of non-random structure in selectivity, as previously assessed in the posterior parietal [Raposo et al., 2014] and prefrontal [Hirokawa et al., 2019] cortices. Following the approach developed in those studies, we represented each neuron as a point in a *selectivity space*, where each axis was given by the linear regression coefficient of neural firing rate with respect to a task variable *v* such as stimulus, decision or context (Fig. 1a). The dimension of the selectivity space ranged from 2 to 4 depending on the task (see Methods 4.4), and each trained network led to a distribution of points in that space (Fig. 1b). For each network, we compared the obtained distribution with a randomized shuffle corresponding to a multivariate Gaussian with matching empirical mean and covariance (Fig. 1b,c), and assessed the difference using the ePAIRS statistical test [Raposo et al., 2014; Hirokawa et al., 2019]. A non-significant outcome suggests an isotropic distribution of single-neuron selectivity, a situation that has been denoted as *fully-random population structure*, or non-categorical mixed selectivity [Raposo et al., 2014]. A statistically significant outcome instead indicates that neurons tend to be clustered along multiple axes of the selectivity space. Following Raposo et al. [2014]; Hirokawa et al. [2019], we refer to this situation as non-random mixed selectivity, or *non-random population structure*. The ePAIRS test on the selectivity distributions revealed the presence of non-random population structure for two out of the five tasks, the contextual decision-making and delay-match-to-sample tasks (proportion of statistically significant networks under the ePAIRS test, *p* < 0.05, Bonferroni corrected : DM task: 1/100, WM task: 6/100, MDM task: 10/100, CDM task: 87/100, DMS task: 100/100) (Fig. 1d). In particular, this analysis revealed a clear difference between the multi-sensory [Raposo et al., 2014] and context-dependent [Mante et al., 2013] decision making tasks, which had an identical input structure (two stimuli *A* and *B* and two contextual cues *A* and *B*, Fig. 3b) and therefore identical four-dimensional selectivity spaces, but required different mappings from inputs to outputs.

#### Population structure in connectivity

The selectivity in trained RNNs necessarily reflects the underlying connectivity [Mastrogiuseppe and Ostojic, 2018]. We therefore next sought to determine the presence of non-random population structure directly in the connectivity of networks trained on different tasks. Recent work has shown that training networks on simple tasks as considered here leads to a particular form of recurrent connectivity based on a *low-rank* structure [Schuessler et al., 2020b], meaning that the connectivity of each neuron is specified by a small number of parameters as detailed below. We leveraged this structure to represent trained networks in a low-dimensional *connectivity space*, and then assessed the presence of non-random population structure in that space using a procedure identical to the analysis of selectivity.

More specifically, we focused on RNNs constrained to have recurrent connectivity matrices *J*_{ij} of a fixed rank *R*, and for each task determined the minimal required *R* (Sup. Fig. S2). A matrix of rank *R* can in general be written as
so that neuron *i* is characterized by 2*R* recurrent connectivity parameters . Each neuron moreover received *N*_{in} input weights, and sent out one readout weights (Fig. 1e), leading to a total of 2*R* + *N*_{in} + 1 parameters per neuron. We therefore represented the connectivity of each neuron as a point in a (2*R* + *N*_{in} + 1)-dimensional *connectivity space*, where each axis corresponds to entries along one connectivity vector. The connectivity of a full network can then be described as a distribution of points in that space (Fig. 1f). Similarly to the selectivity analysis, we assessed the presence of non-random population structure by comparing connectivity distributions of trained networks with randomized shuffles corresponding to multivariate Gaussians with matching empirical means and covariances, and quantified the deviations using the ePAIRS test. The results were consistent with the analysis of selectivity (Fig. 1g,h), and we again observed a clear gap between two groups of tasks (number of networks with statistically significant clustering for each task: DM: 3/100; WM: 5/100; MDM: 1/100; CDM: 100/100; DMS: 100/100; *p* < 0.05 with Bonferroni correction) (Fig. 1h). In particular and as was the case for selectivity, the MDM and CDM tasks led to opposite results although their connectivity spaces were identical (seven dimensional, with *N*_{in} = 4, *R* = 1, so that the total dimension was (2*R* + *N*_{in} + 1) = 7).

#### Computational role of population structure

The analyses of selectivity and connectivity provided a consistent picture on the absence or presence of non-random population structure across tasks. These analyses are however purely correlational, and do not allow us to infer a causal role of the observed structure. To determine when non-random population structure is computationally necessary, or conversely when random population structure is computationally sufficient, we therefore developed a new *resampling* analysis. For each task, we first generated new networks by sampling the connectivity parameters of each neuron from the randomized distribution used to assess structure in Fig. 1e-h, i.e. a multi-variate Gaussian distribution with mean and covariance matching the trained low-rank RNNs. This procedure preserved the rank of the connectivity (Fig. 1i), and the overall correlation structure of connectivity parameters, but scrambled any non-random population structure (Fig. 1j,k). We then quantified the performance of each randomly resampled network on the original task. This key analysis revealed that the randomly resampled networks led to a near perfect accuracy for the DM, WM and MDM tasks, but not for the CDM and DMS tasks (Fig. 1l). This demonstrates that, on one hand, random population structure is sufficient to implement the DM, WM and MDM tasks, while on the other hand non-random population structure is necessary for CDM and DMS tasks. These results held independently of the constraints on the rank of the connectivity, and in particular for unconstrained, full-rank networks in which only the learned part of the connectivity was resampled (Sup. Fig. S3).

It is important to stress that the performance of resampled networks is a much more direct assessment of the computational role of the non-random population structure than the analyses of selectivity and connectivity through the ePAIRS test. Indeed, the ePAIRS analyses can lead to false positives in which statistically significant non-random structure is found in both selectivity and connectivity although resampled networks with a single Gaussian still match the performance of the trained network (Sup. Fig. S4). As an illustration, networks trained on the DM task sometimes exhibited two diametrically opposed clusters in the connectivity space, suggesting two concurrent pools of self-excitatory populations, reminiscent of solutions previously found for this task [Wang, 2002; Williams et al., 2018; Schaeffer et al., 2020]. Generating resampled networks scrambled that structure, but still led to functioning networks, which showed that in the DM task the population structure does not bear an essential computational role, and might be an artifact of specific training parameters. Spurious structure can also appear in selectivity when the non-linearity is strongly engaged (Sup. Fig. S4).

In summary, our analyses of trained recurrent neural networks revealed that certain tasks can be implemented with a fully random population structure in both connectivity and selectivity, while others appeared to require additional organization in the connectivity that led to non-random structure in selectivity. We next sought to understand the mechanisms by which the population structure of connectivity determines the dynamics and the resulting computations. In a first step, we examined the situation in which the population structure is fully random. In a second step, in line with Hirokawa et al. [2019], we asked whether non-random population structure in the connectivity space could be represented in terms of separate clusters or sub-populations, and how this additional organization expands the computational capabilities of the network.

### 2.2 Interpreting computations in terms of latent dynamical systems

To unravel the mechanisms by which population structure impacts computations, we developed a method for interpreting the trained recurrent neural networks in terms of underlying low-dimensional dynamics [Vyas et al., 2020]. We specifically focused on networks with low-rank connectivity (Fig. 2a), which can be directly reduced to low-dimensional dynamical systems [Beiran et al., 2021] Here we first outline this model reduction approach, and next apply it on trained recurrent networks.

In line with recent methods for analyzing large-scale neural activity [Buonomano and Maass, 2009; Cunningham and Byron, 2014; Gallego et al., 2017; Saxena and Cunningham, 2019], we start by representing the dynamics as trajectories *x*(*t*) = {*x*_{i}(*t*)}_{i}_{=1…N} in the high-dimensional state space, where the *i*-th dimension corresponds to the activation *x*_{i} of neuron *i* (Fig. 2b). For low-rank networks, the set of connectivity parameters can be interpreted as vectors over neurons that directly correspond to directions in the state-space (Fig. 2b). Indeed, each feed forward input corresponds to an input connectivity vector , the low-rank parameters of the connectivity matrix (Eq. 2) can be represented as *R* pairs of recurrent connectivity vectors and for *r* = … *R*, and the readout forms a vector ** w** (Fig. 2a). Crucially, the low-rank connectivity structure directly restricts the dynamics to lie in a low-dimensional subspace spanned by the connectivity vectors

*I*^{(s)}and

*m*^{(r)}(Fig. 2b) [Mastrogiuseppe and Ostojic, 2018]. In line with dimensionality reduction approaches [Cunningham and Byron, 2014; Gallego et al., 2017], the collective activity in the network can therefore be fully described in terms of a small number of latent variables that quantify activity in this subspace [Mastrogiuseppe and Ostojic, 2018]. More specifically,

**(**

*x**t*) can be decomposed into a set of internal variables

*κ*

_{r}and inputs

*u*

_{s}which quantify respectively activity along

*m*^{(r)}and

*I*^{(s)}(Fig. 2b, Methods section 4.8.1), and correspond to recurrent and input-driven directions in state-space [Wang et al., 2018] :

Altogether, the activity ** x**(

*t*) is therefore embedded in a linear subspace of dimension

*R*+

*N*

_{in}where

*R*is the rank of the connectivity, and

*N*

_{in}is the dimensionality of feed-forward inputs. The dynamics are then fully specified by the evolution of the internal variables

*κ*= {

*κ*

_{r}}

_{r=1…R}driven by inputs . A mathematical analysis shows that the internal variables form a dynamical system [Remington et al., 2018; Vyas et al., 2020] with a temporal evolution of the form

Here *F* is a non-linear function that determines the amount of change of ** κ** at every time step. In the limit of large networks, the precise shape of

*F*is set by the statistics of the connectivity across neurons (Methods section 4.8.4), i.e. precisely the distribution of points in the connectivity space that we previously examined in Fig. 1f,j. The connectivity in the network can therefore be represented in two complementary ways, either in terms of directions in the activity state-space (Fig. 2b top left) or in terms of distributions in the connectivity space (Fig. 2b bottom left), and these two representations together determine the low-dimensional latent dynamics.

In summary, in line with the computation-through-dynamics framework [Vyas et al., 2020], low-rank networks can be exactly reduced to low-dimensional, non-linear latent dynamical systems which determine the performed computations. We next examined how the population structure in trained recurrent networks impacts the resulting latent dynamical system. To facilitate the interpretation of computational mechanism, we focused on networks of minimal rank, which lead to latent dynamics of minimal dimensionality for each task (Methods 4.2). We later verify that the main conclusions carry over in absence of this constraint.

### 2.3 Latent dynamics and computations for fully random population structure

Our resampling analyses of trained RNNs revealed that a range of tasks could be performed by networks in which the population structure was fully random in connectivity space (Fig. 1l). We therefore first examine the latent dynamics underlying computations in that situation. Crucially, a fully random population structure limits the available parameter space, and strongly constrains the set of achievable latent dynamics independently of their dimensionality [Beiran et al., 2021]. We start be specifying these constraints on the dynamics, and show they nevertheless allow networks with random population structure to implement a range of tasks of increasing complexity by increasing the rank of the connectivity and therefore the dimensionality of the dynamics.

Networks with fully random population structure were defined in Fig. 1i-l as having distributions of connectivity parameters computationally equivalent to a Gaussian distribution. In such networks, the statistics of connectivity are therefore fully characterized by a set of covariances between connectivity parameters, each of which can be directly interpreted as the alignment, or overlap between two connectivity vectors (Fig. 2b bottom left, see Eq. 12). For this type of connectivity, a mean-field analysis shows that the latent low-dimensional dynamics can be directly reduced to an effective circuit, where internal variables *κ*_{r} integrate external inputs *u*_{s}, and interact with each other through *effective couplings* set by the overlaps between connectivity vectors multiplied by a common, activity-dependent gain factor [Beiran et al., 2021]. In such reduced models, the role of individual parameters can then be analyzed in detail (Sup. Info).

As a concrete example, a unit-rank network (*R* = 1) with connectivity vectors ** m** and

**and a single feed-forward input vector**

*n***(**

*I**N*

_{in}= 1) leads to two-dimensional activity (Eq. 3), fully described by a single internal variable

*κ*(

*t*) and a single external variable

*u*(

*t*) (Fig. 2b). The latent dynamics of

*κ*(

*t*) are given by where and are effective couplings, that depend both on overlaps between connectivity vectors, and implicitly on

*κ*and

*u*through a gain factor, so that the full dynamics in Eq. 5 are non-linear despite their immediate appearance. More specifically, the effective couplings are defined as and , where

*σ*

_{nm}(resp.

*σ*

_{nI}) is the fixed overlap between the vector

**and the vector**

*n***(resp.**

*m***). The connectivity vector**

*I***therefore selects inputs to the latent dynamics [Mastrogiuseppe and Ostojic, 2018]: the overlap between**

*n***and**

*n***controls how strongly the latent dynamics integrate feed-forward inputs, while the overlap between**

*I***and**

*n**m*controls the strength of positive feedback in the latent dynamics. Crucially, all the effective couplings are scaled by the same factor ⟨Φ′⟩ that represents the average gain of all neurons in the network. This gain depends on the activity in the network (Methods section 4.8.4), which makes the dynamics non-linear. The fact that all the effective couplings are scaled by the same factor however implies that, in networks with a fully random population structure, the overall form of the effective circuit is determined by the connectivity overlaps, which strongly limits the range of possible dynamics for the internal variables [Beiran et al., 2021]. Tasks for which a fully random population structure is sufficient are therefore those that can be implemented by a fixed effective circuit at the level of latent dynamics.

We first applied this model reduction framework to the perceptual decision making task, where a network received a noisy scalar stimulus *u*(*t*) along a random input vector, and was trained to report the sign of its temporal average along a random readout vector (Fig. 2c). Minimizing the rank of the trained recurrent connectivity matrix, we found that a unit-rank network was sufficient to solve the task (Sup. Fig. S2). The network connectivity was fully characterized by four connectivity vectors: the input vector ** I**, recurrent connectivity vectors

**and**

*n***, and the readout vector**

*m***(Fig. 2c). As a result, the activity**

*w**x*(

*t*) evolved in a two-dimensional plane spanned by

**and**

*I***, and was fully described by two corresponding collective variables**

*m**u*(

*t*) and

*κ*(

*t*) (Fig. 2d). The resampling analysis in Fig. 1l showed that trained networks were fully specified by the overlaps, or covariances between connectivity vectors, as generating new networks by sampling connectivity from a Gaussian distribution with identical covariances led to identical performance. The latent dynamics of

*κ*(

*t*) could then be reduced to a simple effective circuit (Fig. 2e, Eq. 5). Inspecting the values of covariances in the trained networks (Sup. Fig. S10) and analyzing the effective circuit (Sup. Fig. S5) revealed that the latent dynamics relied on a strong overlap

*σ*

_{nI}to integrate inputs, and an overlap

*σ*

_{nm}≈ 1 to generate a long integration timescale via positive feedback. The internal variable

*κ*(

*t*) therefore represented integrated evidence along a direction in state space determined by the connectivity vector

*m*(Fig. 2e,f). The readout vector

*w*was aligned with

*m*, so that the output directly corresponded to integrated evidence

*κ*(

*t*). Controlling only three parameters in the latent dynamics was sufficient to reproduce the psychometric input-output curve of the full trained network (Fig. 2f). Note that this network implementation is very similar to the implementation that has been proposed in previous work without making use of a learning algorithm [Mastrogiuseppe and Ostojic, 2018]. The findings from the perceptual decision task directly extended to the multi-sensory decision-making task [Raposo et al., 2014], in which the latent dynamics were identical, but integrated two inputs corresponding to two different stimulus features.

We next turned to the parametric working memory task [Romo et al., 1999], where two scalar stimuli *f*_{1} and *f*_{2} were successively presented along an identical input vector ** I**, and the network was trained to report the difference

*f*

_{1}−

*f*

_{2}between the values of the two stimuli (Fig. 2g). We found that this task required rank

*R*= 2 recurrent connectivity (Sup. Fig. S2), so that the activity was constrained to the three-dimensional space spanned by

**and the connectivity vectors**

*I*

*m*^{(1)}and

*m*^{(2)}. The low-dimensional dynamics could therefore be described by two internal variables

*κ*

_{1}(

*t*) and

*κ*

_{2}(

*t*) that represented activity along

*m*

^{(1)}and

*m*

^{(2)}, and formed a two-dimensional dynamical system that integrated the input

*u*(

*t*) received along

**. The resampling analysis indicated that in this case also the trained connectivity was fully specified by covariances between connectivity vectors (Fig. 1i-l). Inspecting the connectivity distribution (Sup. Fig. S10) revealed that the two internal variables**

*I**κ*

_{1}and

*κ*

_{2}did not directly interact, but instead independently integrated stimuli through dynamics given by Eq. 5 (Fig. 2i). For

*κ*

_{1}, a strong overlap led to strong positive feedback that generated a persistent representation of the intensity

*f*

_{1}of the first stimulus along the direction of state space set by the connectivity vector

*m*^{(1)}(Fig. 2h top). For

*κ*

_{2}, the overlap , and therefore the positive feedback, was weaker, leading to a transient response that encoded the most recent stimulus along the direction

*m*

^{(2)}in the state space (Fig. 2h bottom). The readout vector

*w*was aligned with both

*m*

^{(1)}and

*m*

^{(2)}, but with overlaps of opposite signs, so that the output of the network in the decision period corresponded to the difference between

*κ*

_{1}and

*κ*

_{2}, and therefore effectively

*f*

_{2}−

*f*

_{1}(Fig. 2i). Controlling only five parameters in the latent dynamics (Sup. Fig. S6) was therefore sufficient to reproduce the psychometric matrix describing the input-output mapping of the full trained network (Fig. 2j).

In summary, networks with random population structure can perform tasks of increasing complexity by relying on the dimensionality of recurrent dynamics to represent an increasing number of task-relevant latent variables. The random population structure however limits ways in which these latent variables can be combined by fixing the shape of the equivalent circuit. As a consequence, for more complex tasks a fully random population structure was not sufficient. We next sought to further elucidate this aspect.

### 2.4 Representing non-random connectivity structure with multiple populations

The resampling analysis in Fig. 1l indicated that tasks such as context-dependent decision-making and delayed-match-to-sample relied on a population structure in connectivity that was not fully random. To better understand the underlying structure and its computational role, we further examined RNNs trained on these two tasks, and asked whether their connectivity could be represented in terms of multiple populations. We first examined whether a multi-population connectivity structure is sufficient to implement the two tasks, and in a second step examined how such a structure modifies latent dynamics and expands their computational capacity.

To identify computationally-relevant populations, we took inspiration from Hirokawa et al. [2019], and first performed clustering analyses in the connectivity space where non-random population structure was found (Fig. 3a, Methods section 4.7). Each axis in that space represents entries along one connectivity vector, and each neuron corresponds to one point. Applying a Gaussian mixture clustering algorithm on the cloud of points formed by each trained network, we partitioned the neurons into separate sub-populations. In the trained networks, all clusters were centered close to the origin, but each had a different shape and orientation that corresponded to multiple peaks in the distribution of nearest-neighbour angles detected by the ePAIRS analysis (Fig. 1f-g). Each population was therefore characterized by a different set of covariances, or overlaps, between input, recurrent, and output connectivity vectors. We then extended our resampling approach from Fig. 1i-l, and generated new networks by first randomly assigning each neuron to a population, and then sampling its connectivity parameters from a Gaussian distribution with the fitted covariance structure. Finally, we inspected the performance of these randomly generated networks, and compared them with fully trained ones. By progressively increasing the number of fitted clusters, we determined the minimal number of populations needed to implement the task (Methods 4.7). Within this approach, networks with a fully random population structure such as those described in Fig. 2 correspond to a single overall population in connectivity space.

We first considered context-dependent decision making, where stimuli consisted of a combination of two scalar features that fluctuated in time [Mante et al., 2013]. Depending on a contextual cue, only one of the two features needed to be integrated (Fig. 3b), so that the same stimulus could require opposite responses, a hallmark of flexible input-output transformations [Fusi et al., 2016]. We implemented each stimulus feature and contextual cue as an independent input vector over the population, so that the dimension of feed-forward inputs was *N*_{in} = 4. We found that unit-rank connectivity was sufficient (Fig. S2), and focused on such networks. The analysis in Fig. 1l showed that generating networks by resampling connectivity from a single, fully-random population led to a strong degradation of the performance, although it remained above chance. A closer inspection of psychometric matrices representing input-output transforms (Fig. 3c) in different contexts revealed that single-population resampled networks in fact generated correct responses for stimuli requiring identical outputs in the two contexts, but failed for incongruent stimuli, for which responses needed to be flipped according to context (Fig. 3c right). This observation was not specific to unit rank networks, as randomizing population structure in higher rank (Sup. Fig. S11) and full rank networks (Sup. Fig. S3) led to a similar reduction in performance (Sup. Fig. S3). We therefore performed a clustering analysis in the connectivity space. The number of clusters varied across networks (Sup. Fig. S9), but the minimal required number was two. For such minimal networks, we found that randomly resampling from the corresponding mixture-of-Gaussian distribution led to an accuracy close to the original trained connectivity (Fig. 3d). In particular, the randomly generated networks correctly switched their response to incongruent stimuli across contexts (Fig. 3c), in contrast to networks with random population structure. This indicated that connectivity based on a structure in two populations was sufficient to implement the context-dependent decision-making task.

We next turned to the delayed-match-to-sample task [Miyashita, 1988; Engel and Wang, 2011; Chaisangmongkon et al., 2017], where two stimuli were interleaved by a variable delay period, and the network was trained to indicate in each trial whether the two stimuli were identical or different (Fig. 3e). This task involved flexible stimulus processing analogous to the context-dependent decision-making task because an identical stimulus presented in the second position required opposite responses depending on the stimulus presented in the first position (Fig. 3f). We found that this task required a rank two connectivity (Fig. S2), but, similarly to the context-dependent decision making task, a fully random population structure was not sufficient to perform the task, as networks generated by randomizing connectivity parameters reduced the output to chance level (Fig. 1l,3f,g). Fitting instead two clusters in the connectivity space showed that two sub-populations were sufficient, as networks generated by sampling connectivity based on a two-population structure led to a performance close to that of the fully trained network (Fig. 3g).

Altogether, our analyses based on clustering connectivity parameters, and randomly generating networks from the obtained multi-population distributions, indicated that connectivity distributions described by a small number of populations were sufficient to implement tasks requiring flexible input-output mappings. To identify the mechanistic role of this multi-population structure, we next examined how it impacted the latent dynamics implemented by trained networks.

### 2.5 Gain-based modulation of latent dynamics by multi-population connectivity

To unveil the mechanisms underlying flexible input-output mappings in networks with connectivity based on multiple populations, we examined how such a structure impacts the latent dynamics of internal variables. We first show that in contrast to a single-population, a multi-population structure allows external inputs to flexibly modulate the overall form of the circuit describing latent dynamics. We then show how this general principle applies specifically to the two flexible tasks described in Fig. 3. We focus here on networks with minimal rank and minimal number of populations, and show in the next section that the inferred predictions hold more generally.

In Fig. 3 we defined sub-populations as subsets of neurons characterized by different overlaps between input, recurrent and output connectivity vectors in a network of fixed rank. In a network with a multi-population structure, the number of internal variables describing low-dimensional dynamics is determined by the rank of the recurrent connectivity, as in networks without population structure (Fig. 2a). Remarkably, a mean-field analysis (Methods 4.8.4, [Beiran et al., 2021]) shows that the latent low-dimensional dynamics can still be represented in terms of an effective circuit where internal variables *κ*_{r} integrate inputs and interact with each other through effective couplings (Fig. 4a). The key effect of the multi-population structure is however to modify the form of the effective couplings and endow them with much greater flexibility than in the case of a single, fully random population. Indeed, in a network with a single population, the effective couplings were given by connectivity overlaps multiplied by a single, global gain factor, and modulating the gain therefore scaled all effective couplings simultaneously. In contrast, in networks with multiple populations, each population is described by its own set of overlaps between connectivity sub-vectors (Fig. 3a-c), and, importantly, by its own gain, which corresponds to the average slope *ϕ′*(*x*_{i}) on the input-output nonlinearity of different neurons in the population. The effective couplings between inputs and internal variables are then given by a sum over populations of connectivity overlaps each weighted by the gain of the corresponding population (Methods Eq. (40)). As an illustration, in the case 374 of wo populations, the effective coupling between the input and the internal variable becomes
where and are the overlaps for each population between the input vector *I* and the input-selection vector *n*, while ⟨Φ′⟩_{1} and ⟨Φ′⟩_{2} are the gains of the two populations, which depend implicitly both on inputs and the values of internal variables.

Crucially, additional inputs restricted to a given population can modulate its gain independently of other populations by shifting the position of neurons on the non-linear input-output function. Additional inputs can thereby shape latent dynamics without directly driving them, but by modifying effective couplings (Methods 4.8.5). Indeed, as pointed out earlier, only inputs corresponding to input vectors aligned with the input-selection vectors *n*^{(r)} directly drive internal variables through a non-zero effective coupling. In contrast, inputs corresponding to input vectors orthogonal to input-selection vectors do not directly drive the latent dynamics, but do modulate the values of the gain ⟨Φ′⟩_{p} of each population, and therefore the effective couplings. As a consequence, depending on the geometry between input vectors and input-selection vectors, different sets of inputs can play distinct roles of drivers and modulators [Sherman and Guillery, 1998] at the level of the effective circuit describing latent dynamics. Such a mechanism considerably extends the range of possible dynamics with respect to the case of a single overall population. In particular, modulating the gains of different populations allows the network to flexibly remodel the effective circuit formed by collective variables in different trials or epochs according to the demands of the task, in contrast to the single-population case, where the form of the effective circuit is fixed. We next describe how this general mechanism explains the computations in the two flexible tasks of Fig. 3.

For the context-dependent decision-making task, the minimal trained networks were of unit rank and consisted of two sub-populations (Fig. 3f). Analyzing the statistics of input and connectivity vectors for each population, we found that the input vectors *I*^{A} and *I*^{B} corresponding to the two stimulus features *u*_{A} and *u*_{B} had different overlaps with the input-selection vector ** n** in the two populations (Fig. 4b right) so that the two stimulus features

*u*

_{A}and

*u*

_{B}acted as drivers of latent dynamics. The contextual input vectors

*I*^{ctxA}and

*I*^{ctxB}in contrast had weak overlaps with the input-selection vector

*n*(Sup. Fig. S10), but strongly different amplitudes on the two populations (Fig. 4b left). They therefore modified the gains of the two populations in an opposite manner (Fig. 4c bottom), and played the role of modulators that modified the form of the effective circuit describing latent dynamics in each context (Fig. 4c top). More specifically, the latent dynamics of the internal variable

*κ*could be approximated by (Methods 4.8.4 and Sup. Fig. S7): where ⟨Φ′⟩

_{1}and ⟨Φ′⟩

_{2}are the average gains of the two populations, the overlap for the first population between the input vector for stimulus feature

*A*and the input-selection vector

**, and the overlap for the second population between**

*n***and the input vector for stimulus feature**

*n**B*. By modulating the gains of the two populations in a differential manner between the two contexts (Fig. 4c bottom), the contextual cues controlled the effective couplings between stimulus inputs and the internal variable

*κ*, and determined which feature was integrated by the internal variable in each context (Fig. 4d). This mechanism implemented an effective input gating, but only at the level of the latent dynamics of the internal variable

*κ*that integrated relevant evidence. Importantly, as observed in experimental data [Mante et al., 2013], on the level of the full network, the two stimulus features were instead equally represented in both contexts, but along directions in state space orthogonal to the direction

*n*that encoded internal collective variable (Sup. Fig. S12) as observed in experimental data [Mante et al., 2013].

For the delayed-match-to-sample task, we found that the multi-population structure also led to a modulation of latent dynamics, but across task epochs rather than across trials. Fig. 4e-i describes an example minimal network implementing this task, where one of the stimuli played the role of a modulatory input, and transiently modified the latent dynamics when presented (Fig. 4e,g,i). More specifically, the network was of rank two, so that the latent dynamics were described by effective interactions between two internal variables *κ*_{1} and *κ*_{2} (Fig. 4e), and could be visualised in terms of a flow in a dynamical landscape in the *κ*_{1} *κ*_{2} plane (Fig. 4f). The minimal connectivity moreover consisted of two populations (Fig. 3i). Stimulus A modulated the gain of the first population (Fig. 4h), and therefore, when presented, modified the effective couplings in the latent dynamics and the dynamical landscape (Fig. 4i and Sup. Fig. S8)). The main effect of the inputs was therefore to shape the trajectories of internal variables by modulating the dynamical landscape at different trial epochs (Fig. 4i and Sup. Fig. S13). In particular, stimulus A strongly enhanced negative feedback (Fig. 4g), which led to a limit-cycle in the dynamics that opened a fast transient channel that could flip neural activity in the *κ*_{1} *κ*_{2} plane [Chaisangmongkon et al., 2017]. The four trials in the task therefore corresponded to different sequences of dynamical landscapes (Fig. 4i) leading to different neural trajectories and final states determining the correct behavioral outputs (Sup. Fig. S13).

In summary, we found that networks with multiple sub-populations implemented flexible computations by exploiting gain modulation to modify effective couplings between collective variables. The minimal solutions for the two tasks displayed in Fig. 3 and Fig. 4 illustrate two different variants of this general mechanism. In the context-dependent decision-making task, the sensory inputs acted as drivers of the internal dynamics, and contextual inputs as gain modulators that controlled effective couplings between the sensory inputs and the internal collective variable. In contrast, in the delayed-match-to-sample task, sensory inputs acted as modulators of recurrent interactions, and gain modulation controlled only the effective couplings between the two internal variables. More generally, modulations of inputs and modulations of recurrent interactions could be combined to implement more complex tasks.

### 2.6 Predictions for neural selectivity and inactivations

Analyzing networks of minimal rank and minimal number of population allowed us to identify the mechanisms underlying computations based on a multi-population structure in connectivity. We next sought to generate predictions of the identified mechanisms that are experimentally testable without access to details of the connectivity. We then tested these predictions on networks with a higher number of populations or higher rank, obtained by varying the constraints used during training. We focus here specifically on the context-dependent decision-making (CDM) task, and contrast it with the multi-sensory decision-making (MDM) task, for which networks received an identical input structure, but were required to produce an output independent of context.

For the CDM task, reducing the trained networks to effective circuits revealed that the key computations relied on a differential gain-modulation of separate populations by contextual inputs. For each neuron, contextual cues set the functioning point of the neuron on its non-linearity, and the gain of its response to incoming stimuli. A direct implication is that neurons more strongly modulated by context cues change more strongly their gain across contexts, and thereby the amplitude of their responses to stimulus features (Fig. 5a). An ensuing prediction at the level of selectivity of individual neurons is therefore that the pre-stimulus selectivity to context should be correlated with the change across contexts of regression coefficients to stimulus features (Fig. 5b). Our analyses therefore predict a specific form of multiplicative interactions, or non-linear mixed selectivity to stimulus features and context cues [Rigotti et al., 2013], but also imply that the two populations can be identified based on their selectivity to context (Fig. 5b).

The multiplicative interaction between context and stimulus selectivity is a necessary, but not a sufficient condition for implementing context-dependent responding. A second, necessary component of the computational mechanism is that each population integrates dominantly one of the two features into the latent dynamics, as seen previously from the overlaps between the input vectors and the input-selection vectors (Fig. 4b right). This leads to a specific prediction for inactivation experiments: inactivating separately populations defined by their selectivity to context disrupts performance in one context, while leaving the other intact (Fig. 5c-d). In contrast, inactivating a random subset of neurons leads only to a slight overall decrease in performance independently of the context (Fig. 5c-d).

We first tested the two predictions on networks constrained to be of minimal, unit rank, but in which clustering analyses in connectivity space revealed more than two populations (Sup. Fig. S9), as in [Yang et al., 2019]. The two predictions for selectivity and inactivations were therefore directly borne out for such networks (Fig. 5e). We next turned to networks trained without rank constraint, and tested the two predictions without analyzing connectivity, as would be the case in experimental studies. The two predictions were again borne out (Fig. 5f), confirming that key aspects of the computational mechanisms extend to networks in which the connectivity structure was of higher rank, and the dynamics higher dimensional.

Finally we examined unit-rank networks trained on the MDM task. Such networks received an input structure identical to the CDM task, consisting of two stimulus features and two context cues. In contrast to the CDM task, the network was required to average the two stimulus features, and contextual cues were irrelevant, so that a fully random population structure was sufficient to perform the task (Fig. 1l). We therefore expected that the two predictions made for the CDM task do not necessarily hold in this case. We indeed found that training networks on the MDM task led to weaker selectivity to context, and weaker correlation between context selectivity and the change in stimulus selectivity (Fig. 5f). Specific neurons still exhibited selectivity to contextual cues, but inactivating them led to changes in performance similar to inactivating a random subset of neurons (Fig. 5f). Importantly, we controlled for the effect of context selectivity strength by manually increasing the amplitude of contextual inputs until the context selectivity matched that of networks trained on the CDM task. This increased the correlation between context selectivity and the change in stimulus selectivity but did not increase the impact of inactivating context-selective neurons (Sup. Fig. S15).

Altogether, our analyses therefore show that inactivating specific selectivity-defined populations leads to specific effects on performance in networks predicted to rely on non-random population structure, but not in networks for which population structure is expected to be computationally irrelevant.

### 2.7 Implications for multi-tasking

A recent study has reported that multiple populations emerge in networks trained simultaneously on multiple tasks, and can be repurposed across tasks [Yang et al., 2019]. Our results more specifically suggest that a multi-population structure in connectivity is needed only when an identical stimulus requires different outputs depending on the context set by the performed task. While this is the case in many multi-tasking situations, concurrent tasks are alternatively often based on different sets of stimuli [Cromer et al., 2010; Fritz et al., 2010; Elgueda et al., 2019]. Here we show that the reduced models developed by analyzing networks trained on individual tasks can be used to build networks that perform multiple tasks in parallel (Fig. 6). More specifically, multiple tasks on an identical set of stimuli can be performed by combining and repurposing multiple populations, while in contrast multiple tasks on separate sets of stimuli can be performed with a single population by relying on dynamics in orthogonal subspaces [Duncker et al., 2020; Zenke et al., 2017]. As a result, when identical stimuli are processed, some individual neurons exhibit task-specialisation, while for separate sets of stimuli all neurons are multi-taskers, and contribute to multiple tasks in parallel. These findings are in direct agreement with the activity of neurons in the prefrontal cortex during flexible categorisation, which show specialisation when identical stimuli are processed [Roy et al., 2010], and multi-tasking when separate stimuli sets are used [Cromer et al., 2010].

To illustrate task-specialization, we first consider a network that receives stimuli composed of two sensory features, and depending on a rule cue performs one out of three different tasks on them : perceptual decision-making on the first stimulus feature, perceptual decision-making on the second stimulus feature, or integration of the two features as in the multi-sensory decision making task (Fig. 6a). This multi-tasking setup is in fact a direct extension of context-dependent decision-making, and we implemented it using a simplified network based on the CDM task, consisting of unit-rank connectivity with three separate populations (Sup. Fig. S9). In that network, each population has a well defined computational role. One of the populations plays the role of an integrator, and endows the latent dynamics with a long time-scale through strong positive feedback, allowing the network to integrate evidence. That population is repurposed across all tasks (Fig. 6c brown traces), and inactivating it leads to performance degradation on all three tasks (Fig. 6b). The other two populations relay separately the two sensory features into the latent dynamics, as in the CDM task (Fig. 4b,c,d). Each of them participates in only two of the three tasks, as corroborated by changes in task performance after selective inactivations (Fig. 6b). Neurons belonging to these two populations are therefore specialised for specific tasks, as seen in their task-specific responses to stimuli (Fig. 6c green and purple neurons).

We next illustrate multi-tasking in a network that performs two tasks on distinct sets of stimuli, the perceptual decision-making (DM) and the parametric working-memory (WM) tasks (Fig. 6d). Such a network can be obtained by directly superposing the connectivity matrices *J*_{DM} and *J*_{WM} of two minimal networks of rank one and two that perform the individual tasks with random population structure (Fig. 2). The resulting connectivity ** J** =

*J*_{DM}+

*J*_{W M}is of rank three, and has a random population structure. The corresponding latent dynamics are based on a recurrent sub-space of dimension three, and the two tasks rely on two orthogonal subspaces with one dimension implementing the DM task, and the other two implementing the WM task (Fig. 6e). Because of the random population structure, each neuron is a random combination of collective variables corresponding to different tasks, so that all neurons display multi-tasking activity (Fig. 6e).

## 3 Discussion

The goal of this study was to determine whether and when a non-random population structure is necessary for networks to perform a specific computation based on recurrent dynamics. To address this question, we first trained recurrent neural networks on a range of standard systems neuroscience tasks, and examined the emerging population structure in the selectivity and connectivity, and its relationship with the computations. We then identified underlying mechanisms by extracting the latent low-dimensional dynamics. Although a number of tasks could be implemented with random population structure in connectivity, we found that tasks based on flexible input-output mappings instead appeared to require an additional structure that could be accurately approximated in terms of a small number of sub-populations which played functionally distinct roles.

The starting motivation of this work was the apparent discrepancy between the experimental results of Raposo et al. [2014] and Hirokawa et al. [2019]. Analyzing neural activity in the rat posterior parietal cortex during a multi-sensory decision-making task, Raposo et al. [2014] found no evidence for non-random population structure in selectivity. Applying identical analyses to the prefrontal cortex, Hirokawa et al. [2019] instead identified population structure in activity during a more complex task that combined perceptual and value-guided decisions. Our results suggest that the difference between tasks provides a possible explanation for these diverging conclusions. Examining networks trained on an abstracted version of the multi-sensory integration task of Raposo et al. [2014], we found that a non-random population structure was not needed. Implementing a full version of the task used in Hirokawa et al. [2019] would have required reinforcement learning that falls beyond the scope of the supervised methods for training networks used here. The core component of that task was however a flexible weighing of two sensory features depending on the context set by reward history. That requirement of context-dependent weighing of input streams is in fact identical to the context-dependent decision-making task, in which all-or-none weights were assigned to the two stimulus features depending on the contextual cues. The gain-modulation mechanism underlying networks that performed the CDM task can more generally assign graded weights to each feature as required for the task of Hirokawa et al. [2019]. This mechanism requires multiple populations, so that our analyses predict that a non-random population structure is needed for the task used in Hirokawa et al. [2019].

Fundamental theoretical results guarantee that unconstrained recurrent neural networks are able to approximate any input-output function if the number of neurons is large enough [Doya, 1993]. Here we have instead sought to determine how far this property extends to networks with a random population structure in connectivity, as defined based on the null hypothesis of the analyses of Raposo et al. [2014] and Hirokawa et al. [2019]. A key point is that such a random population structure in fact sets constraints on the parameter space that the connectivity can explore, by precluding distributions of connectivity parameters more complex than a single Gaussian population. In a previous study, we demonstrated that low-rank recurrent networks with such a random population structure can generate only a limited range of autonomous dynamics independently of their rank, while having multiple sub-populations instead allows networks to approximate any low-dimensional dynamical system [Beiran et al., 2021]. Here we showed that these theoretical findings directly allow us to interpret the computational role of population structure in networks trained on neuroscience tasks.

We found that in trained networks relying on a non-random population structure, connectivity could be accurately described by a small number of sub-populations. Mechanistically, the role of such a sub-population structure can be understood from two perspectives. From the neural state-space perspective, the collective dynamics explore a low-dimensional recurrent subspace, and the sub-population structure shapes the non-linear dynamical landscape of the activity in this subspace [Sussillo and Barak, 2013]. Specifically, different inputs differentially activate different populations, and shift the recurrent sub-space into different regions of the state-space with different non-linear dynamical landscapes. A complementary picture emerges from the perspective of the effective circuits which describe the low-dimensional latent dynamics in terms of interactions between collective variables through effective couplings (Fig. 4c,g). In that picture, the sub-population structure allows inputs to control the effective couplings by modulating the average gain of different sub-populations. The computations then rely on two functionally distinct types of inputs: drivers that directly entrain the collective variables, and modulators that shape the gains of the different sub-populations, and thereby the interactions between collective variables. Interestingly, gain modulation has long been posited as a mechanism underlying selective attention [Rabinowitz et al., 2015], a type of processing closely related to flexible input-output tasks considered here. While patterns of gain modulation [Salinas and Thier, 2000; Ferguson and Cardin, 2020], and the distinction between drivers and modulators [Sherman and Guillery, 1998] are fundamentally physiological concepts, here we found that an analogous mechanism emerges in abstract trained networks at the collective level of latent dynamics. Note that in our framework, drivers and modulators are indistinguishable at the single cell level, where they both correspond to additive inputs (in contrast to eg neuro-modulation that may multiplicatively control the gain of individual neurons, see [Stroud et al., 2018]). The functional distinction between drivers and modulators instead stems from the relation between the collective pattern of inputs, and the recurrent connectivity in the network. Our analyses therefore establish a bridge between two levels of description, in terms of circuits, and in terms of collective dynamics [Barack and Krakauer, 2021].

To focus on the functional role of population structure, before training we initialized our networks with fully unstructured connectivity, and in particular did not include any explicit anatomical constraints such as Dale’s law. Our analyses nevertheless show that the non-random population structure that emerges through training can be accurately described in terms of abstract sub-populations, defined as clusters in the connectivity space. What could be the physiological counter-parts of the different functional sub-populations that we identified? There are at least two distinct possibilities. In the network trained on the context-dependent decision-making task, we found that the two sub-populations differed only in the relationship of their connectivity with respect to feed-forward and contextual inputs. Such sub-populations therefore bear an analogy with input- and output-defined cortical populations such as for instance defined by inputs from the thalamus [Harris and Mrsic-Flogel, 2013; Schmitt et al., 2017] or outputs to the striatum [Znamenskiy and Zador, 2013]. In the network trained on the delayed-match-to sample task, the two sub-populations instead differed at the level of recurrent connectivity: one population implemented positive, and the other negative feedback, the two being in general balanced (Sup. Fig. S10). This situation is reminiscent of excitatory and inhibitory sub-populations, which effectively implement positive and negative feedback in biological networks. Note that we do not mean to suggest that such population structure emerges biologically over the course of learning a task. Here we used artificial network training protocols to identify computational constraints, and we did not assume that they correspond to biological task-learning mediated by synaptic plasticity. The population wiring structure that emerges through network training could for instance be interpreted as the result of evolutionary selection leading to anatomic structure encoded at the genetic or developmental level [Zador, 2019].

Previous studies have reported that when training networks on a given task, some aspects of the solutions are invariant [Maheswaranathan et al., 2019] while others depend on the details of the implementation [Yang et al., 2019; Duncker et al., 2020; Flesch et al., 2021]. Our analyses confirmed these observations. Our main result for the computational requirement of non-random population structure in connectivity (Fig.1l) held independently of the details of the training, and in particular in absence of constraints on the rank of the network (Sup. Fig. S3). For tasks requiring a non-random population structure, the number of sub-populations needed to approximate connectivity however varied across networks (Sup. Fig. S9). For those tasks, our results show that a single global population is insufficient and that fundamental computational mechanisms are conserved across a range of different networks (Fig. 5). Our analyses however do not predict the specific dimensionality or number of populations to be expected. More systematic model selection could for instance be performed by further constraining recurrent neural networks based on recorded neural activity [Rajan et al., 2016; Aoi et al., 2020].

The fact that neurons are selective to mixtures of task variables rather than individual features has emerged as one of the defining properties of higher order areas in the mammalian cortex [Fusi et al., 2016]. Following up on analyses of Hirokawa et al. [2019], our results further clarify that mixed selectivity however does not necessarily preclude the role of any population structure, and demonstrate how such a structure influences collective dynamics that underlie computations.

## Author contributions

A.D., A.V. and S.O. designed the study. A.D., A.V., S.O. developed the training and analysis pipelines. A.D., A.V., M.B., F.M., S.O. performed research and contributed to writing the manuscript.

## Code availability

Code will be made available on Github upon publication.

## Data availability

Trained models will be made available on Github upon publication.

## 4 Methods

### 4.1 Recurrent Neural Networks

We considered networks of *N* rate units that evolve over time according to

Here *x*_{i} represents the *activation* or total current received by the *i*-th unit, and *ϕ*(*x*_{i}) = tanh(*x*_{i}) is its firing rate. Moreover, each neuron received a feed-forward input and an independent white-noise input *η*_{i}(*t*) specified below.

The recurrent connectivity is set by the connectivity matrix ** J** = {

*J*

_{ij}}

_{i,j=1…N}. For full-rank networks, the coefficients

*J*

_{ij}were treated as independent parameters. For low-rank networks

**was constrained to be of rank**

*J**R*, and parametrized as i.e.

**was a sum of**

*J**R*outer-products of vectors and . Throughout the text, we refer to the vectors

*m*^{(r)}and

*n*^{(r)}as the

*connectivity vectors*, with

*m*^{(r)}the

*r*-th output vector, and

*n*^{(r)}the

*r*-th input-selection vector. Without loss of generality, we will assume that all the output vectors (and respectively all the input-selection vectors) are mutually orthogonal. Such a representation is uniquely defined by the singular-value decomposition of

**by taking**

*J***m**

^{(r)}to be the left singular vectors, and

**n**

^{(r)}the right singular vectors multiplied by the corresponding singular values.

The feedforward inputs were generated by *N*_{in} temporally-varying scalar stimuli *u*_{s}(*t*), each fed into the unit *i* through a set of weights :

We refer to as the *s*-th *input vector*.

The output of the network was defined by a readout value
where ** w** = {

*w*

_{i}}

_{i=1…N}is the

*readout vector*.

The time constant of neurons was *τ* = 100ms. For simulation and training, equation (8) was discretized using Euler’s method with a time step Δ*t* = 20ms. The white noise *η*_{i} was simulated by drawing at each time step a random number from a centered Gaussian distribution of standard deviation 0.05.

For any pair of *N*-dimensional vectors ** a** and

**, the**

*b**overlap σ*

_{ab}was defined as the empirical covariance of their entries:

### 4.2 Network training procedure

We used backpropagation through time [Werbos, 1990] to train networks to minimize loss functions corresponding to specific tasks. For each task (see details below), we specified the temporal structure of trials and the desired mapping from combinations of stimulus inputs to target readouts , and then stochastically generated trials. We minimized the mean squared error loss function
where *z*_{k,t} and are respectively the actual, and the target readout values and the indices *k, t* respectively run over trials and time steps. The terms *M*_{t} are {0, 1} masks that were non-zero only during a decision period at the end of each trial, when the readouts were required to match their target values. For each task we also define a performance measure called accuracy, defined as the percentage of t t trials for which the network output has the same sign as the expected output (i.e. )

For full-rank networks (Figs. 1,5) the gradients were computed with respect to individual entries *J*_{ij} of the connectivity matrix. For results on full-rank networks in Fig. 1 (left column) and Sup. Fig. S3, matrices ** J** were initialized with random independent Gaussian weights of mean 0 and variance

*ρ*= 1/

*N*. For the supplementary results on Sup. Fig. S3, we also trained networks whose weights were initialized with a variance

*ρ*= 0.1/

*N*, since these tend to be approximated more easily by low-rank networks [Schuessler et al., 2020b].

For low-rank networks, we specifically looked for solutions in the sub-space of connectivity matrices with rank *R*. The loss functions were therefore minimized by computing gradients with respect to the elements of connectivity vectors {*m*^{(r)}}_{r=1…R}, {*n*^{(r)}}_{r=1…R}. Unless specified otherwise in the description of individual tasks, we did not train the entries of input vectors and the readout vectors {**w**} but only an overall amplitude factor for each and readout vector. All vectors were initialized with their entries drawn from Gaussian distributions with zero mean and unit standard deviation, except for the readout vector, for which the standard deviation was 4. The initial network state at the beginning of each trial was always set to **0**. We used the ADAM optimizer [Kingma and Ba, 2014] in pytorch [Paszke et al., 2017] with the decay rates of the First and second moments of 0.9 and 0.999, and learning rates between 10^{−3} and 10^{−2}.

To identify networks of minimal rank that performed each task, the number of pairs of connectivity vectors *R* was treated as a hyper-parameter. We first trained full rank networks (*R* = *N*) and determined the accuracy with which they solved the task. We then started training rank *R* = 5 networks, and progressively decreased the rank until there was a sharp decrease in accuracy (Sup. Fig. S2). The minimal rank *R** was defined for each task such that the accuracy at *R** was at least of 95%.

To ease the clustering and resampling procedure, and approach mean-field solutions, we trained large networks (of sizes 512 neurons for the networks of figures 1 and 2, 4096 neurons for the context-dependent DM and DMS task networks of Figures 3 and 4, and 1024 neurons in figure 5).

### 4.3 Definition of individual tasks

#### 4.3.1 Perceptual decision making (DM) task

##### Trial structure

A fixation epoch of duration *T*_{fix} = 100ms was followed by a stimulation epoch of duration *T*_{stim} = 800ms, a delay epoch of duration *T*_{delay} = 100ms and a decision epoch of duration *T*_{decision} = 20ms.

##### Inputs and outputs

The feed-forward input to neuron *i* on trial *k* was
where, during the stimulation period, , with *ξ*^{(k)}(*t*) a zero-mean Gaussian white noise with standard deviation *σ*_{u} = 0.1. The mean stimulus was drawn uniformly from ±0.1 × {1, 2, 4} on each trial. The elements *I*_{i} of the input vector were generated from a Gaussian distribution with zero mean and unit standard deviation, and fixed during training. During the decision epoch, the output *z* was evaluated through a readout vector ** w** = {

*w*

_{i}}

_{i}=1…

*N*, the elements

*w*

_{i}of which were generated from a Gaussian distribution with zero mean and standard deviation of 4, and fixed during the training. On trial

*k*, the target output value in the loss function (Eq. (13)) was defined as the sign of the mean input .

#### 4.3.2 Parametric working memory (WM) task

##### Trial structure

A fixation epoch of duration *T*_{fix} = 100ms was followed by a first stimulation epoch of duration *T*_{stim1} = 100ms, a delay epoch of duration *T*_{delay} drawn from a uniform distribution between 500 and 2000ms, a second stimulation epoch of duration *T*_{stim2} = 100ms and a decision epoch of duration *T*_{decision} = 100ms.

##### Inputs and outputs

The feed-forward input to neuron *i* on trial *k* was
where and were non-zero during the first and second stimulation epochs respectively. On trial *k* and during the corresponding stimulation epoch, the values of these inputs were , with and drawn uniformly from {10, 11, … , 34}, and *f*_{min} = 10 and *f*_{max} = 34. The elements *I*_{i} of the input vector were generated from a Gaussian distribution with zero mean and unit standard deviation, and fixed during the training.

During the decision epoch, the output *z* was evaluated through a readout vector ** w** = {

*w*

_{i}}

_{i=1…N}, the elements

*w*

_{i}of which were generated from a Gaussian distribution with zero mean and standard deviation of 4, and fixed during the training. On trial

*k*, the target output value in the loss function (Eq. (13)) was defined as .

#### 4.3.3 Context-dependent decision making (CDM) task

##### Trial structure

A fixation epoch of duration *T*_{fix} = 100ms was followed by a first context-only epoch of duration *T*_{ctxt1} = 0ms for figure 1 and 350*s* for figure 3 and 4 plots, followed by a stimulation epoch of duration *T*_{stim} = 800ms, a second context-only epoch of *T*_{ctxt2} = 500ms, and a decision epoch of *T*_{decision} = 20ms.

##### Stimuli and outputs

The feed-forward input to neuron *i* on trial *k* was

Here and correspond to contextual cues. On each trial, during the context-only and the stimulation epochs, one of the two cues took a value +0.1 (or +0.5 for figures 3 and 4), while the other was 0. The inputs and represent two sensory features of the stimulus. They were non-zero only during the stimulation epoch, and took the same form as in the perceptual decision-making task, with means and , and fluctuatingparts and drawn independently for each feature, on each trial. The elements of the input vectors were generated from a Gaussian distribution with zero mean and unit standard deviation on both populations. For the networks presented in the main text, input vectors were trained, while for the networks reported in supplementary section S2.3 all the input vectors were fixed throughout training. During the decision epoch, on trial *k* the target in the loss function (Eq. (13)) was defined as the sign of the mean of feature *X* = *A* or *B* for which the contextual cue was activated, i. e. . The readout vector was fixed throughout training.

#### 4.3.4 Multi-sensory decision making (MDM) task

##### Trial structure

A fixation epoch of duration *T*_{fix} = 100ms was followed by a context-only period of duration *T*_{ctx} = 350ms, a stimulation epoch of duration *T*_{stim} = 800ms, a delay epoch of duration *T*_{delay} = 300ms and a decision epoch of duration *T*_{decision} = 20ms.

##### Inputs and outputs

The feed-forward input to neuron *i* on trial *k* had the same structure as for the context-dependent decision-making task, and was given by:
where the two stimulus inputs and represent two sensory modalities, and and are contextual inputs. In this task, the contextual inputs were irrelevant for the output, and we included them as a control. The inputs and were generated as for the CDM task, with the difference that on each trial the two inputs provided congruent evidence for the output, i.e. their means were of the same sign.

Specifically in each trial a sign *s*_{k} ∈ {−1, 1} is generated randomly, as well as a modality that can be A, B, or AB. Then if the modality is A or AB, a mean is chosen from 0.1 × *s* × {1, 2, 4} and the signal during the stimulation period is set to that mean plus a gaussian white noise as in the perceptual decision making task. A contextual input signal is set to 0.1 from the beginning of the contextual period to the end of the trial. If the modality is B, then the signal is only equal to the zero-centered gaussian white noise. The signals and are set in a similar manner. During the decision epoch, the target is the underlying common sign *s*_{k}.

The networks received input signals through input vectors *I*^{A}, *I*^{B}, *I*^{ctxA} and *I*^{ctxB} which were trained, and output was read through a readout vector ** w** which was fixed throughout training.

#### 4.3.5 Delayed-match-to-sample task

##### Trial structure

A fixation epoch of duration *T*_{fix} = 100ms was followed by a first stimulus epoch of duration *T*_{stim1} = 500ms, a delay epoch of a duration drawn uniformly between 500ms and 3000ms, a second stimulus epoch of duration *T*_{stim2} = 500ms, and a decision epoch of duration *T*_{decision} = 1000ms.

##### Stimuli and outputs

During each stimulus epoch, the network received one of two stimuli *A* or *B*, which were randomly and independently chosen on each trial and stimulus epoch. These two stimuli were represented by two input vectors *I*^{A} and *I*^{B}, so that the feed-forward input to neuron *i* on trial *k* was:
where the inputs and were non-zero only when stimuli *A* or *B* are respectively received, in which case they were equal to one.

During the decision epoch, the target output value in the loss function (Eq. (13)) was equal to +1 if the same stimulus was received in both stimulation epochs and −1 otherwise.

### 4.4 Regression analyses and selectivity space

We used multivariate linear regression to predict time-averaged neural firing rates *r*_{i} = *ϕ*(*x*_{i}) from task variables, using a linear model :

Here *r*_{i} = {*r*_{i,}_{1}, … , *r*_{i,K}} is a vector containing the time-averaged firing rates of neuron *i* in trials 1 to *K*, ** X** is the design matrix where rows correspond to different trials and columns correspond to

*D*task variables such as stimulus, context and decision in each condition (defined below for each task),

*β*_{i}is a

*D*-by-1 vector of regression coefficients, and

*ϵ*_{i}is a

*K*-by-1 vector of residuals.

The regression coefficients defined the *selectivity space* (Fig. 1a-d) of dimension *D* where each axis corresponded to the regression coefficient with respect to one task variable, and each neuron was represented as point *β*_{i}. The choice of task variables and window of time-averaging of firing rates depended on the task:

For the DM task, two regressions were performed on different time windows, leading to

*D*= 2 two coefficients per neuron: a regression of average firing rate during the first 100ms of stimulation period against mean stimulus which defined the coefficient and a regression of average firing rate during the decision period against network choice which defined the coefficient . This was done to avoid ill-conditioning due to correlations between choice and stimulus.For the WM task, the mean firing rate during the decision period was regressed against both

*f*_{1}and*f*_{2}, leading to*D*= 2 two coefficients per neuron.For the MDM task and the CDM task, the average firing rate during the stimulation period was regressed against both mean stimulus features and and both contextual input signals and , leading to

*D*= 4 coefficients per neuron, and . In Fig. 5, the selectivity to context was characterized by a single regression coefficient obtained by regressing the absolute value of the firing rate |*r*_{i}|, averaged over the pre-stimulus period where only the contextual cues are non-zeros, against a regressor*X*that takes the value +1 in context A and 1 in context B. The context selectivity is extracted through the linear model for*K*trialsIn order to characterize the changes in selectivity with context, we substracted the pre-stimulus firing rate to the firing rate averaged over the first 100ms of stimulus presentation, and regressed this quantity against and separately in each context to obtain the regression coefficients .

The change in selectivity is then given by

In Fig. 5 the analysis is presented for feature A, similar results are obtained for feature B (not shown).

For the DMS task, the average firing rate during the decision period was regressed against both first and second stimulus identity (with

*X*_{k,s}= 1 if stimulus*s*is A in trial*k*, 0 otherwise,*s*∈ {0, 1}), leading to*D*= 2 regression coefficients per neuron.

### 4.5 Connectivity space

For a low-rank network, the connectivity is specified by 2*R* + *N*_{in} + 1 parameters for each neuron, corresponding to its entries on the input, connectivity and output vectors. The connectivity of each neuron can therefore be represented as a point in a space of dimension 2*R* + *N*_{in} + 1 that we term *connectivity space*. For each network, the distribution of points in this space is analysed for randomness in Figure 1, and used in the resampling procedures of figures 1, 2 and 3. Our mean-field theory shows that in the limit of large networks, the distribution of points in this space determines the low-dimensional latent dynamics of the network (see Section 4.8.2).

### 4.6 ePairs analysis

To statistically assess the presence of non-random population structure in the selectivity and connectivity spaces of trained networks, we implemented a version of the ePAIRS statistical test [Hirokawa et al., 2019], which is itself derived from the PAIRS test developed in Raposo et al. [2014]. We consider a point cloud **X** = (*X*_{ij})_{1≤i≤N,1≤j≤d}, where the rows **x**_{i} corresponds to different points (here neurons) and columns correspond to different axes of the considered space (regression coefficients to different variables in the selectivity space, entries of different input, connectivity nd readout vectors in the connectivity space), which is centered by removing the mean (so that for each*j*, Σ_{i}*X*_{ij}. The ePAIRS test examines the directional distribution of points, i.e. the empirical distribution of **x**_{i}/∥**x**_{i}∥, and determines whether it is statistically distinguishable from the null distribution generated by a multi-variate Gaussian with a covariance matrix identical to the covariance of **X**. A significant outcome indicates of the ePairs test that the empirical distribution presents multiple “preferred” directions incompatible with a Gaussian.

More specifically, the analysis proceeds as follows:

For each point

**x**_{i}, we determine its*l*nearest neighbors in terms of the cosine metric (*ie.*the*l*points for which are the highest,*l*being a hyperparameter set to 3 in our case).For each neuron, we compute the mean angle

*α*_{i}with its*l*nearest neighbors, defining an empirical distribution .To generate the corresponding null distribution, a multivariate Gaussian distribution is fit to the cloud of points

**X**, with**Σ**the empirical covariance of**X**, computed as . Then the steps 1-2 are applied on 500 samples of the multivariate Gaussian with the same number*N*of data points to compute a Monte-Carlo null distribution .Finally, the difference between the data and the null distributions is assessed using a Wilcoxon’s rank-sum test, giving a p-value, and the effect size

*c*is computed as where*μ*and*σ*represent the means and standard deviations of and . An effect size*c*> 0 indicates that angles between neighbors are smaller in the data than in the resampled point clouds, meaning that points are more highly clustered than expected. On the contrary,*c*< 0 indicates that points are more regularly dispersed than expected from random.

### 4.7 Resampling and clustering trained networks

For a given trained network, we first fitted a single multivariate Gaussian to its connectivity distribution by computing the empirical covariance matrix (matrix of size (*N*_{in} + 2*R* + 1)^{2}). We then generated networks by resampling connectivity parameters from this distribution, and examined their performance (Fig. 1i and Sup. Fig. S3). In all trained networks we examined, the empirical means were close to zero, and we neglected them.

For the CDM and DMS tasks, we performed a clustering analysis in the connectivity space by fitting multivariate mixtures of Gaussians with an increasing number of clusters, and by resampling from the obtained distributions until we found networks that were able to optimally perform the task, as defined by an accuracy higher than 95% for at least 95% of the sampled networks. We used variational inference with a gaussian prior for the mean with a precision equal to 10^{5} to enforce a zero-mean constraint for all components of the mixtures, and a Dirichlet process prior for the weights with concentration 1 divided by number of components, using the model `BayesianGaussianMixture` of the package scikit-learn [Pedregosa et al., 2011].

Since the inference and resampling processes are susceptible to finite-size fluctuations, for the DMS task in Fig. 3 we complemented the clustering with some retraining of the covariance matrices found for each component. For this we developed a class of Gaussian mixture, low-rank RNNs, in which the covariance structure of each population is trainable. Directly training the covariance matrices is difficult given that they need to be symmetric definite positive; we therefore used a trick akin to the reparametrization trick used in variational auto-encoders [Kingma and Welling, 2013]: the set of input, connectivity and readout vectors were defined as a linear transformation of a basis of i.i.d. normal vectors, such that for any connectivity vector ** a**:
where

*p*is the population index of neuron

*i*(sampled from a categorical distribution with weights {

*α*

_{p}}

_{p=1…P}derived by the variational inference), are random normal vectors of dimension

*N*

_{in}+ 2

*R*+ 1, and the vectors correspond to the rows of the Cholesky factorization of the covariance matrix (such that see SI section S1 for more details). We then trained the vectors , with the population indices being sampled only once, and the

*X*_{i}being resampled at each training epoch.

### 4.8 Analysis of latent dynamics in low-rank networks

Here we provide an overview of the reduction of low-rank networks to low-dimensional latent dynamics. A more complete derivation can be found in [Beiran et al., 2021]. For simplicity, we consider the noise free case (*η*_{i}(*t*) = 0 in Eq. (8)), and we assume the initial condition *x*_{i} = 0 at *t* = 0 for all *i* = 1 … *N*.

#### 4.8.1 Low-dimensional dynamics

The dynamics defined by Eq. (8) can be represented as a trajectory in the *N*-dimensional state space in which each axis corresponds to the activation *x*_{i} of unit *i*. When the connectivity is constrained to be of low rank, the dynamics are restricted to a low-dimensional subspace of this state-space [Mastrogiuseppe and Ostojic, 2018]. Indeed, inserting Eqs. (9) and (10) into Eq. (8), leads to

At any time *t*, the right-hand-side is confined to the linear subspace spanned by the vectors {*m*^{(r)}}_{r=1…R} and . Since we assumed *x*_{i} = 0 at *t* = 0, the dynamics of ** x**(

**) = {**

*t**x*

_{i}(

*y*)}

_{i=1…N}remain in that subspace for all

*t*. The activation vector

**can therefore be expressed in terms of**

*x**R*internal collective variables

*κ*

_{r}, and

*N*

_{in}external collective variables

*υ*

_{s}:

The first term on the right-hand side in Eq. (25) represents the component of the activity on the *recurrent space* [Wang et al., 2018; Remington et al., 2018] defined as the sub-space spanned by the output connectivity vectors {*m*^{(r)}}_{r=1…R}. The corresponding internal collective variables *κ*_{r} are defined as projections of the activation vector ** x** on the

*m*^{(r)}:

The second term on the right-hand side in Eq. (25) represents the component of the activity on the *input space* defined as the sub-space spanned by , the set of input vectors orthogonalized with respect to the recurrent sub-space. The corresponding external collective variables *υ*_{s} are defined as projections of the activation vector ** x** on the :

The dimensionality of the dynamics in state space is thus given by the sum of the dimension *R* of the recurrent sub-space, i.e. the rank of the connectivity, and the dimensionality *N*_{in} of the input space.

The dynamics of the internal variables *κ*_{r} are obtained by projecting Eq. (8) onto the output connectivity vectors *m*^{(r)}:
where represents the recurrent input to the *r*-th collective variable, defined as the projection of the firing rate vector *ϕ*(** x**) onto the input-selection vector

*n*^{(r)}:

Inserting Eq. (25) into leads to a closed set of equations for the *κ*_{r}:

The dynamics of the external variables *υ*_{s} is obtained by projecting Eq. (8) onto the orthogonalized input vectors . They are given by external inputs *u*_{s}(*t*) filtered by the single neurons time constant *τ*

Throughout the main text, we assume for simplicity that the stimuli *u*_{s} vary on a timescale slower than *τ*, and replace *υ*_{s} with *u*_{s}. We also assume throughout the main text that input vectors are orthogonal to the output connectivity vectors, ie. for all *s*. Hence the third term on the r.h.s. of equation (28) equals zero. Using Eq. (25), the readout value *z* can be expressed in terms of the collective variables as

#### 4.8.2 Connectivity space and mean-field limit

The dynamics of the collective variables are fundamentally determined by the components of connectivity and input vectors through Eq. (30). Neuron *i* is therefore characterized by the 2*R* + *N*_{in} + 1 parameters

Each neuron can thus be represented as a point in the *connectivity space* of dimension 2*R* + *N*_{in} + 1, and the connectivity of the full network can therefore be described as a set of *N* points in this space. Note that the right-hand-side of Eq. (30) consists of a sum of *N* terms, where the term *j* contains only the connectivity parameters of neuron *j*. The connectivity parameters of different neurons therefore do not interact in , so that the r.h.s of Eq. (30) can be interpreted as an average over the set of points corresponding to all neurons in the connectivity space.

Our main assumption will be that in the limit of large networks (*N* → ∞), the set of points in the connectivity space is described by a probability distribution . In this mean-field limit, the r.h.s. of Eq. (30) becomes:
where we have used the shorthand . The collective dynamics are therefore fully specified by the single-neuron distribution of connectivity parameters. Once this distribution is specified, any network generated by sampling from it will have identical collective dynamics in the limit of a large number of neurons.

The joint distribution of connectivity parameters *P* (*n*, *m*, *I*, *w*) also determines the values of the readout:

#### 4.8.3 Statistics of connectivity and sub-populations

To approximate any arbitrary joint distributions of connectivity parameters *P* (*n*, *m*, *I*, *w*), we used multivariate Gaussian mixture models (GMMs). This choice was based on the following considerations: (i) GMMs are able to approximate an arbitrary multi-variate distribution [Kostantinos, 2000]; (ii) model parameters can be easily inferred from data using GMM clustering; (iii) GMMs afford a natural interpretation in terms of sub-populations (iv) GMMs allow for a mathematically tractable and transparent analysis of the dynamics as shown below [Beiran et al., 2021].

In a multivariate Gaussian mixture model, every neuron belongs to one of *P* sub-populations. For a neuron in sub-population *p*, the set of parameters is generated from a multivariate Gaussian distribution with mean *μ*_{p} and covariance **Σ**_{p}, where *μ*_{p} is a vector of size 2*R* + *N*_{in} + 1, and **Σ**_{p} is a covariance matrix of size (2*R* + *N*_{in} + 1)^{2}. The full distribution of connectivity parameters is therefore given by
where the coefficients *α*_{p} define the fraction of neurons belonging to each sub-population.

Each sub-population directly corresponds to a Gaussian cluster of points in the connectivity space. The vector *μ*_{p} determines the center of the *p*-th cluster, while the covariance matrix **Σ**_{p} determines its shape and orientation. For a neuron *i* belonging to population *p*, we will write as the covariance between two connectivity parameters *a* and *b*, with *a, b* ∈ {{*n*^{(r)}}_{r=1…R}, {*m*^{(r)}}_{r=1…R}, . Note that because the output vectors *m*^{(r)} (resp. input-selection vectors *n*^{(r)}) are mutually orthogonal, the covariances between the parameters (respectively vanish.

Since every neuron belongs to a single population, the r.h.s of Eq. (30) can be split into P terms, each corresponding to an average over one population. As within each population the distribution is a joint Gaussian, Eq. (34) becomes a sum of *P* Gaussian integrals

#### 4.8.4 Effective circuit description of latent dynamics

In the following, we focus on zero-mean multivariate Gaussian mixture distributions for the connectivity parameters, and input vectors orthogonal to {*m*^{(r)}}_{r=1…R}, as distributions with these assumptions were sufficient to describe trained networks. The more general case of Gaussian mixtures with non-zero means is treated in [Beiran et al., 2021]. Using Stein’s lemma for Gaussian distributions, the dynamics of the internal collective variables can be expressed as a dynamical system (see SI section S1)

In the main text, *υ*_{s} were replaced by *u*_{s} which amounts to assume that inputs vary slowly with respect to the single neuron time constant *τ*.

In Eq. (39), represents the effective self-feedback of the collective variable *κ*_{r}, sets the interaction between the collective variables *κ*_{r} and *κ*_{r′}, and is the effective coupling between the input *u*_{s} and *κ*_{r}. These effective interactions between the internal variables are given by weighted averages over populations
where is the covariance between connectivity parameters *a* and *b* for population *p*, and ⟨Φ′⟩_{p} is the average gain of population *p*, defined as
with
and
the standard deviation of activation variables in population *p*, where is the variance of a vector ** a** on population

*p*.

In Eq. (39), the covariances are set by the statistics of the connectivity and input vectors, but the gain factors ⟨Φ′⟩_{p} in general depend both on internal and external collective variables *κ*_{k} and *v*_{j}. As a consequence, the dynamics in Eq. (39) is non-linear, and in fact it can be shown that given a sufficient number of sub-populations, the right-hand side in Eq. (39) can approximate any arbitrary dynamical system [Beiran et al., 2021].

In the special case of linear networks (i.e. Φ(*x*) = *x*), the gain is constant so that the effective couplings in Eq. 40 are equal to the overlaps *σ*_{ab} of vectors *a* and *b* over the full population, as defined in Eq. 12. The population structure therefore only plays a role for non-linear networks.

The value of the readout (Eq. (35)) can also be expressed in terms of effective interactions as

#### 4.8.5 Drivers and modulators of latent dynamics

Eq. (39) shows that feed-forward inputs to the network can have two distinct effects on the collective dynamics of internal variables *κ*_{r}. If the input vector *I*^{(s)} overlaps with the *r*-th input-selection vector *n*^{(r)}, i.e. the corresponding covariance is non-zero for population *p*, the input directly drives the latent dynamics, in the sense that *v*_{s} acts as an effective external input to the dynamics of *κ*_{r} in Eq. (39).

In contrast, when all covariances between the input vector *s* and the input selection vectors are zero (i.e. for all *r, p*), the corresponding input does not drive the latent dynamics, but can still modulate them by modifying the gain through Eq. (43) if the variance of the input on some population *p* is non-zero.

The inputs can therefore play roles of drivers and modulators of latent dynamics, depending on whether the corresponding input vectors overlap or not with the input selection vectors *n*^{(r)}.

### 4.9 Reduced models of latent dynamics for individual tasks

#### 4.9.1 Perceptual decision making task

##### Latent dynamics and reduced model

We found that computations in the rank one, single population trained networks could be reproduced by a reduced model with two non-zero covariances *σ*_{nI} and *σ*_{nm} (Sup. Fig. S5a). For this reduced model, the dynamics of the internal collective variable is given by
where and with ⟨Φ⟩(Δ) defined in Eq. (41), and the effective population variance Δ given by:

Here *υ*(*t*) corresponds to the integrated input *u*(*t*), see Eq. (31).

An analysis of nonlinear dynamics defined by Eq. (45) showed that adjusting these parameters was sufficient to implement the task, as additional parameters only modulate the overall gain (see SI section S2.1). In particular the value of *σ*_{mn}, determines the qualitative shape of the dynamical landscape on which the internal variable evolves and sets the time scale on which it integrates inputs (see SI S2.1 for more details).

#### 4.9.2 Parametric working memory task

##### Latent dynamics and reduced model

We found that computations in the rank two, single population trained networks could be reproduced by a reduced model with four non-zero covariances and (Sup. Fig. S6a). In particular covariances across the two vectors could be set to zero without performance impairment.

For this reduced model, the dynamics of the two internal collective variables is given by: where , with ⟨Φ′⟩(Δ) defined in Eq. (41), and the effective noise Δ given by:

Here *υ*(*t*) corresponds to the integrated input *u*(*t*), see Eq. (31).

The two internal collective variables are therefore effectively uncoupled, and integrate the incoming feed-forward inputs at two different timescales due to different levels of positive feedback. For the first collective variable, a strong, fine-tuned positive feedback leads to an approximate line attractor along *κ*_{1} that persistently encodes the first stimulus throughout the delay and the sum of the two stimuli at the decision epoch. For the second internal variable, a weaker positive feedback leads to a shorter timescale of a transient response to stimuli along *κ*_{2}, such that the first stimulus is forgotten during the delay and that the second stimulus is represented during the decision epoch.

#### 4.9.3 Context-dependent decision making task

##### Latent dynamics and reduced model

We found that the computations in the unit rank, two populations network relied on the following conditions for the covariances in the two populations (Sup. Fig. S7a): (i) *I*^{ctxA} and *I*^{ctxB} were essentially orthogonal to the input-selection vector **n**, implying that and for both populations *p* = 1, 2; (ii) on each population, each of the two input-selection vectors was correlated with only one of the input-feature vectors, i.e. and , while and ; (iii) each context-cue vector had a strong variance on a different sub-population, i.e. for the first population *I*^{ctxA} and *I*^{ctxB} had respectively weak and strong variance (i.e. and ), and conversely for the second population and .

The computation could therefore be described by a reduced model, in which the covariances were set to zero. The dynamics of the internal variable was then given by with effective couplings

The averaged gains for each population were given by equations (42), with the standard deviations of currents onto each population

Here *υ*_{A}(*t*) and *υ*_{B}(*t*) correspond to the integrated inputs *u*_{A}(*t*) and *u*_{B}(*t*), see Eq. (31).

As for the perceptual decision making task, the value of *σ*_{mn}, determines the qualitative shape of the dynamical landscape on which the internal variable evolves and sets the time scale on which it integrates inputs. Large values of the variances and allow the contextual cues to differentially vary the gain of the two populations in the two contexts, leading to an effective gating of the inputs integrated by the internal collective variable (see SI section S2.3 for more details).

#### 4.9.4 Delayed-match-to-sample task

##### Latent dynamics and reduced model

We found that the computations in the rank two, two population network relied on the following conditions for the covariances in the two populations (Sup. Fig. S8a): (i) on one population, the two connectivity modes were coupled through , with a specific condition on their values to induce a limit cycle (that the difference is large, see SI and [Mastrogiuseppe and Ostojic, 2018; Beiran et al., 2021]); (ii) on the other population, the covariances were in contrast set to counter-balance the first population, and cancel the rotational dynamics and ; (iii) the input-selection and output vectors for the second connectivity mode on the second population had a strong overlap that led to strong positive feedback; (iv) the input vectors *I*^{A} had a strong variance on population 2, while other input sub-vectors had small variances .

For this reduced model, the dynamics of the two internal collective variables is given by: with the effective couplings mediating inputs and effective couplings governing the autonomous dynamics:

The average gains are given by equations (42), with standard deviations of currents onto each population

Here *υ*_{A}(*t*) and *υ*_{B}(*t*) correspond to the integrated inputs *u*_{A}(*t*) and *u*_{B}(*t*), see Eq. (31).

Conditions (i) to (iv) on the covariances allow to implement the dynamical landscape modulation of Fig. 4h (see Sup. Fig. S8d). When stimulus A is present (*u*_{A} = 1), the gain of population 2 is set to (Φ′)_{2} ≃ 0 because of (see Eq. (60)), and the specific values of covariances for sub-vectors in population 1 induce a limit cycle (see SI section S2.4). In absence of inputs, or when input B was present, gains were approximately equal for the two populations (Sup. Fig. S8c), leading to a cancellation of the cross effective couplings and , while positive feedback implemented through shaped a dynamical landscape with two fixed-points.

## Supplementary information

### S1 Parametrization and collective dynamics for mixture of Gaussians connectivity vectors

In this section we show how connectivity vectors with entries drawn from mixtures of multivariate Gaussians can be constructed from independent Gaussians, as mentioned in Eq. (23). We then derive the dynamics of the internal collective variables (Eq. (39)) in this setting.

We considered distributions of connectivity parameters characterized by *P* covariance matrices **Σ**_{p}, and zero means *μ*_{p} = **0**, *p* = 1, … , *P*. For a neuron *i* belonging to population *p*, each vector entry is constructed as a linear transformation of the same set of values

Here the are drawn from , independently for each *i* and *d*. The linear coefficients are different for each connectivity vector *a* ∈ {*n*^{(r)}, *m*^{(r)}, *I*^{(s)}, *w*}, but identical across neurons within a given population. These sets of coefficients therefore determine the covariance between entries of connectivity vectors within a given population *p*:

The row-vectors in fact correspond to the rows of the Cholesky factorization of the covariance matrix. We next turn to the derivation of Eq. (39). With the parametrization for the entries of connectivity vectors defined in Eq. (S1), the recurrent inputs to the *r*-th internal collective variable Eq. (34) can be written as
with . For a given *p*, we then compute each of the *D* integrals applying successively Stein’s lemma
and using the fact that a sum of independent Gaussians is a Gaussian with variance given by the sum of variances, so that

This leads to with

Inverting the sums on *p* and *r*′, *s* indices and assuming that input vectors *I*^{(s)} are orthogonal to the output vectors {*m*^{(r)}}_{r=1,…,R} (as in all the reduced models described in the section below), we get the compact description in terms of effective couplings for the dynamics of internal collective variables Eq. (39)
with, for any two vectors ** a**,

**, the effective couplings and averaged gains**

*b*### S2 Theoretical analysis of reduced models

Here we examine reduced network models, that were minimally parametrized to solve each task by relying on the same network dynamics as the trained networks presented in the main text. The minimal parameter sets correspond to subsets of the computationally important covariances between the connectivity vectors of the trained networks. These parameters were first set by hand and then, if necessary, fine-tuned with the ADAM optimizer to solve the task with optimal accuracy. We first report on how to parametrize connectivity vectors to build these networks. We then examine the effects of these parameters on mean-field collective dynamics and show their implication in task solving.

#### S2.1 Perceptual decision-making network

The network trained on this task was of unit rank, and consisted of a single population. Such a network can be minimally parametrized using three covariances *σ*_{nm}, *σ*_{nI} and *σ*_{mw} (Supp. Fig. S5a). This can be obtained with an input vector and a pair of recurrent connectivity vectors given by:
for *i* = 1, … , *N* , with drawn independently from zero-mean Gaussian distributions of unit variance. The readout components were taken as

The dynamics of the single internal collective variable is then given by with effective couplings given by equation (S9), i.e. the covariances scaled by the global gain factor

These dynamics can be graphically summarized as in Fig. 2e and lead to network dynamics that match the ones of trained networks (Fig. S5b).

The autonomous dynamics of the network is determined by the parameter *σ*_{nm} that controls (i) the qualitative shape of the dynamical landscape, with a transition from a single stable fixed-point (*σ*_{nm} < 1) to two symmetric fixed-points (*σ*_{nm} > 1) and (ii) the time-scale with which the network state relaxes or diverges from the initial condition ** x** =

**0**at the beginning of each trial (Fig.S5c,d, [Mastrogiuseppe and Ostojic, 2018]). The integration of the filtered input

*v*(

*t*) by

*κ*is controlled by

*σ*

_{nI}, the covariance between the input vector

**and the input-selection vector**

*I***. For instance for**

*n**σ*

_{nI}= 0,

*v*(

*t*) is projected on a direction orthogonal to the input-selection vector and is not integrated by the recurrent activity (Supp. Fig. S5g light shade line).

Finally, the covariance *σ*_{mw} between the output vector ** m** and the readout vector

**controls the extent to which the readout is driven by**

*w**κ*, with no drive of the readout in case of orthogonal output and readout vectors,

*σ*

_{mw}= 0 (Supp. Fig. S5f light shade line).

The network connectivity of equation (S11), also involved the variance *σ*_{mm} of the connectivity vector ** m**. Changing

*σ*

_{mm}influences the autonomous dynamics of the network (Supp. Fig. S5c) by influencing the gain of the neurons (see Eq. (S14)).

For the reduced model shown in the main text, the non-zero covariances were: *σ*_{nm} = 1.4, *σ*_{nI} = 2.6 and *σ*_{mw} = 2.1.

#### S2.2 Parametric working-memory network

The network trained on this task was of rank two, and consisted of a single population. A minimal parametrization of this network involves six covariances and (Supp. Fig. S6a).

This can be obtained with an input vector and two pairs of recurrent connectivity vectors:
for *i* = 1, … , *N* , with drawn from zero-mean Gaussian distributions of unit variance. The readout components were taken as

The dynamics of the two internal collective variables is then given by: with effective couplings given by equation (S9), i.e. the covariances scaled by the global gain factor with

These dynamics can be graphically summarized as in Fig. 2i and reproduce the dynamics of trained networks as shown in Supp. Fig. S6b. Supp. Fig. S6c shows the dynamical phase portrait on which recurrent activity evolves. It approximates a line attractor [Seung, 1996] on the direction *m*^{(1)} as the covariance sets the network close to the bifurcation point of Supp. Fig. S5c. On the second direction *m*^{(2)} the dynamics relax with a time scale set by the covariance . For the reduced model shown in the main text, the non-zero covariances were: and .

#### S2.3 Context-dependent decision-making network

The networks trained on this task were of unit rank and consisted of either two or three populations depending on the training procedure (see methods section 4.3.3, supplementary section S3 and Sup. Fig. S9).

##### Two-population network

Such a network can be minimally parametrized using 4 non-zero covariances on each population. This can be obtained with the two sensory input vectors generated independently , , irrespective of the population structure. The connectivity vectors are structured in two sub-vectors.

For *i* in population 1:
while for *i* in population 2:
with ’s drawn from i.i.d. centered Gaussian distributions of unit variance. The readout vector is taken as
for *i* in population 1 and
for *i* in population 2. Importantly the contextual input vectors are also structured in two sub-vectors, such that for *i* in population 1:
while for *i* in population 2:
with .

The recurrent activity is then described by a single internal collective variable, graphically summarized in Fig. 4a:

The time evolution of the internal collective variable is coupled to the two inputs through the two inputs through the two effective couplings time evolution of the internal collective variable is coupled to

The recurrent dynamics are supported equally by the two populations: with averaged gains given by equations (S10) and standard deviations of currents onto each population

The obtained dynamics are similar to the trained networks, graphically illustrated in Fig. 4c, with contextual inputs controlling the gain of each of the two populations (Supp. Fig. S7b). This control relies on the large amplitude of the weights of contextual input vectors, , as illustrated in Supp. Fig. S7c where we show the effect of varying these parameters on the network readout. In this implementation, each of the two populations selectively integrates one of the two sensory inputs thanks to the non-zero covariances between sensory input and input-selection vectors , as illustrated in Supp. Fig. S7d.

The non-zero covariances for the implementation of the solution presented in the main text are given by .

##### Three-population network

For the context-dependent decision-making task, we also examined a network relying on three populations. In this network, two populations selectively gate inputs as in the two-population network, but the recurrent interactions that implement evidence integration are segregated to a third population. Here we describe the corresponding reduced model.

As for the two-population network, the two sensory input vectors are generated independently , , irrespective of the population structure. The pair of recurrent connectivity vectors is structured in three sub-populations. For *i* in population 1:
for *i* in population 2:
and for *i* in population 3:
for *i* = 1, … , *N*, with drawn independently from centered Gaussian distributions of unit variance. The readout vector reads only from the third population:

The contextual inputs are the same as in the two-population network. The overall expression for the time evolution of the internal collective variable is unchanged compared to the two populations solution Eq. (S26). Each of the effective couplings between *κ* and inputs is supported by one of two populations
and the self-coupling of the internal collective variable is supported by the third population
with averaged gains given by equations (S10) and standard deviations of currents onto each population by

In this three-population implementation, the contextual inputs do not control the gains of neurons in the third population and thus modulate only the effective couplings that mediate the influence of sensory inputs. The non-zero covariances for an implementation of this solution are given by .

#### S2.4 Delay-match-to-sample network

Networks trained on this task were of rank two and consisted of two populations. Here we propose a minimally parametrized network (Supp. Fig. S8a) that, similarly to the trained network presented in the main text, relies on the ability of inputs to control the autonomous dynamics of the network. The pairs of recurrent connectivity vectors defined on the first population are coupled to each other through covariances and [Mastrogiuseppe and Ostojic, 2018; Beiran et al., 2021]:
with covariances chosen such that the trivial fixed-points ** x** =

**0**is an unstable spiral point, and the dynamics defined by the first sub-population generate a limit cycle. As shown by a linear stability analysis of the dynamical equation for internal collective variables, this dynamical feature arises when the covariances are such that the following matrix has complex eigenvalues with positive real-parts [Mastrogiuseppe and Ostojic, 2018; Beiran et al., 2021]

This first population is coupled to a second population which, in the absence of inputs, cancels the rotational dynamics, through the relationships and . The second population also implements a pair of fixed-points that will be used to store the identity of the first stimulus throughout the delay and report the match/non-match decision. The connectivity sub-vectors on the second population can then be written as:

The readout vector reads only from the second population:

The input vector *I*^{B} also stimulates only the second population, pushing the dynamics towards one fixed point on the direction *m*^{(2)}
while the input vector *I*^{A} activates the two populations. For units in the second population
pushing the dynamics towards the other fixed point on the direction *m*^{(2)}, while for *i* in the first population
with

Such a connectivity leads to the dynamical equation for the two internal collective variables with the effective couplings mediating inputs of the form and effective couplings governing the autonomous dynamics: with averaged gains given by equations (S10) and standard deviations of currents onto each population

This dynamics can be graphically summarized as in Fig. 4g. It reproduces the dynamics of trained rank two networks presented in the main text (Supp. Fig. S8b), by relying on the same network mechanism, with input *A* controlling the gains of neurons in population one (Supp. Fig. S8c, middle) and thus shaping the dynamical landscape on which internal collective variables evolve (Supp. Fig. S8d). The important non-zero covariances of the reduced model are given by: for the first population and .

### S3 Non-uniqueness of network implementation for a given task

We observed that varying training parameters on a given task can lead to various network implementations. We identified three factors that contribute to such variability.

A first factor is the determination of the network parameters that are trained (e.g. number of pairs of recurrent connectivity vectors *R*, whether input vectors are trained or not, scaling of trained parameters with network size, etc.). An example of this is provided by training a rank-one network on the context-dependent decision-making task, without training any of the input vectors (while the contextual-input *I*^{ctxA} and *I*^{ctxB} vectors are trained for the rank-one networks presented in the main text). Supp. Fig. S9 reports the analysis of such a trained network, showing that training leads to a network with three functional populations, whose implication in the computations are reproduced and detailed in a reduced model (section S2.3), and which is reminiscent of the one found in [Yang et al., 2019]. Another such example concerns the number of pairs of recurrent connectivity vectors allowed during training. For instance if training a rank-two networks on the perceptual decision-making task, one could exhibit networks with a ring-like slow manifold [Mastrogiuseppe and Ostojic, 2018], which gives rise to a single, non-linear collective variable embedded in a two-dimensional subspace.

A second factor is task parametrization. For instance we observed that training on the parametric-working memory task with fixed delays between the two stimuli, while they are drawn randomly here, leads to solutions that exploit network oscillations with periods fine-tuned to the delay, rather than a line attractor (not shown). Another such example can be put forward for the context-dependent decision-making task. Here we trained networks on a two-alternatives forced choice version of this task in which every stimulus requires one out of two responses (section 4.3.3) and found that multiple populations were required for the implementation (Supp. Fig. S11). In a Go-Nogo setting, where the alternatives are to either respond or not, flexible input-output associations can be implemented with a single population, through a mechanism based on biasing the response threshold rather than modulating the gain [Mastrogiuseppe and Ostojic, 2018].

A third factor is the stochastic nature of the training procedure, with initial connectivity being randomly drawn for each training, as well as the stochastic split of training examples into batches inherent to stochastic-gradient-descent-based methods used here. In Supp. Fig. S14, we show the dynamics of a network trained on the delay-match-to-sample task obtained for the same task parametrization and the same trained parameters. Similarly to the solution described in the main text, it relies on gain modulations through external inputs to shape the dynamics of the network in the *κ*_{1} − *κ*_{2} plane. However, that solution relies on four stable fixed points in the autonomous dynamics, two of them encoding the memory of the first stimulus (A/B) while the 2 other encode the final decision (match/non-match). The mechanism relies on different input-driven dynamics, each implementing bistable dynamics with a separatrix that moves just enough to execute a XOR operation during the second stimulation.

## Acknowledgements

The project was supported by the ANR project MORSE (ANR-16-CE37-0016), the CRCNS project PIND, the program “Ecoles Universitaires de Recherche” launched by the French Government and implemented by the ANR, with the reference ANR-17-EURE-0017. There are no competing interests. SO thanks Joshua Johansen and Bijan Pesaran for fruitful discussions.