Encoding and Decoding Dynamic Sensory Signals with Recurrent Neural Networks: An Application of Conceptors to Birdsongs

In a constantly changing environment the brain has to make sense of dynamic patterns of sensory input. These patterns can refer to stimuli with a complex and hierarchical structure which has to be inferred from the neural activity of sensory areas in the brain. Such areas were found to be locally recurrently structured as well as hierarchically organized within a given sensory domain. While there is a great body of work identifying neural representations of various sensory stimuli at different hierarchical levels, less is known about the nature of these representations. In this work, we propose a model that describes a way to encode and decode sensory stimuli based on the activity patterns of multiple, recurrently connected neural populations with different receptive fields. We demonstrate the ability of our model to learn and recognize complex, dynamic stimuli using birdsongs as exemplary data. These birdsongs can be described by a 2-level hierarchical structure, i.e. as sequences of syllables. Our model matches this hierarchy by learning single syllables on a first level and sequences of these syllables on a top level. Model performance on recognition tasks is investigated for an increasing number of syllables or songs to recognize and compared to state-of-the-art machine learning approaches. Finally, we discuss the implications of our model for the understanding of sensory pattern processing in the brain. We conclude that the employed encoding and decoding mechanisms might capture general computational principles of how the brain extracts relevant information from the activity of recurrently connected neural populations.

In a constantly changing environment the brain has to make sense of dynamic patterns of sensory input. These patterns can refer to stimuli with a complex and hierarchical structure which has to be inferred from the neural activity of sensory areas in the brain. Such areas were found to be locally recurrently structured as well as hierarchically organized within a given sensory domain. While there is a great body of work identifying neural representations of various sensory stimuli at different hierarchical levels, less is known about the nature of these representations. In this work, we propose a model that describes a way to encode and decode sensory stimuli based on the activity patterns of multiple, recurrently connected neural populations with different receptive fields. We demonstrate the ability of our model to learn and recognize complex, dynamic stimuli using birdsongs as exemplary data. These birdsongs can be described by a 2-level hierarchical structure, i.e. as sequences of syllables. Our model matches this hierarchy by learning single syllables on a first level and sequences of these syllables on a top level. Model performance on recognition tasks is investigated for an increasing number of syllables or songs to recognize and compared to state-of-the-art machine learning approaches. Finally, we discuss the implications of our model for the understanding of sensory pattern processing in the brain. We conclude that the employed encoding and decoding mechanisms might capture general computational principles of how the brain extracts relevant information from the activity of recurrently connected neural populations. satisfied by the membrane potential at the level of single neurons, it can be satisfied by 23 recurrent connections at the level of neural populations. We were particularly interested 24 in the question, whether it is possible to encode and decode multiple dynamic sensory 25 signals with a strongly simplified model of a recurrently connected neural circuit. More 26 specifically, we wanted to investigate whether dynamic signals, as they occur naturally, 27 can be learned by a recurrent neural network (RNN) with very simple neuron models. 28 The crucial task the model has to perform for this purpose, is to extract information 29 about the input to the RNN from its continuously changing activation patterns. This 30 involves observing the state dynamics of the network and detecting patterns in those 31 dynamics that are specific to a certain input signal. Once learned, we wanted to use 32 these representations for recognition of the respective sensory signals. A possible 33 mechanism for the recognition task is proposed by predictive coding theory [12]. It 34 states that internal representations are compared to current sensory input, resulting in a 35 difference signal [3]. This signal in turn is thought to be used to update internal beliefs 36 about which pattern the sensory input might belong to. Therefore, our model needs to 37 be able to both learn representations based on the activity patterns of an RNN and 38 compare already learned representations with the current state of the RNN. 39 Current advances in the field of reservoir computing provide the tools to build such 40 models. A reservoir is a randomly connected RNN that can be driven with some kind of 41 input pattern and trained to approximate and reproduce that pattern [22]. Each of its 42 neurons has a different, randomly initiated receptive field and a non-linear activation 43 function. In such a network, stimuli are encoded by the state of the whole network, or 44 by a series of states in the case of dynamic input patterns. Unfortunately, reservoirs 45 suffer from so called catastrophic forgetting, which refers to the inability of a single 46 reservoir to learn multiple patterns. However, a recent development by Herbert Jaeger 47 called conceptor can solve that problem [24]. Conceptors rely on the idea that a 48 randomly connected RNN visits only a sub-part of all the possible states it can visit, 49 given an input pattern of limited length. Different inputs should therefore push the 50 network into different parts of its state space, as long as the state space is sufficiently 51 large. A conceptor exploits this behavior by capturing the parts of the state space an 52 RNN visits while being driven with a certain input pattern. It does so by extracting 53 directions of maximum variance from the state development observed in the RNN state 54 space. In other words, conceptors are a dimensionality reduction technique that try to 55 identify the manifolds in the state space of a reservoir in which different signals live. If 56 imposed on the reservoir, the conceptor restrains it to visit only those manifolds, hence 57 acting like an attractor. It is therefore possible to use an RNN of limited size to learn 58 representations of multiple dynamic patterns in the form of conceptors and then use 59 2/18 these as input to the same network to reproduce the learned patterns. This makes 60 reservoir-conceptor dynamics a viable option to model representation formation in the 61 brain. Furthermore, it allows for comparing the already learned conceptors to the 62 network dynamics observed for new input to obtain evidence for which pattern the 63 current input might belong to, thus performing recognition. 64 In the auditory domain the brain has to deal with a one-dimensional, but highly 65 variable and complex dynamic signal. This makes it an optimal candidate to 66 demonstrate pattern learning and recognition within RNNs. In human speech, this 67 complexity is even increased by its deep hierarchical structure, ranging from single 68 phonemes to nested sentences [35] [6]. As capturing this hierarchy would require a 69 similarly deep and complex model, we chose birdsongs as the dynamic pattern to model. 70 Birdsongs have gained increasing popularity for investigating auditory processing in the 71 brain, because of their similar, yet less complex hierarchy as compared to human 72 language, paired with the better understood neural circuits of the auditory bird 73 brain [10] [6]. Many birdsongs show a hierarchical structure combining single syllables 74 to complex songs, thereby resembling human syntax [6]. Such a hierarchical structure 75 has also been found in the bird brain performing song generation [2]. More specifically, 76 high-level neurons encoding syllables fire in a certain sequence that refers to a song and 77 drive neurons on a lower level to initiate the respective motor output [38] [33]. Thus, a 78 hierarchically structured model performing recognition on the level of syllables and 79 songs seems plausible from both a biological and behavioral perspective. Moreover, 80 similar to early language acquisition in humans, song-learning in many bird species has 81 been shown to be error driven [5]. Therefore, song recognition and acquisition processes 82 driven by the similarity between sensory input and stored song representations are 83 readily motivated as underlying computational principles of auditory processing in the 84 bird brain. Birdsongs can be divided in two categories depending on their inherent 85 complexity. Some songs exhibit linear syntax structure, i.e. the song always consists of 86 the same sequence of syllables, whereas other songs show stochastic patterns with the 87 sequence of syllables changing between repetitions. Our study focuses on the linear song 88 syntax found in species like zebra finch and sparrow [36], since it compares better to the 89 deterministic succession of syllables within words in human language.

90
Recent approaches in modeling birdsongs either focused on mere recognition 91 performance instead of the underlying brain processes [29] [25] [31] or on the brain 92 processes mapping a song to an actual motor output [44] [45]. However, none of these 93 studies tried to capture the inherent hierarchy of the birdsong within a neural network 94 model of the listening bird brain. Here, we propose a birdsong recognition model that 95 recognizes single syllables on a bottom level and sequences of syllables, which we refer 96 to as songs, on a top level. It does so by forming representations of syllables and songs 97 on the respective levels and classifying input according to its similarity to those 98 representations. Both levels are thereby instantiated as reservoirs while syllables and 99 songs are represented by conceptors. By demonstrating that such a model can recognize 100 hierarchical, dynamic signals such as birdsongs, we claim that the recurrent structure at 101 a given level of a certain cortical hierarchy could be sufficient to explain how the brain 102 is able to deal with time-evolving sensory stimuli. Conceptor-learning would thereby 103 describe what kind of information the brain has to extract from its sensory areas to 104 form representations of those stimuli. Namely, it would need to find linear combinations 105 of neurons that describe the directions in the state space of the respective sensory area 106 along which the state of that area varies the strongest under a certain input. Moreover, 107 we propose that conceptors as used for song recognition on the top level of our model 108 are adequate for implementing a predictive coding scheme within an RNN. In summary, 109

3/18
we suggest that our model resembles a possible general principle of how the brain can 110 deal with hierarchically structured dynamic patterns based on the activation patterns of 111 recurrently connected sensory areas. The subsequent chapter describes our model in 112 detail, while the third chapter reports the model performance on birdsong data. We 113 believe that a plausible model for recognizing any kind of sensory signal needs to be 114 performing well on basic recognition tasks that the species in question is clearly able to 115 solve. This is why we compared the performance of both levels of our model to 116 state-of-the-art machine learning methods. Finally, we discuss the implications of our 117 model for future research on pattern formation and recognition in the brain as well as 118 for computational models of these processes.
In 1, x is the state vector of the reservoir, W is the internal connectivity matrix, W in 134 are the input weights, s(n) is the input signal and b is a bias term. W as well as W in 135 are randomly initialized. On training time, we fed training samples of each syllable into 136 the reservoir and collected its states. We then used those as well as the input of the 137 4/18 reservoir to compute a preliminary conceptor C for each syllable j, employing the 138 following equation: In 2, R corresponds to the correlation matrix of the j th syllable, expressing the 140 correlation of each reservoir unit and input dimension with each other over all training 141 samples, while I stands for the identity matrix. On top of that conceptor (which we will 142 from now on refer to as positive conceptor), we also calculated a preliminary negative 143 conceptor for each syllable, resembling all of the state space not occupied by the 144 positive conceptors of all other syllables. This was done using the logical operators for 145 conceptors defined in [23]. With both positive and negative preliminary conceptors at 146 hand, we employed the following equation to receive the final conceptors: The purpose of 3 is to weave the aperture α into the conceptors, which controls how 148 strong the solution is pulled to the identity or to zero. A more detailed explanation of 149 the aperture and how to use it to adapt conceptors can be found in [23]. By the end of 150 this training procedure, each syllable was represented by its own positive conceptor and 151 the logical exclusion of all other syllables.

153
On testing time, separate test samples of each syllable were fed into the reservoir while 154 collecting its states. Subsequently, we computed the similarity between the collected 155 states x for one test sample and each of the conceptors the following way: The outcome h is a vector containing an evidence value for each conceptor, expressing 157 the similarity between that conceptor and the testing sample that was fed into the  Conceptor (HFC) architecture as proposed by Herbert Jaeger [23]. Similar to the 165 syllable classification module, a conceptor is learned for each song. However, the module 166 does not need to hear the whole song to provide evidences for which song it heard.

167
Instead, it assigns belief values to each learned song and updates them for every new 168 syllable that it receives as input. This update is based on the difference between the 169 network state observed after applying the input and each of the stored conceptors.

170
Another difference to the syllable classification module is the setup of reservoir and 171 conceptors. They follow an architecture which is called Random Feature Conceptor 172 (RFC) and has first been introduced by Herbert Jaeger [23]. This architecture is 173 visualized in the right box of figure 1. In the following, we will explain its sub-parts and 174 dynamics in more detail. The RFC is the main building block of the song recognition module. It is an attempt to 177 store conceptors in a more efficient and biologically plausible way by applying the conceptors on the neurons of a network instead of their connections. This can be done 179 by introducing a second, larger reservoir z, whose states evolve in the following way: 180 z(n + 1) = diag(c(n))F r(n + 1) In 5 diag refers to the mathematical operator that creates a matrix with the elements of 181 its input vector on the diagonal and zeros on off-diagonal elements. Thus, every row of 182 F is weighted by the respective entry in the conceptor c(n), with F being a random 183 matrix mapping the states from the smaller reservoir r(n) to z. The dynamics of r are 184 described by: As one can see, 6 is nearly identical to 1 except for the dependency on its own states 186 from the previous time step, which is replaced by a dependency on z. G is again a 187 random matrix, mapping from z to r. The dynamics of the RFC as described by 5 and 188 6 can thus be understood as a loop of two major mechanisms. First, a state space with λ being the step size and α the aperture. In this case, the aperture determines how 196 strong the conceptor should be able to change in the presence of new information stored 197 in the current state vector z. For a more detailed analysis of these dynamics, we refer 198 the interested reader to the chapter on RFCs in [23]. To train conceptors for different 199 songs, we created training patterns of multiple repetitions of each song and fed them 200 into reservoir r for a certain training period during which 7 should converge. After this 201 was done for all songs, we trained read-out weights for r using Ridge regression over all 202 training patterns [20]. This method chooses the weight for each unit in r such that it 203 minimizes the L2 norm of the weights and the sum of the squared residuals between the 204 read-out and the input to r. The same method was used to update G, changing the 205 internal weights of the mapping from z to r such that z is capable of resembling the 206 input r has received during training. This gave us the final RFC which we then used for 207 song recognition.

209
Song recognition was performed on the timescale of single syllables. More specifically, at 210 every time step a syllable was fed into the network, which returned a belief value for 211 each known pattern, thus performing online song recognition. Thereby, the belief values 212 refer to the weights γ assigned to each previously trained conceptor. These weights were 213 used to calculate a weighted sum of all trained conceptors, which was then applied to z 214 the same way as the conceptor c in 5. They were updated at each point in time 215 according to In 8 η is the learning rate, m is the number of patterns stored in the RFC and P is a  depicted for a single syllable in figure 2. The input p to the syllable classification 255 module as in 1 were those spectral feature vectors. As the Cassin's Vireo does not have 256 a linear song syntax, we could not use these data for song recognition, but only for 257 syllable classification. Therefore, we created a total of 20 synthetic songs from the set of 258 65 syllables. Each song consisted of 3-5 syllables and was of Markov order 1 or 2, as it is 259 typical for birdsongs [9]. In this case, the Markov order refers to the number of previous 260 syllables necessary to predict the next syllable within a particular song. We created a represented the evidence for that syllable being the input at that particular time-step. 267 A noise free syllable representation would thus be a vector with all entries being 0 268 except the one of the syllable currently being input, which would be a 1. These syllable 269 vectors were then used as input s to the song recognition module as in 6.

271
The following paragraphs describe how well our model performs on the tasks it was

293
We tested the performance of our syllable classification module on an increasing number 294 of syllables to distinguish between. Furthermore, we also tested a shallow MLP on the 295 same task, serving as a comparison to our conceptor-based syllable classification. It the MLP was trained on. During training, the cross-entropy between MLP output and 301 target was minimized using an Adam optimizer with an initial learning rate of 302 0.001 [27]. Initial weights were sampled from a normal distribution with zero mean and 303 standard deviation of 0.1. We trained the MLP on all training samples for 3 epochs  and used a winner-takes-it-all transform on those to arrive at a song classification.

349
Performance was then measured as the fraction of correct classifications over the entire 350 test data set. Each combination of song number and signal-to-noise ratio was tested 10 351 song number and SNR on recognition performance to be roughly additive, as indicated 354 by the diagonal structure of the performance matrix. This is very different from the 355 GRU classifier, which can deal well with increasing numbers of songs, but is more 356 susceptible to noise. Looking at the difference between the two performance matrices, it 357 is clearly visible that the conceptor-based classifier is superior at low SNRs. It is 358 important to notice that it was trained on noise-free training data, while we added noise 359 to the training data of the GRU to improve its performance. This training data noise 360 was in the range of the noise used for the testing data. We do recognize that increasing 361 the amount of training data could have improved the performance of the GRU classifier. 362 However, since both methods are supervised and we already allowed the GRU to see The main goal of our work was to demonstrate how a recurrent neural circuit can 402 encode and decode complex dynamic patterns such as birdsongs. Thereby, we employed 403 a recently developed technique that allows the capture and control of the dynamics of 404 randomly connected RNNs [23]. Using this mechanism, we were able to show how 405 birdsongs can be learned and recognized within a 2-level hierarchical model of RNNs.

406
While the first level learned and recognized single syllables based on their spectral 407 features, the second level learned and recognized songs as sequences of syllables. Both 408 processes, learning and recognition, relied on extracting information from the activity 409 pattern of an RNN driven by dynamic input. We were able to show that on a limited 410 number of distinguishable patterns the recognition performance of our model compares 411 to state-of-the-art machine learning methods and that song recognition is remarkably 412 robust to noise, a desirable property when dealing with recognition in natural 413 environments. Our model was motivated by two basic organizational properties of the 414 brain. First, its recurrent structure on the level of local neuronal circuits [8] [41], and 415 second, the hierarchical organization of the brain and the representations stored 416 therein [12]. If the description of sensory areas in the brain as recurrently connected 417 recurrent neural circuits is sufficient to capture the crucial information transmitted by 418 those areas, our model suggests general computational principles for encoding and 419 decoding dynamic sensory signals in the brain.  performed within a predictive coding framework [28]. Since each conceptor in our 464 top-level module was a unit with weights to every neuron in the reservoir z, they could 465 also be interpreted as single neurons with synaptic connections. From such a 466 perspective the weights γ would represent the activation of the neurons, meaning that 467 each neuron's activity would be determined by the similarity of the song it codes for to 468 the current input to the reservoir r. Such activation could then be detected by a 469 hierarchically even higher level and used for decision processes and the like. This could 470 be a possible way of how probability distributions are represented and used in the brain. 471 Furthermore, this offers a direct way to implement top-down influences by acting on the 472 weights of the conceptors. The second desirable property is that the trained conceptors 473 can serve as generative models. More specifically, applying a conceptor encoding a 474 certain song to the RNN it was trained on will force the RNN to generate that song 475 without any further input needed. Therefore, one cannot only use the learned 476 representations for recognition, as demonstrated in this paper, but also for other 477 cognitive tasks requiring generation processes. One drawback to using RFCs is the 478 14/18 limited stability of the training process. At its current development state, the RFC can 479 sometimes fail to converge to a correct solution for a certain pattern. More specifically, 480 it can fail to capture the subspace of the RNN state space which the RNN visited for a 481 certain input pattern. This is not a problem for the presented recognition task, as once 482 a correct conceptor is learned for every pattern, the network performance is stable.

483
Unfortunately, the probability of at least one pattern not being learned correctly 484 increases with the number of overall patterns to learn. Therefore, given limited training 485 data and computational capacities, testing recognition performance for more than 7 486 songs was unfeasible for us. To resolve that issue in a satisfactory manner, more work 487 has to be invested in stabilizing the RFC architecture. One possible direction of 488 research could be, to extend conceptors to other dimensionality reduction mechanisms 489 apart from the one currently employed in conceptor learning.

490
Due to that limitation of the RFC we used a more simplified architecture for syllable 491 classification, which allowed us to use a bigger set of syllables to construct our songs 492 from. While the conceptors within this simplified architecture cannot serve as 493 generative models, they still allow for using the logical operations AN D OR and N OT 494 on them. Thus, we were able to calculate negative conceptors for each syllable, performance significantly. Integrating these logical operations into an online recognition 499 process such as the song recognition on the top level would in theory be possible as well. 500 However, one would have to integrate both positive and negative conceptor into a single 501 one in order to use the update scheme for the belief values in the way it is laid out in 502 this paper. Importantly, the simplified architecture of the syllable classifier sufficed to 503 demonstrate that an abstract spectral representation of highly variable sound patterns 504 such as bird calls is sufficient to learn and recognize those syllables. This is in line with 505 experimental evidence suggesting that mental representations are typically 506 under-specified and abstract, making them more robust to noise [11].

507
In summary, we put forth a hierarchical model of RNNs which is able to recognize 508 non-stochastic birdsongs online. However, the mechanisms used in the model are in 509 principle not specific to that one task or domain. Every dynamic pattern of similar 510 length and dimensionality could be learned and recognized by such an architecture.

511
Thereby, the complexity of the hierarchy of the input could be met by a suitable 512 amount of hierarchical levels in the model. Our model suggests a general principle of 513 how the activity pattern of a population of neurons with different receptive fields could 514 be used to encode and decode information about and from the environment. Moreover, 515 it shows how these principles could be implemented in a hierarchically organized neural 516 circuit performing recognition tasks, though they could even be used in a similar fashion 517 to perform pattern generation tasks.