## ABSTRACT

The problem of deciphering how low-level patterns (action potentials in the brain, amino acids in a protein, etc.) drive high-level biological features (sensorimotor behavior, enzymatic function) represents the central challenge of quantitative biology. The lack of general methods for doing so from the size of datasets that can be collected experimentally severely limits our understanding of the biological world. For example, in neuroscience, some sensory and motor codes have been shown to consist of precisely timed multi-spike patterns. However, the combinatorial complexity of such pattern codes have precluded development of methods for their comprehensive analysis. Thus, just as it is hard to predict a protein’s function based on its sequence, we still do not understand how to accurately predict an organism’s behavior based on neural activity. Here we derive a method for solving this class of problems. We demonstrate its utility in an application to neural data, detecting precisely timed spike patterns that code for specific motor behaviors in a songbird vocal system. Our method detects such codewords with an arbitrary number of spikes, does so from small data sets, and accounts for dependencies in occurrences of codewords. Detecting such dictionaries of important spike patterns – rather than merely identifying the timescale on which such patterns exist, as in some prior approaches – opens the door for understanding fine motor control and the neural bases of sensorimotor learning in animals. For example, for the first time, we identify differences in encoding motor exploration versus typical behavior. Crucially, our method can be used not only for analysis of neural systems, but also for understanding the structure of correlations in other biological and nonbiological datasets.

## Introduction

One of the goals of modern high-throughput biology is to generate predictive models of interaction networks, from interactions among individual biological molecules^{1} to the encoding of information by networks of neurons in the brain^{2}. To be able to make predictions about activity across networks requires one to accurately approximate—or build a *model* of—their joint probability distribution, such as the distribution of joint firing patterns in neural populations or the distribution of co-occurring mutations in proteins of the same family. To successfully generalize and to improve interpretability, models should contain as few as possible terms. Thus constructing a model requires detecting *relevant* features in the data: namely, the smallest possible set of spike patterns or nucleotide sequences that capture the most correlations among the network components. By analogy with the human language, where words are strongly correlated, co-occurring combinations of letters, we refer to the problem of detecting features that succinctly describe correlations in a data set as the problem of *dictionary reconstruction*, see Fig.1.

In recent years, the problem of dictionary reconstruction has been addressed under different names for a variety of biological contexts^{3} including gene expression networks^{4,5}, protein structure, protein-protein interactions^{1,6–8}, the structure of regulatory DNA^{9}, distribution of antibodies and pathogenic sequences^{10,11}, species abundance^{12}, and collective behaviors^{13–17}. The efforts to identify interactions in neural activity have been particularly plentiful^{2,18–21}. The diversity of biological applications notwithstanding, most of these attempts have relied on similar mathematical constructs, and most have suffered from the same limitations. First, unlike in classical statistics and traditional quantitative model building, where the number of observations, *M*, usually vastly exceeds the number of unknowns to be estimated, *K, K*/*M* ≪ 1, modern biological data often has *M* ≫ 1, but also *K*/*M* ≫ 1. Indeed, because of, for example, protein allostery, recurrent connections within neural populations, or coupling to global stimuli, biological systems are rarely limited to local interactions only^{22–24}, so that the number of pairwise interactions among *N* variables is *K* ~ *N*^{2}, and the number of all higher order interactions among them is *K* ~ 2^{N}. Put differently, words in biological dictionaries can be of an arbitrary length, and spelling rules may involve many letters simultaneously, some of which are far away from each other. Because of this, reconstruction of biological dictionaries from data sets of realistic sizes requires assumptions and simplifications about the structure of possible biological correlations, and will not be possible by brute force. The second problem is that, as in human languages, biological dictionaries have redundancies: there are synonyms and words that share roots. For example, a set of gene expressions may be correlated not because the genes interact directly, but because they are related to some other genes that do. Similarly, a certain pattern of neural activity may be statistically over- or under-represented not on its own, but because it is a subset or a superset of another, more important, pattern. Identifying *irreducible words*—the root forms of biological dictionaries—is therefore harder than detecting all correlations while also being crucial to fully understanding biological systems. Altogether, these complications make it impossible to use off-the-shelf methods for reconstructing combinatorially complex dictionaries from datasets of realistic sizes.

In this work, we propose a new method for reconstructing complex biological dictionaries from relatively small datasets, as few as ** M** ~ 10

^{2}… 10

^{3}samples of the joint activity. We impose no limitation on the structure of the words that can enter the dictionary — neither their length nor their rules of spelling — beyond the obvious limitation that (i) words that do not happen in the data cannot be detected, and (ii) that data contain few samples of many words, rather than of just a few that repeat many times. Additionally, we address the problem of irreducibility, making the inferred dictionaries compact, non-redundant, and easier to comprehend. The main realization that allows this progress is that instead of approximating the entire joint probability distribution of a system’s states and hence answering

*how*specific significant words matter, we can focus on a more restricted, and hence simpler, question:

*which*specific words contribute to the dictionary. We answer this question in the language of Bayesian inference and statistical mechanics by developing an unsupervised version of the Bayesian Ising Approximation

^{25}and by merging it with the

*reliable interactions*model

^{26}.

To illustrate the capabilities of our approach, we develop it in the context of a specific biological system: recordings from single neurons in brain area RA (the robust nucleus of the arcopallium) in the Bengalese finch, a songbird. Neurons communicate with each other using patterns of action potentials (spikes), which encode sensory information and motor commands, and hence behavior. Reconstructing the neural dictionary, and specifically detecting irreducible patterns of neural activity that correlate with (or “encode”) sensory stimuli or motor behaviors — which we hereafter call *codewords* — has been a key problem in computational neuroscience for decades^{18}. It is now known that both in the sensory^{27–34} and in the motor systems^{35–38} the timing of neural action potentials (spikes) in multispike patterns, down to millisecond resolution, can contribute to the encoding of sensory or motor information. Such dictionaries that involve long sequences of neural activities (or incorporate multiple neurons) at high temporal resolution are both complex and severely undersampled. Specifically, even though the songbird datasets considered here are large by neurophysiological standards, they are too small for most existing analysis approaches for modeling neural activity to detect the codewords. This motivates the general inference problem we address here.

Understanding the neural-motor dictionary answers important questions about vertebrate motor control which could not be addressed by previous techniques (see *Online Methods*). Specifically, while it is known that various features of the complex vocal behaviors are encoded by millisecond-scale firing patterns^{35}, here for the first time we can identify which specific patterns most strongly predict behavioral variations. Further, we show that dictionaries of individual neurons are rather large and quite variable, so that neurons speak different languages, which nonetheless share some universal features. Intriguingly, we detect that codewords that predict large, exploratory deviations in vocal acoustics are statistically different from those that predict typical behaviors. Collectively, these findings pave the way for development of future theories of the structure of these dictionaries, of how they are formed during development, how they adapt to different contexts, and how motor biophysics translates them into specific movements. More generally, these findings open gates for using our method in other biological domains, where reconstruction of feature dictionaries is equally important.

## Results

### The neuro-motor code reconstruction problem

Owing to their complex and tightly regulated vocal behavior and experimentally accessible nervous system, songbirds provide an ideal model system for investigating the neural dictionaries underlying complex motor behaviors^{39,40}. We recorded from individual neurons in the motor cortical area RA of Bengalese finches during spontaneous singing, while at the same time quantifying acoustic feature of song, specifically the fundamental frequency (“pitch”), amplitude, and spectral entropy of individual vocal gestures, or “syllables”, as described previously^{35,41,42}. The data sets are sufficiently large to allow reconstruction of the dictionaries: we have 49 data sets — spanning 4 birds, 30 neurons and sometimes multiple song syllables for each neuron — for which we observed at least 200 instances of the behavior and the associated neural activity, which we estimate below to be the lower threshold for a sufficient statistical power.

To represent analysis of this motor code as a dictionary reconstruction problem, we binarize the recorded spiking time series so that *σ _{t}* = (0,1) indicates the absence or presence of a spike in a time slice of [(

*t*− 1)Δ

*t*,

*t*Δ

*t*), see Fig. 2. Thus each time interval is represented by a binary variable, and interactions among these patterns are described by over-represented or under-represented sequences of zeros and ones in the data. Using a complementary information-theoretic analysis, Tang et al.

^{35}showed that the mutual information between the neural spike train and various features of song acoustics peaks at Δ

*t*= 1 … 2 ms. Thus studying precise timing codes means that we focus on Δ

*t*= 2 ms (our datasets are not large enough to explore smaller Δ

*t*) as discussed previously in

^{35}. Detection of statistically significant codewords at this temporal resolution would both re-confirm that this neural motor code is timing based, consistent with previous analyses

^{35}, as well as for the first time revealing the specific patterns that most strongly predict behavior. We focus on neural time series of length

*T*= 40 ms duration preceding a certain acoustic syllable, which includes the approximate premotor delay with which neurons and muscles influence behavior

^{35}. Thus the index

*t*runs between 1 and

*N*=

*T*/Δ

*t*= 20.

Since we are interested in codewords that relate neural activity and behavior, we similarly binarize the motor output^{35}, denoting by 0 or 1 different binary characteristics of the behavior, such as pitch being either below or above its median value, or outside or inside a certain range, see Fig. 2. We treat the behavioral variable as the 0’th component of the neuro-behavioral activity network, which then has *N* = 21 binary variables, . Finding the codewords and reconstructing the neural-behavioral dictionary is then equivalent to detecting significantly over- or under-represented patterns in the probability distribution *P*(** σ**), and specifically those patterns that are overrepresented together with the behavioral bit. Note that 2

^{N}= 2

^{21}≈ 2 · 10

^{6}, which is much greater than

*M*~ 100 … 1000 observations of the activity that we can record, illustrating the complexity of the problem. In fact, similar to the famous birthday problem, one expects a substantial number of repeating samples of the activity — and hence the ability to detect statistically over- and under-represented binary words – only when which is what limits the statistical power of any dictionary reconstruction method.

### The unsupervised BIA method (uBIA)

To reconstruct the neural-motor dictionaries and detect codewords that predict specific behaviors, we develop the unsupervised Bayesian Ising Approximation (uBIA) method based on the Bayesian Ising Approximation for detection of significant features in regression problems^{25}. For this, we write the probability distribution *P*(** σ**) as
where

*Z*is the normalization coefficient

^{43}. We use the notation such that

*V*is a nonempty subset of indexes

_{μ}*i*∈ [0,

*N*], and

*μ*= [1,2

^{N+1}− 1] is the subset number. Then are coefficients in the log-linear model in front of the corresponding product of binary

*σ*s. In other words,

*V*denotes a specific combination of the behavior and / or times when the neuron is active. If

_{μ}*θ*is nonzero for a term Π

_{μ}_{i∈Vμ}

*σ*, where

_{i}*i*= 0 (the response variable) is in

*V*, then this specific spike word is correlated with the motor output, and is a significant codeword in the neural code, see Fig. 1. Finding nonzero

_{μ}*θ*s is then equivalent to identifying

_{μ}*which*codewords matter and should be included in the dictionary in Fig. 1, and inferring the exact values of

*θ*tells

_{μ}*how*they matter.

A common alternative model of probability distributions uses *x* = 2*σ* − 1 = ±1 instead of *σ* = (0,1). A third order term coupling, for example, *σ _{i}σ_{j}σ_{k}* represents a combination of first, second, and third order terms in the corresponding

**x**s, and vice versa. Thus which words are codewords may depend on the parameterization used, but the longest codewords and nonoverlapping groups of codewords remain the same in both parameterizations. Our choice of

*σ*vs

*x*is for a practical reason: a codeword in the

*σ*basis does not contribute to

*P*unless

*all*of its constituent bins are nonzero. Thus since spikes are rare, we do not need to consider contributions of very long words to the code.

We would like to investigate the neural dictionary systematically and to avoid pitfalls of other methods that arbitrarily truncate Eq. (1) at some order of interactions. For this, we define indicator variables *s _{μ}* = (0,1),

*μ*= 1,…, 2

^{N+1}− 1, which denote if a particular sequence of

*σ*= 1,

_{i}*i*∈

*V*, and

_{μ}*σ*= 0,

_{i}*i*∉

*V*, “matters” (is a putative word in the dictionary), i. e., it is either statistically significantly over- or under-represented in the data set compared to a null model (which we define later). In other words, we rewrite

_{μ}We then choose a prior on *θ _{μ}* and on

*s*. As mentioned above, detecting words that are severely anomalously represented is easy, and it is not our focus. Instead, since many neurons control the muscles and hence the behavioral output, we assume that individual neuron can only have a very weak effect on the motor behavior. In other words, typically |

_{μ}*θ*| ≪ 1. We thus work in the strong regularization limit and impose priors

_{μ}Note that the prior distribution is irrelevant since, for *s _{μ}* = 0,

*θ*does not contribute to . We then choose in a way such that the

_{μ}*a priori*and the empirical averages of individual

*σ*s are equal, (we always use and 〈…〉 to denote

_{i}*a priori*and

*a posteriori*expectations). This is equivalent to saying that we constrain our model to reproduce the firing rate of neurons and the frequency of the behavior. In typical problems, such marginal expectations are, indeed, well-known, and it is the higher order interaction terms, the complex words in the dictionary, that make the reconstruction hard. Finally, we choose

*P*(

*s*= 1) =

_{μ}*P*(

*s*= 1) = 0.5, so that

_{μ}*a priori*a particular codeword has a fifty-fifty chance of being included in the neural dictionary.

Since we are only interested in whether a spike word is important in predicting the behavior and not how important it is, we integrate out all *θ _{μ}* having observed

*M*samples of the

*N*+ 1 dimensional vector

**. We perform this integration to the quadratic order in**

*σ**ε*to get the posterior probability of indicator variables (see

*Online Methods*):

This is of a familiar pairwise Ising form^{44}, with data-dependent magnetic fields *h _{μ}* and the couplings

*J*. Note that this Ising model has 2

_{μν}^{N+1}spins, replacing the model with

*N*+ 1 spins with higher order interactions in Eq. (1). This is naively a much harder problem. However, since most of the 2

^{N+1}words do not appear in the actual data, and because of the

*ε*

^{2}in front of the pairwise coupling term, evaluating posterior expectations 〈

*s*〉 for all word that actually occur is relatively easy (see

_{μ}*Online Methods*). Knowing 〈

*s*〉 allows us to tell which specific words should enter the neural-behavioral dictionary with a high

_{μ}*a posteriori*probability. To interpret Eq. (4), we notice (see details in

*Online Methods*) that the linear terms

*h*({

_{μ}**}) bias the indicator variables: those words that over- or under-occur (are anomalously represented) in the data relative to their expected frequency in the null model that matches 〈**

*σ**σ*〉 will have

_{i}*h*> 0. In their turn, the couplings

_{μ}*J*are typically negative for those words that frequently co-occur, which includes words that at least partially overlap. Thus in this Ising model, words compete to explain certain correlations in data. Once a word has a large evidence (the

_{μν}*h*term), it suppresses all other correlated words that explain the data in a weaker way, bringing the dictionary closer to the irreducible form. To verify that our analysis can, indeed, recover neural dictionaries and to set various adjustable parameters involved in the method, we successfully tested the approach on synthetic data that are statistically similar to our neural recordings (see

_{μ}*Online Methods*). In particular, for the data we have, we expect false discovery rate of < 1 codeword per neuron, but the fraction of the words that we discover depends on the (unknown) typical values of

*θ*s.

### Statistical properties of neural motor codes

Figure 3 illustrates the occurence of two specific codewords found by uBIA that encode high-pitch renditions of syllables. Note that these codewords are, indeed, overrepresented together with the high pitch vocalizations. Analyzing if a particular word is correlated with an acoustic feature is, of course, not hard. However, detecting words that should be tested, without a multiple hypothesis testing significance penalty is nontrivial. Thus the power of uBIA comes from being able to systematically analyze abundances of *combinatorially many* such spike patterns, and further to identify which of them are *irreducibly* over- or under-represented. Figure 4 illustrates statistical properties of entire neural-behavior dictionaries discovered by uBIA for different songbird premotor neurons and for three features of the acoustic behavior. While we reconstruct the dictionaries that include all irreducible words, including those that have only anomalous firing patterns, here we primarily focus on codewords, that is the words that relate behavior to the neural activity. We do the analysis twice, first for behavior binarized as *σ*_{0} = 1 for the above-median acoustic features, and then for the below-median acoustic features. We then combine the results. Note that the same pattern of spikes should not be simultaneously over- or under-represented when studying both the above and the below median codes, since the pattern cannot code for two mutually exclusive features. There were 0.7 such codewords on average per dictionary. This is consistent with the expected false discovery rate of about 0.3 codewords per neuron for data sets of our size and statistical properties.

The most salient observation is that the inferred codewords consist of present or absent spikes in specific 2 ms time bins. This is consistent with previous analysis^{35}, which identified the same timescale for this dataset by analyzing the dependence of the mutual information between the activity and the behavior on the temporal resolution, but was unable to detect the specific words that carry the information. The second crucial observation is that most of codewords are composed of multiple spikes, representing an orthographically complex *pattern* timing code^{37}, in contrast to single spike timing codes, such as in^{45}. Large number of codewords of 2 or more spikes (and thus 3 or more features, including the behavior itself) suggests that analyzing these dictionaries with the widely-used lower order MaxEnt or GLM methods that typically focus on lower-order statistics (see *Online Methods*) would miss their significant components. Our third crucial observation is that very few sub-words / super-words pairs occur in the dictionaries, cf. Fig. 4 (e. g., the second codeword coding for entropy in neuron 2 in the panel A is a subword of the others). This indicates that uBIA fulfills its goal of rejecting multiple correlated explanations for the same data.

We quantify these observations as follows. In the 49 different datasets, the average size of a dictionary is 14. Of these words, on average 5.6 include the behavioral feature and hence are *codewords*, cf. Fig. 4(D). That there are so many specific temporally precise codewords suggests that the behaviorally-relevant spike timing patterns are the rule, rather than the exception, in this dataset. We found that 66% of codewords are unique to one of the three analyzed acoustic features. This further quantifies the observation that some neurons in RA are *selective* for specific acoustic features, as noted previously in^{41}. Across all neurons and all acoustic features, only 15% of codewords consist of a single spike (or absence of spike), while 58%, 23%, and 4% consist of two, three, and four spikes respectively, cf. Fig. 4(C) (we are likely missing many long codewords, especially with small θ’s due to undersampling, see *Online methods*). This observation is consistent across all neurons and acoustic features, again indicating that coding by temporally precise spike patterns is a rule and not an exception.

At the same time, the observed dictionaries are quite variable across neurons and the production of particular song syllables. Codewords are built by stitching together multiple spikes or spike absences, and individual spikes occur at certain time points in the (−40,0) ms window with different probabilities in different neurons and syllables (i. e., the firing rate is both time and neuron dependent, cf. Fig. 4(A), grey lines). Codewords are likely to occur where the probability of seeing a spike in a bin is ~ 50%, since these are the times that have more capacity to transmit information. Thus variability in firing rates as a function of time across neurons necessarily creates variability in the dictionaries across these neurons. Beyond this, we observe additional variability among the dictionaries that is *not* explained by the rate fluctuations. For example, we can differentiate one of the four birds from two of the others just by looking at the proportions of high-order codewords (an average of 0.21 bits in Jensen-Shannon divergence between the target bird and the rest), which means that we need around five independent samples/codewords to distinguish this bird from the others). Further, the mean inter-spike interval (ISI) for codewords is different from that of other words in the dictionaries, and this difference is also bird-dependent, see Fig. 4(E).

### Verification of the inferred dictionaries

To show that the dictionaries we decoded are biologically (and not just statistically) significant, we verify whether the codewords can, in fact, be used to predict the behavioral features. For this, we built two logistic regression models that relate the neural activity to behavior. The first one uses the presence / absence of spikes in individual time bins and the second the presence / absence of the uBIA detected codewords as predictor variables (see *Online Methods*). Note that the individual spikes model is still a precise-timing model, which has 20 predictors (20 time bins, each 2 msec long), and hence one may expect it to predict better than the codewords model, which typically has many fewer predictors. To account for the possibility of overfitting, in all comparisons we test the predictive power of models using cross-validation. We emphasize that we do not expect either of the two models to capture an especially large fraction of the behavioral variation. Indeed, Tang et al.^{35} have shown that, at 2 ms resolution, on average, there is only about 0.12 bits of information between the activity of an individual neuron and the behavioral features, and the assumption behind our entire approach is that none of individual predictors have strong effects. Further, a specific model, such as logistic regression, will likely recover even less predictive power from the data. With this, Supplementary Fig. 8 compares prediction between the two models, obtaining a significantly higher accuracy and a lower mean cross-entropy between the model and the data for the models that use codewords as predictors. In other words, the higher order, but smaller, dictionaries decoded by uBIA outperform larger, non-specific dictionaries in predicting behavior.

### Dictionaries for exploratory vs. typical behaviors

Bengalese finches retain the ability to learn through their lifetimes, updating their vocalizations based on the sensorimotor feedback^{40,46,47}. A key element of this lifelong learning capacity is the precise regulation of vocal variability, which songbirds use to explore the space of possible motor outputs, cf. Fig. 5A,B. For example, male songbirds minimize variability when singing to females during courtship, but dramatically increase the range of variability in acoustic features such as pitch when singing alone^{48,49}. The variability is controlled by the activity of nucleus LMAN. Silencing or lesioning LMAN reduces the acoustic variance of undirected song (Fig. 5A) to a level approximately equal to that of female-directed song^{50,51}. Using uBIA, we can ask for the first time whether the statistics of codewords controlling the exploratory vs. the baseline range of motor variability are different. To do this, we analyze the statistics of codewords representing different parts of the pitch distribution. First, we define the output as *σ*_{0} = 1 if the behavior belongs to a specific 20-percentile interval ([0 – 20], [10 – 30], …, [80 – 100]) and compare the dictionaries that code for behavior in each of the intervals. We find that there are significantly more codewords for exploratory behaviors (percentile intervals farthest from the median, cf. Fig. 5C). This holds true for different features of the vocal output, though the results are only statistically significant if pooled over all features. To improve statistical power by increasing the number of trails in each acoustic interval, we also consider a division of the output into three equal intervals: low, medium, and high. In this case, there are still more codewords for the high exploratory pitch, and the dictionaries for each of the intervals are still multispike, cf. Fig. 5D. We further observe that the codewords themselves are different for the three percentile groups: the mean ISI of high pitch, amplitude, and spectral entropy codewords is higher, with the largest effect coming from the pitch and the spectral entropy, cf. Fig. 5E. Examples of typical and exploratory dictionaries are illustrated in Fig. 5F.

These findings challenge common accounts of motor variability, in songbirds and other systems, that motor exploration is induced by adding random spiking variations to a baseline motor program. Rather, the over-abundance of codewords in the exploratory flanks of the acoustic distributions indicates that the mapping between the neural activity and the behavior is more reliable than in the bulk of the behavioral activity: multiple renditions of the same neural command result in the same behaviors more frequently, making it easier to detect the codewords. One possibility is that the motor system is less biomechanically noisy for large behavioral deviations. This is unlikely due to the tremendous variation in the acoustic structure (pitch, etc.) of different song syllables within and across animals^{47,52}, which indicates that songbirds can produce a wide range of sounds and that particular pitches (i.e., those at at one syllable’s exploratory tail) are not intrinsically different or harder for an animal to produce. Similarly, songbirds can dramatically modify syllable pitch in response to manipulations of auditory feedback^{40,47}. A more likely explanation for the greater prevalence of codewords in the exploratory tails is that the nervous system drives motor exploration by selectively introducing particular patterns into motor commands that are specifically chosen for their reliable neural-to-motor mapping. This would result in a more accurate deliberate exploration and evaluation of the sensory feedback signal, which, in turn, is likely to be useful during sensorimotor learning^{53}.

## Discussion

In this work, we developed the unsupervised Bayesian Ising Approximation as a new method for reconstructing biological dictionaries —the sets of anomalously represented joint activities of multiple components of biological systems. Inferring these dictionaries directly from data is a key problem in many fields of modern data-rich biological and complex systems research including systems biology, immunology, collective animal behavior, and population genetics. Our approach addresses crucial shortcomings that so far have limited applicability of other methods. First, it does not limit the possible dictionaries, either by considering words of only limited length or of a pre-defined structure. Instead we performs a systematic analysis through all possible words that occur in the data sample. Second, it promotes construction of irreducible dictionaries, de-emphasizing related, co-occurring words. Further, uBIA does not make assumptions about the linear structure of dependencies unlike various linear methods.

To illustrate capabilities of the method, we applied it to analysis of motor activity in cortical area RA in a songbird. We were able to infer statistically significant codewords from large-dimensional probability distributions (2^{21} possible different words) with relatively small data sets (~ 10^{2} … 10^{3} samples). We verified that the codewords are biologically meaningful, in the sense that they predict behavioral features better than alternative approaches. Importantly, most of words in hundreds of dictionaries that we reconstructed were more complex than is usually considered, involving multiple spikes in precisely timed patterns. The multi-spike, precisely timed nature of the codes was universal across individuals, neurons, and acoustic features, while details of the codes (e.g., specific codewords and their number) showed tremendous variability.

Further, we identified codewords that correlate with three different acoustic features of the behavior (pitch, amplitude, and spectral entropy), and different percentile ranges for each of these acoustic features. Across many of these analyses, various statistics of codewords predicting exploratory vs. typical behaviors were different. Specifically, the exploratory dictionaries contained more codewords than the dictionaries for typical behavior, suggesting that the exploratory spiking patterns are more consistently able to evoke particular behaviors. This is surprising since the exploratory behavior is usually viewed as being noisier than the typical one. Crucially, exploration is a fundamental aspect of sensorimotor learning^{40,46,54}, and it has been argued that large deviations in behaviors are crucial to explaining the observed learning phenomenology^{53}. However, the neural basis for controlling exploration vs. typical performance is not well understood. Intriguingly, vocal motor exploration in songbirds is driven by the output of a cortical-basal ganglia-thalamo-cortical circuit, and lesions of the output nucleus of this circuit (area LMAN) abolishes the exploratory (larger) pitch deviations^{50,51}. Our findings therefore suggest that the careful selection of the spike patterns most consistently able to drive behavior may be a key function of basal ganglia circuits.

While the identified codewords are statistically significant, and we show that they can predict the behavior better than larger, but non-specific features of the neural activity, a crucial future test of our findings will be in establishing their *causal* rather than merely correlative nature by means of stimulating neurons with patterns of pulses mimicking the identified codewords^{36}. This will be facilitated by the speed of our method, which can reconstruct dictionaries in real time on a laptop computer. Additional future work can explore how population-level dictionaries are built from the activity of individual neurons, how the dictionaries develop and are modified in development, and whether the structure of dictionaries as a whole can be predicted from various biomechanical and information-theoretic optimization principles. Finally, one needs to understand how these dictionaries are implemented in the recurrent dynamics of neural networks in animals’ brains.

## Online Methods

### Overview of neural decoding methods

For many different experimental systems, it has been possible to measure the information content of spike trains^{34–36}, but the question of decoding – which spike patterns carry this information? – has turned out to be a harder one. Multiple approaches have been used to address this problem and to reconstruct neural dictionaries, whether in the context of sensory or motor systems, starting with linear decoding methods^{45}. All have fallen a bit short, especially in the context of motor codes, where an animal is free to perform any one of many behaviors it wishes, and hence statistics are usually poor. A leading method is Generalized Linear Models (GLMs)^{19,55,56}, which encode the rate of spike generation from a certain neuron at a certain time as a nonlinear function of a linear combination of past stimuli (sensory systems) or of future motor behavior (motor systems) and the past spiking activity of a neuron and its presynaptic partners. GLM approaches can detect the importance of the timing of individual spikes and sometimes interspike intervals for information encoding, but generalizations to detect importance of higher order spiking patters are not yet well established. Another common approach is based on maximum entropy (MaxEnt) models^{2,57,58}. These replace the true distribution of the data with the least constrained (i. e., maximum entropy) approximation consistent with low-order, well-sampled correlation functions of the distribution. The approach is computationally intensive, especially when higher order correlations are constrained by data. At the same time, to approximate empirical distributions well, a large number of such constraints is often required. This needs very large datasets, especially if one is interested in relating the neural activity to the external (behavioral or sensory) signals. Such large datasets are hard to obtain in the motor control setting.

More recently, feed-forward and recurrent artificial neural network approaches have been used to decode large-scale neural activity^{59,60}, but these have focused primarily on neural firing rates over large (tens of milliseconds) temporal windows. As a result, to date, there have not been successful attempts to reconstruct neural dictionaries from data, which would (i) resolve spike timing in words of the dictionary to a high temporal resolution, (ii) be comprehensive and orthographically complex, not limiting the words to just single spikes or pairs of spikes, and (iii) discount correlations among spiking words to produce irreducible dictionaries that only detect those codewords that cannot be explained away by correlations with other words in the dictionary.

### Details of the Bayesian Ising Approximation approach

To obtain Eq. (4) for the posterior probability of including a word into the dictionary, we start with

Now we make two approximations. First, we evaluate the integral in Eq. (5) using the saddle point approximation around the peak of the *prior*, ** θ***. This is a low signal-to-noise limit, and it is different from most high signal-to-noise approaches that analyze the saddle around the peak of the

*posterior*. Second, we do all calculations as a Taylor series in the small parameter

*ε*(see below on the choice of

*ε*). Both approximations are facets of the same strong regularization assumption, which insists that most coupling constants

*θ*are small. Following Fisher and Mehta

_{μ}^{25}, we obtain Eq. (4), where the magnetic fields (biases)

*h*and the exchange interactions

_{μ}*J*are where is the log-likelihood (see Fig. 6 for a geometric interpretation of the field

_{μν}*h*). Plugging in the model of the probability distribution, Eq. (1), we get for the fields and the exchange interactions

_{μ}Here, to simplify the notation, we defined *σ*_{μ} ≡ Π_{i∈Vμ} *σ _{i}*. Further, angular brackets, cov, and var denote the

*a priori*expectations, covariances, and variances of frequencies of words in the null model, which matches frequency of occurrence of each individual

*σ*(probability of firing in every time bin for the songbird data). Similarly, overlines denote the empirical counts or correlations between co-occurrences of words in the observed data. Specifically, denoting by

_{i}*n*the marginal frequencies of the word

_{μ}*V*in the data, these expectations and frequencies are defined as follows:

_{μ}To derive these equations, note that . Note also that cov(*σ*_{μ}, *σ*_{ν}) =0 if the intersection of *V _{μ}* and

*V*is empty.

_{ν}Equation (7) has a straightforward interpretation, which we illustrate in Fig. 6. Specifically, if the difference between the *a priori* expected frequency and the empirical frequency of a word is statistically significantly nonzero (compared to the *a priori* standard error), then the corresponding word is anomalously represented. It does not matter whether the word is over- or under-represented: in either case, if the frequency deviated from the expectation, then the field *h _{μ}* is positive, biasing the indicator

*s*towards 1, and hence towards inclusion of the word in the dictionary. If the frequency is as expected, then the field is negative, and the indicator is biased towards 0, excluding the word from the dictionary. Note that as

_{μ}*M*increases, the standard error goes down, and the field generally increases, allowing us to consider more words. The sign of

*θ*would determine whether the word is over- or underrepresented. However, estimating the exact value of

_{μ}*θ*from small datasets is often impossible and is not our goal, even though, in Fig. 2, we denote words as under- or over-represented by whether their empirical frequency is smaller or larger than the

_{μ}*a priori*expectation. Thus in some aspects, our approach is similar to the previous work

^{61}, where multi-neuronal patterns are found by comparing empirical firing probabilities to expectations. However, we do this comprehensively for

*all*patterns that occur in data, and we account for reducibility of the dictionaries (also see below).

The exchange interactions *J _{μν}* are also interpretable. As explained above, correlations among words are a serious problem. Indeed, for example, a word

*σ*

_{0}

*σ*= 1 may occur too frequently simply because its sub-words

_{i}σ_{j}*σ*

_{0}

*σ*= 1,

_{i}*σ*= 1, or

_{i}σ_{j}*σ*

_{0}

*σ*= 1 are common. Alternatively, the word may be frequent because, in its turn, it is a sub-word of a larger common word, for example,

_{j}*σ*

_{0}

*σ*= 1. In GLMs, resolving these overlaps requires imposing sparsity or other additional constraints. In contrast, the couplings

_{i}σ_{j}σ_{k}*J*address this problem for uBIA naturally and computationally efficiently. Notice that because of the factor 2 in Eq. (8), the exchange interactions are predominantly negative if one expects the two studied words to be correlated, and if they co-occur in the empirical data as much as they are expected to co-occur in the null model because of the overlaps in their composition,

_{μν}*V*and

_{μ}*V*. Negative

_{ν}*J*s implement a “winner-take-all” mechanism, where statistical anomalies in data that can be explained, in principle, by many different

_{μν}*θ*s are attributed predominantly to one such

_{μ}*θ*that explains them the best. On the other hand, the exchange interactions are positive if one expects correlations between the words

_{μ}*a priori*, but does not observe them. Thus, in principle, a word can be included in the dictionary even at zero field

*h*.

_{μ}Knowing the coefficients *h _{μ}* and

*J*, one can numerically estimate 〈

_{μν}*s*〉, the posterior expectation for including a word

_{μ}*V*in the dictionary. Generally, finding such marginal expectations from the joint distribution in disordered systems is a hard problem. However, here

_{μ}*h*∝

_{μ}*ε*and

*J*∝

_{μν}*ε*

^{2}, so that the fields and the interactions create small perturbations around the “total ignorance” solution, 〈

*s*〉 = 1/2 (this is a manifestation of our general assumption that none of the words is very easy to detect). Therefore, we calculate the marginal expectation using fast mean field techniques

_{μ}^{62}. We use the

*naive*mean field approximation, which is given by self-consistent equations for the posterior expectations in terms of the magnetizations

*m*= 2〈

_{μ}*s*〉 − 1, so that interactions among spins are encapsulated in an effective field . We solve Eq. (13) iteratively

_{μ}^{25}, by increasing

*ε*from 0—that is, from the total ignorance 〈

*s*〉 = 1/2 or

_{μ}*m*= 0 — and up to the limiting value

_{μ}*ε*

_{max}in steps of

*δε*=

*M*

^{−1}/20. This limiting value

*ε*

_{max}is determined by the two approximations involved in the strong regularization assumption. First, the saddle point approximation around the peak of the prior in Eq. (5) implies that the characteristic width of the prior should be smaller than that of the likelihood,

*ε*≤

*ε*

_{1}= 1/

*M*. Second, the Taylor series up to second order in

*ε*for the posterior of the indicator variables implies that the quadratic corrections should not be larger than the linear terms. Within the mean field approximation, this means that , which is saturated at some

*ε*

_{2}(notice that, in contrast to our usual notation, the averages here are over the indices, and not the data). Thus overall we take

*ε*

_{max}= min{

*ε*

_{1},

*ε*

_{2}}.

Additionally we have used the TAP equations^{62}, instead of Eq. (13) to calculate magnetizations. These are more accurate since they account for how a spin affects itself through its couplings with the other spins. However, corrections due to this more complicated method were observed to be negligible in our strong regularized regime, since they were of higher order in *ε* ≪ 1. Thus all results that we report are based on the mean field estimation.

### Effect of absent words

Of the exponentially many possible words, majority will not happen in a realistic data set. In particular, this includes most of long words. At the same time, *a priori* expectations for the frequency of such words, Eq. (10), decrease exponentially fast with the word length. Thus the fields, Eq. (7), for the words that do not occur are small, and the posterior expectation for including these words in the dictionary is 〈*s _{μ}*〉 ≈ 1/2, so that we do not need to analyze them explicitly. A bit more complicated is the fact that all words affect each other’s probability to be included in the dictionary through the exchange couplings

*J*, so that, in principle, the sum in the mean field equations, Eq. (14), is over exponentially many terms. Here we show that the effect of non-occurring words on the interaction terms is exponentially small in

_{μν}*N*, as long as the empirical averages

*n*/

_{i}*M*≪ 1.

To illustrate this, we start with the probabilities *p*(*σ _{i}* = 1) =

*p*of a single variable

_{i}*i*being active. We then define the average such probability

*q*=

*N*

^{−1}Σ

_{i}

*p*. Without the loss of generality, we assume

_{i}*q*< 1/2, and otherwise we rename

*σ*→ 1 −

_{i}*σ*. Denoting a long word of a high order

_{i}*k*that does not occur in the data as

*σ*_{ω}, we have

*n*= 0. Then the corresponding field is

_{ω}Here we consider as *high order* words those, for which *q ^{k}M* ≪ 1 (in general, 〈

*σ*〉

_{ω}*M*= 〈

*n*〉 ≪ 1, which happens for

_{ω}*k*~ 4 … 5 for our datasets). Then the magnetization is

This illustrates our first assertion that none of these non-occurring words will be included in the dictionary. However, as a group, they may still have an effect on words of lower orders. To estimate this effect, for a word *σ*_{μ} of a low order *k*_{0}, we calculate the effective field , which all of the non-occurring words *σ*_{ω} have on it. First we notice that, if *V _{μ}* and

*V*do not overlap, then their covariance is zero, and

_{ω}*J*= 0. That is, only high order words that overlap with

_{μω}*V*can contribute to Since cov(

_{μ}

*σ*_{μ},

*σ*_{ω}) ~

*q*(1 −

^{k}*q*

^{k0}), the couplings are

Using Eq. (13), this gives for the typical effective field that absent words have on the word *μ*
where the number of words of order *k* that overlap with *σ*_{μ} and can affect it is given by the combinatorial coefficient . This has a very sharp peak at *k* = (*N* + *k*_{0})/2, where . We can approximate the sum in Eq. (19) as the argument of the sum evaluated at this peak *k* = (*N* + *k*_{0})/2, obtaining an effective field coming from high order words

In other words, even the combined effect of all higher order absent words is small if the average frequency of individual letters is smaller than 1/2. We thus can disregard all non-occurring words in the mean field equations.

We stress that, for this to hold, the average of the binary variables *σ _{i}* must be small,

*q*=

*N*

^{−1}Σ

_{i}

*p*(

*σ*= 1) < 1/2. In our songbird dataset, this condition was fulfilled with

_{i}*q*~ 0.2. However, in 4% of cases the probability to have a spike in a certain time bin was

*p*> 1/2. Thus to stay on the safe side, we performed additional analyses by redefining variables as

_{i}*σ*→ 1 −

_{i}*σ*if the presence of a spike in a bin was >50%. In other words, in such cases, we defined the absence of the spike as 1 and the presence as 0. For our datasets, the findings did not change with this redefinition.

_{i}This previous analysis does not imply that absent words of high order are irrelevant — it only says that they cannot be detected with the available datasets. In the numerical implementation of the method, we filter out long absent words *ω* such that 〈*σ _{ω}*〉

*M*= 〈

*n*〉 < 0.02, with this cutoff determined by Eqs. (15–17), so that, for these words,

_{ω}*h*≪ 1. These words get assigned 1/2 as the posterior probability of inclusion in the dictionary, and their contribution to the mean field equations is neglected. In contrast, if a word ωis absent but 〈

_{ω}*n*〉 ≥ 0.02, we include them in the analysis, Eq. (4). Such words may turn out to be relevant code words, especially if they happen a lot less frequently than expected

_{ω}*a priori*.

### Synthetic data for testing and fine-tuning the method

To set the free parameters of our approach and to quantify its ability to reconstruct dictionaries, we test it on synthetic data sets that are similar to the songbird data, analyzed in this work. We use the log-linear model, Eq. (1), as a generative model for binary correlated observables ** σ** with

*N*= 20. We choose the individual biases in the generative model from a normal distribution, , which matches the observed probability of a spike in a bin in the bird data. That is,

*p*(

*σ*= 1) ≃ [1 + exp(−2

_{i}*θ*)]

_{i}^{−1}~

*q*~ 0.2. Then we select which binary variables interact. We allow interactions of 2nd, 3rd, and 4th order, with an equal number of interactions per order. For different tests, we choose the interaction strengths from (a) the sum of two Gaussian distributions, one with a positive mean and the other with a negative one, mean(

*θ*) = ±0.5, std(

_{μ}*θ*) = 0.1, and (b) from one Gaussian distribution centered at zero with std(

_{μ}*θ*) = 0.5. Both choices reflect our strong regularization assumption, so that effects of individual variables on each other are weak, and a state of one variable does not determine the state of the others, and hence does not “freeze” the system. We are specifically interested in performance of the algorithm in the case where

_{μ}**are distributed as the sum of Gaussians. On the one hand, this tests how the algorithm deals with data that are atypical within its own assumptions. On the other hand, this choice ensures that there are very few values of**

*θ***that are statistically indistinguishable from zero, making it easier to quantify findings of the algorithm as either true or false. We have additionally tested other distributions of**

*θ***, but no new qualitatively different behaviors were observed. Finally, for both types of distributions of**

*θ***, we also varied the density of interactions**

*θ**α*, from

*α*= 2 to

*α*= 4, which spans the interaction densities of tree-like and 2D lattice-like networks.

Next, we generate *M* samples from these random probability distribution and we apply our pipeline to reconstruct the dictionary. We test on 400 distributions from each family. As the first step, we discard high-order words absent in the data using a threshold on the expected number of occurrences 〈*σ _{μ}*〉

*M*= 〈

*n*〉 < 0.02, as explained above. Next, we select

_{μ}*N*

_{max}words that have the highest (absolute) values in magnetic field (we have tested

*N*

_{max}= 200, 500, 2000, 5000, and finally use 500 after not observing differences). To decide which of these remaining words are to be included in the dictionary, we build the Ising model on the indicator variables, Eq. (4), with its corresponding magnetizations

*m*given by the mean field equations. We start from an inverse regularization strength of

_{μ}*ε*= 0 and then increase

*ε*in steps of

*δε*= 1/(20

*M*), up to

*ε*

_{max}= min{

*ε*

_{1},

*ε*

_{2}}, as detailed above. We then set the full false discovery rate

*n*

_{false}— the number of dictionary words that we allow ourselves to identify falsely from the fully reshuffled data, which must have zero words in it. Note that we do reshuffling while keeping the observed frequency of individual variables

*n*constant. For the value of

_{i}*n*

_{false}, we identify the significance threshold for the magnetization, which allow this many false words to be accepted in the shuffled data, on average (from the same 400 distributions). Finally, we select as dictionary words those that have their marginal magnetizations

*m*above a significance threshold

_{μ}*m*>

_{μ}*m*(

*n*

_{false}).

We then repeat the analysis for different values of *n*_{false}, seeking the thresholds that minimize false negatives (exclusions of true words from the dictionary) and false positives (inclusion of spurious words into the dictionary) simultaneously. To measure these, we use two metrics: precision and recall. First, precision measures the fraction of the words included in the dictionary that are true, i. e., have a nonzero *θ _{μ}* in the generative model. Second, recall measures the fraction of the words in the generative model with

*θ*≠ 0 that were included in the dictionary. Results of the analysis are shown in Fig. 7B. Since data set sizes are relatively small, we do not expect to detect all words, especially in the case where

_{μ}*θ*are allowed to be close to 0 in the generative model (Gaussian distributed). Thus we emphasize precision over recall in setting parameters of the algorithm: we are willing to not include words in a dictionary, but those words that we include should have a large probability of being true words in the underlying model. Our most crucial observation is that the precision-recall curves are remarkably stable with the changing density of interactions. Recall is smaller when interactions coefficients are taken from a Gaussian centered at zero. However, one could argue that missing words with very small

_{μ}*θ*should not be considered a mistake: they are not significant words in the studied dictionary. We observe that by keeping the full false discovery rate

_{μ}*n*

_{false}below 0.5 (only about half a word detected falsely, on average, in shuffled data), we can reach a precision as high as 80%, extracting 20% – 30% of the codewords depending on the number of samples, the distribution of

**, and**

*θ**α*. We are thus confident that our method produces dictionaries, in which a substantial fraction of words correspond to true words in the data.

### Testing the predictive power of the uBIA dictionaries

In this section, we test whether the codewords found in data from songbird premotor neurons can be used to predict the subsequent behavior. We compare two logistic regression models: one that uses the activity in the 20 time bins to predict the behavior and another that only uses as features the activity of the few relevant codewords, usually far fewer than 20. The features corresponding to the codewords are binary, and they are only active when all the time bins of such words are active. This means that the model using the time bins is more complex, as it already has all the information that the codewords model has and more, though it does not account for combinatorial effects of combining spikes into patterns. In order to properly test the predictive power between these two models with different complexity we perform 2-fold cross-validation, using a log-likelihood loss function. As is common in these cases, an L2 penalty is included to help the convergence to a solution (the models were implemented with the Classify function from Mathematica, whose optimization is done by the LBFGS algorithm). As shown by Tang et al.^{35}, not all neurons in our dataset are timing neurons, or even code for the behavior at all. Thus we restrict the comparison to those cases that have at least 4 codewords (27 case in total, with 10 codewords on average). Both of the logistic regression models have the following structure
where *y* corresponds to the behavior, and the features correspond to the time bins in one case (*z _{i}* =

*x*) and to the codewords in the other (

_{i}*z*= Π

_{i}_{j∈Vi}

*x*), while

_{j}*β*are the coefficients of the model. The loss function used is the log-likelihood with the L2 penalty, where

_{i}*M*is the number of samples, and

*λ*is the regularization strength. In our analysis, as different datasets have different number of samples, we show the results for the mean cross-entropy over the test data, which correspond to the normalized log-likelihood.

Tang el al.^{35} showed that individual neurons on average carry around 0.12 bits at a 2 ms scale. So for both models we expect the prediction accuracy to be barely above chance, especially since we are focusing on a particular prediction model (a logistic regression), and may be missing predictive features not easily incorporated in it. Figure 8**a** shows the scatter plot of accuracy in the 27 analyzed datasets, plotting the prediction using the time bins activity on the horizontal axis versus prediction using only the codewords activity on the vertical one. We observe that the models based on codewords are consistently better than the ones using all the 20 time bins, and the difference is significant (inset). We additionally evaluate the quality of prediction using the mean cross-entropy between the model and the data. Figure 8**b** shows that the models with the codewords have lower mean cross-entropies and thus generalize better (see Inset).

## Software Implementation

The software implementation of uBIA is available from https://github.com/dghernandez/decomotor.

## Data

The data used in this work is available from https://figshare.com/articles/Songbird_premotor_dictionaries/10315844.

## Author contributions statement

DGH, SJS, and IN conceived the project, performed the research, and wrote the paper. DGH wrote the software implementation of uBIA.

## Additional information

### Competing interests

The authors declare no competing interests.

## Acknowledgements

We thank David Hoffman and Pankaj Mehta for valuable discussions. This work was supported in part by NIH Grants R01-EB022872, R01-NS084844, and R01-NS099375, and NSF grant BCS-1822677. IN acknowledges hospitality of the Kavli Institute for Theoretical Physics, supported in part by NSF Grant PHY-1748958, NIH Grant R25GM067110, and the Gordon and Betty Moore Foundation Grant 2919.01. IN and SJS further acknowledge hospitality of the Aspen Center for Physics, which is supported by NSF grant PHY-1607611.