Inferring neural dynamics of memory during naturalistic social communication

Memory processes in complex behaviors like social communication require forming representations of the past that grow with time. The neural mechanisms that support such continually growing memory remain unknown. We address this gap in the context of fly courtship, a natural social behavior involving the production and perception of long, complex song sequences. To study female memory for male song history in unrestrained courtship, we present ‘Natural Continuation’ (NC)—a general, simulation-based model comparison procedure to evaluate candidate neural codes for complex stimuli using naturalistic behavioral data. Applying NC to fly courtship revealed strong evidence for an adaptive population mechanism for how female auditory neural dynamics could convert long song histories into a rich mnemonic format. Song temporal patterning is continually transformed by heterogeneous nonlinear adaptation dynamics, then integrated into persistent activity, enabling common neural mechanisms to retain continuously unfolding information over long periods and yielding state-of-the-art predictions of female courtship behavior. At a population level this coding model produces multi-dimensional advection-diffusion-like responses that separate songs over a continuum of timescales and can be linearly transformed into flexible output signals, illustrating its potential to create a generic, scalable mnemonic format for extended input signals poised to drive complex behavioral responses. This work thus shows how naturalistic behavior can directly inform neural population coding models, revealing here a novel process for memory formation.

recordings of song were initially conducted at a much higher sampling rate, then segmented and binned into 30.03Hz sampling bins, which was the sampling rate of the video cameras used to record locomotion).Female flies were also rendered blind and pheromone insensitive (PIBL) to increase their responses to auditory stimuli.See (3) for further details on the data collection.
For the analyses reported in this manuscript we used 87 courtship rituals (comprising 13.4 hours of song/behavior) with wild type PIBL females and two strains of males (NM91 and ZH23), which sang more robustly than other male strains and evoked robust female locomotion.Although more courtship rituals were available, we found that including them in our analysis did not substantially improve our MA-based predictions of female behavior; and the improved predictive capacity using the MA over the LN model remained as more trials were included in the training data (Fig S13).Using 87 sessions also greatly simplified the computational resources required to generate and store the many iterations of artificial neural recordings and behavioral predictions that we performed.
Fitting the encoding models.We fit the four parameters of the MA encoding model to each neuron by minimizing the squared error between the model MA response to the sine and pulse stimuli (together with their 10-second post-stimulus periods) and the empirical calcium responses.To ensure fair comparison between the abilities of the LN vs MA models, we derived a method to parameterize the LN models by exactly the same parameters as the MA model (τint, τa, xs, xp) (otherwise the LN model would have many more parameters, including each timepoint in each filter).To this end, we analytically computed the MA step response and then took its time-derivative to identify the corresponding LN filter, which was consequently parameterized by the MA parameters (see next section).For the nonlinearity g we used a signed rectifying nonlinearity in accordance with whether the empirical neural activity increased or decreased following stimulus onset.We then fine-tuned the LN filter by adjusting the four parameters through gradient descent so that the LN response captured both the onset and offset responses of the block stimuli.
We also fit the LN model directly to the calcium data using Ridge Regression and a sigmoidal nonlinearity.The nonlinearity was where z is the filtered song input (hs * Is + hp * hs).We fit the filters first without the nonlinearity, then fit the nonlinearity given the filters, and finally performed a joint optimization of both the filter and nonlinearity parameters to fine tune the fit, minimizing the squared error between the LN predictions and the mean calcium responses, and penalizing the filter weights with the same Ridge Regression parameter.We also found that using a sigmoidal nonlinearity produced similar results as a signed rectification nonlinearity.In general, however, the preceding fitting procedure in which we parameterized the LN model by the MA parameters yielded more conservative results (i.e. the LN models parameterized by the 4 MA parameters yielded a better LN prediction of female walking than the LN models fit with Ridge Regression), in particular since the LN model tended to inherit very similar timescales as the MA timescales (Fig S15 where is the exponential filter described by the dynamical system for r.That is, where the Θ(t) corresponding to the filter hr has been absorbed into the limited integration range.We can take care of the second heaviside function in the same way: otherwise.Note that the τa ̸ = τint case converges to the τa = τint case as τa → τint, even though the denominator in the prefactor goes to 0.
To construct the LN model we use these filters, combined with a signed rectification nonlinearity in accordances with the signs of the selectivities xs and xp.Thus, by construction the LN responses to the step inputs i.e. block song onset are identical to the MA responses.However, the offset responses following the 10-second block stimulus, while typically similar, will not necessarily be exactly matched to the MA responses, however.Therefore we adjusted the 4 parameters of the LN model using a standard gradient-descent procedure (scikit-learn's minimize function) to maximally reproduce the full block song responses over the 10-second stimulus period and 10-second post-stimulus period.
Rich Pang, Christa Baker, Mala Murthy, and Jonathan Pillow Bout-duration and hand-picked feature models.It was shown in Clemens, et al (2015) (4) that among several manually chosen song features, time-averaged bout-duration exhibited the strongest correlation with time-averaged female walking speed, with the correlation plateauing near a 1-minute averaging window.To verify these conclusions within our moment-to-moment predictive analysis of female walking speed, we computed the momentary values of a battery of hand-picked song features, time-averaged over different windows, and then used each feature individually to predict moment-to-moment female walking speed (using either a 1-second or 1-minute forward averaging window for walking speed, as in our other analysis).Corroborating the results of Clemens et al we found mean bout duration, averaged over about 1-4 minutes, provided the strongest prediction of moment-to-moment female walking speed, explaining about 10-12% of the variance of the 1-second-averaged walking speed and 12-15% of the variance of the 1-minute-averaged walking speed (Fig S14).(As in our other analyses, the bout-duration regressor was fit on 80% of the trials and the prediction and variance explained computed on a test/held-out 20% of the trials, and finally averaged over 30 training/test splits).Note that while (4) showed that mean bout duration could be estimated with a single-neuron filter applied directly to the envelope of song acoustic envelope waveform, combined with a set of nonlinearities and an integrator, we did not explicitly implement this pseudo-neural model, since its end result to was to provide a highly accurate estimate of mean bout duration (Pearson correlation = 0.93), which we instead computed directly from the ternary song representation.
Direct song-to-locomotion filter.To estimate how much female locomotion variance could be explained by direct linear filters on sine and pulse song, we represented both a sine and pulse filter, hs and hp as a sum of 16 raised cosine basis functions: where we set a = 6, c = 0, and ϕi = iπ/2, which spanned timescales up to approximately 1 minute.Filters were then represented as and similarly for the pulse filter.We then fit the basis function weights {w i s }, {w i p } using Ridge Regression (α = 10) to minimize the squared error between female locomotion and the summed filter outputs (Fig S17 ).
Fitting the neural-to-locomotion readout.Except when otherwise mentioned, each encoding model and variation we investigated produced a 224-dimensional time-series (the number of original calcium recordings) accompanying each of the 87 courtship sessions.Each recording was a deterministic function of the courtship songs, as all encoding models were deterministic.As our locomotion variable we used the total walking speed of the fly, which could comprise both forward and lateral components (relative to the fly's body axis), and which was as or more predictable than purely forward or purely lateral motion (Fig S4 ).
To predict locomotion we first "forward-smoothed" the locomotion time-series (i.e. total walking speed at any timestep) by replacing the walking speed at each timestep t with its time average from t to t + ∆T .Our motivation for smoothing only in the forward direction was to not contaminate our measure of ongoing/future locomotion with past features of female locomotion that could have influenced song at time t.
For each encoding model or variation we trained a linear readout from the artificial population neural activity at time t (which encodes song history up till time t) to the forward-smoothed female locomotion variable at time t.We trained the readout (using Ridge Regression with α = 10) on 80% of the courtship sessions and tested the readout across all timepoints in the remaining 20% of courtship sessions, which contained different flies.In general, female flies walked even before male song began.To focus on song-modulated locomotion, we therefore excluded from training/testing (1) all timepoints occurring before the first non-quiet song timepoint, (2) any timepoint occurring after more than 30 seconds of pure quiet had passed.(Note that this second condition was extremely rare, since the mean quiet period was only 2. Estimating song information.For a given MA model neuron we estimated the information about preceding song it contained in its instantaneous activity level by presenting the entire courtship song extracted in each of the 87 sessions to the neuron (excluding any initial quiet period at the start of a session before singing began), and then creating a histogram of responses (using 16 evenly spaced bins ranging from 0 to the maximum neural response level) aggregating over all songs and timepoints (yielding 1,448,116 timepoints total binned into 16 bins).We then computed the Shannon entropy of the resulting histogram: where fi are the estimated fraction of timepoints in each bin.While this quantity is in general a biased estimate of entropy, we do not expect our results to be significantly affected, since the histogram estimate is built from over 1 million timepoints.
Values reported in the figures are relative to the Shannon entropy of the equivalent uniform distribution, hence range between 0 and 1.

Selection of songs for accumulation and manifold analysis.
To investigate accumulator-like dynamics of the MA neurons we presented 1-minute song segments taken from the courtship rituals used in the rest of our analyses.To extract song segments, we segmented all songs across all 87 sessions into song segments lasting at least 1 minute and separated by at least 5 seconds of quiet.For any song segments selected as such that extended beyond 1 minute, only the first minute was used.This yielded 108 unique natural song segments, which we presented to the model neurons studied in Fig 3.
34 seconds [Fig S4], with substantially longer quiet periods occurring with very low probability.)To compute the final score for each encoding model we performed this procedure for 30 random 80/20 splits of the 87 sessions into training/test sessions and reported the variance explained in held-out sessions, averaged across splits.Song shuffling procedure.Songs were shuffled (Fig 1J) as follows.Songs could not be randomly assigned to different courtship sessions directly because sessions had different lengths.Instead, we concatenated all songs, then randomly circularly shifted them, then re-segmented the songs into individual sessions using the empirical session durations.This procedure retains temporal correlations in song while breaking correlations between song and locomotion.

For the manifold analysis
Fig S1.Example labeled line and pan-neuronally imaged calcium response to block song. A. Example responses from a diverse selection of neurons in the labeled line data (Baker et al 2022) (1).Every pair of responses to the sine and pulse blocks corresponds to one neural recording.B. Example responses from a selection of example ROIs extracted from the pan-neuronal imaging data in Pacheco et al. 2021.C. Distributions of MA parameters fit to all neurons in Baker et al 2022 dataset (N=224).D. Distributions of MA parameters fit to all ROIs in Pacheco et al 2021 dataset (N=19036) (2).

Fig S2 :
Fig S2: Larger example set of encoding model fits to neural recordings. A. Example MA fits (black) to song-element responses (green) recorded in Baker et al 2022 (1).B. As in A but for LN encoding models with filters parameterized by the MA parameters.Formatting as in Fig 1E.

Fig S4 :Fig S5 :Fig S7 :Fig S9 :Fig S11 :Fig S12 :
Fig S4: Variance explained for different female behavioral variables and over different smoothing windows.A. As in Fig 1J, but predicting female walking speed forward-averaged over a 60 s window.B. Comparison of female forward velocity (FFV), absolute female lateral speed (FLS) (relative to the female's body axis), and total female motion (MTN) i.e. walking speed, predictions using LN or MA populations, after forward-smoothing female behavior in a 1-second window.C. As in B but for a 60s window.D. Additional examples of walking speed predictions from the LN vs MA population models for three different held-out trials (as in Fig 1H).E. As in D, but using a 1-minute forward averaging window for female walking speed.F. As in Fig 2A, but for 1-minute-forward-averaged female walking speed.

A 1 -BFig S13 :Fig S15 :C 1 -
Fig S13:Female walking speed variance explained vs number of trials.A. Walking speed variance explained using artificial neural recordings generated by MA and LN population encoding models (LN model fit using the MA-step-response-matching procedure), using 1-second forward averaging of female walking speed.Error bars show standard error over 30 random train/test splits.B. As in A but using 1-minute forward averaging of female walking speed.In each plot the first 47 trials came from the male NM91 strain, the next 40 from ZH23, the next 42 from CarM03, and the final 37 from ZW109 (3).
-invariant adaptation model.The stimulus-invariant version of the MA encoding model was given by Our greedily constructed behaviorally predictive MA population was built in the following way.We first defined a finite set of parameter values to select from, since this procedure amounts to an optimization requiring training a readout and predicting locomotion over 30 training/test splits of a large behavioral dataset for every iteration, making it infeasible to easily search over a continuous parameter space.The range of parameters we considered was τint ∈ {0.1, 0.5, 1, 2, 5, 10, 30, 60, 120} s, τa ∈ {0.1, 0.5, 1, 2, 5, 10, 30, 60, ∞} s, xs ∈ [0, 0.5, 1], xp = 1 − xs.Note that we did not include negative selectivities, which correspond to empirical neural activity that decreases in response to song, since this does not affect the ability to predict behavior due to the ability of the linear readout to absorb arbitrary signs and scalings of the neural activity.
We first generated an artificial recording containing one neural response for every combination of parameters.Next, we identified the single neuron that could best predict female locomotion in held-out courtship sessions.We then iterated over selecting the next best neuron, that if added to the existing population, maximally increased female locomotion variance explained in held-out courtship sessions, averaged over training/test splits, up to 50 neurons.