## Abstract

Sensory neurons encode information using multiple nonlinear and dynamical transformations. For instance, auditory receptor neurons in *Drosophila* adapt to the mean and the intensity of the stimulus, change their frequency tuning with sound intensity, and employ a quadratic nonlinearity. While these computations are considered advantageous in isolation, their combination can lead to a highly ambiguous and complex code that is hard to decode. Combining electrophysiological recordings and computational modelling, we investigate how the different computations found in auditory receptor neurons in *Drosophila* combine to encode behaviorally-relevant acoustic signals like the courtship song.

The computational model consists of a quadratic filter followed by a divisive normalization stage and reproduces population neural responses to artificial and natural sounds. For general classes of sounds, like band-limited noise, the representation resulting from these highly nonlinear computations is highly ambiguous and does not allow for a recovery of information about the frequency content and amplitude pattern. However, for courtship song, the code is simple and efficient: The quadratic filter improves the representation of the song envelope while preserving information about the song’s fine structure across intensities. Divisive normalization renders the presentation of the song envelope robust to the relatively slow fluctuations in intensity that arise during social interactions, while preserving information about the species-specific fast fluctuations of the envelope.

Overall, we demonstrate how a sensory system can benefit from adaptive and nonlinear computations while minimizing concomitant costs arising from ambiguity and complexity of readouts by adapting the code for behaviorally-relevant signals.

## Introduction

Sensory receptor neurons transform physical stimuli into neuronal representations. Constraints in neuronal dynamic range and bandwidth result in neural codes that represent some aspects of the stimulus at the cost of others, typically through multiple nonlinear and dynamical computations. For instance, sensory neurons are known to adapt to the statistics of their inputs: Visual systems adapt to the luminance and contrast of a visual scene, and auditory systems to the frequency and intensity statistics of the soundscape (Kastner and Baccus, 2014; Nagel and Doupe, 2006). Adaptation creates energyefficient representations and improves stimulus discrimination (Benda and Hennig, 2008; Fairhall et al., 2001; Finn et al., 2007; Gorur-Shandilya et al., 2017; Laughlin, 1981). Adaptation can act on two properties of the mapping from stimulus to neural response: 1) how inputs are integrated in time and/or space (filtering), and 2) how the integrated input is transformed into spikes (gain) (Atick and Redlich, 1990; Attneave, 1954; Barlow, 1961; Carandini and Heeger, 2012; Laughlin, 1981; Nagel and Wilson, 2011; Zhaoping, 2006). However, while adaptive filtering and gain improve information transmission, they also remove information, for instance about absolute light or sound intensity levels. This introduces ambiguity for neural computations in downstream circuits since the meaning of a spike – the pattern and magnitude of the stimulus it represents – will depend on the stimulus history (Fairhall et al., 2001; Haak and Mesik, 2016; Seriès et al., 2009; Whitmire and Stanley, 2016; Zavitz et al., 2016). This ambiguity is desirable when it introduces invariances to intensity or contrast, but it may distort neural readouts if the adaptation affects behaviorally-relevant stimulus features (Fairhall et al., 2001; Hildebrandt et al., 2015).

Like adaptation, other strongly nonlinear computations are beneficial for making explicit some stimulus features at the cost of distorting others. For instance, the recognition of acoustic signals – such as speech – typically relies on two features of a sound: 1) the carrier, or fast oscillations that define the fine structure of the waveform, and 2) the envelope, relatively slower modulations of the intensity or variance of the carrier. A quadratic output nonlinearity produces responses that depend on the magnitude but not on the sign of an input stimulus. This operation is considered crucial for extracting the envelope but it introduces ambiguity about the structure of the carrier. Importantly, the impact of this ambiguity on decoding depends on the stimulus: For spectrally simple, pure-tone carriers, a quadratic nonlinearity produces frequency doubling responses, from which the stimulus frequency can be easily recovered. However, if the carrier is spectrally complex, the frequency content of the stimulus cannot generally be recovered after a quadratic nonlinearity.

Thus, unbiased measures based on information transmission on generic stimulus ensembles can be insufficient to assess the performance of a neural code because the impact of sensory computations, like adaptation or quadratic nonlinearities, is context-specific. A code’s performance depends on the statistics of the stimuli the system processes, and on the features the system aims to extract from the stimuli to drive behavior (Gomez-Marin and Ghazanfar, 2019).

We address this issue in the auditory receptor neurons of *Drosophila melanogaster.* Hearing is important for courtship behavior (Coen et al., 2014): Males chase the female and produce a dynamical courtship song by vibrating their wing. The song of *Drosophila melanogaster* males contains two major modes: The sine song consists of sustained oscillations with a frequency around 150 Hz, and the pulse song consists of regular trains of short pulses that can have two distinct shapes (Arthur et al., 2013; Clemens et al., 2018a). Females — but also nearby males — evaluate both the spectral and the temporal properties of the song to inform social interactions (Deutsch et al., 2019; Li et al., 2018; Versteven et al., 2017). Mechanoreceptor neurons in the fly’s ear (its antenna) must therefore maintain the carrier and the envelope of the song for further processing downstream. In addition to acoustic communication, flies also detect and avoid threats based on sudden increases in the sound envelope (Lehnert et al., 2013).

Flies detect sound using the arista, a feathery extension of the antenna (Fig. 1A). Sound-induced antennal vibrations, but also slow antennal movement induced by wind and gravity, activate a diverse population of stretch-sensitive mechanoreceptors in the antenna’s second segment – the Johnston’s organ neurons (JONs) (Kamikouchi et al., 2009; 2006; Yorozu et al., 2009). While some types of JONs encode slow antennal movement induced by wind- and gravity (JONs C and E) (Yorozu et al., 2009), the JON-A and JON-B subpopulations respond best to fast, sound-induced antennal vibrations in the frequency range of the courtship song (100-400 Hz) and act as the auditory receptor neurons. These auditory JONs perform multiple nonlinear computations that are thought to improve the representation of sounds (Fig. 1B): JONs adapt to the mean and to the variance of receiver movement (Albert et al., 2007; Clemens et al., 2018b; Lehnert et al., 2013). Mean adaptation renders responses robust to slow antennal movement induced by wind and gravity. Variance adaptation corrects for slow intensity fluctuations, which arise during the dynamical social interactions from the strongly directional and distance-dependent sound receiver (Bennet-Clark, 1971; Morley et al., 2012). The extracellularly recorded bulk spiking activity of JONs A and B – the compound action potential (CAP) (Clemens et al., 2018b; Kamikouchi et al., 2009; Lehnert et al., 2013) (Fig. 1A) – exhibits frequency doubling responses for sinusoidals. For instance, a 300 Hz sinusoidal evokes 600 Hz oscillations in the CAP (Eberl et al., 2000; Lehnert et al., 2013; Tootoonian et al., 2012) (Fig. 1B). This frequency doubling can be reproduced mathematically by squaring the sinusoidal, indicating a quadratic transformation for sound across the JON population. Yet another computation performed by JONs is adaptive temporal filtering: the cutoff frequency of the antennal receiver increases with intensity (Nadrowski and Göpfert, 2014) and this effect is linked to the mechanical amplification of low-amplitude sounds in JONs (Albert and Kozlov, 2016; Göpfert et al., 2006; 2005; Göpfert and Robert, 2003).

To understand how diverse computations (mean and variance adaptation, frequency doubling, and adaptive temporal filtering) combine in JONs to encode the song’s carrier and envelope, we built a computational model of the JONs based on CAP recordings. We relied on CAP recordings, since they are currently the only readout of JON responses with sufficient resolution and allow all of the above computations to be measured. The model reproduces all major features of the responses for a wide range of stimuli including the courtship song, and shows that different adaptive and nonlinear computations produce an efficient representation of song. That is, a representation that, despite the complexity of the transformation from sound to response in JONs, is easy to read out and allows a reconstruction of the behaviorally relevant features of the song’s carrier and envelope with little ambiguity.

## Results

### A quadratic filter and a dynamic normalization stage reproduce CAP responses

To fit a model that captures the multiple computations in JONs (Fig. 1B), we recorded CAP (compound action potential, representing the summed activity of many JONs) responses for stimuli with diverse carrier and envelope dynamics (Fig. 1C): Gaussian noise band-pass filtered between 100 and 900 Hz (termed “noise” from here on) with constant intensity (Fig. 1D) or with step-wise modulations in amplitude (Fig. 1E); and the natural courtship song, which exhibits narrow-band carriers and strong amplitude modulations (Fig. 1F).

A common framework for describing the stimulus-response mapping of sensory neurons is the linear-nonlinear (LN) model (Schwartz et al., 2006; Sharpee, 2013) in which the stimulus is 1) linearly filtered to account for temporal integration properties and 2) transformed by a static nonlinearity to account for neuronal threshold or saturation (Fig. 1C, top). However, a LN model fails to reproduce the nonlinear CAP responses (Fig. 1D-F, top, 1G). A quadratic filter (QF, Fig. 1C, middle) reproduces the fine structure of responses to noise stimuli with constant intensity well but fails for stimuli with dynamic switches in intensity (Fig. 1D-G). For instance, the QF output lacks the response transients after intensity steps and the relative intensity invariance of the steady-state responses (Fig. 1E, F). To reproduce this adaptive gain, we added a divisive normalization (DN) stage after the QF (Fig. 1C, bottom). The DN stage was placed at the output of the QF because variance adaptation arises in the JONs after mean adaptation and frequency doubling (Clemens et al., 2018b; Lehnert et al., 2013). The DN stage low-pass filters the rectified QF output to an adaptation signal, which then divides the QF output to normalize the response. Adding the DN stage greatly improved model performance for stimuli with dynamic envelopes, in particular for the courtship song (Fig. 1E, F, H).

We term this two-stage model – consisting of a quadratic filter (QF) followed by a divisive normalization (DN) stage – QF-DN. The model reproduces responses to a wide range of stimuli enabling us to examine how both quadratic filtering and divisive normalization contribute to the representation of song features in JONs. We first focus on the properties of the quadratic filter.

### The quadratic filter is a frequency-dependent encoder of sound carrier and envelope

A linear filter corresponds to a set of weights *h*(*τ*) for the stimulus *s*(*t-τ*) at *τ* time steps in the past: *r*(*t*) = ∑* _{τ} h*(

*τ*)

*s*(

*t-τ*) (Fig. 1C, top). A quadratic filter also constitutes a set of weights,

*H*(

*τ*

_{1},

*τ*

_{2}), but for the product of stimulus values at different delays

*τ*

_{1}and

*τ*

_{2}.

*r*(

*t*) = ∑

_{τ1,τ2}

*H*(

*τ*

_{1},

*τ*

_{2})

*s*(

*t-τ*

_{1})

*s*(

*t-τ*

_{2}) (Fig. 2A). While a linear filter can be easily interpreted given its set of weights

*h*– for instance, purely positive filters smooth the stimulus while filters with positive and negative lobes are band-pass filters – understanding the stimulus selectivity of a QF directly from its weight matrix

*H*is challenging. We hence approximated the QF in terms of a bank of LN models with quadratic output nonlinearities, obtained through eigenvalue decomposition of

*H*(Fig. 2B, S1) (Berkes and Wiskott, 2007; Lewis et al., 2002). The outputs of all LN models (Fig. 2C, D) in the filter bank are linearly combined to predict the CAP:

*r*(

*t*) = ∑

_{i}

*σ*[

_{i}*∑*(

_{τ}s*t-τ*)

*v*(

_{i}*τ*)]

^{2}. The

*σ*correspond to the eigenvalues of

_{i}*H*(Fig. 2E) and provide weights for each LN in the bank, while the filters

*v*are given by

_{i}*H*’s eigenvectors (Fig. 2C). Note that the full model includes additional bias and linear terms but their contribution to model performance is negligible and these terms were therefore omitted in the following analyses (Fig. S1D).

Overall, eigenvalue decomposition yields a compact description of the computation implemented in JONs, since four LN models – corresponding to the four highest-magnitude eigenvalues and vectors – are sufficient to reproduce the performance of the QF (Fig. S1B-D).

The four filters in the filter bank representation are well approximated by Gabor wavelets (Fig. S2A-B). The individual Gabor filters have positive and negative lobes and therefore respond only weakly to static stimuli and best to fluctuating inputs (Fig. 2C, S2E). This filter shape accounts for mean adaptation in JONs, which suppresses responses to static deflections of the antenna. Intriguingly, the four LN units in the model form two filter pairs: An “excitatory” filter pair with low latency and a “suppressive” pair with higher latency (Fig. 2C-F). Both filter pairs are similar: The suppressive pair resembles a delayed and sign-inverted version of the excitatory pair (Fig. 2C). Within each pair, the filters are phase-shifted by 90° and form so-called quadrature pair filters (Fig. 2G). This principal filter structure is independent of the stimulus type used for fitting the model (Fig. S3A-D).

Quadrature-pair filters are well known from auditory nerve fibers in vertebrates (Lewis et al., 2002), from complex cells in primary auditory and visual cortex of vertebrates (Rust et al., 2005; Tian et al., 2013) and from motion-sensitive cells in vertebrate visual cortex or in the *Drosophila* optic lobe (Borst and Helmstaedter, 2015). Canonical quadrature pair filters extract the stimulus envelope, since the responses of each filter in a pair combine to a phase-invariant and smooth output that is proportional to the stimulus energy (Lewis et al., 2002; Rajan et al., 2013). However, the notion of JONs as envelope detectors is at odds with the phase-locking of responses to sound (Fig. 1D-F). This is explained by an asymmetry in the frequency tuning of the filters in each pair (Fig. 2H): Quadrature pair filters only work as envelope detectors for carrier frequencies at which the two filters have similar response magnitude. In JONs, this is the case for frequencies >400 Hz (Fig. 2H, L) and for this high frequency range, each quadrature pair represents the stimulus envelope. Since the inhibitory pair is delayed relative to the excitatory pair (Fig. 2C, F), JONs only respond to transient increases in stimulus energy (Fig. 2L). The filter structure therefore explains aspects of variance adaptation, which induces transient responses, for high frequencies (Clemens et al., 2012; Rajan and Bialek, 2013; Slee et al., 2005). Below 400 Hz however, only one filter per quadrature pair responds strongly (Fig. 2H, K). In this low-frequency range, the QF effectively behaves like two linear filters with a quadratic nonlinearity: It phase-locks at twice the carrier frequency just like the CAP and responds in a sustained manner (Fig. 2I-K). The QF thus reveals two coding regimes present in JONs: for high frequencies (>400 Hz), JONs encode envelope transients, while for low frequencies (<400 Hz), JONs produce phase-locked, more sustained, and frequencydoubling responses (Tootoonian et al., 2012).

Overall, the properties of the quadratic filter contribute to multiple JON computations, including the frequency-doubling, mean adaptation, and variance adaptation at high frequencies. To examine to what extent the nonlinear filter is itself adaptive, we examined how the structure of the quadratic filter changes with intensity.

### Adaptive temporal filtering arises from adaptive antennal mechanics

To characterize adaptive processes that shape filter properties, we analyzed QFs fitted to noise stimuli for a range of intensities (1/16 to 4 mm/s, Fig. 3A, B, S4A, B). Eigenvalue decomposition of the different QFs reveals changes in the relative timing and weight (eigenvalues) of the Gabor filters but most of these changes are relatively small (Fig. 3C, S4A). By contrast, a narrowing of the negative lobe of all four Gabor filters leads to an increase in their cutoff frequencies by 200 to 500 Hz over the intensity range tested (Fig. 3D). This increase in the cutoff frequencies with intensity is consistent with predictions from optimal coding theory (Attneave, 1954; Barlow, 1961) and with observations from other systems (Nagel and Doupe, 2006; Zhaoping, 2006).

Where does this adaptive temporal filtering arise? Previous studies have shown that the frequency tuning of antennal responses to sound is intensity dependent and that this arises from active processes driven by mechanotransducer gating in JONs (Göpfert and Robert, 2003; 2002; Nadrowski and Göpfert, 2014; Riabinina et al., 2011). Measuring antennal movement in response to sound using laser Doppler vibrometry, we confirm that the antenna’s cutoff frequency increases with intensity (Fig. 3E, S4C). To test whether the adaptive antennal tuning fully explains the changes in QF structure, we decomposed the filter into “mechanical” and “neuronal” terms: We first passed the stimulus through the intensitydependent antennal filter obtained through vibrometry (Fig. 3E) – the “mechanical” term – and used this pre-processed stimulus as an input for estimating a new quadratic filter (Fig. 3F), which now captures the remaining “neuronal” processes.

We call this new model “antQF”, short for antennal QF. Accounting for intensity-dependent antennal tuning in the model by prefiltering the stimulus drastically simplifies the structure of the resulting quadratic filter (Fig. 3F): The new quadratic filter only contains slow components associated with the timing of the excitatory and suppressive Gabor filter pairs from the original quadratic filter (Fig. 3B). Eigenvalue decomposition shows that the new quadratic filter is well approximated by only two eigenvectors (instead of four in the original model). The new eigenvectors lack the fast oscillations found in the original filter (Fig. 3G, compare Fig. 3D), demonstrating that these fast oscillations have been captured by the mechanical antennal filters in the antQF. Importantly, the shape and the frequency preference of the two eigenvectors in the antQF are intensity invariant. This demonstrates that adaptive temporal filtering in JONs arises from adaptive antennal mechanics driven by transducer gating, that these antennal mechanics are represented by the Gabor filters in the original QF model, and that the remaining properties of the QF model likely reflect intensity-invariant JON-intrinsic processes. Note that the antQF model is simply a decomposition of the QF model into an intensity-dependent and linear “mechanical” term followed by a quadratic “neuronal” term (Fig. 3F). It is identical to the original QF in terms of the computations it can perform and the outputs it produces.

### Divisive normalization reproduces the strength and dynamics of variance adaptation

While the QF explained some aspects of variance adaptation at high frequencies (Fig. 2L), it was not sufficient to reproduce the variance adaptation over the full frequency range (Fig. 2K), for instance for the courtship song (Fig. 1F, H). Our initial model selection (Fig. 1C-H) had revealed that a divisive normalization (DN) stage was required to fully capture the JONs’ adaptive gain (Carandini and Heeger, 2012). In the DN stage (Fig. 4A), the output of the quadratic filter is divided by an adaptation signal – a running estimate of the response gain – obtained by rectification and low-pass filtering the filtered stimulus. Two parameters of the DN stage control adaptation strength σ and adaptation speed τ, and were fitted to the data.

The adaptation strength σ varies across stimulus types, but all values are »1, consistent with the nearperfect intensity invariance of the CAP amplitude after adaptation (Fig. 4B, S5J-L). The adaptation time constant τ is ~15 ms for all stimulus types (Fig. 4C). The DN stage explains the transient CAP dynamics for stimuli with dynamical envelopes (Fig. 4D-I): After rapid changes in intensity, CAP transients arise because the adaptation signal lags the stimulus intensity through a delay introduced by the low-pass filter in the DN stage (Fig. 4E, H). Through this lagged gain compensation, the adaptation stage briefly overcompensates for fast decreases and briefly undercompensates for fast increases in intensity (Fig. 4F, I). Note, that in the CAP, transients are slower for negative than for positive steps in intensity and our adaptation model shows that these asymmetrical dynamics arises from the multiplicative nature of adaptation with a single time constant parameter (Fig. S5A-E).

Overall, the QF-DN model accounts for the adaptive and nonlinear JON responses using two computations – quadratic filtering and divisive normalization. Each of these computations comes at a cost: The quadratic filter that induces frequency doubling strongly distorts the representation of the stimulus fine structure (carrier); the adaptive filtering could introduce further ambiguity into the code for carrier because the strength with which particular frequencies are transmitted depends on intensity; and adaptation may be too slow or too fast to contribute to an efficient code for transient communication signals like the pulse song. We therefore explored whether and how each of these computations contribute to an efficient representation of communication signals.

### The quadratic nonlinearities produce a robust code of the song envelope

We first examined the contribution of the quadratic filter to producing a robust representation of the song’s envelope pattern. Canonical quadratic filters encode the sound envelope (Rajan et al., 2013). However, as shown above, the JON filter does not act as a canonical envelope encoder for the low frequencies typically found in song (Fig. 2H-L). Rather, for the song, the QF acts more like a linear filter with a quadratic nonlinearity and produces frequency doubling responses – each cycle in the song’s carrier induces two cycles in the response (Fig. 1B, 2K). We reasoned that the frequency doubling could improve the representation of the song envelope, since it doubles the temporal resolution of the representation of the envelope and may improve coding of higher envelope frequencies. To test this hypothesis, we set up two simple encoders: one that squares the stimulus just as in JONs and one that simply thresholds the song waveform (Fig. 5A). We then asked how well an optimal linear decoder could reconstruct the song envelope from the responses of these two encoders under varying levels of response noise. To ensure a fair comparison, we normalized responses of the two encoders to have the same average energy before adding noise. Consistent with our intuition, the quadratic encoder transmits more information about the song envelope than the threshold linear encoder (Fig. 5B, C). Thus, the frequency doubling in the CAP improves the representation of the song envelope.

### A simple and unambiguous code for communication signals despite an adaptive and quadratic filter

The advantage of a quadratic code for coding the envelope comes at a potential cost when reading out the carrier structure of the stimulus. Generally, a linear system only produces responses with power at frequencies present in the input. For instance, linearly filtering a sinusoid produces another sinusoid of the same frequency with a different gain and phase. This simple mapping is fully reversible even for spectrally complex stimuli. By contrast, a quadratic filter produces power at frequencies that are pairwise combinations of the stimulus frequencies. This includes the frequency doubling seen in the CAP, but also combinations of the different input frequencies. For spectrally complex stimuli, this complicates stimulus reconstruction since it is unclear which response frequencies existed originally in the stimulus and which are the product of the quadratic filter. However, fly communication signals are relatively narrow-band over the duration of the quadratic filter and their dominant frequencies are in the range of phase-locking in JONs (Fig. 5D, F, compare Fig. 2H-K). In this regime, the quadratic filter’s main effect is that of frequency doubling – the dominant frequency in the response is twice of the stimulus carrier frequency. This allows a simple readout of stimulus frequency for the continuous sine song and for the transient pulse song based on half the dominant frequency of the CAP (Tootoonian et al., 2012) (Fig. 5E, G, S6).

### Intensity adaptation for song enables robust pattern recognition

Having shown that the nonlinear adaptive code improves envelope coding while still allowing for a relatively simple representation of the sound carrier for the spectrally simple communication signals of *Drosophila*, we next examined whether the timescale and strength of adaptation are sufficient to enable robust recognition of song pattern across intensities. This is relevant since sound intensity at the female changes drastically during the dynamical courtship because of the angle and distance dependence of the arista (Coen et al., 2016; Morley et al., 2012; 2018). To do this, we employed a classifier which uses the Euclidean distance between responses to classify the identity of short stimulus patterns across intensities (Clemens et al., 2011). We first assessed the ability of the classifier to identify 100 short noise patterns, each presented at 8 different intensities (Fig. 6A). In addition to classifying CAP responses to these stimuli, we also classified model responses, since this allowed us to directly demonstrate the contribution of adaptation by removing the divisive normalization (DN) stage (Fig. 6B, C). Responses of the CAP and the model with DN stage cluster poorly by intensity (Fig. 6D) but very well by pattern identity (Fig. 6E). By contrast, the responses from a model missing the DN stage represent intensity, not pattern identity. Accordingly, the information about stimulus pattern retrieved by the classifier is very high for both the CAP and model with DN stage, reaching ~85% of the maximal information (Fig. 6F). For the model without DN stage, pattern information is strongly reduced. Thus, intensity adaptation enables intensity-invariant classification for noise patterns. However, variance adaptation is incomplete for courtship song (Clemens et al., 2018b), since intensity fluctuations in song can be faster than the adaptation, calling into question whether adaptation can contribute to the coding of song. We therefore also classified stimulus identity for short song patterns of different intensities. We found that the model with DN stage reached near-perfect song pattern identification, while the model without adaptation yielded much lower information values (Fig. 6G). Combined, these results show that adaptation in auditory receptor neurons supports robust, intensity-invariant song-pattern recognition in *Drosophila.*

## Discussion

Here, we examined how the different computations performed in the JONs affect the representation of song in *Drosophila*. Based on electrophysiological recordings, we built a computational model that describes the nonlinear and dynamical mapping from sound to the extracellularly recorded population response, the CAP. The model consists of a quadratic filter followed by a divisive normalization stage and fully reproduces CAPs for artificial and natural stimuli (Fig. 1). The quadratic filter acts in a frequency-dependent manner: For high frequencies (>400 Hz), the filter encodes envelope transients, while at the lower frequencies found in the fly’s courtship song, it encodes the envelope but also phase locks to the carrier (Fig. 2). The filter’s preferred frequency increases with intensity and by incorporating direct readouts of antennal movement, we demonstrated that this adaptive frequency filtering arises from antennal mechanics (Fig. 3). Divisive normalization produces the variance (=intensity) adaptation and reveals that response transients after fast changes in sound intensity arise from the adaptation signal lagging behind the stimulus (Fig. 4). Nonlinear and dynamical computations like adaptive temporal filtering and variance adaptation can introduce ambiguities about the carrier and the envelope of sounds, which may reduce coding performance. However, we find that these computations improve the robustness to noise (Fig. 5) and to fluctuations in intensity (Fig. 6) while avoiding strong ambiguities for signals with the statistics of song.

### Adaptive and quadratic temporal filtering in JONs

The computation performed by the intensity-dependent quadratic filter at the heart of our JO model can be decomposed into three stages (Fig. 2B, 3F): adaptive temporal filters, quadratic nonlinearities, and combination of excitatory and suppressive quadrature pairs. First, the adaptive Gabor filters determine the intensity-dependent frequency preference of JONs (Fig. 2, 3). By incorporating measures of antennal movement into the model, we demonstrated that the mechanical filtering in the antenna fully explains the adaptive frequency filtering found in the CAP (Göpfert and Robert, 2003; 2002).

Second, a quadratic nonlinearity processes the filtered stimulus and reproduces the frequency doubling responses (Fig. 2K). Since the CAP reflects the bulk spiking activity of all auditory JONs (Clemens et al., 2018b; Lehnert et al., 2013; Łęski et al., 2013), the origin of the frequency doubling and of the quadratic nonlinearity it is still unclear. The frequency doubling could arise in individual JONs that are activated twice during the period of the sound carrier – for instance at negative and positive peaks of a sinusoidal. Alternatively, responses of JONs with different phase preferences could combine in the CAP to produce the frequency doubling (Kamikouchi et al., 2009). Existing anatomical and physiological evidence favors a single-neuron origin of quadratic filtering (Lehnert et al., 2013; Pézier and Blagburn, 2013): Responses from JON subsets from only one side of the JO exhibit frequency doubling (Pézier and Blagburn, 2013), and adaptation to positive steps carries over to reduce responses to negative steps, suggesting that both response components arise in the same neurons (Lehnert et al., 2013). Note that our analyses do not rule out the existence of a subpopulation of JONs that encodes the song’s carrier via linear phaselocking responses. Only comprehensive single-neuron recordings from the population of JONs can resolve this issue.

In the last stage of the QF, the Gabor filters are combined to produce the filter output (Fig. 2C). The Gabor filters form quadrature pairs which are mainly known from higher-order sensory neurons in insects and vertebrates (Borst and Helmstaedter, 2015; Rust et al., 2005; Tian et al., 2013). Interestingly, quadratic filters with the early excitatory and delayed excitatory quadrature pairs found in JONs are also known from the auditory nerve fibers of mammals (Lewis et al., 2002). Typically, quadrature pair filters report stimulus energy at specific frequency bands independent of phase (Rajan et al., 2013). However, to act as an energy detector, the responses of each filter in a quadrature pair must have similar magnitude. In JONs, this is only the case at high frequencies (>400Hz, Fig. 2H, L). For low frequencies, only one of the filters in a pair responds strongly and the quadratic filter produces phase-locked responses (Fig. 2K). The resulting frequency-dependent code preserves information about fine structure for low carrier frequencies through phase locking while maintaining responsiveness to transients in sound energy for high carrier frequencies. Note that this split in coding schemes by frequency could stem from the CAP largely reflecting synchronous activity across the population of JONs (Leski et al. 2013). That is, the lack of a sustained response in the CAP for higher stimulus frequencies may stem from a breakdown of phase-locking, not a reduction in firing rates. However, this is unlikely, since the frequency split is also evident from mechanical responses of the antenna, which are not sensitive to neuronal phase locking but which show strongly reduced sensitivity for frequencies >400 Hz, just like the CAP. The combination of the early excitatory and the delayed suppressive quadrature pair (Fig. 2C-F) produces transient responses for high frequencies in JONs (Fig. 2H-L). In olfactory receptor neurons, a similar delayed-suppressive operation – modeled with a biphasic filter – describes the transformation from transduction current to spiking (Nagel and Wilson, 2011). Whether or not this component of the model maps to the spike generation remains to be tested.

### Organization of mean and variance adaptation in JONs

JONs implement two forms of adaptation – to the mean and to the variance of the physical stimulus. Previously, an abstract model backed by experimental data suggested that mean and variance adaptation arise serially in JONs, with mean adaptation occurring before variance adaptation (Clemens et al., 2018b). The more detailed model in the present study confirms this result. Mean adaptation arises in the first stage of the model: The Gabor filters (Fig. 2C) respond only to fluctuating stimuli and do not transmit static or very slow input components corresponding to the stimulus mean (Fig. S2E). The Gabor filters represent active antennal mechanics driven by the gating of mechanotransduction channels (Fig. 3E-G), consistent with previous results that show that mean adaptation is visible at the level of antennal mechanics (Albert et al., 2007). The other form of adaptation found in JONs – to stimulus variance – is produced in two stages in the model: First, the delayed-suppressive operation that combines the two quadrature pairs in the quadratic filter suppresses sustained responses to high frequencies (>400Hz). Second, the divisive normalization in the last stage of the model corrects for sound intensity at low frequencies to produce variance adaptation over the full frequency spectrum (see Fig. 4A, E, H). This is consistent with experimental results that show that variance adaptation arises first in the subthreshold transduction currents (corresponding to the first operation in the quadratic filter) and is completed after spike generation (corresponding to the divisive normalization stage) (Clemens et al., 2018b). The model thus confirms existing experimental and modelling results on the implementation of mean and variance in JONs. In addition, the model constitutes a valuable computational tool for analyzing changes in mean and variance adaptation in genetic mutants to identify the biophysical and anatomical origins of adaptation.

### JONs produce an efficient and robust representation of song

Our model demonstrates how different computations – mean adaptation, adaptive temporal filtering, a quadratic nonlinearity, and variance adaptation – contribute to an efficient and robust representation of courtship song features – in particular, the carrier and the envelope (Deutsch et al., 2019) (Fig. 5, 6). First, adaptive temporal filtering (Fig. 3E-H) is consistent with efficient coding principles (Atick, 1992; Attneave, 1954; Barlow, 1961): For weak inputs, the preference for lower frequencies favors integration and sensitivity. For strong inputs, JONs prefer higher frequencies which leads to more differentiation and selectivity. Second, the quadratic nonlinearities produce a representation of the song’s envelope that is robust to noise (Fig. 5A-C), because it pools information from positive and negative stimulus components. Third, mean adaptation (Albert et al., 2007; Lehnert et al., 2013) uncouples the sensitivity of JONs to sound from the baseline position of the antenna (Fig. S2E) – it renders the code for song robust to slow antennal movement from wind or gravity (Clemens et al., 2018b). Finally, variance adaptation produces a representation of the song that is robust to the fluctuations in intensity arising from the dynamical interaction between male and female during courtship (Fig. 6). Adaptation acts as a high-pass filter whose cutoff frequency is set by the adaptation time constant (Benda and Herz, 2003). In JONs, adaptation is fast (Fig. 4), but not too fast – it is sufficiently fast to compensate for the slower (<20Hz) intensity fluctuations that arise from the constant changes in position of the singing male relative to the female during courtship (Coen et al., 2014). But it preserves the behaviorally relevant intensity fluctuations associated with the periodical envelope of the fly pulse song (>20Hz, (Deutsch et al., 2019)).

### The code for behaviorally relevant classes of sounds is simple despite a complex encoding scheme

The sequence of highly nonlinear and dynamical computations in JONs results in a complex mapping from sound to JON response. A faithful reconstruction of carrier and envelope from this representation is impossible for arbitrary sounds, because quadratic coding and adaptation combine to produce an ambiguous and stimulus-dependent code. However, sensory neurons like JONs do not serve as general and faithful encoders of stimuli but to extract specific features from behaviorally relevant classes of sensory signals. Hearing in flies is known to be used for acoustic threat detection and for acoustic communication (Kamikouchi et al., 2009; Lehnert et al., 2013). We find that for these signal classes, the relevant stimulus features can be extracted from JONs using relatively simple computations despite a complex encoding scheme.

In the context of acoustic threat detection, sudden increases in sound energy trigger startle responses (Lehnert et al., 2013). JONs accentuate increases in sound energy at frequencies >400 Hz through the quadratic filter (Fig. 2L) and at lower frequencies through variance adaptation (Fig. 4). Information relevant for acoustic threat direction can therefore be read out directly from the amplitude of JON responses by postsynaptic neurons like the giant fiber neuron to trigger startle responses (Pézier and Blagburn, 2013). When evaluating the courtship song, flies are sensitive to a wide range of features of the envelope and the carrier (Batchelor and Wilson, 2019; Deutsch et al., 2019). The envelope pattern can be obtained by low-pass filtering the JON responses (Fig. 5A-C) and this operation is likely performed in neurons directly postsynaptic to the JONs (Clemens et al., 2015; Yamada et al., 2018). For instance, synaptic currents from JON into AMMC-A2 in the fly brain are low-pass filtered by the synapse and the postsynaptic membrane to explicitly encode the envelope (Azevedo and Wilson, 2017). This sequence of transformations resembles the root mean square algorithm in which a signal is first squared and then low-pass filtered to extract the envelope. The code for carrier is highly ambiguous for arbitrary sounds due to the adaptive temporal filtering and the quadratic nonlinearities (Fig. 2, 3). However, for the narrow-band and low frequency (100-400Hz) courtship song, the code for carrier reduces to amplitude scaling from adaptive filtering (Fig. 2H, 3D) followed by frequency doubling from the quadratic nonlinearity (Fig. 2K). From this relatively simple temporal code, the carrier frequency of courtship song can be decoded as half of the dominant frequency in the JON response (Fig. 5D-G, (Tootoonian et al., 2012)).

### Conclusion

Overall, our study highlights the importance of examining sensory systems in the context of behaviorally relevant signals. JONs produce a highly dynamical and nonlinear code for sound. This code prevents a faithful reconstruction of general classes of sounds in the fly brain. However, JONs function to represent specific features from particular classes of sounds. For these behaviorally relevant sounds, the relevant features are simple to decode. This match between the behaviorally-relevant signals and the neural code is assumed to be a general feature of neural codes (Ryan and Cummings, 2013; Wehner, 1987). We here show that shaping a code towards the relevant signals, allows animals to avoid the costs of strong nonlinearities in the form of ambiguity and complex decoding while benefitting from improved noise robustness and intensity invariance.

## Methods

### Flies

Virgin females of the *Drosophila melanogaster* wildtype strain CantonS Tully were used for all recordings. Flies were sexed and housed in groups of ~10 at 25°C and a 12:12 dark-light cycle. All experiments were performed 2-5 days post eclosion.

### Stimulus design and presentation

Stimuli were generated at a sampling frequency of 10 kHz. Band-limited Gaussian noise (from now on termed “noise”) was produced from a sequence of normally distributed random values by band-pass filtering using a linear-phase, finite impulse response filter with a pass band between 80 Hz and 1000Hz. The effect of intensity on JON responses was estimated using 5 independent noise patterns that lasted 5 seconds and were presented at intensities of 1/16, 1/8, ¼, ½, 1, 1.5, 2, and 4 mm/s (“noise” stimuli in Fig. 1). For probing adaptation, we switched the intensity of the noise every 100ms in a sequence that contained all transitions between ¼, ½, 1 and 2 mm/s (“step” stimuli in Fig. 1). The effect of adaptation on intensity tuning was assessed using a noise stimulus at intensities ¼, ½, 1, 2, 4 mm/s (adaptation background) whose intensity was switched every 120 ms for 20 ms to a probe intensity of 1/16, 1/8, ¼, ½, 1,2, 4, and 8 mm/s (“background” stimuli in Figs. 1G, H). To minimize artifacts from abrupt changes in sound intensity, each intensity switch in the step and background stimuli had a duration of 1 ms during which the intensity was linearly interpolated to the new value. We also assessed responses to natural courtship, recorded from a *Drosophila melanogaster* male courting a virgin female (Arthur et al., 2013; Coen et al., 2014) (“song” stimuli).

#### Sound

The sound delivery system consisted of i) the analog output of a DAQ card (PCI-5251, National Instruments), ii) a 2-channel amplifier (Crown D-75A), iii) a headphone speaker (KOSS, 16 Ohm impedance; sensitivity, 112 dB SPL/1 mW), and iv) a coupling tube (12 cm, diameter: 1 mm).

The stimulus presentation setup was calibrated as in Clemens et al. (2015). Briefly, the amplitude of pure tones of all frequencies used (100-1000Hz) was calibrated using a frequency-specific attenuation value measured using a calibrated pressure gradient microphone (NR23159, Knowles Electronics Inc., Itasca, IL, USA). To ensure that the temporal pattern of the noise stimuli was reproduced faithfully, we corrected the presented noise patterns by the inverse of the system’s transfer function, measured using a pressure microphone (4190-L-001, Brüel & Kjaer) that was placed at the position of the fly head during recordings.

### Electrophysiology

Extracellular recordings were performed using glass electrodes (1.5ID/2.12OD, WPI) pulled with a micropipette puller (Model P-1000, Sutter Instruments). The fly’s wings and legs were removed under cold anesthesia and the abdomen was subsequently fixed using low-temperature melting wax. The head was fixed by extending and waxing the proboscis’ tip. The preparation was further stabilized by applying wax or small drops of UV-curable glue to the neck and the proboscis. The recording electrode was placed in the joint between the second and third antennal segment and the reference electrode was placed in the eye. Both electrodes were filled with external saline (Murthy and Turner, 2013). The sound delivery tube was placed orthogonal to the arista on the side we recorded JON activity form, at a distance of 2 mm. The recorded signal was amplified and band-pass filtered between 5 and 5000 Hz to reduce high frequency noise and slow baseline fluctuations induced by spontaneous movement of the antenna (Model 440 Instrumentation Amplifier, Brownlee Precision). We ensured that the band-pass filter did not distort the recorded signal, e.g. it did not introduce artefactual response transients. We subsequently digitized the recorded signal at 10 kHz with the same DAQ card used for stimulus presentation (PCI-5251, National Instruments).

### Laser-Doppler vibrometry

Arista movement was measured using laser-doppler vibrometry (Polytec OFV534 laser unit, OFV-5000 vibrometer controller, Physik Instrumente, low-pass 5 kHz).

### Data analysis

#### Pre-processing

The instantaneous amplitude of the CAP was estimated from the envelope of the recorded signal as the magnitude of the Hilbert transform (Fig. S5A). Tuning curves from the “background” stimui were obtained by averaging the CAP amplitude over the first 8 ms of the probe intensities (see stimulus description above).

#### Adaptation time scale and strength

The adaptation time scale was estimated from the CAP amplitude traces by fitting an exponential function *r*(*t*)=*r*_{0} + *r _{max} exp*(-

*t/τ*) to the falling/rising phases of positive/negative transients after a change in intensity from the “step” stimuli (Fig. S5C).

### Modelling

#### Model Structure

To reproduce the transfer function from stimulus waveform to CAP waveform, we use the discretized Volterra series, which decomposes the transfer function into a constant term *h _{o}*, a linear term

*h*, a quadratic term

_{1}*H*, and higher-order terms

_{2}*ε*which we do not consider here:

*s*(

*t*) and

*y*(

*t*) are the stimulus and the CAP response, respectively.

*h*describes a constant offset or bias.

_{o}*h*is a linear filter and describes how stimulus values

_{1}*τ*time steps into the past are weighted in the response.

*H*is a quadratic filter and describes how the product of stimulus values at two different time steps in the past,

_{2}*τ*and

_{1}*τ*, are weighted in the response. The linear models in this paper only include the

_{2}*h*and the

_{o}*h*term. The quadratic filter model (QF) consists of all terms up to and including

_{1}*H*. The temporal support of the filters

_{2}*h*and

_{1}*H*,

_{2}*τ*, describes the memory of the system and was set to 10 ms. This duration saturates performance – in initial tests, longer

_{max}*τ*did not improve model performance. This is confirmed by the lack of filter structure in

_{max}*h*and

_{1}*H*for

_{2}*τ*>8 ms (Fig. 2A).

#### Model fitting

The individual terms of the model – *h _{o}, h_{1}* and

*H*– were estimated using linear regression:

_{2}*y*’(

*t*) =

*σ*(

*t*)

*w*, where

*y*’is the predicted response and

*w*is a concatenation of the model coefficie

*nts;*[

*h*]. Exploiting the symmetry of the quadratic filter,

_{2}, h_{1}, h_{2}*H*(

_{2}*τ*) =

_{1}, τ_{2}*H*(

_{2}*τ*),

_{2}, τ_{1}*h*is a vector containing all upper triangular values including the diagonal of

_{2}*H*(

_{2}*τ*≥

_{1}*τ*). Equivalently,

_{2}*σ*(

*t*) is a concatenation of the inputs for each of the model terms: [

*s*]. For the bias

_{0}, s_{1}, s_{2}*h*, the input

_{o}*s*=1. For the linear filter

_{0}*h*, the input

_{1}*s*(

_{1}*τ*) corresponds to the stimulus in the 10 ms preceding the response time t:

*s*(

_{1}*τ*)=

*s*(

*t-τ*) for all

*τ*≥

_{max}*τ*≥

*0*. To reduce the number of filter coefficients to be estimated, we projected

*s*onto a basis composed of 50 Gaussian bumps with a standard deviation of 0.2 ms (2 samples) and a spacing of 0.2 ms (2 samples), with the first bump at 0 ms and the last bump at 10 ms. For the input values to

_{1}*H*, we take the outer product of the stimulus values in the 10 ms (100 samples) preceding the stimulus to be predicted:

_{2}*S*(

_{2}*τ*) =

_{1}, τ_{2}*s*(

*t-τ*)

_{1}*s*(

*t-τ*). To reduce the number of filter coefficients, we projected each matrix onto a basis composed of two-dimensional Gaussian bumps, each bump with a standard deviation of 0.3 ms (3 samples) and a spacing of 0.2 ms (2 samples) (cf. Rajan et al. (2013)). We exploit the symmetry of

_{2}*S*and keep only the unique values of

_{2}*S*for which

_{2}*τ*, flattened into a vector

_{1}≥τ_{2}*s*. The projection onto Gaussian bump bases and the exploitation of symmetry reduces the number of free parameters from 1+100+100*100=10101 to 1+50+(50

_{2}^{2}+50)/2=1275.

All filter coefficients were then fitted using ridge regression. Ridge regression minimizes the mean-squared error between the data and the predicted responses and the norm of the filter coefficients: ∑_{t} (y-y’)^{2}+ α∑_{i} *w _{i}*

^{2}(Tikhonov and Arsenin, 1977). The first term increases the match between the data and the model, the second term penalizes filter weights that do not contribute to improving this match. α controls the influence of the penalty term and was chosen using methods and code from (Park and Pillow, 2011). Since song is relatively sparse, we only used samples that were within ±100 ms of song for model fitting and evaluation. For visualization and analysis of the filters,

*h*and

_{1}*H*were projected from the basis of Gaussian bumps back into the temporal domain.

_{2}For models that only contain the first two terms (*h _{o}, h_{1}*), so-called linear-nonlinear models, we estimate an output nonlinearity based on methods described in (Schwartz et al., 2006). For the quadratic models, the nonlinearity was approximately linear and did not improve performance. We therefore omitted that step to simplify model interpretation.

#### Fitting the antennal filters and the antQF model

The transfer function from stimulus to antennal movement for each intensity was fitted as a discretized Volterra series with terms *h _{o}, h_{1}*. This model typically explained more than 90% of the variance in antennal responses to noise (Fig. S4C). To account for intensity-dependent antennal filters in the quadratic model, the intensity-specific antennal filters obtained from one animal were used to pre-filter the stimulus before fitting the quadratic model. This was necessary, because simultaneous measurement of antennal movement and CAP responses was not possible in our rig. However, antennal filters are virtually identical across animals (pairwise Pearson correlation of the filters r=0.97±0.03 across N=5 flies; filters from different intensities were concatenated before calculating the correlations).

#### Fitting QF-DN

The divisive normalization (DN) stage has the form: *x*’ = *x*/(*σ* + *y*) where *γ* is the gain control signal that divides the input *x* and is given by: *γ*(*t*) = ∫ *dv* || *x*(*t – v*) || *e*^{−t/τ}. *τ* is the adaptation time constant. The input to the DN stage, *x*, is the output of the quadratic filter. The QF-DN model was fitted using an iterative procedure: We initialized the filter by fitting a QF model without the DN stage to the data using ridge regression. Then, we optimized the parameters of the DN stage, *σ* and *γ*, by minimizing the mean squared error between the CAP and the model prediction using the matlab function “fmincon”, holding the filter coefficients constant. Lastly, we optimized the filter coefficients, holding the DN parameters constant. The model parameters typically converged after 1-2 cycles of fitting the parameters of the DN stage and the filter. During the fitting, only the magnitude, but not the structure of the filter changed from its initialization.

#### Model evaluation

Model performance was quantified using the squared Pearson’s correlation coefficient between the CAP and the model predictions. Initial experiments with cross-validation show that train and test performance are within ±1% of each other.

#### QN representations

To gain insight into the computational structure of the quadratic filter, we presented *H _{2}* by its eigenvalue decomposition:

*H*= ∑

_{2}_{i}

*σ*

_{i}v_{i}, v_{i}^{T}, where the

*v*, and

_{i}*σ*are the eigenvectors and their associated eigenvalues (Berkes and Wiskott, 2007; Lewis et al., 2002). This representation is useful, since

_{i}*H*is typically of low-rank, that is few eigenvector-eigenvalue pairs are sufficient to reconstruct

_{2}*H*with sufficient fidelity (Fig. S1). This representation is equivalent to a bank of linear-nonlinear models with filters

_{2}*v*and quadratic nonlinearities

_{i}*y*’ = ∑

_{i}

*σ*(∑

_{i}_{τ}

*v*(

_{i}*τ*)

*s*(

*t-τ*))

^{2}.

#### Subspace overlap

To compare quadratic filters fitted to different stimuli in a manner that is robust to noise, we computed the overlap between the subspaces spanned by the four leading eigenvectors. Specifically, given two quadratic filters *A* and *B,* with eigenvectors and , the cumulative overlap between the pairs of the K largest eigenvectors is given by (Fig. S3D) (Romo and Grossfield, 2011).

#### Envelope reconstruction

To assess the contribution of the quadratic filter to encoding the envelope, we compared the performance of two simple encoders: A rectified linear one, which thresholds the stimulus at 0, and a quadratic one, which squares the stimulus. For a stimulus that is symmetrical around 0, the quadratic encoder will produce more response energy, since the rectified linear one cuts off all negative stimulus components. We therefore normalized the output of each of the encoders to unit norm. Gaussian noise with standard deviation *σ* was then added to the normalized outputs and the coherence between the model output, *r*, and the original envelope, *e*, was estimated from the stimulus: *C _{er}*(

*f*) =

*P*/(

_{er}P_{re}*P*) for different noise-to-signal ratios as a measure of decoding accuracy.

_{ee}P_{rr}#### Pattern classification

For pattern classification, we used leave-one-out nearest-neighbor classifier. Each of 100, 10-ms-long response patterns was selected as a template and assigned to the class of its nearest neighbor, based on the Euclidean distance to the 99 non-template patterns (Machens et al., 2003). This resulted in a confusion matrix *p*(*s, s*’), which tabulates the joint probability with which a response that was elicited by stimulus *s* was classified as being elicited by any of the stimuli *s*’. The mutual information of *p* serves as a lower bound on the mutual information between stimulus and response: *I*(*s, r*)≤*I*(*s, s*’)=∑*p*(*s, s*’) log_{2} *p*(*s, s*’)/(*p*(*s*)*p*(*s*’)). *I*(*s, s*’)=*0* if *p*(*s, s*’) is uniform, that is if there is no correspondence between the actual and the classified stimulus. The maximal value of *I*(*s, s*’) is given by the log_{2} of the number of stimulus classes (log_{2}(100)=6.64 for pattern, log_{2}(8)=3 for intensity) for a perfect, one-to-one mapping between stimulus and response. We visualized the similarity structure of the representations using tSNE of a random subset of 10 patterns (Maaten and Hinton, 2008).

## Supplemental Figures

## Acknowledgements

We thank Cyrille Girardin for discussions during an early phase of the project; Georgia Guan for technical assistance; members of the Murthy lab, the Clemens lab, and Tim Gollisch for helpful feedback; and Rachel Wilson and Stephen Holtz for feedback on the manuscript. JC was supported by a postdoctoral fellowship through the Princeton Sloan-Swartz Center, an Emmy Noether grant (329518246) from the DFG, by the DFG through SFB889 (grant #154113120), Mechanisms of Sensory Processing, Project A10. MM was supported by a NIH New Innovator award (DP2) and a HHMI Faculty Scholar award.