Abstract
In longitudinal observations of animal groups, the goal is to identify individuals and to reliably detect their interactive behaviors including their vocalizations. However, to reliably extract individual vocalizations from their mixtures and other environmental sounds remains a serious challenge. Promising approaches are multi-modal systems that make use of animal-borne wireless sensors and that exploit the inherent signal redundancy. In this vein, we designed a modular recording system (BirdPark) that yields synchronized data streams and contains a custom software-defined radio receiver. We record pairs of songbirds with multiple cameras and microphones and record their body vibrations with custom low-power frequency-modulated (FM) radio transmitters. Our custom multi-antenna radio demodulation technique increases the signal-to-noise ratio of the received radio signals by 6 dB and reduces the signal loss rate by a factor of 87 to only 0.03% of the recording time compared to standard single-antenna demodulation techniques. Nevertheless, neither a single vibration channel nor a single sound channel is sufficient by itself to signal the complete vocal output of an individual, with each sensor modality missing on average about 3.7% of vocalizations. Our work emphasizes the need for high-quality recording systems and for multi-modal analysis of social behavior.
Introduction
Acoustic communication is vital for many social behaviors. However, to study interactions among animals that are kept in groups entails many measurement challenges beyond the already considerable challenges of analyzing longitudinal data of isolated animals1–3. One of the key difficulties of group-level behavior research is to perform automatic recognition of individuals and their actions.
Action recognition is the task of detecting behaviors from video or audio, or from signals collected with animal-borne sensors. Video-based action recognition has traditionally been based on posture tracking4–6 to avoid data-hungry training of classifiers on high-dimensional video data. Recently, posture tracking has greatly improved thanks to deep-learning approaches7–11. Action recognition from video requires good visibility of focal animals because visual obstructions tend to hamper recognition accuracy. Given that freely moving animals may occlude one another, e.g., in birds during nesting, there seems to be a limitation to the usefulness of pure vision-based approaches.
Sound recordings have also been instrumental in action recognition. Sounds can be informative about both vocal and non-vocal behaviors. For example, wing flapping during flying produces a characteristic sound signature. But also preening, walking, and shaking can be recognized from sounds12. The task of classifying sounds is known as acoustic scene classification13. Similar to vision, microphones record not just the focal animal but also background sounds. Therefore, action recognition and actor identification from sounds alone are challenging tasks, especially when many animals interact with one another. Possible workarounds to these issues are multi-modal approaches that combine multiple cameras and microphones, for example to assign vocalizations from dairy cattle in a barn to individual animals14. Other examples are systems that include also motion-tracking devices, for example to quantify gesture–speech synchrony in humans15.
In general, limitations due to sight occlusions and sound superpositions can be overcome with animal-borne sensors such as accelerometers16,17, gyroscopes, microphones18, and global positioning systems (GPS)17. In combination with wireless transmitters18 and loggers16, these sensors enable the detection of behaviors such as walking, grooming, eating, drinking, and flying, for example, in birds19, cats20, and dogs21, though often with low reliability because of noisy and ambiguous sensor signals12. In general, animal-borne transmitter devices are designed to achieve high reliability, low weight, small size, and long battery life, giving rise to a complex tradeoff. Among the best transmitters, in terms of battery life, size, and weight, are analog frequency-modulated (FM) radio transmitters. Their low power requirement minimizes animal handling frequency and associated handling stress, making them an excellent choice for longitudinal observations of small vertebrates18,22,23.
Among the challenges associated with FM radio reception are radio signal fadings due to relative movements of animal-borne transmitters and stationary receivers. Fading arises when electromagnetic waves arrive over multiple paths and interfere destructively (channel fading)24 e.g., by reflection of metallic walls. Fading also occurs because every receiver has a direction of zero gain, which may affect signal reception from a moving transmitter. Signal fading can be addressed with antenna diversity, i.e., the use of several antennas. In multi-antenna approaches of diversity combining, either the strongest signal is selected, all signals are summed up, or signals are first weighted by their strength and then summed25. However, these approaches do not guarantee protection from fading when signals annihilate. Alternatively, diversity combining is possible with phase compensation, which is the technique of shifting signal phases such that shifted signals align and sum constructively26,27. Phase shifting reduces fading and increases the signal-to-noise of the received signal, and it provides cues for localizing a transmitter28,29. We set out to bring the benefits of antenna diversity and of phase-shifting techniques to ethology research.
We focus on systems that make use of more than one sensor modality and recognize actions with higher accuracy than would be possible from one sensor modality alone. One particular challenge with multi-modal systems is the synchronization of the multimodal data streams. Usually, each data modality is recorded with a dedicated recording device that uses its own internal sampling clock. Furthermore, clocks tend to drift apart, and often the recordings cannot be started at exactly at the same time on all devices. Therefore, the individual data streams must be aligned post-recording using either markers in the sensor signals or auxiliary synchronization channels, which are labor intense and error-prone processes15,30,31.
To perform individual-level longitudinal observations of social behaviors and to record high-quality multimodal data sets suitable for action recognition, we present a custom recording system for groups of vocalizing animals (BirdPark). We built a naturalistic environment inside a soundproof enclosure that features a set of microphones to record sounds and several video cameras to capture the entire scene. Moreover, all animals wear a miniature low-power transmitter device that transmits body movements from a firmly attached accelerometer via analog, frequency-modulated (FM) radio that we receive with several antennas.
Our system is optimized for robust longitudinal recordings of vocal interactions in songbirds. The on-animal accelerometers enable week-long monitoring of vocalizations without a change of battery. The combination of multiple antennas minimizes signal losses. All sensor signals are perfectly synchronized, which we achieve by routing dedicated sample triggers to all recording devices (radio receiver, stationary microphone digitizer, and cameras) derived from a central quartz clock using clock dividers. We release our custom recording software and we demonstrate the high data quality and redundant signaling of vocal gestures.
Results
The Recording Arena
We built an arena optimized for audiovisual recordings, minimizing acoustic resonances and visual occlusions (Figure 1A). It provides space for up to 8 songbirds and contains nest boxes, perches, sand baths, food, and water sources. To record the sounds inside the chamber, we installed five microphones. Three video cameras capture the overall scene from three orthogonal viewpoints. In addition, we installed a camera and a microphone in each of two nest boxes. To simplify video analysis, we combined the images from all five cameras into a single video image (Figure 1B). The camera resolutions are sufficient to resolve key points on birds even in midflight (Figure 1D).
Recording arena and schematic of recording system. A: Inside a soundproof chamber, we built a recording arena (red dotted line) for up to 8 birds. We record the animals’ behaviors with three cameras mounted through the ceiling. These provide a direct top view and indirect side and back views via two mirrors (delimited by green and magenta dotted lines). To record the sounds in the chamber, we installed five microphones (blue dotted lines) among all four sides of the cage (one attached to the front door is not visible) and the ceiling, and two small microphones in the nest boxes. The radio signals from the transmitter device are received with four radio antennas (orange dotted lines) mounted on three side walls and the ceiling. One nest box is indicated with yellow arrows and a water bottle with blue arrows. B: A composite still image of all camera views shows two monochrome nest box views (top left) and three views of the arena (top, side, back) with 8 birds among which one is flying (red arrows). Yellow and blue arrows as in A. C: Schematic of the recording system for gapless and synchronized recording of audio (microphones), radio (accelerometer sensors), and video channels (cameras). The radio receiver is implemented on a universal software radio peripheral (USRP) with a large field programmable gate array (FPGA) that runs at the main clock frequency of 200 MHz. Clock dividers on the FPGA provide the sample trigger for audio recordings and the frame trigger for the cameras. The data streams are collected on a host computer that runs two custom programs, one (BirdRadio) for streaming audio and sensor signals to disk and one (BirdVideo) for encoding video data. D: Zoom-in on an airborne bird, illustrating the spatial and temporal resolution of the camera.
To record vocalizations, we mounted transmitter devices to birds’ backs. On each device there is an accelerometer that picks up body vibrations from vocalizations and body movements such as hopping and wing flapping16. The devices transmit the acceleration signals as frequency-modulated (FM) radio waves to four antennas inside the recording chamber.
We demodulated the FM radio signals with a custom eight-channel radio receiver. To ensure that the data streams from the video cameras, microphones, and transmitter devices are well synchronized, the clock of the radio receiver triggers the video frames and audio samples (Figure 1C): We divided the frequency of this 200 MHz clock by 213 to generate the audio clock rate of 24.414 kHz, and with a further division by 29, we generated the 47.684 Hz video frame rate.
On the host computer that controlled the acquisition system, we ran two custom applications: BirdVideo, that writes the video data to a file, and BirdRadio, that acquires the microphone and sensor data to a file (see Methods). The generated file pairs are synchronized such that with each video frame there are 512 audio and sensor samples. While the recording is gapless, it is split into files of typically 7 minutes duration.
Transmitter device
Our transmitter devices are based on the FM radio transmitter circuit described in Ter Maat et al.18 (Figure 2A). To distinctly record birds’ vocalizations irrespective of external sounds, we replaced the microphone with an accelerometer16 that acts as a contact microphone. The radio circuit uses a single transistor to frequency modulate the sensor signal a(t) onto a radio carrier frequency ωc (that is set by the resonator circuit properties) with an inductor coil as an emitting antenna. As a result, the acceleration signal a(t) is encoded as momentary radio transmitter frequency ωT(t) ≃ ωc(t) + ca(t) with c some constant. We found that the carrier frequency ωc is not constant but modulated by the proximity of body parts (proximity effect, Figure 5). Additionally, it is subject to temperature and end-of-battery-life drift (see Methods: Transmitter device). While the instability of the carrier frequency ωc is a disadvantage of the single-transistor circuit design, its advantages are its low weight (1.5 g) and small power consumption (the battery lifetime is 12 days).
Transmitter device. A: Schematic of the electronic circuit (adapted from Ter Maat et al.18). The analog FM radio transmits the vibration transducer (accelerometer) signal via a high-pass filter (cutoff frequency: 15Hz) followed by a radiating oscillator. B: A fully assembled transmitter (left), and another one without epoxy and battery (middle), and a piece of mounting foil (right). C: Picture of the device mounted on a bird. The transmitters are color coded to help identify the birds in video data. D: Schematic of a bird wearing the device. The harness (adapted from Alarcón-Nieto et al.32) is made of rubber string (black) that goes around the wings and the chest.
Before mounting a device on a bird, we adjusted the coil to the desired carrier frequency in the range 250–350 MHz by slightly bending the wires. Thereafter, as a means of protection, we fixed the coil and the electronics in dyed epoxy. The purpose of the dyed epoxy is to help identify the birds in the video images and to stabilize the coil’s inductance (Figure 2B,C). We mounted the devices on birds using a rubber-string harness adapted from Alarcón-Nieto et al.32 (Figure 2D).
Radio receiver
To reconstruct the acceleration signal a(t) of a given vibration transmitter device from the received multi-antenna FM signals, we demodulated the latter using a Phase Locked Loop (PLL), which measures the momentary transmitter frequency ωT(t). A PLL generates an internal oscillatory signal of variable instantaneous frequency ω(t). It adjusts that frequency to maintain a zero relative phase with respect to the received (input) signal. As long as the zero relative phase condition is fulfilled, the PLL’s instantaneous frequency ω(t), after high-pass filtering, forms our estimate of the bird’s acceleration a(t) (knowing that proximity effect degrade this relationship).
In our diversity combining approach, we construct the PLL’s input signal as a combination of all four antenna signals. Namely, we compensate the individual phase offsets of the four antenna signals in such a way that all phases are aligned (we achieve this phase shifting with phase-compensation circuits — one for each antenna). We then form the desired mixture signal by summing the phase-shifted signals; the phase of the summed signal serves as the PLL’s error signal that we use to adjust the instantaneous frequency ω(t). The variable phase offsets on the four antennas arise from the variable locations and orientations of transmitters relative to receivers.
In summary, our approach to minimizing fading in wireless ethology research is to use a demodulator comprising a PLL and a diversity combining approach based on phase-compensation. Our FM radio demodulation is described in more detail in the following.
We implemented our custom demodulation technique by installing four antennas labeled A, B, C, and D perpendicular to the walls and the ceiling of the chamber (see Methods: Radio reception). We fed the antenna signals into a universal software radio peripheral (USRP) containing a large field programmable gate array (FPGA) (Figure 3A). The four input stages of the USRP filter out an 80 MHz wide band around the local oscillator frequency ωLO, which we typically set to 300 MHz. These four signals then become digitally available on the FPGA as complex valued signal of 100 MHz sampling rate. We call them the intermediate signals za(t), a ∈ {A, B, C, D} (intermediate band in the frequency domain, Methods: Intermediate band, Figure 3B).
Radio receiver with PLL demodulators and phase compensation. A: Schematic of the software-defined radio receiver. Analog antenna signals are first down converted by mixing the amplified signals with a local oscillator of frequency ωLO, followed by low-pass filtering. The resulting radio frontend (RF) outputs have digital representations zA(t), zB(t), zC(t), and zD(t). From these, eight FM demodulators extract the instantaneous frequencies ωi(t). B: The power spectrum of the 80 MHz wide intermediate band zA(ω) centered on the oscillator frequency ωLO = 300 MHz. The limiting ranges (gray bars) of the 8 channels (transmitters) are typically set to ± 1 MHz of their manually set center frequencies (black vertical lines). A zoom-in (lower graph) to the limiting range of one active channel reveals the large peak associated with the momentary radio transmitter frequency. C: The baseband power spectrum density is a down-converted, ±100 kHz (dashed vertical lines) frequency flat band around the instantaneous frequency (solid vertical line) that tracks the momentary radio transmitter frequency (the large peak). D: For each channel (transmitter), a demodulator using a PLL (orange shading) and four phase compensation circuits (blue shading) computes the instantaneous frequency ωi (shown for i = 2). The baseband signals
are derived from intermediate signals with digital downconverters (DDCs) that operate on a common instantaneous frequency ω2. The phase compensation circuits drive the phases
, with the rotated baseband signals
, to zero. E: Vector diagram in the complex plane illustrating the effects of alignment by the PLL (middle) and of phase compensation (right) on main and baseband vectors. F: The measured phases of the four baseband vectors
are shown without phase compensation (left), after aligning the main vector with the PLL (middle), and after additional phase compensation (right). The result is that all vectors are aligned, and the master vector is of maximal amplitude.
On the FPGA, we instantiated eight demodulators indexed by i ∈ {1,…,8}. Each demodulator contained four digital downconverters (DDC) that cut out from the four intermediate signals za(t) the four baseband signals around the eight instantaneous frequencies ωi(t), sampled at 781.25 kHz (Methods: Baseband, Figure 3C and D). All baseband signals and the main signal are complex valued and so we interchangeably call them vectors and their phases we refer to as angles.
The PLL is driven by the main vector that we form as the sum of the four baseband vectors. The error signal for PLL’s feedback controller is given by the phase of the main vector
(Methods: PLL, Figure 3D). When that phase is approximatively zero (PLL is locked), the instantaneous frequency ωi(t) tracks the frequency of transmitter i. We updated the PLL’s instantaneous frequency at a rate of 781.25 kHz.
To avoid destructive interferences in the summation of baseband vectors, we compensated their phases relative to the main vector. We introduced individual phase offset
that were set in feedback loops to drive the phases
towards zero (Methods: Phase compensation, Figure 3E). Phase offsets were updated at a rate of 1.5 kHz and logged at every video frame.
The PLL and the phase compensation form independent control loops. When the PLL is unlocked (e.g., off), the instantaneous frequency does not match the momentary transmitter frequency and the baseband vectors rotate at the difference frequency (Figure 3F left). When the PLL is switched on and locked, the baseband vectors do not rotate, and the main signal displays a phase θi(t) ≃ 0 (Figure 3F middle). When the phase alignment is switched on, the baseband signals align and their sum maximizes the magnitude of the main vector (Figure 3F right).
Because birds’ locomotion is slower than their rapid vocalization-induced vibratory signals, the phases change more slowly than the instantaneous frequency ωi(t) and therefore we updated phase offsets less often than the PLL’s internal frequency.
Operation
The intermediate band of za(ω) is wide enough to accommodate up to eight FM transmitters. Provided the transmitters’ FM carrier frequencies are roughly evenly spread, the momentary transmitter frequencies do not cross, even during very large frequency excursions such as when a bird pecks on its sensor. Nevertheless, we limited the instantaneous frequencies ωi to the range [Ωi − ΔΩ, Ωi + ΔΩ], where Ωi is the center frequency and ΔΩ = 1 MHz is the common limiting range of all channels, Figure 3B. The center frequency Ωi of a channel we manually set at the beginning of an experiment to the associated FM carrier frequency. The limiting range of 1 MHz we found narrow enough such that the PLL rapidly re-locked after brief signal losses.
During operation, we often observed large excursions of the instantaneous frequency. These excursions occurred while birds pecked on the device or while they preened their feathers near the sensor, or when one bird sat on top of another, such as during copulations. The magnitude of these jumps could reach 1 MHz, which is much larger than the maximally 1-kHz shifts induced by vocalizations, Figure 5. The large excursions likely were caused by capacitive-inductive effects of movements on the resonator circuit. These observations demonstrate the large bandwidth and robustness of our PLLs.
Performance of diversity combining technique
We validated the robustness of our multi-antenna demodulation method by analyzing the signal-to-noise ratio of the received transmitter signal in a zebra finch pair over a full day (2 x 14 h measurement period). We quantified the diversity combining performance in terms of the fraction of times at which signal fading occurred, both with and without diversity combining. We defined the radio signal-to-noise ratio (RSNR) of as the signal power (mean square) divided by the median of the signal variance (the median is taken over the duration of 7-min long data files). In theory, the noise (variance) of
is twice as large as that of either baseband signals
, assuming independence of radio amplifier noises. The signal power of
is upper bounded by 16 times that of its largest constituent, i.e., the best-placed antenna signal (the base-band signal
of largest amplitude).
Indeed, we found that the signal-to-noise ratio of was about 6 dB higher on average than that of the best single-antenna signal, Figure 4b. The total reception time during which the RSNR was critically low (<13 dB, our operational definition of signal fading) was 87 times shorter when using the multi-antenna demodulation than when using the best-antenna demodulation (Fig. 4c).Thus, our diversity combining technique is useful and results in a coherent constructive addition of the four antenna signals and to very robust frequency tracking even when the signal in one antenna vanishes.
A: Spectrogram examples of demodulated acceleration signals generated by single zebra finch calls with radio signal-to-noise ratios of 11, 22, 32, and 41 dB (top, from left to right). The bottom plots show the spectra of the calls (red lines) and of noise (blue lines), computed as time averages from the spectrograms above (time windows indicated by red and blue horizontal bars). The relative noise power PN (integral of blue line relative to the noise power of the first example) is decreasing with increasing RSNR. When the RSNR is below 13 dB, the noise power spectral density is above the signal power of most vocalizations (these become invisible). B: Histogram of RSNR over time for the multi-antenna signal (blue line) and the single-antenna signals for the antenna with the greatest mean RSNR (purple line) and all other antennas (grey lines). The mean of the multi-antenna RSNR (dashed blue line) is 6 dB larger than the mean RSNR of the best single antenna (dashed purple line). C: Multi-antenna demodulation is significantly better than best-antenna demodulation, as demonstrated by the significantly longer reception periods above a given signal-to-noise ratio. The time below the critical RSNR of 13 dB is reduced by a factor of 87.
Proximity affects the instantaneous frequency much more than vocalizations. On a transmitter device with a disabled (short circuited) accelerometer, the instantaneous frequency is strongly modulated by the proximity of the head during preening (left) and by wing movements during flight (middle). In contrast, the modulation of instantaneous frequency is much weaker for vocalizations (right) on a transmitter with enabled accelerometer. The modulation amplitudes (black arrows) due to proximity are about 1000 times larger than the modulation amplitudes due to vocalizations.
Analysis of vocalizations
We observed that signals picked up from vocalizations reached up to 7 kHz in optimal cases, but only up to 1 kHz in the worst case. This variability stems from differences in vocalization amplitude, RSNR levels, and presumably, also from how well the sensor is in contact with the skin of the bird and at which position. The vibration sensor also measures accelerations from body movements like wing flaps or hoping. To quantify our ability to detect vocalizations in the multimodal data stream, we manually segmented vocalizations by inspection of log-power sound spectrograms from 2 vibration and 6 microphone channels (see Methods for details). We were particularly interested in missed vocalizations that are invisible on a given channel but whose presence is revealed on another channel. Compared to the miss rate associated with a single microphone channel of 4.2% (range 2.5%-10.0%, 7 mins recordings of n=2 bird pairs), all microphone channels combined produced a reduction in miss rate by a factor of about 10, which reveals a large benefit of the multi-microphone approach. The percentage of missed vocalizations on a given vibration channel was similar to that on a single microphone channel, 3.7% (range 1.2%-5.8%, n=4 birds), although these numbers are not directly comparable because the miss rate on a microphone channel stems from two birds and the miss rate on a vibration channel from a single bird. A very small number of vocalizations was missed on all six microphone channels, 0.4% (range 0.0%-0.6%, n=2 bird pairs), these were all very faint calls that were masked by loud noises.
We release our open-source radio receiver software and our graphical user interface (GUI) for monitoring life data streams on a host computer. Also, we share a sample of our dataset together with the annotations of all vocal segments, these may be useful for benchmarking machine learning systems on the task of vocal segmentation.
Discussion
We designed and validated a behavioral recording system for up to 8 songbirds, yielding perfectly synchronized multi-modal data. Our custom multi-antenna FM demodulation technique, compared with single antenna demodulation increases radio SNRs by 6 dB and reduces signal-loss events by a factor of 87. The wireless devices transmit well-separated vocalizations unless these are masked by large body movements such as wing flaps or because of weak mechanical coupling between the bird and the device, which we were unable to completely get rid of. We segmented vocalizations in two 7-mins recordings of pair-housed birds and found that each signal channel by itself is of limited value for reliably extracting vocal segments. We believe this ability to evaluate the inherent quality of a sensor modality can be very valuable in ethology research.
Ideally, the quality of a data set of vocalizations should be evaluated relative to ground truth, i.e., the true segmentation of vocalizations and background noise. However, such a ground truth data set does not (yet) exist in practice. For example, to perfectly measure the vocal output of a bird would require simultaneous measurements of syringeal labia and muscle activity, of sub-syringeal air pressure, and of tracheal air flow33, which are measurements that have not been performed simultaneously in freely moving group-housed birds. Without such a data set or at least an approximation thereof, quantifications of vocal output will be biased.
We estimated vocal output by visual inspection of sound spectrograms from five microphones and two animal-borne accelerometers. With respect to this approximate ground truth, we find that no sensor channel by itself achieves a rate of misses (false negative vocalizations) of less than 1%, with an average of 3.7%, which is a large number given that we took our measurements in ideal settings using the minimal group size of two instrumentalized birds housed in a relatively small, acoustically well isolated environment. Since we might have missed some vocalizations even considering all the channels used, our estimated single-channel miss rates constitute lower bounds, implying that the true miss rate of vocalizations must be higher. And, assuming that published datasets are of worse quality than ours (less instrumentalization), it is likely that findings on animal vocal communication in the literature exhibit a bias due to a miss rate higher than 3.7%. Moreover, since we expect the miss rate to increase with the number of interacting birds studied, the benefits of our recording system likely increase with the size of the social group.
Our system could promote research on the meaning of vocalizations for social behavior. Of key interest are courtship and reproductive behaviors, which have attracted much attention in the past. For songbirds, selecting a partner for copulation and subsequent offspring rearing involves complex courtship displays that include varied vocalizations and coordinated body movements on very rapid time scales. The reproductive behavior of zebra finches has been studied thoroughly34–37, with efforts to define comprehensive ethograms38,39. The copulation behavior of zebra finches has attracted much attention, partly because of the phenomenon of extra-pair copulation: the tendency of zebra finches to copulate with mates that are not their partner40–42. However, not much is known about the roles of vocalizations in signaling copulation readiness and in reflecting the subsequent pair bond.
Similarly, multimodal group-level studies using our system could help us better understand the learning strategies young birds use while they modify their immature vocalizations to match a sensory target provided by a tutor. Much of our knowledge on song learning stems from research of isolated animal in controlled environments43–45. However, the isolated-animal paradigm is impoverished compared to the natural setting in which animals live in groups, because vocal learning is subject to social influences46. For example, the song learning success in juveniles is positively influenced by social interactions47,48 including by non-singing adult females49. Furthermore, the directed songs of males produced towards females are different and subserved by different brain mechanisms than the undirected songs produced while alone50. To study these processes, it would be valuable to acquire longitudinal high-quality audio and video data sets of freely-behaving and vocalizing animals in complex social settings. As vocalizations are communicative signals, the information they contain can only be fully captured by considering both the environment and the social group of the studied individual.
Methods
The chamber
All our experiments were performed in a sound-isolation chamber (Industrial Acoustic Company, UK) of dimensions 124 cm (width) x 90 cm (depth) x 130 cm (height). The isolation chamber has a special silent ventilation system for air exchange at a rate of roughly 500 l/min. Two types of light sources made of light emitting diodes (LEDs, Waveform Lighting) are mounted on aluminum plates on the ceiling of the chamber: 1) LEDs with natural frequency spectrum (consuming a total electric power of 80 W); and 2) ultraviolet (UV) LEDs of wavelength 365 nm and consuming a total electric power of 13 W.
The ceiling of the chamber contains three circular holes of 7 cm diameter through which we inserted aluminum tubes of 5 mm wall thickness that serve as holders for three cameras (Basler acA2040-120uc). The tubes also conduct the heat of the LEDs (93 W) and the cameras (13 W) to the outside of the chamber where we placed silent fans to keep the cameras below 55 deg C.
With the cameras, we filmed the arena directly from the top and indirectly from two sides via two tilted mirrors in front of the glass side panels. This setup yields an object distance of about 1 – 1.5 m, which allows for a depth of field that covers the entire arena. Furthermore, the large object distance results in a small perspective distortion.
We installed four microphones on the four side walls of the isolation chamber and one microphone on the ceiling. We mounted two further microphones in the nest boxes and one microphone outside the isolation chamber. Wherever possible, we covered the sidewalls of the chamber with sound-absorbing foam panels. A door sensor measures whether the door of the chamber is open or closed. The temperature inside the chamber we kept at roughly 26 °C and the humidity was roughly 24 %.
The arena
Inside the chamber, we placed the bird arena of dimensions 90 cm x 55 cm (floor). To minimize acoustic resonances and optical reflections, the 40-cm high side panels of the arena are tilted by 11° and 13° towards the inside. Two of the side panels are made of glass for videography and the two opposite panels are made of perforated plastic plates. The floor of the arena is covered with a sound-absorbing carpet. A pyramidal tent made of a fine plastic net covers the upper part of the arena reaching up to 125 cm above ground. At a height of 35 cm, we attached two nest boxes to the side panels, each equipped with a microphone and a camera. Furthermore, the arena is equipped with perches, a sand-bath, and a food and water tray.
Video acquisition system
To visualize the arena, we used industrial cameras (3 Megapixel, MP) with zoom lenses (opening angles: top view: 45° x 26°, back view: 55° x 26°, side view: 26° x 35°) and exposure times of 3 ms. To visualize nests in the dimly lit nest boxes, we used monochrome infrared cameras (2 MP, Basler daA1600-60um) and fisheye lenses (143° x 112°).
The uncompressed camera outputs (ca. 400 MB/s in total) are relayed to a host computer over a USB3 Vision interface. Each camera receives frame trigger signals generated by the USRP (Figure 1C). A custom program (BirdVideo), written in C++ and using the OpenCV library51 and FFMPEG52, undistorts the nest camera images with a fisheye lens model and transforms all five camera images to a single 2976 x 2016 pixel-sized image (Figure 1B). The composite images are then encoded with the h264 codec on a NVIDIA GPU and stored in an MP4 (ISO/IEC 14496-14) file. We used constant-quality compression with a variable compression ratio in the range 150-370, resulting in a data rate of 0.72 MB/s – 1.8 MB/s, depending on the number of birds and their activity in the arena. Compression ratios in this range do not significantly decrease key point tracking performance53. The frame rate of the video is about 48 frames per second (frame period 21 ms). The spatial resolution of the main cameras is about 2.2 pixels/mm.
Transmitter device
The vibration transducer (Knowles BU21771) senses acceleration with a sensitivity of 1.4 mV/m/s2 in the 30 Hz – 6 kHz frequency range54. Inspired by Ter Maat et al.18, we performed the frequency modulation of the vibration signal onto a radio carrier using a simple Hartley oscillator transmitter stage with only one transistor and a coil that functions both as the inductor in the LC resonator and as the antenna. The LC resonator determines the carrier frequency ωc of the radio signal. We set ωc of our transmitters in the vicinity of 300 MHz, which corresponds to an electromagnetic wavelength of about 1 m and is roughly equal to the dimension of our soundproof chambers, implying that our radio system operates in the near field.
We found that ωc depends on temperature at an average of −73 kHz/°C (range −67 kHz/°C to −87 kHz/°C, n=3 transmitters). Towards the end of the battery life (over the course of the last 3 days), we observed an increase of ωc by an average of 500 – 800 kHz. These slow drifts can easily be accounted for by tracking and high-pass filtering the momentary frequency. The measured end-to-end sensitivity of the frequency modulation is 5 kHz/g, with g being the gravitational acceleration constant. The transmitters are powered by a zinc-air battery (type LR41), which lasts at least 12 days. The total weight of the transmitter including the harness and battery is 1.5 g, which is ca. 10% of the body weight of a zebra finch.
We moderately tightened the harness during attachment in a tradeoff between picking up vibratory signals from the singer and preserving the wearer’s natural behaviors. It is known that the act of mounting devices can transiently affect zebra finch behavior: right after attachment, the singing rate and amount of locomotion both tend to decrease, they return to baseline within 3 days for sensors weighing 0.75 g55 and within 2 weeks for sensors weighing 3 g56.
Radio reception
We mounted four whip antennas of 30 cm length (Albrecht 6157) perpendicular to the metallic sidewalls and ceiling of the chamber (using magnetic feet). We fed the antenna signals to a universal software radio interface (USRP-2940, National Instruments, USA), which comprises four independent antenna amplifiers the gains of which we set to 68 dB, adjustable in the range −20 dB to +90 dB.
Intermediate band
The USRP generates from the amplified radio signal a down-converted signal with an analog bandwidth of 80 MHz around the local oscillator frequency (ωLO). After digitization, this intermediate signal za(t) for antenna a ∈ {A, B, C, D}, is a 100 MS/s complex-valued signal (IQ) of 2 x 18 bits precision and is relayed to the FPGA. For details about the analog down conversion and the sampling of complex valued signals, see the NI manual of the USRP and the documentation of our custom radio software.
Digital downconverters (DDCs) and baseband
Every PLL, i ∈ {1,…,8}, makes use of four digital downconverters (DDCs, Figure 3D), one for each antenna a, that cut from the intermediate band a narrow baseband around the instantaneous frequency ωi. The 3-stage decimation filter of this down-conversion has a flat frequency response within a ±100 kHz band. Given the decimation factor of 128, the resulting complex base-band signals (the outputs of the DDCs) have a sample rate of 781.25 kHz and a precision of 2 x 25 bits.
Phase locked loop
Each PLL generates its instantaneous frequency ωi by direct digital synthesis (DDS) with a 48-bit phase accumulation register and a lookup table. This frequency is dynamically adjusted to keep the phase of the main vector (i.e., the summed baseband signals) close to zero. We calculated the angle of the main vector with the CORDIC algorithm57 and unwrapped the phase up to ±128 turns before using it as the error signal for a proportional-integral-derivative (PID) controller. The PID controller adjusts the instantaneous frequency of the PLL.
The PID controller was implemented on the FPGA in single precision floating point (32-bit) arithmetic. The controller included a limiting range and an anti-windup mechanism58. The unwrapping of the error phase was crucial for the PLL to quickly lock-in and to keep the lock even during large and fast frequency deviations of the transmitter. To tune the PID parameters, we measured the closed-loop transfer function of the PLL by adding a white-noise signal to the control signal (instantaneous frequency), and then adjusted the PID parameters until we observed a low-pass characteristic without overshoot. We achieved a closed loop bandwidth of about 30 kHz.
Phase compensation
For a given PLL i, we compensated the relative phases under which a radio signal arrives at the four antennas a in order to align the four baseband vectors . The alignment was achieved by providing a phase offset
to each downconverter, where it acts as offset to the phase accumulation register of the direct digital synthesis. To compute the angle of baseband vector relative to the main vector, we rotated the baseband vectors
by the phase θi of the main vector to result in the rotated vector
. After averaging the rotated vectors across 512 samples, we computed its angle
. We then compensated that angle by iteratively adding a fraction γ ∈ [0,1] of it to the phase offset:
. The parameter γ is the phase compensation gain (Figure 3D), typically set to = 0.2. This iterative update was performed at a rate of about 1.5 kHz (781.25 kHz / 512), which is faster than birds’ locomotion (changes in physical position or orientation of the transmitter).
Central control software
On the host we run our Central control software (BirdRadio) programmed with LabView, that acquires the microphone and transmitter signals and writes them to a TDMS file (Figure 1C). Furthermore, BirdRadio sends UDP control signals to BirdVideo, it automatically starts and stops the recording in the morning and evening respectively, it controls the light in the sound-isolation chamber with dimming that simulates sunrise and sunset, it triggers an email alarm when the radio signal from a transmitter device is lost and it automatically adjusts the center frequency of each radio channel every morning to adjust for carrier frequency drift.
Data management
The BirdPark is designed for continuous recordings over multiple months, producing data at a rate of 60 GB/day for 2 birds and 130 GB/day for 8 birds. We implemented the FAIR (findable, accessible, interoperable, and reusable) principles of scientific data management59 as follows:
Our recording software splits the data into gapless files of 20’000 video frames (ca. 7 mins duration). At the end of a recording day, all files are processed by a script that converts the TDMS files into HDF5 files and augments them with rich metadata. The HDF5 files are self-descriptive in that every data field (main data and metadata) contains its description as a property. We use the lossless compression feature of HDF5 to obtain a compression ratio of typically 2.5 for the audio and accelerometer data. The script also adds two AAC-compressed microphone channels into the video files. Although this step introduces redundancy, the availability of sound in the video files is very useful during manual annotation of the videos. Furthermore, the script also exports the metadata as a JSON file and copies the processed data onto a NAS server. At the end of an experiment, the metadata is uploaded onto an openBIS60 server and is linked with the datafiles on the NAS.
Manual Segmentation of vocalizations
We manually segmented vocalizations in recordings of mixed-sex zebra finch pairs (7 mins recordings of n=2 bird pairs). We high-pass filtered (FIR filter with a stopband frequency of 100 Hz and passband frequency of 200 Hz) the raw accelerometer and microphone signals and produced spectrograms (window size=384 samples, hop size=96 samples) that we manually segmented using Raven Pro 1.661. We performed three types of segmentations in the following order:
Transmitter-based vocal segments: We separately segmented all vocalizations on either transmitter channel, precisely annotating their onsets and offsets. In our released data sets, these segments are referred to as transmitter-based vocal segments. When we were uncertain whether a sound segment was a vocalization or not, we also looked at spectrograms of microphone channels and listened to sound playbacks. If we remained uncertain, the segment was tagged with the label ‘Unsure’, such segments were treated as (non-vocal) noises and were excluded from further analysis.
Microphone-based vocal segments: We simultaneously visualized all microphone spectrograms (from Mic1 to Mic6) using the multi-channel view in Raven Pro (we ignored Mic7, which was located in the second nest that was not accessible to the birds). On those, we annotated each vocal segment on the first microphone channel on which it was visible (e.g., a syllable that is visible on all microphone channels is only annotated on Mic1). Overlapping vocalizations were annotated as a single vocal segment. When we were uncertain whether a sound segment was a vocalization or not, we also looked at spectrograms of accelerometer channels and listened to sound playbacks. If we remained uncertain, the segment was tagged with the label ‘Unsure’, such segments were treated as (non-vocal) noises and were excluded from further analysis. These segments are referred to as microphone-based vocal segments.
Consolidated vocal segments: All consistent (perfectly overlapping) accelerometer- and microphone-based vocal segments, we labelled as consolidated vocal segments. We then inspected all inconsistent (not perfectly overlapping) segments by visualizing all channel spectrograms. We fixed inconsistencies that were caused by human annotation errors (e.g. lack of attention) by fixing the erroneous or missing transmitter- and microphone-based segments. From the inconsistent (partially overlapping) segments that were not caused by human error, we generated one or several consolidated segments by trusting the modality that more clearly revealed the presence of a vocalization (hence our reporting of ‘misses’ in Table 1). In our released comma-separated value (CSV) files, we give each consolidated vocal segment a Bird Tag (e.g., either ‘b15p5_m’ or ‘b14p4_f’) that identifies the bird that produced the vocalization, a Transmitter Tag that identifies the transmitter channel on which the vocalization was identified (either ‘b15p5_m’ or ‘b14p4_f’ or ‘None’), and a FirstMic Tag that identifies the first microphone channel on which the segment was visible (‘Mic1’ to ‘Mic6’, or ‘None’). We resolved inconsistencies and chose these tags as follows:
If a microphone (-based vocal) segment was paired (partially overlapping) with exactly one transmitter segment, a consolidated segment was generated with the onset time set to the minimum onset time and the offset time set to the maximum offset time of the segment pair. The Bird and Transmitter Tags were set to the transmitter channel name, and the FirstMic Tag was set to the microphone channel name.
If a transmitter segment was unpaired, a consolidated segment was created with the same onset and offset times. The Bird and TrCh Tags were set to the transmitter channel name, and the FirstMic Tag was set to ‘None’.
If a microphone segment was unpaired, a transmitter channel name was inferred based on the vocal repertoire and noise levels on both transmitter channels. We visually verified that the vocal segment was not the result of multiple overlapping vocalizations (which was never the case). Then we created a consolidated segment with the same on- and offset, set the FirstMic Tag to the microphone Id, the Bird Tag to the guessed transmitter channel name, and the Transmitter Tag to ‘None’.
If a microphone segment was paired with more than one transmitter segment, a consolidated segment was created for each of the accelerometer segments. The onsets and offsets were manually set based on visual inspection of all spectrograms. Bird and Transmitter Tags were set to the transmitter channel name and the FirstMic Tag was set to the microphone channel name.
We never encountered the case where a transmitter segment was paired with multiple microphone segments.
Statistics of missed vocalizations on a transmitter channel (second row) and on all microphone channel (third row) and on an individual microphone channel (Mic1, fourth row). The ‘syllable overlaps’ row quantifies the percentage of times in which both birds vocalize simultaneously (as in the example in Figure 6A) expressed as a fraction of total vocalization duration.
Vocal signals and their segmentation. A: Example log-power spectrograms of vocalizations produced by a mixed-sex zebra finch pair. The songs and calls overlap on the microphone channel (Mic1, top) but appear well separated on spectrograms of the male’s momentary frequency ω (t) (male transmitter, middle) and the female momentary frequency ω2(t) (female transmitter, bottom). Distinct transmitter-based vocal segments are indicated by red (male) and purple (female) horizontal bars on top of the spectrograms. High-frequency vibrations appear attenuated, but even a high-pitched 7 kHz song note (white arrow) by the male is still visible. B: Example vocalizations that are not captured by the transmitter device (A1: syllable masked by wing flaps, A2: low signal level) or not visible on either some channels (B1: low signal level) or on all microphone channels (B2: syllable masked by loud noise).
The statistics in Table 1 were calculated as follows:
# vocalizations = (number of consolidated segments)
# voc. missed on tr. channels = (number of consolidated segments with Transmitter=None)
# voc. missed on all mic. channels = (number of consolidated segments with FirstMic=None)
# voc. missed on Mic1 = (number of consolidated segments with FirstMic≠Mic1 and FirstMic≠None)
The assignment to the female/male column was determined by the ‘Bird’ tag.
The vocal overlap statistic was calculated based on the number of spectrogram bins in the consolidated segmentation as follows:
vocal overlap = 2*(number of spectrogram bins with vocal overlap)/(number of spectrogram bins with vocalization)
Animals and Experiments
Mixed-sex zebra finch (Taeniopygia guttata) pairs bred and raised in our colony (University of Zurich and ETH Zurich) were kept in the BirdPark on a 14/10 h light/dark daily cycle for multiple days, with food and water provided ad libitum. All experimental procedures were approved by the Cantonal Veterinary Office of the Canton of Zurich, Switzerland (license numbers ZH054/2020). All methods were carried out in accordance with relevant guidelines and regulations (Swiss Animal Welfare Act and Ordinance, TSchG, TSchV, TVV).
Data and code availability
The recording software (BirdRadio and BirdVideo) and the segment annotations including the raw data files (HDF5 files with transmitter and microphone signals and MP4 videos) will be made available for download.
Author contributions
Conceptualization, J.R., L.R., M.D.R. and R.H.R.H.; System and software development, L.R. and J.R.; animal experiments, L.R. and H.H.; data annotation and data analysis, T.T. and L.R.; writing—original draft, J.R., L.R. and R.H.R.H.; writing—reviewing and editing, J.R., L.R., T.T., M.D.R. and R.H.R.H.
Competing interests
The author(s) declare no competing interests.
Acknowledgments
We thank Aymeric Nager for his help with the design and construction of the BirdPark. Financial support was provided by Swiss national Science Foundation 31003A_182638 (The roles of vocal communication in pair formation and cultural learning in songbirds) and Swiss national Science Foundation NCCR Evolving Language (Agreement #51NF40_180888).
References
- [1].↵
- [2].
- [3].↵
- [4].↵
- [5].
- [6].↵
- [7].↵
- [8].
- [9].
- [10].
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].
- [36].
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].
- [42].↵
- [43].↵
- [44].
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵