idtracker.ai: Tracking all individuals in large collectives of unmarked animals

Francisco Romero-Ferrero; Mattia G. Bergomi; Robert Hinz; Francisco J. H. Heras; Gonzalo G. de Polavieja

doi:10.1101/280735

Abstract

Our understanding of collective animal behavior is limited by our ability to track each of the individuals. We describe an algorithm and software, idtracker.ai, that extracts from video all trajectories with correct identities at a high accuracy for collectives of up to 100 individuals. It uses two deep networks, one detecting when animals touch or cross and an-other one for animal identification, trained adaptively to conditions and difficulty of the video.

Obtaining animal trajectories from a video faces the problem of how to track animals with cor-rect identities after they touch, cross or they are occluded by environmental features. To bypass this problem, we proposed in idTracker the idea of tracking by identification of each individual using a set of reference images obtained from the video [1]. idTracker and further developments in animal identification algorithms [2–6] can work for small groups of 2-15 individuals. In larger groups, they only work for particular videos with few animal crossings [7] or with few crossings of particular species-specific features [5].

Here we present idtracker.ai, a system to track all individuals in small or large collectives (up to 100 individuals) at a high identification accuracy, often of > 99.9%. The method is species-agnostic and we have tested it in small and large collectives of zebrafish, Danio rerio and flies, Drosophila melanogaster. Code, quickstart guide and data used are provided (see Methods), and Supplementary Text describes algorithms and gives pseudocode. A graphical user interface walks users through tracking, exploration and validation (Fig. 1a).

Figure 1: Tracking by identification in idtracker.ai.

a. Graphical user interface to track, explore and validate. b. Diagram of processing flow. c. Preprocessing to segment single animal and multi-animal blobs from background. d. Deep crossing detector network is trained with examples of preprocessed single and multi-animal images. Consists of two convolutional layers (gray blocks), a max-pool operation (orange block), a hidden layer of 100 neurons (green line) and 2 output neurons to classify images into single animal/crossing. e. Deep identification network is similar to d. but with 3 convolutional layers and as many output neurons as group size. f. Single-image accuracy as function of animal group size in original idTracker and in idtracker.ai after training with 3, 000 images per animal. g. Accumulation of training images in a video of 100 zebrafish of 31 dpf. Small horizontal black segments correspond to crossings detected by the crossing detector network in d. Step 1: global fragment (portion of video in which animals do not touch or cross) with the animal that moves less having the longest distance traveled. It trains the identification network, which in turn assigns the other global fragments, of which our quality check procedure selects a high quality subgroup (Step 2). If it does not cover enough of video, it moves to Protocol 2. Iterative training and quality checks end here in Step 9. h. Residual identification with final network plotted as short segments in lower transparency colors. Here shown a zoom into the region between dashed lines in g. Small segments in white are non-assigned at this step i. Estimated (top) and manually validated (bottom) accuracies. j. Postprocessing assigns crossings and small non-assigned (white) segments in h.

Similar to idTracker [1], but with different algorithms, idtracker.ai identifies animals using their visual features. In idtracker.ai, animal identification is done adapting deep learning [8–10] to work in videos of animal collectives thanks to specific training protocols. In brief, it consists of a series of processing steps summarized in Fig. 1b. After image preprocessing, the first deep network finds when animals are touching or crossing. Then the system uses the images between these detected to train a second deep network for animal identification. The system first assumes that a single portion of video when animals do not touch or cross has enough images to properly train the identification network (Protocol 1). However, animals touch or cross often and this portion is then typically very short, making the system estimate that identification quality is too low. If this happens, two extra Protocols (Protocols 2 and 3) are dedicated to safely accumulate enough images of each animal using several of these portions of video and build a larger training set. After training and assignation of identities, some postprocessing is performed to output the trajectories and an estimation of identification accuracy.

In the following we give more details of the processing steps. The preprocessing extracts blobs, areas of each video frame corresponding to either a single animal, or to several animals that are touching or crossing, i.e. ‘crossings’. Then, it orients the blobs using their axes of maximum elongation (Fig. 1c). This procedure leaves the animal pointing in one of two possible orientations. We solve this ambiguity by training with both upright and 180-degrees rotated images. This method is valid for any elongated animal and is preferred to species-specific methods.

The deep crossing detector network finds whether each preprocessed image corresponds to a single animal or a crossing (Fig. 1d; details of network architecture in Supplementary Table 1). idtracker.ai trains this network using images that confidently classifies as single animals or crossings (see Supplementary Text for heuristics used). Once trained, it classifies all blobs as single animals or crossings. We depict these detected crossings as small black segments in Fig. 1g.

View this table:

Supplementary Table 1: Convolutional neural networks used in idtracker.ai.

The deep cross-ing detector network is used to classify images as belonging to a single animal or to multiple animals crossing or touching. The identification convolutional neural network (idCNN) is used to identify individual images.

The deep identification network is then used to identify each individual between two crossings (Fig. 1e; Supplementary Table 1 for details of network architecture). We measured the identification capacity of this network using 184 single-animal videos, with 300 pixels per animal on average. The advantage of single-animal videos is that we obtain a very large number of images per animal. Out of the 18, 000 images per animal we randomly selected 3, 000 for training. Testing it in 300 new images gave a > 95% single-image accuracy up to 150 animals (Fig. 1f; see Supplementary Fig. 2 for experimental set-up, Supplementary Fig. 1 for results using alternative architectures detailed in Supplementary Tables 2-3). In contrast, idTracker degrades more quickly down to a value 83% ≈for 30 individuals and it is computationally too demanding for larger groups.

Supplementary Figure 1: Single image identification accuracy for different group sizes and different variations of the identification network.

Each network is trained from scratch using 3000 temporally uncorrelated images per animal (90% for training and 10% for validation) and then tested with 300 new temporally uncorrelated images to compute the single image identification accuracy (see section D.4.2). We train and test each network five times. For every repetition the individuals of the group and the images of each individual are selected randomly. Images are extracted from videos of 184 different animals recorded in isolation (see Supplementary Figure 2). Colored lines with markers represent single image accuracies (mean ± std., N = 5) for networks architectures with different number of convolutional layers (a, see Supplementary Table 2 for the architectures) and different size and number of fully connected layers (b, see Supplementary Table 2 for the architectures). The black solid line with diamond markers shows the accuracy for the network used to identify images in idtracker.ai (see Supplementary Table 1, Identification convolutional neural network).

Supplementary Figure 2: Creating the training dataset of individual images.

a. Holding grid used to record 184 juvenile zebrafish (TU strain, 31dpf) in separated chambers (60 mm diameter Petri dishes). Transparent acrylic walls allowed for equal spacing between arenas, while granting visual access to the neighbouring dishes. To enhance the contrast, we used a white acrylic floor placed at a distance of 5 cm from the holding grid, acting as a light diffuser the floor impede the formation of shadows. See Supplementary Figure 3 for an explanation of the other components of the setup. b. Four individuals at a time were recorded for 10 minutes (green circles). On the outer borders we placed additional dishes with fish to act as social stimuli (purple circles). c. From these videos, images were labelled according to the individual they represented. Each image was preprocessed following the procedure detailed in section D.4.2, and then cropped as a square image, in order to be used to test the identification network (image size 52 × 52px). The dataset is composed by a total of ≈ 3312000 uncompressed, grayscale labelled images.

In videos of collective animal behavior, however, we lack direct access to 3, 000 images per animal to train the identification network. Instead, we use a cascade of three protocols that obtains the training images differently depending on the difficulty of the video (Fig. 1b, cascade of training protocols; see Supplementary Figures 3-4 for setups of video acquisition in zebrafish, and flies, respectively).

Protocol 1 starts by finding all intervals of the video where all the animals are detected as separated from each other. To each interval, for each animal we add images up to the next crossing from future frames, and images up to the immediate previous crossing from past frames. We call global fragments these extended intervals, which can contain different number of images per animal. Among all the global fragments, the system then chooses the one in which the animal traveling the shortest distance travels more (Fig. 1g, Step 1, colors indicate each of the 100 individuals in the collective). The system uses this global fragment to train the identification network. Once trained, the network assigns identities in all the remaining global fragments.

Afterwards, the system evaluates the quality of the assigned global fragments. It eliminates: 1. global fragments with an estimated identification accuracy below some threshold, 2. those with identifications inconsistent with already assigned global fragments, and 3. those where the same identity has been assigned to several animals. If the remaining high-quality global fragments (Fig. 1g, Step 2) cover < 99.95% of the images in global fragments, then Protocol 1 failed and Protocol 2 starts, as in our example.

Protocol 2 starts by training the network with the high-quality global fragments found in Protocol 1. This network is then used to assign the remaining global fragments again, selecting those passing our three-steps quality check. This procedure iterates until we have at least 50% of images assigned. From this point on, the system runs the accumulation as before, alternating it with the following extension. Single-animal fragments belonging to an unsuitable global fragment are accumulated if they are certain enough, are consistent with fragments already accumulated and do not introduce identity duplications. Accumulation continues until no more acceptable global fragments remain or 99.95% of the images from global fragments have survived the quality check. After this point, if > 90% of the of the images in global fragments have been accumulated, then Protocol 2 ends successfully. In our example, Protocol 2 stops the accumulation at the 9th step (Fig. 1g, Step 9). Afterwards, the remaining images are assigned using the final network (see higher transparency segments in the close-up given in Fig. 1h).

The system then estimates identification accuracy using a conservative Bayesian framework (Supplementary Fig. 5), 99.95% in our example (Fig. 1i, top). Human validation of 3, 000 sequential video frames, by revising 680 crossings, gave 99.997% (Fig. 1i, bottom). An identification accuracy of 100% was obtained with the alternative method of following 10 random animals throughout the video.

Supplementary Figure 3: Experimental setup to recording zebrafish videos.

a. We built a setup to obtain high image quality zebrafish videos. A main tank with a water recirculating system equipped with a filter and a chiller ensures a constant water temperature of 28° C. The tank is placed inside a box built with matte white acrylic walls with a door to allow for an easy access and manipulation of the setup. b.The lighting is based on infrared and RBG LED strips. Homogeneous illumination in the central part of the main tank is obtained by using a cylindrical retractable light diffuser made of plastic. A 20 MP monochrome camera (Emergent Vision HT-20000M) with a 28 mm lens (ZEISS Distagon T* 28 mm f/2.0 Lens with ZF.2) was positioned at ≈ 70 cm from the surface of the arena. To prevent reflections of the room ceiling, a black fabric was used to cover the top of the box. c. We used this setup to record videos of zebrafish in groups and isolation (see Supplementary Figure 2 for details on the isolation conditions). The videos of groups of 10, 60 and 100 fish were recorded in a custom-made one-piece circular tank of 70 cm of diameter. The tank was filled up with fish system water (28°C) up to 2.5 cm from the bottom. The circular tank was held in contact with the water of the main tank at a distance of ≈ 10 cm from a white background to improve the contrast between the animals and the background. d. Sample frame from a video of 60 animals.

Supplementary Figure 4: Experimental setup to record fruit flies videos.

a. The setup was placed in a dedicated experimental room with controlled humidity (60%) and temperature (25°C). The illumination consisted of RBG and IR LEDs placed on a ring around a cylindrical light diffuser to guarantee homogeneous light conditions in the central part of the setup. b. Videos were recorded using a 20 MP monochrome camera (Emergent Vision HT-20000M) with a 28 mm lens (ZEISS Distagon T* 28 mm f/2.0 Lens with ZF.2) positioned at ≈20 cm above the arena. Black cardboard around the camera helped to reduce reflections of the ceiling in the glass covering the arena. c. We used two different arenas made of transparent acrylic, both built to prevent animals from walking on the walls: Arena 1 (diameter 19cm, height 3mm) had vertical walls which were heated using a white insulated resistance wire (Pelican Wire Company, 28 AWG Solid (0.0126”), Nichrome 60, 4.4 Ohms/ft, 0.015” White TFE Tape). At 10 V, 0.3 A the temperature at the walls reached 37° C. Arena 2 (diameter 19 cm, height 3.4 mm) had conical walls (angle of inclination: 11°, width of conical ring: 18 mm). Best results were obtained by recording flies from a top view as is the standard for fruit flies (see Supplementary Table 5). Arena 1 was also used for bottom view recordings, where the camera was placed below the arena, pointing upward. The top of the arena consisted of a sheet of glass covered with Sigmacote SL2 (Sigma-Aldrich) which prevented the flies from walking upside down on the ceiling. A white plastic sheet was put below the arena to increase the contrast between flies and background, and the arena was separated 5 cm from this background in order to eliminate shadows. d. Sample frame from a 100 flies video. Flies were placed in the arena either by anaesthetising them with CO₂, ice, or by using a suction tube. We found the last method to have the least negative effect on the flies’ health and provide better activity levels.

Supplementary Figure 5: Automatic estimation of identification accuracy.

Comparison between the accuracy estimated automatically by idtracker.ai (see section D.7.1) and the accuracy computed by human validation of the videos (see section E.1). The estimated accuracy is computed over the validated portion of the video. Blue dots represent the videos in Supplementary Table 4, Supplementary Table 5, and Supplementary Table 6.

A post-processing step obtains animal images by iterative image erosion and assigns them with a heuristic (Fig. 1j; Supplementary Text). Human validation gives an accuracy of 99.988% for the final assignments, including images between crossings and during crossings.

If Protocol 2 fails, Protocol 3 starts training the convolutional part of the identification net-work using most of the global fragments. Then, it proceeds as Protocol 2 but always keeping the convolutional layers fixed.

We have tested idtracker.ai in small and large animal collectives (Supplementary Tables 4 and 5, respectively). In zebrafish, Protocol 2 was always successful, giving accuracies of 99.96 (mean) ±0.06 (std) for 60 individuals and 99.99 (mean) ±0.01 (std) for 100 individuals. Importantly, of the remaining 0.01% in videos of 100 animals only 0.003% is isolated frames with assignment error and 0.007% is short non-assigned segments. In flies, Protocol 2 succeeded for a collective of 38 individuals with 99.98% accuracy. For larger groups, Protocol 3 was successful. For 72 flies the accuracy is 99.997%. For 80 -100 flies the system reaches its limit, still with > 99.5% accuracy.

View this table:

Supplementary Table 2: Architectures with variations in the number of convolutional layers part.

Several architectures with variable numbers and shape of convolutional layers has been tested in order to assess the stability of the accuracy in single-image identification.

View this table:

Supplementary Table 3: Architectures with variations in the number and size of the fully connected layers (fc).

Several architectures with variable numbers and shape of fully-connected layers has been tested in order to assess the stability of the accuracy in single-image identification. The notation fc n → fc m characterises each architecture according to its fully-connected layers before the output layer.

View this table:

Supplementary Table 4: Results of manual validation for video of small group size.

To compare the performance of idtracker.ai with respect to idTracker, we tracked and manually validated most of the videos used in [1]. We also add three more videos of 10 zebrafish (TU strain, 31 dpf). We observe that the performance is comparable to the one obtained by idTracker (see Supplementary Table 1 in [1]). The column “Reviewed crossings” displays the number of crossing fragments as defined in section D.3 in the validated part. The column “Accuracy prot. cascade” displays the proportion of individual images correctly identified after the protocol cascade. The column “Accuracy” displays the proportion of individual images correctly identified in the validated part. The column “Non-identified” displays the percentage of individual images for which the system did not assign an identity. The column “Misidentified” displays the percentage of individual images wrongly identified.

View this table:

Supplementary Table 5: Results of manual validation for videos of large group size.

To validate the system we manually reviewed the identities of the animals before and after every crossing image and every non-identified image for a part of the video (see Global Validation in section E.1). For some videos we also reviewed the entire video for 10 randomly chosen animals (see Individual Validation in section E.2). “Reviewed crossings”, “Accuracy”, “Accuracy prot. cascade”, “Non-identified” and “Misidentified” refer to the global validation, and “Indiv. acc.”’ refers to the individual validation. The column “Reviewed crossings” displays the number of crossing fragments as defined in section D.3 in the validated part. The column “Accuracy prot. cascade” displays the proportion of individual images correctly identified (PIICI) after the protocol cascade. The column “Accuracy” displays the PIICI in the validated part. The column “Indiv. acc.” displays the average PIICI for the 10 validated individuals. The column “Non-identified” displays the percentage of individual images for which the system did not give an identity. The column “Misidentified” displays the percentage of individual images wrongly identified. All zebrafish videos were recorded in the setup described in Supplementary Figure 3. All the fruit flies videos were recorded in the setup described in Supplementary Figure 4.

We also studied how performance depends on the number of images between crossings. We built synthetic global fragments obtained from individual videos of 184 individual zebrafish (Supplementary Fig. 2). We found that the system reaches high accuracy provided there is at least one global fragment with more than 30 images per animal, but it can still be successful with fewer (Supplementary Fig. 6, empty markers). Recorded collectives of up to 100 zebrafish follow this condition by a large margin (Supplementary Fig. 6, green dots). Flies also meet this condition except at very low locomotor activity levels here obtained in a low humidity setup (Supplementary Fig. 6, purple dots). Also note that conditions for video acquisition should ideally allow for a high image quality (Supplementary Text), but idtracker.ai seems more robust than idTracker when some of these conditions are not met (Supplementary Table 6).

Supplementary Figure 6: Accuracy as a function of the minimum number of images in the first global fragment used for training.

To study the effect of the minimum number of images per individual in the first global fragment used to train the identification network, we created synthetic videos using images of 184 individuals recorded in isolation (see Supplementary Figure 2). Each synthetic video consists of 10000 frames, where the number of images in every individual fragment was drawn from a gamma distribution and the crossings fragments lasted for three frames (see section D.3). The parameters were set as follows: θ = [2000, 1000, 500, 250, 100], k = [0.5, 0.35, 0.25, 0.15, 0.05], number of individual = [10, 60, 100]. For every combination of these parameters we ran three repetitions. In total, we computed both the training and identification protocol cascade (see section D.4) and the residual identification (see section D.5) for 225 synthetic videos. a. Identification accuracy for simulated (empty markers) and real videos (colour markers) as a function of the minimum number of images in the first global fragment. The number next to each colour markers indicates the number of animals in the video. The accuracy of the real videos is obtained by manual validation (see Supplementary Table 5, Supplementary Table 4, and Supplementary Table 6). In some videos, animals are almost immobile for long periods of time. Potentially, the individual fragments acquired during these periods encode less information useful to identify the animals. To account for this, we corrected the number of images in the individual fragments by only considering frames where the animals were moving with a speed of at least 0.75 BL/s (body lengths per second). We observe that idtracker.ai is more likely to have higher accuracy when the minimum number of images in the first global fragment used for training is above 30. b. Distributions of the number of images per individual fragments for real videos of zebrafish and their fits to a gamma distribution. c. Distributions of speeds of zebrafish and fruit flies videos.

View this table:

Supplementary Table 6: Results of manual validation for videos that do not fulfil some of the general video conditions listed in Supple-mentary Material C.

The column headers are defined in Supplementary Table 4. The first two videos were recorded in a compressed lossy video formats,.avi (FMP4 compression code) and.MOV (avc1), respectively. Lossy video formats may deleted pieces of information that could be important to identify the animals. The video of 10 fruit flies was recorded in a retroilluminated circular arena with conic walls. During the video, at regular time intervals a LED turns on and off. In images from retroilluminated arenas animals appear mainly as dark blobs, and the visual features of the body of the animals are less visible resulting in a more difficult identification. In addition, changes in light conditions can be problematic, since the appearance of the animals can change. This video was recorded by Clara Ferreira. The video of 14 ants has multiple shadows of animals which make the segmentation difficult. Furthermore, it is not illuminated with indirect light creating reflections in the body of the animals. This videos was given by Andrew. I. Bruce and Nico Blüthgen (Diacamma ants, preliminary tracking tests). The animals in the video of 60 flies have very low locomotor activity levels, and the first video of 100 female flies contains dead animals due to the after effects of the anaesthesia with CO₂. Images of animals during immobility periods represent a small set of the poses that animals can show over the duration of the video. A network trained with images of immobile animals is likely to fail to identify an animal when it starts moving or changes its pose. Moreover, the video of 60 flies has poor contrast and was segmented with a bad selection of preprocessing parameters. In the second video of 100 flies some animals show atypical behaviours like rolling on their back. In particular, there is an animal rolling on its back in the first global fragment. After with this global fragment, other animals which roll on their backs are identified as this animal.

View this table:

Note: Supplementary Information is available

Authors declare no conflict of interest exists

Author contributions

F.R-F., M.G.B. and G.G.dP. devised project, algorithms and analysed data, F.R-F. and M.G.B. wrote the code with help from F.H., M.G.B. managed code architecture and GUI, F.R-F. managed testing procedures, R.H. built set-ups and performed experiments with help from F.R-F., G.G.dP. supervised project, M.G.B. wrote supplement with help from F.R.-F, R.H, F.H. and G.G.dP., and G.G.dP. wrote main text with help from F.R.-F, M.G.B. and F.H.

Methods

Software availability

idtracker.ai is open-source and free software (license GPL v.3). The source-code as well as the instructions for its installation are available in www.gitlab.com/polavieja_lab/idtrackerai. A quick-start user guide and a detailed explanation of the graphical user interface can be found in www.idtracker.ai.

Data availability

All videos used in this study can be downloaded from www.idtracker.ai. A library of single-individual images of zebrafish to test identification methods can be found in the same link. Two example videos, one of 8 adult zebrafish and another of 100 juvenile zebrafish, are also included as part of the quick-start user guide.

Computers

We tracked all the videos with desktop computers running GNU/Linux Mint 18.1 64bit (processor Intel Core i7-6800K or i7-7700K, 32 or 128 GB RAM, Titan X or GTX 1080 Ti GPU’s, and 1 Tb SSD disk). Sample videos can be tracked using CPU but the performance of the system will be highly affected.

Animal rearing and handling

All fish were raised at the Champalimaud Foundation Fish Platform, according to the housing and husbandry methods integrated in the zebrafish welfare program fully described in [11]. Animal handling and experimental procedures were approved by the Champalimaud Foundation Ethics Committee and the Portuguese Direcção Geral Veterinária and were performed according to the European Directive 2010/63/EU. For zebrafish videos we used the wild-type TU strain at 31 days post fertilization (dpf). Animals were kept in 8 L holding tanks at a density of 10 fish/L and a 14 h light / 10 h dark cycle in the main fish facility. For each experiment, a holding tank with the necessary number of fish was transported to the experimental room, where fish were carefully transferred to the experimental arena using a standard fish net appropriate for their age.

For the fruit fly videos we used adults from the Canton S wild-type strain at 2-4 days post-eclosion. Animals were reared on a standard fly medium and kept on a 12-h light-dark cycle at 28°. Flies were placed in the arena either by anesthetizing them with CO₂ or ice, or by using a suction tube. We found the last method to have the least negative effect on the flies’ health and to provide better activity levels.

Details of the networks

Network architectures

The deep crossing detector network (Fig. 1d) is a convolutional neural network [8, 10]. It has 2 convolutional layers that obtain from data a relevant hierarchy of filters. A hidden layer of 100 neurons then transforms the convolutional output into a classification into single animal or crossing. idtracker.ai trains this network using images that can confidently characterize as single or as multiple animals (for example, single animals as blobs of area consistent with single-animal statistics and not splitting into more blobs in its past or future). Further details of the architecture are given in Supplementary Table 1.

The architecture of the identification network (Fig. 1e) consists of 3 convolutional layers, a hidden layer of 100 neurons and a classification layer with as many classes as animals in the collective. Further details are given in Supplementary Table 1. We tested variations of the architecture either by modifying the number of convolutional layers Supplementary Table 2 or the number of hidden layer neurons Supplementary Table 3. Analysis of these networks indicated that the most important feature for a successful identification is that the convolutional part needs at least two layers (Supplementary Fig. 1). The GUI allows users to modify the architecture of this network and its training hyperparameters.

Network training

The convolutional and fully-connected layers of both networks are initialised using Xavier initialisation [12]. Biases are initialised to 0.

The deep crossing detector network is trained using the algorithm and hyperparameters in [13]. The learning rate is set at the initial value of 0.005. This network is trained in mini batches of 100 images.

The identification network is trained using stochastic gradient descent, setting the learning rate to 0.005. This network is trained in mini batches of 500 images. Further details are given in the Supplementary Text.

C General video conditions

It is advisable to adhere to some guidelines during the realisation of videos of freely-moving animals. Here follows a list of conditions that allow to maximise the probability of success and the accuracy of the tracking.

Resolution. The higher the number of pixels per individual, the more information to distinguish it from the rest. Notice that, on the downside, the additional information makes the algorithm less time-efficient. Check Supplementary Tables 4 and 5 for the average number of pixels per animal in each of the videos tracked.
Frame rate. The frame rate must be high enough for the blobs associated with the same individual to overlap in consecutive frames, when moving at average speed. A low frame rate—with respect to the average speed of the animals—can cause a bad fragmentation of the video: An essential process in the tracking pipeline, that allows to collect images belonging to the same individual and organise them in fragments. On the contrary, excessively high frame rates will make the information coming from the analysis of the fragments highly redundant. This will increase the computational time necessary to track the video, without guaranteeing an improvement of the identification of the individuals. In the examples provided in this paper, the frame rate ranges from 25fps to 50fps.
Duration. The length of the video for which the system works depends on the number of animals, the distribution of images per individual fragment and the number of pixels per animal. For few animals (8 zebrafish) we can track videos as short as ≈18 sec (≈500 frames at 28 fps. For large groups we can track videos as short as 1 min (≈1950 frames at 32 fps). The system works for longer videos as far as the overall conditions do not change abruptly in different parts of the video. Very large videos with many animals will require a high amount of RAM and could block your computer.
Video format. The system works with any video format compatible with OpenCV. We recommend uncompressed or lossless video formats: Some compression algorithms work by deleting pieces of information that could be crucial for the identification of the individuals.However, we have successfully tracked videos with compressed formats:.avi (FPM4 video codec) and.MOV (avc1 video codec) (see Supplementary Table 6).
Illumination. Illumination has to be as uniform as possible, so that the appearance of the animals is homogeneous along the video. We recommend using indirect light either by making the light reflect on the walls of the setup, or by covering the setup with a light diffuser as shown in Supplementary Figures 3 and 4. Although, we have also tracked videos with retroilluminated arenas (see Supplementary Table 6), recall that the tracking systems relies on visual features of the animals that this type of illumination could hide.
Definition and focus. Images of individuals should be as sharp and focused as possible for their features to be clearly displayed along the entire video. When using wide apertures on the camera, the depth of field can be quite narrow. Make sure that the plane of the sensor of the camera is parallel to the plane of the arena so that animals are focused in all parts of it. In addition, exposition time (shutter speed) should be high enough so that animals do not appear blurred when moving at average speed. Blurred and out of focus images are more difficult to be identified correctly.
Background. The background should be as uniform as possible. To facilitate the detection of the animals during the segmentation process (see section D.1), the background colour has to be chosen in order to maximise the contrast with the animals. Small background inhomogeneity or noise are acceptable and can be removed by the user during the selection of the preprocessing parameters:
- – Static or moving objects much smaller or much larger than the animals can be removed by setting the appropriate maximum and minimum pixels size thresholds.
- – Static objects of the same size and intensity of the animals can be removed by selecting the option “subtract background” in the preprocessing tab.
- – Regions of the frame can be also excluded by selecting a region of interest.
Shadows. Shadows projected by the individuals on the background can lead to a bad segment-ation and hence, to a bad identification. Shadows can be diffused by using a transparent base separated from an opaque background (see Supplementary Figure 3) or by using a retroillu-minated arena.
Reflections. Reflections of individuals on the walls of the arena should be avoided: They could be mistaken for an actual individual during the segmentation process. Reflections in opaque walls can be reduced by using either very diffused light or matte walls. For aquatic arenas with transparent walls, reflections can be softened by having water at both sides of the walls. Furthermore, reflections can be removed by selecting an appropriate ROI.
Variability in number of pixels per animal. The number of pixels in a blob is one of the criteria used to distinguish individual fish from crossings. An optimal video should fulfil the two following conditions. First, the number of pixels associated with each individual should vary as little as possible along the video. Second, the size an individual should vary as little as possible depending on its position in the arena. In any case, strategies to avoid misidentification are put in place, even in case of variable animal sizes (see section D.2.3).

D Algorithm

First, we introduce the work-flow of the algorithm. Subsequent sections will give further details on each of the components in the work-flow. The algorithm is composed of six computational cores highlighted in blue in Supplementary Figure 7. First, during the segmentation process the images representing either single or multiple touching animals are extracted from the video. In the remainder, we will refer to images representing a single individual as individual images and to images in which two or more individuals are touching as crossing images.

A model of the average area of the individuals, and later a convolutional neural network (CNN)— named deep crossing detector in the remainder—are used to discriminate between individual and crossing images.

Each image extracted from the video is now labelled as either a single individual or a crossing. By means of an extra-safe protocol, we define collections of images in subsequent frames of the video in which the same individual (or crossing) is represented. We name these collections individual and crossing fragments, respectively.

The fourth computational core is the gist of the algorithm. A subset of the collection of individual fragments, in which all the individuals are visible in the same part of the video is used to generate a dataset of individual images labelled with the corresponding identities. This dataset is then utilised to train a second CNN to classify images according to their identity. A cascade of increasingly encompassing training/identification protocols is put in place, so that an appropriate identification strategy is automatically defined by the algorithm, according to the complexity of the video. The idea underlying this family of methods is that the information gained from the first dataset of labelled images will allow either to accurately assign the entire collection of individual fragments, or to increase the first dataset by incorporating safely identified individual fragments throughout the video.

Supplementary Figure 7:

Simplified algorithmic flow. Refer to the section specified next to each process block for details.

The knowledge acquired during the protocol cascade is used to identify the individual fragments that were not used to train the identification CNN. In the remainder, we will refer to this operation as residual identification.

Finally, trivial identification errors are corrected by a series of post-processing routines, and the identity of the crossing fragments is inferred in a last computational core.

D.1 Segmentation

idtracker.ai tracks the individuals by relying on their visual features. Hence, given a frame of the video, it is necessary to distinguish between pixels associated to individuals and background. According to the standard notation adopted in computer vision, we refer to a collection of connected pixels which is not part of the background as a blob.

The segmentation process has four main steps. First, the user can define a region of interest to be applied on each frame of the video. In this way it is possible to exclude, for instance, walls which may contain reflections of the animals.

Supplementary Figure 8:

Grayscale image thresholding for blob segmentation. idtrackerai has been tested with average individual blob areas of ∼300pxs. The resolution reduction button (top-left) allows to introduce a downsampling factor, to be applied to the entire frame, and consequently to the segmented blobs. The tree switches on the top allow to consider a predefined ROI, compute and subtract a model of the background. It is possible to activate a control on the number of blobs detected in each frame: If the segmentation returns more blobs than animals to be tracked in a frame or a collection of frames, the user will be asked to specify new segmentation parameters to be applied only in those frames. Finally, the user can define ranges of both acceptable intensities and blob areas, by adjusting the Maximum and Minimum intensity and area thresholds respectively.

Second, each frame is normalised with respect to its average intensity to correct for illumination fluctuations. It is also possible to perform background subtraction by generating a model of the background calculated as the average of a collection of frames obtained via subsampling the video.

Then, blobs of pixels corresponding to animals are detected by intensity thresholding and sub-sequent labelling of connected components. The intensity thresholds that allow to distinguish the individuals from the background are specified by the user. Often, intensity is not enough to segment the animals in the entire video. For this reason, it is also possible to specify a minimum and a maximum area (number of pixels) for a blob to be acceptable. For instance, these parameters allow to exclude dust during the segmentation.

All these operations are carried out in an intuitive way by using the idtracker.ai graphic user interface, where both the intensity and area thresholds can be adjusted by observing their effect in real time on the frame, see Supplementary Figure 8.

The software currently supports only grayscale video segmentation. Frames captured from a color video will be automatically mapped to grayscale.

Remark 1 (On background subtraction). Background subtraction is often useful when trying to segment a video in which a static object has the same intensity level as the individuals one wants to segment (see Supplementary Material C).

D.2 Detection of individual and crossing images

The training/identification process allows to identify only images representing single individuals. Thus, a crucial point of the algorithm is the discrimination of individual and crossing images. In order to differentiate between these two classes, we apply a series of three different algorithms on the images segmented from the video.

First, we use two heuristics to detect images that in all likelihood correspond to a single animal (sure individual image) and crossing animals (sure crossing images), respectively. Then, we use these sure individual images and sure crossing images to train a neural network. Finally, the trained network is used to label ambiguous (not sure) images as either crossing or individual images.

D.2.1 Model area

We build a model of the area of the individuals by taking into account portions of the video in which the number of segmented blobs corresponds to the number of animals declared by the user. In case there is no frame in which this condition is fulfilled, the tracking cannot proceed and an error is raised. Let ϱ = {b₁,…, b_n} be the collection of the blobs segmented from these parts of the video and A = {area (b_i) for every b_i ∈ ϱ the collections of the corresponding individual areas, where the function area(b_i) counts the number of pixels corresponding to the blob b_i. The model area is defined by m_A = median (A) and the standard deviation s_A = σ (A). Let b be a blob, we define A model based exclusively on the area of the blobs can easily fail when the individuals’ body is not rigid (e.g. fish or mice), can suddenly change shape (e.g. a fly with opened or closed wings), or under heterogeneous lighting conditions. Even more complex situations can arise when animals can move freely in 3 dimensions (e. g. fish swimming at different depths). In this latter case, one individual can be almost completely occluded by a second one, causing the model area to fail.

D.2.2 Blobs overlapping in subsequent frames

The second heuristic is based on the overlapping of blobs in subsequent frames: This allows to select sure crossing and individual images depending on the merging or splitting of consecutive, overlapping blobs. We recall that a blob is a collection of connected acceptable pixels in a certain frame, where a pixel is considered acceptable depending on its intensity value and the thresholding described in section D.1. Let b₁ and b₂ be two blobs. We say that the two blobs overlap if and only if b₁ ∩b₂ = ∅, where the intersection b₁ ∩b₂ is the intersection between sets of pixels. See Supplementary Figure 9 for an example.

Let B_i = {b_{i, 1},…., b_i,n } be the collection of blobs segmented from the ith frame of a video 𝒱. For every blob b_i,j ∈ B_i we derive the collections of blobs overlapping with b_i,j in frames (i - 1) and (i + 1). We call these collections the sets of previous and next blobs of b_i,j, denoted by P_b respectively.

Let b be a blob. We say that b is a blob associated with a sure individual image if:

a) b is an individual according to eq. (D.1);

b) |P_b| = |N_b |= 1, i. e. the blob is overlapping with one and only one blob both in the previous and subsequent frame. The notation | · | indicates the cardinality of a set, i. e. the number of elements of the set.

c) for every b_p and b_n in the past and future overlapping history of b |P_{b p} | ⩽ 1 and |N_{b n} | ⩽1. Symmetrically, we say that b is associated with a sure crossing image if:

a) b is a crossing according to eq. (D.1);

b) |P_b| > 1 or |N_b| > 1.

a) b does not satisfy the model of the area;

b) |P_b| = |N_b| = 1;

c) for some b_p and some b_n in the past and future overlapping history of b and .

D.2.3 Deep crossing detector

The methods described in sections D.2.1 and D.2.2 can be used on any video in order to generate a dataset D_ic of sure individual and sure crossing images. With this dataset we train a CNN in the task of distinguishing crossing and individual images. We call this model deep crossing detector (DCD). In the following paragraphs we will describe the preprocessing, architecture, hyperparameters and stopping criteria used to define and train this particular model.

Preprocessing

Let b be a blob segmented from a video 𝒱 and I_b the image generated by cropping a rectangular bounding box around the centroid of b, such that all the pixels of b are represented in I_b. We first consider a dilation b* of b generated with a 5×5 kernel. We assign value 0 to every pixel in I_b which is not in b*. In order to overcome the sensitivity of CNNs with respect to rotation, we compute the first principal component of the cloud of pixels defined by b and then rotate and crop I_b such that the first principal component forms and angle of with the x axis. After the rotation, the size of each image is set to be the maximum of the largest side for all the bounding boxes among the collection of sure crossing images. Then, the images are resized to 40×40 pixels. The resizing improves both the time and memory efficiency of the algorithm. Finally, each image I ∊ 𝒟_ic is standardised as (see sample images in Fig 1, Panel d in the main text).

Architecture

See Supplementary Table 1 (deep crossing detector). Both convolutional and fully-connected layers are initialised using Xavier initialisation [12]. Biases are initialised to 0.

Loss function

Let (x, l_i) be a labelled image, where l_i is the label in one-hot encoding, i. e. l₀ = [1, 0] is the label associated with x if x is a crossing image, and l₁ = [0, 1] if x is an individual. We compute the loss function associated to (x, l_i) as a weighted cross-entropy: where is the softmax function applied to the activation a_i of the ith unit of the last layer of the network, with j varying among all classes, in this case j ∈ {0, 1}; w_i is the weight associated with l_i and computed as where |L_i| is the number of training samples belonging to the class l_i and j varies in the set of all the classes of the dataset (only two in this case: individual and crossing). The weighting allows to deal with the potentially unbalanced dataset 𝒟_ic. Indeed, we prefer to collect all the possible examples of sure-crossing and sure-individual images available in a given video, rather than force 𝒟_ic to be balanced in the number of samples per class. After dividing the dataset 𝒟_ic in batches of X_i of 50 images, we optimise by considering the mean µ (𝔏 (X_i)) using the algorithm described [13], with the hyperparameters suggested in the paper. The learning rate is set at the initial value of 0.005.

Remark 2. (On the softmax function) In general, the softmax is equipped with an extra parameter called temperature. We omit discussing it in the formula, since we always set it to 1.

Training and validation set

Before training, the dataset of sure crossing and sure individual images is split into two parts: 90% of the images are used for training, i. e. the weights of the network are updated in order to minimise the error (loss function) with respect to the labels associated with this set of images. We call this portion of the dataset the training set, denoted by T. The remaining 10% of images–the validation set V – are used to evaluate the generalisation power of the network. For this reason, the performances of the model on the validation set are used to stop its training. In section D.2.3, we shall discuss in more detail the algorithm used to stop the training of the network.

Accuracy

We measure the accuracy of the network by comparing the predictions generated by the softmax computed on the activation of the last layer of the network, with the labels associated with each image in both the training and the validation set. Hence, let | V | be the number of images in the validation set, P_V = {p₁,…, p_n} the ordered predictions generated by a forward pass of these images in the network, and L_V = {l₁,…, l_n} the corresponding labels. Let A_V be the set of correct predictions, defined as A_V = {p_i s.t. p_i = l_i for p_i ∈P_V, l_i ∈L_V} We define the overall accuracy of the network as We will also take into account the accuracy on each of the inferred classes. Let c^* be a class (in this case c^* could be either the crossing or the individual class). The set corresponds to the predictions equal to their associated labels and attributed to the particular class c^*. In this case the class-accuracy is defined as Symmetrically, we define the error and the class-error as 1 -Acc_V and 1 -Acc_V (c*), respectively.

Training stopping criteria

While training the network, we verify the goodness of its outputs by computing both the loss function and the accuracy on the validation set V (see sections D.2.3 and D.2.3). This procedure gives a reasonable control on the actual classification power of the network on new unlabelled images. Thus, it is crucial to stop the training of the network to prevent two main behaviours. On the one hand, we want to prevent overfitting: A too exact representation of the trainig data, that will prevent the network from generalising on new data points. On the other hand, it is desirable to stop the training in case the error cannot be further minimised, i. e. the loss function reached a plateau.

More formally, we call an epoch a complete training pass on the set T, concluded with the evaluation (of both loss and accuracy functions) on the validation set V. Let 𝔏_i (T) and 𝔏_i (V) be the value of the loss function after the epoch i. We define as the difference between the loss value in validation at i*, and the mean of the loss values of the previous 10 epochs. We stop the training at epoch i* > 10 if one of the following conditions holds.

a) The network is overfitting: d_i > 0 for every i* - 5 < i*;

b) the loss reached a plateau: |d_i*| < 0.05 · 10log10(Li* (V))-1;

c) the network reached class-accuracy 1 on all the classes, for every sample in the validation set. See eq. (D.3);

d) the loss (error) is zero: 𝔏_i* (V) = 0.

Crossing detection

Let Δ be the set of parameters learnt by training the DCD as described above. Let us denote the trained model as DCD (Δ). We create the test set T of unlabelled images by considering all the images that are not either sure individuals or sure crossings. The trained model acts as a function taking as an input an image I ∈ 𝕋 and outputting a prediction as the softmax computed on the activation of the last layer of DCD (Δ). We recall that the softmax is the function defined as where a_i is the activation of the ith unit of the last layer. Since the DCD classifies images in two classes, its last layer is composed by two units. Hence, given an image I, we obtain where for brevity we set s_i = s (a_i). If s₁ > s₂ the image is classified as a crossing, and as an individual otherwise.

Exceptions

It is possible that during training the loss value diverges to infinity. In this case a warning is produced, and the algorithm falls back into a crossing-individual images discrimination process based only on model of the area of individual blobs (see section D.2.1). In case the criteria forcing the training to stop are never reached, we set a maximum number of 100 epochs for the training of the DCD. If this threshold is reached, a warning is produced and the training is stopped. The parameters computed in the last iteration will be used to classify individual and crossing images.

D.3 Fragmentation

At this stage of the algorithm, the images segmented from the video (see section D.1) are labelled either as individual or crossing, following the protocols described in section D.2. A very careful dynamical analysis of the segmented blobs allows to create collections of images associated with the same individual (or crossing) in subsequent frames. In the remainder, we will refer to these collections as individual and crossing fragments. See Supplementary Figure 9 for an example of fragments and its decomposition in individual and crossing components.

The method used to create these types of fragments is based on the one hand on the classification of the images into crossing and individual categories; on the other hand, it considers the overlapping of the blobs associated with these images. We start by introducing some notations. Then, we will describe the algorithm to generate individual and crossing fragments in two separated sections.

Supplementary Figure 9:

An individual fragment built by considering the blobs’ overlapping in subsequent frames.

Let be the collection of segmented blobs, where the first index of the elements of B corresponds to the frame number, and the second to the order in which the blobs have been segmented. We recall, that given two blobs b₁ and b₂, we say that they overlap if and only if the intersection of their sets of pixels is not empty. Following the notation introduced in section D.2.2, given a blob b_i,j ∈ B we call the collections of blobs overlapping with b_i,j in frame (i - 1) and (i + 1), and respectively.

D.3.1 Individual fragments

We iterate over the elements of B = {b_i,j} proceeding by frame number i and then following the natural ordering induced by the second index. Let b_i,j be a blob associated with the image labelled as an individual. We create individual fragments by considering only the future overlapping history of b_i,j. If b_i,j is not yet part of any individual fragment, we associate with b_i,j a unique fragment identifier α (i. e. b_i,j is the blob intiating an individual fragment). To simplify the notation let b_i,j = b, and . We consider two cases:

case 1: |N|.> 1 The blob b in frame i overlaps with more than one blob in frame i + 1, hence it is the only blob (and image) associated with the individual fragment α.
case 2: |N| = 1. Let n_b be the unique element of N. The fact that b overlaps with a single blob in its future history is a necessary condition for n_b to be part of the same fragment as b, but notsufficient. It could be that thus we say that n_b is in the same individual fragment as b if and only if We also require the image to be labelled as an individual.

If case 1 is verified we generate a new fragment identifier and continue iterating on the elements of 𝔅. Otherwise, we apply the same algorithm to n_b in order to enlarge the individual fragment α as much as possible. We stop adding blobs to the fragment whenever, during the iteration, a new candidate blob fulfils the condition in case 1. See algorithm 1 for the pseudocode.

D.3.2 Crossing fragments

In the same setting of the previous section let b_i,j = b be a blob associated with a crossing image. If b is not yet equipped with a crossing fragment identifier, we generate a new identifier β. The conditions are almost equivalent to the individual fragments’ case:

case 1: |N| > 1. The blob b in frame i overlaps with more than one blob in frame i + 1, hence the crossing represented by b is splitting. Thus b is the only blob associated with the crossing fragment β.
case 2: |N | = 1 and and is a crossing image. We add n_b to the crossing fragment β.

Algorithm 1 Assign individual fragment identifier to blobs

In the second case, we try to extend the crossing fragment simply by iterating the algorithm on n_b, and verifying that both P (n_b) and N (n_b) have cardinality 1, and the unique element of N (n_b) is associated with a crossing image.

The pseudocode presented in algorithm 1 can be easily adapted to work with crossing fragments, by modifying the if s and while conditions.

D.4 Cascade of training/identification protocols

After fragmentation has finished, the training of the identification network begins. We would first like to give the reader some intuition regarding why it is possible to train an identification network in an automated manner. First imagine that we had at our disposal an all-knowing black box, that looked at the set of fragments we have complied from one part of the video and then told us which fragment belonged to which individual. Remember that each fragment contains an entire set of images belonging to a single individual. Therefore, thanks to the information coming from the black box, we would effectively have at our disposal a set of labelled images and we could use standard supervised learning to train a classifier which can tell the individuals apart from one another. The trained network could then be used on other parts of the video to do identification.

In real life, we do not have access to such source of information, so we need to use some heuristics to generate our training dataset. In order to understand our heuristic, let’s again remember that each fragment is supposed to contain images belonging to a single individual. We also know the total number of individuals in the video as this number is specified by the user. Consequently, if we can find one frame of the video, where the number of fragments which are present at that frame is equal to the total number of individuals, then we could be sure that each visible fragment in that frame belongs to a separate individual. We can then label the images within each fragment with the same label while every fragment of course has a different label. Next, we train our network on the resulting dataset of images and labels. This is the intuition behind how we achieve our aim without the help of an omniscient black box.

In order to train the identification network, we have designed three training protocols. The first protocol is the fastest and is able to deal with videos where animals are relatively well separated (crossings are not too frequent). The other two protocols handle more difficult scenarios, where crossings may be frequent, the setup lighting intensity may drift over time, the animals may change their features throughout the video (e. g., posture, colour).

Each protocol relies on the information acquired and structures defined in the previous ones. In the following sections we will introduce some definitions and the main elements on which the fingerprint protocols are built. Then, we will discuss each of the three protocols from the simplest and fastest, to the most general and computationally expensive one.

D.4.1 Global fragments

All the protocols rely on a strong, fundamental hypothesis: To learn the features characterizing each individual and consequently identify it, there must exist at least one portion of the video in which all animals are visible and separated.

Let 𝒱 be a video in which the aforementioned condition is fulfilled in frame number i. We define a global fragment as the collection of individual fragments (see section D.3.1), whose images contain the ones extracted from the ith frame of the video and with the same number of fragments as the number of individual to be tracked. We call the minimum frame number in which this condition is satisfied the core of the global fragment. See Supplementary Figure 10 for a visual representation of a global fragment. We denote by 𝒢 the set of all the global fragments in 𝒱, whose shortest individual fragment counts at least 3 images.

D.4.2 Identification network

All the fingerprint protocols aim at finding different strategies in order to create datasets of images of the animals labelled with their identities. These datasets will be created from one, or a collection of global fragments and used to train the identification CNN, denoted in the remainder as idCNN. In the following paragraphs we define the architecture, the hyperparameters and algorithms used to train the idCNN.