Abstract
Our understanding of collective animal behavior is limited by our ability to track each of the individuals. We describe an algorithm and software, idtracker.ai, that extracts from video all trajectories with correct identities at a high accuracy for collectives of up to 100 individuals. It uses two deep networks, one detecting when animals touch or cross and an-other one for animal identification, trained adaptively to conditions and difficulty of the video.
Obtaining animal trajectories from a video faces the problem of how to track animals with cor-rect identities after they touch, cross or they are occluded by environmental features. To bypass this problem, we proposed in idTracker the idea of tracking by identification of each individual using a set of reference images obtained from the video [1]. idTracker and further developments in animal identification algorithms [2–6] can work for small groups of 2-15 individuals. In larger groups, they only work for particular videos with few animal crossings [7] or with few crossings of particular species-specific features [5].
Here we present idtracker.ai, a system to track all individuals in small or large collectives (up to 100 individuals) at a high identification accuracy, often of > 99.9%. The method is species-agnostic and we have tested it in small and large collectives of zebrafish, Danio rerio and flies, Drosophila melanogaster. Code, quickstart guide and data used are provided (see Methods), and Supplementary Text describes algorithms and gives pseudocode. A graphical user interface walks users through tracking, exploration and validation (Fig. 1a).
Similar to idTracker [1], but with different algorithms, idtracker.ai identifies animals using their visual features. In idtracker.ai, animal identification is done adapting deep learning [8–10] to work in videos of animal collectives thanks to specific training protocols. In brief, it consists of a series of processing steps summarized in Fig. 1b. After image preprocessing, the first deep network finds when animals are touching or crossing. Then the system uses the images between these detected to train a second deep network for animal identification. The system first assumes that a single portion of video when animals do not touch or cross has enough images to properly train the identification network (Protocol 1). However, animals touch or cross often and this portion is then typically very short, making the system estimate that identification quality is too low. If this happens, two extra Protocols (Protocols 2 and 3) are dedicated to safely accumulate enough images of each animal using several of these portions of video and build a larger training set. After training and assignation of identities, some postprocessing is performed to output the trajectories and an estimation of identification accuracy.
In the following we give more details of the processing steps. The preprocessing extracts blobs, areas of each video frame corresponding to either a single animal, or to several animals that are touching or crossing, i.e. ‘crossings’. Then, it orients the blobs using their axes of maximum elongation (Fig. 1c). This procedure leaves the animal pointing in one of two possible orientations. We solve this ambiguity by training with both upright and 180-degrees rotated images. This method is valid for any elongated animal and is preferred to species-specific methods.
The deep crossing detector network finds whether each preprocessed image corresponds to a single animal or a crossing (Fig. 1d; details of network architecture in Supplementary Table 1). idtracker.ai trains this network using images that confidently classifies as single animals or crossings (see Supplementary Text for heuristics used). Once trained, it classifies all blobs as single animals or crossings. We depict these detected crossings as small black segments in Fig. 1g.
The deep identification network is then used to identify each individual between two crossings (Fig. 1e; Supplementary Table 1 for details of network architecture). We measured the identification capacity of this network using 184 single-animal videos, with 300 pixels per animal on average. The advantage of single-animal videos is that we obtain a very large number of images per animal. Out of the 18, 000 images per animal we randomly selected 3, 000 for training. Testing it in 300 new images gave a > 95% single-image accuracy up to 150 animals (Fig. 1f; see Supplementary Fig. 2 for experimental set-up, Supplementary Fig. 1 for results using alternative architectures detailed in Supplementary Tables 2-3). In contrast, idTracker degrades more quickly down to a value 83% ≈for 30 individuals and it is computationally too demanding for larger groups.
In videos of collective animal behavior, however, we lack direct access to 3, 000 images per animal to train the identification network. Instead, we use a cascade of three protocols that obtains the training images differently depending on the difficulty of the video (Fig. 1b, cascade of training protocols; see Supplementary Figures 3-4 for setups of video acquisition in zebrafish, and flies, respectively).
Protocol 1 starts by finding all intervals of the video where all the animals are detected as separated from each other. To each interval, for each animal we add images up to the next crossing from future frames, and images up to the immediate previous crossing from past frames. We call global fragments these extended intervals, which can contain different number of images per animal. Among all the global fragments, the system then chooses the one in which the animal traveling the shortest distance travels more (Fig. 1g, Step 1, colors indicate each of the 100 individuals in the collective). The system uses this global fragment to train the identification network. Once trained, the network assigns identities in all the remaining global fragments.
Afterwards, the system evaluates the quality of the assigned global fragments. It eliminates: 1. global fragments with an estimated identification accuracy below some threshold, 2. those with identifications inconsistent with already assigned global fragments, and 3. those where the same identity has been assigned to several animals. If the remaining high-quality global fragments (Fig. 1g, Step 2) cover < 99.95% of the images in global fragments, then Protocol 1 failed and Protocol 2 starts, as in our example.
Protocol 2 starts by training the network with the high-quality global fragments found in Protocol 1. This network is then used to assign the remaining global fragments again, selecting those passing our three-steps quality check. This procedure iterates until we have at least 50% of images assigned. From this point on, the system runs the accumulation as before, alternating it with the following extension. Single-animal fragments belonging to an unsuitable global fragment are accumulated if they are certain enough, are consistent with fragments already accumulated and do not introduce identity duplications. Accumulation continues until no more acceptable global fragments remain or 99.95% of the images from global fragments have survived the quality check. After this point, if > 90% of the of the images in global fragments have been accumulated, then Protocol 2 ends successfully. In our example, Protocol 2 stops the accumulation at the 9th step (Fig. 1g, Step 9). Afterwards, the remaining images are assigned using the final network (see higher transparency segments in the close-up given in Fig. 1h).
The system then estimates identification accuracy using a conservative Bayesian framework (Supplementary Fig. 5), 99.95% in our example (Fig. 1i, top). Human validation of 3, 000 sequential video frames, by revising 680 crossings, gave 99.997% (Fig. 1i, bottom). An identification accuracy of 100% was obtained with the alternative method of following 10 random animals throughout the video.
A post-processing step obtains animal images by iterative image erosion and assigns them with a heuristic (Fig. 1j; Supplementary Text). Human validation gives an accuracy of 99.988% for the final assignments, including images between crossings and during crossings.
If Protocol 2 fails, Protocol 3 starts training the convolutional part of the identification net-work using most of the global fragments. Then, it proceeds as Protocol 2 but always keeping the convolutional layers fixed.
We have tested idtracker.ai in small and large animal collectives (Supplementary Tables 4 and 5, respectively). In zebrafish, Protocol 2 was always successful, giving accuracies of 99.96 (mean) ±0.06 (std) for 60 individuals and 99.99 (mean) ±0.01 (std) for 100 individuals. Importantly, of the remaining 0.01% in videos of 100 animals only 0.003% is isolated frames with assignment error and 0.007% is short non-assigned segments. In flies, Protocol 2 succeeded for a collective of 38 individuals with 99.98% accuracy. For larger groups, Protocol 3 was successful. For 72 flies the accuracy is 99.997%. For 80 -100 flies the system reaches its limit, still with > 99.5% accuracy.
We also studied how performance depends on the number of images between crossings. We built synthetic global fragments obtained from individual videos of 184 individual zebrafish (Supplementary Fig. 2). We found that the system reaches high accuracy provided there is at least one global fragment with more than 30 images per animal, but it can still be successful with fewer (Supplementary Fig. 6, empty markers). Recorded collectives of up to 100 zebrafish follow this condition by a large margin (Supplementary Fig. 6, green dots). Flies also meet this condition except at very low locomotor activity levels here obtained in a low humidity setup (Supplementary Fig. 6, purple dots). Also note that conditions for video acquisition should ideally allow for a high image quality (Supplementary Text), but idtracker.ai seems more robust than idTracker when some of these conditions are not met (Supplementary Table 6).
Note: Supplementary Information is available
Authors declare no conflict of interest exists
Author contributions
F.R-F., M.G.B. and G.G.dP. devised project, algorithms and analysed data, F.R-F. and M.G.B. wrote the code with help from F.H., M.G.B. managed code architecture and GUI, F.R-F. managed testing procedures, R.H. built set-ups and performed experiments with help from F.R-F., G.G.dP. supervised project, M.G.B. wrote supplement with help from F.R.-F, R.H, F.H. and G.G.dP., and G.G.dP. wrote main text with help from F.R.-F, M.G.B. and F.H.
Methods
Software availability
idtracker.ai is open-source and free software (license GPL v.3). The source-code as well as the instructions for its installation are available in www.gitlab.com/polavieja_lab/idtrackerai. A quick-start user guide and a detailed explanation of the graphical user interface can be found in www.idtracker.ai.
Data availability
All videos used in this study can be downloaded from www.idtracker.ai. A library of single-individual images of zebrafish to test identification methods can be found in the same link. Two example videos, one of 8 adult zebrafish and another of 100 juvenile zebrafish, are also included as part of the quick-start user guide.
Computers
We tracked all the videos with desktop computers running GNU/Linux Mint 18.1 64bit (processor Intel Core i7-6800K or i7-7700K, 32 or 128 GB RAM, Titan X or GTX 1080 Ti GPU’s, and 1 Tb SSD disk). Sample videos can be tracked using CPU but the performance of the system will be highly affected.
Animal rearing and handling
All fish were raised at the Champalimaud Foundation Fish Platform, according to the housing and husbandry methods integrated in the zebrafish welfare program fully described in [11]. Animal handling and experimental procedures were approved by the Champalimaud Foundation Ethics Committee and the Portuguese Direcção Geral Veterinária and were performed according to the European Directive 2010/63/EU. For zebrafish videos we used the wild-type TU strain at 31 days post fertilization (dpf). Animals were kept in 8 L holding tanks at a density of 10 fish/L and a 14 h light / 10 h dark cycle in the main fish facility. For each experiment, a holding tank with the necessary number of fish was transported to the experimental room, where fish were carefully transferred to the experimental arena using a standard fish net appropriate for their age.
For the fruit fly videos we used adults from the Canton S wild-type strain at 2-4 days post-eclosion. Animals were reared on a standard fly medium and kept on a 12-h light-dark cycle at 28°. Flies were placed in the arena either by anesthetizing them with CO2 or ice, or by using a suction tube. We found the last method to have the least negative effect on the flies’ health and to provide better activity levels.
Details of the networks
Network architectures
The deep crossing detector network (Fig. 1d) is a convolutional neural network [8, 10]. It has 2 convolutional layers that obtain from data a relevant hierarchy of filters. A hidden layer of 100 neurons then transforms the convolutional output into a classification into single animal or crossing. idtracker.ai trains this network using images that can confidently characterize as single or as multiple animals (for example, single animals as blobs of area consistent with single-animal statistics and not splitting into more blobs in its past or future). Further details of the architecture are given in Supplementary Table 1.
The architecture of the identification network (Fig. 1e) consists of 3 convolutional layers, a hidden layer of 100 neurons and a classification layer with as many classes as animals in the collective. Further details are given in Supplementary Table 1. We tested variations of the architecture either by modifying the number of convolutional layers Supplementary Table 2 or the number of hidden layer neurons Supplementary Table 3. Analysis of these networks indicated that the most important feature for a successful identification is that the convolutional part needs at least two layers (Supplementary Fig. 1). The GUI allows users to modify the architecture of this network and its training hyperparameters.
Network training
The convolutional and fully-connected layers of both networks are initialised using Xavier initialisation [12]. Biases are initialised to 0.
The deep crossing detector network is trained using the algorithm and hyperparameters in [13]. The learning rate is set at the initial value of 0.005. This network is trained in mini batches of 100 images.
The identification network is trained using stochastic gradient descent, setting the learning rate to 0.005. This network is trained in mini batches of 500 images. Further details are given in the Supplementary Text.
C General video conditions
It is advisable to adhere to some guidelines during the realisation of videos of freely-moving animals. Here follows a list of conditions that allow to maximise the probability of success and the accuracy of the tracking.
Resolution. The higher the number of pixels per individual, the more information to distinguish it from the rest. Notice that, on the downside, the additional information makes the algorithm less time-efficient. Check Supplementary Tables 4 and 5 for the average number of pixels per animal in each of the videos tracked.
Frame rate. The frame rate must be high enough for the blobs associated with the same individual to overlap in consecutive frames, when moving at average speed. A low frame rate—with respect to the average speed of the animals—can cause a bad fragmentation of the video: An essential process in the tracking pipeline, that allows to collect images belonging to the same individual and organise them in fragments. On the contrary, excessively high frame rates will make the information coming from the analysis of the fragments highly redundant. This will increase the computational time necessary to track the video, without guaranteeing an improvement of the identification of the individuals. In the examples provided in this paper, the frame rate ranges from 25fps to 50fps.
Duration. The length of the video for which the system works depends on the number of animals, the distribution of images per individual fragment and the number of pixels per animal. For few animals (8 zebrafish) we can track videos as short as ≈18 sec (≈500 frames at 28 fps. For large groups we can track videos as short as 1 min (≈1950 frames at 32 fps). The system works for longer videos as far as the overall conditions do not change abruptly in different parts of the video. Very large videos with many animals will require a high amount of RAM and could block your computer.
Video format. The system works with any video format compatible with OpenCV. We recommend uncompressed or lossless video formats: Some compression algorithms work by deleting pieces of information that could be crucial for the identification of the individuals.However, we have successfully tracked videos with compressed formats:.avi (FPM4 video codec) and.MOV (avc1 video codec) (see Supplementary Table 6).
Illumination. Illumination has to be as uniform as possible, so that the appearance of the animals is homogeneous along the video. We recommend using indirect light either by making the light reflect on the walls of the setup, or by covering the setup with a light diffuser as shown in Supplementary Figures 3 and 4. Although, we have also tracked videos with retroilluminated arenas (see Supplementary Table 6), recall that the tracking systems relies on visual features of the animals that this type of illumination could hide.
Definition and focus. Images of individuals should be as sharp and focused as possible for their features to be clearly displayed along the entire video. When using wide apertures on the camera, the depth of field can be quite narrow. Make sure that the plane of the sensor of the camera is parallel to the plane of the arena so that animals are focused in all parts of it. In addition, exposition time (shutter speed) should be high enough so that animals do not appear blurred when moving at average speed. Blurred and out of focus images are more difficult to be identified correctly.
Background. The background should be as uniform as possible. To facilitate the detection of the animals during the segmentation process (see section D.1), the background colour has to be chosen in order to maximise the contrast with the animals. Small background inhomogeneity or noise are acceptable and can be removed by the user during the selection of the preprocessing parameters:
– Static or moving objects much smaller or much larger than the animals can be removed by setting the appropriate maximum and minimum pixels size thresholds.
– Static objects of the same size and intensity of the animals can be removed by selecting the option “subtract background” in the preprocessing tab.
– Regions of the frame can be also excluded by selecting a region of interest.
Shadows. Shadows projected by the individuals on the background can lead to a bad segment-ation and hence, to a bad identification. Shadows can be diffused by using a transparent base separated from an opaque background (see Supplementary Figure 3) or by using a retroillu-minated arena.
Reflections. Reflections of individuals on the walls of the arena should be avoided: They could be mistaken for an actual individual during the segmentation process. Reflections in opaque walls can be reduced by using either very diffused light or matte walls. For aquatic arenas with transparent walls, reflections can be softened by having water at both sides of the walls. Furthermore, reflections can be removed by selecting an appropriate ROI.
Variability in number of pixels per animal. The number of pixels in a blob is one of the criteria used to distinguish individual fish from crossings. An optimal video should fulfil the two following conditions. First, the number of pixels associated with each individual should vary as little as possible along the video. Second, the size an individual should vary as little as possible depending on its position in the arena. In any case, strategies to avoid misidentification are put in place, even in case of variable animal sizes (see section D.2.3).
D Algorithm
First, we introduce the work-flow of the algorithm. Subsequent sections will give further details on each of the components in the work-flow. The algorithm is composed of six computational cores highlighted in blue in Supplementary Figure 7. First, during the segmentation process the images representing either single or multiple touching animals are extracted from the video. In the remainder, we will refer to images representing a single individual as individual images and to images in which two or more individuals are touching as crossing images.
A model of the average area of the individuals, and later a convolutional neural network (CNN)— named deep crossing detector in the remainder—are used to discriminate between individual and crossing images.
Each image extracted from the video is now labelled as either a single individual or a crossing. By means of an extra-safe protocol, we define collections of images in subsequent frames of the video in which the same individual (or crossing) is represented. We name these collections individual and crossing fragments, respectively.
The fourth computational core is the gist of the algorithm. A subset of the collection of individual fragments, in which all the individuals are visible in the same part of the video is used to generate a dataset of individual images labelled with the corresponding identities. This dataset is then utilised to train a second CNN to classify images according to their identity. A cascade of increasingly encompassing training/identification protocols is put in place, so that an appropriate identification strategy is automatically defined by the algorithm, according to the complexity of the video. The idea underlying this family of methods is that the information gained from the first dataset of labelled images will allow either to accurately assign the entire collection of individual fragments, or to increase the first dataset by incorporating safely identified individual fragments throughout the video.
The knowledge acquired during the protocol cascade is used to identify the individual fragments that were not used to train the identification CNN. In the remainder, we will refer to this operation as residual identification.
Finally, trivial identification errors are corrected by a series of post-processing routines, and the identity of the crossing fragments is inferred in a last computational core.
D.1 Segmentation
idtracker.ai tracks the individuals by relying on their visual features. Hence, given a frame of the video, it is necessary to distinguish between pixels associated to individuals and background. According to the standard notation adopted in computer vision, we refer to a collection of connected pixels which is not part of the background as a blob.
The segmentation process has four main steps. First, the user can define a region of interest to be applied on each frame of the video. In this way it is possible to exclude, for instance, walls which may contain reflections of the animals.
Second, each frame is normalised with respect to its average intensity to correct for illumination fluctuations. It is also possible to perform background subtraction by generating a model of the background calculated as the average of a collection of frames obtained via subsampling the video.
Then, blobs of pixels corresponding to animals are detected by intensity thresholding and sub-sequent labelling of connected components. The intensity thresholds that allow to distinguish the individuals from the background are specified by the user. Often, intensity is not enough to segment the animals in the entire video. For this reason, it is also possible to specify a minimum and a maximum area (number of pixels) for a blob to be acceptable. For instance, these parameters allow to exclude dust during the segmentation.
All these operations are carried out in an intuitive way by using the idtracker.ai graphic user interface, where both the intensity and area thresholds can be adjusted by observing their effect in real time on the frame, see Supplementary Figure 8.
The software currently supports only grayscale video segmentation. Frames captured from a color video will be automatically mapped to grayscale.
Remark 1 (On background subtraction). Background subtraction is often useful when trying to segment a video in which a static object has the same intensity level as the individuals one wants to segment (see Supplementary Material C).
D.2 Detection of individual and crossing images
The training/identification process allows to identify only images representing single individuals. Thus, a crucial point of the algorithm is the discrimination of individual and crossing images. In order to differentiate between these two classes, we apply a series of three different algorithms on the images segmented from the video.
First, we use two heuristics to detect images that in all likelihood correspond to a single animal (sure individual image) and crossing animals (sure crossing images), respectively. Then, we use these sure individual images and sure crossing images to train a neural network. Finally, the trained network is used to label ambiguous (not sure) images as either crossing or individual images.
D.2.1 Model area
We build a model of the area of the individuals by taking into account portions of the video in which the number of segmented blobs corresponds to the number of animals declared by the user. In case there is no frame in which this condition is fulfilled, the tracking cannot proceed and an error is raised. Let ϱ = {b1,…, bn} be the collection of the blobs segmented from these parts of the video and A = {area (bi) for every bi ∈ ϱ the collections of the corresponding individual areas, where the function area(bi) counts the number of pixels corresponding to the blob bi. The model area is defined by mA = median (A) and the standard deviation sA = σ (A). Let b be a blob, we define A model based exclusively on the area of the blobs can easily fail when the individuals’ body is not rigid (e.g. fish or mice), can suddenly change shape (e.g. a fly with opened or closed wings), or under heterogeneous lighting conditions. Even more complex situations can arise when animals can move freely in 3 dimensions (e. g. fish swimming at different depths). In this latter case, one individual can be almost completely occluded by a second one, causing the model area to fail.
D.2.2 Blobs overlapping in subsequent frames
The second heuristic is based on the overlapping of blobs in subsequent frames: This allows to select sure crossing and individual images depending on the merging or splitting of consecutive, overlapping blobs. We recall that a blob is a collection of connected acceptable pixels in a certain frame, where a pixel is considered acceptable depending on its intensity value and the thresholding described in section D.1. Let b1 and b2 be two blobs. We say that the two blobs overlap if and only if b1 ∩b2 = ∅, where the intersection b1 ∩b2 is the intersection between sets of pixels. See Supplementary Figure 9 for an example.
Let Bi = {bi, 1,…., bi,n } be the collection of blobs segmented from the ith frame of a video 𝒱. For every blob bi,j ∈ Bi we derive the collections of blobs overlapping with bi,j in frames (i - 1) and (i + 1). We call these collections the sets of previous and next blobs of bi,j, denoted by Pb respectively.
Let b be a blob. We say that b is a blob associated with a sure individual image if:
a) b is an individual according to eq. (D.1);
b) |Pb| = |Nb |= 1, i. e. the blob is overlapping with one and only one blob both in the previous and subsequent frame. The notation | · | indicates the cardinality of a set, i. e. the number of elements of the set.
c) for every bp and bn in the past and future overlapping history of b |Pb p | ⩽ 1 and |Nb n | ⩽1. Symmetrically, we say that b is associated with a sure crossing image if:
a) b is a crossing according to eq. (D.1);
b) |Pb| > 1 or |Nb| > 1.
or
a) b does not satisfy the model of the area;
b) |Pb| = |Nb| = 1;
c) for some bp and some bn in the past and future overlapping history of b and .
D.2.3 Deep crossing detector
The methods described in sections D.2.1 and D.2.2 can be used on any video in order to generate a dataset Dic of sure individual and sure crossing images. With this dataset we train a CNN in the task of distinguishing crossing and individual images. We call this model deep crossing detector (DCD). In the following paragraphs we will describe the preprocessing, architecture, hyperparameters and stopping criteria used to define and train this particular model.
Preprocessing
Let b be a blob segmented from a video 𝒱 and Ib the image generated by cropping a rectangular bounding box around the centroid of b, such that all the pixels of b are represented in Ib. We first consider a dilation b* of b generated with a 5×5 kernel. We assign value 0 to every pixel in Ib which is not in b*. In order to overcome the sensitivity of CNNs with respect to rotation, we compute the first principal component of the cloud of pixels defined by b and then rotate and crop Ib such that the first principal component forms and angle of with the x axis. After the rotation, the size of each image is set to be the maximum of the largest side for all the bounding boxes among the collection of sure crossing images. Then, the images are resized to 40×40 pixels. The resizing improves both the time and memory efficiency of the algorithm. Finally, each image I ∊ 𝒟ic is standardised as (see sample images in Fig 1, Panel d in the main text).
Architecture
See Supplementary Table 1 (deep crossing detector). Both convolutional and fully-connected layers are initialised using Xavier initialisation [12]. Biases are initialised to 0.
Loss function
Let (x, li) be a labelled image, where li is the label in one-hot encoding, i. e. l0 = [1, 0] is the label associated with x if x is a crossing image, and l1 = [0, 1] if x is an individual. We compute the loss function associated to (x, li) as a weighted cross-entropy: where is the softmax function applied to the activation ai of the ith unit of the last layer of the network, with j varying among all classes, in this case j ∈ {0, 1}; wi is the weight associated with li and computed as where |Li| is the number of training samples belonging to the class li and j varies in the set of all the classes of the dataset (only two in this case: individual and crossing). The weighting allows to deal with the potentially unbalanced dataset 𝒟ic. Indeed, we prefer to collect all the possible examples of sure-crossing and sure-individual images available in a given video, rather than force 𝒟ic to be balanced in the number of samples per class. After dividing the dataset 𝒟ic in batches of Xi of 50 images, we optimise by considering the mean µ (𝔏 (Xi)) using the algorithm described [13], with the hyperparameters suggested in the paper. The learning rate is set at the initial value of 0.005.
Remark 2. (On the softmax function) In general, the softmax is equipped with an extra parameter called temperature. We omit discussing it in the formula, since we always set it to 1.
Training and validation set
Before training, the dataset of sure crossing and sure individual images is split into two parts: 90% of the images are used for training, i. e. the weights of the network are updated in order to minimise the error (loss function) with respect to the labels associated with this set of images. We call this portion of the dataset the training set, denoted by T. The remaining 10% of images–the validation set V – are used to evaluate the generalisation power of the network. For this reason, the performances of the model on the validation set are used to stop its training. In section D.2.3, we shall discuss in more detail the algorithm used to stop the training of the network.
Accuracy
We measure the accuracy of the network by comparing the predictions generated by the softmax computed on the activation of the last layer of the network, with the labels associated with each image in both the training and the validation set. Hence, let | V | be the number of images in the validation set, PV = {p1,…, pn} the ordered predictions generated by a forward pass of these images in the network, and LV = {l1,…, ln} the corresponding labels. Let AV be the set of correct predictions, defined as AV = {pi s.t. pi = li for pi ∈PV, li ∈LV} We define the overall accuracy of the network as We will also take into account the accuracy on each of the inferred classes. Let c* be a class (in this case c* could be either the crossing or the individual class). The set corresponds to the predictions equal to their associated labels and attributed to the particular class c*. In this case the class-accuracy is defined as Symmetrically, we define the error and the class-error as 1 -AccV and 1 -AccV (c*), respectively.
Training stopping criteria
While training the network, we verify the goodness of its outputs by computing both the loss function and the accuracy on the validation set V (see sections D.2.3 and D.2.3). This procedure gives a reasonable control on the actual classification power of the network on new unlabelled images. Thus, it is crucial to stop the training of the network to prevent two main behaviours. On the one hand, we want to prevent overfitting: A too exact representation of the trainig data, that will prevent the network from generalising on new data points. On the other hand, it is desirable to stop the training in case the error cannot be further minimised, i. e. the loss function reached a plateau.
More formally, we call an epoch a complete training pass on the set T, concluded with the evaluation (of both loss and accuracy functions) on the validation set V. Let 𝔏i (T) and 𝔏i (V) be the value of the loss function after the epoch i. We define as the difference between the loss value in validation at i*, and the mean of the loss values of the previous 10 epochs. We stop the training at epoch i* > 10 if one of the following conditions holds.
a) The network is overfitting: di > 0 for every i* - 5 < i*;
b) the loss reached a plateau: |di*| < 0.05 · 10log10(Li* (V))-1;
c) the network reached class-accuracy 1 on all the classes, for every sample in the validation set. See eq. (D.3);
d) the loss (error) is zero: 𝔏i* (V) = 0.
Crossing detection
Let Δ be the set of parameters learnt by training the DCD as described above. Let us denote the trained model as DCD (Δ). We create the test set T of unlabelled images by considering all the images that are not either sure individuals or sure crossings. The trained model acts as a function taking as an input an image I ∈ 𝕋 and outputting a prediction as the softmax computed on the activation of the last layer of DCD (Δ). We recall that the softmax is the function defined as where ai is the activation of the ith unit of the last layer. Since the DCD classifies images in two classes, its last layer is composed by two units. Hence, given an image I, we obtain where for brevity we set si = s (ai). If s1 > s2 the image is classified as a crossing, and as an individual otherwise.
Exceptions
It is possible that during training the loss value diverges to infinity. In this case a warning is produced, and the algorithm falls back into a crossing-individual images discrimination process based only on model of the area of individual blobs (see section D.2.1). In case the criteria forcing the training to stop are never reached, we set a maximum number of 100 epochs for the training of the DCD. If this threshold is reached, a warning is produced and the training is stopped. The parameters computed in the last iteration will be used to classify individual and crossing images.
D.3 Fragmentation
At this stage of the algorithm, the images segmented from the video (see section D.1) are labelled either as individual or crossing, following the protocols described in section D.2. A very careful dynamical analysis of the segmented blobs allows to create collections of images associated with the same individual (or crossing) in subsequent frames. In the remainder, we will refer to these collections as individual and crossing fragments. See Supplementary Figure 9 for an example of fragments and its decomposition in individual and crossing components.
The method used to create these types of fragments is based on the one hand on the classification of the images into crossing and individual categories; on the other hand, it considers the overlapping of the blobs associated with these images. We start by introducing some notations. Then, we will describe the algorithm to generate individual and crossing fragments in two separated sections.
Let be the collection of segmented blobs, where the first index of the elements of B corresponds to the frame number, and the second to the order in which the blobs have been segmented. We recall, that given two blobs b1 and b2, we say that they overlap if and only if the intersection of their sets of pixels is not empty. Following the notation introduced in section D.2.2, given a blob bi,j ∈ B we call the collections of blobs overlapping with bi,j in frame (i - 1) and (i + 1), and respectively.
D.3.1 Individual fragments
We iterate over the elements of B = {bi,j} proceeding by frame number i and then following the natural ordering induced by the second index. Let bi,j be a blob associated with the image labelled as an individual. We create individual fragments by considering only the future overlapping history of bi,j. If bi,j is not yet part of any individual fragment, we associate with bi,j a unique fragment identifier α (i. e. bi,j is the blob intiating an individual fragment). To simplify the notation let bi,j = b, and . We consider two cases:
case 1: |N|.> 1 The blob b in frame i overlaps with more than one blob in frame i + 1, hence it is the only blob (and image) associated with the individual fragment α.
case 2: |N| = 1. Let nb be the unique element of N. The fact that b overlaps with a single blob in its future history is a necessary condition for nb to be part of the same fragment as b, but notsufficient. It could be that thus we say that nb is in the same individual fragment as b if and only if We also require the image to be labelled as an individual.
If case 1 is verified we generate a new fragment identifier and continue iterating on the elements of 𝔅. Otherwise, we apply the same algorithm to nb in order to enlarge the individual fragment α as much as possible. We stop adding blobs to the fragment whenever, during the iteration, a new candidate blob fulfils the condition in case 1. See algorithm 1 for the pseudocode.
D.3.2 Crossing fragments
In the same setting of the previous section let bi,j = b be a blob associated with a crossing image. If b is not yet equipped with a crossing fragment identifier, we generate a new identifier β. The conditions are almost equivalent to the individual fragments’ case:
case 1: |N| > 1. The blob b in frame i overlaps with more than one blob in frame i + 1, hence the crossing represented by b is splitting. Thus b is the only blob associated with the crossing fragment β.
case 2: |N | = 1 and and is a crossing image. We add nb to the crossing fragment β.
In the second case, we try to extend the crossing fragment simply by iterating the algorithm on nb, and verifying that both P (nb) and N (nb) have cardinality 1, and the unique element of N (nb) is associated with a crossing image.
The pseudocode presented in algorithm 1 can be easily adapted to work with crossing fragments, by modifying the if s and while conditions.
D.4 Cascade of training/identification protocols
After fragmentation has finished, the training of the identification network begins. We would first like to give the reader some intuition regarding why it is possible to train an identification network in an automated manner. First imagine that we had at our disposal an all-knowing black box, that looked at the set of fragments we have complied from one part of the video and then told us which fragment belonged to which individual. Remember that each fragment contains an entire set of images belonging to a single individual. Therefore, thanks to the information coming from the black box, we would effectively have at our disposal a set of labelled images and we could use standard supervised learning to train a classifier which can tell the individuals apart from one another. The trained network could then be used on other parts of the video to do identification.
In real life, we do not have access to such source of information, so we need to use some heuristics to generate our training dataset. In order to understand our heuristic, let’s again remember that each fragment is supposed to contain images belonging to a single individual. We also know the total number of individuals in the video as this number is specified by the user. Consequently, if we can find one frame of the video, where the number of fragments which are present at that frame is equal to the total number of individuals, then we could be sure that each visible fragment in that frame belongs to a separate individual. We can then label the images within each fragment with the same label while every fragment of course has a different label. Next, we train our network on the resulting dataset of images and labels. This is the intuition behind how we achieve our aim without the help of an omniscient black box.
In order to train the identification network, we have designed three training protocols. The first protocol is the fastest and is able to deal with videos where animals are relatively well separated (crossings are not too frequent). The other two protocols handle more difficult scenarios, where crossings may be frequent, the setup lighting intensity may drift over time, the animals may change their features throughout the video (e. g., posture, colour).
Each protocol relies on the information acquired and structures defined in the previous ones. In the following sections we will introduce some definitions and the main elements on which the fingerprint protocols are built. Then, we will discuss each of the three protocols from the simplest and fastest, to the most general and computationally expensive one.
D.4.1 Global fragments
All the protocols rely on a strong, fundamental hypothesis: To learn the features characterizing each individual and consequently identify it, there must exist at least one portion of the video in which all animals are visible and separated.
Let 𝒱 be a video in which the aforementioned condition is fulfilled in frame number i. We define a global fragment as the collection of individual fragments (see section D.3.1), whose images contain the ones extracted from the ith frame of the video and with the same number of fragments as the number of individual to be tracked. We call the minimum frame number in which this condition is satisfied the core of the global fragment. See Supplementary Figure 10 for a visual representation of a global fragment. We denote by 𝒢 the set of all the global fragments in 𝒱, whose shortest individual fragment counts at least 3 images.
D.4.2 Identification network
All the fingerprint protocols aim at finding different strategies in order to create datasets of images of the animals labelled with their identities. These datasets will be created from one, or a collection of global fragments and used to train the identification CNN, denoted in the remainder as idCNN. In the following paragraphs we define the architecture, the hyperparameters and algorithms used to train the idCNN.
Preprocessing
The images used to train and test the idCNN are preprocessed with an algorithm similar to the one described in section D.2.3. The images are obtained, aligned and standardised in the same way. The only difference is that the square images used to train the DCD are resized to be square images of size 40 × 40, while the training images of the idCNN are obtained as The body length is estimated by considering the median of the diagonal of the images associated to each individual blob.
Architecture
See Supplementary Table 1 (identification convolutional neural network). Both convolutional and fully-connected layers are initialised using Xavier initialisation [12]. Biases are initialised to 0.
Loss function
The loss is a weighted cross-entropy eq. (D.2). The dataset given by a global fragment is potentially unbalanced: Every individual fragment Fi ∈ G counts a certain number of images, say ni. For every Fi ∈ G, we compute the weight wi of eq. (D.2) as where j varies in {F1,…, Fn}∈G. Thus, a larger loss is associated with individuals less represented in the dataset. We optimise using stochastic gradient descent, setting the learning rate to 0.005.
Training and validation set
Every individual fragment Fi∈G can be written as Fi = {(I1, l1), …, (In, ln)} where (Ij, lj) is a pair such that the label lj is the identity of the individual depicted in the image Ij. The dataset generated from G is given by the union 𝒟G = ∪iFi. After a random permutation of the pairs (Ij, lj) in order to lose any temporal correlation between the images, we split 𝒟G in the training and validation sets denoted by T and V respectively. These sets are composed of 90% and 10% of the available data, respectively.
Accuracy
The accuracy of the network is computed as the number of successfully classified images over the total number of images, according to eq. (D.3). We measure the single class accuracy following eq. (D.4). This second expression is fundamental when dealing with large groups, in order to evaluate the capability of the network to distinguish each of the individuals.
Training stopping criteria
See section D.2.3.
Exceptions
It is possible that during training, the loss value diverges to infinity. In this case an error is raised and the algorithms stop its execution. Advanced users have the possibility to change the parameters of the idCNN (e. g. learning rate, dropout, layers’ units count).
If the training of the idCNN is not stopped before 10000 epochs (passes over the entire dataset) a warning is produced and the training is stopped. The parameters computed in the last epoch will be used to continue the fingerprint protocol cascade.
D.4.3 Protocol 1: One-global-fragment tracking
This protocol is based on the features learnt by considering the images belonging to a single global fragment. Thus, it is likely to be successful when the individual images are uniform along the entire video.
Choosing the global fragment
Since the network will be trained on a single global fragment, its choice is fundamental. We aim at selecting the global fragment whose individual fragments are sets of images with high variability. This, in order to be as close as possible to the setting described in Supplementary Figure 1, where images are subsampled from the entire video, and hence uncorrelated in time.
We define the distance travelled in a global fragment G as the minimum of the distance travelled in its individual fragments, as described in algorithm 2.
We choose the global fragment realising the maximum of the minimum distance travelled denoted as Gσ(1). This procedure does not assure to get the global fragment whose images are maximally variable. However, there is a natural correlation between the distance travelled by an animal and the variability of the images stored in the corresponding fragment.
Training
After choosing Gσ(1), we label the images belonging to each of its individual fragment with random, unique identities (think of increasing natural numbers). From this dataset of labelled images we create the training and the validation set, as described in section D.4.2. We train the idCNN as specified in section D.4.2. The training is interrupted automatically when one of the condition in section D.4.2 is satisfied. Let us call θ0 the set of parameters of the idCNN after training, and denote the trained model as idCNN(θ0) (I).
Single image identification
We recall that a trained neural network acts as a function. We use the trained idCNN in order to identify individual fragment not used for training. For every image I in an individual fragment F, we can compute idCNN(θ0) (I), obtaining SI = { s1,…, sn }, where is the softmax computed on the activity of the jth unit of the last fully connected layer. The idCNN’s last layer has as many units as the number of individual to be tracked (see Supplementary Table 1, idCNN). By definition, ΣsεsI = 1. Thus, SI can be interpreted as the probability of the input image to represent each of the individuals. Each image is labelled with the identity realising the maximum of the softmax: id (I) = argmax (SI).
Remark 3. Given the relatively low number of parameters of idCNN (≈ 200000, see Supplementary Table 1, idCNN), and the usage of GPU computing, the single image identification step is time efficient, even when dealing with large groups of animals.
Computing the identity probability mass function
When considering an individual fragment, it is natural to take advantage of the hypothesis that all its images are associated with the same individual. We follow the assumptions and logic already presented in [1, Supporting text, Section 3.1]. Let ⋀F = (f1,…, fn) be the vector of identification frequencies associated with F, i. e. the vector whose components correspond to the number of images of F assigned to the ith individual, computed as in algorithm 3. Under the assumption that all the images in F are independent and that the probability to assign one image to the correct individual is twice as large as the probability to assign the image to any of the incorrect individuals, we compute for every identity i ∈ { 1, n } where ⋀F (i) is the ith component of the vector ⋀F.
The vector P1 (F) = (P1 (F, 1), …, P1 (F, n)) is the probability mass function of F being identified as one of the individuals.
Model quality check and identification of individual fragments in global fragments
While identifying the individual fragments belonging to the global fragments not used to train the idCNN(θ0), we evaluate the goodness of the model. For every G∈𝒢 {Gσ(1)} we proceed by verifying the fol-lowing conditions, providing at the same time a temporary identification of the individual fragments in G:
G is certain: A global fragment G is certain if all the individual fragments Fi in G are also certain. A fragment F is certain if cert(F) ⩾0.1. Let a and b be the indices (identities) realising the first and second maximum of P1 (F) respectively, and Sj the vector of softmax values of all the images assigned to the index j in the fragment F. The function cert (F) is defined as
Temporary identification of the individual fragments: Let us consider the entire collection of individual fragments{Fi} Fi ȠG We start by reordering it according to the maximum value of each P1 (Fi). Let us denote the reordered collection of individual fragments as ℱ = Fρ(1),…, Fρ(m). We iterate on the individual fragments indexed as in ℱF. So, Fρ(1) is the individual fragment having maximum probability of being identified as the individual with identity ι = argmax P1 Fρ(1). We set ι to be the temporary identity of Fρ(1) if two conditions are reached:
There is no identified individual fragment coexisting with Fρ(1) with identity l.
, where |Fρ(1)|is the number of images in Fρ(1).
We proceed to the next iteration by considering Fρ(2). If one of the aforementioned conditions is not satisfied, the individual fragment is marked as non-consistent as well as the entire global fragment G.
G is unique: A global fragment G is unique if the temporary identity of every Fi in G are unique within the global fragment.
We iterate on the global fragments in 𝒢 \{Gσ(1)} by sorting them from the nearest to the farthest with respect to the distance in frames between their core and the core of Gσ(1). See section D.4.1 for the definition of the core of a global fragment.
We now consider the number of images in all the individual fragments that are part of a global fragment. We will refer to this set of images in the remainder as the global fragments’ images or the images in global fragments. If at least 99.95% of these images are contained in global fragments considered acceptable with respect to the conditions listed in the previous paragraph, we interrupt the cascade of protocols and we pass to the residual identification described in section D.5. Otherwise, the second protocol is put in place.
D.4.4 Protocol 2: Global-fragments-accumulation
The main aim of this protocol is to accumulate the images belonging to those global fragments, detected during protocol 1, that are simultaneously certain, consistent and unique. By iterating this procedure, it is possible to incorporate new images in the labelled dataset used to train the idCNN. This accumulation procedure allows to learn features able to grasp the individuals’ variability through the video. See Supplementary Figure 12 for the flow of the algorithm of this protocol.
Global accumulation
Let A -1 = {Gσ(1)} be the set containing the first global fragment used for training, and A0 = {G1,1,…, G1,n}⊂𝒢 \Gσ(1) be the subset of global fragments that meet the conditions described in section D.4.3.
First, we fix the identities of the individual fragments belonging to the global fragments in A0, since the images associated with these individual fragments will be used to train the idCNN.
We build the dataset 𝒟A0 by considering all the labelled images contained in every global fragments in A -1∪A0. Note that, since an individual fragment can be shared by more global fragments, its images will be collected only once.
𝒟A0 is then split in the training and validation sets (T1 and V1), according to the proportions specified in section D.4.2 and an additional constraint: In the training and validation sets every individual can be represented by at most 3000 images. If the amount of images associated with a certain individual in 𝒟A0 exceeds this threshold, 3000 images are randomly subsampled from this collection, by taking 1800 samples from the images previously accumulated (images in A-1), and the remaining 1200 from the new set of accumulated images (A0).
Remark 4. At every iteration of the accumulation, the permutation used to subsample the images representing the same individual changes in order to train the idCNN with maximally variable images.
We train the network using the stopping criteria listed in section D.2.3. Following the notation introduced in section D.4.3, we denote the idCNN model obtained after training as idCNN (θ1). The accumulation process is iterated by defining Ai as the set of acceptable global fragments in The set of new acceptable global fragments is computed by identifying the individual fragments not used for training and apllying the procedure described in section D.4.3.
Partial accumulation
Partial accumulation is a riskier accumulation strategy. It allows to include in the dataset of accumulated images single individual fragments, rather than entire global fragments. For this reason, before applying this strategy, we require that more than half of the images contained in the set of global fragments has been accumulated via global accumulation. Assume that this condition is reached at the iteration , then an individual fragment F ∈ G for some global fragment is accumulated if
cert (F) > 0.1;
let γ (F) be the set of individual fragments that coexist with F in at least one frame of the video. F can be accumulated if at least half of the elements of γ (F) have already been accumulated;
3. the identity of F is coherent with all the individual fragments in γ (F), i. e. the assignment of a certain identity to F does not create duplications.
If all these conditions are met, F will be added to as a single individual fragment.
Accumulation stopping criteria
We stop both the global and the partial accumulation processes if one of the two following conditions holds:
99.95% of the images in global fragments have been accumulated;
there are no more acceptable global or individual fragments.
Evaluation of the accumulation
If the number of images accumulated in the last iteration is less than 90% of all the images in the global fragments, the accumulation is considered not acceptable and the third protocol is used. Otherwise we proceed to the identification of the individual fragments not identified during accumulation, see section D.5.
D.4.5 Protocol 3: pretraining and accumulation
This last protocol allows to learn globally the features of the images representing the individuals by taking advantage of their local organisation in global fragments.
Pretraining
Given a video 𝒟 let 𝒢 be the set of global fragments as defined in section D.4.1. We rewrite the idCNN by considering it as the juxtaposition of its convolutional idCNNc and fully-connected idCNNf parts, with sets of parameters γ and Φ, respectively. Here follows the list of processes involved in the pretraining algorithm:
(i) Consider the set σ (𝒢) ={Gσ(1),…, Gσ(n)} of global fragments 𝒢 ordered by distance travelled. See section D.4.3.
(ii) Iterate on the elements of σ (𝒢). Let 𝒟Gσ(i) be the dataset of labelled images built at the ith iteration. Assign a random unique identity to each individual fragment: The aim is to learn features, and classify the individuals only locally. Generate both the training and validation sets as in section D.4.2.
(iii) Train the model using the parameters learnt during the previous iteration for the convolutional part idCNNc (⌈σ(i-1) and reinitialise idCNNf. This step allows to learn convolutional filters optimised on the task of distinguishing the animals in Gσ(i-1) based on their local labelling in the global fragment.
(iv) Conclude the training according to the conditions listed in section D.4.2.
(v) Iterate on σ (𝒢) until 95% of the images stored in the global fragments have been used to train the network.
Accumulation parachute
After pretraining, we start the accumulation of reference images as in section D.4.4, but freezing the parameters of idCNNc learnt during pretraining along the entire accumulation. Thus, in the first step of this second accumulation we reinitialise only the fully-connected part of the idCNN. With these settings, we apply the accumulation protocol, by updating only the parameters Φ, and starting by considering the global fragment Gσ(1).
If more than 90% of the images in the global fragments are accumulated during the accumula-tion, we proceed to the identification of the individual fragments not used for training section D.5. Otherwise, we will repeat the accumulation starting from Gσ(2). If the accumulation fails even in this case, we repeat it with Gσ(3) as a basis. Finally, we end the deep protocol cascade by selecting the accumulation in which the largest amount of images has been used for traninig and hence, already identified. By using the parameters of the idCNN learnt in the chosen accumulation, we proceed to the identification of the remaining individual fragments.
Remark 5. As pointed out in section D.4.3, the computation of the distance travelled cannot guarantee the images in Gσ(1) to be maximally uncorrelated. Hence it is important, rather than assigning identities with a non-optimal model, to try and learn starting from different global fragments, that could incorporate images whose features are keys to maximise the amount of accumulated global fragments.
D.5 Residual identification
After the fingerprint protocol cascade, it is necessary to identify those individual fragments that could not be accumulated, either because they are not included in any global fragment, or they gave a low certainty value during test. We recall that all the individual fragments already accumulated are endowed with both an identity and the P1-vector. See section D.4.3 and eq. (D.7) for details about the identification of individual fragments during accumulation.
D.5.1 Non-accumulated images identification
Let 𝒰 = {F1,…, Fn}be the set of individual fragments that are not identified during the protocols described in section D.4. First, we assign an identity to every image I in every Fi in 𝒰. To do that, we pass every image through isCNN (θ final)where θ final are the parameters learnt during the fingerprint protocol cascade. Then P1 (Fi) is computed for every Fi∈ 𝒰 according to section D.4.3 and eq. (D.7).
D.5.2 Identification of non-accumulated individual fragments
When assigning the identity to an individual fragment, it is desirable to take advantage of the fact that the same identity cannot be assigned to two fragments that coexist in time. Given an individual fragment F, let be the set of identified individual fragments coexisting with F and not F itself, and such that every is equipped with a P1-vector. We integrate the information coming from the identified fragments coexisting with F by following the approach of [1, Supporting text, Section 3.1]. We define the probability of the fragment F to be assigned to the identity i as where n is the total number of animals.
Furthermore, we define the identification certainty as where a and b are the indices (identities) that realise the first and second maximum of P1 (F), respectively.
We compute P2 (F) = (P2 (F, 1), …, P2 (F, n)) and (F) for every individual fragment F ∈ U. We proceed to identify the fragments in 𝒰 from higher to lower values of For every fragment we assign the identity l = argmaxi (P2 (F, i)). If there are two identities realising the maximum P2, we do not identify the fragment (in the GUI this fragments are indicated with the identity 0 and black colour). If the fragment F is identified, say with identity i, we set P1 (F, i) = 1 and P1 (F, j) = 0, ∀j ≠i. Then, we recompute P2 (F) and for every fragment in . According to eq. (D.9), all the fragments in will have P2 (F, i) = 0. This prevents the assignment of the same identity to multiple coexisting individual fragment.
The process is iterated on 𝒰\{F}, until all the fragments are either equipped with an identity or unsuitable for identification.
D.6 Post-processing
The training/identification protocols and the residual identification assigned an identity to a as large as possible number of individual fragments. The methods involved in the post-processing stage of the algorithm take care of correcting trivial identification mistakes and, thereafter, to identify the individuals involved in crossings.
These processes are described in details in the following sections; here, we provide an intuition concerning the algorithms involved in both of them. It is possible to correct trivial identification errors by considering adjacent individual fragments assigned to the same individual. If the individual has to reach a supernatural speed in order to move from its position at the end of a fragment, to the position corresponding to the beginning of the next one, the identification is assumed to be incorrect. A series of heuristics allows to either assign a new (not necessarily different) identity to the fragments involved in the process, or renounce to their identification.
The idea underlying the identification of crossings is basically an informed interpolation of the individual trajectory. First, we consider a blob associated to a crossing. We workout the identities of each crossing individual by trying to split the blob by successive erosions. If the blob splits in smaller parts (say sub-blobs), we try to link each sub-blob to an already identified individual fragment. To do that, we consider two conditions. On one hand we evaluate the eventual overlapping of the sub-blobs we just obtained with identified, individual blobs segmented either in the next or the previous frame. In case the overlapping strategy fails, we seek individual blobs in adjacent frames with respect to the considered crossing, that can be linked to the sub-blob by using speed-constraints similar to the ones discussed in the previous paragraph.
D.6.1 Evaluate unrealistic identifications at fragment boundaries
Individual fragments are defined by considering the overlapping of blobs segmented from consecutive frames (see section D.3). Let us denote the frame numbers spanned by a fragment F as [fs, fe], where fs is the number of the frame from which the first blob associated to F has been segmented. We say that two individual fragments F1 and F2 are consecutive if they share the same identity, say i, and f1e < f2s. We aim at evaluating the goodness of the identification of such fragments by comparing a model of the stereotypical speed of the individuals in the video, with the speed that the individual i needs to reach to travel from its position in f1e to its new position in f2s.
Computation of the stereotypical speed: The stereotypical speed is computed as follows by considering the speed of the animals in every individual fragment:
1. For every individual fragment F, let (b1,…, bn) be the blobs collected in F, and (c1,…, cn) their centroids.
2. We compute the speed of the animal in F by considering the distance in pixels between subsequent centroids. Namely, vi = d (ci, ci+1) for i ∈ { 1,…, n - 1 }.
3. We define vmax = P99 (V), where V collects the speeds computed from every individual fragment F.
Evaluation of consecutive fragments: We set to immutable the identity of all the fragments that have either been identified during the deep protocol cascade, or whose identity has been assigned during the residual identification with maxi(P2(F) ⩾ 0.9.
Let F1 and F2 be the consecutive fragments described above. The speed at the boundary needed to connect the two fragment is realistic if vmax. In order to test and correct for unrealistic connecting speed, we proceed as follow We iterate on the collection of individual fragments by separating them into two subsets. First we consider the individual fragments whose last frame is less than the core of the first global fragment used for training (see section D.4.3), then the others. Let us consider a general individual fragment F spanning frame numbers [fs, fe]. We check if there exist fragments Fp and Fn sharing the identity with F and defined either in the past or in the future. If Fp and Fn do not exists, we proceed with the iteration. Otherwise, we evaluate the boundary speeds and We distinguish the following cases:
max and vmax: If the identity of Fp or Fn is fixed or neither Fp’s consecutive previous fragment, nor Fn’s consecutive next fragments are unrealistic, we set F to be reidentified.
max and max: If the identity of Fp is fixed F is set to be reidentified. Otherwise Fp.
Symmetrically in the case max and max.
Reidentification of unrealistic consecutive fragments: Let F be an individual fragment to be reidentified. We compute the set A of available identities by considering the identities of the individual fragments coexisting with F, and including the identity of F itself. If |A| = 1, we assign the only available identity to F. Otherwise we proceed by calculating:
The subset S ⊆A of available identities that would not imply unrealistic boundary speeds given F.
Q = { i ∈ A s.t. P2 (F, i) > ρ(F)}, where where n is the number of tracked animals.
The set C = Q∩A of candidate identities, i. e. the set of identities that do not create duplications if assigned to F and are at the same time acceptable with respect to both P2(F) and the speed model.
By considering the set C just defined we have:
|C| = 0: No identities are available, thus F is not identified.
C = {i}: We identify F with i.
|C| > 1: F is identified by considering the identity I ∈ C realising the minimum boundary speed.
D.6.2 Crossing identification
Single individuals in a crossing are identified according to a python reimplementation of the algorithm described in [1, Supporting Text, Section 2.12]. See idtracker.ai/postprocessing/ for the documentation and the source code of the algorithm.
D.7 Output
In this section we discuss the final outputs of the algorithm: An estimation of the tracking accuracy will warn the user in case the algorithm could not proceed smoothly in the tracking process. The files containing the trajectories of each individual are saved and made available to the user.
D.7.1 Estimated accuracy
Let ℐ be the set of all identified individual fragments, and the total number of images in such fragments. We estimate the overall accuracy of the algorithm as where i is the identity assigned by the algorithm to the fragment F.
D.7.2 Individual trajectories
The algorithms outputs two individual trajectories files. One generated by considering the identi-fication of individual images; the second by including the identification of the individual during crossings (see section D.6.2).
Both are organised as matrices with shape × number of frames number of individuals ×2, where the last two components are the position of the centroid of each individual in pixel coordinate, with respect to the entire frame.
E Human validation
After a video has been tracked, idtracker.ai provides an estimate of its own accuracy (see section D.7.1). Human validation is necessary to evaluate the goodness of the automatic accuracy assessment, notice recurrent inaccurate identifications, and evaluate the limit conditions in which the tracking system can work (e. g. fitness of the setup, suitability of the recording conditions and quality of the images).
We recall that the identity of an individual is maintained throughout individual fragments, thus a misidentification can happen only after a crossing or an occlusion. Notice that here a bad segmentation of the images (see section D.1) counts as an occlusion. Hence, the optimal validation would consist in evaluating that the identities of the animals before and after every crossing is conserved. Identities are assigned for the first time when labelling the individual fragments of the first global fragment used to train the idCNN. Hence, only by starting the validation from that global fragment, one can be sure that no switch of identities between two or more individuals occurred.
When dealing with large groups or particularly long videos, the validation of all the crossings is extremely costly. For this reason, we provide two procedures to facilitate this process. On the one hand, a global validation graphic interface allows to easily check the goodness of the identification of all the individuals in a segment of the video, correct their identities and compute the accuracy of the identification by considering the user-generated ground truth. On the other hand, an individual validation procedure allows to select a specific animal and follow it throughout the video. All the crossings, or occlusions that do not involve that individual are ignored, allowing a fast validation in long segments of the video.
E.1 Global validation
Starting from the core of the first global fragment (see section D.4.1), we manually check that in every crossing all the identities of the animal involved in are maintained. Corrected identities are stored. After providing the segment S = (start - end) on which validation has been performed, we compute the following accuracy indices. Let ℐS be the total number of individual images validated.
1. Accuracy during protocol cascade: Number of images correctly identified during the fingerpint protocol cascade, over the total number of individual images used to train the idCNN in S.
2. Accuracy: Number of images correctly identified over ℐS.
3. Percentage of non-identified images: Number of images non-identified, over ℐS.
4. Percentage of misidentified images: Number of images misidentified, over ℐS.
E.2 Individual validation
Individual validation is performed by considering a single individual at a time, and always proceeding from the core of the first global fragment used in the protocol cascade to the previous or the future frames. When validating the individual assigned to the identity ι, we are interested only in crossings and occlusions in which it is involved. In this way, the validation is much faster and it is possible to control the quality of the identification in a wider timespan. After the correction of the misidentified images, we compute the accuracy in the assignment of ι by considering the number of correctly identified images, over the number of total images representing the individual.
Acknowledgements
We thank Alfonso Perez-Escudero and Andres Laan for discussions, Antonia Groneberg and Andres Laan for a critical reading of the manuscript, João Bauto and Ricardo Ribeiro for assistance in hardware and software, Paulo Carriço for help in designing of fish arenas, Ana Catarina Certal and Isabel Campos for animal husbandry, Andrew I. Bruce and Nico Blüthgen for videos of ants (Diacamma) and Joana Couceiro, Liliana Costa, Clara Ferreira and Tomas Cruz for assistance with fly experiments. This study was supported by a GPU NVIDIA grant (to M.G.B, F.H and G.G.dP), Fundaçao para a Ciência e Tecnologia PTDC/NEU-SCC/0948/2014 (to G.G.dP.) including a contract to F.H and Champalimaud Foundation (to G.G.dP.), including contracts to M.G.B and R.H. F. R-F. acknowledges a FCT PhD fellowship.