MoveFormer: a Transformer-based model for step-selection animal movement modelling

The movement of animals is a central component of their behavioural strategies. Statistical tools for movement data analysis, however, have long been limited, and in particular, unable to account for past movement information except in a very simplified way. In this work, we propose MoveFormer, a new step-based model of movement capable of learning directly from full animal trajectories. While inspired by the classical step-selection framework and previous work on the quantification of uncertainty in movement predictions, MoveFormer also builds upon recent developments in deep learning, such as the Transformer architecture, allowing it to incorporate long temporal contexts. The model predicts an animal’s next movement step given its past movement history, including not only purely positional and temporal information, but also any available environmental covariates such as land cover or temperature. We apply our model to a diverse dataset made up of over 1550 trajectories from over 100 studies, and show how it can be used to gain insights about the importance of the provided context features, including the extent of past movement history. Our software, along with the trained model weights, is released as open source.


Introduction
The movement of animals is a central component of their behavioural strategies to best exploit the landscape they live in, to find a mate or to avoid predators, for instance.The role that these movements have beyond the individual, for instance in shaping animals' ecosystem impacts, is clear.Accordingly, and thanks to the technological developments that are allowing to collect more detailed movement data on more individuals and species each day, the study of animal movement has become an important goal of ecology [1].
For a long time, however, statistical tools to analyze movement data were lacking or limited.Over time, though, purely pattern-based descriptions (e.g.home-range analyses) have been complemented by regression models allowing to infer the effects of spatio-temporal features on movement.
Step-selection function (SSF) models, which compare actual movement steps with realistic candidate ones, are one of such models and have de facto become the established approach to analyse animal trajectories [2][3][4].They are now routinely used to infer and quantify the effect of environmental variables such as, for instance, land cover or temperature.
However, an animal's movement is likely to be driven not only by spatiotemporal environmental features, but also by some internal knowledge and rules that are unobservable directly.The importance of memory and of an animal's familiarity with places is increasingly recognized [5][6][7], and familiarity is usually incorporated into SSF models using a previously visited yes/no variable, or a time-spent variable, often calculated over an arbitrary time window [8,9].Memory of places and their characteristics can also lead to routine movement behaviours.Traplining, in which an individual travels to the same places in the same order, is rare, but it is clear from visual inspection of animal trajectories that many animals display some form of routine movement behaviours.But for traplining, which has received a lot of interest, the study of routine movement behaviour has remained extremely limited [10].Riotte-Lambert et al. [11] showed how conditional entropy, calculated using the information on visits to patches, could be used as a metric of routine in movement.That metric has not been used much since then, possibly because the need to determine sites may render its application difficult on data collected in nature, where patches can be difficult to determine, be diffuse, or not exist at all.Further work is needed to describe and explain routine movements, which result from the interaction between memorized knowledge, movement rules and environmental context.Additionally, we are not aware of any work that has focused on how to incorporate complex information about past movement and environmental context into predictive models of animal movement, although it should, by definition, improve predictions.The question: 'To what extent past movements inform where an animal is likely to go next?' remains open.
The classic implementation of the SSF framework appears unsuitable to address this difficult question.We therefore developed a new type of model that we named MoveFormer.We conserved the conceptual attractiveness of SSF, but built on the most recent developments in deep learning to embed the information about current and past animal location, movement and environmental context.
Our contribution is threefold.First, we propose a model that learns to best predict the next step of a movement trajectory based on a given context length, i.e. a given time-window of information about the past.Second, the proposed approach is flexible enough to allow each step in the context to be defined not only by the locations of the start and end points, but also by any kind of features that could be relevant, in particular environmental variables.Third, we show how the model can be used to gain insights about the importance of the provided context, both in terms of the extent of the past that it is useful to know, and in terms of what kinds of information are most ecologically relevant to predict an animal's movement.We demonstrate this by comparing the predictions, via information-theoretic metrics and prediction accuracy, for different context lengths or with randomized features.Model training and analyses are conducted on a dataset made up of over 1550 trajectories from over 100 studies, encompassing various species within mammals, birds and reptiles.
The MoveFormer source code, including code for data pre-processing and evaluation, as well as complete hyperparameter settings, is available online. 1  We also release the weights of the trained models. 2

Data
In this section, we describe our sources of data, specifically: movement data (trajectories consisting of latitudes, longitudes and timestamps), geospatial variables (associated with locations), and taxonomic classification information (associated with each animal).

GPS location data
Our main source of location data is Movebank 3 [12], an online repository for animal movement data.The location data in Movebank is presented as latitude/longitude pairs along with UTC timestamps and is grouped into trajectories (deployments) and associated with (occasionally missing) metadata such as a taxon name, sex, and date of birth.We used the Movebank API to retrieve data from GPS sensors for all 269 studies that were available 4 for download under a Creative Commons 5 license (CC0, CC BY and CC BY-NC), obtaining 13 577 trajectories comprising a total of 197 million observations (location events).We subsampled the trajectories (splitting them into segments when necessary) so that observations occur at midnight and at noon (according to local mean time) with a tolerance of ±3 h and so that the time difference between consecutive observations is 9 to 15 h.We discarded trajectory segments shorter than 120 observations, leaving us with 1440 trajectories from 98 different studies .See Table 4 in the appendix for the full list of studies and their licenses.
We added unpublished data from 4 more studies, collected by one of us (S.C-J).These are GPS data from plains zebras and African elephants, collected in Hwange National Park (Zimbabwe), and GPS data from plains zebras and blue wildebeest, collected in Hluhluwe-iMfolozi Park (South Africa).After subsampling and filtering as in the case of the Movebank data, we obtained 73 trajectories.The final dataset contains about 1 million observations from 1506 individuals, grouped into 1513 trajectories with a median length of 408.We performed a train/validation/test split, making sure that 1) the validation and test sections contain only frequent species (with at least 10 members in the full dataset), and 2) each individual appears in exactly one split.Table 1 details the amounts of data by section and by taxonomic classification and Fig. 1 shows the geographical distribution.
During training and evaluation, we additionally split each trajectory into segments of length N max = 500 and subsequently consider each of these segments as a separate trajectory.

Taxon vectors
Each trajectory in our data is associated with a taxon name (most commonly the animal's species).To obtain a dense vector representation of the taxon, we look up its Wikipedia article and retrieve the associated 100-dimensional embedding vector from Wikipedia2Vec [166].
A property of Wikipedia2Vec is that embeddings of semantically similar entities are placed close together in the embedding space.To illustrate that this   extends, to some degree, to similarity between species, we display in Fig. 2 the PCA (principal component analysis) projections of species embeddings, labeled by higher taxonomic ranks.We also measured the cosine similarity between all pairs of embeddings and found it to be correlated with the number of common ancestors of the two species in the taxonomic hierarchy (Spearman ρ = 0.68).
Overall, the Wikipedia2Vec embeddings appear to meaningfully encode a species' position in the phylogeny.Hence, we speculate (though we do not test this in the present work) that their inclusion in the model should help this model to generalize to species that are not present in the training data, at least as long as they are sufficiently similar to those that are.

Geospatial variables
The proposed model is powerful enough to account not only for each trajectory intrinsic dynamics but also for any third-party additional information that may be available as covariates.In order to illustrate this, we augment each trajectory data point with exogenous information.For each location, we retrieve the following geospatial variables, which could be ecologically relevant, from publicly available raster data: • 2009 Human Footprint, 2018 Release [167,168]  As a fundamental use case, we are interested in analyzing the effect of available past context on the prediction of x n+1 .Specifically, for a varying context length c ∈ {1, . . ., c max } (where c max is an arbitrary constant), we wish to study the behavior of the prediction of x n+1 given ξ n−c+1...n and t n+1 .Hence, we are in fact interested in a model accepting as input any trajectory segment of length at most c max , and predicting the next location.
We adopt a step-selection function modelling approach [2,4], based on selecting the end-point location of a step from a set of candidates.Specifically, for a position n+1 within a trajectory, given an associated timestamp t n+1 , a set of candidate locations x (1...K) n+1 and associated variables z (1...K) n+1 , we are interested in estimating a probability distribution over the candidates: where i ∈ {1, . . ., K}.
We propose to model this distribution using a deep neural network, consisting of a Transformer [171] encoder and a candidate selection module, as depicted in Fig. 3.The role of the Transformer is to encode the trajectory up until position n, i.e. ξ 1...n along with the timestamp for the next observation, t n+1 .The candidate selection module then encodes each candidate x (i) n+1 and employs an attention mechanism to compute a probability distribution over the candidates.The model is described in detail in Section 3.1, followed by our choice of input representation in Section 3.2.In order to train and evaluate this model, we also need a way to generate suitable candidate locations x (i) n+1 .We use a simple but general method employing quantile-based modelling of turning angles and movement distances, as detailed in Section 3.3.

Input embeddings
We build two sets of embeddings , input for the candidate selection module, depends on x n , x The inputs are represented as collections of carefully engineered continuous and discrete features that we will describe later (see Section 3.2).Missing (NaN) values are replaced with a special embedding vector learned as an additional parameter.In each case, we project each feature vector to a common embedding space R d emb , then linearly combine them (with different learnable coefficients in each of the two cases).
More precisely, for ϕ in : where f n is the j-th out of all F feature vectors at step n, and the learnable parameters are coefficients w in,(j) ∈ R (we set w in,(j) = 0 for features we do not wish to consider), biases b (j) ∈ R d emb and weight matrices W (j) ∈ R d emb ×dj .The formula for ϕ cand,(i) n is analogous.As can be seen from Eq. ( 2), the chosen method for constructing input embeddings allows features to have different dimensions and automatically projects them to the desired embedding dimension (via W (j) and b (j) ) before applying scaling through w in,(j) .

Trajectory encoder
The trajectory encoder is a Transformer encoder with causally masked attention.It receives the embedding sequence ϕ in 1...N −1 and outputs a sequence of vectors h 1...N −1 where h n is a representation of ξ 1...n .The encoder does not use any positional encoding in the conventional sense (encoding the indices 1, . . ., N − 1, as is commonly done in Transformers), but position information is conveyed by the feature representations of the timestamps t 1...N −1 .

Candidate selection
The candidate selection module is used to select the next location out of a list of candidates.We build upon the common approach that models the probability of an individual being present at a given candidate location via conditional logistic regression [3]; expressed in our notation: where β is a parameter vector.In this work, in order to incorporate the context representation computed by the trajectory encoder, we replace the global parameter vector β with a context-dependent query vector q n ∈ R d sel , which is a linear projection of the trajectory encoder output h n .We also do not use the raw candidate features ϕ cand,(i) n but replace them with a key vector k which is computed by concatenating the feature vector with the corresponding encoder output h n and passing the result through a candidate encoder (a fully-connected network): ).Thus, we arrive at a dot-product attention mechanism; scaling the dot products by 1/ √ d sel as in Transformer attention [171], we have: During training, the first candidate location x (1) n+1 is taken as the true next location x n+1 ; the rest of the candidates are randomly sampled around the current location x n (we detail this process below).This allows us to define a cross entropy loss, which we minimize through stochastic gradient descent using the Adam optimizer: (5)

Variable receptive field training
As mentioned above, we aim to evaluate our model on arbitrary trajectory segments up to some maximum length c max (this procedure is detailed below in Section 4.1).As can be seen from Eq. ( 5), our model is effectively being simultaneously trained on all prefixes of the trajectory ξ 1...N .Hence, the model is able to accept segments of variable length as desired, but being only trained on trajectory prefixes may bias it, leading to incorrect predictions on segments that are not prefixes.To alleviate this, we propose a training scheme that intervenes on the attention weights to randomly vary the past context available for each prediction.
In each training batch, we sample a random integer B uniformly from {1, . . ., N max } and apply a block-diagonal attention mask to the attention matrix (on top of the causal mask) with blocks of size B (with the last block truncated if B ∤ N ).As a result, the ranges of positions {1, . . ., B}, {B + 1, . . ., 2B}, etc. are prevented from attending to each other, and the corresponding segments are therefore effectively considered as separate trajectories.

Data representation
Let us now describe the feature mappings used for location and time, as well as associated features.

Location
In the raw data, each location x n is represented as a GPS coordinate pair (latitude, longitude).We represent it as a geodetic normal vector (n-vector) Additionally, we encode the position relative to the previous location x n−1 as a movement vector µ(x n−1 , x n ) ∈ R 2 , obtained by computing the bearing and distance from x n−1 to x n and converting them to cartesian coordinates.We apply scaling to make the overall root-mean-square (RMS) of the norms of movement vectors computed on the training dataset equal to 1.
Analogously, we encode each candidate location x n+1 as an n-vector ν(x ) and as a movement vector µ(x n , x (i) n+1 ).
We also encode the time difference w.r.t. the next timestamp t n+1 as a 12dimensional vector of sines and cosines with the same periods as above, plus a period of 25 years.While this multi-scale encoding may not be necessary in our case (where the time differences are between 9 and 15 h), we propose it as a generic representation suitable for any time scale from seconds to years (and hence for virtually all existing animal movement data).

Associated variables
For each input and candidate location, we retrieve and pre-process geospatial variables as described in Section 2.3.We also include the taxon vectors (as described in Section 2.2) as an additional encoder feature vector for every element of the input sequence.

Candidate sampling
We sample each candidate location x (i) n+1 as follows: • we estimate the current bearing β of the animal from the positions x n and x n−1 ; • we independently sample a turning angle θ ∼ P (θ) and a log-distance log d ∼ P (log d); β ′ ← β + θ; n+1 by moving x n according to β ′ and d.P (θ) and p(log d) are estimated on the training set as follows: • We collect all turning angles from the training set and compute the quantiles (estimated using linear interpolation) at 101 equally spaced points 0 = q 0 , q 1 , . . ., q 100 = 1.We use them to construct the quantile function of P (θ) as a piecewise linear function with knots at q 0 , q 1 , . . ., q 100 .
• We collect the natural logarithms of all non-zero distances between consecutive points in the dataset; we construct the quantile function of P (log d) analogously.
We sample from each distribution by drawing a sample from U[0, 1] and passing it through the estimated quantile function; this is sometimes called the increasing rearrangement [172].
In our experiments, we condition the distributions on the taxon, i.e. we estimate a separate pair of distributions on the section of the training dataset corresponding to each taxon.

Implementation details and hyperparameters
Our implementation of MoveFormer, available as open source software,7 is written in Python using the PyTorch framework8 and the x-transformers9 package.The code for efficient geospatial variable loading relies on the rasterio10 library and is released as a separate package, gps2var. 11he trajectory encoder is a 6-layer Transformer with 8 attention heads per layer and a feature dimension of 128.The candidate encoder is a fully-connected neural network with one hidden layer of size 256 and a GELU activation [173].The candidate selection module has d sel = 128.The total number of parameters of the model is around 2.6 million -several orders of magnitude smaller than current state-of-the-art Transformer language models, for instance, but appropriate for the limited-size dataset that we are working with.
The Adam optimizer uses a learning rate of 5 × 10 −5 with linear warm-up and exponential decay.We train for 180 epochs with a batch size of 24, taking 7.5 h on a Tesla V100 GPU (note that GPU utilization was only about 20 % and the performance bottleneck appeared to be the geospatial variable loading).We validate on the validation set twice per epoch and use the checkpoint with the lowest validation loss.
The complete hyperparameter settings are included with the source code.

Context length analysis
Riotte-Lambert et al. [11] propose to use conditional entropy as a measure of uncertainty in predicting the next location given the c previous locations.Specifically, given a distribution P over sequences of locations, conditional entropy of order c can be written as where P (s 1 , . . ., s c ) is understood as the probability of c consecutive locations in a sequence being equal to s 1 , . . ., s c , and P (s c+1 | s 1 , . . ., s c ) as the conditional probability of s c+1 immediately following the sequence s 1 , . . ., s c .Considering this uncertainty measure as a function of the context length c, it may be used to study routine movement behavior.Riotte-Lambert et al. [11] work with a finite set of discrete locations, allowing them to evaluate the expression (6) empirically on a given trajectory.However, the probability estimates quickly become unreliable with increasing c due to data sparsity.Moreover, the method is inapplicable when locations are unique, as in our case.
We propose an alternative way, which is to approximate log P using a suitable machine learning model (e.g.our proposed step selection model), so that Eq. ( 6) becomes cross entropy computed on trajectory segments of appropriate length.In our case:12 where we collapse all the conditioning variables to ψ n,c for brevity.For more finegrained analysis, we may be interested not in the sequence-level cross entropy, but rather in the "pointwise" values, i.e. − log K P (y n+1 = 1 | ψ n,c ).
More generally, we may alternatively choose to examine any metric that can be computed from the probabilities.We adopt the relative entropy (also known as the Kullback-Leibler divergence) of the prediction with the maximum context length c max with respect to the one at context length c (as proposed by Cífka and Liutkus [174] in the context of causal language models for text): Note that this metric does not depend on the ground truth location, but measures the amount of information gained by considering the maximal context instead of the limited one.

Relevant context length
We may expect that there would be a critical context length C after which the above metrics stop improving, as further extending the context does not result in significant information gain.Similarly to Riotte-Lambert et al. [11], we define the relevant context length C m -for a given metric m -as the smallest context length for which the metric reaches its optimum, with a 5% tolerance for robustness to noise:

Efficient evaluation
We now discuss how to efficiently compute the probabilities needed to calculate the above metrics, following the procedure proposed for causal language models by Cífka and Liutkus [174].We may collect all the probabilities in a tensor P ∈ R N ×cmax×K such that Observe that by running the model on a segment of the trajectory corresponding to indices n, . . ., n+c max −1 for a given n, we obtain all the values P n+c−1,c, * for c ∈ {1, . . ., c max }.We may also notice that P n,n, * = P n,n+1, * = . . .= P n,cmax, * for any n < c max .Hence, we can efficiently fill in the tensor P using N runs of the model on segments of length at most c max .

Candidate feature importance
While the parameters of step-selection models fitted by conditional logistic regressions or point-process models are directly interpretable [4], deep learning models are known as "black boxes" that require special techniques to be interpreted post-hoc.A simple but popular technique [175,176] is based on testing the model on a dataset with the values of a given feature randomly permuted.While aware of the caveats related to using this technique with correlated features [177], we employ it here to demonstrate the possibility of interpretation, and leave more advanced techniques for future work.

Specifically, we study how individual candidate features (components of ϕ cand,(i) n
) influence the selection of candidates.We pick a feature (or a group of features), and for every observation in the dataset, we randomly shuffle the feature's values among the K candidates (in contrast to Fisher et al. [176], who shuffle values across the entire dataset).The aim is to make the feature completely uninformative while maintaining its values plausible in the given context.We evaluate the model on both the permuted and the original dataset, and use the difference in performance as a measure of the importance of the selected feature.

Validation
We evaluate the proposed model (here dubbed VarCtx) against variants to serve as baselines: • FullCtx is a variant without the variable receptive field training (see Section 3.1.4); • NoAtt is a model where all the attention layers are removed from the Transformer encoder, so that information is not allowed to flow between different positions in the sequence; • NoEnc is a model where the Tranformer encoder is removed, i.e. we have h n = ϕ in n .Note that the last two variants have a receptive field of 1 (i.e.only the features at position n are available for predicting the location at n + 1).To simulate this for VarCtx and FullCtx in a comparable way, we test these in a regime (denoted by +diag) where the attention matrices are restricted to an identity matrix, i.e. each position can only attend to itself.
After running each of the above models on the test set, we compute the following metrics: • xent@16: cross entropy (Eq.( 5)) computed with 16 candidates; • xent@100: cross entropy computed with 100  The results, averaged over all trajectories, are presented in Table 2.We note that the results are very consistent across all metrics, and we found all pairs of metrics to be strongly correlated (Pearson ρ > 0.87, computed over all models and trajectories).
Both FullCtx and VarCtx outperform the rest of the models, which have a receptive field length of 1.This is evidence that providing past movement as context is beneficial.Interestingly, VarCtx yields better results than FullCtx, possibly because the variable receptive field training scheme effectively makes the training data more diverse, alleviating overfitting.
We can also observe that the results of VarCtx+diag are closest to those of the models trained with minimum context (NoAtt, NoEnc).This suggests that the performance of VarCtx is not strongly degraded by limiting its receptive field at test time (unlike that of FullCtx), validating our variable receptive field training approach.
Finally, we noticed large performance differences between species.For the VarCtx model, we calculated the average cross entropy for each taxonomic order (see Table 3) and found that it tends to be lower (i.e.better) for orders with a higher number of observations in the training set (Pearson ρ = −0.71).

Context length analysis
In this section, we demonstrate the ability to use the VarCtx to study the dependence of the predictions on the length c of the available past context, as described in Section 4.1.We set c max = 200 and K = 16.
First, we display in Fig. 5 the average cross entropy and relative entropy as a function of context length and by taxonomic order, and in Fig. 5 examples for concrete observations, with the relevant context length C highlighted.We observe that the best predictions tend to be achieved around context lengths 10-50, which corresponds to 5-25 days.Apart from the clear inter-species differences in cross entropy already noted in the previous section (Table 3), we also observe some differences in relative entropy, though less marked.For example, while Ciconiiformes' movements are substantially easier (in terms of cross entropy) for our model to predict than those of Anseriformes, both have a similar relative entropy profile, indicating that the amount of information contributed by each time scale is similar for both taxa.On the other hand, note that the flat relative entropy profile of Testudines simply reflects a failure of our model to accurately predict their movements at any time scale -as evidenced by the cross entropy values being close to 1 -, which is possibly due to an insufficient amount of reptile training data.

Candidate feature importance
We present in Fig. 7 the results of the feature importance experiment.Vector features (location) are treated as groups; bioclimatic variables are tested both individually and as a group.
The most important features found by this method are movement vector and land cover, followed by human footprint.The bioclimatic variables appear to have relatively low impact, with the most important ones being BIO2 (mean diurnal range), BIO14 and BIO17 (both related to precipitation).Interestingly, global location (represented as n-vectors) seems to be the least important feature, possibly because it is difficult to exploit for candidate selection compared to the relative location information provided by the movement vectors.
Note that only candidate features ϕ cand,(i) n are tested here, and the results do not say anything about the input (past observation) features ϕ in 1...n .For example, global location, which we found to be unimportant as a candidate feature, may well turn out to be an important past context feature.

Discussion
In this work, we propose a new model to learn from animal trajectories.Inspired by the classical step-selection framework [2] and previous work on the quantification of uncertainty in movement predictions [11], we designed MoveFormer, a step-based model of movement that builds upon recent developments in deep learning, such as the Transformer architecture.This allowed us to meet our initial goal to endow the model with a unique ability to learn how past movements influence current and future ones.Although this is an important question in movement ecology, it has remained poorly addressed so far because classical step-selection functions or other movements models are unable to account for past information except in a very simplified way (e.g. by including a feature indicating whether or not the animal has previously visited a given site).
An important contribution of this work is also to generalize the suggestion of Riotte-Lambert et al. [11] to use conditional entropy calculated over visits to discrete sites as a way to measure movement uncertainty.Although attractive, the difficulty of discretizing trajectories to meaningful 'sites' has slowed down the application of this idea.Here, we extend it to locations acquired in continuous space and propose cross-entropy and relative entropy, estimated through the movement model, as a more general approach.This allows to estimate the relevant context length ('relevant order of dependency' in Riotte-Lambert et al. [11]), i.e. the amount of the past that significantly improves the predictions about further movements.We did so in this study, and to the best of our knowledge, our study therefore provides the first estimation of how much of the past one needs to know to improve predictions of animal movements.
Our results suggest that for most datasets, predictions are improved when integrating the information from about a few days to two or three weeks before the movement to be predicted.Why this is the case, and why these results are broadly consistent between species, with possibly significant within-species variability, remains to be investigated further as it was beyond the goal of this methodological work.We note that, possibly, these results are affected by our choice to alternate sampling at midnight and at noon and to limit the length of trajectories to 500 locations, restricting the receptive field of our model to about 250 days.This may have weakened or excluded the influence of migration, which commonly leads to seasonal back-and-forth movement patterns and that, when accounted for, could help improve predictions about future movements.
One obvious limitation of our approach is the data requirement.As with all deep learning approaches, learning is limited by the data available in the training set, and enough data should also be available for validation and test sets.The whole dataset we gathered here, despite being rather large (> 1500 trajectories) compared to movement datasets currently analyzed in ecology, is likely close to the minimal size required to obtain a robust model and avoid severe overfitting issues.Currently, there are probably very few, if any, single-species datasets large enough to fit this model.For this reason, we aggregated data from numerous species; as a benefit, this allowed us to demonstrate that comparative analyses could be conducted with the model, for instance by comparing the distribution of relevant context lengths between species or higher-order taxa.
An important characteristic of the proposed approach is that the model not only accounts for past movements to predict new ones, but can also account for environmental predictors.First, this is crucial for realistic predictions, as the step-selection literature has well demonstrated that step selection by animals is critically linked to habitats to be traversed or reached.Second, this allows to evaluate the relative importance of predictors in improving predictions.Interestingly, we found that purely relative positional information (movement vectors) could be more important than environmental variables for future location prediction.We tentatively suggest that this result might be linked to the fact that most animals favor familiar places and by doing so restrict themselves to well-established home-ranges [178].We however also found, without surprise, that among the environmental variables tested, land cover and human footprint significantly affected animal movements [179].
To summarize, in the present work, we provide a new, state-of-the art model to analyze and predict animal movement data.The novelty of the model lies in the fact that it leverages the power of deep learning approaches and can account for past movements in the predictions.However, we emphasize, and have shown above, that the model is not only a tool for prediction, but can also be used to test hypotheses about the intrinsic and extrinsic drivers of animal movements.

Figure 1 :
Figure 1: The geographical distribution of the observations in the dataset.

Figure 2 :
Figure 2: A PCA projection of Wikipedia2Vec embeddings of species, labeled by class (left) and order (right).

Figure 3 :
Figure3: The high-level architecture of MoveFormer.The input to the trajectory encoder is a sequence of embedding vectors ϕ in 1...N −1 , each corresponding to a different data point (location-timestamp pair) in the trajectory.The encoder outputs a sequence of vectors h 1...N −1 ; the causal masking in the encoder causes each h n to encode only the inputs up to position n, i.e. ϕ in 1...n .This representation is then fed to the candidate selection module, which uses it as queries in an attention mechanism that assigns probabilities to different candidate locations.Both the input embeddings ϕ in 1...N −1 and the candidate embeddings ϕ cand,(i) n are computed through embedding layers which are not displayed here but described in Section 3.1.1.

Figure 4 :
Figure 4: Metric value averages by context length and taxonomic order, computed on the test set (only positions n > c max = 200).

Figure 5 :
Figure 5: Examples of metric values (pointwise, i.e. for a single observation within a given trajectory) plotted as a function of context length.Top and bottom correspond to different (random) positions within the same trajectory.The red dot marks the relevant context length (where the metric reaches 5 % of its min-max range).

Figure 6 :
Figure 6: Relevant context length C by taxon, computed using cross entropy and relative entropy (pointwise values, as shown in Fig. 5), respectively.The black triangles indicate means.

Table 1 :
Number of species, individuals, trajectories, and observations in each section of the dataset, and a breakdown by taxa.
Formally, we consider the dataset as composed of trajectories, where a trajectory 6 ξ 1...N of length N consists of locations x 1...N , corresponding to timestamps t 1...N , and any associated variables z 1...N , i.e. ξ n = (x n , z n , t n ) as described above.Our main goal is to estimate a model for the next-step prediction task, i.e. for any given n ∈ {1, . . ., N }, predict the next location x n+1 from the trajectory prefix ξ 1...n and the next timestamp t n+1 .
• acc 1/16: accuracy (i.e.how often the top scoring candidate is the ground truth) with 16 candidates,• acc 10/100: top-10 accuracy (i.e.how often the ground truth is among the 10 top scoring candidates) with 100 candidates.

Table 3 :
VarCtx validation cross entropies by taxonomic order, along with numbers of observations in the training data.

Table 4 :
The list of all Movebank datasets used in this work.(Continued)

Table 5 :
Number of observations of each taxon in each section of the dataset.