Robust Algorithms for Capturing Population Dynamics and Transport in Oceanic Variables along Drifter Trajectories using Linear Dynamical Systems with Latent Variables

The blooms of Noctiluca in the Gulf of Oman and the Arabian Sea have been intensifying in recent years posing a threat to regional fisheries and the long-term health of an ecosystem supporting a coastal population of nearly 120 million people. We present the results of a microscopic data analysis to investigate the onset and patterns of the Noctiluca (mixotrophic dinoflagellate Noctiluca scintillans) blooms, which form annually during the winter monsoon in the Gulf of Oman and in the Arabian Sea. Our approach combines methods in physical and biological oceanography with machine learning techniques. In particular, we present a robust algorithm, the variable-length Linear Dynamic Systems (vLDS) model, that extracts the causal factors and latent dynamics at the microscopic population-level along each individual drifter trajectory, and demonstrate its effectiveness by using it to test and confirm previously benchmarked macroscopic scientific hypotheses. The test results provide microscopic statistical evidence to support and recheck the macroscopic physical and biological Oceanography hypotheses on the Noctiluca blooms; it also helps identify complementary microscopic dynamics that might not be visible or discoverable at the macroscopic scale. The vLDS model also exhibits a generalization capability (inherited from a machine learning methodology) to investigate important causal factors and hidden dynamics associated with ocean biogeochemical processes and phenomena at the population-level.

52 In particular, we obtain a robust model (the vLDS mode, variable-length Linear Dynamic System 53 Model) that is capable of identifying the causal factors and dynamics at the microscopic population-54 level along each individual drifter trajectory. Furthermore, we assess its effectiveness via testing 55 and confirm previously benchmarked scientific hypotheses. Rigorously statistical, the vLDS model 56 is a powerful tool that helps us: 1) discover microscopic causal relationships in a high-dimensional 57 dataset and 2) identify complementary microscopic dynamics that might not be discoverable at the 58 macroscopic scale or accessible in controlled laboratory experiments.

60
The significance of this research is that these blooms of Noctiluca have been intensifying in recent 61 years posing a threat to regional fisheries and the long-term health of an ecosystem supporting a 62 coastal population of nearly 120 million people [25][26][27][28]. When seen from space, the Noctiluca 63 blooms appear as large drifting swirls and filaments on the surface of the sea (Fig 1A). Traditionally, 64 photosynthetic diatoms supported the Arabian Sea food chain; zooplankton preyed primarily on 65 diatoms, which were in turn grazed by fish. Since early 2000s, the ecosystem of the Arabian Sea

162
The benefit of vLDS is threefold. First, it provides statistical evidence in a direct and zoomed-in

278
We note that each float record has information on its coordinates .

313
Using the multidimensional interpolation procedure described above, we map the satellite  439 It is possible that some of the components of , for instance, the in our study, as demonstrated 865 440 in the "Discussion & Conclusion" Section, are not much involved in the latent dynamics. Therefore, 441 the dimension of the latent space recovered by the latent variable might be smaller than the 442 dimension of the observations . In this study, the latent dimension in , as determined by the 443 cross-validation procedure, turns out to be 11, and the dimension of the observations is 12.

445
To test our benchmarked hypotheses in the "Background" Section, we first generate the predicted 446 values of by using the equation (1) with the recovered latent variables from the vLDS model.      With the optimal value of the latent space dimension identified, we fit the vLDS model one = 11 553 more time with the full cross-validation dataset to generate the vLDS model parameter. In Fig 9, we 554 display the log-likelihood convergence of the Expectation-Maximization algorithm for the complete 555 cross-validation dataset and five individual floats in the cross-validation dataset. We note that from 556 Equations (2) -(4), the log-likelihood of the complete cross-validation dataset is the sum of the log- 575 Also, 't865', the aerosol optical thickness over water, turns out to be independent of the chlorophyll 576 a concentration and other ocean profiles. Moreover, the spatial information, namely, the longitude, 577 latitude, velocity, speed of the float, and distance from the nearest coast, is all well recovered by the 578 vLDS model (lat and spd are not shown in Figs 10, 11 due to space limitations).

580
We next examine on the robustness of the vLDS model. The floats in the heldout dataset are not 581 used in the cross-validation process or the model's parameter estimation process. Therefore, the 582 heldout dataset is totally unknown to the vLDS learning algorithm. We use the cross-validated latent 583 dimension , and the model parameter generated by training the vLDS = 11 : = { , ,Γ,Σ, ,V 0 } 584 model on the cross-validation dataset. Applying one iteration of the forward-backward smoothing 585 process, namely, merely one iteration of the Expectation steps 4, 5, and 6 in Algorithm 2, to each 586 float in the heldout dataset, we obtain the predictions of their profiles (Fig 11). Most of the hidden 587 dynamics along drifter trajectories for the floats in the heldout dataset, which is totally  637 dispersal, and biological environments (Fig 10, 11, Table 3.) The model's generalization capability