Averaging a local PLSR pipeline to predict chemical compositions and nutritive values of forages and feed from spectral near infrared data

Partial least squares regression (PLSR) is a reference method in chemometrics. In agronomy, it is used for instance to predict components of chemical composition (response variables y) of vegetal materials from spectral near infrared (NIR) data X collected from spectrometers. The principle of PLSR is to reduce the dimension of the spectral data X by computing vectors that are then used as latent variables (LVs) in a multiple linear model. A difficulty is to determine the relevant dimensionality (number of LVs) of the model for the given available data. This step can also become time consuming when many different datasets have to be processed and/or the datasets are frequently updated. An alternative to determinate the relevant PLSR dimensionality is the ensemble learning method “PLSR averaging”. In the past, this method has been demonstrated to be efficient for complex biological materials such as mixed forages, and facilitates to automatize predictions (e.g. in user-friendly web interface platforms). This article presents the extension of the PLSR averaging to a k-nearest neighbors locally weighted PLSR pipeline (kNN-LWPLSR). The kNN-LWPLSR pipeline has the advantage to account for non-linearity between X and y existing for instance in heterogeneous data (e.g. mixing of vegetal species, collection from different geographical areas, etc.). In the article, kNN-LWPLSR averaging is applied to an extensive NIR database built to predict the chemical composition of European and tropical forages and feed. The main finding of the study was the overall superiority of the averaging compared to the usual kNN-LWPLSR. Averaging may therefore be recommended in local PLSR pipelines to predict NIR forage and feed data.


Introduction
Near-infrared spectroscopy (NIRS) is a fast and nondestructive analytical method used in many agronomic contexts, for instance to evaluate the nutritive quality of forages.
Basically, spectral data X (matrix of n observations  p wavelengths) are collected on samples of the material to study (e.g.forages) using a spectrometer, and targeted response variables (e.g. chemical compositions) Y {y1, …, yk} (k vectors of n observations) are measured precisely in laboratory.Regression models of Y on X are then fitted and used to predict the response variables from new spectral observations.Spectral data are known to be highly collinear and, in general, matrix X is ill-conditioned.Specific regression methods have to be implemented, in particular partial least squares regression (PLSR) [1][2][3].The general principle of PLSR is to reduce the dimension of X to a limited number a << p of orthogonal vectors n  1 maximizing the squared covariance with Y and referred to as scores.The scores are then used as regressor latent variables (LVs) in a multiple linear regression (MLR).PLSR is very efficient when the relationship between X and Y is linear [4].
For several years, agronomic databases (e.g. in feed, food or soils researches) tend to aggregate large numbers of samples of different natures or origins, bringing heterogeneity.This generates curvatures and/or clustering in the data that can alter the linear relation between X and Y and therefore the PLSR predictions.Local PLSR is an easy tool that can turn out non-linearity in the data [4][5][6][7].The general principle is, for each new observation to predict, to do a pre-selection of k nearest neighbors of the observation (the kNN selection step) and then to apply a PLSR to the neighborhood (i.e. the k neighbors).Many variants of local PLSR pipelines can be built, depending essentially on the type of PLSR implemented, and on how are selected the neighborhood.
One of these variants ( [8]) consists in applying a locally weighted PLSR (LWPLSR) on the neighborhood, instead of a PLSR as it is done in the more common local PLSR, say kNN-PLSR.LWPLSR, detailed in a next section, has the particularity to weight each of the n training observations {xi ; i = 1,…,n} depending on its distance (or any dissimilarity) to the observation to predict, xnew (while in PLSR a uniform weight 1 / n is given to all the xi).Closer is xi to xnew, higher is its weight in the iterative PLSR equations and therefore its importance in the prediction.Implementing LWPLSR in the local pipeline, say kNN-LWPLSR ( [8]), has been observed to be more efficient than using kNN-PLSR for various data including forages, for regression as for discrimination [8,9].As a remark, kNN-LWPLSR can be considered as a particular case of LWPLSR where positive weights are given to the neighbors of xnew and null weights to the observations outside of the neighborhood.Nevertheless, for large datasets, doing the kNN step and then apply LWPLSR only on the neighborhood (kNN-LWPLSR) is much faster in terms of computation times than implement LWPLSR on the all data ( [8]).
An important step of tuning kNN-LWPLSR, as for PLSR and LWPLSR, is to determine the dimensionality (i.e. the number of LVs), say a, that is used in the MLR step.Different strategies have been addressed in the PLSR literature to guide the determination of an optimal dimensionality [10][11][12][13][14][15].One of the most common is the cross-validation (CV) that searches the value a that minimizes the CV-error curve.These strategies face to several difficulties.Firstly, despite attempts of automated procedures, they often require case-by-case decisions based on expertise.Secondly, to find a unique optimal dimensionality is in general difficult when the data contain complex information, which is often the case with biological materials such as plants and forages (mixing of stems, leaves, different stages of development and geographical areas, etc.).Finally, finding a can also become time consuming when many datasets have to be processed (e.g.many variable responses y to consider successively) and/or when the datasets are periodically updated with new training observations.Alternatively, PLSR-averaging [16][17][18][19] (say PLSR-AVG) is an ensemble learning method whose the main objective is to bypass the determination of a.The method consists in averaging the predictions of A + 1 PLSR models of dimensionality r = 0, 1, 2 … A LVs.The maximal value A is set a priori and voluntary larger than the expected values of optimal a.For forages ( [19]), PLSR-AVG was shown more efficient than the usual procedure where a was determined by CV.In the past, such an averaging has already been implemented in a kNN-PLSR pipeline ( [18]).
The present article proposes its extension to the kNN-LWPLSR pipeline [8].
The article is organized as follows.Theoretical points on kNN-LWPLSR and PLSR-AVG are firstly presented.Then, the kNN-LWPLSR-averaging pipeline (say kNN-LWPLSR-AVG) is applied to an extensive NIR database on chemical composition of European and tropical forages and feed.The predictive performance of the pipeline is compared to kNN-LWPLSR where the optimal number of LVs a is determined by CV.

kNN-LWPLSR
LWPLSR and kNN-LWPLSR have been described in Lesnoff et al. [8].LWPLSR [20][21][22] is a particular case of weighted PLSR (WPLSR).In WPLSR, a n  1 vector of weights  = {1, 2, … n} is inserted into the PLSR algorithm, in two steps: (a) the PLS scores (LVs) are computed by maximizing weighted (instead of unweighted) squared covariances, and (b) the prediction MLR equation is computed by weighted least-squares (WLS, instead of ordinary LS) on the scores.The specificity of LWPLSR compared to WPLSR is that  is computed from a decreasing function, say f, of the distances (or any dissimilarities) between the n training observations and xnew.This is the same principle as in the well-known locally weighted regression algorithm [23,24].kNN-LWPLSR simply adds a preliminary step to LWPLSR: a neighborhood is selected around xnew (kNN selection step) on which LWPLSR is then applied.This present article uses the same kNN-LWPLSR pipeline as in Lesnoff et al. [8], involving a fast PLSR algorithm [25] and consisting in the following steps.The case h =  is the unweighted situation corresponding to a kNN-PLSR.

PLSR-averaging
Let xnew be a new observation to predict, and  , its PLSR prediction when the dimensionality (nb.LVs) is A. The PLSR-AVG prediction is defined by:  , [ ] =   , +   , + ⋯ +   , Eq. ( 1) where wr (r = 0, …, A) is the weight (bounded between 0 and 1) of the model with r LVs, with the constraint: In Eq. ( 1), the particular case  , is the simple mean of y.Vector w = {w0, w1, …, wA} represents the pattern of weights, and the shape of this pattern is specific to a given averaging method.In practice, weight wr should quantify the level of confidence that can be awarded to the PLSR model with r LVs, relatively to the other dimensionalities.Vector w can be for instance estimated from the predictive performance of each of the A + 1 PLSR models (e.g.estimated by CV, Akaike criterion or other indicators; [19]) or from more integrated Bayesian approaches ( [26,27]).
A close approach to model averaging is the stacking ( [26]) in which weights wr (r = 0, …, A) are estimated from a meta regression model, are not bounded to [0, 1] and constrained to sum to 1.The stacking did not show improvements in the prediction performances of forages composition ( [19]) compared to averaging and is not considered in this study.

kNN-LWPLSR-AVG
The proposed kNN-LWPLSR-AVG pipeline consists simply in chaining kNN-LWPLSR and the PLSR-AVG principle: for each neighborhood, the predictions returned from the LWPLSR models with dimensionalities r = 0, …, A are averaged.
Preliminary results on forages ( [19]) showed that the uniform weighting (i.e.wr = 1 / (A + 1); r = 0, …, A) was in general as efficient as more elaborated patterns on such data.
Therefore, for simplicity, the present article only considers the uniform weighting that has also the advantage to be much faster to compute than other patterns.Speed is essential when implementing local PLSR pipelines since one model per new observation to predict (xnew) has to be fitted.

Dataset
The predictive models were assessed on a NIR database built by Cirad (Mediterranean and tropical livestock systems research unit) on European and tropical forages and feed.
The absorbance spectra were collected on dried and grounded materials using Foss Instruments 5000 or 6500 models in the spectral range 1100 to 2498 nm (2 nm steps).
The objective of the database is to enable the prediction of twelve main variables of chemical composition (Table 1).The data are regularly updated with new observations collected from various Cirad research projects and partners.The total size of the dataset used in this study was N = 18813 observations but, depending on the availability of the response variable, data size ranged from N = 2020 observations to N = 18055 observations (Table 1).
After preliminary exploration, a preprocessing was applied to X (spectra) consisting to a standard normal variate (SNV) transformation followed by a Savitzky-Golay 2nd derivation (polynomial of order 3, and window of 11 spectral points).Examples of raw and preprocessed spectra are presented in Fig. 1.The projected data in PCA scores (Fig. 2) illustrates the large spectral heterogeneity existing in the dataset.

Predictive performances of the models
The model performances were compared by computing prediction errors rates on selected test sets (generalization errors).The procedure was the same for each of the compared models and each given response variable (Table 1), as follows.Let us consider the j th The RMSEPTest distributions were summarized by means and standard errors.

The compared models
The two compared models were the usual kNN-LWPLSR in which the optimal number of LVs, a, is determined by CV, and the kNN-LWPLSR-AVG pipeline where a PLSRaveraging is used instead of determining a.A particular case of kNN-LWPLSR-AVG was also considered ("omnibus" strategy described thereafter).All the computations were implemented with the package Jchemo [28].This package is written with the free and fast Julia language [29] (https://julialang.org).

kNN-LWPLSR
The preliminary global PLS score space was of dimensionality nlvdis = 25 scores, from which were computed, for each new observation to predict, xnew, the neighborhood (kNN selection based on Mahalanobis distances) and the weights vector .The same parameter values as above were considered for h and k, leading to 9 combinations.For each xnew, the prediction was the average of the predictions of the LWPLSR models (on the neighborhood) with nlv = 0, …, 20 LVs ( 21 predictions).

kNN-LWPLSR-AVG with omnibus strategy
When many datasets (spectra  response variables) have to be processed, for instance in automated prediction platforms regularly updated, an "omnibus" strategy applied to PLSR-averaging [19] can facilitate the predictions.This strategy consists in defining a single model (i.e. a set of a priori parameter values) that is applied blindly to all the datasets, instead of trying to tune and optimize the model separately for each spectral data and response variable.The defined omnibus model is ideally expected to have an acceptable efficiency for most of the spectral  response variables, even if not always optimal.
In this article, an omnibus strategy was applied to kNN-LWPLSR-AVG and compared to the two above models (kNN-LWPLSR and kNN-LWPLSR-AVG) tuned by grid-search.
From preliminary studies on forages (sorghum and other data not considered in this article [8,19]), the omnibus parameters were set to h = 1 and k = 1000 neighbors.

Results
The performances of models kNN-LWPLSR and kNN-LWPLSR-AVG tuned by gridsearch for the nrep = 30 replications are summarized in Fig. 3.For all the response variables, the averaging improved the predictions, with the mean RMSEPTest decreasing in relative values from 1.7% (ADF) to 11.4% (SUGARS).

Discussion and conclusions
The averaging always improved the performances of the kNN-LWPLSR models, compared to the usual strategy that optimizes the number of LVs.The improvement was also observed for the omnibus strategy in which single values of parameters h and k are used for all the response variables.This reinforces the previous results [19] about the interest of using PLS averaging on mixed forage data.Forage and feed spectral data often contain high intrinsic complexity (the material is collected from different species, parts of plants, years, geographical areas, etc.) than can be distributed over all the wavelength range.Averaging different PLS dimensionality may generate better robustness in such situation.The statistical decrease of prediction variances observed with ensemble learning methods [26,30] can also favor better performances.
The present article focused on the uniform weighting, the simplest averaging approach.
More elaborated weighting patterns can be considered, but they seem not be always more efficient [19].Moreover, in the context of local PLSR, they can become highly time consuming to implement on data.For instance, additional computations on the study dataset (not detailed in this article) showed that a weighting based on the CV-errors [19] did not improve the predictions.And a weighting based on AIC [19] required several days of computation to run the nrep = 30 replications, which is not realistic in practice.
The uniform weighting seems therefore to be a good compromise for kNN-LWPLSR-AVG.
To reinforce the finding of this article, kNN-LWPLSR-AVG should be evaluated on other agronomic materials than mixed forages and feed, with the same or, conversely, lower level of heterogeneity (data complexity).In the second case, the method may be less advantageous.In practice, such evaluations can easily be implemented either by using already existing routines (e.g.package Jchemo [28]) or by inserting the averaging in other available pipelines of local PLSR.
Firstly, a global PLSR (i.e. over all the training observation) is fitted and defines a global score space.Secondly, for each new observation xnew, Mahalanobis distances in this global score space between the training observations and xnew are computed (obviously, other global spaces, e.g.PCA or nonlinear kernel-PCA/PLS score spaces, and/or other types of distances or dissimilarities could be used).Theses distances are used to compute the neighborhood of xnew (kNN selection) and the weights  within the neighborhood.The weight function, f, chosen to be easily tunable, has a negative exponential shape whose the sharpness depends on a scalar parameter, say h [8,21]: lower is h, sharper is function f and therefore more the closest neighbors of xnew have importance in the LWPLSR fit.


The total available data for this variable (size Ntot,j) is randomly divided to a training set, say Train (size Ntrain,j = 80% * Ntot,j), and a test set, say Test (size Ntest,j = 20% * Ntot,j). Then, Train is randomly divided to a calibration set, say Cal (size Ncal,j = 80% * Ntrain,j), and a validation set, say Val (size Nval,j = 20% * Ntrain,j),  In summary, Tot = Train  Test, and Train = Cal  Val. The model is tuned by exhaustive grid-search over Train as follows: The combinations of the predefined values of the parameters of the model are listed.For each combination of parameter values, the model is fitted on Cal and the performance of the combination is evaluated by the root mean squared prediction error rate computed on Val (RMSEPVal).The optimal combination (corresponding to the lower RMSEPVal within the combinations) is then selected and the model is re-fitted (with the selected combination) on Train and used to predict Test.RMSEPTest is then computed and used as final estimate of generalization error of the given model.Due to the random splitting (Train/Test and Cal/Val), RMSEPVal and RMSEPTest are random variables.Considering only one single replication may lead to possible misleading conclusions since the replication can favor, by chance, one or the other of the models.It is therefore important to replicate the process described in the above items to get the RMSEP distributions.In this article, the process was replicated nrep = 30 times.

Fig. 2 :
Fig.2: Projection of the observations (preprocessed forage spectra; N = 18813) on the three first PCA scores space.

Fig. 4
Fig.4 compares the tuned kNN-LWPLSR-AVG vs. the omnibus model (h = 1 and k = 1000 neighbors).The two strategies showed very close performances: the relative difference in relative value between the mean RMSEPTest were always in the range  2.5%.The omnibus model returned lower vs. higher error rates depending on the response variables.

Table 1 :
Variables of forage and feed chemical composition to be predicted (N = total number of