Random Forest Regression models for Lactation and Successful Insemination in Holstein Friesian cows

To overcome well-known difficulties in establishing reliable models based on large data sets, the Random Forest Regression (RFR) method is applied to study economical breeding and milk production of dairy cows. As for the features of RFR, there are several positive experiences in various areas of applications supporting that with RFR one can achieve reliable model predictions for industrial production of any product providing a useful base for decisions. In this study, a data set of a period of ten years including about eighty thousand cows was analysed by means of RFR. Ranking of production control parameters is obtained, the most important explanatory variables are found by computing the variances of the target variable on the sets created during the training phases of the RFR. Predictions are made for the milk production and the conception of the calves with high accuracy on given data and simulations are used to investigate prediction accuracy. This paper is primarily concerned with the mathematical aspects of a forthcoming work focused on the agricultural viewpoints. As for future mathematical research plans, the results will be compared with models based on factor analysis and linear regression.

Reproductive management is a key factor in economic dairy production and poor 2 practice can cause considerable economic loss, mainly because of decreased milk yield 3 per cow per lactation and decreased number of calves per year per cow. However, it is 4 also associated with reduced conception rates [1]. Conception rate is determined by 5 heat detection, the choice of the first insemination time after calving, induction of 6 ovulation, and ovulation synchronization program. As we seek for the best conception 7 rate, it is also worth noting the existence of some environmental features and 8 management practices that would directly affect insemination, thus have adverse effects 9 on reproduction performance. The efficiency, accuracy and timing of artificial 10 insemination (AI) remain a major challenge to improving reproductive and economic 11 efficiencies of many dairy farms [6,14]. Various studies have shown that regression 12 models are of great importance in addressing the issues around the conception rate. 13 Some of them have been used in prediction of the optimal time of insemination [12]. 14 November 14, 2020 1/8 Probability of conception was analysed using a logistic procedure which uses maximum 15 likelihood method to fit linear logistic regression [5]. However, for the logistic 16 regression, the target variables are assumed to be independent and single valued, yet 17 some data are categorical. Due to these deficiencies, other methods like machine 18 learning procedures are sort to be used to address such problems. Various machine 19 learning algorithms including Bayesian networks, decision trees and in particular 20 random forest algorithms have been used for such tasks. Bayesian networks are mainly 21 suited for small and incomplete data [9] with challenges in discretizing continuous 22 variables and implementing recursive feedback loops [13]. Decision trees, as well as 23 RFRs can be used for classification and regression too. RFR algorithm has been widely 24 utilized due to its ability to accommodate complex relationships. RFR calculations can 25 be trivially parallelized, so they can be done on multiple cores of the same CPU.

26
Additionally, the RFR algorithm involves very few statistical assumptions and its 27 hyperparameters can be used to reduce overfitting. The performance of RFR can be 28 explained by the power of ensemble methods to generate high-performance regressors by 29 training a collection of individual regressors. RFR was considered in a study of 30 predicting pregnant versus non pregnant cows at the time of insemination and it proved 31 to be significantly better than other machine learning techniques in [17]. Random 32 forest was also used in attempt to predict conception outcome in dairy cows [17]. On 33 the other hand, mathematical models for the lactation are not new either. Models of 34 lactation curves were early referenced by [3], but due to limitation of the computers 35 and computational difficulties experienced by then, the early models were based on 36 simple logarithmic transformations of exponentials, polynomials, and other linear 37 functions [15]. Another study gave an overview of the parametric models used to fit of 38 lactation curves in dairy cattle by considering linear and non-linear functions [10].

39
Machine learning approaches have also proved to be vital in the lactation study.

40
Different models based on machine learning in both non-autoregressive and 41 autoregressive cases have been investigated in [16] exhibiting the best performance for 42 both cases with the random forest algorithm. Regression trees have been used in the 43 past to analyse different factors affecting lactation. Researches on effects of the dry 44 period, the lactation parity, the farm, the calving season, the age of the cow , the year 45 of calving and the calving interval have been performed by several authors [4,11].

46
Though, previous studies have used other machine learning based models (including 47 RFR) to predict lactation and successful insemination, the proposed study will adopt 48 RFR technique for the same but with different variables. Our purpose is to investigate 49 how the large collection of data gathered in the last ten years used in milk production 50 factories throughout Europe could effectively be analyzed. Therefore, the aim of this 51 study is to apply random forest regression model to predict factors influencing lactation 52 and the success of insemination (SI) as well as the choice of the time of insemination 53 attempts.

55
For this analysis, a large data set was obtained. However, some data were not useful as 56 some information was missing hence were omitted in the study. All data editing and 57 analyses were conducted in the Python , where pandas was used for data preparation hyperparameter. The maximum depth of the trees is also hyperparameter. We did not 100 limit the depth. Figure 1. and Figure 2. depict sample decision trees of depth 3. The 101 depth of the trees used in the models is much bigger. For each tree, there is one node at 102 the start, the root node, that contains all the samples. For each node that has at least 2 103 samples, a split is performed. For each node to be split, the CART algorithm examines 104 the possible splits of all the features for that specific node and the best alternative is 105 considered, according to the used splitting criterion. Subsequently, the interval of the 106 selected feature will be split by the selected value of the feature, resulting in two new 107 nodes. Consequently, the two new nodes will have fewer samples. Then, if one or both 108 of these new nodes have at least 2 samples, the splitting process continues. When there 109 is only 1 sample left in a node, the CART algorithm will not perform the splitting. The 110 minimum number of samples required for splitting is also a hyperparameter and can be 111 changed. We used the default value, 2. The trees are grown as long as no stopping rule 112 stops the growing process. Pure nodes, where the target variable is identical in all 113 samples, are not split.Nonetheless, we did not use pruning. This way, we got "fully 114 grown and unpruned trees"(https://scikit-learn.org/ stable/modules/generated/  In both cases, the prediction value by means of the random forest is the mean of the 118 prediction values of its individual decision trees.

119
Given any tree T and one of it nodes n, our procedure fixes a feature f T,n along 120 with a value v T,n in the range of f T,n . Furthermore the algorithm fixes a prediction 121 value p T, to every leaf of T . Given a "virtual cow" C with f T,n -values f T,n (C) (for 122 each node n of T ), our RFR estimate for answering P1. P2 according to tree T is the 123 value r T (C) = p T, (C) where the leaf (C) is determined as follows: We start at the root 124 of the tree, and if a node n is reached, the decision for continuing to left or right to a 125 next node is done accordingly, if we have f T,n (C) < v T,n or f T,n (C) ≥ v T,n (cf. Fig 1   126 and Fig 2). Our steps end when reaching a leaf which we set to (C).

127
The prediction value by means of a random forest is the mean of the prediction 128 values of its individual decision trees. As for standard theoretical background we refer 129 to and the monograph [7] and Breiman [2]. For implementation we used the python 130 machine learning package scikit-learn (https://scikit-learn.org/stable/). As we know, 131 scikit-learn's RandomForestRegressor object is trained using the CART algorithm.

132
After the training is done, we need to check its score on the test set. If the R 2 score 133 of the forest's predictions on the test set is not good, then we need to fine-tune the 134 hyperparameters of the RFR. When using hyperparameter tuning, we make a grid of 135 the hyperparameters of the RFR (maximum depth of the trees, minimum samples for 136 splitting, change pruning, etc.). Additionally, we need to split the training set into a 137 real training set and a validation set. We will use the latter to find the best set of 138 hyperparemeters. Then we need to train a RFR for each value in the grid, and check 139 their predictive performance using cross-validation. Consequently, the best set of 140 hyperparameters is used for our final model choice. We then train the RFR on the 141 combined training and validation sets, and check its final predictive performance on the 142 test set, that we did not use at until this final step.

143
If the model performs well, we can decide to not deal with hyperparameter tuning, 144 because the model is already good enough.

145
The original data set consists of 82,563 records, but it contains missing data and 146 invalid data. For this reason, not all records were kept. As for the Lactation model, 147 45,461 records were kept, as for the Succesful Insemination model, 82,378 records were 148 kept after the preparation of the data set. In our study, we used 10% of the prepared 149 data set as the test set, and the remainining 90% of the data set as the training set.  Table 1 Tables 2 192  and Table 3.

205
We established an alternative approach to other machine learning based models 206 concerning Questions P1,P2,P3 by means of random forest regression. The transformed 207 dataset were split into a test size of 10%, the remaining 90% was used to train the 208 forest. The results indicated that when the target was SI, the prediction of the RFR on 209 the test set and the actual targets had an R 2 ≈ 0.948. The important features were the 210 IPAC, DFI, CN, PCIS, MFCA and UMP with importance scores in Table 3. When the 211 target was Lactation, the prediction of the RFR on the test set and the actual targets 212 had an R 2 ≈ 0.987. The important features of were AMPLD, DPLM, PLD, DC and 213 PLESI with importance scores in Table 2. Various alternative regression methods were 214 used for analysis of this data set but all these attempts failed due to the complexity of 215 data and the large sample size. It seems that the RFR is good for practical applications 216 (as for problems P1, P2 and P3.)