ABSTRACT
The future of personalized health relies on knowledge of dietary composition. The current analytical methods are impractical to scale up, and the computational methods are inadequate. We propose machine learning models to predict the nutritional profiles of cooked foods given the raw food composition and cooking method, for a variety of plant and animal-based foods. Our models (trained on USDA’s SR dataset) were on average 31% better than baselines, based on RMSE metric, and particularly good for leafy green vegetables and various cuts of beef. We also identified and remedied a bias in the data caused by representation of composition per 100grams. The scaling methods are based on a process-invariant nutrient, and the scaled data improves prediction performance. Finally, we advocate for an integrated approach of data analysis and modeling when generating future composition data to make the task more efficient, less costly and apply for development of reliable models.
INTRODUCTION
Food processing, such as fermentation, baking, or even boiling alters the chemical composition of food, often in unpredictable ways from the raw to the finished state. This is due to the unresolved chemical and structural complexity of the food and the physio-chemical transformation mechanisms that occur during processing. [1] [2] Yet in spite of these challenges, the objectives for prediction models are compelling which include sensory properties [3] such as aroma, texture, taste, etc., and nutrient profiles, and here we address the latter. The models in use currently, simplify the inherent complexity and instead predict the content of a nutrient based on only a few parameters. For instance, kinetic modelling based on experimental data for any given food establishes the relationship between nutrient concentration, time and temperature conditions. [4] [5] [6] [7]. This can then be applied to compute concentrations, for example predicting vitamin C (ascorbic acid) content in processed orange juice [6]. Another approach to compute post-process nutrition composition, is to apply retention factors (RF) which are based on analytical composition data on a representative set of foods and processes. RF-based computation is used widely by food manufacturers for nutrition labels, and by USDA’s dietary survey group to calculate nutrient intakes that investigators may use to determine correlations between intake and health outcomes [8]. However, all of these methods have limited potential. Kinetic models are difficult to scale up to capturing more food and processing parameters, as these measurements are time-consuming, expensive [9] and have many experimental challenges such as certain chemicals which degrade rapidly. RF-based methods in practice inevitably under or overestimate the nutrient content in a particular instance, since any single RF is representative of several foods and a cooking method. Here we address the challenge that our knowledge of composition and reactions of food systems is limited, which inevitably manifests to such incomplete or underdetermined models. This can be at least partially addressed with predictive machine learning (ML) methods that can learn the multi-parametric transformation patterns between the compositions of raw and cooked foods, from experimental data across diverse foods and cooking methods.
The application of ML to food science data is at an early stage, yet it has been successful in generalizing across a variety of prediction tasks when trained on relevant datasets. Some examples of recent work include prediction of nutrient profiles or properties of food. Classifiers model have been applied to predict sensory properties from the molecular structure, such as bitter [10] [11] and sweet [12] [13]and aroma labels [14]. A number of food quality classifiers use hyperspectral data, for example for the freshness classification of shrimp [15], detection of adulteration in red meat products [16] or detection of damaged/bruised fruits and vegetables [17] [18]. Several models have addressed attributes related to nutrient profiles. Natural language processing (NLP) methods were used to predict the macronutrient (proteins, fats and carbohydrates) content of foods from a text description of the food [19]. USDA investigators predicted the content of 3 label nutrients (carbohydrates, protein and sodium) in processed foods from the ingredient list, using the Branded Foods datatype in Food Data Central (FDC) [20]. Several projects predicted nutrient contents from the composition data; nutrient content was predicted for the missing values in food composition data [21], lactose content was predicted in dietary recall database [22], fiber content was predicted for commercially processed foods [23]. Availability of datasets with high quality data for training and testing is essential, and databases such as BitterDB [24], BTP640 [25], FlavorDB [26],FooDB [27], SuperSweet [28], Fenaroli [29],GoodScents [30], FDC [31] as well as specifically curated datasets of hyperspectral images may contribute to this end.
Here, we have constructed an ML model that predicts food micronutrient (specifically 7 vitamins and 7 minerals) composition after processing (Figure 1). We have curated a sample of 820 foods, for 5 processes, namely steaming, boiling, grilling, broiling, and roasting from FDC, and trained regressors per nutrient and per process that have achieved a correlation(R2) between measured and predicted micronutrient values that range from 0.42 to 0.95 (outliers are -0.42, -0.09,0.13 and 0.23).
METHODS
Dataset
We downloaded the composition dataset of 7,793 foods from the Standard Reference (SR) legacy dataset, which is the most suitable of the 5 data types in FDC (Figure 1; as of November 2021), since it is aligned with our objectives. This dataset is intended for application in public health initiatives such as the assessment of nutrient intakes for the purpose of national nutrition monitoring, in creating meal plans in schools and day-care centers, in product development and labeling by manufacturers. The composition data for the foods in SR is obtained from 3 sources; analytical experiments, calculations (based on the analytical data), and literature. For our models, we selected a subset of these data according to the following criteria. We matched raw/cooked food pairs, where the raw foods were a single ingredient harvested from a plant or from an animal (includes butchery products), and the cooked food was the outcome of the raw food treated to wet (boiling, steaming), or dry (roasting, grilling, broiling) heat processes. Foods were excluded from the dataset if either there was no single-ingredient raw food corresponding to the cooked food and vice-versa, or the foods had several ingredients and produced by a multi-step process like ‘Luncheon meat, pork and chicken, minced, canned, includes SPAM Lite’, ‘Bread, banana, prepared from recipe, made with margarine’. We excluded processes which have added ingredients, such as oil for frying, while we included boiling and steaming (simple aqueous, i.e., wet heat processes), as well as roasting, broiling and grilling (dry heat processes). This resulted in 840 foods total in the dataset, with 178 and 247 pairs from wet and dry heat processes, respectively. In this dataset, all plant-based foods were cooked by wet heat process (WH), and all animal-based foods by a dry heat (DH) process. The categorical breakdown of the number of pairs for plant-based and animal-based foods is shown in Figure 2.
The composition data consists of content values for up to 232 ‘chemical constituents’ or ‘components’, which include specific chemicals (vitamins, amino-acids, fatty acids, etc.) and aggregated chemicals or chemical groups (total fats, total proteins, etc.) for every food. Here, we selected the components that are reported for at least 80% of the foods in our dataset. This resulted in 27 components per food, namely 9 vitamins, 10 minerals, water, and 7 aggregates of total protein, total carbohydrates and various fat categories (Supplementary materials). The pair-wise composition data was used to train the prediction models where the input feature set to every model is the content of the 27 components in the raw food and the outputs are the contents of the 14 micronutrients in the cooked food. Prior to model fitting, the composition data should be preprocessed to adjust for the bias resulting from the conventional format of representing nutrient contents per 100 grams of a food sample. In practice, the raw sample is likely not always 100 grams, and the cooked food sample would have a relatively higher or lower weight yield primarily due to loss or gain of water in the cooking process. As a result, when water is lost the solid components have a relatively higher concentration, and in comparing the raw and cooked food composition per 100 grams, there is a concentration bias in the cooked food. Similarly, there is a dilution bias when water is absorbed in the cooking process. Ideally the data preprocessing would reverse this scaling effect. We use two different scaling methods, solid content scaling in equations 1 and 2 and process-invariant nutrient scaling in equations 3 and 4. For the solid content scaling (SCS), the water content in the raw food is set to that in the cooked food which assumes that the water content stays constant while scaling the content of other components. Then the composition is scaled preserving the proportions and the weight of the sample is maintained at 100 grams. This method lowers the extent of the dilution or concentration biases while the “process-invariant nutrient” scaling (PINS) method intends to undo this bias. The method is based on identifying a nutrient that is largely invariant to processing and can be used to calculate the scaling factor as per equation 3. This factor is then used as per equation 4 to derive the composition for the “true” weight of the cooked food corresponding to a 100-gram sample of raw food. From the 27 components in our dataset, cholesterol, iron and zinc are invariant to changes in processing for animal-based foods, but there is no such information for the plant-based foods. Cholesterol in the various meats is invariant to processing since it is in the muscle-cell membranes which are resistant to cooking loss. [32]. Iron and zinc were reported to be process invariant as per experimental studies conducted by UDSA [33]. For confirmation of these hypotheses, all components are used in the PINS method and prediction performance is compared for both animal and plant-based foods.
In the equations for scaling methods, R represents the raw food and C represents the cooked food, and component is the generalized term for each of the 26 solid components. The final dataset includes 425 pairs of foods, with 27 components, 5 processes (boiling, steaming, roasting, grilling and broiling), in two states (raw and cooked). (Supplementary materials).
MODEL
We trained models to predict the content of 14 micronutrients for which we had baseline retention factors in the cooked food. Of those, seven are vitamins, namely vitamin B1 (thiamin), vitamin B2 (riboflavin), vitamin B3 (niacin), vitamin B6 (pyridoxine), vitamin B9 (folate), vitamin C (ascorbic acid), vitamin A, and the other seven are minerals, namely calcium, iron, potassium, phosphorus, magnesium, sodium and zinc. We created separate models based on the process category (wet, dry), as these are fundamentally different processes, but not based on the actual process (e.g., boiling vs. steaming), as there are not sufficient data per process to avoid overfitting. All models have the same input, which is the composition of the raw food, as illustrated in Figure 1. Other details that might be informative to the task (cooking time, temperature, water content) were not available in the SR legacy dataset, and consequently were not present in our dataset, or our model. Since vitamin C is not present in meats (which are all the foods for DH models), the dry heat models are only 13, for the other micronutrients, resulting in 27 models total (13 for DH and 14 for DH). These sets of WH and DH models were trained and tested on scaled variants of the dataset explained earlier. We applied a filtering step to the scaled datasets to select the pairs of foods where the nutrient being predicted was more in the raw food than in the cooked food. So, each of the nutrient models were trained on different subsets of the data. The unscaled data for the dry heat models and wet heat models was not filtered for this condition. The effect of the data scaling and filtering on the predictive models is explained in the Results.
The best performing model (for any dataset variant) was selected based on a cross validation grid search across 6 regressor types (MLP, LASSO, Elastic Net, Gradient Boost, Random Forest, Decision Trees), each with a variety of hyperparameters totaling 12,336 regressors where the metric for the best model was the least root mean squared error (RMSE). This was done for each of the 27 models using the sklearn library [34] and the best hyperparameters for each of the regressor types along with the RMSE is in Supplementary materials. We then performed a feature selection technique, a recursive feature elimination variant as described in the sequential feature selector function of the mlxtend package [35]. The model performances for data variants for the WH and DH process are compared in Table 1.
We assessed the predictive performance (RMSE) in comparison with 2 baseline models. The first is to naively assume that the dependent variable (the micronutrient to predict after cooking) is equal to its value in the raw food. This baseline serves as a comparison to a naïve regressor where the retention factor (RF) is 100%, i.e., the amount of the micronutrient after the heat process is the same as in the raw food. The second baseline was based on the USDA Retention Factors table, a common, standard model for the retention of nutrients after a process [8]. The nutrient outputs were computed as a product of the RF for the specific nutrient and the content of that nutrient in the raw food. We use RSME, the coefficient of determination (R2), Pearson Correlation Coefficient (PCC), and Spearman Rank Correlation Coefficient (SRC) to assess the performance of our regressor model (Table 2 and Supplementary materials). At each case, we performed 5-fold cross validation runs, bootstrapped 50 times to avoid overfitting and increase the generalization potential of our classifiers. For a subset of foods (Supplementary materials), we provide a higher resolution baseline using retention factors from experimental studies in literature. Finally, we analyze the prediction performance through a breakdown of R2 by food category for plant-based foods (Leafy greens, Roots, Vegetables, Legumes, Cereals) and animal-based foods (Beef, Lamb, Chicken, Veal) as shown in Table 3. We do this by tagging every predicted micronutrient value by the category (associated with the food) and calculate the R2 for every group. This is repeated for all predictions, and the average R2 of a category is used to determine the best and worst performances in the plant-based and animal-based foods.
RESULTS
Approximately 10% of SR Legacy foods can be paired in processes to be used in model training
The single ingredient foods that are either raw or cooked were found in 35% of the SR legacy data, and 30% of these were paired into raw and cooked samples. The final selection of 840 foods (or 425 pairs) is 10% of SR legacy data (Figure 2A), with an unequal distribution of data pairs by food category (Figure 2B). We identified an anomaly where the content of a micronutrient was more in the cooked food than in the raw food in 50% of the pairs on average across the 14 micronutrients. The anomaly was more severe for the animal-based foods (77% vs 23% pairs, respectively; see Supplementary materials). This was partially caused by the bias introduced by the data representation convention. For the animal-based foods, the non-anomalous pairs are 30% of the total pairs for unscaled data and increase to 70% for PINS-cholesterol scaled data, p-value<0.01. This is reasonable, since the anomaly is due to a concentration bias (nutrient content in cooked food is more than in raw food), which is mitigated by scaling. For the plant-based foods, there is no significant change(p-value>0.05) in non-anomalous pairs using the scaling methods for plant-based foods, since the issue is a dilution bias which is mitigated however this does not cause an anomaly (nutrient content in cooked food is more than in raw food). The comparison of non-anomalous pairs for animal and plant-based foods is shown in Figure 2. The Discussion section explains the reasons for this differing effectiveness of the scaling methods in reducing the bias and suggests other possible causes for the bias.
Scaling improves model performance
We trained predictive models on variants of the datasets as explained in Methods. The dry heat models (broiling, grilling, roasting processes; 247 animal-based foods) and wet heat models (steaming, boiling; 178 plant-based foods) were trained on the unscaled data, which is not filtered for the anomalous condition, and on data from both the scaling methods which is filtered for non-anomalous data. We use the metric RMSE to compare model performance and confirm the hypotheses described in Methods. For the dry heat models, the average RMSE (for 13 predictions) was 20% lower when the model was trained on data scaled by the PINS-cholesterol method than data scaled using SCS method, which had 15% lower RMSE compared to the model trained on unscaled data. Although the model performance based on PINS data for iron and zinc has lower average RMSE than cholesterol, we consider the model trained on PINS-cholesterol as the best model since there is a mechanistic explanation described in Methods. For the wet heat models, the average RMSE was 35% lower when the model was trained on SCS data than that on unscaled data. These comparisons are shown in Table 1, and all results are in Supplementary materials and further analysis is in Discussion. The best model for the wet heat process is trained on SCS data and for the dry heat process it is trained on PINS-cholesterol data. We now compare results from the best predictive ML models to the baseline model.
The predictive model performs 43% and 18% better than using the standard USDA Retention Factor model for wet and dry heat processes, respectively
We compared the predicted concentrations of the micronutrients in the cooked foods for both the wet heat processes and the dry heat processes against the two baseline models, as described in the Methods section. When compared to the naïve baseline (i.e., retention factor is always 100%), the predictive model is better in all 27 out of the 27 comparisons (100%; RMSE of 9.90±16.45 vs 31.29±56.56, respectively; 64% decrease of RMSE on average for wet heat, p-value < 0.01; 52% decrease in RMSE on average for dry heat, p-value < 0.01). Then, to compare with the standard practice, we computed micronutrient concentrations using the USDA’s Retention Factor table (see Methods) as shown in Table 2. In that case, the predictive model was better than this baseline in 22 out of the 27 comparisons (81%; RMSE of 9.90±16.45 vs 16.45±28.41, respectively; 43% decrease of RMSE on average for wet heat, p-value < 0.01; 18% decrease in RMSE on average for dry heat, p-value < 0.01). Figure 3 depicts the correlation between predicted and actual (ground truth) values for the 14 micronutrients, for both the ML model and the USDA retention factor baseline. Next, we investigated the difference in the predictive performance when curating retention factors from literature. For this, we identified the retention factors of vitamin C (ascorbic acid) for 12 sample foods (green beans, beet greens, broccoli, Chinese cabbage, carrots, cauliflower, mustard greens, green peas, green peppers, pumpkin, spinach, zucchini) and of vitamin B9 (folate) for 12 sample foods (amaranth leaves, broccoli, drumstick leaves, snap beans, lentils, okra, onions, potatoes, green peas, soybeans spinach, taro leaves) (see Supplementary materials). In both cases, the ML model had a better agreement with the ground truth data than the Literature Retention factor baseline, although less so for vitamin C (for vitamin C (ascorbic acid), RMSE 10.51 vs 13.31, p-value=0.026; for vitamin B9 (folate) RMSE 25.84 vs 97.22, p-value=0.013). Note that retention factor information for each micronutrient is not available for the majority of foods, and it is a time consuming and expensive process to measure it. Using scale-invariant metrics reach the same conclusions (see Supplementary materials). The Discussion section elaborates further on the reasons that any RF baseline method is error prone and not appropriate to compute nutritional baselines.
Prediction performance is best for leafy green vegetables, and worst for cereals, in the plant-based food categories, and best for beef and worst for veal in the animal-based food categories
As reported in prior literature, the food structure/phenotype influences the chemical and physical changes that occur in food processes. Here we use the food category to represent this concept and show the differences in predictability. We group the 14 predicted micronutrient values by the food category and calculated the R2 (Table 3 and Figure 4A). Leafy green vegetables have the highest average R2 of 0.75±0.10 and cereals have the least average R2 of 0.52±0.25. In the dry-heat processed animal-based foods, beef had the highest average with R2 of 0.48±0.46 and veal the least average R2 of - 0.50 ±1.21. Due to the uncertainty associated with methods of data generation, USDA specifies the nutrients with most reliable data, these are vitamin B3 (niacin), vitamin B6, calcium, iron and zinc. The highest average R2 is now 0.85 ±0.08 for beef and the lowest is -0.06 ±1.26 for veal. As such, the nutrient loss is better predicted in leafy greens and beef given the current training data.
High variability on the top predictive features
There is a notable lack of feature importance order across the prediction models. Figure 3B shows the feature ranks, where the features are ordered by their average rank across predictions. The average rank is in the mid-range for both the WH and DH process, suggesting that no feature has a consistent importance across all the predictions. Figure 3C shows performance by feature-size plots for vitamin B6 and potassium (WH) and vitamin B6 and zinc (DH) and the feature names are listed in the caption. The common observation is that the top ranked feature is the micronutrient itself in the raw food, as expected, but all other input features are specific to every prediction. The complete coverage of best features and feature ranks is in the Supplementary materials.
DISCUSSION
In recent years the body of food composition data on preprocessing and postprocessing has grown, and so have simultaneously the expectations and vision [36] for big data in prediction tasks related to processing effects on food. But these expectations have yet to match the currently available computational capability and it is unknown whether the available data is sufficient for predictive capability and where the gaps might be. The results prove the potential of this existing data for predictive models and even show the categories that are more predictable namely leafy green vegetables in plant-based foods and beef in animal-based foods. A significant drawback in the data was the standard data convention of composition per 100grams of foods for raw and cooked foods which obscured the actual change and we remedied (but not reversed) this by the scaling method.
In the introduction we pointed out the discrepancies between the simpler prevalent methods and our ML models. Now, we elaborate on the discrepancies in perspective of our results, where the data scaling methods and ML model achieve an overwhelming improvement over the baseline. The scatter plots for vitamin B9(folate) and magnesium show that the baseline method overestimates the composition, which implies that the baseline RF is much greater than the RF inherent in the true data. RF represents the rate of loss which is influenced by process-related factors like processing times, surface area of vegetable exposed to processing conditions. Ideally for a fair comparison, these factors should be known for the baseline and matched to the data at hand. This can easily be addressed by recording additional meta-data. However, the more challenging discrepancy was that the baseline is a simple linear method, while the prediction model is a much more complex multiparametric non-linear ML model. Inevitably more sophisticated methods will emerge whether machine learning, mechanistic or a hybrid, and a suitable state-of-the-art baseline method will be available for comparison.
Our results show that methods of data analysis and predictive ML are valuable tools to assist in experimental design for food composition analysis, since the data generation process is time consuming and expensive. Specifically, we suggest how our methods justify decisions such as; the selection of food samples, recording of structured metadata/provenance, checking for data quality, and determining the composition feature set. The provenance of the data was incomplete in at least two different aspects. The composition data was calculated for some foods, and there was no explanation for the calculation method and no mention of the reference food /data used in the calculation method. It is unclear whether the samples for the raw and cooked food were related. Additionally, ontologies or structured vocabularies are a valuable resource when creating a format or structure for the dataset. Regarding data quality, we have described the anomalous condition in the Results. This is an example of a basic data sanity check, and especially in the context of a prediction hypotheses. Predictive performance depends on both the sample size as well as the entropy of the dataset, and one can use the predictive performance of the model as a guide for the sampling size for gathering new experimental data. There was only a single representative instance for each food and factors like geography, method of agriculture etc. are known to significantly impact the composition. Our results on prediction performance by category could justify the need for greater sampling. Regarding the feature space per sample, we suggest including process parameters and features known to influence nutrient loss such as pH.
Finally, representation of the composition per 100g of food, further obscured the data. We mitigated this issue by applying the data scaling methods, however our observation show that this is not a complete resolution and new standards for data representation are required. The results from applying the scaling methods on the composition data, has two unrelated interpretations; the effect on the size of non-anomalous food-pairs (Figure 3) and the effect on model performance trained on this data (Supplementary materials).
As seen in Figure 3, there is no significant effect (p-value =0.06) for the plant-based foods where the data representation causes a dilution bias, and the anomaly could instead be due to different food samples used for the raw and cooked analysis. Whereas there is a significant effect (p-value <0.01) on animal-based foods where the anomaly is due to a concentration bias. Regarding the prediction performance, a few additional components used in the PINS method had good results besides the hypotheses. For plant-based foods, the performance for SCS data was the best, followed by carbohydrate PINS data. For animal-based foods, the performance by PINS-proteins data was the better than for zinc, iron and cholesterol. However, the results for PINS-carbohydrate and PINS-protein are likely due to the methods used for generating this data. This analysis presents several questions for future inquiry and experimental validation, though the most important might be to ascertain a process-invariant nutrient and under which conditions and the biochemical/mechanistic explanation. This information might help for data transformations of existing data, but new data representation standards need to be considered and applied to future data generation efforts.
Food composition data is poised to grow in size and quality [37] [38] [39], and such predictive modelling applications can assist consumers in making reliable dietary decisions. For example, to choose food ingredients and cooking methods by getting answers to queries like-“How do boiling/streaming affect the nutrient profile of Vegetable X compared to roasting or frying”. Personalized health models must consider additional factors like bioavailability and individual physiological response, however dietary composition is an essential step towards that goal.
AUTHOR CONTRIBUTIONS
T.N. gathered the data and performed all computational analysis. T.N. and I.T. wrote the manuscript. I.T. and T.N. designed the computational methods. I.T. conceived and supervised all aspects of the project.
ACKNOWLEDGEMENTS
We would like to thank the members of the Tagkopoulos Lab, particularly Jason Youn and Gabriel Simmons for their feedback in machine learning methods, the USDA ARS methods and applications of food composition lab personnel and Dr. Bruce German for their suggestions. This work has been been supported by the USDA-NIFA AI Institute for Next Generation Food Systems (AIFS), USDA-NIFA award number 2020-67021-32855 and USDA grant 58-8040-8-015 to I.T.