Linear and Neural Network Models for Predicting N-glycosylation in Chinese Hamster Ovary Cells Based on B4GALT Levels

Glycosylation is an essential modification to proteins that has positive effects, such as improving the half-life of antibodies, and negative effects, such as promoting cancers. Despite the importance of glycosylation, predictive models have been lacking. This article constructs linear and neural network models for the prediction of the distribution of glycans on N-glycosylation sites. The models are trained on data containing normalized B4GALT levels in Chinese Hamster Ovary cells. The ANN models achieve a median prediction error of 1.39%, which is 10-fold smaller than for previously published models, and a narrow error distribution. We also discuss issues with other models reported in the literature. We provide all of the software used in this work, allowing other researchers to reproduce the work and reuse or improve the code in future endeavors.


Introduction
Glycosylation is a form of co-translational and post-translational modification that involves adding a glycan or glycans to proteins. When a glycan is added to the nitrogen of an asparagine or arginine, this process is called N-linked glycosylation. Glycosylation has many important functional and structural roles [1] [2] [3]. Improper glycosylation or deglycosylation, on the other hand, is associated with multiple diseases such as cancers [4], infections [5], and congenital disorders [6]. Many enzymes participate in the glycosylation process, and B4GALT1-B4GALT4 have been shown to be key contributors in multiple independent studies [7].
Due to its potential for diagnoses and treatments, glycosylation has been of significant interest to the biomedical and pharmaceutical industry, physicians, and patients. For example, increases in fucosylation, branching, and sialyation occur in many types of carcinoma [8]. Disialoganglioside is expressed by almost all neuroblastomas, and anti-disialoganglioside monoclonal antibodies have been successful against highrisk neuroblastoma in Phase I-III studies [9] [10]. Another example is poly-α2,8-sialyation, which increases the half-lives of antibodies without introducing tolerance problems [11]. On the other hand, the presence of glycans not produced by humans can be detrimental to a therapeutic. N-glycolylneuraminic acid, which is in some CHO-cell-derived glycoproteins [12], is immunogenic to humans [13].
Despite the importance of glycosylation for biotherapeutics and the many advances made in the field, such as the genetic engineering of CHO cells to increase the sialyation of glycoproteins [14], some challenges remain. Proteins have multiple glycosylation sites, and analyses need to take into account not only the glycan compositions but also where these glycans are located [8]. This structural diversity makes it difficult to explore specific functions of glycosylation [3]. Clinical laboratories struggle with analyzing patient glycosylation samples due to the complex equipment needed, which has limited the progress of personalized medicine [8].
Many computational models have been developed to assist researchers in better understanding and predicting glycosylation patterns. Some of the models, such as YinOYang [15] or O-GlcNAcPRED-II [16], are classifiers. In the context of glycosylation, these models may predict whether an amino acid is N-GlcNAcylated, or whether it is O-GlcNAcylated, for example. Other models, such as by Ref. [17], perform regression, that is, the models attempt to predict numerical values. In the context of glycosylation, a regression model may predict the numerical glycan distribution at a given glycosylation site.
In this work, we construct linear models and artificial neural network (ANN) regression models to predict glycan distribution in different glycosylation sites from normalized B4GALT levels in Chinese Hamster Ovary cells. The model construction procedures employ nested cross-validation and rigorous unbiased prediction error estimation. Our ANN models have 10-fold lower median prediction error than previously reported models. Open-source software is provided so that other researchers can reproduce the work or retrain the models as additional glycosylation data become available.

Materials and Methods
This section describes the datasets and methods for constructing the data-driven models. Details on how to run the software to make glycan distribution predictions or recreate the results in this article are provided in Supplementary Information.

Datasets
The models in this article are constructed using two experimental datasets. 1 The larger dataset, from the Supplemental Data of Ref. [7], comprises distributions of glycans in nine different glycosylation sites in response to changes in the enzymes B4GALT1-B4GALT4 due to five types of knockouts. A smaller dataset, from Figures 3-5 and S2 of Ref. [18], comprises distributions of glycans over time in response to changes in the concentrations of nucleotide sugar donors. 20% of each dataset was separated for testing, with the remaining 80% used for cross-validation. The data from Ref. [7] are split into five groups according to the knockouts that are performed and the data from Ref. [18] into five groups according to the feed conditions. 2 During each cross-validation fold and testing, the validation/testing groups are selected such that an entire knockout group was held out, and the other knockout groups are used for training. This procedure avoids test set leakage by ensuring the test set is sufficiently different from the training set.

Linear data-driven models
Linear data-driven models are constructed using elastic net, ridge regression, and partial least squares implemented in the Smart Process Analytics (SPA) software [19]. 3 For each data target, the best hyperparameters for the linear models are selected through cross-validation on four of these groups. The model and hyperparameters with the lowest cross-validation error are selected and the last group used to test that model's performance.

Properly trained linear and ANN models display good performance on these datasets
All of the data-driven models constructed in this study for prediction of N-glycosylation are trained with hyperparameters selected by cross-validation. This section compares our models with the models in Ref. [25] (denoted by ANN-2 and ANN-3), which trained ANNs using the same data but selected the hyperparameters based on test set performance, a procedure that leads to overly optimistic model errors and issues with generalization [26] [27]. Thus, it should be noted that the real relative errors for the models from Ref. [25] are higher than what is reported in this work or Ref. [25]. Moreover, the training procedure of Ref. [25] had some restrictions, such as on the learning rate and activation function, that reduced the potential prediction accuracy of its models.
The predictions of the data-driven models are compared with experimental data in the test sets in Fig. 1, and the corresponding percent relative errors (PRE) are summarized in Table 1. The median prediction error for our ANN is 10-fold lower than for the previously published models. The mean and median prediction errors for ANN-2 and ANN-3 are higher than for our ANN, in spite of being selected based on their performance on the test set instead of on a validation set, indicating issues with the ANN training in Ref. [25].
The median and mean prediction errors for the linear models are higher than for any of the ANN models. On the other hand, the linear models have the lowest prediction errors for one of the glycans, which suggests that any nonlinearity in the true relationship for that glycan is low enough that the bias of assuming linearity is small relative to the increase in variance associated with having more degrees of freedom in the training of the ANN models [19]. Table 1: Mean and median percent relative errors (PRE) for different N-glycosylation models. Models "ANN-2" and "ANN-3" are from Ref. [25]. Models "EN/RR/PLS" and "Our ANN" come from this work. "Train mean" is the mean of the training data (Refs. [7] and [18] The PRE for each model for each glycan is reported in Table 2. All of the models have low prediction errors for some glycans (e.g., Fc-DAO-GnGnF), but other glycans tend to have higher prediction errors (e.g., Asn38-NaGnGnGnF). These prediction errors are not significantly correlated with the Train Mean prediction error (all R 2 < 0.10, except for the EN/RR/PLS models, which have R 2 = 0.51). Our ANNs are the best model 28/32 times and produce lower prediction errors than the training mean for all glycans. In contrast, the predictions for ANN-2 and ANN-3 are worse than the training mean for some glycans.
The distribution of prediction errors for our ANNs are concentrated near low values (Fig. 2). At the other extreme, the distribution of prediction errors for ANN-2 has a wide spread with very high prediction error (≥ 50%) for two glycans. The linear models also have a long tail, although not to the same degree. ANN-3 has a lower range of prediction errors, but it is still significantly higher than that of our ANNs (Fig. 2).  [7] for the first 8 subplots and [18] for the last subplot (see Section 2.1). Models "ANN-2" and "ANN-3" came from Ref. [25]. Models "EN/RR/PLS" and "Our ANN" come from this work. "Train mean" is the mean of the training data (Refs. [7] and [18]). For each subplot, the bar orders and colors are the same, and each x-axis position represents a different glycan.  Fig. 1. The horizontal bars within the distributions are median PREs. Table 2: Test set percent relative errors (PREs) on the main glycans for different N-glycosylation models. Models "ANN-2" and "ANN-3" came from Ref. [25]. Models "EN/RR/PLS" and "Our ANN" come from this work. For each glycan, the "EN/RR/PLS" column reports the PRE for the model type with the best cross-validation score. "Train mean" is the PRE when using the mean of the training data (Refs. [7] and [18]). For each glycan, the lowest PRE(s) among the models is in bold. "N/A" marks predictions that are not reported in Ref. [25].

Nested validation highlights the robustness of our models
This work uses the same test set as Ref. [25] to allow a fair comparison between models. To ensure the models constructed in this study are not biased by that choice of test set, nested validation was performed. In each round, a group within the complete dataset is selected as the test set, and the rest is used for training and cross-validation. These rounds repeat until all groups have been used as the test set. In this work, five rounds of nested validation are performed, each following the procedure in Section 2.1.
The difference is negligible between the PREs of the models selected with and without nested validation (Fig. 3). The Jensen-Shannon distance between the EN/RR/PLS distribution is 0.127 and that between the ANN distributions is 0.478. As such, the selection of test set did not introduce any significant bias to the models, and the models are robust to changes in the split of data into training and testing.

Discussion
This study constructs models from literature data on antibody and fusion protein glycan distribution based on normalized CHO cell B4GALT1-B4GALT4 levels. Two types of models are trained to predict the glycan distribution based on user-input B4GALT levels. Linear models (EN/RR/PLS) provided reasonable predictions of glycan distribution despite their simplicity. Artificial neural network models are constructed that have significantly improved prediction performance compared to published models. The median PRE of the new ANN models is 1.39%, with the error distribution concentrated near low values. This median PRE is 10-fold lower than for the other models. A nested validation study shows that the prediction performance of the linear and ANN models was insensitive to the split of the data into training/validation and test sets.

DISCUSSION
The models trained in this work are compared to the ANN models of Ref. [25], which did not properly separate the experimental data into training, validation, and test sets. This issue effectively ensured Ref. [25]'s models are overfit, resulting in overly optimistic estimates of the prediction errors. Ref. [25] manually set the ANN structures, instead of using a pre-published package such as PyTorch as in this study, and fixed the choice of some hyperparameters, namely, the learning rate and activation function. The models in Ref. [25] also were not trained using learning rate scheduling. These choices limit the potential prediction performance of the models in Ref. [25], and explain why the PREs of their models are high despite being overfitted to the test data.
This software used in this work is publicly available, 5 allowing other researchers to reproduce this work and reuse or improve the code in future studies. The software provides a simple way to install and run the models to predict the glycan distribution of antibodies and fusion proteins made in CHO cells given normalized B4GALT1-B4GALT4 levels.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.