The Effects of Nonlinear Signal on Expression-Based Prediction Performance

Those building predictive models from transcriptomic data are faced with two conflicting perspectives. The first, based on the inherent high dimensionality of biological systems, supposes that complex non-linear models such as neural networks will better match complex biological systems. The second, imagining that complex systems will still be well predicted by simple dividing lines prefers linear models that are easier to interpret. We compare multi-layer neural networks and logistic regression across multiple prediction tasks on GTEx and Recount3 datasets and find evidence in favor of both possibilities. We verified the presence of non-linear signal for transcriptomic prediction tasks by removing the predictive linear signal with Limma, and showed the removal ablated the performance of linear methods but not non-linear ones. However, we also found that the presence of non-linear signal was not necessarily sufficient for neural networks to outperform logistic regression. Our results demonstrate that while multi-layer neural networks may be useful for making predictions from gene expression data, including a linear baseline model is critical because while biological systems are highdimensional, effective dividing lines for predictive models may not be.


Introduction
Transcriptomic data contains a wealth of information about biology.Gene expression-based models are already being used for subtyping cancer [1], predicting transplant rejections [2], and uncovering biases in public data [3].In fact, both the capability of machine learning models [4] and the amount of transcriptomic data available [5,6] are increasing rapidly.It makes sense, then, that neural networks are frequently being used to build predictive models from transcriptomic data [7,8,9].However, there are two con icting ideas in the literature regarding the utility of non-linear models.One theory draws on prior biological understanding: the paths linking gene expression to phenotypes are complex [10,11], and non-linear models like neural networks should be more capable of learning that complexity.Unlike purely linear models such as logistic regression, non-linear models should learn more sophisticated representations of the relationships between expression and phenotype.Accordingly, many have used non-linear models to learn representations useful for making predictions of phenotypes from gene expression [12,13,14].
The other supposes that even high-dimensional complex systems may have linear decision boundaries.This is supported empirically: linear models seem to do as well as or better than nonlinear ones in many cases [15].While papers of this sort are harder to come by -perhaps scientists do not tend to write papers about how their deep learning model was worse than logistic regression -other complex biological problems have also seen linear models prove equivalent to non-linear ones [16,17].
We design experiments to ablate linear signal and nd merit to both hypotheses.We construct a system of binary and multi-class classi cation problems on the GTEx and Recount3 compendia [18,19] that shows linear and non-linear models have similar accuracy on several prediction tasks.However, upon removing the linear signals relating the phenotype to gene expression we nd non-linear signal in the data even when the linear models outperform the non-linear ones.Given the unexpected nature of these ndings, we evaluate independent tasks, examine di erent problem formulations, and verify our models' behavior with simulated data.The models' results are consistent across each setting, and the models themselves are comparable, as they use the same training and hyperparameter optimization processes [20].
In reconciling these two ostensibly con icting theories, we con rm the importance of implementing and optimizing a linear baseline model before deploying a complex non-linear approach.While non-linear models may outperform simpler models at the limit of in nite data, they do not necessarily do so even when trained on the largest datasets publicly available today.

Linear and non-linear models have similar performance in many tasks
We compared the performance of linear and non-linear models across multiple datasets and tasks ( g. 1 A).We examined using TPM-normalized RNA-seq data to predict tissue labels from GTEx [18], tissue labels from Recount3 [19], and metadata-derived sex labels from Flynn et al. [21].To avoid leakage between cross-validation folds, we placed entire studies into single folds ( g. 1 B).We evaluated models on subsampled datasets to determine the extent to which performance was a ected by the amount of training data.

A B
Figure 1: Schematic of the model analysis work ow.We evaluate three models on multiple classi cation problems in three datasets (A).We use study-wise splitting by default and evaluate the e ects of sample-wise splitting and pretraining (B).
We used GTEx [18] to determine whether linear and non-linear models performed similarly on a wellcharacterized dataset with consistent experimental protocols across samples.We rst trained our models to di erentiate between tissue types on pairs of the ve most common tissues in the dataset.Likely due to the clean nature of the data, all models were able to perform perfectly on these binary classi cation tasks ( g. 2 B).Because binary classi cation was unable to di erentiate between models, we evaluated the models on a more challenging task.We tested the models on their ability to perform multiclass classi cation on all 31 tissues present in the dataset.In the multitask setting, logistic regression slightly outperformed the ve-layer neural network, which in turn slightly outperformed the three-layer net ( g. 2 A).
We then evaluated the same approaches in a dataset with very di erent characteristics: Sequence Read Archive [22] samples from Recount3 [19].We compared the models' ability to di erentiate between pairs of tissues (supp.g. 5) and found their performance was roughly equivalent.We also evaluated the models' performance on a multiclass classi cation problem di erentiating between the 21 most common tissues in the dataset.As in the GTEx setting, the logistic regression model outperformed the ve-layer network, which outperformed the three-layer network ( g. 2 C).
To examine whether these results held in a problem domain other than tissue type prediction, we tested performance on metadata-derived sex labels ( g. 2 D), a task previously studied by Flynn et al. [21].We used the same experimental setup as in our other binary prediction tasks to train the models, but rather than using tissue labels we used sex labels from Flynn et al.In this setting we found that while the models all performed similarly, the non-linear models tended to have a slight edge over the linear one.

There is predictive non-linear signal in transcriptomic data
Our results to this point are consistent with a world where the predictive signal present in transcriptomic data is entirely linear.If that were the case, non-linear models like neural networks would fail to give any substantial advantage.However, based on past results we expect there to be relevant nonlinear biological signal [23].To get a clearer idea of what that would look like, we simulated three datasets to better understand model performance for a variety of data generating processes.We created data with both linear and non-linear signal by generating two types of features: half of the features with a linear decision boundary between the simulated classes and half with a non-linear decision boundary (see Methods for more details).After training to classify the simulated dataset, all models e ectively predicted the simulated classes.To determine whether or not there was non-linear signal, we then used Limma [24] to remove the linear signal associated with the endpoint being predicted.After removing the linear signal from the dataset, non-linear models correctly predicted classes, but logistic regression performed no better than random ( g 3).
To con rm that non-linear signal was key to the performance of non-linear methods, we generated another simulated dataset consisting solely of features with a linear decision boundary between the classes.As before, all models were able to predict the di erent classes well.However, once the linear signal was removed, all models performed no better than random guessing ( g 3).That the non-linear models only achieved baseline accuracy also indicated that the signal removal method was not injecting non-linear signal into data where non-linear signal did not exist.
We also trained the models on a dataset where all features were Gaussian noise as a negative control.
As expected, the models all performed at baseline accuracy both before and after the signal removal process ( g. 3).This experiment supported our decision to perform signal removal on the training and validation sets separately, as removing the linear signal in the full dataset induced predictive signal (supp.g. 6).We next removed linear signal from GTEx and Recount3.We found that the neural nets performed better than the baseline while logistic regression did not ( g. 4 B, supp.g. 7).Similarly, for multiclass problems logistic regression performed poorly while the non-linear models had performance that increased with an increase in data while remaining worse than before the linear signal was removed ( g. 4 A,C).Likewise, the sex label prediction task showed a marked di erence between the neural networks and logistic regression: only the neural networks could learn from the data ( g. 4 D).In each of the settings, the models performed less well when run on data with signal removed, indicating an increase in the problem's di culty.Logistic regression, in particular, performed no better than random.To verify that our results were not an artifact of our decision to assign studies to cross-validation folds rather than samples, we compared the study-wise splitting that we used with an alternate method called sample-wise splitting.Sample-wise splitting (see Methods) is common in machine learning, but can leak information between the training and validation sets when samples are not independently and identically distributed among studies -a common feature of data in biology [25].We found that sample-wise splitting induced substantial performance in ation (supp.g. 8).The relative performance of each model stayed the same regardless of the data splitting technique, so the results observed were not dependent on the choice of splitting technique.

A
Another growing strategy in machine learning, especially on biological data where samples are limited, is training models on a general-purpose dataset and ne-tuning them on a dataset of interest.
We examined the performance of models with and without pretraining (supp.g 9).We split the Recount3 data into three sets: pretraining, training, and validation ( g. 1 B), then trained two identically initialized copies of each model.One was trained solely on the training data, while the other was trained on the pretraining data and ne-tuned on the training data.The pretrained models showed high performance even when trained with small amounts of data from the training set.However, the non-linear models did not have a greater performance gain from pretraining than logistic regression, and the balanced accuracy was similar across models.

Datasets GTEx
We downloaded the 17,382 TPM-normalized samples of bulk RNA-seq expression data available from version 8 of GTEx.We zero-one standardized the data and retained the 5000 most variable genes.The tissue labels we used for the GTEx dataset were derived from the 'SMTS' column of the sample metadata le.

Recount3
We downloaded RNA-seq data from the Recount3 compendium [26] during the week of March 14, 2022.Before ltering, the dataset contained 317,258 samples, each containing 63,856 genes.To lter out single-cell data, we removed all samples with greater than 75 percent sparsity.We also removed all samples marked 'scrna-seq' by Recount3's pattern matching method (stored in the metadata as 'recount_pred.pattern.predict.type').We then converted the data to transcripts per kilobase million using gene lengths from BioMart [27] and performed standardization to scale each gene's range from zero to one.We kept the 5,000 most variable genes within the dataset.
We labeled samples with their corresponding tissues using the 'recount_pred.curated.tissue'eld in the Recount3 metadata.These labels were based on manual curation by the Recount3 authors.A total of 20,324 samples in the dataset had corresponding tissue labels.Samples were also labeled with their corresponding sex using labels from Flynn et al. [3].These labels were derived using pattern matching on metadata from the European Nucleotide Archive [28].A total of 23,525 samples in our dataset had sex labels.

Data simulation
We generated three simulated datasets.The rst dataset contained 1,000 samples of 5,000 features corresponding to two classes.Of those features, 2,500 contained linear signal.That is to say that the feature values corresponding to one class were drawn from a standard normal distribution, while the feature values corresponding to the other were drawn from a Gaussian with a mean of 6 and unit variance.
We generated the non-linear features similarly.The values for the non-linear features were drawn from a standard normal distribution for one class, while the second class had values drawn from either a mean six or negative six Gaussian with equal probability.These features are referred to as "non-linear" because two dividing lines are necessary to perfectly classify such data, while a linear classi er can only draw one such line per feature.
The second dataset was similar to the rst dataset, but it consisted solely of 2,500 linear features.The nal dataset contained only values drawn from a standard normal distribution regardless of class label.

Model architectures
We used three representative models to demonstrate the performance pro les of di erent model classes.The rst was a linear model, ridge logistic regression, selected as a simple linear baseline to compare the non-linear models against.The next model was a three-layer fully-connected neural network with ReLU non-linearities [29] and hidden layers of size 2500 and 1250.This network served as a model of intermediate complexity: it was capable of learning non-linear decision boundaries, but not the more complex representations a deeper model might learn.Finally, we built a ve-layer neural network to serve as a (somewhat) deep neural net.This model also used ReLU non-linearities, and had hidden layers of sizes 2500, 2500, 2500, and 1250.The ve-layer network, while not particularly deep compared to, e.g., state of the art computer vision models, was still in the domain where more complex representations could be learned, and vanishing gradients had to be accounted for.

Model training
We trained our models via a maximum of 50 epochs of mini-batch stochastic gradient descent in PyTorch [30].Our models minimized the cross-entropy loss using an Adam [31] optimizer.They also used inverse frequency weighting to avoid giving more weight to more common classes.To regularize the models, we used early stopping and gradient clipping during the training process.The only training di erences between the models were that the two neural nets used dropout [32] with a probability of 0.5, and the deeper network used batch normalization [33] to mitigate the vanishing gradient problem.
We ensured the results were deterministic by setting the Python, NumPy, and PyTorch random seeds for each run, as well as setting the PyTorch backends to deterministic and disabling the benchmark mode.The learning rate and weight decay hyperparameters for each model were selected via nested cross-validation over the training folds at runtime, and we tracked and recorded our model training progress using Neptune [34].We also used Limma [24] to remove linear signal associated with tissues in the data.More precisely, we ran the 'removeBatchE ect' function on the training and validation sets separately, using the tissue labels as batch labels.

Model Evaluation
In our analyses we used ve-fold cross-validation with study-wise data splitting.In a study-wise split, the studies are randomly assigned to cross-validation folds such that all samples in a given study end up in a single fold ( g. 1 B).

Hardware
Our analyses were performed on an Ubuntu 18.04 machine and the Colorado Summit compute cluster.The desktop CPU used was an AMD Ryzen 7 3800xt processor with 16 cores and access to 64 GB of RAM, and the desktop GPU used was an Nvidia RTX 3090.The Summit cluster used Intel Xeon E5-2680 CPUs and NVidia Tesla K80 GPUs.From initiating data download to nishing all analyses and generating all gures, the full Snakemake [35] pipeline took around one month to run.

Recount3 tissue prediction
In the Recount3 setting, the multi-tissue classi cation analyses were trained on the 21 tissues (see Supp.Methods) that had at least ten studies in the dataset.Each model was trained to determine which of the 21 tissues a given expression sample corresponded to.
To address class imbalance, our models' performance was then measured based on the balanced accuracy across all classes.Unlike raw accuracy, balanced accuracy (the mean across all classes of the per-class recall) isn't predominantly determined by performance on the largest class in an imbalanced class setting.For example, in a binary classi cation setting with 9 instances of class A and 1 instance of class B, successfully predicting 8 of the 9 instances of class A and none of class B yields an accuracy of 0.8 and a balanced accuracy of 0.44.
The binary classi cation setting was similar to the multi-class one.The ve tissues with the most studies (brain, blood, breast, stem cell, and cervix) were compared against each other pairwise.The expression used in this setting was the set of samples labeled as one of the two tissues being compared.
The data for both settings were split in a strati ed manner based on their study.

GTEx classi cation
The multi-tissue classi cation analysis for GTEx used all 31 tissues.The multiclass and binary settings were formulated and evaluated in the same way as in the Recount3 data.However, rather than being split study-wise, the cross-validation splits were strati ed according to the samples' donors.

Simulated data classi cation/sex prediction
The sex prediction and simulated data classi cation tasks were solely binary.Both settings used balanced accuracy, as in the Recount3 and GTEx problems.

Pretraining
When testing the e ects of pretraining on the di erent model types, we split the data into three sets.Approximately forty percent of the data went into the pretraining set, forty percent went into the training set, and twenty percent went into the validation set.The data was split such that each study's samples were in only one of the three sets to simulate the real-world scenario where a model is trained on publicly available data and then ne-tuned on a dataset of interest.
To ensure the results were comparable, we made two copies of each model with the same weight initialization.The rst copy was trained solely on the training data, while the second was trained on the pretraining data, then the training data.Both models were then evaluated on the validation set.This process was repeated four more times with di erent studies assigned to the pretraining, training, and validation sets.

Conclusion
We performed a series of analyses to determine the relative performance of linear and non-linear models across multiple tasks.Consistent with previous papers [15,16], linear and non-linear models performed roughly equivalently in a number of tasks.That is to say that there are some tasks where linear models perform better, some tasks where non-linear models have better performance, and some tasks where both model types are equivalent.
However, when we removed all linear signal in the data, we found that residual non-linear signal remained.This was true in simulated data as well as GTEx and Recount3 data across several tasks.These results also held in altered problem settings, such as using a pretraining dataset before the training dataset and using sample-wise data splitting instead of study-wise splitting.This consistent presence of non-linear signal demonstrated that the similarity in performance across model types was not due to our problem domains having solely linear signals.
Given that non-linear signal is present in our problem domains, why doesn't that signal allow nonlinear models to make better predictions?Perhaps the signal is simply drowned out.Recent work has shown that only a fraction of a percent of gene-gene relationships have strong non-linear correlation despite a weak linear one [23].
One limitation of our study is that the results likely do not hold in an in nite data setting.Deep learning models have been shown to solve complex problems in biology and tend to signi cantly outperform linear models when given enough data.However, we do not yet live in a world in which millions of well-annotated examples are available in many areas of biology.Our results are generated on some of the largest labeled expression datasets in existence (Recount3 and GTEx), but our tens of thousands of samples are far from the millions or billions used in deep learning research.
We are also unable to make claims about all problem domains.There are many potential transcriptomic prediction tasks and many datasets to perform them on.While we show that nonlinear signal is not always helpful in tissue or sex prediction, and others have shown the same for various disease prediction tasks, there may be problems where non-linear signal is more important.
Ultimately, our results show that task-relevant non-linear signal in the data, which we con rm is present, does not necessarily lead non-linear models to outperform linear ones.Additionally, our results suggest that scientists making predictions from expression data should always include simple linear models as a baseline to determine whether more complex models are warranted.

Signal removal
While it's possible to remove signal in the full dataset or the train and validation sets independently, we decided to do the latter.We made this decision because we observed potential data leakage when removing signal from the entire dataset in one go (supp.g. 6).

Figure 2 :
Figure2: Performance of models across four classi cation tasks.In each panel the loess curve and its 95% con dence interval are plotted based on points from three seeds, ten data subsets, and ve folds of study-wise cross-validation (for a total of 150 points per model per panel).

Figure 3 :
Figure 3: Performance of models in binary classi cation of simulated data before and after signal removal.Dotted lines indicate expected performance for a naive baseline classi er that predicts the most frequent class.

Figure 4 :
Figure 4: Performance of models across four classi cation tasks before and after signal removal

Figure 5 :
Figure 5: Comparison of models' binary classi cation performance on Recount3 data

Figure 6 :
Figure 6: Full dataset signal removal in a dataset without signal

Figure 7 :
Figure 7: Comparison of models' binary classi cation performance before and after removing linear signal