Evaluating generalizability of artificial intelligence models for molecular datasets

Deep learning has made rapid advances in modeling molecular sequencing data. Despite achieving high performance on benchmarks, it remains unclear to what extent deep learning models learn general principles and generalize to previously unseen sequences. Benchmarks traditionally interrogate model generalizability by generating metadata based (MB) or sequence-similarity based (SB) train and test splits of input data before assessing model performance. Here, we show that this approach mischaracterizes model generalizability by failing to consider the full spectrum of cross-split overlap, i.e., similarity between train and test splits. We introduce Spectra, a spectral framework for comprehensive model evaluation. For a given model and input data, Spectra plots model performance as a function of decreasing cross-split overlap and reports the area under this curve as a measure of generalizability. We apply Spectra to 18 sequencing datasets with associated phenotypes ranging from antibiotic resistance in tuberculosis to protein-ligand binding to evaluate the generalizability of 19 state-of-the-art deep learning models, including large language models, graph neural networks, diffusion models, and convolutional neural networks. We show that SB and MB splits provide an incomplete assessment of model generalizability. With Spectra, we find as cross-split overlap decreases, deep learning models consistently exhibit a reduction in performance in a task- and model-dependent manner. Although no model consistently achieved the highest performance across all tasks, we show that deep learning models can generalize to previously unseen sequences on specific tasks. Spectra paves the way toward a better understanding of how foundation models generalize in biology.

The coordinates for a gene in the reference M. tuberculosis genome, H37Rv, may not correspond to those of a M. tuberculosis sample due to mutations before and after the gene that can modify gene coordinates.To address this and collect nucleic acid sequences of regions of interest interest, we process the BAM file that resulted from the pairwise alignment of the input M. tuberculosis sample to H37Rv via pysam, a wrapper of samtools 3 .Pysam is used to find the number of contigs associated with a region.If there is one contig, we then search the contig for the first and last 20 reference basepairs of the region of interest.If a match is found, we pull everything in between as the sequence of interest.If we can not find the first and last 20 basepairs (in case of a mutation in these regions), we then repeat the process with basepairs 20-40 from the H37Rv reference.We continue this 20 basepair progression until the sequence is found.Supplementary Note 3: Approximating AUSPCs.
Characterizing SPCs and calculating AUSPCs requires computational resources and time.We identify two instances where AUSPCs can be approximated.Let f i (s) describe the test performance of model i at SPECTRA splits generated at a spectral parameter of s (SP = s).Then the AUSPC of model i is equal to integral of the SPC over all spectral parameters or 1 0 f i (s) ds.
In the first instance where all samples in MSDs contain a single unique mutation and the spectral property of interest is shared mutations, the spectral performance curve will be a horizontal line.This is because regardless of separation parameter, SPECTRA will always generate splits with no cross-split overlap.As a result the AUSPC reduces: where s is any spectral parameter.In these special cases AUSPC is equal to test performance on any split.This is the case for the 4D664-9INFA-Soh-CCL141-2019, A0A140D2T1-ZIKV-Sourisseaugrowth-2019, A4-HUMAN-Seuma-2021,AACC1-PSEAI-Dandage-2018, and A4GRB6-PSEAI-Chen-2020 datasets from ProteinGym 4 .
In the second instance when the objective is to compare model AUSPCs in a benchmark, it is not necessary to characterize the full SPC.The SPC is a monotonically decreasing function or a function that will always decrease or stay constant.As a result the SPC for any model i will be bounded by the test performance at splits generated by a spectral parameter of 0 and 1 or f i (1) ≤ 1 0 f i (s) ds ≤ f i (0).With this in mind when comparing two models i and j if f i (0) > f j (0) and f i (1) > f j (1) then the AUSPC of model i must be greater than that of model j.If f i (0) = f j (0) and f i (1) > f j (1) or f i (0) > f j (0) and f i (1) = f j (1) then the AUSPC of model i must be greater than that of model j.If f i (0) > f j (0) and f i (1) < f j (1) or vice-versa nothing can be concluded about the relation between AUSPC of model i and j, a full SPC must be generated.
This approximation decreases the computational burden when using SPECTRA to calculate AUSPC to compare models.However, this can only approximate the relationship between the AUSPC of two models, they cannot be used to uncover unconsidered spectral properties.
to the computational efficiency of comparing mutational barcodes versus constructing pairwise alignments for spectral property comparison.This is why constructing SPECTRA splits for GFP, a MSD with 54,024 samples, takes 3 hours while construction for PDBBind, a SD with 18,778 samples, takes 9 hours.
The size of the SPG also plays a role.Within MSDs, SPECTRA was able to generate splits faster for PZA (2,571 nodes and 936,431 edges in SPG) than RIF (3,998 nodes and 2,959,746 edges in SPG) on a single-core CPU machine.Within SDs, SPECTRA takes longest for PDBBind (18,778 nodes, 1,987,949 edges in SPG), taking 9 hours to generate 40 splits on a 20-core CPU machine.
After splits are generated, SPECTRA runtime is dictated by the size of the model to be evaluated.To produce the spectral performance curve for the GFP dataset for the largest model considered, ESM2-Finetuned, SPECTRA takes five days for 20 splits across four GPUs.The spectral performance curve for logistic regression for the same dataset takes one hour for 20 splits on a single GPU.Cross-split overlap and number of samples for SPECTRA splits for datasets from PEER, TAPE, ProteinGym, and PDB-Bind.Across all datasets cross-split overlap and sample number decreases as spectral parameter increases.However, the shape of this decrease differs dataset to dataset.This reflects the different structure of the underlying spectral property graph (SPG).In more densely connected SPGs, decreases in number of samples will be more pronounced since more nodes will be deleted at every step of SPECTRA.Also the amount of cross-split overlap at a spectral parameter of zero differs dataset to dataset.This motivates the use of a spectral parameter as it captures the effect on model performance with respect to different amounts of cross-split overlap at random splits.

Figure S1 :
FigureS1: RIF spectral property graph Spectral property graph for RIF.Nodes are M. tuberculosis samples and edges are between samples that share at least one mutation.There are 3,998 nodes (one for each unique set of mutations) and 2,959,746 edges.The densely connected graph structure reflects how RIF resistance is defined by a small set of key mutations5 .

Figure S2 :
FigureS2: PZA spectral property graph Spectral property graph for PZA.Nodes are M. tuberculosis samples and edges are between samples that share at least one mutation.There are 2,571 nodes (one for each unique set of mutations) and 936,431 edges.The sparsely connected graph structure reflects how PZA resistance is defined by lesscommon mutations6 .

Figure S3 :
Figure S3: Distribution of GFP Labels Distribution of labels for the GFP dataset.The distribution is bimodal where in one one end mutations cause little fluorescence in the GFP while on the other end mutations cause an increase in fluorescence.This distribution was log and min-max normalized to assist in model training.

Figure S5 :
Figure S5: Statistics for SPECTRA splits generated for INH, PZA, RIF, SARS-CoV-2, and GFP datasets.Number of sample barcodes, unique sample barcodes, mutational barcodes, unique mutational barcodes, and crosssplit overlap for SPECTRA splits.All values shown are averaged across three runs of SPECTRA with different random seeds.Across all datasets, cross-split overlap decreases as spectral parameter increases.The number of samples also decreases, meaning dataset sizes become much smaller at high spectral parameter.In mutation sequence datasets (MSDs) there is a discrepancy between the number of sample barcodes and the unique number of sample barcodes, implying increased dataset size will not always contribute to the diversity of observed mutations.

Figure S6 :
Figure S6: Statistics for SPECTRA splits generated for PEER subcellular localization, ProteinGym amyloid beta protein aggregation, ProteinGym RRM domain activity, TAPE secondary structure, TAPE remote homology, and PDBBind datasets.Cross-split overlap and number of samples for SPECTRA splits for datasets from PEER, TAPE, ProteinGym, and PDB-Bind.Across all datasets cross-split overlap and sample number decreases as spectral parameter increases.However, the shape of this decrease differs dataset to dataset.This reflects the different structure of the underlying spectral property graph (SPG).In more densely connected SPGs, decreases in number of samples will be more pronounced since more nodes will be deleted at every step of SPECTRA.Also the amount of cross-split overlap at a spectral parameter of zero differs dataset to dataset.This motivates the use of a spectral parameter as it captures the effect on model performance with respect to different amounts of cross-split overlap at random splits.
FigureS7: Spectral performance curve for all models in GFP dataset along with respective AUSPC with confidence interval.Points represent the average model test performance at SPECTRA splits generated at the corresponding spectral parameter.Three splits were generated for every value of the spectral parameter.Models were trained three times for every split generated by SPECTRA.Error bars represent the variance in model test performance across the three generated SPECTRA splits.Relatively large error bars indicate the potential presense of unconsidered spectral properties.

Algorithm 1 :
SubsetSum approximation algorithm to balance generated SPECTRA train-test splits Input: Selected sample barcodes from spectral property graph S, number of samples per barcode N , proportion of sample barcodes in the test set γ, random number generator r Output: Balanced Train and Test Sets n tr = (1 − γ) ∀n∈N n; n te = γ ∀n∈N n; sum tr = 0; sum te = 0; Initialize the sets train, test, and discarded as empty sets; for s i in S do p = r(); if p < 1 − γ then if n s i + sum tr < n tr then 10 Add s i to train; 11 sum tr + = n s i ; else 13 Add s i to discarded; end else if n s i + sum te < n te then 17 Add s i to test; 18 sum te + = n s i ; Supplementary Note 2: Algorithm to pull sequence of interest from M. tuberculosis align-ments.

Table S1 :
7ame, start and end coordinates for genomic regions previously found to be relevant to Pyrazinamide (PZA), Rifampicin (RIF), and Isoniazid (INH) antibiotic resistance in M. tuberculosis.Sequences from these regions were pulled as input for all M. tuberculosis antibiotic resistance models used in this study.All coordinates are relative to M. tuberculosis H37Rv reference strain.7