Adaptive, sample-specific parameter selection for more accurate transcript assembly

Motivation Transcript assemblers are tools to reconstruct expressed transcripts from RNA-seq data. These tools have a large number of tunable parameters, and accurate transcript assembly requires setting them suitably. Because of the heterogeneity of different RNA-seq samples, a single default setting or a small fixed set of parameter candidates can only support the good performance of transcript assembly on average, but are often suboptimal for many individual samples. Manually tuning parameters for each sample is extremely time consuming and requires specialized experience. Therefore, developing an automated system that can advise good parameter settings for individual samples becomes an important problem. Results Using Bayesian optimization and contrastive learning, we develop a new automated parameter advising system for transcript assembly that can generate sets of sample-specific parameter candidates. Our framework achieves efficient sample-specific parameter advising by learning parameter knowledge from a large representative set of existing RNA-seq samples and transferring the knowledge to unseen samples. We use Scallop and StringTie, two well-known transcript assemblers, to test our framework on two collections of RNA-seq samples. Results show that our new parameter advising system significantly outperforms the previous advising method in each dataset and each transcript assembler. The source code to reproduce the results from this study can be found at https://github.com/Kingsford-Group/autoparadvisor.


Introduction
One crucial computational problem in RNA-seq analysis is transcript assembly-the reconstruction of full-length expressed transcripts from the reads in a given RNA-seq sample.This problem is challenging due to the diversity of splice variants and biases from the technology such as unevenness of read coverage [Shao and Kingsford, 2017].Although a large number of algorithms have been proposed for assembling different types of RNA-seq samples [Shao and Kingsford, 2017, Zhang et al., 2021, Trapnell et al., 2010, Guttman et al., 2010, Tomescu et al., 2013, Song et al., 2016, Liu et al., 2016, Pertea et al., 2015, Mao et al., 2020, Kovaka et al., 2019, Tung et al., 2019, Liu et al., 2019, Nip et al., 2020], according to benchmarking studies [Voshall and Moriyama, 2018] and the performance reported in recent assemblers, the accuracy of these transcript assemblers is still low.Hence, new strategies are needed for improving the performance of transcript assembly.
Transcript assemblers usually have a large number of tunable parameters.Previous work [DeBlasio et al., 2020, Kovaka et al., 2019] shows that the choices of these parameters significantly affect performance.A single default setting of these parameters, although carefully tuned by the algorithm designers, is typically not optimal for all samples.Manually tuning parameters for a specific sample often leads to higher performance, but it is time consuming.Therefore, developing an automatic parameter advising system (or parameter advisor), a system that can automatically provide good parameter choices for each sample, is an important computational problem.The solution to this problem will provide an orthogonal strategy for improving the accuracy of transcript assembly: rather than developing new transcript assemblers with new algorithmic approaches, we further improve the performance of existing methods by choosing better parameters.
Previous studies developed the parameter advising system by deriving a fixed set of good parameters from data, commonly referred to as an advisor set.However, a key limitation of employing a fixed advisor set is that the fixed set can only guarantee good performance on average [Balcan et al., 2021], but is often suboptimal for many individual samples.This issue becomes more pronounced with smaller advisor sets, which are common in practical applications.A smaller set, while reducing computational load, often fails to capture the variability across different samples, leading to suboptimal performance.Conversely, larger advisor sets significantly prolong the running time of the parameter advisor, as it must evaluate each parameter in the set to identify the most suitable one.
Thus, there is a need for a method that can create a compact, sample-specific advisor set.Such an approach would not only enhance accuracy but also streamline the computational process involved in parameter tuning.
We propose the first adaptive, sample-specific parameter advising system for transcript assembly.
Our algorithm is based on the intuition that if two samples are "similar", then the optimal parameters for one should be effective for the other.This system operates on a foundational principle: given a set of RNA-seq samples with known optimal parameters (referred to as the training set) and a metric to gauge sample similarity, the advisor set for any new sample R is the set of the best parameters of the closest samples in the training set, as determined by our similarity measure.
The construction of a sample-specific advisor set involves two challenging sub-tasks: (1) efficiently determining the best parameters for a finite collection of existing RNA-seq samples, and (2) devising a similarity measure that facilitates the transfer of parameter knowledge to novel samples.Neither of these tasks has been adequately addressed in prior research.
To address the first sub-problem, we develop a Bayesian Optimization framework that can efficiently obtain the best parameters of a large number of representative RNA-seq samples of the universe of all possible RNA-seq samples.For the second sub-problem, we develop a contrastive learning framework.This framework is adept at transferring the parameter knowledge to new samples, thereby enabling the identification of new advisor sets.We evaluate our system, called Auto-ParAdvisor, on the transcript assemblers Scallop [Shao and Kingsford, 2017] and StringTie [Pertea et al., 2015], and compare it against previous parameter advising work on two RNA-seq datasets.Our results show Auto-ParAdvisor significantly outperforms the previous methods in both datasets and for each of the transcript assemblers.
More accurate transcript assembly via Auto-ParAdvisor will benefit downstream RNA-seq analysis such as transcript quantification and differential expression analysis [Shao and Kingsford, 2017].
Moreover, the framework developed here is general and should apply to other bioinformatics tools.
2 Related work

Parameter advising
Parameter advising has been previously discussed in several bioinformatic areas such as multiple sequence alignment [Cedillo et al., 2022, DeBlasio and Kececioglu, 2014, 2015a,b], biological pathway reconstruction [Magnano and Gitter, 2021] as well as transcript assembly [DeBlasio et al., 2020].The method proposed in DeBlasio et al. [2020] is the first and only work developing a parameter advising system for transcript assembly by learning the advisor set in a data-driven way.Although the method achieves significantly better performance than the default setting of transcript assemblers on Scallop [Shao and Kingsford, 2017] and StringTie [Pertea et al., 2015], the method is limited by its reliance on a fixed advisor set for all new samples.This limitation is compounded by the small number of RNA-seq samples used in the training set, which fails to comprehensively represent the diversity of all RNA-seq samples.Consequently, this approach does not work well when encountering new samples markedly different from those in the training set, a common issue with small training datasets.
Our current work addresses these shortcomings by employing a large, representative collection of RNA-seq samples.This allows for the learning of a system capable of generating sample-specific advisor sets, thereby offering a more adaptable and robust solution in the context of transcript assembly.

Bayesian Optimization
Bayesian Optimization (BO) is a sample-efficient framework designed for globally optimizing a blackbox function f (x) that is costly to evaluate [Frazier, 2018, Shahriari et al., 2015].It aims to obtain the input point that maximizes (or minimizes) the function f by iteratively acquiring input points that are likely to achieve the maximum and evaluating the function on these points.Each iteration of BO has two components: a Bayesian statistical model, usually a Gaussian Process regressor [Schulz et al., 2018], to estimate f (x), and an acquisition function that determines the next input point to sample.We refer the reader to Frazier [2018] for a detailed description of BO.Although BO has been successfully applied in many scenarios such as hyper-parameter tuning [Snoek et al., 2012, Klein et al., 2017], reinforcement learning [Brochu et al., 2010, Marco et al., 2017, Wilson et al., 2014], and chemical design [Griffiths andHernández-Lobato, 2020, Negoescu et al., 2011], its efficiency in tuning parameters of bioinformatics tools has not been broadly explored.
In our study, we introduce the use of Bayesian Optimization (BO) as an effective method for determining the optimal transcript assembly parameters for a specific RNA-seq sample.We use a transcript assembler paired with a loss function designed to estimate the assembler's performance.
In this scenario, the task of finding the best parameters for a given RNA-seq sample R is equivalent to optimizing the loss function.As we demonstrate below, this loss function manifests as a blackbox function, hence BO emerges as a highly suitable tool for this optimization process due to its proficiency in handling such black-box scenarios.

Contrastive Learning
Contrastive Learning (CL) is a class of learning methods that learns by comparing input samples.
The goal of CL is to learn embeddings of input samples such that those of "similar" samples (e.g. two cat images) are embedded close together, while those of "dissimilar" samples (e.g. one cat image and one dog image) are placed further away [Le-Khac et al., 2020].A typical CL framework has three components: • Similarity definition: this determines whether a pair of samples in the training set is similar (positive pair) or dissimilar (negative pair).The criteria for similarity vary depending on the specific problem.For instance, in supervised learning, the labels of input samples might be used to define similarity: samples sharing a label are considered similar, and those with different labels are dissimilar [Khosla et al., 2020].
• Feature encoder: a trainable neural network that maps each input sample into an embedding space.
• Contrastive loss function: this function, using the outputs of the feature encoder and the defined similarity, guides the training process.By minimizing this loss, the encoder learns to map samples from a positive pair close together and those from a negative pair far apart in the embedding space.CL has been effectively applied in various domains such as self-supervised learning [Chen et al., 2020, Chuang et al., 2020], image classification [Khosla et al., 2020], gaze direction regression [Wang et al., 2022] and metagenomics [Zhang et al., 2022].In this work, we adapt CL to compare different RNA-seq samples.We define "similarity" in the context of RNA-seq samples (details below) based on the premise that similar samples will likely have similar optimal parameter vectors for transcript assembly.Thus, knowing the optimal parameters for one RNA-seq sample allows us to apply them to a similar sample, bypassing the need for fresh optimization.

Overview of Auto-ParAdvisor
Let R be the universe of all possible RNA-seq samples and T be the universe of all sets of transcripts.
A transcript assembler, denoted as A, is a map A(R; θ) : R×Θ → T which takes a sample R ∈ R and a parameter vector θ ∈ Θ ⊆ R n in some parameter space Θ as inputs and outputs a set of transcripts.
Given some loss function L : T → R that estimates the performance of transcript assembly, we use f (R; θ) L • A(R; θ) : R × Θ → R to estimate the performance of assembler A with the input sample R and the parameter vector θ.We define fR(θ) L(A(R; θ)) for cleaner notation.The objective in creating a parameter advising system is to develop an oracle O : R → Θ, which can accurately determine an optimal or near-optimal hyper-parameter vector for each RNA-seq sample, that is: Our method, Auto-ParAdvisor, constructs such an oracle by first developing a mapping M.
This mapping takes an RNA-seq sample R as the input, and outputs an advisor set, a set of good parameter candidates, of this sample.Using this mapping, we then define the oracle OM as the following: fR(θ). (2) Different from traditional hyper-parameter tuning in machine learning, which focuses on optimizing hyper-parameters for a single specific function, M operates as a meta-framework.This framework is able to minimize fR(θ) for each sample R by producing sample-specific advisor sets.A significant advantage of M is that once it is trained, it does not require retraining for new samples, thus offering a more efficient and adaptable approach to hyper-parameter tuning across diverse samples.
As shown in Fig. 1, training Auto-ParAdvisor is composed of three steps: (i) choose a representative sample set, denoted as Rrep, from the sample universe R; (ii) search for the best parameter vector for each representative sample R i rep ∈ Rrep via Bayesian Optimization; and (iii) develop a similarity measure S for RNA-seq samples through contrastive learning by using the information from step (ii).The culmination of these steps results in the definition of our mapping M. Specifically, M is defined as the union of the optimal parameter vectors from those representative samples that are identified as the nearest neighbors of a given sample R under the similarity measure S. That is: where θi represents the optimal parameter vector for the representative sample R i rep as determined through Bayesian Optimization in step (ii), and N N p (R, Rrep, S) is the set of p samples within Rrep that are closest to the sample R under the measure S. We now use the following three sections to describe the three steps (i) to (iii) in detail.

Representative sample selection from R
To construct M, it is essential to use existing RNA-seq samples as the training foundation.The training set needs to effectively encapsulate the critical transcriptional characteristics of R, the entire spectrum of RNA-seq samples, to ensure that M possesses strong generalizability to new samples.
One approach could be to incorporate all existing RNA-seq samples into the training set.However, considering the immense volume of available RNA-seq data -numbering in the hundreds of thousands -such an approach would demand prohibitively high computational resources.To strike a balance between comprehensive representation and computational feasibility, we adopt the representative sample set identified in Tung and Kingsford [2021], which includes around 7,000 human RNA-seq samples.This set effectively represents the broader collection of 196,523 existing human RNA-seq samples.To enhance the computational efficiency, we further use apricot [Schreiber et al., 2020], for selecting a smaller representative subset from these samples.This selection process ensures a robust coverage of R, while significantly alleviating computational demands.The final subset of representative samples selected through this method is denoted as Rrep.

Search for the best parameter vectors for representative samples
For every representative sample R i rep in Rrep, we use BO to search for its best parameter vector by minimizing the function f R i rep (θ).BO is particularly suitable for this task for two main reasons: is essentially a black-box function, meaning its derivatives are difficult to compute.This complexity arises because the transcript assembler A typically comprises multiple modules that tackle intricate combinatorial problems.As a result, it is not feasible to express f R i rep (θ) in a closed form; and (ii) each evaluation of f R i rep (θ) necessitates a run of the transcript assembler A which can be a slow process, ranging from minutes to hours.Therefore, even a single evaluation of f R i rep (θ) can be quite time-consuming.
However, standard Bayesian Optimization algorithms are not immediately applicable in our context due to their reliance on a predefined search domain.Typically, these algorithms require a user-specified domain, assumed to encompass the optimal or near-optimal points for the function being optimized [Shahriari et al., 2016].In the case of RNA-seq samples, however, the challenge arises from the variability of optimal parameter vectors across different samples.Given this diversity, and the potential for significant dissimilarities between optimal parameters for various samples, it becomes impractical to establish a fixed search domain suitable for all cases.This aspect of the RNA-seq samples necessitates an adaptation or modification of standard Bayesian Optimization techniques to accommodate the varying nature of the optimal parameter vectors.
Inspired by Poloczek et al. [2016] and DeBlasio et al. [2020], we propose a new warm-up-based BO, called CAWarm-BO (Coordinate Ascent Warm-up Bayesian Optimization), that is able to automatically adjust the search domain for each sample.The effectiveness of this new framework is based on the following observations: • Coordinate ascent is an iterative optimization algorithm that successively maximizes along the coordinate directions of a function to find the maximum [Zangwill, 1969].Coordinate ascent does not require a predefined search domain, and it will rapidly converge towards a local region potentially containing the optimal or near-optimal points.However, within that localized region, coordinate ascent tends to be slower in pinpointing the local optimal point.
• Bayesian Optimization is able to efficiently find the best point within a reasonably sized search domain, but the search domain must be predefined.
By synergizing these two methodologies, CAWarm-BO leverages the strengths of both: using coordinate ascent to swiftly approach a promising region without the need for a predefined domain, followed by Bayesian Optimization to efficiently find the optimal point within that localized area.
We run CAWarm-BO independently on each sample in the representative set Rrep (here we In this set, θ i g represents a queried parameter vector during the Bayesian Optimization search, and y i g = f R i rep (θ i g ) corresponds to the loss value for that parameter vector.The optimal parameter vector θi for sample R i rep is determined as the parameter vector θ i g in X i rep that yields the smallest y i g .

Learning the similarity measure via contrastive learning
Compared to other optimization algorithms, CAWarm-BO is more efficient in optimizing the function fR(θ) for a given sample R.However, as shown in our experimental results it is still not efficient enough that we can directly use it to search for the best parameter vector for every new sample, i.e.
it cannot be solely relied upon as the oracle O due to its slow speed.This limitation underscores the need to develop a similarity measure S, enabling model M to quickly generate the specific advisor set for each new sample through nearest neighbor search, as shown in Eq. 3. We learn S via the contrastive learning (CL) framework.Within this framework, given any two RNA-seq samples R1 and R2, they are embedded into a latent space via a feature encoder Enc φ with trainable parameters φ.In this latent space, the similarity between two vectors is quantified using the cosine similarity measure: Learning S is now equivalent to training the encoder Enc φ .Our training framework is inspired by previous work of supervised contrastive learning such as Khosla et al. [2020] and Wang et al.
[2022], but has the following several unique components: • We represent each sample in Rrep by a set derived from the MinHash sketch [Broder, 1997, Ondov et al., 2016].These sets serve as the inputs for the encoder Enc φ .The details are described in Section 3.4.1.
• We develop a novel method to define the similarities between pairs of representative samples, denoted as Ŝrep(R i rep , R j rep ) for training purposes.The details are described in Section 3.4.2.
• We develop a distinctive data augmentation module for enhancing the robustness and accuracy of training.The details are described in Section 3.4.3.
• We develop a neural network architecture and a contrastive loss function that are particularly tailored to our specific requirements.The details are described in Section 3.4.4.
compute cosine similarity between vectors z q and z new , S q cos z ⊤ new zq znew zq 7: end for 8: Sort {S q cos } m q=1 and pick p samples in the representative set that have highest p similarity values with R new .9: return p samples

RNA-seq sample representation
Given that each RNA-seq sample comprises a vast number of sequencing reads, which are not ideally suited for direct input into neural networks, we use the MinHash sketch [Broder, 1997, Ondov et al., 2016] to generate representations for these samples.MinHash sketch process begins by decomposing the reads in an RNA-seq sample into constituent k-mers.Each k-mer is then processed through a hash function h to obtain a hash value.Finally, for a predetermined sketch size s, MinHash sketch returns the s smallest hashes output by h over all k-mers, along with the count of k-mers associated with these hash values.Each RNA-seq sample R is represented by a set input X = {xi} s i=1 , where each xi is a two-dimensional vector; the first dimension denotes a hash value, and the second dimension indicates the number of k-mers in R corresponding to this hash.

Similarity between representative samples
Contrary to supervised contrastive learning in classification problems [Khosla et al., 2020] where sample pairs are simply categorized as positive (having the same label) or negative (having different labels), our case more closely aligns with a contrastive regression problem as in Wang et al. [2022] where Ŝrep is used to assess the degree of similarity between two RNA-seq samples.Ideally, for ).However, since these functions lack a closed-form representation, direct computation of their similarity is not feasible.Instead, we leverage information from the Bayesian Optimization (BO) step to approximate these functions and assess their similarity.
Specifically, as described above, the output of CAWarm-BO for each sample R i rep is a set of pairs .We then introduce the following term, called normrank, to quantify the similarity between µ i rep and µ j rep : where θj is the optimal parameter vector for R j rep found by BO.The numerator of Eq. ( 5) represents the number of parameter vectors queried in X i rep that have smaller posterior mean values than the value of the optimal parameter vector for R j rep .Intuitively, if f R i rep and f R j rep are similar, the optimal parameter vector for R j rep should also be an approximately optimal one of R i rep , making the numerator smaller.The denominator of Eq. 5 serves as a normalization term, accounting for different sizes of X i rep .
normrank functions as a distance-like measure between samples: the more alike the samples are, the lower the normrank value.However, it has two drawbacks: (i) it is not symmetric, meaning normrank R i rep , R j rep may not be equal to normrank R j rep , R i rep , and (ii) it does not constitute a true distance as it fails to satisfy the triangle inequality.To address these issues, we first compute the /2 to ensure symmetry.We then apply the shortest-path method, as introduced in Lesne et al. [2014], to reconstruct the distances, denoted as Drep, and to filter out some noise.Based on this, Ŝrep is subsequently defined as: According to Eq. ( 5), for any pair of samples, the value of Drep R i rep , R j rep falls within the range [0, 1].Consequently, as per Def.( 6), this ensures that the value of Ŝrep R i rep , R j rep lies within the range [−1, 1], aligning with the range typical for cosine similarity.

Data augmentation module
The time-intensive nature of conducting Bayesian Optimization (BO) on each RNA-seq sample to acquire the optimal parameter vector imposes a constraint on the size of the representative sample set; it cannot be excessively large.This limitation implies that the amount of training data available for contrastive learning is not vast.As will be discussed in Section 4, the representative sample set comprises around 1,000 samples.To improve the accuracy of training under these constraints, we develop a novel data augmentation method.This method substantially expands the size of the training dataset without necessitating additional rounds of BO computation.
The underlying principle of our data augmentation method is as follows: for a new RNA-seq sample R sub , created by subsampling reads from an original sample R, its corresponding function fR sub is expected to have a similar optimal parameter vector to that of fR.This means we can create additional training data by subsampling reads from the original RNA-seq samples, and calculate the similarity scores without the need for extra Bayesian Optimization (BO) computations.Specifically, let R i sub1 and R j sub2 be subsampled versions of R i rep and R j rep respectively.The similarity score between these two new subsampled samples is defined as follows: In practice, for each representative sample, we generate seven subsampled variants with sampling ratios 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95.This approach effectively increases the size of our training data by a factor of seven, all without necessitating any additional Bayesian Optimization (BO) computations.The effectiveness of this data augmentation module is further discussed in Section 4.4.

Neural network architecture and loss function
As outlined in Section 3.4.1, each RNA-seq sample is represented as a set.Consequently, in our case, it is necessary to use a neural network architecture that is capable of processing sets as inputs.
Crucially, this architecture must exhibit permutation invariance, meaning its output should remain unchanged irrespective of the order of elements within the set.
Inspired by Zaheer et al. [2017], the encoder Enc φ is designed to be a composite of two subnetworks: hϕ and g ψ with trainable parameters ϕ and ψ.Specifically, given an RNA-seq sample R with its set representation X = {xi} s i=1 , we have: Theorem 2 of Zaheer et al. [2017] proved that the form of Eq. ( 8) preserves permutation invariance.
Our contrastive loss function is simply defined as mean squared error: where N b is the batch size and R i batch and R j batch are RNA-seq samples in the training data corresponding to embeddings z i batch and z i batch .Compared to the contrastive loss formulas used in Khosla et al. [2020] and Wang et al. [2022], we find that this simple loss function achieves very good performance in our case, with the advantage that no extra hyper-parameters such as the temperature need to be tuned.We leave to future work the investigation of more sophisticated loss functions.
4 Experimental results

Implementation Details
We test Auto-ParAdvisor on Scallop v0.10.2 (A = Scallop) with 18 tunable parameters and StringTie v2.2.1 (A = StringTie) with 8 tunable parameters.The details of these parameters are listed in [2021] as representative samples.We then use apricot [Schreiber et al., 2020] to further select 1, 500 samples for constructing Rrep.These 1, 500 samples are aligned using the aligner STAR [Dobin et al., 2013] to the human reference genome GRCh38.We run CAWarm-BO independently on these samples, and remove samples that have very low AUC values (the samples for which the best AUC value found by BO is less than 3 × 10 −4 ).This filtration process resulted in 1,263 samples retained for contrastive training for Scallop, and 1,235 samples for StringTie.
We use PyTorch [Paszke et al., 2019] to implement the contrastive learning framework.We use simple fully connected neural networks with ReLU activations to represent hϕ and g ψ in Eq. ( 8).We leave to future work the investigation of optimal neural network architectures.Additional details of this framework are described in Appendix A.2. • SRA-test, a collection of 1595 RNA-Seq samples from the Sequence Read Archive [Cochrane et al., 2016].These samples are aligned using the aligner STAR [Dobin et al., 2013] to the representative sample set (fewer than 10 samples), and then use the remaining samples to test the performance of Auto-ParAdvisor.9 in the appendix reveals that the best AUC values discovered by CAWarm-BO and CA are quite similar, although CA requires significantly more iterations than CAWarm-BO to reach these optimal values.Additionally, Table 10 indicates that this trend is consistent across different transcript assemblers.These findings indicate that CAWarm-BO is a more efficient method than both CA and CASMOPOLITAN for finding the best parameter vector for each sample in the representative sample set.

Performance of CAWarm-BO
BO is a method that contains randomness, which can lead to variations in the results from different runs of BO on the same function.To test whether this variance significantly impacts the final performance, we executed 10 distinct runs of CAWarm-BO on samples SRR315323 and SRR307903, and plot the mean and 1/2 standard deviation of the maximum value found by iterations.

Accuracy of similarity definition between representative samples
We use the following correlation test to verify the accuracy of similarity values Ŝrep defined in Section 3.4.2 between representative samples.We randomly select 10 samples in the representative set.
For each of these samples, denoted R, we compute fR(θ) using the best parameter vectors θi found by CAWarm-BO for all other samples in the representative set.We plot the relationships between these function values {fR( θi )} m i=1 and the similarity values { Ŝrep(R, R i rep )} m i=1 , shown in Fig. 10(c).

Rationality and performance of the data augmentation module
Figure 3: The correlation between two sets of copula standardized [Salinas et al., 2020] function values ) and . Each dot represents one parameter vector θ i g .Sample accession numbers: ERR197922 and SRR1812365.The transcript assembler A used here is Scallop.R 0.6 : sample from the subsampling with ratio 0.6.R 0.8 : sample from the subsampling with ratio 0.8.r p : Pearson correlation coefficient.
In this section, we first empirically validate the underlying principle of our data augmentation method: a sample R sub created by subsampling reads from an original sample R shares a similar optimal parameter vector with R. We randomly select 10 samples from the representative set.For each selected sample R i rep , we create a subsampled version R i 0.8 with a sampling ratio of 0.8, meaning , where N is the size of This procedure is repeated with a sampling ratio of 0.6.If the functions f R i rep and f R i ) are simialr, we would expect (θ i g ) for each θ i g .High correlations, as shown in Fig. 3, Table 3, and Table 4, indicate that for both transcript assemblers Scallop and StringTie, when reads in R i rep are randomly subsampled with an adequate ratio, the resulting function closely resembles the original.
To confirm that the set representation of R i 0.8 or R i 0.6 differs from R i rep (ensuring actual data augmentation rather than duplication), we also compute the Mash distance between these two set representations.The non-zero distances reported in Table 5 and Table 6 confirm that these samples have distinct set representations.Moreover, the values in both tables are considerably lower than the average Mash distances among samples in the representative set (0.0728 for Scallop, and 0.0624 for StringTie).These results show that the set representation of R i 0.8 or R i 0.6 is distinct from, but similar to, R i rep , validating the efficacy of read subsampling as a reasonable data augmentation approach.For the ENCODE65 dataset, we compared Auto-ParAdvisor with the parameter advising method for transcript assemblers proposed by DeBlasio et al. [2020], the only existing method of its kind.

Samples
Their approach generates a fixed advisor set comprising 30 parameter vectors.To ensure a fair comparison, we also generated an advisor set from Auto-ParAdvisor with a size of 30 (i.e.p = 30).
We use "Base" to represent the method in DeBlasio et al. [2020] in the following.Additionally, we created advisor sets from Auto-ParAdvisor with sizes of 1 and 5 (i.e.p = 1 and p = 5), comparing them against the default parameter vector.These scenarios are denoted as top-1, top-5, and top-30 cases.We also compare Auto-ParAdvisor with two simpler methods on the ENCODE65 dataset: • Direct application of CAWarm-BO on each sample in ENCODE65.
• Implementation of a method called Auto-Mash, which follows the same framework as Auto-ParAdvisor but uses the negative Mash distance, instead of the contrastive learning framework, to compute the similarity measure S between samples.
The results of these comparisons are displayed in Figures 11 and 12 in the appendix.Generally, Auto-ParAdvisor has superior performance over both CAWarm-BO and Auto-Mash.These findings suggest that directly employing CAWarm-BO on each new sample is not the most efficient approach, and that the contrastive learning framework employed in Auto-ParAdvisor is more effective in learning a better similarity measure S by leveraging information from the BO process.For the SRA-test dataset, Figures 7 and 8 demonstrate Auto-ParAdvisor's robust performance on this extensive dataset.In the top-1 case, Auto-ParAdvisor outperforms the default settings in 81.6% of samples for Scallop and 84.9% for StringTie.In the top-5 case, Auto-ParAdvisor surpasses the default settings in 97.0% of samples for Scallop and 97.8% for StringTie.In the top-30 case, Auto-ParAdvisor outperforms Base in 94.9% of samples for Scallop and 97.4% for StringTie.

Running time of Auto-ParAdvisor
The most time-consuming part of constructing Auto-ParAdvisor for a specific transcript assembler is to search for the optimal parameter vectors for representative samples using BO.In practice, this process requires approximately 15 days using around 300 CPUs to complete BO on all samples in

Discussion
We introduce Auto-ParAdvisor, the first parameter advisor that can generate a sample-specific advisor set for each RNA-seq sample, for more accurate transcript assembly.To achieve this, we develop a new Bayesian Optimization method that can efficiently search for the best parameter vectors of representative samples.We also develop a contrastive learning framework for learning a similarity measure between samples and building the nearest-neighbor model to output sample-specific advisor sets.With the transcript assemblers Scallop and StringTie, we show that Auto-ParAdvisor obtains better performance than the previous method on both the ENCODE65 and SRA-test datasets.
The methods developed in this work are not limited to bioinformatics.The way of combining CA and BO provides a new strategy for optimizing black-box functions with unknown search domains, and the contrastive learning framework provides new insights on solving a special case of contrastive regression problem: both inputs and outputs are not in standard Euclidean space (in our case inputs are RNA-seq samples and outputs are functions), which is a very interesting problem that has not been extensively discussed and for which no previous methods applies.
The heuristic constructions used in this framework, such as the way to define the similarity between two functions, the form of the contrastive loss function, and the architecture of the encoding network can be further improved in the future.Additionally, following DeBlasio et al. [2020], we use AUC to estimate the performance of transcript assemblers by comparing predicted transcripts with known ones.However, this is not a perfect measure as it may unfairly penalize assemblers that generate novel transcripts, which can be of significant interest.Developing better performance metrics is an interesting direction for future work.Finally, although for a specific transcript assembler Auto-ParAdvisor does not need to be retrained for any new RNA-seq sample, it needs to be retrained when the transcript assembler changes.Therefore, developing a method that can transfer knowledge from one transcript assembler to another to speed up this retraining procedure is also an important direction for future work.

A Other implementation details A.1 Details of CAWarm-BO
In Scallop, we adjust 18 non-negative parameters, and in StringTie, we tune 8 parameters, which is the same as DeBlasio et al. [2020].The names of these parameters, their default settings and their types are listed in Table 7 and Table 8.The initial range for each non-binary parameter is defined as [0, 10×default] (e.g. for max dp table size in Table 7, the initial range includes all integers within [0, 100000]).CAWarm-BO initially runs the same coordinate ascent algorithm as proposed by DeBlasio et al. [2020], with a maximum warmup iteration set at 60.The step size for each coordinate in the ascent is determined by the initial step size specified by DeBlasio et al. [2020].The warmup phase of CAWarm-BO will cease if it exhaustively interrogates the entire parameter set without any improvement or upon reaching the maximum warmup iteration.
Assuming the best parameter vector identified during warmup is θ, where θ[i] represents the value of its i-th dimension, CAWarm-BO adjusts the search range of the i-th dimension to [0, max(2 θ[i], Ui)], with [0, Ui] being the initial range of the i-th dimension.Following this, CAWarm-BO executes CAS-MOPOLITAN over this new domain.The total number of iterations for CAWarm-BO per sample is set to 200, inclusive of the warmup phase.

Parameter
Default

A.2 Details of contrastive learning
As described in Section 3.4.1,we use MinHash sketch to create the set input X = {xi} s i=1 for each RNA-seq sample R. Each value in each xi is z-score normalized.We set k = 21 and s = 1000 which are default settings in Ondov et al. [2016] and exclude k-mers with counts less than 2.

Figure 1 :
Figure 1: Overview of Auto-ParAdvisor.Step 1: Representative sample selection from R, each blue dot represents one RNA-seq sample, and the blue ellipse represents the universe of all samples; Step 2: Search for the best parameter vectors for representative samples via Bayesian optimization, each green dot represents the optimal point of a function; Step 3: Learning a similarity measure S via contrastive learning.Samples with similar function shapes are mapped close.The dashed red circle represents the region of the nearest neighbors of a new sample R new under S.
DeBlasio et al. [2020], we use the following three datasets in our experiments: • ENCODE10, a collection of 10 RNA-seq samples from the ENCODE database [The ENCODE Project Consortium, 2012].These samples are already aligned in DeBlasio et al. [2020] to the human reference genome GRCh38.DeBlasio et al. [2020] use this dataset to generate the advisor set, while we use it to test the performance of CAWarm-BO.• ENCODE65, a collection of 65 RNA-seq samples also from ENCODE.None of them is included in the representative sample set (training data).These samples have pre-existing alignments in ENCODE.We use them to test the performance of Auto-ParAdvisor.

Figure 2 :
Figure 2: (a) Comparisons of Coordinate Ascent (CA), CASMOPOLITAN and CAWarm-BO on two RNA-seq samples from ENCODE10 (accession numbers: SRR315323 and SRR534291).The transcript assembler used here is Scallop.For each sample, we do one run for each method and plot the maximum value found by iterations.Here (×10 4 ) means the AUC values shown in the figures are real AUC values times 10 4 .(b) The variance of CAWarm-BO with different runs.Here, we do 10 independent runs on the sample SRR315323, and plot the mean and 1/2 standard deviation of the maximum value found by iterations.Note that the warm-up part is deterministic.

Fig. 2
Fig. 2(b) and Fig. 9(d) demonstrate that the variance among these 10 runs for each iteration is minimal, and the best values obtained from different runs are similar.This consistency indicates that running CAWarm-BO just once per sample is sufficient, and the comparisons shown in Fig. 2(a) and Fig. 9(a,b,c) are reliable and robust.
, a distance metric based on the set representations of two samples.We calculate this distance between sample R and all others, plotting the function values against the Mash similarity (negative Mash distance values), shown in Fig. 10(a); (ii) The Euclidean distance between the best parameter vector of R (found by CAWarm-BO) and those of all other samples, normalized to the [0,1] range to negate scale effects.The relationship between the function values and the Euclidean similarity (negative Euclidean distance values) is also plotted, shown in Fig. 10(b).An ideal similarity measure should exhibit a strong negative correlation with these function values.
without the implementation of data augmentation.We randomly select 150 samples from Rrep to form a validation set, while the remaining samples are used for training.For each sample R in the validation set, we compute the Pearson correlation between the predicted similarity values and the actual predefined similarity values Ŝrep(R, R ′ ), where R ′ represents any sample in the training set.We then compare the average of these correlations across all samples in the validation set, contrasting the results obtained when training with the data augmentation module against those from training without it.Figure 4 shows the average Pearson correlations of all samples in the validation set across various training iterations.It is evident that the inclusion of data augmentation leads to significantly higher Pearson correlation values compared to the scenarios where no data augmentation is used.This trend holds true regardless of the transcript assembler employed.These findings indicate the beneficial impact and importance of the data augmentation module in enhancing the performance of Auto-ParAdvisor.

Figure 4 :
Figure 4: The average Pearson correlations of all samples in the validation set across various training iterations.The transcript assembler used for the left figure is Scallop, while the transcript assembler used for the right figure is StringTie.

4. 5
Figure 5: (a,b,c) Comparisons between Auto-ParAdvisor against the default parameter settings or Base in the top-1(a), top-5(b), and top-30(c) cases for ENCODE65.Each point is one RNA-seq sample positioned by the AUC achieved with the default parameters or the best AUC obtained with Base and the ratio of the best AUC achieved by Auto-ParAdvisor relative to the default parameter setting or Base.A value above 1.0 indicates an improvement.The transcript assembler used here is Scallop.

Figure 6 :
Figure 6: (a,b,c) Comparisons between Auto-ParAdvisor against the default parameter setting or Base in the top-1(a), top-5(b), and top-30(c) cases for ENCODE65.Each point is one RNA-seq sample positioned by the AUC achieved with the default parameters or the best AUC obtained with Base and the ratio of the best AUC achieved by Auto-ParAdvisor relative to the default parameter setting or Base.A value above 1.0 indicates an improvement.The transcript assembler used here is StringTie.

Figures 5
Figures 5 and 6 display the ratios of the best AUC values achieved by Auto-ParAdvisor over the default settings and Base in these three scenarios for the Scallop and StringTie transcript assemblers.In the top-1 case, Auto-ParAdvisor outperforms the default settings in 53 out of 65 samples for Scallop and in 46 out of 65 samples for StringTie.In the top-5 case, Auto-ParAdvisor surpasses the default settings in all samples for Scallop and in 59 out of 65 samples for StringTie.In the top-30 case,

Figure 7 :
Figure 7: (a,b,c) Comparisons between Auto-ParAdvisor against the default parameter settings or Base in the top-1(a), top-5(b), and top-30(c) cases for SRA-test.Each point is one RNA-seq sample positioned by the AUC achieved with the default parameters or the best AUC obtained with Base and the ratio of the best AUC achieved by Auto-ParAdvisor relative to the default parameter setting or Base.A value above 1.0 indicates an improvement.The transcript assembler used here is Scallop.

Figure 8 :
Figure 8: (a,b,c) Comparisons between Auto-ParAdvisor against the default parameter setting or Base in the top-1(a), top-5(b), and top-30(c) cases for SRA-test.Each point is one RNA-seq sample positioned by the AUC achieved with the default parameters or the best AUC obtained with Base and the ratio of the best AUC achieved by Auto-ParAdvisor relative to the default parameter setting or Base.A value above 1.0 indicates an improvement.The transcript assembler used here is StringTie.

Rrep.
We leave to future work the investigation of more efficient methods for searching the best parameter vectors for the training set.Creating set representations of samples in the training set and running the data augmentation module usually take 1-2 days with simple CPU parallelism.Once all these steps are completed, the training of our contrastive framework is notably swift, taking only about 10 minutes on one GPU.As introduced in DeBlasio et al. [2020], for Scallop, running coordinate ascent for any single example took between about 40 hours and over 22 days.Therefore, even with parallelism, the creation of the advisor set as per DeBlasio et al. [2020] can take around 22 days.Thus, in terms of wall clock time, the training of our method is on par with the time required to generate the advisor set of DeBlasio et al. [2020] for Scallop, although it does necessitate greater parallelism and consequently more CPU resources.After Auto-ParAdvisor has been trained, it eliminates the need for retraining with each new sample, allowing for rapid prediction of the advisor set for new samples.The only additional computational steps for our method are creating the set representation of a new sample using the MinHash sketch technique and computing the nearest neighbors of the new sample.Both of these steps are efficient, typically taking less than 10 minutes to complete.

Figure 10 :
Figure 10: The correlation between (a) the rank of Mash similarity (negative Mash distance values −d mash ), (b) the rank of Euclidean similarity (negative Euclidean distance values −d euc ) or (c) the rank of sorted values of Ŝrep and function values {f R ( θi )} mi=1 for three RNA-seq samples with accession numbers ERR197922, SRR1812365, SRR5099917.Each dot represents one representative sample (one θi ).The transcript assembler A used here is Scallop.Each y-axis is copula standardized[Salinas et al., 2020] for better visualization.r s : Spearman's rank correlation coefficient.

Figure 11 :
Figure 11: (a,b,c) Comparisons between Auto-ParAdvisor against direct application of CAWarm-BO in the top-1(a), top-5(b), and top-30(c) cases for ENCODE65.Each point is one RNA-seq sample positioned by the best AUC achieved by CAWarm-BO and the ratio of the best AUC achieved by Auto-ParAdvisor relative to CAWarm-BO.A value above 1.0 indicates an improvement.The transcript assembler used for the first row is Scallop and StringTie for the second row.

Figure 12 :
Figure 12: (a,b,c) Comparisons between Auto-ParAdvisor against Auto-Mash in the top-1(a), top-5(b), and top-30(c) cases for ENCODE65.Each point is one RNA-seq sample positioned by the best AUC achieved by Auto-Mash and the ratio of the best AUC achieved by Auto-ParAdvisor relative to Auto-Mash.A value above 1.0 indicates an improvement.The transcript assembler used for the first row is Scallop and StringTie for the second row.
The pseudocode of the training framework is given in Algorithm 1.During each training iteration, every input sample in the training batch is embedded using the encoding network Enc φ (line 5 of Alg. 1).The outputs from Enc φ , together with Ŝrep, are used to compute the contrastive loss function Input Input MinHash sets {X i } n i=1 with size n, similarities between representative samples Ŝrep , initialized encoder Enc φ , batch size N b , total epochs N ep , total number of batches tot batch = n N b 2: for epoch = 1, . . ., N ep do Algorithm 2 Nearest Neighbor Computation for a New Sample 1: Input representative samples {R i rep } m i=1 and their corresponding input sets {X i rep } m i=1 , new RNA-seq sample R new , trained encoder Enc φ , constant p 2: MinHash sketch on R new to obtain the new input set X new .3: z new = Enc φ (X new ) 4: for q ∈ {1, . . ., m} do φ to map this new sample to the embedding space (line 3 of Alg. 2) and use the cosine similarity to identify the closest neighbors among the representative samples (lines 6 of Alg.2).The advisor set is then composed of the best parameter vectors from these nearest neighbors (lines 8 of Alg.2).1: hench we use Gaussian Process (GP) models to fit data: X i rep ∼ GP i rep and

Table 7 and
DeBlasio et al. [2020]x.Similar toDeBlasio et al. [2020], we use the area under the curve (AUC) of an assembler's prediction of known transcripts to estimate the performance of these two transcript assemblers (true positive: transcripts that are in both A's output and the known transcript set; false positive: transcripts that are in A's output but not in the known transcript set; false negative: transcripts that are in the known transcript set but not in A's output).The transcript annotation file for GRCh38 can be downloaded from https://ftp.ensembl.org/pub/release-108/gtf/homosapiens.The loss function L is defined as the negative AUC value.
[Wan et al., 2021]ssemblers including Scallop and StringTie usually have mixed types of parameters (parameters include categorical, integer, and continuous variables), we choose CASMOPOLI-TAN[Wan et al., 2021], a BO-based algorithm for mixed search spaces, as the basic framework of CAWarm-BO.Other detailed settings of CAWarm-BO are described in Appendix A.1.From 196, 523 human samples, 7, 860 human RNA-seq samples are selected by Tung and Kingsford

Table 1 :
Spearman's rank correlation coefficient between similarity measures and function values for different samples.The transcript assembler used here is Scallop.
Table1 and Table 2, along with Fig.10show that in most tested samples Ŝrep demonstrates a higher negative correlation with the function values compared to both Mash and Euclidean similarities, irrespective of the transcript assembler used.These results suggest that leveraging information from the BO steps enables more accurate quantification of sample similarity than directly computing the distance between set representations of RNA-seq samples.Moreover, the normrank function used in Ŝrep incorporates broader information from the BO step than just the best parameter vector of each sample, thus enhancing the accuracy of similarity quantification.

Table 2 :
Spearman's rank correlation coefficient between similarity measures and function values for different samples.The transcript assembler used here is StringTie.

Table 3 :
Pearson correlation coefficient between two sets of function values of f R i 0.6 (f R i 0.8 ) and f R i rep in 10 different samples.The transcript assembler used here is Scallop.

Table 4 :
Pearson correlation coefficient between two sets of function values off R i 0.6 (f R i 0.8) and f R i rep in 10 different samples.The transcript assembler used here is StringTie.

Table 5 :
The Mash distance between R i 0.6 (R i 0.8 ) and R i rep in 10 different samples.The samples here correspond to those listed in Table3.then empirically examine the differences in the performance of Auto-ParAdvisor with and

Table 6 :
The Mash distance between R i 0.6 (R i 0.8 ) and R i rep in another 10 different samples.Different from samples shown in Table5, the samples here correspond to those listed in Table4, which are randomly selected samples from the representative set for StringTie.

Table 7 :
Parameter information of Scallop.