RT Journal Article
SR Electronic
T1 Constructing benchmark test sets for biological sequence analysis using independent set algorithms
JF bioRxiv
FD Cold Spring Harbor Laboratory
SP 2021.09.29.462285
DO 10.1101/2021.09.29.462285
A1 Samantha N. Petti
A1 Sean R. Eddy
YR 2021
UL http://biorxiv.org/content/early/2021/09/30/2021.09.29.462285.abstract
AB Statistical inference and machine learning methods are benchmarked on test data independent of the data used to train the method. Biological sequence families are highly non-independent because they are related by evolution, so the strategy for splitting data into separate training and test sets is a nontrivial choice in benchmarking sequence analysis methods. A random split is insufficient because it will yield test sequences that are closely related or even identical to training sequences. Adapting ideas from independent set graph algorithms, we describe two new methods for splitting sequence data into dissimilar training and test sets. These algorithms input a sequence family and produce a split in which each test sequence is less than p% identical to any individual training sequence. These algorithms successfully split more families than a previous approach, enabling construction of more diverse benchmark datasets.Competing Interest StatementThe authors have declared no competing interest.