Abstract
We present NEAR, a method based on representation learning that is designed to rapidly identify good sequence alignment candidates from a large protein database. NEAR’s neural embedding model computes per-residue embeddings for target and query protein sequences, and identifies alignment candidates with a pipeline consisting of k-NN search, filtration, and neighbor aggregation. NEAR’s ResNet embedding model is trained using an N-pairs loss function guided by sequence alignments generated by the widely used HMMER3 tool. Benchmarking results reveal improved performance relative to state-of-the-art neural embedding models specifically developed for protein sequences, as well as enhanced speed relative to the alignment-based filtering strategy used in HMMER3’s sensitive alignment pipeline.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
↵* Travis J. Wheeler. twheeler{at}arizona.edu