Abstract
Pre-training has attracted much attention in recent years. Although significant performance improvements have been achieved in many downstream tasks using pre-training, the mechanism of how a pre-training method works for downstream tasks is not fully illustrated. In this work, focusing on nucleotide sequences, we decompose a pre-training model of Bidirectional Encoder Representations from Transformers (BERT) into embedding and encoding modules to illustrate what a pre-trained model learns from pre-training data. Through dysfunctional analysis on both data and model level, we demonstrate that the context-consistent k-mer representation is the primary product that a typical BERT model learns in the embedding layer. Surprisingly, single usage of the k-mer embedding pre-trained on the random data can achieve comparable performance to that of the k-mer embedding pre-trained on actual biological sequences. We further compare the learned k-mer embeddings with other commonly used k-mer representations in downstream tasks of sequence-based functional predictions and propose a novel solution to accelerate the pre-training process.
Contact yaozhong{at}ims.u-tokyo.ac.jp or imoto{at}hgc.jp
Supplementary information The source code and relevant data are available at https://github.com/yaozhong/bert_investigation.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
1 In the following part, BERT and DNABERT are interchangeable, referring to the BERT model for nucleotide sequences.