Abstract
We have made several steps towards creating a fast and accurate algorithm for gene prediction in eukaryotic genomes. First, we introduced an automated method for efficient ab initio gene finding, GeneMark-ES, with parameters trained in iterative unsupervised mode. Next, in GeneMark-ET we proposed a method of integration of unsupervised training with information on intron positions revealed by mapping short RNA reads.
Now we describe GeneMark-EP, a tool that utilizes another source of external information, a protein database, readily available prior to a start of a sequencing project. A new specialized pipeline, ProtHint, initiates massive protein mapping to genome and extracts hints to splice sites and translation start and stop sites of potential genes. GeneMark-EP uses the hints to improve estimation of model parameters as well as to adjust co-ordinates of predicted genes if they disagree with the most reliable hints (the -EP+ mode).
Tests of GeneMark-EP and -EP+ demonstrated improvements in gene prediction accuracy in comparison with GeneMark-ES, while the GeneMark-EP+ showed higher accuracy than GeneMark-ET. We have observed that the most pronounced improvements in gene prediction accuracy happened in large eukaryotic genomes.
Footnotes
↵† Joint first authors
We added data on comparison of the new GeneMark-EP tool, the one working with protein external evidence, with GeneMark-ET, a tool able to include hints to intron positions revealed by mapping RNA-Seq short read to genome into gene prediction. The data are shown in Tables S3, S7 and S8.