GALBA: Genome Annotation with Miniprot and AUGUSTUS

The Earth Biogenome Project has rapidly increased the number of available eukaryotic genomes, but most released genomes continue to lack annotation of protein-coding genes. In addition, no transcriptome data is available for some genomes. Various gene annotation tools have been developed but each has its limitations. Here, we introduce GALBA, a fully automated pipeline that utilizes miniprot, a rapid protein-to-genome aligner, in combination with AUGUSTUS to predict genes with high accuracy. Accuracy results indicate that GALBA is particularly strong in the annotation of large vertebrate genomes. We also present use cases in insects, vertebrates, and a previously unannotated land plant. GALBA is fully open source and available as a docker image for easy execution with Singularity in high-performance computing environments. Our pipeline addresses the critical need for accurate gene annotation in newly sequenced genomes, and we believe that GALBA will greatly facilitate genome annotation for diverse organisms.

: BUSCO scores (obtained with poales odb10) of proteins predicted with GALBA in Coix aquatica. Table S1: Donor proteins used for annotating each species genome with GALBA, FunAnnotate, and BRAKER2. Note: The proteins for whales and dolphins were applied to all whale and dolphin species with GALBA. * ) Proteins were not used in the combined set but only for single protein set input experiments. s ) Proteins were used to demonstrate GALBA accuracy with reference proteins from this species, alone (GALBA s in Table 3 Table S2: Comparison of intron predictions by spliced alignment using a protein set of closely related species (see Table S1), and the OrthoDB v.11 (ODB) Arthopoda partition (proteins from species of the same order excluded) on D. melanogaster. The reference annotation has 47,739 introns. The values in the table-True Positives (TP), False Positives (FP), Sensitivity (Sn), Specificity (Sp)-are shown for the raw miniprot result, all miniprothint predictions, and high-confidence (HC) miniprothint predictions (see Figure 3 for details).

S22
. CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made   Table S4: Feature prediction Sensitivity in a subset of reliably annotated genes. A gene is regarded as reliable if a minimum of two annotation sets contain this exact gene structure.

S23
. CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted April 10, 2023.

S24
. CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made    Table S8: OMArk results (in percent) in genomes that were de novo annotated with GALBA. The number of conserved HOGs for whales and dolphins is 13,050, the number of conserved HOGs for Coix aquatica is 20,501.

S25
. CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted April 10, 2023.   Table 1, each test species, species belonging to the same taxonomic order were excluded from the databases for each experiment. We used the orthodb-clades pipeline to generate the protein sets. For results in Table  S7, only the target species were excluded, and this ODB partition was subsequently combined with the close relatives input from Table S1 by concatenation prior to execution of BRAKER2.

S26
. CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted April 10, 2023. ; https://doi.org/10. 1101/2023 For accuracy evaluation, the gff3 output of FunAnnotate was converted from gff3 to gtf format using gff3 to gtf.pl from GeneMark-ET, and with compute accuracies.sh from BRAKER: gff3_to_gtf.pl funannotate.gff3 funannotate.gtf compute_accuracies.sh annot.gtf pseudo.gff3 funannotate.gtf gene trans cds FunAnnotate sometimes modifies sequence names in the output, automatically. We had to revert these sequence name changes to match the reference annotation. This was in particular the case for Medicago truncatula: cat funannotate.gtf | perl -pe 's/Mrun/Mtrun/' > funannotate.f.gtf mv funannotate.f.gtf funannotate.gtf

S28
. CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted April 10, 2023. ; https://doi.org/10.1101/2023.04.10.536199 doi: bioRxiv preprint