ABSTRACT
The evolutionary rapid emergence of new genes gives rise to “orphan genes” that share no sequence homology to genes in closely related genomes. These genes provide organisms with a reservoir of genetic elements to quickly respond to changing selection pressures. Gene annotation pipelines that combine ab initio machine-learning with sequence homology-based searches are efficient in identifying basal genes with a long evolutionary history. However, their ability to identify orphan genes and other young genes has not been systematically evaluated. Here, we classify the phylostrata of curated Arabidopsis thaliana genes and use these to assess the ability of two of the most prevalent annotation pipelines, MAKER and BRAKER, to predict orphans and other young genes. MAKER predictions are highly dependent on the RNA-Seq evidence, predicting between 11% and 60% of the orphan-genes and 95% to 98% of basal-genes in the annotated genome of Arabidopsis. In contrast, BRAKER consistently predicts 33% of orphan-genes and 98% of basal-genes. A less used method to identify genes is by directly aligning RNA-Seq data to the genome sequence. We present a Findable, Accessible, Interoperable and Reusable (FAIR) approach, called BIND, that mitigates the under-prediction of orphan genes. BIND combines BRAKER predictions with direct evidence-based inference of transcripts based on RNA-Seq alignments to the genome. BIND increases the number and accuracy of orphan gene predictions, identifying 68% of Araport11-annotated orphan genes and 99% of the conserved genes.
Introduction
Eukaryotic and prokaryotic genomes contain genes (“orphan genes”) whose proteins are recognizable only in a single species; some of these have emerged de novo from the genome, while others have diverged so quickly as to be unrecognizable by homology1–9. Of the billions of extant orphan genes in eukaryotic species10, the precise function of only a few is understood2, 10–12, 12–19, 19–25. However, orphans appear to play a crucial role in adaptation to new biological niches. Forward and reverse genetics studies from vertebrates, annelids, insects, fungi, and plants show that extant proteins encoded by orphans interact with prey or predators, or integrate into established internal metabolic and developmental pathways2, 6, 11, 12, 15–19, 19–23, 26–28. A subset of orphans are retained in the genome and continue to evolve, such that each genome contains a mixture of genes of different ages (phylostrata)1, 3. Phylostratigraphic analyses3 indicates orphans played a role in development of new reproductive and neural structures across evolutionary time29–31. Thus, the advent of new genes may provide a critical enabler of speciation. Without the ability to accurately predict orphan genes and other young lineage-specific genes we are missing a big piece in understanding how life evolves.
Gene annotation is a fundamental step in genome sequencing projects. However, no standard best practice has been established, and protocols are diverse. Prevailing methods combine homology-based analysis, which compares a new genome to previously-identified genes from other species, with ab initio prediction of genes from the genome sequence32–40. Each approach may have an inherent bias against young genes41. Homology-based methods (content sensors) assume that genes have identifiable orthologs in other species. Because orphan genes have no homology to those of other species, homology-based methods are clearly not useful in predicting them. Ab initio-based predictors (signal sensors) assume a pre-defined gene structure for all the genes of an organism42–45. Typically, the predictor is trained using extrinsic RNA-Seq evidence, and possibly additional evidence specific to the species being annotated, to provide a gene model33–35, 37–39. The nucleic acid signatures by which genes are predicted can include untranslated regions (UTR), translation start sites, termination sites, and intron-exon boundaries with standard splice donor and acceptors. Various ab initio algorithms show weaker predictions for short genes and exons39, 46, genes with alternative splicing46, genes with long introns47, highly conserved introns and non-coding sequences48, overlapping/nested genes49 and/or non-canonical splice sites50.
A more straightforward approach to identify genes is to directly align RNA-Seq data to genomes51. Although this approach is important for predicting non-coding RNAs52, 53 and young genes1, 51, 54–56, with some exceptions it has not been widely adopted for genome annotation33, 57. This may be in part because direct inference predicts many genes that have not been annotated55.
It has been suggested that canonical signatures are less well-defined in young genes, and therefore presumed that ab initio methods would be less likely to detect them1, 55, 58–64. However, the ability of ab initio approaches to detect genes of recent origin had not been directly evaluated. If young genes are indeed missed, diversifying the RNA-Seq and protein sequence that is extrinsically-supplied as training data might improve prediction of young genes. Here again, to our knowledge, the influence of extrinsically-supplied training data on the prediction outcome for young genes, or indeed for genes overall, had not been assessed.
Based on the postulates that prediction of younger genes is generally challenging, and that diverse RNA-Seq and protein sequence data used as evidence would increase detection of younger genes, we leverage the well-annotated A. thaliana genome, and RNA-seq and protein datasets of different sizes and depths, to evaluate the efficacy of five gene prediction pipelines (Fig. 1). From the multitude of variations of ab initio gene prediction pipelines, we focused on two of the most extensively-used software: MAKER65, and BRAKER66. We also developed in-house pipelines that incorporate evidence-based gene inference directly from RNA-Seq data.
We show that, regardless of method or extrinsic evidence, younger genes were vastly under-detected. Furthermore, we quantify the extent to which diverse, RNA-Seq evidence can improve the annotation of genes in a genome. We provide a pipeline for improved gene annotation, BIND (BRAKER-Inferred Directly); BIND combines ab initio prediction with predictions based on directly aligning RNA-Seq data to the genome.
Results
Phylostratograpy of Araport11-annotated A. thaliana genes
Phylostratigraphic designations of gene age are contingent on the ability, borne out by extensive evolutionary simulations67, of the protein-BLAST algorithm to infer time of gene origin, spanning even temporally-distant homologs. The phylostratigraphic origins of the annotated protein-coding genes of the A. thaliana genome were inferred using the R-based pipeline, phylostratr68, and the most recent community-curated, evidence-based annotation for A. thaliana, Araport1169, 70 (See Supplementary File 1). We similarly used phylostratr to infer the age of the potential proteins that would be translated from the longest ORF in any sequence predicted as a gene by a prediction pipeline, but not annotated in Araport11.
Phylostratographic analysis predicted 1,038 orphan genes, as compared to the 1,084 orphans inferred in a determination using the older TAIR10 A. thaliana gene annotations and less recent genome and protein sequence data2. The resultant phylostratal assignments were used to bench-mark gene prediction pipelines for their efficacy in predicting genes according to their inferred phylostrata.
RNA-Seq and protein used as extrinsic evidence for gene prediction scenario
Ab initio protocols are highly relied on for genome annotation. RNA-Seq provides training data to guide the performance of ab initio gene prediction programs35, 65, 66, 71. There is little guidance in the literature on how to formulate the training data, with some genome annotation projects using raw reads from only 1-3 RNA-Seq samples as training evidence.Furthermore, in many new genome annotation projects only limited amounts of RNA-Seq data are available, and often these are developed as part of the annotation project.
To assess the effect of extrinsic evidence on prediction methods, we assembled six datasets of varied sizes and compositions (Supplementary Table 1). These are: a “Small”, “Medium” and “Large” dataset, based on size; a “Pooled” dataset consisting of all of these three; an “orphan-rich” dataset comprised of the 38 samples (selected from a total of 1074 samples in SRA) that are most highly represented in Araport11-annotated orphan genes; and, a “ground truth” dataset composed of models of all Arabidopsis genes/proteins as annotated in Araport11 (“Ara11” dataset).
The “Small”, “Medium”, “Large”, and Pooled datasets are real-world extrinsic evidence of similar sizes as might be used in genome annotation. The “orphan-rich” dataset is designed to maximize orphan gene representation. We reasoned that selecting samples which contain a breadth of orphan gene transcripts could be important because many, though by no means all10, orphans are highly expressed under only a very limited set of conditions10, 14, 54, 72–78. The “orphan-rich” RNA-Seq samples were chosen from over a thousand samples in SRA; over 60% of all Araport11-annotated orphans are transcribed in each sample.
Because Arabidopsis is a model species with high quality annotation, we were able to select a diverse RNA-Seq dataset comprised of samples rich in expression of orphan genes by determining the expression of all Araport11-annotated orphan genes across many samples. However, many species do not have a highly-vetted set of annotated orphan genes. For a species that has an available body of RNA-Seq samples, an alternative strategy to identify diverse RNA-Seq samples rich in orphan gene expression would be: 1) run an initial gene annotation; 2) infer the phylostratum of each predicted gene; 3) realign raw reads from all available RNA-Seq samples to the predicted genes; 4) select those RNA-Seq samples with maximal representation of predicted orphan genes.
Raw reads of each dataset were aligned to the genome before being provided to BRAKER (which uses genome-aligned RNA-Seq data as training data) or being used as direct-inference evidence (which predicts genes by aligning RNA-Seq data directly to the genome). Raw reads of each dataset were aligned to the genome and then assembled into transcripts before being provided to MAKER (which uses assembled transcripts of RNA-Seq datasets along with predicted protein sequence, as training data). Protein sequence data used for MAKER was generated either by prediction from the RNA-Seq training data, or downloaded from Phytozyme. See Supplementary Table 2 for the full set of prediction scenarios.
Gene predictions by MAKER
MAKER is a standard gene prediction pipeline, and the original impetus of this research was to test Maker’s ability to predict orphan genes. An initial prediction annotated only 11 percent of the orphan genes present in Araport11. This led us to consider whether providing more diverse datasets as evidence for training MAKER’s machine learning algorithms might improve the predictions.
Seven combinations of transcript and protein evidence were provided to MAKER (Table 1 and Supplementary Table 2). Predictions were decreased when no protein data was provided Supplementary Table 3. Depending on the RNA-Seq plus protein training data, MAKER predicted between 80% to 94% of Araport11-annotated genes. For example, 22,065 of the Araport11-annotated genes were predicted with training data of the Small RNA-Seq dataset plus its predicted proteins, whereas 25,649 of the Araport11-annotated genes were predicted with training data from the Pooled dataset plus its predicted proteins (Figure 2 and Supplementary Table 3).
Regardless of the evidence supplied, MAKER’s ability to predict genes (Fig. 2 and Supplementary Table 3) was greatest for the genes of the oldest phylostratum (Cellular Organisms, PS1) and progressively decreased for younger phylostrata.
The numbers of genes in younger phylostrata that were predicted varied up to five-fold, depending on the evidence supplied. 11% of the Araport11-annotated orphan genes were predicted if MAKER was provided the Small RNA-Seq dataset, versus 41% predicted from the Pooled dataset, and 60% for the orphan-rich dataset (Fig. 2 and Supplementary Table 3).
The best predictions were obtained by providing MAKER with the complete set of Araport11-annotated genes and proteins as training data (i.e., 27,635 genes, 48,338 transcripts, with corresponding protein sequences); MAKER then predicted nearly 96% of all the genes. However, even with this gold-standard gene set, which contains all the Araport11-annotated orphans, only 70% of the Araport11-annotated orphans were predicted by MAKER.
The greater the diversity of the RNA-Seq evidence provided to MAKER, the more genes MAKER predicted that did not match any Araport11 gene annotation, for example, 2,048 for the Small dataset (1474 of which are orphan ORFs) and 14,147 for the Orphan-rich dataset (11,470 of which are orphan ORFs)(Supplementary Table 4). Precision and overall accuracy were lowest with the Orphan-rich dataset (Supplementary Table 7), because: 1) many predictions did not match any Araport11 gene, and 2) fewer ancient genes were predicted.
Gene predictions by BRAKER
Six RNA-Seq datasets were provided to BRAKER (Table 1and Supplementary Table 2). BRAKER predicted 93%-94% of all genes annotated by Araport11 (Fig. 2 and Supplementary Tables 3 and 7). Regardless of training data, BRAKER yielded predictions that differed in quantity and accuracy by less that two percent. BRAKER uses RNA-Seq alignments to the genome for unsupervised training of GeneMark-ES/ET (Supplementary Fig. 1); however, BRAKER then selects a subset of the predicted protein coding genes to train AUGUSTUS, which makes the final gene predictions. This filtering of the extrinsic evidence prior to making the final predictions might explain why BRAKER provides similar predictions even with large variations in the makeup and sizes of the RNA-Seq dataset provided as extrinsic evidence. BRAKER’s filtering of the RNA-Seq evidence enables good predictions from even the Small RNA-Seq dataset, but also means it can’t take full advantage of very large, diverse RNA-Seq datasets, when used on its own.
For each prediction scenario, BRAKER’s ability to predict genes was greatest for the genes of the oldest phylostratum (“Cellular organisms”) and progressively decreased for younger phylostrata (Fig. 2 and Supplementary Table 3). As a specific example, when provided the Pooled dataset as training data, BRAKER predicted 98% of genes that traced back to Cellular Organisms (PS1), but only 34% of the orphan genes (Fig. 2). 8,204 of the genes predicted by BRAKER were not annotated by Araport11 (1,498 of these were orphan-ORFs) (Supplementary Tables 4 and 6).
Gene predictions inferred directly from transcript evidence
Prediction by direct alignment of transcriptomic evidence to the genome is only occasionally used for gene annotation40. However, use of cDNAs and ESTS was standard for early annotations. We reasoned that non-canonical gene sequences could be better identified by inferring them directly from RNA-Seq alignment. To use the transcript evidence directly, rather than as training data, RNA-Seq raw read data was aligned to the reference Arabidopsis genome and assembled. After concatenating the transcripts and removing the redundant ones, the ORF(s) in each inferred transcript were determined. The resultant transcripts were processed by Mikado79, which removes redundant transcripts, consolidates information in a relational database, and selects representative transcripts for each locus. Because this approach directly relies on RNA-Seq alignments to the genome, if an RNA is not expressed under the conditions sampled for RNA-Seq evidence, it will not be detected. Thus, providing a wide variety of RNA-Seq data collected from various developmental stages, tissues and under various conditions is particularly important. For example, the Orphan-rich dataset, by far the most diverse RNA-Seq dataset used in this study, predicted nearly 96% of all Araport11-annotated genes, and 69% of Araport11-annotated orphans, while the Pooled dataset predicted about 70% of all Araport11-annotated genes, and 41% of the orphans (Fig. 2 and Supplementary Table 3).
Gene predictions combining ab initio predictions with transcripts inferred directly (BIND or MIND)
Because ab initio methods can identify very high proportions of the conserved genes, while direct inference can identify those genes that are expressed in the RNA-Seq data evidence without regard to canonical structure or homology to genes in other organisms, we tested whether combining both approaches would maximize gene predictions across phylostrata. (Fig. 1).
When BRAKER predictions were combined with transcripts INferred Directly (BIND), genes matching Araport11 annotations increased over either method alone. Similar results were observed when MAKER predictions were combined with transcripts INferred Directly (MIND) (Fig. 2 and Supplementary Table 3). BIND and MIND, using the Orphan-rich dataset as input, predicted more Araport11-annotated orphan genes than did either ab initio predictor alone. Basal F1 scores for overall prediction accuracy were comparable for BIND and MIND (75-77%). MIND predicted 11,642 genes that did not match any Araport11 genes, whereas BIND predicted 6,686. Irrespective of RNA-Seq data input, the most accurate representation of all genes, and young genes in particular, was BIND, which combined direct inference with ab initio detection by BRAKER. (Supplementary Table 6).
Discussion
Structural annotation of genes in a genome is critical to making genomics data useful to the research community. In experimental studies, standard practice is to align RNA-Seq data only to the annotated genes; thus, any gene that is not annotated will be completely missed. However, genes are annotated by a wide variety of protocols, which are at times sparsely or unclearly documented, with a strong emphasis on ab initio prediction from the genome. Although RNA-Seq is almost universally supplied as training data, a wide variety of sizes and diversities of RNA-Seq data have been used. Few annotation pipelines are benchmarked against “ground-truth” genes and genomes. Many genome annotation projects routinely filter from the list of predicted genes sequences that are “too short” (eliminating some orphan genes), have only one exon (eliminating many orphan genes), and/or that have no known homologs (eliminating all orphan genes). Although these strategies minimize false positives, they also exclude many bonafide genes. Here, we deploy the highly curated, community-based gene annotations from the model species, A. thaliana, as gold standard70, 80 to explicitly illustrate the challenge of annotating young genes. The Arabidopsis RNA-Seq datasets compiled for this study, along with the phylostratal designations of genes, are available for benchmarking other gene prediction pipelines.
As we demonstrate by their low visibility to ab initio annotation, young genes have a less recognizable sequence signature than their more ancient counterparts, which have undergone hundreds of millions of years of selection. The efficacy of every gene prediction scenario we tested, regardless of the pipeline, or the amount or quality of the evidence supplied, was strongly dependent on the phylostratal origin of the gene. This confirms earlier speculation that young genes have minimal canonical structure and thus may be more difficult to identify by ab initio approaches1–3, 55, 81, 82. The clear trend is that the more ancient a gene, the more likely it is to be predicted.
Our study reveals that the selection of RNA-Seq training data evidence greatly affects ab initio MAKER genome annotation, particularly for young genes. Using RNA-Seq data with high representation of orphan gene expression dramatically improves the efficacy of gene prediction by MAKER, by the Direct Inference pipeline, and by the BIND and MIND combination pipelines. Thus, the accuracy of an initial genome annotation will be maximized by selecting RNA-Seq data from diverse samples, including samples that typically express high levels of young genes, e.g., reproductive tissues and stressed tissues10, 14, 54, 72–78, 83. Additionally, reannotation70, 84, 85 outcomes can be optimized by utilizing evidence from diverse samples in the expanded body of public RNA-Seq datasets.
Each scenario that we tested predicted genes that are not annotated in Araport11. Compelling RNA-seq evidence from wide range of phyla, ranging from bacteria, to yeast, to humans, shows that many sequences that are not annotated as genes are highly transcribed and translated1, 55, 56, 62, 86–90. For example, in a study of RNA-Seq data from over 3000 diverse yeast samples, about half of the transcripts are not annotated as genes by Saccharomyces Genome Database (SGD)55. A large portion of these unannotated transcripts might be “transcriptional noise “51 or, as suggested by62, fodder for de novo gene evolution. Other sequences may be on a continuum from protogene to gene to pseudogene and/or to oblivion1, 91. No clear criteria have been established to assign these products of this “dark transcriptome” to any particular category.
Essential protein-mediated functions have been experimentally demonstrated for some transcripts that had not been annotated as genes92, 93. The potential that translated transcripts that encode orphan proteins could be useful, although they have only “recently” been subjected to selection pressure, has been reinforced by synthetic biology research revealing that randomly-generated or evolutionarily-selected peptides with no clear homology to other proteins are often able to bind small molecules in vivo94, and can have beneficial consequences when expressed in vivo, inducing developmental, stress-resistance, and longevity phenotypes94, 95. If information on the expression of the dark transcriptome was more easily accessible, potential roles of these transcripts could be better considered.
We argue that best practice would be a universal, evidence-based standard for gene annotation, along with a more inclusive annotation of predicted genes (See Supplementary Table 12 for an example). This could enable researchers to prioritize genes (or candidate genes) for experimental study. A case for routinely incorporating translation evidence has been logically described61, 96. In silico evidence could include: (ab initio prediction; homology; direct inference; inferred phylostrata; syntenic analysis; functional annotation). Experimental evidence could include: transcription (RNA-Seq, cDNA); translation (Ribo-Seq, proteomics); genetic screens; targeted experimental studies. This more inclusive approach would provide both annotated genes and expressed transcripts (candidate genes), together with their evidence. Thus, the signal of candidate genes would be retained in the processed data. Ultimately, providing broad, straightforward access to information on predicted genes would facilitate understanding of genome evolution and function.
In summary, we demonstrate that orphan genes and other young genes often elude annotation in new genomes. Our findings support annotating genes by: 1) combining an ab initio/ homology-based pipeline with a direct inference approach, e.g., BIND; 2) testing out new gene prediction pipelines on well-sequenced genomes with “gold standard” gene annotations for their ability to predict “annotated genes; 3) including diverse transcriptomic evidence; and 4) iteratively reannotating the genome, leveraging new RNA-Seq and other evidence.
Methods
Complete methods, all scripts used in this study, and all results files are documented on GitHub (https://github.com/eswlab/orphan-prediction).
Data download
RNA-Seq raw read data were downloaded from NCBI SRA97 as pre-filtered reads using the SRA-toolkit (v2.8.0)97. RNA-Seq used as training data input to ab initio predictors and for direct alignments is specified in Table 1 and Supplementary Table 1. To create the orphan-rich dataset, RNA-Seq samples were queried from the NCBI SRA web interface with the expression: (“Arabidopsis thaliana”[Organism] AND “filetype fastq”[Filter] AND “paired”[Layout] AND “illumina”[Platform] AND “transcriptomic”[Source]) This query returned 1074 individual RNA-Seq samples, which we downloaded and converted to fastq format using SRA-toolkit. The reads from each sample were mapped and number of reads mapping to each Araport11-annotated CDS sequence were computed using Kallisto98. The 1074 RNA-Seq samples were ranked by number of expressed Araport11-annotated orphan genes (TPM > 0 was considered expressed), and the 38 samples with the greatest number of expressed orphans (more than 60% of the annotated orphans were expressed in each sample) were chosen to create a “orphan-rich” dataset. All other RNA-Seq datasets were generated by downloading SRRs directly from NCBI SRA using BioProject IDs.
The A. thaliana reference genome version (Araport11), GFF3 files, transcripts, and protein sequences for A. thaliana (Araport11 version) were downloaded from TAIR70. Protein sequence used as evidence in MAKER65 were: 1) generated by assembling RNA-Seq reads using Trinity (v2.6.6)99, followed by open reading frame (ORF) prediction and translation using TransDecoder (v3.0.1)99 or, 2) downloaded from Phytozome100 as predicted protein sequences for nine species: Arabidopsis thaliana, (Glycine max, Populus trichocarpa, Arabidopsis lyrata, Conradina grandiflora, Setaria italica, Oryza sativa, Physcomitrella patens, Chlamydomonas reinhardtii, and Brassica rapa).
BRAKER
RNA-Seq raw reads were mapped to the indexed Arabidopsis genome using HiSat2 aligner (v2.1.0)101 (default settings). The resultant SAM files were sorted and converted to BAM format. BAM files from each set of RNAseq samples were combined using SAMTools (v1.9)102 and provided as training for the BRAKER (v2.1.2) pipeline66, along with the unmasked A. thaliana genome (Araport11). BRAKER is an automated pipeline to predict genes using GeneMark-ET (v4.33)38 and AUGUSTUS (v3.3.1)103. Briefly, GeneMark-ET is used for iterative training of AUGUSTUS, by generating initial gene predictions. GeneMark-ET-predicted genes are filtered and provided for AUGUSTUS training, followed by AUGUSTUS prediction, integrating the RNA-Seq information, to generate the final gene predictions (Fig. 1). In additional analyses, we compared results using the translated protein sequence from the transcripts as evidence. BRAKER results with and without protein evidence were virtually identical; the BRAKER User Guide notes that protein evidence is useful only if there is minimal RNA-Seq evidence.
MAKER
To implement the MAKER (v2.31.10)65 pipeline, RNA-Seq data was assembled into a transcriptome using Trinity (v2.6.6)99; this CDS evidence, was supplied along with the unmasked A. thaliana genome (Araport11). Depending on the case study, either CDS-only; CDS and translated proteins; or CDS and Phytozome proteins were supplied (Supplementary Tables 1 and 2).
MAKER was run in two successive rounds with default settings (Fig. 1). In round one, transcriptome and protein data were aligned to the reference genome to generate crude gene predictions. These crude predictions were then used for training SNAP (release 2006-07-28)37 and AUGUSTUS (v3.2.1) ab initio gene predictors with default options. In round two, the Hidden Markov Models (HMM) for ab initio gene predictors, along with self-trained HMM of GeneMark-ES (v4.32) were used within MAKER to predict genes. MAKER finalizes the comprehensive sets of genes from all three predictors were ranked using Annotation Edit Distance (AED)104; the highest-ranking genes were retained for the final set of predictions. MAKER’s default output includes key metadata about gene predictions (evidence scores supporting each prediction, name of the component(s) within MAKER that generated the prediction).
Computer allocations and ease of use for MAKER and BRAKER
BRAKER had much simpler prerequisites than MAKER, which had a large overhead, starting from installation through finalizing the predictions. BRAKER only required a user to execute a single command-line operation (once the raw reads were mapped to the genome), whereas MAKER required longer preparation time (assembling transcripts, collecting evidence, training gene predictors) running multiple iterations and has moderate to heavy scripting requirements. Run-time for BRAKER (359 CPU hours) is several orders of magnitude shorter than for MAKER (1536 CPU hours, not including the Trinity/TransDecoder run-time), if each is provided with similar training data. BRAKER is more efficient in terms of disk usage, disk I/O (Input/Output), and unlike MAKER, it does not create millions of intermediary files. BRAKER is easier to install and use, and has fewer computational prerequisites than MAKER. New approaches are being developed to facilitate MAKER use105.
Direct inference-based
Raw RNA-seq reads were assembled using four genome-guided transcriptome assemblers: viz. Class2, Trinity, StringTie and CuffLinks99, 106–108. The BAM file generated by mapping reads to the Araport11-annotated indexed genome using HiSat2 (v2.1.0)101 was provided as training for the assemblers. The resultant assembled transcripts were used to predict ORFs using Transdecoder99, and for BLAST109 alignments against the Araport11-annotated genome (XML format). We selected those ORFs over 150 nt. (Other user requirements might include: transcript length, number of exons, exon length, intron length, start/end codon, expression value, or presence of UTRs.) These data files, along with splicing junctions identified from the alignments using Portcullis110 were provided as input to Mikado79. Except for the number of threads (set to 28), Mikado pick was run with default parameters.
Combined gene predictions by MIND and BIND
The total predictions from Direct Inference were integrated with BRAKER predictions (BIND) or with MAKER predictions (MIND), using Evidence Modeler (EVM)111 to merge predictions. Briefly, the weights file was prepared by setting the direct evidence to 2 and ab initio predictions to 1. EVM was run as per the guidelines, with default options, using GFF3 files, genome file and weights file as input. The merged predictions were finalized in GFF3 format.
Accuracy calculations
Accuracy calculations were based on sensitivity (Sn), precision (Pr), and the combined accuracy score, F1112. Sn is a measure of the percent of genes identified by a given set of predictions relative to the genes in the gold-standard Araport11 annotations (true positives), and calculated as [predictions/all genes in Araport11* 100]. Precision (Pr) is a measure of how specific the predictions are, and was calculated as [Pr= true positives/(all predictions) * 100]. The F1 score combines the sensitivity and precision as a measure of accuracy [2(sensitivity*precision)/(sensitivity+precision)].
Phylostratography
Phylostratographic analysis was implemented using phylostratr software (v0.2.0)68. Phylostratal origin of genes based on the homology of predicted proteins to proteins in clades of increasing depth (age) was inferred using the default settings of phylostratr68. The focal species was set as “3702” (NCBI taxa id for A. thaliana). A database of protein sequences was created by phylostratr from 138 species, including custom species selected based on evolutionary clade and genome/proteome quality. A customized, more comprehensive, set of proteins was provided for several species, and protein sequences of the remaining species were downloaded from Uniprot
Comparing prediction scenarios
The results of each gene prediction pipeline scenario were compared to the Araport11 annotations using Mikado Compare79; gene structure annotation predictions were provided as GFF3 files. Similarity statistics are reported for each gene locus individually. Compiled reports of similarity statistics, phylostratography of predictions and reference, were generated using custom bash scripts and final plots were generated using custom R-Script (see https://github.com/eswlab/orphan-prediction for code).
Author contributions statement
AS designed and performed the experiments, analyzed the data, and wrote the paper.
ZA performed parts of the phylostratagraphic analysis and wrote related parts of the paper. ESW conceived the study, contributed annotations and analysis tools, and wrote the paper.
Additional information
To include, in this order: Accession codes (where applicable); Competing interests (mandatory statement).
The corresponding author is responsible for submitting a competing interests statement on behalf of all authors of the paper.
This statement must be included in the submitted article file.
Figures and tables can be referenced in LaTeX using the ref command, e.g. Figure 2 and Table 1
Acknowledgements
We thank Andrew J. Severin, Genome Informatics Facility, Iowa State University, for assistance with data analysis and helpful discussions. We are grateful to the Center for Metabolic Biology and the High Performance Computing Facility, and Research IT at Iowa State University for providing support and cyberinfrastructure facilities that made this study possible. Many thanks to Urminder Singh, Jing Li and Priyanka Bhandary who provided valuable insights and participated in discussions. This study is based upon work supported by the National Science Foundation under Grant No. IOS 1546858.
Footnotes
Supplementary table number formatting to match the text in the main document
References
- 1.↵
- 2.↵
- 3.↵
- 4.
- 5.
- 6.↵
- 7.
- 8.
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.
- 14.↵
- 15.↵
- 16.
- 17.
- 18.
- 19.↵
- 20.
- 21.
- 22.
- 23.↵
- 24.
- 25.↵
- 26.↵
- 27.
- 28.↵
- 29.↵
- 30.
- 31.↵
- 32.↵
- 33.↵
- 34.
- 35.↵
- 36.
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.
- 44.
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.
- 60.
- 61.↵
- 62.↵
- 63.
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.
- 74.
- 75.
- 76.
- 77.
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.
- 88.
- 89.
- 90.↵
- 91.↵
- 92.↵
- 93.↵
- 94.↵
- 95.↵
- 96.↵
- 97.↵
- 98.↵
- 99.↵
- 100.↵
- 101.↵
- 102.↵
- 103.↵
- 104.↵
- 105.↵
- 106.↵
- 107.
- 108.↵
- 109.↵
- 110.↵
- 111.↵
- 112.↵