Accurate identification of de novo genes in plant genomes using machine learning algorithms

De novo gene birth—the evolution of new protein-coding genes from ancestrally noncoding DNA—is increasingly appreciated as an important source of genetic and phenotypic innovation. However, the frequency and overall biological impact of de novo genes (DNGs) remain controversial. Large-scale surveys of de novo genes are critical to address these issues, but DNG identification represents a persistent challenge due to the lack of standardized protocols and the laborious analyses traditionally used to detect DNGs. Here, we introduced novel approaches to identify de novo genes that rely on Machine Learning Algorithms (MLAs) and are poised to accelerate DNG discovery. We specifically investigated if MLAs developed in one species using known DNGs can accurately predict de novo genes in other genomes. To maximize the applicability of these methods across species, we relied only on DNA and protein sequence features that can be easily obtained from annotation data. Using hundreds of published and newly annotated DNGs from three angiosperms, we trained and tested both Decision Tree (DT) and Neural Network (NN) algorithms. Both MLAs showed high levels of accuracy and recall within-genomes. Although accuracies and recall decreased in cross-species analyses, they remained elevated between evolutionary closely related species. A few training features, including presence of a protein domain and coding probability, held most of the MLAs predictive power. In analyses of all genes from a genome, recall was still elevated. Although false positive rates were relatively high, MLA screenings of whole-genome datasets reduced by up to ten-fold the number of genes to be examined by conventional comparative genomic methods. Thus, a combination of MLAs and traditional strategies can significantly accelerate the accurate discovery of DNG and the annotation in angiosperm genomes.


47
Novel genes are major drivers of adaptation and evolutionary innovation. A large body of work 48 suggests that new protein-coding genes form at a high rate and represent major contributors to 49 genome evolution and phenotypic variation in plants (1)(2)(3)(4)(5)(6)(7)(8). As a result of this rapid evolutionary 50 gene turnover, all species contain hundreds to thousands young, lineage-specific protein-coding 51 sequences that are absent from other taxa (9-12). Both small-scale and whole-genome 52 duplications are responsible for the formation of many novel genes, which tend to share the 53 biological function of their parent genes (1,3,4). Occasionally, plants acquired novel genes 54 throughout non-duplicative mechanisms, including horizontal DNA transfer from other species 55 (13,14) and via the "exaptation" of the coding regions of transposable elements (15)(16)(17). 56 Although fundamentally different in nature, all these processes generate new protein-coding 57 sequences from pre-existing genes. 58 59 Evolutionary genomic analyses have revealed that new genes can also emerge from ancestrally 60 noncoding DNA sequences wherein an open reading frame (ORF) and new regulatory sequences 61 originate through 'enabler substitutions', typically nucleotide substitutions (18)(19)(20). Long 62 considered unlikely evolutionary accidents (Jacob 1977), these so-called de novo genes 63 (hereafter, DNGs) encode for evolutionary novel protein sequences that share no homology with 64 genes from other species. Some DNGs have been shown to bear significant phenotypic impacts, 65 from regulating the mating pathway in budding yeast (21,22), to producing antifreeze proteins in 66 some fish (23, 24) and affecting human health (25)(26)(27). 67 68 115 Machine learning algorithms (MLAs) offer a set of approaches with the potential to mitigate the 116 limitations in DNG detection outlined above. MLAs have proven to be powerful methods for 117 learning models from non-linear datasets in a variety of domains, including many applications in 118 genomics and bioinformatics. In fact, MLAs have been developed to annotate genomic features 119 that include protein-coding genes, RNA genes, enhancers, transcription start sites, splice sites 120 and gene function (59)(60)(61). To the best of our knowledge, MLAs applied to the detection of de 121 novo genes have not been developed yet. Interestingly, a few studies have explored the ability of 122 MLAs to identify the broader category of orphan genes in plants (62,63). A variety of MLAs 123 trained and tested on 1,784 orphan genes from A. thaliana showed up to 92% accuracy and 95% 124 sensitivity (62), whereas hybrid deep-learning algorithms applied to 1,544 moso bamboo 125 (Phyllostachys edulis) orphan genes reached up to 87% balanced accuracy (63). These results 126 suggest that MLAs can achieve high levels of accuracy for orphan gene prediction. However, as 127 discussed above, DNGs represent only a fraction of orphan genes, and are evolutionarily distinct 128 from other types of genes that lack homology across species. Thus, the ability of MLAs to 129 accurately predict DNGs remains untested. genes can be confidently identified as de novo genes for training. Some MLAs can be sensitive 135 to this asymmetry, outputting models with low information content that appear accurate only 136 because the majority of genes are ancestral genes (hereafter, AGs), while being very inaccurate 137 for DNGs. There are various methods that have been proposed for handling this significant class 138 imbalance (64). We show that sub-sampling of AGs as negative examples can be effective in 139 training accurate models for DNGs. However, although the accuracy on balanced testing sets 140 can be high, even a moderate false positive rate (FPR) can lead to many false positive predictions 141 when the classifier is applied to tens-of-thousands of genes in a whole genome. We show that 142 the FPR can be reduced somewhat by For instance, DNGs tend to be shorter and with fewer exons than most AGs (8,11,35,36). 152 Proteins encoded by DNGs typically have fewer annotated domains and possess more 153 structurally disordered regions than ancestral proteins in some eukaryotes (18,20,51,65 Li et al. (41). 180 These genes were identified using sequence homology searches on a limited set of databases and 181 without validating the de novo status of each gene through synteny conservation with closely 182 related species. To produce a set of high confidence DNGs from the Li et al. (2016) catalogue, 183 we conducted additional homology searches and retained only genes that passed a series of 184 stringent criteria, including conserved synteny with other Brassicaceae and lack of homology 185 with genomes and transcriptomes deposited on NCBI, as described in Methods. This approach 186 follows a robust computational framework developed to identify DNGs (58). A total of 298 187 putative DNGs were excluded as they shared homology with genes in other species (S1 Table), 188 leaving 331 high confidence A. thaliana-specific DNGs, similarly to the number of DNGs 189 reported by Donoghue et al. (11); however, we could not directly compare our list of DNGs to 190 those identified in this paper, as de novo gene IDs were not provided by Donoghue et al. (11). 191 The remaining 153 putative A. thaliana DNGs shared no conserved synteny with other species 192 and were removed from the catalog as the de novo birth pathway could not be determined. 193

Identification of high confidence DNGs in Brassica 195
We first analyzed all available Brassica genome assemblies to determine gene annotation 196 completeness. Most assemblies showed a high level of completeness according to BUSCO 197 scores (S3 Table). We selected B. rapa as the primary focal species for a de novo gene survey 198 due to its agricultural importance and due to the extensive genomic and functional data available 199 for this species (66-71). As the main focal genome to identify de novo genes, we selected the 200 recently improved B. rapa v3.0 genome assembly from the Chiifu-401-42 genotype (71) S2D Fig). In 246 rice, the GC-content distribution is bimodal in AG, as previously described (73) Interestingly, the microsatellite content is lower in the large rice DNG dataset, which includes 251 older DNGs, suggesting that simple repeat content may decrease with time due to substitutions. 252 As expected, the predicted coding potential of DNGs is much lower than in AGs, with much 253 fewer DNGs being labeled 'coding' according to the coding potential calculator 2. In agreement 254 with this, the coding adaptation index of DNGs is on average significantly lower than in AGs 255 (   Similarly, we found a much lower number of transmembrane motifs (TMs) and genes with TMs 264 in de novo proteins than ancestral proteins. This is in contrast with the observation that S. 265 cerevisiae de novo ORFs with adaptive potential are enriched for TM domains (74). 266 267 Across all species, DNG proteins showed consistently higher isoelectric point (pI), indicating a 268 higher proportion of basic residues, and very few transmembrane domains compared to AG 269 proteins ( Table 1). Elevated pIs due to a depletion of acid residues have been also found in 270 mammalian orphan proteins (75) and in S. cerevisiae-specific translated ORFs (37), but no 271 explanations for these trends have been put forward. We also found that the distribution of pI 272 values is very different between proteins encoded by DNGs and AGs. While in AG proteins the 273 pI values follow an approximate trimodal distribution, isoelectric points in DNG proteins cluster 274 around two peaks around low (~4) and high (~11)  Overall, DNG proteins also contained fewer localization peptides compared to AG proteins, with 286 the notable exception in A. thaliana of both a higher proportion of mitochondrial transit peptides 287 in DNGs vs. AGs, and a comparable number of DNG and AG proteins with a signal peptide 288 (Table 1). 289

290
We further investigated possible correlation among features in each species (Fig 1). As expected, 291 gene length, CDS length and CDS #exons (the number of coding exons) were positively 292 correlated. Similarly, the coding probability and the presence of protein domains increased with 293 gene length and CDS length and, to a smaller extent, with CDS #exons. Longer coding regions 294 are more likely to be predicted as coding by the coding potential calculator (72). Longer genes 295 are also more likely to encode protein domains. In agreement with previous studies (51, 80), we 296 found that GC-content and ISD were positively correlated. Another expected pattern is the 297 negative correlation between ISD and the presence of transmembrane helices, which cannot form 298 in disordered protein regions. The anticorrelation between GC-content and codon adaption index 299 may be due to the paucity of GC-rich codons among the most-used codons in angiosperms. selected examples were divided randomly into 70% for training and 30% for testing, the DT and 328 NN models were found to have 91-92.0% accuracy in predicting DNGs in the test set. Thus, 329 even though Decision Trees and Neural Networks represent completely different methods for 330 capturing patterns in training data, they are both able to learn how to discriminate DNGs from 331 AGs using sequenced-based features. The confusion matrix for the 100 genes from the 30% test 332 sets shows that the errors are evenly distributed between false positives and false negatives, with 333 recall values above 90% ( In order to determine a more general estimate of the predictive accuracy (since the single tree 344 above is dependent on the specific AGs chosen as negative examples), 10-fold cross-validation 345 was carried out (where a decision tree was generated based on 90% of the data and tested on the 346 remaining 10%, repeated 10 times in a rotated manner), which resulted in a performance estimate 347 of 89.7%, with a 95%-confidence interval (CI95) of 87.3-92.1% ( Table 3). The same protocol 348 was used to train classifiers on species-specific sequence features retrieved in DNGs and AGs of 349 B. rapa and O. sativa (using the larger dataset of 343 de novo genes in the latter, see Table 1). In 350 all species, NN models performed slightly better than DT models, significantly so in A. thaliana 351 and B. rapa but not in rice ( both classifiers indicate that NN models achieve substantially higher recall than DT models in all 357 species (S4 Table). Recall is comparably high (~92%) in NN models of Brassicaceae, but 358 decreased to ~83% in rice, where DT models achieved only ~76% of recall (S4 Table). 359 360 The variation in accuracy and recall across species may be due to several factors. A higher 366 quality of gene annotation in A. thaliana may explain the increased accuracy of MLAs in this 367 species. The lower accuracy in rice could in part depend on the slightly older age of the larger 368 dataset of 343 DNGs used for this species (8) We evaluated whether adding functional genomic data could improve the accuracy of the 377 decision tree classifier using datasets available for A. thaliana. The classifier was re-trained 378 using 28 additional features which are not systematically available in all genomes, including 379 transcription data (RNAseq), translation levels estimated through Ribosomal profiling 380 (RiboSeq), proximity to transposable elements, selective constraint, and phenotype data for gene 381 knockout mutants (S5 Table). The 10-fold cross-validated accuracy of models extended with 382 these functional features was 91.4%, (89.7-93.1%), which is not significantly greater than 383 models without these functional features (P>0.05, unpaired T-test). Some of the functional 384 features were occasionally used as decision criteria in lower branches of some of the decision 385 trees; the functional feature with highest importance (0.04) was "AVG RiboP RPKM 25 386 samples", which suggests that lack of expression evidence can be an important discriminator for 387 DNGs. 388 389 390

Feature Importance in Decision Trees 391
To better assess the contribution of each feature to DT classifiers in the three species, we 392 calculated feature importances, which are based on the relative contribution of each feature to 393 splitting the data in the tree and range between 0 and 1 (see Methods and Fig 2). Feature 394 importances averaged over 10 runs for each species are shown in S6 Table. The DT classifier 395 developed in A. thaliana consisted of 70 nodes (including 24 leaves) with a depth of 11. The 396 attribute tested for splitting at the root of the tree, considered the most important based on 397 reduction of Gini impurity, was "Protein domains". This turns out to be a highly discriminating 398 feature: most AGs (22458/26423=85.0%) have at least one domain (recognized as a known fold 399 family by Pfam based on amino acid sequence), whereas most DNGs do not (only 7/331=2.1% 400 had a recognized domain) (Fig 2). Overall, the top ten features by importance are largely the same across all species. "Protein 407 domains" is the most prominent feature across species (Fig 2; S6 Table). Several other top 408 features that are known to be significantly different between DNGs and AGs, including "Coding 409 probability", "Gene length", "CDS length", "Codon adaptation index (CAI)", "%GC" and the 410 proportion of "intrinsic structurally disordered (ISD)" regions in proteins, show high importance. 411 Although the presence of a conserved Kozak motif has not been investigated before in DNGs, 412 the "Kozak score" feature showed a relatively high importance, in agreement with other findings 413 suggesting that de novo genes might acquire more 'gene-like' regulatory sequences by natural 414 selection after their emergence (31). 415 416 Furthermore, some features exhibited substantially higher importance in rice than Brassicaceae. 417 For instance, "Codon adaptation index (CAI)" represents the second most important feature in 418 rice while ranking sixth and eight in A. thaliana and B. rapa, respectively (Fig 2; S6 Table). 419 This is interesting as CAI is only slightly higher in AGs than DNGs in rice (Table 1; S2G Fig). genes and is nearly equal to the highest accuracy with only 30% of the full training size, 444 corresponding to 138 genes (Fig 3). The observed trend also suggests that the accuracy would 445 not be significantly improved by using a larger training set with more known DNGs.  distances from A. thaliana will be required to test this hypothesis more thoroughly. We noticed 487 that the recall decreased in cross-species prediction, with a limited difference in B. rapa (from 488 ~84-92% to ~82-84%) and a significant drop in in rice (from ~76-83% to ~55-59%) (S7 Table).

Whole-genome predictions of DNGs using DT and NN classifiers 498
We next assessed the accuracy of species-specific DT and NN classifiers to predict DNGs in 499 whole-genome gene sets. We calculated the balanced accuracy, which is better suited to assess 500 the performance of classifiers when classes are imbalanced. The overall balanced accuracy 501 ranged from 92.3% in A. thaliana to 76.9% in O. sativa, with slightly higher accuracy in DT vs. 502 NN models ( Table 5). A high recall of ~94-99% was found across species and classifiers, 503 although DT models also achieved lower FPRs compared to NN models (S8 Table). Given the 504 much higher number of AGs than DNGs, the total number of false positives reached ~2,055 in A. 505 thaliana and a maximum of ~8,948 genes in O. sativa (S8 Table). Given the high FPRs, MLAs 506 alone may achieve the level of accuracy required to entirely replace traditional comparative 507 genomic analyses in DNG surveys; however, the application of MLAs as a first step would 508 decrease by up to 10-fold the number of genes that need to be investigated with homology 509 searches and other time-consuming approaches in order to remove false positives. 510 511 512 We further examined if the performance of models trained on A. thaliana data but applied to the 518 two other species (in this analysis, there are no confidence intervals because a single input model 519 trained on one species was tested for accuracy on the whole genome of another). The A. thaliana 520 classifiers, particularly NN models, showed relatively high balanced accuracy in B. rapa and in 521 rice (Table 6). However, recall values dropped significantly in both species, from ~97-98% to 522 ~82-84% in B. rapa and from ~94-98% to ~55-59% in rice (S9 Table). Interestingly, the A. 523 thaliana NN model resulted in lower FPRs than the within-species models while the opposite 524 was true for the A. thaliana DT model (S9 Table). 525 526 527 examples, the balanced accuracy steadily decreases from ~91% to ~85%, primarily due to the increasing false negative rate from <10% up to 25% (Fig 4). Concomitantly, the false positive 540 rate decreased from ~8% to <4% (Fig 4). As the main goal of the MLA approach is to detect 541 DNGs, the loss of sensitivity associated with the higher number In this study, we have developed and assessed the first machine learning framework to identify 554 de novo genes. In order to make these approaches readily applicable in species with limited 555 functional genomic data, we have specifically selected basic sequence features that can be 556 obtained from DNA and protein sequences available in annotated genomes. Using DNG datasets from three plant species, including an updated gene set from A. thaliana and the first group of de 558 novo genes in B. rapa, we have found that both decision tree (DT) and neural network (NN) 559 classifiers achieve high levels of accuracy and recall in predicting DNGs. Using DT algorithms 560 applied to sub-sampled sets of DNGs and AGs, we identified a few features with significant 561 predictive power for DNGs. This is in line with performance ability of MLAs to discover orphan 562 genes in A. thaliana based on six DNA sequence features (62). Importantly, orphan genes are not 563 equivalent to de novo genes, as the former appear to be mostly constituted by rapidly evolving 564 genes (54). Training MLA models with additional features derived from functional genomic data 565 (transcription and translation data) and information from phenotypic assays that are not readily 566 available in most sequenced genomes does not lead to a substantial increase in accuracy or 567 reduction in the number false positives. 568 569 A major advantage offered by MLAs is the significant decrease in computational time compared 570 to traditional genomic approaches to find DNGs-essentially, a time contraction from weeks or 571 months to minutes. For MLAs to be successfully applied in DNG surveys across hundreds to 572 thousands of species, it is critical to train models using datasets of known DNGs from a few key 573 species, and for these models to obtain high accuracy and recall in other species. We conducted 574 initial tests to explore this possibility using DT and NN models trained on A. thaliana to predict 575 DNGs in B. rapa and rice. We found that these cross-species predictions achieved comparable 576 accuracy of species-specific models in B. rapa, and somewhat lower accuracy in rice. This 577 suggests that MLAs trained in one species can likely be used to infer DNGs in closely related 578 species, as in the case of the two Brassicaceae, A. thaliana and B. rapa. Given that many 579 angiosperm families now contain sequenced genomes from multiple species, and considering 580 both the rapid increased of the number and quality of new genome assemblies, de novo gene 581 discovery based on MLAs could likely be applied to a large number of flowering plant taxa. 582 Future work on more taxa should help better determine how cross-species MLAs accuracy 583 decreases when the evolutionary distance between taxa increases, as this study indicates. 584 585 Genome-wide analyses showed that species-specific models predicted well above 90% of known 586 DNGs, although the much higher number of ancestral genes lead to several thousand false 587 positive cases. In cross-species genome-wide analyses, A. thaliana models identified 82-84% of 588 true DNGs in B. rapa, but achieved less than 60% recall in rice. Notably, FPRs were 589 substantially lower in cross-species NN models compared to species-specific models. The 590 combination of high recall, at least in some cross-species tests, and high FPRs suggests that a 591 three-step pathway can be employed to accelerate DNG discovery in angiosperms. First, NN 592 models are developed in one or two species with the best gene annotation quality in each 593 angiosperm family. Second, these NN models are applied to an array of target species in the 594 same family. Third, the candidate DNGs predicted from each target species, comprising only a 595 few thousand genes, are analyzed post-hoc with traditional comparative genomic approaches to 596 remove false positives. 597 598 As this represents the first systematic study to assess machine learning approaches in de novo 599 gene discovery, we expect that further developments of in this area could significantly increase 600 accuracy and recall while reducing the false positive rate in DNG detection. Along these lines, 601 alternative machine learning approaches, including deep neural networks, and methods to 602 address the class imbalance between DNGs and AGs different from sub-sampling, such as 603 synthetic minority over-sampling algorithms, or SMOTE (62, 82), warrant future investigations. 604 605 606 607

Validation of Arabidopsis thaliana de novo genes 609
The TAIRv10 (TAIR10) DNA and protein sequences of A. thaliana were obtained from the 610 folder "TAIR10 blastsets" in the TAIR repository (83). Data from the files 611 "TAIR10_pep_20101214", "TAIR10_cds_20101214" and "TAIR10_exon_20101028" were 612 used in sequence similarity searches and sequence feature analyses. A. thaliana de novo genes 613 were retrieved from a set of 782 putative DNGs recently described by Li et al (2016) using 614 sequence homology searches. These genes were screened to identify high-confidence DNGs 615 supported by further comparative genomic data, particularly synteny information. Specifically, 616 we performed Blast v2.11.0 (49) searches of the corresponding protein sequences against several 617 NCBI databases. We used Blastp to search the "nr" and "tsa_nr" databases, and tBlastn to search 618 the "nr/nt", "refseq_rna", "est" and "TSA" databases with the following modified parameters: 619 num_descriptions 100 -num_alignments 100 -max_hsps 5 -evalue 0.001 -seg yes. The seg filter 620 was turned on in order to remove spurious hits due to nonhomologous stretches of similar amino 621 acids. These searches were carried out against increasingly broader taxonomic units that include 622 Arabidopsis species excluding A. thaliana, Brassicaceae (excluding Arabidopsis), rosidae 623 (excluding Brassicaceae), angiosperms (excluding rosidae), green plants (excluding 624 angiosperms). We also searched for homologous sequences of the 782 DNGs in fungi, bacteria 625 and archaea to identify and remove possible horizontal transfer cases (S1 Table). Subject 626 sequences were screened using unix scripts to remove truncated proteins. We excluded from the 627 catalog of DNGs all cases with any homology with sequences from a non-focal species. 628

629
To determine synteny conservation of DNG coding regions we searched 45 Brassicaceae genome 630 assemblies obtained from Phytozome (https://phytozome-next.jgi.doe.gov) using the translated 631 DNA sequence of each exon of the 782 putative DNGs, which allowed us to detect conserved 632 coding regions for genes with one or multiple exons, in tBlastx run with the following modified 633 parameters: num_descriptions 100 -num_alignments 100 -max_hsps 5 -evalue 0.001 -seg yes. A 634 list of the 45 species investigated in available in S2 Table.  635 636 First, we used these alignments to identify Brassicaceae genomic regions that were syntenic with 637 DNGs and with the potential to encode proteins. Although these regions did not include 638 annotated genes, they maintained long coding regions and were thus considered bona fide 639 homologs to DNGs. We based this selection on two criteria. We selected for hits with a 640 conserved methionine within the first five amino acids of the A. thaliana DNG protein, in order 641 to account for alternative first codon positions (84). Additionally, we included only hits wherein 642 the total alignment length from the first codon to the first stop codon equal to at least 75% of the 643 query protein. This threshold was selected to include hits that are likely to encode a protein, 644 given the lack of disabling mutations along most the of the coding, while allowing for slightly 645 shorter loci, as stop codons may also vary slightly between orthologs. Using this strategy, we 646 identified 44 DNGs with putative homologous genes in non-A. thaliana Brassicaceae, which 647 were thus discarded. 648 Second, we used the same alignment data for the remaining DNGs to identify those that maintain 649 synteny conservation in noncoding regions with other Brassicaceae. To this end, we applied a 650 minimum threshold of 30% coverage between A. thaliana DNGs and Brassicaceae genomes as 651 corresponding to conserved synteny. This length threshold is lower than those used in previous 652 DNG analyses in animals (31,85,86), in order to account for the decreased overall synteny 653 conservation among Brassicaceae. Furthermore, the syntenic alignments were screened for the 654 presence of enabler substitutions, represented by a novel start codon, the removal of stop codons 655 and/or frameshifts in the DNG coding region compared to the syntenic DNA of the outgroup 656 species. Overall, we found 604 A. thaliana putative DNGs with apparent synteny conservation 657 with at least one other Brassicaceae genome. 658 659

Identification of Brassica rapa de novo genes 660
We retrieved all Brassica rapa genome assemblies and gene sets available as of October 2021 661 and screened each assembly for completeness using BUSCO v3.0 (87) (S3 Table) Similarly to the procedure applied to A. thaliana putative DNGs, we carried out Blast searches 689 against increasingly broad taxonomic units containing B. rapa, starting from Brassica but 690 excluding B. rapa, then other Brassicaceae, rosidae, angiosperms, green plants, excluding the 691 previous taxon at each step, and using the following modified parameters: -max_target_seqs 50 -692 max_hsps 5-evalue 0.001 -seg yes. Fungal proteomes were also screened, whereas Archaea and 693 Bacteria sequences were not included as our analyses in A. thaliana DNGs showed that 694 prokaryotic databases contributed marginally to the detection of homologs. We found 2,089 B. 695 rapa proteins sharing no sequenced homology with two or more non-Brassica proteins, thus 696 representing B. rapa orphan genes. A cut-off of at least two non-Brassica proteins was 697 implemented to take into account possible contamination from A. thaliana into other 698 Brassicaceae genome datasets (89,90). This number of orphan genes in B. rapa is similar to but 699 higher than the 1,540 B. rapa orphan genes recently described (45), probably because of 700 differences in the homology search criteria and in the gene annotation version. We further 701 removed from the list of orphan genes 35 genes with annotated protein domains in eggNOG 702 Mapper (91), as they likely represent fast evolving non-DNGs. 703

704
In order to identify enabler changes uniquely associated with de novo gene birth, we inspected 705 the alignments from the Blast searches between B. rapa orphan genes and the genome of 45 706 Brassicaceae (S2 Table). After applying the same approach described to filter A. thaliana 707 putative DNGs, we obtained 754 B. rapa-specific de novo genes. where + is the observed nucleotide at position i relative to the start of the ATG. The scores are 736 generally negative with a mean around -5.1, but the closer to zero, the more like the consensus 737 sequence (AAAAAAATGGCG) they are. This is similar, but not identical, to the 738 acAACAATGGC consensus sequence for terrestrial plants (97) "TAIR10_GFF3_genes.gff" in the TAIR repository was used to obtain the length of "5'UTR 763 length" and "3'UTR length" and the number of coding exons ("#Exons"). The proportion of the 764 coding region overlapping with transposable elements (TEs; "TEs in CDS (bp)" and "%TEs in 765 CDS") and the gene distance to the nearest TE ("TEdist") were calculated using bedtools (94)  766 and the genome coordinate of TAIR10 coding exons and TEs. Genome coordinate of TEs were 767 obtained from the gff3 file "TAIR10_GFF3_genes_transposons.gff" available in the TAIR 768 repository. Possible regulatory motifs ("#Motifs promoter") in the promoter regions of DNGs 769 and AGs were identified using the MEME suite (103). The DNA sequences corresponding to the 770 300bp upstream of the transcription start site of each TAIR10 gene were retrieved using bedtools 771 and screened using the MEME Streme tools (104). TAIR10_all.domains ("#HMMPfam Domain") and TairProteinInteraction 20090527.txt 801 ("#PPIs", "#PPIs (w/predicted)"), respectively, from the TAIR repository. The PPI data contain 802 interaction annotations extracted from the literature by TAIR and BIOGRID (110). The 803 frequency of three different categories of amino acids, "Tiny", "Aromatic" and "Acidic", were 804 obtained using the pepstats program in the EMBOSS suite (95). 805

806
We gathered phenotypic information using data deposited in the TAIR repository containing 807 phenotypic data extracted from the literature by TAIR. Data files names: TAIR_Phenotypes_9-808 2019.txt ("TAIR_Phenotypes_9-2019.txt"), Locus_Germplasm_Phenotype_20190630.txt 809 ("LGP_20190630.txt"), Locus_Germplasm_Phenotype_20130122 ("LGP_20130122"). 810 Additionally, data from manually curated meta-analysis of loss-of-function phenotypes (111)  811 ("Lloyd and Meinke 2012") and from a high-throughput phenotype screening of annotated genes 812 with unknown function (112)  The Decision Tree (DT) classifier was trained using the scikit-learn package in Python (113). 819 The feed-forward fully-connected neural network (NN) with a single hidden layer with 20 820 hidden nodes was also trained with the MLPClassifier implementation in scikit-learn, using tanh 821 activation functions and the 'adam' solver. Log transformations were applied to length-based 822 features (Gene length, CDS length, Distance from TEs) and to functional genomic features with 823 RPKM in A. thaliana. 824 825 Feature importance (normalized decrease in Gini impurity index at each node where a feature is 826 used, weighted by the fraction of training examples represented at those nodes) was calculated 827 using the 'feature_importance' attribute of the DecisionTreeClassifier generated by scikit-learn. 828 In some runs, "At-least-one-domain" was the feature at the root of the tree, and in other runs, 829 "Coding potential (cpc2)" represented the splitting feature at the root. Thus, we estimated the 830 importance of features by averaging them over multiple runs.