Applying Machine Learning to Classify the Origins of Gene Duplications

Nearly all lineages of land plants have experienced at least one whole genome duplication (WGD) in their history. The legacy of these ancient WGDs is still observable in the diploidized genomes of extant plants. Genes originating from WGD—paleologs—can be maintained in diploidized genomes for millions of years. These paleologs have the potential to shape plant evolution through sub- and neofunctionalization, increased genetic diversity, and reciprocal gene loss among lineages. Current methods for classifying paleologs often rely on only a subset of potential genomic features, have varying levels of accuracy, and often require significant data and/or computational time. Here we developed a supervised machine learning approach to classify paleologs from a target WGD in diploidized genomes across a broad range of different duplication histories. We collected empirical data on syntenic block sizes and other genomic features from 27 plant species each with a different history of paleopolyploidy. Features from these genomes were used to develop simulations of syntenic blocks and paleologs to train a gradient boosted decision tree. Using this approach, Frackify (Fractionation Classify), we were able to accurately identify and classify paleologs across a broad range of parameter space, including cases with multiple overlapping WGDs. We then compared Frackify with other paleolog inference approaches in six species with paleotetraploid and paleohexaploid ancestries. Frackify provides a way to combine multiple genomic features to quickly classify paleologs while providing a high degree of consistency with existing approaches.


Introduction
Polyploidy, or whole genome duplication (WGD), is a common phenomenon among some lineages of eukaryotes, particularly flowering plants where nearly 30% of extant species are estimated to be recent polyploids (1-3) . Although many of these polyploid species ultimately go extinct (4-7) many return to the diploid state through a process known as diploidization (8) . This is a ubiquitous process as most land plants are ancient polyploids that have diploidized, with several lineages containing multiple WGD in their recent and ancient history (9) . WGD derived genes, also known as paleologs or ohnologs, are rapidly lost during diploidization in a process termed fractionation (8, [10][11][12] . The paleologs that are not lost during this process can persist in diploidized genomes for millions of years (13,14) . Paleologs are thought to play important roles in a variety of evolutionary processes including speciation through reciprocal gene loss or lineage specific rediploidizaiton (15-21) as well as adaptation and the evolution of novelty by sub-and neofunctionalization (22)(23)(24)(25)(26) . Recent analysis has shown that paleologs have higher levels of genetic diversity than non-paleologs and can significantly contribute to domestication (14) . Further exploration into the legacy of ancient polyploidy requires the rapid and accurate identification of paleologs in the genomes of extant diploid descendants.
Current methods for the identification of paleologs utilize sequence comparisons, genome structure information, or a combination of the two. Arguably the most common and simplest method used to detect paleologs is to identify peaks of gene duplication by examining gene age distributions (27-31) . WGD instantly doubles the number of genes in a genome and this large, abrupt birth event is visible in gene age distributions. These age distributions are often called Ks plots ( Fig. 1) because they are usually made by plotting the synonymous site divergences (Ks) Classifying Gene  among paralogs in a single genome. Mixture models can be used to identify and place bounds on peaks of gene duplication in Ks plots, and homologous sets of genes within these Ks bounds are assumed to be paleologs (27-29) . Although simple and applicable to a wide range of data, this approach suffers from potentially high rates of false positives and negatives. For example, not all genes within a Ks peak were necessarily duplicated by a WGD and the hard bounds placed by mixture models are not perfectly accurate (Fig. 2). Another method for identifying paleologs is to leverage genome structure rather than only comparing unordered sequences. This approach uses conserved gene order-synteny-either within a species or between an outgroup to identify the duplicated regions in genomes from WGDs (32-36) . Syntenic regions can also arise from other small scale duplications or additional layers of older WGD (23, 37, 38) .
Several methods have been employed to resolve the origins of syntenic regions. The simplest of these approaches is to compare the overlap of genes within syntenic regions to that of the Ks based methods (14, 34, 39-41) . As an alternative to Ks, some studies have used Double Conserved Synteny (DCS) with a diploid outgroup that does not share the WGD. In DCS analyses, syntenic regions in the self-self comparison that match to the homologous syntenic block in the outgroup are considered to be WGD duplicates ( Fig. 2) (35, 36) . Although these methods can be useful for filtering out small scale duplications, both suffer when multiple WGDs are present, a common occurrence in plants (9) . Relying solely on syntenic information will result in DCS being detected not only for the focal WGD, but possibly also older WGD (34, 42) .

Filtering these blocks by Ks can also be insufficient because the overlap between adjacent
WGDs can make them difficult or impossible to separate (Fig. 2) (34, 43) .
Classifying Gene  To help disentangle the origins of syntenic genes, several methods have been developed that account for the complex duplication histories of plants. One approach, quota-align, attempts to differentiate paralogous and orthologous syntenic blocks by using Binary Integer Programming (BIP) to fit syntenic blocks that match to an a priori chosen ploidy level (34) . Although this method can be useful for disentangling the origin of individual syntenic blocks, it often fails to identify genes retained in the single copy state as they are inherently undetected due to missing their homolog (Fig. 2). This method is also limited by the user's a priori knowledge of the evolutionary history of their study organism (34) . As an alternative, POInT (Polyploid Orthology Inference Tool) provides a powerful framework for reconstructing the evolutionary history of ancient WGDs, and is one of the few methods to formally model WGD evolution Although paleolog classification is not the primary goal of POInT, its inferences can be used to identify paleologs (11,44) . An inherent drawback to POInT's approach is that inferences are only made for regions with conserved synteny across all species examined, resulting in many paleologs being missed, especially those in potentially interesting areas of genomes that have diverged in retention or experienced high rates of rearrangement since a WGD (11, 44) . For sets of paleologs that are detected, the retention status of each copy can be filtered by the model confidence, limiting error in downstream analysis (11,44) . The tradeoff of these sophisticated models is that it requires non-trivial computational effort to prepare data and run POInT; including the identification of high confidence DCS blocks within species using simulated annealing and the estimation of ancestral gene order using DCS blocks conserved across all species of interest.
Overall, each of the existing approaches to identify paleologs have strengths and weaknesses that can be leveraged to provide a specialized tool for paleolog identification and classification that is computationally efficient. The benefit of supervised machine learning in these cases is that they often achieve accurate inferences with much less computational time and effort than traditional inference methods (53, 56) . In the context of identifying paleologs, machine learning allows for an ensemble approach that can leverage the strengths of various methods such as quota-align, DCS, and Ks divergence, yet also allows for the easy integration of updated methods in the future. Here we introduce a gradient boosted classification approach to identify paleologs using multiple features from synteny and Ks information. This method, Frackify (Fractionation-Classify), is able to make inferences with a high degree of overlap with POInT classifications in conserved syntenic regions while also identifying putative paleologs in regions not analyzed by POInT. We begin with exploring biological data in order to make informed simulations and the caveats of such an approach. We then explore methods for training, testing, and validation a gradient boosted decision tree. Finally, we analyze six paleohexaploid and paleotetraploid plant genomes to compare Frackify with other paleolog inference approaches.

Biologically Informed Approach
Frackify, like other machine learning based approaches, makes predictions using a series of "rules" which it infers from a training dataset where the "answers" are known (64, 65) . Currently no such dataset exists for paleologs as methods for paleolog identification are not necessarily accurate. Insufficient training data is a common issue in bioinformatics and many researchers supplement training data using simulations (48, 55) . However simulating WGD is often difficult Classifying Gene  or intractable given that many plant lineages have complex duplication histories (9, 11, 66-68) . This issue is compounded by our incomplete understanding of genome evolution following WGD. A prime example is the ongoing exploration of how cytological and genetic diploidization interplay on evolutionary timescales (8) . A more tractable approach is to simulate the detectable signatures of WGD in genomes. In our case we simulated the output of popular synteny inference tools such as MCScanX and CoGe SynMap (39, 69) . Both of these tools produce output files containing inferred syntenic block pairs either within or between species. Additional settings also allow for Ks values to be produced for homologous gene pairs in the detected syntenic blocks (39, 69) . To help inform our simulations using empirical datasets we used MCScanX to identify syntenic blocks in the genomes of 27 species (Table 1). These species were selected because they have well studied and assembled genomes across a variety of taxa, including 10 orders and 22 genera that have experienced varying numbers of paleotetraploidies or paleohexaploidies with different levels of divergence. Several key properties of these syntenic blocks were summarized and used to develop simulations of WGD to train Frackify.
A fundamental property of syntenic blocks is their length. In closely related taxa, syntenic blocks can be nearly as long as chromosomes, but over time they are broken up through fractionation and genomic rearrangements (8, 70,71) . To understand the lengths of syntenic blocks given the level of divergence of the WGD, we explored the distributions of self-self syntenic block lengths from MCScanX in our 27 exemplar species. Across the entire dataset, block lengths of less than 30 genes were predominant with a mean of 17.6 collinear gene pairs, however there was a long tail of much larger blocks with a maximum of 1285 genes in Brassica juncea (Fig 3). Although the approximate shape of a log-normal distribution was consistent Classifying Gene  across individual species, more fractionated genomes were skewed towards shorter block lengths. To test this, we fit a log-normal distribution to the distribution of syntenic block lengths in each species using the fitdistr() function from the MASS R library (72) . We then plotted the mean and standard deviation of those distributions against the mean (Ks WGD ) of collinearity homologous gene pair Ks (Ks pair ) of the most recent WGD in each species (Fig. 3). Using a linear model we found that although the mean length of blocks did not vary significantly with Ks WGD (F 1,27 = -1.909, p = 0.067, R 2 = 0.123), it was a significant predictor of the standard deviation ( As syntenic blocks become broken up through time, the surviving collinear homologs also accumulate substitutions (29, 34, 43) . Using the sequence divergence of these homologs it is possible to screen which syntenic blocks originate from different duplication events, such as the WGD in question or older WGDs and small scale duplications (29, 34, 43) . Similar to the use of Ks plots to visualize the age distribution of paralogs in a genome, the mean synonymous divergence (Ks) of the collinear homologous gene pairs (Ks block ) in each syntenic block pair can be plotted in a frequency distribution. As we could not a priori know the exact WGD each block originates from in our empirical data, we used EMMIX to fit a series of mixture models to a distribution of duplication node Ks values in gene family clusters calculated using DupPipe for each of the 27 species (73, 74) . We then selected the model that fit the distribution for each species best based on the Bayesian Information Criterion (BIC) score and used Ks pair with >50% Classifying Gene Duplications-10 likelihood assignment to the corresponding duplication peak to estimate a Ks boundary for each WGD peak. Syntenic blocks with a Ks block within the bounds of these estimates were considered to have originated from the WGD. We then calculated the standard deviation of Ks block values for each WGD peak and plotted it against Ks WGD of each WGD. We fit a linear model to the data and found that Ks WGD was a significant predictor of the Ks block standard deviation ( Using this model we simulated the level of synonymous divergence in collinear homologous gene pairs given the Ks block of the syntenic block pair they are within.

Simulation
Using the models and distributions estimated from our 27 species, we simulated syntenic blocks from a variety of duplication histories in a format similar to the output of MCScanX (Fig. 1).
This simulated training data was then used to train Frackify to classify paleologs in syntenic blocks detected by MCScanX or CoGe SynMap. The simplest of these training data simulated Classifying Gene Duplications-11 only a single WGD event. In single WGD simulations, we began by specifying the Ks WGD and producing a list of featureless syntenic block pairs ( Fig. 6A.1). The number of these block pairs can be determined by the user, allowing for flexibility in the scale of the simulation. Each block pair within the initial list was a different length randomly selected from a log normal distribution described by Ks WGD and the model of block lengths for our 27 species (Fig. 3). We then simulated synonymous divergence in each block pair by assigning a random Ks block from a normal distribution with a mean equal to the Ks WGD and standard deviation specified by our model of To train Frackify to identify these classes of homologous gene pairs, we populated our simulated syntenic blocks with genes that reflect these different classes of gene duplications (Fig. 6B). In order to produce genes with varying duplication histories we began by producing a list of Classifying Gene Duplications-12 unduplicated genes. The length of this list was based on the number of empty collinear gene pairs in the simulated syntenic blocks from the most divergent WGD so that post-WGD and fractionation we would have enough syntenic WGD derived genes -syntelogs -to populate each simulated syntenic block (Fig. 6B.1). We then produced WGD derived syntelogs by duplicating or triplicating each gene in the list depending on if a paleohexaploidy or paleotetraploidy was being simulated (Fig. 6B.2) (77) . Next we simulated fractionation of these genes post-WGD by selecting a random number of genes to remove from each syntelog set between 0 and 3 (triplication) or 0 and 2 (duplication) (Fig. 6B.3). This resulted in approximately equal numbers of each paleolog class so that Frackify would be trained with a large sample of each. If a more complex duplication history was desired we duplicated and fractionated the gene list again to simulate repeated duplication events in the genome. Small scale duplications can also occur in genomes and result in genes which do not originate from our focal WGD (23, 33, 34, 69,78) . Genes from these additional duplications were simulated by producing a separate list of non-paleologs with a length equal to an arbitrary 15% of the paleolog gene list.
To produce the final simulated synteny dataset we merged our simulated list of paleologs and non-paleologs into syntenic blocks appropriate for each class (Fig. 6B.4). In the case of double retained paleologs, we expected each copy to be in WGD derived regions A and B that together make a single syntenic block pair (11, 34, 79) . However, triple retained genes would be found in three regions (A,B and C), all being syntenic with each other. This resulted in pairwise block comparisons among all regions (e.g., A-B, A-C, and B-C) (11, 38, 80) . Most single retained genes would not be detected in self-self syntenic block pairs due to having lost their homologs, however a small number may be aligned with non-paleologs due to tandem duplications or gene Classifying Gene  conversion (38, 69, 76, 81, 82) . To simulate this a priori, 10% of the single retained genes were randomly selected and paired with a non-paleolog to form a misaligned homologous pair. Once all pairs of collinear homologous genes were produced, they were randomly assigned to syntenic blocks from the WGD of origin to produce the final simulated self-self syntenic dataset (Fig.   6C).
Although self-self synteny data is useful for detecting WGD-derived syntelog sets, relying on it solely could lead to single retained genes being misclassified because they have lost their other paralogs (38, 81,82) . Previous methods have often identified single retained genes using gene tree information or syntenic comparisons to an outgroup (11, 38, 44, 81,83) . For Frackify we identified single retained genes using syntenic information from a related outgroup that the focal species diverged from prior to experiencing the WGD of interest. Similar to WGD, this speciation event can be detected in Ks plots due to a large peak of orthologous genes (28, [84][85][86][87] . Genes in the focal species that were not already recognized as paleologs and were orthologous to an outgroup gene were considered to be single retained paleologs. This is because genes that have an ortholog in a pre-duplication outgroup-identified by both synteny and divergence-were likely present in the genome prior to WGD and the other copies were presumably lost during fractionation (83, 88) . To simulate orthology with an outgroup we first simulated the shared duplication history prior to speciation by duplicating our self-self comparison list and removing syntenic blocks from the youngest WGD. Natural variation in synonymous divergence of the shared WGD in the outgroup and ingroup species was simulated by assigning new Ks pair to each syntenic homolog pair using the same methods as the self-self comparison ( Fig. 6A.2-3). The ortholog peak was then added by duplicating the gene list once Classifying Gene Duplications-14 more and assigning new Ks pair to each ortholog pair with a mean Ks (Ks orth ) between the focal and the second oldest Ks WGD . Although this final list represents our self-outgroup syntenic comparison, it is possible that some genes may be differentially lost between our focal species and the outgroup (11, 43, 89) . To capture differential gene loss we randomly removed 10% of the genes from the outgroup, but users can specify different amounts of gene loss if desired.
Using our simulation framework we produced self-self and self-outgroup syntenic information each syntelog and reported how many standard deviations away from Ks WGD the three closest syntelogs were. This initial set of summary statistics was chosen to summarize both the syntenic information for each individual gene and the syntenic relationships of its syntelogs.
In addition to features from the self-self syntenic data we also included self-outgroup syntenic information to both detect single retained genes and help parse out which genes in a syntelog set may originate from overlapping WGD. First we searched if each gene was found in syntenic blocks pairs with the outgroup at all. If a gene was in more than one syntenic block pair we then filtered out only the syntenic blocks with Ks block within three standard deviations of the Ks orth .
This process was repeated for all syntelogs in each set to gain a list of all the syntenic blocks that may be from the ortholog peak. We then filtered the outgroup genes in this list of syntenic blocks to find those that were orthologous to multiple syntelogs. If more than one orthologous outgroup gene was found, we calculated the mean Ks block for all the syntenic block pairs that contained the outgroup genes. We selected the outgroup gene with the closest mean Ks block to Ks orth as it was most likely to be the ortholog of the WGD derived syntelog set. Finally, we listed the number of syntelogs in the set that matched the chosen orthologous outgroup gene. All of the features of the duplication history, self-self, and self-outgroup syntenic comparisons were formulated into a single row of a data frame along with the known class of the given gene.
With our data formatted we then trained our boosted decision tree while attempting to minimize overfitting (93, 94) . First we assessed if the portion of the data used for testing and training could bias the model and lead to artificially inflated estimates of model accuracy while also searching for optimized hyperparameters using a Bayesian approach (95) . This was done using a 5 k-cross Classifying Gene Duplications-17 fold validation, in which the data set is repeatedly split into training and testing data then an average accuracy of the model's predictions is assessed. We repeated the 5 k-cross fold validation 200 times with different hyperparameters to minimize the negative mean absolute error (96-98) . Once the final set of hyperparameters were selected we performed a final 5 k-cross fold validation, producing predictions for gene classes in the testing data with a mean accuracy of 96.05% varying by 2.32%. Given the low level of variance in how accurately the model identified gene classes, it is unlikely the model would be sensitive to the proportion of the data set that is used for training and testing. Based on this information we split our simulated dataset into 70% training and 30% testing data and trained the final model.

Testing and Validation
Using our simulated test data described above, we used a variety of standard approaches to evaluate the performance of Frackify's predictions for the class of each gene. First we tested how well the thresholds for each class of gene were defined within the model using a Receiver Operating Characteristic curve (ROC) (99, 100) . For all classes, the rate at which false positives (FPR) was discovered remained low compared to the true positive discovery rate (TPR), suggesting the model was optimized correctly to avoid biasing towards one class of gene (Fig.   7). To evaluate the performance of Frackify further we made a confusion matrix which compared the classifications predicted by Frackify for each gene against the actual class, visualized as a heat map (Fig. 8) (101) . Across the 1,540 simulations Frackify predicted the majority (~96-98%) of each classification category correctly (Fig. 8). Similarly, there was a low error rate between all classifications with no clear patterns of particular gene classes being mistaken for others.
Between the ROC and heatmap, Frackify appears to perform well across a large set of simulated Classifying Gene Duplications-18 duplication histories, however it is possible certain scenarios are contributing the majority of the 2-4% of each class identified incorrectly.
To further explore the small fraction of paleologs that were misclassified by Frackify, we considered several cases where paleolog classification might be challenging. For example, if multiple WGD peaks heavily overlap or speciation with the outgroup occurred soon after WGD   (34, 83, 102) . To evaluate Frackify's performance in these cases we parsed out predictions for simulations that contained three WGD (Fig. 9). Despite the multiple overlapping WGD, non-paleologs were rarely misidentified as paleologs. However, Frackify less accurately identified different classes of paleologs in cases with highly divergent and overlapping WGD.
There are several potential explanations for why this might be the case. As an example we considered the scenario where a given gene experiences a WGD, followed quickly by speciation, and then a second WGD. This scenario results in two sets of double retained paleologs from the younger WGD that are orthologous to two genes from the older WGD shared with the outgroup.
In this case, Frackify does not misclassify these older duplicate genes as paleologs from the focal WGD. However, it may misidentify which ortholog each of the syntelogs matches, resulting in a triple retained paleolog set and a single retained paleolog. This issue could be compounded if one or more of the orthologs or paleologs are lost, such as in highly divergent WGD. Across most parameter space Frackify was able to discern these complex scenarios with a high degree of accuracy (~85% or higher), but classification accuracy declined for Ks WGD greater than 1.00 (Fig.   9).

Classifying Gene Duplications-19
To help understand why Frackify might sometimes misclassify paleologs, we explored the boosted decision tree itself. We used SHapley Additive exPlanation (SHAP) values (63, 103,   104) to help understand how XGBoost was making predictions for each class of gene. SHAP values are a useful data visualization tool to help interpret which features are having a significant impact on a machine learning model (103) . In our case, we noted several important features that had consistently high values for several classes of genes (Fig. 10). In particular, we found that the number of syntelogs in a set and how many match to the same orthologous outgroup gene had high SHAP values across multiple gene classes. These features are likely major drivers of paleolog classification and in combination may explain why non-paleologs are rarely misclassified. For example, it is possible non-paleologs from more recent small scale duplications could be found in the focal WGD Ks peak, but they are not likely to be identified as paleologs because they do not have a clear ortholog in the outgroup. Similarly, these primary features may be used in combination with other features to help discern more complex scenarios.
For example, if a set of paleologs contains a paralog that actually originates from an older, shared WGD, that paralog will be syntenic to multiple regions of the outgroup genome. These additional syntenic regions will result in an inflated number of putative orthologous outgroup genes, lending evidence that the paleolog set is contaminated with paralogs. These secondary features can be highly gene class specific, evident by their SHAP values. For example, the most important features for the classification of double retained paleologs was the number of standard deviations away from Ks WGD the mean Ks block of each syntelog (Fig. 10). These features are likely how the model discerns if syntenic block pairs in the self-self comparison originated from the focal or overlapping WGD. Although we do not know the exact way these features are used Classifying Gene  to make final predictions, their importance to the model is useful to understand how we may improve Frackify's classifications in the future and for comparing Frackify to other approaches.

Empirical Examples
Although Frackify performed well when validated with simulated test data, we wanted to test the model's performance on empirical data sets. To accomplish this we compared Frackify's classifications with paleolog inferences from POInT for six plant species (11, 14, 80 70 (3, 66) . For Brassica rapa, B. oleracea, and Crambe hispanica we analyzed the Brassiceae whole-genome triplication, with a median WGD peak Ks near 0 .30 (3, 14) . Similar to POInT, we used Cleome violacea and Arabidopsis thaliana as outgroups for the Brassiceae whole-genome triplication and At -α WGD respectively (11) . We compared the POInT paleolog inferences in these six genomes to the less rigorous Ks + Synteny overlap based approach discussed in the introduction and new classifications from Frackify.
Across all six species we found that Frackify identified genes as paleologs with a large degree of overlap with other approaches, but also identified many additional paleologs. We visualized this overlap using euler plots from the eulerr R library (Fig. 11) (109) . Over half of the paleologs identified were shared across more than one method (58.7% total, 53.0 -62.1% across species) (Fig. 11). A relatively small proportion of unique paleologs classifications came from Classifying Gene  POInT and the Ks + Synteny overlap approach (5.1% total, 0.3 -5.5% across species) (Fig. 11).
Similarly, Frackify identified 95.3% and 92.1% of the paleologs identified by Ks + Synteny and POInT respectively. In the case of POInT, Frackify also maintained a high degree of agreement in terms of the retention status of each paleolog classification (81% average, 63 -89% range) (Fig.   12). These results suggest that Frackify is not only able to identify the majority of paleologs previously identified by other methods, but also their retention status. In addition to these findings, a significant proportion of paleologs identified were unique to Frackify (36.2% total, 33.0 -39.0% across species) (Fig. 11).
To further analyze the additional genes identified by Frackify and qualitatively assess if they are generally accurate or false positives, we conducted a more detailed comparison of classification differences between POInT and Frackify that leveraged SHAP values from the model. For all classes of paleologs identified by POInT we noted a high degree of agreement with Frackify, suggesting the model was able to recognize these classifications and make accurate predictions (81% average, 63 -89% range) (Fig. 12). The major source of disagreement between Frackify and POInT occurred in non-paleologs. It is possible these non-paleologs were simply not analyzed by POInT due to being in DCS blocks not conserved across all species (11, 44) . This was supported by the observation that 46,195 (62%) of the additional paleologs identified by Frackify were not found in the output files of POInT. To assess if these additional paleologs were false positives, we used SHAP values to compare their features. The most consistently important features for paleolog classification in Frackify were 1) the WGD derived syntelog set size and 2) how many genes in the set matched to the same outgroup gene in the ortholog peak (Fig. 10). It is possible that some of the genes identified using these metrics may originate from other WGDs in the Classifying Gene Duplications -22 genomes. To test if this was the case, we examined how many paleologs were found in any syntenic blocks with a mean Ks within three standard deviations of Ks WGD and Ks orth . Filtering by this metric we found that out of 74,745 total unique paleologs Frackify identified across all six species, only 389 were found outside our expected Ks WGD and Ks orth distributions. Although some of these paleologs (14,453) were also located in syntenic blocks pairs originating from older WGD, our analyses suggest that they were duplicated in the youngest WGDs rather than being false positive classifications. Thus, many of the paleologs uniquely recovered by Frackify are not clearly false positives and are likely paleologs missed by other methods.

Summary
Frackify demonstrates how machine learning approaches can be used to explore the evolution of complex, duplicated genomes. By combining multiple features of genome synteny and divergence, Frackify is able to relatively rapidly identify paleologs in diploidized species with both simple and complex duplication histories. Frackify uses a gradient boosted decision tree trained with simulated syntenic blocks developed from the empirical data of 27 paleopolyploid plant species. In testing, Frackify performed well under a wider variety of duplication scenarios, including those with multiple, overlapping WGDs. Finally, we compared Frackify against other sophisticated methods used for identifying paleologs. Across six empirical datasets, Frackify maintained a high degree of consistency with methods such as POInT while also identifying a large set of additional putative paleologs in regions of the genomes not analyzed by these methods. With this tool, researchers can quickly identify paleologs in species of interest while also gathering information about patterns of paleolog retention.

Classifying Gene Duplications-23
Using SHAP values we were also able to gain a deeper understanding of the importance of different genomic features for identifying paleologs. Previous methods have often relied on syntenic information or synonymous divergence and applied complex models or heuristic filters to identify paleologs (11, 27, 28, 30, 32-36, 80) . SHAP values uncovered that syntenic information was the primary feature used by Frackify to classify paleologs. However, Ks information was critical for assigning syntenic blocks to particular WGDs or other duplication events (Fig. 10). Similarly, methods which utilize an outgroup, such as POInT, Frackify, or Quota-Align, are able to resolve paleolog classifications in regions of the genome that are not necessarily detected during self-self syntenic comparisons (11, 34, 44, 80) . As a result, carefully choosing an outgroup to compare against is critical for marking accurate inferences. These inferences suggest that improving ortholog identification with outgroup genomes could be important for better classification and characterization of paleologs. Further merging of synteny and gene family phylogenetic data should improve current paleolog classifications while also allowing for multi-species comparisons. Approaches such as Frackify provide one possible path to combining these disparate types of data to make genomic inferences.

Using Frackify
Frackify has several folders that the user will need to populate with data before running the program. When populating these folders users should use the same gene names for each file and make the prefix (Example: "species") of the file consistent. Examples have been placed in the Git repository for users to emulate formatting.

CDS: This folder contains CDS fasta files for each diploid species the user intends to run
Frackify on. Use the file naming format "species".cds Classifying Gene Duplications-24 2. Self-Self: For each species place either a CoGe Synmap or MCScanX output file for the self -self synteny comparison. These files should have Ka/Ks values appended to them and the gene names match the CDS file. Use the file naming format "species"-Self.txt 3. Self-Out: This folder will contain the same files as described for Self-Self, however for a self -outgroup syntenic comparison. This outgroup should be a diploid closely related to the focal species that does not share the WGD you are analyzing for paleologs. Use the file naming format "species"-Out.txt 4. In the main folder: Supply a CSV file containing information about each species you intend to run Frackify on. This file will have 5 columns: 1) the first will contain each species name; please make this the same as your file names. 2) The second column will contain the Ks orth of the outgroup, typically found between the focal and next oldest WGD. [3][4][5] The third through fifth columns will contain the Ks WGD of the three youngest WGD in the species in ascending order. If only one or two WGD are present in the genome, list the others as "0". An example of this format can be found in the Frackify git repository (https://gitlab.com/barker-lab/frackify/-/tree/master).
The first script to run in the Frackify pipeline will produce an intermediate file that contains the features of each gene summarized in a table. This can be produced using the following command.
python3 Survey.py "species_info_.csv" Following the completion of the script two files will be generated in the " Translated_Data " folder. The first will be the list of features for each gene ( "species.Forest.csv") the second will Classifying Gene  contain the list of all syntelog sets from the self-self and self-outgroup comparisons ( "species.Groups.csv" ). Users can then produce predictions using the main Frackify command.
python3 Frackify.py "Translated_Data/species.Forest.csv" The Frackify script will produce a final csv file in the Predictions folder that contains a list of all the genes in the CDS file and their classification. For single, double, and triple retained genes a separate file will be produced that contains the WGD derived syntelog sets and the outgroup gene they match in each row.
Alternatively, users can train a new boosted decision tree using different simulations.
Simulations can be produced by supplying a CSV file with a list of each desired scenario. Users will provide unique names for each scenario to be simulated in the first column. In the second column, provide the number of syntenic blocks to be simulated per WGD. The third column allows the user to specify additional fractionation from 0-1 where 0 is no additional fractionation and 1 would represent a complete loss of all paleologs. In the fourth column the user can select which WGD will be used for paleolog simulation (if there is more than one WGD in the simulation). In column 5 specify the Ks orth of the outgroup ortholog divergence. Columns 6-8 should contain the Ks WGD of up to three WGD in order of ascending divergence. If less than three WGD are specified, place zeros in the remaining WGD Ks columns. The final three columns are used to identify if each WGD is a duplication (2 = paleotetraploid) or triplication (3 = paleohexaploidy). Similar to Frackify, running these simulations requires only one command.
python3 Sim_Stations.py "file_of_simulation_parameters.csv" Running the Sim_Stations.py script will populate the Simulated_Data folder with data from the simulations that can then be used to train new gradient boosted models. The training script Classifying Gene Duplications-26 " Fracking.Rig.py " has two commands, one that will generate Bayesian optimized hyperparameters for this training set (1), and another that will add a new trained model in the " Models " folder (2). If users wish to use their new model in Frackify, please alter the "Frackify.py" script at line 25 to load the new model. Users should consider this when interpreting results as syntenic blocks can be missed in the case of poor genome assembly quality, when using a highly divergent outgroup, or altering MCScanX/SynMap.pl settings. Below we have included several ways to help determine if paleologs are being missed due to syntenic block inferences.
a. In the final step of Frackify, genes predicted as paleologs are grouped together using the self-outgroup and self-self syntenic information. During this process it is possible that some syntelogs in a given set have disagreeing predictions; for example a triple retained group being split into a single retained gene and double retained pair. Frackify will print out the number of genes that suffer from these problems, a large amount suggesting some syntenic blocks are being undetected. b. The same issue described in point "a" can be observed within the "Groups" file. This problem will manifest as some outgroup genes occurring more than once in the file, suggesting the self-self comparison is missing a large number of syntenic blocks.
Classifying Gene  c. If more than 10% of genes display either of the issues in points "a" or "b" we recommend users refer back to their MCScanX or SynMap results to diagnose why syntenic blocks are being undetected. d. The simplest and most influential method to help limit the number of disagreeing paleolog classifications is to be careful when selecting an out group. Users should take time to carefully identify the least divergent out group possible while also prioritizing genome quality. Highly divergent genomes or those with poor assemblies can lead to inflated false positives and false negatives due to the limitations of synteny detection tools.
2. In the case where older WGD in the outgroup overlap with the orthology peak, it is possible to use quota-align in SynMap to help identify orthologous blocks if the syntenic depth of the WGD is known a priori (34) . While this can limit false positives, it can also result in false negatives for single retained paleologs.