Mirage: A phylogenetic mixture model to reconstruct gene-content evolutionary history using a realistic evolutionary rate model

Reconstruction of gene-content evolutionary history is an essential approach for understanding how complex biological systems have been organized. However, the existing gene-content evolutionary models cannot formulate complex and heterogeneous gene gain/loss processes, which reflect diverse evolutionary events and greatly depend on gene families. In this study, we developed Mirage (MIxture model with a Realistic evolutionary rate model for Ancestral Genome Estimation), which allows different gene families to have flexible gene gain/loss rates, but reasonably limits the number of parameters to be estimated by the expectation-maximization algorithm. Simulation analysis showed that Mirage can accurately estimate complex and heterogeneous gene gain/loss rates and reconstruct gene-content evolutionary history. Application to empirical datasets demonstrated that our evolutionary model better fits genome data from various taxonomic groups than other models. Using Mirage, we revealed that gene families of metabolic function-related gene families displayed frequent gene gains and losses in all taxa investigated. The source code of Mirage is freely available at https://github.com/fukunagatsu/Mirage.


Introduction
Gene gain and loss events in genomes have played essential roles in the evolutionary history of life. Complex biological systems that function through the coordination of numerous genes, e.g., metabolic pathways and signal transduction systems, have been constructed through the accumulation of such events. To answer the fundamental biological question of how such complex systems have been organized, reconstruction of gene content evolutionary history has been established as an important field in bioinformatics (Iwasaki and Takagi 2009;Montague et al. 2014;Fernández and Gabaldón 2020). To date, various algorithms to reconstruct gene content evolutionary history have been developed (Snel et al. 2002;Hahn et al. 2005;Iwasaki and Takagi 2007;Csurös and Miklós 2009;Cohen and Pupko 2010;Ames et al. 2012;Li et al. 2014;Zamani-Dahaj et al. 2016;Li et al. 2019). These algorithms estimate the gene content of ancestral species based on the maximum parsimony (MP) or maximum likelihood (ML) method, where the ML method is known to show better performance (Cohen and Pupko 2011;Ames et al. 2012).
In the ML method, it is important which gene-content evolutionary model is adopted. Given an ortholog table and a phylogenetic tree, the ML method first estimates the evolutionary model parameters, such as gene gain and loss rates, using numerical optimization methods or the expectationmaximization (EM) algorithm. Then, the ML evolutionary history of gene-content is reconstructed based on the estimated parameters. Some ML methods adopt a two-state evolutionary model and require a two-state ortholog table, which contains the presence/absence information of each ortholog group in each genome, and estimate whether each ortholog group existed or not at each ancestral node of the given phylogenetic tree (Cohen and Pupko 2010;Li et al. 2014) (Fig. 1A). Although the twostate evolutionary model is mathematically simple, it is apparently unable to deal with gene copy number variations, which play important roles in the evolution of biological systems (Saitou and Gokcumen 2020). At the opposite extreme, the ML method can also adopt an "infinite-state" (or leastconstraint) evolutionary model and require a "raw" ortholog table, which contains copy number information of each ortholog group in each genome, and estimate the copy number of each ortholog group at each ancestral node (the least-constraint model; Fig. 1B). A major drawback of this approach is that parameter numbers can become too large to be estimated. For example, even if the upper bound of the gene copy numbers is set to ~1,000 (cf., olfactory receptor genes (Niimura 2009)), ~2,000 parameters still need to be estimated at a huge computational cost and overfitting risk.
Between the two extremes of the two-state and least-constraint models, previous studies on the ML method have adopted gene-content evolution models that consider gene copy numbers but limit the numbers of parameters. The simplest model is the two-parameter model, which considers only gene gain and loss rates (Fig. 1C) (Hahn et al. 2005;Iwasaki and Takagi 2007). Other models (Csurös and Miklós (C&M) model and Birth, Death and Innovation (BDI) model) decompose the gene gain rate into gene birth (innovation) and duplication rates, resulting in three parameters (Figs. 1D and 1E) (Csurös andMiklós 2009, Karev et al. 2002;Ames et al. 2012). However, these two-and threeparameter models still cannot formulate complex processes that reflect diverse evolutionary events, e.g., de novo gene birth, tandem duplication, ectopic recombination, and horizontal gene transfers (HGTs). These models ignore tendencies that gene gain rates change non-linearly in accordance with gene copy numbers in case of tandem duplications, and copy numbers of many genes are kept one because of dosage balance conservation.
Another aspect of gene-content evolution that should be considered is that different gene families have different gene gain/loss rates. For example, housekeeping genes are seldom lost from genomes and gene loss rates to zero copies are small for such genes, whereas antibiotic genes are easily lost from genomes. Another example is olfactory receptor genes, which are prone to increase copy numbers and have exceptionally large gene gain rates. To deal with such heterogeneity in gene-content evolution, the phylogenetic mixture model, which is widely used in molecular evolutionary analyses to handle heterogeneous evolutionary rates among different residues (Lartillot and Philippe 2004;Pagel and Maede 2004;Quang et al. 2008;Dang and Kishino 2019), can be adopted. However, in gene-content evolutionary analysis, the phylogenetic mixture model has been adopted only in the ML-based methods that use the two-state model (Spencer and Sangaralingam 2009;Cohen and Pupko 2010;Li et al. 2014;Zamani-Dahaj SA et al. 2016).
In this study, we proposed a realistic evolutionary rate (RER) model, a novel gene-content evolutionary model, and developed Mirage (MIxture model with a RER model for Ancestral Genome Estimation), which adopts the RER model and phylogenetic mixture model for accurate ML reconstruction of gene-content evolutionary history (Fig. 1F). To limit the number of model parameters to be estimated, the RER model enables users to set an upper bound of gene copy numbers (Iwasaki and Takagi 2007;Ames et al. 2012). We verified that Mirage can estimate both model parameters and gene-content evolutionary history with high accuracy using simulated datasets. In addition, we demonstrated that the combination of the RER model and phylogenetic mixture model fitted empirical datasets better than the other models. Finally, we reconstructed gene-content evolutionary histories of several taxonomic groups using Mirage, and revealed that gene families involved in metabolic functions frequently exhibited gene gain/loss events in all taxonomic groups investigated.
2. New Approaches 2.1 Input data The input data for our method are an ortholog table ! and a phylogenetic tree ". ! is a data matrix that consists of # species (genomes) and $ gene families (ortholog groups). ! !,# , which is an element of the species % and the gene family & in the matrix, represents the gene copy number of j in i. The phylogenetic tree " is a binary rooted tree whose branches have branch lengths greater than 0.
The tree has # leaves (external nodes), which correspond to the # species in the ortholog table !. " also has # − 1 internal nodes, which correspond to the ancestral species. The reconstruction problem of gene content evolutionary history is defined as an estimation problem of the gene copy numbers in the ancestral species for each gene family ()).

2.2
The RER model and the phylogenetic mixture model Gene-content evolution is formulated as a continuous-time Markov model, where gene copy numbers and gene gain/loss events are represented as states and state transitions, respectively. Gene gain/loss events in each gene family are assumed to have occurred independently of those in other gene families. In an infinitesimal time *+, a gene gain/loss event of one gene is assumed to have occurred at most once. In the two-parameter model, where the model parameters are a gene gain rate , and a gene loss rate -, transition from a gene copy number . to . + 1 and . − 1 occur at probabilities of ,*+ and -*+, respectively, in *+ (Fig. 1B) (Hahn et al. 2005;Iwasaki and Takagi 2007). Csurös and Miklós developed a model with three parameters: a gene acquisition rate ,, a gene loss rateand a gene duplication rate 0 (Fig. 1C) (Csurös and Miklós 2009). By assuming that HGT is a main mechanism of gene acquisition, the C&M model defines transition rates from . to . + 1 and . − 1 as , + .0 and .-, respectively (note that the gene gain/loss rates change linearly with gene copy numbers). The other three-parameter model, the BDI model, utilizes a novel gene family acquisition rate 1 in addition to a general gene gain rate , and a gene loss rate - (Fig. 1D) (Karev et al. 2002;Ames et al. 2012). This model is basically the same as the two-parameter model, except that the transition rate from 0 to 1 is 1.
In this study, we proposed a RER model, which allows all state transition rates to be different. Instead, to limit parameter numbers, we set the maximum gene copy numbers as a parameter 2 $%& (i.e., gene families having copy numbers larger than 2 $%& are considered to have 2 $%& gene copies) (Fig. 1E). That is, the number of states is 2 $%& + 1 (0 to 2 $%& ). Because 2 $%& is a user-input parameter, the user can freely set it to a reasonable value according their interest (for example, if the user is interested in the evolution of many copy gene families, 2 $%& can be set large at the expense of a parameter number increase). Let 3 be a (2 $%& + 1) × (2 $%& + 1) transition rate matrix. Furthermore, to allow different gene families to have different gene gain/loss rates, we combined the RER model with the phylogenetic mixture model. Instead of assuming that all of the $ gene families evolve under the same transition rate matrix 3 , the phylogenetic mixture model introduces L transition rate matrices (i.e., 3 -, … , 3 . ). Each of the $ gene families is probabilistically assigned to L clusters as a mixture model. Here, L is a user-input parameter.

Parameter estimation and gene-content evolutionary history reconstruction algorithm
The evolutionary model parameter to be estimated is N = {P, 3 -, … , 3 . , Q -, … , Q . }, where P is a Llength vector that is the mixing probability of each gene-content cluster, 3 ! is a transition rate matrix of the %-th gene-content cluster, and Q ! is a (2 $%& + 1)-length vector that is the state occurrence probability at the root node in the phylogenetic tree of the %-th gene-content cluster. Note that the degree of freedom is 32 $%& L + L − 1.
The model parameters are estimated by the EM algorithm (Dempster et al. 1977). The EM algorithm is an ML method for estimating parameters from observed data in statistical models that assume unobserved hidden states. In our model, the observed data is the ortholog table ! , while the unobserved hidden states are the gene-content evolutionary history ) and assignments of each gene family to each gene-content cluster T. The EM algorithm consists of the following four steps. (1) . (4) If the log-likelihood converges, terminate the EM algorithm. Otherwise, substitute N )23 for N /01 and return to the step (2). The steps (2) and (3) can be efficiently calculated using the eigenvalue decomposition of the state transition probability matrices and a dynamic programming method across the phylogenetic tree " (Holmes and Rubin 2002;Quang et al. 2008;Kiryu 2011).
After the parameter estimation, the ML evolutionary history ) \ is reconstructed by a dynamic programming method using the estimated parameters. The reconstruction method is similar to the Viterbi algorithm, which obtains the ML path of hidden states in the hidden Markov model, and also resembles an algorithm for the reconstruction of ancestral protein sequences (Pupko et al. 2000). The details of the algorithms are described in the Supplementary Materials. We implemented the algorithms in C++. The source code is freely available at https://github.com/fukunagatsu/Mirage.

Performance evaluation of Mirage based on simulated datasets
We first evaluated the performance of Mirage using simulated datasets. We simulated the evolution of 10,000 gene families in a phylogenetic tree that had 128 leaves for 10 times, creating 10 datasets. We used the RER model whose maximum gene copy number 2 $%& was set to 3 and the phylogenetic mixture model whose number of gene content cluster L was set to 4 for the simulation. Details are described in the Material and Methods.
Then, we applied Mirage to the 10 simulated datasets to estimate the evolutionary model parameters and reconstruct the gene-content evolutionary history. Fig. 2 shows the relative errors of the estimated model parameters. The medians and maximum values of the relative errors were 1.25 and 7.23 for P, 1.122 and 27.164 for 3, and 14.7565 and 510.67 for Q, respectively. These results imply that Mirage can estimate parameters with high accuracy, although the estimated Q may substantially differ from the true parameter. Such difficulty in root parameter estimation is well known and may stem from the fact that the root node is the topologically the furthest away from the observable leaf nodes (i.e., extant genomes). The accuracy of the reconstructed ancestral states (gene copy numbers) was 84.0% (Table 1). When the phylogenetic mixture model was not used (i.e., L = 1), the accuracy decreased to 81.6%, likely because the heterogeneity of gene-content evolution was ignored. If the copy number states were ignored and only presence/absence information was considered (i.e., if incorrect estimation among the copy numbers 1, 2, and 3 was ignored), the accuracy of presence/absence state reconstruction of the ancestral nodes was 92.0%. On the other hand, when gene copy number information was not used and the two-state model (with the phylogenetic mixture model) was applied (i.e., 2 $%& = 1), the accuracy was 91.9%. This result indicates that the two-state model is sufficiently accurate only to reconstruct presence/absence information. In addition, if the two-parameter, C&M, and BDI models (without the phylogenetic mixture model) were used, the accuracies were 82.8%, 82.6%, and 82.8%, respectively. Note that 2 $%& was set to 3 to compare the models under the same condition.  (Yilmaz et al. 2014;Yarza et al. 2017). The Archaea, Micrococcales, and Fungi datasets comprised 167 species and 11,717 gene families, 111 species and 6,653 gene families, and 123 species and 23,791 gene families, respectively. The largest gene copy numbers among all gene families were 183, 102, and 513 in the Archaea, Micrococcales, and Fungi datasets, respectively, indicating that it would be reasonable to set an upper bound of gene copy numbers to limiting numbers of model parameters. Fig. 3 shows cumulative relative frequency curves of the largest gene copy numbers in each gene family. Approximately 60% of the gene families had one copy at most in any dataset, indicating that the two-state model was not suitable for reconstructing the evolution of approximately 40% of the gene families. In contrast, when we set 2 $%& to 3, the evolution of approximately 90% of the gene families could be reconstructed without the gene copy number cut-off.

Comparison of models and parameters by holdout validation
Next, we compared effects of models and parameters by evaluating holdout performance of Mirage using empirical datasets. We first divided each dataset into gene families of training and test datasets, and estimated the model parameters using the training dataset only. We then calculated log-likelihood of the test dataset. Similarly, parameters were estimated and log-likelihood values were obtained using the two-parameter, C&M, and BDI models for each dataset. Note that 2 $%& was set to 2, 3, or 4 to compare the models under the same condition.
Regardless of 2 $%& or the dataset used, the log-likelihood increased with the increasing number of transition-rate matrices L, except for limited cases, likely because of convergence to local optima by the EM algorithm, and the RER and two-parameter models showed the best and the worst loglikelihood, respectively (Fig. 4). When we divided the training and test datasets in different ways (i.e., by species or by species and gene families), we obtained similar results (Figs. S1-S2). In conclusion, both the RER model and the phylogenetic mixture model yielded a gene-content evolutionary model with high log-likelihood values.
Interestingly, although the C&M and BDI models had the same number of parameters, their loglikelihood values were slightly different (Fig. 4). When the Archaea or Micrococcales dataset was used and 2 $%& was 3 or 4, the C&M model exhibited a larger log-likelihood. On the other hand, when the Fungi dataset was used, the BDI model exhibited a larger log-likelihood. When 2 $%& ≥ 3, the C&M model naturally assumes that gene duplication and loss rates change linearly with gene copies, whereas the BDI model assumes that gene duplication and loss occur at a constant rate regardless of gene copy numbers (Fig. 1). Thus, the difference likely reflects the nature of gene duplications and losses in prokaryotic and eukaryotic genomes. Specifically, the BDI model may be more suitable for eukaryotic evolutionary processes in which meiotic recombination introduces tandem gene duplications and losses, which are basically independent of gene-copy numbers (Fitzpatrick 2012).
3.4 Analysis of estimated evolutionary model parameters Next, we applied Mirage to each of the complete Archaea, Micrococcales, and Fungi datasets. Based on an investigation of gene copy numbers and holdout validation results, we used L = 5 and 2 $%& = 3. Estimated model parameters are presented in Table 2, Fig. S3, and the Supplementary Data. In all datasets, the gene gain rates ([3 7 ] !,!8-) tended to be smaller than the gene loss rates ([3 7 ] !,!9-), being consistent to a previous study (Cohen and Pupko 2010). divided by the minimum of these values among each dataset (Fig. S4). The maximum normalized cluster evolutionary rates were 60.0, 4.9, and 12.1 for the Archaea, Micrococcales, and Fungi datasets, respectively, indicating that different gene-content clusters have different evolutionary rates. The Archaea dataset exhibited the largest difference, where the gene-content clusters 1, 2, and 3 exhibited large, moderate, and small normalized cluster evolutionary rates, respectively (Table 2). We also investigated whether specific gene functions were enriched in specific gene-content clusters. We used EGGNOG database version 4.0 for gene annotation to COG, arCOG, and NOG gene families and version 3.0 for gene annotation to KOG category (Powell et al. 2012;Powell et al. 2014). After removing "poorly characterized" supercategories, we observed differences in the enriched COG supercategories among gene-content clusters (Tables S1-S3).

Reconstruction of the gene-content evolutionary history
We then reconstructed the gene-content evolutionary history for each dataset using Mirage. We first counted gene gain/loss events in each gene family from the reconstructed evolutionary history (Fig.  S5). In all datasets, gene gain/loss events were rare in most gene families, whereas some gene families exceptionally frequently experienced gene/gain loss events. Tables S4-S6 list the 20 gene families with the most frequent gene gain/loss events for each dataset. Many transposase genes were commonly found in all three datasets, whereas two gene families, COG0286 (HsdM) and COG1848, commonly appeared in the lists of the Archaea and Micrococcales datasets. COG1848 is a function-unknown gene family, whereas COG0286 is annotated as a DNA methylase subunit of the type I restrictionmodification system. It is reasonable that this restriction-modification system has been spread by HGT, as is well known for the type II system (Jeltsch and Pingoud, 1996).
Finally,, we investigated whether specific gene functions were enriched in the gene families with frequent gene gain/loss events. First, we examined differences in the distributions of the COG supercategories between the top 10% of gene families with frequent gene gain/loss events and entire gene families (Table 3). Based on a chi-square test with Bonferroni's multiple correction, we found that the "metabolism" supercategory was significantly enriched in the gene families with frequent gene gain/loss events in all datasets. When we analyzed which categories in the "metabolism" supercategory were enriched in the different datasets, we found that different categories were enriched (Table 4). In the Archaea dataset, gene families in categories C, "Energy production and conversion", and P, "Inorganic ion transport and metabolism", were the most enriched, probably reflecting the diverse ways in which Archaea obtain energy. In the Micrococcales and Fungi datasets, gene families in categories G, "Carbohydrate transport and metabolism", and Q, "Secondary metabolites biosynthesis, transport, and catabolism", were highly enriched, probably reflecting rich secondary metabolism functions of those taxonomic groups. COG category symbols: C, energy production and conversion; E, amino acid transport and metabolism; F, nucleotide transport and metabolism; G, carbohydrate transport and metabolism; H, coenzyme transport and metabolism; I, lipid transport and metabolism; P, inorganic ion transport and metabolism; Q, secondary metabolites biosynthesis, transport, and catabolism.

Discussion
In this study, we developed Mirage, which adopts the RER model and phylogenetic mixture model for accurate ML reconstruction of gene-content evolutionary history. We demonstrated that the combination achieved good performance based on simulated and empirical datasets. The differences between the RER model and the UNREST model in DNA evolution models are noteworthy, because elements of the UNREST evolutionary rate matrix are independent of each other (Yang 1994). The first difference lies in the way of estimating Q. Whereas Q is modeled as the stationary distribution of the Markov process formulated by the parameter matrix 3 in the UNREST model, Q and 3 are defined as independent parameters in the RER model, because of the difficulty of assuming the stationarity in the gene-content evolution. Second, the RER model assumes that [3] !,# = 0 when |% − &| > 1, like all other gene-content evolution models. Although simultaneous acquisitions/ losses of multiple genes may occur during evolution, they can be modeled as consecutive gene acquisitions/losses in a very small period of time by the RER model.
The setting of the gene-content category number K is also an important problem in Mirage. Although the Akaike Information Criterion (AIC) is a widely used estimator for model selection in phylogenetics, AIC can only be applied to statistically regular models, and the phylogenetic mixture model is a nonregular model. Another popular technique for model selection is the non-parametric Bayesian method. This method can be applied to non-regular models, but requires a lot of computation time. A practical model selection method for non-regular models is an unsolved problem in statistics, and various methods have been proposed (Fujimaki and Morinaga 2012;Watanabe 2013). The integration of Mirage with model selection methods for non-regular models is a future task. As an application of Mirage, we envision function prediction of function-unknown genes by integrating it with phylogenetic profiling to be an interesting direction. The phylogenetic profiling method predicts gene functions based on correlated occurrence patterns between genes in an ortholog table (Kensche et al. 2008;Kumagai et al. 2018;Sherill-Rofe et al. 2019). The method generally ignores evolutionary relationships, for example by using simple mutual information as an index of correlation, and this ignorance is known to decrease prediction performance (Kensche et al. 2008). Previous studies have shown that the prediction performance can be improved by observing correlation patterns of gene gain/loss events in the reconstructed gene-content evolutionary history instead of gene occurrence patterns in extant species (Barker et al. 2007;Ta et al. 2011;Moi et al. 2020). Thus, precise reconstruction of the gene-content evolutionary history by Mirage would contribute to the improvement of the phylogenetic profiling method.
With the present Mirage implementation, it is still difficult to reconstruct the gene-content evolutionary histories of all genome-sequenced species at once because of the huge computation time required. Thus, improving the computation time of Mirage is essential. In particular, application of the series acceleration method, which improves the convergence rate of a series, to the iteration steps of the EM algorithm seems promising. Specifically, the vector-` acceleration technique, which does not require derivation of acceleration formula for each statistical model, may be readily applied to Mirage (Kuroda and Sakakihara. 2006). Another powerful approach would be a partitioning method, which does not use probabilistic but deterministic assignment of gene families to each gene cluster in the phylogenetic mixture model. This method has been widely used in molecular evolutionary analyses, but not in gene-content evolutionary analyses (Brown and Lemmon. 2007;Frandsen et al. 2015;Lanfear et al. 2017). Although the partitioning method can be less accurate due to the deterministic approximation, its computational efficiency would be high.
Although Mirage can model differences in the evolutionary rates among gene-content clusters, it assumes the same evolutionary rate among all branches of the phylogenetic tree. However, this assumption does not always hold true. For example, polyploidization events cause massive gene gains (Inoue et al. 2015;Sriswasdi et al. 2016), and parasitization events cause massive gene losses (Sun et al. 2018). Moreover, heterogeneity of evolutionary rates among branches may also be caused by changes in survival strategies (Sriswasdi et al. 2017) or large-scale extinction events (Wolf and Koonin 2013). Our knowledge of the nature of branch-wide gene gain/loss rate heterogeneities is still very limited, and the expansion of Mirage in this direction would be needed to deepen our understanding. Iwasaki and Takagi developed a heterogeneous gene-content evolution model, but their model had a problem with an increase in parameter numbers by assigning independent parameters for each branch (Iwasaki and Takagi 2007). Instead, a branch-level parameter mixture model that divides branches into clusters with the same evolutionary model parameters may be promising.

Parameter estimation and the data analysis
The EM algorithm is guaranteed to converge to a local optimum but not to a global optimum, and thus, the estimation results can depend on the initial values of the model parameters. Therefore, we estimated the parameters 100 times using the EM algorithm for each dataset and each model setting, and we adopted the estimation results with the highest data likelihood in the training dataset. We compared the RER model with the two-parameter, C&M, and BDI models with restriction of the maximum value of the gene copy number value. Because a software program to estimate the parameters under these model settings was not available, we implemented our own software. Technical details are described in the Supplementary Materials. In the simulated dataset analysis, we used model settings that mimic those in previous studies for comparison. Specifically, we investigated the model settings of (1) the RER model (L = 4, 2 $%& = 1) (Cohen and Pupko 2010), (2) the two-parameter model (L = 1, 2 $%& = 3) (Hahn et al. 2005;Iwasaki and Takagi 2007), (3) the C&M model (L = 1, 2 $%& = 3) (Csurös and Miklós 2009) and (4) the BDI model (L = 1, 2 $%& = 3) (Karev et al. 2002;Ames et al. 2012). Additionally, we investigated the model setting that changed L of the true model setting from 4 to 1 to assess the effect of the phylogenetic mixture model on the performance. As the evaluation criteria for the reconstructed evolutionary history in the simulated dataset analysis, we used the proportion of gene families whose gene copy numbers were correctly estimated in ancestral nodes, except for the two-state model. For the two-state model and the true model settings, we evaluated the estimation accuracy of the presence or absence of gene families. We used the averaged accuracy from the 10 simulated datasets.
In the empirical dataset analysis, we tested L values from 1 to 15 and 2 $%& values from 2 to 4. We investigated the two-parameter, C&M, BDI model, and RER models as evolutionary models.

Preprocessing of the empirical datasets
We extracted gene family data of Archaea, Micrococcales, and Fungi genomes from the ortholog table in the STRING database version 11.0 (Szklarczyk et al. 2019). We used the NCBI Taxonomy database for taxonomic annotation. Next, we retrieved species in the phylogenetic trees provided by the Genome Taxonomy Database release 89 (Parks et al. 2019) for Archaea and Micrococcales, and those provided by the SILVA database release 111 (Yilmaz et al. 2014;Yarza et al. 2017) for Fungi. Then, we removed species data that were only included in either the ortholog tables or the phylogenetic trees from these datasets. In the Fungi phylogenetic tree, for some species, there were multiple strains per species. For these species, we randomly selected one strain and removed the others. Then, we reshaped the phylogenetic trees to satisfy the following three conditions: (1) a leaf of the phylogenetic trees always corresponds to a species, (2) tree topology of the tree is a binary tree, and (3) the distances and the phylogenetic relationships between species are the same as in the original tree. Because there were branches with branch lengths of 0 in the Fungi phylogenetic tree, we added a pseudo length 0.0001 to all tree branches. Note that the minimum branch length excluding 0 in the tree was 0.00059, which is larger than 0.0001. Finally, we removed their gene families shared by # or # − 1 or only one organism from the ortholog tables. The constructed datasets are freely available at https://github.com/fukunagatsu/Mirage.

Division of the datasets into the training and test datasets
We divided the datasets in the following three ways. For experiment 1, we randomly divided the gene families into training and test datasets at a 3:1 ratio for each dataset. In this method, the species were common between the two datasets. For experiment 2, for each dataset, we randomly divided the species into training and test datasets at a 1:1 ratio and removed the gene families that were not present or shared by only 1 or . − 1 or . species in each dataset after the division. Here, . is defined as the number of species in the dataset after the division. In this method, some gene families were shared between the training and test datasets. For experiment 3, we further processed the datasets obtained by the second method. In particular, we randomly assigned gene families shared between the training and test datasets to either of the datasets, so that no gene families were shared between the two datasets. Therefore, in this division, both the species and the gene families differed between the two datasets. The numbers of gene families for each dataset in the experiments 2 and 3 are listed in Table S7. Distributions of the relative errors of the estimated parameters for the simulated datasets. A relative error is 100 × |the estimated value − the true value|/the true value . The x-axis and y-axis represent the relative error and the frequency, respectively. Distributions for (A) φ, (B) R, and (C) π are shown. Note that R and π have many parameters and thus their frequencies are larger than that of φ.

Fig. 3
Cumulative distribution curves of the maximum gene copy numbers in the empirical datasets. The xaxis and y-axis represent the maximum value of the gene copy numbers and the cumulative relative frequency, respectively. The Archaea, Micrococcales, and Fungi datasets are represented by red, black, and blue lines, respectively.