Accurate inference of tree topologies from multiple sequence alignments using deep learning

Reconstructing the phylogenetic relationships between species is one of the most formidable tasks in evolutionary biology. Multiple methods exist to reconstruct phylogenetic trees, each with their own strengths and weaknesses. Both simulation and empirical studies have identified several “zones” of parameter space where accuracy of some methods can plummet, even for four-taxon trees. Further, some methods can have undesirable statistical properties such as statistical inconsistency and/or the tendency to be positively misleading (i.e. assert strong support for the incorrect tree topology). Recently, deep learning techniques have made inroads on a number of both new and longstanding problems in biological research. Here we designed a deep convolutional neural network (CNN) to infer quartet topologies from multiple sequence alignments. This CNN can readily be trained to make inferences using both gapped and ungapped data. We show that our approach is highly accurate, often outperforming traditional methods, and is remarkably robust to bias-inducing regions of parameter space such as the Felsenstein zone and the Farris zone. We also demonstrate that the confidence scores produced by our CNN can more accurately assess support for the chosen topology than bootstrap and posterior probability scores from traditional methods. While numerous practical challenges remain, these findings suggest that deep learning approaches such as ours have the potential to produce more accurate phylogenetic inferences.


Introduction
Reconstructing the historical relationships among biological sequences remains one of the most important challenges in evolutionary biology. In order to reconstruct a tree it is first necessary to acquire orthologous (Koonin 2005) and/or paralogous (Hellmuth, et al. 2015) sequences from a sample of species or populations. Generally, the next step involves producing a multiple sequence alignment (MSA) in order to compare orthologous sequence characters, although in recent years a number of methods that bypass explicit orthology and alignment searches have been proposed (Hohl and Ragan 2007;Bonham-Carter, et al. 2013). Finally, any one of a plethora of tree reconstruction methods can be employed to infer a tree from an MSA.
Determining the topology of a tree is an NP-hard problem (Roch 2006), as the space of possible solutions grows double-factorially with increasing numbers of sequences (Felsenstein 1978b).
We are thus limited to methods that utilize various heuristics and/or make assumptions about the process of sequence evolution in order to efficiently traverse tree space, including maximum parsimony (MP : Farris 1970;Fitch 1971), distance-based approaches such as neighbor joining (NJ: Saitou and Nei 1987), maximum-likelihood (ML: Felsenstein 1981), and Bayesian inference (BI: Rannala and Yang 1996;Li, et al. 2000;Huelsenbeck and Ronquist 2001). Although none of these methods are guaranteed to produce a global solution (i.e. the true tree topology), one can evaluate and compare their effectiveness.
Supervised machine learning algorithms represent an alternative framework for drawing inferences from data, including sequence data. These methods have been successfully applied to various areas of biological research (Tarca, et al. 2007), and have recently been introduced to evolutionary biology-population genetics in particular (Schrider and Kern 2018). Briefly, given some multivariate observation that is associated with a response variable , supervised machine learning seeks to create a function ( ) that predicts . This is achieved through a process called training, wherein a set of observations with known response variables are examined by an algorithm that creates and tunes the function to minimize the disparity between ( ) and on this training set. Typically, is summarized by a vector of values, or features. Convolutional neural networks (CNNs: Lecun, et al. 1998), a class of deep learning algorithms (Goodfellow, et al. 2016), are able to learn to extract informative features from data arranged in a matrix in order to make a prediction; thus CNNs do not require a predefined feature vector. CNNs have proved extremely effective in analyzing image data (Krizhevsky, et al. 2012) and have recently been shown to achieve impressive accuracy on a number of tasks in population genetic inference when applied to MSAs (Chan, et al. 2018;Flagel, et al. 2018). CNNs and other deep learning methods may thus prove to make sizeable gains in a number of inference tasks involving biological sequence data.
Here we propose a novel approach applying CNNs to phylogenetic inference. We show that CNNs can be trained to extract phylogenetic signal from an MSA and use it to accurately reconstruct unrooted gene tree topologies (i.e. here our is a tree topology) when given an alignment of four sequences. We adopt a classification approach, wherein the CNN is trained to discriminate among discrete data classes which in this case correspond to the three possible tree topologies. We find that our CNN approach is highly accurate, can naturally include indel information, is fairly robust to classical biases affecting tree reconstruction and has several other benefits over existing methods. Below we describe our methodology and the CNN's performance in detail before concluding with a brief discussion of limitations, potential extensions,and outstanding practical challenges for adapting deep neural networks to phylogenomics.

MSA Simulations
For 4-taxon case there exist exactly three possible unrooted tree topologies: ((A,B),(C,D)), ((A,C),(B,D)) and ((A,D),(B,C)) in Newick notation. We simulated an equal number of MSAs from each of these topologies using INDELible (Fletcher and Yang 2009), which allows for the inclusion of indels. The MSA length was set to 1000 sites unless stated otherwise. Trees with various branch length configurations were sampled uniformly from a "branch-length space" (BLspace) that encompasses a wide variety of branch lengths and nodal rotations (see Supplementary Fig. S1). Tree branch lengths are expressed in expected number of substitutions per site.

Supplementary Text and
With the tree topology and branch lengths in hand, sequences were generated under a randomly assigned Markov model of nucleotide substitution using INDELible. We used five nested models of increasing complexity: JC (Jukes and Cantor 1969), TIM, TIMef (Posada 2003), GTR (Tavaré 1986) and UNREST (Yang 1994) accommodating rate heterogeneity along the sequence by drawing the rate parameter of a continuous gamma distribution (+G) from U(0,4). We also allowed the proportion of invariable sites (pinv) in an MSA to vary, drawing this parameter (+I) from U(0,1). We note that the average expected substitutions per site (i.e. the branch length) is specified for variable sites only, not for the entire MSA. For example, when simulating an MSA of length 1000 and pinv = 0.5, the specified branch lengths refer only to the 50% of sites that are free to vary. Substitution rate parameters and nucleotide frequencies were drawn from U(0,3) and a four-category Dirichlet distribution (a = 30), respectively. Finally, when including indels in the simulation we used identical fixed insertion and deletion rates (lI = lD = 0.01) and drew indel lengths from a Zipf distribution with a = 1.5, similar to empirical estimates (Vialle, et al. 2018), and a maximum indel size of 50. For each topology class we simulated either 5´10 4 , 1.5´10 5 or 3´10 5 MSAs for training and 2´10 4 MSAs for validation and test datasets. Additionally, we simulated test sets consisting of 3000 trees from each of seven regions of the tree-shape space that may potentially cause biased tree topology inference, including the Farris and Felsenstein zones that have long been known to negatively affect performance of classical methods (Swofford, et al. 2001). The parameters for each of these were similar to those stated above with branch lengths drawn from the distributions shown in Table 1.
Additionally we simulated trees within so called "twisted" Farris zone (McTavish, et al. 2015) with asymmetrical branch lengths as well as within region where B5 is short. Throughout this paper we denote sister branches or their branch lengths on the left side of the tree as B1 and B2, those on the right side as B3 and B4, and the internal branch as B5 The overall pipeline is summarized in Figure 1.
The internal branch length (B3) for these zones was drawn from U(0,0.05) ( Table 1). In order to investigate effect of internal branch length on each method's performance in more detail, we also created variants of the Farris and Felsenstein zones where B3 was drawn the interval U(0,1) Table S1). For instance, previous simulation studies suggest that methodological biases primarily arise when an internal branch appears to be very short, i.e.

(Supplementary
internal branch length ≤0.05 (Swofford, et al. 2001). However, since these studies followed overly simplified simulation approaches (e.g. MSA simulation under JC model only Siddall 1998), this conclusion may not hold for MSA datasets generated using more complex simulation strategies that incorporate various substitution models, rate heterogeneity, varying proportions of invariant sites and indels.

CNN Architecture and Training
This section describes the structure of our neural network and training procedure, and thus many of the concepts herein may be foreign to some readers. We refer those interested in an accessible overview of CNNs in the context of sequence alignments to Flagel, et al. (2018). We implemented our CNN in Python using version 2.2.4 of the Keras API (https://keras.io/) with TensorFlow 1.12.0 (Abadi, et al. 2016) as the backend. To extract phylogenetically informative features from an MSA we used a 1D convolution-pooling strategy. We represented each MSA as a matrix with rows corresponding to sequences and columns to sites in the alignment, and with each character in the alignment encoded as a number ('A':0, 'T':1, 'C':2, 'G':3, '-':4), and encoded the MSA's associated topology encoding as a numeric class label. Since indels result in MSAs of varying lengths and our CNN requires all input matrices to have equal size, shorter alignments were extended by "padding" the MSA with additional columns wherein each value was arbitrarily set to -15 until the MSA's length matched that of the longest alignment in our simulated data. Our CNN consisted of eight convolution layers with ReLU activation (Nair and E. Hinton 2010) and each followed by an average-pooling step, one 1024-node fully connected layer with ReLU activation and an output layer with three nodes corresponding to the three possible topology classes. We used softmax as the activation function for our output layer in order to assign a class membership probability to each of the three classes. We used categorical cross-entropy as our loss (i.e. error) function. During training we used the adaptive moment estimation (Adam) optimizer (Kingma and Ba 2014).
We set the filter size for the first convolution operation to 4´1, reasoning that it would potentially capture a phylogenetic signal by striding across the MSA and examining one column of the alignment at a time. The total number of filters was set to 1024 for the first and second convolutional layer and set to 128 for the remaining convolutional layers. The first averagepooling operation size was set to 1´1 (i.e. no pooling), then all the subsequent convolution steps had equal sizes of 1´2 and pooling steps had equal sizes of 1´4, 1´4, 1´4, 1´2, 1´2 and 1´1. In order to avoid possible internal covariate shift and model overfitting we used batch normalization (Ioffe and Szegedy 2015) followed by dropout regularization with a rate of 0.2 for each convolution layer as well as for the hidden layer. During training we used a minibatch size of 150 MSAs. The training procedure was run for a number of iterations determined by the stopping criterion (validation loss difference between the best prior iteration and the current iteration < 0.001) and patience value (the number of consecutive iterations that must satisfy the stopping criterion before halting training). We observed that the results of our training runs were fairly stable, with a given CNN architecture consistently achieving the same final accuracy on the validation set. To further examine performance during training, we visualized the surface of the loss function and training trajectories across the CNN's parameter space (i.e. the possible values of all weights in the network) as described below. This revealed that there appear to be many readily reachable local minima of the loss function that all have roughly equivalent values ( Supplementary Fig. S2), lending further support to the notion that training results should be fairly stable, at least for the datasets and architectures examined here.

Testing Procedures
We sought to compare the performance of CNN vs. conventional tree inference methods, namely MP, NJ, ML and BI using our simulated test datasets. MP reconstructions were performed using PHYLIP's dnapars program v3.696 (Felsenstein 1989) the interface provided by SeaView v4.7 (Gouy, et al. 2010). ML analysis was implemented in IQ-TREE v1.6.5 (Nguyen, et al. 2015) restricting model selection to only the models that were used to simulate MSAs, i.e. JC, TIM, TIMef, GTR and UNREST with +I and/or +G, where the continuous rate gamma distribution was approximated by discrete gamma with 8 categories (G8). For each MSA, ML search was run independently 100 times; along with the inferred tree, the likelihood scores from the best ML run were recorded for the likelihood-mapping analysis (Strimmer and von Haeseler 1997). The NJ trees were generated using the distance function of the best-fit model in IQ-TREE. Note that ML and NJ approaches infer a rooted tree under the UNREST model, thus every such estimated tree was unrooted for further comparisons. Bayesian tree reconstruction was carried out in ExaBayes v1.4.1 (Aberer, et al. 2014) under default parameters starting with a random initial topology.
Three independent MCMC chains were run for 10 6 generations each with two coupled chains.
Tree support was assessed by standard nonparametric bootstrap (Felsenstein 1985) for MP, NJ, ML and CNN, and by posterior split probability for BI. We also assessed tree support from the CNN by simply using the estimated posterior class membership probabilities emitted by the output layer. Then, for each method we calculated topological accuracy as the number of correctly returned topologies divided by the total number of inferences performed.

CNNs are more Accurate than Standard Methods on Gapped and Ungapped MSAs
Initially, we trained our CNN on gapped and ungapped training sets of 5´10 4 MSAs (each with 1000 sites) for each of our three possible topologies, and in each case evaluated our accuracy on an independent test set. We compared the accuracy of our CNN on simulated test data (3´10 4 MSAs per topology) to that of several other standard tree estimation methods, namely maximum parsimony (MP), neighbor joining (NJ), maximum likelihood (ML), and Bayesian inference (BI). Strikingly, the CNN was substantially more accurate than the other methods (McNemar test, P < 1´10 -10 for each comparison with the CNN) when applied to gapped MSAs (Figure 2).
However, this was not the case when testing methods on ungapped MSAs: here accuracy of the CNN (0.77), which was in this case also trained on ungapped MSAs, was marginally yet significantly smaller than that of the best performing method, NJ (accuracy = 0.776, McNemar test, P < 1´10 -10 ). Nevertheless, the CNN's accuracy improved with increasing amounts of training data and obtained higher accuracy (0.778; Figure 2) than all other methods when trained on 3´10 5 ungapped MSAs per topology, though the difference with the next best method (NJ) was not significant (McNemar test, P = 0.428).
We also investigated the relationship between several simulation parameters and inference accuracy. When sequences are very closely related, the MSA may consist largely of invariant sites, which are phylogenetically uninformative. One of the parameters of our simulations was the fraction of sites that were not allowed to vary (pinv; see Methods). We found that for each method, misclassified MSAs from the gapped test set had a larger value of pinv on average (Wilcoxon rank sum test, P < 1´10 -10 for each standard method and P = 1´10 -6 for the CNN), but this difference is far less pronounced for the CNN (Figure 3A) other methods misclassified MSAs had elevated values of a (Wilcoxon rank sum test, P < 1´10 -10 for all standard methods; Figure 3B). Similar results hold for ungapped MSAs (Supplementary Figure S3). Together these results suggest that CNN's are more robust to higher values of pinv and a than traditional methods. Additionally, we note that the DNA substitution model does not affect any method's performance (Fisher exact test, P > 0.05 for all methods).

Gamma (+Γ model) B
this property of the gapped-MSA CNN, we simulated MSAs of length 1000 with no substitutions allowing only indels. Under these conditions all standard methods fail to achieve consistency (Warnow 2012;Truszkowski and Goldman 2016), and MP, NJ, and BI all had accuracy of 0.33 or lower-no better than random guessing. By contrast, the CNN's accuracy on this set was 0.965, and ML's was 0.94, which is surprising given that this method may not be consistent under this scenario (Warnow 2012).
Together these results show that our CNN, when trained on gapped MSAs, is in principle able to incorporate indel information into its phylogenetic inference. Because the CNN implicitly constructs substitution and indel process models by examining the simulated training data, alignment error could impact inference in practice. This could be handled in one of two ways: 1) by incorporating alignment errors into the training set (e.g. by running a sequence aligner on training data prior to training), or 2) judging the reliability of indels within the MSA and masking particular sites or omitting entire MSAs appearing to be unreliable (e.g. following Castresana 2000;Misof and Misof 2009;Ashkenazy, et al. 2014;Fujimoto, et al. 2016). Another potential problem is model misspecification: accuracy could suffer if indel rates/lengths differ dramatically between simulated training data and real datasets. Future work will have to investigate strategies to train a network that is robust to such misspecification, perhaps by incorporating a variety of indel models into training.

Topology Estimation Under Branch Length-Heterogeneity
There exist several regions in the branch-length parameter space where common methods fail to produce the correct topology. The bias known as long branch attraction (LBA) is the inability of MP (Felsenstein 1978a) and possibly Bayesian (Kolaczkowski and Thornton 2009;Susko 2015) methods to estimate the correct topology in presence of long nonadjacent branches (i.e. the to correctly infer topologies that have adjacent long branches; we note that this bias may diminish when using longer alignments due to ML's property of consistency (Swofford, et al. 2001). Additionally, we assessed the performance of all methods in the asymmetrical ("twisted") Farris zone (Table 1; Figure 4B), where a single incorrect topology is favored if the distance function is convex (McTavish, et al. 2015). Lastly, we examined trees with a short internal branch (≤ 0.05) but external branches lengths varying along the interval [0,1] (Table 1; Figure   4D).
We primarily focused on these four regions of known bias and also examined several other regions of potential bias on test sets of gapped (Figure 4 and Supplementary Fig. S4) and ungapped (Supplementary Figs. S5, S6) MSAs. In most of these regions, our CNN trained and tested on gapped alignments exhibited roughly matching or exceeding that of the best traditional method, whereas each other method failed catastrophically in at least one of these zones (Figure   4). The one exception was the Felsenstein zone, where our CNN's accuracy is 35.3%, slightly above the random expectation of 1/3, and second to ML (47.6%; Figure 4); however, the CNN still substantially outperformed the remaining methods in this case (16.1% accuracy for BI and ≤3% for both MP and NJ). The CNN shows superior accuracy in each of the remaining regions of potential bias (Supplementary Table S1; Supplementary Fig. S4) than the other methods.
Generally, for gapped MSAs the CNN's performance mostly depends on the length of internal branch (B5), with no misclassified topologies for trees with B5 > 0.4 (Supplementary Fig. S4) whereas other methods may incorrectly infer topologies across the entire rage of 0 ≤ B5 ≤ 1. Our analysis also revealed that although for the Ferris and Felsenstein zones the internal branch length is typically defined as 0.05 substitutions per site or less (Swofford, et al. 2001), we find that in some cases elevated error rates may hold for longer internal branches as well ("extended"

Farris and Felsenstein zones in Supplementary Figs. S4, S6)
On ungapped data, the CNN's performance roughly matched that of BI across zones.
These two methods both exhibit acceptable but not exceptional accuracy in each of these different zones (Supplementary Fig. S6), perhaps with the exception of the Felsenstein zone where all methods other than ML have very low accuracy (≤15.5% versus 47.0% for ML). summarize results in the Farris zone, the twisted Farris zone, the Felsenstein zone and trees with short internal branches. The top row illustrates these tree branch length configurations, and the second row shows accuracy of each method's performance in each zone (with violins representing accuracy on bootstrap replicates from the test set). Panels three through seven show the locations (gray dots) and densities (heat map) of incorrectly inferred topologies for each method plotted in two-dimensional space with B5 length on the x-axis and B1+B2 (labeled as "B1+2"), B1+B2 ("B1+2"), B1+B3 ("B1+3"), B1+B2+B3+B4 ("B1+2+3+4") on the y-axis for Farris, twisted Farris, Felsenstein and short internal branch regions respectively. See Table 1 for branch length simulation details.

Classification Accuracy Increases with Alignment Length
Statistical consistency is defined as the ability of a method, , to return the true parameter value, (e.g. tree topology), with probability approaching unity when amount of data ( ) increases (with alignment length measured in nucleotides, n). More formally, a method is consistent when: Statistical consistency has been considered as one of the major criteria for a good tree reconstruction method (Felsenstein 1978a;Goldman 1990;Penny, et al. 1992). Consistency of machine learning models has been only formally proven for a multilayer perceptron regression estimator with a particular network architecture (Mielniczuk and Tyrcha 1993). Generally, it is not trivial to analytically derive the global theoretical guarantees for neural networks due to their complexity and flexibility as exemplified by their large number of hyperparameters (e.g. different network architectures, activation and loss functions and optimization algorithms), which can be tuned to accommodate a certain learning problem. Additional factors such as sizes of both the training dataset and the input data (in our case number of sequences in the MSA and the MSA's length) may also affect the degree of consistency. Here we considered three consistent methods (provided the correct substitution model is used) of conventional tree estimation: NJ (Saitou and Nei 1987), ML (Felsenstein 1981;Truszkowski and Goldman 2016) and BI (Steel, et al. 2013;Warnow 2018) as well as MP as an example of an inconsistent (Felsenstein 1978a) method under standard Markov substitution models. Because empirical datasets are always finite in length, the rate (Fu 1975) at which ( ( , ) = ) increases with n is perhaps more crucial in practice.
We tested our CNN and the four standard methods on simulated sets of MSAs of increasing length. Our results show that for both gapped MSAs each method's accuracy increases with MSA length, and that the CNN outperforms other methods for each MSA length ( Figure 5). Thus, although we do not prove that CNNs are consistent, these empirical results are highly encouraging and imply that our approach has great potential for practical applications. Indeed, in some instances accuracy on shorter MSAs may be a more crucial quality than provable consistency because asymptotic properties of consistent methods may not be exhibited for smaller datasets (Yang 2014).

Analysis of Phylogenetic Reliability Measures
Testing phylogenetic hypotheses usually involves calculation of some reliability measures such as bootstrap support (BS) for MP, NJ and ML or posterior probabilities (PP) in the context of BI.
Despite the fact that their direct comparison is somewhat problematic due to their fundamental differences, one can estimate which measures are more prone to support false tree topologies (Douady, et al. 2003;Huelsenbeck and Rannala 2004  positively misleading cases. On the other hand, the CPP estimates show a large gap between scores assigned to correctly vs. incorrectly classified topologies, with the latter being significantly lower, and only 0.1% of incorrectly classified gapped MSAs having a CPP above 0.95. In other words, the CNN can avoid incorrect classifications even when imposing a relatively relaxed posterior probability cutoff, and thus simultaneously recover a large fraction of correctly classified topologies while producing very few misclassifications. This is further illustrated by precision-recall curves (Supplementary Fig. S7), which show that compared to other methods the CNN achieves high sensitivity while retaining a high positive predictive value on gapped alignments. Similarly, on ungapped data, the CNN performs better in this respect than the more traditional methods (Supplementary Fig. S8). Moreover, for our four-taxon topology trees the interpretation of CPP is straightforward: the most unreliable tree estimates receive CPP of 1/3, which is equivalent to randomly picking one of the three possible topologies. The probability vector P = [P1, P2, P3] can be plotted onto a 2D simplex to visualize affinity of a quartet to each of the three topologies T1, T2 or T3. Conveniently, the final output layer of our CNN generates a vector of posterior probabilities (CPPs) that can be readily utilized in the same manner. We compared the distributions of the CPPs to ML's likelihood-mappings using gapped test data (Figure 7, Supplementary Fig. S9) and ungapped (Supplementary Fig. S10).
Generally, the CPPs exhibit less density near the center of the simplex that corresponds to ambiguous quartets; this is especially so for the CNN using gapped MSAs. Thus, our CNN's class membership probabilities can be used to visualize the phylogenetic information in an MSA in a manner that is analogous to likelihood-mapping, but with minimal additional effort as these probabilities are computed internally as part of the classification process. Red dashed lines mark basins of attraction (i.e. regions where the probability of a topology Ti is largest) for the three topologies T1, T2 and T3. In the center all of the topologies are equally likely, i.e. P1=P2=P3=1/3.

Concluding remarks
We used a convolutional neural network to infer unrooted quartet topologies through classification. Our results demonstrate that CNNs work and have a number of desirable qualities.
First and foremost, our CNN resolves quartets with higher accuracy than existing methods. This is in part due to the fact that our CNN was successfully trained to be relatively insensitive to regions of the tree branch-length heterogeneity (e.g. the Felsenstein and Farris Zones) that are known to confound certain methods. Another advantage of using CNNs is that they produce posterior probability estimates for each tree topology; this has the added benefit of making likelihood-mapping efficient and straightforward. Also, we have shown that our CNN's accuracy can be further improved by adding more training datasets, albeit most likely with diminishing returns (Figure 2). Although training on large datasets can impose a computational burden up front, this training only needs to be completed once, after which the CNN can be used to infer tree topologies very rapidly. Finally, CNNs have the advantage of being able to incorporate indels into their classification in a straightforward manner-one need only include them in the simulated training data-which can improve accuracy by supplying more information to the classifier. Future efforts will have to examine the robustness of inference to indel model misspecification, and the extent to which robustness can be improved by training under a mixture of indel rates and length distributions.
Our efforts thus suggest that there is great potential for the application of deep learning techniques to inferring trees, which is central to questions in both phylogenetics (species-tree inference) and population genomics (inferring ancestral recombination graphs along a recombining chromosome (Rasmussen, et al. 2014). Indeed, using the entire MSA as input would be most useful on recombining sequences-otherwise it is possible to collapse the alignment down to a vector of statistics (e.g. the number of times each distinct possible alignment column was observed, if the number of sequences is small) that would sufficiently capture all of the relevant information from the input. Deep learning methods acting directly on the alignment could also be applied to additional phylogenetic problems, such as substitution model selection or tree inference from protein sequence data.
Although our findings are encouraging, additional advances are required for our approach to match the practical utility of conventional methods. First, we have only investigated the ability of a CNN to produce the correct tree topology, but not the associated branch lengths. Artificial neural networks are capable of estimating real-valued parameters in addition to discrete class labels, so future work should investigate the possibility of using deep learning for inferring branch lengths in addition to the topology. Another limitation of our current approach is that it is limited to a small number of sequences/taxa. Future extensions could use quartet amalgamation methods (e.g. von Haeseler and Strimmer 1996;Reaz, et al. 2014), or perhaps even devise neural networks capable of directly predicting a tree topology. The latter could in principle be accomplished by treating the values of the final hidden layer of the network as entries in a distance matrix and using an output layer that builds a tree from this matrix (e.g. using NJ), and adopting a tree dissimilarity measure such as RF distance as the objective function to be minimized. Finally, in order to move from inferring gene trees to inferring species trees, our approach will have to be extended to handle tree discordance due to variation in evolutionary rates across loci, incomplete lineage sorting (Maddison and Knowles 2006;Degnan and Rosenberg 2009) and gene flow (Maddison 1997). Supervised learning approaches may prove to be well suited for this question, as arbitrarily complicated phylogenetic models combining multiple sources of discordance can readily be incorporated into the training process via simulation. Thus, while additional practical developments are necessary, applications of deep learning to problems in tree inference and phylogenomics are likely to be a fruitful endeavor.