Trial by phylogenetics - Evaluating the Multi-Species Coalescent for phylogenetic inference on taxa with high levels of paralogy (Gonyaulacales, Dinophyceae)

From publicly available next-gen sequencing datasets of non-model organisms, such as marine protists, arise opportunities to explore their evolutionary relationships. In this study we explored the effects that dataset and model selection have on the phylogenetic inference of the Gonyaulacales, single celled marine algae of the phylum Dinoflagellata with genomes that show extensive paralogy. We developed a method for identifying and extracting single copy genes from RNA-seq libraries and compared phylogenies inferred from these single copy genes with those inferred from commonly used genetic markers and phylogenetic methods. Comparison of two datasets and three different phylogenetic models showed that exclusive use of ribosomal DNA sequences, maximum likelihood and gene concatenation showed very different results to that obtained with the multi-species coalescent. The multi-species coalescent has recently been recognized as being robust to the inclusion of paralogs, including hidden paralogs present in single copy gene sets (pseudoorthologs). Comparisons of model fit strongly favored the multi-species coalescent for these data, over a concatenated alignment (single tree) model. Our findings suggest that the multi-species coalescent (inferred either via Maximum Likelihood or Bayesian Inference) should be considered for future phylogenetic studies of organisms where accurate selection of orthologs is difficult.


INTRODUCTION
Historically, the availability of genetic data has been the limiting factor in phylogenetic inference of 28 evolutionary relationships. Now, the breadth of publicly available data sets generated by high throughput 29 sequencing techniques allows for an increasingly detailed investigation into the evolutionary relationships 30 between organisms. The quest to untangle an organism's phylogeny is often challenging but can inform 31 a broad range of further studies, for example epidemiology, toxicology and ecological interactions, e.g. till resuspended. Samples were split in two and transferred to 1.5ml eppendorf tubes. Cellular thecae were ruptured by three rounds of freeze-thaw, with tubes transferred between liquid Nitrogen and 95 • C. 149 RNA was extracted as per the protocol for TRI Reagent (Rio et al., 2010). RNA eluate was purified with  Transcriptome processing scripts 166 The workflow was separated into two parts. See section Implementation for script details.  Construction of multiple sequence alignments 176 The BUSCOv2 output from all transcriptomes from the previous step formed the input for identification 177 of single copy genes and construction of multiple alignments. Any genes that BUSCOv2 identified as 178 single copy and were present in at least 75% of the transcriptomes were indexed, the corresponding contig 179 extracted from the assemblies, aligned with hmmer3.1b2 (Eddy and Wheeler, 2015) and unaligned regions 180 trimmed. If several candidate sequences were processed for the same organism, a warning message in the 181 terminal window alerted the user before proceeding. The output for this section was used as a basis for 182 single copy gene phylogenetic inferences in subsequent sections.   Table S3. Individual genes were aligned using MUSCLE (Edgar,       Support for branches was interpreted as follows, for ML and BI, respectively: 100%/1.0 was considered 250 fully supported, above 90%/0.9 was very well supported, 80%/0.8 and above was interpreted as relatively 251 well supported and above 50%/0.5 was considered weakly supported. Below 50%/0.5 was considered was considered a given for this study. Therefore, the branch separating these taxa from others was used to 256 root ML trees in subsequent analyses where rooting was required for tree layout in visual comparisons. All nodes except one within the Gambierdiscus species cluster were relatively well supported (Fig. 2).   Table S3. Gonyaulacales (n=16) in purple, outgroups (n=3) in light blue and taxa incertae sedis (n=1) in teal. The topology was rerooted on the branch separating outgroup taxa with the Gonyaulacales. The scale represents the expected number of substitutions per site.

276
All nodes resolved with full support, except one node within the genus Gambierdiscus which was very 277 well supported as well as an internal node within the outgroup clade (Fig. 3). The species in the genera

306
This study presents the first species tree for the Gonyaulacales that has been inferred with a method robust 307 to paralogy, including hidden paralogy.

308
The phylogenetic inference for Gonyaulacales that resulted from the workflow we developed, which 309 incorporates several of the most recent innovations in analytical methodology, resolved within-genus 310 relationships well and showed high posterior probability support throughout the species tree (Fig. 4).

311
The inferred species tree topology followed a broad revised taxonomic classification of the Gonyaulacales

342
Quality of transcriptome assemblies.

343
Publicly available data sets may have been generated with a variety of different methods, and their 344 resulting quality can be highly variable, so an initial quality assessment step is essential. In the time

354
Another factor in assembly quality is RNA-seq data processing prior to assembly, especially trimming.  best-practice transcriptome assembly methods as part of this study.  Selection of paralogs to infer species evolution. 371 Inclusion of genes which diverged through a process other than speciation events, such as paralogs, 372 violates the assumptions of most commonly used phylogenetic models which assume all genes analysed 373 have an orthologous relationship. This study sought to mitigate the issues arising from paralogs by 374 identifying and using single copy genes and using a phylogenetic inference method that is robust to the  The phylogenies inferred by Price and Bhattacharya (2017) and by our study resulted in markedly and Thecadinium formed their own clade while with concatenation, Ceratium and Thecadinium were 479 ancestral genera to the rest of the Gonyaulacales. There was a marked difference in the internal branch 480 arrangement, which resulted in different taxa clustering, between the concatenation and MSC methods.

481
The concatenated approach closely mirrored the ML arrangement of taxa, apart from Crypthecodinium 482 placement. Both inferences were topologically distinct to the MSC approach.