Identification and qualification of 500 nuclear, single-copy, orthologous genes for the Eupulmonata (Gastropoda) using transcriptome sequencing and exon-capture

The qualification of orthology is a significant challenge when developing large, multiloci phylogenetic datasets from assembled transcripts. Transcriptome assemblies have various attributes, such as fragmentation, frameshifts, and mis-indexing, which pose problems to automated methods of orthology assessment. Here, we identify a set of orthologous single-copy genes from transcriptome assemblies for the land snails and slugs (Eupulmonata) using a thorough approach to orthology determination involving manual alignment curation, gene tree assessment and sequencing from genomic DNA. We qualified the orthology of 500 nuclear, protein coding genes from the transcriptome assemblies of 21 eupulmonate species to produce the most complete gene data matrix for a major molluscan lineage to date, both in terms of taxon and character completeness. Exon-capture targeting 490 of the 500 genes (those with at least one exon > 120 bp) from 22 species of Australian Camaenidae successfully captured sequences of 2,825 exons (representing all targeted genes), with only a 3.7% reduction in the data matrix due to the presence of putative paralogs or pseudogenes. The automated pipeline Agalma retrieved the majority of the manually qualified 500 single-copy gene set and identified a further 375 putative single-copy genes, although it failed to account for fragmented transcripts resulting in lower data matrix completeness. This could potentially explain the minor inconsistencies we observed in the supported topologies for the 21 eupulmonate species between the manually curated and Agalma-equivalent dataset (sharing 458 genes). Overall, our study confirms the utility of the 500 gene set to resolve phylogenetic relationships at a broad range of evolutionary depths, and highlights the importance of addressing fragmentation at the homolog alignment stage for probe design.

3) which is about the level of divergence tolerated be the probes (Hugall et al. 2015).

4 1
Including both taxa in the design increases the likelihood that we will capture sequences from 2 4 2 more divergent lineages within the Camaenidae for which we don't yet have transcriptome shorter than 120bp and were excluded from the bait design). Probes for the target sequences were designed and produced by MYcroarray (Ann Arbor, Michigan) using MYbaits custom 2 5 0 biotinylated 120bp RNA baits at 2X tiling. We tested the probe set on 22 camaenid species spanning much of the phylogenetic  (Table 2). DNA was extracted using the DNeasy blood 2 5 4 and tissue kit (Qiagen) and sheared using the Covaris S2 (targeting a fragment size of 275bp). Biosystems, USA), modified to accommodate dual-indexing using the i7 and i5 index sets 2 5 7 (see Hugall et al. 2015). Up to eight libraries (normalised to 100 ng each) were pooled per 2 5 8 capture, and hybridised to the baits (at one-quarter dilution) for 36 hours, following the 2 5 9 MYbait protocol v1. A second hybridisation was then carried out on the fragments retained We used FastUniq v1.1 (Xu et al. 2012) to remove duplicates, and Trimmomatic 2 6 4 v0.22 (Bolger et al. 2014) to trim and remove low quality reads and adaptor sequences 2 6 5 (minimum average quality score threshold of 20 per 8nt window). Reads shorter than 40 2 6 6 bases after trimming were discarded. The trimmed reads were then mapped onto the 2 6 7 transcriptome sequences used for the probe design using BFAST v0.7.0a (Homer et al. 2009) 2 6 8 with a single index of 22nt without mismatch. After creating pileup files using Samtools alignments showed that exon boundaries were often not conserved between L. gigantea and 2 7 2 the Camaenidae. In these cases ( Figure S5) the reference exons were split to reflect the actual reference and consensus sequences made as outlined above. To flag potential pseudogenes 2 7 5 and paralogs we identified consensus sequences with an elevated proportion of variable sites 2 7 6 (> 3% heterozygote sites) and reviewed the corresponding read alignments (BAM files) using greater than 3% ambiguous sites where removed from the final dataset. Exons where more 2 7 9 than 10% of the taxa contained greater than 3% ambiguous sites were discarded entirely. We again used TreSpEx to assess conflicting phylogenetic signal. We screened for 2 8 1 hidden paralogs based on five a priori phylogenetic hypotheses representing well supported 2 8 2 clades (≥75% bootstrap support) within the Australasian camaenid radiation as delineated by  northern (sister clades 5 and 6) and north-eastern (clade 7) Chloritid groups, a group subfamily Sinumeloninae (e.g. Solem 1992), and a phenotypically and ecologically diverse 2 8 7 group dominated by eastern Australian wet forest taxa (sister clades 8 and 9). Gene trees for 2 8 8 each of the 490 genes (exons from the same gene were combined as one partition) were 2 8 9 constructed using the GTRGAMMA model and 100 fast bootstraps in RAxML (Stamatakis 2 9 0 2006). TreSpEx was run using the same settings as the analysis for the transcriptome dataset 2 9 1 (i.e. TreSpEx considered nodes for strong conflict, long branches, and short branches in that 2 9 2 order with parameters upbl and lowbl set to 5 and blt 0.00001).

9 3
Comparison to the Agalma pipeline 2 9 4 As an independent qualification of the manually curated 500 gene set we ran the fully using an all-by-all tblastx, followed by clustering using the Markov Clustering algorithm  The surviving subtrees were filtered based on the number of taxa (set to greater than four 3 0 5 taxa) and realigned for subsequent phylogenetic analysis. We then identified Agalma 3 0 6 homologous clusters that corresponded to the manually curated 500 gene set using BLAST 3 0 7 (blastp, e-value cut off of 1e-10). After removal of paralogs or sequences with excessive polymorphism (>3% Agalma pipeline) and manual masking. We reconstructed maximum likelihood trees using suitable models and partitioning schemes, implemented with 1% heuristic r-cluster searches, scores. In all cases, nodal support was assessed by performing 100 full non-parametric  We analysed two datasets resulting from the Agalma pipeline. The first dataset containing only a single ortholog cluster (from here on referred to as the 'Agalma best 3 2 4 dataset'); that is, Agalma homolog clusters containing multiple copies, albeit diagnosable, were not considered further. Finally, we reconstructed a phylogeny for the camaenid dataset transcriptomes presented herein, as well as sequences of Cornu aspersum as an outgroup. The number of paired reads obtained for each of the 21 eupulmonate species 3 3 1 sequenced ranged from 7.8M to 31.6M (Table 3). Trimming and de novo assembly statistics 3 3 2 are presented in Table 3. The number of L. gigantea reference genes with BLAST matches 3 3 3 ranged from 7,011 to 9,699 per assembly (Table 3), 5,490 of which had homologous  Of the 288 genes used in a previous molluscan phylogenomic study (Kocot et al.  paralogs in at least one species (mean p-distance between paralogs within a sample was 0.28, 3 3 8 ranging from 0.16-0.46). We could not unambiguously qualify the remaining 12 genes from candidate homolog clusters until we reached a total of 500 single-copy genes. In addition to 3 4 2 the 146 Kocot genes shown to be paralogous within the eupulmonates, we identified and ranging from 228nt to 6,261nt. In total, the final alignment of this gene set represents   Based on the all-by-all BLAST comparison of the L. gigantea genes, 347 of our final 3 5 0 500 genes had a single hit at an e-value threshold of 1e-10 (i.e. single copy status was 3 5 1 consistent between the L. gigantea reference and the eupulmonates), while the remainder had the multiple-copy gene set are potentially single copy for patellogastropods). These results  Across the 500 single-copy genes, the p-distance between the two rhytidids,  TreSpEx analyses of all 500 genes found no well supported conflict with the a priori 3 7 0 phylogenetic hypotheses, suggesting that hidden paralogs (i.e., genes represented by a single 3 7 1 sequence per taxon yet paralogous across multiple taxa) were absent from our dataset.

7 2
Furthermore, this analysis also showed no evidence of cross sample contamination, nor any 3 7 3 evidence of suspect long internal branches within the Stylommatophora. in the probe design with exon boundaries delineated based on the L. gigantea genome. contained multiple internal exon boundaries within the Camaenidae ( Figure S4).

8 0
Accordingly, the mapping reference was modified to account for exon-splitting (including the  Table 1). We  We then remapped reads to the revised reference (coverage and specificity statistics 3 8 5 presented in Table 4) and flagged resulting consensus sequences which exhibited elevated 3 8 6 polymorphism (> 3% heterozygote sites). There were 508 exons where at least one taxon 3 8 7 exhibited elevated polymorphism. Of these, 105 exons had greater than 10% of the taxa 3 8 8 (typically two or more taxa, taking into account missing taxa) exhibiting elevated removed from the final alignment. In total, 3.7% of the sequences were removed from final 3 9 6 data matrix due to elevated polymorphism. The final exon capture data matrix was 98% taxa 3 9 7 complete and 95% character complete. Based on the TreSpEx analyses, four genes did not support the monophyly of the 'Far genes. An additional five genes were in conflict with the a priori taxonomic hypotheses, and corresponding genealogies indicated that they represented deep basal divergence between 4 0 7 well supported major clades, and was not reflective of hidden paralogy.  Using the Agalma pipeline we identified 11,140 ortholog clusters. Of these ortholog 4 1 9 clusters 635 corresponded to 457 of our 500 single-copy gene set. We refer to this dataset as 4 2 0 the "Agalma equivalent" dataset, and is 61% taxa complete and 54% character complete. all BLAST but dropped during the clustering step, and eight made it to the initial clusters but  least 18 taxa and that had one ortholog cluster per homolog cluster. Of these, 171 were also 4 3 5 contained in our 500 single-copy gene set. Hence, the Agalma pipeline identified 375 genes 4 3 6 in addition to the 500 manually curated genes, which had optimum taxon sampling. The 4 3 7 majority of these genes also represented the full CDS with 89% representing at least 80% of 4 3 8 the length of the respective L. gigantea gene. We refer to this dataset as the "Agalma best" 4 3 9 dataset and is 92% taxa complete and 85% character complete.  with very high bootstrap support, namely Helicoidea, Limacoidea, Orthurethra, the Australian 4 5 0 rhytidids and the Stylommatophora (Figure 4a, c). In terms of phylogenetic relationships, the  Of the Camaenidae exon capture dataset, 5% of the alignment was removed by  The identification and qualification of orthology is a critical prerequisite for sound 4 6 5 phylogenetic inference. Our approach of orthology assessment involved an initial assessment 4 6 6 and manual editing of homolog clusters, allowing us to correct for multiple isoforms and 4 6 7 errors such as sequence fragmentation, frame-shifts and mis-indexing. Using this approach,  The resulting 500 gene data matrix is the most complete produced for a major molluscan 4 7 1 lineage to date, both in terms of taxon and character completeness. We further qualified  completeness in the latter. We discuss approaches to ortholog determination and implications  genes qualified as single-copy across many highly divergent taxa are more likely to maintain 4 9 1 single-copy status with greater taxonomic sampling. We tested this idea at a preliminary stage being paralogous is due to the limited representation of eupulmonates in that study, and for taxonomic sampling, such paralogy may extend across multiple taxa and, unless conservation suitable for phylogenetic analysis. placed subsequent orthology assessment on a sounder footing. Consequently, our final data 5 3 4 matrix was highly complete (93% character complete whereas the 'Agalma best' dataset was 5 3 5 85% character complete).

3 6
The Agalma analysis confirmed the single-copy, orthology status for the majority of Agalma for any one of our 500 genes, this was due to transcript fragmentation, not missed pruning and ortholog clustering (e.g. Figure S3). Consequently, for the 'Agalma equivalent' making algorithm such as TGICL (Pertea et al. 2003) into the pipeline to address 5 5 0 fragmentation at the homolog alignment stage. Doing so is particularly desirable, given that 5 5 1 manual curation of homologous sequences requires considerable time investment.

2
A major strength of automated pipelines is that they enable a more comprehensive 5 5 3 screening of putative orthologous genes. Manual curation requires considerable effort, and 5 5 4 while more candidate genes were identified than were assessed, we ceased the manual 5 5 5 2 4 assessment once our target of 500 genes had been attained. The Agalma analyses had no 5 5 6 constraints, however, hence all possible orthologous clusters were considered. Consequently, 5 5 7 we identified an additional 375 ortholog clusters which met a strict taxa completeness 5 5 8 threshold (18 taxa or more) and represented the only ortholog cluster arising from original 5 5 9 homolog clusters. These genes (i.e. the 'Agalma best' dataset) reconstructed a phylogeny that 5 6 0 was very similar to the manually curated dataset. While beyond the scope of this study, there 5 6 1 is potential for these genes to be included in future probe designs and further qualification of 5 6 2 these additional genes using exon-capture (see below) would be highly desirable. The 500 gene set represents a significant contribution towards advancing molecular dataset, however, Orthurethra is basal within Stylommatophora, albeit with marginal support.

7 3
Without greater taxonomic sampling of all the major lineages within the eupulmonates, 5 7 4 however, a comprehensive phylogenetic assessment is beyond the scope of this study.

7 5
Nevertheless, these phylogenomic datasets do afford greater resolution of deeper 5 7 6 relationships than obtained in previous molecular studies (Wade et al. 2001(Wade et al. , 2006. Secondly, 5 7 7 convergence in supported topology between the two most complete and largely independent One of the overarching objectives of this study was to identify and qualify 500 genes small dataset for the family Camaenidae principally as a means to further qualify orthology.

8 5
There are two principle outcomes from this exploration. First, for all reference sequences target discarded due to putative paralogs or pseudogenes (Hugall et al. 2015). It is possible 5 9 6 that these putative paralogs were only detected in the genomic sequencing because they were 5 9 7 not expressed in the transcriptomes.

9 8
Within the Australian Camaenidae, uncorrected distances for the majority of the genes 5 9 9 did not exceed 13%. This level of sequence variability is within the range of mismatch that is Expanding the bait design to enrich across the Australasian camaenid radiation, 6 0 7 indeed the family Helicoidea, would require the incorporation of multiple divergent reference 6 0 8 taxa into the bait design. Recent "anchored enrichment" approaches to bait design (e.g. used a similar approach to the one in the present study, but designed baits based on ancestral 6 1 6 sequences, rather than representative tip taxa, to reduce the overall size of the reference set. approximately 260 million years. Here we have presented a simple bait design targeting a 6 2 0 specific family, but our transcriptome dataset could be used to produce a more diverse bait 6 2 1 design to facilitate a more comprehensive study of Eupulmonata phylogenetics and 6 2 2 systematics. We thank Andrew Hugall for bioinformatics advice and Felipe Zapata for assistance with specifically Travis Glenn and Roger Neilsen, for guidance and assistance in transcriptome 6 3 0 and exon-capture sequencing. LT was supported by Victorian Life Science Computing Computational Infrastructure (NCI), which is supported by the Australian Government.   Fig. 4. Maximum Likelihood phylogenies for 21 eupulmonates based on three datasets. These datasets were (a) 500 nuclear single-copy orthologous genes identified by manual curation, (b) 635 orthologous clusters identified by the automated pipeline Agalma, which corr the same 500 genes, and (c) 546 orthologous clusters identified by Agalma, where each orthologous cluster was the only one produced respective homolog cluster and had sequences for at least 18 taxa. Phylogenies are each based on analyses of amino acid sequences. Nu branches indicate bootstrap nodal support. Heat maps (d, e, f) indicate proportions of sequence obtained for each gene per sample for e dataset (sorted left to right by total proportion of data present per gene, top to bottom by total proportion of data present per sample).