DiscoMark: Nuclear marker discovery from orthologous sequences using draft genome data

High-throughput sequencing has laid the foundation for fast and cost-effective development of phylogenetic markers. Here we present the program DISCOMARK, which streamlines the development of nuclear DNA (nDNA) markers from whole-genome (or whole-transcriptome) sequencing data, combining local alignment, alignment trimming, reference mapping and primer design based on multiple sequence alignments in order to design primer pairs from input orthologous sequences. In order to demonstrate the suitability of DISCOMARK we designed markers for two groups of species, one consisting of closely related species and one group of distantly related species. For the closely related members of the species complex of Cloeon dipterum s.l. (Insecta, Ephemeroptera), the program discovered a total of 78 markers. Among these, we selected eight markers for amplification and Sanger sequencing. The exon sequence alignments (2,526 base pairs (bp)) were used to reconstruct a well supported phylogeny and to infer clearly structured haplotype networks. For the distantly related species we designed primers for several families in the insect order Ephemeroptera, using available genomic data from four sequenced species. We developed primer pairs for 23 markers that are designed to amplify across several families. The DISCOMARK program will enhance the development of new nDNA markersby providing a streamlined, automated approach to perform genome-scale scans for phylogenetic markers. The program is written in Python, released under a public license (GNU GPL v2), and together with a manual and example data set available at: https://github.com/hdetering/discomark.


Introduction
The inference of phylogenetic relationships has benefited profoundly from the availability of 47 nuclear DNA (nDNA) sequences for an increasing number of organism groups. The  Here we present DISCOMARK (=Discovery of Markers), a flexible, user-friendly program 79 that identifies conserved regions and designs primers based on multiple sequence alignments 80 taken from FASTA-formatted files of putative orthologous sequences from whole-genome or 81 whole-transcriptome data. The program can be used to easily screen for phylogenetically 82 suitable nDNA markers and to design primers that can be used for Sanger sequencing as well 83 as high-throughput sequencing. The program is structured into several steps that can be 84 individually optimized by the user and run independently. In terms of input, the program can 85 be applied on large and small sets of taxa, including both closely and distantly related species. 86 Ideally, orthologous sequences in combination with a whole-genome reference sequence are 87 used. Thus, exon/intron boundaries can be inferred using the reference for each marker. Under 88 the default settings, the program will design several primer pairs that anneal in conserved 89 regions. The visualization of the alignments with potential primers allows the user to choose 90 between primers targeting exons or introns (e.g. exon-primed intron-crossing (EPIC) 91 markers). Additionally, information about the suitability as phylogenetic markers is provided 92 by an estimate of the number of SNPs per marker and the applicability across species. Finally, 93 5 we demonstrate the utility of DISCOMARK for (1) closely related species (i.e. Cloeon dipterum 94 s.l. species complex) using whole-genome data, and (2) distantly related species (i.e. insect 95 order Ephemeroptera) using whole-genome data derived from genome sequencing projects. In 96 order to generate genomic reference sequences we used draft whole genome sequencing at 97 shallow coverage followed by draft genome assembly. In one scenario (C. dipterum s.l. 98 species complex) we demonstrate that incomplete genomic data can be used for ortholog 99 prediction and primer design as well. to be amplified can be on entire exon markers, EPIC markers, or a combination.     For comparison, we also ran DISCOMARK without a reference and also present these results.

227
The Pearson correlation between the number of SNPs located between primer pairs and 228 corresponding estimated product length was calculated using the function cor within the stats

275
Closely related species -species complex of Cloeon dipterum s.l. The haplotype networks based on the eight selected markers showed a clear structure for 288 all markers, including two markers with shared haplotypes for the two species from the U.S.

289
and Madeira (Fig. 3 and Fig. S1, Supporting information). The length of the concatenated 290 sequence alignment of the eight markers was 3,530 bp (2,526 bp exon sequence, Table S1,

303
Distantly related species -insect order Ephemeroptera

304
In total, we found 23 orthologs with a total of 53 primer pairs for all four species (Table S2, 305 Supporting information) for the first run with a reference. The input files per species (i.e. The program, user manual and example data sets are freely available at:     Species overlap for identified markers Table 2: Grouping candidate markers (e.g. orthologous group of sequences) and primer pairs by the number of species that they cover.