Utility of a high-resolution mouse single nucleotide polymorphism microarray assessed for rodent comparative genomics

In the study of genetic diversity in non-model species there is a notable lack of the low-cost, high resolution tools that are readily available for model organisms. Genotyping microarray technology for model organisms is well-developed, affordable, and potentially adaptable for cross-species hybridization. The Mouse Diversity Genotyping Array (MDGA), a single nucleotide polymorphism (SNP) genotyping tool designed for Mus musculus, was tested as a tool to survey genomic diversity of wild species for inter-order, inter-genus, and intra-genus comparisons. Application of the MDGA cross-species provides genetic distance information that reflects known taxonomic relationships reported previously between non-model species, but there is an underestimation of genetic diversity for non-Mus samples, indicated by a plateau in loci genotyped beginning 10-15 millions of years divergence from the house mouse. The number and types of samples included in datasets genotyped together must be considered in cross-species hybridization studies. The number of loci with heterozygous genotypes mapped to published genome sequences indicates potential for cross-species MDGA utility. A case study of seven deer mice yielded 159,797 loci (32% of loci queried by the MDGA) that were genotyped in these rodents. For one species, Peromyscus maniculatus, 6,075 potential polymorphic loci were identified. Cross-species utility of the MDGA provides needed genetic information for non-model species that are lacking genomic resources. Genotyping arrays are widely available, developed tools that are capable of capturing large amounts of genetic information in a single application, and represent a unique opportunity to identify genomic variation in closely related species that currently have a paucity of genomic information available. A candidate list of MDGA loci that can be utilized in cross-species hybridization studies was identified and may prove to be informative for rodent species that are known as environmental sentinels. Future studies may evaluate the utility of candidate SNP loci in populations of non-model rodents. Author Summary There is a need for a tool that can assay DNA sequence differences in species for which there is little or no DNA information available. One method of analyzing differences in DNA sequences in species with well-understood genomes is through a genotyping microarray, which has demonstrated utility cross-species. The Mouse Diversity Genotyping Array (MDGA) is a tool designed to examine known differences across the genome of the house mouse, Mus musculus. Given that related organisms share genetic similarity, the MDGA was tested for utility in identifying genome variation in other wild mice and rodents. Variation identified from distantly related species that were not of the same genus as the house mouse was an underestimate of the true amount of variation present in the genomes of wild species. Utility of the MDGA for wild species is best suited to mice from the same genus as the house mouse, and candidate variation identified can be tested in rodent populations in future studies. Identifying changes in genetic variation within populations of wild rodents can help researchers understand the links between specific genome changes and the ability to adapt to pressures in the environment, as well as better understand the evolution of rodents.


28
In the study of genetic diversity in non-model species there is a notable lack of the low-cost, high 29 resolution tools that are readily available for model organisms. Genotyping microarray 30 technology for model organisms is well-developed, affordable, and potentially adaptable for 31 cross-species hybridization. The Mouse Diversity Genotyping Array (MDGA), a single 32 nucleotide polymorphism (SNP) genotyping tool designed for Mus musculus, was tested as a tool 33 to survey genomic diversity of wild species for inter-order, inter-genus, and intra-genus 34 comparisons. Application of the MDGA cross-species provides genetic distance information that 35 reflects known taxonomic relationships reported previously between non-model species, but 36 there is an underestimation of genetic diversity for non-Mus samples, indicated by a plateau in 37 loci genotyped beginning 10-15 millions of years divergence from the house mouse. The number 38 and types of samples included in datasets genotyped together must be considered in cross-species 39 hybridization studies. The number of loci with heterozygous genotypes mapped to published 40 genome sequences indicates potential for cross-species MDGA utility. A case study of seven 41 deer mice yielded 159,797 loci (32% of loci queried by the MDGA) that were genotyped in these 42 rodents. For one species, Peromyscus maniculatus, 6,075 potential polymorphic loci were 43 identified. Cross-species utility of the MDGA provides needed genetic information for non-44 model species that are lacking genomic resources. Genotyping arrays are widely available, 45 developed tools that are capable of capturing large amounts of genetic information in a single 46 application, and represent a unique opportunity to identify genomic variation in closely related 47 species that currently have a paucity of genomic information available. A candidate list of 48 MDGA loci that can be utilized in cross-species hybridization studies was identified and may prove to be informative for rodent species that are known as environmental sentinels. Future 50 studies may evaluate the utility of candidate SNP loci in populations of non-model rodents. 51 52 Introduction 69 The study and characterization of genomic diversity of non-model organisms is complicated by 70 limitations in knowledge and genomic resources available [1]. By contrast, researchers studying 71 model organisms benefit from the advantage of working with species that have sequenced and 72 annotated genomes, and high throughput platforms to survey genetic diversity at low cost. There 73 is a lack of genomic sequence information available for non-model species, and a deficit of tools 74 to assay genomic diversity in understudied organisms [2][3][4]. There is a need for custom tools to 75 survey genomic diversity in non-model organisms, but the creation of these tools can be time 76 consuming and expensive. There is an opportunity to explore existing technologies designed for 77 model organisms and test the applicability of these tools in non-model species. 78 79 Genotyping arrays are convenient tools that obtain large amounts of genetic diversity 80 information in a single assay at low cost [5]. Genotyping arrays are designed to capture a large 81 swath of diversity within a species, but the technology is typically tailored to the model species 82 of interest. Hybridization of microarray oligos targeted to unique locations in test DNA of the 83 organism of interest provides a picture of the genomic landscape of that sample [6]. Single 84 nucleotide polymorphisms (SNPs) are single base pair genome variations found in at least one 85 percent of individuals in a population, and are an informative type of genomic diversity that is 86 captured by genotyping arrays [6,7]. SNPs are found in abundance throughout the genome, and 87 this variation can be used as a metric of genomic diversity when comparing different individuals 88 in a population, or different species of interest [8]. 89 There is a precedent for exploring the possibility of applying existing genotyping array 91 technologies to related, non-model species.  Table). The goal was to identify the three metrics of success that define MDGA cross-species 169 utility in related organisms. This study represents an advance in the field of mammalian cross-170 species genotyping that will add to the paucity of genomic sequence and SNP information 171 available for non-model mice and rodents (Fig 1). It was hypothesized that application of the 172 MDGA to wild rodent DNA samples will help elucidate potential polymorphic loci, or the 173 number of loci that can detect both the A and B allele in a population, and that can be used cross-174 species in non-model organisms.

Genotyping Test Sets
Intra-Genus Inter-Genus Inter-Order Genotyping sets organized in descending order according to bounds of taxonomic classification and differences in maximum genetic divergence of a test set from the reference C57BL/6J (Mus musculus) organism

Cross-species test sets exceed maximum genetic diversity of the training set 189
A training set of DNA samples from 114 classical, inbred laboratory mice was used in training 190 the genotyping algorithm employed by Affymetrix Power Tools to provide accurate genotypes 191 (S2 Table). Genetic distances reflect the relatedness between samples and were obtained from For the inter-order genotyping set (n=44), a general decrease is observed in the percentage of loci 214 genotyped as divergence time increases from M. musculus (r = -0.57; p-value<0.0001; Fig 3A). 215 As divergence time increases from M. musculus, the number of 'no calls', or inability to 216 determine a genotype at a locus, increases. A plateau in the percentage of loci genotyped is 217 observed between 10-15 MYD for non-Mus samples from the inter-genus test set. Loci with 218 heterozygous genotypes were of particular interest, as those loci have the potential to identify 219 both the major and minor alleles in a population (polymorphic loci). The percentage of loci that 220 had a heterozygous genotype increases as divergence time from the house mouse increases (Fig  221   3B). There is a positive correlation between increasing percent heterozygosity and the known 222 divergence times from the house mouse (r = 0.67; p-value<0.0001). Similar to the percentage of 223 loci genotyped, a plateau in percent heterozygosity is also observed to begin between 10-15 224 million years divergence from M. musculus ( Fig 3B). 225

MDGA captures the genetic diversity of wild samples from the genus Mus 227
As seen in the inter-order test set, there is a general decrease in the percentage of loci that were 228 genotyped in samples of the intra-genus test set ( Fig 4A). There is a negative correlation between 229 the percentage of loci genotyped and the known divergence times from M. musculus (r = -0.76; 230 p<0.0001). In the intra-genus test set, heterozygosity increases as divergence time increases (Fig  231   4B). The increase in percent heterozygosity of Mus samples is positively correlated with an increase in divergence times (r = 0.93; p-value<0.0001). There is no plateau or obvious 233 underestimate of genetic diversity for samples in the intra-genus test set. 234 235 A tree of relatedness derived from SNP-based genetic distance values differentiates Mus samples 236 of the intra-genus test set from one another at a species level (Fig 5). Enough genetic diversity is 237 captured using the MDGA to reflect the known taxonomic relationships between the intra-genus 238 samples at a species level. At 9.5 MYD, the pygmy mouse subspecies M. n. minutoides is 239 grouped with the subspecies M. n. orangiae and not the replicate data file of the same species. 240 241

Peromyscus case study 242
Seven Peromyscus species were genotyped together as a case study to determine if the MDGA 243 could provide useful results that reflect known biological diversity for a number of species of a 244 different genus from Mus. Of the Peromyscus samples queried, approximately 52% of loci 245 queried by the array produce a genotype (Table 2). There are 159,797 loci genotyped across all 246 seven samples (32% of loci queried by the array) despite a 32.7 million-year divergence time 247 from M. musculus. SNP-based genetic distances of Peromyscus species were utilized to produce 248 trees of genetic relatedness that reflect the known divergence times of these species (Fig 6). Top 249 KEGG pathway annotations of the genotyped loci in Peromyscus samples are associated with 250 neurological signaling (Table 3).   Special attention was given to potential polymorphic loci that were genotyped as heterozygous in 320 samples using the MDGA and could be cross-validated as being present in the genome using an 321 in-silico search of publicly available genome sequences. There is a trend of there being more 322 heterozygous loci genotyped using the MDGA than the number of those loci that can be cross 323 validated as present in the publicly available genome sequence (Table 4). There are 147,452 324 heterozygous loci genotyped in all three M. caroli samples, and 9,413 of these loci were 325 validated as present in the publicly available genome sequence (Table 4). There are 9,341 of the 326 147,452 heterozygous loci genotyped in a M. pahari sample that were cross validated as potential polymorphic SNP loci (Table 4). In two R. norvegicus samples, there are 85,926 loci 328 that were genotyped empirically using the MDGA, and 1,019 loci that were cross-validated using 329 an in-silico genome sequence search (Table 4) (Table 6). 404 Given that the mouse array targets over two times the number of positions than the canine array 405 targets, there is a much larger number of loci that can be genotyped in the deer mouse than the 406 Antarctic fur seal. Future studies will focus on validating a panel of SNPs that are polymorphic 407 in deer mouse populations. Pathway analyses are limited by the information assayed by each 408 technology and are with respect to the annotations of the model organisms. As new sequence 409 information and genome annotations become available for the deer mouse, it will be interesting 410 to see which SNP markers associated with conserved pathways will be found to be shared 411 between the house mouse and the deer mouse. The deer mouse is an intriguing sentinel of 412 environmental effects and a model for population studies that has a surprising lack of genomic 413 information available [18,43]. Cross-species array use may be one technique to identify SNP 414 diversity in these relevant species until genome sequencing prices become more affordable for 415 non-model species. The use of a rat genotyping array in the future may be of use, as the deer 416 mouse and rat share greater genetic synteny than with the mouse [44]. 417  two Rattus CEL files, seven Peromyscus CEL files, one Apodemus CEL file, and CEL files 473 representing more highly diverged species including a squirrel, four naked mole rats, a tapir, and 474 an African Black Rhino (S3 Table). CEL file raw array intensity images were analyzed for 475 quality and abnormalities in array images were noted. Two CEL files (S1 Fig BRLMM-P to train the algorithm in accurate assignment of genotypes [26]. The samples were 487 organized into three test sets that were genotyped separately from one another. The first 488 genotyping set (known as the inter-order test set) consists of all 44 CEL files representing species spanning different orders of classification and a maximum divergence time of 96 million years of 490 divergence (MYD) from the reference house mouse, Mus musculus (Table 1). The second test set 491 (the intra-genus test set) is composed of the 27 samples from the genus Mus and has a maximum 492 divergence of 9.5 MYD from the house mouse (Table 1). The third test set (Peromyscus case 493 study test set) was composed of seven deer mouse species from the genus Peromyscus that have 494 32.7 MYD from the house mouse (Table 1)

In silico validation of MDGA loci genotyped cross-species and pathway analysis 531
In silico validation of loci genotyped from MDGA data was performed using the program E-532 MEM (efficient computation of maximal exact matches for very large genomes) designed by 533 Khiste and Ilie (2015) [31]. The publicly available genomes of rodents were searched for the 534 unique presence of MDGA probe sequences. E-MEM was employed to search a publicly available genome of wild rodents available on NCBI (S4 Table)