SNP discovery by exome capture and resequencing in a 1 pea genetic resource collection 2

In addition to being the model plant used by Mendel1 to establish genetic laws, pea (Pisum sativum L., 2n=14) is a major pulse crop cultivated in many temperate regions of the world. In order to face new challenges imposed particularly by global climate change and new regulations targeted at reducing chemical inputs, pea breeders have to take advantage of the genetic diversity present in the Pisum genepool to develop improved, resilient varieties. The aim of this study was to assess the genetic diversity of a pea germplasm collection and allow genome-wide association studies using this collection. To be able to perform genome-wide association approaches with high resolution, genotyping with a large set of genetic markers such as Single Nucleotide Polymorphism (SNP) markers well-spread over the genome is required. Rapid advances in second-generation sequencing technologies and the development of bioinformatic tools have revolutionized the access to and the characterization of available genetic diversity. High-density, high-throughput genotyping has been possible for a large number of species, including those with large and complex genomes2 such as pea (2n=14) which genome size is estimated to be 4.45 Gb3. In this study, which is part of the PeaMUST project4, we used a target capture technology based on pea transcriptome sequences to generate exome-enriched genomic libraries that were further subjected to Illumina sequencing in paired-end mode. This methodology was chosen because whole-genome resequencing is relatively expensive for species with large genomes and because capturing genetic variations in repeated non-coding regions is difficult to achieve or to interpret5. Whole-exome sequencing represented an interesting alternative that focused on coding regions only6,7. Mapping the obtained reads on the reference pea genome sequence enabled the discovery of an abundant set of SNPs. The development of this resource is a crucial cornerstone in research and breeding projects towards boosting the improvement of pea production and quality.


Background & Summary
In addition to being the model plant used by Mendel 1 to establish genetic laws, pea (Pisum sativum L., 2n=14) is 24 a major pulse crop cultivated in many temperate regions of the world. In order to face new challenges imposed 25 particularly by global climate change and new regulations targeted at reducing chemical inputs, pea breeders 26 have to take advantage of the genetic diversity present in the Pisum genepool to develop improved, resilient 27 varieties. The aim of this study was to assess the genetic diversity of a pea germplasm collection and allow 28 genome-wide association studies using this collection. 29 To be able to perform genome-wide association approaches with high resolution, genotyping with a large set of 30 genetic markers such as Single Nucleotide Polymorphism (SNP) markers well-spread over the genome is 31 required. Rapid advances in second-generation sequencing technologies and the development of bioinformatic 32 tools have revolutionized the access to and the characterization of available genetic diversity. High-density, 33 high-throughput genotyping has been possible for a large number of species, including those with large and 34 complex genomes 2 such as pea (2n=14) which genome size is estimated to be 4.45 Gb 3 . In this study, which is 35 part of the PeaMUST project 4 , we used a target capture technology based on pea transcriptome sequences to 36 generate exome-enriched genomic libraries that were further subjected to Illumina sequencing in paired-end 37 mode. This methodology was chosen because whole-genome resequencing is relatively expensive for species 38 with large genomes and because capturing genetic variations in repeated non-coding regions is difficult to 39 achieve or to interpret 5 . Whole-exome sequencing represented an interesting alternative that focused on coding 40 regions only 6,7 . Mapping the obtained reads on the reference pea genome sequence enabled the discovery of an 41 abundant set of SNPs. The development of this resource is a crucial cornerstone in research and breeding 42 projects towards boosting the improvement of pea production and quality.  Table 1). Leaves were collected from 10 plants per accession, flash frozen 53 in liquid nitrogen and stored at -80°C prior to DNA extraction. Tissues were then ground in liquid nitrogen using 54 a pestle and a mortar. Genomic DNA extraction was performed using Nucleospin PlantII minikit (Macherey-55 Nagel, Hoerdt, France) following the manufacturer's instructions.

57
Probe design 58 As the pea genome sequence was not available at the time the probe design was made, two pea transcriptome 59 datasets 9,10 were used to build a reference set (refset) of expressed genes. After redundancy was removed, 67,161 60 unique contigs were kept, 20,972 of them being common to the two sequence datasets. The first exome capture PairedEnd sequencing strategy of 2 reads of 100 bases. The sequenced reads were trimmed for adaptor 83 sequences using cutadapt 1.8.3 12 . Low-quality nucleotides with quality value<20 were removed from both ends.

84
The longest sequence without adapters and low-quality bases was kept. Sequences between the second unknown 85 nucleotide (N) and the end of the read were also trimmed. Reads shorter than 30 nucleotides after trimming were 86 discarded. These trimming steps were achieved using fastx_clean (http://www.genoscope.cns.fr/fastxtend), an  GTR+F+ASC. Alternatively, we also used a coalescent approach using 10,000 SNP non-overlapping windows as

102
Tree visualizations were generated using R package ggtree 18 .

104
Structure analysis 105 Structure within the collection was calculated using the Bayesian clustering program FastStructure 19 using a 106 logistic prior for K ranging from 1 to 10. The script chooseK.py (part of the FastStructure distribution) was used 107 to determine the best K that explained the structure in the collection based on model complexity. In addition, 108 discriminant analysis of principal components (DAPC) was applied using DAPC function from adegenet 109 package (version 2.1.5) 20 in order to describe the genetic structure of the panel. Both FastStructure and DAPC 110 groups were visualized on the phylogenetic tree using R package ggtreeExtra 18 .  Supplementary Table 3. 118 In average, 183,170 homozygous variants (compared to Cameor genome sequence) were detected per accession.  Table 4).

128
Variant effects. 129 We used the snpEff program 21 in order to categorize the detected SNPs according to their predicted effects or 130 their locations (Supplementary Table 4). The vast majority (76,71%) were labelled as "Modifier" (placed    In conclusion, this dataset is a large marker resource that can be used for different purposes, including the 167 development of targeted genotyping tools for molecular identification, genetic mapping or genomic selection in 168 pea. It provides insights into pea diversity and helps to investigate selection processes in this species. The SNP 169 resource also empowers Genome-wide Association Studies targeted at revealing the genetic architecture of 170 important traits and highlighting alleles to be used in pea breeding programmes. Indeed, the collection has been 171 evaluated for different traits including plant architecture, phenology and resistance or tolerance to a range of The authors declare they have no conflict of interest relating to the content of this article