Double-digest RAD-sequencing: do wet and dry protocol parameters impact biological results?

Next-generation sequencing technologies have opened a new era of research in genomics. Among these, restriction enzyme-based techniques such as restriction-site associated DNA sequencing (RADseq) or double-digest RAD-sequencing (ddRADseq) are now widely used in many population genomics fields. From DNA sampling to SNP calling, both wet and dry protocols have been discussed in the literature to identify key parameters for an optimal loci reconstruction. The impact of these parameters on downstream analyses and biological results drawn from RADseq or ddRADseq data has however not been fully explored yet. In this study, we tackled this issue by investigating the effects of ddRADseq laboratory (i.e. wet protocol) and bioinformatics (i.e. dry protocol) settings on loci reconstruction and inferred biological signal at two evolutionary scale using two systems: a complex of butterfly species (Coenonympha sp.) and populations of Common beech (Fagus sylvatica). Results suggest an impact of wet protocol parameters (DNA quantity, number of PCR cycles during library preparation) on the number of recovered reads and SNPs, the number of unique alleles and individual heterozygosity. We also found that bioinformatic settings (i.e. clustering and minimum coverage thresholds) impact loci reconstruction (e.g. number of loci, mean coverage) and SNP calling (e.g. number of SNPs, heterozygosity). We however do not detect an impact of parameter settings on three types of analysis performed with ddRADseq data: measure of genetic differentiation, estimation of individual admixture, and demographic inferences. In addition, our work demonstrates the high reproducibility and low rate of genotyping inconsistencies of the ddRADseq protocol. Thus, our study highlights the impact of wet parameters on ddRADseq protocol with strong consequences on experimental success and biological conclusions. Dry parameters affects loci reconstruction and descriptive statistics but not biological conclusion for the two studied systems. Overall, this study illustrates, with others, the relevance of ddRADseq for population and evolutionary genomics at the inter- or intraspecific scales.

repeating the experiment, since the digestion/ligation step, three times for 11 Fagus sylvatica 2 3 0 individuals. Each replicate of these triplets was processed with the same protocol and was sequenced in at least 50% of the 33 replicates. For each replicate, the number of ddRADseq fragments, the mean fragment coverage, the 2 3 6 proportion of polymorphic fragments and the individual heterozygosity were estimated. These 2 3 7 different parameters were then compared among replicates to assess the intra and inter-2 3 8 replicate variability. We also evaluated reproducibility by performing a Principal Component 2 3 9 Analysis on genetic data with the R-package adegenet (Jombart, 2008), and looking at the 2 4 0 distances between replicates in the PCA projection space. In addition, the replicates were used can take two different forms: errors of genotypes (due to PCR errors or ADO) or fragment 2 4 3 absence (due to a lack of reproducibility of the experiment). We measured the genotyping replicates. Then, by looking at the fragments in the three replicates, we could estimate the 2 4 6 proportion of "true" fragment absences, when the fragment was missing in all three replicates, 2 4 7 and the proportion of "false" fragment absences, when the fragment was missing in just one 2 4 8 or two of the replicates.  For this purpose, we used the standard ddRADseq wet protocol described above on 30 Digne, and Bauges, France; see Table S1) and 30 individuals of Coenonympha butterflies 2 5 6 from three different species (C. arcania, C. gardetta and C. macromma; see Table S1). We 6. When the M value varied, m was fixed to 4 and n was equal to M. For all these tests, the 2 6 0 remaining steps of the procedure were exactly the same and the last step of genetic dataset 2 6 1 export was performed by keeping only one SNP by RAD tag fragment and without any 2 6 2 filtering on allelic frequency.

6 3
To estimate the influence of the m and M values on RAD tag fragment recovery, we 2 6 4 determined the number of reconstructed fragments, their mean coverage and the proportion of polymorphic fragments for each value of M and m tested. We also evaluated the impact of DNA during the initial step of digestion/ligation could dramatically decrease fragment  results showed a great variability depending on the PCR settings. Increasing the number of and SNPs recovery. The mean number of fragments for Fagus sylvatica ranged from 41 for 2 9 9 10 cycles to 1,028 for 25 cycles. Similar results were obtained for Coenonympha sp. for

Reproducibility of the experiment and estimation of inconsistencies
The library of the 11 Fagus sylvatica triplicates produced a mean of 7,547 fragments with a Moreover, considering the eigenvalues, the 10 first axes retained most of the genetic variance 3 1 6 within the three replicates * 11 sample tests (92%). For example, PC1 strongly discriminates 3 1 7 individual VTX_H_83 from the rest of the sampling and PC2 differentiates individuals variability differentiating a particular individual from the remaining samples. Replicates did 3 2 0 not seem to add any substantial genetic variability that could have been caught by the PCA. more contrasted results but with similar patterns across the different parameters we measured.

2 7
No association between the initial DNA concentration after extraction and the consistency of ddRADseq results was observed (data not shown).

2 9
Regarding the estimation of genotype inconsistencies, the results were congruent 3 3 0 across the 11 tested individuals (Fig. 4). The maximum inconsistency rate was just above 4% same ones that showed a great variability in the reproducibility experiment (see above). Some pretty homogeneous across individuals, varying from 5% to 13%. Finally, for all samples, a 3 3 7 fair proportion of fragments (2 to 24%) was found in only one or two replicates, giving an 3 3 8 estimation of fragment loss not due to restriction site polymorphism across individuals but to 3 3 9 incomplete digestion, ligation, amplification or sequencing of these fragments.

4 2
The ddRADseq libraries used for bioinformatic tests produced very variable numbers of of individual heterozygosity did not seem to follow the variation of M.

9 7
In agreement with the results obtained for the m parameter, population genetic  (Table S2).  Multiple studies have speculated and tested the impact of different parameter settings on pre- as well as on population genetics and demographic inferences. The initial amount of DNA required in ddRADseq library preparation is a constraint  and the SNP recovery and by inflating the variability between samples (Fig. 2). This points The number of PCR cycles is another part of the experiment that has to be carefully during PCR, because these can lead to very weak fragment coverage impeding loci  frequency alleles in the population. Unfortunately, our sampling size did not allow such tests. Furthermore, we showed that, when properly calibrated, the protocol is greatly inconsistency and missing fragments (Fig. 4), our results illustrate that ddRADseq is an 4 7 6 accessible method with some key parameters that have to be finely tuned to gain in robustness 4 7 7 and reproducibility. Our testing procedure does not claim to cover all parameters that could 4 7 8 influence the ddRADseq method but points at key information about lab protocols and gives 4 7 9 clues to optimize the technique.  We could have expected that such variation in loci reconstruction and SNPs 4 9 9 identification would influence, to some extent, the population genetic analyses performed level for the animal model nor at the intraspecific level for the plant model (Fig 6, 7, S7 and 5 0 4 S8).

0 5
Moreover, despite a reduced number of individuals, the results of this study are Fagus sylvatica (Capblancq et al., 2015;Capblancq, unpublished data). Such patterns may be of thousands of SNPs). A potential "false" signal due to genotyping inconsistencies at some 5 1 0 loci seems negligible compared to the abundance of "true" signal provided by most of the 5 1 1 RAD loci. Similar observations have also been made for another species model, i.e..  All (3)