Double-digest RAD-sequencing: do pre- and post-sequencing protocol parameters impact biological results?

Next-generation sequencing technologies have opened a new era of research in population genetics. Following these new sequencing opportunities, the use of restriction enzyme-based genotyping techniques, such as restriction site-associated DNA sequencing (RAD-seq) or double-digest RAD-sequencing (ddRAD-seq), has dramatically increased in the last decade. From DNA sampling to SNP calling, the laboratory and bioinformatic parameters of enzyme-based techniques have been investigated in the literature. However, the impact of those parameters on downstream analyses and biological results remains less documented. In this study, we investigated the effects of sevral pre- and post-sequencing settings on ddRAD-seq results for two biological systems: a complex of butterfly species (Coenonympha sp.) and several populations of common beech (Fagus sylvatica). Our results suggest that pre-sequencing parameters (i.e., DNA quantity, number of PCR cycles during library preparation) have a significant impact on the number of recovered reads and SNPs, on the number of unique alleles and on individual heterozygosity. In the same way, we found that post-sequencing settings (i.e., clustering and minimum coverage thresholds) influenced loci reconstruction (e.g., number of loci, mean coverage) and SNP calling (e.g., number of SNPs; heterozygosity) but had only a marginal impact on downstream analyses (e.g., measure of genetic differentiation, estimation of individual admixture, and demographic inferences). In addition, replication analyses confirmed the reproducibility of the ddRAD-seq procedure. Overall, this study assesses the degree of sensitivity of ddRAD-seq data to pre- and post-sequencing protocols, and illustrates its robustness when studying population genetics.


Introduction
For a decade, next-generation sequencing (NGS) technologies have opened a new era in the large field of molecular ecology. In particular, the advances in sequencing capabilities have deeply changed the field of population genetics, by providing tremendous amounts of sequence data (10-100 thousand markers) at a decreasing cost Da Fonseca et al. 2016), even for non-model species. Whole-genome re-sequencing (WGR) methods, which provides the highest marker density, appears very interesting to study evolutionary biology (Fuentes-Pardo and Ruzzante 2017). However, the use of such methods remains limited for non-model species, for which a reference genome is not always available, and because producing the data requires considerable sequencing and computing efforts (Fuentes-Pardo and Ruzzante 2017). Stemming from these limitations, reduced-representation sequencing methods have Communicated by Stefan Hohmann.
Tristan Cumer and Charles Pouchon are contributed equally to this work. been developed. These approaches include restriction-site associated DNA sequencing (RAD-sequencing), sequencing of transcribed DNA from mRNA (RNA-sequencing), and target sequencing (e.g., whole-exome sequencing). Overall, reduced-representation methods allow sequencing numerous homologous loci with a great taxa coverage at a relatively low cost (Fuentes-Pardo and Ruzzante 2017). Among these methods, RAD-sequencing, or RAD-seq (Miller et al. 2007;Baird et al. 2008), is certainly the most popular method to obtain thousands of single nucleotide polymorphisms (SNPs) (Andrewset al. 2016). The principle of RAD-seq is to use restriction enzymes to subsample the genome of multiple individuals at homologous genomic locations (Miller et al. 2007;Baird et al. 2008). The resulting DNA fragments are sequenced and compared among individuals to detect SNPs. Since its origin, this technique has been modified into a variety of related approaches (e.g., Peterson et al. 2012;Wang et al. 2012;Toonen et al. 2013;Campbell et al. 2018). Among these, double-digest RADseq, or ddRAD-seq (Peterson et al. 2012), is highly customizable as regards to the final number of loci, according to the choice of enzymes and range of fragment sizes selected. The ddRAD-seq approach has been applied with success to many purposes including population genetic studies (Kjeldsen et al. 2016;Black et al. 2017;Sherpa et al. 2018a, b;Capblancq et al. 2019), phylogenetic reconstructions (DaCosta and Sorenson, 2016;Vargas et al. 2017;Boubli et al. 2018;Lee et al. 2018;Sherpa et al. 2018a, b), demographic inferences (Capblancq et al. 2015;Nunziata et al. 2017;Settepani et al. 2017;Elleouet and Aitken 2018) and landscape genetic analyses (Saenz-Agudelo et al. 2015;Johnson et al. 2017;Capblancq et al. 2020). Despite the recognized advantages of the ddRAD-seq technique, several limitations and weaknesses are highlighted in the literature (Davey et al. 2013;Andrews et al. 2016;Lowry et al. 2017). This study aims at exploring some of the ddRAD-seq caveats for which guidelines are missing in the literature.
The main concerns with ddRAD-seq are related to both pre-sequencing laboratory protocols and post-sequencing bioinformatic procedures (Puritz et al. 2014;Mastretta-Yanes et al. 2015;Shafer et al. 2017;O'Leary et al. 2018). As an example, the choice of enzymes is critical to optimize genome subsampling. This choice will influence the number of digested fragments, their location in the genome, and their size distribution (Burns et al. 2017;Wang et al. 2017). The quality of DNA has also been pointed out as a potential critical step (Davey et al. 2013;Mastretta-Yanes et al. 2015). Degraded (i.e., fragmented) DNA can greatly lower the efficiency of restriction enzyme-based techniques by inducing a loss of recovered RAD fragments (hereafter "loci"). Non-homogeneous amplification of fragments can lead to a substantial loss of alleles due to unbalanced loci coverage Puritz et al. 2014). Furthermore, the number of PCR cycles during library preparation is generally low to minimize PCR artefacts in the sequences (Hohenlohe et al. 2012;Peterson et al. 2012).
The impact of missing data, which are inherent to any genotyping technique, has also been evaluated over the years (Arnold et al. 2013;Gautier et al. 2013;Malinsky et al. 2018;Crotti et al. 2020;Ortiz et al. 2020). Missing data can be due to some extent to an experimental lack of reproducibility in laboratory protocol (e.g., inconsistencies in DNA quantity and quality among individuals, or in size selection), or to polymorphism within the restriction sites (DaCosta and Sorenson 2016). Such polymorphism leads to allele dropout (ADO) for the individuals lacking the restriction site in one or two of the homologous chromosomes. ADO directly influences the estimation of genetic variation and diversity by producing false homozygote callings (Davey et al. 2013;Gautier et al. 2013;Cariou et al. 2016). It has been particularly investigated in phylogenetic studies (Cariou et al. 2013;Eaton 2014;DaCosta and Sorenson, 2016) because of the direct correlation between ADO and the divergence time among lineages and species (Cariou et al. 2013).
Another important concern about RAD-based methods is the bioinformatic treatment of the sequences and the reconstruction of RAD-seq loci (e.g., Shafer et al. 2017). The principle of RAD-seq techniques relies on the identification of homologous loci among individuals. This task is implemented by clustering single-copy loci according to a similarity threshold, which is determined using either distance-based (e.g., STACKS; Catchen et al. 2013), or global alignment (e.g., pyRAD; Eaton 2014) methods. In both cases, stringent parameter settings will avoid the clustering of paralogs but can also split highly divergent single-copy loci in different clusters (Catchen et al. 2013;Eaton 2014). Coverage also influences loci reconstruction and a minimum number of reads is generally set to consider an allele (Catchen et al. 2013). Defining a high threshold value can induce a loss of alleles via insufficient coverage, while an excessively low value will not discard rare sequences originating from PCR or sequencing errors (Paris et al. 2017).
The scientific community has accumulated expertise about RAD-based methods over the last decade to make good use of such techniques. The influence of those parameter settings (e.g., the minimum coverage or the similarity threshold in sequences clustering) has therefore been widely tested and reviewed on the quality and quantity of recovered loci and SNPs (e.g., in Eaton 2014;Mastretta-Yanes et al. 2015;Paris et al. 2017;Rochette and Catchen 2017;Shafer et al. 2017;O'leary et al. 2018). However, their impacts on downstream population genomics analyses (e.g., genetic structure, demographic inferences, etc.) have not been systematically evaluated or mainly focused on genetic differentiation using F ST estimates (see Mastretta-Yanes et al. 2015;Rodríguez-Ezpeleta et al. 2016;Shafer et al. 2017;Malinsky et al. 2018;Díaz-Arce and Rodríguez-Ezpeleta 2019). In addition, several critical points of the laboratory procedure deserve further investigations. For example, the initial DNA quantity and number of PCR cycles, are thought to be critical for the success of the experiment (Peterson et al. 2012;Mastretta-Yanes et al. 2015), but to our knowledge, they have never been experimentally evaluated or reported in the literature.
This study aims to evaluate the influence of parameter variation on ddRAD-seq results for the pre-sequencing step of laboratory work and the post-sequencing step of bioinformatics treatment. On the pre-sequencing side, we evaluated the impact of (i) initial DNA quantity and (ii) number of PCR cycles on the number of recovered loci (a.k.a., RADloci or RAD-tag), their coverage, the number of identified SNPs and some metrics of individual-based genetic diversity (Heterozygosity and private alleles). We also estimated the percentage of genotyping inconsistencies and missing loci among replicates to assess the reproducibility of the protocol. On the post-sequencing side, we investigated the influence of minimum coverage and similarity threshold during loci reconstruction on three types of population genetic analyses typically performed with ddRAD-seq data: (i) genetic differentiation (evaluated using F ST estimation and Principal Component Analysis), (ii) genetic structure (genetic clustering) and (iii) demographic inferences (Approximate Bayesian Computation method). For this purpose, we used two different non-model species for which population genetics had already been investigated using ddRAD-seq data: a complex of four different species of Coenonympha butterflies (Capblancq et al. 2015), and several populations of common beech Fagus sylvatica from the French Alps (Capblancq et al. 2020). We found that variation in laboratory and bioinformatics settings impacted both loci recovery (e.g., SNPs number) and genetic diversity estimation (e.g., the heterozygosity). However, the biological signal resulting from downstream analyses remained very similar across parameter ranges and highly congruent with previous results, confirming the robustness of such approach when studying population genetics.

Sampling
A total of 108 samples were selected from the two reference study samplings (Capblancq et al. 2015(Capblancq et al. , 2020, including 58 individuals of Fagus sylvatica (genome size around 600 Mbp) and 50 individuals of Coenonympha sp. (three species; C. arcania, C. gardetta and C. macromma; genome size around 300 Mbp). Depending on the conditions and settings tested (Fig. 1), different numbers of individuals and populations were used for the experiments (see Table S1).

Standard ddRAD-seq protocol
A double-digested RAD experiment was conducted using a common protocol modified from the original protocol of Peterson et al (2012) for both pre-and post-sequencing stages of the procedure (see details in the sections x 11 x 10 x 10 x 10 x 10 x 30 x 30 Fig. 1 Overview of the experimental design used in this study "Molecular protocol" and "Bioinformatic treatment"). This same common protocol was conducted for all samples, except for some parameter settings as described in the sections "Setting tests for the laboratory protocol" and "Setting tests for the bioinformatics treatment". ) and once again two units of PstI and MspI enzymes. This simultaneous digestion-ligation reaction was performed on a thermocycler using 60 cycles of a succession of 2 min at 37 °C for digestion and 4 min at 16 °C for ligation. The complete volumes of all the digested-ligated fragment mixtures were pooled and purified using magnetic beads (Agencourt AMPure XP of Beckman Coulter, or NucleoMag of Macherey Nagel) with a DNA/beads ratio equal to 1/1.5. Fragments were sizeselected in a range between 250 and 500 bp on agarose gel (1.6%) and excised bands purified with the QIAquick Gel Extraction Kit (Qiagen). The ddRAD-seq library obtained was amplified independently eight times by PCR, and the obtained PCR products were then pooled, to minimize the impact of potential PCR errors. We used the following PCR mix: a final volume of 20 µL containing 1 µL of DNA template, 10 mM of dNTPs, 10 µM of each PCR primer (Peterson et al. 2012) and 2U/µL of Taq Phusion-HF (New England Biolabs Inc.); and the following PCR program: an initial denaturation at 98 °C for 30 s; 15 cycles of 98 °C for 10 s, 66 °C for 30 s and 72 °C for 1 min; followed by a final extension at 72 °C for 10 min. The amplified ddRADseq library was purified with magnetic beads and was sent to Fasteris SA (Geneva, Switzerland) for sequencing on half a lane of a Hi-Seq 2500 sequencer (Illumina, San Diego, CA, USA) with a paired-end approach (2 × 125 bp).

Bioinformatic treatment
DNA sequences (total: ~ 100 million reads) of Fagus sylvatica and Coenonympha sp. libraries were used to call SNP genotypes. We developed a homemade pipeline called ProcessMyRAD (fully available at https ://githu b.com/ cumtr /pmr) to automatically perform the different steps leading from the raw reads to genotype data. This pipeline allows for data cleaning, data visualization and helps users to choose pertinent thresholds during loci reconstruction and genetic dataset export. To call the genotypes, Pro-cessMyRAD relies on the STACKS pipeline (Catchen et al. 2013). To reconstruct loci, the STACKS procedure calls upon three thresholds: the minimum number of reads to consider an allele (m), the maximum number of mismatches allowed between two alleles to reconstruct a locus (M), and the maximum number of mismatches allowed between two individual loci to consider them as homologous (n).

Impact of initial DNA amount and number of PCR cycles
We evaluated the impact of two pre-sequencing parameters on the experimental results: (1) the DNA quantity used for the initial digestion/ligation step; and (2) the number of PCR cycles used to produce the final library. For 10 samples of Coenonympha and 10 samples of Fagus sylvatica (Table S1), we repeated the ddRAD-seq lab experiment three times for each sample with the standard protocol described above but with different quantities of DNA during the first step: 50, 150 or 250 ng of initial genomic DNA. Similarly, we used 10 samples of Coenonympha sp. and 10 samples of Fagus sylvatica to repeat three times the ddRAD-seq lab experiment with different numbers of PCR cycles in the final step of the protocol: 10, 15 or 25 cycles. We then sequenced the resulting libraries all together.
The sequences resulting from these tests were treated with the ProcessMyRAD pipeline, using the following clustering parameters: m = 4, M = 6, n = 8. These values were selected based on the results described in section "Setting tests for the bioinformatic treatment" and especially looking at loci recovery along increasing values of m and M parameters. To estimate the impact of laboratory settings on allele frequencies, we did not filter out alleles based on their frequency. On the resulting genetic datasets, we determined the number of polymorphic loci, the mean locus coverage, the number of SNPs per locus, the individual heterozygosity, and the proportion of private alleles in individuals.

Experimental reproducibility
We assessed the reproducibility of our laboratory protocol by repeating the experiment three times, starting at the digestion/ligation step, for 11 Fagus sylvatica individuals. Each replicate of these triplets was processed with the same protocol and was sequenced within the same Illumina sequencing run. All sequences obtained were treated together with the ProcessMyRAD pipeline with m = 4, M = 6, n = 8 and keeping only SNPs with a minor allele frequency superior to 0.1 (corresponding to at least three individuals to keep the allele). A locus was kept only if sequenced in at least 50% of the 33 replicates.
For each replicate, the number of ddRAD-seq loci (i.e., fragments of DNA digested, amplified and sequenced during the experiment, sometimes called RAD-loci or RAD-tag, can be monomorphic or polymorphic), the mean locus coverage, the proportion of polymorphic loci (within a sample) and the individual heterozygosity (i.e., the proportion of heterozygous sites for the sample across all the retained polymorphic sites in the 33 individuals) were estimated. These different parameters were then compared among replicates to assess the intra and inter-replicate variability. We also evaluated reproducibility by performing a Principal Component Analysis on genetic data with the R-package adegenet (Jombart 2008), and looking at the distances between replicates in the PCA projection space. In addition, the replicates were used to estimate the proportion of inconsistencies in our final genetic dataset. These inconsistencies included two possibilities: (1) difference of genotype between replicates (due to PCR errors or ADO) or (2) locus absence in at least one of the replicates (due to a lack of reproducibility of the experiment). To measure the genotyping inconsistency rate (or genotype mismatches), we identified the proportion of SNPs with genotypic differences among the three replicates. We also estimated the proportion of locus absence by comparing "true" absence (i.e., a locus was missing in all three replicates) to "false" absence (i.e., a locus was missing in one or two replicates).

Setting tests for the bioinformatic treatment
We estimated the influence of the m and M values on ddRAD-seq loci reconstruction and downstream analyses. For this purpose, we used the standard ddRAD-seq laboratory protocol described above on 30 individuals of Fagus sylvatica coming from three different populations (Sainte Beaume, Digne, and Bauges, France; see Table S1) and 30 individuals of Coenonympha butterflies from three different species (C. arcania, C. gardetta and C. macromma; see Table S1). We then repeated the bioinformatic treatment with different m values ranging from 1 to 15 and different values of M ranging from 1 to 25. When the m value varied, M and n were fixed to 6. When the M value varied, m was fixed to 4 and n was equal to M. For all these tests, the remaining steps of the procedure were exactly the same: we kept only one SNP per locus and we excluded SNPs with more than 50% of missing data or a minor allele frequency inferior to 0.1.
To estimate the influence of the m and M values on loci recovery, we determined the number of reconstructed loci, their mean coverage and the proportion of polymorphic loci for each value of M and m tested. We also evaluated the impact of these parameters on population genetics results by performing, for all m and M values, some of the most commonly used analyses using ddRAD-seq data (Capblancq et al. 2015;Kjeldsen et al. 2016;Black et al. 2017;Nunziata et al. 2017;Settepani et al. 2017;Elleouet and Aitken 2018;Sherpa et al. 2018a, b), i.e., mean individual heterozygosity, F ST among populations estimated with the adegenet R package (Jombart 2008), Principal Component Analysis (PCA) using the adegenet R package (Jombart 2008), genetic structure with sNMF using the LEA R package (Frichot and François 2015), and evolutionary history reconstruction using Approximate Bayesian Computation, performed with the diyABC program (Cornuet et al. 2014). The ABC analyses were conducted using a different evolutionary scenario for each model: hybrid speciation for the Coenonympha model and population divergence for the Fagus model. The 1000 simulations the closest from the observed datasets were used to estimate the posterior probability of different demographic parameters (e.g., divergence time, effective population size). The summary statistics used to compare observed and simulated datasets were the mean gene diversity, the F ST among populations across all loci, and the Nei's distance among populations. The results of these analyses were then compared across the m and M ranges and with results from the reference studies (Capblancq et al. 2015(Capblancq et al. , 2020, to evaluate the consistency of the biological signal.

DNA quantity
For all initial DNA quantity conditions, the library produced a mean of 3172 loci (i.e., monomorphic and polymorphic) for Coenonympha, with a mean coverage of 24.6 reads per locus, and 450 loci (monomorphic and polymorphic) for Fagus sylvatica, with a mean coverage of 19.5 reads per locus. If the number of loci was fitting our expectations for Coenonympha (see Capblancq et al. 2015Capblancq et al. , 2019Capblancq et al. , 2020, it was lower than expected for Fagus sylvatica (450 instead of ~ 7000). Considering that, for this particular test of DNA quantity, we only compared individuals processed the same way and showing a very similar number of sequenced loci we decided to keep F. sylvatica in our analyses. With 50 or 150 ng of genomic DNA as template, similar numbers of loci and SNPs were recovered (around 3500 loci for Coenonympha and around 500 for Fagus sylvatica) (Fig. 2). Considering the range of locus coverage observed (from 15 to 30 for Coenonympha and from 15 to 23 for Fagus sylvatica), we observed only little impact on individual heterozygosity (He ~ 0.125 for Coenonympha and He ~ 0.25 for Fagus sylvatica). Conversely, we noticed that using 250 ng of DNA during the initial step of digestion/ligation could dramatically decrease locus recovery (divided by 1.5) and SNPs identification for some individuals (Fig. 2). Using 250 ng of DNA also induced a greater variability among tested individuals (Fig. 2).

Number of PCR cycles
For all PCR conditions, the library produced a mean of 2480 loci (monomorphic and polymorphic) for Coenonympha with a mean coverage of 23.9 reads per locus and 520 loci (monomorphic and polymorphic) for Fagus sylvatica with a mean coverage of 23.3 reads per locus. Here again, the number of loci obtained for Fagus sylvatica was lower than expected but we still included the species in our analyses for the same reasons as explained above. The results showed a great variability depending on the PCR settings (Fig. 2). Increasing the number of PCR cycles in the final library preparation increased the number of loci and SNPs recovered. For example, the mean number of loci for Fagus sylvatica ranged from 41 for 10 cycles to 1028 for 25 cycles. Similar results were obtained for Coenonympha sp. for which the number of loci varied from 350 to 4500. The number of PCR cycles was also directly associated with individual heterozygosity and with the number of private alleles in the individuals. For example, the individual heterozygosity increased from 0.09 to 0.27 for Fagus sylvatica individuals when the number of cycles increased from 10 to 25. In the same way, the number of private alleles doubled when the number of PCR cycles increased from 10 to 25 for both Coenonympha sp. and Fagus sylvatica samples. Finally, 10 cycles of PCR lowered substantially the number of loci and SNPs as well as the number of private alleles in the final genetic dataset.

Reproducibility of the experiment and estimation of inconsistencies
The library of the 11 Fagus sylvatica triplicates produced a mean of 7547 loci (monomorphic and polymorphic) with a mean coverage of 20.3 reads per locus. The PCA performed on the complete genetic dataset showed very consistent results across the 11 tested individuals (Fig. 3). Inter-individual genetic variability was higher than interreplicate genetic variability. All triplicates clustered in the PCA plot and the different individuals could easily be differentiated. Moreover, considering the eigenvalues, the 10 first axes retained most of the genetic variance within the three replicates * 11 sample tests (92%). For example, PC1 strongly discriminates individual VTX_H_83 from the rest of the sampling and PC2 differentiates individuals SB_H_42 the mean coverage of these loci, the number of identified SNPs, the individual heterozygosity and the number of private alleles in individuals. A log-transformation was performed on the results in order to simplify the comparison of the two models and BG_1_1. This suggests that each axis captured parts of inter-individual genetic variability differentiating a particular individual from the remaining samples. Replicates did not seem to add any substantial genetic variability that could have been caught by the PCA.
Across all replicates and individuals, the number of recovered loci varied from around 4000-16,000; the mean coverage from 8 to 40; the proportion of polymorphic loci from 0.15 to 0.38, and the individual heterozygosity from 0.27 to 0.37 (Fig. 3). For seven individuals, the replicates returned almost exactly the same number of loci, the same ratio of polymorphic loci and the same individual heterozygosity. Four individuals showed more contrasted results. No association between the initial DNA concentration, quality (assessed on a gel) or library in which individuals were included and the consistency of ddRAD-seq results was observed for these individuals (data not shown).
Regarding the estimation of genotype inconsistencies, the results were congruent across the 11 tested individuals (Fig. 4). The maximum inconsistency rate was just above 4% and the minimum is around 1.6%. Similarly, we obtained a good proportion of loci recovery among replicates. Between 66 and 90% of the loci were found in all three replicates (Fig. 4). The individuals with a low locus recovery rate were the exact same ones that showed a great variability in the reproducibility experiment (see above). Some loci were missing for all three replicates, and the proportions of missing loci were pretty homogeneous across individuals, varying from 5 to 13%. Finally, for all samples, a fair proportion of loci (2-24%) was found in only one or two replicates, giving an estimation of locus loss not due to restriction site polymorphism across individuals but to incomplete digestion, ligation, amplification or sequencing of these loci.

Influence of bioinformatic thresholds on the biological results
The ddRAD-seq libraries used for bioinformatic tests produced very variable numbers of loci and coverage depending on the thresholds used during the analysis pipeline. With the parameter values m = 4 and M = 6, we obtained a mean of 3246 loci (monomorphic and polymorphic) for Coenonympha individuals with a mean coverage of 40.16 reads per locus and 11,018 loci (monomorphic and polymorphic) for Fagus sylvatica samples with a mean coverage of 25.18 reads per locus. The minimum coverage required to create a stack of sequences during the first step of STACKS procedure (m) had a direct influence in the number of ddRAD-seq loci, their mean coverage and the number of SNPs identified (Fig. 5). Furthermore, the pattern was very similar for both the Coenonympha sp. and Fagus sylvatica models. An increase in Nevertheless, the m parameter did not seem to greatly influence any of the downstream population genetic analyses. No major difference among the m parameter settings was observed for F ST estimation among populations (Fig. 6), PCA and genetic clustering results (Fig. 6, Fig. S1-2), or demographic inferences (Fig. 7). Here again, the results were similar for both the animal and plant models. While changes in absolute F ST values or PCA scores were noticed when m varied, the populations remained differentiated in the same way and strength (Fig. 6). For example, the F ST values ranged from 0.28 to 0.33 between Coenonympha arcania and C. gardetta but the ranking of F ST values among the three pairs of species did not change depending on m (Fig. 6). An increasing m seemed to slightly influence the percentage of inertia retained by the first two PCs for both Fagus sylvatica and Coenonympha sp. (from 18 to 20% of the genetic variance, see Fig. 6) but population differentiation on the PCA was not affected. The Procrustes superimposition performed on the first two axes of the PCAs returns correlation coefficients superior to 0.96 between each pair of m values for Coenonympha and superior to 0.85 for F. sylvatica (Fig. S7). Regarding the genetic structure (sNMF analysis), the number of K selected with the cross-entropy criterion did not vary for Coenonympha samples and varied between 2, 3 and 4 for Fagus sylvatica individuals (Fig. 6). This variation was due to very close values of cross-entropy for K = 2, 3 and 4 (Fig. S2). Similarly, the differentiation of species groups in the sNMF analysis remained exactly the same across the range of m values (Fig. 6). The Procrustes superimposition performed on the percentage of assignation to the three main clusters obtained with sNMF returns correlation coefficients superior to 0.975 between each pair of m values for Coenonympha and superior to 0.99 for F. sylvatica (Fig. S8). Finally, we did not detect any influence of the m parameter on the estimations of population size, divergence time or hybridization rate through ABC procedure (Fig. 7). All model parameters showed approximately the same posterior distribution whatever the m value, with only a small variation between the maximum and the minimum of the estimates across the m range (Table S1).   S1-S6 for the complete results). Ellipses in the PCA distinguish the different populations or species, and sNMF results are shown for K = 3, which is the best number of clusters in almost all cases according to the cross-entropy criterion 1 3 The maximum number of mismatches allowed between two stacks of sequences to merge two alleles in one locus (M) greatly influenced the number of recovered loci, the number of identified SNPs and individual heterozygosity (Fig. 5). When M varied from 1 to 25 the number of recovered loci decreased from 12,000 to 10,000 for Fagus sylvatica and from 3500 to 3000 for Coenonympha sp. On the opposite, the number of SNPs identified increased rapidly for the first M values of the range (1-6) until a plateau was reached around 3800 loci for Fagus sylvatica individuals. The influence of M on individual heterozygosity was clear for M values between 1 and 6, for which heterozygosity increased with M. For higher values of M, the relationship was less obvious and the variation of individual heterozygosity did not seem to follow the variation of M.
In agreement with the results obtained for the m parameter, population genetic analyses showed very consistent results across the range of tested M values. Again, no substantial effect was observed for F ST values among populations, PCA results, genetic clustering results or demographic inferences when M value varied (Figs. 6,7,Figs. S4 and S5). M variation did not impact PCA, neither in terms of population differentiation, nor in terms of percentage of inertia of the two first axes. The Procrustes superimposition performed on the first two axes of the PCAs returns correlation coefficients superior to 0.95 between each pair of M values for Coenonympha and superior to 0.85 for F. sylvatica (Fig. S7). For genetic structure however, we observed a slight change in the number of K selected by the cross-entropy criterion. Here again, it is rather due to close cross-entropy values at K = 3 and K = 4 than to a real variation across the M range (Fig. S5). Even though, neither the genetic grouping of individuals nor the percentage of assignation varied across the range of M values (Fig. 6 and Fig. S8). Finally, all parameters inferred during the ABC analysis showed very consistent distributions whatever the M value used for the sequence and effective population size (N1, N2, N3) for Fagus sylvatica and Coenonympha populations, and hybridization contribution (ra) for Coenonympha clustering (Fig. 7), with only a small variation between the maximum and minimum of the estimates across the M range (Table S1).

Discussion
Double-digest RAD-sequencing is a technique widely used to investigate population genetics in a large range of nonmodel organisms (Peterson et al. 2012;Andrews et al. 2016).
Multiple studies have investigated the impact of parameter settings on pre-and post-sequencing procedures of RADbased protocols, especially in loci reconstruction summary statistics, such as the number of identified loci or SNPs, or heterozygosity (Davey et al. 2013;Gautier et al. 2013;Puritz et al. 2014;Mastretta-Yanes et al. 2015;Burns et al. 2017;Rochette and Catchen 2017;Wang et al. 2017;Willis et al. 2017;O'leary et al. 2018). Nevertheless, only a handful of studies linked those parameters to downstream biological interpretations (Mastretta-Yanes et al. 2015;Rodríguez-Ezpeleta et al. 2016;Shafer et al. 2017;Malinsky et al. 2018; Díaz-Arce and Rodríguez-Ezpeleta 2019; Crotti et al. 2020;Ortiz et al. 2020). In addition, some pre-sequencing factors, such as initial DNA quantity or number of PCR cycles, are regularly pointed out as sensitive parameters of RAD-seq protocol (Hohenlohe et al. 2012;Peterson et al. 2012) but lack robust experimental testing across different models. Such investigation would allow building a solid knowledge of pre-sequencing protocol influence during RAD-seq or ddRAD-seq data production. Our testing procedure does not claim to cover all factors that could influence the ddRADseq method, but rather points out at important parameters and gives some advices to improve the protocol, complementing the recent literature (O'leary et al. 2018).

Pre-sequencing treatment
To assess the sensitivity of the laboratory protocol to variation in parameters values, we evaluated the influence of two factors for which we did not find any proper evaluation in the literature: initial DNA quantity and number of PCR cycles during library amplification.

DNA quantity
The initial amount of DNA required for ddRAD-seq library preparation is potentially limiting when working with small or scarce individuals (Blair et al. 2015;Shortt et al. 2017). We found that only a small amount of DNA template (i.e., ~ 50 ng; Fig. 2) was required for library preparation, opening the possibility of using ddRAD-seq technique with low-DNA quantity. It could, for example, be of particular interest for non-invasive sampling in a conservation genetics context. The initial protocol from Peterson et al. (2012) already suggested using less than 100 ng of DNA per individual but ddRAD-seq users commonly use more than 200 ng and even up to 1 µg (Capblancq et al. 2015;Yang et al. 2016;Burns et al. 2017;Sherpa et al. 2018a, b). Our results also showed that, in this experiment, a 5-time increase in DNA quantity (i.e., 250 ng instead of 50 or 150 ng in our study) increased the probability of experimental failure by reducing loci and SNPs recovery and by inflating the variation within samples (Fig. 2). Indeed, when using 250 ng of DNA, some samples had significantly fewer loci than others, associated with a low number of sequenced reads (Fig. S9). This inflation of within sample variance at 250 ng may be explained by multiple factors. First, we observe this phenomenon only in the library containing samples with 250 ng and only some individuals failed but similarly for the two species models. We, thus, tend to think that it may be an effect of the DNA quantity/concentration, with problems occurring before pooling the different samples, probably during the digestion/ligation step and potentially due to enzyme saturation, unbalanced ratio of DNA fragments / ligation adaptors or even pipetting errors. However, all 250 ng samples were processed in the same library (Fig.  S9). We thus cannot consequently rule out a library effect (i.e., batch effect) on the observed pattern generated by technical causes as pipetting errors, low accuracy, discrepancies in DNA pool concentration or in laboratory conditions (Bonin et al. 2004

PCR cycles
The number of PCR cycles is another part of the experiment that has to be carefully considered as it could introduce artefacts during the SNP calling (i.e., false heterozygotes calling) through PCR errors and biases. For example, Davey et al. (2013) highlighted that PCR cycles introduced GC biases in sequenced RAD libraries. RAD loci with high-GC content were sequenced more often compared to their counterparts with a low-GC content, for high numbers of PCR cycles, while the opposite was observed for low numbers of PCR cycles. Our results showed that the number of SNPs and individual heterozygosity are reduced with a low number of PCR cycles, while a high number of PCR cycles increases the number of private alleles. This highlights the trade-off existing between a satisfactory coverage, directly related to the number of PCR cycles, and the limitation of errors occurring during PCR (Hohenlohe et al. 2012). Usually the number of PCR cycles is set between 12 and 16 1 3 (Peterson et al. 2012;Capblancq et al. 2015;Yang et al. 2016;Burns et al. 2017). Our study highlights that using 15 cycles allowed to recover as many ddRAD-seq loci and SNPs as when using 25 cycles while lowering the level of heterozygosity and the number of private alleles. It would be interesting to investigate further if the observed increase in private alleles and individual heterozygosity could be corrected by stringent filtering on allele frequencies, which would discard low frequency alleles in the population. Unfortunately, our sampling size did not allow conducting such investigation. In agreements with our findings, the literature also suggests that limiting the number of PCR cycles reduces the amount of PCR clones (i.e., PCR duplicates) that are incorporated during amplification and can strongly bias heterozygosity estimation (Andrews et al. 2016;Davey et al. 2013).

Library effect
Variability in coverage and set of loci generated can occur among libraries during the laboratory stages of library preparation (see previous comments in the DNA quantity section). In their review, O'leary et al. (2018) gave strategies for technical mitigation of such library effects, by randomizing samples across sequencing lanes and by replicating individuals. Here, we used triplicates of 11 individuals of Fagus sylvatica, processed in different libraries, to evaluate the reproducibility of the ddRAD-seq protocol. Overall, the protocol seems robust with a high consistency between replicates for the number of recovered loci, the coverage, the rate of polymorphism and individual heterozygosity, although some triplicates showed variability (Fig. 3). This observed variability was due to variation in the number of reads obtained for each triplicate (Fig. S10), and appeared equally important for replicates processed in the same library or in different libraries. The fact that this variation was only present for 4/11 individuals let us think that it might be due to a lower DNA quality for those individuals. Despite this variability in loci recovery for some individuals, all triplicates were very close on the genetic PCA (Fig. 3). These results are consistent with those of Mastretta-Yanes et al.
, where most replicate pairs were clustered together in neighbour-joining dendrograms. Therefore, even if there are several steps during ddRAD-seq laboratory experiment that can lack reproducibility (e.g., digestion/ligation, range size selection, amplification by PCR), our results were robust across replicates. Given the high reproducibility and the low rates of genotyping inconsistency and missing loci (Fig. 4), we confirm here that ddRAD-seq can be a robust procedure as long as some key parameters have been carefully tuned during the laboratory work.

Post-sequencing treatment
An important part of the RAD-based sequencing literature relates to the bioinformatic treatment of sequences and to loci reconstruction (Mastretta-Yanes et al. 2015;Paris et al. 2017;Rochette and Catchen 2017;Wang et al. 2017). From de novo procedures to methods based on a draft genome, all the strategies aiming at reconstructing loci search putative homology among sequences based on clustering similarity (Catchen et al. 2013;Eaton 2014). As a consequence, clustering thresholds (e.g., M and m parameters of the STACKS software procedure) have a substantial impact on bioinformatic results and summary statistics. For example, these thresholds have been shown to influence the number of recovered loci, coverage, number of identified SNPs and error rates (Mastretta-Yanes et al. 2015;Paris et al. 2017;Rochette and Catchen, 2017;Wang et al. 2017). In agreement with those previous works, we found a substantial impact of the minimum coverage (m) and clustering (M) thresholds on RAD-loci recovery and reconstruction during the bioinformatic process (Fig. 5).
The minimum coverage imposes a minimum number of reads to consider an allele. Many alleles are thus expected to be lost with a high m, while a low m can give too much importance to very rare sequences, and thus potentially to sequencing or PCR errors (Catchen et al. 2013). As expected, we found that a high m value was associated with a higher coverage rate but also a lower number of stacked loci and identified SNPs. More surprising, we also found that individual heterozygosity increased with m. That could be due to the progressive removal of poorly covered loci, increasing at the same time the proportion of potential paralogs that would both exhibit high coverage and artificially high heterozygosity. On the other side, the similarity threshold (M) determines the minimum sequence homology to consider that two sequences are variants of the same locus. Choosing a too high M value can wrongly impede the clustering of different alleles of the same locus (i.e., oversplitting), consequently leading to a decline in estimated heterozygosity. On the contrary, a too low M value could lead to the merging of paralogs or repetitive regions of the genome (i.e., undersplitting), increasing at the same time the estimated heterozygosity rate. In agreement with the literature (Ilut et al. 2014;Harvey et al. 2015;Willis et al. 2017) we found that an increase of M comes with an increase in both the number of SNPs and the estimated individual heterozygosity, until these two values reach a plateau (Fig. 5). However, by investigating the number of alleles per locus within individuals (Fig. S11), we showed that the proportion of bi-allelic loci decreases slightly with the increase of M and m, but remains very high (up to 75% for M lower than 10 and all values of m for both species). This may indicate that even with high M and m values, impacts of drawback listed above remain limited.
One can expect that the observed variation in loci reconstruction, SNPs identification and heterozygosity estimation would influence, to some extent, the population genetic analyses performed with most ddRAD-seq datasets (Shafer et al. 2017). However, our results suggest that this is not necessarily the case, at least not for the two models studied here. If variations in genetic differentiation, clustering or demographic inferences were detected, the relative ranking or positions of populations remained unchanged both at the interspecific (animal model) and intraspecific levels (plant model) (Figs. 6, 7, Fig. S7-8). Those results are consistent with other population genetics studies (Mastretta-Yanes et al. 2015; Díaz-Arce and Rodríguez-Ezpeleta 2019; Malinsky et al. 2018;Crotti et al. 2020) or with phylogenetic studies (Herrera et al. 2015;Hou et al. 2015;Lee et al. 2018). Moreover, despite a reduced number of individuals, the biological results obtained here (i.e., genetic clustering and differentiation, demographic events timing and population sizes) are highly congruent with the results obtained with larger samplings in both Coenonympha and Fagus sylvatica reference studies (Capblancq et al. 2015(Capblancq et al. , 2020. Such consistency in downstream population genetics analyses may be explained by the large amount of information generated by the ddRAD-seq method (usually 10 or 100 of thousands of SNPs). A potential "false" signal due to genotyping inconsistencies at some loci (due either to the pre-or postsequencing stage of the experiment) seems negligible compared to the abundance of "true" signal provided by most of the ddRAD-seq loci. Such results could potentially differ for different study models, experimental conditions or bioinformatic strategies (see Ortiz et al. 2020), and we recommend to optimize the treatment of the sequences for each model and experimental design. Nonetheless, our results emphasize more the importance of testing the laboratory steps of the protocol, for example by replicating individuals and testing different conditions. This pre-sequencing part of the protocol seems to be the key for the optimal success of a ddRAD-seq experiment.