Developing best practices for genotyping-by-sequencing analysis in the construction of linkage maps

Abstract Background Genotyping-by-sequencing (GBS) provides affordable methods for genotyping hundreds of individuals using millions of markers. However, this challenges bioinformatic procedures that must overcome possible artifacts such as the bias generated by polymerase chain reaction duplicates and sequencing errors. Genotyping errors lead to data that deviate from what is expected from regular meiosis. This, in turn, leads to difficulties in grouping and ordering markers, resulting in inflated and incorrect linkage maps. Therefore, genotyping errors can be easily detected by linkage map quality evaluations. Results We developed and used the Reads2Map workflow to build linkage maps with simulated and empirical GBS data of diploid outcrossing populations. The workflows run GATK, Stacks, TASSEL, and Freebayes for single-nucleotide polymorphism calling and updog, polyRAD, and SuperMASSA for genotype calling, as well as OneMap and GUSMap to build linkage maps. Using simulated data, we observed which genotype call software fails in identifying common errors in GBS sequencing data and proposed specific filters to better handle them. We tested whether it is possible to overcome errors in a linkage map using genotype probabilities from each software or global error rates to estimate genetic distances with an updated version of OneMap. We also evaluated the impact of segregation distortion, contaminant samples, and haplotype-based multiallelic markers in the final linkage maps. Through our evaluations, we observed that some of the approaches produce different results depending on the dataset (dataset dependent) and others produce consistent advantageous results among them (dataset independent). Conclusions We set as default in the Reads2Map workflows the approaches that showed to be dataset independent for GBS datasets according to our results. This reduces the number of required tests to identify optimal pipelines and parameters for other empirical datasets. Using Reads2Map, users can select the pipeline and parameters that best fit their data context. The Reads2MapApp shiny app provides a graphical representation of the results to facilitate their interpretation.

To use the polyRAD approach, the VCF files are imported using VCF2RADdata with-181 out applying any filters or considering phase information. The polyRAD model is run with 182 PipelineMapping2Parents default arguments which assume an F 1 bi-parental population. 183 The function Export_MAPpoly is used to export the genotype probabilities. The vcfR OneMap with prior information regarding the reliability of each input genotype, thereby in-216 creasing the HMM's adaptability. The create_probs function allows users to input three 217 types of values: a global error value (global_error); an error probability for each inferred 218 genotype (genotypes_error); or genotype probabilities for each possible genotype in individ-219 uals (genotypes_probs). This flexibility empowers users to tailor the analysis to their specific 220 dataset characteristics and improve the accuracy of the results. This update is described in 221 detail in Supplementary File 1.

222
The OneMap software previous to version 3.0 considered the HMM error probability as a 223 single value of 10 −5 for every genotype. In version 3.0, this value is kept as default to keep 224 the code reproducible. However, it is noteworthy that this probability can be unreliable in 225 several situations when the genotypes are more prone to errors, especially for new genotyping 226 technology (e.g. GBS data).

227
OneMap 3.0 updates also include the possibility to parallelize the HMM using the approach 228 described by Schiffthaler et al. (2017). It parallelizes the procedure into a maximum of 229 four cores. We used this new OneMap feature to estimate the genetic distances. We also 230 implemented new functions for linkage maps quality diagnostics such as interactive plots 231 for recombination fraction matrices, progeny haplotype representation, and counts of the 232 recombination breakpoints in progeny.

233
Despite using the parallelized HMM, the genetic distance estimations in OneMap can 234 take time to run with a high number of markers, chromosomes, and tested combinations of 235 software. Therefore, the EmpiricalReads2Map workflow runs the HMM in just a subset of 236 markers which can be a single chromosome or a fragment of a chromosome. The alignment, 237 the SNP, and genotype calling steps are performed with the entire dataset. After running 238 the workflow and deciding the pipeline that provided the best results, the respective VCF 239 output can be used to build the linkage map for all chromosomes in the R environment with 240 OneMap functions.

241
The OneMap function onemap_read_vcfR is used to convert the VCFs to the OneMap R 242 object format. The markers are filtered again by a maximum number of missing data of 243 25% because the VCF files include unexpected genotypes according to the segregation of a 244 given locus (e.g. in a cross "AA x AB", genotype "BB" cannot exist). OneMap makes this

248
The Reads2Map workflows give flexibility to the user to define the probabilities to be used 249 in the OneMap HMM for the estimation of the genetic distances. Users can provide more than 250 one value to be tested as global errors (global_error input); can choose to use the upstream 251 genotype caller error probability (genoprob_error input); and can provide global error values 252 to be considered together with the software probabilities (genoprob_global_error input) 253 according to the following: 1 − (1 − global error)x(1 − sof tware error probability).

254
For GATK, TASSEL, Stacks, and Freebayes callers, the workflow uses in the HMM the 255 Phred score genotype error (GQ FORMAT value) converted to probabilities. For the software 256 polyRAD, SuperMASSA, and updog it uses 1−output genotype probability as a genotype error.

257
For these last, the population's structure (F 1 ) is used as a priori information to increase the 258 accuracy of the estimated genotypes.

259
The simulations do not consider interference in the recombination events. Therefore the 260 Haldane map function was used to estimate the genetic distances in SimulatedReads2Map.

261
Kosambi's map function was applied to estimate the genetic distances in the EmpiricalReads2Map.    The sequencing reads of the two empirical datasets were filtered using the Stacks plugin 295 process_radtags (Catchen et al., 2013) to filter sequences by the presence of the restriction 296 site and sequencing quality. The reads were discarded if the average quality score of 50% 297 of its length was below the Phred score of 10 (or 90% probability of being correct). The     Table 1: Marker types according to parental genotype combinations and progeny segregation. The letters "a", "b", "c" and "d" represent different alleles and the letter "o" represents null alleles. Adapted from (Wu et al., 2002   The markers identified by the SNP callers (GATK, TASSEL, Stacks, Freebayes) were 374 filtered by minor allele frequency (MAF) of 5% and maximum missing data allowed of 25% 375 before proceeding to the genotype callers (updog, polyRAD, and SuperMASSA). At this step, 376 we also tested two other filters. One of them was removing non-informative markers from 377 the VCF file. We considered non-informative markers homozygous in both parents or if at 378 least one of the parental genotypes was missing. The second filter was to replace the allele 379 depth (AD) field in the VCF file format by missing data when the genotype is missing. This 380 avoids that updog, polyRAD, and SuperMASSA use the allele depth when GATK filtered out 381 the genotype due to bad quality.
After the genotype call, we reduce the analysis to a subset of markers (the first 8.426 data. 396 We also tested the consequences of the presence and absence of the and Stacks haplotype- In the testing of scenarios in which we considered multiallelic markers, the VCFs con-  Table 2 shows an overview of the notations used to refer to each evaluated scenario.  We use the structure of the Reads2Map workflows, the simulated, and the empirical datasets 466 to test each software and some different parameters and markers filters. Our goal was to 467 identify the approach that provides the best quality linkage map. 468 We have categorized the approaches used in our analysis into two groups: dataset-   They can be also a combination of VCFs from different software such as the common markers 491 between the implemented SNP call software results ("intersect" in Figure 2). 492 We had to perform extra manipulations in TASSEL VCF output to be able to run the  In the VCF file outputted by GATK the P1 genotype is missing (GT ./.) because the reads did not pass the quality filters, but it reports the counts in the reference AD field (149,0). The updog software use progeny segregation (1:1) to estimate the parents, but it makes a mistake identifying which one is heterozygous. Using counts from BAM file (B) fix this issue despite losing the GATK quality filters that can be important in other situations.  Figure 8) when we applied these two described filtering steps: 562 removing non-informative data before genotype calling, and replacing allele counts with 563 missing data when the genotype is missing in the GATK calls. After the genotype calling, 564 we applied a threshold of 0.8 to filter low-quality genotypes, which also was beneficial in all 565 scenarios. It is important to notice that these filters are applied before the segregation test  Figures 9 and 10). This can be one of the reasons why using 574 genotype probabilities in the HMM did not present consistent results across tested datasets.

575
Despite we considered the HMM error rate dataset-dependent values, we identified that

602
The comparison shows that overestimated breakpoints are generally more frequent than 603 underestimated ones. We observe that when a map is inflated, it also has many wrong 604 recombination breakpoints. However, in some cases, the map has the expected map size, but 605 a high number of wrong haplotypes due to both overestimated and underestimated breaks.

606
A high number of underestimated breaks can be observed in situations where the Euclidean 607 distance is close to, or less than 1 (log 10 0) and the number of wrong recombination events is 608 between 10 and 100 (log 10 1 and log 10 2). These situations are more frequent when a global 609 Figure 5: Process of selecting best pipeline: A) Comparing the effect of different error probabilities in the OneMap 3.0 HMM in the distances between adjacent markers; B) Comparing the effect of different error probabilities in the linkage maps total size built with a single SNP call software; C) Checking the recombination fraction (rf) heatmap and markers coverage in the genome using the selected pipeline. These figures were extracted from Reads2MapApp. Figure 6: Relation between Euclidean distance (y-axis) and the number of recombination breakpoints (x-axis) in maps built with global error rates (0.001% and 5%), and with probabilities outputted by the genotype call software (relative error). Each dot represents a map built with simulated data based on the first 37% of aspen chromosome 10. The red squares highlight maps that do not present inflated size (1 or less Euclidean distance) but have from 10 to 100 wrong recombination breakpoints.
error rate of 5% is used.
In the empirical data results, we observed maps with expected size and excess recombi-  to select the best approaches before using OneMap 3.0 to guarantee that it will result in the 707 best quality genetic map possible with the data available.

708
It is important to highlight that we did not design the workflows to be a tool to build 709 a final linkage map but to select the bioinformatic pipeline that provides the best quality

716
The diversity in the results of the pipeline suggested for both empirical datasets highlights 717 that pipelines perform differently with datasets with different properties. This means that 718 the pipelines presented here as the best cannot be considered the best for every dataset. 719 We could reduce the number of required tests by users identifying the dataset-independent 720 approaches and setting them as default in Reads2Map. However, we suggest users reproduce 721 the tests presented here for the dataset-dependent approaches using the Reads2Map workflows 722 with their empirical dataset and select the best pipelines for their specific conditions.

723
The workflows were built using WDL and containers to ensure high reproducibility. This Aguia High-Performance Computing. We also thank David Gerard for the idea of using 815 genotype probabilities from updog combined with a global error rate. 816 Taniguti CH, Taniguti LM, Amadeu RR, Lau J, de S Gesteira G, de P Oliveira