PyBSASeq: a novel, simple, and effective algorithm for BSA-Seq data analysis

Bulked segregant analysis (BSA), coupled with next generation sequencing (NGS), allows the rapid identification of both qualitative and quantitative trait loci (QTL), and this technique is referred to as BSA-Seq here. The current SNP index method and G-statistic method for BSA-Seq data analysis require relatively high sequencing coverage to detect major single nucleotide polymorphism (SNP)-trait associations, which leads to high sequencing cost. Here we developed a simple and effective algorithm for BSA-Seq data analysis and implemented it in Python, the program was named PyBSASeq. Using PyBSASeq, the likely trait-associated SNPs (ltaSNPs) were identified via Fisher’s exact test and then the ratio of the ltaSNPs to total SNPs in a chromosomal interval was used to identify the genomic regions that condition the trait of interest. The results obtained this way are similar to those generated by the current methods, but with more than five times higher sensitivity, which can reduce the sequencing cost by ~80% and makes BSA-Seq more applicable for the species with a large genome. Significance Statement BSA-Seq can be utilized to rapidly identify DNA polymorphismtrait associations, and PyBSASeq allows the detection of such associations at much lower sequencing coverage than the current methods, leading to lower sequencing cost and making BSA-Seq more accessible to the research community and more applicable to the species with a large genome.

The greater the Δ(SNP index) (the difference of the SNP 28 indices between bulks), the more likely the SNP contributes 29 to the trait of interest or is linked to a gene that controls the

56
The sequencing data used in this study were generated by Yang

Significance Statement
BSA-Seq can be utilized to rapidly identify DNA polymorphismtrait associations, and PyBSASeq allows the detection of such associations at much lower sequencing coverage than the current methods, leading to lower sequencing cost and making BSA-Seq more accessible to the research community and more applicable to the species with a large genome.
Author contributions: JZ and DRP conceived the study. JZ developed the algorithm, wrote the Python code, analyzed the data, and wrote the manuscript. DRP edited the manuscript and supervised the project.  1  29 759  C  G  0,2  6  0,2  6  1  31 071  A  G  25,39  99  33,29  99  1  31 478  C  T  27,38  99  48,32  99  1  33 667  A  G  21,46  99  39,32  99  1  34 057  C  T  29,37  99  32,31  99 a The chromosome on which the SNP is located; b The position of the SNP on the chromosome; c The base sequence of the SNP that is the same as the one from the reference genome; d The base sequence that is different from REF; e The allele depths (AD) of the SNP in the first bulk (ID: 834927) or the second bulk (ID: 834931). This column contains two numbers, the first one is the REF read (AD REF ) and the second is the ALT read (AD ALT );

D R A F T
f The genotype quality of the SNP in the first bulk (ID: 834927) or the second bulk (ID: 834931).  has zero SNP, its ltaSNP/totalSNP ratio will be replaced with the 98 value of the previous sliding window. If the first sliding window 99 of a chromosome is empty, the string 'empty' will be assigned to 100 this sliding window as a placeholder that will be replaced with a 101 non-empty value of the nearest window later.  smAD REF2 /smAD ALT2 of bulk 2 were obtained as described above. 134 Using these AD values, the Δ(SNP index) was calculated with the 135 equation below and the G-statistic was calculated as previously 136 stated. This process was repeated 10 000 times, the 99% confidence 137 interval of the 10 000 Δ(SNP index) values was used as the significant 138 threshold for the SNP index method, and the 99.5 th percentile of 139 the 10 000 G-statistic values was used as the significant threshold 140 for the G-statistic method.

184
The most obvious difference between Figure 1A and figure   185 1B was the first peak on chromosome 2 and the peaks on 186 chromosomes 3, 6 and 9; these regions contained fewer SNPs, 187 but the ltaSNPs enrichment was relatively high.

188
Under the null hypothesis that the SNPs were not asso-189 ciated with the trait, resampling was utilized to obtain the 190 threshold to determine which peak in Figure 1B was statisti-191 cally significant. For each SNP in the dataset, its simulated 192 AD REF and AD ALT were calculated as detailed in the materi-193 als and methods section and then the simulated AD REF and 194 AD ALT from both bulks were used to perform Fisher's exact 195 test. A SNP with its p-value less than 0.10 was considered a 196 ltaSNP (A high cut-off p-value results in a high threshold). 197 The amount of SNPs that are the same as the average number 198 of SNPs per sliding window were randomly selected from the 199 SNP dataset and the simulated ltaSNP/totalSNP ratio (total 200 SNPs was the sample size) in the sample was recorded. This 201 process was repeated 10 000 times, and the 99.5 th percentile 202 of these 10 000 values was used as the significant threshold 203 for the detection of peak-trait associations. The threshold 204 obtained this way was 0.087. In addition to the six major 205 QTLs (two of them on chromosome 2) verified in the work of 206 Yang et al.
(3), one or more new peaks on all chromosomes 207 except chromosomes 5 and 10 were also above the threshold 208 ( Figure 1B).

Sequencing coverage affected the detection of SNP-trait as-210
sociation. Using the Lander/Waterman equation (28), the 211 sequencing coverage of SRR834927 and SRR834931 was es-212 timated to be 84× and 103×, respectively. It would be very 213 costly to achieve such high sequencing coverage for the or-214 ganisms with a large genome. Thus, we wanted to know how 215 decreasing sequencing coverage would affect the detection of 216 SNP-trait associations. To achieve lower sequencing cover-217 age, we sampled 40%, 30%, and 20% of the raw sequence 218 reads using the seqtk program (https://github.com/lh3/seqtk) 219 with random seeds 123, 160, and 100, respectively. The ltaS-220 NPs were identified from these sequence subsets and the ratios 221 of ltaSNP/totalSNP were plotted along all the chromosomes 222   QTLs and a minor QTL on chromosome 2 were detected (Fig-264 ure S1). However, the PyBSASeq approach had the highest 265 sensitivity using the same filtering criteria, and it can detect 266 more minor QTLs than other methods even if the whole SNP 267 dataset was used ( Figures 1B, 2, and S1).

268
As in PyBSASeq, we also tested how decreasing sequencing 269 coverage would affect the detection of the SNP-trait associ-270 ations in these two methods. Using the original sequencing 271 reads, the SNP index method had relatively low detection 272 power, the major QTL on chromosome 5 was missed and the 273 peak (valley) representing the major QTL on chromosome 10 274 was barely beyond the threshold. With decreasing sequencing 275 coverage, the Δ(SNP index) did not change much, but the 276 thresholds increased dramatically, the QTLs on chromosomes 277 2, 5, and 10 were missed at 40% of the original sequencing 278 coverage and all the QTL were missed at 30% or lower of the 279 original sequencing coverage (Figure 3). For the G-statistic 280 method, with the original sequencing reads, all the 6 major 281 QTLs can be detected. With decreasing sequencing cover-282 age, the G-statistic values decreased substantially, whereas 283 the threshold increased slightly; the QTLs on chromosomes 284 2, 5, and 10 were missed at 40% of the original sequencing 285 coverage, the peaks representing the QTLs on chromosomes 1 286 and 8 were barely above the threshold at 30% of the original 287 sequencing coverage, and all the QTLs were missed at 20% of 288 the original sequencing coverage (Figure 4).

290
PyBSASeq detected more than 10 minor QTLs along with all 291 of the major QTLs detected via the current methods when run 292 with the entire SNP dataset based on the original sequencing 293 reads ( Figures 1B, 3A, and 4A). Plant cold tolerance is a 294 complex quantitative trait controlled by many genes (29, 30). 295 The additional QTLs detected via PyBSASeq may represent 296 the minor QTLs that have small phenotypic effects. Filtering 297 out the SNPs with a low DP value increased the sensitivity 298 of the current methods ( Figures S1, 3, and 4), but doing so 299 increased the sensitivity of PyBSASeq as well ( Figures S1C and 300  1B). Decreasing the sequencing coverage substantially reduced 301 the detection power of all the methods (Figures 2, 3, and 4). 302 At 20% of the original coverage (17× in the first bulk and 21× 303 in the second bulk) all QTLs were missed using the current 304 methods; however, all the verified major QTLs plus two minor 305