freqpcr: interval estimation of population allele frequency based on quantitative PCR ΔΔCq measures from bulk samples

PCR techniques, both quantitative (qPCR) and non-quantitative, have been used to estimate allele frequency in a population. However, the labor required to sample more individuals and handle each sample makes it difficult to quantify rare mutations, such as pesticide-resistance genes at the early stages of resistance development. Pooling DNA from multiple individuals as a “bulk sample” may reduce handling costs. The output of qPCR on a bulk sample, however, contains uncertainty owing to variations in DNA yields from each individual, in addition to measurement error. In this study, we developed a statistical model for the interval estimation of allele frequency via ΔΔCq-based qPCR analyses of multiple bulk samples taken from a population. We assumed a gamma distribution as the individual DNA yield and developed an R package for parameter estimation, which was verified with real DNA samples from acaricide-resistant spider mites, as well as a numerical simulation. Our model resulted in unbiased point estimates of the allele frequency compared with simple averaging of the ΔΔCq values, and their confidence intervals suggested collecting more samples from individuals and pooling them may produce higher precision than individual PCR tests with moderate sample sizes.

parameters of the gamma distribution, respectively. The mean is given by . 93 Using Eq. 1, let us consider the amounts of allelic DNA in the sample extracted from multiple individuals 94 at once, hereafter referred to as "a bulk sample." Table 1 lists the variables and parameters of the model 95 structure. For simplicity, we model the case of haploidy in the main text. Appendix A1 describes the 96 approximated formulation for diploids. Now, we have insects, of which ( = 0,1, , . . . , ) are the 97 genotypes resistant to an insecticide (hereafter denoted by R). The rest − had S, the susceptible allele. 98 When we capture insects from a wild population, the size of is obvious, but is usually unknown. 99 Assuming random sampling from an infinite population with the R allele at the frequency , follows a  Eq. 5 131 Here, and 1 + ( > 0) denote the initial amount of template DNA and its amplification efficiency, 132 respectively. Standard PCR protocols are designed so that obtain the range 80% to 120% i.e., doubling in 133 each cycle. The size of Cq, , is then defined as: 134 = ln( ) − ln ln(1 + ) .
6 two levels, we amplified at least four samples (two levels, two primer sets for TG and HK loci, ignoring 148 technical replicates). ∆Cq is then defined as the difference of the Cq values of "TG − HK" for each treatment 149 level, which is equivalent to the abundance of target cDNA offset by housekeeping gene (= TG / HK) in each 150 sample (Schefe et al. 2006). Finally, we obtained ΔΔCq = ∆Cq treated − ∆Cq control from the Cq measures. 151 Derived from Eq. 6, 2 Cq gives the relative abundance of template DNA between the treatment levels 152 (Livak and Schmittgen 2001; Pfaffl 2012) (1 + = 2 was presupposed there). 153 Allele frequency estimation from a single bulk sample: RED-method 154 The original ΔΔCq method compares the quantities of (c)DNA between samples to determine the relative The RED-ΔΔCq method also utilized ΔΔCq as a proxy for relative quantity, but the Cq measurements were 161 all taken from a single bulk sample, which was collected from a population in which each individual 162 possesses R or S. The calibrator was an intact sample containing total DNA (= R + S ) on the target locus. 163 The sample in question was the same DNA extract, but digested with restriction endonucleases prior to 164 qPCR analysis. The restriction site is designed to recognize the S allele on the target locus so that the 165 operation digests the major part of S (denoted by 1 − : is a small, but positive variable giving the residual 166 rate). Consequently, we obtained the template amount R + S at the target locus after digestion. 167 The samples before and after digestion were also amplified using the HK primer set as an internal 168 reference. In the etoxazole-R diagnosis by Osakabe  The coefficient T ( T > 0) provides the relative content of the target gene to the housekeeping gene in 178 genomic DNA (the difference in the DNA extraction efficiencies is also included). After digestion (state D), 179 HD and TD denote the DNA amounts at the H and T loci, respectively:  Eq. 8

182
The common coefficient B ( B > 0) provides the rate of certain locus-independent changes in the quantities 183 of template DNA accompanying the restriction enzyme treatment. 184 As a result of qPCR, the Cq quartet, HW , TW , HD , and TD were obtained. From Eq. The actual Cq data contain measurement errors in addition to uncertainty due to experimental operations, 189 such as sample dispensation or PCR amplification. We express these using the common error term 190 ε ~N(0, c ), following the normal distribution of mean = 0 and variance = c in the scale of raw Cq values. 191 The validity of this error structure is verified later.

192
The two ΔCq values were then defined as Δ W = TW − HW and Δ D = TD − HD , respectively. Their

193
ΔΔCq are: Eq. 10 196 From Eq. 10, the expected value of ( S + R ) ( S + R ) ⁄ is calculated as (1 + ) . 197 The point estimate of the resistance allele frequency, R , is defined as R ( R + S ) ⁄ for each bulk sample. 198 When is much smaller than R , the quantity ( S + R ) ( S + R ) ⁄ = R + 1 − R itself can 199 approximate the frequency, which will be the case with enough digestion time before qPCR. However, the 200 use of the point estimate may introduce a problem in that the size of R often exceeds 1 when the R 201 frequency is high and there is a larger error in the Cq measurement (also see the result of Experiment 2).

202
Although the value of 1 + may vary on the primer sets, both target and housekeeping loci share the 203 same amplification efficiency in Eq. 9. This is because practical PCR protocols were designed to be 1 + ≅ 204 2. We can also approximately cancel the effect of heterogeneous amplification efficiencies by fitting the size 205 of T the sample sets with known allele ratios (Experiment 1). Conversely, we must consider the amount of DNA in the bulk sample to calculate the probability of .

264
Eq. 14 265 We must consider not only the possible cases of , but also the entire range of the DNA amounts. If we use  As we also implemented the gamma distribution model as freqpcr(..., beta = FALSE), a numerical 375 experiment with the gamma model was also conducted for the first 250 replicates, and the estimation 376 accuracy was compared between the two assumptions. Furthermore, we also fitted the function with the 377 settings freqpcr(..., K = 1), that is, assuming the gamma shape parameter was fixed at 1 (a.k.a. exponential The explicit modeling of individuals also allows sample division to various degrees, which helps us to 474 balance our sampling strategy on the cost-precision tradeoff. We can achieve higher precision (narrower 475 confidence interval) by increasing the total sample size, ∑ although it also increases the costs 476 associated with sample collection and laboratory work, including library preparation and PCR analysis. Although this study focused on resistance genes, the likelihood model in Eq. 13 can be used for other 507 qPCR protocols based on this ΔΔCq method. If both the specific and nonspecific primer sets are available to 508 amplify the mutant and "wild type + mutant" alleles at the target locus, they can be used for the test and 509 control samples equivalent to TV in Eq. 12 and TU in Eq. 11, respectively. However, there is a caveat in 510 determining which allele should be amplified with a specific primer set and which affects the estimation 511 accuracy due to the intrinsic nature of (1 + ) . As shown, the 95% confidence intervals were broader 512 when p = 0.75 than when p = 0.25 (Figure 2), the accuracy was not symmetric around 0.5, but more accurate 513 when the frequency was low. That is, one should design a specific primer set to amplify the allele that would 514 be rare in the population to improve the signal-to-noise ratio. 515 The maximum likelihood estimation with freqpcr() relies on the assumption that the quantities of the S and 516 R alleles in each bulk sample independently follow gamma distribution and that their quotient is expressed 517 using beta distribution. Fixing the size of the gamma shape parameter k further accelerated the optimization, 518 which was owing to the robustness of p to the size of k. However, once the size of k was fixed much larger 519 than the actual size of the gamma shape parameter (i.e., the individual DNA yield was regarded as almost a 520 fixed value), the iterative optimization using the nlm() function sometimes returned an error. Therefore, one 521 should start with a smaller shape parameter e.g., k = 1 (the exponential distribution: Figure 3), which is 522 currently the default setting of the freqpcr package. 523 In qPCR applications for diagnostic use, ΔΔCq is often used with calibration. One of the popular methods 524 is the involvement of technical replicates; each sample is dispensed and analyzed using qPCR multiple times, 525 which cancels the Cq measurement error. The measurement error obeys a homoscedastic normal distribution 526 in the Cq scale, as shown in Experiment 1. Thus, a simple solution is to average the Cq values measured for 527 every bulk sample before the estimation with freqpcr(), although the estimated size of c changes from its 528 original definition in Eq. 9. However, it is trivial if the number of technical replicates is unified between bulk 529 samples. 530 Moreover, the comparison of Cq values is sometimes conducted on more than one internal reference 531 because there is no guarantee that the expression level of a "housekeeping gene" is always constant 532 (Vandesompele et al. 2002). Future updates of freqpcr() will handle multiple internal references. As long as 533 qPCR is used to estimate population allele frequency, the use of statistical inferences on the bulk samples, as 534 presented in this study, will continue to be a realistic option for regional allele monitoring and screening for 535 practitioners, such as those in agricultural, food security, and public health sectors.