In silico prediction of COVID-19 test efficiency with DinoKnot

The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a novel coronavirus spreading across the world causing the disease COVID-19. The diagnosis of COVID-19 is done by quantitative reverse-transcription polymer chain reaction (qRT-PCR) testing which utilizes different primer-probe sets depending on the assay used. Using in silico analysis we aimed to determine how the secondary structure of the SARS-CoV-2 RNA genome affects the interaction between the reverse primer during qRT-PCR and how it relates to the experimental primer-probe test efficiencies. We introduce the program DinoKnot (Duplex Interaction of Nucleic acids with pseudoKnots) that follows the hierarchical folding hypothesis to predict the secondary structure of two interacting nucleic acid strands (DNA/RNA) of similar or different type. DinoKnot is the first program that utilizes stable stems in both strands as a guide to find the structure of their interaction. Using DinoKnot we predicted the interaction of the reverse primers used in four common COVID-19 qRT-PCR tests with the SARS-CoV-2 RNA genome. In addition, we predicted how 12 mutations in the primer/probe binding region may affect the primer/probe ability and subsequent SARS-CoV-2 detection. While we found all reverse primers are capable of interacting with their target area, we identified partial mismatching between the SARS-CoV-2 genome and some reverse primers. We predicted three mutations that may prevent primer binding, reducing the ability for SARS-CoV-2 detection. We believe our contributions can aid in the design of a more sensitive SARS-CoV-2 test. Author summary The current testing for the disease COVID-19 that is caused by the novel cornonavirus SARS-CoV-2 uses oligonucleotides called primers that bind to specific target regions on the SARS-CoV-2 genome to detect the virus. Our goal was to use computational tools to predict how the structure of the SARS-CoV-2 RNA genome affects the ability of the primers to bind to their target region. We introduce the program DinoKnot (Duplex interaction of nucleic acids with pseudoknots) that is able to predict the interactions between two DNA or RNA molecules. We used DinoKnot to predict the efficiency of four common COVID-19 tests, and the effect of mutations in the SARS-CoV-2 virus on ability of the COVID-19 tests in detecting those strains. We predict partial mismatching between some primers and the SARS-CoV-2 genome but that all primers are capable of interacting with their target areas. We also predict three mutations that prevent primer binding and thus SARS-CoV-2 detection. We discuss the limitations of the current COVID-19 testing and suggest the design of a more sensitive COVID-19 test that can be aided by our findings.


Introduction
Interaction of the primers to the RNA/cDNA strands during qRT-PCR amplification. The reverse primer first binds to the target complementary sequence on the SARS-CoV-2 positive (+ve) sense RNA genome. The reverse transcriptase then generates the negative (-ve) sense complementary DNA (cDNA) strand. The forward primer then binds to the negative sense cDNA strand and the DNA polymerase generates the positive sense cDNA strand. The reverse primer binds to the complementary target sequence on the positive sense cDNA and the DNA polymerase generates a new negative sense cDNA strand. [3]. This process repeats for strand amplification during qRT-PCR.
We aimed to determine whether the RNA structure of the SARS-CoV-2 genome 21 affects the binding of the reverse primers in the qRT-PCR assay and whether this 22 correlated to the analytical efficiencies and sensitivities shown by Vogels et al. [4]. To 23 study the structure of such interactions, we developed DinoKnot, a program that given 24 two nucleic acid strands predicts their interaction structure. We used DinoKnot to 25 predict how the RNA structure of SARS-CoV-2 changes when the reverse primer binds 26 to the RNA genome and whether the primer was binding in the location that was 27 expected. We further predicted how mutations in the primer/probe binding region of 28 the SARS-CoV-2 genome provided by Vogels et al. could affect the primer/probe 29 sensitivity for SARS-CoV-2 detection [4]. 30 First, we introduce DinoKnot and our experiment set up. We then present the 31 interaction structures predicted by DinoKnot. We discuss the interactions and which 32 mutations may decrease SARS-CoV-2 detection during qRT-PCR testing. Finally, we 33 suggest the design of a more sensitive COVID-19 test and future work that can be done 34 using DinoKnot. 35 Materials and methods 36 Interaction of the reverse primer and the SARS-CoV-2 genome is an example of a 37 duplex DNA/RNA interaction, in which both strands (i.e. the RNA genome and the 38 cDNA transcript) can be structured. Their structures can change upon interaction with 39 one another to accommodate formation of more stable base pairings. Existing tools that 40 predict structure of interaction in two molecules mostly focus on similar strands (i.e. 41 DNA/DNA or RNA/RNA) [5][6][7][8][9][10], and merely focus on the interaction site (i.e. 42 ignoring the intramolecular structures) [11][12][13][14][15]. Our DinoKnot (Duplex Interaction of 43 Nucleic acids with pseudoKnots) aims to address both such shortcomings. 44 In this section we first provide an overview of the problem of secondary structure 45 prediction for two interacting molecules. We then introduce DinoKnot, and explain how 46 it overcomes the shortcomings of the existing methods on prediction of the secondary 47 structure for two different types of nucleic acid strands. Finally we provide details of 48 our in silico system and experiment setup. An RNA is a single stranded molecule with two distinct ends, namely 5' and 3', that 51 folds back onto itself by forming intramolecular base pairs. An RNA molecule is 52 represented by a sequence, S, of its four bases, Adenine (A), Cytosine (C), Guanine (G) 53 and Uracil (U) arranged on a line (representing the backbone) from 5' (left) to 3' (right) 54 ends. The length of the RNA molecule is denoted by n and each base of the RNA 55 sequence is referred to by its index i, 1 ≤ i ≤ n. Complementary bases bind and form 56 base pairs (A.U , C.G, and G.U ). Here "." represents a pairing of the two bases. A 57 secondary structure, R, is then defined as a set of base pairs i.j , 1 ≤ i < j ≤ n; i.j and 58 k.j can belong to the same set if and only if i = k i.e. each base may pair at most with 59 one other base. If i.j and k.l are two base pairs of a secondary structure, R, such that 60 1 ≤ i < k < j < l ≤ n, then i.j crosses k.l. A pseudoknotted secondary structure refers 61 to a structure with crossing base pairs. A pseudoknot-free secondary structure refers to 62 a structure without crossing base pairs. The RNA structure forms because it is 63 energetically favourable for bases to form paired helices. Different base pairing patterns 64 in a secondary structure define different loop types. We note that the principles of 65 structure prediction are essentially the same for single stranded DNAs. Computational 66 secondary structure prediction from the base sequence is often done by finding the 67 energetically most stable (minimum free energy) secondary structure, when each loop is 68 assigned an energy value. These energy values are kept in various tables and are known 69 as energy parameters. Some parameter sets have been derived directly from 70 experiments, and others are extrapolated based on experimentally determined values.

71
Energy parameters are strand type specific, i.e. similar loops in an RNA molecule have 72 different assigned energy than the ones in a DNA molecule. Existing minimum free 73 energy (MFE) structure prediction methods find the minimum free structure for a given 74 sequence from the pool of all possible structures. Methods for MFE pseudoknot-free 75 structure prediction use dynamic programming method to find the MFE 76 structure [16,17]. Since prediction of the MFE pseudoknotted structure is 77 NP-hard [18,19] and even inapproximable [20], methods for pseudoknotted MFE 78 structure prediction focus on a restricted class of structures [21][22][23][24][25]. 79 We can represent the interacting secondary structure of two nucleic acid sequences 80 similarly, by concatenating the two strands together and keeping track of the gap 81 between the two strands with a dummy linker. When two strands are of similar type, 82 the energy calculation of the concatenated sequence will be similar to that of a single 83 strand of the same type, except for loops containing the gapped region (as they are not 84 true loops when sequences are not concatenated). When two strands of different types 85 interact (i.e a DNA strand binding to an RNA strand) the situation is more 86 complicated as there is currently no energy parameters available for loops formed 87 between the two strands. DinoKnot follows the relaxed hierarchical folding hypothesis [26] for prediction of the 90 minimum free energy structure of two interacting nucleic acid strands. Following this 91 hypothesis an RNA molecule first forms simple pseudoknot-free base pairs before 92 forming more complex and possibly pseudoknotted structures [27]. During this process 93 some of the originally formed base pairs may open up to accommodate more stable 94 pairings. Existing methods based on hierarchical folding, namely HFold [24] and 95 Iterative HFold [26], focus on single RNA structure prediction [24,26]. Our DinoKnot is 96 to the best of our knowledge the first program that follows the hierarchical folding 97 hypothesis for prediction of pseudoknotted structure of two interacting nucleic acid  DinoKnot, takes a pair of nucleic acid sequences as input and returns their 103 interaction structure with its corresponding free energy value. Each sequence can be of 104 type RNA or DNA. We note that the minimum free energy structure of two input 105 strands may not involve any interaction if it is energetically more favourable for each 106 sequence to form intramolecular base pairs.

107
The user can optionally provide a pseudoknot-free input structure if such 108 information is available to guide the prediction. If no input structure is provided by the 109 user, DinoKnot will generate up to 20 pseudoknot-free secondary structures (i.e.  The SARS-CoV-2 reference genome NC 045512.2 was obtained from the National 123 Center for Biotechnology Information GenBank database [29]. The sequence of the 124 RNA transcripts used by Vogels et al. [4] to determine the primer efficiency were input 125 into the programs Iterative HFold [26] to predict their secondary structure. This output 126 is the secondary structure of the transcript prior to the interaction with the reverse 127 primer. The primer sequences were obtained from the list of World Health Organization 128 (WHO) protocols to diagnose COVID-19 [30]. The locations where the primers bind on 129 the reference genome NC 045512.2 and the corresponding RNA transcript area are 130 stated in Table 1.    [31]. In these cases, all possible 149 degenerate base combinations were predicted with DinoKnot. The dot-bracket output 150 was visualized using VARNA [32] in arc format in which RNA backbone is represented 151 by a horizontal line and base pairs are presented as arcs that connect the two bases. genome that occur at a frequency of greater than 0.1% [4]. To predict if these mutations 155 affect primer/probe binding, and thus the sensitivity of SARS-CoV-2 detection, the 156 mutated sequence of the transcript along with the affected primer/probe was entered 157 into DinoKnot. The RNA/DNA setting was used to predict the interaction between the 158 mutated RNA transcript and the reverse primer. The DNA/DNA setting in DinoKnot 159 was used to predict the interaction structure between the mutated cDNA transcript and 160 the corresponding primer/probe. The dot-bracket output was visualized using 161 VARNA [32] in arc format in which RNA backbone is visualized by a horizontal line 162 and base pairs are represented by arcs connecting bases of the backbone.

163
The GISAID next hCoV-19 app was used to look at the frequency of mutations in 164 the SARS-CoV-2 genome [33].  Vogels et al. experimentally found that all of the primer-probe sets had comparable 175 analytical efficiencies that were all above 90% [4]. All primer-probe sets had comparable 176 analytical sensitivities with a limit of detection of 100,000 SARS-CoV-2 viral copies/mL 177 except for the RdRp-SARSr set which had the lowest sensitivity [4]. 178 All of the reverse primers were predicted by DinoKnot to interact with their 179 expected target region. The 2019-nCoV N1-R, 2019-nCoV N3-R, CCDC-N-R and 180 CCDC-ORF1-R primers fully paired to their target region. This prediction agrees with 181 the analytical efficiency and sensitivity results found by Vogels et al. [4]. 182 Partial reverse primer mismatching 183 The HKU-ORF1-R and the 2019-nCoV N2-R primer were predicted to pair to their 184 target with single base mismatches. The HKU-ORF1-R primer contains the degenerate 185 base R which means the the primer may contain an A or a G at this position. The 186 target base is a U in the SARS-CoV-2 genome. The interaction structure was predicted 187 twice, both with an A and G in the position of degenerate base R. The output produced 188 the expected match, except when the primer was set as a G, the G/U base pair did not 189 bind but it did not affect the rest of the primer area from binding to the target area.

190
The last base at the 3' end of the 2019-nCoV N2-R primer did not pair with any 191 nucleotide but the rest of the primer was predicted to bind to its expected target region. 192 Although these primers contain a single predicted mismatch, the analytical efficiencies 193 and sensitivities of these primer probe sets were comparable to the other primer-probes 194 tested experimentally [4].

HKU-N-R and E-Sarbeco-R 196
The HKU-N-R and E-Sarbeco-R primers did not pair as expected to their target region 197 when DinoKnot was not given an input structure (i.e. when both strands were free to 198 assume possible structures before interacting with one another). The HKU-N-R primer 199 was predicted to bind to itself, rather than its target region. When the HKU-N-R 200 primer structure was forced to be unfolded, the first base of the HKU-N-R primer at the 201 5' end did not bind to its expected nucleotide but the rest of the primer interacted with 202 the target region as expected. Forcing the primer to be unfolded is a prediction of the 203 primer structure after the qRT-PCR test denaturation step at 95 • C in the work done by 204 Vogels et al. [4] The predicted interaction structure for the E-Sarbeco-R primer when   All combinations were given as input to DinoKnot and the predicted structures are 219 shown in Fig 3. Fig 3 represents the interaction site only. The complete structures can 220 be found in the supplemental information on the DinoKnot Github page. The  Fig 3. Interaction structures of the RdRp-SARS-R primer predicted by DinoKnot with all of the possible base combinations from the degenerate primers. The RdRp-SARS-R primer contains the degenerate bases R and S, which means an A or G may be present at the R position and a C or G may be present at the S position. All possible combinaions were predicted. The S position was also input as an A to predict if this change may increase primer sensitivity. entered as a G. The target nucleotide is a U at both of the positions where the R and S 224 base bind. The structures were also predicted with an A input at the position of the 225 degenerate base S. This change was suggested by Vogels et al. to possibly increase the 226 sensitivity of this primer-probe set because it was found experimentally to have the 227 lowest sensitivity of all the primer-probe sets [4]. However, this change resulted in the 228 primer not binding to its target region.  Table 3, along with the interaction 231 structure energies [4]. The resulting structures are shown in  Mutations that cause complete mismatching 259 The mutation in the primer binding region of E-Sarbeco-R at base pair position 26,370 260 is predicted to result in a complete mismatch between the reverse primer and the Table 3. Structure energy differences predicted with mutations in the primer binding region of the SARS-CoV-2 genome obtained from Vogels et al. [4] . The primer binding energy was predicted by DinoKnot. The probe/transcript and forward primer/transcript interactions are DNA/DNA interactions since these oligonucleotides interact with the cDNA strands [3]. Mutations in the reverse primer binding region include both DNA/DNA and DNA/RNA interaction since the reverse primer interacts with the SARS-CoV-2 genome along with the negative sense cDNA strand [3] Primer

274
We used DinoKnot to predict the secondary structure of the interaction between RNA 275 transcripts from the SARS-CoV-2 genome and the reverse primers from nine mismatch is unlikely to have an effect on cDNA strand synthesis [4] 296 The mismatching in the E-Sarbeco-R and HKU-N-R primers occur when the primer 297 is assumed to have structure prior to interaction with the target. However, when the 298 primer was forced in DinoKnot as unfolded, the program predicts the E-Sarbeco-R 299 primer to fully bind to its target sequence and the HKU-N -R primer to bind with a 300 single mismatch at the 5' end of the primer. Since the qRT-PCR assay used by Vogels 301 et al. has a denaturation step of 95 • C and an annealing step of 55 • C, it can be assumed 302 that the primer is fully unfolded at those temperatures because the E-Sarbeco-R and 303 HKU-N-R had primer efficiencies comparable to the other primers as determined by 304 Vogels et al. [4].

305
The RdRp-SARSr-R primer contains two degenerate bases, R and S. Vogels et al. 306 determined that 990/992 of their SARS-CoV-2 clinical samples contain a T at the 307 position in the reference genome NC 045512.2 where the S base binds to its target [4]. 308 This primer-probe set had the lowest sensitivity out of all of the primers [4]. During the 309 qRT-PCR test by Vogels et al. this primer-probe set had the highest Ct values, 6-10 Ct 310 values higher than the other sets, and is unable to detect low viral amounts [4].  The combination that produced the expected match was when an A was input at the 314 R position and a G was input at the S postion. Vogels et al. proposed that changing the 315 S to an A in the reverse primer could increase the sensitivity of the primer-probe set [4]. 316 To test this, the primer was given as input to DinoKnot with an A input at the S 317 position. All combinations where S was input as an A resulted in the primer not binding 318 to the target area. The result was the primer binding to itself and partial binding to the 319 5' end of the transcript, over 400 bases downstream from the target area. Therefore, 320 this result indicates that changing the S to an A is unlikely to increase the sensitivity of 321 the primer even though the target base pair is a U in the RNA genome and a T on the 322 cDNA transcript. Based on these results, we hypothesize that the low sensitivity issue of 323 the RdRp-SARSr primer-probe set may be due to the predicted primer mismatch when 324 a G is present in the R and S position of the primer. Although the concentration of 325 each base combination at the degenerate R and S positions is not stated in the protocol, 326 if the assumption is made that the four combinations are present in equal amounts, this 327 may explain the low sensitivity issue of the RdRp primer probe-set [30] [4]. The base 328 combination where a G is present at both the R and S position is predicted to mismatch 329 to the target region. This would lower the RdRp-SARSr-R concentration that is capable 330 of binding to its target region. If the effective concentration of the RdRp-SARSr-R 331 primer is lower than the other primer-probe sets, this may explain why this set had low 332 sensitivity issues due to higher Ct values shown experimentally by Vogels et al. [4]. 333 Therefore, we suggest that changing the base at the R position to an A and the base at 334 the S position to a G may increase the RdRp-SARSr primer-probe set sensitivity since 335 this base combination was predicted to perfectly bind to its target region.  The mutation in the 2019-nCoV N3-R primer binding region at position 28, 739 is 372 predicted to cause a complete mismatch between the reverse primer and the RNA 373 genome. However, the reverse primer is able to bind to the target area on the positive 374 sense cDNA, except with one base pair mismatch at the mutated base. This mutation 375 may cause this strain not to be detected by this primer-probe set because if the reverse 376 primer does not bind to the target area on the SARS-CoV-2 genome, the reverse 377 transcriptase cannot generate the positive sense cDNA strand. Therefore, the binding of 378 the reverse primer to the negative sense cDNA strand interaction would not occur.

379
The CCDC-ORF1-F primer had a mismatch at the 3' end of the primer causing the 380 5 bases at 3' end to remain unpaired. The DNA polymerase generates the 381 complementary strand in the 5'-3' direction. Since the 5' end still binds, the full target 382 area may still be generated. In this primer-probe set, the probe binds to the same area 383 of the cDNA that the forward primer binds so having the full area of the transcript 384 generated is important. The 5 bases at the 3' end did not align to any other part of the 385 reference genome in a NCBI BLAST search so it is unlikely but still possible that the 386 tail could bind to another part of the genome and hinder the ability of the DNA 387 polymerase to generate the cDNA strand, preventing SARS-CoV-2 detection [29]. This 388 mutation should be tested experimentally in order to determine its effect on the 389 primer-probe set's ability for SARS-CoV-2 detection. early after the onset of symptoms, the viral load is greater than 1 × 10 6 copies per 399 mL [34]. This is sufficient for detection by the primer-probe sets, except for 400 RdRp-SARSr [4]. However, respiratory samples from 80 patients from different stages of 401 infection showed a range of 641 copies per mL to 1.34 × 10 11 copies per mL. Therefore, 402 the more time that passes after symptom onset may result in the viral load to be below 403 the limit of detection during testing. If patients are tested a greater number of days 404 after symptom onset when the viral load may be below the primer-probe set limit of 405 detection determined by Vogels et al., this could result in a false negative test. However, 406 nasopharyngeal swabs are the most common ways to collect patient samples and the 407 work by Pan et al. only looked at one nasal swab which was tested 3 days after 408 symptom onset with a total of 1.69 × 10 5 copies per mL [2,34].

409
In a patient led survey, 27.5% of respondents believed they had COVID-19 but 410 tested negative for the disease [35]. This group of respondents were on average tested 411 later, by day 16, compared to the 21% of respondents who believed they had COVID-19 412 and also tested positive, who were on average tested by day 10 [35]. As is a limitation of 413 this survey we cannot verify whether the respondents who believed they had COVID-19 414 were in fact infected with SARS-CoV-2. However, since they were tested later, this may 415 support the hypothesis that the viral load may be below the limit of detection if the test 416 is taken a greater number of days after symptom onset. Therefore false negative tests 417 are likely an issue with having enough viral load in the sample rather than an issue with 418 the primer interaction with the SARS-CoV-2 genome since DinoKnot predicted correct 419 primer binding for most of the primers at the higher temperatures. 420 We hypothesize that a SARS-CoV-2 aptamer test is beneficial to provide a lower 421 limit of detection. Aptamers are single stranded RNA or DNA nucleotides (10-100nt) 422 that are able to bind to targets such as viruses and proteins [36]. Aptamers' binding 423 specificity is ensured by their secondary and tertiary structure [36]. An RNA aptamer 424 test designed for the SARS-CoV Nucleocapsid protein showed a detection limit of 2 425 pg/mL [37]. A recent aptamer test for Norovirus, a positive sense RNA virus, has a 426 limit of detection of 200 viral copies/mL [38]. This is lower than the qRT-PCR limit of 427 detection of determined by Vogels et al. by a magnitude of 10 3 [4]. The lower limit of 428 detection that is possible with aptamer tests would be beneficial in testing patients later 429 after symptom onset. 430 We believe that an aptamer test can be designed for SARS-CoV-2 that both 431 improves the detection sensitivity and provides a more rapid test compared to the 432 qRT-PCR tests currently in use. The areas targeted in the primer-probe sets may be 433 good targets for aptamer design, since they are specific to SARS-CoV-2. Zooming in on 434 these regions in the GISAID h-CoV-19 app, it can be shown that these areas generally 435 have few mutations, as compared to the number of mutations that can be seen in the 436 entire genome as shown in Fig 5 [33]. Our findings may be beneficial in a starting place for aptamer design. The

438
HKU-ORF1 primer-probe set was the only set to have its reverse primer completely 439 bind its target sequence and not contain any mutations in the primer/probe target areas 440 in the 992 clinical samples looked at by Vogels et al. [4]. The GISAID h-CoV-19 app 441 also shows the area targeted by this primer-probe set to have a low number of mutation 442 events [33]. The 2019-nCoV N2 primer-probe set did not have any mutations in the 443 primer/probe binding regions listed by Vogels et al. but the reverse primer did contain 444 the one base pair mismatch to the reference genome. The areas targeted by these 445 primer-probe sets may be suitable targets for designing a SARS-CoV-2 aptamer test 446 since these areas are specific for SARS-CoV-2 and contain few mutations.

448
We introduced the program DinoKnot that is able to predict the secondary structure of 449 two interacting nucleotide sequences. By in silico prediction we provided evidence to 450 support that during COVID-19 qRT-PCR tests, the reverse primers investigated here 451 are binding to their expected target region on the SARS-CoV-2 genome. Our prediction 452 supports that binding is not prevented by the secondary structure of the SARS-CoV-2 453 RNA genome. This supports the idea that the correct area of the SARS-CoV-2 genome 454 is being amplified during the qRT-PCR tests and the RNA structure does not prevent 455 SARS-CoV-2 detection. Our prediction also provides a hypothesis to explain the low 456 sensitivity issue of the RdRp-SARSr primer-probe set and we suggest changing the 457 degenerate base positions in the RdRp-SARSr-R from an R to an A and an S to a G to 458 potentially increase the sensitivity. 459 We also predicted how different mutations in the primer binding region affected  Finally we hypothesize that an aptamer test may be a better way to test for 464 COVID-19 since it can provide a lower limit of detection. Further research is required to 465 quantify the viral load at each stage of infection and determine at what point the 466 qRT-PCR can no longer detect COVID-19 so that conclusions can be made on the 467 impact of the limit of detection affecting false negative test results.

468
DinoKnot can also be used as a tool to aid in the prevention of false positive results 469 during the design COVID-19 and other nucleic acid based testing. A false positive test 470 would involve the non-specific binding of the primer-probe sets to a source other than 471 the target in the patient sample. Future work can be done using our program DinoKnot 472 to predict the interaction between primer/probes with other pathogens and nucleic acid 473 sequences that may be present in patient samples to determine what the primer/probes 474 may be capable of binding to produce false positive results. 475