coil: an R package for cytochrome C oxidase I (COI) DNA barcode data cleaning, translation, and error evaluation

Biological conclusions based on DNA barcoding and metabarcoding analyses can be strongly influenced by the methods utilized for data generation and curation, leading to varying levels of success in the separation of biological variation from experimental error. The five-prime region of cytochrome c oxidase subunit I (COI-5P) is the most common barcode gene for animals, with conserved structure and function that allows for biologically informed error identification. Here, we present coil (https://CRAN.R-project.org/package=coil), an R package for the pre-processing and error assessment of COI-5P animal barcode and metabarcode sequence data. The package contains functions for placement of barcodes into a common reading frame, accurate translation of sequences to amino acids, and highlighting insertion and deletion errors. The analysis of 10,000 barcode sequences of varying quality demonstrated how coil can place barcode sequences in reading frame and distinguish sequences containing indel errors from error-free sequences with greater than 97.5% accuracy. Package limitations were tested through the analysis of COI-5P sequences from the plant and fungal kingdoms as well as the analysis of potential contaminants: nuclear mitochondrial pseudogenes and Wolbachia COI-5P sequences. Results demonstrated that coil is a strong technical error identification method but is not reliable for detecting all biological contaminants.


Introduction 34
DNA barcoding leverages sequence diversity within standardized gene regions for the 35 identification and classification of organisms (Hebert et al. 2003; Ratnasingham and Hebert 2007). 36 Answering questions about biodiversity through DNA barcode analyses depends on the comparison of 37 novel barcode sequences to reference libraries or the de novo comparison of sequences to one another 38 (Hebert et al. 2004; Ratnasingham  An approximately 657bp fragment of the 5' region of the cytochrome C oxidase subunit I gene 49 (COI-5P) is the main marker utilized in DNA barcoding of the animal kingdom (Hebert et al. 2003). 50 COI is a component of the last enzyme in the electron transport chain, which is essential to metabolism 51 (Castresana et al. 1994;Pentinsaari et al. 2016). As a result, the COI-5P barcode region has particular 52 selective constraints, especially against changes to the corresponding amino acid sequence that modify 53 the protein structure and negatively affect metabolism (Tsukihara et al. 1995;Pentinsaari et al. 2016). 54 The conserved amino acid sequence corresponding to the COI-5P barcode region means that translation 55 of sequences can provide a powerful means of detecting insertion and deletion technical errors (indels) 56 that have resulted in reading frame shifts that alter the inferred amino acid profile. 57 of hidden states corresponding to the observed sequence by determining the optimal path through the 133 network of emission and transition probabilities, which are respectively defined as the probability of a 134 given base at a given position and the probability of a given base following the preceding base at the 135 given position. For framing of nucleotide data, the output is used to infer the presence of DNA not 136 matching the 657bp region defined by the boundaries of the model or missing information at the start 137 of the sequence (Figs. 1, 4). It is important to note that any data outside of the 657bp Folmer region 138 (upon which the model is trained) is removed and not utilized in subsequent translation or error 139 assessment; coil reports when information is trimmed from the input sequence as part of this processing 140 step. 141 Translation 142 As a result of its ancient endosymbiotic origin, mitochondrial translation utilizes a different 143 genetic code from the nucleus, and this code additionally varies across taxa (Youle 2019). The coil 144 package is therefore designed in a manner robust to the variability in animal mitochondrial genetic 145 codes (NCBI numeric identifiers: 2, 4, 5, 9, 13, 14, 21, 24, and 33) (Elzanowski and Ostell 2019). If the 146 taxonomic designation of a barcode sample is known and provided along with the input sequence, then 147 coil calls functions from the package SeqinR (https://CRAN.R-project.org/package=seqinr) to translate 148 the sequence using the appropriate genetic code (Osawa et al. 1992; Jukes and Opsawa 1993;Charif 149 and Lobry 2007). By default, a special censored translation option is utilized by coil in order to 150 accommodate instances when the taxonomy of a sample (and therefore the correct genetic code) is 151 unknown. Codons that vary in the amino acid they code for across known animal mitochondrial genetic 152 codes are translated to a placeholder character ('?') to indicate that the amino acid at this location in the 153 sequence cannot be stated with certainty (Fig. 5). This allows coil to assess the likelihood of the 154 sequence containing indel errors, without being biased by errors introduced due to the appropriation of 155 the wrong genetic code. As an example of why this is necessary, if a barcode sequence from a biting fly 156 (genetic code = 5) was amplified in the analysis of a mammal host specimen (proper genetic code = 2) 8 (Ratnasingham and Hebert 2007; Elzanowski and Ostell 2019) then the codon 'AGG' would be 158 interpreted as a stop codon, as opposed to the correct amino acid: serine. The presence of a stop codon 159 due to this translation error would make the resulting amino acid sequence highly improbable and lead 160 to it being flagged as likely erroneous, when it is in fact a true barcode sequence that a user may in 161 some instances wish to retain. To avoid this, censored translation outputs placeholder characters that do 162 not negatively impact the likelihood of the amino acid sequence because of mistranslated codons. The 163 censored translation outputs do, however, lead to a slight loss of power for subsequent error assessment 164 due to reduced sample size because placeholder characters (for 5/64 codons) are not evaluated in the 165 calculation of the likelihood score. 166

Error assessment 167
The inference of sequence errors is accomplished by comparison of a translated barcode 168 sequence against the amino acid PHMM (Fig. 6). The Viterbi algorithm (Fig. 1) is used to compare the 169 amino acid sequence to the PHMM, and the likelihood of the optimal path that aligns the input 170 sequence to the PHMM is considered. The likelihood is expressed in logarithmic format; the higher a 171 log likelihood (all log likelihood values are negative) the more likely it is that the sequence is a match 172 to the profile. Conversely, the lower the log likelihood, the more likely the amino acid sequence is at 173 some point shifted out of reading frame due to an insertion or deletion error in the nucleotide sequence, 174 leading to a highly unlikely amino acid sequence. Low log likelihood values could result from 175 frameshifts due to technical errors (i.e. PCR amplification or DNA sequencing) or due to the detection 176 of non-barcode sequences, such as nuclear pseudogenes. Amino acid insertions or deletions (indel of a 177 full codon of nucleotides) do not result in frameshifts and lead to only minor perturbations in the log 178 likelihood score; the check of the amino sequence is therefore receptive of mutations of this kind, even 179 if they were unseen in the PHMM training data. The check for frameshifts via the PHMM comparison 180 is complemented by a simple query of the amino acid sequence for stop codons, which can provide 181 additional evidence of frameshift errors or identify instances where translation has been conducted with 182 the wrong genetic code. Since the enzyme that COI-5P codes for is highly conserved, there is an 183 extremely low likelihood of a biologically viable mutation that leads to early termination of the amino 184 acid sequence. The log likelihood of the sequence path and the query of the sequence for stop codons 185 can therefore be used in conjunction to assess sequence validity with high accuracy. This error 186 checking method has the additional advantage of not employing a computationally expensive step of data were initially analyzed in BOLD's aligned format. Sequences with information outside of the COI-202 5P region were truncated and the placeholder dashes from BOLD's aligned format were then removed 203 from the sequences. The resulting unaligned sequence data for only the Folmer region were then used 204 in subsequent model training and assessment. PHMM training (described below) included de novo 205 multiple sequence alignment, so that the alignment of input sequences for PHMM generation was 206 repeatable (parameters for BOLD's alignment of sequences are not publicly available). This process 207 was repeated for the amino acid sequences, with only the 219 amino acids corresponding to the barcode 208 region retained. To ensure that the data used for model training and the data used for model testing and 209 validation were independent from one another, the barcode dataset was first split in a stratified fashion 210 (on the taxonomic level: family) into training (70%), test (15%), and validation (15%) sets for 211 independent use in model training, refinement, and final assessment. The taxonomically stratified split 212 of the data was done to ensure that the training dataset was not biased towards the largest taxonomic 213 orders due to random chance and that subsequent models were trained on diverse data representative of 214 currently-available DNA barcode data on BOLD for the entire animal kingdom. It should be noted that 215 some of the sequences in the BOLD dataset may be taxonomically misidentified, but that any errors of with the exception of families with fewer than 10 samples in the training set, which had all their 230 representatives added to each subset to ensure that these rare families were represented in model (28,189) of the training data were produced, and each was subjected to the following procedure. 233 Training of the nucleotide and amino acid PHMMs was done using the R package aphid (Wilkinson 234 2019). Model training was conducted using aphid's derivePHMM function, which produces a de novo 235 multiple sequence alignment and generates an optimal PHMM using the Baum-Welch training 236 algorithm (Wilkinson 2019). Training with the derivePHMM function utilized default parameters, 237 except for the 'pseudocounts' and 'maxsize' parameters. The 'background' option was used for the 238 'pseudocounts' parameter (see Durbin and Eddy 1998 -Chapter 5 for explanation of pseudocounts). 239 Initially, models were trained using two 'maxsize' parameter pairs: sequence length on classification ability, the long sequences (b) were compared to the sequences from 296 the variable length error dataset (e) that were elongated, and the short sequences (c) were compared to 297 the sequences from the variable length error dataset (e) that were shortened. The area under the curve 298 (AUC -a metric of how well the log likelihood values separate the two categories) and the optimal F 299 score threshold (the log likelihood value that optimizes the trade-off between false positive and false 300 negative classifications) were determined for all comparisons. 301 To demonstrate how coil can be applied in metabarcoding analyses, the coil analysis pipeline 302 described above was rerun, and associated evaluation metrics were generated, for non-full-length 303 barcode sequences from a defined subsection for the COI-5P region. Subsets of the barcode sequences 304 in categories (a) and (c) were taken for an arbitrarily chosen 300bp window (base pairs 337-636 of the 305 full COI-5P region) to simulate non-full-length COI-5P sequences commonly generated in DNA error within the defined 300bp region (1868 sequences). To determine coil's baseline classification 308 ability on these data, the coil analysis pipeline was run for the non-full-length sequences using the full-309 length nucleotide and amino acid PHMMs. To test if the use of targeted PHMMs (corresponding to 310 only that specific 300bp section COI-5P region) improves coil's ability to frame sequences and identify 311 indels, the complete coil analysis pipeline was then rerun an additional time using a subsection of the 312 PHMMs representing only the given 300bp window to evaluate the non-full-length barcode sequences 313

(Supplementary File 4). 314
Processing non-animal barcode sequences 315 In addition to identifying the presence of indels in animal COI-5P sequences (which would most 316 likely result from technical errors), we tested how coil performed on COI-5P sequences for which it 317 was not designed to process. A series of plant, fungi, and bacterial COI-5P sequences as well as animal 318 nuclear mitochondrial pseudogenes were analyzed with coil, to see if sequences from these categories 319 would be flagged as likely containing errors, or if they would be processed in the same manner as 320 animal COI-5P sequences. The first set of sequences investigated were COI-5P sequences from the 321 plant and fungi kingdoms. The NCBI nucleotide database (https://www.ncbi.nlm.nih.gov/nucleotide/) 322 was queried for 'cytochrome c oxidase 1', and the results were filtered to retain sequences from only 323 plants and fungi. Results were filtered by sequence length (minimum 550bp, maximum 2000bp), and 324 the sequence descriptions were all manually checked to ensure that all results were COI-5P sequences. 325 This resulted in a set of 570 plant and 174 fungal sequences (Supplementary File 5) that were then 326 analyzed using the coil analysis pipeline (coi5p_pipe function) with default parameters. 327 Wolbachia is a genus of bacterial endosymbionts (commonly found in arthropods) that can in 328 some instances confound attempts to isolate an animal specimen's barcode sequence (Smith et al. Across the entire validation dataset, coil's indel_check function was able assess the amino acid 367 data and accurately discern sequences with insertion or deletion errors from correct barcode sequences 368 with greater than 97% accuracy. As with the framing of sequences, coil's effectiveness was influenced 369 by the length of the sequences (Table 1). When the full-length barcode sequences with and without 370 introduced errors were evaluated, they could be categorized with greater than 99.9% accuracy. For 371 sequences with excess DNA sequence added to the ends, the sequences with errors could be discerned 372 from the correct sequences with 98.0% accuracy. For the short sequences, performance was once again 373 the lowest, with 95.1% classification accuracy (Fig. 7). The similarity in these accuracy scores to the 374 percentage of sequences placed in reading frame correctly is not coincidental, but rather due to the fact 375 that the ability to correctly classify a sequence is highly dependent on the ability of coil to place it in 376 the common reading frame correctly. 377 Closer examination of the classification errors in the short sequence dataset (the category with 378 the highest error frequency) helped to characterize the limitations of coil's classification ability. The 379 majority of the classification errors (103/132 -78%) were false negatives, i.e. sequences without 380 introduced errors that were flagged as likely to contain errors. The reason for this appears to be the 381 incorrect establishment of the reading frame and thus large-scale reading frame shifts, resulting in 382 incorrect amino acid sequences. The major limitation of coil therefore appears to be the inability to 383 always establish the reading frame properly, a problem that is magnified in short sequences where the 384 front of the sequence is truncated. The examination of the 29 falsely accepted sequences for the short 385 category revealed the three main patterns where coil failed to detect errors in a sequence: 1) When an 386 indel occurs late (>500 bp) in the sequence (observed for 7 sequences), 2) when three insertions or 387 three deletions are found in closely proximity (observed for 4 sequences), and 3) when an insertion and 388 deletion occur in close proximity to one another (observed for 18 instances). All three of these patterns 389 do not lead to large-scale alterations to the amino acid sequence, and are therefore not easily detected 390 by examination of the amino acid sequence. 391 In all error classification tests, the accuracy (as shown by AUC) was highly similar when 392 translation was conducted with the known genetic codes and when censored translation was employed 393 (Table 1). This indicates that the censored translation method is effective at allowing error 394 identification to proceed with accuracy in instances where the taxonomic origin of a sequence is 395 unknown. It should be noted that the log likelihood values associated with sequences are overall 396 slightly lower when generated using censored translation, because lower numbers of amino acids 397 contribute to the likelihood value (5/64 codons are not translated - Fig. 5). For this reason, the 398 threshold used to identify sequences with indel errors should be adapted to the characteristics of the 399 data being analyzed. Table 1 provides users with a starting point for selecting a classification threshold, 400 showing the optimal log likelihood threshold (as determined through F-score maximization) for the 401 different comparisons of validation data. These values serve as statistically informed starting points 402 from which a user can deviate if they wish to attempt minimization of type 1 (lowering the logL 403 threshold to minimize rejection of correct sequences) or type 2 error (raising the logL threshold to 404 minimize acceptance of sequences with indel errors).
The indel identification tests showed that coil's performance is suboptimal when the length of 407 the examined sequence is far shorter than the PHMM profile length. The proximal cause of this issue is 408 that similarity between probabilistic profiles for different parts of the barcode region can lead to 409 continuous series of base pairs from the sequence matching to an incorrect portion of the PHMM and 410 incorrect establishment of the reading frame. This is problematic in the analysis of metabarcode data, 411 where non-full-length (~200-400bp) sections of the COI-5P barcode are targeted for amplification and 412 sequencing by custom primers. The fact that these non-full-length barcodes are in most instances 413 derived from targeted subsections of the COI-5P region means that the framing limitation can be 414 overcome by comparing sequences to only a subsection of coil's complete PHMMs. Using a subsection 415 of the full PHMMs reduces or eliminates the discrepancy in size between the PHMM and the non-full-416 length barcodes, drastically reducing the false positive indel signals caused by incorrect establishment 417 of the reading frame. 418 Using the subsection of the complete PHMMs (defined using coil's subsetPHMM function), 419 coil was able to separate non-full-length barcode sequences with indel errors from correct sequences 420 with greater than 99.4% accuracy (Fig 8.). This proved to be a drastic improvement over the use of the It should also be noted that amino acid PHMM log likelihood values reported for input 431 sequences are dependent on the size of the model to which they are compared. The optimal log 432 likelihood cut offs presented (in Table 1 Of the 345 Wolbachia COI-5P sequences analyzed, only five were flagged as likely to be sequences being flagged as errors resulted from the incorrect framing and translation of sequences. 456 Since the Wolbachia sequences are real instances of the COI-5P gene (albeit from bacteria as opposed 457 to animals) that do not possess any drastic changes to reading frame relative to animals, their 458 nucleotide and corresponding amino acid sequences are likely enough to be accepted as matches to the 459 animal COI-5P PHMMs. This demonstrates that coil cannot be effectively appropriated to separate 460 animal COI-5P sequences from the COI-5P sequences of endosymbiote contaminants. sequences identified within a single species complex. Expansion of these tests to COI-5P-derived 470 numts from a more diverse set of taxa could reveal how well coil is able to characterize numts. At 471 present, it is hypothesized that the ability to discern pseudogenes from true barcode sequences will 472 depend on how diverged the structure and sequence of a pseudogene are, with frame-shift-causing 473 indels increasing the likelihood of detection by coil. 474 Taken together, these results show that coil Version 1.0 should only be used to process animal 475 COI-5P data and that it is less reliable at detecting biological contaminants than technical errors. The 476 genetic structure of biological contaminants such as Wolbachia COI-5P is not always dissimilar enough 477 from animal COI-5P sequences for the PHMMs to discern the two categories. Additionally, if an 478 animal specimen were contaminated with DNA not from another kingdom, but rather with DNA from employed by the package are trained on a taxonomically robust sample of animal sequences. The 481 results presented here regarding the identification of biological contaminants are conservative, as in all 482 tests we have used coil's default censored translation option, which assumes that taxonomic 483 information is not available. The ability to identify biological contaminants can be greatly increased if 484 the taxonomy (and therefore the genetic code) corresponding to a sample is known. Contaminants from 485 organisms that appropriate a different genetic code would have a higher likelihood of being identified 486 in this situation, as translation would be more likely to yield amino acid sequences with stop codons or 487 highly improbable amino acids. Overall the coil package should not be considered a panacea for all 488 DNA barcode data errors, but rather as a tool that can be incorporated into barcode analysis as a 489 reliable pre-processing and data-cleaning step that can identify technical errors. Researchers using 490 barcode data should consider additional steps in their analyses, such as BLAST, which can help inform 491 the taxonomy of sequences and identify biological contaminants. functionality would require the training of addition profile hidden Markov models for these barcode 505 regions and integration of these models into the coil package. This is a feasible addition thanks to the 506 large amount of additional barcode data available through BOLD (Ratnasingham & Hebert 2007) and 507 other protein-coding markers available through NCBI, which can be queried and filtered using methods 508 similar to those employed here for COI-5P. Another extension of the functionality of the coil package 509 is to move from error identification to error correction. PHMMs are robust models of the conserved 510 sequence of the COI-5P barcode region on both a DNA and amino acid level. Accurately identifying 511 where deviations from the expected sequence occur and applying corrections to the sequences in an 512 unbiased fashion would allow PHMMs to serve as the basis of a DNA barcode data denoiser. 513 Corrections based on PHMMs trained on the wealth of barcode sequence information available could 514 serve as a powerful complement, or alternative, to current denoising methods which rely on read 515 abundance and de novo clustering of sequences (Nearing et al. 2018).  compositions. AUC -area under the curve for the ability of the amino acid PHMM log likelihood score 659 to discriminate sequences with insertion or deletion errors from sequences without errors. The optimal 660 LogL threshold for differentiating sequences with indels from those without indels is shown. The 661 optimal LogL cut off, as determined through F-score maximization, is shown. 662

Sequence with missing information:
Path suggests 9 missing bp at start of sequence. Placeholder characters added to sequence to place in reading frame.

Frame function
Viterbi algorithm used to obtain hidden state path 2 indicates insertion (added bp), 1 indicates match to profile.

222222222111111111111111……………………111111111112222 GCGCGCGACACTCTATATTTTATA……………………ACCAACATTTACGCG
Path suggests 9 extra bp at start of sequence and 4 at end Excess bp trimmed from sequence to place in reading frame. Vitertbi algorithm to determine the likelihood that a sequence contains an insertion or deletion error 725 that has led to a shift in reading frame. 726 727 Sequence without errors:

111111111111111……………………111111111111111
Path suggests all match states, suggestive of a barcode sequence without errors. The log likelihood of the sequence is therefore high.
Log likelihood: -237.68 , stop codons: False Additionally, the amino acid sequence does not contain stop codons, which is further suggestive of a correct amino acid sequence.

1112221122222211……………………222221112122222
Path is rife with insert states, suggests a phase shift has occurred in the nucleotide sequence, moving the sequence out of correct reading frame. The log likelihood of the sequence is therefore low.
Log likelihood: -1138.69, stop codons: True Additionally, the amino acid sequence contains stop codons, which is further suggestive of a shift out of correct reading frame.  Table 1 for corresponding AUC values). The dotted grey lines represent classification through 736 chance (AUC = 0.5). 737 738