Abstract
Background To improve the quality of nucleic acid detection reagents, we provided a new strategy, Shine, to explore specific, sensitive and conserved biomarkers from massive microbial genomic data within intrapopulations in order to improve detection sensitivity and accuracy. It is obvious that the more comprehensive genomic data are, the more effective the detection biomarkers.
Results We demonstrated that our method could detect undiscovered multicopy conserved species-specific or even subspecies-specific target fragments, according to several clinical projects. In particular, this approach was effective for any pathogenic microorganism even in incompletely assembled motifs. Based on our strategy, the detection device designed with quantitative PCR primers and probes for systematic and automated detection of pathogenic microorganisms in biological samples may cover all pathogenic microorganisms without limits based on genome annotation. On the website https://bioinfo.liferiver.com.cn, users may select different configuration parameters depending on the purpose of the project to realize routine clinical detection practices.
Conclusions It is recommended that our strategy is suitable to identify shared universal phylogenetic markers with few false positive or false negative errors and to automate the design of minimal primers and probes to detect pathogenic communities with cost-effective predictive power.
Introduction
The testing and rapid detection of pathogenic organisms is a crucial undertaking related to health, safety and wellbeing, especially for the early detection of pathogens, which is important for diagnosing and preventing diseases[1–3]. While the landscape of diagnostics is rapidly evolving, polymerase chain reaction (PCR) remains the gold standard of nucleic acid-based diagnostic assays, in part due to its reliability, flexibility and wide deployment[4]. Obviously, the process of developing an emergency-use molecular-based laboratory-developed test (LDT) would be useful to other laboratories in future outbreaks and would help to lower barriers to establishing fast and accurate diagnostic testing in crisis conditions[4]. Nevertheless, the Deoxyribo Nucleic Acid (DNA) concentrations of pathogenic microorganisms in biological samples are mostly very low and close to the detection limit, so pathogen detection has become one of the most challenging aspects in clinical applications[5]. Traditional PCR or real-time PCR often lack detection sensitivity[6, 7]. Other methods, such as two-step nested PCR, may have better sensitivity, but they are not feasible for routine tests and present a high risk of contamination[8]. Thus, these methods are time consuming and costly and have poor accuracy, so it is necessary to explore biomarkers with high performance to improve the quality of reagents.
Since viruses lack shared universal phylogenetic biomarkers, a rise or drop in the concentrations of single biomarkers is not sufficient for accurate prediction of viral/bacterial community-acquired pneumonia, with overlap to varying extents depending on the marker cutoff values, detection methods, analysis, and desired specificity and sensitivity[9]. Although automatic identification of species-specific repetitive DNA sequences and their utilization for detecting microbial organisms by MultiMPrimer3 have been well presented[10], the website is limited by the lack of customized settings, especially for clinical applications. For instance, if unknown microorganisms cause epidemic outbreaks[11], the pathogenic microorganism database will be updated continuously, which may cause the original probe primer design to fail to cover epidemic pathogenic microorganisms, affecting the quality of nucleic acid detection reagents. To greatly improve the predictive power of detection, biomarker combinations have become the primary choice in many studies[12–14]. However, this approach may not be cost effective and could cause several experimental mistakes in actual mechanical processes in many clinical settings. Therefore, the importance of exploring minimal biomarkers with primers and probes to improve the detection sensitivity and accuracy at any time for any pathogen cannot be overestimated.
Generally, a common way to confirm suitable biomarkers as template regions for designing primers and probes for pathogenic microorganisms is to select specific plasmid[15] and 16S Ribosomal RNA (16S rRNA) sequences[16]. On the one hand, 16S RNA gene sequence analysis can be routinely used for the identification of mycobacteria and lead to the recognition of novel pathogens and non-cultured bacteria[17–20] because rRNA genes exist in all microbial genomes and there are often multiple copies, which can improve detection sensitivity. However, few studies have reported consensus quantitative definitions of genera or species based on 16S rRNA gene sequence data. Several studies have highlighted that rRNA and other marker genes cannot be directly used to fully predict the functional potential of the bacterial community[21]. In fact, not all rRNA genes are species specific, i.e., rRNA genes cannot meet the requirements of species specificity or even subspecies specificity because the sequences of rRNA genes are too conserved to distinguish, especially between closely related species or even between strains of different subtypes of the same species. On the other hand, plasmid-mediated gene transfer plays an important role in the mobilization and dissemination of antibiotic resistance genes and in the spread of degradative pathways and pathogenicity determinants of pathogens[22]. However, we must note that not all microorganisms have specific-species plasmids and that some microorganisms even have no plasmids. That is, the mechanisms and selective pressures causing mosaic plasmids do not occur evenly over all species, and plasmids may provide different levels of potential variation to different species that are abundant and unevenly distributed across prokaryotic taxa[23]. Therefore, it has not been confirmed that plasmid DNA is species specific, especially because of the high similarity of plasmid DNA between some different species. Plasmids cannot universally test the species without plasmids by plasmid design. Hence, many clinical laboratories still have to validate the quality of assays by other primers and probes since plasmid PCR testing has obviously high risks of false positive or false negative errors. Overall, neither selecting a specific plasmid nor rRNA to design primers and probes for pathogenic microorganisms is the best choice.
On the basis of previous studies on comparative analysis of molecular sequence data, such as those using Molecular Evolutionary Genetics Analysis (MEGA5)[24] or Phylogenetic Analysis by Maximum Likelihood (PAML4)[25], which involved reconstructing the evolutionary histories of species and inferring the selective forces shaping the evolution of genes and species, it is also essential to practice comparative genomics in routine tests and rapid detection of pathogenic organisms for improved performance. Here, we demonstrated the Shine strategy based on comparative genomics to explore specific, sensitive and conserved biomarkers from massive microbial genomic data within populations. We hypothesized that the more comprehensive genomic data are, the more effective detection biomarkers. We aimed to show a design strategy to improve the quality of nucleic acid detection reagents, which has been validated by several clinical projects. In particular, it is available for any pathogenic microorganism even in incompletely assembled motifs. Our method could detect undiscovered multicopy universal species-specific or even subspecies-specific target fragments as design templates and automate the production of the best and minimal primer and probe sets that covered all publicly epidemic pathogenic microorganisms.
Materials and Methods
Specificity
The pathogenic genomic data were derived from public databases, such as the National Center for Biotechnology Information (NCBI) Assembly database[26], Global Initiative on Sharing All Influenza Data (GISAID)[27, 28], ChunLab’s public data and analytics portal (EzBioCloud)[29], Eukaryotic Pathogen Database Project (EuPathDB)[30], Giardia Genomics Resources (GiardiaDB)[31], Trichomonas Informatics Resources (TrichDB)[31], and Fungal & Oomycete Informatics Resources (FungiDB)[32], which either contained completely assembled pathogenic genomes or incompletely assembled motifs. The defined populations were specific species or subspecies, and the control group was all the other species or subspecies of the same classification excluding the defined populations. As shown in Figure 1b, to identify the specific regions in the microorganism target fragments, 1) the microorganism target fragments were compared with the whole genome sequences of one or more comparison strains one to one, and fragments for which the similarity exceeded the preset value were removed to obtain the plurality of residual fragments as first-round cut fragments T1-Tn, wherein n was the integer greater than or equal to 1; 2) then, the first-round cut fragments T1-Tn were compared with whole genome sequences of the remaining comparison strains, and fragments for which the similarity exceeded the preset values were removed to obtain the collection of residual cut fragments as the candidate specific regions of the microorganism target fragments; and 3) the specific regions were then verified and obtained to determine whether the candidate specific regions met the following requirements: a) searching in public databases[33] to find whether there were other species for which the similarity values to the candidate specific region was greater than the preset value; and b) comparing the candidate specific regions of the whole genome sequences of the comparison strains to find whether there were fragments with the similarity greater than the preset values. If the candidate specific regions did meet the above requirements, the candidate specific regions were considered the specific regions of the microorganism target fragments.
Sensitive
To identify the multicopy regions in the microorganism target fragments illustrated in Figure 1c, 1) for searching candidate multicopy regions, internal alignments were performed on the microorganism target fragments, and searching for the regions corresponding to the to-be-detected sequences for which the similarity met the preset values as candidate multicopy regions, the similarity was the product of the coverage rates and matching rates of the to-be-detected sequence; 2) for verifying and obtaining the multicopy regions, the median values of the copy numbers of the candidate multicopy regions were obtained, including a) determining the positions of each candidate multicopy region on the microorganism target fragments; b) obtaining the numbers of other candidate multicopy regions covering the positions of each base of the to-be-verified candidate multicopy regions; and c) calculating the median values of the copy numbers of the to-be-verified candidate multicopy regions. The other candidate multicopy regions mentioned above refer to candidate multicopy regions other than the candidate multicopy regions to be verified. The target fragments of microorganisms may be chains or multiple incomplete motifs. If the median copy numbers of the candidate multicopy regions were greater than 1, the candidate multicopy regions were recorded as multicopy regions. The preset value of the similarity could be determined as needed. The recommended preset value of similarity had to exceed 80%. If the region where the similarity met the preset value contained different motifs, the region was divided based on the original motif connection points and divided into different subregions to determine whether the subregions were candidate multicopy regions. The coverage rate = (length of similar sequence/(end value of the to-be-detected sequence – starting value of the to-be-detected sequence +1)) %. The matching rates referred to the identity values when the to-be-detected sequences were aligned with themselves. The identity values of the two aligned sequences could be obtained by software such as needle[34], water[35] or blat[36]. The length of similar sequences referred to the number of bases in which the matched fragments occupied the to-be-detected sequences when the to-be-detected sequence was aligned with other sequences, that is, the length of the matched fragments.
Conservative
As presented in Figure 1a, to obtain species-specific consensus sequences of microorganisms, 1) for searching for candidate consensus sequences, specific sequences of target strains belonging to the same species were clustered based on the clustering algorithm[37] to obtain a plurality of candidate species-specific consensus sequences; and 2) for verifying and obtaining primary-screened species-specific consensus sequences, whether the candidate species-specific consensus sequences met the following conditions remapped by mafft was determined[38]. Herein, the strain coverage rates met the preset values, and the effective copy numbers met the preset values. If the candidate species-specific consensus sequences met all the above conditions, it was determined that the candidate species-specific consensus sequences were species-specific consensus sequences; the percentage of strain = (number of target strains with the candidate species-specific consensus sequence/total number of target strains) * 100%. The effective copy numbers, i.e., weighted average copy numbers were calculated according to formula (I), where n was the total number of copy number gradients of the candidate species-specific consensus sequences; Ci was the copy number corresponding to the i-th candidate species-specific consensus sequence; Si was the number of strains with the i-th candidate species-specific consensus sequence; and Sall was the total number of target strains. Formula (I) refers to the summation of Ci (Si/Sall), where i ranges from Cmin to Cmax, and the number of i is n. Cmin is the minimum copy number of all candidate species-specific consensus sequences. Cmax is the maximum copy number of all candidate species-specific consensus sequences.
Design
Based on the above various combinations of different submodules, the final candidate species-specific consensus sequences could be compared to the whole genomes of all target strains to calculate the percentage of strains and effective copy numbers of the candidate species-specific consensus sequences. Designing the templates of the primary-screened species-specific consensus sequence and achieving the best sets of primers and probes were performed as follows: 1) we obtained the candidate probes and primers by Primer3[39] or Beacon Designer™; 2) the sequences of the candidate probes and primers were aligned to the whole genome of all target strains; 3) the strain coverage rates corresponding to the sequences of each probe and primer were calculated; and 4) the candidate probes and primers for which the strain coverage rates met the preset values were screened, and the primary-screened species-specific consensus sequences corresponding to the screened candidate probes and primers were chosen as the final species-specific consensus sequences.
Results
The illustration of our strategy
We developed a de novo genome alignment-based pipeline to explore the original and specific multicopy biomarkers of the defined intrapopulations to cover all the members. If either repetitive regions or specific regions were preferred, the result was split into two selections and then processed in the other modules separately. Each selection was finally focused on searching for consensus sequences and designing the best primer and probe sets. Correspondingly, it was necessary to perform double-check validation in every module, as shown in Figure 1a. One of the important details was common block deletions used to search the specific regions, and each genome of the target strains was compared with every genome of the control strains for N calculations. Common block deletions lasted for X generations with multiple threads to search specific regions or subspecific regions for M target strains, as illustrated in Figure 1b. The other key point was searching repetitive regions with different copy numbers in every target strain and extracting potential repeats for validation by remapping and statistically summarizing the mean copy numbers and variations for each repeat, and the rest were discarded. Finally, to be conservative, the maximum values of the strain coverage rate were achieved as much as possible with the fewest consensus sequences. All the logic modules were verified multiple times.
Identification of specific regions
The most striking finding of this method was the contribution of specific, sensitive, and conservative biomarkers for each species or subspecies, especially those available for microbial genomes. First, the obvious advantage of our strategy was that it was capable of detecting species-specific or even subspecies-specific target fragments that contained forward primers, reverse primers and probes separately in several projects, such as Human coronavirus HKU1 (HKU1), Human coronavirus OC43 (OC43), Human coronavirus NL63 (NL63), Human coronavirus 229E (229E), Middle East respiratory syndrome (MERS) coronavirus, severe acute respiratory syndrome coronavirus (SARS-CoV) and severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2).
Notably, if there were no hits with the above biomarker genes or probes and no annotation, the sets were defined de novo, as presented in Table 1, and were obviously distinguished from other species or subspecies.
To accelerate the comparison, in a preferred embodiment, the first-round divided fragments T1-Tn were respectively compared with whole genome sequences of the remaining comparison strains by group iterations, as shown in Figure 1b: 1) dividing the remaining comparison strains into P groups, each group included a plurality of comparison strains; 2) simultaneously comparing the first-round divided fragment Tn with the whole genome sequences of each comparison strain in the first group one to one and removing fragments for which the similarity exceeded the preset value, the plurality of residual fragments was obtained as the first-round candidate sequence library of the first-round divided fragment Tn; 3) simultaneously comparing the previous-round candidate sequence library of the first-round divided fragment Tn with whole genome sequences of each comparison strain in the next group one to one and removing fragments for which the similarity exceeded the preset value, the plurality of residual fragments was obtained as the next-round candidate sequence library of the first-round divided fragment Tn; 4) operations from the first-round candidate sequence library were repeated until the Pth-round candidate sequence library were obtained as the candidate specific sequence library of the first-round divided fragments Tn; 5) the collection of all the candidate specific sequence libraries of the first-round divided fragments were the candidate specific regions. The method further comprised comparing selected adjacent microorganism target fragments one to one; if the similarity after comparison was lower than the preset values, an alarm was issued, and screening conditions corresponding to the target strains were displayed. Abnormal data and redundant data caused by human errors could be filtered. The target fragments of microorganisms could be the whole genomes of microorganisms or their gene fragments.
Identification of multi-copy regions
Second, compared with the previous method, our strategy was highly accurate and sensitive, and undiscovered multicopy regions could be identified which demonstrated in Figure 1c. For example, it was clear that IS6110 was identified by Shine, as shown in Table 2. The motif is an insertion element found exclusively within members of the Mycobacterium tuberculosis complex (MTBC), which has become an important biomarker in the identification of MTBC species[40, 41]. IS1002 is present in both Bordetella pertussis (BP) and Bordetella parapertussis (BPP) strains isolated from humans and was also detected by Shine, consistent with a recent study[42]. Finally, it could be necessary to cover all pathogenic target microorganism genomes to avoid lowering the quality of the nucleic acid detection reagents presented in Table 2. Therefore, our strategy was more flexible for customized settings to obtain the most conserved biomarkers, primers and probes provided by the continuous updating of massive microbial genomic data. Since 16S rRNA genes are not limited to whether there was a whole genome sequence that was not always multicopy, some rRNA genes in the closely related species could not be distinguished from each other. It is likely that not all plasmids have specificity and universality and are unevenly distributed across prokaryotic taxa. In short, our method was more comprehensive than limited selection of plasmids or 16S rRNA genes as template regions, as repetitive, specific and universal target fragments could be found even in incompletely assembled motifs in any case.
Identification of conservative regions
As demonstrated in Supplementary Table1-8, the distribution of each cluster was summarized by the count numbers in parentheses of all the copy number out of parentheses. And it was obvious to pick up the target cluster sequences as the design templates by the percentage of strains and weighted average numbers of copy numbers detailed in Materials and Methods. In summary, the detection device based on our strategy, Shine, designed with quantitative PCR primers and probes for systematic and automated detection of pathogenic microorganisms in biological samples may cover all pathogenic microorganisms, including bacteria, viruses, fungi, amoebas, cryptosporidia, flagellates, microsporidia, piroplasma, plasmodia, toxoplasma, trichomonas and kinetoplastids. However, whether genome annotation is present was not a limiting factor. Operational tasks can be submitted by providing the names of the target strains and the comparison strains or by uploading sequence files locally on the website https://bioinfo.liferiver.com.cn. Therefore, users may select different configuration parameters depending on the purpose of the project. The configuration parameters mainly include the name of the workflow, target species, comparison species, uploaded local fasta files, target fragment length, species specificity, repeated region similarity, target fragment strain distribution, host sequence filtering, priority scheme (prioritizing multicopy regions vs. prioritizing specific regions), calculation of target strain and alarm threshold similarities, and primer probe design parameters. As a consequence, it was suitable for identifying shared universal phylogenetic biomarkers with few false positive or false negative errors and automating the design of minimal primers and probes to detect the pathogenic community with cost-effective predictive power.
Discussion
We demonstrated a new strategy, Shine, to explore specific, sensitive and conserved biomarkers from massive microbial genomic data within intrapopulations to improve detection sensitivity and accuracy. Several clinical projects have been carried out by devices based on Shine. Unfortunately, it should be noted that this study examined only limited public genomic data, and we are still looking forward to promoting collaboration with more organizations on the basis of open sharing of data and respect for all rights and interests[28]. Despite its preliminary characteristics, i.e., specific, sensitive and conservative, this study can be clearly described and explored in the future for several reasons, as follows.
The first aspect involved the ability to identify specific regions in microorganism target fragments. The biodiversity and evolution of vertebrate RNA viruses has expanded dramatically since the beginning of the millennium, and it has been reported that more expensive, better sampling worldwide and more powerful approaches for virus characterization are needed to help us find these divergent viruses, such as chuviruses and jingmenviruses[43], which will help to fill the evolutionary gaps of RNA viruses[44]. With the development of methods for detecting more than 100 different nucleic acid targets at one time, FilmArray made the system well suited for the molecular detection of infectious agents, and the automated identification of pathogens from their corresponding target amplicons could be accomplished by analysis of the DNA melting curve of the amplicon[45]. Additionally, several studies have reported multiplex real-time PCR assays for detecting four microorganisms relevant to community-acquired pneumonia (CAP) infections[46] in Asia; CAP is one of the most common infectious diseases and a significant cause of mortality and morbidity globally. The availability of tests with improved diagnostic capabilities potentially leads to an informed choice of antibiotic usage and appropriate management of the patient to achieve a better treatment outcome and financial savings[46]. Herein, we generated a more significant biomarker dataset, which was validated by several clinical experiments, as described in Table 1 and Table 2. All the results support that our strategy is robust for detecting effective biomarkers. It seems that specificity, sensitivity and conservation could account for this performance. Interestingly, graphene is a lightweight, chemically stable and conductive material that can be successfully utilized for the detection of various virus strains. The current state-of-the-art applications of graphene-based systems for sensing a variety of viruses, e.g., SARS coronavirus 2 (SARS-CoV-2), influenza, dengue fever, hepatitis C virus, human immunodeficiency virus (HIV), rotavirus and Zika virus, have been summarized[47, 48]. Graphene-based biosensor technology with high sensitivity and specificity could be particularly useful in the life sciences and medicine since it can significantly enhance patient care, early disease diagnosis and pathogen detection in clinical practice[49, 50]. Notably, CRISPR-Cas systems, in particular the recently discovered DNA-targeting Cas12 and RNA-targeting Cas13 systems, both possessing unique trans-cleavage activity, are being harnessed for viral diagnostics and therapies[51]. In addition, specific high-sensitivity enzymatic reporter unlocking (SHERLOCK) testing in one pot (STOP) is a streamlined assay combining simplified extraction of viral RNA with isothermal amplification and Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR)-mediated detection, which can be performed at a single temperature in less than one hour with minimal equipment[52]. Therefore, we tentatively propose cooperating with related institutes to combine the strategy of Shine with graphene-based biosensor technology or CRISPR-Cas systems for application in pathogen sensing.
On the other hand, when identifying multicopy regions in microorganism target fragments, the motifs are connected together before searching for candidate multicopy regions, in which the microorganism target fragments often have multiple incomplete motifs. The motif is caused by incomplete splicing of short read lengths under existing second-generation sequencing conditions. There was no specific restriction on the order in which the motifs were connected together, i.e., the motifs may have been connected to the chain in random order. If the region where the similarity met the preset value contained different motifs, the region was divided based on the original motif connection points into different subregions to determine whether the subregions were candidate multicopy regions. This method is also suitable for whole-genome sequencing data generated by new technologies such as third-generation sequencing. In the preferred embodiment, the 95% confidence interval of the copy numbers of the candidate multicopy regions was calculated. The confidence interval refers to the estimated interval of the overall parameters constructed by the sample statistics, that is, the interval estimation of the overall copy numbers of the target regions. The confidence interval reflected the degree to which the true values of the copy numbers of the target regions were close to the measurement result. The confidence interval indicates the credibility of the measured values of the measured parameters.
Finally, this approach is related to obtaining species-specific consensus sequences for microorganisms. Were these different assignments due to the fundamental nature of the approach or the result of different approaches to species demarcation by the respective specialized study groups (SGs)? For instance, HIV-1 and HIV-2 were assigned to two different species, while SARS-CoV and SARS-CoV-2 were assigned to two strains of a single species. That is, how can the position of the viral entity in the natural world be defined? In practical terms, recognizing virus species as the principal subjects of virology would also expand the scale of the spatiotemporal framework connecting studies of natural virus variation, cross-host transmission, and pathogenicity and thus contribute to the understanding and control of virus infections[53]. Here, we present a method to ensure covering all pathogenic microorganism genomics to avoid lowering the quality of nucleic acid detection reagents. Users may submit the latest sequence dataset through a user-friendly interface. The sequence update coverage rate modules may reintegrate the latest sequence dataset into the database to calculate the coverage rates by recomparing the sequences of the original probes and primers to the updated sequences. This result may reflect whether the sequence of the original probes and primers could cover the newer strains. Exceptions always occurred for highly divergent viruses, such as Sapovirus and human astrovirus, which have limited consensus biomarkers with high performance. If none of the strain coverage rates of the candidate consensus sequences reached the preset value, we had to prioritize specificity and/or sensitivity and combine the candidate consensus sequences to improve conservation, although it may not be cost-effective and could cause several experimental errors. The recommended process was in turn performed by screening the combinations with the strain coverage rate reaching the preset values and having the fewest consensus sequences, taking the screened combinations as the candidate consensus sequences, and then verifying/obtaining the primary-screened species-specific consensus sequences. Herein, the combination could be performed according to the number of consensus sequences from low to high for selection. Unless a single consensus sequence covered all the current strains, it was possible to find two consensus sequences for which the sum of the strain coverage rates of the two consensus sequences was greater than or equal to the preset value of the strain coverage rate. If it did exist, two consensus sequences were recorded in the results; if not, three consensus sequences were combined. That is, unless there was a single consensus sequence or two consensus sequences that could meet the preset value of the strain coverage rate, it was possible to find three consensus sequences, where the sum of the strain coverage rates of the three consensus sequences was greater than or equal to the preset value of the strain coverage rate. If it did exist, the three consensus sequences were recorded in the results; if not, four consensus sequences were combined. By that analogy, infinite numbers of consensus sequences should not be combined until the consensus sequence combination that could meet the preset value of the total strain coverage rate is found and recorded in the result.
Conclusion
Above all, the Shine strategy was presented to explore specific, sensitive and conserved biomarkers from massive microbial genomic data within intrapopulations. We have proposed a design strategy to improve the quality of nucleic acid detection reagents, which has been validated by several clinical projects. Our method was highly accurate and sensitive and could be capable of detecting undiscovered multicopy universal species-specific and even subspecies-specific target fragments, covering all publicly epidemic pathogenic microorganisms. Therefore, it was suitable for identifying shared universal phylogenetic biomarkers with few false positive or false negative errors and automating the design of minimal primers and probes to detect the pathogenic community with cost-effective predictive power.
Ethical Approval and Consent to participate
All protocols were approved by the Liferiver Science and Technology Institute, Shanghai ZJ Bio-Tech Co., Ltd and Use Committee (Shanghai, China).
Consent for publication
Not applicable.
Availability of data and materials
All shared universal phylogenetic markers from massive microbiological genomics have been available at the website https://bioinfo.liferiver.com.cn/#/home.
Competing interests
The authors declare that they have no competing interests.
Funding
This work was supported by no funding.
Authors’ contributions
The authors’ responsibilities were as follows-JBS and CJ designed and conducted the research; CJ and JBS analyzed the data and perform the analysis; CJ wrote the paper; JBS revised the manuscript; and CJ had primary responsibility for the final content. The authors read and approved the final manuscript.
Figure and Table Legends
Table 1. Sample sets for detecting species-specific or even subspecies-specific target fragments. This output includes forward primers, reverse primers and probes separately for several projects on different species or subspecies of coronavirus which meet three criterias: specificity, sensitivity and conservation.
Table 2. Sample sets of identified undiscovered multicopy regions compared with known 16S rRNA genes. This output included all the de novo multicopy fragments identified by our method and several known 16S rRNA genes for the target pathogenic microorganisms with corresponding copy numbers and conservation.
Supplementary Table1. Cluster sets of identified multicopy regions from Mycobacterium tuberculosis, Mycobacterium Africanum, Mycobacterium bovis, Tuberculosis complex.
Supplementary Table2. Cluster sets of identified multicopy regions from B. pertussis, B. parapertussis, B. holmesii.
Supplementary Table3. Cluster sets of identified multicopy regions from B. parapertussis, B. bronchiseptica.
Supplementary Table4. Cluster sets of identified multicopy regions from M. pneumoniae strain M129.
Supplementary Table5. Cluster sets of identified multicopy regions from Streptococcus agalactiae.
Supplementary Table6. Cluster sets of identified multicopy regions from H. pylori UA802, H. pylori strain PMSS1, H. pylori strain 7.13.
Supplementary Table7. Cluster sets of identified undiscovered multicopy regions from Legionella pneumophila.
Supplementary Table8. Cluster sets of identified undiscovered multicopy regions from Candida auris.
Acknowledgments
Appreciation goes to Zhang Hanyan, Xiong Lei, and Pan Daxia for their experimental validation in carrying out this study. The authors deeply thank Liu Yan, Zhang Jie, and Li Qiang for their valuable suggestions and comments on this work. Many facets of the user-interface design benefited from Niu Xingsheng, Lu Wang and Pan Yajie. We wish to express our thanks for the valuable modifications to the paper made by Shen Yilin, Zhu Lingjiao, Guo Jingjing, and Zhou Miaomiao, who helped us greatly revise this paper. All the other data supporting the findings of this study and the computational code used in this study are available from the corresponding authors upon reasonable request. Cong Ji and Junbin (Jack) Shao are named inventors on the pending PCT Patent Applications PCT/CN2020/090180, PCT/CN2020/090175, and PCT/CN2020/090177 filed by the Liferiver Science and Technology Institute of Shanghai ZJ Bio-Tech Co., Ltd., which separately describe the method and device for identifying multicopy, species-specific consensus sequences in microorganism target fragments and use thereof. The other authors declare no competing interests.
Footnotes
Abstract has been divided into three different parts; Result has been divided into four different parts and related text updated; Table2 and Supplemental Table1-8 updated; Abbreviations updated; Ethical Approval and Consent to participate updated; Consent for publication updated; Availability of data and materials updated; Competing interests updated; Funding updated; Authors' contributions updated; Author information updated.
Abbreviations
- PCR
- Polymerase Chain Reaction
- LDT
- Laboratory-developed Test
- DNA
- Deoxyribo Nucleic Acid
- RNA
- Ribonucleic Acid
- 16S rRNA
- 16S Ribosomal RNA
- MEGA
- Molecular Evolutionary Genetics Analysis
- PAML
- Phylogenetic Analysis by Maximum Likelihood
- NCBI
- National Center for Biotechnology Information
- GISAID
- Global Initiative on Sharing All Influenza Data
- EzBioCloud
- ChunLab’s public data and analytics portal
- EuPathDB
- Eukaryotic Pathogen Database Project
- GiardiaDB
- Giardia Genomics Resources
- TrichDB
- Trichomonas Informatics Resources
- FungiDB
- Fungal & Oomycete Informatics Resources
- HKU1
- Human coronavirus HKU1
- OC43
- Human coronavirus OC43
- NL63
- Human coronavirus NL63
- 229E
- Human coronavirus 229E
- MERS
- Middle East respiratory syndrome-related coronavirus
- SARS-CoV
- Severe acute respiratory syndrome coronavirus
- SARS-CoV-2
- Severe acute respiratory syndrome coronavirus 2
- MTBC
- Mycobacterium tuberculosis complex
- BP
- Bordetella pertussis
- BPP
- Bordetella parapertussis
- CAP
- community-acquired pneumonia
- HIV
- human immunodeficiency virus
- CRISPR
- Clustered Regularly Interspaced Short Palindromic Repeats
- CRISPR-Cas
- CRISPR-associated (Cas) genes
- SHERLOCK
- specific high-sensitivity enzymatic reporter unlocking
- STOP
- SHERLOCK testing in one pot
- SGs
- study groups