GI-POP: A combinational annotation and genomic island prediction pipeline for ongoing microbial genome projects
Highlights
► A boy with de novo apparently balanced exceptional complex chromosomal rearrangement. ► His karyotype was found to be: 46,XY,der(5)t(5;7)(p15.1;7q34),t(5;8)(q13.1;8q24.1)dn. ► Microdeletion encompassing the 887.69 kb at 5q12.1–5q12.3 was detected by array-CGH. ► This deleted region includes the HTR1A and RNF180 genes. ► Deletions of HTR1A and RNF180 lead to epileptic seizures and mental retardation.
Introduction
In the research of pathogenesis and drug resistance, it has been found that genes or associated elements were clustered in chromosomal regions (Hacker et al., 1990). These DNA segments often have the ability to jump and incorporate into other bacterial genomes by an event termed horizontal gene transferring (HGT), which commonly occurs among microorganisms (Binnewies et al., 2006, Frost et al., 2005, Hacker and Kaper, 2000, Koonin et al., 2001, Mantri and Williams, 2004, Ou et al., 2006). The incorporated foreign DNA segments often have tRNA genes or repeated sequences at their boundaries (Hsiao et al., 2003, Lobry, 1996a, Yoon et al., 2007). Collectively referred to as genomic islands (GIs), these foreign DNA segments typically possess medical and environmental adaptability, and range from 5 to 500 kb in length (Schmidt and Hensel, 2004). For instance, secretion islands encode supplementary secretion systems, metabolic islands carry genes for secondary metabolism, and resistance islands bring antibiotic resistance to the host bacteria (Hacker and Kaper, 2000). The sequence composition of GIs (e.g., guanine and cytosine contents) (%G + C), dinucleotide bias, and codon usage preferences differ from that of the host genome (Frost et al., 2005, Hacker and Kaper, 2000, Koonin et al., 2001). Of the several computational approaches developed for detecting GIs, some require pre-annotated information. For example, homologous genes appearing in detected GIs can act as the monitor of related GIs (Mantri and Williams, 2004, Ou et al., 2006, Yoon et al., 2007). Novel GIs are difficult to identify in this manner because of the lack of available genetic annotations or sequence homologies to known GIs. Most other methodologies use sequence compositional differences (e.g., %G + C), dinucleotide frequencies, and amino acid and codon usage preferences between foreign and native DNA to identify GIs by assuming that different organisms vary in compositional chromosomal patterns (Hsiao et al., 2003, Lobry, 1996b, Mantri and Williams, 2004, Merkl, 2004, Nag et al., 2006, Rajan et al., 2007, Tu and Ding, 2003, van Passel et al., 2005). However, these methodologies are limited because even in the same organism, different chromosomal regions might vary in sequence compositions and gene expression levels (Schmidt and Hensel, 2004). Closely related organisms are assumed to share higher sequence similarities and expressional properties (Schmidt and Hensel, 2004). By adopting comparative genomic approaches, other strategies are independent of sequence compositional analysis. Comparing multiple chromosomes of closely related organisms leads to the assumption that unexpected phyletic sequences are horizontal transfer regions (Ragan, 2001). Comparison based method can identify GIs that sequence composition-based methods cannot detect because the sequence compositions of these GIs resemble the core chromosomes. GI-predicting methods can be applied to allocate possible GI regions in given microbial genomes. Genomes of pathogenesis, antibiotic resistance, or other researchable phenotypes can be sequenced by the high-throughput sequencing technique. Over the past decade, the high-throughput sequencing technique (or next generation sequencing) has made significant progress toward reducing the sequencing cost and handling time. The genome of the target organism is first separated into billions of short DNA elements called short reads. These short reads are then sequenced by sequencers and assembled into longer sequences called contigs. Finally, genes and proteins are annotated using annotating software, such as Glimmer (Salzberg et al., 1998).
However, finishing an entire genome using only computational methods continues to be difficult. Experimental procedures, including optical mapping and polymerase chain reaction (PCR), were performed to guarantee sequencing quality. These increased the cost and time. Furthermore, the present GI-predicting methods need whole genome sequences and require completed gene or protein annotations. Finding GIs in microorganisms can only be achieved by obtaining completed genomes. This observation suggests the need for a GI prediction and analysis method that can be used in ongoing genome projects.
In this paper, we developed an annotating pipeline for ongoing genome projects. This pipeline integrates functional annotating and GI-predicting capabilities and can be used for analyzing incomplete genomes. It provides an annotating service where researchers can submit their own draft genomes. The pipeline assembles and annotates the submitted sequences, including contigs/scaffold assembly, gene prediction, tRNA or other non-coding RNA prediction, Clusters of Orthology Groups (COG) searching, and GI prediction. We believe that sequence compositional approaches continue to provide a strong foundation for developing GI detection methods. With an adequate length, any fragment of a GI should have sequence compositions resembling the remaining portion of the GI and differ from the core chromosome. Based on this assumption, we developed GI prediction by Genome Profile Scanning (GI-GPS), for example, a GI detection system that operates by scanning, filtering, and refining. Performing cross-validation on a published data set demonstrated the feasibility of using the genome profile and the prediction engine to distinguish between GI and non-GI sequences. Moreover, the GI-GPS requires only one organism's genome, which is advantageous to the identification of foreign DNA for newly sequenced organisms, especially the novel ones with few known related species. Moreover, GI-POP is the first combinational annotation and GI detecting Web server which provides pre-analytic information of ongoing genome project.
Section snippets
GI-POP: the annotation platform with GI detecting modules
GI-POP is a Web server designed for online functional annotations, gene predictions, non-coding RNA predictions, and GI predictions of ongoing genome projects. As shown in Fig. 1, genomic sequences, including contigs, scaffolds, and chromosomal sequences, are first assembled by a do-it-yourself annotator (DIYA) assembler which is an annotating package. The coding sequences (CDS), non-coding region predictions, and analysis are then operated. Several subroutines participate in this stage; for
Discussion
GIs or segments of foreign DNA cause morphological changes in its host microorganisms. Such changes are an impetus of microbial evolution. Elucidating how GIs and host organisms are related sheds light into the adaptations of microbes and their living environments. Studies have analyzed and compared sequence compositions to identify foreign DNA segments (Binnewies et al., 2006, Frost et al., 2005, Koonin et al., 2001). A training set is required to apply these compositional methods to detect
Data preparation
The sequence files of prokaryotic complete genomes were obtained from the National Center of Biotechnology Information (NCBI) FTP server [ftp://ftp.ncbi.nih.gov/refseq/release/complete/]. For SVM training, the 771 positive and 3700 negative GIs belonging to 118 bacterial chromosomes were downloaded from the IslandPick (Langille et al., 2008) Web site [http://www.pathogenomics.sfu.ca/islandpick_GI_datasets/].
Databases and software packages used for annotation pipeline
The DIYA is a Perl-based package designed for microbial genome annotation and
Conclusions
In this study, we developed a combinational annotation and genomic island prediction pipeline for ongoing microbial genome projects (GI-POP). To the best of our knowledge, the GI-POP is the first integrated genome annotating and GI predicting server that provides genome assembly, coding/non-coding region prediction, homologous gene searching, COG alignment, and genomic island detecting. In many genome projects, the genomes are unfinished. For example, draft genomes are often composed of contigs
Conflict of interest statement
The authors declare that they have no competing interests.
Contributors
CCL, TJY and CYM designed and carried out this study. TJY designed and performed the synthetic draft genome experiment. CCL and WCL drafted the manuscript and analyze the data. PCL, CYT, WCL and YPPC conceived the study, participated in its design and helped draft the manuscript.
The following are the supplementary data related to this article.
Acknowledgments
This work is funded by the National Science Council, Taiwan, R.O.C. with grant numbers 100-2221-E-126-010-MY3, and 101-2319-B-400 -001.
References (36)
Deletions of chromosomal regions coding for fimbriae and hemolysins occur in vitro and in vivo in various extraintestinal Escherichia-coli isolates
Microb. Pathog.
(1990)Transfer RNAs and pathogenicity islands
Trends Biochem. Sci.
(1999)Detection of lateral gene transfer among microbial genomes
Curr. Opin. Genet. Dev.
(2001)- et al.
EMBOSS: the European molecular biology open software suite
Trends Genet.
(2000) - et al.
Detecting pathogenicity islands and anomalous gene clusters by iterative discriminant analysis
FEMS Microbiol. Lett.
(2003) Ten years of bacterial genome sequencing: comparative-genomics-based discoveries
Funct. Integr. Genomics
(2006)- et al.
Identifying unknown game species: experience with nucleotide sequencing of the mitochondrial cytochrome b gene and a subsequent basic local alignment search tool search
Eur. Food Res. Technol.
(2001) - et al.
LIBSVM: a library for support vector machines
ACM Trans. Intell. Syst. Technol.
(2011) Using the Generic Genome Browser (GBrowse)
Curr. Protoc. Bioinformatics
(2007)- et al.
Mobile genetic elements: the agents of open source evolution
Nat. Rev. Microbiol.
(2005)
Pathogenicity islands and the evolution of microbes
Annu. Rev. Microbiol.
IslandPath: aiding detection of genomic islands in prokaryotes
Bioinformatics
Horizontal gene transfer in prokaryotes: quantification and classification
Annu. Rev. Microbiol.
Circos: an information aesthetic for comparative genomics
Genome Res.
RNAmmer: consistent and rapid annotation of ribosomal RNA genes
Nucleic Acids Res.
Evaluation of genomic island predictors using a comparative genomics approach
BMC Bioinformatics
An efficient algorithm for unique signature discovery on whole-genome EST Databases
ACLAME: a classification of mobile genetic elements, update 2010
Nucleic Acids Res.
Cited by (15)
Computational methods for predicting genomic islands in microbial genomes
2016, Computational and Structural Biotechnology JournalCitation Excerpt :Native regions may easily be detected as false positives owing to their atypical composition for reasons other than LGT, such as highly expressed genes [58]. At the same time, ameliorated GIs [52] or GIs originated from genomes with similar composition may not be detected. But the false positives can be reduced by eliminating well-known non-GIs.
An Introduction to Microbial Genomic Islands for Evolutionary Adaptation and Pathogenicity
2023, Microbial Genomic Islands in Adaptation and PathogenicityMicrobial Genomic Island Discovery: Visualization and Analysis
2023, Microbial Genomic Islands in Adaptation and PathogenicityAn Overview of Genomic Islands’ Main Features and Computational Prediction: The CMNR Group of Bacteria As a Case Study
2023, Microbial Genomic Islands in Adaptation and PathogenicityArtificial Intelligence and Machine Learning for Prediction and Analysis of Genomic Islands
2023, Microbial Genomic Islands in Adaptation and PathogenicityComputation Tools for Prediction and Analysis of Genomic Islands
2023, Microbial Genomic Islands in Adaptation and Pathogenicity