GI-POP: A combinational annotation and genomic island prediction pipeline for ongoing microbial genome projects

doi:10.1016/j.gene.2012.11.063

Gene

Volume 518, Issue 1, 10 April 2013, Pages 114-123

https://doi.org/10.1016/j.gene.2012.11.063 Get rights and content

Abstract

Sequencing of microbial genomes is important because of microbial-carrying antibiotic and pathogenetic activities. However, even with the help of new assembling software, finishing a whole genome is a time-consuming task. In most bacteria, pathogenetic or antibiotic genes are carried in genomic islands. Therefore, a quick genomic island (GI) prediction method is useful for ongoing sequencing genomes. In this work, we built a Web server called GI-POP (http://gipop.life.nthu.edu.tw) which integrates a sequence assembling tool, a functional annotation pipeline, and a high-performance GI predicting module, in a support vector machine (SVM)-based method called genomic island genomic profile scanning (GI-GPS). The draft genomes of the ongoing genome projects in contigs or scaffolds can be submitted to our Web server, and it provides the functional annotation and highly probable GI-predicting results. GI-POP is a comprehensive annotation Web server designed for ongoing genome project analysis. Researchers can perform annotation and obtain pre-analytic information include possible GIs, coding/non-coding sequences and functional analysis from their draft genomes. This pre-analytic system can provide useful information for finishing a genome sequencing project.

Highlights

► A boy with de novo apparently balanced exceptional complex chromosomal rearrangement. ► His karyotype was found to be: 46,XY,der(5)t(5;7)(p15.1;7q34),t(5;8)(q13.1;8q24.1)dn. ► Microdeletion encompassing the 887.69 kb at 5q12.1–5q12.3 was detected by array-CGH. ► This deleted region includes the HTR1A and RNF180 genes. ► Deletions of HTR1A and RNF180 lead to epileptic seizures and mental retardation.

Introduction

In the research of pathogenesis and drug resistance, it has been found that genes or associated elements were clustered in chromosomal regions (Hacker et al., 1990). These DNA segments often have the ability to jump and incorporate into other bacterial genomes by an event termed horizontal gene transferring (HGT), which commonly occurs among microorganisms (Binnewies et al., 2006, Frost et al., 2005, Hacker and Kaper, 2000, Koonin et al., 2001, Mantri and Williams, 2004, Ou et al., 2006). The incorporated foreign DNA segments often have tRNA genes or repeated sequences at their boundaries (Hsiao et al., 2003, Lobry, 1996a, Yoon et al., 2007). Collectively referred to as genomic islands (GIs), these foreign DNA segments typically possess medical and environmental adaptability, and range from 5 to 500 kb in length (Schmidt and Hensel, 2004). For instance, secretion islands encode supplementary secretion systems, metabolic islands carry genes for secondary metabolism, and resistance islands bring antibiotic resistance to the host bacteria (Hacker and Kaper, 2000). The sequence composition of GIs (e.g., guanine and cytosine contents) (%G + C), dinucleotide bias, and codon usage preferences differ from that of the host genome (Frost et al., 2005, Hacker and Kaper, 2000, Koonin et al., 2001). Of the several computational approaches developed for detecting GIs, some require pre-annotated information. For example, homologous genes appearing in detected GIs can act as the monitor of related GIs (Mantri and Williams, 2004, Ou et al., 2006, Yoon et al., 2007). Novel GIs are difficult to identify in this manner because of the lack of available genetic annotations or sequence homologies to known GIs. Most other methodologies use sequence compositional differences (e.g., %G + C), dinucleotide frequencies, and amino acid and codon usage preferences between foreign and native DNA to identify GIs by assuming that different organisms vary in compositional chromosomal patterns (Hsiao et al., 2003, Lobry, 1996b, Mantri and Williams, 2004, Merkl, 2004, Nag et al., 2006, Rajan et al., 2007, Tu and Ding, 2003, van Passel et al., 2005). However, these methodologies are limited because even in the same organism, different chromosomal regions might vary in sequence compositions and gene expression levels (Schmidt and Hensel, 2004). Closely related organisms are assumed to share higher sequence similarities and expressional properties (Schmidt and Hensel, 2004). By adopting comparative genomic approaches, other strategies are independent of sequence compositional analysis. Comparing multiple chromosomes of closely related organisms leads to the assumption that unexpected phyletic sequences are horizontal transfer regions (Ragan, 2001). Comparison based method can identify GIs that sequence composition-based methods cannot detect because the sequence compositions of these GIs resemble the core chromosomes. GI-predicting methods can be applied to allocate possible GI regions in given microbial genomes. Genomes of pathogenesis, antibiotic resistance, or other researchable phenotypes can be sequenced by the high-throughput sequencing technique. Over the past decade, the high-throughput sequencing technique (or next generation sequencing) has made significant progress toward reducing the sequencing cost and handling time. The genome of the target organism is first separated into billions of short DNA elements called short reads. These short reads are then sequenced by sequencers and assembled into longer sequences called contigs. Finally, genes and proteins are annotated using annotating software, such as Glimmer (Salzberg et al., 1998).

However, finishing an entire genome using only computational methods continues to be difficult. Experimental procedures, including optical mapping and polymerase chain reaction (PCR), were performed to guarantee sequencing quality. These increased the cost and time. Furthermore, the present GI-predicting methods need whole genome sequences and require completed gene or protein annotations. Finding GIs in microorganisms can only be achieved by obtaining completed genomes. This observation suggests the need for a GI prediction and analysis method that can be used in ongoing genome projects.

In this paper, we developed an annotating pipeline for ongoing genome projects. This pipeline integrates functional annotating and GI-predicting capabilities and can be used for analyzing incomplete genomes. It provides an annotating service where researchers can submit their own draft genomes. The pipeline assembles and annotates the submitted sequences, including contigs/scaffold assembly, gene prediction, tRNA or other non-coding RNA prediction, Clusters of Orthology Groups (COG) searching, and GI prediction. We believe that sequence compositional approaches continue to provide a strong foundation for developing GI detection methods. With an adequate length, any fragment of a GI should have sequence compositions resembling the remaining portion of the GI and differ from the core chromosome. Based on this assumption, we developed GI prediction by Genome Profile Scanning (GI-GPS), for example, a GI detection system that operates by scanning, filtering, and refining. Performing cross-validation on a published data set demonstrated the feasibility of using the genome profile and the prediction engine to distinguish between GI and non-GI sequences. Moreover, the GI-GPS requires only one organism's genome, which is advantageous to the identification of foreign DNA for newly sequenced organisms, especially the novel ones with few known related species. Moreover, GI-POP is the first combinational annotation and GI detecting Web server which provides pre-analytic information of ongoing genome project.

Section snippets

GI-POP: the annotation platform with GI detecting modules

GI-POP is a Web server designed for online functional annotations, gene predictions, non-coding RNA predictions, and GI predictions of ongoing genome projects. As shown in Fig. 1, genomic sequences, including contigs, scaffolds, and chromosomal sequences, are first assembled by a do-it-yourself annotator (DIYA) assembler which is an annotating package. The coding sequences (CDS), non-coding region predictions, and analysis are then operated. Several subroutines participate in this stage; for

Discussion

GIs or segments of foreign DNA cause morphological changes in its host microorganisms. Such changes are an impetus of microbial evolution. Elucidating how GIs and host organisms are related sheds light into the adaptations of microbes and their living environments. Studies have analyzed and compared sequence compositions to identify foreign DNA segments (Binnewies et al., 2006, Frost et al., 2005, Koonin et al., 2001). A training set is required to apply these compositional methods to detect

Data preparation

The sequence files of prokaryotic complete genomes were obtained from the National Center of Biotechnology Information (NCBI) FTP server [ftp://ftp.ncbi.nih.gov/refseq/release/complete/]. For SVM training, the 771 positive and 3700 negative GIs belonging to 118 bacterial chromosomes were downloaded from the IslandPick (Langille et al., 2008) Web site [http://www.pathogenomics.sfu.ca/islandpick_GI_datasets/].

Databases and software packages used for annotation pipeline

The DIYA is a Perl-based package designed for microbial genome annotation and

Conclusions

In this study, we developed a combinational annotation and genomic island prediction pipeline for ongoing microbial genome projects (GI-POP). To the best of our knowledge, the GI-POP is the first integrated genome annotating and GI predicting server that provides genome assembly, coding/non-coding region prediction, homologous gene searching, COG alignment, and genomic island detecting. In many genome projects, the genomes are unfinished. For example, draft genomes are often composed of contigs

Conflict of interest statement

The authors declare that they have no competing interests.

Contributors

CCL, TJY and CYM designed and carried out this study. TJY designed and performed the synthetic draft genome experiment. CCL and WCL drafted the manuscript and analyze the data. PCL, CYT, WCL and YPPC conceived the study, participated in its design and helped draft the manuscript.

The following are the supplementary data related to this article.

. The ROC curve of 5-fold cross-validations. The AUC is 0.9343. The x-axis is the false positive rate and the y-axis is the true positive rate.

Acknowledgments

This work is funded by the National Science Council, Taiwan, R.O.C. with grant numbers 100-2221-E-126-010-MY3, and 101-2319-B-400 -001.

References (36)

J. Hacker
Deletions of chromosomal regions coding for fimbriae and hemolysins occur in vitro and in vivo in various extraintestinal Escherichia-coli isolates
Microb. Pathog.
(1990)
Y.M. Hou
Transfer RNAs and pathogenicity islands
Trends Biochem. Sci.
(1999)
M.A. Ragan
Detection of lateral gene transfer among microbial genomes
Curr. Opin. Genet. Dev.
(2001)
P. Rice et al.
EMBOSS: the European molecular biology open software suite
Trends Genet.
(2000)
Q. Tu et al.
Detecting pathogenicity islands and anomalous gene clusters by iterative discriminant analysis
FEMS Microbiol. Lett.
(2003)
T.T. Binnewies
Ten years of bacterial genome sequencing: comparative-genomics-based discoveries
Funct. Integr. Genomics
(2006)
P.D. Brodmann et al.
Identifying unknown game species: experience with nucleotide sequencing of the mitochondrial cytochrome b gene and a subsequent basic local alignment search tool search
Eur. Food Res. Technol.
(2001)
C.-C. Chang et al.
LIBSVM: a library for support vector machines
ACM Trans. Intell. Syst. Technol.
(2011)
M.J. Donlin
Using the Generic Genome Browser (GBrowse)
Curr. Protoc. Bioinformatics
(2007)
L.S. Frost et al.
Mobile genetic elements: the agents of open source evolution
Nat. Rev. Microbiol.
(2005)

J. Hacker et al.

Pathogenicity islands and the evolution of microbes

Annu. Rev. Microbiol.

(2000)

W. Hsiao et al.

IslandPath: aiding detection of genomic islands in prokaryotes

Bioinformatics

(2003)

E.V. Koonin et al.

Horizontal gene transfer in prokaryotes: quantification and classification

Annu. Rev. Microbiol.

(2001)

M. Krzywinski

Circos: an information aesthetic for comparative genomics

Genome Res.

(2009)

K. Lagesen et al.

RNAmmer: consistent and rapid annotation of ribosomal RNA genes

Nucleic Acids Res.

(2007)

M.G.I. Langille et al.

Evaluation of genomic island predictors using a comparative genomics approach

BMC Bioinformatics

(2008)

H.P. Lee et al.

An efficient algorithm for unique signature discovery on whole-genome EST Databases

R. Leplae et al.

ACLAME: a classification of mobile genetic elements, update 2010

Nucleic Acids Res.

(2010)

Cited by (15)

Computational methods for predicting genomic islands in microbial genomes
2016, Computational and Structural Biotechnology Journal
Citation Excerpt :
Native regions may easily be detected as false positives owing to their atypical composition for reasons other than LGT, such as highly expressed genes [58]. At the same time, ameliorated GIs [52] or GIs originated from genomes with similar composition may not be detected. But the false positives can be reduced by eliminating well-known non-GIs.
Clusters of genes acquired by lateral gene transfer in microbial genomes, are broadly referred to as genomic islands (GIs). GIs often carry genes important for genome evolution and adaptation to niches, such as genes involved in pathogenesis and antibiotic resistance. Therefore, GI prediction has gradually become an important part of microbial genome analysis. Despite inherent difficulties in identifying GIs, many computational methods have been developed and show good performance. In this mini-review, we first summarize the general challenges in predicting GIs. Then we group existing GI detection methods by their input, briefly describe representative methods in each group, and discuss their advantages as well as limitations. Finally, we look into the potential improvements for better GI prediction.
An Introduction to Microbial Genomic Islands for Evolutionary Adaptation and Pathogenicity
2023, Microbial Genomic Islands in Adaptation and Pathogenicity
Microbial Genomic Island Discovery: Visualization and Analysis
2023, Microbial Genomic Islands in Adaptation and Pathogenicity
An Overview of Genomic Islands’ Main Features and Computational Prediction: The CMNR Group of Bacteria As a Case Study
2023, Microbial Genomic Islands in Adaptation and Pathogenicity
Artificial Intelligence and Machine Learning for Prediction and Analysis of Genomic Islands
2023, Microbial Genomic Islands in Adaptation and Pathogenicity
Computation Tools for Prediction and Analysis of Genomic Islands
2023, Microbial Genomic Islands in Adaptation and Pathogenicity

View all citing articles on Scopus

View full text

GI-POP: A combinational annotation and genomic island prediction pipeline for ongoing microbial genome projects

Abstract

Highlights

Introduction

Section snippets

GI-POP: the annotation platform with GI detecting modules

Discussion

Data preparation

Databases and software packages used for annotation pipeline

Conclusions

Conflict of interest statement

Contributors

Acknowledgments

Microb. Pathog.

Trends Biochem. Sci.

Curr. Opin. Genet. Dev.

Trends Genet.

FEMS Microbiol. Lett.

Ten years of bacterial genome sequencing: comparative-genomics-based discoveries

Funct. Integr. Genomics

Identifying unknown game species: experience with nucleotide sequencing of the mitochondrial cytochrome b gene and a subsequent basic local alignment search tool search

Eur. Food Res. Technol.

LIBSVM: a library for support vector machines

ACM Trans. Intell. Syst. Technol.

Using the Generic Genome Browser (GBrowse)

Curr. Protoc. Bioinformatics

Mobile genetic elements: the agents of open source evolution

Nat. Rev. Microbiol.

Pathogenicity islands and the evolution of microbes

Annu. Rev. Microbiol.

IslandPath: aiding detection of genomic islands in prokaryotes

Bioinformatics

Horizontal gene transfer in prokaryotes: quantification and classification

Annu. Rev. Microbiol.

Circos: an information aesthetic for comparative genomics

Genome Res.

RNAmmer: consistent and rapid annotation of ribosomal RNA genes

Nucleic Acids Res.

Evaluation of genomic island predictors using a comparative genomics approach

BMC Bioinformatics

An efficient algorithm for unique signature discovery on whole-genome EST Databases

ACLAME: a classification of mobile genetic elements, update 2010

Nucleic Acids Res.