RT Journal Article SR Electronic T1 Closing Target Trimming: a Perl Package for Discovering Hidden Superfamily Loci in Genomes JF bioRxiv FD Cold Spring Harbor Laboratory SP 490490 DO 10.1101/490490 A1 Zhihua Hua A1 Matthew J. Early YR 2018 UL http://biorxiv.org/content/early/2018/12/07/490490.abstract AB The contemporary capacity of genome sequence analysis significantly lags behind the rapidly evolving sequencing technologies. Retrieving biological meaningful information from an ever-increasing amount of genome data would be significantly beneficial for functional genomic studies. For example, the duplication, organization, evolution, and function of superfamily genes are arguably important in many aspects of life. However, the incompleteness of annotations in many sequenced genomes often results in biased conclusions in comparative genomic studies of superfamilies. Here, we present a Perl software, called Closing Target Trimming, for automatically identifying most, if not all, members of a gene family in any sequenced genomes. Our test data on the F-box gene superfamily showed 78.2 and 79% gene finding accuracies in two well annotated plant genomes, Arabidopsis thaliana and rice, respectively. This annotation performance is clearly higher than the best ab initio methods that are currently available. To further demonstrate the effectiveness of this program, we ran it through 18 plant genomes and five non-plant genomes to compare the expansion of the F-box and the BTB superfamilies. The program discovered that on average 12.7 and 9.3% of the total F-box and BTB members, respectively, are new loci in plant genomes while it only found a small number of new members in vertebrate genomes. Therefore, different evolutionary and regulatory mechanisms of cullin-RING ubiquitin ligases may be present in the plant and the animal kingdoms. Further studies may shed light on new discoveries in the ubiquitin-26S proteasome system-mediated regulatory pathways in eukaryotic organisms. With a detailed compiling instruction and a simple running operation, we expect that this software will assist many biological scientists with little programming experience to smoothly obtain a comprehensive dataset of a gene superfamily in any sequenced eukaryotic genomes.