Comparative Annotation Toolkit (CAT)—simultaneous clade and personal genome annotation

  1. Benedict Paten1
  1. 1Genomics Institute, University of California Santa Cruz and Howard Hughes Medical Institute, Santa Cruz, California 95064, USA;
  2. 210x Genomics, Pleasanton, California 94566, USA;
  3. 3Institute of Mathematics and Computer Science, University of Greifswald, 17489 Greifswald, Germany;
  4. 4Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington 98195, USA;
  5. 5Pacific Biosciences of California, Incorporated, Menlo Park, California 94025, USA;
  6. 6Howard Hughes Medical Institute, University of Washington, Seattle, Washington 98195, USA;
  7. 7European Bioinformatics Institute, Wellcome Genome Campus, Hinxton CB10 1SD, United Kingdom
  1. 8 These authors contributed equally to this work.

  • Corresponding authors: bpaten{at}ucsc.edu, ian.t.fiddes{at}gmail.com
  • Abstract

    The recent introductions of low-cost, long-read, and read-cloud sequencing technologies coupled with intense efforts to develop efficient algorithms have made affordable, high-quality de novo sequence assembly a realistic proposition. The result is an explosion of new, ultracontiguous genome assemblies. To compare these genomes, we need robust methods for genome annotation. We describe the fully open source Comparative Annotation Toolkit (CAT), which provides a flexible way to simultaneously annotate entire clades and identify orthology relationships. We show that CAT can be used to improve annotations on the rat genome, annotate the great apes, annotate a diverse set of mammals, and annotate personal, diploid human genomes. We demonstrate the resulting discovery of novel genes, isoforms, and structural variants—even in genomes as well studied as rat and the great apes—and how these annotations improve cross-species RNA expression experiments.

    Footnotes

    • Received December 7, 2017.
    • Accepted May 3, 2018.

    This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the first six months after the full-issue publication date (see http://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.

    | Table of Contents

    Preprint Server