PT - JOURNAL ARTICLE AU - Kristoffer Sahlin AU - Marta Tomaszkiewicz AU - Kateryna D. Makova AU - Paul Medvedev TI - IsoCon: Deciphering highly similar multigene family transcripts from Iso-Seq data AID - 10.1101/246066 DP - 2018 Jan 01 TA - bioRxiv PG - 246066 4099 - http://biorxiv.org/content/early/2018/01/10/246066.short 4100 - http://biorxiv.org/content/early/2018/01/10/246066.full AB - A significant portion of genes in vertebrate genomes belongs to multigene families, with each family containing several gene copies whose presence/absence can be highly variable across individuals. For example, each Y chromosome ampliconic gene family harbors several nearly identical (up to 99.99%) gene copies. Existing de novo techniques for assaying the sequences of such highly-similar gene families fall short of reconstructing end to end transcripts with nucleotide-level precision or assigning them to their respective gene copies. We present IsoCon, a novel approach that combines experimental and computational techniques that leverage the power of long PacBio Iso-Seq reads to determine the full-length transcripts of highly similar multicopy gene families. IsoCon uses a cautiously iterative process to correct errors, followed by a statistical framework that allows it to distinguish errors from true variants with high precision. IsoCon outperforms existing methods for transcriptome analysis of Y ampliconic gene families in both simulated and real human data and is able to detect rare transcripts that differ by as little as one base pair from much more abundant transcripts. IsoCon has allowed us to detect an unprecedented number of novel isoforms, as well as to derive estimates on the number of gene copies in human Y ampliconic gene families.