Abstract
Reference quality genomes are expected to provide a resource for studying gene structure and function. However, often genes of interest are not completely or accurately assembled, leading to unknown errors in analyses or additional cloning efforts for the correct sequences. A promising solution to this problem is long-read sequencing. Here we tested PacBio-based long-read sequencing and diploid assembly for potential improvements to the Sanger-based intermediate-read zebra finch reference and Illumina-based short-read Anna’s hummingbird reference, two vocal learning avian species widely studied in neuroscience and genomics. With DNA of the same individuals used to generate the reference genomes, we generated diploid assemblies with the FALCON-Unzip assembler, resulting in contigs with no gaps in the megabase range (N50s of 5.4 and 7.7 Mb, respectively), and representing 150-fold and 200-fold improvements over the current zebra finch and hummingbird references, respectively. These long-read assemblies corrected and resolved what we discovered to be misassemblies, including due to erroneous sequences flanking gaps, complex repeat structure errors in the references, base call errors in difficult to sequence regions, and inaccurate resolution of allelic differences between the two haplotypes. We analyzed protein-coding genes widely studied in neuroscience and specialized in vocal learning species, and found numerous assembly and sequence errors in the reference genes that the PacBio-based assemblies resolved completely, validated by single long genomic reads and transcriptome reads. These findings demonstrate, for the first time in non-human vocal learning species, the impact of higher quality, phased and gap-less assemblies for understanding gene structure and function.
Abbreviations
- A1-L4
- primary auditory cortex – layer 4
- Am
- nucleus ambiguous
- Area X
- a vocal nucleus in the striatum
- aSt
- anterior striatum vocal region
- aT
- anterior thalamus speech area
- Av
- avalanche
- aDLM
- anterior dorsolateral nucleus of the thalamus
- DM
- dorsal medial nucleus of the midbrain
- HVC
- a vocal nucleus (no abbreviation)
- L2
- auditory area similar to human cortex layer 4
- LSC
- laryngeal somatosensory cortex
- LMC
- laryngeal motor cortex
- MAN
- magnocellular nucleus of the anterior nidopallium
- MO
- oval nucleus of the anterior mesopallium
- NIf
- interfacial nucleus of the nidopallium
- PAG
- peri-aqueductal gray
- RA
- robust nucleus of the arcopallium
- v
- ventricle space