Abstract
Genome assembly projects typically run multiple algorithms in an attempt to find the single best assembly, although those assemblies often have complementary, if untapped, strengths and weaknesses. We present our metassembler algorithm that merges multiple assemblies of a genome into a single superior sequence. We apply it to the four genomes from the Assemblathon competitions and show it consistently and substantially improves the contiguity and quality of each assembly. We also develop guidelines for metassembly by systematically evaluating 120 permutations of merging the top 5 assemblies of the first Assemblathon competition. The software is open-source at http://metassembler.sourceforge.net.
Abbreviations
- CEGMA -
- Core Eukaryotic Genes Mapping Approach
- CE-statistic -
- Compression/Expansion statistic
- Comp Ref Bases -
- Compressed Reference Bases
- Ctg NG50 -
- Contig N50 size relative to the estimated/reference genome size
- Ctg GC-NG50 -
- Contig GAGE Corrected N50, relative to the reference genome size Ctg RC-NG50 - Contig REAPR Corrected N50, relative to the estimated genome size Dup Ref Bases - Duplicated Reference Bases
- GAGE -
- Genome Assembly Gold Standard Evaluation
- GAM-NGS -
- Genomic Assemblies Merger for Next Generation Sequencing
- ICA -
- Independent Component Analysis
- PCA -
- Principal Components Analysis
- REAPR -
- Recognising Errors in Assemblies using Paired Reads
- Scf NG50 -
- Scaffold N50 size relative to the estimated/reference genome size
- Scf GC-NG50 -
- Scaffold GAGE Corrected N50, relative to the reference genome size
- Scf RC-NG50 -
- Scaffold REAPR Corrected N50, relative to the estimated genome size
Copyright
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.