Plastid Genome Assembly Using Long-read data

Wenbin Zhou; Carolina E Armijos; Chaehee Lee; Ruisen Lu; Jeremy Wang; Tracey A Ruhlman; Robert K Jansen; Alan M Jones; Corbin D Jones

doi:10.1111/1755-0998.13787

Plastid Genome Assembly Using Long-read data

Mol Ecol Resour. 2023 Aug;23(6):1442-1457. doi: 10.1111/1755-0998.13787. Epub 2023 Apr 2.

Authors

Wenbin Zhou¹, Carolina E Armijos², Chaehee Lee³, Ruisen Lu⁴, Jeremy Wang⁵, Tracey A Ruhlman⁶, Robert K Jansen⁶, Alan M Jones^{1

7}, Corbin D Jones^{1

5}

Affiliations

¹ Department of Biology, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA.
² Laboratorio de Biotecnología Vegetal, Universidad San Francisco de Quito USFQ, Quito, Ecuador.
³ Department of Plant Sciences, University of California Davis, Davis, California, USA.
⁴ Institute of Botany, Jiangsu Province and Chinese Academy of Sciences, Nanjing, China.
⁵ Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA.
⁶ Department of Integrative Biology, University of Texas at Austin, Austin, Texas, USA.
⁷ Department of Pharmacology, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA.

Abstract

Although plastid genome (plastome) structure is highly conserved across most seed plants, investigations during the past two decades have revealed several disparately related lineages that experienced substantial rearrangements. Most plastomes contain a large inverted repeat and two single-copy regions, and a few dispersed repeats; however, the plastomes of some taxa harbour long repeat sequences (>300 bp). These long repeats make it challenging to assemble complete plastomes using short-read data, leading to misassemblies and consensus sequences with spurious rearrangements. Single-molecule, long-read sequencing has the potential to overcome these challenges, yet there is no consensus on the most effective method for accurately assembling plastomes using long-read data. We generated a pipeline, plastid Genome Assembly Using Long-read data (ptGAUL), to address the problem of plastome assembly using long-read data from Oxford Nanopore Technologies (ONT) or Pacific Biosciences platforms. We demonstrated the efficacy of the ptGAUL pipeline using 16 published long-read data sets. We showed that ptGAUL quickly produces accurate and unbiased assemblies using only ~50× coverage of plastome data. Additionally, we deployed ptGAUL to assemble four new Juncus (Juncaceae) plastomes using ONT long reads. Our results revealed many long repeats and rearrangements in Juncus plastomes compared with basal lineages of Poales. The ptGAUL pipeline is available on GitHub: https://github.com/Bean061/ptgaul.

Keywords: Juncus; Juncaceae; Poales; chloroplast; long-read assembly; rearrangement events.

MeSH terms

Gene Rearrangement
Genome, Plastid*
High-Throughput Nucleotide Sequencing / methods
Plastids / genetics
Repetitive Sequences, Nucleic Acid
Sequence Analysis, DNA / methods

Grants and funding

K01 DK119582/DK/NIDDK NIH HHS/United States