Comparative genome analysis revealed gene inversions, boundary expansion and contraction, and gene loss in Stemona sessilifolia (Miq.) Miq. chloroplast genome

Stemona sessilifolia (Miq.) Miq., commonly known as Baibu, is one of the most popular herbal medicines in Asia. In Chinese Pharmacopoeia, Baibu has multiple authentic sources, and there are many homonym herbs sold as Baibu in the herbal medicine market. The existence of the counterfeits of Baibu brings challenges to its identification. To assist the accurate identification of Baibu, we sequenced and analyzed the complete chloroplast genome of Stemona sessilifolia using next-generation sequencing technology. The genome was 154,039 bp in length, possessing a typical quadripartite structure consisting of a pair of inverted repeats (IRs: 27,094 bp) separating by a large single copy (LSC: 81,950 bp) and a small single copy (SSC: 17,901 bp). A total of 112 unique genes were identified, including 80 protein-coding, 28 transfer RNA, and four ribosomal RNA genes. Besides, 45 tandem, 27 forward, 23 palindromic, and 72 simple sequence repeats were detected in the genome by repeat analysis. Compared with its counterfeits (Asparagus officinalis and Carludovica palmate), we found that IR expansion and SSC contraction events of Stemona sessilifolia resulted in two copies of the rpl22 gene in the IR regions and partial duplication of the ndhF gene in the SSC region. Secondly, an approximately 3-kb-long inversion was identified in the LSC region, leading to the petA and cemA gene presented in the complementary strand of the chloroplast DNA molecule. Comparative analysis revealed some highly variable regions, including trnF-GAA_ndhJ, atpB_rbcL, rps15_ycf1, trnG-UCC_trnR-UCU, ndhF_rpl32. Finally, gene loss events were investigated in the context of phylogenetic relationships. In summary, the complete plastome of Stemona sessilifolia will provide valuable information for the molecular identification of Baibu and assist in elucidating the evolution of Stemona sessilifolia.


Introduction
On the other hand, multiple authentic sources and the homonym also increase the difficulty of 74 identifying Baibu. In some area of China, another herbal medicine, Aconitum kusnezoffii Rchb., is 75 also called Baibu. However, the therapeutic activity of Aconitum kusnezoffii is significantly different 76 from the authentic sources of Baibu described in Chinese Pharmacopoeia. Researches even 77 reported that it might result in toxicity when Aconitum kusnezoffii was taken in large quantities [9].

78
Besides, counterfeits in the herbal market also brought challenges to the exact identification of Baibu.

79
Due to their similar morphologic features to the authentic sources for Baibu, many counterfeits 80 such as Asparagus officinalis, Asparagus filicinus, and Asparagus acicularis were sold as Baibu in 81 the herbal market frequently [10]. Therefore, the exact identification of Baibu origin is critical for its 82 usage as a medicinal herb.

83
DNA barcode was deemed a more efficient and effective method in identifying plant species 84 compared to morphological characteristics. Typical barcodes such as ITS, psbA-trnH, matK, and 85 rbcL have been used to distinguish different plant species [11][12][13]. However, these DNA barcodes 86 were not always working effectively, especially when distinguishing closely related plant species.

87
Such a phenomenon may attribute to single-locus DNA barcodes still lack adequate variations in 88 closely related taxa. Compared with DNA barcodes, the chloroplast genome provides more 89 abundant genetic information and higher resolution in identifying plant species. Some researchers 90 have proposed using the chloroplast genome as a species-level DNA barcode [14,15].

91
The chloroplast is an organelle presenting in almost all green plants. It    2-parameters (K2p) evolution model [36]. We attempted to discover highly divergent regions for the

201
We then carried out a multi-scale comparative genome analysis of these three chloroplast 202 genomes from four aspects, including the size, the guanine-cytosine (GC) content, the count of 203 genes, and the gene organization (

228
and C. palmate, respectively. All of these three chloroplast genomes have 28 tRNAs and four rRNAs.

229
The number of genes with introns in each species is 18, similar to reports in prior works [39].

230
Therefore, we may conclude that there have no intron loss events occurred in the chloroplast 231 genomes of these three species. All the genes with introns were described in Table S2. Besides, 21, 232 22, and 18 genes were predicted for S. sessilifolia, A. officinalis, and C. palmate in IR regions.

233
The gene organizations were compared in

263
The number of the repeat sequences for C. palmata was 45, 33, and 17, respectively.

264
There have significant differences in the types of repeat sequence among S. sessilifolia, A.

265
officinalis, and C. palmata. The repeat occurrence in S. sessilifolia was similar to that of A. officinalis 266 but significantly higher than that of C. palmata. It should be noted that the size of the A. officinalis 267 and C. palmata chloroplast genome is larger than the chloroplast genomes of S. sessilifolia.

268
Therefore, the relatively larger size of the chloroplast genome of A. officinalis and C. palmata does 269 not result from the repeat sequence.

274
Sequence divergence analysis 275 To evaluate the genome sequence divergence, we aligned sequences from four species using 276 mVISTA [34] (Fig 3). The chloroplast genome of S. sessilifolia was significantly different from A.
The two most divergent regions were ycf4-psbJ regions (red square A) and rpl22 coding regions (red 280 square B). We suspected that such a phenomenon might result from gene loss events or genome 281 rearrangement events, and the detailed reasons will be discussed later. Ycf1 gene is also highly 282 divergent, which may occur due to the occurrence of pseudogenization. In summary, the LSC region 283 showed the highest divergence, followed by the SSC region, and the IR regions were less divergent 284 than the LSC and SSC region. Compared to the coding areas, the intergenic spacers displayed 285 higher divergence.

286
Highly divergent regions always assist in the development of molecular markers. Because higher MK2P values are likely to be the candidate regions of high-resolution molecular markers.

294
Consequently, for introns (S3 Table), the MK2p value ranges from 0.0055 to 0.1096. ClpP_intron2 295 with the highest MK2p value followed by rpl16_intron1, the third, fourth, and fifth were rps16_intron1, 296 ndhA_intron1, and trnL-UAA_intron1, respectively. For intergenic spacers (S4 Table), five highly regions, and coding regions were characterized by sky-blue block, red block, and blue block. We 307 adopted a cutoff value of 70% in the process of alignment.

309
To investigate whether there are significant differences in ycf3-psbJ regions (red square A in Fig   310  3) and rpl22 coding regions (red square B in Fig 3) between S. sessilifolia and its closely related 311 species, we conducted synteny analysis. As plotted in Fig 4, we detected a large inversion of 3 kb 312 long in the LSC region. Interestingly, such an approximately 3-kb long inversion was confirmed was always visible (data are not shown). Therefore, inversion in the ycf3-psbj areas may be unique 320 to S. sessilifolia.
321 Figure 4. Comparison of three chloroplast genomes using MAUVE algorithm. Local collinear 322 blocks were colored to indicate syntenic regions, and histograms within each block indicated the 323 degree of sequence similarity.

IR expansion and SSC contraction
325 IR contraction and expansion are common evolutionary events contributing to chloroplast 326 genomes size variation [43]. Here, boundary comparison analysis was performed by which we 327 attempt to identify IR contraction and expansion events (Fig 5). Compared to A. officinalis and C.
328 palmate, the relatively larger IR regions indicated IR expansion events in S. sessilifolia.

329
Simultaneously, the SSC region was shorter than A. officinalis and C. palmate by 465-737bp,

330
suggesting the occurrence of SSC contraction events in S. sessilifolia. For A. officinalis and C.

331
palmate, the rpl22 gene is located at the LSC region with one copy. However, the IR regions of S.
332 sessilifolia spanned to the intergenic spacers between the rpl22 gene and rps3 gene, resulting in two 333 copies of the rpl22 gene. Therefore, we can claim that the significant difference in rpl22 coding 334 regions between S. sessilifolia and its closely related species was attributed to IR expansion events.

335
The IRb/SSC boundary extended into the ycf1 genes by 1146-1260bp, creating ycf1 pseudogene in Asparagoideae formed a cluster without lhbA gene.

370
Ycf68 gene has the highest frequency of gene deletion, and the second was lhbA gene. The

371
following three were the infA gene, psbZ, and ycf1 gene, respectively.