Refinement of Draft Genome Assemblies of Pigeonpea (Cajanus cajan)

Genome assembly of short reads from large plant genomes remains a challenge in computational biology despite major developments in Next Generation sequencing. Of late multiple draft assemblies of plant genomes are reported in many organisms. The draft assemblies of Cajanus cajan are with different levels of genome completeness; contain large number of repeats, gaps and segmental duplications. Draft assemblies with portions of genome missing, are shorter than the referenced original genome. These assemblies come with low map accuracy affecting further functional annotation and prediction of gene component as desired by crop researchers. Genome coverage i.e. number of sequenced raw reads mapped on to certain locations of the genome is an important quality indicator of completeness and assembly quality in draft assemblies. Present work was aimed at improvement of coverage in reported de novo sequenced draft genomes (GCA_000340665.1 and GCA_000230855.2) of Pigeonpea, a legume widely cultivated in India. The two assemblies comprised 72% and 75% of estimated coverage of genome respectively. We employed assembly reconciliation approach to compare draft assemblies and merged them to generate a high quality near complete assembly with enhanced contiguity. Finished assembly has reduced number of gaps than reported in draft assemblies and improved genome coverage of 82.4%. Quality of the finished assembly was evaluated using various quality metrics and for presence of specific trait related functional genes. Employed pair-end and mate-pair local library data sets enabled to resolve gaps, repeats and other sequence errors yielding lengthier scaffolds compared to two draft assemblies. We report prediction of putative host resistance genes from improved sequence against Fusarium wilt disease and evaluated them in both wet laboratory and field phenotypic conditions.


32
Rapid developments in sequencing technologies facilitated generation of several draft assemblies 33 in plants. These are valuable resources for elucidating genetic information and understanding 34 biology of the crop. However, each of these draft assemblies have strengths and weaknesses as 35 were sequenced and assembled based on different algorithms [1,2]. Draft assemblies differ on 36 the sequencing technology and also the assembly software employed. One assembly may be 37 conservative in selection of reads resulting in low genome coverage but with many gaps. 38 Another assembler is vigorous, yielding more contigs but with many errors. Draft genomes are 39 typically sets of large contingent of assembled contigs and scaffolds that are often fragmented 40 due to presence several gaps interlaced by repetitive regions. In a misassembly different contigs 41 are improperly joined. Mis-joins problem arises due to inversions, relocation or a translocation. 42 Gaps arise also due to incorrect insertion or deletion of a sequenced read in a misassembly. An 43 inversion or a translocation alters placement of a contig on to scaffold belonging to a different 44 chromosome. Hence, annotation of unfinished and partially assembled genomes creates 45 ambiguities while accessing complete genetic information as desired by biologists. 46 In misassemblies some of the reasons for incompleteness include 1.Gaps appearing due to 47 polymorphisms in complex genomes where reads on either side of a gap representing two 48 haplotypes belonging to two separate chromosomes, 2. Abundance of repeat elements, multiple 49 ways to fill the gaps and confusing the assembler thus leaving a gap unfiled, 3. Lack of more 50 reads to cover that part of the genome, requiring additional library of reads to fill the gaps.

105
Improvement of the draft genome assemblies employing reconciliation algorithm 106 Reconciliation assembly approach was employed in the present work to refine the fragmented 107 draft genome assemblies A1 and A2. For selection of optimum K-mers, hybridSPAdes [15], was 108 employed and combinations ranging from 21 to 55 were evaluated. We observed with k-mer 109 sizes 21, 33, 55 and 77 yielded few fragmented sequences, less number contigs with high N50, 110 mean and median scaffold lengths in superior assemblies. The Illumina HiSeq sequence reads 111 resulted in 46,979 reads with the N50 length of 24,087. Metassembler was employed for merging 112 of two assemblies. Metassembler implements reconciliation algorithm to refine and obtain 113 reconstructed genome. In order to capture the suitable reference assembly set for alignment 114 during merger process we examined the required order in which assemblies A1 and A2 are to be 115 chosen as master set (GCA_000230855.2) and slave sets for aligning with the former, 116 (GCA_000340665.1). We observed that choice of A1 as master set with and A2 as slave set 117 resulted in a highly contiguous superior assembly. Superiority of resulting merged metassembly 118 was systematically evaluated with the compression-expansion (CE) statistics. Gaps present in the 119 scaffolds were closed using mate pairs. The remaining gaps were filled by searching unique 120 contig end sequences against unused reads. We observed that repeat structure analysis and 121 resulted significant reduction of gaps and contributed to prediction of specific genes. The 122 improved assembly had 46,979 contigs with total size of 548.2 Mb and covers 82.4% of the 123 genome with high contiguity ( Table 1). to fill these gaps. While building the graph, metassembler searches the mate-pair library for right 175 sized inserts to complete the shortest path between any of these contigs, to fill a gap. 176 We evaluated the closure performances of the Gapcloser and Gapfiller tools on the repeat 177 derived gaps using the raw mate pair reads. We first tested the performance of each tool using 178 the raw mate pair reads. Both the above tools used first raw pair end and Mate pair libraries. We Gap closer was more efficient by filling most of the gaps with 82.4% and with low error rates. 183 We achieved improved contiguity by using long mate-pairs to fill gaps in assembly and there by 184 achieving higher coverage in the finished assembly. Augustus was employed and 51,737 genes are predicted in the finished metassembly.

206
Predicted numbers of genes are less compared to A1 and higher than A2.    Table 3). The 256 overall analysis showed that the relative abundance of tetra, penta and hexa SSRs types were low 257 as compared to mono, di and tri SSR types in Pigeonpea genome sequences (Figure 1).    Table 7). 311 Genomic DNA from 15 day old seedlings of 34 Pigeonpea cultivars was extracted employing 312 CTAB method [28]. Purity and concentration of DNA was estimated with Nanodrops ND-1000.

313
Nine primers were selected for polymorphism study (Supplementary Table 7  To begin with we wanted to select the order in which the input draft assemblies are to be merged 333 to gain subsequent superior alignment and read mapping. After several permutations, we 334 observed that treating assembly A1 as master and aligning it with assembly A2 yielded better genes are less in our finished assembly, A3 are less compared to A1 but higher than A2 (Table   361 2). Annotation of improved assembly yielded 51,737 genes predicted. Wet lab PCR  (Figure 3). Data on yellow mosaic disease reaction is not presented here. PCR 367 amplified genes were isolated, cloned and submitted to NCBI (Supplementary Table 6). 368 Genotype, environment interaction in the field determines the phenotypic performance of In the present work genome reconciliation algorithm was adopted to reconstruct draft assemblies

392
We developed a workflow model (Figure 4) based on reconciliation algorithm, that includes 1.

393
Merging two mis-assemblies, 2. Finding matches and mismatches and other sequencing errors, 3. assembly. 425 We initially employed overlap approach with available read sequences followed by used pair-426 end as well as mate-pair library data sets for resolving repeat redundancy, gap filling and other 427 structural errors. Firstly, we used entire single paired read data sets available (minus two mate-428 pair data sets) of A1 along with all data sets from A2. Alternatively, second treatment included 429 two mate-pair data sets from A1 and all full data available from A2. At the end of analysis, all 430 the output values and statistical metric data were collected for comparative performance analysis.

431
Draft assembly A1 was sequenced in 2011 and had genome coverage of 199% [1]. However, 432 using the same raw read data, the authors had again reassembled employing A3 assembler and 433 reported gain of coverage, i.e. an increase of ~15% (from 60.0% to 75.6%,) and resubmitted to 434 18 NCBI . In our present work we used this recent assembly set, A1 along with A2 assembly data 435 [2] for reassembly and improvement (Figure 4).   for mono, 6 for di and 5 for a tri, tetra, penta, hexa motifs. We calculated the total lengths of all 478 mono-, di-, tri-, tetra-, penta-, and hexa-nucleotide repeats in terms of base pairs of SSR per             Table 3: Repetitive sequences of draft assemblies A1, A2 and finished A3 assembly. 671 Table 4: Results of microsatellite search in the improved pigeonpea assembly A3.