Abstract
Different Musa species, subspecies, and cultivars are currently investigated to reveal their genomic diversity. Here, we compare the genome sequence of one of the commercially most important cultivars, Musa acuminata Dwarf Cavendish, against the Pahang reference genome assembly. Numerous small sequence variants were detected and the ploidy of the cultivar presented here was determined as triploid based on sequence variant frequencies. Illumina sequence data also revealed a duplication of a large segment on the long arm of chromosome 2 in the Dwarf Cavendish genome. Comparison against previously sequenced cultivars provided evidence that this duplication is unique to Dwarf Cavendish. Although no functional relevance of this duplication was identified, this example shows the potential of plants to tolerate such aneuploidies.
Bananas (Musa) are monocotyledonous perennial plants. The edible fruit (botanically a berry) belongs to the most popular fruits in the world. In 2016, about 5.5 million hectares of land were used for the production of more than 112 million tons of bananas (FAO 2019). The majority of bananas were grown in Africa, Latin America, and Asia where they offer employment opportunities and are important export commodities (FAO 2019). Furthermore, with an annual per capita consumption of more than 200 kg in Rwanda and more than 100 kg in Angola, bananas provide food security in developing countries (FAO 2019; Arias et al. 2003). While plantains or cooking bananas are commonly eaten as a staple food in Africa and Latin America, the softer and sweeter dessert bananas are popular in Europe and Northern America. Between 1998 and 2000, around 47 % of the world banana production and the majority of the dessert banana production relied on the Cavendish subgroup of cultivars (Arias et al. 2003). Therein the Dwarf Cavendish banana (“Dwarf” refering to the height of the pseudostem, not to the fruit size) is one of the commercially most important cultivars, along with Grand Naine (“Chiquita banana”).
Although Cavendish bananas are almost exclusively traded internationally, numerous varieties are used for local consumption in Africa and Southeast Asia. Bananas went through a long domestication process which started at least 7,000 years ago (Denham et al. 2003). The first step towards edible bananas was interspecific hybridisation between subspecies from different regions, which caused incorrect meiosis and diploid gametes (Perrier et al. 2011). The diversity of edible triploid banana cultivars resulted from human selection and triploidization of Musa acuminata as well as Musa balbisiana (Perrier et al. 2011).
These exciting insights into the evolution of bananas were revealed by the analysis of genome sequences. Technological advances boosted sequencing capacities and allowed the (re-)sequencing of genomes from multiple subspecies and cultivars. M. acuminata can be divided into several subspecies and cultivars. The first M. acuminata (DH Pahang) genome sequence has been published in 2012 (D’Hont et al. 2012), many more genomes have been sequenced recently including: banksii, burmannica, zebrina (Rouard et al. 2018), malaccensis (SRR8989632, SRR6996493), Baxijiao (SRR6996491, SRR6996491), Sucrier : Pisang_Mas (SRR6996492). Additionally, the genome sequences of other Musa species, M. balbisiana (Davey et al. 2013), M. itinerans (Wu et al. 2016), and M. schizocarpa (Belser et al. 2018), have already been published.
Here we report about our investigation of the genome of M. acuminata Dwarf Cavendish, one of the commercially most important cultivars. We identified an increased copy number of a segment of the long arm of chromosome 2, indicating that this region was duplicated in one haplophase.
Materials and Methods
Plant material and DNA extraction
Musa acuminata Dwarf Cavendish tissue culture seedlings were obtained from FUTURE EXOTICS/SolarTek (Düsseldorf, Germany) (Figure 1). Plants were grown under natural daylight at 21°C. Genomic DNA was isolated from leaves following the protocol of Dellaporta et al. (1983).
Library preparation and sequencing
Genomic DNA was subjected to sequencing library preparation via the TrueSeq v2 protocol as previously described (Pucker et al. 2016). Paired-end sequencing was performed on an Illumina HiSeq1500 and NextSeq500, respectively, resulting in 2×250 nt and 2×154 nt read data sets with an average Phred score of 38. These data sets provide 55x and 65.6x coverage, respectively, for the approximately 523 Mbp (D’Hont et al. 2012) haploid banana genome.
Read mapping, variant calling, and variant annotation
All reads were mapped to the DH Pahang v2 reference genome sequence via BWA-MEM v0.7 (Li 2013) using –M to flag short hits for downstream filtering. This read mapping was analysed by the HaplotypeCaller of the Genome Analysis ToolKit (GATK) v3.8 (McKenna et al. 2010; Van der Auwera et al. 2013) to identify sequence variations in single nucleotides, called “single nucleotide variants” (SNVs), and also insertions/deletions (InDels). SNVs and InDels were called using the following filter rules in accordance with the GATK developer recommendation: ‘QD<2.0’, ‘FS>60.0’, and ‘MQ<40’ for SNVs and ‘QD<2.0’ and ‘FS>200.0’ for InDels. An InDel length cutoff of 100 bp was applied to restrict downstream analyses to a set of high quality variants called from 2×250nt reads. Only variants supported by at least five reads were kept. The resulting variant set was subjected to SnpEff (Cingolani et al. 2012) to assign predictions about the functional impact to the variants in the set. Variants with disruptive effects were selected using a customized Python script as described earlier (Pucker et al. 2016).
The genome-wide distribution of SNVs and InDels was assessed based on previously developed scripts (Baasner et al. 2019). The length distribution of InDels inside coding sequences was compared to the length distribution of InDels outside coding sequences using a customized Python script (Pucker et al. 2016).
De novo genome assembly
Trimmomatic v0.38 (Bolger et al. 2014) was applied to remove low quality sequences (i.e. four consecutive bases below Phred 15) and remaining adapter sequences (based on similarity to all known Illumina adapter sequences). Different sets of trimmed reads were subjected to SOAPdenovo2 (Luo et al. 2012) for assembly using optimized parameters (Pucker et al. 2019) including avg_ins=600, asm_flags=3, rd_len_cutoff=300, pair_num_cutoff=3, and map_len=100. K-mer sizes ranged from 67 to 127 in steps of 10. Resulting assemblies were evaluated using previously described criteria (Pucker et al. 2019) including general assembly statistics (e.g. number of contigs, assembly size, N50, and N90) and a BUSCO (Benchmarking Universal Single-Copy Orthologs) v3 (Simão et al. 2015) assessment. Polishing was done by removing potential contaminations and adapters as described before (Pucker et al. 2019). The DH Pahang v2 assembly (D’Hont et al. 2012; Martin et al. 2016) was used in the contamination detection process to distinguish between bona fide banana contigs and sequences of unknown origin. Contigs with high sequence similarity to non-plant sequences were removed as previously described (Pucker et al. 2019). Remaining contigs were sorted based on the DH Pahang v2 reference genome sequence and concatenated to build pseudochromosomes to facilitate downstream analyses. A de novo Dwarf Cavendish assembly generated with a K-mer size of K=127 was choosen to give statistics.
Data Availability Statement
Sequencing datasets were submitted to the European Nucleotide Archive (ERR3412983, ERR3412984, ERR3413471, ERR3413472, ERR3413473, ERR3413474). Python scripts are freely available on github (https://github.com/bpucker/banana). SNVs and InDels detected between the M. acuminata cultivars DH Pahang and Dwarf Cavendish are available in VCF format at http://doi.org/10.4119/unibi/2937972. The Dwarf Cavendish genome assembly is available in FASTA format at http://doi.org/10.4119/unibi/2937697.
Supplementary Material
File S1. Per chromosome read coverage distribution of Dwarf Cavendish reads.
File S2. Comparison of SNVs in the duplicated segment on the long arm of chromosome 2 to all other SNVs in the genome. The higher read coverage at variants indicates a duplication of this region.
File S3. List of public genomic banana sequence read samples used for comparison against Dwarf Cavendish based on the DH Pahang reference.
File S4. Coverage plots of public genomic banana sequence read samples (File S1) for comparison against Dwarf Cavendish based on the DH Pahang reference. Samples are: Musa acuminata, Musa acuminata AYP_BOSN_r1, Musa acuminata ssp. banksii, Musa acuminata ssp. burmannica, Musa acuminata Cavendish BaXiJiao, Musa acuminata Gros Michel, Musa acuminata ssp. malaccensis, Musa acuminata Sucrier (Pisang Mas), Musa acuminata Sucrier (Pisang Mas 1998-2307), Musa acuminata ssp. zebrina (blood banana), Musa balbisiana Pisang Klutuk Wulung, Musa itinerans, Musa schizocarpa.
File S5. List of selected high impact variants between Dwarf Cavendish and DH Pahang with resulting effects predicted by SnpEff.
Results and Discussion
Structural variants
Mapping of M. acuminata Dwarf Cavendish reads against the DH Pahang v2 reference sequence assembly revealed several copy number variations in different parts of the genome (Figure 2, File S1). The most remarkable difference between the Dwarf Cavendish and Pahang genome sequence is the amplification of an about 6.2 Mbp continuous region (length deduced from the reference genome) on the long arm of chromosome 2 (Figure 2, File S1, S2). An investigation of allele frequencies in the duplicated segment on chromosome 2 revealed that this duplication originates from a haplophase with high similarity to the reference sequence (Figure 3). Such a duplication was not observed in any of the other publicly available genomic sequencing data sets when compared against the DH Pahang v2 genome sequence (File S3, S4). Apparently, read mapping also indicates at least four large scale deletions in Dwarf Cavendish compared to Pahang v2 on chromosomes 2, 4, 5 and 7 (Figure 2). However, analysis of the underlying sequence revealed long stretches of ambiguous bases (Ns) at these positions in the Pahang assembly as the cause for these pseudo low coverage regions.
Ploidy of M. acuminata Dwarf Cavendish
Based on the coverage of small sequence variants (SNVs and InDels), the ploidy of Dwarf Cavendish was identified as triploid (Figure 3). Many heterozygous variant positions display a frequency of the reference allele close to 0.33 or close to 0.66. This fits the expectation for two copies of the reference allele and one copy of a different allele, or vice versa. Deviation from the precise values is explained by random fluctuation of the read distribution at the given position. Since the peak around 0.66 for the frequency of the allele identical to the reference is substantially higher than the peak around 0.33, it is reasonable to assume that two haplophases are very similar to the reference. The third haplophase is the one that contains more deviating positions and differs more from the reference. It is likely that reads of the divergent haplophase are mapped with a slightly reduced rate. This might explain why the peak at 0.66 is slightly more than twice the size of the peak at 0.33. In the duplicated segment on chromosome 2 the allele frequency peaks are shifted to 0.25 and 0.75 (Figure 3), indicating a tetraploid region with three haplophases identical to the reference and one haplophase divergent from the Pahang reference.
To be able to test and prove or disprove hypotheses regarding differences of the haplophases of the Dwarf Cavendish genome, a high continuity phased assembly would be needed. Up-to-date long read sequencing technologies like Single Molecule Real-Time (Pacific Biosciences) or nanopore sequencing (Oxford Nanopore Technologies) in principle allow to generate such assemblies. However, successful phase separation currently requires tools like TrioCanu (Koren et al. 2018) which use Mendelian relationships between parents and F1 (i.e. crosses) for assignment of reads to phases. Generation of such datasets will be very difficult for banana and goes significantly beyond the scope of this study.
Genome-wide distribution of small sequence variants
In total, 10,535,983 SNVs and 1,466,047 InDels were identified between the Dwarf Cavendish reads and the Pahang v2 assembly (see Data availability above). The genome-wide distribution of these variants is shown in Figure 4. As previously observed in other re-sequencing studies (Pucker et al. 2016), the number of SNVs exceeds the number of InDels substantially. Moreover, InDels are more frequent outside of annotated coding regions. Inside coding regions, InDels show an increased proportion of lengths which are divisible by 3, a bias introduced due to the avoidance of frameshifts.
SnpEff predicted 4,163 premature stop codons, 3,238 lost stop codons, and 8,065 frameshifts based on this variant set (File S5). Even given the larger genome size, these numbers are substantially higher than high impact variant numbers observed in re-sequencing studies of homozygous species before (Pucker et al. 2016; Xu et al. 2019). One explanation could be the presence of three alleles for each locus leading to compensation of disrupted alleles. Since banana plants are propagated vegetatively, breeders do not suffer inbreeding depressions.
De novo genome assembly
To facilitate wet lab applications like oligonucleotide design and validation of amplicons, the genome sequence of Dwarf Cavendish was assembled de novo. The assembly comprises 256,523 scaffolds with an N50 of 5.4 kb (Table 1). Differences between the three haplophases are one possible explanation for the low assembly contiguity. The assembly size slightly exceeds the size of one haplotype. Due to the low contiguity of this assembly and only minimal above 50% complete BUSCOs (Benchmarking Universal Single-Copy Orthologs) (Simão et al. 2015), annotation was omitted. Nevertheless, we successfully used the produced genome assembly for primer design and detection of small sequence variants.
Author contributions
BP, MB and RS planned the experiment. PV did the library preparation and sequencing. BP performed bioinformatic analyses. MB and BP wrote the initial draft. MB, BP, BW and RS revised the manuscript. All authors read and approved the final manuscript version.
Acknowledgments
We thank Joachim Weber for great technical assistance.
Footnotes
MB: mbusche{at}cebitec.uni-bielefeld.de
BP: bpucker{at}cebitec.uni-bielefeld.de
PV: viehoeve{at}cebitec.uni-bielefeld.de
BW: bernd.weisshaar{at}uni-bielefeld.de
Updated link.