Telomere-to-telomere genome assembly of matsutake (Tricholoma matsutake)

Abstract Here, we report the first telomere-to-telomere genome assembly of matsutake (Tricholoma matsutake), which consists of 13 sequences (spanning 161.0 Mb) and a 76 kb circular mitochondrial genome. All the 13 sequences were supported with telomeric repeats at the ends. GC-rich regions are located at the middle of the sequences and are enriched with long interspersed nuclear elements (LINEs). Repetitive sequences including long-terminal repeats (LTRs) and LINEs occupy 71.6% of the genome. A total of 21,887 potential protein-coding genes were predicted. The genomic data reported in this study served not only matsutake gene sequences but also genome structures and intergenic sequences. The information gained would be a great reference for exploring the genetics, genomics, and evolutionary study of matsutake in the future, and ultimately facilitate the conservation of this vulnerable genetic resource.


Introduction 19
Matsutake (Tricholoma matsutake [S. Ito et Imai] Singer), belonging to the phylum Basidiomycota, 20 is an ectomycorrhizal fungus that coexists with Pinaceae and Fagaceae trees in a symbiotic 21 association 1,2 . In the field, two spores of matsutake fuse together and grow to form a "shiro", which 22 is a symbiotic entity formed between matsutake and its host tree. One shiro produces a number of 23 sporocarps during the growing season. The sporocarp of matsutake has been considered as one of the 24 most valuable components of traditional Japanese cuisine since ancient times, as mentioned in 25 Manyo-shu (a series of books for Japanese poetry compiled around 700 AD in Japan), owing to its 26 pleasant aroma, which is largely attributed to 1-octen-3-ol (also known as matsutakeol) 3,4 ; however, 27 sporocarps are non-culturable. In 2019, the International Union for Conservation of Nature (IUCN) 28 categorized matsutake as vulnerable. The production of sporocarps has drastically decreased in 29 recent years 5 because of the deterioration of its growing environment. To understand the life cycle 30 and life history of matsutake, safeguarding its production and conservation is necessary, which 31 Two sporocarps, which were probably ramets derived from a single shiro (radius > 2 m) that has 23 been generating sporocarps for more than 20 years 11 , were collected from Ina, Nagano, Japan. The 24 sporocarps were flash-frozen in liquid nitrogen, dried under vacuum, and then stored at room 25 temperature until needed for DNA extraction. 26 Genomic DNA was extracted from the dried stipes using the cetyltrimethylammonium bromide 27 (CTAB) method 12 . The concentration of the extracted DNA was measured using the Qubit dsDNA 28 BR assay kit (Thermo Fisher Scientific, Waltham, MA, USA), and DNA fragment length was 29 evaluated by agarose gel electrophoresis with Pippin Pulse (Sage Science, Beverly, MA, USA). 30 31 Using the HiFi reads obtained from the Sequel IIe system (PacBio), the genome size of matsutake 11 was estimated with GCE 13 , based on k-mer frequency (k = 21) calculated with Jellyfish 14 (version 12 2.3.0). The reads were assembled using hifiasm 15  Single nucleotide polymorphisms (SNPs) and insertions and deletions (indels) were detected with 3 genome sequence reads obtained from NCBI database (accession number: PRJNA726361). Low-4 quality bases and adapter sequences were removed with PRINSEQ 25 (version 0.20.4) and 5 fastx_clipper (parameter, -a AGATCGGAAGAGC), respectively, in the FASTX-Toolkit (version 6 0.0.14; http://hannonlab.cshl.edu/fastx_toolkit). The remaining high-quality reads were mapped on 7 to the current assembly with Bowtie2 26 (version 2.3.5.1; parameters: --local -I 100 -X 1000), and 8 sequence variants were detected using the mpileup and call commands of BCFtools 27 (version 1.9). 9 High-confidence variants were selected with VCFtools 28 (version 0.1.12b) using the following 10 parameters: minimum read depth ≥ 8 (--minDP 8); minimum variant quality = 999 (--minQ 999); 11 maximum missing data < 0.5 (--max-missing 0.5); and minor allele frequency ≥ 0.05 (--maf 0.05). 12 Insertions of MarY1 were detected with PTEMD 29 (version 1.03). Effects of nucleotide sequence 13 variations on gene function were estimated with SNPeff 30 (version 4.3t). 14 Four matsutake genome assemblies downloaded from the NCBI database (accession numbers: 15 BDDP01, Tricma30605_assembly01; PKSN02, ASM293902v2; QMFF01, ASM331463v1; 16 WIUY01, Trima3) were aligned against the genome assembly generated in this study using 17 Minimap2 31 (version 2.24; parameter: -cx asm20), with a mapping-quality cutoff of 60. The 18 positions of genes, repeats, and genome alignments were compared using the intersection command 19 in BEDtools 32 (version 2.27.0; default parameters). 20 21

22
DNA sequencing, data analysis, and genome assembly 23 Genomic DNA was extracted from two dried sporocarps (samples A and B) of matsutake. The 24 amount of DNA extracted from each sample (9 µg) was sufficient for library construction; however, 25 because of degradation (Supplementary Figure S1), the extracted DNA was used for library 26 preparation without shearing. The resultant libraries were sequenced on a SMRT Cell 8M to obtained 27 9.5 Gb (sample A) and 7.8 Gb (sample B) data, with N50 lengths of 11 kb (sample A) and 10 kb 28 (sample B). The k-mer analysis detected two peaks (Supplementary Figure S2), indicating that the 29 haploid genome size of matsutake was 149 Mb and the level of heterozygosity was high. The 30 sequence reads of each sample were assembled separately to obtain two sets of contigs: 182 contigs 31 (165.5 Mb) for sample A, and 146 contigs (162.9 Mb) for sample B. Among these, 15 contigs (160.7 1 Mb) of sample A and 12 contigs (159.2 Mb) of sample B, all of which were >1 Mb in length, were 2 selected for further analysis. 3 Next, we searched for the telomeric motif, (CCCTAA)n, in the contigs (Figure 1). In sample A, 4 the telomeric motif was found at both ends of nine contigs (A1, A2, A6, A8, A11, A12, A13, A14, 5 and A16) and at one end of five contigs (A3, A4, A7, A10, and A15). In sample B, the motif was 6 found at both ends of nine contigs (B1, B2, B3, B4, B5, B7, B8, B9, and B11) and at one end of 7 three contigs (B6, B12, and B13). The average length of the telomeric sequence was 129 bp (21 8 repeats), and ranged from 66 bp (11 repeats) to 168 bp (28 repeats). 9 Comparison of the two sets of genome assemblies revealed 10 pairs of perfectly aligned contigs 10 (A1-B1, A2-B9, A6-B11, A8-B3, A10-B4, A11-B8, A12-B2, A13-B7, A14-B5, and A15-B12) 11 ( Figure 1). Three contigs of sample A (A3, A4, and A9) covered the entire sequence of one contig of 12 sample B (B13). Furthermore, two contigs of sample A (A7 and A16) corresponded to one contig of 13 sample B (B6). Thus, we concluded that contigs A3, A4, and A9 were unassembled, and contig B6 14 was misassembled. Therefore, we joined contigs A3, A4, and A9 with 100 Ns to establish a single 15 contig, and left contigs A7 and A16 as separate. Finally, 13 contigs spanning 160.7 Mb were 16 obtained, of which 11 contigs possessed telomeric motifs at both ends, while two contigs were 17 supported with the telomeric motif at either end. The 13 contigs represented 94.2% complete 18 BUSCOs. The final assembly was designated as TMA_r1.0, and the contigs were named 19 TMA_r1.0ch01 to TMA_r1.0ch13 in order of decreasing sequence length ( Figure 1, Table 1). The 20 GC content was ca. 45% over the entire genome, with one peak (~55%) in each chromosome, except 21 chromosome 1, which showed two peaks ( Figure 2). In addition, we identified a 76,067 bp circular 22 contig, which represented the mitochondrial genome of matsutake (Tma1.0mito). 23 24

Repetitive sequence analysis 25
Repetitive sequences occupied a total physical distance of 115.2 Mb (71.7%) in the genome 26 assembly (TMA_r1.0; 160.7 Mb). Nine major types of repeats were identified in varying proportions 27 (Table 2). The dominant repeat types in the chromosome sequences were LTRs (69.2 Mb) and long 28 interspersed nuclear elements (LINEs; 5.9 Mb). LINEs were predominant in regions with high GC 29 content in all chromosomes, whereas LTR retrotransposons were predominant in regions with low 30 GC content (Figure 2). Repeat sequences unavailable in public databases totaled 16.2 Mb. The MarY1 LTR, which has been extensively studied to date, and its terminal repeats were present as 683 1 and 3,240 copies, respectively, across all 13 chromosomes. 2 3 Gene prediction and annotation 4 RNA-Seq reads were mapped on to the genome assembly of matsutake (TMA_r1.0), with a mapping 5 rate of 93.8%. Based on the sequence alignment, TMA_r1.0 was predicted to contain a total of 6 28,646 genes including 28,322 protein-coding genes and 324 tRNA genes (Table 1). These predicted 7 genes possessed 90.2% complete BUSCOs. The mitochondrial genome was predicted to contain a 8 total of 28 protein-coding genes, including 26 tRNA genes and 2 rRNA genes. 9 Additionally, sequence alignment revealed that of the 23,068 genes predicted in the previous Comparative analysis of the current and previous genome assemblies of matsutake 16 The TMA_r1.0 genome assembly was compared with the four publicly available matsutake genome 17 assemblies, Tricma30605_assembly01, ASM293902v2, ASM331463v1, and Trima3. Sequence 18 coverage in GC-rich regions was mostly low in the four assemblies (Supplementary Figure S3). The 19 Tricma30605_assembly01 covered the longest part of TMA_r1.0 (79.5%) among the four 20 assemblies, followed by Trima3 (78.9%), ASM331463v1 (75.9%), and ASM293902v2 (67.9%). 21 When genomic positions of the alignments with the four assemblies were merged, 94.4% of the 22 TMA_r1.0 was covered by at least one of the four assemblies, while the remaining 5.6% was not 23 covered by any assembly. 24 25

Sequence variants in divergent matsutake lines 26
Whole-genome resequencing data of 14 matsutake lines were obtained from a public DNA database. 27 High-quality reads (3.9 Gb per sample) were mapped on to TMA_r1.0, with an average mapping rate 28 of 96.5%, except one sample (TM_NH), which showed a mapping rate of 73.3%. Totals of 29 2,322,349 SNPs and 102,831 indels were identified in the 13 chromosomes ( Figure 2). The most 30 prominent variant type was intergenic mutations (2,106,014, 86.8%) followed by missense mutations 31 (148,235, 6.1%) and synonymous mutations (99,674, 4.1%) (Supplementary Table S1). The number 1 of deleterious variations, which could disrupt gene structure and function, was 13,479 (0.6%); these 2 were categorized as high-impact variants. In the mitochondrial genome, a total of 90 SNPs and 97 indels were identified across all 14 8 lines, although no MarY1 insertion was detected. Deleterious mutations were found in three 9 mitochondrial genes, orf123, orf290, and cox1. 10 11

Discussion 12
This study presents the telomere-to-telomere genome sequence of matsutake comprising 13 13 chromosomes ( Figure 1, Table 1). To assemble the matsutake genome, we not only considered the 14 telomeric repeat motif but also identified the centromeric regions and sequenced two independent 15 samples. GC-rich regions were found at a single position in all chromosomes, except chromosome 1, 16 which had two GC-rich regions ( Figure 2). Interestingly, the GC-rich regions were enriched with 17 LINEs but devoid of LTRs ( Figure 2). Together, these observations suggest that GC-rich regions 18 represent centromeres, and that chromosome 1 is a dicentric chromosome formed by the telomeric 19 fusion of two chromosomes. We also compared the genome assemblies generated from two 20 independent data sets (samples A and B) ( Figure 1). Consequently, it was possible to identify a 21 misassembled region and an unassembled region (Table 1), which led to the establishment of a 22 telomere-to-telomere genome assembly. To the best of our knowledge, haploid chromosome number 23 of matsutake (n = 7) has been reported in only one study to date 10 . Constructing a telomere-to-24 telomere assembly could serve as an alternative to karyotyping for determining the chromosome 25 number of a species, for which no chromosome information is available. 26 The telomere-to-telomere genome assembly generated in this study spans a physical distance pf 27 160.7 Mb and covers the entire genome of matsutake. The genome size of matsutake is larger than 28 that of other mushroom species 6,7 because of the high proportion of repetitive sequences (Table 2) 33 . 29 Owing to its high content of repetitive sequences (Table 2) and high heterozygosity (Supplementary 30 Figure S2), the matsutake genome could not be fully sequenced with short-read and error-prone 31 long-read sequencing technologies. The HiFi sequencing technology (~10 kb read length) employed 1 in this study likely helped overcome the problem posed by repetitive sequences, such as MarY1 (~6 2 kb), thus enabling the construction of the telomere-to-telomere genome assembly. Owing to the long 3 contigs and high genome coverage, 28,646 genes were predicted in the matsutake genome. Of these 4 genes, 15,885 had not been represented in the previous assembly (Trima3). 5 The genome sequences and predicted genes could help us understand the ecophysiology of a 6 shiro and thus reveal the mechanism of sporocarp formation. All SNPs, indels, and transposon 7 insertions in the genome were identified, and their chromosomal locations were determined. This 8 information could be used to reveal the genetic diversity of matsutake in nature, conserve its genetic 9 resources, and ensure its production. Furthermore, sequence variant analysis, followed by genome-10 wide association study, could reveal the genetic mechanisms underlying phenotypic variations in the 11 physiological and metabolomic traits of matsutake. As mentioned above, the matsutake genome 12 assembly constructed in this study could serve as a reference for further genetic studies.