Multi-omic analysis of a hyper-diverse plant metabolic pathway reveals evolutionary routes to biological innovation

The diversity of life on Earth is a result of continual innovations in molecular networks influencing morphology and physiology. Plant specialized metabolism produces hundreds of thousands of compounds, offering striking examples of these innovations. To understand how this novelty is generated, we investigated the evolution of the Solanaceae family-specific, trichome-localized acylsugar biosynthetic pathway using a combination of mass spectrometry, RNA-seq, enzyme assays, RNAi and phylogenetics in non-model species. Our results reveal that hundreds of acylsugars are produced across the Solanaceae family and even within a single plant, revealing this phenotype to be hyper-diverse. The relatively short biosynthetic pathway experienced repeated cycles of innovation over the last 100 million years that include gene duplication and divergence, gene loss, evolution of substrate preference and promiscuity. This study provides mechanistic insights into the emergence of plant chemical novelty, and offers a template for investigating the ∼300,000 non-model plant species that remain underexplored.


24
Since the first proto-life forms arose on our planet some four billion years ago, the forces 25 of mutation, selection and drift have generated a world of rich biological complexity. This 26 complexity, evident at all levels of biological organization, has intrigued humans for millennia 27 (Tipton, 2008;Mayr, 1985). Plant metabolism, estimated to produce hundreds of thousands of 28 products with diverse structures across the plant kingdom (Fiehn, 2002;Afendi et al., 2012), 29 2 provides striking examples of this complexity. Plant metabolism is traditionally divided into 30 primary and secondary/specialized, the former referring to production of compounds essential for 31 plant development and the latter encompassing metabolites documented as important for plant 32 survival in nature and metabolites of as yet unknown functional significance (Pichersky and 33 Lewinsohn, 2011;Moghe and Last, 2015). While primary metabolism generally consists of 34 highly conserved pathways and enzymes, specialized metabolic pathways are in a state of 35 continuous innovation (Milo and Last, 2012). This dynamism produced numerous lineage-36 specific metabolite classes such as steroidal glycoalkaloids in Solanaceae (Wink, 2003), 37 benzoxazinoid alkaloids in Poaceae (Dutartre et al., 2012), betalains in Caryophyllales 38 (Brockington et al., 2015) and glucosinolates in Brassicales (Halkier and Gershenzon, 2006). The 39 structural diversity produced by lineage-specific pathways makes them exemplary systems for 40 understanding the evolution of novelty in the living world. 41 Previous studies investigating the emergence of lineage-specific metabolite classes 42 uncovered the central role of gene duplication and diversification in this process: e.g. 43 duplications of isopropylmalate synthase (glucosinolates and acylsugars), cycloartenol synthase 44 (avenacin), tryptophan synthase (benzoxazinoids) and anthranilate synthase (acridone alkaloids) 45 (Moghe and Last, 2015). Duplications of members of enzyme families (e.g. cytochromes P450, 46 glycosyltransferases, methyltransferases, BAHD acyltransferases) also play major roles in generating chemical novelty, with biosynthesis of >40,000 structurally diverse terpenoids -48 produced partly due to genomic clustering of terpene synthases and cytochrome P450 enzymes 49 (Boutanaev et al., 2015) -as an extreme example. These duplicate genes, if retained, may 50 experience sub-or neo-functionalization via transcriptional divergence and evolution of protein-51 protein interactions as well as via changes in substrate preference, reaction mechanism and 52 allosteric regulation to produce chemical novelty (Moghe and Last, 2015). In this study, we 53 acylsugar biosynthetic network and illustrate the many ways by which chemical novelty emerges 515 in specialized metabolism. 516 In this study, we used an integrative strategy combining high-throughput technologies 517 and traditional biochemistry. S. lycopersicum is the flagship species of the Solanaceae family, 518 and we were able to apply the insights derived with this crop as a foundation to discover novel 519 enzymes in related species. Our investigations in Salpiglossis were also aided by the availability 520 of ASATs and NMR structures of their products in Petunia, which is more closely related to 521 Salpiglossis than tomato. "Anchor species" such as Petunia, which are phylogenetically distant 522 from flagship/model species, can enable study of a different region of the phylogenetic tree of 523 the clade of interest. Development of limited genomic and functional resources in such anchor 524 species, coupled with integrative, comparative approaches can offer more efficient routes for the 525 exploration of biochemical complexity in the ~300,000 plant species estimated to exist on our 526 planet (Mora et al., 2011). 527

Plant acylsugar extractions and mass spectrometric analyses 529
Acylsugar extractions were carried out from plants at the New York Botanical Gardens and from 530 other sources (Figure 1-File Supplement 1). The plants sampled were at different 531 developmental stages and were growing in different environments. The extractions were carried 532 out using acetonitrile:isopropanol:water in a 3:3:2 proportion similar to previous descriptions 533 (Schilmiller et al., 2010;Ghosh et al., 2014;Fan et al., 2016a) with the exception of gently 534 shaking the tubes by hand for 1-2 minutes. All extracts were analyzed on LC/MS using 7-min, 535 22-min or 110-min LC gradients on Supelco Ascentis Express C18 (Sigma Aldrich, USA) or 536 Waters BEH amide (Waters Corporation, USA) columns (Supplemental Table 1), as described 537 previously (Schilmiller et al., 2010;Ghosh et al., 2014;Fan et al., 2016a). While the 110-min 538 method was used to minimize chromatographic overlap in support of metabolite annotation in 539 samples with diverse mixtures of acylsugars, targeted extracted ion chromatogram peak area 540 quantification was performed using the 22-min method data. The QuanLynx function in 541 MassLynx v4.1 (Waters Corporation, USA) was used to integrate extracted ion chromatograms 542 for manually selected acylsugar and internal standard peaks. Variable retention time and 543 chromatogram mass windows were used, depending on the experiment and profile complexity. 544 Peak areas were normalized to the internal standard peak area and expressed as a proportion of 545 mg of dry weight. 546

RNA-seq data analysis 547
RNA was extracted and sequenced as described in the Supplemental Methods. The mRNA-seq 548 reads were adapter-clipped and trimmed using Trimmomatic v0.32 using the parameters 549 (LEADING:20 TRAILING:20 SLIDINGWINDOW:4:20 MINLEN:50). The quality-trimmed reads 550 from all datasets of a species were assembled de novo into transcripts using Trinity 551 v.20140413p1 after read normalization (max_cov=50,KMER_SIZE=25,max-pct-552 stdev=200,SS_lib_type=RF) (Grabherr et al., 2011). We tested three different kmer values 553 (k=25, 27, 31) and selected the best kmer value for each species based on contig N50 values, 554 BLASTX hits to the S. lycopersicum annotated protein sequences and CEGMA results (Parra et 555 al., 2007). A minimum kmer coverage of 2 was used to reduce the probability of erroneous or 556 low abundance kmers being assembled into transcripts. After selecting the best assembly for 557 each species, we obtained a list of transcripts differentially expressed between trichomes and 558 stem/petiole for each species using RSEM/EBSeq (Li and Dewey, 2011;Leng et al., 2013) at an 559 FDR threshold of p<0.05. The differential expression of five transcripts in S. quitoense and four 560 transcripts in Salpiglossis was confirmed using semi-quantitative RT-PCR, along with the PDS 561 as negative control (Table 1- Figure Supplement 1). 562 Purification and structure elucidation of acylsugars using NMR 563 For metabolite purification, aerial tissues of 28 Salpiglossis plants (aged 10 weeks) were 564 extracted in 1.9 L of acetonitrile:isopropanol (AcN:IPA, v/v) for ~10 mins, and ~1 L of the 565 extract was concentrated to dryness on a rotary evaporator and redissolved in 5 mL of AcN:IPA. 566 Repeated injections from this extract were made onto a Thermo Scientific Acclaim 120 C18 567 HPLC column (4.6 x 150 mm, 5 µm particle size) with automated fraction collection. HPLC 568 fractions were concentrated to dryness under vacuum centrifugation, reconstituted in AcN:IPA 569 and combined according to metabolite purity as assessed by LC/MS. Samples were dried under 570 N 2 gas and reconstituted in 250 or 300 µL of deuterated NMR solvent CDCl 3 (99.8 atom % D) 571 and transferred to solvent-matched Shigemi tubes for analysis. 1 H, 13 C, J-resolved 1 H, gCOSY, 572 gHSQC, gHMBC and ROESY NMR experiments were performed at the Max T. Rogers NMR 573 Facility at Michigan State University using a Bruker Avance 900 spectrometer equipped with a 574 20 TCI triple resonance probe. All spectra were referenced to non-deuterated CDCl 3 solvent signals 575 (δ H = 7.26 and δ C = 77.20 ppm). 576

Virus induced gene silencing 602
Primers to amplify fragments of transcripts for VIGS were chosen to minimize probability of 603 cross silencing other Salpiglossis transcripts due to sequence similarity or due to homopolymeric 604 regions (Figure 2-File Supplement 1,2). This was achieved using a custom Python script 605 ( Figure 3-File Supplement 1), which divided the entire transcript of interest in silico into 606 multiple overlapping 20nt fragments, performed a BLAST against the Salpiglossis transcriptome 607 and tomato genome and flagged fragments with >95% identity and/or >50% homopolymeric 608 stretches. Contiguous transcript regions with >12 unflagged, high-quality fragments were 609 manually inspected and considered for VIGS. 1-2 300bp regions were cloned into the pTRV2-610 LIC vector (Dong et al., 2007) and transformed into Agrobacterium tumefaciens GV3101. 611 Agroinfiltration of Salpiglossis plants was performed as described previously (Velásquez et al.,612 2009) using the prick inoculation method. Salpiglossis phytoene desaturase was used as the 613 positive control for silencing. Empty pTRV2-LIC or pTRV2 vectors were used as negative 614 controls. Each VIGS trial was done slightly differently while modifying the growth, 615 transformation and maintenance conditions of this non-model species. The experimental details, 616 including the optimal growth and VIGS conditions that give the fastest results, are noted in 617

Phylogenetic inference 619
All steps in the phylogenetic reconstruction were carried out using MEGA6 (Tamura et al., 620 2013). Amino acid sequences were aligned using Muscle with default parameters. Maximum 621 likelihood and/or neighbor joining (NJ) were used to generate phylogenetic trees. For maximum 622 likelihood, the best evolutionary model (JTT[Jones-Taylor-Thornton]+G+I with 5 rate 623 categories) was selected based on the Akaike Information Criterion after screening several 624 models available in the MEGA6 software, while for NJ, the default JTT model was used. 625 Support values were obtained using 1000 bootstrap replicates, however trees obtained using 100 626 bootstrap replicates also showed similar overall topologies. Trees were generated either using 627 "complete deletion" or "partial deletion with maximum 30% gaps/missing data" options for tree 628 reconstruction. Sequences <350aa (eg: Solyc10g079570) were excluded from this analysis. 629

B.
C.   File Supplement 1 is the Python script used for picking the most optimal fragments and Source Data 1 includes values obtained from qPCR analysis. (C,D) SsASAT1 knockdown using two different constructs (SsASAT1-1, SsASAT1-2) shows reduction in acylsugar levels. The SsASAT1-1 phenotype is more prominent than the ASAT1-2 phenotype, being significantly lower (p=0.05; KS test). One construct for SsASAT2 (SsASAT2-1) also showed significant decrease in acylsugar levels (p=0.03; KS test). Note that the Y-axis total ion intensity in (C) is different for each chromatogram. (E,F) SsASAT2 knockdown leads to drops in levels of higher molecular weight acylsugars. In (C-F), number of plants used for statistical analysis is noted. Source Data 2 describes normalized peak areas from VIGS plants used for making these inferences. Results of the second set of biological replicate experiments performed at a different time under a different set of conditions --as described in Table  Supplement 1 --is shown in Figure Supplement 1. Figure Supplement 2

B. A.
Relative abundane (%)   Statistical significance of the difference between TRV2-LIC and ASAT3-2 was tested using Kolmogrov-Smirnov test. ** represents p<0.01 and * represents p<0.05. Boxes where the acylsugars had significantly different levels compared to TRV2-LIC are highlighted in green. Acylsugar peak area was normalized as described in the Methods.  We make the assumption that fragment ions and the most abundant pseudomolecular ions appearing at the same retention time are associated. In all three cases, assuming that the original mono-and di-acylations occurred on the pyranose (P) ring, we can infer that the subsequent C2 and C5 additions giving rise to the tri-and tetra-acylated sugars occurred on the furanose (F) ring.              Fig. 6 are shown by a green square. All trees were generated using NJ, using the JTT model with 100 bootstraps. All sites with <80% coverage were eliminated. ASAT activities overlaid on the Solanaceae phylogenetic relationships. Each colored square represents a single ASAT, starting from ASAT1 and moving sequentially down the pathway to ASAT5 (left to right, sequentially). Homologs are represented by the same color. Squares not contained within a box are experimentally validated activities. A solid black box indicates that a highly identical transcript exists in the RNA-seq dataset and is trichome-high. A dashed box indicates that, based on a BLAST search, the sequence exists in the genome for the contained enzymes. In vitro validated S. nigrum and H. niger ASAT activities, including their phylogenetic positions, are shown in Figure Supplement 1. Species selected for RNA-seq in this study are colored blue, previously studied species are highlighted in pink and Convolvulaceae species are in red. See Source Data 1 for results of the BLAST analysis. (B) Orthologous genomic regions between three species harboring aASAT1 orthologs. Each gene in the region is shown by a colored block. Orthologous genes are represented by the same color. The PaASAT1 gene (red box) has two homologous sequences in the Capsicum syntenic region, but one of them is truncated. aASAT1 ortholog is not seen in the syntenic region in tomato. Genes used to make this figure are described in Source Data 2. (C) Substrate utilization of aASAT2 orthologs from multiple species is described based on the activities presented in Figure Supplement 1. Hyoscyamus niger (Hn), Salpiglossis sinuata (Ss), Solanum nigrum (Sn) and Solanum lycopersicum (Sl). The hydroxyl group highlighted in red shows the predicted position of acylation by the respective aASAT2 ortholog.    (A) Neighbor-joining tree made using Salpiglossis and tomato ASATs and their best hits in S. nigrum (Sn) and H. niger (Hn) transcriptome database. Tree was made using default parameters in MEGA6 for alignment and phylogeny, and identifies putative aASAT2 orthologs in S. nigrum and H. niger. (B-D)Shown are LC/MS extracted ion chromatograms for S1:6(6) [B], S1:12 (12) [C] and S2:10(5,5) [D]. All reactions were run on a C18 LC column. Results show that Hn, Ss, Sn and SlASAT1s can catalyze sucrose to mono-acylsucrose conversion using aiC6 and nC12 CoA. However, aASAT2 orthologs in Hn and Ss are not able to catalyze this reaction. (D) HnASAT2 is able to catalyze di-acylsucrose formation using the HnASAT1 product.  Events in plant evolution are noted on the left and the acylsugar biosynthetic pathway developments are noted on the right of the timeline. Tri-acylsugars produced by Salpiglossis and tomato ASAT1,2,3 enzymes are shown with the chain color corresponding to the color of the enzyme that adds the chain. Images obtained from Wikimedia Commons under Creative Commons CC BY-SA 3.0 license. We note that Salpiglossis and SsASAT activities should not be considered as "ancestral" and Tomato and SlASAT activities as "modern", since all of them exist today. We use the word "lineages" (eg: Salpiglossis lineage) to refer to the branch leading up to the currently existing (extant) species.

Extraction and sequencing of RNA
Total RNA was extracted from 4-5 week old (S. nigrum, S. quitoense) or 7-8 week old plants (H. niger, Salpiglossis) using Qiagen RNEasy kit (Qiagen, Valencia, California) with on-column DNA digestion. In addition to trichome RNA, total RNA was extracted from shaved stems of S. nigrum and Salpiglossis, and shaved petioles of S. quitoense and H. niger. The quality of extracted RNA was determined using Qubit (Thermo Fisher Scientific, USA) and Bioanalyzer

Confirmation of Salpiglossis sinuata
We confirmed the plant to be Salpiglossis using the chloroplast ndhF and trnLF marker based phylogenies (Figure 2-Figure Supplement 1A,B). Specifically, we amplified these regions using locus-specific primers, sequenced the amplicons and assembled the contigs using Muscle (Edgar, 2004). A neighbor joining tree including ndhF and trnLF sequences from NCBI Genbank was used to confirm the identity of the plant under investigation.

Prediction of protein sequences and orthologous group assignments
The protein sequences corresponding to the longest isoform of all expressed transcripts (read count>10 in at least one dataset in a given species) were obtained using TransDecoder (Haas et al., 2013) and GeneWise v2.1.20c (Birney et al., 2004). Only the protein sequences of transcripts with ≥10 reads as defined by RSEM were used for constructing orthologous groups using OrthoMCL v5 (Li et al., 2003). We defined orthologous relationships between S. lycopersicum, S. nigrum, S. quitoense, H. niger, N. benthamiana, Salpiglossis and Coffea canephora (outgroup) using an inflation index of 1.5.

Gene Ontology enrichment analyses
We transferred the tomato gene ontology assignments to the homologs from other species in the same orthologous group. GO enrichment analysis was performed using a custom R script, and enriched categories were obtained using Fisher Exact Test and correction for multiple testing based on Q-value (Storey, 2002).

qRT-PCR
Primers to specific regions of the targeted transcript were designed with amplicons between 100-200 bp using Primer3 (Untergasser et al., 2012). The regions selected for amplification did not overlap with the region targeted in the VIGS analysis. Primer sets for Salpiglossis orthologs of PDS and elongation factor alpha (EF1a) were used as controls. We used 1 µg of total RNA from a single VIGS plant with an acylsugar phenotype to generate cDNA using the Thermo Fisher Superscript II RT kit. An initial amplification and visualization on a 1% agarose gel was performed to ensure that the primers yielded an amplicon with the predicted size and did not show visible levels of primer dimers. We first tested multiple primer sets per gene and selected primers within 85-115% efficiency range using a dilution series of cDNA from uninoculated plants. These primers were used for the final qRT-PCR reaction. The Ct values for the transcripts (on 1x template) were measured in triplicate, which were averaged for the analysis. Both uninoculated and empty vector controls were measured with all primer sets for ΔΔCt calculations.

Similarity searches using BLAST
Similarity searches against the 1kp and NCBI nr databases were performed using TBLASTN, using SlASAT1, SlASAT3, SsASAT1 and SsASAT2 protein sequences as queries. For the 1kp database, TBLASTN was performed against all Asterid sequences, and the best non-Solanaceae sequences were analyzed using phylogenetic tree reconstruction (see below; We used a similar approach to search the annotated peptide sequences in C. canephora (Denoeud et al., 2014) and Ipomoea trifida (Hirakawa et al., 2015) sequence databases.
Sequences that provided additional insights into the evolution of the ASAT clade were integrated into the final gene tree shown in Figure 6A.

Synteny analysis
Synteny between the Petunia, Capsicum and tomato genomes was determined as previously described using MCScanX (Moghe et al., 2014;Wang et al., 2012). Regions corresponding to PaASAT1 and its best matching homologs were used to make Figure 7B.

Estimating the number of acylsugars
This and previous studies provide evidence for a number of acyl chains esterified to sucrose, classified into short (C2, C3, C4, iC5, aiC5, iC6, aiC6, C8), long (nC10, iC10, nC11, nC12) and non-aliphatic (malonyl). We only focused on aliphatic chains for this estimate. We typically see one long chain and 3-4 short chains on the sugar molecule across different Solanaceae species. In our calculation, we assumed that each acyl chain has the same probability of being incorporated into the acylsugar, with the core being a sucrose core. Under these assumptions, there are 8 acyl chains that could be incorporated at three positions on a tetra-acylsugar and 12 acyl chains on the fourth position. This gives a theoretical estimate of 8*8*8*12=6144 acylsucroses that could be produced with the above acyl chains.