ABSTRACT
Reference datasets are critical in computational biology. They help define canonical biological features and are essential for benchmarking studies. Here, we describe a comprehensive reference dataset of experimentally validated plant NLR immune receptors. RefPlantNLR consists of 415 NLRs from 31 genera belonging to 11 orders of flowering plants. We used RefPlantNLR to determine the canonical features of functionally validated plant NLRs. This reference dataset should prove useful for benchmarking NLR annotation tools, guiding comparative analyses of NLRs across the wide spectrum of plant diversity and identifying under-studied taxa. We hope that the RefPlantNLR resource will contribute to moving the field beyond a uniform view of NLR structure and function.
INTRODUCTION
Reference datasets are critical in computational biology (Weber et al., 2019; Schaafsma and Vihinen, 2018). They help define canonical biological features and are essential to benchmarking studies. Reference datasets are particularly important for defining the sequence and domain features of gene and protein families. Despite this, curated collections of experimentally validated sequences are still lacking for several widely studied gene and protein families. One example is the nucleotide-binding leucine-rich repeat (NLR) family of plant proteins. NLRs constitute the predominant class of disease resistance (R) genes in plants (Jones et al., 2016; Kourelis and van der Hoorn, 2018; de Araújo et al., 2020). They function as intracellular receptors that detect pathogens and activate an immune response that generally leads to disease resistance. NLRs are thought to be engaged in a coevolutionary tug of war with pathogens and pests. As such, they tend to be among the most polymorphic genes in plant genomes, both in terms of sequence diversity and copy-number variation (Tamborski and Krasileva, 2020). Ever since their first discovery in the 1990s, hundreds of NLRs have been characterized and implicated in pathogen and self-induced immune responses (Kourelis and van der Hoorn, 2018). NLRs are among the most widely studied and economically valuable plant proteins given their importance in breeding crops with disease resistance (Dangl et al., 2013).
NLRs occur widely across all kingdoms of life where they generally function in non-self perception and innate immunity (Jones et al., 2016; Uehling et al., 2017). In the broadest biochemical definition, NLRs share a similar multidomain architecture consisting of a nucleotide-binding and oligomerization domain (NOD) and a super-structure forming repeat (SSFR) domain (Dyrka et al., 2020). The NOD is either an NB-ARC (nucleotide-binding adaptor shared by APAF-1, certain R gene products and CED-4) or NACHT (neuronal apoptosis inhibitory protein, MHC class II transcription activator, HET-E incompatibility locus protein from Podospora anserina, and telomerase-associated protein 1), whereas the SSFR domain can be formed by ankyrin (ANK) repeats, tetratricopeptide repeats (TPRs), armadillo (ARM) repeats, WD repeats or leucine-rich repeats (LRRs) (Dyrka et al., 2020; Andolfo et al., 2019). Plant NLRs exclusively carry an NB-ARC domain with the C-terminal SSFR consisting typically of LRRs. The NB-ARC domain has been used to determine the evolutionary relationships between plant NLRs given that it is the only domain that produces reasonably good global alignments across all members of the family. In flowering plants (angiosperms), NLRs form three main monophyletic groups with distinct N-terminal domain fusions: the TIR-NLR subclade containing an N-terminal Toll/interleukin-1 receptor (TIR) domain, the CC-NLR-subclade containing an N-terminal Rx-type coiled-coil (CC) domain and the CCR-NLR subclade containing an N-terminal RPW8-type CC (CCR) domain (Tamborski and Krasileva, 2020). Additionally, Lee et al. (2020) have recently proposed that the G10-subclade of NLRs is a monophyletic group containing a distinct type of CC (CCG10; CCG10-NLR). NLRs also occur in non-flowering plants where they carry additional types of N-terminal domains such as kinases and α/β hydrolases (Andolfo et al., 2019).
Plant NLRs likely evolved from multifunctional receptors to specialized receptor pairs and networks (Adachi, Contreras, et al., 2019; Adachi, Derevnina, et al., 2019). NLRs which combine pathogen detection and immune signalling activities into a single protein are referred to as “functional singletons”, whereas NLRs which have specialized in pathogen recognition or immune signalling are referred to as “sensor” or “helper” NLRs, respectively. About one quarter of NLR genes occur as “genetic singletons” in plant genomes, whereas the others form genetic clusters often near telomeres (Jacob et al., 2013). This genomic clustering likely aids the evolutionary diversification of this gene family and subsequent emergence of pairs and networks (Tamborski and Krasileva, 2020; Adachi, Derevnina, et al., 2019). The emerging picture is that NLRs form genetic and functional receptor networks of varying complexity (Wu et al., 2018; Adachi, Derevnina, et al., 2019).
The mechanism of pathogen detection by NLRs can be either direct or indirect (Kourelis and van der Hoorn, 2018). Direct recognition involves the NLR protein binding a pathogen-derived molecule or serving as a substrate for the enzymatic activity of a pathogen virulence protein (known as effector). Indirect detection is conceptualized by the guard and decoy models where the status of a host component–the guardee or decoy–is monitored by the NLR (van der Biezen and Jones, 1998; van der Hoorn and Kamoun, 2008). Some sensor NLRs known as NLR-IDs contain non-canonical “integrated domains” that can function as decoys to bait pathogen effectors and enable pathogen detection (Cesari et al., 2014; Sarris et al., 2016; Wu et al., 2015). These extraneous domains appear to have evolved by fusion of an effector target domain into an NLR (Cesari et al., 2014; Sarris et al., 2016; Białas et al., 2017). The sequence diversity of integrated domains in NLR-IDs is staggering indicating that novel domain acquisitions have repeatedly occurred throughout the evolution of plant NLRs (Sarris et al., 2016; Kroj et al., 2016).
Given their multidomain nature, sequence diversity and complex evolutionary history, prediction of NLR genes from plant genomes is challenging. Several bioinformatic tools have been developed to extract plant NLRs from sequence datasets. As an input these tools take either annotated genomic features and transcriptomic data, or alternatively can be run directly on the unannotated genomic sequence. NLR-Parser, RGAugury, RRGPredictor, and DRAGO2 identify transcript and protein sequences that have features of NLRs and are best described as NLR extractors (Steuernagel et al., 2015; Li et al., 2016; Osuna-Cruz et al., 2018; Santana Silva and Micheli, 2020). RGAugury, RRGPredictor, and DRAGO2 also extract other classes of immune-related genes in addition to NLRs. These various tools use pre-defined motifs to classify sequences as NLRs, but they differ in the methods and pipelines. NLR-Annotator, an extension of NLR-Parser–and NLGenomeSweeper, use unannotated genome sequences as input to predict the genomic locations of NLRs (Steuernagel et al., 2020; Toda et al., 2020). This output then requires manual annotation to extract the final gene-models and some of the annotated loci may represent partial or pseudogenized genes.
The goal of this study is to provide a curated reference dataset of experimentally validated plant NLRs. This version of RefPlantNLR (v.20200528_415) consists of 415 NLRs from 31 genera belonging to 11 orders of flowering plants. We used RefPlantNLR to determine the canonical features of functionally validated plant NLRs. RefPlantNLR can also be used to benchmark NLR annotation tools. RefPlantNLR is also useful to guide comparative and phylogenetic analyses of plant NLRs and identify under-studied taxa for future studies.
RESULTS and DISCUSSION
RefPlantNLR pipeline
The current version of RefPlantNLR (v.20200528_415) contains 415 entries from 31 genera of flowering plants (Figure 1; Supplemental dataset 1, 2, 3). The list was obtained by mining and assessing information gathered from the literature. Briefly, we manually crawled through the literature, extracting plant NLRs that have been experimentally validated to at least some degree. We defined experimental validation broadly as genes reported to be involved in any of the following: 1) disease resistance, 2) disease susceptibility, including effector-triggered immune pathology or trailing necrosis to viruses, 3) hybrid necrosis, 4) autoimmunity, 5) NLR helper function or involvement in downstream immune responses, 6) negative regulation of immunity, and 7) well-described allelic series of NLRs with different pathogen recognition spectra even if not reported to confer disease resistance.
To validate the recovered sequences as genuine NLRs, we annotated them using InterProScan ((Finn et al., 2017); see Material & Methods for the used sequence signatures). We defined NLRs as sequences containing the NB-ARC domain (Pfam signature PF00931) and at least one other annotated domain. This resulted in 414 sequences. We also included the rice protein Pb1, which carries an overall NLR-type domain architecture (N-terminal CC-domain and C-terminal LRRs) but contains an atypical NOD domain instead of the Pfam NB-ARC signature (Hayashi et al., 2010). Altogether these 415 sequences form the current version of RefPlantNLR.
Description of the RefPlantNLR dataset
Table S1 describes the RefPlantNLR dataset, including amino acid, coding sequence (CDS) and locus identifiers, as well as the organism from which the NLR was cloned, the article describing the identification of the NLR, the pathogen type and pathogen to which the NLR provides resistance (when applicable), the matching pathogen effector, additional host components required for pathogen recognition (guardees or decoys) or required for NLR function and the articles describing the identification of the pathogen and host components. We also provide domain annotations of the amino acid and CDS (Supplemental dataset 4, 5). From these datasets, we extracted 407 unique NLRs and 347 unique NB-ARC domains (SUPERFAMILY signature SSF52540; Supplemental dataset 6, 7). NLRs with identical amino acid sequences were recovered because they have different resistance spectra when genetically linked to different sensor NLR allele (e.g. alleles of Pik), are different in non-coding regions leading to altered regulation (e.g. RPP7 alleles) or have been independently discovered in different plant genotypes (e.g. RRS1-R and SLH1).
The distribution of the RefPlantNLR across plant species mirrors the most heavily studied taxa, Arabidopsis, Solanaceae (Solanum, Capsicum and Nicotiana) and cereals (Oryza, Triticum and Hordeum) (Figure 1A). These seven genera comprise 77% (321 out of 415) of the RefPlantNLR sequences. When accounting for redundancy by collapsing similar sequences (>90% NB-ARC amino acid identity per genus), these seven genera would still account for 77% (181 out of 235) sequences (Figure 1B). It should be noted that there could be different evolutionary rates of the NB-ARC domain between NLRs, and hence some subfamilies may still be overrepresented in the reduced redundancy set.
In total, 31 plant genera representing 11 taxonomic orders are listed in RefPlantNLR. Interestingly, these species represent only a small fraction of plant diversity with only 11 of 59 major seed plant (spermatophyte) orders described by Smith and Brown represented, and no single entry from non-flowering plants (Table S2) (Smith and Brown, 2018). Arabidopsis remains the only species with experimentally validated NLRs from the four major clades (CC-NLR, CCG10-NLR, CCR-NLR and TIR-NLR) (Figure 1).
The average length of RefPlantNLR sequences varies depending on the subclass (Figure 2A, Figure 2C for the reduced redundancy set). CC-NLRs varied from 696 to 1824 amino acids (mean = 1089, N = 288), whereas TIR-NLR varied from 380 to 2048 amino acids (mean = 1169, N = 92). NB-ARC domains were more constrained (mean = 265, N = 347, stdev = 20) (Figure 2B). Nonetheless, 11 atypically short NB-ARCs (148 to 224 amino acids) and 4 long NB-ARCs (308 to 339 amino acids) were observed at more than two standard deviations of the mean illustrating the overall flexibility of plant NLRs even for this canonical domain (Figure 2B, Figure 2D for the reduced redundancy set).
We noted that some of the unusually small NLRs lacked a SSFR domain, while some of the small NB-ARC domains appeared to be partial duplications of this domain. In order to look at domain architecture of NLRs more widely and to determine whether these unusual features are common, we mapped the domain architecture of RefPlantNLR proteins (Figure 3A, Figure 3B for the reduced redundancy set). Even though CC-NLR and TIR-NLR domain combinations were the most frequent (60% and 19%, respectively), we observed additional domain combinations. In the RefPlantNLR dataset, a subset of NLRs lack the N-terminal domain but still group with the major NLR clades based on the NB-ARC phylogeny. Some TIR-NLRs lack a SSFR domain. Non-canonical integrated domains are found in all NLR subfamilies, and occur at the N-terminus, in between the N-terminal domain and the NB-ARC domain, at the C-terminus or both ends. Of these non-canonical domains, the N-terminal late-blight resistance protein R1 domain (also known as the Solanaceae domain; Pfam signature PF12061) only occurs in association with the NB-ARC domain, and has an ancient origin likely in the most recent common ancestor of the Asterids and Amaranthaceae (Seong et al., 2020). Other non-canonical domains are also more wide-spread, including the monocot-specific integration of a zinc-finger BED domain, which occurs in between the CC and NB-ARC domain (Bailey et al., 2018; Marchal et al., 2018). Finally, whereas some CC-NLRs contain a partial duplication of the NB-ARC domain, TIR-NLRs appear to allow a more flexible domain architecture with multiple combinations of TIR and NB-ARC domains within the same protein (Figure 3C).
We explored the phylogenetic diversity of RefPlantNLR proteins using the extracted NB-ARC domains (Figure 4; Supplemental dataset 8, 9). As with previously reported NLR phylogenetic analyses, RefPlantNLR sequences generally grouped in well-defined clades, notably CC-NLR, CCG10-NLR, CCR-NLR and TIR-NLR. Within this phylogeny some of the branches, notably Pi54, are long and may represent highly diverged NB-ARC domains. Since Pb1 does not match the Pfam NB-ARC domain it was not included in this phylogenetic analysis.
Applications of the RefPlantNLR dataset
There are many applications of the RefPlantNLR dataset. RefPlantNLR should prove useful for benchmarking NLR extraction and annotation tools. The majority of the reported NLR prediction tools have only been evaluated using the Arabidopsis NLRome, which is not representative of NLR diversity across flowering plants (Figure 1). Additionally, it would be useful to precisely document the limitations and biases of each prediction method to enable users to select the approach that best matches their needs and project objectives. Ideally, NLR extraction tools would correctly identify the N-terminal domain, the boundaries of the NB-ARC, the type of SSFR domain and any non-canonical domains that are present.
Another use of RefPlantNLR would be to provide reference points for newly discovered NLRs and in large-scale phylogenetic analyses of NLRs. Phylogenetic analyses can be readily performed by extracting the SUPERFAMILY signature SSF52540 of the NLRs of interest and aligning the retrieved sequences to the NB-ARC domains of the RefPlantNLR list (Supplemental dataset 7). Phylogenetic analyses would help assign NLRs to subclades and provide a basis for generating hypotheses about the function and mode of action of novel NLRs which phylogenetically cluster with experimentally validated NLRs. This type of phylogenetic information can be combined with other features such as genetic clustering and has, for instance, proven valuable in our work on rice and solanaceous NLRs (Białas et al., 2017; Wu et al., 2017) and for defining the CCG10-NLR class (Lee et al., 2020).
RefPlantNLR highlights the under-studied plant species of NLR biology. Table S2 reveals that 48 out of the 59 major seed plant clades recently defined by Smith and Brown (Smith and Brown, 2018) do not have a single experimentally validated NLR. Certain taxa have subfamily-specific contractions and expansions, and hence may contain unexplored genetic and biochemical diversity of NLR function. For example, while all currently studied CCR-NLRs act as helper NLRs for TIR-NLRs, it has been reported that the CCR-NLR subfamily have experienced clade-specific expansions in gymnosperms and rosids, pointing to potential biochemical specialization of this subfamily in these taxa (Van Ghelder et al., 2019). In addition, although NLRs have been reported in non-seed plants and some of these appear to have distinct N-terminal domains (Andolfo et al., 2019), their experimental validation is still lacking.
The RefPlantNLR dataset has inherent limitations due to its focus on experimentally validated NLRs. First, it is biased towards a few highly studied model species and crops as illustrated in Figure 1. Additionally, RefPlantNLR entries are somewhat redundant with particular NLR allelic series, such as the monocot MLA and spinach alpha-WOLF, being overrepresented in the dataset (Figure 1, Figure 4). These issues, notably redundancy, will need to be considered for certain applications such as benchmarking. Our recommendation is to use the reduced redundancy dataset of 235 sequences for benchmarking exercises (Supplemental dataset 10, 11).
Conclusion
We hope that the RefPlantNLR resource will contribute to moving the field beyond a uniform view of NLR structure and function. It is now evident that NLRs turned out to be more structurally and functionally diverse than anticipated. Whereas a number of plant NLRs have retained the presumably ancestral three domain architecture of the TIR/CCR/CCG10/CC fused to the NB-ARC and LRR domains, many NLRs have diversified into specialized proteins with degenerated features and extraneous non-canonical integrated domains (Sarris et al., 2016; Kroj et al., 2016; Adachi, Derevnina, et al., 2019). Therefore, it is time to question holistic concepts such as effector-triggered immunity (ETI) and appreciate the wide structural and functional diversity of NLR-mediated immunity. More specifically, a robust phylogenetic framework of plant NLRs should be fully integrated into the mechanistic study of these exceptionally diverse proteins.
MATERIAL & METHODS
Sequence retrieval
RefPlantNLR was assembled by manually crawling the literature for experimentally validated NLRs according to the criteria described in the results section. NLRs are defined as having a NB-ARC and at least one additional domain. Where possible the amino acid and nucleotide sequences were taken from GenBank. For some NLRs, only the mRNA has been deposited and no genomic locus information was present. When GenBank records were not available, the sequences were extracted from the matching whole-genome sequences projects or from articles and patents describing the identification of these NLRs.
Domain annotation
Protein sequences were annotated with CATH-Gene3D (v4.2.0) (Dawson et al., 2017), SUPERFAMILY (v1.75) (Gough et al., 2001), PRINTS (v42.0) (Attwood et al., 2000), and Pfam (v32.0) (El-Gebali et al., 2019) identifiers using InterProScan (v5.44-79.0) (Finn et al., 2017). A custom R script (Appendix S1) was used to convert the InterProScan output to the final GFF3 annotation of protein and CDS using the InterPro descriptions (Appendix S2). We routinely use Geneious Prime (v2020.1.2) (https://www.geneious.com) to visualize these annotations on the sequence. The NLR-associated signature motifs/domain IDs are:
Late blight resistance protein R1: PF12061
Rx-type CC: PF18052, cd14798
RPW8-type CC: PF05659, PS51153
TIR: PF01582, G3DSA:3.40.50.10140, SSF52200, PF13676
NB-ARC: PF00931
NB-ARC used for phylogenetic analysis: SSF52540
LRRs: G3DSA:3.80.10.10, G3DSA:1.25.10.10, PF08263, PF00560, PF07723, PF07725, PF12799, PF13306, PF13516, PF13855, SSF52047, SSF52058
Other: any other Pfam, SUPERFAMILY, and/or CATH-Gene3D annotation. Additionally, we included the PROSITE Profiles signatures PS51697 (ALOG domain) and PS50808 (zinc-finger BED domain), and the SMART signature SM00614 (zinc-finger BED domain).
Sequence deduplication
The extracted NB-ARC domains (SUPERFAMILY signature SSF52540) were clustered using CD-HIT at 90% sequence identity ((Fu et al., 2012); Usage: cd-hit -i RefPlantNLR_v.20200528_415_SSF52540.fasta -o RefPlantNLR_SSF52540_90 -c 0.90 -n 5 -M 16000 -d 0). A custom R script (Appendix S1) was used to assign representative sequences per cluster per genus, i.e. if a single cluster contained sequences from multiple genera we assigned a representative sequence per genus. In case an NLR has multiple NB-ARC domains we grouped this NLR in clusters according to the largest NB-ARC domain. The reduced redundancy sequences are provided in Supplemental dataset 10, 11.
Phylogenetics
The NB-ARC domain (SUPERFAMILY signature SSF52540) of all NLRs was extracted using Geneious Prime. A custom R script (Appendix S1) was used to deduplicate sequences with identical NB-ARC domains and assign unique identifiers to sequences containing multiple NB-ARC domains. Sequences were aligned using Clustal Omega (Sievers et al., 2011), and all positions with less than 95% site coverage were removed using QKphylogeny (Moscou, 2020) (Supplemental dataset 8). RAxML (v8.2.12) (Stamatakis, 2014) was used (usage: raxmlHPC-PTHREADS-AVX -T 6 -s RefPlantNLR_v.20200528_415_SSF52540_Unique_ClustalOmega_missing5.phy -n RefPlantNLRs_v.20200528_415 -m PROTGAMMAAUTO -f a -# 1000 -x 8153044963028367 -p 644124967711489) to infer the evolutionary history using the Maximum Likelihood method based on the JTT model (Jones et al., 1992). Bootstrap values from 1000 rapid bootstrap replicates as implemented in RAxML are shown (Stamatakis et al., 2008) (Supplemental dataset 9). The RefPlantNLR phylogeny was edited using the iTOL suite (v5.5.1; Letunic and Bork, 2019), and can be accessed online through: https://itol.embl.de/shared/JKourelis
FUNDING
This work has been supported by the Gatsby Charitable Foundation, Biotechnology and Biological Sciences Research Council (BBSRC), European Research Council (ERC) and BASF Plant Science. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
COMPETING INTERESTS
The authors receive funding from industry on NLR biology.
AUTHOR CONTRIBUTIONS
Conceptualization: JK, SK; Data curation: JK; Formal analysis: JK; Investigation: JK; Supervision: SK; Funding acquisition: SK; Project administration: SK; Writing initial draft: JK; Editing: JK, SK.
SUPPLEMENTAL DATA
Table S1: Description of RefPlantNLR.
Table S2: Plant orders represented in RefPlantNLR.
Supplemental dataset 1: Amino acid sequences of RefPlantNLR entries (fasta format). This file contains 415 amino acid sequences.
Supplemental dataset 2: CDS sequences of RefPlantNLR entries (fasta format). This file contains 400 CDS sequences. CDS sequences could not be retrieved for 15 RefPlantNLR entries.
Supplemental dataset 3: Annotated genomic sequences of RefPlantNLR entries (GenBank flat file format). This file contains 329 genomic loci containing the gene models of 344 RefPlantNLR entries and 56 RefPlantNLR mRNA entries lacking genomic information.
Supplemental dataset 4: InterProScan annotation of the RefPlantNLR amino acid sequences (GFF3 format). This file contains the InterProScan annotation of 415 amino acid sequences.
Supplemental dataset 5: InterProScan annotation of the RefPlantNLR CDS sequences (GFF3 format). This file contains the InterProScan annotation of the 400 CDS sequences.
Supplemental dataset 6: Amino acid sequences of the extracted RefPlantNLR NB-ARC domains (fasta format). This file contains 424 NB-ARC domain (SUPERFAMILY signature SSF52540) amino acid sequences belonging to 415 RefPlantNLR entries.
Supplemental dataset 7: Amino acid sequences of the unique RefPlantNLR extracted NB-ARC domains (fasta format). This file contains 347 unique NB-ARC domain (SUPERFAMILY signature SSF52540) amino acid sequences.
Supplemental dataset 8: Clustal Omega alignment of the unique RefPlantNLR extracted NB-ARC domains (PHYLIP format). This file contains the Clustal Omega alignment of 346 unique NB-ARC domains (SUPERFAMILY signature SSF52540) with all positions with less than 95% coverage removed. Pb1 was omitted from this alignment.
Supplemental dataset 9: NB-ARC domain phylogeny of the RefPlantNLR entries using the Maximum likelihood method (Newick format). This file contains the phylogenetic analysis of the NB-ARC domain of the RefPlantNLR entries using the JTT method.
Supplemental dataset 10: Amino acid sequences of the non-redundant RefPlantNLR entries (fasta format). This file contains 235 amino acid sequences representing the non-redundant RefPlantNLR entries at a 90% amino acid identity threshold per genus according to the NB-ARC domain.
Supplemental dataset 11: Amino acid sequences of the NB-ARC domains of the non-redundant RefPlantNLR entries (fasta format). This file contains 241 amino acid sequences representing the extracted NB-ARC domains of the 235 non-redundant RefPlantNLR.
DATA AVAILABILITY AND UPDATES
Up to date versions of RefPlantNLR can be accessed via Zenodo at http://doi.org/10.5281/zenodo.3936022. This project is part of the OpenPlantNLR community on Zenodo: https://zenodo.org/communities/openplantnlr
ACKNOWLEDGEMENTS
We thank Hiroaki Adachi, Adeline Harant, and Philip Carella for useful comments and feedback, and Aleksandra Białas for the domain architecture illustrations.