Rfam 13.0: shifting to a genome-centric resource for non-coding RNA families

Kalvari, Ioanna; Argasinska, Joanna; Quinones-Olvera, Natalia; Nawrocki, Eric P; Rivas, Elena; Eddy, Sean R; Bateman, Alex; Finn, Robert D; Petrov, Anton I

doi:10.1093/nar/gkx1038

Abstract

The Rfam database is a collection of RNA families in which each family is represented by a multiple sequence alignment, a consensus secondary structure, and a covariance model. In this paper we introduce Rfam release 13.0, which switches to a new genome-centric approach that annotates a non-redundant set of reference genomes with RNA families. We describe new web interface features including faceted text search and R-scape secondary structure visualizations. We discuss a new literature curation workflow and a pipeline for building families based on RNAcentral. There are 236 new families in release 13.0, bringing the total number of families to 2687. The Rfam website is http://rfam.org.

INTRODUCTION

Rfam is a database of RNA families in which each family is represented by a multiple sequence alignment, a consensus secondary structure, and a covariance model (CM). Using a combination of manual, literature-based curation and a custom software pipeline, Rfam converts descriptions of RNA families found in research papers into computational models that can be used to annotate RNAs belonging to those families in any DNA sequence. Valuable research outputs that are often locked up in figures and supplementary information files are encapsulated in Rfam entries and made accessible through the Rfam website and other resources that integrate the Rfam data, such as Ensembl (1) and UCSC (2) genome browsers.

The process of building a new Rfam family starts with extracting RNA sequences and, where possible, sequence alignments and secondary structures, from the scientific literature and manually building a seed alignment. The alignment includes a representative set of homologous sequences spanning the family's taxonomic distribution and capturing its sequence variation. Rfam uses the Infernal software package (3) to build CMs and perform iterative sequence- and secondary structure-based searches against a sequence database to find other instances of the family. The Rfam curator examines the search results and chooses a bit score cutoff, also known as a gathering threshold, which separates true from false homologs. Newly found homologs that are not represented by the current seed alignment are incorporated into it, a new model is built, and the search is repeated. This iterative process continues until no new homologs of the family are discovered. The curator also adds Gene Ontology (4) and Sequence Ontology (5) terms and updates Wikipedia pages (6) to provide insight into the biological function of the family. As new information becomes available, existing families are reviewed and updated.

The current release, Rfam 13.0, was made available in September 2017 and contains 2687 RNA families. Figure 1 shows the growth in the number of RNA families grouped by RNA type in all major releases since 2002 when Rfam was established (7). Some Rfam families are homologous to one another and are only separated to improve annotation because multiple models can be more specific than a single all-encompassing model. For example, there are five small subunit ribosomal RNA families, one each for Archaea, Bacteria, Eukaryota, Microsporidia Fungi and Trypanosomatid mitochondria. Homologous families like these are grouped together into the same clan (6). Search results from models in the same clan often overlap so in Rfam we analyse the overlaps and report only the lowest-scoring hits using a procedure called ‘clan competing’ (8). In Rfam 13.0, 344 families are organised into 111 clans.

Figure 1.

Open in new tab Download slide

Growth in the number of RNA families grouped by RNA type in major database releases. The other RNA types group includes types with less than 50 families, such as rRNA, tRNA, snRNA or riboswitches.

In this article, we discuss the new sequence database composed of complete reference genomes. The shift towards using complete genomes instead of the entire ENA sequence archive significantly reduces data redundancy, enables meaningful taxonomic comparisons, and allows for the ncRNA annotations to be organised and displayed in a genome-specific manner on the Rfam website. We also give an overview of new families and describe a new literature curation workflow.

UPDATING THE RFAM SEQUENCE DATABASE

To build new families and identify instances of existing ones, Rfam uses an aggregated sequence database called Rfamseq. Previously Rfamseq was composed of the standard (STD) and whole genome shotgun (WGS) sequence sets obtained from the ENA (9). Prior to release 13.0 the most recent version of Rfamseq was generated based on ENA release 110 (December 2011). Since then the size of the STD and WGS sets has grown almost tenfold from 0.27 Tb to 2.2 Tb (ENA release 132, July 2017), and updating Rfamseq with these data would make sequence searches and family building impractical. Furthermore, many of these sequences represent redundant versions of the same or closely related genomes and annotating all of them may not be useful. For example, Rfamseq v12 contained 26 Gb of human sequences which is eight times more than the length of reference human genome, and 77 different Mycobacterium tuberculosis strains. This redundancy in sequence led to redundant hits in the CM search results and made the analysis of such results difficult.

Sequence redundancy is a common problem for biological databases. To deal with it, the UniProt database recently switched to annotating a non-redundant and regularly updated set of reference proteomes which are selected using a semi-automated approach to cover the taxonomic spread of species (10). The reference proteomes are reviewed by UniProt curators who ensure that species of special interest remain in the reference set.

As of Rfam 13.0, we are using the same set of genomes as is used to generate the UniProt reference proteome collection (release March 2017, supplemented by viral genomes used in release January 2017). It includes 8366 genomes from 8302 different species, with 3025 species in common with Rfamseq v12. All genome sequences were downloaded from ENA based on GCA assembly accessions or WGS identifiers obtained from UniProt. The size of the Rfamseq v13 database is comparable with the previous version, v12 (Table 1).

Comparison of Rfamseq databases used for Rfam releases 12.3 and 13.0

Table 1.

Comparison of Rfamseq databases used for Rfam releases 12.3 and 13.0

	Rfamseq v12	Rfamseq v13
Total sequence length	274 Gb	300 Gb
Number of complete genomes	N/A	8366
Number of species	464 237	8302 (3025 identical with v12)
Number of significant hits	8.7 million	2.3 million

	Rfamseq v12	Rfamseq v13
Total sequence length	274 Gb	300 Gb
Number of complete genomes	N/A	8366
Number of species	464 237	8302 (3025 identical with v12)
Number of significant hits	8.7 million	2.3 million

Open in new tab

Table 1.

Comparison of Rfamseq databases used for Rfam releases 12.3 and 13.0

	Rfamseq v12	Rfamseq v13
Total sequence length	274 Gb	300 Gb
Number of complete genomes	N/A	8366
Number of species	464 237	8302 (3025 identical with v12)
Number of significant hits	8.7 million	2.3 million

	Rfamseq v12	Rfamseq v13
Total sequence length	274 Gb	300 Gb
Number of complete genomes	N/A	8366
Number of species	464 237	8302 (3025 identical with v12)
Number of significant hits	8.7 million	2.3 million

Open in new tab

Summary statistics for Rfam annotation of various genomes

Table 2.

Summary statistics for Rfam annotation of various genomes

Genome	Assembly/accession	Size	Number of hits	Number of families
Homo sapiens	GRCh38.p9	3.2 Gb	14 882	803
Sus scrofa (pig)	Sscrofa10.2	2.7 Gb	6289	637
Drosophila melanogaster	Release 6 plus ISO1 MT	143.7 Mb	945	160
Caenorhabditis elegans	WBcel235	100.3 Mb	1027	175
Escherichia coli	U00096.3	4.6 Mb	382	130
Bacillus subtilis	ASM904v1	4.2 Mb	250	53
Methanocaldococcus jannaschii	ASM9166v1	1.7 Mb	276	18
Aquifex aeolicus	ASM862v1	1.6 Mb	55	7
Borrelia burgdorferi	ASM868v2	1.5 Mb	42	7

Genome	Assembly/accession	Size	Number of hits	Number of families
Homo sapiens	GRCh38.p9	3.2 Gb	14 882	803
Sus scrofa (pig)	Sscrofa10.2	2.7 Gb	6289	637
Drosophila melanogaster	Release 6 plus ISO1 MT	143.7 Mb	945	160
Caenorhabditis elegans	WBcel235	100.3 Mb	1027	175
Escherichia coli	U00096.3	4.6 Mb	382	130
Bacillus subtilis	ASM904v1	4.2 Mb	250	53
Methanocaldococcus jannaschii	ASM9166v1	1.7 Mb	276	18
Aquifex aeolicus	ASM862v1	1.6 Mb	55	7
Borrelia burgdorferi	ASM868v2	1.5 Mb	42	7

Open in new tab

Table 2.

Summary statistics for Rfam annotation of various genomes

Genome	Assembly/accession	Size	Number of hits	Number of families
Homo sapiens	GRCh38.p9	3.2 Gb	14 882	803
Sus scrofa (pig)	Sscrofa10.2	2.7 Gb	6289	637
Drosophila melanogaster	Release 6 plus ISO1 MT	143.7 Mb	945	160
Caenorhabditis elegans	WBcel235	100.3 Mb	1027	175
Escherichia coli	U00096.3	4.6 Mb	382	130
Bacillus subtilis	ASM904v1	4.2 Mb	250	53
Methanocaldococcus jannaschii	ASM9166v1	1.7 Mb	276	18
Aquifex aeolicus	ASM862v1	1.6 Mb	55	7
Borrelia burgdorferi	ASM868v2	1.5 Mb	42	7

Genome	Assembly/accession	Size	Number of hits	Number of families
Homo sapiens	GRCh38.p9	3.2 Gb	14 882	803
Sus scrofa (pig)	Sscrofa10.2	2.7 Gb	6289	637
Drosophila melanogaster	Release 6 plus ISO1 MT	143.7 Mb	945	160
Caenorhabditis elegans	WBcel235	100.3 Mb	1027	175
Escherichia coli	U00096.3	4.6 Mb	382	130
Bacillus subtilis	ASM904v1	4.2 Mb	250	53
Methanocaldococcus jannaschii	ASM9166v1	1.7 Mb	276	18
Aquifex aeolicus	ASM862v1	1.6 Mb	55	7
Borrelia burgdorferi	ASM868v2	1.5 Mb	42	7

Open in new tab

The updated Rfamseq has a broad taxonomic coverage shown in Figure 2. The majority of genomes come from Bacteria (∼60%), while Eukaryotes and Archaea account for 10% and 3% respectively. Twenty seven percent of genomes in the database are from Viruses but only 748 out of 2404 viral genomes had Rfam matches.

Figure 2.

Open in new tab Download slide

Overview of species annotated with RNA families in Rfam 13.0. The tree is based on NCBI Taxonomy and was generated using iToL (11).

We searched the new Rfamseq with Rfam CMs and found that almost all families had matches to the new genome-centric database. To recover the missing families, we added 26 genomes and viroid sequences that were not present in UniProt. We compared our latest annotations for the genomes that were previously discussed in the Rfam 12 paper (8), and found that the results for the set of common families are comparable between release 12.0 and 13.0 (Table 2). For example, the human genome had matches to 796 families in Rfam 12.0 and 795 in Rfam 13.0. The only missing family was mir-689 (RF00871) that was recently removed from Rfam (see below). It is important to note that not every Rfam hit represents a functional RNA because many hits originate from pseudogenes that are no longer active but still maintain enough sequence similarity to be detected as statistically significant hits by CM searches.

In future releases, we will supplement Rfamseq with genomes from Ensembl (1), Ensembl Genomes (12) and NCBI reference genome collection (13). We will also consider adding genomes upon user request. In addition, any genome or metagenomic dataset can be annotated with RNA families locally using Infernal and Rfam CMs (for detailed instructions see Rfam Help at http://rfam.readthedocs.org or the Infernal User Manual at http://eddylab.org/infernal/).

WEBSITE UPDATES

Text search

Until recently, the Rfam website provided several distinct types of search functionality, such as taxonomy, RNA type and keyword searches. To enable more powerful searches that allow a combination of these search types, we implemented a new unified text search that brings all of the website search functionality into one search box. Users can filter the data by RNA type or organism using facets and browse genomes, families, clans, motifs, and sequences with rich search results showing key information about each entry.

Users can search and browse all genomes as well as RNA families and sequences found therein. It is also possible to compare RNA families across taxonomic groups. For example, one can search for Mammalia and identify all RNA families found in Mammals (Figure 3). It is also possible to compare any set of genomes to find the families that are common to all of them or specific to some of them. The documentation and examples are available on the Rfam Help website (http://rfam.readthedocs.org). For even more complex queries that are not currently supported in the web interface, users can query the new public MySQL database (see section Availability).

Figure 3.

Open in new tab Download slide

Browsing Mammalian families and genomes annotated in Rfam. The entries can be filtered using facets and sorted by multiple criteria.

Sequence summary pages

We also developed new summary pages for each individual sequence that includes a download link as well as links to RNAcentral, ENA, and NCBI Taxonomy. The sequence pages make it possible for other resources to link to Rfam sequences and for Rfam to display more data about each identified RNA, such as its genomic context (Figure 4). The pages contain an embedded genome browser powered by Genoverse (http://genoverse.org) that displays genes retrieved using Ensembl API (14).

Figure 4.

Open in new tab Download slide

Sequence summary page for Homo sapiens small Cajal body-specific RNA 2 sequence located in chromosome 1.

R-Scape secondary structures

R-scape (15) is a new method for testing whether covariation analysis supports the presence of a conserved RNA secondary structure. In order to check the quality of Rfam structures, we analysed all Rfam seed alignments with R-scape and added interactive R-scape visualisations to the secondary structure galleries. For example, Figure 5 shows the R-scape analysis of the SAM riboswitch alignment from the Rfam website.

Figure 5.

Open in new tab Download slide

R-scape visualisation of SAM riboswitch (RF00162, http://rfam.org/family/SAM). Left: The current Rfam 13.0 SAM riboswitch seed alignment and consensus secondary structure; 19 of the 27 basepairs in the alignment show statistically significant covariation. Right: The R-scape improved SAM riboswitch seed alignment and consensus secondary structure; 27 out of 36 basepairs show statistically significant covariation. The structures are displayed using R2R (16); significant basepairs, as defined by R-scape, are shown in green. Other colours and markup of the structure diagrams are explained in the legend on the far right.

According to R-scape, the secondary structure from the Rfam seed alignment, shown on the left, has 19 statistically significant basepairs (highlighted in green). R-scape can also use statistically significant basepairs as constraints to predict a new secondary structure that is consistent with the seed alignment. Using this approach, R-scape increased the number of statistically significant basepairs from 19 to 27 and added nine new basepairs consistent with the seed alignment (structure on the right). The R-scape secondary structure is also in perfect agreement with that obtained from the crystal structure (17). In this case R-scape indicates that the SAM seed alignment may need to be updated.

R-scape analysis suggests that many existing Rfam secondary structures can be improved (for example, SAM and FMN riboswitches, 5S rRNA). In other families, the current consensus secondary structures are not statistically supported by the R-scape covariation analysis (for example, oxyS RNA) which indicates that either their seed alignments need to be expanded (to increase statistical significance) or that these RNA families do not have a conserved secondary structure. Lastly, there are also cases where the R-scape structures do not show significant improvement compared to the current secondary structure (for instance, Metazoa SRP).

In future releases, we will use R-scape to improve existing Rfam seed alignments and at the time of creating new families. In the meantime, Rfam users can get an indication of the quality of the structure using the R-scape visualisations shown on the website.

INCREASING THE NUMBER OF RNA FAMILIES

New literature curation workflow

Extracting sequences from the literature is a time-consuming and error-prone process because the sequences need to be re-typed manually from figures or other images found in papers. To speed up this process, we established a new literature workflow in collaboration with a biocuration company E-Merge Tech (http://e-mergeglobal.com). The Rfam team finds papers that potentially contain descriptions of new RNA families (∼15 papers per month on average). The PubMed identifiers of these papers are sent to E-Merge curators who extract RNA sequences from text, figures, and supplementary information and capture the metadata associated with sequences, such as experimental evidence, accessions, species, and secondary structure or alignment availability. In addition, E-Merge curators search the sequences against Rfam in order to prioritise sequences that do not match any existing Rfam family.

The data is exchanged using a shared Google Spreadsheet which allows us to keep track of the data and perform automatic validation. A copy of the spreadsheet from September 2017 is available at https://goo.gl/hb6nEj and in the Supplementary Information. We also developed an extension for Google Chrome browser (https://github.com/Rfam/rfam-pubmed-id-highlighter) that connects to the spreadsheet and highlights PubMed search results that have not been analysed yet.

Using this workflow between January 2016 and September 2017 we identified over 660 papers for curation and built 236 families after analysing ∼40% of the papers (Figure 6). If the current trend continues, there may be up to 370 new RNA families that could be added to Rfam based on the already accumulated literature alone.

Figure 6.

Open in new tab Download slide

The number of papers curated to extract RNA sequences between January 2016 and June 2017. In total, 660 papers were selected for curation, 260 of which were processed to build 236 new families (some papers were not used for family building for various reasons).

Overview of new RNA families

The work described above demonstrates that we are far from having a complete collection of non-coding RNA families in Rfam. Therefore, it is is very important to continue to build new families that we or others have identified. Over the course of three Rfam releases (12.1, 12.2 and 12.3) 236 new RNA families were made available (Table 3). The number of small RNA (sRNA) and thermoregulator families was significantly expanded by the addition of 170 new entries. The group of iron-related bacterial small RNAs was also expanded with HrrF RNA (RF02728) (18) and four Aggregatibacter iron-regulated sRNAs called JA01 to JA04 (RF02729-RF02732) (19). Two new riboswitch families were also added: NiCo (20) and Twister-sister (21).

Overview of new families added to Rfam since release 12.0

Table 3.

Overview of new families added to Rfam since release 12.0

RNA type	Number of families	Example families
Gene; sRNA	148	PsrR1, AsdA, MicL, RnaG, SprX, SorY
Cis-reg; thermoregulator	22	TrxA_thermometer, ToxT_thermometer, AilA_thermometer
Cis-reg	17	Oskar_OES, HilD 3΄UTR, DapZ
Gene; antisense	14	AsrC, Sernc350
Gene; snRNA; snoRNA; CD-box	7	sno_ZL1, sno_ZL2
Gene; snRNA; snoRNA; scaRNA	11	sca_ncR14, sca_ncR26
Gene; ribozyme	5	Hatchet, Pistol, Twister-sister
Gene	3	EBv-sisRNA-1, EBv-sisRNA-2
Cis-reg; riboswitch	2	NiCo, PreQ1-III
Gene; rRNA	2	ppoRNA, hveRNA
Cis-reg; IRES	2	IRES_RhPV, IRES_cyp24a1
Gene; snRNA; snoRNA; HACA-box	3	B_rapa_snoR775, sno_ncR1
	Total: 236

RNA type	Number of families	Example families
Gene; sRNA	148	PsrR1, AsdA, MicL, RnaG, SprX, SorY
Cis-reg; thermoregulator	22	TrxA_thermometer, ToxT_thermometer, AilA_thermometer
Cis-reg	17	Oskar_OES, HilD 3΄UTR, DapZ
Gene; antisense	14	AsrC, Sernc350
Gene; snRNA; snoRNA; CD-box	7	sno_ZL1, sno_ZL2
Gene; snRNA; snoRNA; scaRNA	11	sca_ncR14, sca_ncR26
Gene; ribozyme	5	Hatchet, Pistol, Twister-sister
Gene	3	EBv-sisRNA-1, EBv-sisRNA-2
Cis-reg; riboswitch	2	NiCo, PreQ1-III
Gene; rRNA	2	ppoRNA, hveRNA
Cis-reg; IRES	2	IRES_RhPV, IRES_cyp24a1
Gene; snRNA; snoRNA; HACA-box	3	B_rapa_snoR775, sno_ncR1
	Total: 236

Open in new tab

Table 3.

Overview of new families added to Rfam since release 12.0

RNA type	Number of families	Example families
Gene; sRNA	148	PsrR1, AsdA, MicL, RnaG, SprX, SorY
Cis-reg; thermoregulator	22	TrxA_thermometer, ToxT_thermometer, AilA_thermometer
Cis-reg	17	Oskar_OES, HilD 3΄UTR, DapZ
Gene; antisense	14	AsrC, Sernc350
Gene; snRNA; snoRNA; CD-box	7	sno_ZL1, sno_ZL2
Gene; snRNA; snoRNA; scaRNA	11	sca_ncR14, sca_ncR26
Gene; ribozyme	5	Hatchet, Pistol, Twister-sister
Gene	3	EBv-sisRNA-1, EBv-sisRNA-2
Cis-reg; riboswitch	2	NiCo, PreQ1-III
Gene; rRNA	2	ppoRNA, hveRNA
Cis-reg; IRES	2	IRES_RhPV, IRES_cyp24a1
Gene; snRNA; snoRNA; HACA-box	3	B_rapa_snoR775, sno_ncR1
	Total: 236

RNA type	Number of families	Example families
Gene; sRNA	148	PsrR1, AsdA, MicL, RnaG, SprX, SorY
Cis-reg; thermoregulator	22	TrxA_thermometer, ToxT_thermometer, AilA_thermometer
Cis-reg	17	Oskar_OES, HilD 3΄UTR, DapZ
Gene; antisense	14	AsrC, Sernc350
Gene; snRNA; snoRNA; CD-box	7	sno_ZL1, sno_ZL2
Gene; snRNA; snoRNA; scaRNA	11	sca_ncR14, sca_ncR26
Gene; ribozyme	5	Hatchet, Pistol, Twister-sister
Gene	3	EBv-sisRNA-1, EBv-sisRNA-2
Cis-reg; riboswitch	2	NiCo, PreQ1-III
Gene; rRNA	2	ppoRNA, hveRNA
Cis-reg; IRES	2	IRES_RhPV, IRES_cyp24a1
Gene; snRNA; snoRNA; HACA-box	3	B_rapa_snoR775, sno_ncR1
	Total: 236

Open in new tab

Some of the new RNA families are species-specific including Francisella tularensis sRNAs Ftr A (RF02747) and Ftr B (RF02748), which are the first novel small RNAs identified in F. tularensis (22).

Nine of the existing families were renamed to reflect new research findings (the details can be found in Supplementary Information Table S1). A new clan (CL00113) was created to group together two 5S rRNA families, 5S_rRNA (RF00001) and mtPerm-5S (RF02547). The 5_8S_rRNA (RF00002) family was added to the LSU clan (CL00112) due to its homology with the 5΄ end of bacterial and archaeal LSU rRNA (23), and Yersinia sRNA 186/sR026/CsrC (RF02767) family was added to the Csr-Rsm protein binding clan (CL00106). The RNAIII secondary structure (RF00503) has been updated based on the report by Gupta and colleagues (24).

Three families have been removed since Rfam 12.0. The PK-G12rRNA 23S rRNA pseudoknot family (RF01118) has been deleted because it was covered by the models from the large ribosomal subunit clan. Sau-50 (RF02391) failed protein-coding check, and mir-689 (RF00871) was identified as an rRNA fragment (25) and has been suppressed in miRBase.

RNAcentral as a source of new families

We investigated how well Rfam covers non-coding RNA sequence space by looking at the percentage of RNAcentral (26) sequences that match Rfam families. RNAcentral contains RNA sequences imported from 25 specialised databases, including Rfam, GtRNAdb, Flybase and others. We scanned 9.3 million RNAcentral sequences (release 5) with Rfam CMs and discovered that ∼90% either have been submitted by Rfam (50%) or match an Rfam family (40%). This finding suggests that Rfam already contains the most common RNA families (Figure 7).

Figure 7.

Open in new tab Download slide

Annotating RNAcentral with Rfam families. About 1.8% of RNAcentral could be used as a source of new Rfam families.

The remaining 10% of RNAcentral sequences do not match any existing Rfam family and could potentially represent new families. Some of these sequences belong to non-coding RNA classes that are not suitable for inclusion in Rfam because they are unstructured or their sequence length is too short or too long to be effectively modeled using CMs, such as piRNAs, rasiRNAs, siRNA, and long non-coding RNAs (5% of the total number of sequences).

About 3% of the unmatched sequences are annotated in RNAcentral as being members of ncRNA families for which there is a corresponding Rfam model, yet the existing models do not detect these apparent homologs. These include rRNA, tRNA and tmRNA sequences, and we are investigating these cases to see if the corresponding Rfam models need to be revised or if the RNAcentral annotations are inaccurate.

We focused on the remaining 2% of RNAcentral sequences that could potentially represent new Rfam families. We developed a set of filters based on sequence length and keywords found in sequence descriptions and associated publications to discard sequences that are not suitable for Rfam (mislabelled tRNAs, ITS and others). This narrowed down the list of sequences that may represent potential new families to 41,985 entries. These sequences were then clustered using CD-HIT-EST (27) and the clusters were manually reviewed. We identified several groups of well-described non-coding RNAs and created new Rfam families (EBER-2 and MCS4 sRNAs and snopsi28S-3378 snoRNA). We are currently refining our clustering approach in order to fully automate the identification of candidates for family building.

AVAILABILITY

The Rfam website is now hosted under its own domain name and is available at a new URL, http://rfam.org. We will maintain the redirection from the previously used URL (http://rfam.xfam.org) for the foreseeable future, but users are encouraged to update their links accordingly.

In order to make it easier to query the data in ways that are not supported by the website, we created a public MySQL database with the latest Rfam data. Now, users can explore the data using SQL queries from a MySQL client or programmatically using custom scripts. The MySQL database is updated with each Rfam release. Database dumps are archived on the FTP site (ftp://ftp.ebi.ac.uk/pub/databases/Rfam) that also contains FASTA files, seed alignments, and CMs for all families.

The Rfam documentation has been migrated to a dedicated documentation hosting service, ReadTheDocs, and is available at http://rfam.readthedocs.org. Rfam code and documentation are available on GitHub at http://github.com/Rfam under the Apache 2.0 license. We welcome feedback and questions, which can be directed to rfam-help@ebi.ac.uk or submitted on GitHub.

CONCLUSIONS

The new Rfam genome-based sequence database is more scalable and gives a more accurate view of the distribution of Rfam entries. Using complete genomes enables meaningful taxonomic comparisons and identification of a repertoire of RNA families found in a certain species. Users can now more easily and more powerfully search the Rfam data with the new faceted text search. Together these improvements make Rfam a more valuable resource for the sequence analysis community.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR online.

ACKNOWLEDGEMENTS

We would like to thank Paul Gardner (University of Canterbury) and all previous Rfam team members. We would like to thank Zasha Weinberg (Leipzig University) for contributing new families. We also thank the Scientific Advisory Board members, including Michelle Meyer (Boston College), John Rinn (University of Boulder), Eric Westhof (University of Strasbourg), and Mihaela Zavolan (University of Basel) for useful feedback. This research was supported by the Intramural Research Program of the NIH National Library of Medicine.

FUNDING

This work was supported by Biotechnology and Biological Sciences Research Council (BBSRC) [BB/M011690/1]. Work in SRE’s laboratory is supported by the Howard Hughes Medical Institute and by US National Institutes of Health grant 1-R01-HG009116. Funding for open access charge: Research Councils UK (RCUK).

Conflict of interest statement. None declared.

REFERENCES

1.

Aken

B.L.

,

Achuthan

P.

,

Akanni

W.

,

Amode

M.R.

,

Bernsdorff

F.

,

Bhai

J.

,

Billis

K.

,

Carvalho-Silva

D.

,

Cummins

C.

,

Clapham

P.

et al.

Ensembl 2017

.

Nucleic Acids Res.

2017

;

45

:

D635

–

D642

.

2.

Speir

M.L.

,

Zweig

A.S.

,

Rosenbloom

K.R.

,

Raney

B.J.

,

Paten

B.

,

Nejad

P.

,

Lee

B.T.

,

Learned

K.

,

Karolchik

D.

,

Hinrichs

A.S.

et al.

The UCSC Genome Browser database: 2016 update

.

Nucleic Acids Res.

2016

;

44

:

D717

–

D725

.

3.

Nawrocki

E.P.

,

Eddy

S.R.

Infernal 1.1: 100-fold faster RNA homology searches

.

Bioinformatics

.

2013

;

29

:

2933

–

2935

.

4.

Gene Ontology Consortium

Gene Ontology Consortium: going forward

.

Nucleic Acids Res.

2015

;

43

:

D1049

–

D1056

.

Crossref

PubMed

WorldCat

5.

Eilbeck

K.

,

Lewis

S.E.

,

Mungall

C.J.

,

Yandell

M.

,

Stein

L.

,

Durbin

R.

,

Ashburner

M.

The Sequence Ontology: a tool for the unification of genome annotations

.

Genome Biol.

2005

;

6

:

R44

.

6.

Gardner

P.P.

,

Daub

J.

,

Tate

J.

,

Moore

B.L.

,

Osuch

I.H.

,

Griffiths-Jones

S.

,

Finn

R.D.

,

Nawrocki

E.P.

,

Kolbe

D.L.

,

Eddy

S.R.

et al.

Rfam: Wikipedia, clans and the ‘decimal’ release

.

Nucleic Acids Res.

2011

;

39

:

D141

–

D145

.

7.

Griffiths-Jones

S.

,

Bateman

A.

,

Marshall

M.

,

Khanna

A.

,

Eddy

S.R.

Rfam: an RNA family database

.

Nucleic Acids Res.

2003

;

31

:

439

–

441

.

8.

Nawrocki

E.P.

,

Burge

S.W.

,

Bateman

A.

,

Daub

J.

,

Eberhardt

R.Y.

,

Eddy

S.R.

,

Floden

E.W.

,

Gardner

P.P.

,

Jones

T.A.

,

Tate

J.

et al.

Rfam 12.0: updates to the RNA families database

.

Nucleic Acids Res.

2015

;

43

:

D130

–

D137

.

9.

Toribio

A.L.

,

Alako

B.

,

Amid

C.

,

Cerdeño-Tarrága

A.

,

Clarke

L.

,

Cleland

I.

,

Fairley

S.

,

Gibson

R.

,

Goodgame

N.

,

Hoopen ten

P.

et al.

European Nucleotide Archive in 2016

.

Nucleic Acids Res.

2017

;

45

:

D32

–

D36

.

10.

Bursteinas

B.

,

Britto

R.

,

Bely

B.

,

Auchincloss

A.

,

Rivoire

C.

,

Redaschi

N.

,

O’Donovan

C.

,

Martin

M.J.

Minimizing proteome redundancy in the UniProt Knowledgebase

.

Database

.

2016

;

2016

:

baw139

.

11.

Letunic

I.

,

Bork

P.

Interactive tree of life (iTOL) v3: an online tool for the display and annotation of phylogenetic and other trees

.

Nucleic Acids Res.

2016

;

44

:

W242

–

W245

.

12.

Kersey

P.J.

,

Allen

J.E.

,

Armean

I.

,

Boddu

S.

,

Bolt

B.J.

,

Carvalho-Silva

D.

,

Christensen

M.

,

Davis

P.

,

Falin

L.J.

,

Grabmueller

C.

et al.

Ensembl Genomes 2016: more genomes, more complexity

.

Nucleic Acids Res.

2016

;

44

:

D574

–

D580

.

13.

O’Leary

N.A.

,

Wright

M.W.

,

Brister

J.R.

,

Ciufo

S.

,

Haddad

D.

,

McVeigh

R.

,

Rajput

B.

,

Robbertse

B.

,

Smith-White

B.

,

Ako-Adjei

D.

et al.

Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation

.

Nucleic Acids Res.

2016

;

44

:

D733

–

D745

.

14.

Yates

A.

,

Beal

K.

,

Keenan

S.

,

McLaren

W.

,

Pignatelli

M.

,

Ritchie

G.R.S.

,

Ruffier

M.

,

Taylor

K.

,

Vullo

A.

,

Flicek

P.

The Ensembl REST API: Ensembl Data for Any Language

.

Bioinformatics

.

2015

;

31

:

143

–

145

.

15.

Rivas

E.

,

Clements

J.

,

Eddy

S.R.

A statistical test for conserved RNA structure shows lack of evidence for structure in lncRNAs

.

Nat. Methods

.

2017

;

14

:

45

–

48

.

16.

Weinberg

Z.

,

Breaker

R.R.

R2R–software to speed the depiction of aesthetic consensus RNA secondary structures

.

BMC Bioinformatics

.

2011

;

12

:

3

.

17.

Montange

R.K.

,

Batey

R.T.

Structure of the S-adenosylmethionine riboswitch regulatory mRNA element

.

Nature

.

2006

;

441

:

1172

–

1175

.

18.

Santana

E.A.

,

Harrison

A.

,

Zhang

X.

,

Baker

B.D.

,

Kelly

B.J.

,

White

P.

,

Liu

Y.

,

Munson

R.S.

Jr

HrrF is the Fur-regulated small RNA in nontypeable Haemophilus influenzae

.

PLoS One

.

2014

;

9

:

e105644

.

19.

Amarasinghe

J.J.

,

Connell

T.D.

,

Scannapieco

F.A.

,

Haase

E.M.

Novel iron-regulated and Fur-regulated small regulatory RNAs in Aggregatibacter actinomycetemcomitans

.

Mol. Oral Microbiol.

2012

;

27

:

327

–

349

.

20.

Furukawa

K.

,

Ramesh

A.

,

Zhou

Z.

,

Weinberg

Z.

,

Vallery

T.

,

Winkler

W.C.

,

Breaker

R.R.

Bacterial riboswitches cooperatively bind Ni(2+) or Co(2+) ions and control expression of heavy metal transporters

.

Mol. Cell

.

2015

;

57

:

1088

–

1098

.

21.

Weinberg

Z.

,

Kim

P.B.

,

Chen

T.H.

,

Li

S.

,

Harris

K.A.

,

Lünse

C.E.

,

Breaker

R.R.

New classes of self-cleaving ribozymes revealed by comparative genomics analysis

.

Nat. Chem. Biol.

2015

;

11

:

606

–

610

.

22.

Postic

G.

,

Frapy

E.

,

Dupuis

M.

,

Dubail

I.

,

Livny

J.

,

Charbit

A.

,

Meibom

K.L.

Identification of small RNAs in Francisella tularensis

.

BMC Genomics

.

2010

;

11

:

625

.

23.

Jacq

B.

Sequence homologies between eukaryotic 5.8S rRNA and the 5’ end of prokaryotic 23S rRNa: evidences for a common evolutionary origin

.

Nucleic Acids Res.

1981

;

9

:

2913

–

2932

.

24.

Gupta

R.K.

,

Luong

T.T.

,

Lee

C.Y.

RNAIII of the Staphylococcus aureus agr system activates global regulator MgrA by stabilizing mRNA

.

Proc. Natl. Acad. Sci. U.S.A.

2015

;

112

:

14036

–

14041

.

25.

Chiang

H.R.

,

Schoenfeld

L.W.

,

Ruby

J.G.

,

Auyeung

V.C.

,

Spies

N.

,

Baek

D.

,

Johnston

W.K.

,

Russ

C.

,

Luo

S.

,

Babiarz

J.E.

et al.

Mammalian microRNAs: experimental evaluation of novel and previously annotated genes

.

Genes Dev.

2010

;

24

:

992

–

1009

.

26.

The RNAcentral Consortium

RNAcentral: a comprehensive database of non-coding RNA sequences

.

Nucleic Acids Res.

2017

;

45

:

D128

–

D134

.

Crossref

PubMed

WorldCat

27.

Li

W.

,

Godzik

A.

Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences

.

Bioinformatics

.

2006

;

22

:

1658

–

1659

.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Download all slides

Month:	Total Views:
November 2017	251
December 2017	229
January 2018	405
February 2018	340
March 2018	369
April 2018	266
May 2018	233
June 2018	187
July 2018	238
August 2018	310
September 2018	167
October 2018	187
November 2018	215
December 2018	151
January 2019	133
February 2019	137
March 2019	154
April 2019	193
May 2019	154
June 2019	149
July 2019	178
August 2019	150
September 2019	157
October 2019	139
November 2019	130
December 2019	128
January 2020	133
February 2020	129
March 2020	96
April 2020	126
May 2020	114
June 2020	161
July 2020	191
August 2020	158
September 2020	160
October 2020	148
November 2020	141
December 2020	71
January 2021	66
February 2021	71
March 2021	97
April 2021	98
May 2021	75
June 2021	73
July 2021	61
August 2021	46
September 2021	63
October 2021	81
November 2021	116
December 2021	61
January 2022	64
February 2022	72
March 2022	93
April 2022	90
May 2022	62
June 2022	62
July 2022	63
August 2022	95
September 2022	104
October 2022	87
November 2022	78
December 2022	86
January 2023	103
February 2023	110
March 2023	79
April 2023	59
May 2023	68
June 2023	57
July 2023	44
August 2023	67
September 2023	70
October 2023	70
November 2023	58
December 2023	81
January 2024	100
February 2024	114
March 2024	131
April 2024	37

Article Contents

Rfam 13.0: shifting to a genome-centric resource for non-coding RNA families

Abstract

INTRODUCTION

UPDATING THE RFAM SEQUENCE DATABASE

Comparison of Rfamseq databases used for Rfam releases 12.3 and 13.0

Summary statistics for Rfam annotation of various genomes

WEBSITE UPDATES

Text search

Sequence summary pages

R-Scape secondary structures

INCREASING THE NUMBER OF RNA FAMILIES

New literature curation workflow

Overview of new RNA families

Overview of new families added to Rfam since release 12.0

RNAcentral as a source of new families

AVAILABILITY

CONCLUSIONS

SUPPLEMENTARY DATA

ACKNOWLEDGEMENTS

FUNDING

REFERENCES

Supplementary data

Comments

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Article Contents

Rfam 13.0: shifting to a genome-centric resource for non-coding RNA families

Abstract

INTRODUCTION

UPDATING THE RFAM SEQUENCE DATABASE

Comparison of Rfamseq databases used for Rfam releases 12.3 and 13.0

Summary statistics for Rfam annotation of various genomes

WEBSITE UPDATES

Text search

Sequence summary pages

R-Scape secondary structures

INCREASING THE NUMBER OF RNA FAMILIES

New literature curation workflow

Overview of new RNA families

Overview of new families added to Rfam since release 12.0

RNAcentral as a source of new families

AVAILABILITY

CONCLUSIONS

SUPPLEMENTARY DATA

ACKNOWLEDGEMENTS

FUNDING

REFERENCES

Supplementary data

Comments

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only