MolEvolvR: A web-app for characterizing proteins using molecular evolution and phylogeny

Studying proteins through the lens of evolution can reveal conserved features, lineage-speciﬁc variants, and their potential functions. MolEvolvR (https://jravilab.org/molevolvr) is a novel web-app enabling researchers to visualize the molecular evolution of their proteins of interest in a phylogenetic context across the tree of life, spanning all superkingdoms. The web-app accepts multiple input formats – protein/domain sequences, homologous proteins, or domain scans – and, using a general-purpose computational workflow, returns detailed homolog data and dynamic graphical summaries (e.g., phylogenetic trees, multiple sequence alignments, domain architectures, domain proximity networks, phyletic spreads, co-occurrence patterns across lineages). In addition to whole protein searches, MolEvolvR can perform domain-wise analyses. Thus, MolEvolvR is a powerful, easy-to-use web interface for computational protein characterization.

The rate of protein family discovery far outpaces the assignment of molecular or biochemical functions to proteins 1 .This gap impedes scientific understanding of critical cellular processes such as molecular pathogenesis, antibiotic resistance, or stress response.Identifying and characterizing the complete molecular systems involved in such processes requires functional knowledge of the proteins involved.Many studies [2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18] have demonstrated the power of phylogenetic and molecular evolutionary analysis in determining the molecular functions of such proteins.A variety of individual tools  exist for protein sequence similarity searches or ortholog detection, delineating co-occurring domains (domain architectures), and building multiple sequence alignments and phylogenetic trees. Howevr, despite providing complementary insight, these tools are disjointed and often require both technical skill and background knowledge for meaningful use.Biologists need a means to functionally characterize their unknown/understudied protein(s) of interest by exhaustively identifying relevant homologs with shared motifs, domains, and domain architectures.We aim to address this demand with a unified, user-friendly web-based framework that works with data spanning multiple super-kingdoms and phyletic scales whose interpretable integration requires effective summarization and visualization to provide functional insights.
Here, we present MolEvolvR, a web application that comprehensively characterizes proteins in a streamlined, easy-to-use platform (Fig. 1; accessible at jravilab.org/molevolvr).MolEvolvR performs homology searches across the tree of life and reconstructs domain architecture by functionally characterizing the input proteins and each of their homologs, presenting these results in the context of evolution.The computational evolutionary framework underlying MolEvolvR is written using custom R [44][45][46][47][48][49][50][51][52][53][54][55][56][57][58][59] and shell scripts.The web application is built with an R/Shiny 46,60,61 framework with additional user interface customizations in HTML, Javascript, and CSS, and has been tested on Chrome, Brave, Firefox, and Safari browsers on Mac, Windows, and Linux operating systems.
Consider a researcher starting with a protein of unknown function, either falling out of a genetic screen for a phenotype or an uncharacterized protein in a genomic context of interest.In Step 1, MolEvolvR resolves this query protein into its constituent domains and uses each domain for iterative homology searches 20,21 across the tree of life [62][63][64][65] .This divide-and-conquer strategy is domain-sensitive by design and captures remote homologs missed by homology searches using the full-length protein sequences alone.In Step 2, MolEvolvR characterizes the protein and each of its homologs by reconstructing their domain architectures and delineating molecular function, combining: 1) sequence alignment and clustering algorithms for domain detection 19,24,25,66 ; 2) profile matching against protein domain and orthology databases 32,33,[67][68][69][70] ; and 3) prediction algorithms for signal peptides 33,71 , transmembrane regions 33,[72][73][74] , cellular localization 33,73 , and secondary/tertiary structures 33,[75][76][77][78] .This analysis illustrates the breadth of MolEvolvR's analytic capability: results entail a detailed molecular characterization of the initial query protein(s) and of their homologs integrated across the superkingdoms of life, along with any lineage-specific functional variants.
MolEvolvR is versatile, accommodating a variety of inputs, investigating wide-ranging questions, and producing diverse potential outputs.Queries can include protein/domain sequences of single-or multi-protein operons (FASTA, NCBI/Uniprot protein accession numbers), homologous proteins (pre-computed MSA or web/command-line BLAST output), or motif/domain scans (InterProScan output) [Fig.1B].MolEvolvR tailors analyses to answer a variety of questions, e.g., determining protein features restricted to certain pathogenic groups to discover virulence factors/diagnostic targets [Fig.1C].Finally, MolEvolvR generates output types spanning a complete set of homologs or phylogenetic trees, the domain architectures of a query protein, or the most common partner domains [previews in Fig. 1A].Along with tables and visualizations, MolEvolvR also provides graphical summaries combining these results in the context of evolution: i) structure-based multiple sequence alignments and phylogenetic trees; ii) domain proximity networks from all co-occurring domains (across homolog domain architectures) consolidating within and across query proteins; iii) phyletic spreads of homologs and their domain architectures; and iv) co-occurrence patterns and relative occurrences of domain architectures across lineages [Fig.1A].Most importantly, MolEvolvR can return domain-wise searches, in addition to whole protein searches, to trace the evolution of the proteins/domains of interest, even across remote homologs.The web-app contains detailed documentation about all these options, including case studies (see Supplementary Material) and frequently asked questions (FAQs).In addition to easy access to a local, in-browser history of user-submitted jobs, MolEvolvR can, optionally, send users a detailed description and status update of their submitted jobs via email.
A specific instance of the web-app applied to study several PSP stress response proteins (present across the tree of life) can be found here: https://jravilab.org/psp 2 .MolEvolvR is a generalized web server of this web-app.To demonstrate its broad applicability, we have applied the approach underlying MolEvolvR to study several systems, including proteins/operons in zoonotic pathogens, e.g., nutrient acquisition systems in Staphylococcus aureus 4,5 , novel phage defense system in Vibrio cholerae 6 , surface layer proteins in Bacillus anthracis 7 , helicase operators in bacteria 8 , and internalins in Listeria spp 9 .We have included a few pre-loaded examples in MolEvolvR for users to explore in addition to help pages and FAQ.Thus, MolEvolvR (jravilab.org/molevolvr) is a flexible, user-friendly, and powerful interactive web tool that allows researchers of any level of bioinformatics experience to bring molecular evolution and phylogeny to bear on their proteins of interest.

Figure 1 .
Figure 1.Overview of MolEvolvR.A. MolEvolvR allows users to start with protein(s) of interest and perform the full analysis (1+3+4), only protein characterization (1+3), or only homology searches (1+4); or start with external outputs from BLAST or Interproscan for further analysis, summarization, and visualization (2+3+4).MolEvolvR is interactive, queryable, and customizable.B. Multiple input options in MolEvolvR: FASTA, NCBI/UniProt accession numbers, pre-computed MSA (in FASTA formats), analysis outputs from an external web or command-line BLAST or InterProScan runs.C. The different analysis options available in MolEvolvR depend on the chosen input formats in 1A, 1B.