ABC-finder: A containerized web server for the identification and topology prediction of ABC proteins

In view of the multiple clinical and physiological implications of ABC transporter proteins, there is a considerable interest among researchers to characterize them functionally. However, such characterizations are based on the premise that ABC proteins are accurately identified in the genome, and their topology is correctly predicted. With this objective, we have developed ABC-finder, i.e., a Docker-based package for the identification of ABC proteins in all organisms, and visualization of the topology of ABC proteins using an interactive web browser. ABC-finder is built and deployed in a Linux container, making it scalable for many concurrent users on our servers and enabling users to download and run it locally. Overall, ABC-finder is a convenient, portable, and platform-independent tool for the identification and topology prediction of ABC proteins. ABC-finder is accessible at http://abc-finder.osdd.jnu.ac.in


INTRODUCTION
ABC proteins comprise one of the largest and most important protein families, the majority of proteins being involved in active transport [1]. A typical ABC transporter is composed of a pair of nucleotide-binding domains (NBDs) and transmembrane domains (TMDs) [2]. While the NBDs are involved in fueling the transport process by means of ATP hydrolysis, the TMDs are responsible for substrate recognition and form the translocation channel [3]. NBDs have several conserved motifs, namely, Walker A, Walker B, Signature sequence (C-motif), H-loop, D-loop, and Q-loop [4]. Contrarily, the TMDs display poor conservation across different subfamilies.
HUGO Gene Nomenclature Committee divided the ABC superfamily into 7 subfamilies from ABCA to ABCG. This classification was later extended for non-mammalian proteins with the addition of the ABCH subfamily found in insects [5] and fishes [6,7], and the ABCI subfamily found in plants [8]. ABC transporters can function both as importers as well as exporters, however, the former is restricted only to bacteria and plants and mediate the uptake of various nutrients, micronutrients, and phytohormones, etc [9]. Furthermore, the Energy-Coupling Factor (ECF) transporters have been recently categorized as Type III ABC importers [9]. On the other hand, the ABC exporters facilitate the extrusion of a wide spectrum of substrates including ions, lipids, peptides, toxic xenobiotics, etc [10]. Numerous studies have suggested the role of ABC transporters in various human diseases and chemoresistance [1]. Besides their role as transporters, some ABC proteins harbor just the NBDs, for instance, the ABCE and ABCF family representatives, and have implications in ribosome biogenesis, translation control, etc. [11,12], further adding to the immense biological relevance of this superfamily. The primary requirement for in-depth investigations pertaining to ABC proteins is their accurate identification from the genome/proteome and analysis of its topology. Even though programs performing certain steps in isolation exist, there is a need for a unified package to do the job with lesser hassles and more emphasis on reproducibility, keeping in view that reproducibility of research is the key element in modern science [13]. In an effort to make a simpler program available to the biology researcher community, we herein present "ABC-finder" as a simple and fast tool for identification and topology prediction of ABC proteins based on our previously established pipeline which led to the inventorization of ABC proteins in a number of yeast species [14][15][16].
ABC-finder combines stand-out methodologies, namely profile-HMM and TOPCONS for homology detection, and prediction of the topology of membrane proteins, respectively in a seamless manner whereby the users need to provide only the organism's name or the proteome file as an input. Our analysis with reference organisms shows a strong correlation with the ABC protein inventories available in the literature.

Data submission
Protein sequence data can be submitted in the following manner to the ABC-finder platform. The user can either specify the organism name (Homo sapiens, Arabidopsis thaliana, or Candida glabrata) or can upload a raw FASTA file (uncompressed) of the proteome. The file size is, however, limited to 95 MB. In the last step, the user can provide an email address to get the result files directly delivered.

Prediction of potential ABC proteins based on NBD HMM
To extract the putative ABC proteins from the sequence input, we utilized a protocol that was developed previously by us for inventorization of ABC proteins in yeasts [14][15][16]. Briefly, the HMM profile of the ABC-tran model (accession PF00005) obtained from the Pfam database [17] is used as a query to match against the proteome of an organism with the help of the "hmmsearch" function within the HMMER package [18] using the default settings. Positive hits above the default threshold are then further filtered based on a cutoff defined from the plot of domain-score and E-value. The cutoff is interpreted from the point of a sudden drop in the domain score in the plot.

Clustering of sequences using Cd-hit
Next, ABC-finder uses the Cluster Database at High Identity with Tolerance (CD-HIT) [19].
This clustering algorithm allows the grouping of all the sequences based on their sequence similarities. Given that Cd-hit clusters the highly similar (redundant) sequences, it allows the grouping of different ABC protein sequence variants that usually populate the database in case of higher eukaryotes. The user has the option to optimize the input parameters for the particular program from the "advanced options" menu.

Prediction of the TMD and protein topology
Since most of the families under the ABC superfamily include membrane transporters, it is relevant to detect their presence among putative proteins. Following the CD-hit run, ABCfinder utilizes the stand-out program TOPCONS [20], which is a widely used program for consensus prediction of membrane protein topology. Once the TM helices are defined using TOPCONS, ABC-finder uses a set of in-house written Python scripts to demarcate the NBDs and TMDs in each of the putative ABC proteins. The overall workflow implemented in the ABC-finder server is shown in Figure 1. The help page provides a step-by-step guide to ABC-finder. All results are kept on the server for ten days.

IMPLEMENTATION
ABC-finder server is implemented in Python 3.7.6 (http://www.python.org/) using the Web framework Django 3.0.3 (http://www.djangoproject.com/). The system is containerized using Docker 19.03.3 [21] and docker-compose 1.25.0. Job queuing is carried out by the asynchronous task queue Celery 3.0.19 that uses the distributed message passing system Redis 3.3.11.
Users can also install the ABC-finder web server directly from the source code or build and run it from within Docker container using the instructions provided in the Supplementary information. The Supplementary information also contains FAQs and some useful links.  cerevisiae S288C strain is provided in Figure 3. The topology plot should be referred alongside the Cd-hit output file "search_faa0(.clstr)" to detect the highly similar sequence clusters, especially in the case of higher eukaryotes, wherein multiple variants are present for each protein in the database. To test ABC-finder, we utilized already reported inventories of ABC proteins from prokaryotic as well as eukaryotic systems. The results are summarized in Table 1.

SOFTWARE AVAILABILITY
The scripts we used to deploy the containers in our study are publicly available on GitHub and the Docker images are available on DockerHub.