Epitope-Based Peptide Vaccine Against Severe Acute Respiratory Syndrome-Coronavirus-2 Nucleocapsid Protein: An in silico Approach

With an increasing fatality rate, severe acute respiratory syndrome-coronavirus-2 (SARS-CoV-2) has emerged as a promising threat to human health worldwide. SARS-CoV-2 is a member of the Coronaviridae family, which is transmitted from animal to human and because of being contagious, further it transmitted human to human. Recently, the World Health Organization (WHO) has announced the infectious disease caused by SARS-CoV-2, which is known as coronavirus disease-2019 (COVID-2019) as a global pandemic. But, no specific medications are available for the treatment of COVID-19 so far. As a corollary, there is a need for a potential vaccine to impede the progression of the disease. Lately, it has been documented that the nucleocapsid (N) protein of SARS-CoV-2 is responsible for viral replication as well as interferes with host immune responses. We have comparatively analyzed the sequences of N protein of SARS-CoV-2 for the identification of core attributes and analyzed the ancestry through phylogenetic analysis. Subsequently, we have predicted the most immunogenic epitope for T-cell as well as B-cell. Importantly, our investigation mainly focused on major histocompatibility complex (MHC) class I potential peptides and NTASWFTAL interacted with most human leukocyte antigen (HLA) that are encoded by MHC class I molecules. Further, molecular docking analysis unveiled that NTASWFTAL possessed a greater affinity towards HLA and also available in a greater range of the population. Our study provides a consolidated base for vaccine design and we hope that this computational analysis will pave the way for designing novel vaccine candidates.


Introduction
The present world has witnessed the outbreak of many life-threatening human pathogens including Ebola, Chikungunya, Zika, Severe Acute respiratory syndrome coronavirus (SARS-CoV), Middle East respiratory syndrome coronavirus (MERS-CoV) in the 21 st century. More recently in late December 2019, a cluster of pneumonia cases was reported in the city of Wuhan, Hubei province, China which was of unknown cause. Later it was confirmed that these pneumonia cases were due to a novel coronavirus named SARS-CoV-2 (previously named as 2019-nCoV) and the disease condition of this virus is referred to as COVID-19 (1)(2)(3). On March 11, 2020, the World Health Organization (WHO) has assessed that COVID-19 can be characterized as a pandemic. The current COVID-19 pandemic is a global concern and is spreading at an alarming rate and as of April 12, 2020, more than 1.6 million cases and over 105,000 deaths have been reported globally (4).
Coronaviruses (CoVs) are phenotypically and genotypically diverse group viruses that can adapt to the new environment through mutation and recombination, probably even more than influenza. Coronaviruses often infect mammals, birds and can transmit to humans. Six strains of coronaviruses were found in the last few decades but this is a completely new strain and of zoonotic origin. COVID-19 virus belongs to the Coronaviridae family of the Genus Betacoronavirus, pleomorphic or spherical particles, 150 to 160 nm in size, associated with positive single-strand RNA (ssRNA) which is surrounded by crown-shaped, club-like spikes projection on the outer surface. Among all RNA viruses, Coronaviruses have the largest genome typically ranging from 27 to 32 kb. After the two previously reported coronavirus-SARS-CoV and MERS-CoV, this is the third coronavirus that has already infected humans and the preliminary investigations revealed that some environmental specimens of the Huanan seafood market in Wuhan were positive for COVID-19 (3). Although the seafood market was reckoned positive for COVID-19, no specific association with an animal is confirmed yet based on the WHO report. Researchers are working to establish a possible animal reservoir for COVID-19 (5).
As COVID-19 is mainly a respiratory disease, in most cases it might affect the lungs only. The primary mode of infection is human-to-human transmission through close contact, which occurs via spraying droplets from the infected individual through their cough or sneeze. The symptoms of this coronavirus can be mild to moderate or severe including, fever, cough, and shortness of breath or pneumonia. Respiratory, hepatic and neurological complications can be seen in case of severe cases that can lead to death. It seems that the severity and fatality rate of COVID-19 is milder than that of SARS and MERS. Although diarrhea was presented in about 20-25% of patients with SARS and MERS, intestinal symptoms were rarely reported in patients with COVID-19 (6)(7)(8). Multi-organ failure, especially in elderly people and people with underlying health conditions such as hypertension, cardiovascular disease and diabetes, are exhibiting a higher mortality rate in COVID-19.
Interestingly, SARS-CoV-2 has 82% similarity with the original SARS-CoV virus attributed to the outbreak in 2003 (9). A mature SARS-CoV-2 virus generally has a polyprotein (the open reading frame 1a and 1b, Orf1ab), four structural proteins such as envelope (E) protein; membrane (M) protein; nucleocapsid (N) protein; spike (S) protein and five accessory proteins (Orf3a, Orf6, Orf7a, Orf8, Orf10), and particularly, SARS-CoV-2 encodes an additional glycoprotein having acetyl esterase and hemagglutination (HE) attributes, which identified it distinct than its two predecessors (10). The functions of accessory proteins may include signal inhibition, apoptosis induction and cell cycle arrest (11). The S protein on the surface of the viral particle enables the infection of host cells by binding to the host cell receptor angiotensinconverting enzyme 2 (ACE2), utilizing the S-protein's receptor-binding domain (RBD).
The N protein binds to the RNA genome of the COVID-19 and creates a shell or capsid around the enclosed nucleic acid. N protein is involved in viral RNA synthesis and folding which interacts with the viral membrane protein during viral assembly affects host cell responses including cell cycle and translation . An epitope-based peptide vaccine has been raised in this aspect. The core mechanism of the peptide vaccine is based on the chemical method to synthesize the recognized B-cell and T-cell epitopes that can induce specific immune responses and are immune-dominant. T-cell epitopes are short peptide fragments (8-20 amino acids) while the B-cell epitopes can be proteins (12,13).
Once a mutated virus infects the host cells by escaping the antibodies, it then relies upon the Tcell mediated immunity to fight against the virus. Viral proteins are processed into short peptides inside the infected cells and then loaded onto major histocompatibility complexes (MHC) proteins. After that, the MHC-peptide complexes are presented on the infected cell surface for recognition by specific T cells. Activated CD8 + T cells then recognize the infected cells and clear them. T-cell immunity also depends strictly on the MHC-peptide complexes which are similar to the antigen-antibody association. MHC proteins are encoded by human leukocyte antigen (HLA) which is located among the most genetically variable regions on the human genome. Each HLA allele can only present a certain set of peptides that can be presented on the infected cell surface and recognized by T cells are called T-cell epitopes. For a vaccine, it is essential to identify Tcell epitopes that originate from conserved regions of the virus T cell responses against the S and N proteins have been reported to be the most dominant and long-lasting (14).
To develop effective diagnostic tests and vaccine, the identification of B-cell and T-cell epitopes for SARS-CoV-2 proteins are critical especially for structural N and S proteins. Both humoral immunity and cellular immunity provided by B-cell antibodies and T-cells respectively are essential for effective vaccines (15,16). Although humans may mount an antibody response against viruses normally, only neutralizing antibodies can block the entry of viruses into human cells completely (17). Antibody binding site's location on a viral protein strongly affects the body's ability to produce neutralizing antibodies (18). It is important to understand whether SARS-CoV-2 has potential antibody binding sites (B-cell epitopes) near their interacting surface with its known human entry receptor, ACE2. Besides neutralizing antibodies, human bodies also depend on cytotoxic CD8 + T-cells and helper CD4 + T-cells to clear viruses completely from the body. For anti-viral T-cell responses, presentation of viral peptides by human MHC class I and class II is essential (19). MHC-I analysis includes common alleles for HLA-A, HLA-B, and HLA-C. Multiple investigations have indicated that antibodies generated against the N protein of SARS-CoV are highly immunogenic and abundantly expressed protein during infection (20).
The purpose of our present study is to promote the designing of a vaccine against COVID-19 using in silico methods, considering SARS-CoV-2 N protein. The reason for focusing particularly on the epitopes in the N structural proteins is due to their dominant and long-lasting immune response which was reported against SARS-CoV previously. For the identified T-cell epitopes, we incorporated the information on the associated MHC alleles so that we can provide a list of epitopes that seek to maximize population coverage globally. Therefore, we designed an epitope-based peptide vaccine to potentially narrow down the search for potent targets against SARS-CoV-2 using the computational approach with an expectation that the wet laboratory research will validate our result.

Materials and Methods
The methodologies used for peptide vaccine development for SARS-CoV-2 N protein are shown in Figure 1.

Protein sequence retrieval
The SARS-CoV-2 N protein sequence was extracted from NCBI (National Center for Biotechnology Information) protein database (Accession no.: QIC53221.1, GI: 1811294683) in FASTA format.

Sequence analysis
The understanding of the features, function, structure, and evaluation is mainly based on the process of sequence analysis which depicts the process of subjecting DNA, RNA, or peptide sequences to wide ranges of analytical methods. We implied NCBI BLAST (Basic Local Alignment Search Tool) (21) that screens homologous sequences from its database and selects those sequences that are more similar to our SARS-CoV-2 N protein; we also performed multiple sequence alignment (MSA) using the ClustalW web server with default settings, and a phylogenetic tree was assembled using MEGA6 software and a web logo was also generated for the conserved peptide sequences using WebLogo based on this alignment (21-23).

Protein antigenicity and toxicity prediction
To determine the potent antigenic protein of the SARS-CoV-2 N protein, we used the online server VaxiJen v2.0, with a default threshold value (24). All the antigenic proteins of SARS-CoV-2 N protein with their respective scores were obtained then sorted in Notepad++. A single antigenic protein with maximum antigenicity scores was selected for further evaluation. The toxicity of epitopes was analyzed using the Toxinpred web server (25).

Protein secondary and tertiary structure prediction
The secondary structure of the SARS-CoV-2 N protein was predicted by using CFSSP (Chou & Fasman Secondary Structure Prediction) because the antigenic part of the protein is more likely to belong to the β-sheet region (26). Also, we predicted the 3D structure of the protein using EasyModeller, a graphical user interface (GUI) version of MODELLER, where we designed the three-dimensional structure of the SARS-CoV-2 N protein using template proteins from Protein Data Bank. The model was validated using PROCHECK and PROSA web servers (27-30).

CD8 + T-cell epitope prediction
For the de novo prediction of T-cell epitope, NetCTL 1.2 server was used in this experiment, using a 0.95 threshold to maintain sensitivity and specificity of 0.90 and 0.95, respectively. The tool expands the prediction for 12 MHC-I supertypes and integrates the prediction of peptide MHC-I binding, proteasomal C-terminal cleavage with TAP transport efficiency. These predictions were performed by an artificial neural network, weighted TAP transport efficiency matrix and a combined algorithm for MHC-I binding and proteasomal cleavage efficiency was then used to determine the overall scores and translated into sensitivity/specificity. Based on this overall score, five best peptides (epitopes) were selected for further evaluation.
For the prediction of peptides binding to MHC-I, we used a tool from the Immune Epitope Database (IEDB) and calculate IC50 values for peptides binding to specific MHC-I molecules (31). For the binding analysis, all the frequently used alleles were selected with a word length of nine residues and binding affinity < 200 nm for further analysis. Another tool (named as MHC-NP) provided by the IEDB server was used to assess the probability that a given peptide was naturally processed and bound to a given MHC molecule (32).

Epitope conservancy and immunogenicity prediction
The degree of similarity between the epitope and the target (i.e. given) sequence is elucidated by epitope conservancy. This property of epitope gives us the promise of its availability in a range of different strains. Hence for the analysis of the epitope conservancy, the web-based tool from IEDB analysis resources was used (33). Immunogenicity prediction can uncover the degree of influence (or efficiency) of the respective epitope to produce an immunogenic response. The Tcell class I pMHC immunogenicity predictor at IEDB, which uses amino acid properties as well as their position within the peptide to predict the immunogenicity of a class I peptide MHC (pMHC) complex (34).

Prediction of population coverage and allergenicity assessment
The population coverage tool from IEDB was applied to determine the population coverage for every single epitope by selecting HLA alleles of the corresponding epitope.
Allergenicity of the predicted epitope was calculated using AllerTop v2.0(35), which is an alignment-free server, used for in silico based allergenicity prediction of a protein-based on its physiochemical properties.

Epitope model generation
The 3D structures of the selected epitopes were predicted by PEP-FOLD, a web-based server (36). For each sequence, the server predicted five probable structures. The energy of each structure was determined by SWISS-PDB VIEWER and the structure with the lowest energy was chosen for further analysis.

Retrieval of HLA allele molecule
The three-dimensional structure of the HLA-A*68:02 (PDB ID: 4I48) was retrieved from Protein Data Bank (RCSB-PDB).

Molecular docking analysis
Molecular docking analysis was performed using Autodock vina in PyRx 0.8, by considering the HLA-A*68:02 molecule as receptor protein and identified epitopes as ligand molecule (37).
Firstly, we used the protein preparation wizard of UCSF Chimera (Version 1.11.2) to prepare the protein for docking analysis by deleting the attached ligand, adding hydrogens and Gasteiger-Marsili charges (38,39). The prepared file was then added to the Autodock wizard of PyRx 0.8 and converted into pdbqt format. The energy form of the ligand was minimized and converted to pdbqt format by OpenBabel (40). The parameters used for the docking simulation were set to default. The size of the grid box in AutoDock Vina was kept at 50.183 × 50.183 × 50.183 Å respectively, for X, Y, and Z-axis. AutoDock Vina was implemented via the shell script offered by AutoDock Vina developers. Docking results were observed by the negative score in kcal/mol, as the binding affinity of ligands (41).

B-cell epitope identification
The prediction of B-cell epitopes was performed to find the potential antigen that assures humoral immunity. To detect B-cell epitope, various tools from IEDB were used to identify the B-cell antigenicity, together with the Emini surface accessibility prediction, Kolaskar and Tongaonkar antigenicity scale, Karplus and Schulz flexibility prediction, Bepipred linear epitope prediction analysis and since antigenic parts of a protein belonging to the beta-turn regions, the Chou and Fasman beta-turn prediction tool was also used (42)(43)(44)(45)(46)(47). Results

Sequence retrieval and analysis
We retrieved the SARS-CoV-2 N protein sequence from the NCBI database (Accession No.: QIC53221.1). Then we performed BLASTp using NCBI-BLAST for the nucleocapsid protein of SARS-CoV-2. We searched for a total of 100 homologs with > 60% identical sequences.
Multiple sequence alignment was then performed (Supplementary data 1), and a phylogenetic tree was constructed (Supplementary Figure S1). From multiple sequence alignment, it has been confirmed that the protein sequences have a close relationship. A web logo was designed using the WebLogo server to demonstrate the conserved region ( Figure 2).

Antigenic protein prediction
The most potent antigenic protein of SARS-CoV-2 N protein was predicted by VaxiJen v2.0, which is based on the auto-cross covariance transformation of protein sequences into uniform vectors of principal amino acid properties ( Table 1). The overall antigen prediction score was 0.5002 (probable antigen) at 0.4 threshold value.

Toxicity prediction
The toxicity of the selected peptide sequences was assessed using the ToxinPred web server. The results from the ToxinPred server represented that all of our probable epitopes were found nontoxic (Table 1). 13 The secondary of a protein describes the α-helix, β-sheets, and random coil. Our SARS-CoV-2 N protein has 419 residues, of which 213 residues (50.8%) from the helix, 187 residues from sheets (44.6%) and 66 (15.8%) residues from the coil (Figure 3). For 3D structure, we built a model using EasyModeller, which was validated using PROCHECK and PROSA web server. The discrete optimized protein energy (DOPE) score was calculated -19585.62653. PROSA predicts the z-score of the model which was depicted 0.56 and the PROCHECK server was used for Ramachandran plot calculations. Ramachandran plot of the model protein indicated that 76% of residues in the most favorable region, 22% in the allowed region and 0.4% in the disallowed region ( Figure 4).

CD8 + T-cell epitope identification
Based on high combinatorial and MHC binding, the top eight epitopes were predicted by the NetCTL server from the selected protein sequence were selected for further analysis. Using the MHC-I binding prediction tool, which is based on SMM, we selected those MHC-I alleles for  (Table 2). Moreover, the MHC-NP prediction tool was used to find the highest probable score of our predicted epitope NTASWFTAL, with a score of 1.11, for HLA-A*68:02. Furthermore, all the predicted epitopes had a maximum identity for conservancy hit and 100% maximum identity was found (Table 2). Also, the I-pMHC immunogenicity prediction analysis of the epitope NTASWFTAL was found 0.22775 (Table 2).

Population coverage
The cumulative amount of the population coverage was obtained for the predicted epitope NTASWFTAL Results from the population coverage demonstrated that with 57.16% coverage, East Asia found the highest coverage region. The results of the population coverage were shown in Table 3 and Supplementary Figures S2-S5.

Allergenicity assessment
The AllerTop server was used for the identification of the allergic reaction caused by a vaccine in an individual which might be harmful or life-threatening. The allergenicity of the selected epitope was calculated using the AllerTop tool and predicted as probable non-allergen.

Molecular docking analysis for HLA and epitope interaction
In this study, the verification of the interaction between the HLA molecules and our predicted potential epitope was done by molecular docking simulation using Autodock Vina in PyRx 0.8 software. Among all the MHC class I alleles, only HLA-A*68:02 had a maximum probable score for our most potent epitope NTASWFTAL ( Figure 5). Therefore, we carried out the molecular docking study using HLA-A*68:02 (PDB ID: 4I48). We found that our predicted epitope NTASWFTAL interacted with HLA-A*68:02 with strong binding affinities of -9.4 kcal/mol  (Figure 6).

B-cell epitope prediction
In this study, using the amino acid scale-based method, we predicted the B-cell epitope identification. Different analysis methods were used for the prediction of continuous B-cell epitope. The results of the B-cell predictions were shown in Table 4 and Figures 7-9.
Firstly, Bepipred linear epitope prediction was used, which is regarded as the best single method for predicting linear B-cell epitopes using a Hidden Markov model. Our analysis revealed that the peptide sequences from 232 to 269 amino acid residues were able to induce the desired immune response as B-cell epitopes.
The β-turns were predicted by Chaus and Fasman β-turn prediction method. The region 233-239 residues were predicted as a β-turn region with a score of 1.164, which was higher than the average score. For predicting the surface ability, this study includes the Emini surface accessibility prediction method. The average surface accessibility was 1.0 and a minimum 0.050. In alignment with the previous B-cell epitope results, we predicted the peptide sequence from 237 to 242 had the better surface ability.

Discussion
As yet, it has been reported that the reproduction rate of SARS-CoV-2 is greater than SARS and MERS and the symptoms of the COVID-19 infection include fever with more than 38 °C body temperature along with alveolar edema, leading to difficulty in breathing, whereas mild symptoms perhaps not engender high fever (48). Surprisingly, with a high fatality rate, the severity of the infection was found more than the infection caused by both SARS and MERS, with multiple organ damage, which was reported not long ago (49).
At present, researchers are examining repurposed compounds from other viral infections to treat SARS-CoV-2. For example, both lopinavir and ritonavir are HIV protease inhibitors but in a lopinavir-ritonavir clinical trial report, the treatment benefit derived was dubious (50). From recovering patients, several convalescent immunoglobulins are derived which is currently investigated as a potential treatment for the disease (51). As there have been no approved treatments for COVID-19 that exists until now, these treatments are the best hope for striving to keep the mortality rate low before vaccines become widely available. Earlier, it has been thought that vaccine development primarily relies on B-cell immunity, but recent discovery unveiled that T-cell epitopes are more propitious as a result of more long-lasting immune response mediated by CD8 + T-cells and due to the antigenic drift, by which an antibody is not able to respond against an antibody (62). In this study, focusing on MHC class I potential peptide epitopes, we predicted T-cell as well as B-cell epitopes which were able to show immune responses in various ways. Many characteristics including antigenicity, toxicity need to take into consideration for identifying a protein sequence-based epitope into a vaccine candidate and the predicted eight epitopes were fulfilled the entire criterion. However, only five potent epitopes have been predicted from the NetCTL 1.2 server and the epitopes were further taken for the progressive analysis. Besides, all peptides except SSPDDQIGY were able to interact with the MHC class I alleles, and NTASWFTAL interacted with the most MHC class I alleles. Amongst them, HLA-A*68:02 possessed the highest probable score. Further, the conservancy of the epitopes which was predicted by the IEDB conservancy analysis tool delineated that all of our predicted epitopes had the maximum identity of 100%. Therefore, we have taken the epitope NTASWFTAL for further analysis due to its maximum interaction with MHC class I alleles and the highest conservancy.
Allergenicity is regarded as one of the most noteworthy obstacles in vaccine development.

Importantly, T-cells not CD4 + T-cells are involved in an allergic reaction and an allergic reaction
is stimulated by type 2 T helper cell along with immunoglobulin E (63). In this experiment, we assessed the allergenicity using AllerTop 2.0, which is well recognized for its high sensitivity, and able to identify structurally diverse allergens in comparison with the known allergens.
AllerTop predicted our selected epitope as non-allergen.
It has been proposed that the T-cell epitopes bind with the MHC molecules and MHC class I molecules generally presented short peptides that are 8-11 amino acid long, whereas MHC class II molecules present longer peptides with 13-17 amino acid residues (64). In this experiment, we determined the binding (presence of the antigen on the surface) affinity of the predicted epitope using molecular docking analysis and demonstrated that NTASWFTAL interacted with HLA-A*68:02 and found a binding affinity of -9.4 kcal/mol, which depicted a greater interaction with the epitope and the HLA molecule as the more negative energy implied to more binding affinity (65). The results from the molecular docking studies also revealed that epitope NTASWFTAL formed H-bond with both chain-A and chain-B of the HLA molecule and attractive charges were also responsible for the binding.
Another factor that is considered as the most prominent one during the process of vaccine development is population coverage, as the distribution of HLA varies according to ethnicity and 20 geographical region. Our experiment showed that the epitope NTASWFTAL covered almost all regions of the world, where the highest coverage was observed in East Asia, where COVID-19 first reported. Interestingly, our findings indicated that our predicted epitope specifically binds with the widespread HLA molecules and the vaccine will be easily employed.
In addition, the B-cell epitope provides a strong immune response without causing any adverse effects. As a result, we also calculated the B-cell epitope prediction and found that the protein sequences from 232 to 249 amino acid residues as B-cell epitope. The identified region might be able to stimulate the desired immune response and also important for developing a vaccine.

Conclusions
The advancement in immunoinformatic has now emerged as a potential field for the prediction of epitope-based vaccines. As viruses can delineate both T-cell and humoral immunity, our predicted epitope might suggest enhancing the immunity against SARS-CoV-2. The assumption is based on the basic principles of immunity, which confers the attachment of virus with the host cell, evoking immune responses and transfers the information to a broad spectrum of T cells and B cells. Our investigated epitopes mimic the interaction to CD8 cells antigen presentation using computational approaches. However, our study is an introductory design to predict epitope-based vaccine against SARS-CoV-2 and we hope that this predicted epitope will assist the further laboratory analysis for designing and predicting novel candidates against COVID-19.

Conflict of interest
The authors report no conflicts of interests in this work.

Funding
This work is conducted with the individual funding of all authors.   The potential CD8 + T-cell epitopes along with their interacting MHC class I alleles and total processing score, epitopes conservancy_hits and pMHC-I immunogenicity score     Ramachandran plot analysis of the protein using PROCHECK web server; (C) z-score predicted by PROSA server.       MCH-I interaction with an affinity of IC50 < 200 and the total score (proteasome score, TAP score, MHC-I score, processing score) pMHC-I immunogenicity score