Abstract
One of the reasons for the fast spread of SARS-CoV-2 is the lack of accuracy in detection tools in the clinical field. Molecular techniques, such as quantitative real-time RT-PCR and nucleic acid sequencing methods, are widely used to identify pathogens. For this particular virus, however, they have an overall unsatisfying detection rate, due to its relatively recent emergence and still not completely understood features. In addition, SARS-CoV-2 is remarkably similar to other Coronaviruses, and it can present with other respiratory infections, making identification even harder. To tackle this issue, we propose an assisted detection test, combining molecular testing with deep learning. The proposed approach employs a state-of-the-art deep convolutional neural network, able to automatically create features starting from the genome sequence of the virus. Experiments on data from the Novel Coronavirus Resource (2019nCoVR) show that the proposed approach is able to correctly classify SARS-CoV-2, distinguishing it from other coronavirus strains, such as MERS-CoV, HCoV-NL63, HCoV-OC43, HCoV-229E, HCoV-HKU1, and SARS-CoV regardless of missing information and errors in sequencing (noise). From a dataset of 553 complete genome non-repeated sequences that vary from 1,260 to 31,029 bps in length, the proposed approach classifies the different coronaviruses with an average accuracy of 98.75% in a 10-fold cross-validation, identifying SARS-CoV-2 with an AUC of 98%, specificity of 0.9939 and sensitivity of 1.00 in a binary classification. Then, using the same basis, we classify SARS-CoV-2 from 384 complete viral genome sequences with human host, that contain the gene ORF1ab from the NCBI with a 10-fold accuracy of 98.17%, a specificity of 0.9797 and sensitivity of 1.00. These preliminary results seem encouraging enough to identify deep learning as a promising research venue to develop assisted detection tests for SARS-CoV-2. At this end the interaction between viromics and deep learning, will hopefully help to solve global infection problems. In addition, we offer our code and processed data to be used for diagnostic purposes by medical doctors, virologists and scientists involved in solving the SARS-CoV-2 pandemic. As more data become available we will update our system.
- convolutional neural networks
- coronavirus
- deep learning
- SARS-CoV-2
1. Introduction
The Coronaviridae family presents a positive sense, single-strand RNA genome. This viruses have been identified in avian and mammal hosts, including humans. Coronaviruses have genomes from 26.4 kilo base-pairs (kbps) to 31.7 kbps, with G + C contents varying from 32% to 43%, and human-infecting coronaviruses include SARS-CoV, MERS-CoV, HCoV-OC43, HCoV-229E, HCoV-NL63 and HCoV-HKU1 [1]. In December 2019, SARS-CoV-2, a novel, human-infecting Coronavirus was identified in Wuhan, China, using Next Generation Sequencing [2].
As a typical RNA virus, new mutations appears every replication cycle of Coronavirus, and its average evolutionary rate is roughly 10-4 nucleotide substitutions per site each year [2]. In the specific case of SARS-CoV-2, RT-qPCR testing using primers in ORF1ab and N genes have been used to identified the infection in humans. However, this method presents a high false negative rate (FNR), with a detection rate of 30-50% [3, 4]. This low detection rate can be explained by the variation of viral RNA sequences within virus species, and the viral load in different anatomic sites [5]. Population mutation frequency of site 8,872 located in ORF1ab gene and site 28,144 located in ORF8 gene gradually increased from 0 to 29% as the epidemic progressed [6].
As of March 6th of 2020, the new SARS-CoV-2 has 98,192 confirmed cases across 88 countries, with 17,481 cases outside of China [7]. In addition, SARS-CoV-2 has an estimated mortality rate of 3-4%, and it is spreading faster than SARS-CoV and MERS-CoV [8]. SARS-CoV-2 assays can yield false positives if they are not targeted specifically to SARS-CoV-2, as the virus is closely related to other Coronavirus organisms. In addition, SARS-CoV-2 may present with other respiratory infections, which make it even more difficult to identify [9, 10]. Thus, it is fundamental to improve existing diagnostic tools to contain the spread. For example, diagnostic tools combining computed tomography (CT) scans with deep learning have been proposed, achieving an improved detection accuracy of 82.9% [11]. Another solution for identifying SARS-CoV-2 is additional sequencing of the viral complementary DNA (cDNA). We can use sequencing data with cDNA, resulting from the PCR of the original viral RNA; e,g, Real-Time PCR amplicons (Fig. 1) to identify the SARS-CoV-2 [12].
PCR Amplicons sequencing procedure.
Classification using viral sequencing techniques is mainly based on alignment methods such as FASTA [13] and BLAST [14]. These methods rely on the assumption that DNA sequences share common features, and their order prevails among different sequences [15, 16]. However, these methods suffer from the necessity of needing base sequences for the detection [17]. Nevertheless, it is necessary to develop innovative improved diagnostic tools that target the genome to improve the identification of pathogenic variants, as sometimes several tests, are needed to have an accurate diagnosis. As an alternative deep learning methods have been suggested for classification of DNA sequences, as these methods do not need pre-selected features to identify or classify DNA sequences. Deep Learning has been efficiently used for classification of DNA sequences, using one-hot label encoding and Convolution Neural Networks (CNN) [18, 19], albeit the examples in literature are featuring DNA sequences of length up to 500 bps, only.
In particular, for the case of viruses, Next Generation Sequencing (NGS) genomic samples might not be identified by BLAST, as there are no reference sequences valid for all genomes, as viruses have high mutation frequency [20]. Alternative solutions based on deep learning have been proposed to classify viruses, by dividing sequences into pieces of fixed lengths, from 300 bps [20] to 3,000 bps [21]. However, this approach has the negative effect of potentially ignoring part of the information contained in the input sequence, that is disregarded if it cannot completely fill a piece of fixed size.
Given the impact of the world-wide outbreak, international efforts have been made to simplify the access to viral genomic data and metadata through international repositories, such as; the 2019 Novel Coronavirus Resource (2019nCoVR) repository [6] and the National Center for Biotechnology Information (NCBI) [22], expecting that the easiness to acquire information would make it possible to develop medical countermeasures to control the disease worldwide, as it happened in similar cases earlier [23, 24, 25]. Thus, taking advantage of the available information of international resources without any political and/or economic borders, we propose an innovative system based on viral gene sequencing.
Differently from previous works in literature, that use of deep learning with fixed length features and one-hot label encoding, in this work we propose the use of a different encoding to input the full sequence as a whole. In addition, we use as base input 31,029 as an input vector, which is the maximum length of available DNA sequences for Coronavirus. Finally, we propose a novel architecture for the deep network, inspired by successful applications in cancer detection starting from miRNA [26].
2. Methods
2.1 Data
2.1.1 Classification of Coronaviruses
SARS-CoV-2 identification can give wrong results, as the virus is difficult to distinguish from other Coronaviruses, due to their genetic similarity. In addition, people with SARS-CoV-2 may present other infections besides the virus [9, 10]. Therefore, it is important to be able to properly classify SARS-CoV-2 from other Coronaviruses.
From the repository 2019 Novel Coronavirus Resource (2019nCoVR) [6], we downloaded all the available sequences with the query Nucleotide Completeness=“complete” AND host=“homo sapiens”, for a total of 588 samples. Next, we removed all repeated sequences, resulting in 553 unique sequences of variable length (1,260-31,029 bps). The data was organized and labeled as summarized by Table 1. We grouped HCoV-229E and HCoV-OC43 in the same class, as they are mostly known as Coronaviruses responsible for the common cold [27]; the two available samples of HCoV-4408 were also added to the same class, as it is a Betacoronavirus 1, as HCoV-OC43. In a similar fashion, we grouped HCoV-NL63 and HCoV-HKU1, as they are both associated with acute respiratory infections (ARI) [28]. Finally, we grouped SARS-CoV/SARS-CoV-P2/SARS-CoV HKU-39849 [29]/SARS-CoV GDH-BJH01 organisms together, as they are all strains of SARS.
To encode the cDNA data into an input tensor for the CNN, we assigned numeric values to the different bases; C=0.25, T=0.50, G=0.75, A=1.0 (see Fig. 2). All missing entries were assigned the value 0.0. This procedure is different from previous methods, that relied upon one-hot encoding [21, 20], and has the advantages of making the input more human-readable and do not multiply the amount of memory required to store the information. We divide the available samples in two parts, 90% for training and validation (80% training, 10% validation), and 10% for testing, in a 10-fold cross-validation scheme. k-fold cross-validation is a procedure by which available data is divided into k parts, called folds. At each iteration i, the i-th fold is used as a test set, while all the other folds are used as training. At the end of the k-th iteration, the average performance of the model in test over all folds provides a good estimate of the generality of the results. In this particular case, we use stratified folds, that preserve the same proportion of classes in every fold. The procedure is summarized by Fig. 3.
Coding for the input sequences.
Scheme of a k-fold cross-validation. Available data is divided into k parts. At each iteration i, the i-th fold is used for testing, while all the others are used as a training set.
2.1.2. Separating SARS-CoV-2 from other viruses containing gene ORF1ab
Two thirds of the Coronaviruses’ genome contain the ORF1ab gene [1]. Therefore, it is important that we are able to differentiate SARS-CoV-2 from similar viruses, like Astroviruses. From the NCBI repository [30], we downloaded the genome sequences corresponding to the following search: gene=“ORF1ab” AND host=“homo sapiens” AND “complete genome”. This resulted in 402 sequences, distributed as described in Table 2. For this data, we assigned SARS-CoV-2 label 0, and grouped the rest of the organisms together in label 1. Next, we removed all the repeated sequences, obtaining a total of 384 unique sequences, with 45 samples belonging to SARS-CoV-2. The genomic data was translated to digits using the encoding previously described in Subsection 2.1.1.
Organism, assigned label, and number of samples in the unique sequences obtained from the repository NCBI [30].
2.2. Convolutional Neural Network
The deep learning model used for the experiments is a CNN with 3 convolutional layers and one fully connected layer, as described in Fig. 4. The input is a vector of 31,029 elements, which is the maximum size of the genome sequences in the dataset. Each convolutional layer is characterized by 3 hyperparameters, as shown in Fig. 5. The architecture is summarized by hyper-parameters w0 = 130, w1 = 204, w2 = 150, w3 = 196, h0 = 148, h2 = 236, h2 = 81, wd0 = 9, wd1 = 106, wd3 = 121 where w3 is the number of units in the fully connected layer. To improve generality, the fully connected layer is set with a dropout with probability pd = 0.5 during training; moreover, a l2 regularization is applied to the cross-categorical entropy loss function, considering all weights in the convolutional layers, with β = 10−3. The optimizer used for the weights is Adaptive Moment Estimation (Adam) [31], with learning rate lr = 10−5, run for 500 epochs. The hyper-parameters used in the experiments were selected after a set of preliminary trials. All the necessary code was developed in Python 3, using the keras library for deep learning [32], and has been made available on an open GitHub repository1.
CNN Architecture.
CNN layer description.
3. Results
3.1. Classification of SARS-CoV-2 among Coronaviruses
In the first test, we separated the SARS-CoV-2 from other sequences available at the repository 2019 Novel Coronavirus Resource (2019nCoVR) [6]. We obtained a 10-fold average test accuracy of µ = 0.9875 with σ = 0.0160. The resulting confusion matrix (Fig. 6) shows that only 3 out of the 66 SARS-CoV-2 sequences were mistakenly assigned to another class. The binarized curve of the test (Fig. 7) has an area under the curve (AUC) of 0.98, with a specificity of 0.9939 and sensitivity of 1.00. This is considered an outstanding performance, according to the guidelines provided by [33, 34].
Confusion matrix resulting from the test of a 10-fold cross-validation, comprising 553 samples belonging to 5 different classes.
Binarized ROC curve of the 553 sequences, where we consider samples belonging to SARS-CoV-2 as class 0, and all the rest as class 1.
As viruses are characterized by high mutation frequencies, to assess the robustness of our approach, we performed further experiments where we added noise to the dataset, simulating possible future mutations. 5% noise was added by randomly selecting 1,551 positions from each sequence, from the 31,029 available, and modifying each selected base to another, or to a missing value, randomly. A new 10-fold cross-validation classification run on the noisy dataset yields an average accuracy µ = 0.9674 with a σ = 0.0158. Figs. 8 and 9 show the resulting confusion matrix and ROC curve, respectively. This gives a AUC of 0.97, with a specificity of 0.9939 and sensitivity of 0.90.
Confusion matrix resulting from the test of a 10-fold cross-validation, comprising 553 samples belonging to 5 different classes, with a 5% noise in the dataset.
Binarized ROC curve of the 553 sequences, where we consider samples belonging to SARS-CoV-2 as class 0, and all the rest as class 1, with 5% added noise.
3.2. Separating SARS-CoV-2 from other viruses containing gene ORF1ab
In a next batch of experiments, we aim to distinguish SARS-CoV-2 from other genome sequences from NCBI [30], with the following search parameters: gene=“ORF1ab” AND host=“homo sapiens” AND “complete genome”. We get a 10-fold average accuracy of µ = 0.9817 with a σ = 0.0167. The resulting confusion matrix (Fig. 6) shows that 7 out of the 45 SARS-CoV-2 sequences, were classified in another class. The ROC curve of the test (Fig. 11) has an area under the curve (AUC) of 0.92, with a specificity of 0.9797 and sensitivity of 1.00.
Confussion Matrix of the 384 NCBI sequences with 2 classes.
ROC curve of the 384 sequences, where we consider SARS-CoV-2, as class 0 and the rest as 1.
4. Conclusion
Being able to reliably identify SARS-CoV-2 and distinguish it from other similar pathogens is important to contain its spread. The time of processing samples and the availability of reliable diagnostic tests is a challenge during an outbreak. Developing innovative diagnostic tools that target the genome to improve the identification of pathogens, can help reduce health costs and time to identify the infection, instead of using unsuitable treatments or testing. Moreover, it is necessary to perform an accurate classification to identify the different species of Coronavirus, the genetic variants that could appear in the future, and the coinfections with other pathogens.
Following, the high transmissibility of the SARS-CoV-2, the proper diagnosis of the disease is urgent, to stop the virus from spreading further. Considering the false negatives given by the standard nucleic acid detection, better implementations such as using deep learning are necessary in order to to properly detect the virus. While the accuracy of current nucleic acid testing is around 30-50%, and CT scans with deep learning go up at 83%, we believe that the use of a CNN-based system with sequencing has the potential to improve the accuracy of the diagnosis above 90%.
Our preliminary results using non-repeated sequences with differences in length from 1,260 to 31,029, missing information (segments with Ns) and noise (errors) do show an area under the curve of 98% in binary classification in a 10-fold cross-validation. In order to further improve the proper classification within the 7 existing coronavirus strains, more examples of full genome sequences with host=homo sapiens are needed, in order to make a full sub-type classification instead of grouping HCoV-229E/OC43 and HCoV-NL60/HKU1 as we were forced to do, due to the lack of samples. Thus, to further validate our results, we will increase, and accommodate the data as it becomes available in the international repositories to further improve our system.
As of March 12th 2020, China and USA have made publicly available 50 SARS-CoV-2 virus gene sequences each. In Europe, however due to the strict privacy laws, only 3 sequences; Italy, Sweden and Finland, one viral genome sequence each are available and this is of great concern. We urge to consider to make more data publicly available, in order to increase the possibility to create counter-measures to the spread of the virus.