DIA-NN: Deep neural networks substantially improve the identification performance of Data-independent acquisition (DIA) in proteomics

Data-independent acquisition (DIA-MS) strategies, like SWATH-MS, have been developed to increase consistency, quantification precision and proteomic depth in label-free proteomic experiments. They aim to overcome stochasticity in the selection of precursor ions by utilising (mass-) windowed acquisition that is followed by computational reconstruction of the chromatograms. While DIA methods increasingly outperform typical data-dependent methods in identification consistency and precision specifically on large sample series, possibilities remain for further improvements. At present, only a fraction of the information recorded in the complex DIA spectra is extracted by the software analysis pipelines. Here we present a software tool (DIA-NN) that introduces artificial neural nets and a new quantification strategy to enhance signal processing in DIA-data. DIA-NN greatly improves identification of precursor ions and, as a consequence, protein quantification accuracy. The performance of DIA-NN demonstrates that deep learning provides opportunities to boost the analysis of data-independent acquisition workflows in proteomics.


Introduction
Mass spectrometry-based quantitative proteomics is an invaluable tool in a wide range of clinical and research applications. Data-independent acquisition (DIA) is devoid of the inherent stochasticity of data-dependent acquisition (DDA) workflows, which manifests itself as missing measurements between successive runs. Its identification performance is hence not limited by stochastic elements and only indirectly dependent on the sampling velocity of the mass analyzer. In addition, quantification via DIA is more reliable, as it is performed on the fragment (MS/MS or MS2) level and is thus less susceptible to interferences [1]. Introduced already in 1970s, with new instruments and acquisition strategies available (like SWATH-MS [2,3]), DIA is becoming increasingly popular. In an increasing number of instances, DIA workflows outperform label-free DDA in terms of reproducibility [1,4]. Recently, a highlyoptimised DIA workflow has been shown to be capable of identifying more precursors than the theoretical maximum number of tandem spectra that can be acquired in the DDA mode [4]. Thus, considerable attention has been dedicated to devising advanced algorithms for processing DIA-proteomics data [5][6][7][8][9][10][11][12][13][14][15][16][17].
A main restriction remains in the computational processing of DIA data, as the available algorithms are still underperforming compared to the theoretical information content of DIA data. In particular, the use of wide isolation windows, which are needed to achieve fast scan times, is associated with high levels of interferences in the tandem-MS spectra [18]. The interferences cause false-positive discoveries as well as lower accuracy and precision in proteomic experiments.
Here, we demonstrate the application of artificial neural nets to the extraction of information from the complex SMATH-MS data. Our approach allows to increase the precursor identification performance and, therefore, the number of protein groups that can be accurately quantified. Neural networks have been used previously to classify spectra in DDAproteomics data [19][20][21]. However, to the best of our knowledge, DIA-proteomics data processing has so far only employed linear classifiers such as linear discriminant analysis or support vector machines [7][8][9]13,16,17]. In addition, we introduce an efficient method for the removal of interferences from tandem-MS spectra, thus significantly improving quantification precision. Our software tool, DIA-NN (Data Independent Acquisition by Neural Networks), implements all stages of DIA data processing in a single program, taking a set of raw data files as input and reporting quantitative values for precursor ions and protein groups. DIA-NN is written in C++ and is designed to be fast. Its memory requirements are independent of the number of LC-MS runs in the experiment, as the computationally intensive processing of raw data is performed for each run separately, and the results are saved to the hard drive (the space requirements are several orders of magnitude less than the volume of the respective raw data files). This makes DIA-NN ideal for automated handling of data generated in large-scale experiments.

General architecture of DIA-NN
DIA-NN takes as input a collection of centroided mass spectrometry data files (corresponding to individual runs) and a spectral library. Each of the files is processed separately to match precursor ions to elution peaks, the result being saved to the hard drive. Quantification of fragment ions generated from the precursors is also performed at this stage. Interferences are detected and removed for each fragment.
To speed up the data processing, DIA-NN supports optional restriction of the number of elution peaks considered as potentially valid matches for a given precursor. For this, DIA-NN can use a defined set of added or internal reference precursors. These can be provided as a list by the user or automatically generated by DIA-NN using high confidence identifications in previous analyses with the same spectral library. Once elution peaks for these precursor ions are identified, they allow the relationship between the retention times in the data file and in the spectral library to be inferred. DIA-NN then only searches for elution peaks within an automatically generated retention time window.
False discovery rate (FDR) is estimated using a modified version of the target-decoy method [22]. Briefly, for a precursor ion, target or decoy, a set of scores is calculated for each potential elution peak. First, one of the scores is used to select the best peak. Second, a classifier is trained to distinguish between the sets of scores corresponding to the best peaks matched to target and decoy precursors. It is then used to generate a "combined score" that allows to refine the selection of peaks. The process is repeated iteratively for a specified number of iterations. In our tests, eight iterations proved to be sufficient for the efficient training of the classifier. After each iteration, the mapping between the actual and library retention times is refined; the normalised retention time of the elution peak as well as the square root of its difference from the library retention time are added to the set of scores.
Finally, the ratio of decoy to target precursor numbers with combined scores exceeding a given threshold is used as the FDR estimate.
Once the initial processing of all runs in the experiment is finished, DIA-NN quantifies precursor ions using the previously collected information on the quality of extracted ion chromatograms of their individual fragments. The best three fragments per precursor are selected in a cross-run manner and eventually used for its quantification. DIA-NN also supports automatic cross-run retention time profiling. Briefly, if a precursor is identified in some runs with FDR lower than a specified threshold, the retention time information in the spectral library is corrected based on the run in which this precursor was identified with lowest FDR. All the runs are then reanalysed using corrected retention times.
After precursor ion quantification, optional cross-run normalisation and protein quantification can be performed. All the precursor intensities corresponding to identifications with FDR estimates above a given threshold are replaced with zeroes. Precursors are then ordered by their coefficients of variation. Top pN precursors are selected, where N is the average number of identifications passing the FDR threshold and p is between 0 and 1. Sums of the intensities of these precursors are calculated and are used for normalisation, i.e. the levels of all precursors are scaled to make these quantities equal in different runs. A "Top N" method is eventually used for protein quantification: protein intensities are obtained as sums of the intensities of top N most abundant precursors identified at FDR lower than a given threshold in a particular run.

Decoy precursors generation
A decoy precursor is generated for each target precursor. The m/z value of the decoy precursor as well as its reference retention time are set to be equal to those of the target precursor, as specified in the spectral library. The order of amino acid residues (except for the first and the last ones) is reversed. If the resulting sequence of amino acid masses happens to be the same as the one corresponding to the target precursor, then the mass of the central amino acid is increased by an artificial value (i.e. 12 m/z). The fragmentation spectrum of the decoy precursor is then calculated using the same fragmentation pattern as that of the target precursor.

Detection and scoring of elution peaks
Each precursor is represented in the spectral library by its m/z value as well as the m/z values and reference intensities of its fragments. Six fragments with the highest reference intensities are considered in DIA-NN, and their m/z values are used to extract the respective chromatograms from the data. First, the sequence of spectra is filtered to leave only MS1 spectra and those MS2 spectra that were obtained using precursor ion selection windows containing the m/z value of the precursor. In each of the remaining spectra, the highest peak is chosen within a window centered at the m/z value of interest. The radius of this window is calculated as the product of this m/z value and the specified mass accuracy coefficient.
The chromatograms are scanned using retention time (RT) windows. The window size can be either specified by the user or chosen by DIA-NN automatically. In the latter case, the diameter of the RT window is taken to be 1 + 4.4 · ℎ , where ℎ is the average peak width at half maximum for the reference precursors. These can be defined by the user or inferred automatically as high-confidence precursors during an initial search with a wide scan window. This procedure is carried out for the first run in the experiment, subsequent runs using the same RT window size. DIA-NN calculates pairwise correlations of elution curves of all fragments of the precursor within the window. The fragment with the largest sum of correlations is then designated as the "best" fragment. It is assumed that this fragment is likely to be the one least affected by interferences; hence its elution curve is expected to be representative of the true elution curve of the precursor. If the level of the best fragment at the center of the window is close to its maximum level within the window, then this window is considered as an elution peak. The elution curve of the best fragment is smoothed and is designated as the "reference" elution curve. For each elution peak, a set of scores is then calculated.
Signal scores. Total signals of the top five fragments normalised by the total signal of all fragments are used as scores. These scores are ordered by the reference intensities of the respective fragments.
Correlation scores. First, correlations of the fragments' elution curves with the reference curve are calculated. The sum of these correlations is used as a score. Second, each fragment chromatogram is processed by taking minima of each three consecutive values, correlations with the reference curve are calculated, and the sum of these is used as another score. Finally, the correlation between the MS1 elution curve and the reference curve is also used as a score. to generate the training and test datasets, respectively. Only the test dataset is then used to calculate the "combined score" level that corresponds to a given FDR threshold.
Linear classifier. DIA-NN calculates {∆ + } + -the set of score vector differences between the paired target and decoy precursors. Here denotes a target-decoy pair. A weight vector is then obtained as the solution of the equation = µ, where µ is the average of these vector differences and is their covariance matrix. The "combined score" is then calculated for each score vector .
The Neural network classifier is employed for the final two iterations of elution peak scoring.
Training of the neural network is only performed during the first of these iterations. In

Identification performance of DIA-NN
We explored the precursor identification performance of DIA-NN when using the linear classifier and the artificial neural network classifier, which was trained in either run-specific or cross-run manner. We compared the performance characteristics of DIA-NN to the results For other benchmarks, we use a dataset previously generated as part of the LFQbench test, and the spectral library generated therein [23].
As different software tools employ different decoy precursor generation algorithms and use different classifiers, software-reported false discovery rates cannot be compared directly. We therefore followed a strategy suggested previously [4], and 'tested' for false positive hits, by calling E. coli peptides in a yeast proteome. Almost every call of an E.coli-specific peptide in the pure yeast digest will be a false positive; and the number of such calls is hence an illustrative benchmark that visualizes the FD performance of the different methods (Fig. 1).
We utilised a compound spectral library, which comprised both yeast and E.coli precursors (15153 and 13550, respectively). For this, a human-yeast-E.coli spectral library was obtained from the LFQbench test suite (64 var, openswath) [23], and precursors matching to human proteins were removed (calling for human-specific peptides is a less reliable benchmark against calling false positives, as human contamination due to sample handling is difficult to exclude). Both software tools were operated with standard parameters; in Spectronaut, precursor and protein qvalue cutoff thresholds were set to 1.0. The average numbers of total and E.coli-specific precursor identifications in the yeast proteome were calculated at different reported FDR levels.
Written in C++, DIA-NN demonstrated high processing speed, once it converts the input mzML files with the raw data into its own format. For example, when using a neural network with two hidden layers trained in a run-specific manner, DIA-NN was able to process all the ten files in less than 13 minutes on a average processing workstation (2x 6-core Intel Xeon E5645). The fast performance renders DIA-NN suitable for the analysis of very large datasets.
In this comparison, DIA-NN outperformed the Spectronaut-implemented data analysis pipeline both with the linear classifier as well as with the artificial neural network classifier ( Fig. 1). Application of the latter substantially suppressed false peptide discoveries: the number of falsely discovered E. coli precursors (in the yeast dataset) was reduced to about 38-48 at approximately 8000 precursor identifications, which is a 1.6-to 2-fold improvement over the linear classifier. Different settings for the neural network classifier show comparable performance.

Quantification performance of DIA-NN
While the identification performance is important, the key application of DIA is accurate, precise and consistent protein quantification in large sample series [24]. We illustrate the quantification performance of DIA-NN by comparing it to Spectronaut Pulsar using the LFQbench test [23]. In this benchmark, human, yeast, and E.coli lysates were mixed in different proportions and analysed via SWATH-MS. For each mixture, three injection replicates were measured. The LFQbench R package takes as input quantitative values for precursor ions and uses these to quantify peptides and proteins. Two characteristics of these are being monitored: median bias ("accuracy") and standard deviation ("precision"). Median coefficients of variation in technical replicates ("technical variance") are also calculated for human peptides and proteins.
We considered the 64 variable windows acquisition schemes on TripleTOF 6600 (Sciex) datasets, as these were revealed in the LFQbench manuscript to perform best [23]. Both the software tools were operated with standard parameters; in Spectronaut, precursor and protein qvalue cutoff thresholds were set to 1.0; DIA-NN was set to use a run-specific neural network classifier with two hidden layers ( Fig. 2 and 3). In order to be able to compare the quantification performance of the two methods on the same dataset, we aimed to fix the  (Fig. 2, 3). For example, in case of the yeast and E.coli proteins in the HYE110 dataset, in which their ratios in different samples were expected to be very high (10:1), DIA-NN shows almost 1.5-times better precision.

Summary and Discussion
In this work, we aimed to maximise the amount of information extracted from DIAproteomics data. Ultimately, the precursor identification problem relies on the classification of potential elution peaks as false or true positives. Linear classifiers, such as linear discriminant analysis or support vector machines, usually require careful selection of a small set of scores to be calculated for each elution peak. Since DIA-proteomics generates highly information-rich datasets (unlike DDA), conventional approaches to handling DIA data effectively discard a significant portion of information. To address this issue, we developed a set of highly optimised scores, which encodes smoothed elution curves for each fragment of the precursor ion under consideration. We demonstrated that these scores allow for an efficient solution of the classification problem with the use of deep neural networks. So far we have benchmarked this first version of DIA-NN only on a small number of datasets; we expect its relative advantage in peptide quantification to be dependent on the dataset. The main advantage of DIA-NN is achieved with the use of an artificial neural network, the performance of which will depend on the number of true and false target precursor identifications used for its training as well as on such parameters as the number of data points per peak. The performance difference between DIA-NN and other tools will hence vary from dataset to dataset (it is not a 'digital', universal value) and will also depend on the spectral library used. However, we currently still work on improving DIA-NN; we will conduct a more comprehensive benchmark once a more 'final' version of the software is generated and a manuscript prepared for submission. We publish this preprint, to encourage other proteomic labs to try DIA-NN already at this stage; any feedback provided will help to improve it for the use in the community, or for its incorporation in commercial DIA software.
In the identification benchmark conducted with the linear classifier, DIA-NN outperformed the conventional data analysis pipeline, exhibiting lower rate of false positive identifications across the wide range of reported precursor identification numbers considered. The use of artificial neural networks substantially improved this result. Neural networks with two and five hidden layers showed comparable performance; there was also little difference between training the network in a run-specific or a cross-run manner. This situation may, however, be different on more variable or complex samples.
We also addressed another crucial issue associated with DIA-proteomics data processing, namely, the problem of interference removal to facilitate accurate and precise quantification of precursor ions and, therefore, proteins. For this, we introduced a novel approach based on selecting the "best" fragment per precursor in each run. Its elution curve is then assumed to be representative of the true elution curve of the precursor, and is used as a template for correcting interferences affecting other fragments.
We validated the high performance of this method using the LFQbench test suite. We observed that while the accuracy (bias) of quantification is comparable between DIA-NN and Spectronaut Pulsar, DIA-NN tends to be superior in quantification precision and exhibits lower technical variance.
In summary, DIA-NN is a fast software tool for processing of DIA proteomics data. Applied in datasets generated to benchmark SWATH platforms, it shows improved precursor identification performance as well as high quantification accuracy and precision. Our study demonstrates the power of using artificial neural networks in the analysis of DIA-proteomics data.

DIA-NN implementation
To facilitate the very high processing speed demonstrated by DIA-NN, its code was written in C++. DIA-NN relies on the following third-party libraries: -Cranium (https://github.com/100/Cranium) provides functionality necessary for the implementation of the artificial neural network classifier; -MSToolkit (https://github.com/mhoopmann/mstoolkit) provides an interface for files in the mzML format; -Eigen (http://eigen.tuxfamily.org) is used to solve linear equations.

Yeast DIA analyses
Saccharomyces cerevisiae (BY4743 rendered prototrophic with a plasmid encoding for HIS3, LEU2 and URA3 [25]) were grown to exponential phase in minimal synthetic nutrient media.
Proteins were extracted by bead beating for 5min at 1500rpm in 8M urea/0.1M ammonium bicarbonate. Proteins were reduced with 5mM dithiothreitol, alkylated with 10mM iodoacetamide. The sample was diluted to 1.5M urea/0.1M ammonium bicarbonate before the proteins were digested overnight with Trypsin (1:30 Trypsin to total protein ratio). Peptides were cleaned-up with 96-well MacroSpin plates (Nest Group) and iRT peptides (Biognosys AG) were spiked in.
The digested peptides were analysed on a nanoAcquity (Waters) coupled to a TripleTOF 6600 (Sciex). Peptides were separated with a 23 minute non-linear gradient (4% Acetonitrile/0.1 % formic acid to 36% Acetonitrile/0.1% formic acid) on a Waters HSS T3 column (150mm x 300µm, 1.8µm Particles) with a 5µl/min flow rate. The DIA method consisted of an MS1 scan from m/z 400 to m/z 1250 (50ms accumulation time) and 40 MS2 scans (35ms accumulation time) with variable precursor isolation width covering the mass range from m/z 400 to m/z 1250.

Raw mass spectrometry data files conversion
SCIEX wiff files were converted to the mzML format using MSConvert with the following settings: binary encoding precision was set to 32-bit, MS1 and MS2 vendor peak picking was used; all other options were turned off.