Ion-networks: a sparse data format capturing full data integrity of data independent acquisition mass spectrometry

Data-independent acquisition (DIA) mass spectrometry (MS) has introduced deter-ministic, periodic and simultaneous acquisition of all fragment ions. Despite the chimeric side-effects associated with this unprecedented data integrity, DIA data analysis approaches still use conventional spectra and extracted ion chromatograms (XICs) that represent individual precursors and fragments. Here, we introduce ion-networks, an alternative data format wherein nodes correspond to reproducible fragment ions from multiple runs and edges correspond to consistent co-elution. Each ion-network represents a complete experiment and computationally eliminates chimericy based on reproducibility without sacrificing data integrity.

R u n _ 0 5 A R u n _ 0 6 B R u n _ 0 7 A R u n _ 0 8 B R u n _ 0 9 A R u n _ 1 0 B R u n _ 1 1 A R u n _ 1 2 B R u n _ 1 3 A R u n _ 1 4 B  An ion-network was created and annotated for a total of ten public proteomic HDMS e runs from two different benchmark samples. Herein the nodes (dots in panels A, C-D) are aggregates, i.e. between-run aligned HE fragment ions, while the edges (green lines in panel C) represent consistent within-run co-elution. With an interactive browser, we zoomed in on the aggregates of a select region, containing many aggregates with a reproducibility of at least 2 (panel A) including several that are fully reproducible (circled dots in panel A, dots in panels C-D). After visualizing the t R values of each ion per run for (a selection of) these fully reproducible aggregates (panel B), three groups become apparent. Two of these groups co-elute in the first six runs, but are deconvoluted in the last four runs due to stochastic effects. For each potential pair of aggregates, an edge is set if and only if they consistently co-elute in each run, thereby forming the final network (panel C). When this full network is annotated, multiple aggregates of three distinct peptides are annotated (colored dots in panel D). Furthermore, when the individual ion intensities of the three clusters are visualized, each of their aggregates follow the same pattern indicating a correct deconvolution (panels E-G). Notice that many unannotated aggregates, presumably other fragments that are not mono-isotopic singly charged b-or y-ions, still have a high quantification potential due to the deconvolution. Finally, the intensity patterns agree with the benchmark design wherein fragments from Human peptides (1), E. coli peptides (2) and Yeast peptides (3) have an expected logarithmic fold change (logFC) of respectively 0, -2 and 1, thereby confirming correct identification.
in DDA has a peptide-to-spectrum match (PSM). To illustrate the performance on our benchmark 92 ion-network, we annotated the aggregates with singly-charged mono-isotopic b-and y-fragments from a single sample. Herein we imported the scanning quadrupole selection as if it was t D , showing imply that ion-networks are most performant on datasets with high data integrity as well as high 118 chimericy. To confirm this hypothesis, we acquired precursorless HDMS e which we denominated 119 single window ion mobility (SWIM)-DIA. With only a single continuously acquired HE scan and no 120 precursor window selection, SWIM-DIA has the highest possible data integrity for fragment ions.

121
To illustrate its performance, we created three ion-networks for an in-house benchmark dataset summarization, illustrating an unprecedented quantitative accuracy ( Figure SF10).

127
In conclusion, ion-networks are able to capture HE fragment ions from different DIA techniques 128 in a very sparse format with minimal noise and chimericy. While we only investigated a single 129 software application, i.e. a simplistic proteomic database search, we postulate that the noiseless 130 nature of these ion-networks enables a plethora of other untargeted software applications such as e.g.
critical feedback during research and writing. 149 The authors declare no competing financial interests.

Conflict of interest
ions from data independent LC-MS with data dependant LC-MS/MS". In :   2) to obtain an exhaustive list of ions that can be analyzed concurrently. Herein each peak picked ion has a mass-to-charge ratio (m/z) apex, drift time (t D ) apex, retention time (t R ) apex and run identifier as primary coordinates, as well as meta-data describing the intensity and apex peak picking errors per coordinate. The 50,000 most abundant ions per run are used in a quick alignment to determine which are fully reproducible in all runs. These fully reproducible ions form clusters that are used to calibrate the primary coordinates between each run and furthermore give an estimate of the between-run deviation of the t R (Supplementary note 1.4). Based on these calibrated coordinates, all ions from the complete experiment are aligned into aggregates, i.e. between-run reproducible ions (Supplementary note 1.5.1). With these aggregates, the intensity of each ion is normalized per run (Supplementary note 1.6). Next, aggregates with at least two constituent ions are defined as nodes in the ion-network, while irreproducible ions are considered noise and discarded. For each pair of aggregates, an edge is set if and only if their constituent ions consistently co-elute within each run (Supplementary note 1.5.2). Hereby fragments from chimeric precursors can be deconvoluted, as stochastic co-elution of precursors is not always consistent. For proteomics experiments, each individual aggregate within this ion-network can be annotated as a specific b-or y-ion with a simplistic database search (Supplementary note 1.7). Figure Figure SF4: Example of non-consistent co-elution. Two high energy (HE) fragment ions with m/z 944.4 (red) and 984.4 (purple) are co-eluting in run 8 (top) of PXD001240, with equal t R apices and similar peak shapes. However, in run 14 (bottom) of the same dataset, their t R apices are separated by several seconds and different peak shapes, making it unlikely that these HE fragment ions originate from the same low energy (LE) precursor ion. This hypothesis is confirmed by their intensity ratio profiles revealing these HE fragment ions belong to different organisms by design of the benchmark. When both runs are analyzed simultaneously with an ion-network, this inconsistent co-elution can be leveraged to deconvolute the two chimeric HE fragment ions from run 8. An HDMS e ion-network was created for PXD001240 (Supplementary note 1.5). Herein, each aggregate, i.e. (partially) reproducible HE fragment ion, has a number of consistently co-eluting aggregates (x-axis) and the logarithmic frequencies of these aggregates with equal consistently co-eluting aggregates (y-axis) was determined. Consistently co-eluting aggregates are presumed to originate from the same LE ion and comprise all potential HE fragment ions such as b-and y-ions, isotopes, neutral losses, et cetera. The boxplot indicates the IQR with a median (orange line) of 24 and whiskers extending to the minima and maxima. of this aggregate can be determined. By design of the benchmark, aggregates with a logFC of 0, 1 or -2 are respectively expected to be Human (red), Yeast (green) or E. coli (blue). When all pairs of aggregates are partitioned by the number of runs in which they consistently co-elute (x-axis), the percentage of paired aggregates with equal logFC (y-axis), i.e. likely organism origin, can be determined (experimental; full lines). While an equal organism origin does not proof that the pair of aggregates are fragments from the same precursor, the converse statement is generally true: a pair of aggregates with different logFC are fragments from two different chimeric precursors that are not deconvoluted. To determine the impact of consistent co-elution on this deconvolution, we calculated the theoretical probability that a pair of aggregates has the same logFC (expected; dotted lines), regardless of consistent co-elution. This was done by first calculating the probability P (X) of an aggregate for X ∈ {Human, Yeast, E. coli} per partition of consistent co-elution. When aggregates within these partitions are paired independently, a pair has the same logFC with probability P (both X ) = P (X) 2 .  Within the HDMS e ion-network of PXD001240, 99,375 aggregates were annotated with a significant score (green) (Supplementary note 1.7). The logarithmic frequency (y-axis) of these aggregates was determined in function of their reproducibility (x-axis). This was compared against the logarithmic frequency and reproducibility of all aggregates in the whole ion-network (red), regardless of their annotation. Hereby annotation efficiency seems to be related to reproducibility as e.g. only 0.1% of two-fold reproducible aggregates were annotated, while 6% of all fully reproducible aggregates were annotated.         298 To calibrate the m/z, t R and t D of each run, the 50,000 most abundant ions of each run were selected.

299
Since the m/z of all ions was already normalized post-acquisition by the lockmass throughout the 300 Apex3D peak picking, this is generally the most accurate descriptive attribute of an ion. As such, 301 the m/z distance (in ppm) was used as metric to perform a hierarchical agglomerative clustering 302 with single linkage on all these ions. All clusters that contain each run exactly once were retained 303 and considered potentially aligned prior to t R and t D outlier removal.

304
For each cluster the maximum distance in t R and t D between its constituent ions was calculated.

305
Based on the distribution of the absolute deviation to the median of all t R or t D errors, individual z-306 scores were calculated per cluster. Each cluster with a z-score exceeding 5 was considered an outlier 307 and removed. This process of outlier removal was repeated until only clusters with z-scores below 308 5 for both t R and t D remained. The final set of clusters was considered to be correctly aligned and 309 equally partitioned into a set of clusters for calibration and validation. Note that the partitioning 310 was done by selecting even and uneven clusters after m/z sorting, potentially introducing some 311 dependency bias through isotopes between calibration and validation clusters.  of these smaller clusters is again subjected to step one, otherwise the next iteration commences.

362
By design, this process finishes at the latest after as many iterations as there are runs. Hereafter, 363 no clusters containing multiple ions from the same run remain and all clusters can form aggregates 364 with unambiguously aligned constituent ions.

365
As this trimming is quite stringent, a last step is performed which merges clusters not con-366 taining ions from the same run. This is done by iterating over all original untrimmed pair-367 wise alignments in order by Euclidean distance, i.e. a pairwise alignment defines a distance Once no clusters can be merged anymore, all clusters 369 are defined as aggregates. Finally, all aggregates with reproducibility of at least two are defined as 370 nodes in the ion-network.  Duplicate peptides from different proteins were merged to obtain a list of unique peptide sequences.

396
Peptides originating solely from decoy proteins were classified as decoy peptides, while all others 397 were classified as targets. All fragments, i.e. mono-isotopic masses of all singly-charged b-and 398 y-ions, were calculated for each peptide.

399
For each aggregate that has at least two other consistently co-eluting aggregates, all potential defined as a peptide-fragment-to-ion-neighborhood match (PIM), analogous to a precursor that is 410 assigned a peptide-to-spectrum match (PSM) in DDA. Note that not all aggregates are given a 411 PIM, as there sometimes are no fragment explanations or no linear regression can be made due to 412 e.g. too few consistently co-eluting aggregates. Equally, some aggregates are assigned more than 413 one PIM, which by the current definition always have an equal score.

414
As an additional accuracy measure besides a PIM score, the t D of each aggregate was used as a 415 proxy for potential precursor m/z. First, the aggregates of each PIM were checked for a consistently  Each PIM is then rescored and assigned a target-decoy false discovery rate (FDR) controlled • Consistently co-eluting aggregate match ratio cm k