Middle-down proteomics reveals dense sites of methylation and phosphorylation in arginine-rich RNA-binding proteins

Arginine (Arg)-rich RNA-binding proteins play an integral role in RNA metabolism. Post-translational modifications (PTMs) within Arg-rich domains, such as phosphorylation and methylation, regulate multiple steps in RNA metabolism. However, the identification of PTMs within Arg-rich domains with complete trypsin digestion is extremely challenging due to the high density of Arg residues within these proteins. Here, we report a middle-down proteomic approach coupled with electron transfer dissociation (ETD) mass spectrometry to map previously unknown sites of phosphorylation and methylation within the Arg-rich domains of U1-70K and structurally similar RNA-binding proteins from nuclear extracts of HEK293 cells. Remarkably, the Arg-rich domains in RNA-binding proteins are densely modified by methylation and phosphorylation compared with the remainder of the proteome, with di-methylation and phosphorylation favoring RSRS motifs. Although they favor a common motif, analysis of combinatorial PTMs within RSRS motifs indicate that phosphorylation and methylation do not often co-occur, suggesting they may functionally oppose one another. Collectively, these findings suggest that the level of PTMs within Arg-rich domains may be among the highest in the proteome, and a possible unexplored regulator of RNA metabolism. These data also serve as a resource to facilitate future mechanistic studies of the role of PTMs in RNA-binding protein structure and function. Briefs Middle-down proteomics reveals arginine-rich RNA-binding proteins contain many sites of methylation and phosphorylation.


Introduction
after which the slurries were loaded onto a column. The resin was washed with 10 column volumes of wash buffer (50 mM HEPES pH 7.4, 200 mM NaCl and 1% TX-100), and eluted with 4 x 1 ml elution buffer (50 mM Tris-HCl pH 8.5, 500 mM NaCl, 20 mM reduced L-glutathione (Sigma G4251) and 0.1% TX-100. The eluted fractions were concentrated to ~200 μl using Amicon Ultra-0.5 ml 10K MWCO Centrifugal Filter Units (EMD Millipore), and dialyzed overnight against 50 mM HEPES pH 7.4, 200 mM NaCl and 0.1 mM PMSF using 10K MWCO Slide-A-Lyzer MINI Dialysis Units (Thermo). Protein concentration was determined by running each elution fraction on an SDS-PAGE gel with bovine serum albumin (BSA) standards ranging from 0.2-1 μg per lane and staining with Coomassie G-250 [36]. Densitometry of the BSA standards was used to calculate the concentration of GST affinity purified protein.

Western Blotting
Western Blotting was performed according to standard protocol as previously described in Bishof et al. [3]. In short, samples were boiled in Laemmli sample buffer (8% glycerol, 2% SDS, 50mM Tris pH 6.8, 3.25% beta-mercaptoethanol) for 5 minutes, then resolved on a Bolt® 4-12% Bis-tris gel (catalog no. NW04120BOX, Invitrogen) by SDS-PAGE and semi-dry transferred to a PVDF membrane with the iBlot2 system (ThermoFisher). Membranes were blocked with TBS Starting Block Blocking Buffer (catalog no. 37542, ThermoFisher) and probed with primary antibodies (1:1,000 dilutions) overnight at 4C. Membranes were then incubated with secondary antibodies conjugated to either Alexa Fluor 680 (Invitrogen) or IRDye800 (Rockland) fluorophores for one hour at RT. Membranes were imaged using an Odyssey Infrared Imaging System (Li-Cor Biosciences) and band intensities were calculated using Odyssey imaging software.

In-gel Limited Trypsin digestion
The GST-LC1/BAD (residues 231-310) purified protein (14 µg) was run onto a 10% acrylamide gel. The gel was stained with Coomassie G-250 and the GST-LC1/BAD band was compared to BSA standards to estimate protein concentration. The GST-LC1/BAD band was cut out and diced into small pieces. The gel pieces were divided among five tubes each receiving ~3 µg of GST-LC1/BAD. Gel pieces were de-stained until clear using 70% 50mM ammonium bicarbonate (ABC) and 30% acetonitrile. While on ice, each tube received 30µL digest buffer [12.5 ng/µl trypsin (Pierce MS grade) in 50mM ABC buffer. After the addition of trypsin, samples were incubated on ice for 3 minutes then brought up to room temperature to start digestion. At 6 different time points (15,30,60,120, 240 minutes, and overnight) excess trypsin solution was removed and the digestion reaction was stopped with 30 µl extraction buffer (50% acetonitrile, 5% acetic acid).
After the addition of extraction buffer, samples were allowed to equilibrate for five minutes, then stored at -20 °C until peptide extraction. Peptides were shaken for 40 minutes at room temperature then spun at 20,000 x g for one minute. After a minute period following the first spin, peptides were spun again. This spin and relax cycle was repeated twice. The supernatant containing the extracted peptides was then collected into a new tube, in which 30 µl of extraction buffer was then added. This extraction process was repeated two more times. Peptides were lyophilized using a SpeedVac (catalog no. 731022, Labconco) and resuspended in MS sample loading buffer (1% acetonitrile, 0.1% formic acid, and 0.03% trifluoroacetic acid).

Nucleoplasm Enrichment
This cellular extraction procedure was adapted from the Gozani group [37]. In short, cells from eight 15-cm plates were combined and rinsed with cold PBS, and then scraped in 10 ml PBS + 1X Protein Inhibitor Cocktail buffer (catalog no. COUL-RO, Roche) and centrifuged for 5 min at 1,000 x g at 4C. Cells were washed once in 1 ml 1X (cold) PBS and recovered by centrifugation for 5 min at 1,000 x g at 4C, swelled in 75 ul of hypotonic lysis buffer (10 mM HEPES pH 7.9, 20 mM KCl, 0.1 mM EDTA, 1mM DTT, 5% Glycerol, 0.5 mM PMSF, 10 ug/ml Aprotinin, 10 ug/mL Leupeptin) and incubated on ice for 10 min. Samples were lysed by 0.1% NP-40, vortexed, and incubated on ice for 5 min. Nuclei were recovered by centrifugation for 10 min at 15,600 x g at 4C. Nuclei were extracted for 30 min on ice in 40 ul high salt buffer (20 mM HEPES pH 7.9, 0.4 M NaCl, 1 mM EDTA, 1 mM EGTA, 1 mM DTT, 0.5 mM PMSF, 10 ug/ml Aprotinin, 10 ug/mL Leupeptin). Samples were sonicated for 5 sec and extracts were collected by centrifugation for 10 min at 15,600 x g at 4C. The supernatant obtained following the centrifugation step consisted of isolated nucleoplasm and the resulting pellet was the chromatin fraction.

In-solution digest of nucleoplasm fractions
For each time point, 150 µg of nucleoplasmic fraction was digested. Samples were brought to a final concentration of 1M urea, then dithiothreitol was added to a final concentration of 1 mM, and incubated for 30 minutes. Iodoacetamide was added to a final concentration of 1 mM, and incubated for 20 minutes in the absence of light. Samples were then diluted in digestion buffer, and digested with a 1:50 ratio of trypsin (Pierce MS grade) to total protein. The digestion reactions were performed at room temperature and quenched at increasing time lengths (5, 10, 20, 40, 80, 160 minutes, overnight) with 0.1% formic acid, and 0.01% trifluoroacetic acid solution. Resulting peptides were cleaned up using an HLB column (Waters). Samples were washed first with methanol, then Buffer C (50% acetonitrile and 50% water), then 0.1% trifluoroacetic acid. The digested samples were then loaded into the column, washed with 0.1% trifluoroacetic acid twice, and then eluted with Buffer C. The resulting elutant was lyophilized using a SpeedVac (catalog no. 731022, Labconco).

Mass spectrometry analysis
Lyophilized peptides were resuspended in loading buffer (0.1% formic acid, 0.03% TFA, 1% acetonitrile) and separated on a self-packed C18 (1.9 μm Dr. Maisch, Germany) fused silica column (20 cm × 75 μm internal diameter; New Objective, Woburn, MA) by a NanoAcquity UHPLC (Waters). Linear gradient elution was performed using Buffer A (0.1% formic acid, 0% acetonitrile) and Buffer B (0.1% formic acid, 80% acetonitrile) starting from 3% Buffer B to 40% over 100 min at a flow rate of 300 nl/min. Mass spectrometry was performed on an Orbitrap Fusion Tribrid Mass Spectrometer. Data-dependent MS/MS analyses included a high resolution step (120,000 at m/z 400) with an m/z range of 100-1000. MS1 scans were conducted in the Orbitrap, and the top 10 ions with highest charge, followed by the precursor ion with the greatest intensity, were given priority for fragmentation. A data-dependent decision tree was used [38] and MS/MS spectra from both HCD and ETD were collected in the ion-trap. Peptides with charge state of +2 were chosen for fragmentation by HCD only, while all charge states +3 and above were fragmented by both ETD and HCD. At +3 charge state, precursor ions under 650 m/z were fragmented by both ETD and HCD. Those equal to or greater than 650 m/z were fragmented by HCD only. At +4 charge state, precursor ions under 900 m/z were jointly fragmented by both HCD and ETD, and only by HCD at m/z values equal to greater than 900. At +5 charge state, precursor ions under 950 m/z were fragmented by both HCD and ETD, and only by HCD at m/z values equal to or greater than 950. At + 6 charge state, all precursor ions were fragmented by ETD. Dynamic exclusion was set to exclude previous sequenced precursor ions for 30 seconds.

Database Searching
Data files for the time points were analyzed using MaxQuant v1. 5 [39]. Methionine oxidation (+15.995 Da), and protein Nterminal acetylation (+42.011 Da) were included as variable modifications (up to 5 allowed per peptide); cysteine was assigned a fixed carbamidomethyl modification (+57.022 Da) for the nucleoplasm. A two stage search was performed as described previously [40]. This two-step method allows for a larger search space while limiting false-discovery rate (FDR) [41]. For the first search only fully tryptic peptides were considered with up to 2 missed cleavages in the database search. A precursor mass tolerance of ± 20 ppm was applied prior to mass accuracy calibration and ± 4.5 ppm after internal MaxQuant calibration. Other search settings included a maximum peptide mass of 6,000 Da, a minimum peptide length of 6 residues and 0.6 Da tolerance for ion-trap ETD and HCD MS/MS scans. The false discovery rate (FDR) for peptide spectral matches, proteins, and site decoy fraction were all set to 1 percent. The proteins identified in the first database search were then used to make a targeted FASTA formatted database to re-search the raw files with an expanded missed cleavage window and variable modifications. The targeted database corresponding to purified U1-70K LC1/BAD protein contained a FASTA file of 366 proteins and allowed for 6 mis-cleavages. The targeted database corresponding to the nucleoplasm contained a FASTA file of 4,307 unique protein groups (20,392 protein isoforms), allowing 6 miscleavages. Both targeted databases were used to search the spectra for variable PTMs including phosphorylation (S/T/Y) (+79.966 Da) and mono-and di-methylation (K/R) (+14.016 Da, +28.031 Da), in addition to N-terminal acylation and methionine oxidation. The MS/MS spectra were collected by a low-resolution ion trap (product ion tolerance = 0.6 Da). As such, trimethylation of lysine (42.047 ± 0.002 Da) and acetylation of lysine (42.011 ± 0.004 Da) were not included in our search, given that the masses of these modifications are less than 20 ppm apart. The .raw and .txt files obtained from MaxQuant searches were uploaded to ProteomeXchange on 9/4/2019 (Accession ID: PXD015208).

Parameters for selecting Basic Acidic Dipeptide (BAD) and Arginine-Serine (RS) peptides
We applied a BAD score algorithm counting peptides that added 1.0 points for an alternating basic and acidic charge (+/-or -/+) and added 0.1057/0.0764/0.0475 points for S/T/Y residues, respectively, neighboring a basic residue. This attempted to reflect the average relative phosphorylation frequency across all nucleoplasmic peptides of each residue, which transforms the dipeptide into a BAD sequence. This sum was calculated and divided by the peptide length to give a "BAD score" for each peptide. A cutoff score of 0.3 or greater was chosen, identifying 543 BAD peptides in total. These selection criteria enriched for Lys/Arg/Ser/Thr/Tyr residues, and actual residue frequency was factored into normalized PTM fold changes. RS peptides were identified through a similar algorithm, counting peptides with 1.0 points for alternating Arg-Ser/Ser-Arg residues, normalized to peptide length. The sum was calculated and divided by the peptide length to give an "RS score" for each peptide. A cutoff of greater or equal to 0.2 was chosen. There were 533 RS peptides identified by this selection algorithm. These selection criteria resulted in an enrichment for Lys/Arg/Ser/Thr/Tyr residues, which was factored into normalizing PTM fold changes for RS peptides.

Phosphosite
Motif Logo analysis was conducted (https://www.phosphosite.org/sequenceLogoAction.action) [42] to identify sequence motifs of serine phosphorylation, arginine monomethylation and arginine dimethylation. Input peptide sequence windows with unique sites of PTM were assembled and entered for analysis. A list of 46 monomethylated arginine peptides, 252 dimethylated arginine peptides and 458 phosphorylated serine peptides generated by our MS data served as input sequences. Additionally, Motif analysis was performed for mono-/di-methylated lysines, and threonine/tyrosine phosphorylation (Supplemental Fig. S5). threshold-passing peptide sequences identified in the same experiment were inputted separately through the pipeline to tabulate novel coverage statistics for these two special classes of peptide.

Measurement of PTM Coverage and Motif Occupancy
Modified sequences containing a 31-residue width window were matched to sequences downloaded (5/24/2019) from phosphosite.org [42]. Excel was used to match methylated or phosphorylated sequences previously observed by mass spectrometry methods only. A character length (len) function in Excel substitution function was used to search for post-translationally modified RSRS/SRSR sequences. Occurrences were normalized to the number of RSRS/SRSR sequences within a peptide and calculated as percent of total occurrences.

Results
As shown in Fig. 1A, the spliceosomal protein, U1-70K, contains two Arg-rich low complexity (LC) domains, LC1/BAD (residues 231-310) and LC2 (residues 317-407). The 'BAD' acronym of LC1/BAD stands for "Basic Acidic Dipeptide", containing dipeptide repeats of a basic (K/R) residue adjacent to an acidic residue (D/E). The U1-70K LC1/BAD domain has both BAD and RS motifs. The LC1/BAD domain of U1-70K is of particular biological interest due to its central role in U1-70K nuclear localization, granule formation, and co-aggregation with Tau in Alzheimer's disease [3,35,43]. There are currently over 50,000 peptide spectral matches to U1-70K using standard bottom-up approaches with complete trypsin digestion, mapping approximately 70% of the protein [44]. Despite this number of spectral matches, only 7.5% of the LC1/BAD domain has previously been sequenced, or 6 out of 80 residues [45,46] (Fig.1A). Thus, we sought to develop a method to sequence the U1-70K LC1/BAD domain that would be generally applicable to other Arg-rich RNA-binding proteins in the proteome.
In order to sequence low complexity Arg-rich domains, we needed to account for a large number of missed cleavages and PTMs, which would increase the search space compared to traditional proteome database searches. A two-step database search method was utilized wherein proteins matched from a primary search were used to create a smaller, focused database. This strategy limits the false discovery rate (FDR) and the number of false negatives [41], resulting in higher confidence peptide spectral matches (PSMs) compared to traditional one-step database search methods [41]. Our focused database was then used to perform a second search that included increased missed-cleavages, and PTMs (methylation and phosphorylation), using a <1% FDR cutoff.
A strength of ETD is the ability to fragment highly protonated peptides and retain PTMs [16,25,27,28,[30][31][32]. For example, when comparing the MS/MS spectra of the same peptide fragmented by either ETD or HCD, the ETD method results in increased fragmentation ( Fig.2A-B). These fragment ions (c-and z-ions) produced by ETD provide unique diagnostic ions, unafforded by canonical HCD/CID fragmentation methods, that enhance identification of the peptide (Fig.2B) [47,48]. When examining the HCD spectra, only a neutral loss of 98 Da is observed. This mass shift is a signature of a phosphorylation PTM, indicating the loss of phosphate and water [49][50][51][52] (Fig.2A). However, the HCD spectra contain few other additional fragment ions When examining LC1/BAD peptides containing PTMs, the difference between peptides identified by ETD and HCD fragmentation is similarly apparent. ETD identifies 55 PTMcontaining LC1/BAD peptides alone compared with HCD, which identified just 13 such peptides ( Fig.2D). To visualize these findings, the LC1/BAD unique peptides identified by either HCD or ETD fragmentation were mapped onto the LC1/BAD domain (Fig.2E). ETD led to the identification of more LC1/BAD unique peptides than HCD, providing coverage of 79/80 residues (99%) within the LC1/BAD domain (Supplemental Table S1). If trypsin digestion is allowed to proceed overnight, the number of PSMs decreases and significant coverage is no longer obtained with either ETD or HCD. Using the middle-down ETD strategy, we achieved near complete coverage of the LC1/BAD domain of U1-70K in vitro.
Thus, the combination of both ETD and a limited digest is necessary for complete coverage of the LC1/BAD domain of U1-70K. Furthermore, we identified 36 sites of PTMs summed across all proteolysis time lengths (Supplemental Fig. S4A). Eight out of 10 possible serine residues within LC1/BAD were found to be phosphorylated. Additionally, 23 total sites of arginine methylation were discovered, as 12 of these arginines were mono-methylated while 13 arginine residues were dimethylated. Furthermore, four sites of lysine mono-methylation were discovered.
No lysine di-methylation was observed in the LC1/BAD domain of recombinant U1-70K LC1/BAD protein. This near complete coverage of purified recombinant U1-70K LC1/BAD serves as proof of concept for using a middle-down ETD to sequence Arg-rich proteins in complex mixtures.

Preparation of nucleoplasm fractions enriched with Arg-rich RBPs
We next sought to employ this approach globally to achieve widespread coverage of Arg-rich BAD and RS proteins, mapping PTMs therein, from complex cell extracts. Arginine is not evenly distributed throughout the proteome, but rather is densely concentrating in Arg-rich domains [1,2]. Small nuclear ribonucleoproteins (snRNPs), associated heterogeneous nuclear ribonucleoproteins (hnRNPs) and serine/arginine-rich splicing factors (SRSFs) form macromolecular spliceosome structures in the nucleus, many of which contain Arg-rich domains [53,54]. To obtain a biological sample enriched in RBPs with Arg-rich domains, cellular fractionation was performed to isolate the nucleoplasm. The nucleoplasm is rich in splicing RBPs, many of which undergo liquid-liquid phase separation (LLPS) and aggregate in neurodegenerative disease [3,36,55,56]. For downstream proteomic analysis, the over-representation of highlyexpressed histone proteins in nuclear fractions would suppress the sensitivity of the mass spectrometer to identify comparatively less abundant RBPs that we sought to capture in our analyses. Therefore, simultaneous nucleoplasm isolation and histone depletion was performed to increase the probability of sequencing Arg-rich RBPs, especially low stoichiometry peptides within BAD and RS domains, many of which are expected to contain PTMs.

MS
Nucleoplasm samples were incubated with trypsin, and reactions were quenched with acetic acid at 5, 10, 20, 40, 80, and 160 minutes, with the standard overnight digestion serving as a control.
Partially-digested peptides were extracted and analyzed by LC-MS/MS on an Orbitrap Fusion Tribrid mass spectrometer operating on an HCD-ETD decision tree as described for the recombinant U1-70K LC1/BAD domain expressed in HEK293T cells [28].  Table S2). Although fewer input protein database entries were included in the second search, it yielded significantly more matched unique peptides (~12,000) compared to the conventional database searched consistent with previous two-step search strategies [40].
Peptides with three or more missed-cleavages (n=8,398) and modified peptides (n=7,201) made up a significant proportion of the novel peptide matches in the two-step search strategy.
Furthermore, our second database search matched 96% of protein groups identified in the first search (4,140/4,307). GO-term analysis of these proteins sequenced demonstrate that the nucleoplasm fraction was enriched with factors involved in 'RNA binding', consistent with our western blot results (Fig.3D). Thus, the cellular fractionation method successfully isolated nucleoplasmic proteins, enriching for native BAD and SR RBPs while also depleting histones, generating a complex sample amenable to middle-down ETD analysis.

Enhanced sequencing coverage of Arg-rich proteins by middle-down ETD MS
We attempted to compare the influence of divergent HCD or ETD fragmentation strategies on sequencing events across the complex nucleoplasm sample. Notably, ETD and HCD sequenced a similar number of unique peptides with relatively little overlap. A total of 23,398 and 29,891 unique peptides were sequenced by ETD and HCD, respectively, while 8,239 peptides were identified by both fragmentation methods (Fig.4A). Peptides above +2 charge are preferentially sequenced by ETD as directed by the ETD/HCD decision tree algorithm [28], contributing to this peptide sequencing disparity (Supplemental Fig. S1C) This indicates that for complex sample mixtures, a decision tree approach that utilizes both ETD and HCD fragmentation platforms may provide complementary sequence information than currently achieved with standard sample preparation methods [16,[23][24][25][26]. For example, at shorter trypsin proteolysis times, ETD fragmented-peptides consisted the majority of sequenced peptides (Supplemental Fig. S1A).
Between 20-40 minutes of digestion, however, HCD fragmentation becomes the preferred method of fragmentation, contributing the majority of matched peptides from 40 minutes of digestion on.
Importantly, peptides sequenced following ETD fragmentation were generally more confidently scored and assigned a peptide sequence, averaging a higher Andromeda Score (147.86), a measure of the confidence of the peptide sequence assignment, as compared to HCD-fragmented peptides (103.31) (Fig.4B).
We then sought to assess the overall quality of our peptide spectral matches after expanding the search space of our raw spectral data using a non-standard targeted search accounting for increased missed cleavage events produced as a result of the shorter trypsin digestion times necessary for analyses of Arg-rich proteins (Supplemental Fig. S1D). Peptides at shorter trypsin digestion time lengths were high-confidence identifications with Andromeda scores comparable to the standard overnight digestion sample (Fig.4C). Interestingly, the 5-minute digestion produced the most unique peptides of any time point, and averaged higher Andromeda scores (140.78) than the standard overnight digestion sample (122.63) (Fig.4C).
When analyzing peptides modified by methylation or phosphorylation, the reliance on shorter trypsin incubation lengths and ETD fragmentation strategies becomes even more apparent.
The majority of peptides with at least one PTM are successfully matched after ETD fragmentation across all limited trypsin proteolysis time lengths tested, only surpassed by HCD fragmentation in the standard overnight digestion sample, where peptides average 0.58 missed-cleavage events (Supplemental Fig. S1B, S1D). These modified peptides fragmented by ETD do not suffer from significantly reduced confidence scores, however, as they generally overlap with the Andromeda Scores of HCD-fragmented peptides (Fig.4B). As methylation of arginine or lysine does not affect the charge of the residue, Arg-rich peptides may thus be preferentially identified according to the data dependent HCD/ETD decision tree, at earlier digestion time points by ETD (Fig.4D,   Supplemental Fig. S1B) [28]. Indeed, 764 out of 1076 peptides to Arg-rich proteins (71%) were fragmented by ETD (Supplemental Fig. S2D). Therefore, it is apparent that ETD fragmentation and limited digestion strategies enhance of the identification of Arg-rich sequences.

BAD and RS motif algorithm resolves RNA-binding protein subgroups with distinct biological properties
RNA-binding proteins with BAD and RS domains have multiple binding partners that can change due to cellular condition, hypothesized to be regulated by PTMs [3,57,58]. To study the characteristics of two divergent yet similar Arg-rich protein subgroups, we implemented a scoring algorithm to select for peptides with alternating basic-acidic "BAD" or arginine-serine "RS" dipeptides normalized to peptide length, and focused the following analyses for peptides that scored above a stringent cutoff score (Supplemental Fig. S2A-B). The BAD algorithm selected 543 peptides, approximately 0.89% of all unique peptides sequenced (Supplemental Fig. S2A).
To examine the biological function of proteins containing BAD or RS domains, GO-elite analysis was performed, as compared to that of the total nucleoplasmic proteome identified by our analysis (Fig.5C-D). Using the 3,900 total nucleoplasmic gene symbols identified in this study as background, both the BAD and RS proteins are enriched in 'RNA splicing' and 'mRNA processing' functions. Both the BAD and RS proteins, in particular, are enriched with helicase activity ( Fig.5C-D). Helicases remodel snRNA structure, which is a critical step in spliceosome rearrangement, and various steps of RNA processing [61][62][63][64][65][66]. The RS proteins meanwhile, enrich for RS domain binding, as expected (Fig.5D). Furthermore, these algorithm-selected groups naturally partition to the correct subcellular locations, including 'nuclear speckles', the 'spliceosomal complex' and the 'nucleoplasm' (Fig.5C-D). Additionally, functions tangential to canonical splicing regulatory roles were parsed out by GO analysis, as there are many BAD proteins that ubiquitinate histones, including E3 ligases (Fig.5C) [67], while RS proteins aid in RNAPII transcription enhancement as well as mRNA export (Fig.5D). Thus, by utilizing ETD and bioinformatics approaches we were able to infer the biological function of two similar, yet distinct Arg-rich domains.

Novel sites of phosphorylation and methylation on RBPs in the nuclear proteome revealed by middle-down ETD MS
Due to the increased proteomic coverage yielded by utilizing a combination of limited trypsin digestion and ETD fragmentation, we hypothesized that our method would achieve coverage of previously uncovered regions of the proteome, and in particular, BAD and RS proteins in the nucleoplasm fraction. All nucleoplasm peptides were then searched against all previously annotated sequences on the peptideatlas.org mass spectrometry repository database to determine if our sequences represented previously observed, or rather, unreported proteomic coverage [44].
A total of 29,114 residues mapping to the top isoform of each gene that were previously unobserved by mass spectrometry approaches, were uncovered by our middle-down ETD examination (Supplemental Fig.S3A). If this analysis is applied to all known alternate isoforms, we achieved coverage of a total of 101,438 residues, currently unreported on peptideatlas.org. The BAD and RS peptides alone matched to 2,410 and 3,096 previously un-sequenced residues mapping to the most common isoform, previously missed by conventional bottom-up MS methods (Supplemental Fig.S3A). With increased proteomic coverage, we hypothesized that a significant amount of unreported PTMs would also be identified. We thus compared PTM sites identified in this experiment to that of all PTMs previously identified by mass spectrometry and uploaded on phosphosite.org [42], which allows us to catalog unreported PTM sites identified in this study. A total of 681 unreported PTMs were identified in this experiment, with over half consisting of dimethylated arginine and phosphorylated serine (Supplemental Fig.S3B).
While only accounting for ~1.5% of all residues detected (Supplemental Table S2), BAD and RS peptides consisted of ~20% of all unreported residue coverage (Supplemental Fig.S3B).

Arg-rich domains in RBPs contain combinatorial PTMs
As the Arg-rich BAD and RS domains have a high density of modifiable residues (Lys/Arg/Ser/Thr/Tyr), we hypothesized these domains may have an increased frequency of multiply-modified peptides. Indeed, the majority of BAD and RS peptides contained two or more PTMs, while less than 10% of nonBAD/RS peptides were multiply-modified (Fig.6A). While the majority of BAD and RS peptides were modified, nonBAD/RS peptides were overwhelmingly unmodified (89%) (Fig.6B). RS peptides, in particular, more frequently contained the maximum amount of PTMs (5) on a single peptide, than one or no PTMs at all (Fig.6B).
We next sought to determine whether BAD and RS peptides contained a combination of several PTM subtypes at once. Peptides belonging to nonBAD/RS, BAD and RS subgroups were classified according to the number of PTMs contained (0-5). The percentage of peptides belonging to uniform PTM states (mono-methylation, di-methylation, phosphorylation), double-PTM states (mono-and di-methylation, monomethylation and phosphorylation, dimethylation and phosphorylation) or a triple-PTM state (mono-and di-methylation and phosphorylation) were then calculated, and percentages were displayed as a heat map (Fig.6C). The majority of nonBAD/RS peptides were unmodified, although the most frequent PTM state was a single phosphorylation PTM. BAD peptides were enriched in combinatorial PTMs, with over 40% containing a combination of PTM subtypes on a single peptide (Fig.6C). RS domains were more frequently modified, as almost 3 out of every 4 peptides were modified by a combination of PTM subtypes (Fig.6C). Namely, ~30% of all RS peptides contained all three PTM subtypes searched for in a single peptide (Fig.6C).
This was further illustrated when we performed peptide mapping within several highlyrated BAD and RS proteins, including LUC7L2, SRSF2 and SRSF4. Indeed, the Arg-rich BAD/RS domains were uniquely enriched in methylation and phosphorylation (Fig.6D, Supplemental Table S5). Compared to peptides mapping to surrounding domains, BAD and RS domains have a complex combinatorial signature of PTM. As PTMs are an essential regulator of the structure and function of RBPs [68][69][70][71][72], we next sought to further characterize the frequency of these PTMs in the nucleoplasm proteome.

Arg-rich domains in RBPs are densely modified
We next performed PTM quantification on BAD and RS peptides, as compared to the background nucleoplasm proteome. For instance, we calculated the average frequency of a PTM within a peptide, divided by the number of potential modifiable (Lys/Arg/Ser/Thr/Tyr) residues within a peptide. On average, approximately 22% of Lys/Arg/Ser/Thr/Tyr residues within the BAD peptides are modified (Fig.7A). This is in stark contrast to ~4.5% of residues modified in the remainder of the nucleoplasm (nonBAD/RS) proteome sequenced (Fig.7A). The RS peptides are similarly increased in methylation and phosphorylation, as nearly a quarter (23%) of Lys/Arg/Ser/Thr/Tyr residues within RS peptides are modified (Fig.7A). In sum, the BAD and RS peptides are approximately five-fold more likely to contain lysine methylation, arginine methylation or serine phosphorylation, compared with nonBAD/RS sequences identified in the nucleoplasm (Fig.7A).
Next, we sought to determine if the PTM density increase observed is due purely to an increased number of modifiable residues or rather, a true increase in PTM frequency in BAD and RS regions compared to residues in nonBAD/RS regions. We divided the PTM frequencies observed in BAD and RS peptides by the frequencies observed in the background nucleoplasm proteome (nonBAD/RS) to estimate the relative fold increase in PTM frequency. We found markedly increased methylation and phosphorylation fold changes, indicating that the hypermodification observed within BAD and RS sequences is not simply caused by an increase in modifiable residue density within these regions, but also a naturally increased likelihood for residues in these domains to be modified (Fig.7B). For example, serine in particular is almost six times more likely to be phosphorylated in BAD and RS domains, compared to the rest of the proteome, even when normalizing for the presence of serine by our selection criteria (Fig.7B).
Thus, by increasing proteomic coverage with middle-down ETD, we discovered that Arg-rich domains contain markedly increased PTM densities.

Phosphorylation and methylation favor RSRS motifs, yet do not frequently co-occur in this motif
Due to the high level of combinatorial PTMs detected within Arg-rich low complexity sequences, we analyzed PTM motif sequences in the nucleoplasm sample. To determine if increased proteomic coverage influences phosphorylation and methylation PTM motif sites, a motif analysis was performed to determine if consensus sequences were observed at sites of arginine methylation and serine phosphorylation PTMs. Our nucleoplasm middle-down ETD MS PTM-containing sequences were searched in 31-residue width windows using the Motif Logo tool offered by phosphosite.org to detect increased frequencies of residues adjacent to a central modified residue for monomethylated arginine (mmR), dimethylated arginine (dmR) and phosphorylated serine (pS) ( Fig.7C) [42]. Importantly, the PTM motif searches were conducted in isolation, without consideration of other PTM subtypes. As expected, the canonical RGG/GAR (glycine-argininerich) motif was found as the favored consensus motif for arginine monomethylation [73][74][75].
Unexpectedly, an "RSRS" motif featured prominently for not only serine phosphorylation [76], but also arginine dimethylation (Fig.7C). The difference of motifs between mono-methylation and di-methylation modification states of arginine highlights the utility of the middle-down ETD strategy and alternate MS approaches in uncovering unreported features of the proteome.
To assess whether serine phosphorylation and arginine methylation co-occur within RSRS motifs, a search was performed to select peptide sequences containing RSRS motifs. Interestingly, although arginine dimethylation and serine phosphorylation occur most frequently within BAD and RS peptides in nucleoplasmic RNA-binding proteins, these PTMs do not tend to co-occur within an RSRS motif. Namely, the most frequent modification states of the RSRS motif are either the central arginine residue(s) methylated (14.6%) or double serine phosphorylation (12.4%) (Fig.7D). This disparity of modification states suggests a complex layer of co-regulatory crosstalk between arginine and serine residues within Arg-rich domains.

DISCUSSION
Mass spectrometry has achieved great advances towards the goal of globally characterizing PTMs within proteins [77], yet significant gaps in understanding remain. Here we describe a method to examine the PTM diversity of Arg-rich domains, re-purposing a technique first conceptualized by others to explore the biology of histone tails [16,24,32,78]. The power of middle-down ETD is that this method can map unique profiles of combinatorial PTMs within Arg-rich proteins, many of which are RBPs that aggregate in neurodegenerative disease [3,36]. This approach could be extended to diseased tissues to understand how PTM status changes and correlates with disease state or progression. This technique can be further used to determine the PTM profiles of Arg-rich domains across the proteome, eventually illuminating key regulatory steps in RNA processing and metabolism.
Here, we achieved full proteomic sequence coverage of the recombinant arginine-rich U1-70K LC1/BAD domain by a combination of limited proteolysis and ETD approaches. We then expanded the approach for the proteomic analysis of nucleoplasm proteins isolated from mammalian cells, many of which are Arg-rich RBPs. The use of ETD sequencing on arginine-rich proteins contributed to the coverage of thousands of residues currently un-reported on peptideatlas.org [44,45], and hundreds of PTMs previously unannotated on phosphosite.org [42]. nuclear export [82]. Conversely, phosphorylation also blocks methylation in certain contexts.
Namely, phosphorylation of RNA polymerase II (RNAPII) at its carboxy terminal domain prevented symmetric arginine dimethylation at R1810, integral to SMN protein interaction [83,84]. It is evident that there is functional cis-acting crosstalk between site-specific methylation and phosphorylation PTMs that is context-dependent. Therefore, the overall ratio between methylation and phosphorylation in arginine-rich domains may critically define the protein function. The global identification of novel sites and further interrogation of the crosstalk between methylation and phosphorylation of RS motifs is warranted and may be revolutionized by this type of mass spectrometry approach.
The appearance of a range of arginine methylation states reflect a dynamic PTM process, in which protein function may be tuned by varying degrees of methylation of contiguous arginines.
Interestingly, PRMTs appear to have a broad yet essential role in regulating alternative splicing.
Type II PRMT5 was recently shown to be preferential substrates of RNA-binding proteins.
PRMT5-depleted cells critically trigger changes in gene expression, cell-cycle de-regulation and alternative splicing [81,85]. Another type II enzyme, PRMT9, regulates alternative splicing by methylation of spliceosome-associated protein 145 (SAP145) at R508, priming SAP145 for interaction with the protein SMN [86]. Thus, arginine methylation is likely an essential regulator of alternative splicing, and the identification residue-specific of arginine methylation sites in splicing RBPs can reveal novel and essential insights to structural and functional inquiry of these proteins. In the future, performing ETD MS on PRMT inhibitor-treated cells may reveal sitespecific targets to individual residues within BAD and RS domains. This may offer insight to the therapeutic capacity of individual PRMT enzyme inhibition in a disease-specific basis. Numerous lines of evidence suggest that further characterization of RBP PTM status will increase our understanding of multiple steps in RNA processing [7,9,12,87].
A fundamental observation made in this study is that Arg-rich domains are densely modified by PTMs. These modifications likely regulate numerous aspects of protein function and localization. Furthermore, the middle-down ETD approach described here can be modified to study similarly Arg-rich dipeptide repeats, including poly-GR and poly-PR dipeptides repeats, translated from an intronic expansion repeat in C9ORF72 of ALS. These large polypeptide expansion repeats may be sinks of arginine methylation, disrupting normal PTM status of argininerich RBPs in ALS [88]. This sequencing technology could allow for the mapping of site-specific PTM within these repeats, as well as a first opportunity to relatively quantitate DPRs of various lengths above disease-thresholds from patient samples. At the moment, antibodies are used to confirm DPR presence [89,90], which do not to resolve exact DPR length.
One limitation of our study is the lack of PTM enrichment strategies, including IMAC and PTM immunoaffinity approaches, limiting the true depth of our analysis of the nucleoplasmic proteome. The intriguing possibility that one PTM-enriched sample may purify peptides with combinatorial PTMs is intriguing, and new, powerful search engines and algorithms are currently being developed to handle the expanded search space of multiple PTMs, such as MSFragger and TagGraph [91,92]. However, by cellular fractionation methods we were able to capture and sequence combinatorial-modified peptides after a two-step database search using a 1% PSM and protein FDR cutoff [40].
Importantly, the unique pattern of ETD fragmentation itself aids in the sequencing of modified peptides [33]. Notably, a number of orthogonal proteases may be used to plausibly increase and crossvalidate sequence coverage of Arg-rich domains, including proteases that cleave non-specifically (Elastase), at acidic residues (AspN, GluC), at basic residues (ArgC, LysC) and at aromatic residues (Chymotrypsin). Multi-protease strategies such as Confetti have successfully increased coverage of the proteome [93]. For proteases that cleave at basic and acidic residues (ArgC, LysC, AspN, GluC), however, a limited trypsin proteolysis strategy must be utilized in a manner similar to that described in this study. Furthermore, cleaving U1-70K with the protease Chymotrypsin Trypsin remains a robust protease that, while using middle-down strategies, generates highly-basic Arg-rich peptides of suitable lengths that are primed for ETD fragmentation and mass spectrometry analysis.
Finally, by illuminating previously "dark" regions of the Arg-rich proteome we have Fusion using a data-dependent decision tree acquisition method [28] to alternatively select between HCD and ETD peptide fragmentation based on the charge state and m/z of the precursor ion. A two-step database search method using the Andromeda search engine was employed wherein proteins that were identified from a primary target-decoy search against human UniProtKB database were used to create a second smaller focused database. This focused database was then used to search for phosphorylation (serine, threonine or tyrosine) and mono-, di-and tri-methylation (arginine or lysine) with consideration of up to 6 missed trypsin cleavage events.

Supporting information
The following supporting information is available free of charge at ACS website http://pubs.acs.org Supplemental   (76/80 residues, 95%), identifying 9 phosphorylation sites with 9 total methylation sites. By comparison, the U1-70K LC2 domain, without BAD or RS motifs but similar arginine content, contains less methylation and phosphorylation PTM. (C) Residues in red are those that have been previously observed by mass spectrometry analysis. U1-70K has 437 total amino acids, with 70% sequence coverage (306 AAs, red) currently deposited on the repository peptideatlas.org. As a result of our ETD analysis of nucleoplasm extract alone, we identified 111 new amino acids (green). This has led to an increase to 95% total sequence coverage of U1-70K by middle-down ETD MS approaches.