Genome-wide identification of protein binding sites in mammalian cells

We present GWPBS-Cap, a method to capture genome-wide protein binding sites (PBSs) without using antibodies. Using this technique, we identified many protein binding sites with different binding strengths between proteins and DNA. The PBSs can be useful to predict transcription binding sites and the co-localization of multiple transcription factors in the genome. The results also revealed that active promoters contained more protein binding sites with lower NaCl tolerances. Taken together, GWPBS-Cap can be used to efficiently identify protein binding sites and reveal genome-wide landscape of DNA-protein interactions.

The distribution of PBSs in different gene regions 204 We also analyzed the distribution for PBSs from the six NaCl groups in different gene 205 regions ( Figure 2B). The distribution of PBSs in promoters, UTRs, and downstream 206 (2 kb) regions followed parabolic curves. When the concentration of NaCl in the 207 washing solution was less than 200 mM, the number of PBSs in the in these gene 208 elements or downstream regions was negatively correlated with the concentration.

209
When the concentration of NaCl in the washing solution was more than 200 mM, the   (Table S2). PBSs in their promoters were explored (Table S3). We found that the binding strength 225 of PBSs to promoters was inversely correlated with gene expression level ( Figure 2C). 226 We selected top 100 genes for each of five PBS groups, according to the number of 227 each group PBSs contained in gene promoters. The genes whose promoters contain 228 more PBSs tend to be highly expressed ( Figure 2D). To gain further insight into the   We analyzed the base compositions of some TFs from the TFBS collection. For 301 example, TCF3 factors have a well-defined DNA recognition site (CASSTG) that 302 could be used for independent validation. We randomly selected 3,048 PBSs which 303 are predicated to bind with TCF3 (Table S4) and found that their related PBS tags 304 located on two strands have exonuclease barriers ( Figure 3E). The sequences between 305 the barrier pairs contained the TCF3 recognition site CASSTG ( Figure 3F). This

306
indicates that indeed TFBSs are captured by GWPBS-Cap.

307
To confirm that our predicted TFBSs are real TF binding sequences, we used 308 15 SP1-ChIP to test our predictions. We selected one SP1 TFBS each in the 100 mM and 309 1000 mM salt groups (Materials and methods), and carried out ChIP-PCR to confirm 310 the existence of these two sites ( Figure S5). The result means that the PBSs captured 311 by GWPBS-Cap can be used to predict TFBSs.  which motifs appear on the same PBS in Table 1.

360
To explore the potential protein-protein interactions between different TF families 361 that bind to the same PBS. We selected two TF-pairs, NFIC-TCF12 and 362 RUNX3-EGR1, which are derived from different TF-families, respectively. We 363 randomly selected 272 PBSs containing both NFIC and TCF12 binding sites (Table   364 S4) and 100 PBSs containing both RUNX3 and EGR1 binding sites (Table S4). These

365
PBSs with multiple TFBSs also present obvious peak pairs of exonuclease barriers 366 ( Figure 4A-B). We used MEME Suite [13] to find motifs from these PBSs and 367 compared them with their single motifs downloaded from JASPAR website [14]. We    PBSs genome-wide ( Figure 5A-B). This confirms the notion that promoters tend to 427 bind multiple TFs or TF families. 428 We also observed different binding strengths of TFs to promoters. As shown in 429 Figure 5A and 5B, the 200 mM NaCl concentration was a dividing point, suggesting form of TF complex to resist NaCl elution. 459 We also noticed that most of the TFBSs corresponding to promoters contained DBDs. When they bind DNA, TF complexes of different families will enhance their 467 resistance to Na + because they have more focus on DNA, so they can still bind DNA 468 closely in high concentration Na + washing environment. Figure 5C shows that the are not enough to capture most of protein binding sites in these specific regions at 519 once. In this study, we describe a method called GWPBS-Cap for capturing 520 genome-wide protein binding sites.

521
In the experiments, we optimized the method for crosslinking DNA-binding 522 proteins and magnetic beads. We have tried to use EDC (1 -(3-dimethylaminopropyl)

523
-3-ethylcarbodiimide hydrochloride) to mediate amino-carboxyl cross-linking 524 between proteins and beads, but sequencing results showed that many non-specific