PT  - JOURNAL ARTICLE
AU  - G. Sampath
TI  - Peptide partitions and protein identification: a computational analysis
AID  - 10.1101/069526
DP  - 2016 Jan 01
TA  - bioRxiv
PG  - 069526
4099  - http://biorxiv.org/content/early/2016/08/14/069526.short
4100  - http://biorxiv.org/content/early/2016/08/14/069526.full
AB  - Peptide sequences from a proteome can be partitioned into N mutually exclusive sets and used to identify their parent proteins in a sequence database. This is illustrated with the human proteome (http://www.uniprot.org; id UP000005640), which is partitioned into eight subsets KZ*R, KZ*D, KZ*E, KZ*, Z*R, Z*D, Z*E, and Z*, where Z ∈ {A, N, C, Q, G, H, I, L, M, F, P, S, T, W, Y, V} and Z* ≡ 0 or more occurrences of Z. If the full peptide sequence is known then over 98% of the proteins in the proteome can be identified from such sequences. The rate exceeds 78% if the positions of four internal residue types are known. When the standard set of 20 amino acids is replaced with an alphabet of size four based on residue volume the identification rate exceeds 96%. In an information-theoretic sense this last result suggests that protein sequences effectively carry nearly the same amount of information as the exon sequences in the genome that code for them using an alphabet of size four. An appendix discusses possible in vitro methods to create peptide partitions and potential ways to sequence partitioned peptides.