PT - JOURNAL ARTICLE AU - G. Sampath TI - Peptide partitions and protein identification: a computational analysis AID - 10.1101/069526 DP - 2016 Jan 01 TA - bioRxiv PG - 069526 4099 - http://biorxiv.org/content/early/2016/08/14/069526.short 4100 - http://biorxiv.org/content/early/2016/08/14/069526.full AB - Peptide sequences from a proteome can be partitioned into N mutually exclusive sets and used to identify their parent proteins in a sequence database. This is illustrated with the human proteome (http://www.uniprot.org; id UP000005640), which is partitioned into eight subsets KZ*R, KZ*D, KZ*E, KZ*, Z*R, Z*D, Z*E, and Z*, where Z ∈ {A, N, C, Q, G, H, I, L, M, F, P, S, T, W, Y, V} and Z* ≡ 0 or more occurrences of Z. If the full peptide sequence is known then over 98% of the proteins in the proteome can be identified from such sequences. The rate exceeds 78% if the positions of four internal residue types are known. When the standard set of 20 amino acids is replaced with an alphabet of size four based on residue volume the identification rate exceeds 96%. In an information-theoretic sense this last result suggests that protein sequences effectively carry nearly the same amount of information as the exon sequences in the genome that code for them using an alphabet of size four. An appendix discusses possible in vitro methods to create peptide partitions and potential ways to sequence partitioned peptides.