PT - JOURNAL ARTICLE AU - G. Sampath TI - Protein identification with a nanopore and a binary alphabet AID - 10.1101/119313 DP - 2017 Jan 01 TA - bioRxiv PG - 119313 4099 - http://biorxiv.org/content/early/2017/07/10/119313.short 4100 - http://biorxiv.org/content/early/2017/07/10/119313.full AB - Protein sequences are recoded with a binary alphabet obtained by dividing the 20 amino acids into two subsets based on volume. A protein is identified from subsequences by database search. Computations on the Helicobacter pylori proteome show that over 93% of binary subsequences of length 20 are correct at a confidence level exceeding 90%. Over 98% of the proteins can be identified, most have multiple identifiers so the false detection rate is low. Binary sequences of unbroken protein molecules can be obtained with a nanopore from current blockade levels proportional to residue volume; only two levels, rather than 20, need be measured to determine a residue’s subset. This procedure can be translated into practice with a sub-nanopore that can measure residue volumes with ∼0.07 nm3 resolution as shown in a recent publication. The high detector bandwidth required by the high speed of a translocating molecule can be reduced more than tenfold with an averaging technique, the resulting decrease in the identification rate is only 10%. Averaging also mitigates the homopolymer problem due to identical successive blockade levels. The proposed method is a proteolysis-free single-molecule method that can identify arbitrary proteins in a proteome rather than specific ones.