Abstract
Alignment-based methods dominate molecular biology. However, by primarily allowing one-to- one comparisons, they are focused on gene-centered viewpoint and lack the broad aperture needed for complex biological systems analysis. We hypothesized existence of contextual information related to gene’s inclusion in a molecular network of the cell being distributed among more than one sequence. The need for conservation of established interactions, which is arguably more important to the evolutionary success of species than conservation of individual function was the rationale behind this. To test whether this information exists, we applied distributional semantics method -Latent Semantic Analysis (LSA) to thousands of species proteomes. Using natural language processing we identified Latent Taxonomic Signatures (LTSs), a novel proteome distributed feature supporting the argument that protein-coding genes do not evolve as taxonomy independent variables. LTSs reflect constraint imposed to individual gene/protein evolution by their genome/proteome context. In summary, discovery of LTSs indicates that genes had to trade some of their “selfishness” by becoming parts of genome conglomerates.