PT - JOURNAL ARTICLE AU - Valeriani, Lucrezia AU - Cuturello, Francesca AU - Ansuini, Alessio AU - Cazzaniga, Alberto TI - The geometry of hidden representations of protein language models AID - 10.1101/2022.10.24.513504 DP - 2022 Jan 01 TA - bioRxiv PG - 2022.10.24.513504 4099 - http://biorxiv.org/content/early/2022/10/26/2022.10.24.513504.short 4100 - http://biorxiv.org/content/early/2022/10/26/2022.10.24.513504.full AB - Protein language models (pLMs) transform their input into a sequence of hidden representations whose geometric behavior changes across layers. Looking at fundamental geometric properties such as the intrinsic dimension and the neighbor composition of these representations, we observe that these changes highlight a pattern characterized by three distinct phases. This phenomenon emerges across many models trained on diverse datasets, thus revealing a general computational strategy learned by pLMs to reconstruct missing parts of the data. These analyses show the existence of low-dimensional maps that encode evolutionary and biological properties such as remote homology and structural information. Our geometric approach sets the foundations for future systematic attempts to understand the space of protein sequences with representation learning techniques.Competing Interest StatementThe authors have declared no competing interest.