Evolutionary and Structural Constraints Influencing Apolipoprotein A-I Amyloid Behavior

Apolipoprotein A-I (apoA-I) has a key function in the reverse cholesterol transport mediated by the high-density lipoprotein (HDL) particles. However, aggregation of apoA-I single point mutants can lead to hereditary amyloid pathology. Although several studies have tackled the biophysical and structural impacts introduced by these mutations, there is little information addressing the relationship between the evolutionary and structural features that contribute to the amyloid behavior of apoA-I. We combined evolutionary studies, in silico saturation mutagenesis and molecular dynamics (MD) simulations to provide a comprehensive analysis of the conservation and pathogenic role of the aggregation-prone regions (APRs) present in apoA-I. Sequence analysis demonstrated the pervasive conservation of an APR, designated here APR1, within the N-terminal α-helix bundle. Moreover, stability analysis carried out with the FoldX engine showed that this motif contributes to the marginal stability of apoA-I. Structural properties of the full-length apoA-I model suggest that aggregation is avoided by placing APRs into highly packed and rigid portions of its structure. Compared to HDL-deficiency or natural silent variants extracted from the gnomAD database, the thermodynamic and pathogenic impact of apoA-I point mutations associated with amyloid pathologies were found to show a higher destabilizing effect. MD simulations of the amyloid variant G26R evidenced the partial unfolding of the α-helix bundle and the occurrence of β-strand secondary elements at the C-terminus of apoA-I. Our findings highlight APR1 as a relevant component for apoA-I structural integrity and emphasize a destabilizing effect of amyloid variants that leads to the exposure of APRs. This information contributes to our understanding of how apoA-I, with its high degree of structural flexibility, maintains a delicate equilibrium between its native structure and intrinsic tendency to form amyloid aggregates. In addition, our stability measurements could be used as a proxy to interpret the structural impact of new mutations affecting apoA-I. Highlights Aggregation-prone region 1 (APR1), comprising residues 14-19, is consistently conserved during the evolutionary history of Apolipoprotein A-I. APR1 contributes to thermal stability of the α-helix bundle in the full-length Apolipoprotein A-I model. Amyloid variants introduce a destabilizing effect on the monomer structure of Apolipoprotein A-I, in contrast to HDL-deficiency and naturally-occurring variants, which are nearly neutral. During molecular dynamics simulations, G26R amyloidogenic mutant lead to the partial unfolding of α-helix bundle and exposure of APR1.


Introduction
Apolipoprotein A-I (apoA-I) is the most abundant protein component of high-density lipoproteins (HDL) and is responsible for the reverse cholesterol transport from extracellular tissues back to the liver (1,2), which has been associated with a protective function against cardiac disease and atherosclerosis (3,4). The scaffolding functions of apoA-I in the HDL particle and its multiple protein-protein interactions, mainly with the lecithin:cholesterol acyltransferase and the ATP-binding cassette A1 transporter (5,6), forces it to maintain a dynamic and flexible conformation (7). In contrast to these physiological functions, several point mutations affecting apoA-I have been associated with hereditary amyloid pathology (8). These mutations are mainly distributed into two "hot spots", located in the N-terminal and the C-terminal regions of the protein, each one with a typical clinical manifestation (9). Mutations that occur in the N-terminal region (residues 26-100) are characterized by amyloid deposits in the liver and kidney (10,11), while those located within a short C-terminal sequence (residues 170-178) are mainly associated with heart, larynx and skin deposits (12). In non-hereditary amyloidosis, full-length apoA-I is deposited in atherosclerotic plaques as fibrils or senile forms of amyloid. This process has been associated with aging, but it has also been described in chronic pathologies such as Alzheimer's disease and type 2 diabetes mellitus (13). Amyloid behavior of apoA-I N-terminal fragment has been attributed to the presence of aggregation-prone regions (APRs) in its sequence and, specifically, to an APR located at the N-terminus (11). It has been hypothesized that amyloidogenic mutations or post-translational modifications could promote aggregation through the destabilization of apoA-I structure, described as a molten globular state, followed by the exposure of APRs. In this sense, most studies addressing the effect of amyloid variants have focused on the biophysical and physiological consequences of single point apoA-I mutants. However, the relationship between apoA-I amyloidogenic motifs and its aggregation process remains unclear. In this study, through an extensive evolutionary analysis, we characterized the conservation of aggregating regions in a broad dataset of apoA-I sequences. Using the recently described full-length consensus structure (14), we examined the structural properties of apoA-I that contribute to reduce the exposure of its constituent APRs. In silico saturation mutagenesis analysis of apoA-I demonstrated that an evolutionary conserved APR (residues [14][15][16][17][18][19] contributes to the thermodynamic stability of the N-terminus and reveals a common destabilizing effect for amyloid-associated variants. Using molecular dynamics simulations, we studied the conformational and dynamic impact of five different amyloid variants on the structure of full-length apoA-I. Altogether, our results suggest that APR1 is a structural component that contributes to the stability of apoA-I helix bundle and emphasizes the destabilizing effect of amyloid variants, which is linked to subsequent APRs exposure in the case of G26R variant. This information is relevant to understand how a marginally stable, but metabolically active protein manages to initiate the formation of an amyloid structure and develop a severe pathology.

Molecular evolution of apoA-I in vertebrates
Because amino acid residues relevant for protein native structure tend to be maintained across its evolutionary trajectory, we conducted an analysis of apoA-I sequence conservation. As a first step, a maximum likelihood phylogeny reflecting the evolutionary relationships between sequences was reconstructed from a dataset comprising 215 species ( Figure 1A). Our phylogenetic tree segregated apoA-I sequences into three major groups: a distant clade including ray-finned fishes that diverged first from the common ancestor of apoA-I and two clades comprising tetrapods, one integrated by birds, reptiles, and frogs sequences and the remaining encompassing mammals. Using this phylogeny as a framework, we inferred site-wise evolutionary rates (ER) at both codon ( dN/dS , the ratio of nonsynonymous to synonymous mutations) and amino acid level, both with similar results. The ER profile of apoA-I protein sequence ( Figure 1B) revealed that most of the N-terminal region of the protein is relatively conserved, while a higher level of variability is present in the C-terminal region. Overall, the dN/dS profile indicated that a major portion of the protein sequence is evolving under purifying selection, suggesting the presence of functional and structural constraints acting on apoA-I (Supplementary Figure 1). ApoA-I sequence is organized into an N-terminal domain followed by ten 11/22-mer tandem repeats. Given the relevance of apoA-I repeats in the mature form of HDL particles, we decided to investigate the relationship between the ER and the residue identity inside these repeats ( Figure 1C). From these results, the proline and positively charged residues (arginine and lysine) were consistently more conserved when compared against other residues, a result that is also supported by the sequence logos of the tandem repeats (Supplementary Figure 2). Although the ER values corresponding to leucine residues were similar to the mean ER of apoA-I, we noticed several highly conserved leucine residues among apoA-I orthologs (e.g. L46, L163, L200, L214, and L218). Lastly, we tried to uncover possible patterns of coevolution between pairs of residue positions using the RaptorX server to predict a contact map for apoA-I. Intriguingly, the predicted contact pairs, located mostly in the N-terminus (Supplementary Table 1), do not correlate with contacts proposed in apoA-I structure (14). This inconsistency could reflect the coexistence of different conformational states where these contacts effectively occur.

Figure 1 ApoA-I Molecular Evolution in Vertebrates
A Maximum likelihood phylogeny of apoA-I orthologs. Major taxonomic groups are colored according to the legend. Branches with an ultra-fast bootstrap support value greater than 90 are marked with a black dot. Cartilaginous fishes were chosen as outgroup for phylogeny rooting. B Evolutionary rate (ER) profile for apoA-I protein sequence, estimated with LEISR. A value of 1 represents the average apoA-I ER, while those rates below and above 1 indicate sites evolving at a slower and a faster rate than the average, respectively. C Evolutionary rate for each residue type within apoA-I tandem repeats. Proline and positively charged residues (arginine (R) and lysine (K)) display values consistent with stringent evolutionary conservation.

ApoA-I has retained an APR during its evolutionary history
Protein aggregation through short regions has been associated with a wide range of human conformational diseases (15), notably amyloidosis. In the case of apoA-I, an N-terminal fragment has been linked to the formation of amyloid and atherosclerotic plaques, but a detailed mechanism of the amyloidogenic process remains elusive. To better understand how apoA-I sequence determines its aggregation properties, we performed a comprehensive evaluation of APRs present across vertebrates and mammals. We annotated as APR the regions with a TANGO score above 5% over a stretch of 5 or more residues and established the APR position as the centroid of its sequence range. From the distribution of APRs position across apoA-I sequences (Figure 2A), we evidenced that APRs in vertebrates are concentrated in two hotspots (residues 15-20 and 220-240) but mammalian APRs are almost exclusively located in the N-terminal region. Supported by our TANGO predictions and previous reports (16), we identified three APRs in the human apoA-I sequence, designated here APR1 (residues [14][15][16][17][18][19], APR2 (residues 53-58), and APR3 (residues 227-232). While the consensus sequences of APR1 and APR3 constitute amyloid peptides verified experimentally in the Waltz database (17) ( Figure  2B), APR2 consensus does not represent an amyloid sequence and seems to be specific to the human species. Overall, the pervasive conservation of the N-terminal APR1 suggests an important role in apoA-I structure for this region, as reported for other protein families (15). Given the fact that several natural variants of apoA-I have been reported to promote an increase in amyloid aggregation, we decided to investigate the impact of these mutations on its aggregation propensity. Although some of these variants have been proven experimentally to form amyloid fibers, we did not observe any increase in the intrinsic aggregation tendency for apoA-I amyloid variants (Supplementary File 1). Moreover, point mutations did not introduce novel APRs on apoA-I sequence. Based on the human full-length structure of apoA-I (14), we decided to characterize the structural properties of its APRs. We complemented our previous TANGO aggregation measurements with solubility scores provided by the CamSol method, hexapeptide β-zipper tendency available at ZipperDB, residue mean square fluctuation (MSF) calculated with the Gaussian Network Model (GNM) implemented in ProDy and packaging level as measured by the weighted contact number (WCN). ApoA-I APRs showed low solubility ( Figure 3A) and a high propensity to form aggregates (β-zipper structures with low rosetta energy) ( Figure 3B), which are in agreement with our TANGO results and with previous observations (16). Interestingly, these aggregating regions are located in portions of apoA-I structure with low mobility ( Figure  3C) and highly packaged environments, preventing its exposure to the solvent ( Figure 3D). This is especially evident for APR1, which is buried inside the N-terminal ɑ-helix bundle ( Figure 3E).

Amyloid-associated variants have a destabilizing effect on apoA-I monomer structure
In order to better understand the structural consequences of amyloid variants on apoA-I monomer, we explored their thermodynamic and pathological effects using in silico saturation mutagenesis. Destabilizing effect of each possible mutation in apoA-I sequence, represented by the difference in free energy (ΔΔG) between wild type and mutant structures, was measured using the FoldX empirical force field and the MutateX automation pipeline. To complement this approach, variant pathogenicity probability was estimated using Rhapsody. We noticed from the ΔΔGs distribution that most of the variants had a moderate impact on apoA-I stability (-1 kcal/mol < ΔΔG < 1 kcal/mol) ( Figure 4A, complete FoldX results are available with Supplementary Figure 3). Further examination revealed that apoA-I structure is highly sensitive to mutations in the region of residues 7-28, which comprises the APR1 ( Figure 4B). Pathogenicity probabilities also support this region as a mutation-sensible segment of apoA-I structure (Supplementary Figure 4). This result suggests that the conservation of APR1 in apoA-I could be necessary to maintain the marginal thermodynamic stability of the ɑ-helix bundle despite the risk to undergo aggregation. In line with our observations, APRs have been recently proposed to play a stabilizing role in protein structure (18). We used ΔΔG values to highlight differences between pathogenic variants associated with amyloidosis or HDL deficiencies (19), and natural variants reported by the gnomAD project (20). Our results evidenced that amyloid mutations had a destabilizing effect and a pathogenicity probability significantly greater when compared with natural or HDL-deficiency variants ( Figure  5A and 5B), emphasizing the relationship between structural destabilization and amyloid pathology onset. An interesting observation from this result is that HDL-deficiency mutations have similar effects compared with natural variants, suggesting that this type of mutations could exert its pathogenic effect with little consequences on apoA-I monomer stability. Given the fact that a small group of variants in the gnomAD database showed an elevated impact on protein stability (> 2 ΔΔG kcal/mol), we decided to investigate how frequently they occur at population level. Frequency spectrum ( Figure 5C) showed that variants with a severe impact on protein stability were present at low frequencies, thus minimizing their deleterious effect on the population. In contrast, variants with a higher frequency in our dataset had a nearly neutral effect on stability. It is worth noting that although gnomAD excluded subjects with mendelian and pediatric diseases from its cohorts, we cannot rule out the possibility that some of these destabilizing variants correspond to non-diagnosed pathologies.  Impact of apoA-I variants on protein stability and pathogenicity probability A-B Free energy difference (ΔΔG) and pathogenicity probability distributions for each variant class (amyloid, HDL-deficiency and natural, as reported by gnomAD project). Statistical significance was determined using the Mann-Whitney U Test. C Allele frequency distribution for gnomAD variants against its predicted effect on protein stability.

Molecular dynamics simulations of apoA-I mutants
To complement our previous results showing the destabilizing effect of amyloid variants, we decided to study the dynamic properties of apoA-I amyloid mutants by conducting coarse-grain molecular dynamics simulations under the SIRAH force field. We selected four amyloid mutants (G26R, L60R, Δ107 and R173P) previously characterized by our group (21)(22)(23)(24), plus the wild type protein, to prepare our simulation systems. Our selection also ensured that mutations were distributed throughout the apoA-I sequence.
In the first place, we explored the overall dynamics of our system by means of its root mean square deviation (RMSD). The recently described consensus structure for apoA-I was used as reference coordinates for RMSD calculations. We observed a great variability in the RMSD values for all simulated systems (5.4-10 Å) during our 1 µs simulation, which could be related with the highly dynamic and marginally stable structure proposed for apoA-I. We did not find evidence for any significant differences between systems (Supplementary Table 2 Figure 3C), reinforcing the dynamic profile obtained for apoA-I. The similar fluctuation profiles between the wild type apoA-I and the above-mentioned mutants suggest that mutations do not introduce major structural changes, at least during the simulation time frame. To further characterize the structural impact of single point mutations on each system we measured the gyration radius (Rg) as a general descriptor of the protein shape for each system. When mutants were compared against the wild type system ( Figure 6A), only the L60R system displayed a significantly higher Rg value, indicating a more extended conformation for this mutant. Mutants G26R and R173P also showed a tendency to present greater Rg values when compared against wild type, but they were statistically not significant, in part due to the highly variable nature of the apoA-I system. We explore the possible role of mutations in amyloid aggregation of full length apoA-I by analyzing the solvent accessible surface area (SASA) of each APR in our five systems. We noted a significant increase in the solvent exposure of the APR1 in the G26R system when compared against the wild type, while the other systems did not exhibit a significant increase of SASA values for any of the APRs ( Figure 6B). Visualization of the final time frames of the trajectory corresponding to the G26R system showed a partial unfolding of the ɑ-helix bundle, which explains the increased exposure of APR1 ( Figure 6C). Additionally, the G26R mutant evidenced the transitory formation of β-strand secondary structures at the APR3. The low impact of the L60P, Δ107 and R173P variants on APRs exposure suggests that these mutants could require further post-translational modifications in order to undergo native structure unfolding.

Figure 6 Molecular dynamics simulations of full-length apoA-I mutants
A Gyration radius (Rg) of each system computed over the last 100 nanoseconds from five independent simulations. The L60R mutant showed a higher Rg when compared against the wild type system (p-value ≤ 0.05, Student's Test). B Solvent accessible surface area (SASA) calculated for the APR1 (residues 14-19). The G26R system displayed a higher APR1 exposure when compared with the wild type system (p-value ≤ 0.05, Student's Test). C Graphical representation of the consensus model (WT) and the final snapshot of one of the replicas simulated for the G26R mutant. The substitution of glycine by arginine in position 26 destabilizes the helix bundle, expelling helix H6 with the concomitant solvent exposure of APR1 (left inset). A 180º view rotation shows the β-sheet hairpin formed between residues S224 and A232, corresponding to the APR3 (right inset).

Discussion
Molecular mechanism of amyloid aggregation for apoA-I remains largely unknown, due in part to the limited structural information given its inherent conformational plasticity (7). This work builds upon evolutionary, dynamical and structural features of apoA-I in order to provide a comprehensive characterization of the amyloid phenomena in this protein, complementing the extensive experimental results available. Collectively, our results suggest an intimate relationship between aggregation-prone regions and structural stability in apoA-I. Additionally, MD simulations of full-length apoA-I mutants shed light on the first steps of the aggregation process in amyloid mutants. First, we aimed to complement the few available studies that tackle the evolutionary history of apoA-I and their implications on its structure (25). An important observation that emerges from our results is that apoA-I evolution is tightly linked with the biophysical properties imposed by its constituent amphipathic ɑ-helices. This is especially evident in the case of prolines and positively charged residues, which are critical for apoA-I function and structure. Prolines have been extensively characterized as a fundamental component for apoA-I flexibility and stability, as their positioning at the beginning of the 22-mers induces a relative break of one helical segment respect to the other, allowing the protein rearrangement required for lipid removal and dynamic interactions with membranes and proteins interactors (26). In the case of charged residues, a strong lipid affinity has been attributed to the cationic residues within the polar face of the amphipathic ɑ-helices (27), as they interact with the negative heads of the phospholipids at the surface of lipid bilayers through a process designated "snorkeling" (28, 29). Arginine residues present at the helix 6 are also relevant in apoA-I-mediated activation of LCAT (30). In addition, the conservation of specific leucine positions observed from sequence logos could be driven by its involvement in lipid binding (31,32) and stabilization of the hydrophobic clusters inside the helix bundle (33). Given that fast evolving regions of proteins have been associated with greater flexibility (34,35), the higher evolutionary rate observed for the C-terminal region could be linked with the maintenance of the flexibility required for its lipid-binding properties. The fact that apoA-I has consistently conserved an aggregation-prone segment (termed here APR1) along its evolutionary history raises questions about its structural relevance. Amyloid motifs have been proposed to contribute to protein structural stability through extensive interactions inside protein hydrophobic cores (18,36), which establish a trade-off between protein environment, foldability and aggregation propensity (37,38). Based on its conserved nature and FoldX stability results, it is possible to hypothesize that APR1 is necessary to ensure the marginal stability of apoA-I ɑ-helix bundle, even though this region could trigger aggregation upon solvent exposure or proteolytic cleavage (39). Moreover, the presence of APR2 exclusively in human species represents a synergizing factor that could aggravate the amyloid behavior of apoA-I, as demonstrated recently for the N-terminal peptide (40,41). In this context, the structural features of APR1 (low intrinsic flexibility and highly packaged environment) are likely to control its exposure to solvent and prevent aggregation events. Hydrogen-deuterium exchange experiments (42) supports this highly packaged nature of the ɑ-helix bundle and the low solvent exposure of APR1 in apoA-I. Amyloidogenic variants are primarily located in the N-terminal region of apoA-I, whereas variants associated with HDL deficiencies are clustered in the H5-H7 region (19,43). Through a comprehensive evaluation of the destabilizing effect and pathogenicity of each possible mutation affecting apoA-I we demonstrated that amyloid variants have a significant destabilizing effect on the monomer structure. The fact that TANGO aggregation tendency of APRs was not modified by the introduction of amyloid mutations, supports the hypothesis that aggregation propensity per se has a limited impact on the aggregation process of full-length apoA-I (44). On the other hand, variants associated with HDL defects had a minimal effect on structural instability, which provides evidence that this type of disorders could be caused by mechanisms less dependent on protein unfolding and probably involving the disruption of interaction sites with protein interactors during the reverse cholesterol transport pathway. Taking advantage of the recently described consensus model of apoA-I (14), our MD simulations of mutant G26R revealed a partial unfolding of the N-terminal ɑ-helix bundle and a significant increase in the exposure of APR1, which is also congruent with the destabilizing effect predicted from our ΔΔG calculations. This partial unfolding is in line with the experimental reports of increased susceptibility to proteases (45) and greater solvent exposure of the ɑ-helix bundle (42) for this mutant. Moreover, β-sheet secondary structures present at APR3 could provide a template for the aggregation of full-length apoA-I (16). Altogether, our results obtained from full-length protein information support the current hypothesis that unfolding of the helix bundle and exposure of aggregating regions represents the first steps of apoA-I-mediated amyloidosis (41). The mild effect of L60R, Δ107 and R173P variants on apoA-I structure and APRs exposure suggest that further modifications could be required to promote protein aggregation of these mutants, like oxidation or proteolytic cleavage (46,47). Recently, the connection between the pro-inflammatory microenvironment and the formation of aggregation-prone species has been deeply characterized, reinforcing this hypothesis (21). Moreover, the presence of the N-terminal proteolytic fragment (residues 1-93) within patients' lesions raises the hypothesis that mutations may facilitate the cleavage of apoA-I by circulating proteases (48,49). Our work highlights the importance of an evolutionary-conserved aggregation-prone region for the stabilization of apoA-I ɑ-helix bundle and suggests a general destabilizing effect for amyloidogenic variants. These results expand our knowledge of the sequence determinants involved in amyloid aggregation mediated by apoA-I. In addition, we hope that our in silico saturation mutagenesis results will be useful to guide further experimental explorations of novel apoA-I mutants by others.

Materials and Methods
Evolutionary analysis of apoA-I sequences A comprehensive dataset of sequences was generated by collating apoA-I orthologs available at Ensembl and Refseq databases (50,51). To exclude low quality data, only sequences which did not contain ambiguous characters, had a proper methionine starting codon and were longer than 200 amino acids were kept. Additionally, as both Ensembl and Refseq have overlapping data for some species, CD-HIT clustering tool (52) was employed to generate groups of similar sequences with an identity cut-off value of 0.98. Our final dataset comprised 215 protein sequences from vertebrate species. In order to reconstruct a maximum likelihood phylogeny, a multiple sequence alignment (MSA) was built from the protein sequences using ClustalO with default parameters (53) and the phylogenetic inference was carried out with the IQ-TREE software (54). The substitution model was selected based on the ModelFinder evolutionary model fitting tool (55) and the ultrafast bootstrap implemented in IQ-TREE was used to calculate the support values for phylogeny branches (56). We rooted our phylogeny using cartilaginous fish species as outgroups (57). Visualization of the resulting phylogeny was carried out using the iTOL server (58). Selective pressure acting on apoA-I sequence Nucleotide coding sequences were retrieved for each protein in our dataset using the NCBI Entrez eutils tools for Refseq sequences and the Ensembl orthologs dataset. Because the evolutionary rate estimation requires a codon-level alignment, the software PAL2NAL was used to align codons in nucleotide sequence using a protein alignment as a guide (59). The Hypothesis Testing using Phylogenies (HyPhy) package was used to conduct evolutionary analysis on the codon-based alignment. Before testing for evidence of selective pressure, we conducted a recombination analysis using the Genetic Algorithm Recombination Detection (GARD) method (60), in order to screen for possible recombination events in our alignment; it is known that the presence of recombination leads to a larger number of false positives in selection analysis. We inferred the natural selection strength (Omega, dN/dS ) for each alignment position using our phylogeny as framework. We employed the Fixed Effects Likelihood (FEL) (61) and the Fast Unconstrained Bayesian Approximation (FUBAR) methods (62) to quantify the dN/dS ratio for each codon in the alignment. Although both methods provide similar information, FEL provides support for negative selection ( dN/dS < 1) whereas FUBAR has more statistical power to detect positive selection ( dN/dS > 1). Because codon alignment positions are difficult to put in structural context, data were extracted for codons occurring in wild type human apoA-I. Additionally, we estimated evolutionary rates based on alignments at amino acid level using LEISR (63).

Coevolving residue pairs
Pairs of residues that are evolutionary correlated (coevolving sites) are useful to predict structural contacts. However, the repetitive structure of apoA-I and the lack of a large number of orthologs poses difficulties for this kind of analysis. To overcome these difficulties, putative coevolving residues were computed using the RaptorX server (64). RaptorX applies an ultra-deep convolutional residual neural network to predict contacts and distance and works particularly well on proteins without many sequence homologs. This method works by predicting the contact/distance matrix as a whole instead of predicting one residue pair independent of the others. RaptorX output represents the probability of two residues being in contact (i.e., their distance falling in the range 0-8 Å). Only residue pairs with a contact probability greater than 0.5 were retained.

Structural features measurement
Residue solubility profile for apoA-I consensus structure was computed with the CamSol method (65). CamSol first calculates an intrinsic solubility score for each residue, based only on sequence information. Then, the algorithm applies a score correction to the solubility profile from the previous step to account for the spatial proximity of amino acids in the three-dimensional structure and for their solvent exposure. Fibril-forming segments were identified with the ZipperDB resource ( https://services.mbi.ucla.edu/zipperdb/ ). Fibrillation propensity is calculated as proposed by Thompson (66). Briefly, each hexapeptide not containing a proline from the query sequence is mapped onto the cross-beta crystal structure of the fibril-forming peptide NNQQNY. Energetic fit is evaluated with the RosettaDesign software (67). Hexapeptides with energies below the threshold of -23 kcal/mol were considered as highly propense to fibrillation. Packaging level for residue i was represented by its Weighted Contact Number (WCN), which was calculated as follows: Where, rij is the distance between the geometric center of the side-chain atoms for residue i and residue j . Calculations were carried out using a custom script (68). Protein intrinsic dynamics was characterized using a coarse-grained simulation model based solely on protein topological information represented as a Gaussian Network Model (GNM). In this approach, protein structure is modelled as a network of nodes (alpha carbons) connected by springs. Numerical resolution of this model allows the calculation of the equilibrium displacement for all nodes (Mean Square Fluctuation, MSF), describing the global motions of the system. The ProDy package (69) was used to adjust a GNM to the apoA-I consensus structure. We selected the first ten slow modes for analysis and plotting, since they have been reported previously as the main determinants of the global dynamics of protein structure (70).

Conservation of Aggregation-prone Regions (APRs)
Signal peptide sequences were trimmed and removed from the MSA to retain only the mature protein sequence. TANGO software (71) was used to detect APRs in the protein sequences dataset. This algorithm predicts beta-aggregation using a space phase where the unfolded protein can adopt one of five states: random coil, alpha-helix, beta-turn, alpha-helical aggregation or beta-sheet aggregation. Importantly, TANGO is based on the assumption that the core regions of an aggregate are fully buried. Predictions were carried out using default settings: no protection for the C-terminus and N-terminus, pH 7, temperature of 310 K and ionic strength of 0.1. Output files provide an aggregation score per position; as suggested in the TANGO manual, contiguous regions comprising five or more residues with a score of at least five were annotated as an APR. To address the impact of single point mutations in apoA-I aggregation tendency we ran TANGO for each mutant sequence and compared the scores profile against the wild type sequence. Sequence logos of each APR were plotted using the LogoMaker package (72). TANGO software was downloaded from http://tango.crg.es using an academic license.

Impact of missense variants on protein stability
The FoldX engine (73) implements an empirical energy function based on terms significant for protein structure stability. The free energy of unfolding (ΔG) of the protein includes terms for van der Waals interactions, solvation of apolar and polar residues, intra and intermolecular hydrogen bonds, water bridges, electrostatic interactions and entropic cost for fixed backbone and side chains. Changes in free energy of folding upon mutation is calculated as the difference between the folding energy (ΔΔG) estimated for the mutants and the wild type variants. Although FoldX seems to be more accurate for the prediction of destabilizing mutations and less accurate for the prediction of stabilizing mutations, in both cases it was shown that FoldX is a valuable tool to infer putative relevant sites for structural stability. FoldX 5 suite was downloaded from http://foldxsuite.crg.eu/academic-license-info . We employed MutateX software (74) to automate the prediction of ΔΔGs associated with the systematic mutation of each available residue within apoA-I, by employing the FoldX energy function. MutateX automated pipeline engine handles input preparation and performs parallel runs with FoldX. Basic steps involve protein data bank (PDB) structure repair (involving energy minimization to remove unfavorable interactions), model building for the mutant variants, energy calculations for both mutant and wild type structures and summarizing the estimated average free energy differences.

Pathogenicity probability of missense variants
The Rhapsody prediction tool (75) consists of a random forest classifier that combines sequence, structure, and dynamics-based features associated with a given amino acid variant and is trained over a comprehensive dataset of annotated human missense variants. Dynamical features include: mean-square fluctuations of the residue at the mutation site, which estimates local conformational flexibility; perturbation-response scanning effectiveness/sensitivity, accounting for potential allosteric responses involving the mutation site, and the mechanical stiffness at the sequence position of the mutated residue. These properties are computed from Elastic Network Models (ENM) representations of protein structures that describe inter-residue contact topology in a compact and computationally-efficient format that lends itself to a unique analytical solution for each structure. The algorithm was recently upgraded to include coevolutionary features calculated on conserved Pfam domains, and the training dataset was further expanded and refined. The latter combines annotated human variants from several publicly available datasets (Humvar, ExoVar, predictSNP, VariBench, SwissVar, Uniprot's Humsavar and ClinVar). All analyses were performed using the Rhapsody server http://rhapsody.csb.pitt.edu/ Molecular Dynamics Simulations Coarse grained Molecular Dynamics simulations were performed with the SIRAH force field (76) and GROMACS 2018.4 software package (77). We employed the consensus model of human apoA-I in its monomeric and lipid-free state (14). The PDB file was downloaded from Davidson Lab homepage (http://homepages.uc.edu/~davidswm/structures.html). Mapping atomic to coarse-grained representations was done with a Perl script included in SIRAH Tools (78). G26R, L60R, R173P and Δ107 mutants were generated with Chimera (79), editing the coordinates of the consensus model PDB file. For the case of the deletion mutant, we removed Lys107 and connected residues Lys106 and Trp108 with an unstructured segment using Modloop (80). Wild type apoA-I and the mutant systems were assembled using the following setup: The protein was placed inside an octahedron simulation box defined by setting a distance of 1.5 nm between the solute and the edges of the box. Systems were solvated setting a 150 mM NaCl concentration following the protocol (81). Energy minimization and heating steps were done following the protocol recommended (76) using positional restraints in the protein backbone to ensure side-chain relaxation, especially in the mutant models. Production runs were performed by quintuplicate in the absence of any positional restraint, generating 1 μs trajectories at 310 K using a 1 bar NPT ensemble. Structural analysis was performed with GROMACS tools gmx rmsf, gmx gyrate and gmx sasa. Root mean square fluctuation was calculated for each residue aligning the full trajectory apoA-I coordinates with the initial models. Radius of gyration and Solvent accessible surface areas (SASA) were obtained averaging the values corresponding to the last 0.1 μs of simulation. The SASA calculations were measured over three amyloid prone regions, comprising residues 14-19 (APR1), 53-58 (APR2) and 227-232 (APR3).

Code Availability
All Python packages used were installed through the Conda environment manager into a single environment. The workflow manager Snakemake was used in the evolutionary analysis in order to gain reproducibility and consistency of the results (82). All data, Snakefile and Python scripts used in this work are available at https://github.com/tomasMasson/APOA1_evolution .

Statistical Analyses and Visualizations
Scipy Python libraries were used for data manipulation and statistical analyses (83). Statistical significance was determined using Mann-Whitney U Test for variant's impact comparison and Student's Test for MD observables. MD graphs are reported as means ± standard deviation derived from five independent experiments. All visualizations were prepared with the Seaborn library (84).