Predicting functional effects of missense variants in voltage-gated sodium and calcium channels

A machine learning method can predict loss- versus gain-of-function effects of human genetic variants in disease-associated ion channels. Predicting ion channel variant phenotypes Ion channel variants have been associated with disease, predominantly neurological. Heyne et al. developed a tool to predict the functional effects of variants in disease-associated voltage-gated sodium and calcium ion channels using machine learning–based statistical models. Loss of function versus gain of function (LOF or GOF) was predicted separately from neutrality versus pathogenicity. Their model was trained to classify variant effects using protein sequences and structures containing missense variants with known or highly probable effects and validated against experimentally tested variants and in cohorts including individuals with epilepsy and autism. This work could have implications for ion channel and clinical genetics research. Malfunctions of voltage-gated sodium and calcium channels (encoded by SCNxA and CACNA1x family genes, respectively) have been associated with severe neurologic, psychiatric, cardiac, and other diseases. Altered channel activity is frequently grouped into gain or loss of ion channel function (GOF or LOF, respectively) that often corresponds not only to clinical disease manifestations but also to differences in drug response. Experimental studies of channel function are therefore important, but laborious and usually focus only on a few variants at a time. On the basis of known gene-disease mechanisms of 19 different diseases, we inferred LOF (n = 518) and GOF (n = 309) likely pathogenic variants from the disease phenotypes of variant carriers. By training a machine learning model on sequence- and structure-based features, we predicted LOF or GOF effects [area under the receiver operating characteristics curve (ROC) = 0.85] of likely pathogenic missense variants. Our LOF versus GOF prediction corresponded to molecular LOF versus GOF effects for 87 functionally tested variants in SCN1/2/8A and CACNA1I (ROC = 0.73) and was validated in exome-wide data from 21,703 cases and 128,957 controls. We showed respective regional clustering of inferred LOF and GOF nucleotide variants across the alignment of the entire gene family, suggesting shared pathomechanisms in the SCNxA/CACNA1x family genes.


INTRODUCTION
Voltage-gated sodium (Na v s) and calcium channels (Ca v s) play a critical role in initiating and propagating action potentials across a broad variety of excitable cells and physiological functions. Upon membrane depolarization, Na v s and Ca v s are activated and inactivated within milliseconds, leading to a transient influx of sodium or calcium ions into the cell (1). In humans, Na v s' channel proteins are encoded by 10 SCNxA genes (sodium channel protein type x subunit alpha) and Ca v s' channel proteins are encoded by 10 CACNA1x (voltage-dependent calcium channel subunit alpha-1) genes. Variants in genes encoding these channel proteins have been associated with multiple predominantly neurological and neurodevelopmental diseases. These diseases include developmental and epileptic encephalopathy (DEE) (SCN1A, SCN2A, SCN8A, CACNA1E and CACNA1A), episodic ataxia (CACNA1A), migraine (CACNA1A and SCN1A), autism spectrum disorder (ASD) (SCN2A and CACNA1C), and pain disorders (SCN9A, SCN10A, and SCN11A). Disorders affecting cardiac muscle (SCN5A CACNA1C), skeletal muscle (SCN4A and CACNA1S), or the retina (CACNA1F) have also been associated with variants in these gene families (Table 1). Pathogenic variants in these genes often contribute to severe early-onset disorders that are less frequently passed on to the next generation. This selective pressure is captured by the depletion of functional variants in those genes in the general population [median loss of function observed/expected upper bound fraction of 0.29 (2)]. Beyond rare diseases and high-penetrance variants, common variants at CACNA1x or SCNxA loci have also been associated with highly related common disease endpoints. For example, GWAS (genome-wide association studies) have identified genome-wide significant associations at loci including CACNA1C and CACNA1I for schizophrenia (3), SCN1A for epilepsy (4), and SCN10A and SCN5A for atrial fibrillation (5).
Phylogenetic analyses have found that Na v s and Ca v s share a common ancestral gene (6) and that they have previously been defined as one gene family (7). Na v s and Ca v s accordingly share a similar structure composed of four highly similar domains I, II, III, and IV, each consisting of six membrane-spanning segments S1 to S6 (1,(8)(9)(10)(11). These four domains come together, as a pseudoheterotetramer, to form a functional channel. In the center of the structure is the pore domain that is composed of S5 and S6 segments, surrounded by four voltage sensor domains formed by S1 and S4 segments. The general architecture of Na v s' and Ca v s' voltage-sensing and pore modules is comparable (8), and their function and structure have been extensively studied. Protein domains are associated with specific functions and diseases across channels (8,12,13). It has therefore previously been suggested that mutations in similar structural domains entail similar functional outcomes in Na v s (12)(13)(14), Ca v s (15), and Na v s and Ca v s (16). It has also been suggested that pathogenic variants in Na v s and Ca v s occur preferentially at functionally equivalent amino acids across the gene family alignment of Na v s and Ca v s (12,17). Although functional differences between channels exist, particularly among calcium channels (18), with only two amino acid changes, a sodium channel was experimentally turned calcium selective (19), thus illustrating some potential functional homology between Na v s and Ca v s. In addition, disease-associated missense variants are enriched at amino acid sites that are conserved across paralogs, (20) including in sodium and calcium channels (17,21). This further supports the hypothesis that similar biophysical pathomechanisms are involved across Na v s and Ca v s and therefore jointly analyzing them should increase statistical power to identify disease-associated protein features. Disease-associated biological features such as protein structure and conservation metrics have been successfully used in predicting pathogenic versus neutral effects of amino acid changes (22,23). In voltage-gated potassium channels, pathogenic variant prediction of only one gene, KCNQ1 (24), *We assume that pathogenic variants in SCN4A and CACNA1S predominantly cause HypoPP via reducing the main current resulting in LOF, in agreement with a clustering with LOF variants in other channels (Fig. 1B). However, we acknowledge that (114) showed that pathogenic variants at the gating charges of the voltage sensor can also lead to an additional gating pore or omega current leading to a partial GOF effect. See also (101). and the K v gene family (25) has been conducted with the aim of improving specificity of variant prediction in comparison to genomewide scores. There have also been attempts to predict functional readouts using electrophysiology data in SCN5A, with limited success potentially because of sparse training data (26). It has been a standard convention in genetics to differentiate between genetic variants that lead to reduced function and those that lead to altered or increased gene function (27,28). The labels loss of function (LOF) and gain of function (GOF) are frequently used for this purpose (27,29). Genetic variants that inactivate protein-coding genes by nonsense-mediated mRNA decay such as stop-gain, essential splice, or frameshift variants have by definition LOF effects. Missense variants, however, can have LOF or GOF effects. These functional alterations can be pathogenic, neutral (if effects are small or can be compensated), or rarely beneficial. Pathogenic missense variants in SCNxA/CACNA1x genes can lead to disease through various changes in channel properties. Such variants can, for example, affect the voltage dependence of steady-state activation or inactivation, the kinetics of the inactivation process or its recovery, ion selectivity, and other metrics that can be recorded in electrophysiological experiments (6). In a simplified disease context, these variants are usually classified molecularly as having either LOF or GOF effects, just as the genetic terminology, depending on whether the net ion flow is increased or decreased. However, a variant may change more than one of the properties described above, with potential opposite functional effects, such as slowing down the inactivation process (causing GOF) and simultaneously lowering protein expression (causing LOF). In such cases, it may be estimated which of the functional alterations dominates to define it either as a net GOF or LOF variant, but this may also be difficult to determine. Clear LOF or GOF effects in different genes are associated with specific channelopathies. For example, in SCN5A, LOF variants can cause Brugada syndrome, whereas GOF variants can lead to long QT syndrome (30). All such known gene-function-disease associations are displayed in Table 1. That a pathogenic variant has a LOF or GOF effect may therefore also be inferred from disease phenotypes. In multiple genes, however, including SCN2A (31) and SCN8A (32), phenotypic differences between LOF and GOF variants are not clear-cut or are not always present at the time of diagnosis. In addition, at scale, functional data are often sparse and this categorization therefore usually relies on genetic classifiers or variants' downstream effects (29). Therefore, although they are mostly in agreement, at times the genetic LOF/GOF terminology can contrast with actual molecular mechanisms. For example, dominant-negative mutations are thought to lead to LOF effects on the phenotype level but are categorized as GOF mutations on the molecular level (27). A recent example in SCN1A is the variant T226M that results in a biophysical GOF in turn leading to depolarization block in interneurons, a quasidominant-negative downstream LOF effect, in agreement with a severe epilepsy phenotype (33).

of 16
incorrect treatment that could have aggravating consequences [such as treatment with sodium channel blockers in individuals with LOF variants in SCN2A (35) or SCN1A (39)]. However, current variant prediction usually only focuses on whether a variant has a diseasecausing or neutral effect. We therefore introduce here machine learning-based statistical models that can classify variants in Na v s and Ca v s as LOF versus GOF or neutral versus pathogenic, thus providing a valuable resource for clinical genetics, gene discovery, and the experimental ion channel community.

Similar molecular mechanisms in different Na v s/Ca v s lead to LOF and GOF
Genetic variants in different Na V s/Ca V s lead to disease in diverse contexts. Comparing expression data (40) and associations of genes to phenotypes (41), we found that tissue-specific gene expression was correlated with tissue-associated phenotypes ( fig. S1). For example, pathogenic variants in SCN5A contributed to heart diseases (Brugada syndrome and long QT syndrome), and the SCN5A-encoded protein Na v 1.5 was predominantly expressed in heart tissue. Expression in different tissues and cell types could thus explain the clinically diverse disease spectrum of Na v s/Ca v s while allowing the possibility that similar alterations to protein structure cause heterogeneous diseases across different channels. We therefore gathered variants in SCNxA/CACNA1x genes in individuals with disease from different sources (table S1) (35,(42)(43)(44)(45)(46). We filtered these to 1521 likely pathogenic variants using the variant interpretation guideline by the American College of Medical Genetics and Genomics (47) where possible. Most diseases associated with Na v s/Ca v s are caused by either GOF or LOF of those ion channels (13,14,18,48). Thus, we inferred whether a likely pathogenic variant has a GOF or LOF effect from disease phenotypes on the basis of known gene-disease mechanisms. For example, in an individual with Brugada syndrome and a pathogenic variant in SCN5A, we assumed that the variant has a LOF effect, because it was previously described that most pathogenic SCN5A variants cause Brugada syndrome via a LOF mechanism (30,49). We screened the literature for such known gene-disease mechanisms (Table 1). Although we expect agreement in most cases, a LOF/GOF categorization based on the disease outcome can sometimes be different from the variant's molecular mechanism. An example is the variant p.K1422E in SCN2A (Na V 1.2) carried by an individual whose disease phenotype suggests that the variant has a LOF effect. The variant was however previously described as GOF electrophysiologically (50) which we could experimentally replicate. However, we also found that the p.K1422E variant carried a lower current density compared to wild type ( fig. S8) potentially explaining a net LOF effect. Applying this knowledge, of the 1521 likely pathogenic variants, we classified 518 variants as likely LOF and 309 variants as likely GOF in 12 different genes across 19 diseases. Eleven diseases had inferred LOF variants and 8 had inferred GOF variants.
We set out to ascertain whether variants with inferred LOF or GOF effects were clustered at corresponding amino acid sites in Na v s/Ca v s, because this would greatly boost our power to jointly analyze LOF and GOF variants of different Na v s/Ca v s. To compare variant location of different Na v s and Ca v s, we mapped variants on a combined gene family alignment of all 20 Na v /Ca v sequences. We then correlated variant densities between all 19 diseases (Kendall correlation; Fig. 1A). When variant densities of two diseases are significantly correlated, their variants are clustered at corresponding amino acid sites. We obtained 40 different variant density correlations between diseases. Thirty-seven of the 40 significant (P < 6 × 10 −5 ) correlations involved GOF variants of which 31 occurred between two diseases that were both inferred to be caused by GOF variants. This suggests that GOF variants are clustered at similar amino acid sites in different channels. We performed a principal component analysis to summarize all disease-disease correlations as measured by Kendall's τ. The first principal component (PC 1) perfectly separated diseases with inferred LOF from those with inferred GOF variants (Fig. 1B). This indicates regional clustering of LOF and GOF variants and thus shared mechanisms lead to LOF or GOF in different ion channels. We hence combined LOF and GOF variants of Na v s and Ca v s in further analyses.

Machine learning method predicts LOF versus GOF variant effects
We gathered 89 structure-based and sequence-based protein features putatively enriched for LOF versus GOF or neutral versus pathogenic effect variants. Structure-based annotations included protein secondary structure, protein's accessible surface area, and structural (for example, cytoplasmic or nuclear) and functional (for example, channel pore or selectivity filter) protein domains. Sequence-based features included conservation metrics across the 20 genes, physicochemical amino acid properties, and deleteriousness of amino acid changes like "missense badness" (51). These also included our own conservation score for which we estimated selection pressure on amino acids conditional on ancestry to account for the shared evolutionary history of paralog genes. We tested all binary protein features for an enrichment of inferred LOF (n = 518), GOF (n = 309), pathogenic (n = 1517), or neutral (n = 2328) variants with Fisher's exact tests (Fig. 2). Six of 9 structure-based features and 3 of 12 sequence-based features were enriched for functional entities such as the pore or selectivity filter for LOF and the S4-S5 linker helix or cytoplasm for GOF variants. In Fig. 3  We next sought to leverage these associations of protein features with variant effects to train a prediction tool that outputs the probability that a variant results in GOF or LOF. For this, we trained a machine learning model on all 89 protein features of all 518 LOF and 309 GOF variants (table S1). To assess the performance of our model, we set aside a test dataset of 82 randomly chosen variants before the modeling process. We measured performance with the following metrics: balanced accuracy (BA), Cohen's κ (kappa), Matthews correlation coefficient (MCC), area under the precisionrecall curve (prAUC), and receiver operating characteristics (ROC) curve. We aimed to maximize BA during model training. BA, kappa, MCC, ROC, and prAUC are performance metrics aiming to summarize a 2 × 2 contingency table of true-positive/true-negative and false-positive/false-negative predictions with a single number (52). The ROC curve was created by plotting the true-positive rate against the false-positive rate at various probabilities (  (22)(23)(24). Using this model, we ranked the relative influence of the 89 features on the prediction of LOF versus GOF effects (Fig. 4E). The top two features were GOF variant density features. The 10 most important GOF features also included three different amino acid hydrophobicity scores, three different conservation features, and the Grantham score.
We asked whether modeling variants in Na v s + Ca v s jointly improved variant prediction over modeling Na v s and Ca v s separately. When using only the 573 Na v variants during model training, prediction performance in Na v s was comparable to model training with all Na v s + Ca v s variants (BA 0. 79 Ca v s. These results suggest that the increased power obtained by combining Na v s and Ca v s outweighs the differences between these channel types.

Machine learning method predicts pathogenic versus neutral variant effects
We set out to predict whether a variant has a "neutral" or a potentially disease-causing ("pathogenic") effect using the same features, machine learning method, and variants as in the functional variant prediction. We used the 1517 likely pathogenic variants described above including the 825 variants with LOF/GOF annotations and 692 likely pathogenic variants for which we could not annotate with certainty whether they had LOF or GOF effects. We used 2328 variants in the Genome Aggregation Database (gnomAD) (2) in individuals Cytoplasm (297) S4-S5 linker helix (81) S4 voltage sensor, gating charges (54) Inactivation gate (34)* Pore (64) Selectivity filter (45) Selectivity filter, DEKA motif (5)* Gating break (7) . S7). We then added 12 functionally tested variants in CACNA1I (described below) that were predicted as pathogenic by our method and restricted the data to outcomes of LOF or GOF. Our model predicted the resulting 87 electrophysiological experiments with BA 0.73, ROC 0.73, and MCC 0.45 (Fig. 4, B and D; permutation P < 1 × 10 −4 ).
When subsetting to 57 variants in SCN1/2/8A that either fulfilled our phenotype/pathogenicity criteria of being included in our functional variant prediction training data or were associated with a  (31)). In each domain, transmembrane segments S1 to S6 are labeled with 1 to 6. S5 and S6 form the channel pore and S4 contains the voltage sensor that is labeled with "+" to illustrate the positive gating charges. The * at site 1151 refers to a cluster of GOF variants in CACNA1C in individuals with long QT syndrome; ** at site 1882 refers to a cluster of GOF variants in SCN2/8A. Variants with minor allele frequency (MAF) > 10 −4 in gnomAD (individuals with neuropsychiatric diseases excluded) (2) were selected as neutral variants. Variants were inferred to be LOF or GOF from disease phenotypes (see Table 1).
severe phenotype, our model predicted the variants with BA 0.77, ROC 0.80, and MCC 0.54. All five variants with neutral effects were predicted to be neutral by our pathogenicity prediction method, significantly more than other functionally tested variants (Fisher's exact test; P = 0.002; odds ratio = Inf; 95% confidence interval 2.4-Inf). Our functional validation data also included 11 SCN2A variants in individuals where age of seizure onset was unavailable or outside of the cutoffs that we used to infer GOF/ LOF. We correctly predicted 9/11 of them, emphasizing the benefit of our functional variant prediction when using phenotype as a proxy for variant function is unreliable.
We also tested our prediction on electrophysiology experiments of 50 variants in CACNA1I present in 12,332 individuals with and without psychiatric disease (table S3) (56). The functionally tested variants were present at different population frequencies. However, common variants are unlikely to have strong pathogenic effects-despite considerable efforts in GWAS, exome chip, and exome sequencing, no common strong acting variants have been identified, consistent with the fact that these would not be permitted by the strong selection against schizophrenia (57). Because our LOF-GOF prediction is trained and should hence only be used on likely pathogenic variants, we sought to first predict whether variants were likely pathogenic or neutral using our own method described above. Variants that were more rare were more likely to be predicted pathogenic despite variant frequency not being a component of the model [Spearman rank correlation between minor allele frequency (MAF) in the population cohort gnomAD (2) and pathogenic prediction, ρ = −0.60, P = 3.3 × 10 −6 ; Fig. 5A]. We found that whether a variant was predicted to be pathogenic correlated with whether a variant had a functional effect when considering variants that were present in only one individual (BA 0.75, ROC 0.77, and MCC 0.44). There was, however, no association of pathogenicity and functional effect in 19 variants that were present in >10 individuals in gnomAD (BA 0.46, ROC 0.36, and MCC −0.14). That is consistent with the abovementioned statement that common variants should have no strong pathogenic effects suggesting that functional effects found at higher variant frequencies were likely milder or not disease causing. Given that a variant is pathogenic, we predicted its functional effect (LOF or GOF) with BA 0.83, ROC 0.78, and MCC 0.58 ( Fig. 5B and table S4). We then combined the z scores of four electrophysiology parameters to investigate how well we could predict variants with different magnitudes of functional effects. First, pathogenicity probability positively correlated with the combined experimental z score in variants with functional effects (Spearman correlation; P = 0.02; ρ = 0.43; Fig. 5C). In a logistic regression model, we also found that the strength of the functional effect (combined experimental z score) influenced whether funNCion correctly predicted LOF or GOF functional effects (coefficient, 0.29; P = 0.02; Fig. 5D). Accordingly, when only analyzing the 10 variants with a combined z score of four experimental parameters ≥16, we predicted the correct functional effect (LOF or GOF) with BA 0.94, ROC 0.89, and MCC 0.67. Together, these results suggest that our disease phenotype-based model predicting LOF versus GOF effects largely corresponds to LOF versus GOF in functionally tested variants, with an increased performance in variants with larger functional effects and variants that are more likely pathogenic.

Validation of funNCion with large datasets of population controls and neuropsychiatric diseases
We first predicted functional and pathogenicity effects of missense variants in 114,704 individuals without severe pediatric and neurological disorders in gnomAD (2). We set out to test which factors predict a variant's probability to be pathogenic using linear regres-sion. The most significant predictor was −log 10 MAF in gnomAD (P = 2 × 10 −74 ; coefficient, −0.07); in other words, pathogenic variants were at significantly lower frequencies in gnomAD. This is expected, because selection should not allow deleterious variants to rise to high population frequencies (57); see CACNA1I in previous paragraph. We also observed this in individual genes (Bonferronicorrected P < 0.0025 for eight CACNA1x and five SCNxA genes, P < 0.05 for three genes, and P > 0.05 for SCN7A, SCN10A, SCN11A, and CACNA1F; see Fig. 6A). A positive predictor of variant pathogenicity was a gene's LOEUF [loss of function observed/expected upper-bound fraction (2); P = 2 × 10 −67 ; coefficient, 0.21]. A low LOEUF value means that the respective gene has significantly fewer protein-truncating variants here labeled "loss of function" variants because they have, by definition, a LOF effect, in gnomAD than expected. The equivalent value for missense variants (here termed "MOEUF") was also significant (P = 1 × 10 −4 ; coefficient, 0.06). It is again expected that genes that are most intolerant to functional variants would harbor mostly neutral rather than pathogenic missense variants in a cohort of primarily healthy individuals. LOEUF being more strongly associated with pathogenicity than MOEUF suggests that Na v s/Ca v s may be generally more intolerant to LOF (including LOF missense and truncating) variants than GOF variants.

of 16
in line with the notion that most Na v s/Ca v s, but in particular those with a lower tolerance for protein-truncating variants, tolerate LOF missense variants less than GOF missense variants. In contrast, genes with particularly low tolerance for missense variants harbored fewer GOF than LOF variants in gnomAD (Fig. 6B). We found an association of pathogenicity and GOF probability in all individual genes (P < 0.0025 corrected for 20 tests) except SCN2A, SCN8A, CACNA1A, CACNA1B, CACNA1C, CACNA1D, and CACNA1E. Fittingly, all of these except CACNA1B are implicated in severe GOF disorders, and SCN8A, SCN2A, CACNA1C, and CACNA1E had the  (2)]. Nominally significant associations (P < 0.05) are colored in orange (Bonferroni P value threshold of 7 × 10 −5 ). Horizontal bars show 95% confidence intervals of the odds ratio point estimates that are log 10 -transformed and cut at −1.7 and 1.7 for clarity. Odds ratios >1.7 or <−1.7 are shown as arrows.
lowest MOEUF values of all Na v s/Ca v s. Overall, these biologically meaningful results validate our method.
To investigate whether our functional variant prediction could replicate known disease associations and mechanisms, we tested the algorithm on large datasets of individuals with and without diseases. We compared numbers of ultrarare missense variants with Fisher's exact tests between 9170 individuals with and 8436 individuals without epilepsy [of which 6860 were also part of gnomAD (2)] from the Epi25 Collaborative (58), de novo variants in 4186 individuals with and 2179 without ASD from the Autism Sequencing Consortium (ASC) (59); and 8347 individuals with ASD or attention deficit hyperactivity disorder (ADHD) to 5214 controls from the Danish bloodspot cohort (DBS)/iPSYCH consortium (60). We found an enrichment of pathogenic LOF, but not GOF missense variants in genes, where protein-truncating variants are known to cause specific diseases. These included 29 LOF in SCN1A (61) in several nonlesional epilepsies (associated with DEE and febrile seizures with P values < Bonferroni threshold 7 × 10 −5 ) and 14 LOF in SCN2A in 5252 cases of autism with intellectual disability (31) (see Fig. 6C and table S5). CACNA1G, a recent candidate (58) for genetic generalized epilepsy (n = 3108), was also enriched for LOF missense but not GOF variants, and combining 3 LOF missense with 2 protein-truncating variants improved disease association to P = 1 × 10 −3 . In contrast, only missense variants are associated with DEE in SCN8A and CACNA1E, which were accordingly only enriched for two GOF missense variants in DEE (P = 0.01 and 0.03, respectively). We can also nominate CACNA1B as a potential candidate gene for genetic generalized epilepsy. It was enriched for six missense LOF variants (P = 2 × 10 −3 ) with an overall missense signal of P = 7 × 10 −4 . Further, biallelic protein-truncating variants in CACNA1B have recently been implicated in a severe epilepsy syndrome (62). It would therefore be plausible that heterozygous LOF in CACNA1B may lead to a milder epilepsy phenotype.

DISCUSSION
Tailoring treatment to individual patients' genetic variants has made substantial progress in many fields of medicine in recent years (63). Studying ion channel variants' functional effects in with electrophysiology experiments has enabled development of precision therapies, often accelerated by repurposing existing drugs (48,64,65). These functional studies require considerable effort and expertise and therefore usually focus on few variants. In Na v s and Ca v s, multiple precision medicine approaches have been described (34,35,(66)(67)(68); however, their success is dependent on the type of functional changes of pathogenic variants. Here, we present a method that predicts LOF versus GOF effects in likely pathogenic variants in SCNxA and CACNA1x genes-applicable across a range of diverse diseases that can be caused by pathogenic variants in Na V s and Ca V s.
In our study, we inferred LOF and GOF effects of genetic variants from disease phenotypes without functionally testing them. This poses several challenges. Phenotypes are ascertainment biased, and there is often variable expressivity of the same variant in multiple individuals. We therefore carefully curated our data to include only clearly distinguishable LOF-or GOF-associated disease phenotypes. Our control dataset was ascertained to include no individuals with severe childhood onset or neuropsychiatric disease. Because these datasets are large, few disease-associated variants may still erroneously be present. Although few variants may still be miscate-gorized, in silico validation in multiple large population and disease cohorts and performance similar to popular variant prediction tools suggest that our approach is generally meaningful. We further showed that our LOF/GOF prediction largely corresponded to molecular LOF/GOF classes based on electrophysiological experiments. As mentioned previously, there exist a few cases where the molecular LOF/GOF categorization differs by definition from the disease-based LOF/GOF categorization. For example, dominantnegative mutations are molecularly GOF but the downstream effects result in a LOF phenotype (27,33). Although only experiments can lead to functional insight, any experimental setup constitutes a model system. Hence, a variant may have a functional effect in a laboratory setting that may not always translate to a pathophysiological effect on the organismal level. A model that uses disease phenotypes as a quasi-functional readout considers these additional layers of complexity that in vitro systems are not able to reproduce. This is further illustrated by our data for CACNA1I, where pathogenicity prediction correlated with functional effects only for rare variants. This is expected, because natural selection should prevent deleterious variants from rising in population frequency.
As an example of contradictory phenotypic and functional interpretation, we highlight the selectivity filter domain of the channel proteins. In this region, 42 of 43 likely pathogenic variants were in individuals with LOF phenotypes, including the DEKA motif in Na v s that conveys selectivity for Na + ions (50). However, there are examples of GOF effects in electrophysiology experiments. The p.G1662S variant in SCN10A encoding Na V 1.8 was implicated in small fiber neuropathy and showed GOF functionally (69). However, this variant was found at a frequency of 0.0014 in the gnomAD population database, including in four homozygous individuals, and was therefore rated benign by two independent laboratories in ClinVar. Thus, the variant's functional changes are unlikely to contribute to disease. The second example is the variant p.K1422E in SCN2A carried by an individual with DEE who was 13 months old at the onset of seizures, thus corresponding to a LOF disease phenotype. In previous studies, the variant rendered the channel much elevated permeability to divalent cations like Ba 2+ and Ca 2+ , whereas selectivity for Na + was significantly reduced (50). We could experimentally replicate that the variant acted as a GOF electrophysiologically in terms of permeability to Ca 2+ . However, we also found that the Na v 1.2 p.K1422E variant carried a lower overall current density compared to Na v 1.2 wild type and the Na v 1.2 p.T1420M isogenic-variant stable cell line. The current density reduction observed may reflect biological defects in either forward trafficking, reduced single-channel conductance, increased permeability to outward Na + /Ca 2+ current, or enhanced endocytosis/ degradation. The apparent LOF effect in current density may override the GOF effects in Ca 2+ influx, thus explaining the overall LOF disease phenotype. These effects would be difficult to properly evaluate in transient expression systems, illustrating the difficulty of experimentally modeling those complex proteins.
Our approach has several limitations. We acknowledge that the classification of variant effects into LOF and GOF oversimplifies complex electrophysiological mechanisms, even if frequently done in the literature. In SCN9A, for example, two different types of GOF mechanisms impairing channel activation and inactivation have been shown to lead to two different diseases: erythromelalgia and paroxysmal pain syndrome, respectively (70). As mentioned earlier, some variants also have mixed effects (53,55) that sometimes correspond to clinical phenotypes with overlapping symptoms (30). We have currently too little power to include such variants in our model. With large-scale experimental electrophysiology data, it may be possible to expand or subdivide the GOF and LOF categories or introduce more quantitative GOF/LOF scoring systems for predictions in the future. Further, we analyzed more functional variant and experimental validation data for Na v s. Therefore, predictions in Na v s should be more reliable than in Ca v s. Also, our model was trained on likely pathogenic variants in mostly severe diseases. It remains to be properly validated in individuals with milder diseases with potentially milder variant effects.
Our results gain important insights into which functional protein domains and motifs in Na v s and Ca v s are associated with inferred GOF or LOF effects in 825 curated likely pathogenic variants. This may provide a valuable resource for experimental follow-up studies to potentially identify mechanistically important sites and drug targets. We can also confirm associations of specific amino acid sites with GOF or LOF effects across diseases with high statistical power that have thus far been shown mechanistically (50,71,72) or descriptively (12)(13)(14), often with less rigorously curated disease variants.
As a positive control, we recapitulated known structure-function associations, such as that pathogenic variants are enriched in transmembrane segments and in functionally important domains like the channel pore or inactivation machinery (8,12). As mentioned above, LOF variants were clearly associated with the ion conduction structural motifs, including the selectivity filter (50) of the pore domain (S5-S6 segments) as previously hypothesized (14). We confirmed that the structural motifs associated with the inactivation process (8,71,73) as well as the S4-S5 linker helix (72) were associated with GOF variants. The S4-S5 linker helix was previously implicated in GOF (14), for example, in pain disorders caused by variants in SCN9A (70) and in DEE caused by variants in CACNA1E (45) and SCN2A (31). We can now confirm a statistically significant association of GOF effects with the S4-S5 linker using data of nine different CACNA1x/SCNxA diseases. Worth noting is a slight extension of the GOF variant cluster beyond the linker helices toward the start of S5 consistently across the four transmembrane domains. Another notable takeaway is that GOF and LOF variants were not equally associated with transmembrane segments S1 to S6 at the four different transmembrane domains I to IV. This corroborates previous findings that different domains in Ca v s and Na v s have an overall structural similarity but a different contribution to the channel functioning (74)(75)(76). Last, we observed an accumulation of Na v and Ca v GOF variants in the cytoplasmic part downstream of each transmembrane segment S6. Exploring this further may yield mechanical insights.
We highlighted an accumulation of four likely pathogenic variants in CACNA1C encoding Cav1.2 in individuals with long QT syndrome (transcript: ENST00000347598; variants: p.P857L, p.P857R, p.R858H, and p.R860G). Two of these variants were previously functionally investigated. Peak calcium currents were larger in mutant channels than those of wild type for p.R858H (77) and p.P857R (78). One study also identified increased surface membrane expression of the channel compared to wild type (78). The authors found that those variants overlapped with the so-called PEST domain (proline, glutamic acid, serine, and threonine) that is involved in protein degradation signaling and lead to increased numbers of Cav1.2 channels at the cell membrane. This domain as well as the cluster of GOF variants are not present in other Ca v s or Na v s, pointing to a distinct GOF mechanism in Ca v 1.2.
We reported a GOF variant cluster of nine likely pathogenic SCNxA variants (genes SCN2A, ~4A, and ~8A). When mapped onto SCN2A, they are located in the C-terminal domain at amino acid sites 1875 to 1887 that is in close proximity to a FGF (fibroblast growth factor)/FHF (FGF homologous factor) binding site of a calmodulin (CaM)-FGF complex also present in Na v 1.4 and Na v 1.5 (79). FHF1 to FHF4 interact with the C-terminal domain of Na v s to modulate the channels' fast and long-term inactivation (80). One of these variants, p.R1882Q in SCN2A, also showed a slower time course of inactivation (81). Further, de novo GOF variants in FHF1 have been associated with DEE (82) and variants in FHF2 with generalized epilepsy with febrile seizures plus (GEFS+). The C-terminal lobe of the CaM-FGF complex interacts with the conserved IQ motif of helix -VI of the C terminus of all Na v channels (79), suggesting that it may serve as an anchor for the control of activation of the channels by CaM. In contrast to the FHF binding site, the I of the potential "IQ motif" overlaps with two LOF variants in SCN1A. These observations could yield starting points for hypotheses about this interaction.
We also identified secondary structural protein features associated with LOF and GOF variants. As expected (83), LOF variants are more likely to be buried in the protein where they can potentially disrupt protein stability, so the probability that an amino acid is buried becomes a predictive feature in the machine learning model. Features related to amino acid properties, particularly hydrophobicity and deleteriousness, also contributed to the LOF versus GOF prediction, in agreement with previous studies (13,33,84,85).
There exist generally more LOF than GOF variants. For SCN2A, a recent study estimates the incidence of LOF cases to be about fivefold higher than GOF cases (31). The most important reason for this is likely that GOF can be achieved at fewer sites across the genes than LOF, although other factors like frequency of genetic testing, variant penetrance, and expressivity also play a role. The observation that GOF variants can be more easily identified by their location than LOF variants is also indicated by the fact that the two top predictors of LOF/GOF are GOF variant density features.
We introduced a potentially powerful approach to predict the directionality of likely pathogenic missense variants in SCNxA/CACNA1x genes. In a clinical setting like SCN2A-or SCN8A-related DEE, treatment decisions must often be made before functional studies of disease-causing variants can be done. In the future, our prediction method could be adapted and benchmarked for use in conjunction with best current clinical practices, for example, to predict which individuals with pathogenic variants may be likely to benefit from a particular treatment based on their variants' LOF or GOF effects. Our method could potentially be refined with large-scale experimental data, for example, by introducing more types of predictions than mere LOF and GOF. Because most SCNxA/CACNA1x genes are depleted for functional variants in the general population, it is likely that more SCNxA/CACNA1x genes could contribute to disease for which disease associations or mechanisms have not yet been elucidated and to which our method could potentially be applied. Last, our study introduces disease phenotype-based functional variant prediction that can also be used in other genes or gene families.

Study design
In this study, we developed a statistical model that predicts GOF versus LOF effects of genetic variants in Na V s and Ca V s (corresponding to SCNxA and CACNA1x genes, respectively). We trained a machine learning model on sequence-and structure-based protein features with (likely) pathogenic missense variants in SCNxA/CACNA1x genes with probable LOF (n = 518) and GOF (n = 309) effects that were inferred from disease phenotypes of variant carriers based on known gene-disease mechanisms. We tested the model on 82 variants randomly chosen from the training data before model training. We then validated the GOF versus LOF prediction on 87 functionally tested variants in SCN1/2/8A and CACNA1I and in exome-wide data from the general population (gnomAD, n = 114,704) (2); individuals with epilepsy (n = 9170) (58), autism (n = 4186) (59), and ASD or ADHD (n = 8347) (60); and 15,829 control individuals of which 6860 overlapped with the 114,704 individuals from gnomAD (2). In all validation steps, testing data were always excluded from training data. Further method details are available in the Supplementary Materials.

Statistical analysis
Statistical analyses were done with the R and the C programming languages. We used the R package caret (86) for most machine learning-related functions and packages ggplot2 (87) and plotROC (88) for plotting. Methods for determining significance of association (Fisher's exact test, Kendall and Spearman correlation, and logistic and linear regression) and multiple testing correction (Bonferroni) were used as indicated in the manuscript. We used the conventional P value threshold of 0.05 in all analyses with the exception of the analysis "Clustering of inferred GOF or LOF variants in different genes" (see Fig. 1) where we used a more stringent P value threshold of 0.01.
To compare variant location between all Na v s and Ca v s for clustering of inferred LOF and GOF variants, we mapped the amino acid sites on a combined gene family alignment of all 20 Na v /Ca v sequences. We removed alignment gaps obtaining 1268 amino acid sites mappable to all sodium and calcium channels [61% of their canonical isoform length of 2064 amino acids ±222 (means ± SD)]. Seven hundred twenty-six of all 825 LOF/GOF variants could be mapped onto the 1268 family-aligned amino acid sites. We then counted the LOF or GOF variant density on the 1268 family-aligned sites in sliding windows of three amino acids hereby considering LOF or GOF effects of variants' directly neighboring amino acid sites.
For machine learning-based prediction of GOF versus LOF and pathogenic versus neutral variant effects, we used a table of all 89 protein features by all 825 variants (table S1) to train a prediction tool that outputs the probability that a variant results in GOF or LOF. We used the R package caret's train function to evaluate, using a 10-fold cross-validation resampling, the effect of model tuning parameters on performance and to choose the optimal parameters for the final model. Further details on statistical modeling and electrophysiology experiments are available in Supplementary Materials and Methods.