Mutation-specific pathophysiological mechanisms define different neurodevelopmental disorders associated with SATB1 dysfunction

Whereas large-scale statistical analyses can robustly identify disease-gene relationships, they do not accurately capture genotype-phenotype correlations or disease mechanisms. We use multiple lines of independent evidence to show that different variant types in a single gene, SATB1, cause clinically overlapping but distinct neurodevelopmental disorders. Clinical evaluation of 42 individuals carrying SATB1 variants identified overt genotype-phenotype relationships, associated with different pathophysiological mechanisms, established by functional assays. Missense variants in the CUT1 and CUT2 DNA-binding domains result in stronger chromatin binding, increased transcriptional repression and a severe phenotype. Contrastingly, variants predicted to result in haploinsufficiency are associated with a milder clinical presentation. A similarly mild phenotype is observed for individuals with premature protein truncating variants that escape nonsense-mediated decay and encode truncated proteins, which are transcriptionally active but mislocalized in the cell. Our results suggest that in-depth mutation-specific genotype-phenotype studies are essential to capture full disease complexity and to explain phenotypic variability.


Abstract
Whereas large-scale statistical analyses can robustly identify disease-gene relationships, they do not accurately capture genotype-phenotype correlations or disease mechanisms. We use multiple lines of independent evidence to show that different variant types in a single gene, SATB1, cause clinically overlapping but distinct neurodevelopmental disorders. Clinical evaluation of 42 individuals carrying SATB1 variants identified overt genotype-phenotype relationships, associated with different pathophysiological mechanisms, established by functional assays. Missense variants in the CUT1 and CUT2 DNA-binding domains result in stronger chromatin binding, increased transcriptional repression and a severe phenotype. Contrastingly, variants predicted to result in haploinsufficiency are associated with a milder clinical presentation. A similarly mild phenotype is observed for individuals with premature protein truncating variants that escape nonsense-mediated decay and encode truncated proteins, which are transcriptionally active but mislocalized in the cell. Our results suggest that in-depth mutation-specific genotype-phenotype studies are essential to capture full disease complexity and to explain phenotypic variability.

Main text
SATB1 encodes a dimeric/tetrameric transcription factor 1 with crucial roles in development and maturation of T-cells [2][3][4] . Recently, a potential contribution of SATB1 to brain development was suggested by statistically significant enrichment of de novo variants in two large neurodevelopmental disorder (NDD) cohorts 5; 6 , although its function in the central nervous system is poorly characterized.
We performed functional analyses assessing consequences of different types of SATB1 variants for cellular localization, transcriptional activity, overall chromatin binding, and dimerization capacity. Based on protein modeling (Figure 2, Suppl. Notes), we selected five missense variants in CUT1 and CUT2 affecting residues that interact with, or are close to, the DNA backbone (mosaic variant p.E407G and de novo variants p.Q420R, p.E530K/p.E530Q, p.E547K), as well as the only homeobox domain variant (p.L682V, de novo). As controls, we selected three rare missense variants from the UK10K consortium, identified in healthy individuals with a normal IQ: p.S366L (gnomAD allele frequency 6.61e-4), p.V519L (8.67e-6) and p.A573T (1.17e-4) ( Figure 1A, Suppl. Table 3) 16 . When overexpressed as YFP-fusion proteins in HEK293T/17 cells, wildtype SATB1 localized to the nucleus in a granular pattern, with an intensity profile inverse to the DNA-binding dye Hoechst 33342 ( Figure 3A-B). In contrast to wildtype and UK10K control missense variants, the p.E407G, p.Q420R, p.E530K/p.E530Q and p.E547K variants displayed a cage-like clustered nuclear pattern, strongly co-localizing with the DNA (Figure 3A-B, Suppl. Figure 6).
To assess the effects of SATB1 missense variants on transrepressive activity, we used a luciferase reporter system with two previously established downstream targets of SATB1, the IL2-promoter and IgH-MAR (matrix associated region) [17][18][19] . All five functionally assessed CUT1 and CUT2 missense variants demonstrated increased transcriptional repression of the IL2-promoter, while the UK10K control variants did not differ from wildtype ( Figure 3C). In assays using IgH-MAR, increased repression was seen for both CUT1 variants, and for one of the CUT2 variants ( Figure 3C). Taken together, these data suggest that etiological SATB1 missense variants in CUT1 and CUT2 lead to stronger binding of the transcription factor to its targets.
To study whether SATB1 missense variants affect the dynamics of chromatin binding more globally, we employed fluorescent recovery after photobleaching (FRAP) assays. Consistent with the luciferase reporter assays, all CUT1 and CUT2 missense variants, but not the UK10K control variants, affected protein mobility in the nucleus. The CUT1 and CUT2 variants demonstrated increased halftimes and reduced maximum recovery, suggesting stabilization of SATB1 chromatin binding ( Figure 3D).
In contrast to the CUT1 and CUT2 missense variants, the homeobox variant p.L682V did not show functional differences from wildtype ( Figure 3A-D, Suppl. Figure 6), suggesting that, although it is absent from gnomAD, highly intolerant to variation and evolutionarily conserved (Suppl. Figure 2, Suppl. Figure 7A-B), this variant is unlikely to be pathogenic. This conclusion is further supported by the presence of a valine residue at the equivalent position in multiple homologous homeobox domains (Suppl. Figure 7C). Additionally, the mild phenotypic features in this individual (individual 42) can be fully explained by an out-of-frame de novo intragenic duplication of FOXP2, known to cause disease through haploinsufficiency 20 .
We went on to assess the impact of the CUT1 and CUT2 missense variants (p.E407G, p.Q420R, p.E530K, p.E547K) on protein interaction capacities using bioluminescence resonance energy transfer (BRET). All tested variants retained the ability to interact with wildtype SATB1 ( Figure 3E), with the potential to yield dominant-negative dimers/tetramers in vivo and to disturb normal activity of the wildtype protein.
The identification of SATB1 deletions suggests that haploinsufficiency is a second underlying disease mechanism. This is supported by the constraint of SATB1 against loss-offunction variation, and the identification of PTV carriers that are clinically distinct from individuals with missense variants. PTVs are found throughout the locus and several are predicted to undergo NMD by in silico models of NMD efficacy (Suppl. Table 4) 21 . In contrast to these predictions, we found that one of the PTVs, p.R410*, escapes NMD (Suppl. Figure  8A-B). However, the p.R410* variant would lack critical functional domains (CUT1, CUT2, homeobox) and indeed showed reduced transcriptional activity in luciferase reporter assays when compared to wildtype protein (Suppl. Figure 8), consistent with the haploinsufficiency model.
Four unique PTVs that we identified were located within the final exon of SATB1 ( Figure  1A) and predicted to escape NMD (Suppl. Table 4). Following experimental validation of NMD escape ( Figure 4A-B), three such variants (p.P626Hfs*81, p.Q694* and p.N736Ifs*8) were assessed with the same functional assays that we used for missense variants. When overexpressed as YFP-fusion proteins, the tested variants showed altered subcellular localization, forming nuclear puncta or (nuclear) aggregates, different from patterns observed for missense variants ( Figure 4C, Suppl. Figure 9A-B). In luciferase reporter assays, the p.P626Hfs*81 variant showed increased repression of both the IL2-promoter and IgH-MAR, whereas p.Q694* only showed reduced repression of IgH-MAR ( Figure 4D). The p.N736Ifs*8 variant showed repression comparable to that of wildtype protein for both targets ( Figure 4D). In further pursuit of pathophysiological mechanisms, we tested protein stability and SUMOylation, as the previously described p.K744 SUMOylation site is missing in all assessed NMD-escaping truncated proteins ( Figure 4A) 22 . Our observations suggest the existence of multiple SATB1 SUMOylation sites (Suppl. Figure 10) and no effect of NMD-escaping variants on SUMOylation of the encoded proteins (Suppl. Figure 10) nor any changes in protein stability (Suppl. Figure 9C). Although functional assays with NMD-escaping PTVs hint towards additional disease mechanisms, HPO-based phenotypic analysis could not confirm a third distinct clinical entity (p=0.932; Suppl. Figure 4E, Suppl. Table 5).
Our study demonstrates that while statistical analyses 5; 6 can provide the first step towards identification of new NDDs, a mutation-specific functional follow-up is required to gain insight into the underlying mechanisms and to understand phenotypic differences within patient cohorts. Multiple mechanisms and/or more complex genotype-phenotype correlations are increasingly appreciated in newly described NDDs, such as those associated with RAC1, POL2RA, KMT2E and PPP2CA [23][24][25][26] . Interestingly, although less often explored, such mechanistic complexity might also underlie well-known (clinically recognizable) NDDs. For instance, a CUT1 missense variant in SATB2, a paralog of SATB1 that causes Glass syndrome through haploinsufficiency (MIM #612313) 27 , affects protein localization and nuclear mobility in a similar manner as the corresponding SATB1 missense variants (Suppl. Figure 11, Suppl. Figure 12) 28 . Taken together, these observations suggest that mutation-specific mechanisms await discovery both for new and well-established clinical syndromes.
In summary, we demonstrate that at least two different previously uncharacterized NDDs are caused by distinct classes of rare (de novo) variation at a single locus. We combined clinical investigation, in silico models and cellular assays to characterize the phenotypic consequences and functional impacts of a large patient series uncovering distinct pathophysiological mechanisms of the SATB1-associated NDDs. This level of combined analyses is recommended for known and yet undiscovered NDDs to fully understand disease etiology.    p.N736Ifs*8

G A T A A A A T A C T A A C C T A C T A A C A G A T A A A A T A C T A A C C T A C T A A C A SATB1
A Homeobox p.Q694*

Individuals and consent
For all individuals reported in this study, informed consent was obtained to publish unidentifiable data. When applicable, specific consent was obtained for publication of clinical photographs and inclusion of photographs in facial analysis. All consent procedures are in accordance with both the local ethical guidelines of the participating centers, and the Declaration of Helsinki. Individuals with possible (likely) pathogenic SATB1 variants were identified through international collaborations facilitated by MatchMakerExchange 7 , GPAP of RD-connect 8 , the Solve-RD consortium, the Decipher Database 9 , and through searching literature for cohort-studies for NDD 5; 6 . Individuals 27 and 28 were previously described in a clinical case report 29 . Clinical characterization was performed by reviewing the medical files and/or revising the phenotype of the individuals in the clinic. All (affected) individuals with a SATB1 variant are included in Suppl. Table 1. A summary of clinical characteristics is provided in Table 1, including 38 of 42 individuals: individual 16, 32 and 41 were excluded because no clinical data were available, individual 22 was excluded as she is (low) mosaic for the SATB1 variant (~1%). In Figure 1G, 37 of 42 individuals were included: in addition to individuals 16, 22, 32, and 41, we also excluded individual 18, for whom only very limited clinical information was available.

Next generation sequencing
For all individuals except individual 1, 2, and 28, SATB1 variants were identified by whole exome sequencing after variant filtering as previously described 11; 30-35 . Information on inheritance was obtained after parental confirmation, either from parental exome sequencing data or through targeted Sanger sequencing. For individual 1 the SATB1 variant was identified by array-CGH and for individual 2 an Affymetrix Cytoscan HD array was performed in addition to whole exome sequencing. For individual 28 targeted Sanger sequencing was performed after identification of the variant in his similarly affected sister. To predict deleteriousness of variants, CADD-PHRED V1.4 scores and SpliceAI scores (VCFv4.2; dated 20191004) were obtained for all variants identified in affected individuals 36; 37 . In addition, for all nonsense, frameshift and splice site variants, NMDetective scores were obtained (v2) 21 . For all missense variants, we analysed the mutation tolerance of the site of the affected residue using Metadome 38 .

UK10K controls for functional assays
Genome sequence data from 1,867 ALSPAC 39; 40 individuals in the UK10K 16 dataset were annotated in ANNOVAR 41 and filtered to identify individuals carrying rare coding variants (gnomAD genome_ALL frequency<0.1%) within SATB1. In total six rare variants were identified. These variants were carried by 13 individuals, all in a heterozygous state. Three variants (one in the CUT1 domain, one in the CUT2 domain and one outside of critical domains) were selected for functional studies. These variants were carried by nine individuals. Phenotypic data of carriers and non-carriers were available through the ALSPAC cohort, an epidemiological study of pregnant women who were resident in Avon, UK with expected dates of delivery 1st April 1991 to 31st December 1992. This dataset included 13,988 children who were alive at 1 year of age, 1,867 of whom underwent genome sequencing as part of the UK10K project. Of the UK10K individuals, 1,741 children had measures of IQ (WISC) collected at age 8 years providing an indication of cognitive development. The ALSPAC study website contains details of all the data that is available through a fully searchable data dictionary and variable search tool (http://www.bristol.ac.uk/alspac/researchers/our-data/)

Human Phenotype Ontology (HPO)-based phenotype clustering analysis
All clinical data were standardized using HPO terminology 14 . Thirty-eight of 42 individuals were included in analysis: individual 16, 32 and 41 were excluded because no clinical data were available, individual 22 was excluded as she is (low) mosaic for the SATB1 variant (~1%). The semantic similarity between all the HPO terms used in this cohort (356 features) was calculated using the Wang algorithm in the HPOSim package 42; 43 in R. HPO terms with at least a 0.5 similarity score were grouped (Suppl. Figure 5): a new feature was created as a replacement, which was the sum of the grouped features. For eleven terms, the HPO semantic similarity could not be calculated using HPOSim. Seven of those could be manually assigned to a group, since the feature clearly matched (for instance: nocturnal seizures with the seizure/epilepsy group). For a full list of the grouped features, see Suppl. Table 6. HPO terms that could not be grouped were added as separate features, as was severity of intellectual disability. This led to 100 features for every individual, instead of the previous 356 separate HPO terms. To quantify the possible genotype/phenotype correlation in the cohort, we used Partitioning Around Medoids (PAM) clustering 13 dividing our cohort into two groups (missense variants versus truncating variants), followed by a permutations test (n=100,000) and relabeling based on variant types, while keeping the original distribution of variant types into account. The same clustering and permutations test was performed when dividing our cohort into three groups. For both analyses, Bonferroni correction for multiple testing was applied and a p-value smaller than 0.025 was considered significant.

Average face analysis
For 24 of 42 individuals facial 2D-photographs were available for facial analysis. As previously described, average faces were generated while allowing for asymmetry preservation and equal representation by individuals 15 .

Three-dimensional protein modeling
The crystal structure of the CUT1 domain of SATB1 bound to Matrix Attachment Region DNA (PDB entry 2O4A 44 ) was used to contextualize the SATB1 CUT1 variants with respect to DNA using Swiss-PdbViewer 45 . The solution structure of the CUT2 domain of human SATB2 (first NMR model of the PDB entry 2CSF 46 ) was used as a template to align the SATB1 residues T491 to H577 (Uniprot entry Q01826), and to build a model using Swiss-PdbViewer 45 . The model of the CUT2 domain was superposed onto the SATB1 CUT1 domain bound to Matrix Attachment Region DNA (PDB entry 2O4A 44 using the "magic fit" option of Swiss-PdbViewer 45 ) to contextualize the SATB1 CUT2 variants with respect to DNA. The solution structure of the homeodomain of human SATB2 (second NMR model of the PDB entry 1WI3 47 was used as a template to align SATB1 residues P647 to G704 (Uniprot entry Q01826), and to build a model using Swiss-PdbViewer 45 . Chains A, C and D of the crystal structure of HNF-6alpha DNAbinding domain in complex with the TTR promoter (PDB entry 2D5V), which has a DNA binding domain similar to the CUT2 domain of SATB1 and a second DNA binding domain similar to the homeobox of SATB1, was used as a template to superpose the model of the SATB2 homeobox domain onto the HNF-6alpha structure using the "magic fit" option of Swiss-PdbViewer 48 to contextualize the SATB1 homeobox variant with respect to DNA.

Spatial clustering analysis of missense variants
Twenty-four of the observed 30 missense variants were included in the spatial clustering analysis. We excluded 6 variants, to correct for familial occurrence. The geometric mean was computed over the locations of observed (de novo) missense variants in the cDNA of SATB1 (NM_001131010.4). This geometric mean was then compared to 1,000,000 permutations, by redistributing the (de novo) variant locations over the total size of the coding region of SATB1 (2,388 bp) and calculating the resulting geometric mean from each of these permutations. The p-value was then computed by checking how often the observed geometric mean distance was smaller than the permutated geometric mean distance. This approach was previously used to identify cDNA clusters of variants 11; 12 . Code used in this analysis is available at: https://github.com/laurensvdwiel/SpatialClustering.
Cell culture HEK293T/17 cells (CRL-11268, ATCC) were cultured in DMEM supplemented with 10% fetal bovine serum and 1x penicillin-streptomycin (all Invitrogen) at 37°C with 5% CO2. Transfections for functional assays were performed using GeneJuice (Millipore) following the manufacturer's protocol. Lympoblastoid cell lines (LCLs) were established by Epstein-Barr virus transformation of peripheral lymphocytes from blood samples collected in heparin tubes, and maintained in RPMI medium (Sigma) supplemented with 15% fetal bovine serum and 5% HEPES (both Invitrogen).

Testing for nonsense mediated decay of truncating variants
Patient-derived LCLs were grown for 4 h with 100 µg/ml cycloheximide (Sigma) to block NMD. After treatment, cell pellets (10*10 6 cells) were collected and RNA was extracted using the RNeasy Mini Kit (Qiagen). RT-PCR was performed using SuperScriptIII Reverse Transcriptase (ThermoFisher) with random primers, and regions of interest were amplified from cDNA using primers listed in Suppl. Table 9.

Fluorescence microscopy
HEK293T/17 cells were grown on coverslips coated with poly-D-lysine (Sigma). Cells were fixed with 4% paraformaldehyde (PFA, Electron Microscopy Sciences) 48 h after transfection with YFP-tagged SATB1 and SATB2 variants. Nuclei were stained with Hoechst 33342 (Invitrogen). Fluorescence images were acquired with a Zeiss LSM880 confocal microscope and ZEN Image Software (Zeiss). For images of single nuclei, the Airyscan unit (Zeiss) was used with a 4.5 zoom factor. All other images were acquired with a 2.0 zoom factor. Intensity profiles were plotted using the 'Plot Profile' tool in Fiji -ImageJ.

FRAP assays
HEK293T/17 cells were transfected in clear-bottomed black 96-well plates with YFP-tagged SATB1 and SATB2 variants. After 48 h, medium was replaced with phenol red-free DMEM supplemented with 10% fetal bovine serum (both Invitrogen), and cells were moved to a temperature-controlled incubation chamber at 37°C. Fluorescent recordings were acquired using a Zeiss LSM880 and Zen Black Image Software, with an alpha Plan-Apochromat 100x/1.46 Oil DIC M27 objective (Zeiss). FRAP experiments were performed by photobleaching an area of 0.98 µm x 0.98 µm within a single nucleus with 488-nm light at 100% laser power for 15 iterations with a pixel dwell time of 32.97 µs, followed by collection of times series of 150 images with a 2.5 zoom factor and an optical section thickness of 1.4 µm (2.0 Airy units). Individual recovery curves were background subtracted and normalized to the pre-bleach values, and mean recovery curves were calculated using EasyFRAP software 52 . Curve fitting was done with the FrapBot application using direct normalization and a singlecomponent exponential model, to calculate the half-time and maximum recovery 53 .

Luciferase reporter assays
Luciferase reporter assays were performed with a pIL2-luc reporter construct containing the human IL2-promoter region, and a pGL3-basic firefly luciferase reporter plasmid carrying seven repeats of the -TCTTTAATTTCTAATATATTTAGAAttc-MAR sequence identified in an enhancer region 3' of the immunoglobulin heavy chain (IgH) genes (gift from Dr. Kathleen McGuire and Dr. Sanjeev Galande), as described previously [17][18][19] . HEK293T/17 cells were transfected with firefly luciferase reporter constructs and a Renilla luciferase (Rluc) normalization control (pGL4.74; Promega) in a ratio of 50:1, and with pYFP-SATB1 (WT or variant) or empty control vector (pYFP). After 48 h, firefly luciferase and Rluc activity was measured using the Dual-Luciferase Reporter Assay system (Promega) at the Infinite M Plex Microplate reader (Tecan).

BRET saturation assays
BRET assays were performed as previously described 51 . HEK293T/17 cells were transfected in white clear-bottomed 96-well plates with increasing molar ratios of YFP-fusion proteins and constant amounts of Rluc-fusion proteins (donor/acceptor ratios of 1/0.5, 1/1, 1/2, 1/3, 1/6, 1/9). YFP and Rluc fused to a C-terminal nuclear localization signal were used as control proteins. After 48 h, medium was replaced with phenol red-free DMEM, supplemented with 10% fetal bovine serum (both Invitrogen), containing 60 µM EnduRen Live Cell Substrate (Promega). After incubation for 4 h at 37°C, measurements were taken in live cells with an Infinite M200PRO Microplate reader (Tecan) using the Blue1 and Green1 filters. Corrected BRET ratios were calculated with the following formula: [Green1(experimental (Ex: 505 nm, Em: 545 nm) to quantify expression of the YFP-fusion proteins. Curve fitting was done with a non-linear regression equation assuming a single binding site using GraphPad Prism Software, after plotting the corrected BRET ratios against the ratio of total luminescence / total YFP fluorescence.

Fluorescence-based quantification of protein stability
Cells were transfected in triplicate in clear-bottomed black 96-well plates with YFP-tagged SATB1 variants. After 24 h, MG132 (R&D Systems) was added at a final concentration of 10 µM, and cycloheximide (Sigma) at 50 µg/ml. Cells were incubated at 37°C with 5% CO2 in the Infinite M200PRO microplate reader (Tecan), and the fluorescence intensity of YFP (Ex: 505 nm, Em: 545 nm) was measured over 24 h at 3 h intervals.

Statistical analyses of cell-based functional assays
Statistical analyses for cell-based functional assays were done using a one-or two-way ANOVA followed by a Bonferroni post-hoc test, with GraphPad Prism Software. Statistical analyses for FRAP and BRET data were performed on values derived from fitted curves of individual recordings or independent experiments respectively.