Spatiotemporal structure and substates in emotional facial expressions

Facial expressions are tools for modulating social interaction, from the display of overt expressions to subtle cues like a raised eyebrow that accompanies speech. How this complex signalling capacity is achieved remains only partially explored. An overlooked factor is the dynamic form of facial expressions signalling, particularly how facial actions combine and recombine over time to produce nuanced expressions and enrich speech. Previous research has focused largely on the static aspects of facial behaviour, reflecting the theoretical and methodological challenges in modelling the production and perception of dynamic social signals. Drawing on theories of motor control we leveraged facial-motion tracking and spatiotemporal dimensionality reduction to investigate the structure and function of facial signalling dynamics. We show that despite the complexity of facial expressions, their emotion-signalling function is achieved through a few fundamental dynamic patterns with subtle yet diagnostic kinematic differences. Classification analysis further show that spatiotemporal components reliably differentiate emotion signals in Expressions only and Emotive-speech conditions. The underlying spatiotemporal structure represents an efficient encoding strategy, optimising the transmission and perception of facial social signals. These insights have implications for understanding normative and atypical face-to-face signalling and inform the optimal design of non-verbal expressive capabilities in artificial social agents. This work also contributes new methods for analysing facial movements, applicable to broader aspects human multimodal social communication. Author Summary In face-to-face interactions, facial movements play a crucial role in conveying both emotional and speech-related information. Yet, we don’t fully understand how such complex signalling is achieved. Previous research has mainly focused on static signals, such as a single moment during a smile or frown. This study instead explores how facial movements change and evolve over time, particularly in relation to producing communicative emotions. Using advanced techniques to measure and analyse moment-to-moment changes in dozens of facial muscles during facial expression and speech production, we identified fundamental movement patterns that differentiate emotional content. These dynamic patterns enable us to communicate both isolated emotional signals through facial expressions, but also to simultaneously communicate emotions during speech, by adding subtle cues like a raised eyebrow or a brief smile. Our findings have implications to understand typical and atypical face-to-face interactions abilities and could enhance the design of biologically inspired and human-like facial expressions for social robots and virtual agents. Additionally, we contribute new methods for analysing facial movements, applicable to broader aspects of human multimodal social communication.


Introduction
The human face is an incredibly versatile tool for social communication [1] capable of producing a rich array of non-verbal expressions through the coordination of basic motor facial action units (AUs) [2].For example, frowning (AU4) combined with pursed lips (AU24) can convey a prototypical facial expression of anger; while raising the cheeks (AU6) and pulling back the lip corners (AU12) can signal a smile.In addition to ostensive non-verbal emotional cues, in real-life social interaction, facial expressions simultaneously modulate verbal communication.A slight tightening of the eyelids (AU7) for example, can make a "neutral" verbal utterance appear slightly angrier.The mechanisms that afford such flexible expressive capabilities to facial expressions remain only partially understood.
One overlooked factor in previous research is the dynamic nature of facial expressions.Most previous research has focused on the static aspects of facial behaviour, or aggregates of expressions that average out temporal dynamics [3].
Previous studies also largely ignore conditions that may carry a dynamic advantage, such as conversational or speech related facial expressions where dynamics simultaneously transmit visual cues for emotion and verbal communication [3][4][5][6].
Indeed, recent studies have started to reveal the importance of dynamics in both production and perception of facial expressions.Psychophysics inspired approaches have combined generative synthesis of facial movements [7] and reverse correlation to parametrically map the spatiotemporal features of AUS to perceptual representations of emotion categories [8], valence-arousal dimensions [9] and even social traits like dominance [10].Similarly, a recent production and perception study recorded participants' facial movements during expression of anger, happiness, and sadness, as well as during spoken sentences conveying the same emotions [11].Kinematic analysis revealed faster movements for high-arousal emotions (happiness, anger) compared to slower movements for low-arousal emotions (sadness).Thus, suggesting that dynamic information in facial movements carries a diagnostic signal for differentiating facial expressions.
However, a direct exploration of the spatiotemporal structure of facial expressionsthat is, how AUs dynamically combine and unfold over time and space -and how these patterns relate to their emotion signalling function remains lacking.Notably, one significant challenge to studying the role of spatiotemporal structure in facial expression production is the lack of a suitable theoretical and methodological framework.Such a framework should ideally integrate the dynamic nature of facial expression signals with their socio-communicative functions and provide an appropriate modelling strategy.In this work, we draw on the observation that facial expressions, like other facial and bodily movements, are controlled by an underlying motor system [12,13].We argue that the motor control literature provides a valuable theoretical and analytical framework to investigate spatiotemporal structure and function of dynamics in facial expressions.
A motor control perspective provides two useful insights that are applicable to studying the spatiotemporal structure and function of facial expressions.First, a longstanding motor control hypothesis suggests that body movements are generated by a lowdimensional mechanism that underpins their effective neural control and physical execution [14][15][16].Such low-dimensionality of motor control and execution might reflect the brain's solution to the problem of controlling numerous muscles and components of the motor system whilst minimising computational demands -often referred to as the "degrees of freedom" problem [17][18][19].Crucially, such low dimensionality has been shown to describe motion patterns during walking [20], whole body pointing [21], eye movements [22] and articulatory speech movements [23].Indeed, while not primarily serving a communicative function, these movements are known to also convey emotion-related information [24][25][26], therefore suggesting shared control mechanisms between social and non-social aspects of movement.
Applied to facial expressions, this raises the possibility that dynamics underlying facial expressions, and their ability to multiplex information (e.g., visually signal emotion and speech), may be structured around flexible low-dimensional spatiotemporal patterns.Such low dimensional patterns could, for example, reflect how AUs combine and evolve over time during facial expression production.Such spatiotemporal structure, can be explored using data-driven models that learn a reduced dimensionality from high-dimensional AU movement data [27].Thus, this perspective also suggests a methodological framework to model the spatiotemporal nature of movement, which is rare in the study of facial expression production.
A second insight from the motor-control perspective is the notion that although outwardly fluid, movements can often be segmented into distinct 'substates' often seen in differential modulation of kinematics (e.g., speed, displacement of m motor actions).
For instance, walking movements can be segmented into 'swing' or 'stance' phases [20] or eye movements into 'fixations' and 'saccades' [28], and even more fine-grained states (sub movements) [29].Facial expressions are likely to have substates related to the biomechanics of how they are produced, encompassing 'transitions'-where action units are initiated or inhibited, -and 'stability' -where actions units are sustained [30,31].However, to date facial expressions substates have not been formally characterised, and whether and how they relate to emotional signalling is unknown Similar to spatiotemporal components, substates can be explored in a data-driven fashion, using methods such as spatiotemporal clustering [27,32,33] without apriori defined kinematic profiles.
The characterisation of underlying spatiotemporal dimensions and substates in facial expressions would enable a richer characterisation of facial dynamics that is both holistic, i.e., captures overall spatiotemporal patterns, but also focal as it can capture micro spatiotemporal patterns in movement data.Crucially, we hypothesise that these underlying patterns are not merely incidental; rather, they are diagnostic of emotional content, reflecting how facial movements are optimised for social signalling.
The current study therefore aims to (1) describe the spatiotemporal structure underlying facial expression production; (2) explore and characterise potential facial expression substates, and 3) evaluate the diagnostic value of spatiotemporal structure and substates for emotion signalling.To achieve this, we videoed participants while they produced facial movements in a) Expression only and b) Emotive speech conditions.We used motion tracking and automated Facial Action Coding System (FACS) to describe the spatiotemporal patterns of the facial action units (e.g., mouth widening, eyebrow raise, etc).We then used Non-Negative Matrix factorization (NMF) [34,35] a part-based dimensionality reduction approach to explore whether highdimensional spatiotemporal patterns (e.g., AU weights over time) can be compressed into a set of diagnostic low-dimensional latent spatiotemporal patterns (groups of coactivated AUs and salient temporal patterns) -see Methods.To precis our results, we find that facial expression dynamics can be distilled into a set of three fundamental spatiotemporal components that form the basis for flexible signalling.Classification and regression analysis (see Methods) show that emotion (happy, angry, sad) and conditions (Expression only and Emotive speech) are differentiated by synergistic mixing of components and subtle substate differences related to expression transitions.These results suggests that the complex and flexible signalling of emotional information through facial movements is achieved through variations in a small number of spatiotemporal components and substates.

Spatiotemporal patterns during facial Expression only condition
We first looked at the spatiotemporal components identified via the NMF, which represent the underlying spatiotemporal patterns (e.g., groups of AUs with common activation timecourses and salient temporal patterns).The optimal number of spatiotemporal components (k) optimising for various fit indices (see Methods)  Evaluating the spatiotemporal patterns revealed that component 1 (C1) had stronger activation of the upper face-related AUs (e.g., AU4 brow lowerer, AU7 lid tightener)     Class-wise, the spatiotemporal components significantly distinguished all expressions.
Happy had the highest accuracy (BalACC = 0.91, AUC = 0.97, p < .001),suggesting a clear and distinct spatiotemporal profile, notable in the increased activation in C2 (lower-mid-upper face AUs) over time (see Fig 4).Brighter colours indicate higher accuracy along the diagonal elements, and confusion off-diagonal (e.g., happy shows high accuracy and low confusion, whereas angry shows higher confusion than all expressions).
Angry had the lowest classification accuracy compared to all other expressions, yet still significantly above chance (BalACC = 0.67, AUC = 0.83, p = .01),reflecting a more complex and less distinct pattern across all components.Similar to happy, sad expressions were classified reliably above chance (BalACC = 0.72, AUC = 0.84, p = 0.001).Fisher's exact test on the confusion matrix confirmed significant pairwise differentiation between happy and sad expressions (p < .001)and between angry and happy (p < .001).However, it suggested confusions between and angry and sad (p = 0.12) (see Fig 4; SI1 -Table S3-4).
In summary, we found three latent spatiotemporal components that reliably described the dynamics of facial expressions during Expression-only and Emotive speech conditions.Robust train-test classification on unseen held-out data, indicates that spatiotemporal components carry information that is diagnostic of emotional content.
While we reported results for Expression only and Emotive speech expressions separately for simplicity, these patterns were consistent when analysed together.

Facial expression substate results
The second objective of this study was to identify and characterise facial expression substates, such as transition and sustain periods, and their diagnostic value for emotion signalling (i.e., whether they differ between emotions: happy, angry and sad expressions).We identified three primary substates through spatiotemporal clustering that segmented periods with robust local dynamics in the spatiotemporal components Methods.Similar measures have been shown to capture diagnostic information about facial expression movement patterns [8,11].
In sum, transition substates, carry speed information that differs between expressions and also between Expression only and Emotive speech conditions.

Substate dynamics differentiate emotion content
Entropy is an information-theoretic measure of complexity, directly related to state dynamics within a system [36].Here entropy is used as a metric to statistically describe the dynamics of facial expression substates in terms of how temporally structured/predictable they are (see Methods).High entropy therefore equates with high complexity in substate patterns, and low entropy with low complexity.Fig 5 , for example, illustrates that substate transitions are more complex (i.e., high entropy, less predictable) in the Emotive speech condition where visual pattern differences between emotions are harder to discern than in the Expression only condition (see Methods:
Nonetheless, substates patterns differentiate expressions with sad expressions having a more complex patern of substates than happy and sad expressions.
To summarise, in addition to the low-dimensional patterns of spatiotemporal components that are diagostic of emotion and expression condition, we also identified three substates reflecting periods of relaxation, transition, and sustainment of facial actions.These substates are distinguished by their unique kinematic and dynamic (complexity) profiles which differ between emotional expressions and Expression only versus Emotive speech conditions.

Discussion
The current study explored how the underlying dynamics of facial movements relate to their emotional signalling function.We employed a motor-theoretic inspired approach to characterise the spatiotemporal structure of facial movements in Expression only and Emotive speech conditions.Specifically, we used data-driven spatiotemporal dimensionality reduction to identify spatiotemporal components that characterise how the activation of different action units evolves and combines over time.We consistently identified a small number of components that reliably described the time courses and covariance patterns of over a dozen action units.The identification of these components suggests that, despite the apparent complexity of facial expressions, their emotion signalling function is achieved by just a few fundamental dynamic patterns; combinations of these dynamic patterns differentiate angry, happy and sad expressions.Second, we showed that these spatiotemporal patterns contain meaningful substates, which reflect periods of relaxation, transition, and sustainment of expressions.The kinematic profiles of these substates differed by emotion category as well as between Expression only and Emotive speech conditions.
The low dimensionality of the spatiotemporal patterns of facial expressions, that we have uncovered, is intriguing given the various combinations of action units that can contribute to the same expressions [3] as well as the variability in how people produce expressions [37].A low-dimensional spatiotemporal structure may reflect flexibility for signalling, as a variety of expressions can be rapidly produced by relying on just a few interchangeable dynamic patterns.Such flexibility is likely beneficial in face-to-face interaction, where facial expressions may need to be quickly adapted according to interaction demands [3,38] or where it is necessary to accentuate emotional information during speech [4,39].Interestingly, our findings are consistent with the observation of low-dimensional spatiotemporal patterns in other body movements [17,19,21] which also have the capacity to signal emotion information [24,40,41].Thus, we extend existing findings from the bodily motor control literature to encompass facial expression movement.
An important question concerns whether our results are representative of facial expression signalling in general, or simply an outcome of the particular methodology we employed.For example, it could be argued that we identified a small number of spatiotemporal components because we asked participants to demonstrate relatively simplistic expressions which can be signalled by activating just a few action units.We argue that this is not the case because a) the Emotive speech condition was specifically designed to dissuade participants from relying on overly simple and caricatured expressions, and b) visualisations of the non-reduced patterns of facial expressions, as well as their reconstruction from the components show variability in the AU patterns.It could also be argued that our results are simply a consequence of dimensionality reduction, as any set of data can in theory be represented by a lowerdimensional structure.Yet, as we showed in a series of validation and sensitivity analyses, the components learned carry diagnostic information about emotions, which is robust to comparisons against appropriate randomised dynamic patterns and generalises to unseen data.Thus, we have not generated a "non-sense" low dimensional structure but rather a meaningful way to represent the data.
Here we provide evidence for a low-dimensional spatiotemporal structure for facial expressions of emotion.An intriguing question concerns why this low-dimensional structure exists.One possibility is that receivers only attend to a reduced subset of spatiotemporal features and so, correspondingly, signallers learn to only move a reduced subset of features, or to move them in specific ways.In support of this, studies have found that even when observers are presented with dozens of biologically plausible face movements, their perceptual representations suggest that they attend to only a reduced subset [35].Since human infants learn to produce and recognise facial expressions throughout a protracted period of development [42], it is possible that humans learn to optimise signals based on how these are optimally perceived and responded to, creating a reciprocal dynamic interaction between these processes.Thus, the low-dimensional organisation of facial expressions may reflect the optimisation of social signals according to perceptual demands and these perceptual demands, themselves, may be influenced by the low-dimensional organisation of facial expressions [43].An alternative possibility is that there are savings (e.g., energy/computational savings) to be had from moving the face according to a few spatiotemporal patterns.Future research that integrates both perception and production of facial expressions, particularly in interactive contexts, will be crucial for elucidating the bidirectional influences.
The identification of substates in spatiotemporal patterns of facial expressions adds a nuanced view of the dynamics of facial expression production.On the one hand substates likely reflect the biomechanics of producing facial expressions, which inevitably result in periods of relaxation, transition and sustainment of facial motor actions [44].Yet, the fact that the patterns of these substates (e.g., speed, entropy) vary with different emotions and expressive conditions, suggests that the local dynamics are sensitive to the communication of different emotions.
In real-life interaction facial movements subserving verbal and nonverbal expressions are often seamlessly integrated.Our findings in the Emotive speech condition raise the possibility that facial movements underlying verbal and non-verbal emotion information are integrated by modulating different spatiotemporal components.For example, during speech we can modulate the lip movement to produce the word while accentuating movements such as open mouth and eyebrow raise to signal emotion.Indeed, the observation of mostly part-based profiles (e.g., lower vs upper face AUs) in Expression only conditions, contrasting the involvement of both mouth and nonmouth-related components in Emotive speech expressions supports this assertion.
Despite the robustness of our results, there are three important caveats that need to be addressed.First is that the finding of low-dimensional spatiotemporal structure for emotion signalling does not imply that three spatiotemporal components are an appropriate representation of all possible facial expressions.The discretisation of movement using FACS with more action units than those used in this study, may require more dimensions, as might different emotional expressions (e.g., surprise, fear, disgust).Furthermore, while FACS AUs are anatomically based robust descriptors of facial behaviour, facial movements for speech are more complex and not exhaustively described by AUs.Future studies should explore different coding systems (e.g., geometrical/landmark-based approaches) and include systems optimised for the complexities of speech.Second, a valid question concerns whether we would have found more/less substates had we employed a different methodology.
However the number of substates and the degree to which they are an optimal representation of dynamics depends on the level of spatial and temporal resolution in the underlying data and that which is desired in the analysis.A different number of substates might be discovered using physiological approaches, such as electromyography (EMG).EMG is sensitive to subtle motor potentials and therefore may capture more substates including preparatory or corrective movement impulses.Lastly, our approach to quantifying spatiotemporal structure of facial behaviour focuses on exhaustively describing spatiotemporal features that account for variance in the signal such as curvature patterns, entropy, or others.While our method highlights that the underlying dynamic patterns are robust and diagnostic of emotion expression, it does not necessarily isolate specific features (or groups) nor does it compare their individual diagnostic value in classifying emotion.While such analysis is beyond the scope of this study, it can be easily integrated with our approach.Despite these caveats, our results and approach contribute new insights for future research on facial expression signalling.Spatiotemporal structure and substates might be important to better understand the mechanisms underlying differences in social communication.This may provide useful insight into clinical conditions such as autism and depression.While emotion recognition differences have typically been focused upon with respect to these conditions, an underexplored dimension is the potential role of atypical facial expression production in contributing to social interaction difficulties [45,46].A compelling extension of this work would therefore investigate whether conditions such as autism and depression are characterised by unique spatiotemporal patterns and substates during social signalling.The insights gained from the current study also have practical applications in fields such as social robotics and human-computer interaction.A significant challenge in these fields relates to endowing social agents (robots, digital agents) with convincing and adaptable repertoires of social behaviour, including facial signalling [47].Spatiotemporal components and substates underlying facial expression production provide actionable targets to enhance the design of flexible realistic and expressive agents without the need to painstakingly craft each signal feature into the architecture of these systems.
Similarly, the incorporation of spatiotemporal cues and substates could significantly aid understanding how spatiotemporal structure influences emotion recognition, in both humans and machine recognition systems.

Conclusion
In conclusion, we revealed a low-dimensional spatiotemporal structure with distinct substates that underpin facial expression production in Expression only and Emotive speech conditions.We provide a methodological framework to advance the development of novel theoretical and experimental accounts of the structure and function of facial expression dynamics.This framework can be extended to other modalities and has practical implications for understanding emotional signalling and perception in human and machine systems.

Materials and Methods
Participants 43 healthy volunteers from the university community (24 female; mean age = 26), took part in a study for course credit or monetary incentive.All participants gave informed consent, and all procedures of the study were approved by the institutional ethics committee.The dataset in this study was collected as part of a larger project investigating facial expression production kinematics and perception -see [11].
However, all the variables and analyses reported here are novel and specific to this study and not discussed in previous papers.

Experimental procedure
In two separate sessions, across two days, participants' facial movements were captured during a facial expression production task.Participants were asked to produce three emotion expressions: anger, happiness, and sadness in two counterbalanced conditions.An Expression only condition where participants simply produced facial movements to signal each expression, and an Emotive speech condition where they produce a fixed spoken sentence in emotional fashion (e.g., "My name is John and I'm a scientist").The conditions are motivated by the fact that in social interaction, facial expressions can be signalled both in isolation and embedded within Emotive speech, meaning that facial movements might signal information relevant to emotion and speech simultaneously, yet how this is achieved in unclear.
Facial movements were recorded with participants seated with their head at the centre of a polystyrene frame to ensure central positioning towards the camera, positioned 1 meter from a tripod-mounted video camcorder, sampling at 60Hz f/s (Sony Handicap HDR-CX240E).The start and end points were signalled using an audible beep, which allowed us to align the recordings of facial expressions.See the schematic illustration in Fig 7.

Analytical approach
Data processing.Facial expression videos were processed using OpenFace [48], an open-source suite for automated face-tracking, landmarking and facial action unit detection based on the facial action coding system (FACS) [2].We focused on Action Units (AUs), which code for the observable presence and intensity of basic face actions (e.g., blinking, mouth widening, eyebrow-raising) that can be used to produce almost any facial expression.We estimated the temporal activation weights for 18 AUs.Only videos with a tracking accuracy of over 90% and with participant data across two sessions were used, resulting in a total of 532 videos with approximately 240 time points each.Based on visual inspection of the movement profiles, we applied a moving average (window = 3 frames) to smooth out the slight jitter inherent in face tracking and AU estimation, while preserving the underlying trends and dynamic range.To ensure an equal temporal window for spatiotemporal analyses, AU timeseries were reduced into 100-time bins for each video.Binning was preferred over more computationally expensive alternatives such as dynamic time-warping (DTW) for practicality since the overall timeseries duration was very similar across videos, and visual comparisons between binning and DTW tests on our data showed negligible differences.

Spatiotemporal analysis
Timeseries of AU weights for the full facial expression production task were analysed using Non-Negative Matrix Factorization (NMF) [34].NMF is a multivariate dimensionality reduction method that decomposes a high-dimensional dataset into two non-negative matrices that form a part-based representation of the data (i.e., the original data is achieved by combining/adding components).The non-negativity constraint is particularly well suited for representing movement and action unit data, which is typically positive, and the two-matrix part-based decomposition is well suited to represent spatial vs time-varying aspects in the data -see Fig 7 (see also: [18,34,35]).The NMF takes the concatenated AU timeseries as input, denoted by a matrix M, which is a product of two low-dimensional matrices (T and S), T represents the temporal weights (basis values for time bins), and S represents the spatial weights (face action activation weights).The product of T and S matrices denoted by M' represents an approximation of the full original matrix such that M~M' = T * S. The algorithm works by initialising the values for T and S (e.g., with random values) and using an iterative multiplicative update rule to sequentially update T and S by a factor that minimises the approximation error (i.e., residuals).While the iterative nature of NMF means that the solution is non-deterministic (i.e., you can learn multiple equally effective part-based representations), these update rules have been shown to guarantee convergence (i.e., stability of the cost function) after a relatively low number of iterations [49].The optimal number of spatiotemporal components (k) was assessed computationally by fitting a range of values (k = 2 to 6) and choosing a k that optimised various model fit metrics (e.g., residual sum of squares, silhouette for clustering cohesion, cophenetic coefficient, see below).We also evaluated the final solution against NMF results on a permutated dataset, using block-based time shuffling.This resulted in an appropriate null dataset that destroyed the inherent spatiotemporal pattern in the data but retained some temporal ordering.We then compared fit metrics between original and permutated NMF (see also Classification and clustering analysis).We used the NMF decomposition functions in the NMF package for R [50].
To visualise our NMF results in an intuitive manner, we trained a custom model to predict aligned Facial Landmark displacements (after correcting rotation, translation and looming) from AUs, based on original data using partial least squares regression.
Note however that the AU to Landmark face model is not a perfect one to one map between NMF results and visualisation, given the many-to-one relationship between facial landmarks and AUs.

Multivariate timeseries classification
We used a supervised classification approach to validate the diagnostic value of spatiotemporal components for emotion expression.This involved in five steps.
(1) Data Splitting: We began by splitting the original pre-NMF data into training (80%) and testing (20%) sets, ensuring that each set contained complete AU time series for each video.This split was performed before any further analyses, and all subsequent steps maintained this separation ensuring further processing and model training was only done on the training set, and then applying results to the test set.Hence any mention of "training", always refers to operations on the respective training set.This approach was taken to prevent information leakage between the training and test sets, ensuring that the classification was conducted on truly unseen data [51,52].
(2) NMF re-training and projection: We then re-trained the NMF model using only the training set data and projected the test set onto the trained NMF model space to ensure consistency of components for classification.
(3) Feature extraction: Because standard classification approaches do not account for inherent spatiotemporal structure of the NMF components, we opted for an approach that represents the spatiotemporal structure via interpretable theoretic time series features (e.g., curvature, entropy, predictability, peak, trend) [53] consistent with features suggested to capture diagnostic information for classification of facial expressions and body movements [8,11,54,55].Feature extraction was done separately for train and test sets.Then, a Principal Component Analysis (PCA) was used to reduce the features that explained the majority of the variance in the train set while retaining a number of dimensions equal to the number of NMF components for consistency.The test set features were then transformed using the PCA model fitted on the training set features to be used in the next step.The component scores were used as inputs for classification analysis in the next step.
(4) Classification training and testing.We the trained a classifier on spatiotemporal component features derived from step 3 and applied to predict emotion expression in a held-out test set.Specifically we used a Random Forests (RF) Classifier -a supervised learning method that learns a discriminant model between a set of input variables (e.g., spatiotemporal components) and a set of target classes (expressions).
Note, that we achieved similar results with different methods (e.g., Support Vector Machine, Linear Discriminant Analysis), thus our results are not a function of the analytical approach.The Random Forests Classifier has the advantage of being relatively more robust to overfitting and able to capture non-linear relationships and therefore was preferred.We used cross-validation during training, where multiple training sets are created, and the data not included in the test set is used to test the accuracy and tune the model parameters.Reliable classification on the test set would indicate that the underlying spatiotemporal components (identified by the NMF analysis) carry diagnostic information for differentiating expressions.We derived significance tests for our classification results using binomial tests to determine whether the model's accuracy on the test set was significantly greater than what would be expected by random guessing (by comparing it to a no information rate).These binomial p-values were compared with permutation-based p-values, which were generated by comparing the observed accuracy to the distribution of accuracies obtained from shuffling the labels (expressions) in the test set (N = 1000).The negligible differences between the binomial and permutation-based p-values for our dataset validated the robustness of our results, so we primarily report the binomial pvalues.Additionally, we used Fisher's Exact Test to assess the statistical significance of the confusion matrix, by comparing pairwise expression differentiation with adjustments for multiple comparisons.A low p-value in Fisher's exact test indicates statistically significant differences between pairs, meaning the pattern in the confusion matrix is unlikely to have occurred by chance.

Spatiotemporal clustering for substates analyses
We investigated whether reliable substates could be identified from the spatiotemporal patterns of facial expressions.We considered substates as local intervals where the components exhibit salient signatures that are consistent across expression timeseries.Therefore, this challenge can be formulated as segmenting a multivariate timeseries into finer temporal segments.To do this we first derived speed and displacement for spatiotemporal components, as these features are sensitive to Both speed and entropy measures were computed for each individual expression and analysed using a mixed-effects model to test the effect of emotion (happy, angry, sad), condition (Expression only, Emotive speech), and substate (relaxed, transition, sustain) as fixed effects and subject (face id) as a random effect.Entropy was log transformed to correct for non-normal residuals.Linear mixed models were fitted using the lme4 package for R [58].Linear mixed models were used as opposed to the classification approach due to the larger number of fixed effects and nested conditions which makes classification results overly complex.
-was consistently three (see Fig 1 -top panel) (see also Expression only condition -NMF validation analysis in Supplementary Information -Fig S1-3).

Figure 1 .
Figure 1.Spatiotemporal components for facial expressions only production.The top panel (A) shows that expression production can be summarised by three components, capturing spatiotemporal patterns for upper face AUs (component 1), lower face AUs (component 2) and a combination of both lower and upper face AUS (component 3).Top -middle and right columns show approximate spatial distribution of components in an average face model and how it changes over time, respectively.The bottom panel (B) highlight spatiotemporal profiles for angry, happy and sad expressions.Coef = coefficient from the NMF, brighter colours indicate stronger activation.Note while heatmaps are drawn to scale, face maps are drawn for visualisation purposes only.

Figure 2 .
Figure 2. Spatiotemporal classification of facial expressions only production.Notes: The left panels show decision boundaries (filled backgrounds) across spatiotemporal components (Comp1: upper AUs, comp2: lower face AUs; comp3: lower to upper face AUs) estimated on a training set to predict emotion classes (angry, happy, sad).Coloured dots represent individual ground-truth expressions from both training and held-out test data.Dots within a matching background indicate correct classifications, while mismatches indicate misclassifications.The results demonstrate robust classification, with at least one component combination separating an expression from the rest.e.g.comp1 vs 2 and comp-2 vs 3 robustly distinguishes happy expressions.The top right panel presents Receiver Observer Characteristics (ROC) curves, showing good model performance (higher True Positive Rate [TPR] and lower False Positive Rate [FPR]) with the diagonal dashed line representing chance).The bottom right panel displays a confusion matrix.Brighter colours indicate higher accuracy along the diagonal elements, with off-diagonal showing little to no confusions (darker colours).

Figure 3 .
Figure 3. Spatiotemporal components for Emotive speech facial expression production.The top panel (A) shows that Emotive speech facial expression production can be summarised by three underlying components, capturing synergies between upper to lower face AUs ( Component 1), lower-mid-upper AUs (component 2) and lower to upper face AUs (component 3).Top -middle and right columns show approximate spatial distribution of components in an average face model and how it changes over time, respectively.The bottom panels (B) show how components change over time, highlighting different profiles for angry, happy and sad expressions.Coef = coefficient from the NMF, brighter colours indicate stronger activation.Note while heatmaps are drawn to scale, face maps are not a one-to-one map and are drawn for visualisation purposes only.
mid-upper face AUs) combined strong activation of AUs in the lower (e.g., AU25 -lips part), middle (e.g., AU14 -dimpler) and upper face areas (e.g., AU7 -lid tightener).Component 3 primarily featured activation of lower face AUs, such as AU25 (lips part) and AU26 (jaw drop), plus additional moderate activation for AU 45 (blink).Henceforth component 3 is referred to as lower to upper face AUs (see Fig 3).C2 (lower-mid-upper face AUS) and C3 (lower to upper face AUs) had an inverted-U shape with C3 peaking earlier than C2 (Fig 3C).As with facial expression-only production, Emotive speech expressions were also characterised by distinct spatiotemporal patterns.That is, a classifier trained on timeseries-based features of the spatiotemporal components (e.g., curvature, entropy) achieved significant accuracy in classifying expressions on a held-out test set, with performance at 70% (ACC = 0.69, 95% CI: 0.53, 0.81; BalAcc = .76Kappa = 0.53, p < 0.001).These results indicate moderate to strong agreement between prediction and ground-truth emotional expression labels.

Figure 4 .
Figure 4. Spatiotemporal classification of Emotive speech facial expressions.The left panels show decision boundaries (filled backgrounds) across spatiotemporal components (comp1: upper to lower face AUs, comp2: lower to upper face AUs; comp3: upper-to-lower face AUs) estimated on a training set to predict emotion classes (angry, happy, sad).Coloured dots represent individual ground-truth expressions from both training and unseen test data.Dots within a matching background indicate correct classifications, while mismatches indicate misclassifications.The results demonstrate robust classification, with component combinations effectively separating the expressions in at least one combination.The top right panel presents ROC curves, showing good model performance (higher True Positive Rate [TPR] and lower False Positive Rate [FPR]), with the diagonal dashed line representing chance.The bottom right panel displays a confusion matrix.Brighter colours indicate higher accuracy along the diagonal elements, and confusion off-diagonal (e.g., happy shows high accuracy and low confusion, whereas angry shows higher confusion than all expressions).

(
Figure 5. Illustration of facial expression substate profiles.A-B.Example of substate clusters (1 (green) = relaxed, 2 (peach) = transition; 3 (purple) = sustain).C-D show overall the speed distribution for Expression only and Emotive speech expressions for all subjects.To compare whether the pattern of facial expression substates differs between

Figure 6 :
Figure 6: Summary of speed and entropy of substate patterns.
speech Substate cluster: 1-relaxed, 2-sustained, 3-transition.Expression only condition has faster than Emotive speech, with transition substates being faster; Entropy mainly differentiates Expression only and vs.

Figure 7 .
Figure 7. Schematic illustration of methodological and analytical approach.Left panel illustrates the facial expression production task, video capture and OpenFace processing for landmark and action unit (AU) estimation.The right panel summarises the procedure for the concatenation of AU timeseries data and the application of Non-Negative Matrix Factorization (NMF) to learn an underlying low-dimensional spatiotemporal representation of the data.The accuracy and effectiveness of this lower-dimensional representation were validated through a series of sensitivity and