Establishing an AI-based evaluation system that quantifies social/pathophysiological behaviors of common marmosets

Nonhuman primates (NHPs) are indispensable animal models by virtue of the continuity of behavioral repertoires across primates, including humans. However, behavioral assessment at the laboratory level has so far been limited. By applying multiple deep neural networks trained with large-scale datasets, we established an evaluation system that could reconstruct and estimate three-dimensional (3D) poses of common marmosets, a small NHP that is suitable for analyzing complex natural behaviors in laboratory setups. We further developed downstream analytic methodologies to quantify a variety of behavioral parameters beyond simple motion kinematics, such as social interactions and the internal state behind actions, obtained solely from 3D pose data. Moreover, a fully unsupervised approach enabled us to detect progressively-appearing symptomatic behaviors over a year in a Parkinson’s disease model. The high-throughput and versatile nature of our analytic pipeline will open a new avenue for neuroscience research dealing with big-data analyses of social/pathophysiological behaviors in NHPs.


Introduction
Quantitative evaluation of animal behavior is crucial for various research areas of neuroscience.
However, observing natural behaviors of freely moving animals by visual inspection incurs a considerable cost.Meanwhile, recent advances in artificial intelligence (AI) allow us to pave the way to quantify massive amounts of behavioral data in a large-scale and automated manner [1][2][3][4] , and assessment of natural behaviors with "markerless pose estimation" has already been implemented in a number of studies [5][6][7][8][9][10] .Indeed, AI-based three-dimensional (3D) analysis of body posture, involving the limb positions, makes it possible to evaluate a variety of behavioral aspects that characterize nonhuman primates (NHPs) 11,12 .
The application of this methodological innovation to the neuroscience research field is now rapidly expanding 11,[13][14][15][16][17] , as it is expected to have a potential to bring about fundamental changes in how to design behavioral experiments on NHPs which have long been carried out in a head-fixed condition.In the past decades, accumulated evidence from a number of research works, such as ethological studies on wild animals, suggests the continuity of behavioral repertoires across primates including humans [19][20][21] .However, there remains a large gap between the field and the laboratory research since experimental settings under freely moving conditions have so far been limited at the laboratory level.
Common marmosets are one of the NHP species suitable for overcoming this problem, given that their relatively small body size permits observations of complex natural behaviors in laboratory setups [18][19][20][21] .Furthermore, marmosets are a remarkably prosocial animal.It is generally accepted that all family members cooperate to breed infants whose development is successfully attained via interactions with their caregivers.This implies that marmosets can be useful as a primate model for exploring social behavior 18,22 .The development of telemetric devices for brain activity recordings [28][29][30] also accelerates the preparation of experimental environment in a freely behaving fashion.In addition, the utility of marmosets which have high reproductive efficiency has led to the production of brain disease models by genetic engineering techniques [23][24][25][26][27] , which requires the longitudinal and high-throughput assessment of symptomatic behaviors.
Two issues should be solved to achieve a methodological improvement in designing behavioral experiments on marmosets.First, the practical use of "deep neural networks" for behavioral analysis demands both a huge volume of ground truth data 14,16 and an analytic pipeline that reconstructs 3D poses of multiple animals simultaneously while recognizing individuals.Second, even if the best effort is made to establish such a system, a major question still arises as to how effective this approach is to evaluate natural behaviors of freely moving marmosets.In fact, quantitative analyses to date based on the markerless pose estimation have highly been focused on the movement itself (e.g., kinematics of body-part movements and sequence of motor actions) 6,17,31 , leaving cognitive behaviors or social interactions untargeted.
In the present study, we developed a markerless 3D pose estimation system to analyze natural behaviors of marmosets under freely moving conditions, and a large-scale training dataset to promote automated quantification of videographic data.We further developed a set of downstream analytic methodologies that took advantage of the potential of 3D pose data.
Here we show that (1) the 3D pose data are suitable for defining social behavior which should be more than kinematics of a single animal and represents complex interactions among multiple animals, (2) the 3D pose data are able to infer the animal's internal state behind actions, and (3) a completely unsupervised approach based on the 3D pose data allows us to detect behavioral changes in response to pathophysiological conditions.Through these distinct experimental subjects (parenting behavior of male vs. female marmosets, behavioral flexibility of socially interacting marmosets, and symptomatic behaviors progressively appearing in a marmoset model of Parkinson's disease (PD), respectively), we have revealed the potent applicability of our system that permits extracting a wide range of behavioral parameters beyond spatiotemporal kinematics.

Markerless 3D pose estimation of multiple marmosets with individual identification
Our analytic framework consisted of the following three elements: a multi-camera recording system, an analytic pipeline combined with multiple deep neural networks, and large-scale ground truth data to train the deep neural networks for accurate quantification.The recording system included eight synchronized cameras surrounding a transparent cage that was specially designed to allow housing of a marmoset family (up to four individuals) and to provide continuous clear video recordings for several days or more.Multiview videographic data were fed into the custom-made analytic pipeline which had fully been optimized for robust reconstruction of the 3D poses of multiple marmosets under individual identification in a variety of natural behavioral contexts (Fig. 1a).
For the analytic pipeline, regions of interest (ROIs) where marmosets were located were first determined in each camera view at each time frame by using a detection network.Subsequently,18 keypoints and a potential animal identity per ROI were estimated through a pose network and an identity network, respectively.In each camera view, ROIs taken from numbers of time frames were combined based on the spatial continuity to construct tracklets which were composed of time-series data including the pose and identity.During this process, individual tracklets contained information only from a single camera view, and, therefore, they were fragmented by a short time period (Fig. 1a, 2D processing).As the next step, a 3D tracklet was constructed by combining several tracklets that represented the same animal from different camera views by minimizing the so-called pose affinity score (Fig. 1a, camera association; for details, see the Methods section).Finally, 3D tracklets were combined across the entire recording time based on both the spatial continuity and the probability of animal identity (Fig.

1a, 3D optimization).
To achieve the accurate and robust 3D pose estimation, we created annotations of 3D keypoints for more than 7404 bodies (consisting of eight different views) (Fig. 1c, d) in a variety of natural behavioral contexts (Fig. 2a, b), which could be used as a ground truth dataset for training both the detection and pose networks.The requirement for a training dataset of animal recognition largely depended on experimental conditions (with/without infants, the use of a color tag, implantation of neuron activity recording devices, etc.).In the present study, we tested either a pair of marmosets or a breeding family (including male and female parents with their infants).A neckless type of color tag was attached to adult marmosets to facilitate identification.Under these conditions, we labeled 4231 samples in total for ID classification.
With this dataset, we used 80% for training and 10% each for validation and test.The ground truth dataset was created from 29 different individuals ranging from 1.5 months old (infant) to 12 years old (adult).
The final performance of animal detection and identification in 3D space was 99.3% and 98.8% in precision and recall, respectively (Fig. 2c, d, Video S1).The geometric error in pose estimation at each keypoint was 9.68 mm (4.86 ~15.25) in 3D space (Fig. 2e).On the scale of human body, the estimation error of, for example, the wrist positions were about 4 cm.This accuracy was comparable to the state-of-the-art performance of a similar task in human pose estimation 32 where enormous amounts of ground truth data were available, indicating that our system consisting of the recording environment, training dataset, and analytic pipeline reached the highest level that was considered achievable at the present time.However, a major question remained as to the extent to which our system would practically be useful for actual experiments, which was hard to judge from the so-far-listed score alone.In the following sections, we explored the potential of 3D pose data by quantifying various types of behavioral parameters that were beyond simple spatiotemporal kinematics of the body parts.

Differential roles of male vs. female marmosets in parenting as defined by automated detection of social behavior
When introducing the automated quantification into natural behaviors, evaluation of social behavior is the most difficult and beneficial, since it is more than kinematics of a single animal.
In the first set of our experiments, we tested the potential of 3D pose data for assessing foodsharing behavior which is frequently observed in a breeding family of marmosets.Both male and female marmosets generally take care of their infants together, and, therefore, they are characterized as cooperative breeders, which is similar to the human case, but is relatively rare in other NHPs 18 .As part of parenting, adult marmosets share their food with infant marmosets, which enables the infants not only to satisfy their nutritional needs, but also to obtain an opportunity of learning about diet 33 .Thus, we attempted to quantify food-sharing behavior of breeding marmoset families.
In the present experiment, we sought to detect the food-sharing behavior by applying a spatiotemporal filter to 3D pose time-course data.Two marmoset families participated in this experiment.Since the output of our system was a simple time-course data of the 3D posture in each marmoset, we started with engineering the features that might capture food-sharing events in the marmoset families based on the 3D pose time-course data.According to such data obtained from parents and infants, we computed the distance between the either the infants' mouths/hands and those of parents' hands/mouths, and its derivatives (i.e., velocity).
Comparison with videographic images confirmed that the resulting time-course data could be potentially good indicators to detect the food-sharing event between the parents and the infants (Fig. 3a, b).Via spatiotemporal thresholds of these quantitative posture and motion parameters, we then defined and counted the occurrence of such events automatically (for details, see the Methods section).Moreover, we acquired annotations by a human observer to optimize and verify the automated detection of food-sharing events based on a subset of videographic sequences randomly selected from the entire study cohort.The threshold values were tuned using 25% of the annotation data.The detection accuracy (i.e., true positive, false positive, and false negative) was estimated with the rest of annotations which was not used for the parameter tuning (Fig. 3c).We obtained the Precision-Recall curve (Fig. 3d) and estimated the optimal F1 and Cohen's kappa which were 0.80 and 0.77, respectively.These scores satisfied common criteria for the inter-observer reliability in behavioral sciences 34 , thus indicating that our automated analysis was reliable enough for quantification of social behavior.Furthermore, we used this detector for the rest of the entire dataset and found that the food-sharing event occurred more frequently in male than in female parents (Fig. 3e-f).Such a difference between fathers and mothers is suggested by previous studies on distinct species of New World monkeys [35][36][37] .The overall results demonstrated that our AI-based analytic pipeline clarified the differential roles of cooperating breeding animals in parenting under the laboratory environment, and that this pipeline could be useful for quantifying social behavior.

Behavioral adjustment depending on others' internal state as investigated by recurrent neural networks
In the second set of our experiments, we assessed the extent to which our system with 3D pose time-course data could infer the animal's internal state behind actions.In social life of primates, it is crucial to adjust one's own behavior depending on others' internal state, such as emotions, intentions, and other physiological needs [38][39][40] .Conceivably, internally-guided behavioral changes by others may not readily be observable, but can be judged by watching over themselves 41 .Several human neuroimaging studies have shown neural substrates that are involved in this sort of cognitive function [42][43][44][45] .On the other hand, only a few related works have so far been available in NHPs [46][47][48] , because nonverbal behavioral paradigms are so limited that the possible underlying mechanism remains to be investigated.Here, we attempted to overcome this issue by combining a novel freely-moving behavioral task with our analytic pipeline using a deep neural network.
To examine a social behavioral action in response to others' internal state, we developed a food competition task under freely moving conditions where two marmosets interacted to share or keep a valuable food (Fig. 4a,b).Two different pairs of marmosets participated in this experiment.The partner's internal state (either full or hungry) was controlled without notifying the subject before the experiment started.Then, only the subject animal could obtain a large food that takes a couple of minutes to eat.The partner animal in the same cage may try to take away or beg for the food from the subject, and, therefore, the subject should pay attention to the partner's action.Employing this behavioral task, we tested how the subject might adjust his/her behavior depending on the partner's internal state.
The Long Short-Term Memory (LSTM) 49 , a type of recurrent neural network for temporal data analysis was used to decode the partner's internal state.Two different LSTMs, LSTMpartner and LSTMsubject, with the same architecture were trained to decode the partner's internal state (i.e., full or hungry) from actions of either the subject or the partner (Fig. 4c).These LSTMs were designed to utilize the 3D pose data for 800ms as input and to generate as output a score representing the partner's internal state, i.e., hungriness.The output score of LSTMpartner was predicted only from the partner's action and could even display a variability within single trials (Fig. 4d).For example, in a scene with higher score (Fig. 4D, left panel), the partner was directly approaching the subject as if the partner tried to take away the food from the subject.
Conversely, in a scene with lower score, the partner was exploring inside the cage without any interest in either the subject or the food.Similarly, as the partner's internal state (and the resulting action) might probably affect the subject's behavior, the output score of LSTMsubject was able to predict the partner's internal state solely from the subject's action (Fig. 4e).Even though the outputs of both LSTMs fluctuated within single trials or across trials, the overall scores were higher in a hungry than in a full condition (Fig. 4f).Thus, not only LSTMpartner but also LSTMsubject precisely predicted the partner's condition on average (Fig. 4g).The accurate decoding of the LSTMsubject output indicated that the marmoset indeed adjusted his/her own behavior flexibly based on others' internal state.
Another important question arises as to whether such a behavioral change might be an immediate, simple reaction to an others' particular action rather than a reflection of others' internal state behind the sequence of their actions.The comparison between the LSTMpartner and the LSTMsubject exhibited a positive correlation, which indicated that an immediate action by the subject was related to the sequence of the partner's actions at that moment regardless of the partner's internal state (Fig. 5a).Concurrently, at any level of the LSTMpartner output, the LSTMsubject output was consistently higher in the hungry than in the full condition (Fig. 5b).
The present result implied that the one's reaction towards the same sort of action by the other was changed according to the internal state.As an example of such behavioral adjustment depending on others' internal state, we found that, in a pair of marmosets, the gaze behavior of the subject was changed according to the partner's internal state.One marmoset sometimes looked back at the other when the other marmoset looked at the one (Fig. 5c).This look-back behavior was more frequently seen in a hungry than in a full condition (Fig. 5d), again indicating that the subject's reaction towards the same action by the partner was changed based on the partner's internal state.The overall results demonstrated the cognitive complexity of marmosets in the social context, thus elucidating that they flexibly adjust their behaviors depending on others' internal state that is not readily observable by an immediate action alone.

Progressive manifestation of motor deficits in a marmoset model of PD as revealed by unsupervised clustering
In the third set of our experiments, we evaluated whether a completely unsupervised approach might allow us to detect behavioral changes in response to pathological conditions if relatively large-scale 3D pose data are available.To this end, we analyzed symptomatic behaviors in a marmoset model of PD.It is well known that PD progressively manifests motor deficits, such as akinesia, rigidity, and tremor, which is caused by degeneration/loss of dopaminergic neurons in the substantia nigra pars compacta (SNc) 50,51 .Given that over-expression of mutant variants of alpha-synuclein (α-syn) emulates the progressive aspect of the disease, much emphasis has been placed on the notion that an animal model produced by α-syn over-expression is suitable for PD research 52,53 .In this model, however, observations over months or even years are required for behavioral assessment of phenotype expression, and, therefore, automated quantification of symptomatic behaviors is indispensable.Here, we yielded a PD model marmoset by injecting a combination of adeno-associated virus (AAV) vector 54 carrying the mutant α-syn gene 55,56 and pathological α-syn fibril 57 into the nigra on one side of the brain (Fig. 6a).Histological analysis using tyrosine hydroxylase (TH) immunostaining after the behavioral observation confirmed loss of dopaminergic neurons from the SNc.With this PD model, varying motor activity was monitored for two days per month over one year.
Employing our analytic pipeline in a fully unsupervised manner without any a priori hypothesis, we could identify a couple of behavioral changes in the marmoset PD model.First, by means of dimensional reduction and clustering approach, we determined action motifs that were the patterns of 3D pose time-series data repeatedly observed throughout the recording period (Fig. 6b; for details, see the Methods section).We found that some actions were occasionally observed before the surgery, and others gradually appeared after the surgery (Fig. 6c).Specific behavioral actions, such as running, turning, and jumping from wood, were reduced after the surgery (Fig. 6d,e; upper panels).Conversely, various types of "stay" actions were increasingly observed several months after the surgery (Fig. 6d,e; bottom panels).
Interestingly, apparently similar postures were classified into different clusters notably by the difference in the neck angle (Fig. 6f).Some postures were seen more frequently, whereas others were observed less frequently after the surgery (Fig, 6g).After three months, an increased tonus of the neck muscle markedly appeared contralaterally as evidenced by the finding that the head bent towards the side opposite to the nigral injection site.
We further quantified the amount of gross movement (as an index of reduced locomotion) and the head posture based on the 3D pose time-series data, and then successfully confirmed the progression of symptomatic behaviors obtained from the unsupervised analysis (Fig. 6h-j).
The overall results indicated that parkinsonian phenotypes induced by α-syn over-expression gradually progressed.This suggested that our system allowed the longitudinal and highthroughput evaluation of symptomatic behaviors in brain disease models without any behavioral tasks.

Discussion
In the present study, we have developed the analytic pipeline that permits automated and highthroughput quantification of natural behaviors of common marmosets using a markerless motion capture system which consists of multiple deep neural networks.With the large-scale ground truth dataset, the decoding accuracy reached the best performance that we could expect at the present time.Applying this system, we have revealed that our approach is capable of detecting behavioral changes due to a variety of experimental conditions, such as differential contributions of males vs. females to parenting in breeding families, flexible behavioral adjustment depending on others' internal state, and progressive manifestation of motor impairments in a PD model.Our results provide a novel framework to many research areas of neuroscience using NHPs by introducing objective and large-scale quantification of animal behavior.It should also be noted here, however, that there are some limitations on the use of the analytic pipeline that we have developed in this study.First, the proposed system is able to quantify only restricted variations of behavioral actions that are represented by 18 keypoints.Thus, other types of actions, such as facial expression, cannot be quantified 62 .Second, careful assessment is needed to confirm that behavioral data obtained from our system are not attributable to erroneous tracking of individual animals.The 3D pose time-course data may sometimes be derived from a mixture of multiple animals, although such an error is rare as shown in Figure 2c,d.In a severe condition where individual recognition is inaccurate, an alternative system should be called for to address this issue specifically 58 .
Recent technological innovations have attracted much attention to experimental paradigms with freely moving marmosets.Large-scale telemetric recordings of neuronal activity were successfully carried out 31 , and electrocorticography recordings from almost the entire lateral hemisphere were also reported 59,60 .Combining these recording techniques with our analytic pipeline allows comprehensive understanding of the correlation between cortical signals and behavioral dynamics.This could be an appropriate methodology to explore the cortical circuitry related to behavioral actions of particular interest.Then, optogenetic 61 and chemogenetic 62 approaches, which are also compatible with freely-moving experimental conditions, enable us to disclose the causal role of a specific neural circuit in the expression of a given type of natural behaviors.Until recently, major efforts have been made to assess motor and cognitive functions of NHPs through analysis of eye/hand movements as the behavioral output.Now, the AI-based innovative development has increasingly been accomplished to quantify and evaluate social interactions in a certain animal population with high efficiency 3 .This may make it feasible to elucidate the neural mechanisms underlying behavioral theories, so far intensively explored in socio-ecological and ethological studies, for example, the Machiavellian theory in which expansion of the cerebral cortex, especially the frontal lobe, leads to the adaptation to social complexity in our daily life [63][64][65] .
The novel pipeline that we have established for 3D pose time-series analysis of a group of marmosets can be utilized in various experimental environments and laboratories.All that is required is to estimate the camera calibration parameters for accurate 3D reconstructions and to refine the neural networks for detection, identification and pose estimation of individuals.Concerning the former requirement, at least a two-camera system should work though our experiments were conducted with eight cameras to enhance the robustness and accuracy, and then data needed for the calibration will be acquired within hours.With respect to the latter requirement, the neural networks for 2D analysis should be re-tuned to each experimental environment or laboratory because of the differences in varying factors, such as background, lighting, and camera angle.In our experiments, we provided a substantial amount of ground truth data to achieve robust 3D analysis, which will be of immense help for adapting neural networks to specific environments and achieving impeccable performance.In recent years, several tools, for example, "style transfer" 66,67 , further support a transfer learning of the networks from some environment to others.Moreover, while our analytic pipeline has highly been optimized for marmosets, It can be customized for other species as well.
The present study has revealed the potent applicability of the 3D pose data, as evidenced by a wide range of behavioral parameters beyond spatiotemporal kinematics that can be quantified via a proper choice of downstream analytic methodologies (Fig. 7).The simplest method is to detect specific behavioral events by defining spatiotemporal parameters derived from certain combinations of 3D keypoints, as demonstrated in the food-sharing experiment.
A key factor to succeed in this method is appropriate feature engineering that is suitable for target event detection and parameter tuning with a small set of supervised data, both of which should be performed by experts of animal behavioral observations.Moreover, we have elucidated that simple spatiotemporal data concerning the 3D poses permit quantification of the internal state of marmosets which is combined with cutting-edge neural networks, for instance, a recurrent neural network (i.e., LSTM) in the present study.This brings about a unique opportunity of studying the mind behind the complex social behavior in primates.
Finally, a fully-unsupervised data mining approach is capable of disclosing behavioral changes induced by pathophysiological manipulation, as shown in the PD model experiment.This approach is specifically beneficial to explore behavioral changes comprehensively if a substantial amount of data are available.Such methodological innovations are greatly meritorious given that the behavioral complexity inherent in NHPs substantially accentuates the assessment of neurological/psychiatric/developmental disorder models.The highthroughput and versatile trait of our evaluation system will play critical roles in establishing a new standard that quantifies social/pathophysiological behaviors of NHPs.

Animals
All procedures for the use and experiments of common marmosets were approved by the Animal Welfare and Animal Care Committee of the Center for the Evolutionally Origins of the Human Behavior, Kyoto University, followed by the Guidelines for Care and Use of Nonhuman Primates established by the same institution.First, 29 marmosets (ranging from 1.5 months olds to 12 years old; 13 males and 16 females) were used to create the ground truth dataset.Four adult and two infant marmosets derived from two families participated in the food-sharing experiment.Then, two pairs of adult marmosets were utilized for the food competition experiment, and one adult marmoset was for the PD model experiment.

Recording system
A recording booth was a 90-cm cubic box which consisted of acrylic transparent walls, and a metal mesh floor and ceiling.This recoding booth was designed to keep up to four animals under the Ethical Guideline of the Japan Neuroscience Society and equipped with common items required for a normal marmoset cage, such as water bottles, food boxes, and perches.
Videographic images were recorded by Motif system (Loopbio, Lange G, Wien, Austria) which was synchronized with eight machine vision cameras (2048x1536-pixel, 24 fps).The cameras were arranged horizontally with an equal distance as surrounding the recording booth.The viewing angle of each camera was set at 110x70 degree to cover the whole booth.
To accomplish accurate 3D reconstructions, we obtained intrinsic (e.g., lens distortion coefficients) and extrinsic (e.g., camera positions) camera calibration parameters by the OpenCV framework as follows: The intrinsic parameters were obtained by cv2.omnidir.calibrateusing the images of a checker-board pattern recorded by each camera; and the extrinsic parameters were initialized by cv2.solvPnP function by the 3D coordinates of a set of landmark positions in the recording booth and their 2D coordinates projected onto the camera image.To improve the calibration accuracy, we further optimized both the intrinsic and the extrinsic parameters simultaneously by minimizing the projection (reconstruction) errors of the trajectory of a small object (a ping-pong ball) moved inside the recording cage 68 .

Ground truth dataset
Our keypoint schema follows that of macaque-pose 12 dataset with slight modification to fit to analyze the whole-body movements of marmosets.Specifically, we annotated 20 keypoints consisting of the nose, eyes (left and right), ears, shoulders, elbows, wrists, hips, knees, ankles, back, and the middle and tip of the tail (while the last two keypoints were not used in the analytic pipeline).The annotators were trained by movies of marmosets whose body parts corresponding to the keypoints were marked by paint markers.The annotations were performed in a 3D manner by using custom-made software where those of a single body were a collection of 3D positions constructed through triangulation of 2D positions via all cameras.
While the 3D positions could be computed with triangulation once a single keypoint was annotated via more than two cameras, the annotators visually confirmed every keypoint for all cameras to maximize precision.We used images from 29 different marmosets and annotated 7404 bodies in a 3D space which were equivalent to 56103 bodies in a 2D space.We selected scenes from different behavioral contexts, 732 bodies from full-day recordings of a single animal, 654 bodies from those of two animals, 2010 bodies from three-animal recordings, and 4008 bodies from four-animal recordings.The annotation frames were semi-manually selected to maximize variations of the behavioral contents.

Markerless 3D multi-animal pose estimation
The analytic pipeline started from the analysis of 2D images taken from each camera (Fig. 1b 2D processes).The detection network analyzed the locations of marmosets in an image of each frame and generated a bunch of bounding boxes, which are rectangles of partial regions bounded by the smallest rectangle enclosing a marmoset as a region of interest.Then, the pose network estimated 18 keypoints, and the ID network estimated an animal ID for all bounding boxes.The bounding boxes were combined along the time axis at the 2D level to construct socalled 2D tracklets, namely time-series data consisting of the regions of a marmoset associated with the postures and animal IDs.As multiple bounding boxes could be detected in each frame, the bounding boxes that seemed to correspond to a single marmoset were combined based on the consistency in the positions of the marmoset across frames.At this moment, the 2D tracklets were still fragmented in short durations, because one animal who were occluded by objects or other animals, and, therefore, it could not be tracked continuously.The 2D processing was implemented using OpenMMLab 69 , a set of image processing libraries for deep neural networks.The network architecture used here was yolox-l 70 and resnet-50 71 for detection and identification.The pose networks were hrnet-w32 72 for both the food-sharing and the food competition experiments, and dekr-hrnet-w48 73 for the PD model experiment.
The connections of bounding boxes to construct 2D tracklets were performed by using ByteTrack 74 .Subsequently, the 3D pose time-series data on each animal were obtained with four steps.
The first to third steps corresponded to camera association and the fourth step to 3D optimization in Figure 1a.
As the first step, in each frame, we grouped the bounding boxes (tracklets were not used here) likely belonging to the same marmosets across different cameras.This process was performed only in key frames which were every 0.5 sec to reduce computational load.We searched for the optimal grouping of bounding boxes by minimizing geometric inconsistency (i.e., the inverse of the so-called pose affinity score 75 ) between the boxes from different cameras within a group.We defined geometric inconsistency Dg as below.
where    indicated the 2D position of the n-th keypoint of pose I,   (   ) the projection line associated with    from a different camera, and   (⋅, ) the point-to-line distance for l.
The optimization was performed according to the algorithm proposed by Dong et al. 80 .Once the grouping of bounding boxes was established, we constructed the 3D pose of a marmoset for each group of the bounding boxes by triangulation of the 2D poses in each key frame.Then, we obtained 3D poses of marmosets in every key frame, while their temporal association remained undetermined.
As the second step, the matching of the same animal over time was performed as follows: A combination of 3D poses across adjacent key frames could be considered, in the Graph theory, the maximum matching M of a complete bipartite graph G=(S,T;E) with non-negative edge cost :  → ℝ ≥ 0, where S, T are 3D poses for key frame t and t+1.Here we defined the cost c(i,j) for the edge connecting Si and Tj poses as below.
where N indicated the number of keypoints,    and    represented the 3D position of the n-th keypoint of the 3D pose Si and Tj, respectively, and (⋅,⋅) was the distance between two points in a 3D space.This cost represented geometrical inconsistency of a pair of 3D poses.
The maximum matching M was obtained by minimizing the cost ∑ () ∈ through the Hungarian algorithm.In addition, the edge connections were removed if the geometrical inconsistency per keypoint was over an empirically determined threshold T1=150.The frames between the key frames were complemented by continuity of 2D tracklets which were combinations of multiple bounding boxes over time in a 2D space.Through this process, we obtained 3D tracklets time-series data on 3D posture.
Third, a marmoset ID was assigned for each 3D tracklet.The ID was assigned in every frame if the following criterion was satisfied: where   was the number of instances for the most frequently observed ID, N was the number of all bounding boxes taken from all cameras, T2 and T3 were hyperparameters, set as 12 and 0.8, respectively.Here, within a sliding time window (5 sec), all the bounding boxes belonging to a single 3D tracklet were considered.If the same ID was assigned to a different tracklet at the same time point, the ID was given only to the tracklet with the highest Nid.A 3D tracklet was divided at the time point when the IDs assigned by the above criterion were changed within the 3D tracklet.
The fourth step was the final refinement of the 3D tracklets.There might be the case where multiple 3D tracklets, which should correspond to the same animal, were dissociated due to the failure in the previous steps.To compensate such a case, these tracklets were integrated by the following procedure.Suppose that there was a tracklet that had not yet been assigned an ID, TnoID; and a tracklet that had been assigned an ID, TwithID.During the period when two 3D tracklets overlapped, if the difference between their 3D trajectory was less than the error threshold T4=200, then the ID of TwithID propagated to that of TnoID.This was repeated twice for the entire dataset.Furthermore, for tracklets that had not yet been assigned an ID, we assigned the remaining ID if the IDs of all but one animal had been assigned.Finally, the tracklets with the same ID were integrated, and the resulting 3D pose time-series data on individual marmosets were spatiotemporally smoothed and normalized via anipose 73 .

Food-sharing experiment
A couple of marmoset families participated in this experiment.Each family consisted of a father, a mother, and their infant who was about three months of age at the start of the experiment.A piece of home-made Arabian gumball was given to each of the parents simultaneously, and then their social interactions were observed.When both gumballs were consumed, new ones were given again to the parents separately.The experiment was carried out for about 30 min per day and repeated for 12 or 16 days in two families.
Food-sharing events were detected by the following procedure.First, 3D pose data on three individuals per family were obtained with the analytic pipeline.At this stage, the 3D pose data were independent across the animals and were not suitable for detecting social behavior.
Therefore, we created new features by combining the 3D pose data about the infant and parents.
Specifically, we calculated the distance between the mouth or the left or right hand of the infant and those of each parent.Considering all combinations for a pair of the infant and one of his/her parent, this process generated 9 different values for each time frame.The smallest one of these values was taken for each frame, and the resulting time-series data (D) and the first derivative (V) were obtained.A food-sharing event was marked when there were at least TN consecutive frames in which D and V were larger than detection parameters Td and Tv.To optimize these detection parameters and to evaluate detection accuracy, a human observer counted the occurrence of food-sharing events as a subset of the entire dataset.The human observer coded the presence or absence of food-sharing events for every 15 sec and analyzed for 90 min in total.The threshold value obtained from the human coding was optimized by maximizing the consistency to the automated detection by using 25% of the annotation data.
The detection accuracy was obtained from the rest of the annotation data.The Precision-Recall curve shown in Figure 3d was obtained by varying Td from the optimal value.The statistical significance in the difference between the father and the mother in the food-sharing events were evaluated with a paired two-tailed t-test (α = 0.05) with the number of observations on each recording day as independent data points.

Food competition experiment
Two pairs of marmosets were used for this experiment.For each pair, the subject and partner animals were familiar with each other as they had been kept in the same cage.Food deprivation was performed from the evening of one day before the experiment, and, therefore, both animals were in a hungry state at the start of the experiment.Just before the experiment, the partner's state was controlled to be either hungry or full by the following procedure.The partner was separated from the subject immediately before the experiment in order that they could not see until the learning curve reached a plateau.The best network weights over the training iterations were selected based on the performance of the prediction for the rest of the dataset which had not been used for the training.In the analysis of Figure 5c and d, the look-back behavior was defined as the head direction (as defined in the previous paragraph) of the subject (as the calculation mentioned above) became below 40 degrees and were aligned by the onset of the partner's gaze ( head direction should be below 40 degrees) which was kept for more than 800 ms.The bar graph in Figure 5d denoted the sum of the look-back behavior between 0.5-1.5 sec to the partner's gaze.The statistical significance was obtained by an unpaired two-sided t-test for the differences between the conditions.

PD model experiment
A PD marmoset model was produced by unilateral injections of both virus vector expressing mutant α-syn and pathological α-syn fibril into the SNc.A total of 12-µl solutions consisting of 4 µl of AAV2.1-hTH-α-syn (G51D) (4.88x10e13 gc/ml) 54 and 8 µl of the fibril (5 mg/ml) 57  The recordings were started two months before the surgery and continued 12 months after the surgery.
After the behavioral assessment, immunohistochemical analysis was performed to confirm loss of dopamine neurons from the SNc.The animal was deeply anesthetized with ketamine hydrochloride (40mg/kg, i.m.) and sodium secobarbital (50 mg/kg, i.v.), and perfused transcardially with 0.1M phosphate-buffered saline (PBS) followed by 4% paraformaldehyde in 0.1 M phosphate buffer (pH 7.4).Then, the brain was removed from the skull, postfixed overnight, and saturated with 30% sucrose at 4°C.Coronal sections were cut serially at the 40µm thickness on a freezing microtome.A series of every tenth section was used for tyrosine hydroxylase (TH) immunostaining.The sections were pretreated with 0.3% H2O2 for 30 min and immersed in 1% skim milk for 2 hr.The sections were then incubated for 48 hr at 4°C with mouse anti-TH antibody (1:2,000; Millipore, Burlington, MA) in 0.1 M PBS containing 2% normal donkey serum and 0.1% Triton X-100.Subsequently, the sections were incubated with biotinylated donkey anti-mouse IgG antibody (1:1,000; Jackson ImmunoResearch, West Grove, PA) for 2 hr at room temperature in the same fresh medium, followed by the avidin-biotinperoxidase complex (ABC Elite; 1:200; Vector laboratories, Burlingame, USA) in 0.1 M PBS for 2 hr at room temperature.Finally, the antigen was visualized with diaminobenzidine (DAB) containing nickel ammonium sulfate (0.01% DAB, 1.0% nickel ammonium sulfate, and 0.0003% H202).The sections were mounted onto gelatin-coated glass slides and counterstained with 1% Neutral red.
An unsupervised clustering of behavioral actions was performed by using time-series data about action features which were computed based on the 3D pose data as follows: First, the aligned postures were obtained as described in the previous section.Then, the spectrogram representation (0.05-12.8Hz) of these data was obtained from the fast Fourier transformation, and, therefore, the data at a single time point contained not only instantaneous postural information, but also dynamics of the postures.In the end, the action features used for the clustering were created by adding locomotion vector to this spectrogram.The clustering was carried out by using the k-means clustering method with the number of classes fixed ====to 56, and, thus, all videographic frames throughout the entire recording period were classified as one of the 56 action clusters.Then, the time-course of the occurrence rate of each action class was obtained as shown in Figure 6c.The order of these action clusters was defined by the following procedure.The 56-dimensional time-series data representing the action occurrence rate were analyzed by the principal component analysis (PCA).Then, the first principal component PC1 showed monotonic increment in which the score was low before the surgery and was gradually being increased after the surgery.Therefore, the order of the coefficients of PC1 was used as the order of the action clusters.In other words, the actions with the small cluster number were frequently observed after the surgery, and those with the larger cluster number were often observed before the surgery.In Figure 6j, the azimuth and tilt of the head were calculated by the vector form the midpoint of the shoulders to that of the eyes in the aligned posture.For both the movements and the head angles, the errors were estimated by the bootstrap method.All data during the pre-surgery period were used to estimate the 95% confidential intervals.The mean for every 15 min was taken as an independent data point, and the repetition of the bootstrap sampling was 2000 times.research.e: Food-sharing events between the male/female and their infant predicted from the AI-based analysis for an example session.f: Rates of food-sharing events averaged across days.Our behavioral quantification via the AI-based pipeline reveals that the foodsharing event with infants occurs more frequently in male than in female parents (df=25, t=4.55, p=0.0001).However, by combining with proper downstream analytic methodologies, the data allow us to elucidate a wide spectrum of behavioral parameters based on the 3D poses alone.
was injected into four rostrocaudally and mediolaterally different loci of the SNc through a 10-µl Hamilton microsyringe (30 gauge) over 35 min per penetration.The injection coordinates were adjusted individually based on MR images.A surgical navigation system (Brainsight, Rogue Research, Montréal, Québec, Canada) was used to accurately guide the position of the injection sites76 .The animal was anesthetized with ketamine hydrochloride (20-40 mg/kg, i.m.) and maintained with isoflurane (1-2%) during the surgery while SpO2, heart rate, and rectal temperature were monitored.A water-heating circulator was used to control the body temperature.An analgesic (Meloxicam; 0.1-0.2mg/kg, i.m.) was also administered before and for a couple of days after the injection.Behavioral observations were conducted once a month.The marmoset was moved to the recording booth and allowed to stay there for two days.Food pellets were supplied once a day and water was available ad libitum.Video recordings was done for 20 min per hour from 9 a.m. to 4 p.m. (a total of 160 min per day).

Fig. 2 .
Fig. 2. Exemplified annotations and decoding accuracy of the analytic pipeline.

Fig. 3 .
Fig. 3. Differential contributions of father vs. mother marmosets to food-sharing events with their infants.

Fig. 4 .
Fig. 4. Quantification of the internal state predicted from actions.