Abstract
The rhesus macaque is an important model species in several branches of science, including neuroscience, psychology, ethology, and several fields of medicine. The utility of the macaque model would be greatly enhanced by the ability to precisely measure its behavior, specifically, its pose (position of multiple major body landmarks) in freely moving conditions. Existing approaches do not provide sufficient tracking. Here, we describe OpenMonkeyStudio, a novel deep learning-based markerless motion capture system for estimating 3D pose in freely moving macaques in large unconstrained environments. Our system makes use of 62 precisely calibrated and synchronized machine vision cameras that encircle an open 2.45m×2.45m×2.75m enclosure. The resulting multiview image streams allow for novel data augmentation via 3D reconstruction of hand-annotated images that in turn train a robust view-invariant deep neural network model. This view invariance represents an important advance over previous markerless 2D tracking approaches, and allows fully automatic pose inference on unconstrained natural motion. We show that OpenMonkeyStudio can be used to accurately recognize actions and track two monkey social interactions without human intervention. We also make the training data (195,228 images) and trained detection model publicly available.
Introduction
Rhesus macaques are one of the most important model organisms in the life sciences, e.g. (1–3). They are invaluable stand-ins for humans in neuroscience and psychology. They are a standard comparison species in comparative psychology. They are a well studied group in ethology, behavioral ecology, and animal psychology. They are crucial disease models for infection, stroke, heart disease, AIDS, and several others. In all of these domains of research, characterization of macaque behavior provides an indispensable source of data for hypothesis testing. Macaques evolved to move gracefully through large three-dimensional spaces (3D) using four limbs coordinated with head, body, and tail movement. The details of this 3D movement provide a rich stream of information about the macaque’s behavioral state, allowing us to draw inferences about the interaction between the animal and its world (4–9).
We typically measure only a fraction of available information about body movement generated by our research subjects. For example, joystick, button press, and gaze tracking measure a very limited range of motion from a single modality. One could potentially incorporate more such measurement devices, but there are practical limits in training and use. More broadly, it is possible to divide movement into actions that take account of the entire body by delineating an ethogram, which expressly characterizes and interprets full body positions and actions (see, for example, Sade, 1973). However, ethograms can generally only be done by highly trained human observers, are labor intensive, costly, imprecise, and susceptible to human judgment errors (12). These limitations greatly constrain the types of science that can be done and therefore the potential value of that research.
For these reasons, the automated measurements of 3D macaque pose is an important goal (13). Pose, here, refers to a precise description of the position of all major body parts (landmarks) in relation to each other and to the physical environment. Pose estimation can currently be done with a high degree of accuracy by commercial marker-based motion capture systems (e.g., Vicon, OptiTrack, and PhaseSpace). Macaques, however, are particularly ill-suited for these marker-based systems. Their long, dense, and fastgrowing fur makes most machine-detectable markers difficult to attach and creates a great deal of occlusion. Their highly flexible skin makes markers shift position relative to bone structure during vigorous movement, which is common. Their agile hands and natural curiosity make them likely to remove most markers. The thickness and speed of growth of their fur makes skin marking impractical. They often show discomfort, and consequently unnatural movement regimes, with jackets and bodysuits.
Markerless motion capture offers the best possibility for a widely usable tracking system for macaques. Recent success in deep learning based 2D human pose estimation from RGB images opens a new opportunity for animal markerless motion capture (14–16). For instance, DeepLabCut leverages a pre-trained deep learning model (based on ImageNet) to accurately localize body landmarks. These methods work for various organisms like flies, worms, and mice by learning from a larger number of images collected from a single view (17–22). However, macaques present several problems that make current best markerless motion capture unworkable. First, they have a much greater range of possible body movements than other model organisms. Each body joint has multiple degrees of freedom, which generates a large number of distinctive poses associated with common activities such as bipedal/quadrupedal locomotion, grooming, and social interactions in even modestly sized environments. Second, they interact with the world in a fundamentally three dimensional way, and so they must be tracked in 3D. Existing 2D motion tracking learned from the visual data recorded by a single-view camera can only produce a view-dependent 2D representation. Thus, application to novel vantage points introduces substantial performance degradation. In principle, these problems could all be greatly mitigated by use of a sufficiently large database of annotated images. For example, state of the art approaches to human tracking are trained on 2.5 million annotated images (23–25). Such a database does not exist for macaques and would be prohibitively expensive to generate. Specifically, we estimate that to generate a macaque database similar to the one used for humans, it would take over hundreds of thousand distinct annotated images for each joint, and cost roughly $10M. Thus, successful tracking requires a larger pipeline that includes generating a large annotated pose database. This annotation problem simply cannot be overcome by better AI, and requires a qualitatively new approach.
Here, we present a description of OpenMonkeyStudio, a novel deep learning-based markerless motion capture system for rhesus macaques (Fig. 1). It solves the annotation problem through innovations in image acquisition, annotation and label generation, and augmentation through multiview 3D reconstruction. It then implements pose estimation using a deep neural network. Our system uses 62 cameras, which provides multiview image streams that can augment annotated data to a remarkable extent by leveraging 3D multiview geometry. However, while this large number of cameras is critical for training the pose detector, the resulting model can be used in other systems with fewer cameras without training. Our system generalizes readily across subjects and can simultaneously track two individuals. It is complemented by the OpenMonkeyPose dataset, a large database of annotated images (195,228 images) which we will make publicly available.
Results
OpenMonkeyStudio: automated markerless motion capture for macaques
We developed a multiview markerless motion capture system called OpenMonkeyStudio that reconstructs a full set of three-dimensional (3D) body landmarks (13 joints) in freely moving macaques (Fig. 2) without manual intervention. The system is composed of 62 synchronized high definition cameras that encircle a large open space (2.45m×2.45m×2.75m) and observe a macaque’s full body motion from all possible vantage points. For each image, a pose detector made of a deep neural network predicts a set of 2D locations of body landmarks that are triangulated to form the 3D pose given the camera calibration parameters (focal length, lens distortion, rotation and translation) as shown in Fig. 1.
We built the pose detector using a Convolutional Pose Machine (CPM, (10)), which learns the appearance of body landmarks (e.g., head, neck, and elbow) and their spatial relationship from a large number of annotated pose instances. Note that our framework is agnostic to the design of the underlying network, and therefore, other landmark detectors such as Stacked Hourglass (15) or DeeperCut ((16) also used in DeepLabCut (17)) can be complementary to the CPM.
Quantitative evaluation of OpenMonkeyStudio
Here we describe validations of the OpenMonkeyStudio system including its accuracy, effectiveness, and precision.
Accuracy
We evaluated the accuracy of head pose reconstruction by comparing to the best available marker-based video motion capture system (OptiTrack, NaturalPoint, Inc, Corvallis, OR). We chose the head because it is the only location where a marker can be reliably attached without disturbing free movement in macaques. Fig. 3 illustrates the head trajectory in three dimensions as measured by both methods over 13 minutes measured at 30 Hz. This illustrative sequence includes jumping and climbing (insets). Assuming that the OptiTrack system represents the ground truth (i.e. 0 error), the reconstruction by OpenMonkeyStudio has a median error of 6.76 cm, a mean error of 7.14 cm and a standard deviation: 2.34 cm. Note that there is a spatial bias due to marker attachment (roughly 5 cm above the head) reducing the actual error substantially. Note that these presumptive ground truth data from OptiTrack include obvious and frequent excursion errors as shown in Fig. 3. This is caused by marker confusion and occlusion, which requires additional manual post-processing to remove. Additionally infrared interference is a significant hurdle in a large enclosure such as ours. By contrast, OpenMonkeyStudio leverages visual semantics (appearance and spatial pose configuration) from images that can automatically associate the landmarks across time. That in turn makes our system more robust to confusion/occlusion errors, a common failure point in any motion capture system. It can even predict the occluded landmarks based on the learned spatial configuration, e.g., knowing shoulder and elbow joints is highly indicative of the occluded hand’s location.
Effectiveness of Multiview Augmentation
We evaluated the effectiveness of multiview augmentation that is used to generate the training data, i.e., how many cameras are needed for training. We measured the relative accuracy of tracking as a function of the number of cameras, each of which supplements view augmentation (m=1,2,4,8,16,32,48 cameras) compared to the model generated from the full system (m=62 cameras). For each pose detector trained by the augmentation with a factor of m, we reconstructed the 3D pose using n=62 cameras for a new testing sequence on different macaques. (Here, m and n denote the number of cameras used for training and testing, respectively). Among 62 cameras, we select the views for augmentation uniformly over camera placement. We compared the reconstructed poses with the pseudo ground truth reconstructed by the full model. Note that m=1 camera and m=2 cameras are equivalent to the single view approach (17) and stereo view approach (19). These are the special instances of OpenMonkeyStudio with limited view augmentation.
Fig. 4(a) illustrates the relative accuracy measured by the percentage of correct reconstruction for each landmark, i.e., how many testing instances are correctly reconstructed given the error tolerance (10 cm). For visually distinctive and relatively rigid landmarks such as nose, neck and head, relatively accurate reconstruction can be achieved by a small number of augmentations, e.g., training with a single view camera can produce approximately 65% of correct reconstruction. Note that even for these ostensibly “easy” landmarks, augmentation still provides substantial benefits. However, for the limb landmarks that have higher degrees of freedom (hands, knees, and feet) and are frequently self-occluded, their reconstructions are in particular vulnerable to a viewpoint variation because the appearance of such landmarks varies significantly across views. This leads to considerable performance degradation (for example, 12% of correct reconstruction for the single view). Overall performance (black line) is increased from 34% (m=1) to 76% (m=48), justifying the multiview augmentation.
Inference Precision
We next evaluated the inference precision, i.e., how many cameras are needed for inference (testing data) to produce comparable reconstruction with 62 cameras. We measured the precision of 3D reconstruction of the full model (m=62) while varying the number of views (n=2,4,8,16,32,48) for a testing sequence with different macaques. n=1 is impossible as 3D reconstruction requires at least two views. The error measure (the percentage of correct reconstruction for each landmark) is identical to the relative accuracy analysis above (Fig. 4(b)). The inference precision quickly approaches 80% average performance of 13 landmarks with as few as 8 cameras. However, as with view augmentation for training, hands, knees, and feet require more cameras (n=32). In other words, to fully capture the position of the extremities, there is a strong benefit to using dozens of cameras.
Although the trained pose detector was designed to be viewinvariant, in practice, it can still be view-variant. That is, the inference precision can still depend on viewpoint. We characterize the view-dependency in Fig. 5. Specifically, we illustrate the accuracy of the 2D pose inference by comparing decimated reconstructions to the presumed ground truth of full reconstruction. The views are organized with respect to the macaque’s facing direction (detected automatically using the detected head pose). Specifically, the relative camera angle is negative if the cameras are located on the left side, and positive otherwise. We find that for the landmarks that are visible in most views (such as head and pelvis), the 2D localization is highly view-invariant, meaning that there is uniform accuracy across views (less than 2 pixel error). However, for the hands and feet, which are frequently occluded (often, by the torso), localization is often highly view-variant. For example, right hand side views are typically less suitable to localize the left hand. This view-variance can be alleviated by leveraging multiple views. Nonetheless, to ensure the minimal view-variance in 3D reconstruction, the cameras need to be distributed uniformly across the enclosed space.
Automated identification of semantically meaningful actions based on 3D pose estimation
The central goal of tracking, of course, is to identify actions (12). The ability to infer actions from our data, is therefore a crucial measure of the effectiveness of our system. We call this expressibility. We therefore next assessed expressibility for our 3D representations and, for comparison, our 2D representations.
For processing 3D representations, we transformed each one-frame pose into a canonical coordinate system. Specifically, we defined the neck as the origin and the y-axis as the gravity direction. The z-axis is defined as aligned with the spine (specifically, the axis connecting the neck and hip), and the representation is then normalized so that the length of the spine is one. This coordinate transform standardizes the location, orientation, and size of all macaques. We then vectorize the 3D transformed landmark coordinates to form the 3D representation. That is, (note that the neck location is not included as it corresponds to the origin.).
In Fig. 6A, we visualize the clusters of the 3D representations of macaque movements in an exemplary 30 minute sequence (that is, 54,000 frames). We used Uniform Manifold Approximation and Projection (UMAP) for dimensionality reduction (26). This process results in coherent clusters. Visual inspection of these clusters in turn demonstrates that they are highly correlated with the semantic actions such as sitting, standing, climbing, and climbing upside down. We further use these clusters to classify actions in a new testing sequence (13 mins) using a k nearest neighbor search. This allows identifying the transitions among these actions as shown in Fig. 6B. The transitions are physically sensible, e.g., from walking on the floor to climbing upside down, a macaque needs transitional actions of walking → standing → climbing → climbing upside down.
For comparison, we next performed the same analyses on 2D representations. To process the 2D representations, we first vectorized the set of landmarks in each frame, thus, (The number 26 comes from the fact that we have 13 joints in each of two dimensions). In contrast, the 2D representation is affected by viewpoint where the clusters are not distributed in a semantically meaningful way as shown in Fig. 6C.
Social interaction
Macaques, like all simian primates, are highly social animals, and their social interactions are a major determinant of their reproductive success, as well as a means of communication (27–30). OpenMonkeyStudio offers the ability to measure multiple macaque poses jointly. To show its capability, we first generated a dataset using two macaques placed together in our large cage system. The two subjects were familiar with each other and exhibited clear behavioral tolerance in their home caging. These two macaques freely navigated in the open environment while interacting with each other. Fig. 7A illustrates the 3D reconstruction of their poses over time. It is important to note that extraction of the animals field of view is highly dependent on the amount of cameras available thus further justifying our high camera count. The closer animals are to each other, the more important are unique views that are able to separate the individuals. Fig. 7B demonstrates that the same tools and approaches used for single macaque pose representation readily extend to social interactions: the co-occurrences of actions of two macaques illustrate their correlation, e.g., one macaque follows the other. Fig. 7C illustrates the proxemics (31) of macaques that characterizes their social space. The 3D location of the second macaque is transformed to the body centric coordinate system of the first macaque to show the spatial distribution of social interactions. We use the polar histogram of the transformed coordinate to visualize the proxemics of macaques.
OpenMonkeyPose dataset
We have argued that the critical barrier to tracking is overcoming the annotation problem (see above). We have done so here by leveraging a large-scale multiview dataset of macaques to build the pose detector for OpenMonkeyStudio. Our dataset consists of 195,228 image instances that can densely span a large variation of poses and positions seen from 62 views (Fig. 9). The dataset includes diverse configurations of the open unconstrained environment, and also involves inanimate objects (barrels, ropes, feeding stations). It also involves multiple camera configurations and types, it involves two background colors (beige and chroma-key green), and four macaque subjects varying in size and age (5.5 – 12 kg). The dataset, trained detection model, and training code will be made available at the time of publication (https://github.com/OpenMonkeyStudio).
Discussion
Here, we present OpenMonkeyStudio, a novel method for tracking the pose of rhesus macaques without the use of markers. Our method makes use of 62 precisely arranged high-resolution video cameras, a deep learning pose detector, and 3D reconstruction to fit rich behavior. Our system can track monkey pose (13 joints) with high spatial and temporal resolution for several hours, which has been impossible with a marker-based motion capture. It can track two interacting monkeys and can consistently identify individuals over time. The ability to track positions of macaques is important because of their central role in biomedical research, as well as their importance in psychology, and ethology. Recent years have witnessed the development of widely used markerless tracking systems in many species, including flies, worms, mice and rats, and humans (17, 32, 33). Such systems are typically not designed with the specific problems of monkey pose estimation in mind. Relative to other more readily trackable species, monkeys have largely homogeneous unsegmented appearances (due to their thick continuous and mostly single colored fur covering), have much richer pose repertoires, and have much richer positional repertoires. (Consider, for example, that a mouse rarely moves to a sitting position, much less flips upside down). Although our system takes multiple approaches to solve these problems, the major innovation concerns overcoming the annotation problem. That is, the major barrier to successful pose tracking in macaques is the lack of a sufficiently large reliably annotated training set, rather than the lack of an algorithm that can estimate pose given that set.
The annotation problem is deceptively complex. We estimate that to generate a dataset of quality equivalent to that used in human studies would cost a few million dollars – several orders of magnitude more than our system costs. We get around this problem using several innovations: (1) a novel design of a dense multi-camera system that can link multiview images through a common 3D macaque’s pose, (2) an algorithm that strategically selects maximally informative frames, thus allowing us to allocate manual annotation efforts much more efficiently, and (3) multiview augmentation, or increasing the effective size of our dataset by leveraging 3D reconstruction. We further augment these with a standard additional step, affine transformation. These steps, combined with professional annotation of a subset of data, allow for the creation of a dataset called OpenMonkeyPose sufficient for machine learning, which we will make publicly available at the time of publication. Although our system is designed for a single cage environment, it can readily be extended to other environment shapes and sizes. We demonstrate that with the trained pose detector, it is possible to reduce the density of cameras for other environments, e.g., for 2.45 m × 2.45 m × 2.75 m space, 8 cameras produce 80% performance compared to 62 cameras (1280×1024). For a larger space, higher spatial resolution is needed, which can effectively produce a similar size of region of interest that contains a macaque.
It is instructive to compare our system with DeepLabCut (17, 18). The goals of the two projects are quite different – whereas DeepLabCut facilitates the development of a tracking system that can track animals, OpenMonkeyStudio is a tracking system. In other words, DeepLabCut provides a structure that helps develop a model, OpenMonkeyStudio is a specific model. DeepLabCut is very general – it can track any of a large number of species; OpenMonkeyStudio only works with macaques. On the other hand, DeepLabCut skirts the major problems associated with monkey tracking – the annotation problem. The solution to these problems constitutes the core innovative aspect of OpenMonkeyStudio. Indeed, the model that results from implementing DeepLabCut will be highly constrained by the training set provided to it; that is, if training examples come from a range of behaviors, it can only track new behaviors within that range. In contrast, OpenMonkeyStudio can track any pose and position a monkey may generate.
Conclusion
Monkeys evolved through natural selection processes to adaptively fit their environments, not to serve as scientific subjects. They are, nonetheless, invaluable, and their behavior, in particular, promises major advances, especially if it can be understood in ever more naturalistic contexts (12, 34–37). Despite this, much biological research using them contorts their behavior to our convenience. By reversing things and letting them behave in a more naturalistic and ethologically relevant way, we gain several opportunities. First, we get a more direct correspondence between what we measure and what we want to know – the internal factors that drive the animal’s behavior on a moment to moment basis (38).Second, we gain access to a much higher dimensionality representation of that dataset, which gives us greater sensitivity to effects that are not (39). Finally, it gives us an ability to measure effects that the monkeys simply cannot convey otherwise. It is for these reasons that improved measure of naturalistic behavior holds great hope in the next generation of neuroscience; we anticipate macaques will be part of that step forward.
Methods
Training OpenMonkeyStudio
The main challenge of training a generalizable pose detection model for OpenMonkeyStudio is to collect large scale annotated data that include diverse poses and viewpoints: for each image, the landmark coordinates of the macaque’s pose need to be manually specified. Unlike the pose datasets for human subjects (23–25), this requires primatological knowledge, which precludes collecting a dataset with comparable size (the order of millions; estimated cost: $12M, assuming a minimum wage of $7.25). One key innovation of OpenMonkeyStudio is the use of the multi-camera system to address the annotation problem through a novel data augmentation scheme using a theory of multiview geometry. The resulting system allows for robust reconstruction of 3D pose even with noisy landmark detections.
Keyframe selection for maximally informative poses
Some images are more informative than others. For example, when a macaque is engaged in quiescent repose, its posture will only change modestly over seconds or minutes. After the first image in such a sequence, subsequent ones will provide little to no additional visual information. Including such redundant image instances introduces imbalance of the training data, which leads to biased pose estimation. A compact set of the images that include all possible distinctive poses from many views are ideal for the training dataset.
To identify the informative images, we develop a keyframe selection algorithm based on monkey movement, e.g., locomotion, jumping, and hanging. A keyframe is defined here as frame that has large translational movement between its consecutive frames. The translational movement is the 3D distance traveled by the center of mass of a macaque. We approximate the center of mass using the triangulated center of mass. The macaque body is segmented from an image using a background subtraction method that employs a Gaussian mixture model (40), and the center of segmented pixels is computed. The centers of segmented pixels from multiview images are triangulated in 3D using the direct linear transform method (11) given the camera calibration parameters. Robust triangulation using a mean-shift triangulation approach (41)or random sample consensus (RANSAC (42), see below) can be complementary when background subtraction is highly noisy. With the keyframe selection, the amount of required annotations is reduced by a factor of 100-400, e.g., instead of needing 200,000 labeled frames, we would only need 500-2000.
Cross-view data augmentation using multiview cameras
Given the selected keyframes, we annotate the data and extensively augment the size of data using multiview images.
Annotation
For each keyframe, we crop the region of interest in the images such that the center of mass is located at the center of the cropped region and the window size is inversely proportional to the distance between the center of mass and the camera. By resizing all cropped images to the common resolution (368 width × 368 height), the macaque appears roughly the same size in pixel units. Among 62 view images, we select three to four views that maximize visibility, i.e., most body parts of macaques are visible, and minimize view redundancy, i.e., maximum distance between cameras’ optical centers. This selection process is done in a semiautomatic fashion: an algorithm is developed to propose a few camera candidates that the center of mass is visible while retaining the maximal distance between them, and a trained lab member selects the views among the candidates. This selection process significantly alleviates the annotation ambiguity and efforts and reduces the uncertainty of the triangulation by providing a wide baseline between camera optical centers. The set of selected images are manually annotated by the trained annotators. In practice, we leverage a commercial annotation service (Hive AI). As of January 2020, 33,192 images are annotated.
Adjustment
The manual annotations can be still noisy. We use a geometric verification to correct the erroneous annotations. The annotated landmarks are triangulated in 3D using the direct linear transform and projected to the annotated images to check reprojection error, i.e., how the annotations are geometrically consistent. Ideally, the annotated landmarks must agree with the projected landmarks. For the landmark that has reprojection error higher than 10 pixels, we manually adjust the annotations or indicate outliers using an interactive graphical user interface that visualizes the annotated landmarks and their corresponding projections in real time. This interface allows efficient correction of the erroneous annotations across views jointly. The resulting annotations are geometrically consistent even for occluded landmarks. MATLAB code of the adjustment interface is publicly available on our github page.
Propagation
The refined annotations form a macaque’s 3D pose (13 landmarks), which can be projected onto the rest of the views for data augmentation. For example, the annotation of the left shoulder joint in two images can be propagated through any of the other 60 view images collected at that at the same time instant (i.e. same frame count) that include that landmark (i.e., that are not occluded by the body or out of frame). Given our circular arrangement of cameras, this propagation step reduces the amount of annotation needed by a factor of 15-20 depending on the visibility of the 3D landmark location (Fig. 9).
Training Pose detector
Given the annotated landmarks, we automatically crop the region of the monkey from each image based on the method used in keyframe selection. When multiple macaques are present, we use a k-means clustering method to identify their centers. We further augment the data by applying a family of nine affine transforms (±30° rotations, 20% left/right/up/down shifting, ±10% scaling, and horizontal flipping) to form the training data. These transformed images enhance the robustness of the detector with respect to affine transformations. The pose detector (CPM) takes as an input a resized color image (368×368×3 pixel, 368 pixel width and height with RGB channels) and outputs 46×46×14 pixel response maps (46 width and height with 13 landmarks and one for background). The ground truth response maps are generated by convolving a Gaussian kernel at the landmark location, i.e., in each output response map, the coordinate of the maximum response corresponds to the models’ best guess as to the position of the landmark coordinate. L2 loss between the ground truth and inference response maps is minimized to train the CPM. We use ADAM stochastic gradient descent method for the optimization (43). A key feature of CPM is multi-stage inference, which allows iterative refinements of landmark localization (10). In particular, such multi-stage inference is highly effective for macaques as the visual appearance of their landmarks (e.g. their hips) is often ambiguous due to the uniform coloration of their pelage. In practice, we use a six stage CPM that produces optimal performance in terms of accuracy and computational complexity. We use a server containing 8 GPU’s (NVIDIA RTX 2080 Ti; 11Gb memory) to train the CPM. Training the model only requires 7 days for 1.1M iterations with 20 batch size on one card. Hardware on model training (a task usually not repeated often) are quite modest.
Plausible pose inference
For the testing (inference) phase, no manual intervention and training is needed. For synchronized multiview image streams of a testing sequence, we compute the 3D center of mass of macaque based on the method used for the keyframe selection and crop the regions of monkeys from multiview images. We localize the landmark position in each cropped image by finding the maximum locations in the response maps predicted by the trained CPM. Given the camera calibration, the landmarks are robustly triangulated in 3D using a RANSAC procedure (42), i.e., for each landmark, a pair of images among 62 images are randomly selected to reconstruct the 3D position that is validated by projecting onto the remaining images. This randomized process allows robustly finding the best 3D position that agrees with the most CPM inferences in the presence of spurious inferences.
The obtained 3D reconstruction is performed on each landmark independently while considering physical plausibility, e.g., limb length must remain approximately constant across time. Given the initialization of the 3D pose reconstruction, we incorporate two physical cues for its refinement without explicit supervision. (1) Limb length cue: for an identical macaque, the distance between landmarks needs to be preserved. We estimate the distance between the connected landmarks (e.g., right shoulder and right elbow) using the median of the distance over time. This estimated distance is used to refine the landmark localization. (2) Temporal smoothness cue: the movement of macaque is temporally smooth over time. The poses between consecutive frames must be similar, which allows us to filter the spurious initialization. We integrate these two cues by minimizing the following objective function: where Xt is the 3D location of a landmark at the t time instant, xi,t is the predicted location of the landmark at the ith image, and Πi is the projection operation at the ith image. Yt is the parent landmark in the kinematic chain, LX,Y is the estimated length between X and Y, and Xt−1 is the 3D location of the landmark at t − 1 time instant. The first term ensures the projection of the 3D landmark to match with the CPM inference, the second term enforces the limb length constraint, i.e., the distance between adjacent landmarks remain constant, and the third term applies a temporal smoothness. This optimization is recursively applied along a kinematic chain of body, e.g., neck→pelvis→right knee→right foot.
Multi-camera system design
Our computational approach is strongly tied to the customized multi-camera system that can collect our training data and to reconstruct 3D pose on the fly (Fig. 10). We integrate the camera system into our 2.45 m × 2.45 m × 2.75 m open cage at the University of Minnesota. The cameras are mounted on movable arms mounted to a rigid exoskeleton surrounding the cage system and not touching it (to reduce jitter). The cameras peer into holes in the mesh caging covered with plastic windows. The cameras are carefully positioned so as to provide coverage of the entire system. The resulting system possesses the following desired properties for accurate markerless motion capture: high spatial resolution, continuous views, precise synchronization, and subpixel accurate calibration.
High spatial resolution
We use machine vision cameras (BlackFly S, FLIR) that produce a resolution of 1280×1024 at up to 80 frames per second (although in practice we use 30 fps). The camera is equipped with a global shutter with sensor size 1/2” format (4.8μm pixel size). Fisheye lenses with 3-4 mm focal length (Fujinon) are attached to the camera. This optical configuration results in a monkey with 1 m size appearing at a approximately 150×150 pixel image patch from the farthest camera (diagonal distance: 5.2m). Our camera placement guarantees that, in each frame, there exist at least 10 cameras that observe the monkey with a greater than 550×550 resolution. This resolution is sufficiently high such that the CPM can recognize the landmark.
Continuous views
62 cameras are uniformly distributed along the two levels of horizontal perimeter of OpenMonkeyStudio made of 80/20 T-slot aluminum (Global Industrial), i.e., for each wall except for the wall with a gate, there are 16 cameras facing at the center of the studio. The baseline between adjacent cameras is approximately 35 cm, producing less than 6.7° view difference, or 70 pixel disparity at the monkey 3m away. This dense camera placement results in nearly continuous change of appearance across views where the landmark detector can learn a view invariant representation, and therefore, can reliably reconstruct landmarks using redundant detection. Further, uniform distribution of cameras minimizes the probability of self-occlusion, e.g., the left shoulder that is occluded by torso for one side of cameras can be visible from the other side of cameras. Positioning of the cameras is performed according to 2 fundamental principles. First we sample the internal space of the cage with overlapping camera field of views while maintaining a focal length that enables the viewed subject to cover at least half the camera’s image sensor. This is done to ensure adequate resolution of the subject. Additionally we configure the cameras in the corners of the rectangular skeleton frame to have a 45° angle. This allows corner cameras to oversample even further which greatly helps with intrinsic and extrinsic camera calibration.
Precise synchronization
The principle of multiview geometry applies on completely static scenes where the precise synchronization is a key enabler of 3D reconstruction. We use an external synchronization TTL pulse (5V) that triggers to open and close the shutters of all cameras at exactly the same moment through General Purpose Input/Output (GPIO). This pulse is generated by a high precision custom waveform generator (Agilent 33120A) capable of 70 ns rise and fall times. Our system has been extensively tested and remains accurace to sub millisecond precision over 4 hours of data acquisition (maximum capacity of our NVMe Raid array)
Subpixel accurate calibration
Geometric camera calibration, the estimate of the parameters of each individual lens and image sensor has to be performed before each recording session. Parameters are used to correct for lens distortions as well as determine the location of the cameras within the scene. To calibrate the cameras, we use a large 3D object (1 m × 3 m) with non repeating visual patterns (mixed art works and comic strips), which facilitates visual feature matching across views. A standard structure-from-motion algorithm (44) is used to reconstruct the 3D object and 62 camera poses including intrinsic and extrinsic parameters automatically.
Distributed image acquisition
62 cameras produce 3.7 GB data per second (each image is approximately 2MB with a JPEG lossless compression at 30 Hz). To accommodate such a large data stream, we designed a novel distributed image acquisition system consisting of 6 local servers controlled by a global server (Fig. 11). The data streams from 10-11 cameras are routed to a local server (Core i7, Intel Inc.) using individual Cat6 cables to a power over ethernet (PoE) capable network switch (Aruba 2540). The firmware of the switch has been altered by the authors to allow for the specialized requirements of high data throughput using JUMBO packages. Each PoE switch is then connected to the local servers using dedicated fiberoptic 10 Gbit SFP+ transceivers. The data streams are compressed and stored in three solid state drives (NVMe SSD in RAID 0 mode) of the local server.
Cameras also received synchronization pulses through general purpose input output lines (GPIO, Hirose). An individually designed wiring setup provided TTL pulses (5V) generated at a target frequency of 30 Hz to each camera. Pulses were generated using a high precision custom waveform generator (Agilent 33120A) capable of 70 ns rise and fall times. Upon completion of a data acquisition session, data is copied onto 12 Terabyte HDD’s and physically moved to a JBOD daisy chained SAS hot swappable array (Colfax Storage Solutions) connected to Lambda Blade (Lambda Labs) server.
Data Collection
Subjects and apparatus
All research and animal care was conducted in accordance with University of Minnesota Institutional Animal Care and Use Committee approval and in accord with National Institutes of Health standards for the care and use of non-human primates. Four male rhesus macaques served as subjects for the experiment. All four subjects were fed ad libitum and pair housed within a light and temperature controlled colony room. Subjects were water restricted to 25 mL/kg for initial training, and readily worked to maintain 50 mL/kg throughout experimental testing. Three of the subjects had previously served as subjects on standard neuroeconomic tasks, including a set shifting task (45) and several simple choice tasks (46–50). Training also included experience with foraging tasks (51, 52), including one study using the large cage apparatus (53). One subject was naive to all experimental procedures.
Subjects were allowed for unconstrained movement within the cage in three dimensions. Five 208 L drum barrels weighted with sand were placed within the cage to serve as perches for the subjects to sit upon. In some sessions, four juice feeders were placed at each of the four corners of the cage in a rotationally symmetric alignment. The juice feeders consisted of a 16 × 16 LED screen, a lever, buzzer, a solenoid (Parker Instruments), and were controlled via an Arduino Uno microcontroller. Data were collected in MATLAB via Bluetooth communication with each of the juice feeders. We first introduced subjects to the large cage and allowed them to acclimate to it. Acclimation consisted of placing subjects within the large cage for progressively longer periods of time over the course of about five weeks. To make the cage environment more positive, we provisioned the subjects with copious food rewards (chopped fruit and vegetables) placed throughout the enclosure. This process ensured that subjects were comfortable with the environment. We then trained subjects to use the specially designed juice dispenser (53).
Implantation of headcap for marker testing
For purposes of comparison with marker data, we collected one large dataset with simultaneous tracking by our OpenMonkeyStudio system and the OptiTrack system. We placed three markers onto a head implant that was surgically attached to the subject’s calvarium. This was placed for another study using methods described therein (Azab and Hayden, 2017 and 2018). Briefly, the skin was removed and ceramic screws placed with the bone overlying the crown. A headpost (GrayMatter Research) was placed adjacent to the bone and orthopedic cement (Palacos) was placed around the screws and post in a circular pattern. The marker test took place several years after this procedure. It involved attaching a novel 3-D printed three-arm holder to the heapost itself. The three arms each bore a reflective marker that could be detected by the Opti-Track system. We used 8 Opti-Track cameras (Natural Point, Corvallis, OR) mounted in the same room as our camera system. Placement of the 8 cameras was optimized to minimize IR reflections and interference and to obtain a camera calibration (through wanding) error of less that 1 mm.
ACKNOWLEDGEMENTS
We thank Marc Mancarella for critical initial help, Giuliana Loconte and Hannah Lee for ongoing assistance. We also thank Yasamin Jafarian and Jayant Sharma for help with developing the pipelines we used.
This work was supported by an award from MNFutures to HSP and BYH, from the Digital Technologies Initiative to HSP, JZ, and BYH, from the Templeton Foundation to BYH, by an R01 from NIDA (DA038615) to BYH, and by NSF CAREER (1846031) to HSP.