Abstract
This paper concerns the fully automatic direct in vivo measurement of active and passive dynamic skeletal muscle states using ultrasound imaging. Despite the long standing medical need (myopathies, neuropathies, pain, injury, ageing), currently technology (electromyography, dynamometry, shear wave imaging) provides no general, non-invasive method for online estimation of skeletal intramuscular states. Ultrasound provides a technology in which static and dynamic muscle states can be observed non-invasively, yet current computational image understanding approaches are inadequate. We propose a new approach in which deep learning methods are used for understanding the content of ultrasound images of muscle in terms of its measured state. Ultrasound data synchronized with electromyography of the calf muscles, with measures of joint torque/angle were recorded from 19 healthy participants (6 female, ages: 30 ± 7.7). A segmentation algorithm previously developed by our group was applied to extract a region of interest of the medial gastrocnemius. Then a deep convolutional neural network was trained to predict the measured states (joint angle/torque, electromyography) directly from the segmented images. Results revealed for the first time that active and passive muscle states can be measured directly from standard b-mode ultrasound images, accurately predicting for a held out test participant changes in the joint angle, electromyography, and torque with as little error as 0.022°, 0.0001V, 0.256Nm (root mean square error) respectively.
I. INTRODUCTION
There is a current unmet medical demand for personalized in vivo skeletal muscle analysis. Neurological conditions (dystonia, motor neurone disease), myopathies (myositis, inflammation), neuropathies (nerve injury, spinal cord injury), ageing (motor unit loss), and pain/injury (work-related injury, neck injury, back injury, low back pain, neck pain) are some medical problems which would benefit from an ability to measure the dynamic/static states of specific/individual skeletal muscles in vivo. The state of muscle is determined by numerous input factors, the main two being neural drive, and joint rotation. Active contraction/relaxation via neural drive causes muscle shortening/lengthening. Muscles can lengthen or shorten when connecting joints rotate. Active contraction of muscles can happen with fixed joints at any angle (isometric) or free rotating joints either with the contraction (concentric) or against the contraction (eccentric). This defines a complex state-function space for muscle in which there are numerous muscle lengths with the same joint angle but different activations, and numerous muscle lengths which have the same activation but different joint angles. If joint angles and neural drive are two independent inputs to muscle which contribute to state, some others are pressure from adjacent muscles, pain, inflammation, temperature, fatigue and historical states (state transitions).
The current state of the art technology does not provide a solution to measuring specific muscle states non-invasively. Surface electromyography (EMG) can non-invasively measure active contraction in superficial muscles, but measurements are noisy (need filtering), subjective, and correlations with measured muscle force are entirely dependent on the position of the electrode. Intramuscular EMG can provide invasive (needles or fine wires inserted through skin and muscle) measurements of active contractions in deep muscles. EMG cannot measure passive tension in the muscle and has many other well-known problems [1]. Dynamometry can provide non-invasive gross measures of passive or active forces acting on a joint and therefore can provide gross estimates of the contribution of groups of muscles crossing that joint. Dynamometry is therefore not muscle specific and cannot resolve passive and active tensions within specific muscles. Supersonic Shear wave Imaging (SSI) can provide non-invasive estimates of the regional stiffness within cross-sectional areas of specific muscles, where such measures have been shown to correlate well with measured/estimated muscle force [2]. SSI cannot resolve active force (produced from within the muscle from active of motor unit firing) from passive force (resulting from joint rotation or pressure from adjacent muscles), and correlations with measured force are subjective requiring calibration to person-specific maximum voluntary contraction (MVC). SSI requires that the ultrasound scanning plan be in line with the muscle fibers, else the theory breaks down [3]. Finally, SSI has a low sampling rate of about one sample per second [2].
There have been many previous attempts to analyse the features of skeletal muscle via ultrasound and yet this long standing medical need is still unmet. Previous attempts have all made assumptions as to what the descriptive features of active and passive tension are when observing muscle state within ultrasound; muscle thickness/cross-sectional area [4], muscle fascicle orientation/curvature [5]-[7], and muscle length [8],[9]. These previous attempts are either too presumptuous, too low dimensional, and/or do not sample the state-function space comprehensively. In this paper, we propose a new alternative approach for measuring states (torque, active contraction, passive shortening/lengthening) of individual muscles directly from ultrasound images using machine learning. The methods we have developed are applicable to standard frame-rate (25Hz) b-mode ultrasound imaging for numerous reasons, not least it is ubiquitous in a clinical and research sense, non-invasive, cost-effective, has minimal exclusion criteria, and is portable. Ultrasound can very easily image deep structures including deep muscles within the body at very practical frame rates (25-100+Hz). The focus of this paper is on the human calf muscles since they are of interest, with a dense research track record and a variety of previously developed computational methods.
II. RELATED WORK
Although ultrasound has many clear benefits it is hard to analyze the information content [10]. With respect to skeletal muscle, research has predominantly been focused on extraction of intuitive low-resolution features such as pennation angle, and muscle thickness/cross-sectional area [4], [11]-[14]. Hodges and others [4] highlighted the potential of ultrasound for analyzing specific muscles without cross-talk (an artifact of EMG) from adjacent muscles and they compare low dimensional features with EMG measurements from four specific muscles undergoing isometric contractions. Among many conclusions, their most relevant findings were that from initial conditions large changes in the low dimensional features are associated with small changes in EMG, and small changes in low dimensional features are associated with large changes in EMG. They conclude from this that ultrasound would be good for measuring small activations (4-20% maximum voluntary contraction) but not large activations. If we assume that the main finding is true (i.e. large activations are associated with small changes in state) this merely means that there is a non-linear relationship between muscle state and contractile force, which does not mean that ultrasound should be limited to small activations. We must primarily consider measurement noise arising from human error a limiting factor for discrimination at higher forces, and then we must consider the fundamental limitation of using low-dimensional predicated features.
Rana and others [5] attempted to address the problem of subjectivity when measuring muscle fascicle orientation and curvature by developing a computational approach. After filtering their images with a vessel enhancement filter, they applied two methods to ultrasound images of the vastus lateralis muscle. The first method was the Radon transform which gave the main orientation of the visible fascicles. The second method was to convolve the images with orientated Gabor wavelet filters, where the maximum convolution at each pixel reveals the orientation for that location. The mean angle obtained from the wavelet convolutions reveals the dominant orientation over all of the fascicles. They evaluate their methods on synthetic data with known orientations, reporting accuracy to 0.02°. They also acquired manually digitized fascicle orientations from 10 operators, which revealed very large subjectivity between operators such that comparisons with the computational methods were not possible. They do not use automatic segmentation and fascicle regions were manually selected. The main problem with this approach is that there is no built-in discrimination of what is considered to be a fascicle; fascicles are vessel-like structures which exhibit a bright to dark to bright pixel intensity pattern, and there are many objects within an image which exhibit this pattern such as blood vessels, connective tissues, and nerves. These other structures can and often do present at different orientations from the fascicle field which can cause local errors in the measurement. It is currently not known if fascicle orientation is sufficient to delineate active and passive muscle states, or extract forces either within a single person or generalized outside a population.
For some time, the dominant paradigm was to track the motion of local structures visible within the image plane with a view to perhaps predicting and tracking muscle length; this is known as speckle, or feature tracking. The main flaw with this approach is tracking drift; where the local tracking error accumulates over time due to noise and other effects and the absolute position is lost. Loram and others [8] applied a cross-correlation feature tracking technique to two specific muscles (superficial and deep) in the lower leg. Loram demonstrated for the first time that ultrasound can be used to measure completely different muscle lengths in two adjacent muscles during the same task. Of the many conclusions, the relevant findings were that cross-correlation tracking fails for arbitrarily large movements, and that the unregulated tracking of point features results in tracking drift. The latter confirms the findings of an earlier study by Yeung and colleagues [15], [16] in which the authors address the tracking drift problem and conclude that it is a consequence of features leaving or entering the image plane making them inherently impossible to directly track with pure feature tracking methods.
The techniques of Loram and others [8] were improved upon by the use of a more robust tracking method; the Kanade Lucas-Tomasi (KLT) feature tracking method [17], [18], [19]. Loram and others mentioned that tracking failed for arbitrary large motions, but the KLT solved that problem by use of pyramid levels which could track at coarse detail (top pyramid level) for large motions, and refine tracking at each subsequent pyramid level. Darby and others [9] not only improved tracking of local features, they automated the entire analysis by applying active shape models (ASM) [19] and using Eigen features [18] interpolated at automatically placed grid points within the muscle belly using Delaunay triangulation. They reported inconsistent segmentation results with automatic initialization, and quite accurate results (0.3mm) with manual initialization. With respect to tracking, although they show robustness for larger movements, they found more drift than the cross-correlation method for small movements, and they reported large feature dropout (large discrepancies between texture patches in sequential images causing termination of tracking of that feature). The feature dropout they experienced resulted from out of plane motion, where features leave or enter the plane suddenly and cannot possibly be tracked using these methods.
Naturally we can now discuss regulated tracking methods [16], [7]. Regulated tracking within this domain is closely related to feature engineering; some presupposition about the information content is made and then a technique is developed to automatically measure that information, which is then used to regulate spurious tracking points. There is then a hierarchical dependency of feature tracking on the measured parameters, and the measured parameters on the quality of the data. Further, there are not always intelligible features within the muscle which can be used for regulation. The neck for example contains six bilateral muscle layers and when imaged simultaneously fascicles and other internal structures are invisible, only the muscle boundary and a random-deterministic internal speckle pattern are present. That speckle pattern can be tracked [20], but it would be difficult to formulate a regulated version of the tracking without some complex model of the speckle structure or neck muscle shape and mechanics.
The recent development [21]-[29] of a class of methods known collectively as deep learning (DL) provide a framework for understanding the content of ultrasound images of muscle in relation to measured data (EMG, torque, angle). DL is a technique for building ANN representations of data in a layer-wise fashion, where each layer models increasingly abstract/complex features of the data, facilitating modeling of complex features without a priori assumptions of the descriptive features. ANNs can learn nonlinear functions to map data (images) to labels (EMG, torque, joint angle). Even without many or any labels (which may often be the case with respect to deep muscles) features can be extracted using generative models such as restricted Boltzmann machines [21], [30], deep belief networks [31], deep autoencoders/autoassociators [32]-[34], or more recently generative adversarial networks (GAN) [35]. Those features can then either be directly analyzed (using statistics or distance metrics), or re-mapped to relatively few labels. If large volumes of labeled data exist, a CNN can be trained directly on the data to predict the labels, which can be continuous or discrete. CNNs work very well for understanding the content of static images [36] or speech [37], and more recently very deep CNNs known as residual networks (ResNet) have surpassed human-level performance [38] on the same image recognition task as [36]. CNNs have also demonstrated the ability to track local motion [39], which means that unlike standard feature tracking, a CNN can measure the dynamic state of local features, while simultaneously having access to the static state (or pose). One could argue for the need to measure historical states with short and/or long-term dependencies perhaps with recurrent networks, but considering this is an initial investigation of new ground they are not a sensible option since they are comparatively difficult to train and would not easily generalize to different ultrasound acquisition rates (e.g. ultrafast ultrasound > 1000Hz vs standard 25-100Hz).
In order to establish a firm but accessible benchmark we investigate the application of standard deep CNNs, rather than recurrent or very deep ResNets, with a view to extending the current work in future research. Our CNNs are compared to a variation of the Darby method [9] which is fundamentally feature tracking with a fully connected feed-forward neural network on top. In addition, we used established visualization techniques to attempt to understand and interpret the models which we generate [40]. We show that these methods can be used to visually understand mechanical and functional differences between active and passive skeletal muscle.
III. METHODS
A. Data Acquisition
Ultrasound data were recorded from 19 participants (6 female, ages: 30 ± 7.7) during dynamic standing tasks. Participants stood upright on a programmable/controllable foot pedal system during three tasks while strapped at the hip to a backboard. During the tasks, we recorded calf muscle (GM) activation using electromyography (EMG), ankle joint angle/torque, all at 1000Hz, and ultrasound of the GM at 25Hz. Three distinct tasks were designed to explore the state-function space of muscle:
1) Isometric
The pedal system was fixed at a neutral angle (flat feet), and participants observed an analog oscilloscope. On the oscilloscope, we displayed side by side a dot representing the amplitude of their filtered EMG, and a dot representing the amplitude of a fabricated signal (see section III. C.). Participants were asked to contract their calf muscles by pushing down their toes while simultaneously keeping their foot in full contact with the static pedals.
2) Passive
Participants observed an analog oscilloscope. On the oscilloscope, we displayed side by side a dot representing the amplitude of their filtered EMG. Participants were asked to monitor and minimize any EMG activity be relaxing their muscles. The pedal system was driven using a fabricated signal (see section III. C.). Participants were asked to allow their ankle to rotate and keep their feet in full contact with the moving pedals.
3) Combined
The pedal system was fixed at a neutral angle (flat feet), and participants observed an analog oscilloscope. On the oscilloscope, we displayed side by side a dot representing the amplitude of their filtered EMG, and a dot representing the amplitude of a fabricated signal (see section III. C.). The pedal system was simultaneously driven using a fabricated signal (see section III. C.). Participants were asked to allow their ankle to rotate and keep their feet in full contact with the moving pedals.
All trials were 190 seconds in length which consisted of 10 seconds of neutral standing (i.e. no signals were used to move the pedals or the dot on the screen), followed by 180 seconds of trial. Data were collected in the ranges of 0.0481V, 100.182Nm, and 12.371° (c. 3.1° dorsflexion, 9.3° plantar-flexion) for EMG, torque, and joint angle respectively.
B. Designing the Labels
Two signals were designed to manipulate active and passive muscle input factors, active contraction and passive joint rotation respectively. Both signals were derived from the following bases:
The dot on the screen used to guide participants to contract their muscles was constructed using the following rules: 1) For the first 10 seconds signal a was used, and every 10 seconds thereafter we alternated between signals a and b. 2) After 30 seconds signal c was used, and every 30 seconds thereafter, either signal a or b was used depending on the first rule. The pedals were driven using the same bases with the following different rules: 1) For the first 20 seconds signal a was used, and every 20 seconds thereafter we alternated between signals a and b. 2) After 60 seconds signal c was used, and every 60 seconds thereafter, either signal a or b was used depending on the first rule. The signals were designed to produce transient correlations, de-correlations, and anti-correlations to maximize exploration of the state-function muscle space. The correlation of the two independent signals was r = 0.33, p = 0 (Pearson), and r = 0.34, p = 0 (Spearman).
Simulink (Matlab, R2013a, The MathWorks Inc., Natick, MA) was used to interface with the lab equipment (pedal system and EMG), and for video synchronization a hardware trigger was used to initiate recording at the start of each trial.
C. Segmentation and Region Extraction
For segmentation and region extraction we used a fast and accurate muscle segmentation algorithm previously developed by our group [41]. That method enabled normalization of the gastrocnemius muscle such to reduce the computational dimensions and complexity while simultaneously maximizing the spatial resolution. The segmentation also provided an opportunity to standardize the input by allowing extraction of a region orthogonal to the main axis (mean over the video sequence) of the muscle (see figure).
First, an expert annotated the internal boundaries of the medial GM muscle in 500 randomly selected images of which 100 were randomly selected for testing. After interpolating the annotations to a standard 40 point vector, a principal component model was constructed from the remaining 400 images. The component model was then used to construct a texture-to-shape dictionary with only 4 components (> 90%). That dictionary was then used to give an approximate segmentation for each image in the dataset. That initial segmentation was then used to initialize a heuristic search routine using an ASM [19] constructed from just 10 principal components (> 99%). The search was conducted at full resolution ±10 pixels about each contour point. For more detail see [41].
The entire dataset (> 300,000 images) was segmented and then a region of interest (x × y = 496 × 120 pixels ≈ 55.42 × 14.67 millimeters) was extracted orthogonal to the main orientation of the GM muscle (linear least square fit to mean segmentation over the sequence).
D. Feature Tracking
The first stage of feature tracking was selection of ‘good’ (corner) features [18]. Within the region of interest defined by the segmentation (see sub-section C), we selected the top 1000 Eigen features, where the number 1000 was chosen empirically (it was not always possible to select greater than approximately 1000 features using the Eigen features method [18]). Eigen features were selected in every image of every sequence. Following feature selection we took 500 equidistant (over trials and participants) images from the training set (see sub-section G) and used the K-means [42] algorithm to identify integration points from the selected features in those images. The integration points were used to average the motion of the features within a cluster and record a single motion for that region/cluster (see figure 2). We empirically chose 100 means which were used to classify every Eigen feature in every image of every sequence. Greater than 100 means caused empty clusters within the testing and/or validation sets, and less than 100 gave larger average centroid distances.
After K-means clustering the KLT [17] algorithm was used to track the motion of each of the Eigen features one image forward. The features belonging to the same class/cluster were averaged (mean) and recorded. The result was a matrix consisting of row vectors of 100 displacements (x,y), per image, per sequence (vector of 200 values). This was then used to train a variety of fully connected feed-forward neural networks (see sub-section F).
E. Feed-forward Neural Network
After K-means clustering and integration of tracking points (previous section) we designed 3 feed-forward neural network architectures, with which to model the data and predict changes EMG, joint torque, and joint angle. The main design choices (or hyperparameters) in such networks are number of neurons in a layer, number of layers in a network, and they type of neuron transfer function. We treat the transfer function as a fixed hyperparameter and decided on the ReLU transfer function due to its popularity and success. We also treat the number of layers as a fixed hyperparameter and decided on 2 layers as a way of increasing complexity yet maintaining efficiency. The number of neurons per layer, however, was considered an important parameter and therefore we varied the number of neurons per layer in three models, A, B, and C, with 256, 512, and 1024 neurons, respectively.
These three models were trained with and without dropout (p = 0.5), with a learning rate of 1e – 5, momentum of 9e – 1.
F. Convolutional Neural Network
When considering the choice of architecture our concern primarily was that the model was large enough to minimize the training error, and then the main concerns were in computation and generalization. Our strategy was to train a variety of models, exploring width (number of filters per convolutional layer) and depth (number of convolutional layers), with state of the art regularization (dropout), while evaluating performance on held out validation data. The architectures of the three modes were as follows: where the prefix c denotes convolutional layer, the prefix p denotes max pooling layer, the prefix fc denotes fully connected layer, and the number shown is the number of filters/neurons in the layer. The weight matrices associated with each convolutional filter were 2 × 2 in every layer except for the input layer which was connected to two sequential images in the form n × n × 2, where we varied n during cross validation (sub-section G).
G. Training and Cross Validation
To train our models we minimized the mean square error (MSE) between the model and the labels (change in EMG, joint torque, and joint angle) using stochastic online gradient descent (i.e. batch size of one). All images (or KLT features) and labels were normalized having zero mean and unit variance. A learning rate of 1e – 5 was empirically chosen for KLT and CNN models, with momentum of 9.5e – 1. ReLU units were used in all layers except the output layer which was linear. Prior to training all biases were initialized to 0, and all weights were initialized using a variation of the Xavier initialization [43],
The validation error was measured during training periodically to allow selection of optimal models. Cross validation was used with a test set of one held out participant (12500 samples) and a validation set of one held out participant (12500 samples). The validation set was used to choose the optimal model and the testing set was used to evaluate generalization performance. The same participants were used in cross validation in the CNN models and the KLT neural network models. The testing and validation sets were not used to train any of the models. To regularize our models we used dropout (p = 0.5) in the top 1, 2 or 3 fully connected layers (in both CNN and fully connected networks). Early stopping was used where the model with the lowest validation error was taken after the validation error did not decrease for more than 5 error evaluations.
H. Visualization of CNN model
We use established methods [40] to construct visualizations of the hierarchical knowledge learned by the best CNN model. More importantly we use these methods to understand the mechanical properties of active and passive muscle changes. For each convolutional layer, we generate images which maximize the response (output) of individual neurons (filters), and also images of each neuron maximized at every spatial location of the input space (2 sequential ultrasound images. Finally, we generate images which maximize the response of active (EMG) and passive (joint angle) neurons. Images are produced by initializing the input to the trained CNN with zero mean and unit variance Gaussian noise. Then we compute a full forward pass. Then we create an error vector which is equal to the output of the layer of interest, plus a constant (1) at the unit we want to maximize. Then from the layer of interest we back-propagate that error through to the input layer, and we use the gradient to update the pixels (at a rate of (0.5). During updates we apply L2 regularization (0.05) as per [40]. This process is repeated with the new pixels for 100 iterations.
IV. RESULTS
A. Region Extraction
The segmentation technique we had previously developed was evaluated by manual annotation of 100 test images. Our concern here was not generalization but accuracy within the dataset. Our results showed that the segmentation was accurate to 0.16mm (> 99.9%) and segmented approximately 10 images per second.
B. Modeling Muscle Function from State
The KLT + ANN method was able to predict changes EMG, torque, and joint angle to within 0.0007V, 0.57Nm and 0.09°, respectively. The CNN method was able to predict changes in EMG, torque, and joint angle to within 0.0006V, 0.58Nm and 0.09°, respectively. Comparison of the two different computational techniques revealed few qualitative/quantitative differences in performance. The KLT method was less discriminative of active and passive function in the isometric and passive cases (see figure, and table) for the test participant, yet showed a slight improvement over the CNN in the combined function case for torque and ankle angle predictions. The CNN method performed better than the KLT method when predicting EMG, and worse when predicting torque, while there was very little difference for ankle angle (see tables 1-3).
Analysis of CNN cross-validation results revealed that the most important factors for generalization were filter size in the input layer, learning rate, and number of convolutional and pooling layers; the width (number of filters per convolutional layer) of the network was less important. Network depth (number of conv./pooling layers) broadly improved performance, although adding too many pooling layers (i.e. so input dimensions to fc. layers reached 2 × 1 × n where n is the number of filters in the last layer) proved detrimental to generalization. The sizes of the filters in each convolutional layer remained fixed parameters at 3 × 3, except for the input layer, where we found smaller filter sizes improved generalization. The learning rate proved to be the factor with the greatest effect on generalization. Observations of the convolution ReLU response histograms during training revealed large dropout (dying ReLU) in many layers where the learning rate was large (1e-3), and was much more stable for lower learning rates (1e-5). Online training prevented use of adaptive learning rate algorithms. Finally, initialization of the weights was an important factor. Empirical experiments using initial weights sampled from a Gaussian distribution with unit variance and zero mean proved very difficult to train using ReLUs, but not such an issue for other unit types; the tanh unit was investigated but convergence was more of an issue than initial training. Using Xavier initialization our CNNs trained much more easily in just a few hours. All networks converged (diverged from the validation set) within 2.5 million weight updates.
Analysis of KLT results revealed very little difference between models. Reducing KLT feature size from 19×19 to 11×11 had little noticeable or measurable effect. The main factor with respect to generalization was model complexity. Smaller models generalized better. Dropout regularization had a negative effect on generalization, where 2 layers of dropout caused convergence at very high error.
V. Discussion
This manuscript details the first published result of its kind within its domain; i.e. generalized prediction of changes in muscle-specific torque, connecting joint angle, motor unit activity (EMG) directly from standard frame-rate (25Hz) b-mode ultrasound in combined functional conditions. This manuscript also demonstrates the successful application of CNNs to medical ultrasound outside the domain of classification. Our CNNs were trained with relative ease after empirical tests revealed good learning parameters. There is currently no benchmark with which to compare to, hence we implemented an existing state of the art technique based on KLT feature tracking, and with some parameterization standard feed-forward ANNs provided a relevant comparison. We have demonstrated state of the art performance with our CNNs with only marginally smaller errors than the KLT with ANN. All the literature on deep learning suggests that this benchmark can be improved upon with additional data. While our dataset of over 300,000 images may seem large, there were only 19 participants, and because 2 were held out, that leaves only 17 different muscle architectures, probe positions/orientations, and tissue compositions with which to generalize from. The fact that we have produced generalized predictions of torque from specific muscles in combined functional conditions from only 17 participants is testament to the feasibility of the methods to the problem of measuring torque from specific muscles.
After model selection we applied a standard visualization technique [40] to gain insight into what our CNN had modeled with respect to active and passive muscle function. The technique works the same way as learning in a CNN, where the input is initialized (either with a pair of images or random noise – we used noise) and instead of learning (in the active case) the weights which predict an EMG burst, the error gradient is used to learn the images which predict an EMG burst. The result was 2 image pairs representing a single motion; one motion for active and one motion for passive. The active reconstruction depicted a shearing motion (the superficial and deep parts of the muscle moved right and left respectively), while the passive motion illustrated a broadly uniform left linear translation. The ability to produce these graphics from CNNs is a particularly powerful paradigm, especially in the domain of medical image analysis. For example, if one were to construct a model of muscle function from a complex system like the posterior neck, which consists of 6 bilateral muscle layers, without segmentation, in theory this technique could provide localization of abnormal contractions of the kind that happen to people with cervical dystonia.
The success of both methods opens up a new domain for research and development, namely low-cost personalized non-invasive measurement of torque from deep skeletal muscles. This has clinical relevance to many musculoskeletal diseases like motor neurone disease for measurement and monitoring of twitches and dystonia for measurement, targeting and treatment of abnormal muscle contractions. The CNN method however opens up an additional domain over methods like the KLT, namely drift-less tracking of functional muscle states for drift-less prediction of muscle specific torque. Currently both methods presented here track changes in muscle state and when accumulated over time they drift from the absolute measurement. Feature tracking methods (KLT) drift for a variety of fundamental reasons (noise, viewpoint variation, occlusion, object transformation, etc..), and a KLT tracker has no model of the underlying texture. Our CNNs observed patches of texture in 2 adjacent images in a video sequence and learned to map changes in texture to changes in muscle functional states (EMG, torque, and angle). While the CNNs observed changes in texture as per the nature of the problem, they also had access to the absolute state in both images and could therefore potentially model absolute states and map them to absolute changes in muscle functional states (i.e. the difference between a muscle at rest and a muscle in some arbitrary state of torque output) – this is not possible with the KLT. Our results suggest this ought to be possible within some arbitrary range since we demonstrated successful tracking within the experimental range. We could feasibly predict that a KLT approach may work up until some maximum velocity due to the unconstrained nature of the algorithm and the motion of features outside the ultrasound image plane which are fundamentally not trackable, while the CNN could work beyond this range and perhaps for any conceivable range of motion because of the intrinsic ability to build models of range of motion, rather than tracking nearest matching texture patches.
The consistency of both KLT and CNN over all models and parameters was likely due to the near-perfect normalization of the data using our existing segmentation technique [41]. Accurate segmentation of skeletal muscle for modeling purposes is extremely uncommon. The benefits of segmentation to our approaches are that over-fitting is more difficult (i.e. generalization is encouraged), and also that there can be no doubt that the measurements we extract are from specific muscles. We propose that it is possible to recreate this analysis without segmentation; KLT tracking would provide some degree of functional segmentation for an ANN system, as would some texture differences for the CNN to some degree. Also, CNNs inherently cope well with translation of features through the max-pooling layers and weight sharing during convolution. However, we recommend segmentation where possible to validate measurements from specific muscles.
VI. Conclusions
In this paper we have presented a novel experiment for the generation of thousands of accurately labeled muscle ultrasound images for modeling functional muscle states using ultrasound. We have presented the first generalized prediction of specific muscle EMG, torque, and joint angle from standard frame-rate b-mode ultrasound for combined functional cases. Existing methods rely on simplistic measures in isolated cases (isometric only, or passive only) which do not generalize and have negligible practical application. We have demonstrated the efficacy of CNNs to this domain, which opens up a whole new line of research, namely deep learning applied to skeletal muscle ultrasound. The work presented here could realistically have practical applications in sport and performance biomechanics, and clinical applications in rehabilitation, diagnosis and monitoring of cervical dystonia and motor neurone disease. Although we have not demonstrated application to deep muscles, the techniques presented here are easily transferable to deep muscles. Future research will focus on increasing the population and functional range (larger torques and joint rotations with additional joint variables – i.e. the knee) in our dataset. We will also focus on extending the current research to absolute measurement of torque in multiple muscles both with and without segmentation.
Footnotes
This paragraph of the first footnote will contain the date on which you submitted your paper for review. It will also contain support information, including sponsor and financial support acknowledgment. For example, “This work was supported in part by the U.S. Department of Commerce under Grant BS123456”.
The next few paragraphs should contain the authors’ current affiliations, including current address and e-mail. For example, F. A. Author is with the National Institute of Standards and Technology, Boulder, CO 80305 USA (e-mail: author{at}boulder.nist.gov).
S. B. Author, Jr., was with Rice University, Houston, TX 77005 USA. He is now with the Department of Physics, Colorado State University, Fort Col-lins, CO 80523 USA (e-mail: author{at}lamar.colostate.edu).
T. C. Author is with the Electrical Engineering Department, University of Colorado, Boulder, CO 80309 USA, on leave from the National Research Institute for Metals, Tsukuba, Japan (e-mail: author{at}nrim.go.jp).