AI-Powered Smart Glasses for Sensing and Recognition of Human-Robot Walking Environments

Environment sensing and recognition can allow humans and robots to dynamically adapt to different walking terrains. However, fast and accurate visual perception is challenging, especially on embedded devices with limited computational resources. The purpose of this study was to develop a novel pair of AI-powered smart glasses for onboard sensing and recognition of human-robot walking environments with high accuracy and low latency. We used a Raspberry Pi Pico microcontroller and an ArduCam HM0360 low-power camera, both of which interface with the eyeglass frames using 3D-printed mounts that we custom-designed. We trained and optimized a lightweight and efficient convolutional neural network using a MobileNetV1 backbone to classify the walking terrain as either indoor surfaces, outdoor surfaces (grass and dirt), or outdoor surfaces (paved) using over 62,500 egocentric images that we adapted and manually labelled from the Meta Ego4D dataset. We then compiled and deployed our deep learning model using TensorFlow Lite Micro and post-training quantization to create a minimized byte array model of size 0.31MB. Our system was able to accurately predict complex walking environments with 93.6% classification accuracy and had an embedded inference speed of 1.5 seconds during online experiments using the integrated camera and microcontroller. Our AI-powered smart glasses open new opportunities for visual perception of human-robot walking environments where embedded inference and a low form factor is required. Future research will focus on improving the onboard inference speed and miniaturization of the mechatronic components.


I. INTRODUCTION
Visual sensing and recognition of human-robot walking environments is of growing interest.Applications range from autonomous control and planning of robotic leg prostheses and exoskeletons to providing sensory feedback to persons with visual impairments.However, most egocentric visual perception systems, such as Project Aria by Meta [1], have been limited to off-device inferencing with external machines and cloud computing.
Previous research has mainly focused on head-mounted cameras with large computation systems such as Raspberry Pi 3 [2], [3], and chest and waist-mounted cameras [4]- [5].However, these systems did not integrate the camera sensor and computation all within a single system.Other studies [6]- [11] have used large convolutional neural networks (CNNs) with many learnable parameters for accurate image classification of walking terrains, including level-ground, stairs, and other obstacles, though lacked onboard inferencing.Some researchers have used classical machine learning methods with success [12]- [16].While both deep learning and non-deep learning models have shown good accuracy performance, they have not focused on deployment and efficiency as the main objectives.Consequently, these systems have been restricted to highpower computers and have limited deployment on mobile and embedded devices.
An integrated visual perception system has yet to be designed, prototyped, and evaluated on edge devices with low inference speeds.This gap could be explained by limitations in mobile and embedded computing, which have only recently been alleviated by advances in hardware and deep learning model compression methods.Accordingly, the purpose of this study was to develop AI-powered smart glasses that uniquely integrate both sensing and deep learning computation for visual perception of human-robot walking environments while achieving high accuracy and low latency (Fig. 1).We integrated our mechatronic components all within a single device, which is lightweight and a small form factor as to not obstruct mobility or user comfort.Computationally, it has sufficient memory and processing power for real-time inferencing with live video stream.

A. Mechatronics Design
The mechanical mounts for our smart glasses included two design considerations: 1) the location of the mechatronic components on the frames, and 2) the means by which these components are attached.The location of the microcontroller and camera was partially inspired by commercial smart glasses such as Google Glass [17] and Ray-Ban Stories [18], which have the camera forward facing and the computational processor on the arms of the frames [19].This design allows for a larger processor to not obstruct the visual field-of-view while also having the camera simulate the orientation and perspective of the user -i.e., egocentric.We designed a semi-permanent mounting system that would allow our smart glasses to be applicable and transferable to a wide range of eyeglass frames.We custom-designed and 3D-printed mounting brackets for the camera and microcontroller (Fig. 2).The two main mechatronic components required to develop our system is the camera to capture visual information about the walking environment, and the microcontroller for processing and computing the images (Fig. 3).With the low-power, low-latency constraints of our design, heightened scrutiny of relevant metrics and constraints was required to identify optimal components.We used the ArduCam HM0360 VGA SPI camera due to its low power consumption, high frame rate, and high resolution [20].The camera has a power consumption of less than 19.6mW during active VGA sampling.This low-power consumption supports the "always-on" operating mode that our smart glasses aim for by ensuring that the power consumption would be minimally affected by continuous sensing.Another important feature of the camera is the high frame rate.At 60 frames per second, this provides the microcontroller with a high enough sampling rate to ensure that there are no bottlenecks to the image classification resulting from the camera's framerate, while also providing updated real-world visual data, reducing lag in our smart glasses' understanding.The camera includes a high resolution of 640x480 monochrome images.This resolution provides an image size large enough to portray environmental state information and even supports down-sampling of the resolution to allow for smaller models and faster inference predictions.
For the onboard computational processing, we used the Raspberry Pi Pico W microcontroller.This newly developed board has more memory and CPU power compared to smaller boards.The increase in processing power and memory, small form factor, and capability for wireless communication made this embedded microcontroller a viable solution for our smart glasses.Compared to microcontrollers of comparable size, such as the Arduino Nano 33 BLE with a processing speed of 64 MHz, the Pico contains Dual ARM 133 MHz processors.This added processing power provides sufficient speed and parallelization to process live video streams while minimizing inference speeds, aligned with our goal of achieving real-time predictions using deep learning.
The Pico contains 64 kB SRAM and 2 MB QSPI flash memory, which is greater than other microcontrollers such as the Arduino Nano 33 BLE.This increased memory is important for running machine learning algorithms directly on the embedded device and providing flexibility in the type of models that can be loaded, including more memory-intensive models such as deep convolutional neural networks.The Pico also has a small form factor of 21mm x 51.3 mm, which is essential for our design to be easily integrated into eyeglass frames and provide minimal obstruction to mobility or user comfort.The Pico can wirelessly communicate and interface with external robotic devices and computers via a CYW43439 chip, which supports single-band 2.4 GHz Wi-Fi connection and Bluetooth 5.2.

B. Computer Vision and Deep Learning
We created a new image dataset based on the Meta Ego4D dataset [21].The full Ego4D dataset includes more than 3,670 hours of egocentric (i.e., first-person) video collected by 923 subjects from 74 locations worldwide (Fig. 4).The images were collected using head-mounted wearable cameras, which made the dataset highly applicable to our computer vision application.In addition to an appropriate camera angle, the video clips were pre-labelled to identify the scene in which the videos were recorded.
Extending the Ego4D annotations, we developed new class definitions and manually re-labelled images as either 1) indoor surfaces, 2) outdoor surfaces (grass and dirt), and 3) outdoor surfaces (paved).See Table 1 for our class distributions.We sampled the videos at one frame per second to collect images.To reduce the required memory storage to run our deep learn-ing model, images in our dataset were downsampled to 96x96 pixels before being used for training, therein minimizing the staging area requirements for our microcontroller.This down sampling is a constraint imposed by the computing hardware.To help reduce overfitting during the model training, we added random horizontal reflections, image zooms, slight rotations, and contrast changes.These augmentations were deemed appropriate as such effects would likely occur in real-world walking while wearing glasses.Finally, all images were converted to grayscale to mimic the conditions of our onboard camera (Fig. 5).
For image classification and automatic feature engineering, we used the base model of MobileNetV1 [22] in Tensor-Flow with an alpha value of 0.25, thereby reducing the model width and learnable parameters to lower the computational demand on our embedded device.An additional 2D convolutional layer was added before the MobileNet base model to expand the input dimensions of the grayscale images to a 3-channel image as required by the MobileNet layer.The MobileNet layer is followed by a 2D global average pooling layer to reduce the dimensionality of the 2D output, followed by a fully connected layer with a softmax activation to predict the three walking terrains (Table 2).The MobileNet architecture was selected as the underlying model similar to [7] because the depth-wise separable convolutional layers aid in efficient and accurate image classification.The model contained ~219,300 parameters and was trained using TensorFlow.The dataset was split into training (70%), validation (15%), and test (15%) sets.To avoid data leakage between test and validation sets, the source videos for frames within the training and validation sets were different.
Finally, we converted our deep learning model to a Ten-sorFlow Lite model using a quantization method converting the floating-point numbers to 8-bit integers and resolving incompatible tensor operations.The TensorFlow Lite model was then converted using the TensorFlow Lite Micro tooling to produce a byte array usable by the microcontroller for onboard inference.Our final model was of size 0.31MB.To quantify inference speed, we took the most recent image from the camera and loaded it to the microcontroller's memory.The image was then loaded into memory as input to the model, and the resulting label for that frame was derived.

III. RESULTS
Our deep learning model achieved a training accuracy of 97.7%, a training loss of 0.07, a validation accuracy of 93.2%, and a validation loss of 0.41 (Fig. 6).During inference on the test set, the compressed model achieved an overall prediction accuracy of 93.6%, an f1-score of 93.6%, a precision of 93.7%, and a recall of 93.6%.The multiclass confusion matrix in Table 3 shows the distribution of the prediction accuracies for each walking environment.The neural network most accurately predicted outdoor surfaces -grass and dirt (96.8%), fol-  lowed by outdoor surfaces -paved (94.7%) and indoor surfaces (90%).The onboard inference speed on the embedded device was 1.5 seconds from reading the image to outputting the predicted label.

IV. DISCUSSION
In this study, we developed a novel pair of AI-powered smart glasses that uniquely integrates the sensing and deep learning computation for visual perception of human-robot walking environments with high accuracy and low latency.We designed the system using a Raspberry Pi Pico microcontroller and an ArduCam HM0360 camera, which interfaces with the eyeglass frames using 3D-printed mounts that we custom-designed.We then trained and optimized a lightweight and effi-cient convolutional neural network using a MobileNet backbone [22] to classify the walking terrain as either indoor surfaces, outdoor surfaces (grass and dirt), or outdoor surfaces (paved) using over 62,500 images that we manually re-labelled from the Meta Ego4D open-source dataset [21].Our system could accurately predict complex walking terrains in 1.5 seconds and with high accuracy (~94%).These results demonstrate, for the first time, the potential to develop fast and accurate egocentric visual perception systems for human-robot walking using embedded computing.
Compared to previous work that required off-device inference due to limited computational power and the high memory demands of machine learning models, our system leverages the latest advances in microcontroller hardware, efficient neural network architectures, and model compression algorithms in order to develop a novel integrated system.As a result, our smart glasses can uniquely process and classify images of walking environments with low latency without a dependency on inconsistent wireless communication to desktop machines or cloud computing.
Compared to leg, waist, and chest-mounted systems [4]- [11], our smart glasses offer several benefits due to its humancentered design.Our system was trained and tested using egocentric images, also known as first-person vision.This pointof-view mimics the biological vision system and takes into consideration the orientation of the user's head, which has practical implications for inferring intent.Additionally, our smart glasses do not have explicit requirements for the pose or viewing angles compared to other systems [23], which rely on manual heuristics and rule-based thresholds for both the users and environments.
Despite this progress, our study still has several limitations.Our inference speed should be further improved for real-time control.This would be especially important for rehabilitation robotics like exoskeletons that need to dynamically adapt online to different walking terrains.Another area for improvement would be to increase the number of environmental states in our classification model, therein allowing our system to be more applicable to a wider range of applications such as providing sensory feedback to persons with visual impairments and autonomous driving with powered wheelchairs.Future research should also consider further miniaturization of Fig. 6 The loss and classification accuracy on the training (maroon) and validation (blue) sets.the mechatronic components in order to improve user comfort and acceptance.While our MobileNet model [22] resulted in good performance in terms of classification accuracy and embedded inference speed, we also plan to experiment with and compare other lightweight and efficient architectures (e.g., FastViT [24] by Apple or EfficientViT [25] by Microsoft Research) that are suitable for real-time edge computing.
It is important to note that our visual perception system is designed to supplement, not replace, the existing intent recognition systems of robotic prosthetic legs and exoskeletons that use mechanical, inertial, and/or EMG data to estimate the current state of the human-robot-environment system [26].We view computer vision as a means to improve the speed and accuracy of locomotion mode (intent) recognition by minimizing the search space of potential solutions based on the perceived walking environment.Accordingly, sensor fusion methods will need to be studied in the future in order to integrate our smart glasses with existing intent recognition systems used for robot control.
In summary, here we designed and prototyped a novel pair of AI-powered smart glasses that uniquely integrate sensing and deep learning computation for onboard visual perception of human-robot walking environments.Applications of our technology range from autonomous control of robotic leg prostheses and exoskeletons to providing sensory feedback to persons with visual impairments.Moving forward, we plan to further improve the onboard inference speed and miniaturization of the mechatronic components.

Fig. 2
Fig.2Our custom-designed 3D-printed mounts for the camera (left) and microcontroller (right) with screw holes to secure the mechatronics to the eyeglass frames using rubber-tipped screws.

Fig. 3
Fig.3The Arducam HM0360 low-power camera (left) used for vision sensing and the Raspberry Pi Pico microcontroller (right) used for the onboard image processing and computation.

Fig. 4
Fig.4 Examples of images we adapted and manually labelled from the Meta Ego4D dataset[21].

Fig. 5
Fig. 5 Examples of images from our glasses, including indoor surfaces (top row), outdoor surfaces -grass and dirt (middle row), and outdoor surfaces -paved (bottom row).

Table 1 .
[21]kdown of the class distributions in our new image dataset that we developed from the Meta Ego4D dataset[21].

Table 2 .
[22]lightweight and efficient deep learning model[22]used for image classification of walking environment.

Table 3 .
Multiclass confusion matrix showing the image classification accuracies (%) during inference on the test set.The columns and rows are the predicted and labelled classes, respectively.The classes include indoor surfaces (IS), outdoor surfaces -grass and dirt (OS-GD), and outdoor surfaces -paved (OS-P).