International Journal of Control, Automation, and Systems 2024; 22(1): 252-264
https://doi.org/10.1007/s12555-022-0051-6
© The International Journal of Control, Automation, and Systems
Head gesture is a natural and non-verbal communication method for human-computer and human-robot interaction, conveying attitudes and intentions. However, the existing vision-based recognition methods cannot meet the precision and robustness of interaction requirements. Due to the limited computational resources, applying most high-accuracy methods to mobile and onboard devices is challenging. Moreover, the wearable device-based approach is inconvenient and expensive. To deal with these problems, an end-to-end two-stream fusion network named TSIR3D is proposed to identify head gestures from videos for analyzing human attitudes and intentions. Inspired by Inception and ResNet architecture, the width and depth of the network are increased to capture motion features sufficiently. Meanwhile, convolutional kernels are expanded from the spatial domain to the spatiotemporal domain for temporal feature extraction. The fusion position of the two-stream channel is explored under an accuracy/complexity trade-off to a certain extent. Furthermore, a dynamic head gesture dataset named DHG and a behavior tree are designed for human-robot interaction. Experimental results show that the proposed method has advantages in real-time performance on the remote server or the onboard computer. Furthermore, its accuracy on the DHG can surpass most state-of-the-art vision-based methods and is even better than most previous approaches based on head-mounted sensors. Finally, TSIR3D is applied on Pepper Robot equipped with Jetson TX2.
Keywords Computer vision, deep learning, head gesture, human-robot interaction.
International Journal of Control, Automation, and Systems 2024; 22(1): 252-264
Published online January 1, 2024 https://doi.org/10.1007/s12555-022-0051-6
Copyright © The International Journal of Control, Automation, and Systems.
Jialong Xie, Botao Zhang*, Qiang Lu, and Oleg Borisov
Hangzhou Dianzi University
Head gesture is a natural and non-verbal communication method for human-computer and human-robot interaction, conveying attitudes and intentions. However, the existing vision-based recognition methods cannot meet the precision and robustness of interaction requirements. Due to the limited computational resources, applying most high-accuracy methods to mobile and onboard devices is challenging. Moreover, the wearable device-based approach is inconvenient and expensive. To deal with these problems, an end-to-end two-stream fusion network named TSIR3D is proposed to identify head gestures from videos for analyzing human attitudes and intentions. Inspired by Inception and ResNet architecture, the width and depth of the network are increased to capture motion features sufficiently. Meanwhile, convolutional kernels are expanded from the spatial domain to the spatiotemporal domain for temporal feature extraction. The fusion position of the two-stream channel is explored under an accuracy/complexity trade-off to a certain extent. Furthermore, a dynamic head gesture dataset named DHG and a behavior tree are designed for human-robot interaction. Experimental results show that the proposed method has advantages in real-time performance on the remote server or the onboard computer. Furthermore, its accuracy on the DHG can surpass most state-of-the-art vision-based methods and is even better than most previous approaches based on head-mounted sensors. Finally, TSIR3D is applied on Pepper Robot equipped with Jetson TX2.
Keywords: Computer vision, deep learning, head gesture, human-robot interaction.
Vol. 22, No. 12, pp. 3545~3811
Jeonghoon Kwak, Kyon-Mo Yang, Ye Jun Lee, Min-Gyu Kim, and Kap-Ho Seo*
International Journal of Control, Automation, and Systems 2023; 21(11): 3746-3756Qian-Qian Hong, Liang Yang*, and Bi Zeng
International Journal of Control, Automation and Systems 2022; 20(12): 3996-4004Hae-June Park, Bo-Hyeon An, Su-Bin Joo , Oh-Won Kwon, Min Young Kim*, and Joonho Seo*
International Journal of Control, Automation and Systems 2022; 20(10): 3410-3417