TY - JOUR
T1 - Action-Stage Emphasized Spatiotemporal VLAD for Video Action Recognition
AU - Tu, Zhigang
AU - Li, Hongyan
AU - Zhang, Dejun
AU - Dauwels, Justin
AU - Li, Baoxin
AU - Yuan, Junsong
N1 - Funding Information:
Manuscript received July 8, 2018; revised November 12, 2018 and December 17, 2018; accepted December 26, 2018. Date of publication January 3, 2019; date of current version March 21, 2019. This work was supported in part by Wuhan University under Grant CXFW-18-413100063, in part by the National Key Research and Development Program of China under Grant 2016YFF0103501, in part by the Natural Science Foundation of China (NSFC) under Grant 61572012, and in part by the Natural Science Fund of Hubei Province under Grants 2017CFB598 and 2017CFB677. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Xiaochun Cao. (Corresponding author: Hongyan Li.) Z. Tu is with the State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan 430079, China (e-mail: tuzhigang@whu.edu.cn).
Publisher Copyright:
© 1992-2012 IEEE.
PY - 2019/6
Y1 - 2019/6
N2 - Despite outstanding performance in image recognition, convolutional neural networks (CNNs) do not yet achieve the same impressive results on action recognition in videos. This is partially due to the inability of CNN for modeling long-range temporal structures especially those involving individual action stages that are critical to human action recognition. In this paper, we propose a novel action-stage (ActionS) emphasized spatiotemporal vector of locally aggregated descriptors (ActionS-ST-VLAD) method to aggregate informative deep features across the entire video according to adaptive video feature segmentation and adaptive segment feature sampling (AVFS-ASFS). In our ActionS-ST-VLAD encoding approach, by using AVFS-ASFS, the keyframe features are chosen and the corresponding deep features are automatically split into segments with the features in each segment belonging to a temporally coherent ActionS. Then, based on the extracted keyframe feature in each segment, a flow-guided warping technique is introduced to detect and discard redundant feature maps, while the informative ones are aggregated by using our exploited similarity weight. Furthermore, we exploit an RGBF modality to capture motion salient regions in the RGB images corresponding to action activity. Extensive experiments are conducted on four public benchmarks-HMDB51, UCF101, Kinetics, and ActivityNet for evaluation. Results show that our method is able to effectively pool useful deep features spatiotemporally, leading to the state-of-The-Art performance for video-based action recognition.
AB - Despite outstanding performance in image recognition, convolutional neural networks (CNNs) do not yet achieve the same impressive results on action recognition in videos. This is partially due to the inability of CNN for modeling long-range temporal structures especially those involving individual action stages that are critical to human action recognition. In this paper, we propose a novel action-stage (ActionS) emphasized spatiotemporal vector of locally aggregated descriptors (ActionS-ST-VLAD) method to aggregate informative deep features across the entire video according to adaptive video feature segmentation and adaptive segment feature sampling (AVFS-ASFS). In our ActionS-ST-VLAD encoding approach, by using AVFS-ASFS, the keyframe features are chosen and the corresponding deep features are automatically split into segments with the features in each segment belonging to a temporally coherent ActionS. Then, based on the extracted keyframe feature in each segment, a flow-guided warping technique is introduced to detect and discard redundant feature maps, while the informative ones are aggregated by using our exploited similarity weight. Furthermore, we exploit an RGBF modality to capture motion salient regions in the RGB images corresponding to action activity. Extensive experiments are conducted on four public benchmarks-HMDB51, UCF101, Kinetics, and ActivityNet for evaluation. Results show that our method is able to effectively pool useful deep features spatiotemporally, leading to the state-of-The-Art performance for video-based action recognition.
KW - Action recognition
KW - ActionS-ST-VLAD
KW - adaptive feature sampling
KW - adaptive video feature segmentation
KW - feature encoding
UR - http://www.scopus.com/inward/record.url?scp=85063468385&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85063468385&partnerID=8YFLogxK
U2 - 10.1109/TIP.2018.2890749
DO - 10.1109/TIP.2018.2890749
M3 - Article
AN - SCOPUS:85063468385
SN - 1057-7149
VL - 28
SP - 2799
EP - 2812
JO - IEEE Transactions on Image Processing
JF - IEEE Transactions on Image Processing
IS - 6
M1 - 8600333
ER -