TY - GEN
T1 - Video2vec
T2 - 23rd International Conference on Pattern Recognition, ICPR 2016
AU - Hu, Sheng Hung
AU - Li, Yikang
AU - Li, Baoxin
N1 - Funding Information:
The work was supported in part by ONR grants N00014-15-1-2344 and N00014-15-1-2722. Any opinions expressed in this material are those of the authors and do not necessarily reflect the views of ONR or ARO.
Publisher Copyright:
© 2016 IEEE.
PY - 2016/1/1
Y1 - 2016/1/1
N2 - We propose to learn semantic spatio-temporal embeddings for videos to support high-level video analysis. The first step of the proposed embedding employs a deep architecture consisting of two channels of convolutional neural networks (capturing appearance and local motion) followed by their corresponding Gated Recurrent Unit encoders for capturing longer-term temporal structure of the CNN features. The resultant spatio-temporal representation (a vector) is used to learn a mapping via a multilayer perceptron to the word2vec semantic embedding space, leading to a semantic interpretation of the video vector that supports high-level analysis. We demonstrate the usefulness and effectiveness of this new video representation by experiments on action recognition, zero-shot video classification, and 'word-to-video' retrieval, using the UCF-101 dataset.
AB - We propose to learn semantic spatio-temporal embeddings for videos to support high-level video analysis. The first step of the proposed embedding employs a deep architecture consisting of two channels of convolutional neural networks (capturing appearance and local motion) followed by their corresponding Gated Recurrent Unit encoders for capturing longer-term temporal structure of the CNN features. The resultant spatio-temporal representation (a vector) is used to learn a mapping via a multilayer perceptron to the word2vec semantic embedding space, leading to a semantic interpretation of the video vector that supports high-level analysis. We demonstrate the usefulness and effectiveness of this new video representation by experiments on action recognition, zero-shot video classification, and 'word-to-video' retrieval, using the UCF-101 dataset.
UR - http://www.scopus.com/inward/record.url?scp=85019107022&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85019107022&partnerID=8YFLogxK
U2 - 10.1109/ICPR.2016.7899735
DO - 10.1109/ICPR.2016.7899735
M3 - Conference contribution
AN - SCOPUS:85019107022
T3 - Proceedings - International Conference on Pattern Recognition
SP - 811
EP - 816
BT - 2016 23rd International Conference on Pattern Recognition, ICPR 2016
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 4 December 2016 through 8 December 2016
ER -