Multi-stream CNN: Learning representations based on human-related regions for action recognition

Zhigang Tu; Wei Xie; Qianqing Qin; Ronald Poppe; Remco C. Veltkamp; Baoxin Li; Junsong Yuan

doi:10.1016/j.patcog.2018.01.020

Multi-stream CNN: Learning representations based on human-related regions for action recognition

Zhigang Tu, Wei Xie, Qianqing Qin, Ronald Poppe, Remco C. Veltkamp, Baoxin Li, Junsong Yuan

Research output: Contribution to journal › Article › peer-review

197 Scopus citations

Abstract

The most successful video-based human action recognition methods rely on feature representations extracted using Convolutional Neural Networks (CNNs). Inspired by the two-stream network (TS-Net), we propose a multi-stream Convolutional Neural Network (CNN) architecture to recognize human actions. We additionally consider human-related regions that contain the most informative features. First, by improving foreground detection, the region of interest corresponding to the appearance and the motion of an actor can be detected robustly under realistic circumstances. Based on the entire detected human body, we construct one appearance and one motion stream. In addition, we select a secondary region that contains the major moving part of an actor based on motion saliency. By combining the traditional streams with the novel human-related streams, we introduce a human-related multi-stream CNN (HR-MSCNN) architecture that encodes appearance, motion, and the captured tubes of the human-related regions. Comparative evaluation on the JHMDB, HMDB51, UCF Sports and UCF101 datasets demonstrates that the streams contain features that complement each other. The proposed multi-stream architecture achieves state-of-the-art results on these four datasets.

Original language	English (US)
Pages (from-to)	32-43
Number of pages	12
Journal	Pattern Recognition
Volume	79
DOIs	https://doi.org/10.1016/j.patcog.2018.01.020
State	Published - Jul 2018

Keywords

Action recognition
Convolutional Neural Network
Motion salient region
Multi-Stream

ASJC Scopus subject areas

Software
Signal Processing
Computer Vision and Pattern Recognition
Artificial Intelligence

Access to Document

10.1016/j.patcog.2018.01.020

Cite this

@article{93dd70048cc349768aa047209dca75e9,

title = "Multi-stream CNN: Learning representations based on human-related regions for action recognition",

abstract = "The most successful video-based human action recognition methods rely on feature representations extracted using Convolutional Neural Networks (CNNs). Inspired by the two-stream network (TS-Net), we propose a multi-stream Convolutional Neural Network (CNN) architecture to recognize human actions. We additionally consider human-related regions that contain the most informative features. First, by improving foreground detection, the region of interest corresponding to the appearance and the motion of an actor can be detected robustly under realistic circumstances. Based on the entire detected human body, we construct one appearance and one motion stream. In addition, we select a secondary region that contains the major moving part of an actor based on motion saliency. By combining the traditional streams with the novel human-related streams, we introduce a human-related multi-stream CNN (HR-MSCNN) architecture that encodes appearance, motion, and the captured tubes of the human-related regions. Comparative evaluation on the JHMDB, HMDB51, UCF Sports and UCF101 datasets demonstrates that the streams contain features that complement each other. The proposed multi-stream architecture achieves state-of-the-art results on these four datasets.",

keywords = "Action recognition, Convolutional Neural Network, Motion salient region, Multi-Stream",

author = "Zhigang Tu and Wei Xie and Qianqing Qin and Ronald Poppe and Veltkamp, {Remco C.} and Baoxin Li and Junsong Yuan",

note = "Funding Information: This work is supported by the Singapore Ministry of Education Academic Research Fund Tier 2 MOE2015-T2-2-114 , and by grants from Office of Naval Research, US. Any opinions expressed in this material are those of the authors and do not necessarily reflect the views of ONR. It was also supported by the Natural Science Foundation of China ( 61501198 ), Natural Science Foundation of Hubei Province ( 2014CFB461 ), Wuhan Youth Science and Technology Chenguang program (2014072704011248), the Dutch national program COMMIT and Dutch NWO TOP grant ARBITER. Publisher Copyright: {\textcopyright} 2018 Elsevier Ltd",

year = "2018",

month = jul,

doi = "10.1016/j.patcog.2018.01.020",

language = "English (US)",

volume = "79",

pages = "32--43",

journal = "Pattern Recognition",

issn = "0031-3203",

publisher = "Elsevier Limited",

}

TY - JOUR

T1 - Multi-stream CNN

T2 - Learning representations based on human-related regions for action recognition

AU - Tu, Zhigang

AU - Xie, Wei

AU - Qin, Qianqing

AU - Poppe, Ronald

AU - Veltkamp, Remco C.

AU - Li, Baoxin

AU - Yuan, Junsong

N1 - Funding Information: This work is supported by the Singapore Ministry of Education Academic Research Fund Tier 2 MOE2015-T2-2-114 , and by grants from Office of Naval Research, US. Any opinions expressed in this material are those of the authors and do not necessarily reflect the views of ONR. It was also supported by the Natural Science Foundation of China ( 61501198 ), Natural Science Foundation of Hubei Province ( 2014CFB461 ), Wuhan Youth Science and Technology Chenguang program (2014072704011248), the Dutch national program COMMIT and Dutch NWO TOP grant ARBITER. Publisher Copyright: © 2018 Elsevier Ltd

PY - 2018/7

Y1 - 2018/7

N2 - The most successful video-based human action recognition methods rely on feature representations extracted using Convolutional Neural Networks (CNNs). Inspired by the two-stream network (TS-Net), we propose a multi-stream Convolutional Neural Network (CNN) architecture to recognize human actions. We additionally consider human-related regions that contain the most informative features. First, by improving foreground detection, the region of interest corresponding to the appearance and the motion of an actor can be detected robustly under realistic circumstances. Based on the entire detected human body, we construct one appearance and one motion stream. In addition, we select a secondary region that contains the major moving part of an actor based on motion saliency. By combining the traditional streams with the novel human-related streams, we introduce a human-related multi-stream CNN (HR-MSCNN) architecture that encodes appearance, motion, and the captured tubes of the human-related regions. Comparative evaluation on the JHMDB, HMDB51, UCF Sports and UCF101 datasets demonstrates that the streams contain features that complement each other. The proposed multi-stream architecture achieves state-of-the-art results on these four datasets.

AB - The most successful video-based human action recognition methods rely on feature representations extracted using Convolutional Neural Networks (CNNs). Inspired by the two-stream network (TS-Net), we propose a multi-stream Convolutional Neural Network (CNN) architecture to recognize human actions. We additionally consider human-related regions that contain the most informative features. First, by improving foreground detection, the region of interest corresponding to the appearance and the motion of an actor can be detected robustly under realistic circumstances. Based on the entire detected human body, we construct one appearance and one motion stream. In addition, we select a secondary region that contains the major moving part of an actor based on motion saliency. By combining the traditional streams with the novel human-related streams, we introduce a human-related multi-stream CNN (HR-MSCNN) architecture that encodes appearance, motion, and the captured tubes of the human-related regions. Comparative evaluation on the JHMDB, HMDB51, UCF Sports and UCF101 datasets demonstrates that the streams contain features that complement each other. The proposed multi-stream architecture achieves state-of-the-art results on these four datasets.

KW - Action recognition

KW - Convolutional Neural Network

KW - Motion salient region

KW - Multi-Stream

UR - http://www.scopus.com/inward/record.url?scp=85044306276&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85044306276&partnerID=8YFLogxK

U2 - 10.1016/j.patcog.2018.01.020

DO - 10.1016/j.patcog.2018.01.020

M3 - Article

AN - SCOPUS:85044306276

SN - 0031-3203

VL - 79

SP - 32

EP - 43

JO - Pattern Recognition

JF - Pattern Recognition

ER -

Multi-stream CNN: Learning representations based on human-related regions for action recognition

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this