TY - GEN
T1 - Cooking with blocks
T2 - 32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2019
AU - Gokhale, Tejas
AU - Sampat, Shailaja
AU - Fang, Zhiyuan
AU - Yang, Yezhou
AU - Baral, Chitta
N1 - Funding Information:
We acknowledge support from NSF Grant 1816039.
Publisher Copyright:
© 2019 IEEE Computer Society. All rights reserved.
PY - 2019/6
Y1 - 2019/6
N2 - The ability of identifying changes or transformations in a scene and to reason about their causes and effects, is a key aspect of intelligence. In this work we go beyond recent advances in computational perception, and introduce a more challenging task, Image-based Event-Sequencing (IES). In IES, the task is to predict a sequence of actions required to rearrange objects from the configuration in an input source image to the one in the target image. IES also requires systems to possess inductive generalizability. Motivated from evidence in cognitive development, we compile the first IES dataset, the Blocksworld Image Reasoning Dataset (BIRD) which contains images of wooden blocks in different configurations, and the sequence of moves to rearrange one configuration to the other. We first explore the use of existing deep learning architectures and show that these end-to-end methods under-perform in inferring temporal event-sequences and fail at inductive generalization. We propose a modular two-step approach: Visual Perception followed by Event-Sequencing, and demonstrate improved performance by combining learning and reasoning. Finally, by showing an extension of our approach on natural images, we seek to pave the way for future research on event sequencing for real world scenes.
AB - The ability of identifying changes or transformations in a scene and to reason about their causes and effects, is a key aspect of intelligence. In this work we go beyond recent advances in computational perception, and introduce a more challenging task, Image-based Event-Sequencing (IES). In IES, the task is to predict a sequence of actions required to rearrange objects from the configuration in an input source image to the one in the target image. IES also requires systems to possess inductive generalizability. Motivated from evidence in cognitive development, we compile the first IES dataset, the Blocksworld Image Reasoning Dataset (BIRD) which contains images of wooden blocks in different configurations, and the sequence of moves to rearrange one configuration to the other. We first explore the use of existing deep learning architectures and show that these end-to-end methods under-perform in inferring temporal event-sequences and fail at inductive generalization. We propose a modular two-step approach: Visual Perception followed by Event-Sequencing, and demonstrate improved performance by combining learning and reasoning. Finally, by showing an extension of our approach on natural images, we seek to pave the way for future research on event sequencing for real world scenes.
UR - http://www.scopus.com/inward/record.url?scp=85113852803&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85113852803&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85113852803
T3 - IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops
SP - 5
EP - 8
BT - Proceedings - 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2019
PB - IEEE Computer Society
Y2 - 16 June 2019 through 20 June 2019
ER -