TY - GEN
T1 - Ensemble learning on deep neural networks for image caption generation
AU - Katpally, Harshitha
AU - Bansal, Ajay
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/2
Y1 - 2020/2
N2 - Capturing the information in an image into a natural language sentence is considered a difficult problem to be solved by computers. Image captioning involves not just detecting objects from images but understanding the interactions between the objects to be translated into relevant captions. So, expertise in the field of computer vision paired with natural language processing is crucial for this purpose. The sequence to sequence modelling strategy of deep neural networks is the traditional approach to generate a sequential list of words that are combined to represent the image. But these models suffer from the problem of high variance by not being able to generalize well on the training data. The main focus of this paper is to reduce the variance factor that will help in generating better captions. To achieve this, Ensemble Learning techniques have been explored, which have the reputation of solving the high variance problem that occurs in machine learning algorithms. Three different ensemble techniques namely, k-fold ensemble, bootstrap aggregation ensemble and boosting ensemble have been evaluated in our work. For each of these techniques, three output combination approaches have been analyzed. Extensive experiments have been conducted on the Flickr8k dataset which has a collection of 8000 images and 5 different captions for every image. The bleu score performance metric, which is considered to be the standard for evaluating natural language processing (NLP) problems, is used to evaluate the predictions. Based on this metric, the analysis shows that ensemble learning performs significantly better and generates more meaningful captions compared to any of the individual models used.
AB - Capturing the information in an image into a natural language sentence is considered a difficult problem to be solved by computers. Image captioning involves not just detecting objects from images but understanding the interactions between the objects to be translated into relevant captions. So, expertise in the field of computer vision paired with natural language processing is crucial for this purpose. The sequence to sequence modelling strategy of deep neural networks is the traditional approach to generate a sequential list of words that are combined to represent the image. But these models suffer from the problem of high variance by not being able to generalize well on the training data. The main focus of this paper is to reduce the variance factor that will help in generating better captions. To achieve this, Ensemble Learning techniques have been explored, which have the reputation of solving the high variance problem that occurs in machine learning algorithms. Three different ensemble techniques namely, k-fold ensemble, bootstrap aggregation ensemble and boosting ensemble have been evaluated in our work. For each of these techniques, three output combination approaches have been analyzed. Extensive experiments have been conducted on the Flickr8k dataset which has a collection of 8000 images and 5 different captions for every image. The bleu score performance metric, which is considered to be the standard for evaluating natural language processing (NLP) problems, is used to evaluate the predictions. Based on this metric, the analysis shows that ensemble learning performs significantly better and generates more meaningful captions compared to any of the individual models used.
KW - Boosting
KW - Bootstrap aggregation
KW - Deep neural networks
KW - Ensemble learning
KW - Image captioning
KW - K-fold ensemble
UR - http://www.scopus.com/inward/record.url?scp=85083468511&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85083468511&partnerID=8YFLogxK
U2 - 10.1109/ICSC.2020.00016
DO - 10.1109/ICSC.2020.00016
M3 - Conference contribution
AN - SCOPUS:85083468511
T3 - Proceedings - 14th IEEE International Conference on Semantic Computing, ICSC 2020
SP - 61
EP - 68
BT - Proceedings - 14th IEEE International Conference on Semantic Computing, ICSC 2020
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 14th IEEE International Conference on Semantic Computing, ICSC 2020
Y2 - 3 February 2020 through 5 February 2020
ER -