INDOOR VISUAL UNDERSTANDING THROUGH IMAGE CAPTIONING

Dhomas Hatta Fudholi; Royan Abida N. Nayoan

doi:10.11113/aej.v14.20285

Authors

Dhomas Hatta Fudholi Department of Informatics, Universitas Islam Indonesia, Yogyakarta, Indonesia
Royan Abida N. Nayoan Department of Informatics, Universitas Islam Indonesia, Yogyakarta, Indonesia

DOI:

https://doi.org/10.11113/aej.v14.20285

Keywords:

image captioning, indoor, visual understanding, EfficientNet, Transformer

Abstract

Transformers have been widely used in image captioning tasks on English language datasets such as MSCOCO and Flickr. However, research related to image captioning in Indonesian is still rare and relies on machine translation to obtain the Indonesian dataset. In this study, the Transformer model is used to generate caption using the modified MSCOCO datasets to gain visual understanding in an indoor environment. We modified the MSCOCO dataset by creating new Indonesian text description based on the MSCOCO images. A few simple rules are made to create the Indonesian dataset by including the object’s location, colour, and its characteristics. Experiments were carried out using several CNN pre-trained models to extract the image features before feeding them to the Transformer model. We also performed hyper-parameter settings on the models by assigning different values for batch size, dropouts, and attention heads to get the best model. BLEU-n, METEOR, CIDEr, and ROUGE-L are used to evaluate the model. From this study, by utilizing the EfficientNetB0 with a batch size of 128, dropouts of 0.2, and attention heads of 4, the model can get the best score in four different evaluation matrices. The EfficientNetB0 model reached the highest score on BLEU-4 with a score of 0.344, ROUGE-L of 0.535, METEOR of 0.264, and CIDEr of 0.492.

References

H. Lu, R. Yang, Z. Deng, Y. Zhang, G. Gao, and R. Lan. 2021. “Chinese Image Captioning via Fuzzy Attention-based DenseNet-BiLSTM,” ACM Transactions on Multimedia Computing, Communications, and Applications, 17(1s): 1–18. doi: 10.1145/3422668.

Y. Ming, N. Hu, C. Fan, F. Feng, J. Zhou, and H. Yu. 2022. “Visuals to Text: A Comprehensive Review on Automatic Image Captioning,” IEEE/CAA Journal of Automatica Sinica, 9(8): 1339–1365. doi: 10.1109/jas.2022.105734.

G. Xu, S. Niu, M. Tan, Y. Luo, Q. Du, and Q. Wu. 2021. “Towards Accurate Text-based Image Captioning with Content Diversity Exploration,”. doi: 10.48550/ARXIV.2105.03236.

S. Herdade, A. Kappeler, K. Boakye, and J. Soares. 2019. “Image captioning: Transforming objects into words,” Advances in neural information processing systems, 32: 1–11.

C. Yan, B. Gong, Y. Wei, and Y. Gao. 2021. “Deep Multi-View Enhancement Hashing for Image Retrieval,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(4): 1445–1451. doi: 10.1109/tpami.2020.2975798.

X. Yang, H. Zhang and J. Cai. 2019. "Learning to Collocate Neural Modules for Image Captioning," 2019 IEEE/CVF International Conference on Computer Vision (ICCV): 4249-4259. doi: 10.1109/ICCV.2019.00435.

C. Janiesch, P. Zschech, and K. Heinrich. 2021. “Machine learning and deep learning,” Electronic Markets, 31(3):685–695. doi: 10.1007/s12525-021-00475-2.

Y. LeCun, Y. Bengio, and G. Hinton. 2015. “Deep learning,” Nature, 521(7553): 436–444. doi: 10.1038/nature14539.

A. Sherstinsky. 2020. “Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) network,” Physica D: Nonlinear Phenomena, 404: 132306. doi: 10.1016/j.physd.2019.132306.

G. Li, L. Zhu, P. Liu, and Y. Yang. 2019. “Entangled Transformer for Image Captioning,” 2019 IEEE/CVF International Conference on Computer Vision (ICCV). doi: 10.1109/iccv.2019.00902.

P. Zeng, H. Zhang, J. Song, and L. Gao. 2022. “S2 Transformer for Image Captioning,” Proceedings of the International Joint Conferences on Artificial Intelligence, 5.

Y. Zhang, X. Shi, S. Mi, and X. Yang. 2021. “Image captioning with transformer and knowledge graph,” Pattern Recognition Letters, 143: 43–49. doi: 10.1016/j.patrec.2020.12.020.

M. R. S. Mahadi, A. Arifianto, and K. N. Ramadhani. 2020. “Adaptive Attention Generation for Indonesian Image Captioning,” 2020 8th International Conference on Information and Communication Technology (ICoICT). doi: 10.1109/icoict49345.2020.9166244.

E. Mulyanto, E. I. Setiawan, E. M. Yuniarno, and M. H. Purnomo. 2019. “Automatic Indonesian Image Caption Generation using CNN-LSTM Model and FEEH-ID Dataset,” 2019 IEEE International Conference on Computational Intelligence and Virtual Environments for Measurement Systems and Applications (CIVEMSA). doi: 10.1109/civemsa45640.2019.9071632.

A. A. Nugraha, A. Arifianto, and Suyanto. 2019. “Generating Image Description on Indonesian Language using Convolutional Neural Network and Gated Recurrent Unit,” 2019 7th International Conference on Information and Communication Technology (ICoICT). doi: 10.1109/icoict.2019.8835370.

J. Devlin, S. Gupta, R. Girshick, M. Mitchell, and C. L. Zitnick. 2015. "Exploring nearest neighbor approaches for image captioning." arXiv preprint arXiv:1505.04467.

O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. 2015. “Show and tell: A neural image caption generator,” 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). doi: 10.1109/cvpr.2015.7298935.

MD. Z. Hossain, F. Sohel, M. F. Shiratuddin, and H. Laga. 2019. “A Comprehensive Survey of Deep Learning for Image Captioning,” ACM Computing Surveys, 51(6): 1–36. doi: 10.1145/3295748.

S. Pa Pa Aung, W. Pa Pa, and T. Lay Nwe. 2020. "Automatic Myanmar image captioning using CNN and LSTM-based language model." Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL). 2020. “Automatic Myanmar Image Captioning using CNN and LSTM-Based Language Model,” 139-143.

S. Sukhbaatar, A. Szlam, J. Weston, and R. Fergus. 2015. "End-to-end memory networks." Advances in neural information processing systems 28.

A. Vaswani et al. 2017. "Attention is all you need." Advances in neural information processing systems 30.

Y. Pan, T. Yao, Y. Li, and T. Mei. 2020. “X-Linear Attention Networks for Image Captioning,” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). doi: 10.1109/cvpr42600.2020.01098.

T. Xian, Z. Li, C. Zhang, and H. Ma. 2022. “Dual Global Enhanced Transformer for image captioning,” Neural Networks, 148: 129–141. doi: 10.1016/j.neunet.2022.01.011.

S. Dubey, F. Olimov, M. A. Rafique, J. Kim, and M. Jeon. 2023. "Label-attention transformer with geometrically coherent objects for image captioning." Information Sciences 623: 812-831.

W. Liu, S. Chen, L. Guo, X. Zhu, and J. Liu. 2021. "Cptr: Full transformer network for image captioning." arXiv preprint arXiv:2101.10804.

Z. Ren, S. Gou, Z. Guo, S. Mao, and R. Li. 2022. “A Mask-Guided Transformer Network with Topic Token for Remote Sensing Image Captioning,” Remote Sensing, 14(12): 2939. doi: 10.3390/rs14122939.

X. Chen et al. 2015. "Microsoft coco captions: Data collection and evaluation server." arXiv preprint arXiv:1504.00325.

B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik. 2016. “Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models,” International Journal of Computer Vision, 123(1): 74–93. doi: 10.1007/s11263-016-0965-7.

F. M. Shah, M. Humaira, M. A. R. K. Jim, A. Saha Ami, and S. Paul. 2021. “Bornon: Bengali Image Captioning with Transformer-Based Deep Learning Approach,” SN Computer Science, 3(1). doi: 10.1007/s42979-021-00975-0.

D. H. Fudholi and R. A. N. Nayoan. 2022. “The Role of Transformer-based Image Captioning for Indoor Environment Visual Understanding,” International Journal of Computing and Digital Systems, 12(3): 479–488. doi: 10.12785/ijcds/120138.

A. Gholamy, V. Kreinovich, and O. Kosheleva. 2018. "Why 70/30 or 80/20 relation between training and testing sets: A pedagogical explanation.".

H. Rashid, A. S. M. R. Al-Mamun, M. S. R. Robin, M. Ahasan, and S. M. T. Reza. 2016. “Bilingual wearable assistive technology for visually impaired persons,” 2016 International Conference on Medical Engineering, Health Informatics and Technology (MediTec). doi: 10.1109/meditec.2016.7835386.

G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. 2017. “Densely Connected Convolutional Networks,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). doi: 10.1109/cvpr.2017.243.

M. Tan and Q. v. Le. 2019. "Efficientnet: Rethinking model scaling for convolutional neural networks." International conference on machine learning. PMLR.

D. H. Fudholi et al. 2021. “Image Captioning with Attention for Smart Local Tourism using EfficientNet,” IOP Conference Series: Materials Science and Engineering, 1077(1): 012038. doi: 10.1088/1757-899x/1077/1/012038.

V. Maeda-Gutiérrez et al. 2020. “Comparison of Convolutional Neural Network Architectures for Classification of Tomato Plant Diseases,” Applied Sciences, 10(4):1245. doi: 10.3390/app10041245.

F. Chollet. 2017. “Xception: Deep Learning with Depthwise Separable Convolutions,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). doi: 10.1109/cvpr.2017.195.

S. Vellakani and I. Pushbam. 2020. “An enhanced OCT image captioning system to assist ophthalmologists in detecting and classifying eye diseases,” Journal of X-Ray Science and Technology, 28(5): 975–988. doi: 10.3233/xst-200697.

C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi. 2017. "Inception-v4, inception-resnet and the impact of residual connections on learning." Proceedings of the AAAI conference on artificial intelligence, 31(1).

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. 2001. “BLEU,” Proceedings of the 40th Annual Meeting on Association for Computational Linguistics - ACL ’02. doi: 10.3115/1073083.1073135.

C. Y. Lin. 2004. "Rouge: A package for automatic evaluation of summaries." Text summarization branches out.

A. Lavie and A. Agarwal. 2007. “Meteor: an automatic metric for MT evaluation with high levels of correlation with human judgments.” In Proceedings of the Second Workshop on Statistical Machine Translation (StatMT '07). Association for Computational Linguistics, USA, 228–231.

R. Vedantam, C. L. Zitnick, and D. Parikh. 2015. “CIDEr: Consensus-based image description evaluation,” 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). doi: 10.1109/cvpr.2015.7299087.

M. Huh, P. Agrawal, and A. A. Efros. 2016. "What makes ImageNet good for transfer learning?." arXiv preprint arXiv:1608.08614.

A. A. Nugraha, A. Arifianto, and Suyanto. 2019. “Generating Image Description on Indonesian Language using Convolutional Neural Network and Gated Recurrent Unit,” 2019 7th International Conference on Information and Communication Technology (ICoICT). doi: 10.1109/icoict.2019.8835370.

M. R. S. Mahadi, A. Arifianto, and K. N. Ramadhani. 2020. “Adaptive Attention Generation for Indonesian Image Captioning,” 2020 8th International Conference on Information and Communication Technology (ICoICT). doi: 10.1109/icoict49345.2020.9166244.

INDOOR VISUAL UNDERSTANDING THROUGH IMAGE CAPTIONING

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

How to Cite

scopus

announcements

Information