Á¤º¸°úÇÐȸ³í¹®Áö (Journal of KIISE)
ÇѱÛÁ¦¸ñ(Korean Title) |
À̹ÌÁö ĸ¼Ç »ý¼ºÀ» À§ÇÑ ´ÙÁß °üÁ¡À» °¡Áø ÀÚ°¡ ±³¿ Æ®·£½ºÆ÷¸Ó |
¿µ¹®Á¦¸ñ(English Title) |
Self-revising Transformer with Multi-view for Image Captioning |
ÀúÀÚ(Author) |
ÀÌÁöÀº
¹ÚÁø¿í
¹Ú»óÇö
Jieun Lee
Jinuk Park
Sanghyun Park
|
¿ø¹®¼ö·Ïó(Citation) |
VOL 48 NO. 03 PP. 0340 ~ 0351 (2021. 03) |
Çѱ۳»¿ë (Korean Abstract) |
À̹ÌÁö ĸ¼Ç »ý¼ºÀ̶õ ÁÖ¾îÁø À̹ÌÁö·ÎºÎÅÍ °´Ã¼ ¿ä¼Ò¸¦ ÆľÇÇÏ¿© Àå¸éÀ» ¼³¸íÇÏ´Â ÀÚ¿¬¾î¸¦ ÀÚµ¿À¸·Î ¼¼úÇÏ´Â ¿¬±¸ÀÌ´Ù. ¼±Çà ¿¬±¸¿¡¼´Â ÁÖ·Î ´ÜÀÏ Æ¯Â¡ ÃßÃâ±â¸¦ ÅëÇØ À̹ÌÁö¿¡¼ Á¤º¸¸¦ Æ÷ÂøÇÑ ÈÄ, ¼øȯ ½Å°æ¸Á ±â¹ÝÀÇ µðÄÚ´õ¸¦ ÅëÇØ Ä¸¼ÇÀ» »ý¼ºÇÑ´Ù. ÇÏÁö¸¸ ´ÜÀÏ Æ¯Â¡ ÃßÃâ±â¸¦ »ç¿ëÇϱ⠶§¹®¿¡ ´ÙÁß °üÁ¡ÀÇ À̹ÌÁö Á¤º¸¸¦ »ç¿ëÇÒ ¼ö ¾ø°í, ¼øȯ ½Å°æ¸Á ±â¹ÝÀÇ Àå±â ÀÇÁ¸¼º ¹®Á¦¸¦ °¡Áö´Â µðÄÚ´õ¸¦ »ç¿ëÇÑ´Ù. À̸¦ ÇØ°áÇϱâ À§Çؼ º» ¿¬±¸´Â º¹¼öÀÇ Æ¯Â¡ ÃßÃâ±â¸¦ »ç¿ëÇÏ´Â ´ÙÁß °üÁ¡ ÀÎÄÚ´õ¸¦ ÅëÇØ ´Ù¾çÇÑ °¢µµÀÇ À̹ÌÁö Á¤º¸¸¦ °¡°øÇÏ¿© Àü´ÞÇÑ´Ù. ¶ÇÇÑ, ¼øȯ ½Å°æ¸ÁÀÇ ÇѰ踦 º¸¿ÏÇϱâ À§Çؼ, Æ®·£½ºÆ÷¸Ó ¸ðµ¨ ±â¹ÝÀÇ µðÄÚ´õ ·¹À̾ Ãß°¡ÀûÀÎ ¸ÖƼ-Çìµå ÁÖÀÇ ±âÁ¦ ±â¹ýÀ» ÅëÇØ »ý¼ºµÈ ¹®ÀåÀ» À籸ÃàÇÏ¿© ¹®ÀåÀÇ ¿Ï¼ºµµ¸¦ ³ôÀÌ´Â ÀÚ°¡ ±³¿ Æ®·£½ºÆ÷¸Ó¸¦ Á¦¾ÈÇÑ´Ù. Á¦¾ÈÇÏ´Â ¸ðµ¨ÀÇ °ËÁõÀ» À§ÇØ MSCOCO µ¥ÀÌÅͼÂÀ» ÀÌ¿ëÇÏ¿© ´Ù¾çÇÑ ºñ±³½ÇÇèÀ¸·Î Á¤·®Àû, Á¤¼ºÀû Æò°¡¸¦ ÅëÇØ Á¦¾ÈÇÑ ¹æ¹ý·ÐÀÇ ¿ì¼ö¼ºÀ» °ËÁõÇÏ¿´´Ù.
|
¿µ¹®³»¿ë (English Abstract) |
Image captioning is a task of automatically describing a scene by identifying an object element from a given image. In prior research, information has mainly been captured from the image using a single feature extractor, and captions have then been generated by a recurrent neural network-based decoder. However, multi-view image information is not available with this method because of the use of a single feature extractor, and the use of a recurrent neural network-based decoder causes a long-term dependency problem. To address these issues, the proposed model employs a multi-view encoder using a couple of feature extractors that provide processed image information from various view. In addition, to supplement the limits of the recurrent neural network, we propose a self-revising transformer that increases the completeness of sentences by revising the generated sentences by focusing additional multi-head attention in the transformer-based decoder layer. To present the proposed model, we verify its superiority through quantitative and qualitative evaluations with various comparative experiments using MSCOCO datasets.
|
Å°¿öµå(Keyword) |
ÀÚ¿¬¾î ó¸®
À̹ÌÁö ĸ¼Ç »ý¼º
¸ÖƼ-Çìµå ÁÖÀÇ ±âÁ¦ ±â¹ý
´ÙÁß °üÁ¡ ÀÎÄÚ´õ
natural language processing
image captioning
multi-head attention
multi-view encoder
|
ÆÄÀÏ÷ºÎ |
PDF ´Ù¿î·Îµå
|