• Àüü
  • ÀüÀÚ/Àü±â
  • Åë½Å
  • ÄÄÇ»ÅÍ
´Ý±â

»çÀÌÆ®¸Ê

Loading..

Please wait....

±¹³» ³í¹®Áö

Ȩ Ȩ > ¿¬±¸¹®Çå > ±¹³» ³í¹®Áö > Çѱ¹Á¤º¸°úÇÐȸ ³í¹®Áö > Á¤º¸°úÇÐȸ³í¹®Áö (Journal of KIISE)

Á¤º¸°úÇÐȸ³í¹®Áö (Journal of KIISE)

Current Result Document : 6 / 41 ÀÌÀü°Ç ÀÌÀü°Ç   ´ÙÀ½°Ç ´ÙÀ½°Ç

ÇѱÛÁ¦¸ñ(Korean Title) À̹ÌÁö ĸ¼Ç »ý¼ºÀ» À§ÇÑ ´ÙÁß °üÁ¡À» °¡Áø ÀÚ°¡ ±³¿­ Æ®·£½ºÆ÷¸Ó
¿µ¹®Á¦¸ñ(English Title) Self-revising Transformer with Multi-view for Image Captioning
ÀúÀÚ(Author) ÀÌÁöÀº   ¹ÚÁø¿í   ¹Ú»óÇö   Jieun Lee   Jinuk Park   Sanghyun Park  
¿ø¹®¼ö·Ïó(Citation) VOL 48 NO. 03 PP. 0340 ~ 0351 (2021. 03)
Çѱ۳»¿ë
(Korean Abstract)
À̹ÌÁö ĸ¼Ç »ý¼ºÀ̶õ ÁÖ¾îÁø À̹ÌÁö·ÎºÎÅÍ °´Ã¼ ¿ä¼Ò¸¦ ÆľÇÇÏ¿© Àå¸éÀ» ¼³¸íÇÏ´Â ÀÚ¿¬¾î¸¦ ÀÚµ¿À¸·Î ¼­¼úÇÏ´Â ¿¬±¸ÀÌ´Ù. ¼±Çà ¿¬±¸¿¡¼­´Â ÁÖ·Î ´ÜÀÏ Æ¯Â¡ ÃßÃâ±â¸¦ ÅëÇØ À̹ÌÁö¿¡¼­ Á¤º¸¸¦ Æ÷ÂøÇÑ ÈÄ, ¼øȯ ½Å°æ¸Á ±â¹ÝÀÇ µðÄÚ´õ¸¦ ÅëÇØ Ä¸¼ÇÀ» »ý¼ºÇÑ´Ù. ÇÏÁö¸¸ ´ÜÀÏ Æ¯Â¡ ÃßÃâ±â¸¦ »ç¿ëÇϱ⠶§¹®¿¡ ´ÙÁß °üÁ¡ÀÇ À̹ÌÁö Á¤º¸¸¦ »ç¿ëÇÒ ¼ö ¾ø°í, ¼øȯ ½Å°æ¸Á ±â¹ÝÀÇ Àå±â ÀÇÁ¸¼º ¹®Á¦¸¦ °¡Áö´Â µðÄÚ´õ¸¦ »ç¿ëÇÑ´Ù. À̸¦ ÇØ°áÇϱâ À§Çؼ­ º» ¿¬±¸´Â º¹¼öÀÇ Æ¯Â¡ ÃßÃâ±â¸¦ »ç¿ëÇÏ´Â ´ÙÁß °üÁ¡ ÀÎÄÚ´õ¸¦ ÅëÇØ ´Ù¾çÇÑ °¢µµÀÇ À̹ÌÁö Á¤º¸¸¦ °¡°øÇÏ¿© Àü´ÞÇÑ´Ù. ¶ÇÇÑ, ¼øȯ ½Å°æ¸ÁÀÇ ÇѰ踦 º¸¿ÏÇϱâ À§Çؼ­, Æ®·£½ºÆ÷¸Ó ¸ðµ¨ ±â¹ÝÀÇ µðÄÚ´õ ·¹À̾ Ãß°¡ÀûÀÎ ¸ÖƼ-Çìµå ÁÖÀÇ ±âÁ¦ ±â¹ýÀ» ÅëÇØ »ý¼ºµÈ ¹®ÀåÀ» À籸ÃàÇÏ¿© ¹®ÀåÀÇ ¿Ï¼ºµµ¸¦ ³ôÀÌ´Â ÀÚ°¡ ±³¿­ Æ®·£½ºÆ÷¸Ó¸¦ Á¦¾ÈÇÑ´Ù. Á¦¾ÈÇÏ´Â ¸ðµ¨ÀÇ °ËÁõÀ» À§ÇØ MSCOCO µ¥ÀÌÅͼÂÀ» ÀÌ¿ëÇÏ¿© ´Ù¾çÇÑ ºñ±³½ÇÇèÀ¸·Î Á¤·®Àû, Á¤¼ºÀû Æò°¡¸¦ ÅëÇØ Á¦¾ÈÇÑ ¹æ¹ý·ÐÀÇ ¿ì¼ö¼ºÀ» °ËÁõÇÏ¿´´Ù.
¿µ¹®³»¿ë
(English Abstract)
Image captioning is a task of automatically describing a scene by identifying an object element from a given image. In prior research, information has mainly been captured from the image using a single feature extractor, and captions have then been generated by a recurrent neural network-based decoder. However, multi-view image information is not available with this method because of the use of a single feature extractor, and the use of a recurrent neural network-based decoder causes a long-term dependency problem. To address these issues, the proposed model employs a multi-view encoder using a couple of feature extractors that provide processed image information from various view. In addition, to supplement the limits of the recurrent neural network, we propose a self-revising transformer that increases the completeness of sentences by revising the generated sentences by focusing additional multi-head attention in the transformer-based decoder layer. To present the proposed model, we verify its superiority through quantitative and qualitative evaluations with various comparative experiments using MSCOCO datasets.
Å°¿öµå(Keyword) ÀÚ¿¬¾î 󸮠  À̹ÌÁö ĸ¼Ç »ý¼º   ¸ÖƼ-Çìµå ÁÖÀÇ ±âÁ¦ ±â¹ý   ´ÙÁß °üÁ¡ ÀÎÄÚ´õ   natural language processing   image captioning   multi-head attention   multi-view encoder  
ÆÄÀÏ÷ºÎ PDF ´Ù¿î·Îµå