• Àüü
  • ÀüÀÚ/Àü±â
  • Åë½Å
  • ÄÄÇ»ÅÍ
´Ý±â

»çÀÌÆ®¸Ê

Loading..

Please wait....

±¹³» ³í¹®Áö

Ȩ Ȩ > ¿¬±¸¹®Çå > ±¹³» ³í¹®Áö > Çѱ¹Á¤º¸°úÇÐȸ ³í¹®Áö > Á¤º¸°úÇÐȸ³í¹®Áö (Journal of KIISE)

Á¤º¸°úÇÐȸ³í¹®Áö (Journal of KIISE)

Current Result Document : 1 / 4   ´ÙÀ½°Ç ´ÙÀ½°Ç

ÇѱÛÁ¦¸ñ(Korean Title) ¼Ò±Ô¸ð µ¥ÀÌÅÍ ±â¹Ý Çѱ¹¾î ¹öÆ® ¸ðµ¨
¿µ¹®Á¦¸ñ(English Title) A Small-Scale Korean-Specific BERT Language Model
ÀúÀÚ(Author) ÀÌ»ó¾Æ   ÀåÇѼ֠  ¹é¿¬¹Ì   ¹Ú¼öÁö   ½ÅÈ¿ÇÊ   Sangah Lee   Hansol Jang   Yunmee Baik   Suzi Park   Hyopil Shin  
¿ø¹®¼ö·Ïó(Citation) VOL 47 NO. 07 PP. 0682 ~ 0692 (2020. 07)
Çѱ۳»¿ë
(Korean Abstract)
ÃÖ±Ù ÀÚ¿¬¾î󸮿¡¼­ ¹®Àå ´ÜÀ§ÀÇ ÀÓº£µùÀ» À§ÇÑ ¸ðµ¨µéÀº °Å´ëÇÑ ¸»¹¶Ä¡¿Í ÆĶó¹ÌÅ͸¦ ÀÌ¿ëÇϱ⠶§¹®¿¡ Å« Çϵå¿þ¾î¿Í µ¥ÀÌÅ͸¦ ¿ä±¸ÇÏ°í ÇнÀÇÏ´Â µ¥ ½Ã°£ÀÌ ¿À·¡ °É¸°´Ù´Â ´ÜÁ¡À» °®´Â´Ù. µû¶ó¼­ ±Ô¸ð°¡ Å©Áö ¾Ê´õ¶óµµ ÇнÀ µ¥ÀÌÅ͸¦ °æÁ¦ÀûÀ¸·Î È°¿ëÇϸ鼭 ÇÊÀûÇÒ¸¸ÇÑ ¼º´ÉÀ» °¡Áö´Â ¸ðµ¨ÀÇ Çʿ伺ÀÌ Á¦±âµÈ´Ù. º» ¿¬±¸´Â À½Àý ´ÜÀ§ÀÇ Çѱ¹¾î »çÀü, ÀÚ¼Ò ´ÜÀ§ÀÇ Çѱ¹¾î »çÀüÀ» ±¸ÃàÇÏ°í ÀÚ¼Ò ´ÜÀ§ÀÇ ÇнÀ°ú ¾ç¹æÇâ WordPiece ÅäÅ©³ªÀÌÀú¸¦ »õ·Ó°Ô ¼Ò°³ÇÏ¿´´Ù. ±× °á°ú ±âÁ¸ ¸ðµ¨ÀÇ 1/10 »çÀÌÁîÀÇ ÇнÀ µ¥ÀÌÅ͸¦ ÀÌ¿ëÇÏ°í ÀûÀýÇÑ Å©±âÀÇ »çÀüÀ» »ç¿ëÇØ ´õ ÀûÀº ÆĶó¹ÌÅÍ·Î °è»ê·®Àº ÁÙ°í ¼º´ÉÀº ºñ½ÁÇÑ KR-BERT ¸ðµ¨À» ±¸ÇöÇÒ ¼ö ÀÖ¾ú´Ù. À̷νá Çѱ¹¾î¿Í °°ÀÌ °íÀ¯ÀÇ ¹®ÀÚ Ã¼°è¸¦ °¡Áö°í ÇüÅ·ÐÀûÀ¸·Î º¹ÀâÇϸç ÀÚ¿øÀÌ ÀûÀº ¾ð¾î¿¡ ´ëÇØ ¸ðµ¨À» ±¸ÃàÇÒ ¶§´Â ÇØ´ç ¾ð¾î¿¡ ƯȭµÈ ¾ð¾îÇÐÀû Çö»óÀ» ¹Ý¿µÇØ¾ß ÇÑ´Ù´Â °ÍÀ» È®ÀÎÇÏ¿´´Ù.
¿µ¹®³»¿ë
(English Abstract)
Recent models for the sentence embedding use huge corpus and parameters. They have massive data and large hardware and it incurs extensive time to pre-train. This tendency raises the need for a model with comparable performance while economically using training data. In this study, we proposed a Korean-specific model KR-BERT, using sub-character level to character-level Korean dictionaries and BidirectionalWordPiece Tokenizer. As a result, our KR-BERT model performs comparably and even better than other existing pre-trained models using one-tenth the size of training data from the existing models. It demonstrates that in a morphologically complex and resourceless language, using sub-character level and BidirectionalWordPiece Tokenizer captures language-specific linguistic phenomena that the Multilingual BERT model missed.
Å°¿öµå(Keyword) ¾ð¾î ¸ðµ¨¸µ   ÀÓº£µù ¸ðµ¨   Çѱ¹¾î ¸ðµ¨   »çÀü ÅäÅ©³ªÀÌÀú   BERT   language modeling   embedding model   Korean language modeling   vocabulary   tokenizer  
ÆÄÀÏ÷ºÎ PDF ´Ù¿î·Îµå