Á¤º¸°úÇÐȸ³í¹®Áö (Journal of KIISE)
Current Result Document : 1 / 1
ÇѱÛÁ¦¸ñ(Korean Title) |
¼Ò±Ô¸ð µ¥ÀÌÅÍ ±â¹Ý Çѱ¹¾î ¹öÆ® ¸ðµ¨ |
¿µ¹®Á¦¸ñ(English Title) |
A Small-Scale Korean-Specific BERT Language Model |
ÀúÀÚ(Author) |
ÀÌ»ó¾Æ
ÀåÇѼÖ
¹é¿¬¹Ì
¹Ú¼öÁö
½ÅÈ¿ÇÊ
Sangah Lee
Hansol Jang
Yunmee Baik
Suzi Park
Hyopil Shin
|
¿ø¹®¼ö·Ïó(Citation) |
VOL 47 NO. 07 PP. 0682 ~ 0692 (2020. 07) |
Çѱ۳»¿ë (Korean Abstract) |
ÃÖ±Ù ÀÚ¿¬¾î󸮿¡¼ ¹®Àå ´ÜÀ§ÀÇ ÀÓº£µùÀ» À§ÇÑ ¸ðµ¨µéÀº °Å´ëÇÑ ¸»¹¶Ä¡¿Í ÆĶó¹ÌÅ͸¦ ÀÌ¿ëÇϱ⠶§¹®¿¡ Å« Çϵå¿þ¾î¿Í µ¥ÀÌÅ͸¦ ¿ä±¸ÇÏ°í ÇнÀÇÏ´Â µ¥ ½Ã°£ÀÌ ¿À·¡ °É¸°´Ù´Â ´ÜÁ¡À» °®´Â´Ù. µû¶ó¼ ±Ô¸ð°¡ Å©Áö ¾Ê´õ¶óµµ ÇнÀ µ¥ÀÌÅ͸¦ °æÁ¦ÀûÀ¸·Î È°¿ëÇÏ¸é¼ ÇÊÀûÇÒ¸¸ÇÑ ¼º´ÉÀ» °¡Áö´Â ¸ðµ¨ÀÇ Çʿ伺ÀÌ Á¦±âµÈ´Ù. º» ¿¬±¸´Â À½Àý ´ÜÀ§ÀÇ Çѱ¹¾î »çÀü, ÀÚ¼Ò ´ÜÀ§ÀÇ Çѱ¹¾î »çÀüÀ» ±¸ÃàÇÏ°í ÀÚ¼Ò ´ÜÀ§ÀÇ ÇнÀ°ú ¾ç¹æÇâ WordPiece ÅäÅ©³ªÀÌÀú¸¦ »õ·Ó°Ô ¼Ò°³ÇÏ¿´´Ù. ±× °á°ú ±âÁ¸ ¸ðµ¨ÀÇ 1/10 »çÀÌÁîÀÇ ÇнÀ µ¥ÀÌÅ͸¦ ÀÌ¿ëÇÏ°í ÀûÀýÇÑ Å©±âÀÇ »çÀüÀ» »ç¿ëÇØ ´õ ÀûÀº ÆĶó¹ÌÅÍ·Î °è»ê·®Àº ÁÙ°í ¼º´ÉÀº ºñ½ÁÇÑ KR-BERT ¸ðµ¨À» ±¸ÇöÇÒ ¼ö ÀÖ¾ú´Ù. À̷νá Çѱ¹¾î¿Í °°ÀÌ °íÀ¯ÀÇ ¹®ÀÚ Ã¼°è¸¦ °¡Áö°í ÇüÅ·ÐÀûÀ¸·Î º¹ÀâÇϸç ÀÚ¿øÀÌ ÀûÀº ¾ð¾î¿¡ ´ëÇØ ¸ðµ¨À» ±¸ÃàÇÒ ¶§´Â ÇØ´ç ¾ð¾î¿¡ Æ¯ÈµÈ ¾ð¾îÇÐÀû Çö»óÀ» ¹Ý¿µÇØ¾ß ÇÑ´Ù´Â °ÍÀ» È®ÀÎÇÏ¿´´Ù.
|
¿µ¹®³»¿ë (English Abstract) |
Recent models for the sentence embedding use huge corpus and parameters. They have massive data and large hardware and it incurs extensive time to pre-train. This tendency raises the need for a model with comparable performance while economically using training data. In this study, we proposed a Korean-specific model KR-BERT, using sub-character level to character-level Korean dictionaries and BidirectionalWordPiece Tokenizer. As a result, our KR-BERT model performs comparably and even better than other existing pre-trained models using one-tenth the size of training data from the existing models. It demonstrates that in a morphologically complex and resourceless language, using sub-character level and BidirectionalWordPiece Tokenizer captures language-specific linguistic phenomena that the Multilingual BERT model missed.
|
Å°¿öµå(Keyword) |
¾ð¾î ¸ðµ¨¸µ
ÀÓº£µù ¸ðµ¨
Çѱ¹¾î ¸ðµ¨
»çÀü ÅäÅ©³ªÀÌÀú
BERT
language modeling
embedding model
Korean language modeling
vocabulary
tokenizer
|
ÆÄÀÏ÷ºÎ |
PDF ´Ù¿î·Îµå
|