소규모 데이터 기반 한국어 버트 모델

이상아; 장한솔; 백연미; 박수지; 신효필; Sangah Lee; Hansol Jang; Yunmee Baik; Suzi Park; Hyopil Shin

연구문헌

국내 논문지

홈 > 연구문헌 > 국내 논문지 > 한국정보과학회 논문지 > 정보과학회논문지 (Journal of KIISE)

정보과학회논문지 (Journal of KIISE)

Current Result Document : 1 / 1

한글제목(Korean Title)	소규모 데이터 기반 한국어 버트 모델
영문제목(English Title)	A Small-Scale Korean-Specific BERT Language Model
저자(Author)	이상아 장한솔 백연미 박수지 신효필 Sangah Lee Hansol Jang Yunmee Baik Suzi Park Hyopil Shin
원문수록처(Citation)	VOL 47 NO. 07 PP. 0682 ~ 0692 (2020. 07)
한글내용 (Korean Abstract)	최근 자연어처리에서 문장 단위의 임베딩을 위한 모델들은 거대한 말뭉치와 파라미터를 이용하기 때문에 큰 하드웨어와 데이터를 요구하고 학습하는 데 시간이 오래 걸린다는 단점을 갖는다. 따라서 규모가 크지 않더라도 학습 데이터를 경제적으로 활용하면서 필적할만한 성능을 가지는 모델의 필요성이 제기된다. 본 연구는 음절 단위의 한국어 사전, 자소 단위의 한국어 사전을 구축하고 자소 단위의 학습과 양방향 WordPiece 토크나이저를 새롭게 소개하였다. 그 결과 기존 모델의 1/10 사이즈의 학습 데이터를 이용하고 적절한 크기의 사전을 사용해 더 적은 파라미터로 계산량은 줄고 성능은 비슷한 KR-BERT 모델을 구현할 수 있었다. 이로써 한국어와 같이 고유의 문자 체계를 가지고 형태론적으로 복잡하며 자원이 적은 언어에 대해 모델을 구축할 때는 해당 언어에 특화된 언어학적 현상을 반영해야 한다는 것을 확인하였다.
영문내용 (English Abstract)	Recent models for the sentence embedding use huge corpus and parameters. They have massive data and large hardware and it incurs extensive time to pre-train. This tendency raises the need for a model with comparable performance while economically using training data. In this study, we proposed a Korean-specific model KR-BERT, using sub-character level to character-level Korean dictionaries and BidirectionalWordPiece Tokenizer. As a result, our KR-BERT model performs comparably and even better than other existing pre-trained models using one-tenth the size of training data from the existing models. It demonstrates that in a morphologically complex and resourceless language, using sub-character level and BidirectionalWordPiece Tokenizer captures language-specific linguistic phenomena that the Multilingual BERT model missed.
키워드(Keyword)	언어 모델링 임베딩 모델 한국어 모델 사전 토크나이저 BERT language modeling embedding model Korean language modeling vocabulary tokenizer
파일첨부	PDF 다운로드