RoBERTa를 이용한 한국어 기계독해

최윤수; 이혜우; 김태형; 장두성; 이영훈; 나승훈; Yun-Su Choi; Hye-Woo Lee; Tae-Hyeong Kim; Du-Seong Chang; Young-Hoon Lee; Seung-Hoon Na

연구문헌

국내 논문지

홈 > 연구문헌 > 국내 논문지 > 한국정보과학회 논문지 > 정보과학회 컴퓨팅의 실제 논문지 (KIISE Transactions on Computing Practices)

정보과학회 컴퓨팅의 실제 논문지 (KIISE Transactions on Computing Practices)

Current Result Document : 1 / 1

한글제목(Korean Title)	RoBERTa를 이용한 한국어 기계독해
영문제목(English Title)	RoBERTa for Korean Machine Reading Comprehension
저자(Author)	최윤수 이혜우 김태형 장두성 이영훈 나승훈 Yun-Su Choi Hye-Woo Lee Tae-Hyeong Kim Du-Seong Chang Young-Hoon Lee Seung-Hoon Na
원문수록처(Citation)	VOL 27 NO. 04 PP. 0198 ~ 0203 (2021. 04)
한글내용 (Korean Abstract)	기계독해는 문단에서 주어진 질문에 대한 답을 찾는 자연어처리 task이다. 최근 BERT와 같이 대량의 데이터로 학습한 언어모델을 자연어처리에 이용하는 연구가 진행되고 있다. 본 논문에서는 토크나 이징 방식을 형태소와 자소 단위를 결합한 형태 등으로 변경하고 RoBERTa 학습 및 평가를 진행하여 토크나이징 방식에 따른 성능 변화를 보았다. 그리고 BERT를 수정한 RoBERTa 모델을 학습하고 기계독해를 위해 MCAF(Multi-level Co-Attention Fusion)를 결합한 모델을 제안한다. 한국어 기계독해 데이터 셋인 KorQuAD 데이터를 이용하여 실험한 결과 dev 셋에서 EM 87.62%, F1 94.61%의 성능을 보였다.
영문내용 (English Abstract)	Machine reading comprehension is a natural language processing task that finds answers to a given question in a given paragraph. Currently, studies on language model trained with a large amount of data, such as BERT, for natural language processing are in progress. In this paper, we adapted a tokenizer capable of analyzing text in morpheme and grapheme level, conducted RoBERTa learning and evaluated the changes in benchmark scores depending on the variation of the tokenizing method. In addition, we have trained the RoBERTa model with a modified BERT and propose a model that combines the RoBERTa model with MCAF(Multi-level Co-Attention Fusion) for the purpose of machine reading comprehension. As a results, the experiments with KorQuAD, a korean machine reading comprehension development dataset, showed that EM is 87.62% and F1 is 94.61%.
키워드(Keyword)	기계독해 언어모델 토큰 분리 RoBERTa machine reading comprehension language model tokenizing RoBERTa
파일첨부	PDF 다운로드