임베딩 교체에 따른 구어체 텍스트 탐지 모델 성능 비교

김현종; Hyeonjong Kim; 남궁주홍; 문양세; Yang-Sae Moon; 최형진; Hyung-Jin Choi; Juhong Namgung; 길명선; Myeong-Seon Gil

연구문헌

국내 학회지

홈 > 연구문헌 > 국내 학회지 > 데이터베이스 연구회지(SIGDB)

데이터베이스 연구회지(SIGDB)

Current Result Document : 17 / 17

한글제목(Korean Title)	임베딩 교체에 따른 구어체 텍스트 탐지 모델 성능 비교
영문제목(English Title)	Performance Comparison of the Spoken Language Detection Model with Embedding Replacement
저자(Author)	김현종 Hyeonjong Kim 남궁주홍 문양세 Yang-Sae Moon 최형진 Hyung-Jin Choi Juhong Namgung 길명선 Myeong-Seon Gil
원문수록처(Citation)	VOL 36 NO. 02 PP. 0045 ~ 0055 (2020. 08)
한글내용 (Korean Abstract)	딥러닝 기반 욕설 탐지 모델은 구어체의 오탈자 및 띄어쓰기 오류로 인해 정확도 향상에 많은 제약이 있다. 특히, 구어체는 학습 데이터 생성을 위한 형태소 분석에서 단어 의미 파악을 방해하는 형태소가 빈번하게 생성되는 문제점이 있으며, 이는 탐지 모델의 정확도를 떨어뜨리는 가장 큰 요인이다. 본 논문에서는 이러한 한국어 구어체의 문제점을 극복하기 위해, 임베딩에 따른 탐지 모델을 설계 및 구현하고, 이를 기반으로 욕설 탐지 정확도를 비교한다. 탐지에는 Word2Vec, fastText, SKT-KoBERT, KoELECTRA의 총 네 가지 임베딩 모델을 사용하며, 실험을 통해 각 임베딩 기반 욕설 탐지 모델 성능을 비교 및 평가한다. 실험 결과, 사용 문자 단위에 따른 실험은 Word2Vec과 fastText 모두 90% 이상의 정확도를 보였고, 중의성 판단 여부에 따른 실험에서는 SKT-KoBERT가 fastText에 비해 월등히 높은 성능을 보이는 것으로 나타났다. 마지막으로, 사전 학습 방법에 따른 실험 또한 SKT-KoBERT가 KoELECTRA에 비해 높은 성능을 보이는 것으로 나타났다. 본 논문의 실험 결과를 통해, 다양한 구어체 기반 딥러닝 서비스에 보다 효과적인 임베딩 기술을 적용할 수 있을 것으로 사료된다.
영문내용 (English Abstract)	Deep learning-based abuse detection model is limited in accuracy due to frequent typos and spacing errors in Korean text. Particularly, in the process of morphological analysis of spoken language for generating learning data, there is a problem in morphemes that make it difficult to grasp the meaning of words are frequently extracted. This is the biggest cause of degrading the accuracy of the abuse detection model. In this paper, to overcome the problem of Korean spoken language, we design and implement a detection model based on embedding, and compare the accuracy of abuse detection We use four embedding models: Word2Vec, fastText, SKT-KoBERT, and KoELECTRA for detection, and we compare and evaluate the performance of each embedding-based abuse detection model through experiments. As a result of the experiment, the character unit-based experiments showed more than 90% accuracy in both Word2Vec and fastText, and in the experiment according to the determination of ambiguity, SKT-KoBERT showed significantly higher performance than fastText. Finally, the experiment according to the pre-learning method also showed higher performance of SKT-KoBERT than KoELECTRA. Through the experimental results of this paper, it is considered that more effective embedding technology can be applied to various spoken language-based deep learning services.
키워드(Keyword)	구어체 텍스트 텍스트 기반 딥러닝 모델 텍스트 임베딩 욕설 탐지 Spoken language Text-based deep learning model Text embedding Abuse detection
파일첨부	PDF 다운로드