베이지언 문서분류시스템을 위한 능동적 학습 기반의 학습문서집합 구성방법

김제욱; 김한준; 이상구

연구문헌

국내 논문지

홈 > 연구문헌 > 국내 논문지 > 한국정보과학회 논문지 > 정보과학회 논문지 B : 소프트웨어 및 응용

정보과학회 논문지 B : 소프트웨어 및 응용

Current Result Document : 20 / 20 이전건

한글제목(Korean Title)	베이지언 문서분류시스템을 위한 능동적 학습 기반의 학습문서집합 구성방법
영문제목(English Title)	An Active Learning-based Method for Composing Training Document Set in Bayesian Text Classification Systems
저자(Author)	김제욱 김한준 이상구
원문수록처(Citation)	VOL 29 NO. 12 PP. 0966 ~ 0978 (2002. 12)
한글내용 (Korean Abstract)	기계학습 기법을 이용한 문서분류시스템의 정확도를 결정하는 요인 중 가장 중요한 것은 학습문서집합의 선택과 그것의 구성방법이다. 학습문서집합 선택의 문제란 임의의 문서공간에서 보다 정보량이 큰 적은 양의 문서집합을 골라서 학습문서로 채택하는 것을 말한다. 이렇게 선택한 학습문서집합을 재구성하여 보다 정확도가 높은 문서분류함수를 만드는 것이 학습문서집합 구성방법의 문제이다. 전자의 문제를 해결하는 대표적인 알고리즘이 능동적 학습(active learning) 알고리즘이고, 후자의 경우는 부스팅(boosting) 알고리즘이다. 본 논문에서는 이 두 알고리즘을 Na ve Bayes 문서분류 알고리즘에 적용해보고, 이때 생기는 여러 가지 특징들을 분석하여 새로운 학습문서집합 구성방법인 AdaBUS 알고리즘을 제안한다. 이 알고리즘은 능동적 학습 알고리즘의 아이디어를 이용하여 최종 문서분류함수를 만들기 위해 임시로 만든 여러 임시 문서분류함수(weak hypothesis)들 간의 변이(variance)를 높였다. 이를 통해 부스팅 알고리즘이 효과적으로 구동되기 위해 필요한 핵심 개념인 교란(perturbation)의 효과를 실현하여 문서분류의 정확도를 높일 수 있었다. Reuter-21578 문서집합을 이용한 경험적 실험을 통해, AdaBUS 알고리즘이 기존의 알고리즘에 비해 Na ve Bayes 알고리즘에 기반한 문서분류시스템의 정확도를 보다 크게 향상시킨다는 사실을 입증한다.
영문내용 (English Abstract)	There are two important problems in improving text classification systems based on machine learning approach. The first one, called 큦election problem�, is how to select a minimum number of informative documents from a given document collection. The second one, called 큓omposition problem�, is how to reorganize selected training documents so that they can fit an adopted learning method. The former problem is addressed in 큑ctive learning� algorithms, and the latter is discussed in 큒oosting� algorithms. This paper proposes a new learning method, called AdaBUS, which proactively solves the above problems in the context of Na ve Bayes classification systems. The proposed method constructs more accurate classification hypothesis by increasing the variance in 큪eak� hypotheses that determine the final classification hypothesis. Consequently, the proposed algorithm yields perturbation effect makes the boosting algorithm work properly. Through the empirical experiment using the Reuters-21578 document collection, we show that the AdaBUS algorithm more significantly improves the Na ve Bayes-based classification system than other conventional learning methods
키워드(Keyword)	2002 정보과학 논문경진대회 수상작 학습문서집합 구성방법 Na ve Bayes 문서분류 알고리즘 부스팅 알고리즘 불확실성 기반 샘플링 알고리즘 AdaBUS 알고리즘 composing train document set Na ve Bayes text classifier boosting algorithm uncertainty-based sampling algorithm
파일첨부	PDF 다운로드