주제 기반 뉴스 기사 수집을 위한 메타 속성 융합형 기계학습 아키텍처

김태준; 김한준; Tae-jun Kim; Han-joon Kim

연구문헌

국내 학회지

홈 > 연구문헌 > 국내 학회지 > 데이터베이스 연구회지(SIGDB)

데이터베이스 연구회지(SIGDB)

Current Result Document : 1 / 1

한글제목(Korean Title)	주제 기반 뉴스 기사 수집을 위한 메타 속성 융합형 기계학습 아키텍처
영문제목(English Title)	A Machine Learning Architecture Incorporating Meta-features for Topical News Filtering
저자(Author)	김태준 김한준 Tae-jun Kim Han-joon Kim
원문수록처(Citation)	VOL 33 NO. 01 PP. 0003 ~ 0014 (2017. 04)
한글내용 (Korean Abstract)	기존의 키워드 매칭을 통한 주제 기반 크롤링(topical crawling) 기법은 주어진 주제에서 벗어난 다수의 문서들을 수집하는 문제점을 안고 있다. 본 논문은 화재 사건과 관련 없는 뉴스 기사를 걸러 내기 위해 기존 bag-of-words 형태의 속성과 메타 속성 데이터를 융합한 형태의 속성 집합을 고려한 앙상블 과정을 수행하는 효과적인 기계학습 아키텍처를 제안한다. 두 가지 유형의 속성을 다양한 기계학습 알고리즘에 반영하여 얻은 여러 학습 모델들은 적절한 앙상블 과정을 거쳐 주제 기반 크롤링을 위한 효과적인 필터링 작업에 기여한다. 제안 기법의 앙상블 모델은 기존 기법의 분류 모델보다 우수한 성능을 보였다. 구체적으로 이는 기존 최고의 성능을 보이는 나이브 베이즈 기반 모델보다 정밀도 측면에서 8.1% 더 높은 93.9%, F1 측정치 측면에서 1% 더 높은 91.1% 기록 하였다. 또한, 제안 기법으로 얻어진 학습 모델은 필터링에 보다 적합한 정밀도-재현율 곡선 (precision-recall curve)을 보였다.
영문내용 (English Abstract)	The existing topical crawling method using keyword matching has a problem of collecting a number of documents deviating from a given topic. In this paper, we propose an effective machine learning architecture that performs an ensemble process considering a set of attributes that combine attributes of the bag-of-words type and meta-attribute data in order to filter out news articles that are not related to fire events. Several learning models, obtained by reflecting two types of attributes into various machine learning algorithms, contribute to the effective filtering job for topic-based crawling via proper ensemble process of learned models. The ensemble model of the proposed method shows better performance than the conventional method; specifically, it was 8.1% higher in accuracy and 1% higher in terms of F1-score than the naive Bayes model with the highest performance. In addition, the learned model obtained by the proposed method showed a better precision-recall curve for filtering.
키워드(Keyword)	기계 학습 문서 분류 속성 엔지니어링 앙상블 웹 크롤링 bag-of-words machine learning text classification feature engineering ensemble web crawling bag-of-words
파일첨부	PDF 다운로드