단어 반복 특징을 이용한 스팸 문서 분류 방법에 관한 연구

이성진; 백종범; 한정석; 이수원; Seongjin Lee; Jongbum Baik; Chung-Seok Han; Soowon Lee

연구문헌

국내 논문지

홈 > 연구문헌 > 국내 논문지 > 한국정보처리학회 논문지 > 정보처리학회 논문지 B

정보처리학회 논문지 B

Current Result Document :

한글제목(Korean Title)	단어 반복 특징을 이용한 스팸 문서 분류 방법에 관한 연구
영문제목(English Title)	A Study on Spam Document Classification Method using Characteristics of Keyword Repetition
저자(Author)	이성진 백종범 한정석 이수원 Seongjin Lee Jongbum Baik Chung-Seok Han Soowon Lee
원문수록처(Citation)	VOL 18-B NO. 05 PP. 0315 ~ 0324 (2011. 10)
한글내용 (Korean Abstract)	인터넷 환경에서 스팸의 범람은 개인 정보의 유출, 피싱에 의한 금전적 손해, 무분별한 유해 콘텐츠의 유통 등 심각한 사회 문제를 야기하고 있다. 또한 사회적 통제를 필요로 하는 유해 정보를 무차별적으로 유통시키는 스팸의 형태와 기술이 갈수록 다양해지고 있다. Bag-of-Words 모델을 이용한 학습 기반 스팸 분류 방법은 현재까지의 연구 중에서 가장 일반적으로 사용되는 방법이다. 그러나 이 방법은 분류 모델 학습 과정에서 사용된 키워드의 출현 정보만으로 스팸 문서를 분류하기 때문에 최근 흔히 발견할 수 있는 스팸 차단 회피 방법에 대한 대처 능력이 부족하다. 본 논문에서는 이러한 문제를 해결하기 위해 문서에서 등장하는 반복 단어의 특징을 이용한 스팸 문서 탐지 방법을 제안한다. 최근 대부분의 스팸 문서에서는 노출하고자 하는 스팸 문구를 반복하는 경향이 있으며, 이는 스팸 문서를 판별하는 기준으로 사용될 수 있다. 본 논문에서는 단어 반복의 특징을 표현할 수 있는 6개의 변수를 정의하고 이를 분류 모델 생성을 위한 속성으로 사용한다. 본 논문에서 제안하는 스팸 탐지 방법의 성능 평가를 위해 블로그 포스트 데이터와 이메일 데이터를 이용하여 기존 방법들과의 비교 실험을 진행하였고, 결과 분석을 통해 제안 방법이 우수함을 확인하였다.
영문내용 (English Abstract)	In Web environment, a flood of spam causes serious social problems such as personal information leak, monetary loss from fishing and distribution of harmful contents. Moreover, types and techniques of spam distribution which must be controlled are varying as days go by. The learning based spam classification method using Bag-of-Words model is the most widely used method until now. However, this method is vulnerable to anti-spam avoidance techniques, which recent spams commonly have, because it classifies spam documents utilizing only keyword occurrence information from classification model training process. In this paper, we propose a spam document detection method using a characteristic of repeating words occurring in spam documents as a solution of anti-spam avoidance techniques. Recently, most spam documents have a trend of repeating key phrases that are designed to spread, and this trend can be used as a measure in classifying spam documents. In this paper, we define six variables, which represent a characteristic of word repetition, and use those variables as a feature set for constructing a classification model. The effectiveness of proposed method is evaluated by an experiment with blog posts and E-mail data. The result of experiment shows that the proposed method outperforms other approaches.
키워드(Keyword)	스팸 차단 스팸 스팸덱싱 단어 스패밍 단어 반복 Spam Filtering Spam Spamdexing Term Spamming Word Repetition
파일첨부	PDF 다운로드