대용량 데이터 분석을 위한 맵리듀스 기반의 이상치 탐지

홍예진; 나은희; 정용환; 김양우; Yejin Hong; Eunhee Na; Yonghwan Jung; Yangwoo Kim

연구문헌

국내 논문지

홈 > 연구문헌 > 국내 논문지 > 한국인터넷정보학회 논문지

한국인터넷정보학회 논문지

Current Result Document :

한글제목(Korean Title)	대용량 데이터 분석을 위한 맵리듀스 기반의 이상치 탐지
영문제목(English Title)	Outlier Detection Based on MapReduce for Analyzing Big Data
저자(Author)	홍예진 나은희 정용환 김양우 Yejin Hong Eunhee Na Yonghwan Jung Yangwoo Kim
원문수록처(Citation)	VOL 18 NO. 01 PP. 0027 ~ 0035 (2017. 02)
한글내용 (Korean Abstract)	가까운 미래에는 빅데이터의 많은 부분을 IoT 데이터가 차지할 것이라는 전망이 나오고 있다. 그에 따라, IoT 데이터의 많은 부분을 차치하는 센서 데이터에 관한 관심과 연구 또한 활발하게 진행되고 있다. 여러 분야에서 활용되고 있는 센서 데이터는 분석할 때 실제와는 다른 값인 이상치를 포함하게 되면 정확한 분석이 어려우며, 왜곡된 결과가 도출되어 활용할 수 없는 경우가 생긴다. 따라서 본 논문에서는 정확한 결과를 도출하기 위해 수집된 원자료를 분석하기 전에 이상치 탐지 및 제거를 하였다. 또한, 점점 늘어나고있는 대용량의 데이터를 빠르게 처리하기 위해 메모리 접근 방식인 스파크를 사용한 분산처리 환경에서 처리하였다. 맵리듀스 기반의 이상치 탐지 및 제거는 총 4단계로 나누어 구현하였으며, 각 단계를 매퍼와 리듀스로 구현하였다. 제안한 기법의 평가를 위해서 3가지 환경에서 비교하였으며, 그 결과 이상치 탐지 및 제거를 하고자 하는 데이터의 용량이 커질수록 스파크를 이용한 분산처리 환경에서의 처리가 가장 빠르다는 결과를 얻었다.
영문내용 (English Abstract)	In near future, IoT data is expected to be a major portion of Big Data. Moreover, sensor data is expected to be major portion of IoT data, and its’ research is actively carried out currently. However, processed results may not be trusted and used if outlier data is included in the processing of sensor data. Therefore, method for detection and deletion of those outlier data before processing is studied in this paper. Moreover, we used Spark which is memory based distributed processing environment for fast processing of big sensor data. The detection and deletion of outlier data consist of four stages, and each stage is implemented with Mapper and Reducer operation. The proposed method is compared in three different processing environments, and it is expected that the outlier detection and deletion performance is best in the distributed Spark environment as data volume is increasing.
키워드(Keyword)	빅데이터 이상치 맵리듀스 분산처리 스파크 Big Data Outlier MapReduce Distributed Processing Spark
파일첨부	PDF 다운로드