맵리듀스를 이용한 그리드 기반 k-NN 조인 질의처리 알고리즘

윤들녁; 장미영; 장재우; DeulNyeok Yoon; Miyoung Jang; Jae-Woo Chang

연구문헌

국내 학회지

홈 > 연구문헌 > 국내 학회지 > 데이터베이스 연구회지(SIGDB)

데이터베이스 연구회지(SIGDB)

Current Result Document : 5 / 8 이전건 다음건

한글제목(Korean Title)	맵리듀스를 이용한 그리드 기반 k-NN 조인 질의처리 알고리즘
영문제목(English Title)	Grid-based k-Nearest Neighbor Join Query Processing Algorithm using MapReduce
저자(Author)	윤들녁 장미영 장재우 DeulNyeok Yoon Miyoung Jang Jae-Woo Chang
원문수록처(Citation)	VOL 30 NO. 02 PP. 0103 ~ 0115 (2014. 08)
한글내용 (Korean Abstract)	최근 대용량 데이터를 분석하기 위한 맵리듀스 기반 질의처리 알고리즘이 다양하게 연구되고 있다. 특히, k-NN 조인 질의처리 알고리즘은 서로 다른 두 개의 데이터베이스 R과 S가 존재할 때, R의 모든 데이터에 대해 가장 거리가 가까운 상위 k개의 S데이터를 탐색하는 알고리즘으로써, 데이터 마이닝 및 분석을 기반으로 하는 응용 분야에서 매우 중요하게 활용되고 있다. 그러나, 대표 연구인 보로노이 기반 k-NN 조인 질의처리 알고리즘은 보로노이 인덱스 구축 비용이 매우 크기 때문에, 업데이트가 빈번하게 발생하는 대용량 데이터에 적합하지 못하다. 아울러 보로노이 셀 정보를 저장하기 위해 사용하는 R-트리는 맵리듀스 환경에서의 분산 병렬 처리에 적합하지 않다. 따라서, 본 논문에서는 새로운 그리드 인덱스 기반의 k-NN 조인 질의 처리 알고리즘을 제안한다. 첫째, 높은 인덱스 구축 비용 문제를 해결하기 위해, 데이터 분포를 고려한 동적 그리드 인덱스 생성 기법을 제안한다. 둘째, 맵리듀스 환경에서 효율적으로 k-NN 조인 질의를 수행하기 위해, 인접셀 정보를 시그니처로 활용하는 후보영역 탐색 및 필터링 알고리즘을 제안한다. 이를 통해, R의 각 데이터가 위치한 그리드 셀의 인접 셀만을 탐색하여 관련 데이터만을 맵리듀스의 입력으로 전송하기 때문에 데이터 입출력 및 연산 시간이 크게 감소하는 장점을 지닌다. 마지막으로 성능 평가를 통해 제안하는 기법이 높은 질의 결과 정확도를 보이는 동시에 질의 처리 시간 측면에서 기존 기법에 비해 최대 3배의 높은 질의 처리 성능을 나타낸다.
영문내용 (English Abstract)	Recently, MapReduce based query processing algorithms have been widely studied to analyze big data. K-nearest neighbor(k-NN) join algorithm, which aims to produce the k nearest neighbors of each point of a data set S from another data set R, has been considered most important in data analysis-based applications. However, the existing k-NN join query processing algorithm suffers from high index construction cost which makes them unsuitable for big data processing. Furthermore, to store data partitioning information, the existing algorithm utilizes R-tree which is not useful in the distributed computing environment. To solve these problems, we propose a new grid-based k-NN join query processing algorithm. First, to reduce the index construction cost, we design a dynamic grid index construction algorithm by considering data distribution. Second, to efficiently perform a k-NN join query in MapReduce, we devise a candidate cell retrieval and pruning method based on data signature. Therefore, our algorithm only retrieves neighboring data from the query cell and sends them as an input of MapReduce job. This can greatly reduce the data transmission and computation overhead. In performance analysis, we prove that our algorithm outperforms the existing work up to 3 times in terms of query processing time while our algorithm achieves high query result accuracy.
키워드(Keyword)	k-NN 조인 질의처리 알고리즘 맵리듀스 기반 질의 처리 대용량 데이터 처리 분산 질의 처리 k-NN join query processing algorithm MapReduce-based processing distributed data processing
파일첨부	PDF 다운로드