FLASHer: 유사 주제를 갖는 대량의 뉴스 기사에 대한 군집 기반 중복 제거 방법

이주영; 차수진; 서영균; Joo-Young Lee; Sujin Cha; Young-Kyoon Suh

연구문헌

국내 학회지

홈 > 연구문헌 > 국내 학회지 > 데이터베이스 연구회지(SIGDB)

데이터베이스 연구회지(SIGDB)

Current Result Document : 23 / 23

한글제목(Korean Title)	FLASHer: 유사 주제를 갖는 대량의 뉴스 기사에 대한 군집 기반 중복 제거 방법
영문제목(English Title)	FLASHer: A Novel Clustering-Based Scheme for Deduplicating a Large Amount of News Articles with Similar Topics
저자(Author)	이주영 차수진 서영균 Joo-Young Lee Sujin Cha Young-Kyoon Suh
원문수록처(Citation)	VOL 35 NO. 02 PP. 0054 ~ 0065 (2019. 08)
한글내용 (Korean Abstract)	오늘날 빅데이터 시대를 맞아 매일 수천 건의 뉴스기사가 다양한 언론사에서 생성되고 있다. 이러한 많은 양의 기사 데이터를 활용한 수많은 종류의 응용들이 생성되고 있다. 그럼에도 불구하고, 동일한 사건에 대해 기술된 대부분의 기사들이 동일한 내용을 포함하고 있음을 쉽게 접하게 된다. 그러한 뉴스 기사의 중복은 사용자들이 획일화된 관점을 가질 수 있게 할 뿐만 아니라 기사 데이터를 활용하는 다양한 응용 시스템들의 저장 및 처리 시간 측면에서 성능 저하 문제를 불러일으킬 수 있다. 본 논문은 주어진 뉴스 데이터에 대한 효율적이고 확장성 있는 중복 제거를 수행하는 FLASHer 기법을 소개한다. FLASHer 는 먼저 주어진 뉴스 문서 데이터에 대한 전처리를 수행한 다음, 관련 문서들끼리 군집화한다. 이어, 문서간의 코사인 유사도를 계산하고 그것을 이용하여 데이터의 중복을 제거한다. 실험 결과, FLASHer 는 메모리를 훨씬 더 많이 소비하는 기존의 베이스라인 알고리즘 대비 최소 약 8% 의 메모리만 사용하면서, 대략 4.5%의 중복 문서를 제거 할 수 있었다. 제안된 알고리즘을 통해 사용자들은 중복이 되지 않은 다양한 뉴스 기사들을 볼 수 있으며, 이러한 고품질 데이터를 활용한 응용 개발에 집중할 수 있다.
영문내용 (English Abstract)	In the era of big data, thousands of news articles are being produced by many different agencies on a daily basis. A variety of applications are accordingly developed based on these articles. That said, it is not surprising to witness that most of the articles over the same event contain the same contents. Duplication of such articles not only exposes a uniform viewpoint to readers but also causes performance degradation of application systems in terms of storage and processing time. In this regard, we introduce a novel scheme, termed FLASHer, to perform efficient and scalable deduplication given a large amount of news document data. FLASHer first preprocesses the given document data and carries out clustering on the data. Subsequently, it calculates the cosine similarity between the documents and eliminates duplicate documents by leveraging the similarity. In our empirical evaluation, FLASHer demonstrates that it can remove approximately 4.5% redundant documents while spending only a minimum of 8% of memory, compared with existing baseline algorithms consuming much larger memory. Using the proposed algorithm, the users (or developers) can view various news articles that are not duplicated and focus on writing their applications based on such high quality data.
키워드(Keyword)	Large Volume News Articles Documents Deduplication Information Retrieval Natural language processing 대용량 뉴스 중복 제거 정보검색 자연어처리
파일첨부	PDF 다운로드