텍스트 블록 주변의 문맥을 이용한 HTML 문서 본문 추출

송원문; 김우승; 김명원; Wonmoon Song; Wooseung Kim; Myungwon Kim

연구문헌

국내 논문지

홈 > 연구문헌 > 국내 논문지 > 한국정보과학회 논문지 > 정보과학회 논문지 B : 소프트웨어 및 응용

정보과학회 논문지 B : 소프트웨어 및 응용

Current Result Document :

한글제목(Korean Title)	텍스트 블록 주변의 문맥을 이용한 HTML 문서 본문 추출
영문제목(English Title)	Contents Extraction from HTML Documents using Text Block Context
저자(Author)	송원문 김우승 김명원 Wonmoon Song Wooseung Kim Myungwon Kim
원문수록처(Citation)	VOL 40 NO. 03 PP. 0155 ~ 0163 (2013. 03)
한글내용 (Korean Abstract)	다양한 웹 저작 도구 및 새로운 웹 표준의 출현과 웹에 대한 접근성이 보다 편리해지면서 매우 다양한 종류의 웹 콘텐츠들이 아주 빠르게 생산되고 있다. 이와 같은 환경에서, 사용자의 요구에 적합한 웹 서비스를 제공하기위해서는 웹 문서로부터 광고와 같은 비 본문 영역 등을 제거하고 본문에 적합한 정보만을 정확하고 빠르게 추출하는 것이 중요하다. 이에 본 논문에서는 HTML 형태의 웹 문서로부터 본문 영역을 정확하게 추출하는 방법을 제안한다. 제안한 방법에서는 문서내의 각각의 텍스트 블록들이 본문 영역에 해당하는지 분류하기 위하여 의사결정트리를 생성하고 이용하였으며 분류를 위한 특징으로는 텍스트 블록의 단어 및 링크 밀도와 HTML 태그 분포 및 텍스트 블록간 거리 등을 포함하는 문맥 정보를 사용하였다. 공개된 데이터 및 본 연구팀에서 직접 수집한 데이터를 이용한 실험을 통해 기존의 방법에 비해 F-Measure가 약 19% 향상되었음을 보였다.
영문내용 (English Abstract)	Due to various Web authoring tools, the new web standards, and improved web accessibility, a wide variety of Web contents are being produced very quickly. In such an environment, in order to provide appropriate Web services to users' needs it is important to quickly and accurately extract relevant information from Web documents and remove irrelevant contents such as advertisements. In this paper, we propose a method that extracts main contents accurately from HTML Web documents. In the method, a decision tree is built and used to classify each block of text whether it is a part of the main contents. For classification we use contextual features around text blocks including word density, link density, HTML tag distribution, and distance between text blocks. We experimented with our method using a published data set and a data set that we collected. The experiment results show that our method performs 19% better in F-measure compared to the existing methods.
키워드(Keyword)	웹 페이지 분석 본문 추출 태그 분포 블록간 거리 문맥 정보 web document analysis contents extraction tag distribution block distance context
파일첨부	PDF 다운로드