형태소 사이의 유사도를 이용한 용례의 의미별 자동 정렬

백대호; 이호; 임해창; 박동인; Dae-Ho Baek; Ho Lee; Hae-Chang Rim; Dong-In Park

연구문헌

국내 논문지

홈 > 연구문헌 > 국내 논문지 > 한국정보과학회 논문지 > 정보과학회 논문지 B : 소프트웨어 및 응용

정보과학회 논문지 B : 소프트웨어 및 응용

Current Result Document :

한글제목(Korean Title)	형태소 사이의 유사도를 이용한 용례의 의미별 자동 정렬
영문제목(English Title)	Automatic Conceptual Sorting of Concordances using the Similarity Between Morphemes
저자(Author)	백대호 이호 임해창 박동인 Dae-Ho Baek Ho Lee Hae-Chang Rim Dong-In Park
원문수록처(Citation)	VOL 25 NO. 01 PP. 0183 ~ 0192 (1998. 01)
한글내용 (Korean Abstract)	용례의 정렬이란 코퍼스에서 추출되는 용례를 재배열하는 작업을 말한다. 기존의 용례 정렬 방식은 특정 형태소의 사전적 순서에 의한 정렬이었기 때문에, 원하는 언어 정보를 획득하는 데는 많은 어려움이 있다. 본 논문에서는 코퍼스에서 추출되는 용례를 형태소의 사전적 순서가 아니라, 중심어의 의미에 따라 정렬하고자 한다. 용례를 중심어의 의미별로 정렬하기 위해서 용례 사이의 의미 유사도를 계산하고, 유사한 용례들을 같은 클러스터로 모으는 계층적 클러스터링 기법을 사용한다. 그리고 용례 사이의 의미 유사도를 계산하기 위해서는, 같은 형태소가 나타나는 빈도와 형태소 사이의 유사도를 이용한다. 형태소 사이의 유사도 척도로는 상호 정보, 상호 정보 유사도, 그리고 벡터 유사도를 사용한다. 품사 태깅된 약 17만 코퍼스에서 의미 중의성이 있는 명사 4개와 동사 4개를 중심어로 사용하여 추출된 용례에 대해서 각 방법을 실험한 결과, 형태소 사이의 유사도를 상호 정보와 상호 정보 유사도를 사용한 실험이 90.16%의 정확도를 보였다.
영문내용 (English Abstract)	A concordance sorting is the procedure of reordering concordances extracted from corpus. The previous methods of concordance sorting have some problem in acquiring linguistic information because they order concordances by lexicographical order of the specific morphemes. In this paper, we propose a method of ordering the concordances extracted from corpus by the meanings of keywords. To order concordances by the meanings of their keywords, we compute the sense similarity between concordances, and use a hierarchical clustering method to collect conceptually similar concordances in the same cluster. We use the frequency of cooccurring morphemes and the similarity between morphemes to compute the similarity between concordances. Also, we use mutual information, the similarity between mutual information values, and vector similarity for the measure of the similarity between morphemes. We have experimented on each method with the concordances of 4 polysemous nouns and 4 polysemous verbs extracted from 170,000 word size part-of-speech tagged corpus. The method of using both mutual information and the similarity between mutual information values shows 90.16% precision.
키워드(Keyword)
파일첨부	PDF 다운로드