영상 기반 대화를 위한 모듈 신경망 학습

조영수; 김인철; Yeongsu Cho; Incheol Kim

연구문헌

국내 논문지

홈 > 연구문헌 > 국내 논문지 > 한국정보과학회 논문지 > 정보과학회논문지 (Journal of KIISE)

정보과학회논문지 (Journal of KIISE)

Current Result Document :

한글제목(Korean Title)	영상 기반 대화를 위한 모듈 신경망 학습
영문제목(English Title)	Neural Module Network Learning for Visual Dialog
저자(Author)	조영수 김인철 Yeongsu Cho Incheol Kim
원문수록처(Citation)	VOL 46 NO. 12 PP. 1304 ~ 1313 (2019. 12)
한글내용 (Korean Abstract)	본 논문에서는 영상 기반 대화를 위한 새로운 모듈 신경망 모델을 제안한다. 영상 기반 대화는 몇 가지 어려운 도전적 과제를 가지고 있다. 첫 번째는 자연어 질문에서 언급하는 개체들을 주어진 입력 영상의 어떤 물체들과 연관 지어 이해해야 하는가에 관한 시각적 접지 문제이다. 그리고 두 번째는 새로운 질문에 포함된 명사구나 대명사가 과거 질문이나 답변에 등장하는 어떤 개체를 가리키며, 결국 입력 영상의 어떤 물체를 의미하는 지를 알아내는 시각적 상호 참조 해소 문제이다. 이러한 문제들을 해결하고자, 본 논문에서는 질문 맞춤형 모듈 신경망과 참조 풀을 이용하는 새로운 영상 기반 대화 모델을 제안한다. 본 논문의 제안 모델은 비교 질문들에 효과적으로 답하기 위한 새로운 비교 모듈을 포함 할뿐만 아니라, 이중 주의 집중 메커니즘을 적용해 성능을 향상시킨 새로운 탐지 모듈, 참조 풀을 이용해 시각적 상호 참조를 해소하는 참조 모듈 등을 포함한다. 제안 모델의 성능 평가를 위해, 대규모 벤치마크 데이터 집합인 VisDial v0.9와 VisDial v1.0을 이용한 다양한 실험들을 수행하였다. 그리고 이 실험들을 통해, 기존의 최신 영상 기반 대화 모델들에 비해 본 논문에서 제안한 모델의 더 뛰어난 성능을 확인할 수 있었다.
영문내용 (English Abstract)	In this paper, we propose a novel neural module network (NMN) model for visual dialog. Visual dialog currently has several challenges: The first one is visual grounding, which is concerned with how to associate the entities mentioned in the natural language question with the visual objects included in the given image. The other one is visual co-reference resolution, which involves how to determine which words, typically noun phrases and pronouns, co-refer to the same visual object in a given image. In order to address these issues, we suggest a new visual dialog model using both question-customized neural module networks and a reference pool. The proposed model includes not only a new Compare module to answer the questions that require comparing prosperities between two visual objects, but also a novel Find module improved by using a dual attention mechanism, and a Refer module to resolve visual co-references with the reference pool. To evaluate the performance of the proposed model, we conduct various experiments on two large benchmark datasets, VisDial v0.9 and VisDial v1.0. The results of these experiments show that the proposed model outperforms the state-of-the-art models for visual dialog.
키워드(Keyword)	영상 기반 대화 모듈 신경망 시각적 접지 시각 상호 참조 해소 심층 신경망 visual dialog neural module network visual grounding visual co-reference resolution deep neural network
파일첨부	PDF 다운로드