Improving Transformer with Dynamic Convolution and Shortcut for Video-Text Retrieval

Zhi Liu; Jincen Cai; Mengmeng Zhang

연구문헌

영문 논문지

홈 > 연구문헌 > 영문 논문지 > TIIS (한국인터넷정보학회)

TIIS (한국인터넷정보학회)

Current Result Document : 15 / 426 이전건 다음건

한글제목(Korean Title)	Improving Transformer with Dynamic Convolution and Shortcut for Video-Text Retrieval
영문제목(English Title)	Improving Transformer with Dynamic Convolution and Shortcut for Video-Text Retrieval
저자(Author)	Zhi Liu Jincen Cai Mengmeng Zhang
원문수록처(Citation)	VOL 16 NO. 07 PP. 2407 ~ 2424 (2022. 07)
한글내용 (Korean Abstract)
영문내용 (English Abstract)	Recently, Transformer has made great progress in video retrieval tasks due to its high representation capability. For the structure of a Transformer, the cascaded self-attention modules are capable of capturing long-distance feature dependencies. However, the local feature details are likely to have deteriorated. In addition, increasing the depth of the structure is likely to produce learning bias in the learned features. In this paper, an improved Transformer structure named TransDCS (Transformer with Dynamic Convolution and Shortcut) is proposed. A Multi-head Conv-Self-Attention module is introduced to model the local dependencies and improve the efficiency of local features extraction. Meanwhile, the augmented shortcuts module based on a dual identity matrix is applied to enhance the conduction of input features, and mitigate the learning bias. The proposed model is tested on MSRVTT, LSMDC and Activity-Net benchmarks, and it surpasses all previous solutions for the video-text retrieval task. For example, on the LSMDC benchmark, a gain of about 2.3% MdR and 6.1% MnR is obtained over recently proposed multimodal-based methods.
키워드(Keyword)	Video representation cross-modal retrieval Multi-modal Local Descriptors Transformer
파일첨부	PDF 다운로드