MDMMT: Multidomain Multimodal Transformer for Video Retrieval
Maksim Dzabraev,
Maksim Kalashnikov,
Stepan Komkov,
Aleksandr Petiushko Александр Петюшко
June, 2021
Abstract
We present a new state-of-the-art on the text-to-video retrieval task on MSRVTT and LSMDC benchmarks where our model outperforms all previous solutions by a large margin. Moreover, state-of-the-art results are achieved using a single model and without finetuning. This multidomain generalisation is achieved by a proper combination of different video caption datasets. We show that our practical approach for training on different datasets can improve test results of each other. Additionally, we check intersection between many popular datasets and show that MSRVTT as well as ActivityNet contains a significant overlap between the test and the training parts. More details are available at https://github.com/papermsucode/mdmmt.
Publication
In IEEE/CVF Conference on Computer Vision and Pattern Recognition 2021 Workshop on Large Scale Holistic Video Understanding (CVPR 2021)
Sr. Director, Head of AI Research / Adjunct Professor / PhD
Principal R&D Researcher (15+ years of experience), R&D Technical Leader (10+ years of experience), and R&D Manager (8+ years of experience). Running and managing industrial research and academic collaboration (35+ publications, 30+ patents). Hiring and transforming AI/ML teams. Inspired by theoretical computer science and how it changes the world.