Abstract
The recently proposed audio-visual scene-aware dialog task paves the way to a more data-driven way of learning virtual assistants, smart speakers and car navigation systems. However, very little is known to date about how to effectively extract meaningful information from a plethora of sensors that pound the computational engine of those devices. Therefore, in this paper, we provide and carefully analyze a simple baseline for audio-visual scene-aware dialog which is trained end-to-end. Our method differentiates in a data-driven manner useful signals from distracting ones using an attention mechanism. We evaluate the proposed approach on the recently introduced and challenging audio-visual scene-aware dataset, and demonstrate the key features that permit to outperform the current state-of-the-art by more than 20% on CIDEr.
| Original language | English |
|---|---|
| Title of host publication | Proceedings - 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2019 |
| Publisher | IEEE Computer Society |
| Pages | 12540-12550 |
| Number of pages | 11 |
| ISBN (Electronic) | 9781728132938 |
| DOIs | |
| State | Published - Jun 2019 |
| Externally published | Yes |
| Event | 32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2019 - Long Beach, United States Duration: 16 Jun 2019 → 20 Jun 2019 |
Publication series
| Name | Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition |
|---|---|
| Volume | 2019-June |
| ISSN (Print) | 1063-6919 |
Conference
| Conference | 32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2019 |
|---|---|
| Country/Territory | United States |
| City | Long Beach |
| Period | 16/06/19 → 20/06/19 |
Bibliographical note
Publisher Copyright:© 2019 IEEE.
Keywords
- Vision + Language