Abstract
We introduce a zero-shot video captioning method that employs two frozen networks: the GPT-2 language model to generate sentences and the CLIP to maintain a high average matching score between the generated text and the video frames. Existing zero-shot captioning methods use token-level optimization that drives the generation of each token to be related to the image. However, maintaining language fluency with a set of frames can be challenging since (i) a single token has to describe a set of non-homogeneous frames, and (ii) the generation may commit to a single direction, restricting the flexibility of the process. In our approach, we use pseudo-tokens that update after each complete sentence is generated, gradually improving the specificity and comprehensiveness of the sentence while letting the user control the level of specificity. The optimization takes into account the whole sentence and does not require beam-searching. Our experiments show that the generated captions are fluent and display a broad range of real-world knowledge for both videos and images. Moreover, while current supervised video captioning methods generate captions that often follow a short and generic pattern based on the datasets they were trained on, our approach generates diverse and descriptive captions that are much more appealing to humans. Our code is available at: https://github.com/YoadTew/zero-shot-video-to-text.
| Original language | English |
|---|---|
| State | Published - 2023 |
| Externally published | Yes |
| Event | 34th British Machine Vision Conference, BMVC 2023 - Aberdeen, United Kingdom Duration: 20 Nov 2023 → 24 Nov 2023 |
Conference
| Conference | 34th British Machine Vision Conference, BMVC 2023 |
|---|---|
| Country/Territory | United Kingdom |
| City | Aberdeen |
| Period | 20/11/23 → 24/11/23 |
Bibliographical note
Publisher Copyright:© 2022. The copyright of this document resides with its authors.
Fingerprint
Dive into the research topics of 'Zero-Shot Video Captioning by Evolving Pseudo-tokens'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver