Abstract
Vision Transformers have been proven powerful in various vision applications. Yet, their adaptations for video understanding tasks incur large computational costs, limiting their practical deployment on resource-constrained devices. Token pruning can effectively alleviate the processing overhead of underlying attention blocks, but often neglects the iterative processing nature of video models applied frame-by-frame. We propose to prune tokens according to the estimated contribution of their corresponding tokens in previous frames to previous predictions. We leverage attention rollout and token tracking to propagate token importance of previous outputs to current input tokens. Our method is interpretable, requires no training and has negligible memory overhead. We show the efficacy of our method for both video object detection and action recognition using different transformer architectures, achieving up to 65% reduction in FLOPS on ImageNet VID and 60% on EPIC-Kitchens with no accuracy degradation. We release the code and models at https://github.com/RGTPdyn/RGTP.
| Original language | English |
|---|---|
| Title of host publication | 2025 IEEE International Conference on Image Processing, ICIP 2025 - Proceedings |
| Publisher | IEEE Computer Society |
| Pages | 37-42 |
| Number of pages | 6 |
| ISBN (Electronic) | 9798331523794 |
| DOIs | |
| State | Published - 2025 |
| Externally published | Yes |
| Event | 32nd IEEE International Conference on Image Processing, ICIP 2025 - Anchorage, United States Duration: 14 Sep 2025 → 17 Sep 2025 |
Publication series
| Name | Proceedings - International Conference on Image Processing, ICIP |
|---|---|
| ISSN (Print) | 1522-4880 |
Conference
| Conference | 32nd IEEE International Conference on Image Processing, ICIP 2025 |
|---|---|
| Country/Territory | United States |
| City | Anchorage |
| Period | 14/09/25 → 17/09/25 |
Bibliographical note
Publisher Copyright:©2025 IEEE.
Keywords
- Action Recognition
- Attention Rollout
- Token Pruning
- Video Object Detection
- Video Transformers
Fingerprint
Dive into the research topics of 'ROLLOUT-GUIDED TOKEN PRUNING FOR EFFICIENT VIDEO UNDERSTANDING'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver