TAPS: Temporal Attention-Based Pruning and Scaling for Efficient Video Action Recognition

Yonatan Dinai, Avraham Raviv, Nimrod Harel, Donghoon Kim, Ishay Goldin, Niv Zehngut

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Video neural networks are computationally expensive. For real-time applications they require significant compute resources that are lacking on edge devices. Various methods were proposed to reduce the computational load of neural networks. Among them, dynamic approaches adapt the network architecture, its weights or the input resolution to the content of the input. Our proposed approach, showcased on the task of video action recognition, allows to dynamically reduce computations for a wide range of video processing networks by utilizing the redundancy between frames and channels. A per-layer lightweight policy network is used to make a per-filter decision regarding the filter’s importance. Important filters are retained while others are scaled down or entirely skipped. Our method is the first to allow the policy network to gain a broader temporal context considering features aggregated over time. Temporal aggregation is done using self-attention between present, past and future (if available) input tensor descriptors. As demonstrated on a large variety of leading benchmarks such as Something-Something-V2, Mini-Kinetics, Jester and ActivityNet1.3, and over multiple network architectures, our method is able to enhance accuracy or save up to 70% of the FLOPs with no accuracy degradation, outperforming existing dynamic pruning methods by a large margin and setting a new bar for the accuracy-efficiency trade-off allowed by dynamic methods. We release the code and trained models at https://github.com/tapsdyn/TAPS.

Original languageEnglish
Title of host publicationComputer Vision – ACCV 2024 - 17th Asian Conference on Computer Vision, Proceedings
EditorsMinsu Cho, Ivan Laptev, Du Tran, Angela Yao, Hongbin Zha
PublisherSpringer Science and Business Media Deutschland GmbH
Pages422-438
Number of pages17
ISBN (Print)9789819609079
DOIs
StatePublished - 2025
Externally publishedYes
Event17th Asian Conference on Computer Vision, ACCV 2024 - Hanoi, Viet Nam
Duration: 8 Dec 202412 Dec 2024

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume15474 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference17th Asian Conference on Computer Vision, ACCV 2024
Country/TerritoryViet Nam
CityHanoi
Period8/12/2412/12/24

Bibliographical note

Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025.

Keywords

  • Action Recognition
  • Dynamic Pruning
  • Efficient Inference

Fingerprint

Dive into the research topics of 'TAPS: Temporal Attention-Based Pruning and Scaling for Efficient Video Action Recognition'. Together they form a unique fingerprint.

Cite this