Abstract
Video processing requires analysis of spatial features that are changing over time. By combining spatial and temporal modelling together, a neural network can gain a better understanding of the scene with no increase in computation. Spatio-temporal modeling can also be used to identify redundant and sparse information in both the spatial and the temporal domains. In this work we present Dynamic Spatio-Temporal Pruning, DSTEP, a new, simple, yet efficient method for learning the evolution of spatial mapping between frames. More specifically, we used a cascade of lightweight policy networks to dynamically filter out, per input, regions and channels that do not provide information while also sharing information across time. Guided by the policy networks, the model is able to focus on relevant data and filters, avoiding unnecessary computations. Extensive evaluations on Something-Something-V2, Jester and Mini-Kinetics action recognition datasets demonstrate that the proposed method shows a significantly improved accuracy-compute trade-off over the current state-of-the-art methods. We release our code and trained models at https://github.com/DynamicAR/DSTEP.
Original language | English |
---|---|
State | Published - 2022 |
Externally published | Yes |
Event | 33rd British Machine Vision Conference Proceedings, BMVC 2022 - London, United Kingdom Duration: 21 Nov 2022 → 24 Nov 2022 |
Conference
Conference | 33rd British Machine Vision Conference Proceedings, BMVC 2022 |
---|---|
Country/Territory | United Kingdom |
City | London |
Period | 21/11/22 → 24/11/22 |
Bibliographical note
Publisher Copyright:© 2022. The copyright of this document resides with its authors. It may be distributed unchanged freely in print or electronic forms.