Object-Region Video Transformers

Roei Herzig, Elad Ben-Avraham, Karttikeya Mangalam, Amir Bar, Gal Chechik, Anna Rohrbach, Trevor Darrell, Amir Globerson

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

75 Scopus citations

Abstract

Recently, video transformers have shown great success in video understanding, exceeding CNN performance; yet existing video transformer models do not explicitly model objects, although objects can be essential for recognizing actions. In this work, we present Object-Region Video Transformers (ORViT), an object-centric approach that extends video transformer layers with a block that directly incorporates object representations. The key idea is to fuse object-centric representations starting from early layers and propagate them into the transformer-layers, thus affecting the spatio-temporal representations throughout the network. Our ORViT block consists of two object-level streams: appearance and dynamics. In the appearance stream, an 'Object-Region Attention' module applies self-attention over the patches and object regions. In this way, visual object regions interact with uniform patch tokens and enrich them with contextualized object information. We further model object dynamics via a separate 'Object-Dynamics Module', which captures trajectory interactions, and show how to integrate the two streams. We evaluate our model on four tasks and five datasets: compositional and few-shot action recognition on SomethingElse, spatio-temporal action detection on AVA, and standard action recognition on Something-Something V2, Diving48 and Epic-Kitchen100. We show strong performance improvement across all tasks and datasets considered, demonstrating the value of a model that incorporates object representations into a transformer architecture. For code and pretrained models, visit the project page at https://roeiherz.github.io/ORViT/

Original languageEnglish
Title of host publicationProceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022
PublisherIEEE Computer Society
Pages3138-3149
Number of pages12
ISBN (Electronic)9781665469463
DOIs
StatePublished - 2022
Event2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022 - New Orleans, United States
Duration: 19 Jun 202224 Jun 2022

Publication series

NameProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Volume2022-June
ISSN (Print)1063-6919

Conference

Conference2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022
Country/TerritoryUnited States
CityNew Orleans
Period19/06/2224/06/22

Bibliographical note

Publisher Copyright:
© 2022 IEEE.

Funding

The results are shown in Table 4a. It can be seen that a single ORViT layer already results in considerable improvement (66.7%), and that it is very important to apply it in the earlier layers rather than at the end. This is in contrast to the current practice in object-centric approaches (e.g., STRG and STIN) that extract RoIs from the final layer. It can also be seen that the ODM stream improves performance (by 2.1% from 66.7% to 68.8%). Finally, multiple applications of the layer further improve performance to 69.7%. Object-Centric Baselines. ORViT proposes an elegant way to integrate object region information into a video transformer. Here we consider two other candidate models to achieve this goal. (i) MF+RoIAlign uses RoIAlign over the last video transformer layer to extract object features. Then, it concatenates the CLS token with max-pooled object features to classify the action using an MLP. (ii) MF+Boxes uses coordinates and patch tokens. We use the CLS token from the last layer of MF, concatenated with trajectory embeddings. To obtain trajectory embeddings, we use a standard self-attention over the coordinates similar to our ODM stream. The first captures the appearance of objects with global context while the latter captures the trajectory information with global context, both without fusing the object information several times back to the backbone as we do. The results are shown in Table 4b. MF+RoIAlign does not improve over the baseline, while MF+Boxes improves by 3.5%, which is still far from ORViT (69.7%). How important are the object bounding boxes. Since ORViT changes the architecture of the base video transformer model, we want to check whether the bounding boxes are indeed the source of improvement. We consider several variations where the object bounding boxes are replaced with other values. (i) All boxes: all boxes are given the coordinates of the entire image ([0, 0, 1, 1]). (ii) Null boxes: boxes are initialized to zeros. (iii) Grid boxes: each of the 4 bounding boxes is one fourth of the image. (iv) Random boxes - each box is chosen uniformly at random. See Table 4c for results. We observe a large drop in performance for all these baselines, which confirms the important role of the object regions in ORViT. Finally, we ask whether tracking information is important, as opposed to just detection. We find that this results in degradation from 69.7 to 68.2, indicating that the model can perform relatively well with only detection information. Decreasing Model Size. Next, we show that model size can be significantly decreased, incurring a small performance loss. Most the parameters added by ORViT over the baseline MF are in the ODM, and thus it is possible to use a smaller embedding dimension in ODM (see B˜ in Section 3.2). Table 4d reports how the dimension affects the performance, demonstrating that most of the performance gains can be achieved with a model that is close in size to the original MF. More in D.1&D.2 in supp. We would like to highlight that “Object-Region Attention” alone (set dimension size to 0; thus ODM is not used) is the main reason for the improvement with only 2% additional parameters. 5. Discussion and Limitations Objects are a key element of human visual perception, but their modeling is still a challenge for machine vision. In this work, we demonstrated the value of an object-centric approach that incorporates object representations starting from early layers and propagates them into the transformer-layers. Through extensive empirical study, we show that integrating the ORViT block into video transformer architecture leads to improved results on four video understanding tasks and five datasets. However, we did not put effort into the object detection and used externally provided boxes, which is a limitation of our work. Replacing the externally provided boxes with boxes that the model generates without supervision will be interesting. Acknowledgements. This project has received funding from the ERC under the European Unions Horizon 2020 research and innovation programme (grant ERC HOLI 819080). Prof. Darrell’s group was supported in part by DoD including DARPA’s XAI, and LwLL programs, as well as BAIR’s industrial programs. This work was completed in partial fulfillment for the Ph.D. degree of the first author.

FundersFunder number
U.S. Department of Defense
Defense Advanced Research Projects Agency
Horizon 2020ERC HOLI 819080

    Keywords

    • Action and event recognition
    • Video analysis and understanding

    Fingerprint

    Dive into the research topics of 'Object-Region Video Transformers'. Together they form a unique fingerprint.

    Cite this