Attention-based multimodal image matching

Aviad Moreshet, Yosi Keller

Research output: Contribution to journalArticlepeer-review

1 Scopus citations


We propose a method for matching multimodal image patches using a multiscale Transformer-Encoder that focuses on the feature maps of a Siamese CNN. It effectively combines multiscale image embeddings while improving task-specific and appearance-invariant image cues. We also introduce a residual attention architecture that allows for end-to-end training by using a residual connection. To the best of our knowledge, this is the first successful use of the Transformer-Encoder architecture in multimodal image matching. We motivate the use of task-specific multimodal descriptors by achieving new state-of-the-art accuracy on both multimodal and unimodal benchmarks, and demonstrate the quantitative and qualitative advantages of our approach over state-of-the-art unimodal image matching methods in multimodal matching. Our code is shared here: Code.

Original languageEnglish
Article number103949
JournalComputer Vision and Image Understanding
StatePublished - Apr 2024

Bibliographical note

Publisher Copyright:
© 2024 Elsevier Inc.


  • Attention-based
  • Deep learning
  • Multisensor image matching


Dive into the research topics of 'Attention-based multimodal image matching'. Together they form a unique fingerprint.

Cite this