Vision UFormer: Long-range monocular absolute depth estimation

Tomas Polasek, Martin Čadík, Yosi Keller, Bedrich Benes

Research output: Contribution to journalArticlepeer-review

6 Scopus citations

Abstract

We introduce Vision UFormer (ViUT), a novel deep neural long-range monocular depth estimator. The input is an RGB image, and the output is an image that stores the absolute distance of the object in the scene as its per-pixel values. ViUT consists of a Transformer encoder and a ResNet decoder combined with the UNet style of skip connections. It is trained on 1M images across ten datasets in a staged regime that starts with easier-to-predict data such as indoor photographs and continues to more complex long-range outdoor scenes. We show that ViUT provides comparable results for normalized relative distances and short-range classical datasets such as NYUv2 and KITTI. We further show that it successfully estimates absolute long-range depth in meters. We validate ViUT on a wide variety of long-range scenes showing its high estimation capabilities with a relative improvement of up to 23%. Absolute depth estimation finds application in many areas, and we show its usability in image composition, range annotation, defocus, and scene reconstruction. Our models are available at cphoto.fit.vutbr.cz/viut.

Original languageEnglish
Pages (from-to)180-189
Number of pages10
JournalComputers and Graphics (Pergamon)
Volume111
DOIs
StatePublished - Apr 2023

Bibliographical note

Publisher Copyright:
© 2023 Elsevier Ltd

Funding

This work was supported by project LTAIZ19004 Deep-Learning Approach to Topographical Image Analysis ; by the Ministry of Education, Youth and Sports of the Czech Republic within the activity INTER-EXCELENCE (LT), subactivity INTER-ACTION (LTA), ID: SMSM2019LTAIZ . Computational resources were partly supplied by the project “ e-Infrastruktura CZ ” (e-INFRA CZ ID:90140) supported by the Ministry of Education, Youth and Sports of the Czech Republic.

FundersFunder number
Ministerstvo Školství, Mládeže a Tělovýchovy

    Keywords

    • Absolute depth prediction
    • Long-range
    • Monocular
    • Transformer

    Fingerprint

    Dive into the research topics of 'Vision UFormer: Long-range monocular absolute depth estimation'. Together they form a unique fingerprint.

    Cite this