TY - JOUR
T1 - Vision UFormer
T2 - Long-range monocular absolute depth estimation
AU - Polasek, Tomas
AU - Čadík, Martin
AU - Keller, Yosi
AU - Benes, Bedrich
N1 - Publisher Copyright:
© 2023 Elsevier Ltd
PY - 2023/4
Y1 - 2023/4
N2 - We introduce Vision UFormer (ViUT), a novel deep neural long-range monocular depth estimator. The input is an RGB image, and the output is an image that stores the absolute distance of the object in the scene as its per-pixel values. ViUT consists of a Transformer encoder and a ResNet decoder combined with the UNet style of skip connections. It is trained on 1M images across ten datasets in a staged regime that starts with easier-to-predict data such as indoor photographs and continues to more complex long-range outdoor scenes. We show that ViUT provides comparable results for normalized relative distances and short-range classical datasets such as NYUv2 and KITTI. We further show that it successfully estimates absolute long-range depth in meters. We validate ViUT on a wide variety of long-range scenes showing its high estimation capabilities with a relative improvement of up to 23%. Absolute depth estimation finds application in many areas, and we show its usability in image composition, range annotation, defocus, and scene reconstruction. Our models are available at cphoto.fit.vutbr.cz/viut.
AB - We introduce Vision UFormer (ViUT), a novel deep neural long-range monocular depth estimator. The input is an RGB image, and the output is an image that stores the absolute distance of the object in the scene as its per-pixel values. ViUT consists of a Transformer encoder and a ResNet decoder combined with the UNet style of skip connections. It is trained on 1M images across ten datasets in a staged regime that starts with easier-to-predict data such as indoor photographs and continues to more complex long-range outdoor scenes. We show that ViUT provides comparable results for normalized relative distances and short-range classical datasets such as NYUv2 and KITTI. We further show that it successfully estimates absolute long-range depth in meters. We validate ViUT on a wide variety of long-range scenes showing its high estimation capabilities with a relative improvement of up to 23%. Absolute depth estimation finds application in many areas, and we show its usability in image composition, range annotation, defocus, and scene reconstruction. Our models are available at cphoto.fit.vutbr.cz/viut.
KW - Absolute depth prediction
KW - Long-range
KW - Monocular
KW - Transformer
UR - http://www.scopus.com/inward/record.url?scp=85149382691&partnerID=8YFLogxK
U2 - 10.1016/j.cag.2023.02.003
DO - 10.1016/j.cag.2023.02.003
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
AN - SCOPUS:85149382691
SN - 0097-8493
VL - 111
SP - 180
EP - 189
JO - Computers and Graphics
JF - Computers and Graphics
ER -