Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer
This repository contains code to compute depth from a single image. It accompanies our paper:
Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, Vladlen Koltun
MiDaS was trained on up to 12 datasets (ReDWeb, DIML, Movies, MegaDepth, WSVD, TartanAir, HRWSI, ApolloScape, BlendedMVS, IRS, KITTI, NYU Depth V2) with
multi-objective optimization.
The original model that was trained on 5 datasets (MIX 5 in the paper) can be found here.
The figure below shows an overview of the different MiDaS models; the bubble size scales with number of parameters.
Setup
Pick one or more models and download the corresponding weights to the weights folder:
The resulting depth maps are written to the output folder.
optional
By default, the inference resizes the height of input images to the size of a model to fit into the encoder. This
size is given by the numbers in the model names of the accuracy table. Some models do not only support a single
inference height but a range of different heights. Feel free to explore different heights by appending the extra
command line argument --height. Unsupported height values will throw an error. Note that using this argument may
decrease the model accuracy.
By default, the inference keeps the aspect ratio of input images when feeding them into the encoder if this is
supported by a model (all models except for Swin, Swin2, LeViT). In order to resize to a square resolution,
disregarding the aspect ratio while preserving the height, use the command line argument --square.
via Camera
If you want the input images to be grabbed from the camera and shown in a window, leave the input and output paths
away and choose a model type as shown above:
python run.py --model_type <model_type> --side
The argument --side is optional and causes both the input RGB image and the output depth map to be shown
side-by-side for comparison.
Currently only supports MiDaS v2.1. DPT-based models to be added.
Accuracy
We provide a zero-shot error $\epsilon_d$ which is evaluated for 6 different datasets
(see paper). Lower error values are better.
$\color{green}{\textsf{Overall model quality is represented by the improvement}}$ (Imp.) with respect to
MiDaS 3.0 DPTL-384. The models are grouped by the height used for inference, whereas the square training resolution is given by
the numbers in the model names. The table also shows the number of parameters (in millions) and the
frames per second for inference at the training resolution (for GPU RTX 3090):
* No zero-shot error, because models are also trained on KITTI and NYU Depth V2 $\square$ Validation performed at square resolution, either because the transformer encoder backbone of a model
does not support non-square resolutions (Swin, Swin2, LeViT) or for comparison with these models. All other
validations keep the aspect ratio. A difference in resolution limits the comparability of the zero-shot error and the
improvement, because these quantities are averages over the pixels of an image and do not take into account the
advantage of more details due to a higher resolution. Best values per column and same validation height in bold
Improvement
The improvement in the above table is defined as the relative zero-shot error with respect to MiDaS v3.0
DPTL-384 and averaging over the datasets. So, if $\epsilon_d$ is the zero-shot error for dataset $d$, then
the $\color{green}{\textsf{improvement}}$ is given by $100(1-(1/6)\sum_d\epsilon_d/\epsilon_{d,\rm{DPT_{L-384}}})$%.
Note that the improvements of 10% for MiDaS v2.0 → v2.1 and 21% for MiDaS v2.1 → v3.0 are not visible from the
improvement column (Imp.) in the table but would require an evaluation with respect to MiDaS v2.1 Large384
and v2.0 Large384 respectively instead of v3.0 DPTL-384.
Depth map comparison
Zoom in for better visibility
Speed on Camera Feed
Test configuration
Windows 10
11th Gen Intel Core i7-1185G7 3.00GHz
16GB RAM
Camera resolution 640x480
openvino_midas_v21_small_256
Speed: 22 FPS
Applications
MiDaS is used in the following other projects from Intel Labs:
ZoeDepth (code available here): MiDaS computes the relative depth map given an image. For metric depth estimation, ZoeDepth can be used, which combines MiDaS with a metric depth binning module appended to the decoder.
LDM3D (Hugging Face model available here): LDM3D is an extension of vanilla stable diffusion designed to generate joint image and depth data from a text prompt. The depth maps used for supervision when training LDM3D have been computed using MiDaS.
Please cite our paper if you use this code or any of the models:
@ARTICLE {Ranftl2022,
author = "Ren```
If you use a DPT-based model, please also cite:
@article{Ranftl2021,
author = {Ren```
Please cite the technical report for MiDaS 3.1 models:
@article{birkl2023midas,
title={MiDaS v3.1 -- A Model Zoo for Robust Monocular Relative Depth Estimation},
author={Reiner Birkl and Diana Wofk and Matthias M{\"u}ller},
journal={arXiv preprint arXiv:2307.14460},
year={2023}
}
For ZoeDepth, please use
@article{bhat2023zoedepth,
title={Zoedepth: Zero-shot transfer by combining relative and metric depth},
author={Bhat, Shariq Farooq and Birkl, Reiner and Wofk, Diana and Wonka, Peter and M{\"u}ller, Matthias},
journal={arXiv preprint arXiv:2302.12288},
year={2023}
}
and for LDM3D
@article{stan2023ldm3d,
title={LDM3D: Latent Diffusion Model for 3D},
author={Stan, Gabriela Ben Melech and Wofk, Diana and Fox, Scottie and Redden, Alex and Saxton, Will and Yu, Jean and Aflalo, Estelle and Tseng, Shao-Yen and Nonato, Fabio and Muller, Matthias and others},
journal={arXiv preprint arXiv:2305.10853},
year={2023}
}
Acknowledgements
Our work builds on and uses code from timm and Next-ViT.
We’d like to thank the authors for making these libraries available.
License
MIT License
{e} Ranftl and Katrin Lasinger and David Hafner and Konrad Schindler and Vladlen Koltun”,
title = “Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer”,
journal = “IEEE Transactions on Pattern Analysis and Machine Intelligence”,
year = “2022”,
volume = “44”,
number = “3”
}
If you use a DPT-based model, please also cite:
@article{Ranftl2021,
author = {Ren'{e} Ranftl and Alexey Bochkovskiy and Vladlen Koltun},
title = {Vision Transformers for Dense Prediction},
journal = {ICCV},
year = {2021},
}
Please cite the technical report for MiDaS 3.1 models:
@article{birkl2023midas,
title={MiDaS v3.1 – A Model Zoo for Robust Monocular Relative Depth Estimation},
author={Reiner Birkl and Diana Wofk and Matthias M{"u}ller},
journal={arXiv preprint arXiv:2307.14460},
year={2023}
}
For ZoeDepth, please use
@article{bhat2023zoedepth,
title={Zoedepth: Zero-shot transfer by combining relative and metric depth},
author={Bhat, Shariq Farooq and Birkl, Reiner and Wofk, Diana and Wonka, Peter and M{"u}ller, Matthias},
journal={arXiv preprint arXiv:2302.12288},
year={2023}
}
and for LDM3D
@article{stan2023ldm3d,
title={LDM3D: Latent Diffusion Model for 3D},
author={Stan, Gabriela Ben Melech and Wofk, Diana and Fox, Scottie and Redden, Alex and Saxton, Will and Yu, Jean and Aflalo, Estelle and Tseng, Shao-Yen and Nonato, Fabio and Muller, Matthias and others},
journal={arXiv preprint arXiv:2305.10853},
year={2023}
}
### Acknowledgements
Our work builds on and uses code from [timm](https://github.com/rwightman/pytorch-image-models) and [Next-ViT](https://github.com/bytedance/Next-ViT).
We'd like to thank the authors for making these libraries available.
### License
MIT License
{e} Ranftl and Alexey Bochkovskiy and Vladlen Koltun},
title = {Vision Transformers for Dense Prediction},
journal = {ICCV},
year = {2021},
}
Please cite the technical report for MiDaS 3.1 models:
@article{birkl2023midas,
title={MiDaS v3.1 -- A Model Zoo for Robust Monocular Relative Depth Estimation},
author={Reiner Birkl and Diana Wofk and Matthias M{\"u}ller},
journal={arXiv preprint arXiv:2307.14460},
year={2023}
}
For ZoeDepth, please use
@article{bhat2023zoedepth,
title={Zoedepth: Zero-shot transfer by combining relative and metric depth},
author={Bhat, Shariq Farooq and Birkl, Reiner and Wofk, Diana and Wonka, Peter and M{\"u}ller, Matthias},
journal={arXiv preprint arXiv:2302.12288},
year={2023}
}
and for LDM3D
@article{stan2023ldm3d,
title={LDM3D: Latent Diffusion Model for 3D},
author={Stan, Gabriela Ben Melech and Wofk, Diana and Fox, Scottie and Redden, Alex and Saxton, Will and Yu, Jean and Aflalo, Estelle and Tseng, Shao-Yen and Nonato, Fabio and Muller, Matthias and others},
journal={arXiv preprint arXiv:2305.10853},
year={2023}
}
Acknowledgements
Our work builds on and uses code from timm and Next-ViT.
We’d like to thank the authors for making these libraries available.
License
MIT License
{e} Ranftl and Katrin Lasinger and David Hafner and Konrad Schindler and Vladlen Koltun”,
title = “Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer”,
journal = “IEEE Transactions on Pattern Analysis and Machine Intelligence”,
year = “2022”,
volume = “44”,
number = “3”
}
If you use a DPT-based model, please also cite:
@article{Ranftl2021,
author = {Ren```
Please cite the technical report for MiDaS 3.1 models:
@article{birkl2023midas,
title={MiDaS v3.1 -- A Model Zoo for Robust Monocular Relative Depth Estimation},
author={Reiner Birkl and Diana Wofk and Matthias M{\"u}ller},
journal={arXiv preprint arXiv:2307.14460},
year={2023}
}
For ZoeDepth, please use
@article{bhat2023zoedepth,
title={Zoedepth: Zero-shot transfer by combining relative and metric depth},
author={Bhat, Shariq Farooq and Birkl, Reiner and Wofk, Diana and Wonka, Peter and M{\"u}ller, Matthias},
journal={arXiv preprint arXiv:2302.12288},
year={2023}
}
and for LDM3D
@article{stan2023ldm3d,
title={LDM3D: Latent Diffusion Model for 3D},
author={Stan, Gabriela Ben Melech and Wofk, Diana and Fox, Scottie and Redden, Alex and Saxton, Will and Yu, Jean and Aflalo, Estelle and Tseng, Shao-Yen and Nonato, Fabio and Muller, Matthias and others},
journal={arXiv preprint arXiv:2305.10853},
year={2023}
}
Acknowledgements
Our work builds on and uses code from timm and Next-ViT.
We’d like to thank the authors for making these libraries available.
License
MIT License
{e} Ranftl and Alexey Bochkovskiy and Vladlen Koltun},
title = {Vision Transformers for Dense Prediction},
journal = {ICCV},
year = {2021},
}
Please cite the technical report for MiDaS 3.1 models:
@article{birkl2023midas,
title={MiDaS v3.1 – A Model Zoo for Robust Monocular Relative Depth Estimation},
author={Reiner Birkl and Diana Wofk and Matthias M{"u}ller},
journal={arXiv preprint arXiv:2307.14460},
year={2023}
}
For ZoeDepth, please use
@article{bhat2023zoedepth,
title={Zoedepth: Zero-shot transfer by combining relative and metric depth},
author={Bhat, Shariq Farooq and Birkl, Reiner and Wofk, Diana and Wonka, Peter and M{"u}ller, Matthias},
journal={arXiv preprint arXiv:2302.12288},
year={2023}
}
and for LDM3D
@article{stan2023ldm3d,
title={LDM3D: Latent Diffusion Model for 3D},
author={Stan, Gabriela Ben Melech and Wofk, Diana and Fox, Scottie and Redden, Alex and Saxton, Will and Yu, Jean and Aflalo, Estelle and Tseng, Shao-Yen and Nonato, Fabio and Muller, Matthias and others},
journal={arXiv preprint arXiv:2305.10853},
year={2023}
}
### Acknowledgements
Our work builds on and uses code from [timm](https://github.com/rwightman/pytorch-image-models) and [Next-ViT](https://github.com/bytedance/Next-ViT).
We'd like to thank the authors for making these libraries available.
### License
MIT License
Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer
This repository contains code to compute depth from a single image. It accompanies our paper:
and our preprint:
For the latest release MiDaS 3.1, a technical report and video are available.
MiDaS was trained on up to 12 datasets (ReDWeb, DIML, Movies, MegaDepth, WSVD, TartanAir, HRWSI, ApolloScape, BlendedMVS, IRS, KITTI, NYU Depth V2) with multi-objective optimization. The original model that was trained on 5 datasets (
MIX 5
in the paper) can be found here. The figure below shows an overview of the different MiDaS models; the bubble size scales with number of parameters.Setup
weights
folder:MiDaS 3.1
MiDaS 3.0: Legacy transformer models dpt_large_384 and dpt_hybrid_384
MiDaS 2.1: Legacy convolutional models midas_v21_384 and midas_v21_small_256
Set up dependencies:
optional
For the Next-ViT model, execute
For the OpenVINO model, install
Usage
Place one or more input images in the folder
input
.Run the model with
where
<model_type>
is chosen from dpt_beit_large_512, dpt_beit_large_384, dpt_beit_base_384, dpt_swin2_large_384, dpt_swin2_base_384, dpt_swin2_tiny_256, dpt_swin_large_384, dpt_next_vit_large_384, dpt_levit_224, dpt_large_384, dpt_hybrid_384, midas_v21_384, midas_v21_small_256, openvino_midas_v21_small_256.The resulting depth maps are written to the
output
folder.optional
--height
. Unsupported height values will throw an error. Note that using this argument may decrease the model accuracy.--square
.via Camera
If you want the input images to be grabbed from the camera and shown in a window, leave the input and output paths away and choose a model type as shown above:
The argument
--side
is optional and causes both the input RGB image and the output depth map to be shown side-by-side for comparison.via Docker
Make sure you have installed Docker and the NVIDIA Docker runtime.
Build the Docker image:
Run inference:
This command passes through all of your NVIDIA GPUs to the container, mounts the
input
andoutput
directories and then runs the inference.via PyTorch Hub
The pretrained model is also available on PyTorch Hub
via TensorFlow or ONNX
See README in the
tf
subdirectory.Currently only supports MiDaS v2.1.
via Mobile (iOS / Android)
See README in the
mobile
subdirectory.via ROS1 (Robot Operating System)
See README in the
ros
subdirectory.Currently only supports MiDaS v2.1. DPT-based models to be added.
Accuracy
We provide a zero-shot error $\epsilon_d$ which is evaluated for 6 different datasets (see paper). Lower error values are better. $\color{green}{\textsf{Overall model quality is represented by the improvement}}$ (Imp.) with respect to MiDaS 3.0 DPTL-384. The models are grouped by the height used for inference, whereas the square training resolution is given by the numbers in the model names. The table also shows the number of parameters (in millions) and the frames per second for inference at the training resolution (for GPU RTX 3090):
* No zero-shot error, because models are also trained on KITTI and NYU Depth V2
$\square$ Validation performed at square resolution, either because the transformer encoder backbone of a model does not support non-square resolutions (Swin, Swin2, LeViT) or for comparison with these models. All other validations keep the aspect ratio. A difference in resolution limits the comparability of the zero-shot error and the improvement, because these quantities are averages over the pixels of an image and do not take into account the advantage of more details due to a higher resolution.
Best values per column and same validation height in bold
Improvement
The improvement in the above table is defined as the relative zero-shot error with respect to MiDaS v3.0 DPTL-384 and averaging over the datasets. So, if $\epsilon_d$ is the zero-shot error for dataset $d$, then the $\color{green}{\textsf{improvement}}$ is given by $100(1-(1/6)\sum_d\epsilon_d/\epsilon_{d,\rm{DPT_{L-384}}})$%.
Note that the improvements of 10% for MiDaS v2.0 → v2.1 and 21% for MiDaS v2.1 → v3.0 are not visible from the improvement column (Imp.) in the table but would require an evaluation with respect to MiDaS v2.1 Large384 and v2.0 Large384 respectively instead of v3.0 DPTL-384.
Depth map comparison
Zoom in for better visibility
Speed on Camera Feed
Test configuration
Speed: 22 FPS
Applications
MiDaS is used in the following other projects from Intel Labs:
Changelog
Citation
Please cite our paper if you use this code or any of the models:
@article{Ranftl2021, author = {Ren```
Please cite the technical report for MiDaS 3.1 models:
For ZoeDepth, please use
and for LDM3D
Acknowledgements
Our work builds on and uses code from timm and Next-ViT. We’d like to thank the authors for making these libraries available.
License
MIT License {e} Ranftl and Katrin Lasinger and David Hafner and Konrad Schindler and Vladlen Koltun”, title = “Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer”, journal = “IEEE Transactions on Pattern Analysis and Machine Intelligence”, year = “2022”, volume = “44”, number = “3” }
@article{Ranftl2021, author = {Ren'{e} Ranftl and Alexey Bochkovskiy and Vladlen Koltun}, title = {Vision Transformers for Dense Prediction}, journal = {ICCV}, year = {2021}, }
@article{birkl2023midas, title={MiDaS v3.1 – A Model Zoo for Robust Monocular Relative Depth Estimation}, author={Reiner Birkl and Diana Wofk and Matthias M{"u}ller}, journal={arXiv preprint arXiv:2307.14460}, year={2023} }
@article{bhat2023zoedepth, title={Zoedepth: Zero-shot transfer by combining relative and metric depth}, author={Bhat, Shariq Farooq and Birkl, Reiner and Wofk, Diana and Wonka, Peter and M{"u}ller, Matthias}, journal={arXiv preprint arXiv:2302.12288}, year={2023} }
@article{stan2023ldm3d, title={LDM3D: Latent Diffusion Model for 3D}, author={Stan, Gabriela Ben Melech and Wofk, Diana and Fox, Scottie and Redden, Alex and Saxton, Will and Yu, Jean and Aflalo, Estelle and Tseng, Shao-Yen and Nonato, Fabio and Muller, Matthias and others}, journal={arXiv preprint arXiv:2305.10853}, year={2023} }
Please cite the technical report for MiDaS 3.1 models:
For ZoeDepth, please use
and for LDM3D
Acknowledgements
Our work builds on and uses code from timm and Next-ViT. We’d like to thank the authors for making these libraries available.
License
MIT License {e} Ranftl and Katrin Lasinger and David Hafner and Konrad Schindler and Vladlen Koltun”, title = “Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer”, journal = “IEEE Transactions on Pattern Analysis and Machine Intelligence”, year = “2022”, volume = “44”, number = “3” }
@article{Ranftl2021, author = {Ren```
Please cite the technical report for MiDaS 3.1 models:
For ZoeDepth, please use
and for LDM3D
Acknowledgements
Our work builds on and uses code from timm and Next-ViT. We’d like to thank the authors for making these libraries available.
License
MIT License {e} Ranftl and Alexey Bochkovskiy and Vladlen Koltun}, title = {Vision Transformers for Dense Prediction}, journal = {ICCV}, year = {2021}, }
@article{birkl2023midas, title={MiDaS v3.1 – A Model Zoo for Robust Monocular Relative Depth Estimation}, author={Reiner Birkl and Diana Wofk and Matthias M{"u}ller}, journal={arXiv preprint arXiv:2307.14460}, year={2023} }
@article{bhat2023zoedepth, title={Zoedepth: Zero-shot transfer by combining relative and metric depth}, author={Bhat, Shariq Farooq and Birkl, Reiner and Wofk, Diana and Wonka, Peter and M{"u}ller, Matthias}, journal={arXiv preprint arXiv:2302.12288}, year={2023} }
@article{stan2023ldm3d, title={LDM3D: Latent Diffusion Model for 3D}, author={Stan, Gabriela Ben Melech and Wofk, Diana and Fox, Scottie and Redden, Alex and Saxton, Will and Yu, Jean and Aflalo, Estelle and Tseng, Shao-Yen and Nonato, Fabio and Muller, Matthias and others}, journal={arXiv preprint arXiv:2305.10853}, year={2023} }