RGB-D-Fusion - Image Conditioned Depth Diffusion of Humanoid Subjects

We present RGB-D-Fusion, a multi-modal conditional denoising diffusion probabilistic model to generate high resolution depth maps from low-resolution monocular RGB images of humanoid subjects. RGB-D-Fusion first generates a low-resolution depth map using an image conditioned denoising diffusion probabilistic model and then upsamples the depth map using a second denoising diffusion probabilistic model conditioned on a low-resolution RGB-D image. We further introduce a novel augmentation technique, depth noise augmentation, to increase the robustness of our super-resolution model.

Authors: Sascha Kirch, Valeria Olyunina, Jan Ondřej, Rafael Pagés, Sergio Martín & Clara Pérez-Molina

[Paper] [Code]

Overview

Our RGB-D-Fusion framework takes an RGB image as input and outputs an RGB-D image using perspective projection. We first downsample the input image from 256×256 to 64×64. We then predict the perspective depth map using a conditional denoising diffusion probabilistic model. We then combine the predicted depth map with the input RGB into an RGB-D image of resolution 64×64. We further apply a super-resolution model conditioned on the low-resolution RGB-D to obtain a high-resolution depth map. The predicted high-resolution depth map is combined with the input RGB to construct the final output: a high-resolution RGB-D image. To reconstruct the real depth in camera coordinates, a projection matrix can be applied if available.

Contribution

We provide a framework for high resolution dense monocular depth estimation using diffusion models.
We perform super-resolution for dense depth data conditioned on a multi-modal RGB-D input condition using diffusion models.
We introduce a novel augmentation technique, namely depth noise, to enhance the robustness of the depth super-resolution model.
We perform rigorous ablations and experiments to validate our design choices

Results

Prediction vs. GT

Inference results on the test set of our RGB-D-Fusion framework at various stages compared to the ground truth. The 64x64 depth is generated by our depth diffusion stage. The 256x256 depth is generated from the 64x64 depth using our depth super resolution diffusion model.

results_gt — Fig.2 - Inference results compared against ground truth.

In the wild predictions

Outputs of our RGB-D-Fusion framework for the given input images where no ground truth is available. Original images are taken with a variety of single-view mobile cameras.

results_in_the_wild_1 — Fig.3 - In the wild prediction for images captured with a single-view mobile camera.