Any360D:

Towards 360 Depth Anything with Unlabeled 360 Data and Möbius Spatial Augmentation


Zidong Cao1     Jinjing Zhu1*;    Weiming Zhang1*    Lin Wang1,2†   
1AI Thrust, HKUST(GZ)              2Dept. of CSE, HKUST             
* Equal contribution        † Corresponding author

Our code will be released soon.

360 Depth Visualization on Videos

Comparison between Depth Anything Model (DAM) and our Any360D

Abstract

Recently, Depth Anything Model (DAM) reveals impressive zero-shot capacity for diverse perspective images. Despite its success, it remains an open question regarding DAM's performance on 360 images that enjoy a large field-of-view (180°×360°) but suffer from spherical distortions. To this end, we establish, to our knowledge, the first benchmark that aims to 1) evaluate the performance of DAM on 360 images and 2) develop a powerful 360 DAM for the benefit of the community. For this, we conduct a large suite of experiments that consider the key properties of 360 images, e.g., different 360 representations, various spatial transformations, and diverse indoor and outdoor scenes. This way, our benchmark unveils some key findings, e.g., DAM is less effective for diverse 360 scenes and sensitive to spatial transformations. To address these challenges, we first collect a large-scale unlabeled dataset including diverse indoor and outdoor scenes. We then propose a semi-supervised learning (SSL) framework to learn a 360 DAM, dubbed Any360D. Under the umbrella of SSL, Any360D first learns a teacher model by fine-tuning DAM via metric depth supervision. Then, we train the student model by uncovering the potential of large-scale unlabeled data with pseudo labels from the teacher model. Möbius transformation-based spatial augmentation (MTSA) is proposed to impose consistency regularization between the unlabeled data and spatially transformed ones. This subtly improves the student model's robustness to various spatial transformations even under severe distortions. Extensive experiments demonstrate that Any360D outperforms DAM and many prior data-specific models, e.g., PanoFormer, across diverse scenes, showing impressive zero-shot capacity for being a 360 depth foundation model.

pipeline

Utilized Data

Training set of Matterport3D dataset (7829 RGB-D 360 images); Training set of ZInD dataset (54034 360 images); Our Diverse360 dataset (12063 360 images). Below is the samples of our Diverse360 dataset at a campus level.

pipeline

Benchmarking Depth Anything Model

We benchmark Depth Anything Model (DAM) on 360 images, including 360 image representations, 360 spatial transformations, and different backbone model sizes, etc.

pipeline
pipeline

Visualization of the 360 spatial transformation

It is achieved with Möbius transformation, which is the only conformal transformation that can be applied to the sphere. The visualization of the 360 spatial transformation is shown below, including vertical rotation and zoom.

pipeline

Framework

The semi-supervised learning framework of our Any360D is shown below.

pipeline

Comparison with SOTA monocular 360 depth estimation methods

The quantitative comparison with SOTA monocular 360 depth estimation methods is shown below.

pipeline