[논문] Self-Supervised Scale Recovery for Monocular Depth Estimation and Egomotion Estimation

1. Motivation

Self-supervised monocular depth estimation의 근본적인 한계 중 하나는 scale ambiguity다. Photometric reconstruction loss만으로는 metric scale을 복원할 수 없어, 서로 다른 입력에 대해 depth와 egomotion 예측의 scale factor가 일관되지 않게 변한다.

“A limitation of the photometric reconstruction loss is that it can only be used to train depth and egomotion networks that produce unscaled predictions. Furthermore, the predictions are scale inconsistent: different inputs produce depth and egomotion predictions with a varying scale factor.”

현재까지 monocular 시스템에서 metric scale을 강제하는 self-supervised loss function은 존재하지 않았다.

Godard et al. (Monodepth): Left-Right consistency를 통해 stereo 학습으로 absolute scale을 학습할 수 있었으나, monocular 시스템(monodepth2 등)에서는 적용 불가
DNet (Toward Hierarchical Self-Supervised Monocular Absolute Depth Estimation for Autonomous Driving Applications): 테스트 시에 ground plane이 보이는 이미지가 필요하다는 제약이 있음. 드론과 같이 지면이 보이지 않는 환경에서는 적용이 어렵다.

3. Proposed Method

이 논문은 metric scale 비교를 훈련 과정 자체에 포함시킴으로써 테스트 시에 ground plane이 보일 필요가 없도록 한다.

핵심 아이디어는 ground plane segmentation을 통해 추정한 카메라 높이와 알려진 카메라 높이의 차이를 loss로 활용하는 것이다. 구체적인 절차는 다음과 같다:

Relative → Absolute scale 변환: plane segmentation을 통해 ground 영역을 추출하고, ground truth 카메라 높이와 추정 높이 간의 ratio로 scale을 조정
카메라 높이 사전 정보: KITTI 카메라 높이 1.7m, Oxford 카메라 높이 1.52m
Segmentation 모델 통합: 학습 중에 ground segmentation 모델을 내부에 포함하여 ground 영역을 예측
카메라 높이 계산: segmentation으로 추출한 ground 영역의 point cloud 좌표를 normalize하고, depth network의 point cloud를 해당 normal vector에 projection하여 카메라 높이를 계산
Scale Recovery Loss: 추정한 카메라 높이와 알려진 카메라 높이의 차이를 loss로 사용
Scale 적용: 추정한 depth에 scaling factor를 곱하여 업스케일링하고, .detach()를 통해 supervision 부여 (gradient flow 제어)

4. Experiments

KITTI 및 Oxford 데이터셋에서 absolute scale depth metric으로 평가
테스트 시 ground plane 없이도 metric scale depth 예측 가능
기존 방법 대비 scale consistency 향상

5. Conclusion & Limitation

학습 시 ground plane segmentation과 카메라 높이 prior를 활용하여 metric scale을 self-supervised 방식으로 복원하는 방법을 제안했다. 테스트 타임에 ground가 보이지 않아도 된다는 점이 DNet 대비 실용적인 장점이다. 다만, 학습 시 카메라 높이 정보와 ground segmentation label이 필요하므로 완전한 label-free 학습이라 볼 수는 없다. 또한, 지면이 거의 없는 환경(드론, 실내 등)에서는 적용이 제한적이다.

1. Motivation#

2. Related Work#

3. Proposed Method#

4. Experiments#

5. Conclusion & Limitation#

1. Motivation

2. Related Work

3. Proposed Method

4. Experiments

5. Conclusion & Limitation