[논문] Attentive and Contrastive Learning for Joint Depth and Motion Field Estimation

1. Motivation

Photometric consistency 기반의 self-supervised depth estimation은 static scene 가정 하에 성립한다. 따라서 동적 물체(moving objects)나 non-rigid motion이 존재하는 장면에서는 마스킹 등의 별도 처리가 필요하다.

카메라 모션과 오브젝트 모션의 커플링을 어떻게 풀어낼 것인가가 이 논문의 핵심 문제다. 뎁스와 카메라 모션만을 추정하는 연구는 dynamic environment에서 성립하기 어렵기 때문이다.

Depth + camera motion only: Static scene에서만 성립하며 동적 환경에서는 한계가 명확하다.
Segmentation mask guidance: 오브젝트 영역을 세그멘테이션 마스크로 가이던스하는 시도가 있었으나, 많은 GT 레이블이 필요하다.

3. Proposed Method

Attention-based Motion Module

모션 네트워크에 어텐션 기법을 적용하여, 어떤 영역이 얼마나 크게 움직이는지 correlation을 계산하는 모듈을 제안한다. 배경(BG)과 전경(FG) 오브젝트에 adaptive하게 적용 가능하다.

Contrastive Sample Consensus (CSC)

두 가지 가정을 기반으로 한다.

2D detection box 안에는 potentially movable한 object가 존재한다.
각 detection box 안에서 BG의 motion vector는 작고, FG의 motion vector는 크다.

이 가정을 이용하여 BG와 FG의 모션을 구분하고, contrastive learning 방식으로 샘플을 분리하는 방법을 제안한다.

4. Experiments

Joint depth and motion field estimation 벤치마크에서 평가하며, 동적 물체가 포함된 장면에서의 성능 향상을 확인한다.

5. Conclusion & Limitation

어텐션 기반 모션 모듈을 통해 BG/FG를 adaptive하게 처리하고, contrastive sample consensus로 detection box 내의 movable object를 분리하는 방법을 제안한 점이 의미 있다.

다만 CSC의 구체적인 동작 방식과 detection box 기반 가정이 실제 복잡한 장면에서도 유효한지에 대한 추가적인 검증이 필요하다.

1. Motivation#

2. Related Work#

3. Proposed Method#

Attention-based Motion Module#

Contrastive Sample Consensus (CSC)#

4. Experiments#

5. Conclusion & Limitation#