[논문] Transformers Solve the Limited Receptive Field for Monocular Depth Prediction

1. Motivation

CNN 기반 monocular depth estimation은 제한된 receptive field로 인해 global context를 충분히 포착하지 못한다는 근본적인 한계가 있다. Transformer는 self-attention 메커니즘을 통해 이미지 전체의 long-range dependency를 효과적으로 모델링할 수 있다.

이 논문은 Transformer를 monocular depth estimation 및 surface normal prediction에 처음으로 적용한 연구다.

CNN 기반 depth estimation은 U-Net, DeepLabV3+ 등의 구조로 발전해 왔으나, 모두 local receptive field의 한계를 내재하고 있다. Attention 메커니즘을 depth estimation에 활용한 연구들은 있었으나, Transformer 구조 자체를 depth estimation backbone으로 채택한 시도는 이 논문이 처음이다.

3. Proposed Method

기여

Depth estimation과 surface normal prediction에 Transformer를 최초 적용
Unified Attention Gate 구조: multi-scale 정보를 병렬로 활용하고 융합하며, 서로 다른 affinity map 간 정보를 전달하는 attention gate decoder 설계

네트워크 구조

Encoder: ResNet에서 다양한 스케일의 feature map 추출
- 마지막 feature map: fr (reference feature)
- 나머지 feature map: fe (early features, from 1 to N-1)
Attention Gate: fr과 fe를 활용하여 다음 연산을 수행
- Ie→r, L, Ir→e는 linear transformation (MLP)으로 구현
- Message 전달 부분도 feature map과 linear projection 연산으로 구현 가능
Decoder: Attention Gate를 통해 정제된 multi-scale feature로 depth map 및 surface normal 예측

이 구조를 통해 서로 다른 스케일의 feature map 간 정보를 효과적으로 교환하고, global context를 유지하면서 세밀한 depth prediction이 가능해진다.

4. Experiments

NYU Depth V2 (indoor) 및 KITTI (outdoor) 벤치마크에서 평가
Depth estimation과 surface normal prediction 모두에서 CNN 기반 방법 대비 성능 향상
Multi-scale attention gate가 두 task 모두에서 효과적임을 확인

5. Conclusion & Limitation

Transformer를 monocular depth estimation에 최초로 적용하여 limited receptive field 문제를 해결했다. Unified attention gate 구조를 통해 multi-scale 정보를 효과적으로 융합할 수 있다. 다만, Transformer 기반 구조는 CNN 대비 연산량과 메모리 사용량이 크며, 실시간 추론에는 최적화가 필요하다. 당시 ViT 등 pure Transformer 구조가 등장하기 시작한 시점으로, 이후의 DPT 등 후속 연구의 초석이 되는 방향을 제시했다.

1. Motivation#

2. Related Work#

3. Proposed Method#

기여#

네트워크 구조#

4. Experiments#

5. Conclusion & Limitation#

1. Motivation

2. Related Work

3. Proposed Method

기여

네트워크 구조

4. Experiments

5. Conclusion & Limitation