oppenheimer1223's BLOG

[논문] Unsupervised Monocular Depth Learning in Dynamic Scenes

1. Motivation 지도 학습 기반 depth estimation은 방대한 레이블링이 필요하다. SfM(Structure from Motion) 모델은 2-view 기반으로 scene geometry를 이해하지만, texture, occlusion, 그리고 moving objects 문제가 남아 있다. 동적 물체에 대해서는 semantic 정보를 auxiliary network에 주어 모션을 학습하는 방법도 있지만, 과연 semantic signal이 반드시 필요한지에 대한 의문이 있다. 이 논문은 그 어떤 semantic signal도, stereo도, GT도 없이 dynamic scene의 depth를 비지도 학습으로 풀고자 한다. 2. Related Work Monodepth2는 동일한 속도로 움직이는 static pixel들을 photometric loss 계산에서 제외했다. 이를 제외하지 않으면, moving object를 머나먼 배경(무한한 depth)으로 추론하는 “hole” 문제가 발생할 수 있기 때문이다. 그러나 monodepth2는 특정 유형의 object motion에 한해서만 문제를 해결했다는 한계가 있다. ...

[논문] Every Pixel Counts++: Joint Learning of Geometry and Motion with 3D Holistic Understanding

1. Motivation Every Pixel Counts(EPC)의 확장 버전으로, geometry와 motion을 jointly 학습하는 프레임워크를 더욱 발전시킨 논문이다. 단안 비디오에서 depth, camera pose, optical flow, 그리고 동적 물체의 3D motion을 holistic하게 이해하는 것이 목표다. 2. Related Work 기존의 unsupervised monocular depth/motion 학습 방법들은 static scene 가정에 의존하거나, dynamic object를 explainability mask로 처리하여 무시하는 방식을 사용하였다. Every Pixel Counts(EPC, ECCV 2018)에서 제안한 Holistic Motion Parser(HMP)는 이를 극복하기 위한 시도였으며, EPC++는 이를 더욱 개선하고 확장한다. ...

[논문] Attentive and Contrastive Learning for Joint Depth and Motion Field Estimation

1. Motivation Photometric consistency 기반의 self-supervised depth estimation은 static scene 가정 하에 성립한다. 따라서 동적 물체(moving objects)나 non-rigid motion이 존재하는 장면에서는 마스킹 등의 별도 처리가 필요하다. 카메라 모션과 오브젝트 모션의 커플링을 어떻게 풀어낼 것인가가 이 논문의 핵심 문제다. 뎁스와 카메라 모션만을 추정하는 연구는 dynamic environment에서 성립하기 어렵기 때문이다. 2. Related Work Depth + camera motion only: Static scene에서만 성립하며 동적 환경에서는 한계가 명확하다. Segmentation mask guidance: 오브젝트 영역을 세그멘테이션 마스크로 가이던스하는 시도가 있었으나, 많은 GT 레이블이 필요하다. 3. Proposed Method Attention-based Motion Module 모션 네트워크에 어텐션 기법을 적용하여, 어떤 영역이 얼마나 크게 움직이는지 correlation을 계산하는 모듈을 제안한다. 배경(BG)과 전경(FG) 오브젝트에 adaptive하게 적용 가능하다. ...

[논문] Content-aware Unsupervised Deep Homography Estimation and Its Enxtensions

Motivation 기존 뎁스 추정은 correspondence estimation으로 풀었다. 그러나 이 과정에는 문제가 있음 Conventional 방법은 텍스쳐가 약하거나, non-Lambertian 표면에서 문제가 생김 딥러닝 기반은 뎁스 consistency가 일정하지 않고, photometric consistency에서 3D 정보를 제대로 반영하지 못하는 문제 이 논문은 NeRF의 힘을 빌려, 멀티 뷰 스테레오 뎁스 추정을 하고자 함 correspondence estimation과 corr view depth reprojection 최적화 대신에, 이 논문은 다이렉트로 부피를 최적화함 → 그런데 NeRF에서는 shape-radiance ambiguity 문제가 있음. 이를 해결하기 위해 뎁스 프라이어 기반의 NeRF 훈련 가이던스를 제안함 Related Work Conventional 방법은 textuless, non-lambertian 표면에서 문제가 발생 Learning based 방법은 depth consistency 일정하지 않고, photometric loss가 3D 정보를 반영하지 못함 Contribution Depth consistency를 해결하기 위한 제안 방법 ...

[논문] Self-Supervised Scale Recovery for Monocular Depth Estimation and Egomotion Estimation

1. Motivation Self-supervised monocular depth estimation의 근본적인 한계 중 하나는 scale ambiguity다. Photometric reconstruction loss만으로는 metric scale을 복원할 수 없어, 서로 다른 입력에 대해 depth와 egomotion 예측의 scale factor가 일관되지 않게 변한다. “A limitation of the photometric reconstruction loss is that it can only be used to train depth and egomotion networks that produce unscaled predictions. Furthermore, the predictions are scale inconsistent: different inputs produce depth and egomotion predictions with a varying scale factor.” ...

[논문] Bidirectional Attention Network for Monocular Depth Estimation

1. Motivation 기존 CNN 기반의 monocular depth estimation은 local receptive field의 한계로 인해 global context를 충분히 포착하기 어렵다. 신뢰할 수 있는 depth 추정을 위해서는 장면 전체의 global context를 이해하는 것이 중요한데, 이를 위해 attention 메커니즘을 활용하는 방향이 주목받고 있다. 이 논문은 depth 추정의 핵심이 입력 해상도와 동일한 출력을 생성하는 능력에 있다고 주장하며, 해상도를 점진적으로 복원하는 과정에 bidirectional attention을 도입하여 refinement를 수행한다. 2. Related Work Attention 메커니즘을 depth estimation에 적용한 이전 연구들은 encoder 단계에서 global context를 포착하려 했다. 그러나 decoder 단계에서의 해상도 복원 시에도 attention을 활용하여 세밀한 refinement를 수행하는 시도는 많지 않았다. ...

[논문] Excavating the Potential Capacity of Self-Supervised Monocular Depth Estimation

1. Motivation Self-supervised learning(SSL) 기반으로 지도 학습의 성능에 더 가까이 다가가기 위한 방법을 탐구한다. 추가적인 외부 비용 없이도 SSL 기반 monocular depth estimation이 잠재적인 가능성을 가지고 있음을 보인다. 2. Related Work 지도 학습 성능을 따라잡기 위해, segmentation, optical flow, depth normal 등 다양한 보조 정보를 활용하는 연구들이 있었다. 그러나 이러한 접근법은 명백히 SSL의 개념과 상반된다. pseudo label을 사용하거나 추가적인 플러그인 네트워크가 필요하기 때문이다. 이 논문은 이러한 외부 의존 없이 기존 SSL 프레임워크 내에서의 잠재 능력을 극대화하는 데 집중한다. ...

[논문] Video Object Segmentation with Compressed Video

Motivation 비디오 압축 코덱 정보만으로 세그멘테이션 추론을 어떻게 빨리 할 수 있을까? Related Works 기존 VOS 태스크들은 정확하지만 속도가 느림 효율적인 방법들이 제시되었으나, 정확도 간의 트레이드오프가 있음 옵티컬 플로우 기반은 비용이 너무 비쌈, 그리고 two-view 밖에 못 봄 Contribution 키프레임에서 다른 프레임으로 bidirectional, multi-hop 방식으로 세그멘테이션 마스크를 전달하여 워핑하는 네트워크 디자인 소프트 프로파게이션 모듈 부정확하고 블록 단위의 모션 벡터를 입력으로 받아서, 노이즈를 없앤 후 정확한 와핑을 할 수 있게 함 → 모션 벡터 와핑 모듈을 제안 ...

[논문] Mono-SF: Multi-View Geometry Meets Single-View Depth for Monocular Scene Flow Estimation of Dynamic Traffic

1. Motivation 3D scene flow는 3D geometry(구조)와 3D motion 정보를 동시에 추정하는 것이다. Mono-SF는 단안(monocular) 이미지로부터 3D structure와 motion을 추정하는 방법을 제안한다. 핵심 아이디어는 multi-view geometry 원리와 single-view depth 추정을 결합하는 것이다. 2. Related Work 기존 방법들의 한계: 움직이는 물체가 주변 환경에 부착되어 있다(static)는 가정 하에 동작하여, dynamic object를 제대로 처리하지 못함 Single-view depth 추정과 multi-view geometry는 두 개의 별도 태스크로 각각 다루어져 왔음 Static scene 가정에서만 적용 가능하다는 근본적인 한계 이를 단안 카메라로 dynamic traffic scene에서 3D scene flow를 추정하는 통합 프레임워크로 극복하고자 한다. ...

[논문] Exploiting Temporal Consistency for Real-time Video Depth Estimation

1. Motivation 단안 비디오에서 실시간(real-time) depth 추정을 목표로 한다. 기존의 단안 depth 추정 방법들은 각 프레임을 독립적으로 처리하기 때문에 시간적 일관성(temporal consistency)이 부족하고, 프레임 간 depth 결과가 흔들리는(flickering) 문제가 있다. 이 논문은 LSTM을 활용하여 시간적 정보를 누적함으로써 temporal consistency를 개선하고, 실시간 처리 속도를 달성한다. 2. Related Work 단안 depth 추정은 크게 supervised와 self-supervised 방식으로 나뉜다. 비디오 입력에서 temporal 정보를 활용하는 연구들도 있었으나, 실시간 처리와 temporal consistency를 동시에 달성하는 방법은 드물었다. LSTM 기반의 recurrent 구조는 시퀀스 데이터에서 시간적 의존성을 학습하는 데 효과적이다. ...