[논문] Efficient Vide Instance Segmentation via Tracklet Query and Proposal

Motivation

Video Instance Segmentation 문제는 동시에 classify, segment, track을 하는 것이다. 이 태스크는 프레임 레벨 VIS보다 성능이 좋다. 그러나 리얼 타임이 아니다. VisTR이 이 문제를 해결하려 했으나, 훈련 시간이 길었다. 그리고 hand-crafted data association이 많이 필요해서 비효율적이다.

프레임 레벨 VIS
1. tracking by segmentation 방법
2. 복잡한 data association 알고리즘이 필요
3. temporal context를 추출하는게 한계가 있음
4. object occlusion을 핸들링하지 못함
클립 레벨 VIS
1. clip by clip으로 segmentation and tracking
2. 프레임 레벨 VIS보다 long range temporal context를 추출 가능
3. 그러나 실시간성이 부족해서 속도가 느림

Contribution

EfficientVIS
EFfcientVIS는 두 개의 개념을 사용한다. tracklet query, tracklet proposal
tracklet query는 target instance를 위한 embedding을 말함
tracklet proposal은 space-temporal 레벨에서 tube를 가리킴
Factoriseg Temporal Spatial Self-Attention (FTSA)
어텐션을 이용하여 query가 spatial, temporal dim에서 서로 일치점을 관계하도록 계산함
각 쿼리가 비디오에서 특정 타겟 인스턴스와 얼마나 관계되는지를 표현
Temporal Dynamic Convolution (TDC)
다이나믹 컨볼루션의 weight는 FTSA를 통해서 계산된 q에서 생성된다.
따라서 w는 각 인스턴스의 고유한 시맨틱 시그널을 가진다.
이 weight를 이용하여 feature map과 tracklet tube와 컨볼루션을 하는 부분
왜 TDC를 하는가?
비디오 클립에서 타겟 인스턴스의 정보를 aggregation하기 위함이다.

Experiments

생략

Conclusion

생략

Motivation#

Related Works#

Contribution#

Experiments#

Conclusion#

Motivation

Related Works

Contribution

Experiments

Conclusion