Papers

[논문] SqueezeLLM: Dense-and-Sparse Quantization

Paper Link Sehoon Kim et al (UC Berkeley) Introduction This paper proposes a Psuedo-PTQ method considering the weight distribution of LLM and outliers. Motivation Large Language Models (LLMs) have demonstrated remarkable results for a wide range of tasks. Deploying LLMs for inference has been a significant challenge due to their unprecedented resource requirements. AThis has forced existing deployment frameworks to use multi-GPU inference pipelines, or to use smaller and less performant models. They demonstrates that the main bottleneck for inference with LLMs is memory bandwidth, specifically for single batch inference. Contribution Sensitivity-based Non-Uniform Quantization In LLaMA-7B. The distribution clearly demonstrates a non-uniform pattern. In LLM quantization. Uniformly distributing quantized values is sub-optimal because the weight distribution in neural networks is typically non-uniform. Main advantage of uniform quantization is fast and efficient reduced precision computation, this does not lead to end-to-end latency improvement in memory-bound LLM inference. While quantization introduces errors or perturbations in each layer, we need to minimize the overall perturbation with respect to the final loss term, rather than focusing on individual layers. The k-means centroids closer to the values that are more sensitive with respect to the final loss, rather than treating all weight values equally. Taylor series expansion to analyze how the model output changes in response to perturbations in the parameters W Dense and Sparse Decomposition a method to filter out outliers from the weight matrix W by performing a very simple, yet effective, decomposition of the weight matrix into a dense (D) and sparse matrix (S). The sparse part is calculated by computing the outliers in a given layer, and taking it out of the weight matrix. The remainder is a dense matrix that can be quantized much more effectively thanks to its significantly reduced range of values, which in some cases is more than 10×: Experiments PTQ like framework that enables lossless compression to ultra-low precisions of up to 3-bit. Consistent performance improvement across different model sizes compared to GPTQ and AWQ. Conclusion Nope

[논문] Content-aware Unsupervised Deep Homography Estimation and Its Enxtensions

Motivation 호모그래피는 스테레오 비전의 근본이다. 영상이 대략 회전 모션이거나 장면이 평면 표면에 가까우면 호모그래피 행렬을 근사할 수 있다. 장면이 제약 조건을 만족하면 직접 호모그래피를 계산할 수 있다. 시맨텍 어웨어하고 러버스트한 호모그래피 추정 딥러닝 알고리즘을 개발 Related Works 생략 Contribution 두 이미지를 인코더 레즈넷34 백본을 받아서 3x3, 8DoF의 호모그래피 행렬을 추정 호모그래피 추정을 위해 Triplet Loss를 사용 호모그래피 추정이 완벽하다면 호모그래피를 통한 와핑이 잘 되어야 함 그래서 와핑한 피처 혹은 이미지가 타겟 피처 또는 이미지와 잘 얼라인 되어야 함 두 번째 로스 텀은 잘 모르겠음 호모그래피 a→b에서 b→a는 identity로 레귤라이저를 추가함 Content-aware prob map 마스크를 출력하는 네트워크는 이미지 피처맵에서 어디가 중요한지를 표현한다. 이는 마치 어텐션과 같다. 그리고 RANSAC을 통해 아웃라이어를 미리 제거한 것과 같다 ...

[논문] Rethinking the Augmentation Module in Contrastive Learning

Motivation CL은 DA에 강력하게 의존하는 방법이다. 인위적인 DA는 다음과 같은 단점이 있다. 데이터 증강의 휴리스틱한 조합은 특정적인 표현 불변성을 가져다 준다. 강력한 데이터 증강은 너무 많은 불변성을 가지고 있어서 오히려 fine-grained한 다운스트림 태스크에 적합하지 않다. 따라서 이 논문은 어디서? 무엇을? 이란 질문으로 DA를 하는 방법론을 소개한다. Related Work 생략 Contribution 다양한 augmentation module 조합을 사용한다. 샴 구조는 깊이에 따라 여러 스테이지로 활용한다. 각 스테이지의 피처를 CL에 활용한다. Hierarchical augmentation invariance ...

[논문] Self-Supervised Video Representation Leraning with Motion-Contrastive Perception

Motivation CL이나 특정한 Pretext task는 비디오에서 중요하지 않은 배경에 집중하는 문제가 발생 비디오에는 모션이 있음 이 모션에 집중하기 위한 학습 방법을 제안해야함 Related Work Pretext task 방법 지오메트릭한 정보를 배우는 spatial learning clip order를 학습하는 temporal learning space-time 정보를 학습하는 spatiotemporal learning 그러나 이 방법의 단점은 비디오에 리더던시 정보가 많아서 불필요한 학습을 야기함 배경에 대하여 정적이거나 무관한 정보는 모델의 판단성을 저해할 수 있음 배경 때문에 모델의 비디오 이해도가 낮아질 수 있음 이를 보완하기 위해 옵티컬 플로우가 쓰이나 비쌈 보다 저렴한 계산을 위해 residual frame이 쓰일 수 있음 → 더 구체적인 정보를 제공함 ...

[논문] Deep Video Prior for Consistency and Propagation

Motivation 비디오 프레임간 시간 불일치성을 해결하기 위해 DVP를 implcit하게 DNN에 주는 방법을 제안 DVP가 무엇인가? 비디오를 사용한 멀티모달 태스크에서는 성능의 흔들림이 심함 → 이터레티브하게 중요도를 재할당하는 전략으로 해결 Related Work 이전 비디오 연구들은 구축된 대규모 비디오 데이터셋이 필요했음 옵티컬 플로우 같은 정보나, 단순 프레임 간 유사도를 비교하는 것만으로는 롱-텀 비디오에 적합하지 않음 이전 비디오 연구들은 멀티 모달 태스크에서 좋은 성능을 골고루 보이기 어려웠음 Contribution DVP가 무엇인가? DVP는 비디오 처리에서 임플리싯하게 비디오 일관성을 주기 위해 사용되는 성질들을 일컬음 ...

[논문] Learning Optical Flow, Depth, and Scene FLow without Real-world Labels

Motivation Depth, Sceneflow를 동시에 푸는 것은 ill-posed 문제이고, 수 많은 해가 존재한다. 먼저 옵티컬 플로우를 추정하고, 알려진 포즈와 함께 initial depth를 연산한다. 그리고 sceneflow, depth를 refinement하는 파이프라인을 제안한다. (그러니까 원 스테이지로는 하기가 힘드니 투 스테이지로 해보겠다는 의미) Related Works 비디오 기반의 SSL를 통한 3D perception 학습들은 아래의 네 가지 태스크로 나뉜다. Ego-motion estimation Monocular Depth estimation → Scale ambiguity, Static assumption 문제 발생 Opticalflow estimation Sceneflow estimation → Can not handle sceneflow from opticalflow esitmator (indirectly estimation), stereo manner at training time 모두 개별 태스크에서 우수한 성능을 보이지만, scale ambuiguity 문제가 있다. 이러한 작업들은 Ego-motion과 플로우 추정과 관련이 있지만, 결국 reprojection 모호성을 해결해기 위해 스테레오 데이터셋가 필요하다. ...

[논문] Masked Autoencoders Are Scalable Vision Learners

Motivation 입력 이미지의 패치를 랜덤으로 마스킹한 상태에서 오토인코더 모델이 복원할 수 있을까? 비대칭 형태의 인코더 - 디코더 인코더 입력은 마스크 패치를 제외하고 visible 패치를 입력, 디코더는 이 latent vector를 가지고 원래의 이미지를 복원 인코더는 표준적인 ViT이고 디코더는 트랜스포머 블록으로 구성 Related Works 마스크 오토인코더는 디노이징 오토인코더의 일반적 형태 마스킹 입력으로 표현력을 끌어올리는 방법은 버트에서 선행되었지만, 비전에서 오토인코딩으로의 진전 X 저자의 질문, 비전과 자연어 사이에서 무엇이 마스크된 오토인코딩을 만드는가? 자연어는 인간이 만들어낸 상당히 시맨틱하고 높은 정보 밀도의 신호이다. ...

[논문] DeepVideoMVS: Multi-View Stereo on Video with Recurrent Spatio-Temporal Fusion

1. Motivation 비디오 스트리밍에서 얻는 멀티뷰 프레임을 활용하여, 이전 타임스텝에서 계산한 scene geometry 정보를 현재 프레임으로 propagation하는 방법을 제안한다. 논문의 핵심 가정: 포즈(카메라 pose)를 알고 있다는 것. 포즈를 모른다면 pose network로 학습하면 된다. 비슷한 연구들이 이전에도 있었지만, 이 논문은 명시적으로 geometry information을 활용하여 시간적 일관성을 높이고자 한다. 2. Related Work Multi-View Stereo(MVS) 기반 depth estimation은 여러 시점의 이미지로부터 cost volume을 구성하여 depth를 추정한다. 기존의 비디오 기반 depth 방법들은 이전 프레임의 geometry 정보를 충분히 활용하지 못했다. 이 논문은 ConvLSTM을 통해 spatio-temporal fusion을 명시적으로 수행한다. ...

[논문] Just Go with the Flow: Self-Supervised Scene Flow Estimation

1. Motivation Scene flow 추정을 위한 self-supervised 학습에서는 어떤 loss 함수를 사용할 것인가가 핵심 문제이다. GT annotation을 얻기 어렵기 때문에 supervised learning의 대안이 필요하며, 이 논문은 가능한 여러 loss 함수의 장단점을 분석하고 이들을 조합하는 방법을 제안한다. 2. Related Work Supervised Loss GT scene flow를 직접 annotation하는 것은 매우 어렵고 비용이 많이 든다. 실제 환경에서 dense GT를 취득하는 것은 현실적으로 불가능에 가깝다. Nearest Neighbor (NN) Loss 시간 t의 포인트 클라우드와 t+1의 포인트 클라우드에서 각 포인트의 nearest neighbor를 찾아서, 두 포인트 사이의 거리를 0으로 만드는 flow를 추정하는 방식이다. ...

[논문] FlowNet3D++: Geometric Losses for Deep Scene Flow Estimation

1. Motivation FlowNet3D는 Scene Flow를 추정할 때 단순히 Lp Loss를 사용하는 것이 단점으로 지적된다. 구체적으로, FlowNet3D는 motion vector의 예측값과 GT 사이의 방향 차이를 고려하지 않는다. 이 논문은 다음 두 가지 geometric loss를 제안하여 이 한계를 극복한다: ICP (Iterative Closest Point) 알고리즘에 영감을 받은 Plane-to-Plane (P2P) Loss 코사인 유사도 기반의 motion vector alignment loss 2. Related Work FlowNet3D: 3D point cloud 기반의 scene flow 추정 네트워크. L2 loss를 이용해 학습하지만, motion vector의 방향 정렬이 고려되지 않는다. ICP (Iterative Closest Point): 두 포인트 클라우드를 정합하는 고전적 알고리즘. Plane-to-Plane 변형에서 local surface의 법선 벡터를 활용한다. 3. Proposed Method Plane-to-Plane Loss ICP 알고리즘의 Point-to-Plane 개념을 확장하여, 예측한 flow가 GT의 local plane과 일치하도록 제약을 준다. 이를 통해 단순한 Euclidean 거리 기반 loss가 놓치는 기하학적 정보를 보완한다. ...