๐Ÿ‘‹ Hello, there

I am Hyeongcheol Geum. This site contains my usually thoughts and technical stories of ML/AI. I am a man with many questions about the world. You probably visited my blog for similar reasons. In that sense, I believe we could become good friends. Have a nice day!

[๋…ผ๋ฌธ] ReCLAP: Improving Zero Shot Audio Classification by Describing Sounds

1. Motivation CLAP(Contrastive Language-Audio Pre-training) ๋ชจ๋ธ์€ ์ œ๋กœ์ƒท ์˜ค๋””์˜ค ๋ถ„๋ฅ˜(ZSAC) ์ž‘์—…์—์„œ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ด์ง€๋งŒ, ์—ฌ์ „ํžˆ ํ‘œ์ค€ ์ง€๋„ํ•™์Šต ๋ฐฉ๋ฒ•๋ณด๋‹ค ์„ฑ๋Šฅ์ด ๋‚ฎ๋‹ค. ์ด๋Š” ๋‹ค์Œ ์„ธ ๊ฐ€์ง€ ์ฃผ์š” ์ด์œ  ๋•Œ๋ฌธ์ด๋‹ค. ๋Œ€๊ทœ๋ชจ ์˜ค๋””์˜ค-์บก์…˜ ๋ฐ์ดํ„ฐ์…‹ ์ ‘๊ทผ์˜ ํ•œ๊ณ„: CLAP์€ CLIP๊ณผ ๋‹ฌ๋ฆฌ ๋Œ€๊ทœ๋ชจ ์˜คํ”ˆ์†Œ์Šค ์˜ค๋””์˜ค-์บก์…˜ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ํ›ˆ๋ จ๋˜์ง€ ์•Š์•˜๊ธฐ ๋•Œ๋ฌธ์— ๋‹ค์–‘ํ•œ ์˜ค๋””์˜ค์™€ ์–ธ์–ด ์ƒํ˜ธ์ž‘์šฉ์„ ์™„์ „ํžˆ ์ดํ•ดํ•˜๋Š” ๋Šฅ๋ ฅ์ด ์ œํ•œ๋œ๋‹ค. ํ›ˆ๋ จ ์นดํ…Œ๊ณ ๋ฆฌ ๋ ˆ์ด๋ธ” ๋„ˆ๋จธ์˜ ์ผ๋ฐ˜ํ™” ๋ถ€์กฑ: CLAP์€ ํ›ˆ๋ จ์— ์‚ฌ์šฉ๋œ ํŠน์ • ์นดํ…Œ๊ณ ๋ฆฌ ๋ ˆ์ด๋ธ”์„ ๋„˜์–ด ์ผ๋ฐ˜ํ™”ํ•˜๋Š” ๋ฐ ์–ด๋ ค์›€์„ ๊ฒช๋Š”๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, AudioSet์—์„œ โ€œSound of a toothbrush"๋กœ ํ›ˆ๋ จ๋œ ๋ชจ๋ธ์ด ESC50 ๋ฐ์ดํ„ฐ์…‹์˜ โ€œbrushing teeth"์™€ ๊ฐ™์€ ์œ ์‚ฌํ•œ ๋ ˆ์ด๋ธ”์— ์ •ํ™•ํžˆ ์ผ๋ฐ˜ํ™”ํ•˜์ง€ ๋ชปํ•  ์ˆ˜ ์žˆ๋‹ค. ZSAC์šฉ ์ˆ˜์ž‘์—… ํ”„๋กฌํ”„ํŠธ์˜ ํ•œ๊ณ„: ํ˜„์žฌ ZSAC ์„ค์ •์€ ๋ฐ์ดํ„ฐ์…‹ ์นดํ…Œ๊ณ ๋ฆฌ ๋ ˆ์ด๋ธ”์— ์ง์ ‘ ๋Œ€์‘ํ•˜๋Š” ์ˆ˜์ž‘์—… ํ”„๋กฌํ”„ํŠธ์— ์˜์กดํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ํ”„๋กฌํ”„ํŠธ๋Š” ๋ ˆ์ด๋ธ” ์ž์ฒด๋ฅผ ๋„˜์–ด ์ถ”๊ฐ€์ ์ธ ์ปจํ…์ŠคํŠธ๋ฅผ ์ œ๊ณตํ•˜์ง€ ๋ชปํ•œ๋‹ค. 2. Related Work CLAP ์ดํ›„๋กœ ์—ฌ๋Ÿฌ ์—ฐ๊ตฌ๊ฐ€ CLAP์˜ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ ์œ„ํ•ด ๋…ธ๋ ฅํ–ˆ๋‹ค. Wu ๋“ฑ์€ CLAP์„ 630k ์˜ค๋””์˜ค-์บก์…˜ ์Œ์œผ๋กœ ํ™•์žฅํ–ˆ๊ณ , Elizade ๋“ฑ์€ 4.6M ์˜ค๋””์˜ค-์บก์…˜ ์Œ๊นŒ์ง€ ๋ฐ์ดํ„ฐ๋ฅผ ํ™•์žฅํ•˜๊ณ  ์Œ์„ฑ ์ƒ˜ํ”Œ๋„ ํ›ˆ๋ จ์— ํฌํ•จ์‹œ์ผฐ๋‹ค. Ghosh ๋“ฑ์€ ์˜ค์ง ๊ณต๊ฐœ ๋„๋ฉ”์ธ ๋ฐ์ดํ„ฐ๋งŒ ์‚ฌ์šฉํ•˜์—ฌ 660k ์Œ์œผ๋กœ CompA-CLAP์„ ๊ตฌ์ถ•ํ–ˆ๋‹ค. CLAP์€ ํ…์ŠคํŠธ-์˜ค๋””์˜ค ์ƒ์„ฑ, ์˜ค๋””์˜ค ์บก์…”๋‹, ์˜ค๋””์˜ค ์ฑ„ํŒ… ๋ชจ๋ธ ๋“ฑ ๋‹ค์–‘ํ•œ ๊ธฐ์ดˆ ์˜ค๋””์˜ค ์ฒ˜๋ฆฌ ์ž‘์—…์˜ ์˜ค๋””์˜ค๋‚˜ ํ…์ŠคํŠธ ๋ฐฑ๋ณธ์œผ๋กœ๋„ ํ™œ์šฉ๋˜๊ณ  ์žˆ๋‹ค. ...

[๋…ผ๋ฌธ] A Multi-Resolution Front-End for End-to-End Speech Anti-Spoofing

1. Motivation ๊ธฐ์กด ์Œ์„ฑ ์‹ ํ˜ธ ๋ถ„๋ฅ˜ ์ž‘์—…์—์„œ ์‹œ๊ฐ„-์ฃผํŒŒ์ˆ˜ ํ•ด์ƒ๋„์˜ ์ตœ์  ์„ ํƒ์€ ์„ฑ๋Šฅ์— ์ค‘์š”ํ•œ ์˜ํ–ฅ์„ ๋ฏธ์น˜์ง€๋งŒ, ์–ด๋–ค ํ•ด์ƒ๋„๊ฐ€ ๊ฐ€์žฅ ์ ํ•ฉํ•œ์ง€๋Š” ๋ช…ํ™•ํ•˜์ง€ ์•Š๋‹ค. ํŠนํžˆ, ์Šคํ‘ธํ•‘ ๋ฐฉ์ง€๋ฅผ ์œ„ํ•œ ์Œ์„ฑ ๋ถ„๋ฅ˜์—์„œ๋Š” ๋‹ค์–‘ํ•œ ์‹œ๊ฐ„-์ฃผํŒŒ์ˆ˜ ์Šค์ผ€์ผ์ด ํ•„์š”ํ•˜๋‹ค. ๊ธฐ์กด ์—ฐ๊ตฌ๋Š” ๊ณ ์ •๋œ ํ•ด์ƒ๋„์—์„œ ์ž‘์—…ํ•˜์—ฌ ์ •๋ณด ์†์‹ค ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ์œผ๋ฉฐ, ์ด๋Š” ๋ถ„๋ฅ˜ ์„ฑ๋Šฅ์„ ์ œํ•œํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด ๋…ผ๋ฌธ์€ ๋‹ค์ค‘ ํ•ด์ƒ๋„ ๊ธฐ๋ฐ˜์˜ ์ „์ฒ˜๋ฆฌ(front-end) ๋ฐฉ์‹์„ ์ œ์•ˆํ•˜์—ฌ ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ณ ์ž ํ•œ๋‹ค. 2. Related Work ๊ธฐ์กด ์—ฐ๊ตฌ์—์„œ๋Š” ๋‹ค์ค‘ ํ•ด์ƒ๋„ ๋˜๋Š” ๋‹ค์ค‘ ์Šค์ผ€์ผ ๊ตฌ์กฐ๋ฅผ ํ†ตํ•ด ๋ถ„๋ฅ˜ ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ•˜๋ ค๋Š” ์‹œ๋„๊ฐ€ ์žˆ์—ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด: ...

[๋…ผ๋ฌธ] MATPC: Masked Latent Prediction and Classification for Self Supervised Audio Representation Learning

MATPAC: Masked Latent Prediction and Classification for Self Supervised Audio Representation Learning 1. Motivation ์ตœ๊ทผ ๋งˆ์Šคํฌ ์ž ์žฌ ์˜ˆ์ธก(masked latent prediction)์— ๊ธฐ๋ฐ˜ํ•œ ์ž๊ธฐ์ง€๋„ ํ•™์Šต(SSL) ๋ฐฉ๋ฒ•๋“ค์ด ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ•๋ ฅํ•œ ํ‘œํ˜„์œผ๋กœ ์ธ์ฝ”๋”ฉํ•˜๋Š” ๋ฐ ํšจ๊ณผ์ ์ž„์ด ์ž…์ฆ๋˜์—ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ํ•™์Šต ๊ณผ์ •์—์„œ ํ•™์Šต๋œ ์ž ์žฌ ๊ณต๊ฐ„์„ ๋” ๋†’์€ ์ˆ˜์ค€์˜ ์ •๋ณด๋ฅผ ์ถ”์ถœํ•˜๋„๋ก ๋ณ€ํ™˜ํ•˜๋ฉด ๋‹ค์šด์ŠคํŠธ๋ฆผ ๋ถ„๋ฅ˜ ์ž‘์—…์— ๋” ์ ํ•ฉํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด ๋…ผ๋ฌธ์€ ๋‘ ๊ฐ€์ง€ ์‚ฌ์ „ ์ž‘์—…(pretext task)์„ ๊ฒฐํ•ฉํ•˜์—ฌ ์˜ค๋””์˜ค ํ‘œํ˜„ ํ•™์Šต์˜ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ์ƒˆ๋กœ์šด ๋ฐฉ๋ฒ•๋ก ์ธ MATPAC(MAsked latenT Prediction And Classification)์„ ์ œ์•ˆํ•œ๋‹ค. ์ฒซ ๋ฒˆ์งธ ์‚ฌ์ „ ์ž‘์—…์€ ๋งˆ์Šคํฌ ์ž ์žฌ ์˜ˆ์ธก์ด๋ฉฐ, ๋‘ ๋ฒˆ์งธ๋Š” ๋น„์ง€๋„ ๋ถ„๋ฅ˜๋กœ, ์ž ์žฌ ํ‘œํ˜„์„ ํ™œ์šฉํ•˜์—ฌ ๊ต์‚ฌ(teacher)์™€ ํ•™์ƒ(student) ๋ชจ๋ธ ๊ฐ„์˜ ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ์ผ์น˜์‹œํ‚จ๋‹ค. ...

[๋…ผ๋ฌธ] Sparse Binarization for Fast Keyword Spotting

1. Motivation ์Œ์„ฑ ๊ธฐ๋ฐ˜ ๋””๋ฐ”์ด์Šค์™€ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์˜ ์ฆ๊ฐ€๋กœ ํ‚ค์›Œ๋“œ ์ŠคํฌํŒ…(Keyword Spotting, KWS)์€ ์‹ค์‹œ๊ฐ„ ์Œ์„ฑ ์ธ์‹์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋ฉฐ, ์—ฃ์ง€ ๋””๋ฐ”์ด์Šค์—์„œ์˜ ํ”„๋ผ์ด๋ฒ„์‹œ์™€ ๋Œ€์—ญํญ ํšจ์œจ์„ฑ์„ ๋†’์ธ๋‹ค. ์—ฃ์ง€ ๋””๋ฐ”์ด์Šค๋Š” ๋ฉ”๋ชจ๋ฆฌ์™€ ์—ฐ์‚ฐ ์†๋„๊ฐ€ ์ œํ•œ๋˜์–ด ์žˆ์–ด KWS ๋ชจ๋ธ์˜ ๊ฒฝ๋Ÿ‰ํ™”์™€ ์ตœ์ ํ™”๊ฐ€ ํ•„์ˆ˜์ ์ด๋‹ค. ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ํšจ์œจ์ ์ด๊ณ  ์ •ํ™•ํ•œ KWS๋ฅผ ์œ„ํ•œ ์ƒˆ๋กœ์šด ๋ฐฉ๋ฒ•์œผ๋กœ Sparse Binarization์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ๋ชจ๋ธ SparkNet์„ ์ œ์•ˆํ•œ๋‹ค. SparkNet์€ ๊ธฐ์กด ์ตœ์ฒจ๋‹จ(SOTA) ๋ชจ๋ธ ๋Œ€๋น„ 4๋ฐฐ ๋น ๋ฅด๋ฉด์„œ๋„ ๋” ๋†’์€ ์ •ํ™•๋„๋ฅผ ์ œ๊ณตํ•˜๋ฉฐ, ์†Œ์Œ ํ™˜๊ฒฝ์—์„œ๋„ ๋” ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค€๋‹ค. 2. Related Work Keyword Spotting (KWS) KWS๋Š” ์Œ์„ฑ ๋ฐ์ดํ„ฐ๋ฅผ ์‹ค์‹œ๊ฐ„์œผ๋กœ ๋ถ„์„ํ•ด ํŠน์ • ๋‹จ์–ด๋ฅผ ํƒ์ง€ํ•˜๋Š” ๊ธฐ์ˆ ์ด๋‹ค. ๊ธฐ์กด ์—ฐ๊ตฌ๋Š” ์†Œํ˜• CNN, RNN, ๋˜๋Š” ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ๋„คํŠธ์›Œํฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์—ฃ์ง€ ๋””๋ฐ”์ด์Šค์—์„œ ์ตœ์ ํ™”๋œ ๋ชจ๋ธ์„ ์„ค๊ณ„ํ•ด์™”๋‹ค. ์ฃผ์š” ๊ธฐ๋ฒ•์œผ๋กœ๋Š” ์–‘์žํ™”(Quantization), ํ”„๋ฃจ๋‹(Pruning), ๊ทธ๋ฆฌ๊ณ  **1D ๊นŠ์ด๋ถ„๋ฆฌ ํ•ฉ์„ฑ๊ณฑ(Depthwise Separable Convolution)**์ด ํ™œ์šฉ๋˜์—ˆ๋‹ค. 3. Proposed Method Method Overview Sparse Binarization: ์ž…๋ ฅ ๋ฐ์ดํ„ฐ์—์„œ ์œ ํšจํ•˜์ง€ ์•Š์€ ํŠน์ง•์„ ์ œ๊ฑฐํ•˜๊ณ , ์˜ˆ์ธก์— ์œ ์šฉํ•œ ์ •๋ณด๋ฅผ ์œ ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ์ด์ง„ํ™”๋œ ํ‘œํ˜„์„ ํ•™์Šตํ•œ๋‹ค. ๋ชจ๋ธ ๊ตฌ์กฐ: SparkNet์€ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ ์ด์ง„ํ™”ํ•˜์—ฌ ์„ ํ˜• ๋ถ„๋ฅ˜๊ธฐ๋กœ ์ „๋‹ฌํ•˜๋ฉฐ, ํšจ์œจ์ ์ธ ๊ณ„์‚ฐ์„ ์œ„ํ•ด **1D ์‹œ๊ฐ„-์ฑ„๋„ ๋ถ„๋ฆฌ ํ•ฉ์„ฑ๊ณฑ(Time-Channel Separable Convolution)**์„ ์‚ฌ์šฉํ•œ๋‹ค. SparkNet Architecture ์ž…๋ ฅ ๋ฐ์ดํ„ฐ: ๋ฉœ ์ฃผํŒŒ์ˆ˜ ์ŠคํŽ™ํŠธ๋Ÿผ(MFCC)์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ (F \times T) ํฌ๊ธฐ์˜ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. ๊ตฌ์กฐ: 4๊ฐœ์˜ ๋ธ”๋ก์œผ๋กœ ๊ตฌ์„ฑ๋œ 1D ๊นŠ์ด๋ถ„๋ฆฌ ํ•ฉ์„ฑ๊ณฑ ๋ ˆ์ด์–ด. ๋ฐฐ์น˜ ์ •๊ทœํ™”์™€ ReLU ํ™œ์„ฑํ™”๋ฅผ ํฌํ•จ. ๋งˆ์ง€๋ง‰ ์ถœ๋ ฅ ๋ ˆ์ด์–ด๋Š” 1x1 ํ•ฉ์„ฑ๊ณฑ์œผ๋กœ ๊ตฌ์„ฑ๋˜๋ฉฐ Tanh ํ™œ์„ฑํ™”๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. ์ถœ๋ ฅ: 12๊ฐœ์˜ ํ‚ค์›Œ๋“œ ๋ฒ”์ฃผ๋กœ ๋งคํ•‘๋˜๋ฉฐ, ์—ฌ๊ธฐ์—๋Š” 10๊ฐœ์˜ ํƒ€๊ฒŸ ๋‹จ์–ด, โ€œUnknownโ€, ๊ทธ๋ฆฌ๊ณ  โ€œSilence"๊ฐ€ ํฌํ•จ๋œ๋‹ค. Sparse Binarized Representation Learning ํ•™์Šต ๊ณผ์ •: ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ ์ด์ง„ํ™”ํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ€์šฐ์‹œ์•ˆ ๊ธฐ๋ฐ˜์˜ ์ด์™„๋œ Bernoulli ๋ถ„ํฌ๋ฅผ ํ™œ์šฉํ•œ๋‹ค. ํ•™์Šต ์ค‘, ์ŠคํŒŒ์Šค ํ‘œํ˜„์„ ๊ฐ•ํ™”ํ•˜๊ธฐ ์œ„ํ•ด ์ •๊ทœํ™” ์†์‹ค((L_{sparse}))์„ ์ถ”๊ฐ€. ํšจ๊ณผ: ์ž…๋ ฅ ๋ฐ์ดํ„ฐ์˜ ์‹œ๊ณต๊ฐ„์  ํŠน์ง•์„ ๊ฐ„๊ฒฐํ•˜๊ฒŒ ์œ ์ง€ํ•˜์—ฌ, ๊ณ„์‚ฐ๋Ÿ‰์€ ์ค„์ด๋ฉด์„œ ๋†’์€ ์ •ํ™•๋„๋ฅผ ๋ณด์žฅํ•œ๋‹ค. Classification Learning ํ•™์Šต ๋ชฉํ‘œ: ์ด์ง„ํ™”๋œ ํ‘œํ˜„์„ ํ‰๊ท  ํ’€๋งํ•œ ํ›„, ๋‹จ์ผ ์„ ํ˜• ๋ ˆ์ด์–ด๋กœ ํƒ€๊ฒŸ ํ‚ค์›Œ๋“œ๋ฅผ ์˜ˆ์ธก. ์†์‹ค ํ•จ์ˆ˜: (L = L_{sparse} + \lambda L_{ce}), ์—ฌ๊ธฐ์„œ (L_{ce})๋Š” ํฌ๋กœ์Šค ์—”ํŠธ๋กœํ”ผ ์†์‹ค. 4. Experiments Experimental Setup ๋ฐ์ดํ„ฐ์…‹: Google Speech Commands ๋ฒ„์ „ 1(V1) ๋ฐ 2(V2). ๊ฐ๊ฐ 30๊ฐœ์™€ 35๊ฐœ์˜ ํ‚ค์›Œ๋“œ ๋ฒ”์ฃผ๋ฅผ ํฌํ•จํ•˜๋ฉฐ, 1์ดˆ ๊ธธ์ด์˜ ์ƒ˜ํ”Œ๋กœ ๊ตฌ์„ฑ. MFCC๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ 32๊ฐœ์˜ ์ฃผํŒŒ์ˆ˜ ๋นˆ์œผ๋กœ ์ „์ฒ˜๋ฆฌ. ํ‰๊ฐ€ ์ง€ํ‘œ: Top-1 ์ •ํ™•๋„์™€ Multiply-Accumulate Operations(MACs). ์†Œ์Œ ํ™˜๊ฒฝ์—์„œ์˜ ๊ฐ•๊ฑด์„ฑ: ๋‹ค์–‘ํ•œ ์‹ ํ˜ธ๋Œ€์žก์Œ๋น„(SNR)์—์„œ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ์ธก์ •. Results ์†๋„์™€ ์ •ํ™•๋„: SparkNet์€ SOTA ๋ชจ๋ธ(BC-ResNet)๋ณด๋‹ค 4๋ฐฐ ๋น ๋ฅด๋ฉฐ, ๋™์ผํ•˜๊ฑฐ๋‚˜ ๋” ๋†’์€ ์ •ํ™•๋„๋ฅผ ๋‹ฌ์„ฑ. SparkNet[C=32]: SC2 ๋ฐ์ดํ„ฐ์…‹์—์„œ 97.0%์˜ ์ •ํ™•๋„๋ฅผ ๊ธฐ๋กํ•˜๋ฉฐ BC-ResNet์„ ์ดˆ๊ณผ. ์†Œ์Œ ๊ฐ•๊ฑด์„ฑ: ๋‹ค์–‘ํ•œ SNR์—์„œ SparkNet์ด BC-ResNet ๋Œ€๋น„ ์ผ๊ด€๋˜๊ฒŒ ๋†’์€ ์ •ํ™•๋„๋ฅผ ๋ณด์ž„. Ablation Study ๋ชจ๋ธ ๊ตฌ์„ฑ ์š”์†Œ ๊ฒ€์ฆ: ์ด์ง„ํ™” ๊ณผ์ •(Lsparse)์ด ๋ชจ๋ธ ์„ฑ๋Šฅ์— ๊ฐ€์žฅ ํฐ ๊ธฐ์—ฌ๋ฅผ ํ•จ์„ ํ™•์ธ. ๋ณด์กฐ ๋ถ„๋ฅ˜๊ธฐ๋ฅผ ์ถ”๊ฐ€ํ–ˆ์„ ๋•Œ, ์„ฑ๋Šฅ ํ–ฅ์ƒ์ด ์—†์—ˆ์Œ์„ ์‹คํ—˜์ ์œผ๋กœ ์ž…์ฆ. 5. Conclusion & Limitation Conclusion SparkNet์€ ํšจ์œจ์„ฑ๊ณผ ์ •ํ™•์„ฑ์„ ๋™์‹œ์— ๋‹ฌ์„ฑํ•œ KWS ๋ชจ๋ธ๋กœ, ์—ฃ์ง€ ๋””๋ฐ”์ด์Šค์— ์ตœ์ ํ™”๋˜์—ˆ๋‹ค. ์†Œ์Œ ํ™˜๊ฒฝ์—์„œ๋„ ๊ฐ•๊ฑด์„ฑ์„ ๊ฐ€์ง€๋ฉฐ, ๊ธฐ์กด ๋ชจ๋ธ๋ณด๋‹ค ์ ์€ ๊ณ„์‚ฐ๋Ÿ‰์œผ๋กœ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์ธ๋‹ค. Limitation ์ด ๋ชจ๋ธ์€ ๊ฐ๋… ํ•™์Šต(Supervised Learning)์— ๊ธฐ๋ฐ˜ํ•˜๋ฉฐ, ์ž๊ฐ€ ์ง€๋„ ํ•™์Šต(Self-Supervised Learning)์œผ๋กœ ํ™•์žฅ์ด ํ•„์š”ํ•จ. ๋”์šฑ ์†Œํ˜•ํ™”๋œ ๋””๋ฐ”์ด์Šค๋ฅผ ๋Œ€์ƒ์œผ๋กœ ํ•œ ์ถ”๊ฐ€ ์ตœ์ ํ™” ๊ฐ€๋Šฅ์„ฑ์ด ์กด์žฌ. Related Works BC-ResNet: Broadcasted Residual Learning ๊ธฐ๋ฐ˜์˜ KWS ๋ชจ๋ธ. MatchboxNet: 1D ์‹œ๊ฐ„-์ฑ„๋„ ๋ถ„๋ฆฌ ํ•ฉ์„ฑ๊ณฑ ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•œ KWS ๋ชจ๋ธ. TinySpeech: ์—ฃ์ง€ ๋””๋ฐ”์ด์Šค์—์„œ ๊ฒฝ๋Ÿ‰ํ™”๋ฅผ ์œ„ํ•ด ์„ค๊ณ„๋œ Attention ๊ธฐ๋ฐ˜ ๋ชจ๋ธ. Key References Svirsky et al., โ€œSG-VAD: Stochastic Gates Based Speech Activity Detectionโ€ (ICASSP 2023) Kim et al., โ€œBroadcasted Residual Learning for Efficient Keyword Spottingโ€ (Interspeech 2021) Majumdar et al., โ€œMatchboxNet: 1D Time-Channel Separable CNN for Speech Commands Recognitionโ€ (2020)

[๋…ผ๋ฌธ] Keyword Transformer: A Self-Attention Model for Keyword Spotting

1. Motivation Transformer ๊ตฌ์กฐ๋Š” ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์ด๋ฏธ์ง€ ์ฒ˜๋ฆฌ์™€ ์Œ์„ฑ ์ธ์‹ ๋“ฑ ๋‹ค์–‘ํ•œ ๋„๋ฉ”์ธ์—์„œ ์„ฑ๊ณต์ ์œผ๋กœ ์‚ฌ์šฉ๋˜๊ณ  ์žˆ๋‹ค. ํ•˜์ง€๋งŒ ํ‚ค์›Œ๋“œ ์ŠคํฌํŒ… ๋ถ„์•ผ์—์„œ๋Š” ์ฃผ๋กœ Transformer๊ฐ€ ๊ธฐ์กด์˜ CNN์ด๋‚˜ RNN ๊ฐ™์€ ๊ตฌ์กฐ ์œ„์— ์ถ”๊ฐ€์ ์œผ๋กœ ์‚ฌ์šฉ๋˜์–ด ์™”๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, ์ด ๋…ผ๋ฌธ์€ ํ‚ค์›Œ๋“œ ์ŠคํฌํŒ…์— Transformer๋ฅผ ์ง์ ‘ ์ ์šฉํ•˜๋Š” ๋ชจ๋ธ์ธ Keyword Transformer(KWT)๋ฅผ ์ œ์•ˆํ•œ๋‹ค. KWT๋Š” ๋ณ„๋„์˜ ์‚ฌ์ „ ํ•™์Šต์ด๋‚˜ ์ถ”๊ฐ€ ๋ฐ์ดํ„ฐ๋ฅผ ํ•„์š”๋กœ ํ•˜์ง€ ์•Š์œผ๋ฉด์„œ ๊ธฐ์กด์˜ ๋ณต์žกํ•œ ํ˜ผํ•ฉ ๊ตฌ์กฐ๋ณด๋‹ค ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์ด๋ฉฐ, Google Speech Commands ๋ฐ์ดํ„ฐ์…‹์—์„œ ์ตœ๊ณ  ์ˆ˜์ค€์˜ ์ •ํ™•๋„๋ฅผ ๋‹ฌ์„ฑํ–ˆ๋‹ค. ...

[๋…ผ๋ฌธ] BEATS : Audio Pre-Training with Acoustic Tokenizercategories

1. Motivation ์ตœ๊ทผ ์ž๊ธฐ์ง€๋„ํ•™์Šต(SSL)์€ ์–ธ์–ด, ๋น„์ „, ์Œ์„ฑ์—์„œ ํฐ ์„ฑ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ์ง€๋งŒ, ์˜ค๋””์˜ค ๋„๋ฉ”์ธ์—์„œ๋Š” ์—ฌ์ „ํžˆ ๋ณต์› ์†์‹ค(reconstruction loss)์ด ์ฃผ๋กœ ์‚ฌ์šฉ๋˜๊ณ  ์žˆ๋‹ค. ๋ณต์› ์†์‹ค์€ ์ €์ˆ˜์ค€ ์‹œ๊ฐ„-์ฃผํŒŒ์ˆ˜ ํŠน์ง•์„ ์žฌํ˜„ํ•˜๋Š” ๋ฐ ์ดˆ์ ์ด ๋งž์ถฐ์ ธ, ๊ณ ์ˆ˜์ค€์˜ ์˜๋ฏธ ์ •๋ณด๋ฅผ ์ œ๋Œ€๋กœ ๋ฐ˜์˜ํ•˜์ง€ ๋ชปํ•˜๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ๋‹ค. BEATS๋Š” ์—ฐ์†์ ์ธ ์˜ค๋””์˜ค ๋ฐ์ดํ„ฐ๋ฅผ ์ด์‚ฐ์ (discrete) ๋ผ๋ฒจ๋กœ ๋ณ€ํ™˜ํ•ด ๊ณ ์ˆ˜์ค€์˜ ์˜๋ฏธ์  ์ •๋ณด๋ฅผ ํ•™์Šตํ•˜๋Š” ์ƒˆ๋กœ์šด ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๊ธฐ์กด ๋ฐฉ์‹๋ณด๋‹ค ํšจ์œจ์ ์ด๊ณ  ์˜๋ฏธ ์ค‘์‹ฌ์ ์ธ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•˜๋„๋ก ํ•œ๋‹ค. 2. Related Work ์˜ค๋””์˜ค ์‚ฌ์ „ ํ•™์Šต์€ ํฌ๊ฒŒ ์ง€๋„ ํ•™์Šต๊ณผ ์ž๊ธฐ์ง€๋„ํ•™์Šต์œผ๋กœ ๋‚˜๋‰œ๋‹ค. ...

[๋…ผ๋ฌธ] Filterbank Learning for Noise-Robust Small-Footprint Keyword Spotting

Filterbank Learning for Noise-Robust Small-Footprint Keyword Spotting 1. Motivation ํ‚ค์›Œ๋“œ ์ŠคํฌํŒ…(KWS)์€ ๋ณดํ†ต Log-Mel์ด๋‚˜ MFCC ๊ฐ™์€ ์ˆ˜์ž‘์—… ํŠน์ง•์„ ์‚ฌ์šฉํ•จ. ํ•™์Šต ๊ฐ€๋Šฅํ•œ ํ•„ํ„ฐ๋ฑ…ํฌ๊ฐ€ ๊ธฐ์กด ํŠน์ง•์„ ๋Œ€์ฒดํ•˜๋ ค๋Š” ์‹œ๋„๋Š” ์žˆ์—ˆ์œผ๋‚˜, ํฐ ์„ฑ๊ณผ๋Š” ์—†์—ˆ์Œ. ํ•„ํ„ฐ๋ฑ…ํฌ ์ฑ„๋„ ์ˆ˜๋ฅผ ์ค„์ด๋ฉด ํ•™์Šต๋œ ํ•„ํ„ฐ๋ฑ…ํฌ๊ฐ€ ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•˜๋ฉด์„œ๋„ ์—๋„ˆ์ง€ ์†Œ๋น„๋ฅผ ํฌ๊ฒŒ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ์ฃผ์žฅํ•จ. ํ•ญ์ƒ ์ผœ์ ธ ์žˆ๋Š” ์ €์ž์› KWS ์‹œ์Šคํ…œ์— ํŠนํžˆ ์ค‘์š”ํ•จ. 2. Related Works SincNet: ํ•™์Šต ๊ฐ€๋Šฅํ•œ ํ•„ํ„ฐ๋ฑ…ํฌ๋ฅผ CNN ๊ธฐ๋ฐ˜ KWS์— ์ ์šฉํ•œ ์—ฐ๊ตฌ. ์ˆ˜์ž‘์—… ํŠน์ง•๊ณผ์˜ ์ง์ ‘ ๋น„๊ต๋Š” ๋ถ€์กฑํ–ˆ์Œ. ๊ธฐ์กด ์—ฐ๊ตฌ์—์„œ๋Š” Log-Mel๊ณผ MFCC๊ฐ€ ์—ฌ์ „ํžˆ ๋” ์šฐ์ˆ˜ํ•˜๋‹ค๊ณ  ๊ฒฐ๋ก ์ง€์Œ. ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ํ•„ํ„ฐ๋ฑ…ํฌ ์ฑ„๋„์„ ์ค„์ผ ๋•Œ ์„ฑ๋Šฅ ๊ฐœ์„  ๊ฐ€๋Šฅ์„ฑ์„ ์ฆ๋ช…ํ•จ. ๋“œ๋กญ์•„์›ƒ ์‚ฌ์šฉ์œผ๋กœ ์†Œ์Œ ๊ฐ•๊ฑด์„ฑ๊ณผ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์„ ๋†’์ž„. 3. Proposed Method ํ•„ํ„ฐ๋ฑ…ํฌ ํ•™์Šต: ์ž…๋ ฅ ์‹ ํ˜ธ์˜ STFT๋ฅผ ๊ณ„์‚ฐ ํ›„ ํ•„ํ„ฐ๋ฑ…ํฌ ๋ ˆ์ด์–ด๋กœ ํ•„ํ„ฐ๋งํ•จ. ํ•™์Šต ๊ฐ€๋Šฅํ•œ ํ•„ํ„ฐ๋ฑ…ํฌ ํ–‰๋ ฌ ( W )๋ฅผ ํ†ตํ•ด ํ•„ํ„ฐ๋ง๋œ ์ถœ๋ ฅ ( Y ) ์ƒ์„ฑ. ๋“œ๋กญ์•„์›ƒ์„ ํ†ตํ•ด ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ ๊ฐœ์„ . ์—๋„ˆ์ง€ ์ ˆ์•ฝ: ํ•„ํ„ฐ๋ฑ…ํฌ ์ฑ„๋„ ์ˆ˜๋ฅผ ์ค„์ด๋ฉด ๊ณฑ์…ˆ ์—ฐ์‚ฐ๋Ÿ‰์ด ๊ฐ์†Œ โ†’ ์—๋„ˆ์ง€ ์†Œ๋น„ ์ ˆ๊ฐ. ํ•™์Šต ๊ตฌ์กฐ: CNN ๋ฐฑ์—”๋“œ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉฐ, ์ž”์ฐจ ์—ฐ๊ฒฐ๊ณผ ์‹œ๊ฐ„-์ฃผํŒŒ์ˆ˜ ํŒจํ„ด ํฌ์ฐฉ. ํ‚ค์›Œ๋“œ ์กด์žฌ ํƒ์ง€. 4. Experiments ๋ฐ์ดํ„ฐ์…‹ Google Speech Commands Dataset ์‚ฌ์šฉ. ์†Œ์Œ ์ถ”๊ฐ€ (์ฐจ๋Ÿ‰ ๋‚ด๋ถ€, ์นดํŽ˜ ๋“ฑ), SNR ๋ฒ”์œ„๋Š” -10dB ~ 20dB. ๊ฒฐ๊ณผ ํ•„ํ„ฐ๋ฑ…ํฌ ํ•™์Šต์€ ์†Œ์Œ ํ™˜๊ฒฝ์—์„œ ๋” ๋†’์€ ๊ฐ•๊ฑด์„ฑ ์ œ๊ณต. ํŠนํžˆ ๋“œ๋กญ์•„์›ƒ ์‚ฌ์šฉ ์‹œ ํšจ๊ณผ๊ฐ€ ๋” ์ข‹์Œ. Log-Mel (40์ฑ„๋„) vs. ํ•™์Šต ํ•„ํ„ฐ๋ฑ…ํฌ (8์ฑ„๋„): ์ •ํ™•๋„ 3.5% ๊ฐ์†Œ, ์—๋„ˆ์ง€ ์†Œ๋น„ 6.3๋ฐฐ ์ ˆ๊ฐ. 8์ฑ„๋„ vs. 5์ฑ„๋„: ์ •ํ™•๋„ ์œ ์ง€, ์—๋„ˆ์ง€ ์†Œ๋น„ 2๋ฐฐ ์ ˆ๊ฐ. ํ•™์Šต๋œ ํ•„ํ„ฐ๋ฑ…ํฌ๊ฐ€ ์†Œ์Œ ํ™˜๊ฒฝ(๋ณด์ด์ง€ ์•Š๋Š” ์†Œ์Œ ํฌํ•จ)์—์„œ๋„ ๋” ๋‚˜์€ ์„ฑ๋Šฅ ๋ฐœํœ˜. 5. Conclusion & Limitation ๊ฒฐ๋ก  ํ•„ํ„ฐ๋ฑ…ํฌ ์ฑ„๋„ ์ˆ˜๋ฅผ ์ค„์ด๋ฉด ํ•™์Šต ๊ฐ€๋Šฅํ•œ ํ•„ํ„ฐ๋ฑ…ํฌ๊ฐ€ ์ˆ˜์ž‘์—… ํŠน์ง•๋ณด๋‹ค ์šฐ์ˆ˜. ๋“œ๋กญ์•„์›ƒ์ด ์†Œ์Œ ๊ฐ•๊ฑด์„ฑ๊ณผ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์— ํฐ ๊ธฐ์—ฌ. ์ €์ž์› ํ™˜๊ฒฝ์—์„œ ํŠนํžˆ ์œ ์šฉํ•จ. ํ•œ๊ณ„ ๋ฐ ํ–ฅํ›„ ์—ฐ๊ตฌ ํ•„ํ„ฐ๋ฑ…ํฌ ์„ค๊ณ„์™€ ์†Œ์Œ ๊ฐ•๊ฑด์„ฑ ์ตœ์ ํ™”๋ฅผ ์œ„ํ•œ ์ถ”๊ฐ€ ์—ฐ๊ตฌ ํ•„์š”. ๋” ๋‚˜์€ ํŠน์ง• ์„ค๊ณ„๋ฅผ ๋ชฉํ‘œ๋กœ ํ•จ.

[๋…ผ๋ฌธ] Noise-Robust Keyword Spotting throught Self-Supervised Pretraikning

Noise-Robust Keyword Spotting throught Self-Supervised Pretraikning 1. Motivation ํ˜„๋Œ€์˜ ์Œ์„ฑ ๋น„์„œ๋Š” ๊ฑฐ์˜ ๋ชจ๋“  ์ปดํ“จํ„ฐ์™€ ์Šค๋งˆํŠธ ๊ธฐ๊ธฐ์—์„œ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•จ ์Œ์„ฑ ๋น„์„œ๋Š” ASR(์ž๋™ ์Œ์„ฑ ์ธ์‹) ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์ง€๋งŒ, ์ด๋Š” ๊ณ„์‚ฐ ๋น„์šฉ์ด ๋†’์•„ ์ž‘์€ ๊ธฐ๊ธฐ์—์„œ ์‹คํ–‰ํ•˜๊ธฐ ์–ด๋ ค์›€ ๋Œ€์‹  ํ‚ค์›Œ๋“œ ์ŠคํŒŸํŒ…(KWS) ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ†ตํ•ด ํŠน์ • ํ‚ค์›Œ๋“œ๊ฐ€ ๋ฐœํ™”๋˜์—ˆ์„ ๋•Œ ASR์„ ํ™œ์„ฑํ™”ํ•จ ํ˜„์žฌ ์ตœ์‹  KWS ๋ชจ๋ธ๋“ค์€ ์ง€๋„ํ•™์Šต ๋ฐฉ์‹์œผ๋กœ ํ•™์Šต๋˜์–ด ๋งŽ์€ ์–‘์˜ ๋ ˆ์ด๋ธ”๋œ ๋ฐ์ดํ„ฐ๊ฐ€ ํ•„์š”ํ•œ ์ƒํ™ฉ ๋ ˆ์ด๋ธ”์ด ์—†๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ์ž๊ธฐ์ง€๋„ํ•™์Šต์˜ ํ™œ์šฉ์ด ํ•„์š”ํ•œ ์ƒํ™ฉ 2. Related Works Data2Vec ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์‚ฌ์šฉํ•œ transformer ๊ธฐ๋ฐ˜ KWS ๋ชจ๋ธ์˜ ์‚ฌ์ „ํ•™์Šต์ด ์„ฑ๋Šฅ ํ–ฅ์ƒ์— ๋„์›€์ด ๋œ๋‹ค๋Š” ์—ฐ๊ตฌ๊ฐ€ ์žˆ์—ˆ์Œ ํ•˜์ง€๋งŒ ์ด์ „ ์—ฐ๊ตฌ๋Š” ๊นจ๋—ํ•œ ์˜ค๋””์˜ค ์ž…๋ ฅ๋งŒ์„ ๊ฐ€์ •ํ–ˆ๊ณ , ์‹ค์ œ ํ™˜๊ฒฝ์˜ ๋…ธ์ด์ฆˆ๋Š” ๊ณ ๋ คํ•˜์ง€ ์•Š์•˜์Œ ASR ๋ถ„์•ผ์—์„œ๋Š” ์ž๊ธฐ์ง€๋„ํ•™์Šต์„ ํ†ตํ•œ ๋…ธ์ด์ฆˆ ๊ฐ•๊ฑด์„ฑ ์—ฐ๊ตฌ๊ฐ€ ์ง„ํ–‰๋˜์–ด ์™”์Œ KWS์—์„œ๋Š” ๋Œ€๋ถ€๋ถ„ ์ง€๋„ํ•™์Šต ๊ธฐ๋ฐ˜์˜ multi-style training์ด๋‚˜ adversarial training ๋ฐฉ์‹์œผ๋กœ ๋…ธ์ด์ฆˆ ๊ฐ•๊ฑด์„ฑ์„ ํ™•๋ณด 3. Proposed Method Data2Vec ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ์„ธ ๊ฐ€์ง€ ์‚ฌ์ „ํ•™์Šต ๋ฐฉ์‹ ์ œ์•ˆ: ...

[๋…ผ๋ฌธ] Survey: Efficient Large Language Models

Efficient Large Language Models Introduction ๋ณธ ๊ธ€์€ Yizhang Jin et al โ€œEfficient Multimodal Large Language Modelsโ€ ์„œ๋ฒ ์ด์— ๊ธฐ๋ฐ˜ํ•œ๋‹ค. 2023๋…„ ์ค‘ํ›„๋ฐ˜๋ถ€ํ„ฐ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๊ธฐ๋ฐ˜ ๋Œ€ํ˜• ์–ธ์–ด ๋ชจ๋ธ(Multimodal Large Language Models, MLMMs)์˜ ๋ฐœ์ „์€ ํ…์ŠคํŠธ ๊ธฐ๋ฐ˜์„ ๋„˜์–ด ์‹œ๊ฐ์  ์ดํ•ด ๋ฐ ์ถ”๋ก  ์ž‘์—…์—์„œ ๋†€๋ผ์šด ์„ฑ๊ณผ๋ฅผ ๋ณด์˜€๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ LLM๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ๋ชจ๋ธ ํฌ๊ธฐ๊ฐ€ ๋งค์šฐ ํฌ๊ณ , ํ›ˆ๋ จ ๋ฐ ์ถ”๋ก  ๋น„์šฉ์ด ๋†’์•„ ํ•™๊ณ„์™€ ์‚ฐ์—…๊ณ„์—์„œ ๊ด‘๋ฒ”์œ„ํ•œ ์‘์šฉ์„ ์ œํ•œ์‹œ์ผฐ๋‹ค. ์ด์— ๋”ฐ๋ผ ๋กœ์ปฌ ์žฅ์น˜, ์—ฃ์ง€ ์ปดํ“จํŒ… ๋“ฑ์˜ ์š”๊ตฌ ์‚ฌํ•ญ์„ ์ถฉ์กฑํ•˜๊ธฐ ์œ„ํ•ด ํšจ์œจ์ ์ด๊ณ  ๊ฒฝ๋Ÿ‰ํ™”๋œ MLMM์„ ์—ฐ๊ตฌํ•˜๋Š” ์‹œ๋„๊ฐ€ ๋งŽ์•„์กŒ๋‹ค. ์ด๋Ÿฌํ•œ ๋ณ€ํ™”๋Š” LLM์˜ ์†Œํ˜•ํ™”์™€ ์šฐ์ˆ˜ํ•œ ๋น„์ „ ์ธ์ฝ”๋”์˜ ๋ฐœ์ „๊ณผ ํ•จ๊ป˜ ์ง„ํ–‰๋˜๊ณ  ์žˆ๋‹ค. ...

[๋…ผ๋ฌธ] Speculative Decoding

๊ฐœ์š” ์ด ๊ธ€์€ ์Šคํ€ด์ฆˆ๋น„์ธ ์˜ ๊น€ํƒœ์ˆ˜๋‹˜์ด ๋ฐœํ‘œํ•œ ๋‚ด์šฉ์œผ๋กœ ๋‘ ๋…ผ๋ฌธ์„ ์ •๋ฆฌํ•˜์˜€๋‹ค. LLM์— ํ† ํฐ์„ ํ•˜๋‚˜์”ฉ ์ƒ์„ฑํ•  ๋•Œ๋งˆ๋‹ค ๊ต‰์žฅํžˆ ๋งŽ์€ weight๋ฅผ ๋ถˆ๋Ÿฌ์™€์•ผ ํ•œ๋‹ค. ๊ทธ๋ž˜์„œ DRAM bandwidth๊ฐ€ ๋ฌธ์ œ๊ฐ€ ๋œ๋‹ค. Autoregressive ๋ฐฉ์‹์ด GPU๋ฅผ ์™„์ „ํ•˜ ํ™œ์šฉํ•˜์ง€ ๋ชปํ•˜๋Š” ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜๋กœ Speculative Decoding์ด ์žˆ๋‹ค. Speculative Decoding์€ 1๊ฐœ์˜ ํ”„๋กฌํ”„ํŠธ๋ฅผ 1 ๋ฐฐ์น˜๋กœ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ์˜ˆ์ธกํ•œ ์—ฌ๋Ÿฌ ํ† ํฐ๋“ค์„ ๋™์‹œ์— ์žฌ์ž…๋ ฅํ•˜์—ฌ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ธฐ์ˆ ์ด๋‹ค. ๋”ฐ๋ผ์„œ ๋ชจ๋ธ์€ ์—ฌ๋Ÿฌ ์ž…๋ ฅ ๋ฌธ์žฅ์„ ๋ฐฐ์น˜ ๋‹จ์œ„๋กœ ์ฒ˜๋ฆฌํ•œ๋‹ค. Speculative Decoding ์ด ๋…ผ๋ฌธ์€ Draft, Verification์„ ๋‹จ์ˆœํ•˜๊ฒŒ ๊ตฌํ˜„ํ•˜์—ฌ ์ตœ์ ์˜ ํ† ํฐ์„ ์ฐพ๋Š”๋‹ค. ์ด๋•Œ ์ ์ ˆํ•œ ํ† ํฐ์ด ์•„๋‹ˆ๋ฉด ๋ฌผ๋Ÿฌ๋‚˜๋Š”๋ฐ ์ด rejection์„ ์ž˜ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ์ค‘์š”ํ•˜๋‹ค. ์ด ๋…ผ๋ฌธ์€ computational resource ํ™œ์šฉ์„ฑ์„ ๋†’์ด๊ธฐ ์œ„ํ•ด, Speculative Sampling ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ...