KV-Cacheยถ

Summaryยถ

KV-Cache๋Š” Large Language Model (LLM)์—์„œ ์‚ฌ์šฉ๋˜๋Š” ์ตœ์ ํ™” ๊ธฐ๋ฒ•์œผ๋กœ, ๋ชจ๋ธ์ด ์ด์ „์— ๊ณ„์‚ฐํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์ €์žฅํ•˜๊ณ  ์žฌ์‚ฌ์šฉํ•˜์—ฌ ์ถ”๋ก  ์‹œ๊ฐ„์„ ๋‹จ์ถ•ํ•ฉ๋‹ˆ๋‹ค. KV-Cache๋Š” ๋ชจ๋ธ์ด ์ด์ „์— ๊ณ„์‚ฐํ•œ ํ‚ค-๊ฐ’ ๋ฒกํ„ฐ๋ฅผ ์ €์žฅํ•˜์—ฌ ์ถ”ํ›„์˜ ๊ณ„์‚ฐ์—์„œ ์žฌ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๋ชจ๋ธ์ด ๊ธด ์‹œํ€€์Šค๋ฅผ ์ฒ˜๋ฆฌํ•  ๋•Œ ํŠนํžˆ ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ KV-Cache๋Š” GPU ๋ฉ”๋ชจ๋ฆฌ๋ฅผๅคง้‡์œผ๋กœ ์†Œ๋น„ํ•˜์—ฌ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ๊ณผ ์ปจํ…์ŠคํŠธ ํฌ๊ธฐ๋ฅผ ์ œํ•œํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Key Conceptsยถ

  • KV-Cache์˜ ๋ชฉ์  : KV-Cache๋Š” ๋ชจ๋ธ์ด ์ด์ „์— ๊ณ„์‚ฐํ•œ ํ‚ค-๊ฐ’ ๋ฒกํ„ฐ๋ฅผ ์ €์žฅํ•˜์—ฌ ์ถ”ํ›„์˜ ๊ณ„์‚ฐ์—์„œ ์žฌ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜์—ฌ ์ถ”๋ก  ์‹œ๊ฐ„์„ ๋‹จ์ถ•ํ•ฉ๋‹ˆ๋‹ค.

  • KV-Cache์˜ ๊ตฌ์กฐ : KV-Cache๋Š” ๊ฐ ํ† ํฐ์— ๋Œ€ํ•ด ๊ณ„์‚ฐ๋œ ํ‚ค-๊ฐ’ ๋ฒกํ„ฐ๋ฅผ ์ €์žฅํ•˜๋ฉฐ, ๊ฐ ๋ ˆ์ด์–ด์™€ ๊ฐ ํ—ค๋“œ์— ๋Œ€ํ•ด ๋ณ„๋„์˜ ์บ์‹œ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

  • KV-Cache์˜ ํฌ๊ธฐ : KV-Cache์˜ ํฌ๊ธฐ๋Š” ๋ชจ๋ธ์˜ ํฌ๊ธฐ์™€ ์‹œํ€€์Šค์˜ ๊ธธ์ด์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์ง€๋ฉฐ, GPU ๋ฉ”๋ชจ๋ฆฌ๋ฅผๅคง้‡์œผ๋กœ ์†Œ๋น„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • KV-Cache์˜ ์ตœ์ ํ™” : KV-Cache์˜ ์ตœ์ ํ™”๋ฅผ ์œ„ํ•ด ๋‹ค์–‘ํ•œ ๊ธฐ๋ฒ•์ด ์‚ฌ์šฉ๋˜๋ฉฐ, ์ด๋Š” ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ๊ณผ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ ๊ฐœ์„ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Referencesยถ

URL ์ด๋ฆ„

URL

Techniques for KV Cache Optimization

https://www.omrimallis.com/posts/techniques-for-kv-cache-optimization/

SqueezeAttention: 2D Management of KV-Cache in LLM Inference

https://arxiv.org/html/2404.04793v1

LLM Jargons Explained: Part 4 - KV Cache

https://www.youtube.com/watch?v=z07GStMex4w

How KV cache is valid in LLM transformer

https://www.reddit.com/r/MachineLearning/comments/1b0ob2m/d_how_kv_cache_is_valid_in_llm_transformer/

LLM profiling guides KV cache optimization

https://www.microsoft.com/en-us/research/blog/llm-profiling-guides-kv-cache-optimization/