Splitting in LLM

Splitting in LLMยถ

Summaryยถ

Splitting in LLM์€ ํ…์ŠคํŠธ๋ฅผ ์ž‘์€ ๋‹จ์œ„๋กœ ๋‚˜๋ˆ„๋Š” ํ”„๋กœ์„ธ์Šค๋ฅผ ๋งํ•ฉ๋‹ˆ๋‹ค. ์ด ํ”„๋กœ์„ธ์Šค๋Š” LLM์ด ๋” ํšจ๊ณผ์ ์œผ๋กœ ์ •๋ณด๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ณ  ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋„๋ก ๋„์™€์ค๋‹ˆ๋‹ค. ํ…์ŠคํŠธ๋ฅผ ๋‚˜๋ˆ„๋Š” ๋ฐฉ๋ฒ•์—๋Š” ์—ฌ๋Ÿฌ ๊ฐ€์ง€๊ฐ€ ์žˆ์œผ๋ฉฐ, ๋ฌธ์žฅ ๋‹จ์œ„๋กœ ๋‚˜๋ˆ„๋Š” sentence splitting, ํ† ํฐ ์ˆ˜์— ๋”ฐ๋ผ ๋‚˜๋ˆ„๋Š” max token splitting, ๊ทธ๋ฆฌ๊ณ  ์˜๋ฏธ์— ๋”ฐ๋ผ ๋‚˜๋ˆ„๋Š” semantic chunking ๋“ฑ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ๊ฐ ๋ฐฉ๋ฒ•์€ ์žฅ๋‹จ์ ์ด ์žˆ์œผ๋ฉฐ, ์ ์ ˆํ•œ chunking ์ „๋žต์„ ์„ ํƒํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.

Key Conceptsยถ

  • Sentence Splitting : ํ…์ŠคํŠธ๋ฅผ ๋ฌธ์žฅ ๋‹จ์œ„๋กœ ๋‚˜๋ˆ„๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ, ๊ฐ ๋ฌธ์žฅ์ด ํ•˜๋‚˜์˜ chunk๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.

  • Max Token Splitting : ํ…์ŠคํŠธ๋ฅผ ํ† ํฐ ์ˆ˜์— ๋”ฐ๋ผ ๋‚˜๋ˆ„๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ, ๊ฐ chunk๋Š” ์ตœ๋Œ€ ํ† ํฐ ์ˆ˜๋ฅผ ๊ฐ€์ง‘๋‹ˆ๋‹ค.

  • Semantic Chunking : ํ…์ŠคํŠธ๋ฅผ ์˜๋ฏธ์— ๋”ฐ๋ผ ๋‚˜๋ˆ„๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ, ๊ฐ chunk๋Š” ์˜๋ฏธ์ ์œผ๋กœ ๊ด€๋ จ๋œ ์ •๋ณด๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.

  • Token-based Splitting : ํ† ํฐ ๋‹จ์œ„๋กœ ๋‚˜๋ˆ„๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ, LLM์˜ context window์— ๋งž์ถ”์–ด chunk๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

  • Context-aware Splitting : ๋ฌธ์„œ์˜ ๊ตฌ์กฐ์™€ ๊ณ„์ธต์„ ๊ณ ๋ คํ•˜์—ฌ chunk๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ, header ์ •๋ณด๋ฅผ ๋ณด์กดํ•ฉ๋‹ˆ๋‹ค.