#llm-serving · IQ Lab

AI 2026.05.03 · 13 min Advanced Efficient Ml Deep Dive · 7

KV cache 단편화 해소부터 mobile NPU 컴파일까지, LLM inference를 실용적으로 만드는 PagedAttention·Speculative Decoding·Continuous Batching·Edge Deployment의 설계 철학을 추적한다.