Nvidia says it can shrink LLM memory 20x without changing model weights

Nvidia says it can shrink LLM memory 20x without changing model weights

Nvidia researchers have introduced a new technique that dramatically reduces how much memory large language models need to track conversation history — by as much as 20x — without modifying the model itself. The method, called KV Cache Transform Coding (KVTC), applies ideas from media compression formats like JPEG to shrink the key-value cache behind multi-turn AI systems, lowering GPU memory demands and speeding up time-to-first-token by up to 8x.For enterprise AI applications that rely on agen...

📰 Original Source

Read full article at Venturebeat →

KhanList aggregates and links to publicly available news content. We do not host full articles from third-party sources. Always verify important information with original sources.