RAG vs. CAGDigital Garden

Over the past year, there have been many new experiments with the existing RAG architecture, including CRAG and Self-RAG, to name a few. Now, here we take a look at one such uniquely powerful architecture that’s gaining interest: Cache-Augmented Generation (CAG).

The main difference between RAG and CAG:

RAG relies on external knowledge base (e.g. in-memory DB, Vector DB, etc.).
CAG bypasses this by precomputing KV values for the entire knowledge base and caching them in a KV cache. This entire knowledge base is then used during inference.

Traditional RAG

Query submitted
Relevant documents retrieved from VectorDB
Retrieved documents appended to the query as context

Characteristics:

Retrieves only subset of documents relevant to the query
Retrieved documents and query must fit within the model’s context window

Problems:

Retrieval Latency
Retrieval Inaccuracies
Context length limits

CAG

At index time, knowledge base is tokenized and then KV values for every token is precomputed and stored in the KV cache
At query time, the query vector interacts with the ENTIRE knowledge bases KV values which are stored in the cache for quick retrieval

Characteristics

Precomputed Knowledge Base (KV values of every indexed token is calculated up front)
This is unique because we bypass traditional context length limits, because the entire knowledge base is accessible via KV cache. So it falls outside the model architecture allowing model to use more than its context limit.

Benefits

No Retrieval Latency
No Retrieval Error (since all knowledge is preloaded), but low-quality or off-topic data in the knowledge base can degrade results
CAG indirectly bypasses model context length limits, as the precomputed KV cache stores the entire knowledge base, allowing the model to attend to it without appending raw text to the query

Problems

High compute cost as attention must process the query against the ENTIRE precomputed KV cache knowledge base
High memory cost as precomputing and storing KV caches for large knowledge bases requires significant storage and memory
No filtering during inference means if theres erroneous or outlier data in the knowledge base, it can degrade the quality of the generated response

Comparison

Feature	Traditional RAG	Cache-Augmented Generation (CAG)
Knowledge Access	Real-time retrieval from a knowledge base.	Preloaded KV cache for the entire knowledge base.
Context Length	Limited to query + retrieved documents.	Full knowledge base is accessible via KV cache.
Latency	Includes retrieval time.	No retrieval latency.
Accuracy	Depends on retrieval quality.	Holistic access to precomputed knowledge.
Compute Cost	Lower (limited context).	Higher (query attends to entire knowledge base).
Memory Use	Minimal (retrieved subset only).	Higher (KV cache for all documents).

Personal opinion

In theory,

RAG is best for large, dynamic knowledge bases.
CAG is ideal for tasks with small, static, relevant knowledge bases.

My concern however:

I believe, practically, precomputing KV values isn’t straightforward for all model architectures. For models like GPT KV values are not modular by default.

To enable precomputing KV values, the model architecture has to also be modified to

Allow preloading static KV representation from an external source (in this case the KV cache).
Attention mechanism of the model must also be changed to attend to both dynamically generated query values as well as preloaded KV values.

So custom integration would be required for every model…and of course there is the memory and compute challenges if your knowledge base becomes too large…at some point it would make sense to switch to traditional RAG from there.

AI Inference RAG

Ishaan Sehgal

Explorer

RAG vs. CAG

Comparison

Graph View