Do you need a vector database for AI on your docs?
You need a vector database only when your knowledge base is too big to fit in a model's context window, changes many times a day, or needs per-user access control. If it fits in a long-context window and changes rarely, putting the whole corpus in a cached prompt is cheaper, more accurate, and less to maintain than RAG.
You need a vector database only when your knowledge base is too large to fit in a model's context window, it changes many times a day, or different users are allowed to see different parts of it. If your documents fit in a long-context window and change rarely, the simpler build is to put the whole corpus in the prompt and cache it. As of June 2026 that path is cheaper, more accurate, and far less to maintain than retrieval-augmented generation for most small knowledge bases. The reflex to reach for a vector database first is a holdover from when context windows were tiny and long prompts were expensive. Both of those facts changed.
The reason this is worth saying plainly is that almost every "put AI on our documents" tutorial opens by spinning up a vector store, chunking the corpus, and wiring a retrieval step. That was the right default in 2023. It is the wrong default for a 100-page handbook in 2026, and following it anyway buys you a system that is harder to debug and easier to get subtly wrong.
"AI on our docs" is really two builds
The request is always the same: a bot that answers from our handbook, our product docs, our policies, or our past support tickets. There are two ways to build it, and the whole decision is which one you pick.
The first is prompt-stuffing. You load the documents the model needs into its context with each request and ask the question. The model reads everything and answers. The second is RAG. You embed the corpus into a vector database ahead of time, and at query time you retrieve the handful of chunks most similar to the question, then put only those chunks in the prompt.
RAG became the default for a real reason. Models used to hold a few thousand tokens of context. You could not fit a handbook, so you had to retrieve the relevant slice. That constraint is mostly gone, and a lot of advice has not caught up.
When the whole corpus fits, put it in the prompt
Start by measuring, not guessing. English text runs about 750 words to 1,000 tokens, so a 100-page handbook of roughly 50,000 words is about 65,000 tokens. A 200,000 token window holds around 500 pages. A 1 million token window holds a few thousand. Paste your real documents into a token counter and you will usually find the corpus is smaller than you feared.
The economics shifted in March 2026. Anthropic made the 1 million token context window generally available for Claude Opus 4.6 and Sonnet 4.6 and, more importantly, removed the long-context surcharge. Before that change, Sonnet input pricing roughly doubled, from about $3 to about $6 per million tokens, once a prompt crossed the long-context threshold. Now the full 1 million token window runs at the flat standard rate. A 900,000 token request costs the same per token as a 9,000 token one.
Then there is caching, which is the part that makes prompt-stuffing genuinely cheap. Anthropic's prompt caching charges a cached read at 0.1x the normal input rate, with a write costing 1.25x for the 5-minute cache or 2x for the 1-hour cache. Run the numbers on that 65,000 token corpus at Sonnet's $3 per million input rate:
- No caching: 65,000 tokens at $3 per million is about $0.195 per query, just for the documents.
- Cache write, once: about $0.24 to load the corpus into the cache.
- Cached read, every query after: about $0.02 per query.
Two cents a query to have the model read your entire knowledge base on every single question, with no chunk that could be missed and no retrieval step to misfire. The catch is traffic density. The 5-minute cache only stays warm if queries keep arriving inside that window. If your bot gets a question every twenty minutes, the cache expires between them and you re-pay the write each time. The 1-hour cache option exists for exactly that steady-but-sparse pattern. For a busy support bot the cache stays hot and the per-query cost holds.
The decision rule
Here is the rule we actually use when a client asks for AI over their documents. Read it by what the knowledge base looks like, not by what the tutorials assume.
| Knowledge base | How often it changes | Right build |
|---|---|---|
| Under ~500 pages (fits a standard window) | Any | Prompt-stuff with caching |
| ~500 to a few thousand pages (fits a 1M window) | Rarely | Prompt-stuff the 1M window with caching |
| Fits the window | Many times a day | Caching efficiency drops; consider RAG or a hybrid |
| Larger than the context window | Any | RAG, because the corpus cannot fit |
| Any size | Different users may see different data | RAG, so you can filter at retrieval |
| Any size | You need exact citations to a source chunk | RAG or a hybrid |
Most small-business cases land in the first two rows. The corpus is a handbook, a product catalog, a policy set, or a fixed body of past answers, and it fits. The bottom three rows are where RAG stops being optional, and they are about scale, access control, and attribution, not about whether AI can "read documents."
What premature RAG actually costs
Reaching for a vector database before you need one is not free. It adds failure modes that prompt-stuffing simply does not have, and we have watched each of these break a real build.
Chunking splits the answer. You chunk the corpus into 500-token pieces for retrieval. A policy states a rule in one paragraph and its exception in the next, and the two land in different chunks. Retrieval grabs the rule, the model never sees the exception, and the bot confidently gives the wrong answer. A model reading the whole document would have caught it.
Retrieval misses the synonym. The customer asks about a "refund." Your document says "return credit." The embeddings are close but not close enough, the right chunk never surfaces in the top results, and the model answers as if the policy does not exist. Whole-context reading does not care what word you used.
The index goes stale. Someone updates the pricing doc. Nobody re-runs the embedding job. The vector store still holds last month's prices, so the bot quotes numbers that are no longer true, and nothing errors out to tell you. A re-index job is one more pipeline you own and have to keep running, which is the same operational tax we describe in our workflow automation work.
You now debug two systems. When an answer is wrong with RAG, the first question is whether retrieval pulled the wrong chunk or the model reasoned badly over the right one. With prompt-stuffing there is one place to look. That single-surface simplicity is worth more than it sounds when the thing has to run unattended.
When RAG genuinely earns its place
None of this means RAG is obsolete. It means RAG is a tool with a job, and the job is specific. Use it when the corpus is genuinely larger than the context window, so it cannot fit no matter how cheap tokens get. Use it when query volume is high enough that per-query token cost dominates everything else: at two cents a query, ten thousand questions a month is $200, which nobody optimizes, but a million questions is $20,000, which pays for retrieval infrastructure many times over. Use it when different users are only allowed to see different slices of the data, because filtering at retrieval is cleaner than trusting the model to ignore what it can see. And use it when you need a precise citation back to the exact source passage, which retrieval gives you for free and a full-context read does not.
Those are the four thresholds. If you cross one, build RAG and build it well. If you cross none, you are paying the full complexity of retrieval to solve a problem you do not have. This is the same instinct behind decisions like whether you need an MCP server or just a webhook: match the architecture to the actual requirement, not to the most impressive option on the table.
How to decide this week
Do three measurements before you write any code. First, paste your real documents into a token counter and get the true size. Second, estimate honest monthly query volume, not the optimistic launch number. Third, ask whether any user must be blocked from any part of the corpus. If the corpus fits, volume is modest, and everyone can see everything, start with prompt-stuffing and caching. You will ship in a fraction of the time and have one system to maintain instead of three.
If the answer is genuinely "the corpus is too big" or "we serve millions of queries" or "access has to be partitioned," then RAG is the right call, and the chunking and retrieval quality become the thing worth your attention. Designing that boundary correctly is most of what we do when we build custom AI software for clients. If you want a second opinion on which side of the line your project sits, tell us what you are building and we will give you the simplest build that works.
Frequently Asked Questions
SOURCES & CITATIONS
- Prompt caching — Anthropichttps://platform.claude.com/docs/en/build-with-claude/prompt-caching
- Anthropic makes a pricing change that matters for Claude's longest prompts — The New Stackhttps://thenewstack.io/claude-million-token-pricing/
- Claude pricing — Anthropichttps://www.anthropic.com/pricing
About Alexey Yushkin
Alexey is the founder of GENERAL INFORMATICS LLC. He designs and ships AI and automation systems for businesses and operators across the US.
Related reading
Want this kind of system in your business?
We build practical AI and automation systems for operators. Send us your current workflow and we will show you what to automate first.
Request a Workflow Review