Do I need a vector database for a small AI knowledge base?

Usually not. If your documents fit inside a model's context window, which as of June 2026 means up to roughly a few thousand pages on Claude Sonnet 4.6's 1 million token window, you can put the whole corpus in the prompt and cache it. A vector database earns its place when the corpus is too big to fit, changes many times a day, or has to be filtered per user.

What is the difference between RAG and just putting documents in the prompt?

Prompt-stuffing loads the documents the model needs into the context with each request and lets the model read everything before answering. RAG, retrieval-augmented generation, embeds the corpus into a vector database ahead of time, then at query time retrieves only the chunks most similar to the question and puts those in the prompt. RAG sends fewer tokens per query but adds an embedding step, a vector store to host, a retrieval step, and a re-index job on every document change.

How big can a knowledge base be before it stops fitting in the context window?

English text runs about 750 words to 1,000 tokens, so a 100-page handbook of roughly 50,000 words is about 65,000 tokens. A 200,000 token window holds around 500 pages, and a 1 million token window holds a few thousand. Paste your documents into a token counter to get the real number before deciding.

Does prompt caching make long context cheaper than RAG?

For most small knowledge bases, yes. Anthropic's prompt caching charges cached reads at 0.1x the normal input rate, so a 65,000 token corpus costs roughly two cents per query once the cache is warm. RAG can be a bit cheaper per query in raw tokens, but the saving is small in absolute terms until query volume is very high, and you pay for it with retrieval infrastructure you now have to maintain.

When does RAG actually beat long context?

When the corpus is larger than the context window, when you serve enough queries per month that per-query token cost dominates everything else, when different users may only see different parts of the data, or when you need exact citations back to a specific source chunk. Those are real thresholds, not defaults. If none of them apply, the simpler prompt-and-cache build usually wins.

Do you need a vector database for AI on your docs?

You need a vector database only when your knowledge base is too large to fit in a model's context window, it changes many times a day, or different users are allowed to see different parts of it. If your documents fit in a long-context window and change rarely, the simpler build is to put the whole corpus in the prompt and cache it. As of June 2026 that path is cheaper, more accurate, and far less to maintain than retrieval-augmented generation for most small knowledge bases. The reflex to reach for a vector database first is a holdover from when context windows were tiny and long prompts were expensive. Both of those facts changed.

The reason this is worth saying plainly is that almost every "put AI on our documents" tutorial opens by spinning up a vector store, chunking the corpus, and wiring a retrieval step. That was the right default in 2023. It is the wrong default for a 100-page handbook in 2026, and following it anyway buys you a system that is harder to debug and easier to get subtly wrong.

"AI on our docs" is really two builds

The request is always the same: a bot that answers from our handbook, our product docs, our policies, or our past support tickets. There are two ways to build it, and the whole decision is which one you pick.

The first is prompt-stuffing. You load the documents the model needs into its context with each request and ask the question. The model reads everything and answers. The second is RAG. You embed the corpus into a vector database ahead of time, and at query time you retrieve the handful of chunks most similar to the question, then put only those chunks in the prompt.

RAG became the default for a real reason. Models used to hold a few thousand tokens of context. You could not fit a handbook, so you had to retrieve the relevant slice. That constraint is mostly gone, and a lot of advice has not caught up.

When the whole corpus fits, put it in the prompt

Start by measuring, not guessing. English text runs about 750 words to 1,000 tokens, so a 100-page handbook of roughly 50,000 words is about 65,000 tokens. A 200,000 token window holds around 500 pages. A 1 million token window holds a few thousand. Paste your real documents into a token counter and you will usually find the corpus is smaller than you feared.

The economics shifted in March 2026. Anthropic made the 1 million token context window generally available for Claude Opus 4.6 and Sonnet 4.6 and, more importantly, removed the long-context surcharge. Before that change, Sonnet input pricing roughly doubled, from about $3 to about $6 per million tokens, once a prompt crossed the long-context threshold. Now the full 1 million token window runs at the flat standard rate. A 900,000 token request costs the same per token as a 9,000 token one.

Then there is caching, which is the part that makes prompt-stuffing genuinely cheap. Anthropic's prompt caching charges a cached read at 0.1x the normal input rate, with a write costing 1.25x for the 5-minute cache or 2x for the 1-hour cache. Run the numbers on that 65,000 token corpus at Sonnet's $3 per million input rate:

No caching: 65,000 tokens at $3 per million is about $0.195 per query, just for the documents.
Cache write, once: about $0.24 to load the corpus into the cache.
Cached read, every query after: about $0.02 per query.

Two cents a query to have the model read your entire knowledge base on every single question, with no chunk that could be missed and no retrieval step to misfire. The catch is traffic density. The 5-minute cache only stays warm if queries keep arriving inside that window. If your bot gets a question every twenty minutes, the cache expires between them and you re-pay the write each time. The 1-hour cache option exists for exactly that steady-but-sparse pattern. For a busy support bot the cache stays hot and the per-query cost holds.

The decision rule

Here is the rule we actually use when a client asks for AI over their documents. Read it by what the knowledge base looks like, not by what the tutorials assume.

Knowledge base	How often it changes	Right build
Under ~500 pages (fits a standard window)	Any	Prompt-stuff with caching
~500 to a few thousand pages (fits a 1M window)	Rarely	Prompt-stuff the 1M window with caching
Fits the window	Many times a day	Caching efficiency drops; consider RAG or a hybrid
Larger than the context window	Any	RAG, because the corpus cannot fit
Any size	Different users may see different data	RAG, so you can filter at retrieval
Any size	You need exact citations to a source chunk	RAG or a hybrid

Most small-business cases land in the first two rows. The corpus is a handbook, a product catalog, a policy set, or a fixed body of past answers, and it fits. The bottom three rows are where RAG stops being optional, and they are about scale, access control, and attribution, not about whether AI can "read documents."

What premature RAG actually costs

Reaching for a vector database before you need one is not free. It adds failure modes that prompt-stuffing simply does not have, and we have watched each of these break a real build.

Chunking splits the answer. You chunk the corpus into 500-token pieces for retrieval. A policy states a rule in one paragraph and its exception in the next, and the two land in different chunks. Retrieval grabs the rule, the model never sees the exception, and the bot confidently gives the wrong answer. A model reading the whole document would have caught it.

Retrieval misses the synonym. The customer asks about a "refund." Your document says "return credit." The embeddings are close but not close enough, the right chunk never surfaces in the top results, and the model answers as if the policy does not exist. Whole-context reading does not care what word you used.

The index goes stale. Someone updates the pricing doc. Nobody re-runs the embedding job. The vector store still holds last month's prices, so the bot quotes numbers that are no longer true, and nothing errors out to tell you. A re-index job is one more pipeline you own and have to keep running, which is the same operational tax we describe in our workflow automation work.

You now debug two systems. When an answer is wrong with RAG, the first question is whether retrieval pulled the wrong chunk or the model reasoned badly over the right one. With prompt-stuffing there is one place to look. That single-surface simplicity is worth more than it sounds when the thing has to run unattended.

When RAG genuinely earns its place

None of this means RAG is obsolete. It means RAG is a tool with a job, and the job is specific. Use it when the corpus is genuinely larger than the context window, so it cannot fit no matter how cheap tokens get. Use it when query volume is high enough that per-query token cost dominates everything else: at two cents a query, ten thousand questions a month is $200, which nobody optimizes, but a million questions is $20,000, which pays for retrieval infrastructure many times over. Use it when different users are only allowed to see different slices of the data, because filtering at retrieval is cleaner than trusting the model to ignore what it can see. And use it when you need a precise citation back to the exact source passage, which retrieval gives you for free and a full-context read does not.

Those are the four thresholds. If you cross one, build RAG and build it well. If you cross none, you are paying the full complexity of retrieval to solve a problem you do not have. This is the same instinct behind decisions like whether you need an MCP server or just a webhook: match the architecture to the actual requirement, not to the most impressive option on the table.

How to decide this week

Do three measurements before you write any code. First, paste your real documents into a token counter and get the true size. Second, estimate honest monthly query volume, not the optimistic launch number. Third, ask whether any user must be blocked from any part of the corpus. If the corpus fits, volume is modest, and everyone can see everything, start with prompt-stuffing and caching. You will ship in a fraction of the time and have one system to maintain instead of three.

If the answer is genuinely "the corpus is too big" or "we serve millions of queries" or "access has to be partitioned," then RAG is the right call, and the chunking and retrieval quality become the thing worth your attention. Designing that boundary correctly is most of what we do when we build custom AI software for clients. If you want a second opinion on which side of the line your project sits, tell us what you are building and we will give you the simplest build that works.

Do you need a vector database for AI on your docs?

"AI on our docs" is really two builds

When the whole corpus fits, put it in the prompt

The decision rule

What premature RAG actually costs

When RAG genuinely earns its place

How to decide this week

Frequently Asked Questions

SOURCES & CITATIONS

About Alexey Yushkin

Related reading

How to build an AI chatbot that books appointments instead of just answering FAQs

Do you need an MCP server, or is a webhook enough?

Rolling back a broken automation isn't recovery

Want this kind of system in your business?