Fine-TuningRAGAISmall Business

Do you need to fine-tune an AI model, or just prompt it?

Fine-tuning changes how a model formats and behaves, not what it knows, so most 'train our own AI on our data' projects are knowledge problems that retrieval or a longer prompt solves better and cheaper. In 2026 the case shrank further: OpenAI is winding down self-serve fine-tuning, leaving it a narrow tool for distilling a proven behavior into a smaller, cheaper model at high volume.

Alexey YushkinFounder, GENERAL INFORMATICS3 min read

Fine-tuning changes how a model writes, not what it knows. If your goal is for the AI to answer from your prices, policies, or product docs, fine-tuning is the wrong tool, because it teaches form and behavior, not facts, and pushing new facts in through training actually makes a model hallucinate more. Most "we should train our own AI on our data" projects are knowledge problems, and a retrieval step or a longer prompt solves them better and cheaper. In 2026 the case shrank again from the supply side: OpenAI is winding down self-serve fine-tuning, which leaves it a narrow tool for one job, copying a behavior you have already proven into a smaller, cheaper model at high volume.

This matters because "train it on our data" is the single most common way a small AI project starts in the wrong place. The phrase sounds like fine-tuning. It almost never is.

Fine-tuning teaches form, not facts

There are three ways to shape what an AI does, and they are not interchangeable. Prompting gives instructions and examples in the request itself. Retrieval, often called RAG, fetches the relevant facts at query time and puts them in the prompt. Fine-tuning continues training the model's own weights on a set of your examples so it absorbs a pattern of behavior.

The distinction that gets lost is what each one moves. Prompting and retrieval change the information in front of the model. Fine-tuning changes the model's instincts: the shape of its answers, its tone, its default format, how it handles a recurring kind of input. It is good at "always reply in this exact JSON structure" or "classify every message into one of these six buckets the way our senior rep would." It is bad at "know that our return window changed to 45 days last week."

This is not a style opinion. A 2024 study by Gekhman and co-authors, presented at EMNLP, tested what happens when you fine-tune a model on facts it did not already know. Two findings stand out. The model learns those new-fact examples much more slowly than examples that agree with what it already knew. And as the new facts finally take hold, the model's rate of hallucination rises in step. The authors' read is that models mostly acquire facts during pre-training, and fine-tuning teaches them to use what they have more efficiently. Translated for an operator: you cannot reliably teach a model your data by fine-tuning, and the harder you try, the more confidently it will make things up.

What you actually want, mapped to the right tool

Almost every fine-tuning request we get is really one of these wishes wearing the wrong label. Here is the version we keep on hand.

What you want the AI to doThe real exampleThe right tool
Answer in the same fixed structure every time"Return name, email, and intent as JSON"Structured outputs / schema-constrained prompt
Know your current prices, policies, or docs"Answer from our handbook and quote the right number"Retrieval or a long-context prompt, kept current
Sound like our brand and follow our tone"Write replies the way our team writes them"System prompt plus a few good examples
Follow a long, multi-rule procedure consistently"Apply our 14-step intake SOP without skipping"The SOP in the prompt, plus a checking step
Do one narrow task at very high volume, cheaply"Classify 8,000 tickets a day at low cost"Fine-tune a small model (the surviving case)
Choose the next step based on the last result"Decide what to do depending on what it found"That is an agent design question, not training

The first four rows cover the overwhelming majority of small-business AI work, and none of them need fine-tuning. The format and tone wishes are solved by deciding per step what the model should and should not own and writing a tight prompt. The knowledge wishes are solved by retrieval. Only the fifth row is a genuine fine-tuning case, and it has conditions.

Why "train it on our data" is almost always a knowledge problem

When a client says "train the AI on our data," picture what the data usually is: a handbook, a price list, a policy set, a folder of past support answers. That is reference material the model should read, not behavior it should absorb. The moment you treat reference material as training data, you walk into two failure modes we have watched break real builds.

The first is the stale snapshot. Fine-tuning bakes the data in at the moment you train. Your prices change, a policy gets revised, a product is renamed, and the model keeps answering from the version it was trained on. Nothing errors out to warn you. A retrieval step reads the live document, so when you update the source the answers update with it. A fine-tuned model has to be retrained to learn that the return window moved, and most teams never do.

The second is the confident hallucination, which is the Gekhman finding showing up in production. You feed the model a few hundred question-and-answer pairs hoping it learns the answers. What it actually learns is the shape of a plausible answer in your domain. Ask something just outside the training set and it fills the blank with a response that looks exactly right and is wrong, delivered with the same confidence as a real one. That is worse than a model that says it does not know, because nobody catches it.

If the underlying need is "answer from our documents," the build decision is retrieval versus a long-context prompt, not fine-tuning at all. We wrote the whole decision out separately in whether you need a vector database for AI on your docs. Fine-tuning is not even on that ballot.

The 2026 shift: OpenAI is winding down self-serve fine-tuning

The market just made this argument for us. OpenAI, which for years was the default place a small team would fine-tune a model, is phasing out self-serve fine-tuning. Its deprecations page lays out the schedule. As of May 7, 2026, only organizations that had already run a fine-tuning job can create new ones. From July 2, 2026, that narrows again to organizations with inference activity on a fine-tuned model in the prior 60 days. On January 6, 2027, even active customers can no longer create new fine-tuning jobs. Inference on models you already fine-tuned keeps working, but only until the underlying base model is itself deprecated.

Read that last clause twice, because it names the failure mode operators never price in: the deprecation treadmill. A fine-tuned model is welded to one specific base model. When the vendor retires that base, your custom model goes with it, and you are back to retraining on whatever replaces it. You do not control that clock. The companies that fine-tuned on older GPT snapshots are living through exactly this now.

Fine-tuning has not vanished. It has moved to the cloud platforms and narrowed to specific models. Google Vertex AI offers supervised fine-tuning for Gemini 2.5 models (Pro, Flash, and Flash-Lite as of mid-2026, with the newest 3.x line not yet supported). Amazon Bedrock has offered fine-tuning for Claude 3 Haiku since November 2024. Both charge for the training run and then add an hourly hosting fee to keep your tuned model available, on top of normal per-token inference. The frontier models you would most want to fine-tune are generally the ones you cannot fine-tune yet.

The one case where fine-tuning still earns its keep

Here is the surviving case, stated precisely so you can check whether you are in it. You have a narrow, repetitive task. A large model already does it well when you prompt it. You run it at high enough volume that the per-call cost or the latency of the large model genuinely hurts. So you fine-tune a small, cheap model to copy the large model's behavior on that one task, and you swap it in to save money and time.

Notice the order. You prove the behavior with prompting first, on the best model you have, and only then consider distilling it down. Fine-tuning is the optimization at the end, not the starting move. If you have not already got the task working with a prompt, fine-tuning will not rescue it, it will just bake your current mistakes into the weights.

And notice the volume bar. Distillation pays off when you are running tens of thousands of calls a day and the gap between a frontier model and a small one is real money. At a few thousand jobs a month, a Haiku-class or Flash-class model with a good prompt is already cheap enough that the training cost, the dataset curation, the hosting fee, and the retraining-on-deprecation tax never earn back. The dataset alone is a project: you need hundreds of clean, correctly-labeled examples, and you have to maintain that set as the task drifts. Most operators who think they want fine-tuning want a better prompt and a cheaper base model, which they can have today with no training at all.

How to decide this week

Run three checks before anyone writes a training script. First, name what you are actually trying to change: the model's format and behavior, or the facts it has access to. If it is facts, stop, you want retrieval or a longer prompt. Second, if it is genuinely behavior, try to get it with a system prompt and a handful of examples, because most format and tone goals fall out of a good prompt in an afternoon. Third, only if a prompt provably cannot hold the behavior at the volume and cost you need, scope a fine-tune, and scope it as distilling a proven behavior into a smaller model, not as teaching the model your business.

The honest answer for most small and mid-sized operators in 2026 is that you will not fine-tune anything, and you will ship faster and maintain less for it. When we build custom AI software for a client, fine-tuning is the rare exception we reach for after prompting and retrieval are exhausted, not the headline. If you are staring at a "train it on our data" request and are not sure which of the three tools it really calls for, tell us what you are building and we will point you at the simplest version that works.

Frequently Asked Questions

SOURCES & CITATIONS

  1. Deprecations (fine-tuning API wind-down schedule) OpenAIhttps://developers.openai.com/api/docs/deprecations
  2. Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations? Association for Computational Linguistics (EMNLP 2024)https://aclanthology.org/2024.emnlp-main.444/
  3. Supervised fine-tuning for Gemini models Google Cloudhttps://cloud.google.com/vertex-ai/generative-ai/docs/models/gemini-supervised-tuning
  4. Fine-tuning for Anthropic's Claude 3 Haiku in Amazon Bedrock is now generally available Amazon Web Serviceshttps://aws.amazon.com/about-aws/whats-new/2024/11/fine-tuning-anthropics-claude-3-haiku-amazon-bedrock

About Alexey Yushkin

Alexey is the founder of GENERAL INFORMATICS LLC. He designs and ships AI and automation systems for businesses and operators across the US.

Connect on LinkedIn

Related reading

Want this kind of system in your business?

We build practical AI and automation systems for operators. Send us your current workflow and we will show you what to automate first.

Request a Workflow Review