Fine-Tuning Small Models vs Prompting Large Ones: When to Do What

Every AI project we start now begins with the same question: should we fine-tune a small model or prompt a large one? Two years ago, the answer was almost always "prompt GPT-4." The models were so far ahead of anything you could fine-tune that the calculus was simple. That is no longer the case. The gap between large frontier models and fine-tuned smaller models has narrowed dramatically, and in many production scenarios, the fine-tuned small model wins on every metric that matters.

We have shipped over sixty AI-powered features across various projects since early 2024. About forty used prompted large models. About twenty used fine-tuned smaller models. Here is what we have learned about when each approach wins.

Prompting large models wins when: the task requires broad world knowledge, the output format varies significantly between requests, you are in the exploration phase and the task definition is still changing, the volume is low enough that API costs are manageable (under a thousand requests per day for most use cases), or the task involves complex multi-step reasoning that smaller models genuinely cannot do.

Fine-tuning small models wins when: the task is narrow and well-defined, the output format is consistent and structured, you have at least five hundred examples of good input-output pairs, the volume is high enough that large model API costs become painful, latency requirements are strict (under 200ms), or you need to run inference on-device or in an air-gapped environment.

Let us get specific with numbers. We built a document classification system for a legal tech client. The task: given a PDF document, classify it into one of forty-seven document types. Using Claude 3.5 Sonnet with a carefully crafted prompt and few-shot examples, we achieved 91% accuracy at approximately four cents per document. Monthly volume was around thirty thousand documents. That is twelve hundred dollars per month in API costs, plus the latency was 3-5 seconds per classification.

We then fine-tuned a Mistral 7B model on eight thousand labeled examples from the client's historical data. Accuracy jumped to 96% -- the fine-tuned model outperformed the frontier model because it learned the client's specific taxonomy and edge cases. Inference cost dropped to roughly 0.1 cent per document on a single A10 GPU instance. Monthly cost: about eighty dollars for the GPU instance. Latency: 180ms per document. The fine-tuned model was cheaper by 15x, faster by 20x, and more accurate by 5 percentage points.

But fine-tuning is not free. The upfront cost was substantial: two weeks of engineering time to prepare the training data, run experiments, evaluate results, and deploy the model. We estimate that at roughly fifteen thousand dollars in labor. The breakeven point versus prompting was about twelve months of operation. For a system that would run for years, that was an easy decision. For a three-month pilot, it would not have been.

Here is our decision framework, distilled into a flowchart we actually use:

Step one: Can you define the task with fewer than twenty example input-output pairs? If yes, start with prompting. You do not have enough signal to fine-tune effectively.

Step two: Is the task changing frequently? If you are still figuring out what the model should do, fine-tuning is premature. Every task change means re-collecting data and re-training. Prompts are cheap to iterate on.

Step three: What is your monthly volume? Below one thousand requests per month, the economics almost never favor fine-tuning. The GPU hosting costs alone exceed what you would spend on API calls to a large model.

Step four: What is your latency budget? If you need sub-200ms responses, you need a small model running on dedicated hardware. No API call to a hosted large model will consistently hit that target.

Step five: Do you have labeled data? Fine-tuning requires examples. Not ten examples. Hundreds, ideally thousands. If you do not have labeled data and would need to create it, factor that cost into your decision. Labeling one thousand examples typically costs two to five thousand dollars using a service like Scale AI, or a week of internal effort.

Step six: What is the accuracy gap? Run your best prompt against a large model on a held-out test set. If accuracy is above 95%, the marginal improvement from fine-tuning may not justify the investment. If accuracy is below 85%, fine-tuning with domain-specific data will almost certainly help.

The hybrid approach is worth mentioning because we use it often. We use a large model to generate training data for fine-tuning a small model. The process: write a good prompt, run it against a large model on a diverse set of inputs, have a human review and correct the outputs, then use those corrected outputs as training data for a small model. This is sometimes called "distillation" and it works remarkably well. We have used this pattern to create fine-tuned models that match 90-95% of the large model's performance at a fraction of the cost.

One pattern we have seen fail repeatedly: fine-tuning for tasks that require current knowledge. A fine-tuned model only knows what was in its training data. If the task requires understanding recent events, new product information, or evolving regulations, you either need to retrain frequently (expensive and slow) or use RAG (retrieval-augmented generation) with a prompted model. We generally recommend RAG with a large model for knowledge-intensive tasks and fine-tuning for skill-intensive tasks.

The tooling has gotten significantly better. For fine-tuning, we use Axolotl for local experiments and Together AI or Fireworks for hosted fine-tuning and inference. For prompting, we use LiteLLM as a proxy layer that lets us switch between providers without changing application code. For evaluation, we use a combination of custom Python scripts and Braintrust for systematic prompt testing.

Another consideration people overlook: data privacy. When you fine-tune and self-host a model, your data never leaves your infrastructure. When you prompt a hosted model, your inputs and outputs pass through a third party's servers. For healthcare, legal, and financial applications, this distinction can be the deciding factor regardless of cost or performance.

Our current default recommendation for most projects: start with prompted Claude or GPT-4o for prototyping and initial launch. Measure volume, cost, accuracy, and latency in production for sixty to ninety days. If any of those metrics are problematic, evaluate fine-tuning with the production data you have collected. This gives you the fast time-to-market of prompting with a clear upgrade path to fine-tuning when the numbers justify it.

The worst decision is the one made on ideology. "We should fine-tune because it is cooler" is as bad as "just use GPT-4 for everything." Let the numbers guide you. Measure everything. Optimize for the metric that matters most to your specific product.

Related Articles

Gemini Flash Lite: The Underrated LLM That Powers Half Our Projects

The RAG Tax: Hidden Costs of Retrieval-Augmented Generation in Production

Why Claude Opus 4.5 Costs $75/MTok and Whether It's Worth Every Cent

Want to discuss this further?

Ready to build
something real?