Post-transformer NLP — small models, structured output, function calling.
LLM function calling handles most extraction and classification tasks that used to require custom NLP pipelines. But there are cases where a fine-tuned SLM outperforms a prompted LLM: high-volume classification with latency constraints, domain-specific entity recognition, and regulated environments where proprietary data can't leave your infrastructure. We match the architecture to the constraint, not the other way around.
The post-transformer NLP landscape has restructured the architecture decision space. Traditional NLP pipelines — rule-based entity extraction, intent classifiers, slot-filling models — have largely been replaced by two cleaner regimes. For complex, open-ended tasks: LLMs with structured output (JSON mode) or function calling. For high-volume, well-defined tasks: small language models (SLMs) fine-tuned on domain-specific data.
The architecture mistake in both directions is expensive. Using GPT-4o for a support ticket classifier that runs at high volume burns budget that a fine-tuned DeBERTa-v3 or Phi-3-mini could handle at 50x lower inference cost with equivalent accuracy on the specific task. Using a fine-tuned SLM for a task that requires multi-document reasoning or complex judgment produces poor results where a prompted LLM would handle it correctly. The function calling pattern — where you define a JSON schema and the model populates it — handles a large class of structured extraction tasks that previously required custom NLP pipelines.
- High-volume classification with labeled data → fine-tuned SLM (DeBERTa-v3, Phi-3-mini)
- Structured extraction from documents → LLM with JSON mode or function calling
- Named entity recognition in specialized domains → spaCy pipeline with custom components
- Complex multi-step reasoning → LLM with chain-of-thought prompting
- Semantic search → embedding models (text-embedding-3-large, BGE, E5) + vector index
- On-device NLP with privacy constraints → quantized SLM via ONNX or CoreML
We start every NLP engagement with task characterization: what is the input, what is the required output, what accuracy is acceptable, and what are the latency and throughput constraints. This determines whether the right approach is a fine-tuned SLM, a spaCy extraction pipeline, a RAG system, or a prompted LLM with structured output.
For structured extraction, we use the LLM function calling pattern with well-defined JSON schemas and field descriptions. For high-volume classification where cost is a concern, we fine-tune SLMs from the Hugging Face Hub on domain-specific labeled data. For entity extraction and NLP preprocessing, spaCy production pipelines handle throughput that LLMs cannot approach at viable cost.
NLP system build process
Define input/output specification, accuracy requirements, latency budget, throughput targets, and cost constraints. Select architecture based on requirements — not on what is most technically interesting.
Design JSON schemas for function calling or JSON mode with field descriptions that give the model semantic context. Validate against a sample dataset to catch schema ambiguities before production.
Fine-tune from the Hugging Face Hub on domain-specific labeled data. Evaluate against task-appropriate metrics: F1 for sequence labeling, macro-F1 for classification. Report per-class performance — aggregate accuracy hides class imbalance.
FastAPI serving endpoint with batch processing support, confidence scoring, and human escalation routing for low-confidence outputs. vLLM or TGI for high-throughput SLM serving.
Confidence score distributions and prediction class distributions tracked over time. Distribution shift triggers review before user-visible degradation.
- 01
Function calling and JSON mode for structured extraction
We design JSON schemas with precise field descriptions that give LLMs the semantic context to populate them accurately — replacing what custom NER and slot-filling pipelines used to do. Output is validated against Zod (TypeScript) or Pydantic (Python) schemas at the application boundary. This pattern works well for low-to-medium volume extraction where accuracy matters more than inference cost.
- 02
SLM fine-tuning for high-volume classification
For classification, labeling, and extraction tasks running at scale, fine-tuned SLMs from the Hugging Face Hub — DeBERTa-v3, Phi-3-mini, Mistral-7B — deliver competitive accuracy at a fraction of GPT-4-class inference cost. We handle dataset preparation, fine-tuning, per-class evaluation against a held-out test set, and production serving via vLLM or Text Generation Inference. Evaluation reports include confusion matrices and confidence calibration curves.
- 03
spaCy production pipelines
spaCy handles high-throughput, low-latency NLP where transformer inference overhead is unacceptable. We build custom pipeline components for domain-specific named entities, relation extraction, and text normalization — integrated with transformer-based components where the task requires it. Production pipelines are packaged as spaCy models and versioned independently from application code.
- 04
Semantic search infrastructure
Embedding-based search using text-embedding-3-large, BGE, or E5 models depending on language coverage and cost requirements. For Postgres-native stacks we use pgvector; for dedicated search at scale, Pinecone or Weaviate. Search quality is evaluated against human-relevance baselines before deployment — NDCG and MRR reported per query category.
- 05
On-device NLP for privacy-sensitive workloads
When data can't leave the device — regulated industries, client-side mobile apps, air-gapped environments — we deploy quantized SLMs via ONNX Runtime or CoreML on Apple Silicon. Practical for classification and extraction tasks where 4-bit or 8-bit quantization trade-offs are acceptable against the privacy and latency requirements. We benchmark accuracy loss from quantization before recommending this path.
- Task characterization doc with architecture recommendation and cost model
- JSON schema design for function calling or structured output
- Fine-tuned SLM with per-class precision, recall, and F1 metrics
- Production serving endpoint with confidence scoring and escalation routing
- Batch processing pipeline for high-volume workloads
- Monitoring setup tracking prediction distribution and confidence drift
SLMs fine-tuned on domain data typically reduce inference cost by 80–95% versus GPT-4-class models on the same classification task, with comparable accuracy on in-distribution inputs. The trade-off is narrower generalisation — which is acceptable for well-defined extraction and classification problems with stable schemas.
Frequently
asked questions
When does fine-tuning an SLM beat prompting a large LLM?
Fine-tuning wins on: high-volume tasks where inference cost matters, tasks requiring consistent structured output format, domain-specific vocabulary or reasoning patterns the base model handles poorly, and latency-sensitive applications. Prompting large LLMs wins on: tasks with insufficient training data, highly varied open-ended inputs, multi-step reasoning across long contexts, and low-volume tasks where engineering time is the dominant cost.
Is function calling reliable enough for production structured extraction?
Yes, for well-defined schemas with clear field descriptions. GPT-4o and Claude 3.5 Sonnet reliably populate well-designed schemas. The reliability drops when schemas are ambiguous, fields have overlapping semantics, or the input text is highly unstructured. We design schemas with field descriptions that eliminate ambiguity, validate outputs with Pydantic or Zod, and route failures to retry or human review.
What is RAG and when is it appropriate?
RAG (retrieval-augmented generation) combines a retrieval system — semantic search over your document corpus — with an LLM that generates answers grounded in the retrieved content. It is appropriate when you need LLM-quality responses about a knowledge base that changes frequently: internal documentation, product catalogs, regulatory content. It is not appropriate for tasks requiring reasoning across the entire corpus simultaneously rather than retrieving relevant passages.
How much labeled data do we need for SLM fine-tuning?
Classification tasks: hundreds to a few thousand examples per class can achieve strong performance with modern pre-trained models — fine-tuning starts from a model that already understands language. Named entity recognition: depends on entity type diversity and domain specificity. We assess data requirements during task characterization and give realistic estimates before committing to a fine-tuning approach.
Ready to get started?
Tell us what you are building. We will scope it, price it honestly, and give you a clear plan.
Free 30-min scoping call
