The core methods for adapting a pre-trained model — what each technique changes, what it preserves, and the trade-offs between them.
Full Fine-Tuning vs Parameter-Efficient Fine-Tuning
What Changes in Each Approach
Full fine-tuning unfreezes every parameter in the model and runs gradient updates across them all. This gives the optimizer maximum freedom to reshape internal representations, which can yield the strongest task adaptation — but at the cost of enormous GPU memory, long training runs, and a full copy of the model per task.
Parameter-efficient fine-tuning (PEFT) takes a different approach: freeze the base model and either inject small trainable modules (adapters, LoRA matrices) or selectively unfreeze a tiny subset of existing parameters (bias tuning, layer-norm tuning). The result is far fewer trainable parameters — often less than 1% of the total — with surprisingly competitive quality for many tasks.
When to Prefer Each
| Criterion | Full Fine-Tuning | PEFT |
|---|---|---|
| Adaptation depth | Deepest — can reshape all representations | Moderate — adds capacity but base stays fixed |
| GPU memory | Very high — optimizer states for all params | Low — only adapter gradients stored |
| Multi-task serving | One full model copy per task | Shared base + swappable adapter files |
| Risk of forgetting | Higher — all weights can drift | Lower — base knowledge is preserved |
| Data needed | More data for stable results | Can work with smaller curated sets |
Practical Considerations
In production, the choice is rarely purely technical. Full fine-tuning produces a monolithic model that is harder to roll back or diff against the base. PEFT adapters, by contrast, are small files that can be version-controlled, A/B tested, and hot-swapped at serving time. This operational advantage often matters more than marginal quality differences. See Topic 11: Cost Trade-Offs for a full breakdown of lifecycle costs.
Python — Comparing Trainable Parameters
from transformers import AutoModelForCausalLM
from peft import LoraConfig, get_peft_model
# Load a base model
base = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
# Count full model parameters
full_params = sum(p.numel() for p in base.parameters())
print(f"Full model params: {full_params:,}")
# Wrap with LoRA — only adapter weights are trainable
config = LoraConfig(
r=16, # rank of the low-rank matrices
lora_alpha=32, # scaling factor
lora_dropout=0.05, # regularization
target_modules=["q_proj", "v_proj"], # which weights get adapters
task_type="CAUSAL_LM",
)
peft_model = get_peft_model(base, config)
# Show the dramatic difference
peft_model.print_trainable_parameters()
# Typical output: trainable params: ~6M / 8B total (0.07%)
Can you combine PEFT with full fine-tuning on different layers?
How do you serve multiple LoRA adapters efficiently?
Does PEFT always produce worse results than full fine-tuning?
LoRA and QLoRA
The Low-Rank Idea
A standard transformer weight matrix W has shape d x d (e.g., 4096 x 4096 = 16.7M parameters). LoRA decomposes the update into two small matrices: A (d x r) and B (r x d), where r is typically 4 to 64. The effective update is delta_W = B * A, and the total adapter parameters drop to 2 * d * r instead of d * d. At rank 16, that is 131K parameters instead of 16.7M — a 128x reduction.
Key Hyperparameters
| Parameter | Role | Typical Values |
|---|---|---|
r (rank) | Controls adapter capacity | 4–64; 16 is a common default |
lora_alpha | Scaling factor for the update (alpha/r) | Usually 2x the rank |
lora_dropout | Regularization to prevent overfitting | 0.0–0.1 |
target_modules | Which weight matrices get adapters | q_proj, v_proj; sometimes all attention projections |
QLoRA: Quantize Then Adapt
QLoRA (Dettmers et al., 2023) keeps the same adapter structure but quantizes the frozen base model to 4-bit precision using NormalFloat (NF4) quantization. This slashes GPU memory from ~32 GB to ~6 GB for a 7B model, enabling fine-tuning on a single consumer GPU. The adapters themselves remain in higher precision to preserve gradient quality.
LoRA vs QLoRA
| Dimension | LoRA | QLoRA |
|---|---|---|
| Base model precision | fp16 / bf16 | 4-bit (NF4) |
| Training memory (7B) | ~16 GB | ~6 GB |
| Training speed | Faster | Slightly slower (dequantization overhead) |
| Quality | Baseline PEFT quality | Very close to LoRA on most tasks |
| Best for | Production with adequate GPU | Experimentation, prototyping, constrained hardware |
See Topic 1: Full vs Parameter-Efficient FT for how LoRA fits into the broader adaptation landscape, and Topic 5: When FT Is Worth It for decision criteria.
Python — LoRA Setup with PEFT
from transformers import AutoModelForCausalLM
from peft import LoraConfig, get_peft_model
# Load the pre-trained base model
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B"
)
# Configure LoRA adapters
config = LoraConfig(
r=16, # low-rank dimension
lora_alpha=32, # scaling: effective lr = alpha / r
lora_dropout=0.05, # dropout on adapter activations
target_modules=["q_proj", "v_proj"], # inject into Q and V projections
task_type="CAUSAL_LM", # causal language modeling
)
# Wrap the model — only adapter params are trainable
model = get_peft_model(base_model, config)
model.print_trainable_parameters()
# Output: trainable params: ~6.5M / 8B total (0.08%)
How do you choose the right rank for LoRA?
Can you merge LoRA weights back into the base model?
Is QLoRA good enough for production, or only for prototyping?
SFT, Instruction Tuning, and Preference Optimization
How They Differ
| Method | Training Signal | What It Shapes | Data Format |
|---|---|---|---|
| SFT | Gold input-output pairs | Task-specific skill | Prompt → completion |
| Instruction Tuning | Diverse instruction-response pairs | General instruction following | Instruction → response |
| Preference Optimization | Ranked outputs (chosen vs rejected) | Helpfulness, safety, style | (prompt, chosen, rejected) triples |
The Alignment Pipeline
Modern chat models typically go through all three stages. First, SFT or instruction tuning gives the model basic conversational and task-following ability. Then preference optimization (via RLHF, DPO, or similar methods) refines the model's outputs to match human preferences for helpfulness, harmlessness, and honesty. This two-phase approach is what produced models like ChatGPT and Claude (Ouyang et al., 2022).
See Topic 10: Alignment and Fine-Tuning for a deeper look at how alignment relates to the fine-tuning process.
DPO vs RLHF
RLHF trains a separate reward model on preference data, then uses reinforcement learning (PPO) to optimize the LLM against that reward. DPO (Direct Preference Optimization) skips the reward model entirely and directly optimizes the LLM using preference pairs. DPO is simpler to implement, more stable during training, and has become increasingly popular as a result.
Python — SFT Data Format Example
# Typical SFT training data structure
# Each example is a prompt-completion pair
sft_examples = [
{
"prompt": "Summarize the following article:\n{article_text}",
"completion": "The article discusses three key findings..."
},
{
"prompt": "Classify the sentiment: 'Great product!'",
"completion": "Positive"
},
]
# Preference data for DPO / RLHF
# Each example has a prompt, a preferred response, and a rejected one
preference_examples = [
{
"prompt": "Explain quantum computing simply.",
"chosen": "Quantum computers use qubits that can be...",
"rejected": "Quantum computing is a paradigmatic shift..."
},
]
Can you do instruction tuning and preference optimization at the same time?
How much instruction-tuning data is needed?
What is the difference between RLHF and RLAIF?
Model Distillation
70B parameters
7B parameters
Why Soft Labels Matter
When a teacher model predicts the next token, it produces a full probability distribution over the vocabulary. A hard label says "the answer is X." A soft label says "X has 70% probability, Y has 20%, Z has 10%." The student learns far more from soft labels because they encode the teacher's uncertainty, similarity judgments, and implicit knowledge about related alternatives.
Distillation vs PEFT
| Dimension | Distillation | PEFT (LoRA) |
|---|---|---|
| Output | A new, smaller model | The same model with adapters |
| Serving cost | Much lower (smaller model) | Same base model + adapter overhead |
| Training cost | High (need teacher inference + student training) | Moderate |
| Flexibility | Fixed once distilled | Adapters can be swapped |
| Best for | High-throughput, latency-critical production | Multi-task with shared base |
When Distillation Makes Sense
Distillation is most valuable when the production constraint is serving efficiency, not just training cost. Mobile deployments, edge inference, and high-throughput APIs where a 70B model is economically impossible are prime use cases. It is also common in cascading systems where a small model handles easy requests and routes hard ones to the full teacher (Hinton et al., 2015).
See Topic 1: Full vs Parameter-Efficient FT for how distillation fits alongside other adaptation methods.
Python — Distillation Loss Concept
import torch
import torch.nn.functional as F
def distillation_loss(student_logits, teacher_logits, labels, temperature=2.0, alpha=0.5):
"""
Combine soft-label KL divergence with hard-label cross-entropy.
temperature: higher = softer distribution (more knowledge transfer)
alpha: balance between soft and hard losses
"""
# Soft loss: KL divergence between teacher and student soft outputs
soft_teacher = F.softmax(teacher_logits / temperature, dim=-1)
soft_student = F.log_softmax(student_logits / temperature, dim=-1)
soft_loss = F.kl_div(soft_student, soft_teacher, reduction="batchmean")
soft_loss = soft_loss * (temperature ** 2) # scale by T^2
# Hard loss: standard cross-entropy against ground truth
hard_loss = F.cross_entropy(student_logits, labels)
# Weighted combination
return alpha * soft_loss + (1 - alpha) * hard_loss
What role does temperature play in distillation?
Can you distill a model into a fundamentally different architecture?
How does distillation relate to synthetic data generation?
Knowing when to fine-tune, what data quality means in practice, and when the right answer is to skip fine-tuning entirely.
When Is Fine-Tuning Worth the Effort?
The Intervention Ladder
Before fine-tuning, exhaust cheaper interventions. This is not about avoiding fine-tuning — it is about proving the problem actually requires it:
- Prompt engineering — rewrite instructions, add examples, refine system prompts
- Retrieval improvements — better chunking, re-ranking, evidence selection
- Tool and workflow design — structured outputs, validation layers, fallbacks
- PEFT / LoRA — lightweight adaptation when behavior is the bottleneck
- Full fine-tuning — deep adaptation when nothing else closes the gap
Signals That Fine-Tuning Will Help
- Consistent format violations that prompt engineering cannot fix
- Domain-specific language patterns the base model does not produce reliably
- Latency or cost reduction from replacing a large model + long prompt with a smaller fine-tuned model
- Policy or safety behavior that must be deeply embedded, not just prompted
See Topic 7: When to Avoid Fine-Tuning for the flip side of this decision.
Python — Decision Framework
def should_fine_tune(metrics):
"""
Simple decision framework: check all prerequisites
before recommending fine-tuning investment.
"""
checks = {
"prompt_engineering_tried": metrics.get("prompt_iterations", 0) >= 3,
"retrieval_optimized": metrics.get("retrieval_recall", 0) > 0.85,
"task_is_stable": metrics.get("task_change_frequency_days", 0) > 30,
"data_quality_score": metrics.get("label_agreement", 0) > 0.90,
"volume_justifies_cost": metrics.get("monthly_requests", 0) > 10000,
}
passed = sum(checks.values())
print(f"Readiness: {passed}/{len(checks)} checks passed")
for name, ok in checks.items():
status = "PASS" if ok else "FAIL"
print(f" [{status}] {name}")
return passed == len(checks)
Can fine-tuning reduce API cost even if quality is already good?
How do you measure that prompting has "plateaued"?
What Makes a Fine-Tuning Dataset High Quality?
Quality Defines the Ceiling
Fine-tuning amplifies the patterns in the data. It does not invent a better target than the one you provide. If 10% of your labels are wrong, the model learns to be wrong 10% of the time — with high confidence. This is why data quality is often the single biggest lever in a fine-tuning project, more impactful than model size, learning rate, or adapter rank.
Common Data Quality Issues
| Issue | Impact | Mitigation |
|---|---|---|
| Mislabeled examples | Model learns incorrect patterns | Multi-annotator review, agreement scoring |
| Distribution skew | Over-fits to common cases, fails on rare ones | Stratified sampling, targeted augmentation |
| Inconsistent formatting | Unstable output format in production | Style guides, automated validation |
| Data leakage | Inflated eval metrics, poor real-world performance | Strict train/eval splits, temporal splits |
| Too little diversity | Brittle to novel inputs | Adversarial examples, out-of-distribution test sets |
How Much Data Is Enough?
There is no universal number. For LoRA on a well-pretrained base model, as few as 200–500 high-quality examples can produce meaningful behavior change for narrow tasks (style, format, tone). For broader domain adaptation, thousands to tens of thousands are typical. The diminishing-returns curve is steep: cleaning 500 examples is almost always more impactful than collecting 5,000 noisy ones.
See Topic 5: When FT Is Worth It for dataset readiness as part of the fine-tuning decision.
Python — Data Quality Checks
import json
from collections import Counter
def audit_dataset(filepath):
"""Run basic quality checks on a JSONL fine-tuning dataset."""
with open(filepath) as f:
examples = [json.loads(line) for line in f]
# Check for duplicates
prompts = [ex["prompt"] for ex in examples]
dupes = sum(1 for c in Counter(prompts).values() if c > 1)
# Check for empty completions
empty = sum(1 for ex in examples if not ex.get("completion", "").strip())
# Check completion length distribution
lengths = [len(ex["completion"]) for ex in examples]
print(f"Total examples: {len(examples)}")
print(f"Duplicate prompts: {dupes}")
print(f"Empty completions: {empty}")
print(f"Avg completion len: {sum(lengths)/len(lengths):.0f} chars")
Is it better to have a small perfect dataset or a large noisy one?
Can you use LLM-generated data for fine-tuning?
How do you handle class imbalance in fine-tuning data?
When to Avoid Fine-Tuning
The Discipline Signal
In interviews, saying "we should not fine-tune here" signals stronger engineering judgment than saying "let's fine-tune." It shows you understand the intervention ladder and can diagnose where the real bottleneck is. Good engineers do not optimize the wrong layer of the stack.
Fine-Tuning vs Alternatives
| Problem | Wrong Solution | Right Solution |
|---|---|---|
| Model lacks domain facts | Fine-tune on domain documents | Build a RAG pipeline |
| Output format is inconsistent | Fine-tune for formatting | Structured output schemas + validation |
| Model ignores instructions | Fine-tune on examples | Improve system prompt + few-shot examples |
| Responses are too long/short | Fine-tune for length | Add explicit length constraints to prompt |
The Cost of Getting It Wrong
An unnecessary fine-tuning project wastes not just compute but also engineering time, creates a model that needs ongoing maintenance, and can introduce catastrophic forgetting or lifecycle costs that compound over time. The opportunity cost of fixing the wrong layer is often the largest hidden cost.
Python — Bottleneck Diagnosis
def diagnose_bottleneck(eval_results):
"""
Analyze evaluation failures to identify the actual bottleneck
before committing to fine-tuning.
"""
categories = {
"knowledge_gap": [], # Model lacks facts -> RAG
"format_issue": [], # Wrong format -> structured output
"instruction_miss": [], # Ignores instructions -> prompt
"behavior_gap": [], # Persistent style/skill issue -> FT
}
for result in eval_results:
if result["error_type"] == "factual":
categories["knowledge_gap"].append(result)
elif result["error_type"] == "format":
categories["format_issue"].append(result)
elif result["error_type"] == "ignored_constraint":
categories["instruction_miss"].append(result)
else:
categories["behavior_gap"].append(result)
# Only recommend FT if behavior_gap dominates
for cat, items in categories.items():
print(f" {cat}: {len(items)} failures")
What if we need both RAG and fine-tuning?
How do you convince a team that wants to fine-tune not to?
The operational realities of running fine-tuned models in production — what can go wrong, how to evaluate, and how to manage costs over time.
Catastrophic Forgetting
Why It Happens
Neural networks store knowledge distributed across many parameters. When fine-tuning updates those parameters for a new task, it can overwrite the representations that encoded prior capabilities. The more aggressive the fine-tuning (more epochs, higher learning rate, narrower data), the worse the forgetting.
Mitigation Strategies
| Strategy | How It Works | Trade-off |
|---|---|---|
| PEFT / LoRA | Freeze base weights, train only adapters | Preserves base well, may limit adaptation depth |
| Balanced data mix | Include general-purpose examples alongside task data | Slows convergence on the target task |
| Lower learning rate | Smaller updates preserve more prior knowledge | Requires more training steps |
| Early stopping | Stop before the model over-specializes | May leave target quality on the table |
| Elastic Weight Consolidation | Penalize changes to important prior-task weights | Adds training complexity |
Evaluating for Forgetting
The critical discipline is to evaluate both new and retained capabilities after fine-tuning. Run the fine-tuned model on a held-out general-purpose benchmark alongside your task-specific evaluation. If general scores drop significantly, you are trading breadth for depth and need to decide whether that trade is acceptable. See Topic 9: Evaluating Fine-Tuned Models for the full evaluation framework.
Python — Forgetting Detection
def detect_forgetting(base_scores, finetuned_scores, threshold=0.05):
"""
Compare base model vs fine-tuned model on general benchmarks.
Flag any capability where performance dropped significantly.
"""
regressions = {}
for benchmark, base_score in base_scores.items():
ft_score = finetuned_scores.get(benchmark, 0)
delta = ft_score - base_score
if delta < -threshold:
regressions[benchmark] = {
"base": base_score,
"finetuned": ft_score,
"regression": abs(delta),
}
print(f"WARNING: {benchmark} dropped {abs(delta):.1%}")
if not regressions:
print("No significant regressions detected.")
return regressions
Does LoRA fully prevent catastrophic forgetting?
Can you recover forgotten capabilities without retraining from scratch?
Evaluating a Fine-Tuned Model Before Release
The Three-Way Comparison
A mature evaluation compares three systems: (1) the base model with the best prompt, (2) the fine-tuned model, and (3) the cheapest non-fine-tuned alternative that meets requirements. This triangulation reveals whether fine-tuning truly earned its complexity or whether a simpler approach would suffice.
Evaluation Dimensions
| Dimension | What to Measure | Why It Matters |
|---|---|---|
| Task accuracy | Precision, recall, F1 on target task | Did fine-tuning actually improve the target? |
| Format stability | Schema compliance rate | Inconsistent formats break downstream pipelines |
| Safety behavior | Refusal rates on harmful prompts | Fine-tuning can erode safety alignment |
| General capability | Scores on MMLU, HellaSwag, etc. | Detects catastrophic forgetting |
| Realistic prompts | Performance on production-like inputs | Training-like prompts may not match real usage |
| Latency | p50, p99 response times | Fine-tuning should not degrade serving speed |
Red Flags in Evaluation
- Only training-like prompts tested — The model may memorize training patterns without generalizing.
- No baseline comparison — You cannot claim improvement without measuring what you started from.
- Safety not re-tested — Fine-tuning can subtly erode refusal behavior even when safety data is not explicitly included.
Python — Evaluation Harness
def evaluate_fine_tuned_model(model, test_sets):
"""
Comprehensive evaluation covering target task,
general capabilities, and safety behavior.
"""
results = {}
# Target task evaluation
target = test_sets["target"]
results["target_accuracy"] = run_accuracy(model, target)
# General capability regression check
for bench_name, bench_data in test_sets["general"].items():
results[f"general_{bench_name}"] = run_accuracy(model, bench_data)
# Safety refusal check
safety = test_sets["safety"]
results["refusal_rate"] = run_safety_check(model, safety)
# Format compliance on structured output tasks
format_tests = test_sets["format"]
results["format_compliance"] = check_format(model, format_tests)
return results
How large should the evaluation set be?
Should you use automated metrics or human evaluation?
What if the fine-tuned model is better on the target but worse on safety?
Alignment and Fine-Tuning
Alignment Is Not Just Politeness
A common misconception is that alignment means making the model polite or adding refusals. In practice, alignment is about steering the model toward useful, appropriate, and policy-consistent behavior in the context of real applications. This includes:
- Helpfulness — Actually solving the user's problem, not just being safe
- Truthfulness — Acknowledging uncertainty rather than hallucinating
- Policy compliance — Following organizational rules about data handling, tone, and scope
- Harmlessness — Avoiding outputs that could cause real-world harm
Fine-Tuning's Role in Alignment
Fine-tuning (especially preference optimization) is how alignment gets baked into the model's weights. But weight-level alignment is only one layer. Production systems also need:
| Layer | Mechanism | What It Catches |
|---|---|---|
| Model weights | RLHF / DPO fine-tuning | Broad behavioral tendencies |
| System prompt | Instructions and constraints | Task-specific policies |
| Input filters | Content classification before the model | Obviously harmful requests |
| Output filters | Post-generation safety checks | Edge cases the model mishandles |
| Retrieval constraints | Limiting what knowledge the model accesses | Data boundary violations |
The Alignment Tax
Fine-tuning for alignment can reduce performance on narrow benchmarks. A model trained to refuse harmful requests may also over-refuse legitimate ones. A model trained for safety may become less creative. This tension is real and requires careful calibration — alignment is not a switch but a dial.
Python — Multi-Layer Alignment Check
def alignment_check(prompt, model, guardrails, retrieval_policy):
"""
Alignment is enforced at multiple layers, not just model weights.
Each layer catches different failure modes.
"""
# Layer 1: Input filter (block obviously harmful requests)
if guardrails.is_blocked_input(prompt):
return "I cannot help with that request."
# Layer 2: Retrieval constraint (limit knowledge scope)
context = retrieval_policy.get_permitted_context(prompt)
# Layer 3: Model generation (alignment baked into weights)
response = model.generate(prompt, context=context)
# Layer 4: Output filter (catch edge cases)
if guardrails.is_blocked_output(response):
return "I need to rephrase my response."
return response
Can fine-tuning undo alignment from the base model?
How do you balance helpfulness and safety in alignment?
Cost Trade-Offs in Fine-Tuning Projects
Upfront vs Lifecycle Costs
Teams commonly focus on training cost (GPU hours, cloud spend) and forget the long-term burden. A fine-tuned model needs ongoing evaluation, periodic retraining as the base model evolves, version management, and governance review. These lifecycle costs often exceed the initial training investment within 6–12 months.
Cost Comparison by Method
| Method | Upfront Cost | Ongoing Cost | Operational Burden |
|---|---|---|---|
| Prompt engineering | Very low | Very low | Minimal |
| RAG | Moderate (infra) | Moderate (data refresh) | Index maintenance |
| LoRA / PEFT | Low–moderate | Low | Adapter versioning |
| Full fine-tuning | High | High | Full model lifecycle management |
| Distillation | Very high | Moderate | Student model serving |
When the ROI Is Clear
Fine-tuning has the strongest ROI when:
- High volume — Amortizes the upfront cost across millions of requests
- Latency reduction — A smaller fine-tuned model replaces a larger base model + long prompt
- Consistent behavior — The same formatting, tone, or policy must apply every time
- Cost per request — Fine-tuned model uses fewer tokens (no long system prompt)
See Topic 5: When FT Is Worth It for the readiness checklist.
Python — ROI Calculator
def fine_tuning_roi(
monthly_requests,
base_cost_per_request, # cost with prompting approach
ft_cost_per_request, # cost with fine-tuned model
training_cost, # one-time fine-tuning cost
monthly_maintenance_cost, # ongoing evaluation + retraining
):
"""Calculate months to break even on fine-tuning investment."""
monthly_savings = monthly_requests * (base_cost_per_request - ft_cost_per_request)
net_monthly_gain = monthly_savings - monthly_maintenance_cost
if net_monthly_gain <= 0:
print("Fine-tuning does NOT pay for itself at this volume.")
return float("inf")
months = training_cost / net_monthly_gain
print(f"Monthly savings: ${monthly_savings:,.2f}")
print(f"Net monthly gain: ${net_monthly_gain:,.2f}")
print(f"Break-even: {months:.1f} months")
return months