Chapter 13: Fine-Tuning Your Own Models¶
"You can prompt a model to act smart. But when it needs to be fluent in your domain? That’s when you teach it."
At some point, you’ll realize: no prompt is clever enough to fully overcome a model's limitations when it wasn’t trained for your context. If your chatbot has to speak like a lawyer, diagnose like a doctor, or respond like your company’s internal support rep, it’s time for fine-tuning.
This chapter teaches you how to adapt a pretrained LLM—like LLaMA, Mistral, or Falcon—to specialized tasks and tones using modern fine-tuning techniques. We'll focus especially on LoRA (Low-Rank Adaptation), the gold standard for low-cost, high-efficiency training.
Whether you're updating just the output layer or teaching a model your internal knowledge base, this is how you give your chatbot a true voice of its own.
When Should You Fine-Tune?¶
✅ You want responses that match a specific tone, format, or persona
✅ Your chatbot must operate in a specialized domain (legal, finance, medicine, etc.)
✅ You have structured Q\&A pairs, documents, or dialogues for training
✅ You want to reduce the need for heavy prompts at inference time
✅ Prompt engineering can’t reach the level of fluency you need
If you’re mostly happy with the base model but want subtle shifts, consider prompt tuning or embedding-based retrieval (RAG). Otherwise—fine-tuning is the key.
Fine-Tuning Methods Overview¶
Method | Description | Use Case | Resource Need |
---|---|---|---|
LoRA | Injects learnable adapters into layers | Most popular for open LLMs | Moderate (1 GPU) |
QLoRA | LoRA + 4-bit quantization | Memory-efficient fine-tuning | Low (16GB GPU) |
Full Fine-Tune | Retrains all model weights | Rare—used for large datasets | Very high (A100s) |
Instruction Tuning | Fine-tunes with examples + instructions | Great for chatbots | Common in OSS |
PEFT | Parameter-Efficient Fine-Tuning (umbrella) | Includes LoRA, Adapters, Prefix Tuning | Modular & extensible |
We’ll focus on LoRA + QLoRA with the PEFT library (Hugging Face), which allows you to fine-tune large models on a single A100 or even a 3090.
Dataset Formats¶
Format Type | Example | Used In |
---|---|---|
Alpaca-style | {"instruction": "...", "input": "...", "output": "..."} |
Chat / Instruction tuning |
Plain Q\&A | {"question": "...", "answer": "..."} |
FAQ bots, support bots |
JSONL / CSV | Structured columns or key-value pairs | General fine-tuning |
Dialogues | List of alternating user/assistant messages | Multi-turn chatbots |
Clean, well-formatted data is more important than volume. Even 1,000–10,000 examples of high-quality pairs can beat massive but noisy datasets.
Implementation (QLoRA + PEFT)¶
Let’s fine-tune mistralai/Mistral-7B-Instruct-v0.1
on your data using Hugging Face tools.
Install Dependencies¶
pip install transformers datasets accelerate peft bitsandbytes trl
Load Model with 4-bit Quantization¶
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype="float16")
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1", quantization_config=bnb_config, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
Add LoRA Layers with PEFT¶
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model
model = prepare_model_for_kbit_training(model)
config = LoraConfig(
r=8,
lora_alpha=16,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, config)
Load Dataset (Alpaca Format)¶
from datasets import load_dataset
data = load_dataset("json", data_files="your_dataset.json")
Training¶
Use transformers.Trainer
or trl.SFTTrainer
for simplicity:
from transformers import TrainingArguments
from trl import SFTTrainer
training_args = TrainingArguments(
output_dir="./mistral-finetuned",
per_device_train_batch_size=4,
num_train_epochs=3,
learning_rate=2e-4,
logging_steps=10,
save_total_limit=2,
save_strategy="epoch",
bf16=True,
)
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=data["train"],
args=training_args,
)
trainer.train()
Saving and Using the Model¶
After training:
model.save_pretrained("./mistral-finetuned")
tokenizer.save_pretrained("./mistral-finetuned")
You can now deploy this using the same FastAPI/Docker method from Chapter 12
Common Pitfalls¶
Issue | Solution |
---|---|
OOM (Out of Memory) errors | Use QLoRA, smaller batch, or gradient checkpointing |
Training but model forgets | Ensure consistent prompt formatting and padding |
Slow convergence | Lower learning rate, cleaner data |
Inference mismatch | Match prompt_template in training and inference |
Summary¶
Fine-tuning is how you embed your company’s brain into an LLM. With LoRA and quantization, it’s now feasible to train on a laptop with a decent GPU or a low-cost cloud instance. You can steer tone, tighten reasoning, and build agents that go far beyond general-purpose capabilities.
Next: Once your model is trained, how do you make it faster, smaller, and cheaper to serve? Time for serious optimization.