Skip to content

Chapter 13: Fine-Tuning Your Own Models

"You can prompt a model to act smart. But when it needs to be fluent in your domain? That’s when you teach it."

At some point, you’ll realize: no prompt is clever enough to fully overcome a model's limitations when it wasn’t trained for your context. If your chatbot has to speak like a lawyer, diagnose like a doctor, or respond like your company’s internal support rep, it’s time for fine-tuning.

This chapter teaches you how to adapt a pretrained LLM—like LLaMA, Mistral, or Falcon—to specialized tasks and tones using modern fine-tuning techniques. We'll focus especially on LoRA (Low-Rank Adaptation), the gold standard for low-cost, high-efficiency training.

Whether you're updating just the output layer or teaching a model your internal knowledge base, this is how you give your chatbot a true voice of its own.


When Should You Fine-Tune?

✅ You want responses that match a specific tone, format, or persona
✅ Your chatbot must operate in a specialized domain (legal, finance, medicine, etc.)
✅ You have structured Q\&A pairs, documents, or dialogues for training
✅ You want to reduce the need for heavy prompts at inference time
✅ Prompt engineering can’t reach the level of fluency you need

If you’re mostly happy with the base model but want subtle shifts, consider prompt tuning or embedding-based retrieval (RAG). Otherwise—fine-tuning is the key.


Fine-Tuning Methods Overview

Method Description Use Case Resource Need
LoRA Injects learnable adapters into layers Most popular for open LLMs Moderate (1 GPU)
QLoRA LoRA + 4-bit quantization Memory-efficient fine-tuning Low (16GB GPU)
Full Fine-Tune Retrains all model weights Rare—used for large datasets Very high (A100s)
Instruction Tuning Fine-tunes with examples + instructions Great for chatbots Common in OSS
PEFT Parameter-Efficient Fine-Tuning (umbrella) Includes LoRA, Adapters, Prefix Tuning Modular & extensible

We’ll focus on LoRA + QLoRA with the PEFT library (Hugging Face), which allows you to fine-tune large models on a single A100 or even a 3090.


Dataset Formats

Format Type Example Used In
Alpaca-style {"instruction": "...", "input": "...", "output": "..."} Chat / Instruction tuning
Plain Q\&A {"question": "...", "answer": "..."} FAQ bots, support bots
JSONL / CSV Structured columns or key-value pairs General fine-tuning
Dialogues List of alternating user/assistant messages Multi-turn chatbots

Clean, well-formatted data is more important than volume. Even 1,000–10,000 examples of high-quality pairs can beat massive but noisy datasets.


Implementation (QLoRA + PEFT)

Let’s fine-tune mistralai/Mistral-7B-Instruct-v0.1 on your data using Hugging Face tools.

Install Dependencies

pip install transformers datasets accelerate peft bitsandbytes trl

Load Model with 4-bit Quantization

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype="float16")
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1", quantization_config=bnb_config, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")

Add LoRA Layers with PEFT

from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model

model = prepare_model_for_kbit_training(model)

config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)

Load Dataset (Alpaca Format)

from datasets import load_dataset

data = load_dataset("json", data_files="your_dataset.json")

Training

Use transformers.Trainer or trl.SFTTrainer for simplicity:

from transformers import TrainingArguments
from trl import SFTTrainer

training_args = TrainingArguments(
    output_dir="./mistral-finetuned",
    per_device_train_batch_size=4,
    num_train_epochs=3,
    learning_rate=2e-4,
    logging_steps=10,
    save_total_limit=2,
    save_strategy="epoch",
    bf16=True,
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=data["train"],
    args=training_args,
)

trainer.train()

Saving and Using the Model

After training:

model.save_pretrained("./mistral-finetuned")
tokenizer.save_pretrained("./mistral-finetuned")

You can now deploy this using the same FastAPI/Docker method from Chapter 12


Common Pitfalls

Issue Solution
OOM (Out of Memory) errors Use QLoRA, smaller batch, or gradient checkpointing
Training but model forgets Ensure consistent prompt formatting and padding
Slow convergence Lower learning rate, cleaner data
Inference mismatch Match prompt_template in training and inference

Summary

Fine-tuning is how you embed your company’s brain into an LLM. With LoRA and quantization, it’s now feasible to train on a laptop with a decent GPU or a low-cost cloud instance. You can steer tone, tighten reasoning, and build agents that go far beyond general-purpose capabilities.

Next: Once your model is trained, how do you make it faster, smaller, and cheaper to serve? Time for serious optimization.