Chapter 7: Embeddings and Retrieval-Augmented Generation (RAG)¶

While prompts guide the LLM’s behavior, they alone can’t store long-term memory or domain-specific knowledge. What happens when your chatbot needs to reference company documents, FAQs, or proprietary knowledge? That’s where RAG (Retrieval-Augmented Generation) comes in.

This chapter explores how to give your chatbot memory—by using embeddings, vector databases, and a smart retrieval pipeline. You’ll learn how to embed documents, store them efficiently using pgvector in Supabase, and query relevant context at runtime to inject into your prompts.

7.1 What Is Retrieval-Augmented Generation (RAG)?¶

RAG is a hybrid technique combining information retrieval and language generation. The idea is simple:

“Don’t make the model guess. Let it look up information first.”

RAG Flow Overview:¶

User sends a question.
Input is embedded into a high-dimensional vector.
Relevant documents are retrieved via vector similarity search.
Retrieved context is injected into the prompt sent to the LLM.
LLM responds with an answer grounded in that context.

This creates a powerful feedback loop between storage and generation—minimizing hallucinations and enabling domain-specific accuracy.

7.2 Chunking: Preparing Text for Embedding¶

Before embedding your knowledge base, you must break it into digestible pieces. This is known as chunking.

Common Chunking Strategies:¶

Strategy	Pros	Cons
Fixed-size (e.g. 500 tokens)	Simple to implement	May split sentences awkwardly
Sentence-based	Preserves grammatical structure	Varies in length
Overlapping windows	Improves retrieval accuracy	Increases storage size
Recursive splitting (based on headings)	Hierarchical context	More complex to implement

💡 Use a hybrid strategy: split by paragraph → add sliding window overlap of 10–20% for better semantic coverage.

7.3 Embedding with `text-embedding-3-small` (OpenAI)¶

Once you have chunks, each is converted into an embedding—a numerical vector representing its semantic meaning.

Example Python Code (using OpenAI):¶

from openai import OpenAI
import openai
openai.api_key = "your-api-key"

response = openai.embeddings.create(
    model="text-embedding-3-small",
    input="How to reset a printer to factory settings?"
)

embedding = response['data'][0]['embedding']

Why `text-embedding-3-small`?¶

Fast and affordable.
High semantic performance.
Compact vector size (1536 dims).

7.4 Setting Up Supabase with `pgvector`¶

Supabase is a developer-friendly PostgreSQL-as-a-service platform. When paired with the pgvector extension, it becomes a powerful vector store.

Step-by-Step:¶

Create a Supabase project.
Enable pgvector extension:

create extension if not exists vector;

Create embeddings table:

create table documents (
  id uuid primary key,
  content text,
  embedding vector(1536)
);

Insert documents and embeddings:

insert into documents (id, content, embedding)
values (
  gen_random_uuid(),
  'How to connect to WiFi?',
  '[0.021, 0.894, ...]'
);

Query by similarity:

select content
from documents
order by embedding <-> '[user_embedding]'
limit 5;

📌 Use embedding <-> vector for cosine similarity in pgvector.

7.5 Vector Retrieval in the Backend¶

You’ll build a function that:

Embeds the user query.
Sends a similarity query to Supabase.
Returns the top-k matching chunks.

Python Example:¶

def retrieve_context(user_query: str):
    query_embedding = get_embedding(user_query)
    results = query_supabase_top_k(query_embedding)
    return "\n".join([r['content'] for r in results])

You can now inject this into the prompt:

context = retrieve_context(user_input)
prompt = f"""
Use the following document context to answer the question.

[Start Context]
{context}
[End Context]

User: {user_input}
"""

7.6 Best Practices for RAG Bots¶

Principle	Recommendation
Limit context length	Keep injected context under 1500 tokens
Clean input	Strip HTML, fix typos, remove irrelevant sections
Cache embeddings	Don’t re-embed the same query repeatedly
Handle missing results	Fallback to LLM-only response if retrieval fails
Track provenance	Show sources or titles alongside answers (for transparency)

7.7 Debugging Retrieval Failures¶

When RAG doesn't perform as expected:

Check if documents were properly chunked.
Ensure correct vector dimensionality.
Inspect similarity query—are the right chunks being returned?
Log the actual injected prompt for inspection.

Use tools like Postgres query logs, vector visualizations (e.g., t-SNE), and similarity scoring to inspect RAG performance.

Conclusion: Giving Your Chatbot a Brain¶

With RAG, your chatbot evolves from a generic assistant to a domain-aware knowledge bot. By embedding documents and retrieving relevant content during conversations, you build an AI system with grounded knowledge, reduced hallucinations, and improved trustworthiness.

In the next chapter, we’ll turn back to the frontend—bringing this intelligence to users with a sleek React-based chat interface and refined UX features.