Chapter 24: Transformers in TensorFlow (from Scratch)¶

“Forget recurrence. Attention is all you need.”

Transformers marked a revolution in deep learning, replacing RNNs and LSTMs as the new foundation of sequence modeling. Their secret? Self-attention—a mechanism that lets models focus on relevant parts of input, regardless of position.

In this chapter, you will:

Understand the intuition behind attention and transformers
Explore how positional encoding enables order-awareness
Build a simplified transformer encoder block using TensorFlow
See how models like BERT and GPT are built on this architecture

Why Transformers?¶

RNNs process sequences sequentially, limiting parallelism. Transformers process entire sequences in parallel, and use attention to model relationships between tokens.

🔁 From RNN to Transformer

RNN: word₁ → word₂ → word₃ → ...
Transformer: [word₁, word₂, word₃] all at once

Self-Attention Explained¶

Self-attention allows the model to weigh the importance of different words when encoding a particular token.

“The bank was crowded.” Self-attention helps distinguish: money bank vs river bank using context like “crowded.”

Core Transformer Components¶

Component	Purpose
Embedding Layer	Convert tokens to dense vectors
Positional Encoding	Add order information
Multi-Head Attention	Attend to different parts of the sequence
Feed-Forward Network	Process attended representations
Residual & LayerNorm	Improve stability and gradient flow

Step-by-Step: Mini Transformer Encoder¶

Step 1: Positional Encoding

import numpy as np
import tensorflow as tf

def positional_encoding(seq_len, d_model):
    pos = np.arange(seq_len)[:, np.newaxis]
    i = np.arange(d_model)[np.newaxis, :]
    angle_rates = 1 / np.power(10000, (2 * (i // 2)) / np.float32(d_model))
    angle_rads = pos * angle_rates

    angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])
    angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])

    return tf.cast(angle_rads, dtype=tf.float32)

Step 2: Scaled Dot-Product Attention

def scaled_dot_product_attention(q, k, v, mask=None):
    matmul_qk = tf.matmul(q, k, transpose_b=True)
    d_k = tf.cast(tf.shape(k)[-1], tf.float32)
    scaled = matmul_qk / tf.math.sqrt(d_k)

    if mask is not None:
        scaled += (mask * -1e9)

    weights = tf.nn.softmax(scaled, axis=-1)
    return tf.matmul(weights, v)

Step 3: Multi-Head Attention Layer

from tensorflow.keras.layers import Dense, Layer

class MultiHeadAttention(Layer):
    def __init__(self, d_model, num_heads):
        super().__init__()
        assert d_model % num_heads == 0
        self.depth = d_model // num_heads
        self.num_heads = num_heads

        self.Wq = Dense(d_model)
        self.Wk = Dense(d_model)
        self.Wv = Dense(d_model)
        self.dense = Dense(d_model)

    def split_heads(self, x, batch_size):
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
        return tf.transpose(x, perm=[0, 2, 1, 3])

    def call(self, q, k, v, mask=None):
        batch_size = tf.shape(q)[0]
        q = self.split_heads(self.Wq(q), batch_size)
        k = self.split_heads(self.Wk(k), batch_size)
        v = self.split_heads(self.Wv(v), batch_size)

        scaled_attention = scaled_dot_product_attention(q, k, v, mask)
        scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])
        concat = tf.reshape(scaled_attention, (batch_size, -1, self.num_heads * self.depth))
        return self.dense(concat)

Transformer Encoder Block¶

from tensorflow.keras.layers import LayerNormalization, Dropout, Dense

class TransformerEncoderBlock(Layer):
    def __init__(self, d_model, num_heads, ff_dim, rate=0.1):
        super().__init__()
        self.att = MultiHeadAttention(d_model, num_heads)
        self.ffn = tf.keras.Sequential([
            Dense(ff_dim, activation='relu'),
            Dense(d_model)
        ])
        self.layernorm1 = LayerNormalization()
        self.layernorm2 = LayerNormalization()
        self.dropout1 = Dropout(rate)
        self.dropout2 = Dropout(rate)

    def call(self, x, training):
        attn_output = self.att(x, x, x)
        out1 = self.layernorm1(x + self.dropout1(attn_output, training=training))
        ffn_output = self.ffn(out1)
        return self.layernorm2(out1 + self.dropout2(ffn_output, training=training))

Training Transformer on Text (Preview)¶

You can now stack transformer blocks, add embeddings, and train using model.fit() as usual. Later chapters (esp. Part IV and V) show this for:

Sentiment analysis
Text classification
Question answering

BERT and GPT: Built on Transformers¶

Model	Direction	Use Case
BERT	Bidirectional	Classification, QA
GPT	Left-to-right	Generation, Autocomplete

You can fine-tune these using transformers from Hugging Face or build them from scratch (see next chapter for applications).

Summary¶

In this chapter, you:

Understood how attention replaces recurrence
Built a mini transformer encoder using TensorFlow
Explored the components like positional encodings and multi-head attention
Learned how modern models like BERT and GPT are architecturally composed

Transformers are the foundation of state-of-the-art NLP and many vision models. You’ve now walked through the gears that power them.