Skip to content

Chapter 24: Transformers in TensorFlow (from Scratch)

Forget recurrence. Attention is all you need.


Transformers marked a revolution in deep learning, replacing RNNs and LSTMs as the new foundation of sequence modeling. Their secret? Self-attention—a mechanism that lets models focus on relevant parts of input, regardless of position.

In this chapter, you will:

  • Understand the intuition behind attention and transformers
  • Explore how positional encoding enables order-awareness
  • Build a simplified transformer encoder block using TensorFlow
  • See how models like BERT and GPT are built on this architecture

Why Transformers?

RNNs process sequences sequentially, limiting parallelism. Transformers process entire sequences in parallel, and use attention to model relationships between tokens.

🔁 From RNN to Transformer

RNN: word₁  word₂  word₃  ...
Transformer: [word₁, word₂, word₃] all at once


Self-Attention Explained

Self-attention allows the model to weigh the importance of different words when encoding a particular token.

“The bank was crowded.” Self-attention helps distinguish: money bank vs river bank using context like “crowded.”


Core Transformer Components

Component Purpose
Embedding Layer Convert tokens to dense vectors
Positional Encoding Add order information
Multi-Head Attention Attend to different parts of the sequence
Feed-Forward Network Process attended representations
Residual & LayerNorm Improve stability and gradient flow

Step-by-Step: Mini Transformer Encoder

Step 1: Positional Encoding

import numpy as np
import tensorflow as tf

def positional_encoding(seq_len, d_model):
    pos = np.arange(seq_len)[:, np.newaxis]
    i = np.arange(d_model)[np.newaxis, :]
    angle_rates = 1 / np.power(10000, (2 * (i // 2)) / np.float32(d_model))
    angle_rads = pos * angle_rates

    angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])
    angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])

    return tf.cast(angle_rads, dtype=tf.float32)

Step 2: Scaled Dot-Product Attention

def scaled_dot_product_attention(q, k, v, mask=None):
    matmul_qk = tf.matmul(q, k, transpose_b=True)
    d_k = tf.cast(tf.shape(k)[-1], tf.float32)
    scaled = matmul_qk / tf.math.sqrt(d_k)

    if mask is not None:
        scaled += (mask * -1e9)

    weights = tf.nn.softmax(scaled, axis=-1)
    return tf.matmul(weights, v)

Step 3: Multi-Head Attention Layer

from tensorflow.keras.layers import Dense, Layer

class MultiHeadAttention(Layer):
    def __init__(self, d_model, num_heads):
        super().__init__()
        assert d_model % num_heads == 0
        self.depth = d_model // num_heads
        self.num_heads = num_heads

        self.Wq = Dense(d_model)
        self.Wk = Dense(d_model)
        self.Wv = Dense(d_model)
        self.dense = Dense(d_model)

    def split_heads(self, x, batch_size):
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
        return tf.transpose(x, perm=[0, 2, 1, 3])

    def call(self, q, k, v, mask=None):
        batch_size = tf.shape(q)[0]
        q = self.split_heads(self.Wq(q), batch_size)
        k = self.split_heads(self.Wk(k), batch_size)
        v = self.split_heads(self.Wv(v), batch_size)

        scaled_attention = scaled_dot_product_attention(q, k, v, mask)
        scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])
        concat = tf.reshape(scaled_attention, (batch_size, -1, self.num_heads * self.depth))
        return self.dense(concat)


Transformer Encoder Block

from tensorflow.keras.layers import LayerNormalization, Dropout, Dense

class TransformerEncoderBlock(Layer):
    def __init__(self, d_model, num_heads, ff_dim, rate=0.1):
        super().__init__()
        self.att = MultiHeadAttention(d_model, num_heads)
        self.ffn = tf.keras.Sequential([
            Dense(ff_dim, activation='relu'),
            Dense(d_model)
        ])
        self.layernorm1 = LayerNormalization()
        self.layernorm2 = LayerNormalization()
        self.dropout1 = Dropout(rate)
        self.dropout2 = Dropout(rate)

    def call(self, x, training):
        attn_output = self.att(x, x, x)
        out1 = self.layernorm1(x + self.dropout1(attn_output, training=training))
        ffn_output = self.ffn(out1)
        return self.layernorm2(out1 + self.dropout2(ffn_output, training=training))

Training Transformer on Text (Preview)

You can now stack transformer blocks, add embeddings, and train using model.fit() as usual. Later chapters (esp. Part IV and V) show this for:

  • Sentiment analysis
  • Text classification
  • Question answering

BERT and GPT: Built on Transformers

Model Direction Use Case
BERT Bidirectional Classification, QA
GPT Left-to-right Generation, Autocomplete

You can fine-tune these using transformers from Hugging Face or build them from scratch (see next chapter for applications).


Summary

In this chapter, you:

  • Understood how attention replaces recurrence
  • Built a mini transformer encoder using TensorFlow
  • Explored the components like positional encodings and multi-head attention
  • Learned how modern models like BERT and GPT are architecturally composed

Transformers are the foundation of state-of-the-art NLP and many vision models. You’ve now walked through the gears that power them.