Attention mechanisms solve a core limitation of the encoder-decoder RNN: the entire input sequence is compressed into a fixed-size context vector, which the decoder then uses for all output steps. With attention, the decoder can selectively focus on different parts of the encoder’s output at each decoding step.
The Development Arc
The lecture frames the evolution as 4 explicit stages, each building on the last:
| Stage | Architecture | Key idea |
|---|---|---|
| 1 | RNN with Attention | Decoder attends over all encoder states instead of just the last one |
| 2 | RNN with Self-Attention | Each encoder hidden state attends over other states in the same sequence |
| 3 | Attention without RNN | Cross-attention between encoder and decoder, but no recurrence |
| 4 | Self-Attention without RNN | Every layer is pure self-attention — this is the Transformer |
Each step addresses the limitations of the previous:
- RNN alone: sequential computation, can’t parallelize, long-term forgetting.
- RNN + Attention: decoder learns where to look, but the RNN itself is still sequential.
- RNN + Self-Attention: richer encoder representations, but still sequential.
- Transformer: fully parallel, no recurrence, attention does everything.
Attention in Seq2Seq (RNN + Attention)
In a standard Seq2Seq model, the decoder only looks at its current hidden state. With attention, it additionally consults all encoder hidden states, weighted by relevance.
Complexity
| Model | Parameters per step |
|---|---|
| Without attention | |
| With attention |
= input length, = output length. Each decoder step computes weights over all encoder states.
Key rule: Do NOT re-use attention weights computed in a previous decoder step.
Query, Key, Value Abstraction
| Component | ”What it means” | Source |
|---|---|---|
| Query | ”What am I looking for?” | Decoder hidden state |
| Key | ”What do I have to offer?” | Each encoder hidden state |
| Value | ”The actual content I return” | Each encoder hidden state |
Self-Attention
Self-attention applies the attention mechanism to a single sequence — the query, key, and value all come from the same sequence.
What it enables:
- Each token can attend to every other token in the same sequence.
- The model builds richer, context-aware representations.
- Less likely to forget earlier parts of the sequence.
Example: In “The FBI is chasing a criminal on the run”, when encoding “run”, self-attention allows the model to attend to “FBI”, “chasing”, and “criminal”.
Attention vs. Self-Attention
| Attention (Seq2Seq) | Self-Attention | |
|---|---|---|
| Needs decoder? | Yes (two sequences) | No (one sequence) |
| Use case | Machine translation | Any sequence task |
| Scope | Cross-sequence | Within-sequence |
Single-Head Self-Attention
Three learned parameter matrices: , , .
where , , .
Multi-Head Self-Attention
Run independent single-head self-attention layers in parallel, each with its own . Concatenate all outputs.
- Total parameter matrices: (Q, K, V per head).
- Output shape: if each head outputs a vector → multi-head output is .
- Different heads can specialize in different types of relationships (syntax, coreference, proximity, etc.).
# Keras MultiHeadAttention layer
mha = keras.layers.MultiHeadAttention(num_heads=8, key_dim=64)
output = mha(query=x, key=x, value=x) # self-attention: all three are the same xSelf-Attention Without RNN (The Bridge to Transformers)
This is the conceptual leap that makes the Transformer feel inevitable.
The question to ask: “If self-attention already lets each token attend to all others in the sequence, why do we need the RNN at all?”
- The RNN’s job was to accumulate context across time steps.
- Self-attention does this directly — in a single operation, every token can look at every other token.
- The RNN is now redundant as a context aggregator.
Removing the RNN gives you:
- Full parallelism — no sequential dependency between steps.
- No vanishing gradient through time — attention weights are direct connections.
- Unlimited effective context window — every token attends to every other token equally.
The only thing lost is the notion of position — RNNs inherently encode order by processing left to right. Transformers solve this with positional encodings added to the input embeddings, injecting order information back in explicitly.
In short: Self-attention without RNN = Transformer encoder. The Transformer isn’t magic — it’s what you get when you ask “what if we only use self-attention?”
The Transformer
The Transformer is a Seq2Seq model built entirely from attention and dense layers — no recurrence.
Why Remove the RNN?
- RNNs are sequential: cannot be computed before — no parallelism.
- RNNs still suffer from long-term forgetting even with attention.
- Attention-only computation is fully parallelizable.
Transformer Encoder
- Stack of 6 identical blocks.
- Each block: Multi-Head Self-Attention (8 heads) + Dense Layer (+ residual connections & layer norm).
- Input shape: → Output shape: (shape-preserving so blocks can be stacked).
Transformer Decoder
- Stack of 6 identical blocks.
- Each block has 3 sublayers:
- Masked Multi-Head Self-Attention — attends to the decoder’s own previous outputs (masked to prevent attending to future tokens).
- Multi-Head Attention — Keys & Values come from encoder output; Queries come from sublayer 1.
- Dense Layer.
- Input: (decoder) + (encoder output) → Output: .
Transformer as Machine Translation
- Encoder processes the full source sentence.
- Decoder generates the target sentence one token at a time.
- Decoding starts with a
[START]token, stops at[STOP]. - Both special tokens are part of the vocabulary and learned during training.
# Minimal Transformer encoder block in Keras
class TransformerEncoderBlock(keras.layers.Layer):
def __init__(self, embed_dim, num_heads, ff_dim, **kwargs):
super().__init__(**kwargs)
self.attn = keras.layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim // num_heads)
self.ffn = keras.Sequential([
keras.layers.Dense(ff_dim, activation='relu'),
keras.layers.Dense(embed_dim),
])
self.norm1 = keras.layers.LayerNormalization()
self.norm2 = keras.layers.LayerNormalization()
def call(self, x, training=False):
attn_out = self.attn(x, x)
x = self.norm1(x + attn_out)
ffn_out = self.ffn(x)
return self.norm2(x + ffn_out)BERT (Bidirectional Encoder Representations from Transformers)
BERT is a method for pre-training the Transformer encoder on unlabeled text, producing powerful general-purpose representations that can be fine-tuned for downstream tasks.
- Trained on English Wikipedia (~2.5 billion words) + BookCorpus.
- Does not require manually labeled data.
- Fully bidirectional — attends to left and right context simultaneously.
Pre-Training Tasks
Task 1 — Masked Language Modeling (MLM)
- 15% of tokens are randomly replaced with
[MASK]. - The model predicts the original masked words using a Softmax classifier.
- Loss: Cross-Entropy.
Task 2 — Next Sentence Prediction (NSP)
- Input:
[CLS] Sentence A [SEP] Sentence B. - Target:
True(consecutive) orFalse(random pairing). - 50% real pairs, 50% random pairs.
- A binary classifier is applied to the
[CLS]token output. - Loss: Binary Cross-Entropy.
Combined Training:
Both tasks are trained simultaneously in a single gradient descent step.
# Fine-tuning BERT with HuggingFace + Keras
from transformers import TFBertModel, BertTokenizer
import tensorflow as tf
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert = TFBertModel.from_pretrained('bert-base-uncased')
inputs = tokenizer("Hello, my dog is cute", return_tensors="tf")
outputs = bert(**inputs)
# CLS token representation for classification
cls_embedding = outputs.last_hidden_state[:, 0, :] # shape: (batch, 768)
logits = tf.keras.layers.Dense(2)(cls_embedding)Related Notes
- RNNs & LSTMs - The sequential architectures that Transformers replace.
- Autoencoders - Another encoder-decoder architecture (for reconstruction, not generation).
- Large Language Models - GPT-style models built on the Transformer decoder.
- 4. RNNs & CNNs for Text Classification - NLP grounding for sequence models.