Before transformers: the problem with RNNs

Before 2017, sequence-to-sequence tasks like translation were handled by recurrent neural networks. The architecture was intuitive: process one token at a time, pass a hidden state forward, and build up a representation of the sequence step by step. The problem is that this is fundamentally sequential. To compute the representation of token 100, you first have to compute token 99, which requires token 98, and so on. You cannot parallelize over positions during training, which makes large-scale training slow.

The deeper problem is long-range dependencies. In a sentence like "The trophy didn't fit in the suitcase because it was too big," the word "it" refers back to "trophy," which might be many positions earlier. In an RNN, the gradient signal connecting "it" to "trophy" has to travel through every intermediate hidden state. Gradients get diluted or explode along the way. LSTMs and GRUs helped, but did not solve the problem.

The transformer's answer is to drop recurrence entirely. Instead of processing tokens one at a time, every position attends to every other position simultaneously. The whole sequence is processed in parallel, and long-range relationships are handled directly rather than through a chain of hidden states. "Attention is All You Need" is a deliberately provocative title. The claim is that attention mechanisms, properly designed, are sufficient on their own.

The 30-second view

The transformer is an encoder-decoder model. The encoder reads the input sequence and produces a rich contextual representation of it. The decoder then generates the output sequence one token at a time, attending to the encoder's representation at each step. For translation, the encoder reads the source sentence and the decoder generates the target sentence.

Both the encoder and decoder are stacks of identical layers. The original paper uses six layers in each stack, with a model dimension of 512. The two stacks are connected by a cross-attention mechanism inside each decoder layer, which is how the decoder reads what the encoder encoded. That connection is the amber bridge in the diagram below.

The Transformer model architecture An editorial diagram of the Transformer from Attention Is All You Need: a six-layer encoder (left, lavender) and a six-layer decoder (right, green), with an amber cross-attention bridge carrying K and V from the encoder output into each decoder layer. Encoder Decoder K, V from encoder output Q from decoder Encoder Output ×6 N = 6 ×6 N = 6 Add & Norm Feed-Forward Network Add & Norm Multi-Head Self-Attention Positional Encoding Input Embedding Inputs Add & Norm Feed-Forward Network Add & Norm Multi-Head Cross-Attention Add & Norm Masked Multi-Head Self-Attention Positional Encoding Output Embedding Outputs (shifted right) Linear Softmax Output Probabilities d_model = 512 8 heads, d_k = 64 d_ff = 2048 sin/cos encoding causal mask
The full Transformer architecture. The encoder (left, purple) reads the source sequence and produces contextual representations. The decoder (right, green) generates the target sequence autoregressively. The amber dashed bridge carries Keys and Values from the encoder into each decoder layer's cross-attention sublayer.

Embeddings and positional encoding

Every token in the vocabulary maps to a learnable vector of dimension 512. The embedding layer is just a lookup table: given a token ID, return its corresponding row in a matrix of shape (vocab_size, 512). Before passing these vectors into the encoder, they are scaled by sqrt(d_model) to keep their magnitude in range as the model trains.

The transformer has no recurrence and no convolution. It processes all positions at once, which is great for parallelism but means the model has no built-in sense of order. Token 3 is indistinguishable from token 7 without some positional signal. The solution is positional encoding: add a fixed vector to each embedding that encodes the position of that token in the sequence.

The original paper uses sine and cosine functions at different frequencies across the dimension axis. For position pos and dimension i:

PE(pos, 2i)   = sin(pos / 10000^(2i / d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i / d_model))

Even dimensions get sine, odd dimensions get cosine. Each dimension oscillates at a different frequency: low dimensions complete many cycles over a short sequence, high dimensions change very slowly. Every position gets a unique pattern across the 512 dimensions, like a fingerprint.

# from transformer.py — PositionalEncoding.__init__
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))

pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
Positional Encoding Pattern (10 positions × 32 dimensions) Position Dimension 0 1 2 3 4 5 6 7 8 9 0 8 16 24 32 40 48 56 –1 +1 sin / cos value
Positional encoding pattern (first 10 positions, first 32 dimensions). Each row is a unique fingerprint for that position. Low dimensions oscillate rapidly (short wavelength); high dimensions change very slowly (long wavelength). The model can infer token position from this pattern.

One useful property of this encoding: for any fixed offset k, the position encoding at pos + k can be expressed as a linear transformation of the encoding at pos. This means the model can learn to attend to tokens at relative positions, not just absolute ones.

Self-attention: the core idea

Self-attention lets every token in the sequence look at every other token and decide how much to weight their information. The key insight is that the same token can play different roles depending on what it is attending to. Consider: "The animal didn't cross the street because it was too tired." When the model processes "it," it needs to figure out that "it" refers to "animal," not "street." Self-attention gives the model a mechanism to make this connection.

For every token, three 64-dimensional vectors are computed from the 512-dim input embedding: a Query (what this token is looking for), a Key (what this token is broadcasting about itself), and a Value (the actual content this token contributes). These come from three separate learned linear projections.

Creating Q, K, V from a Token Embedding Embedding 512-dim "it" W_Q 512×64 W_K 512×64 W_V 512×64 q 64-dim query k 64-dim key v 64-dim value
Each input embedding is linearly projected three times to produce a Query, Key, and Value vector (all 64-dim when using 8 heads with d_model=512). The weight matrices W_Q, W_K, W_V are learned during training.

With Q, K, and V in hand, attention scores are computed. The Query from one token is dot-producted with the Keys of every other token. A high dot product means those two tokens are related in some way the model has learned. The scores are scaled by the square root of the key dimension (here, sqrt(64) = 8) to prevent the dot products from getting so large that softmax gradients become tiny. Then softmax turns the scores into a probability distribution, and a weighted sum of Value vectors produces the final output.

# from transformer.py — scaled_dot_product_attention
scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
if mask is not None:
    scores = scores.masked_fill(mask == 0, -1e9)
attention_weights = F.softmax(scores, dim=-1)
output = torch.matmul(attention_weights, value)
Attention Score Computation q · kᵀ dot product raw scores ÷ √d_k ÷ 8 scaled scores softmax normalize to sum = 1 × V weight each value vector sum weighted sum = output Attention(Q, K, V) = softmax( Q Kᵀ / √d_k ) V
The five steps of scaled dot-product attention. The scaling by √d_k (= 8 in the base model) prevents dot products from growing large enough to push softmax into regions with near-zero gradients.
Self-Attention: "it" Attending to Other Tokens The animal didn't cross it tired ← "it" attending to → 0.04 0.52 0.11 0.07 0.19 0.07 The animal didn't cross it tired "it" correctly attends most strongly to "animal" low high attn
Attention weights when processing "it" in "The animal didn't cross the street because it was too tired." The model correctly assigns the highest weight (0.52) to "animal," resolving the coreference. Each head learns different relationships, so another head might attend to "tired" to capture the reason clause.

Multi-head attention

One attention head can only learn one kind of relationship between tokens. But natural language has multiple types of relationships that matter simultaneously: syntax (subject-verb agreement), semantics (coreference like "it" → "animal"), local structure (bigrams and phrases), and long-range dependencies. Multi-head attention runs 8 parallel attention heads, each with its own set of learned projections.

With d_model=512 and 8 heads, each head gets a 64-dim subspace to work with (512 ÷ 8 = 64). All eight heads compute attention in parallel, produce 8 output matrices of shape (seq_len, 64), and then these are concatenated to produce a (seq_len, 512) matrix. A final linear projection W_O mixes the information across heads.

# from transformer.py — MultiHeadAttention.forward
Q = self.W_q(query).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
K = self.W_k(key).view(batch_size,   -1, self.n_heads, self.d_k).transpose(1, 2)
V = self.W_v(value).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)

attn_output, _ = scaled_dot_product_attention(Q, K, V, mask, self.dropout)

# concat heads: (batch, n_heads, seq, d_k) → (batch, seq, d_model)
attn_output = attn_output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
output = self.W_o(attn_output)
Multi-Head Attention: 8 Heads in Parallel Input (512-dim) head1 head2 head3 head4 head5 head6 head7 head8 ··· 64-dim Concat → W_O 8 × 64 = 512 dim output
Each of the 8 attention heads works in a 64-dimensional subspace and learns different relationships. Their outputs are concatenated (8 × 64 = 512 dims) and projected through W_O back to d_model=512. Different heads specialize in syntax, semantics, coreference, etc.

The encoder layer

A single encoder layer has two sublayers: multi-head self-attention and a position-wise feed-forward network. What makes the layer work in practice is what wraps each sublayer: a residual connection and layer normalization.

The residual connection means the sublayer's output is added to its input before being passed forward. In code this is literally x = LayerNorm(x + sublayer(x)). The residual path lets gradients flow directly to earlier layers during backprop without passing through the sublayer transformation, making it much easier to train deep networks. Layer normalization stabilizes the distribution of activations across the feature dimension.

The feed-forward network is applied independently to each position. It expands from 512 to 2048 dimensions, applies ReLU, then contracts back to 512. This nonlinearity is where the model can store and transform information in ways pure attention cannot.

# from transformer.py — EncoderLayer.forward
# Sub-layer 1: self-attention + residual + norm
attn_output = self.self_attn(x, x, x, mask)
x = self.norm1(x + self.dropout1(attn_output))

# Sub-layer 2: feed-forward + residual + norm
ff_output = self.feed_forward(x)
x = self.norm2(x + self.dropout2(ff_output))
Encoder Layer: Residual Connections x (input) Multi-Head Self-Attention residual Add & Norm Feed-Forward Network 512 → 2048 → 512, ReLU residual Add & Norm x′ (output)
Each sublayer is wrapped with a residual connection: the sublayer output is added to its input before layer normalization. The residual paths (dashed) let gradients flow directly backward without passing through the sublayer, enabling stable training at depth.

The encoder stack

The encoder is six of these layers stacked on top of each other. The output of layer N is the input to layer N+1. Each layer operates on the same sequence length but can transform the representation in ways the previous layer could not.

What does stacking buy you? Early layers tend to capture low-level structure: token-level patterns, local syntax, nearby relationships. Later layers capture more abstract, long-range semantic information. This mirrors what is observed in other deep networks: depth lets the model build increasingly abstract representations hierarchically.

The final output of the encoder stack is a matrix of shape (seq_len, 512): one 512-dimensional vector for each position in the input sequence. These vectors encode the meaning of each token in the full context of the sentence. This is what gets handed off to the decoder.

# from transformer.py — Encoder.forward
for layer in self.layers:   # 6 identical EncoderLayer instances
    x = layer(x, mask)
return x  # shape: (batch_size, seq_len, d_model)

The decoder

The decoder generates the output sequence one token at a time. At each step, it takes all the tokens it has generated so far plus the encoder's output, and predicts the next token. This autoregressive process continues until the model outputs a special end-of-sequence token.

Each decoder layer has three sublayers instead of two. The first is masked self-attention: the decoder can attend to its own previously generated tokens, but not to future ones it has not generated yet. The mask is a lower-triangular matrix that zeroes out attention scores for future positions.

# from transformer.py — Transformer.create_causal_mask
mask = torch.triu(torch.ones(size, size), diagonal=1).type(torch.uint8)
return (mask == 0).unsqueeze(0).unsqueeze(0)
Causal Mask: Each Token Can Only See Its Past I am a stu- dent <EOS> I am a student
The causal mask for a 5-token decoder sequence. Green cells (✓) show allowed attention; red cells (✗) are masked with a large negative value before softmax, effectively zeroing out those attention weights. This ensures the model cannot "cheat" by looking at tokens it has not generated yet.

The second sublayer is cross-attention, and it is the heart of how the decoder uses the encoder. The Queries come from the decoder's previous sublayer (what the decoder is currently generating), but the Keys and Values come from the encoder output (the encoded source sequence). This lets every decoder position attend to every position in the encoded input at every generation step.

# from transformer.py — DecoderLayer.forward
# 1. Masked self-attention (decoder attends to its own past)
self_attn_output = self.self_attn(x, x, x, tgt_mask)
x = self.norm1(x + self.dropout1(self_attn_output))

# 2. Cross-attention (decoder queries the encoder output)
cross_attn_output = self.cross_attn(x, encoder_output, encoder_output, src_mask)
x = self.norm2(x + self.dropout2(cross_attn_output))

# 3. Feed-forward
ff_output = self.feed_forward(x)
x = self.norm3(x + self.dropout3(ff_output))
Cross-Attention: Decoder Queries the Encoder Encoder Output (seq_len × 512) full source sequence encoded K V Multi-Head Cross-Attention Q from decoder sublayer below output to next sublayer
In cross-attention, the decoder provides the Query (what it is currently generating), while Keys and Values come from the encoder output (the full encoded source). This lets the decoder "look up" relevant parts of the input sequence at every generation step.

Output projection

After the final decoder layer, each position has a 512-dimensional representation. A linear layer projects this to a vector of size equal to the vocabulary (for example, 37,000 tokens for English-French translation in the original paper). Softmax converts these logits into a probability distribution over the vocabulary. The token with the highest probability is selected as the output at that step.

One detail worth noting: the embedding matrix and the output linear layer share weights. The same matrix that maps token IDs to 512-dim vectors is used (transposed) to map 512-dim decoder outputs back to vocabulary scores. This weight tying reduces the parameter count and makes sense conceptually: if token A and token B are semantically similar, their embedding vectors are close, so the output layer will also assign similar scores to them given a similar decoder state.

# from transformer.py — Transformer.__init__
self.tgt_embedding = nn.Embedding(tgt_vocab_size, d_model)
self.output_linear = nn.Linear(d_model, tgt_vocab_size)

# weight tying: same matrix for embedding lookup and output projection
self.tgt_embedding.weight = self.output_linear.weight

Training

Training uses cross-entropy loss between the predicted probability distribution and the target token. For a sequence of length T, the loss is the average cross-entropy across all T positions. The model is trained to assign high probability to the correct next token at every position simultaneously (teacher forcing: during training, the ground-truth previous tokens are fed as decoder input rather than the model's own predictions).

The original paper uses label smoothing with a value of 0.1. Instead of training against a one-hot target (all probability mass on the correct token), 0.1 of probability is spread uniformly across all vocabulary tokens. This prevents the model from becoming overconfident and improves generalization, at the cost of a slightly higher training loss.

The learning rate schedule is unusual and specifically designed for the transformer. It warms up linearly over the first 4000 steps, then decays proportionally to the inverse square root of the step number. The warmup prevents large gradient updates during early training when the model's parameters are far from useful values. Without warmup, the model can diverge early and never recover.

# from config.py — the base model configuration
MODEL_CONFIG = {
    'n_layers': 6,     # encoder and decoder layers
    'd_model': 512,    # embedding and hidden dimension
    'd_ff':    2048,   # feed-forward inner dimension
    'n_heads':    8,   # attention heads
    'd_k':       64,   # d_model / n_heads
    'dropout':  0.1,
}

TRAINING_CONFIG = {
    'warmup_steps': 4000,
    'label_smoothing': 0.1,
}
Learning Rate Schedule (warmup_steps = 4000) 0 peak Learning rate 0 4k 10k 20k steps peak lr at step 4000 lrate = d_model^(−0.5) · min(step^(−0.5), step · warmup^(−1.5)) warmup ∝ 1/√step
The transformer learning rate schedule. The learning rate increases linearly for the first 4000 steps, then decays proportionally to 1/√step. The warmup phase prevents the model from diverging early in training when gradients are noisy and parameters are far from useful values.
What makes the transformer work The transformer succeeds because of four ideas working together. Attention lets every token talk directly to every other token, eliminating the sequential bottleneck of RNNs. Multi-head attention gives the model multiple perspectives on token relationships simultaneously. Residual connections and layer normalization make it possible to stack many layers without training instability. And the careful learning rate schedule ensures the model learns in a stable regime from the very first step.