Before transformers: the problem with RNNs
Before 2017, sequence-to-sequence tasks like translation were handled by recurrent neural networks. The architecture was intuitive: process one token at a time, pass a hidden state forward, and build up a representation of the sequence step by step. The problem is that this is fundamentally sequential. To compute the representation of token 100, you first have to compute token 99, which requires token 98, and so on. You cannot parallelize over positions during training, which makes large-scale training slow.
The deeper problem is long-range dependencies. In a sentence like "The trophy didn't fit in the suitcase because it was too big," the word "it" refers back to "trophy," which might be many positions earlier. In an RNN, the gradient signal connecting "it" to "trophy" has to travel through every intermediate hidden state. Gradients get diluted or explode along the way. LSTMs and GRUs helped, but did not solve the problem.
The transformer's answer is to drop recurrence entirely. Instead of processing tokens one at a time, every position attends to every other position simultaneously. The whole sequence is processed in parallel, and long-range relationships are handled directly rather than through a chain of hidden states. "Attention is All You Need" is a deliberately provocative title. The claim is that attention mechanisms, properly designed, are sufficient on their own.
The 30-second view
The transformer is an encoder-decoder model. The encoder reads the input sequence and produces a rich contextual representation of it. The decoder then generates the output sequence one token at a time, attending to the encoder's representation at each step. For translation, the encoder reads the source sentence and the decoder generates the target sentence.
Both the encoder and decoder are stacks of identical layers. The original paper uses six layers in each stack, with a model dimension of 512. The two stacks are connected by a cross-attention mechanism inside each decoder layer, which is how the decoder reads what the encoder encoded. That connection is the amber bridge in the diagram below.
Embeddings and positional encoding
Every token in the vocabulary maps to a learnable vector of dimension 512. The embedding layer is just a lookup table: given a token ID, return its corresponding row in a matrix of shape (vocab_size, 512). Before passing these vectors into the encoder, they are scaled by sqrt(d_model) to keep their magnitude in range as the model trains.
The transformer has no recurrence and no convolution. It processes all positions at once, which is great for parallelism but means the model has no built-in sense of order. Token 3 is indistinguishable from token 7 without some positional signal. The solution is positional encoding: add a fixed vector to each embedding that encodes the position of that token in the sequence.
The original paper uses sine and cosine functions at different frequencies across the dimension axis. For position pos and dimension i:
PE(pos, 2i) = sin(pos / 10000^(2i / d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i / d_model))
Even dimensions get sine, odd dimensions get cosine. Each dimension oscillates at a different frequency: low dimensions complete many cycles over a short sequence, high dimensions change very slowly. Every position gets a unique pattern across the 512 dimensions, like a fingerprint.
# from transformer.py — PositionalEncoding.__init__
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
One useful property of this encoding: for any fixed offset k, the position encoding at pos + k can be expressed as a linear transformation of the encoding at pos. This means the model can learn to attend to tokens at relative positions, not just absolute ones.
Self-attention: the core idea
Self-attention lets every token in the sequence look at every other token and decide how much to weight their information. The key insight is that the same token can play different roles depending on what it is attending to. Consider: "The animal didn't cross the street because it was too tired." When the model processes "it," it needs to figure out that "it" refers to "animal," not "street." Self-attention gives the model a mechanism to make this connection.
For every token, three 64-dimensional vectors are computed from the 512-dim input embedding: a Query (what this token is looking for), a Key (what this token is broadcasting about itself), and a Value (the actual content this token contributes). These come from three separate learned linear projections.
With Q, K, and V in hand, attention scores are computed. The Query from one token is dot-producted with the Keys of every other token. A high dot product means those two tokens are related in some way the model has learned. The scores are scaled by the square root of the key dimension (here, sqrt(64) = 8) to prevent the dot products from getting so large that softmax gradients become tiny. Then softmax turns the scores into a probability distribution, and a weighted sum of Value vectors produces the final output.
# from transformer.py — scaled_dot_product_attention
scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
attention_weights = F.softmax(scores, dim=-1)
output = torch.matmul(attention_weights, value)
Multi-head attention
One attention head can only learn one kind of relationship between tokens. But natural language has multiple types of relationships that matter simultaneously: syntax (subject-verb agreement), semantics (coreference like "it" → "animal"), local structure (bigrams and phrases), and long-range dependencies. Multi-head attention runs 8 parallel attention heads, each with its own set of learned projections.
With d_model=512 and 8 heads, each head gets a 64-dim subspace to work with (512 ÷ 8 = 64). All eight heads compute attention in parallel, produce 8 output matrices of shape (seq_len, 64), and then these are concatenated to produce a (seq_len, 512) matrix. A final linear projection W_O mixes the information across heads.
# from transformer.py — MultiHeadAttention.forward
Q = self.W_q(query).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
K = self.W_k(key).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
V = self.W_v(value).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
attn_output, _ = scaled_dot_product_attention(Q, K, V, mask, self.dropout)
# concat heads: (batch, n_heads, seq, d_k) → (batch, seq, d_model)
attn_output = attn_output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
output = self.W_o(attn_output)
The encoder layer
A single encoder layer has two sublayers: multi-head self-attention and a position-wise feed-forward network. What makes the layer work in practice is what wraps each sublayer: a residual connection and layer normalization.
The residual connection means the sublayer's output is added to its input before being passed forward. In code this is literally x = LayerNorm(x + sublayer(x)). The residual path lets gradients flow directly to earlier layers during backprop without passing through the sublayer transformation, making it much easier to train deep networks. Layer normalization stabilizes the distribution of activations across the feature dimension.
The feed-forward network is applied independently to each position. It expands from 512 to 2048 dimensions, applies ReLU, then contracts back to 512. This nonlinearity is where the model can store and transform information in ways pure attention cannot.
# from transformer.py — EncoderLayer.forward
# Sub-layer 1: self-attention + residual + norm
attn_output = self.self_attn(x, x, x, mask)
x = self.norm1(x + self.dropout1(attn_output))
# Sub-layer 2: feed-forward + residual + norm
ff_output = self.feed_forward(x)
x = self.norm2(x + self.dropout2(ff_output))
The encoder stack
The encoder is six of these layers stacked on top of each other. The output of layer N is the input to layer N+1. Each layer operates on the same sequence length but can transform the representation in ways the previous layer could not.
What does stacking buy you? Early layers tend to capture low-level structure: token-level patterns, local syntax, nearby relationships. Later layers capture more abstract, long-range semantic information. This mirrors what is observed in other deep networks: depth lets the model build increasingly abstract representations hierarchically.
The final output of the encoder stack is a matrix of shape (seq_len, 512): one 512-dimensional vector for each position in the input sequence. These vectors encode the meaning of each token in the full context of the sentence. This is what gets handed off to the decoder.
# from transformer.py — Encoder.forward
for layer in self.layers: # 6 identical EncoderLayer instances
x = layer(x, mask)
return x # shape: (batch_size, seq_len, d_model)
The decoder
The decoder generates the output sequence one token at a time. At each step, it takes all the tokens it has generated so far plus the encoder's output, and predicts the next token. This autoregressive process continues until the model outputs a special end-of-sequence token.
Each decoder layer has three sublayers instead of two. The first is masked self-attention: the decoder can attend to its own previously generated tokens, but not to future ones it has not generated yet. The mask is a lower-triangular matrix that zeroes out attention scores for future positions.
# from transformer.py — Transformer.create_causal_mask
mask = torch.triu(torch.ones(size, size), diagonal=1).type(torch.uint8)
return (mask == 0).unsqueeze(0).unsqueeze(0)
The second sublayer is cross-attention, and it is the heart of how the decoder uses the encoder. The Queries come from the decoder's previous sublayer (what the decoder is currently generating), but the Keys and Values come from the encoder output (the encoded source sequence). This lets every decoder position attend to every position in the encoded input at every generation step.
# from transformer.py — DecoderLayer.forward
# 1. Masked self-attention (decoder attends to its own past)
self_attn_output = self.self_attn(x, x, x, tgt_mask)
x = self.norm1(x + self.dropout1(self_attn_output))
# 2. Cross-attention (decoder queries the encoder output)
cross_attn_output = self.cross_attn(x, encoder_output, encoder_output, src_mask)
x = self.norm2(x + self.dropout2(cross_attn_output))
# 3. Feed-forward
ff_output = self.feed_forward(x)
x = self.norm3(x + self.dropout3(ff_output))
Output projection
After the final decoder layer, each position has a 512-dimensional representation. A linear layer projects this to a vector of size equal to the vocabulary (for example, 37,000 tokens for English-French translation in the original paper). Softmax converts these logits into a probability distribution over the vocabulary. The token with the highest probability is selected as the output at that step.
One detail worth noting: the embedding matrix and the output linear layer share weights. The same matrix that maps token IDs to 512-dim vectors is used (transposed) to map 512-dim decoder outputs back to vocabulary scores. This weight tying reduces the parameter count and makes sense conceptually: if token A and token B are semantically similar, their embedding vectors are close, so the output layer will also assign similar scores to them given a similar decoder state.
# from transformer.py — Transformer.__init__
self.tgt_embedding = nn.Embedding(tgt_vocab_size, d_model)
self.output_linear = nn.Linear(d_model, tgt_vocab_size)
# weight tying: same matrix for embedding lookup and output projection
self.tgt_embedding.weight = self.output_linear.weight
Training
Training uses cross-entropy loss between the predicted probability distribution and the target token. For a sequence of length T, the loss is the average cross-entropy across all T positions. The model is trained to assign high probability to the correct next token at every position simultaneously (teacher forcing: during training, the ground-truth previous tokens are fed as decoder input rather than the model's own predictions).
The original paper uses label smoothing with a value of 0.1. Instead of training against a one-hot target (all probability mass on the correct token), 0.1 of probability is spread uniformly across all vocabulary tokens. This prevents the model from becoming overconfident and improves generalization, at the cost of a slightly higher training loss.
The learning rate schedule is unusual and specifically designed for the transformer. It warms up linearly over the first 4000 steps, then decays proportionally to the inverse square root of the step number. The warmup prevents large gradient updates during early training when the model's parameters are far from useful values. Without warmup, the model can diverge early and never recover.
# from config.py — the base model configuration
MODEL_CONFIG = {
'n_layers': 6, # encoder and decoder layers
'd_model': 512, # embedding and hidden dimension
'd_ff': 2048, # feed-forward inner dimension
'n_heads': 8, # attention heads
'd_k': 64, # d_model / n_heads
'dropout': 0.1,
}
TRAINING_CONFIG = {
'warmup_steps': 4000,
'label_smoothing': 0.1,
}