AlphaFold 3 is one of the more architecturally interesting models I've come across.
The paper is good but moves fast, and I found myself wanting a slower, visual
walkthrough I could actually sit with. This is that walkthrough.
It assumes you're comfortable with attention. If you need a refresher,
The Illustrated Transformer
is the best one out there. It won't cover why protein structure prediction matters
or what AlphaFold changed for biology. There's plenty written on that. The focus
here is purely mechanical: how molecules are represented inside the model, and what
operations turn them into a predicted 3D structure.
Overview
AF3 predicts the structure of a protein, optionally complexed with other proteins, nucleic acids, or small molecules, entirely from sequence. That broader input space
is the first thing that sets it apart from AF2. A quick note on terminology used
throughout: a "token" is a single amino acid (for proteins), a single nucleotide
(for DNA/RNA), or an individual atom for anything that doesn't fit those two.
The model has three stages: Input Preparation (embedding sequences
and querying structural databases), Representation Learning
(refining pair and single representations through stacked attention modules), and
Structure Prediction (generating 3D coordinates via diffusion).
Click any component in the diagram to explore it.
Two types of representation run through the model. A single representation ($N \times C$) stores one feature vector per token or atom, capturing individual identity. A pair representation ($N \times N \times C$) stores one feature vector per pair, capturing relationships between them. Most computation happens in the pair representation because structure prediction is about distances and orientations between entities, not individual properties alone.
AlphaFold 3: Full Architecture
From AF2 to AF3
AF2 established the pair and single representation framework that AF3 inherits. To understand what AF3 adds and why, it helps to know what AF2 did and what it could not. AF2 predicted protein structures only, operated entirely at the residue level throughout its trunk (one vector per amino acid, all the way through), and was deterministic: the same input always produced the same output. Its architecture had two main stages: a 48-block Evoformer that jointly refined an MSA representation and a pair representation, followed by a structure module that used residue-specific backbone frames and torsion angles to produce 3D coordinates. The structure module had to be rotationally invariant, which led to the Invariant Point Attention (IPA) design.
AF3 keeps the pair representation and all the triangle-based operations that update it. These carry over largely unchanged into the Pairformer and are the most proven part of AF2. What AF3 replaces is everything around them. The 48-block Evoformer is replaced by a simpler 4-block MSA module followed by a 48-block Pairformer that operates on pairs and singles only. After those 4 MSA blocks, the MSA representation is discarded entirely. The structure module is replaced by a diffusion module that works directly on raw atom coordinates. The one-sentence version: AF3 keeps AF2's pair processing, cuts the MSA processing drastically, and replaces the structure module with diffusion.
AF3 is generative, not deterministic
The most consequential change is that AF3 is a generative model. The diffusion module is trained to denoise atomic coordinates: given corrupted positions, predict the true ones. At inference time, random noise is sampled and iteratively denoised. The same input run twice with different random seeds will produce different structures. For most complex types, this variation is small and confidence-based ranking selects the best sample from a handful of seeds. For antibody-antigen complexes, performance keeps improving even up to 1,000 seeds, which reflects genuine geometric uncertainty at the interface rather than a quirk of sampling. This is a fundamental shift in what the model is doing: AF2 predicted a single answer, AF3 draws from a distribution over plausible structures.
Equivariance is dropped
AF2's IPA was carefully designed so that the attention operation remained invariant to global rotations and translations of the molecule. This made sense when the model operated inside residue backbone frames and needed to reason about relative orientations. AF3 abandons rotational invariance entirely. The diffusion training objective does not require it, and dropping it simplifies handling arbitrary chemistry: ligands, nucleic acids, and modified residues all have molecular graphs that do not fit naturally into a residue-frame-based equivariant design. No IPA, no FAPE loss, no backbone frame representation.
Atom-level representations are new
AF2 worked entirely at the residue level in its trunk. There was no atom-level computation until the structure module's final side-chain prediction step. AF3 introduces atom-level representations in the input embedder, specifically the atom single representation c, the atom pair representation p, and the Atom Transformer output q. These have no AF2 analog. They are necessary because ligands and modified residues cannot be compressed to a single residue vector the way standard amino acids can. The input embedder computes these atom-level features from conformer geometry, then aggregates them to the token level before the Pairformer runs, bridging the two scales.
Confidence training required a new procedure
In AF2, the confidence head was trained by comparing the structure module output directly to the true structure at each training step. This does not transfer to diffusion: each training step only denoises one noise level and never produces a complete structure. AF3 solves this with a rollout procedure. During training, the full diffusion process is run at a coarser step size to generate a complete predicted structure, and that structure is used to supervise the confidence head. The confidence outputs are similar to AF2 (per-residue pLDDT and a predicted aligned error matrix PAE) with one addition: a predicted distance error matrix (PDE) that measures error in the predicted distance matrix.
Hallucination and cross-distillation
Generative models tend to hallucinate: they invent compact, plausible-looking structure even in regions that are genuinely disordered. AF3 addresses this by enriching its training data with structures predicted by AlphaFold-Multimer v2.3. In those predictions, disordered regions appear as extended loops rather than compact globules. Training on them teaches AF3 to produce ribbon-like disorder with low confidence rather than incorrectly confident compact folds. This cross-distillation is not a small fix: without it, AF3 hallucination rates are substantially higher.
The MSA de-emphasis is a principled bet
Reducing MSA processing from 48 blocks to 4 is not just an efficiency cut. AF3 is making a claim: that evolutionary coevolution signal is not required for cross-entity interactions like protein-ligand or protein-nucleic acid binding. These interactions are primarily governed by local chemistry and geometry rather than by how residues co-evolved across species. The results support this. AF3 substantially outperforms specialized docking tools on protein-ligand binding despite using far less MSA processing than AF2 used for protein-only prediction.
Tokenization
AF2 only handled proteins, so every token was one amino acid, one token, fixed.
AF3 needs to represent proteins, nucleic acids, small molecules, and modified
residues all in one framework, so the scheme is more involved.
The rule: standard amino acids and nucleotides get one token per residue,
no matter how many atoms they contain. Everything else (modified residues, ligand atoms, ions) gets one token per atom. Hydrogens are
excluded throughout; only heavy atoms count.
The reason the compression is valid for standard residues: the model has access
to a reference conformer for each one: a canonical 3D geometry from a chemical
lookup table (the CCD). Atom-level positions can be recovered downstream from
the residue token using this reference. Modified residues and ligands have no
such lookup, so each atom must carry its own token.
typeheavy atomstokens
Standard AA (glycine)51
Standard nucleotide (CMP)211
non-standard → per-atom
Modified AA (hydroxyproline)99
Modified nucleotide (m5C)2222
Ligand (aspirin)1313
The compression matters beyond just cleanliness. The pairformer (the main attention stack) operates on an $N_\text{token} \times N_\text{token}$ pair
representation. Its memory cost scales quadratically with token count, not atom
count. Without this compression, a typical protein complex would be
computationally intractable at the atom level.
Genetic Search: MSA Construction
Before any neural network runs, AF3 does something that resembles
retrieval-augmented generation: search existing databases for related sequences,
then hand them to the model as extra context. The genetic search handles the
sequence side of this.
For each protein or RNA chain in the input, AF3 runs HMM-based tools
(jackhmmer for proteins, nhmmer for RNA) against several large databases.
For proteins: UniRef90, BFD, MGnify, UniClust30. For RNA: RNAcentral, Rfam,
and the NCBI Nucleotide Database. These databases collectively span hundreds of
millions of sequences: UniRef90 alone exceeds 300 million, and BFD runs into
the billions. No neural networks are involved; these are classical bioinformatics
tools, and they run before the model sees anything.
The hits get aligned into a multiple sequence alignment (MSA):
a matrix where each row is a related sequence from another organism, and each
column corresponds to a position in the input. The MSA is capped at
$N_\text{MSA} < 2^{14}$ sequences, since complexity scales with $N_\text{MSA}$,
so this is a hard ceiling even when far more hits exist.
Why does the MSA help?
The same protein appears across thousands of species with slightly
different sequences. Looking at a single column of the MSA (all the amino acids at position i across organisms) tells you how much evolutionary pressure exists at that site. Positions that
are always the same are likely structurally or functionally critical.
Positions that are nearly always different are probably tolerant of
substitution.
More importantly, pairs of columns co-vary. If position
i and position j tend to mutate together across
species, those two residues are probably physically close. A mutation at one creates evolutionary pressure to compensate at the other. This
co-evolutionary signal lets the model infer spatial contacts without
ever seeing a structure.
Paired MSA for multi-chain complexes
For a complex with multiple chains, the naive approach is to search
each chain independently and concatenate the results into a
block-diagonal matrix: chain 1's hits fill the top-left, chain 2's
hits fill the bottom-right, and the off-diagonal blocks are all gaps.
This is sparse and loses any cross-chain co-evolutionary signal.
AF3 instead pairs hits where possible: if the same organism
has a homologue of chain 1 and a homologue of chain 2, those two
sequences get placed in the same MSA row. The result is a denser
matrix that explicitly encodes whether mutations in chain 1 correlate
with mutations in chain 2 across evolution, which is exactly the signal that
lets the model reason about interfaces.
One notable change from AF2: AF3 substantially de-emphasizes MSA processing. The MSA module is only 4 blocks vs. 48 in AF2's Evoformer, and after those 4 blocks the MSA representation is discarded entirely. Only the pair representation carries forward. The evolutionary signal still matters, but AF3 bets it can be absorbed quickly and that the Pairformer does the heavy lifting.
Template Search
Alongside the MSA, AF3 retrieves structural templates: known 3D structures from the PDB that resemble the input. The process uses hmmsearch against PDB sequences using the constructed MSA as a query profile. Up to 4 high-quality template structures are sampled. Only individual chains are used, not full complexes.
Each template is represented as a pairwise distance matrix. For every pair of tokens, the Euclidean distance between their center atoms is computed: Cα for amino acids, C1' for nucleotides, or the atom itself for single-atom tokens. Rather than storing raw distances, they're binned into a distogram: a discretized distribution over distance ranges. Each entry is augmented with per-token metadata. Inter-chain distances are masked out, so templates inform only intra-chain geometry.
Why distograms instead of raw distances?
Crystal structures vary in resolution. At 3Å resolution, atom positions can be off by a meaningful amount, and some atoms may not be resolved at all. A single distance value cannot express this uncertainty. A distogram instead distributes probability mass across bins: rather than "this pair is 8.3Å apart," the model sees "probably 7–10Å, less likely 10–13Å." This soft representation handles noisy or missing data naturally.
What if no templates are found?
When no templates are found (novel proteins, RNA with no PDB homologues, or any ligand or modified residue), the template features are masked to zero and the model proceeds without them. This is not a failure mode; de novo proteins, novel RNA folds, and all small molecules have no structural templates by definition.
Conformer Generation & Atom Representations
Every atom in the input needs a reference 3D position: a starting geometry the model can build on. For standard amino acids and nucleotides, this is a lookup: each has a canonical low-energy conformation in the Chemical Components Dictionary (CCD), retrieved directly. For small molecules there is no such lookup. AF3 uses RDKit's ETKDGv3 algorithm to generate a conformer from the SMILES string, combining experimental torsion angle preferences with distance geometry to produce chemically reasonable 3D coordinates.
These positions are not predictions. They are priors: rough geometry reflecting known chemistry, used to initialize the atom-level representations before any learned processing. The model doesn't commit to them; they're a starting point that the network updates and the diffusion module ultimately discards in favor of its own generated coordinates.
One practically important point: the conformer only needs to be reasonable, not perfect. Small errors in ring conformations or torsion angles don't catastrophically affect downstream predictions because the diffusion module re-optimizes from learned embeddings, not from the conformer directly. That said, gross failures such as invalid stereochemistry or highly strained geometries can degrade the initial atom pair representation in ways that are harder to recover from, so RDKit quality matters more than it might initially seem.
Building the atom representations
c: atom single representation. For every atom, concatenate its conformer position (relative to other atoms in the same token), charge, atomic number, and other identifiers. This gives c (Natoms × Catom): one feature vector per atom.
Distances → local distances → p. From the conformer positions, compute a full Natoms × Natoms pairwise distance matrix: dl,m = |rl − rm|. This covers every atom pair. Then apply mask v, a block-diagonal matrix that keeps only within-token pairs and zeros out everything else, producing the local distances matrix. These local distances get embedded as 1/d² (more informative at short range), the feature vectors cl and cm are projected and added to each entry, and three linear layers with residual connections refine this into p (Natoms × Natoms × Catompair). The 3 linear layers are what build p. This is separate from what produces q.
q: atom Transformer output. Separately, the Atom Transformer runs 3 blocks (adaptive LayerNorm + attention biased by p + conditional gating + SwiGLU) to update c using p as context. The result is copied to q, the atom-level single representation passed forward. The original c is saved and reused later in the diffusion module.
What is the token structure for different molecule types?
Amino acids and nucleotides are multi-atom tokens: a single token corresponds to one residue, which has many atoms (a standard amino acid backbone has ~5 heavy atoms, plus a side chain). Ligand atoms and modified residues are single-atom tokens: each atom is its own token. This means Ntokens < Natoms for most inputs. The within-token mask v is therefore block-diagonal: large blocks along the diagonal for protein/RNA residues (each block is one residue), and 1×1 blocks for single-atom tokens.
Input Embedder: Updating Atom Representations
With q (atom-level single) and p (atom-level pair) initialized from the conformer, the input embedder uses the Atom Transformer to update them based on what neighboring atoms look like. c, the original atom features before any learned processing, stays fixed and acts as a conditioning signal throughout. It's effectively a residual anchor to the initial representation.
Each of the 3 Atom Transformer blocks has 4 steps:
q
→
1. AdaNorm
cond. on c
→
2. Attn + Pair Bias
seq-local sparse
pair bias from p
→
3. Cond. Gate
cond. on c
→
4. Cond. Trans.
SwiGLU MLP
cond. on c
→
q'
c and p condition each step · residual connections (q = q + step output) throughout each block
Step 1: AdaNorm
Standard LayerNorm normalizes each token's activations by mean and standard deviation, then rescales with fixed learned parameters γ (scale) and β (shift) that are the same regardless of input. AdaNorm keeps the normalization step but makes γ and β adaptive: a linear projection of c generates them on the fly. The conditioning is one-directional (c → γ,β → applied to q); c itself is never updated. This lets the initial atom features dictate how each block's normalization behaves, rather than using a single fixed rescaling learned from all examples.
Step 2: Attention with Pair Bias
Self-attention on q, with three differences from standard multi-head attention:
Pair bias: after computing QK dot products for query atom l, row l of p is linearly projected to a scalar per head and added as a bias to the attention logits before softmax. This is strictly one-directional: p shapes which atoms attend to each other, but the attention output never updates p. Dimensions: Catom = 128, Catompair = 16.
Gating: q is also projected through a sigmoid gate ∈ [0,1]. The attention output is multiplied by this gate before heads are concatenated, controlling how much of each attention update enters the residual stream. The model learns to filter which information gets written into the running representation.
Sparse (sequence-local) attention: because Natoms ≫ Ntokens, full Natoms × Natoms attention is prohibitively expensive. AF3 uses sequence-local atom attention: groups of 32 query atoms each attend to 128 nearby key atoms.
Step 3: Conditioned Gating
After the attention step, a second gate filters q. This time the gate is generated from c, not from q itself. A linear projection of c through sigmoid gives per-channel values in [0, 1] that element-wise scale what gets absorbed from the attention step into the residual stream. The gate depends on the original atom features (c), not the current learned state of q, so the initial chemistry of each atom determines how strongly the attention update registers, independent of what q has learned so far.
Step 4: Conditioned Transition (SwiGLU MLP)
The MLP step, analogous to the FFN in a standard transformer. It's "conditioned" because it's wrapped in AdaNorm and Conditioned Gating, both of which depend on c. AF3 uses SwiGLU here instead of the ReLU used in AF2. ReLU projects up to 4×D, applies ReLU, and projects back down. SwiGLU takes two parallel up-projections (D → K×D), applies Swish (x ⋅ σ(x), a smooth non-linearity that does not hard-zero negative values) to one, element-wise multiplies the two, then projects down. Swish's smoothness helps gradient flow, particularly when the network is deep. SwiGLU has become standard in large models (LLaMA, PaLM) and generally outperforms ReLU without meaningful additional cost.
Why does the same Atom Transformer appear again in the diffusion module?
All four building blocks here (AdaNorm, attention with pair bias, conditioned gating, and the SwiGLU transition) are not unique to the input embedder. The exact same Atom Transformer structure reappears inside the diffusion module, where it runs at atom level during structure generation. Understanding it here means you've already understood it there. The key difference is what conditioning tensors are used: in the input embedder, c and p come from the conformer; in the diffusion module, they are updated versions incorporating trunk information.
Aggregating to Token Level: sinit and zinit
Everything built so far (c, q, p) lives at the atom level. The representation learning section operates at the token level, so we aggregate here. This is the last step of the input embedder.
sinit (token-level single). q is linearly projected from Catom (128) to Ctoken (384). For each token, all atom representations belonging to that token are averaged, compressing Natoms rows down to Ntokens. The result is concatenated with token-level MSA features where available (residue type, MSA statistics at position i, etc.), growing the channel dimension to Ctoken + 65. A final linear projection brings it back to Ctoken, producing sinit. The pre-projection version, sinputs, is saved separately for later use in structure prediction.
zinit (token-level pair). Starting from sinit, project each token's representation from Ctoken to Cz (128), giving vectors si and sj. The pair entry zi,j is initialized as si + sj. Two things are then added: a relative positional encoding capturing how far apart tokens i and j are in sequence, and any user-specified bond information linearly embedded into the pair representation.
At this point all inputs are fully embedded. The atom-level representations (c, q, p) are set aside for the diffusion module. The representation learning section from here works entirely with the token-level s and z, updated using the MSA (m) and templates (t).
Template Module
What is element-wise addition (⊕)?
When two matrices of identical shape are added together, matching entries at each position are summed. For pair representations, zi,j from one source is added directly to zi,j from another, letting information from both accumulate in the same position. This is how the template module, MSA module, and recycling loop all contribute to the same z without replacing what came before.
The template module injects structural prior information from PDB homologues into the pair representation z. The templates retrieved earlier each contribute a pairwise distance matrix: all token-pair distances binned into 38 bins spanning 3.15–50.75 Å. Cross-chain distances are masked to zero because templates only inform intra-chain geometry, never inter-chain contacts.
The processing steps for each template:
Each template's pair representation is linearly projected into channel dimension C, giving one matrix vt per template.
z is also linearly projected (Cz → C) and added to each vt, giving every template context about the current state of the pair representation.
Each combined vt passes through 2 blocks of the Pairformer stack (same operations as the main Pairformer, described in that section).
All template representations are averaged into a single Ntokens × Ntokens × C matrix.
A linear layer followed by ReLU produces u, notably one of only two uses of ReLU in AF3. The other is in the input embedder; every other non-linearity in the network uses SwiGLU.
u is added to z, injecting structural template knowledge into the pair representation before the main Pairformer runs.
The design is efficient: templates are processed in parallel, averaged rather than concatenated (keeping z's dimensionality fixed regardless of how many templates are found), and the lightweight 2-block Pairformer does the heavy lifting of extracting useful structural signal before averaging collapses the template dimension.
Why run a Pairformer on each template separately?
Templates contain pairwise geometry between residues, the same kind of information as z. Running a dedicated Pairformer on template features before merging them with z allows the model to apply triangle inequality constraints to template data before it enters the main trunk. Without this, raw distance bins would be added directly to z, mixing different feature spaces and losing structured geometric reasoning. Processing templates independently (rather than concatenating them) also keeps the computation linear in the number of templates.
Why average instead of concatenate templates?
Concatenating all templates would multiply the channel dimension by the number of templates (up to 4), quadrupling z's memory footprint and making the downstream Pairformer proportionally more expensive. Averaging collapses the template dimension entirely, keeping z at fixed size. The averaging is done after the per-template Pairformer has already extracted each template's structural signal, so collapsing across templates does not discard information prematurely.
MSA Module
The MSA module is AF3's version of AF2's Evoformer. Its job is to simultaneously refine the MSA representation (m) and the pair representation (z), letting information flow between them. It runs 4 blocks, each containing three distinct steps: updating z from m (outer product mean), updating m from z (row-wise attention), and updating z through triangle operations.
Step 1: Outer Product Mean (m → z)
The first thing each block does is push MSA information into the pair representation. Comparing two columns of the MSA tells you how correlated those two positions are across evolution. For every token pair (i, j), take the outer product of as,i and bs,j (two linearly projected MSA vectors) for each evolutionary sequence s, then average across s. This gives one Ntokens × Ntokens × C² tensor capturing cross-position correlations. Flatten the C² depth, project down to Cz, and add to zi,j.
This is the only point in the entire model where information is shared across evolutionary sequences. Everything else in the MSA module runs independently per sequence row. This is a significant simplification from AF2's Evoformer, which was much more computationally intensive in how it mixed sequences.
Step 2: Row-wise Gated Attention (z → m)
Having updated z from m, the block then does the reverse: update m using z. This is attention with no queries or keys. Instead, row i of z directly provides the attention scores for sequence row s. Each zi,j vector is linearly projected to a scalar per head, softmaxed along j, and used as attention weights to compute a weighted sum over column i of m. A sigmoid gate (also from m) filters the output before concatenating heads. This runs independently per sequence row, with no cross-sequence information shared.
The key intuition: zi,j already encodes the relationship between tokens i and j, so projecting it to a scalar gives an attention map directly from that relationship. No learned Q/K projections are needed.
Step 3: Triangle Updates (z → z)
The rest of each block refines z using the same four triangle operations described in the Pairformer section: triangle multiplicative update (outgoing), triangle multiplicative update (incoming), triangle self-attention (start node), and triangle self-attention (end node). Each is followed by a residual add. The block closes with a SwiGLU transition. No MSA features are used in this step; it operates purely on z.
Why outer product mean instead of a learned attention over sequences?
Attention across the NMSA dimension would cost O(NMSA² × Ntokens), which is prohibitive at scale. The outer product mean costs O(NMSA × Ntokens²) and relies on averaging rather than learned weights over sequences. This works because the relevant signal is the correlation structure across positions, which averaging captures directly. Individual sequences contribute equally, reflecting the assumption that each homologue is an independent sample from evolution.
Why only 4 MSA blocks vs 48 Pairformer blocks?
The MSA module is intentionally shallow. Its role is to extract co-evolutionary signal and inject it into z, not to do the full geometric reasoning the Pairformer handles. Four blocks is enough to propagate correlations from the MSA columns into pair representations, after which the deep Pairformer refines those pair representations without needing to revisit the MSA. Keeping the MSA module shallow also limits memory cost, since the MSA tensor (NMSA × Ntokens × Cm) is large.
Pairformer
After the template and MSA modules have enriched z and s, the model sets them aside. From here, only the pair representation z and single representation s pass through the Pairformer: 48 blocks that refine them against each other. Each block has two streams: the pair stream (4 triangle operations + transition) and the single stream (single attention with pair bias + transition). The updated z feeds down into the single stream each block.
Why triangles?
The pair representation zi,j encodes the relationship between token i and token j, a learned analog of their geometric relationship. The triangle inequality says: if you know the distance from i to k, and from j to k, you have a strong constraint on what the distance from i to j can be. The triangle operations bake this principle in by ensuring every zi,j is updated by looking at all possible third tokens k simultaneously.
Because z encodes directional relationships, we need to consider two types of triangles. If we think of tokens as nodes in a directed graph with z as a weighted adjacency matrix, "outgoing edges" from i form one type of triangle and "incoming edges" to i form another.
Triangle Updates
Each triangle update creates three linear projections of z (called a, b, g). To update zi,j, take row i from a and row j from b, multiply them element-wise for each k, then sum across all k. This is equivalent to asking: "for every possible third token k, how does the relationship i→k combine with the relationship j→k?" The result is gated by gi,j and added back to z. The incoming update does the same but transposes the logic: use column i from a and column j from b, capturing k→i and k→j triangles instead.
Triangle Attention
Triangle attention extends axial attention with the triangle principle. For the starting node variant, to compute attention scores for zi,j along row i, we compare query zi,j with key zi,k for all k, then add a bias from zj,k projected to a scalar per head. The bias from zj,k nudges the attention weights based on the third edge of the triangle, encoding how j relates to k. Values also come from row i.
The ending node variant transposes this: keys and values come from column j of z instead of row i, and the pair bias comes from column i (zk,i). Same logic, opposite axis. Sigmoid gating also appears throughout.
Single Attention with Pair Bias
After the four triangle steps and a transition block refine z, the model updates s using z as an attention bias. Q/K/V projections from s, a bias from row i of z projected to a scalar per head, and a sigmoid gate. This runs at the token level with full attention (no sparse windowing, since Ntokens ≪ Natoms). After 48 blocks the outputs are strunk and ztrunk, which feed into the diffusion module.
Recycling
Before the 48 blocks run, the entire trunk (input embedder, MSA module, template module, and Pairformer) runs multiple times end to end. This is called recycling (3 iterations by default). At the start of each new pass, the pair and single representations from the previous pass are added back into the input, giving the model a better initialization for the next round.
Each recycling pass lets the representations converge: early passes establish rough global geometry, later passes refine local structure. Only the final pass produces strunk and ztrunk that feed into the diffusion module.
Why does recycling work?
Recycling was shown in AF2 to significantly improve accuracy at minimal additional cost compared to simply deepening the trunk. Rather than adding more Pairformer blocks, recycling re-runs the full pipeline. This allows each iteration to incorporate the geometric constraints discovered in the previous pass. Early iterations establish which residues are likely close; later iterations refine the exact relationships given that knowledge. The network learns to exploit this iterative structure during training.
Diffusion Module
The diffusion module is where 3D coordinates are actually generated. At each denoising step, the module takes noisy atom coordinates x and predicts how much noise to remove. To do this it conditions on everything the trunk computed (strunk, ztrunk) plus the original atom-level embeddings from the input embedder (sinputs, c, p). The unprocessed versions matter because the trunk representations are token-level, while diffusion also needs raw atom-level geometry.
The process has 4 steps that repeatedly move between atom-level and token-level representations within each denoising step.
Step 1: Token-Level Conditioning
Two conditioning tensors are initialized. The token-level single s is formed by concatenating sinputs and strunk, projecting back to Ctoken, then adding a Fourier embedding of the current diffusion timestep t. Including t ensures the model knows what noise scale to remove at this step. The token-level pair z is formed similarly by concatenating ztrunk and the relative positional encoding, projecting to Cz, and running through transition blocks.
Step 2: Atom-Level Conditioning, Coordinate Update, and Aggregation
The atom-level tensors c and p (from the input embedder) are updated to incorporate trunk information. strunk is broadcast from Ntokens to Natoms (each token's representation is copied to all its atoms), projected to Catom, and added to c. ztrunk is similarly broadcast and added to p.
The noisy coordinates x (Natoms×3) are scaled to unit variance to create dimensionless r, then linearly projected to Catom and added to q. This gives q awareness of each atom's current position. q passes through the Atom Transformer (conditioned on c and p), and the output is aggregated back to token-level to produce a.
The aggregated a is updated using a token-level transformer: AdaNorm (conditioned on s), attention with pair bias (using z), conditioned gating, and a SwiGLU transition. This runs at token level with full attention and uses s and z from Step 1 as conditioning.
Step 4: Atom-Level Attention and Coordinate Prediction
a' (updated token-level) is broadcast back to atom-level, then used to update q via another Atom Transformer (conditioned on c' and p'). A final linear layer maps q back to ℝ³, giving the coordinate update. Because the prediction was made in unit-variance space, the update is rescaled and applied: xnew = x + xupdate. That's a single diffusion step. The process repeats until coordinates converge.
How is diffusion different from AF2's coordinate prediction?
AF2 predicted coordinates directly via frame-based iterative refinement (IPA). AF3 instead treats structure prediction as a denoising problem: start from random noise, and repeatedly remove noise conditioned on the sequence. The key advantage is handling multi-modal distributions. If a protein can fold into two equally valid conformations, a direct regression head would average them and produce neither. A diffusion model can sample from both modes. This matters most for ligands, RNA, and flexible loops where the true distribution is genuinely multi-modal.
Confidence Module & Loss Function
The confidence module predicts how accurate the model's own output is. Its goal is not to improve the structure but to give the user a signal about which parts of a prediction to trust. Its outputs (pLDDT, PAE, PDE, and resolved atom predictions) are trained via a dedicated confidence loss. The full training objective combines three terms:
The model's predicted atom coordinates are converted to a token-level distogram by using the center atom of each token (Cα for amino acids, C1' for nucleotides). Pairwise distances between these center atoms are binned into distance categories, and the predicted distogram is compared to the true one via cross-entropy. This acts as a fast, coarse signal on overall structural accuracy.
ℒMSE measures mean squared error between predicted and ground-truth atom positions across all atoms (with DNA, RNA, and ligand atoms upweighted). The full-atom MSE is scaled by a factor that depends on the diffusion noise level $\hat{t}$, giving more weight to predictions at low noise (fine-grained refinement steps). ℒbond adds an extra MSE penalty specifically for protein-ligand bond lengths.
ℒsmooth-lDDT (smoothed local distance difference test) measures local accuracy. For each atom pair, the predicted and true distograms are compared. The difference is passed through a sigmoid at four thresholds (0.5, 1, 2, 4 Å), converting each into a smooth probability of passing that test. The average pass probability across all four thresholds is the smooth-lDDT score. Pairs where the true distance is large are excluded.
The confidence loss trains the model to predict its own error metrics. The key insight:
if the predicted structure is highly inaccurate but the model correctly predicts that
it will be, the confidence loss is low. This teaches calibration, not structural accuracy.
Gradients from this loss only update the confidence heads. They do not affect the
rest of the network.
For atom l, the lDDT is computed by finding all polymer center atoms m within
15 Å (or 30 Å for nucleic acids), measuring the distance between l and each m
in the predicted structure, and checking how many of those distances are within
4, 2, 1, and 0.5 Å of the true distances. The average pass rate is binned into
50 bins over [0, 1].
The pLDDT head takes the single representation for each token, broadcasts it
to all atoms in that token, and projects to 50 logits. A softmax converts
these to a distribution over bins, and cross-entropy loss is applied.
At inference, the predicted bin index gives the expected per-atom accuracy.
PAE: predicted alignment error
Every token has a local coordinate frame defined by three atoms. For each
token pair (i, j), the predicted and true positions of token i's center atom
are both expressed in token j's frame. If token j is in exactly its correct
position, how far off is token i? That distance is the alignment error for
(i, j), binned into 64 bins.
The PAE head projects z_{i,j} to 64 logits → softmax → cross-entropy.
PAE captures inter-domain accuracy: even if both domains are internally
correct, PAE measures whether their relative orientations are right.
PDE: predicted distance error
The true distance error for token pair (i, j) is the absolute difference
between the predicted and true distance between their center atoms, binned
into 64 bins from 0–32 Å. The PDE head projects z_{i,j} + z_{j,i} to
64 logits → softmax → cross-entropy.
Experimentally resolved prediction
Not every atom in a crystal structure is experimentally resolved. Some
are inferred or missing. The model predicts binary resolved/unresolved for
each atom. The single representation s_i is broadcast to all atoms in token i,
projected to 2 logits, and trained with binary cross-entropy against the
ground-truth resolution status.
At a selected diffusion timestep t, the predicted coordinates rt are used to construct a distogram d (by linear projection and binning). This d is added to si + sj to initialize zi,j, combining sequence and structural information into the pair representation for the confidence Pairformer. strunk and sinputs are concatenated, projected, and combined with ztrunk to initialize s and z for a lightweight Pairformer update. The updated s and z then feed the four confidence heads.
The confidence predictions are generated mid-run during diffusion. At a selected
timestep t, the current noisy coordinates are used to update s_trunk and z_trunk,
and the confidence heads predict errors from those updated representations. The
actual error metrics are then computed on those same coordinates and used as targets.
This "mini rollout" during training is what makes confidence calibration possible
without running full diffusion at every training step.
Before AlphaFold3 unified structure prediction across proteins, nucleic acids, ligands, and complexes into a single framework, DeepMind took an intermediate step with AlphaFold-Multimer (2021), a version of AF2 retrained specifically to handle multi-chain assemblies. Rather than bolting complexes onto a single-chain model with tricks like residue gaps or flexible linkers, AlphaFold-Multimer was trained end-to-end on oligomeric structures, with native handling of cross-chain genetics (pairing MSAs across chains using species annotations), permutation symmetry (so that identical chains in a homomer aren't unfairly penalized for being "out of order"), and a new confidence metric called interface pTM (ipTM) that specifically scores how well predicted interfaces match reality. Visually, this showed up as much sharper predicted aligned error (PAE) maps at chain interfaces: a model could now say not just "this chain folds correctly" but "these two chains dock correctly," and flag low-confidence interfaces (like PDB 6QF7) even when each individual chain was predicted well. It's a useful reference point for understanding why visualizing complexes, not just single folds, became such a central challenge on the road to AlphaFold3.