Visually Explaining AlphaFold 2 & 3

From AF2 to AF3

AF2 established the pair and single representation framework that AF3 inherits. To understand what AF3 adds and why, it helps to know what AF2 did and what it could not. AF2 predicted protein structures only, operated entirely at the residue level throughout its trunk (one vector per amino acid, all the way through), and was deterministic: the same input always produced the same output. Its architecture had two main stages: a 48-block Evoformer that jointly refined an MSA representation and a pair representation, followed by a structure module that used residue-specific backbone frames and torsion angles to produce 3D coordinates. The structure module had to be rotationally invariant, which led to the Invariant Point Attention (IPA) design.

AF3 keeps the pair representation and all the triangle-based operations that update it. These carry over largely unchanged into the Pairformer and are the most proven part of AF2. What AF3 replaces is everything around them. The 48-block Evoformer is replaced by a simpler 4-block MSA module followed by a 48-block Pairformer that operates on pairs and singles only. After those 4 MSA blocks, the MSA representation is discarded entirely. The structure module is replaced by a diffusion module that works directly on raw atom coordinates. The one-sentence version: AF3 keeps AF2's pair processing, cuts the MSA processing drastically, and replaces the structure module with diffusion.

AF3 is generative, not deterministic

The most consequential change is that AF3 is a generative model. The diffusion module is trained to denoise atomic coordinates: given corrupted positions, predict the true ones. At inference time, random noise is sampled and iteratively denoised. The same input run twice with different random seeds will produce different structures. For most complex types, this variation is small and confidence-based ranking selects the best sample from a handful of seeds. For antibody-antigen complexes, performance keeps improving even up to 1,000 seeds, which reflects genuine geometric uncertainty at the interface rather than a quirk of sampling. This is a fundamental shift in what the model is doing: AF2 predicted a single answer, AF3 draws from a distribution over plausible structures.

Equivariance is dropped

AF2's IPA was carefully designed so that the attention operation remained invariant to global rotations and translations of the molecule. This made sense when the model operated inside residue backbone frames and needed to reason about relative orientations. AF3 abandons rotational invariance entirely. The diffusion training objective does not require it, and dropping it simplifies handling arbitrary chemistry: ligands, nucleic acids, and modified residues all have molecular graphs that do not fit naturally into a residue-frame-based equivariant design. No IPA, no FAPE loss, no backbone frame representation.

Atom-level representations are new

AF2 worked entirely at the residue level in its trunk. There was no atom-level computation until the structure module's final side-chain prediction step. AF3 introduces atom-level representations in the input embedder, specifically the atom single representation c, the atom pair representation p, and the Atom Transformer output q. These have no AF2 analog. They are necessary because ligands and modified residues cannot be compressed to a single residue vector the way standard amino acids can. The input embedder computes these atom-level features from conformer geometry, then aggregates them to the token level before the Pairformer runs, bridging the two scales.

Confidence training required a new procedure

In AF2, the confidence head was trained by comparing the structure module output directly to the true structure at each training step. This does not transfer to diffusion: each training step only denoises one noise level and never produces a complete structure. AF3 solves this with a rollout procedure. During training, the full diffusion process is run at a coarser step size to generate a complete predicted structure, and that structure is used to supervise the confidence head. The confidence outputs are similar to AF2 (per-residue pLDDT and a predicted aligned error matrix PAE) with one addition: a predicted distance error matrix (PDE) that measures error in the predicted distance matrix.

Hallucination and cross-distillation

Generative models tend to hallucinate: they invent compact, plausible-looking structure even in regions that are genuinely disordered. AF3 addresses this by enriching its training data with structures predicted by AlphaFold-Multimer v2.3. In those predictions, disordered regions appear as extended loops rather than compact globules. Training on them teaches AF3 to produce ribbon-like disorder with low confidence rather than incorrectly confident compact folds. This cross-distillation is not a small fix: without it, AF3 hallucination rates are substantially higher.

The MSA de-emphasis is a principled bet

Reducing MSA processing from 48 blocks to 4 is not just an efficiency cut. AF3 is making a claim: that evolutionary coevolution signal is not required for cross-entity interactions like protein-ligand or protein-nucleic acid binding. These interactions are primarily governed by local chemistry and geometry rather than by how residues co-evolved across species. The results support this. AF3 substantially outperforms specialized docking tools on protein-ligand binding despite using far less MSA processing than AF2 used for protein-only prediction.

Overview

From AF2 to AF3

AF3 is generative, not deterministic

Equivariance is dropped

Atom-level representations are new

Confidence training required a new procedure

Hallucination and cross-distillation

The MSA de-emphasis is a principled bet

Tokenization

Genetic Search: MSA Construction

Template Search

Conformer Generation & Atom Representations

Building the atom representations

Input Embedder: Updating Atom Representations

Step 1: AdaNorm

Step 2: Attention with Pair Bias

Step 3: Conditioned Gating

Step 4: Conditioned Transition (SwiGLU MLP)

Aggregating to Token Level: sinit and zinit

Template Module

MSA Module

Step 1: Outer Product Mean (m → z)

Step 2: Row-wise Gated Attention (z → m)

Step 3: Triangle Updates (z → z)

Pairformer

Why triangles?

Triangle Updates

Triangle Attention

Single Attention with Pair Bias

Recycling

Diffusion Module

Step 1: Token-Level Conditioning

Step 2: Atom-Level Conditioning, Coordinate Update, and Aggregation

Step 3: Token-Level Attention (Diffusion Transformer)

Step 4: Atom-Level Attention and Coordinate Prediction

Confidence Module & Loss Function

$\mathcal{L}_\text{distogram}$: token-level structure accuracy

$\mathcal{L}_\text{diffusion}$: atom-level structure accuracy

$\mathcal{L}_\text{confidence}$: self-predicted accuracy

Confidence Head Architecture

Aggregating to Token Level: s^init and z^init