GPT Training Visualization

Weights · Gradients · Activations · Attention

Weights (Viridis)
+
Gradients (Plasma)
+
Activations (Coolwarm)
+
Attention (Inferno)
01
Loss
Cross-Entropy Loss scalar
L = −Σᵢ yᵢ log(ŷᵢ)
Logits [B, V]
LM Head (Linear) [d_model, vocab]
Weights
Gradients
Hidden [B, T, D]
Transformer Block ×12
LayerNorm
Feed-Forward Network
FFN Output
W₁ weights
W₁ grads
GELU
W₂ weights
W₂ grads
FFN Input
LayerNorm
Multi-Head Self-Attention
Attn Output
Wq
Wk
Wv
Wo
Attention Matrix (QKᵀ/√d)
causal mask applied
Embedded [B, T, D]
Token Embedding [vocab, d_model]
Weights
Gradients
Input Tokens
The
quick
brown
fox
Training Monitor
Forward
Backward
Update
Loss 2.891
0
Step
3e-4
LR
0.00
∇ Norm
0
Tokens
Ready
Click Train to begin
Multi-Head Self-Attention
Attn Output
Wq
Wk
Wv
Wo
Attention Matrix (QKᵀ/√d)
causal mask applied
Feed-Forward Network
FFN Output
W₁ weights
W₁ grads
GELU
W₂ weights
W₂ grads
FFN Input
Transformer Block ×12
LayerNorm
Feed-Forward Network
FFN Output
W₁ weights
W₁ grads
GELU
W₂ weights
W₂ grads
FFN Input
LayerNorm
Multi-Head Self-Attention
Attn Output
Wq
Wk
Wv
Wo
Attention Matrix (QKᵀ/√d)
causal mask applied
Speed:
Full Weight Matrix