Loss
Cross-Entropy Loss
scalar
L = −Σᵢ yᵢ log(ŷᵢ)
Logits [B, V]
LM Head (Linear)
[d_model, vocab]
Weights
Gradients
Hidden [B, T, D]
Transformer Block
×12
LayerNorm
Feed-Forward Network
FFN Output
W₁ weights
W₁ grads
→
GELU
→
W₂ weights
W₂ grads
FFN Input
LayerNorm
Multi-Head Self-Attention
Attn Output
Wq
Wk
Wv
Wo
Attention Matrix (QKᵀ/√d)
causal mask applied
Embedded [B, T, D]
Token Embedding
[vocab, d_model]
Weights
Gradients
Input Tokens
The
quick
brown
fox
Training Monitor
Forward
Backward
Update
Loss
2.891
0
Step
3e-4
LR
0.00
∇ Norm
0
Tokens
Ready
Click Train to begin