Architecture
Model Forge
Every transformer has the same building blocks: embeddings, attention heads, feedforward layers. The parameter count and memory footprint emerge from six numbers. This calculator computes the exact math so you can feel how architecture decisions compound.
MODEL FORGE
emergent architecture parameter calculator — agentic design space
Presets
Precision
Parameter Breakdown
Memory Breakdown
The Parameter Equation
A transformer's parameter count breaks into four terms:
Attention: (d_model² + d_model × d_kv × 2 + d_model²) × n_layers
FFN: d_model × d_ffn × 2 (×3 for SwiGLU) × n_layers
LayerNorm: d_model × 2 × n_layers (negligible)
Attention dominates in narrow models. FFN dominates in wide models. The FFN multiplier (2.67x for Llama, 3.5x for Llama 3, 4x for GPT) is the single biggest lever — drag it and watch the parameter count respond.
Source: Vaswani et al., "Attention Is All You Need" (NeurIPS 2017). GQA: Ainslie et al., "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints" (EMNLP 2023). SwiGLU: Shazeer, "GLU Variants Improve Transformer" (2020).
What the Sliders Teach
Llama 3 uses 32 Q heads but only 8 KV heads (4:1 ratio). Set KV heads below attention heads to see how GQA shrinks the KV cache by 4x while keeping quality. This is why Llama 3 8B fits on a 24GB GPU at 8K context.
Switch from FP32 to INT4 and watch total VRAM drop 8x. A 70B model needs 280GB in FP32 but 35GB in INT4. The GPU fit indicator shows the practical consequence: what runs on your hardware depends more on precision than parameter count.
Increase context length and batch size together. KV cache scales as 2 × n_kv_heads × d_head × context × layers × batch. At 128K context with batch 32, KV cache can exceed model weight memory. This is why long-context serving is expensive.
Load GPT-2 Small (124M), then GPT-3 (175B). Parameters grew 1,400x. But d_model only grew 16x (768 → 12288) and layers 8x (12 → 96). The magic is multiplication: depth times width times FFN width compounds.
In the Series
Model Forge shows the static architecture. For what happens at runtime — how attention patterns form inside these structures — see Attention Heads. For the downstream cost of running these architectures, see Inference Cost.
The parameter breakdown here maps directly to the components explored in Backpropagation and Superposition — every parameter shown in the bar chart is a dimension that features must compress into.