← Neural

Architecture

Model Forge

Every transformer has the same building blocks: embeddings, attention heads, feedforward layers. The parameter count and memory footprint emerge from six numbers. This calculator computes the exact math so you can feel how architecture decisions compound.

MODEL FORGE

emergent architecture parameter calculator — agentic design space

Presets

Precision

1
164
4,096
6416,384
32
1128
8
132
32
1128
8.2K
512131.1K
128.3K
1.0K256.0K
3.5x
1.0x8.0x
7.50B
total parameters
15.01 GB
Model Memory
1.07 GB
KV Cache
4.29 GB
Activations
20.38 GB
Total VRAM
GPU Fit Estimate
RTX 4090✓ fits
A100 40GB✓ fits
A100 80GB✓ fits
H100 80GB✓ fits

Parameter Breakdown

Embedding
525.34M
Attention
1.34B
FFN
5.64B
LayerNorm
262.1K

Memory Breakdown

Model Weights
15.01B
KV Cache
1.07B
Activations
4.29B

we're so early — emergent architecture intuition is AI-native now · ← neural · tokens →

The Parameter Equation

A transformer's parameter count breaks into four terms:

Embedding: vocab_size × d_model
Attention: (d_model² + d_model × d_kv × 2 + d_model²) × n_layers
FFN: d_model × d_ffn × 2 (×3 for SwiGLU) × n_layers
LayerNorm: d_model × 2 × n_layers (negligible)

Attention dominates in narrow models. FFN dominates in wide models. The FFN multiplier (2.67x for Llama, 3.5x for Llama 3, 4x for GPT) is the single biggest lever — drag it and watch the parameter count respond.

Source: Vaswani et al., "Attention Is All You Need" (NeurIPS 2017). GQA: Ainslie et al., "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints" (EMNLP 2023). SwiGLU: Shazeer, "GLU Variants Improve Transformer" (2020).

What the Sliders Teach

GQA — Grouped Query Attention

Llama 3 uses 32 Q heads but only 8 KV heads (4:1 ratio). Set KV heads below attention heads to see how GQA shrinks the KV cache by 4x while keeping quality. This is why Llama 3 8B fits on a 24GB GPU at 8K context.

Precision — The Memory Multiplier

Switch from FP32 to INT4 and watch total VRAM drop 8x. A 70B model needs 280GB in FP32 but 35GB in INT4. The GPU fit indicator shows the practical consequence: what runs on your hardware depends more on precision than parameter count.

KV Cache — The Hidden Cost

Increase context length and batch size together. KV cache scales as 2 × n_kv_heads × d_head × context × layers × batch. At 128K context with batch 32, KV cache can exceed model weight memory. This is why long-context serving is expensive.

Scaling Laws

Load GPT-2 Small (124M), then GPT-3 (175B). Parameters grew 1,400x. But d_model only grew 16x (768 → 12288) and layers 8x (12 → 96). The magic is multiplication: depth times width times FFN width compounds.

In the Series

Model Forge shows the static architecture. For what happens at runtime — how attention patterns form inside these structures — see Attention Heads. For the downstream cost of running these architectures, see Inference Cost.

The parameter breakdown here maps directly to the components explored in Backpropagation and Superposition — every parameter shown in the bar chart is a dimension that features must compress into.