Architecture

Model Forge

Every transformer has the same building blocks: embeddings, attention heads, feedforward layers. The parameter count and memory footprint emerge from six numbers. This calculator computes the exact math so you can feel how architecture decisions compound.

MODEL FORGE

emergent architecture parameter calculator — agentic design space

Presets

Precision

Batch Size1

164

Model Dim (d_model)4,096

6416,384

Attention Heads32

1128

KV Heads (GQA)8

132

Layers32

1128

Context Length8.2K

512131.1K

Vocab Size128.3K

1.0K256.0K

FFN Multiplier3.5x

1.0x8.0x

7.50B

total parameters

15.01 GB

Model Memory

1.07 GB

KV Cache

4.29 GB

Activations

20.38 GB

Total VRAM

GPU Fit Estimate

RTX 4090✓ fits

A100 40GB✓ fits

A100 80GB✓ fits

H100 80GB✓ fits

Parameter Breakdown

Embedding

525.34M

Attention

1.34B

FFN

5.64B

LayerNorm

262.1K

Memory Breakdown

Model Weights

15.01B

KV Cache

1.07B

Activations

4.29B

we're so early — emergent architecture intuition is AI-native now · ← neural · tokens →

The Parameter Equation

A transformer's parameter count breaks into four terms:

Embedding: vocab_size × d_model
Attention: (d_model² + d_model × d_kv × 2 + d_model²) × n_layers
FFN: d_model × d_ffn × 2 (×3 for SwiGLU) × n_layers
LayerNorm: d_model × 2 × n_layers (negligible)

Attention dominates in narrow models. FFN dominates in wide models. The FFN multiplier (2.67x for Llama, 3.5x for Llama 3, 4x for GPT) is the single biggest lever — drag it and watch the parameter count respond.

Source: Vaswani et al., "Attention Is All You Need" (NeurIPS 2017). GQA: Ainslie et al., "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints" (EMNLP 2023). SwiGLU: Shazeer, "GLU Variants Improve Transformer" (2020).

What the Sliders Teach

GQA — Grouped Query Attention

Llama 3 uses 32 Q heads but only 8 KV heads (4:1 ratio). Set KV heads below attention heads to see how GQA shrinks the KV cache by 4x while keeping quality. This is why Llama 3 8B fits on a 24GB GPU at 8K context.

Precision — The Memory Multiplier

Switch from FP32 to INT4 and watch total VRAM drop 8x. A 70B model needs 280GB in FP32 but 35GB in INT4. The GPU fit indicator shows the practical consequence: what runs on your hardware depends more on precision than parameter count.

KV Cache — The Hidden Cost

Increase context length and batch size together. KV cache scales as 2 × n_kv_heads × d_head × context × layers × batch. At 128K context with batch 32, KV cache can exceed model weight memory. This is why long-context serving is expensive.

Scaling Laws

Load GPT-2 Small (124M), then GPT-3 (175B). Parameters grew 1,400x. But d_model only grew 16x (768 → 12288) and layers 8x (12 → 96). The magic is multiplication: depth times width times FFN width compounds.

In the Series

Model Forge shows the static architecture. For what happens at runtime — how attention patterns form inside these structures — see Attention Heads. For the downstream cost of running these architectures, see Inference Cost.

The parameter breakdown here maps directly to the components explored in Backpropagation and Superposition — every parameter shown in the bar chart is a dimension that features must compress into.

← All Neural Pages