04 · Transformer Block: One Pass from token_id to logits
Theory
Lessons 01–03 zoomed out (the forward → loss → sample loop) and in (autograd, attention). This lesson wires up the piece in the middle: the single function gpt() that turns one (token_id, pos_id) into a vector of logits over the next character. For microGPT that’s exactly one transformer block (n_layer = 1), n_embd = 16, n_head = 4, head_dim = 4.
The data path, in the order the reference code runs it:
- Embedding — look up the token’s row of
wteand the position’s row ofwpe, add them element-wise → a length-16 vectorx. - RMSNorm (the initial one) — Karpathy applies
rmsnormonce right here, before the block. It’s easy to miss and looks redundant next to the norm inside the block, but his comment is explicit: “not redundant due to backward pass via the residual connection.” It changes what the residual branch carries. - Attention sub-block (pre-norm + residual):
- save
x_residual = x(branch ①), rmsnorma copy,- multi-head attention — the same
q·kᵀ/√head_dim → softmax → ·vfrom lesson 03, run per head and concatenated, - project with
attn_wo, - add the saved branch ① back.
- save
- MLP sub-block (pre-norm + residual):
- save
x_residual = x(branch ②), rmsnorma copy,mlp_fc1expands 16 → 64,- ReLU (not GeLU — the reference uses
max(0, x)), mlp_fc2projects 64 → 16,- add the saved branch ② back.
- save
- LM Head —
linear(x, lm_head)projects the final 16-vector to one logit per vocabulary token. There is no final norm beforelm_headin the reference.
That’s the whole block. Things microGPT deliberately does not have, and which therefore are not in the sandbox: LayerNorm (it uses RMSNorm), GeLU (it uses ReLU), dropout, and biases on any linear.
A note on the two residuals: this is a pre-norm transformer. Each sub-block normalizes a copy of x, runs its sub-layer, and adds the result back to the un-normalized x it saved. That saved-then-added bypass is drawn as the two arcs on the right of the scene.
Parameter Initialization
Before any forward pass, the weights have to exist. microGPT builds them once, as plain Gaussian random scalars wrapped in Value (py lines 99–114):
matrix = lambda nout, nin, std=0.08: [[Value(random.gauss(0, std)) for _ in range(nin)] for _ in range(nout)]
state_dict = {'wte': matrix(vocab_size, n_embd), 'wpe': matrix(block_size, n_embd), 'lm_head': matrix(vocab_size, n_embd)}
for i in range(n_layer):
state_dict[f'layer{i}.attn_wq'] = matrix(n_embd, n_embd)
state_dict[f'layer{i}.attn_wk'] = matrix(n_embd, n_embd)
state_dict[f'layer{i}.attn_wv'] = matrix(n_embd, n_embd)
state_dict[f'layer{i}.attn_wo'] = matrix(n_embd, n_embd)
state_dict[f'layer{i}.mlp_fc1'] = matrix(4 * n_embd, n_embd)
state_dict[f'layer{i}.mlp_fc2'] = matrix(n_embd, 4 * n_embd)
params = [p for mat in state_dict.values() for row in mat for p in row]Every matrix is nout × nin of random.gauss(0, 0.08) values — a plain normal distribution with standard deviation 0.08. That is the entire initialization: no Xavier, no Kaiming, no special scaling. With n_embd = 16, block_size = 16, n_head = 4, and vocab_size = len(uchars) + 1, the state_dict holds:
| matrix | shape (nout × nin) | role |
|---|---|---|
wte | vocab_size × 16 | token embedding table |
wpe | 16 × 16 | position embedding table (block_size × n_embd) |
attn_wq / wk / wv / wo | 16 × 16 each | per-layer Q/K/V projections + output projection |
mlp_fc1 | 64 × 16 | MLP up-projection (4·n_embd × n_embd) |
mlp_fc2 | 16 × 64 | MLP down-projection (n_embd × 4·n_embd) |
lm_head | vocab_size × 16 | final projection to logits |
linear(x, w) reads each weight matrix as [nout][nin], so output j is the dot product of w[j] with the input. Finally params flattens every scalar from every matrix into one flat list — exactly what the 05 · Training & Generation Adam loop walks over, with one m/v buffer and one update per scalar, every step.
Annotated Code
The block lives in src/microgpt_annotated.py, subsection attention-multihead (the helpers linear / softmax / rmsnorm are in overview-pipeline-helpers):
def gpt(token_id, pos_id, keys, values):
tok_emb = state_dict['wte'][token_id] # token embedding
pos_emb = state_dict['wpe'][pos_id] # position embedding
x = [t + p for t, p in zip(tok_emb, pos_emb)] # joint embedding
x = rmsnorm(x) # note: not redundant due to backward pass via the residual connection
for li in range(n_layer):
# 1) Multi-head Attention block
x_residual = x
x = rmsnorm(x)
q = linear(x, state_dict[f'layer{li}.attn_wq'])
k = linear(x, state_dict[f'layer{li}.attn_wk'])
v = linear(x, state_dict[f'layer{li}.attn_wv'])
keys[li].append(k); values[li].append(v)
x_attn = []
for h in range(n_head):
hs = h * head_dim
q_h = q[hs:hs+head_dim]
k_h = [ki[hs:hs+head_dim] for ki in keys[li]]
v_h = [vi[hs:hs+head_dim] for vi in values[li]]
attn_logits = [sum(q_h[j] * k_h[t][j] for j in range(head_dim)) / head_dim**0.5
for t in range(len(k_h))]
attn_weights = softmax(attn_logits)
head_out = [sum(attn_weights[t] * v_h[t][j] for t in range(len(v_h)))
for j in range(head_dim)]
x_attn.extend(head_out)
x = linear(x_attn, state_dict[f'layer{li}.attn_wo'])
x = [a + b for a, b in zip(x, x_residual)]
# 2) MLP block
x_residual = x
x = rmsnorm(x)
x = linear(x, state_dict[f'layer{li}.mlp_fc1'])
x = [xi.relu() for xi in x]
x = linear(x, state_dict[f'layer{li}.mlp_fc2'])
x = [a + b for a, b in zip(x, x_residual)]
logits = linear(x, state_dict['lm_head'])
return logitsThe TypeScript port in src/inference/model.ts computes the same path. Its difference is mechanical: Python calls gpt() once per position with a growing KV cache, while the port takes the whole sequence and applies an explicit j ≤ i causal mask — same math, different control flow (the same point lesson 03 makes about attention).
Sandbox
Each module on the path is a block you can click to see its input → output shape and the exact Python line it runs. Press play (or scrub) to send a data pulse down the path; the two green arcs are the residual bypasses (saved at ①/② and added back at the matching Add stages). The attention stage summarizes the same computation explained in lesson 03 — this lesson focuses on where attention sits inside the complete block. It is a map of the block’s structure and execution order, not a per-stage inspector of real tensor values (the shapes shown are the static layer dimensions; lesson 03 is where you watch the actual attention numbers).