04 · Transformer Block: One Pass from token_id to logits

Theory

Lessons 01–03 zoomed out (the forward → loss → sample loop) and in (autograd, attention). This lesson wires up the piece in the middle: the single function gpt() that turns one (token_id, pos_id) into a vector of logits over the next character. For microGPT that’s exactly one transformer block (n_layer = 1), n_embd = 16, n_head = 4, head_dim = 4.

The data path, in the order the reference code runs it:

Embedding — look up the token’s row of wte and the position’s row of wpe, add them element-wise → a length-16 vector x.
RMSNorm (the initial one) — Karpathy applies rmsnorm once right here, before the block. It’s easy to miss and looks redundant next to the norm inside the block, but his comment is explicit: “not redundant due to backward pass via the residual connection.” It changes what the residual branch carries.
Attention sub-block (pre-norm + residual):
- save x_residual = x (branch ①),
- rmsnorm a copy,
- multi-head attention — the same q·kᵀ/√head_dim → softmax → ·v from lesson 03, run per head and concatenated,
- project with attn_wo,
- add the saved branch ① back.
MLP sub-block (pre-norm + residual):
- save x_residual = x (branch ②),
- rmsnorm a copy,
- mlp_fc1 expands 16 → 64,
- ReLU (not GeLU — the reference uses max(0, x)),
- mlp_fc2 projects 64 → 16,
- add the saved branch ② back.
LM Head — linear(x, lm_head) projects the final 16-vector to one logit per vocabulary token. There is no final norm before lm_head in the reference.

That’s the whole block. Things microGPT deliberately does not have, and which therefore are not in the sandbox: LayerNorm (it uses RMSNorm), GeLU (it uses ReLU), dropout, and biases on any linear.

A note on the two residuals: this is a pre-norm transformer. Each sub-block normalizes a copy of x, runs its sub-layer, and adds the result back to the un-normalized x it saved. That saved-then-added bypass is drawn as the two arcs on the right of the scene.

Parameter Initialization

Before any forward pass, the weights have to exist. microGPT builds them once, as plain Gaussian random scalars wrapped in Value (py lines 99–114):


matrix = lambda nout, nin, std=0.08: [[Value(random.gauss(0, std)) for _ in range(nin)] for _ in range(nout)]
state_dict = {'wte': matrix(vocab_size, n_embd), 'wpe': matrix(block_size, n_embd), 'lm_head': matrix(vocab_size, n_embd)}
for i in range(n_layer):
    state_dict[f'layer{i}.attn_wq'] = matrix(n_embd, n_embd)
    state_dict[f'layer{i}.attn_wk'] = matrix(n_embd, n_embd)
    state_dict[f'layer{i}.attn_wv'] = matrix(n_embd, n_embd)
    state_dict[f'layer{i}.attn_wo'] = matrix(n_embd, n_embd)
    state_dict[f'layer{i}.mlp_fc1'] = matrix(4 * n_embd, n_embd)
    state_dict[f'layer{i}.mlp_fc2'] = matrix(n_embd, 4 * n_embd)
params = [p for mat in state_dict.values() for row in mat for p in row]

Every matrix is nout × nin of random.gauss(0, 0.08) values — a plain normal distribution with standard deviation 0.08. That is the entire initialization: no Xavier, no Kaiming, no special scaling. With n_embd = 16, block_size = 16, n_head = 4, and vocab_size = len(uchars) + 1, the state_dict holds:

matrix	shape (nout × nin)	role
`wte`	vocab_size × 16	token embedding table
`wpe`	16 × 16	position embedding table (block_size × n_embd)
`attn_wq` / `wk` / `wv` / `wo`	16 × 16 each	per-layer Q/K/V projections + output projection
`mlp_fc1`	64 × 16	MLP up-projection (4·n_embd × n_embd)
`mlp_fc2`	16 × 64	MLP down-projection (n_embd × 4·n_embd)
`lm_head`	vocab_size × 16	final projection to logits

linear(x, w) reads each weight matrix as [nout][nin], so output j is the dot product of w[j] with the input. Finally params flattens every scalar from every matrix into one flat list — exactly what the 05 · Training & Generation Adam loop walks over, with one m/v buffer and one update per scalar, every step.

Annotated Code

The block lives in src/microgpt_annotated.py, subsection attention-multihead (the helpers linear / softmax / rmsnorm are in overview-pipeline-helpers):


def gpt(token_id, pos_id, keys, values):
    tok_emb = state_dict['wte'][token_id]          # token embedding
    pos_emb = state_dict['wpe'][pos_id]            # position embedding
    x = [t + p for t, p in zip(tok_emb, pos_emb)]  # joint embedding
    x = rmsnorm(x)  # note: not redundant due to backward pass via the residual connection
 
    for li in range(n_layer):
        # 1) Multi-head Attention block
        x_residual = x
        x = rmsnorm(x)
        q = linear(x, state_dict[f'layer{li}.attn_wq'])
        k = linear(x, state_dict[f'layer{li}.attn_wk'])
        v = linear(x, state_dict[f'layer{li}.attn_wv'])
        keys[li].append(k); values[li].append(v)
        x_attn = []
        for h in range(n_head):
            hs = h * head_dim
            q_h = q[hs:hs+head_dim]
            k_h = [ki[hs:hs+head_dim] for ki in keys[li]]
            v_h = [vi[hs:hs+head_dim] for vi in values[li]]
            attn_logits = [sum(q_h[j] * k_h[t][j] for j in range(head_dim)) / head_dim**0.5
                           for t in range(len(k_h))]
            attn_weights = softmax(attn_logits)
            head_out = [sum(attn_weights[t] * v_h[t][j] for t in range(len(v_h)))
                        for j in range(head_dim)]
            x_attn.extend(head_out)
        x = linear(x_attn, state_dict[f'layer{li}.attn_wo'])
        x = [a + b for a, b in zip(x, x_residual)]
        # 2) MLP block
        x_residual = x
        x = rmsnorm(x)
        x = linear(x, state_dict[f'layer{li}.mlp_fc1'])
        x = [xi.relu() for xi in x]
        x = linear(x, state_dict[f'layer{li}.mlp_fc2'])
        x = [a + b for a, b in zip(x, x_residual)]
 
    logits = linear(x, state_dict['lm_head'])
    return logits

The TypeScript port in src/inference/model.ts computes the same path. Its difference is mechanical: Python calls gpt() once per position with a growing KV cache, while the port takes the whole sequence and applies an explicit j ≤ i causal mask — same math, different control flow (the same point lesson 03 makes about attention).

Sandbox

Each module on the path is a block you can click to see its input → output shape and the exact Python line it runs. Press play (or scrub) to send a data pulse down the path; the two green arcs are the residual bypasses (saved at ①/② and added back at the matching Add stages). The attention stage summarizes the same computation explained in lesson 03 — this lesson focuses on where attention sits inside the complete block. It is a map of the block’s structure and execution order, not a per-stage inspector of real tensor values (the shapes shown are the static layer dimensions; lesson 03 is where you watch the actual attention numbers).