02 · Autograd: How Gradients Flow Through a Tiny DAG

Theory

A neural network is a function composed of many small ops: add, multiply, exponentiate, ReLU. To train it, we need the gradient of the loss with respect to every parameter. The trick that makes this practical is reverse-mode automatic differentiation.

Karpathy’s Value class implements it in about 25 lines. Each Value records:

its scalar data,
the list of _children it was built from,
the local gradient d(self) / d(child_i) for each child.

When you call loss.backward():

Traverse the graph in topological order (children before parents).
Initialize loss.grad = 1.
Walk the topo list in reverse and, for each node v, distribute v.grad into each child via the chain rule: child.grad += local_grad_i * v.grad.

That’s it. No symbolic math, no static graph — the graph is built on the fly as you do the forward pass, and backward() just plays it in reverse.

The 3D sandbox below lets you type an expression, see the live DAG that the Value ops build, then play either the forward pulse (data flowing from leaves to root) or the backward pulse (gradients flowing root to leaves). Drag a slider to change a leaf’s value and watch the whole graph recompute.

Reading the graph

Layout. Leaf variables sit on the left, each operation to the right of its inputs, and the root on the far right. Branches that merge — like a and b both feeding +, then (a+b) and c feeding * — are drawn as branches that visibly come together, so you can see it’s a DAG, not a chain.
Backward is progressive. Press Backward and the reveal starts at the root with root.grad = 1, then flows outward; each node’s g=… only appears once the gradient has actually reached it. Nothing shows all the final gradients up front.
Chain rule on the arrows. Every backward arrow is labelled incoming grad × local derivative = contribution. For the default example, the * node sends 1 × c = 10 toward (a+b) and 1 × (a+b) = -1 toward c; (a+b) then sends 10 × 1 = 10 to each of a and b. That is the chain rule, made literal.
Derived ops. - and / are tagged derived: the engine builds them from primitives (a - b is a + (b·-1), a / b is a · b^-1), so the single node you see folds that internal structure. The gradients are still exact — e.g. the right input of - gets a local derivative of -1.
Exponents are constants. ** only accepts a numeric literal exponent (a ** 3), shown tagged const with no gradient flowing into it. A variable exponent like a ** b is rejected, because this engine’s pow differentiates only the base — accepting it would silently leave b.grad = 0.

Annotated Code

The Value class lives in src/microgpt_annotated.py, in the subsection marked autograd-value-class:


class Value:
    def __init__(self, data, _children=(), _local_grads=()):
        self.data = data
        self.grad = 0
        self._children = _children
        self._local_grads = _local_grads
 
    def __add__(self, other):
        return Value(self.data + other.data, (self, other), (1, 1))
 
    def __mul__(self, other):
        return Value(self.data * other.data, (self, other), (other.data, self.data))
 
    # ... pow, exp, log, relu identical in spirit ...
 
    def backward(self):
        topo = []
        visited = set()
        def build(v):
            if v not in visited:
                visited.add(v)
                for c in v._children: build(c)
                topo.append(v)
        build(self)
        self.grad = 1
        for v in reversed(topo):
            for child, local_grad in zip(v._children, v._local_grads):
                child.grad += local_grad * v.grad

The TypeScript port in src/inference/value.ts mirrors this one-for-one — same field names, same op semantics — so the equivalence tests can introspect both sides.

Sandbox

Type any expression using + - * / **, relu(x), exp(x), log(x), single-letter variables, and parentheses. Hit a preset for a starting point. Drag sliders to change variable values. Press Play to watch the pulse sweep the graph, or scrub the timeline by hand; switching Forward/Backward restarts the pulse from the start.

Each node is a small computation chip — a structured card shows its value and grad (the grad reads -- until the backward wave reaches it). Two toggles: local derivatives annotates the derived ops with their primitive expansion, and final gradients reveals every gradient at once (off by default, so the step-by-step backward stays intact). Drag to orbit a little; the view is clamped so the graph always stays readable.