02 · Autograd: How Gradients Flow Through a Tiny DAG
Theory
A neural network is a function composed of many small ops: add, multiply, exponentiate, ReLU. To train it, we need the gradient of the loss with respect to every parameter. The trick that makes this practical is reverse-mode automatic differentiation.
Karpathy’s Value class implements it in about 25 lines. Each Value records:
- its scalar
data, - the list of
_childrenit was built from, - the local gradient
d(self) / d(child_i)for each child.
When you call loss.backward():
- Traverse the graph in topological order (children before parents).
- Initialize
loss.grad = 1. - Walk the topo list in reverse and, for each node
v, distributev.gradinto each child via the chain rule:child.grad += local_grad_i * v.grad.
That’s it. No symbolic math, no static graph — the graph is built on the fly as you do the forward pass, and backward() just plays it in reverse.
The 3D sandbox below lets you type an expression, see the live DAG that the Value ops build, then play either the forward pulse (data flowing from leaves to root) or the backward pulse (gradients flowing root to leaves). Drag a slider to change a leaf’s value and watch the whole graph recompute.
Reading the graph
- Layout. Leaf variables sit on the left, each operation to the right of its inputs, and the root on the far right. Branches that merge — like
aandbboth feeding+, then(a+b)andcfeeding*— are drawn as branches that visibly come together, so you can see it’s a DAG, not a chain. - Backward is progressive. Press Backward and the reveal starts at the root with
root.grad = 1, then flows outward; each node’sg=…only appears once the gradient has actually reached it. Nothing shows all the final gradients up front. - Chain rule on the arrows. Every backward arrow is labelled
incoming grad × local derivative = contribution. For the default example, the*node sends1 × c = 10toward(a+b)and1 × (a+b) = -1towardc;(a+b)then sends10 × 1 = 10to each ofaandb. That is the chain rule, made literal. - Derived ops.
-and/are tagged derived: the engine builds them from primitives (a - bisa + (b·-1),a / bisa · b^-1), so the single node you see folds that internal structure. The gradients are still exact — e.g. the right input of-gets a local derivative of-1. - Exponents are constants.
**only accepts a numeric literal exponent (a ** 3), shown taggedconstwith no gradient flowing into it. A variable exponent likea ** bis rejected, because this engine’spowdifferentiates only the base — accepting it would silently leaveb.grad = 0.
Annotated Code
The Value class lives in src/microgpt_annotated.py, in the subsection marked autograd-value-class:
class Value:
def __init__(self, data, _children=(), _local_grads=()):
self.data = data
self.grad = 0
self._children = _children
self._local_grads = _local_grads
def __add__(self, other):
return Value(self.data + other.data, (self, other), (1, 1))
def __mul__(self, other):
return Value(self.data * other.data, (self, other), (other.data, self.data))
# ... pow, exp, log, relu identical in spirit ...
def backward(self):
topo = []
visited = set()
def build(v):
if v not in visited:
visited.add(v)
for c in v._children: build(c)
topo.append(v)
build(self)
self.grad = 1
for v in reversed(topo):
for child, local_grad in zip(v._children, v._local_grads):
child.grad += local_grad * v.gradThe TypeScript port in src/inference/value.ts mirrors this one-for-one — same field names, same op semantics — so the equivalence tests can introspect both sides.
Sandbox
Type any expression using + - * / **, relu(x), exp(x), log(x), single-letter variables, and parentheses. Hit a preset for a starting point. Drag sliders to change variable values. Press Play to watch the pulse sweep the graph, or scrub the timeline by hand; switching Forward/Backward restarts the pulse from the start.
Each node is a small computation chip — a structured card shows its value and grad (the grad reads -- until the backward wave reaches it). Two toggles: local derivatives annotates the derived ops with their primitive expansion, and final gradients reveals every gradient at once (off by default, so the step-by-step backward stays intact). Drag to orbit a little; the view is clamped so the graph always stays readable.