microgpt.py

microgpt.py

The most atomic way to train and run inference for a GPT in pure, dependency-free Python.

by @karpathy

This page presents Andrej Karpathy's microgpt.py as a literate program. The entire file is the complete algorithm — everything else is just efficiency. On the left you'll find prose annotations; on the right, the corresponding code.

Source: karpathy.github.io/2026/02/12/microgpt/
Converted to literate programming style inspired by Backbone.js annotated source, using Claude Opus 4.6.
Table of Contents

Preamble & Imports

The entire GPT — training and inference — uses only three Python standard library modules. No PyTorch, no NumPy, no TensorFlow. Every operation is built from scratch using pure scalar arithmetic. The random seed is fixed for reproducibility.

"""
The most atomic way to train and run inference
for a GPT in pure, dependency-free Python.
This file is the complete algorithm.
Everything else is just efficiency.

@karpathy
"""

import os       # os.path.exists
import math     # math.log, math.exp
import random   # random.seed, random.choices, random.gauss, random.shuffle
random.seed(42) # Let there be order among chaos

Dataset

The dataset is a list of human names from Karpathy's makemore project. If input.txt isn't already on disk, it's downloaded automatically. Each line becomes one "document" in our training set. The documents are shuffled so the model doesn't memorize alphabetical order.

# Let there be a Dataset `docs`: list[str] of documents
# (e.g. a list of names)
if not os.path.exists('input.txt'):
    import urllib.request
    names_url = 'https://raw.githubusercontent.com/karpathy/makemore/988aa59/names.txt'
    urllib.request.urlretrieve(names_url, 'input.txt')
docs = [line.strip() for line in open('input.txt') if line.strip()]
random.shuffle(docs)
print(f"num docs: {len(docs)}")

§

Tokenizer

The tokenizer is character-level — the simplest possible scheme. Every unique character in the dataset gets a numeric ID. A special BOS (Beginning of Sequence) token is appended to mark the start and end of each name. For this names dataset, the vocabulary is just the 26 lowercase letters plus the BOS token = 27 total tokens.

Production GPTs use sub-word tokenizers (BPE), but character-level is perfect for learning the structure of the algorithm.
```
# Let there be a Tokenizer to translate strings
# to sequences of integers ("tokens") and back
uchars = sorted(set(''.join(docs)))
BOS = len(uchars)
vocab_size = len(uchars) + 1
print(f"vocab size: {vocab_size}")
```
§

Autograd Engine

The heart of any deep learning framework is automatic differentiation. Here it's implemented in a single class: Value. Each Value wraps a scalar number and tracks how it was computed (its children in the computation graph and the local gradients of that operation).

During the forward pass, operations like +, *, and exp build up a directed acyclic graph. During the backward pass, gradients flow backwards through this graph via the chain rule.

__slots__ is a Python memory optimization — it prevents creation of a __dict__ per instance, which matters when you have millions of Value nodes.
```
# Let there be Autograd to recursively apply
# the chain rule through a computation graph
class Value:
    __slots__ = (
        'data', 'grad',
        '_children', '_local_grads'
    )

    def __init__(self, data, children=(), local_grads=()):
        self.data = data
        self.grad = 0
        self._children = children
        self._local_grads = local_grads
```

Arithmetic Operations

Each operation records the local gradient — the partial derivative of the output with respect to each input. For example:

Addition a + b: ∂(a+b)/∂a = 1, ∂(a+b)/∂b = 1
Multiplication a * b: ∂(a·b)/∂a = b, ∂(a·b)/∂b = a
Power a ** n: ∂(aⁿ)/∂a = n·a^n-1

Non-Value operands are automatically promoted via Value(other), so you can write x + 1 freely.

    def __add__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        return Value(
            self.data + other.data,
            (self, other), (1, 1)
        )

    def __mul__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        return Value(
            self.data * other.data,
            (self, other),
            (other.data, self.data)
        )

    def __pow__(self, other):
        return Value(
            self.data ** other,
            (self,),
            (other * self.data ** (other - 1),)
        )

Unary & Reverse Operations

The transcendental functions log, exp, and relu each have well-known derivatives:

log(x): derivative is 1/x
exp(x): derivative is exp(x) (it's its own derivative!)
relu(x): derivative is 1 if x>0, else 0

The "reverse" dunder methods (__radd__, __rmul__, etc.) let Python handle expressions like 2 * value where the Value is on the right side. Subtraction and division are defined in terms of addition, negation, and power — no new gradient rules needed.

    def log(self):
        return Value(math.log(self.data),
                     (self,), (1/self.data,))

    def exp(self):
        return Value(math.exp(self.data),
                     (self,), (math.exp(self.data),))

    def relu(self):
        return Value(max(0, self.data),
                     (self,), (float(self.data > 0),))

    def __neg__(self):        return self * -1
    def __radd__(self, o):  return self + o
    def __sub__(self, o):   return self + (-o)
    def __rsub__(self, o):  return o + (-self)
    def __rmul__(self, o):  return self * o
    def __truediv__(self, o):
        return self * o ** -1
    def __rtruediv__(self, o):
        return o * self ** -1

§

Backward Pass (Backpropagation)

The backward() method implements reverse-mode automatic differentiation. It first builds a topological sort of the computation graph (depth-first post-order traversal), then walks backwards through it.

For each node, the chain rule says: a child's gradient accumulates local_grad × parent_grad. The += handles the case where a value is used in multiple operations — gradients from different paths are summed, which is the multivariate chain rule.

This is the exact same algorithm that PyTorch's loss.backward() runs, just without any of the performance engineering.
```
    def backward(self):
        topo = []
        visited = set()
        def build_topo(v):
            if v not in visited:
                visited.add(v)
                for child in v._children:
                    build_topo(child)
                topo.append(v)
        build_topo(self)
        self.grad = 1
        for v in reversed(topo):
            for child, local_grad in \
                    zip(v._children, v._local_grads):
                child.grad += local_grad * v.grad
```
§

Hyperparameters

These define the architecture of our tiny transformer:
- n_layer = 1 — a single transformer block (GPT-3 has 96)
- n_embd = 16 — each token is a 16-dimensional vector (GPT-2 uses 768)
- block_size = 16 — maximum context window (the longest name is 15 chars)
- n_head = 4 — 4 attention heads, each seeing 4 dimensions
The head_dim is derived: each head gets n_embd / n_head = 4 dimensions. Multi-head attention lets the model attend to different positional relationships simultaneously.
```
# Initialize the parameters, to store the
# knowledge of the model
n_layer = 1
n_embd = 16
block_size = 16
n_head = 4
head_dim = n_embd // n_head
```

Parameter Initialization

All learnable parameters live in a flat dictionary called state_dict (borrowing PyTorch naming). Each "matrix" is a list of lists of Value objects initialized from a Gaussian distribution with std=0.08.

The weight matrices are:

wte — token embeddings (vocab_size × n_embd)
wpe — positional embeddings (block_size × n_embd)
lm_head — final linear projection back to vocabulary logits
Per layer: Q, K, V, O attention matrices and a 2-layer MLP with 4× expansion

Finally, all Value objects are gathered into a single flat list params for the optimizer to iterate over.

matrix = lambda nout, nin, std=0.08: \
    [[Value(random.gauss(0, std))
      for _ in range(nin)]
     for _ in range(nout)]

state_dict = {
    'wte':     matrix(vocab_size, n_embd),
    'wpe':     matrix(block_size, n_embd),
    'lm_head': matrix(vocab_size, n_embd),
}

for i in range(n_layer):
    state_dict[f'layer{i}.attn_wq'] = matrix(n_embd, n_embd)
    state_dict[f'layer{i}.attn_wk'] = matrix(n_embd, n_embd)
    state_dict[f'layer{i}.attn_wv'] = matrix(n_embd, n_embd)
    state_dict[f'layer{i}.attn_wo'] = matrix(n_embd, n_embd)
    state_dict[f'layer{i}.mlp_fc1'] = matrix(4*n_embd, n_embd)
    state_dict[f'layer{i}.mlp_fc2'] = matrix(n_embd, 4*n_embd)

params = [p for mat in state_dict.values()
            for row in mat
            for p in row]
print(f"num params: {len(params)}")

§

Model Building Blocks

Three small helper functions define the primitive operations used inside the transformer. These follow GPT-2 with minor simplifications (noted by Karpathy): LayerNorm → RMSNorm, no biases, and GeLU → ReLU.

linear(x, w)

A matrix-vector product. Each row of w is dotted with input x to produce one output element. This is the fundamental operation of neural networks — a learned linear transformation.

softmax(logits)

Converts raw scores (logits) into a probability distribution that sums to 1. The max_val subtraction is the standard numerical stability trick to prevent overflow in exp().

rmsnorm(x)

Root Mean Square Normalization — a simpler alternative to LayerNorm that skips the mean-centering step. It divides each element by the RMS of the vector, keeping activations at a controlled scale. The 1e-5 epsilon prevents division by zero.
```
# Define the model architecture: a function
# mapping tokens and parameters to logits
# over what comes next.
#
# Follow GPT-2, blessed among the GPTs, with
# minor differences: layernorm -> rmsnorm,
# no biases, GeLU -> ReLU

def linear(x, w):
    return [sum(wi * xi for wi, xi in zip(wo, x))
            for wo in w]

def softmax(logits):
    max_val = max(val.data for val in logits)
    exps = [(val - max_val).exp() for val in logits]
    total = sum(exps)
    return [e / total for e in exps]

def rmsnorm(x):
    ms = sum(xi * xi for xi in x) / len(x)
    scale = (ms + 1e-5) ** -0.5
    return [xi * scale for xi in x]
```
§

GPT Forward Pass

This is the core function — it takes a single token and its position, and produces logits (unnormalized scores) over the entire vocabulary for what token comes next.

The function operates in autoregressive / KV-cache mode: it processes one token at a time and accumulates past keys and values in the keys and values lists. This is exactly how production LLMs generate text token-by-token.

Step 1: Embedding. Look up the token embedding and positional embedding, then add them element-wise. This gives the model both what the token is and where it is in the sequence. An initial RMSNorm is applied — not redundant because gradients flow back through the residual connections.
```
def gpt(token_id, pos_id, keys, values):
    # token embedding
    tok_emb = state_dict['wte'][token_id]
    # position embedding
    pos_emb = state_dict['wpe'][pos_id]
    # joint token and position embedding
    x = [t + p for t, p in zip(tok_emb, pos_emb)]
    # note: not redundant due to backward pass
    # via the residual connection
    x = rmsnorm(x)
```

Multi-Head Attention

For each layer, the normalized input is projected into queries, keys, and values via three separate linear transformations. The keys and values are cached for future tokens (the KV cache).

The attention mechanism then runs independently for each head:

Slice Q, K, V into per-head chunks of size head_dim
Compute attention scores: Q·K^T / √d
Apply softmax to get attention weights (how much to attend to each past position)
Weighted sum of values produces the head's output

The / head_dim**0.5 scaling prevents dot products from growing too large with dimensionality, which would push softmax into saturation.

The concatenated head outputs pass through attn_wo (the output projection) and are added back to the input via a residual connection — the key innovation that makes deep transformers trainable.

    for li in range(n_layer):
        # 1) Multi-head Attention block
        x_residual = x
        x = rmsnorm(x)
        q = linear(x, state_dict[f'layer{li}.attn_wq'])
        k = linear(x, state_dict[f'layer{li}.attn_wk'])
        v = linear(x, state_dict[f'layer{li}.attn_wv'])
        keys[li].append(k)
        values[li].append(v)

        x_attn = []
        for h in range(n_head):
            hs = h * head_dim
            q_h = q[hs:hs+head_dim]
            k_h = [ki[hs:hs+head_dim] for ki in keys[li]]
            v_h = [vi[hs:hs+head_dim] for vi in values[li]]

            attn_logits = [
                sum(q_h[j] * k_h[t][j]
                    for j in range(head_dim))
                / head_dim ** 0.5
                for t in range(len(k_h))
            ]
            attn_weights = softmax(attn_logits)
            head_out = [
                sum(attn_weights[t] * v_h[t][j]
                    for t in range(len(v_h)))
                for j in range(head_dim)
            ]
            x_attn.extend(head_out)

        x = linear(x_attn,
                    state_dict[f'layer{li}.attn_wo'])
        x = [a + b for a, b in zip(x, x_residual)]

§

Feed-Forward (MLP) Block

After attention, the transformer applies a position-wise feed-forward network. This is a classic two-layer MLP with:
1. RMSNorm on the input
2. Linear projection expanding from n_embd → 4 × n_embd
3. ReLU activation (GPT-2 uses GeLU, but ReLU keeps things simple)
4. Linear projection compressing back: 4 × n_embd → n_embd
5. Another residual connection
The 4× expansion ratio is standard in transformers. It gives the network more capacity to learn complex non-linear transformations at each position.

After all layers, the final lm_head linear projection maps from the embedding dimension back to vocabulary size, producing one logit per token in the vocabulary.
```
        # 2) MLP block
        x_residual = x
        x = rmsnorm(x)
        x = linear(x, state_dict[f'layer{li}.mlp_fc1'])
        x = [xi.relu() for xi in x]
        x = linear(x, state_dict[f'layer{li}.mlp_fc2'])
        x = [a + b for a, b in zip(x, x_residual)]

    logits = linear(x, state_dict['lm_head'])
    return logits
```
§

Adam Optimizer

Adam (Adaptive Moment Estimation) is the standard optimizer for training transformers. It maintains two buffers per parameter:
- m — first moment (exponential moving average of gradients, like momentum)
- v — second moment (exponential moving average of squared gradients, for per-parameter learning rate scaling)
The hyperparameters:
- lr = 0.01 — base learning rate (quite aggressive; works for this tiny model)
- β₁ = 0.85 — momentum decay (typical is 0.9)
- β₂ = 0.99 — second moment decay (typical is 0.999)
- ε = 1e-8 — prevents division by zero
```
# Let there be Adam, the blessed optimizer
# and its buffers
learning_rate = 0.01
beta1, beta2, eps_adam = 0.85, 0.99, 1e-8
m = [0.0] * len(params)  # first moment
v = [0.0] * len(params)  # second moment
```

Training Loop

The training loop is strikingly simple — 1000 steps, one document per step (batch size = 1). For each step:

Tokenize a name, wrapping it with BOS tokens on both sides
Forward pass: feed tokens one by one through the GPT, predicting each next token
Compute loss: cross-entropy (negative log probability of the correct next token)
Backward pass: compute all gradients via autograd
Update: apply Adam optimizer to all parameters

The forward loop processes tokens autoregressively: at each position, the model sees all previous tokens (via the KV cache) and predicts the next one. The loss is the average negative log-likelihood across all positions.

"May yours be low." — a fitting prayer for any loss function.

# Repeat in sequence
num_steps = 1000
for step in range(num_steps):

    # Take single document, tokenize it,
    # surround it with BOS on both sides
    doc = docs[step % len(docs)]
    tokens = [BOS] + \
             [uchars.index(ch) for ch in doc] + \
             [BOS]
    n = min(block_size, len(tokens) - 1)

    # Forward the token sequence through the model,
    # building up the computation graph all the
    # way to the loss
    keys = [[] for _ in range(n_layer)]
    values = [[] for _ in range(n_layer)]
    losses = []
    for pos_id in range(n):
        token_id = tokens[pos_id]
        target_id = tokens[pos_id + 1]
        logits = gpt(token_id, pos_id, keys, values)
        probs = softmax(logits)
        loss_t = -probs[target_id].log()
        losses.append(loss_t)
    # final average loss over the document
    # May yours be low.
    loss = (1 / n) * sum(losses)

    # Backward the loss, calculating the gradients
    # w.r.t. all model parameters
    loss.backward()

§

Adam Update Step

After backpropagation, every Value in params has a .grad attribute with the gradient of the loss with respect to that parameter. The Adam update rule is:
1. Update first moment: m = β₁·m + (1-β₁)·grad
2. Update second moment: v = β₂·v + (1-β₂)·grad²
3. Bias-correct both: m̂ = m/(1-β₁^t), v̂ = v/(1-β₂^t)
4. Update parameter: θ -= lr · m̂ / (√v̂ + ε)
A linear learning rate decay schedule is applied: the learning rate ramps down to zero over the course of training. Each parameter's gradient is reset to zero after the update, ready for the next forward-backward pass.
```
    # Adam optimizer update: update model parameters
    # based on the corresponding gradients
    lr_t = learning_rate * (1 - step / num_steps)
    for i, p in enumerate(params):
        m[i] = beta1 * m[i] + (1 - beta1) * p.grad
        v[i] = beta2 * v[i] + (1 - beta2) * p.grad ** 2
        m_hat = m[i] / (1 - beta1 ** (step + 1))
        v_hat = v[i] / (1 - beta2 ** (step + 1))
        p.data -= lr_t * m_hat / (v_hat ** 0.5 + eps_adam)
        p.grad = 0

    print(f"step {step+1:4d} / {num_steps:4d} | "
          f"loss {loss.data:.4f}", end='\r')
```
§

Inference

After training, the model generates new names that don't exist in the training set — hallucinated names, in a delightful use of the term.

Generation is autoregressive: start with BOS, sample a token from the predicted distribution, feed it back in, repeat until BOS appears again (or we hit the maximum context length).

The temperature parameter controls creativity:
- temperature → 0: always picks the most likely token (deterministic, repetitive)
- temperature = 1: samples from the raw learned distribution
- temperature > 1: flattens the distribution (more random, more "creative")
Here temperature = 0.5 gives a nice balance: the names sound plausible but are novel. The logits are divided by temperature before softmax, which sharpens or flattens the probability distribution.

random.choices does weighted sampling from the vocabulary based on the model's predicted probabilities.

And that's it — a complete GPT, from autograd to attention to Adam, in ~120 lines of pure Python. Everything else in modern deep learning is, as Karpathy says, just efficiency.
```
# Inference: may the model babble back to us
temperature = 0.5
print("\n--- inference (new, hallucinated names) ---")
for sample_idx in range(20):
    keys = [[] for _ in range(n_layer)]
    values = [[] for _ in range(n_layer)]
    token_id = BOS
    sample = []
    for pos_id in range(block_size):
        logits = gpt(token_id, pos_id, keys, values)
        probs = softmax(
            [l / temperature for l in logits]
        )
        token_id = random.choices(
            range(vocab_size),
            weights=[p.data for p in probs]
        )[0]
        if token_id == BOS:
            break
        sample.append(uchars[token_id])
    print(f"sample {sample_idx+1:2d}: {''.join(sample)}")
```

Table of Contents

Preamble & Imports

Dataset

Tokenizer

Autograd Engine

Arithmetic Operations

Unary & Reverse Operations

Backward Pass (Backpropagation)

Hyperparameters

Parameter Initialization

Model Building Blocks

linear(x, w)

softmax(logits)

rmsnorm(x)

GPT Forward Pass

Multi-Head Attention

Feed-Forward (MLP) Block

Adam Optimizer

Training Loop

Adam Update Step

Inference