How Transformers See Every Word at Once
Self-attention, query-key-value, feed-forward networks, and generation — the full transformer architecture, animated step by step.
Every time you type into ChatGPT, a machine reads your entire message and writes back word by word. That machine is called a transformer. Before it existed, language models worked like a game of telephone — each word whispered to the next, information decaying with every step. By word fifty, the original message was a ghost. In 2017, a team at Google asked: what if every word could talk to every other word, all at once? Not a whisper chain. A group chat.
Self-Attention: The Group Chat
It starts by turning each word into a vector — a list of numbers where direction encodes meaning. Similar words point in similar directions.
Each word then creates three versions of itself:
- Query — what am I looking for?
- Key — what do I have to offer?
- Value — here's my actual information
These come from three learned weight matrices — the core trainable parameters of the transformer.
The query of one word dot-products with the key of every other word. High score means strong connection. After scaling by sqrt(d_key) to prevent one word from dominating, softmax converts scores into a probability distribution. Multiply those weights by the values, and each word gets a weighted sum of every other word's information.
The result: an attention map — every word's relationship to every other word, visible all at once.
The Full Transformer Block
One attention pattern isn't enough. Language has grammar, meaning, and long-range references happening simultaneously. So the transformer runs multiple attention heads in parallel — eight in the original paper — each learning different relationship patterns. Their outputs get concatenated and projected back together.
After attention, every token passes through a feed-forward network — often holding the majority of a model's parameters. Attention decides *what to look at*. The feed-forward network processes *what you found*.
Wrapping both components: residual connections. Every block doesn't replace the input — it adds to it. The original signal flows on a highway through every layer. This is why transformers can stack dozens of layers without the signal collapsing.
| Component | Role |
|---|---|
| Multi-head attention | Find relationships between all words |
| Feed-forward network | Process and enrich each token's representation |
| Residual connections | Preserve original signal across layers |
| Layer normalization | Stabilize values between components |
How Generation Works
Attention treats words as a set with no order. The fix: positional encodings added to each embedding before anything else, giving the model a sense of sequence.
During generation, the model works autoregressively — it predicts one token at a time. Your prompt runs through every layer. The final vector predicts the first word. Then the prompt plus that new word runs again to predict the second. And again.
The clever optimization: KV caching. Keys and values from previous tokens get cached so the model doesn't recompute them from scratch on every step. This is what makes real-time generation practical.
Token by token, the response builds — each word informed by every word that came before it, through the same attention mechanism running billions of times a day across every major AI product you've used this year.
Watch the full animated breakdown: How Transformers Actually Work
