Explain me transformer as if I was 5

For a long time, I couldn’t grasp what “attention” really was.
Most explanations just threw formulas at me or repeated vague phrases like “compute query, key, and value with a feedforward net.”
But I wanted something deeper. Not just what happens — but why it works.

I’m not ashamed to admit I’m an integrator. I code systems, not train models from scratch. And like many developers I met on forums, I had naive theories like: maybe GPT is just an expert system querying a giant database.

Then I bought Sebastian Raschka’s book Build a Large Language Model from Scratch and finally — finally! — got the answer I was looking for.

Let’s Start with a Puzzle

Imagine we have a large 2D puzzle, and we remove some pieces.
The question is: can we guess what’s missing just by looking at the surrounding ones?

If a hole is surrounded by 🌊 blue waves, the missing piece is likely more ocean.
If we see 🐫 camel fur, it might be a hump.
If it’s 🌼 daisies, maybe a flower or a bee.

To illustrate the association visually, I bought a real puzzle and chose one with a variety of details — I picked an adorable hedgehog.

Now let’s think: could we somehow gather information about the face to predict that the missing piece should be an eye?

And in the second picture, can we guess that what’s missing is a piece of the rainbow?

Yes — that’s the answer in both cases. So here’s the intuition:

We could collect thousands of such “islands with holes” and train a system to guess what should fill them.

From 2D to Linear: Puzzles in a Row

Now I want to shift to rows of puzzle pieces to make the analogy closer to text, because text is also sequential information. Let’s look at such rows, formed from the same 2D puzzle pieces we used before.

We want to train a system to look at such rows and predict what comes next.

If we’ve seen a cheek and a nose, the row is likely to continue with another cheek.
If there was one eye, we probably need to complete the second eye symmetrically.
If the arc of a rainbow is going up, it’s natural to complete it with a matching downward curve.

The same thing happens with text — we assume that the beginning of a sentence already contains enough information to guess what comes next.

The hedgehog has one cheek… maybe another? cheek
The hedgehog has one eye… what’s missing? eye
The rainbow goes up… and then? down

Our first goal is to understand how we can capture the information from the first three puzzle pieces in a mathematical form so that the model can learn the patterns. This is where attention enters.

Attention = Looking Back to Decide Who Matters

Imagine we want to build a formula that takes three puzzle pieces and predicts the fourth one that fits the row.
To train it, we show many overlapping 4-piece sequences, always masking the last piece and asking the model to guess it. Over time, it learns patterns like:

“On Friday night I usually order…” → pizza, not furniture.

Now look at puzzle 3 — it shows part of a right eye. To understand what should come next, it “looks” at:

Puzzle 1 — the left cheek
Puzzle 2 — the left eye

In simple attention, we can multiply puzzle 3 by puzzle 1 and puzzle 2. The result tells us how strongly their patterns match.

Some intuitive examples:

Left eye ⋅ Right eye → captures symmetry — the network “feels” a mirrored structure
Cheek ⋅ Eye → low match — those belong to different regions
Stripe pattern ⋅ matching stripe → confirms continuation of fur
Blush ⋅ blush → suggests both sides of a face
Hairline ⋅ darker stripe → weak match — wrong texture

Multiplication reveals when two parts align, repeat, or complete each other. Once we’ve measured how strongly puzzle 3 connects to the previous pieces, we multiply each earlier puzzle by its attention weight — this tells the model which pieces matter more.

But we also care about order. So we add position indices (1, 2, 3), so the model knows it saw eye → nose → eye, not just a mix of those parts.

After we multiply each puzzle piece by its connection strength to the third piece, we perform the final step — we add them all together. This gives us a compressed representation of the context, where both the important details and their order are preserved.

This compressed result is called the context vector. On the left is the context vector needed to predict the puzzle piece on the right.

But real transformers don’t use the raw puzzle pieces themselves in our analogy (or raw word embeddings, which are mathematical representations of words in semantic space). Instead, each word sends a representative to do the job.

The current token (puzzle) sends a query.
Each neighbor sends a key (for matching) and a value (the info to borrow).

These Q/K/V vectors are not the same as the raw embedding — they’re linear transformations of it. Each one is computed via a learned single-layer neural net, i.e., a linear layer, no activation.

Let’s now imagine what happens when we transform our puzzle pieces into Q, K, and V vectors. Each transformation gives the same puzzle piece a different role — what it’s looking for, how it describes itself, and what it offers to the final output.

Puzzle Piece	Q (Query): what am I looking for?	K (Key): how do I describe myself?	V (Value): what do I offer if selected?
Eye	Looking for symmetry, a matching eye	I’m a dark round shape	I bring visual focus and gaze
Cheek	Looking for a soft curve to continue	I’m pink and round on the left side	I soften the overall shape and add warmth
Corner	Looking for a border or edge	I’m straight and light	I signal a frame or outer boundary
Fur	Looking for matching texture direction	I’m fuzzy with diagonal lines	I carry continuity and texture
Nose	Looking for central alignment and symmetry	I’m centered, small and slightly raised	I bring facial balance and orientation

This lets the model extract different “views” of the same token. Why? Because the same word might be relevant in different ways. With multiple Q/K/V projections — a.k.a. multi-head attention — we get multiple perspectives at once.

Training the Mappers

We don’t manually program these Q/K/V projections. We just say: “Learn to project embeddings into useful Q/K/V vectors that produce good predictions.” So we train the system by:

Feeding puzzle rows (or text sequences for real LLM)
Computing context from attention
Predicting the next piece
Comparing prediction to ground truth
Updating the Q/K/V generators to reduce the error

Do this on millions of examples, and the system becomes able to generalize.
It starts predicting coherent continuations.

Determinism vs Exploration

Once trained, we can use it to predict. We have two main strategies:

Greedy: pick the highest-probability continuation
Sampling: allow for some randomness — based on temperature, top-k, etc.

If we think in terms of puzzles: a greedy choice sees an eye and places the matching second eye — a perfect mirror, no doubt. But with sampling, we might try:

a wink instead of an open eye
a pirate eye patch covering one side
a leaf partly covering the eye

Each version still makes sense, but brings a different style or interpretation to the puzzle.

There’s More Inside the Block

Attention alone isn’t enough. Each transformer block also includes:

Layer normalization (to stabilize training)
Nonlinearity (e.g., GELU)
Feedforward layers (to mix and reshape representations)
Residual connections (to preserve original info)

These elements help the network remain deep and expressive without collapsing.

Stacking Blocks

We don’t use just one transformer block. We stack them. Why? Because deeper layers can refine meaning further. Each layer operates on the output of the previous one — allowing multi-step reasoning. It’s like layering levels of abstraction: from letters → to words → to syntax → to logic → to intention.

Summary

I believe I’ve managed to explain the idea clearly using puzzles — at least the parts without deep learning terminology should feel intuitive.

To write this article, I actually bought a beautiful puzzle and worked through it myself, piece by piece, until I understood how things really fit together.

If you’re curious to dig deeper and get hands-on with how GPT works under the hood, I highly recommend the book by Sebastian Raschka that I mentioned at the beginning. Every great idea has that one click — the moment when a scientist’s brain lights up and thinks, “Wait… this could work.” And I finally understood that click for the transformer.

SmartCloud