The Core Insight
Next Token Prediction
The secret behind every modern language model fits in a single sentence: predict what word comes next. That's it. Everything else—understanding, reasoning, creativity—emerges from this one simple task, scaled to an incomprehensible degree.
The Deceptively Simple Idea
Consider the sentence: "The cat sat on the ___"
What word comes next? You probably thought "mat" or "floor" or "couch." You didn't think "elephant" or "syntax" or "tomorrow." Why? Because you've absorbed patterns from the language around you. You know what words tend to follow other words.
This is exactly what LLMs do—but at a scale that's hard to comprehend. During training, a model sees trillions of examples like this, learning the statistical patterns of language from virtually every kind of text humans have ever written.
How It Works
- Input: "The weather is" — model sees this text
- Predict: Calculate probabilities for next word — "nice" (40%), "cold" (25%), "hot" (15%)...
- Sample: Pick one word (e.g., "nice") — input becomes "The weather is nice"
- Repeat: Each new word becomes context for the next prediction
Tokens: The Building Blocks
Actually, LLMs don't predict whole words—they predict tokens. A token might be a word, part of a word, or even punctuation.
Token Examples
"Hello" → Hello
"incredible" → incredible
"don't" → don't
Common words are usually single tokens. Rare or complex words get split into pieces. This lets the model handle any text, even words it's never seen before, by combining familiar pieces.
Why Scale Changes Everything
Here's where it gets interesting. Train a small model to predict the next token, and you get something that can complete sentences—but not much more. The outputs are often nonsensical or repetitive.
But increase the model size (more parameters), train on more data, and something remarkable happens: the model starts exhibiting capabilities that weren't explicitly trained.
Small Models
- • Complete simple sentences
- • Basic grammar
- • Limited coherence
Large Models
- • Multi-step reasoning
- • Code generation
- • Translation, analysis
- • Creative writing
The Attention Revolution
The breakthrough that made modern LLMs possible came in 2017 with the Transformer architecture and its key innovation: the attention mechanism.
Previous models processed text sequentially—one word after another. This made it hard to capture long-range connections. If a word at the beginning of a paragraph is relevant to understanding a word at the end, the model might lose that connection.
Attention solves this by letting every position in the input "attend to" every other position directly. When predicting the next word, the model can look back at any previous word and weigh how relevant it is.
Example: Attention in Action
"The cat, which had been sleeping in the warm sunlight all afternoon, finally _____"
To predict the next word, the model attends most strongly to "cat" and "sleeping"—not the nearby words "afternoon" or "finally." It understands the sentence structure and knows the verb should relate to what the cat does (like "woke" or "stretched").
The Context Window
Every LLM has a context window—the maximum amount of text it can consider at once. This is measured in tokens.
Early GPT models had context windows of about 2,000 tokens (roughly 1,500 words). Modern models like Claude 3 can handle 200,000 tokens—the equivalent of a long novel.
A larger context window means the model can maintain longer conversations, analyze bigger documents, and keep track of more information while generating responses. But it also requires more computation, which is why context window size remains an active area of research.
Key Takeaways
- LLMs work by predicting the next token, one at a time
- Tokens are pieces of text—words, subwords, or punctuation
- Scale matters: larger models exhibit emergent capabilities
- The attention mechanism lets models capture long-range relationships
- Context window determines how much text the model can consider at once