The Core Insight - Learn

The secret behind every modern language model fits in a single sentence: predict what word comes next. That's it. Everything else—understanding, reasoning, creativity—emerges from this one simple task, scaled to an incomprehensible degree.

The Deceptively Simple Idea

Consider the sentence: "The cat sat on the ___"

What word comes next? You probably thought "mat" or "floor" or "couch." You didn't think "elephant" or "syntax" or "tomorrow." Why? Because you've absorbed patterns from the language around you. You know what words tend to follow other words.

This is exactly what LLMs do—but at a scale that's hard to comprehend. During training, a model sees trillions of examples like this, learning the statistical patterns of language from virtually every kind of text humans have ever written.

How It Works

Input: "The weather is" — model sees this text
Predict: Calculate probabilities for next word — "nice" (40%), "cold" (25%), "hot" (15%)...
Sample: Pick one word (e.g., "nice") — input becomes "The weather is nice"
Repeat: Each new word becomes context for the next prediction

Tokens: The Building Blocks

Actually, LLMs don't predict whole words—they predict tokens. A token might be a word, part of a word, or even punctuation.

Token Examples

"Hello" → Hello

"incredible" → incredible

"don't" → don't

Common words are usually single tokens. Rare or complex words get split into pieces. This lets the model handle any text, even words it's never seen before, by combining familiar pieces.

Why Scale Changes Everything

Here's where it gets interesting. Train a small model to predict the next token, and you get something that can complete sentences—but not much more. The outputs are often nonsensical or repetitive.

But increase the model size (more parameters), train on more data, and something remarkable happens: the model starts exhibiting capabilities that weren't explicitly trained.

Small Models

• Complete simple sentences
• Basic grammar
• Limited coherence

Large Models

• Multi-step reasoning
• Code generation
• Translation, analysis
• Creative writing

The Attention Revolution

The breakthrough that made modern LLMs possible came in 2017 with the Transformer architecture and its key innovation: the attention mechanism.

Previous models processed text sequentially—one word after another. This made it hard to capture long-range connections. If a word at the beginning of a paragraph is relevant to understanding a word at the end, the model might lose that connection.

Attention solves this by letting every position in the input "attend to" every other position directly. When predicting the next word, the model can look back at any previous word and weigh how relevant it is.

Example: Attention in Action

"The cat, which had been sleeping in the warm sunlight all afternoon, finally _____"

To predict the next word, the model attends most strongly to "cat" and "sleeping"—not the nearby words "afternoon" or "finally." It understands the sentence structure and knows the verb should relate to what the cat does (like "woke" or "stretched").

The Context Window

Every LLM has a context window—the maximum amount of text it can consider at once. This is measured in tokens.

Early GPT models had context windows of about 2,000 tokens (roughly 1,500 words). Modern models like Claude 3 can handle 200,000 tokens—the equivalent of a long novel.

A larger context window means the model can maintain longer conversations, analyze bigger documents, and keep track of more information while generating responses. But it also requires more computation, which is why context window size remains an active area of research.

Key Takeaways

LLMs work by predicting the next token, one at a time
Tokens are pieces of text—words, subwords, or punctuation
Scale matters: larger models exhibit emergent capabilities
The attention mechanism lets models capture long-range relationships
Context window determines how much text the model can consider at once

Related Concepts

Tokens Parameters Attention Transformer Context Window Temperature