When people first encounter large language models like GPT, a common explanation is that they "predict the next word based on probabilities." This isn't wrong, exactly, but it's misleading since it doesn't describe some of the interesting parts.
To understand why modern LLMs are different from simpler statistical models, it helps to start with Markov chains.
What Is a Markov Chain?
A Markov chain is a probabilistic model where the next state depends only on the current state. For text generation, this means looking at the last word (or last N words) and choosing the next word based on what typically follows in the training data.
For example, after seeing "the cat", a Markov model trained on English text might predict:
- "sat" (40% probability)
- "is" (30%)
- "ran" (20%)
- "was" (10%)
The model has no memory beyond its immediate context window. If you use a 2-word Markov chain (bigram model), it only looks at the previous word. A 3-word version (trigram) looks at the previous two words. This is what I remember learning about in Computer Science classes around 2010.
But Markov chains can't handle "long-range dependencies". Consider this sentence:
"The cat that had been chasing the mouse all morning and was now exhausted from the effort finally gave up and went to sleep."
By the time we get to "went", a simple Markov chain has completely forgotten about "cat" being the subject. It might generate "went" correctly based on local patterns, but not because it understands the sentence structure.
The Attention Mechanism
Modern LLMs are built on the Transformer architecture, which introduced a concept called attention. Instead of only looking at the last few words, attention allows the model to look at ALL previous words and dynamically decide which ones matter most.
Think of it this way: When you're reading a sentence and encounter the word "it", you naturally glance back through the text to figure out what "it" refers to. You don't just look at the immediately preceding word. You scan for nouns, you use context, you apply your understanding of grammar.
Attention works similarly. For each word being processed, the model calculates attention scores for every other word in the context. These scores determine how much each previous word should influence the current prediction.
When processing "The animal didn't cross the street because it was too tired", the attention mechanism can:
- Look at all previous words simultaneously
- Calculate that "it" should pay high attention to "animal" (not "street")
- Use this understanding to make better predictions
Multiple attention heads can focus on different aspects at once: one might track subject-verb agreement, another might follow pronoun references, and another might capture semantic relationships.
Why This Matters
The difference between Markov chains and Transformers isn't just about scale or performance. It's about the type of things that are possible.
A Markov chain asks: "What usually comes after these words?"
A Transformer asks: "Given everything in the context, including the structure and relationships in this text, what should come next?"