Why AI Cannot Count the Letters in Strawberry

Have you ever wondered why an AI that can summarize legal documents and write working code consistently fails to count the letters in a single word? Ask many language models how many r's are in strawberry and they will tell you two. The correct answer is three. This has become one of the most widely shared examples of AI failure on the internet, and the explanation behind it is more interesting than most people realize.

The Model Does Not Read the Way You Do

When you look at the word strawberry, your brain processes it as ten individual characters: s, t, r, a, w, b, e, r, r, y. Counting the r's is trivial because you can walk through each letter one at a time.

A language model never sees the word that way.

Before any text reaches the model, it passes through a component called a tokenizer. The tokenizer's job is to break raw text into smaller units called tokens. These are not always words. They are frequently fragments of words, combinations of characters, or sometimes individual characters, depending on how often certain sequences appear in the training data.

The algorithm most modern language models use for this is called Byte Pair Encoding, or BPE. It starts with individual characters and repeatedly merges the most frequently occurring pairs until it reaches a target vocabulary size. Common words and common sequences get merged into single tokens. Rare sequences stay as smaller fragments.

How BPE merges individual characters into final tokens step by step

What Happens to Strawberry

When the word Strawberry enters the GPT-4 tokenizer, it does not come out as ten characters. It comes out as three tokens: Str, aw, and berry.

This is verifiable using OpenAI's own tokenizer tool. The split is consistent and deliberate, a product of how frequently those character sequences appear together in the training data.

The model does not see S-t-r-a-w-b-e-r-r-y. It sees three opaque chunks. Asking it to count letters inside those chunks is like asking someone to count the rooms in a house from a photograph of the front door.

Now think about what each token contains. The token Str has one r. The token aw has no r at all. The token berry has two r's, but those two r's are compressed inside a single unit that the model treats as one thing. When the model tries to reason about the letter count, it identifies two tokens that contain the letter r — Str and berry. It answers two.

The actual count is three. The error is not random. It follows directly from the structure of the tokenization.

The r count broken down token by token — why the model arrives at 2

Why the Model Cannot Just Look Inside a Token

You might wonder why the model cannot simply examine what letters are inside each token. The answer lies in how the transformer architecture processes information.

The model operates on token embeddings, which are dense numerical vectors. Each token is converted into a vector of hundreds or thousands of numbers that encode its meaning and context. The model attends to these vectors, computes relationships between them, and generates the next token based on what it has learned.

One AI researcher described this directly: the model knows what the token means, but it does not know the letters that make it up. The token berry is understood in relation to fruit, color, and food. That it contains two r's is not a property the model can reliably retrieve unless it has seen that fact stated explicitly in training data enough times to memorize it.

This Is Not a Reasoning Failure, It Is a Representation Failure

This distinction matters. When people share the strawberry example as evidence that AI is not truly intelligent, they are pointing at the symptom. The actual issue is structural.

Language models are not built to execute algorithms. They are built to predict the most statistically likely next token given everything before it. Counting letters requires something different: a deterministic scan through a sequence, a tally, and a report. That is an algorithmic operation, and without either a tool to perform it or explicit training to simulate it step by step, the model is guessing from patterns.

An LLM answering a letter-counting question is not performing arithmetic. It is retrieving the most probable answer based on patterns in training data. When the training data happens to contain the wrong association, the model reproduces it confidently.

Research published in late 2024 formalized this, showing that transformer-based models are theoretically constrained in the number of letters they can count by the size of certain parameters related to attention mechanisms and embeddings. The strawberry problem is not a bug that will be patched. It reflects a genuine limitation in how the architecture represents text.

Why Newer Models Often Get It Right

If this is a structural problem, why do some current models answer correctly?

The honest answer is that it depends on the model and how it was trained. Reasoning-oriented models that are trained to think through problems step by step, breaking a question into intermediate steps before answering, can arrive at the correct count by effectively simulating what you do when you count manually. Rather than retrieving an answer in a single forward pass, they decompose the problem.

LLMs in most cases do not work with letters, only when the token happens to be a letter. The problems of LLMs to count letters have been analysed theoretically, showing that transformer-based LLMs are constrained in the number of letters they can count by the size of certain model parameters.
— Research paper on LLM letter counting, arXiv, December 2024

This also explains why prompting a model to show its work, or asking it to spell the word out before counting, often produces the correct answer. You are not making the model smarter. You are changing the structure of the task so that the intermediate tokens force the character-level information to become visible before the final count is made.

The Simple Version

The word Strawberry reaches a language model as three tokens: Str, aw, and berry. The model has no mechanism for inspecting the characters inside a token the way you would read individual letters. When it counts r's, it reasons across tokens, finds r in two of them — Str and berry — and answers two.

The error is not about intelligence. It is about the gap between how humans represent words and how language models represent them. You see letters. The model sees chunks.

The next time an AI confidently gives you a wrong answer to a simple question, the most likely explanation is not that it misunderstood you. It is that the problem was invisible to it before it even started thinking.

Sources: OpenAI Tokenizer tool for the Str-aw-berry split, academic research published on arXiv in December 2024 on LLM letter counting constraints, and secwest.net's technical breakdown of the strawberry r-counting problem citing transformer attention limitations.