Strawbery Fields: Why Does Covenant-72B Look Broken?
Someone on X pointed out that Covenant-72b can't count the R's in strawberry. They're right. But so were the people who laughed at GPT-4 for the same mistake two years before it started passing the bar exam. The interesting question was never whether the model fails. It's why, what that reveals about intelligence, and what happens next.

A few days ago, a couple of posts made the rounds on X poking fun at Covenant-72b. The test was simple: ask the model to count the R's in "strawberry." The model got it wrong. Screenshots were shared, laughs were had.
They're not wrong. Covenant-72b cannot reliably count letters in a word. Ask it to reverse a string character by character and it will stumble. These are tasks that any second-grader handles without thinking, and our model fails at them.
The interesting question is why. Two years ago, the most advanced AI systems on Earth, models built by companies with billions of dollars in compute, failed at these same tasks. OpenAI considered the problem so emblematic that when they finally built a model capable of solving it, they gave it the internal codename "Strawberry" — a project that evolved from the mysterious Q* and eventually shipped as o1.
The story of why language models struggle with something this basic turns out to be one of the most revealing windows into how artificial intelligence actually works, how it differs from human cognition, and why the distance between "barely functional" and "genuinely capable" closes faster than anyone expects.
Why the Smartest Thing in the Room Can't Count to Three
To understand why a model that can discuss quantum field theory cannot count characters in a word, you need to understand three things about how language models process text.
The first is that the model never sees individual letters. When you type "strawberry" into a language model, what arrives is not ten characters. It is tokens, chunks of text that its tokenizer has learned to treat as single units. GPT-style models using byte-pair encoding split "strawberry" into something like st, raw, berry. The model processes three units of meaning where a human eye sees ten letters. The three R's are buried inside those tokens, distributed across chunk boundaries, invisible to the model's attention mechanism. Imagine trying to count the threads in a rope without untwisting it. The model sees the rope. It knows what rope is, what it is made from, how it behaves under tension. But the individual threads are fused into the structure, and the tokenizer is what did the fusing.
Research has confirmed that the difficulty goes deeper than token boundaries. Even when repeated letters happen to fall across different tokens, the model struggles. The core problem is repeated characters. The model's internal representation compresses away the very granularity that counting requires.
The second problem is architectural. Transformer models compute at fixed depth. Each answer is produced in a single forward pass through the network, dozens of layers of computation that execute in parallel. This makes transformers extraordinarily powerful for pattern recognition and reasoning. It also makes them structurally inadequate for tasks that require sequential processing.
Counting letters is inherently sequential. A human counting R's in "strawberry" walks through the word one character at a time, maintaining a running tally. The depth of that computation scales with the length of the input. A language model cannot do this. It produces its answer in what amounts to a single computational breath. The difference is between counting the red cars in a parking lot by walking row by row versus glancing at the lot for one second and guessing. The model gets the glance. Researchers at MIT proved that without chain-of-thought reasoning, constant-precision transformers can only solve problems within a complexity class that cannot even compute whether a number is odd or even. Chain-of-thought prompting offers a workaround. When a model "thinks out loud," spelling out each letter and checking as it goes, it converts a single-pass problem into a multi-step one. The workaround succeeds, but it reveals the constraint: the model must simulate sequential processing because its architecture does not provide it natively.
The third problem is subtler. Language models do not execute algorithms. They do not run "strawberry".count("r") internally. They predict the next most probable token based on statistical patterns learned from training data. When asked about the letter composition of "strawberry," the model is pattern-matching against similar exchanges it encountered during training, and an uncomfortable number of those exchanges contain the wrong answer. Earlier models got it wrong. Humans discussing those failures on forums and social media repeated the error. That misinformation became part of the training corpus for the next generation of models. If you learned everything you knew about strawberries from reading the internet, and a substantial fraction of online discussions confidently stated the wrong count, you would probably echo the consensus. The model is doing exactly that. It is faithfully reproducing an error that is endemic in its training data.
There is a name for this broader pattern. In 1988, the roboticist Hans Moravec observed that machines find "hard" problems easy and "easy" problems hard. A computer can beat a grandmaster at chess but cannot fold a towel. A language model can write legal briefs that pass the bar exam but trips over a task you could give to a six-year-old. The paradox inverts our intuitions about what intelligence requires. The tasks that feel trivial to humans often demand millions of years of evolutionary optimization that no neural network has replicated.
You can solve differential equations but you cannot explain how you catch a ball. You can count letters but you cannot articulate the grammatical rules you apply instinctively when constructing a sentence. Intelligence, whether biological or artificial, is profoundly uneven. Every mind has capabilities that seem miraculous alongside gaps that seem absurd.
Which raises the question: if these are such different kinds of minds, how do they learn?
The Alien Student
A child learns the word "strawberry" through a collision of sensory experience. The taste, sweet and slightly tart. The dimpled red skin under small fingers. The smell of a punnet on a summer afternoon. A parent's voice saying the word while pointing. By the time a child can spell "strawberry," the word is anchored to a web of embodied memory that no amount of text could replicate.
A language model learns "strawberry" by processing statistical relationships across millions of sentences. It encounters the word in recipes, in agricultural research papers, in children's stories, in nutritional databases, in poetry. It builds an extraordinarily rich representation of how the word relates to every other word in its vocabulary. It knows more about strawberries than any human who has ever lived: every cultivar, every chemical compound, every cultural association in every language it was trained on. It has never tasted one. It has never held one. It has never watched one rot on a kitchen counter and felt a small pang of waste.
Both human and machine learn from patterns. Children are sensitive to the statistical regularities of language in ways that researchers are still mapping. They pick up word boundaries, grammatical structures, and phonetic rules long before anyone teaches them explicitly. In this narrow sense, a child and a language model are doing something structurally similar: extracting regularities from enormous quantities of data.
The difference is in how feedback lands. A child integrates correction through emotion, through memory, through the social weight of getting something right in front of a parent. When a parent corrects a mispronunciation, the correction arrives alongside tone of voice, facial expression, the warmth or sharpness of the moment. Reinforcement learning from human feedback, the technique used to fine-tune language models after their initial training, mirrors this loop in structure. The model produces output, a human rates it, the model adjusts. Same feedback architecture. Alien substrate. There is no embarrassment at getting something wrong, only a shift in probability weights. The outcome can look similar. The experience could not be more different.
Intelligence turns out to be stranger and more varied than our intuitions prepare us for. Counting letters is easy for humans because we have eyes, spatial processing, and a visual system that evolved over hundreds of millions of years to track individual objects in a scene. It is hard for language models because they were built to process meaning, and meaning operates at a higher level of abstraction than individual characters. A different kind of mind, with a different growth curve.
Covenant-72b was trained by a decentralized network of participants contributing compute across the open internet. But the team curated the data that fed it: approximately 1.1 trillion tokens carefully assembled across a main phase of web text and an annealing phase blending synthetic data with code, mathematics, and scientific literature. Decentralized muscle, directed curriculum. Like any student with an unconventional education, the model has gaps. The question is how fast those gaps close.
Twelve Seconds at Kitty Hawk
In December 1903, Orville Wright flew a powered aircraft for twelve seconds and covered 120 feet. The major newspapers barely covered it.

A machine that could stay airborne for less time than it takes to pour a cup of coffee did not, by any reasonable standard, look like a revolution. Sixty-six years later, human beings walked on the surface of the Moon.

The hard part was never building a faster plane. It was proving that heavier-than-air flight was possible at all.
The pattern that followed the Wright Flyer is the same pattern that followed GPT's struggle with strawberry: the distance between embarrassing and extraordinary, closed in a fraction of the time anyone expected.
Covenant-72b was trained across the open internet, by a permissionless network of participants coordinating through economic incentive on the Bittensor blockchain. No central authority decided who could contribute. Seventy unique peers participated over the course of the training run, joining and leaving freely, their contributions scored and aggregated by the Gauntlet incentive mechanism. The model achieved a 94.5% compute utilization rate, with only 70 seconds of communication idle time per training round, compared to 8.3 minutes for INTELLECT-1's DiLoCo-style approach.
The results speak plainly. On standard zero-shot benchmarks, Covenant-72b is broadly competitive with centralized baselines trained at similar scale, including LLM360 K2 and LLaMA-2-70B, models produced in conventional datacenter environments. It outperforms every other decentralized training effort. After supervised fine-tuning, Covenant-72B-Chat achieves the highest IFEval and MATH scores among all compared models in its class.
Ask it how many R's appear in a common English word and it will still get it wrong. Nobody on the team pretends otherwise. What the model proves is that decentralized training works at the 72-billion-parameter scale, with permissionless participation, over commodity internet connections. That is the zero-to-one. That is twelve seconds at Kitty Hawk.
The path from that point forward has already been measured. Epoch AI, an independent research organization that tracks compute trends across the AI industry, published a quantitative analysis of decentralized training scaling in December 2025. They named Covenant AI's Templar network specifically as the largest active decentralized training effort.
Since 2020, the computational scale of decentralized training projects has grown 600,000 times, at an implied rate of roughly 20x per year. Centralized frontier training, by comparison, has been growing at approximately 5x per year.
— Epoch AI, "How Far Can Decentralized Training Over the Internet Scale?" (December 2025)
We do not ask you to take our word for the growth rate. Epoch AI measured it.
The distance from "barely works" to "works well" is never measured in decades. It is measured in iterations. We wrote recently about why the Covenant-72b moment matters as a barrier-breaking event. This post is about why the model does not work perfectly yet, and why those are two halves of the same story.
Two days ago, Sam Altman stood in front of an audience of infrastructure investors at BlackRock's 2026 Infrastructure Summit and laid out his vision for the future of AI:
"We see a future where intelligence is a utility, like electricity or water, and people buy it from us on a meter."
— Sam Altman, BlackRock Infrastructure Summit (March 11, 2026)
Buy it from us. On a meter. Intelligence as a commodity, dispensed by the company that built the pipe, to the customers who can afford the bill. He said this to a room full of people who build and finance infrastructure, two days ago.
We shipped a model that cannot count letters. We shipped it, because the point of Covenant-72b was never to compete with frontier labs on day one. The point was to prove that intelligence does not have to be built behind closed doors, by a handful of companies, metered and sold back to the rest of us. Covenant-72b is rough. It is early. And it belongs to the Covenant community and the broader Bittensor ecosystem: to our researchers and engineers, to the miners running gradient computations on GPUs across the world, to the gamma token holders who believed in the Templar before there was anything to show for it.
The next version will be better. The one after that will be better still. The curve that every breakthrough in this field has followed is the same curve we are on now, measured independently, growing four times faster than the incumbents.
One day, someone will ask a Covenant model to count letters in a word, and the model will answer correctly, and nobody will think twice about it. That is the goal. Not to impress anyone with what the model can do today, but to build toward a future where decentralized intelligence is just intelligence, and the strawberry question is a footnote in a history that moved on to harder problems.
We are not there yet. But the strawbery field is planted.
Related Reading:
- The 900: Why Covenant72B Will Soon Be Ordinary — the companion piece
- The Internet is the Datacenter — technical foundation
- Covenant-72B research paper — the full technical report
Disclosure: I work with Covenant AI and am directly involved in the organization's communications. For full transparency about my involvement and investments, see my projects page. All opinions expressed are entirely my own.