AI

Who Teaches the Machine: How Grail is Decentralizing the Most Consequential Phase of AI Development

Pre-training gives AI knowledge. Post-training teaches it judgment: what to refuse, how to reason, what to value. This is the phase where alignment happens, and until now, it has been exclusively controlled by a handful of corporations behind closed doors. This week, a research paper from Grail demonstrated that the bandwidth barrier keeping RL post-training centralized was 99% phantom, an artifact of how we were moving data, not a physical constraint. The implications extend far beyond compression ratios.

Cover Image for Who Teaches the Machine: How Grail is Decentralizing the Most Consequential Phase of AI Development

There are two phases to building an AI that can think.

The first phase, pre-training, is the one most people know. You feed the model the internet, trillions of tokens of text, and it learns patterns, facts, relationships. Pre-training produces knowledge. A pre-trained model knows that water boils at 100 degrees Celsius, that Shakespeare wrote Hamlet, that the derivative of x² is 2x. It knows a staggering amount. But it doesn't know what to do with any of it.

The second phase is where models learn judgment. Reinforcement learning post-training (RLHF, GRPO, PPO, DPO, and the growing bag of techniques the field collectively calls "post-training") is where a model learns to be useful. How to follow instructions. How to reason through problems step by step. What to refuse. How to be helpful without being harmful. The behaviors that distinguish a raw language model from something you'd actually want to talk to.

This second phase is where AI alignment happens. The values a model exhibits, the behaviors it declines, the ways it navigates difficult questions: these emerge from RL fine-tuning. Which means whoever controls post-training controls how AI systems behave. Not what they know, but what they do with what they know.

And until this week, that phase has been exclusively centralized. Locked inside the data centres of a handful of corporations, behind closed doors, shaped by commercial incentives and increasingly by military-industrial partnerships.

This week, researchers at Grail published a paper that changes the calculus. The headline result: decentralized RL post-training that matches the speed of centralized training. The method is called PULSE (Patch Updates via Lossless Sparse Encoding), and it achieves 100x more efficient weight synchronization, enough to make internet-scale post-training practical over commodity bandwidth.

As Sam Dare, who leads Grail's parent organization Covenant AI, put it on this week's TGIF Space: "It's always been 'decentralized training is too slow. It'll never work.' Right? Now we have instantiated the fastest decentralized training instance in the world with a fraction of the resources."

The bandwidth barrier that kept post-training centralized turned out to be 99% phantom, an artifact of how we were moving data, not a physical constraint. But the implications extend far beyond compression ratios. If post-training can be decentralized, the question of who teaches AI its values becomes genuinely open for the first time.

Why Post-Training Is Different

Sam admitted something on the TGIF call that's worth sitting with: "Very naively, I didn't think that there was anything more, 'cause I just thought post-training, it's just like SFT, you just do a couple of tricks. But that is changing a lot now. RL and post-training and all those bag of tricks are starting to dominate a large part of the training process."

This tracks with the broader industry trajectory. Two years ago, pre-training was the main event. The assumption was that scale (more data, more compute, bigger models) would solve everything. Post-training was an afterthought: some supervised fine-tuning, a little RLHF to sand off rough edges.

That assumption has collapsed. The reasoning capabilities that define frontier models (chain-of-thought, tool use, mathematical proof, code generation) emerge primarily from post-training. OpenAI's o1 and o3 models, Anthropic's Claude, Google's Gemini: their most impressive capabilities come not from pre-training on more data, but from sophisticated RL post-training that teaches models how to think.

Which means post-training is no longer the afterthought. It's the main event. And it's the phase where the deepest questions about AI behavior get resolved. Not in published papers or corporate mission statements, but in the specific reward signals and training procedures that shape what a model does when you ask it something difficult.

Covenant's earlier work with Templar had solved many of the problems of decentralized pre-training, the knowledge phase. But as Sam described it, "Grail provides the other missing piece towards the whole pipeline, which is how do we decentralize the process?" The post-training process. The values layer.

The problem: RL post-training is architecturally different from pre-training in a way that makes decentralization significantly harder.

The Inference-Training Loop

Pre-training is conceptually simple. You have data. You have a model. You feed the data through the model, compute how wrong it was, update the weights, repeat. Everything flows in one direction. Erfan Miahi, the lead engineer behind PULSE, explained it plainly on the TGIF call: "You have a bunch of texts and you train on those. You have a trainer and you train on those."

RL is different. It has a loop.

"When you do RL," Erfan continued, "you have two different functions: inference and trainer. Inference generates the rollouts. When you talk to ChatGPT, you ask the question, it answers something to you. Those answers are the rollouts being generated."

The loop works like this: an inference engine generates rollouts (model outputs), the trainer uses those rollouts to improve the model, the improved model's weights get sent back to the inference engine, and the inference engine generates new rollouts with the updated model. Around and around. Each cycle makes the model slightly better at producing useful outputs.

In a data centre, this loop runs fast. The inference engine and the trainer sit in the same facility, connected by high-bandwidth interconnects. Weight synchronization takes seconds.

But in a decentralized setting, where inference nodes (miners) are scattered across the public internet and the trainer sits elsewhere, that weight synchronization becomes the bottleneck. Every time the trainer improves the model, it needs to ship the entire updated model to every inference node. For a 7-billion-parameter model in BF16 precision, that's 14 gigabytes per sync.

"Inference is 80% or more of the cost of doing reinforcement learning," Erfan noted. This is why Grail's architecture gives inference responsibility to miners. It's the expensive part, and decentralized networks have abundant compute. But it means model weights need to flow constantly from trainer to miners over public internet.

And public internet is slow.

The Fourteen-Minute Wall

"There was a paper by Prime Intellect," Erfan recounted, "that they post-trained a model and it took 14 minutes to send the new weights to the inference people. That's a bottleneck."

Fourteen minutes. For a single weight synchronization. In a training loop that needs to synchronize constantly.

"If you spend 14 minutes sending the weights, your whole training paradigm is gonna be 10 to 20 times slower than training in a centralized way. That bottleneck can screw up a lot of things."

This is why RL post-training has remained centralized. Not because the algorithms require it. Not because the training logic demands co-location. Because the weight synchronization, the simple act of shipping updated model parameters from trainer to inference nodes, takes too long over commodity internet.

This should sound familiar. In the SparseLoCo article, the same pattern appeared with pre-training: gradient synchronization was too slow over internet bandwidth, so training required co-location. Templar solved that with aggressive gradient compression and local optimization, reducing communication by 40x.

But post-training has a different bottleneck. It's not gradients flowing from nodes to a coordinator. It's full model weights flowing from trainer to inference nodes. Different direction, different data, different problem.

A problem that, as Erfan confessed, he initially thought would take a long time to solve: "I thought it would take us a long time to solve this because it's not an easy issue. Your model is really big and has so many parameters. It's gigabytes of data and you have to send it over the internet."

Then he started reading papers.

Looking at the Right Granularity

"I was reading a couple of papers two months ago, three months ago, and I realized this common theme has started showing up in different papers. There is a high sparsity in the model's weights."

Weight update sparsity means that when you train a model, not all parameters change. Some papers reported 30% of weights updating. Others 20%. Others 10%. The numbers varied with models and hyperparameters, but the pattern was consistent: a meaningful fraction of the model stays static during training.

Erfan saw the obvious implication: "If 1% of the model weights are getting updated, why are we sending the whole model? This seemed like such an intuitive and simple idea to try."

But there was a problem with the existing literature.

"All of the papers I looked at compare the model before training and 10 hours after training has finished, or two days after training has finished. They look at these two different versions and say only 20% of the weights got updated, or 30% of the weights got updated."

This is aggregate measurement. Before training versus after training. It tells you how much changed over the full course of a run. But it doesn't tell you how much changes at each step, and each step is what matters for synchronization.

"I want to see how the weights are getting updated at each step of training. Each step takes a minute or two. So I was wondering how it happens at that granularity, because that's what we care about when we are actually training a model."

This is the moment where the discovery happened. Not a flash of genius. A researcher asking the right question at the right scale.

"And if it is really low, then we can exploit it. But if it is only like 30% of the weights getting updated, we can't get that much performance boost."

So Erfan ran the experiments. Across Gemma models (Google), LLaMA models (Meta), and Qwen models. Multiple sizes, from 0.5B to 7B parameters. Not before-and-after. At each individual training step.

"And I saw that the same patterns keep happening. The part of the model that is getting updated after each update step is only 1% of the model. So it's not 30%, it's 1% after each update. That was crucial."

Not 30%. Not 20%. One percent.

The gap between aggregate measurement and step-level measurement was enormous, and it was the step-level measurement that mattered for synchronization. Everyone else had been looking at the wrong granularity. The sparsity was there all along, dramatically higher than anyone had reported, hiding in plain sight because nobody had thought to measure it where it counted.

"I ran these experiments, I showed it in the paper, scientifically that this is true. I ran it to make sure that I get results that I can be confident on. I ran multiple times. I averaged through, a lot of the scientific rigor that goes through these things."

Then he went further. He investigated why.

The Ghost in the Precision

The mechanism behind 99% weight update sparsity is elegant, and understanding it matters because it's what separates a useful observation from an exploitable insight.

Modern AI training uses BF16 (bfloat16) precision. BF16 uses 7 bits for the mantissa, which means it can only represent relative changes larger than roughly 0.4% of a weight's current magnitude. Any change smaller than that threshold rounds to zero. The weight stays exactly the same, bit-for-bit.

Now consider what the Adam optimizer does during RL fine-tuning. Adam normalizes gradient updates by dividing by a running estimate of their standard deviation. This bounds the effective update for each parameter to roughly 1 to 10 times the learning rate, regardless of the gradient magnitude. At standard RL learning rates (around 10⁻⁶), updates land in the range of 10⁻⁶ to 10⁻⁵.

For a weight with magnitude 0.1, BF16 requires a change of at least 0.0004 to register. But the Adam-bounded update is only 0.00001. One hundred times too small.

The update is computed. It's added to the weight. And it rounds back to zero. The learning algorithm intended the change (gradients are nearly fully dense, with about 99% of parameters receiving non-zero gradient signals). But the precision format absorbs the intention before it can take effect.

We were synchronizing phantom traffic. Fourteen gigabytes that encoded almost no actual change. Bytes that were bitwise identical before and after the update, shipped across the internet at enormous cost for no reason, because nobody had measured what was actually changing.

The paper proves this sparsity isn't a fluke:

  • Consistent across model families: 99% sparsity in Qwen, LLaMA, and Gemma models from 0.5B to 7B parameters
  • Stable throughout training: Standard deviation across 400 training steps is only 0.2 to 0.4%
  • Robust to async conditions: Even with 32 steps of policy staleness (common in decentralized settings), sparsity only drops about 3 percentage points
  • Mechanistically understood: A precision ablation shows 99% sparsity in BF16, 58% in FP16, and only 2% in FP32, confirming the phenomenon is a precision artifact, not a training artifact

That last point matters. Understanding why something happens is what lets you trust it and build on it. If the sparsity were a mysterious empirical observation, you'd worry about edge cases. But the mechanism is clear: it's a direct consequence of BF16 precision and Adam's update bounds. Change the precision format and the sparsity changes predictably. That's science, not luck.

The Lossless Principle

PULSE exploits this sparsity with what amounts to a simple idea executed with rigorous care: transmit only the 1% of parameters that actually changed.

Compare consecutive checkpoints bitwise. Extract the changed indices and their new values. Compress with zstd. Send only that. For a 7B model, what required transferring 14 gigabytes now requires approximately 108 megabytes. A hundred times less.

"A hundred times is a lot," Erfan said, with the understated disbelief of someone who surprised himself. "It's not a small number. It's crazy even to me."

But the design choice that elevates PULSE above a clever optimization is its commitment to losslessness. When asked to compare PULSE with SparseLoCo (Templar's gradient compression technique), Erfan drew a sharp distinction:

"What they did was lossy. What we did was lossless. We're gonna send them the exact weights. There's no approximation. It's a hundred percent the same weights. It's the same everything. And the network and training is gonna work the same."

This distinction matters beyond the technical. Lossy compression accepts that some information will be lost: small errors that accumulate over time, drift that compounds across training steps. In pre-training, where you're optimizing a smooth loss landscape over trillions of tokens, this is an acceptable trade-off. SparseLoCo is brilliant precisely because it identifies what can be safely lost.

But post-training is where models learn their values. The difference between a model that helpfully explains a concept and one that refuses a reasonable request can come down to subtle weight differences shaped by specific RL training steps. In this context, "close enough" is a different kind of gamble.

PULSE stores actual values, not deltas. No arithmetic during reconstruction, just a memory copy. Every transfer verified by SHA-256 checksums. Drift isn't improbable; it's mathematically impossible. The model at every inference node is provably identical to the model at the trainer.

When you're decentralizing the phase where AI learns judgment, exactness isn't a luxury. It's a requirement.

What It Means to Decentralize the Values Layer

Here's where the philosophical stakes exceed the technical achievement.

Today, a handful of companies (OpenAI, Anthropic, Google, Meta) control RL post-training for frontier AI models. Their internal teams decide the reward signals, the training procedures, the behavioral boundaries. Their choices determine whether a model will help you plan a birthday party but refuse to discuss certain political topics. Whether it reasons transparently or hides its chain of thought. Whether it treats users as adults capable of handling nuanced information or as potential adversaries who need to be managed.

These choices aren't neutral engineering decisions. They embed values. And they're being made behind closed doors, influenced by commercial incentives, regulatory pressure, and increasingly by the military and intelligence partnerships I documented in AI and the War Machine.

The implicit assumption has been that this centralization is necessary. That RL post-training requires the kind of tight coordination and low-latency communication that only a data centre provides. That alignment work, the process of shaping AI behavior, inherently requires centralized control.

PULSE challenges that assumption at the infrastructure level. If you can synchronize model weights 100x more efficiently, decentralized RL post-training becomes viable. Miners scattered across the public internet can generate rollouts, a training process can consume those rollouts to improve the model, and PULSE keeps the weight synchronization fast enough that GPU utilization matches centralized training.

Grail's architecture already works this way. Inference nodes (miners) generate rollouts, the trainer updates the model using techniques like GRPO, and PULSE ships the updated weights back to miners in seconds instead of minutes. Validators verify rollout authenticity through cryptographic mechanisms. The whole thing runs permissionlessly on Bittensor.

This doesn't mean decentralized post-training will immediately produce models that rival GPT-5. It doesn't mean alignment is "solved." It means the structural assumption that post-training must be centralized has lost its technical foundation. The barrier was never physics. It was 14 gigabytes of phantom traffic.

And if post-training doesn't require centralization, then the question of who shapes AI behavior stops being answered by default (whoever owns the data centre) and becomes a genuine question with multiple possible answers. Including: everyone.

Scarcity and the Source of Insight

There's a pattern in how PULSE came to exist that's worth naming, because it illuminates something about the Bittensor model that's easy to miss.

Erfan didn't solve this problem with more resources. He solved it by understanding the problem more deeply than anyone with more resources had bothered to. The insight (measuring sparsity at step-level granularity instead of aggregate) wasn't computationally expensive. It was conceptually precise. Anyone at Google or Meta could have done it. They didn't, because when you have data centre bandwidth, you don't need to.

Sam frames this as the core thesis: "Scarcity always breeds innovation... We can channel incentives to fund innovation. In resource constrained environments, what would you do? You don't have the option to say, 'Oh, we use bigger computers, you co-locate them.' You dig deep into literature, do research and come up with stuff like SparseLoCo, which is what we did for Templar, and now with PULSE."

This is not a feel-good narrative about scrappy underdogs. It's a structural claim about where insight comes from. When you can throw bandwidth at a problem, you don't have to understand it. When you can't, you do. And that understanding sometimes reveals that the problem was smaller than everyone assumed, that 99% of the difficulty was phantom.

Consider what Erfan built. Sam called Grail "a one-stop subnet. I've never had to touch it. I think I vibe-coded a V1 and Erfan threw everything away." One engineer, four or five months, working under the constraints of a decentralized network with limited resources, produced a research result that changes what's possible in distributed RL training. "It feels like three years with the amount of work you've done," Sam told him.

This is the Bittensor thesis in action. Not that decentralization is morally superior to centralization, but that the incentive structure (channeling resources to people solving hard problems in constrained environments) produces insights that well-resourced incumbents miss. Not despite the constraints, but because of them.

The Pattern Across Covenant

This is now the third time a constraint that seemed physical has turned out to be an artifact of incomplete understanding across Covenant's work.

SparseLoCo (formerly called CCLoCo) discovered that gradient compression and local optimization could combine to reduce communication overhead by 40x, making decentralized pre-training viable. The barrier wasn't bandwidth. It was how much of the gradient actually needed to be communicated.

Heterogeneous SparseLoCo discovered that activation compression could enable consumer GPUs to participate alongside data centre hardware in frontier training. The barrier wasn't VRAM. It was how much of the activation data actually needed to flow between pipeline stages.

Now PULSE discovers that 99% of synchronized model weights in RL post-training are unchanged and don't need to be sent at all. The barrier wasn't internet bandwidth. It was the assumption that all 14 gigabytes mattered.

Each time, the pattern is the same: what looked like a hard physical constraint turned out to be a soft informational one. The solution wasn't more bandwidth, more VRAM, or more compute. It was understanding what was actually happening: measuring at the right granularity, asking the right question, and recognizing that the data being moved was mostly noise.

The PULSE paper states it directly: "We didn't just observe sparsity; we proved it's consistent, understood why it happens, and verified it works in production."

Understanding why something happens is what enables exploiting it. Observation alone doesn't give you PULSE. Mechanistic understanding does. And that understanding, the kind that comes from sitting with a problem until its structure reveals itself, is exactly what constrained environments produce.

What's Difficult, and What's Next

It would be dishonest to write this piece without acknowledging what's hard.

Sam closed the TGIF call with a candor that cuts through the triumphalism: "It's been a hard week currency-wise, but a lot of people on the team have taken sacrifices to ensure that we're anti-fragile because crypto is volatile. We run very lean and I'm very grateful to my team for the dedication, helping us stay lean so that we don't compromise on our values."

The team building the fastest decentralized training instance in the world is doing it under genuine financial pressure. Crypto markets are brutal. Token-funded research operates without the safety net of venture capital or corporate backing. The people writing these papers are making material sacrifices to keep going. This is not the sanitized narrative of a well-funded lab announcing results from a position of comfort. It's researchers delivering under constraint because they believe the work matters.

There's also a competitive vulnerability Sam raised: "The problem is if we're training our model, they can always snipe it and just put it on theirs. So we need to find a solution around that." Open research means open access, including for competitors who can take the insights without bearing the cost of discovery. This is the structural tension of doing research in the open, and it doesn't have a clean resolution.

And there are genuine technical boundaries. PULSE's sparsity results are validated for RL post-training specifically, at current model scales, with the BF16 precision format that dominates modern training. The paper is rigorous about these conditions. Whether the same patterns hold at 70B or 405B parameters, or with future precision formats, remains to be established.

What's emerging on the roadmap: Erfan revealed that Grail's first post-training target is a model for generating optimized GPU kernels, a practical, high-demand application where decentralized training can prove its worth against centralized alternatives. Sam discussed plans to mine other subnets with Grail-trained models, beginning to close the loop between post-training capability and economic sustainability.

The Full Pipeline

Step back far enough and what's taking shape is something that didn't exist a year ago: a complete decentralized pipeline for building AI.

Templar handles pre-training, teaching models knowledge through large-scale language modeling over the internet. Basilica provides decentralized compute: GPUs you can rent, jobs you can run, infrastructure that doesn't require a data centre contract. And now Grail handles post-training, teaching models judgment through reinforcement learning, with PULSE making the weight synchronization fast enough to match centralized speeds.

Pre-training, compute, post-training. Knowledge, infrastructure, values. The full stack, running permissionlessly on Bittensor, with no single entity controlling who can participate, what gets built, or who benefits.

But Covenant's subnets are only part of the picture. The broader Bittensor ecosystem contains dozens of teams building complementary capabilities: data, inference, storage, evaluation, and more. We're already collaborating with some of these teams and actively seeking partnerships with others. Closing the gap with centralized AI labs isn't something any single project can accomplish alone. It requires the kind of coordination that decentralized networks are designed to enable.

This doesn't mean the stack is complete or production-ready for frontier models. Each piece has constraints, rough edges, and open research questions. But the architectural claim is now credible in a way it wasn't before PULSE. Each major bottleneck (gradient synchronization for pre-training, activation transmission for heterogeneous training, weight synchronization for RL post-training) has a demonstrated solution. The barriers that seemed physical have, one by one, turned out to be informational.

The question of who teaches the machine (who shapes AI behavior, who determines what values get embedded, who decides what models refuse and what they help with) has been answered by default for the entire history of the field. Whoever owns the data centre. Whoever pays for the GPUs. Whoever controls the training pipeline.

PULSE doesn't change that overnight. But it removes the technical assumption that made centralized control feel inevitable. And removing that assumption is how alternative futures become possible.

What Grail builds on this foundation is the next chapter. But the ground has shifted.


Research Paper: Understanding and Exploiting Weight Update Sparsity for Communication-Efficient Distributed RL by Erfan Miahi, Eugene Belilovsky

Code: github.com/one-covenant/grail

Related Reading:

Disclosure: I work with Covenant AI and am directly involved in the organization's communications. For full transparency about my involvement and investments, see my projects page. All opinions expressed are entirely my own.

For updates on Grail, follow @grail_ai. For Covenant AI ecosystem updates, visit covenant.ai.