An exploration of whether the AI industry's obsession with clean data might be optimizing for the wrong thing.
The AI industry might have a cleanliness problem. Not the kind you'd expect—not messy codebases or disorganized research. The problem is that we might be too clean.
Since GPT-4 demonstrated what scaled pretraining could achieve on partially cleaned internet data, the entire field pivoted hard toward data curation. Filter aggressively. Deduplicate ruthlessly. Score for quality. Synthesize where natural data falls short. The mantra became "better data beats better algorithms," and it stuck. FineWeb outperformed RedPajama with fewer but cleaner tokens. Phi showed that small models on curated data could punch above their weight. The evidence seemed clear: clean data wins.
But what if that conclusion, while locally correct, is globally misleading? What if we're measuring the right thing at the wrong scale?
Before going further, I want to be precise about what I mean by "noise" here. I'm not talking about data structure—the formatting, parsing, and structural preparation that makes raw text consumable by a training pipeline. Unstructured data in the literal sense (malformed encodings, broken HTML, incomplete documents) will simply prevent pretraining from proceeding correctly. That's not interesting or controversial. What I'm talking about is information noise: the quality filtering, denoising, and curation that happens after data has been properly structured and processed. The step where we decide which well-formed text is "high quality" enough to keep and which gets discarded. That's the decision I think deserves more scrutiny.
And there's a reason this matters beyond academic curiosity. Current LLMs, across model sizes and providers, share a peculiar failure mode: they're proficient—often superhuman—on benchmarks and structured tasks, yet they still make mistakes that no reasonably attentive human would make. Not hallucinations in the traditional sense (large models have largely solved that), but failures of judgment, context sensitivity, and common-sense filtering that persist stubbornly even as capabilities scale. The gap between "impressively capable" and "reliably sensible" remains wide. I think the way we prepare pretraining data might have something to do with that gap, and that a different approach to noise might be part of what closes it.
The Experiment Nobody Has Run
Here's what the FineWeb result actually tells us: at current training scales and current compute budgets, aggressively filtered data outperforms lightly filtered data of comparable size. That's a real finding. But it doesn't tell us what happens when you have 10x more data and train long enough to converge.
Nobody has run that experiment at frontier scale. The reasons are understandable—training runs are expensive, and no lab wants to burn a $100M compute budget on a thesis that contradicts the prevailing wisdom. But the absence of evidence isn't evidence of absence. We've tested "clean and moderate" against "noisy and moderate." We haven't tested "clean and moderate" against "noisy and massive."
That's a different experiment entirely, and I think the outcome might surprise people.
The Paradox We Already Live With
There's an irony in how we train neural networks that the field doesn't talk about enough.
Dropout—one of the most successful regularization techniques in deep learning—works by injecting noise. During training, neurons are randomly deactivated, forcing the network to develop redundant representations and avoid over-reliance on any single pathway. Recent analytical work has shown that dropout reduces harmful correlations between hidden nodes, mitigates the impact of label noise, and that the optimal dropout rate actually increases as data gets noisier.
The optimizer itself introduces noise through stochastic gradient descent—processing random mini-batches rather than the full dataset. This noise helps escape local minima and find flatter, more generalizable loss landscapes. Data augmentation adds random transformations. Gaussian noise is sometimes added directly to inputs or weights.
We inject artificial noise into every layer of the training process and call it good engineering. But then we spend enormous effort removing naturally occurring noise from the training data before it ever reaches the model. There's a tension here that deserves more attention.
The artificial noise we add is structureless—random, uncorrelated, designed to prevent overfitting. But naturally occurring noise in internet data is structured. SEO-optimized content follows recognizable templates. Low-quality writing has systematic patterns. Misinformation has narrative structures. Redundancy has its own signature. This structured noise isn't just interference—it's information about the world. A model that encounters enough of it doesn't just learn to ignore it. It learns the meta-patterns of how humans produce, optimize, and sometimes degrade information. That's a qualitatively different capability than what you get from a model trained only on the curated peaks of human output.
Stochastic Resonance: When Noise Literally Helps
There's a phenomenon in physics and neuroscience called stochastic resonance that challenges our intuition about noise in a fundamental way.
In nonlinear systems—including biological neural networks—adding the right amount of noise to a weak signal can actually improve detection of that signal. The noise pushes sub-threshold signals past the detection boundary, making them visible to the system. Too little noise and the signal stays hidden. Too much and it drowns. But at the optimal level, noise and signal cooperate.
This has been demonstrated in hippocampal neurons, in visual cortex, in sensory systems across modalities. Researchers have even applied random noise stimulation directly to human visual cortex and found that detection performance improved significantly—following an inverted U-shaped curve characteristic of stochastic resonance. The brain doesn't just tolerate noise. Under the right conditions, it exploits noise to enhance information processing.
I'm not claiming that internet noise maps directly onto gaussian white noise in a physics experiment. The analogy has limits. But the broader principle is worth taking seriously: the assumption that cleaner input always produces better output is not a universal law. It's a simplification that holds in certain regimes and breaks down in others. The interesting question is which regime LLM pretraining actually occupies at scale.
The Internet as a Self-Filtering System
One of the recurring concerns in the data scaling conversation is the "data wall"—the projection that we'll exhaust available high-quality public text within a few years. These projections treat the stock of internet data as roughly static, growing at historical rates.
But the internet after widespread generative AI adoption is a fundamentally different information ecosystem than the pre-GPT era. Content production velocity has increased dramatically. And much of this content goes through an interesting pipeline: AI generates candidates, humans review and select, the best version gets published. This isn't purely synthetic data and it isn't purely human-written data. It's a new category—human-curated, AI-assisted content that arguably has higher information density than either source alone, because it went through an implicit quality selection process.
The data wall might not be a wall at all. It might be a moving horizon that recedes as the production dynamics of the internet shift. The real constraint may move from data quantity to compute cost for training on an ever-growing corpus—which circles back to the question of whether you can extract more value from noisy data if you're willing to spend the compute.
Structured Noise as Curriculum
Here's the part of this idea that I find most interesting, and where I think the conventional framing misses something important.
When people argue against noisy training data, they implicitly model noise as pure interference—random corruption of an otherwise clean signal. And if that's what internet noise were, the argument would be airtight. Training on pure noise teaches nothing.
But internet data isn't pure noise. It's signal embedded within structured noise, and the noise itself carries information. A webpage with genuine technical content surrounded by SEO boilerplate, cookie consent banners, and sidebar spam isn't a degraded signal—it's a realistic representation of how information actually exists in the world. A model that learns to extract the technical content while developing internal representations that capture what boilerplate looks like, what spam looks like, what reliable sources look like, has arguably built a richer world model than one that only ever saw the extracted clean text.
Think of it as the difference between a model that knows what good content says, versus a model that knows what good content says and what bad content looks like and the structural differences between them and the motivations that produce each type. The second model understands the full topology of the information landscape.
Multi-head attention is well-suited for this. Different attention heads can specialize in different aspects of the input—some attending to semantic content, others to structural patterns, others to source reliability signals. A model trained on the full distribution of internet content has the opportunity to develop heads that specialize in source discrimination, which is a capability you'd never develop if the training data was pre-filtered to remove everything but the good stuff.
The Convergence Argument
One reasonable objection is compute cost. Training on 10x more noisy data requires roughly 10x the compute, and we don't know if the return justifies the investment. That's fair. But consider the trajectory we're already on.
Frontier labs are scaling compute regardless. The investment in training infrastructure continues to accelerate. Meanwhile, the supply of high-quality curated data is constrained in ways that raw internet data is not. As compute becomes less scarce relative to curated data, the economics shift. When compute was the bottleneck, maximizing learning per FLOP meant curating aggressively. When data quality at scale becomes the bottleneck, the question becomes whether you can substitute compute for curation—train longer on messier data and get something qualitatively different out the other end.
The established pretraining time scaling laws already tell us that longer training on the same data improves convergence. The hypothesis here is that longer training on noisier but more diverse data might unlock capabilities that curated-data training can't access at any scale—specifically, emergent denoising ability and more robust internal representations.
What This Would Actually Look Like
If someone were to test this thesis seriously, the experimental setup would look something like this:
Same architecture, same total FLOP budget. One model trained on aggressively curated data at current best practices. Another trained on minimally filtered data—basic safety filtering, rough deduplication, but minimal quality scoring, minimal synthetic augmentation, minimal domain balancing—with proportionally more tokens to fill the compute budget.
The prediction: the curated model wins on standard benchmarks initially. It should—those benchmarks were designed in a world where curated training was the norm. But the noisy model develops superior calibration, better robustness to adversarial or out-of-distribution inputs, and stronger performance on tasks requiring judgment about source reliability, tone assessment, and real-world reasoning under ambiguity. And at sufficient scale, the noisy model closes the benchmark gap because the internal representations it builds are richer and more generalizable.
This connects back to the persistent gap I mentioned earlier—the one between "impressively capable" and "reliably sensible." Current models make silly mistakes not because they lack knowledge but because they lack the kind of contextual discrimination that comes from exposure to the full range of information quality. A model that has only seen curated text has learned what correct outputs look like, but it may not have developed strong internal representations of what incorrect, misleading, or contextually inappropriate outputs look like. It recognizes good answers but doesn't have the same depth of representation for avoiding bad ones. If noisy pretraining at scale produces emergent denoising behavior—the model learning to internally distinguish signal from noise—that might be exactly the capability needed to close this gap. Not more knowledge, but better judgment.
The further prediction: post-training with RLVR or similar techniques would be more effective on the noisy-pretrained model, and this is a point worth emphasizing. A model pretrained on richer, noisier data doesn't just learn the same things with more interference—it builds a larger and more diverse set of internal representations. Those representations are latent potential. They're embedded knowledge about patterns, anti-patterns, context signals, and reliability indicators that a curated-data model never had the opportunity to develop. Post-training techniques like RLVR and GRPO work by selectively reinforcing and refining existing representations. If the pretrained model has more of them—more raw material to work with—post-training can unlock capabilities that simply don't exist in a model with a narrower representational base. The noisy-pretrained model might appear worse before post-training but improve more dramatically during post-training, because there's more there to be refined. Unfiltered pretraining followed by focused post-training might be the optimal pipeline, rather than filtered pretraining followed by further filtering during fine-tuning.
The Uncomfortable Implication
If any version of this thesis is correct, the industry might be leaving capability on the table by over-curating pretraining data. Not because curation is wrong in principle, but because the sweet spot between "too noisy" and "too clean" might be further toward the noisy end than current practice assumes—especially as scale increases.
The deeper implication is about what kind of intelligence we're building. A model trained only on curated data develops a specific kind of competence—it knows a lot about what's true and well-expressed. A model trained on the full distribution of human information production develops something closer to judgment—it knows not just what's true, but what's misleading, what's motivated, what's superficially convincing but structurally hollow. That second capability might matter more as these systems are deployed in messy, adversarial, real-world contexts.
An Honest Caveat
I want to be clear that this is an exploration, not a conclusion. I'm not claiming the industry is wrong—I'm suggesting there's a hypothesis that hasn't been adequately tested, and that the evidence base for the "clean data always wins" position is narrower than the confidence with which it's stated.
The strongest counter-argument remains the empirical one: at every scale tested so far, curation has helped. That's real data and it shouldn't be dismissed. The question is whether that trend extrapolates, or whether there's a crossover point at scales we haven't reached yet. I don't know the answer. But I think the question is worth asking, and I notice that very few people in the field are asking it.
This started as a thread of thinking inspired by Yann LeCun's observation about the raw data volume a four-year-old processes through vision alone—roughly equivalent to what the largest LLMs train on, but noisy, continuous, and multimodal. Most people in the field might look at that comparison and conclude that the data types are incomparable. I looked at it and wondered whether the noise itself might be doing more work than we give it credit for.
If you work on pretraining data pipelines and have thoughts on this, I'd genuinely like to hear them. The best ideas usually come from people who can articulate exactly why the thesis breaks down—that's where the interesting refinements live.