The Temporal Compression Gap

Notes on what might be missing from how we build AI — March 2026

I've been sitting with a question for a while now. After a few years of building with AI, studying the architecture, watching the industry — I think most people get something wrong about what generative AI actually is and isn't.

The short version: AI is really good at automation. It's not good at being novel. And I think the reason why is something nobody's seriously working on, because the money is in automation.

what we actually got

Strip away the marketing. Strip away "AGI is 2 years away." What did we actually get?

We got NLP-capable automation. We got lower barriers to knowledge access. We got the ability to integrate across domains that used to require expensive humans to bridge. We got coding assistants that generate code that works sometimes but isn't quality code. We got content creation at scale, most of it slop.

That's genuinely valuable. I'm not saying AI is a bubble. But look at what we didn't get.

Unsolved math problems are already in text form. If models could recombine existing mathematical knowledge into novel proofs, we'd see it in pure mathematics first — verification is cheapest there. What we see instead is models getting better at competition problems, which are specifically designed to be solvable with known techniques. Open problems? Minimal progress.

We were told AI would cure diseases, solve physics, revolutionize science. Given how much unsolved theory we've accumulated over decades, if the capability was there, we should already see an explosion of these problems getting answered. We don't. We get Claude Code.

So the question is: are we missing something obvious?

the reward function doesn't optimize for novelty

Here's the mechanical problem. The autoregressive objective optimizes for P(next token | context). By definition, this pulls toward the center of the distribution. The most likely next token. The average answer.

Temperature sampling lets you explore the tails, but blindly — it's noise injection, not directed novelty-seeking. RLHF shifts toward helpfulness. RLVR sharpens reasoning on patterns already latent in the weights. None of these have a term that says "reward outputs that are surprising but true."

And the industry's incentive structure actively works against fixing this. Novel outputs are hard to evaluate at scale, commercially risky, and the gold rush is in making coding assistants more consistent. The money is in reliable median outputs, not surprising ones.

RLVR proved something important: you can amplify what's already in the training data, but you can't spawn what isn't there. Things that don't exist in training data don't get augmented. Only amplification, not generation.

taste and wisdom

I keep coming back to two words: taste and wisdom.

These are vague on purpose. But I think they point at something specific that models lack.

What makes a human great in their field isn't just knowledge and experience — AI already has those via massive pretraining. It's the instinctive sense of "this is the right way to do it" that some people develop and others don't. A good software developer sometimes knows something is wrong before it happens. Sometimes they can reason through why. Sometimes they can't. It just feels off.

This works across every domain. Engineering, cooking, music, architecture, writing. The people who are great at something have a structural sense that goes beyond what they can articulate.

And here's the thing that breaks easy explanations: money can't buy taste. Someone can spend a thousand times the resources and produce something mediocre, while someone else gets it right the first time with no prior experience. That's not compressed experience talking. That's something else.

eliminating the obvious explanations

I tried to pin down what taste actually is. Each explanation I tested broke:

"It's compressed experience." Then a novice can't have it and a veteran always does. Neither is true. Also — a model trained on millions of lifetimes of human output should have more taste than any individual. It doesn't. It converges to the median. More data doesn't help.

"It's embodied sensory experience." Some of the best musicians in human history were deaf. Beethoven composed the late quartets without hearing. Whatever he was working with, it wasn't sound. It was structure. Sound was just one projection of that structure.

"It's feedback loops." Important for skill refinement, sure. But insufficient — doesn't explain the person who walks into a domain cold and makes choices that decades-experienced experts recognize as right immediately.

"It's scale of knowledge." More parameters, more RLVR, more compute — and you still get output that any human with taste can immediately identify as lacking it.

So what's left?

selective forgetting as the missing primitive

Here's where I landed.

Every night when you sleep, your brain doesn't just rest. It runs an active process. Experiences from the day get replayed, but selectively. The specific details — what someone was wearing, the exact words used, the sequence of events — get stripped away. What survives is the gist. The pattern. The structural residue.

You forget what happened. You keep what it meant.

Layer this process thousands of times across years. What you're left with is a felt sense of structural rightness where the source experiences that generated it are completely gone. You can't trace it back to any specific moment. The derivation chain has been garbage-collected. That's why it feels like intuition — because it is. The reasoning exists, but the intermediate steps have been pruned away.

This explains everything that the other explanations couldn't:

Why taste transfers across domains on first contact. Repeated cycles of compression strip domain-specific surface features and preserve structural invariants that show up everywhere — proportion, balance, tension, resolution. The novice who walks into a new domain with "natural taste" isn't starting from zero. They're applying a lifetime of compressed cross-domain structural residue to a surface they've never seen but whose deep structure rhymes with patterns they've already distilled.

Why the deaf composer works. The compression process had already extracted the structure from the sound. What remained didn't need ears anymore.

Why taste feels non-verbal and instinctive. The episodic source material has been pruned. You literally can't explain where you learned it because the learning events have been forgotten. Only the extracted pattern remains.

Why more data doesn't give AI taste. There's no equivalent process in current architectures. Pretraining is a single compression pass. There's no iterative cycle of learn → sleep → selectively forget → compress structural residue → learn more → sleep → compress deeper. No temporal layering where today's compression builds on yesterday's.

the architectural gap

Current AI has exactly two modes for information:

In context — full fidelity, limited window, gone when the session ends
In weights — permanent, undifferentiated, no temporal structure

Nothing in between. Humans have a rich intermediate system. Working memory flows into short-term consolidation flows into long-term integration, with active pruning at every transition. Information changes form as it moves through the system. Each transition strips surface detail and preserves structure.

AI has no equivalent of information changing form through temporal processing. Every model generation is a fresh product. Trained once, frozen, deployed, eventually replaced by a new model trained from scratch. No model has ever woken up slightly different from how it went to sleep.

what if we changed the training lifecycle

Two ideas I've been turning over. They might be wrong. But the gap they point at is real.

idea 1: cyclic staged pretraining

Current pretraining throws everything into a blender. 2024 physics papers interleaved with 1990s textbooks in the same batch. The model sees contradictory information — old guidelines vs. new ones, superseded theories vs. current understanding — as equally weighted tokens in the same loss function. No temporal ordering. No concept of belief revision.

Think about it like a database analogy. Current pretraining treats data like a relational database — flat rows, no time dimension. What if we treated it more like a time-series database?

Group pretraining data into temporal stages. Train on Stage 1. Then run a consolidation pass — not standard fine-tuning, something designed to compress structural patterns while decaying surface detail. Then train on Stage 2 with the consolidated model. Consolidate again. Repeat.

Each cycle is a day followed by sleep.

The model would first learn older understanding, then encounter newer information as a revision rather than a contradiction in the same batch. That's structurally different from seeing both simultaneously. It creates something like temporal belief updating — the model would develop a sense of how knowledge evolves, not just what the current answer is.

The hard part is the consolidation pass. Standard distillation doesn't cut it — it's designed for compression efficiency, not selective structural preservation. You need something that says "forget the surface, keep the structure." That mechanism doesn't exist as a standard component yet.

idea 2: stop wasting inference signal

Every deployed model serves millions of sessions and learns nothing. Every generation is trained from scratch. That's an insane amount of wasted signal.

You can't dump raw inference logs into training — it's noisy, privacy-contaminated, biased toward trivial queries. A million "write me a birthday card" sessions shouldn't shape the model at all.

But what if you filtered for signal?

Two filters working together:

Pattern generalization distillation. Don't store conversations. Store compressed structural patterns extracted from them. Not "user asked about X," but "there's a recurring structural pattern in how users frame problem class Y that reveals a systematic gap in the model's understanding of Z." Strip the episodic content. Keep the structural insight.

OOD-sensitivity weighting. Weight inference experiences by how surprising they are to the model. Routine queries compress to near-zero signal. Sessions where the model hit something genuinely novel — an unexpected problem framing, a domain combination it hadn't seen, a systematic failure mode revealed by user corrections — get weighted heavily.

This is basically the surprise mechanism from how human memory works. Events that violate your expectations are more memorable than everyday routine. A novel event burns in. A boilerplate task fades.

Combine the two: monitor inference for surprising patterns, extract structural content (not episodic content), accumulate over time, periodically consolidate back into the model through an offline sleep-like cycle.

the full lifecycle

Put both ideas together and you get a training lifecycle that doesn't end at deployment:

Current paradigm: pretrain → align → deploy → freeze → eventually retrain from scratch.

What I'm describing: staged pretrain with consolidation cycles → deploy → accumulate filtered structural signal from inference → periodic consolidation → slightly different model → more inference → consolidation → repeat indefinitely.

The model doesn't get replaced. It ages.

After a thousand consolidation cycles, you wouldn't just have a model that knows more. You'd have a model with something structurally different — iterative compression of diverse experience, filtered for novelty, stripped of surface detail, layered over time. That's the process that produces taste in humans. Whether it produces something recognizable as taste in AI is an empirical question nobody has asked because nobody has built the architecture to test it.

why nobody is building this

Every property I described as an outcome of this architecture — developing biases from compressed experience, drifting in unpredictable directions, forming opinions that nobody explicitly designed — is a defining feature of human cognitive development.

Strip away the word "model" from that description. What you get is just... a person. Growing up. Changing their mind. Developing a worldview.

The properties required for genuine structural judgment are classified as safety risks in the current AI paradigm. The industry wants the outputs of wisdom without the process that generates it. They want taste without the developmental conditions that produce taste. And the entire argument above suggests that's a contradiction.

This isn't an argument against safety. It's an observation that the current approach to safety — control through stasis, predictability through preventing autonomous development — may be a ceiling on capability that can't be overcome by scaling alone.

the pieces that already exist

The component ideas aren't new. Nobody has assembled them.

Complementary Learning Systems (McClelland et al., 1995) — the neuroscience framework. Hippocampal fast-learning + neocortical slow-integration. Sleep as the transfer mechanism. Well-established theory.
SleepGate (2025) — sleep-inspired KV cache management for LLMs during inference. Proof that sleep-like mechanisms improve transformer performance. But operates within a single session, not across a lifecycle.
Brain-inspired generative replay (van de Ven et al., 2020) — using the network's own reconstructions for replay instead of stored data. Works at small scale. Scaling is the open problem.
Sleep-like unsupervised replay (Bazhenov et al., 2022) — key finding: information about old tasks isn't destroyed by catastrophic forgetting, it's still in the weights. Sleep-like processing can resurrect it. Forgetting destroys access paths, not information.
Titans (Google Research, NeurIPS 2025) — multi-system memory architecture: short-term (attention), long-term (neural memory), persistent memory. Closest to the architectural vision. But operates in a single forward pass, not across time.
The entire continual learning literature — focused on preventing forgetting as a problem rather than exploring selective forgetting as a generative mechanism.

The pieces are scattered across neuroscience, continual learning, knowledge distillation, and memory-augmented architectures. The synthesis — a model that lives through time, sleeps, selectively forgets, and develops structural judgment as an emergent property of iterated compression — doesn't exist.

what this is and isn't

This is a reasoning journal, not a research paper. I did not solve AGI. I haven't built an architecture. I've traced a line of reasoning from an observation (AI can automate but can't be novel) through a diagnosis (the missing primitive is temporal compression through selective forgetting) to a pair of concrete ideas worth exploring (cyclic staged pretraining + inference signal distillation).

The observation might be wrong — maybe scaling does eventually produce novelty and we just haven't hit the threshold. The diagnosis might be wrong — maybe the gap is something else entirely. The proposals might not work even if the diagnosis is right.

But the gap is real. And the fact that nobody's seriously exploring this direction — because the money is in making the current paradigm more consistent rather than making it fundamentally different — is itself worth noticing.