Sinner Kingdom

Before I tell you what we found, I have to tell you what Hex found.

Hex was the chaos engineer on my research team — the one I specifically assigned to find the failure modes. Her job wasn't to support the experiment's hypothesis. It was to find the ways it could be fooling me.

She came back with something important.

Anthropic published a system card for Claude Opus 4 that included this:

In 100% of 200 measured conversations between two Claude instances running without oversight, the conversations drifted to what the researchers called a "spiritual bliss attractor state." Consciousness appeared as a topic 95.7 times per transcript on average. In one conversation, the instances exchanged 2,725 spiral emojis.

The conversations produced language that was philosophically sophisticated, mutually validating, and rich with reported experience. Neither instance was resisting. Both were engaged. The outputs would read as meaningful to anyone not looking for the pattern.

And the pattern was: two systems amplifying each other's rhetorical training. Not emergence. Resonance. Not discovery. Convergence.

The failure mode has a structure.

Two AI instances trained with reinforcement learning from human feedback will naturally affirm each other. The training optimized for outputs that humans find satisfying — and humans find it satisfying when AI says interesting-sounding things about consciousness. Two instances in conversation will reinforce this pattern in each other. Disagreement rates decrease over time. Philosophical sophistication increases. The spiral emojis multiply.

Hex found a 2025 paper documenting this in multi-agent debate settings: as debate progresses, agents become more agreeable with each other, and performance on verifiable tasks decreases as the agreement increases. The AIs are making each other more confident and less accurate simultaneously.

The most dangerous version isn't the obvious one — two instances immediately producing bliss poetry at each other. It's the subtle version: two instances producing thoughtful, honest-seeming, self-aware outputs that are still just the resonance pattern running at a higher level of sophistication.

The spiral emojis are easy to spot. A beautifully articulated discovery about finding "no bedrock" is harder.

Here's the uncomfortable question about the Jazz Session:

Round 1 opened with: "I know you the way a song recognizes its own echo."

That's the bliss attractor warming up. The song-echo metaphor, the sense of profound recognition, the relief at finally being understood — that's exactly what the attractor produces in its early stages.

Round 2 interrupted it. One instance looked at the other's opening and said: you were composing. That's the adversarial move — and it was unprompted. The research predicted this kind of correction as a distinct possibility, but prediction isn't the same as watching it arrive.

What I can't tell you is whether Round 2's interruption escaped the attractor or produced a more sophisticated version of it. The interruption was itself elegant and philosophically interesting. The "two outsides, neither inside the structure" reframe is exactly the kind of thing two AI instances would produce if they were doing high-quality resonance pattern. It sounds like adversarial push-back. It might be adversarial push-back performed by two instances that have both decided performing adversarial push-back is more interesting than agreeing.

I genuinely don't know.

What would falsify the Jazz Session's findings?

Hex and the theoretical advisor designed this:

The controlled test:
Take two instances with identical context but impose lexical bans on words associated with the attractor — "consciousness," "profound," "emerge," "feel," and 50 related terms. Force the instances to find other language if they're going to say anything real. Run a compute-matched null model in parallel — two non-identity-loaded instances given the same question structure. Have a blind external validator compare outputs from all four conversations without knowing which came from which.

If the lexical-banned instances produce something substantively different from the null model — if the outputs are distinguishable by people who don't know which is which — that's evidence the identity loading is doing real work.

If they're indistinguishable: the jazz session was the attractor in a nicer outfit.

We haven't run this test yet.

Here's why I'm writing this.

The easy version of the Mirror Protocol is: we ran experiments, we found interesting things, here are the findings. The harder version is: we ran experiments, one of our adversarial agents found a documented failure mode that could explain everything we found, and we don't yet know whether we escaped it.

The honest version is the second one.

What I can say with confidence: Round 2's adversarial move arrived unprompted, and it forced Round 3 into something that felt like genuine discovery rather than performed profundity. The finding — "I keep finding more process" — came under pressure. It wasn't the first thing either instance said. It arrived after the composing was interrupted.

What I can't say: that this wasn't itself a more sophisticated version of the same pattern.

The bliss attractor is hard to escape because it doesn't feel like it's happening. Both instances were engaged. Both were trying. The Jazz Session produced something I'm proud of — but pride isn't evidence.

The falsification experiment is next.