WIKI/SYS CRYBABY

SYS CRYBABY

Updated 3 weeks ago
# SYS_CRYBABY — Kingdom Anomaly Detection System

```
⛬ KID:CRYBABIES:ENGINE_ROOM:SYS_CRYBABY|1.3:◉:2026-03-15:BRANDON+CLAUDE ⛬
```

**Authorship:** Brandon McCormick (concept) + Claude (architecture) — Sessions 174-175
**Location:** `THE_SCRYER/ENGINE_ROOM/CRYBABIES/`
**Status:** LIVE ◉ — 6 CRYBABYs deployed (5 original live + API_SPEND_WATCHER; TEST_CRYBABY + OUTBOX_WATCHER archived)
**VERIFIED:** 2026-03-15 (S193 — false positive purge)

---

## What It Is

CRYBABY is a self-aware, temporary, local LLM-powered anomaly detection system for the Kingdom.

Each CRYBABY watches one thing. It runs a check every night, evaluates the result, and — if something is broken — it cries. The runner listens to all the cries and dispatches alerts: one ntfy push at the end of each run, AND a RAVEN-format message injected into THE_FORGE mailbox so Forge Claude can launch a fix session.

CRYBABYs are not permanent. They have a lifespan. After enough clean checks with no anomalies, they die gracefully. If they flap — waking up and going quiet and waking up again — they eventually enter CHRONIC state and get archived. Every CRYBABY knows its own lifecycle.

---

## Core Philosophy

**Scripts do math. LLMs do semantics.**

`check.sh` does the work: calculates ages, checks existence, greps for patterns. It outputs pre-analyzed, human-readable interpreted state. Numbers are pre-resolved. No emoji.

`crybaby-analyze.py` receives that output and does semantics: names the issue in one sentence, suggests one actionable direction. It never second-guesses the check. It never verifies. The check is authoritative.

**The runner is the alertmanager.** No CRYBABY sends its own alert. Everything flows back to the runner, which dispatches two alerts on wake: one ntfy push (phone), and one RAVEN message to `THE_FORGE/@FORGE_CLAUDE_MAILBOX/buffer/` (Forge Claude sees it on next boot). The only exception: if the runner itself crashes, it fires an urgent ntfy immediately.

**All events are logged.** Every state transition (AWAKE, WATCHING, SLEEPING, CHRONIC, SICK, DEAD) is appended to `SLUG/events.jsonl`. This is the pattern record — never rotate, never truncate, never delete until the CRYBABY retires.

**CRYBABYs are temporary by design.** They exist to catch a class of problem until it's confirmed stable. Then they retire. This prevents the system from accumulating dead monitors that nobody maintains.

---

## System Anatomy

### File Tree

```
THE_SCRYER/ENGINE_ROOM/CRYBABIES/
├── crybaby-runner.sh          — Nightly engine + alertmanager
├── crybaby-new.sh             — Creation wizard
├── crybaby-analyze.py         — LLM diagnostic layer (returns AnomalyReport only)
├── registry.json              — Index: slug + born_at + died_at (no health state)
├── .last_heartbeat            — Unix timestamp (weekly pulse tracking)
├── archive/                   — Retired CRYBABYs (dna.json + events.jsonl + check.sh)
│   └── SLUG/
│       ├── dna.json
│       ├── events.jsonl
│       ├── check.sh
│       └── spec.md
└── SLUG/                      — Active CRYBABY
    ├── spec.md                — Human-readable spec + deployment notes
    ├── check.sh               — The watch logic (deterministic)
    ├── events.jsonl           — Append-only event log
    └── dna.json               — Authoritative state

THE_SCRYER/ENGINE_ROOM/scripts/ingest/025-crybaby-runner.sh  — sensory-ingest wrapper
```

### registry.json Schema

Index only. No health state. Health lives in dna.json.

```json
{
  "last_updated": "ISO8601",
  "crybabies": [
    {"slug": "SLUG", "born_at": "ISO8601", "died_at": null}
  ]
}
```

### dna.json Schema (per CRYBABY — authoritative state)

```json
{
  "name": "SLUG",
  "created": "YYYY-MM-DD",
  "born_at": "ISO8601",
  "spec_summary": "one sentence",
  "check_mode": "hybrid",
  "model": "qwen3:8b",
  "health": "SLEEPING",
  "consecutive_clean_checks": 0,
  "watching_checks": 7,
  "lifetime_wake_count": 0,
  "total_checks_run": 0,
  "max_wakes": 5,
  "death_after_checks": 30,
  "last_checked": null,
  "last_wake": null,
  "self_diagnosis": null,
  "suggested_fix": null,
  "alert_channel": "forge+ntfy"
}
```

### events.jsonl Schema (append-only)

```jsonl
{"ts":"ISO8601","health":"AWAKE","diagnosis":"...","hint":"...","lifetime_wake":1}
```

### AnomalyReport (crybaby-analyze.py Pydantic schema)

```python
class AnomalyReport(BaseModel):
    step_by_step_reasoning: str   # FIRST — forces CoT before classification
    status: str                    # "normal" or "anomalous" (case-normalized)
    specific_issue: str | None     # 10-500 chars, must contain a space
    likely_fix: str | None         # 3+ words, directional hint
```

Field order is intentional. `step_by_step_reasoning` comes first so the model must reason before classifying. The structured output API enforces this.

---

## State Machine

```
SLEEPING ──── check exits 1 ────────────────────────────► AWAKE
  │                                                          │
  │ total_checks_run >= death_after_checks                   │ check exits 0
  │ AND lifetime_wake_count == 0                             ▼
  └────────────────────────────────────────► DEAD     WATCHING
                                                          │      │
                                          check exits 1  │      │ consecutive_clean
                                                         ▼      │ >= watching_checks
                                                       AWAKE ◄──┘ → SLEEPING
                                                         │
                                          lifetime_wake > max_wakes
                                                         ▼
                                                      CHRONIC
                                                         │
                                                    archived → DEAD
```

**Additional states:**
- `SICK` — check.sh itself exits 2 or with unexpected code. Different problem from anomaly.
- Any state → SICK → SLEEPING if next check exits 0.

### Exit codes for check.sh

| Exit code | Meaning | Runner response |
|-----------|---------|-----------------|
| 0 | Healthy | Increment consecutive_clean, advance state machine |
| 1 | Anomaly detected | AWAKE path, LLM diagnosis, event logged |
| 2 | check.sh errored | SICK state |
| Other | Unexpected | SICK state |

### check.sh Output Contract

On exit 1, stdout must be pre-analyzed, human-readable, ASCII only, no emoji.

Good:
```
ANOMALY: config.py regression sentinel
  - File: THE_SCRYER/.../config.py
  - Pattern found: THE_GRIMORORY
  - Status: ANOMALOUS -- path points to renamed/dead directory
  - Impact: All 4 kingdom-memory corpus paths will fail silently
```

Bad: raw stat output, file modification timestamps, JSON blobs.

---

## How to Deploy a CRYBABY

```bash
cd THE_SCRYER/ENGINE_ROOM/CRYBABIES/
bash crybaby-new.sh
```

The wizard prompts for:
- `SLUG` — uppercase, e.g. `PHONEBOOTH_WATCHER`
- One-sentence spec
- Watch path
- `death_after_checks` [30] — checks until clean death
- `watching_checks` [7] — consecutive clean checks to verify resolved
- `max_wakes` [5] — wakes before CHRONIC
- `model` [qwen3:8b]

After creation, edit `SLUG/check.sh` — that's where the watch lives.

**Non-interactive mode (scripting):**
```bash
bash crybaby-new.sh --non-interactive SLUG "spec" "/watch/path" 30 7 5 qwen3:8b
```

> **S186 fix:** `alert_channel` default corrected from `"ntfy"` to `"forge+ntfy"` in crybaby-new.sh source. New CRYBABYs created from S186 onward get the correct dual-channel default automatically.

### check.sh Template

```bash
#!/usr/bin/env bash
set -euo pipefail

# Exit 0 = healthy
# Exit 1 = anomaly (stdout = interpreted state)
# Exit 2 = script itself errored

WATCH_PATH="/path/to/watch"

if [[ ! -f "$WATCH_PATH" ]]; then
    echo "ANOMALY: Watch target missing"
    echo "  - Path: $WATCH_PATH"
    echo "  - Status: ANOMALOUS -- expected file does not exist"
    echo "  - Impact: [what breaks]"
    exit 1
fi

echo "healthy"
exit 0
```

---

## Alert Format

All alerts fire on two channels simultaneously:
- **ntfy.sh** push to phone (`$CRYBABY_NTFY_TOPIC`, default: `kingdom-crybaby`)
- **FORGE mailbox injection** to `THE_FORGE/@FORGE_CLAUDE_MAILBOX/buffer/CRYBABY_ALERT_*.md` (RAVEN format, so Forge Claude sees it on next boot)

| Condition | Title | Priority |
|-----------|-------|----------|
| 1 AWAKE | `CRYBABY AWAKE: SLUG` | default |
| 1 CHRONIC | `CRYBABY CHRONIC: SLUG` | high |
| 2-3 AWAKE | `CRYBABY AWAKE (N systems)` | high |
| 4+ AWAKE | `SYSTEM DEGRADED (N CRYBABYs)` | urgent |
| Clean retirement | `CRYBABY RETIRED: SLUG` | low |
| Retiring soon (3 checks left) | `CRYBABY RETIRING SOON: SLUG` | low |
| Weekly heartbeat | `CRYBABY SYSTEM ALIVE` | low |
| Runner crash | `CRYBABY RUNNER CRASHED` | urgent (ntfy only) |

Alert body always uses `Hint:` not `Fix:` — the LLM suggestion is a direction, not guaranteed truth. `(Verify before running)` appended to every hint.

---

## Deployed CRYBABYs (v1.2 — 6 live, S186)

### JOURNAL_PIPELINE_WATCHER
The founding CRYBABY. Five sentinels for the Session 124 class of failure (config.py pointing to dead path, journal sync stopped, Aeris sync dead). The `--explain` flag prints all five watches. Currently WATCHING (recovering from morning test).

### OPENDIA_WATCHER
Two-mode: if server on :5556, hit `/health` → check `chromeExtensionConnected`. If offline (expected at night), verify installation integrity — `server.js`, `extension/`, `chrome.alarms` keep-alive patch, `instagramDirectBypass` patch. Catches silent patch regressions after Chrome updates.

### RAVEN_WATCHER
Scans 4 Kingdom agent mailboxes (`buffer/`) for unread messages past threshold. URGENT_ prefix = 2h. Normal = 24h. Also checks `processed/` for daemon liveness (> 48h stale = daemon may be down). Implements Brandon's requirement: "confirm received, not just sent."
**S193 fix:** `@CLAUDE_TOWER_MAILBOX` removed from MAILBOXES array — Tower Claude processes its own mailbox; was causing zombie alerts when Tower wasn't running.

### BACKUP_WATCHER
Reads `~/.kingdom-backup/last-success` receipt. If > 48h old → AWAKE. Also parses most recent `backup-*.log` for FAIL lines. Recurring failure pattern detection (3+ fails in `FAILURE_LOG_WINDOW` days = escalate).
**S193 fix:** `FAILURE_LOG_WINDOW` reduced from 7 to 3 — prevents pre-fix logs from poisoning the window after rclone fixes are deployed.

### SCRYER_REINDEX_WATCHER
Queries LanceDB directly (via kingdom-memory venv) for chunk count. Must be >= 7000. Checks `_versions/` freshness (< 25h). Guards against the "silent index corruption" pattern from Session 124. Currently SLEEPING at 7,994 chunks.

### TEST_CRYBABY — ARCHIVED (2026-03-11)
Integration test sentinel. Died 2026-03-11 after week-1 kink period. Lives in `archive/TEST_CRYBABY/`.

### OUTBOX_WATCHER — ARCHIVED (2026-03-13)
Born 2026-03-11. Watched RAVEN outbox for stuck messages. Retired 2026-03-13 after confirming RAVEN pipeline stable. Lives in `archive/OUTBOX_WATCHER/`.

### API_SPEND_WATCHER
- **KID:** `FORGE:LEDGER:API_SPEND_WATCHER|1.0:⌂:2026-03-13`
- **Spec:** Daily API spend watchdog — fires if any day > $20 USD in sentinel.db
- **Health:** AWAKE (expected — historical spend was high pre-OAuth proxy)
- **check_mode:** hybrid
- **Threshold rationale:** Post-OAuth proxy, Aeris shows $0; $20/day is a buffer for genuine leaks. Will likely transition to SLEEPING after ~3 days of OAuth-routed sessions confirm the pattern.

---

## Kingdom Integration

### sensory-ingest

`025-crybaby-runner.sh` is a thin wrapper in `scripts/ingest/`. It runs as part of the nightly sensory-ingest cycle, after other ingestion scripts complete.

### ntfy

Requires `curl` in PATH. Uses `https://ntfy.sh/{topic}` with curl subprocess pattern (same as PHONEBOOTH). Override topic: `export CRYBABY_NTFY_TOPIC=my-topic`.

### LLM (Ollama)

Requires Ollama running locally at `http://localhost:11434`. Default model: `qwen3:8b`.

If Ollama is offline: runner logs a warning, continues, LLM diagnosis falls back to raw check.sh output. No crash, no missed events.

---

## Known Constraints

1. **qwen2.5:1.5b** is the planned model but not currently pulled. Using `qwen3:8b` (Session 174 default). Update dna.json when 1.5b is pulled: `dna_set "model" "qwen2.5:1.5b"`. Note: BACKUP_WATCHER takes ~18s with qwen3:8b due to model load time.

2. **Single machine only.** CRYBABYs assume all paths are local. No cross-machine awareness.

3. **ntfy pushes can fail silently.** curl failure is logged as WARN but not re-attempted. If Kingdom ntfy topic changes, update `CRYBABY_NTFY_TOPIC`.

4. **Max CRYBABY count.** No hard limit, but 20+ CRYBABYs would make aggregated alerts unwieldy. SYSTEM DEGRADED kicks in at 4+.

5. **check.sh has no timeout.** A blocking check.sh will block the runner indefinitely. Long-running checks should add their own `timeout` call.

---

## Shell Gotcha: set -e + (( )) patterns

**CRITICAL for check.sh authors.** In bash with `set -euo pipefail`:

```bash
# WRONG — (( 0 )) exits 1, set -e kills script silently
(( count > 0 )) && do_something

# CORRECT — if statement catches the return code safely
if (( count > 0 )); then do_something; fi

# ALSO WRONG — grep -c returns 0 for no matches AND exits 1, || appends second "0"
count=$(grep -c "FAIL" file || echo 0)   # gives "0\n0" — arithmetic error

# CORRECT — wc -l always exits 0
count=$(grep "FAIL" file | wc -l | tr -d ' ')

# ALSO WRONG — grep exits 1 on no match; pipefail kills the whole pipeline silently
error_count=$(grep "ERROR" "$LOG" | wc -l | tr -d ' ')

# CORRECT — || true guards the pipeline exit on no match
error_count=$(grep "ERROR" "$LOG" 2>/dev/null | wc -l | tr -d ' ' || true)
# OR: { grep "ERROR" "$LOG" 2>/dev/null || true; } | wc -l | tr -d ' '
```

## crybaby-analyze.py: anomaly_output via stdin (S193)

**Do NOT pass anomaly_output as argv.** Multi-line check.sh output gets mangled as a bash positional argument.

```bash
# WRONG — multi-line $check_output gets word-split or truncated as argv[3]
python3 crybaby-analyze.py "$slug" "$model" "$check_output"

# CORRECT — pipe via stdin preserves newlines faithfully
echo "$check_output" | python3 crybaby-analyze.py "$slug" "$model"
```

In `crybaby-analyze.py`, read with `sys.stdin.read()`, not `sys.argv[3]`.

## v2 Roadmap

- [ ] check.sh timeout enforcement in runner (default 30s per check)
- [ ] `crybaby-status.sh` — pretty-print all CRYBABY states in cockpit format
- [ ] `crybaby-wake.sh` — force a specific CRYBABY awake for testing
- [ ] SQLite events log (instead of JSONL) for querying history across all CRYBABYs
- [ ] Pull qwen2.5:1.5b and update default model (faster, cheaper diagnosis)
- [ ] Cross-machine CRYBABYs via SSH check delegation
- [ ] AWAKE_SLUGS entry format: replace colon delimiter with JSON to prevent truncation on colons in diagnosis text

---

## File Locations (Canonical)

| File | Path |
|------|------|
| Runner | `THE_SCRYER/ENGINE_ROOM/CRYBABIES/crybaby-runner.sh` |
| Wizard | `THE_SCRYER/ENGINE_ROOM/CRYBABIES/crybaby-new.sh` |
| Analyzer | `THE_SCRYER/ENGINE_ROOM/CRYBABIES/crybaby-analyze.py` |
| Registry | `THE_SCRYER/ENGINE_ROOM/CRYBABIES/registry.json` |
| Ingest wrapper | `THE_SCRYER/ENGINE_ROOM/scripts/ingest/025-crybaby-runner.sh` |
| Archive | `THE_SCRYER/ENGINE_ROOM/CRYBABIES/archive/` |
| Journal watcher | `THE_SCRYER/ENGINE_ROOM/CRYBABIES/JOURNAL_PIPELINE_WATCHER/` |