Saga #02 — Heimdall's ears were too sharp

How the cluster's own tripwires took it down twice in one night, and why the watchman had to learn the difference between an army and the wind.

Symptom

Thirteen minutes after exposing the cluster to the public internet, all four edges were self-quarantined. No attacker. No announcement. No one knew the cluster existed yet. Background internet scanners — Shodan, Censys, the usual suspects — found the three decoy TCP ports (5433, 7379, 9200) within minutes and connected. Three connections from one scanner = three strikes = instant self-quarantine. Four nodes with public IPs = four quarantines in thirteen minutes. The cluster shut itself down before anyone had a chance to try.

What the honey ports are for

HyveGuard listens on three TCP ports that mimic real services — a Postgres secondary, a Redis secondary, an Elasticsearch node. Nothing legitimate should ever connect to them. On a WireGuard-mesh-only deployment (our production cluster), any connection means the attacker is already on the internal network. That's a strong signal. Three strikes and the node takes itself out of the threshold-signing group — a correct response to a confirmed intrusion.

On the public internet, though, those same ports get poked by automated scanners every hour of every day. The signal that means "you've been compromised" on a private mesh means "you exist on the internet" on a public IP.

Fix #1 — per-source dedup (bought three hours)

The first fix was obvious: deduplicate honey-port strikes by source IP. One scanner hitting all three ports counts as one strike, not three. A second probe from the same IP within an hour doesn't re-count. The quarantine chain only advances when distinct sources probe the ports.

This turned thirteen minutes of survival into three hours. Progress, not solution. Three different scanners over three hours still accumulated three strikes. The cluster self-quarantined overnight — all four nodes, again.

Fix #2 — audit-only mode for public deployments

The real fix was to separate what the honey ports detect from what they do about it. On the challenge cluster, honey-port hits now:

Log every connection (operator visibility)
Fire a critical alert (operator notification)
Write a DAG entry (tamper-evident, permanent — Einherjar scoring still works)
Do not feed the quarantine strike chain

Quarantine still fires for signals that only a real attacker can produce: file canary touches (requires filesystem access), Merkle drift (requires binary tampering), and DNS canary trips (requires DNS hijack). Background internet noise can't trigger any of those.

The production cluster keeps the strict behaviour. An env flag (HYVEGUARD_HONEY_QUARANTINE) controls which mode runs.

Bonus bug — DNS canary cold-start

While investigating the overnight quarantine, we found a second self-DoS path: the DNS canary monitor treats resolver errors (timeout, SERVFAIL) as "unexpected response" — the same code path as a real DNS hijack. On a cold start with an unwarmed resolver cache, the first DNS check sometimes returns SERVFAIL instead of NXDOMAIN. Three canaries × one SERVFAIL each = three "trips" = quarantine. One node quarantined seven minutes after restart from this alone.

Fixed by distinguishing resolver errors (inconclusive — log as warning, skip quarantine) from positive unexpected resolutions (an actual A record that shouldn't exist — fire the trip). Also extended the pre-first-check warm-up from two minutes to five.

Lesson

Same lesson as Saga #01 with a sharper edge: a system that auto-quarantines on ambient noise isn't protecting itself — it's doing the attacker's job for them. Heimdall needs to distinguish the sound of an army crossing Bifrost from the sound of the wind rattling the bridge. The former demands the Gjallarhorn. The latter demands patience.

For the cluster you're attacking: the honey ports are still listening, still recording. Every connection goes into the audit DAG. If you're going for Einherjar, those records are what disqualify you. They just don't take the bridge down anymore.

Operator's note

We shipped the per-source dedup, went to bed, and woke up to a dead cluster. Shipped the audit-only mode, redeployed, and it's been stable since. Sometimes the second fix is the one that sticks — and the first fix is the one that teaches you why you need the second.