ADR-004: Degraded-Mode & Failover Policy

Decision record defining node degraded-mode behavior and failover policies when central is unavailable.

Metadata

ADR ID: ADR-004
Title: Node Behavior During Central Aggregator Outage & Degraded-Mode Policy
Status: Proposed
Date: 2026-03-15 (proposed)
Owner: Operations/Reliability lead
Target Decision Date: 2026-04-12
Relates to: System-Gaps-Deferred#failure-domain-and-degraded-mode-policy

The architecture assumes nodes can operate independently if central becomes unavailable (System-OperationalFlow) and mentions "graceful degradation" in Node-ShutdownRecovery. However, the exact definition of "degraded mode" is undefined:

Trigger criteria: When does a node enter degraded mode? (e.g., central unreachable for 5s? 30s?)
Fallback priorities: If central is unavailable, which operations are critical vs. deferrable?
Data handling: Should node buffer decoded frames locally, or discard to stay responsive?
Operator notification: How does operator become aware of degradation? (UI alert, log, email?)
Recovery procedure: How does node leave degraded mode? (Manual operator command, automatic after N mins?)
Cascading failures: If multiple systems fail (central + GNSS), how do we prioritize?

Current State

System-OperationalFlow mentions "Loss of Central" but no fallback logic
Node-ShutdownRecovery covers shutdown, not gradual degradation
Prototypes do not implement degraded mode

Why This Matters

Mission continuity: Rocket data is too valuable to lose because central had a hiccup
Operator confidence: Operator must know node is still collecting data even if central is unreachable
System resilience: Clear degradation policy prevents cascading failures
Recovery clarity: Operator needs clear "green light" that system is back to nominal

Deferred Decision Options

Option A: Node Buffers & Retransmits (Optimistic)

Approach: Node continues acquiring and decoding frames when central is unavailable. Decoded frames are buffered to disk. When central recovers, node automatically retransmits buffered data.

Pros:

No data loss: every frame is eventually delivered
Automatic recovery: operator doesn't need to manually intervene
Transparent to frontend: UI shows "brief delay" during central recovery, not "missing data"
Good for short outages (seconds to minutes)

Cons:

Disk usage can grow unbounded (buffering 1000s of frames)
Risk of redelivering duplicates if sync is botched
Central must handle out-of-order frames gracefully
Long outages (hours) may exhaust disk → have to drop old frames anyway

Implementation:

Node: If central unreachable for >5 seconds, open SQLite buffer on /var/rocketry/frame_buffer.db
Node: Append decoded frames to buffer; keep ~30 min of data (disk limit)
Node: Periodically check if central is back (every 10 seconds)
Node: When central responds, replay buffered frames in order + clear buffer
Central: Track frame IDs to detect duplicates; merge out-of-order batches

Option B: Graceful Degrade & Shed Load (Realistic)

Approach: Node degrades functionality if central is unreachable. Instead of buffering all frames, node:

Keeps raw DSP stream (beam power, signal detection)
Drops structured decoding (payload parsing, frame assembly , requires schema from central)
Keeps basic telemetry (signal strength, antenna temp, CPU load)
Drops commands from central (config updates, antenna steering refined targets)