ADR-004: Degraded-Mode & Failover Policy
Decision record defining node degraded-mode behavior and failover policies when central is unavailable.
Metadata
- ADR ID: ADR-004
- Title: Node Behavior During Central Aggregator Outage & Degraded-Mode Policy
- Status: Proposed
- Date: 2026-03-15 (proposed)
- Owner: Operations/Reliability lead
- Target Decision Date: 2026-04-12
- Relates to: System-Gaps-Deferred#failure-domain-and-degraded-mode-policy
Problem / Context
The architecture assumes nodes can operate independently if central becomes unavailable (System-OperationalFlow) and mentions "graceful degradation" in Node-ShutdownRecovery. However, the exact definition of "degraded mode" is undefined:
- Trigger criteria: When does a node enter degraded mode? (e.g., central unreachable for 5s? 30s?)
- Fallback priorities: If central is unavailable, which operations are critical vs. deferrable?
- Data handling: Should node buffer decoded frames locally, or discard to stay responsive?
- Operator notification: How does operator become aware of degradation? (UI alert, log, email?)
- Recovery procedure: How does node leave degraded mode? (Manual operator command, automatic after N mins?)
- Cascading failures: If multiple systems fail (central + GNSS), how do we prioritize?
Current State
- System-OperationalFlow mentions "Loss of Central" but no fallback logic
- Node-ShutdownRecovery covers shutdown, not gradual degradation
- Prototypes do not implement degraded mode
Why This Matters
- Mission continuity: Rocket data is too valuable to lose because central had a hiccup
- Operator confidence: Operator must know node is still collecting data even if central is unreachable
- System resilience: Clear degradation policy prevents cascading failures
- Recovery clarity: Operator needs clear "green light" that system is back to nominal
Deferred Decision Options
Option A: Node Buffers & Retransmits (Optimistic)
Approach: Node continues acquiring and decoding frames when central is unavailable. Decoded frames are buffered to disk. When central recovers, node automatically retransmits buffered data.
Pros:
- No data loss: every frame is eventually delivered
- Automatic recovery: operator doesn't need to manually intervene
- Transparent to frontend: UI shows "brief delay" during central recovery, not "missing data"
- Good for short outages (seconds to minutes)
Cons:
- Disk usage can grow unbounded (buffering 1000s of frames)
- Risk of redelivering duplicates if sync is botched
- Central must handle out-of-order frames gracefully
- Long outages (hours) may exhaust disk → have to drop old frames anyway
Implementation:
- Node: If central unreachable for >5 seconds, open SQLite buffer on
/var/rocketry/frame_buffer.db - Node: Append decoded frames to buffer; keep ~30 min of data (disk limit)
- Node: Periodically check if central is back (every 10 seconds)
- Node: When central responds, replay buffered frames in order + clear buffer
- Central: Track frame IDs to detect duplicates; merge out-of-order batches
Option B: Graceful Degrade & Shed Load (Realistic)
Approach: Node degrades functionality if central is unreachable. Instead of buffering all frames, node:
- Keeps raw DSP stream (beam power, signal detection)
- Drops structured decoding (payload parsing, frame assembly , requires schema from central)
- Keeps basic telemetry (signal strength, antenna temp, CPU load)
- Drops commands from central (config updates, antenna steering refined targets)
Pros:
- Bounded resource usage: no unbounded buffering
- Clear operator comprehension: "I can still see signal, but no message decoding"
- Matches user expectations: ground station lost comms, but antenna still tracking
- Simple fallback logic: raw data is always available
Cons:
- Some data loss: structured decoding unavailable during outage
- Operator must manually re-enable structured decoding after central recovers
- More complex implementation: need to identify which operations require central
ADR-003: Clock Authority & Time Synchronization
Decision record for clock authority, time synchronization, and fallback policies across nodes and central.
ADR-005: Per-Target Parameter Binding & Fallback Resolution
Decision record for per-target UI parameter binding, fallback hierarchy, and observability.