Rocket Telemetry Project Docs

ADR-004: Degraded-Mode & Failover Policy

Decision record defining node degraded-mode behavior and failover policies when central is unavailable.

Metadata


Problem / Context

The architecture assumes nodes can operate independently if central becomes unavailable (System-OperationalFlow) and mentions "graceful degradation" in Node-ShutdownRecovery. However, the exact definition of "degraded mode" is undefined:

  1. Trigger criteria: When does a node enter degraded mode? (e.g., central unreachable for 5s? 30s?)
  2. Fallback priorities: If central is unavailable, which operations are critical vs. deferrable?
  3. Data handling: Should node buffer decoded frames locally, or discard to stay responsive?
  4. Operator notification: How does operator become aware of degradation? (UI alert, log, email?)
  5. Recovery procedure: How does node leave degraded mode? (Manual operator command, automatic after N mins?)
  6. Cascading failures: If multiple systems fail (central + GNSS), how do we prioritize?

Current State

Why This Matters

  • Mission continuity: Rocket data is too valuable to lose because central had a hiccup
  • Operator confidence: Operator must know node is still collecting data even if central is unreachable
  • System resilience: Clear degradation policy prevents cascading failures
  • Recovery clarity: Operator needs clear "green light" that system is back to nominal

Deferred Decision Options

Option A: Node Buffers & Retransmits (Optimistic)

Approach: Node continues acquiring and decoding frames when central is unavailable. Decoded frames are buffered to disk. When central recovers, node automatically retransmits buffered data.

Pros:

  • No data loss: every frame is eventually delivered
  • Automatic recovery: operator doesn't need to manually intervene
  • Transparent to frontend: UI shows "brief delay" during central recovery, not "missing data"
  • Good for short outages (seconds to minutes)

Cons:

  • Disk usage can grow unbounded (buffering 1000s of frames)
  • Risk of redelivering duplicates if sync is botched
  • Central must handle out-of-order frames gracefully
  • Long outages (hours) may exhaust disk → have to drop old frames anyway

Implementation:

  • Node: If central unreachable for >5 seconds, open SQLite buffer on /var/rocketry/frame_buffer.db
  • Node: Append decoded frames to buffer; keep ~30 min of data (disk limit)
  • Node: Periodically check if central is back (every 10 seconds)
  • Node: When central responds, replay buffered frames in order + clear buffer
  • Central: Track frame IDs to detect duplicates; merge out-of-order batches

Option B: Graceful Degrade & Shed Load (Realistic)

Approach: Node degrades functionality if central is unreachable. Instead of buffering all frames, node:

  • Keeps raw DSP stream (beam power, signal detection)
  • Drops structured decoding (payload parsing, frame assembly , requires schema from central)
  • Keeps basic telemetry (signal strength, antenna temp, CPU load)
  • Drops commands from central (config updates, antenna steering refined targets)

Pros:

  • Bounded resource usage: no unbounded buffering
  • Clear operator comprehension: "I can still see signal, but no message decoding"
  • Matches user expectations: ground station lost comms, but antenna still tracking
  • Simple fallback logic: raw data is always available

Cons:

  • Some data loss: structured decoding unavailable during outage
  • Operator must manually re-enable structured decoding after central recovers
  • More complex implementation: need to identify which operations require central

On this page