N-ADR-002: Decoder Failure Behavior & Backpressure Policy
Decision record for decoder backpressure policy: drop, buffer, or block strategies and metrics.
Metadata
- ADR ID: N-ADR-002
- Title: Node Decoder Backpressure Policy (Drop, Buffer, or Block on Overload)
- Status: Proposed
- Date: 2026-03-15 (proposed)
- Owner: Node performance lead
- Target Decision Date: 2026-04-05
- Relates to: Node-Gaps-Deferred#decoder-failure-behavior-and-backpressure
Problem / Context
The node decodes frames at high rate (Node-DataFlow-Timeline shows millisecond-scale lock acquisitions). If the decoder falls behind (e.g., CPU overload, network congestion on uplink to central), the incoming frame buffer will eventually overflow. The architecture defines no policy for this scenario:
- Backpressure trigger: At what queue depth do we activate mitigation? (50% full, 90% full, first-frame?)
- Drop policy: If dropping frames, FIFO or priority-based? (e.g., drop weak signals first)
- Buffering strategy: If buffering, how long (~ms, ~sec, ~min)? On disk or RAM?
- Blocking strategy: Should node throttle DSP input to match decoder output? (flow control)
- Error signaling: How does decoder communicate backpressure to DSP and to central?
- Recovery: After overload clears, how does node recover? (restart, resume from checkpoint?)
Current State
- Node-DataFlow-Timeline assumes decoder keeps up (no overload scenario)
- Node-ResourceBudgets allocates CPU but doesn't specify headroom
- Prototypes use simple in-memory queue with no backpressure
Why This Matters
- Data integrity: Controlled dropping is better than crashing with memory exhaustion
- Operator awareness: Operator must know if frames were dropped (affects mission analysis)
- Performance: Clear policy prevents decoder becoming bottleneck
- Fairness: Multi-target missions should prioritize high-SNR/ high-priority targets
Deferred Decision Options
Option A: Drop Excess Frames (Simple, Lossy)
Approach: Decoder has fixed-size queue (~1000 frames, ~500 MB). When full, new frames are dropped (oldest first).
Rules:
- Queue capacity: 1000 frames (configurable)
- When full: Drop oldest frame in queue (FIFO)
- Signal: Log warning once per second, meter to central
- Recovery: Automatic (no manual step)
Pros:
- Simplest implementation: circular buffer
- Bounded memory: never exceeds configured size
- Recovery automatic: as load drops, start accepting frames again
- Low latency: no buffering overhead
Cons:
- Data loss: drops are unrecoverable
- Operator may not notice losses (if rare)
- No intelligence: might drop important frame to keep trivial one
Implementation:
from collections import deque
class DecoderQueue:
def __init__(self, max_size=1000):
self.queue = deque(maxlen=max_size)
self.drop_count = 0
self.last_drop_log = time.time()
def append(self, frame):
if len(self.queue) == self.queue.maxlen:
self.drop_count += 1
if time.time() - self.last_drop_log > 1.0:
log_warn(f"Decoder backpressure: {self.drop_count} frames dropped")
self.last_drop_log = time.time()
self.drop_count = 0
self.queue.append(frame)