Resilience Framework¶
Unified fault-mitigation foundation for guarded calls — propagating Deadline, outcome taxonomy, execution seam, composable strategy library, pipeline/registry, and the broker/Obsidian adapters.
Details¶
The resilience framework (work_buddy/resilience/) is the unified foundation for fault mitigation across guarded calls — any call that can be slow or fail. It supersedes three independently hand-rolled protections (the inference broker's admission control, the Obsidian bridge's @bridge_retry, the capability registry's DISABLED_CAPABILITIES) with one model: standard patterns, one vocabulary, shared observability. Built following Polly v8 / resilience4j.
Core concepts¶
- Outcome / OutcomeKind — a guarded call returns an
Outcome: a value XOR an error, plus anOutcomeKind(the outcome taxonomy):SUCCESS,TIMEOUT,REJECTED(shed before execution),TRANSIENT_FAILURE,TERMINAL_FAILURE,PARTIAL. The kind carriesis_retryable/counts_toward_circuit_tripso retry and circuit-breaker logic never re-inspect the underlying exception. - Deadline — an absolute monotonic stop-time that propagates down a nested call tree; each layer clamps its timeout to the remaining budget and never extends a parent.
derive_attempt()yields per-attempt sub-budgets. - ResilienceContext — per-call state: operation key, the Deadline, a call identity (with a parent link), and a typed-key property bag. Threaded explicitly and also published in a
ContextVar, so synchronous code in a worker thread (viaasyncio.to_thread, which snapshots context) can read it. - The execution seam —
guarded_call(operation_key, fn, ...)runsfnthrough a chain of strategies, classifies the result, emits telemetry, and returns anOutcome. A guarded call never raises to signal a classified failure; two deliberate exceptions: a failure signalled by return value (result_classifier) and a declared passthrough exception — a control-flow signal re-raised untouched (e.g.ObsidianPostWriteUncertain, which the gateway's verify-then-decide path needs as a raised exception). - ResilienceStrategy — the callback-wrapper protocol every primitive implements:
execute(nxt, ctx) -> Outcome.
Strategy library¶
Six composable strategies in strategies.py: TimeoutStrategy, RetryStrategy (exponential backoff + full jitter), CircuitBreakerStrategy (closed/open/half-open, consecutive-failure count), BulkheadStrategy (flat concurrency cap), PriorityBulkheadStrategy (priority-aware admission — INTERACTIVE/WORKFLOW/BACKGROUND; the native-async port of the inference broker's per-profile algorithm), RateLimiterStrategy (token bucket), FallbackStrategy.
Composition¶
ResiliencePipelineBuilder assembles strategies (declaration order = outermost-first; canonical order: overall Timeout -> RateLimiter/Bulkhead -> Retry -> CircuitBreaker -> per-attempt Timeout). ResiliencePipeline.execute runs a call through them. ResiliencePipelineRegistry (get_pipeline_registry()) holds named, lazily-built pipelines. One hard rule: retry at exactly one layer per failure domain.
Adapters and consumers¶
Existing systems participate without being rewritten at their call sites: work_buddy/inference/resilient_broker.py (guarded_broker_call) and work_buddy/obsidian/resilient_bridge.py (guarded_bridge_call, build_obsidian_pipeline) map broker / Obsidian errors onto the taxonomy, propagate the deadline, and emit unified guard.* telemetry. The @bridge_retry decorator (work_buddy/obsidian/retry.py) is itself a thin framework consumer — each decorated call runs a RetryStrategy → _BridgeHealthGate → call chain via guarded_call_sync, so decorated capabilities share the same foundation as ad-hoc adapter calls. There is one retry loop in the codebase for the Obsidian failure domain (the one-retry-layer rule, structurally).
Telemetry¶
guarded_call emits CallCompleted; strategies emit CircuitStateChanged and LoadShed. Listeners registered via register_listener receive every event; InMemoryMetrics is the default in-process recorder.
Scope boundary¶
The framework is fault mitigation only. Durable execution and human-in-the-loop waits (consent prompts, the conductor, the retry-queue's cross-restart durability) are a separate discipline — a slow human is not a fault — and must not be folded in.
State¶
The framework, strategy library, pipeline/registry, both adapters, and @bridge_retry's live wiring are built and verified (unit-tested plus a live integration smoke test against the real Obsidian bridge). The remaining live migrations are:
- The broker's synchronous LM Studio call-sites (embedding provider + LLM backends) — gated on a sync→async conversion of the inference path.
DISABLED_CAPABILITIES→ aCircuitBreakerStrategyon the capability registry.