Async Execution Queue¶
Unified disk-backed queue for three kinds of background work: retries of transient failures, deferred submissions (llm_submit), and scheduled jobs — shared sidecar sweep, per-reason policy (backoff, failure escalation)
Details¶
Agent-facing quick rules¶
- When
wb_runreturns{queued_for_retry: true}: do NOT treat as success. Wait for either aretry_successorretry_exhaustednotification. For mutating operations, verify the side effect landed (file content, store record) before reporting back. Retries CAN fail. - When a call returns
{status: "timeout", operation_id}: retry withmcp__work-buddy__wb_run("retry", {"operation_id": "op_xxx"}). The retry replays the original call without re-sending parameters because the op record stores the capability + params. For Obsidian-bridge-dependent ops useobsidian_retryinstead (bridge health-checks between attempts). llm_submit: the return payload includesoperation_idand ahintexplaining how to retrieve the result viawb_run("wb_status", {operation_id}). The originating session also receives a messaging ping on completion.- Op records carry
error_kindfor typed failures (post-CP4): when anObsidianError(or other typed exception with anerror_kindattribute) is caught, the gateway persists it on the op record. Useful for post-incident analysis: grepdata/agents/operations/op_*.jsonforerror_kind: "obsidian_post_write_uncertain"to count how often the post-write-uncertain recovery path fires. - Per-effect verification short-circuits success: a capability that returns
result.verified(a dict per declared effect) is treated as a partial-failure when any field is notTrueor"verified"— the sweep re-enqueues rather than reporting success.task_createis the canonical caller; the verdict vocabulary is"verified" | "absent" | "indeterminate" | "partial". Capabilities without averifiedfield are unaffected.
How it works¶
-
Canonical fields on the op record:
queued: bool,queue_reason: "retry" | "deferred_submit" | "scheduled_job",error_kind(post-CP4, optional),originating_session_id. The legacyqueued_for_retry: boolis still written and read as an alias for transitional compatibility.lease_secondson a record overrides the default 90s lease for long-running work (LLM ops set 600s). -
Error classification (work_buddy/errors.py): classify_error(exc) → transient | permanent | unknown. is_transient_result(result) checks soft failures in return dicts.
- Typed-exception fast-path (post-CP3):
isinstance(exc, ObsidianError)short-circuits —ObsidianRefused→ permanent, all other ObsidianError subclasses → transient. No string matching needed. -
error_kind in result dicts (post-CP3): when a result dict carries
error_kind(set by the gateway for typed exceptions, or by capabilities that catch and translate), is_transient_result keys off it directly. Falls back to legacy_TRANSIENT_PATTERNSfor non-Obsidian capabilities only. -
Gateway auto-enqueue for retries (gateway.py): When wb_run catches a transient exception AND the capability's retry_policy is 'replay' or 'verify_first', _enqueue_for_retry() sets queued=True, queue_reason='retry' on the operation record with a retry_at timestamp. error_kind is propagated into the record + retry_history entry for the attempt.
-
Post-write-verify (post-CP5): when the caught exception is
ObsidianPostWriteUncertain, the gateway callswork_buddy.obsidian.post_write_verify.verify_post_writeBEFORE deciding to enqueue. If the filesystem confirms the write landed, the op is marked completed (success-with-warning); otherwise it falls through to the normal enqueue path. Same logic intools/gateway.py::retry_workflow_stepandsidecar/retry_sweep.py::_replay.
For capabilities that declare an effects manifest (Capability.effects — see architecture/capability-registry), the recovery path uses verify_post_write_effects which walks every declared effect and can return partial (some landed, some missing). Partial is treated like absent at the queue level (enqueue retry); the capability is required to be idempotent under retry so the replay heals the half-written state. The agent-visible response shape is unchanged.
Wrapper-aware pre-verify: when the queued capability is a retry wrapper (retry / obsidian_retry) whose params.operation_id references an inner op, the verifier walks the inner capability's manifest with the inner op's params rather than single-effect-verifying just the carrier path. Without this, a PWU carrier pointing at the first of several inner effects would single-effect-verify as verified even when later effects never landed — silent partial-state success. Falls back to single-effect verify only when the inner record can't be loaded.
-
Result-dict verified inspection in
_replay: after the inner capability returns success, the sweep checksresult.verified(when present) for partial-failure signals before marking the op completed. Any field whose value is notTrue(legacy boolean shape) or"verified"(string-verdict shape) is treated as a transient failure and re-enqueued. Capabilities without averifiedfield are unaffected. The complementary signal to the pre-verify path above — pre-verify catches the case where the capability never got to run all its effects; post-verify catches the case where it did but a later effect's verification reports absent/indeterminate. -
Sidecar sweep recovery from disabled-capability state: when
_replayresolves the capability and finds it's NOT in the active registry but IS in the disabled registry (a transient tool probe failure left it disabled), the sweep callswork_buddy.recovery.recheck_disabled_capability(name)to re-probe just that capability's missing tools. If the probe now passes, the capability is restored to the active registry and the replay proceeds. If the probe still fails, the sweep invokes the disabled entry's callable directly — the bridge call inside will raise a typed transient exception which the existing exception handler treats as re-queue-worthy. Strictly better than reporting "not found in registry" and exhausting. -
Cross-session consent during sidecar replay: before invoking a replayed callable,
_replaysetsagent_session._originating_sessionto the op record'soriginating_session_id. The consent layer'sis_granted/get_modethen check the originating session's DB as a fallback when the sidecar's own session has no grant (seenotifications/consent). Closes the failure mode where a consented op queued by the user's session would failConsentRequiredon replay. -
Deferred submit path (work_buddy/llm/submit.py): llm_submit() directly writes a record with queued=True, queue_reason='deferred_submit', status='failed', retry_at=now, max_retries=1, lease_seconds=600. Returns immediately with {operation_id, status: 'queued', hint}.
-
Sidecar sweep (sidecar/retry_sweep.py): RetrySweep.sweep() runs every daemon tick. Scans operation records for queued ops where retry_at <= now, acquires a lease (honoring per-op lease_seconds), and invokes entry.callable(**params). Before the call it sets the originating-session contextvar so per-session artifacts (LLM cost log) AND per-session consent grants resolve to the right agent. The replay path also catches
ObsidianPostWriteUncertainand verifies before re-enqueueing. -
Backoff strategies (retry-reason only): 'adaptive' (default: 10s, 20s, 45s, 90s, 120s), 'fixed_10s', 'exponential' (10s * 2^n, capped 120s). Deferred submits use 'none' — no backoff, one attempt.
-
Failure notification varies by reason:
retryexhausted → messaging ping to originating session AND loud user notification via all surfaces (Obsidian, Telegram, Dashboard)-
deferred_submit/scheduled_jobexhausted → messaging ping to originating session only. Agent decides whether to escalate. -
Workflow integration: TaskStatus.RETRY_PENDING blocks dependents without killing the workflow. On success → conductor.resume_after_retry() completes the step and unblocks dependents. On exhaustion → conductor.fail_after_retry_exhaustion() fails the step.
Agent perspective¶
- Retry enqueued: wb_run returns {queued_for_retry: true, retry_hint: ..., error_kind?: ...}. Wait for
retry_successorretry_exhaustedmessaging notification before reporting back. Do NOT treat queued as success. - Post-write-verify recovered (single-effect): wb_run returns {status: 'ok', post_write_recovery: true, warning: ..., path: ...}. The bridge timed out client-side but the file on disk has the content. No retry needed.
- Post-write-verify recovered (multi-effect): same shape additionally with
effects_verified: <count>. Every declared effect landed. - Disabled capability auto-recovered: wb_run returns the normal capability response with
registry_auto_recovered: true. The capability was in the disabled registry at dispatch time; the gateway or sweep re-probed and restored it before invoking. - Deferred submit:
wb_run("llm_submit", ...)returns {operation_id, status: 'queued', hint, queue_reason: 'deferred_submit'}. Check withwb_status(operation_id); messaging ping lands when it completes. retry_successpayload includes the full inner result. For multi-effect capabilities, inspectresult.verifiedper-effect to confirm every effect landed; the sweep already re-enqueued any partial state, so aretry_successyou receive should show all effects verified.
Configuration¶
config.yaml → sidecar.retry_queue: enabled, max_retries (default 5), default_backoff ('adaptive'), max_retry_age_minutes (30).
Key files¶
- work_buddy/errors.py — error classification (typed-exception fast-path + legacy patterns)
- work_buddy/obsidian/errors.py — typed Obsidian exception hierarchy (post-CP1)
- work_buddy/obsidian/post_write_verify.py — post-write-uncertain recovery (single- and multi-effect)
- work_buddy/obsidian/effects.py — EffectSpec schema for the multi-effect manifest
- work_buddy/sidecar/retry_sweep.py — RetrySweep class (queue handling, disabled-cap recovery, effects-aware pre-verify, wrapper-aware inner-manifest resolution, verified-dict inspection)
- work_buddy/recovery.py — recheck_disabled_capability / recheck_tool (per-capability re-probe)
- work_buddy/mcp_server/tools/gateway.py — _enqueue_for_retry(), _is_queued() helper, operation record writes (with error_kind post-CP4), effects-aware PWU handler with the same wrapper-aware fallback
- work_buddy/llm/submit.py — llm_submit() async submission
- work_buddy/agent_session.py — set_originating_session / get_originating_session contextvar
- work_buddy/llm/cost.py — _cost_log_path() honors originating-session override
- work_buddy/sidecar/daemon.py — sweep wired into tick loop
- work_buddy/workflow.py — TaskStatus.RETRY_PENDING
- work_buddy/mcp_server/conductor.py — resume_after_retry(), fail_after_retry_exhaustion()
- work_buddy/consent.py — ConsentCache cross-session lookup
Observability¶
wb_status() includes retry_queue summary (queued count, next_retry_at). _list_recent_operations() shows 'queued_retry' / 'queued_deferred_submit' / 'queued_scheduled_job' statuses with retry_at and max_retries. error_kind on op records lets you grep for specific failure categories across history.