Async Execution Queue¶

Unified disk-backed queue for three kinds of background work: retries of transient failures, deferred submissions (llm_submit), and scheduled jobs — shared sidecar sweep, per-reason policy (backoff, failure escalation)

Details¶

Agent-facing quick rules¶

When wb_run returns {queued_for_retry: true}: do NOT treat as success. Wait for either a retry_success or retry_exhausted notification. For mutating operations, verify the side effect landed (file content, store record) before reporting back. Retries CAN fail.
When a call returns {status: "timeout", operation_id}: retry with mcp__work-buddy__wb_run("retry", {"operation_id": "op_xxx"}). The retry replays the original call without re-sending parameters because the op record stores the capability + params. For Obsidian-bridge-dependent ops use obsidian_retry instead (bridge health-checks between attempts).
llm_submit: the return payload includes operation_id and a hint explaining how to retrieve the result via wb_run("wb_status", {operation_id}). The originating session also receives a messaging ping on completion.
Op records carry error_kind for typed failures (post-CP4): when an ObsidianError (or other typed exception with an error_kind attribute) is caught, the gateway persists it on the op record. Useful for post-incident analysis: grep data/agents/operations/op_*.json for error_kind: "obsidian_post_write_uncertain" to count how often the post-write-uncertain recovery path fires.
Per-effect verification short-circuits success: a capability that returns result.verified (a dict per declared effect) is treated as a partial-failure when any field is not True or "verified" — the sweep re-enqueues rather than reporting success. task_create is the canonical caller; the verdict vocabulary is "verified" | "absent" | "indeterminate" | "partial". Capabilities without a verified field are unaffected.

How it works¶

Canonical fields on the op record: queued: bool, queue_reason: "retry" | "deferred_submit" | "scheduled_job", error_kind (post-CP4, optional), originating_session_id. The legacy queued_for_retry: bool is still written and read as an alias for transitional compatibility. lease_seconds on a record overrides the default 90s lease for long-running work (LLM ops set 600s).
Error classification (work_buddy/errors.py): classify_error(exc) → transient | permanent | unknown. is_transient_result(result) checks soft failures in return dicts.
Typed-exception fast-path: an ObsidianError is classified by its error_kind, not by type matching. The "user must act out of band" kinds — obsidian_refused, obsidian_not_running, obsidian_plugin_missing, obsidian_plugin_disabled — are permanent (retrying without user action never succeeds, so they are not auto-enqueued); every other ObsidianError kind (startup race, timeout, 5xx, editor conflict, post-write-uncertain) is transient. The same error_kind drives the result-dict path, so a raised exception and a bridge_failure return classify identically.
error_kind in result dicts (post-CP3): when a result dict carries error_kind (set by the gateway for typed exceptions, or by capabilities that catch and translate), is_transient_result keys off it directly. Falls back to legacy _TRANSIENT_PATTERNS for non-Obsidian capabilities only.
Gateway auto-enqueue for retries (gateway.py): When wb_run catches a transient exception AND the capability's retry_policy is 'replay' or 'verify_first', _enqueue_for_retry() sets queued=True, queue_reason='retry' on the operation record with a retry_at timestamp. error_kind is propagated into the record + retry_history entry for the attempt.
Post-write-verify (post-CP5): when the caught exception is ObsidianPostWriteUncertain, the gateway calls work_buddy.obsidian.post_write_verify.verify_post_write BEFORE deciding to enqueue. If the filesystem confirms the write landed, the op is marked completed (success-with-warning); otherwise it falls through to the normal enqueue path. Same logic in tools/gateway.py::retry_workflow_step and sidecar/retry_sweep.py::_replay.

For capabilities that declare an effects manifest (Capability.effects — see architecture/capability-registry), the recovery path uses verify_post_write_effects which walks every declared effect and can return partial (some landed, some missing). Partial is treated like absent at the queue level (enqueue retry); the capability is required to be idempotent under retry so the replay heals the half-written state. The agent-visible response shape is unchanged.

Wrapper-aware pre-verify: when the queued capability is a retry wrapper (retry / obsidian_retry) whose params.operation_id references an inner op, the verifier walks the inner capability's manifest with the inner op's params rather than single-effect-verifying just the carrier path. Without this, a PWU carrier pointing at the first of several inner effects would single-effect-verify as verified even when later effects never landed — silent partial-state success. Falls back to single-effect verify only when the inner record can't be loaded.

Result-dict verified inspection in _replay: after the inner capability returns success, the sweep checks result.verified (when present) for partial-failure signals before marking the op completed. Any field whose value is not True (legacy boolean shape) or "verified" (string-verdict shape) is treated as a transient failure and re-enqueued. Capabilities without a verified field are unaffected. The complementary signal to the pre-verify path above — pre-verify catches the case where the capability never got to run all its effects; post-verify catches the case where it did but a later effect's verification reports absent/indeterminate.
Sidecar sweep recovery from disabled-capability state: when _replay resolves the capability and finds it's NOT in the active registry but IS in the disabled registry (a transient tool probe failure left it disabled), the sweep calls work_buddy.recovery.recheck_disabled_capability(name) to re-probe just that capability's missing tools. If the probe now passes, the capability is restored to the active registry and the replay proceeds. If the probe still fails, the sweep invokes the disabled entry's callable directly — the bridge call inside will raise a typed transient exception which the existing exception handler treats as re-queue-worthy. Strictly better than reporting "not found in registry" and exhausting.
Cross-session consent during sidecar replay: before invoking a replayed callable, _replay binds a replay_of(originating_session_id) consent principal (and also sets the _originating_session contextvar for non-consent attribution). The consent layer's is_granted / get_mode then resolve against the originating session's DB directly — and, being a REPLAY principal, ride individual op-grants only: a workflow grant that was live when the op was queued does NOT time-travel to authorize the replay (see notifications/consent, "The three consent principals"). Closes the failure mode where a consented op queued by the user's session would fail ConsentRequired on replay, without opening a workflow-blanket time-travel hole.
Deferred submit path (work_buddy/llm/submit.py): llm_submit() directly writes a record with queued=True, queue_reason='deferred_submit', status='failed', retry_at=now, max_retries=1, lease_seconds=600. Returns immediately with {operation_id, status: 'queued', hint}.
Sidecar sweep (sidecar/retry_sweep.py): RetrySweep.sweep() runs every cycle of the sidecar's dispatch loop (a background thread, so a long replay delays only other dispatch work, not supervision or state publishing). Scans operation records for queued ops where retry_at <= now, acquires a lease (honoring per-op lease_seconds), and invokes entry.callable(**params). Before the call it binds a replay_of consent principal (so per-session consent grants resolve to the originating agent, individual-grants-only) and sets the originating-session contextvar (so per-session artifacts like the LLM cost log resolve to the right agent). The replay path also catches ObsidianPostWriteUncertain and verifies before re-enqueueing.
Backoff strategies (retry-reason only): 'adaptive' (default: 10s, 20s, 45s, 90s, 120s), 'fixed_10s', 'exponential' (10s * 2^n, capped 120s). Deferred submits use 'none' — no backoff, one attempt.
Failure notification varies by reason:
retry exhausted → messaging ping to originating session AND loud user notification via all surfaces (Obsidian, Telegram, Dashboard)
deferred_submit / scheduled_job exhausted → messaging ping to originating session only. Agent decides whether to escalate.
Workflow integration: TaskStatus.RETRY_PENDING blocks dependents without killing the workflow. On success → conductor.resume_after_retry() completes the step and unblocks dependents. On exhaustion → conductor.fail_after_retry_exhaustion() fails the step.

Agent perspective¶

Retry enqueued: wb_run returns {queued_for_retry: true, retry_hint: ..., error_kind?: ...}. Wait for retry_success or retry_exhausted messaging notification before reporting back. Do NOT treat queued as success.
Post-write-verify recovered (single-effect): wb_run returns {status: 'ok', post_write_recovery: true, warning: ..., path: ...}. The bridge timed out client-side but the file on disk has the content. No retry needed.
Post-write-verify recovered (multi-effect): same shape additionally with effects_verified: <count>. Every declared effect landed.
Disabled capability auto-recovered: wb_run returns the normal capability response with registry_auto_recovered: true. The capability was in the disabled registry at dispatch time; the gateway or sweep re-probed and restored it before invoking.
Deferred submit: wb_run("llm_submit", ...) returns {operation_id, status: 'queued', hint, queue_reason: 'deferred_submit'}. Check with wb_status(operation_id); messaging ping lands when it completes.
retry_success payload includes the full inner result. For multi-effect capabilities, inspect result.verified per-effect to confirm every effect landed; the sweep already re-enqueued any partial state, so a retry_success you receive should show all effects verified.

Configuration¶

config.yaml → sidecar.retry_queue: enabled, max_retries (default 5), default_backoff ('adaptive'), max_retry_age_minutes (30).

Key files¶

work_buddy/errors.py — error classification (typed-exception fast-path + legacy patterns)
work_buddy/obsidian/errors.py — typed Obsidian exception hierarchy (post-CP1)
work_buddy/obsidian/post_write_verify.py — post-write-uncertain recovery (single- and multi-effect)
work_buddy/obsidian/effects.py — EffectSpec schema for the multi-effect manifest
work_buddy/sidecar/retry_sweep.py — RetrySweep class (queue handling, disabled-cap recovery, effects-aware pre-verify, wrapper-aware inner-manifest resolution, verified-dict inspection)
work_buddy/recovery.py — recheck_disabled_capability / recheck_tool (per-capability re-probe)
work_buddy/mcp_server/tools/gateway.py — _enqueue_for_retry(), _is_queued() helper, operation record writes (with error_kind post-CP4), effects-aware PWU handler with the same wrapper-aware fallback
work_buddy/llm/submit.py — llm_submit() async submission
work_buddy/agent_session.py — set_originating_session / get_originating_session contextvar
work_buddy/llm/cost.py — _cost_log_path() honors originating-session override
work_buddy/sidecar/daemon.py — sweep wired into the dispatch loop
work_buddy/workflow.py — TaskStatus.RETRY_PENDING
work_buddy/mcp_server/conductor.py — resume_after_retry(), fail_after_retry_exhaustion()
work_buddy/consent.py — ConsentCache cross-session lookup

Observability¶

wb_status() includes retry_queue summary (queued count, next_retry_at). _list_recent_operations() shows 'queued_retry' / 'queued_deferred_submit' / 'queued_scheduled_job' statuses with retry_at and max_retries. error_kind on op records lets you grep for specific failure categories across history.