Skip to content

Retry Queue

Background retry system for transient operation failures — sidecar sweep, error classification, adaptive backoff, workflow DAG integration

Entry points

  • work_buddy.errors
  • work_buddy.sidecar.retry_sweep

Details

How it works

  1. Error classification (work_buddy/errors.py): classify_error(exc) → transient | permanent | unknown. is_transient_result(result) checks soft failures in return dicts.

  2. Gateway auto-enqueue (gateway.py): When wb_run catches a transient exception AND the capability's retry_policy is 'replay' or 'verify_first', _enqueue_for_retry() sets queued_for_retry=True on the operation record with a retry_at timestamp.

  3. Sidecar sweep (sidecar/retry_sweep.py): RetrySweep.sweep() runs every daemon tick (~30s). Scans operation records for queued retries where retry_at <= now, replays via entry.callable(**params), and handles success/failure/exhaustion.

  4. Backoff strategies: 'adaptive' (default: 10s, 20s, 45s, 90s, 120s — designed for Obsidian outages that may be seconds or minutes), 'fixed_10s', 'exponential' (10s * 2^n, capped 120s).

  5. Notification: On success → messaging.send_message() to originating session. On exhaustion → user notification via all surfaces (Obsidian, Telegram, Dashboard).

  6. Workflow integration: TaskStatus.RETRY_PENDING blocks dependents without killing the workflow. On retry success → conductor.resume_after_retry() completes the step and unblocks dependents. On exhaustion → conductor.fail_after_retry_exhaustion() fails the step.

Agent perspective

When an agent calls wb_run and gets {queued_for_retry: true}, it should MOVE ON. The retry_hint says: 'This operation has been queued for automatic background retry. You will be notified when it succeeds. Move on to other work.'

Configuration

config.yaml → sidecar.retry_queue: enabled, max_retries (default 5), default_backoff ('adaptive'), max_retry_age_minutes (30).

Key files

  • work_buddy/errors.py — error classification
  • work_buddy/sidecar/retry_sweep.py — RetrySweep class
  • work_buddy/mcp_server/tools/gateway.py — _enqueue_for_retry(), operation record extensions
  • work_buddy/sidecar/daemon.py — sweep wired into tick loop
  • work_buddy/workflow.py — TaskStatus.RETRY_PENDING
  • work_buddy/mcp_server/conductor.py — resume_after_retry(), fail_after_retry_exhaustion()

Observability

wb_status() includes retry_queue summary (queued count, next_retry_at). _list_recent_operations() shows 'queued_retry' status with retry_at and max_retries.