Sidecar Daemon¶

Unified process supervisor, cron scheduler, and message-driven job dispatcher

Details¶

A single long-lived Python process that replaces multiple independent Windows Task Scheduler entries with a unified process supervisor, cron/heartbeat scheduler, and message-driven job dispatcher.

Starting: uv run python -m work_buddy.sidecar

Manages its own lifecycle via PID file (<data_root>/runtime/sidecar.pid) and state file (<data_root>/runtime/sidecar_state.json).

Two loops split the daemon's work by blocking behavior. The supervisor loop (main thread) evaluates cached health probes, restarts failed children, and writes sidecar_state.json every tick — everything on it is fast and bounded, so the state file's freshness is a true daemon-liveness signal (wbuddy status classifies ~90s of staleness as wedged). The dispatch loop (a background thread) runs scheduler cron ticks, message-driven dispatch, and retry sweeps — the phases that execute jobs and replays inline and can legitimately block for minutes (agent spawns, index rebuilds, local-LLM leases). A slow job therefore reads as a busy dispatch phase in the state file, never as a hung daemon, and can never delay child restarts.

Three subsystems: 1. Process Supervisor — starts and monitors child services. A dedicated HealthMonitor thread probes every service's /health endpoint concurrently (via ThreadPoolExecutor) at health_probe_interval (default 5s), so dispatch work or slow capabilities can never delay probes. The supervisor loop consumes cached health state only; a service is restarted after health_failure_threshold (default 2) consecutive failed probes, under exponential backoff, and given up on after max_service_crashes (default 5). 2. Cron/Heartbeat Scheduler — loads job .md files from BOTH sidecar_jobs/ (system jobs, git-tracked) AND <data_root>/user_jobs/ (user-authored, gitignored) and fires them on their cron schedule, inline on the dispatch loop. Supports exclusion windows (quiet hours), and optional per-job stable jitter to spread phase-aligned starts. Hot-reload has two triggers: a JobsWatcher (kernel filesystem events via watchdog; ~50ms latency on file change) AND a 30s polling interval as a safety net. Watcher events set a threading.Event (Scheduler.jobs_reload_pending) that the daemon's dispatch-loop sleep waits on, so the next cycle reloads immediately. On filename-stem collision the user file wins and a WARN is logged. 3. Message-to-Job Dispatch — polls messaging service for pending messages addressed to work-buddy. Classifies each message and executes automatically, also inline on the dispatch loop.

Shutdown: First Ctrl+C / SIGTERM requests graceful shutdown; a watchdog thread force-exits after 15s if the main thread is stuck in a blocking syscall, so shutdown is always bounded. A second Ctrl+C force-kills children immediately. The JobsWatcher observer thread is stopped and joined alongside the HealthMonitor in the cleanup path; the dispatch thread is a daemon thread, briefly joined and abandoned if mid-job. The pid file is removed on shutdown only when it still records the exiting process's own pid — the atexit hook can fire after a takeover has replaced the file, and deleting the successor's pid file would make a healthy daemon read as not running.

Child stdout/stderr: redirected to <data_root>/runtime/service_logs/<service>.log so a silent or crashing child is always observable — Popen inheritance with CREATE_NO_WINDOW can otherwise drop output on Windows. At child launch the daemon rolls an oversized live log (>16 MiB) aside to a timestamped backup (_roll_oversize_log); the service-logs artifact then reaps rolled backups older than 7 days on the twice-daily cleanup tick (pinning the live log). See the rotation dev-note above.

Job file format: .md files in either jobs directory with YAML frontmatter (schedule, recurring, type, capability/params, enabled, spawn_mode, optional jitter_seconds). Each loaded Job carries a source field ("system" or "user") that propagates through JobState into sidecar_state.json and is used by the dashboard's Jobs tab to group entries.

Job types: capability (calls registered MCP gateway capability directly), workflow (triggers registered workflow), prompt (freeform text — spawns claude -p agent session, consent-gated).

Agent spawn modes for prompt jobs: headless_ephemeral (default, --print --no-session-persistence), headless_persistent (--print only, registered in <data_root>/runtime/agent_registry.json), interactive_persistent (deferred).

Session launcher (work_buddy/session_launcher.py): remote_session_begin launches a visible Claude Code terminal session, optionally with --remote-control. Consent-gated. Primary use case: Remote Control from phone.

Cron syntax: standard 5-field cron in the timezone set in config.yaml (default: America/New_York).

Jitter — the thundering herd problem and how the scheduler avoids it

Many of work-buddy's jobs run on phase-aligned schedules: */3, */5, */10, */30. These coincide at common minute boundaries (:00, :30, hourly, etc.) and fire simultaneously — every five minutes a wave of indexers, sync jobs, and health checks all hit at the same second. That's the thundering herd: a synchronized burst of work whose contention (CPU, disk, lock acquisition, downstream API rate limits) is much worse than the same total work spread across the interval. Each individual job is fine; the simultaneous-start is the problem.

The scheduler's per-job jitter solves this: set jitter_seconds: <N> in a job's frontmatter and it fires at scheduled_at + offset where offset is in [0, N], deterministic per job. The same job lands at the same offset every cycle; two jobs sharing a schedule land at different offsets and stop colliding.

The dashboard form caps jitter_seconds per schedule. Worked examples: */3→10s, */5→30s, */10→60s, */30→180s, hourly/daily/weekly→300s (the hard cap is 5 minutes regardless). The cap is a UI recommendation, not a backend rule — hand-editing a .md file can exceed it.

Tick cadence is health_check_interval (default 30 s), so values < 30 are quantized away in practice — the dashboard form warns when the typed value falls below that floor. jitter_seconds: 0 (the default) bypasses the pending-fire queue entirely and fires inline on cron match.

Observability fields next_at (raw cron eligibility) and effective_at (next_at + offset, or queued pending due time) ride alongside each JobState in sidecar_state.json; the dashboard's Jobs tab renders effective_at in the "Next Run" column and shows the configured jitter_seconds in a dedicated "Jitter" column so users can correlate display time with cause.

Config (sidecar: section in config.yaml): health_check_interval (cadence of both loops: supervisor restart-decision evaluation and dispatch cycles), health_probe_interval (HealthMonitor cadence), health_probe_timeout, health_failure_threshold, max_service_crashes, restart_backoff_base, dispatch_stall_warn_seconds (default 600 — warn when the dispatch loop sits in one phase this long), services (with module/port/enabled per service), jobs_dir (system jobs, defaults to sidecar_jobs), user_jobs_dir (user jobs override; empty = <data_root>/user_jobs/), heartbeat, message_poll_interval.

Observability: The supervisor writes <data_root>/runtime/sidecar_state.json every tick regardless of what the dispatch loop is doing. Alongside services/jobs/events, the state carries dispatch-loop fields: dispatch_phase (scheduler_tick | message_poll | retry_sweep | idle), dispatch_phase_since, dispatch_job (the job currently executing inline, if any), and last_dispatch_at (end of the most recent dispatch cycle). A phase held past dispatch_stall_warn_seconds emits a dispatch_stalled event (once per stall) naming the phase and job; wbuddy status prints a busy line for a phase held past ~2 minutes. Query via sidecar_status or sidecar_jobs capabilities. Per-service child logs at <data_root>/runtime/service_logs/*.log (rotated as described above). Dashboard subscribes to cron.hot_reload events on the bus to refresh its Jobs tab on every actual reload.