Sidecar Daemon¶
Unified process supervisor, cron scheduler, and message-driven job dispatcher
Details¶
A single long-lived Python process that replaces multiple independent Windows Task Scheduler entries with a unified process supervisor, cron/heartbeat scheduler, and message-driven job dispatcher.
Starting: powershell.exe -Command "cd
Manages its own lifecycle via PID file (<data_root>/runtime/sidecar.pid) and state file (<data_root>/runtime/sidecar_state.json).
Three subsystems:
1. Process Supervisor — starts and monitors child services. A dedicated HealthMonitor thread probes every service's /health endpoint concurrently (via ThreadPoolExecutor) at health_probe_interval (default 5s), so scheduler ticks or slow capabilities can never delay probes. The main loop consumes cached health state only; a service is restarted after health_failure_threshold (default 2) consecutive failed probes, under exponential backoff, and given up on after max_service_crashes (default 5).
2. Cron/Heartbeat Scheduler — loads job .md files from BOTH sidecar_jobs/ (system jobs, git-tracked) AND <data_root>/user_jobs/ (user-authored, gitignored) and fires them on their cron schedule. Supports exclusion windows (quiet hours), and optional per-job stable jitter to spread phase-aligned starts. Hot-reload has two triggers: a JobsWatcher (kernel filesystem events via watchdog; ~50ms latency on file change) AND a 30s polling interval as a safety net. Watcher events set a threading.Event (Scheduler.jobs_reload_pending) that the daemon's main-loop sleep waits on, so the next tick reloads immediately. On filename-stem collision the user file wins and a WARN is logged.
3. Message-to-Job Dispatch — polls messaging service for pending messages addressed to work-buddy. Classifies each message and executes automatically.
Shutdown: First Ctrl+C / SIGTERM requests graceful shutdown; a watchdog thread force-exits after 15s if the main thread is stuck in a blocking syscall, so shutdown is always bounded. A second Ctrl+C force-kills children immediately. The JobsWatcher observer thread is stopped and joined alongside the HealthMonitor in the cleanup path.
Child stdout/stderr: redirected to <data_root>/runtime/service_logs/<service>.log so a silent or crashing child is always observable — Popen inheritance with CREATE_NO_WINDOW can otherwise drop output on Windows. Each log is size-capped at 16 MiB × 4 rotations (80 MB ceiling per service) via rotate-on-startup; oversized logs are renamed to <service>.1.log (.2, .3, .4, dropping the oldest) when the daemon launches the child.
Job file format: .md files in either jobs directory with YAML frontmatter (schedule, recurring, type, capability/params, enabled, spawn_mode, optional jitter_seconds). Each loaded Job carries a source field ("system" or "user") that propagates through JobState into sidecar_state.json and is used by the dashboard's Jobs tab to group entries.
Job types: capability (calls registered MCP gateway capability directly), workflow (triggers registered workflow), prompt (freeform text — spawns claude -p agent session, consent-gated).
Agent spawn modes for prompt jobs: headless_ephemeral (default, --print --no-session-persistence), headless_persistent (--print only, registered in <data_root>/runtime/agent_registry.json), interactive_persistent (deferred).
Session launcher (work_buddy/session_launcher.py): remote_session_begin launches a visible Claude Code terminal session, optionally with --remote-control. Consent-gated. Primary use case: Remote Control from phone.
Cron syntax: standard 5-field cron in the timezone set in config.yaml (default: America/New_York).
Jitter — the thundering herd problem and how the scheduler avoids it
Many of work-buddy's jobs run on phase-aligned schedules: */3, */5, */10, */30. These coincide at common minute boundaries (:00, :30, hourly, etc.) and fire simultaneously — every five minutes a wave of indexers, sync jobs, and health checks all hit at the same second. That's the thundering herd: a synchronized burst of work whose contention (CPU, disk, lock acquisition, downstream API rate limits) is much worse than the same total work spread across the interval. Each individual job is fine; the simultaneous-start is the problem.
The scheduler's per-job jitter solves this: set jitter_seconds: <N> in a job's frontmatter and it fires at scheduled_at + offset where offset is in [0, N], deterministic per job. The same job lands at the same offset every cycle; two jobs sharing a schedule land at different offsets and stop colliding.
The dashboard form caps jitter_seconds per schedule. Worked examples: */3→10s, */5→30s, */10→60s, */30→180s, hourly/daily/weekly→300s (the hard cap is 5 minutes regardless). The cap is a UI recommendation, not a backend rule — hand-editing a .md file can exceed it.
Tick cadence is health_check_interval (default 30 s), so values < 30 are quantized away in practice — the dashboard form warns when the typed value falls below that floor. jitter_seconds: 0 (the default) bypasses the pending-fire queue entirely and fires inline on cron match.
Observability fields next_at (raw cron eligibility) and effective_at (next_at + offset, or queued pending due time) ride alongside each JobState in sidecar_state.json; the dashboard's Jobs tab renders effective_at in the "Next Run" column and shows the configured jitter_seconds in a dedicated "Jitter" column so users can correlate display time with cause.
Config (sidecar: section in config.yaml): health_check_interval (main-loop cadence for restart-decision evaluation), health_probe_interval (HealthMonitor cadence), health_probe_timeout, health_failure_threshold, max_service_crashes, restart_backoff_base, services (with module/port/enabled per service), jobs_dir (system jobs, defaults to sidecar_jobs), user_jobs_dir (user jobs override; empty = <data_root>/user_jobs/), heartbeat, message_poll_interval.
Observability: Writes <data_root>/runtime/sidecar_state.json on every tick. Query via sidecar_status or sidecar_jobs capabilities. Per-service child logs at <data_root>/runtime/service_logs/*.log (rotated as described above). Dashboard subscribes to cron.hot_reload events on the bus to refresh its Jobs tab on every actual reload.