Capability Registry¶

How capabilities are registered, probed for tool availability, disabled when a probe fails, and recovered cheaply via per-capability re-probe (CP-A3) instead of a full registry rebuild. Authoritative reference for the heavy-vs-light recovery decision.

Details¶

What¶

The capability registry (work_buddy/mcp_server/registry.py) holds two maps: _REGISTRY (active capabilities, directly callable via the MCP gateway) and _DISABLED_REGISTRY (capabilities whose requires=[...] tool probe failed at build time, stashed but not callable). At build-time, the registry filter pass moves any capability whose tool probe failed from _REGISTRY to _DISABLED_REGISTRY and adds a row to work_buddy.tools.DISABLED_CAPABILITIES listing the missing tools.

The Obsidian-bridge tool family is the one exception. When the bridge itself is down, the filter skips not just obsidian but every tool that transitively depends on it in the probe graph — the in-Obsidian plugins datacore, google_calendar (work_buddy.tools.obsidian_backed_tools()). The bridge is a transiently-flaky shared dependency, not a genuinely-absent one, so those capabilities stay admitted and are governed at runtime by a circuit breaker on the gateway dispatch (see architecture/resilience and work_buddy/mcp_server/dispatch_resilience.py): they fail fast per call while the bridge is down and recover the instant it returns — no session-long disable, no reload. This carve-out is transitive-only: it applies only when the bridge itself is down. If the bridge is up but a plugin is genuinely missing (e.g. datacore not installed), that plugin's capabilities still hard-disable here. The build-time disable below therefore applies to genuinely-absent dependencies (a missing plugin while the bridge is up, hindsight, thunderbird, ...).

Disabled state is cached. A capability disabled by a transient probe failure (e.g. the hindsight memory service unreachable for 200ms during sidecar startup) stays disabled until something explicitly re-probes it.

Capability schema (selected fields)¶

name, description, parameters, callable — the dispatch surface.
requires: list[str] — tool IDs the dispatcher gates on.
mutates_state: bool, retry_policy: "manual" | "replay" | "verify_first" — inform the gateway's auto-enqueue policy.
consent_operations: list[str] — declarations for the gateway's pre-flight consent bundling.
op_id: str | None — set when the capability was resolved from an inert declaration rather than instantiated directly (see "Declaration-based capabilities" below); None for directly-registered capabilities.
effects: list[EffectSpec] — manifest of externally-visible effects for capabilities that produce more than one. When non-empty, the post-write-verify recovery path uses verify_post_write_effects (walks every declared effect; can return partial) instead of single-effect verify. Capabilities with declared effects MUST be idempotent under retry. Schema lives at work_buddy.obsidian.effects.EffectSpec; recovery semantics in architecture/retry-queue.
timeout_seconds: float | None | Callable[[params], float | None] — the wall-time budget for one gateway dispatch, owned by the operation (never the caller). The gateway timeout is opt-in: a scalar is a fixed ceiling, a callable derives the budget from the actual params (for operations whose runtime scales with input), and unset (None) is unbounded — the gateway imposes no cap. No flat default is applied: capabilities that deliberately block or run long (human-in-the-loop request_send/request_poll, the obsidian_retry/retry wrappers, llm_submit) would be wrongly cut off by a too-low default, so a real default must be calibrated from observed dispatch p99 and paired with explicit exemptions. Resolved at dispatch in work_buddy/mcp_server/dispatch_resilience.py; a capability that declares a finite budget and overruns it gets error_kind="mcp_gateway_timeout".

Two recovery paths — use the right one¶

Per-capability re-probe (preferred for runtime recovery)¶

work_buddy.recovery.recheck_disabled_capability(name, *, force=False) re-probes ONLY the capability's missing tools, with per-tool cool-down (default 30s, env-overridable via WB_RECHECK_COOLDOWN_SECS), single _RECOVERY_LOCK (RLock) for concurrent-caller safety. On success, mutates _REGISTRY in place to restore the capability. No rebuild, no module purge.

Returns True if the capability is now in the live registry, False if it remains disabled (with DISABLED_CAPABILITIES[name] updated to reflect any partially-recovered tools).

Companion: recheck_tool(tool_id, *, force=False) for re-probing a single tool without scoping to a capability. Same cool-down, same lock.

Used by: - The gateway's wb_run dispatch path (work_buddy/mcp_server/tools/gateway.py). On hitting a disabled capability, the gateway calls recheck_disabled_capability before returning the disabled-error. - The sidecar's retry sweep _replay (work_buddy/sidecar/retry_sweep.py). On hitting a disabled capability during a queued retry, the sweep calls recheck_disabled_capability rather than reporting "not found in registry". Falls back to invoking the disabled entry's callable when recheck still says no, since the bridge call inside raises a typed transient exception and the operation re-queues correctly.

Data-only registry reload (declaration / workflow / param-schema changed)¶

reload_capability_data capability (calls reload_capability_data() in registry.py) resets the knowledge-store cache and clears _REGISTRY, then rebuilds in place via get_registry() — WITHOUT purging sys.modules. Because no module is re-imported, Capability / WorkflowDefinition class identity stays stable and the long-lived FastMCP gateway reads the rebuilt registry directly. Costs ~6–8s (it re-probes every tool via _build_registry).

Use when you edited or added a capability declaration (including its parameters schema) or a workflow unit and want it live without a restart. It also re-enables a capability whose tool just came back (the rebuild re-probes and re-runs the requirements filter).

It does NOT pick up edited Op code or a brand-new Op module — re-importing Python is what a process restart (Ctrl+R) does safely.

Retired: mcp_registry_reload (the function invalidate_registry() lives on, dormant) purged work_buddy.* from sys.modules to pick up code. In the long-lived FastMCP gateway that silently did nothing — wb_run / wb_search are frozen against the boot module generation, so the rebuilt registry never reached dispatch, while the purge corrupted Capability class identity. It was removed from the agent surface; use reload_capability_data for data and a Ctrl+R restart for code. See dev/mcp-reload and .data/designs/mcp-registry-reload.

Do NOT use reload_capability_data for transient probe failures. The dispatch path already auto-recovers a disabled capability via recheck_disabled_capability (a per-capability re-probe); the full rebuild is heavier than needed for that.

Decision tree¶

Capability is disabled / not found in active registry
  - Declaration / workflow / param-schema changed -> reload_capability_data
  - Op code or new Op module changed -> restart the gateway (Ctrl+R)
  - Probe transient-failed -> recheck_disabled_capability(name)
      - Returns True -> capability is back in _REGISTRY, proceed
      - Returns False -> tools still down
          - Caller wants to retry later -> re-queue / surface
          - Caller can run the capability anyway ->
              invoke disabled_entry.callable(...)
              (typed bridge exception -> @bridge_retry handles)

Declaration-based capabilities¶

Not every capability is a Capability(...) instance in registry.py. A capability can also be an inert declaration in the knowledge store that names an Op (a callable registered by ID in the Op registry). The capability loader resolves declarations at registry-build time and merges the resulting Capability objects into _REGISTRY alongside the directly-registered ones — a declared capability is indistinguishable at dispatch time except for its op_id field. See architecture/data-first-capabilities for the Op registry, the loader, and load-time validation.

Key files¶

work_buddy/mcp_server/registry.py — _REGISTRY, _DISABLED_REGISTRY, Capability dataclass, get_registry, get_disabled_registry, invalidate_registry
work_buddy/mcp_server/op_registry.py — Op registry backing declaration-based capabilities (see architecture/data-first-capabilities)
work_buddy/knowledge/capability_loader.py — resolves capability declarations against the Op registry
work_buddy/recovery.py — recheck_disabled_capability, recheck_tool, _RECOVERY_LOCK, _LAST_RECHECK_AT
work_buddy/obsidian/effects.py — EffectSpec schema for the Capability.effects manifest
work_buddy/obsidian/post_write_verify.py — verify_post_write_effects walker
work_buddy/tools/__init__.py — DISABLED_CAPABILITIES, is_tool_available, reprobe_one
work_buddy/mcp_server/tools/gateway.py — dispatch path with effects-aware PWU handler
work_buddy/sidecar/retry_sweep.py — sweep path with disabled-cap recovery and effects-aware pre-verify

When in doubt¶

Per-capability re-probe is almost always right at runtime. reload_capability_data is for data changes (declarations, workflows, param schemas) and full inventory rebuilds; an Op code change or a new Op module needs a gateway restart (Ctrl+R). The 30s cool-down on per-capability is your friend — it stops aggressive callers from hammering a genuinely-down tool.