Local LLM With Tools¶

[LEGACY: superseded by LLMRunner] Local LM Studio-served models invoke a restricted whitelist of work-buddy MCP tools via /api/v1/chat. Gateway-enforced security via session_acl. Superseded for internal callers by architecture/llm-runner; retained here because the MCP-exposed llm_with_tools capability still uses this path.

Details¶

Status — legacy¶

This document describes the legacy local-LLM-with-tools path. All internal Python callers migrated to :class:work_buddy.llm.LLMRunner. The llm_with_tools function and its MCP capability are retained because external agents (Claude Code sessions, slash commands, workflow steps) may still call them over MCP. A CI sentinel test (:file:tests/unit/test_legacy_llm_api_guard.py) blocks NEW internal callers from adopting this path.

For new code, use :mod:architecture/llm-runner instead. Tool-call dispatch on LLMRunner.call(tools=[...]) currently raises NotImplementedError; the deletion pass (task t-a373609f) wires it through native Anthropic + LM Studio backend adapters and removes llm_with_tools entirely.

The rest of this unit documents the legacy implementation for reference — the gateway-enforced security model described below is still correct because the MCP-exposed capability still operates.

Why this exists¶

Claude calls consume cloud tokens per turn. For bounded, pattern-based work (summarization, classification, structured extraction, context pre-compression, low-priority background triage), local models on a compute laptop are ~$0 per call and can run overnight without blocking. llm_with_tools lets the user offload such work to a local model while keeping it meaningfully useful by giving it access to a restricted set of work-buddy capabilities.

The design explicitly does NOT try to make local models a Claude replacement. They are slower, weaker at reasoning, and much more likely to make dangerous mistakes if given broad tool access. Every layer below assumes adversarial or confused-model behavior.

How the call flow works¶

Agent calls wb_run("llm_with_tools", {profile, tool_preset, system, user, required_capabilities?, ...}).
Pre-flight guard (if caller passed required_capabilities): verify every name is in resolve_preset(tool_preset). On miss, return a specific error listing the missing names and the preset — zero LM Studio round-trip. Guards against goal-preset mismatch (e.g. reusing a readonly preset for a workflow that needs mutating capabilities).
llm_with_tools synthesizes a per-call session id (lms-).
Calls session_acl.set_session_acl(session_id, preset_capabilities) — puts the whitelist in a process-local dict keyed by session id.
POSTs /api/v1/chat on LM Studio's localhost server with:
model from the profile
input (system + user text)
integrations: [{type: "ephemeral_mcp", server_url: "http://localhost:5126/mcp", allowed_tools: ["wb_run", "wb_search"], headers: {"X-Work-Buddy-Session": }}]
LM Studio drives a tool-call loop server-side. On each call it hits work-buddy's MCP gateway with the X-Work-Buddy-Session header.
The gateway's _auto_init_from_header(ctx) reads the header and registers the MCP connection with that session id (so the model doesn't need to call wb_init). wb_run then consults session_acl for the session and rejects any capability not in the whitelist.
When wb_search is called, session_acl.filter_search_results trims the hit list to the ACL. If anything was trimmed, the response is wrapped as {results, _acl_filtered, _acl_hidden_count, _acl_notice} so the model sees WHY its list is short rather than silently reworking the query and burning tokens on a search loop.
When all tool calls finish, LM Studio returns the model's final message. llm_with_tools clears the ACL in a finally.
Raw tool-call outputs and reasoning tokens are trimmed from the response. persist_tool_results=True or any tool error triggers auto-persist to the scratch artifact store (3-day TTL).

Security model: gateway-enforced, not LM Studio-enforced¶

LM Studio expects integrations.allowed_tools to contain top-level MCP tool names. work-buddy's MCP surface exposes only six: wb_init, wb_run, wb_search, wb_advance, wb_status, wb_step_result. Every domain capability (task_briefing, project_get, etc.) is dispatched through wb_run.

We list ["wb_run", "wb_search"] in allowed_tools and enforce the per-capability whitelist server-side via session_acl. LM Studio cannot be trusted to enforce this — a malicious or confused preset could easily over-advertise.

Fail-closed bookend on unresolved sessions¶

The gateway relies on header-based auto-init to tie each tool call back to the ACL set by llm_with_tools. If an MCP transport or client reconnect loses the ctx→session mapping, _resolve_session returns None. Historically is_capability_allowed(None, cap) returned True, which silently bypassed the ACL. The fix (2026-04-17) adds session_acl.any_acl_registered() and makes is_capability_allowed fail closed when session is None AND an ACL is active anywhere in the process. wb_search's filter_search_results helper applies the same fail-closed rule on the search path. The only legitimate callers that resolve to None are normal agents in a process with no ACL-scoped runs — they're unaffected.

Tool presets¶

Defined in code at work_buddy/llm/tool_presets.py. PRESETS is a frozen dict of preset name → frozenset of capability names. Validated by validate_presets(): - readonly_* presets must contain zero mutating capabilities (checked against _MUTATING_CAPABILITIES) - every preset name must match a real registered capability - NO preset may contain wb_init (ACL-escape vector)

Adding or expanding a preset is a reviewed PR. Config-level overrides are not supported — presets live in code.

required_capabilities pre-flight guard¶

Callers who know which capabilities the model will need can pass required_capabilities: [...] alongside tool_preset. llm_with_tools verifies every name is in the resolved preset BEFORE opening an LM Studio session. On miss, returns an explicit error naming the missing capabilities and the preset. This catches the common failure mode of reusing a preset from a prior call without checking whether it covers the new task.

Response hygiene¶

content (model's final text answer): always surfaced
reasoning (chain of thought): stripped from response by default; saved as .md artifact on persist/error
tool_calls[].output (raw MCP tool result): stripped by default; replaced with output_size_chars + output_omitted=true; saved as .json artifact (cleanly unwrapped, no triple-JSON escape) on persist/error
error_preview (capped at 500 chars): always included when a tool call errored, so the caller has signal even without the full artifact
Any tool error in the batch auto-escalates to persist everything, so the caller can audit without re-running

Auto-retry opt-out¶

llm_with_tools and llm_submit declare auto_retry=False on their Capability registration. The gateway honors this by forcing retry_policy="manual". Without this, a local-LLM failure gets replayed 5 times — each replay wastes tokens, spams consent prompts on bridge-dependent capabilities, and almost never succeeds on the next tick.

Disabled-vs-ACL-hidden distinction¶

Two very different conditions used to share the single unavailable: true flag in wb_search results:

Capability is in the knowledge store but NOT in the live registry — typically because a tool dependency (e.g. Obsidian bridge) is unmet.
Capability is registered but filtered out by this session's ACL.

As of 2026-04-17 these are explicit: - Case 1 → result carries disabled: true + disabled_reason: "Dependency unavailable: obsidian" (or the equivalent for whatever dep is missing). The old unavailable: true key is kept as a back-compat alias through 2026-Q3. This branch now applies to workflows too — previously workflow hits without a live registry entry came back with no flag at all. - Case 2 → results are filtered out entirely; if any were hidden, wb_search returns the _acl_notice wrapper described above.

Reasoning models were conflating the two conditions ("this capability is unavailable" = "I don't have permission") and drawing the wrong conclusion. The distinct signals + explicit reason strings fix that.

Known limitations¶

Consent-gated primitives like obsidian.eval_js fire on every call, because gating is per-primitive, not per-caller-context. A readonly preset that includes task_briefing still triggers eval_js high-risk consent even though task_briefing is read-only. Parked in task t-3629e1b1 as the call-stack-aware consent risk work.
The native /api/v1/chat endpoint is LM Studio-specific. Other OpenAI-compat servers (vLLM, Ollama) don't support it. Use llm_call with a profile for plain text generation on those.
LM Studio routes remote models via LM Link through its OpenAI-compat endpoints; /api/v1/chat behavior with LM Link is less extensively documented. Keep the compute laptop loaded and linked for reliability.

Key files¶

work_buddy/llm/with_tools.py — llm_with_tools capability
work_buddy/llm/tool_presets.py — PRESETS, resolve_preset, validate_presets
work_buddy/llm/backends/lmstudio_native.py — /api/v1/chat client
work_buddy/llm/_tool_call_trim.py — response hygiene, artifact persist
work_buddy/mcp_server/session_acl.py — per-session capability ACL, any_acl_registered, filter_search_results, fail-closed semantics
work_buddy/mcp_server/tools/gateway.py — _auto_init_from_header, wb_init escape block, wb_run ACL enforcement, wb_search ACL filter
tests/unit/test_llm_with_tools.py, test_llm_with_tools_hygiene.py, test_llm_tool_call_trim.py, test_session_acl_escape_block.py, test_search_disabled_flag.py, test_compat_port_cleanup.py

architecture/llm-runner — the unified replacement. New code should go there.
Task t-a373609f — the deletion pass that retires this path.