Skip to content

Inference

Local inference subsystems — admission control, LLM backends, embedding service, and provider dispatch for LM Studio / LM Link.

Details

Local-inference subsystems — everything that decides when, where, and how a call to a local model happens.

What lives here

  • Brokerwork_buddy.inference.broker.LocalInferenceBroker. Per-profile slot limits, priority classes (INTERACTIVE / WORKFLOW / BACKGROUND), split queue-wait vs inference timeouts, and per-call metrics. Every outbound local-inference call (embedding or LLM) routes through it so work-buddy — not LM Studio — is the scheduler of record.

What WILL live here (pending restructure)

Today these are flat siblings under architecture/; a follow-up PR (docs_move pass) will re-home them under architecture/inference/ alongside the broker:

  • architecture/llm-runner — unified LLM entry point + tier dispatch.
  • architecture/llm-with-tools — legacy tool-call loop (kept for MCP-exposed capability).
  • architecture/embedding-service — the Flask service on port 5124 + asymmetric / symmetric model registry.

Until that restructure lands, follow the flat-path links in architecture parent for those three.

Why group these

All four concerns (broker, LLM calls, embedding calls, runner) share the same underlying infrastructure: LM Studio on localhost:1234 (possibly routing via LM Link to remote compute), per-profile slot accounting, and the same error vocabulary (LocalInferenceError + its kinds in work_buddy/llm/backends/_errors.py). Keeping them together makes it easy to find the right entry point for a new inference-adjacent feature without scanning a flat architecture/ index.

Children