Skip to content

Embedding Service

Local HTTP service on port 5124 providing dense vector embeddings for search and similarity; exposes a symmetric default model plus an asymmetric query/document pair via role-aware client wrappers.

Details

Overview

A long-running sidecar service providing dense vector embeddings for work-buddy's search and similarity features. Exposes an HTTP API on localhost:5124 with eager-loaded models, so interactive calls are fast.

Endpoints

  • POST /embed — embed a batch of texts, return vectors
  • POST /similarity — cosine similarity between a query and candidate texts
  • POST /search — BM25 + embedding hybrid search over candidates
  • POST /ir/search, POST /ir/index — indexed IR search over registered sources
  • GET /health — liveness probe

Client

work_buddy/embedding/client.py wraps the HTTP API:

  • embed(texts) — plain batch embedding, uses the service's default model
  • embed_for_ir(texts, role="query"|"document") — asymmetric IR encoding via the query↔document model pair
  • similarity_search(query, candidates) — rank candidates by similarity
  • hybrid_search(query, candidates) — BM25 + dense blend
  • ir_search(query) — search a pre-built IR index
  • ir_index(action, source, ...) — build or check an IR index

Graceful degradation

Every client function returns None (or an empty list) when the service is unavailable. Callers must handle this — typically by falling back to BM25 alone or skipping the semantic step. Never assume the service is up; always handle the None path.

Consumers

Knowledge search, IR conversation search, Smart Connections ranking, task-triage similarity, and other semantic-scoring sites throughout the codebase all route through this service. Model loading happens once here; all callers share the loaded weights.

Optional: LM Studio offload for bulk document encoding

Bulk document encoding (the big passage-side model, leaf-ir / snowflake-arctic-embed-m-v1.5) can route through LM Studio instead of loading locally — moves ~500 MB of RSS off the main machine, optionally via LM Link to a remote compute device. Opt-in per model via embedding.models.<key>.provider: lmstudio in config.

When enabled, the call path is:

ir.dense._encode_bulk_direct
  → work_buddy.embedding.providers.lmstudio.encode
    → LocalInferenceBroker.slot(profile=f"lmstudio:{model_id}", priority=BACKGROUND, ...)
      → httpx POST to LM Studio /v1/embeddings

Query-side encoding (leaf-ir-query, leaf-mt) is NOT offloaded — query latency is user-facing and the network hop would hurt.

On LM Studio errors, per-model on_error: fallback | fail decides the behavior. fallback (default) drops to the in-process sentence-transformers path. Measured drift between Q8 GGUF and fp32 sentence-transformers is ~0.0002 cosine, so mixed-provenance vectors in the same index cluster correctly.

See architecture/inference/broker for the admission-control / priority / metrics layer, and features/lmstudio-offload-setup for the end-to-end setup procedure (GGUF download, metadata audit, drift test, config flip).

Key files

  • work_buddy/embedding/service.py — Flask service, model registry, lazy loading. _get_model() uses a per-entry Condition so a cold load of one model doesn't block concurrent access to another.
  • work_buddy/embedding/client.py — HTTP client with role-aware wrappers.
  • work_buddy/embedding/providers/lmstudio.py — optional LM Studio provider (broker-wrapped).
  • work_buddy/embedding/__main__.py — service entry point (launched by the sidecar).