Embedding Service¶
Local HTTP service on port 5124 providing dense vector embeddings for search and similarity; exposes a symmetric default model plus an asymmetric query/document pair via role-aware client wrappers.
Details¶
Overview¶
A long-running sidecar service providing dense vector embeddings for work-buddy's search and similarity features. Exposes an HTTP API on localhost:5124 with eager-loaded models, so interactive calls are fast.
Endpoints¶
POST /embed— embed a batch of texts, return vectorsPOST /similarity— cosine similarity between a query and candidate textsPOST /search— BM25 + embedding hybrid search over candidatesPOST /ir/search,POST /ir/index— indexed IR search over registered sourcesGET /health— liveness probe
Client¶
work_buddy/embedding/client.py wraps the HTTP API:
embed(texts)— plain batch embedding, uses the service's default modelembed_for_ir(texts, role="query"|"document")— asymmetric IR encoding via the query↔document model pairsimilarity_search(query, candidates)— rank candidates by similarityhybrid_search(query, candidates)— BM25 + dense blendir_search(query)— search a pre-built IR indexir_index(action, source, ...)— build or check an IR index
Graceful degradation¶
Every client function returns None (or an empty list) when the service is unavailable. Callers must handle this — typically by falling back to BM25 alone or skipping the semantic step. Never assume the service is up; always handle the None path.
Consumers¶
Knowledge search, IR conversation search, Smart Connections ranking, task-triage similarity, and other semantic-scoring sites throughout the codebase all route through this service. Model loading happens once here; all callers share the loaded weights.
Optional: LM Studio offload for bulk document encoding¶
Bulk document encoding (the big passage-side model, leaf-ir / snowflake-arctic-embed-m-v1.5) can route through LM Studio instead of loading locally — moves ~500 MB of RSS off the main machine, optionally via LM Link to a remote compute device. Opt-in per model via embedding.models.<key>.provider: lmstudio in config.
When enabled, the call path is:
ir.dense._encode_bulk_direct
→ work_buddy.embedding.providers.lmstudio.encode
→ LocalInferenceBroker.slot(profile=f"lmstudio:{model_id}", priority=BACKGROUND, ...)
→ httpx POST to LM Studio /v1/embeddings
Query-side encoding (leaf-ir-query, leaf-mt) is NOT offloaded — query latency is user-facing and the network hop would hurt.
On LM Studio errors, per-model on_error: fallback | fail decides the behavior. fallback (default) drops to the in-process sentence-transformers path. Measured drift between Q8 GGUF and fp32 sentence-transformers is ~0.0002 cosine, so mixed-provenance vectors in the same index cluster correctly.
See architecture/inference/broker for the admission-control / priority / metrics layer, and features/lmstudio-offload-setup for the end-to-end setup procedure (GGUF download, metadata audit, drift test, config flip).
Key files¶
work_buddy/embedding/service.py— Flask service, model registry, lazy loading._get_model()uses a per-entry Condition so a cold load of one model doesn't block concurrent access to another.work_buddy/embedding/client.py— HTTP client with role-aware wrappers.work_buddy/embedding/providers/lmstudio.py— optional LM Studio provider (broker-wrapped).work_buddy/embedding/__main__.py— service entry point (launched by the sidecar).