Embedding Service¶

Local HTTP service on port 5124 providing dense vector embeddings for search and similarity; exposes a symmetric default model plus an asymmetric query/document pair via role-aware client wrappers.

Details¶

Overview¶

A long-running sidecar service providing dense vector embeddings for work-buddy's search and similarity features. Exposes an HTTP API on localhost:5124 with eager-loaded models, so interactive calls are fast.

Endpoints¶

POST /embed — embed a batch of texts, return vectors
POST /similarity — cosine similarity between a query and candidate texts
POST /search — BM25 + embedding hybrid search over candidates
POST /ir/search, POST /ir/index — indexed IR search over registered sources
POST /vault/search, POST /vault/index — vault semantic index search / build, run in-process so the resident vector matrix stays warm and the bulk encode shares the broker (see architecture/vault-index)
POST /index/search, POST /index/search_many — consolidated-index hybrid search over one or more partitions (/index/search_many shares ONE query-encode across a batch of queries); POST /index/build — incremental build of a partition (or all) into the separate db/index-consolidated DB. Run in-process so the resident matrices stay warm and the bulk encode shares the broker (see architecture/consolidated-index)
GET /health — liveness probe

Client¶

work_buddy/embedding/client.py wraps the HTTP API:

embed(texts) — plain batch embedding, uses the service's default model
embed_for_ir(texts, role="query"|"document") — asymmetric IR encoding via the query↔document model pair
similarity_search(query, candidates) — rank candidates by similarity
hybrid_search(query, candidates) — BM25 + dense blend
ir_search(query) — search a pre-built IR index
ir_index(action, source, ...) — build or check an IR index
vault_search(query, ...), vault_index(action, ...) — vault semantic index search / build (see architecture/vault-index)
index_search(query, ...), index_search_many(queries, ...) — consolidated-index hybrid search over its partitions (see architecture/consolidated-index)

Graceful degradation¶

Every client function returns None (or an empty list) when the service is unavailable. Callers must handle this — typically by falling back to BM25 alone or skipping the semantic step. Never assume the service is up; always handle the None path.

Consumers¶

Knowledge search, IR conversation search, native vault search, task-triage similarity, and other semantic-scoring sites throughout the codebase all route through this service. Model loading happens once here; all callers share the loaded weights.

Optional: LM Studio offload for bulk document encoding¶

Bulk document encoding (the big passage-side model, leaf-ir / snowflake-arctic-embed-m-v1.5) can route through LM Studio instead of loading locally — moves ~500 MB of RSS off the main machine, optionally via LM Link to a remote compute device. Opt-in per model via embedding.models.<key>.provider: lmstudio in config.

When enabled, the call path is:

ir.dense._encode_bulk_direct
  → work_buddy.embedding.providers.lmstudio.encode
    → LocalInferenceBroker.slot(profile=f"lmstudio:{model_id}", priority=BACKGROUND, ...)
      → httpx POST to LM Studio /v1/embeddings

Query-side encoding (leaf-ir-query, leaf-mt) is NOT offloaded — query latency is user-facing and the network hop would hurt.

On LM Studio errors, per-model on_error: fallback | fail decides the behavior. fallback (default) drops to the in-process sentence-transformers path. Measured drift between Q8 GGUF and fp32 sentence-transformers is ~0.0002 cosine, so mixed-provenance vectors in the same index cluster correctly.

See architecture/inference/broker for the admission-control / priority / metrics layer, and features/lmstudio-offload-setup for the end-to-end setup procedure (GGUF download, metadata audit, drift test, config flip).

Key files¶

work_buddy/embedding/service.py — Flask service, model registry, lazy loading. _get_model() uses a per-entry Condition so a cold load of one model doesn't block concurrent access to another.
work_buddy/embedding/client.py — HTTP client with role-aware wrappers.
work_buddy/embedding/providers/lmstudio.py — optional LM Studio provider (broker-wrapped).
work_buddy/embedding/__main__.py — service entry point (launched by the sidecar).