Segment Notes¶
Read the Running Notes section from a journal file, identify coherent threads of related content, and annotate the text with inline thread IDs. The raw text is never modified — only HTML comment tags are inserted.
Workflow name: segment-notes
Execution: subagent
Steps¶
| # | ID | Name | Type | Depends on |
|---|---|---|---|---|
| 1 | extract-section |
Extract Running Notes section from journal | code |
|
| 2 | generate-ids |
Generate pool of thread IDs | code |
extract-section |
| 3 | segment-and-tag |
Segment content and insert thread tags | reasoning |
generate-ids |
| 4 | validate |
Validate segmentation with Python | code |
segment-and-tag |
| 5 | extract-threads |
Extract thread objects from tagged text | code |
validate |
| 6 | return-results |
Return tagged text, thread list, and validation | code |
extract-threads |
Step instructions¶
extract-section¶
If given a journal_path, read the file and extract everything between:
- Start: # **Running Notes / Considerations**
- End: % RUNNING END marker (if present) or the next top-level heading
Strip the carried over from YYYY-MM-DD banners — these are structural artifacts of the carry-forward mechanism, not content. Preserve the dates as metadata (they indicate when items were originally written).
generate-ids¶
Use Python to pre-generate a pool of random IDs:
import uuid
def generate_thread_id() -> str:
return f"t_{uuid.uuid4().hex[:6]}"
Generate more IDs than you expect to need (e.g., 50). Pass these to the LLM so it uses consistent, collision-free IDs.
segment-and-tag¶
This is the reasoning-heavy step. Read the full extracted text and identify coherent threads — groups of content that belong to the same concern, topic, or action item.
Insert open/close HTML comment tags around each thread:
<!-- [t_a3f8c1] -->
- Graph tokenization approach for entity linking
- Looked at paper X, relates to graph tok idea
<!-- [/t_a3f8c1] -->
Segmentation rules:
- Preserve all raw text exactly — do not edit, rephrase, reorder, or delete any content. Only insert
<!-- [t_xxxxxx] -->and<!-- [/t_xxxxxx] -->tags - Every line of content must be inside exactly one thread — no orphaned lines
- A thread is content that belongs to the same concern. This requires interpretation — "Taxes - Varsha" and "Varsha - T4 / T4A" are the same thread even though they appear on different dates. A 40-line technical exploration about graph tokenization is one thread
- Multi-concern lines must be flagged. If a single bullet contains two unrelated concerns (e.g., "VPN/taxes/new laptop stuff"), wrap it in one thread but add a
<!-- [multi] -->annotation so the router knows it may need splitting - Sub-bullets belong to their parent's thread unless they clearly introduce a new topic
- Code blocks, links, and formatted text are preserved exactly — tags go outside these, not inside
- Empty lines and horizontal rules between threads are fine — place tags around content, not whitespace
- Carried-over banners are stripped — they are not content. But note the source dates in a separate metadata output
- When uncertain whether two items are the same thread, keep them separate. Over-splitting is better than under-splitting — the router can propose merges, but splitting a wrongly-merged thread is harder
Edge cases:
- Long multi-paragraph explorations: one thread, even if they span 50+ lines. The thread is the coherent unit
- Isolated one-liners with no context: each gets its own thread ID. The router will ask the user what they mean
- Links/references followed by commentary: group together as one thread
- Empty carried-over sections (just banners, no content between them): skip entirely
validate¶
After the LLM produces the tagged text, run Python validation:
import re
def validate_segmentation(tagged_text: str, original_text: str) -> dict:
"""Validate that segmentation is complete and consistent."""
# Extract all thread IDs used
open_tags = re.findall(r'<!-- \[(t_[a-f0-9]{6})\] -->', tagged_text)
close_tags = re.findall(r'<!-- \[/(t_[a-f0-9]{6})\] -->', tagged_text)
thread_ids = set(open_tags)
errors = []
# Every open tag has a matching close tag
if sorted(open_tags) != sorted(close_tags):
errors.append("Mismatched open/close tags")
# No nested threads (open inside open without close)
# ... validation logic ...
# All content lines from original are present in tagged version
original_lines = [l.strip() for l in original_text.splitlines() if l.strip() and not l.strip().startswith('---') and 'carried over from' not in l]
tagged_content = re.sub(r'<!-- \[/?t_[a-f0-9]{6}\??]\s*-->', '', tagged_text)
tagged_lines = [l.strip() for l in tagged_content.splitlines() if l.strip() and not l.strip().startswith('---')]
if len(original_lines) != len(tagged_lines):
errors.append(f"Line count mismatch: {len(original_lines)} original vs {len(tagged_lines)} tagged")
return {
"valid": len(errors) == 0,
"thread_count": len(thread_ids),
"thread_ids": sorted(thread_ids),
"errors": errors,
}
If validation fails, the LLM should fix the issues and re-validate. Do not proceed to routing with invalid segmentation.
extract-threads¶
Once validated, extract each thread into a structured object:
def extract_threads(tagged_text: str) -> list[dict]:
"""Extract thread objects from tagged text."""
threads = []
# Parse tagged text, group content by thread ID
# For each thread:
threads.append({
"id": "t_a3f8c1",
"raw_text": "The exact text between open/close tags",
"line_count": 3,
"source_dates": ["2026-03-03", "2026-03-14", "2026-03-28"], # from banner positions
"has_multi_flag": False,
})
return threads
return-results¶
Return: 1. The tagged text (original content with inserted thread tags) 2. The thread list (extracted structured objects) 3. The validation result
The caller (process-backlog.md) uses the thread list to feed into the routing workflow.
Context¶
What NOT to do¶
- Don't modify, rephrase, summarize, or reorder any raw text
- Don't interpret what items mean — that's the router's job (with user input)
- Don't delete content, even if it looks stale or irrelevant
- Don't merge threads aggressively — when in doubt, keep separate
- Don't try to route or triage during segmentation — separation of concerns
- Don't skip the Python validation step
Output quality checks¶
Before returning, verify:
- [ ] Every content line is inside exactly one thread
- [ ] All thread tags are properly opened and closed
- [ ] No content was modified, deleted, or reordered
- [ ] Code blocks and links are intact
- [ ] Carried-over banners are stripped (noted as metadata, not content)
- [ ] Multi-concern lines are flagged with <!-- [multi] -->