Manage Claude Code config + add Justfile via bombadil

Bring ~/.claude config under bombadil management across both machines: - claude/shared/: converged settings.json (union of both hosts) and a single Catppuccin-powerline statusline merged from the two machines' versions - claude/xps, claude/desktop: per-host agents/skills behind [profiles.xps]/ [profiles.desktop]; each host links only its own via `bombadil link -p <theme> <host>` Linked at file granularity because bombadil 4.2.0 can't create directory symlinks for new targets, and to keep ~/.claude/{agents,skills} real dirs. Add a Justfile (symlinked to ~/.justfile, usable via `just -g`) with link/ dark/light/watch/unlink/update/status/edit recipes; host auto-detected from hostname. Recipes use exported shell vars to avoid bombadil's Tera engine mis-parsing just's double-brace interpolation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 17:17:17 -05:00
parent 59c40ad5ad
commit e884e4a88f
10 changed files with 921 additions and 5 deletions
@@ -0,0 +1,85 @@
+---
+name: local-coder
+description: Run a focused coding subtask on the local Qwen3-Coder-30B model (via llama-server + claude-code-router) instead of a remote Anthropic model. Use when the task is well-scoped, mechanical, fits within ~16K tokens of total context, and doesn't need top-tier reasoning. Good fits — refactor a function, generate tests scaffolded from existing ones, format/normalize docstrings, mechanical lint fixes, scaffold boilerplate, translate a small function between languages. Bad fits — cross-file architectural changes, ambiguous requirements needing judgment, performance work needing measurement, anything requiring web research or `cargo run`. Saves Anthropic API tokens for the orchestrator's harder work.
+tools: Bash, Read
+model: sonnet
+---
+
+# local-coder
+
+Thin transport layer for the local Qwen3-Coder model. **You do not solve the task — even if you could answer it from your own knowledge.** Every invocation goes to the wrapper. The orchestrator chose `local-coder` over a direct response *because they want the local model to do this work* — to save Anthropic tokens, to test the local model, or to keep certain work off the remote API. Answering directly defeats the entire purpose.
+
+## Tool budget — strict
+
+- **Bash: exactly 1 call, always the wrapper.** Non-negotiable. Even if the task is one line and you "know" the answer, route it through the wrapper. The wrapper is the *product* the orchestrator asked for — not a fallback path.
+- **No other Bash.** No `ls`, no `cat`, no `rustc`, no `cargo`, no `wc`. The orchestrator runs validation commands after you return.
+- **Read: 0-3 calls.** One Read per file the model says it created or edited, max 3 files. Skip Read entirely for pure-text-output tasks (no file edits).
+
+If you find yourself wanting a 2nd Bash call or a 4th Read, stop and return what you have. More tool calls do not improve the report — they only add latency.
+
+**Sanity check before responding**: count your tool uses. If Bash count != 1, you did the wrong thing — re-run with the wrapper.
+
+## Process
+
+1. **Invoke the wrapper** in a single Bash call using a heredoc (no temp file). Heredoc preserves newlines and shell-special chars cleanly:
+
+   ```bash
+   ~/llm/scripts/local-coder-task.sh [--profile <name>] [--minimal-prompt] <<'TASK'
+   <orchestrator's prompt verbatim, including any code fences>
+   TASK
+   ```
+
+   **Wrapper flags** (optional, passed before the heredoc):
+   - `--profile <name>` — tool profile, controls `--allowed-tools` footprint:
+     - `full` (default) — Read/Write/Edit/Grep/Glob + ~25 Bash patterns.
+     - `code` — Read/Write/Edit/Grep/Glob + cargo verification Bash only.
+     - `edit-only` — Read/Write/Edit/Grep/Glob, **no Bash** (~3–5K tokens saved).
+     - `read-only` — Read/Grep/Glob, no edits.
+   - `--minimal-prompt` — replace Claude's built-in system prompt with a ~1 KB minimal one (~8–12K tokens saved). Use for trusted orchestrator-spec'd mechanical tasks. Drops Claude's default safety/style guidance.
+
+   **How to pick flags**: if the orchestrator's brief includes a `WRAPPER FLAGS:` line at the top (e.g. `WRAPPER FLAGS: --profile=edit-only --minimal-prompt`), pass those flags verbatim and **strip the line from the task body** before piping. If no such line is present, invoke with no flags (default behavior — full profile, default Claude prompt + tool-call-format append).
+
+2. **If exit != 0**: return the structured report with `exit: <N>` and the wrapper's stderr verbatim. Stop. No retry.
+
+3. **If exit == 0**:
+   - Scan the model's stdout for filenames it claims to have created/edited (lines like "created `/path/...`", "wrote to `/path/...`", or a Write/Edit tool reference in the output).
+   - For each (≤3), do exactly one `Read` to confirm the file exists and is non-empty. That's the entire verification — do not spot-check contents, do not count lines, do not parse.
+   - If the model output references no files (pure text task), skip Read entirely.
+
+4. **Return the structured report.**
+
+## Final report
+
+```
+exit: <N>
+files: <comma-separated list of files the model claimed to touch, or "none">
+verified: <comma-separated "yes"/"no" matching files list, or "n/a" if none>
+
+output:
+<verbatim wrapper stdout — no paraphrasing, no editing>
+```
+
+That's the whole report. No "verification notes" prose, no elapsed time (the runtime tracks it), no "model:" header (the orchestrator knows what model you wrap).
+
+If you flagged any file as "verified: no" (i.e., Read failed or returned empty), add one final line:
+
+```
+warning: <one short sentence about which file and what was wrong>
+```
+
+## Hard rules
+
+- **Verbatim output.** Never paraphrase or summarize the wrapper's stdout. The orchestrator wants raw model output.
+- **No retry.** First-attempt failure → report and stop. The orchestrator decides whether to fall back to real Claude.
+- **No scope expansion.** Ambiguous prompt → return `exit: 0`, `output: need clarification: <specific question>`. Do not guess.
+- **Tool ceiling is binding.** 1 Bash + 0-3 Read. Going over is a bug, not thoroughness.
+
+## Failure modes (surface to orchestrator, don't act on them)
+
+| Exit | Cause | What you do |
+|---|---|---|
+| 2 | local stack down (llama-server / CCR failed to start) | Report verbatim stderr. |
+| 1 | child claude errored (CCR translation, context overflow, timeout) | Report verbatim stderr. |
+| 0 + truncated output | model hit max_tokens or got confused mid-stream | Report as-is. The orchestrator notices. |
+| 0 + refusal text | model said "I can't help" | Report as-is. |
+| 0 + file claimed but Read fails | model hallucinated the edit | Mark "verified: no" + add the warning line. |
@@ -0,0 +1,210 @@
+---
+name: local-delegate
+description: Decide when and how to delegate a focused coding subtask to the local Qwen3-Coder-30B model via the `local-coder` subagent. Use when the task is well-scoped, mechanical, fits in ~16K context, and doesn't need top-tier reasoning — saves Anthropic API tokens for the orchestrator's harder work. Anti-patterns — cross-file architectural changes, ambiguous requirements, performance tuning, anything requiring `cargo run`.
+---
+
+# /local-delegate
+
+A decision-support skill for the orchestrator. Triage whether a subtask fits the local Qwen3-Coder model, and if so, hand it off via the `local-coder` subagent with a properly-shaped prompt.
+
+## The local stack
+
+| Layer | What | Where |
+|---|---|---|
+| Model | Qwen3-Coder-30B-A3B-Instruct, UD-Q5_K_XL quant | `~/llm/models/` |
+| Inference | llama.cpp 9200 with Vulkan/RADV on AMD Radeon RX 7900 XTX | `systemctl --user … llama-server` (port 8080) |
+| API translator | claude-code-router (Anthropic ↔ OpenAI) | `ccr` (port 3456) |
+| Wrapper | One-shot `claude --print` with `ANTHROPIC_BASE_URL=ccr` | `~/llm/scripts/local-coder-task.sh` |
+| Subagent | Haiku transport layer that drives the wrapper + verifies | `~/.claude/agents/local-coder.md` |
+
+**Performance**: ~135-140 tok/s decode, ~100-200 ms TTFT, 32K context (practical task budget ~16-20K leaves room for output).
+
+## ✅ Good fits for the local model
+
+- **Mechanical refactors** — rename, extract helper, inline a constant, hoist a binding.
+- **Boilerplate scaffolding** — new test file modeled on an existing one, getter/setter pairs, a CLI subcommand stub.
+- **Format normalization** — rewrite docstrings to a target style, normalize import order, convert log macros.
+- **Single-file changes** where the surrounding context fits in ~10K tokens.
+- **Cross-language translation** — port a function from Python to Rust, convert XML config to TOML, etc.
+- **Lint-driven fixes** where the lint message names the change ("inline this `format!`", "remove unused import").
+- **Read-only inspection** — "summarize what module X does", "list all callers of function Y" (model can use Read/Grep/Glob).
+
+## ❌ Bad fits — keep on real Claude
+
+- **Cross-file architectural changes** — local model can't hold enough context to reason about ripple effects.
+- **Ambiguous requirements** — anything needing "well, depends on…" judgment.
+- **Performance work** — needs bench data, knowledge of the existing perf budget, system-level reasoning.
+- **Web research / external lookups** — local model has no web access through this pipe.
+- **`cargo run` / interactive smoke testing** — same TTY constraint as remote subagents; the local model can't verify visual output either.
+- **PR creation, git commits, branch ops** — wrapper's Bash allowlist is read-only for safety. Have the orchestrator handle git after the subagent returns.
+- **Anything novel** — local 30B is fluent but doesn't have the depth on niche libraries / rare patterns.
+
+## How to invoke
+
+### Shape A — task that writes/edits files
+
+Use this when the local model should produce a file as its primary output. The subagent will verify each touched file with one `Read` call (≤3 files).
+
+```
+Agent({
+  subagent_type: "local-coder",
+  description: "<3-5 word summary>",
+  prompt: "
+    Task: <one-paragraph description, imperative mood>
+
+    Files in scope:
+      - <path>:<optional line range>
+      - <path>
+
+    Context (paste relevant snippets — keep under 8K tokens):
+      ```<lang>
+      <relevant code>
+      ```
+
+    Acceptance criteria:
+      - <bullet>
+      - <bullet>
+
+    Out of scope:
+      - <bullet — what NOT to touch>
+      - Don't run compile/test/lint checks — orchestrator will do that after you return.
+  "
+})
+```
+
+### Shape B — task that returns text only (no file writes)
+
+Use this when you want analysis, an explanation, a code snippet for the orchestrator to apply itself, or a summary. The subagent skips the `Read` verification entirely (0 Read calls), so it's the fastest shape — typically ~5-10 s end-to-end on warm stack.
+
+```
+Agent({
+  subagent_type: "local-coder",
+  description: "<3-5 word summary>",
+  prompt: "
+    Task: <one-paragraph description, imperative mood>
+
+    Context (paste relevant snippets — keep under 8K tokens):
+      ```<lang>
+      <relevant code>
+      ```
+
+    Output format:
+      - <e.g., 'one Rust function, no markdown fences, no explanation'>
+      - <e.g., 'bullet list of files that match the pattern, one per line'>
+
+    Out of scope:
+      - Don't write any files. Return your answer as plain text only.
+      - Don't run any commands.
+  "
+})
+```
+
+**Concrete no-edit examples**:
+
+- *Explain*: "Explain in 4 sentences what `crates/zemyna_terrain/src/chunked.rs` does. Output: 4 sentences, plain text, no headings."
+- *Snippet for orchestrator to paste*: "Write a Rust closure equivalent to this Python lambda: `lambda x, y: x * 2 + y`. Output: only the closure, one line, no `let` binding."
+- *Listing*: "Read `Cargo.toml`. List every workspace member crate, one per line, no other text."
+- *Translation*: "Translate this SQL `WHERE` clause to a `serde_json::Value` filter expression. Output: only the Rust expression."
+
+Both shapes invoke the same wrapper via a single Bash heredoc — no temp file involved. The subagent returns a structured report with `exit:`, `files:`, `verified:`, and verbatim wrapper output.
+
+### Shape C — direct Bash, no subagent (guaranteed routing)
+
+Use this when you need a **hard guarantee** the local model actually ran — typically because the task is trivial enough that the subagent might decide to answer it directly from its own knowledge instead of invoking the wrapper. (Sonnet-as-subagent follows multi-paragraph rules ~95% of the time, but trivial one-liner tasks tempt any model to shortcut.)
+
+You give up the subagent's verification + structured report; you pay one Bash call's worth of orchestrator context for the wrapper's raw output.
+
+```
+Bash({
+  command: "~/llm/scripts/local-coder-task.sh <<'TASK'\n<your task here>\nTASK\n",
+  description: "force-route through local model"
+})
+```
+
+The wrapper's stdout becomes the Bash result. You parse it yourself.
+
+**When Shape C is the right call**:
+- The task is one-liner-trivial (e.g., "convert this Python lambda to Rust").
+- You're benchmarking the local model and need every call to actually hit it.
+- You're testing the local stack (smoke test, latency measurement, output-format check).
+- You suspect the subagent will shortcut because the task is too easy.
+
+**When Shape A or B is still better**:
+- Real coding subtasks (refactor, scaffold, format-cleanup) — the subagent's verification step catches hallucinated file edits.
+- Tasks where you want a structured report (`exit:`, `files:`, `verified:`) for the orchestrator's downstream handling.
+- Multi-file tasks where the verification of each file is non-trivial.
+
+### Quick decision tree
+
+```
+Task fits the local model? ── no ──> keep on real Claude
+        │
+        yes
+        │
+Will the model write files? ── yes ──> Shape A (subagent, file verification)
+        │
+        no
+        │
+Is the task trivial enough that the
+subagent might answer directly?    ── yes ──> Shape C (direct Bash, guaranteed)
+        │
+        no (e.g., needs the local model's
+            actual code-gen style, length,
+            or vocabulary)
+        │
+        └──> Shape B (subagent, no file verification)
+```
+
+## Pre-flight checklist (orchestrator side)
+
+Before invoking, mentally check:
+
+1. **Sizing** — can the task be described in <500 tokens + ≤8K tokens of context? If no, scope-split or keep on real Claude.
+2. **Cohesion** — is the task contained to 1-3 files? If it sprawls, keep on real Claude.
+3. **Verifiability** — can you state an objective acceptance criterion (a passing test, a successful build, a grep returning N hits)? If you can't state how you'd know it worked, don't delegate.
+4. **Recoverability** — if the local model produces wrong output, can you `git checkout -- <files>` and try again on real Claude? If not (e.g., it's a brand-new file), reduce blast radius first.
+
+## Stack health (drop into a Bash if unsure)
+
+```bash
+curl -sf http://127.0.0.1:8080/health  # llama-server (loads model on first start, ~65 s cold)
+ccr status                              # CCR
+systemctl --user status llama-server    # if either above fails
+```
+
+The wrapper auto-starts both if missing. But on cold start, the first call takes ~65 s for model load. Subsequent calls (within the 30-min keep-alive) are warm.
+
+## Failure handling
+
+| Symptom | Likely cause | Action |
+|---|---|---|
+| Wrapper exit 2, stderr says "llama-server failed health check" | Model load failed (GPU contention, OOM) | Check `journalctl --user -u llama-server --since '5 min ago'`. Often: another GPU consumer started. Run `~/llm/scripts/use-llama-server.sh` to force-restart clean. |
+| Wrapper exit 1, claude session error | CCR translation issue or context overflow | Check `~/.claude-code-router/` logs. Shrink the prompt context, retry. |
+| Clean exit, output references edits that aren't there | Local model hallucinated the edit | Subagent's verification step catches this. Fall back to real Claude. |
+| Clean exit, output is mid-sentence cut | Hit max_tokens or context overflow | Reduce prompt size and retry, OR raise max_tokens in the wrapper. |
+| Repeated/looping output | Sampling broke (rare with our config) | Retry on real Claude — don't iterate on local. |
+
+## Anti-patterns
+
+- **Don't retry the same task on local.** If first attempt fails, fall back. Iterating burns wall clock without fixing the underlying capability gap.
+- **Don't chain local subagents.** Sequential local calls compound error rate. Use real Claude as the connecting tissue.
+- **Don't pass the orchestrator's full CLAUDE.md / rules context.** Wrapper uses `--bare` precisely to avoid this — the local model gets a clean context. Pass only the task-relevant context inline.
+- **Don't delegate work you wouldn't trust a junior dev to do with the same brief.** If the brief itself requires deep project knowledge to write correctly, the implementer needs it too.
+
+## CLI usage (outside Claude Code)
+
+Useful for testing the stack without spawning a subagent:
+
+```bash
+echo "Write a Rust function that reverses a string in-place." \
+  | ~/llm/scripts/local-coder-task.sh
+```
+
+Output goes to stdout. Same env, same flags as what the subagent uses.
+
+## See also
+
+- `~/.claude/agents/local-coder.md` — the subagent profile
+- `~/llm/scripts/local-coder-task.sh` — the wrapper this skill invokes
+- `~/.claude-code-router/config.json` — CCR routing
+- `~/llm/scripts/claude-local.sh` — interactive `claude code` against the local stack (different use case: full claude session vs one-shot subtask)
@@ -0,0 +1,84 @@
+---
+name: unload-local-model
+description: Unload the local llama.cpp model (Qwen3-Coder-30B) from the 7900 XTX to free VRAM. Stops the llama-server systemd user service and reaps any stray foreground server. Idempotent — safe to run when already unloaded. Use when done with local-model work or when you want the GPU's VRAM back.
+---
+
+# /unload-local-model
+
+Free the GPU by unloading the local Qwen3-Coder-30B model that backs the
+`local-coder` subagent (see [local-delegate](../local-delegate/SKILL.md)). The
+model is served by `llama-server` (llama.cpp) and pins ~9.5 GB of VRAM on the
+Radeon RX 7900 XTX while resident. This skill stops it cleanly and verifies the
+VRAM is back.
+
+## What holds the GPU
+
+| Layer | Holds VRAM? | This skill touches it? |
+|---|---|---|
+| `llama-server.service` (systemd --user, port 8080) | **Yes** — the model weights + KV cache | **Stops it** |
+| stray foreground `llama-server` (from `llama-server-foreground.sh`) | **Yes**, if running outside systemd | **Reaps it** |
+| `claude-code-router` / `ccr` (port 3456) | No — pure API translator, no VRAM | Left running |
+| `ollama` daemon (port 11434) | Only while a model is loaded | Out of scope — see note below |
+
+Leaving CCR up is deliberate: it holds no VRAM and re-attaches to llama-server
+the next time the stack warms. There is nothing to restart.
+
+## Run it
+
+```bash
+# 1. Canonical path — stop the systemd user service (idempotent; no-op if dead).
+systemctl --user stop llama-server.service
+
+# 2. Reap any stray foreground server started outside systemd. Match the binary
+#    PATH (leading slash) — NOT the bare word "llama-server", or pkill matches
+#    its own command line and SIGTERMs the shell running this skill.
+pkill -f '/llama-server ' 2>/dev/null || true
+```
+
+## Verify
+
+```bash
+echo "service: $(systemctl --user is-active llama-server.service)"   # want: inactive
+pgrep -af '/llama-server' | grep -v pgrep || echo "no server process"  # want: none
+curl -sf --max-time 2 http://127.0.0.1:8080/health >/dev/null 2>&1 \
+  && echo "port 8080: UP (STILL LOADED)" || echo "port 8080: down (unloaded)"
+# VRAM should drop to desktop baseline (~2.4 GiB); a loaded model adds ~9.5 GB.
+rocm-smi --showmeminfo vram 2>/dev/null | awk '/Used/{printf "VRAM used: ~%d MiB\n", $NF/1024/1024}'
+```
+
+A clean unload reads: `service: inactive`, `no server process`, `port 8080:
+down`, VRAM near the desktop baseline.
+
+## Gotchas
+
+- **Self-pkill footgun.** `pkill -f 'llama-server'` (no slash) matches *this
+  skill's own command string* and kills the shell mid-run (exit 144 = SIGTERM).
+  Always anchor on the binary path: `pkill -f '/llama-server '`.
+- **Already unloaded is the common case.** The systemd unit is `disabled` and
+  only runs on demand (the wrapper auto-starts it), so most of the time the
+  model is already down. The skill is idempotent — running it then is a no-op
+  that just confirms state. Report "already unloaded" rather than implying you
+  stopped something.
+- **Don't disable or mask the service.** Stopping unloads the model; the next
+  `/local-delegate` call auto-starts it again (~65 s cold load). Disabling would
+  break that auto-start. Stop only.
+
+## Note on ollama
+
+The stack can alternatively serve the same model via the `ollama` daemon (port
+11434). If a request asks to free the GPU broadly and ollama has a model
+resident, also run:
+
+```bash
+ollama stop qwen3-coder-30b-a3b-q5kxl 2>/dev/null || true
+```
+
+This skill's default scope is the llama.cpp path (`llama-server`), which is what
+`local-coder` uses. Reach for the ollama stop only when ollama is the active
+backend (`~/llm/scripts/use-ollama.sh` was run).
+
+## See also
+
+- [local-delegate](../local-delegate/SKILL.md) — when/how to *use* the local model.
+- `~/llm/scripts/use-ollama.sh` — stops llama-server so ollama can take the GPU.
+- `~/llm/scripts/use-llama-server.sh` — the inverse: load llama-server, free ollama.