Manage Claude Code config + add Justfile via bombadil

Bring ~/.claude config under bombadil management across both machines: - claude/shared/: converged settings.json (union of both hosts) and a single Catppuccin-powerline statusline merged from the two machines' versions - claude/xps, claude/desktop: per-host agents/skills behind [profiles.xps]/ [profiles.desktop]; each host links only its own via `bombadil link -p <theme> <host>` Linked at file granularity because bombadil 4.2.0 can't create directory symlinks for new targets, and to keep ~/.claude/{agents,skills} real dirs. Add a Justfile (symlinked to ~/.justfile, usable via `just -g`) with link/ dark/light/watch/unlink/update/status/edit recipes; host auto-detected from hostname. Recipes use exported shell vars to avoid bombadil's Tera engine mis-parsing just's double-brace interpolation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 17:17:17 -05:00
parent 59c40ad5ad
commit e884e4a88f
10 changed files with 921 additions and 5 deletions
@@ -0,0 +1,84 @@
+---
+name: unload-local-model
+description: Unload the local llama.cpp model (Qwen3-Coder-30B) from the 7900 XTX to free VRAM. Stops the llama-server systemd user service and reaps any stray foreground server. Idempotent — safe to run when already unloaded. Use when done with local-model work or when you want the GPU's VRAM back.
+---
+
+# /unload-local-model
+
+Free the GPU by unloading the local Qwen3-Coder-30B model that backs the
+`local-coder` subagent (see [local-delegate](../local-delegate/SKILL.md)). The
+model is served by `llama-server` (llama.cpp) and pins ~9.5 GB of VRAM on the
+Radeon RX 7900 XTX while resident. This skill stops it cleanly and verifies the
+VRAM is back.
+
+## What holds the GPU
+
+| Layer | Holds VRAM? | This skill touches it? |
+|---|---|---|
+| `llama-server.service` (systemd --user, port 8080) | **Yes** — the model weights + KV cache | **Stops it** |
+| stray foreground `llama-server` (from `llama-server-foreground.sh`) | **Yes**, if running outside systemd | **Reaps it** |
+| `claude-code-router` / `ccr` (port 3456) | No — pure API translator, no VRAM | Left running |
+| `ollama` daemon (port 11434) | Only while a model is loaded | Out of scope — see note below |
+
+Leaving CCR up is deliberate: it holds no VRAM and re-attaches to llama-server
+the next time the stack warms. There is nothing to restart.
+
+## Run it
+
+```bash
+# 1. Canonical path — stop the systemd user service (idempotent; no-op if dead).
+systemctl --user stop llama-server.service
+
+# 2. Reap any stray foreground server started outside systemd. Match the binary
+#    PATH (leading slash) — NOT the bare word "llama-server", or pkill matches
+#    its own command line and SIGTERMs the shell running this skill.
+pkill -f '/llama-server ' 2>/dev/null || true
+```
+
+## Verify
+
+```bash
+echo "service: $(systemctl --user is-active llama-server.service)"   # want: inactive
+pgrep -af '/llama-server' | grep -v pgrep || echo "no server process"  # want: none
+curl -sf --max-time 2 http://127.0.0.1:8080/health >/dev/null 2>&1 \
+  && echo "port 8080: UP (STILL LOADED)" || echo "port 8080: down (unloaded)"
+# VRAM should drop to desktop baseline (~2.4 GiB); a loaded model adds ~9.5 GB.
+rocm-smi --showmeminfo vram 2>/dev/null | awk '/Used/{printf "VRAM used: ~%d MiB\n", $NF/1024/1024}'
+```
+
+A clean unload reads: `service: inactive`, `no server process`, `port 8080:
+down`, VRAM near the desktop baseline.
+
+## Gotchas
+
+- **Self-pkill footgun.** `pkill -f 'llama-server'` (no slash) matches *this
+  skill's own command string* and kills the shell mid-run (exit 144 = SIGTERM).
+  Always anchor on the binary path: `pkill -f '/llama-server '`.
+- **Already unloaded is the common case.** The systemd unit is `disabled` and
+  only runs on demand (the wrapper auto-starts it), so most of the time the
+  model is already down. The skill is idempotent — running it then is a no-op
+  that just confirms state. Report "already unloaded" rather than implying you
+  stopped something.
+- **Don't disable or mask the service.** Stopping unloads the model; the next
+  `/local-delegate` call auto-starts it again (~65 s cold load). Disabling would
+  break that auto-start. Stop only.
+
+## Note on ollama
+
+The stack can alternatively serve the same model via the `ollama` daemon (port
+11434). If a request asks to free the GPU broadly and ollama has a model
+resident, also run:
+
+```bash
+ollama stop qwen3-coder-30b-a3b-q5kxl 2>/dev/null || true
+```
+
+This skill's default scope is the llama.cpp path (`llama-server`), which is what
+`local-coder` uses. Reach for the ollama stop only when ollama is the active
+backend (`~/llm/scripts/use-ollama.sh` was run).
+
+## See also
+
+- [local-delegate](../local-delegate/SKILL.md) — when/how to *use* the local model.
+- `~/llm/scripts/use-ollama.sh` — stops llama-server so ollama can take the GPU.
+- `~/llm/scripts/use-llama-server.sh` — the inverse: load llama-server, free ollama.