Files
dotfiles/claude/desktop/skills/unload-local-model/SKILL.md
T
daniel e884e4a88f Manage Claude Code config + add Justfile via bombadil
Bring ~/.claude config under bombadil management across both machines:
- claude/shared/: converged settings.json (union of both hosts) and a single
  Catppuccin-powerline statusline merged from the two machines' versions
- claude/xps, claude/desktop: per-host agents/skills behind [profiles.xps]/
  [profiles.desktop]; each host links only its own via `bombadil link -p <theme> <host>`

Linked at file granularity because bombadil 4.2.0 can't create directory
symlinks for new targets, and to keep ~/.claude/{agents,skills} real dirs.

Add a Justfile (symlinked to ~/.justfile, usable via `just -g`) with link/
dark/light/watch/unlink/update/status/edit recipes; host auto-detected from
hostname. Recipes use exported shell vars to avoid bombadil's Tera engine
mis-parsing just's double-brace interpolation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 17:17:17 -05:00

3.8 KiB

name, description
name description
unload-local-model Unload the local llama.cpp model (Qwen3-Coder-30B) from the 7900 XTX to free VRAM. Stops the llama-server systemd user service and reaps any stray foreground server. Idempotent — safe to run when already unloaded. Use when done with local-model work or when you want the GPU's VRAM back.

/unload-local-model

Free the GPU by unloading the local Qwen3-Coder-30B model that backs the local-coder subagent (see local-delegate). The model is served by llama-server (llama.cpp) and pins ~9.5 GB of VRAM on the Radeon RX 7900 XTX while resident. This skill stops it cleanly and verifies the VRAM is back.

What holds the GPU

Layer Holds VRAM? This skill touches it?
llama-server.service (systemd --user, port 8080) Yes — the model weights + KV cache Stops it
stray foreground llama-server (from llama-server-foreground.sh) Yes, if running outside systemd Reaps it
claude-code-router / ccr (port 3456) No — pure API translator, no VRAM Left running
ollama daemon (port 11434) Only while a model is loaded Out of scope — see note below

Leaving CCR up is deliberate: it holds no VRAM and re-attaches to llama-server the next time the stack warms. There is nothing to restart.

Run it

# 1. Canonical path — stop the systemd user service (idempotent; no-op if dead).
systemctl --user stop llama-server.service

# 2. Reap any stray foreground server started outside systemd. Match the binary
#    PATH (leading slash) — NOT the bare word "llama-server", or pkill matches
#    its own command line and SIGTERMs the shell running this skill.
pkill -f '/llama-server ' 2>/dev/null || true

Verify

echo "service: $(systemctl --user is-active llama-server.service)"   # want: inactive
pgrep -af '/llama-server' | grep -v pgrep || echo "no server process"  # want: none
curl -sf --max-time 2 http://127.0.0.1:8080/health >/dev/null 2>&1 \
  && echo "port 8080: UP (STILL LOADED)" || echo "port 8080: down (unloaded)"
# VRAM should drop to desktop baseline (~2.4 GiB); a loaded model adds ~9.5 GB.
rocm-smi --showmeminfo vram 2>/dev/null | awk '/Used/{printf "VRAM used: ~%d MiB\n", $NF/1024/1024}'

A clean unload reads: service: inactive, no server process, port 8080: down, VRAM near the desktop baseline.

Gotchas

  • Self-pkill footgun. pkill -f 'llama-server' (no slash) matches this skill's own command string and kills the shell mid-run (exit 144 = SIGTERM). Always anchor on the binary path: pkill -f '/llama-server '.
  • Already unloaded is the common case. The systemd unit is disabled and only runs on demand (the wrapper auto-starts it), so most of the time the model is already down. The skill is idempotent — running it then is a no-op that just confirms state. Report "already unloaded" rather than implying you stopped something.
  • Don't disable or mask the service. Stopping unloads the model; the next /local-delegate call auto-starts it again (~65 s cold load). Disabling would break that auto-start. Stop only.

Note on ollama

The stack can alternatively serve the same model via the ollama daemon (port 11434). If a request asks to free the GPU broadly and ollama has a model resident, also run:

ollama stop qwen3-coder-30b-a3b-q5kxl 2>/dev/null || true

This skill's default scope is the llama.cpp path (llama-server), which is what local-coder uses. Reach for the ollama stop only when ollama is the active backend (~/llm/scripts/use-ollama.sh was run).

See also

  • local-delegate — when/how to use the local model.
  • ~/llm/scripts/use-ollama.sh — stops llama-server so ollama can take the GPU.
  • ~/llm/scripts/use-llama-server.sh — the inverse: load llama-server, free ollama.