Setting up the llama.cpp server on Docker + OpenClaw configuration — fully offline, no API key required
This guide shows you how to run a local LLM (Qwen3.6 35B) on a server (Ubuntu) using Docker llama.cpp, then connect it to OpenClaw to use it as a personal AI assistant.
| Task type | Decode TPS | TTFT | Description |
|---|---|---|---|
| 📝 Narrative | ~134 tok/s | ~68ms | Natural text generation, answering questions |
| 💻 Code | ~177 tok/s | ~83ms | Code generation, scripts |
GPU usage: ~21GB/24GB VRAM, utilization ~88%, power ~289W
System architecture:
services:
llama-cpp-qwen36-35b-a3b-mtp:
image: docker.io/eav782021/llama-cpp:mtp-cuda
container_name: llama-cpp-qwen36-35b-a3b-mtp
restart: unless-stopped
runtime: nvidia
ports:
- "8020:8080"
volumes:
- /path/to/models-cache:/models
environment:
LLAMA_CACHE: /models/llama-cache
entrypoint:
- bash
- -c
- |
set -e
EXTRA_ARGS=()
if [ "$${DISABLE_THINKING:-0}" = "1" ]; then
EXTRA_ARGS+=("--chat-template-kwargs" '{"enable_thinking":false}')
echo "[entrypoint] DISABLE_THINKING=1"
fi
exec /app/llama-server "$$@" "$${EXTRA_ARGS[@]}"
- --
command:
- --host
- 0.0.0.0
- --port
- "8080"
- -hf
- unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q3_K_XL
- -c
- "229376"
- -b
- "2048"
- -ub
- "512"
- -ngl
- "99"
- -fa
- "on"
- --cache-type-k
- q4_0
- --cache-type-v
- q4_0
- -np
- "1"
- --spec-type
- draft-mtp
- --spec-draft-n-max
- "3"
- --jinja
- --reasoning-format
- none
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["0"]
capabilities: [compute, utility]
cd /path/to/folder/ docker compose pull docker compose up -d
curl http://localhost:8020/health
# Response: {"status": "ok"}
Add this to your openclaw.json file so OpenClaw connects to the llama.cpp server:
"models": {
"providers": {
"llama-cpp": {
"baseUrl": "http://SERVER_IP:8020/v1",
"api": "openai-completions",
"apiKey": "sk-not-needed", // llama.cpp doesn't require an API key
"request": {
"allowPrivateNetwork": true
},
"models": [
{
"id": "unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q3_K_XL",
"name": "unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q3_K_XL",
"reasoning": false,
"input": ["text"],
"contextWindow": 229376,
"maxTokens": 32768
}
]
}
}
}
openclaw config set agents.defaults.model.primary "llama-cpp/unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q3_K_XL" openclaw gateway restart
In Telegram:
/status # You should see: Model: unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q3_K_XL
Send a test message: "Hello" — if the server responds, you're all set! 🎉
nvidia-container-toolkit installed:
sudo nvidia-ctk runtime configure --runtime=docker && sudo systemctl restart docker
restart: unless-stopped setting in docker-compose ensures the container restarts automatically after a server reboot.