🤖 Running Local LLM with OpenClaw

Setting up the llama.cpp server on Docker + OpenClaw configuration — fully offline, no API key required

This guide shows you how to run a local LLM (Qwen3.6 35B) on a server (Ubuntu) using Docker llama.cpp, then connect it to OpenClaw to use it as a personal AI assistant.

📊 Performance Benchmark

💻 Test hardware: RTX 3090 (24GB VRAM), Ubuntu PC
📈 Model: Qwen3.6-35B-A3B-MTP-GGUF (UD-Q3_K_XL)

Task type	Decode TPS	TTFT	Description
📝 Narrative	~134 tok/s	~68ms	Natural text generation, answering questions
💻 Code	~177 tok/s	~83ms	Code generation, scripts

GPU usage: ~21GB/24GB VRAM, utilization ~88%, power ~289W

System architecture:

Server (Ubuntu): Runs the llama.cpp server via Docker, exposing port 8020
Client (MacBook): Runs OpenClaw, connects to the server over the local network

🖥️ Step 1: Set up the Ubuntu server

1.1 Create the docker-compose.yml file

services:
  llama-cpp-qwen36-35b-a3b-mtp:
    image: ghcr.io/ggml-org/llama.cpp:server-cuda
    container_name: "${ESTATE_CONTAINER:-llama-cpp-qwen36-35b-a3b-mtp}"
    restart: unless-stopped
    ports:
      - "${ESTATE_PORT:-${PORT:-8020}}:8080"
    volumes:
      - "${MODEL_DIR:-../../../../../models-cache}:/models"
    environment:
      LLAMA_CACHE: /models/llama-cache
    entrypoint:
      - bash
      - -c
      - |
        set -e
        EXTRA_ARGS=()
        if [ "$${DISABLE_THINKING:-0}" = "1" ]; then
          EXTRA_ARGS+=("--chat-template-kwargs" '{"enable_thinking":false}')
          echo "[entrypoint] DISABLE_THINKING=1 — chat template will produce empty "
        fi
        exec /app/llama-server "$$@" "$${EXTRA_ARGS[@]}"
      - --
    command:
      - --host
      - 0.0.0.0
      - --port
      - "8080"
      - -hf
      - ${HF_MODEL:-unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q3_K_XL}
      - -c
      - ${CTX_SIZE:-262144}
      - -b
      - ${BATCH_SIZE:-4096}
      - -ub
      - ${UBATCH_SIZE:-1024}
      - -ngl
      - "99"
      - -fa
      - "on"
      - --cache-type-k
      - ${KV_TYPE:-q4_0}
      - --cache-type-v
      - ${KV_TYPE:-q4_0}
      - -np
      - "1"
      - --jinja
      - --reasoning-format
      - ${REASONING_FORMAT:-none}
      - --temp
      - "0.6"
      - --top-p
      - "0.95"
      - --top-k
      - "20"
      - --min-p
      - "0.0"
      - --presence-penalty
      - "0.0"
      - --repeat-penalty
      - "1.05"
      - --spec-type
      - draft-mtp
      - --spec-draft-n-max
      - ${MTP_N:-2}
      - --spec-draft-backend-sampling
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["${ESTATE_GPUS:-${CUDA_VISIBLE_DEVICES:-0}}"]
              capabilities: [compute, utility]

1.2 Start the container

cd /path/to/folder/
docker compose pull
docker compose up -d

1.3 Verify the server is running

curl http://localhost:8020/health
# Response: {"status": "ok"}

💻 Step 2: Configure OpenClaw on the client

2.1 Add the llama.cpp provider

Add this to your openclaw.json file so OpenClaw connects to the llama.cpp server:

"models": {
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://SERVER_IP:8020/v1",
      "api": "openai-completions",
      "apiKey": "sk-not-needed",  // llama.cpp doesn't require an API key
      "request": {
        "allowPrivateNetwork": true
      },
      "models": [
        {
          "id": "unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q3_K_XL",
          "name": "unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q3_K_XL",
          "reasoning": false,
          "input": ["text"],
          "contextWindow": 229376,
          "maxTokens": 32768
        }
      ]
    }
  }
}

2.2 Set the default model

openclaw config set agents.defaults.model.primary "llama-cpp/unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q3_K_XL"
openclaw gateway restart

✅ Step 3: Verify the setup

In Telegram:

/status
# You should see: Model: unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q3_K_XL

Send a test message: "Hello" — if the server responds, you're all set! 🎉

🔧 Important Notes

💡 GPU: Make sure the server has nvidia-container-toolkit installed: sudo nvidia-ctk runtime configure --runtime=docker && sudo systemctl restart docker

💡 Model size: The Qwen3.6 35B model requires ~20GB VRAM. Ensure your server GPU has enough memory.

💡 Network: The client must be able to reach the server's IP on port 8020. Check the firewall if needed.

💡 Auto-start: The restart: unless-stopped setting in docker-compose ensures the container restarts automatically after a server reboot.

💡 Performance: On an RTX 3090, the model runs at ~134 tok/s (narrative) and ~177 tok/s (code). Using MTP (Multi-Token Prediction) significantly boosts throughput compared to vanilla llama.cpp.