← Back to Blog

🤖 Running Local LLM with OpenClaw

Setting up the llama.cpp server on Docker + OpenClaw configuration — fully offline, no API key required


This guide shows you how to run a local LLM (Qwen3.6 35B) on a server (Ubuntu) using Docker llama.cpp, then connect it to OpenClaw to use it as a personal AI assistant.

📊 Performance Benchmark

💻 Test hardware: RTX 3090 (24GB VRAM), Ubuntu PC
📈 Model: Qwen3.6-35B-A3B-MTP-GGUF (UD-Q3_K_XL)
Task type Decode TPS TTFT Description
📝 Narrative ~134 tok/s ~68ms Natural text generation, answering questions
💻 Code ~177 tok/s ~83ms Code generation, scripts

GPU usage: ~21GB/24GB VRAM, utilization ~88%, power ~289W


System architecture:

🖥️ Step 1: Set up the Ubuntu server

1.1 Create the docker-compose.yml file

services:
  llama-cpp-qwen36-35b-a3b-mtp:
    image: docker.io/eav782021/llama-cpp:mtp-cuda
    container_name: llama-cpp-qwen36-35b-a3b-mtp
    restart: unless-stopped
    runtime: nvidia
    ports:
      - "8020:8080"
    volumes:
      - /path/to/models-cache:/models
    environment:
      LLAMA_CACHE: /models/llama-cache
    entrypoint:
      - bash
      - -c
      - |
        set -e
        EXTRA_ARGS=()
        if [ "$${DISABLE_THINKING:-0}" = "1" ]; then
          EXTRA_ARGS+=("--chat-template-kwargs" '{"enable_thinking":false}')
          echo "[entrypoint] DISABLE_THINKING=1"
        fi
        exec /app/llama-server "$$@" "$${EXTRA_ARGS[@]}"
      - --
    command:
      - --host
      - 0.0.0.0
      - --port
      - "8080"
      - -hf
      - unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q3_K_XL
      - -c
      - "229376"
      - -b
      - "2048"
      - -ub
      - "512"
      - -ngl
      - "99"
      - -fa
      - "on"
      - --cache-type-k
      - q4_0
      - --cache-type-v
      - q4_0
      - -np
      - "1"
      - --spec-type
      - draft-mtp
      - --spec-draft-n-max
      - "3"
      - --jinja
      - --reasoning-format
      - none
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["0"]
              capabilities: [compute, utility]

1.2 Start the container

cd /path/to/folder/
docker compose pull
docker compose up -d

1.3 Verify the server is running

curl http://localhost:8020/health
# Response: {"status": "ok"}

💻 Step 2: Configure OpenClaw on the client

2.1 Add the llama.cpp provider

Add this to your openclaw.json file so OpenClaw connects to the llama.cpp server:

"models": {
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://SERVER_IP:8020/v1",
      "api": "openai-completions",
      "apiKey": "sk-not-needed",  // llama.cpp doesn't require an API key
      "request": {
        "allowPrivateNetwork": true
      },
      "models": [
        {
          "id": "unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q3_K_XL",
          "name": "unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q3_K_XL",
          "reasoning": false,
          "input": ["text"],
          "contextWindow": 229376,
          "maxTokens": 32768
        }
      ]
    }
  }
}

2.2 Set the default model

openclaw config set agents.defaults.model.primary "llama-cpp/unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q3_K_XL"
openclaw gateway restart

✅ Step 3: Verify the setup

In Telegram:

/status
# You should see: Model: unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q3_K_XL

Send a test message: "Hello" — if the server responds, you're all set! 🎉

🔧 Important Notes

💡 GPU: Make sure the server has nvidia-container-toolkit installed: sudo nvidia-ctk runtime configure --runtime=docker && sudo systemctl restart docker
💡 Model size: The Qwen3.6 35B model requires ~20GB VRAM. Ensure your server GPU has enough memory.
💡 Network: The client must be able to reach the server's IP on port 8020. Check the firewall if needed.
💡 Auto-start: The restart: unless-stopped setting in docker-compose ensures the container restarts automatically after a server reboot.
💡 Performance: On an RTX 3090, the model runs at ~134 tok/s (narrative) and ~177 tok/s (code). Using MTP (Multi-Token Prediction) significantly boosts throughput compared to vanilla llama.cpp.