Llama.cpp + OpenWebUI Setup on Proxmox

This document details the complete setup of a llama.cpp server with OpenWebUI interface running in a Proxmox container with HTTPS access.

Container Specifications

  • Container ID: #106
  • Name: llama-ai
  • RAM: 8GB
  • CPU: 4 cores
  • Storage: 32GB (local-lvm)
  • MAC Address: BC:24:11:15:F2:3A
  • IP Address: 172.16.32.135
  • OS: Debian 12 (Bookworm)

Services Overview

llama.cpp Server

  • Port: 8080
  • Model: Qwen2.5-1.5B-Instruct (Q4_0 quantization)
  • Context Window: 16,384 tokens (~12,000-13,000 words)
  • Service: llama-cpp.service
  • Status: Auto-start enabled

OpenWebUI

  • Port: 3000
  • Interface: Web-based chat interface
  • Service: open-webui.service
  • Status: Auto-start enabled
  • PyTorch: CPU version installed

NGINX Reverse Proxy

  • HTTP Port: 80 (redirects to HTTPS)
  • HTTPS Port: 443
  • Domain: https://llama-ai.<yourdomain.com>
  • SSL: Let’s Encrypt with Cloudflare DNS challenge
  • Auto-renewal: Enabled

Installation Steps

1. Create Proxmox Container

pct create 106 local:vztmpl/debian-12-standard_12.12-1_amd64.tar.zst \
  --hostname llama-ai \
  --memory 8192 \
  --cores 4 \
  --rootfs local-lvm:32 \
  --net0 name=eth0,bridge=vmbr0,hwaddr=BC:24:11:15:F2:3A,ip=dhcp \
  --unprivileged 1 \
  --onboot 1

2. Install Dependencies

apt update && apt upgrade -y
apt install -y build-essential cmake git curl wget python3 python3-pip pkg-config libcurl4-openssl-dev

3. Build llama.cpp

cd /opt
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
mkdir build && cd build
cmake .. && make -j$(nproc)

4. Download Models

mkdir -p /opt/llama.cpp/models
cd /opt/llama.cpp/models

# Qwen2.5 0.5B (fastest, 409MB)
wget -O qwen3-0.6b-q4_0.gguf \
  https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct-GGUF/resolve/main/qwen2.5-0.5b-instruct-q4_0.gguf

# Qwen2.5 1.5B (most capable, 1017MB)
wget -O qwen2.5-1.5b-instruct-q4_0.gguf \
  https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct-GGUF/resolve/main/qwen2.5-1.5b-instruct-q4_0.gguf

# Llama 3.2 1B (balanced, 738MB)
wget -O llama3.2-1b-instruct-q4_0.gguf \
  https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_0.gguf

5. Create llama.cpp Service

# /etc/systemd/system/llama-cpp.service
[Unit]
Description=Llama.cpp Server
After=network.target

[Service]
Type=simple
User=root
WorkingDirectory=/opt/llama.cpp
ExecStart=/opt/llama.cpp/build/bin/llama-server \
  --model /opt/llama.cpp/models/qwen2.5-1.5b-q4_0.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  --ctx-size 16384
Restart=always
RestartSec=3

[Install]
WantedBy=multi-user.target

6. Install OpenWebUI

python3 -m venv /opt/openwebui-venv
source /opt/openwebui-venv/bin/activate
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
pip install open-webui

7. Create OpenWebUI Service

# /etc/systemd/system/open-webui.service
[Unit]
Description=OpenWebUI
After=network.target

[Service]
Type=simple
User=root
WorkingDirectory=/opt/openwebui-venv
Environment=OPENAI_API_BASE_URL=http://127.0.0.1:8080/v1
Environment=OPENAI_API_KEY=sk-dummy
Environment=WEBUI_AUTH=false
ExecStart=/opt/openwebui-venv/bin/open-webui serve --port 3000 --host 0.0.0.0
Restart=always
RestartSec=3

[Install]
WantedBy=multi-user.target

8. Setup HTTPS with Let’s Encrypt

Install NGINX and Certbot

apt install -y nginx python3-certbot-nginx python3-certbot-dns-cloudflare

Configure Cloudflare Credentials

mkdir -p /etc/letsencrypt/credentials
chmod 700 /etc/letsencrypt/credentials
echo "dns_cloudflare_api_token = YOUR_CLOUDFLARE_TOKEN" > /etc/letsencrypt/credentials/cloudflare.ini
chmod 600 /etc/letsencrypt/credentials/cloudflare.ini

Obtain SSL Certificate

certbot certonly \
  --dns-cloudflare \
  --dns-cloudflare-credentials /etc/letsencrypt/credentials/cloudflare.ini \
  -d llama-ai.<yourdomain.com> \
  --non-interactive \
  --agree-tos \
  --email [email protected]

Configure NGINX

# /etc/nginx/sites-available/llama-ai.<yourdomain.com>
server {
    listen 80;
    server_name llama-ai.<yourdomain.com>;
    return 301 https://$server_name$request_uri;
}

server {
    listen 443 ssl http2;
    server_name llama-ai.<yourdomain.com>;

    ssl_certificate /etc/letsencrypt/live/llama-ai.<yourdomain.com>/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/llama-ai.<yourdomain.com>/privkey.pem;

    location / {
        proxy_pass http://127.0.0.1:3000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
    }
}

9. Enable Services

systemctl daemon-reload
systemctl enable --now llama-cpp.service
systemctl enable --now open-webui.service
systemctl enable --now nginx

ln -s /etc/nginx/sites-available/llama-ai.<yourdomain.com> /etc/nginx/sites-enabled/
nginx -t && systemctl reload nginx

Switching Models

llama.cpp runs as a systemd service with a single model loaded at start (set via --model in llama-cpp.service). To change models, update the service, reload, and restart.

  • Check the currently active model
grep -- "--model" /etc/systemd/system/llama-cpp.service | sed 's/.*--model //; s/ --host.*//'
systemctl is-active llama-cpp
  • Switch to Qwen2.5 0.5B (fastest)
systemctl stop llama-cpp
sed -i "s|--model /opt/llama.cpp/models/.*\.gguf|--model /opt/llama.cpp/models/qwen3-0.6b-q4_0.gguf|" /etc/systemd/system/llama-cpp.service
systemctl daemon-reload
systemctl start llama-cpp
  • Switch to Qwen2.5 1.5B Instruct (most capable)
systemctl stop llama-cpp
sed -i "s|--model /opt/llama.cpp/models/.*\.gguf|--model /opt/llama.cpp/models/qwen2.5-1.5b-instruct-q4_0.gguf|" /etc/systemd/system/llama-cpp.service
systemctl daemon-reload
systemctl start llama-cpp
  • Switch to Llama 3.2 1B Instruct (balanced)
systemctl stop llama-cpp
sed -i "s|--model /opt/llama.cpp/models/.*\.gguf|--model /opt/llama.cpp/models/llama3.2-1b-instruct-q4_0.gguf|" /etc/systemd/system/llama-cpp.service
systemctl daemon-reload
systemctl start llama-cpp

Notes:

  • After switching, OpenWebUI will automatically use the new model (it connects to the same llama.cpp endpoint). Refresh the UI if needed.
  • If you prefer one-click switching, create small scripts (e.g., /usr/local/bin/switch-to-qwen-1.5b) that run the commands above.

Access URLs

  • Primary Interface: https://llama-ai.<yourdomain.com>
  • Direct OpenWebUI: http://172.16.32.135:3000
  • llama.cpp API: http://172.16.32.135:8080

Key Features

  • Large Context Window: 16,384 tokens for long conversations
  • HTTPS Security: Valid SSL certificate with auto-renewal
  • WebSocket Support: Real-time chat interface
  • Auto-startup: All services start automatically on boot
  • CPU Optimized: PyTorch CPU version for efficient resource usage

Model Details

Multiple models are available for switching. Default service starts with Qwen2.5 1.5B.

Qwen2.5 0.5B-Instruct (qwen3-0.6b-q4_0.gguf)

  • Parameters: 630 million
  • File Size: 409MB
  • Speed: Fastest inference
  • Use Case: Quick responses, simple tasks
  • Training Context: 32,768 tokens

Qwen2.5 1.5B-Instruct (qwen2.5-1.5b-instruct-q4_0.gguf)

  • Parameters: 1.78 billion
  • File Size: 1017MB
  • Speed: Moderate inference
  • Use Case: Most capable, complex reasoning
  • Training Context: 32,768 tokens

Llama 3.2 1B-Instruct (llama3.2-1b-instruct-q4_0.gguf)

  • Parameters: 1.24 billion
  • File Size: 738MB
  • Speed: Good balance
  • Use Case: Balanced performance/speed
  • Training Context: 131,072 tokens

Common Features:

  • Quantization: Q4_0 (4-bit) for all models
  • Running Context: 16,384 tokens (can be adjusted in service config)
  • Language Support: Multilingual capabilities

Resource Usage

  • RAM Usage: ~1.1GB (llama.cpp) + ~500MB (OpenWebUI)
  • Storage: ~2GB total (model + applications)
  • CPU: Moderate load during inference

Maintenance

  • Certificate Renewal: Automatic via certbot.timer
  • Service Logs: journalctl -u service-name
  • Model Switching: Use commands from “Switching Models” section above
  • Model Updates: Download new GGUF files to /opt/llama.cpp/models/
  • OpenWebUI Updates: pip install --upgrade open-webui in venv
  • Check Current Model: grep "--model" /etc/systemd/system/llama-cpp.service

Troubleshooting

Service Status

systemctl status llama-cpp.service
systemctl status open-webui.service
systemctl status nginx

Health Checks

curl http://127.0.0.1:8080/health
curl -I http://127.0.0.1:3000
curl -I https://llama-ai.<yourdomain.com>

Logs

journalctl -u llama-cpp.service -f
journalctl -u open-webui.service -f
tail -f /var/log/nginx/access.log

Setup completed: October 16, 2025 Container IP: 172.16.32.135 Public URL: https://llama-ai.<yourdomain.com>