Setting up llama.cpp in LXC Container on Proxmox

This guide documents the complete process of setting up llama.cpp in an LXC container on Proxmox with Intel GPU support and OpenAI-compatible API endpoints.

Overview

  • Goal: Replace Ollama with llama.cpp for better performance and lower resource usage
  • Hardware: Intel N150 GPU (OpenCL support)
  • Container: Debian 12 LXC on Proxmox
  • API: OpenAI-compatible endpoints on port 11434

Container Creation

1. Create LXC Container

# Download Debian 12 template
pveam download local debian-12-standard_12.12-1_amd64.tar.zst

# Create container
pct create 107 local:vztmpl/debian-12-standard_12.12-1_amd64.tar.zst \
  --hostname llama-cpp \
  --memory 8192 \
  --swap 512 \
  --cores 4 \
  --rootfs local-lvm:32 \
  --net0 name=eth0,bridge=vmbr0,ip=dhcp \
  --features keyctl=1,nesting=1 \
  --unprivileged 1 \
  --onboot 1 \
  --tags ai

2. Add GPU Passthrough

# Get GPU group IDs
stat -c '%g' /dev/dri/card0      # Output: 44
stat -c '%g' /dev/dri/renderD128 # Output: 104

# Add GPU devices to container
pct set 107 --dev0 /dev/dri/card0,gid=44 --dev1 /dev/dri/renderD128,gid=104

# Start container
pct start 107

Software Installation

3. Install Dependencies

# Update package list
pct exec 107 -- apt update

# Install build tools and dependencies
pct exec 107 -- apt install -y \
  build-essential \
  cmake \
  git \
  curl \
  pkg-config \
  libssl-dev \
  python3 \
  python3-pip \
  libcurl4-openssl-dev

# Install OpenCL support for Intel GPU
pct exec 107 -- apt install -y \
  opencl-headers \
  ocl-icd-opencl-dev \
  intel-opencl-icd

4. Compile llama.cpp

# Clone repository
pct exec 107 -- bash -c "cd /opt && git clone https://github.com/ggerganov/llama.cpp.git"

# Configure with OpenCL support
pct exec 107 -- bash -c "cd /opt/llama.cpp && mkdir build && cd build && cmake .. -DGGML_OPENCL=ON -DCMAKE_BUILD_TYPE=Release"

# Compile server binary
pct exec 107 -- bash -c "cd /opt/llama.cpp/build && make -j$(nproc) llama-server"

Model Setup

5. Download Models

# Create models directory
pct exec 107 -- mkdir -p /opt/llama.cpp/models

# Download Qwen2.5-1.5B model (Q4_0 quantized)
pct exec 107 -- bash -c "cd /opt/llama.cpp/models && curl -L -o qwen2.5-1.5b-q4_0.gguf https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct-GGUF/resolve/main/qwen2.5-1.5b-instruct-q4_0.gguf"

Service Configuration

6. Create systemd Service

# Create service file
pct exec 107 -- bash -c "printf '[Unit]\nDescription=llama.cpp Server\nAfter=network-online.target\nWants=network-online.target\n\n[Service]\nType=simple\nExecStart=/opt/llama.cpp/build/bin/llama-server --host 0.0.0.0 --port 11434 --threads 4 --model /opt/llama.cpp/models/qwen2.5-1.5b-q4_0.gguf --ctx-size 8192 --batch-size 512\nRestart=always\nRestartSec=3\nUser=root\nGroup=root\nEnvironment=HOME=/root\nStandardOutput=journal\nStandardError=journal\nSyslogIdentifier=llama-cpp\n\n[Install]\nWantedBy=multi-user.target\n' > /etc/systemd/system/llama-cpp.service"

# Enable and start service
pct exec 107 -- systemctl daemon-reload
pct exec 107 -- systemctl enable llama-cpp.service
pct exec 107 -- systemctl start llama-cpp.service

Testing and Verification

7. Verify Service Status

# Check service status
pct exec 107 -- systemctl status llama-cpp.service

# Check port binding
pct exec 107 -- ss -tlnp | grep :11434

8. Test API

# Test OpenAI-compatible API
pct exec 107 -- curl -X POST http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen2.5-1.5b-q4_0","messages":[{"role":"user","content":"Hello, how are you?"}],"max_tokens":50}'

Expected response:

{
  "choices": [{
    "finish_reason": "stop",
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "Hello! I'm doing well, thank you. How can I assist you today?"
    }
  }],
  "created": 1760551378,
  "model": "qwen2.5-1.5b-q4_0",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 18,
    "prompt_tokens": 14,
    "total_tokens": 32
  }
}

Container Specifications

ComponentDetails
Container ID107
Hostnamellama-cpp
Memory8GB RAM + 512MB swap
CPU Cores4
Storage32GB
GPUIntel N150 (OpenCL)
NetworkBridge vmbr0 (DHCP)
OSDebian 12

Service Configuration

SettingValue
Service Namellama-cpp.service
Port11434 (Ollama compatible)
Host0.0.0.0 (external access)
ModelQwen2.5-1.5B-Instruct (Q4_0)
Context Size2048 tokens
Batch Size512
Threads4
Auto-startEnabled

Performance Metrics

  • Memory Usage: ~814MB for 1.5B parameter model
  • Response Time: ~1.7 seconds for short responses
  • Model Size: ~1GB (Q4_0 quantized)
  • API Compatibility: Full OpenAI v1/chat/completions support

Management Commands

# Service management
pct exec 107 -- systemctl start llama-cpp.service
pct exec 107 -- systemctl stop llama-cpp.service
pct exec 107 -- systemctl restart llama-cpp.service
pct exec 107 -- systemctl status llama-cpp.service

# View logs
pct exec 107 -- journalctl -u llama-cpp.service -f

# Container management
pct start 107
pct stop 107
pct enter 107

Integration with OpenWebUI

To use with OpenWebUI, configure it to point to:

http://CONTAINER-IP:11434

The API is fully compatible with OpenAI’s chat completions format, so no additional configuration is needed.

Troubleshooting

OpenCL Issues

If you see “OpenCL platform IDs not available”:

  • Check Intel GPU drivers in the host
  • Verify container has access to /dev/dri/ devices
  • May fall back to CPU inference (still functional)

Service Issues

# Check service logs
pct exec 107 -- journalctl -u llama-cpp.service -n 50

# Test binary directly
pct exec 107 -- /opt/llama.cpp/build/bin/llama-server --help

Memory Issues

  • Reduce context size: --ctx-size 1024
  • Reduce batch size: --batch-size 256
  • Use smaller model or different quantization

Future Enhancements

  1. GPU Optimization: Configure Intel GPU drivers for better OpenCL performance
  2. Model Management: Create scripts to switch between different models
  3. Load Balancing: Set up multiple instances for high availability
  4. Monitoring: Add Prometheus metrics collection
  5. Backup: Implement model and configuration backup strategy

Benefits over Ollama

  • Lower Memory Usage: ~40% less RAM consumption
  • Better Performance: Faster inference times
  • More Control: Fine-grained configuration options
  • OpenCL Support: Better GPU utilization on Intel hardware
  • Lightweight: Minimal overhead and dependencies