Setting up llama.cpp in LXC Container on Proxmox

This guide documents the complete process of setting up llama.cpp in an LXC container on Proxmox with Intel GPU support and OpenAI-compatible API endpoints.

Overview

Goal: Replace Ollama with llama.cpp for better performance and lower resource usage
Hardware: Intel N150 GPU (OpenCL support)
Container: Debian 12 LXC on Proxmox
API: OpenAI-compatible endpoints on port 11434

Container Creation

1. Create LXC Container

# Download Debian 12 template
pveam download local debian-12-standard_12.12-1_amd64.tar.zst

# Create container
pct create 107 local:vztmpl/debian-12-standard_12.12-1_amd64.tar.zst \
  --hostname llama-cpp \
  --memory 8192 \
  --swap 512 \
  --cores 4 \
  --rootfs local-lvm:32 \
  --net0 name=eth0,bridge=vmbr0,ip=dhcp \
  --features keyctl=1,nesting=1 \
  --unprivileged 1 \
  --onboot 1 \
  --tags ai

2. Add GPU Passthrough

# Get GPU group IDs
stat -c '%g' /dev/dri/card0      # Output: 44
stat -c '%g' /dev/dri/renderD128 # Output: 104

# Add GPU devices to container
pct set 107 --dev0 /dev/dri/card0,gid=44 --dev1 /dev/dri/renderD128,gid=104

# Start container
pct start 107

Software Installation

3. Install Dependencies

# Update package list
pct exec 107 -- apt update

# Install build tools and dependencies
pct exec 107 -- apt install -y \
  build-essential \
  cmake \
  git \
  curl \
  pkg-config \
  libssl-dev \
  python3 \
  python3-pip \
  libcurl4-openssl-dev

# Install OpenCL support for Intel GPU
pct exec 107 -- apt install -y \
  opencl-headers \
  ocl-icd-opencl-dev \
  intel-opencl-icd

4. Compile llama.cpp

# Clone repository
pct exec 107 -- bash -c "cd /opt && git clone https://github.com/ggerganov/llama.cpp.git"

# Configure with OpenCL support
pct exec 107 -- bash -c "cd /opt/llama.cpp && mkdir build && cd build && cmake .. -DGGML_OPENCL=ON -DCMAKE_BUILD_TYPE=Release"

# Compile server binary
pct exec 107 -- bash -c "cd /opt/llama.cpp/build && make -j$(nproc) llama-server"

Model Setup

5. Download Models

# Create models directory
pct exec 107 -- mkdir -p /opt/llama.cpp/models

# Download Qwen2.5-1.5B model (Q4_0 quantized)
pct exec 107 -- bash -c "cd /opt/llama.cpp/models && curl -L -o qwen2.5-1.5b-q4_0.gguf https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct-GGUF/resolve/main/qwen2.5-1.5b-instruct-q4_0.gguf"

Service Configuration

6. Create systemd Service

# Create service file
pct exec 107 -- bash -c "printf '[Unit]\nDescription=llama.cpp Server\nAfter=network-online.target\nWants=network-online.target\n\n[Service]\nType=simple\nExecStart=/opt/llama.cpp/build/bin/llama-server --host 0.0.0.0 --port 11434 --threads 4 --model /opt/llama.cpp/models/qwen2.5-1.5b-q4_0.gguf --ctx-size 8192 --batch-size 512\nRestart=always\nRestartSec=3\nUser=root\nGroup=root\nEnvironment=HOME=/root\nStandardOutput=journal\nStandardError=journal\nSyslogIdentifier=llama-cpp\n\n[Install]\nWantedBy=multi-user.target\n' > /etc/systemd/system/llama-cpp.service"

# Enable and start service
pct exec 107 -- systemctl daemon-reload
pct exec 107 -- systemctl enable llama-cpp.service
pct exec 107 -- systemctl start llama-cpp.service

Testing and Verification

7. Verify Service Status

# Check service status
pct exec 107 -- systemctl status llama-cpp.service

# Check port binding
pct exec 107 -- ss -tlnp | grep :11434

8. Test API

# Test OpenAI-compatible API
pct exec 107 -- curl -X POST http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen2.5-1.5b-q4_0","messages":[{"role":"user","content":"Hello, how are you?"}],"max_tokens":50}'

Expected response:

{
  "choices": [{
    "finish_reason": "stop",
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "Hello! I'm doing well, thank you. How can I assist you today?"
    }
  }],
  "created": 1760551378,
  "model": "qwen2.5-1.5b-q4_0",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 18,
    "prompt_tokens": 14,
    "total_tokens": 32
  }
}

Container Specifications

Component	Details
Container ID	107
Hostname	llama-cpp
Memory	8GB RAM + 512MB swap
CPU Cores	4
Storage	32GB
GPU	Intel N150 (OpenCL)
Network	Bridge vmbr0 (DHCP)
OS	Debian 12

Service Configuration

Setting	Value
Service Name	llama-cpp.service
Port	11434 (Ollama compatible)
Host	0.0.0.0 (external access)
Model	Qwen2.5-1.5B-Instruct (Q4_0)
Context Size	2048 tokens
Batch Size	512
Threads	4
Auto-start	Enabled

Performance Metrics

Memory Usage: ~814MB for 1.5B parameter model
Response Time: ~1.7 seconds for short responses
Model Size: ~1GB (Q4_0 quantized)
API Compatibility: Full OpenAI v1/chat/completions support

Management Commands

# Service management
pct exec 107 -- systemctl start llama-cpp.service
pct exec 107 -- systemctl stop llama-cpp.service
pct exec 107 -- systemctl restart llama-cpp.service
pct exec 107 -- systemctl status llama-cpp.service

# View logs
pct exec 107 -- journalctl -u llama-cpp.service -f

# Container management
pct start 107
pct stop 107
pct enter 107

Integration with OpenWebUI

To use with OpenWebUI, configure it to point to:

http://CONTAINER-IP:11434

The API is fully compatible with OpenAI’s chat completions format, so no additional configuration is needed.

Troubleshooting

OpenCL Issues

If you see “OpenCL platform IDs not available”:

Check Intel GPU drivers in the host
Verify container has access to /dev/dri/ devices
May fall back to CPU inference (still functional)

Service Issues

# Check service logs
pct exec 107 -- journalctl -u llama-cpp.service -n 50

# Test binary directly
pct exec 107 -- /opt/llama.cpp/build/bin/llama-server --help

Memory Issues

Reduce context size: --ctx-size 1024
Reduce batch size: --batch-size 256
Use smaller model or different quantization

Future Enhancements

GPU Optimization: Configure Intel GPU drivers for better OpenCL performance
Model Management: Create scripts to switch between different models
Load Balancing: Set up multiple instances for high availability
Monitoring: Add Prometheus metrics collection
Backup: Implement model and configuration backup strategy

Benefits over Ollama

Lower Memory Usage: ~40% less RAM consumption
Better Performance: Faster inference times
More Control: Fine-grained configuration options
OpenCL Support: Better GPU utilization on Intel hardware
Lightweight: Minimal overhead and dependencies

Setting up llama.cpp in LXC Container on Proxmox#

Overview#

Container Creation#

1. Create LXC Container#

2. Add GPU Passthrough#

Software Installation#

3. Install Dependencies#

4. Compile llama.cpp#

Model Setup#

5. Download Models#

Service Configuration#

6. Create systemd Service#

Testing and Verification#

7. Verify Service Status#

8. Test API#

Container Specifications#

Service Configuration#

Performance Metrics#

Management Commands#

Integration with OpenWebUI#

Troubleshooting#

OpenCL Issues#

Service Issues#

Memory Issues#

Future Enhancements#

Benefits over Ollama#

Setting up llama.cpp in LXC Container on Proxmox

Overview

Container Creation

1. Create LXC Container

2. Add GPU Passthrough

Software Installation

3. Install Dependencies

4. Compile llama.cpp

Model Setup

5. Download Models

Service Configuration

6. Create systemd Service

Testing and Verification

7. Verify Service Status

8. Test API

Container Specifications

Service Configuration

Performance Metrics

Management Commands

Integration with OpenWebUI

Troubleshooting

OpenCL Issues

Service Issues

Memory Issues

Future Enhancements

Benefits over Ollama