Setting up llama.cpp in LXC Container on Proxmox
This guide documents the complete process of setting up llama.cpp in an LXC container on Proxmox with Intel GPU support and OpenAI-compatible API endpoints.
Overview
- Goal: Replace Ollama with llama.cpp for better performance and lower resource usage
- Hardware: Intel N150 GPU (OpenCL support)
- Container: Debian 12 LXC on Proxmox
- API: OpenAI-compatible endpoints on port 11434
Container Creation
1. Create LXC Container
# Download Debian 12 template
pveam download local debian-12-standard_12.12-1_amd64.tar.zst
# Create container
pct create 107 local:vztmpl/debian-12-standard_12.12-1_amd64.tar.zst \
--hostname llama-cpp \
--memory 8192 \
--swap 512 \
--cores 4 \
--rootfs local-lvm:32 \
--net0 name=eth0,bridge=vmbr0,ip=dhcp \
--features keyctl=1,nesting=1 \
--unprivileged 1 \
--onboot 1 \
--tags ai
2. Add GPU Passthrough
# Get GPU group IDs
stat -c '%g' /dev/dri/card0 # Output: 44
stat -c '%g' /dev/dri/renderD128 # Output: 104
# Add GPU devices to container
pct set 107 --dev0 /dev/dri/card0,gid=44 --dev1 /dev/dri/renderD128,gid=104
# Start container
pct start 107
Software Installation
3. Install Dependencies
# Update package list
pct exec 107 -- apt update
# Install build tools and dependencies
pct exec 107 -- apt install -y \
build-essential \
cmake \
git \
curl \
pkg-config \
libssl-dev \
python3 \
python3-pip \
libcurl4-openssl-dev
# Install OpenCL support for Intel GPU
pct exec 107 -- apt install -y \
opencl-headers \
ocl-icd-opencl-dev \
intel-opencl-icd
4. Compile llama.cpp
# Clone repository
pct exec 107 -- bash -c "cd /opt && git clone https://github.com/ggerganov/llama.cpp.git"
# Configure with OpenCL support
pct exec 107 -- bash -c "cd /opt/llama.cpp && mkdir build && cd build && cmake .. -DGGML_OPENCL=ON -DCMAKE_BUILD_TYPE=Release"
# Compile server binary
pct exec 107 -- bash -c "cd /opt/llama.cpp/build && make -j$(nproc) llama-server"
Model Setup
5. Download Models
# Create models directory
pct exec 107 -- mkdir -p /opt/llama.cpp/models
# Download Qwen2.5-1.5B model (Q4_0 quantized)
pct exec 107 -- bash -c "cd /opt/llama.cpp/models && curl -L -o qwen2.5-1.5b-q4_0.gguf https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct-GGUF/resolve/main/qwen2.5-1.5b-instruct-q4_0.gguf"
Service Configuration
6. Create systemd Service
# Create service file
pct exec 107 -- bash -c "printf '[Unit]\nDescription=llama.cpp Server\nAfter=network-online.target\nWants=network-online.target\n\n[Service]\nType=simple\nExecStart=/opt/llama.cpp/build/bin/llama-server --host 0.0.0.0 --port 11434 --threads 4 --model /opt/llama.cpp/models/qwen2.5-1.5b-q4_0.gguf --ctx-size 8192 --batch-size 512\nRestart=always\nRestartSec=3\nUser=root\nGroup=root\nEnvironment=HOME=/root\nStandardOutput=journal\nStandardError=journal\nSyslogIdentifier=llama-cpp\n\n[Install]\nWantedBy=multi-user.target\n' > /etc/systemd/system/llama-cpp.service"
# Enable and start service
pct exec 107 -- systemctl daemon-reload
pct exec 107 -- systemctl enable llama-cpp.service
pct exec 107 -- systemctl start llama-cpp.service
Testing and Verification
7. Verify Service Status
# Check service status
pct exec 107 -- systemctl status llama-cpp.service
# Check port binding
pct exec 107 -- ss -tlnp | grep :11434
8. Test API
# Test OpenAI-compatible API
pct exec 107 -- curl -X POST http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"qwen2.5-1.5b-q4_0","messages":[{"role":"user","content":"Hello, how are you?"}],"max_tokens":50}'
Expected response:
{
"choices": [{
"finish_reason": "stop",
"index": 0,
"message": {
"role": "assistant",
"content": "Hello! I'm doing well, thank you. How can I assist you today?"
}
}],
"created": 1760551378,
"model": "qwen2.5-1.5b-q4_0",
"object": "chat.completion",
"usage": {
"completion_tokens": 18,
"prompt_tokens": 14,
"total_tokens": 32
}
}
Container Specifications
Component | Details |
---|---|
Container ID | 107 |
Hostname | llama-cpp |
Memory | 8GB RAM + 512MB swap |
CPU Cores | 4 |
Storage | 32GB |
GPU | Intel N150 (OpenCL) |
Network | Bridge vmbr0 (DHCP) |
OS | Debian 12 |
Service Configuration
Setting | Value |
---|---|
Service Name | llama-cpp.service |
Port | 11434 (Ollama compatible) |
Host | 0.0.0.0 (external access) |
Model | Qwen2.5-1.5B-Instruct (Q4_0) |
Context Size | 2048 tokens |
Batch Size | 512 |
Threads | 4 |
Auto-start | Enabled |
Performance Metrics
- Memory Usage: ~814MB for 1.5B parameter model
- Response Time: ~1.7 seconds for short responses
- Model Size: ~1GB (Q4_0 quantized)
- API Compatibility: Full OpenAI v1/chat/completions support
Management Commands
# Service management
pct exec 107 -- systemctl start llama-cpp.service
pct exec 107 -- systemctl stop llama-cpp.service
pct exec 107 -- systemctl restart llama-cpp.service
pct exec 107 -- systemctl status llama-cpp.service
# View logs
pct exec 107 -- journalctl -u llama-cpp.service -f
# Container management
pct start 107
pct stop 107
pct enter 107
Integration with OpenWebUI
To use with OpenWebUI, configure it to point to:
http://CONTAINER-IP:11434
The API is fully compatible with OpenAI’s chat completions format, so no additional configuration is needed.
Troubleshooting
OpenCL Issues
If you see “OpenCL platform IDs not available”:
- Check Intel GPU drivers in the host
- Verify container has access to
/dev/dri/
devices - May fall back to CPU inference (still functional)
Service Issues
# Check service logs
pct exec 107 -- journalctl -u llama-cpp.service -n 50
# Test binary directly
pct exec 107 -- /opt/llama.cpp/build/bin/llama-server --help
Memory Issues
- Reduce context size:
--ctx-size 1024
- Reduce batch size:
--batch-size 256
- Use smaller model or different quantization
Future Enhancements
- GPU Optimization: Configure Intel GPU drivers for better OpenCL performance
- Model Management: Create scripts to switch between different models
- Load Balancing: Set up multiple instances for high availability
- Monitoring: Add Prometheus metrics collection
- Backup: Implement model and configuration backup strategy
Benefits over Ollama
- Lower Memory Usage: ~40% less RAM consumption
- Better Performance: Faster inference times
- More Control: Fine-grained configuration options
- OpenCL Support: Better GPU utilization on Intel hardware
- Lightweight: Minimal overhead and dependencies