Llama.cpp

Setting up llama.cpp in LXC Container on Proxmox This guide documents the complete process of setting up llama.cpp in an LXC container on Proxmox with Intel GPU support and OpenAI-compatible API endpoints. Overview Goal: Replace Ollama with llama.cpp for better performance and lower resource usage Hardware: Intel N150 GPU (OpenCL support) Container: Debian 12 LXC on Proxmox API: OpenAI-compatible endpoints on port 11434 Container Creation 1. Create LXC Container # Download Debian 12 template pveam download local debian-12-standard_12.12-1_amd64.tar.zst # Create container pct create 107 local:vztmpl/debian-12-standard_12.12-1_amd64.tar.zst \ --hostname llama-cpp \ --memory 8192 \ --swap 512 \ --cores 4 \ --rootfs local-lvm:32 \ --net0 name=eth0,bridge=vmbr0,ip=dhcp \ --features keyctl=1,nesting=1 \ --unprivileged 1 \ --onboot 1 \ --tags ai 2. Add GPU Passthrough # Get GPU group IDs stat -c '%g' /dev/dri/card0 # Output: 44 stat -c '%g' /dev/dri/renderD128 # Output: 104 # Add GPU devices to container pct set 107 --dev0 /dev/dri/card0,gid=44 --dev1 /dev/dri/renderD128,gid=104 # Start container pct start 107 Software Installation 3. Install Dependencies # Update package list pct exec 107 -- apt update # Install build tools and dependencies pct exec 107 -- apt install -y \ build-essential \ cmake \ git \ curl \ pkg-config \ libssl-dev \ python3 \ python3-pip \ libcurl4-openssl-dev # Install OpenCL support for Intel GPU pct exec 107 -- apt install -y \ opencl-headers \ ocl-icd-opencl-dev \ intel-opencl-icd 4. Compile llama.cpp # Clone repository pct exec 107 -- bash -c "cd /opt && git clone https://github.com/ggerganov/llama.cpp.git" # Configure with OpenCL support pct exec 107 -- bash -c "cd /opt/llama.cpp && mkdir build && cd build && cmake .. -DGGML_OPENCL=ON -DCMAKE_BUILD_TYPE=Release" # Compile server binary pct exec 107 -- bash -c "cd /opt/llama.cpp/build && make -j$(nproc) llama-server" Model Setup 5. Download Models # Create models directory pct exec 107 -- mkdir -p /opt/llama.cpp/models # Download Qwen2.5-1.5B model (Q4_0 quantized) pct exec 107 -- bash -c "cd /opt/llama.cpp/models && curl -L -o qwen2.5-1.5b-q4_0.gguf https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct-GGUF/resolve/main/qwen2.5-1.5b-instruct-q4_0.gguf" Service Configuration 6. Create systemd Service # Create service file pct exec 107 -- bash -c "printf '[Unit]\nDescription=llama.cpp Server\nAfter=network-online.target\nWants=network-online.target\n\n[Service]\nType=simple\nExecStart=/opt/llama.cpp/build/bin/llama-server --host 0.0.0.0 --port 11434 --threads 4 --model /opt/llama.cpp/models/qwen2.5-1.5b-q4_0.gguf --ctx-size 8192 --batch-size 512\nRestart=always\nRestartSec=3\nUser=root\nGroup=root\nEnvironment=HOME=/root\nStandardOutput=journal\nStandardError=journal\nSyslogIdentifier=llama-cpp\n\n[Install]\nWantedBy=multi-user.target\n' > /etc/systemd/system/llama-cpp.service" # Enable and start service pct exec 107 -- systemctl daemon-reload pct exec 107 -- systemctl enable llama-cpp.service pct exec 107 -- systemctl start llama-cpp.service Testing and Verification 7. Verify Service Status # Check service status pct exec 107 -- systemctl status llama-cpp.service # Check port binding pct exec 107 -- ss -tlnp | grep :11434 8. Test API # Test OpenAI-compatible API pct exec 107 -- curl -X POST http://localhost:11434/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model":"qwen2.5-1.5b-q4_0","messages":[{"role":"user","content":"Hello, how are you?"}],"max_tokens":50}' Expected response: ...