This guide covers setting up Ollama (Open Large Language Model) in a Proxmox LXC container with GPU passthrough and creating a simple web interface for easy interaction.
Overview
We’ll deploy Ollama in a resource-constrained LXC environment with:
- Intel UHD Graphics GPU acceleration
- llama3.2:1b model (~1.3GB)
- Lightweight Python web interface
- Auto-starting services
Prerequisites
- Proxmox VE host
- Intel integrated graphics (UHD Graphics)
- At least 4GB RAM allocated to LXC
- 40GB+ storage for container
Step 1: Container Setup
Check Available GPU Resources
First, verify GPU availability on the Proxmox host:
# Check for Intel graphics
lspci | grep -i vga
# Should show: Intel Corporation Alder Lake-N [UHD Graphics
# Verify DRI devices
ls /dev/dri/
# Should show: card0 renderD128
LXC Container Configuration
Create or modify your LXC container configuration to include GPU passthrough:
# Edit container configuration
pct set <VM id> -features nesting=1
# Add GPU device passthrough to container config
cat >> /etc/pve/lxc/<VM id>.conf << 'EOF'
# GPU Passthrough for Intel Graphics
lxc.cgroup2.devices.allow: c 226:128 rwm
lxc.mount.entry: /dev/dri/renderD128 dev/dri/renderD128 none bind,optional,create=file
lxc.cgroup2.devices.allow: c 226:0 rwm
lxc.mount.entry: /dev/dri/card0 dev/dri/card0 none bind,optional,create=file
lxc.mount.entry: /dev/dri dev/dri none bind,optional,create=dir
EOF
Step 2: Install Ollama
Using Community Script
The easiest method is using the Proxmox VE Helper Scripts:
# Download and run Ollama LXC script
bash -c "$(wget -qLO - https://github.com/community-scripts/ProxmoxVE/raw/main/ct/ollama.sh)"
Manual Installation (Alternative)
If installing manually in an existing container:
# SSH into container
pct exec <VM id> bash
# Download and install Ollama
curl -fsSL https://ollama.ai/install.sh | sh
# Create systemd service
systemctl enable ollama
systemctl start ollama
Step 3: Verify GPU Access
Check if the container can access the GPU:
pct exec <VM id> -- ls -la /dev/dri/
# Should show: card0 and renderD128 with proper permissions
pct exec <VM id> -- lspci | grep -i vga
# Should show: Intel Corporation Alder Lake-N [UHD Graphics]
Step 4: Download and Test Model
Download llama3.2:1b Model
# Download lightweight model (1.3GB)
pct exec <VM id> -- /usr/local/bin/ollama pull llama3.2:1b
Test Model
# Test the model
pct exec <VM id> -- /usr/local/bin/ollama run llama3.2:1b "Hello! Please respond with just a short greeting."
Verify GPU Usage
Check Ollama logs to confirm GPU detection:
pct exec <VM id> -- systemctl status ollama
Look for log entries like:
level=INFO source=types.go:130 msg="inference compute" name="Intel(R) UHD Graphics"
Step 5: Create Web Interface
Create Web Interface Directory
pct exec <VM id> -- mkdir -p /opt/ollama-web
Create Simple Python Web Server
pct exec <VM id> -- bash -c 'cat > /opt/ollama-web/simple_web.py << "EOF"
#!/usr/bin/env python3
import http.server
import socketserver
import urllib.request
import urllib.parse
import json
from html import escape
PORT = 8080
class OllamaHandler(http.server.SimpleHTTPRequestHandler):
def do_GET(self):
if self.path == "/" or self.path == "/index.html":
self.serve_chat_page()
else:
self.send_error(404)
def do_POST(self):
if self.path == "/chat":
self.handle_chat()
else:
self.send_error(404)
def serve_chat_page(self):
html = """<!DOCTYPE html>
<html>
<head>
<title>Ollama Chat</title>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<style>
body {
font-family: Arial, sans-serif;
margin: 0;
padding: 20px;
background-color: #f5f5f5;
}
.container {
max-width: 800px;
margin: 0 auto;
background: white;
border-radius: 10px;
box-shadow: 0 2px 10px rgba(0,0,0,0.1);
}
.header {
background: #2563eb;
color: white;
padding: 20px;
border-radius: 10px 10px 0 0;
text-align: center;
}
.chat-container {
height: 400px;
overflow-y: auto;
padding: 20px;
border-bottom: 1px solid #eee;
}
.message {
margin-bottom: 15px;
padding: 10px;
border-radius: 8px;
}
.user-message {
background: #e3f2fd;
text-align: right;
margin-left: 50px;
}
.ai-message {
background: #f5f5f5;
margin-right: 50px;
}
.input-container {
padding: 20px;
display: flex;
gap: 10px;
}
#messageInput {
flex: 1;
padding: 10px;
border: 1px solid #ddd;
border-radius: 5px;
font-size: 16px;
}
#sendButton {
padding: 10px 20px;
background: #2563eb;
color: white;
border: none;
border-radius: 5px;
cursor: pointer;
}
#sendButton:hover {
background: #1d4ed8;
}
#sendButton:disabled {
background: #ccc;
cursor: not-allowed;
}
.loading {
font-style: italic;
color: #666;
}
.status {
padding: 10px;
text-align: center;
font-size: 12px;
color: #666;
}
</style>
</head>
<body>
<div class="container">
<div class="header">
<h1>🦙 Ollama Chat</h1>
<p>Model: llama3.2:1b | Status: <span id="status">Ready</span></p>
</div>
<div class="chat-container" id="chatContainer">
<div class="message ai-message">
<strong>Assistant:</strong> Hello! I am your Ollama assistant running llama3.2:1b. How can I help you today?
</div>
</div>
<div class="input-container">
<input type="text" id="messageInput" placeholder="Type your message here..." onkeypress="handleKeyPress(event)">
<button id="sendButton" onclick="sendMessage()">Send</button>
</div>
<div class="status">
GPU: Intel UHD Graphics | RAM: Available
</div>
</div>
<script>
async function sendMessage() {
const input = document.getElementById("messageInput");
const chatContainer = document.getElementById("chatContainer");
const sendButton = document.getElementById("sendButton");
const status = document.getElementById("status");
const message = input.value.trim();
if (!message) return;
// Add user message
const userDiv = document.createElement("div");
userDiv.className = "message user-message";
userDiv.innerHTML = "<strong>You:</strong> " + message;
chatContainer.appendChild(userDiv);
// Add loading message
const loadingDiv = document.createElement("div");
loadingDiv.className = "message ai-message loading";
loadingDiv.innerHTML = "<strong>Assistant:</strong> Thinking...";
chatContainer.appendChild(loadingDiv);
input.value = "";
sendButton.disabled = true;
status.textContent = "Generating...";
chatContainer.scrollTop = chatContainer.scrollHeight;
try {
const response = await fetch("/chat", {
method: "POST",
headers: {
"Content-Type": "application/x-www-form-urlencoded",
},
body: "message=" + encodeURIComponent(message)
});
const data = await response.json();
// Remove loading message
chatContainer.removeChild(loadingDiv);
// Add AI response
const aiDiv = document.createElement("div");
aiDiv.className = "message ai-message";
aiDiv.innerHTML = "<strong>Assistant:</strong> " + data.response;
chatContainer.appendChild(aiDiv);
} catch (error) {
// Remove loading message
chatContainer.removeChild(loadingDiv);
// Add error message
const errorDiv = document.createElement("div");
errorDiv.className = "message ai-message";
errorDiv.innerHTML = "<strong>Error:</strong> Failed to get response. Is Ollama running?";
chatContainer.appendChild(errorDiv);
}
sendButton.disabled = false;
status.textContent = "Ready";
chatContainer.scrollTop = chatContainer.scrollHeight;
}
function handleKeyPress(event) {
if (event.key === "Enter") {
sendMessage();
}
}
// Focus input on load
document.getElementById("messageInput").focus();
</script>
</body>
</html>"""
self.send_response(200)
self.send_header("Content-type", "text/html")
self.end_headers()
self.wfile.write(html.encode())
def handle_chat(self):
content_length = int(self.headers["Content-Length"])
post_data = self.rfile.read(content_length).decode("utf-8")
parsed_data = urllib.parse.parse_qs(post_data)
message = parsed_data.get("message", [""])[0]
try:
# Create request to Ollama
data = json.dumps({
"model": "llama3.2:1b",
"prompt": message,
"stream": False
}).encode()
req = urllib.request.Request(
"http://localhost:11434/api/generate",
data=data,
headers={"Content-Type": "application/json"}
)
with urllib.request.urlopen(req) as response:
result = json.loads(response.read().decode())
ai_response = result.get("response", "No response received")
# Send JSON response
response_data = json.dumps({"response": ai_response})
self.send_response(200)
self.send_header("Content-type", "application/json")
self.end_headers()
self.wfile.write(response_data.encode())
except Exception as e:
error_response = json.dumps({"response": f"Error: {str(e)}"})
self.send_response(500)
self.send_header("Content-type", "application/json")
self.end_headers()
self.wfile.write(error_response.encode())
if __name__ == "__main__":
with socketserver.TCPServer(("", PORT), OllamaHandler) as httpd:
print(f"Server running at http://0.0.0.0:{PORT}")
httpd.serve_forever()
EOF'
Make Script Executable
pct exec <VM id> -- chmod +x /opt/ollama-web/simple_web.py
Step 6: Create Systemd Service
Create Web Interface Service
pct exec <VM id> -- bash -c 'cat > /etc/systemd/system/ollama-web.service << "EOF"
[Unit]
Description=Ollama Web Interface
After=ollama.service
Requires=ollama.service
[Service]
Type=simple
User=root
WorkingDirectory=/opt/ollama-web
ExecStart=/usr/bin/python3 /opt/ollama-web/simple_web.py
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.target
EOF'
Enable and Start Service
pct exec <VM id> -- systemctl daemon-reload
pct exec <VM id> -- systemctl enable ollama-web
pct exec <VM id> -- systemctl start ollama-web
Step 7: Verify Setup
Check Service Status
pct exec <VM id> -- systemctl status ollama ollama-web --no-pager
Test Web Interface
# Test if web interface responds
curl -s -o /dev/null -w "%{http_code}" http://CONTAINER_IP:8080
# Should return: 200
Access Points
Web Interface
- URL:
http://CONTAINER_IP:8080
- Features: ChatGPT-like interface, real-time chat, auto-scroll
Ollama API
- URL:
http://CONTAINER_IP:11434
- Endpoint:
/api/generate
for direct API calls
Command Line
# Direct container access
pct exec <VM id> -- /usr/local/bin/ollama run llama3.2:1b "Your question here"
# SSH access
ssh root@CONTAINER_IP
ollama run llama3.2:1b
Resource Optimization
Remove Unnecessary Packages
After installation, clean up development packages to save space:
# Remove build tools and dev packages
pct exec <VM id> -- apt autoremove -y gcc g++ make patch dpkg-dev cpp manpages-dev
# Clean package cache
pct exec <VM id> -- apt autoclean
Resource Usage
Final Footprint
- Disk Space: ~9.4GB total (after cleanup)
- RAM Usage:
- Ollama service: ~2.8GB (with model loaded)
- Web interface: ~11MB
- System overhead: ~1.7GB total
- Available RAM: 2.3GB free
Model Storage
- llama3.2:1b: 1.3GB
- Available space: 28GB for additional models
Troubleshooting
GPU Not Detected
If Ollama doesn’t detect the GPU:
- Verify container config includes GPU passthrough
- Check device permissions:
ls -la /dev/dri/
- Restart container:
pct restart <VM id>
Web Interface Not Loading
- Check service status:
systemctl status ollama-web
- Verify port 8080 is not blocked
- Check logs:
journalctl -u ollama-web -f
Model Loading Issues
- Verify sufficient RAM available
- Check Ollama logs:
journalctl -u ollama -f
- Try smaller model if memory constrained
Adding More Models
Popular Small Models
# Ultra-light model (500MB)
pct exec <VM id> -- ollama pull qwen2.5:0.5b
# Code-focused model (1.6GB)
pct exec <VM id> -- ollama pull codellama:7b-code
# Conversational model (2.2GB)
pct exec <VM id> -- ollama pull phi3.5:latest
Switch Models in Web Interface
To use different models, edit the web interface script and change:
"model": "llama3.2:1b", # Change to desired model
Performance Notes
- Intel UHD Graphics: Provides hardware acceleration for inference
- Response Time: 2-4 seconds for simple queries
- Concurrent Users: 2-3 simultaneous users on 4GB RAM
- Model Loading: ~2-3 seconds cold start
Security Considerations
- Web interface runs without authentication (add auth for production)
- Container runs as root (consider user namespacing)
- No HTTPS (add reverse proxy for external access)
- Firewall rules recommended for external exposure
Next Steps
Production Enhancements
- Add authentication to web interface
- Set up reverse proxy with SSL
- Implement user session management
- Add model switching capability
- Monitor resource usage
Alternative Web Interfaces
- Open WebUI: Full-featured ChatGPT-like interface
- AnythingLLM: Document chat capabilities
- Chatbot Ollama: Streamlit-based interface
Conclusion
This setup provides a fully functional, resource-efficient local AI chat system with:
- Hardware-accelerated inference
- Clean web interface
- Auto-starting services
- Optimized resource usage
Perfect for homelab experimentation, private AI assistance, and learning about LLM deployment.
Last updated: 2025-08-11