Ollama LXC Setup with GPU Acceleration and Web Interface

This guide covers setting up Ollama (Open Large Language Model) in a Proxmox LXC container with GPU passthrough and creating a simple web interface for easy interaction.

Overview

We’ll deploy Ollama in a resource-constrained LXC environment with:

Intel UHD Graphics GPU acceleration
llama3.2:1b model (~1.3GB)
Lightweight Python web interface
Auto-starting services

Prerequisites

Proxmox VE host
Intel integrated graphics (UHD Graphics)
At least 4GB RAM allocated to LXC
40GB+ storage for container

Step 1: Container Setup

Check Available GPU Resources

First, verify GPU availability on the Proxmox host:

# Check for Intel graphics
lspci | grep -i vga
# Should show: Intel Corporation Alder Lake-N [UHD Graphics

# Verify DRI devices
ls /dev/dri/
# Should show: card0  renderD128

LXC Container Configuration

Create or modify your LXC container configuration to include GPU passthrough:

# Edit container configuration
pct set <VM id> -features nesting=1

# Add GPU device passthrough to container config
cat >> /etc/pve/lxc/<VM id>.conf << 'EOF'
# GPU Passthrough for Intel Graphics
lxc.cgroup2.devices.allow: c 226:128 rwm
lxc.mount.entry: /dev/dri/renderD128 dev/dri/renderD128 none bind,optional,create=file
lxc.cgroup2.devices.allow: c 226:0 rwm
lxc.mount.entry: /dev/dri/card0 dev/dri/card0 none bind,optional,create=file
lxc.mount.entry: /dev/dri dev/dri none bind,optional,create=dir
EOF

Step 2: Install Ollama

Using Community Script

The easiest method is using the Proxmox VE Helper Scripts:

# Download and run Ollama LXC script
bash -c "$(wget -qLO - https://github.com/community-scripts/ProxmoxVE/raw/main/ct/ollama.sh)"

Manual Installation (Alternative)

If installing manually in an existing container:

# SSH into container
pct exec <VM id> bash

# Download and install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Create systemd service
systemctl enable ollama
systemctl start ollama

Step 3: Verify GPU Access

Check if the container can access the GPU:

pct exec <VM id> -- ls -la /dev/dri/
# Should show: card0 and renderD128 with proper permissions

pct exec <VM id> -- lspci | grep -i vga
# Should show: Intel Corporation Alder Lake-N [UHD Graphics]

Step 4: Download and Test Model

Download llama3.2:1b Model

# Download lightweight model (1.3GB)
pct exec <VM id> -- /usr/local/bin/ollama pull llama3.2:1b

Test Model

# Test the model
pct exec <VM id> -- /usr/local/bin/ollama run llama3.2:1b "Hello! Please respond with just a short greeting."

Verify GPU Usage

Check Ollama logs to confirm GPU detection:

pct exec <VM id> -- systemctl status ollama

Look for log entries like:

level=INFO source=types.go:130 msg="inference compute" name="Intel(R) UHD Graphics"

Step 5: Create Web Interface

Create Web Interface Directory

pct exec <VM id> -- mkdir -p /opt/ollama-web

Create Simple Python Web Server

pct exec <VM id> -- bash -c 'cat > /opt/ollama-web/simple_web.py << "EOF"
#!/usr/bin/env python3
import http.server
import socketserver
import urllib.request
import urllib.parse
import json
from html import escape

PORT = 8080

class OllamaHandler(http.server.SimpleHTTPRequestHandler):
    def do_GET(self):
        if self.path == "/" or self.path == "/index.html":
            self.serve_chat_page()
        else:
            self.send_error(404)
    
    def do_POST(self):
        if self.path == "/chat":
            self.handle_chat()
        else:
            self.send_error(404)
    
    def serve_chat_page(self):
        html = """<!DOCTYPE html>
<html>
<head>
    <title>Ollama Chat</title>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <style>
        body { 
            font-family: Arial, sans-serif; 
            margin: 0; 
            padding: 20px; 
            background-color: #f5f5f5; 
        }
        .container { 
            max-width: 800px; 
            margin: 0 auto; 
            background: white; 
            border-radius: 10px; 
            box-shadow: 0 2px 10px rgba(0,0,0,0.1); 
        }
        .header { 
            background: #2563eb; 
            color: white; 
            padding: 20px; 
            border-radius: 10px 10px 0 0; 
            text-align: center; 
        }
        .chat-container { 
            height: 400px; 
            overflow-y: auto; 
            padding: 20px; 
            border-bottom: 1px solid #eee; 
        }
        .message { 
            margin-bottom: 15px; 
            padding: 10px; 
            border-radius: 8px; 
        }
        .user-message { 
            background: #e3f2fd; 
            text-align: right; 
            margin-left: 50px; 
        }
        .ai-message { 
            background: #f5f5f5; 
            margin-right: 50px; 
        }
        .input-container { 
            padding: 20px; 
            display: flex; 
            gap: 10px; 
        }
        #messageInput { 
            flex: 1; 
            padding: 10px; 
            border: 1px solid #ddd; 
            border-radius: 5px; 
            font-size: 16px; 
        }
        #sendButton { 
            padding: 10px 20px; 
            background: #2563eb; 
            color: white; 
            border: none; 
            border-radius: 5px; 
            cursor: pointer; 
        }
        #sendButton:hover { 
            background: #1d4ed8; 
        }
        #sendButton:disabled { 
            background: #ccc; 
            cursor: not-allowed; 
        }
        .loading { 
            font-style: italic; 
            color: #666; 
        }
        .status { 
            padding: 10px; 
            text-align: center; 
            font-size: 12px; 
            color: #666; 
        }
    </style>
</head>
<body>
    <div class="container">
        <div class="header">
            <h1>🦙 Ollama Chat</h1>
            <p>Model: llama3.2:1b | Status: <span id="status">Ready</span></p>
        </div>
        <div class="chat-container" id="chatContainer">
            <div class="message ai-message">
                <strong>Assistant:</strong> Hello! I am your Ollama assistant running llama3.2:1b. How can I help you today?
            </div>
        </div>
        <div class="input-container">
            <input type="text" id="messageInput" placeholder="Type your message here..." onkeypress="handleKeyPress(event)">
            <button id="sendButton" onclick="sendMessage()">Send</button>
        </div>
        <div class="status">
            GPU: Intel UHD Graphics | RAM: Available
        </div>
    </div>

    <script>
        async function sendMessage() {
            const input = document.getElementById("messageInput");
            const chatContainer = document.getElementById("chatContainer");
            const sendButton = document.getElementById("sendButton");
            const status = document.getElementById("status");
            
            const message = input.value.trim();
            if (!message) return;
            
            // Add user message
            const userDiv = document.createElement("div");
            userDiv.className = "message user-message";
            userDiv.innerHTML = "<strong>You:</strong> " + message;
            chatContainer.appendChild(userDiv);
            
            // Add loading message
            const loadingDiv = document.createElement("div");
            loadingDiv.className = "message ai-message loading";
            loadingDiv.innerHTML = "<strong>Assistant:</strong> Thinking...";
            chatContainer.appendChild(loadingDiv);
            
            input.value = "";
            sendButton.disabled = true;
            status.textContent = "Generating...";
            chatContainer.scrollTop = chatContainer.scrollHeight;
            
            try {
                const response = await fetch("/chat", {
                    method: "POST",
                    headers: {
                        "Content-Type": "application/x-www-form-urlencoded",
                    },
                    body: "message=" + encodeURIComponent(message)
                });
                
                const data = await response.json();
                
                // Remove loading message
                chatContainer.removeChild(loadingDiv);
                
                // Add AI response
                const aiDiv = document.createElement("div");
                aiDiv.className = "message ai-message";
                aiDiv.innerHTML = "<strong>Assistant:</strong> " + data.response;
                chatContainer.appendChild(aiDiv);
                
            } catch (error) {
                // Remove loading message
                chatContainer.removeChild(loadingDiv);
                
                // Add error message
                const errorDiv = document.createElement("div");
                errorDiv.className = "message ai-message";
                errorDiv.innerHTML = "<strong>Error:</strong> Failed to get response. Is Ollama running?";
                chatContainer.appendChild(errorDiv);
            }
            
            sendButton.disabled = false;
            status.textContent = "Ready";
            chatContainer.scrollTop = chatContainer.scrollHeight;
        }
        
        function handleKeyPress(event) {
            if (event.key === "Enter") {
                sendMessage();
            }
        }
        
        // Focus input on load
        document.getElementById("messageInput").focus();
    </script>
</body>
</html>"""
        
        self.send_response(200)
        self.send_header("Content-type", "text/html")
        self.end_headers()
        self.wfile.write(html.encode())
    
    def handle_chat(self):
        content_length = int(self.headers["Content-Length"])
        post_data = self.rfile.read(content_length).decode("utf-8")
        parsed_data = urllib.parse.parse_qs(post_data)
        
        message = parsed_data.get("message", [""])[0]
        
        try:
            # Create request to Ollama
            data = json.dumps({
                "model": "llama3.2:1b",
                "prompt": message,
                "stream": False
            }).encode()
            
            req = urllib.request.Request(
                "http://localhost:11434/api/generate",
                data=data,
                headers={"Content-Type": "application/json"}
            )
            
            with urllib.request.urlopen(req) as response:
                result = json.loads(response.read().decode())
                ai_response = result.get("response", "No response received")
            
            # Send JSON response
            response_data = json.dumps({"response": ai_response})
            self.send_response(200)
            self.send_header("Content-type", "application/json")
            self.end_headers()
            self.wfile.write(response_data.encode())
            
        except Exception as e:
            error_response = json.dumps({"response": f"Error: {str(e)}"})
            self.send_response(500)
            self.send_header("Content-type", "application/json")
            self.end_headers()
            self.wfile.write(error_response.encode())

if __name__ == "__main__":
    with socketserver.TCPServer(("", PORT), OllamaHandler) as httpd:
        print(f"Server running at http://0.0.0.0:{PORT}")
        httpd.serve_forever()
EOF'

Make Script Executable

pct exec <VM id> -- chmod +x /opt/ollama-web/simple_web.py

Step 6: Create Systemd Service

Create Web Interface Service

pct exec <VM id> -- bash -c 'cat > /etc/systemd/system/ollama-web.service << "EOF"
[Unit]
Description=Ollama Web Interface
After=ollama.service
Requires=ollama.service

[Service]
Type=simple
User=root
WorkingDirectory=/opt/ollama-web
ExecStart=/usr/bin/python3 /opt/ollama-web/simple_web.py
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF'

Enable and Start Service

pct exec <VM id> -- systemctl daemon-reload
pct exec <VM id> -- systemctl enable ollama-web
pct exec <VM id> -- systemctl start ollama-web

Step 7: Verify Setup

Check Service Status

pct exec <VM id> -- systemctl status ollama ollama-web --no-pager

Test Web Interface

# Test if web interface responds
curl -s -o /dev/null -w "%{http_code}" http://CONTAINER_IP:8080
# Should return: 200

Access Points

Web Interface

URL: http://CONTAINER_IP:8080
Features: ChatGPT-like interface, real-time chat, auto-scroll

Ollama API

URL: http://CONTAINER_IP:11434
Endpoint: /api/generate for direct API calls

Command Line

# Direct container access
pct exec <VM id> -- /usr/local/bin/ollama run llama3.2:1b "Your question here"

# SSH access
ssh root@CONTAINER_IP
ollama run llama3.2:1b

Resource Optimization

Remove Unnecessary Packages

After installation, clean up development packages to save space:

# Remove build tools and dev packages
pct exec <VM id> -- apt autoremove -y gcc g++ make patch dpkg-dev cpp manpages-dev

# Clean package cache
pct exec <VM id> -- apt autoclean

Resource Usage

Final Footprint

Disk Space: ~9.4GB total (after cleanup)
RAM Usage:
- Ollama service: ~2.8GB (with model loaded)
- Web interface: ~11MB
- System overhead: ~1.7GB total
Available RAM: 2.3GB free

Model Storage

llama3.2:1b: 1.3GB
Available space: 28GB for additional models

Troubleshooting

GPU Not Detected

If Ollama doesn’t detect the GPU:

Verify container config includes GPU passthrough
Check device permissions: ls -la /dev/dri/
Restart container: pct restart <VM id>

Web Interface Not Loading

Check service status: systemctl status ollama-web
Verify port 8080 is not blocked
Check logs: journalctl -u ollama-web -f

Model Loading Issues

Verify sufficient RAM available
Check Ollama logs: journalctl -u ollama -f
Try smaller model if memory constrained

Adding More Models

Popular Small Models

# Ultra-light model (500MB)
pct exec <VM id> -- ollama pull qwen2.5:0.5b

# Code-focused model (1.6GB)  
pct exec <VM id> -- ollama pull codellama:7b-code

# Conversational model (2.2GB)
pct exec <VM id> -- ollama pull phi3.5:latest

Switch Models in Web Interface

To use different models, edit the web interface script and change:

"model": "llama3.2:1b",  # Change to desired model

Performance Notes

Intel UHD Graphics: Provides hardware acceleration for inference
Response Time: 2-4 seconds for simple queries
Concurrent Users: 2-3 simultaneous users on 4GB RAM
Model Loading: ~2-3 seconds cold start

Security Considerations

Web interface runs without authentication (add auth for production)
Container runs as root (consider user namespacing)
No HTTPS (add reverse proxy for external access)
Firewall rules recommended for external exposure

Next Steps

Production Enhancements

Add authentication to web interface
Set up reverse proxy with SSL
Implement user session management
Add model switching capability
Monitor resource usage

Alternative Web Interfaces

Open WebUI: Full-featured ChatGPT-like interface
AnythingLLM: Document chat capabilities
Chatbot Ollama: Streamlit-based interface

Conclusion

This setup provides a fully functional, resource-efficient local AI chat system with:

Hardware-accelerated inference
Clean web interface
Auto-starting services
Optimized resource usage

Perfect for homelab experimentation, private AI assistance, and learning about LLM deployment.

Last updated: 2025-08-11

Overview#

Prerequisites#

Step 1: Container Setup#

Check Available GPU Resources#

LXC Container Configuration#

Step 2: Install Ollama#

Using Community Script#

Manual Installation (Alternative)#

Step 3: Verify GPU Access#

Step 4: Download and Test Model#

Download llama3.2:1b Model#

Test Model#

Verify GPU Usage#

Step 5: Create Web Interface#

Create Web Interface Directory#

Create Simple Python Web Server#

Make Script Executable#

Step 6: Create Systemd Service#

Create Web Interface Service#

Enable and Start Service#

Step 7: Verify Setup#

Check Service Status#

Test Web Interface#

Access Points#

Web Interface#

Ollama API#

Command Line#

Resource Optimization#

Remove Unnecessary Packages#

Resource Usage#

Final Footprint#

Model Storage#

Troubleshooting#

GPU Not Detected#

Web Interface Not Loading#

Model Loading Issues#

Adding More Models#

Popular Small Models#

Switch Models in Web Interface#

Performance Notes#

Security Considerations#

Next Steps#

Production Enhancements#

Alternative Web Interfaces#

Conclusion#