This guide covers setting up Ollama (Open Large Language Model) in a Proxmox LXC container with GPU passthrough and creating a simple web interface for easy interaction.

Overview

We’ll deploy Ollama in a resource-constrained LXC environment with:

  • Intel UHD Graphics GPU acceleration
  • llama3.2:1b model (~1.3GB)
  • Lightweight Python web interface
  • Auto-starting services

Prerequisites

  • Proxmox VE host
  • Intel integrated graphics (UHD Graphics)
  • At least 4GB RAM allocated to LXC
  • 40GB+ storage for container

Step 1: Container Setup

Check Available GPU Resources

First, verify GPU availability on the Proxmox host:

# Check for Intel graphics
lspci | grep -i vga
# Should show: Intel Corporation Alder Lake-N [UHD Graphics

# Verify DRI devices
ls /dev/dri/
# Should show: card0  renderD128

LXC Container Configuration

Create or modify your LXC container configuration to include GPU passthrough:

# Edit container configuration
pct set <VM id> -features nesting=1

# Add GPU device passthrough to container config
cat >> /etc/pve/lxc/<VM id>.conf << 'EOF'
# GPU Passthrough for Intel Graphics
lxc.cgroup2.devices.allow: c 226:128 rwm
lxc.mount.entry: /dev/dri/renderD128 dev/dri/renderD128 none bind,optional,create=file
lxc.cgroup2.devices.allow: c 226:0 rwm
lxc.mount.entry: /dev/dri/card0 dev/dri/card0 none bind,optional,create=file
lxc.mount.entry: /dev/dri dev/dri none bind,optional,create=dir
EOF

Step 2: Install Ollama

Using Community Script

The easiest method is using the Proxmox VE Helper Scripts:

# Download and run Ollama LXC script
bash -c "$(wget -qLO - https://github.com/community-scripts/ProxmoxVE/raw/main/ct/ollama.sh)"

Manual Installation (Alternative)

If installing manually in an existing container:

# SSH into container
pct exec <VM id> bash

# Download and install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Create systemd service
systemctl enable ollama
systemctl start ollama

Step 3: Verify GPU Access

Check if the container can access the GPU:

pct exec <VM id> -- ls -la /dev/dri/
# Should show: card0 and renderD128 with proper permissions

pct exec <VM id> -- lspci | grep -i vga
# Should show: Intel Corporation Alder Lake-N [UHD Graphics]

Step 4: Download and Test Model

Download llama3.2:1b Model

# Download lightweight model (1.3GB)
pct exec <VM id> -- /usr/local/bin/ollama pull llama3.2:1b

Test Model

# Test the model
pct exec <VM id> -- /usr/local/bin/ollama run llama3.2:1b "Hello! Please respond with just a short greeting."

Verify GPU Usage

Check Ollama logs to confirm GPU detection:

pct exec <VM id> -- systemctl status ollama

Look for log entries like:

level=INFO source=types.go:130 msg="inference compute" name="Intel(R) UHD Graphics"

Step 5: Create Web Interface

Create Web Interface Directory

pct exec <VM id> -- mkdir -p /opt/ollama-web

Create Simple Python Web Server

pct exec <VM id> -- bash -c 'cat > /opt/ollama-web/simple_web.py << "EOF"
#!/usr/bin/env python3
import http.server
import socketserver
import urllib.request
import urllib.parse
import json
from html import escape

PORT = 8080

class OllamaHandler(http.server.SimpleHTTPRequestHandler):
    def do_GET(self):
        if self.path == "/" or self.path == "/index.html":
            self.serve_chat_page()
        else:
            self.send_error(404)
    
    def do_POST(self):
        if self.path == "/chat":
            self.handle_chat()
        else:
            self.send_error(404)
    
    def serve_chat_page(self):
        html = """<!DOCTYPE html>
<html>
<head>
    <title>Ollama Chat</title>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <style>
        body { 
            font-family: Arial, sans-serif; 
            margin: 0; 
            padding: 20px; 
            background-color: #f5f5f5; 
        }
        .container { 
            max-width: 800px; 
            margin: 0 auto; 
            background: white; 
            border-radius: 10px; 
            box-shadow: 0 2px 10px rgba(0,0,0,0.1); 
        }
        .header { 
            background: #2563eb; 
            color: white; 
            padding: 20px; 
            border-radius: 10px 10px 0 0; 
            text-align: center; 
        }
        .chat-container { 
            height: 400px; 
            overflow-y: auto; 
            padding: 20px; 
            border-bottom: 1px solid #eee; 
        }
        .message { 
            margin-bottom: 15px; 
            padding: 10px; 
            border-radius: 8px; 
        }
        .user-message { 
            background: #e3f2fd; 
            text-align: right; 
            margin-left: 50px; 
        }
        .ai-message { 
            background: #f5f5f5; 
            margin-right: 50px; 
        }
        .input-container { 
            padding: 20px; 
            display: flex; 
            gap: 10px; 
        }
        #messageInput { 
            flex: 1; 
            padding: 10px; 
            border: 1px solid #ddd; 
            border-radius: 5px; 
            font-size: 16px; 
        }
        #sendButton { 
            padding: 10px 20px; 
            background: #2563eb; 
            color: white; 
            border: none; 
            border-radius: 5px; 
            cursor: pointer; 
        }
        #sendButton:hover { 
            background: #1d4ed8; 
        }
        #sendButton:disabled { 
            background: #ccc; 
            cursor: not-allowed; 
        }
        .loading { 
            font-style: italic; 
            color: #666; 
        }
        .status { 
            padding: 10px; 
            text-align: center; 
            font-size: 12px; 
            color: #666; 
        }
    </style>
</head>
<body>
    <div class="container">
        <div class="header">
            <h1>🦙 Ollama Chat</h1>
            <p>Model: llama3.2:1b | Status: <span id="status">Ready</span></p>
        </div>
        <div class="chat-container" id="chatContainer">
            <div class="message ai-message">
                <strong>Assistant:</strong> Hello! I am your Ollama assistant running llama3.2:1b. How can I help you today?
            </div>
        </div>
        <div class="input-container">
            <input type="text" id="messageInput" placeholder="Type your message here..." onkeypress="handleKeyPress(event)">
            <button id="sendButton" onclick="sendMessage()">Send</button>
        </div>
        <div class="status">
            GPU: Intel UHD Graphics | RAM: Available
        </div>
    </div>

    <script>
        async function sendMessage() {
            const input = document.getElementById("messageInput");
            const chatContainer = document.getElementById("chatContainer");
            const sendButton = document.getElementById("sendButton");
            const status = document.getElementById("status");
            
            const message = input.value.trim();
            if (!message) return;
            
            // Add user message
            const userDiv = document.createElement("div");
            userDiv.className = "message user-message";
            userDiv.innerHTML = "<strong>You:</strong> " + message;
            chatContainer.appendChild(userDiv);
            
            // Add loading message
            const loadingDiv = document.createElement("div");
            loadingDiv.className = "message ai-message loading";
            loadingDiv.innerHTML = "<strong>Assistant:</strong> Thinking...";
            chatContainer.appendChild(loadingDiv);
            
            input.value = "";
            sendButton.disabled = true;
            status.textContent = "Generating...";
            chatContainer.scrollTop = chatContainer.scrollHeight;
            
            try {
                const response = await fetch("/chat", {
                    method: "POST",
                    headers: {
                        "Content-Type": "application/x-www-form-urlencoded",
                    },
                    body: "message=" + encodeURIComponent(message)
                });
                
                const data = await response.json();
                
                // Remove loading message
                chatContainer.removeChild(loadingDiv);
                
                // Add AI response
                const aiDiv = document.createElement("div");
                aiDiv.className = "message ai-message";
                aiDiv.innerHTML = "<strong>Assistant:</strong> " + data.response;
                chatContainer.appendChild(aiDiv);
                
            } catch (error) {
                // Remove loading message
                chatContainer.removeChild(loadingDiv);
                
                // Add error message
                const errorDiv = document.createElement("div");
                errorDiv.className = "message ai-message";
                errorDiv.innerHTML = "<strong>Error:</strong> Failed to get response. Is Ollama running?";
                chatContainer.appendChild(errorDiv);
            }
            
            sendButton.disabled = false;
            status.textContent = "Ready";
            chatContainer.scrollTop = chatContainer.scrollHeight;
        }
        
        function handleKeyPress(event) {
            if (event.key === "Enter") {
                sendMessage();
            }
        }
        
        // Focus input on load
        document.getElementById("messageInput").focus();
    </script>
</body>
</html>"""
        
        self.send_response(200)
        self.send_header("Content-type", "text/html")
        self.end_headers()
        self.wfile.write(html.encode())
    
    def handle_chat(self):
        content_length = int(self.headers["Content-Length"])
        post_data = self.rfile.read(content_length).decode("utf-8")
        parsed_data = urllib.parse.parse_qs(post_data)
        
        message = parsed_data.get("message", [""])[0]
        
        try:
            # Create request to Ollama
            data = json.dumps({
                "model": "llama3.2:1b",
                "prompt": message,
                "stream": False
            }).encode()
            
            req = urllib.request.Request(
                "http://localhost:11434/api/generate",
                data=data,
                headers={"Content-Type": "application/json"}
            )
            
            with urllib.request.urlopen(req) as response:
                result = json.loads(response.read().decode())
                ai_response = result.get("response", "No response received")
            
            # Send JSON response
            response_data = json.dumps({"response": ai_response})
            self.send_response(200)
            self.send_header("Content-type", "application/json")
            self.end_headers()
            self.wfile.write(response_data.encode())
            
        except Exception as e:
            error_response = json.dumps({"response": f"Error: {str(e)}"})
            self.send_response(500)
            self.send_header("Content-type", "application/json")
            self.end_headers()
            self.wfile.write(error_response.encode())

if __name__ == "__main__":
    with socketserver.TCPServer(("", PORT), OllamaHandler) as httpd:
        print(f"Server running at http://0.0.0.0:{PORT}")
        httpd.serve_forever()
EOF'

Make Script Executable

pct exec <VM id> -- chmod +x /opt/ollama-web/simple_web.py

Step 6: Create Systemd Service

Create Web Interface Service

pct exec <VM id> -- bash -c 'cat > /etc/systemd/system/ollama-web.service << "EOF"
[Unit]
Description=Ollama Web Interface
After=ollama.service
Requires=ollama.service

[Service]
Type=simple
User=root
WorkingDirectory=/opt/ollama-web
ExecStart=/usr/bin/python3 /opt/ollama-web/simple_web.py
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF'

Enable and Start Service

pct exec <VM id> -- systemctl daemon-reload
pct exec <VM id> -- systemctl enable ollama-web
pct exec <VM id> -- systemctl start ollama-web

Step 7: Verify Setup

Check Service Status

pct exec <VM id> -- systemctl status ollama ollama-web --no-pager

Test Web Interface

# Test if web interface responds
curl -s -o /dev/null -w "%{http_code}" http://CONTAINER_IP:8080
# Should return: 200

Access Points

Web Interface

  • URL: http://CONTAINER_IP:8080
  • Features: ChatGPT-like interface, real-time chat, auto-scroll

Ollama API

  • URL: http://CONTAINER_IP:11434
  • Endpoint: /api/generate for direct API calls

Command Line

# Direct container access
pct exec <VM id> -- /usr/local/bin/ollama run llama3.2:1b "Your question here"

# SSH access
ssh root@CONTAINER_IP
ollama run llama3.2:1b

Resource Optimization

Remove Unnecessary Packages

After installation, clean up development packages to save space:

# Remove build tools and dev packages
pct exec <VM id> -- apt autoremove -y gcc g++ make patch dpkg-dev cpp manpages-dev

# Clean package cache
pct exec <VM id> -- apt autoclean

Resource Usage

Final Footprint

  • Disk Space: ~9.4GB total (after cleanup)
  • RAM Usage:
    • Ollama service: ~2.8GB (with model loaded)
    • Web interface: ~11MB
    • System overhead: ~1.7GB total
  • Available RAM: 2.3GB free

Model Storage

  • llama3.2:1b: 1.3GB
  • Available space: 28GB for additional models

Troubleshooting

GPU Not Detected

If Ollama doesn’t detect the GPU:

  1. Verify container config includes GPU passthrough
  2. Check device permissions: ls -la /dev/dri/
  3. Restart container: pct restart <VM id>

Web Interface Not Loading

  1. Check service status: systemctl status ollama-web
  2. Verify port 8080 is not blocked
  3. Check logs: journalctl -u ollama-web -f

Model Loading Issues

  1. Verify sufficient RAM available
  2. Check Ollama logs: journalctl -u ollama -f
  3. Try smaller model if memory constrained

Adding More Models

# Ultra-light model (500MB)
pct exec <VM id> -- ollama pull qwen2.5:0.5b

# Code-focused model (1.6GB)  
pct exec <VM id> -- ollama pull codellama:7b-code

# Conversational model (2.2GB)
pct exec <VM id> -- ollama pull phi3.5:latest

Switch Models in Web Interface

To use different models, edit the web interface script and change:

"model": "llama3.2:1b",  # Change to desired model

Performance Notes

  • Intel UHD Graphics: Provides hardware acceleration for inference
  • Response Time: 2-4 seconds for simple queries
  • Concurrent Users: 2-3 simultaneous users on 4GB RAM
  • Model Loading: ~2-3 seconds cold start

Security Considerations

  • Web interface runs without authentication (add auth for production)
  • Container runs as root (consider user namespacing)
  • No HTTPS (add reverse proxy for external access)
  • Firewall rules recommended for external exposure

Next Steps

Production Enhancements

  1. Add authentication to web interface
  2. Set up reverse proxy with SSL
  3. Implement user session management
  4. Add model switching capability
  5. Monitor resource usage

Alternative Web Interfaces

  • Open WebUI: Full-featured ChatGPT-like interface
  • AnythingLLM: Document chat capabilities
  • Chatbot Ollama: Streamlit-based interface

Conclusion

This setup provides a fully functional, resource-efficient local AI chat system with:

  • Hardware-accelerated inference
  • Clean web interface
  • Auto-starting services
  • Optimized resource usage

Perfect for homelab experimentation, private AI assistance, and learning about LLM deployment.


Last updated: 2025-08-11