How to: Self-Hosted LLMs

Self-hosted Large Language Models provide complete control over your AI infrastructure, ensuring data privacy, predictable costs, and independence from external providers.

Overview

Self-hosted LLMs in Vectense Platform are powered by Ollama, an open-source platform that makes it easy to run large language models locally. This approach offers:

Benefits

Complete Data Privacy: All processing happens on your infrastructure
No External Dependencies: Works in air-gapped environments
Predictable Costs: No per-token charges, only infrastructure costs
Customization: Ability to fine-tune models for specific use cases
Compliance: Meets strict data governance requirements

Considerations

Infrastructure Requirements: Needs sufficient hardware resources
Performance Variability: May be slower than cloud models
Management Overhead: Requires system administration
Model Updates: Manual process to update models

Ollama Installation

System Requirements

Minimum Requirements

RAM: 8GB+ (16GB+ recommended)
CPU: Modern multi-core processor
Storage: 10GB+ free space per model
OS: Linux, macOS, or Windows

Recommended for Production

RAM: 32GB+ for larger models
GPU: NVIDIA GPU with 8GB+ VRAM for better performance
Storage: SSD storage for faster model loading
Network: Stable network connection for model downloads

Installation Steps

Linux/macOS Installation

# Download and install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Verify installation
ollama --version

Windows Installation

Download Ollama installer from https://ollama.ai/download
Run the installer with administrator privileges
Open Command Prompt or PowerShell
Verify installation: ollama --version

Docker Installation

# Run Ollama in Docker
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

# Pull and run a model
docker exec -it ollama ollama run llama2

Starting Ollama Service

Background Service

# Start Ollama service (runs in background)
ollama serve

# For production, create a systemd service
sudo systemctl enable ollama
sudo systemctl start ollama

Check Service Status

# Verify Ollama is running
curl http://localhost:11434/api/tags

# Expected response: JSON with empty models list initially

Model Management

Available Models

General Purpose Models

Llama 2 (7B, 13B, 70B): Meta's open-source models
Llama 3 (8B, 70B): Latest generation with improved performance
Mistral (7B): Efficient European model
Mixtral (8x7B): Mixture of experts model

Specialized Models

CodeLlama (7B, 13B, 34B): Optimized for code generation
Llama2-Uncensored: Versions with reduced safety filters
Vicuna: Fine-tuned for conversation
WizardCoder: Enhanced coding capabilities

Lightweight Models

TinyLlama (1B): Very fast, minimal resource requirements
Phi-2 (3B): Microsoft's efficient small model
Gemma (2B, 7B): Google's efficient models

Downloading Models

Download Popular Models

# Download Llama 2 (7B parameters)
ollama pull llama2

# Download Mistral 7B
ollama pull mistral

# Download Code Llama for programming tasks
ollama pull codellama

# Download smaller model for testing
ollama pull tinyllama

List Available Models

# List all available models online
ollama list

# List downloaded models
ollama ls

Remove Models

# Remove a model to free space
ollama rm model-name

Model Selection Guide

For General Tasks

Llama 3 8B: Best balance of performance and resource usage
Mistral 7B: Good European alternative
Llama 2 7B: Proven performance, widely tested

For Coding Tasks

CodeLlama 7B: Specialized for code generation
CodeLlama 13B: Better performance, more resources required
WizardCoder: Enhanced coding capabilities

For Resource-Constrained Environments

TinyLlama: Minimal resources, basic capabilities
Phi-2: Better performance in small package
Gemma 2B: Good balance of size and capability

For High-Performance Needs

Llama 3 70B: Best open-source performance (requires significant resources)
CodeLlama 34B: Best coding performance
Mixtral 8x7B: Excellent performance with mixture of experts

Performance Optimization

Hardware Optimization

GPU Acceleration

# Install NVIDIA drivers and CUDA toolkit
# Ubuntu/Debian
sudo apt install nvidia-driver-535 nvidia-cuda-toolkit

# Verify GPU detection
nvidia-smi

# Ollama automatically uses GPU when available
ollama run llama2

Memory Management

RAM: Ensure sufficient RAM for model size
Swap: Configure swap space for large models
Model Quantization: Use quantized models for lower memory usage

Storage Optimization

SSD Storage: Use SSD for faster model loading
Model Caching: Keep frequently used models loaded
Storage Cleanup: Remove unused models regularly

Ollama Configuration

Environment Variables

# Set custom model storage location
export OLLAMA_MODELS=/path/to/models

# Configure GPU usage
export OLLAMA_NUM_GPU=1

# Set memory limits
export OLLAMA_MAX_LOADED_MODELS=3

Configuration File Create ~/.ollama/config.json:

{
  "model_path": "/custom/model/path",
  "gpu_layers": 35,
  "num_thread": 8,
  "num_gpu": 1
}

Network Configuration

External Access

# Allow external connections (default: localhost only)
OLLAMA_HOST=0.0.0.0:11434 ollama serve

Reverse Proxy Setup (Nginx)

server {
    listen 80;
    server_name your-ollama-server.com;
    
    location / {
        proxy_pass http://localhost:11434;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }
}

Integration with Vectense

Configure Ollama Model in Vectense

Step 1: Ensure Ollama is Running

# Check Ollama status
curl http://localhost:11434/api/tags

# Start if needed
ollama serve

Step 2: Configure in Vectense

Navigate to Models → "Configure new Model"
Select "Ollama" as provider
Configure settings:
- Name: Descriptive name (e.g., "Local Llama 3")
- Ollama Model: Exact model name (e.g., "llama3")
- Ollama Endpoint: Server URL (e.g., "http://localhost:11434")

Step 3: Test Connection

Use built-in test to verify connectivity
Check response quality and speed
Adjust configuration if needed

Production Deployment

High Availability Setup

Run multiple Ollama instances
Use load balancer for distribution
Implement health checks and failover

Monitoring and Logging

Monitor resource usage (CPU, RAM, GPU)
Track model response times
Set up alerting for failures

Backup and Recovery

Backup model files and configurations
Document deployment procedures
Test recovery processes

Security Considerations

Network Security

Firewall: Restrict access to Ollama port (11434)
VPN: Use VPN for remote access
Authentication: Implement authentication proxy if needed
HTTPS: Use TLS for external access

Data Security

File Permissions: Secure model files and data
Access Control: Limit system access to authorized users
Audit Logging: Track access and usage
Data Encryption: Encrypt sensitive data at rest

Model Security

Model Verification: Verify model checksums
Source Validation: Download models from trusted sources
Version Control: Track model versions and updates
Sandboxing: Consider containerized deployment

Troubleshooting

Installation Issues

Permission Errors

# Fix permissions for model directory
sudo chown -R $USER:$USER ~/.ollama

# Run with sudo if needed (not recommended for production)
sudo ollama serve

Port Conflicts

# Check what's using port 11434
sudo netstat -tulpn | grep 11434

# Use different port
OLLAMA_HOST=localhost:11435 ollama serve

Performance Issues

Out of Memory

Use smaller models or quantized versions
Increase system RAM or swap space
Limit concurrent model loading

Slow Response Times

Check available system resources
Ensure GPU acceleration is working
Use SSD storage for models
Optimize model selection for hardware

Model Loading Failures

# Check available disk space
df -h

# Verify model integrity
ollama pull model-name --force

# Check system logs
journalctl -u ollama

Connectivity Issues

Ollama Not Responding

# Check if process is running
ps aux | grep ollama

# Restart service
ollama serve

# Check logs for errors
ollama logs

Network Access Problems

Verify firewall settings
Check network connectivity
Confirm correct endpoint URL in Vectense

Cost Analysis

Infrastructure Costs

Hardware Investment

Server hardware or cloud instances
GPU costs for acceleration
Storage for models and data
Network infrastructure

Operational Costs

Electricity and cooling
System administration
Maintenance and updates
Backup and disaster recovery

Cost Comparison

Break-Even Analysis Compare self-hosted costs vs. cloud API costs:

Calculate monthly infrastructure costs
Estimate token usage volume
Compare with cloud provider pricing
Consider growth projections

Example Calculation

Cloud Cost: $0.002 per 1K tokens
Monthly usage: 10M tokens
Cloud cost: $20/month

Self-hosted server: $200/month
Break-even: 100M tokens/month

Best Practices

Deployment

Start Small: Begin with smaller models and scale up
Test Thoroughly: Validate performance before production
Monitor Resources: Track CPU, RAM, and GPU usage
Plan Capacity: Size infrastructure for peak usage

Management

Regular Updates: Keep Ollama and models updated
Health Monitoring: Implement comprehensive monitoring
Backup Strategy: Regular backups of models and config
Documentation: Maintain deployment and operational docs

Security

Access Control: Implement proper authentication
Network Security: Use firewalls and VPNs
Data Protection: Encrypt sensitive data
Audit Logging: Track all access and changes

Advanced Topics

Model Fine-Tuning

Use Ollama's fine-tuning capabilities
Train models on your specific data
Optimize for your use cases
Balance performance and resource usage

Multi-Model Deployment

Run multiple specialized models
Load balance based on task type
Optimize resource allocation
Implement intelligent routing

Containerization

Deploy Ollama in containers
Use Kubernetes for orchestration
Implement auto-scaling
Ensure data persistence

Self-hosted LLMs provide powerful capabilities while maintaining complete control over your AI infrastructure. With proper planning and implementation, they offer an excellent alternative to cloud-based models for organizations with specific privacy, cost, or compliance requirements.

Overview​

Benefits​

Considerations​

Ollama Installation​

System Requirements​

Installation Steps​

Starting Ollama Service​

Model Management​

Available Models​

Downloading Models​

Model Selection Guide​

Performance Optimization​

Hardware Optimization​

Ollama Configuration​

Network Configuration​

Integration with Vectense​

Configure Ollama Model in Vectense​

Production Deployment​

Security Considerations​

Network Security​

Data Security​

Model Security​

Troubleshooting​

Installation Issues​

Performance Issues​

Connectivity Issues​

Cost Analysis​

Infrastructure Costs​

Cost Comparison​

Best Practices​

Deployment​

Management​

Security​

Advanced Topics​

Model Fine-Tuning​

Multi-Model Deployment​

Containerization​