Skip to main content

How to: Self-Hosted LLMs

Self-hosted Large Language Models provide complete control over your AI infrastructure, ensuring data privacy, predictable costs, and independence from external providers.

Overview

Self-hosted LLMs in Vectense Platform are powered by Ollama, an open-source platform that makes it easy to run large language models locally. This approach offers:

Benefits

  • Complete Data Privacy: All processing happens on your infrastructure
  • No External Dependencies: Works in air-gapped environments
  • Predictable Costs: No per-token charges, only infrastructure costs
  • Customization: Ability to fine-tune models for specific use cases
  • Compliance: Meets strict data governance requirements

Considerations

  • Infrastructure Requirements: Needs sufficient hardware resources
  • Performance Variability: May be slower than cloud models
  • Management Overhead: Requires system administration
  • Model Updates: Manual process to update models

Ollama Installation

System Requirements

Minimum Requirements

  • RAM: 8GB+ (16GB+ recommended)
  • CPU: Modern multi-core processor
  • Storage: 10GB+ free space per model
  • OS: Linux, macOS, or Windows

Recommended for Production

  • RAM: 32GB+ for larger models
  • GPU: NVIDIA GPU with 8GB+ VRAM for better performance
  • Storage: SSD storage for faster model loading
  • Network: Stable network connection for model downloads

Installation Steps

Linux/macOS Installation

# Download and install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Verify installation
ollama --version

Windows Installation

  1. Download Ollama installer from https://ollama.ai/download
  2. Run the installer with administrator privileges
  3. Open Command Prompt or PowerShell
  4. Verify installation: ollama --version

Docker Installation

# Run Ollama in Docker
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

# Pull and run a model
docker exec -it ollama ollama run llama2

Starting Ollama Service

Background Service

# Start Ollama service (runs in background)
ollama serve

# For production, create a systemd service
sudo systemctl enable ollama
sudo systemctl start ollama

Check Service Status

# Verify Ollama is running
curl http://localhost:11434/api/tags

# Expected response: JSON with empty models list initially

Model Management

Available Models

General Purpose Models

  • Llama 2 (7B, 13B, 70B): Meta's open-source models
  • Llama 3 (8B, 70B): Latest generation with improved performance
  • Mistral (7B): Efficient European model
  • Mixtral (8x7B): Mixture of experts model

Specialized Models

  • CodeLlama (7B, 13B, 34B): Optimized for code generation
  • Llama2-Uncensored: Versions with reduced safety filters
  • Vicuna: Fine-tuned for conversation
  • WizardCoder: Enhanced coding capabilities

Lightweight Models

  • TinyLlama (1B): Very fast, minimal resource requirements
  • Phi-2 (3B): Microsoft's efficient small model
  • Gemma (2B, 7B): Google's efficient models

Downloading Models

Download Popular Models

# Download Llama 2 (7B parameters)
ollama pull llama2

# Download Mistral 7B
ollama pull mistral

# Download Code Llama for programming tasks
ollama pull codellama

# Download smaller model for testing
ollama pull tinyllama

List Available Models

# List all available models online
ollama list

# List downloaded models
ollama ls

Remove Models

# Remove a model to free space
ollama rm model-name

Model Selection Guide

For General Tasks

  • Llama 3 8B: Best balance of performance and resource usage
  • Mistral 7B: Good European alternative
  • Llama 2 7B: Proven performance, widely tested

For Coding Tasks

  • CodeLlama 7B: Specialized for code generation
  • CodeLlama 13B: Better performance, more resources required
  • WizardCoder: Enhanced coding capabilities

For Resource-Constrained Environments

  • TinyLlama: Minimal resources, basic capabilities
  • Phi-2: Better performance in small package
  • Gemma 2B: Good balance of size and capability

For High-Performance Needs

  • Llama 3 70B: Best open-source performance (requires significant resources)
  • CodeLlama 34B: Best coding performance
  • Mixtral 8x7B: Excellent performance with mixture of experts

Performance Optimization

Hardware Optimization

GPU Acceleration

# Install NVIDIA drivers and CUDA toolkit
# Ubuntu/Debian
sudo apt install nvidia-driver-535 nvidia-cuda-toolkit

# Verify GPU detection
nvidia-smi

# Ollama automatically uses GPU when available
ollama run llama2

Memory Management

  • RAM: Ensure sufficient RAM for model size
  • Swap: Configure swap space for large models
  • Model Quantization: Use quantized models for lower memory usage

Storage Optimization

  • SSD Storage: Use SSD for faster model loading
  • Model Caching: Keep frequently used models loaded
  • Storage Cleanup: Remove unused models regularly

Ollama Configuration

Environment Variables

# Set custom model storage location
export OLLAMA_MODELS=/path/to/models

# Configure GPU usage
export OLLAMA_NUM_GPU=1

# Set memory limits
export OLLAMA_MAX_LOADED_MODELS=3

Configuration File Create ~/.ollama/config.json:

{
"model_path": "/custom/model/path",
"gpu_layers": 35,
"num_thread": 8,
"num_gpu": 1
}

Network Configuration

External Access

# Allow external connections (default: localhost only)
OLLAMA_HOST=0.0.0.0:11434 ollama serve

Reverse Proxy Setup (Nginx)

server {
listen 80;
server_name your-ollama-server.com;

location / {
proxy_pass http://localhost:11434;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
}

Integration with Vectense

Configure Ollama Model in Vectense

Step 1: Ensure Ollama is Running

# Check Ollama status
curl http://localhost:11434/api/tags

# Start if needed
ollama serve

Step 2: Configure in Vectense

  1. Navigate to Models → "Configure new Model"
  2. Select "Ollama" as provider
  3. Configure settings:
    • Name: Descriptive name (e.g., "Local Llama 3")
    • Ollama Model: Exact model name (e.g., "llama3")
    • Ollama Endpoint: Server URL (e.g., "http://localhost:11434")

Step 3: Test Connection

  • Use built-in test to verify connectivity
  • Check response quality and speed
  • Adjust configuration if needed

Production Deployment

High Availability Setup

  • Run multiple Ollama instances
  • Use load balancer for distribution
  • Implement health checks and failover

Monitoring and Logging

  • Monitor resource usage (CPU, RAM, GPU)
  • Track model response times
  • Set up alerting for failures

Backup and Recovery

  • Backup model files and configurations
  • Document deployment procedures
  • Test recovery processes

Security Considerations

Network Security

  • Firewall: Restrict access to Ollama port (11434)
  • VPN: Use VPN for remote access
  • Authentication: Implement authentication proxy if needed
  • HTTPS: Use TLS for external access

Data Security

  • File Permissions: Secure model files and data
  • Access Control: Limit system access to authorized users
  • Audit Logging: Track access and usage
  • Data Encryption: Encrypt sensitive data at rest

Model Security

  • Model Verification: Verify model checksums
  • Source Validation: Download models from trusted sources
  • Version Control: Track model versions and updates
  • Sandboxing: Consider containerized deployment

Troubleshooting

Installation Issues

Permission Errors

# Fix permissions for model directory
sudo chown -R $USER:$USER ~/.ollama

# Run with sudo if needed (not recommended for production)
sudo ollama serve

Port Conflicts

# Check what's using port 11434
sudo netstat -tulpn | grep 11434

# Use different port
OLLAMA_HOST=localhost:11435 ollama serve

Performance Issues

Out of Memory

  • Use smaller models or quantized versions
  • Increase system RAM or swap space
  • Limit concurrent model loading

Slow Response Times

  • Check available system resources
  • Ensure GPU acceleration is working
  • Use SSD storage for models
  • Optimize model selection for hardware

Model Loading Failures

# Check available disk space
df -h

# Verify model integrity
ollama pull model-name --force

# Check system logs
journalctl -u ollama

Connectivity Issues

Ollama Not Responding

# Check if process is running
ps aux | grep ollama

# Restart service
ollama serve

# Check logs for errors
ollama logs

Network Access Problems

  • Verify firewall settings
  • Check network connectivity
  • Confirm correct endpoint URL in Vectense

Cost Analysis

Infrastructure Costs

Hardware Investment

  • Server hardware or cloud instances
  • GPU costs for acceleration
  • Storage for models and data
  • Network infrastructure

Operational Costs

  • Electricity and cooling
  • System administration
  • Maintenance and updates
  • Backup and disaster recovery

Cost Comparison

Break-Even Analysis Compare self-hosted costs vs. cloud API costs:

  • Calculate monthly infrastructure costs
  • Estimate token usage volume
  • Compare with cloud provider pricing
  • Consider growth projections

Example Calculation

Cloud Cost: $0.002 per 1K tokens
Monthly usage: 10M tokens
Cloud cost: $20/month

Self-hosted server: $200/month
Break-even: 100M tokens/month

Best Practices

Deployment

  • Start Small: Begin with smaller models and scale up
  • Test Thoroughly: Validate performance before production
  • Monitor Resources: Track CPU, RAM, and GPU usage
  • Plan Capacity: Size infrastructure for peak usage

Management

  • Regular Updates: Keep Ollama and models updated
  • Health Monitoring: Implement comprehensive monitoring
  • Backup Strategy: Regular backups of models and config
  • Documentation: Maintain deployment and operational docs

Security

  • Access Control: Implement proper authentication
  • Network Security: Use firewalls and VPNs
  • Data Protection: Encrypt sensitive data
  • Audit Logging: Track all access and changes

Advanced Topics

Model Fine-Tuning

  • Use Ollama's fine-tuning capabilities
  • Train models on your specific data
  • Optimize for your use cases
  • Balance performance and resource usage

Multi-Model Deployment

  • Run multiple specialized models
  • Load balance based on task type
  • Optimize resource allocation
  • Implement intelligent routing

Containerization

  • Deploy Ollama in containers
  • Use Kubernetes for orchestration
  • Implement auto-scaling
  • Ensure data persistence

Self-hosted LLMs provide powerful capabilities while maintaining complete control over your AI infrastructure. With proper planning and implementation, they offer an excellent alternative to cloud-based models for organizations with specific privacy, cost, or compliance requirements.