How to: Self-Hosted LLMs
Self-hosted Large Language Models provide complete control over your AI infrastructure, ensuring data privacy, predictable costs, and independence from external providers.
Overview
Self-hosted LLMs in Vectense Platform are powered by Ollama, an open-source platform that makes it easy to run large language models locally. This approach offers:
Benefits
- Complete Data Privacy: All processing happens on your infrastructure
- No External Dependencies: Works in air-gapped environments
- Predictable Costs: No per-token charges, only infrastructure costs
- Customization: Ability to fine-tune models for specific use cases
- Compliance: Meets strict data governance requirements
Considerations
- Infrastructure Requirements: Needs sufficient hardware resources
- Performance Variability: May be slower than cloud models
- Management Overhead: Requires system administration
- Model Updates: Manual process to update models
Ollama Installation
System Requirements
Minimum Requirements
- RAM: 8GB+ (16GB+ recommended)
- CPU: Modern multi-core processor
- Storage: 10GB+ free space per model
- OS: Linux, macOS, or Windows
Recommended for Production
- RAM: 32GB+ for larger models
- GPU: NVIDIA GPU with 8GB+ VRAM for better performance
- Storage: SSD storage for faster model loading
- Network: Stable network connection for model downloads
Installation Steps
Linux/macOS Installation
# Download and install Ollama
curl -fsSL https://ollama.ai/install.sh | sh
# Verify installation
ollama --version
Windows Installation
- Download Ollama installer from https://ollama.ai/download
- Run the installer with administrator privileges
- Open Command Prompt or PowerShell
- Verify installation:
ollama --version
Docker Installation
# Run Ollama in Docker
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
# Pull and run a model
docker exec -it ollama ollama run llama2
Starting Ollama Service
Background Service
# Start Ollama service (runs in background)
ollama serve
# For production, create a systemd service
sudo systemctl enable ollama
sudo systemctl start ollama
Check Service Status
# Verify Ollama is running
curl http://localhost:11434/api/tags
# Expected response: JSON with empty models list initially
Model Management
Available Models
General Purpose Models
- Llama 2 (7B, 13B, 70B): Meta's open-source models
- Llama 3 (8B, 70B): Latest generation with improved performance
- Mistral (7B): Efficient European model
- Mixtral (8x7B): Mixture of experts model
Specialized Models
- CodeLlama (7B, 13B, 34B): Optimized for code generation
- Llama2-Uncensored: Versions with reduced safety filters
- Vicuna: Fine-tuned for conversation
- WizardCoder: Enhanced coding capabilities
Lightweight Models
- TinyLlama (1B): Very fast, minimal resource requirements
- Phi-2 (3B): Microsoft's efficient small model
- Gemma (2B, 7B): Google's efficient models
Downloading Models
Download Popular Models
# Download Llama 2 (7B parameters)
ollama pull llama2
# Download Mistral 7B
ollama pull mistral
# Download Code Llama for programming tasks
ollama pull codellama
# Download smaller model for testing
ollama pull tinyllama
List Available Models
# List all available models online
ollama list
# List downloaded models
ollama ls
Remove Models
# Remove a model to free space
ollama rm model-name
Model Selection Guide
For General Tasks
- Llama 3 8B: Best balance of performance and resource usage
- Mistral 7B: Good European alternative
- Llama 2 7B: Proven performance, widely tested
For Coding Tasks
- CodeLlama 7B: Specialized for code generation
- CodeLlama 13B: Better performance, more resources required
- WizardCoder: Enhanced coding capabilities
For Resource-Constrained Environments
- TinyLlama: Minimal resources, basic capabilities
- Phi-2: Better performance in small package
- Gemma 2B: Good balance of size and capability
For High-Performance Needs
- Llama 3 70B: Best open-source performance (requires significant resources)
- CodeLlama 34B: Best coding performance
- Mixtral 8x7B: Excellent performance with mixture of experts
Performance Optimization
Hardware Optimization
GPU Acceleration
# Install NVIDIA drivers and CUDA toolkit
# Ubuntu/Debian
sudo apt install nvidia-driver-535 nvidia-cuda-toolkit
# Verify GPU detection
nvidia-smi
# Ollama automatically uses GPU when available
ollama run llama2
Memory Management
- RAM: Ensure sufficient RAM for model size
- Swap: Configure swap space for large models
- Model Quantization: Use quantized models for lower memory usage
Storage Optimization
- SSD Storage: Use SSD for faster model loading
- Model Caching: Keep frequently used models loaded
- Storage Cleanup: Remove unused models regularly
Ollama Configuration
Environment Variables
# Set custom model storage location
export OLLAMA_MODELS=/path/to/models
# Configure GPU usage
export OLLAMA_NUM_GPU=1
# Set memory limits
export OLLAMA_MAX_LOADED_MODELS=3
Configuration File
Create ~/.ollama/config.json:
{
"model_path": "/custom/model/path",
"gpu_layers": 35,
"num_thread": 8,
"num_gpu": 1
}
Network Configuration
External Access
# Allow external connections (default: localhost only)
OLLAMA_HOST=0.0.0.0:11434 ollama serve
Reverse Proxy Setup (Nginx)
server {
listen 80;
server_name your-ollama-server.com;
location / {
proxy_pass http://localhost:11434;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
}
Integration with Vectense
Configure Ollama Model in Vectense
Step 1: Ensure Ollama is Running
# Check Ollama status
curl http://localhost:11434/api/tags
# Start if needed
ollama serve
Step 2: Configure in Vectense
- Navigate to Models → "Configure new Model"
- Select "Ollama" as provider
- Configure settings:
- Name: Descriptive name (e.g., "Local Llama 3")
- Ollama Model: Exact model name (e.g., "llama3")
- Ollama Endpoint: Server URL (e.g., "http://localhost:11434")
Step 3: Test Connection
- Use built-in test to verify connectivity
- Check response quality and speed
- Adjust configuration if needed
Production Deployment
High Availability Setup
- Run multiple Ollama instances
- Use load balancer for distribution
- Implement health checks and failover
Monitoring and Logging
- Monitor resource usage (CPU, RAM, GPU)
- Track model response times
- Set up alerting for failures
Backup and Recovery
- Backup model files and configurations
- Document deployment procedures
- Test recovery processes
Security Considerations
Network Security
- Firewall: Restrict access to Ollama port (11434)
- VPN: Use VPN for remote access
- Authentication: Implement authentication proxy if needed
- HTTPS: Use TLS for external access
Data Security
- File Permissions: Secure model files and data
- Access Control: Limit system access to authorized users
- Audit Logging: Track access and usage
- Data Encryption: Encrypt sensitive data at rest
Model Security
- Model Verification: Verify model checksums
- Source Validation: Download models from trusted sources
- Version Control: Track model versions and updates
- Sandboxing: Consider containerized deployment
Troubleshooting
Installation Issues
Permission Errors
# Fix permissions for model directory
sudo chown -R $USER:$USER ~/.ollama
# Run with sudo if needed (not recommended for production)
sudo ollama serve
Port Conflicts
# Check what's using port 11434
sudo netstat -tulpn | grep 11434
# Use different port
OLLAMA_HOST=localhost:11435 ollama serve
Performance Issues
Out of Memory
- Use smaller models or quantized versions
- Increase system RAM or swap space
- Limit concurrent model loading
Slow Response Times
- Check available system resources
- Ensure GPU acceleration is working
- Use SSD storage for models
- Optimize model selection for hardware
Model Loading Failures
# Check available disk space
df -h
# Verify model integrity
ollama pull model-name --force
# Check system logs
journalctl -u ollama
Connectivity Issues
Ollama Not Responding
# Check if process is running
ps aux | grep ollama
# Restart service
ollama serve
# Check logs for errors
ollama logs
Network Access Problems
- Verify firewall settings
- Check network connectivity
- Confirm correct endpoint URL in Vectense
Cost Analysis
Infrastructure Costs
Hardware Investment
- Server hardware or cloud instances
- GPU costs for acceleration
- Storage for models and data
- Network infrastructure
Operational Costs
- Electricity and cooling
- System administration
- Maintenance and updates
- Backup and disaster recovery
Cost Comparison
Break-Even Analysis Compare self-hosted costs vs. cloud API costs:
- Calculate monthly infrastructure costs
- Estimate token usage volume
- Compare with cloud provider pricing
- Consider growth projections
Example Calculation
Cloud Cost: $0.002 per 1K tokens
Monthly usage: 10M tokens
Cloud cost: $20/month
Self-hosted server: $200/month
Break-even: 100M tokens/month
Best Practices
Deployment
- Start Small: Begin with smaller models and scale up
- Test Thoroughly: Validate performance before production
- Monitor Resources: Track CPU, RAM, and GPU usage
- Plan Capacity: Size infrastructure for peak usage
Management
- Regular Updates: Keep Ollama and models updated
- Health Monitoring: Implement comprehensive monitoring
- Backup Strategy: Regular backups of models and config
- Documentation: Maintain deployment and operational docs
Security
- Access Control: Implement proper authentication
- Network Security: Use firewalls and VPNs
- Data Protection: Encrypt sensitive data
- Audit Logging: Track all access and changes
Advanced Topics
Model Fine-Tuning
- Use Ollama's fine-tuning capabilities
- Train models on your specific data
- Optimize for your use cases
- Balance performance and resource usage
Multi-Model Deployment
- Run multiple specialized models
- Load balance based on task type
- Optimize resource allocation
- Implement intelligent routing
Containerization
- Deploy Ollama in containers
- Use Kubernetes for orchestration
- Implement auto-scaling
- Ensure data persistence
Self-hosted LLMs provide powerful capabilities while maintaining complete control over your AI infrastructure. With proper planning and implementation, they offer an excellent alternative to cloud-based models for organizations with specific privacy, cost, or compliance requirements.