Why Local LLMs?
Local deployment of Large Language Models ensures that sensitive immigration-related queries never leave your organization's control. This is essential because:
- Immigration status information is highly sensitive
- Cloud APIs create subpoena-able records
- Third-party services can be discontinued or restricted
- User trust requires demonstrable privacy protection
Open-Source Model Selection
Recommended Models for Legal Q&A
| Model Family | Best Size | License | Legal Reasoning | Multilingual |
|---|---|---|---|---|
| Meta Llama 3.3 | 70B | Llama Community | Excellent (MMLU 86%) | Spanish: Excellent |
| Mistral/Mixtral | 8x22B MoE | Apache 2.0 | Very Good | Spanish: Excellent |
| Microsoft Phi-4 | 14B | MIT | Good (punches above weight) | Spanish: Good |
| Alibaba Qwen 2.5/3 | 32B-72B | Apache 2.0 | Excellent | 29+ languages |
| Google Gemma 3 | 27B | Gemma License | Good | Good |
Model Selection Criteria
For General KYR Inquiries (7B-14B sufficient):
- Simple rights questions
- Checkpoint procedures
- Document checklists
- Basic procedural information
For Complex Legal Interpretation (30B-70B recommended):
- Conditional immigration statutes
- Status-specific rights analysis
- Multi-factor legal scenarios
- Reduced hallucination risk
GPU Requirements
The central constraint for local LLM deployment is Video RAM (VRAM). The entire model plus conversation context must fit in GPU memory.
Quantization Reduces VRAM Requirements
Quantization compresses model weights from 16-bit to 4-bit precision:
- 72% VRAM reduction
- Only 3-5% quality loss
- Q4_K_M format recommended for balanced quality/performance
Hardware Specifications by Tier
| Tier | Model Size (4-bit) | VRAM Needed | Hardware | Cost | Speed |
|---|---|---|---|---|---|
| Entry-Level | 7B-8B | 6-8 GB | RTX 4060 | ~$350 | 40+ tok/s |
| Mid-Range | 13B-32B | 16-24 GB | RTX 4090 | ~$1,600 | 20-30 tok/s |
| High-End | 70B-72B | 40-48 GB | 2x RTX 4090 | ~$3,200 | 15-25 tok/s |
| Enterprise | 70B+ (FP16) | 140-160 GB | 2x A100 (80GB) | ~$25,000 | 50+ tok/s |
Apple Silicon Alternative
Apple M2/M3 Max with 96GB+ unified memory can run 70B quantized models without multi-GPU complexity:
- Single workstation deployment
- Lower raw throughput than dedicated GPUs
- Excellent for development and small-scale production
- Cost: ~$4,000-5,000
Deployment Frameworks
Ollama (Development/Testing)
Best for: Prototyping, local testing, single-user deployments
# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh
# Pull a model
ollama pull mistral:7b-instruct-q4_K_M
# Run with API
ollama serve
Limitations:
- Queues requests (no concurrent batching)
- Struggles under multi-user load
- Not production-ready for high traffic
vLLM (Production)
Best for: Production environments with concurrent users
# Install vLLM
pip install vllm
# Launch server
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-7B-Instruct-v0.3 \
--quantization awq \
--max-model-len 4096
Advantages:
- PagedAttention for efficient KV cache management
- Massive continuous batching
- OpenAI-compatible API
Requirements:
- Models must fully load into VRAM
- More complex configuration
llama.cpp (Resource-Constrained)
Best for: Legacy hardware, Apple Silicon, CPU-only environments
# Build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make
# Run server
./server -m models/mistral-7b-q4_K_M.gguf -c 4096
Advantages:
- Runs on CPU with acceptable speed for small models
- Native Apple Metal support
- GGUF format widely supported
Text Generation Inference (TGI)
Best for: Enterprise deployments with strict observability requirements
# Docker deployment
docker run --gpus all \
-p 8080:80 \
-v $PWD/data:/data \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id mistralai/Mistral-7B-Instruct-v0.3
Advantages:
- Enterprise-grade telemetry
- Containerized, scalable
- HuggingFace ecosystem integration
Network Isolation for Privacy
Air-Gapped Deployment
To ensure zero data leakage, configure network isolation:
# docker-compose.yml
services:
llm-server:
image: vllm/vllm-openai:latest
networks:
- internal_only
# No ports exposed to external network
chatbot-ui:
image: chatbot-frontend:latest
networks:
- internal_only
- web
ports:
- "443:443" # Only UI exposed
networks:
internal_only:
internal: true # No external internet access
web:
driver: bridge
Firewall Rules
# Block all outbound from LLM container
iptables -A OUTPUT -m owner --uid-owner llm-user -j DROP
# Allow only internal network
iptables -A OUTPUT -m owner --uid-owner llm-user -d 10.0.0.0/8 -j ACCEPT
Zero-Retention Logging
Memory-Only Processing
Configure the inference server to:
- Never write prompts to disk
- Never log conversation content
- Clear context on session close
# vLLM configuration
engine_args = EngineArgs(
model="mistral-7b",
disable_log_requests=True, # No request logging
disable_log_stats=True, # No statistics
)
Secure Session Management
# Session cleanup on disconnect
@app.on_event("shutdown")
async def cleanup():
# Cryptographically shred conversation memory
secure_delete(session_cache)
Model Download and Verification
Downloading Models Safely
# Use HuggingFace CLI
huggingface-cli download \
mistralai/Mistral-7B-Instruct-v0.3 \
--local-dir ./models/mistral-7b \
--local-dir-use-symlinks False
# Verify checksums
sha256sum ./models/mistral-7b/* | diff - checksums.txt
Converting to GGUF (for llama.cpp)
# Convert to quantized GGUF
python convert.py ./models/mistral-7b --outtype q4_K_M
Performance Optimization
Batch Size Tuning
For concurrent users, optimize batch size based on VRAM:
| Users | Batch Size | VRAM Overhead |
|---|---|---|
| 1-5 | 4 | Minimal |
| 5-20 | 16 | ~2GB additional |
| 20+ | 32+ | Consider multi-GPU |
KV Cache Management
vLLM's PagedAttention automatically manages KV cache, but for llama.cpp:
# Allocate KV cache for 4096 context
./server -m model.gguf -c 4096 --n-gpu-layers 35
Cost Comparison: CapEx vs OpEx
On-Premises (Capital Expenditure)
| Item | One-Time Cost |
|---|---|
| 2x RTX 4090 | $3,200 |
| Workstation (CPU, RAM, storage) | $2,000 |
| Setup and configuration | Staff time |
| Total | ~$5,200 |
Cloud GPU Rental (Operating Expenditure)
| Instance | Hourly Cost | Monthly (24/7) |
|---|---|---|
| A100 (80GB) | ~$3.50/hr | ~$2,520/mo |
| A10G (24GB) | ~$1.00/hr | ~$720/mo |
Break-even: On-premises hardware pays for itself in 2-3 months vs A100 cloud rental.
Recommended Starting Configuration
For most legal aid organizations:
Hardware
- GPU: Single RTX 4090 (24GB VRAM)
- CPU: 16+ cores
- RAM: 64GB
- Storage: 1TB NVMe SSD
Software
- OS: Ubuntu 22.04 LTS
- Framework: vLLM (production) or Ollama (development)
- Model: Mistral 7B Instruct Q4_K_M (start), upgrade to Llama 3.3 70B as needed
Model Selection
- Primary: Mistral 7B Instruct (fast, efficient, Apache 2.0)
- Upgrade path: Qwen 2.5 32B (better multilingual) → Llama 3.3 70B (best reasoning)
Next Steps
- Set up RAG pipeline - Connect to Know Your Rights content
- Implement safety guardrails - Required before any deployment
- Configure privacy architecture - Zero-retention, air-gapped operation