Local LLM Infrastructure for Immigration Legal Aid | ICE Encounter

Why Local LLMs?

Local deployment of Large Language Models ensures that sensitive immigration-related queries never leave your organization's control. This is essential because:

Immigration status information is highly sensitive
Cloud APIs create subpoena-able records
Third-party services can be discontinued or restricted
User trust requires demonstrable privacy protection

Open-Source Model Selection

Recommended Models for Legal Q&A

Model Family	Best Size	License	Legal Reasoning	Multilingual
Meta Llama 3.3	70B	Llama Community	Excellent (MMLU 86%)	Spanish: Excellent
Mistral/Mixtral	8x22B MoE	Apache 2.0	Very Good	Spanish: Excellent
Microsoft Phi-4	14B	MIT	Good (punches above weight)	Spanish: Good
Alibaba Qwen 2.5/3	32B-72B	Apache 2.0	Excellent	29+ languages
Google Gemma 3	27B	Gemma License	Good	Good

Model Selection Criteria

For General KYR Inquiries (7B-14B sufficient):

Simple rights questions
Checkpoint procedures
Document checklists
Basic procedural information

For Complex Legal Interpretation (30B-70B recommended):

Conditional immigration statutes
Status-specific rights analysis
Multi-factor legal scenarios
Reduced hallucination risk

GPU Requirements

The central constraint for local LLM deployment is Video RAM (VRAM). The entire model plus conversation context must fit in GPU memory.

Quantization Reduces VRAM Requirements

Quantization compresses model weights from 16-bit to 4-bit precision:

72% VRAM reduction
Only 3-5% quality loss
Q4_K_M format recommended for balanced quality/performance

Hardware Specifications by Tier

Tier	Model Size (4-bit)	VRAM Needed	Hardware	Cost	Speed
Entry-Level	7B-8B	6-8 GB	RTX 4060	~$350	40+ tok/s
Mid-Range	13B-32B	16-24 GB	RTX 4090	~$1,600	20-30 tok/s
High-End	70B-72B	40-48 GB	2x RTX 4090	~$3,200	15-25 tok/s
Enterprise	70B+ (FP16)	140-160 GB	2x A100 (80GB)	~$25,000	50+ tok/s

Apple Silicon Alternative

Apple M2/M3 Max with 96GB+ unified memory can run 70B quantized models without multi-GPU complexity:

Single workstation deployment
Lower raw throughput than dedicated GPUs
Excellent for development and small-scale production
Cost: ~$4,000-5,000

Deployment Frameworks

Ollama (Development/Testing)

Best for: Prototyping, local testing, single-user deployments

# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Pull a model
ollama pull mistral:7b-instruct-q4_K_M

# Run with API
ollama serve

Limitations:

Queues requests (no concurrent batching)
Struggles under multi-user load
Not production-ready for high traffic

vLLM (Production)

Best for: Production environments with concurrent users

# Install vLLM
pip install vllm

# Launch server
python -m vllm.entrypoints.openai.api_server \
  --model mistralai/Mistral-7B-Instruct-v0.3 \
  --quantization awq \
  --max-model-len 4096

Advantages:

PagedAttention for efficient KV cache management
Massive continuous batching
OpenAI-compatible API

Requirements:

Models must fully load into VRAM
More complex configuration

llama.cpp (Resource-Constrained)

Best for: Legacy hardware, Apple Silicon, CPU-only environments

# Build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make

# Run server
./server -m models/mistral-7b-q4_K_M.gguf -c 4096

Advantages:

Runs on CPU with acceptable speed for small models
Native Apple Metal support
GGUF format widely supported

Text Generation Inference (TGI)

Best for: Enterprise deployments with strict observability requirements

# Docker deployment
docker run --gpus all \
  -p 8080:80 \
  -v $PWD/data:/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id mistralai/Mistral-7B-Instruct-v0.3

Advantages:

Enterprise-grade telemetry
Containerized, scalable
HuggingFace ecosystem integration

Network Isolation for Privacy

Air-Gapped Deployment

To ensure zero data leakage, configure network isolation:

# docker-compose.yml
services:
  llm-server:
    image: vllm/vllm-openai:latest
    networks:
      - internal_only
    # No ports exposed to external network

  chatbot-ui:
    image: chatbot-frontend:latest
    networks:
      - internal_only
      - web
    ports:
      - "443:443"  # Only UI exposed

networks:
  internal_only:
    internal: true  # No external internet access
  web:
    driver: bridge

Firewall Rules

# Block all outbound from LLM container
iptables -A OUTPUT -m owner --uid-owner llm-user -j DROP
# Allow only internal network
iptables -A OUTPUT -m owner --uid-owner llm-user -d 10.0.0.0/8 -j ACCEPT

Zero-Retention Logging

Memory-Only Processing

Configure the inference server to:

Never write prompts to disk
Never log conversation content
Clear context on session close

# vLLM configuration
engine_args = EngineArgs(
    model="mistral-7b",
    disable_log_requests=True,  # No request logging
    disable_log_stats=True,     # No statistics
)

Secure Session Management

# Session cleanup on disconnect
@app.on_event("shutdown")
async def cleanup():
    # Cryptographically shred conversation memory
    secure_delete(session_cache)

Model Download and Verification

Downloading Models Safely

# Use HuggingFace CLI
huggingface-cli download \
  mistralai/Mistral-7B-Instruct-v0.3 \
  --local-dir ./models/mistral-7b \
  --local-dir-use-symlinks False

# Verify checksums
sha256sum ./models/mistral-7b/* | diff - checksums.txt

Converting to GGUF (for llama.cpp)

# Convert to quantized GGUF
python convert.py ./models/mistral-7b --outtype q4_K_M

Performance Optimization

Batch Size Tuning

For concurrent users, optimize batch size based on VRAM:

Users	Batch Size	VRAM Overhead
1-5	4	Minimal
5-20	16	~2GB additional
20+	32+	Consider multi-GPU

KV Cache Management

vLLM's PagedAttention automatically manages KV cache, but for llama.cpp:

# Allocate KV cache for 4096 context
./server -m model.gguf -c 4096 --n-gpu-layers 35

Cost Comparison: CapEx vs OpEx

On-Premises (Capital Expenditure)

Item	One-Time Cost
2x RTX 4090	$3,200
Workstation (CPU, RAM, storage)	$2,000
Setup and configuration	Staff time
Total	~$5,200

Cloud GPU Rental (Operating Expenditure)

Instance	Hourly Cost	Monthly (24/7)
A100 (80GB)	~$3.50/hr	~$2,520/mo
A10G (24GB)	~$1.00/hr	~$720/mo

Break-even: On-premises hardware pays for itself in 2-3 months vs A100 cloud rental.

Recommended Starting Configuration

For most legal aid organizations:

Hardware

GPU: Single RTX 4090 (24GB VRAM)
CPU: 16+ cores
RAM: 64GB
Storage: 1TB NVMe SSD

Software

OS: Ubuntu 22.04 LTS
Framework: vLLM (production) or Ollama (development)
Model: Mistral 7B Instruct Q4_K_M (start), upgrade to Llama 3.3 70B as needed

Model Selection

Primary: Mistral 7B Instruct (fast, efficient, Apache 2.0)
Upgrade path: Qwen 2.5 32B (better multilingual) → Llama 3.3 70B (best reasoning)

Next Steps

Set up RAG pipeline - Connect to Know Your Rights content
Implement safety guardrails - Required before any deployment
Configure privacy architecture - Zero-retention, air-gapped operation