Why Build a Database?
Translating massive volumes of disparate FOIA releases, redacted contract PDFs, and dense inspection reports into actionable intelligence requires robust, relational databases.
The methodology pioneered by TRAC (Transactional Records Access Clearinghouse) at Syracuse University and the Deportation Data Project at UC Berkeley serves as the gold standard.
Core Principles
Event-Based Architecture
A functional detention monitoring database must move beyond static facility lists to track chronological movement of individuals through the enforcement pipeline.
Anonymous Unique Identifiers
Challenge: Government redacts actual Alien Registration Numbers (A-Numbers) to protect privacy.
Solution: Cryptographic hash creates unique identifier allowing individual tracking across datasets without exposing identity.
Original A-Number → SHA-256 Hash → Anonymous Unique ID
A123456789 → [hash] → AUI_a7b3c9d2e4f1
Database Schema
Core Tables
-- Arrest/Encounter Event
CREATE TABLE arrests (
arrest_id UUID PRIMARY KEY,
anonymous_uid VARCHAR(32) NOT NULL,
arrest_date DATE NOT NULL,
geographic_loc VARCHAR(100),
zip_code VARCHAR(10),
arresting_agency VARCHAR(50),
agency_type VARCHAR(20), -- ICE, CBP, 287g
landmark_text TEXT,
latitude DECIMAL(10,8),
longitude DECIMAL(11,8),
created_at TIMESTAMP DEFAULT NOW()
);
-- Detainer Requests
CREATE TABLE detainers (
detainer_id UUID PRIMARY KEY,
anonymous_uid VARCHAR(32) NOT NULL,
issue_date DATE NOT NULL,
jail_facility VARCHAR(100),
jail_state VARCHAR(2),
response_type VARCHAR(20), -- honored, ignored, released
response_date DATE,
FOREIGN KEY (anonymous_uid) REFERENCES arrests(anonymous_uid)
);
-- Detention Stints
CREATE TABLE detention (
detention_id UUID PRIMARY KEY,
anonymous_uid VARCHAR(32) NOT NULL,
facility_code VARCHAR(20) NOT NULL,
book_in_datetime TIMESTAMP NOT NULL,
book_out_datetime TIMESTAMP,
transfer_flag BOOLEAN DEFAULT FALSE,
stint_number INTEGER, -- nth detention for this individual
FOREIGN KEY (facility_code) REFERENCES facilities(facility_code)
);
-- Removal/Release Events
CREATE TABLE removals (
removal_id UUID PRIMARY KEY,
anonymous_uid VARCHAR(32) NOT NULL,
event_date DATE NOT NULL,
event_type VARCHAR(20), -- removal, release, voluntary
destination VARCHAR(100),
flight_id VARCHAR(50),
FOREIGN KEY (anonymous_uid) REFERENCES arrests(anonymous_uid)
);
-- Facility Reference
CREATE TABLE facilities (
facility_code VARCHAR(20) PRIMARY KEY,
facility_name VARCHAR(200) NOT NULL,
facility_type VARCHAR(20), -- SPC, CDF, IGSA, FRC
operator VARCHAR(100),
standards_applied VARCHAR(20), -- PBNDS2011, NDS2019
aor VARCHAR(20), -- Area of Responsibility
state VARCHAR(2),
city VARCHAR(100),
latitude DECIMAL(10,8),
longitude DECIMAL(11,8),
capacity INTEGER,
guaranteed_beds INTEGER,
per_diem_rate DECIMAL(8,2),
active BOOLEAN DEFAULT TRUE
);
Data Integration Challenges
Inconsistent Nomenclature
ICE data is plagued by inconsistencies:
| Challenge | Example |
|---|---|
| Facility name variations | "Adelanto" vs "Adelanto ICE Processing Center" |
| AOR code changes | Historical reorganizations |
| Typos | "Lumpkin" vs "Lumpkn" |
| Format shifts | Different field offices use different conventions |
Data Cleaning Pipeline
import pandas as pd
from fuzzywuzzy import fuzz
def standardize_facility_name(raw_name, facility_lookup):
"""
Match raw facility name to standardized facility_code
using fuzzy matching with manual review threshold
"""
best_match = None
best_score = 0
for code, standard_name in facility_lookup.items():
score = fuzz.ratio(raw_name.lower(), standard_name.lower())
if score > best_score:
best_score = score
best_match = code
if best_score >= 90:
return best_match
elif best_score >= 70:
# Flag for manual review
return f"REVIEW:{best_match}:{best_score}"
else:
return "UNMATCHED"
Geocoding Raw Data
ICE often provides only landmark text strings instead of coordinates.
import googlemaps
def geocode_landmark(landmark_text, state=None):
"""
Convert landmark description to coordinates
"""
gmaps = googlemaps.Client(key='YOUR_API_KEY')
query = landmark_text
if state:
query += f", {state}"
result = gmaps.geocode(query)
if result:
location = result[0]['geometry']['location']
return {
'latitude': location['lat'],
'longitude': location['lng'],
'formatted_address': result[0]['formatted_address']
}
return None
Calculating Key Metrics
Average Daily Population (ADP)
Method: Aggregate at facility-day level by counting distinct individuals present at midnight.
-- Calculate ADP for a facility in a given month
WITH daily_counts AS (
SELECT
facility_code,
date_trunc('day', gs.day) AS count_date,
COUNT(DISTINCT anonymous_uid) AS daily_pop
FROM
generate_series(
'2026-01-01'::date,
'2026-01-31'::date,
'1 day'::interval
) AS gs(day)
JOIN detention d ON
gs.day >= d.book_in_datetime::date
AND (d.book_out_datetime IS NULL
OR gs.day < d.book_out_datetime::date)
WHERE facility_code = 'ADELANTO'
GROUP BY facility_code, date_trunc('day', gs.day)
)
SELECT
facility_code,
AVG(daily_pop) AS average_daily_population,
MAX(daily_pop) AS peak_population,
MIN(daily_pop) AS minimum_population
FROM daily_counts
GROUP BY facility_code;
Length of Stay Calculations
-- Calculate average length of stay by facility
SELECT
facility_code,
AVG(
EXTRACT(EPOCH FROM (
COALESCE(book_out_datetime, NOW()) - book_in_datetime
)) / 86400
) AS avg_days,
PERCENTILE_CONT(0.5) WITHIN GROUP (
ORDER BY EXTRACT(EPOCH FROM (
COALESCE(book_out_datetime, NOW()) - book_in_datetime
)) / 86400
) AS median_days
FROM detention
WHERE book_in_datetime >= '2025-01-01'
GROUP BY facility_code
ORDER BY avg_days DESC;
Transfer Frequency Analysis
-- Identify individuals with high transfer counts
SELECT
anonymous_uid,
COUNT(*) AS transfer_count,
COUNT(DISTINCT facility_code) AS facilities_visited,
MIN(book_in_datetime) AS first_detention,
MAX(COALESCE(book_out_datetime, NOW())) AS last_activity
FROM detention
GROUP BY anonymous_uid
HAVING COUNT(*) > 3
ORDER BY transfer_count DESC;
Data Visualization
Interactive Filtering
Design public interfaces allowing filtering by:
| Filter | Options |
|---|---|
| Fiscal Year | FY2020-FY2026 |
| Area of Responsibility | 24 AORs |
| Facility Type | SPC, CDF, IGSA, FRC |
| Subpopulation | Family, juvenile, non-criminal |
| Operator | GEO, CoreCivic, IHSC, etc. |
Map Visualizations
Effective Approaches:
- Choropleth maps by state/AOR
- Graduated symbols for facility size
- Flow maps for transfer patterns
- Heat maps for enforcement concentration
Time Series Dashboards
Track over time:
- ADP by facility/AOR
- Book-in volumes
- Length of stay trends
- Transfer frequencies
- Death rates
Public API Design
Enabling External Research
GET /api/facilities
GET /api/facilities/{facility_code}
GET /api/facilities/{facility_code}/population?start=2025-01-01&end=2025-12-31
GET /api/statistics/adp?aor=SNA&fiscal_year=2026
GET /api/statistics/transfers?facility_code=ADELANTO
Response Format:
{
"facility_code": "ADELANTO",
"facility_name": "Adelanto ICE Processing Center",
"facility_type": "CDF",
"operator": "GEO Group",
"statistics": {
"period": "2026-01",
"average_daily_population": 1802,
"book_ins": 423,
"book_outs": 398,
"avg_length_of_stay_days": 45.2
}
}
Data Quality Assurance
Validation Rules
def validate_detention_record(record):
"""
Validate detention record integrity
"""
errors = []
# Required fields
if not record.get('anonymous_uid'):
errors.append("Missing anonymous_uid")
if not record.get('facility_code'):
errors.append("Missing facility_code")
if not record.get('book_in_datetime'):
errors.append("Missing book_in_datetime")
# Logical checks
if record.get('book_out_datetime'):
if record['book_out_datetime'] < record['book_in_datetime']:
errors.append("Book out before book in")
# Referential integrity
if record.get('facility_code'):
if not facility_exists(record['facility_code']):
errors.append(f"Unknown facility: {record['facility_code']}")
return errors
Handling Data Gaps
| Gap Type | Strategy |
|---|---|
| Missing book_out | Flag as "currently detained" |
| Unknown facility | Create placeholder, research later |
| Date inconsistencies | Flag for manual review |
| Duplicate records | Deduplication pipeline |
Update Frequency
Data Sources and Cadence
| Source | Update Frequency |
|---|---|
| FOIA bulk data | As received (irregular) |
| Public statistics | Weekly/monthly ICE releases |
| Facility inventory | Quarterly refresh |
| Contract data | Annual from USASpending |
| Inspection reports | As released |
Version Control
Maintain historical snapshots:
- Monthly database dumps
- Change logs for facility updates
- Data source provenance tracking
Infrastructure Requirements
Minimum Setup
| Component | Specification |
|---|---|
| Database | PostgreSQL 14+ |
| Storage | 100GB+ for multi-year data |
| API Server | Node.js or Python Flask |
| Visualization | D3.js, Leaflet, or Tableau |
Scaling Considerations
For large-scale analysis:
- Read replicas for query load
- Time-series partitioning for detention table
- Caching layer for common queries
- CDN for map tiles
Related Resources
- FOIA Strategies - Obtaining raw data
- Contract Research - Financial data integration
- Advocacy Applications - Using database outputs
- Flight Tracking - Linking deportation data