docs: add comprehensive deployment guide for docker and manual setups

feat: implement PP-Structure table extraction pipeline with GPU runtime configuration support
update
2026-04-27 10:06:38 +07:00 · 2026-04-27 00:51:23 +07:00 · 2026-04-26 22:08:41 +08:00 · 2026-04-26 18:15:38 +07:00 · 2026-04-26 17:16:47 +07:00 · 2026-04-26 13:10:44 +07:00
37 changed files with 5725 additions and 67 deletions
--- a/defaults/inference.pdiparams
+++ b/defaults/inference.pdiparams
--- a/defaults/inference.pdiparams.info
+++ b/defaults/inference.pdiparams.info
--- a/defaults/inference.pdmodel
+++ b/defaults/inference.pdmodel
--- a/.claude/settings.local.json
+++ b/.claude/settings.local.json
@@ -0,0 +1,18 @@
+{
+  "permissions": {
+    "allow": [
+      "Bash(python -m pytest tests/unit/test_personnel_text_fallback.py -x -q)",
+      "Bash(python -c \"import sys; print\\(sys.executable\\)\")",
+      "Bash(.venv/Scripts/python.exe -m pytest tests/unit/test_personnel_text_fallback.py -x -q)",
+      "Bash(.venv/Scripts/python.exe -m pytest tests/unit -x -q)",
+      "Bash(git stash *)",
+      "Bash(.venv/Scripts/python.exe -m pytest tests/unit/test_api.py::test_documents_sync_returns_pipeline_output -x -q)",
+      "Bash(.venv/Scripts/python.exe -m pytest tests/unit --ignore=tests/unit/test_api.py -q)",
+      "Bash(.venv/Scripts/python.exe -c ' *)",
+      "Bash(xargs grep *)",
+      "Bash(.venv/Scripts/python.exe -m pytest tests/unit -q --ignore=tests/unit/test_api.py --ignore=tests/unit/test_api_hitl.py --ignore=tests/unit/test_blob_storage.py)",
+      "Bash(.venv/Scripts/python.exe -m pytest tests/unit/test_ocr_layout.py tests/unit/test_personnel_text_fallback.py -q)",
+      "Bash(.venv/Scripts/python.exe -m pytest tests/unit/test_personnel_text_fallback.py tests/unit/test_ocr_layout.py -q)"
+    ]
+  }
+}
--- a/.env.example
+++ b/.env.example
@@ -10,7 +10,8 @@ STORAGE_LOCAL_DIR=./storage
 # ==== OCR ====
 OCR_LANG=latin                # PaddleOCR lang code; "latin" works well for Bahasa Indonesia
 OCR_USE_GPU=false             # set true if running on a GPU host
-OCR_DET_MODEL_DIR=             # leave empty to use PaddleOCR defaults
+# Leave empty to use PaddleOCR defaults.
+OCR_DET_MODEL_DIR=
 OCR_REC_MODEL_DIR=
 OCR_CLS_MODEL_DIR=
 OCR_MAX_IMAGE_SIDE=2200       # downscale longest side before OCR
--- a/13
+++ b/13
@@ -1,9 +1,10 @@
-.PHONY: help install dev fmt lint typecheck test test-cov run docker-build docker-up docker-down clean
+.PHONY: help install dev update fmt lint typecheck test test-cov run docker-build docker-up docker-down clean

 help:
 	@echo "Targets:"
 	@echo "  install       - install runtime + dev deps in current env"
 	@echo "  dev           - run FastAPI app with autoreload"
+	@echo "  update        - git pull + install deps + migrate db + run dev server"
 	@echo "  fmt           - format code with ruff"
 	@echo "  lint          - lint with ruff"
 	@echo "  typecheck     - run mypy"
@@ -21,6 +22,16 @@ install:
 dev:
 	uvicorn ocr_sprint.main:app --reload --host 0.0.0.0 --port 8000

+update:
+	@echo "[1/4] Pulling latest code..."
+	git pull
+	@echo "[2/4] Installing/updating dependencies..."
+	pip install -e ".[dev]"
+	@echo "[3/4] Running database migrations..."
+	alembic upgrade head
+	@echo "[4/4] Starting dev server..."
+	uvicorn ocr_sprint.main:app --reload --host 0.0.0.0 --port 8000
+
 fmt:
 	ruff format src tests
 	ruff check --fix src tests
--- a/docs/DEPLOYMENT-EXISTING-STACK.md
+++ b/docs/DEPLOYMENT-EXISTING-STACK.md
@@ -0,0 +1,858 @@
+# Deployment OCR Sprint Service (Existing Stack)
+
+Panduan deployment untuk server dengan Python 3.12.3, PostgreSQL 16.13, dan Redis 7.0.15 yang sudah terinstall.
+
+## Informasi Server Anda
+
+- **OS**: Ubuntu 24.04
+- **Python**: 3.12.3 ✅
+- **PostgreSQL**: 16.13 ✅
+- **Redis**: 7.0.15 ✅
+
+Semua versi sudah kompatibel dan optimal untuk OCR Sprint Service!
+
+## Langkah 1: Install System Libraries untuk OpenCV & PaddleOCR
+
+```bash
+# Update package list
+sudo apt update
+
+# Install libraries yang dibutuhkan oleh OpenCV dan PaddleOCR
+sudo apt install -y \
+    libgl1 \
+    libglib2.0-0 \
+    libsm6 \
+    libxext6 \
+    libxrender1 \
+    libgomp1 \
+    libmagic1 \
+    python3.12-venv \
+    python3.12-dev \
+    build-essential \
+    git
+```
+
+## Langkah 2: Setup PostgreSQL Database
+
+```bash
+# Login ke PostgreSQL
+sudo -u postgres psql
+```
+
+Jalankan SQL commands berikut:
+
+```sql
+-- Create user dan database
+CREATE USER ocr WITH PASSWORD '@Offroader123';
+CREATE DATABASE ocr_sprint OWNER ocr;
+
+-- Grant privileges
+GRANT ALL PRIVILEGES ON DATABASE ocr_sprint TO ocr;
+
+-- Connect ke database untuk grant schema privileges
+\c ocr_sprint
+
+-- Grant schema privileges (PostgreSQL 15+)
+GRANT ALL ON SCHEMA public TO ocr;
+GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO ocr;
+GRANT ALL PRIVILEGES ON ALL SEQUENCES IN SCHEMA public TO ocr;
+
+-- Verify
+\l ocr_sprint
+\du ocr
+
+-- Exit
+\q
+```
+
+**Generate password yang aman:**
+
+```bash
+# Generate random password
+openssl rand -base64 32
+J33GdYQcWcfqXs169cmgPrQJpLFgybjoedr/tNb0d4=
+```
+
+Simpan password ini, akan digunakan di konfigurasi nanti.
+
+## Langkah 3: Verify Redis
+
+```bash
+# Check Redis status
+sudo systemctl status redis-server
+
+# Test connection
+redis-cli ping
+# Expected output: PONG
+
+# Check Redis config (opsional)
+redis-cli CONFIG GET maxmemory
+```
+
+Jika Redis belum running:
+
+```bash
+sudo systemctl enable redis-server
+sudo systemctl start redis-server
+```
+
+## Langkah 4: Create Application User
+
+```bash
+# Create dedicated user untuk aplikasi
+sudo useradd -m -s /bin/bash ocr
+
+# Create application directory
+sudo mkdir -p /opt/ocr-sprint-service
+sudo chown ocr:ocr /opt/ocr-sprint-service
+```
+
+## Langkah 5: Clone dan Install Application
+
+```bash
+# Switch ke user ocr
+sudo su - ocr
+
+# Clone repository
+cd /opt
+git clone https://github.com/Adriankf59/ocr-sprint-service.git
+cd ocr-sprint-service
+
+# Create virtual environment dengan Python 3.12
+python3.12 -m venv .venv
+
+# Activate virtual environment
+source .venv/bin/activate
+
+# Verify Python version di venv
+python --version
+# Expected: Python 3.12.3
+
+# Upgrade pip
+pip install --upgrade pip setuptools wheel
+
+# Install application dengan OCR dependencies
+# Ini akan download ~1.5GB PaddlePaddle wheels
+pip install -e ".[ocr]"
+
+# Verify installation
+python -c "import paddleocr; print('PaddleOCR OK')"
+python -c "import cv2; print('OpenCV OK')"
+python -c "import fastapi; print('FastAPI OK')"
+```
+
+## Langkah 6: Konfigurasi Application
+
+```bash
+# Masih sebagai user ocr
+cd /opt/ocr-sprint-service
+
+# Copy environment template
+cp .env.example .env
+
+# Edit konfigurasi
+nano .env
+```
+
+**Konfigurasi `/opt/ocr-sprint-service/.env`:**
+
+```bash
+# ==== App ====
+APP_ENV=prod
+APP_HOST=0.0.0.0
+APP_PORT=8000
+APP_LOG_LEVEL=INFO
+
+# ==== Storage ====
+STORAGE_LOCAL_DIR=/opt/ocr-sprint-service/storage
+BLOB_STORAGE_DIR=/opt/ocr-sprint-service/storage/blobs
+BLOB_MAX_UPLOAD_MB=25
+
+# ==== OCR ====
+OCR_LANG=latin
+OCR_USE_GPU=false
+OCR_MAX_IMAGE_SIDE=2200
+
+# ==== Preprocessing ====
+PREPROCESS_TARGET_DPI=300
+PREPROCESS_DENOISE=true
+PREPROCESS_DESKEW=true
+PREPROCESS_DETECT_DOCUMENT=true
+PREPROCESS_REMOVE_SHADOW=true
+PREPROCESS_MIN_QUAD_AREA_FRACTION=0.20
+
+# ==== Table Extraction ====
+TABLES_ENABLED=true
+
+# ==== Confidence ====
+CONFIDENCE_AUTO_APPROVE=0.95
+CONFIDENCE_NEEDS_REVIEW=0.85
+
+# ==== LLM (Phase 5, optional - disable untuk sekarang) ====
+LLM_ENABLED=false
+
+# ==== Async Pipeline ====
+QUEUE_ENABLED=true
+REDIS_URL=redis://localhost:6379/0
+CELERY_TASK_DEFAULT_QUEUE=ocr_sprint
+
+# ==== Database ====
+# Ganti 'your-password-here' dengan password yang Anda generate di Langkah 2
+DATABASE_URL=postgresql+psycopg://ocr:your-password-here@localhost:5432/ocr_sprint
+DATABASE_ECHO=false
+
+# ==== Auth (WAJIB untuk production!) ====
+# Generate dengan: openssl rand -hex 32
+API_KEYS=paste-api-key-1-here,paste-api-key-2-here
+API_KEY_HEADER=X-API-Key
+```
+
+**Generate API keys:**
+
+```bash
+# Generate 2 API keys
+echo "API Key 1: $(openssl rand -hex 32)"
+echo "API Key 2: $(openssl rand -hex 32)"
+```
+
+Copy output dan paste ke `API_KEYS` di file `.env`.
+
+**Create storage directories:**
+
+```bash
+mkdir -p /opt/ocr-sprint-service/storage/blobs
+chmod 755 /opt/ocr-sprint-service/storage
+```
+
+## Langkah 7: Run Database Migrations
+
+```bash
+# Masih sebagai user ocr, dengan venv activated
+cd /opt/ocr-sprint-service
+source .venv/bin/activate
+
+# Run migrations
+alembic upgrade head
+
+# Verify - should show current revision
+alembic current
+
+# Expected output: (head) atau revision number
+```
+
+## Langkah 8: Test Manual Run
+
+```bash
+# Masih sebagai user ocr
+cd /opt/ocr-sprint-service
+source .venv/bin/activate
+
+# Test API server
+uvicorn ocr_sprint.main:app --host 0.0.0.0 --port 8000
+```
+
+**Di terminal lain (sebagai user ubuntu):**
+
+```bash
+# Test health check
+curl http://localhost:8000/api/v1/health
+
+# Expected: {"status":"ok","version":"0.1.0"}
+
+# Test dengan sample file (jika ada)
+curl -X POST "http://localhost:8000/api/v1/documents?sync=true" \
+  -H "X-API-Key: your-api-key-here" \
+  -F "file=@/path/to/test.pdf"
+```
+
+Jika berhasil, stop server dengan `Ctrl+C`.
+
+## Langkah 9: Setup Systemd Services
+
+```bash
+# Exit dari user ocr
+exit
+
+# Kembali sebagai user ubuntu dengan sudo
+```
+
+### Create API Service
+
+```bash
+sudo nano /etc/systemd/system/ocr-sprint-api.service
+```
+
+**Content:**
+
+```ini
+[Unit]
+Description=OCR Sprint API Service
+After=network.target postgresql.service redis-server.service
+Wants=postgresql.service redis-server.service
+
+[Service]
+Type=simple
+User=ocr
+Group=ocr
+WorkingDirectory=/opt/ocr-sprint-service
+
+# Environment
+Environment="PATH=/opt/ocr-sprint-service/.venv/bin:/usr/local/bin:/usr/bin:/bin"
+EnvironmentFile=/opt/ocr-sprint-service/.env
+
+# Start command - 4 workers untuk production
+ExecStart=/opt/ocr-sprint-service/.venv/bin/uvicorn \
+    ocr_sprint.main:app \
+    --host 0.0.0.0 \
+    --port 8000 \
+    --workers 4 \
+    --log-level info
+
+# Restart policy
+Restart=always
+RestartSec=10
+StartLimitInterval=0
+
+# Resource limits
+LimitNOFILE=65536
+
+# Security
+NoNewPrivileges=true
+PrivateTmp=true
+
+[Install]
+WantedBy=multi-user.target
+```
+
+### Create Celery Worker Service
+
+```bash
+sudo nano /etc/systemd/system/ocr-sprint-worker.service
+```
+
+**Content:**
+
+```ini
+[Unit]
+Description=OCR Sprint Celery Worker
+After=network.target postgresql.service redis-server.service ocr-sprint-api.service
+Wants=postgresql.service redis-server.service
+
+[Service]
+Type=simple
+User=ocr
+Group=ocr
+WorkingDirectory=/opt/ocr-sprint-service
+
+# Environment
+Environment="PATH=/opt/ocr-sprint-service/.venv/bin:/usr/local/bin:/usr/bin:/bin"
+EnvironmentFile=/opt/ocr-sprint-service/.env
+
+# Start command - concurrency 2 untuk CPU dengan 4 cores
+# Sesuaikan dengan jumlah CPU cores server Anda
+ExecStart=/opt/ocr-sprint-service/.venv/bin/celery \
+    -A ocr_sprint.worker.celery_app \
+    worker \
+    --loglevel=info \
+    --concurrency=2 \
+    --max-tasks-per-child=100
+
+# Restart policy
+Restart=always
+RestartSec=10
+StartLimitInterval=0
+
+# Resource limits
+LimitNOFILE=65536
+
+# Security
+NoNewPrivileges=true
+PrivateTmp=true
+
+[Install]
+WantedBy=multi-user.target
+```
+
+### Enable dan Start Services
+
+```bash
+# Reload systemd
+sudo systemctl daemon-reload
+
+# Enable services (auto-start on boot)
+sudo systemctl enable ocr-sprint-api
+sudo systemctl enable ocr-sprint-worker
+
+# Start services
+sudo systemctl start ocr-sprint-api
+sudo systemctl start ocr-sprint-worker
+
+# Check status
+sudo systemctl status ocr-sprint-api
+sudo systemctl status ocr-sprint-worker
+```
+
+**Expected output:** `active (running)` dengan warna hijau.
+
+### View Logs
+
+```bash
+# API logs (real-time)
+sudo journalctl -u ocr-sprint-api -f
+
+# Worker logs (real-time)
+sudo journalctl -u ocr-sprint-worker -f
+
+# Last 50 lines
+sudo journalctl -u ocr-sprint-api -n 50
+sudo journalctl -u ocr-sprint-worker -n 50
+```
+
+## Langkah 10: Install dan Setup Nginx
+
+```bash
+# Install Nginx dan Certbot
+sudo apt install -y nginx certbot python3-certbot-nginx
+
+# Check Nginx status
+sudo systemctl status nginx
+```
+
+### Create Nginx Configuration
+
+```bash
+sudo nano /etc/nginx/sites-available/ocr-sprint
+```
+
+**Content (ganti `ocr.yourdomain.com` dengan domain Anda):**
+
+```nginx
+# Upstream
+upstream ocr_api {
+    server 127.0.0.1:8000;
+    keepalive 32;
+}
+
+# Rate limiting
+limit_req_zone $binary_remote_addr zone=api_limit:10m rate=10r/s;
+
+server {
+    listen 80;
+    server_name ocr.yourdomain.com;
+
+    # Max upload size
+    client_max_body_size 30M;
+    client_body_buffer_size 128k;
+
+    # Timeouts
+    proxy_connect_timeout 300s;
+    proxy_send_timeout 300s;
+    proxy_read_timeout 300s;
+    send_timeout 300s;
+
+    # Logging
+    access_log /var/log/nginx/ocr-sprint-access.log;
+    error_log /var/log/nginx/ocr-sprint-error.log;
+
+    # API endpoints
+    location /api/ {
+        limit_req zone=api_limit burst=20 nodelay;
+
+        proxy_pass http://ocr_api;
+        proxy_http_version 1.1;
+        
+        proxy_set_header Host $host;
+        proxy_set_header X-Real-IP $remote_addr;
+        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
+        proxy_set_header X-Forwarded-Proto $scheme;
+        proxy_set_header Connection "";
+        
+        proxy_buffering off;
+    }
+
+    # Health check
+    location /api/v1/health {
+        proxy_pass http://ocr_api;
+        proxy_http_version 1.1;
+        proxy_set_header Host $host;
+        access_log off;
+    }
+
+    # Metrics (restrict access)
+    location /metrics {
+        allow 127.0.0.1;
+        allow 10.0.0.0/8;
+        deny all;
+
+        proxy_pass http://ocr_api;
+        proxy_http_version 1.1;
+        proxy_set_header Host $host;
+    }
+
+    # API docs
+    location /docs {
+        proxy_pass http://ocr_api;
+        proxy_http_version 1.1;
+        proxy_set_header Host $host;
+    }
+
+    location /redoc {
+        proxy_pass http://ocr_api;
+        proxy_http_version 1.1;
+        proxy_set_header Host $host;
+    }
+}
+```
+
+### Enable Site
+
+```bash
+# Test konfigurasi
+sudo nginx -t
+
+# Enable site
+sudo ln -s /etc/nginx/sites-available/ocr-sprint /etc/nginx/sites-enabled/
+
+# Reload Nginx
+sudo systemctl reload nginx
+```
+
+### Setup SSL (jika punya domain)
+
+```bash
+# Obtain certificate
+sudo certbot --nginx -d ocr.yourdomain.com
+
+# Test auto-renewal
+sudo certbot renew --dry-run
+```
+
+## Langkah 11: Setup Firewall
+
+```bash
+# Check UFW status
+sudo ufw status
+
+# Allow SSH (PENTING!)
+sudo ufw allow 22/tcp
+
+# Allow HTTP dan HTTPS
+sudo ufw allow 80/tcp
+sudo ufw allow 443/tcp
+
+# Enable firewall (jika belum)
+sudo ufw enable
+
+# Verify
+sudo ufw status numbered
+```
+
+## Langkah 12: Verifikasi Final
+
+### Test dari Server
+
+```bash
+# Health check
+curl http://localhost:8000/api/v1/health
+
+# Test async endpoint
+curl -X POST http://localhost:8000/api/v1/documents \
+  -H "X-API-Key: your-api-key-here" \
+  -F "file=@/path/to/test.pdf"
+
+# Expected: {"job_id":"...","status":"pending",...}
+
+# Check job status
+curl -H "X-API-Key: your-api-key-here" \
+  http://localhost:8000/api/v1/documents/JOB_ID_HERE
+```
+
+### Test via Domain (jika sudah setup SSL)
+
+```bash
+curl https://ocr.yourdomain.com/api/v1/health
+```
+
+### Check Services
+
+```bash
+# All services should be active
+sudo systemctl status ocr-sprint-api
+sudo systemctl status ocr-sprint-worker
+sudo systemctl status postgresql
+sudo systemctl status redis-server
+sudo systemctl status nginx
+```
+
+## Monitoring
+
+### View Logs
+
+```bash
+# API logs
+sudo journalctl -u ocr-sprint-api -f
+
+# Worker logs
+sudo journalctl -u ocr-sprint-worker -f
+
+# Nginx access logs
+sudo tail -f /var/log/nginx/ocr-sprint-access.log
+
+# Nginx error logs
+sudo tail -f /var/log/nginx/ocr-sprint-error.log
+```
+
+### Prometheus Metrics
+
+```bash
+# View metrics
+curl http://localhost:8000/metrics
+
+# Key metrics:
+# - ocr_documents_total
+# - ocr_processing_duration_seconds
+# - ocr_confidence_score
+```
+
+## Maintenance
+
+### Restart Services
+
+```bash
+sudo systemctl restart ocr-sprint-api
+sudo systemctl restart ocr-sprint-worker
+```
+
+### Update Application
+
+```bash
+# Switch ke user ocr
+sudo su - ocr
+cd /opt/ocr-sprint-service
+
+# Pull latest code
+git pull
+
+# Activate venv
+source .venv/bin/activate
+
+# Update dependencies
+pip install -e ".[ocr]"
+
+# Run migrations
+alembic upgrade head
+
+# Exit
+exit
+
+# Restart services
+sudo systemctl restart ocr-sprint-api
+sudo systemctl restart ocr-sprint-worker
+
+# Check logs
+sudo journalctl -u ocr-sprint-api -n 50
+```
+
+### Database Backup
+
+```bash
+# Create backup directory
+sudo mkdir -p /opt/ocr-sprint-service/backups
+sudo chown ocr:ocr /opt/ocr-sprint-service/backups
+
+# Manual backup
+sudo -u ocr pg_dump -h localhost -U ocr ocr_sprint | gzip > /opt/ocr-sprint-service/backups/backup_$(date +%Y%m%d_%H%M%S).sql.gz
+```
+
+**Setup automated backup:**
+
+```bash
+# Create backup script
+sudo nano /opt/ocr-sprint-service/backup.sh
+```
+
+```bash
+#!/bin/bash
+BACKUP_DIR="/opt/ocr-sprint-service/backups"
+DATE=$(date +%Y%m%d_%H%M%S)
+
+mkdir -p $BACKUP_DIR
+
+# Backup database
+PGPASSWORD='your-db-password' pg_dump -h localhost -U ocr ocr_sprint | gzip > $BACKUP_DIR/db_$DATE.sql.gz
+
+# Keep only last 7 days
+find $BACKUP_DIR -name "db_*.sql.gz" -mtime +7 -delete
+
+echo "Backup completed: $DATE"
+```
+
+```bash
+# Make executable
+sudo chmod +x /opt/ocr-sprint-service/backup.sh
+sudo chown ocr:ocr /opt/ocr-sprint-service/backup.sh
+
+# Setup cron (daily at 2 AM)
+sudo crontab -e -u ocr
+
+# Add line:
+0 2 * * * /opt/ocr-sprint-service/backup.sh >> /var/log/ocr-backup.log 2>&1
+```
+
+## Troubleshooting
+
+### Service tidak start
+
+```bash
+# Check detailed logs
+sudo journalctl -u ocr-sprint-api -n 100 --no-pager
+sudo journalctl -u ocr-sprint-worker -n 100 --no-pager
+
+# Check file permissions
+ls -la /opt/ocr-sprint-service
+ls -la /opt/ocr-sprint-service/storage
+
+# Test manual run
+sudo su - ocr
+cd /opt/ocr-sprint-service
+source .venv/bin/activate
+uvicorn ocr_sprint.main:app --host 0.0.0.0 --port 8000
+```
+
+### Database connection error
+
+```bash
+# Test connection
+sudo -u ocr psql -h localhost -U ocr -d ocr_sprint
+
+# Check PostgreSQL status
+sudo systemctl status postgresql
+
+# Check PostgreSQL logs
+sudo journalctl -u postgresql -n 50
+```
+
+### Redis connection error
+
+```bash
+# Test Redis
+redis-cli ping
+
+# Check Redis status
+sudo systemctl status redis-server
+
+# Check Redis logs
+sudo journalctl -u redis-server -n 50
+```
+
+### Worker tidak memproses jobs
+
+```bash
+# Check Celery worker status
+sudo su - ocr
+cd /opt/ocr-sprint-service
+source .venv/bin/activate
+celery -A ocr_sprint.worker.celery_app inspect active
+celery -A ocr_sprint.worker.celery_app inspect stats
+
+# Check Redis queue
+redis-cli LLEN ocr_sprint
+```
+
+### PaddleOCR error
+
+```bash
+# Re-download models
+sudo su - ocr
+cd /opt/ocr-sprint-service
+source .venv/bin/activate
+
+python << EOF
+from paddleocr import PaddleOCR
+ocr = PaddleOCR(use_angle_cls=True, lang='latin')
+print("Models downloaded successfully")
+EOF
+```
+
+## Performance Tuning
+
+### Check CPU cores
+
+```bash
+nproc
+```
+
+### Adjust worker concurrency
+
+```bash
+# Edit worker service
+sudo nano /etc/systemd/system/ocr-sprint-worker.service
+
+# Untuk 4 cores: --concurrency=2
+# Untuk 8 cores: --concurrency=4
+# Untuk 16 cores: --concurrency=8
+
+# Reload dan restart
+sudo systemctl daemon-reload
+sudo systemctl restart ocr-sprint-worker
+```
+
+### PostgreSQL 16 Tuning
+
+```bash
+sudo nano /etc/postgresql/16/main/postgresql.conf
+```
+
+**Recommended settings (sesuaikan dengan RAM server):**
+
+```
+# Untuk 8GB RAM:
+shared_buffers = 2GB
+effective_cache_size = 6GB
+maintenance_work_mem = 512MB
+work_mem = 8MB
+
+# Untuk 16GB RAM:
+shared_buffers = 4GB
+effective_cache_size = 12GB
+maintenance_work_mem = 1GB
+work_mem = 10MB
+
+# General
+checkpoint_completion_target = 0.9
+wal_buffers = 16MB
+default_statistics_target = 100
+random_page_cost = 1.1
+effective_io_concurrency = 200
+max_worker_processes = 4
+max_parallel_workers_per_gather = 2
+max_parallel_workers = 4
+```
+
+```bash
+sudo systemctl restart postgresql
+```
+
+## Security Checklist
+
+- [ ] API keys set dengan nilai random yang kuat
+- [ ] Database password diganti dari default
+- [ ] Firewall enabled (UFW)
+- [ ] SSL/TLS enabled (jika punya domain)
+- [ ] `/metrics` endpoint restricted
+- [ ] PostgreSQL hanya listen di localhost
+- [ ] Redis hanya listen di localhost
+- [ ] Backup automated (cron job)
+- [ ] OS security updates enabled
+
+## Next Steps
+
+1. **Setup monitoring** - Install Prometheus + Grafana (opsional)
+2. **Setup alerting** - Email/Slack notification untuk errors
+3. **Load testing** - Test dengan volume dokumen production
+4. **Backup verification** - Test restore dari backup
+5. **Documentation** - Dokumentasi API keys untuk tim
+
+## Support
+
+Untuk pertanyaan atau issues, hubungi tim development.
--- a/docs/DEPLOYMENT-GUIDE.md
+++ b/docs/DEPLOYMENT-GUIDE.md
@@ -0,0 +1,571 @@
+# Panduan Deployment OCR Sprint Service
+
+> Dokumen ini adalah panduan langkah-langkah deployment **ocr-sprint-service** ke server production. Disusun berdasarkan kondisi kodingan aktual per April 2026 (Phase 1–4 selesai).
+
+---
+
+## Daftar Isi
+
+1. [Gambaran Arsitektur](#1-gambaran-arsitektur)
+2. [Prasyarat Server](#2-prasyarat-server)
+3. [Opsi A — Docker Compose (Recommended)](#3-opsi-a--docker-compose-recommended)
+4. [Opsi B — Manual (Tanpa Docker)](#4-opsi-b--manual-tanpa-docker)
+5. [Konfigurasi Environment Production](#5-konfigurasi-environment-production)
+6. [Reverse Proxy & SSL (Nginx)](#6-reverse-proxy--ssl-nginx)
+7. [Firewall](#7-firewall)
+8. [Verifikasi Deployment](#8-verifikasi-deployment)
+9. [Monitoring & Maintenance](#9-monitoring--maintenance)
+10. [Troubleshooting](#10-troubleshooting)
+11. [Security Checklist](#11-security-checklist)
+
+---
+
+## 1. Gambaran Arsitektur
+
+```
+┌──────────┐     ┌──────────────┐     ┌───────┐
+│  Client  │────▶│  Nginx (SSL) │────▶│  API  │──▶ PaddleOCR
+└──────────┘     └──────────────┘     │ :8000 │      Pipeline
+                                      └───┬───┘
+                                          │ async job
+                                    ┌─────▼─────┐
+                                    │   Redis    │
+                                    │   :6379    │
+                                    └─────┬─────┘
+                                    ┌─────▼──────┐
+                                    │   Worker   │──▶ PaddleOCR
+                                    │  (Celery)  │      Pipeline
+                                    └─────┬──────┘
+                                    ┌─────▼──────┐
+                                    │ PostgreSQL │
+                                    │   :5432    │
+                                    └────────────┘
+```
+
+**4 services** yang harus berjalan:
+
+| Service | Fungsi |
+|---------|--------|
+| **API** (FastAPI + Uvicorn) | Menerima upload dokumen, serve hasil OCR |
+| **Worker** (Celery) | Async OCR processing di background |
+| **Redis** | Message broker untuk job queue |
+| **PostgreSQL** | Menyimpan job state & hasil ekstraksi |
+
+Blob storage menggunakan **local filesystem** (belum S3/MinIO).
+
+---
+
+## 2. Prasyarat Server
+
+### Spesifikasi Minimum
+
+| Resource | Minimum | Recommended |
+|----------|---------|-------------|
+| OS | Ubuntu 20.04+ / Debian 11+ | Ubuntu 22.04+ |
+| CPU | 4 cores | 8 cores |
+| RAM | 8 GB | 16 GB |
+| Storage | 50 GB free | 100 GB free |
+| Python | 3.10–3.12 | 3.11 atau 3.12 |
+| Network | Port 8000 (internal) | + Port 80/443 (Nginx) |
+
+### Kebutuhan Disk
+
+- ~1.5 GB — PaddlePaddle wheels
+- ~200 MB — PaddleOCR model downloads (otomatis saat pertama jalan)
+- Sisanya — blob storage dokumen yang diupload
+
+### Software yang Dibutuhkan
+
+- **Docker Compose** — untuk Opsi A
+- **Python 3.10–3.12 + PostgreSQL + Redis** — untuk Opsi B
+- **Git** — kedua opsi
+- **Nginx** (opsional) — reverse proxy + SSL
+
+---
+
+## 3. Opsi A — Docker Compose (Recommended)
+
+> Cara paling cepat. Semua service (API, Worker, Redis, Postgres) berjalan dalam container.
+
+### 3.1 Login & Clone
+
+```bash
+ssh user@your-server.com
+
+git clone https://github.com/Adriankf59/ocr-sprint-service.git
+cd ocr-sprint-service
+```
+
+### 3.2 Konfigurasi .env
+
+```bash
+cp .env.example .env
+nano .env
+```
+
+Lihat [Bagian 5](#5-konfigurasi-environment-production) untuk detail konfigurasi production.
+
+> [!IMPORTANT]
+> Untuk Docker Compose, **jangan ubah** `DATABASE_URL` dan `REDIS_URL` — sudah dioverride oleh `docker-compose.yml` via environment variables di masing-masing container.
+
+### 3.3 Build & Start
+
+```bash
+# Build image (~5–10 menit pertama kali)
+docker compose build
+
+# Start semua services
+docker compose up -d
+
+# Cek logs
+docker compose logs -f api worker
+```
+
+Container `api` akan otomatis menjalankan `alembic upgrade head` sebelum start server (lihat `command` di `docker-compose.yml`).
+
+### 3.4 First-Run Model Download
+
+Request pertama akan trigger download model PaddleOCR (~200 MB) ke Docker volume `paddle-models`. Tunggu hingga selesai sebelum test.
+
+```bash
+# Monitor download di logs
+docker compose logs -f api
+```
+
+### 3.5 Verifikasi
+
+```bash
+curl http://localhost:8000/api/v1/health
+# Expected: {"status":"ok","version":"0.1.0"}
+```
+
+### 3.6 Update Service (Setelah Ada Perubahan Kode)
+
+```bash
+cd ocr-sprint-service
+git pull
+docker compose build
+docker compose up -d
+```
+
+---
+
+## 4. Opsi B — Manual (Tanpa Docker)
+
+> Untuk server yang sudah punya Python, PostgreSQL, dan Redis terinstall.
+
+### 4.1 Install System Libraries
+
+```bash
+sudo apt update && sudo apt upgrade -y
+
+# Libraries untuk OpenCV & PaddleOCR
+sudo apt install -y \
+    python3.11 python3.11-venv python3.11-dev \
+    libgl1 libglib2.0-0 libsm6 libxext6 libxrender1 \
+    libgomp1 libmagic1 \
+    build-essential git curl
+
+# Install Redis & PostgreSQL (jika belum ada)
+sudo apt install -y redis-server postgresql postgresql-contrib
+sudo systemctl enable --now redis-server postgresql
+```
+
+> [!NOTE]
+> Jika server sudah punya Python 3.12, gunakan `python3.12` di semua perintah selanjutnya.
+
+### 4.2 Setup Database
+
+```bash
+sudo -u postgres psql
+```
+
+```sql
+CREATE USER ocr WITH PASSWORD 'ganti-password-kuat';
+CREATE DATABASE ocr_sprint OWNER ocr;
+GRANT ALL PRIVILEGES ON DATABASE ocr_sprint TO ocr;
+\c ocr_sprint
+GRANT ALL ON SCHEMA public TO ocr;
+\q
+```
+
+### 4.3 Create Application User & Directory
+
+```bash
+sudo useradd -m -s /bin/bash ocr
+sudo mkdir -p /opt/ocr-sprint-service
+sudo chown ocr:ocr /opt/ocr-sprint-service
+```
+
+### 4.4 Clone & Install
+
+```bash
+sudo su - ocr
+cd /opt
+git clone https://github.com/Adriankf59/ocr-sprint-service.git
+cd ocr-sprint-service
+
+# Create virtual environment
+python3.11 -m venv .venv
+source .venv/bin/activate
+
+# Install dependencies + OCR runtime (~1.5 GB download)
+pip install --upgrade pip setuptools wheel
+pip install -e ".[ocr]"
+
+# Verify
+python -c "import paddleocr; print('PaddleOCR OK')"
+python -c "import fastapi; print('FastAPI OK')"
+```
+
+### 4.5 Konfigurasi .env
+
+```bash
+cp .env.example .env
+nano .env
+```
+
+**Wajib diubah untuk manual deployment:**
+
+```bash
+APP_ENV=prod
+DATABASE_URL=postgresql+psycopg://ocr:ganti-password-kuat@localhost:5432/ocr_sprint
+REDIS_URL=redis://localhost:6379/0
+QUEUE_ENABLED=true
+API_KEYS=your-generated-api-key
+STORAGE_LOCAL_DIR=/opt/ocr-sprint-service/storage
+BLOB_STORAGE_DIR=/opt/ocr-sprint-service/storage/blobs
+```
+
+```bash
+# Create storage directories
+mkdir -p /opt/ocr-sprint-service/storage/blobs
+```
+
+### 4.6 Run Database Migrations
+
+```bash
+source .venv/bin/activate
+alembic upgrade head
+alembic current  # verify
+```
+
+### 4.7 Test Manual
+
+```bash
+uvicorn ocr_sprint.main:app --host 0.0.0.0 --port 8000
+# Di terminal lain: curl http://localhost:8000/api/v1/health
+# Ctrl+C untuk stop
+```
+
+### 4.8 Setup Systemd Services
+
+**API Service** — `/etc/systemd/system/ocr-sprint-api.service`:
+
+```ini
+[Unit]
+Description=OCR Sprint API Service
+After=network.target postgresql.service redis-server.service
+
+[Service]
+Type=simple
+User=ocr
+Group=ocr
+WorkingDirectory=/opt/ocr-sprint-service
+Environment="PATH=/opt/ocr-sprint-service/.venv/bin:/usr/local/bin:/usr/bin:/bin"
+EnvironmentFile=/opt/ocr-sprint-service/.env
+ExecStart=/opt/ocr-sprint-service/.venv/bin/uvicorn \
+    ocr_sprint.main:app \
+    --host 0.0.0.0 --port 8000 --workers 4 --log-level info
+Restart=always
+RestartSec=10
+LimitNOFILE=65536
+NoNewPrivileges=true
+
+[Install]
+WantedBy=multi-user.target
+```
+
+**Worker Service** — `/etc/systemd/system/ocr-sprint-worker.service`:
+
+```ini
+[Unit]
+Description=OCR Sprint Celery Worker
+After=network.target postgresql.service redis-server.service
+
+[Service]
+Type=simple
+User=ocr
+Group=ocr
+WorkingDirectory=/opt/ocr-sprint-service
+Environment="PATH=/opt/ocr-sprint-service/.venv/bin:/usr/local/bin:/usr/bin:/bin"
+EnvironmentFile=/opt/ocr-sprint-service/.env
+ExecStart=/opt/ocr-sprint-service/.venv/bin/celery \
+    -A ocr_sprint.worker.celery_app worker \
+    --loglevel=info --concurrency=2 --max-tasks-per-child=100
+Restart=always
+RestartSec=10
+LimitNOFILE=65536
+NoNewPrivileges=true
+
+[Install]
+WantedBy=multi-user.target
+```
+
+**Enable & Start:**
+
+```bash
+# Keluar dari user ocr dulu
+exit
+
+sudo systemctl daemon-reload
+sudo systemctl enable --now ocr-sprint-api ocr-sprint-worker
+sudo systemctl status ocr-sprint-api ocr-sprint-worker
+```
+
+### 4.9 Update Service (Manual)
+
+```bash
+sudo su - ocr
+cd /opt/ocr-sprint-service
+git pull
+source .venv/bin/activate
+pip install -e ".[ocr]"
+alembic upgrade head
+exit
+
+sudo systemctl restart ocr-sprint-api ocr-sprint-worker
+```
+
+---
+
+## 5. Konfigurasi Environment Production
+
+Berikut konfigurasi `.env` yang **wajib diubah** dari default untuk production:
+
+| Variable | Default | Production | Keterangan |
+|----------|---------|------------|------------|
+| `APP_ENV` | `local` | `prod` | Mode environment |
+| `API_KEYS` | *(kosong)* | `key1,key2` | **WAJIB!** Auth disabled jika kosong |
+| `QUEUE_ENABLED` | `false` | `true` | Aktifkan async processing |
+| `DATABASE_URL` | `sqlite:///...` | `postgresql+psycopg://...` | Docker: otomatis di-override |
+| `REDIS_URL` | `redis://localhost:6379/0` | Sesuaikan | Docker: otomatis di-override |
+| `OCR_USE_GPU` | `false` | `true` jika ada GPU | Mode GPU butuh NVIDIA driver |
+| `TABLES_ENABLED` | `true` | `true` | Ekstraksi tabel personel |
+
+**Generate API Key:**
+
+```bash
+openssl rand -hex 32
+```
+
+> [!WARNING]
+> Jangan pernah deploy ke production tanpa mengisi `API_KEYS`. Jika kosong, semua endpoint terbuka tanpa autentikasi.
+
+---
+
+## 6. Reverse Proxy & SSL (Nginx)
+
+### Install
+
+```bash
+sudo apt install -y nginx certbot python3-certbot-nginx
+```
+
+### Konfigurasi — `/etc/nginx/sites-available/ocr-sprint`
+
+```nginx
+upstream ocr_api {
+    server 127.0.0.1:8000;
+    keepalive 32;
+}
+
+server {
+    listen 80;
+    server_name ocr.yourdomain.com;
+
+    client_max_body_size 30M;
+
+    proxy_connect_timeout 300s;
+    proxy_read_timeout 300s;
+
+    location / {
+        proxy_pass http://ocr_api;
+        proxy_http_version 1.1;
+        proxy_set_header Host $host;
+        proxy_set_header X-Real-IP $remote_addr;
+        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
+        proxy_set_header X-Forwarded-Proto $scheme;
+    }
+
+    location /metrics {
+        allow 127.0.0.1;
+        allow 10.0.0.0/8;
+        deny all;
+        proxy_pass http://ocr_api;
+    }
+}
+```
+
+### Enable & SSL
+
+```bash
+sudo ln -s /etc/nginx/sites-available/ocr-sprint /etc/nginx/sites-enabled/
+sudo nginx -t
+sudo systemctl reload nginx
+
+# SSL
+sudo certbot --nginx -d ocr.yourdomain.com
+```
+
+---
+
+## 7. Firewall
+
+```bash
+sudo ufw allow 22/tcp    # SSH — PENTING!
+sudo ufw allow 80/tcp    # HTTP
+sudo ufw allow 443/tcp   # HTTPS
+sudo ufw enable
+sudo ufw status
+```
+
+> [!CAUTION]
+> Pastikan SSH (port 22) di-allow **sebelum** enable firewall, agar tidak terkunci dari server.
+
+---
+
+## 8. Verifikasi Deployment
+
+### Health Check
+
+```bash
+curl http://localhost:8000/api/v1/health
+# {"status":"ok","version":"0.1.0"}
+```
+
+### Test OCR (Sync)
+
+```bash
+curl -X POST "http://localhost:8000/api/v1/documents?sync=true" \
+  -H "X-API-Key: your-api-key" \
+  -F "file=@/path/to/test.pdf" | jq
+```
+
+### Test OCR (Async — Production Flow)
+
+```bash
+# Submit job
+curl -X POST http://localhost:8000/api/v1/documents \
+  -H "X-API-Key: your-api-key" \
+  -F "file=@document.pdf" | jq
+# → {"job_id":"8f2a...","status":"pending",...}
+
+# Poll result
+curl -H "X-API-Key: your-api-key" \
+  http://localhost:8000/api/v1/documents/8f2a... | jq
+# → {"status":"completed","confidence":0.93,"data":{...}}
+```
+
+### Cek Semua Service Berjalan
+
+```bash
+# Docker
+docker compose ps
+
+# Manual
+sudo systemctl status ocr-sprint-api ocr-sprint-worker postgresql redis-server nginx
+```
+
+---
+
+## 9. Monitoring & Maintenance
+
+### Logs
+
+```bash
+# Docker
+docker compose logs -f api worker
+
+# Manual (systemd)
+sudo journalctl -u ocr-sprint-api -f
+sudo journalctl -u ocr-sprint-worker -f
+```
+
+### Prometheus Metrics
+
+```bash
+curl http://localhost:8000/metrics
+```
+
+Metrics penting: `ocr_documents_total`, `ocr_processing_duration_seconds`, `ocr_confidence_score`.
+
+### Backup Database
+
+```bash
+# Docker
+docker compose exec postgres pg_dump -U ocr ocr_sprint > backup_$(date +%Y%m%d).sql
+
+# Manual
+pg_dump -U ocr -h localhost ocr_sprint | gzip > backup_$(date +%Y%m%d).sql.gz
+```
+
+### Automated Backup (Cron)
+
+```bash
+# /opt/ocr-sprint-service/backup.sh
+#!/bin/bash
+BACKUP_DIR="/opt/ocr-sprint-service/backups"
+mkdir -p $BACKUP_DIR
+pg_dump -U ocr -h localhost ocr_sprint | gzip > $BACKUP_DIR/db_$(date +%Y%m%d_%H%M%S).sql.gz
+find $BACKUP_DIR -name "db_*.sql.gz" -mtime +7 -delete
+```
+
+```bash
+chmod +x /opt/ocr-sprint-service/backup.sh
+# Cron: daily at 2 AM
+echo "0 2 * * * /opt/ocr-sprint-service/backup.sh >> /var/log/ocr-backup.log 2>&1" | sudo crontab -u ocr -
+```
+
+---
+
+## 10. Troubleshooting
+
+| Masalah | Diagnosis | Solusi |
+|---------|-----------|--------|
+| Service tidak start | `journalctl -u ocr-sprint-api -n 100` | Cek permissions, `.env`, dan log error |
+| PaddleOCR model gagal download | Timeout di logs | `python -c "from paddleocr import PaddleOCR; PaddleOCR(lang='latin')"` |
+| Worker tidak proses jobs | `redis-cli ping` → bukan PONG | Pastikan Redis running, cek `REDIS_URL` |
+| Database migration error | `alembic current` | `alembic stamp head` lalu `alembic upgrade head` |
+| Port 8000 sudah dipakai | `ss -tlnp | grep 8000` | Kill proses lama atau ganti port di `.env` |
+| Out of memory | OOM killer di logs | Kurangi `--concurrency` di worker, atau tambah RAM |
+
+---
+
+## 11. Security Checklist
+
+- [ ] `API_KEYS` diisi dengan random key (`openssl rand -hex 32`)
+- [ ] Password database diganti dari default
+- [ ] Firewall aktif (hanya port 22, 80, 443 terbuka)
+- [ ] SSL/TLS aktif via Nginx + Let's Encrypt
+- [ ] Endpoint `/metrics` restricted ke internal network
+- [ ] Backup database otomatis via cron
+- [ ] OS security updates enabled (`unattended-upgrades`)
+- [ ] `APP_ENV=prod` (bukan `local`)
+
+---
+
+## Quick Reference — Perintah Sehari-hari
+
+```bash
+# === Docker ===
+docker compose up -d          # Start
+docker compose down            # Stop
+docker compose logs -f api     # Logs
+docker compose build && docker compose up -d  # Update
+
+# === Manual ===
+sudo systemctl restart ocr-sprint-api ocr-sprint-worker  # Restart
+sudo journalctl -u ocr-sprint-api -f                     # Logs
+curl http://localhost:8000/api/v1/health                  # Health check
+```
--- a/docs/DEPLOYMENT-MANUAL.md
+++ b/docs/DEPLOYMENT-MANUAL.md
@@ -0,0 +1,943 @@
+# Deployment Manual OCR Sprint Service (Tanpa Docker)
+
+Panduan lengkap deployment OCR Sprint Service langsung di server tanpa menggunakan Docker.
+
+## Prasyarat Server
+
+### Spesifikasi Minimum
+- **OS**: Ubuntu 20.04+ / Debian 11+ / RHEL 8+
+- **CPU**: 4 cores (8 cores recommended)
+- **RAM**: 8 GB minimum (16 GB recommended)
+- **Storage**: 50 GB free space
+- **User**: Non-root user dengan sudo access
+
+### Port yang Dibutuhkan
+- `8000`: API server (internal, akan di-proxy oleh Nginx)
+- `80/443`: HTTP/HTTPS (Nginx)
+- `5432`: PostgreSQL (localhost only)
+- `6379`: Redis (localhost only)
+
+## Langkah 1: Install System Dependencies
+
+### Ubuntu/Debian
+
+```bash
+# Update system
+sudo apt update && sudo apt upgrade -y
+
+# Install Python 3.11
+sudo apt install -y software-properties-common
+sudo add-apt-repository ppa:deadsnakes/ppa -y
+sudo apt update
+sudo apt install -y python3.11 python3.11-venv python3.11-dev python3-pip
+
+# Install system libraries untuk OpenCV dan PaddleOCR
+sudo apt install -y \
+    libgl1-mesa-glx \
+    libglib2.0-0 \
+    libsm6 \
+    libxext6 \
+    libxrender1 \
+    libgomp1 \
+    libmagic1 \
+    build-essential \
+    git \
+    curl \
+    wget
+
+# Install Redis
+sudo apt install -y redis-server
+sudo systemctl enable redis-server
+sudo systemctl start redis-server
+
+# Install PostgreSQL
+sudo apt install -y postgresql postgresql-contrib
+sudo systemctl enable postgresql
+sudo systemctl start postgresql
+```
+
+### RHEL/CentOS/Rocky Linux
+
+```bash
+# Update system
+sudo dnf update -y
+
+# Install Python 3.11
+sudo dnf install -y python3.11 python3.11-devel python3.11-pip
+
+# Install system libraries
+sudo dnf install -y \
+    mesa-libGL \
+    glib2 \
+    libSM \
+    libXext \
+    libXrender \
+    file-libs \
+    gcc \
+    gcc-c++ \
+    make \
+    git
+
+# Install Redis
+sudo dnf install -y redis
+sudo systemctl enable redis
+sudo systemctl start redis
+
+# Install PostgreSQL
+sudo dnf install -y postgresql-server postgresql-contrib
+sudo postgresql-setup --initdb
+sudo systemctl enable postgresql
+sudo systemctl start postgresql
+```
+
+## Langkah 2: Setup Database PostgreSQL
+
+```bash
+# Masuk sebagai postgres user
+sudo -u postgres psql
+
+# Jalankan SQL commands berikut:
+```
+
+```sql
+-- Create user dan database
+CREATE USER ocr WITH PASSWORD 'ganti-dengan-password-kuat';
+CREATE DATABASE ocr_sprint OWNER ocr;
+
+-- Grant privileges
+GRANT ALL PRIVILEGES ON DATABASE ocr_sprint TO ocr;
+
+-- Connect ke database
+\c ocr_sprint
+
+-- Grant schema privileges (PostgreSQL 15+)
+GRANT ALL ON SCHEMA public TO ocr;
+
+-- Exit
+\q
+```
+
+**Konfigurasi PostgreSQL untuk remote access (opsional):**
+
+```bash
+# Edit postgresql.conf
+sudo nano /etc/postgresql/14/main/postgresql.conf
+
+# Uncomment dan ubah:
+listen_addresses = 'localhost'  # Tetap localhost untuk keamanan
+
+# Edit pg_hba.conf
+sudo nano /etc/postgresql/14/main/pg_hba.conf
+
+# Tambahkan line:
+local   ocr_sprint      ocr                                     scram-sha-256
+
+# Restart PostgreSQL
+sudo systemctl restart postgresql
+```
+
+## Langkah 3: Setup Application User
+
+```bash
+# Create dedicated user untuk aplikasi
+sudo useradd -m -s /bin/bash ocr
+sudo usermod -aG sudo ocr  # Opsional, untuk maintenance
+
+# Create application directory
+sudo mkdir -p /opt/ocr-sprint-service
+sudo chown ocr:ocr /opt/ocr-sprint-service
+
+# Switch ke user ocr
+sudo su - ocr
+```
+
+## Langkah 4: Install Application
+
+```bash
+# Clone repository
+cd /opt
+git clone https://github.com/Adriankf59/ocr-sprint-service.git
+cd ocr-sprint-service
+
+# Create virtual environment
+python3.11 -m venv .venv
+
+# Activate virtual environment
+source .venv/bin/activate
+
+# Upgrade pip
+pip install --upgrade pip setuptools wheel
+
+# Install application dengan OCR dependencies
+pip install -e ".[ocr]"
+
+# Verify installation
+python -c "import paddleocr; print('PaddleOCR installed successfully')"
+```
+
+## Langkah 5: Konfigurasi Application
+
+```bash
+# Copy environment template
+cp .env.example .env
+
+# Edit konfigurasi
+nano .env
+```
+
+**Konfigurasi production (`/opt/ocr-sprint-service/.env`):**
+
+```bash
+# ==== App ====
+APP_ENV=prod
+APP_HOST=0.0.0.0
+APP_PORT=8000
+APP_LOG_LEVEL=INFO
+
+# ==== Storage ====
+STORAGE_LOCAL_DIR=/opt/ocr-sprint-service/storage
+BLOB_STORAGE_DIR=/opt/ocr-sprint-service/storage/blobs
+BLOB_MAX_UPLOAD_MB=25
+
+# ==== OCR ====
+OCR_LANG=latin
+OCR_USE_GPU=false
+OCR_MAX_IMAGE_SIDE=2200
+
+# ==== Preprocessing ====
+PREPROCESS_TARGET_DPI=300
+PREPROCESS_DENOISE=true
+PREPROCESS_DESKEW=true
+PREPROCESS_DETECT_DOCUMENT=true
+PREPROCESS_REMOVE_SHADOW=true
+PREPROCESS_MIN_QUAD_AREA_FRACTION=0.20
+
+# ==== Table Extraction ====
+TABLES_ENABLED=true
+
+# ==== Confidence ====
+CONFIDENCE_AUTO_APPROVE=0.95
+CONFIDENCE_NEEDS_REVIEW=0.85
+
+# ==== LLM (Phase 5, optional) ====
+LLM_ENABLED=false
+
+# ==== Async Pipeline ====
+QUEUE_ENABLED=true
+REDIS_URL=redis://localhost:6379/0
+CELERY_TASK_DEFAULT_QUEUE=ocr_sprint
+
+# ==== Database ====
+DATABASE_URL=postgresql+psycopg://ocr:ganti-dengan-password-kuat@localhost:5432/ocr_sprint
+DATABASE_ECHO=false
+
+# ==== Auth (WAJIB!) ====
+API_KEYS=key1-ganti-dengan-random-string,key2-ganti-dengan-random-string
+API_KEY_HEADER=X-API-Key
+```
+
+**Generate secure API keys:**
+
+```bash
+# Generate 2 API keys
+openssl rand -hex 32
+openssl rand -hex 32
+```
+
+**Create storage directories:**
+
+```bash
+mkdir -p /opt/ocr-sprint-service/storage/blobs
+chmod 755 /opt/ocr-sprint-service/storage
+```
+
+## Langkah 6: Run Database Migrations
+
+```bash
+# Masih sebagai user ocr, dengan venv activated
+cd /opt/ocr-sprint-service
+source .venv/bin/activate
+
+# Run migrations
+alembic upgrade head
+
+# Verify
+alembic current
+```
+
+## Langkah 7: Test Manual Run
+
+```bash
+# Test API server
+uvicorn ocr_sprint.main:app --host 0.0.0.0 --port 8000
+
+# Di terminal lain, test health check
+curl http://localhost:8000/api/v1/health
+
+# Jika berhasil, stop dengan Ctrl+C
+```
+
+## Langkah 8: Setup Systemd Services
+
+### API Service
+
+```bash
+# Exit dari user ocr, kembali ke user dengan sudo
+exit
+
+# Create systemd service file
+sudo nano /etc/systemd/system/ocr-sprint-api.service
+```
+
+**Content `/etc/systemd/system/ocr-sprint-api.service`:**
+
+```ini
+[Unit]
+Description=OCR Sprint API Service
+After=network.target postgresql.service redis.service
+Wants=postgresql.service redis.service
+
+[Service]
+Type=simple
+User=ocr
+Group=ocr
+WorkingDirectory=/opt/ocr-sprint-service
+
+# Environment
+Environment="PATH=/opt/ocr-sprint-service/.venv/bin:/usr/local/bin:/usr/bin:/bin"
+EnvironmentFile=/opt/ocr-sprint-service/.env
+
+# Start command - 4 workers untuk production
+ExecStart=/opt/ocr-sprint-service/.venv/bin/uvicorn \
+    ocr_sprint.main:app \
+    --host 0.0.0.0 \
+    --port 8000 \
+    --workers 4 \
+    --log-level info
+
+# Restart policy
+Restart=always
+RestartSec=10
+StartLimitInterval=0
+
+# Resource limits
+LimitNOFILE=65536
+MemoryLimit=6G
+
+# Security
+NoNewPrivileges=true
+PrivateTmp=true
+
+[Install]
+WantedBy=multi-user.target
+```
+
+### Celery Worker Service
+
+```bash
+sudo nano /etc/systemd/system/ocr-sprint-worker.service
+```
+
+**Content `/etc/systemd/system/ocr-sprint-worker.service`:**
+
+```ini
+[Unit]
+Description=OCR Sprint Celery Worker
+After=network.target postgresql.service redis.service ocr-sprint-api.service
+Wants=postgresql.service redis.service
+
+[Service]
+Type=simple
+User=ocr
+Group=ocr
+WorkingDirectory=/opt/ocr-sprint-service
+
+# Environment
+Environment="PATH=/opt/ocr-sprint-service/.venv/bin:/usr/local/bin:/usr/bin:/bin"
+EnvironmentFile=/opt/ocr-sprint-service/.env
+
+# Start command - concurrency 2 untuk 4 core CPU
+ExecStart=/opt/ocr-sprint-service/.venv/bin/celery \
+    -A ocr_sprint.worker.celery_app \
+    worker \
+    --loglevel=info \
+    --concurrency=2 \
+    --max-tasks-per-child=100
+
+# Restart policy
+Restart=always
+RestartSec=10
+StartLimitInterval=0
+
+# Resource limits
+LimitNOFILE=65536
+MemoryLimit=4G
+
+# Security
+NoNewPrivileges=true
+PrivateTmp=true
+
+[Install]
+WantedBy=multi-user.target
+```
+
+### Enable dan Start Services
+
+```bash
+# Reload systemd
+sudo systemctl daemon-reload
+
+# Enable services (auto-start on boot)
+sudo systemctl enable ocr-sprint-api
+sudo systemctl enable ocr-sprint-worker
+
+# Start services
+sudo systemctl start ocr-sprint-api
+sudo systemctl start ocr-sprint-worker
+
+# Check status
+sudo systemctl status ocr-sprint-api
+sudo systemctl status ocr-sprint-worker
+
+# View logs
+sudo journalctl -u ocr-sprint-api -f
+sudo journalctl -u ocr-sprint-worker -f
+```
+
+## Langkah 9: Setup Nginx Reverse Proxy
+
+### Install Nginx
+
+```bash
+sudo apt install -y nginx certbot python3-certbot-nginx
+```
+
+### Konfigurasi Nginx
+
+```bash
+sudo nano /etc/nginx/sites-available/ocr-sprint
+```
+
+**Content `/etc/nginx/sites-available/ocr-sprint`:**
+
+```nginx
+# Upstream untuk load balancing (jika scale horizontal)
+upstream ocr_api {
+    server 127.0.0.1:8000;
+    keepalive 32;
+}
+
+# Rate limiting
+limit_req_zone $binary_remote_addr zone=api_limit:10m rate=10r/s;
+
+server {
+    listen 80;
+    server_name ocr.yourdomain.com;  # Ganti dengan domain Anda
+
+    # Max upload size (sesuaikan dengan BLOB_MAX_UPLOAD_MB)
+    client_max_body_size 30M;
+    client_body_buffer_size 128k;
+
+    # Timeouts untuk dokumen besar
+    proxy_connect_timeout 300s;
+    proxy_send_timeout 300s;
+    proxy_read_timeout 300s;
+    send_timeout 300s;
+
+    # Logging
+    access_log /var/log/nginx/ocr-sprint-access.log;
+    error_log /var/log/nginx/ocr-sprint-error.log;
+
+    # API endpoints
+    location /api/ {
+        # Rate limiting
+        limit_req zone=api_limit burst=20 nodelay;
+
+        proxy_pass http://ocr_api;
+        proxy_http_version 1.1;
+        
+        # Headers
+        proxy_set_header Host $host;
+        proxy_set_header X-Real-IP $remote_addr;
+        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
+        proxy_set_header X-Forwarded-Proto $scheme;
+        proxy_set_header Connection "";
+        
+        # Disable buffering untuk streaming responses
+        proxy_buffering off;
+    }
+
+    # Health check endpoint (no rate limit)
+    location /api/v1/health {
+        proxy_pass http://ocr_api;
+        proxy_http_version 1.1;
+        proxy_set_header Host $host;
+        access_log off;
+    }
+
+    # Metrics endpoint (restrict access)
+    location /metrics {
+        # Allow only from internal network
+        allow 10.0.0.0/8;
+        allow 172.16.0.0/12;
+        allow 192.168.0.0/16;
+        allow 127.0.0.1;
+        deny all;
+
+        proxy_pass http://ocr_api;
+        proxy_http_version 1.1;
+        proxy_set_header Host $host;
+    }
+
+    # Docs (opsional, bisa di-disable di production)
+    location /docs {
+        proxy_pass http://ocr_api;
+        proxy_http_version 1.1;
+        proxy_set_header Host $host;
+    }
+
+    location /redoc {
+        proxy_pass http://ocr_api;
+        proxy_http_version 1.1;
+        proxy_set_header Host $host;
+    }
+}
+```
+
+### Enable Site
+
+```bash
+# Test konfigurasi
+sudo nginx -t
+
+# Enable site
+sudo ln -s /etc/nginx/sites-available/ocr-sprint /etc/nginx/sites-enabled/
+
+# Remove default site (opsional)
+sudo rm /etc/nginx/sites-enabled/default
+
+# Reload Nginx
+sudo systemctl reload nginx
+```
+
+### Setup SSL dengan Let's Encrypt
+
+```bash
+# Install certbot
+sudo apt install -y certbot python3-certbot-nginx
+
+# Obtain certificate (ganti dengan domain Anda)
+sudo certbot --nginx -d ocr.yourdomain.com
+
+# Test auto-renewal
+sudo certbot renew --dry-run
+```
+
+Certbot akan otomatis mengupdate konfigurasi Nginx untuk HTTPS.
+
+## Langkah 10: Setup Firewall
+
+```bash
+# Install UFW (jika belum ada)
+sudo apt install -y ufw
+
+# Allow SSH (PENTING! Jangan sampai terkunci)
+sudo ufw allow 22/tcp
+
+# Allow HTTP dan HTTPS
+sudo ufw allow 80/tcp
+sudo ufw allow 443/tcp
+
+# Enable firewall
+sudo ufw enable
+
+# Check status
+sudo ufw status
+```
+
+## Langkah 11: Verifikasi Deployment
+
+### Test dari Server
+
+```bash
+# Health check
+curl http://localhost:8000/api/v1/health
+
+# Test dengan API key
+curl -X POST http://localhost:8000/api/v1/documents?sync=true \
+  -H "X-API-Key: your-api-key-here" \
+  -F "file=@/path/to/test.pdf"
+```
+
+### Test dari Client
+
+```bash
+# Health check via domain
+curl https://ocr.yourdomain.com/api/v1/health
+
+# Upload dokumen
+curl -X POST https://ocr.yourdomain.com/api/v1/documents \
+  -H "X-API-Key: your-api-key-here" \
+  -F "file=@document.pdf"
+```
+
+## Monitoring dan Maintenance
+
+### View Logs
+
+```bash
+# API logs
+sudo journalctl -u ocr-sprint-api -f
+
+# Worker logs
+sudo journalctl -u ocr-sprint-worker -f
+
+# Nginx logs
+sudo tail -f /var/log/nginx/ocr-sprint-access.log
+sudo tail -f /var/log/nginx/ocr-sprint-error.log
+
+# PostgreSQL logs
+sudo tail -f /var/log/postgresql/postgresql-14-main.log
+```
+
+### Service Management
+
+```bash
+# Restart services
+sudo systemctl restart ocr-sprint-api
+sudo systemctl restart ocr-sprint-worker
+
+# Stop services
+sudo systemctl stop ocr-sprint-api
+sudo systemctl stop ocr-sprint-worker
+
+# Check status
+sudo systemctl status ocr-sprint-api
+sudo systemctl status ocr-sprint-worker
+```
+
+### Database Backup
+
+```bash
+# Create backup script
+sudo nano /opt/ocr-sprint-service/backup.sh
+```
+
+**Content `backup.sh`:**
+
+```bash
+#!/bin/bash
+BACKUP_DIR="/opt/ocr-sprint-service/backups"
+DATE=$(date +%Y%m%d_%H%M%S)
+
+mkdir -p $BACKUP_DIR
+
+# Backup database
+pg_dump -U ocr -h localhost ocr_sprint | gzip > $BACKUP_DIR/db_$DATE.sql.gz
+
+# Backup blobs (opsional, bisa besar)
+# tar -czf $BACKUP_DIR/blobs_$DATE.tar.gz /opt/ocr-sprint-service/storage/blobs
+
+# Keep only last 7 days
+find $BACKUP_DIR -name "db_*.sql.gz" -mtime +7 -delete
+
+echo "Backup completed: $DATE"
+```
+
+```bash
+# Make executable
+chmod +x /opt/ocr-sprint-service/backup.sh
+
+# Setup cron job (daily at 2 AM)
+sudo crontab -e
+
+# Add line:
+0 2 * * * /opt/ocr-sprint-service/backup.sh >> /var/log/ocr-backup.log 2>&1
+```
+
+### Log Rotation
+
+```bash
+sudo nano /etc/logrotate.d/ocr-sprint
+```
+
+**Content:**
+
+```
+/var/log/nginx/ocr-sprint-*.log {
+    daily
+    rotate 14
+    compress
+    delaycompress
+    notifempty
+    create 0640 www-data adm
+    sharedscripts
+    postrotate
+        [ -f /var/run/nginx.pid ] && kill -USR1 `cat /var/run/nginx.pid`
+    endscript
+}
+```
+
+## Update Application
+
+```bash
+# Switch ke user ocr
+sudo su - ocr
+cd /opt/ocr-sprint-service
+
+# Pull latest code
+git pull
+
+# Activate venv
+source .venv/bin/activate
+
+# Update dependencies
+pip install -e ".[ocr]"
+
+# Run migrations
+alembic upgrade head
+
+# Exit user ocr
+exit
+
+# Restart services
+sudo systemctl restart ocr-sprint-api
+sudo systemctl restart ocr-sprint-worker
+
+# Check logs
+sudo journalctl -u ocr-sprint-api -n 50
+```
+
+## Performance Tuning
+
+### Increase Worker Concurrency
+
+```bash
+# Edit worker service
+sudo nano /etc/systemd/system/ocr-sprint-worker.service
+
+# Ubah --concurrency sesuai CPU cores
+# Untuk 8 cores: --concurrency=4
+# Untuk 16 cores: --concurrency=8
+
+# Reload dan restart
+sudo systemctl daemon-reload
+sudo systemctl restart ocr-sprint-worker
+```
+
+### PostgreSQL Tuning
+
+```bash
+sudo nano /etc/postgresql/14/main/postgresql.conf
+```
+
+**Recommended settings untuk 16GB RAM:**
+
+```
+shared_buffers = 4GB
+effective_cache_size = 12GB
+maintenance_work_mem = 1GB
+checkpoint_completion_target = 0.9
+wal_buffers = 16MB
+default_statistics_target = 100
+random_page_cost = 1.1
+effective_io_concurrency = 200
+work_mem = 10MB
+min_wal_size = 1GB
+max_wal_size = 4GB
+max_worker_processes = 4
+max_parallel_workers_per_gather = 2
+max_parallel_workers = 4
+```
+
+```bash
+sudo systemctl restart postgresql
+```
+
+### Redis Tuning
+
+```bash
+sudo nano /etc/redis/redis.conf
+```
+
+**Recommended settings:**
+
+```
+maxmemory 2gb
+maxmemory-policy allkeys-lru
+save ""  # Disable RDB snapshots untuk performance
+```
+
+```bash
+sudo systemctl restart redis
+```
+
+## Troubleshooting
+
+### Service tidak start
+
+```bash
+# Check logs
+sudo journalctl -u ocr-sprint-api -n 100 --no-pager
+sudo journalctl -u ocr-sprint-worker -n 100 --no-pager
+
+# Check permissions
+ls -la /opt/ocr-sprint-service
+ls -la /opt/ocr-sprint-service/storage
+
+# Test manual run
+sudo su - ocr
+cd /opt/ocr-sprint-service
+source .venv/bin/activate
+uvicorn ocr_sprint.main:app --host 0.0.0.0 --port 8000
+```
+
+### Database connection error
+
+```bash
+# Test connection
+sudo -u ocr psql -h localhost -U ocr -d ocr_sprint
+
+# Check PostgreSQL status
+sudo systemctl status postgresql
+
+# Check pg_hba.conf
+sudo cat /etc/postgresql/14/main/pg_hba.conf | grep ocr
+```
+
+### Redis connection error
+
+```bash
+# Test Redis
+redis-cli ping
+
+# Check Redis status
+sudo systemctl status redis
+
+# Check Redis logs
+sudo journalctl -u redis -n 50
+```
+
+### PaddleOCR model download gagal
+
+```bash
+# Download manual
+sudo su - ocr
+cd /opt/ocr-sprint-service
+source .venv/bin/activate
+
+python << EOF
+from paddleocr import PaddleOCR
+ocr = PaddleOCR(use_angle_cls=True, lang='latin')
+print("Models downloaded successfully")
+EOF
+```
+
+### Out of memory
+
+```bash
+# Check memory usage
+free -h
+htop
+
+# Reduce worker concurrency
+sudo nano /etc/systemd/system/ocr-sprint-worker.service
+# Ubah --concurrency=1
+
+# Add swap (jika perlu)
+sudo fallocate -l 4G /swapfile
+sudo chmod 600 /swapfile
+sudo mkswap /swapfile
+sudo swapon /swapfile
+echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab
+```
+
+## Security Checklist
+
+- [ ] API keys diganti dengan nilai random yang kuat
+- [ ] Database password diganti dari default
+- [ ] Firewall enabled (UFW) - hanya port 22, 80, 443 terbuka
+- [ ] SSL/TLS enabled via Let's Encrypt
+- [ ] `/metrics` endpoint restricted ke internal network
+- [ ] Nginx rate limiting configured
+- [ ] PostgreSQL hanya listen di localhost
+- [ ] Redis hanya listen di localhost
+- [ ] Regular backup configured (cron job)
+- [ ] Log rotation configured
+- [ ] OS security updates enabled (`unattended-upgrades`)
+- [ ] Fail2ban installed untuk SSH protection
+
+## Monitoring dengan Prometheus (Opsional)
+
+### Install Prometheus
+
+```bash
+# Download Prometheus
+cd /tmp
+wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
+tar xvfz prometheus-*.tar.gz
+sudo mv prometheus-2.45.0.linux-amd64 /opt/prometheus
+
+# Create user
+sudo useradd --no-create-home --shell /bin/false prometheus
+
+# Create directories
+sudo mkdir /etc/prometheus /var/lib/prometheus
+sudo chown prometheus:prometheus /var/lib/prometheus
+```
+
+### Configure Prometheus
+
+```bash
+sudo nano /etc/prometheus/prometheus.yml
+```
+
+**Content:**
+
+```yaml
+global:
+  scrape_interval: 15s
+
+scrape_configs:
+  - job_name: 'ocr-sprint'
+    static_configs:
+      - targets: ['localhost:8000']
+    metrics_path: '/metrics'
+```
+
+### Create Systemd Service
+
+```bash
+sudo nano /etc/systemd/system/prometheus.service
+```
+
+**Content:**
+
+```ini
+[Unit]
+Description=Prometheus
+After=network.target
+
+[Service]
+User=prometheus
+Group=prometheus
+Type=simple
+ExecStart=/opt/prometheus/prometheus \
+    --config.file=/etc/prometheus/prometheus.yml \
+    --storage.tsdb.path=/var/lib/prometheus/
+
+[Install]
+WantedBy=multi-user.target
+```
+
+```bash
+sudo systemctl daemon-reload
+sudo systemctl enable prometheus
+sudo systemctl start prometheus
+```
+
+Access Prometheus di `http://localhost:9090`
+
+## Support
+
+Untuk pertanyaan atau issues, hubungi tim development.
--- a/docs/DEPLOYMENT.md
+++ b/docs/DEPLOYMENT.md
@@ -0,0 +1,437 @@
+# Quickstart Deployment OCR Sprint Service
+
+Panduan deployment OCR Sprint Service ke server production untuk pemrosesan dokumen surat sprint Polri.
+
+## Prasyarat Server
+
+### Spesifikasi Minimum
+- **OS**: Linux (Ubuntu 20.04+ / Debian 11+ / RHEL 8+)
+- **CPU**: 4 cores (8 cores recommended untuk throughput tinggi)
+- **RAM**: 8 GB minimum (16 GB recommended)
+- **Storage**: 50 GB free space
+  - ~3 GB untuk model PaddleOCR
+  - ~1.5 GB untuk dependencies Python
+  - Sisanya untuk blob storage dokumen
+- **Network**: Port 8000 terbuka untuk API access
+
+### Software Requirements
+- Docker 24.0+ dan Docker Compose v2
+- Git
+- (Opsional) Nginx/Caddy untuk reverse proxy + SSL
+
+## Deployment dengan Docker Compose (Recommended)
+
+### 1. Clone Repository
+
+```bash
+# Login ke server sebagai user non-root dengan sudo access
+ssh user@your-server.com
+
+# Clone repository
+git clone https://github.com/Adriankf59/ocr-sprint-service.git
+cd ocr-sprint-service
+```
+
+### 2. Konfigurasi Environment
+
+```bash
+# Copy template environment
+cp .env.example .env
+
+# Edit konfigurasi production
+nano .env
+```
+
+**Konfigurasi penting untuk production:**
+
+```bash
+# ==== App ====
+APP_ENV=prod
+APP_LOG_LEVEL=INFO
+
+# ==== Storage ====
+STORAGE_LOCAL_DIR=/app/storage
+BLOB_STORAGE_DIR=/app/storage/blobs
+BLOB_MAX_UPLOAD_MB=25
+
+# ==== OCR ====
+OCR_LANG=latin
+OCR_USE_GPU=false              # set true jika server punya GPU NVIDIA
+OCR_MAX_IMAGE_SIDE=2200
+
+# ==== Preprocessing ====
+PREPROCESS_TARGET_DPI=300
+PREPROCESS_DENOISE=true
+PREPROCESS_DESKEW=true
+PREPROCESS_DETECT_DOCUMENT=true
+PREPROCESS_REMOVE_SHADOW=true
+
+# ==== Table Extraction ====
+TABLES_ENABLED=true
+
+# ==== Async Pipeline ====
+QUEUE_ENABLED=true
+REDIS_URL=redis://redis:6379/0
+CELERY_TASK_DEFAULT_QUEUE=ocr_sprint
+
+# ==== Database ====
+DATABASE_URL=postgresql+psycopg://ocr:ocr@postgres:5432/ocr_sprint
+DATABASE_ECHO=false
+
+# ==== Auth (WAJIB untuk production!) ====
+API_KEYS=your-secret-key-1,your-secret-key-2
+API_KEY_HEADER=X-API-Key
+```
+
+**Generate API keys yang aman:**
+
+```bash
+# Generate random API key
+openssl rand -hex 32
+```
+
+### 3. Build dan Start Services
+
+```bash
+# Build Docker images
+docker compose build
+
+# Start semua services (API, Worker, Redis, Postgres)
+docker compose up -d
+
+# Cek logs untuk memastikan semua berjalan
+docker compose logs -f api worker
+```
+
+**Services yang berjalan:**
+- `api`: FastAPI server di port 8000
+- `worker`: Celery worker untuk async processing
+- `redis`: Message broker untuk job queue
+- `postgres`: Database untuk job state
+
+### 4. Verifikasi Deployment
+
+```bash
+# Health check
+curl http://localhost:8000/api/v1/health
+
+# Expected response:
+# {"status":"ok","version":"0.1.0"}
+
+# Test OCR endpoint (sync mode untuk testing)
+curl -X POST http://localhost:8000/api/v1/documents?sync=true \
+  -H "X-API-Key: your-secret-key-1" \
+  -F "file=@samples/pdf/example.pdf" \
+  | jq
+```
+
+### 5. Setup Reverse Proxy (Nginx)
+
+**Install Nginx:**
+
+```bash
+sudo apt update
+sudo apt install nginx certbot python3-certbot-nginx
+```
+
+**Konfigurasi Nginx (`/etc/nginx/sites-available/ocr-sprint`):**
+
+```nginx
+upstream ocr_api {
+    server localhost:8000;
+}
+
+server {
+    listen 80;
+    server_name ocr.yourdomain.com;
+
+    client_max_body_size 30M;  # Sesuaikan dengan BLOB_MAX_UPLOAD_MB
+
+    location / {
+        proxy_pass http://ocr_api;
+        proxy_set_header Host $host;
+        proxy_set_header X-Real-IP $remote_addr;
+        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
+        proxy_set_header X-Forwarded-Proto $scheme;
+        
+        # Timeout untuk dokumen besar
+        proxy_read_timeout 300s;
+        proxy_connect_timeout 75s;
+    }
+
+    location /metrics {
+        # Restrict metrics endpoint
+        allow 10.0.0.0/8;  # Internal network only
+        deny all;
+        proxy_pass http://ocr_api;
+    }
+}
+```
+
+**Enable site dan setup SSL:**
+
+```bash
+# Enable site
+sudo ln -s /etc/nginx/sites-available/ocr-sprint /etc/nginx/sites-enabled/
+sudo nginx -t
+sudo systemctl reload nginx
+
+# Setup SSL dengan Let's Encrypt
+sudo certbot --nginx -d ocr.yourdomain.com
+```
+
+## Deployment Manual (Tanpa Docker)
+
+### 1. Install System Dependencies
+
+```bash
+# Ubuntu/Debian
+sudo apt update
+sudo apt install -y \
+    python3.11 python3.11-venv python3-pip \
+    libgl1 libglib2.0-0 libsm6 libxext6 libxrender1 \
+    libgomp1 libmagic1 \
+    redis-server postgresql-14
+
+# Start services
+sudo systemctl enable --now redis-server postgresql
+```
+
+### 2. Setup Database
+
+```bash
+# Create database dan user
+sudo -u postgres psql << EOF
+CREATE USER ocr WITH PASSWORD 'your-secure-password';
+CREATE DATABASE ocr_sprint OWNER ocr;
+GRANT ALL PRIVILEGES ON DATABASE ocr_sprint TO ocr;
+EOF
+```
+
+### 3. Install Application
+
+```bash
+# Clone repository
+git clone https://github.com/Adriankf59/ocr-sprint-service.git
+cd ocr-sprint-service
+
+# Create virtual environment
+python3.11 -m venv .venv
+source .venv/bin/activate
+
+# Install dependencies
+pip install --upgrade pip
+pip install -e ".[ocr]"
+
+# Copy dan edit .env
+cp .env.example .env
+nano .env
+```
+
+**Update DATABASE_URL di .env:**
+
+```bash
+DATABASE_URL=postgresql+psycopg://ocr:your-secure-password@localhost:5432/ocr_sprint
+REDIS_URL=redis://localhost:6379/0
+QUEUE_ENABLED=true
+```
+
+### 4. Run Database Migrations
+
+```bash
+alembic upgrade head
+```
+
+### 5. Setup Systemd Services
+
+**API Service (`/etc/systemd/system/ocr-sprint-api.service`):**
+
+```ini
+[Unit]
+Description=OCR Sprint API
+After=network.target postgresql.service redis.service
+
+[Service]
+Type=simple
+User=ocr
+WorkingDirectory=/opt/ocr-sprint-service
+Environment="PATH=/opt/ocr-sprint-service/.venv/bin"
+ExecStart=/opt/ocr-sprint-service/.venv/bin/uvicorn ocr_sprint.main:app --host 0.0.0.0 --port 8000 --workers 4
+Restart=always
+RestartSec=10
+
+[Install]
+WantedBy=multi-user.target
+```
+
+**Worker Service (`/etc/systemd/system/ocr-sprint-worker.service`):**
+
+```ini
+[Unit]
+Description=OCR Sprint Celery Worker
+After=network.target postgresql.service redis.service
+
+[Service]
+Type=simple
+User=ocr
+WorkingDirectory=/opt/ocr-sprint-service
+Environment="PATH=/opt/ocr-sprint-service/.venv/bin"
+ExecStart=/opt/ocr-sprint-service/.venv/bin/celery -A ocr_sprint.worker.celery_app worker -l info --concurrency=2
+Restart=always
+RestartSec=10
+
+[Install]
+WantedBy=multi-user.target
+```
+
+**Enable dan start services:**
+
+```bash
+sudo systemctl daemon-reload
+sudo systemctl enable --now ocr-sprint-api ocr-sprint-worker
+sudo systemctl status ocr-sprint-api ocr-sprint-worker
+```
+
+## Monitoring dan Maintenance
+
+### Monitoring Logs
+
+```bash
+# Docker deployment
+docker compose logs -f api worker
+
+# Manual deployment
+sudo journalctl -u ocr-sprint-api -f
+sudo journalctl -u ocr-sprint-worker -f
+```
+
+### Prometheus Metrics
+
+Metrics tersedia di endpoint `/metrics`:
+
+```bash
+curl http://localhost:8000/metrics
+```
+
+**Key metrics:**
+- `ocr_documents_total`: Total dokumen diproses
+- `ocr_processing_duration_seconds`: Durasi processing
+- `ocr_confidence_score`: Distribusi confidence score
+- `celery_task_*`: Celery worker metrics
+
+### Backup Database
+
+```bash
+# Docker deployment
+docker compose exec postgres pg_dump -U ocr ocr_sprint > backup_$(date +%Y%m%d).sql
+
+# Manual deployment
+pg_dump -U ocr ocr_sprint > backup_$(date +%Y%m%d).sql
+```
+
+### Update Service
+
+```bash
+# Docker deployment
+cd ocr-sprint-service
+git pull
+docker compose build
+docker compose up -d
+
+# Manual deployment
+cd ocr-sprint-service
+git pull
+source .venv/bin/activate
+pip install -e ".[ocr]"
+alembic upgrade head
+sudo systemctl restart ocr-sprint-api ocr-sprint-worker
+```
+
+## Troubleshooting
+
+### Service tidak start
+
+```bash
+# Cek logs
+docker compose logs api worker
+
+# Cek health check
+curl http://localhost:8000/api/v1/health
+```
+
+### PaddleOCR model download gagal
+
+```bash
+# Download manual ke volume
+docker compose exec api python -c "from paddleocr import PaddleOCR; PaddleOCR(use_angle_cls=True, lang='latin')"
+```
+
+### Worker tidak memproses jobs
+
+```bash
+# Cek Redis connection
+docker compose exec worker redis-cli -h redis ping
+
+# Cek Celery worker status
+docker compose exec worker celery -A ocr_sprint.worker.celery_app inspect active
+```
+
+### Database migration error
+
+```bash
+# Cek current revision
+docker compose exec api alembic current
+
+# Force upgrade
+docker compose exec api alembic upgrade head
+```
+
+### Out of memory
+
+```bash
+# Kurangi worker concurrency di docker-compose.yml
+# Ubah: --concurrency=1 (default) atau tambahkan memory limit
+```
+
+## Security Checklist
+
+- [ ] API_KEYS diset dengan nilai random yang kuat
+- [ ] Firewall configured (hanya port 80/443 terbuka)
+- [ ] SSL/TLS enabled via Nginx + Let's Encrypt
+- [ ] Database password diganti dari default
+- [ ] `/metrics` endpoint restricted ke internal network
+- [ ] Regular backup database dan blob storage
+- [ ] Log rotation configured
+- [ ] OS security updates enabled
+
+## Performance Tuning
+
+### Untuk throughput tinggi:
+
+1. **Increase worker concurrency:**
+   ```yaml
+   # docker-compose.yml
+   command: ["celery", "-A", "ocr_sprint.worker.celery_app", "worker", "-l", "info", "--concurrency=4"]
+   ```
+
+2. **Scale workers horizontally:**
+   ```bash
+   docker compose up -d --scale worker=3
+   ```
+
+3. **Enable GPU (jika tersedia):**
+   ```bash
+   # .env
+   OCR_USE_GPU=true
+   ```
+
+4. **Tune Postgres:**
+   ```sql
+   -- Increase connection pool
+   ALTER SYSTEM SET max_connections = 200;
+   ALTER SYSTEM SET shared_buffers = '2GB';
+   ```
+
+## Support
+
+Untuk pertanyaan atau issues, hubungi tim development atau buat issue di repository.
--- a/docs/FRONTEND-INTEGRATION.md
+++ b/docs/FRONTEND-INTEGRATION.md
@@ -0,0 +1,537 @@
+# Frontend Integration Guide
+
+Dokumen ini menjelaskan kontrak API yang perlu dipakai frontend untuk upload dokumen sprint, menampilkan hasil OCR, menjalankan review manual, dan approve hasil final.
+
+## Base URL
+
+Default local API:
+
+```text
+http://localhost:8000/api/v1
+```
+
+Untuk frontend, simpan URL di environment variable:
+
+```env
+VITE_OCR_API_BASE_URL=http://localhost:8000/api/v1
+```
+
+Jika `API_KEYS` di backend diisi, semua endpoint protected membutuhkan header:
+
+```http
+X-API-Key: <api-key>
+```
+
+Catatan: jangan expose API key production di frontend publik. Untuk deployment internal, gunakan reverse proxy atau session backend-for-frontend jika aksesnya tidak sepenuhnya trusted.
+
+## Health Check
+
+```http
+GET /health
+GET /health/ready
+```
+
+Contoh response `/health`:
+
+```json
+{
+  "status": "ok",
+  "version": "0.1.0"
+}
+```
+
+Contoh response `/health/ready`:
+
+```json
+{
+  "status": "ready",
+  "version": "0.1.0",
+  "models": {
+    "paddleocr": "ready",
+    "pp_structure": "disabled"
+  }
+}
+```
+
+Gunakan `/health/ready` untuk disable upload button sampai model OCR siap.
+
+## Upload Dokumen
+
+Endpoint:
+
+```http
+POST /documents
+POST /documents?sync=true
+```
+
+Body harus `multipart/form-data` dengan field `file`.
+
+Backend menerima PDF dan format image umum. Default max upload mengikuti backend config `BLOB_MAX_UPLOAD_MB`, saat ini 25 MB.
+
+### Recommended Flow
+
+Untuk frontend production, gunakan async flow:
+
+1. `POST /documents`
+2. Jika status HTTP `202`, ambil `job_id`
+3. Poll `GET /documents/{job_id}` setiap 1-3 detik
+4. Stop polling saat status `completed`, `needs_review`, atau `failed`
+
+Untuk local dev sederhana, `POST /documents?sync=true` boleh dipakai, tetapi request bisa lama karena OCR berjalan inline.
+
+### Upload Example
+
+```ts
+const API_BASE = import.meta.env.VITE_OCR_API_BASE_URL;
+const API_KEY = import.meta.env.VITE_OCR_API_KEY;
+
+async function uploadDocument(file: File) {
+  const form = new FormData();
+  form.append("file", file);
+
+  const res = await fetch(`${API_BASE}/documents`, {
+    method: "POST",
+    headers: API_KEY ? { "X-API-Key": API_KEY } : undefined,
+    body: form,
+  });
+
+  if (!res.ok) {
+    throw await readApiError(res);
+  }
+
+  return (await res.json()) as DocumentResponse;
+}
+```
+
+## Polling Job
+
+Endpoint:
+
+```http
+GET /documents/{job_id}
+```
+
+```ts
+const TERMINAL_STATUSES = new Set(["completed", "needs_review", "failed"]);
+
+async function getDocument(jobId: string) {
+  const res = await fetch(`${API_BASE}/documents/${jobId}`, {
+    headers: API_KEY ? { "X-API-Key": API_KEY } : undefined,
+  });
+
+  if (!res.ok) {
+    throw await readApiError(res);
+  }
+
+  return (await res.json()) as DocumentResponse;
+}
+
+async function pollDocument(jobId: string, onUpdate: (doc: DocumentResponse) => void) {
+  while (true) {
+    const doc = await getDocument(jobId);
+    onUpdate(doc);
+
+    if (TERMINAL_STATUSES.has(doc.status)) {
+      return doc;
+    }
+
+    await new Promise((resolve) => setTimeout(resolve, 2000));
+  }
+}
+```
+
+## Response Schema
+
+### DocumentResponse
+
+```ts
+type DocumentStatus =
+  | "pending"
+  | "processing"
+  | "completed"
+  | "needs_review"
+  | "failed";
+
+type DocumentResponse = {
+  job_id: string;
+  status: DocumentStatus;
+  confidence: number | null;
+  data: ExtractionResult | null;
+  review_flags: ReviewFlag[];
+  error: string | null;
+  approved: boolean;
+  reviewed_by: string | null;
+  reviewed_at: string | null;
+};
+```
+
+### ExtractionResult
+
+```ts
+type ExtractionResult = {
+  header: HeaderFields;
+  personel: PersonnelEntry[];
+  untuk: string[];
+  ttd: Signatory;
+  raw_text: string;
+  confidence: number;
+  review_flags: ReviewFlag[];
+};
+
+type HeaderFields = {
+  nomor_sprint: string | null;
+  tanggal: string | null; // YYYY-MM-DD
+  satuan_penerbit: string | null;
+  perihal: string | null;
+  dasar: string[];
+};
+
+type PersonnelEntry = {
+  no: number | null;
+  pangkat: string | null;
+  nrp: string | null;
+  nama: string | null;
+  jabatan_dinas: string | null;
+  jabatan_sprint: string | null;
+  keterangan: string | null;
+  confidence: number;
+};
+
+type Signatory = {
+  nama: string | null;
+  pangkat: string | null;
+  nrp: string | null;
+  jabatan: string | null;
+};
+```
+
+### Review Flags
+
+```ts
+type ReviewFlag =
+  | "low_ocr_confidence"
+  | "missing_field"
+  | "invalid_nrp"
+  | "unknown_pangkat"
+  | "personnel_count_mismatch"
+  | "date_parse_failed"
+  | "llm_fallback"
+  | "llm_unavailable"
+  | "personnel_text_fallback"
+  | "personnel_text_fallback_no_nrp"
+  | "incomplete_personnel_row";
+```
+
+Recommended UI labels:
+
+| Flag | Label |
+|---|---|
+| `low_ocr_confidence` | Confidence OCR rendah |
+| `missing_field` | Field wajib belum lengkap |
+| `invalid_nrp` | NRP tidak valid |
+| `unknown_pangkat` | Pangkat tidak dikenali |
+| `personnel_count_mismatch` | Jumlah personel perlu dicek |
+| `date_parse_failed` | Tanggal gagal dibaca |
+| `llm_fallback` | Sebagian field diisi fallback LLM |
+| `llm_unavailable` | LLM tidak tersedia |
+| `personnel_text_fallback` | Personel dibaca dari fallback teks |
+| `personnel_text_fallback_no_nrp` | Personel dibaca tanpa NRP |
+| `incomplete_personnel_row` | Baris personel belum lengkap |
+
+## Example Final Response
+
+```json
+{
+  "job_id": "e21e83ed-a42c-4672-baec-914e5c60cc5a",
+  "status": "needs_review",
+  "confidence": 0.82,
+  "data": {
+    "header": {
+      "nomor_sprint": "Sprin/123/IV/2026",
+      "tanggal": "2026-04-21",
+      "satuan_penerbit": "POLRES BANJAR",
+      "perihal": "Instruktur Ops Pekat I Lodaya 2026",
+      "dasar": []
+    },
+    "personel": [
+      {
+        "no": 1,
+        "pangkat": "IPDA",
+        "nrp": "12345678",
+        "nama": "BUDI SANTOSO",
+        "jabatan_dinas": "KANIT",
+        "jabatan_sprint": "INSTRUKTUR",
+        "keterangan": null,
+        "confidence": 0.91
+      }
+    ],
+    "untuk": ["Melaksanakan kegiatan sesuai surat perintah."],
+    "ttd": {
+      "nama": "AGUS",
+      "pangkat": "AKBP",
+      "nrp": "87654321",
+      "jabatan": "KAPOLRES"
+    },
+    "raw_text": "full OCR text...",
+    "confidence": 0.82,
+    "review_flags": ["low_ocr_confidence"]
+  },
+  "review_flags": ["low_ocr_confidence"],
+  "error": null,
+  "approved": false,
+  "reviewed_by": null,
+  "reviewed_at": null
+}
+```
+
+`raw_text` bisa panjang. Tampilkan di collapsible/debug panel, bukan di layar utama.
+
+## Review dan Koreksi HITL
+
+Frontend review screen sebaiknya mengizinkan editor untuk:
+
+- Header: nomor sprint, tanggal, satuan penerbit, perihal, dasar
+- Personel: pangkat, NRP, nama, jabatan dinas, jabatan sprint, keterangan
+- Untuk: daftar tugas
+- TTD: nama, pangkat, NRP, jabatan
+
+### Patch Corrections
+
+Endpoint:
+
+```http
+PATCH /documents/{job_id}
+```
+
+Body:
+
+```json
+{
+  "corrections": [
+    {
+      "path": "header.perihal",
+      "value": "Pelaksanaan Operasi Pekat I Lodaya 2026",
+      "reason": "OCR membaca perihal tidak lengkap"
+    },
+    {
+      "path": "personel[0].nama",
+      "value": "BUDI SANTOSO",
+      "reason": "Perbaikan nama"
+    }
+  ]
+}
+```
+
+Header opsional untuk audit trail:
+
+```http
+X-User-Id: reviewer-a
+```
+
+Path yang umum dipakai:
+
+```text
+header.nomor_sprint
+header.tanggal
+header.satuan_penerbit
+header.perihal
+header.dasar
+ttd.nama
+ttd.pangkat
+ttd.nrp
+ttd.jabatan
+personel[0].pangkat
+personel[0].nrp
+personel[0].nama
+personel[0].jabatan_dinas
+personel[0].jabatan_sprint
+personel[0].keterangan
+untuk
+```
+
+Semua correction dalam satu request bersifat atomic. Jika satu path invalid, seluruh batch ditolak dan tidak ada perubahan disimpan.
+
+### Patch Example
+
+```ts
+async function patchDocument(jobId: string, corrections: FieldCorrection[], userId?: string) {
+  const headers: Record<string, string> = { "Content-Type": "application/json" };
+  if (API_KEY) headers["X-API-Key"] = API_KEY;
+  if (userId) headers["X-User-Id"] = userId;
+
+  const res = await fetch(`${API_BASE}/documents/${jobId}`, {
+    method: "PATCH",
+    headers,
+    body: JSON.stringify({ corrections }),
+  });
+
+  if (!res.ok) {
+    throw await readApiError(res);
+  }
+
+  return (await res.json()) as DocumentResponse;
+}
+
+type FieldCorrection = {
+  path: string;
+  value: unknown;
+  reason?: string | null;
+};
+```
+
+## Correction History
+
+Endpoint:
+
+```http
+GET /documents/{job_id}/history
+```
+
+Response:
+
+```ts
+type CorrectionEventResponse = {
+  id: number;
+  job_id: string;
+  field_path: string;
+  old_value: unknown | null;
+  new_value: unknown | null;
+  corrected_by: string | null;
+  reason: string | null;
+  corrected_at: string;
+};
+```
+
+Gunakan endpoint ini untuk audit panel di halaman review.
+
+## Approve Final Result
+
+Endpoint:
+
+```http
+POST /documents/{job_id}/approve
+```
+
+Header opsional:
+
+```http
+X-User-Id: reviewer-a
+```
+
+Response:
+
+```json
+{
+  "job_id": "e21e83ed-a42c-4672-baec-914e5c60cc5a",
+  "approved": true,
+  "reviewed_by": "reviewer-a",
+  "reviewed_at": "2026-04-26T16:30:00"
+}
+```
+
+Setelah approved, `PATCH /documents/{job_id}` akan ditolak dengan `409`.
+
+## Error Handling
+
+Application errors:
+
+```json
+{
+  "error": "UnsupportedDocumentError",
+  "message": "Uploaded file is empty."
+}
+```
+
+FastAPI validation errors memakai shape standar:
+
+```json
+{
+  "detail": [
+    {
+      "type": "missing",
+      "loc": ["body", "file"],
+      "msg": "Field required"
+    }
+  ]
+}
+```
+
+Helper error:
+
+```ts
+async function readApiError(res: Response) {
+  let payload: unknown = null;
+  try {
+    payload = await res.json();
+  } catch {
+    payload = await res.text();
+  }
+
+  return {
+    status: res.status,
+    payload,
+  };
+}
+```
+
+Recommended UI handling:
+
+| HTTP Status | UI Handling |
+|---|---|
+| `400` | Tampilkan pesan validasi/upload |
+| `401` | Session/API key tidak valid |
+| `404` | Job tidak ditemukan |
+| `409` | Job belum selesai atau sudah approved |
+| `422` | Form correction tidak valid |
+| `500` | Tampilkan error umum dan minta operator cek log backend |
+
+## Ground Truth Admin
+
+Endpoint ini opsional untuk dashboard admin/training data.
+
+```http
+GET /ground-truth/stats?top_n=10
+GET /ground-truth/export?approved_only=true&has_corrections=true&limit=1000
+```
+
+`/ground-truth/export` mengembalikan `application/x-ndjson`, satu JSON per baris. Frontend biasanya cukup menyediakan tombol download, bukan parse seluruh stream di browser.
+
+## Recommended Screens
+
+1. Upload screen
+   - Dropzone file PDF/image
+   - Health readiness badge
+   - Upload progress
+   - Processing state setelah `job_id` diterima
+
+2. Result screen
+   - Status badge
+   - Confidence score
+   - Review flags
+   - Header summary
+   - Personnel table
+   - Untuk list
+   - TTD section
+   - Raw OCR collapsible
+
+3. Review screen
+   - Editable fields
+   - Dirty-state tracking
+   - Correction reason input
+   - Save corrections via `PATCH`
+   - History panel
+   - Approve button
+
+4. Admin screen
+   - Health/ready status
+   - Ground-truth stats
+   - Export approved samples
+
+## UX Rules
+
+- Jangan tunggu `POST /documents?sync=true` untuk production UI; gunakan async + polling.
+- Disable approve kalau status masih `pending` atau `processing`.
+- Tampilkan `needs_review` sebagai hasil yang berhasil diproses tetapi perlu validasi manusia.
+- Jangan render `raw_text` sebagai konten utama.
+- Pada `failed`, tampilkan `error` dari response jika ada.
+- Pada confidence rendah, arahkan user ke review fields yang punya flag terkait.
--- a/docs/OCR-RUNTIME-MODES.md
+++ b/docs/OCR-RUNTIME-MODES.md
@@ -0,0 +1,49 @@
+# OCR Runtime Modes
+
+Backend OCR bisa dijalankan dalam mode CPU atau GPU lewat konfigurasi `OCR_USE_GPU`.
+
+## Cara Pakai
+
+Mode CPU:
+
+```powershell
+.\update.ps1 -OcrMode cpu
+```
+
+Mode GPU:
+
+```powershell
+.\update.ps1 -OcrMode gpu
+```
+
+Jika parameter tidak diberikan, `update.ps1` memakai nilai yang sudah ada di `.env`.
+
+```env
+OCR_USE_GPU=false
+```
+
+atau:
+
+```env
+OCR_USE_GPU=true
+```
+
+## Perilaku Script
+
+- `-OcrMode cpu` menyimpan `OCR_USE_GPU=false` ke `.env`.
+- `-OcrMode gpu` menyimpan `OCR_USE_GPU=true` ke `.env`.
+- Script tidak menghapus package Paddle/CUDA yang sudah terpasang.
+- Dalam mode GPU, script akan memasang `paddlepaddle-gpu` dan runtime cuDNN/cuBLAS jika belum ada.
+- Dalam mode CPU, script hanya memasang `paddlepaddle` CPU jika belum ada runtime Paddle sama sekali.
+
+## Catatan
+
+Mode CPU tidak membutuhkan CUDA, cuDNN, atau driver NVIDIA.
+
+Mode GPU membutuhkan NVIDIA driver dan runtime CUDA/cuDNN yang cocok. Pada Windows, backend juga menambahkan folder DLL NVIDIA dari `.venv` secara otomatis sebelum PaddleOCR diinisialisasi.
+
+`TABLES_ENABLED` adalah konfigurasi terpisah dari mode CPU/GPU. Jika PP-Structure belum stabil di environment lokal, biarkan:
+
+```env
+TABLES_ENABLED=false
+```
--- a/src/ocr_sprint/api/routes/documents.py
+++ b/src/ocr_sprint/api/routes/documents.py
@@ -10,7 +10,10 @@ flow on top:
 * `POST /documents?sync=true` — runs the pipeline inline (the original
                                 Phase 1 behaviour). Useful for tests and
                                 small-volume single-tenant deploys without
-                                 a Celery worker.
+                                 a Celery worker. The heavy OCR work is
+                                 offloaded to a thread-pool executor so the
+                                 uvicorn event loop stays responsive during
+                                 processing (~30-120s on CPU).
 * `GET  /documents/{job_id}`  — returns the current job state. Async
                                 clients poll this until `status` is in a
                                 terminal state (completed / needs_review /
@@ -19,6 +22,9 @@ flow on top:

 from __future__ import annotations

+import asyncio
+from concurrent.futures import ThreadPoolExecutor
+from functools import partial
 from typing import Annotated
 from uuid import UUID, uuid4

@@ -60,6 +66,13 @@ from ocr_sprint.schemas.review import (
 from ocr_sprint.storage.blob import get_blob_storage
 from ocr_sprint.utils.logging import get_logger

+# Thread pool dedicated to blocking OCR work. Using a *separate* pool
+# (rather than the default loop executor) lets us cap the number of
+# concurrent heavy OCR jobs independently of other thread-pool users.
+# With 1 Celery worker + 1 sync slot we never exceed 2 parallel OCR
+# runs; keep the pool at 1 so RAM stays bounded on the 7.4 GB server.
+_OCR_EXECUTOR = ThreadPoolExecutor(max_workers=1, thread_name_prefix="ocr-inline")
+
 router = APIRouter(
    prefix="/documents",
    tags=["documents"],
@@ -86,9 +99,12 @@ def _row_to_response(row: object) -> DocumentResponse:

    assert isinstance(row, JobRow)
    status_enum = DocumentStatus(row.status)
-    result_obj: ExtractionResult | None = None
+    result_obj = None
    if row.result is not None:
        result_obj = ExtractionResult.model_validate(row.result)
+        # Auto-number personnel entries sequentially (1, 2, 3, ...)
+        for idx, entry in enumerate(result_obj.personel, start=1):
+            entry.no = idx
    return DocumentResponse(
        job_id=row.job_id,
        status=status_enum,
@@ -161,11 +177,13 @@ async def create_document(


 async def _run_inline(job_id: UUID, content: bytes) -> DocumentResponse:
-    """Synchronous pipeline execution.
+    """Run the OCR pipeline without blocking the uvicorn event loop.

-    Each state transition opens its own short session so the request-scoped
-    session's rollback-on-exception behaviour cannot wipe out the
-    ``mark_failed`` write or strand the blob on disk.
+    ``run_pipeline`` is CPU-bound and can take 30-120 s on a 2 vCPU server.
+    Awaiting it directly on the async handler would freeze the entire event
+    loop (and therefore block health-checks, metrics, and every other request)
+    for the full duration. We push the work onto a dedicated single-thread
+    executor so the loop stays free while the OCR runs in the background.
    """
    import time

@@ -173,8 +191,13 @@ async def _run_inline(job_id: UUID, content: bytes) -> DocumentResponse:
        JobRepository(s).mark_processing(job_id)

    started = time.perf_counter()
+    loop = asyncio.get_event_loop()
    try:
-        output = run_pipeline(content)
+        # run_pipeline is synchronous; wrap it so asyncio can await it.
+        output = await loop.run_in_executor(
+            _OCR_EXECUTOR,
+            partial(run_pipeline, content),
+        )
    except ValueError as exc:
        with session_scope() as s:
            JobRepository(s).mark_failed(job_id, error=str(exc))
--- a/src/ocr_sprint/api/routes/health.py
+++ b/src/ocr_sprint/api/routes/health.py
@@ -3,8 +3,12 @@
 from __future__ import annotations

 from fastapi import APIRouter
+from fastapi.responses import JSONResponse

 from ocr_sprint import __version__
+from ocr_sprint.config import get_settings
+from ocr_sprint.pipeline import ocr as _ocr
+from ocr_sprint.pipeline import table as _table

 router = APIRouter(tags=["health"])

@@ -13,3 +17,23 @@ router = APIRouter(tags=["health"])
 async def health() -> dict[str, str]:
    """Lightweight liveness check — does NOT touch the OCR engine."""
    return {"status": "ok", "version": __version__}
+
+
+@router.get("/health/ready")
+async def readiness() -> JSONResponse:
+    """Readiness check — returns 200 when OCR models are loaded, 503 if still warming up."""
+    settings = get_settings()
+    ocr_ready = _ocr._instance is not None
+    table_ready = (not settings.tables_enabled) or _table._instance is not None
+    ready = ocr_ready and table_ready
+    payload = {
+        "status": "ready" if ready else "warming_up",
+        "version": __version__,
+        "models": {
+            "paddleocr": "ready" if ocr_ready else "loading",
+            "pp_structure": (
+                "disabled" if not settings.tables_enabled else "ready" if table_ready else "loading"
+            ),
+        },
+    }
+    return JSONResponse(content=payload, status_code=200 if ready else 503)
--- a/src/ocr_sprint/config.py
+++ b/src/ocr_sprint/config.py
@@ -24,6 +24,7 @@ class Settings(BaseSettings):
    app_host: str = "0.0.0.0"
    app_port: int = 8000
    app_log_level: str = "INFO"
+    root_path: str = ""  # For reverse proxy with path prefix (e.g., "/ocr")

    # Storage (Phase 1: local fs)
    storage_local_dir: Path = Path("./storage")
--- a/src/ocr_sprint/data/master_pangkat.py
+++ b/src/ocr_sprint/data/master_pangkat.py
@@ -22,7 +22,7 @@ PANGKAT_VARIANTS: dict[str, tuple[str, ...]] = {
    # Bintara
    "BRIPDA": ("BRIPDA",),
    "BRIPTU": ("BRIPTU",),
-    "BRIGADIR": ("BRIGADIR", "BRIG", "BRIG POL"),
+    "BRIGADIR": ("BRIGADIR", "BRIG", "BRIG POL", "BRIGPOL"),
    "BRIPKA": ("BRIPKA",),
    "AIPDA": ("AIPDA",),
    "AIPTU": ("AIPTU",),
@@ -33,12 +33,45 @@ PANGKAT_VARIANTS: dict[str, tuple[str, ...]] = {
    # Perwira Menengah
    "KOMPOL": ("KOMPOL",),
    "AKBP": ("AKBP",),
-    "KOMBES POL": ("KOMBES POL", "KOMBESPOL", "KBP"),
+    "KOMBES POL": ("KOMBES POL", "KOMBESPOL", "KBP", "KOMBES"),
    # Perwira Tinggi
    "BRIGJEN POL": ("BRIGJEN POL", "BRIGJENPOL", "BRIGJEN"),
    "IRJEN POL": ("IRJEN POL", "IRJENPOL", "IRJEN"),
    "KOMJEN POL": ("KOMJEN POL", "KOMJENPOL", "KOMJEN"),
    "JENDERAL POL": ("JENDERAL POL", "JENDERALPOL", "JENDERAL"),
+    # PNS Polri (Pegawai Negeri Sipil di lingkungan Polri). PNS appear
+    # routinely on sprint panitia / undangan templates alongside Polri
+    # personnel, so we treat them as valid ranks for extraction.
+    # Sources: PP 11/2017 jo PP 17/2020 (Manajemen PNS); golongan I-IV.
+    # Golongan I (Juru)
+    "JURU MUDA": ("JURU MUDA",),
+    "JURU MUDA TK I": ("JURU MUDA TK I", "JURU MUDA TK.I", "JURU MUDA TINGKAT I"),
+    "JURU": ("JURU",),
+    "JURU TK I": ("JURU TK I", "JURU TK.I", "JURU TINGKAT I"),
+    # Golongan II (Pengatur)
+    "PENGATUR MUDA": ("PENGATUR MUDA",),
+    "PENGATUR MUDA TK I": (
+        "PENGATUR MUDA TK I",
+        "PENGATUR MUDA TK.I",
+        "PENGATUR MUDA TINGKAT I",
+    ),
+    "PENGATUR": ("PENGATUR",),
+    "PENGATUR TK I": ("PENGATUR TK I", "PENGATUR TK.I", "PENGATUR TINGKAT I"),
+    # Golongan III (Penata)
+    "PENATA MUDA": ("PENATA MUDA",),
+    "PENATA MUDA TK I": (
+        "PENATA MUDA TK I",
+        "PENATA MUDA TK.I",
+        "PENATA MUDA TINGKAT I",
+    ),
+    "PENATA": ("PENATA",),
+    "PENATA TK I": ("PENATA TK I", "PENATA TK.I", "PENATA TINGKAT I"),
+    # Golongan IV (Pembina)
+    "PEMBINA": ("PEMBINA",),
+    "PEMBINA TK I": ("PEMBINA TK I", "PEMBINA TK.I", "PEMBINA TINGKAT I"),
+    "PEMBINA UTAMA MUDA": ("PEMBINA UTAMA MUDA",),
+    "PEMBINA UTAMA MADYA": ("PEMBINA UTAMA MADYA",),
+    "PEMBINA UTAMA": ("PEMBINA UTAMA",),
 }

 # Reverse lookup: any variant (uppercased) → canonical form.
--- a/src/ocr_sprint/ground_truth/service.py
+++ b/src/ocr_sprint/ground_truth/service.py
@@ -171,7 +171,10 @@ def iter_ground_truth_samples(
            reviewed_at=job_row.reviewed_at,
            created_at=job_row.created_at,
            initial_result=initial,
-            final_result=copy.deepcopy(job_row.result) if job_row.result else None,
+            # Use an ``is None`` check to stay consistent with
+            # ``build_initial_result``; otherwise an empty-dict result
+            # would produce ``initial_result={}`` but ``final_result=None``.
+            final_result=(copy.deepcopy(job_row.result) if job_row.result is not None else None),
            corrections=[
                GroundTruthCorrection(
                    field_path=c.field_path,
--- a/src/ocr_sprint/main.py
+++ b/src/ocr_sprint/main.py
@@ -2,6 +2,10 @@

 from __future__ import annotations

+import threading
+from contextlib import asynccontextmanager
+from typing import AsyncIterator
+
 from fastapi import FastAPI

 from ocr_sprint import __version__
@@ -11,7 +15,10 @@ from ocr_sprint.api.routes import documents, ground_truth, health
 from ocr_sprint.config import get_settings
 from ocr_sprint.db import models as _models  # noqa: F401  (register ORM tables)
 from ocr_sprint.db.base import Base, get_engine
-from ocr_sprint.utils.logging import configure_logging
+from ocr_sprint.utils.logging import configure_logging, get_logger
+
+
+_startup_logger = get_logger(__name__)


 def _ensure_schema() -> None:
@@ -24,22 +31,74 @@ def _ensure_schema() -> None:
    Base.metadata.create_all(bind=get_engine())


+def _warmup_models_background() -> None:
+    """Load PaddleOCR and PP-Structure models in a background thread.
+
+    Running in a thread keeps the lifespan non-blocking so uvicorn can
+    start accepting health-check requests immediately while the heavy models
+    load (~5-15s on CPU). Requests that arrive before warmup completes will
+    wait on the existing _lock in each module rather than racing to load.
+    """
+    from ocr_sprint.config import get_settings as _gs
+    from ocr_sprint.pipeline import ocr as _ocr
+    from ocr_sprint.pipeline import table as _table
+
+    s = _gs()
+    try:
+        _ocr.warmup()
+    except Exception as exc:
+        _startup_logger.warning("paddleocr.warmup.failed", error=str(exc))
+
+    if s.tables_enabled:
+        try:
+            _table.warmup()
+        except Exception as exc:
+            _startup_logger.warning("pp_structure.warmup.failed", error=str(exc))
+
+
+@asynccontextmanager
+async def lifespan(app: FastAPI) -> AsyncIterator[None]:
+    """FastAPI lifespan: warm OCR models on startup in a background thread."""
+    _startup_logger.info("startup.warmup.begin")
+    t = threading.Thread(target=_warmup_models_background, name="ocr-warmup", daemon=True)
+    t.start()
+    yield
+    # Shutdown: nothing to clean up (models are process-global singletons).
+    _startup_logger.info("shutdown.complete")
+
+
 def create_app() -> FastAPI:
    """Application factory — keeps top-level state easy to test."""
    settings = get_settings()
    configure_logging(settings.app_log_level)
    _ensure_schema()

+    # Support for reverse proxy with path prefix (e.g., /ocr)
+    root_path = getattr(settings, "root_path", "")
+
    app = FastAPI(
+        lifespan=lifespan,
        title="OCR Sprint Service",
        version=__version__,
        description="OCR + structured extraction for Indonesian police 'surat sprint' documents.",
        docs_url="/docs",
        redoc_url="/redoc",
        openapi_url="/openapi.json",
+        root_path=root_path,
    )

    register_error_handlers(app)
+
+    # CORS — allow frontend dev servers and production origins
+    from fastapi.middleware.cors import CORSMiddleware
+    app.add_middleware(
+        CORSMiddleware,
+        allow_origins=["*"],
+        allow_credentials=True,
+        allow_methods=["*"],
+        allow_headers=["*"],
+    )
+
    app.add_middleware(MetricsMiddleware)
    app.include_router(health.router, prefix="/api/v1")
    app.include_router(documents.router, prefix="/api/v1")
--- a/src/ocr_sprint/pipeline/confidence.py
+++ b/src/ocr_sprint/pipeline/confidence.py
@@ -22,6 +22,14 @@ _FLAG_PENALTY: dict[ReviewFlag, float] = {
    ReviewFlag.UNKNOWN_PANGKAT: 0.05,
    ReviewFlag.PERSONNEL_COUNT_MISMATCH: 0.15,
    ReviewFlag.DATE_PARSE_FAILED: 0.10,
+    # Text-based personnel fallback is a recoverable degradation: rank/NRP
+    # were extracted via regex from raw OCR rather than from a parsed table
+    # grid. Worth flagging for review but not catastrophic.
+    ReviewFlag.PERSONNEL_TEXT_FALLBACK: 0.05,
+    # An incomplete personnel row (no pangkat AND no nrp) is a strong
+    # signal something went wrong. Penalise heavily so the document
+    # routes to needs_review even if the rest of the extraction is fine.
+    ReviewFlag.INCOMPLETE_PERSONNEL_ROW: 0.15,
 }

 OCR_WEIGHT = 0.6
--- a/src/ocr_sprint/pipeline/extract/personnel.py
+++ b/src/ocr_sprint/pipeline/extract/personnel.py
@@ -64,6 +64,8 @@ _HEADER_SYNONYMS: dict[str, str] = {
    "jabatan dinas": "jabatan_dinas",
    "jabatan dalam dinas": "jabatan_dinas",
    "jbt dinas": "jabatan_dinas",
+    "struktural": "jabatan_dinas",
+    "jabatan struktural": "jabatan_dinas",
    # jabatan dalam sprint (role for this dispatch)
    "jabatan dalam sprint": "jabatan_sprint",
    "jabatan dalam sprin": "jabatan_sprint",
@@ -72,6 +74,8 @@ _HEADER_SYNONYMS: dict[str, str] = {
    "jabatan sprin": "jabatan_sprint",
    "tugas": "jabatan_sprint",
    "penugasan": "jabatan_sprint",
+    "dalam penugasan": "jabatan_sprint",
+    "jabatan dalam penugasan": "jabatan_sprint",
    # remarks
    "keterangan": "keterangan",
    "ket": "keterangan",
--- a/src/ocr_sprint/pipeline/extract/personnel_text.py
+++ b/src/ocr_sprint/pipeline/extract/personnel_text.py
@@ -0,0 +1,797 @@
+"""Text-based fallback personnel extractor.
+
+PP-Structure (Phase 3) is the primary path for personnel rows because it
+preserves the table grid. But PP-Structure can fail in two ways on real
+sprint scans:
+
+1. The table is not detected at all (low-quality scan, watermark, atypical
+   layout) — `extract_personnel` returns an empty list.
+2. The table IS detected but the column mapping is too sparse, so each row
+   collapses to a single ``nama`` cell with all other fields ``None``. This
+   is what was observed on a real Polres Cimahi sprint where the OCR
+   produced 24 rows with only ``nama`` populated.
+
+This module provides a regex/heuristic fallback that operates directly on
+the flat OCR text. It is deliberately conservative: a row must have BOTH a
+recognizable Polri rank AND an 8-digit NRP to be emitted, so we never
+generate the kind of "name-only" rows that motivated the fallback in the
+first place.
+"""
+
+from __future__ import annotations
+
+import re
+
+from ocr_sprint.data.master_pangkat import (
+    PANGKAT_VARIANTS,
+    is_valid_pangkat,
+    normalize_pangkat,
+)
+from ocr_sprint.schemas.personnel import PersonnelEntry
+
+# Build a single alternation of all known rank tokens (longest first so multi-
+# word ranks like "KOMBES POL" win over the single-word "KOMBES").
+_RANK_TOKENS: tuple[str, ...] = tuple(
+    sorted(
+        {variant for variants in PANGKAT_VARIANTS.values() for variant in variants},
+        key=lambda v: -len(v),
+    )
+)
+_RANK_ALT = "|".join(re.escape(tok) for tok in _RANK_TOKENS)
+# A rank token followed (within a few characters) by an 8-digit NRP.
+# We allow common separators: '/', '-', '.', ',', ':' or whitespace.
+# The trailing ``\b`` plus proximity to the 8-digit NRP is the
+# specificity signal — we deliberately do *not* require a leading
+# ``\b`` because real Polri sprint OCR routinely mashes the rank into
+# the trailing characters of the previous cell (observed on Polres
+# Banjar: "...CPHR., CBA, CI" runs straight into "AKP" giving
+# "CIAKP 84011113"). Requiring a leading boundary loses that row
+# entirely. The longest-first alternation order ensures multi-token
+# ranks ("KOMBES POL") still win over short overlaps ("KBP").
+_RE_RANK_NRP_LINE = re.compile(
+    rf"(?P<rank>{_RANK_ALT})\b[\s/.\-,:]*?(?P<nrp>\d{{8}})\b",
+    re.IGNORECASE,
+)
+# A bare row number marker like "1." or "12)". OCR often puts it on its own
+# line in tabular layouts.
+_RE_ROW_NUMBER = re.compile(r"^\s*(\d{1,3})\s*[.)]\s*$")
+# Lines that should never be interpreted as a personnel name. These are
+# section headers, OCR garbage anchors, and column header tokens. We match
+# them with a *word-boundary* regex (built from this list) rather than a
+# bare ``startswith`` check, because short tokens like ``"NO"`` and
+# ``"KET"`` would otherwise reject perfectly valid Indonesian names
+# (e.g. ``"NOVA SARI"``, ``"NOOR HIDAYAT"``, ``"KETUT WARDANA"`` — the
+# latter being an extremely common Balinese birth-order name).
+_NAME_BLOCKLIST_TOKENS: tuple[str, ...] = (
+    "PADA TANGGAL",  # multi-word entries first so they win the alternation
+    "SURAT PERINTAH",
+    "DASAR",
+    "PERIHAL",
+    "PERTIMBANGAN",
+    "DIPERINTAHKAN",
+    "KEPADA",
+    "UNTUK",
+    "TEMBUSAN",
+    "DIKELUARKAN",
+    "SELESAI",
+    "DAFTAR",
+    "LAMPIRAN",
+    "NOMOR",
+    "TANGGAL",
+    "KEPOLISIAN",
+    "DAERAH",
+    "RESOR",
+    "SEKTOR",
+    "MABES",
+    "NRP",
+    "NIP",
+    "PANGKAT",
+    "JABATAN",
+    "NAMA",
+    "KETERANGAN",
+    "KET",
+    "NO",
+)
+_RE_NAME_BLOCKLIST = re.compile(
+    r"^(?:" + "|".join(re.escape(tok) for tok in _NAME_BLOCKLIST_TOKENS) + r")\b",
+    re.IGNORECASE,
+)
+# A name should look like a name: mostly letters, common punctuation, and
+# at least one alphabetic character. Pure-numeric or pure-symbol lines are
+# rejected.
+_RE_NAME_OK = re.compile(r"[A-Za-z]")
+
+
+def _is_plausible_name(line: str) -> bool:
+    """Return True iff ``line`` could plausibly be a personnel name."""
+    stripped = line.strip()
+    if not stripped or not _RE_NAME_OK.search(stripped):
+        return False
+    if _RE_NAME_BLOCKLIST.match(stripped):
+        return False
+    if _RE_ROW_NUMBER.match(stripped):
+        return False
+    if _RE_RANK_NRP_LINE.search(stripped):
+        return False
+    # Reject lines that are nothing but a row number with extra punctuation
+    # ("1 .", "2)") which the bare-number regex above might miss.
+    return not re.fullmatch(r"[\s\d.)(\-]+", stripped)
+
+
+def _following_jabatan(lines: list[str], idx: int) -> str | None:
+    """Collect 1-3 follow-up lines after the rank+NRP line as the jabatan.
+
+    Stops at the next rank+NRP line, the next bare row-number line, or any
+    blocked prefix (section header / column header).
+    """
+    parts: list[str] = []
+    for fwd in range(idx + 1, min(idx + 4, len(lines))):
+        candidate = lines[fwd].strip()
+        if not candidate:
+            if parts:
+                break
+            continue
+        if _RE_RANK_NRP_LINE.search(candidate):
+            break
+        if _RE_ROW_NUMBER.match(candidate):
+            break
+        if _RE_NAME_BLOCKLIST.match(candidate):
+            break
+        parts.append(candidate)
+    if not parts:
+        return None
+    joined = " ".join(parts)
+    return " ".join(joined.split()) or None
+
+
+def extract_personnel_from_text(raw_text: str) -> list[PersonnelEntry]:
+    """Best-effort personnel extraction from a flat OCR text stream.
+
+    Strategy:
+
+    **Pass 1** — same-line rank+NRP (original strategy):
+    1. Iterate every line. Skip lines that don't contain both a known rank
+       and an 8-digit NRP (those are the only signal we trust).
+    2. For each rank+NRP line, look back for the most recent plausible name
+       line, and forward 1-3 lines for jabatan content.
+    3. Emit a ``PersonnelEntry`` only when we have at least pangkat + nrp.
+
+    **Pass 2** — separate-line rank and NRP (for tabular sprint formats):
+    If pass 1 produces no results, scan for lines containing a standalone
+    rank token, then look up to 2 lines forward for a standalone NRP.
+    This handles sprint formats where OCR renders each column on its own
+    line (e.g. Polres Banjar layout).
+
+    **Pass 3** — rank-only (for sprint formats *without* an NRP column):
+    Some sprint templates (panitia, undangan, etc.) list only nama +
+    pangkat + jabatan, no NRP. If pass 1 and pass 2 both yield nothing,
+    fall back to a rank-only scan: every standalone rank line (or
+    two-line rank like "KOMBES" + "POL" produced by narrow-column OCR)
+    becomes a row, with name assembled from preceding lines and jabatan
+    from following lines. ``nrp`` stays ``None``. False-positive risk
+    is higher (stray rank tokens in body text), so this only fires when
+    nothing else matched.
+
+    The fallback is intentionally rate-limited: the first matching rank
+    token on a line wins (no greedy multi-match per line), and a name line
+    can only be consumed once (so a stray ranked text inside a paragraph
+    doesn't turn into multiple bogus entries).
+    """
+    lines = raw_text.splitlines()
+
+    # ── Pass 1: rank+NRP on the same line ────────────────────────────
+    rows = _extract_same_line(lines)
+    if rows:
+        return rows
+
+    # ── Pass 2: rank and NRP on separate lines ───────────────────────
+    rows = _extract_separate_lines(lines)
+    if rows:
+        return rows
+
+    # ── Pass 3: rank-only (no NRP column) ────────────────────────────
+    return _extract_rank_only(lines)
+
+
+# Regex for a line that is *only* a rank token (possibly with punctuation).
+_RE_RANK_ONLY = re.compile(
+    rf"^\s*(?P<rank>{_RANK_ALT})\s*[/.\-,:]*\s*$",
+    re.IGNORECASE,
+)
+# Regex for a line that contains a standalone 8-digit NRP.
+_RE_NRP_ONLY = re.compile(r"(?<!\d)(?P<nrp>\d{8})(?!\d)")
+
+
+# Strip a leading row number marker like "1 ", "1.", "12)" from a name
+# prefix taken from the same OCR line as a rank+NRP match. Unlike
+# _RE_ROW_NUMBER (which matches a *whole* line), this is a prefix strip
+# for embedded same-line cases like "1 CUCU JUHANA, A.K.S. KOMPOL ...".
+_RE_LEADING_ROW_NUMBER = re.compile(r"^\s*\d{1,3}\s*[.):]?\s+")
+
+
+def _extract_same_line(lines: list[str]) -> list[PersonnelEntry]:
+    """Pass 1: rank+NRP pairs found anywhere in the joined text.
+
+    Uses ``finditer`` over the full ``\\n``-joined OCR text rather than
+    ``re.search`` per line so that multiple rank+NRP pairs on the same
+    OCR line still produce separate rows. This is required for sprint
+    scans where Paddle merges several table rows into one OCR line
+    (observed on Polres Banjar where row 2's "...CBA.AKP 77020049 KASAT
+    RESKRIM" was being swallowed into row 1's jabatan because per-line
+    ``search`` only returns the first match).
+
+    For each match we resolve nama from text *before* the match (the
+    same-line prefix takes precedence; otherwise look back through the
+    preceding lines bounded by the previous match) and jabatan from text
+    *after* the match (same-line suffix plus up to ~3 follow-up lines,
+    bounded by the next match).
+    """
+    if not lines:
+        return []
+    full_text = "\n".join(lines)
+
+    line_starts: list[int] = []
+    pos = 0
+    for line in lines:
+        line_starts.append(pos)
+        pos += len(line) + 1  # +1 for the joining "\n"
+
+    def offset_to_line(offset: int) -> int:
+        lo, hi = 0, len(line_starts)
+        while lo < hi:
+            mid = (lo + hi) // 2
+            if line_starts[mid] <= offset:
+                lo = mid + 1
+            else:
+                hi = mid
+        return max(0, lo - 1)
+
+    matches = list(_RE_RANK_NRP_LINE.finditer(full_text))
+    rows: list[PersonnelEntry] = []
+    consumed_lines: set[int] = set()
+
+    for i, m in enumerate(matches):
+        pangkat = normalize_pangkat(m.group("rank"))
+        if not pangkat or not is_valid_pangkat(pangkat):
+            continue
+        nrp = m.group("nrp")
+        ml = offset_to_line(m.start())
+        prev_ml = (
+            offset_to_line(matches[i - 1].start()) if i > 0 else -1
+        )
+        next_ml = (
+            offset_to_line(matches[i + 1].start())
+            if i + 1 < len(matches)
+            else len(lines)
+        )
+
+        line_text = lines[ml]
+        line_off = line_starts[ml]
+
+        # Same-line prefix: text on this line *before* the rank token.
+        # If the previous match was on this same line, only consider the
+        # text after that previous match's NRP (otherwise we'd reuse the
+        # earlier row's tail as this row's name).
+        prefix_start_local = 0
+        if prev_ml == ml and i > 0:
+            prefix_start_local = max(0, matches[i - 1].end() - line_off)
+        prefix = line_text[prefix_start_local : m.start() - line_off]
+
+        # Same-line suffix: text on this line *after* the NRP, capped at
+        # the next match's start if it's on this same line.
+        suffix_end_local = len(line_text)
+        if next_ml == ml and i + 1 < len(matches):
+            suffix_end_local = matches[i + 1].start() - line_off
+        suffix = line_text[m.end() - line_off : suffix_end_local]
+
+        # ── Resolve nama ────────────────────────────────────────────
+        nama: str | None = None
+        prefix_clean = _RE_LEADING_ROW_NUMBER.sub("", prefix).strip()
+        if prefix_clean and _is_plausible_name(prefix_clean):
+            nama = prefix_clean
+        elif prev_ml < ml:
+            for back in range(ml - 1, prev_ml, -1):
+                if back in consumed_lines or back < 0:
+                    continue
+                candidate = lines[back].strip()
+                if _is_plausible_name(candidate):
+                    nama = candidate
+                    consumed_lines.add(back)
+                    break
+
+        # ── Resolve jabatan ─────────────────────────────────────────
+        jabatan_parts: list[str] = []
+        suffix_clean = suffix.strip()
+        if suffix_clean:
+            jabatan_parts.append(suffix_clean)
+        if next_ml > ml:
+            max_fwd = min(ml + 4, next_ml, len(lines))
+            for fwd in range(ml + 1, max_fwd):
+                candidate = lines[fwd].strip()
+                if not candidate:
+                    if jabatan_parts:
+                        break
+                    continue
+                if _RE_NAME_BLOCKLIST.match(candidate):
+                    break
+                if _RE_ROW_NUMBER.match(candidate):
+                    break
+                jabatan_parts.append(candidate)
+        jabatan = (
+            " ".join(" ".join(jabatan_parts).split())
+            if jabatan_parts
+            else None
+        )
+
+        rows.append(
+            PersonnelEntry(
+                no=None,
+                pangkat=pangkat,
+                nrp=nrp,
+                nama=nama,
+                jabatan_dinas=jabatan,
+                jabatan_sprint=None,
+                keterangan=None,
+            )
+        )
+    return rows
+
+
+def _extract_separate_lines(lines: list[str]) -> list[PersonnelEntry]:
+    """Pass 2: rank and NRP on separate nearby lines.
+
+    Handles tabular sprint formats where OCR outputs each column as its
+    own line, e.g.:
+        1
+        CUCU JUHANA, A.K.S.
+        KOMPOL
+        70100418
+        KABAGOPS
+    """
+    consumed_names: set[int] = set()
+    consumed_nrps: set[int] = set()
+    rows: list[PersonnelEntry] = []
+
+    for idx, raw_line in enumerate(lines):
+        line = raw_line.strip()
+        rank_match = _RE_RANK_ONLY.match(line)
+        if not rank_match:
+            # Also try: line starts with a rank token (may have trailing text)
+            for tok in _RANK_TOKENS:
+                if line.upper().startswith(tok) and len(line) - len(tok) < 5:
+                    rank_match = re.match(
+                        rf"^\s*(?P<rank>{re.escape(tok)})\s*[/.\-,:]*",
+                        line,
+                        re.IGNORECASE,
+                    )
+                    if rank_match:
+                        break
+        if not rank_match:
+            continue
+
+        pangkat = normalize_pangkat(rank_match.group("rank"))
+        if not pangkat or not is_valid_pangkat(pangkat):
+            continue
+
+        # Look forward up to 2 lines for NRP
+        nrp: str | None = None
+        nrp_idx: int | None = None
+        for fwd in range(idx + 1, min(idx + 3, len(lines))):
+            if fwd in consumed_nrps:
+                continue
+            nrp_match = _RE_NRP_ONLY.search(lines[fwd].strip())
+            if nrp_match:
+                nrp = nrp_match.group("nrp")
+                nrp_idx = fwd
+                break
+
+        if not nrp:
+            continue
+        assert nrp_idx is not None
+        consumed_nrps.add(nrp_idx)
+
+        # Look back for name
+        nama: str | None = None
+        for back in range(idx - 1, max(idx - 6, -1), -1):
+            if back in consumed_names:
+                continue
+            candidate = lines[back].strip()
+            if _is_plausible_name(candidate):
+                nama = candidate
+                consumed_names.add(back)
+                break
+
+        # Look forward after NRP for jabatan
+        jabatan = _following_jabatan(lines, nrp_idx)
+        rows.append(
+            PersonnelEntry(
+                no=None,
+                pangkat=pangkat,
+                nrp=nrp,
+                nama=nama,
+                jabatan_dinas=jabatan,
+                jabatan_sprint=None,
+                keterangan=None,
+            )
+        )
+    return rows
+
+
+# Bare row-number markers used by sprint formats without NRP (the dot
+# is often missing in narrow-column OCR, e.g. just "1" on its own line).
+_RE_BARE_ROW_NUMBER = re.compile(r"^\s*\d{1,3}\s*[.):]?\s*$")
+
+
+def _try_match_rank_at(lines: list[str], idx: int) -> tuple[str, int] | None:
+    """Try to match a standalone rank starting at ``lines[idx]``.
+
+    Returns ``(rank_text, lines_consumed)`` on success. Handles narrow-
+    column OCR that splits a multi-token rank across two lines (e.g.
+    ``"KOMBES"`` + ``"POL"`` or ``"PENATA"`` + ``"TK I"``).
+
+    The two-line concatenation is tried *first* so that more-specific
+    multi-token ranks ("PENATA TK I") win over their less-specific
+    single-line prefix ("PENATA"). Without this preference, "TK I"
+    would leak into the jabatan column.
+    """
+    if idx >= len(lines):
+        return None
+    line = lines[idx].strip()
+    if idx + 1 < len(lines):
+        combined = (line + " " + lines[idx + 1].strip()).strip()
+        m2 = _RE_RANK_ONLY.match(combined)
+        if m2:
+            return m2.group("rank"), 2
+    m = _RE_RANK_ONLY.match(line)
+    if m:
+        return m.group("rank"), 1
+    return None
+
+
+def _extract_rank_only(lines: list[str]) -> list[PersonnelEntry]:
+    """Pass 3: rank-only fallback for sprint formats without an NRP column.
+
+    Each standalone rank line (single line or two-line concatenation) is
+    treated as the pivot of a personnel row. ``nama`` is assembled from
+    the preceding contiguous plausible-name lines (typical OCR splits a
+    long name across 2-3 short lines because of narrow columns); jabatan
+    is collected from following lines until the next rank or row marker.
+
+    ``nrp`` is always ``None`` for rows produced by this pass.
+    """
+    rows: list[PersonnelEntry] = []
+    consumed_lines: set[int] = set()
+    i = 0
+    while i < len(lines):
+        match = _try_match_rank_at(lines, i)
+        if not match:
+            i += 1
+            continue
+        rank_text, rank_span = match
+        pangkat = normalize_pangkat(rank_text)
+        if not pangkat or not is_valid_pangkat(pangkat):
+            i += 1
+            continue
+
+        # ── Look back for name lines (assemble up to 4 contiguous lines) ──
+        name_lines: list[str] = []
+        for back in range(i - 1, max(i - 6, -1), -1):
+            if back in consumed_lines:
+                break
+            candidate = lines[back].strip()
+            if not candidate:
+                if name_lines:
+                    break
+                continue
+            if _RE_BARE_ROW_NUMBER.match(candidate):
+                break
+            if _RE_NAME_BLOCKLIST.match(candidate):
+                break
+            if _try_match_rank_at(lines, back) is not None:
+                break
+            if not _is_plausible_name(candidate):
+                break
+            name_lines.insert(0, candidate)
+            consumed_lines.add(back)
+        nama = " ".join(" ".join(name_lines).split()) if name_lines else None
+
+        # ── Look forward for jabatan (stop at next rank / row marker) ─────
+        jabatan_parts: list[str] = []
+        fwd = i + rank_span
+        steps = 0
+        while fwd < len(lines) and steps < 8:
+            candidate = lines[fwd].strip()
+            if not candidate:
+                if jabatan_parts:
+                    break
+                fwd += 1
+                steps += 1
+                continue
+            if _RE_BARE_ROW_NUMBER.match(candidate):
+                break
+            if _try_match_rank_at(lines, fwd) is not None:
+                break
+            if _RE_NAME_BLOCKLIST.match(candidate):
+                break
+            jabatan_parts.append(candidate)
+            fwd += 1
+            steps += 1
+        jabatan = " ".join(" ".join(jabatan_parts).split()) if jabatan_parts else None
+
+        rows.append(
+            PersonnelEntry(
+                no=None,
+                pangkat=pangkat,
+                nrp=None,
+                nama=nama,
+                jabatan_dinas=jabatan,
+                jabatan_sprint=None,
+                keterangan=None,
+            )
+        )
+        i += rank_span
+    return rows
+
+
+# ── Column-aware Pass 3 (uses OCR bounding boxes) ───────────────────────
+
+
+def _box_x_left(box: tuple[tuple[float, float], ...]) -> float:
+    return min(p[0] for p in box)
+
+
+def _box_x_right(box: tuple[tuple[float, float], ...]) -> float:
+    return max(p[0] for p in box)
+
+
+def _box_x_center(box: tuple[tuple[float, float], ...]) -> float:
+    return (_box_x_left(box) + _box_x_right(box)) / 2
+
+
+def _box_y_top(box: tuple[tuple[float, float], ...]) -> float:
+    return min(p[1] for p in box)
+
+
+def _box_y_bottom(box: tuple[tuple[float, float], ...]) -> float:
+    return max(p[1] for p in box)
+
+
+def _box_y_center(box: tuple[tuple[float, float], ...]) -> float:
+    return (_box_y_top(box) + _box_y_bottom(box)) / 2
+
+
+def _box_height(box: tuple[tuple[float, float], ...]) -> float:
+    return _box_y_bottom(box) - _box_y_top(box)
+
+
+def extract_personnel_from_ocr_lines(ocr_lines: list) -> list[PersonnelEntry]:
+    """Column-aware Pass 3 for sprint formats without an NRP column.
+
+    Each ``ocr_line`` must expose ``text`` (str) and ``box`` (a tuple of
+    4 ``(x, y)`` corner points). We use the geometry to:
+
+    1. Detect rank lines (single-line or vertically-stacked two-line).
+    2. Estimate the PANGKAT column X-center from those rank lines.
+    3. For each rank, gather **only** lines in the NAMA column (X left
+       of PANGKAT) within the row's Y span as the name fragments, and
+       **only** lines in the JABATAN column (X right of PANGKAT) for
+       jabatan. This prevents column-bleed that flat-text Pass 3
+       suffers from on dense tables.
+
+    Returns ``[]`` if no rank lines are detected (caller can fall back
+    to the text-only Pass 3).
+    """
+    if not ocr_lines:
+        return []
+
+    # Sort by (y_top, x_left) for vertical-stacking rank detection.
+    indexed = sorted(
+        range(len(ocr_lines)),
+        key=lambda i: (_box_y_top(ocr_lines[i].box), _box_x_left(ocr_lines[i].box)),
+    )
+
+    # Pass 1: find rank anchors.
+    # An anchor is one or two stacked OCR lines whose combined text matches
+    # _RE_RANK_ONLY and normalises to a known pangkat. Two-line stacks must
+    # X-overlap so we don't accidentally merge cells from different columns.
+    used: set[int] = set()
+    anchors: list[dict] = []
+    for pos, idx in enumerate(indexed):
+        if idx in used:
+            continue
+        ln = ocr_lines[idx]
+        text = ln.text.strip()
+
+        rank_text: str | None = None
+        member_idxs: list[int] = [idx]
+
+        # Try two-line stack first (so PENATA TK I beats PENATA).
+        for j_pos in range(pos + 1, min(pos + 5, len(indexed))):
+            j_idx = indexed[j_pos]
+            if j_idx in used:
+                continue
+            other = ocr_lines[j_idx]
+            x_overlap = (
+                min(_box_x_right(ln.box), _box_x_right(other.box))
+                - max(_box_x_left(ln.box), _box_x_left(other.box))
+            )
+            if x_overlap <= 0:
+                continue
+            y_gap = _box_y_top(other.box) - _box_y_bottom(ln.box)
+            if y_gap > _box_height(ln.box) * 1.5:
+                break
+            combined = (text + " " + other.text.strip()).strip()
+            m2 = _RE_RANK_ONLY.match(combined)
+            if m2:
+                rank_text = m2.group("rank")
+                member_idxs.append(j_idx)
+                break
+
+        if rank_text is None:
+            m1 = _RE_RANK_ONLY.match(text)
+            if m1:
+                rank_text = m1.group("rank")
+
+        if rank_text is None:
+            continue
+        pangkat = normalize_pangkat(rank_text)
+        if not pangkat or not is_valid_pangkat(pangkat):
+            continue
+
+        anchors.append(
+            {
+                "member_idxs": member_idxs,
+                "pangkat": pangkat,
+                "x_center": _box_x_center(ln.box),
+                "y_top": min(_box_y_top(ocr_lines[m].box) for m in member_idxs),
+                "y_bottom": max(_box_y_bottom(ocr_lines[m].box) for m in member_idxs),
+            }
+        )
+        used.update(member_idxs)
+
+    if not anchors:
+        return []
+
+    # Sort anchors by Y so we can compute row spans.
+    anchors.sort(key=lambda a: a["y_top"])
+
+    # Estimate PANGKAT column X-center as the median of rank anchor X-centers.
+    xs_sorted = sorted(a["x_center"] for a in anchors)
+    pangkat_x = xs_sorted[len(xs_sorted) // 2]
+
+    # X tolerance: half the median rank-line width. Lines with x_center
+    # within ±tolerance of pangkat_x are *in* the PANGKAT column and
+    # excluded from both NAMA and JABATAN buckets.
+    rank_widths = [
+        _box_x_right(ocr_lines[a["member_idxs"][0]].box)
+        - _box_x_left(ocr_lines[a["member_idxs"][0]].box)
+        for a in anchors
+    ]
+    rank_widths.sort()
+    median_rank_width = rank_widths[len(rank_widths) // 2] if rank_widths else 50.0
+    column_margin = max(median_rank_width * 0.5, 5.0)
+
+    # Try to split the JABATAN side into STRUKTURAL (jabatan_dinas) and
+    # DALAM SPRIN (jabatan_sprint) by clustering jabatan-side X-centers.
+    # This is a 2-cluster k-means-style split: collect all X-centers of
+    # lines to the right of PANGKAT, find the largest X-gap among them,
+    # and use that gap as the column boundary. KET is typically the
+    # right-most narrow column we let bleed into jabatan_sprint since
+    # it's commonly empty.
+    jabatan_xs: list[float] = []
+    for ln in ocr_lines:
+        x = _box_x_center(ln.box)
+        if x > pangkat_x + column_margin and ln.text.strip():
+            jabatan_xs.append(x)
+    jabatan_split_x: float | None = None
+    if len(jabatan_xs) >= 4:
+        jabatan_xs.sort()
+        max_gap = 0.0
+        max_gap_x: float | None = None
+        for k in range(1, len(jabatan_xs)):
+            gap = jabatan_xs[k] - jabatan_xs[k - 1]
+            if gap > max_gap:
+                max_gap = gap
+                max_gap_x = (jabatan_xs[k] + jabatan_xs[k - 1]) / 2
+        # Only use the split if the gap is meaningfully larger than a
+        # within-column gap (heuristic: > 1.5× median rank width).
+        if max_gap_x is not None and max_gap > median_rank_width * 1.5:
+            jabatan_split_x = max_gap_x
+
+    # Pre-compute each anchor's y_center for midpoint row dividers.
+    anchor_y_centers = [(a["y_top"] + a["y_bottom"]) / 2 for a in anchors]
+
+    rows: list[PersonnelEntry] = []
+    for i, anchor in enumerate(anchors):
+        # Row Y span: midpoint between this anchor and its neighbours.
+        # Using the midpoint (rather than the previous anchor's
+        # y_bottom) prevents row N's tail content (e.g. last name
+        # fragment "M.H.") from leaking into row N+1's nama bucket
+        # when rank lines don't extend to the full visual row height.
+        y_lo = (
+            (anchor_y_centers[i - 1] + anchor_y_centers[i]) / 2
+            if i > 0
+            else float("-inf")
+        )
+        y_hi = (
+            (anchor_y_centers[i] + anchor_y_centers[i + 1]) / 2
+            if i + 1 < len(anchors)
+            else float("inf")
+        )
+
+        nama_pieces: list[tuple[float, str]] = []
+        struktural_pieces: list[tuple[float, str]] = []
+        sprint_pieces: list[tuple[float, str]] = []
+        for j, ln in enumerate(ocr_lines):
+            if j in anchor["member_idxs"]:
+                continue
+            text = ln.text.strip()
+            if not text:
+                continue
+            x = _box_x_center(ln.box)
+            y = _box_y_center(ln.box)
+            if not (y_lo <= y <= y_hi):
+                continue
+            if x < pangkat_x - column_margin:
+                # NAMA side
+                if _RE_NAME_BLOCKLIST.match(text):
+                    continue
+                if _RE_BARE_ROW_NUMBER.match(text):
+                    continue
+                if not _is_plausible_name(text):
+                    continue
+                nama_pieces.append((y, text))
+            elif x > pangkat_x + column_margin:
+                # JABATAN side — split into STRUKTURAL vs DALAM SPRIN
+                # using the geometric column boundary detected above.
+                if _RE_NAME_BLOCKLIST.match(text):
+                    continue
+                if jabatan_split_x is not None and x > jabatan_split_x:
+                    sprint_pieces.append((y, text))
+                else:
+                    struktural_pieces.append((y, text))
+            # else: in PANGKAT column or column margin — skip
+
+        nama_pieces.sort(key=lambda p: p[0])
+        struktural_pieces.sort(key=lambda p: p[0])
+        sprint_pieces.sort(key=lambda p: p[0])
+
+        # Strip leading row number from the first nama piece (e.g. "1 F. GUNTUR"
+        # collapses to "F. GUNTUR" if the row marker happens to share a box).
+        if nama_pieces:
+            head = _RE_LEADING_ROW_NUMBER.sub("", nama_pieces[0][1]).strip()
+            nama_pieces[0] = (nama_pieces[0][0], head)
+
+        def _join(pieces: list[tuple[float, str]]) -> str | None:
+            text = " ".join(t for _, t in pieces if t).strip()
+            text = " ".join(text.split())
+            return text or None
+
+        rows.append(
+            PersonnelEntry(
+                no=None,
+                pangkat=anchor["pangkat"],
+                nrp=None,
+                nama=_join(nama_pieces),
+                jabatan_dinas=_join(struktural_pieces),
+                jabatan_sprint=_join(sprint_pieces),
+                keterangan=None,
+            )
+        )
+    return rows
+
+
+def is_low_quality(rows: list[PersonnelEntry]) -> bool:
+    """Heuristic: did PP-Structure produce useless rows?
+
+    A row is useful when it has at least pangkat OR nrp. If most rows have
+    only ``nama`` (or worse, nothing) the table extraction failed and the
+    caller should retry with the text-based fallback.
+    """
+    if not rows:
+        return True
+    useful = sum(1 for r in rows if r.pangkat or r.nrp)
+    # Require at least 30% of rows to carry rank/NRP signal. Below that we
+    # assume the column mapper degraded to "everything is nama" and prefer
+    # a fresh attempt.
+    return useful / max(1, len(rows)) < 0.3
--- a/src/ocr_sprint/pipeline/extract/regex_rules.py
+++ b/src/ocr_sprint/pipeline/extract/regex_rules.py
@@ -53,19 +53,52 @@ _RE_TANGGAL_ID = re.compile(
    re.IGNORECASE,
 )

-# Satuan penerbit usually appears in the document letterhead, prefixed by
-# KEPOLISIAN <NEGARA|DAERAH|RESORT|SEKTOR>.
-_RE_SATUAN = re.compile(
-    r"KEPOLISIAN\s+(?:NEGARA\s+REPUBLIK\s+INDONESIA|DAERAH|RESOR(?:T)?|SEKTOR|RESORT)"
-    r"[^\n]{0,80}",
+# Polri letterhead pieces. The full letterhead spans multiple lines that are
+# often broken across separate OCR rows like:
+#
+#     KEPOLISIAN NEGARA REPUBLIK INDONESIA
+#     DAERAH JAWA BARAT
+#     RESOR CIMAHI
+#
+# We capture each individual level so we can reconstruct the most-specific
+# unit (RESOR / SEKTOR > DAERAH > NEGARA) — a downstream consumer cares
+# about *which* unit issued the sprint, not just that some Polri unit did.
+_RE_LEVEL_NEGARA = re.compile(
+    r"KEPOLISIAN\s+NEGARA\s+REPUBLIK\s+INDONESIA",
    re.IGNORECASE,
 )
+_RE_LEVEL_DAERAH = re.compile(
+    r"(?:KEPOLISIAN\s+)?DAERAH\s+([A-Z][A-Z .'/-]{1,60}?)(?:$|\s*\n)",
+    re.IGNORECASE | re.MULTILINE,
+)
+_RE_LEVEL_RESOR = re.compile(
+    r"(?:KEPOLISIAN\s+)?RESORT?\s+([A-Z][A-Z .'/-]{1,60}?)(?:$|\s*\n)",
+    re.IGNORECASE | re.MULTILINE,
+)
+_RE_LEVEL_SEKTOR = re.compile(
+    r"(?:KEPOLISIAN\s+)?SEKTOR\s+([A-Z][A-Z .'/-]{1,60}?)(?:$|\s*\n)",
+    re.IGNORECASE | re.MULTILINE,
+)
+_RE_LEVEL_MABES = re.compile(r"MABES\s+POLRI\b", re.IGNORECASE)

 # "Perihal : ...." up to end of line.
 _RE_PERIHAL = re.compile(r"PERIHAL\s*[:\-]\s*(.+)", re.IGNORECASE)
+# Many sprint docs (especially Polres-level) use 'Pertimbangan' as the
+# single-paragraph rationale block instead of (or alongside) 'Perihal'.
+# When `perihal` is missing we fall back to the first non-empty line under
+# 'Pertimbangan :' so the LLM doesn't have to guess and so a downstream
+# audit trail still has *something* in the perihal slot.
+_RE_PERTIMBANGAN_LABEL = re.compile(r"^\s*PERTIMBANGAN\b", re.IGNORECASE)

 # A dasar entry typically begins with a number and dot, e.g. "1. UU No. 2 Tahun 2002 ..."
 _RE_DASAR_ITEM = re.compile(r"^\s*(\d+)\s*[.)]\s*(.+)$")
+# OCR sometimes splits the number from its content across two lines:
+#     1.
+#      Undang-Undang Nomor 2 Tahun 2002 ...
+# We detect a bare-number line and merge with the next non-empty line.
+_RE_DASAR_BARE_NUMBER = re.compile(r"^\s*(\d+)\s*[.)]\s*$")
+# Generic 'untuk' bullet — same shape as a dasar item.
+_RE_UNTUK_ITEM = re.compile(r"^\s*(\d+)\s*[.)]\s*(.+)$")

 # Signatory NRP — Polri NRPs are 8 digits, civil servant NIPs are 18 digits.
 _RE_NRP = re.compile(r"\b(NRP|NIP)\s*[.:]?\s*(\d{8,20})\b", re.IGNORECASE)
@@ -99,54 +132,159 @@ def find_tanggal(text: str) -> date | None:
        return None


+def _clean_unit_tail(tail: str) -> str:
+    """Strip trailing punctuation/noise from the captured place name."""
+    return " ".join(tail.split()).strip(" .,;:'\"")
+
+
 def find_satuan(text: str) -> str | None:
-    """Return the first letterhead match (issuing unit), normalized."""
-    match = _RE_SATUAN.search(text)
-    if not match:
-        return None
-    return " ".join(match.group(0).split())
+    """Return the issuing unit, preferring the most-specific letterhead level.
+
+    Polri letterheads are hierarchical (Negara > Daerah > Resor/Sektor). The
+    actual *issuing* unit is the deepest level present in the letterhead, not
+    the topmost generic 'KEPOLISIAN NEGARA REPUBLIK INDONESIA' line. We scan
+    for each level independently and pick the most specific one available;
+    if only the generic Negara line is present we return that.
+
+    Examples
+    --------
+    >>> find_satuan("KEPOLISIAN NEGARA REPUBLIK INDONESIA\\n"
+    ...             "DAERAH JAWA BARAT\\nRESOR CIMAHI")
+    'KEPOLISIAN RESOR CIMAHI'
+    >>> find_satuan("KEPOLISIAN NEGARA REPUBLIK INDONESIA")
+    'KEPOLISIAN NEGARA REPUBLIK INDONESIA'
+    """
+    # We only look at the document head — letterheads always sit at the
+    # very top, and constraining the search prevents false positives from
+    # body text like '... Polres Cimahi ...' deep in a paragraph.
+    head = "\n".join(text.splitlines()[:25])
+
+    sektor = _RE_LEVEL_SEKTOR.search(head)
+    if sektor:
+        return f"KEPOLISIAN SEKTOR {_clean_unit_tail(sektor.group(1))}"
+    resor = _RE_LEVEL_RESOR.search(head)
+    if resor:
+        return f"KEPOLISIAN RESOR {_clean_unit_tail(resor.group(1))}"
+    daerah = _RE_LEVEL_DAERAH.search(head)
+    if daerah:
+        return f"KEPOLISIAN DAERAH {_clean_unit_tail(daerah.group(1))}"
+    if _RE_LEVEL_MABES.search(head):
+        return "MABES POLRI"
+    if _RE_LEVEL_NEGARA.search(head):
+        return "KEPOLISIAN NEGARA REPUBLIK INDONESIA"
+    return None


 def find_perihal(text: str) -> str | None:
-    """Return the first 'Perihal: ...' line, trimmed to that line only."""
+    """Return the first 'Perihal: ...' line, trimmed to that line only.
+
+    Falls back to the first non-empty line under a 'Pertimbangan' label
+    (a common variant in Polres-level surat sprint that doesn't have a
+    distinct 'Perihal' field). We deliberately keep this in regex-land
+    rather than deferring to the LLM because the LLM tends to hallucinate
+    perihal content from arbitrary paragraphs.
+    """
    for line in text.splitlines():
        m = _RE_PERIHAL.search(line)
        if m:
            return m.group(1).strip()
+
+    lines = text.splitlines()
+    for idx, line in enumerate(lines):
+        if _RE_PERTIMBANGAN_LABEL.match(line):
+            for follow in lines[idx + 1 : idx + 5]:
+                stripped = follow.strip(" :\t")
+                if stripped and stripped != ":":
+                    return stripped
+            break
    return None


+def _collect_numbered_section(
+    lines: list[str],
+    start_idx: int,
+    terminators: tuple[str, ...],
+) -> list[str]:
+    """Walk forward from ``start_idx`` collecting numbered list items.
+
+    Robust to OCR splitting the number marker onto its own line:
+        '1.'  ->   buffer ``pending_index=1``
+        next non-empty line starts the item body.
+
+    Continuation lines (non-empty, no leading number, after a started item)
+    are appended to the current item. Stops at any line whose uppercase form
+    starts with one of ``terminators``.
+    """
+    items: list[str] = []
+    pending_marker = False
+    blank_run = 0
+    for raw_line in lines[start_idx:]:
+        line = raw_line.strip()
+        upper = line.upper()
+        if any(upper.startswith(term) for term in terminators):
+            break
+        if not line:
+            blank_run += 1
+            # Two consecutive blank lines reliably mark the end of a section.
+            # A single blank line is tolerated because OCR sprinkles them.
+            if blank_run >= 2 and items and not pending_marker:
+                break
+            continue
+        blank_run = 0
+        bare = _RE_DASAR_BARE_NUMBER.match(line)
+        if bare:
+            pending_marker = True
+            continue
+        m = _RE_DASAR_ITEM.match(line)
+        if m:
+            items.append(m.group(2).strip())
+            pending_marker = False
+            continue
+        if pending_marker:
+            items.append(line)
+            pending_marker = False
+            continue
+        if items:
+            items[-1] = (items[-1] + " " + line).strip()
+    return items
+
+
 def find_dasar_list(text: str) -> list[str]:
    """Extract numbered 'Dasar' items from the text.

    Heuristic: locate a line containing 'DASAR' (Indonesian: "DASAR :") and
-    collect subsequent lines that start with a number. Stops at a blank line
-    or a line beginning with another section header keyword.
+    delegate to ``_collect_numbered_section`` which handles three OCR
+    artefacts:
+
+    1. Inline numbered items: ``"1. Undang-Undang ..."``.
+    2. Bare-number lines (the OCR engine puts the number alone on a line):
+       ``"1.\\n Undang-Undang ..."``.
+    3. Continuation lines (a line that is the wrapped tail of the previous
+       item gets appended back onto it).
    """
    lines = text.splitlines()
-    items: list[str] = []
-    in_dasar = False
    section_terminators = ("DIPERINTAHKAN", "UNTUK", "DASAR HUKUM", "PERIHAL")
-    for raw_line in lines:
-        line = raw_line.strip()
-        if not in_dasar:
-            if re.match(r"^\s*DASAR\b", line, re.IGNORECASE):
-                in_dasar = True
-            continue
-        if not line:
-            if items:
-                break
-            continue
-        upper = line.upper()
-        if any(upper.startswith(term) for term in section_terminators):
-            break
-        m = _RE_DASAR_ITEM.match(line)
-        if m:
-            items.append(m.group(2).strip())
-        elif items:
-            # continuation of the previous dasar item
-            items[-1] = (items[-1] + " " + line).strip()
-    return items
+    for idx, raw_line in enumerate(lines):
+        if re.match(r"^\s*DASAR\b", raw_line.strip(), re.IGNORECASE):
+            return _collect_numbered_section(lines, idx + 1, section_terminators)
+    return []
+
+
+def find_untuk_list(text: str) -> list[str]:
+    """Extract numbered 'Untuk' / 'DIPERINTAHKAN' bullets from the text.
+
+    The 'Untuk' section follows 'DIPERINTAHKAN' / 'Kepada' and lists the
+    tasks assigned to the personnel. Same OCR shape as Dasar, so we reuse
+    the collector but with different terminators.
+    """
+    lines = text.splitlines()
+    # Stop conditions: 'Selesai' (boilerplate), 'Dikeluarkan di' (signature
+    # block), 'Tembusan' (carbon-copy section).
+    terminators = ("SELESAI", "DIKELUARKAN", "TEMBUSAN", "PADA TANGGAL")
+    for idx, raw_line in enumerate(lines):
+        if re.match(r"^\s*UNTUK\b", raw_line.strip(), re.IGNORECASE):
+            return _collect_numbered_section(lines, idx + 1, terminators)
+    return []


 def find_signatory(text: str) -> Signatory:
--- a/src/ocr_sprint/pipeline/extract/validators.py
+++ b/src/ocr_sprint/pipeline/extract/validators.py
@@ -30,6 +30,13 @@ def validate_personnel_entry(entry: PersonnelEntry) -> list[ReviewFlag]:
        flags.append(ReviewFlag.INVALID_NRP)
    if entry.pangkat and not is_valid_pangkat(entry.pangkat):
        flags.append(ReviewFlag.UNKNOWN_PANGKAT)
+    # Identification of a personnel row requires at least pangkat OR nrp.
+    # A row carrying only a name is structurally incomplete - likely a
+    # mis-aligned table cell or a leaked tembusan/dasar fragment - and must
+    # be flagged for human review even though pangkat/nrp validation
+    # individually pass (because they're empty).
+    if not entry.pangkat and not entry.nrp:
+        flags.append(ReviewFlag.INCOMPLETE_PERSONNEL_ROW)
    return flags


--- a/src/ocr_sprint/pipeline/ocr.py
+++ b/src/ocr_sprint/pipeline/ocr.py
@@ -36,6 +36,73 @@ class OCRLine:
    box: tuple[tuple[float, float], ...]  # 4 (x, y) corner points


+def _line_y_center(line: OCRLine) -> float:
+    return sum(p[1] for p in line.box) / len(line.box)
+
+
+def _line_x_left(line: OCRLine) -> float:
+    return min(p[0] for p in line.box)
+
+
+def _line_height(line: OCRLine) -> float:
+    ys = [p[1] for p in line.box]
+    return max(ys) - min(ys)
+
+
+def sort_lines_by_layout(lines: list[OCRLine]) -> list[OCRLine]:
+    """Reorder lines into top-to-bottom, left-to-right reading order.
+
+    PaddleOCR's natural output order reflects detection order, not visual
+    layout. On dense tables (e.g. Polda Kalbar Akpol-panitia sprint) this
+    interleaves rows and columns — Paddle may emit a row's KET column
+    before its NAMA column, breaking every downstream extractor that
+    assumes top-to-bottom row order.
+
+    We rebuild reading order by:
+
+    1. Sorting by ``y_center``.
+    2. Grouping consecutive lines into row-bands when their ``y_center``
+       differs by less than half the median line height (so visually
+       same-row cells stay together even when their boxes don't perfectly
+       align).
+    3. Sorting each band left-to-right by ``x_left``.
+    """
+    if not lines:
+        return []
+
+    heights = [_line_height(ln) for ln in lines if _line_height(ln) > 0]
+    if not heights:
+        return list(lines)
+    median_height = sorted(heights)[len(heights) // 2]
+    band_threshold = max(1.0, median_height * 0.5)
+
+    by_y = sorted(lines, key=_line_y_center)
+    bands: list[list[OCRLine]] = []
+    current_band: list[OCRLine] = []
+    current_y: float | None = None
+    for ln in by_y:
+        y = _line_y_center(ln)
+        if current_y is None or abs(y - current_y) <= band_threshold:
+            current_band.append(ln)
+            # Track the band's running y-center as the mean of its
+            # members so a slowly-drifting set of cells doesn't split
+            # mid-row.
+            current_y = (
+                sum(_line_y_center(b) for b in current_band) / len(current_band)
+            )
+        else:
+            bands.append(current_band)
+            current_band = [ln]
+            current_y = y
+    if current_band:
+        bands.append(current_band)
+
+    ordered: list[OCRLine] = []
+    for band in bands:
+        ordered.extend(sorted(band, key=_line_x_left))
+    return ordered
+
+
@dataclass(frozen=True)
 class OCRPage:
    """OCR output for a single page."""
@@ -44,8 +111,8 @@ class OCRPage:

    @property
    def text(self) -> str:
-        """Reconstruct page text by concatenating lines (order = paddle's output order)."""
-        return "\n".join(line.text for line in self.lines)
+        """Reconstruct page text in visual reading order (top-to-bottom, left-to-right)."""
+        return "\n".join(line.text for line in sort_lines_by_layout(self.lines))

    @property
    def mean_confidence(self) -> float:
@@ -55,9 +122,14 @@ class OCRPage:


 def _build_paddleocr() -> PaddleOCR:
+    s = get_settings()
+    if s.ocr_use_gpu:
+        from ocr_sprint.utils.gpu import configure_nvidia_dll_path
+
+        configure_nvidia_dll_path()
+
    from paddleocr import PaddleOCR

-    s = get_settings()
    kwargs: dict[str, object] = {
        "lang": s.ocr_lang,
        "use_angle_cls": True,
@@ -84,6 +156,19 @@ def get_ocr() -> PaddleOCR:
    return _instance


+def warmup() -> None:
+    """Eagerly initialize the PaddleOCR engine.
+
+    Call this during application startup so the first real request does not
+    pay the model-loading cost (~2-5s on CPU). Also prevents the process from
+    entering Disk-Sleep state (state D) mid-request when memory is tight,
+    because the OS has already paged in all model weights during startup.
+    """
+    _logger.info("paddleocr.warmup.start")
+    get_ocr()
+    _logger.info("paddleocr.warmup.done")
+
+
 def run_ocr(image: NDArrayU8) -> OCRPage:
    """Run OCR on a single BGR image and return a structured page result."""
    engine = get_ocr()
--- a/src/ocr_sprint/pipeline/orchestrator.py
+++ b/src/ocr_sprint/pipeline/orchestrator.py
@@ -19,7 +19,16 @@ from ocr_sprint.llm.extractor import llm_fill_header
 from ocr_sprint.pipeline.confidence import compute_confidence, route
 from ocr_sprint.pipeline.document_detect import DocumentDetectConfig, detect_and_correct
 from ocr_sprint.pipeline.extract.personnel import extract_personnel
-from ocr_sprint.pipeline.extract.regex_rules import extract_header, find_signatory
+from ocr_sprint.pipeline.extract.personnel_text import (
+    extract_personnel_from_ocr_lines,
+    extract_personnel_from_text,
+    is_low_quality,
+)
+from ocr_sprint.pipeline.extract.regex_rules import (
+    extract_header,
+    find_signatory,
+    find_untuk_list,
+)
 from ocr_sprint.pipeline.extract.validators import validate_extraction
 from ocr_sprint.pipeline.ingest import NDArrayU8, detect_source_kind, ingest
 from ocr_sprint.pipeline.ocr import OCRPage, run_ocr
@@ -112,6 +121,7 @@ def run_pipeline(content: bytes) -> PipelineOutput:
            header = merged

    personel: list[PersonnelEntry] = []
+    table_flags: list[ReviewFlag] = []
    if s.tables_enabled and cleaned_pages:
        all_tables: list[DetectedTable] = []
        for img in cleaned_pages:
@@ -126,14 +136,58 @@ def run_pipeline(content: bytes) -> PipelineOutput:
            personel_rows=len(personel),
        )

-    initial_flags: list[ReviewFlag] = list(llm_flags)
+    # Text-based fallback: PP-Structure can succeed structurally but emit
+    # rows with only ``nama`` populated (column mapper degraded), or fail to
+    # detect the table at all. In both cases the regex fallback that scans
+    # raw OCR for rank+NRP pairs produces a much more useful result. We
+    # always run it when the structured path is empty or low-quality, and
+    # raise a review flag so the operator knows the document didn't go
+    # through the preferred path.
+    if is_low_quality(personel):
+        fallback_rows = extract_personnel_from_text(full_text)
+        # If text-based fallback produced rows but they all lack NRP
+        # (Pass 3 territory), retry with the column-aware extractor that
+        # uses OCR bounding boxes. On dense tables (e.g. Polda Kalbar
+        # Akpol-panitia), text-only Pass 3 bleeds adjacent columns into
+        # nama/jabatan because lines are interleaved within each Y-band;
+        # the columnar variant restricts each field to its visual column.
+        text_only_no_nrp = bool(fallback_rows) and all(
+            r.nrp is None for r in fallback_rows
+        )
+        if (not fallback_rows) or text_only_no_nrp:
+            ocr_lines = [ln for page in ocr_pages for ln in page.lines]
+            columnar_rows = extract_personnel_from_ocr_lines(ocr_lines)
+            if columnar_rows and (
+                not fallback_rows or len(columnar_rows) >= len(fallback_rows)
+            ):
+                fallback_rows = columnar_rows
+        if fallback_rows:
+            personel = fallback_rows
+            # Pass 3 / columnar emit rows with nrp=None for sprint
+            # templates without an NRP column. Surface that with a
+            # distinct flag so operators know to expect missing NRPs by
+            # design rather than by OCR failure.
+            no_nrp = all(r.nrp is None for r in fallback_rows)
+            if no_nrp:
+                table_flags.append(ReviewFlag.PERSONNEL_TEXT_FALLBACK_NO_NRP)
+            else:
+                table_flags.append(ReviewFlag.PERSONNEL_TEXT_FALLBACK)
+            _logger.info(
+                "pipeline.personnel_text_fallback",
+                fallback_rows=len(fallback_rows),
+                no_nrp=no_nrp,
+            )
+
+    untuk_items = find_untuk_list(full_text)
+
+    initial_flags: list[ReviewFlag] = list(llm_flags) + list(table_flags)
    if mean_ocr_conf < _OCR_CONFIDENCE_FLAG_THRESHOLD:
        initial_flags.append(ReviewFlag.LOW_OCR_CONFIDENCE)

    result = ExtractionResult(
        header=header,
        personel=personel,
-        untuk=[],
+        untuk=untuk_items,
        ttd=ttd,
        raw_text=full_text,
        confidence=mean_ocr_conf,
--- a/src/ocr_sprint/pipeline/table.py
+++ b/src/ocr_sprint/pipeline/table.py
@@ -67,21 +67,43 @@ class DetectedTable:
 # ---------- PP-Structure singleton ----------


-def _build_pp_structure() -> PPStructure:
-    from paddleocr import PPStructure
-
-    s = get_settings()
-    _logger.info("pp_structure.init", lang=s.ocr_lang, use_gpu=s.ocr_use_gpu)
+def _create_pp_structure(
+    pp_structure_cls: type[PPStructure], pp_lang: str, use_gpu: bool
+) -> PPStructure:
    # layout=True so that PP-Structure also returns figure/text regions; we
    # filter to tables only afterwards. show_log=False to keep stdout clean.
-    return PPStructure(
-        lang=s.ocr_lang,
-        use_gpu=s.ocr_use_gpu,
+    return pp_structure_cls(
+        lang=pp_lang,
+        use_gpu=use_gpu,
        layout=True,
        show_log=False,
    )


+def _build_pp_structure() -> PPStructure:
+    s = get_settings()
+    if s.ocr_use_gpu:
+        from ocr_sprint.utils.gpu import configure_nvidia_dll_path
+
+        configure_nvidia_dll_path()
+
+    from paddleocr import PPStructure
+
+    # PPStructure layout models only support 'en' and 'ch', not 'latin'.
+    # Use 'en' for layout/table detection — it's language-agnostic (detects
+    # table structure, not text language). OCR within cells still works for
+    # Indonesian text because the recognition model handles Latin scripts.
+    pp_lang = "en" if s.ocr_lang not in ("en", "ch") else s.ocr_lang
+    _logger.info("pp_structure.init", lang=pp_lang, use_gpu=s.ocr_use_gpu)
+    try:
+        return _create_pp_structure(PPStructure, pp_lang, s.ocr_use_gpu)
+    except Exception as exc:
+        if not s.ocr_use_gpu:
+            raise
+        _logger.warning("pp_structure.gpu_init_failed_falling_back_cpu", error=str(exc))
+        return _create_pp_structure(PPStructure, pp_lang, False)
+
+
 def get_pp_structure() -> PPStructure:
    """Lazy, thread-safe singleton accessor for PP-Structure."""
    global _instance
@@ -92,6 +114,18 @@ def get_pp_structure() -> PPStructure:
    return _instance


+def warmup() -> None:
+    """Eagerly initialize the PP-Structure engine.
+
+    Call this during application startup so the first real request does not
+    pay the model-loading cost (~3-6s on CPU). Mirrors ocr.warmup() so the
+    lifespan handler can warm both engines in one place.
+    """
+    _logger.info("pp_structure.warmup.start")
+    get_pp_structure()
+    _logger.info("pp_structure.warmup.done")
+
+
 # ---------- table parsing ----------


--- a/src/ocr_sprint/schemas/extraction.py
+++ b/src/ocr_sprint/schemas/extraction.py
@@ -21,6 +21,9 @@ class ReviewFlag(str, Enum):
    DATE_PARSE_FAILED = "date_parse_failed"
    LLM_FALLBACK = "llm_fallback"
    LLM_UNAVAILABLE = "llm_unavailable"
+    PERSONNEL_TEXT_FALLBACK = "personnel_text_fallback"
+    PERSONNEL_TEXT_FALLBACK_NO_NRP = "personnel_text_fallback_no_nrp"
+    INCOMPLETE_PERSONNEL_ROW = "incomplete_personnel_row"


 class Signatory(BaseModel):
--- a/src/ocr_sprint/utils/gpu.py
+++ b/src/ocr_sprint/utils/gpu.py
@@ -0,0 +1,57 @@
+"""GPU runtime helpers."""
+
+from __future__ import annotations
+
+import os
+from pathlib import Path
+
+_DLL_HANDLES: list[object] = []
+_CONFIGURED = False
+
+
+def configure_nvidia_dll_path() -> None:
+    """Expose NVIDIA wheel DLL directories to the Windows dynamic loader.
+
+    Paddle's Windows GPU wheels dynamically load CUDA/cuDNN DLLs by name. When
+    those DLLs come from Python packages such as ``nvidia-cudnn-cu11`` instead
+    of a system-wide CUDA Toolkit install, their ``bin`` folders are not on
+    ``PATH`` by default.
+    """
+    global _CONFIGURED
+    if _CONFIGURED or os.name != "nt":
+        return
+
+    package_names = ("nvidia.cudnn", "nvidia.cublas", "nvidia.cuda_nvrtc")
+    dll_dirs: list[Path] = []
+    for package_name in package_names:
+        try:
+            module = __import__(package_name, fromlist=["__file__"])
+        except Exception:
+            continue
+        module_file = getattr(module, "__file__", None)
+        if not module_file:
+            continue
+        dll_dir = Path(module_file).resolve().parent / "bin"
+        if dll_dir.is_dir():
+            dll_dirs.append(dll_dir)
+
+    if not dll_dirs:
+        _CONFIGURED = True
+        return
+
+    current_path_parts = os.environ.get("PATH", "").split(os.pathsep)
+    current_path_norm = {part.casefold() for part in current_path_parts if part}
+
+    prepend: list[str] = []
+    for dll_dir in dll_dirs:
+        dll_dir_str = str(dll_dir)
+        if dll_dir_str.casefold() not in current_path_norm:
+            prepend.append(dll_dir_str)
+        add_dll_directory = getattr(os, "add_dll_directory", None)
+        if add_dll_directory is not None:
+            _DLL_HANDLES.append(add_dll_directory(dll_dir_str))
+
+    if prepend:
+        os.environ["PATH"] = os.pathsep.join([*prepend, os.environ.get("PATH", "")])
+
+    _CONFIGURED = True
--- a/src/ocr_sprint/worker/celery_app.py
+++ b/src/ocr_sprint/worker/celery_app.py
@@ -15,8 +15,12 @@ from __future__ import annotations
 import os

 from celery import Celery
+from celery.signals import worker_ready

 from ocr_sprint.config import get_settings
+from ocr_sprint.utils.logging import get_logger
+
+_logger = get_logger(__name__)


 def build_celery_app() -> Celery:
@@ -47,3 +51,32 @@ def build_celery_app() -> Celery:


 celery_app = build_celery_app()
+
+
+@worker_ready.connect
+def preload_ocr_models(sender: object, **kwargs: object) -> None:
+    """Warm up PaddleOCR and PP-Structure when the worker process is ready.
+
+    With ``--pool=solo`` the worker runs tasks in the *same* process that
+    receives this signal, so models loaded here are reused for every
+    subsequent task — no fork overhead, no duplicate model loading, and
+    RAM usage stays bounded (~1.5 GB instead of 1.5 GB × n_forks).
+    """
+    from ocr_sprint.config import get_settings as _gs
+    from ocr_sprint.pipeline import ocr as _ocr
+    from ocr_sprint.pipeline import table as _table
+
+    _logger.info("celery.worker.warmup.start")
+    s = _gs()
+    try:
+        _ocr.warmup()
+    except Exception as exc:
+        _logger.warning("celery.worker.paddleocr.warmup.failed", error=str(exc))
+
+    if s.tables_enabled:
+        try:
+            _table.warmup()
+        except Exception as exc:
+            _logger.warning("celery.worker.pp_structure.warmup.failed", error=str(exc))
+
+    _logger.info("celery.worker.warmup.done")
--- a/tests/unit/test_ground_truth_service.py
+++ b/tests/unit/test_ground_truth_service.py
@@ -40,11 +40,19 @@ def _seed_approved_job_with_corrections(
            jid,
            status=DocumentStatus.NEEDS_REVIEW,
            confidence=0.8,
-            result=final_result
-            or {
-                "header": {"nomor_sprint": "SPR/1/2025", "satuan_penerbit": "POLRES X"},
-                "personel": [{"pangkat": "AIPDA", "nrp": "77060000", "nama": "BUDI"}],
-            },
+            # ``is None`` (not truthiness) so callers can pass ``{}`` to
+            # exercise the empty-dict edge case.
+            result=(
+                final_result
+                if final_result is not None
+                else {
+                    "header": {
+                        "nomor_sprint": "SPR/1/2025",
+                        "satuan_penerbit": "POLRES X",
+                    },
+                    "personel": [{"pangkat": "AIPDA", "nrp": "77060000", "nama": "BUDI"}],
+                }
+            ),
            review_flags=[],
        )
    if corrections:
@@ -197,6 +205,19 @@ def test_stats_counts_rollup_and_top_fields(db_ready: None) -> None:
    }


+def test_empty_dict_result_stays_consistent(db_ready: None) -> None:
+    """An empty-dict result (``{}``) is logically a valid snapshot — it
+    must round-trip as ``{}`` on *both* ``initial_result`` and
+    ``final_result``, not ``{}`` on one and ``None`` on the other.
+    """
+    _seed_approved_job_with_corrections(final_result={})
+    with session_scope() as session:
+        samples = list(iter_ground_truth_samples(session, GroundTruthFilters()))
+    assert len(samples) == 1
+    assert samples[0].initial_result == {}
+    assert samples[0].final_result == {}
+
+
 def test_serialize_is_valid_jsonl(db_ready: None) -> None:
    _seed_approved_job_with_corrections(corrections=[("header.perihal", "X", None)])
    with session_scope() as session:
--- a/tests/unit/test_ocr_layout.py
+++ b/tests/unit/test_ocr_layout.py
@@ -0,0 +1,75 @@
+"""Tests for OCR layout reordering.
+
+PaddleOCR emits text boxes in detection order, not visual reading order.
+On dense table layouts (Polda Kalbar Akpol-panitia regression) this
+interleaves columns within a row and breaks every downstream extractor
+that assumes top-to-bottom row order. ``sort_lines_by_layout`` rebuilds
+reading order from the bounding-box geometry.
+"""
+
+from __future__ import annotations
+
+from ocr_sprint.pipeline.ocr import OCRLine, OCRPage, sort_lines_by_layout
+
+
+def _box(x: float, y: float, w: float = 30, h: float = 15):
+    return ((x, y), (x + w, y), (x + w, y + h), (x, y + h))
+
+
+def _make(text: str, x: float, y: float) -> OCRLine:
+    return OCRLine(text=text, confidence=1.0, box=_box(x, y))
+
+
+class TestSortLinesByLayout:
+    def test_empty_returns_empty(self) -> None:
+        assert sort_lines_by_layout([]) == []
+
+    def test_already_sorted_is_stable(self) -> None:
+        lines = [_make("A", 10, 10), _make("B", 50, 10), _make("C", 10, 30)]
+        assert [ln.text for ln in sort_lines_by_layout(lines)] == ["A", "B", "C"]
+
+    def test_reorders_column_first_detection_to_row_first(self) -> None:
+        # Simulate a 2-row, 3-col table where Paddle returned cells
+        # column-first instead of row-first.
+        lines = [
+            _make("B1", 50, 10),
+            _make("B2", 50, 30),
+            _make("A1", 10, 10),
+            _make("A2", 10, 30),
+            _make("C1", 90, 10),
+            _make("C2", 90, 30),
+        ]
+        result = [ln.text for ln in sort_lines_by_layout(lines)]
+        assert result == ["A1", "B1", "C1", "A2", "B2", "C2"]
+
+    def test_groups_slightly_misaligned_cells_into_one_band(self) -> None:
+        # Real OCR boxes for a single visual row are rarely perfectly
+        # y-aligned; we still want them grouped.
+        lines = [
+            _make("LEFT", 10, 10),
+            _make("MID", 50, 12),  # 2px below LEFT — same row visually
+            _make("RIGHT", 90, 11),
+        ]
+        result = [ln.text for ln in sort_lines_by_layout(lines)]
+        assert result == ["LEFT", "MID", "RIGHT"]
+
+    def test_separates_rows_when_y_gap_exceeds_threshold(self) -> None:
+        # Lines with a y gap larger than ~½ line-height must NOT collapse
+        # into the same band.
+        lines = [
+            _make("ROW1A", 10, 10),
+            _make("ROW1B", 50, 10),
+            _make("ROW2A", 10, 30),  # gap of 20 vs height 15 → new band
+            _make("ROW2B", 50, 30),
+        ]
+        result = [ln.text for ln in sort_lines_by_layout(lines)]
+        assert result == ["ROW1A", "ROW1B", "ROW2A", "ROW2B"]
+
+    def test_ocrpage_text_uses_sorted_order(self) -> None:
+        lines = [
+            _make("RIGHT", 90, 10),
+            _make("LEFT", 10, 10),
+            _make("BOTTOM", 10, 30),
+        ]
+        page = OCRPage(lines=lines)
+        assert page.text == "LEFT\nRIGHT\nBOTTOM"
--- a/tests/unit/test_orchestrator_llm.py
+++ b/tests/unit/test_orchestrator_llm.py
@@ -169,3 +169,92 @@ def test_orchestrator_marks_unavailable_when_llm_returns_none(
    out = run_pipeline(b"%PDF-1.4\n%fake")
    assert ReviewFlag.LLM_UNAVAILABLE in out.result.review_flags
    assert ReviewFlag.LLM_FALLBACK not in out.result.review_flags
+
+
+def test_orchestrator_uses_text_fallback_when_pp_structure_yields_only_names(
+    monkeypatch: pytest.MonkeyPatch,
+) -> None:
+    """When PP-Structure produces low-quality rows (e.g. only ``nama`` filled),
+    the orchestrator must run the text fallback against the raw OCR text and
+    raise the ``personnel_text_fallback`` flag.
+    """
+    monkeypatch.setenv("LLM_ENABLED", "false")
+    from ocr_sprint.config import get_settings
+
+    get_settings.cache_clear()
+
+    raw_text = (
+        "DAFTAR PERSONIL\n"
+        "1.\n"
+        "SRI WAHYUNI\n"
+        "AIPTU / 75070328\n"
+        "INTELKAM POLRES CIMAHI\n"
+        "2.\n"
+        "AGUNG LUKMAN\n"
+        "BRIPTU / 99030245\n"
+        "SAT INTELKAM\n"
+    )
+
+    # PP-Structure 'succeeded' but emitted name-only rows (the bug we saw on
+    # the real Polres Cimahi document).
+    from ocr_sprint.schemas.personnel import PersonnelEntry
+
+    pp_structure_low_quality = [
+        PersonnelEntry(nama="SRI WAHYUNI"),
+        PersonnelEntry(nama="AGUNG LUKMAN"),
+    ]
+    _stub_pipeline_stages(
+        monkeypatch,
+        raw_text=raw_text,
+        regex_header=HeaderFields(
+            nomor_sprint="Sprin/1/I/2025",
+            tanggal=date(2025, 1, 1),
+            satuan_penerbit="Polres Cimahi",
+            perihal="ok",
+            dasar=["UU 2/2002"],
+        ),
+    )
+    # Override extract_personnel to return the broken PP-Structure rows.
+    monkeypatch.setattr(orch_module, "extract_personnel", lambda _t: pp_structure_low_quality)
+
+    out = run_pipeline(b"%PDF-1.4\n%fake")
+    assert ReviewFlag.PERSONNEL_TEXT_FALLBACK in out.result.review_flags
+    # Fallback rows must carry pangkat + nrp (the whole point of the path).
+    assert all(r.pangkat and r.nrp for r in out.result.personel)
+    assert {r.pangkat for r in out.result.personel} == {"AIPTU", "BRIPTU"}
+
+
+def test_orchestrator_keeps_pp_structure_rows_when_quality_is_high(
+    monkeypatch: pytest.MonkeyPatch,
+) -> None:
+    """Healthy PP-Structure output (rank+nrp present on most rows) must NOT
+    be replaced by the text fallback.
+    """
+    monkeypatch.setenv("LLM_ENABLED", "false")
+    from ocr_sprint.config import get_settings
+
+    get_settings.cache_clear()
+
+    from ocr_sprint.schemas.personnel import PersonnelEntry
+
+    healthy = [
+        PersonnelEntry(pangkat="AIPTU", nrp="11111111", nama="A"),
+        PersonnelEntry(pangkat="BRIPTU", nrp="22222222", nama="B"),
+        PersonnelEntry(pangkat="BRIPDA", nrp="33333333", nama="C"),
+    ]
+    _stub_pipeline_stages(
+        monkeypatch,
+        raw_text="ignored — should not be parsed",
+        regex_header=HeaderFields(
+            nomor_sprint="Sprin/1/I/2025",
+            tanggal=date(2025, 1, 1),
+            satuan_penerbit="Polres X",
+            perihal="ok",
+            dasar=["UU 2/2002"],
+        ),
+    )
+    monkeypatch.setattr(orch_module, "extract_personnel", lambda _t: healthy)
+
+    out = run_pipeline(b"%PDF-1.4\n%fake")
+    assert ReviewFlag.PERSONNEL_TEXT_FALLBACK not in out.result.review_flags
+    assert [r.nrp for r in out.result.personel] == ["11111111", "22222222", "33333333"]
--- a/tests/unit/test_personnel_text_fallback.py
+++ b/tests/unit/test_personnel_text_fallback.py
@@ -0,0 +1,324 @@
+"""Tests for the text-based personnel fallback extractor.
+
+Driven by the real Polres Cimahi sprint document where PP-Structure
+produced 24 rows with only ``nama`` populated. The fallback should
+recover at least the rank + NRP for every row.
+"""
+
+from __future__ import annotations
+
+from ocr_sprint.pipeline.extract.personnel_text import (
+    extract_personnel_from_ocr_lines,
+    extract_personnel_from_text,
+    is_low_quality,
+)
+from ocr_sprint.pipeline.ocr import OCRLine
+from ocr_sprint.schemas.personnel import PersonnelEntry
+
+
+def _ocr_line(text: str, x: float, y: float, w: float = 80, h: float = 15) -> OCRLine:
+    box = ((x, y), (x + w, y), (x + w, y + h), (x, y + h))
+    return OCRLine(text=text, confidence=1.0, box=box)
+
+_CIMAHI_FIXTURE = """\
+DAFTAR PERSONIL SKCK POLRES DAN POLSEK JAJARAN POLRES CIMAHI TA 2024
+NO
+NAMA
+PANGKAT / NRP
+JABATAN
+KET
+BAUR SKCK SAT
+1.
+SRI WAHYUNI
+AIPTU / 75070328
+INTELKAM POLRES
+CIMAHI
+BA PELAKSANA SKCK
+2.
+CITRA DWI PUTRI R
+BRIPTU / 95070659
+ SAT INTELKAM
+POLRES CIMAHI
+BA PELAKSANA SKCK
+3.
+AGUNG LUKMAN AL
+BRIPTU / 99030245
+SAT INTELKAM
+POLRES CIMAHI
+BA POLSEK
+8.
+ARIEF SYAHRUL ZAMAN
+BRIGPOL /96030446
+MARGAASIH
+"""
+
+
+class TestExtractPersonnelFromText:
+    def test_extracts_rank_nrp_and_name(self) -> None:
+        rows = extract_personnel_from_text(_CIMAHI_FIXTURE)
+        assert len(rows) == 4
+        first = rows[0]
+        assert first.pangkat == "AIPTU"
+        assert first.nrp == "75070328"
+        assert first.nama == "SRI WAHYUNI"
+
+    def test_normalizes_brigpol_to_brigadir(self) -> None:
+        rows = extract_personnel_from_text(_CIMAHI_FIXTURE)
+        last = rows[-1]
+        # 'BRIGPOL' (no space) must canonicalize to 'BRIGADIR'.
+        assert last.pangkat == "BRIGADIR"
+        assert last.nrp == "96030446"
+        assert last.nama == "ARIEF SYAHRUL ZAMAN"
+
+    def test_skips_header_lines_as_names(self) -> None:
+        # No row should ever have a column-header word as nama.
+        rows = extract_personnel_from_text(_CIMAHI_FIXTURE)
+        names = [r.nama for r in rows]
+        for blocked in {"NAMA", "PANGKAT", "JABATAN", "KET", "DAFTAR"}:
+            assert blocked not in names
+
+    def test_jabatan_collected_from_following_lines(self) -> None:
+        rows = extract_personnel_from_text(_CIMAHI_FIXTURE)
+        assert rows[0].jabatan_dinas is not None
+        assert "INTELKAM" in rows[0].jabatan_dinas
+
+    def test_empty_text_returns_empty(self) -> None:
+        assert extract_personnel_from_text("") == []
+
+    def test_text_without_rank_nrp_pattern_returns_empty(self) -> None:
+        text = "Just a paragraph with no rank or NRP at all.\nAnother line."
+        assert extract_personnel_from_text(text) == []
+
+    def test_ignores_isolated_8digit_number_without_rank(self) -> None:
+        # NRP without a recognised rank token must not produce a row.
+        text = "Some line\n12345678\nanother line"
+        assert extract_personnel_from_text(text) == []
+
+    def test_rejects_unknown_rank_with_8digit_number(self) -> None:
+        # A "rank-shaped" word that isn't in the master list must not yield a row.
+        text = "Some line\nFAKERANK / 12345678\nanother line"
+        assert extract_personnel_from_text(text) == []
+
+    def test_does_not_drop_indonesian_names_starting_with_no_or_ket(self) -> None:
+        # Regression: 'NO' / 'KET' are legitimate column header tokens but
+        # also prefix common Indonesian names (KETUT, NOVA, NOOR). The
+        # blocklist must use word boundaries, not a raw startswith check.
+        text = (
+            "DAFTAR PERSONIL\n"
+            "1.\n"
+            "KETUT WARDANA\n"
+            "AIPTU / 11111111\n"
+            "JABATAN A\n"
+            "2.\n"
+            "NOVA SARI\n"
+            "BRIPTU / 22222222\n"
+            "JABATAN B\n"
+            "3.\n"
+            "NOOR HIDAYAT\n"
+            "BRIPDA / 33333333\n"
+            "JABATAN C\n"
+        )
+        rows = extract_personnel_from_text(text)
+        names = [r.nama for r in rows]
+        assert names == ["KETUT WARDANA", "NOVA SARI", "NOOR HIDAYAT"]
+
+    def test_extracts_multiple_rows_when_collapsed_to_one_line(self) -> None:
+        # Polres Banjar regression: when PaddleOCR merges several table
+        # rows onto a single OCR line, every rank+NRP pair on that line
+        # must still produce a separate row. Previously per-line
+        # ``re.search`` returned only the first match.
+        text = (
+            "DAFTAR NAMA INSTRUKTUR\n"
+            "1 CUCU JUHANA, A.K.S. KOMPOL 70100418 KABAGOPS "
+            "INSTRUKTUR LAT PRA OPS "
+            "HERU SAMSUL BAHRI, S.E., M.M., CPHR., CBA.AKP 77020049 "
+            "KASAT RESKRIM SDA "
+            "YAYAN SOPIANA, S.A.P., M.A.P., CPHR., CBA, CIAKP 84011113 "
+            "KASATINTELKAM POLRES BANJAR SDA\n"
+        )
+        rows = extract_personnel_from_text(text)
+        assert len(rows) == 3
+        assert [r.pangkat for r in rows] == ["KOMPOL", "AKP", "AKP"]
+        assert [r.nrp for r in rows] == ["70100418", "77020049", "84011113"]
+        assert rows[0].nama == "CUCU JUHANA, A.K.S."
+        assert rows[1].nama is not None and "HERU SAMSUL BAHRI" in rows[1].nama
+        assert rows[2].nama is not None and "YAYAN SOPIANA" in rows[2].nama
+
+    def test_extracts_multiple_rows_when_split_across_lines(self) -> None:
+        # Variant of the squished case where OCR produces one line per
+        # table row. Each row still ends up with multiple rank+NRP pairs
+        # never being on the same line, but verifies the finditer-based
+        # path doesn't regress this layout.
+        text = (
+            "1 CUCU JUHANA, A.K.S. KOMPOL 70100418 KABAGOPS\n"
+            "INSTRUKTUR LAT PRA OPS\n"
+            "HERU SAMSUL BAHRI, S.E., M.M., CPHR., CBA.AKP 77020049 KASAT RESKRIM\n"
+            "SDA\n"
+            "YAYAN SOPIANA, S.A.P., M.A.P., CPHR., CBA, CIAKP 84011113 KASATINTELKAM\n"
+            "POLRES BANJAR SDA\n"
+        )
+        rows = extract_personnel_from_text(text)
+        assert [r.pangkat for r in rows] == ["KOMPOL", "AKP", "AKP"]
+        assert [r.nrp for r in rows] == ["70100418", "77020049", "84011113"]
+        assert rows[0].nama == "CUCU JUHANA, A.K.S."
+
+    def test_extracts_rows_when_sprint_has_no_nrp_column(self) -> None:
+        # Polda Kalbar Akpol-panitia regression: sprint formats without
+        # an NRP column (panitia, undangan templates) must still extract
+        # rows via the rank-only Pass 3 path. Names span multiple OCR
+        # lines (narrow column), and the multi-token rank "KOMBES POL"
+        # is split across two lines.
+        text = (
+            "DAFTAR NAMA PANITIA\n"
+            "NO\nNAMA\nPANGKAT\nJABATAN\nSTRUKTURAL\nDALAM SPRIN\nKET\n"
+            "1\nF. GUNTUR\nSUNOTO, S.I.K.,\nM.H.\n"
+            "KOMBES\nPOL\n"
+            "KARO SDM\nPOLDA KALBAR\nKETUA\nPELAKSANA\n"
+            "2\nJUDA TRISNO\nTAMPUBOLON,\nS.H., S.I.K., M.H.\n"
+            "AKBP\n"
+            "KABAGDALPERS\nRO SDM\nPOLDA KALBAR\nSEKRETARIS\n"
+            "3\nPRAYITNO, S.H.,\nM.H.\n"
+            "KOMPOL\n"
+            "KASUBBAG DIAPERS\nANGGOTA\n"
+        )
+        rows = extract_personnel_from_text(text)
+        assert len(rows) == 3
+        assert [r.pangkat for r in rows] == ["KOMBES POL", "AKBP", "KOMPOL"]
+        # All Pass 3 rows have nrp=None by design.
+        assert all(r.nrp is None for r in rows)
+        assert rows[0].nama == "F. GUNTUR SUNOTO, S.I.K., M.H."
+        assert rows[1].nama == "JUDA TRISNO TAMPUBOLON, S.H., S.I.K., M.H."
+        assert rows[2].nama == "PRAYITNO, S.H., M.H."
+        assert rows[0].jabatan_dinas is not None and "KARO SDM" in rows[0].jabatan_dinas
+
+    def test_pass3_does_not_run_when_pass1_succeeds(self) -> None:
+        # If a sprint has NRPs (Pass 1 succeeds), Pass 3 must not fire
+        # and produce duplicate/contaminating rows.
+        text = (
+            "1\nSRI WAHYUNI\nAIPTU / 75070328\nBAUR SKCK\n"
+            "2\nCITRA DWI PUTRI\nBRIPTU / 95070659\nBA PELAKSANA\n"
+        )
+        rows = extract_personnel_from_text(text)
+        assert len(rows) == 2
+        assert all(r.nrp is not None for r in rows)
+
+    def test_still_blocks_bare_column_header_tokens(self) -> None:
+        # Word-boundary fix must still reject the actual column-header
+        # rows that motivated the blocklist in the first place.
+        text = "NO\nNAMA\nPANGKAT / NRP\nJABATAN\nKET\n1.\nREAL NAME\nAIPTU / 12345678\n"
+        rows = extract_personnel_from_text(text)
+        assert len(rows) == 1
+        assert rows[0].nama == "REAL NAME"
+
+
+class TestExtractPersonnelFromOcrLines:
+    """Column-aware Pass 3 — Polda Kalbar Akpol-panitia regression.
+
+    Verifies that bounding-box geometry preserves column boundaries on
+    dense tables where text-only Pass 3 bleeds adjacent columns into
+    nama/jabatan.
+    """
+
+    def _kalbar_lines(self) -> list[OCRLine]:
+        # Stylised Polda Kalbar layout: NO | NAMA | PANGKAT | STRUKTURAL | SPRIN
+        # X columns: 10, 100, 250, 380, 520. Each row may have multi-line cells.
+        return [
+            # Row 1 — KOMBES POL spans two stacked OCR boxes
+            _ocr_line("1", 10, 100),
+            _ocr_line("F. GUNTUR", 100, 100),
+            _ocr_line("SUNOTO, S.I.K.,", 100, 120),
+            _ocr_line("M.H.", 100, 140),
+            _ocr_line("KOMBES", 250, 100),
+            _ocr_line("POL", 250, 120),
+            _ocr_line("KARO SDM", 380, 100),
+            _ocr_line("POLDA KALBAR", 380, 120),
+            _ocr_line("KETUA", 520, 100),
+            _ocr_line("PELAKSANA", 520, 120),
+            # Row 2
+            _ocr_line("2", 10, 200),
+            _ocr_line("JUDA TRISNO", 100, 200),
+            _ocr_line("TAMPUBOLON,", 100, 220),
+            _ocr_line("S.H., S.I.K., M.H.", 100, 240),
+            _ocr_line("AKBP", 250, 200),
+            _ocr_line("KABAGDALPERS", 380, 200),
+            _ocr_line("RO SDM", 380, 220),
+            _ocr_line("POLDA KALBAR", 380, 240),
+            _ocr_line("SEKRETARIS", 520, 200),
+            # Row 9 — PNS PENATA TK I (multi-token rank stacked)
+            _ocr_line("9", 10, 500),
+            _ocr_line("FITRIANSYAH,", 100, 500),
+            _ocr_line("S.E.", 100, 520),
+            _ocr_line("PENATA", 250, 500),
+            _ocr_line("TK I", 250, 520),
+            _ocr_line("KAURKEU", 380, 500),
+            _ocr_line("RO SDM", 380, 520),
+            _ocr_line("POLDA KALBAR", 380, 540),
+            _ocr_line("BENDAHARA", 520, 500),
+        ]
+
+    def test_extracts_three_rows(self) -> None:
+        rows = extract_personnel_from_ocr_lines(self._kalbar_lines())
+        assert len(rows) == 3
+        assert [r.pangkat for r in rows] == ["KOMBES POL", "AKBP", "PENATA TK I"]
+
+    def test_nama_is_assembled_only_from_nama_column(self) -> None:
+        # Each row's nama must contain *all* its multi-line fragments
+        # and *only* its multi-line fragments — no bleed from struktural.
+        rows = extract_personnel_from_ocr_lines(self._kalbar_lines())
+        assert rows[0].nama == "F. GUNTUR SUNOTO, S.I.K., M.H."
+        assert rows[1].nama == "JUDA TRISNO TAMPUBOLON, S.H., S.I.K., M.H."
+        assert rows[2].nama == "FITRIANSYAH, S.E."
+
+    def test_jabatan_split_into_struktural_and_sprint(self) -> None:
+        # The geometric column boundary must split STRUKTURAL (jabatan_dinas)
+        # from DALAM SPRIN (jabatan_sprint).
+        rows = extract_personnel_from_ocr_lines(self._kalbar_lines())
+        assert rows[0].jabatan_dinas == "KARO SDM POLDA KALBAR"
+        assert rows[0].jabatan_sprint == "KETUA PELAKSANA"
+        assert rows[1].jabatan_dinas == "KABAGDALPERS RO SDM POLDA KALBAR"
+        assert rows[1].jabatan_sprint == "SEKRETARIS"
+
+    def test_returns_empty_when_no_rank_anchors(self) -> None:
+        lines = [
+            _ocr_line("DAFTAR NAMA", 100, 50),
+            _ocr_line("HEADER", 100, 100),
+        ]
+        assert extract_personnel_from_ocr_lines(lines) == []
+
+    def test_returns_empty_for_empty_input(self) -> None:
+        assert extract_personnel_from_ocr_lines([]) == []
+
+    def test_no_row_bleed_between_consecutive_rows(self) -> None:
+        # Row 1's last name fragment ("F. GUNTUR") sits BELOW its rank
+        # line but inside row 1's visual span. It must NOT leak into
+        # row 2's nama, which should start with "JUDA TRISNO".
+        rows = extract_personnel_from_ocr_lines(self._kalbar_lines())
+        assert rows[1].nama is not None
+        assert rows[1].nama.startswith("JUDA TRISNO")
+        assert "GUNTUR" not in rows[1].nama
+        assert "SUNOTO" not in rows[1].nama
+
+
+class TestIsLowQuality:
+    def test_empty_list_is_low_quality(self) -> None:
+        assert is_low_quality([]) is True
+
+    def test_all_rows_with_only_name_is_low_quality(self) -> None:
+        rows = [PersonnelEntry(nama=f"NAMA {i}") for i in range(10)]
+        assert is_low_quality(rows) is True
+
+    def test_majority_with_rank_nrp_is_high_quality(self) -> None:
+        rows = [
+            PersonnelEntry(nama=f"NAMA {i}", pangkat="AIPTU", nrp=f"{10000000 + i:08d}")
+            for i in range(10)
+        ]
+        assert is_low_quality(rows) is False
+
+    def test_borderline_30_percent_threshold(self) -> None:
+        # 3 useful out of 10 = exactly 0.3, treated as not-low-quality.
+        useful = [
+            PersonnelEntry(nama=f"NAMA {i}", pangkat="AIPTU", nrp=f"{10000000 + i:08d}")
+            for i in range(3)
+        ]
+        useless = [PersonnelEntry(nama=f"NAMA {i + 3}") for i in range(7)]
+        assert is_low_quality(useful + useless) is False
--- a/tests/unit/test_regex_rules.py
+++ b/tests/unit/test_regex_rules.py
@@ -14,6 +14,7 @@ from ocr_sprint.pipeline.extract.regex_rules import (
    find_satuan,
    find_signatory,
    find_tanggal,
+    find_untuk_list,
 )


@@ -60,6 +61,36 @@ class TestSatuan:
        result = find_satuan("KEPOLISIAN NEGARA REPUBLIK INDONESIA")
        assert result is not None

+    def test_prefers_resor_over_negara_when_both_present(self) -> None:
+        # The Polri letterhead lists units hierarchically; the issuing unit
+        # is the deepest level, not the topmost generic "NEGARA" line.
+        text = (
+            "KEPOLISIAN NEGARA REPUBLIK INDONESIA\n"
+            "DAERAH JAWA BARAT\n"
+            "RESOR CIMAHI\n"
+            "SURAT PERINTAH\n"
+        )
+        result = find_satuan(text)
+        assert result == "KEPOLISIAN RESOR CIMAHI"
+
+    def test_prefers_sektor_over_resor(self) -> None:
+        text = (
+            "KEPOLISIAN NEGARA REPUBLIK INDONESIA\n"
+            "DAERAH JAWA BARAT\n"
+            "RESOR CIMAHI\n"
+            "SEKTOR PADALARANG\n"
+        )
+        result = find_satuan(text)
+        assert result == "KEPOLISIAN SEKTOR PADALARANG"
+
+    def test_handles_daerah_only(self) -> None:
+        text = "KEPOLISIAN NEGARA REPUBLIK INDONESIA\nDAERAH JAWA BARAT\n"
+        result = find_satuan(text)
+        assert result == "KEPOLISIAN DAERAH JAWA BARAT"
+
+    def test_returns_none_when_no_letterhead(self) -> None:
+        assert find_satuan("no police letterhead here") is None
+

 class TestPerihal:
    def test_extracts_perihal_line(self) -> None:
@@ -69,6 +100,25 @@ class TestPerihal:
    def test_returns_none_when_absent(self) -> None:
        assert find_perihal("no perihal field") is None

+    def test_falls_back_to_pertimbangan_block(self) -> None:
+        # Many Polres-level sprints use "Pertimbangan" instead of "Perihal".
+        # The fallback should pick up the first non-empty line under it.
+        text = (
+            "Pertimbangan\n"
+            "Bahwa dalam rangka mendukung kepentingan Dinas Polres Cimahi.\n"
+            "DASAR :\n"
+            "1. ...\n"
+        )
+        result = find_perihal(text)
+        assert result is not None
+        assert result.startswith("Bahwa dalam rangka mendukung")
+
+    def test_perihal_wins_over_pertimbangan_when_both_present(self) -> None:
+        # If the document has both a Perihal label AND a Pertimbangan
+        # paragraph, the explicit Perihal wins.
+        text = "Pertimbangan\nSome pertimbangan content.\nPERIHAL : The actual perihal.\n"
+        assert find_perihal(text) == "The actual perihal."
+

 class TestDasar:
    def test_numbered_list(self) -> None:
@@ -88,6 +138,57 @@ class TestDasar:
    def test_empty_when_section_missing(self) -> None:
        assert find_dasar_list("no dasar section") == []

+    def test_handles_bare_number_lines_split_by_ocr(self) -> None:
+        # OCR sometimes places the number marker on its own line and the
+        # body on the next non-empty line. The collector must merge them
+        # rather than dropping the body or appending it to the previous
+        # item (which the old implementation did).
+        text = (
+            "Dasar\n"
+            ":\n"
+            "1.\n"
+            " Undang - Undang Nomor 2 tahun 2002 tentang Kepolisian;\n"
+            "2. Peraturan Pemerintah Republik Indonesia No. 76 tahun 2020;\n"
+            "3.\n"
+            "Keterangan Catatan Kepolisian (SKCK);\n"
+            "4.\n"
+            "Pelayanan dilingkungan Badan Intelijen Keamanan Polri.\n"
+            "5. DIPA Petikan Satker Polres Cimahi.\n"
+            "DIPERINTAHKAN\n"
+        )
+        items = find_dasar_list(text)
+        assert len(items) == 5
+        assert items[0].startswith("Undang - Undang")
+        assert items[2].startswith("Keterangan Catatan")
+        assert items[3].startswith("Pelayanan dilingkungan")
+        assert items[4].startswith("DIPA")
+
+
+class TestUntuk:
+    def test_extracts_numbered_untuk_bullets(self) -> None:
+        text = (
+            "DIPERINTAHKAN\n"
+            "Kepada\n"
+            "Untuk\n"
+            "1.\n"
+            "melaksanakan tugas A;\n"
+            "2.\n"
+            "melaksanakan tugas B;\n"
+            "Selesai.\n"
+        )
+        items = find_untuk_list(text)
+        assert len(items) == 2
+        assert items[0] == "melaksanakan tugas A;"
+        assert items[1] == "melaksanakan tugas B;"
+
+    def test_returns_empty_when_section_missing(self) -> None:
+        assert find_untuk_list("no untuk section") == []
+
+    def test_stops_at_dikeluarkan(self) -> None:
+        text = "Untuk\n1. tugas A;\nDikeluarkan di Cimahi\n2. should not be captured\n"
+        items = find_untuk_list(text)
+        assert items == ["tugas A;"]
+

 class TestSignatory:
    def test_extracts_last_nrp(self) -> None:
--- a/tests/unit/test_table.py
+++ b/tests/unit/test_table.py
@@ -2,8 +2,12 @@

 from __future__ import annotations

+import sys
+from types import ModuleType, SimpleNamespace
+
 import pytest

+from ocr_sprint.pipeline import table as table_module
 from ocr_sprint.pipeline.table import (
    DetectedTable,
    extract_tables_from_pp_result,
@@ -82,6 +86,34 @@ class TestDetectedTable:
        assert table.n_cols == 0


+class TestPpStructureInit:
+    def test_gpu_init_falls_back_to_cpu(self, monkeypatch: pytest.MonkeyPatch) -> None:
+        calls: list[dict[str, object]] = []
+
+        class FakePPStructure:
+            def __init__(self, **kwargs: object) -> None:
+                calls.append(kwargs)
+                if kwargs["use_gpu"]:
+                    raise RuntimeError("gpu init failed")
+
+        fake_paddleocr = ModuleType("paddleocr")
+        fake_paddleocr.PPStructure = FakePPStructure
+        monkeypatch.setitem(sys.modules, "paddleocr", fake_paddleocr)
+        monkeypatch.setattr(
+            table_module,
+            "get_settings",
+            lambda: SimpleNamespace(ocr_lang="latin", ocr_use_gpu=True),
+        )
+
+        engine = table_module._build_pp_structure()
+
+        assert isinstance(engine, FakePPStructure)
+        assert calls == [
+            {"lang": "en", "use_gpu": True, "layout": True, "show_log": False},
+            {"lang": "en", "use_gpu": False, "layout": True, "show_log": False},
+        ]
+
+
@pytest.fixture
 def sample_personnel_table() -> DetectedTable:
    """Header + three personnel rows in a typical Polres-level format."""
--- a/tests/unit/test_validators.py
+++ b/tests/unit/test_validators.py
@@ -62,6 +62,20 @@ class TestPersonnelValidator:
        entry = PersonnelEntry(pangkat="Sersan Mayor", nrp="12345678", nama="Test")
        assert ReviewFlag.UNKNOWN_PANGKAT in validate_personnel_entry(entry)

+    def test_row_with_only_name_is_flagged_incomplete(self) -> None:
+        # A row that captured only `nama` (no pangkat AND no nrp) is the
+        # signature of mis-aligned table extraction. Must be flagged so
+        # the operator routes the document to needs_review.
+        entry = PersonnelEntry(nama="LEAKED FROM SOMEWHERE")
+        flags = validate_personnel_entry(entry)
+        assert ReviewFlag.INCOMPLETE_PERSONNEL_ROW in flags
+
+    def test_row_with_only_pangkat_is_not_flagged_incomplete(self) -> None:
+        # Having pangkat without NRP is suboptimal but still identifies a
+        # rank, so we don't raise the structural-incompleteness flag.
+        entry = PersonnelEntry(pangkat="AKP", nama="Test")
+        assert ReviewFlag.INCOMPLETE_PERSONNEL_ROW not in validate_personnel_entry(entry)
+

 class TestHeaderValidator:
    def test_complete_header_no_flags(self) -> None:
--- a/update.ps1
+++ b/update.ps1
@@ -0,0 +1,214 @@
+#!/usr/bin/env pwsh
+# update.ps1 - One-command update & restart for ocr-sprint-service (local dev)
+
+param(
+    [ValidateSet("cpu", "gpu")]
+    [string] $OcrMode
+)
+
+$ErrorActionPreference = "Stop"
+
+$Port = 8000
+$ProjectRoot = $PSScriptRoot
+$VenvDir = Join-Path $ProjectRoot ".venv"
+$Python = Join-Path $VenvDir "Scripts\python.exe"
+
+function Invoke-Step {
+    param(
+        [Parameter(Mandatory = $true)]
+        [scriptblock] $Command,
+        [Parameter(Mandatory = $true)]
+        [string] $FailureMessage
+    )
+
+    & $Command
+    if ($LASTEXITCODE -ne 0) {
+        Write-Host "  $FailureMessage" -ForegroundColor Red
+        exit $LASTEXITCODE
+    }
+}
+
+function Get-DotEnvValue {
+    param(
+        [Parameter(Mandatory = $true)]
+        [string] $Name
+    )
+
+    $envFile = Join-Path $ProjectRoot ".env"
+    if (Test-Path $envFile) {
+        $line = Get-Content $envFile | Where-Object { $_ -match "^\s*$Name\s*=" } | Select-Object -Last 1
+        if ($line) {
+            return (($line -split "=", 2)[1] -split "\s+#", 2)[0].Trim()
+        }
+    }
+    return [Environment]::GetEnvironmentVariable($Name)
+}
+
+function Set-DotEnvValue {
+    param(
+        [Parameter(Mandatory = $true)]
+        [string] $Name,
+        [Parameter(Mandatory = $true)]
+        [string] $Value
+    )
+
+    $envFile = Join-Path $ProjectRoot ".env"
+    if (-not (Test-Path $envFile)) {
+        New-Item -Path $envFile -ItemType File | Out-Null
+    }
+
+    $lines = @(Get-Content $envFile)
+    $updated = $false
+    for ($i = 0; $i -lt $lines.Count; $i++) {
+        if ($lines[$i] -match "^\s*$Name\s*=") {
+            $comment = ""
+            if ($lines[$i] -match "(\s+#.*)$") {
+                $comment = $Matches[1]
+            }
+            $lines[$i] = "$Name=$Value$comment"
+            $updated = $true
+        }
+    }
+    if (-not $updated) {
+        $lines += "$Name=$Value"
+    }
+    Set-Content -Path $envFile -Value $lines
+}
+
+function Test-PythonPackage {
+    param(
+        [Parameter(Mandatory = $true)]
+        [string] $Name
+    )
+
+    & $Python -m pip show $Name *> $null
+    return $LASTEXITCODE -eq 0
+}
+
+function Add-NvidiaDllPaths {
+    $dllDirs = @(
+        (Join-Path $VenvDir "Lib\site-packages\nvidia\cudnn\bin"),
+        (Join-Path $VenvDir "Lib\site-packages\nvidia\cublas\bin"),
+        (Join-Path $VenvDir "Lib\site-packages\nvidia\cuda_nvrtc\bin")
+    )
+    foreach ($dir in $dllDirs) {
+        if ((Test-Path $dir) -and (($env:PATH -split ";") -notcontains $dir)) {
+            $env:PATH = "$dir;$env:PATH"
+        }
+    }
+}
+
+Set-Location $ProjectRoot
+
+if (-not (Test-Path $Python)) {
+    Write-Host "Virtualenv not found at $VenvDir. Creating one..." -ForegroundColor Yellow
+    $venvCreated = $false
+    $pythonLauncher = Get-Command py -ErrorAction SilentlyContinue
+    if ($pythonLauncher) {
+        foreach ($version in @("3.12", "3.11", "3.10")) {
+            & py "-$version" -m venv $VenvDir 2>$null
+            if ($LASTEXITCODE -eq 0) {
+                $venvCreated = $true
+                break
+            }
+        }
+    }
+    if (-not $venvCreated) {
+        $systemPython = Get-Command python -ErrorAction SilentlyContinue
+        if (-not $systemPython) {
+            Write-Host "  Python was not found. Install Python 3.10-3.12, then rerun this script." -ForegroundColor Red
+            exit 1
+        }
+        & python -m venv $VenvDir
+        $venvCreated = ($LASTEXITCODE -eq 0)
+    }
+    if (-not $venvCreated) {
+        Write-Host "  Failed to create virtualenv." -ForegroundColor Red
+        exit $LASTEXITCODE
+    }
+}
+
+$env:VIRTUAL_ENV = $VenvDir
+$env:PATH = "$(Join-Path $VenvDir 'Scripts');$env:PATH"
+
+if ($PSBoundParameters.ContainsKey("OcrMode")) {
+    $ocrUseGpuValue = if ($OcrMode -eq "gpu") { "true" } else { "false" }
+    Set-DotEnvValue "OCR_USE_GPU" $ocrUseGpuValue
+    $env:OCR_USE_GPU = $ocrUseGpuValue
+    Write-Host "OCR mode set to $($OcrMode.ToUpperInvariant()) and saved to .env." -ForegroundColor Green
+}
+
+# ── [1/5] Git pull ──────────────────────────────────────────────────────────
+Write-Host "`n[1/5] Pulling latest code..." -ForegroundColor Cyan
+Invoke-Step { git pull } "Git pull failed."
+
+# ── [2/5] Install/update dependencies ───────────────────────────────────────
+Write-Host "`n[2/5] Installing/updating dependencies..." -ForegroundColor Cyan
+Invoke-Step { & $Python -m pip install -e ".[dev]" -q } "Dependency install failed."
+
+$ocrUseGpu = (Get-DotEnvValue "OCR_USE_GPU")
+if ($ocrUseGpu -and $ocrUseGpu.ToLowerInvariant() -in @("1", "true", "yes", "on")) {
+    Write-Host "  GPU mode enabled; checking Paddle CUDA runtime..." -ForegroundColor Cyan
+    if (-not (Test-PythonPackage "paddlepaddle-gpu")) {
+        Invoke-Step {
+            & $Python -m pip install paddlepaddle-gpu==2.6.2 -i https://www.paddlepaddle.org.cn/packages/stable/cu118/ -q
+        } "Paddle GPU install failed."
+    }
+    if (-not (Test-PythonPackage "nvidia-cudnn-cu11")) {
+        Invoke-Step { & $Python -m pip install nvidia-cudnn-cu11==8.9.5.29 -q } "NVIDIA cuDNN install failed."
+    }
+    Add-NvidiaDllPaths
+} else {
+    Write-Host "  CPU mode enabled; checking Paddle CPU runtime..." -ForegroundColor Cyan
+    if (-not ((Test-PythonPackage "paddlepaddle") -or (Test-PythonPackage "paddlepaddle-gpu"))) {
+        Invoke-Step { & $Python -m pip install paddlepaddle==2.6.2 -q } "Paddle CPU install failed."
+    }
+}
+
+# ── [3/5] Database migration ─────────────────────────────────────────────────
+Write-Host "`n[3/5] Running database migrations..." -ForegroundColor Cyan
+& $Python -m alembic upgrade head
+if ($LASTEXITCODE -ne 0) {
+    Write-Host "  Migration conflict detected, stamping current state as head..." -ForegroundColor Yellow
+    Invoke-Step { & $Python -m alembic stamp head } "Alembic stamp failed."
+    Write-Host "  Retrying upgrade for any remaining new migrations..." -ForegroundColor Yellow
+    & $Python -m alembic upgrade head
+    if ($LASTEXITCODE -ne 0) {
+        Write-Host "  Migration still failed. Please check alembic manually." -ForegroundColor Red
+        exit 1
+    }
+}
+Write-Host "  Migrations OK." -ForegroundColor Green
+
+# ── [4/5] Free up port ───────────────────────────────────────────────────────
+Write-Host "`n[4/5] Checking port $Port..." -ForegroundColor Cyan
+
+# Use Get-NetTCPConnection for reliable port detection on Windows
+$connections = Get-NetTCPConnection -LocalPort $Port -State Listen -ErrorAction SilentlyContinue
+if ($connections) {
+    foreach ($conn in $connections) {
+        $procId = $conn.OwningProcess
+        $procName = (Get-Process -Id $procId -ErrorAction SilentlyContinue).Name
+        Write-Host "  Port $Port used by '$procName' (PID $procId), killing..." -ForegroundColor Yellow
+        Stop-Process -Id $procId -Force -ErrorAction SilentlyContinue
+    }
+    # Wait until port is actually released (max 5 seconds)
+    $waited = 0
+    do {
+        Start-Sleep -Milliseconds 500
+        $waited += 500
+        $still = Get-NetTCPConnection -LocalPort $Port -State Listen -ErrorAction SilentlyContinue
+    } while ($still -and $waited -lt 5000)
+
+    if ($still) {
+        Write-Host "  Port $Port still in use after waiting. Try a different port or restart manually." -ForegroundColor Red
+        exit 1
+    }
+    Write-Host "  Port $Port freed." -ForegroundColor Green
+} else {
+    Write-Host "  Port $Port is free." -ForegroundColor Green
+}
+
+# ── [5/5] Start dev server ───────────────────────────────────────────────────
+Write-Host "`n[5/5] Starting dev server on port $Port (Ctrl+C to stop)..." -ForegroundColor Cyan
+& $Python -m uvicorn ocr_sprint.main:app --reload --host 0.0.0.0 --port $Port
Author	SHA1	Message	Date
Adriankf59	b8a1198e93	docs: add comprehensive deployment guide for docker and manual setups	2026-04-27 10:06:38 +07:00
Adriankf59	6d793758ff	feat: implement PP-Structure table extraction pipeline with GPU runtime configuration support	2026-04-27 00:51:23 +07:00
Nama Kamu	9d969e61fd	update	2026-04-26 22:08:41 +08:00
Adriankf59	5d9d9f784a	updated	2026-04-26 18:15:38 +07:00
Adriankf59	002821ca07	feat: implement robust personnel data extraction pipeline with text-based fallback and coordinate-aware processing	2026-04-26 17:16:47 +07:00
Adrian Kuman Firmansah	dbcf480130	Merge pull request #8 from Adriankf59/devin/1777181072-fix-personnel-extraction-cimahi Fix personnel extraction + header bugs on real Polres Cimahi sprint	2026-04-26 13:10:44 +07:00
Devin AI	737f4999dd	Use word-boundary matching for personnel name blocklist Devin Review correctly flagged that the bare "NO" and "KET" entries in the blocklist would silently drop common Indonesian names (KETUT, NOVA, NOOR, NORMAN, NOVIANTI, ...) because the check used startswith rather than a word boundary. Replaced the per-prefix loop with a single compiled regex anchored at ^ with a trailing \b, which still matches column headers like "NO" or "KET" on their own line but no longer rejects "NOOR HIDAYAT" or "KETUT WARDANA". Also fixes the same bug in _following_jabatan. Added two regression tests covering both directions: names starting with the offending tokens are kept, bare column headers still rejected. Co-Authored-By: adrian kuman firmansah <adriancuman@gmail.com>	2026-04-26 05:46:21 +00:00
Devin AI	58a2bf2648	Fix personnel extraction + header bugs on real Polres Cimahi sprint This fixes 4 bugs found on a real Polres Cimahi SPRIN PDF: 1. satuan_penerbit captured the generic 'KEPOLISIAN NEGARA REPUBLIK INDONESIA' letterhead line instead of the most-specific issuing unit (e.g. RESOR CIMAHI / SEKTOR PADALARANG). Reworked find_satuan to scan for each level independently and return the deepest available. 2. find_dasar_list dropped numbered items when OCR put the marker on its own line ("1.\n Undang-Undang ..."). Refactored into _collect_numbered_section that buffers a bare-number line and uses the next non-empty line as the body. Also reused for the new find_untuk_list which extracts the previously-empty 'untuk' bullets. 3. find_perihal returned None for documents that use 'Pertimbangan' (very common in Polres-level sprint), forcing the LLM to guess. Added a regex fallback that picks up the first line under a 'Pertimbangan' label so we keep extraction deterministic. 4. Personnel rows were emitted with only nama populated when PP-Structure detected a table but the column mapper degraded. Added a text-based fallback (extract_personnel_from_text) that scans raw OCR for <rank> + <8-digit NRP> patterns. Triggered when the PP-Structure result has fewer than 30% rank/NRP-bearing rows. Reviewed by raising the new PERSONNEL_TEXT_FALLBACK flag. 5. Validation now flags rows with neither pangkat nor nrp as INCOMPLETE_PERSONNEL_ROW, so the document routes to needs_review even when individual nrp/pangkat checks pass on empty values. 6. Added 'BRIGPOL' as a variant of BRIGADIR (seen in real scans). Tests: 229 (was 203) — 26 new tests covering the regex fixes, text-based personnel extractor, low-quality detector, validator behaviour, and orchestrator wiring of the fallback path. Co-Authored-By: adrian kuman firmansah <adriancuman@gmail.com>	2026-04-26 05:35:42 +00:00
Adrian Kuman Firmansah	dce77e80e1	Merge pull request #7 from Adriankf59/devin/1777149159-phase-7-followup-empty-dict-consistency Fix empty-dict consistency in ground-truth export (follow-up to #6)	2026-04-26 03:34:22 +07:00
Devin AI	0755fbebda	Fix empty-dict consistency in ground-truth export Devin Review (post-merge on PR #6) flagged that the `final_result` assignment used a truthiness check (`if job_row.result`) while `build_initial_result` used an identity check (`is None`). For a job whose result is an empty dict (`{}`), the emitted `GroundTruthSample` ended up with `initial_result={}` but `final_result=None` — logically inconsistent. Switch the final-result assignment to the same `is None` check so both fields agree. Added `test_empty_dict_result_stays_consistent` to lock the invariant in, and fixed the test helper so callers can pass `{}` without the helper's `or` fallback replacing it. Co-Authored-By: adrian kuman firmansah <adriancuman@gmail.com>	2026-04-25 20:33:26 +00:00