feat: implement robust personnel data extraction pipeline with text-based fallback and coordinate-aware processing

This commit is contained in:
Adriankf59
2026-04-26 17:16:47 +07:00
parent dbcf480130
commit 002821ca07
20 changed files with 3326 additions and 20 deletions

437
docs/DEPLOYMENT.md Normal file
View File

@@ -0,0 +1,437 @@
# Quickstart Deployment OCR Sprint Service
Panduan deployment OCR Sprint Service ke server production untuk pemrosesan dokumen surat sprint Polri.
## Prasyarat Server
### Spesifikasi Minimum
- **OS**: Linux (Ubuntu 20.04+ / Debian 11+ / RHEL 8+)
- **CPU**: 4 cores (8 cores recommended untuk throughput tinggi)
- **RAM**: 8 GB minimum (16 GB recommended)
- **Storage**: 50 GB free space
- ~3 GB untuk model PaddleOCR
- ~1.5 GB untuk dependencies Python
- Sisanya untuk blob storage dokumen
- **Network**: Port 8000 terbuka untuk API access
### Software Requirements
- Docker 24.0+ dan Docker Compose v2
- Git
- (Opsional) Nginx/Caddy untuk reverse proxy + SSL
## Deployment dengan Docker Compose (Recommended)
### 1. Clone Repository
```bash
# Login ke server sebagai user non-root dengan sudo access
ssh user@your-server.com
# Clone repository
git clone https://github.com/Adriankf59/ocr-sprint-service.git
cd ocr-sprint-service
```
### 2. Konfigurasi Environment
```bash
# Copy template environment
cp .env.example .env
# Edit konfigurasi production
nano .env
```
**Konfigurasi penting untuk production:**
```bash
# ==== App ====
APP_ENV=prod
APP_LOG_LEVEL=INFO
# ==== Storage ====
STORAGE_LOCAL_DIR=/app/storage
BLOB_STORAGE_DIR=/app/storage/blobs
BLOB_MAX_UPLOAD_MB=25
# ==== OCR ====
OCR_LANG=latin
OCR_USE_GPU=false # set true jika server punya GPU NVIDIA
OCR_MAX_IMAGE_SIDE=2200
# ==== Preprocessing ====
PREPROCESS_TARGET_DPI=300
PREPROCESS_DENOISE=true
PREPROCESS_DESKEW=true
PREPROCESS_DETECT_DOCUMENT=true
PREPROCESS_REMOVE_SHADOW=true
# ==== Table Extraction ====
TABLES_ENABLED=true
# ==== Async Pipeline ====
QUEUE_ENABLED=true
REDIS_URL=redis://redis:6379/0
CELERY_TASK_DEFAULT_QUEUE=ocr_sprint
# ==== Database ====
DATABASE_URL=postgresql+psycopg://ocr:ocr@postgres:5432/ocr_sprint
DATABASE_ECHO=false
# ==== Auth (WAJIB untuk production!) ====
API_KEYS=your-secret-key-1,your-secret-key-2
API_KEY_HEADER=X-API-Key
```
**Generate API keys yang aman:**
```bash
# Generate random API key
openssl rand -hex 32
```
### 3. Build dan Start Services
```bash
# Build Docker images
docker compose build
# Start semua services (API, Worker, Redis, Postgres)
docker compose up -d
# Cek logs untuk memastikan semua berjalan
docker compose logs -f api worker
```
**Services yang berjalan:**
- `api`: FastAPI server di port 8000
- `worker`: Celery worker untuk async processing
- `redis`: Message broker untuk job queue
- `postgres`: Database untuk job state
### 4. Verifikasi Deployment
```bash
# Health check
curl http://localhost:8000/api/v1/health
# Expected response:
# {"status":"ok","version":"0.1.0"}
# Test OCR endpoint (sync mode untuk testing)
curl -X POST http://localhost:8000/api/v1/documents?sync=true \
-H "X-API-Key: your-secret-key-1" \
-F "file=@samples/pdf/example.pdf" \
| jq
```
### 5. Setup Reverse Proxy (Nginx)
**Install Nginx:**
```bash
sudo apt update
sudo apt install nginx certbot python3-certbot-nginx
```
**Konfigurasi Nginx (`/etc/nginx/sites-available/ocr-sprint`):**
```nginx
upstream ocr_api {
server localhost:8000;
}
server {
listen 80;
server_name ocr.yourdomain.com;
client_max_body_size 30M; # Sesuaikan dengan BLOB_MAX_UPLOAD_MB
location / {
proxy_pass http://ocr_api;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# Timeout untuk dokumen besar
proxy_read_timeout 300s;
proxy_connect_timeout 75s;
}
location /metrics {
# Restrict metrics endpoint
allow 10.0.0.0/8; # Internal network only
deny all;
proxy_pass http://ocr_api;
}
}
```
**Enable site dan setup SSL:**
```bash
# Enable site
sudo ln -s /etc/nginx/sites-available/ocr-sprint /etc/nginx/sites-enabled/
sudo nginx -t
sudo systemctl reload nginx
# Setup SSL dengan Let's Encrypt
sudo certbot --nginx -d ocr.yourdomain.com
```
## Deployment Manual (Tanpa Docker)
### 1. Install System Dependencies
```bash
# Ubuntu/Debian
sudo apt update
sudo apt install -y \
python3.11 python3.11-venv python3-pip \
libgl1 libglib2.0-0 libsm6 libxext6 libxrender1 \
libgomp1 libmagic1 \
redis-server postgresql-14
# Start services
sudo systemctl enable --now redis-server postgresql
```
### 2. Setup Database
```bash
# Create database dan user
sudo -u postgres psql << EOF
CREATE USER ocr WITH PASSWORD 'your-secure-password';
CREATE DATABASE ocr_sprint OWNER ocr;
GRANT ALL PRIVILEGES ON DATABASE ocr_sprint TO ocr;
EOF
```
### 3. Install Application
```bash
# Clone repository
git clone https://github.com/Adriankf59/ocr-sprint-service.git
cd ocr-sprint-service
# Create virtual environment
python3.11 -m venv .venv
source .venv/bin/activate
# Install dependencies
pip install --upgrade pip
pip install -e ".[ocr]"
# Copy dan edit .env
cp .env.example .env
nano .env
```
**Update DATABASE_URL di .env:**
```bash
DATABASE_URL=postgresql+psycopg://ocr:your-secure-password@localhost:5432/ocr_sprint
REDIS_URL=redis://localhost:6379/0
QUEUE_ENABLED=true
```
### 4. Run Database Migrations
```bash
alembic upgrade head
```
### 5. Setup Systemd Services
**API Service (`/etc/systemd/system/ocr-sprint-api.service`):**
```ini
[Unit]
Description=OCR Sprint API
After=network.target postgresql.service redis.service
[Service]
Type=simple
User=ocr
WorkingDirectory=/opt/ocr-sprint-service
Environment="PATH=/opt/ocr-sprint-service/.venv/bin"
ExecStart=/opt/ocr-sprint-service/.venv/bin/uvicorn ocr_sprint.main:app --host 0.0.0.0 --port 8000 --workers 4
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
```
**Worker Service (`/etc/systemd/system/ocr-sprint-worker.service`):**
```ini
[Unit]
Description=OCR Sprint Celery Worker
After=network.target postgresql.service redis.service
[Service]
Type=simple
User=ocr
WorkingDirectory=/opt/ocr-sprint-service
Environment="PATH=/opt/ocr-sprint-service/.venv/bin"
ExecStart=/opt/ocr-sprint-service/.venv/bin/celery -A ocr_sprint.worker.celery_app worker -l info --concurrency=2
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
```
**Enable dan start services:**
```bash
sudo systemctl daemon-reload
sudo systemctl enable --now ocr-sprint-api ocr-sprint-worker
sudo systemctl status ocr-sprint-api ocr-sprint-worker
```
## Monitoring dan Maintenance
### Monitoring Logs
```bash
# Docker deployment
docker compose logs -f api worker
# Manual deployment
sudo journalctl -u ocr-sprint-api -f
sudo journalctl -u ocr-sprint-worker -f
```
### Prometheus Metrics
Metrics tersedia di endpoint `/metrics`:
```bash
curl http://localhost:8000/metrics
```
**Key metrics:**
- `ocr_documents_total`: Total dokumen diproses
- `ocr_processing_duration_seconds`: Durasi processing
- `ocr_confidence_score`: Distribusi confidence score
- `celery_task_*`: Celery worker metrics
### Backup Database
```bash
# Docker deployment
docker compose exec postgres pg_dump -U ocr ocr_sprint > backup_$(date +%Y%m%d).sql
# Manual deployment
pg_dump -U ocr ocr_sprint > backup_$(date +%Y%m%d).sql
```
### Update Service
```bash
# Docker deployment
cd ocr-sprint-service
git pull
docker compose build
docker compose up -d
# Manual deployment
cd ocr-sprint-service
git pull
source .venv/bin/activate
pip install -e ".[ocr]"
alembic upgrade head
sudo systemctl restart ocr-sprint-api ocr-sprint-worker
```
## Troubleshooting
### Service tidak start
```bash
# Cek logs
docker compose logs api worker
# Cek health check
curl http://localhost:8000/api/v1/health
```
### PaddleOCR model download gagal
```bash
# Download manual ke volume
docker compose exec api python -c "from paddleocr import PaddleOCR; PaddleOCR(use_angle_cls=True, lang='latin')"
```
### Worker tidak memproses jobs
```bash
# Cek Redis connection
docker compose exec worker redis-cli -h redis ping
# Cek Celery worker status
docker compose exec worker celery -A ocr_sprint.worker.celery_app inspect active
```
### Database migration error
```bash
# Cek current revision
docker compose exec api alembic current
# Force upgrade
docker compose exec api alembic upgrade head
```
### Out of memory
```bash
# Kurangi worker concurrency di docker-compose.yml
# Ubah: --concurrency=1 (default) atau tambahkan memory limit
```
## Security Checklist
- [ ] API_KEYS diset dengan nilai random yang kuat
- [ ] Firewall configured (hanya port 80/443 terbuka)
- [ ] SSL/TLS enabled via Nginx + Let's Encrypt
- [ ] Database password diganti dari default
- [ ] `/metrics` endpoint restricted ke internal network
- [ ] Regular backup database dan blob storage
- [ ] Log rotation configured
- [ ] OS security updates enabled
## Performance Tuning
### Untuk throughput tinggi:
1. **Increase worker concurrency:**
```yaml
# docker-compose.yml
command: ["celery", "-A", "ocr_sprint.worker.celery_app", "worker", "-l", "info", "--concurrency=4"]
```
2. **Scale workers horizontally:**
```bash
docker compose up -d --scale worker=3
```
3. **Enable GPU (jika tersedia):**
```bash
# .env
OCR_USE_GPU=true
```
4. **Tune Postgres:**
```sql
-- Increase connection pool
ALTER SYSTEM SET max_connections = 200;
ALTER SYSTEM SET shared_buffers = '2GB';
```
## Support
Untuk pertanyaan atau issues, hubungi tim development atau buat issue di repository.