# OCR Sprint Service OCR + structured extraction service for Indonesian police "surat sprint" (surat perintah) documents. Built around **FastAPI + PaddleOCR + hybrid extraction (regex → LLM lokal → validation)** with **on-premise** deployment as a hard requirement. > **Status:** Phase 1–4 — synchronous + async PDF/image OCR with regex header extraction, PP-Structure personnel-table extraction, validation, confidence scoring, document detection / perspective correction / shadow removal, **Celery + Redis job queue, Postgres job state, local-filesystem blob storage, API-key auth, and Prometheus metrics**. Phase 5–6 (LLM extraction, HITL) are tracked in [`docs/architecture.md`](docs/architecture.md). ## Why this stack - **PaddleOCR** is the strongest open-source OCR for mixed-language documents and runs fully on-prem (essential for police data). - **PP-Structure** (Phase 3) handles personnel tables natively. - **Regex-first, LLM-fallback extraction** keeps deterministic fields fast and predictable while letting an LLM handle format drift across Polri units. - **CPU-friendly defaults**: a small (1.5B–4B) local LLM via Ollama is the recommended default; the architecture is also GPU-ready. See [`docs/architecture.md`](docs/architecture.md) for the full architecture, accuracy expectations, and roadmap. ## Quickstart ### Prerequisites - Python **3.10–3.12** - ~3 GB free disk for PaddleOCR model downloads on first run - Linux/macOS recommended (Windows works but PaddleOCR install can be finicky) ### Install (local dev) ```bash git clone https://github.com/Adriankf59/ocr-sprint-service.git cd ocr-sprint-service python -m venv .venv && source .venv/bin/activate make install # installs runtime + dev deps + pre-commit pip install -e ".[ocr]" # only on the worker host — pulls Paddle wheels (~1.5 GB) cp .env.example .env # edit if you need GPU / different storage path ``` ### Run the API ```bash make dev # → http://localhost:8000/docs ``` ### Try it out The default `POST /documents` is async — it returns `202 Accepted` with a `job_id` and the worker fills in the result. For tests / local one-shot usage you can append `?sync=true` to run inline. ```bash # Async (production flow) curl -F "file=@samples/pdf/example.pdf" \ -H "X-API-Key: $API_KEY" \ http://localhost:8000/api/v1/documents | jq # → {"job_id":"8f2a...","status":"pending",...} curl -H "X-API-Key: $API_KEY" \ http://localhost:8000/api/v1/documents/8f2a... | jq # Sync (single small doc, no worker required) curl -F "file=@samples/pdf/example.pdf" \ "http://localhost:8000/api/v1/documents?sync=true" | jq ``` Expected response (truncated): ```json { "job_id": "8f2a...", "status": "completed", "confidence": 0.93, "data": { "header": { "nomor_sprint": "Sprin/123/IV/2025/Reskrim", "tanggal": "2025-04-21", "satuan_penerbit": "KEPOLISIAN RESOR BANDUNG", "perihal": "Pelaksanaan penyelidikan kasus pencurian", "dasar": ["Undang-Undang Nomor 2 Tahun 2002 ...", "..."] }, "personel": [], "ttd": { "nrp": "12345678" } }, "review_flags": [] } ``` > **Note:** As of Phase 3 the `personel[]` array is populated from PP-Structure table recognition. Set `TABLES_ENABLED=false` in `.env` to skip the table stage (faster on documents that you know contain no personnel table). ### Docker The Phase 4 stack runs four services: `api`, `worker` (Celery), `redis`, and `postgres`. Blob uploads are persisted to a Docker volume — there is **no MinIO/S3** dependency. ```bash docker compose build docker compose up -d docker compose logs -f api worker ``` The API container runs `alembic upgrade head` on start, so the `jobs` table is created on first boot. The first request will trigger PaddleOCR to download its detection/recognition/cls models (~200 MB) into the `paddle-models` volume. Metrics are exposed at in Prometheus text format. ## Development ```bash make fmt # format with ruff make lint # lint make typecheck # mypy strict mode make test # pytest make test-cov # pytest + coverage ``` Pre-commit hooks run ruff on every commit. Install once with `pre-commit install` (already done by `make install`). ## Project layout ``` src/ocr_sprint/ api/ # FastAPI routes + error handlers schemas/ # Pydantic v2 models (request/response, extraction, personnel) pipeline/ # ingest → document_detect → preprocess → ocr + table → extract → validate → score extract/ # regex_rules.py (Phase 1) + personnel.py (Phase 3) → llm.py (Phase 5) data/ # master data (Polri ranks, etc.) utils/ # logging, helpers config.py # pydantic-settings main.py # app factory tests/unit/ # 100+ unit tests, PaddleOCR / PP-Structure mocked docs/ # architecture & decision records ``` ## Roadmap | Phase | Scope | Status | |---|---|---| | 1 | Sync API, PDF/image ingest, basic preprocessing, PaddleOCR, regex header extraction, validation, confidence scoring | **Done** | | 2 | OpenCV-based document detection, perspective transform, shadow removal for phone photos | **Done** | | 3 | PP-Structure table extraction for personnel rows + column mapper | **Done** | | 4 | Async pipeline (Celery + Redis), Postgres job state, local-filesystem blob storage, API-key auth, Prometheus metrics | **Done** | | 5 | LLM hybrid extraction (Ollama + structured output) | Planned | | 6 | HITL review endpoints + audit trail | Planned | ## License Proprietary — internal use only.