# OCR Sprint Service OCR + structured extraction service for Indonesian police "surat sprint" (surat perintah) documents. Built around **FastAPI + PaddleOCR + hybrid extraction (regex → LLM lokal → validation)** with **on-premise** deployment as a hard requirement. > **Status:** Phase 1 MVP — synchronous PDF/image OCR with regex header extraction, validation, and confidence scoring. Phase 2–6 (document detection, table extraction, async pipeline, LLM extraction, HITL) are tracked in [`docs/architecture.md`](docs/architecture.md). ## Why this stack - **PaddleOCR** is the strongest open-source OCR for mixed-language documents and runs fully on-prem (essential for police data). - **PP-Structure** (Phase 3) handles personnel tables natively. - **Regex-first, LLM-fallback extraction** keeps deterministic fields fast and predictable while letting an LLM handle format drift across Polri units. - **CPU-friendly defaults**: a small (1.5B–4B) local LLM via Ollama is the recommended default; the architecture is also GPU-ready. See [`docs/architecture.md`](docs/architecture.md) for the full architecture, accuracy expectations, and roadmap. ## Quickstart ### Prerequisites - Python **3.10–3.12** - ~3 GB free disk for PaddleOCR model downloads on first run - Linux/macOS recommended (Windows works but PaddleOCR install can be finicky) ### Install (local dev) ```bash git clone https://github.com/Adriankf59/ocr-sprint-service.git cd ocr-sprint-service python -m venv .venv && source .venv/bin/activate make install # installs runtime + dev deps + pre-commit cp .env.example .env # edit if you need GPU / different storage path ``` ### Run the API ```bash make dev # → http://localhost:8000/docs ``` ### Try it out ```bash curl -F "file=@samples/pdf/example.pdf" http://localhost:8000/api/v1/documents | jq ``` Expected response (truncated): ```json { "job_id": "8f2a...", "status": "completed", "confidence": 0.93, "data": { "header": { "nomor_sprint": "Sprin/123/IV/2025/Reskrim", "tanggal": "2025-04-21", "satuan_penerbit": "KEPOLISIAN RESOR BANDUNG", "perihal": "Pelaksanaan penyelidikan kasus pencurian", "dasar": ["Undang-Undang Nomor 2 Tahun 2002 ...", "..."] }, "personel": [], "ttd": { "nrp": "12345678" } }, "review_flags": [] } ``` > **Note:** Phase 1 does not yet populate the `personel[]` table — that requires PP-Structure (Phase 3). Header fields, signatory NRP, confidence, and HITL routing are fully wired. ### Docker ```bash docker compose build docker compose up -d docker compose logs -f api ``` The first request will trigger PaddleOCR to download its detection/recognition/cls models (~200 MB) into the `paddle-models` volume. ## Development ```bash make fmt # format with ruff make lint # lint make typecheck # mypy strict mode make test # pytest make test-cov # pytest + coverage ``` Pre-commit hooks run ruff on every commit. Install once with `pre-commit install` (already done by `make install`). ## Project layout ``` src/ocr_sprint/ api/ # FastAPI routes + error handlers schemas/ # Pydantic v2 models (request/response, extraction, personnel) pipeline/ # ingest → preprocess → ocr → extract → validate → score extract/ # regex_rules.py (Phase 1) → llm.py (Phase 5) data/ # master data (Polri ranks, etc.) utils/ # logging, helpers config.py # pydantic-settings main.py # app factory tests/unit/ # ~60 unit tests, no PaddleOCR dependency docs/ # architecture & decision records ``` ## Roadmap | Phase | Scope | Status | |---|---|---| | 1 | Sync API, PDF/image ingest, basic preprocessing, PaddleOCR, regex header extraction, validation, confidence scoring | **In progress** | | 2 | DocTR document detection + dewarping for phone photos | Planned | | 3 | PP-Structure table extraction for personnel rows | Planned | | 4 | Async pipeline (Celery + Redis), Postgres + MinIO, auth, observability | Planned | | 5 | LLM hybrid extraction (Ollama + structured output) | Planned | | 6 | HITL review endpoints + audit trail | Planned | ## License Proprietary — internal use only.