Adds OpenCV-based phone-photo handling that runs before the standard preprocessing pipeline for IMAGE source kinds (PDF renders are flat by construction and skip this stage). Pipeline additions in src/ocr_sprint/pipeline/document_detect.py: - _find_document_quad: Canny + dilate + contour search, picks the largest convex 4-point polygon above a configurable area threshold; fails gracefully and returns None when no usable quad is found. - _four_point_warp: orders corners (TL/TR/BR/BL via sum/diff trick) and runs cv2.getPerspectiveTransform + warpPerspective. - _remove_shadow: per-channel background-division (dilate + median blur + 255 - absdiff + normalize) for uneven phone-shot lighting. - detect_and_correct: top-level entrypoint with graceful fallback to the original image when detection fails. Wired into the synchronous orchestrator: only enabled for IMAGE sources, skipped for PDF. New settings: - preprocess_detect_document (default: true) - preprocess_remove_shadow (default: true) - preprocess_min_quad_area_fraction (default: 0.20) Tests: 9 new unit tests covering corner ordering, quad detection on synthetic skewed documents, perspective warp output sanity, shadow removal shape preservation, full-pipeline behavior, and graceful fallback when detection fails. 70 tests total, all green. ML-based dewarping (DewarpNet) and DocTR detector are deferred to a future phase per the roadmap; the existing API is structured so they can be added as alternative backends behind DocumentDetectConfig. Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com>
OCR Sprint Service
OCR + structured extraction service for Indonesian police "surat sprint" (surat perintah) documents. Built around FastAPI + PaddleOCR + hybrid extraction (regex → LLM lokal → validation) with on-premise deployment as a hard requirement.
Status: Phase 1+2 — synchronous PDF/image OCR with regex header extraction, validation, confidence scoring, and document detection + perspective correction + shadow removal for phone photos. Phase 3–6 (table extraction, async pipeline, LLM extraction, HITL) are tracked in
docs/architecture.md.
Why this stack
- PaddleOCR is the strongest open-source OCR for mixed-language documents and runs fully on-prem (essential for police data).
- PP-Structure (Phase 3) handles personnel tables natively.
- Regex-first, LLM-fallback extraction keeps deterministic fields fast and predictable while letting an LLM handle format drift across Polri units.
- CPU-friendly defaults: a small (1.5B–4B) local LLM via Ollama is the recommended default; the architecture is also GPU-ready.
See docs/architecture.md for the full architecture, accuracy expectations, and roadmap.
Quickstart
Prerequisites
- Python 3.10–3.12
- ~3 GB free disk for PaddleOCR model downloads on first run
- Linux/macOS recommended (Windows works but PaddleOCR install can be finicky)
Install (local dev)
git clone https://github.com/Adriankf59/ocr-sprint-service.git
cd ocr-sprint-service
python -m venv .venv && source .venv/bin/activate
make install # installs runtime + dev deps + pre-commit
cp .env.example .env # edit if you need GPU / different storage path
Run the API
make dev
# → http://localhost:8000/docs
Try it out
curl -F "file=@samples/pdf/example.pdf" http://localhost:8000/api/v1/documents | jq
Expected response (truncated):
{
"job_id": "8f2a...",
"status": "completed",
"confidence": 0.93,
"data": {
"header": {
"nomor_sprint": "Sprin/123/IV/2025/Reskrim",
"tanggal": "2025-04-21",
"satuan_penerbit": "KEPOLISIAN RESOR BANDUNG",
"perihal": "Pelaksanaan penyelidikan kasus pencurian",
"dasar": ["Undang-Undang Nomor 2 Tahun 2002 ...", "..."]
},
"personel": [],
"ttd": { "nrp": "12345678" }
},
"review_flags": []
}
Note: Phase 1 does not yet populate the
personel[]table — that requires PP-Structure (Phase 3). Header fields, signatory NRP, confidence, and HITL routing are fully wired.
Docker
docker compose build
docker compose up -d
docker compose logs -f api
The first request will trigger PaddleOCR to download its detection/recognition/cls models (~200 MB) into the paddle-models volume.
Development
make fmt # format with ruff
make lint # lint
make typecheck # mypy strict mode
make test # pytest
make test-cov # pytest + coverage
Pre-commit hooks run ruff on every commit. Install once with pre-commit install (already done by make install).
Project layout
src/ocr_sprint/
api/ # FastAPI routes + error handlers
schemas/ # Pydantic v2 models (request/response, extraction, personnel)
pipeline/ # ingest → document_detect → preprocess → ocr → extract → validate → score
extract/ # regex_rules.py (Phase 1) → llm.py (Phase 5)
data/ # master data (Polri ranks, etc.)
utils/ # logging, helpers
config.py # pydantic-settings
main.py # app factory
tests/unit/ # ~60 unit tests, no PaddleOCR dependency
docs/ # architecture & decision records
Roadmap
| Phase | Scope | Status |
|---|---|---|
| 1 | Sync API, PDF/image ingest, basic preprocessing, PaddleOCR, regex header extraction, validation, confidence scoring | Done |
| 2 | OpenCV-based document detection, perspective transform, shadow removal for phone photos | Done |
| 3 | PP-Structure table extraction for personnel rows | Planned |
| 4 | Async pipeline (Celery + Redis), Postgres + MinIO, auth, observability | Planned |
| 5 | LLM hybrid extraction (Ollama + structured output) | Planned |
| 6 | HITL review endpoints + audit trail | Planned |
License
Proprietary — internal use only.