Adrian Kuman Firmansah 812ea7e030 Merge pull request #1 from Adriankf59/devin/1777129396-phase-2-document-detection
Adds OpenCV-based phone-photo handling that runs before the standard
preprocessing pipeline for IMAGE source kinds (PDF renders are flat by
construction and skip this stage).

Pipeline additions in src/ocr_sprint/pipeline/document_detect.py:
- _find_document_quad: Canny + dilate + contour search, picks the
  largest convex 4-point polygon above a configurable area threshold;
  fails gracefully and returns None when no usable quad is found.
- _four_point_warp: orders corners (TL/TR/BR/BL via sum/diff trick)
  and runs cv2.getPerspectiveTransform + warpPerspective.
- _remove_shadow: per-channel background-division (dilate + median
  blur + 255 - absdiff + normalize) for uneven phone-shot lighting.
- detect_and_correct: top-level entrypoint with graceful fallback
  to the original image when detection fails.

Wired into the synchronous orchestrator: only enabled for IMAGE
sources, skipped for PDF. New settings:
- preprocess_detect_document (default: true)
- preprocess_remove_shadow (default: true)
- preprocess_min_quad_area_fraction (default: 0.20)

Tests: 9 new unit tests covering corner ordering, quad detection on
synthetic skewed documents, perspective warp output sanity, shadow
removal shape preservation, full-pipeline behavior, and graceful
fallback when detection fails. 70 tests total, all green.

ML-based dewarping (DewarpNet) and DocTR detector are deferred to a
future phase per the roadmap; the existing API is structured so they
can be added as alternative backends behind DocumentDetectConfig.

Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com>
2026-04-25 22:08:36 +07:00

OCR Sprint Service

OCR + structured extraction service for Indonesian police "surat sprint" (surat perintah) documents. Built around FastAPI + PaddleOCR + hybrid extraction (regex → LLM lokal → validation) with on-premise deployment as a hard requirement.

Status: Phase 1+2 — synchronous PDF/image OCR with regex header extraction, validation, confidence scoring, and document detection + perspective correction + shadow removal for phone photos. Phase 36 (table extraction, async pipeline, LLM extraction, HITL) are tracked in docs/architecture.md.

Why this stack

  • PaddleOCR is the strongest open-source OCR for mixed-language documents and runs fully on-prem (essential for police data).
  • PP-Structure (Phase 3) handles personnel tables natively.
  • Regex-first, LLM-fallback extraction keeps deterministic fields fast and predictable while letting an LLM handle format drift across Polri units.
  • CPU-friendly defaults: a small (1.5B4B) local LLM via Ollama is the recommended default; the architecture is also GPU-ready.

See docs/architecture.md for the full architecture, accuracy expectations, and roadmap.

Quickstart

Prerequisites

  • Python 3.103.12
  • ~3 GB free disk for PaddleOCR model downloads on first run
  • Linux/macOS recommended (Windows works but PaddleOCR install can be finicky)

Install (local dev)

git clone https://github.com/Adriankf59/ocr-sprint-service.git
cd ocr-sprint-service

python -m venv .venv && source .venv/bin/activate
make install         # installs runtime + dev deps + pre-commit
cp .env.example .env # edit if you need GPU / different storage path

Run the API

make dev
# → http://localhost:8000/docs

Try it out

curl -F "file=@samples/pdf/example.pdf" http://localhost:8000/api/v1/documents | jq

Expected response (truncated):

{
  "job_id": "8f2a...",
  "status": "completed",
  "confidence": 0.93,
  "data": {
    "header": {
      "nomor_sprint": "Sprin/123/IV/2025/Reskrim",
      "tanggal": "2025-04-21",
      "satuan_penerbit": "KEPOLISIAN RESOR BANDUNG",
      "perihal": "Pelaksanaan penyelidikan kasus pencurian",
      "dasar": ["Undang-Undang Nomor 2 Tahun 2002 ...", "..."]
    },
    "personel": [],
    "ttd": { "nrp": "12345678" }
  },
  "review_flags": []
}

Note: Phase 1 does not yet populate the personel[] table — that requires PP-Structure (Phase 3). Header fields, signatory NRP, confidence, and HITL routing are fully wired.

Docker

docker compose build
docker compose up -d
docker compose logs -f api

The first request will trigger PaddleOCR to download its detection/recognition/cls models (~200 MB) into the paddle-models volume.

Development

make fmt        # format with ruff
make lint       # lint
make typecheck  # mypy strict mode
make test       # pytest
make test-cov   # pytest + coverage

Pre-commit hooks run ruff on every commit. Install once with pre-commit install (already done by make install).

Project layout

src/ocr_sprint/
  api/          # FastAPI routes + error handlers
  schemas/      # Pydantic v2 models (request/response, extraction, personnel)
  pipeline/     # ingest → document_detect → preprocess → ocr → extract → validate → score
    extract/    # regex_rules.py (Phase 1) → llm.py (Phase 5)
  data/         # master data (Polri ranks, etc.)
  utils/        # logging, helpers
  config.py     # pydantic-settings
  main.py       # app factory
tests/unit/     # ~60 unit tests, no PaddleOCR dependency
docs/           # architecture & decision records

Roadmap

Phase Scope Status
1 Sync API, PDF/image ingest, basic preprocessing, PaddleOCR, regex header extraction, validation, confidence scoring Done
2 OpenCV-based document detection, perspective transform, shadow removal for phone photos Done
3 PP-Structure table extraction for personnel rows Planned
4 Async pipeline (Celery + Redis), Postgres + MinIO, auth, observability Planned
5 LLM hybrid extraction (Ollama + structured output) Planned
6 HITL review endpoints + audit trail Planned

License

Proprietary — internal use only.

Description
No description provided
Readme 2.4 MiB
Languages
Python 96.3%
PowerShell 2.4%
Dockerfile 0.6%
Makefile 0.5%
Mako 0.2%