Phase 2: document detection + perspective correction + shadow removal
Adds OpenCV-based phone-photo handling that runs before the standard preprocessing pipeline for IMAGE source kinds (PDF renders are flat by construction and skip this stage). Pipeline additions in src/ocr_sprint/pipeline/document_detect.py: - _find_document_quad: Canny + dilate + contour search, picks the largest convex 4-point polygon above a configurable area threshold; fails gracefully and returns None when no usable quad is found. - _four_point_warp: orders corners (TL/TR/BR/BL via sum/diff trick) and runs cv2.getPerspectiveTransform + warpPerspective. - _remove_shadow: per-channel background-division (dilate + median blur + 255 - absdiff + normalize) for uneven phone-shot lighting. - detect_and_correct: top-level entrypoint with graceful fallback to the original image when detection fails. Wired into the synchronous orchestrator: only enabled for IMAGE sources, skipped for PDF. New settings: - preprocess_detect_document (default: true) - preprocess_remove_shadow (default: true) - preprocess_min_quad_area_fraction (default: 0.20) Tests: 9 new unit tests covering corner ordering, quad detection on synthetic skewed documents, perspective warp output sanity, shadow removal shape preservation, full-pipeline behavior, and graceful fallback when detection fails. 70 tests total, all green. ML-based dewarping (DewarpNet) and DocTR detector are deferred to a future phase per the roadmap; the existing API is structured so they can be added as alternative backends behind DocumentDetectConfig. Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com>
This commit is contained in:
@@ -2,7 +2,7 @@
|
||||
|
||||
OCR + structured extraction service for Indonesian police "surat sprint" (surat perintah) documents. Built around **FastAPI + PaddleOCR + hybrid extraction (regex → LLM lokal → validation)** with **on-premise** deployment as a hard requirement.
|
||||
|
||||
> **Status:** Phase 1 MVP — synchronous PDF/image OCR with regex header extraction, validation, and confidence scoring. Phase 2–6 (document detection, table extraction, async pipeline, LLM extraction, HITL) are tracked in [`docs/architecture.md`](docs/architecture.md).
|
||||
> **Status:** Phase 1+2 — synchronous PDF/image OCR with regex header extraction, validation, confidence scoring, and **document detection + perspective correction + shadow removal** for phone photos. Phase 3–6 (table extraction, async pipeline, LLM extraction, HITL) are tracked in [`docs/architecture.md`](docs/architecture.md).
|
||||
|
||||
## Why this stack
|
||||
|
||||
@@ -97,7 +97,7 @@ Pre-commit hooks run ruff on every commit. Install once with `pre-commit install
|
||||
src/ocr_sprint/
|
||||
api/ # FastAPI routes + error handlers
|
||||
schemas/ # Pydantic v2 models (request/response, extraction, personnel)
|
||||
pipeline/ # ingest → preprocess → ocr → extract → validate → score
|
||||
pipeline/ # ingest → document_detect → preprocess → ocr → extract → validate → score
|
||||
extract/ # regex_rules.py (Phase 1) → llm.py (Phase 5)
|
||||
data/ # master data (Polri ranks, etc.)
|
||||
utils/ # logging, helpers
|
||||
@@ -111,8 +111,8 @@ docs/ # architecture & decision records
|
||||
|
||||
| Phase | Scope | Status |
|
||||
|---|---|---|
|
||||
| 1 | Sync API, PDF/image ingest, basic preprocessing, PaddleOCR, regex header extraction, validation, confidence scoring | **In progress** |
|
||||
| 2 | DocTR document detection + dewarping for phone photos | Planned |
|
||||
| 1 | Sync API, PDF/image ingest, basic preprocessing, PaddleOCR, regex header extraction, validation, confidence scoring | **Done** |
|
||||
| 2 | OpenCV-based document detection, perspective transform, shadow removal for phone photos | **Done** |
|
||||
| 3 | PP-Structure table extraction for personnel rows | Planned |
|
||||
| 4 | Async pipeline (Celery + Redis), Postgres + MinIO, auth, observability | Planned |
|
||||
| 5 | LLM hybrid extraction (Ollama + structured output) | Planned |
|
||||
|
||||
Reference in New Issue
Block a user