Phase 2: document detection + perspective correction + shadow removal

Adds OpenCV-based phone-photo handling that runs before the standard
preprocessing pipeline for IMAGE source kinds (PDF renders are flat by
construction and skip this stage).

Pipeline additions in src/ocr_sprint/pipeline/document_detect.py:
- _find_document_quad: Canny + dilate + contour search, picks the
  largest convex 4-point polygon above a configurable area threshold;
  fails gracefully and returns None when no usable quad is found.
- _four_point_warp: orders corners (TL/TR/BR/BL via sum/diff trick)
  and runs cv2.getPerspectiveTransform + warpPerspective.
- _remove_shadow: per-channel background-division (dilate + median
  blur + 255 - absdiff + normalize) for uneven phone-shot lighting.
- detect_and_correct: top-level entrypoint with graceful fallback
  to the original image when detection fails.

Wired into the synchronous orchestrator: only enabled for IMAGE
sources, skipped for PDF. New settings:
- preprocess_detect_document (default: true)
- preprocess_remove_shadow (default: true)
- preprocess_min_quad_area_fraction (default: 0.20)

Tests: 9 new unit tests covering corner ordering, quad detection on
synthetic skewed documents, perspective warp output sanity, shadow
removal shape preservation, full-pipeline behavior, and graceful
fallback when detection fails. 70 tests total, all green.

ML-based dewarping (DewarpNet) and DocTR detector are deferred to a
future phase per the roadmap; the existing API is structured so they
can be added as alternative backends behind DocumentDetectConfig.

Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com>
This commit is contained in:
Devin AI
2026-04-25 15:06:58 +00:00
parent ca0c0a0428
commit d0e1835cc1
6 changed files with 357 additions and 5 deletions

View File

@@ -21,6 +21,11 @@ PREPROCESS_DENOISE=true
PREPROCESS_DESKEW=true
PREPROCESS_ADAPTIVE_THRESHOLD=false # turn on for low-quality phone photos
# ==== Document detection (Phase 2, IMAGE sources only) ====
PREPROCESS_DETECT_DOCUMENT=true
PREPROCESS_REMOVE_SHADOW=true
PREPROCESS_MIN_QUAD_AREA_FRACTION=0.20
# ==== Confidence / routing (Phase 5) ====
CONFIDENCE_AUTO_APPROVE=0.95
CONFIDENCE_NEEDS_REVIEW=0.85

View File

@@ -2,7 +2,7 @@
OCR + structured extraction service for Indonesian police "surat sprint" (surat perintah) documents. Built around **FastAPI + PaddleOCR + hybrid extraction (regex → LLM lokal → validation)** with **on-premise** deployment as a hard requirement.
> **Status:** Phase 1 MVP — synchronous PDF/image OCR with regex header extraction, validation, and confidence scoring. Phase 26 (document detection, table extraction, async pipeline, LLM extraction, HITL) are tracked in [`docs/architecture.md`](docs/architecture.md).
> **Status:** Phase 1+2 — synchronous PDF/image OCR with regex header extraction, validation, confidence scoring, and **document detection + perspective correction + shadow removal** for phone photos. Phase 36 (table extraction, async pipeline, LLM extraction, HITL) are tracked in [`docs/architecture.md`](docs/architecture.md).
## Why this stack
@@ -97,7 +97,7 @@ Pre-commit hooks run ruff on every commit. Install once with `pre-commit install
src/ocr_sprint/
api/ # FastAPI routes + error handlers
schemas/ # Pydantic v2 models (request/response, extraction, personnel)
pipeline/ # ingest → preprocess → ocr → extract → validate → score
pipeline/ # ingest → document_detect → preprocess → ocr → extract → validate → score
extract/ # regex_rules.py (Phase 1) → llm.py (Phase 5)
data/ # master data (Polri ranks, etc.)
utils/ # logging, helpers
@@ -111,8 +111,8 @@ docs/ # architecture & decision records
| Phase | Scope | Status |
|---|---|---|
| 1 | Sync API, PDF/image ingest, basic preprocessing, PaddleOCR, regex header extraction, validation, confidence scoring | **In progress** |
| 2 | DocTR document detection + dewarping for phone photos | Planned |
| 1 | Sync API, PDF/image ingest, basic preprocessing, PaddleOCR, regex header extraction, validation, confidence scoring | **Done** |
| 2 | OpenCV-based document detection, perspective transform, shadow removal for phone photos | **Done** |
| 3 | PP-Structure table extraction for personnel rows | Planned |
| 4 | Async pipeline (Celery + Redis), Postgres + MinIO, auth, observability | Planned |
| 5 | LLM hybrid extraction (Ollama + structured output) | Planned |

View File

@@ -42,6 +42,11 @@ class Settings(BaseSettings):
preprocess_deskew: bool = True
preprocess_adaptive_threshold: bool = False
# Document detection (Phase 2) — applied to IMAGE sources only
preprocess_detect_document: bool = True
preprocess_remove_shadow: bool = True
preprocess_min_quad_area_fraction: float = Field(0.20, ge=0.0, le=1.0)
# Confidence thresholds (Phase 5 routing)
confidence_auto_approve: float = Field(0.95, ge=0.0, le=1.0)
confidence_needs_review: float = Field(0.85, ge=0.0, le=1.0)

View File

@@ -0,0 +1,205 @@
"""Phase 2 — document detection + perspective correction + shadow removal.
Targets phone photos of surat sprint where the page is shot at an angle, with
uneven lighting, and not perfectly centered. The pipeline below uses pure
OpenCV (no ML model dependency) to:
1. detect the four corners of the document inside the image,
2. apply a perspective transform to obtain a flat top-down rectangle,
3. remove shadows via morphological background division.
Failure mode is **graceful**: if no usable document quadrilateral is found,
we return the original image untouched and log a warning. The downstream
`preprocess` stage will still run.
Future work (tracked in docs/architecture.md):
- swap the contour heuristic for a DocTR / MobileSAM model when accuracy
on real Polri photos isn't enough,
- add full ML-based dewarping (DewarpNet) for curved/folded pages.
"""
from __future__ import annotations
from dataclasses import dataclass
import cv2
import numpy as np
from ocr_sprint.pipeline.ingest import NDArrayU8
from ocr_sprint.utils.logging import get_logger
_logger = get_logger(__name__)
# Internal working size for contour detection. Smaller = faster + less noise.
_DETECT_HEIGHT = 500
# Reject any candidate quad whose area is below this fraction of the image —
# they're almost certainly not the document but logos, stamps, or text blocks.
_MIN_AREA_FRACTION = 0.20
# Polygon-approximation epsilon (% of perimeter). 2% works well for paper.
_POLY_EPSILON = 0.02
@dataclass(frozen=True)
class DocumentDetectConfig:
"""Tunable knobs for document detection."""
detect_document: bool = True
remove_shadow: bool = True
min_area_fraction: float = _MIN_AREA_FRACTION
# ---------- public surface ----------
def detect_and_correct(
img: NDArrayU8,
cfg: DocumentDetectConfig | None = None,
) -> NDArrayU8:
"""Run detection + perspective correction + shadow removal in one shot.
Returns either the warped + cleaned image, or the original if detection
failed. Always returns a BGR uint8 ndarray.
"""
if cfg is None:
cfg = DocumentDetectConfig()
out = img
if cfg.detect_document:
quad = _find_document_quad(img, min_area_fraction=cfg.min_area_fraction)
if quad is not None:
out = _four_point_warp(img, quad)
_logger.info("document_detect.warped", shape=out.shape[:2])
else:
_logger.info("document_detect.no_quad_found", shape=img.shape[:2])
if cfg.remove_shadow:
out = _remove_shadow(out)
return out
# ---------- corner detection ----------
def _find_document_quad(
img: NDArrayU8,
min_area_fraction: float = _MIN_AREA_FRACTION,
) -> NDArrayU8 | None:
"""Locate the document quadrilateral; return 4x2 corners or None."""
h_orig, w_orig = img.shape[:2]
if h_orig < 50 or w_orig < 50:
return None
scale = _DETECT_HEIGHT / float(h_orig)
if scale >= 1.0:
small = img
scale = 1.0
else:
small = cv2.resize(
img,
(round(w_orig * scale), _DETECT_HEIGHT),
interpolation=cv2.INTER_AREA,
)
gray = cv2.cvtColor(small, cv2.COLOR_BGR2GRAY)
gray = cv2.bilateralFilter(gray, 9, 75, 75)
edges = cv2.Canny(gray, 60, 180)
# close small gaps so contours are continuous
edges = cv2.dilate(edges, np.ones((3, 3), np.uint8), iterations=2)
contours, _ = cv2.findContours(edges, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
if not contours:
return None
image_area = float(small.shape[0] * small.shape[1])
candidates = sorted(contours, key=cv2.contourArea, reverse=True)[:5]
for contour in candidates:
perimeter = cv2.arcLength(contour, True)
approx = cv2.approxPolyDP(contour, _POLY_EPSILON * perimeter, True)
if len(approx) != 4:
continue
if not cv2.isContourConvex(approx):
continue
area = cv2.contourArea(approx)
if area < min_area_fraction * image_area:
return None # remaining contours are smaller — give up early
quad = approx.reshape(4, 2).astype(np.float32) / scale
return quad
return None
# ---------- perspective transform ----------
def _order_corners(pts: NDArrayU8) -> NDArrayU8:
"""Return points sorted as (top-left, top-right, bottom-right, bottom-left)."""
rect = np.zeros((4, 2), dtype=np.float32)
s = pts.sum(axis=1)
diff = np.diff(pts, axis=1)
rect[0] = pts[np.argmin(s)] # smallest sum → top-left
rect[2] = pts[np.argmax(s)] # largest sum → bottom-right
rect[1] = pts[np.argmin(diff)] # smallest diff → top-right
rect[3] = pts[np.argmax(diff)] # largest diff → bottom-left
return rect
def _four_point_warp(img: NDArrayU8, quad: NDArrayU8) -> NDArrayU8:
"""Warp the image to a top-down view of the detected quadrilateral."""
rect = _order_corners(quad)
tl, tr, br, bl = rect
width_top = float(np.linalg.norm(tr - tl))
width_bot = float(np.linalg.norm(br - bl))
height_left = float(np.linalg.norm(bl - tl))
height_right = float(np.linalg.norm(br - tr))
max_width = round(max(width_top, width_bot))
max_height = round(max(height_left, height_right))
if max_width <= 1 or max_height <= 1:
return img
dst = np.array(
[
[0, 0],
[max_width - 1, 0],
[max_width - 1, max_height - 1],
[0, max_height - 1],
],
dtype=np.float32,
)
matrix = cv2.getPerspectiveTransform(rect, dst)
return cv2.warpPerspective(img, matrix, (max_width, max_height))
# ---------- shadow removal ----------
def _remove_shadow(img: NDArrayU8) -> NDArrayU8:
"""Background-division shadow removal applied per channel.
Idea: dilate + median-blur the channel to estimate the local background;
subtract from 255 minus the absolute diff to flatten lighting; normalize
back to 0-255. Cheap and surprisingly effective on phone shots.
"""
planes = cv2.split(img)
cleaned: list[NDArrayU8] = []
kernel = np.ones((7, 7), np.uint8)
for plane in planes:
dilated = cv2.dilate(plane, kernel)
bg = cv2.medianBlur(dilated, 21)
diff: NDArrayU8 = (255 - cv2.absdiff(plane, bg)).astype(np.uint8)
# cv2 stubs reject None dst even though the runtime accepts it.
norm = cv2.normalize( # type: ignore[call-overload]
diff,
None,
alpha=0,
beta=255,
norm_type=cv2.NORM_MINMAX,
dtype=cv2.CV_8U,
)
cleaned.append(norm)
return cv2.merge(cleaned)

View File

@@ -14,6 +14,7 @@ from dataclasses import dataclass
from ocr_sprint.config import get_settings
from ocr_sprint.pipeline.confidence import compute_confidence, route
from ocr_sprint.pipeline.document_detect import DocumentDetectConfig, detect_and_correct
from ocr_sprint.pipeline.extract.regex_rules import extract_header, find_signatory
from ocr_sprint.pipeline.extract.validators import validate_extraction
from ocr_sprint.pipeline.ingest import detect_source_kind, ingest
@@ -56,10 +57,18 @@ def run_pipeline(content: bytes) -> PipelineOutput:
deskew=s.preprocess_deskew,
adaptive_threshold=s.preprocess_adaptive_threshold,
)
# Document detection only makes sense on photographed images. PDF renders
# are already flat by construction, so we skip the heavy quad search there.
detect_cfg = DocumentDetectConfig(
detect_document=s.preprocess_detect_document and kind == SourceKind.IMAGE,
remove_shadow=s.preprocess_remove_shadow and kind == SourceKind.IMAGE,
min_area_fraction=s.preprocess_min_quad_area_fraction,
)
ocr_pages: list[OCRPage] = []
for page in pages:
cleaned = preprocess(page.image, pre_cfg)
corrected = detect_and_correct(page.image, detect_cfg)
cleaned = preprocess(corrected, pre_cfg)
ocr_pages.append(run_ocr(cleaned))
full_text = "\n".join(p.text for p in ocr_pages)

View File

@@ -0,0 +1,128 @@
"""Tests for Phase 2 document detection + perspective correction."""
from __future__ import annotations
import cv2
import numpy as np
import pytest
from ocr_sprint.pipeline.document_detect import (
DocumentDetectConfig,
_find_document_quad,
_four_point_warp,
_order_corners,
_remove_shadow,
detect_and_correct,
)
def _synthetic_skewed_doc(
canvas_size: tuple[int, int] = (900, 700),
page_size: tuple[int, int] = (500, 380),
skew: tuple[int, int, int, int] = (40, -20, -30, 25),
) -> np.ndarray:
"""Create a dark canvas containing a white quadrilateral 'document'.
Returns a BGR uint8 image. Useful for testing detection without bringing
in real photos.
"""
h, w = canvas_size
img = np.full((h, w, 3), 30, dtype=np.uint8) # dark grey background
page_w, page_h = page_size
cx, cy = w // 2, h // 2
# corners offset by `skew` to simulate perspective distortion
sx_tl, sx_tr, sx_br, sx_bl = skew
pts = np.array(
[
[cx - page_w // 2 + sx_tl, cy - page_h // 2 + sx_tl], # TL
[cx + page_w // 2 + sx_tr, cy - page_h // 2 - sx_tr], # TR
[cx + page_w // 2 + sx_br, cy + page_h // 2 + sx_br], # BR
[cx - page_w // 2 + sx_bl, cy + page_h // 2 - sx_bl], # BL
],
dtype=np.int32,
)
cv2.fillPoly(img, [pts], color=(245, 245, 245))
# add a fake text line so the page isn't a flat white blob
cv2.line(img, (cx - 100, cy), (cx + 100, cy), color=(20, 20, 20), thickness=4)
return img
class TestOrderCorners:
def test_canonical_order(self) -> None:
# input pts unordered: BR, TL, TR, BL
pts = np.array([[100, 100], [10, 10], [110, 8], [9, 95]], dtype=np.float32)
ordered = _order_corners(pts)
# TL has the smallest sum
assert tuple(ordered[0]) == pytest.approx((10.0, 10.0))
# BR has the largest sum
assert tuple(ordered[2]) == pytest.approx((100.0, 100.0))
class TestFindDocumentQuad:
def test_finds_quad_in_synthetic_image(self) -> None:
img = _synthetic_skewed_doc()
quad = _find_document_quad(img, min_area_fraction=0.10)
assert quad is not None
assert quad.shape == (4, 2)
def test_returns_none_for_flat_image(self) -> None:
flat = np.full((400, 400, 3), 200, dtype=np.uint8)
assert _find_document_quad(flat) is None
def test_returns_none_for_tiny_image(self) -> None:
tiny = np.full((20, 20, 3), 200, dtype=np.uint8)
assert _find_document_quad(tiny) is None
class TestFourPointWarp:
def test_warp_produces_axis_aligned_rectangle(self) -> None:
img = _synthetic_skewed_doc()
quad = _find_document_quad(img, min_area_fraction=0.10)
assert quad is not None
warped = _four_point_warp(img, quad)
# warped output should be smaller than the canvas (it's only the page)
assert warped.shape[0] < img.shape[0]
assert warped.shape[1] < img.shape[1]
# majority of the warped image should be near-white (the page)
gray = cv2.cvtColor(warped, cv2.COLOR_BGR2GRAY)
bright_fraction = float((gray > 200).mean())
assert bright_fraction > 0.7
class TestRemoveShadow:
def test_output_shape_matches_input(self) -> None:
img = np.random.randint(0, 256, (200, 300, 3), dtype=np.uint8)
out = _remove_shadow(img)
assert out.shape == img.shape
assert out.dtype == np.uint8
class TestDetectAndCorrect:
def test_full_pipeline_on_skewed_doc(self) -> None:
img = _synthetic_skewed_doc()
out = detect_and_correct(
img,
DocumentDetectConfig(detect_document=True, remove_shadow=True, min_area_fraction=0.1),
)
# the output is the cropped+flattened page, so it must be smaller
assert out.shape[0] < img.shape[0]
def test_disabled_returns_input_unchanged(self) -> None:
img = _synthetic_skewed_doc()
out = detect_and_correct(
img,
DocumentDetectConfig(detect_document=False, remove_shadow=False),
)
assert out.shape == img.shape
# disabled both flags → bytes-identical
assert np.array_equal(out, img)
def test_failed_detection_falls_back_to_input(self) -> None:
# flat grey image with no contour-detectable document
flat = np.full((300, 300, 3), 128, dtype=np.uint8)
out = detect_and_correct(
flat,
DocumentDetectConfig(detect_document=True, remove_shadow=False, min_area_fraction=0.5),
)
# detection failed, so the original shape is preserved
assert out.shape == flat.shape