Phase 2: document detection + perspective correction + shadow removal
Adds OpenCV-based phone-photo handling that runs before the standard preprocessing pipeline for IMAGE source kinds (PDF renders are flat by construction and skip this stage). Pipeline additions in src/ocr_sprint/pipeline/document_detect.py: - _find_document_quad: Canny + dilate + contour search, picks the largest convex 4-point polygon above a configurable area threshold; fails gracefully and returns None when no usable quad is found. - _four_point_warp: orders corners (TL/TR/BR/BL via sum/diff trick) and runs cv2.getPerspectiveTransform + warpPerspective. - _remove_shadow: per-channel background-division (dilate + median blur + 255 - absdiff + normalize) for uneven phone-shot lighting. - detect_and_correct: top-level entrypoint with graceful fallback to the original image when detection fails. Wired into the synchronous orchestrator: only enabled for IMAGE sources, skipped for PDF. New settings: - preprocess_detect_document (default: true) - preprocess_remove_shadow (default: true) - preprocess_min_quad_area_fraction (default: 0.20) Tests: 9 new unit tests covering corner ordering, quad detection on synthetic skewed documents, perspective warp output sanity, shadow removal shape preservation, full-pipeline behavior, and graceful fallback when detection fails. 70 tests total, all green. ML-based dewarping (DewarpNet) and DocTR detector are deferred to a future phase per the roadmap; the existing API is structured so they can be added as alternative backends behind DocumentDetectConfig. Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com>
This commit is contained in:
@@ -14,6 +14,7 @@ from dataclasses import dataclass
|
||||
|
||||
from ocr_sprint.config import get_settings
|
||||
from ocr_sprint.pipeline.confidence import compute_confidence, route
|
||||
from ocr_sprint.pipeline.document_detect import DocumentDetectConfig, detect_and_correct
|
||||
from ocr_sprint.pipeline.extract.regex_rules import extract_header, find_signatory
|
||||
from ocr_sprint.pipeline.extract.validators import validate_extraction
|
||||
from ocr_sprint.pipeline.ingest import detect_source_kind, ingest
|
||||
@@ -56,10 +57,18 @@ def run_pipeline(content: bytes) -> PipelineOutput:
|
||||
deskew=s.preprocess_deskew,
|
||||
adaptive_threshold=s.preprocess_adaptive_threshold,
|
||||
)
|
||||
# Document detection only makes sense on photographed images. PDF renders
|
||||
# are already flat by construction, so we skip the heavy quad search there.
|
||||
detect_cfg = DocumentDetectConfig(
|
||||
detect_document=s.preprocess_detect_document and kind == SourceKind.IMAGE,
|
||||
remove_shadow=s.preprocess_remove_shadow and kind == SourceKind.IMAGE,
|
||||
min_area_fraction=s.preprocess_min_quad_area_fraction,
|
||||
)
|
||||
|
||||
ocr_pages: list[OCRPage] = []
|
||||
for page in pages:
|
||||
cleaned = preprocess(page.image, pre_cfg)
|
||||
corrected = detect_and_correct(page.image, detect_cfg)
|
||||
cleaned = preprocess(corrected, pre_cfg)
|
||||
ocr_pages.append(run_ocr(cleaned))
|
||||
|
||||
full_text = "\n".join(p.text for p in ocr_pages)
|
||||
|
||||
Reference in New Issue
Block a user