Fix personnel extraction + header bugs on real Polres Cimahi sprint

This fixes 4 bugs found on a real Polres Cimahi SPRIN PDF: 1. satuan_penerbit captured the generic 'KEPOLISIAN NEGARA REPUBLIK INDONESIA' letterhead line instead of the most-specific issuing unit (e.g. RESOR CIMAHI / SEKTOR PADALARANG). Reworked find_satuan to scan for each level independently and return the deepest available. 2. find_dasar_list dropped numbered items when OCR put the marker on its own line ("1.\n Undang-Undang ..."). Refactored into _collect_numbered_section that buffers a bare-number line and uses the next non-empty line as the body. Also reused for the new find_untuk_list which extracts the previously-empty 'untuk' bullets. 3. find_perihal returned None for documents that use 'Pertimbangan' (very common in Polres-level sprint), forcing the LLM to guess. Added a regex fallback that picks up the first line under a 'Pertimbangan' label so we keep extraction deterministic. 4. Personnel rows were emitted with only nama populated when PP-Structure detected a table but the column mapper degraded. Added a text-based fallback (extract_personnel_from_text) that scans raw OCR for <rank> + <8-digit NRP> patterns. Triggered when the PP-Structure result has fewer than 30% rank/NRP-bearing rows. Reviewed by raising the new PERSONNEL_TEXT_FALLBACK flag. 5. Validation now flags rows with neither pangkat nor nrp as INCOMPLETE_PERSONNEL_ROW, so the document routes to needs_review even when individual nrp/pangkat checks pass on empty values. 6. Added 'BRIGPOL' as a variant of BRIGADIR (seen in real scans). Tests: 229 (was 203) — 26 new tests covering the regex fixes, text-based personnel extractor, low-quality detector, validator behaviour, and orchestrator wiring of the fallback path. Co-Authored-By: adrian kuman firmansah <adriancuman@gmail.com>
2026-04-26 05:35:42 +00:00
parent dce77e80e1
commit 58a2bf2648
11 changed files with 747 additions and 39 deletions
--- a/src/ocr_sprint/pipeline/orchestrator.py
+++ b/src/ocr_sprint/pipeline/orchestrator.py
@@ -19,7 +19,15 @@ from ocr_sprint.llm.extractor import llm_fill_header
 from ocr_sprint.pipeline.confidence import compute_confidence, route
 from ocr_sprint.pipeline.document_detect import DocumentDetectConfig, detect_and_correct
 from ocr_sprint.pipeline.extract.personnel import extract_personnel
-from ocr_sprint.pipeline.extract.regex_rules import extract_header, find_signatory
+from ocr_sprint.pipeline.extract.personnel_text import (
+    extract_personnel_from_text,
+    is_low_quality,
+)
+from ocr_sprint.pipeline.extract.regex_rules import (
+    extract_header,
+    find_signatory,
+    find_untuk_list,
+)
 from ocr_sprint.pipeline.extract.validators import validate_extraction
 from ocr_sprint.pipeline.ingest import NDArrayU8, detect_source_kind, ingest
 from ocr_sprint.pipeline.ocr import OCRPage, run_ocr
@@ -112,6 +120,7 @@ def run_pipeline(content: bytes) -> PipelineOutput:
            header = merged

    personel: list[PersonnelEntry] = []
+    table_flags: list[ReviewFlag] = []
    if s.tables_enabled and cleaned_pages:
        all_tables: list[DetectedTable] = []
        for img in cleaned_pages:
@@ -126,14 +135,33 @@ def run_pipeline(content: bytes) -> PipelineOutput:
            personel_rows=len(personel),
        )

-    initial_flags: list[ReviewFlag] = list(llm_flags)
+    # Text-based fallback: PP-Structure can succeed structurally but emit
+    # rows with only ``nama`` populated (column mapper degraded), or fail to
+    # detect the table at all. In both cases the regex fallback that scans
+    # raw OCR for rank+NRP pairs produces a much more useful result. We
+    # always run it when the structured path is empty or low-quality, and
+    # raise a review flag so the operator knows the document didn't go
+    # through the preferred path.
+    if is_low_quality(personel):
+        fallback_rows = extract_personnel_from_text(full_text)
+        if fallback_rows:
+            personel = fallback_rows
+            table_flags.append(ReviewFlag.PERSONNEL_TEXT_FALLBACK)
+            _logger.info(
+                "pipeline.personnel_text_fallback",
+                fallback_rows=len(fallback_rows),
+            )
+
+    untuk_items = find_untuk_list(full_text)
+
+    initial_flags: list[ReviewFlag] = list(llm_flags) + list(table_flags)
    if mean_ocr_conf < _OCR_CONFIDENCE_FLAG_THRESHOLD:
        initial_flags.append(ReviewFlag.LOW_OCR_CONFIDENCE)

    result = ExtractionResult(
        header=header,
        personel=personel,
-        untuk=[],
+        untuk=untuk_items,
        ttd=ttd,
        raw_text=full_text,
        confidence=mean_ocr_conf,