Fix personnel extraction + header bugs on real Polres Cimahi sprint
This fixes 4 bugs found on a real Polres Cimahi SPRIN PDF:
1. satuan_penerbit captured the generic 'KEPOLISIAN NEGARA REPUBLIK
INDONESIA' letterhead line instead of the most-specific issuing unit
(e.g. RESOR CIMAHI / SEKTOR PADALARANG). Reworked find_satuan to
scan for each level independently and return the deepest available.
2. find_dasar_list dropped numbered items when OCR put the marker on
its own line ("1.\n Undang-Undang ..."). Refactored into
_collect_numbered_section that buffers a bare-number line and uses
the next non-empty line as the body. Also reused for the new
find_untuk_list which extracts the previously-empty 'untuk' bullets.
3. find_perihal returned None for documents that use 'Pertimbangan'
(very common in Polres-level sprint), forcing the LLM to guess.
Added a regex fallback that picks up the first line under a
'Pertimbangan' label so we keep extraction deterministic.
4. Personnel rows were emitted with only nama populated when
PP-Structure detected a table but the column mapper degraded.
Added a text-based fallback (extract_personnel_from_text) that
scans raw OCR for <rank> + <8-digit NRP> patterns. Triggered when
the PP-Structure result has fewer than 30% rank/NRP-bearing rows.
Reviewed by raising the new PERSONNEL_TEXT_FALLBACK flag.
5. Validation now flags rows with neither pangkat nor nrp as
INCOMPLETE_PERSONNEL_ROW, so the document routes to needs_review
even when individual nrp/pangkat checks pass on empty values.
6. Added 'BRIGPOL' as a variant of BRIGADIR (seen in real scans).
Tests: 229 (was 203) — 26 new tests covering the regex fixes,
text-based personnel extractor, low-quality detector, validator
behaviour, and orchestrator wiring of the fallback path.
Co-Authored-By: adrian kuman firmansah <adriancuman@gmail.com>
This commit is contained in:
@@ -19,7 +19,15 @@ from ocr_sprint.llm.extractor import llm_fill_header
|
||||
from ocr_sprint.pipeline.confidence import compute_confidence, route
|
||||
from ocr_sprint.pipeline.document_detect import DocumentDetectConfig, detect_and_correct
|
||||
from ocr_sprint.pipeline.extract.personnel import extract_personnel
|
||||
from ocr_sprint.pipeline.extract.regex_rules import extract_header, find_signatory
|
||||
from ocr_sprint.pipeline.extract.personnel_text import (
|
||||
extract_personnel_from_text,
|
||||
is_low_quality,
|
||||
)
|
||||
from ocr_sprint.pipeline.extract.regex_rules import (
|
||||
extract_header,
|
||||
find_signatory,
|
||||
find_untuk_list,
|
||||
)
|
||||
from ocr_sprint.pipeline.extract.validators import validate_extraction
|
||||
from ocr_sprint.pipeline.ingest import NDArrayU8, detect_source_kind, ingest
|
||||
from ocr_sprint.pipeline.ocr import OCRPage, run_ocr
|
||||
@@ -112,6 +120,7 @@ def run_pipeline(content: bytes) -> PipelineOutput:
|
||||
header = merged
|
||||
|
||||
personel: list[PersonnelEntry] = []
|
||||
table_flags: list[ReviewFlag] = []
|
||||
if s.tables_enabled and cleaned_pages:
|
||||
all_tables: list[DetectedTable] = []
|
||||
for img in cleaned_pages:
|
||||
@@ -126,14 +135,33 @@ def run_pipeline(content: bytes) -> PipelineOutput:
|
||||
personel_rows=len(personel),
|
||||
)
|
||||
|
||||
initial_flags: list[ReviewFlag] = list(llm_flags)
|
||||
# Text-based fallback: PP-Structure can succeed structurally but emit
|
||||
# rows with only ``nama`` populated (column mapper degraded), or fail to
|
||||
# detect the table at all. In both cases the regex fallback that scans
|
||||
# raw OCR for rank+NRP pairs produces a much more useful result. We
|
||||
# always run it when the structured path is empty or low-quality, and
|
||||
# raise a review flag so the operator knows the document didn't go
|
||||
# through the preferred path.
|
||||
if is_low_quality(personel):
|
||||
fallback_rows = extract_personnel_from_text(full_text)
|
||||
if fallback_rows:
|
||||
personel = fallback_rows
|
||||
table_flags.append(ReviewFlag.PERSONNEL_TEXT_FALLBACK)
|
||||
_logger.info(
|
||||
"pipeline.personnel_text_fallback",
|
||||
fallback_rows=len(fallback_rows),
|
||||
)
|
||||
|
||||
untuk_items = find_untuk_list(full_text)
|
||||
|
||||
initial_flags: list[ReviewFlag] = list(llm_flags) + list(table_flags)
|
||||
if mean_ocr_conf < _OCR_CONFIDENCE_FLAG_THRESHOLD:
|
||||
initial_flags.append(ReviewFlag.LOW_OCR_CONFIDENCE)
|
||||
|
||||
result = ExtractionResult(
|
||||
header=header,
|
||||
personel=personel,
|
||||
untuk=[],
|
||||
untuk=untuk_items,
|
||||
ttd=ttd,
|
||||
raw_text=full_text,
|
||||
confidence=mean_ocr_conf,
|
||||
|
||||
Reference in New Issue
Block a user