Phase 3: PP-Structure table extraction + personnel column mapper (#2)
* Phase 3: PP-Structure table extraction + personnel column mapper Adds the personnel-table stage of the pipeline. PaddleOCR's PP-Structure recognizes table regions and emits HTML, which we parse into a 2D cell grid. A separate column mapper detects the header row, classifies each column to a canonical PersonnelEntry field via a synonym dictionary, and walks the data rows. Variant handling: - Different satuan use different column orders and header phrasing. Supported synonyms for each canonical field are listed in pipeline/extract/personnel.py (Pangkat / NRP / Pangkat-NRP combo / Nama / Jabatan dalam Dinas / Jabatan dalam Sprint / Keterangan). - A merged 'PANGKAT NRP' or 'PANGKAT NRP NAMA' cell is split using the 8-digit NRP regex (with look-arounds so glued forms like 'BRIPKA98050505' work) and the master pangkat lookup. - Unknown ranks are kept verbatim so the validation layer can flag them as UNKNOWN_PANGKAT for HITL review. - Rows without nrp AND nama are dropped (separators / merged cells). New module pipeline/table.py: - DetectedTable dataclass (cells + html). - parse_table_html: tag/entity-tolerant HTML -> 2D grid. - extract_tables_from_pp_result: filter PP-Structure regions to type=table. - run_table_extraction: top-level entrypoint with lazy-init singleton for the heavy PP-Structure engine. Orchestrator now invokes table extraction (gated by TABLES_ENABLED) on every preprocessed page and merges the discovered personnel into the ExtractionResult. Failures are caught and logged so a flaky table recognizer never blocks header extraction. Tests: 38 new unit tests covering HTML parsing, region filtering, header classification, column mapping (split, combined, glued cells), and end-to-end personnel extraction. Total 108 tests, all green. PaddleOCR / PP-Structure remain optional - no test imports them. Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com> * Phase 3: fix header misclassification for combined Pangkat/NRP/Nama columns Devin Review caught two related bugs in personnel column mapping: 1. _classify_header_cell iterated _HEADER_SYNONYMS in insertion order when falling back to substring matching. The dict listed shorter keywords first ('pangkat' before 'pangkat / nrp'), so a header like 'Pangkat / NRP / Nama' classified as plain 'pangkat'. map_row then tried to normalize the whole '"AKP 87010101 Budi Santoso"' cell as a rank, normalize_pangkat returned None, and the row failed the nrp-or-nama gate at the bottom of map_row -- silently dropping every personnel row in tables using this layout. 2. _split_pangkat_nrp_nama existed and was unit-tested but was never wired up in map_row, so even if classification had worked, the three-way split would not have run. The module docstring claimed the split was supported. Fix: - Iterate the synonym table sorted by keyword length descending in the substring-match fallback so the most specific synonym wins. - Add 'pangkat_nrp_nama' synonym entries for typical separators (' / ', '/', whitespace, comma). - Wire 'pangkat_nrp_nama' into map_row using the existing helper. - Update is_personnel_table so combined headers count as both an id signal and a name signal. Tests: 6 new asserts (parametrized), 1 regression test for triple- combined header end-to-end, 1 dedicated map_row test for the new column type. 114 tests total, all green. Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com> * Phase 3: handle multi-word Polri ranks in _split_pangkat_nrp_nama Devin Review caught: token-by-token is_valid_pangkat() check could not recognize multi-word ranks ('KOMBES POL', 'BRIGJEN POL', 'IRJEN POL', 'KOMJEN POL', 'JENDERAL POL'). For 'KOMBES POL 88123456 John Doe' the old code returned pangkat=None, nama='KOMBES POL John Doe', and the validator's UNKNOWN_PANGKAT flag never fired because pangkat was falsy. New behavior: greedy longest-prefix match. After stripping the NRP we try the leading 3-token, 2-token, 1-token slice against normalize_pangkat() and take the longest that maps to a canonical rank. Tokens after the matched rank become the nama. Unknown ranks fall through to pangkat=None and the rank text stays in the nama field, where downstream validation already flags the row. Tests: 5 new asserts (4 multi-word ranks + 1 unknown-rank fallback), 119 total green. Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com> * Phase 3: don't count pangkat_nrp as a name signal in is_personnel_table Devin Review caught: a table with header ['No', 'Pangkat / NRP', 'Jabatan'] (no name column) was wrongly classified as a personnel table because pangkat_nrp was lumped into has_name. Such a table would produce PersonnelEntry rows with nama=None passing the nrp-or- nama gate, polluting the personel[] output with id-only fragments. Split the combined-cell set into combined_id (counts toward has_id) and combined_name (counts toward has_name). Only pangkat_nrp_nama, which actually embeds a name, qualifies for has_name. pangkat_nrp remains an id-only signal. Tests: 3 new asserts (rejects id-only, accepts pangkat_nrp + separate nama, accepts pangkat_nrp_nama). 122 total green. Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com> --------- Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com>
This commit is contained in:
committed by
GitHub
parent
812ea7e030
commit
33b38aacc7
316
src/ocr_sprint/pipeline/extract/personnel.py
Normal file
316
src/ocr_sprint/pipeline/extract/personnel.py
Normal file
@@ -0,0 +1,316 @@
|
||||
"""Map a raw 2D table grid into a list of `PersonnelEntry`.
|
||||
|
||||
Surat sprint personnel tables don't have a fixed schema across satuan: column
|
||||
order, header phrasing, and even whether pangkat/NRP are merged into one cell
|
||||
all vary. We deal with this by:
|
||||
|
||||
1. Detecting the header row by keyword scoring (rows that contain "PANGKAT"
|
||||
or "NRP" or "NAMA" are headers; the row with the highest score wins).
|
||||
2. Mapping each header cell to one of the canonical PersonnelEntry fields
|
||||
via a synonym dictionary.
|
||||
3. Walking the remaining rows and slotting cells into fields by column
|
||||
index. A combined "PANGKAT/NRP" or "PANGKAT/NRP/NAMA" cell is split
|
||||
heuristically (8-digit token → NRP, known-rank token → pangkat, the
|
||||
leftover words → nama).
|
||||
|
||||
The mapper is deliberately conservative: when in doubt it leaves a field
|
||||
None and lets validation flag the row for HITL review.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import re
|
||||
|
||||
from ocr_sprint.data.master_pangkat import normalize_pangkat
|
||||
from ocr_sprint.pipeline.table import DetectedTable
|
||||
from ocr_sprint.schemas.personnel import PersonnelEntry
|
||||
|
||||
# ---------- column synonyms ----------
|
||||
|
||||
# header keyword → canonical column id. Lowercased, whitespace-collapsed.
|
||||
_HEADER_SYNONYMS: dict[str, str] = {
|
||||
# row index column
|
||||
"no": "no",
|
||||
"nomor": "no",
|
||||
"no.": "no",
|
||||
# rank
|
||||
"pangkat": "pangkat",
|
||||
"pkt": "pangkat",
|
||||
# NRP / NIP / NIPK
|
||||
"nrp": "nrp",
|
||||
"no nrp": "nrp",
|
||||
"nrp / nip": "nrp",
|
||||
"nrp/nip": "nrp",
|
||||
"nrp nip": "nrp",
|
||||
"no. mhs": "nrp", # taruna
|
||||
# combined pangkat + NRP + nama cell, seen in compact Polri layouts.
|
||||
# Order matters here only for readability; classify_header_cell ranks
|
||||
# synonyms by length, so the longer 'pangkat / nrp / nama' wins over
|
||||
# both 'pangkat / nrp' and 'pangkat'.
|
||||
"pangkat / nrp / nama": "pangkat_nrp_nama",
|
||||
"pangkat/nrp/nama": "pangkat_nrp_nama",
|
||||
"pangkat nrp nama": "pangkat_nrp_nama",
|
||||
"pangkat, nrp, nama": "pangkat_nrp_nama",
|
||||
# combined pangkat + NRP cell, common in Polres-level sprint
|
||||
"pangkat / nrp": "pangkat_nrp",
|
||||
"pangkat/nrp": "pangkat_nrp",
|
||||
"pangkat dan nrp": "pangkat_nrp",
|
||||
"pangkat nrp": "pangkat_nrp",
|
||||
# name
|
||||
"nama": "nama",
|
||||
"nama lengkap": "nama",
|
||||
# jabatan dalam dinas (permanent post)
|
||||
"jabatan": "jabatan_dinas",
|
||||
"jabatan dinas": "jabatan_dinas",
|
||||
"jabatan dalam dinas": "jabatan_dinas",
|
||||
"jbt dinas": "jabatan_dinas",
|
||||
# jabatan dalam sprint (role for this dispatch)
|
||||
"jabatan dalam sprint": "jabatan_sprint",
|
||||
"jabatan dalam sprin": "jabatan_sprint",
|
||||
"jabatan dalam surat perintah": "jabatan_sprint",
|
||||
"jabatan sprint": "jabatan_sprint",
|
||||
"jabatan sprin": "jabatan_sprint",
|
||||
"tugas": "jabatan_sprint",
|
||||
"penugasan": "jabatan_sprint",
|
||||
# remarks
|
||||
"keterangan": "keterangan",
|
||||
"ket": "keterangan",
|
||||
"ket.": "keterangan",
|
||||
}
|
||||
|
||||
# 8-digit NRP. We don't anchor on word boundaries because OCR sometimes glues
|
||||
# the rank directly onto the digits ("BRIPKA98050505"). We use (?<!\d) and (?!\d)
|
||||
# look-arounds to make sure we don't match a substring of a longer number.
|
||||
_NRP_RE = re.compile(r"(?<!\d)(\d{8})(?!\d)")
|
||||
_NUMBER_RE = re.compile(r"^\s*(\d{1,3})[.)\s]*$")
|
||||
|
||||
|
||||
# ---------- header detection ----------
|
||||
|
||||
|
||||
def _normalize_header_cell(text: str) -> str:
|
||||
return " ".join(text.lower().split()).strip(" .:")
|
||||
|
||||
|
||||
# Synonym keywords sorted by length (descending) so that substring matching
|
||||
# in `_classify_header_cell` prefers the most specific match. Without this,
|
||||
# 'pangkat' would match 'pangkat / nrp / nama' before 'pangkat / nrp / nama'
|
||||
# itself, silently misclassifying combined-cell headers and dropping rows.
|
||||
_SORTED_HEADER_KEYWORDS: list[tuple[str, str]] = sorted(
|
||||
_HEADER_SYNONYMS.items(), key=lambda kv: -len(kv[0])
|
||||
)
|
||||
|
||||
|
||||
def _classify_header_cell(text: str) -> str | None:
|
||||
"""Return the canonical column id for a header cell, or None.
|
||||
|
||||
First tries an exact match against the synonym table; if that fails,
|
||||
falls back to substring matching against the *longest* synonym that is
|
||||
contained in the cell text. The longest-first ordering matters: a header
|
||||
like 'Pangkat / NRP / Nama' must classify as `pangkat_nrp_nama`, not
|
||||
`pangkat`, otherwise downstream `map_row` would treat the whole cell as
|
||||
a rank string and drop the row when normalize_pangkat returns None.
|
||||
"""
|
||||
norm = _normalize_header_cell(text)
|
||||
if not norm:
|
||||
return None
|
||||
if norm in _HEADER_SYNONYMS:
|
||||
return _HEADER_SYNONYMS[norm]
|
||||
for keyword, canonical in _SORTED_HEADER_KEYWORDS:
|
||||
if keyword in norm:
|
||||
return canonical
|
||||
return None
|
||||
|
||||
|
||||
def detect_header_row(table: DetectedTable) -> tuple[int, list[str | None]] | None:
|
||||
"""Find the most likely header row and return (row_index, column_mapping).
|
||||
|
||||
Strategy: score each of the first ~3 rows by how many cells classify as a
|
||||
known column. Pick the highest-scoring row provided it covers at least
|
||||
two known fields (otherwise we don't have enough signal to trust it).
|
||||
"""
|
||||
best_idx: int | None = None
|
||||
best_mapping: list[str | None] = []
|
||||
best_score = 0
|
||||
for r_idx in range(min(3, table.n_rows)):
|
||||
row = table.cells[r_idx]
|
||||
mapping = [_classify_header_cell(cell) for cell in row]
|
||||
score = sum(1 for m in mapping if m is not None)
|
||||
if score >= 2 and score > best_score:
|
||||
best_score = score
|
||||
best_idx = r_idx
|
||||
best_mapping = mapping
|
||||
if best_idx is None:
|
||||
return None
|
||||
return best_idx, best_mapping
|
||||
|
||||
|
||||
# ---------- combined-cell splitting ----------
|
||||
|
||||
|
||||
def _split_pangkat_nrp(cell: str) -> tuple[str | None, str | None]:
|
||||
"""Split a 'PANGKAT NRP' cell into (pangkat, nrp).
|
||||
|
||||
Returns (None, None) if the cell can't be split confidently.
|
||||
"""
|
||||
if not cell:
|
||||
return None, None
|
||||
nrp_match = _NRP_RE.search(cell)
|
||||
nrp = nrp_match.group(1) if nrp_match else None
|
||||
pangkat_part = cell
|
||||
if nrp_match:
|
||||
pangkat_part = cell[: nrp_match.start()] + cell[nrp_match.end() :]
|
||||
# Strip separators commonly seen between rank and NRP ("AKP / 87010101",
|
||||
# "AKP. 87010101", "AKP - 87010101") before normalizing.
|
||||
pangkat_part = pangkat_part.strip(" /-.,;:|").strip()
|
||||
pangkat = normalize_pangkat(pangkat_part)
|
||||
return pangkat, nrp
|
||||
|
||||
|
||||
def _split_pangkat_nrp_nama(cell: str) -> tuple[str | None, str | None, str | None]:
|
||||
"""Split a 'PANGKAT NRP NAMA' single-cell into its three components.
|
||||
|
||||
Multi-word ranks like 'KOMBES POL' or 'BRIGJEN POL' must be matched as
|
||||
contiguous token sequences, otherwise tokens like 'POL' leak into the
|
||||
name. We greedily try the longest leading token-prefix that normalizes
|
||||
to a known pangkat, then fall back to shorter prefixes.
|
||||
"""
|
||||
if not cell:
|
||||
return None, None, None
|
||||
nrp_match = _NRP_RE.search(cell)
|
||||
nrp = nrp_match.group(1) if nrp_match else None
|
||||
rest = cell
|
||||
if nrp:
|
||||
rest = cell.replace(nrp, " ", 1)
|
||||
tokens = rest.split()
|
||||
if not tokens:
|
||||
return None, nrp, None
|
||||
|
||||
# Try the longest leading sub-sequence first so 'KOMBES POL' wins over
|
||||
# 'KOMBES' (which alone is not a valid pangkat anyway).
|
||||
pangkat: str | None = None
|
||||
consumed = 0
|
||||
for prefix_len in range(min(len(tokens), 3), 0, -1):
|
||||
candidate = " ".join(tokens[:prefix_len])
|
||||
normalized = normalize_pangkat(candidate)
|
||||
if normalized is not None:
|
||||
pangkat = normalized
|
||||
consumed = prefix_len
|
||||
break
|
||||
|
||||
name_tokens = tokens[consumed:] if pangkat else tokens
|
||||
nama = " ".join(name_tokens) if name_tokens else None
|
||||
return pangkat, nrp, nama
|
||||
|
||||
|
||||
# ---------- row mapping ----------
|
||||
|
||||
|
||||
def _parse_int(value: str) -> int | None:
|
||||
m = _NUMBER_RE.match(value)
|
||||
return int(m.group(1)) if m else None
|
||||
|
||||
|
||||
def map_row(row: list[str], mapping: list[str | None]) -> PersonnelEntry | None:
|
||||
"""Convert one data row into a PersonnelEntry using the column mapping."""
|
||||
fields: dict[str, str | int | None] = {
|
||||
"no": None,
|
||||
"pangkat": None,
|
||||
"nrp": None,
|
||||
"nama": None,
|
||||
"jabatan_dinas": None,
|
||||
"jabatan_sprint": None,
|
||||
"keterangan": None,
|
||||
}
|
||||
for idx, cell in enumerate(row):
|
||||
if idx >= len(mapping):
|
||||
break
|
||||
column = mapping[idx]
|
||||
if column is None:
|
||||
continue
|
||||
text = cell.strip()
|
||||
if column == "no":
|
||||
fields["no"] = _parse_int(text)
|
||||
elif column == "pangkat_nrp_nama":
|
||||
pangkat, nrp, nama = _split_pangkat_nrp_nama(text)
|
||||
if pangkat:
|
||||
fields["pangkat"] = pangkat
|
||||
if nrp:
|
||||
fields["nrp"] = nrp
|
||||
if nama:
|
||||
fields["nama"] = nama
|
||||
elif column == "pangkat_nrp":
|
||||
pangkat, nrp = _split_pangkat_nrp(text)
|
||||
if pangkat:
|
||||
fields["pangkat"] = pangkat
|
||||
if nrp:
|
||||
fields["nrp"] = nrp
|
||||
elif column == "pangkat":
|
||||
fields["pangkat"] = normalize_pangkat(text) or text or None
|
||||
elif column == "nrp":
|
||||
m = _NRP_RE.search(text)
|
||||
fields["nrp"] = m.group(1) if m else (text or None)
|
||||
elif column in fields:
|
||||
fields[column] = text or None
|
||||
|
||||
# require at least nama OR nrp to consider this a real personnel row;
|
||||
# otherwise it's likely a separator / footnote / merged cell.
|
||||
if not (fields["nrp"] or fields["nama"]):
|
||||
return None
|
||||
|
||||
return PersonnelEntry(
|
||||
no=fields["no"] if isinstance(fields["no"], int) else None,
|
||||
pangkat=fields["pangkat"] if isinstance(fields["pangkat"], str) else None,
|
||||
nrp=fields["nrp"] if isinstance(fields["nrp"], str) else None,
|
||||
nama=fields["nama"] if isinstance(fields["nama"], str) else None,
|
||||
jabatan_dinas=(
|
||||
fields["jabatan_dinas"] if isinstance(fields["jabatan_dinas"], str) else None
|
||||
),
|
||||
jabatan_sprint=(
|
||||
fields["jabatan_sprint"] if isinstance(fields["jabatan_sprint"], str) else None
|
||||
),
|
||||
keterangan=(fields["keterangan"] if isinstance(fields["keterangan"], str) else None),
|
||||
)
|
||||
|
||||
|
||||
# ---------- table-level entrypoint ----------
|
||||
|
||||
|
||||
def is_personnel_table(table: DetectedTable) -> bool:
|
||||
"""Heuristic: a table is the personnel list if its header row contains
|
||||
at least one rank/NRP indicator and one name indicator.
|
||||
"""
|
||||
detected = detect_header_row(table)
|
||||
if detected is None:
|
||||
return False
|
||||
_, mapping = detected
|
||||
# `pangkat_nrp` is an id-only signal (rank + NRP, no name), while
|
||||
# `pangkat_nrp_nama` carries a name too. Counting `pangkat_nrp` toward
|
||||
# `has_name` would let id-only tables (e.g. ['No', 'Pangkat / NRP',
|
||||
# 'Jabatan']) be mistaken for personnel tables.
|
||||
combined_id = {"pangkat_nrp", "pangkat_nrp_nama"}
|
||||
combined_name = {"pangkat_nrp_nama"}
|
||||
has_id = any(m in {"nrp", "pangkat"} | combined_id for m in mapping)
|
||||
has_name = any(m == "nama" or m in combined_name for m in mapping)
|
||||
return has_id and has_name
|
||||
|
||||
|
||||
def extract_personnel(tables: list[DetectedTable]) -> list[PersonnelEntry]:
|
||||
"""Pick the best-matching personnel table and convert its rows.
|
||||
|
||||
If multiple tables look like personnel lists (rare), we concatenate them
|
||||
in document order so nothing is silently dropped.
|
||||
"""
|
||||
rows: list[PersonnelEntry] = []
|
||||
for table in tables:
|
||||
if not is_personnel_table(table):
|
||||
continue
|
||||
detected = detect_header_row(table)
|
||||
if detected is None:
|
||||
continue
|
||||
header_idx, mapping = detected
|
||||
for r_idx in range(header_idx + 1, table.n_rows):
|
||||
entry = map_row(table.cells[r_idx], mapping)
|
||||
if entry is not None:
|
||||
rows.append(entry)
|
||||
return rows
|
||||
Reference in New Issue
Block a user