Phase 3: PP-Structure table extraction + personnel column mapper (#2)

* Phase 3: PP-Structure table extraction + personnel column mapper

Adds the personnel-table stage of the pipeline. PaddleOCR's PP-Structure
recognizes table regions and emits HTML, which we parse into a 2D cell
grid. A separate column mapper detects the header row, classifies each
column to a canonical PersonnelEntry field via a synonym dictionary,
and walks the data rows.

Variant handling:
- Different satuan use different column orders and header phrasing.
  Supported synonyms for each canonical field are listed in
  pipeline/extract/personnel.py (Pangkat / NRP / Pangkat-NRP combo /
  Nama / Jabatan dalam Dinas / Jabatan dalam Sprint / Keterangan).
- A merged 'PANGKAT NRP' or 'PANGKAT NRP NAMA' cell is split using
  the 8-digit NRP regex (with look-arounds so glued forms like
  'BRIPKA98050505' work) and the master pangkat lookup.
- Unknown ranks are kept verbatim so the validation layer can flag
  them as UNKNOWN_PANGKAT for HITL review.
- Rows without nrp AND nama are dropped (separators / merged cells).

New module pipeline/table.py:
- DetectedTable dataclass (cells + html).
- parse_table_html: tag/entity-tolerant HTML -> 2D grid.
- extract_tables_from_pp_result: filter PP-Structure regions to type=table.
- run_table_extraction: top-level entrypoint with lazy-init singleton
  for the heavy PP-Structure engine.

Orchestrator now invokes table extraction (gated by TABLES_ENABLED) on
every preprocessed page and merges the discovered personnel into the
ExtractionResult. Failures are caught and logged so a flaky table
recognizer never blocks header extraction.

Tests: 38 new unit tests covering HTML parsing, region filtering,
header classification, column mapping (split, combined, glued cells),
and end-to-end personnel extraction. Total 108 tests, all green.
PaddleOCR / PP-Structure remain optional - no test imports them.

Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com>

* Phase 3: fix header misclassification for combined Pangkat/NRP/Nama columns

Devin Review caught two related bugs in personnel column mapping:

1. _classify_header_cell iterated _HEADER_SYNONYMS in insertion order
   when falling back to substring matching. The dict listed shorter
   keywords first ('pangkat' before 'pangkat / nrp'), so a header like
   'Pangkat / NRP / Nama' classified as plain 'pangkat'. map_row then
   tried to normalize the whole '"AKP 87010101 Budi Santoso"' cell
   as a rank, normalize_pangkat returned None, and the row failed the
   nrp-or-nama gate at the bottom of map_row -- silently dropping
   every personnel row in tables using this layout.

2. _split_pangkat_nrp_nama existed and was unit-tested but was never
   wired up in map_row, so even if classification had worked, the
   three-way split would not have run. The module docstring claimed
   the split was supported.

Fix:
- Iterate the synonym table sorted by keyword length descending in the
  substring-match fallback so the most specific synonym wins.
- Add 'pangkat_nrp_nama' synonym entries for typical separators
  (' / ', '/', whitespace, comma).
- Wire 'pangkat_nrp_nama' into map_row using the existing helper.
- Update is_personnel_table so combined headers count as both an id
  signal and a name signal.

Tests: 6 new asserts (parametrized), 1 regression test for triple-
combined header end-to-end, 1 dedicated map_row test for the new
column type. 114 tests total, all green.

Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com>

* Phase 3: handle multi-word Polri ranks in _split_pangkat_nrp_nama

Devin Review caught: token-by-token is_valid_pangkat() check could not
recognize multi-word ranks ('KOMBES POL', 'BRIGJEN POL', 'IRJEN POL',
'KOMJEN POL', 'JENDERAL POL'). For 'KOMBES POL 88123456 John Doe' the
old code returned pangkat=None, nama='KOMBES POL John Doe', and the
validator's UNKNOWN_PANGKAT flag never fired because pangkat was falsy.

New behavior: greedy longest-prefix match. After stripping the NRP we
try the leading 3-token, 2-token, 1-token slice against
normalize_pangkat() and take the longest that maps to a canonical
rank. Tokens after the matched rank become the nama. Unknown ranks
fall through to pangkat=None and the rank text stays in the nama
field, where downstream validation already flags the row.

Tests: 5 new asserts (4 multi-word ranks + 1 unknown-rank fallback),
119 total green.

Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com>

* Phase 3: don't count pangkat_nrp as a name signal in is_personnel_table

Devin Review caught: a table with header ['No', 'Pangkat / NRP',
'Jabatan'] (no name column) was wrongly classified as a personnel
table because pangkat_nrp was lumped into has_name. Such a table
would produce PersonnelEntry rows with nama=None passing the nrp-or-
nama gate, polluting the personel[] output with id-only fragments.

Split the combined-cell set into combined_id (counts toward has_id)
and combined_name (counts toward has_name). Only pangkat_nrp_nama,
which actually embeds a name, qualifies for has_name. pangkat_nrp
remains an id-only signal.

Tests: 3 new asserts (rejects id-only, accepts pangkat_nrp + separate
nama, accepts pangkat_nrp_nama). 122 total green.

Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com>

---------

Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com>
This commit is contained in:
devin-ai-integration[bot]
2026-04-25 16:10:48 +00:00
committed by GitHub
parent 812ea7e030
commit 33b38aacc7
8 changed files with 905 additions and 12 deletions

View File

@@ -0,0 +1,300 @@
"""Tests for the personnel-row mapper."""
from __future__ import annotations
import pytest
from ocr_sprint.pipeline.extract.personnel import (
_classify_header_cell,
_split_pangkat_nrp,
_split_pangkat_nrp_nama,
detect_header_row,
extract_personnel,
is_personnel_table,
map_row,
)
from ocr_sprint.pipeline.table import DetectedTable
# ---------- header detection ----------
class TestClassifyHeaderCell:
@pytest.mark.parametrize(
("text", "expected"),
[
("No", "no"),
("NO.", "no"),
("Nomor", "no"),
("Pangkat", "pangkat"),
("NRP", "nrp"),
("Pangkat / NRP", "pangkat_nrp"),
("PANGKAT/NRP", "pangkat_nrp"),
("Pangkat / NRP / Nama", "pangkat_nrp_nama"),
("PANGKAT/NRP/NAMA", "pangkat_nrp_nama"),
("Pangkat, NRP, Nama", "pangkat_nrp_nama"),
("Nama", "nama"),
("Nama Lengkap", "nama"),
("Jabatan dalam Dinas", "jabatan_dinas"),
("Jabatan dalam Sprint", "jabatan_sprint"),
("Keterangan", "keterangan"),
],
)
def test_known_header(self, text: str, expected: str) -> None:
assert _classify_header_cell(text) == expected
def test_substring_match_prefers_longest_synonym(self) -> None:
# 'pangkat' is a shorter prefix of 'pangkat / nrp / nama'. Without
# length-sorted iteration we'd misclassify combined headers as plain
# 'pangkat' and downstream map_row would drop every row.
assert _classify_header_cell("Pangkat / NRP / Nama Personel") == "pangkat_nrp_nama"
assert _classify_header_cell("Pangkat / NRP Polri") == "pangkat_nrp"
def test_unknown_header(self) -> None:
assert _classify_header_cell("Random Text") is None
assert _classify_header_cell("") is None
class TestDetectHeaderRow:
def test_detects_first_row_as_header(self) -> None:
table = DetectedTable(
cells=[
["No", "Pangkat", "NRP", "Nama"],
["1", "AKP", "87010101", "Budi"],
]
)
result = detect_header_row(table)
assert result is not None
idx, mapping = result
assert idx == 0
assert mapping == ["no", "pangkat", "nrp", "nama"]
def test_detects_second_row_when_first_is_title(self) -> None:
table = DetectedTable(
cells=[
["DAFTAR PERSONEL"], # title row, not a header
["No", "Pangkat / NRP", "Nama", "Jabatan dalam Dinas"],
["1", "AKP 87010101", "Budi", "Kanit"],
]
)
result = detect_header_row(table)
assert result is not None
idx, _ = result
assert idx == 1
def test_returns_none_when_no_header_found(self) -> None:
table = DetectedTable(cells=[["foo", "bar"], ["baz", "qux"]])
assert detect_header_row(table) is None
# ---------- combined-cell splitting ----------
class TestSplitPangkatNrp:
@pytest.mark.parametrize(
("text", "expected"),
[
("AKP 87010101", ("AKP", "87010101")),
("IPDA / 92030404", ("IPDA", "92030404")),
("BRIPKA98050505", ("BRIPKA", "98050505")),
("KOMPOL 88123456", ("KOMPOL", "88123456")),
],
)
def test_known_combos(self, text: str, expected: tuple[str, str]) -> None:
assert _split_pangkat_nrp(text) == expected
def test_returns_none_when_no_nrp(self) -> None:
pangkat, nrp = _split_pangkat_nrp("AKP")
assert pangkat == "AKP"
assert nrp is None
class TestSplitPangkatNrpNama:
def test_three_way_split(self) -> None:
pangkat, nrp, nama = _split_pangkat_nrp_nama("AKP 87010101 Budi Santoso")
assert pangkat == "AKP"
assert nrp == "87010101"
assert nama == "Budi Santoso"
@pytest.mark.parametrize(
("text", "expected_pangkat", "expected_name"),
[
# multi-word ranks must be matched as contiguous token sequences,
# otherwise tokens like 'POL' would leak into the name.
("KOMBES POL 88123456 John Doe", "KOMBES POL", "John Doe"),
("BRIGJEN POL 99887766 Jane Doe", "BRIGJEN POL", "Jane Doe"),
("IRJEN POL 77665544 Ahmad Hidayat", "IRJEN POL", "Ahmad Hidayat"),
("JENDERAL POL 11223344 Sari Wulandari", "JENDERAL POL", "Sari Wulandari"),
],
)
def test_multi_word_ranks(self, text: str, expected_pangkat: str, expected_name: str) -> None:
pangkat, _nrp, nama = _split_pangkat_nrp_nama(text)
assert pangkat == expected_pangkat
assert nama == expected_name
def test_unknown_rank_returns_none_pangkat(self) -> None:
pangkat, nrp, nama = _split_pangkat_nrp_nama("Foobar 87010101 Budi Santoso")
assert pangkat is None
assert nrp == "87010101"
# name keeps the unknown rank token; validators will flag the row.
assert nama == "Foobar Budi Santoso"
# ---------- row mapping ----------
class TestMapRow:
def test_split_columns_polres_layout(self) -> None:
mapping = ["no", "pangkat", "nrp", "nama", "jabatan_dinas", "jabatan_sprint"]
row = ["1", "AKP", "87010101", "Budi Santoso", "Kanit Reskrim", "Ketua Tim"]
entry = map_row(row, mapping)
assert entry is not None
assert entry.no == 1
assert entry.pangkat == "AKP"
assert entry.nrp == "87010101"
assert entry.nama == "Budi Santoso"
assert entry.jabatan_dinas == "Kanit Reskrim"
assert entry.jabatan_sprint == "Ketua Tim"
def test_combined_pangkat_nrp_nama_cell(self) -> None:
mapping = ["no", "pangkat_nrp_nama", "jabatan_dinas", "jabatan_sprint"]
row = ["1", "AKP 87010101 Budi Santoso", "Kanit Reskrim", "Ketua Tim"]
entry = map_row(row, mapping)
assert entry is not None
assert entry.no == 1
assert entry.pangkat == "AKP"
assert entry.nrp == "87010101"
assert entry.nama == "Budi Santoso"
assert entry.jabatan_dinas == "Kanit Reskrim"
assert entry.jabatan_sprint == "Ketua Tim"
def test_combined_pangkat_nrp_cell(self) -> None:
mapping = ["no", "pangkat_nrp", "nama", "jabatan_dinas"]
row = ["1", "AKP 87010101", "Budi Santoso", "Kanit Reskrim"]
entry = map_row(row, mapping)
assert entry is not None
assert entry.pangkat == "AKP"
assert entry.nrp == "87010101"
assert entry.nama == "Budi Santoso"
def test_skips_row_without_nama_or_nrp(self) -> None:
mapping = ["no", "pangkat"]
row = ["", ""]
assert map_row(row, mapping) is None
def test_unknown_pangkat_kept_verbatim(self) -> None:
mapping = ["no", "pangkat", "nrp", "nama"]
row = ["1", "Foobar", "87010101", "Budi"]
entry = map_row(row, mapping)
assert entry is not None
# unknown pangkat is preserved so the validation layer can flag it
assert entry.pangkat == "Foobar"
# ---------- end-to-end extraction ----------
class TestExtractPersonnel:
def test_full_table_with_header(self) -> None:
table = DetectedTable(
cells=[
[
"No",
"Pangkat / NRP",
"Nama",
"Jabatan dalam Dinas",
"Jabatan dalam Sprint",
],
["1", "AKP 87010101", "Budi Santoso", "Kanit Reskrim", "Ketua Tim"],
["2", "IPDA 92030404", "Sari Wulandari", "Banit Reskrim", "Anggota"],
["3", "BRIPKA 98050505", "Ahmad Hidayat", "Banit Reskrim", "Anggota"],
]
)
entries = extract_personnel([table])
assert len(entries) == 3
assert entries[0].nama == "Budi Santoso"
assert entries[0].nrp == "87010101"
assert entries[1].pangkat == "IPDA"
assert entries[2].pangkat == "BRIPKA"
def test_full_table_with_triple_combined_header(self) -> None:
# Regression test for header misclassification: 'Pangkat / NRP / Nama'
# used to be classified as 'pangkat' due to substring matching, which
# silently dropped every personnel row.
table = DetectedTable(
cells=[
["No", "Pangkat / NRP / Nama", "Jabatan dalam Sprint"],
["1", "AKP 87010101 Budi Santoso", "Ketua Tim"],
["2", "IPDA 92030404 Sari Wulandari", "Anggota"],
]
)
entries = extract_personnel([table])
assert len(entries) == 2
assert entries[0].pangkat == "AKP"
assert entries[0].nrp == "87010101"
assert entries[0].nama == "Budi Santoso"
assert entries[1].nama == "Sari Wulandari"
def test_skips_non_personnel_table(self) -> None:
table = DetectedTable(
cells=[["Tahun", "Anggaran"], ["2024", "100M"]],
)
assert extract_personnel([table]) == []
def test_concatenates_multiple_personnel_tables(self) -> None:
t1 = DetectedTable(
cells=[
["No", "Pangkat", "NRP", "Nama"],
["1", "AKP", "87010101", "Budi"],
]
)
t2 = DetectedTable(
cells=[
["No", "Pangkat", "NRP", "Nama"],
["1", "IPDA", "92030404", "Sari"],
]
)
entries = extract_personnel([t1, t2])
assert len(entries) == 2
assert entries[0].nama == "Budi"
assert entries[1].nama == "Sari"
class TestIsPersonnelTable:
def test_matches_with_pangkat_and_nama(self) -> None:
table = DetectedTable(
cells=[["No", "Pangkat", "NRP", "Nama"], ["1", "AKP", "87010101", "X"]]
)
assert is_personnel_table(table) is True
def test_rejects_unrelated_table(self) -> None:
table = DetectedTable(cells=[["A", "B"], ["1", "2"]])
assert is_personnel_table(table) is False
def test_rejects_id_only_table_without_name_column(self) -> None:
# 'Pangkat / NRP' carries id but no name; without a name signal
# this should not be classified as a personnel table.
table = DetectedTable(
cells=[
["No", "Pangkat / NRP", "Jabatan"],
["1", "AKP 87010101", "Kanit Reskrim"],
]
)
assert is_personnel_table(table) is False
def test_accepts_pangkat_nrp_when_separate_nama_present(self) -> None:
table = DetectedTable(
cells=[
["No", "Pangkat / NRP", "Nama"],
["1", "AKP 87010101", "Budi"],
]
)
assert is_personnel_table(table) is True
def test_accepts_pangkat_nrp_nama_combined(self) -> None:
table = DetectedTable(
cells=[
["No", "Pangkat / NRP / Nama", "Jabatan"],
["1", "AKP 87010101 Budi", "Kanit"],
]
)
assert is_personnel_table(table) is True