Phase 3: PP-Structure table extraction + personnel column mapper (#2)
* Phase 3: PP-Structure table extraction + personnel column mapper Adds the personnel-table stage of the pipeline. PaddleOCR's PP-Structure recognizes table regions and emits HTML, which we parse into a 2D cell grid. A separate column mapper detects the header row, classifies each column to a canonical PersonnelEntry field via a synonym dictionary, and walks the data rows. Variant handling: - Different satuan use different column orders and header phrasing. Supported synonyms for each canonical field are listed in pipeline/extract/personnel.py (Pangkat / NRP / Pangkat-NRP combo / Nama / Jabatan dalam Dinas / Jabatan dalam Sprint / Keterangan). - A merged 'PANGKAT NRP' or 'PANGKAT NRP NAMA' cell is split using the 8-digit NRP regex (with look-arounds so glued forms like 'BRIPKA98050505' work) and the master pangkat lookup. - Unknown ranks are kept verbatim so the validation layer can flag them as UNKNOWN_PANGKAT for HITL review. - Rows without nrp AND nama are dropped (separators / merged cells). New module pipeline/table.py: - DetectedTable dataclass (cells + html). - parse_table_html: tag/entity-tolerant HTML -> 2D grid. - extract_tables_from_pp_result: filter PP-Structure regions to type=table. - run_table_extraction: top-level entrypoint with lazy-init singleton for the heavy PP-Structure engine. Orchestrator now invokes table extraction (gated by TABLES_ENABLED) on every preprocessed page and merges the discovered personnel into the ExtractionResult. Failures are caught and logged so a flaky table recognizer never blocks header extraction. Tests: 38 new unit tests covering HTML parsing, region filtering, header classification, column mapping (split, combined, glued cells), and end-to-end personnel extraction. Total 108 tests, all green. PaddleOCR / PP-Structure remain optional - no test imports them. Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com> * Phase 3: fix header misclassification for combined Pangkat/NRP/Nama columns Devin Review caught two related bugs in personnel column mapping: 1. _classify_header_cell iterated _HEADER_SYNONYMS in insertion order when falling back to substring matching. The dict listed shorter keywords first ('pangkat' before 'pangkat / nrp'), so a header like 'Pangkat / NRP / Nama' classified as plain 'pangkat'. map_row then tried to normalize the whole '"AKP 87010101 Budi Santoso"' cell as a rank, normalize_pangkat returned None, and the row failed the nrp-or-nama gate at the bottom of map_row -- silently dropping every personnel row in tables using this layout. 2. _split_pangkat_nrp_nama existed and was unit-tested but was never wired up in map_row, so even if classification had worked, the three-way split would not have run. The module docstring claimed the split was supported. Fix: - Iterate the synonym table sorted by keyword length descending in the substring-match fallback so the most specific synonym wins. - Add 'pangkat_nrp_nama' synonym entries for typical separators (' / ', '/', whitespace, comma). - Wire 'pangkat_nrp_nama' into map_row using the existing helper. - Update is_personnel_table so combined headers count as both an id signal and a name signal. Tests: 6 new asserts (parametrized), 1 regression test for triple- combined header end-to-end, 1 dedicated map_row test for the new column type. 114 tests total, all green. Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com> * Phase 3: handle multi-word Polri ranks in _split_pangkat_nrp_nama Devin Review caught: token-by-token is_valid_pangkat() check could not recognize multi-word ranks ('KOMBES POL', 'BRIGJEN POL', 'IRJEN POL', 'KOMJEN POL', 'JENDERAL POL'). For 'KOMBES POL 88123456 John Doe' the old code returned pangkat=None, nama='KOMBES POL John Doe', and the validator's UNKNOWN_PANGKAT flag never fired because pangkat was falsy. New behavior: greedy longest-prefix match. After stripping the NRP we try the leading 3-token, 2-token, 1-token slice against normalize_pangkat() and take the longest that maps to a canonical rank. Tokens after the matched rank become the nama. Unknown ranks fall through to pangkat=None and the rank text stays in the nama field, where downstream validation already flags the row. Tests: 5 new asserts (4 multi-word ranks + 1 unknown-rank fallback), 119 total green. Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com> * Phase 3: don't count pangkat_nrp as a name signal in is_personnel_table Devin Review caught: a table with header ['No', 'Pangkat / NRP', 'Jabatan'] (no name column) was wrongly classified as a personnel table because pangkat_nrp was lumped into has_name. Such a table would produce PersonnelEntry rows with nama=None passing the nrp-or- nama gate, polluting the personel[] output with id-only fragments. Split the combined-cell set into combined_id (counts toward has_id) and combined_name (counts toward has_name). Only pangkat_nrp_nama, which actually embeds a name, qualifies for has_name. pangkat_nrp remains an id-only signal. Tests: 3 new asserts (rejects id-only, accepts pangkat_nrp + separate nama, accepts pangkat_nrp_nama). 122 total green. Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com> --------- Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com>
This commit is contained in:
committed by
GitHub
parent
812ea7e030
commit
33b38aacc7
300
tests/unit/test_personnel_mapper.py
Normal file
300
tests/unit/test_personnel_mapper.py
Normal file
@@ -0,0 +1,300 @@
|
||||
"""Tests for the personnel-row mapper."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import pytest
|
||||
|
||||
from ocr_sprint.pipeline.extract.personnel import (
|
||||
_classify_header_cell,
|
||||
_split_pangkat_nrp,
|
||||
_split_pangkat_nrp_nama,
|
||||
detect_header_row,
|
||||
extract_personnel,
|
||||
is_personnel_table,
|
||||
map_row,
|
||||
)
|
||||
from ocr_sprint.pipeline.table import DetectedTable
|
||||
|
||||
# ---------- header detection ----------
|
||||
|
||||
|
||||
class TestClassifyHeaderCell:
|
||||
@pytest.mark.parametrize(
|
||||
("text", "expected"),
|
||||
[
|
||||
("No", "no"),
|
||||
("NO.", "no"),
|
||||
("Nomor", "no"),
|
||||
("Pangkat", "pangkat"),
|
||||
("NRP", "nrp"),
|
||||
("Pangkat / NRP", "pangkat_nrp"),
|
||||
("PANGKAT/NRP", "pangkat_nrp"),
|
||||
("Pangkat / NRP / Nama", "pangkat_nrp_nama"),
|
||||
("PANGKAT/NRP/NAMA", "pangkat_nrp_nama"),
|
||||
("Pangkat, NRP, Nama", "pangkat_nrp_nama"),
|
||||
("Nama", "nama"),
|
||||
("Nama Lengkap", "nama"),
|
||||
("Jabatan dalam Dinas", "jabatan_dinas"),
|
||||
("Jabatan dalam Sprint", "jabatan_sprint"),
|
||||
("Keterangan", "keterangan"),
|
||||
],
|
||||
)
|
||||
def test_known_header(self, text: str, expected: str) -> None:
|
||||
assert _classify_header_cell(text) == expected
|
||||
|
||||
def test_substring_match_prefers_longest_synonym(self) -> None:
|
||||
# 'pangkat' is a shorter prefix of 'pangkat / nrp / nama'. Without
|
||||
# length-sorted iteration we'd misclassify combined headers as plain
|
||||
# 'pangkat' and downstream map_row would drop every row.
|
||||
assert _classify_header_cell("Pangkat / NRP / Nama Personel") == "pangkat_nrp_nama"
|
||||
assert _classify_header_cell("Pangkat / NRP Polri") == "pangkat_nrp"
|
||||
|
||||
def test_unknown_header(self) -> None:
|
||||
assert _classify_header_cell("Random Text") is None
|
||||
assert _classify_header_cell("") is None
|
||||
|
||||
|
||||
class TestDetectHeaderRow:
|
||||
def test_detects_first_row_as_header(self) -> None:
|
||||
table = DetectedTable(
|
||||
cells=[
|
||||
["No", "Pangkat", "NRP", "Nama"],
|
||||
["1", "AKP", "87010101", "Budi"],
|
||||
]
|
||||
)
|
||||
result = detect_header_row(table)
|
||||
assert result is not None
|
||||
idx, mapping = result
|
||||
assert idx == 0
|
||||
assert mapping == ["no", "pangkat", "nrp", "nama"]
|
||||
|
||||
def test_detects_second_row_when_first_is_title(self) -> None:
|
||||
table = DetectedTable(
|
||||
cells=[
|
||||
["DAFTAR PERSONEL"], # title row, not a header
|
||||
["No", "Pangkat / NRP", "Nama", "Jabatan dalam Dinas"],
|
||||
["1", "AKP 87010101", "Budi", "Kanit"],
|
||||
]
|
||||
)
|
||||
result = detect_header_row(table)
|
||||
assert result is not None
|
||||
idx, _ = result
|
||||
assert idx == 1
|
||||
|
||||
def test_returns_none_when_no_header_found(self) -> None:
|
||||
table = DetectedTable(cells=[["foo", "bar"], ["baz", "qux"]])
|
||||
assert detect_header_row(table) is None
|
||||
|
||||
|
||||
# ---------- combined-cell splitting ----------
|
||||
|
||||
|
||||
class TestSplitPangkatNrp:
|
||||
@pytest.mark.parametrize(
|
||||
("text", "expected"),
|
||||
[
|
||||
("AKP 87010101", ("AKP", "87010101")),
|
||||
("IPDA / 92030404", ("IPDA", "92030404")),
|
||||
("BRIPKA98050505", ("BRIPKA", "98050505")),
|
||||
("KOMPOL 88123456", ("KOMPOL", "88123456")),
|
||||
],
|
||||
)
|
||||
def test_known_combos(self, text: str, expected: tuple[str, str]) -> None:
|
||||
assert _split_pangkat_nrp(text) == expected
|
||||
|
||||
def test_returns_none_when_no_nrp(self) -> None:
|
||||
pangkat, nrp = _split_pangkat_nrp("AKP")
|
||||
assert pangkat == "AKP"
|
||||
assert nrp is None
|
||||
|
||||
|
||||
class TestSplitPangkatNrpNama:
|
||||
def test_three_way_split(self) -> None:
|
||||
pangkat, nrp, nama = _split_pangkat_nrp_nama("AKP 87010101 Budi Santoso")
|
||||
assert pangkat == "AKP"
|
||||
assert nrp == "87010101"
|
||||
assert nama == "Budi Santoso"
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
("text", "expected_pangkat", "expected_name"),
|
||||
[
|
||||
# multi-word ranks must be matched as contiguous token sequences,
|
||||
# otherwise tokens like 'POL' would leak into the name.
|
||||
("KOMBES POL 88123456 John Doe", "KOMBES POL", "John Doe"),
|
||||
("BRIGJEN POL 99887766 Jane Doe", "BRIGJEN POL", "Jane Doe"),
|
||||
("IRJEN POL 77665544 Ahmad Hidayat", "IRJEN POL", "Ahmad Hidayat"),
|
||||
("JENDERAL POL 11223344 Sari Wulandari", "JENDERAL POL", "Sari Wulandari"),
|
||||
],
|
||||
)
|
||||
def test_multi_word_ranks(self, text: str, expected_pangkat: str, expected_name: str) -> None:
|
||||
pangkat, _nrp, nama = _split_pangkat_nrp_nama(text)
|
||||
assert pangkat == expected_pangkat
|
||||
assert nama == expected_name
|
||||
|
||||
def test_unknown_rank_returns_none_pangkat(self) -> None:
|
||||
pangkat, nrp, nama = _split_pangkat_nrp_nama("Foobar 87010101 Budi Santoso")
|
||||
assert pangkat is None
|
||||
assert nrp == "87010101"
|
||||
# name keeps the unknown rank token; validators will flag the row.
|
||||
assert nama == "Foobar Budi Santoso"
|
||||
|
||||
|
||||
# ---------- row mapping ----------
|
||||
|
||||
|
||||
class TestMapRow:
|
||||
def test_split_columns_polres_layout(self) -> None:
|
||||
mapping = ["no", "pangkat", "nrp", "nama", "jabatan_dinas", "jabatan_sprint"]
|
||||
row = ["1", "AKP", "87010101", "Budi Santoso", "Kanit Reskrim", "Ketua Tim"]
|
||||
entry = map_row(row, mapping)
|
||||
assert entry is not None
|
||||
assert entry.no == 1
|
||||
assert entry.pangkat == "AKP"
|
||||
assert entry.nrp == "87010101"
|
||||
assert entry.nama == "Budi Santoso"
|
||||
assert entry.jabatan_dinas == "Kanit Reskrim"
|
||||
assert entry.jabatan_sprint == "Ketua Tim"
|
||||
|
||||
def test_combined_pangkat_nrp_nama_cell(self) -> None:
|
||||
mapping = ["no", "pangkat_nrp_nama", "jabatan_dinas", "jabatan_sprint"]
|
||||
row = ["1", "AKP 87010101 Budi Santoso", "Kanit Reskrim", "Ketua Tim"]
|
||||
entry = map_row(row, mapping)
|
||||
assert entry is not None
|
||||
assert entry.no == 1
|
||||
assert entry.pangkat == "AKP"
|
||||
assert entry.nrp == "87010101"
|
||||
assert entry.nama == "Budi Santoso"
|
||||
assert entry.jabatan_dinas == "Kanit Reskrim"
|
||||
assert entry.jabatan_sprint == "Ketua Tim"
|
||||
|
||||
def test_combined_pangkat_nrp_cell(self) -> None:
|
||||
mapping = ["no", "pangkat_nrp", "nama", "jabatan_dinas"]
|
||||
row = ["1", "AKP 87010101", "Budi Santoso", "Kanit Reskrim"]
|
||||
entry = map_row(row, mapping)
|
||||
assert entry is not None
|
||||
assert entry.pangkat == "AKP"
|
||||
assert entry.nrp == "87010101"
|
||||
assert entry.nama == "Budi Santoso"
|
||||
|
||||
def test_skips_row_without_nama_or_nrp(self) -> None:
|
||||
mapping = ["no", "pangkat"]
|
||||
row = ["", ""]
|
||||
assert map_row(row, mapping) is None
|
||||
|
||||
def test_unknown_pangkat_kept_verbatim(self) -> None:
|
||||
mapping = ["no", "pangkat", "nrp", "nama"]
|
||||
row = ["1", "Foobar", "87010101", "Budi"]
|
||||
entry = map_row(row, mapping)
|
||||
assert entry is not None
|
||||
# unknown pangkat is preserved so the validation layer can flag it
|
||||
assert entry.pangkat == "Foobar"
|
||||
|
||||
|
||||
# ---------- end-to-end extraction ----------
|
||||
|
||||
|
||||
class TestExtractPersonnel:
|
||||
def test_full_table_with_header(self) -> None:
|
||||
table = DetectedTable(
|
||||
cells=[
|
||||
[
|
||||
"No",
|
||||
"Pangkat / NRP",
|
||||
"Nama",
|
||||
"Jabatan dalam Dinas",
|
||||
"Jabatan dalam Sprint",
|
||||
],
|
||||
["1", "AKP 87010101", "Budi Santoso", "Kanit Reskrim", "Ketua Tim"],
|
||||
["2", "IPDA 92030404", "Sari Wulandari", "Banit Reskrim", "Anggota"],
|
||||
["3", "BRIPKA 98050505", "Ahmad Hidayat", "Banit Reskrim", "Anggota"],
|
||||
]
|
||||
)
|
||||
entries = extract_personnel([table])
|
||||
assert len(entries) == 3
|
||||
assert entries[0].nama == "Budi Santoso"
|
||||
assert entries[0].nrp == "87010101"
|
||||
assert entries[1].pangkat == "IPDA"
|
||||
assert entries[2].pangkat == "BRIPKA"
|
||||
|
||||
def test_full_table_with_triple_combined_header(self) -> None:
|
||||
# Regression test for header misclassification: 'Pangkat / NRP / Nama'
|
||||
# used to be classified as 'pangkat' due to substring matching, which
|
||||
# silently dropped every personnel row.
|
||||
table = DetectedTable(
|
||||
cells=[
|
||||
["No", "Pangkat / NRP / Nama", "Jabatan dalam Sprint"],
|
||||
["1", "AKP 87010101 Budi Santoso", "Ketua Tim"],
|
||||
["2", "IPDA 92030404 Sari Wulandari", "Anggota"],
|
||||
]
|
||||
)
|
||||
entries = extract_personnel([table])
|
||||
assert len(entries) == 2
|
||||
assert entries[0].pangkat == "AKP"
|
||||
assert entries[0].nrp == "87010101"
|
||||
assert entries[0].nama == "Budi Santoso"
|
||||
assert entries[1].nama == "Sari Wulandari"
|
||||
|
||||
def test_skips_non_personnel_table(self) -> None:
|
||||
table = DetectedTable(
|
||||
cells=[["Tahun", "Anggaran"], ["2024", "100M"]],
|
||||
)
|
||||
assert extract_personnel([table]) == []
|
||||
|
||||
def test_concatenates_multiple_personnel_tables(self) -> None:
|
||||
t1 = DetectedTable(
|
||||
cells=[
|
||||
["No", "Pangkat", "NRP", "Nama"],
|
||||
["1", "AKP", "87010101", "Budi"],
|
||||
]
|
||||
)
|
||||
t2 = DetectedTable(
|
||||
cells=[
|
||||
["No", "Pangkat", "NRP", "Nama"],
|
||||
["1", "IPDA", "92030404", "Sari"],
|
||||
]
|
||||
)
|
||||
entries = extract_personnel([t1, t2])
|
||||
assert len(entries) == 2
|
||||
assert entries[0].nama == "Budi"
|
||||
assert entries[1].nama == "Sari"
|
||||
|
||||
|
||||
class TestIsPersonnelTable:
|
||||
def test_matches_with_pangkat_and_nama(self) -> None:
|
||||
table = DetectedTable(
|
||||
cells=[["No", "Pangkat", "NRP", "Nama"], ["1", "AKP", "87010101", "X"]]
|
||||
)
|
||||
assert is_personnel_table(table) is True
|
||||
|
||||
def test_rejects_unrelated_table(self) -> None:
|
||||
table = DetectedTable(cells=[["A", "B"], ["1", "2"]])
|
||||
assert is_personnel_table(table) is False
|
||||
|
||||
def test_rejects_id_only_table_without_name_column(self) -> None:
|
||||
# 'Pangkat / NRP' carries id but no name; without a name signal
|
||||
# this should not be classified as a personnel table.
|
||||
table = DetectedTable(
|
||||
cells=[
|
||||
["No", "Pangkat / NRP", "Jabatan"],
|
||||
["1", "AKP 87010101", "Kanit Reskrim"],
|
||||
]
|
||||
)
|
||||
assert is_personnel_table(table) is False
|
||||
|
||||
def test_accepts_pangkat_nrp_when_separate_nama_present(self) -> None:
|
||||
table = DetectedTable(
|
||||
cells=[
|
||||
["No", "Pangkat / NRP", "Nama"],
|
||||
["1", "AKP 87010101", "Budi"],
|
||||
]
|
||||
)
|
||||
assert is_personnel_table(table) is True
|
||||
|
||||
def test_accepts_pangkat_nrp_nama_combined(self) -> None:
|
||||
table = DetectedTable(
|
||||
cells=[
|
||||
["No", "Pangkat / NRP / Nama", "Jabatan"],
|
||||
["1", "AKP 87010101 Budi", "Kanit"],
|
||||
]
|
||||
)
|
||||
assert is_personnel_table(table) is True
|
||||
Reference in New Issue
Block a user