* Phase 3: PP-Structure table extraction + personnel column mapper Adds the personnel-table stage of the pipeline. PaddleOCR's PP-Structure recognizes table regions and emits HTML, which we parse into a 2D cell grid. A separate column mapper detects the header row, classifies each column to a canonical PersonnelEntry field via a synonym dictionary, and walks the data rows. Variant handling: - Different satuan use different column orders and header phrasing. Supported synonyms for each canonical field are listed in pipeline/extract/personnel.py (Pangkat / NRP / Pangkat-NRP combo / Nama / Jabatan dalam Dinas / Jabatan dalam Sprint / Keterangan). - A merged 'PANGKAT NRP' or 'PANGKAT NRP NAMA' cell is split using the 8-digit NRP regex (with look-arounds so glued forms like 'BRIPKA98050505' work) and the master pangkat lookup. - Unknown ranks are kept verbatim so the validation layer can flag them as UNKNOWN_PANGKAT for HITL review. - Rows without nrp AND nama are dropped (separators / merged cells). New module pipeline/table.py: - DetectedTable dataclass (cells + html). - parse_table_html: tag/entity-tolerant HTML -> 2D grid. - extract_tables_from_pp_result: filter PP-Structure regions to type=table. - run_table_extraction: top-level entrypoint with lazy-init singleton for the heavy PP-Structure engine. Orchestrator now invokes table extraction (gated by TABLES_ENABLED) on every preprocessed page and merges the discovered personnel into the ExtractionResult. Failures are caught and logged so a flaky table recognizer never blocks header extraction. Tests: 38 new unit tests covering HTML parsing, region filtering, header classification, column mapping (split, combined, glued cells), and end-to-end personnel extraction. Total 108 tests, all green. PaddleOCR / PP-Structure remain optional - no test imports them. Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com> * Phase 3: fix header misclassification for combined Pangkat/NRP/Nama columns Devin Review caught two related bugs in personnel column mapping: 1. _classify_header_cell iterated _HEADER_SYNONYMS in insertion order when falling back to substring matching. The dict listed shorter keywords first ('pangkat' before 'pangkat / nrp'), so a header like 'Pangkat / NRP / Nama' classified as plain 'pangkat'. map_row then tried to normalize the whole '"AKP 87010101 Budi Santoso"' cell as a rank, normalize_pangkat returned None, and the row failed the nrp-or-nama gate at the bottom of map_row -- silently dropping every personnel row in tables using this layout. 2. _split_pangkat_nrp_nama existed and was unit-tested but was never wired up in map_row, so even if classification had worked, the three-way split would not have run. The module docstring claimed the split was supported. Fix: - Iterate the synonym table sorted by keyword length descending in the substring-match fallback so the most specific synonym wins. - Add 'pangkat_nrp_nama' synonym entries for typical separators (' / ', '/', whitespace, comma). - Wire 'pangkat_nrp_nama' into map_row using the existing helper. - Update is_personnel_table so combined headers count as both an id signal and a name signal. Tests: 6 new asserts (parametrized), 1 regression test for triple- combined header end-to-end, 1 dedicated map_row test for the new column type. 114 tests total, all green. Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com> * Phase 3: handle multi-word Polri ranks in _split_pangkat_nrp_nama Devin Review caught: token-by-token is_valid_pangkat() check could not recognize multi-word ranks ('KOMBES POL', 'BRIGJEN POL', 'IRJEN POL', 'KOMJEN POL', 'JENDERAL POL'). For 'KOMBES POL 88123456 John Doe' the old code returned pangkat=None, nama='KOMBES POL John Doe', and the validator's UNKNOWN_PANGKAT flag never fired because pangkat was falsy. New behavior: greedy longest-prefix match. After stripping the NRP we try the leading 3-token, 2-token, 1-token slice against normalize_pangkat() and take the longest that maps to a canonical rank. Tokens after the matched rank become the nama. Unknown ranks fall through to pangkat=None and the rank text stays in the nama field, where downstream validation already flags the row. Tests: 5 new asserts (4 multi-word ranks + 1 unknown-rank fallback), 119 total green. Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com> * Phase 3: don't count pangkat_nrp as a name signal in is_personnel_table Devin Review caught: a table with header ['No', 'Pangkat / NRP', 'Jabatan'] (no name column) was wrongly classified as a personnel table because pangkat_nrp was lumped into has_name. Such a table would produce PersonnelEntry rows with nama=None passing the nrp-or- nama gate, polluting the personel[] output with id-only fragments. Split the combined-cell set into combined_id (counts toward has_id) and combined_name (counts toward has_name). Only pangkat_nrp_nama, which actually embeds a name, qualifies for has_name. pangkat_nrp remains an id-only signal. Tests: 3 new asserts (rejects id-only, accepts pangkat_nrp + separate nama, accepts pangkat_nrp_nama). 122 total green. Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com> --------- Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com>
95 lines
3.2 KiB
Python
95 lines
3.2 KiB
Python
"""Tests for the PP-Structure table parsing helpers (no paddleocr required)."""
|
|
|
|
from __future__ import annotations
|
|
|
|
import pytest
|
|
|
|
from ocr_sprint.pipeline.table import (
|
|
DetectedTable,
|
|
extract_tables_from_pp_result,
|
|
parse_table_html,
|
|
)
|
|
|
|
|
|
class TestParseTableHtml:
|
|
def test_simple_grid(self) -> None:
|
|
html_str = """
|
|
<html><body><table>
|
|
<tr><td>No</td><td>Pangkat</td><td>NRP</td><td>Nama</td></tr>
|
|
<tr><td>1</td><td>AKP</td><td>87010101</td><td>Budi Santoso</td></tr>
|
|
<tr><td>2</td><td>IPDA</td><td>92030404</td><td>Sari Wulandari</td></tr>
|
|
</table></body></html>
|
|
"""
|
|
rows = parse_table_html(html_str)
|
|
assert rows == [
|
|
["No", "Pangkat", "NRP", "Nama"],
|
|
["1", "AKP", "87010101", "Budi Santoso"],
|
|
["2", "IPDA", "92030404", "Sari Wulandari"],
|
|
]
|
|
|
|
def test_handles_th_and_entities_and_inline_tags(self) -> None:
|
|
html_str = (
|
|
"<table><tr><th>Pangkat / NRP</th><th>Nama</th></tr>"
|
|
"<tr><td>AKP <b>87010101</b></td><td>Budi Santoso</td></tr></table>"
|
|
)
|
|
rows = parse_table_html(html_str)
|
|
assert rows[0] == ["Pangkat / NRP", "Nama"]
|
|
assert rows[1] == ["AKP 87010101", "Budi Santoso"]
|
|
|
|
def test_empty_table_returns_empty_list(self) -> None:
|
|
assert parse_table_html("<table></table>") == []
|
|
assert parse_table_html("") == []
|
|
|
|
|
|
class TestExtractTablesFromPpResult:
|
|
def test_filters_table_regions_and_parses_html(self) -> None:
|
|
pp_result = [
|
|
{"type": "text", "res": [{"text": "ignore me", "confidence": 0.9}]},
|
|
{
|
|
"type": "table",
|
|
"res": {
|
|
"html": "<table><tr><td>A</td><td>B</td></tr></table>",
|
|
"cell_bbox": [],
|
|
},
|
|
},
|
|
{
|
|
"type": "table",
|
|
"res": {"html": ""}, # empty html → ignored
|
|
},
|
|
{
|
|
"type": "figure",
|
|
"res": [],
|
|
},
|
|
]
|
|
tables = extract_tables_from_pp_result(pp_result)
|
|
assert len(tables) == 1
|
|
assert tables[0].cells == [["A", "B"]]
|
|
|
|
def test_no_tables_returns_empty_list(self) -> None:
|
|
pp_result = [{"type": "text", "res": [{"text": "x"}]}]
|
|
assert extract_tables_from_pp_result(pp_result) == []
|
|
|
|
|
|
class TestDetectedTable:
|
|
def test_dimensions(self) -> None:
|
|
table = DetectedTable(cells=[["a", "b", "c"], ["d", "e"]])
|
|
assert table.n_rows == 2
|
|
assert table.n_cols == 3
|
|
|
|
def test_zero_rows(self) -> None:
|
|
table = DetectedTable()
|
|
assert table.n_rows == 0
|
|
assert table.n_cols == 0
|
|
|
|
|
|
@pytest.fixture
|
|
def sample_personnel_table() -> DetectedTable:
|
|
"""Header + three personnel rows in a typical Polres-level format."""
|
|
cells = [
|
|
["No", "Pangkat / NRP", "Nama", "Jabatan dalam Dinas", "Jabatan dalam Sprint"],
|
|
["1", "AKP 87010101", "Budi Santoso", "Kanit Reskrim", "Ketua Tim"],
|
|
["2", "IPDA 92030404", "Sari Wulandari", "Banit Reskrim", "Anggota"],
|
|
["3", "BRIPKA 98050505", "Ahmad Hidayat", "Banit Reskrim", "Anggota"],
|
|
]
|
|
return DetectedTable(cells=cells)
|