Phase 3: PP-Structure table extraction + personnel column mapper (#2)

* Phase 3: PP-Structure table extraction + personnel column mapper

Adds the personnel-table stage of the pipeline. PaddleOCR's PP-Structure
recognizes table regions and emits HTML, which we parse into a 2D cell
grid. A separate column mapper detects the header row, classifies each
column to a canonical PersonnelEntry field via a synonym dictionary,
and walks the data rows.

Variant handling:
- Different satuan use different column orders and header phrasing.
  Supported synonyms for each canonical field are listed in
  pipeline/extract/personnel.py (Pangkat / NRP / Pangkat-NRP combo /
  Nama / Jabatan dalam Dinas / Jabatan dalam Sprint / Keterangan).
- A merged 'PANGKAT NRP' or 'PANGKAT NRP NAMA' cell is split using
  the 8-digit NRP regex (with look-arounds so glued forms like
  'BRIPKA98050505' work) and the master pangkat lookup.
- Unknown ranks are kept verbatim so the validation layer can flag
  them as UNKNOWN_PANGKAT for HITL review.
- Rows without nrp AND nama are dropped (separators / merged cells).

New module pipeline/table.py:
- DetectedTable dataclass (cells + html).
- parse_table_html: tag/entity-tolerant HTML -> 2D grid.
- extract_tables_from_pp_result: filter PP-Structure regions to type=table.
- run_table_extraction: top-level entrypoint with lazy-init singleton
  for the heavy PP-Structure engine.

Orchestrator now invokes table extraction (gated by TABLES_ENABLED) on
every preprocessed page and merges the discovered personnel into the
ExtractionResult. Failures are caught and logged so a flaky table
recognizer never blocks header extraction.

Tests: 38 new unit tests covering HTML parsing, region filtering,
header classification, column mapping (split, combined, glued cells),
and end-to-end personnel extraction. Total 108 tests, all green.
PaddleOCR / PP-Structure remain optional - no test imports them.

Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com>

* Phase 3: fix header misclassification for combined Pangkat/NRP/Nama columns

Devin Review caught two related bugs in personnel column mapping:

1. _classify_header_cell iterated _HEADER_SYNONYMS in insertion order
   when falling back to substring matching. The dict listed shorter
   keywords first ('pangkat' before 'pangkat / nrp'), so a header like
   'Pangkat / NRP / Nama' classified as plain 'pangkat'. map_row then
   tried to normalize the whole '"AKP 87010101 Budi Santoso"' cell
   as a rank, normalize_pangkat returned None, and the row failed the
   nrp-or-nama gate at the bottom of map_row -- silently dropping
   every personnel row in tables using this layout.

2. _split_pangkat_nrp_nama existed and was unit-tested but was never
   wired up in map_row, so even if classification had worked, the
   three-way split would not have run. The module docstring claimed
   the split was supported.

Fix:
- Iterate the synonym table sorted by keyword length descending in the
  substring-match fallback so the most specific synonym wins.
- Add 'pangkat_nrp_nama' synonym entries for typical separators
  (' / ', '/', whitespace, comma).
- Wire 'pangkat_nrp_nama' into map_row using the existing helper.
- Update is_personnel_table so combined headers count as both an id
  signal and a name signal.

Tests: 6 new asserts (parametrized), 1 regression test for triple-
combined header end-to-end, 1 dedicated map_row test for the new
column type. 114 tests total, all green.

Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com>

* Phase 3: handle multi-word Polri ranks in _split_pangkat_nrp_nama

Devin Review caught: token-by-token is_valid_pangkat() check could not
recognize multi-word ranks ('KOMBES POL', 'BRIGJEN POL', 'IRJEN POL',
'KOMJEN POL', 'JENDERAL POL'). For 'KOMBES POL 88123456 John Doe' the
old code returned pangkat=None, nama='KOMBES POL John Doe', and the
validator's UNKNOWN_PANGKAT flag never fired because pangkat was falsy.

New behavior: greedy longest-prefix match. After stripping the NRP we
try the leading 3-token, 2-token, 1-token slice against
normalize_pangkat() and take the longest that maps to a canonical
rank. Tokens after the matched rank become the nama. Unknown ranks
fall through to pangkat=None and the rank text stays in the nama
field, where downstream validation already flags the row.

Tests: 5 new asserts (4 multi-word ranks + 1 unknown-rank fallback),
119 total green.

Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com>

* Phase 3: don't count pangkat_nrp as a name signal in is_personnel_table

Devin Review caught: a table with header ['No', 'Pangkat / NRP',
'Jabatan'] (no name column) was wrongly classified as a personnel
table because pangkat_nrp was lumped into has_name. Such a table
would produce PersonnelEntry rows with nama=None passing the nrp-or-
nama gate, polluting the personel[] output with id-only fragments.

Split the combined-cell set into combined_id (counts toward has_id)
and combined_name (counts toward has_name). Only pangkat_nrp_nama,
which actually embeds a name, qualifies for has_name. pangkat_nrp
remains an id-only signal.

Tests: 3 new asserts (rejects id-only, accepts pangkat_nrp + separate
nama, accepts pangkat_nrp_nama). 122 total green.

Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com>

---------

Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com>
This commit is contained in:
devin-ai-integration[bot]
2026-04-25 16:10:48 +00:00
committed by GitHub
parent 812ea7e030
commit 33b38aacc7
8 changed files with 905 additions and 12 deletions

View File

@@ -47,6 +47,9 @@ class Settings(BaseSettings):
preprocess_remove_shadow: bool = True
preprocess_min_quad_area_fraction: float = Field(0.20, ge=0.0, le=1.0)
# Table extraction (Phase 3) via PaddleOCR PP-Structure
tables_enabled: bool = True
# Confidence thresholds (Phase 5 routing)
confidence_auto_approve: float = Field(0.95, ge=0.0, le=1.0)
confidence_needs_review: float = Field(0.85, ge=0.0, le=1.0)

View File

@@ -0,0 +1,316 @@
"""Map a raw 2D table grid into a list of `PersonnelEntry`.
Surat sprint personnel tables don't have a fixed schema across satuan: column
order, header phrasing, and even whether pangkat/NRP are merged into one cell
all vary. We deal with this by:
1. Detecting the header row by keyword scoring (rows that contain "PANGKAT"
or "NRP" or "NAMA" are headers; the row with the highest score wins).
2. Mapping each header cell to one of the canonical PersonnelEntry fields
via a synonym dictionary.
3. Walking the remaining rows and slotting cells into fields by column
index. A combined "PANGKAT/NRP" or "PANGKAT/NRP/NAMA" cell is split
heuristically (8-digit token → NRP, known-rank token → pangkat, the
leftover words → nama).
The mapper is deliberately conservative: when in doubt it leaves a field
None and lets validation flag the row for HITL review.
"""
from __future__ import annotations
import re
from ocr_sprint.data.master_pangkat import normalize_pangkat
from ocr_sprint.pipeline.table import DetectedTable
from ocr_sprint.schemas.personnel import PersonnelEntry
# ---------- column synonyms ----------
# header keyword → canonical column id. Lowercased, whitespace-collapsed.
_HEADER_SYNONYMS: dict[str, str] = {
# row index column
"no": "no",
"nomor": "no",
"no.": "no",
# rank
"pangkat": "pangkat",
"pkt": "pangkat",
# NRP / NIP / NIPK
"nrp": "nrp",
"no nrp": "nrp",
"nrp / nip": "nrp",
"nrp/nip": "nrp",
"nrp nip": "nrp",
"no. mhs": "nrp", # taruna
# combined pangkat + NRP + nama cell, seen in compact Polri layouts.
# Order matters here only for readability; classify_header_cell ranks
# synonyms by length, so the longer 'pangkat / nrp / nama' wins over
# both 'pangkat / nrp' and 'pangkat'.
"pangkat / nrp / nama": "pangkat_nrp_nama",
"pangkat/nrp/nama": "pangkat_nrp_nama",
"pangkat nrp nama": "pangkat_nrp_nama",
"pangkat, nrp, nama": "pangkat_nrp_nama",
# combined pangkat + NRP cell, common in Polres-level sprint
"pangkat / nrp": "pangkat_nrp",
"pangkat/nrp": "pangkat_nrp",
"pangkat dan nrp": "pangkat_nrp",
"pangkat nrp": "pangkat_nrp",
# name
"nama": "nama",
"nama lengkap": "nama",
# jabatan dalam dinas (permanent post)
"jabatan": "jabatan_dinas",
"jabatan dinas": "jabatan_dinas",
"jabatan dalam dinas": "jabatan_dinas",
"jbt dinas": "jabatan_dinas",
# jabatan dalam sprint (role for this dispatch)
"jabatan dalam sprint": "jabatan_sprint",
"jabatan dalam sprin": "jabatan_sprint",
"jabatan dalam surat perintah": "jabatan_sprint",
"jabatan sprint": "jabatan_sprint",
"jabatan sprin": "jabatan_sprint",
"tugas": "jabatan_sprint",
"penugasan": "jabatan_sprint",
# remarks
"keterangan": "keterangan",
"ket": "keterangan",
"ket.": "keterangan",
}
# 8-digit NRP. We don't anchor on word boundaries because OCR sometimes glues
# the rank directly onto the digits ("BRIPKA98050505"). We use (?<!\d) and (?!\d)
# look-arounds to make sure we don't match a substring of a longer number.
_NRP_RE = re.compile(r"(?<!\d)(\d{8})(?!\d)")
_NUMBER_RE = re.compile(r"^\s*(\d{1,3})[.)\s]*$")
# ---------- header detection ----------
def _normalize_header_cell(text: str) -> str:
return " ".join(text.lower().split()).strip(" .:")
# Synonym keywords sorted by length (descending) so that substring matching
# in `_classify_header_cell` prefers the most specific match. Without this,
# 'pangkat' would match 'pangkat / nrp / nama' before 'pangkat / nrp / nama'
# itself, silently misclassifying combined-cell headers and dropping rows.
_SORTED_HEADER_KEYWORDS: list[tuple[str, str]] = sorted(
_HEADER_SYNONYMS.items(), key=lambda kv: -len(kv[0])
)
def _classify_header_cell(text: str) -> str | None:
"""Return the canonical column id for a header cell, or None.
First tries an exact match against the synonym table; if that fails,
falls back to substring matching against the *longest* synonym that is
contained in the cell text. The longest-first ordering matters: a header
like 'Pangkat / NRP / Nama' must classify as `pangkat_nrp_nama`, not
`pangkat`, otherwise downstream `map_row` would treat the whole cell as
a rank string and drop the row when normalize_pangkat returns None.
"""
norm = _normalize_header_cell(text)
if not norm:
return None
if norm in _HEADER_SYNONYMS:
return _HEADER_SYNONYMS[norm]
for keyword, canonical in _SORTED_HEADER_KEYWORDS:
if keyword in norm:
return canonical
return None
def detect_header_row(table: DetectedTable) -> tuple[int, list[str | None]] | None:
"""Find the most likely header row and return (row_index, column_mapping).
Strategy: score each of the first ~3 rows by how many cells classify as a
known column. Pick the highest-scoring row provided it covers at least
two known fields (otherwise we don't have enough signal to trust it).
"""
best_idx: int | None = None
best_mapping: list[str | None] = []
best_score = 0
for r_idx in range(min(3, table.n_rows)):
row = table.cells[r_idx]
mapping = [_classify_header_cell(cell) for cell in row]
score = sum(1 for m in mapping if m is not None)
if score >= 2 and score > best_score:
best_score = score
best_idx = r_idx
best_mapping = mapping
if best_idx is None:
return None
return best_idx, best_mapping
# ---------- combined-cell splitting ----------
def _split_pangkat_nrp(cell: str) -> tuple[str | None, str | None]:
"""Split a 'PANGKAT NRP' cell into (pangkat, nrp).
Returns (None, None) if the cell can't be split confidently.
"""
if not cell:
return None, None
nrp_match = _NRP_RE.search(cell)
nrp = nrp_match.group(1) if nrp_match else None
pangkat_part = cell
if nrp_match:
pangkat_part = cell[: nrp_match.start()] + cell[nrp_match.end() :]
# Strip separators commonly seen between rank and NRP ("AKP / 87010101",
# "AKP. 87010101", "AKP - 87010101") before normalizing.
pangkat_part = pangkat_part.strip(" /-.,;:|").strip()
pangkat = normalize_pangkat(pangkat_part)
return pangkat, nrp
def _split_pangkat_nrp_nama(cell: str) -> tuple[str | None, str | None, str | None]:
"""Split a 'PANGKAT NRP NAMA' single-cell into its three components.
Multi-word ranks like 'KOMBES POL' or 'BRIGJEN POL' must be matched as
contiguous token sequences, otherwise tokens like 'POL' leak into the
name. We greedily try the longest leading token-prefix that normalizes
to a known pangkat, then fall back to shorter prefixes.
"""
if not cell:
return None, None, None
nrp_match = _NRP_RE.search(cell)
nrp = nrp_match.group(1) if nrp_match else None
rest = cell
if nrp:
rest = cell.replace(nrp, " ", 1)
tokens = rest.split()
if not tokens:
return None, nrp, None
# Try the longest leading sub-sequence first so 'KOMBES POL' wins over
# 'KOMBES' (which alone is not a valid pangkat anyway).
pangkat: str | None = None
consumed = 0
for prefix_len in range(min(len(tokens), 3), 0, -1):
candidate = " ".join(tokens[:prefix_len])
normalized = normalize_pangkat(candidate)
if normalized is not None:
pangkat = normalized
consumed = prefix_len
break
name_tokens = tokens[consumed:] if pangkat else tokens
nama = " ".join(name_tokens) if name_tokens else None
return pangkat, nrp, nama
# ---------- row mapping ----------
def _parse_int(value: str) -> int | None:
m = _NUMBER_RE.match(value)
return int(m.group(1)) if m else None
def map_row(row: list[str], mapping: list[str | None]) -> PersonnelEntry | None:
"""Convert one data row into a PersonnelEntry using the column mapping."""
fields: dict[str, str | int | None] = {
"no": None,
"pangkat": None,
"nrp": None,
"nama": None,
"jabatan_dinas": None,
"jabatan_sprint": None,
"keterangan": None,
}
for idx, cell in enumerate(row):
if idx >= len(mapping):
break
column = mapping[idx]
if column is None:
continue
text = cell.strip()
if column == "no":
fields["no"] = _parse_int(text)
elif column == "pangkat_nrp_nama":
pangkat, nrp, nama = _split_pangkat_nrp_nama(text)
if pangkat:
fields["pangkat"] = pangkat
if nrp:
fields["nrp"] = nrp
if nama:
fields["nama"] = nama
elif column == "pangkat_nrp":
pangkat, nrp = _split_pangkat_nrp(text)
if pangkat:
fields["pangkat"] = pangkat
if nrp:
fields["nrp"] = nrp
elif column == "pangkat":
fields["pangkat"] = normalize_pangkat(text) or text or None
elif column == "nrp":
m = _NRP_RE.search(text)
fields["nrp"] = m.group(1) if m else (text or None)
elif column in fields:
fields[column] = text or None
# require at least nama OR nrp to consider this a real personnel row;
# otherwise it's likely a separator / footnote / merged cell.
if not (fields["nrp"] or fields["nama"]):
return None
return PersonnelEntry(
no=fields["no"] if isinstance(fields["no"], int) else None,
pangkat=fields["pangkat"] if isinstance(fields["pangkat"], str) else None,
nrp=fields["nrp"] if isinstance(fields["nrp"], str) else None,
nama=fields["nama"] if isinstance(fields["nama"], str) else None,
jabatan_dinas=(
fields["jabatan_dinas"] if isinstance(fields["jabatan_dinas"], str) else None
),
jabatan_sprint=(
fields["jabatan_sprint"] if isinstance(fields["jabatan_sprint"], str) else None
),
keterangan=(fields["keterangan"] if isinstance(fields["keterangan"], str) else None),
)
# ---------- table-level entrypoint ----------
def is_personnel_table(table: DetectedTable) -> bool:
"""Heuristic: a table is the personnel list if its header row contains
at least one rank/NRP indicator and one name indicator.
"""
detected = detect_header_row(table)
if detected is None:
return False
_, mapping = detected
# `pangkat_nrp` is an id-only signal (rank + NRP, no name), while
# `pangkat_nrp_nama` carries a name too. Counting `pangkat_nrp` toward
# `has_name` would let id-only tables (e.g. ['No', 'Pangkat / NRP',
# 'Jabatan']) be mistaken for personnel tables.
combined_id = {"pangkat_nrp", "pangkat_nrp_nama"}
combined_name = {"pangkat_nrp_nama"}
has_id = any(m in {"nrp", "pangkat"} | combined_id for m in mapping)
has_name = any(m == "nama" or m in combined_name for m in mapping)
return has_id and has_name
def extract_personnel(tables: list[DetectedTable]) -> list[PersonnelEntry]:
"""Pick the best-matching personnel table and convert its rows.
If multiple tables look like personnel lists (rare), we concatenate them
in document order so nothing is silently dropped.
"""
rows: list[PersonnelEntry] = []
for table in tables:
if not is_personnel_table(table):
continue
detected = detect_header_row(table)
if detected is None:
continue
header_idx, mapping = detected
for r_idx in range(header_idx + 1, table.n_rows):
entry = map_row(table.cells[r_idx], mapping)
if entry is not None:
rows.append(entry)
return rows

View File

@@ -1,11 +1,13 @@
"""Synchronous pipeline orchestrator (Phase 1).
"""Synchronous pipeline orchestrator (Phase 1-3).
Wires the individual stages together:
bytes ingest preprocess OCR → regex extract → validate → score
bytes -> ingest -> document_detect -> preprocess -> OCR
-> [PP-Structure tables -> personnel mapper]
-> regex header extract -> validate -> score
Phase 4 will replace this with a Celery task graph; Phase 3/5 will plug
in PP-Structure for tables and an LLM extractor for variant fields.
Phase 4 will replace this with a Celery task graph; Phase 5 will plug
in an LLM extractor for variant fields.
"""
from __future__ import annotations
@@ -15,13 +17,16 @@ from dataclasses import dataclass
from ocr_sprint.config import get_settings
from ocr_sprint.pipeline.confidence import compute_confidence, route
from ocr_sprint.pipeline.document_detect import DocumentDetectConfig, detect_and_correct
from ocr_sprint.pipeline.extract.personnel import extract_personnel
from ocr_sprint.pipeline.extract.regex_rules import extract_header, find_signatory
from ocr_sprint.pipeline.extract.validators import validate_extraction
from ocr_sprint.pipeline.ingest import detect_source_kind, ingest
from ocr_sprint.pipeline.ingest import NDArrayU8, detect_source_kind, ingest
from ocr_sprint.pipeline.ocr import OCRPage, run_ocr
from ocr_sprint.pipeline.preprocess import PreprocessConfig, preprocess
from ocr_sprint.pipeline.table import DetectedTable, run_table_extraction
from ocr_sprint.schemas.document import DocumentStatus, SourceKind
from ocr_sprint.schemas.extraction import ExtractionResult, ReviewFlag
from ocr_sprint.schemas.personnel import PersonnelEntry
from ocr_sprint.utils.logging import get_logger
_logger = get_logger(__name__)
@@ -66,9 +71,11 @@ def run_pipeline(content: bytes) -> PipelineOutput:
)
ocr_pages: list[OCRPage] = []
cleaned_pages: list[NDArrayU8] = []
for page in pages:
corrected = detect_and_correct(page.image, detect_cfg)
cleaned = preprocess(corrected, pre_cfg)
cleaned_pages.append(cleaned)
ocr_pages.append(run_ocr(cleaned))
full_text = "\n".join(p.text for p in ocr_pages)
@@ -77,13 +84,28 @@ def run_pipeline(content: bytes) -> PipelineOutput:
header = extract_header(full_text)
ttd = find_signatory(full_text)
personel: list[PersonnelEntry] = []
if s.tables_enabled and cleaned_pages:
all_tables: list[DetectedTable] = []
for img in cleaned_pages:
try:
all_tables.extend(run_table_extraction(img))
except Exception as exc: # pragma: no cover - defensive
_logger.warning("pipeline.table_extraction_failed", error=str(exc))
personel = extract_personnel(all_tables)
_logger.info(
"pipeline.tables",
tables=len(all_tables),
personel_rows=len(personel),
)
initial_flags: list[ReviewFlag] = []
if mean_ocr_conf < _OCR_CONFIDENCE_FLAG_THRESHOLD:
initial_flags.append(ReviewFlag.LOW_OCR_CONFIDENCE)
result = ExtractionResult(
header=header,
personel=[], # Phase 3 will populate from PP-Structure
personel=personel,
untuk=[],
ttd=ttd,
raw_text=full_text,

View File

@@ -0,0 +1,155 @@
"""Phase 3 — table extraction via PaddleOCR PP-Structure.
The personnel section of a surat sprint is almost always a table with columns
like (No, Pangkat, NRP, Nama, Jabatan dalam Dinas, Jabatan dalam Sprint,
Keterangan). Plain OCR on the page produces a flat stream of text lines that
makes column reconstruction brittle, so we use PP-Structure's table recognizer
which returns a 2D cell grid directly.
Like the OCR engine wrapper, PP-Structure has a heavy initialization cost
(~3-6s on CPU) and an API that has shifted across paddleocr releases, so we
hide it behind a small process-global accessor and a stable dataclass surface.
Tests do NOT require paddleocr installed — `extract_tables_from_html` and the
personnel column mapper are pure-Python and parse PP-Structure's HTML output.
"""
from __future__ import annotations
import html
import re
from dataclasses import dataclass, field
from threading import Lock
from typing import TYPE_CHECKING
from ocr_sprint.config import get_settings
from ocr_sprint.pipeline.ingest import NDArrayU8
from ocr_sprint.utils.logging import get_logger
if TYPE_CHECKING:
from paddleocr import PPStructure
_logger = get_logger(__name__)
_lock = Lock()
_instance: PPStructure | None = None
@dataclass(frozen=True)
class TableCell:
"""One parsed table cell."""
text: str
row: int
col: int
@dataclass
class DetectedTable:
"""One table region detected by PP-Structure, parsed into a 2D grid.
`cells[r]` is a list of strings for row r. The list is ragged if the table
has merged cells (we don't currently un-merge), so callers should treat it
defensively.
"""
cells: list[list[str]] = field(default_factory=list)
html: str = ""
@property
def n_rows(self) -> int:
return len(self.cells)
@property
def n_cols(self) -> int:
return max((len(r) for r in self.cells), default=0)
# ---------- PP-Structure singleton ----------
def _build_pp_structure() -> PPStructure:
from paddleocr import PPStructure
s = get_settings()
_logger.info("pp_structure.init", lang=s.ocr_lang, use_gpu=s.ocr_use_gpu)
# layout=True so that PP-Structure also returns figure/text regions; we
# filter to tables only afterwards. show_log=False to keep stdout clean.
return PPStructure(
lang=s.ocr_lang,
use_gpu=s.ocr_use_gpu,
layout=True,
show_log=False,
)
def get_pp_structure() -> PPStructure:
"""Lazy, thread-safe singleton accessor for PP-Structure."""
global _instance
if _instance is None:
with _lock:
if _instance is None:
_instance = _build_pp_structure()
return _instance
# ---------- table parsing ----------
_TR_RE = re.compile(r"<tr[^>]*>(.*?)</tr>", re.IGNORECASE | re.DOTALL)
_TD_RE = re.compile(r"<t[dh][^>]*>(.*?)</t[dh]>", re.IGNORECASE | re.DOTALL)
_TAG_RE = re.compile(r"<[^>]+>")
def _strip_html(fragment: str) -> str:
"""Remove inner tags + collapse whitespace + decode HTML entities."""
no_tags = _TAG_RE.sub(" ", fragment)
decoded = html.unescape(no_tags)
return " ".join(decoded.split()).strip()
def parse_table_html(table_html: str) -> list[list[str]]:
"""Parse an HTML <table> string into a 2D list of cell text values.
Tolerant to PP-Structure's slight HTML inconsistencies (no closing tags,
nested spans, &nbsp; entities) — we don't need full HTML compliance,
just rows x cells.
"""
rows: list[list[str]] = []
for tr in _TR_RE.findall(table_html):
cells = [_strip_html(td) for td in _TD_RE.findall(tr)]
rows.append(cells)
return rows
def extract_tables_from_pp_result(
pp_result: list[dict[str, object]],
) -> list[DetectedTable]:
"""Pull tables out of PP-Structure's region list.
PP-Structure returns one dict per detected region; tables have
`type == "table"` and the recognized table HTML inside `res["html"]`.
"""
tables: list[DetectedTable] = []
for region in pp_result:
if region.get("type") != "table":
continue
res = region.get("res")
if not isinstance(res, dict):
continue
table_html = res.get("html", "")
if not isinstance(table_html, str) or not table_html:
continue
cells = parse_table_html(table_html)
if not cells:
continue
tables.append(DetectedTable(cells=cells, html=table_html))
return tables
def run_table_extraction(image: NDArrayU8) -> list[DetectedTable]:
"""Run PP-Structure on a single page and return the parsed tables."""
engine = get_pp_structure()
raw = engine(image)
if not isinstance(raw, list):
return []
return extract_tables_from_pp_result(raw)