devin-ai-integration[bot] 33b38aacc7 Phase 3: PP-Structure table extraction + personnel column mapper (#2)
* Phase 3: PP-Structure table extraction + personnel column mapper

Adds the personnel-table stage of the pipeline. PaddleOCR's PP-Structure
recognizes table regions and emits HTML, which we parse into a 2D cell
grid. A separate column mapper detects the header row, classifies each
column to a canonical PersonnelEntry field via a synonym dictionary,
and walks the data rows.

Variant handling:
- Different satuan use different column orders and header phrasing.
  Supported synonyms for each canonical field are listed in
  pipeline/extract/personnel.py (Pangkat / NRP / Pangkat-NRP combo /
  Nama / Jabatan dalam Dinas / Jabatan dalam Sprint / Keterangan).
- A merged 'PANGKAT NRP' or 'PANGKAT NRP NAMA' cell is split using
  the 8-digit NRP regex (with look-arounds so glued forms like
  'BRIPKA98050505' work) and the master pangkat lookup.
- Unknown ranks are kept verbatim so the validation layer can flag
  them as UNKNOWN_PANGKAT for HITL review.
- Rows without nrp AND nama are dropped (separators / merged cells).

New module pipeline/table.py:
- DetectedTable dataclass (cells + html).
- parse_table_html: tag/entity-tolerant HTML -> 2D grid.
- extract_tables_from_pp_result: filter PP-Structure regions to type=table.
- run_table_extraction: top-level entrypoint with lazy-init singleton
  for the heavy PP-Structure engine.

Orchestrator now invokes table extraction (gated by TABLES_ENABLED) on
every preprocessed page and merges the discovered personnel into the
ExtractionResult. Failures are caught and logged so a flaky table
recognizer never blocks header extraction.

Tests: 38 new unit tests covering HTML parsing, region filtering,
header classification, column mapping (split, combined, glued cells),
and end-to-end personnel extraction. Total 108 tests, all green.
PaddleOCR / PP-Structure remain optional - no test imports them.

Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com>

* Phase 3: fix header misclassification for combined Pangkat/NRP/Nama columns

Devin Review caught two related bugs in personnel column mapping:

1. _classify_header_cell iterated _HEADER_SYNONYMS in insertion order
   when falling back to substring matching. The dict listed shorter
   keywords first ('pangkat' before 'pangkat / nrp'), so a header like
   'Pangkat / NRP / Nama' classified as plain 'pangkat'. map_row then
   tried to normalize the whole '"AKP 87010101 Budi Santoso"' cell
   as a rank, normalize_pangkat returned None, and the row failed the
   nrp-or-nama gate at the bottom of map_row -- silently dropping
   every personnel row in tables using this layout.

2. _split_pangkat_nrp_nama existed and was unit-tested but was never
   wired up in map_row, so even if classification had worked, the
   three-way split would not have run. The module docstring claimed
   the split was supported.

Fix:
- Iterate the synonym table sorted by keyword length descending in the
  substring-match fallback so the most specific synonym wins.
- Add 'pangkat_nrp_nama' synonym entries for typical separators
  (' / ', '/', whitespace, comma).
- Wire 'pangkat_nrp_nama' into map_row using the existing helper.
- Update is_personnel_table so combined headers count as both an id
  signal and a name signal.

Tests: 6 new asserts (parametrized), 1 regression test for triple-
combined header end-to-end, 1 dedicated map_row test for the new
column type. 114 tests total, all green.

Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com>

* Phase 3: handle multi-word Polri ranks in _split_pangkat_nrp_nama

Devin Review caught: token-by-token is_valid_pangkat() check could not
recognize multi-word ranks ('KOMBES POL', 'BRIGJEN POL', 'IRJEN POL',
'KOMJEN POL', 'JENDERAL POL'). For 'KOMBES POL 88123456 John Doe' the
old code returned pangkat=None, nama='KOMBES POL John Doe', and the
validator's UNKNOWN_PANGKAT flag never fired because pangkat was falsy.

New behavior: greedy longest-prefix match. After stripping the NRP we
try the leading 3-token, 2-token, 1-token slice against
normalize_pangkat() and take the longest that maps to a canonical
rank. Tokens after the matched rank become the nama. Unknown ranks
fall through to pangkat=None and the rank text stays in the nama
field, where downstream validation already flags the row.

Tests: 5 new asserts (4 multi-word ranks + 1 unknown-rank fallback),
119 total green.

Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com>

* Phase 3: don't count pangkat_nrp as a name signal in is_personnel_table

Devin Review caught: a table with header ['No', 'Pangkat / NRP',
'Jabatan'] (no name column) was wrongly classified as a personnel
table because pangkat_nrp was lumped into has_name. Such a table
would produce PersonnelEntry rows with nama=None passing the nrp-or-
nama gate, polluting the personel[] output with id-only fragments.

Split the combined-cell set into combined_id (counts toward has_id)
and combined_name (counts toward has_name). Only pangkat_nrp_nama,
which actually embeds a name, qualifies for has_name. pangkat_nrp
remains an id-only signal.

Tests: 3 new asserts (rejects id-only, accepts pangkat_nrp + separate
nama, accepts pangkat_nrp_nama). 122 total green.

Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com>

---------

Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com>
2026-04-25 16:10:48 +00:00

OCR Sprint Service

OCR + structured extraction service for Indonesian police "surat sprint" (surat perintah) documents. Built around FastAPI + PaddleOCR + hybrid extraction (regex → LLM lokal → validation) with on-premise deployment as a hard requirement.

Status: Phase 1+2+3 — synchronous PDF/image OCR with regex header extraction, validation, confidence scoring, document detection + perspective correction + shadow removal for phone photos, and PP-Structure table extraction for personnel rows. Phase 46 (async pipeline, LLM extraction, HITL) are tracked in docs/architecture.md.

Why this stack

  • PaddleOCR is the strongest open-source OCR for mixed-language documents and runs fully on-prem (essential for police data).
  • PP-Structure (Phase 3) handles personnel tables natively.
  • Regex-first, LLM-fallback extraction keeps deterministic fields fast and predictable while letting an LLM handle format drift across Polri units.
  • CPU-friendly defaults: a small (1.5B4B) local LLM via Ollama is the recommended default; the architecture is also GPU-ready.

See docs/architecture.md for the full architecture, accuracy expectations, and roadmap.

Quickstart

Prerequisites

  • Python 3.103.12
  • ~3 GB free disk for PaddleOCR model downloads on first run
  • Linux/macOS recommended (Windows works but PaddleOCR install can be finicky)

Install (local dev)

git clone https://github.com/Adriankf59/ocr-sprint-service.git
cd ocr-sprint-service

python -m venv .venv && source .venv/bin/activate
make install         # installs runtime + dev deps + pre-commit
cp .env.example .env # edit if you need GPU / different storage path

Run the API

make dev
# → http://localhost:8000/docs

Try it out

curl -F "file=@samples/pdf/example.pdf" http://localhost:8000/api/v1/documents | jq

Expected response (truncated):

{
  "job_id": "8f2a...",
  "status": "completed",
  "confidence": 0.93,
  "data": {
    "header": {
      "nomor_sprint": "Sprin/123/IV/2025/Reskrim",
      "tanggal": "2025-04-21",
      "satuan_penerbit": "KEPOLISIAN RESOR BANDUNG",
      "perihal": "Pelaksanaan penyelidikan kasus pencurian",
      "dasar": ["Undang-Undang Nomor 2 Tahun 2002 ...", "..."]
    },
    "personel": [],
    "ttd": { "nrp": "12345678" }
  },
  "review_flags": []
}

Note: As of Phase 3 the personel[] array is populated from PP-Structure table recognition. Set TABLES_ENABLED=false in .env to skip the table stage (faster on documents that you know contain no personnel table).

Docker

docker compose build
docker compose up -d
docker compose logs -f api

The first request will trigger PaddleOCR to download its detection/recognition/cls models (~200 MB) into the paddle-models volume.

Development

make fmt        # format with ruff
make lint       # lint
make typecheck  # mypy strict mode
make test       # pytest
make test-cov   # pytest + coverage

Pre-commit hooks run ruff on every commit. Install once with pre-commit install (already done by make install).

Project layout

src/ocr_sprint/
  api/          # FastAPI routes + error handlers
  schemas/      # Pydantic v2 models (request/response, extraction, personnel)
  pipeline/     # ingest → document_detect → preprocess → ocr + table → extract → validate → score
    extract/    # regex_rules.py (Phase 1) + personnel.py (Phase 3) → llm.py (Phase 5)
  data/         # master data (Polri ranks, etc.)
  utils/        # logging, helpers
  config.py     # pydantic-settings
  main.py       # app factory
tests/unit/     # 100+ unit tests, PaddleOCR / PP-Structure mocked
docs/           # architecture & decision records

Roadmap

Phase Scope Status
1 Sync API, PDF/image ingest, basic preprocessing, PaddleOCR, regex header extraction, validation, confidence scoring Done
2 OpenCV-based document detection, perspective transform, shadow removal for phone photos Done
3 PP-Structure table extraction for personnel rows + column mapper Done
4 Async pipeline (Celery + Redis), Postgres + MinIO, auth, observability Planned
5 LLM hybrid extraction (Ollama + structured output) Planned
6 HITL review endpoints + audit trail Planned

License

Proprietary — internal use only.

Description
No description provided
Readme 2.4 MiB
Languages
Python 96.3%
PowerShell 2.4%
Dockerfile 0.6%
Makefile 0.5%
Mako 0.2%