Files
OCR-SPRIN-SERVICE/README.md
devin-ai-integration[bot] 33b38aacc7 Phase 3: PP-Structure table extraction + personnel column mapper (#2)
* Phase 3: PP-Structure table extraction + personnel column mapper

Adds the personnel-table stage of the pipeline. PaddleOCR's PP-Structure
recognizes table regions and emits HTML, which we parse into a 2D cell
grid. A separate column mapper detects the header row, classifies each
column to a canonical PersonnelEntry field via a synonym dictionary,
and walks the data rows.

Variant handling:
- Different satuan use different column orders and header phrasing.
  Supported synonyms for each canonical field are listed in
  pipeline/extract/personnel.py (Pangkat / NRP / Pangkat-NRP combo /
  Nama / Jabatan dalam Dinas / Jabatan dalam Sprint / Keterangan).
- A merged 'PANGKAT NRP' or 'PANGKAT NRP NAMA' cell is split using
  the 8-digit NRP regex (with look-arounds so glued forms like
  'BRIPKA98050505' work) and the master pangkat lookup.
- Unknown ranks are kept verbatim so the validation layer can flag
  them as UNKNOWN_PANGKAT for HITL review.
- Rows without nrp AND nama are dropped (separators / merged cells).

New module pipeline/table.py:
- DetectedTable dataclass (cells + html).
- parse_table_html: tag/entity-tolerant HTML -> 2D grid.
- extract_tables_from_pp_result: filter PP-Structure regions to type=table.
- run_table_extraction: top-level entrypoint with lazy-init singleton
  for the heavy PP-Structure engine.

Orchestrator now invokes table extraction (gated by TABLES_ENABLED) on
every preprocessed page and merges the discovered personnel into the
ExtractionResult. Failures are caught and logged so a flaky table
recognizer never blocks header extraction.

Tests: 38 new unit tests covering HTML parsing, region filtering,
header classification, column mapping (split, combined, glued cells),
and end-to-end personnel extraction. Total 108 tests, all green.
PaddleOCR / PP-Structure remain optional - no test imports them.

Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com>

* Phase 3: fix header misclassification for combined Pangkat/NRP/Nama columns

Devin Review caught two related bugs in personnel column mapping:

1. _classify_header_cell iterated _HEADER_SYNONYMS in insertion order
   when falling back to substring matching. The dict listed shorter
   keywords first ('pangkat' before 'pangkat / nrp'), so a header like
   'Pangkat / NRP / Nama' classified as plain 'pangkat'. map_row then
   tried to normalize the whole '"AKP 87010101 Budi Santoso"' cell
   as a rank, normalize_pangkat returned None, and the row failed the
   nrp-or-nama gate at the bottom of map_row -- silently dropping
   every personnel row in tables using this layout.

2. _split_pangkat_nrp_nama existed and was unit-tested but was never
   wired up in map_row, so even if classification had worked, the
   three-way split would not have run. The module docstring claimed
   the split was supported.

Fix:
- Iterate the synonym table sorted by keyword length descending in the
  substring-match fallback so the most specific synonym wins.
- Add 'pangkat_nrp_nama' synonym entries for typical separators
  (' / ', '/', whitespace, comma).
- Wire 'pangkat_nrp_nama' into map_row using the existing helper.
- Update is_personnel_table so combined headers count as both an id
  signal and a name signal.

Tests: 6 new asserts (parametrized), 1 regression test for triple-
combined header end-to-end, 1 dedicated map_row test for the new
column type. 114 tests total, all green.

Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com>

* Phase 3: handle multi-word Polri ranks in _split_pangkat_nrp_nama

Devin Review caught: token-by-token is_valid_pangkat() check could not
recognize multi-word ranks ('KOMBES POL', 'BRIGJEN POL', 'IRJEN POL',
'KOMJEN POL', 'JENDERAL POL'). For 'KOMBES POL 88123456 John Doe' the
old code returned pangkat=None, nama='KOMBES POL John Doe', and the
validator's UNKNOWN_PANGKAT flag never fired because pangkat was falsy.

New behavior: greedy longest-prefix match. After stripping the NRP we
try the leading 3-token, 2-token, 1-token slice against
normalize_pangkat() and take the longest that maps to a canonical
rank. Tokens after the matched rank become the nama. Unknown ranks
fall through to pangkat=None and the rank text stays in the nama
field, where downstream validation already flags the row.

Tests: 5 new asserts (4 multi-word ranks + 1 unknown-rank fallback),
119 total green.

Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com>

* Phase 3: don't count pangkat_nrp as a name signal in is_personnel_table

Devin Review caught: a table with header ['No', 'Pangkat / NRP',
'Jabatan'] (no name column) was wrongly classified as a personnel
table because pangkat_nrp was lumped into has_name. Such a table
would produce PersonnelEntry rows with nama=None passing the nrp-or-
nama gate, polluting the personel[] output with id-only fragments.

Split the combined-cell set into combined_id (counts toward has_id)
and combined_name (counts toward has_name). Only pangkat_nrp_nama,
which actually embeds a name, qualifies for has_name. pangkat_nrp
remains an id-only signal.

Tests: 3 new asserts (rejects id-only, accepts pangkat_nrp + separate
nama, accepts pangkat_nrp_nama). 122 total green.

Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com>

---------

Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com>
2026-04-25 16:10:48 +00:00

124 lines
4.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# OCR Sprint Service
OCR + structured extraction service for Indonesian police "surat sprint" (surat perintah) documents. Built around **FastAPI + PaddleOCR + hybrid extraction (regex → LLM lokal → validation)** with **on-premise** deployment as a hard requirement.
> **Status:** Phase 1+2+3 — synchronous PDF/image OCR with regex header extraction, validation, confidence scoring, document detection + perspective correction + shadow removal for phone photos, and **PP-Structure table extraction** for personnel rows. Phase 46 (async pipeline, LLM extraction, HITL) are tracked in [`docs/architecture.md`](docs/architecture.md).
## Why this stack
- **PaddleOCR** is the strongest open-source OCR for mixed-language documents and runs fully on-prem (essential for police data).
- **PP-Structure** (Phase 3) handles personnel tables natively.
- **Regex-first, LLM-fallback extraction** keeps deterministic fields fast and predictable while letting an LLM handle format drift across Polri units.
- **CPU-friendly defaults**: a small (1.5B4B) local LLM via Ollama is the recommended default; the architecture is also GPU-ready.
See [`docs/architecture.md`](docs/architecture.md) for the full architecture, accuracy expectations, and roadmap.
## Quickstart
### Prerequisites
- Python **3.103.12**
- ~3 GB free disk for PaddleOCR model downloads on first run
- Linux/macOS recommended (Windows works but PaddleOCR install can be finicky)
### Install (local dev)
```bash
git clone https://github.com/Adriankf59/ocr-sprint-service.git
cd ocr-sprint-service
python -m venv .venv && source .venv/bin/activate
make install # installs runtime + dev deps + pre-commit
cp .env.example .env # edit if you need GPU / different storage path
```
### Run the API
```bash
make dev
# → http://localhost:8000/docs
```
### Try it out
```bash
curl -F "file=@samples/pdf/example.pdf" http://localhost:8000/api/v1/documents | jq
```
Expected response (truncated):
```json
{
"job_id": "8f2a...",
"status": "completed",
"confidence": 0.93,
"data": {
"header": {
"nomor_sprint": "Sprin/123/IV/2025/Reskrim",
"tanggal": "2025-04-21",
"satuan_penerbit": "KEPOLISIAN RESOR BANDUNG",
"perihal": "Pelaksanaan penyelidikan kasus pencurian",
"dasar": ["Undang-Undang Nomor 2 Tahun 2002 ...", "..."]
},
"personel": [],
"ttd": { "nrp": "12345678" }
},
"review_flags": []
}
```
> **Note:** As of Phase 3 the `personel[]` array is populated from PP-Structure table recognition. Set `TABLES_ENABLED=false` in `.env` to skip the table stage (faster on documents that you know contain no personnel table).
### Docker
```bash
docker compose build
docker compose up -d
docker compose logs -f api
```
The first request will trigger PaddleOCR to download its detection/recognition/cls models (~200 MB) into the `paddle-models` volume.
## Development
```bash
make fmt # format with ruff
make lint # lint
make typecheck # mypy strict mode
make test # pytest
make test-cov # pytest + coverage
```
Pre-commit hooks run ruff on every commit. Install once with `pre-commit install` (already done by `make install`).
## Project layout
```
src/ocr_sprint/
api/ # FastAPI routes + error handlers
schemas/ # Pydantic v2 models (request/response, extraction, personnel)
pipeline/ # ingest → document_detect → preprocess → ocr + table → extract → validate → score
extract/ # regex_rules.py (Phase 1) + personnel.py (Phase 3) → llm.py (Phase 5)
data/ # master data (Polri ranks, etc.)
utils/ # logging, helpers
config.py # pydantic-settings
main.py # app factory
tests/unit/ # 100+ unit tests, PaddleOCR / PP-Structure mocked
docs/ # architecture & decision records
```
## Roadmap
| Phase | Scope | Status |
|---|---|---|
| 1 | Sync API, PDF/image ingest, basic preprocessing, PaddleOCR, regex header extraction, validation, confidence scoring | **Done** |
| 2 | OpenCV-based document detection, perspective transform, shadow removal for phone photos | **Done** |
| 3 | PP-Structure table extraction for personnel rows + column mapper | **Done** |
| 4 | Async pipeline (Celery + Redis), Postgres + MinIO, auth, observability | Planned |
| 5 | LLM hybrid extraction (Ollama + structured output) | Planned |
| 6 | HITL review endpoints + audit trail | Planned |
## License
Proprietary — internal use only.