* Phase 3: PP-Structure table extraction + personnel column mapper Adds the personnel-table stage of the pipeline. PaddleOCR's PP-Structure recognizes table regions and emits HTML, which we parse into a 2D cell grid. A separate column mapper detects the header row, classifies each column to a canonical PersonnelEntry field via a synonym dictionary, and walks the data rows. Variant handling: - Different satuan use different column orders and header phrasing. Supported synonyms for each canonical field are listed in pipeline/extract/personnel.py (Pangkat / NRP / Pangkat-NRP combo / Nama / Jabatan dalam Dinas / Jabatan dalam Sprint / Keterangan). - A merged 'PANGKAT NRP' or 'PANGKAT NRP NAMA' cell is split using the 8-digit NRP regex (with look-arounds so glued forms like 'BRIPKA98050505' work) and the master pangkat lookup. - Unknown ranks are kept verbatim so the validation layer can flag them as UNKNOWN_PANGKAT for HITL review. - Rows without nrp AND nama are dropped (separators / merged cells). New module pipeline/table.py: - DetectedTable dataclass (cells + html). - parse_table_html: tag/entity-tolerant HTML -> 2D grid. - extract_tables_from_pp_result: filter PP-Structure regions to type=table. - run_table_extraction: top-level entrypoint with lazy-init singleton for the heavy PP-Structure engine. Orchestrator now invokes table extraction (gated by TABLES_ENABLED) on every preprocessed page and merges the discovered personnel into the ExtractionResult. Failures are caught and logged so a flaky table recognizer never blocks header extraction. Tests: 38 new unit tests covering HTML parsing, region filtering, header classification, column mapping (split, combined, glued cells), and end-to-end personnel extraction. Total 108 tests, all green. PaddleOCR / PP-Structure remain optional - no test imports them. Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com> * Phase 3: fix header misclassification for combined Pangkat/NRP/Nama columns Devin Review caught two related bugs in personnel column mapping: 1. _classify_header_cell iterated _HEADER_SYNONYMS in insertion order when falling back to substring matching. The dict listed shorter keywords first ('pangkat' before 'pangkat / nrp'), so a header like 'Pangkat / NRP / Nama' classified as plain 'pangkat'. map_row then tried to normalize the whole '"AKP 87010101 Budi Santoso"' cell as a rank, normalize_pangkat returned None, and the row failed the nrp-or-nama gate at the bottom of map_row -- silently dropping every personnel row in tables using this layout. 2. _split_pangkat_nrp_nama existed and was unit-tested but was never wired up in map_row, so even if classification had worked, the three-way split would not have run. The module docstring claimed the split was supported. Fix: - Iterate the synonym table sorted by keyword length descending in the substring-match fallback so the most specific synonym wins. - Add 'pangkat_nrp_nama' synonym entries for typical separators (' / ', '/', whitespace, comma). - Wire 'pangkat_nrp_nama' into map_row using the existing helper. - Update is_personnel_table so combined headers count as both an id signal and a name signal. Tests: 6 new asserts (parametrized), 1 regression test for triple- combined header end-to-end, 1 dedicated map_row test for the new column type. 114 tests total, all green. Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com> * Phase 3: handle multi-word Polri ranks in _split_pangkat_nrp_nama Devin Review caught: token-by-token is_valid_pangkat() check could not recognize multi-word ranks ('KOMBES POL', 'BRIGJEN POL', 'IRJEN POL', 'KOMJEN POL', 'JENDERAL POL'). For 'KOMBES POL 88123456 John Doe' the old code returned pangkat=None, nama='KOMBES POL John Doe', and the validator's UNKNOWN_PANGKAT flag never fired because pangkat was falsy. New behavior: greedy longest-prefix match. After stripping the NRP we try the leading 3-token, 2-token, 1-token slice against normalize_pangkat() and take the longest that maps to a canonical rank. Tokens after the matched rank become the nama. Unknown ranks fall through to pangkat=None and the rank text stays in the nama field, where downstream validation already flags the row. Tests: 5 new asserts (4 multi-word ranks + 1 unknown-rank fallback), 119 total green. Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com> * Phase 3: don't count pangkat_nrp as a name signal in is_personnel_table Devin Review caught: a table with header ['No', 'Pangkat / NRP', 'Jabatan'] (no name column) was wrongly classified as a personnel table because pangkat_nrp was lumped into has_name. Such a table would produce PersonnelEntry rows with nama=None passing the nrp-or- nama gate, polluting the personel[] output with id-only fragments. Split the combined-cell set into combined_id (counts toward has_id) and combined_name (counts toward has_name). Only pangkat_nrp_nama, which actually embeds a name, qualifies for has_name. pangkat_nrp remains an id-only signal. Tests: 3 new asserts (rejects id-only, accepts pangkat_nrp + separate nama, accepts pangkat_nrp_nama). 122 total green. Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com> --------- Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com>
124 lines
4.4 KiB
Markdown
124 lines
4.4 KiB
Markdown
# OCR Sprint Service
|
||
|
||
OCR + structured extraction service for Indonesian police "surat sprint" (surat perintah) documents. Built around **FastAPI + PaddleOCR + hybrid extraction (regex → LLM lokal → validation)** with **on-premise** deployment as a hard requirement.
|
||
|
||
> **Status:** Phase 1+2+3 — synchronous PDF/image OCR with regex header extraction, validation, confidence scoring, document detection + perspective correction + shadow removal for phone photos, and **PP-Structure table extraction** for personnel rows. Phase 4–6 (async pipeline, LLM extraction, HITL) are tracked in [`docs/architecture.md`](docs/architecture.md).
|
||
|
||
## Why this stack
|
||
|
||
- **PaddleOCR** is the strongest open-source OCR for mixed-language documents and runs fully on-prem (essential for police data).
|
||
- **PP-Structure** (Phase 3) handles personnel tables natively.
|
||
- **Regex-first, LLM-fallback extraction** keeps deterministic fields fast and predictable while letting an LLM handle format drift across Polri units.
|
||
- **CPU-friendly defaults**: a small (1.5B–4B) local LLM via Ollama is the recommended default; the architecture is also GPU-ready.
|
||
|
||
See [`docs/architecture.md`](docs/architecture.md) for the full architecture, accuracy expectations, and roadmap.
|
||
|
||
## Quickstart
|
||
|
||
### Prerequisites
|
||
|
||
- Python **3.10–3.12**
|
||
- ~3 GB free disk for PaddleOCR model downloads on first run
|
||
- Linux/macOS recommended (Windows works but PaddleOCR install can be finicky)
|
||
|
||
### Install (local dev)
|
||
|
||
```bash
|
||
git clone https://github.com/Adriankf59/ocr-sprint-service.git
|
||
cd ocr-sprint-service
|
||
|
||
python -m venv .venv && source .venv/bin/activate
|
||
make install # installs runtime + dev deps + pre-commit
|
||
cp .env.example .env # edit if you need GPU / different storage path
|
||
```
|
||
|
||
### Run the API
|
||
|
||
```bash
|
||
make dev
|
||
# → http://localhost:8000/docs
|
||
```
|
||
|
||
### Try it out
|
||
|
||
```bash
|
||
curl -F "file=@samples/pdf/example.pdf" http://localhost:8000/api/v1/documents | jq
|
||
```
|
||
|
||
Expected response (truncated):
|
||
|
||
```json
|
||
{
|
||
"job_id": "8f2a...",
|
||
"status": "completed",
|
||
"confidence": 0.93,
|
||
"data": {
|
||
"header": {
|
||
"nomor_sprint": "Sprin/123/IV/2025/Reskrim",
|
||
"tanggal": "2025-04-21",
|
||
"satuan_penerbit": "KEPOLISIAN RESOR BANDUNG",
|
||
"perihal": "Pelaksanaan penyelidikan kasus pencurian",
|
||
"dasar": ["Undang-Undang Nomor 2 Tahun 2002 ...", "..."]
|
||
},
|
||
"personel": [],
|
||
"ttd": { "nrp": "12345678" }
|
||
},
|
||
"review_flags": []
|
||
}
|
||
```
|
||
|
||
> **Note:** As of Phase 3 the `personel[]` array is populated from PP-Structure table recognition. Set `TABLES_ENABLED=false` in `.env` to skip the table stage (faster on documents that you know contain no personnel table).
|
||
|
||
### Docker
|
||
|
||
```bash
|
||
docker compose build
|
||
docker compose up -d
|
||
docker compose logs -f api
|
||
```
|
||
|
||
The first request will trigger PaddleOCR to download its detection/recognition/cls models (~200 MB) into the `paddle-models` volume.
|
||
|
||
## Development
|
||
|
||
```bash
|
||
make fmt # format with ruff
|
||
make lint # lint
|
||
make typecheck # mypy strict mode
|
||
make test # pytest
|
||
make test-cov # pytest + coverage
|
||
```
|
||
|
||
Pre-commit hooks run ruff on every commit. Install once with `pre-commit install` (already done by `make install`).
|
||
|
||
## Project layout
|
||
|
||
```
|
||
src/ocr_sprint/
|
||
api/ # FastAPI routes + error handlers
|
||
schemas/ # Pydantic v2 models (request/response, extraction, personnel)
|
||
pipeline/ # ingest → document_detect → preprocess → ocr + table → extract → validate → score
|
||
extract/ # regex_rules.py (Phase 1) + personnel.py (Phase 3) → llm.py (Phase 5)
|
||
data/ # master data (Polri ranks, etc.)
|
||
utils/ # logging, helpers
|
||
config.py # pydantic-settings
|
||
main.py # app factory
|
||
tests/unit/ # 100+ unit tests, PaddleOCR / PP-Structure mocked
|
||
docs/ # architecture & decision records
|
||
```
|
||
|
||
## Roadmap
|
||
|
||
| Phase | Scope | Status |
|
||
|---|---|---|
|
||
| 1 | Sync API, PDF/image ingest, basic preprocessing, PaddleOCR, regex header extraction, validation, confidence scoring | **Done** |
|
||
| 2 | OpenCV-based document detection, perspective transform, shadow removal for phone photos | **Done** |
|
||
| 3 | PP-Structure table extraction for personnel rows + column mapper | **Done** |
|
||
| 4 | Async pipeline (Celery + Redis), Postgres + MinIO, auth, observability | Planned |
|
||
| 5 | LLM hybrid extraction (Ollama + structured output) | Planned |
|
||
| 6 | HITL review endpoints + audit trail | Planned |
|
||
|
||
## License
|
||
|
||
Proprietary — internal use only.
|