Go to file

Adrian Kuman Firmansah dbcf480130 Merge pull request #8 from Adriankf59/devin/1777181072-fix-personnel-extraction-cimahi

Fix personnel extraction + header bugs on real Polres Cimahi sprint

2026-04-26 13:10:44 +07:00

alembic

Phase 6: HITL review endpoints + audit trail

2026-04-25 20:12:04 +00:00

docs

Phase 7: ground-truth export (JSONL + stats) + CLI tool

2026-04-25 20:24:40 +00:00

samples

Phase 1 MVP: synchronous OCR + regex header extraction

2026-04-25 14:58:50 +00:00

src/ocr_sprint

Use word-boundary matching for personnel name blocklist

2026-04-26 05:46:21 +00:00

tests

Use word-boundary matching for personnel name blocklist

2026-04-26 05:46:21 +00:00

.env.example

Phase 4: async pipeline (Celery+Redis), Postgres job state, local-fs blob storage, API-key auth, Prometheus metrics (#3 )

2026-04-25 16:50:51 +00:00

.gitignore

Phase 4: async pipeline (Celery+Redis), Postgres job state, local-fs blob storage, API-key auth, Prometheus metrics (#3 )

2026-04-25 16:50:51 +00:00

.pre-commit-config.yaml

Phase 1 MVP: synchronous OCR + regex header extraction

2026-04-25 14:58:50 +00:00

alembic.ini

Phase 4: async pipeline (Celery+Redis), Postgres job state, local-fs blob storage, API-key auth, Prometheus metrics (#3 )

2026-04-25 16:50:51 +00:00

docker-compose.yml

Phase 4: async pipeline (Celery+Redis), Postgres job state, local-fs blob storage, API-key auth, Prometheus metrics (#3 )

2026-04-25 16:50:51 +00:00

Dockerfile

Phase 4: async pipeline (Celery+Redis), Postgres job state, local-fs blob storage, API-key auth, Prometheus metrics (#3 )

2026-04-25 16:50:51 +00:00

Makefile

Phase 1 MVP: synchronous OCR + regex header extraction

2026-04-25 14:58:50 +00:00

pyproject.toml

Phase 4: async pipeline (Celery+Redis), Postgres job state, local-fs blob storage, API-key auth, Prometheus metrics (#3 )

2026-04-25 16:50:51 +00:00

README.md

Phase 4: async pipeline (Celery+Redis), Postgres job state, local-fs blob storage, API-key auth, Prometheus metrics (#3 )

2026-04-25 16:50:51 +00:00

README.md

OCR Sprint Service

OCR + structured extraction service for Indonesian police "surat sprint" (surat perintah) documents. Built around FastAPI + PaddleOCR + hybrid extraction (regex → LLM lokal → validation) with on-premise deployment as a hard requirement.

Status: Phase 1–4 — synchronous + async PDF/image OCR with regex header extraction, PP-Structure personnel-table extraction, validation, confidence scoring, document detection / perspective correction / shadow removal, Celery + Redis job queue, Postgres job state, local-filesystem blob storage, API-key auth, and Prometheus metrics. Phase 5–6 (LLM extraction, HITL) are tracked in docs/architecture.md.

Why this stack

PaddleOCR is the strongest open-source OCR for mixed-language documents and runs fully on-prem (essential for police data).
PP-Structure (Phase 3) handles personnel tables natively.
Regex-first, LLM-fallback extraction keeps deterministic fields fast and predictable while letting an LLM handle format drift across Polri units.
CPU-friendly defaults: a small (1.5B–4B) local LLM via Ollama is the recommended default; the architecture is also GPU-ready.

See docs/architecture.md for the full architecture, accuracy expectations, and roadmap.

Quickstart

Prerequisites

Python 3.10–3.12
~3 GB free disk for PaddleOCR model downloads on first run
Linux/macOS recommended (Windows works but PaddleOCR install can be finicky)

Install (local dev)

git clone https://github.com/Adriankf59/ocr-sprint-service.git
cd ocr-sprint-service

python -m venv .venv && source .venv/bin/activate
make install         # installs runtime + dev deps + pre-commit
pip install -e ".[ocr]"  # only on the worker host — pulls Paddle wheels (~1.5 GB)
cp .env.example .env # edit if you need GPU / different storage path

Run the API

make dev
# → http://localhost:8000/docs

Try it out

The default POST /documents is async — it returns 202 Accepted with a job_id and the worker fills in the result. For tests / local one-shot usage you can append ?sync=true to run inline.

# Async (production flow)
curl -F "file=@samples/pdf/example.pdf" \
  -H "X-API-Key: $API_KEY" \
  http://localhost:8000/api/v1/documents | jq
# → {"job_id":"8f2a...","status":"pending",...}

curl -H "X-API-Key: $API_KEY" \
  http://localhost:8000/api/v1/documents/8f2a... | jq

# Sync (single small doc, no worker required)
curl -F "file=@samples/pdf/example.pdf" \
  "http://localhost:8000/api/v1/documents?sync=true" | jq

Expected response (truncated):

{
  "job_id": "8f2a...",
  "status": "completed",
  "confidence": 0.93,
  "data": {
    "header": {
      "nomor_sprint": "Sprin/123/IV/2025/Reskrim",
      "tanggal": "2025-04-21",
      "satuan_penerbit": "KEPOLISIAN RESOR BANDUNG",
      "perihal": "Pelaksanaan penyelidikan kasus pencurian",
      "dasar": ["Undang-Undang Nomor 2 Tahun 2002 ...", "..."]
    },
    "personel": [],
    "ttd": { "nrp": "12345678" }
  },
  "review_flags": []
}

Note: As of Phase 3 the personel[] array is populated from PP-Structure table recognition. Set TABLES_ENABLED=false in .env to skip the table stage (faster on documents that you know contain no personnel table).

Docker

The Phase 4 stack runs four services: api, worker (Celery), redis, and postgres. Blob uploads are persisted to a Docker volume — there is no MinIO/S3 dependency.

docker compose build
docker compose up -d
docker compose logs -f api worker

The API container runs alembic upgrade head on start, so the jobs table is created on first boot. The first request will trigger PaddleOCR to download its detection/recognition/cls models (~200 MB) into the paddle-models volume.

Metrics are exposed at http://localhost:8000/metrics in Prometheus text format.

Development

make fmt        # format with ruff
make lint       # lint
make typecheck  # mypy strict mode
make test       # pytest
make test-cov   # pytest + coverage

Pre-commit hooks run ruff on every commit. Install once with pre-commit install (already done by make install).

Project layout

src/ocr_sprint/
  api/          # FastAPI routes + error handlers
  schemas/      # Pydantic v2 models (request/response, extraction, personnel)
  pipeline/     # ingest → document_detect → preprocess → ocr + table → extract → validate → score
    extract/    # regex_rules.py (Phase 1) + personnel.py (Phase 3) → llm.py (Phase 5)
  data/         # master data (Polri ranks, etc.)
  utils/        # logging, helpers
  config.py     # pydantic-settings
  main.py       # app factory
tests/unit/     # 100+ unit tests, PaddleOCR / PP-Structure mocked
docs/           # architecture & decision records

Roadmap

Phase	Scope	Status
1	Sync API, PDF/image ingest, basic preprocessing, PaddleOCR, regex header extraction, validation, confidence scoring	Done
2	OpenCV-based document detection, perspective transform, shadow removal for phone photos	Done
3	PP-Structure table extraction for personnel rows + column mapper	Done
4	Async pipeline (Celery + Redis), Postgres job state, local-filesystem blob storage, API-key auth, Prometheus metrics	Done
5	LLM hybrid extraction (Ollama + structured output)	Planned
6	HITL review endpoints + audit trail	Planned

License

Proprietary — internal use only.

Languages

Python 96.3%

PowerShell 2.4%

Dockerfile 0.6%

Makefile 0.5%

Mako 0.2%

README.md Unescape Escape

OCR Sprint Service

Why this stack

Quickstart

Prerequisites

Install (local dev)

Run the API

Try it out

Docker

Development

Project layout

Roadmap

License

README.md