Go to file

Devin AI 737f4999dd Use word-boundary matching for personnel name blocklist

Devin Review correctly flagged that the bare "NO" and "KET" entries
in the blocklist would silently drop common Indonesian names (KETUT,
NOVA, NOOR, NORMAN, NOVIANTI, ...) because the check used startswith
rather than a word boundary.

Replaced the per-prefix loop with a single compiled regex anchored at
^ with a trailing \b, which still matches column headers like "NO"
or "KET" on their own line but no longer rejects "NOOR HIDAYAT" or
"KETUT WARDANA". Also fixes the same bug in _following_jabatan.

Added two regression tests covering both directions: names starting
with the offending tokens are kept, bare column headers still rejected.

Co-Authored-By: adrian kuman firmansah <adriancuman@gmail.com>

2026-04-26 05:46:21 +00:00

alembic

Phase 6: HITL review endpoints + audit trail

2026-04-25 20:12:04 +00:00

docs

Phase 7: ground-truth export (JSONL + stats) + CLI tool

2026-04-25 20:24:40 +00:00

samples

Phase 1 MVP: synchronous OCR + regex header extraction

2026-04-25 14:58:50 +00:00

src/ocr_sprint

Use word-boundary matching for personnel name blocklist

2026-04-26 05:46:21 +00:00

tests

Use word-boundary matching for personnel name blocklist

2026-04-26 05:46:21 +00:00

.env.example

Phase 4: async pipeline (Celery+Redis), Postgres job state, local-fs blob storage, API-key auth, Prometheus metrics (#3 )

2026-04-25 16:50:51 +00:00

.gitignore

Phase 4: async pipeline (Celery+Redis), Postgres job state, local-fs blob storage, API-key auth, Prometheus metrics (#3 )

2026-04-25 16:50:51 +00:00

.pre-commit-config.yaml

Phase 1 MVP: synchronous OCR + regex header extraction

2026-04-25 14:58:50 +00:00

alembic.ini

Phase 4: async pipeline (Celery+Redis), Postgres job state, local-fs blob storage, API-key auth, Prometheus metrics (#3 )

2026-04-25 16:50:51 +00:00

docker-compose.yml

Phase 4: async pipeline (Celery+Redis), Postgres job state, local-fs blob storage, API-key auth, Prometheus metrics (#3 )

2026-04-25 16:50:51 +00:00

Dockerfile

Phase 4: async pipeline (Celery+Redis), Postgres job state, local-fs blob storage, API-key auth, Prometheus metrics (#3 )

2026-04-25 16:50:51 +00:00

Makefile

Phase 1 MVP: synchronous OCR + regex header extraction

2026-04-25 14:58:50 +00:00

pyproject.toml

Phase 4: async pipeline (Celery+Redis), Postgres job state, local-fs blob storage, API-key auth, Prometheus metrics (#3 )

2026-04-25 16:50:51 +00:00

README.md

Phase 4: async pipeline (Celery+Redis), Postgres job state, local-fs blob storage, API-key auth, Prometheus metrics (#3 )

2026-04-25 16:50:51 +00:00

README.md

OCR Sprint Service

OCR + structured extraction service for Indonesian police "surat sprint" (surat perintah) documents. Built around FastAPI + PaddleOCR + hybrid extraction (regex → LLM lokal → validation) with on-premise deployment as a hard requirement.

Status: Phase 1–4 — synchronous + async PDF/image OCR with regex header extraction, PP-Structure personnel-table extraction, validation, confidence scoring, document detection / perspective correction / shadow removal, Celery + Redis job queue, Postgres job state, local-filesystem blob storage, API-key auth, and Prometheus metrics. Phase 5–6 (LLM extraction, HITL) are tracked in docs/architecture.md.

Why this stack

PaddleOCR is the strongest open-source OCR for mixed-language documents and runs fully on-prem (essential for police data).
PP-Structure (Phase 3) handles personnel tables natively.
Regex-first, LLM-fallback extraction keeps deterministic fields fast and predictable while letting an LLM handle format drift across Polri units.
CPU-friendly defaults: a small (1.5B–4B) local LLM via Ollama is the recommended default; the architecture is also GPU-ready.

See docs/architecture.md for the full architecture, accuracy expectations, and roadmap.

Quickstart

Prerequisites

Python 3.10–3.12
~3 GB free disk for PaddleOCR model downloads on first run
Linux/macOS recommended (Windows works but PaddleOCR install can be finicky)

Install (local dev)

git clone https://github.com/Adriankf59/ocr-sprint-service.git
cd ocr-sprint-service

python -m venv .venv && source .venv/bin/activate
make install         # installs runtime + dev deps + pre-commit
pip install -e ".[ocr]"  # only on the worker host — pulls Paddle wheels (~1.5 GB)
cp .env.example .env # edit if you need GPU / different storage path

Run the API

make dev
# → http://localhost:8000/docs

Try it out

The default POST /documents is async — it returns 202 Accepted with a job_id and the worker fills in the result. For tests / local one-shot usage you can append ?sync=true to run inline.

# Async (production flow)
curl -F "file=@samples/pdf/example.pdf" \
  -H "X-API-Key: $API_KEY" \
  http://localhost:8000/api/v1/documents | jq
# → {"job_id":"8f2a...","status":"pending",...}

curl -H "X-API-Key: $API_KEY" \
  http://localhost:8000/api/v1/documents/8f2a... | jq

# Sync (single small doc, no worker required)
curl -F "file=@samples/pdf/example.pdf" \
  "http://localhost:8000/api/v1/documents?sync=true" | jq

Expected response (truncated):

{
  "job_id": "8f2a...",
  "status": "completed",
  "confidence": 0.93,
  "data": {
    "header": {
      "nomor_sprint": "Sprin/123/IV/2025/Reskrim",
      "tanggal": "2025-04-21",
      "satuan_penerbit": "KEPOLISIAN RESOR BANDUNG",
      "perihal": "Pelaksanaan penyelidikan kasus pencurian",
      "dasar": ["Undang-Undang Nomor 2 Tahun 2002 ...", "..."]
    },
    "personel": [],
    "ttd": { "nrp": "12345678" }
  },
  "review_flags": []
}

Note: As of Phase 3 the personel[] array is populated from PP-Structure table recognition. Set TABLES_ENABLED=false in .env to skip the table stage (faster on documents that you know contain no personnel table).

Docker

The Phase 4 stack runs four services: api, worker (Celery), redis, and postgres. Blob uploads are persisted to a Docker volume — there is no MinIO/S3 dependency.

docker compose build
docker compose up -d
docker compose logs -f api worker

The API container runs alembic upgrade head on start, so the jobs table is created on first boot. The first request will trigger PaddleOCR to download its detection/recognition/cls models (~200 MB) into the paddle-models volume.

Metrics are exposed at http://localhost:8000/metrics in Prometheus text format.

Development

make fmt        # format with ruff
make lint       # lint
make typecheck  # mypy strict mode
make test       # pytest
make test-cov   # pytest + coverage

Pre-commit hooks run ruff on every commit. Install once with pre-commit install (already done by make install).

Project layout

src/ocr_sprint/
  api/          # FastAPI routes + error handlers
  schemas/      # Pydantic v2 models (request/response, extraction, personnel)
  pipeline/     # ingest → document_detect → preprocess → ocr + table → extract → validate → score
    extract/    # regex_rules.py (Phase 1) + personnel.py (Phase 3) → llm.py (Phase 5)
  data/         # master data (Polri ranks, etc.)
  utils/        # logging, helpers
  config.py     # pydantic-settings
  main.py       # app factory
tests/unit/     # 100+ unit tests, PaddleOCR / PP-Structure mocked
docs/           # architecture & decision records

Roadmap

Phase	Scope	Status
1	Sync API, PDF/image ingest, basic preprocessing, PaddleOCR, regex header extraction, validation, confidence scoring	Done
2	OpenCV-based document detection, perspective transform, shadow removal for phone photos	Done
3	PP-Structure table extraction for personnel rows + column mapper	Done
4	Async pipeline (Celery + Redis), Postgres job state, local-filesystem blob storage, API-key auth, Prometheus metrics	Done
5	LLM hybrid extraction (Ollama + structured output)	Planned
6	HITL review endpoints + audit trail	Planned

License

Proprietary — internal use only.

Languages

Python 96.3%

PowerShell 2.4%

Dockerfile 0.6%

Makefile 0.5%

Mako 0.2%

README.md Unescape Escape

OCR Sprint Service

Why this stack

Quickstart

Prerequisites

Install (local dev)

Run the API

Try it out

Docker

Development

Project layout

Roadmap

License

README.md