Adds a small Ollama HTTP client (httpx-based, no extra runtime deps),
prompt builders, and a hybrid header extractor that runs *after* the
deterministic regex layer. The merger never overwrites a regex-filled
field — the LLM only fills gaps. If LLM_ENABLED=false (the default), or
the Ollama server is unreachable, the pipeline degrades gracefully:
- LLM_ENABLED=false -> no LLM call at all, no flag.
- LLM_ENABLED=true,
header complete -> no LLM call.
- LLM_ENABLED=true,
header has gaps,
LLM responded ok -> merge + LLM_FALLBACK flag (review hint).
- LLM_ENABLED=true,
header has gaps,
LLM unavailable -> keep regex result + LLM_UNAVAILABLE flag.
Default model qwen2.5:1.5b on http://localhost:11434 — chosen for CPU
throughput (~5-15s per call) at acceptable accuracy. The LLM only fills
the *header* (nomor, tanggal, satuan, perihal, dasar). Personnel rows
stay with PP-Structure since that's more accurate and doesn't need LLM.
Tests:
- test_llm_client.py: httpx MockTransport-driven tests for the wire
format, error paths (HTTP 5xx, malformed JSON, missing envelope,
ConnectError), and request shape.
- test_llm_extractor.py: merge policy + None-on-unavailable behaviour.
- test_orchestrator_llm.py: end-to-end orchestrator wiring with stubs
for ingest/preprocess/OCR/table — verifies LLM is skipped when
disabled, skipped when header is complete, called and flagged when
gaps exist, and marked unavailable when the client returns None.
162 unit tests pass total (was 146).
Co-Authored-By: adrian kuman firmansah <adriancuman@gmail.com>
OCR Sprint Service
OCR + structured extraction service for Indonesian police "surat sprint" (surat perintah) documents. Built around FastAPI + PaddleOCR + hybrid extraction (regex → LLM lokal → validation) with on-premise deployment as a hard requirement.
Status: Phase 1–4 — synchronous + async PDF/image OCR with regex header extraction, PP-Structure personnel-table extraction, validation, confidence scoring, document detection / perspective correction / shadow removal, Celery + Redis job queue, Postgres job state, local-filesystem blob storage, API-key auth, and Prometheus metrics. Phase 5–6 (LLM extraction, HITL) are tracked in
docs/architecture.md.
Why this stack
- PaddleOCR is the strongest open-source OCR for mixed-language documents and runs fully on-prem (essential for police data).
- PP-Structure (Phase 3) handles personnel tables natively.
- Regex-first, LLM-fallback extraction keeps deterministic fields fast and predictable while letting an LLM handle format drift across Polri units.
- CPU-friendly defaults: a small (1.5B–4B) local LLM via Ollama is the recommended default; the architecture is also GPU-ready.
See docs/architecture.md for the full architecture, accuracy expectations, and roadmap.
Quickstart
Prerequisites
- Python 3.10–3.12
- ~3 GB free disk for PaddleOCR model downloads on first run
- Linux/macOS recommended (Windows works but PaddleOCR install can be finicky)
Install (local dev)
git clone https://github.com/Adriankf59/ocr-sprint-service.git
cd ocr-sprint-service
python -m venv .venv && source .venv/bin/activate
make install # installs runtime + dev deps + pre-commit
pip install -e ".[ocr]" # only on the worker host — pulls Paddle wheels (~1.5 GB)
cp .env.example .env # edit if you need GPU / different storage path
Run the API
make dev
# → http://localhost:8000/docs
Try it out
The default POST /documents is async — it returns 202 Accepted with a job_id and the worker fills in the result. For tests / local one-shot usage you can append ?sync=true to run inline.
# Async (production flow)
curl -F "file=@samples/pdf/example.pdf" \
-H "X-API-Key: $API_KEY" \
http://localhost:8000/api/v1/documents | jq
# → {"job_id":"8f2a...","status":"pending",...}
curl -H "X-API-Key: $API_KEY" \
http://localhost:8000/api/v1/documents/8f2a... | jq
# Sync (single small doc, no worker required)
curl -F "file=@samples/pdf/example.pdf" \
"http://localhost:8000/api/v1/documents?sync=true" | jq
Expected response (truncated):
{
"job_id": "8f2a...",
"status": "completed",
"confidence": 0.93,
"data": {
"header": {
"nomor_sprint": "Sprin/123/IV/2025/Reskrim",
"tanggal": "2025-04-21",
"satuan_penerbit": "KEPOLISIAN RESOR BANDUNG",
"perihal": "Pelaksanaan penyelidikan kasus pencurian",
"dasar": ["Undang-Undang Nomor 2 Tahun 2002 ...", "..."]
},
"personel": [],
"ttd": { "nrp": "12345678" }
},
"review_flags": []
}
Note: As of Phase 3 the
personel[]array is populated from PP-Structure table recognition. SetTABLES_ENABLED=falsein.envto skip the table stage (faster on documents that you know contain no personnel table).
Docker
The Phase 4 stack runs four services: api, worker (Celery), redis, and postgres. Blob uploads are persisted to a Docker volume — there is no MinIO/S3 dependency.
docker compose build
docker compose up -d
docker compose logs -f api worker
The API container runs alembic upgrade head on start, so the jobs table is created on first boot. The first request will trigger PaddleOCR to download its detection/recognition/cls models (~200 MB) into the paddle-models volume.
Metrics are exposed at http://localhost:8000/metrics in Prometheus text format.
Development
make fmt # format with ruff
make lint # lint
make typecheck # mypy strict mode
make test # pytest
make test-cov # pytest + coverage
Pre-commit hooks run ruff on every commit. Install once with pre-commit install (already done by make install).
Project layout
src/ocr_sprint/
api/ # FastAPI routes + error handlers
schemas/ # Pydantic v2 models (request/response, extraction, personnel)
pipeline/ # ingest → document_detect → preprocess → ocr + table → extract → validate → score
extract/ # regex_rules.py (Phase 1) + personnel.py (Phase 3) → llm.py (Phase 5)
data/ # master data (Polri ranks, etc.)
utils/ # logging, helpers
config.py # pydantic-settings
main.py # app factory
tests/unit/ # 100+ unit tests, PaddleOCR / PP-Structure mocked
docs/ # architecture & decision records
Roadmap
| Phase | Scope | Status |
|---|---|---|
| 1 | Sync API, PDF/image ingest, basic preprocessing, PaddleOCR, regex header extraction, validation, confidence scoring | Done |
| 2 | OpenCV-based document detection, perspective transform, shadow removal for phone photos | Done |
| 3 | PP-Structure table extraction for personnel rows + column mapper | Done |
| 4 | Async pipeline (Celery + Redis), Postgres job state, local-filesystem blob storage, API-key auth, Prometheus metrics | Done |
| 5 | LLM hybrid extraction (Ollama + structured output) | Planned |
| 6 | HITL review endpoints + audit trail | Planned |
License
Proprietary — internal use only.