devin-ai-integration[bot] 2112023b6e Phase 4: async pipeline (Celery+Redis), Postgres job state, local-fs blob storage, API-key auth, Prometheus metrics (#3)
* Phase 4: async pipeline (Celery+Redis), Postgres job state, local-fs blob storage, API-key auth, Prometheus metrics

Co-Authored-By: adrian kuman firmansah <adriancuman@gmail.com>

* Phase 4: fix sync-mode rollback orphaning blobs + use is_relative_to for path-escape check

Devin Review on PR #3 found two real bugs:

1. Sync path mark_failed was rolled back by the request-scoped session.
   When the pipeline raised an exception in ?sync=true mode, _run_inline
   modified the FastAPI session and re-raised; get_session caught the
   exception, called session.rollback(), and wiped both the create() and
   the mark_failed() writes. The blob was already on disk, so it was
   permanently orphaned with no DB record. Fix: commit the pending row
   immediately after create(), and run all subsequent state transitions in
   independent session_scope blocks (matching the worker task pattern).

2. _resolve used str.startswith for path-escape detection, which lets a
   sibling directory whose name begins with the storage root pass (e.g.
   /app/blobs_evil vs /app/blobs). Switched to Path.is_relative_to.

Added regression tests for both.

Co-Authored-By: adrian kuman firmansah <adriancuman@gmail.com>

* Phase 4: honor queue_enabled setting + resolve base_dir for path comparisons

Two more bugs found by Devin Review:

3. queue_enabled was declared in config and documented in .env.example but
   never read by the route. A fresh dev install with QUEUE_ENABLED=false
   (the default) would still enqueue, then fail with a Redis connection
   error. Fixed by making the ?sync= query param default to None and
   resolving to (not queue_enabled) inside the route. Tests now set
   QUEUE_ENABLED=true so the async flow stays exercised, and a new test
   verifies the inline fallback when the queue is disabled.

4. LocalFsBlobStorage stored base_dir as-is. _resolve resolved its
   candidate paths, so the empty-dir cleanup loop in delete() compared a
   resolved candidate against an unresolved base_dir and broke on the
   first iteration (no cleanup ever happened). Fixed by resolving base_dir
   once in __init__ so every path comparison is apples-to-apples.

Co-Authored-By: adrian kuman firmansah <adriancuman@gmail.com>

* Phase 4: derive ocr_jobs_total from DB so worker writes are visible at /metrics

Devin Review correctly noted the Counter-based JOBS_TOTAL would never
increment in production because the worker runs in a separate process from
the API and the registry is process-local. Replaced JOBS_TOTAL with a
custom Collector that issues SELECT status, COUNT(*) FROM jobs GROUP BY
status on every /metrics scrape. Result: the metric stays accurate
regardless of which process wrote the row.

Also corrected the metrics.py docstring (the old comment claimed the
counter was 'incremented by the worker', which was the bug).

Removed the JOBS_TOTAL.inc() calls from the sync route — the DB collector
covers both paths now. JOB_PROCESSING_SECONDS stays as an API-process
histogram with an updated docstring noting its scope; cross-process
latency belongs to derived dashboards over jobs.created_at/updated_at.

Added regression test test_metrics_jobs_total_reflects_worker_writes.

Co-Authored-By: adrian kuman firmansah <adriancuman@gmail.com>

---------

Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com>
2026-04-25 16:50:51 +00:00

OCR Sprint Service

OCR + structured extraction service for Indonesian police "surat sprint" (surat perintah) documents. Built around FastAPI + PaddleOCR + hybrid extraction (regex → LLM lokal → validation) with on-premise deployment as a hard requirement.

Status: Phase 14 — synchronous + async PDF/image OCR with regex header extraction, PP-Structure personnel-table extraction, validation, confidence scoring, document detection / perspective correction / shadow removal, Celery + Redis job queue, Postgres job state, local-filesystem blob storage, API-key auth, and Prometheus metrics. Phase 56 (LLM extraction, HITL) are tracked in docs/architecture.md.

Why this stack

  • PaddleOCR is the strongest open-source OCR for mixed-language documents and runs fully on-prem (essential for police data).
  • PP-Structure (Phase 3) handles personnel tables natively.
  • Regex-first, LLM-fallback extraction keeps deterministic fields fast and predictable while letting an LLM handle format drift across Polri units.
  • CPU-friendly defaults: a small (1.5B4B) local LLM via Ollama is the recommended default; the architecture is also GPU-ready.

See docs/architecture.md for the full architecture, accuracy expectations, and roadmap.

Quickstart

Prerequisites

  • Python 3.103.12
  • ~3 GB free disk for PaddleOCR model downloads on first run
  • Linux/macOS recommended (Windows works but PaddleOCR install can be finicky)

Install (local dev)

git clone https://github.com/Adriankf59/ocr-sprint-service.git
cd ocr-sprint-service

python -m venv .venv && source .venv/bin/activate
make install         # installs runtime + dev deps + pre-commit
pip install -e ".[ocr]"  # only on the worker host — pulls Paddle wheels (~1.5 GB)
cp .env.example .env # edit if you need GPU / different storage path

Run the API

make dev
# → http://localhost:8000/docs

Try it out

The default POST /documents is async — it returns 202 Accepted with a job_id and the worker fills in the result. For tests / local one-shot usage you can append ?sync=true to run inline.

# Async (production flow)
curl -F "file=@samples/pdf/example.pdf" \
  -H "X-API-Key: $API_KEY" \
  http://localhost:8000/api/v1/documents | jq
# → {"job_id":"8f2a...","status":"pending",...}

curl -H "X-API-Key: $API_KEY" \
  http://localhost:8000/api/v1/documents/8f2a... | jq

# Sync (single small doc, no worker required)
curl -F "file=@samples/pdf/example.pdf" \
  "http://localhost:8000/api/v1/documents?sync=true" | jq

Expected response (truncated):

{
  "job_id": "8f2a...",
  "status": "completed",
  "confidence": 0.93,
  "data": {
    "header": {
      "nomor_sprint": "Sprin/123/IV/2025/Reskrim",
      "tanggal": "2025-04-21",
      "satuan_penerbit": "KEPOLISIAN RESOR BANDUNG",
      "perihal": "Pelaksanaan penyelidikan kasus pencurian",
      "dasar": ["Undang-Undang Nomor 2 Tahun 2002 ...", "..."]
    },
    "personel": [],
    "ttd": { "nrp": "12345678" }
  },
  "review_flags": []
}

Note: As of Phase 3 the personel[] array is populated from PP-Structure table recognition. Set TABLES_ENABLED=false in .env to skip the table stage (faster on documents that you know contain no personnel table).

Docker

The Phase 4 stack runs four services: api, worker (Celery), redis, and postgres. Blob uploads are persisted to a Docker volume — there is no MinIO/S3 dependency.

docker compose build
docker compose up -d
docker compose logs -f api worker

The API container runs alembic upgrade head on start, so the jobs table is created on first boot. The first request will trigger PaddleOCR to download its detection/recognition/cls models (~200 MB) into the paddle-models volume.

Metrics are exposed at http://localhost:8000/metrics in Prometheus text format.

Development

make fmt        # format with ruff
make lint       # lint
make typecheck  # mypy strict mode
make test       # pytest
make test-cov   # pytest + coverage

Pre-commit hooks run ruff on every commit. Install once with pre-commit install (already done by make install).

Project layout

src/ocr_sprint/
  api/          # FastAPI routes + error handlers
  schemas/      # Pydantic v2 models (request/response, extraction, personnel)
  pipeline/     # ingest → document_detect → preprocess → ocr + table → extract → validate → score
    extract/    # regex_rules.py (Phase 1) + personnel.py (Phase 3) → llm.py (Phase 5)
  data/         # master data (Polri ranks, etc.)
  utils/        # logging, helpers
  config.py     # pydantic-settings
  main.py       # app factory
tests/unit/     # 100+ unit tests, PaddleOCR / PP-Structure mocked
docs/           # architecture & decision records

Roadmap

Phase Scope Status
1 Sync API, PDF/image ingest, basic preprocessing, PaddleOCR, regex header extraction, validation, confidence scoring Done
2 OpenCV-based document detection, perspective transform, shadow removal for phone photos Done
3 PP-Structure table extraction for personnel rows + column mapper Done
4 Async pipeline (Celery + Redis), Postgres job state, local-filesystem blob storage, API-key auth, Prometheus metrics Done
5 LLM hybrid extraction (Ollama + structured output) Planned
6 HITL review endpoints + audit trail Planned

License

Proprietary — internal use only.

Description
No description provided
Readme 2.4 MiB
Languages
Python 96.3%
PowerShell 2.4%
Dockerfile 0.6%
Makefile 0.5%
Mako 0.2%