Phase 1 MVP: synchronous OCR + regex header extraction

Implements the foundation of the OCR Sprint service: - FastAPI app with /api/v1/health and /api/v1/documents (sync upload) - Pydantic v2 schemas for documents, extraction result, personnel - Pipeline: PDF/image ingest (PyMuPDF), preprocessing (resize, deskew, denoise, optional adaptive threshold), PaddleOCR wrapper, regex-based header extraction (nomor sprint, tanggal, satuan, perihal, dasar), signatory NRP, master-pangkat validation, confidence scoring + routing. - Tests: 61 unit tests covering regex rules, validators, preprocess, ingest, confidence, and API contract (PaddleOCR mocked). - Tooling: pyproject (setuptools), ruff, mypy strict, pytest, pre-commit, Dockerfile, docker-compose, Makefile. - Docs: README + docs/architecture.md (full hybrid stack rationale and 6-phase roadmap). Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com>
2026-04-25 14:58:50 +00:00
commit ca0c0a0428
45 changed files with 2457 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,123 @@
+# OCR Sprint Service
+
+OCR + structured extraction service for Indonesian police "surat sprint" (surat perintah) documents. Built around **FastAPI + PaddleOCR + hybrid extraction (regex → LLM lokal → validation)** with **on-premise** deployment as a hard requirement.
+
+> **Status:** Phase 1 MVP — synchronous PDF/image OCR with regex header extraction, validation, and confidence scoring. Phase 2–6 (document detection, table extraction, async pipeline, LLM extraction, HITL) are tracked in [`docs/architecture.md`](docs/architecture.md).
+
+## Why this stack
+
+- **PaddleOCR** is the strongest open-source OCR for mixed-language documents and runs fully on-prem (essential for police data).
+- **PP-Structure** (Phase 3) handles personnel tables natively.
+- **Regex-first, LLM-fallback extraction** keeps deterministic fields fast and predictable while letting an LLM handle format drift across Polri units.
+- **CPU-friendly defaults**: a small (1.5B–4B) local LLM via Ollama is the recommended default; the architecture is also GPU-ready.
+
+See [`docs/architecture.md`](docs/architecture.md) for the full architecture, accuracy expectations, and roadmap.
+
+## Quickstart
+
+### Prerequisites
+
+- Python **3.10–3.12**
+- ~3 GB free disk for PaddleOCR model downloads on first run
+- Linux/macOS recommended (Windows works but PaddleOCR install can be finicky)
+
+### Install (local dev)
+
+```bash
+git clone https://github.com/Adriankf59/ocr-sprint-service.git
+cd ocr-sprint-service
+
+python -m venv .venv && source .venv/bin/activate
+make install         # installs runtime + dev deps + pre-commit
+cp .env.example .env # edit if you need GPU / different storage path
+```
+
+### Run the API
+
+```bash
+make dev
+# → http://localhost:8000/docs
+```
+
+### Try it out
+
+```bash
+curl -F "file=@samples/pdf/example.pdf" http://localhost:8000/api/v1/documents | jq
+```
+
+Expected response (truncated):
+
+```json
+{
+  "job_id": "8f2a...",
+  "status": "completed",
+  "confidence": 0.93,
+  "data": {
+    "header": {
+      "nomor_sprint": "Sprin/123/IV/2025/Reskrim",
+      "tanggal": "2025-04-21",
+      "satuan_penerbit": "KEPOLISIAN RESOR BANDUNG",
+      "perihal": "Pelaksanaan penyelidikan kasus pencurian",
+      "dasar": ["Undang-Undang Nomor 2 Tahun 2002 ...", "..."]
+    },
+    "personel": [],
+    "ttd": { "nrp": "12345678" }
+  },
+  "review_flags": []
+}
+```
+
+> **Note:** Phase 1 does not yet populate the `personel[]` table — that requires PP-Structure (Phase 3). Header fields, signatory NRP, confidence, and HITL routing are fully wired.
+
+### Docker
+
+```bash
+docker compose build
+docker compose up -d
+docker compose logs -f api
+```
+
+The first request will trigger PaddleOCR to download its detection/recognition/cls models (~200 MB) into the `paddle-models` volume.
+
+## Development
+
+```bash
+make fmt        # format with ruff
+make lint       # lint
+make typecheck  # mypy strict mode
+make test       # pytest
+make test-cov   # pytest + coverage
+```
+
+Pre-commit hooks run ruff on every commit. Install once with `pre-commit install` (already done by `make install`).
+
+## Project layout
+
+```
+src/ocr_sprint/
+  api/          # FastAPI routes + error handlers
+  schemas/      # Pydantic v2 models (request/response, extraction, personnel)
+  pipeline/     # ingest → preprocess → ocr → extract → validate → score
+    extract/    # regex_rules.py (Phase 1) → llm.py (Phase 5)
+  data/         # master data (Polri ranks, etc.)
+  utils/        # logging, helpers
+  config.py     # pydantic-settings
+  main.py       # app factory
+tests/unit/     # ~60 unit tests, no PaddleOCR dependency
+docs/           # architecture & decision records
+```
+
+## Roadmap
+
+| Phase | Scope | Status |
+|---|---|---|
+| 1 | Sync API, PDF/image ingest, basic preprocessing, PaddleOCR, regex header extraction, validation, confidence scoring | **In progress** |
+| 2 | DocTR document detection + dewarping for phone photos | Planned |
+| 3 | PP-Structure table extraction for personnel rows | Planned |
+| 4 | Async pipeline (Celery + Redis), Postgres + MinIO, auth, observability | Planned |
+| 5 | LLM hybrid extraction (Ollama + structured output) | Planned |
+| 6 | HITL review endpoints + audit trail | Planned |
+
+## License
+
+Proprietary — internal use only.