Files
OCR-SPRIN-SERVICE/README.md
Devin AI ca0c0a0428 Phase 1 MVP: synchronous OCR + regex header extraction
Implements the foundation of the OCR Sprint service:
- FastAPI app with /api/v1/health and /api/v1/documents (sync upload)
- Pydantic v2 schemas for documents, extraction result, personnel
- Pipeline: PDF/image ingest (PyMuPDF), preprocessing (resize, deskew,
  denoise, optional adaptive threshold), PaddleOCR wrapper, regex-based
  header extraction (nomor sprint, tanggal, satuan, perihal, dasar),
  signatory NRP, master-pangkat validation, confidence scoring + routing.
- Tests: 61 unit tests covering regex rules, validators, preprocess,
  ingest, confidence, and API contract (PaddleOCR mocked).
- Tooling: pyproject (setuptools), ruff, mypy strict, pytest, pre-commit,
  Dockerfile, docker-compose, Makefile.
- Docs: README + docs/architecture.md (full hybrid stack rationale and
  6-phase roadmap).

Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com>
2026-04-25 14:58:50 +00:00

124 lines
4.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# OCR Sprint Service
OCR + structured extraction service for Indonesian police "surat sprint" (surat perintah) documents. Built around **FastAPI + PaddleOCR + hybrid extraction (regex → LLM lokal → validation)** with **on-premise** deployment as a hard requirement.
> **Status:** Phase 1 MVP — synchronous PDF/image OCR with regex header extraction, validation, and confidence scoring. Phase 26 (document detection, table extraction, async pipeline, LLM extraction, HITL) are tracked in [`docs/architecture.md`](docs/architecture.md).
## Why this stack
- **PaddleOCR** is the strongest open-source OCR for mixed-language documents and runs fully on-prem (essential for police data).
- **PP-Structure** (Phase 3) handles personnel tables natively.
- **Regex-first, LLM-fallback extraction** keeps deterministic fields fast and predictable while letting an LLM handle format drift across Polri units.
- **CPU-friendly defaults**: a small (1.5B4B) local LLM via Ollama is the recommended default; the architecture is also GPU-ready.
See [`docs/architecture.md`](docs/architecture.md) for the full architecture, accuracy expectations, and roadmap.
## Quickstart
### Prerequisites
- Python **3.103.12**
- ~3 GB free disk for PaddleOCR model downloads on first run
- Linux/macOS recommended (Windows works but PaddleOCR install can be finicky)
### Install (local dev)
```bash
git clone https://github.com/Adriankf59/ocr-sprint-service.git
cd ocr-sprint-service
python -m venv .venv && source .venv/bin/activate
make install # installs runtime + dev deps + pre-commit
cp .env.example .env # edit if you need GPU / different storage path
```
### Run the API
```bash
make dev
# → http://localhost:8000/docs
```
### Try it out
```bash
curl -F "file=@samples/pdf/example.pdf" http://localhost:8000/api/v1/documents | jq
```
Expected response (truncated):
```json
{
"job_id": "8f2a...",
"status": "completed",
"confidence": 0.93,
"data": {
"header": {
"nomor_sprint": "Sprin/123/IV/2025/Reskrim",
"tanggal": "2025-04-21",
"satuan_penerbit": "KEPOLISIAN RESOR BANDUNG",
"perihal": "Pelaksanaan penyelidikan kasus pencurian",
"dasar": ["Undang-Undang Nomor 2 Tahun 2002 ...", "..."]
},
"personel": [],
"ttd": { "nrp": "12345678" }
},
"review_flags": []
}
```
> **Note:** Phase 1 does not yet populate the `personel[]` table — that requires PP-Structure (Phase 3). Header fields, signatory NRP, confidence, and HITL routing are fully wired.
### Docker
```bash
docker compose build
docker compose up -d
docker compose logs -f api
```
The first request will trigger PaddleOCR to download its detection/recognition/cls models (~200 MB) into the `paddle-models` volume.
## Development
```bash
make fmt # format with ruff
make lint # lint
make typecheck # mypy strict mode
make test # pytest
make test-cov # pytest + coverage
```
Pre-commit hooks run ruff on every commit. Install once with `pre-commit install` (already done by `make install`).
## Project layout
```
src/ocr_sprint/
api/ # FastAPI routes + error handlers
schemas/ # Pydantic v2 models (request/response, extraction, personnel)
pipeline/ # ingest → preprocess → ocr → extract → validate → score
extract/ # regex_rules.py (Phase 1) → llm.py (Phase 5)
data/ # master data (Polri ranks, etc.)
utils/ # logging, helpers
config.py # pydantic-settings
main.py # app factory
tests/unit/ # ~60 unit tests, no PaddleOCR dependency
docs/ # architecture & decision records
```
## Roadmap
| Phase | Scope | Status |
|---|---|---|
| 1 | Sync API, PDF/image ingest, basic preprocessing, PaddleOCR, regex header extraction, validation, confidence scoring | **In progress** |
| 2 | DocTR document detection + dewarping for phone photos | Planned |
| 3 | PP-Structure table extraction for personnel rows | Planned |
| 4 | Async pipeline (Celery + Redis), Postgres + MinIO, auth, observability | Planned |
| 5 | LLM hybrid extraction (Ollama + structured output) | Planned |
| 6 | HITL review endpoints + audit trail | Planned |
## License
Proprietary — internal use only.