Phase 1 MVP: synchronous OCR + regex header extraction

Implements the foundation of the OCR Sprint service: - FastAPI app with /api/v1/health and /api/v1/documents (sync upload) - Pydantic v2 schemas for documents, extraction result, personnel - Pipeline: PDF/image ingest (PyMuPDF), preprocessing (resize, deskew, denoise, optional adaptive threshold), PaddleOCR wrapper, regex-based header extraction (nomor sprint, tanggal, satuan, perihal, dasar), signatory NRP, master-pangkat validation, confidence scoring + routing. - Tests: 61 unit tests covering regex rules, validators, preprocess, ingest, confidence, and API contract (PaddleOCR mocked). - Tooling: pyproject (setuptools), ruff, mypy strict, pytest, pre-commit, Dockerfile, docker-compose, Makefile. - Docs: README + docs/architecture.md (full hybrid stack rationale and 6-phase roadmap). Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com>
2026-04-25 14:58:50 +00:00
commit ca0c0a0428
45 changed files with 2457 additions and 0 deletions
--- a/.env.example
+++ b/.env.example
@@ -0,0 +1,43 @@
+# ==== App ====
+APP_ENV=local                 # local | dev | staging | prod
+APP_HOST=0.0.0.0
+APP_PORT=8000
+APP_LOG_LEVEL=INFO
+
+# ==== Storage (Phase 1: local filesystem) ====
+STORAGE_LOCAL_DIR=./storage
+
+# ==== OCR ====
+OCR_LANG=latin                # PaddleOCR lang code; "latin" works well for Bahasa Indonesia
+OCR_USE_GPU=false             # set true if running on a GPU host
+OCR_DET_MODEL_DIR=             # leave empty to use PaddleOCR defaults
+OCR_REC_MODEL_DIR=
+OCR_CLS_MODEL_DIR=
+OCR_MAX_IMAGE_SIDE=2200       # downscale longest side before OCR
+
+# ==== Preprocessing ====
+PREPROCESS_TARGET_DPI=300
+PREPROCESS_DENOISE=true
+PREPROCESS_DESKEW=true
+PREPROCESS_ADAPTIVE_THRESHOLD=false  # turn on for low-quality phone photos
+
+# ==== Confidence / routing (Phase 5) ====
+CONFIDENCE_AUTO_APPROVE=0.95
+CONFIDENCE_NEEDS_REVIEW=0.85
+
+# ==== LLM (Phase 5, optional) ====
+LLM_ENABLED=false
+LLM_PROVIDER=ollama
+LLM_MODEL=qwen2.5:1.5b        # CPU-friendly default
+LLM_BASE_URL=http://localhost:11434
+LLM_TIMEOUT_S=60
+
+# ==== Async pipeline (Phase 4, optional) ====
+QUEUE_ENABLED=false
+REDIS_URL=redis://localhost:6379/0
+DATABASE_URL=postgresql+psycopg://ocr:ocr@localhost:5432/ocr_sprint
+MINIO_ENDPOINT=localhost:9000
+MINIO_ACCESS_KEY=minioadmin
+MINIO_SECRET_KEY=minioadmin
+MINIO_BUCKET=ocr-sprint
+MINIO_SECURE=false