Phase 1 MVP: synchronous OCR + regex header extraction

Implements the foundation of the OCR Sprint service: - FastAPI app with /api/v1/health and /api/v1/documents (sync upload) - Pydantic v2 schemas for documents, extraction result, personnel - Pipeline: PDF/image ingest (PyMuPDF), preprocessing (resize, deskew, denoise, optional adaptive threshold), PaddleOCR wrapper, regex-based header extraction (nomor sprint, tanggal, satuan, perihal, dasar), signatory NRP, master-pangkat validation, confidence scoring + routing. - Tests: 61 unit tests covering regex rules, validators, preprocess, ingest, confidence, and API contract (PaddleOCR mocked). - Tooling: pyproject (setuptools), ruff, mypy strict, pytest, pre-commit, Dockerfile, docker-compose, Makefile. - Docs: README + docs/architecture.md (full hybrid stack rationale and 6-phase roadmap). Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com>
2026-04-25 14:58:50 +00:00
commit ca0c0a0428
45 changed files with 2457 additions and 0 deletions
--- a/samples/README.md
+++ b/samples/README.md
@@ -0,0 +1,13 @@
+# Samples
+
+Drop sample surat sprint files here for local testing. **Do NOT commit real documents** — `.gitignore` excludes binary file extensions in this folder.
+
+Recommended layout:
+```
+samples/
+  pdf/          # PDF scans
+  photo/        # phone photos
+  ground_truth/ # JSON ground-truth labels for evaluation
+```
+
+For sharing real samples with the team, use the project's secured storage (MinIO/S3 once Phase 4 is live), not git.