Devin AI
|
6003d96a94
|
Phase 7: ground-truth export (JSONL + stats) + CLI tool
- GET /api/v1/ground-truth/export streaming JSONL (approved_only,
since, until, has_corrections, limit)
- GET /api/v1/ground-truth/stats total / approved / corrections
counts + top-N most-corrected field paths
- python -m ocr_sprint.tools.export_ground_truth operator CLI with
the same filters + optional --print-stats
- Ground-truth sample reconstructs the pipeline's original output by
replaying job_corrections in reverse
- docs/ground-truth-format.md schema + fine-tuning guidance
- 17 new tests (service replay, endpoint filters, CLI)
- 201 total tests passing, ruff / mypy --strict clean
Co-Authored-By: adrian kuman firmansah <adriancuman@gmail.com>
|
2026-04-25 20:24:40 +00:00 |
|
Devin AI
|
ca0c0a0428
|
Phase 1 MVP: synchronous OCR + regex header extraction
Implements the foundation of the OCR Sprint service:
- FastAPI app with /api/v1/health and /api/v1/documents (sync upload)
- Pydantic v2 schemas for documents, extraction result, personnel
- Pipeline: PDF/image ingest (PyMuPDF), preprocessing (resize, deskew,
denoise, optional adaptive threshold), PaddleOCR wrapper, regex-based
header extraction (nomor sprint, tanggal, satuan, perihal, dasar),
signatory NRP, master-pangkat validation, confidence scoring + routing.
- Tests: 61 unit tests covering regex rules, validators, preprocess,
ingest, confidence, and API contract (PaddleOCR mocked).
- Tooling: pyproject (setuptools), ruff, mypy strict, pytest, pre-commit,
Dockerfile, docker-compose, Makefile.
- Docs: README + docs/architecture.md (full hybrid stack rationale and
6-phase roadmap).
Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com>
|
2026-04-25 14:58:50 +00:00 |
|