Files

Devin AI ca0c0a0428 Phase 1 MVP: synchronous OCR + regex header extraction

Implements the foundation of the OCR Sprint service:
- FastAPI app with /api/v1/health and /api/v1/documents (sync upload)
- Pydantic v2 schemas for documents, extraction result, personnel
- Pipeline: PDF/image ingest (PyMuPDF), preprocessing (resize, deskew,
  denoise, optional adaptive threshold), PaddleOCR wrapper, regex-based
  header extraction (nomor sprint, tanggal, satuan, perihal, dasar),
  signatory NRP, master-pangkat validation, confidence scoring + routing.
- Tests: 61 unit tests covering regex rules, validators, preprocess,
  ingest, confidence, and API contract (PaddleOCR mocked).
- Tooling: pyproject (setuptools), ruff, mypy strict, pytest, pre-commit,
  Dockerfile, docker-compose, Makefile.
- Docs: README + docs/architecture.md (full hybrid stack rationale and
  6-phase roadmap).

Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com>

2026-04-25 14:58:50 +00:00

16 KiB

Raw Blame History

Plan & Arsitektur — OCR Service Surat Sprint Kepolisian

1. Penilaian Jujur Tech Stack yang Diusulkan

Tech stack Anda (FastAPI + PaddleOCR + OpenCV/Pillow + Regex) sudah bagus dan layak produksi, tapi belum tentu paling optimal akurasinya untuk kasus surat sprint. Ada beberapa gap yang perlu diisi sebelum bisa disebut "terbaik".

Yang sudah tepat

Komponen	Alasan
FastAPI	Async native, Pydantic validation, OpenAPI docs otomatis, ideal untuk ML serving.
PaddleOCR (PP-OCRv4/v5)	Salah satu OCR open-source terbaik untuk dokumen campuran teks + tabel, mendukung Latin (cocok untuk Bahasa Indonesia), bisa jalan on-premise (penting untuk dokumen kepolisian yang sensitif — cloud OCR seperti Google Vision/AWS Textract sebaiknya dihindari karena masalah kerahasiaan).
OpenCV + Pillow	Standar industri untuk preprocessing.
Regex/rule-based	Cocok untuk dokumen terstruktur seperti sprint yang format-nya relatif baku.

Yang masih kurang / perlu ditambah

Table extraction belum tertangani Daftar personel di surat sprint hampir selalu berbentuk tabel (No, Pangkat, NRP, Nama, Jabatan, Keterangan). Regex pada teks linear dari OCR biasa akan kacau ketika baris tabel pecah atau kolom bergeser. Solusi: gunakan PaddleOCR PP-Structure (modul table recognition bawaan Paddle) atau model khusus seperti TableTransformer (Microsoft).
Document detection & dewarping untuk foto HP belum eksplisit Foto HP bermasalah karena: perspektif miring, lipatan, bayangan, lighting tidak rata, fokus tidak merata. OpenCV crop + perspective transform manual saja sering gagal. Tambahkan:
- Document corner detection: DocTR / MobileSAM / model edge-based, atau heuristik kontur OpenCV sebagai fallback.
- Dewarping: DocTr / DewarpNet untuk halaman yang melengkung (lipatan).
- Shadow removal: algoritma background division atau model spesialis.
Strategi ekstraksi 100% regex itu rapuh Surat sprint dari satuan berbeda (Polda, Polres, Polsek, Mabes) punya variasi format: header berbeda, urutan field berbeda, kadang pangkat disingkat (AKP, IPDA) kadang ditulis penuh. Regex murni akan butuh ratusan rule dan tetap miss kasus baru. Rekomendasi pendekatan hybrid:
- Layer 1 — Regex/rule untuk field deterministik (Nomor sprint, tanggal, dasar hukum) yang format-nya baku.
- Layer 2 — Schema-aware extraction menggunakan LLM lokal (Llama 3.1 8B / Qwen2.5 7B via Ollama atau vLLM) dengan structured output (JSON schema / Pydantic) untuk field yang variatif (jabatan, keterangan tugas).
- Layer 3 — Validation terhadap master data (daftar pangkat valid, format NRP 8 digit, dll).
Tidak ada confidence scoring & human-in-the-loop Untuk dokumen kepolisian, akurasi 100% otomatis itu mitos. Sistem harus:
- Mengeluarkan confidence score per field.
- Otomatis flag dokumen low-confidence untuk review manusia.
- Sediakan UI/endpoint koreksi yang feedback-nya bisa dipakai retraining.
Alternatif end-to-end yang patut dipertimbangkan Jika nanti volume dokumen besar dan format relatif stabil, fine-tuning model Document Understanding end-to-end bisa lebih akurat:
- Donut (OCR-free, langsung image → JSON).
- LayoutLMv3 (kombinasi teks + layout + visual).
- Surya OCR (newer, sangat bagus untuk dokumen). Untuk MVP, tetap pakai PaddleOCR. Donut/LayoutLM adalah opsi V2 setelah ada labeled dataset cukup (~500–1000 dokumen).

Verdict

Stack Anda bisa mencapai ~85–92% akurasi field-level untuk surat sprint dengan kualitas scan baik, dan ~70–80% untuk foto HP, kalau ditambah komponen di atas. Tanpa table extraction + dewarping + hybrid extraction, akurasinya akan jatuh di kondisi nyata.

2. Arsitektur yang Direkomendasikan

2.1 Diagram Logis

┌────────────────────────────────────────────────────────────────────┐
│                         Client (Web/Mobile)                        │
└──────────────────────────────┬─────────────────────────────────────┘
                               │ HTTPS (multipart upload)
                               ▼
┌────────────────────────────────────────────────────────────────────┐
│                    FastAPI Gateway (stateless)                     │
│   - Auth (JWT/API key)   - Rate limit   - Request validation       │
└──────────────────────────────┬─────────────────────────────────────┘
                               │ enqueue job
                               ▼
┌────────────────────────────────────────────────────────────────────┐
│              Job Queue (Redis + Celery / RQ / Dramatiq)            │
└──────────────────────────────┬─────────────────────────────────────┘
                               ▼
┌────────────────────────────────────────────────────────────────────┐
│                    OCR Worker Pipeline (GPU/CPU)                   │
│  ┌────────────┐  ┌──────────────┐  ┌───────────┐  ┌────────────┐   │
│  │ 1. Ingest  │→ │ 2. Preproc   │→ │ 3. OCR +  │→ │ 4. Extract │   │
│  │  & detect  │  │ (deskew,     │  │  Layout   │  │ (regex +   │   │
│  │  PDF/IMG   │  │  dewarp,     │  │  PP-Struct│  │  LLM +     │   │
│  │            │  │  denoise)    │  │  + Table) │  │  validate) │   │
│  └────────────┘  └──────────────┘  └───────────┘  └─────┬──────┘   │
│                                                         │          │
│                          ┌──────────────────────────────┘          │
│                          ▼                                         │
│                   ┌─────────────┐                                  │
│                   │ 5. Confidence│ → low conf? flag for review    │
│                   │   scoring    │                                 │
│                   └──────┬───────┘                                 │
└──────────────────────────┼─────────────────────────────────────────┘
                           ▼
┌────────────────────────────────────────────────────────────────────┐
│           Storage: PostgreSQL (metadata) + MinIO/S3 (file)         │
│           + Vector store opsional (untuk dedup / search)           │
└────────────────────────────────────────────────────────────────────┘
                           │
                           ▼
┌────────────────────────────────────────────────────────────────────┐
│           Review UI (optional) — koreksi manual + audit trail      │
└────────────────────────────────────────────────────────────────────┘

2.2 Pipeline Detail per Tahap

Tahap 1 — Ingest & Document Detection

PDF: render setiap halaman jadi image @ 300 DPI (pdf2image / PyMuPDF).
Image (foto HP): deteksi sudut dokumen → crop → perspective transform.
- Library: OpenCV findContours (cepat) sebagai fallback, DocTR document detector (lebih akurat) sebagai utama.

Tahap 2 — Preprocessing

Deskew (rotation correction) — Hough transform atau model.
Dewarp (untuk foto buku/lipatan) — DewarpNet atau model RNN.
Adaptive thresholding (untuk foto dengan lighting tidak rata).
Shadow removal (background division).
Denoise (Non-Local Means).
Resize ke ukuran optimal OCR (~1500–2500 px sisi panjang).

Tahap 3 — OCR + Layout Analysis

PaddleOCR PP-Structure dijalankan sekali → menghasilkan:
- Bounding boxes + teks + confidence per word/line.
- Table region detection + table-to-HTML/JSON.
- Layout type per region (title, paragraph, table, figure).
Output ditampung sebagai struktur intermediate (mirip hOCR / ALTO XML).

Tahap 4 — Information Extraction

4a. Header parsing (regex): Nomor sprint, tanggal, satuan penerbit, dasar hukum, perihal. Format relatif baku → regex sangat cocok.
4b. Personnel table extraction: ambil dari hasil PP-Structure table → mapping kolom (Pangkat, NRP, Nama, Jabatan, Keterangan).
4c. LLM fallback: untuk field yang regex/table miss, kirim chunk teks + JSON schema ke LLM lokal (Ollama / vLLM) dengan structured output (Pydantic via outlines / instructor).
4d. Validation layer:
- NRP: 8 digit numerik.
- Pangkat: harus ada di daftar master pangkat Polri.
- Tanggal: parse + sanity check.
- Cross-check: jumlah personel di body = jumlah baris tabel.

Tahap 5 — Confidence Scoring & Routing

Aggregate confidence: weighted average dari OCR confidence + validation pass/fail + LLM logprob (kalau pakai).
Threshold (mis. < 0.85) → status NEEDS_REVIEW.
Threshold tinggi (≥ 0.95) + semua validasi pass → status AUTO_APPROVED.

2.3 API Endpoint (FastAPI)

POST   /api/v1/documents              # upload, kembalikan job_id
GET    /api/v1/documents/{job_id}     # poll status + hasil
GET    /api/v1/documents/{job_id}/raw # raw OCR output (debug)
PATCH  /api/v1/documents/{job_id}     # koreksi manual (HITL)
GET    /api/v1/health                 # liveness
GET    /api/v1/metrics                # Prometheus

Response shape (contoh):

{
  "job_id": "uuid",
  "status": "completed | processing | needs_review | failed",
  "confidence": 0.92,
  "data": {
    "nomor_sprint": "Sprin/123/IV/2025",
    "tanggal": "2025-04-21",
    "satuan_penerbit": "Polres Bandung",
    "dasar": ["...", "..."],
    "perihal": "...",
    "personel": [
      {"no": 1, "pangkat": "AKP", "nrp": "12345678", "nama": "...", "jabatan": "Kasat Reskrim", "confidence": 0.97},
      ...
    ],
    "ttd": {"pejabat": "...", "pangkat": "...", "nrp": "..."}
  },
  "review_flags": []
}

2.4 Tech Stack Final yang Direkomendasikan

Layer	Pilihan	Catatan
API	FastAPI + Uvicorn/Gunicorn	sesuai usulan
Validation	Pydantic v2	wajib
Queue	Redis + Celery atau Dramatiq	OCR berat, jangan blocking request
OCR	PaddleOCR PP-OCRv4 + PP-Structure	tambah PP-Structure untuk tabel
Preprocessing	OpenCV + Pillow + DocTR (detection)	DocTR untuk foto HP
Extraction	Regex + Ollama (Llama 3.1 8B / Qwen2.5 7B) + instructor/outlines	hybrid
Storage	PostgreSQL (metadata) + MinIO (file blob)	self-hosted, sesuai compliance
Observability	Prometheus + Grafana + Loki	wajib produksi
Container	Docker + docker-compose (dev) → Kubernetes (prod)
GPU	NVIDIA T4/A10 (1× cukup untuk MVP)	PaddleOCR jauh lebih cepat di GPU

3. Roadmap Pengembangan (Bertahap)

Fase 0 — Persiapan (1 minggu)

Kumpulkan dataset sampel: minimal 50 surat sprint (campur PDF scan + foto HP) dari beragam satuan.
Buat ground truth labelling untuk 20 dokumen (untuk evaluasi).
Definisikan schema output final (JSON) bersama stakeholder.

Fase 1 — MVP Pipeline Sinkron (2 minggu)

Setup FastAPI skeleton + Pydantic schemas.
Integrasi PaddleOCR PP-OCRv4 (CPU dulu, GPU menyusul).
Preprocessing dasar: deskew + denoise + resize.
Regex extraction untuk field header.
Endpoint sinkron POST /documents (untuk dev/testing saja).
Evaluasi akurasi terhadap 20 ground truth.

Fase 2 — Robustness untuk Foto HP (2 minggu)

Integrasi document detection (DocTR atau OpenCV contour).
Perspective transform + dewarping.
Shadow removal.
Re-evaluasi akurasi pada subset foto HP.

Fase 3 — Table Extraction (1.5 minggu)

Integrasi PP-Structure untuk personnel table.
Mapping kolom + validation (NRP, pangkat).
Master data tabel pangkat Polri.

Fase 4 — Async + Production Ready (1.5 minggu)

Pindahkan ke arsitektur async dengan Celery + Redis.
Storage MinIO + PostgreSQL.
Auth, rate limit, logging, metrics.
Docker compose untuk deployment.

Fase 5 — LLM Hybrid Extraction (2 minggu)

Setup Ollama / vLLM dengan model lokal.
Structured output via instructor.
Confidence scoring + routing ke review.

Fase 6 — HITL Review UI (opsional, 2 minggu)

Endpoint koreksi.
Simple web UI (Next.js) untuk reviewer.
Audit trail & feedback loop.

Fase 7 — Optimasi Lanjutan (ongoing)

Fine-tune PaddleOCR detection/recognition pada dataset internal.
Eksplorasi Donut/LayoutLMv3 jika dataset sudah cukup.
Batch processing & GPU optimization.

Total estimasi MVP fungsional (Fase 1–4): ~7 minggu dengan 1 backend engineer + 1 ML engineer.

4. Risiko & Mitigasi

Risiko	Mitigasi
Data sensitif (kepolisian) bocor	Wajib on-prem; tidak ada cloud OCR; enkripsi at-rest (LUKS/pgcrypto) + in-transit (mTLS); audit log lengkap.
Variasi format antar satuan	Hybrid extraction (regex + LLM); kumpulkan sample dari banyak satuan sejak awal.
Foto HP kualitas buruk	Validasi kualitas image di client (resolusi minimal, blur detection) sebelum upload.
Akurasi tidak sampai target	HITL review wajib untuk dokumen low-confidence; jangan deploy fully-automatic.
Tanggung jawab hukum atas hasil OCR	Selalu simpan original document + flag bahwa hasil ekstraksi adalah "draft, perlu verifikasi manusia".

5. Pertanyaan Sebelum Implementasi

Sebelum saya lanjut ke implementasi, mohon konfirmasi:

Volume: berapa dokumen/hari yang ditargetkan? (mempengaruhi pilihan async vs sync, GPU vs CPU)
Deployment target: on-prem mutlak, atau private cloud (GovCloud) boleh?
Source dokumen: apakah ada akses ke 20–50 sample surat sprint untuk dijadikan dataset awal?
Integrasi: service ini akan dipanggil sistem apa? (mempengaruhi auth & API contract)
HITL: apakah ada SDM untuk review manual dokumen low-confidence?
Hardware: sudah ada server GPU, atau perlu sizing rekomendasi?
Format output final: ada schema yang sudah dipakai sistem downstream?

16 KiB Raw Blame History Unescape Escape