Phase 7: ground-truth export (JSONL + stats) + CLI tool

- GET /api/v1/ground-truth/export streaming JSONL (approved_only, since, until, has_corrections, limit) - GET /api/v1/ground-truth/stats total / approved / corrections counts + top-N most-corrected field paths - python -m ocr_sprint.tools.export_ground_truth operator CLI with the same filters + optional --print-stats - Ground-truth sample reconstructs the pipeline's original output by replaying job_corrections in reverse - docs/ground-truth-format.md schema + fine-tuning guidance - 17 new tests (service replay, endpoint filters, CLI) - 201 total tests passing, ruff / mypy --strict clean Co-Authored-By: adrian kuman firmansah <adriancuman@gmail.com>
2026-04-25 20:24:40 +00:00
parent 9457fa3c55
commit 6003d96a94
11 changed files with 1148 additions and 1 deletions
--- a/docs/ground-truth-format.md
+++ b/docs/ground-truth-format.md
@@ -0,0 +1,112 @@
+# Ground-truth export format (Phase 7)
+
+The service exposes the HITL corpus as [JSONL](https://jsonlines.org/) —
+one training sample per line — via the HTTP endpoint
+`GET /api/v1/ground-truth/export` and the equivalent CLI:
+
+```bash
+python -m ocr_sprint.tools.export_ground_truth --out corpus.jsonl
+```
+
+Both paths read the same database and emit byte-identical JSONL, so a
+cron-scheduled dump and an ad-hoc curl download are interchangeable.
+
+## Sample schema
+
+Each line is a single JSON object with the following shape:
+
+```jsonc
+{
+  "job_id": "c5da6747-...",
+  "filename": "sprint-042.pdf",
+  "source_kind": "pdf",
+  "approved": true,
+  "reviewed_by": "reviewer-a",        // free-form; comes from X-User-Id
+  "reviewed_at": "2025-06-01T10:15:00Z",
+  "created_at":  "2025-05-28T08:02:17Z",
+
+  // The pipeline's original pre-HITL output, reconstructed by replaying
+  // the audit trail backwards. `null` for jobs that never produced a
+  // result (e.g. hard-failed on OCR).
+  "initial_result": {
+    "header": { "nomor_sprint": "Sprin/1/I/2025", "perihal": null, ... },
+    "personel": [ { "pangkat": "AIPDA", "nrp": "77060000", ... } ],
+    ...
+  },
+
+  // The reviewer-approved answer (current value of jobs.result).
+  "final_result":   { ...same shape as initial_result... },
+
+  // Every correction event, in chronological order.
+  "corrections": [
+    {
+      "field_path":    "header.perihal",
+      "old_value":     null,
+      "new_value":     "Penyelidikan kasus pencurian",
+      "corrected_by":  "reviewer-a",
+      "reason":        "LLM missed it",
+      "corrected_at":  "2025-05-30T14:00:00Z"
+    }
+  ],
+
+  "review_flags": ["llm_fallback"],
+  "confidence":   0.78
+}
+```
+
+## Recommended filters
+
+* `approved_only=true` (default) — **do not** train on unreviewed
+  samples; they can still contain OCR mistakes.
+* `has_corrections=true` — for a "hard examples" set where the pipeline
+  was originally wrong.
+* `has_corrections=false` — for a "sanity" set where the pipeline was
+  already right. Good for regression tests after fine-tuning.
+* `since` / `until` — build incremental snapshots without re-processing
+  the full history.
+
+## When is the dataset big enough to fine-tune?
+
+Rough operational checklist (rules of thumb — adjust based on your own
+error analysis):
+
+| Bucket                          | Minimum rows | Notes                                                         |
+|---------------------------------|--------------|---------------------------------------------------------------|
+| LoRA on header extraction (LLM) | ~200–500     | Per-field error signal must be > random noise.                |
+| Per-satuan prompt tuning        | ~50 / satuan | Helps when formats differ sharply between Polda/Polres units. |
+| PP-Structure table fine-tune    | ~1 000+      | Layout models are data-hungry; hold off until HITL is steady. |
+
+Use `GET /api/v1/ground-truth/stats` to check coverage:
+
+```json
+{
+  "total_jobs": 842,
+  "approved_jobs": 613,
+  "total_corrections": 1 204,
+  "jobs_with_corrections": 431,
+  "top_corrected_fields": [
+    { "field_path": "header.perihal",  "count": 289 },
+    { "field_path": "personel[0].nrp", "count":  51 },
+    ...
+  ]
+}
+```
+
+Fields at the top of `top_corrected_fields` are the highest-leverage
+targets for prompt tweaks, regex upgrades, or (eventually) fine-tuning.
+
+## Fine-tuning outside this repo
+
+The export is deliberately framework-agnostic. Suggested follow-ups on
+dedicated GPU hardware:
+
+* [**Unsloth**](https://github.com/unslothai/unsloth) — LoRA on
+  Qwen2.5 / Llama 3.1 with 2–4 × speedups on a single GPU.
+* [**Axolotl**](https://github.com/axolotl-ai-cloud/axolotl) — more
+  batteries-included; good for multi-GPU runs.
+
+Typical prompt-completion conversion: feed `initial_result` (or the raw
+OCR text, if your pipeline keeps it) as the "input" and `final_result`
+as the "output". The `corrections` list is only needed if you want to
+build an error-class analysis — the model itself trains on the final
+answer.