# Ground-truth export format (Phase 7) The service exposes the HITL corpus as [JSONL](https://jsonlines.org/) — one training sample per line — via the HTTP endpoint `GET /api/v1/ground-truth/export` and the equivalent CLI: ```bash python -m ocr_sprint.tools.export_ground_truth --out corpus.jsonl ``` Both paths read the same database and emit byte-identical JSONL, so a cron-scheduled dump and an ad-hoc curl download are interchangeable. ## Sample schema Each line is a single JSON object with the following shape: ```jsonc { "job_id": "c5da6747-...", "filename": "sprint-042.pdf", "source_kind": "pdf", "approved": true, "reviewed_by": "reviewer-a", // free-form; comes from X-User-Id "reviewed_at": "2025-06-01T10:15:00Z", "created_at": "2025-05-28T08:02:17Z", // The pipeline's original pre-HITL output, reconstructed by replaying // the audit trail backwards. `null` for jobs that never produced a // result (e.g. hard-failed on OCR). "initial_result": { "header": { "nomor_sprint": "Sprin/1/I/2025", "perihal": null, ... }, "personel": [ { "pangkat": "AIPDA", "nrp": "77060000", ... } ], ... }, // The reviewer-approved answer (current value of jobs.result). "final_result": { ...same shape as initial_result... }, // Every correction event, in chronological order. "corrections": [ { "field_path": "header.perihal", "old_value": null, "new_value": "Penyelidikan kasus pencurian", "corrected_by": "reviewer-a", "reason": "LLM missed it", "corrected_at": "2025-05-30T14:00:00Z" } ], "review_flags": ["llm_fallback"], "confidence": 0.78 } ``` ## Recommended filters * `approved_only=true` (default) — **do not** train on unreviewed samples; they can still contain OCR mistakes. * `has_corrections=true` — for a "hard examples" set where the pipeline was originally wrong. * `has_corrections=false` — for a "sanity" set where the pipeline was already right. Good for regression tests after fine-tuning. * `since` / `until` — build incremental snapshots without re-processing the full history. ## When is the dataset big enough to fine-tune? Rough operational checklist (rules of thumb — adjust based on your own error analysis): | Bucket | Minimum rows | Notes | |---------------------------------|--------------|---------------------------------------------------------------| | LoRA on header extraction (LLM) | ~200–500 | Per-field error signal must be > random noise. | | Per-satuan prompt tuning | ~50 / satuan | Helps when formats differ sharply between Polda/Polres units. | | PP-Structure table fine-tune | ~1 000+ | Layout models are data-hungry; hold off until HITL is steady. | Use `GET /api/v1/ground-truth/stats` to check coverage: ```json { "total_jobs": 842, "approved_jobs": 613, "total_corrections": 1 204, "jobs_with_corrections": 431, "top_corrected_fields": [ { "field_path": "header.perihal", "count": 289 }, { "field_path": "personel[0].nrp", "count": 51 }, ... ] } ``` Fields at the top of `top_corrected_fields` are the highest-leverage targets for prompt tweaks, regex upgrades, or (eventually) fine-tuning. ## Fine-tuning outside this repo The export is deliberately framework-agnostic. Suggested follow-ups on dedicated GPU hardware: * [**Unsloth**](https://github.com/unslothai/unsloth) — LoRA on Qwen2.5 / Llama 3.1 with 2–4 × speedups on a single GPU. * [**Axolotl**](https://github.com/axolotl-ai-cloud/axolotl) — more batteries-included; good for multi-GPU runs. Typical prompt-completion conversion: feed `initial_result` (or the raw OCR text, if your pipeline keeps it) as the "input" and `final_result` as the "output". The `corrections` list is only needed if you want to build an error-class analysis — the model itself trains on the final answer.