Files

Devin AI 6003d96a94 Phase 7: ground-truth export (JSONL + stats) + CLI tool

- GET /api/v1/ground-truth/export  streaming JSONL (approved_only,
  since, until, has_corrections, limit)
- GET /api/v1/ground-truth/stats   total / approved / corrections
  counts + top-N most-corrected field paths
- python -m ocr_sprint.tools.export_ground_truth  operator CLI with
  the same filters + optional --print-stats
- Ground-truth sample reconstructs the pipeline's original output by
  replaying job_corrections in reverse
- docs/ground-truth-format.md    schema + fine-tuning guidance
- 17 new tests (service replay, endpoint filters, CLI)
- 201 total tests passing, ruff / mypy --strict clean

Co-Authored-By: adrian kuman firmansah <adriancuman@gmail.com>

2026-04-25 20:24:40 +00:00

3.9 KiB

Raw Permalink Blame History

Ground-truth export format (Phase 7)

The service exposes the HITL corpus as JSONL — one training sample per line — via the HTTP endpoint GET /api/v1/ground-truth/export and the equivalent CLI:

python -m ocr_sprint.tools.export_ground_truth --out corpus.jsonl

Both paths read the same database and emit byte-identical JSONL, so a cron-scheduled dump and an ad-hoc curl download are interchangeable.

Sample schema

Each line is a single JSON object with the following shape:

{
  "job_id": "c5da6747-...",
  "filename": "sprint-042.pdf",
  "source_kind": "pdf",
  "approved": true,
  "reviewed_by": "reviewer-a",        // free-form; comes from X-User-Id
  "reviewed_at": "2025-06-01T10:15:00Z",
  "created_at":  "2025-05-28T08:02:17Z",

  // The pipeline's original pre-HITL output, reconstructed by replaying
  // the audit trail backwards. `null` for jobs that never produced a
  // result (e.g. hard-failed on OCR).
  "initial_result": {
    "header": { "nomor_sprint": "Sprin/1/I/2025", "perihal": null, ... },
    "personel": [ { "pangkat": "AIPDA", "nrp": "77060000", ... } ],
    ...
  },

  // The reviewer-approved answer (current value of jobs.result).
  "final_result":   { ...same shape as initial_result... },

  // Every correction event, in chronological order.
  "corrections": [
    {
      "field_path":    "header.perihal",
      "old_value":     null,
      "new_value":     "Penyelidikan kasus pencurian",
      "corrected_by":  "reviewer-a",
      "reason":        "LLM missed it",
      "corrected_at":  "2025-05-30T14:00:00Z"
    }
  ],

  "review_flags": ["llm_fallback"],
  "confidence":   0.78
}

Recommended filters

approved_only=true (default) — do not train on unreviewed samples; they can still contain OCR mistakes.
has_corrections=true — for a "hard examples" set where the pipeline was originally wrong.
has_corrections=false — for a "sanity" set where the pipeline was already right. Good for regression tests after fine-tuning.
since / until — build incremental snapshots without re-processing the full history.

When is the dataset big enough to fine-tune?

Rough operational checklist (rules of thumb — adjust based on your own error analysis):

Bucket	Minimum rows	Notes
LoRA on header extraction (LLM)	~200–500	Per-field error signal must be > random noise.
Per-satuan prompt tuning	~50 / satuan	Helps when formats differ sharply between Polda/Polres units.
PP-Structure table fine-tune	~1 000+	Layout models are data-hungry; hold off until HITL is steady.

Use GET /api/v1/ground-truth/stats to check coverage:

{
  "total_jobs": 842,
  "approved_jobs": 613,
  "total_corrections": 1 204,
  "jobs_with_corrections": 431,
  "top_corrected_fields": [
    { "field_path": "header.perihal",  "count": 289 },
    { "field_path": "personel[0].nrp", "count":  51 },
    ...
  ]
}

Fields at the top of top_corrected_fields are the highest-leverage targets for prompt tweaks, regex upgrades, or (eventually) fine-tuning.

Fine-tuning outside this repo

The export is deliberately framework-agnostic. Suggested follow-ups on dedicated GPU hardware:

Unsloth — LoRA on Qwen2.5 / Llama 3.1 with 2–4 × speedups on a single GPU.
Axolotl — more batteries-included; good for multi-GPU runs.

Typical prompt-completion conversion: feed initial_result (or the raw OCR text, if your pipeline keeps it) as the "input" and final_result as the "output". The corrections list is only needed if you want to build an error-class analysis — the model itself trains on the final answer.

3.9 KiB Raw Permalink Blame History Unescape Escape