Files
OCR-SPRIN-SERVICE/docs/ground-truth-format.md
Devin AI 6003d96a94 Phase 7: ground-truth export (JSONL + stats) + CLI tool
- GET /api/v1/ground-truth/export  streaming JSONL (approved_only,
  since, until, has_corrections, limit)
- GET /api/v1/ground-truth/stats   total / approved / corrections
  counts + top-N most-corrected field paths
- python -m ocr_sprint.tools.export_ground_truth  operator CLI with
  the same filters + optional --print-stats
- Ground-truth sample reconstructs the pipeline's original output by
  replaying job_corrections in reverse
- docs/ground-truth-format.md    schema + fine-tuning guidance
- 17 new tests (service replay, endpoint filters, CLI)
- 201 total tests passing, ruff / mypy --strict clean

Co-Authored-By: adrian kuman firmansah <adriancuman@gmail.com>
2026-04-25 20:24:40 +00:00

113 lines
3.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Ground-truth export format (Phase 7)
The service exposes the HITL corpus as [JSONL](https://jsonlines.org/) —
one training sample per line — via the HTTP endpoint
`GET /api/v1/ground-truth/export` and the equivalent CLI:
```bash
python -m ocr_sprint.tools.export_ground_truth --out corpus.jsonl
```
Both paths read the same database and emit byte-identical JSONL, so a
cron-scheduled dump and an ad-hoc curl download are interchangeable.
## Sample schema
Each line is a single JSON object with the following shape:
```jsonc
{
"job_id": "c5da6747-...",
"filename": "sprint-042.pdf",
"source_kind": "pdf",
"approved": true,
"reviewed_by": "reviewer-a", // free-form; comes from X-User-Id
"reviewed_at": "2025-06-01T10:15:00Z",
"created_at": "2025-05-28T08:02:17Z",
// The pipeline's original pre-HITL output, reconstructed by replaying
// the audit trail backwards. `null` for jobs that never produced a
// result (e.g. hard-failed on OCR).
"initial_result": {
"header": { "nomor_sprint": "Sprin/1/I/2025", "perihal": null, ... },
"personel": [ { "pangkat": "AIPDA", "nrp": "77060000", ... } ],
...
},
// The reviewer-approved answer (current value of jobs.result).
"final_result": { ...same shape as initial_result... },
// Every correction event, in chronological order.
"corrections": [
{
"field_path": "header.perihal",
"old_value": null,
"new_value": "Penyelidikan kasus pencurian",
"corrected_by": "reviewer-a",
"reason": "LLM missed it",
"corrected_at": "2025-05-30T14:00:00Z"
}
],
"review_flags": ["llm_fallback"],
"confidence": 0.78
}
```
## Recommended filters
* `approved_only=true` (default) — **do not** train on unreviewed
samples; they can still contain OCR mistakes.
* `has_corrections=true` — for a "hard examples" set where the pipeline
was originally wrong.
* `has_corrections=false` — for a "sanity" set where the pipeline was
already right. Good for regression tests after fine-tuning.
* `since` / `until` — build incremental snapshots without re-processing
the full history.
## When is the dataset big enough to fine-tune?
Rough operational checklist (rules of thumb — adjust based on your own
error analysis):
| Bucket | Minimum rows | Notes |
|---------------------------------|--------------|---------------------------------------------------------------|
| LoRA on header extraction (LLM) | ~200500 | Per-field error signal must be > random noise. |
| Per-satuan prompt tuning | ~50 / satuan | Helps when formats differ sharply between Polda/Polres units. |
| PP-Structure table fine-tune | ~1 000+ | Layout models are data-hungry; hold off until HITL is steady. |
Use `GET /api/v1/ground-truth/stats` to check coverage:
```json
{
"total_jobs": 842,
"approved_jobs": 613,
"total_corrections": 1 204,
"jobs_with_corrections": 431,
"top_corrected_fields": [
{ "field_path": "header.perihal", "count": 289 },
{ "field_path": "personel[0].nrp", "count": 51 },
...
]
}
```
Fields at the top of `top_corrected_fields` are the highest-leverage
targets for prompt tweaks, regex upgrades, or (eventually) fine-tuning.
## Fine-tuning outside this repo
The export is deliberately framework-agnostic. Suggested follow-ups on
dedicated GPU hardware:
* [**Unsloth**](https://github.com/unslothai/unsloth) — LoRA on
Qwen2.5 / Llama 3.1 with 24 × speedups on a single GPU.
* [**Axolotl**](https://github.com/axolotl-ai-cloud/axolotl) — more
batteries-included; good for multi-GPU runs.
Typical prompt-completion conversion: feed `initial_result` (or the raw
OCR text, if your pipeline keeps it) as the "input" and `final_result`
as the "output". The `corrections` list is only needed if you want to
build an error-class analysis — the model itself trains on the final
answer.