Phase 7: ground-truth export (JSONL + stats) + CLI tool
- GET /api/v1/ground-truth/export streaming JSONL (approved_only, since, until, has_corrections, limit) - GET /api/v1/ground-truth/stats total / approved / corrections counts + top-N most-corrected field paths - python -m ocr_sprint.tools.export_ground_truth operator CLI with the same filters + optional --print-stats - Ground-truth sample reconstructs the pipeline's original output by replaying job_corrections in reverse - docs/ground-truth-format.md schema + fine-tuning guidance - 17 new tests (service replay, endpoint filters, CLI) - 201 total tests passing, ruff / mypy --strict clean Co-Authored-By: adrian kuman firmansah <adriancuman@gmail.com>
This commit is contained in:
112
docs/ground-truth-format.md
Normal file
112
docs/ground-truth-format.md
Normal file
@@ -0,0 +1,112 @@
|
||||
# Ground-truth export format (Phase 7)
|
||||
|
||||
The service exposes the HITL corpus as [JSONL](https://jsonlines.org/) —
|
||||
one training sample per line — via the HTTP endpoint
|
||||
`GET /api/v1/ground-truth/export` and the equivalent CLI:
|
||||
|
||||
```bash
|
||||
python -m ocr_sprint.tools.export_ground_truth --out corpus.jsonl
|
||||
```
|
||||
|
||||
Both paths read the same database and emit byte-identical JSONL, so a
|
||||
cron-scheduled dump and an ad-hoc curl download are interchangeable.
|
||||
|
||||
## Sample schema
|
||||
|
||||
Each line is a single JSON object with the following shape:
|
||||
|
||||
```jsonc
|
||||
{
|
||||
"job_id": "c5da6747-...",
|
||||
"filename": "sprint-042.pdf",
|
||||
"source_kind": "pdf",
|
||||
"approved": true,
|
||||
"reviewed_by": "reviewer-a", // free-form; comes from X-User-Id
|
||||
"reviewed_at": "2025-06-01T10:15:00Z",
|
||||
"created_at": "2025-05-28T08:02:17Z",
|
||||
|
||||
// The pipeline's original pre-HITL output, reconstructed by replaying
|
||||
// the audit trail backwards. `null` for jobs that never produced a
|
||||
// result (e.g. hard-failed on OCR).
|
||||
"initial_result": {
|
||||
"header": { "nomor_sprint": "Sprin/1/I/2025", "perihal": null, ... },
|
||||
"personel": [ { "pangkat": "AIPDA", "nrp": "77060000", ... } ],
|
||||
...
|
||||
},
|
||||
|
||||
// The reviewer-approved answer (current value of jobs.result).
|
||||
"final_result": { ...same shape as initial_result... },
|
||||
|
||||
// Every correction event, in chronological order.
|
||||
"corrections": [
|
||||
{
|
||||
"field_path": "header.perihal",
|
||||
"old_value": null,
|
||||
"new_value": "Penyelidikan kasus pencurian",
|
||||
"corrected_by": "reviewer-a",
|
||||
"reason": "LLM missed it",
|
||||
"corrected_at": "2025-05-30T14:00:00Z"
|
||||
}
|
||||
],
|
||||
|
||||
"review_flags": ["llm_fallback"],
|
||||
"confidence": 0.78
|
||||
}
|
||||
```
|
||||
|
||||
## Recommended filters
|
||||
|
||||
* `approved_only=true` (default) — **do not** train on unreviewed
|
||||
samples; they can still contain OCR mistakes.
|
||||
* `has_corrections=true` — for a "hard examples" set where the pipeline
|
||||
was originally wrong.
|
||||
* `has_corrections=false` — for a "sanity" set where the pipeline was
|
||||
already right. Good for regression tests after fine-tuning.
|
||||
* `since` / `until` — build incremental snapshots without re-processing
|
||||
the full history.
|
||||
|
||||
## When is the dataset big enough to fine-tune?
|
||||
|
||||
Rough operational checklist (rules of thumb — adjust based on your own
|
||||
error analysis):
|
||||
|
||||
| Bucket | Minimum rows | Notes |
|
||||
|---------------------------------|--------------|---------------------------------------------------------------|
|
||||
| LoRA on header extraction (LLM) | ~200–500 | Per-field error signal must be > random noise. |
|
||||
| Per-satuan prompt tuning | ~50 / satuan | Helps when formats differ sharply between Polda/Polres units. |
|
||||
| PP-Structure table fine-tune | ~1 000+ | Layout models are data-hungry; hold off until HITL is steady. |
|
||||
|
||||
Use `GET /api/v1/ground-truth/stats` to check coverage:
|
||||
|
||||
```json
|
||||
{
|
||||
"total_jobs": 842,
|
||||
"approved_jobs": 613,
|
||||
"total_corrections": 1 204,
|
||||
"jobs_with_corrections": 431,
|
||||
"top_corrected_fields": [
|
||||
{ "field_path": "header.perihal", "count": 289 },
|
||||
{ "field_path": "personel[0].nrp", "count": 51 },
|
||||
...
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Fields at the top of `top_corrected_fields` are the highest-leverage
|
||||
targets for prompt tweaks, regex upgrades, or (eventually) fine-tuning.
|
||||
|
||||
## Fine-tuning outside this repo
|
||||
|
||||
The export is deliberately framework-agnostic. Suggested follow-ups on
|
||||
dedicated GPU hardware:
|
||||
|
||||
* [**Unsloth**](https://github.com/unslothai/unsloth) — LoRA on
|
||||
Qwen2.5 / Llama 3.1 with 2–4 × speedups on a single GPU.
|
||||
* [**Axolotl**](https://github.com/axolotl-ai-cloud/axolotl) — more
|
||||
batteries-included; good for multi-GPU runs.
|
||||
|
||||
Typical prompt-completion conversion: feed `initial_result` (or the raw
|
||||
OCR text, if your pipeline keeps it) as the "input" and `final_result`
|
||||
as the "output". The `corrections` list is only needed if you want to
|
||||
build an error-class analysis — the model itself trains on the final
|
||||
answer.
|
||||
Reference in New Issue
Block a user