Phase 5: hybrid LLM extraction (Ollama) for header gaps

Adds a small Ollama HTTP client (httpx-based, no extra runtime deps), prompt builders, and a hybrid header extractor that runs *after* the deterministic regex layer. The merger never overwrites a regex-filled field — the LLM only fills gaps. If LLM_ENABLED=false (the default), or the Ollama server is unreachable, the pipeline degrades gracefully: - LLM_ENABLED=false -> no LLM call at all, no flag. - LLM_ENABLED=true, header complete -> no LLM call. - LLM_ENABLED=true, header has gaps, LLM responded ok -> merge + LLM_FALLBACK flag (review hint). - LLM_ENABLED=true, header has gaps, LLM unavailable -> keep regex result + LLM_UNAVAILABLE flag. Default model qwen2.5:1.5b on http://localhost:11434 — chosen for CPU throughput (~5-15s per call) at acceptable accuracy. The LLM only fills the *header* (nomor, tanggal, satuan, perihal, dasar). Personnel rows stay with PP-Structure since that's more accurate and doesn't need LLM. Tests: - test_llm_client.py: httpx MockTransport-driven tests for the wire format, error paths (HTTP 5xx, malformed JSON, missing envelope, ConnectError), and request shape. - test_llm_extractor.py: merge policy + None-on-unavailable behaviour. - test_orchestrator_llm.py: end-to-end orchestrator wiring with stubs for ingest/preprocess/OCR/table — verifies LLM is skipped when disabled, skipped when header is complete, called and flagged when gaps exist, and marked unavailable when the client returns None. 162 unit tests pass total (was 146). Co-Authored-By: adrian kuman firmansah <adriancuman@gmail.com>
2026-04-25 16:56:43 +00:00
parent 2112023b6e
commit 45fbfdabb7
9 changed files with 646 additions and 1 deletions
--- a/src/ocr_sprint/pipeline/orchestrator.py
+++ b/src/ocr_sprint/pipeline/orchestrator.py
@@ -15,6 +15,7 @@ from __future__ import annotations
 from dataclasses import dataclass

 from ocr_sprint.config import get_settings
+from ocr_sprint.llm.extractor import llm_fill_header
 from ocr_sprint.pipeline.confidence import compute_confidence, route
 from ocr_sprint.pipeline.document_detect import DocumentDetectConfig, detect_and_correct
 from ocr_sprint.pipeline.extract.personnel import extract_personnel
@@ -35,6 +36,18 @@ _logger = get_logger(__name__)
 _OCR_CONFIDENCE_FLAG_THRESHOLD = 0.80


+def _header_has_gaps(header: object) -> bool:
+    """True if any header field worth asking the LLM about is missing.
+
+    Using ``getattr`` so this stays decoupled from the exact attribute
+    names; the schema change cost was too large last time we hard-coded.
+    """
+    for field in ("nomor_sprint", "tanggal", "satuan_penerbit", "perihal"):
+        if not getattr(header, field, None):
+            return True
+    return not getattr(header, "dasar", None)
+
+
@dataclass
 class PipelineOutput:
    """Bundle returned by the orchestrator."""
@@ -84,6 +97,20 @@ def run_pipeline(content: bytes) -> PipelineOutput:
    header = extract_header(full_text)
    ttd = find_signatory(full_text)

+    # Phase 5 — hybrid extraction. The regex layer is deterministic but
+    # brittle to layout variants between satuan; if any header field is
+    # still missing we ask the local LLM to fill the gaps. The merger
+    # never lets the LLM overwrite a field that regex already captured.
+    llm_flags: list[ReviewFlag] = []
+    if s.llm_enabled and _header_has_gaps(header):
+        merged = llm_fill_header(full_text, header)
+        if merged is None:
+            llm_flags.append(ReviewFlag.LLM_UNAVAILABLE)
+        else:
+            if merged.model_dump() != header.model_dump():
+                llm_flags.append(ReviewFlag.LLM_FALLBACK)
+            header = merged
+
    personel: list[PersonnelEntry] = []
    if s.tables_enabled and cleaned_pages:
        all_tables: list[DetectedTable] = []
@@ -99,7 +126,7 @@ def run_pipeline(content: bytes) -> PipelineOutput:
            personel_rows=len(personel),
        )

-    initial_flags: list[ReviewFlag] = []
+    initial_flags: list[ReviewFlag] = list(llm_flags)
    if mean_ocr_conf < _OCR_CONFIDENCE_FLAG_THRESHOLD:
        initial_flags.append(ReviewFlag.LOW_OCR_CONFIDENCE)