Phase 5: hybrid LLM extraction (Ollama) for header gaps

Adds a small Ollama HTTP client (httpx-based, no extra runtime deps), prompt builders, and a hybrid header extractor that runs *after* the deterministic regex layer. The merger never overwrites a regex-filled field — the LLM only fills gaps. If LLM_ENABLED=false (the default), or the Ollama server is unreachable, the pipeline degrades gracefully: - LLM_ENABLED=false -> no LLM call at all, no flag. - LLM_ENABLED=true, header complete -> no LLM call. - LLM_ENABLED=true, header has gaps, LLM responded ok -> merge + LLM_FALLBACK flag (review hint). - LLM_ENABLED=true, header has gaps, LLM unavailable -> keep regex result + LLM_UNAVAILABLE flag. Default model qwen2.5:1.5b on http://localhost:11434 — chosen for CPU throughput (~5-15s per call) at acceptable accuracy. The LLM only fills the *header* (nomor, tanggal, satuan, perihal, dasar). Personnel rows stay with PP-Structure since that's more accurate and doesn't need LLM. Tests: - test_llm_client.py: httpx MockTransport-driven tests for the wire format, error paths (HTTP 5xx, malformed JSON, missing envelope, ConnectError), and request shape. - test_llm_extractor.py: merge policy + None-on-unavailable behaviour. - test_orchestrator_llm.py: end-to-end orchestrator wiring with stubs for ingest/preprocess/OCR/table — verifies LLM is skipped when disabled, skipped when header is complete, called and flagged when gaps exist, and marked unavailable when the client returns None. 162 unit tests pass total (was 146). Co-Authored-By: adrian kuman firmansah <adriancuman@gmail.com>
2026-04-25 16:56:43 +00:00
parent 2112023b6e
commit 45fbfdabb7
9 changed files with 646 additions and 1 deletions
--- a/src/ocr_sprint/llm/extractor.py
+++ b/src/ocr_sprint/llm/extractor.py
@@ -0,0 +1,84 @@
+"""High-level LLM extractor.
+
+The job is *narrow*: take the raw OCR text plus the partial header that
+came back from the regex layer, and return an LLM-derived header that the
+caller can merge in. We never let the LLM populate the personnel table —
+PP-Structure is more accurate and cheaper for that.
+"""
+
+from __future__ import annotations
+
+from datetime import date
+
+from pydantic import BaseModel, Field
+
+from ocr_sprint.llm.client import LLMUnavailableError, OllamaClient
+from ocr_sprint.llm.prompts import SYSTEM_HEADER, build_user_prompt
+from ocr_sprint.schemas.extraction import HeaderFields
+from ocr_sprint.utils.logging import get_logger
+
+_logger = get_logger(__name__)
+
+
+class LLMHeaderResult(BaseModel):
+    """Schema we ask the model to fill. Mirrors ``HeaderFields`` but is
+    intentionally separate so we control exactly what the prompt and
+    validation surface look like — the public ``HeaderFields`` may grow
+    fields later that we don't want the LLM touching.
+    """
+
+    nomor_sprint: str | None = None
+    tanggal: date | None = None
+    satuan_penerbit: str | None = None
+    perihal: str | None = None
+    dasar: list[str] = Field(default_factory=list)
+
+
+def llm_fill_header(
+    raw_text: str,
+    regex_header: HeaderFields,
+    *,
+    client: OllamaClient | None = None,
+) -> HeaderFields | None:
+    """Run the LLM extractor and return a *merged* HeaderFields.
+
+    Returns ``None`` if the model is unavailable so the caller can decide
+    what to do (typically: keep the regex result and emit a fallback
+    review flag).
+    """
+    client = client or OllamaClient()
+
+    user = build_user_prompt(
+        raw_text=raw_text,
+        regex_partial=regex_header.model_dump(mode="json"),
+    )
+
+    try:
+        llm = client.chat_json(SYSTEM_HEADER, user, LLMHeaderResult)
+    except LLMUnavailableError as exc:
+        _logger.warning("llm.unavailable", error=str(exc))
+        return None
+
+    return _merge(regex_header, llm)
+
+
+def _merge(regex: HeaderFields, llm: LLMHeaderResult) -> HeaderFields:
+    """Merge LLM output into the regex result.
+
+    Policy: regex wins for any field it already filled. The LLM only fills
+    the *gaps*. This keeps deterministic / verifiable extractions for the
+    fields where regex is reliable and prevents the LLM from "correcting"
+    a value that happens to look unusual but is in fact correct.
+    """
+    merged = regex.model_copy(deep=True)
+    if merged.nomor_sprint is None and llm.nomor_sprint:
+        merged.nomor_sprint = llm.nomor_sprint
+    if merged.tanggal is None and llm.tanggal is not None:
+        merged.tanggal = llm.tanggal
+    if not merged.satuan_penerbit and llm.satuan_penerbit:
+        merged.satuan_penerbit = llm.satuan_penerbit
+    if not merged.perihal and llm.perihal:
+        merged.perihal = llm.perihal
+    if not merged.dasar and llm.dasar:
+        merged.dasar = list(llm.dasar)
+    return merged