Phase 5: hybrid LLM extraction (Ollama) for header gaps

Adds a small Ollama HTTP client (httpx-based, no extra runtime deps),
prompt builders, and a hybrid header extractor that runs *after* the
deterministic regex layer. The merger never overwrites a regex-filled
field — the LLM only fills gaps. If LLM_ENABLED=false (the default), or
the Ollama server is unreachable, the pipeline degrades gracefully:

  - LLM_ENABLED=false  ->  no LLM call at all, no flag.
  - LLM_ENABLED=true,
    header complete    ->  no LLM call.
  - LLM_ENABLED=true,
    header has gaps,
    LLM responded ok   ->  merge + LLM_FALLBACK flag (review hint).
  - LLM_ENABLED=true,
    header has gaps,
    LLM unavailable    ->  keep regex result + LLM_UNAVAILABLE flag.

Default model qwen2.5:1.5b on http://localhost:11434 — chosen for CPU
throughput (~5-15s per call) at acceptable accuracy. The LLM only fills
the *header* (nomor, tanggal, satuan, perihal, dasar). Personnel rows
stay with PP-Structure since that's more accurate and doesn't need LLM.

Tests:
 - test_llm_client.py: httpx MockTransport-driven tests for the wire
   format, error paths (HTTP 5xx, malformed JSON, missing envelope,
   ConnectError), and request shape.
 - test_llm_extractor.py: merge policy + None-on-unavailable behaviour.
 - test_orchestrator_llm.py: end-to-end orchestrator wiring with stubs
   for ingest/preprocess/OCR/table — verifies LLM is skipped when
   disabled, skipped when header is complete, called and flagged when
   gaps exist, and marked unavailable when the client returns None.

162 unit tests pass total (was 146).

Co-Authored-By: adrian kuman firmansah <adriancuman@gmail.com>
This commit is contained in:
Devin AI
2026-04-25 16:56:43 +00:00
parent 2112023b6e
commit 45fbfdabb7
9 changed files with 646 additions and 1 deletions

View File

@@ -0,0 +1,84 @@
"""High-level LLM extractor.
The job is *narrow*: take the raw OCR text plus the partial header that
came back from the regex layer, and return an LLM-derived header that the
caller can merge in. We never let the LLM populate the personnel table —
PP-Structure is more accurate and cheaper for that.
"""
from __future__ import annotations
from datetime import date
from pydantic import BaseModel, Field
from ocr_sprint.llm.client import LLMUnavailableError, OllamaClient
from ocr_sprint.llm.prompts import SYSTEM_HEADER, build_user_prompt
from ocr_sprint.schemas.extraction import HeaderFields
from ocr_sprint.utils.logging import get_logger
_logger = get_logger(__name__)
class LLMHeaderResult(BaseModel):
"""Schema we ask the model to fill. Mirrors ``HeaderFields`` but is
intentionally separate so we control exactly what the prompt and
validation surface look like — the public ``HeaderFields`` may grow
fields later that we don't want the LLM touching.
"""
nomor_sprint: str | None = None
tanggal: date | None = None
satuan_penerbit: str | None = None
perihal: str | None = None
dasar: list[str] = Field(default_factory=list)
def llm_fill_header(
raw_text: str,
regex_header: HeaderFields,
*,
client: OllamaClient | None = None,
) -> HeaderFields | None:
"""Run the LLM extractor and return a *merged* HeaderFields.
Returns ``None`` if the model is unavailable so the caller can decide
what to do (typically: keep the regex result and emit a fallback
review flag).
"""
client = client or OllamaClient()
user = build_user_prompt(
raw_text=raw_text,
regex_partial=regex_header.model_dump(mode="json"),
)
try:
llm = client.chat_json(SYSTEM_HEADER, user, LLMHeaderResult)
except LLMUnavailableError as exc:
_logger.warning("llm.unavailable", error=str(exc))
return None
return _merge(regex_header, llm)
def _merge(regex: HeaderFields, llm: LLMHeaderResult) -> HeaderFields:
"""Merge LLM output into the regex result.
Policy: regex wins for any field it already filled. The LLM only fills
the *gaps*. This keeps deterministic / verifiable extractions for the
fields where regex is reliable and prevents the LLM from "correcting"
a value that happens to look unusual but is in fact correct.
"""
merged = regex.model_copy(deep=True)
if merged.nomor_sprint is None and llm.nomor_sprint:
merged.nomor_sprint = llm.nomor_sprint
if merged.tanggal is None and llm.tanggal is not None:
merged.tanggal = llm.tanggal
if not merged.satuan_penerbit and llm.satuan_penerbit:
merged.satuan_penerbit = llm.satuan_penerbit
if not merged.perihal and llm.perihal:
merged.perihal = llm.perihal
if not merged.dasar and llm.dasar:
merged.dasar = list(llm.dasar)
return merged