OCR-SPRIN-SERVICE

adrian/OCR-SPRIN-SERVICE

Fork 0

Commit Graph

Author	SHA1	Message	Date
Devin AI	58a2bf2648	Fix personnel extraction + header bugs on real Polres Cimahi sprint This fixes 4 bugs found on a real Polres Cimahi SPRIN PDF: 1. satuan_penerbit captured the generic 'KEPOLISIAN NEGARA REPUBLIK INDONESIA' letterhead line instead of the most-specific issuing unit (e.g. RESOR CIMAHI / SEKTOR PADALARANG). Reworked find_satuan to scan for each level independently and return the deepest available. 2. find_dasar_list dropped numbered items when OCR put the marker on its own line ("1.\n Undang-Undang ..."). Refactored into _collect_numbered_section that buffers a bare-number line and uses the next non-empty line as the body. Also reused for the new find_untuk_list which extracts the previously-empty 'untuk' bullets. 3. find_perihal returned None for documents that use 'Pertimbangan' (very common in Polres-level sprint), forcing the LLM to guess. Added a regex fallback that picks up the first line under a 'Pertimbangan' label so we keep extraction deterministic. 4. Personnel rows were emitted with only nama populated when PP-Structure detected a table but the column mapper degraded. Added a text-based fallback (extract_personnel_from_text) that scans raw OCR for <rank> + <8-digit NRP> patterns. Triggered when the PP-Structure result has fewer than 30% rank/NRP-bearing rows. Reviewed by raising the new PERSONNEL_TEXT_FALLBACK flag. 5. Validation now flags rows with neither pangkat nor nrp as INCOMPLETE_PERSONNEL_ROW, so the document routes to needs_review even when individual nrp/pangkat checks pass on empty values. 6. Added 'BRIGPOL' as a variant of BRIGADIR (seen in real scans). Tests: 229 (was 203) — 26 new tests covering the regex fixes, text-based personnel extractor, low-quality detector, validator behaviour, and orchestrator wiring of the fallback path. Co-Authored-By: adrian kuman firmansah <adriancuman@gmail.com>	2026-04-26 05:35:42 +00:00
Devin AI	45fbfdabb7	Phase 5: hybrid LLM extraction (Ollama) for header gaps Adds a small Ollama HTTP client (httpx-based, no extra runtime deps), prompt builders, and a hybrid header extractor that runs after the deterministic regex layer. The merger never overwrites a regex-filled field — the LLM only fills gaps. If LLM_ENABLED=false (the default), or the Ollama server is unreachable, the pipeline degrades gracefully: - LLM_ENABLED=false -> no LLM call at all, no flag. - LLM_ENABLED=true, header complete -> no LLM call. - LLM_ENABLED=true, header has gaps, LLM responded ok -> merge + LLM_FALLBACK flag (review hint). - LLM_ENABLED=true, header has gaps, LLM unavailable -> keep regex result + LLM_UNAVAILABLE flag. Default model qwen2.5:1.5b on http://localhost:11434 — chosen for CPU throughput (~5-15s per call) at acceptable accuracy. The LLM only fills the header (nomor, tanggal, satuan, perihal, dasar). Personnel rows stay with PP-Structure since that's more accurate and doesn't need LLM. Tests: - test_llm_client.py: httpx MockTransport-driven tests for the wire format, error paths (HTTP 5xx, malformed JSON, missing envelope, ConnectError), and request shape. - test_llm_extractor.py: merge policy + None-on-unavailable behaviour. - test_orchestrator_llm.py: end-to-end orchestrator wiring with stubs for ingest/preprocess/OCR/table — verifies LLM is skipped when disabled, skipped when header is complete, called and flagged when gaps exist, and marked unavailable when the client returns None. 162 unit tests pass total (was 146). Co-Authored-By: adrian kuman firmansah <adriancuman@gmail.com>	2026-04-25 16:56:43 +00:00

Author

SHA1

Message

Date

Devin AI

58a2bf2648

Fix personnel extraction + header bugs on real Polres Cimahi sprint

This fixes 4 bugs found on a real Polres Cimahi SPRIN PDF:

1. satuan_penerbit captured the generic 'KEPOLISIAN NEGARA REPUBLIK
   INDONESIA' letterhead line instead of the most-specific issuing unit
   (e.g. RESOR CIMAHI / SEKTOR PADALARANG). Reworked find_satuan to
   scan for each level independently and return the deepest available.

2. find_dasar_list dropped numbered items when OCR put the marker on
   its own line ("1.\n Undang-Undang ..."). Refactored into
   _collect_numbered_section that buffers a bare-number line and uses
   the next non-empty line as the body. Also reused for the new
   find_untuk_list which extracts the previously-empty 'untuk' bullets.

3. find_perihal returned None for documents that use 'Pertimbangan'
   (very common in Polres-level sprint), forcing the LLM to guess.
   Added a regex fallback that picks up the first line under a
   'Pertimbangan' label so we keep extraction deterministic.

4. Personnel rows were emitted with only nama populated when
   PP-Structure detected a table but the column mapper degraded.
   Added a text-based fallback (extract_personnel_from_text) that
   scans raw OCR for <rank> + <8-digit NRP> patterns. Triggered when
   the PP-Structure result has fewer than 30% rank/NRP-bearing rows.
   Reviewed by raising the new PERSONNEL_TEXT_FALLBACK flag.

5. Validation now flags rows with neither pangkat nor nrp as
   INCOMPLETE_PERSONNEL_ROW, so the document routes to needs_review
   even when individual nrp/pangkat checks pass on empty values.

6. Added 'BRIGPOL' as a variant of BRIGADIR (seen in real scans).

Tests: 229 (was 203) — 26 new tests covering the regex fixes,
text-based personnel extractor, low-quality detector, validator
behaviour, and orchestrator wiring of the fallback path.

Co-Authored-By: adrian kuman firmansah <adriancuman@gmail.com>

2026-04-26 05:35:42 +00:00

Devin AI

45fbfdabb7

Phase 5: hybrid LLM extraction (Ollama) for header gaps

Adds a small Ollama HTTP client (httpx-based, no extra runtime deps),
prompt builders, and a hybrid header extractor that runs *after* the
deterministic regex layer. The merger never overwrites a regex-filled
field — the LLM only fills gaps. If LLM_ENABLED=false (the default), or
the Ollama server is unreachable, the pipeline degrades gracefully:

  - LLM_ENABLED=false  ->  no LLM call at all, no flag.
  - LLM_ENABLED=true,
    header complete    ->  no LLM call.
  - LLM_ENABLED=true,
    header has gaps,
    LLM responded ok   ->  merge + LLM_FALLBACK flag (review hint).
  - LLM_ENABLED=true,
    header has gaps,
    LLM unavailable    ->  keep regex result + LLM_UNAVAILABLE flag.

Default model qwen2.5:1.5b on http://localhost:11434 — chosen for CPU
throughput (~5-15s per call) at acceptable accuracy. The LLM only fills
the *header* (nomor, tanggal, satuan, perihal, dasar). Personnel rows
stay with PP-Structure since that's more accurate and doesn't need LLM.

Tests:
 - test_llm_client.py: httpx MockTransport-driven tests for the wire
   format, error paths (HTTP 5xx, malformed JSON, missing envelope,
   ConnectError), and request shape.
 - test_llm_extractor.py: merge policy + None-on-unavailable behaviour.
 - test_orchestrator_llm.py: end-to-end orchestrator wiring with stubs
   for ingest/preprocess/OCR/table — verifies LLM is skipped when
   disabled, skipped when header is complete, called and flagged when
   gaps exist, and marked unavailable when the client returns None.

162 unit tests pass total (was 146).

Co-Authored-By: adrian kuman firmansah <adriancuman@gmail.com>

2026-04-25 16:56:43 +00:00

2 Commits