Devin Review correctly flagged that the bare "NO" and "KET" entries
in the blocklist would silently drop common Indonesian names (KETUT,
NOVA, NOOR, NORMAN, NOVIANTI, ...) because the check used startswith
rather than a word boundary.
Replaced the per-prefix loop with a single compiled regex anchored at
^ with a trailing \b, which still matches column headers like "NO"
or "KET" on their own line but no longer rejects "NOOR HIDAYAT" or
"KETUT WARDANA". Also fixes the same bug in _following_jabatan.
Added two regression tests covering both directions: names starting
with the offending tokens are kept, bare column headers still rejected.
Co-Authored-By: adrian kuman firmansah <adriancuman@gmail.com>
This fixes 4 bugs found on a real Polres Cimahi SPRIN PDF:
1. satuan_penerbit captured the generic 'KEPOLISIAN NEGARA REPUBLIK
INDONESIA' letterhead line instead of the most-specific issuing unit
(e.g. RESOR CIMAHI / SEKTOR PADALARANG). Reworked find_satuan to
scan for each level independently and return the deepest available.
2. find_dasar_list dropped numbered items when OCR put the marker on
its own line ("1.\n Undang-Undang ..."). Refactored into
_collect_numbered_section that buffers a bare-number line and uses
the next non-empty line as the body. Also reused for the new
find_untuk_list which extracts the previously-empty 'untuk' bullets.
3. find_perihal returned None for documents that use 'Pertimbangan'
(very common in Polres-level sprint), forcing the LLM to guess.
Added a regex fallback that picks up the first line under a
'Pertimbangan' label so we keep extraction deterministic.
4. Personnel rows were emitted with only nama populated when
PP-Structure detected a table but the column mapper degraded.
Added a text-based fallback (extract_personnel_from_text) that
scans raw OCR for <rank> + <8-digit NRP> patterns. Triggered when
the PP-Structure result has fewer than 30% rank/NRP-bearing rows.
Reviewed by raising the new PERSONNEL_TEXT_FALLBACK flag.
5. Validation now flags rows with neither pangkat nor nrp as
INCOMPLETE_PERSONNEL_ROW, so the document routes to needs_review
even when individual nrp/pangkat checks pass on empty values.
6. Added 'BRIGPOL' as a variant of BRIGADIR (seen in real scans).
Tests: 229 (was 203) — 26 new tests covering the regex fixes,
text-based personnel extractor, low-quality detector, validator
behaviour, and orchestrator wiring of the fallback path.
Co-Authored-By: adrian kuman firmansah <adriancuman@gmail.com>
Devin Review (post-merge on PR #6) flagged that the `final_result`
assignment used a truthiness check (`if job_row.result`) while
`build_initial_result` used an identity check (`is None`). For a
job whose result is an empty dict (`{}`), the emitted
`GroundTruthSample` ended up with `initial_result={}` but
`final_result=None` — logically inconsistent.
Switch the final-result assignment to the same `is None` check so
both fields agree. Added `test_empty_dict_result_stays_consistent`
to lock the invariant in, and fixed the test helper so callers can
pass `{}` without the helper's `or` fallback replacing it.
Co-Authored-By: adrian kuman firmansah <adriancuman@gmail.com>
Devin Review caught that `--out -` discarded the sample count, so
the stderr summary always said 'wrote 0 sample(s)' even when bytes
were streamed. Capture the return value like the file-output branch
does, and add a regression test that exercises the stdout path.
Co-Authored-By: adrian kuman firmansah <adriancuman@gmail.com>
Adds a small Ollama HTTP client (httpx-based, no extra runtime deps),
prompt builders, and a hybrid header extractor that runs *after* the
deterministic regex layer. The merger never overwrites a regex-filled
field — the LLM only fills gaps. If LLM_ENABLED=false (the default), or
the Ollama server is unreachable, the pipeline degrades gracefully:
- LLM_ENABLED=false -> no LLM call at all, no flag.
- LLM_ENABLED=true,
header complete -> no LLM call.
- LLM_ENABLED=true,
header has gaps,
LLM responded ok -> merge + LLM_FALLBACK flag (review hint).
- LLM_ENABLED=true,
header has gaps,
LLM unavailable -> keep regex result + LLM_UNAVAILABLE flag.
Default model qwen2.5:1.5b on http://localhost:11434 — chosen for CPU
throughput (~5-15s per call) at acceptable accuracy. The LLM only fills
the *header* (nomor, tanggal, satuan, perihal, dasar). Personnel rows
stay with PP-Structure since that's more accurate and doesn't need LLM.
Tests:
- test_llm_client.py: httpx MockTransport-driven tests for the wire
format, error paths (HTTP 5xx, malformed JSON, missing envelope,
ConnectError), and request shape.
- test_llm_extractor.py: merge policy + None-on-unavailable behaviour.
- test_orchestrator_llm.py: end-to-end orchestrator wiring with stubs
for ingest/preprocess/OCR/table — verifies LLM is skipped when
disabled, skipped when header is complete, called and flagged when
gaps exist, and marked unavailable when the client returns None.
162 unit tests pass total (was 146).
Co-Authored-By: adrian kuman firmansah <adriancuman@gmail.com>
* Phase 4: async pipeline (Celery+Redis), Postgres job state, local-fs blob storage, API-key auth, Prometheus metrics
Co-Authored-By: adrian kuman firmansah <adriancuman@gmail.com>
* Phase 4: fix sync-mode rollback orphaning blobs + use is_relative_to for path-escape check
Devin Review on PR #3 found two real bugs:
1. Sync path mark_failed was rolled back by the request-scoped session.
When the pipeline raised an exception in ?sync=true mode, _run_inline
modified the FastAPI session and re-raised; get_session caught the
exception, called session.rollback(), and wiped both the create() and
the mark_failed() writes. The blob was already on disk, so it was
permanently orphaned with no DB record. Fix: commit the pending row
immediately after create(), and run all subsequent state transitions in
independent session_scope blocks (matching the worker task pattern).
2. _resolve used str.startswith for path-escape detection, which lets a
sibling directory whose name begins with the storage root pass (e.g.
/app/blobs_evil vs /app/blobs). Switched to Path.is_relative_to.
Added regression tests for both.
Co-Authored-By: adrian kuman firmansah <adriancuman@gmail.com>
* Phase 4: honor queue_enabled setting + resolve base_dir for path comparisons
Two more bugs found by Devin Review:
3. queue_enabled was declared in config and documented in .env.example but
never read by the route. A fresh dev install with QUEUE_ENABLED=false
(the default) would still enqueue, then fail with a Redis connection
error. Fixed by making the ?sync= query param default to None and
resolving to (not queue_enabled) inside the route. Tests now set
QUEUE_ENABLED=true so the async flow stays exercised, and a new test
verifies the inline fallback when the queue is disabled.
4. LocalFsBlobStorage stored base_dir as-is. _resolve resolved its
candidate paths, so the empty-dir cleanup loop in delete() compared a
resolved candidate against an unresolved base_dir and broke on the
first iteration (no cleanup ever happened). Fixed by resolving base_dir
once in __init__ so every path comparison is apples-to-apples.
Co-Authored-By: adrian kuman firmansah <adriancuman@gmail.com>
* Phase 4: derive ocr_jobs_total from DB so worker writes are visible at /metrics
Devin Review correctly noted the Counter-based JOBS_TOTAL would never
increment in production because the worker runs in a separate process from
the API and the registry is process-local. Replaced JOBS_TOTAL with a
custom Collector that issues SELECT status, COUNT(*) FROM jobs GROUP BY
status on every /metrics scrape. Result: the metric stays accurate
regardless of which process wrote the row.
Also corrected the metrics.py docstring (the old comment claimed the
counter was 'incremented by the worker', which was the bug).
Removed the JOBS_TOTAL.inc() calls from the sync route — the DB collector
covers both paths now. JOB_PROCESSING_SECONDS stays as an API-process
histogram with an updated docstring noting its scope; cross-process
latency belongs to derived dashboards over jobs.created_at/updated_at.
Added regression test test_metrics_jobs_total_reflects_worker_writes.
Co-Authored-By: adrian kuman firmansah <adriancuman@gmail.com>
---------
Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com>
* Phase 3: PP-Structure table extraction + personnel column mapper
Adds the personnel-table stage of the pipeline. PaddleOCR's PP-Structure
recognizes table regions and emits HTML, which we parse into a 2D cell
grid. A separate column mapper detects the header row, classifies each
column to a canonical PersonnelEntry field via a synonym dictionary,
and walks the data rows.
Variant handling:
- Different satuan use different column orders and header phrasing.
Supported synonyms for each canonical field are listed in
pipeline/extract/personnel.py (Pangkat / NRP / Pangkat-NRP combo /
Nama / Jabatan dalam Dinas / Jabatan dalam Sprint / Keterangan).
- A merged 'PANGKAT NRP' or 'PANGKAT NRP NAMA' cell is split using
the 8-digit NRP regex (with look-arounds so glued forms like
'BRIPKA98050505' work) and the master pangkat lookup.
- Unknown ranks are kept verbatim so the validation layer can flag
them as UNKNOWN_PANGKAT for HITL review.
- Rows without nrp AND nama are dropped (separators / merged cells).
New module pipeline/table.py:
- DetectedTable dataclass (cells + html).
- parse_table_html: tag/entity-tolerant HTML -> 2D grid.
- extract_tables_from_pp_result: filter PP-Structure regions to type=table.
- run_table_extraction: top-level entrypoint with lazy-init singleton
for the heavy PP-Structure engine.
Orchestrator now invokes table extraction (gated by TABLES_ENABLED) on
every preprocessed page and merges the discovered personnel into the
ExtractionResult. Failures are caught and logged so a flaky table
recognizer never blocks header extraction.
Tests: 38 new unit tests covering HTML parsing, region filtering,
header classification, column mapping (split, combined, glued cells),
and end-to-end personnel extraction. Total 108 tests, all green.
PaddleOCR / PP-Structure remain optional - no test imports them.
Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com>
* Phase 3: fix header misclassification for combined Pangkat/NRP/Nama columns
Devin Review caught two related bugs in personnel column mapping:
1. _classify_header_cell iterated _HEADER_SYNONYMS in insertion order
when falling back to substring matching. The dict listed shorter
keywords first ('pangkat' before 'pangkat / nrp'), so a header like
'Pangkat / NRP / Nama' classified as plain 'pangkat'. map_row then
tried to normalize the whole '"AKP 87010101 Budi Santoso"' cell
as a rank, normalize_pangkat returned None, and the row failed the
nrp-or-nama gate at the bottom of map_row -- silently dropping
every personnel row in tables using this layout.
2. _split_pangkat_nrp_nama existed and was unit-tested but was never
wired up in map_row, so even if classification had worked, the
three-way split would not have run. The module docstring claimed
the split was supported.
Fix:
- Iterate the synonym table sorted by keyword length descending in the
substring-match fallback so the most specific synonym wins.
- Add 'pangkat_nrp_nama' synonym entries for typical separators
(' / ', '/', whitespace, comma).
- Wire 'pangkat_nrp_nama' into map_row using the existing helper.
- Update is_personnel_table so combined headers count as both an id
signal and a name signal.
Tests: 6 new asserts (parametrized), 1 regression test for triple-
combined header end-to-end, 1 dedicated map_row test for the new
column type. 114 tests total, all green.
Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com>
* Phase 3: handle multi-word Polri ranks in _split_pangkat_nrp_nama
Devin Review caught: token-by-token is_valid_pangkat() check could not
recognize multi-word ranks ('KOMBES POL', 'BRIGJEN POL', 'IRJEN POL',
'KOMJEN POL', 'JENDERAL POL'). For 'KOMBES POL 88123456 John Doe' the
old code returned pangkat=None, nama='KOMBES POL John Doe', and the
validator's UNKNOWN_PANGKAT flag never fired because pangkat was falsy.
New behavior: greedy longest-prefix match. After stripping the NRP we
try the leading 3-token, 2-token, 1-token slice against
normalize_pangkat() and take the longest that maps to a canonical
rank. Tokens after the matched rank become the nama. Unknown ranks
fall through to pangkat=None and the rank text stays in the nama
field, where downstream validation already flags the row.
Tests: 5 new asserts (4 multi-word ranks + 1 unknown-rank fallback),
119 total green.
Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com>
* Phase 3: don't count pangkat_nrp as a name signal in is_personnel_table
Devin Review caught: a table with header ['No', 'Pangkat / NRP',
'Jabatan'] (no name column) was wrongly classified as a personnel
table because pangkat_nrp was lumped into has_name. Such a table
would produce PersonnelEntry rows with nama=None passing the nrp-or-
nama gate, polluting the personel[] output with id-only fragments.
Split the combined-cell set into combined_id (counts toward has_id)
and combined_name (counts toward has_name). Only pangkat_nrp_nama,
which actually embeds a name, qualifies for has_name. pangkat_nrp
remains an id-only signal.
Tests: 3 new asserts (rejects id-only, accepts pangkat_nrp + separate
nama, accepts pangkat_nrp_nama). 122 total green.
Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com>
---------
Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com>
Adds OpenCV-based phone-photo handling that runs before the standard
preprocessing pipeline for IMAGE source kinds (PDF renders are flat by
construction and skip this stage).
Pipeline additions in src/ocr_sprint/pipeline/document_detect.py:
- _find_document_quad: Canny + dilate + contour search, picks the
largest convex 4-point polygon above a configurable area threshold;
fails gracefully and returns None when no usable quad is found.
- _four_point_warp: orders corners (TL/TR/BR/BL via sum/diff trick)
and runs cv2.getPerspectiveTransform + warpPerspective.
- _remove_shadow: per-channel background-division (dilate + median
blur + 255 - absdiff + normalize) for uneven phone-shot lighting.
- detect_and_correct: top-level entrypoint with graceful fallback
to the original image when detection fails.
Wired into the synchronous orchestrator: only enabled for IMAGE
sources, skipped for PDF. New settings:
- preprocess_detect_document (default: true)
- preprocess_remove_shadow (default: true)
- preprocess_min_quad_area_fraction (default: 0.20)
Tests: 9 new unit tests covering corner ordering, quad detection on
synthetic skewed documents, perspective warp output sanity, shadow
removal shape preservation, full-pipeline behavior, and graceful
fallback when detection fails. 70 tests total, all green.
ML-based dewarping (DewarpNet) and DocTR detector are deferred to a
future phase per the roadmap; the existing API is structured so they
can be added as alternative backends behind DocumentDetectConfig.
Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com>