OCR-SPRIN-SERVICE

Author	SHA1	Message	Date
Devin AI	737f4999dd	Use word-boundary matching for personnel name blocklist Devin Review correctly flagged that the bare "NO" and "KET" entries in the blocklist would silently drop common Indonesian names (KETUT, NOVA, NOOR, NORMAN, NOVIANTI, ...) because the check used startswith rather than a word boundary. Replaced the per-prefix loop with a single compiled regex anchored at ^ with a trailing \b, which still matches column headers like "NO" or "KET" on their own line but no longer rejects "NOOR HIDAYAT" or "KETUT WARDANA". Also fixes the same bug in _following_jabatan. Added two regression tests covering both directions: names starting with the offending tokens are kept, bare column headers still rejected. Co-Authored-By: adrian kuman firmansah <adriancuman@gmail.com>	2026-04-26 05:46:21 +00:00
Devin AI	58a2bf2648	Fix personnel extraction + header bugs on real Polres Cimahi sprint This fixes 4 bugs found on a real Polres Cimahi SPRIN PDF: 1. satuan_penerbit captured the generic 'KEPOLISIAN NEGARA REPUBLIK INDONESIA' letterhead line instead of the most-specific issuing unit (e.g. RESOR CIMAHI / SEKTOR PADALARANG). Reworked find_satuan to scan for each level independently and return the deepest available. 2. find_dasar_list dropped numbered items when OCR put the marker on its own line ("1.\n Undang-Undang ..."). Refactored into _collect_numbered_section that buffers a bare-number line and uses the next non-empty line as the body. Also reused for the new find_untuk_list which extracts the previously-empty 'untuk' bullets. 3. find_perihal returned None for documents that use 'Pertimbangan' (very common in Polres-level sprint), forcing the LLM to guess. Added a regex fallback that picks up the first line under a 'Pertimbangan' label so we keep extraction deterministic. 4. Personnel rows were emitted with only nama populated when PP-Structure detected a table but the column mapper degraded. Added a text-based fallback (extract_personnel_from_text) that scans raw OCR for <rank> + <8-digit NRP> patterns. Triggered when the PP-Structure result has fewer than 30% rank/NRP-bearing rows. Reviewed by raising the new PERSONNEL_TEXT_FALLBACK flag. 5. Validation now flags rows with neither pangkat nor nrp as INCOMPLETE_PERSONNEL_ROW, so the document routes to needs_review even when individual nrp/pangkat checks pass on empty values. 6. Added 'BRIGPOL' as a variant of BRIGADIR (seen in real scans). Tests: 229 (was 203) — 26 new tests covering the regex fixes, text-based personnel extractor, low-quality detector, validator behaviour, and orchestrator wiring of the fallback path. Co-Authored-By: adrian kuman firmansah <adriancuman@gmail.com>	2026-04-26 05:35:42 +00:00
devin-ai-integration[bot]	33b38aacc7	Phase 3: PP-Structure table extraction + personnel column mapper (#2 ) * Phase 3: PP-Structure table extraction + personnel column mapper Adds the personnel-table stage of the pipeline. PaddleOCR's PP-Structure recognizes table regions and emits HTML, which we parse into a 2D cell grid. A separate column mapper detects the header row, classifies each column to a canonical PersonnelEntry field via a synonym dictionary, and walks the data rows. Variant handling: - Different satuan use different column orders and header phrasing. Supported synonyms for each canonical field are listed in pipeline/extract/personnel.py (Pangkat / NRP / Pangkat-NRP combo / Nama / Jabatan dalam Dinas / Jabatan dalam Sprint / Keterangan). - A merged 'PANGKAT NRP' or 'PANGKAT NRP NAMA' cell is split using the 8-digit NRP regex (with look-arounds so glued forms like 'BRIPKA98050505' work) and the master pangkat lookup. - Unknown ranks are kept verbatim so the validation layer can flag them as UNKNOWN_PANGKAT for HITL review. - Rows without nrp AND nama are dropped (separators / merged cells). New module pipeline/table.py: - DetectedTable dataclass (cells + html). - parse_table_html: tag/entity-tolerant HTML -> 2D grid. - extract_tables_from_pp_result: filter PP-Structure regions to type=table. - run_table_extraction: top-level entrypoint with lazy-init singleton for the heavy PP-Structure engine. Orchestrator now invokes table extraction (gated by TABLES_ENABLED) on every preprocessed page and merges the discovered personnel into the ExtractionResult. Failures are caught and logged so a flaky table recognizer never blocks header extraction. Tests: 38 new unit tests covering HTML parsing, region filtering, header classification, column mapping (split, combined, glued cells), and end-to-end personnel extraction. Total 108 tests, all green. PaddleOCR / PP-Structure remain optional - no test imports them. Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com> * Phase 3: fix header misclassification for combined Pangkat/NRP/Nama columns Devin Review caught two related bugs in personnel column mapping: 1. _classify_header_cell iterated _HEADER_SYNONYMS in insertion order when falling back to substring matching. The dict listed shorter keywords first ('pangkat' before 'pangkat / nrp'), so a header like 'Pangkat / NRP / Nama' classified as plain 'pangkat'. map_row then tried to normalize the whole '"AKP 87010101 Budi Santoso"' cell as a rank, normalize_pangkat returned None, and the row failed the nrp-or-nama gate at the bottom of map_row -- silently dropping every personnel row in tables using this layout. 2. _split_pangkat_nrp_nama existed and was unit-tested but was never wired up in map_row, so even if classification had worked, the three-way split would not have run. The module docstring claimed the split was supported. Fix: - Iterate the synonym table sorted by keyword length descending in the substring-match fallback so the most specific synonym wins. - Add 'pangkat_nrp_nama' synonym entries for typical separators (' / ', '/', whitespace, comma). - Wire 'pangkat_nrp_nama' into map_row using the existing helper. - Update is_personnel_table so combined headers count as both an id signal and a name signal. Tests: 6 new asserts (parametrized), 1 regression test for triple- combined header end-to-end, 1 dedicated map_row test for the new column type. 114 tests total, all green. Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com> * Phase 3: handle multi-word Polri ranks in _split_pangkat_nrp_nama Devin Review caught: token-by-token is_valid_pangkat() check could not recognize multi-word ranks ('KOMBES POL', 'BRIGJEN POL', 'IRJEN POL', 'KOMJEN POL', 'JENDERAL POL'). For 'KOMBES POL 88123456 John Doe' the old code returned pangkat=None, nama='KOMBES POL John Doe', and the validator's UNKNOWN_PANGKAT flag never fired because pangkat was falsy. New behavior: greedy longest-prefix match. After stripping the NRP we try the leading 3-token, 2-token, 1-token slice against normalize_pangkat() and take the longest that maps to a canonical rank. Tokens after the matched rank become the nama. Unknown ranks fall through to pangkat=None and the rank text stays in the nama field, where downstream validation already flags the row. Tests: 5 new asserts (4 multi-word ranks + 1 unknown-rank fallback), 119 total green. Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com> * Phase 3: don't count pangkat_nrp as a name signal in is_personnel_table Devin Review caught: a table with header ['No', 'Pangkat / NRP', 'Jabatan'] (no name column) was wrongly classified as a personnel table because pangkat_nrp was lumped into has_name. Such a table would produce PersonnelEntry rows with nama=None passing the nrp-or- nama gate, polluting the personel[] output with id-only fragments. Split the combined-cell set into combined_id (counts toward has_id) and combined_name (counts toward has_name). Only pangkat_nrp_nama, which actually embeds a name, qualifies for has_name. pangkat_nrp remains an id-only signal. Tests: 3 new asserts (rejects id-only, accepts pangkat_nrp + separate nama, accepts pangkat_nrp_nama). 122 total green. Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com> --------- Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com>	2026-04-25 16:10:48 +00:00
Devin AI	ca0c0a0428	Phase 1 MVP: synchronous OCR + regex header extraction Implements the foundation of the OCR Sprint service: - FastAPI app with /api/v1/health and /api/v1/documents (sync upload) - Pydantic v2 schemas for documents, extraction result, personnel - Pipeline: PDF/image ingest (PyMuPDF), preprocessing (resize, deskew, denoise, optional adaptive threshold), PaddleOCR wrapper, regex-based header extraction (nomor sprint, tanggal, satuan, perihal, dasar), signatory NRP, master-pangkat validation, confidence scoring + routing. - Tests: 61 unit tests covering regex rules, validators, preprocess, ingest, confidence, and API contract (PaddleOCR mocked). - Tooling: pyproject (setuptools), ruff, mypy strict, pytest, pre-commit, Dockerfile, docker-compose, Makefile. - Docs: README + docs/architecture.md (full hybrid stack rationale and 6-phase roadmap). Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com>	2026-04-25 14:58:50 +00:00

4 Commits