Phase 4: async pipeline (Celery+Redis), Postgres job state, local-fs blob storage, API-key auth, Prometheus metrics (#3)

* Phase 4: async pipeline (Celery+Redis), Postgres job state, local-fs blob storage, API-key auth, Prometheus metrics Co-Authored-By: adrian kuman firmansah <adriancuman@gmail.com> * Phase 4: fix sync-mode rollback orphaning blobs + use is_relative_to for path-escape check Devin Review on PR #3 found two real bugs: 1. Sync path mark_failed was rolled back by the request-scoped session. When the pipeline raised an exception in ?sync=true mode, _run_inline modified the FastAPI session and re-raised; get_session caught the exception, called session.rollback(), and wiped both the create() and the mark_failed() writes. The blob was already on disk, so it was permanently orphaned with no DB record. Fix: commit the pending row immediately after create(), and run all subsequent state transitions in independent session_scope blocks (matching the worker task pattern). 2. _resolve used str.startswith for path-escape detection, which lets a sibling directory whose name begins with the storage root pass (e.g. /app/blobs_evil vs /app/blobs). Switched to Path.is_relative_to. Added regression tests for both. Co-Authored-By: adrian kuman firmansah <adriancuman@gmail.com> * Phase 4: honor queue_enabled setting + resolve base_dir for path comparisons Two more bugs found by Devin Review: 3. queue_enabled was declared in config and documented in .env.example but never read by the route. A fresh dev install with QUEUE_ENABLED=false (the default) would still enqueue, then fail with a Redis connection error. Fixed by making the ?sync= query param default to None and resolving to (not queue_enabled) inside the route. Tests now set QUEUE_ENABLED=true so the async flow stays exercised, and a new test verifies the inline fallback when the queue is disabled. 4. LocalFsBlobStorage stored base_dir as-is. _resolve resolved its candidate paths, so the empty-dir cleanup loop in delete() compared a resolved candidate against an unresolved base_dir and broke on the first iteration (no cleanup ever happened). Fixed by resolving base_dir once in __init__ so every path comparison is apples-to-apples. Co-Authored-By: adrian kuman firmansah <adriancuman@gmail.com> * Phase 4: derive ocr_jobs_total from DB so worker writes are visible at /metrics Devin Review correctly noted the Counter-based JOBS_TOTAL would never increment in production because the worker runs in a separate process from the API and the registry is process-local. Replaced JOBS_TOTAL with a custom Collector that issues SELECT status, COUNT(*) FROM jobs GROUP BY status on every /metrics scrape. Result: the metric stays accurate regardless of which process wrote the row. Also corrected the metrics.py docstring (the old comment claimed the counter was 'incremented by the worker', which was the bug). Removed the JOBS_TOTAL.inc() calls from the sync route — the DB collector covers both paths now. JOB_PROCESSING_SECONDS stays as an API-process histogram with an updated docstring noting its scope; cross-process latency belongs to derived dashboards over jobs.created_at/updated_at. Added regression test test_metrics_jobs_total_reflects_worker_writes. Co-Authored-By: adrian kuman firmansah <adriancuman@gmail.com> --------- Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com>
2026-04-25 16:50:51 +00:00
parent 33b38aacc7
commit 2112023b6e
31 changed files with 1646 additions and 105 deletions
--- a/tests/conftest.py
+++ b/tests/conftest.py
@@ -2,10 +2,54 @@

 from __future__ import annotations

+import os
+from collections.abc import Iterator
+from pathlib import Path
+
 import numpy as np
 import pytest


+@pytest.fixture(autouse=True)
+def _isolated_runtime(tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> Iterator[None]:
+    """Per-test sqlite + blob storage so tests don't share state.
+
+    Setting these env vars before ``Settings`` is first read in the test gives
+    each test its own DB file and blob root. We also clear the lru_cache on
+    `get_settings`, the engine, and the sessionmaker so the fresh paths take
+    effect even if a previous test already loaded settings.
+    """
+    db_path = tmp_path / "test.sqlite"
+    blob_dir = tmp_path / "blobs"
+    monkeypatch.setenv("DATABASE_URL", f"sqlite:///{db_path}")
+    monkeypatch.setenv("BLOB_STORAGE_DIR", str(blob_dir))
+    monkeypatch.setenv("STORAGE_LOCAL_DIR", str(tmp_path / "storage"))
+    monkeypatch.setenv("API_KEYS", "")
+    # The async API path is exercised by the test suite, so default it on
+    # here. Production keeps ``QUEUE_ENABLED=false`` so the route falls back
+    # to the inline pipeline when no Redis is configured.
+    monkeypatch.setenv("QUEUE_ENABLED", "true")
+    # Force Celery to run tasks inline so we don't need a broker.
+    monkeypatch.setenv("CELERY_TASK_ALWAYS_EAGER", "true")
+
+    from ocr_sprint.config import get_settings
+    from ocr_sprint.db.base import reset_engine_cache
+    from ocr_sprint.worker.celery_app import celery_app
+
+    get_settings.cache_clear()
+    reset_engine_cache()
+    # `celery_app` is built once at import-time, so flip the eager flag on the
+    # already-instantiated instance for this test.
+    celery_app.conf.task_always_eager = True
+    celery_app.conf.task_eager_propagates = True
+
+    yield
+
+    get_settings.cache_clear()
+    reset_engine_cache()
+    os.environ.pop("CELERY_TASK_ALWAYS_EAGER", None)
+
+
@pytest.fixture
 def blank_bgr_image() -> np.ndarray:
    """A 600x800 white BGR image (uint8) — useful for preprocessing smoke tests."""
--- a/tests/unit/test_api.py
+++ b/tests/unit/test_api.py
@@ -23,35 +23,9 @@ def client() -> TestClient:
    return TestClient(create_app())


-def test_health_endpoint(client: TestClient) -> None:
-    response = client.get("/api/v1/health")
-    assert response.status_code == 200
-    assert response.json()["status"] == "ok"
-
-
-def test_documents_rejects_empty_upload(client: TestClient) -> None:
-    response = client.post(
-        "/api/v1/documents",
-        files={"file": ("empty.pdf", b"", "application/pdf")},
-    )
-    assert response.status_code == 400
-
-
-def test_documents_rejects_unknown_format(
-    client: TestClient,
-    monkeypatch: pytest.MonkeyPatch,
-) -> None:
-    response = client.post(
-        "/api/v1/documents",
-        files={"file": ("x.bin", b"random garbage bytes here", "application/octet-stream")},
-    )
-    assert response.status_code == 400
-
-
-def test_documents_returns_pipeline_output(
-    client: TestClient,
-    monkeypatch: pytest.MonkeyPatch,
-) -> None:
+@pytest.fixture
+def fake_pipeline(monkeypatch: pytest.MonkeyPatch) -> PipelineOutput:
+    """Patch run_pipeline everywhere it's referenced."""
    fake_result = ExtractionResult(
        header=HeaderFields(
            nomor_sprint="Sprin/1/I/2025",
@@ -70,14 +44,36 @@ def test_documents_returns_pipeline_output(
    def _fake_run(_content: bytes) -> PipelineOutput:
        return fake_output

-    # Patch the symbol *imported into* the routes module.
    monkeypatch.setattr(orch_module, "run_pipeline", _fake_run)
    from ocr_sprint.api.routes import documents as docs_module

    monkeypatch.setattr(docs_module, "run_pipeline", _fake_run)
+    from ocr_sprint.worker import tasks as tasks_module

+    monkeypatch.setattr(tasks_module, "run_pipeline", _fake_run)
+    return fake_output
+
+
+def test_health_endpoint(client: TestClient) -> None:
+    response = client.get("/api/v1/health")
+    assert response.status_code == 200
+    assert response.json()["status"] == "ok"
+
+
+def test_documents_rejects_empty_upload(client: TestClient) -> None:
    response = client.post(
        "/api/v1/documents",
+        files={"file": ("empty.pdf", b"", "application/pdf")},
+    )
+    assert response.status_code == 400
+
+
+def test_documents_sync_returns_pipeline_output(
+    client: TestClient,
+    fake_pipeline: PipelineOutput,
+) -> None:
+    response = client.post(
+        "/api/v1/documents?sync=true",
        files={"file": ("x.pdf", b"%PDF-1.4\n%fake", "application/pdf")},
    )
    assert response.status_code == 200
@@ -85,3 +81,169 @@ def test_documents_returns_pipeline_output(
    assert body["status"] == "completed"
    assert body["confidence"] == 0.97
    assert body["data"]["header"]["nomor_sprint"] == "Sprin/1/I/2025"
+
+
+def test_documents_async_returns_202_then_polls_to_completion(
+    client: TestClient,
+    fake_pipeline: PipelineOutput,
+) -> None:
+    """Default flow: POST returns 202, GET returns the eventual completion.
+
+    With CELERY_TASK_ALWAYS_EAGER set in conftest, the worker runs inline,
+    so by the time POST returns the task has already finished and GET sees
+    a `completed` row.
+    """
+    post = client.post(
+        "/api/v1/documents",
+        files={"file": ("x.pdf", b"%PDF-1.4\n%fake", "application/pdf")},
+    )
+    assert post.status_code == 202
+    job_id = post.json()["job_id"]
+
+    get = client.get(f"/api/v1/documents/{job_id}")
+    assert get.status_code == 200
+    body = get.json()
+    assert body["status"] == "completed"
+    assert body["confidence"] == 0.97
+
+
+def test_documents_defaults_to_sync_when_queue_disabled(
+    client: TestClient,
+    fake_pipeline: PipelineOutput,
+    monkeypatch: pytest.MonkeyPatch,
+) -> None:
+    """Regression: with ``QUEUE_ENABLED=false`` the route must NOT enqueue,
+    otherwise a default install with no Redis returns 500.
+    """
+    monkeypatch.setenv("QUEUE_ENABLED", "false")
+    from ocr_sprint.config import get_settings
+
+    get_settings.cache_clear()
+
+    # Pretend the broker is unreachable; if the route still enqueues, the
+    # call would blow up here.
+    def _no_broker(_self: object, *_args: object, **_kwargs: object) -> None:
+        raise AssertionError("queue path taken when queue is disabled")
+
+    from ocr_sprint.worker import tasks as task_module
+
+    monkeypatch.setattr(task_module.process_document_task, "delay", _no_broker)
+
+    post = client.post(
+        "/api/v1/documents",
+        files={"file": ("x.pdf", b"%PDF-1.4\n%fake", "application/pdf")},
+    )
+    assert post.status_code == 200, post.text
+    body = post.json()
+    assert body["status"] == "completed"
+
+
+def test_documents_get_unknown_id_returns_404(client: TestClient) -> None:
+    response = client.get("/api/v1/documents/00000000-0000-0000-0000-000000000000")
+    assert response.status_code == 404
+
+
+def test_documents_async_marks_failed_on_pipeline_error(
+    client: TestClient,
+    monkeypatch: pytest.MonkeyPatch,
+) -> None:
+    def _explode(_content: bytes) -> PipelineOutput:
+        raise RuntimeError("boom")
+
+    from ocr_sprint.worker import tasks as tasks_module
+
+    monkeypatch.setattr(tasks_module, "run_pipeline", _explode)
+
+    post = client.post(
+        "/api/v1/documents",
+        files={"file": ("x.pdf", b"%PDF-1.4\n%fake", "application/pdf")},
+    )
+    assert post.status_code == 202
+    job_id = post.json()["job_id"]
+
+    get = client.get(f"/api/v1/documents/{job_id}")
+    body = get.json()
+    assert body["status"] == "failed"
+    assert "boom" in (body.get("error") or "")
+
+
+def test_documents_sync_persists_failed_row_when_pipeline_raises(
+    client: TestClient,
+    monkeypatch: pytest.MonkeyPatch,
+) -> None:
+    """Regression: an exception in the sync pipeline must NOT roll back the
+    pending row + ``mark_failed`` write. Otherwise the blob on disk has no
+    DB record pointing at it.
+    """
+
+    def _explode(_content: bytes) -> PipelineOutput:
+        raise RuntimeError("kapow")
+
+    from ocr_sprint.api.routes import documents as docs_module
+
+    monkeypatch.setattr(docs_module, "run_pipeline", _explode)
+
+    # ``raise_server_exceptions=False`` lets the test see the 500 response
+    # rather than re-raising the underlying RuntimeError from the route.
+    silent = TestClient(client.app, raise_server_exceptions=False)
+    post = silent.post(
+        "/api/v1/documents?sync=true",
+        files={"file": ("x.pdf", b"%PDF-1.4\n%fake", "application/pdf")},
+    )
+    assert post.status_code == 500
+
+    # The row must still be visible to GET, with status=failed.
+    from ocr_sprint.db.base import session_scope
+    from ocr_sprint.db.repositories import JobRepository
+
+    with session_scope() as session:
+        # Find the most recent row.
+        from ocr_sprint.db.models import JobRow
+
+        row = session.query(JobRow).order_by(JobRow.created_at.desc()).first()
+        assert row is not None, "create() must persist even when pipeline blows up"
+        assert row.status == "failed"
+        assert "kapow" in (row.error or "")
+        assert row.blob_key  # blob is referenced — not orphaned
+
+    # GET must surface the failure too (this is the client-visible contract).
+    get = client.get(f"/api/v1/documents/{row.job_id}")
+    assert get.status_code == 200
+    assert get.json()["status"] == "failed"
+    assert JobRepository  # silence import-only warning
+
+
+def test_metrics_endpoint_exposes_request_counter(
+    client: TestClient,
+    fake_pipeline: PipelineOutput,
+) -> None:
+    client.post(
+        "/api/v1/documents?sync=true",
+        files={"file": ("x.pdf", b"%PDF-1.4\n%fake", "application/pdf")},
+    )
+    metrics = client.get("/metrics")
+    assert metrics.status_code == 200
+    body = metrics.text
+    assert "http_requests_total" in body
+    assert "ocr_jobs_total" in body
+
+
+def test_metrics_jobs_total_reflects_worker_writes(
+    client: TestClient,
+    fake_pipeline: PipelineOutput,
+) -> None:
+    """Regression: when the worker (eager mode here) marks a job complete,
+    /metrics must reflect that — the previous Counter-based implementation
+    would have stayed at zero because the worker's increments don't reach
+    the API process's in-memory registry.
+    """
+    post = client.post(
+        "/api/v1/documents",
+        files={"file": ("x.pdf", b"%PDF-1.4\n%fake", "application/pdf")},
+    )
+    assert post.status_code == 202
+
+    body = client.get("/metrics").text
+    # ``ocr_jobs_total{status="completed"} 1.0`` — exact match to make sure
+    # the gauge-style metric is being populated from the DB.
+    assert 'ocr_jobs_total{status="completed"} 1.0' in body
--- a/tests/unit/test_auth.py
+++ b/tests/unit/test_auth.py
@@ -0,0 +1,43 @@
+"""API key authentication."""
+
+from __future__ import annotations
+
+import pytest
+from fastapi.testclient import TestClient
+
+from ocr_sprint.config import get_settings
+from ocr_sprint.main import create_app
+
+
+def _client_with_keys(monkeypatch: pytest.MonkeyPatch, keys: str) -> TestClient:
+    monkeypatch.setenv("API_KEYS", keys)
+    get_settings.cache_clear()
+    return TestClient(create_app())
+
+
+def test_auth_disabled_when_keys_empty(monkeypatch: pytest.MonkeyPatch) -> None:
+    client = _client_with_keys(monkeypatch, "")
+    response = client.get("/api/v1/documents/00000000-0000-0000-0000-000000000000")
+    # 404 not 401: auth disabled, the endpoint just doesn't find the row.
+    assert response.status_code == 404
+
+
+def test_auth_rejects_missing_key(monkeypatch: pytest.MonkeyPatch) -> None:
+    client = _client_with_keys(monkeypatch, "secret-1,secret-2")
+    response = client.get("/api/v1/documents/00000000-0000-0000-0000-000000000000")
+    assert response.status_code == 401
+
+
+def test_auth_accepts_valid_key(monkeypatch: pytest.MonkeyPatch) -> None:
+    client = _client_with_keys(monkeypatch, "secret-1,secret-2")
+    response = client.get(
+        "/api/v1/documents/00000000-0000-0000-0000-000000000000",
+        headers={"X-API-Key": "secret-2"},
+    )
+    assert response.status_code == 404
+
+
+def test_health_is_unprotected(monkeypatch: pytest.MonkeyPatch) -> None:
+    client = _client_with_keys(monkeypatch, "secret-1")
+    response = client.get("/api/v1/health")
+    assert response.status_code == 200
--- a/tests/unit/test_blob_storage.py
+++ b/tests/unit/test_blob_storage.py
@@ -0,0 +1,85 @@
+"""Local-filesystem blob storage."""
+
+from __future__ import annotations
+
+from pathlib import Path
+
+import pytest
+
+from ocr_sprint.storage.blob import LocalFsBlobStorage
+
+
+@pytest.fixture
+def storage(tmp_path: Path) -> LocalFsBlobStorage:
+    return LocalFsBlobStorage(tmp_path / "blobs")
+
+
+def test_put_returns_dated_key(storage: LocalFsBlobStorage) -> None:
+    key = storage.put(b"hello", original_filename="surat.pdf")
+    # Layout is YYYY/MM/DD/<uuid>.pdf
+    parts = key.split("/")
+    assert len(parts) == 4
+    assert parts[3].endswith(".pdf")
+    assert storage.exists(key)
+    assert storage.get(key) == b"hello"
+
+
+def test_put_unknown_extension_falls_back_to_bin(storage: LocalFsBlobStorage) -> None:
+    key = storage.put(b"x", original_filename="weird.xyz")
+    assert key.endswith(".bin")
+
+
+def test_put_strips_directory_traversal(storage: LocalFsBlobStorage) -> None:
+    # ext is taken via Path().suffix, not from the raw filename, so a name
+    # like "../../etc/passwd" is harmless — the only thing the caller can
+    # influence is the extension.
+    key = storage.put(b"y", original_filename="../../etc/passwd")
+    assert "etc" not in key
+    assert key.endswith(".bin")
+
+
+def test_put_handles_missing_filename(storage: LocalFsBlobStorage) -> None:
+    key = storage.put(b"z", original_filename=None)
+    assert key.endswith(".bin")
+
+
+def test_get_unknown_key_raises(storage: LocalFsBlobStorage) -> None:
+    with pytest.raises(FileNotFoundError):
+        storage.get("2026/01/01/bogus.pdf")
+
+
+def test_delete_is_idempotent(storage: LocalFsBlobStorage) -> None:
+    key = storage.put(b"q", original_filename="x.png")
+    storage.delete(key)
+    assert not storage.exists(key)
+    storage.delete(key)  # second delete must not raise
+
+
+def test_resolve_rejects_path_escape(storage: LocalFsBlobStorage) -> None:
+    with pytest.raises(ValueError, match="escapes storage root"):
+        storage._resolve("../../../etc/passwd")
+
+
+def test_resolve_rejects_directory_prefix_collision(tmp_path: Path) -> None:
+    """Regression: ``startswith`` would mis-accept sibling dirs whose names
+    happen to begin with the storage root's basename. ``is_relative_to``
+    handles this correctly.
+    """
+    root = tmp_path / "blobs"
+    root.mkdir()
+    sibling = tmp_path / "blobs_evil"
+    sibling.mkdir()
+    storage = LocalFsBlobStorage(root)
+    with pytest.raises(ValueError, match="escapes storage root"):
+        storage._resolve("../blobs_evil/secret.txt")
+
+
+def test_exists_returns_false_for_escaped_key(storage: LocalFsBlobStorage) -> None:
+    # exists() must not raise even for malicious keys.
+    assert storage.exists("../../etc/passwd") is False
+
+
+def test_open_streams_content(storage: LocalFsBlobStorage) -> None:
+    key = storage.put(b"streamed", original_filename="x.png")
+    with storage.open(key) as fh:
+        assert fh.read() == b"streamed"
--- a/tests/unit/test_db_repository.py
+++ b/tests/unit/test_db_repository.py
@@ -0,0 +1,76 @@
+"""SQLAlchemy repository tests against an in-memory sqlite db."""
+
+from __future__ import annotations
+
+from uuid import uuid4
+
+import pytest
+
+from ocr_sprint.db.base import Base, get_engine, session_scope
+from ocr_sprint.db.repositories import JobNotFoundError, JobRepository
+from ocr_sprint.schemas.document import DocumentStatus, SourceKind
+
+
+@pytest.fixture
+def db_ready() -> None:
+    Base.metadata.create_all(bind=get_engine())
+
+
+def test_create_then_fetch(db_ready: None) -> None:
+    jid = uuid4()
+    with session_scope() as session:
+        JobRepository(session).create(
+            job_id=jid,
+            filename="x.pdf",
+            source_kind=SourceKind.PDF,
+            blob_key="2026/01/01/x.pdf",
+        )
+    with session_scope() as session:
+        row = JobRepository(session).get_or_raise(jid)
+    assert row.status == DocumentStatus.PENDING.value
+    assert row.source_kind == SourceKind.PDF.value
+    assert row.blob_key == "2026/01/01/x.pdf"
+
+
+def test_lifecycle_transitions(db_ready: None) -> None:
+    jid = uuid4()
+    with session_scope() as session:
+        JobRepository(session).create(
+            job_id=jid,
+            filename="x.pdf",
+            source_kind=SourceKind.PDF,
+            blob_key="k",
+        )
+    with session_scope() as session:
+        JobRepository(session).mark_processing(jid)
+    with session_scope() as session:
+        repo = JobRepository(session)
+        repo.mark_completed(
+            jid,
+            status=DocumentStatus.NEEDS_REVIEW,
+            confidence=0.88,
+            result={"header": {"nomor_sprint": "Sprin/1/2025"}},
+            review_flags=["low_ocr_confidence"],
+        )
+        row = repo.get_or_raise(jid)
+    assert row.status == DocumentStatus.NEEDS_REVIEW.value
+    assert row.confidence == 0.88
+    assert row.result == {"header": {"nomor_sprint": "Sprin/1/2025"}}
+    assert row.review_flags == ["low_ocr_confidence"]
+
+
+def test_mark_failed_truncates_long_error(db_ready: None) -> None:
+    jid = uuid4()
+    with session_scope() as session:
+        JobRepository(session).create(
+            job_id=jid, filename="x", source_kind=SourceKind.UNKNOWN, blob_key="k"
+        )
+    with session_scope() as session:
+        JobRepository(session).mark_failed(jid, error="x" * 5000)
+        row = JobRepository(session).get_or_raise(jid)
+    assert len(row.error or "") == 2048
+
+
+def test_unknown_job_raises(db_ready: None) -> None:
+    with session_scope() as session, pytest.raises(JobNotFoundError):
+        JobRepository(session).get_or_raise(uuid4())