Files
OCR-SPRIN-SERVICE/tests/unit/test_api.py
devin-ai-integration[bot] 2112023b6e Phase 4: async pipeline (Celery+Redis), Postgres job state, local-fs blob storage, API-key auth, Prometheus metrics (#3)
* Phase 4: async pipeline (Celery+Redis), Postgres job state, local-fs blob storage, API-key auth, Prometheus metrics

Co-Authored-By: adrian kuman firmansah <adriancuman@gmail.com>

* Phase 4: fix sync-mode rollback orphaning blobs + use is_relative_to for path-escape check

Devin Review on PR #3 found two real bugs:

1. Sync path mark_failed was rolled back by the request-scoped session.
   When the pipeline raised an exception in ?sync=true mode, _run_inline
   modified the FastAPI session and re-raised; get_session caught the
   exception, called session.rollback(), and wiped both the create() and
   the mark_failed() writes. The blob was already on disk, so it was
   permanently orphaned with no DB record. Fix: commit the pending row
   immediately after create(), and run all subsequent state transitions in
   independent session_scope blocks (matching the worker task pattern).

2. _resolve used str.startswith for path-escape detection, which lets a
   sibling directory whose name begins with the storage root pass (e.g.
   /app/blobs_evil vs /app/blobs). Switched to Path.is_relative_to.

Added regression tests for both.

Co-Authored-By: adrian kuman firmansah <adriancuman@gmail.com>

* Phase 4: honor queue_enabled setting + resolve base_dir for path comparisons

Two more bugs found by Devin Review:

3. queue_enabled was declared in config and documented in .env.example but
   never read by the route. A fresh dev install with QUEUE_ENABLED=false
   (the default) would still enqueue, then fail with a Redis connection
   error. Fixed by making the ?sync= query param default to None and
   resolving to (not queue_enabled) inside the route. Tests now set
   QUEUE_ENABLED=true so the async flow stays exercised, and a new test
   verifies the inline fallback when the queue is disabled.

4. LocalFsBlobStorage stored base_dir as-is. _resolve resolved its
   candidate paths, so the empty-dir cleanup loop in delete() compared a
   resolved candidate against an unresolved base_dir and broke on the
   first iteration (no cleanup ever happened). Fixed by resolving base_dir
   once in __init__ so every path comparison is apples-to-apples.

Co-Authored-By: adrian kuman firmansah <adriancuman@gmail.com>

* Phase 4: derive ocr_jobs_total from DB so worker writes are visible at /metrics

Devin Review correctly noted the Counter-based JOBS_TOTAL would never
increment in production because the worker runs in a separate process from
the API and the registry is process-local. Replaced JOBS_TOTAL with a
custom Collector that issues SELECT status, COUNT(*) FROM jobs GROUP BY
status on every /metrics scrape. Result: the metric stays accurate
regardless of which process wrote the row.

Also corrected the metrics.py docstring (the old comment claimed the
counter was 'incremented by the worker', which was the bug).

Removed the JOBS_TOTAL.inc() calls from the sync route — the DB collector
covers both paths now. JOB_PROCESSING_SECONDS stays as an API-process
histogram with an updated docstring noting its scope; cross-process
latency belongs to derived dashboards over jobs.created_at/updated_at.

Added regression test test_metrics_jobs_total_reflects_worker_writes.

Co-Authored-By: adrian kuman firmansah <adriancuman@gmail.com>

---------

Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com>
2026-04-25 16:50:51 +00:00

250 lines
8.2 KiB
Python

"""API tests with the OCR engine mocked.
These tests do NOT load PaddleOCR — instead they monkeypatch the orchestrator
so we can exercise the FastAPI surface without the heavy ML init cost.
"""
from __future__ import annotations
from datetime import date
import pytest
from fastapi.testclient import TestClient
from ocr_sprint.main import create_app
from ocr_sprint.pipeline import orchestrator as orch_module
from ocr_sprint.pipeline.orchestrator import PipelineOutput
from ocr_sprint.schemas.document import DocumentStatus, SourceKind
from ocr_sprint.schemas.extraction import ExtractionResult, HeaderFields
@pytest.fixture
def client() -> TestClient:
return TestClient(create_app())
@pytest.fixture
def fake_pipeline(monkeypatch: pytest.MonkeyPatch) -> PipelineOutput:
"""Patch run_pipeline everywhere it's referenced."""
fake_result = ExtractionResult(
header=HeaderFields(
nomor_sprint="Sprin/1/I/2025",
tanggal=date(2025, 1, 1),
satuan_penerbit="POLRES TEST",
),
confidence=0.97,
)
fake_output = PipelineOutput(
source_kind=SourceKind.PDF,
status=DocumentStatus.COMPLETED,
confidence=0.97,
result=fake_result,
)
def _fake_run(_content: bytes) -> PipelineOutput:
return fake_output
monkeypatch.setattr(orch_module, "run_pipeline", _fake_run)
from ocr_sprint.api.routes import documents as docs_module
monkeypatch.setattr(docs_module, "run_pipeline", _fake_run)
from ocr_sprint.worker import tasks as tasks_module
monkeypatch.setattr(tasks_module, "run_pipeline", _fake_run)
return fake_output
def test_health_endpoint(client: TestClient) -> None:
response = client.get("/api/v1/health")
assert response.status_code == 200
assert response.json()["status"] == "ok"
def test_documents_rejects_empty_upload(client: TestClient) -> None:
response = client.post(
"/api/v1/documents",
files={"file": ("empty.pdf", b"", "application/pdf")},
)
assert response.status_code == 400
def test_documents_sync_returns_pipeline_output(
client: TestClient,
fake_pipeline: PipelineOutput,
) -> None:
response = client.post(
"/api/v1/documents?sync=true",
files={"file": ("x.pdf", b"%PDF-1.4\n%fake", "application/pdf")},
)
assert response.status_code == 200
body = response.json()
assert body["status"] == "completed"
assert body["confidence"] == 0.97
assert body["data"]["header"]["nomor_sprint"] == "Sprin/1/I/2025"
def test_documents_async_returns_202_then_polls_to_completion(
client: TestClient,
fake_pipeline: PipelineOutput,
) -> None:
"""Default flow: POST returns 202, GET returns the eventual completion.
With CELERY_TASK_ALWAYS_EAGER set in conftest, the worker runs inline,
so by the time POST returns the task has already finished and GET sees
a `completed` row.
"""
post = client.post(
"/api/v1/documents",
files={"file": ("x.pdf", b"%PDF-1.4\n%fake", "application/pdf")},
)
assert post.status_code == 202
job_id = post.json()["job_id"]
get = client.get(f"/api/v1/documents/{job_id}")
assert get.status_code == 200
body = get.json()
assert body["status"] == "completed"
assert body["confidence"] == 0.97
def test_documents_defaults_to_sync_when_queue_disabled(
client: TestClient,
fake_pipeline: PipelineOutput,
monkeypatch: pytest.MonkeyPatch,
) -> None:
"""Regression: with ``QUEUE_ENABLED=false`` the route must NOT enqueue,
otherwise a default install with no Redis returns 500.
"""
monkeypatch.setenv("QUEUE_ENABLED", "false")
from ocr_sprint.config import get_settings
get_settings.cache_clear()
# Pretend the broker is unreachable; if the route still enqueues, the
# call would blow up here.
def _no_broker(_self: object, *_args: object, **_kwargs: object) -> None:
raise AssertionError("queue path taken when queue is disabled")
from ocr_sprint.worker import tasks as task_module
monkeypatch.setattr(task_module.process_document_task, "delay", _no_broker)
post = client.post(
"/api/v1/documents",
files={"file": ("x.pdf", b"%PDF-1.4\n%fake", "application/pdf")},
)
assert post.status_code == 200, post.text
body = post.json()
assert body["status"] == "completed"
def test_documents_get_unknown_id_returns_404(client: TestClient) -> None:
response = client.get("/api/v1/documents/00000000-0000-0000-0000-000000000000")
assert response.status_code == 404
def test_documents_async_marks_failed_on_pipeline_error(
client: TestClient,
monkeypatch: pytest.MonkeyPatch,
) -> None:
def _explode(_content: bytes) -> PipelineOutput:
raise RuntimeError("boom")
from ocr_sprint.worker import tasks as tasks_module
monkeypatch.setattr(tasks_module, "run_pipeline", _explode)
post = client.post(
"/api/v1/documents",
files={"file": ("x.pdf", b"%PDF-1.4\n%fake", "application/pdf")},
)
assert post.status_code == 202
job_id = post.json()["job_id"]
get = client.get(f"/api/v1/documents/{job_id}")
body = get.json()
assert body["status"] == "failed"
assert "boom" in (body.get("error") or "")
def test_documents_sync_persists_failed_row_when_pipeline_raises(
client: TestClient,
monkeypatch: pytest.MonkeyPatch,
) -> None:
"""Regression: an exception in the sync pipeline must NOT roll back the
pending row + ``mark_failed`` write. Otherwise the blob on disk has no
DB record pointing at it.
"""
def _explode(_content: bytes) -> PipelineOutput:
raise RuntimeError("kapow")
from ocr_sprint.api.routes import documents as docs_module
monkeypatch.setattr(docs_module, "run_pipeline", _explode)
# ``raise_server_exceptions=False`` lets the test see the 500 response
# rather than re-raising the underlying RuntimeError from the route.
silent = TestClient(client.app, raise_server_exceptions=False)
post = silent.post(
"/api/v1/documents?sync=true",
files={"file": ("x.pdf", b"%PDF-1.4\n%fake", "application/pdf")},
)
assert post.status_code == 500
# The row must still be visible to GET, with status=failed.
from ocr_sprint.db.base import session_scope
from ocr_sprint.db.repositories import JobRepository
with session_scope() as session:
# Find the most recent row.
from ocr_sprint.db.models import JobRow
row = session.query(JobRow).order_by(JobRow.created_at.desc()).first()
assert row is not None, "create() must persist even when pipeline blows up"
assert row.status == "failed"
assert "kapow" in (row.error or "")
assert row.blob_key # blob is referenced — not orphaned
# GET must surface the failure too (this is the client-visible contract).
get = client.get(f"/api/v1/documents/{row.job_id}")
assert get.status_code == 200
assert get.json()["status"] == "failed"
assert JobRepository # silence import-only warning
def test_metrics_endpoint_exposes_request_counter(
client: TestClient,
fake_pipeline: PipelineOutput,
) -> None:
client.post(
"/api/v1/documents?sync=true",
files={"file": ("x.pdf", b"%PDF-1.4\n%fake", "application/pdf")},
)
metrics = client.get("/metrics")
assert metrics.status_code == 200
body = metrics.text
assert "http_requests_total" in body
assert "ocr_jobs_total" in body
def test_metrics_jobs_total_reflects_worker_writes(
client: TestClient,
fake_pipeline: PipelineOutput,
) -> None:
"""Regression: when the worker (eager mode here) marks a job complete,
/metrics must reflect that — the previous Counter-based implementation
would have stayed at zero because the worker's increments don't reach
the API process's in-memory registry.
"""
post = client.post(
"/api/v1/documents",
files={"file": ("x.pdf", b"%PDF-1.4\n%fake", "application/pdf")},
)
assert post.status_code == 202
body = client.get("/metrics").text
# ``ocr_jobs_total{status="completed"} 1.0`` — exact match to make sure
# the gauge-style metric is being populated from the DB.
assert 'ocr_jobs_total{status="completed"} 1.0' in body