* Phase 4: async pipeline (Celery+Redis), Postgres job state, local-fs blob storage, API-key auth, Prometheus metrics Co-Authored-By: adrian kuman firmansah <adriancuman@gmail.com> * Phase 4: fix sync-mode rollback orphaning blobs + use is_relative_to for path-escape check Devin Review on PR #3 found two real bugs: 1. Sync path mark_failed was rolled back by the request-scoped session. When the pipeline raised an exception in ?sync=true mode, _run_inline modified the FastAPI session and re-raised; get_session caught the exception, called session.rollback(), and wiped both the create() and the mark_failed() writes. The blob was already on disk, so it was permanently orphaned with no DB record. Fix: commit the pending row immediately after create(), and run all subsequent state transitions in independent session_scope blocks (matching the worker task pattern). 2. _resolve used str.startswith for path-escape detection, which lets a sibling directory whose name begins with the storage root pass (e.g. /app/blobs_evil vs /app/blobs). Switched to Path.is_relative_to. Added regression tests for both. Co-Authored-By: adrian kuman firmansah <adriancuman@gmail.com> * Phase 4: honor queue_enabled setting + resolve base_dir for path comparisons Two more bugs found by Devin Review: 3. queue_enabled was declared in config and documented in .env.example but never read by the route. A fresh dev install with QUEUE_ENABLED=false (the default) would still enqueue, then fail with a Redis connection error. Fixed by making the ?sync= query param default to None and resolving to (not queue_enabled) inside the route. Tests now set QUEUE_ENABLED=true so the async flow stays exercised, and a new test verifies the inline fallback when the queue is disabled. 4. LocalFsBlobStorage stored base_dir as-is. _resolve resolved its candidate paths, so the empty-dir cleanup loop in delete() compared a resolved candidate against an unresolved base_dir and broke on the first iteration (no cleanup ever happened). Fixed by resolving base_dir once in __init__ so every path comparison is apples-to-apples. Co-Authored-By: adrian kuman firmansah <adriancuman@gmail.com> * Phase 4: derive ocr_jobs_total from DB so worker writes are visible at /metrics Devin Review correctly noted the Counter-based JOBS_TOTAL would never increment in production because the worker runs in a separate process from the API and the registry is process-local. Replaced JOBS_TOTAL with a custom Collector that issues SELECT status, COUNT(*) FROM jobs GROUP BY status on every /metrics scrape. Result: the metric stays accurate regardless of which process wrote the row. Also corrected the metrics.py docstring (the old comment claimed the counter was 'incremented by the worker', which was the bug). Removed the JOBS_TOTAL.inc() calls from the sync route — the DB collector covers both paths now. JOB_PROCESSING_SECONDS stays as an API-process histogram with an updated docstring noting its scope; cross-process latency belongs to derived dashboards over jobs.created_at/updated_at. Added regression test test_metrics_jobs_total_reflects_worker_writes. Co-Authored-By: adrian kuman firmansah <adriancuman@gmail.com> --------- Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: adrian kuman firmansah <adriancuman@gmail.com>
147 lines
5.2 KiB
Python
147 lines
5.2 KiB
Python
"""Blob storage abstraction.
|
|
|
|
The MVP only ships a local-filesystem backend. The `BlobStorage` Protocol is
|
|
deliberately small (put / get / exists / delete) so that an S3- or MinIO-
|
|
backed implementation can be dropped in later without touching API code.
|
|
|
|
Layout on disk:
|
|
|
|
{blob_storage_dir}/
|
|
2026/04/25/
|
|
<uuid4>.<ext>
|
|
|
|
The date hierarchy keeps the directory listing manageable when the service
|
|
processes thousands of documents per day, and makes manual rsync-based
|
|
backup straightforward.
|
|
"""
|
|
|
|
from __future__ import annotations
|
|
|
|
from datetime import datetime, timezone
|
|
from pathlib import Path
|
|
from typing import BinaryIO, Protocol
|
|
from uuid import uuid4
|
|
|
|
from ocr_sprint.config import get_settings
|
|
from ocr_sprint.utils.logging import get_logger
|
|
|
|
_logger = get_logger(__name__)
|
|
|
|
# Map of upload extensions we'll honor when persisting blobs. Anything else
|
|
# falls back to `.bin` and the OCR pipeline's magic-byte sniffing handles
|
|
# the actual content kind.
|
|
_KNOWN_EXTS = {".pdf", ".png", ".jpg", ".jpeg", ".tif", ".tiff", ".webp"}
|
|
|
|
|
|
class BlobStorage(Protocol):
|
|
"""Minimal interface a blob backend must satisfy."""
|
|
|
|
def put(self, content: bytes, original_filename: str | None = None) -> str:
|
|
"""Persist `content` and return an opaque key the caller can use later."""
|
|
|
|
def get(self, key: str) -> bytes:
|
|
"""Return the raw bytes for `key`. Raises FileNotFoundError on miss."""
|
|
|
|
def open(self, key: str) -> BinaryIO:
|
|
"""Return a binary file-like object for streaming reads."""
|
|
|
|
def exists(self, key: str) -> bool:
|
|
"""True if `key` is currently stored."""
|
|
|
|
def delete(self, key: str) -> None:
|
|
"""Remove a blob. No-op if it doesn't exist."""
|
|
|
|
|
|
class LocalFsBlobStorage:
|
|
"""Filesystem-backed implementation rooted at `base_dir`."""
|
|
|
|
def __init__(self, base_dir: Path) -> None:
|
|
# Resolve once so every subsequent path comparison (escape check,
|
|
# empty-dir cleanup) is apples-to-apples — ``Path.parents`` of a
|
|
# resolved key would otherwise never equal a relative ``base_dir``.
|
|
base_dir.mkdir(parents=True, exist_ok=True)
|
|
self.base_dir = base_dir.resolve()
|
|
|
|
# ---------- helpers ----------
|
|
|
|
@staticmethod
|
|
def _safe_ext(original_filename: str | None) -> str:
|
|
if not original_filename:
|
|
return ".bin"
|
|
suffix = Path(original_filename).suffix.lower()
|
|
return suffix if suffix in _KNOWN_EXTS else ".bin"
|
|
|
|
def _resolve(self, key: str) -> Path:
|
|
# Defensive: keys come from the DB but we still reject paths that try
|
|
# to escape the blob root. ``Path.is_relative_to`` does proper path
|
|
# containment — string ``startswith`` would let ``/app/blobs_evil``
|
|
# slip past when the root is ``/app/blobs``.
|
|
candidate = (self.base_dir / key).resolve()
|
|
if not candidate.is_relative_to(self.base_dir):
|
|
raise ValueError(f"Blob key escapes storage root: {key!r}")
|
|
return candidate
|
|
|
|
# ---------- BlobStorage protocol ----------
|
|
|
|
def put(self, content: bytes, original_filename: str | None = None) -> str:
|
|
now = datetime.now(timezone.utc)
|
|
date_dir = Path(f"{now:%Y/%m/%d}")
|
|
ext = self._safe_ext(original_filename)
|
|
key = str(date_dir / f"{uuid4().hex}{ext}")
|
|
target = self._resolve(key)
|
|
target.parent.mkdir(parents=True, exist_ok=True)
|
|
# Write to a temp file in the same directory then rename. This avoids
|
|
# a half-written blob being read by a concurrent worker.
|
|
tmp = target.with_suffix(target.suffix + ".tmp")
|
|
tmp.write_bytes(content)
|
|
tmp.rename(target)
|
|
_logger.info("blob.put", key=key, size=len(content))
|
|
return key
|
|
|
|
def get(self, key: str) -> bytes:
|
|
path = self._resolve(key)
|
|
if not path.exists():
|
|
raise FileNotFoundError(f"Blob not found: {key}")
|
|
return path.read_bytes()
|
|
|
|
def open(self, key: str) -> BinaryIO:
|
|
path = self._resolve(key)
|
|
if not path.exists():
|
|
raise FileNotFoundError(f"Blob not found: {key}")
|
|
return path.open("rb")
|
|
|
|
def exists(self, key: str) -> bool:
|
|
try:
|
|
return self._resolve(key).exists()
|
|
except ValueError:
|
|
return False
|
|
|
|
def delete(self, key: str) -> None:
|
|
try:
|
|
path = self._resolve(key)
|
|
except ValueError:
|
|
return
|
|
if path.exists():
|
|
path.unlink()
|
|
_logger.info("blob.delete", key=key)
|
|
# Best-effort cleanup of empty date dirs so we don't accumulate
|
|
# 365 directories per year forever. ``self.base_dir`` is already
|
|
# resolved (see __init__), so it can be compared against
|
|
# ``path.parents`` directly.
|
|
for parent in path.parents:
|
|
if parent == self.base_dir or self.base_dir not in parent.parents:
|
|
break
|
|
try:
|
|
parent.rmdir()
|
|
except OSError:
|
|
break
|
|
|
|
|
|
def get_blob_storage() -> BlobStorage:
|
|
"""Build the configured blob backend. Single-process cache lives in `Settings`."""
|
|
s = get_settings()
|
|
return LocalFsBlobStorage(s.blob_storage_dir)
|
|
|
|
|
|
__all__ = ["BlobStorage", "LocalFsBlobStorage", "get_blob_storage"]
|