Why use Celery instead of a single asyncio event loop for DSR polling?

A single asyncio loop on one worker cannot survive a process restart, cannot fan work across a fleet, and cannot enforce per-tenant isolation at the broker level. Celery adds durable at-least-once delivery, named priority queues bound to dedicated worker pools, and tenant-scoped routing keys, so a crash mid-poll redelivers the task and staging payloads cannot traverse production queues.

How do you make Celery safe for long-running DSR polls that may outlive a worker?

Set task_acks_late to True with worker_prefetch_multiplier of 1 and task_reject_on_worker_lost to True. Tasks are then acknowledged only after they complete, so a worker crash mid-extraction redelivers the task rather than silently dropping a subject's request.

Which failures should Celery retry and which should be dead-lettered?

Restrict autoretry_for to ConnectionError and TimeoutError and give transient 5xx faults capped exponential backoff with jitter disabled for audit reproducibility. Structural 4xx client errors bypass retries and route immediately to dsr.dlq.manual, because retrying them only burns the GDPR Article 12(3) and CCPA §1798.130 SLA window.

How should database timeouts relate to Celery task limits?

Enforce statement_timeout at or below task_time_limit, and task_time_limit at or below broker_heartbeat times three. This ordering guarantees no query outlives its task, so a worker killed mid-poll cannot leave an uncommitted cursor locking a compliance table.

What makes the audit trail acceptable to a regulator?

Every validated payload is serialized deterministically, hashed with SHA-256, and appended beside the task request id in an append-only WORM ledger, following the audit-and-accountability controls of NIST SP 800-53 Rev. 5. A consumer can verify extraction accuracy from the hash without re-exposing raw PII, and every success has exactly one ledger entry while every permanent failure has exactly one DLQ entry.

Implementing Celery for Async Polling in DSR Pipelines

Data Subject Request (DSR) fulfillment operates under statutory clocks — the “without undue delay and in any event within one month” mandate of GDPR Article 12(3) and the 45-day baseline of CCPA §1798.130(a)(2) — while the systems being polled are fragmented SaaS endpoints, legacy relational stores, and event-driven microservices. Anyone building distributed discovery hits the same wall: a single async event loop on one worker (the pattern described in Async Polling & Queue Management) cannot survive a process restart, cannot fan work across a fleet, and cannot enforce per-tenant isolation at the broker level. This page shows how to configure Celery as the distributed execution backbone for that stage of Cross-System Data Discovery & Sync, because Celery’s defaults — round-robin routing, ack-on-receipt, uncapped retries — are actively hostile to a compliance workload. The engineer who reaches this page has an async prototype and needs it to run durably, isolated per tenant, and with an audit trail a supervisory authority will accept.

The task lifecycle below is what every configuration choice on this page protects: a validated DSR task is routed to a queue by sensitivity, polled with a bounded timeout, and either emitted to the audit trail on success, retried on a transient fault, or dead-lettered on a permanent one.

Prerequisites

Python 3.11+ — the code uses zoneinfo, PEP 604 unions, and modern type hints.
Celery 5.3+ with a broker. RabbitMQ is assumed because a topic exchange is required for the tenant-scoped routing keys below; Redis works but does not support topic routing natively.
Pydantic v2 (pydantic>=2.6) for boundary validation using ConfigDict and field_validator.
httpx for bounded upstream polling and psycopg[binary] 3.x (or psycopg2) for the read-replica connection pool.
Infrastructure: a broker reachable over TLS, a read-only Postgres replica DSN, a hardware-backed secret manager (Vault or AWS Secrets Manager) for token rotation, and an append-only (WORM) store for the audit ledger. Never hardcode any of these — they arrive through environment variables validated at startup.

This page assumes you have already stood up the single-process async poller and the typed pools from Database Connector Configuration; Celery replaces the ad-hoc event loop, not the connectors underneath it.

Step-by-step implementation

Step 1 — Declare a tenant-aware queue topology

Celery’s default round-robin routing collapses under DSR workloads, where PII classification, data-residency mandates under GDPR Article 44, and retention policy dictate execution priority. Replace implicit routing with explicit queues bound to dedicated worker pools: high-sensitivity extraction routes to dsr.pii.critical, bulk archival synchronization uses dsr.archive.standard, and terminal failures land in dsr.dlq.manual. Prefixing routing keys with environment and tenant ({env}.{tenant}.dsr.extract) makes cross-boundary bleed structurally impossible — a staging payload cannot traverse a production queue because the broker will not match the key.

Celery 5 uses lowercase configuration keys. The uppercase pre-4.0 variants are still read for backward compatibility, but mixing the two styles in one config raises ImproperlyConfigured, so keep everything lowercase.

# celery_config.py  (Celery 5.x, lowercase settings)
from kombu import Exchange, Queue

dsr_exchange = Exchange("dsr_exchange", type="topic")

task_queues: tuple[Queue, ...] = (
    Queue("dsr.pii.critical",     exchange=dsr_exchange, routing_key="prod.eu.dsr.extract"),
    Queue("dsr.archive.standard", exchange=dsr_exchange, routing_key="prod.us.dsr.archive"),
    Queue("dsr.dlq.manual",       exchange=dsr_exchange, routing_key="prod.*.dsr.dlq"),
)

task_default_exchange = "dsr_exchange"
task_default_exchange_type = "topic"
task_default_routing_key = "prod.eu.dsr.extract"

broker_heartbeat = 15          # detect stale workers before partial state corrupts
broker_pool_limit = 10
task_acks_late = True          # redeliver if a worker dies mid-poll
worker_prefetch_multiplier = 1 # long polls must not hoard prefetched tasks
task_reject_on_worker_lost = True

task_acks_late = True with worker_prefetch_multiplier = 1 is the pairing that makes at-least-once delivery safe for long-running polls: a task is acknowledged only after it completes, so a worker crash mid-extraction redelivers the task rather than silently dropping a subject’s request.

Step 2 — Align database timeouts with the task lifecycle

When a worker is killed mid-poll, an uncommitted cursor can lock a compliance table indefinitely. Connection strings arrive only through environment variables with read-only scopes (PGSSLMODE=require, a server-side statement_timeout), and the cursor’s hard limit is synchronized with Celery’s task_time_limit so no query can outlive its task. Initialize the pool once at module level and wrap every extraction in a context manager that guarantees cursor closure regardless of outcome.

import os
from contextlib import contextmanager
from typing import Iterator

import psycopg2
from psycopg2.extensions import cursor as PgCursor
from psycopg2.pool import ThreadedConnectionPool

# Initialize once at import time; shared across all tasks in the worker.
DSR_DB_POOL = ThreadedConnectionPool(
    minconn=2,
    maxconn=10,
    dsn=os.environ["DSR_READ_REPLICA_DSN"],
    connect_timeout=10,
    options="-c statement_timeout=30000",  # 30s server-side hard limit
)


@contextmanager
def get_dsr_cursor() -> Iterator[PgCursor]:
    """Yield a read-scoped cursor with guaranteed commit/rollback/return."""
    conn = DSR_DB_POOL.getconn()
    try:
        yield conn.cursor()
        conn.commit()
    except Exception:
        conn.rollback()
        raise
    finally:
        DSR_DB_POOL.putconn(conn)

Step 3 — Encode deterministic retry semantics and DLQ fallback

Retry logic in a DSR pipeline must be deterministic, not probabilistic, so the audit trail is reproducible. Pass autoretry_for, retry_backoff, and retry_backoff_max to the @app.task() decorator (not as class attributes on a Task subclass — that is a common Celery 5 mistake). Transient faults (429/503, connection resets, timeouts) get exponential backoff capped at a strict SLA boundary; structural 4xx client errors bypass retries and route straight to the dead-letter queue for manual triage. Every retry preserves self.request.id, maintaining the immutable audit trail that the GDPR Article 17 erasure record and CCPA §1798.130 response record both depend on. This mirrors the transient-versus-permanent failure taxonomy defined for SaaS API Sync Strategies.

import time

import httpx
from celery import Celery, Task

app = Celery("dsr_worker")
app.config_from_object("celery_config")


class ComplianceTask(Task):
    """Base task that emits a structured compliance event on every retry."""

    def on_retry(self, exc: Exception, task_id: str, args, kwargs, einfo) -> None:
        self.log_compliance_event(
            "TASK_RETRY",
            {
                "task_id": task_id,
                "attempt": self.request.retries,
                "error_type": type(exc).__name__,
                "timestamp": time.time(),
                "tenant": kwargs.get("tenant_id"),
            },
        )

    def log_compliance_event(self, event_type: str, payload: dict) -> None:
        """Emit to the centralized SIEM / structured logger (implement per stack)."""
        ...


@app.task(
    base=ComplianceTask,
    bind=True,
    name="dsr.poll_endpoint",
    autoretry_for=(ConnectionError, TimeoutError),
    max_retries=5,
    retry_backoff=True,
    retry_backoff_max=3600,   # cap well inside the statutory SLA window
    retry_jitter=False,       # deterministic backoff for audit reproducibility
)
def poll_endpoint(self, tenant_id: str, endpoint_url: str, auth_token: str) -> dict | None:
    """Poll one discovery endpoint; retry transient faults, DLQ permanent ones."""
    try:
        with httpx.Client(timeout=30) as client:
            resp = client.get(endpoint_url, headers={"Authorization": f"Bearer {auth_token}"})
            resp.raise_for_status()
            return resp.json()
    except httpx.HTTPStatusError as exc:
        if 400 <= exc.response.status_code < 500:
            # Permanent client error: hand to the DLQ, treat as handled (do NOT re-raise).
            route_to_dlq.apply_async(
                kwargs={"tenant_id": tenant_id, "reason": str(exc)},
                queue="dsr.dlq.manual",
            )
            return None
        raise  # 5xx: re-raise so the retry policy reschedules it


@app.task(name="dsr.route_to_dlq")
def route_to_dlq(tenant_id: str, reason: str) -> None:
    """Persist a permanently-failed payload for manual compliance triage."""
    ...

Step 4 — Validate and cryptographically seal every payload

Every payload extracted during async polling must pass schema validation before persistence — a malformed response can corrupt a downstream state machine or raise a false compliance alert. Validate at the task boundary with Pydantic v2, then serialize deterministically, hash with SHA-256, and append the hash beside self.request.id in the WORM audit ledger. Cryptographic hashing lets a downstream consumer verify extraction accuracy without ever re-exposing raw PII, satisfying the audit-and-accountability controls of NIST SP 800-53 Rev. 5 (AU family). This is the same contract enforced site-wide by Schema Validation Rules.

import hashlib
import json
from typing import Literal

from pydantic import BaseModel, ConfigDict, field_validator


class DSRExtractPayload(BaseModel):
    """Validated extraction result sealed into the audit ledger."""

    model_config = ConfigDict(str_strip_whitespace=True, extra="forbid")

    subject_id: str
    data_category: Literal["profile", "activity", "financial"]
    records: list[dict]

    @field_validator("subject_id")
    @classmethod
    def opaque_identifier(cls, v: str) -> str:
        """Reject anything but an opaque token so raw PII never enters the ledger key."""
        if not v.replace("-", "").isalnum():
            raise ValueError("subject_id must be an opaque alphanumeric token")
        return v


def validate_and_seal(raw: dict) -> dict:
    """Validate against the DSR schema and attach a SHA-256 integrity hash."""
    payload = DSRExtractPayload(**raw)                       # raises on malformed input
    canonical = payload.model_dump_json()                    # deterministic serialization
    digest = hashlib.sha256(canonical.encode("utf-8")).hexdigest()
    return {**payload.model_dump(), "integrity_hash": digest}

extra="forbid" enforces data minimization at the boundary per GDPR Article 5(1)©: any undeclared field is rejected rather than silently persisted into a regulated ledger.

Configuration reference

Parameter	Type	Default	Compliance note
`task_acks_late`	`bool`	`True`	Ack after completion so a worker crash redelivers rather than drops a subject’s request.
`worker_prefetch_multiplier`	`int`	`1`	Prevents long polls from hoarding prefetched tasks and stalling a partition.
`task_reject_on_worker_lost`	`bool`	`True`	Requeues a task whose worker vanished, closing an at-least-once gap.
`broker_heartbeat`	`int` (s)	`15`	Detects stale workers before partial extraction state corrupts a table.
`retry_backoff_max`	`int` (s)	`3600`	Cap must stay inside the GDPR Art. 12(3) / CCPA §1798.130 SLA window.
`retry_jitter`	`bool`	`False`	Deterministic backoff keeps retry timing reproducible for auditors.
`max_retries`	`int`	`5`	Bounded so a poison task cannot burn the clock indefinitely.
`statement_timeout`	`int` (ms)	`30000`	Must be ≤ `task_time_limit` so no query outlives its task.

The safe ordering to enforce in review is retry_backoff_max ≤ contractual SLA window, and statement_timeout ≤ task_time_limit ≤ broker_heartbeat × 3.

Verification

Confirm correctness before trusting the pipeline with real subject data. Assert that permanent failures never retry, that queues map to worker pools, and that the audit hash is stable and reproducible.

import hashlib
import json

from celery.contrib.testing.worker import start_worker


def test_permanent_4xx_routes_to_dlq(monkeypatch) -> None:
    """A 403 must be handed to the DLQ and return None, never re-raising."""
    calls: list[dict] = []
    monkeypatch.setattr(
        "dsr_worker.route_to_dlq",
        type("Stub", (), {"apply_async": staticmethod(lambda **kw: calls.append(kw))}),
    )
    # ... invoke poll_endpoint against a stub returning 403 ...
    assert calls and calls[0]["queue"] == "dsr.dlq.manual"


def test_integrity_hash_is_deterministic() -> None:
    """Two logically identical payloads must produce the same SHA-256 seal."""
    a = validate_and_seal({"subject_id": "abc-123", "data_category": "profile", "records": []})
    b = validate_and_seal({"subject_id": "abc-123", "data_category": "profile", "records": []})
    assert a["integrity_hash"] == b["integrity_hash"]

Expected log output when a worker starts is one line per declared queue, e.g. celery@w1 ready. queues: dsr.pii.critical, dsr.archive.standard, dsr.dlq.manual. The compliance assertion to hold under load: every task that reaches SUCCESS has exactly one ledger entry keyed by self.request.id, and every PERMANENT failure has exactly one dsr.dlq.manual entry — no task is both absent from the ledger and absent from the DLQ.

Troubleshooting

Tasks silently vanish after a worker restart Root cause: task_acks_late left at its default False, so tasks are acked on receipt and lost when the worker dies mid-poll. Fix: set task_acks_late = True and task_reject_on_worker_lost = True; pair with worker_prefetch_multiplier = 1.

ImproperlyConfigured on startup Root cause: mixing pre-4.0 uppercase keys (CELERY_TASK_QUEUES) with Celery 5 lowercase keys in the same config. Fix: convert every setting to lowercase.

Staging payloads appear on production queues Root cause: a default exchange or a routing key without the {env}.{tenant} prefix, so the topic exchange matches too broadly. Fix: bind every queue through the topic exchange with an environment- and tenant-scoped routing key and remove any direct-exchange fallback.

Retry storm exhausts the upstream API quota Root cause: autoretry_for catching a broad Exception, so 4xx client errors are retried like transient faults. Fix: restrict autoretry_for to (ConnectionError, TimeoutError) and route 4xx explicitly to the DLQ, as in Step 3.

Locked compliance table during network partitions Root cause: a cursor whose lifetime exceeds the task’s, leaving an uncommitted transaction when the worker is killed. Fix: set the server-side statement_timeout ≤ task_time_limit and wrap every query in the get_dsr_cursor() context manager so the connection is always returned.

Async Polling & Queue Management — the parent stage: bounded async ingestion, priority routing, and failure categorization this Celery setup makes distributed.
Database Connector Configuration — the typed, pooled connectors and statement_timeout alignment the tasks above depend on.
SaaS API Sync Strategies — the transient-versus-permanent failure taxonomy this retry policy encodes.
Schema Validation Rules — the Pydantic gate that rejects malformed payloads before they reach the audit ledger.
Cross-System Data Discovery & Sync — the discovery stage this queue layer belongs to.

Implementing Celery for Async Polling in DSR Pipelines

Prerequisites #

Step-by-step implementation #

Step 1 — Declare a tenant-aware queue topology #

Step 2 — Align database timeouts with the task lifecycle #

Step 3 — Encode deterministic retry semantics and DLQ fallback #

Step 4 — Validate and cryptographically seal every payload #

Configuration reference #

Verification #

Troubleshooting #

Related #