Connecting PostgreSQL and Snowflake for DSR Discovery

Discovery for a Data Subject Request (DSR) almost always has to span two very different stores at once: a transactional PostgreSQL database that holds the live operational record of a subject, and a Snowflake warehouse that holds the historical, analytical, and derived copies of that same subject. Anyone building the Database Connector Configuration layer hits the bridge between these two engines as the hardest and most compliance-sensitive part of the job — it is where connection pools exhaust under concurrent sweeps, where TLS silently degrades, and where a JSONB column quietly loses structure on its way into a VARIANT. Within the broader Cross-System Data Discovery & Sync architecture, this page is the concrete recipe for making that Postgres-to-Snowflake extraction deterministic, read-only, and provably bounded in time, so records reach the Schema Validation Rules gate intact and the response window GDPR Article 12(3) fixes at one month and CCPA §1798.130(a)(2) fixes at 45 days is never at the mercy of a leaked cursor.

The bridge is not a one-off ETL script. It is a compliance control point that must fail closed, never mutate source data, and emit enough telemetry to reconstruct exactly what it read from each engine and when. The steps below walk through initializing both connectors safely, coercing types across the relational/semi-structured boundary without corruption, gating on schema drift, and routing failures to a dead-letter queue rather than dropping a subject’s records.

Prerequisites

This implementation targets Python 3.11+ (for datetime.UTC and mature asyncio semantics) and the following libraries:

psycopg2-binary (2.9+) — PostgreSQL driver with a thread-safe connection pool.
snowflake-connector-python (3.x) — the official Snowflake connector, which manages TLS and OCSP internally.
pydantic (v2.5+) — for typed, frozen configuration models validated at startup.
A secret manager (HashiCorp Vault or AWS Secrets Manager) for credential injection and rotation, consistent with the key-lifecycle guidance in NIST SP 800-57 Part 1.

Both source engines must expose a read-only role scoped to the tables in the discovery manifest. The connector must never hold write or DDL privileges — least-privilege access is what lets you attest to a regulator that discovery could not have mutated a subject’s record. Credentials are resolved at call time, never baked into images or committed to configuration.

Step-by-step implementation

1. Model and validate connector configuration

Configuration drift is a leading cause of silent DSR corruption, so both endpoints are described by a frozen Pydantic v2 model that fails fast at startup if a required field is missing or malformed. This mirrors the typed-config discipline established across the Database Connector Configuration layer.

from pydantic import BaseModel, ConfigDict, SecretStr, field_validator


class BridgeConfig(BaseModel):
    """Validated, immutable configuration for the Postgres->Snowflake bridge."""

    model_config = ConfigDict(frozen=True)

    pg_dsn: SecretStr
    sf_account: str
    sf_user: str
    sf_password: SecretStr
    sf_warehouse: str
    sf_database: str
    sf_schema: str
    egress_proxy_host: str | None = None
    egress_proxy_port: int = 8080
    statement_timeout_ms: int = 30_000
    max_drift_pct: float = 5.0

    @field_validator("max_drift_pct")
    @classmethod
    def drift_in_range(cls, v: float) -> float:
        if not 0.0 <= v <= 100.0:
            raise ValueError("max_drift_pct must be between 0 and 100")
        return v

Because the model is frozen=True, no code path can mutate a resolved connection string mid-sweep, and SecretStr keeps credentials out of logs and repr() output.

2. Initialize a bounded, time-budgeted PostgreSQL pool

PostgreSQL drivers expose edge cases when querying high-cardinality PII tables under heavy OLTP load. A common failure mode is pool exhaustion during concurrent discovery sweeps, which surfaces as psycopg2.OperationalError: server closed the connection unexpectedly. The pool is initialized once at module level — not inside the context manager — so it is shared across all workers, and every session carries a hard statement_timeout.

import os
import logging
from contextlib import contextmanager

import psycopg2
import psycopg2.pool

logger = logging.getLogger("dsr.pipeline")

# Initialize once at import time; reuse across all workers.
_PG_POOL = psycopg2.pool.ThreadedConnectionPool(
    minconn=2,
    maxconn=10,
    dsn=os.environ["PG_DSN"],
    connect_timeout=5,
    options="-c statement_timeout=30000 -c default_transaction_read_only=on",
)


@contextmanager
def get_pg_cursor():
    """Check out a read-only, health-checked cursor with guaranteed cleanup."""
    conn = _PG_POOL.getconn()
    try:
        with conn.cursor() as health:
            health.execute("SELECT 1")  # Pre-flight liveness check.
        yield conn.cursor()
        conn.commit()
    except psycopg2.OperationalError as exc:
        conn.rollback()
        logger.error("Postgres connection error during discovery: %s", exc)
        raise
    except Exception:
        conn.rollback()
        raise
    finally:
        _PG_POOL.putconn(conn)

default_transaction_read_only=on enforces the least-privilege guarantee at the session level, and the 30-second statement_timeout bounds every query so a pathological plan cannot hold a connection open past the sweep window. The context manager guarantees the connection is always returned to the pool, so exhaustion is caught before it cascades into a missed statutory deadline.

3. Open an OCSP fail-closed Snowflake session

TLS handshake failures against Snowflake usually stem from mismatched OCSP response caching or corporate proxy interception. The Snowflake connector manages TLS internally, so you pass ocsp_fail_open=False directly as a connection parameter rather than constructing an ssl.SSLContext (the connector does not accept an ssl_context argument).

import os

import snowflake.connector


def build_snowflake_session() -> "snowflake.connector.SnowflakeConnection":
    """Open a fail-closed Snowflake session routed through the egress proxy."""
    return snowflake.connector.connect(
        user=os.environ["SF_USER"],
        password=os.environ["SF_PASSWORD"],
        account=os.environ["SF_ACCOUNT"],
        warehouse=os.environ["SF_WAREHOUSE"],
        database=os.environ["SF_DATABASE"],
        schema=os.environ["SF_SCHEMA"],
        ocsp_fail_open=False,       # Refuse to transmit if revocation status is unknown.
        proxy_host=os.environ.get("EGRESS_PROXY_HOST"),
        proxy_port=int(os.environ.get("EGRESS_PROXY_PORT", "8080")),
        network_timeout=15,
        login_timeout=10,
        session_parameters={"QUERY_TAG": "dsr_discovery"},
    )

With ocsp_fail_open=False, the pipeline refuses to move PII if certificate-revocation status cannot be verified — a fail-closed posture consistent with the transit-protection expectations in NIST SP 800-52 Rev. 2. The QUERY_TAG makes every discovery query attributable in Snowflake’s QUERY_HISTORY, which is part of the audit trail.

4. Gate on schema drift before any transfer

Column presence and type must be reconciled across the two engines before a single row moves, or a downstream Snowflake audit will later flag inconsistencies you cannot explain. This pre-flight diff compares the two information_schema.columns catalogs and halts the sweep if drift exceeds the configured threshold.

def validate_schema_drift(pg_cur, sf_cur, table_name: str, max_drift_pct: float = 5.0) -> None:
    """Compare column presence across engines; raise if drift exceeds threshold."""
    pg_cur.execute(
        "SELECT column_name, data_type FROM information_schema.columns "
        "WHERE table_name = %s",
        (table_name,),
    )
    pg_cols = {row[0].lower(): row[1] for row in pg_cur.fetchall()}

    sf_cur.execute(
        "SELECT column_name, data_type FROM information_schema.columns "
        "WHERE table_name = %s",
        (table_name.upper(),),
    )
    sf_cols = {row[0].lower(): row[1] for row in sf_cur.fetchall()}

    missing_in_sf = set(pg_cols) - set(sf_cols)
    drift_ratio = len(missing_in_sf) / len(pg_cols) if pg_cols else 0.0
    if drift_ratio > (max_drift_pct / 100):
        raise RuntimeError(
            f"Schema drift {drift_ratio:.2%} exceeds {max_drift_pct:.1f}% budget; "
            f"columns missing in Snowflake: {sorted(missing_in_sf)}"
        )

Halting here — rather than transferring partial rows — is what keeps the discovery manifest trustworthy. A malformed manifest is worse than a delayed one, because it produces an access response (GDPR Article 15) that omits records the subject is entitled to see.

5. Coerce JSONB to VARIANT without losing fidelity

PostgreSQL JSONB structures containing nested PII arrays must serialize into Snowflake VARIANT columns without losing structural fidelity. A deterministic encoder handles the non-native types (datetime, UUID, Decimal, bytes) that would otherwise crash json.dumps mid-sweep.

import json
import uuid
from datetime import datetime, date
from decimal import Decimal


class DSRSerializer(json.JSONEncoder):
    """Deterministic JSON encoder with explicit fallbacks for non-native types."""

    def default(self, obj: object) -> object:
        if isinstance(obj, (datetime, date)):
            return obj.isoformat()
        if isinstance(obj, uuid.UUID):
            return str(obj)
        if isinstance(obj, Decimal):
            return str(obj)  # Preserve precision; VARIANT parses the numeric string.
        if isinstance(obj, bytes):
            return obj.decode("utf-8", errors="replace")
        return super().default(obj)


def serialize_jsonb_to_variant(pii_payload: dict) -> str:
    """Serialize a JSONB payload for insertion into a Snowflake VARIANT column."""
    return json.dumps(pii_payload, cls=DSRSerializer, ensure_ascii=False, sort_keys=True)

Decimal is serialized as a string rather than coerced to float, so monetary or high-precision identifier values survive the round trip intact; sort_keys=True makes the output byte-stable so identical payloads hash identically for the deduplication stage.

Configuration reference

Parameter	Type	Default	Compliance note
`statement_timeout_ms`	int	`30000`	Bounds each Postgres query so no cursor can hold a connection past the sweep window that feeds the Article 12(3) clock.
`maxconn` (pool)	int	`10`	Caps concurrent sessions; prevents pool exhaustion from stalling a subject’s discovery mid-sweep.
`default_transaction_read_only`	str	`on`	Enforces least privilege at the session level — discovery cannot mutate source records.
`ocsp_fail_open`	bool	`False`	Fail-closed: refuse to transmit PII if certificate-revocation status is unverifiable.
`network_timeout` (Snowflake)	int	`15`	Bounds warehouse round trips so a hung load cannot silently blow the deadline.
`max_drift_pct`	float	`5.0`	Halts the sweep if the two catalogs diverge beyond this ratio, protecting manifest completeness.
`QUERY_TAG`	str	`dsr_discovery`	Makes every query attributable in Snowflake `QUERY_HISTORY` for the audit trail.

Verification

Confirm correctness with a unit test that asserts the serializer is lossless for the awkward types and that the drift gate fails closed. Discovery is a pure, deterministic stage, so identical inputs must always yield identical output.

import json
import uuid
from datetime import datetime, timezone
from decimal import Decimal

import pytest


def test_serializer_roundtrips_non_native_types():
    payload = {
        "id": uuid.UUID("12345678-1234-5678-1234-567812345678"),
        "seen_at": datetime(2026, 7, 1, 9, 30, tzinfo=timezone.utc),
        "balance": Decimal("1042.55"),
    }
    out = json.loads(serialize_jsonb_to_variant(payload))
    assert out["id"] == "12345678-1234-5678-1234-567812345678"
    assert out["seen_at"].startswith("2026-07-01T09:30")
    assert out["balance"] == "1042.55"  # Precision preserved as a string.


def test_drift_gate_fails_closed(pg_cur_stub, sf_cur_stub):
    # Postgres has an email column Snowflake is missing -> 50% drift.
    with pytest.raises(RuntimeError, match="exceeds"):
        validate_schema_drift(pg_cur_stub, sf_cur_stub, "subjects", max_drift_pct=5.0)

In production logs you should expect one structured line per sweep boundary carrying a trace ID that spans the Postgres extraction, serialization, and Snowflake phases, plus a SELECT 1 health check per checked-out connection. A clean run emits no compliance_gate_violation metric; any drift or serialization failure emits exactly one, tagged with the drift type and a hashed subject identifier (never the raw value).

Troubleshooting

psycopg2.OperationalError: server closed the connection unexpectedly : Root cause: pool exhaustion under concurrent sweeps, or a query exceeding statement_timeout. Fix: confirm the pool is module-level (not per-call), raise maxconn toward your Postgres max_connections headroom, and verify the health check and putconn in the finally block are running so connections are actually returned.

Snowflake OCSPResponseFailureError / handshake stalls : Root cause: a corporate proxy intercepting TLS or a stale OCSP cache while ocsp_fail_open=False. Fix: route through the dedicated egress proxy via proxy_host/proxy_port with a pinned CA bundle. Do not switch to ocsp_fail_open=True to “unblock” a deadline — that silently accepts unverifiable certificates for PII in transit.

VARIANT rows arrive as escaped strings instead of parsed objects : Root cause: double-encoding — the JSON string is inserted into VARIANT without a PARSE_JSON on load. Fix: wrap the serialized payload with PARSE_JSON(%s) in the insert statement, or bind it through a TO_VARIANT(PARSE_JSON(...)) expression so Snowflake stores structured data, not a string literal.

Schema-drift gate fires on a legitimately new column : Root cause: a benign additive change in Postgres that Snowflake has not yet mirrored. Fix: apply the DDL to Snowflake first (schema-first migration), then re-run discovery. Never lower max_drift_pct to 100 to force the sweep through — that defeats the manifest-completeness guarantee the Schema Validation Rules stage relies on.

Silent truncation of nested PII arrays : Root cause: a non-native type hit json.dumps before reaching DSRSerializer, raising deep in a comprehension and dropping the row. Fix: route the row and its error metadata to the dead-letter queue with a compliance_gate_violation metric rather than swallowing the exception; the missing subject data would otherwise never surface until an audit.

Database Connector Configuration — the parent connector-configuration guide: pooled, time-budgeted, credential-isolated relational extraction
Schema Validation Rules — the validation gate that accepts each row into the manifest or dead-letters it
Async Polling & Queue Management — the queue and circuit-breaker lifecycle that decouples slow warehouse loads from fast Postgres reads
Handling Rate Limits in Salesforce API Sync — the same fail-closed, backoff-driven discipline applied to SaaS API sources
Cross-System Data Discovery & Sync — the parent architecture this Postgres-to-Snowflake bridge feeds

Connecting PostgreSQL and Snowflake for DSR Discovery

Prerequisites #

Step-by-step implementation #

1. Model and validate connector configuration #

2. Initialize a bounded, time-budgeted PostgreSQL pool #

3. Open an OCSP fail-closed Snowflake session #

4. Gate on schema drift before any transfer #

5. Coerce JSONB to VARIANT without losing fidelity #

Configuration reference #

Verification #

Troubleshooting #

Related #