Why extract PDF text with coordinates instead of a plain string?

Linear string extraction concatenates text across columns and interleaves address lines with legal boilerplate on contracts, invoices, and scanned statements. Carrying each token's page and bounding box lets the pipeline reassemble a visual block by shared x-offset and descending vertical position, so a multi-line billing address maps to a single CRM field deterministically instead of being guessed from delimiters.

How does a format checksum override a low NLP confidence score?

The router evaluates the NLP score and an optional format pattern together. If an NLP model returns 0.62 for a nine-digit token but a strict SSN pattern with the SSA range exclusions matches it exactly, the match is elevated to auto-redact rather than dropped as a false negative. When both a CRM field and a PDF token plausibly match, the tie is broken by spatial proximity to an anchor phrase such as Billing To or Account Holder, not by chance.

How is a PDF redacted so the identifier is truly unrecoverable?

A cosmetic black rectangle over live text leaves the identifier in the content stream and is an unlawful disclosure under GDPR Art. 5(1)(f). Secure redaction burns the region into the page raster or deletes the underlying content stream and strips document metadata, rendering sanitized pages to a new document rather than editing the original in place.

How is the audit manifest kept from becoming a new PII store?

The manifest records only derived values: a UTC timestamp, the source filename, a salted SHA-256 hash of each redacted value, and a manifest hash over the whole record. Because the raw identifier is never written, the manifest proves field-by-field coverage for GDPR Art. 30 accountability without re-creating the disclosure the pipeline exists to prevent, and it is written to WORM storage so it cannot be altered after the fact.

Syncing Structured CRM Data with Unstructured PDFs

When a Data Subject Request (DSR) names a person whose authoritative identity lives in Salesforce, HubSpot, or Dynamics 365, the hard part is not reading their CRM row — it is proving that every scanned contract, template-generated invoice, and OCR’d statement holding a copy of that identity was found and reconciled against it before anything was redacted. This is the concrete implementation the Structured vs Unstructured Data Sync workflow wraps in a deterministic join, and it sits inside the broader PII Extraction & Redaction Pipelines architecture as the reconciliation layer between two storage worlds that were never designed to agree. Any engineer building PDF ingestion for a DSR pipeline hits the same three traps: linear text dumps that concatenate columns and interleave address lines with boilerplate, ambiguous matches where a CRM field and a PDF token overlap with no tie-breaker, and redaction applied by naive string replacement that leaves the identifier alive in a hidden layer or in document metadata. The goal here is a join that is coordinate-aware, deterministic under retry, and cryptographically auditable, so referential integrity survives layout drift, partial matches, and replay.

Prerequisites

Python 3.11+ — for zoneinfo (UTC audit timestamps), dataclasses, and structural pattern matching in the routing layer.
pdfplumber>=0.11 — exposes per-word x0, top, x1, bottom coordinates needed for spatial reconstruction; PyMuPDF is an acceptable alternative for the redaction render.
pydantic>=2.6 — the CRM anchor row and every extracted candidate are modelled as Pydantic v2 records so a malformed source or a missing subject_id fails deterministically at ingestion rather than mid-pipeline.
A certified PDF redaction library — true redaction must burn the region into the page raster or delete the underlying content stream, not overlay a black rectangle. Cosmetic covering leaves the identifier recoverable and is an unlawful disclosure under GDPR Art. 5(1)(f) integrity-and-confidentiality.
Infra: a dead-letter queue for documents that fail OCR or exceed the per-document timeout, a role-gated manual-review store, and an append-only (WORM) audit sink such as S3 Object Lock in compliance mode.

Contextually ambiguous entities — a name that could be the subject or a witness on the same contract — belong to the NLP-Based Entity Recognition stage; this module stays deterministic wherever the format allows it.

Step-by-step implementation

Step 1 — Extract text with its coordinates, not as a flat string

Naive extraction fails on multi-column contracts, embedded financial tables, and rotated scan layers: a linear string dump from pdfplumber or PyMuPDF concatenates names across columns and interleaves address lines with legal boilerplate. Carrying each token’s bounding box (x0, top, x1, bottom) preserves the spatial context that later steps need to decide which visual block a token belongs to, and gives incident triage a coordinate to replay against.

import logging
from dataclasses import dataclass

import pdfplumber

logger = logging.getLogger("dsr_sync.spatial_parser")


@dataclass(frozen=True)
class SpatialToken:
    """A single extracted token with its page and bounding box."""

    page: int
    text: str
    bbox: tuple[float, float, float, float]  # x0, top, x1, bottom


def extract_spatial_tokens(pdf_path: str) -> list[SpatialToken]:
    """Extract every word with coordinates for spatial reconstruction."""
    tokens: list[SpatialToken] = []
    with pdfplumber.open(pdf_path) as pdf:
        for page_num, page in enumerate(pdf.pages, start=1):
            for word in page.extract_words(x_tolerance=3, y_tolerance=3):
                tokens.append(
                    SpatialToken(
                        page=page_num,
                        text=word["text"],
                        bbox=(word["x0"], word["top"], word["x1"], word["bottom"]),
                    )
                )
    logger.info("Extracted %d spatial tokens from %s", len(tokens), pdf_path)
    return tokens

When a CRM row expects a discrete billing_address but the extractor returns a string spanning three visual lines, the bounding boxes tell you the tokens share an x0 and step down in top — the signal used to reassemble the block before schema mapping, rather than guessing from delimiters.

Step 2 — Normalize deterministically, then align to the CRM anchor

Extracted text rarely aligns directly with a CRM field schema. OCR introduces ligature and encoding artifacts; whitespace is inconsistent; entity names carry stray punctuation. Normalization must be deterministic — identical input yields byte-identical output on every replay — so the same source never produces two different join keys.

import re
import unicodedata


def normalize_block(raw: str) -> str:
    """Fold OCR artifacts, collapse whitespace, and enforce a stable form."""
    folded = unicodedata.normalize("NFKC", raw)
    folded = re.sub(r"[^\x20-\x7EÀ-ɏ]+", " ", folded)
    return re.sub(r"\s+", " ", folded).strip()


def align_to_crm(
    tokens: list[SpatialToken], crm_schema: dict[str, re.Pattern[str]]
) -> dict[str, str]:
    """Map normalized tokens to CRM fields using anchored pattern matching."""
    aligned: dict[str, str] = {}
    for field, pattern in crm_schema.items():
        for token in tokens:
            candidate = normalize_block(token.text)
            if pattern.search(candidate):
                aligned[field] = candidate
                break  # first spatial-order match wins; deterministic
    return aligned

This deterministic layer is what bridges the structured and unstructured worlds: it enforces the same schema contract the parent Structured vs Unstructured Data Sync workflow applies at the connector boundary, refusing to hand a probabilistic “close enough” string downstream.

Step 3 — Route each candidate by tiered confidence

Transformer models rarely hit perfect precision on legacy templates — handwritten signatures, stamped watermarks, and low-resolution scans all erode the score. A tiered router sends high-certainty matches straight to redaction, holds ambiguous ones for review, and discards the rest, while a regex checksum can elevate a mediocre NLP score to certainty when the format itself is verifiable.

import re
from dataclasses import dataclass


@dataclass(frozen=True)
class SyncResult:
    field_name: str
    value: str
    nlp_confidence: float
    regex_confirmed: bool
    routing: str  # "AUTO_REDACT" | "MANUAL_REVIEW" | "DISCARD"


def evaluate(
    field: str,
    value: str,
    nlp_score: float,
    checksum: re.Pattern[str] | None = None,
    nlp_threshold: float = 0.80,
) -> SyncResult:
    """Tiered routing: a format checksum can override a weak NLP score."""
    regex_confirmed = bool(checksum and checksum.fullmatch(value))
    if nlp_score >= nlp_threshold or regex_confirmed:
        routing = "AUTO_REDACT"
    elif 0.45 <= nlp_score < nlp_threshold:
        routing = "MANUAL_REVIEW"
    else:
        routing = "DISCARD"
    return SyncResult(field, value, nlp_score, regex_confirmed, routing)

If an NLP model returns 0.62 for a nine-digit token but a strict SSN pattern with SSA range exclusions confirms the format, the pipeline elevates the match rather than opening a false-negative disclosure gap. The reverse case — a CRM field and a PDF entity that both plausibly match — is broken by spatial proximity to an anchor phrase (Billing To:, Account Holder:) rather than by coin-flip. The exact thresholds and score-blending live in the Confidence Scoring Thresholds workflow so they are tuned and versioned in one place.

Step 4 — Redact securely and write an immutable audit manifest

Redaction that survives scrutiny renders sanitized pages to a new document and records what was removed without recording the removed values in plaintext. Every AUTO_REDACT decision is bound by hash to the extraction that produced it, so an auditor can prove the pipeline acted on the state it claimed to.

import hashlib
import json
import os
from datetime import datetime, timezone


def _salted_hash(value: str, salt: str) -> str:
    """Keep identifiers out of the audit trail while proving coverage."""
    return hashlib.sha256((salt + value).encode("utf-8")).hexdigest()


def build_audit_manifest(
    results: list[SyncResult], pdf_path: str, salt: str
) -> dict:
    """Bind every redaction to its source without logging the raw value."""
    redacted = [
        {"field": r.field_name, "value_hash": _salted_hash(r.value, salt)}
        for r in results
        if r.routing == "AUTO_REDACT"
    ]
    manifest = {
        "timestamp": datetime.now(timezone.utc).isoformat(),
        "source_pdf": os.path.basename(pdf_path),
        "redacted_fields": redacted,
    }
    payload = json.dumps(manifest, sort_keys=True).encode("utf-8")
    manifest["manifest_hash"] = hashlib.sha256(payload).hexdigest()
    return manifest


def persist_manifest(manifest: dict, worm_dir: str) -> str:
    """Write to WORM storage (e.g. S3 Object Lock in compliance mode)."""
    name = f"audit_{manifest['manifest_hash'][:12]}.json"
    path = os.path.join(worm_dir, name)
    with open(path, "w", encoding="utf-8") as fh:
        json.dump(manifest, fh, sort_keys=True)
    return path

Storing only a salted hash of each redacted value keeps the manifest from becoming a new identifier store while still proving, field by field, that the redaction happened — satisfying the accountability duty of GDPR Art. 30 without re-creating the disclosure the pipeline exists to prevent.

Step 5 — Push ambiguous matches to a review queue and propagate overrides

Everything routed MANUAL_REVIEW enters a role-gated queue that shows the reviewer the original bounding box, the extracted candidate, the CRM canonical value, and the confidence breakdown. A confirmed correction propagates back as a labelled training example and, crucially, re-enters the commit path so no held identifier is silently dropped.

import hashlib


def to_review_ticket(result: SyncResult, crm_value: str) -> dict:
    """Serialize an ambiguous match for a role-gated review endpoint."""
    ticket_id = hashlib.sha256(
        f"{result.field_name}:{result.value}".encode("utf-8")
    ).hexdigest()[:12]
    return {
        "ticket_id": f"DSR-REVIEW-{ticket_id}",
        "field": result.field_name,
        "extracted_hash": hashlib.sha256(result.value.encode()).hexdigest(),
        "crm_canonical_hash": hashlib.sha256(crm_value.encode()).hexdigest(),
        "nlp_confidence": result.nlp_confidence,
        "regex_confirmed": result.regex_confirmed,
        "requires_override": True,
    }

The reviewer’s decision is logged with a reviewer ID, a UTC timestamp, and a signature, then fed back to recalibrate the NLP threshold and the regex library. Over successive DSR cycles this closes the partial-match ambiguity gap and steadily lowers manual-intervention volume. The transport must be mutually authenticated (mTLS) so a review payload — which references, but does not carry, the raw identifier — cannot be replayed onto the production redaction workers.

Configuration reference

Parameter	Type	Default	Compliance note
`x_tolerance` / `y_tolerance`	`int`	`3`	pdfplumber word grouping; too loose merges adjacent columns, too tight splits an address token.
`nlp_threshold`	`float`	`0.80`	At/above this a match auto-redacts; tune per template family in the confidence workflow.
`review_floor`	`float`	`0.45`	Below this a match is discarded; above it and under threshold it goes to review, not silent drop.
`per_document_timeout`	`float` (s)	`4.0`	OCR or backtracking past this quarantines the doc to the DLQ, protecting the DSR statutory deadline.
`manifest_salt`	`str` (secret)	required	Salts the value hashes so the audit manifest is not a rainbow-table-recoverable identifier store.
`worm_retention`	`str`	per policy	S3 Object Lock retention; must meet the accountability record duty of GDPR Art. 30.
`redaction_mode`	`str`	`burn`	`burn` rasterizes/deletes the content stream; a cosmetic overlay is non-compliant under GDPR Art. 5(1)(f).

Verification

Confirm correctness with a matrix that pins the two behaviours most likely to regress: the checksum override elevating a weak NLP score, and the manifest never leaking a raw value.

import re

import pytest

SSN_RE = re.compile(r"\d{3}-\d{2}-\d{4}")


@pytest.mark.parametrize(
    "nlp_score, value, checksum, expected",
    [
        (0.62, "123-45-6789", SSN_RE, "AUTO_REDACT"),   # checksum elevates
        (0.62, "not-an-ssn", SSN_RE, "MANUAL_REVIEW"),  # no checksum, mid score
        (0.10, "noise", None, "DISCARD"),               # below review floor
        (0.91, "jane@corp.io", None, "AUTO_REDACT"),    # strong NLP alone
    ],
)
def test_routing(nlp_score, value, checksum, expected):
    result = evaluate("ssn", value, nlp_score, checksum=checksum)
    assert result.routing == expected


def test_manifest_never_stores_raw_value():
    """A redacted identifier must not appear in plaintext in the manifest."""
    results = [SyncResult("ssn", "123-45-6789", 0.62, True, "AUTO_REDACT")]
    manifest = build_audit_manifest(results, "contract.pdf", salt="s3cr3t")
    assert "123-45-6789" not in json.dumps(manifest)
    assert manifest["redacted_fields"][0]["value_hash"]

A correct run emits one manifest per document carrying manifest_hash, the source filename, and a salted hash per redacted field — never the value. The compliance assertion is reproducibility: given the same PDF, the same CRM anchor, and the same salt and library version, the routing decisions and the manifest hash must be byte-for-byte identical on replay.

Troubleshooting

Columns bleed together in the extracted text : Root cause: linear string extraction with no coordinate context, or x_tolerance set too loose. Fix: extract per-word bounding boxes (Step 1) and reassemble blocks by shared x0 and descending top; tighten x_tolerance for multi-column layouts.

A weak NLP score drops a valid, format-verifiable identifier : Root cause: routing on the NLP score alone. Fix: pass a format checksum (SSN, email, IBAN) to evaluate so a confirmed format elevates the match to AUTO_REDACT instead of leaking it as a false negative.

Two fields both match the same PDF token : Root cause: overlapping patterns with no tie-breaker. Fix: rank candidates by spatial proximity to anchor phrases (Billing To:, Account Holder:) and, failing that, route to review rather than guessing.

Redacted text is still recoverable in the output PDF : Root cause: a cosmetic black rectangle over live text or an untouched content stream / metadata. Fix: set redaction_mode="burn" to rasterize or delete the underlying content stream and strip document metadata before write.

The audit manifest becomes a new PII store : Root cause: logging the redacted value in plaintext. Fix: store only the salted value_hash, the source reference, and the manifest hash, keeping identifiers out of operational records per GDPR Art. 5(1)(f).

Structured vs Unstructured Data Sync — the parent workflow that defines the deterministic, idempotent join this page implements for PDFs.
Confidence Scoring Thresholds — where the routing thresholds and NLP/regex score blending used in Step 3 are tuned and versioned.
NLP-Based Entity Recognition — the probabilistic layer that disambiguates entities too contextually ambiguous for a deterministic format match.
Schema Validation Rules — the ingestion contract that rejects malformed sources before they reach this reconciliation layer.
PII Extraction & Redaction Pipelines — the parent pipeline architecture this reconciliation layer feeds.

Syncing Structured CRM Data with Unstructured PDFs

Prerequisites #

Step-by-step implementation #

Step 1 — Extract text with its coordinates, not as a flat string #

Step 2 — Normalize deterministically, then align to the CRM anchor #

Step 3 — Route each candidate by tiered confidence #

Step 4 — Redact securely and write an immutable audit manifest #

Step 5 — Push ambiguous matches to a review queue and propagate overrides #

Configuration reference #

Verification #

Troubleshooting #

Related #