Fine-Tuning spaCy for Legal Document PII Extraction

Legal Data Subject Requests (DSRs) force a recognizer to find personal data inside dense, adversarial text: multi-party contracts, deposition transcripts, and regulatory filings where a single missed name is a confidentiality breach and a single over-eager span destroys data the subject was entitled to receive. This is the problem that privacy and data engineers hit when a general-purpose model — en_core_web_trf out of the box — is pointed at a signed exhibit and starts tagging the caption block, the Bates numbers, and the notary boilerplate as PERSON. Fine-tuning fixes the recognizer, but only if the training pipeline is deterministic and the output is auditable.

This page is the implementation guide for fine-tuning spaCy’s transformer NER on legal corpora, within the broader NLP-Based Entity Recognition stage of the PII Extraction & Redaction Pipelines architecture. The recognizer built here does not make the final masking decision — it emits confidence-bearing spans that flow into confidence scoring and thresholds for routing. Everything below treats training as a reproducible artifact: hashed inputs, pinned seeds, versioned config, and a calibrated confidence score that a reviewer can defend under the accountability duty of GDPR Art. 5(2).

Prerequisites

Python 3.11+ — zoneinfo for deadline stamping in the audit record, and stable dict ordering for reproducible batches.
spaCy 3.7+ with the transformers extra: pip install "spacy[transformers]==3.7.*" then python -m spacy download en_core_web_trf.
A CUDA GPU (or Apple Metal) for transformer fine-tuning; pip install "spacy[cuda12x]" matched to your driver. CPU-only training on en_core_web_trf is impractical for a legal corpus.
Pydantic v2 for validating extracted spans before they leave the recognizer, and scikit-learn for Platt-scaling the confidence calibrator.
A labeled legal corpus exported as spaCy DocBin files (train.spacy, dev.spacy) with a held-out dev split stratified by document type (contract, transcript, filing) so calibration is not skewed by one genre.
Immutable object storage (S3 Object Lock, or any WORM target) to anchor the model, its config.cfg, and the SHA-256 hashes of every training document for audit reconstruction under GDPR Art. 5(2).

Step-by-step implementation

Step 1: Deterministic preprocessing and span flattening

Legal PDFs break standard tokenization. OCR of scanned exhibits injects phantom whitespace, mid-word hyphenation splits, and ligature corruption that shift character offsets unpredictably — and offsets are exactly what a redaction engine acts on. Cross-references like the undersigned, hereinafter "Vendor" also tempt annotators into nested spans, which the transition-based parser cannot represent and which corrupt BIOES tagging.

Normalize every document to a canonical form before annotation, and hash the raw bytes first so the audit trail can prove which input produced which model. Annotation guidelines must forbid nested spans in favour of a single flattened label per token.

import hashlib
import unicodedata
import regex as re

_LIGATURES = {"ﬁ": "fi", "ﬂ": "fl", "ﬀ": "ff"}


def sanitize_legal_document(raw_bytes: bytes) -> tuple[str, str]:
    """Normalize OCR artifacts and return (clean_text, sha256_audit_hash).

    The hash is computed over the raw bytes *before* normalization so the
    audit record can reconstruct exactly which input produced a span.
    """
    doc_hash = hashlib.sha256(raw_bytes).hexdigest()
    text = raw_bytes.decode("utf-8", errors="ignore")

    # Strip non-printable control chars, keep newline/carriage-return/tab.
    text = re.sub(r"[^\P{Cc}\n\r\t]+", "", text)

    # Repair ligatures, then normalize to NFC so composed/decomposed
    # accented characters compare identically and offsets stay stable.
    for glyph, repl in _LIGATURES.items():
        text = text.replace(glyph, repl)
    text = unicodedata.normalize("NFC", text)

    text = re.sub(r"- *\n", "", text)        # rejoin mid-word hyphenation
    text = re.sub(r"[ \t]{2,}", " ", text)    # collapse phantom whitespace
    text = re.sub(r"(\n\s*){3,}", "\n\n", text)  # preserve clause boundaries

    return text.strip(), doc_hash

Step 2: Version the training config with a frozen transformer warmup

The training loop hinges on a versioned config.cfg sourced from en_core_web_trf. On legal text, citations and boilerplate formatting dominate the loss landscape and the optimizer learns to ignore substantive PII. Freezing the transformer for the initial steps forces the parser head to converge on entity structure before the contextual embeddings start moving, which stabilizes low-frequency classes like SSN and ACCOUNT_NO.

[paths]
train = "./corpus/train.spacy"
dev = "./corpus/dev.spacy"

[system]
gpu_allocator = "pytorch"
seed = 42

[nlp]
lang = "en"
pipeline = ["transformer", "ner"]
batch_size = 128

[components.transformer]
source = "en_core_web_trf"

[components.ner]
factory = "ner"

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false

[training]
seed = 42
max_steps = 15000
patience = 2500
frozen_components = ["transformer"]

[training.optimizer]
@optimizers = "Adam.v1"
learn_rate = 0.00005
grad_clip = 1.0

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 500
total_steps = 15000
initial_rate = 0.00005

Validate the config before every run and commit it alongside the corpus hash. See the official spaCy training documentation for the authoritative schema:

python -m spacy debug config ./config.cfg
python -m spacy debug data ./config.cfg  # inspect label balance + span alignment
python -m spacy train ./config.cfg --output ./model --gpu-id 0

Step 3: Stabilize the training loop with pinned seeds and gradient clipping

When validation F1 plateaus below 0.82, the culprit is usually systematic false positives on privilege disclaimers and jurisdictional footers combined with vanishing gradients on rare PII classes. The grad_clip = 1.0 and warmup_linear schedule in Step 2 handle this for the CLI path. When you drive training programmatically — for a custom label-weighted sampler over an imbalanced legal corpus — pin the seed and reuse the same clipping and warmup so results reproduce byte-for-byte.

import spacy
from spacy.training import Example
from spacy.util import fix_random_seed, minibatch

# (text, {"entities": [(start, end, label)]}) — offsets from Step 1's clean text.
TrainingItem = tuple[str, dict[str, list[tuple[int, int, str]]]]


def train_ner(
    nlp: spacy.Language,
    train_data: list[TrainingItem],
    epochs: int = 30,
) -> spacy.Language:
    """Deterministically fine-tune the NER head with gradient clipping."""
    fix_random_seed(42)
    optimizer = nlp.initialize()
    optimizer.grad_clip = 1.0  # match the CLI config; suppress weight blowups

    for epoch in range(epochs):
        losses: dict[str, float] = {}
        for batch in minibatch(train_data, size=32):
            examples = [
                Example.from_dict(nlp.make_doc(text), annotations)
                for text, annotations in batch
            ]
            nlp.update(examples, drop=0.1, losses=losses, sgd=optimizer)
        print(f"epoch={epoch} ner_loss={losses.get('ner', 0.0):.4f}")
    return nlp

Step 4: Calibrate confidence with Platt scaling on a held-out split

Raw spaCy span scores are poorly calibrated for compliance routing — a score of 0.9 does not mean a 90% chance of being correct. Fit a logistic-regression (Platt) calibrator over per-span contextual vectors on the held-out dev split, stratified by document type, so the score handed to the downstream router is a defensible probability.

import numpy as np
import spacy
from sklearn.linear_model import LogisticRegression

_PII_LABELS = {"PERSON", "EMAIL", "SSN", "ACCOUNT_NO", "ADDRESS"}


def fit_span_calibrator(
    nlp: spacy.Language,
    dev_docs: list[spacy.tokens.Doc],
) -> LogisticRegression:
    """Platt-scale span confidence using each span's contextual vector."""
    features: list[np.ndarray] = []
    labels: list[int] = []
    for doc in dev_docs:
        for ent in doc.ents:
            features.append(ent.vector)
            labels.append(1 if ent.label_ in _PII_LABELS else 0)

    calibrator = LogisticRegression(max_iter=1000, class_weight="balanced")
    calibrator.fit(np.asarray(features), np.asarray(labels))
    return calibrator

Step 5: Route calibrated spans with a fail-closed fallback

The recognizer emits validated spans; the tiered gate decides their fate. Under the confidentiality duty of GDPR Art. 5(1)(f), anything below the auto-redaction band must fail closed into quarantine rather than passing through unmasked. Validate each span with a Pydantic v2 model so a malformed offset can never reach the audit store, and tag every routed span with an audit_id for non-repudiation.

import secrets
import logging
import spacy
from pydantic import BaseModel, ConfigDict, Field, field_validator
from sklearn.linear_model import LogisticRegression

logger = logging.getLogger("pii_router")


class RoutedSpan(BaseModel):
    model_config = ConfigDict(frozen=True, extra="forbid")

    text: str = Field(min_length=1)
    label: str
    start: int = Field(ge=0)
    end: int = Field(gt=0)
    confidence: float = Field(ge=0.0, le=1.0)
    action: str
    audit_id: str

    @field_validator("action")
    @classmethod
    def _known_action(cls, v: str) -> str:
        if v not in {"auto_redact", "manual_review", "quarantine"}:
            raise ValueError(f"unknown routing action: {v}")
        return v


def route_spans(
    doc: spacy.tokens.Doc,
    calibrator: LogisticRegression,
    high: float = 0.95,
    low: float = 0.75,
) -> list[RoutedSpan]:
    """Route each span by calibrated confidence; fail closed below `low`."""
    audit_id = secrets.token_hex(16)
    routed: list[RoutedSpan] = []
    for ent in doc.ents:
        conf = float(calibrator.predict_proba(ent.vector.reshape(1, -1))[0][1])
        if conf >= high:
            action = "auto_redact"
        elif conf >= low:
            action = "manual_review"
        else:
            action = "quarantine"
            logger.warning("low-confidence span quarantined id=%s", audit_id)
        routed.append(
            RoutedSpan(
                text=ent.text, label=ent.label_,
                start=ent.start_char, end=ent.end_char,
                confidence=conf, action=action, audit_id=audit_id,
            )
        )
    return routed

Configuration reference

Parameter	Type	Default	Compliance note
`training.seed` / `fix_random_seed`	`int`	`42`	Pinned seed makes runs reproducible; required to reconstruct a model under GDPR Art. 5(2) accountability.
`frozen_components`	`list[str]`	`["transformer"]`	Freeze the backbone for warmup so rare PII classes converge before embeddings move.
`training.optimizer.grad_clip`	`float`	`1.0`	Prevents catastrophic weight updates on imbalanced legal corpora.
`warmup_steps`	`int`	`500`	Linear warmup stabilizes the frozen-then-unfrozen transition.
`learn_rate`	`float`	`0.00005`	Low LR protects pretrained legal-language features from washing out.
`high` (auto-redact)	`float`	`0.95`	Only high-confidence spans auto-mask under GDPR Art. 5(1)(f).
`low` (review floor)	`float`	`0.75`	Below this the span fails closed to quarantine, never passes through.
`class_weight` (calibrator)	`str`	`"balanced"`	Counters the label imbalance typical of legal PII.

Verification

Confirm correctness on three axes before promoting a model. First, evaluate held-out F1 per label — never a single aggregate — because a strong PERSON score can hide a failing SSN recall that leaks identifiers.

python -m spacy evaluate ./model/model-best ./corpus/dev.spacy \
    --gpu-id 0 --output ./metrics.json

Second, assert the per-label floor and the calibrator’s reliability in a unit test:

import json
from sklearn.metrics import brier_score_loss


def test_per_label_recall_floor():
    metrics = json.load(open("./metrics.json"))
    for label in ("PERSON", "SSN", "EMAIL", "ACCOUNT_NO"):
        recall = metrics["ents_per_type"][label]["r"]
        assert recall >= 0.90, f"{label} recall {recall:.3f} below 0.90 floor"


def test_calibrator_is_reliable(y_true, y_prob):
    # A well-calibrated router keeps Brier score low; drift raises it.
    assert brier_score_loss(y_true, y_prob) <= 0.10

Third, expect the audit record for every promoted model to carry the corpus hash, the pinned seed, and the config.cfg digest written to WORM storage — that triple is what makes a redaction decision defensible under GDPR Art. 5(2).

Troubleshooting

Validation F1 plateaus below 0.82 Root cause: boilerplate tokens (captions, Bates numbers, footers) dominate the loss and starve rare PII classes. Fix: run spacy debug data to confirm the imbalance, apply label-weighted sampling, and keep the transformer frozen for the warmup steps so the parser head converges first.

Span offsets drift after preprocessing Root cause: NFC normalization or ligature repair ran after annotation, so stored offsets no longer index the served text. Fix: normalize once in Step 1, annotate against the normalized text, and re-run spacy debug data — misaligned spans surface as "Misaligned tokens" warnings.

Non-deterministic entity boundaries between runs Root cause: an unpinned seed or GPU nondeterminism. Fix: set fix_random_seed(42), seed = 42 in the config, and quarantine any pipeline whose output diverges from the recorded model hash rather than serving it.

Calibrator over-confident on unseen document genre Root cause: the held-out dev split was not stratified by document type, so a transcript-heavy calibrator misjudges filings. Fix: stratify the split, refit, and gate promotion on the brier_score_loss assertion above.

Nested-span annotations rejected during training Root cause: overlapping labels (e.g. a PERSON inside an ORG defined-term) violate BIOES. Fix: flatten to one label per token per the Step 1 guidelines; keep the discarded label in the annotation metadata for audit, not in the training span.

NLP-Based Entity Recognition — the parent stage: normalization, bounded inference, and span validation this model plugs into.
Confidence Scoring & Thresholds — the routing gate that consumes the calibrated span scores produced here.
Deterministic Regex Architecture for Email and SSN Detection — the deterministic overlay that cross-checks the recognizer’s probabilistic spans.
Syncing Structured CRM Data with Unstructured PDFs — reconciling extracted spans against system-of-record fields.
Up to PII Extraction & Redaction Pipelines — the end-to-end architecture this recognizer serves.

Fine-Tuning spaCy for Legal Document PII Extraction

Prerequisites #

Step-by-step implementation #

Step 1: Deterministic preprocessing and span flattening #

Step 2: Version the training config with a frozen transformer warmup #

Step 3: Stabilize the training loop with pinned seeds and gradient clipping #

Step 4: Calibrate confidence with Platt scaling on a held-out split #

Step 5: Route calibrated spans with a fail-closed fallback #

Configuration reference #

Verification #

Troubleshooting #

Related #