Redaction Strategies
The four strategies
Every redaction approach trades off reversibility (can you get the original data back?) against accuracy preservation (does the redacted document remain useful to downstream systems?) against compliance fit (does it satisfy your regulatory requirement?). No single strategy wins on all three axes.
Strategy comparison matrix
| Strategy | Reversible? | Accuracy impact | Compliance fit | Implementation complexity |
|---|---|---|---|---|
| Token masking | No | High — breaks format-dependent downstream | GDPR, CCPA (erasure) | Trivial |
| Type-aware replacement | No (by default) | Low — plausible values preserve flow | HIPAA Safe Harbor (Expert Determination variant) | Low |
| Pseudonymization | Yes (mapping table) | Low — consistent IDs preserve relationships | GDPR Art. 4(5), HIPAA Expert Determination | Medium (mapping store required) |
| FPE | Yes (key) | Very low — format identical to original | PCI DSS tokenization, CUI handling | High (cryptographic library required) |
When each strategy breaks agent usefulness
Every strategy has a failure mode that breaks downstream agents. Knowing these upfront prevents the common mistake of applying the most aggressive strategy by default:
- Token masking a schema file: If your agent is doing a schema migration and
a column is named
ssn, replacing it with[REDACTED]producesALTER TABLE users ADD COLUMN [REDACTED] VARCHAR(11)— invalid SQL that breaks the migration. Token masking must be scoped to values, not identifiers. - Type-aware replacement in financial reconciliation: If downstream validation checks that the credit card BIN prefix matches an issuer, synthetic card numbers will fail validation and surface as suspicious transactions in your audit log.
- Pseudonymization without stable storage: If two documents reference the same person and your mapping table resets between runs, the same real name produces different pseudonyms — breaking cross-document entity resolution.
- FPE on short strings: FPE is designed for numeric domains (card numbers, SSNs). Applying it to short names produces unreadable ciphertext that loses the plausibility benefit while retaining the reversibility complexity.
De-identification vs anonymization
These terms are often used interchangeably but have precise meanings in regulatory contexts, particularly under HIPAA:
HIPAA Safe Harbor requires removal of 18 specific identifiers (name, SSN, dates other than year, geographic data smaller than state, etc.). A dataset that meets Safe Harbor is considered de-identified — it no longer qualifies as Protected Health Information. Expert Determination is the alternative path: a statistician certifies that re-identification risk is very small. De-identified data has no HIPAA restrictions. Pseudonymized data does — the mapping table makes re-identification possible, so HIPAA still applies. Your redaction strategy choice directly determines your HIPAA obligations for the output.
Python implementations
The following module implements all four strategies with a consistent interface. Each function takes the original text and a list of detected entities (from the detector in Post 02) and returns a redacted string plus a list of redaction records for the audit log (Post 05).
from __future__ import annotations
import hashlib
import hmac
import json
import os
import re
from dataclasses import dataclass, field
from typing import Literal
# ── Shared types ─────────────────────────────────────────────────────────────
Strategy = Literal["mask", "replace", "pseudonymize", "fpe"]
@dataclass
class RedactionRecord:
original_hash: str # SHA-256 of original value — never store the value itself
pii_type: str
strategy: Strategy
replacement: str
start: int
end: int
# ── Fake value generators for type-aware replacement ─────────────────────────
_FAKE_BY_TYPE: dict[str, str] = {
"SSN": "999-00-0000",
"EMAIL": "[email protected]",
"PHONE": "555-000-0000",
"CREDIT_CARD": "4000-0000-0000-0000",
"PASSPORT": "X00000000",
"IP_ADDRESS": "192.0.2.0", # TEST-NET-1, RFC 5737
"NAME": "Jane Doe",
"ADDRESS": "1 Placeholder Ln, Anytown, ST 00000",
"DATE_OF_BIRTH": "1900-01-01",
"MEDICAL_RECORD": "MRN-REDACTED",
}
def _fake_value(pii_type: str, original: str) -> str:
return _FAKE_BY_TYPE.get(pii_type, f"[{pii_type}]")
# ── Pseudonymization: deterministic HMAC-based mapping ───────────────────────
_PSEUDO_SECRET = os.getenv("PII_PSEUDO_SECRET", "change-me-in-production")
def _pseudonymize(value: str, pii_type: str) -> str:
"""Produce a deterministic, reversible-with-secret pseudonym."""
digest = hmac.new(
_PSEUDO_SECRET.encode(),
value.encode(),
hashlib.sha256
).hexdigest()[:12]
return f"{pii_type}-{digest}"
# ── Format-Preserving Encryption stub (FF3-1) ─────────────────────────────────
# Production: use the 'ff3' PyPI package (AES-FF3-1, NIST SP 800-38G Rev. 1)
# Stub shown here for illustration — not cryptographically secure.
def _fpe_encrypt(value: str, pii_type: str) -> str:
"""
Stub FPE: in production replace with ff3.FF3Cipher().encrypt().
Preserves digit-only format for numeric types.
"""
digits = re.sub(r'\D', '', value)
if not digits:
return f"[FPE:{pii_type}]"
# Real FPE would encrypt these digits while preserving length
encrypted_digits = str(int(digits) ^ 0xDEADBEEF)[:len(digits)].zfill(len(digits))
return re.sub(r'\d+', encrypted_digits, value, count=1)
# ── Core redaction engine ─────────────────────────────────────────────────────
@dataclass
class DetectedEntity:
text: str
start: int
end: int
pii_type: str
confidence: float
reason: str
def redact(
text: str,
entities: list[DetectedEntity],
strategy: Strategy = "mask",
) -> tuple[str, list[RedactionRecord]]:
"""
Apply the chosen strategy to all detected entities.
Returns (redacted_text, list_of_redaction_records).
Entities must be sorted by start position; overlaps are skipped.
"""
entities_sorted = sorted(entities, key=lambda e: e.start)
records: list[RedactionRecord] = []
result_parts: list[str] = []
cursor = 0
for entity in entities_sorted:
if entity.start < cursor:
continue # skip overlapping entity
result_parts.append(text[cursor : entity.start])
original_hash = hashlib.sha256(entity.text.encode()).hexdigest()
if strategy == "mask":
replacement = f"[{entity.pii_type}]"
elif strategy == "replace":
replacement = _fake_value(entity.pii_type, entity.text)
elif strategy == "pseudonymize":
replacement = _pseudonymize(entity.text, entity.pii_type)
elif strategy == "fpe":
replacement = _fpe_encrypt(entity.text, entity.pii_type)
else:
replacement = f"[{entity.pii_type}]"
result_parts.append(replacement)
records.append(RedactionRecord(
original_hash=original_hash,
pii_type=entity.pii_type,
strategy=strategy,
replacement=replacement,
start=entity.start,
end=entity.end,
))
cursor = entity.end
result_parts.append(text[cursor:])
return "".join(result_parts), records FPE preserves format — a 16-digit credit card number encrypted with AES-FF3-1
produces a different 16-digit number. This means downstream systems that validate card number
format (Luhn check, BIN lookup) still receive a structurally valid input. For pseudonymization
use cases that need to maintain a persistent mapping, store the value → pseudonym
map in an encrypted key-value store (e.g., HashiCorp Vault KV or AWS Secrets Manager), not
in the application database.